* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57eee5b5cc81390750d07e4800b19c0c3084)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Bogus comments that start with "<![CDATA[" should not include the starting "!"
in its value.
(cherry picked from commit 7636a66635a0da849cfccd06a52d0a21fb692271)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
"] ]>" and "]] >" no longer end the CDATA section.
Make CDATA section parsing context depending.
Add private method HTMLParser._set_support_cdata() to change the context.
If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>".
If called with False, "<[CDATA[" starts a bogus comments which ends with ">".
(cherry picked from commit 0cbbfc462119b9107b373c24d2bda5a1271bed36)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
This fixes a regression introduced in GH-135930.
(cherry picked from commit dee650189497735edbc08a54edabb5b06ef1bd09)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Whitespaces no longer accepted between `</` and the tag name.
E.g. `</ script>` does not end the script section.
* Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized
as whitespaces. The only whitespaces are `\t\n\r\f `.
* Null character (U+0000) no longer ends the tag name.
* Attributes and slashes after the tag name in end tags are now ignored,
instead of terminating after the first `>` in quoted attribute value.
E.g. `</script/foo=">"/>`.
* Multiple slashes and whitespaces between the last attribute and closing `>`
are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`.
* Multiple `=` between attribute name and value are no longer collapsed.
E.g. `<a foo==bar>` produces attribute "foo" with value "=bar".
* Whitespaces between the `=` separator and attribute name or value are no
longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and
"=bar", both with value None; `<a foo= bar>` produces two attributes:
"foo" with value "" and "bar" with value None.
---------
(cherry picked from commit 0243f97cbadec8d985e63b1daec5d1cbc850cae3)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
End-of-file errors are now handled according to the HTML5 specs --
comments and declarations are automatically closed, tags are ignored.
(cherry picked from commit 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
When calling .close() the HTMLParser should flush all remaining content,
even when that content is in an unclosed script or style tag.
(cherry picked from commit 53383e90e4df7029f792b7aa81aa2e4cff348ed0)
Co-authored-by: Waylan Limberg <waylan.limberg@icloud.com>
According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.
(cherry picked from commit 77b14a6d58e527f915966446eb0866652a46feb5)
https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
* bpo-41748: Adds tests for unquoted attributes with comma
* bpo-41748: Handles unquoted attributes with comma
* bpo-41748: Addresses review comments
* bpo-41748: Addresses review comments
* Adds more test cases
* Simplifies the regex for handling spaces
* bpo-41748: Moves attributes tests under the right class
* bpo-41748: Addresses review about duplicate attributes
* bpo-41748: Adds NEWS.d entry for this patch
* Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix.
This reverts commit 5ba185039f1bd465d3f82531324fd3fe1ee42f0c.