gh-118350: Add escapable-raw-text mode to html parser#121770
gh-118350: Add escapable-raw-text mode to html parser#121770timonviola wants to merge 10 commits intopython:mainfrom
Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
What is the difference between processing raw text elements and escapable raw text elements? I do not see any this code.
Lib/test/test_htmlparser.py
Outdated
There was a problem hiding this comment.
Is it right? The test name is test_escapable_raw_text_with_closing_tags, but it tests the script element. It looks very similar to test_cdata_with_closing_tags.
Lib/test/test_htmlparser.py
Outdated
There was a problem hiding this comment.
Why test this in the title and textarea elements?
There was a problem hiding this comment.
Add also examples of valid character references and an ambiguous ampersand.
|
@ezio-melotti @serhiy-storchaka can you help with the review? |
Lib/test/test_htmlparser.py
Outdated
There was a problem hiding this comment.
This is not correct. Charrefs should be resolved in escapable raw text elements. Data should be '"X"X"' instead of text. Except for an ambiguous ampersand.
Lib/test/test_htmlparser.py
Outdated
There was a problem hiding this comment.
How does this test differ from test_cdata_content? BTW, most examples use JavaScript syntax, and only relevant for <script>.
Lib/html/parser.py
Outdated
Lib/html/parser.py
Outdated
There was a problem hiding this comment.
This is incorrect. Charrefs should be resolved in an escapable raw text element. Except an ambiguous ampersand.
We need also tests for convert_charrefs=False in an escapable raw text element.
Lib/html/parser.py
Outdated
There was a problem hiding this comment.
Since the behavior for raw text elements and escapable raw text elements is so similar, and they cannot be nested, why not use set_cdata_mode() and cdata_elem for both? Just add an optional boolean parameter to specify whether it is escapable (charrefs should be unescaped) or not.
| ('entityref', 'amp'), | ||
| ('data', ' Pumba') | ||
| ], | ||
| collector=Collector(convert_charrefs=False), |
There was a problem hiding this comment.
Did you mean this test? @serhiy-storchaka
|
It seems that you accidentally removed all changes to the parser. |
@serhiy-storchaka I made tried to revert my changes so I can rebase on top of the latest changes. I created a new PR, I hope that will help you with the review: #135310. Thanks for the comments/input so far @serhiy-storchaka ! |
escapable raw text elements are not handled in the current HTMLParser implementation.
This PR extends the existing parser with an additional mode to handle this correctly.