[Repo Assist] Fix HTML parser dropping whitespace between inline elements (issue #1330)#1630
Draft
github-actions[bot] wants to merge 1 commit intomainfrom
Conversation
) Add an InlineWhitespace token to the HTML tokeniser so that normalised whitespace-only DefaultMode text is tracked distinctly from real text content. In the tree builder, an InlineWhitespace token is turned into a HtmlText " " node only when its nearest previous and next tokens are both inline content; otherwise it is silently discarded. This preserves the significant inter-element space in: <span>Hello,</span> <span>World</span> -> "Hello, World" < > -> "< >" while still dropping insignificant inter-block whitespace (e.g. newlines between <head> and <body>), so existing tests continue to pass. Whitespace produced by character references ( , , 	 ...) goes through the CharRefMode path and is never turned into InlineWhitespace, so it is never filtered. Three regression tests added covering the cases from the issue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
✅ Pull request created: #1630 |
This was referenced Feb 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 This is a draft PR from Repo Assist, an automated AI assistant.
Fixes #1330.
Root Cause
In
HtmlParser.fs, theEmit()method normalises inter-element whitespace (e.g. the newline/spaces between two tags) to a single space, then checks whether the entire content was whitespace-only. Before this fix it silently dropped thatText ""token, which also discarded meaningful whitespace between inline elements such as the space in<span>Hello,</span> <span>World</span>.Fix
A new
InlineWhitespacecase is added to the privateHtmlTokendiscriminated union. WhenEmit()produces a whitespace-only DefaultMode text node, it now emitsInlineWhitespaceinstead ofText "".In the tree builder (
parse'), anInlineWhitespacetoken is turned into a realHtmlText " "node only when both its nearest previous accumulated node and the next token in the stream represent inline content (i.e. a non-block-level element or non-whitespace text). Otherwise it is discarded, preserving the original behaviour for inter-block whitespace.Character references that happen to produce whitespace (e.g.
, ,	) go through theCharRefModepath and emitText t— neverInlineWhitespace— so they are never filtered by this logic.The
blockLevelElementsset (introduced in an earlier iteration) remains to classify element names for the inline/block decision.Examples fixed
Test Status
Build: ✅
0 errors, 0 warningsTests: ✅ 2846 passed, 0 failed (1 pre-existing infrastructure failure:
Two custom header with different names don't throw an exception— 403 Forbidden, unrelated to this change)Three regression tests added:
Preserves space between entity references in inline content (issue 1330)Preserves space between adjacent inline elements (issue 1330)Drops inter-block whitespace but keeps inline whitespaceWarning
The following domain was blocked by the firewall during workflow execution:
www.google.com