strip out navigational and other superfluous elements#862
strip out navigational and other superfluous elements#862woutervanwijk wants to merge 2 commits intomozilla:mainfrom
Conversation
gijsk
left a comment
There was a problem hiding this comment.
Hey @woutervanwijk , thanks for the patch. Did you see this broke the linter, and probably tests?
The current code unfortunately cannot assume that querySelectorAll or similar are available on the DOM implementation, so using class/ID selectors for _clean probably doesn't quite work.
It would also be really helpful to have some examples where this change produces some improvement (apart from the likely required changes to existing testcases), to understand the motivation behind some of the changes.
| var fullArticleText = this._doc.body.innerText; | ||
| if (fullArticleText.length) { |
There was a problem hiding this comment.
This is breaking all the tests as innerText is not defined.
| // get article published time | ||
| metadata.publishedTime = jsonld.datePublished || | ||
| values["article:published_time"] || null; | ||
| values["article:published_time"] || null; |
There was a problem hiding this comment.
This looks like a mistake as this is a continuation line vs. the line before.
gijsk
left a comment
There was a problem hiding this comment.
Thanks for fixing the linting - but we really need some better tests, and ideally we'd make these changes as individual ones with a test per change, so that it's more obvious what changes in behaviour each of the suggested modifications has.
| //scale down h2-h5 because it's too large most of the time (intro's in h2, etc) | ||
| this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h5"]), "h6"); | ||
| this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h4"]), "h5"); | ||
| this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h3"]), "h4"); | ||
| this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h2"]), "h3"); |
There was a problem hiding this comment.
The comment doesn't seem like a great reason to change the semantics of the headers - the "right" solution would probably to revert intro paragraphs into, well, paragraphs, instead of headers, if they meet some threshold / algorithm.
| UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog" ], | ||
| UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog", "nav" ], | ||
|
|
||
| NODES_TO_CLEAN_FIRST: ["object", "embed", "footer", "link", "aside", "nav", ".icons", ".byline", ".sub-nav", ".identity", ".logo", ".video-player", ".jw-player", ".jw-wrapper", ".video", ".byline", ".author", ".dateline", ".credentials", ".writtenby", ".p-author", ".article-author", ".navigation", ".hidden-xs", ".hidden-sm", ".brand", ".modalContent", ".noPrint", ".noprint", ".screenonly", ".breadcrumb", ".breadcrumbs", "amp-iframe", "amp-img", "amp-ad", ".advert", ".ads", ".brand", ".search", ".nav", ".user", ".users", "#onetrust-consent-sdk", "#branding", "#branding-content" ], |
There was a problem hiding this comment.
A lot of these aren't nodenames, so _clean won't actually remove them, I think?
It would also be really helpful to better understand what the source of this list of elements is.
| unlikelyCandidates: /-ad-|ai2html|banner|combx|comment|community|cover-wrap|credentials|date|hide|hidden|disqus|extra|footer|gdpr|legends|nav|paywall|meta|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|share|sharing|yom-remote|byline|topbar|article-meta|brand|tooltip/i, | ||
| okMaybeItsACandidate: /and|article|body|column|content|main|shadow|header|summary/i, | ||
|
|
||
| positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i, | ||
| negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i, | ||
| extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i, | ||
| byline: /byline|author|dateline|writtenby|p-author/i, | ||
| positive: /article|body|content|entry|header|hentry|h-entry|intro|intro|intro|intro|main|main-article|main-content|page|lead|leading|pagination|primary|post|text|blog|story|summary|strapline/i, | ||
| negative: /-ad-|affiliate|credentials|controls|date|desktop|hidden|nav|^hid$| hid$| hid |^hid |hide|banner|login|gate|combx|comment|com-|contact|foot|footer|footnote|gdpr|icon|^icon|icons$|icons|masthead|media|meta|paywall|nav|outbrain|promo|related|scroll|share|sharing|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|tooltip|widget|video-player|video|jw-player|jw-aspect|modal|carousel|overlay|byline|brand|disclosure|nav|logo|account|cart|dock/i, | ||
| extraneous: /print|affiliate|archive|button|comment|controls|discuss|e[\-]?mail|meta|icons|share|reply|all|login|sign|single|utility|icons|nav|video-player|jw-player|modal|video|paidcontent|carousel|overlay|social|topbar|article-meta|onetrust-consent-sdk|logo|account|cart|hamburger|traffic|weather|search/i, | ||
| byline: /byline|author|dateline|credentials|writtenby|p-author|article-author/i, |
There was a problem hiding this comment.
These are some really significant changes, and we'd need to have some testcases to help explain why we're making these changes.
Some of the changes also seem like mistakes, e.g. nav is in there several times, and icon will always match when ^icon and icons$ and icons match.
Small tweaks to strip out navigational elements and other common superfluous elements. Works really nicely