Skip to main content

Updates to HTML Processing

Since its inception FeedMail has done processing on HTML content in feeds to ensure that it renders as expected in email form. At first this was fairly simple things like rewriting URLs to point to the correct location (many feeds use non-absolute URLs that won't work in email) but over time more complex transformations were added such as adding fallback content to media embeds without any. The full-text scraping feature requires even more complex processing as it requires stripping away most of the page and handling content that was designed for full-featured browsers.

What changed?

Recently FeedMail has migrated all HTML rewriting to use new infrastructure. This provides more flexibility and enabled new features (such as showing controls on all media embeds) and made our processing much more reliable.

What does this mean to me?

As a user you shouldn't see much difference. Overall the emails you receive should be better formatted but the difference will be subtle. Full-text scraped feeds will see a bigger impact, on the whole you should see better extraction of article content and better removal of extraneous content.

However as with any significant change there will be regressions, especially for full-text scraping. We have tested on a wide variety of popular feeds however to respect user privacy we did not evaluate the result of every user's feeds (as they may contain personal content). If you see a feed that doesn't render how you expect and you are OK with us viewing the content to investigate please reply to the email notification to get in touch with support. While we can't add support for every strange website we are eager to fix common issues and ensure that our HTML processing works well for the vast majority of websites.

Technical Details

If you aren't interested in the technical stuff you can stop reading now.

Previously FeedMail used two main tools to process HTML; lol-html and readability-rs. We have updated to do all of our processing via the html5ever library.

lol-html

lol-html was our general purpose rewriter. It worked really well for simple transformations like fixing URLs and even replacing content (like <iframe> which is near-universally unsupported by mail clients). On top of that it is very fast as it was designed for passing most HTML through unchanged (it was nearly twice as fast as our new solution).

Unfortunately it got very complicated to use once you needed to make contextual decisions. For example adding fallback for <video> elements only if they had no fallback already. This is because the API for lol-html is a series of callbacks for different CSS selectors (using a subset of the full CSS selector language) or text content. This means that each callback is a different closure and if you want to share mutable state you need to use runtime mutual exclusion like RefCell. On top of that if you wanted to do something when elements close you need to schedule a callback which is required to have a static lifetime. This requires using runtime lifetime tracking like Rc. This added overhead and more importantly complicated the implementation.

This is a shame, since the underlying engine could definitely support a better interface here. If they just had a trait of callback methods instead of taking a list of callbacks it would make more complex cases much more convenient. They could even implement the current API on top of the trait version to keep simple cases easy. lol-html also has no support for extraction which means it couldn't solve all of our use cases, two systems working in sync would be needed.

lol-html also had some correctness issues such as attributes not being decoded when read and being insufficiency encoded when written. It also had very poor support for HTML5 tag omission. Together these limitations made it hard to write a reliable transformer.

At the end of the day performance is not a problem for us—even the longest articles only take a few milliseconds to process—so we decided to move away from lol-html.

readability-rs

When scraping full-text content readability-rs was used for extracting the article content from the entire webpage. This content was then processed by our regular formatting engine to produce the email content. readability-rs is a re-implementation of Mozilla's Readability built on top of html5ever and overall did a good job at content extraction. However it did have some shortcomings.

  1. Some cleanups were performed by both our processor and readability-rs leading to some conflicts. For example when fixing up URLs.
  2. readability-rs had very strict cleaning, we already had to modify the library to disable much of it.
  3. readability had very limited support for influencing candidate selection. For example boosting the ranking of content that is contained in the feed summary.
  4. readability-rs was unmaintained, holding back other dependencies of ours.

Overall we needed more flexability over what readability provided. I would still recommend it for people who want a drop-in solution, but with our new rewriting framework available it was time to graduate to a custom system that we could fine-tune.

html5ever

Our new rewriting framework is built on top of the html5ever tokenizer system. html5ever is the parser used by the Servo browser engine. We use only the tokenizer to allow transformations in a streaming manor. This also allows passing through invalid HTML without making matters worse. This does make dealing with tag omissions difficult, but we found that we could get away with only basic support for these cleanups and without rewriting the entire document.

One issue we had was with the html5ever serializer. It is quite opinionated; attempting to fix up the HTML it writes and sometimes raising errors. This lead to surprising results when trying to pass a document mostly unchanged. Instead we used our existing HTML writing framework (used for the FeedMail web interface). It was not much more work since the html5ever serializer didn't match the tokenizer's API anyways, so we had to do some conversion in either case.

First the HTML-to-text converter and general HTML-to-email rewriters were migrated. These were fairly direct conversions of the existing code.

After that was done the full-text content extractor and formatter was rewritten. This was a big improvement because it could use the same framework, allowing extraction, general rewriting and full-text specific rewriting in a single pass.

At a high level the process looks something like this:

  1. Transform the HTML using the general transformer as well as more aggressive full-text-specific options.
  2. After emitting each element determine its "content quality" based on Readability heuristics such as text, words, commas, paragraphs and media contrasted with the amount of markup it contains.
    1. If it does not contain enough content remove it from the output.
    2. Adjust its score based on its specificity and compare to the best candidates so far. If it is a new top candidate record its location in the output buffer.
  3. When the document is done extract the top candidate.

This can be thought of two processes working with a similar scoring system to extract the content from the page:

  1. A subtree pruner that removes uninteresting subtrees (such as sharing buttons and advertisements).
  2. A subtree selector that removes the website "chrome" such as navigation, headers and footers and picks just the subtree that contains the article content.

One downside of this approach is that it does result in writing the entire (rewritten) document to the output buffer and then erasing most of it in chunks. It would be desirable to avoid this "overdraw" as much as possible, but it is still much faster than constructing the whole DOM in memory.

Comments

Popular posts from this blog

DNS Outage

From 2024-08-26 19:46 to 2024-08-27 11:21 UTC FeedMail had an outage. Until 2024-06-26 20:34 FeedMail was completely down. For the remainder of the outage most emails not sent. It is expected that no feed updates were lost during this outage. Updates would only be lost if they were only present on the feed within the 50min of total outage. Most feeds ensure that updates are present for days so this would not be an issue. Notifications have been delayed and should be sent by 2024-08-27 12:31. This may take longer if your mail provider applies limits and FeedMail needs to retry delivery at a later time. Update : All delayed notifications have been sent successfully. Timeline All times are in UTC . 2024-08-26 19:46 Start FeedMail goes down.   19:53 Detection Automated monitoring reported that feeds were not being checked. 20:34 The Database IP was hardcoded, restoring most functionality. 2024-08-27 11:21 Resolution FeedMail was switched external DNS. 11:24 Schedule of ...

Digests Now Respect Category Filters

Due to an oversight category filters did not apply to digests. This has been corrected and future digests will be filtered by your selected categories. If you do not want this filtering to occur please update your filters to "Ignore selected categories" and deselect all categories to inactivate the filter.

Digests are now Supported for Owner-Paid Feeds

Owner-paid feeds allow feed publishers to provide FeedMail to their subscribers at no cost. For example the FeedMail Blog is an owner-paid feed. Up until now digest subscriptions were not covered by owner-paid plans. Subscribers could select a digest but they would have to pay for the subscriptions themselves. Digests are now fully supported under owner-paid plans. For users: The owner-paid feeds in your digests no longer count towards the cost of the digest. For publishers: Users will now be able to receive your feed as a digest or included in one of their existing digests. You will be charged one credit for each digest issue containing items from your feed (no matter how many items from your feed are in that issue). Notably this cost will never be more than real-time subscriptions would be.