Skip to main content

Updates to HTML Processing

Since its inception FeedMail has done processing on HTML content in feeds to ensure that it renders as expected in email form. At first this was fairly simple things like rewriting URLs to point to the correct location (many feeds use non-absolute URLs that won't work in email) but over time more complex transformations were added such as adding fallback content to media embeds without any. The full-text scraping feature requires even more complex processing as it requires stripping away most of the page and handling content that was designed for full-featured browsers.

What changed?

Recently FeedMail has migrated all HTML rewriting to use new infrastructure. This provides more flexibility and enabled new features (such as showing controls on all media embeds) and made our processing much more reliable.

What does this mean to me?

As a user you shouldn't see much difference. Overall the emails you receive should be better formatted but the difference will be subtle. Full-text scraped feeds will see a bigger impact, on the whole you should see better extraction of article content and better removal of extraneous content.

However as with any significant change there will be regressions, especially for full-text scraping. We have tested on a wide variety of popular feeds however to respect user privacy we did not evaluate the result of every user's feeds (as they may contain personal content). If you see a feed that doesn't render how you expect and you are OK with us viewing the content to investigate please reply to the email notification to get in touch with support. While we can't add support for every strange website we are eager to fix common issues and ensure that our HTML processing works well for the vast majority of websites.

Technical Details

If you aren't interested in the technical stuff you can stop reading now.

Previously FeedMail used two main tools to process HTML; lol-html and readability-rs. We have updated to do all of our processing via the html5ever library.

lol-html

lol-html was our general purpose rewriter. It worked really well for simple transformations like fixing URLs and even replacing content (like <iframe> which is near-universally unsupported by mail clients). On top of that it is very fast as it was designed for passing most HTML through unchanged (it was nearly twice as fast as our new solution).

Unfortunately it got very complicated to use once you needed to make contextual decisions. For example adding fallback for <video> elements only if they had no fallback already. This is because the API for lol-html is a series of callbacks for different CSS selectors (using a subset of the full CSS selector language) or text content. This means that each callback is a different closure and if you want to share mutable state you need to use runtime mutual exclusion like RefCell. On top of that if you wanted to do something when elements close you need to schedule a callback which is required to have a static lifetime. This requires using runtime lifetime tracking like Rc. This added overhead and more importantly complicated the implementation.

This is a shame, since the underlying engine could definitely support a better interface here. If they just had a trait of callback methods instead of taking a list of callbacks it would make more complex cases much more convenient. They could even implement the current API on top of the trait version to keep simple cases easy. lol-html also has no support for extraction which means it couldn't solve all of our use cases, two systems working in sync would be needed.

lol-html also had some correctness issues such as attributes not being decoded when read and being insufficiency encoded when written. It also had very poor support for HTML5 tag omission. Together these limitations made it hard to write a reliable transformer.

At the end of the day performance is not a problem for us—even the longest articles only take a few milliseconds to process—so we decided to move away from lol-html.

readability-rs

When scraping full-text content readability-rs was used for extracting the article content from the entire webpage. This content was then processed by our regular formatting engine to produce the email content. readability-rs is a re-implementation of Mozilla's Readability built on top of html5ever and overall did a good job at content extraction. However it did have some shortcomings.

  1. Some cleanups were performed by both our processor and readability-rs leading to some conflicts. For example when fixing up URLs.
  2. readability-rs had very strict cleaning, we already had to modify the library to disable much of it.
  3. readability had very limited support for influencing candidate selection. For example boosting the ranking of content that is contained in the feed summary.
  4. readability-rs was unmaintained, holding back other dependencies of ours.

Overall we needed more flexability over what readability provided. I would still recommend it for people who want a drop-in solution, but with our new rewriting framework available it was time to graduate to a custom system that we could fine-tune.

html5ever

Our new rewriting framework is built on top of the html5ever tokenizer system. html5ever is the parser used by the Servo browser engine. We use only the tokenizer to allow transformations in a streaming manor. This also allows passing through invalid HTML without making matters worse. This does make dealing with tag omissions difficult, but we found that we could get away with only basic support for these cleanups and without rewriting the entire document.

One issue we had was with the html5ever serializer. It is quite opinionated; attempting to fix up the HTML it writes and sometimes raising errors. This lead to surprising results when trying to pass a document mostly unchanged. Instead we used our existing HTML writing framework (used for the FeedMail web interface). It was not much more work since the html5ever serializer didn't match the tokenizer's API anyways, so we had to do some conversion in either case.

First the HTML-to-text converter and general HTML-to-email rewriters were migrated. These were fairly direct conversions of the existing code.

After that was done the full-text content extractor and formatter was rewritten. This was a big improvement because it could use the same framework, allowing extraction, general rewriting and full-text specific rewriting in a single pass.

At a high level the process looks something like this:

  1. Transform the HTML using the general transformer as well as more aggressive full-text-specific options.
  2. After emitting each element determine its "content quality" based on Readability heuristics such as text, words, commas, paragraphs and media contrasted with the amount of markup it contains.
    1. If it does not contain enough content remove it from the output.
    2. Adjust its score based on its specificity and compare to the best candidates so far. If it is a new top candidate record its location in the output buffer.
  3. When the document is done extract the top candidate.

This can be thought of two processes working with a similar scoring system to extract the content from the page:

  1. A subtree pruner that removes uninteresting subtrees (such as sharing buttons and advertisements).
  2. A subtree selector that removes the website "chrome" such as navigation, headers and footers and picks just the subtree that contains the article content.

One downside of this approach is that it does result in writing the entire (rewritten) document to the output buffer and then erasing most of it in chunks. It would be desirable to avoid this "overdraw" as much as possible, but it is still much faster than constructing the whole DOM in memory.

Comments

Popular posts from this blog

Delivery Delays to Gmail

In the past 48 hours Google has started delaying the delivery of some FeedMail notifications. This is currently affecting about 10% of messages to Gmail users. These notifications will be resent with a delay. We also speculate that some notifications will be marked as spam.   Update : As of 2023-05-09 this appears to be resolved. If You Are Affected If you use Gmail you may be affected by this. Notifications may be delayed or marked as spam. If your notifications are marked as spam you can create a filter to avoid this. Use "from:*@feedmail.org" as the rule and select " Never send it to Spam". If your notifications are delayed we are unaware of any action that you can take. However marking notifications that ended up in your spam folder as "Not Spam" may help avoid future delays.  It does appear that these emails are eventually being accepted but we are unsure if that means that they are actually ending up in users' mailboxes (or even their spam folder

No body notification option.

Previously FeedMail provided two options for notification bodies: Content from the feed. Attempt to scrape content from the linked website. A third option is now available which includes no content in notifications. This can be useful to reduce email size or if you prefer reading articles in your browser anyways. This option is especially useful for digests. By setting subscriptions in a digest to "No Content" you can get a headlines-only digest. Or you can keep content for some short content like micro blogging but just get headlines for news sources with longer articles. Simply go to your subscription management page to change this setting. You can find a link at the bottom of each email notification.