Skip to main content

Updates to HTML Processing

Since its inception FeedMail has done processing on HTML content in feeds to ensure that it renders as expected in email form. At first this was fairly simple things like rewriting URLs to point to the correct location (many feeds use non-absolute URLs that won't work in email) but over time more complex transformations were added such as adding fallback content to media embeds without any. The full-text scraping feature requires even more complex processing as it requires stripping away most of the page and handling content that was designed for full-featured browsers.

What changed?

Recently FeedMail has migrated all HTML rewriting to use new infrastructure. This provides more flexibility and enabled new features (such as showing controls on all media embeds) and made our processing much more reliable.

What does this mean to me?

As a user you shouldn't see much difference. Overall the emails you receive should be better formatted but the difference will be subtle. Full-text scraped feeds will see a bigger impact, on the whole you should see better extraction of article content and better removal of extraneous content.

However as with any significant change there will be regressions, especially for full-text scraping. We have tested on a wide variety of popular feeds however to respect user privacy we did not evaluate the result of every user's feeds (as they may contain personal content). If you see a feed that doesn't render how you expect and you are OK with us viewing the content to investigate please reply to the email notification to get in touch with support. While we can't add support for every strange website we are eager to fix common issues and ensure that our HTML processing works well for the vast majority of websites.

Technical Details

If you aren't interested in the technical stuff you can stop reading now.

Previously FeedMail used two main tools to process HTML; lol-html and readability-rs. We have updated to do all of our processing via the html5ever library.

lol-html

lol-html was our general purpose rewriter. It worked really well for simple transformations like fixing URLs and even replacing content (like <iframe> which is near-universally unsupported by mail clients). On top of that it is very fast as it was designed for passing most HTML through unchanged (it was nearly twice as fast as our new solution).

Unfortunately it got very complicated to use once you needed to make contextual decisions. For example adding fallback for <video> elements only if they had no fallback already. This is because the API for lol-html is a series of callbacks for different CSS selectors (using a subset of the full CSS selector language) or text content. This means that each callback is a different closure and if you want to share mutable state you need to use runtime mutual exclusion like RefCell. On top of that if you wanted to do something when elements close you need to schedule a callback which is required to have a static lifetime. This requires using runtime lifetime tracking like Rc. This added overhead and more importantly complicated the implementation.

This is a shame, since the underlying engine could definitely support a better interface here. If they just had a trait of callback methods instead of taking a list of callbacks it would make more complex cases much more convenient. They could even implement the current API on top of the trait version to keep simple cases easy. lol-html also has no support for extraction which means it couldn't solve all of our use cases, two systems working in sync would be needed.

lol-html also had some correctness issues such as attributes not being decoded when read and being insufficiency encoded when written. It also had very poor support for HTML5 tag omission. Together these limitations made it hard to write a reliable transformer.

At the end of the day performance is not a problem for us—even the longest articles only take a few milliseconds to process—so we decided to move away from lol-html.

readability-rs

When scraping full-text content readability-rs was used for extracting the article content from the entire webpage. This content was then processed by our regular formatting engine to produce the email content. readability-rs is a re-implementation of Mozilla's Readability built on top of html5ever and overall did a good job at content extraction. However it did have some shortcomings.

  1. Some cleanups were performed by both our processor and readability-rs leading to some conflicts. For example when fixing up URLs.
  2. readability-rs had very strict cleaning, we already had to modify the library to disable much of it.
  3. readability had very limited support for influencing candidate selection. For example boosting the ranking of content that is contained in the feed summary.
  4. readability-rs was unmaintained, holding back other dependencies of ours.

Overall we needed more flexability over what readability provided. I would still recommend it for people who want a drop-in solution, but with our new rewriting framework available it was time to graduate to a custom system that we could fine-tune.

html5ever

Our new rewriting framework is built on top of the html5ever tokenizer system. html5ever is the parser used by the Servo browser engine. We use only the tokenizer to allow transformations in a streaming manor. This also allows passing through invalid HTML without making matters worse. This does make dealing with tag omissions difficult, but we found that we could get away with only basic support for these cleanups and without rewriting the entire document.

One issue we had was with the html5ever serializer. It is quite opinionated; attempting to fix up the HTML it writes and sometimes raising errors. This lead to surprising results when trying to pass a document mostly unchanged. Instead we used our existing HTML writing framework (used for the FeedMail web interface). It was not much more work since the html5ever serializer didn't match the tokenizer's API anyways, so we had to do some conversion in either case.

First the HTML-to-text converter and general HTML-to-email rewriters were migrated. These were fairly direct conversions of the existing code.

After that was done the full-text content extractor and formatter was rewritten. This was a big improvement because it could use the same framework, allowing extraction, general rewriting and full-text specific rewriting in a single pass.

At a high level the process looks something like this:

  1. Transform the HTML using the general transformer as well as more aggressive full-text-specific options.
  2. After emitting each element determine its "content quality" based on Readability heuristics such as text, words, commas, paragraphs and media contrasted with the amount of markup it contains.
    1. If it does not contain enough content remove it from the output.
    2. Adjust its score based on its specificity and compare to the best candidates so far. If it is a new top candidate record its location in the output buffer.
  3. When the document is done extract the top candidate.

This can be thought of two processes working with a similar scoring system to extract the content from the page:

  1. A subtree pruner that removes uninteresting subtrees (such as sharing buttons and advertisements).
  2. A subtree selector that removes the website "chrome" such as navigation, headers and footers and picks just the subtree that contains the article content.

One downside of this approach is that it does result in writing the entire (rewritten) document to the output buffer and then erasing most of it in chunks. It would be desirable to avoid this "overdraw" as much as possible, but it is still much faster than constructing the whole DOM in memory.

Comments

Popular posts from this blog

Digests are Coming

Up to this point FeedMail has only supported real-time notifications. Meaning that every feed update immediately produces a single email. However this is about to change! When we asked for feedback on the features you would like to see in FeedMail we had a number of users reach out saying that they wanted a way to batch notifications together. We saw two main reasons for this: To reduce noise in their inbox. For some high-volume feeds users wanted to be able to quickly skim, then delete the entire batch in one go. While deleting one-by-one offers more flexibility, the bulk option is easier for high-volume feeds. To reduce costs. While we believe that our prices are incredibly reasonable, they can add up if you are getting lots of updates. For example if you follow a feed that updates every 15min that will be about $35 a year (or half price if you buy your credits in bulk). Not super expensive but maybe more than you want to spend for a single feed! Digests provide and option for cost

Update to Date-based Entry Ignoring

TL;DR FeedMail will now ignore new items 7 days older than a previously seen item. This is expected to affect almost no "true" new posts. In theory checking to find new entries for a feed is a simple process. Download the feed. Check the ID of each entry to see if you have seen it before. However the real world is much messier. It is recommended for feed IDs to be URLs (to ensure global uniqueness) however this results in many feeds just using the URL that the article is available at. However these URLs sometimes change, and poorly designed feed generators update the ID of existing entries to the new URL. From a protocol point of view these are completely new entries, however to a user these are duplicates. In order to reduce the effect of this common issue on our our users FeedMail has some simple mitigations for posts that have recorded published dates. If the entry is older than a year always ignore it. If the entry is older than the 10th newest post in the feed ignore it.

Digests Leave Beta

Thanks everyone who has helped evaluate digests over the past weeks. All of the blocking issues are now resolved and we will be releasing them soon. Once digests are officially released there will be links to them from the FeedMail site and pricing information added to our homepage. Price Increase Part of the purpose of the beta was to evaluate the cost of providing digests and see how they would be used. We have decided upon final pricing which we hope will be sustainable for years to come. Digests issues will cost 1 credit per 5 feeds. Note that this is feeds included in an issue , not total feeds that target a particular digest. It also does not matter how many new items a feed has. So if you have a digest with 200 feeds configured but this morning's issue only has new items from 2 of them it will cost 1 credit. If 14 feeds update the next day that issue will cost 3 credits. If the day after has no updates it will cost nothing. This new pricing takes effect no earlier than 202