Skip to main content

FeedMail was Down

FeedMail was offline for 26 minutes. During this period the website was unavailable and feed updates were not sent.

This outage was caused by our CoreDNS resolver failing. While FeedMail continued operating normally for a while as most operations such as feed fetching and mail sending don't rely on the Kubernetes DNS server FeedMail does use the Kubernetes DNS server for a few operations such as connecting to it's own database. When database connections needed to be refreshed the DNS resolution failure caused FeedMail to become unhealthy and it was unable to continue operation.

Timeline

All times are in UTC.

13:28StartFeedMail goes down. Website is offline and feeds are not being checked.
13:32DetectionAutomated monitoring reported that the FeedMail website was unavailable.
13:38
Automated monitoring reported that feeds were not being fetched.
13:42
Kubernetes cluster update was started.
13:53MitigatedFeedMail was restored to operation. The website was again available and feeds started being checked.
13:54ResolvedAll feeds were checked and mail was sent.
Note that WebSub updates that fired during the downtime may take slightly longer to appear as the server will select the retry interval.

Analysis

CoreDNS was returning 503 to its readiness healthy check and had the following message repeated in its logs. 

plugin/ready: Still waiting on: "kubernetes"

No changed had recently been made to CoreDNS. Restarting CoreDNS did not help.

This incident was resolved by updating Kubernetes. This update was announced earlier in the day and we were planning on waiting a few days to apply it in case any bugs were found and fixed in the new version. Instead it was decided to apply it immediately to reconfigure CoreDNS or the Kubernetes API server to a working state. This was a risky maneuver but since FeedMail runs on a managed Kubernetes cluster we don't configure CoreDNS ourselves so it seemed safer than manually tweaking settings, especially since the true issue may have been with the Kubernetes API server.

What Went Well

  • Monitoring quickly detected the issue.
  • The service quickly and gracefully recovered once DNS resolution was restored.

What Went Poorly

Nothing.

Where We Got Lucky

  • The Kubernetes update was released only hours before fixed the issue.
    • If it didn't or hadn't been released we would have had to file a service request which likely would have taken longer.

Action Items

At this time we don't except to take any action. This downtime is within our reliability targets. The cost to resolve this issue is not deemed worth it at this time.

One mitigation would be to run multiple Kubernetes clusters. This would give us software version and geographical isolation. However this would increase operational complexity as well as costs. Another option may be to run more instances of CoreDNS but this is managed by our provider so we would prefer not to customize it at this time.

One last option would be to override DNS settings and use our own DNS resolvers for all operations. This is something that we will continue to revisit in the future.

Comments

Popular posts from this blog

DNS Outage

From 2024-08-26 19:46 to 2024-08-27 11:21 UTC FeedMail had an outage. Until 2024-06-26 20:34 FeedMail was completely down. For the remainder of the outage most emails not sent. It is expected that no feed updates were lost during this outage. Updates would only be lost if they were only present on the feed within the 50min of total outage. Most feeds ensure that updates are present for days so this would not be an issue. Notifications have been delayed and should be sent by 2024-08-27 12:31. This may take longer if your mail provider applies limits and FeedMail needs to retry delivery at a later time. Update : All delayed notifications have been sent successfully. Timeline All times are in UTC . 2024-08-26 19:46 Start FeedMail goes down.   19:53 Detection Automated monitoring reported that feeds were not being checked. 20:34 The Database IP was hardcoded, restoring most functionality. 2024-08-27 11:21 Resolution FeedMail was switched external DNS. 11:24 Schedule of ...

Digests Now Respect Category Filters

Due to an oversight category filters did not apply to digests. This has been corrected and future digests will be filtered by your selected categories. If you do not want this filtering to occur please update your filters to "Ignore selected categories" and deselect all categories to inactivate the filter.

Digests are now Supported for Owner-Paid Feeds

Owner-paid feeds allow feed publishers to provide FeedMail to their subscribers at no cost. For example the FeedMail Blog is an owner-paid feed. Up until now digest subscriptions were not covered by owner-paid plans. Subscribers could select a digest but they would have to pay for the subscriptions themselves. Digests are now fully supported under owner-paid plans. For users: The owner-paid feeds in your digests no longer count towards the cost of the digest. For publishers: Users will now be able to receive your feed as a digest or included in one of their existing digests. You will be charged one credit for each digest issue containing items from your feed (no matter how many items from your feed are in that issue). Notably this cost will never be more than real-time subscriptions would be.