DNS Outage

From 2024-08-26 19:46 to 2024-08-27 11:21 UTC FeedMail had an outage.

Until 2024-06-26 20:34 FeedMail was completely down.
For the remainder of the outage most emails not sent.

It is expected that no feed updates were lost during this outage. Updates would only be lost if they were only present on the feed within the 50min of total outage. Most feeds ensure that updates are present for days so this would not be an issue.

Notifications have been delayed and should be sent by 2024-08-27 12:31. This may take longer if your mail provider applies limits and FeedMail needs to retry delivery at a later time.

Update: All delayed notifications have been sent successfully.

Timeline

All times are in UTC.

2024-08-26	19:46	Start	FeedMail goes down.
	19:53	Detection	Automated monitoring reported that feeds were not being checked.
	20:34		The Database IP was hardcoded, restoring most functionality.
2024-08-27	11:21	Resolution	FeedMail was switched external DNS.
	11:24		Schedule of delayed mail was adjusted to send in the next hour.

Analysis

This outage was triggered by the default DNS of our hosting provider failing. This resulted in the failure of some key functionality including:

Database Access (and anything that depends on it).
Monitoring
Mail Sending

It is worth noting that feed checking was not directly affected as FeedMail uses it's own DNS resolver for most internal requests. The affected functionality was due to requests were made by libraries and APIs that use the system resolver.

Satabase access as it is required by almost all functionality. This eventually lead to FeedMail failing health checks then being restarted where it failed to start up.

At first some experiments were attempted to resolve the DNS issue.

Create a fresh Kubernetes node with our hosting provider and move FeedMail to that node. This node had the same DNS issue.
Restart all system Kubernetes pods to see if they came back functioning.

None of these resolved the problem.

Next the database IP was hardcoded in the configuration avoiding the DNS dependency for database access and restoring most functionality. At this point the only missing functionality were:

Monitoring
Mail Sending

A support ticket was filed with out hosting provider describing the problem, but as of now it hasn't been resolved. But we have been informed that it has been forwarded to their response team.

Next we updated our DNS configuration to avoid the provider's DNS server. This resolved all remaining functionality.

What Went Well

It was easy to hardcode the database IP and restore most functionality.

What Went Poorly

Monitoring was taken offline so manual checks needed to be performed. These were unreliable and missed the problem with mail sending.
Despite having our own DNS resolver configured it isn't used for everything.
It was not immediately identified that mail sending depended on the system DNS.
Our hosting provider did not respond promptly to a serious outage.

Action Items

We will move mail-sending to fully utilize our own DNS resolver. It was believed that this was the case as we already look up the MX records using our resolver. But it wasn't noticed that the mail server host names are passed to the OS for resolution. We will ensure that we are resolving the records ourselves to gain the reliability and security benefits of our internal resolver.

FeedMail Blog

Search This Blog

DNS Outage

Timeline

Analysis

What Went Well

What Went Poorly

Action Items

Labels

Comments

Post a Comment

Popular posts from this blog

Changes to Billing for Multiple Simultaneous Updates

Support for SMTP MTA Strict Transport Security

Digests are now Supported for Owner-Paid Feeds