From 2024-08-26 19:46 to 2024-08-27 11:21 UTC FeedMail had an outage.
- Until 2024-06-26 20:34 FeedMail was completely down.
- For the remainder of the outage most emails not sent.
It is expected that no feed updates were lost during this outage. Updates would only be lost if they were only present on the feed within the 50min of total outage. Most feeds ensure that updates are present for days so this would not be an issue.
Notifications have been delayed and should be sent by 2024-08-27 12:31. This may take longer if your mail provider applies limits and FeedMail needs to retry delivery at a later time.
Update: All delayed notifications have been sent successfully.
Timeline
All times are in UTC.
2024-08-26 | 19:46 | Start | FeedMail goes down. | |
19:53 | Detection | Automated monitoring reported that feeds were not being checked. | ||
20:34 | The Database IP was hardcoded, restoring most functionality. | |||
2024-08-27 | 11:21 | Resolution | FeedMail was switched external DNS. | |
11:24 | Schedule of delayed mail was adjusted to send in the next hour. |
Analysis
This outage was triggered by the default DNS of our hosting provider failing. This resulted in the failure of some key functionality including:
- Database Access (and anything that depends on it).
- Monitoring
- Mail Sending
It is worth noting that feed checking was not directly affected as FeedMail uses it's own DNS resolver for most internal requests. The affected functionality was due to requests were made by libraries and APIs that use the system resolver.
Satabase access as it is required by almost all functionality. This eventually lead to FeedMail failing health checks then being restarted where it failed to start up.
At first some experiments were attempted to resolve the DNS issue.
- Create a fresh Kubernetes node with our hosting provider and move FeedMail to that node. This node had the same DNS issue.
- Restart all system Kubernetes pods to see if they came back functioning.
None of these resolved the problem.
Next the database IP was hardcoded in the configuration avoiding the DNS dependency for database access and restoring most functionality. At this point the only missing functionality were:
- Monitoring
- Mail Sending
A support ticket was filed with out hosting provider describing the problem, but as of now it hasn't been resolved. But we have been informed that it has been forwarded to their response team.
Next we updated our DNS configuration to avoid the provider's DNS server. This resolved all remaining functionality.
What Went Well
- It was easy to hardcode the database IP and restore most functionality.
What Went Poorly
- Monitoring was taken offline so manual checks needed to be performed. These were unreliable and missed the problem with mail sending.
- Despite having our own DNS resolver configured it isn't used for everything.
- It was not immediately identified that mail sending depended on the system DNS.
- Our hosting provider did not respond promptly to a serious outage.
Action Items
We will move mail-sending to fully utilize our own DNS resolver. It was believed that this was the case as we already look up the MX records using our resolver. But it wasn't noticed that the mail server host names are passed to the OS for resolution. We will ensure that we are resolving the records ourselves to gain the reliability and security benefits of our internal resolver.
Comments
Post a Comment