Executive Summary
Between 2022-05-12 19:37 and 2022-05-13 12:06 UTC (a period of 16h 29min) notifications for feed updates were not delivered to most customers. These notifications were delivered between 12:16 and 12:36. No notifications were missed.
Technical Details
There was an attempt to enable DNSSEC for MX record lookups. However the applied configuration resulted in errors for domains without DNSSEC configured. This error was not caught during testing because it was expected that this configuration only validated for domains where it was configured and the haphazardly selected test set happened to only include validated domains.
This error was not noticed until 2022-05-13 12:00. This delay was due to quota exhaustion on FeedMail's primary error monitoring service. This resulted in errors not being reported immediately. Instead they were only detected by manual polling of our error dashboard.
Once the cause was identified the following actions were taken:
- All items in the unsent email queue were updated to ensure that they would not fail permanently.
- A version of FeedMail with DNSSEC disabled was deployed.
- All items in the unsent email queue were manually updated to send over the next 20 minutes.
- Logs were checked to identify if any entries were permanently dropped.
Action Items
Error Monitoring Quota Alerts
Right now we lack a good way to identify if we are approaching our primary error alerting quota at a rate that will result in quota depletion before the end of the month.
In this case a dependency update caused a high-frequency warning message which depleted our error quota over a few days early in the month. By the time it is clear that it was necessary to revert that dependency it was too late. Early warning of quota usage would have sufficiently mitigated this problem.
Backup Error Monitoring
Instead of relying on manual polling for backup monitoring an alert could have been set up in our logging framework to notify of this problem rapidly. This likely would have resulted in the error being fixed in 1 hour instead of 16.
This alert has now been set up.
Maintain Test MX List
This was also a preventable error. It was an intentional change to DNSSEC. However it happens that the handful of domains this was tested on all happened to have DNSSEC configured. In order to ensure sufficient testing in the future a list of "Text MXes" will be documented which contains accounts on publicly accessible MXes that exercise a variety of DNS configuration and mail software.
Comments
Post a Comment