AWS S3 Outage Caused by a Typo in an Internal Maintenance Command

Amazon
AWS S3 Outage Caused by a Typo in an Internal Maintenance Command
Image: Wikimedia Commons

What happened

An AWS engineer debugging an S3 billing issue in us-east-1 mistakenly removed too many servers from the subsystem handling index object metadata. S3 went down for four hours, taking with it a significant fraction of the internet including Slack, GitHub, Quora, and many more services.[1]

What went wrong

A maintenance runbook command accepted an integer argument to remove servers, and the entered value was larger than intended. The S3 subsystem had not been fully restarted in years and took far longer to restart than expected, extending the outage significantly.[1]

Lesson learned

Operational commands that can cause large-scale impact should require explicit confirmation and have hard limits on their blast radius. Regularly exercising restart procedures in production-like environments ensures recovery times match expectations when real incidents occur.

Est. value burned ~$150M estimated downstream business losses

Sources

  1. [1] Amazon AWS S3 Outage Caused by a Typo in an Internal Maintenance Command