AWS S3 Outage Caused by a Typo in an Internal Maintenance Command

Amazon
AWS S3 Outage Caused by a Typo in an Internal Maintenance Command
Amazon Web Services data center building exterior, representing the cloud infrastructure behind the S3 outage.Image: Amazon Web Services, Inc. — Apache License 2.0 via Wikimedia Commons · Apache 2.0

What happened

An AWS engineer debugging an S3 billing issue in us-east-1 mistakenly removed too many servers from the subsystem handling index object metadata. S3 went down for four hours, taking with it a significant fraction of the internet including Slack, GitHub, Quora, and many more services.[1]

The AWS S3 status dashboard during the February 2017 outage — caused by an engineer mistyping a parameter in a maintenance command.Image: Bad.Technology archive

What went wrong

A maintenance runbook command accepted an integer argument to remove servers, and the entered value was larger than intended. The S3 subsystem had not been fully restarted in years and took far longer to restart than expected, extending the outage significantly.[1]

Lesson learned

Operational commands that can cause large-scale impact should require explicit confirmation and have hard limits on their blast radius. Regularly exercising restart procedures in production-like environments ensures recovery times match expectations when real incidents occur.

Est. value burned ~$150M estimated downstream business losses

Sources

  1. [1]

External links can go dark — pages move, paywalls appear, domains expire. Every source above includes a Wayback Machine snapshot link as a fallback. All citations are best-effort research; if a source contradicts our summary, the primary source takes precedence.