Infrastructure Major Historical

AWS S3 Outage Caused by a Typo in an Internal Maintenance Command

Amazon 28 February 2017 Indexed 16 Mar 2026

What happened

An AWS engineer debugging an S3 billing issue in us-east-1 mistakenly removed too many servers from the subsystem handling index object metadata. S3 went down for four hours, taking with it a significant fraction of the internet including Slack, GitHub, Quora, and many more services.[1]

Amazon Web Services S3 service health dashboard showing a major outage — The AWS S3 status dashboard during the February 2017 outage — caused by an engineer mistyping a parameter in a maintenance command.Image: Bad.Technology archive

What went wrong

A maintenance runbook command accepted an integer argument to remove servers, and the entered value was larger than intended. The S3 subsystem had not been fully restarted in years and took far longer to restart than expected, extending the outage significantly.[1]

Lesson learned

Operational commands that can cause large-scale impact should require explicit confirmation and have hard limits on their blast radius. Regularly exercising restart procedures in production-like environments ensures recovery times match expectations when real incidents occur.

Est. value burned ~$150M estimated downstream business losses

Sources

[1]

Amazon Wikipedia
AWS S3 Outage Caused by a Typo in an Internal Maintenance Command Wayback Machine snapshot

External links can go dark — pages move, paywalls appear, domains expire. Every source above includes a Wayback Machine snapshot link as a fallback. All citations are best-effort research; if a source contradicts our summary, the primary source takes precedence.

What happened

What went wrong

Lesson learned

Sources

More in Infrastructure