AWS S3 Outage Caused by a Typo in an Internal Maintenance Command

What happened
An AWS engineer debugging an S3 billing issue in us-east-1 mistakenly removed too many servers from the subsystem handling index object metadata. S3 went down for four hours, taking with it a significant fraction of the internet including Slack, GitHub, Quora, and many more services.[1]
What went wrong
A maintenance runbook command accepted an integer argument to remove servers, and the entered value was larger than intended. The S3 subsystem had not been fully restarted in years and took far longer to restart than expected, extending the outage significantly.[1]
Lesson learned
Operational commands that can cause large-scale impact should require explicit confirmation and have hard limits on their blast radius. Regularly exercising restart procedures in production-like environments ensures recovery times match expectations when real incidents occur.