AT&T Long-Distance Collapse: One-Line Software Bug Cascades Through 114 Switches, Blocks 75 Million Calls

Wikipedia
AT&T Long-Distance Collapse: One-Line Software Bug Cascades Through 114 Switches, Blocks 75 Million Calls

What happened

On 15 January 1991, AT&T's long-distance telephone network suffered a nine-hour collapse that blocked approximately 75 million calls — around 60% of all US long-distance traffic. The outage began at a single switching centre in New York and cascaded in minutes through all 114 of AT&T's 4ESS electronic switching systems nationwide. The cause was a single-line bug in error-recovery code introduced in a routine software update three weeks earlier. It was the first major software-induced cascade failure of a national telecommunications network, and a direct precursor to the patterns seen in every large-scale network outage since.[1]

What went wrong

A routine software update to the 4ESS switching system introduced a bug in the error-recovery code path. When a switch experienced a normal brief signalling anomaly, it would recover and then send a status message to adjacent switches. The bug caused the recovery message itself to trigger a brief reset in the receiving switch, which then sent its own messages, which triggered further resets — a cascading chain that spread through all 114 switches in minutes. The bug had not been triggered during testing because the precise sequence of events (anomaly, recovery, heavy traffic) had not been simulated. The same architectural pattern — a bug in failure-handling code causing cascades — recurred in the 2003 Northeast Blackout and the 2024 CrowdStrike outage.[1]

Lesson learned

Error-recovery code paths are the most critical and least tested sections of network software. A bug in the code that handles failures can propagate a local fault into a nationwide cascade. Staged rollout with canary deployment, chaos-engineering testing of recovery paths under realistic load, and mandatory simulation of failure sequences became industry standard after this outage. Thirty-three years later, CrowdStrike demonstrated the same lesson was still not universally applied.

Est. value burned ~$70M ~$70M in lost AT&T revenue and FCC penalties. 75 million calls blocked over nine hours, including emergency services in affected areas.

Sources

  1. [1]

External links can go dark — pages move, paywalls appear, domains expire. Every source above includes a Wayback Machine snapshot link as a fallback. All citations are best-effort research; if a source contradicts our summary, the primary source takes precedence.