Epic Bugs: AT&T 1990 Crash

ATnT truck

Just as today, back in 1990 AT&T were the largest provider of telephone services, wireless services and internet access in the US.

In the early afternoon of Monday, 15th January 1990, the maintenance staff at the network monitoring facilities of AT&T in Bedminster, N.J. started seeing alarm signals on their panels. A few red dots at first and then, within seconds, the whole board turned red.

The board stayed red for the next 11 long hours, with over 50% of all long distance calls made by their customers falling through. AT&T’s financial loss in lost business that day was estimated at $60 million. Many times higher was the cumulative loss to their customers whose business relied on AT&T’s telephone service.

So how could all of 114 switches in their distributed network fail simultaneously? Hackers were suspected at first, causing the company to bring in law enforcement officials to help them find the source of the error.

AT&T were known for their dedication to reliability. No expense was spared to make the system bulletproof and all conceivable doom scenarios were accounted for. In particular, the network was designed along the principle of “paranoid democracy”, where their geographically spread switches were monitoring each other and prepared to take over at the first sign of their pier being in trouble.

The problem was eventually traced down to a specific piece of code introduced a few weeks earlier that was controlling the behaviour of a switch in case it received a distress signal from another switch. In essence, a switch that was struggling to process all signals would go into reset (good old cure for all computer problems) but would first inform the next switch in line to take over and not bother it while it’s resetting. Unfortunately, there was a one line bug in this latest upgrade which made the second switch believe that it also needed to reset and should pass on the traffic to the next switch.  So before you knew it all switches were resetting like crazy and no one was doing any work.

"We didn't live up to our own standards of quality. We didn't live up to our customers' standards of quality. And it's as simple as that," Company Chairman Robert E. Allen told the press regarding the issue.

Once the dust settled, the industry analysts had a chance to look at the “disaster” with a cool head. Certainly, things could have been done better. The code that caused this error was written in C language - a higher level language would have made this bug easier to spot. Equally, the practice of hardware reset as a cure should have been used as the last resort not as a default action. Of course, even more testing could have helped.

But AT&T’s strive to reliability and higher redundancy through distribution, although the root facilitator of this specific problem,  should not have been the baby thrown out with the bathwater. On many other occasions, that precise architecture stopped major disasters from happening. A new architecture can, and frequently does,  bring its own risks but such must always be offset against the overall benefits that it brings with it.

Leave a reply

Your email address will not be published. Required fields are marked*