Epic Bugs: The Big Dark of 2003

How could a hot day, a tall tree and a software bug save your relationship?

Between them, they can cause a blackout which will stop you watching TV, make you talk to your partner over the candlelight and remember the good old days when you used to laugh and have fun and not watch Netflix the whole damn evening. Then, one thing leads to another.

So if your relationship was saved by the Northeast Blackout of 2003, do send a "thank you" note to FirstEnergy, the main protagonist of this blackout. They’ll appreciate it, trust me.

If, on the other hand, you were on the receiving end of this blackout, say your business’s losses on the day were in proportion to the overall loss to the economy of $6 Billion dollars, then feel free to send FirstEnergy a different type of note. They’re used to it by now, trust me.

How the system works

Electric power producers tend to primarily serve a specific geographic area, but are also connected into a country wide grid so that they can take electricity from others when in short supply or give electricity to others when producing surplus.

Most of the balancing happens automatically but human operators can and do step in to make sure that no specific power plant or transmission infrastructure (e.g power lines) gets overloaded. There are safety mechanisms in place to shut the overloaded structure down to protect itself.

The bad news is as follows: say a transmission cable has a capacity of 100 of some units, and its current load is quite high , say at 99 units. You try to load it just with a further 2 units, its safety relay trips, and the cable goes out of action creating an extra load for the rest of the grid, not of just 2 units but of the overall 101 unit. See? That’s how cascading grid failures happen.

What caused the 2003 blackout?

Factor 1: Hot day

August 14 2003 was a very hot day in the North East of USA with temperatures approaching 90F. People everywhere turned up their air conditioning causing an unusually high demand in electric power. 

Factor 2: Untrimmed trees

With high air temperatures and/or increased power demand, the power transmission cables heat up and expand. They become longer and sag. As they sag they get closer to the ground. When the trees beneath them are not trimmed back, a sagging cable can establish contact with them which will cause a power surge that will trip the safety relays and isolate the cable from the grid.

The FirstEnergy had a schedule in place to trim trees every five years but for one reason or another, some of the trees under the power lines were left unattended for much longer.

Factor 3. Race Condition bug

In software, a "race condition" occurs when two or more pieces of code access shared data and they try to change it at the same time. This is similar to when you try to buy an item from a not-so-smart online shop, and just as your Paypal payment goes through, the site tells you that someone else has in the meantime bought the last item in stock.

The computers managing the power system including the monitoring at FirstEnergy were purpose built XA21 Unix systems produced by General Electric. Not known at the time, this system suffered with the Race Condition bug, buried deep within the millions of lines of C and C++ code, just waiting to strike at the right moment.

Sequence of events on 14th August 2003

At 1:31PM, The Eastlake, Ohio power plant, in the ownership of FirstEnergy, shuts down because of problems caused by high demand. Its load spreads to its neighbours.

Due to the heat and the increased surge generated by Eastlake, the first 345 kilovolt transmission line sags, comes into contact with a tree and trips its safety relays.

The GE XE21 monitoring system receives the signal about the cable outage. However, it stays silent and fails to alert the operators. The said Race Condition bug has happened and the computer is stuck in an infinite loop while the alarm signals are queuing unprocessed.

At 3:32PM the power shifted by the first line failure causes the second transmission line to sag and hit the trees. That transmission line also goes out and causes a drop in voltage in Cleveland area.

The power controllers at FirstEnergy are puzzled by the drop in voltage without any accompanying system alarms, but choose to trust their computers and do not alert the controllers in the nearby states. Therefore, FirstEnergy’s power neighbors could not take measures for increased power demands that was coming their way caused by the Ohio failures.

The cascade of failures travels through the FirstEnergy network.

Even at 3:40 PM there was still an opportunity to save the grid by cutting the power to Cleveland but FirstEnergy fails to do so.

Finally, the whole Ohio power system collapses causing cascading power failures, first in Michigan and then in most of the other North East regions. It all happens in a matter of minutes and seconds.

By 4:13 PM it was all over.  256 power plants had gone offline. 40 million people in the U.S. and Canada without power.

It took all the way to the next evening to restore the power in the region.

Any lessons learned?

Other than not cutting trees in time, FirstEnergy was criticised for many other "inadequacies" including that it "did not recognize or understand the deteriorating condition of its system". (Interestingly, the state commision could not penalise them as there were no legal mechanisms in place for this sort of situation. Basically, there was no obligation on FirstEnergy to adhere to reliability standards.)

As we know, they did not understand the deteriorating condition of its system because of the Race Condition bug in the General Electric system used by FirstEnergy, and over 100 other power companies.

The trouble with the Race Condition bug is that it is hard to detect. It falls under the Heisenbug category of defects which almost never show themselves under the testing conditions. It requires two independent events to coincide within a matter of milliseconds, and such conditions are hard to happen under a structured system test, especially if one is not explicitly looking for it.

Mike Unum, the manager of commercial solutions at GE Energy said in their defence: "I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software."

Although it is a general consensus that for certain types of bugs no system is bulletproof, experts will argue that we are looking in the wrong direction. Instead of focusing your software QA efforts on testing and bug discovery, you need to embed QA in all aspects of Software Development Cycle and especially in the design and overall architecture. In short, what we we really need are fault tolerant rather than fault free systems.

Another lesson learned in this case is that we need a mind shift in how we work with computerised systems. If the power operators at FirstEnergy were sceptical enough, they would have trusted more the voltage drop information and should have taken investigative action irrespective of the fact that no alarms were fired.

Tom Kropp, manager of the enterprise information security program at the Electric Power Research Institute says that "...if we see a system that's behaving abnormally well, we should probably be suspicious, rather than assuming that it's behaving abnormally well."

Or if you like it, if it looks like a fault and smells like a fault, it probably is a fault, no matter what the computer says.

Leave a reply

Your email address will not be published. Required fields are marked*