« Frosty the Snowman? | Main | Remembering the rest of MLK's dream »
January 17, 2005
It's the software reliability, stupid
As someone who has sat in a conference room while a so-called "system reliability" expert explained that software reliability problems are not really such a big concern, then went on to explain his area of expertise, which is server hardware - not software - reliability, I know exactly the mentally that let this Comair disaster happen. They spent millions on high speed, reliable servers, but never upgraded or analyzed for faults the software they had been using for twenty years.
From an anonymous poster at Slashdot:
The computer system in question runs AIX. The box itself is still up and running just fine; this is purely an application error. This application was not written in-house at Comair, but by another large aerospace company -- SBS (http://www.sbsint.com/ [sbsint.com], owned by Boeing.) This bit of software does not use an external database, it tracks everything itself. It is a dedicated system responsible only for flight crew assignments. (The blather in the original submission about passenger reservations is way off-base. Those functions are handled by a completely different system.)The great majority of Comair's traffic flows through the midwest, and the central base of operations is in Cincinnati. The midwest was hit by a major snowstorm this week, causing many, many crew reassignments. It appears right now that the application in question has a hard limit of 32,000 changes per month (ouch). Consider that Comair runs 1,100 flights a day and there are usually 3 crew members on each aircraft. A big storm like this can cause problems for days after the snow stops falling. That's a whole lot of crew changes.
In Comair's defense, this has never happened before and is unlikely to happen again. The crew system was already on the chopping block long before this incident, with its replacement scheduled to go live in January. If this freak storm had happened a month later, this likely never would have occurred.
More confirming details from the Cincinatti Enquirer:
ERLANGER - The technology was working on borrowed time. And on a frigid Christmas Eve, time ran out.When Comair's crew-scheduling computer system failed, it caused a crisis that grounded the airline's 1,160 daily flights during one of the busiest travel times of the year.
The computer failure and Midwest storm disrupted travel for more than 100,000 holiday travelers at Cincinnati/Northern Kentucky International Airport and dozens of airports around the country.
The grounding was a public relations disaster not only for Comair but for parent Delta Air Lines. It took four days for Comair to get all its jets back in the air, and about that long to clear the piles of luggage and passengers stranded at the airport here.
And now, a federal investigation is under way.
One week after the fleet was grounded, the probe is just beginning. Comair officials initially said that no one could have prepared for the unprecedented chain of events that occurred. But interviews with dozens of company workers, the computer system manufacturer and outside technology experts raise questions about why an aging computer system was still in use - and why its backup failed as well.
It get's better:
However, it was considered state-of-the-art, experts and Jepperson officials say. Still, it could handle only 32,767 transactions in a month. A "transaction" is counted each time a change or new entry is made to crew schedules. Count limits were common when the system was purchased because computer memory was expensive - something that's no longer true.Last week, the system was already near its memory limit, and the end of the month was still a week away. When the storm hit, it caused a huge spike in crew schedule changes. The system worked until it reached its limit on Christmas Eve. Then it simply refused to take more data.
Jepperson spokesman Mike Pound says six SBS advisers worked via phone and online with Comair tech workers in Erlanger throughout the night and came up with a time-intensive solution: Rebuilding the system from scratch in a "fresh computing environment."
"At about 6 a.m. Christmas morning, that re-entry began, but unfortunately it wasn't something that could be done quickly," Pound says.
Comair had no choice but to shut down the airline while the system was rebuilt. Last Sunday, the day after Christmas and typically one of the busiest travel days of the year, only about 15 percent of the company's flights were in the air. Comair didn't fully recover until Wednesday....
Another key question about the week's events is what happened to the backup system. Jepperson spokesman Pound referred questions to Comair, which would not comment.
"If they didn't have something in place when it happened to fix it, it was already too late," says Ellis Johnson, a Georgia Tech industrial and systems engineering professor with extensive experience in airline crew computer programming. "Something like this can't be done on the fly. And it makes no sense that there was no recovery or redundant systems."
Adds Anbil of OHM Technologies: "That is what is blowing everybody's mind - what happened to the backup in the first place. Nobody goes completely down like that without backup."
The need for quality software still by far exceeds the world supply. Unfortunately for all the unemployed software engineers out there, focus on quarterly profits still gives most companies sticker shock.
Posted by Mike at January 17, 2005 11:53 PM
Trackback Pings
TrackBack URL for this entry:
http://www.topdog08.com/cgi-bin/mt-trackback.cgi/583
Listed below are links to weblogs that reference It's the software reliability, stupid:
» Spanish Now! 8.0 from The Software Blog
Spanish is the most widely spoken of all the Romance languages in terms of the number of speakers and the number of countries that claim Spanish as their official language. Learn Spanish Now! includes four language-learning titles: Fundamental Spani... [Read More]
Tracked on November 29, 2005 07:08 PM
Comments
Post a comment
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)