Delta Air Lines' failure to keep its data center flying at a cruising altitude this week holds lessons for enterprise IT pros.
The airline's continued dependence on legacy applications and its failure to quickly and successfully switch over to a backup data center are the two broader culprits behind six hours of systems downtime on Monday, Aug. 8, and the resulting flight cancellations and delays that lingered for days.
This is at least the fourth instance in the past year where a major airline has been affected by a data center outage, with other incidents affecting United Airlines, Inc., JetBlue Airways Corp. and Southwest Airlines Co. As before, it's yet another teaching moment for all data center IT pros to keep their own houses in order.
Diagnosis: A legacy of problems
This latest incident started early Monday morning when a critical power control module at a Delta data center malfunctioned, which caused a surge to the transformer and a loss of power, Delta COO Gil West said in a statement posted to the airline's website. Power was quickly restored, but "critical systems and network equipment didn't switch over to backups," he said, and the systems that did switch over were unstable.
Most airlines are dealing with a combination of legacy systems -- some with operating systems dating back to the 1950s -- plus mainframes and more modern, open systems for web and mobile applications, according to Robert Mann Jr., an airline industry analyst at R.W. Mann & Company, Inc. in Port Washington, N.Y. He has managed multiple airline systems in his career, for American Airlines, Inc., Pan American World Airways, Inc. and Trans World Airlines, Inc.
Despite the challenges of maintaining that patchwork setup, it is "garden-variety hardware" that has been blamed recently for many airline outages, including a router for Southwest Airlines earlier this summer. Mann called the probability of switchgear failing, as experienced by Delta, a "low probability occurrence." Airlines should do a better job monitoring hardware to know when failure could be imminent, he said.
It is unclear whether the Delta data center that suffered the outage had a redundant power system, said Julius Neudorfer, CTO and founder at North American Access Technologies, Inc. in Westchester, N.Y.
"Is it a 2N environment or an old mainframe environment that was never upgraded?" Neudorfer asked, noting that it could have had a single point of failure. "There is a big difference in an upper tier versus an older design."
One thing seems certain: as the airline industry and the number of devices accessing airline IT infrastructure grow, so will similar IT disruptions, predicts Mann.
Takeaways for enterprise IT: Plan, test and share
After seeing the effect of this and similar outages -- not only on Delta, but on its customers and their businesses -- enterprises should closely examine their infrastructure and operations and make sure they can sustain a hit from the most likely causes of failure. The first checklist item: have a backup data center in place and test the failover plan on a regular basis.
Robert Johnson, executive vice president at Vision Solutions, Inc. in Irvine, Calif., which has worked with Delta in the past for data protection on IBM Power Systems, sees customers buy backup systems and protection, but remain exposed because they don't test their systems often enough. As a result, "when they have a failure, they go into a crisis mode," he said. And it's not only the systems that need testing. "If there are employees involved in this who haven't been trained properly or haven't participated in the testing, when it does happen, people are scrambling around, so many things can go wrong," he said.
Like many enterprises, Delta also likely still depends on its own data center because it has many workloads that are not cloud ready, said Gary Sloper, vice president of global sales engineering at internet performance management company Dyn, Inc. in Manchester, N.H.
"From a planning and execution standpoint, you need to make sure you have backup plans for the legacy workloads that aren't cloud ready," said Sloper, whose experience includes time at CenturyLink and ColoSpace, Inc.
A hybrid environment that makes use of cloud computing could help avoid disruptions like the one suffered by Delta, Sloper said. Cloud computing could help disperse workloads closer to the users and also help mitigate the risk of failure by quickly standing up new instances or using other instances that are load balanced.
"That takes a lot of planning, but it is a cultural challenge, too," Sloper said. "There is not a playbook about how to deliver a hybrid infrastructure."
If more enterprises that suffer data center outages share the cause and analysis of the outage publicly, that would help improve operations for all, said Lee Kirby, president of Uptime Institute LLC, a data center organization best known for its Tier standards. Uptime has a closed-door network to help data center operators learn from each other through an outage reporting system.
"These four airlines -- it would be interesting if they would share more information" about their incidents and resolutions, Kirby said, but "their marketing departments would squash that."
About the author:
Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].
How to prep for an IT disaster
Learn from one data center disaster
How one airline is transforming its IT strategy