
BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Delta outage raises backup data center, power questions
Another outage at an airline data center offers yet another lesson about the need to fail over to a backup data center and bounce back quickly after a power problem.
Delta Air Lines' failure to keep its data center flying at a cruising altitude this week holds lessons for enterprise IT pros.
The airline's continued dependence on legacy applications and its failure to quickly and successfully switch over to a backup data center are the two broader culprits behind six hours of systems downtime on Monday, Aug. 8, and the resulting flight cancellations and delays that lingered for days.
This is at least the fourth instance in the past year where a major airline has been affected by a data center outage, with other incidents affecting United Airlines, Inc., JetBlue Airways Corp. and Southwest Airlines Co. As before, it's yet another teaching moment for all data center IT pros to keep their own houses in order.
Diagnosis: A legacy of problems
This latest incident started early Monday morning when a critical power control module at a Delta data center malfunctioned, which caused a surge to the transformer and a loss of power, Delta COO Gil West said in a statement posted to the airline's website. Power was quickly restored, but "critical systems and network equipment didn't switch over to backups," he said, and the systems that did switch over were unstable.
Most airlines are dealing with a combination of legacy systems -- some with operating systems dating back to the 1950s -- plus mainframes and more modern, open systems for web and mobile applications, according to Robert Mann Jr., an airline industry analyst at R.W. Mann & Company, Inc. in Port Washington, N.Y. He has managed multiple airline systems in his career, for American Airlines, Inc., Pan American World Airways, Inc. and Trans World Airlines, Inc.
Despite the challenges of maintaining that patchwork setup, it is "garden-variety hardware" that has been blamed recently for many airline outages, including a router for Southwest Airlines earlier this summer. Mann called the probability of switchgear failing, as experienced by Delta, a "low probability occurrence." Airlines should do a better job monitoring hardware to know when failure could be imminent, he said.
It is unclear whether the Delta data center that suffered the outage had a redundant power system, said Julius Neudorfer, CTO and founder at North American Access Technologies, Inc. in Westchester, N.Y.
"Is it a 2N environment or an old mainframe environment that was never upgraded?" Neudorfer asked, noting that it could have had a single point of failure. "There is a big difference in an upper tier versus an older design."
One thing seems certain: as the airline industry and the number of devices accessing airline IT infrastructure grow, so will similar IT disruptions, predicts Mann.
Takeaways for enterprise IT: Plan, test and share
After seeing the effect of this and similar outages -- not only on Delta, but on its customers and their businesses -- enterprises should closely examine their infrastructure and operations and make sure they can sustain a hit from the most likely causes of failure. The first checklist item: have a backup data center in place and test the failover plan on a regular basis.
Robert Johnson, executive vice president at Vision Solutions, Inc. in Irvine, Calif., which has worked with Delta in the past for data protection on IBM Power Systems, sees customers buy backup systems and protection, but remain exposed because they don't test their systems often enough. As a result, "when they have a failure, they go into a crisis mode," he said. And it's not only the systems that need testing. "If there are employees involved in this who haven't been trained properly or haven't participated in the testing, when it does happen, people are scrambling around, so many things can go wrong," he said.
Like many enterprises, Delta also likely still depends on its own data center because it has many workloads that are not cloud ready, said Gary Sloper, vice president of global sales engineering at internet performance management company Dyn, Inc. in Manchester, N.H.
"From a planning and execution standpoint, you need to make sure you have backup plans for the legacy workloads that aren't cloud ready," said Sloper, whose experience includes time at CenturyLink and ColoSpace, Inc.
A hybrid environment that makes use of cloud computing could help avoid disruptions like the one suffered by Delta, Sloper said. Cloud computing could help disperse workloads closer to the users and also help mitigate the risk of failure by quickly standing up new instances or using other instances that are load balanced.
"That takes a lot of planning, but it is a cultural challenge, too," Sloper said. "There is not a playbook about how to deliver a hybrid infrastructure."
If more enterprises that suffer data center outages share the cause and analysis of the outage publicly, that would help improve operations for all, said Lee Kirby, president of Uptime Institute LLC, a data center organization best known for its Tier standards. Uptime has a closed-door network to help data center operators learn from each other through an outage reporting system.
"These four airlines -- it would be interesting if they would share more information" about their incidents and resolutions, Kirby said, but "their marketing departments would squash that."
About the author:
Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at rgates@techtarget.com.
Join the conversation
8 comments
some general thoughts ..
It is likely that a similar fate belies many firms across all longer-standing industries should they experience a critical failure that necessitates some form of site/DC wide recovery. I would agree that it isn't always feasible in terms of risk, and the very high probability of disruption to business, to simply 'test' site wide failover end-to-end. Some HA systems enable failover to be tested in a sandbox or bubble, but people tend to kid themselves that this represents complete DR testing, whereas it typically only tests what lies immediately within it's boundary, which doesn't always then include the complex mesh of network, directory/identity, firewall, load balancing, etc etc.. factors that are forced into play in the real world scenario.
Possible ways to combat this?
Lay the situation on the line before the board and see if that generates any enthusiasm or increased appetite for risk in testing more deeply.
Try and break down your testing perhaps to targeted systems, e.g. the next test will be disruptive, but will target only the firewall failover, 6 months later we'll target an aspect of the (non software defined :o)) network.
Also, try and change the line of thinking and understanding of the business. you purchase and build an 'unbreakable system', but the D testing is then done with the intention and expectation that something 'will' break, the purpose of the testing is to flush out that thing under somewhat more controlled and scheduled conditions. Never expect a system simply to work, and never expect it to then stay working without regular attention of some kind. Not a popular conversation, but perhaps a more realistic one.Then come the day it happens in an unscheduled disaster scenario, you have more confidence, familiarity, and wider capability to respond across all the complex and diverse areas of the IT landscape.
lastly, as has been said, understand DR, i.e. what needs to be brought up first, how it get's brought up, what its dependencies are (probably all the other stuff), build your recovery plan to this effect. however, don't expect to test it in one hit, assuming you haven't gradually built up the confidence through many iterations of regular testing across all other aspects of the system, to the point you are ready to failover the Site/DC it it's entirety.