News Stay informed about the latest enterprise technology news and product updates.

Delta outage raises backup data center, power questions

Another outage at an airline data center offers yet another lesson about the need to fail over to a backup data center and bounce back quickly after a power problem.

Delta Air Lines' failure to keep its data center flying at a cruising altitude this week holds lessons for enterprise...

IT pros.

The airline's continued dependence on legacy applications and its failure to quickly and successfully switch over to a backup data center are the two broader culprits behind six hours of systems downtime on Monday, Aug. 8, and the resulting flight cancellations and delays that lingered for days.

This is at least the fourth instance in the past year where a major airline has been affected by a data center outage, with other incidents affecting United Airlines, Inc., JetBlue Airways Corp. and Southwest Airlines Co. As before, it's yet another teaching moment for all data center IT pros to keep their own houses in order.

Diagnosis: A legacy of problems

This latest incident started early Monday morning when a critical power control module at a Delta data center malfunctioned, which caused a surge to the transformer and a loss of power, Delta COO Gil West said in a statement posted to the airline's website. Power was quickly restored, but "critical systems and network equipment didn't switch over to backups," he said, and the systems that did switch over were unstable.

Most airlines are dealing with a combination of legacy systems -- some with operating systems dating back to the 1950s -- plus mainframes and more modern, open systems for web and mobile applications, according to Robert Mann Jr., an airline industry analyst at R.W. Mann & Company, Inc. in Port Washington, N.Y. He has managed multiple airline systems in his career, for American Airlines, Inc., Pan American World Airways, Inc. and Trans World Airlines, Inc.

Despite the challenges of maintaining that patchwork setup, it is "garden-variety hardware" that has been blamed recently for many airline outages, including a router for Southwest Airlines earlier this summer. Mann called the probability of switchgear failing, as experienced by Delta, a "low probability occurrence." Airlines should do a better job monitoring hardware to know when failure could be imminent, he said.

It is unclear whether the Delta data center that suffered the outage had a redundant power system, said Julius Neudorfer, CTO and founder at North American Access Technologies, Inc. in Westchester, N.Y.

"Is it a 2N environment or an old mainframe environment that was never upgraded?" Neudorfer asked, noting that it could have had a single point of failure. "There is a big difference in an upper tier versus an older design."

One thing seems certain: as the airline industry and the number of devices accessing airline IT infrastructure grow, so will similar IT disruptions, predicts Mann.

Takeaways for enterprise IT: Plan, test and share

After seeing the effect of this and similar outages -- not only on Delta, but on its customers and their businesses -- enterprises should closely examine their infrastructure and operations and make sure they can sustain a hit from the most likely causes of failure. The first checklist item: have a backup data center in place and test the failover plan on a regular basis.

Robert Johnson, executive vice president at Vision Solutions, Inc. in Irvine, Calif., which has worked with Delta in the past for data protection on IBM Power Systems, sees customers buy backup systems and protection, but remain exposed because they don't test their systems often enough. As a result, "when they have a failure, they go into a crisis mode," he said. And it's not only the systems that need testing. "If there are employees involved in this who haven't been trained properly or haven't participated in the testing, when it does happen, people are scrambling around, so many things can go wrong," he said.

Like many enterprises, Delta also likely still depends on its own data center because it has many workloads that are not cloud ready, said Gary Sloper, vice president of global sales engineering at internet performance management company Dyn, Inc. in Manchester, N.H.

"From a planning and execution standpoint, you need to make sure you have backup plans for the legacy workloads that aren't cloud ready," said Sloper, whose experience includes time at CenturyLink and ColoSpace, Inc.

A hybrid environment that makes use of cloud computing could help avoid disruptions like the one suffered by Delta, Sloper said. Cloud computing could help disperse workloads closer to the users and also help mitigate the risk of failure by quickly standing up new instances or using other instances that are load balanced.

"That takes a lot of planning, but it is a cultural challenge, too," Sloper said. "There is not a playbook about how to deliver a hybrid infrastructure."

If more enterprises that suffer data center outages share the cause and analysis of the outage publicly, that would help improve operations for all, said Lee Kirby, president of Uptime Institute LLC, a data center organization best known for its Tier standards. Uptime has a closed-door network to help data center operators learn from each other through an outage reporting system.

"These four airlines -- it would be interesting if they would share more information" about their incidents and resolutions, Kirby said, but "their marketing departments would squash that."

About the author:
Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at rgates@techtarget.com.

Next Steps

How to prep for an IT disaster

Learn from one data center disaster

How one airline is transforming its IT strategy

PRO+

Content

Find more PRO+ content and other member only offers, here.

Essential Guide

Building a disaster recovery architecture with cloud and colocation

Join the conversation

8 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What is the best way that the data center industry, as a whole, can help each other come up with better backup data center plans?
Cancel
Interesting article, but it raises an immediate question. How can one actually test backup/failover systems & procedures in a complex, legacy environment like this without actually risking the very costly & time-consuming disaster that just occurred? Seems to me that failure testing itself could trigger a massive cascading failure that would be just as bad as an actual real world failure. Given the high stakes of any single failure anywhere in the infrastructure and the enormous complexity of such a heterogeneous information system, how on Earth could IT arrange any meaningful failover testing that wouldn't, itself, be a high probability for causing a disaster?
Cancel
It's not legacy systems that are the issue, it is prioritizing which functions NEED to be up first to run the airline(or any business) when a disaster like this occurs.  Do Kiosks need to be up first? No, does the entire website need to run?  no....priority should be to the check in, boarding, baggage and ground crew systems for example.  Get what systems need to be up to keep planes and booked passengers moving. Then work on new reservations, kiosks...etc...An airline should be able to switch these over in a reasonable time and limit delays.  A good HA system should allow you to do a virtual switch to test the features.... 
Cancel
Anyone else out there want to chime in? Prioritizing services for restoration when they are hosted in a hybrid infrastructure and some critical apps are definitely not cloud ready doesn't sound like enough of a HA solution to me. In fact, the statement that some systems that did fail over were "unstable" makes it sound like some inter-app datacom processes are either realtime or quickly "broken". All or nothing inter-app processes don't sound amenable to prioritized service restoration - they sound extremely fragile. If a system like the BluePoint Continuity Engine could handle legacy mainframe OSes as well as Windows Servers then there might be a way to manage this legacy fragility. Anybody else have some insight into how the airlines can improve the robustness of their IT systems?
Cancel
Prioritizing services is not an HA solution, but a DR solution that could get the business up and running. It's obvious to anyone who has IT experience that Delta did not have a sufficient if any HA much less DR capability.  When you have human beings waiting at airports and disrupting their lives, you need to move those people ASAP,  even though I am not in the airline industry, common sense tells me to try to continue moving the passengers that are already booked, etc.  It's what we would do in any other industry and I stand by my original statement.
Cancel

some general thoughts ..

It is likely that a similar fate belies many firms across all longer-standing industries should they experience a critical failure that necessitates some form of site/DC wide recovery. I would agree that it isn't always feasible in terms of risk, and the very high probability of disruption to business, to simply 'test' site wide failover end-to-end. Some HA systems enable failover to be tested in a sandbox or bubble, but people tend to kid themselves that this represents complete DR testing, whereas it typically only tests what lies immediately within it's boundary, which doesn't always then include the complex mesh of network, directory/identity, firewall, load balancing, etc etc.. factors that are forced into play in the real world scenario.

Possible ways to combat this?

Lay the situation on the line before the board and see if that generates any enthusiasm or increased appetite for risk in testing more deeply.

Try and break down your testing perhaps to targeted systems, e.g. the next test will be disruptive, but will target only the firewall failover, 6 months later we'll target an aspect of the (non software defined :o)) network.

Also, try and change the line of thinking and understanding of the business. you purchase and build an 'unbreakable system', but the D testing is then done with the intention and expectation that something 'will' break, the purpose of the testing is to flush out that thing under somewhat more controlled and scheduled conditions. Never expect a system simply to work, and never expect it to then stay working without regular attention of some kind. Not a popular conversation, but perhaps a more realistic one.Then come the day it happens in an unscheduled disaster scenario, you have more confidence, familiarity, and wider capability to respond across all the complex and diverse areas of the IT landscape.

lastly, as has been said, understand DR, i.e. what needs to be brought up first, how it get's brought up, what its dependencies are (probably all the other stuff), build your recovery plan to this effect. however, don't expect to test it in one hit, assuming you haven't gradually built up the confidence through many iterations of regular testing across all other aspects of the system, to the point you are ready to failover the Site/DC it it's entirety.

Cancel
Jept77 has given us some practical ideas about how to approach BC via a combination of HA & targeted system failover testing and DR planning that restores services based on both priority & dependencies. (Given the instabilities reported when Delta's failover system kicked in, one has to wonder if co-dependent services on disparate hosts made restoration prioritization a deadly embrace?) He also points out that the HA components in an IT system often allow virtualized failover testing in a "sandbox", but that testing is confined within "what lies immediately within its boundary, which doesn't always then include the complex mesh of network, directory/identity, firewall, load balancing etc etc.. factors that are forced into play in the real world scenario." Well put, and it cuts to the heart of the matter for a system as vast, complex, and cobbled together as a major airline's IT system and it implies: that a complex mesh is greater than the sum of its individually tested parts. So, unless Delta could live with repeated catastrophic outages from each scheduled targeted failover test of a component or subsystem, I don't see how they could realistically get totally on top of their BC/HA/DR nightmare scenarios. Unless they could somehow create an exact duplicate of their core infrastructure, either real or simulated, and run their failover testing on the clone. And that sounds enormously expensive and time-consuming. Anybody else have a handle on Delta's possible BC/HA/DR planning?
Cancel
The Delta Airlines IT meltdown is such a huge teachable moment - I can't believe that our IT community has so little to say about it. Looking at the line of business apps that your employer must have running every day, without fail, in order for you to have a job, and considering the complex mesh of disparate technologies that must work flawlessly in order to host these apps, then a cascade failure like Delta's is a huge opportunity for you to get your c-level suits to invest in more business continuity, high availability and disaster recovery robustness. Okay, I know you've been screaming "security, security, security" so much that they've gone deaf to your concerns, but a line of business IT cascade failure that could take days to recovery from is something that just became very real for every business person who isn't brain dead. I hope we're not going to let this teachable moment pass like a summer thunderstorm - with barely a nod.
Cancel

-ADS BY GOOGLE

SearchWindowsServer

SearchServerVirtualization

SearchCloudComputing

Close