.shock - Fotolia

Lessons learned from data center outages, but still a long trip ahead

The hits keep on coming for the airline industry, with several more IT outages that have stranded angry passengers in recent months. Are there any new lessons for IT pros?

Airlines may be making headway against data center outages and other IT-related troubles that have caused turbulence for airline passengers.

The recovery time for Delta Air Lines Inc.'s outage at the end of January was shorter than the one last year, said Mark Thomas Jaggers, a research director at analyst firm Gartner who focuses on disaster recovery (DR) and IT resilience.

"I would expect that they have learned their lesson," he said.

Jaggers has reviewed the architecture diagrams for some airlines and said the biggest challenge is the complexity of the interdependent systems that must be available around the clock without downtime for upgrades or maintenance.

Newer, born-in-the-cloud and cloud-native applications are built to expect unstable and unreliable infrastructure with resiliency built into the application and middleware. But legacy environments like those run by the airlines lack the same capabilities to deal with fragile environments, he said.

Some airlines do plan to shift to new, more modern and resilient infrastructure. American Airlines Inc., the world's largest airline as measured by passengers and fleet size, has begun to move some applications to the cloud for greater flexibility, scalability and reliability. It already has inked a deal with IBM, but is exploring other cloud deals and providers, as well.

In January, Southwest CEO Gary Kelly said the airline's largest technology project ever -- a move to the Amadeus reservation platform -- launched in December and has gone off without a glitch. IT spending at Southwest will start to level out in 2017, but significant projects are still in the works, including a new maintenance record-keeping system.

Airlines must continue that cloud push, said Rod Berger, a consultant with Bergmen Group Inc., in New York, which provides IT strategy and project management to major airlines.

"Cloud-based architecture really helps the airlines ... to shift the responsibility of the infrastructure to companies that are experienced with it and that can ensure resiliency," he said.

As airlines increasingly use cloud computing, they should pay particular attention to local resources for end users, he said.

"If your local workstations or your local processes are not there, you will not be able to use those cloud-based tools," he said.

Why airline IT systems are so stressed out

Delta and United Airlines Inc. both suffered separate data center outages in recent weeks. Delta also suffered an outage last August that it said cost the company $150 million. A United outage two weeks ago was pinned on the Aircraft Communications Addressing and Reporting System (ACARS), one of several decades-old systems that airlines rely on every day. Another United outage this week, which caused hundreds of flight delays but no cancellations, reportedly affected the creation and filing of flight plans, but has not specifically been blamed on ACARS.

As long as they can check the box and say they have a disaster recovery plan, there is a systemic bias toward doing too little.
Mike GrossmanCEO, Zetta Inc.

In the 1980s, ACARS handled data transmission for gate departure time, takeoff time, landing time and gate arrival time. Today, it also incorporates information about weight and balance, weather and wind, and flight plans, as well as air traffic control data from the Federal Aviation Administration, said Robert Mann, an airline industry analyst at R.W. Mann & Co. Inc., whose career has included time at TWA and American Airlines.

"It's no surprise [with] the expansion of the number of aircraft, flights and the message types that that system is stressed," he said, adding that it is overloaded in many hub cities.

Industry experts said airlines have a particular challenge when it comes to IT systems such as ACARS.

"They have ancient systems written in ancient languages running on relatively ancient hardware that has been built organically like a Jenga stack," said Bill Mansfield, solution architect for data protection and availability at U.K.-based Logicalis Inc. "If you pull out any block in the Jenga stack, the whole pile can come down."

The integration and merger of disparate systems, often after a merger or acquisition, can compound the problems. It is nearly impossible for airlines to start over with IT infrastructure because of the existing complexity, time and effort. In recent years, US Airways merged with American and Continental merged with United. Before that, Northwest merged with Delta and Southwest bought AirTran.

Airline outage lessons for IT pros

Airline's data center outages typically happen because of the same reasons in other industries: an undetected reliability failure, or an update that went bad. Therefore, methods and practices to help address the problems should be well-known and applicable to enterprise IT pros.

Change control and tests are the keys to keep any environment healthy, Mansfield said. Robust change control is needed to recognize and review changes, and there should be a plan to back out of them.

When IT pros get ready to make a change, they need to rigorously test in an environment representative of the one to be changed. Users are most often the cause of a mistake, and automation helps avoid this, he said.

Despite the progress that airlines are making in the eyes of some experts, a six- to eight-hour outage is substantial, and airlines must address the severity and duration of data center outages, said Ahmed Abdelghany, a professor of airline operations at Embry-Riddle Aeronautical University in Daytona Beach, Fla., and a former analyst in United's information services division.

"If the problem happens for a few seconds and there is a backup system, nobody will notice," he said. "But the airlines have a lot of work to do."

But many big banks are also built on old infrastructure, and automation can connect it with the new infrastructure, according to Mehul Amin, director of engineering at Advanced Systems Concepts Inc., based in Morristown, N.J.

"Integrating the two sides -- that's where the problem starts," he said, pointing to problems that come up from manual handoffs from disparate systems. "That manual intervention can be dangerous."

Automation tools high in the stack can span and tie together disparate systems and orchestrate them into one workflow, he said. Without automation, delays, errors and integration can surface and become problematic.

Companies talk about better DR plans, but in many cases, it is still not enough. Many businesses look at a DR plan as insurance, in case something goes wrong.

"As long as they can check the box and say they have a disaster recovery plan, there is a systemic bias toward doing too little," said Mike Grossman, CEO at Zetta Inc. in Sunnyvale, Calif., which helps companies eliminate data loss and downtime.

Instead, a DR plan should be something they plan to actually use and that should be rigorously tested frequently, so when things go wrong, the businesses can stay up, he said.

Many businesses claim to have a DR plan, but he said he often finds these are out of date or are not robust enough. In some cases, they were developed in response to an internal political process instead of an objective assessment.

"The lessons are pretty consistent, regardless of company size," he said.

Airlines are in a heavily regulated industry with constant ups and downs, ranging from oil prices to union contracts, which make these companies even slower to respond and transform, said Justin Barney, president and CEO at ScaleArc Inc., based in Santa Clara, Calif.

Barney examined the resilience of airline IT systems when he booked a Southwest Airlines flight last year and encountered a slow online booking process, vowing to use a different airline next time.

ScaleArc then conducted a survey that found one-third of respondents experienced poor performance from an airline booking website, with close to half saying if the website went down, it emotionally affects them. A whopping 68% admitted they took retaliatory action based on an outage, meaning they intentionally booked a flight with a different airline.

Airlines don't need the most resilient IT infrastructure in the world, as long as it's better than the competition -- not a very high bar for performance, Jaggers said.

The focus should remain on DR tests to make sure IT pros understand how the recovery process works, and there is a high expectation to come back quickly.

"Being able to quickly recover happens because you understand it and practiced it," he said.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].

Next Steps

With a move to the cloud, consider your DR plan

How to implement a virtual DR plan

Outages reveal importance of cloud backup, DR

Dig Deeper on Data center capacity planning