A recent batch of delays for JetBlue traced to a data center problem has two major consumer brands seeing red and singing the blues.
Three hours of data center downtime for JetBlue -- blamed on a power outage at a Verizon data center -- presents lessons on the importance of having redundant power and a failover plan that can work when needed the most.
"My first thought was 'what was their disaster recovery plan and why didn't it work?'" said Kelly Quinn, research manager at IDC.
A "maintenance operation" at a Verizon Communications Inc. data center at an unspecified location caused the outage, which created cascading flight delays into the evening and headaches for thousands of flyers, JetBlue Airways Corp. said on its blog on Jan. 14. The blog post was later removed.
"That really struck a chord with me," Quinn said. Human error is the number one cause of data center outages in all of IDC's recent surveys into the cause of data center downtime.
After power was lost, there should have been a plan to get back up and running quickly, either a failover plan or move to backup power.
"They can failover to another site -- it is Verizon," Quinn quipped.
Verizon said in a statement that a data center experienced a power outage that impacted JetBlue's operations. The power was disrupted during a maintenance operation at the Verizon data center, according to JetBlue.
Verizon would not further comment about the outage, its causes or internal mitigation procedures. It is also unclear whether it was routine or emergency maintenance.
JetBlue first signed an agreement with Verizon in 2009 to manage its data center, network infrastructure and help desk. In November 2014, the two companies expanded the relationship, making Verizon the "primary technology infrastructure business partner" for JetBlue and expanding the service Verizon offered the airline including cloud computing, managed security, communications and mobility networks and professional services.
Kelly Quinnresearch manager, IDC
A three-hour outage after power loss is atypical, and Quinn said she would want to know why Verizon did not implement its disaster recovery (DR) plan. Quinn doesn't know what, if any, DR plan JetBlue has in place but assumes there is one, and that it would call for the system to be restored in fewer than three hours.
"Not having a DR plan is unfathomable," she said.
A 2013 study by the Ponemon Institute and sponsored by Emerson Network Power found that the average cost of data center downtime per incident is $627,418. The same study pegged the average data center downtime at 107 minutes.
Moreover, Quinn thinks it is "stunning" that all of JetBlue's digital businesses -- from its website to airport systems -- were affected.
JetBlue's service-level agreement (SLA) with Verizon should have guaranteed that its systems failed over, and the data center downtown likely fell "far outside the scope" of its SLA, she said.
If and when Verizon releases further details about the outage, "that would be helpful to the market," Quinn said. Beyond industry awareness, such transparency would be in the company's own best interests -- reports have indicated Verizon is considering putting its data centers up for sale, and this event "is going to introduce a huge amount of doubt in buyer's minds."
It is unclear where the affected data center is located, but Quinn speculates it is on the East Coast, likely in the New York-New Jersey area where both Verizon's and JetBlue headquarters are located.
The Verizon outages pales in comparison to the outage at 365 Main in San Francisco in 2007, which is what Robert McFarlane, a data center design consultant for Shen Milsom and Wilke LLC, calls the "the most famous and documented data center failure ever." Still, he's surprised that something like this happened because, "I know the levels they go to."
Verizon may have had everything in place and thought it had everything covered, but something was missed, he said.
For years, McFarlane said he has been talking about how uninterruptible power supply (UPS) is often not uninterruptible.
"I've encountered more instances of UPS failure than anything else," he said.
Transferring a data center's power demand to a UPS is another common failure point.
McFarlane and Quinn both suspect the Verizon data center outage also caused lower-profile outages at other businesses, though none have been publicized and attributed to this specific data center problem.
Peter Kelly-Detwiler, an analyst with energy-consulting firm NorthBridge Energy Partners LLC in Lexington, Mass., was delayed in Cleveland for about an hour by the JetBlue outage. He got a first-hand look at the problems the outage caused.
It is important for data centers to have good batteries and backup generation and test them regularly, Kelly-Detwiler said. Many organizations only test under routine situations, and not stress-test out of sequence. These more sophisticated tests may be more costly, but the cost of something going wrong can be higher, he said.
"Operate your data center with a healthy dose of paranoia," he said. "Always assume the worst is going to happen, and think about what you will do."
A post-mortem about the outage will be important too, Kelly-Detwiler noted.
"It surprised me it happened to Verizon -- but it can happen to anyone," he said.
Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].
Disaster recovery testing technology options
Essential business disaster recovery plan checklist