Essential Guide

Building a disaster recovery architecture with cloud and colocation

A comprehensive collection of articles, videos and more, hand-picked by our editors

JetBlue, Verizon data center downtime raises DR, UPS questions

A power failure at a Verizon data center knocked out JetBlue's digital infrastructure for several hours, giving IT pros plenty to consider when thinking about uptime.

A recent batch of delays for JetBlue traced to a data center problem has two major consumer brands seeing red and...

singing the blues.

Three hours of data center downtime for JetBlue -- blamed on a power outage at a Verizon data center -- presents lessons on the importance of having redundant power and a failover plan that can work when needed the most.

"My first thought was 'what was their disaster recovery plan and why didn't it work?'" said Kelly Quinn, research manager at IDC.

A "maintenance operation" at a Verizon Communications Inc. data center at an unspecified location caused the outage, which created cascading flight delays into the evening and headaches for thousands of flyers, JetBlue Airways Corp. said on its blog on Jan. 14. The blog post was later removed.

"That really struck a chord with me," Quinn said. Human error is the number one cause of data center outages in all of IDC's recent surveys into the cause of data center downtime.

After power was lost, there should have been a plan to get back up and running quickly, either a failover plan or move to backup power.

"They can failover to another site -- it is Verizon," Quinn quipped.

Verizon said in a statement that a data center experienced a power outage that impacted JetBlue's operations. The power was disrupted during a maintenance operation at the Verizon data center, according to JetBlue.

Verizon would not further comment about the outage, its causes or internal mitigation procedures. It is also unclear whether it was routine or emergency maintenance.

JetBlue first signed an agreement with Verizon in 2009 to manage its data center, network infrastructure and help desk. In November 2014, the two companies expanded the relationship, making Verizon the "primary technology infrastructure business partner" for JetBlue and expanding the service Verizon offered the airline including cloud computing, managed security, communications and mobility networks and professional services.

My first thought was 'what was their disaster recovery plan and why didn't it work?'
Kelly Quinnresearch manager, IDC

A three-hour outage after power loss is atypical, and Quinn said she would want to know why Verizon did not implement its disaster recovery (DR) plan. Quinn doesn't know what, if any, DR plan JetBlue has in place but assumes there is one, and that it would call for the system to be restored in fewer than three hours.

"Not having a DR plan is unfathomable," she said.

A 2013 study by the Ponemon Institute and sponsored by Emerson Network Power found that the average cost of data center downtime per incident is $627,418. The same study pegged the average data center downtime at 107 minutes.

Moreover, Quinn thinks it is "stunning" that all of JetBlue's digital businesses -- from its website to airport systems -- were affected.

JetBlue's service-level agreement (SLA) with Verizon should have guaranteed that its systems failed over, and the data center downtown likely fell "far outside the scope" of its SLA, she said.

If and when Verizon releases further details about the outage, "that would be helpful to the market," Quinn said. Beyond industry awareness, such transparency would be in the company's own best interests -- reports have indicated Verizon is considering putting its data centers up for sale, and this event "is going to introduce a huge amount of doubt in buyer's minds."

It is unclear where the affected data center is located, but Quinn speculates it is on the East Coast, likely in the New York-New Jersey area where both Verizon's and JetBlue headquarters are located.

The Verizon outages pales in comparison to the outage at 365 Main in San Francisco in 2007, which is what Robert McFarlane, a data center design consultant for Shen Milsom and Wilke LLC, calls the "the most famous and documented data center failure ever." Still, he's surprised that something like this happened because, "I know the levels they go to."

Verizon may have had everything in place and thought it had everything covered, but something was missed, he said.

For years, McFarlane said he has been talking about how uninterruptible power supply (UPS) is often not uninterruptible.

"I've encountered more instances of UPS failure than anything else," he said.

Transferring a data center's power demand to a UPS is another common failure point.

McFarlane and Quinn both suspect the Verizon data center outage also caused lower-profile outages at other businesses, though none have been publicized and attributed to this specific data center problem.

Peter Kelly-Detwiler, an analyst with energy-consulting firm NorthBridge Energy Partners LLC in Lexington, Mass., was delayed in Cleveland for about an hour by the JetBlue outage. He got a first-hand look at the problems the outage caused.

It is important for data centers to have good batteries and backup generation and test them regularly, Kelly-Detwiler said. Many organizations only test under routine situations, and not stress-test out of sequence. These more sophisticated tests may be more costly, but the cost of something going wrong can be higher, he said.

"Operate your data center with a healthy dose of paranoia," he said. "Always assume the worst is going to happen, and think about what you will do."

A post-mortem about the outage will be important too, Kelly-Detwiler noted.

"It surprised me it happened to Verizon -- but it can happen to anyone," he said.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at rgates@techtarget.com.

Next Steps

Disaster recovery testing technology options

How virtualization can help DR planning

Essential business disaster recovery plan checklist

PRO+

Content

Find more PRO+ content and other member only offers, here.

Essential Guide

Building a disaster recovery architecture with cloud and colocation

Join the conversation

11 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What is a preventable source of data center downtime that you think is most often ignored?
Cancel
The problem I see most are use of stale diesel fuel. I've been to sites with fuel years old. Diesel fuel is carbon based petrochemical and thus starts the process of oxidization as soon as it departs the refinery, with the sediments and gums that clog fuel beginning to form. Without fuel stabilizers, diesel can go bad in as little as 30 days before this oxidization process becomes unwieldy, creating deposits that can damage fuel injectors, fuel lines, and other system components.

I've also been to a military site that all the battery bad lights on the Data center UPS's.

Cancel
I have worked in the UPS field selling service maintenance contracts for 16 years representing one of the largest OEMs in the U.S.  Battery Replacement is often ignored, or rather, held off for too long due to budget constraints.   I always provide budgetary battery replacement pricing 2 years prior to the date they should typically be replaced but some customers continue to push it out further and further until there is an issue.   Proper Preparation Prevents Poor Performance....Budget and stick to it is my advice!
Cancel
Both DR and BC add significant cost. In this case, JetBlue must have signed-up for Business continuity solution from Verizon. Did they? It may not be default.
Cancel
@3Ttodaytomorrowtechnology  The 4 P's.......that's a good thing to remember.
If BC is not a default service from your data center provider, do you have it? Great question @Ignesius
And @destroy ...... stale diesel fuel - so easy to avoid.
Great discussion everyone.
Cancel
I also have seen failures within the UPS not switching over at all and the lack of alerts or notification when a failure happens.  

The other cause of data center collapse is failing cooling systems.  Chillers falling and the data center becomes a hot sauna and then having to run dozens of large fans to protect a 100million+ dollar investment.  

Testing generators monthly is a good practice to burn off bad/old fuel.

Firing up generators when a major storm is blowing through the area.  Why wait for a problem to happen and hope everything works when you lose power.  

Having temp cooling systems onsite when chillers go out.  Better to have these when needed versus kicking yourself for failing to be prepared. Fans just do not do the job, they just move hot air around.

Test your UPS monthly and ensue you have adequate UPS power to ensure all your generators have taken over properly.  

Testing of alerting system Monthly, you don't want to find out about you datacenter issues from the business or customers.

Run paranoid that something will happen especially around the holidays and you will at least have a fighting chance to prevent business disruption.  I have seen some bad things happen during a Holiday weekend...grrrr.

my 2cents
Cancel
Procedures and Checklists! Given the number one cause of downtime is human error; I would suggest a careful review of procedures and checklists for both normal and abnormal situations. We cannot expect teams to out perform the guidance that we provide them to utilize. All guidance should be walked and improved on a continuous basis. Lets blow the dust of those procedures and make sure teams have quality guidance to trap procedural errors. If your teams are still using three ring binders full of procedures, I would also suggest moving into digital and mobile technology to improve the quality and availability of procedures and to provide leadership oversight WRT compliance. The thought of the “B team” running a poorly written abnormal checklist on a holiday weekend should put fear in all of us that value uptime.
Cancel
A power outage is not a disaster. This isn't a DR issue at all, it is a Business Continuity issue, which is a much simpler matter. Our little company has a generator and we test it once a month. Verizon should have banks of generators and an enormous battery bank. This mess up is inexcusable. I could see a blip maybe, but hours of downtime? Someone needs to get the boot.
Cancel
Great point, John. It seems so simple, so it would be interesting to know what happened.
Cancel
Everyone's making a bunch of assumptions here, without any facts. This is all speculation until Verizon actually advise what happened.

There is an awful lot more to power delivery into an infrastructure than "having some battery banks and generators". That's an insanely simplistic view on what is a core and complicated component of a large datacenter.

If we're all going to speculate, then maybe some poor engineer has cut the physical power feed running to these racks and electrocuted themselves. It might have taken a couple of hours to try and save his life, rather than worry about getting some servers back online.
Cancel
I suspect that only Verizon customers are going to get details as to what _precisely_ happened. As a result, there are a number of assumptions that are reasonable to make. 1. The explanation so far is power loss. Is this data center only fed from a single neighborhood power station? If so, then this is likely not a top tier data center.
2. The racks should be powered with feeds from separate power feeds, so a PDU serving multiple racks failure, should not be the issue.
3. The UPS should provide about 15 min of battery life then the generators should have kicked in for at least 48 hours. Did the power switching from line power to UPS fail or UPS to generator power fail?
4. Now, to DR - for the airline industry, they'll need a RTO of 15 minutes. So, either (a) their DR solution was also a problem - failure to recover or paralyzed by indecision as info coming in (oh power will be restored in just a minute) or (b) they felt at the time that their RPO was insufficient - e.g. we need data from 10 min before the failure.
Lots of questions for Verizon & JetBlue and as public companies will not want to disclose much as it may affect stock price when divulging secrets (e.g. oh they have to spend a boatload to fix this).
Cancel

-ADS BY GOOGLE

SearchWindowsServer

SearchEnterpriseLinux

SearchServerVirtualization

SearchCloudComputing

Close