Minute by minute, data center outage costs stack up

A study pinpoints the cost of a data center outage and lends insight into the causes behind outages at more than 60 data centers in the past year.

The cost of an unplanned data center outage is now up to nearly $9,000 per minute.

Some data center pros, though, suggested this estimate is too low, and the more meaningful number is the total hit over an entire data center outage cycle.

The Ponemon Institute's statistics -- based on confidential interviews with operators of 63 data centers to update two similar studies in 2010 and 2013 -- calculated a data center outage costs businesses an average of $8,851 per minute. The total cost of a data center outage is up 38% since 2010 to $740,357. The numbers aren't adjusted for inflation, which Ponemon said wasn't significant over the five-year period.

The latest numbers follow a recent data center outage at a Verizon Communications Inc. data center that affected JetBlue Airways Corp.

For many big-name brands with a strong reputation, customers may accept one outage, but will be far less accepting after that, said Larry Ponemon, chairman and founder of the Ponemon Institute, which conducted the study that was funded by Emerson Network Power.

Customer expectations may even vary depending on a brand's reputation or place in the market, he said. A high-end department store may have a higher bar to meet versus a discount shoe retailer, for example.

The study looked at a variety of businesses across many verticals, including several colocation providers. If a colo operator's outage spanned several tenants, it was counted as a separate outage for each business with infrastructure within the colo building.

The per-minute breakdown of the cost of a data center outage isn't a particularly useful number, compared with the full cycle cost of an outage, asserted Paul Hines, a 20-year data center industry veteran who is senior vice president of operations and engineering at Sentinel Data Centers, which owns and manages data centers in New Jersey and North Carolina -- mostly for Fortune 500 companies.

"Being down for a second can be equal to being down for several hours," Hines said. He suggested the Ponemon per-minute estimates "are on the very low side," and the total cost per minute is likely closer to $10,000 or more.

Knocking out an ERP system, for example, could take down a company's manufacturing operation, which could easily send damages higher than $8,851 per minute, Hines said.

Ponemon's methodology applies "activity-based cost," which helps determine costs for areas that are hard to measure. In addition to collecting data, the monthslong study involved interviews with two to 20 people at each affected company. Ponemon said it uses a "confidential and proprietary benchmark method," and includes various caveats, including noting that the 63 respondents came from inquiries sent to 600 organizations believed to have experienced one or more data center outages in the previous 12 months.

But the "nonresponse bias" was not tested, Ponemon said in the study, noting it is possible companies that did not participate are "substantially different" from those that responded, each of which suffered two to three outages in the prior year.

The study illustrated the problem of "normalizing an industry with distinct and varied behaviors by vertical and by type, age and utilization of each data center," said Julian Kudritzki, COO at Uptime Institute, a think tank devoted to maximizing efficiency and uptime in data centers.

The cost should always include the actual loss, the opportunity lost and the loss of good will, Hines said.

Most enterprises likely run at least two data centers hot-hot and would suffer negligible or no loss if a switch doesn't happen. If a second data center doesn't work, or the workload isn't fully carried, that is when an outage could cause costly downtime, Hines said.

The reasons for data center outages

The study also looked at the top reasons for unplanned data center outages. Uninterruptible power supply (UPS) failure, including UPS and batteries, remained No. 1, accounting for the primary root cause of 25% of the outages. The fastest-growing cause of data center outages is "cybercrime," from 2% in 2010 to 22% in the latest study.

Ponemon agreed that even though human error was the cause of just 22% of outages, there is often a human element to many of the other causes, which also included water, heat or computer room air conditioner failure, weather-related incidents, generator failure and IT equipment failure.

"They shouldn't be bringing their Starbucks into the data center, but they still do," Ponemon said.

The leading cause of data center outages is, by far, "management-related," according to Kudritzki. That includes inadequate training, staffing, processes and procedures.

Data center equipment performs very reliably if maintained well -- but that has the mighty assumption of the right processes and procedures, and training and staff.
Julian KudritzkiCOO, Uptime Institute

"Data center equipment performs very reliably if maintained well -- but that has the mighty assumption of the right processes and procedures, and training and staff," he said.

Some causes may overlap, Ponemon said. For example, a UPS may fail, but there also may have been a denial-of-service (DoS) attack, or equipment may have failed because it wasn't inspected and replaced on schedule -- both of which are essentially human errors.

Since batteries and generators are standby pieces of equipment, there is always a risk that they won't function as expected when needed, according to Hines. However, it is rare to suffer failure in a UPS, which is a "pretty rock solid device," he said.

One of the most common sources of failure is during a utility outage, where the battery can't carry the load.

"Batteries and generators are the Achilles' heel of data centers," Hines said.

To this end, some helpful lessons came out of Superstorm Sandy and Hurricane Irene, Hines said, where generators failed to start or the fuel system failed. Sentinel uses a "fuel polishing system" that works to avoid that.

DoS attacks are "our biggest concern," Hines said. Because most data centers are automated, someone hacking in could disrupt mechanical systems and shut down a data center in a few minutes, he said.

To combat this, Sentinel's control systems are not connected to the Internet, and the company has strict internal controls that limit who has access to the systems. The firm also conducts annual background checks on employees.

Human error, and not UPS failure, should be at the top of the list of outage causes, according to William Dougherty, senior vice president and CTO at RagingWire Data Centers in Reno, Nev.

"From experience, as well as informal surveys with industry peers, I would say without hesitation that the No. 1 cause of unplanned outages is human error, and I'd put the rate at 80% or more," he said.

That percentage could be as high as 90%, he added, by including human error in the design, engineering and budgeting phases of a data center. An outage caused by a UPS failure could be traced to a design that allows for a single point of failure, a faulty sequence of operations or a mistake made during maintenance, he said. Most outages in a data center are caused during maintenance, Dougherty said, which is why so many companies defer maintenance or schedule it for off-peak times.

Data center reliability is directly proportional to the scale of the facility and the amount invested in availability, Dougherty explained. Thus, a small enterprise data center running at N+1 with single points of failure is more vulnerable than a large data center running at 2N+2.

"Even with the high cost of outages, most companies cannot cost justify the investment required to make their own facilities that reliable," he said.

Data center operators should focus on costs in terms of longevity and lifecycle performance, not a series of adverse events at unpredictable intervals, he said.

"You should know the cost of downtime, but this is not a comparative exercise or a race to the bottom," Kudritzki said.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].

Next Steps

Uptime shares data center optimization advice

IT pros say cost still rules DR choices

Business continuity takes center stage

Dig Deeper on Data center budget and culture