Minute by minute, data center outage costs stack up

A study pinpoints the cost of a data center outage and lends insight into the causes behind outages at more than 60 data centers in the past year.

The cost of an unplanned data center outage is now up to nearly $9,000 per minute.

Some data center pros, though, suggested this estimate is too low, and the more meaningful number is the total hit over an entire data center outage cycle.

The Ponemon Institute's statistics -- based on confidential interviews with operators of 63 data centers to update two similar studies in 2010 and 2013 -- calculated a data center outage costs businesses an average of $8,851 per minute. The total cost of a data center outage is up 38% since 2010 to $740,357. The numbers aren't adjusted for inflation, which Ponemon said wasn't significant over the five-year period.

The latest numbers follow a recent data center outage at a Verizon Communications Inc. data center that affected JetBlue Airways Corp.

For many big-name brands with a strong reputation, customers may accept one outage, but will be far less accepting after that, said Larry Ponemon, chairman and founder of the Ponemon Institute, which conducted the study that was funded by Emerson Network Power.

Customer expectations may even vary depending on a brand's reputation or place in the market, he said. A high-end department store may have a higher bar to meet versus a discount shoe retailer, for example.

The study looked at a variety of businesses across many verticals, including several colocation providers. If a colo operator's outage spanned several tenants, it was counted as a separate outage for each business with infrastructure within the colo building.

The per-minute breakdown of the cost of a data center outage isn't a particularly useful number, compared with the full cycle cost of an outage, asserted Paul Hines, a 20-year data center industry veteran who is senior vice president of operations and engineering at Sentinel Data Centers, which owns and manages data centers in New Jersey and North Carolina -- mostly for Fortune 500 companies.

"Being down for a second can be equal to being down for several hours," Hines said. He suggested the Ponemon per-minute estimates "are on the very low side," and the total cost per minute is likely closer to $10,000 or more.

Knocking out an ERP system, for example, could take down a company's manufacturing operation, which could easily send damages higher than $8,851 per minute, Hines said.

Ponemon's methodology applies "activity-based cost," which helps determine costs for areas that are hard to measure. In addition to collecting data, the monthslong study involved interviews with two to 20 people at each affected company. Ponemon said it uses a "confidential and proprietary benchmark method," and includes various caveats, including noting that the 63 respondents came from inquiries sent to 600 organizations believed to have experienced one or more data center outages in the previous 12 months.

But the "nonresponse bias" was not tested, Ponemon said in the study, noting it is possible companies that did not participate are "substantially different" from those that responded, each of which suffered two to three outages in the prior year.

The study illustrated the problem of "normalizing an industry with distinct and varied behaviors by vertical and by type, age and utilization of each data center," said Julian Kudritzki, COO at Uptime Institute, a think tank devoted to maximizing efficiency and uptime in data centers.

The cost should always include the actual loss, the opportunity lost and the loss of good will, Hines said.

Most enterprises likely run at least two data centers hot-hot and would suffer negligible or no loss if a switch doesn't happen. If a second data center doesn't work, or the workload isn't fully carried, that is when an outage could cause costly downtime, Hines said.

The reasons for data center outages

The study also looked at the top reasons for unplanned data center outages. Uninterruptible power supply (UPS) failure, including UPS and batteries, remained No. 1, accounting for the primary root cause of 25% of the outages. The fastest-growing cause of data center outages is "cybercrime," from 2% in 2010 to 22% in the latest study.

Ponemon agreed that even though human error was the cause of just 22% of outages, there is often a human element to many of the other causes, which also included water, heat or computer room air conditioner failure, weather-related incidents, generator failure and IT equipment failure.

"They shouldn't be bringing their Starbucks into the data center, but they still do," Ponemon said.

The leading cause of data center outages is, by far, "management-related," according to Kudritzki. That includes inadequate training, staffing, processes and procedures.

Data center equipment performs very reliably if maintained well -- but that has the mighty assumption of the right processes and procedures, and training and staff.
Julian KudritzkiCOO, Uptime Institute

"Data center equipment performs very reliably if maintained well -- but that has the mighty assumption of the right processes and procedures, and training and staff," he said.

Some causes may overlap, Ponemon said. For example, a UPS may fail, but there also may have been a denial-of-service (DoS) attack, or equipment may have failed because it wasn't inspected and replaced on schedule -- both of which are essentially human errors.

Since batteries and generators are standby pieces of equipment, there is always a risk that they won't function as expected when needed, according to Hines. However, it is rare to suffer failure in a UPS, which is a "pretty rock solid device," he said.

One of the most common sources of failure is during a utility outage, where the battery can't carry the load.

"Batteries and generators are the Achilles' heel of data centers," Hines said.

To this end, some helpful lessons came out of Superstorm Sandy and Hurricane Irene, Hines said, where generators failed to start or the fuel system failed. Sentinel uses a "fuel polishing system" that works to avoid that.

DoS attacks are "our biggest concern," Hines said. Because most data centers are automated, someone hacking in could disrupt mechanical systems and shut down a data center in a few minutes, he said.

To combat this, Sentinel's control systems are not connected to the Internet, and the company has strict internal controls that limit who has access to the systems. The firm also conducts annual background checks on employees.

Human error, and not UPS failure, should be at the top of the list of outage causes, according to William Dougherty, senior vice president and CTO at RagingWire Data Centers in Reno, Nev.

"From experience, as well as informal surveys with industry peers, I would say without hesitation that the No. 1 cause of unplanned outages is human error, and I'd put the rate at 80% or more," he said.

That percentage could be as high as 90%, he added, by including human error in the design, engineering and budgeting phases of a data center. An outage caused by a UPS failure could be traced to a design that allows for a single point of failure, a faulty sequence of operations or a mistake made during maintenance, he said. Most outages in a data center are caused during maintenance, Dougherty said, which is why so many companies defer maintenance or schedule it for off-peak times.

Data center reliability is directly proportional to the scale of the facility and the amount invested in availability, Dougherty explained. Thus, a small enterprise data center running at N+1 with single points of failure is more vulnerable than a large data center running at 2N+2.

"Even with the high cost of outages, most companies cannot cost justify the investment required to make their own facilities that reliable," he said.

Data center operators should focus on costs in terms of longevity and lifecycle performance, not a series of adverse events at unpredictable intervals, he said.

"You should know the cost of downtime, but this is not a comparative exercise or a race to the bottom," Kudritzki said.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at rgates@techtarget.com.

Next Steps

Uptime shares data center optimization advice

IT pros say cost still rules DR choices

Business continuity takes center stage

Dig Deeper on Data center budget and culture

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

5 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What would you estimate is the average cost of a data center outage?
Cancel
The key issue the "lost" means the data center operator or tenant customer? I mean normally data center operators did have insurance so any outage they may be able to get the refund. Of course the insurance premium should be raised due to the outage. However, can customer ask the penalty from data center operator? I don't think so... at least in my country(Taiwan) none of data center provide the penalty according to the lost of outage.

How about in US or other region?
Cancel
It is very simple to estimate total cost of a downtime. We make it simple because its difficult to calculate specifically in financial institute. Total Earning / 12 = Profit in Months. And divide it to daily business hours usually 8. If business ask for repute then multiply it with annual growth. Thanks All.
Cancel
The question can be answered in several ways. One might be, in the the case of batch applications, processing for payroll, inventory, financials, how much time/work units were lost and had to be caught up either internally, or by processing at another site, owned or rented, in a DR or BCP mode; add that cost, plus the cost of SLAs to any customers you might have been processing work for who now, in an example from the Banking industry, did not make the trucks in the morning and now cannot meet their customer, manufacturing, and other financial obligations. Some customers might be hard-pressed to then meet their needs and that costs them too, and if they were smart enough to have SLAs with you, would now be seeking redress and recompense.
Now, if you have on-line applications, then the cost could be per seat - users who normally would be connecting to your application to do work x units of work or time they could not do that work times the cost contracted for in the SLA. SLA of course means Service Level Agreement. I had seen these outage totals approach the totally ludicrous, such that the company could not possibly have had the trillions of dollars of outage that they reported.
Add all of the outage costs for every application denied to a customer and the costs of the DR and SLA payments (punishment). Other items, such as fixing damage caused by a physical outage are another thing. That is just the cost of doing business (BP - Business Plan, which all companies have). Insurance and repairs to physical plant are all just a part of that. Not having adequate battery or UPS is a failure of BCP (Business Continuity Plan - How to stay in business each day, if something goes wrong). IOW, Human Error. There is no way you can blame the equipment for planners shortcomings.
The other answers all seem to be gross oversimplifications of the problem and would not even come close to evaluating or calculating the cost of not doing business.
But then, I worked for Big Giant Telephone Company, and the cost of outages was counted down to the micro dot and tittle. If a customer was off-line and could not do their work, then that cost is real and counted in the total, and therefore the need for SLAs. Their purpose was to contractually specify and limit the damages to reasonable amounts.
So, the cost of an outage would be the costs of not doing business for any of the above reasons and others not listed here, plus the cost of getting the situation back to where it is productive.
Cancel
Very well written Robert. A data center outage should never be taken lightly and these numbers show how important it is to avoid it altogether. A DDoS attack is the modern day equivalent to a worker's strike 50 years ago. It's just paralyzing to an organization and it's nice to see some metrics around it's impact today.
Cancel

-ADS BY GOOGLE

SearchWindowsServer

SearchEnterpriseLinux

SearchServerVirtualization

SearchCloudComputing

Close