In an every-second-counts business, hardware or software failures mean lost money and loads of frustration. For IT pros, it means hard lessons that can make the data center more resilient.
When financial data provider Bloomberg went dark one morning in early April, it interrupted the sale of three billion pounds of treasury bills by the United Kingdom's Debt Management Office.
The data center outage was caused by a combination of hardware and software failures in the network, which led to disconnections that lasted one to two hours for most customers, according to a brief statement Bloomberg issued following the incident. The company stated it had multiple redundant systems, which failed to prevent the disruption. Bloomberg declined to comment further.
A data center failure can often be traced to one of two things: an architecture flaw or process breakdown, according to Mat Mathews, a former software engineer and co-founder at Plexxi, a networking technology provider in Nashua, N.H.
The network, he said, is often at the root of outages since it is the most fragile part of the data center. Earlier this year, for example, a network connection failure brought Google's cloud service down for two hours.
More than 300,000 customers rely on Bloomberg's terminals to supply real-time financial data and messaging. Major outages threaten to send customers to competitors. It's up to IT pros to ensure customers don't leave due to technical issues.
Learning from data center outages
"They realize that every time this happens, it is a learning experience," Mathews said. "As an industry we need to be in a learning mode about this."
Data center outages are always a learning experience, according to Aaron Sawchuk, chairman and co-founder of ColoSpace, a data center operator in New England. He said all outages involve an "extensive post-mortem analysis." Lessons learned at one location -- for example, a faulty switch from a certain manufacturer -- result in a change across all of the company's data centers.
Additionally, he is involved in regional off-the-record consortiums that share experiences. High-profile outages, such as one by Amazon, were caused by 100% human error, he said.
A fundamental way to avoid an outage is to rely on physically separate data centers.
"It's pretty much about geographic redundancy," said Don Jones, curriculum director for IT Pro Content at online training company Pluralsight in Farmington, Utah.
That's much different than the days when Jones worked for Bell Atlantic and he used to go to the fall-back site and try to stand up critical parts of the operation after an outage.
"It wasn't until we got to cloud scale did we start to think about software," he said, pointing to load balancers as an example. "The more legacy a company you are, the harder it is to re-engineer your software to take care of the way we do things today."
Duplication in one location was the old model of data center resiliency. Today, it is having one of everything at two locations, according to Sawchuk.
Once a data center goes down, the restoration process is often manpower-intensive with a lot of trial and error.
"These recoveries are rarely run as well as they look in a three-ring binder," he said.
The three-ring binder is the key to disaster recovery (DR), he said, noting it often fails to include the detailed information you will need if something truly goes wrong.
"You really want a DR picture book that your grandmother could follow," Jones said. "When the heat turns up, half your brain can check out."
Sawchuk conducts quarterly training for his technical, non-technical and facilities staff about how to respond to an outage.
"It really comes down to an appropriate level of planning," he said.
There's an increasing effort to do predictive analysis to find areas that may melt down and cause an outage. Web-scale businesses such as Netflix led the way with programs such as its Chaos Monkey, a software tool that tests cloud resiliency and recoverability by simulating failures.
"New software makes no assumption about the hardware," he said.
That software, though, is often the culprit of high-profile outages. Facebook, Google and Twitter have all had outages where the root cause analysis found that an incorrect assumption in the logic of the programming was the culprit.
Unlike Google or Facebook, which has its network inside its own data center, Mathews said, Bloomberg has a very distributed WAN.
"There are likely things that weren't in their control," he said.
There's always the simple human factor such as a spilled drink or "fat fingers." In one of his first jobs, Mathews said he dealt with spilled yogurt on a frame relay switch. Everything was OK, but the idea that the Bloomberg outage was caused by a spilled can of Coca-Cola is not beyond possibilities.
"I've seen weirder things happen," he said.
The physical infrastructure of data centers has moved beyond the days of redundant network tabs that all lead to the same trench out of the building, where a backhoe cuts through a cable.
"That's a sophomore mistake," Mathews said. Instead, today's applications are mainly client-server based and have a scale-out design that scatters many processes in the infrastructure so one process will not take down an application.
Jones said a lights-out data center in the best way to prevent the unpredictable effects of a spilled can of soda, where it is automated and workers enter only on a regular schedule to complete maintenance tasks. Plus, automated tasks that are done the same way each time help to avoid outages.
"Error comes from inconsistency," Jones said.
Data center managers should tighten the processes and control what can be done by a human, and then make sure that a human error will not be compounded by automation, Sawchuk said.
Robert Gates covers data centers, data center strategies, server technologies, converged and hyperconverged infrastructure and open source operating systems for SearchDataCenter. Follow him @RBGatesTT.