An explosion knocks out power near an Equinix and Internap data center in Los Angeles. Lightning strikes near a...
Google data center in Germany. A chilled water pipe fails in a CenturyLink data center in New Jersey, affecting New York Stock Exchange data displays.
All these incidents have occurred in recent weeks, showing the types of surprises that cause data center downtime and infrastructure failures.
Enter the integrated systems test (IST), which validates that emergency power, mechanical and monitoring systems operate as designed and built, and applications, clusters or even an entire data center will respond the way you expect when the power goes out or chilled water stops flowing.
"[Integrated systems test is] the only opportunity you will have to test the full intensity of a facility," said Stephen Ford, managing director at data center testing company E1E10 Ltd. in Peterborough, U.K., who has performed ISTs for more than a decade.
The idea of an IST makes sense, but not everyone does as much as they should, as much as they say they do, or even at all. Compare IST to doing backups or disaster recovery (DR). Everyone performs backups and takes snapshots, but how many organizations actually test those backups to verify they're recoverable?
Data centers that pull the plug
The integrated systems test is done at all levels. Facebook recently shut off one of its data centers -- after all the necessary preparations were in place -- and nothing happened. That is how it is supposed to work.
Ford said he sees banks and government institutions as the most diligent sectors about conducting an IST. In some verticals, regulatory compliance rules about business continuity or DR preparedness may force an organization's hand. But others may still skip it or cut corners.
"Some of the colo guys just go through things just to say they have done it rather than really test the system," Ford said.
One colocation provider that does a full IST annually is vXchnge, a carrier-neutral provider that owns 15 data centers across the United States.
The company performs ISTs before any customers are up and running and then once a year subsequently.
"It creates the possibility for chaos in a controlled environment," said Ali Marashi, senior vice president of engineering and CTO at vXchnge in Tampa, Fla.
An IST can uncover all sorts of things. In a data center that vXchnge acquired, the first IST revealed that the control and monitoring system didn't have all of its power circuits connected to the uninterruptible power supply (UPS) system.
"When we pulled the plug, we found the monitoring system went dark," he said.
Marashi noted that in an N+1 facility or higher there is no single failure point and the risk is low.
"The transfer event validates that the redundant systems catch the load in a seamless way," he said.
VXchnge is growing and has made several data center purchases in recent months. In one case, the company could not confirm the last time an IST was conducted.
"We don't know what we don't know, and the IST will be the only way we can answer those questions," Marashi said.
Stephen FordManaging Director, E1E10 Ltd.
Marashi, who has performed integrated systems tests for more than 15 years, said he looks for two things: that the data center operates as expected from end-to-end and that the people and processes react correctly.
That's important because "human error still accounts for the largest source of downtime in a data center," Marashi said.
Most of the major multi-tenant data center colocation providers do ISTs, but the frequency may vary, he said. He worked at Equinix for three years and said ISTs were standard practice there.
Don't fear the IST
The IST can be a chance for colocation customers to work with their providers by conducting similar tests of their own. Customers can use it to fail their primary node and have it picked up in a redundant data center, for example, Marashi said.
For vXchnge, customers are always notified in advance about the IST, with enough notice so that they can plan tests of their own, if they want.
E1E10 Ltd.'s Ford recommends data center operators test their generators and UPS systems weekly offline. He was involved in an IST at a data center that had not run its generator in a year. In another case, he found water in the diesel tank, a common occurrence after diesel fuel is not used for a period of time or due to temperature changes.
The C-suite's fear of an integrated systems test's results may be its biggest road block, Ford said. Despite this, if your data center has backup power in place and a failover plan, but you've never used it, how do you know it really works?
"They only see it as testing that creates risk," Ford said. "Until something goes wrong, it is very difficult to convince them to get these done."
For one large car company, it took no car sales on what would otherwise be a busy Saturday to show them just how wrong things can go. Management came in on Monday morning to review weekend car sale numbers and found that no vehicles had been sold in the U.K., Ford said. It turned out that the company's data center had gone down and it hadn't failed over to its backup.
"The business hadn't realized how mission-critical the data center had become," Ford said.
Michael Fluegeman, a professional engineer and manager of data center support systems at PlanNet in Brea, Calif., said he encounters similar questions about whether to perform a complete integrated systems test in a live facility.
"The naysayers say it is a lot of risk," Fluegeman said.
But if the IST is planned, rehearsed and closely watched, the risk is low, he said. And the risk is far less than the risk of a data center failure coming at a business-critical time.
"Usually it is a 3 a.m. on a Sunday or some crazy time like that," Fluegeman said.
Fluegeman, like Marashi, pointed to the purchase of a data center as a good time to conduct an IST.
For example, Fluegeman had a client that had purchased a five-year-old data center that had been largely unused. It didn't have any known failures, "but they really didn't know what they had."
Often, individual tests of components and subsystems are performed first. That helps detect any problems where equipment from different vendors isn't interacting well.
"A number of things may not be right and when it is all happening at once, it may be hard to figure out," he said.
It all leads up to a larger "pull the plug test."
"That tends to be the final one where everyone wants to watch," Fluegeman said.
In the enterprise data center, the integrated systems test is a "maturing disciple," Marashi said. It often takes more detailed coordination to execute.
There is no reason not to do it, Marashi said, only to delay it if there are known conditions that need to be addressed or there are outstanding jobs that should be completed before pulling the plug.
Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or Email him at firstname.lastname@example.org.
Data center maintenance prevents failures
What NYSE's IT outage tells us about automation
Chaos for cloud systems: The Netflix test method