In the past, the mantra "keep it cool" was king in data center facilities. But high availability, new guidelines and server densities are driving data center cooling trends in a different direction.
ASHRAE guidelines allow for servers to run at higher temperatures. What do you think about the concept of "running hot"?
Fred Homewood: It's certainly an interesting concept and really does improve efficiency in the data center. Certainly we're seeing a lot of interest from our customers in this. We have some who are selectively partitioning their data center buildings so they can run higher-grade equipment, or equipment that can run hotter, separately from equipment they've determined won't run at those same temperatures. That's certainly given them a large savings. Running hot, bringing in cold air from outside and using that is helping to bring significant savings in cooling.
The cooling budget's about two times the cost of equipment. We see this trend bringing our customers significant savings.
When you're talking about partitioning, do you mean building separate sections of the facility or adding walls? How would this work?
Homewood: Some of the big guys are taking equipment running hot/cold aisles, and regioning their building to take equipment – hard drives, servers, motherboards, memory, etc. – that they've determined run better at higher temperatures and positively grading them out. Where conventionally they were running up to 40 degrees Celsius (104 degrees Fahrenheit) in-temperature, they're running at 45 degrees C (113 degrees Fahrenheit) or even above 50 (122 degrees Fahrenheit) in some cases. Say they find a particular batch of hard drives will run at higher temperatures without additional errors, or vice versa. Those will be set apart from a batch where they finding errors at the higher temperatures. They're not partitioning physically so much as separating them into areas, so they can run parts of the data center at higher temperatures.
How has the high density server trend changed the way people cool their equipment?
Homewood: The one thing we've seen significant interest in is using outside environmental air. Density is definitely going up so people, by doing this, are seeing a higher differential between the equipment and the incoming air. That improves the actual conversion of heat to cooling. That's important for us because our Ethernet switches are high-density switches designed to support more and more people cramming more servers, storage etc. into these data centers and that is pushing up temperatures.
Whereas a few years ago a rack might be 5 kW, we've now got a bigger customer with a rack of 110 kW. In the old scheme, that's near to melting! What that means is the exhaust heat temperatures are very high and allowing intake of higher-temperature air, particularly ambient air from outside, and using that to free-cool these things and vent straight to the outside world gives significant cost savings. Running at higher temperatures makes that cooling more efficient in that there's a higher differential between the incoming air and outgoing air. You're also dependent on the temperature of the incoming air. It's something we're using in our own data center. We just started venting straight to the outside and that's saving us a significant amount of money in not putting a load of new air conditioning units in.
We see this cost savings in the case of bigger customers being significant.
What effect does the increased demand for high availability in data centers have on cooling costs?
Homewood: In high-availability environments, you have to operate within the temperature guidelines and constraints of the equipment. One of the very large data center customers we're working with has taken this beyond just looking at the bulk parameters of their supplied equipment and looking at individual batches as I mentioned before. Moving the ambient air from 40 degrees Celsius (104 degrees Fahrenheit) to 45 degrees Celsius (113 degrees Fahrenheit) or even above 50 degrees Celsius (122 degrees Fahrenheit) adds up to a huge cost savings overall.
To get high availability, the equipment needs to perform. It's no good to have equipment that is failing, so they're doing life analysis and availability analysis of individual pieces of equipment. That's come down to us because our switches have to go into this environment. It's no good having servers or the disks performing in those high-availability, hot conditions if the corresponding switches connected into the network don't work. We've had customers say 'Hey, we want you to qualify our switches at 45 to 50 degrees air intake and give us guarantees around availability and life expectancy.' They're doing that for costs, so we're responding to help with overall savings.
If you look at the cost of running a data center for three years – even a new build – power is the dominant cost in that equation. It trumps the cost of the equipment. Cooling in that power equation is about two thirds, so cooling is the most significant single contributor to the cost of running data centers.
What are some of the reasons companies might want to rethink their data center cooling infrastructure?
Homewood: In addition to the cost of supplying cooled data center environments, where you're reliant on relatively high intake power – about two thirds of incoming power can be spent on cooling if you're not careful – you want to offset or decrease your reliance on that power source during brown-outs and blackouts. Going to a higher temperature, where there's higher differential power between intake and exhaust air, means that you're getting better cooling from your air. People are already looking at liquid cooling or front-door liquid cooling on these systems to try and improve the efficiency of the cooling.
So there are other reasons besides cost. Cost is the principal one, but in a data center having to switch immediately or semi-immediately to a backup power supply, which alone can be very expensive and troublesome, is another reason why cooling infrastructure is very important. A lot of examples where the incoming power supply is fine and the data center is running fine but part of your cooling system goes out, you know that pretty soon you're going to have to start shutting down equipment.
You have to look at the reliability of the whole system and I think this is one of the primary movers in people looking to go to these high temperatures and using environmental air for cooling. The environmental air will be there apart from unusual conditions when outside temperatures go very high – very rare in the UK and even some of the northern states of the U.S. A lot of dense data centers have been sited near cooling lakes or places with cooler air. This reliability on the environment for cooling is giving significant improvements in the overall risk/reliability of the system.
One of the things we see with our equipment is that the customers get higher utilization of their equipment from free cooling. A switching system and network being used can effectively be overloaded or go down. There's a requirement for maintaining throughput in the network. Customers' networks in high density and high availability systems are being pressed more and more and they want high data rates. It's all part of the system and the network needs to perform. These systems can become so complex and interconnected that you can't change one variable without changing many of the others.
If you had to pick a single issue that's most likely to cause cooling problems in a data center, what would it be?
Homewood: Gnodal is concentrated on the network and keeping it going – we've all experienced network death. In the data center you can get network 'brown outs' where the TCP is just retrying and no one is making progress, so we see the ability to improve throughput to be key. The fact that the network might be down or not progressing means all that power is doing nothing. Efficiency and interlock between all the components is critical and the network provides that.
ABOUT THE CONTRIBUTOR: Fred Homewood has been a chip designer for more than 20 years; he specializes in the development of processors and communication systems. Fred founded Gnodal in 2007 to develop standards-based, high-performance, low-power interconnects for large data centers. Previously, Homewood was a design manager at Quadrics, responsible for ASIC projects. He also worked at STMicro, first as an architect for VLIW processor design, and later as chief architect of the ST200 processor family. Additionally, Homewood was an architect at Inmos, where he helped to develop the first floating point process chip, the T800 Transputer.
This was first published in July 2012