How to prevent and correct server cooling failures

Advanced planning can prevent server cooling failures, and good practices can help to remedy them and minimize the negative impact on your data center.

Network infrastructure and design is an intricate, delicate process where every decision can affect every server in your data center. Too often, server cooling requirements are overlooked, misjudged or severely underestimated. This can happen in all sorts of environments. Experience has shown that tight budgets force IT managers, even in huge environments, to skip on the crucial cooling and temperature control needs of their data centers. This is where accidents happen. Starting out with the right cooling and knowing how to handle failures when they occur can keep you online and save your organization a lot of money.

Start with the right amount of server cooling
Before you ever deal with a failure in your server cooling infrastructure, it's imperative to understand how cooling works and how to figure out the correct cooling amount for your setting. Cooling is directly related to the amount of power drawn by all of the electrical equipment in the data center . Power consumption is based on the number of amps drawn by the equipment. The more amps produced, the more cooling is required. The simplest way to start the calculation is to know how much power your servers will be drawing (per server or total). A typical single CPU server will draw 1 amp, or 120 watts (1 amp x 120 volt = 120 watts). Dual CPU servers such as Xeon or AMD will draw 2 amp or 240 watts. Now multiply the power requirement by 3.4 BTU/watt to approximate the cooling requirement.

For example, a server drawing 120 watts will require 408 BTU of cooling (120 x 3.4). Before these calculations are even reached, use a cooling calculator to gauge what sort of cooling requirements you might need and make sure a certified HVAC consultant is on hand to consult. Mistakes occur when assumptions based on little or no cooling knowledge are made. Have an expert ready to assist you in deciding not only how much cooling you'll need, but which types.

Dealing with a failed server cooling system
When a server cooling failure occurs, it's important to take prompt and decisive action. The right moves will keep you running longer and prevent more serious failures. Here are a few guidelines that will help a data center administrator out of a bind, especially if the entire cooling system has just failed:

  1. Know who your maintenance people are and how to reach them quickly. Your first call will be to your HVAC engineers. Be detailed in the description of the problem, too. If your engineers need to bring spare parts, this will help them. At this point, every second counts!
  2. Understand and anticipate what the impact will be if the cooling unit fails. For example, know how long your servers will last before the room reaches critical temperatures (over 120 degrees F). It will help to conduct heat stress tests on your environment to fully understand how long your servers can last before those critical temperatures are reached.
  3. Have a service-level agreement ready for your critical environment. If you don't, have a portable cooling system on hand (or ready for rent). For example, Tripp Lite's SRCOOL12K is specifically designed as an emergency cooling solution. This specific unit will provide the environment with about 12K BTU of cooling power. Some spot coolers can be expensive and many can be rented on short notice. Still, if you only have a few hours before the room reaches those critical temps, you may have to make some financial sacrifices.
  4. Shut off non-essential servers. Development servers are often big power users that don't need to run during production. Any test server or other non-essential business service needs to be turned off.
  5. If you start approaching those critical temperature levels, open some windows and doors. Your goal is to reduce the temperatures in that room. If the temperature outside of the room is lower than in the room, use any sort of fan to blow the hot air out.
  6. This is one of the most important rules when dealing with a failed cooling unit: Make sure you have all backups prepared ahead of time.

Having a plan in place can help with some of the cooling issues you will face if a unit fails. If you have a secondary hot/cold site or a contingency plan, you will want to start executing the first stages of that plan. If you know your servers will not be able to stay up, fire up your remote site and be prepared to transfer the load. A business contingency plan will sustain normal business operations despite there being an emergency, and a disaster recovery plan will help you move your entire data center to an emergency site to prevent major downtime.

Creating server cooling redundancy
Having a failed cooling unit may not necessarily bring your entire environment to a halt. If you have built-in redundancy and a failover plan, you will be ready for a failed unit.

The base cooling redundancy methodology is n+1, but a lot will depend on the cooling needs and budget of the company. For a medium-sized data center (about 1,000 square feet), there are a few options to go with as far as cooling. For example, you can place a 1.5-ton AC unit for your server room and an additional 1-ton unit for backup. These units can be load balanced, where the smaller cooling unit will take over temporarily should the main one fail. The budget here will probably not allow for an elaborate setup. However, the environment can still be properly cooled.

"For your medium-sized environment, be ready to talk to a certified HVAC expert. If money allows, a fully redundant n+1 environment will have 3-4 units operating at a given time," said David Langlands, a Network Architecture and Security Partner at ESPO Systems. "That is, you will always have 100% of your cooling needs met."

For a large data center (above 5,000 square feet), cooling requirements are going to vary. As opposed to medium or small server rooms, which prevent failures, large environments are designed to anticipate them. There will be times when you will need to be ready to take your cooling units down for maintenance, even during production hours. This isn't a problem when you have multiple systems up and running. Companies like Emerson Electric Co. or independent consultants can help you develop a server cooling system that can handle multiple failures.

Large corporate data centers are often well developed and have several cooling features in their environments:

  1. Raised floors to increase air flow and move cool air throughout the data center.
  2. Increased air flow efficiency by installing blank plates in between gaps in your servers to block air loss.
  3. A cooling design that will allow the entire room to intake cool air from the cool side of the aisle and exhaust hot air from hot portion of an aisle. Remember, a large server room will have multiple aisles and each aisle will have a hot and cold side.
  4. Chillers placed on the roof that push ethylene glycol through pipes into the data center. There are closed pipe loops of this liquid in the data center that will cool the servers. They will then take the warmer liquid back up to the chiller to be chilled again and continue the cycle.
  5. Blowers are a type of AC unit using chilled ethylene glycol to push cold air into the server room.
  6. Plenum acts as enclosed air ducts that push cold air from the bottom of the server room up through the racks and exhaust the air out of the top of the rack.

There are other devices and hardware as well, but planning and speaking with your HVAC specialist will help you get a grasp on what is needed for a securely cooled environment.

Best practices for cooling servers
Finally, let's summarize a few of the most important principles for dealing with cooling problems in the data center:

  1. Have an HVAC expert on speed dial. Have a plan for cooling redundancy ready.
  2. Monitor your temperatures! Log temperature ranges and have a system that will alert you if the server room gets too hot. AVTECH TemPageR is one such system; it will also log and graph temperatures. This $200 device is well worth it in the long run.
  3. Remember, it's not just about temperature, but also humidity. The last thing you want in your server room is condensation.
  4. That said, be ready for condensation that will arise from your cooling system. Plan for where that water will go. Drip trays, water routing methods, and other condensation control methods are available.
  5. An often overlooked fact is that most businesses that are big enough to have cooling needs are growing entities. "Don't design for what you have to cool today, design for what you have to cool tomorrow," Langlands said. "Always plan ahead and design for growth."

There is a lot to consider when planning or working with a failed server cooling environment. The best course of action is to be vigilant and prepared for anything that may arise. The key message: Have a backup plan ready and know your HVAC specialist.

What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at  [email protected].

Dig Deeper on Data center design and facilities