Reduce chances of hardware failure with preventive server maintenance

Data center hardware failure may occur on occasion, but proper preventive maintenance on your servers can minimize the likelihood of failure.

Servers as hardware tend to get attention only when it's time to upgrade or when something goes wrong. The most...

common types of server hardware failures are failures of hard drives, power supplies, RAID adapters, motherboards, RAM or CPU. There are a number of preventive measures you can take to minimize the likelihood of a failure and to have the best chance of a quick recovery in case of a failure.

Use good equipment and care for it well

First, use good hardware and protect it well. While it is certainly possible to use a standard PC as a server, it's a recipe for disaster in any production server. Dedicated servers use boards that are intended to run 24/7 and are generally better engineered to make failures less likely.

At the high end, servers from makers like Hewlett-Packard Co. and Dell may even include features such as dual power supplies, hot-swap capability for PCI slots and fault-tolerant RAM that will continue to function even if one RAM module fails. In a similar vein, having a RAID 5 or 10 array is basic, but using enterprise-class drives is also important. Look for Raid Edition or Enterprise designations to ensure the drives are engineered to run in a 24/7 duty cycle.

Second, protect the hardware. Ensure that servers are plugged into high-quality (not $7.99) surge protectors or uninterruptible power supplies (UPSes). Servers -- and their air inlet filters -- should be cleaned regularly. The metallic dust that often accumulates in industrial buildings may contain metallic particles that can not only cause a server to overheat, but short it out as well. Don't leave slots uncovered if boards are removed -- insects and mice can enter through uncovered holes and wreak havoc on the system.

Make sure that the systems are well ventilated. If they're mounted in racks, ventilation holes shouldn't be covered, but equipment stacked in a wiring closet can easily end up with obstructed ventilation outlets, which is a sure way to overheat and shorten the life of a server. Adequate cooling for the number of servers is also a necessity. The server room should be at no more than 70 degrees Fahrenheit, and the cooler the equipment stays, the longer it should last.

What to do if hardware fails anyway

If after all the precautions, equipment fails anyway (which it will, on occasion), you can do one of two things: pay more for overnight, on-site service and trust that the manufacturer will live up to the terms of the agreement, or keep spares on hand. If all your servers were purchased from the same manufacturer at the same time, it will be relatively cheap to keep at least some spares on hand -- one power supply, one motherboard, an extra CPU or two, enough RAM for one system, and two or three extra drives.

If you didn't purchase your systems at the same time, it gets more complex. Even if they're all from the same manufacturer, motherboards may not be the same even if the model number is, and the same model can have different CPUs and memory as well. It may be simpler to keep an extra server ready to go and move the disks from an existing server that breaks to the backup system.

Replacing components is generally straightforward once you figure out what's causing the problem. Servers are modular, and it's a straight remove-and-replace job. Dedicated servers can even make this part easier, with trouble lights that show exactly where a fault is located or which part has failed. The only critical thing to watch for is electrostatic discharge -- make sure to ground yourself by touching the power supply before handling any components.

Logan Harbaugh is a freelance reviewer, network systems analyst and consultant, specializing in reviews of network hardware and software, including network operating systems, clustering, load balancing, network-attached storage and storage area networks, traffic simulation, network management and server hardware.

This was first published in September 2009

