No faults here: Advanced techniques to attain more reliable servers

A new generation of reliability features can keep busy servers running in the face of serious fault conditions.

Stephen J. Bigelow, Senior Technology Editor

Published: 05 Nov 2013

Reliable servers top any IT pro's wish list. Server virtualization has further exacerbated the need for reliability since a single server will support dozens of workloads -- one hardware fault or bad migration and they all crash.

Progress in server reliability technologies like redundant power supplies and memory error detection and correction is slow. The protocols and behaviors needed to recognize, contain and address failure conditions aren't yet cost-effective, widely implemented standards that interoperate at all levels. Here are some of the newest tools IT pros can use in the fight for reliable servers.

Memory subsystem reliability

Checking parity and error correction code (ECC) are techniques that date back decades, and more recent options like memory sparing or mirroring are now well-established. Still, as the amount of memory and its importance in servers skyrockets alongside virtualization, we need more aggressive error-handling technologies.

Demand and patrol scrubbing are advanced applications of ECC memory. In demand scrubbing, the system corrects random or accidental ECC read errors on the fly. Patrol scrubbing locates and corrects errors in system memory proactively. If these actions fail to fix memory errors, it indicates a permanent fault. Potentially permanent faults invoke other resiliency features, like drawing data from mirrored memory modules instead. Some systems tag the failed locations to prevent further use of questionable memory.

ECC can only correct single-bit errors in any memory location, requiring other techniques for advanced errors. Single device data correction (SDDC), or Advanced ECC, combines ECC modules to correct multi-bit memory errors within one memory chip. By comparison, double device data correction (DDDC) lets servers withstand simultaneous multi-bit errors on two memory chips. Enhanced DDDC, or DDDC+1, detects and corrects an additional single bit error on top of that. These techniques resolve a much wider range of memory glitches and prevent workload-disrupting crashes than the tried and true options.

Memory mirroring protects memory by providing duplicate dual in-line memory modules (DIMMs) that retain a synchronized copy of memory contents. When memory failure is detected, the system switches to the mirrored copy until the suspect DIMM is replaced. New servers on the market support partial memory mirroring: only mirroring the memory used by the server's mission-critical workload, for example. This reduces costs compared to an all-or-nothing feature set.

Processor subsystem reliability

Server reliability's greatest threat is when a memory or processor failure gets passed on to the system and propagated from one workload to others. Data containment mode recognizes when one or more memory locations are faulty or poisoned and prevents other processes from using it. Features like viral mode prevent the system from moving network data to a PCI express (PCIe) bus when an uncorrectable error occurs, isolating the server and preventing any unintended network data from reaching users or other servers.

Servers use processor sparing to shift a workload from a faulty processor core to a spare one seamlessly. The faulty processor sits idle in the system until a technician replaces it. As with memory sparing, processor sparing only works if you have spare cores on the server, so it isn't convenient for high-utilization hosts. Deploy processor sparing when a server hosts critical workloads that are intolerant of downtime. If your server uses a socket disable feature, it can even boot normally with a failed processor in place.

Other features for reliable servers

In the past, server failures caused you to completely shut down to fix the faulty equipment. Some servers now include hot add or hot plug capabilities, so technicians can upgrade or replace core components -- CPU, DIMM, PCIe cards -- while the server operates.

Hot add is a feat of electrical engineering, BIOS and operating system (OS) intelligence. Some OSes, such as Windows Server 2008 R2, Red Hat Enterprise Linux 6 and SUSE Linux Enterprise Server 11, recognize the new resources and provision them for use on the fly.

No faults here: Advanced techniques to attain more reliable servers

A new generation of reliability features can keep busy servers running in the face of serious fault conditions.

Memory subsystem reliability

Processor subsystem reliability

Other features for reliable servers

Dig Deeper on Data center hardware and strategy

4 NAS devices for enterprises in 2024 and buying factors

How to purchase the best server hardware for small businesses

How to set up a NAS: A step-by-step configuration guide

Hamming code