New server processor features improve reliability and fix otherwise-critical errors in real time, preventing a server failure or workload outage.
With virtualization, each server in a data center
Processor reliability problems
Every machine fails eventually, but not every failure is created equal. Hard failures are actual hardware breakdowns within the processor, memory or other server component, and require repairs or component replacement. These involve actual component damage in which something breaks.
A single incorrect bit can corrupt an instruction or data word, disrupting workload availability as suddenly and seriously as an overheated processor die.
Server makers can mitigate hard failures by using better-quality components and rigorously stress testing during manufacture. Manufacturers also circumvent hard failure by introducing redundant components within the server -- when one power supply fails, for example, a second one is there to take over and keep the server running. When one processor die fails, virtual machines can restart on an unused die and operators will replace the processor during the server's regular maintenance.
Many more server faults occur due to transient or spontaneous bit changes within memory, the processor, other components or numerous bus pathways that interconnect them. These soft errors are not physical defects. They occur because of sporadic electrical noise or power disturbances, cosmic radiation or other physical phenomenon.
A single incorrect bit can corrupt an instruction or data word, disrupting workload availability as suddenly and seriously as an overheated processor die. The aftermath is corrected by simply reloading and restarting any affected workloads.
Electronic designers are aggressively developing ways to detect and correct soft errors, preventing unnecessary workload disruption and improving system reliability. The simplest soft error detection method uses parity bits in memory devices. Enterprise-class memory subsystems employ error-correcting code (ECC) on bit errors, as well as resiliency techniques like memory sparing, which switches workloads to redundant memory modules when errors are detected.
Soft error features are moving into processor cores. The processor detects and corrects errant bits or contains the corrupted workloads. Reboots, lost data and workload outages occur less frequently as a result.
What to shop for in reliable servers
Virtually all processors from Intel Corp., Advanced Micro Devices Inc. and ARM Research Inc. support memory error detection with parity and ECC-based correction. Modern processors are more aggressively employing these techniques. For example, late-model Intel processors use data bus error checking to extend parity or cyclical redundancy checks into data bus traffic and repair single-bit errors. ECC protection has also extended to cache, preventing soft errors from disrupting program operation.
Some Intel Xeon and Itanium processors also support a machine check architecture (MCA) feature that logs and reports soft errors. When administrators know what kind of errors occurred, the components involved and actions taken, they can expose weak or questionable components and replace them before a hard failure occurs.
Processors that electrically isolate logical elements prevent a hardware fault in one section from causing problems in other areas of the processor. This is a requirement for hot-swappable components like expansion cards or memory modules.
Servers and their firmware support these processor reliability features. Common examples include ECC and advanced ECC in memory modules paired with single device error correction to help if a single DRAM chip fails. If ECC detects a double-bit error, the memory controller will retry the read cycle. Even memory sparing prevents system disruption. Processors with MCA capabilities require software to query error register data within the processor and send it to reporting software.
Some Intel and ARM processors support lockstep operation, where the same program runs on multiple processors at the same time. Added logic compares the outputs of each processor for consistency or corruption. This kind of functionality requires extended support from the server chipset and motherboard hardware, but enables a high degree of data integrity.
Processor vendors like AMD tend to underemphasize chip-based reliability features in favor of broader system reliability. For example, AMD Opteron 6000 family processors do not list specific error detection and correction capabilities, but chipset elements like the SR5650, SR5670 and SR5690 I/O Hub chips provide features like HyperTransport error handling, PCIe advanced error reporting and PCIe CRC.
ARM chip designs are licensed through ARM Holdings, and individual licensees can decide which features to include or omit in their own products. So, just because a vendor releases a reduced instruction set computer chip based on some ARM architecture, don't assume that the chip has specific reliability features.
Security's role in processor reliability
Server reliability extends beyond the hardware into the workloads that run on the system. Malicious software attacks can corrupt memory locations and other system states to disrupt an application and spread that disruption across workloads. System designers try to isolate and contain malicious activity.
Early protection techniques include Intel's eXecute Disable bit, ARM's eXecute Never bit and AMD's Never eXecute bit, which all segregate areas of memory for instructions or for data and allow the OS to mark areas as non-executable by refusing to run instructions in protected areas.
Intel Xeon models and other modern processors identify memory locations that contain suspect data. Those locations can be restricted to use by the current workload, preventing multiple workloads from accessing suspect data and possibly spreading malware. When the workload finishes, the memory is cleared or overwritten to disarm the potential threat without a reboot. ARM memory protection unit options can also protect areas of memory from unexpected software access.
Reliability technologies are so pervasive and integral to the system's design that chip and system documentation may not detail or even list all of the capabilities available. Discuss features with the vendor in detail and review the system's documentation closely for a complete picture of reliability features.
This was first published in February 2014