IT Operations.com

Tips for improving server reliability and available memory

By Stephen J. Bigelow

With so much attention focused on processing ability, network bandwidth and storage IOPS, it is easy to overlook server memory availability and reliability. Processors are central to any server, but all of the workload's instructions and data are stored in memory.

In today's virtualized data center, a single server may operate numerous virtual machines (VMs) where each VM exists as a file residing in memory. But as new servers hold more and faster memory to meet increased computing demands, the issue of memory reliability is gaining importance. IT staff must be aware of memory failures and take advantage of server features designed to boost available memory.

Understanding memory reliability problems

Today, enterprise-class servers employ several terabytes of 64-bit memory in the form of prefabricated modules designed and manufactured in adherence to JEDEC DDR3 and DDR3L (low-voltage) standards. This makes it easy to source affordable memory from a variety of vendors, but adherence to standards does not guarantee reliability.

The biggest threat to memory reliability is not outright failure, though faults due to manufacturing defects, electrical events and other physical anomalies can occur. Rather, the biggest threats to server memory come from random bit errors -- the spontaneous reversal of a single bit. If left unchecked, the error of just a single bit can alter an instruction or change a data stream in unexpected and potentially catastrophic ways.

Bit errors can occur spontaneously. Memory modules cite error rates anywhere from about 1 bit per hour per gigabyte of memory -- sometimes listed as 1010 errors/bit*h -- to 1 bit per century per gigabyte of memory (1017 errors/bit*h). It's a vast range but, as memory subsystems get faster, electrical operating voltages get lower and the total amount of memory on the server increases, the possibility of a bit being "misinterpreted" and affecting a workload becomes significant.

Additional factors like background radiation (alpha particles), spurious electrical events like nearby electromagnetic interference, poor motherboard shielding or design and even corroded or poor quality electrical contacts on the DIMM sockets can precipitate single bit errors.

Features that enhance memory availability

The lack of available memory has always been a concern, and error detection techniques like parity have been around for years. Parity is simple and effective for detecting single bit errors, but it cannot correct single bit errors, so it is not used much for servers. Fortunately, there are numerous other features available or emerging that can help enhance memory reliability. Consider a few of the approaches below:

Memory now has a more pivotal role in modern virtualized servers, so it is more important than ever to address and mitigate the disruptive effects of memory errors. IT professionals have access to an evolving set of memory reliability features, but they must first conduct a more careful evaluation of memory availability requirements and then deploy servers with the features that will accommodate those needs.

20 Dec 2012

All Rights Reserved, Copyright 2016 - 2024, TechTarget | Read our Privacy Statement