Tips for improving server reliability and available memory

IT Operations.com

Tips for improving server reliability and available memory

By Stephen J. Bigelow

With so much attention focused on processing ability, network bandwidth and storage IOPS, it is easy to overlook server memory availability and reliability. Processors are central to any server, but all of the workload's instructions and data are stored in memory.

In today's virtualized data center, a single server may operate numerous virtual machines (VMs) where each VM exists as a file residing in memory. But as new servers hold more and faster memory to meet increased computing demands, the issue of memory reliability is gaining importance. IT staff must be aware of memory failures and take advantage of server features designed to boost available memory.

Understanding memory reliability problems

Today, enterprise-class servers employ several terabytes of 64-bit memory in the form of prefabricated modules designed and manufactured in adherence to JEDEC DDR3 and DDR3L (low-voltage) standards. This makes it easy to source affordable memory from a variety of vendors, but adherence to standards does not guarantee reliability.

The biggest threat to memory reliability is not outright failure, though faults due to manufacturing defects, electrical events and other physical anomalies can occur. Rather, the biggest threats to server memory come from random bit errors -- the spontaneous reversal of a single bit. If left unchecked, the error of just a single bit can alter an instruction or change a data stream in unexpected and potentially catastrophic ways.

Bit errors can occur spontaneously. Memory modules cite error rates anywhere from about 1 bit per hour per gigabyte of memory -- sometimes listed as 10¹⁰ errors/bit*h -- to 1 bit per century per gigabyte of memory (10¹⁷ errors/bit*h). It's a vast range but, as memory subsystems get faster, electrical operating voltages get lower and the total amount of memory on the server increases, the possibility of a bit being "misinterpreted" and affecting a workload becomes significant.

Additional factors like background radiation (alpha particles), spurious electrical events like nearby electromagnetic interference, poor motherboard shielding or design and even corroded or poor quality electrical contacts on the DIMM sockets can precipitate single bit errors.

Features that enhance memory availability

The lack of available memory has always been a concern, and error detection techniques like parity have been around for years. Parity is simple and effective for detecting single bit errors, but it cannot correct single bit errors, so it is not used much for servers. Fortunately, there are numerous other features available or emerging that can help enhance memory reliability. Consider a few of the approaches below:

ECC. Instead of parity, system vendors have relied on error correcting code (ECC) techniques. ECC builds on the basics of parity by using an algorithm to create and store an 8-bit code for every 64 bits of memory (a total of 72 bits per address). This algorithm and code allows the system to detect and correct single bit errors in real time -- as well as detect multiple bit errors and prevent the system from using corrupt data. ECC is typically the default memory reliability technique used on many general purpose servers.
Advanced ECC. Advanced ECC extends the ECC approach across multiple memory devices, allowing advanced ECC to detect and correct multi-bit failures as long as they occur within the same memory device. However, ECC and advanced ECC do not support any kind of failover, so the system still has to be shut down (or rely on other systems techniques) in order to overcome a questionable memory module. Many enterprise-class servers such as the IBM Proliant or Dell PowerEdge can offer some form of advanced ECC.
Memory error tracking. Part of dealing with memory errors is keeping track of them in the first place. Emerging server designs are starting to keep track of correctable errors by creating a running list of error rates and locations. Some servers can also save error information in the rewritable serial presence detect (SPD) memory space on a memory module, which can be read for future assessment and analysis. Once a system can track correctable memory errors and move that information into the system's management tool, it becomes possible to predict possible memory failures by noting DIMMs with sudden increases in error rates. Error tracking is an essential precursor to more advanced memory reliability features that involve DIMM failover or movement of data within the physical memory space.
Hot spare memory. The concept of a hot spare is common in disk storage, but has only recently found traction in server designs. This is because the system must have the intelligence to identify and track correctable memory errors first before it can make the decision to move data to a spare memory module. Advances in memory error tracking allow the server's memory controller to move data from a DIMM with unacceptable errors to another spare DIMM in the same channel. This is also called rank sparing. The disadvantage to this approach is the expense involved with adding memory to a server that remains unproductive until an error occurs.
Device tagging. One memory failover technique is a BIOS-based technology called device tagging. When the system tracks a memory module that experiences an increasing level of errors, the system can basically move the data from the questionable memory to the ECC memory -- essentially using the ECC memory as a small hot spare. This may alleviate memory failures, but it also prevents error detection and correction within that portion of memory. Device tagging is used as a short-term measure to keep the system running until the questionable memory module can be replaced.
Memory mirroring. The ultimate memory reliability technique is to duplicate memory contents on the server from one channel to another paired channel. This is basically RAID 1 for memory. If a fault occurs within the memory of one channel, the memory controller shifts to the paired channel without disruption, and the channels can resynchronize when repairs (if any) are completed. The disadvantage to mirroring is also the same as RAID 1 in storage; storage capacity is cut in half -- or memory costs are effectively doubled -- because memory contents are duplicated.

Memory now has a more pivotal role in modern virtualized servers, so it is more important than ever to address and mitigate the disruptive effects of memory errors. IT professionals have access to an evolving set of memory reliability features, but they must first conduct a more careful evaluation of memory availability requirements and then deploy servers with the features that will accommodate those needs.

20 Dec 2012