The importance of on-server memory is changing server design and configuration, accommodating more memory and higher memory performance. This means overhauling how servers deal with memory errors and reliability.
How important are on-server memory features now that our data center is consolidated onto virtualized servers?
As virtualization expands across most data centers, fewer hardware platforms are responsible for a greater number of workloads.
Memory capacity and reliability are critical to successful consolidation and workload integrity. Correspondingly, the effect of any server fault is multiplied by the number of workloads running on the server. For example, if a server running 10 workloads experiences a memory fault that causes a system crash or reboot, all 10 workloads are affected until the system restarts or each fails over onto other servers.
New technologies are bolstering memory resilience far beyond error correction code (ECC) and memory sparing. These developments address correctable errors over the long term, tipping off administrators to chronic memory faults. Server administrators can inspect and replace questionable components during routine maintenance before hard failures occur.
The error threshold allows a dual in-line memory module (DIMM) to track the location and frequency of correctable errors -- those that ECC can catch and fix on the fly-- using serial presence to detect error logging and other DIMM capabilities.
If advanced ECC is implemented, the system can detect and recover from multi-bit errors. Data words are split between separate ECC DIMMs for advanced ECC, which usually means deploying matched DIMMs with the same capacity and ranking. An even number of DIMMs should be installed in the server.
When a server identifies chronic problems -- when correctable errors exceed the set threshold -- with a DIMM, the error report can alert the systems management tool to flag the DIMM for pre-emptive replacement. Some servers go a step further and effectively remove an entire memory page from use. The remainder of the DIMM remains in use, or memory swapping switches operations over to a spare module.
When deploying memory sparing or mirroring, use DIMMs in two channels that are matched in capacity and ranking. This ensures the system can switch to a backup DIMM with precisely the same data format as the original memory module. If dissimilar DIMMs are used, the server BIOS may detect the difference and disable these features.
Server technicians should always refer to the new system's documentation to specify the appropriate number and type of memory modules to meet the required level of resiliency.
See the next question on server memory: What voltage is best?
This was first published in February 2014