Server problems are unavoidable, but advances in server hardware designs and instrumentation have made systems more resilient than ever before.
IT technicians should treat any server's system fault as a serious issue, but many problems are not fatal and a server will continue to operate -- albeit at reduced reliability or degraded capabilities. Prompt alerting and diagnosis of degraded server states allow IT staff to take proactive measures, often rectifying the issues without any perceivable disruption to server workloads. Let's define degraded operation and consider the principal system areas where degradation can occur.
The degraded server
Degradation occurs when the server experiences a hardware fault and is no longer running in a redundant state. Alternately, the server may still be in a redundant mode, but diagnostic sensors or algorithms within the server report an impending failure.
For example, the Self-Monitoring, Analysis and Reporting Technology in a hard drive reports an impending failure of the drive. This may prompt the IT staff to replace the suspect drive preemptively. As another example, a redundant power supply may fail, forcing the server to rely on one power supply. In this case, the system would no longer be in a redundant mode until the faulty power supply is replaced.
Degraded server operation typically breaks down into three general categories: Power or temperature problems, memory problems and management problems.
Power and temperature problems
Modern servers contain a wealth of sensors designed to measure power and thermal conditions within the server.
From a power perspective, any problems with AC input voltages, DC output voltage, DC output current levels or even predictive alerts can cause a system to flag problems with a power supply and force the system to rely on the redundant supply until the suspect supply is replaced. It's important to remember that a server must have redundant power supplies for this kind of feature to work and that the server will lose reliability until the faulty unit is replaced.
Servers also rely on cooling fans and temperature sensors to keep the system within an acceptable thermal envelope. Fans that fail to run at acceptable speeds or temperature levels that exceed safe levels may flag a fault and force the server into a lower-performance mode in order to reduce heating. In most cases, replacing a failed fan, or clearing dust-clogged or blocked vents, can address this problem type.
Servers rely on memory to hold workloads and data. With virtualization, servers shoulder more simultaneous workloads, and the amount of installed memory is increasing dramatically. Memory problems can reduce system redundancy or disrupt workloads and limit the number of workloads that a system can support.
A common problem occurs when the server is unable to use all of its installed memory. This may happen when a DIMM fails and makes an entire memory channel inaccessible. If the system can identify the faulty module within the context of an error message, a technician should have little problem exchanging the suspect modules. Otherwise, some trial-and-error troubleshooting may be needed to isolate the failed module.
There are also several memory problems that compromise redundancy. For example, the server may determine that the number of correctable memory errors is over an acceptable limit, causing the system to use a spare memory module. The system is no longer redundant. A similar error may occur in memory-mirroring mode when the first module fails, leaving the system to rely on the mirrored module. The system will continue to operate, but has lost redundancy because the memory copy -- the mirror -- is now being used.
All of these conditions can be corrected by replacing the faulty memory devices. Unfortunately, memory is usually not a hot-swappable component on inexpensive white box systems -- though it is available on some high-end servers -- so most memory problems will require a workload migration to other systems, followed by a hard shutdown, before checking or replacing memory modules.
System management problems
Busy data centers require systems management, especially when virtualization abstracts the workloads from server hardware. Gathering the granular details of system conditions and status requires manageability features that are evolving for modern servers. A common example is the Baseboard Management Controller (BMC) used in Dell servers, along with any management architecture based on the Intelligent Platform Management Interface standard. Trouble with a system's management features won't disrupt the server's overall operation, but can make it impossible for technicians to gauge the system's status or receive warnings that may indicate impending problems.
One common issue involves exhausted batteries. Most systems and peripheral devices -- such as RAID controllers -- preserve onboard management settings and setup details using a small lithium cell or larger battery pack, which must be replaced periodically. If the battery fails, any setup details are lost, and the system may fail to function at top performance. Battery replacement should be part of every server's routine maintenance program.
A more serious issue occurs when the management components -- such as the BMC -- fail outright. Typically, the server can still boot, but the boot process may be prolonged or unusual; manageability features may be unavailable until the motherboard -- or the entire server -- is replaced.
Server redundancy features keep busy data centers running, reducing workload disruptions and allowing continued system operation in the wake of noncritical faults. But when a fault occurs and a server faces degraded operation, it's important to address the faults quickly and decisively to restore system redundancy. Routine maintenance plays a big role in preventing noncritical faults; tasks such as clearing vents and replacing batteries can prevent unnecessary failures. Management trends over time might also reveal impending problems that an IT staff can forestall or prevent.