Data center servers never stay static for long, and failures happen. Vendors are designing servers for faster repairs, upgrades and preventative maintenance, but your IT staff can also make a difference.
Addressing system hardware problems isn't easy. A variety of tactics improve the responsiveness and efficiency of system repairs.
Muster the troops
Set up uninterrupted maintenance agreements with service vendors, reflecting the needs of different workloads. For example, a mission-critical server demands a service contract with a 60-minute or faster response window, 24/7/365. Less-critical systems require only a two-hour or four-hour window. Service contracts are expensive, so choose coverage accordingly.
When in-house IT staff has to perform service, implement a clear chain of command and escalation scheme that leverages available monitoring and reporting tools to immediately send alerts to the appropriate team member. The slowest scenario for server repairs sends information to a supervisor and awaits manual task delegation. Keep the alerting system up to date with staff changes.
The goal is to get every issue addressed as soon as possible. Service ticketing systems, suitable for large data centers, integrate emergency reporting with routine service requests. The ticketing process prioritizes and streamlines IT staff workflows, helping resolve tasks efficiently.
Parts in inventory are expensive and easily misallocated, so use a parts tracking or other inventory control system to track spare parts or upgrade components. Integrate inventories with a change management system so that server updates are documented and benchmarked. This preserves compliance and internal service-level agreement (SLA) obligations. Many help desk and trouble ticket systems include an inventory feature.
System documents are the first things lost, and missing documentation can seriously impede timely repairs. Maintain all system documentation and original software installation media, or at least keep a list of the websites that host electronic documentation, driver or software updates and so on. This can be a major time-saver when trouble strikes and minutes matter.
Spare parts become extremely difficult to find -- and exorbitantly expensive -- for older servers. Even if a server continues to adequately support workloads beyond its depreciation point, repairs can become problematic.
Servers change to facilitate hardware repairs
Server designs are incorporating accessibility features like articulated rails that allow technicians to lower the server once it's extended from the rack. Easy-to-open enclosures permit convenient access and snap-in components include plastic air ducting, fans and expansion card brackets -- minimizing the need for tools.
The best repair is the one you can avoid. In the face of errors, server resilience features keep systems running that would invariably crash older-generation servers. While resilience features don't prevent problems, they can often prevent -- or even correct -- a fault's catastrophic consequences.
The oldest resilience feature is a redundant power supply; many enterprise-class servers provide this option. Two modular power supplies run, and when one fails, the other powers the server until the failed module is replaced. Replacement can be accomplished 'hot,' without powering down the system, but live workload migration via virtualization, and power usage awareness, make redundant power supplies less attractive today.
Server memory resilience relies on error-correcting code and memory sparing. When a working memory module indicates a fault, the contents can be rebuilt on a spare module until the failed one is replaced (often 'hot' swapped). Another option is to retain a duplicate copy of the main working memory space.
CPU reliability has also vastly improved. Processors like Intel's Itanium II can recover from data bus errors and gracefully reset a server when an otherwise-fatal error occurs. The latest CPUs support a lockstep mode where multiple processors compare program information to ensure the integrity of computing operations.
Servers are also using lower-power components that depend less on aggressive cooling.
Beyond the server itself, virtualization features like live migration mean that hardware failures don't take down computing workloads. Clustering and redundant virtual machines mean data centers can host multiple copies of critical workloads to fend off hardware-based downtime. Virtual workloads also make scheduled maintenance on hardware easier, because the load can use existing resources on other machines.
When you're hosting workloads at an external provider, maintenance becomes their sole responsibility. These outsourcing vendors often allow for liberal or "best effort" repair windows when problems occur. This may expose the organization to extended downtime without significant recourse from the SLA. It remains a principal reason why many IT shops choose to retain mission-critical workloads in-house where they exercise more control over the environment.