Data center servers are just sophisticated machines. Like any machine, they require regular maintenance to operate at their best. Simple maintenance procedures reduce serious service calls and extend the working life of servers.
Even with the performance and redundant features of modern servers, increased workload consolidation and reliability expectations can take a toll on your fleet. Your server maintenance checklist should cover physical elements, as well as the system's critical configuration.
Stick to a routine
Server administrators too often overlook planning maintenance windows. Don't wait until there is an actual failure; set aside time for routine maintenance to prevent problems.
Handy preventive server maintenance checklist
☐ Migrate workloads off server
☐ Physical inspection
☐ Airflow check
☐ Hard-disk scan
☐ Event-log alerts or trends
☐ Test patches and updates
☐ Install patches and updates
☐ Next maintenance scheduled
Maintenance frequency depends on the age of the equipment, the data center environment, the volume of servers requiring maintenance and other factors. For example, older equipment located in an equipment closet will need more frequent inspections than new servers deployed in a HEPA-filtered, well-cooled data center. Organizations can base routine maintenance schedules on vendor or third-party provider routines; if the vendor's service contract calls for system inspections every four or six months, follow that schedule.
Before virtualization, maintenance windows disrupted workloads, forcing IT personnel to do maintenance at night or on weekends. Virtualized servers enable workload migration instead of downtime, so applications are safe whenever server maintenance occurs.
Make sure the server can breathe
Once a server is offline, visually inspect its external and internal air flow paths. Remove any accumulations of dust and other debris that can impede cooling air.
Start with the exterior air inlets and outlets, then proceed into the system chassis, looking at the CPU heat sink and fan assemblies, memory modules, and all cooling fan blades and air duct pathways. Remove dust or debris on an appropriate, static-safe workspace with clean, dry compressed air. Do not clean the server right there in the rack.
Dust-busting is an old-school process, but that doesn't mean it's outdated. Dust is a thermal insulator, making it all the more important to remove it, now that alternative cooling schemes and ASHRAE recommendations have raised data center operating temperatures. Dust and other air-flow obstructions will cause the server to use more energy, even precipitating avoidable premature component failures.
Check local hard disks
Many servers rely on internal hard disks for booting, workload startup and storage, user data, and other functions. Disk media problems seriously hurt workload performance and stability, often leading to premature disk failures.
Magnetic media isn't perfect; common problems include bad sectors and fragmentation. RAID will go a long way toward preserving data integrity in the wake of storage errors, but smaller, 1U rack servers don't provide enough physical space to deploy an array of disks. Use tools like the CHKDSK (Check Disk) utility to verify the disk's integrity and attempt to recover any bad sectors on it. Windows Server 2012's updated version of CHKDSK quickly analyzes and fixes disk problems in the file system structure.
Disk fragmentation simply won't go away, as long as the NTFS and file allocation table or FAT, file systems use disk space by first-available clusters. Fragmentation can slow down a server's disk and cause failures. A utility like Optimize-Volume under Windows Server 2012 will arrange each file's clusters contiguously on the disk.
Read the event log's fine print
Servers record a wealth of information in event logs, particularly details about problems. No server inspection is complete without a careful review of system, malware and other event logs. Sure, critical system issues should have attracted the attention of IT administrators and technicians right away, but there are countless noncritical issues that signal chronic and serious problems.
While you're there, check the reporting setup and verify the correct alert and alarm recipients. For example, if a technician leaves the server group, you'll need to update the server's reporting system. Double-check the contact methods too; a critical error reported to a technician's company email address might be entirely inadequate if the error occurs outside of business hours.
Be proactive with log data. When a log inspection reveals chronic or recurring issues, a proactive investigation can resolve the problem before it escalates. For example, if the server's log reports recoverable errors in a memory module, it will not trigger critical alarms. But repeated instances signal problems with the module, and IT staff can perform more detailed diagnostics to identify impending failures.
If the problems are not severe enough to warrant shutting down a server, it can return to production until replacement hardware comes in.
Make time for patches and updates
The server's software stack -- BIOS, OS, hypervisors, drivers, applications, support tools -- must all interact and work together. Unfortunately, software code is rarely elegant or problem-free, so pieces of this software puzzle are frequently patched or updated to fix bugs, improve security, streamline interoperability, enhance performance and so on.
No production software should be able to update automatically. Administrators will determine if a patch or upgrade is necessary, then evaluate and test the change thoroughly. If the update fixes a problem your server doesn't have, why risk creating other problems?
Software developers cannot possibly test every potential combination of hardware and software, so patches and updates can cause more problems than they fix on your specific server or software stack. For example, a monitoring-agent patch could cause performance problems with an important workload because the new agent takes more bandwidth than expected.
The shift to DevOps, with smaller and more frequent updates, exacerbates the potential for problems. You still need to test any patch or update in a lab before deploying it. And always be sure you can undo the change and restore the original software configuration if necessary.
Dig deeper on x86 commodity rackmount servers