Data center servers are just sophisticated machines. Like any machine, they require regular maintenance to operate...
at peak performance. Simple maintenance procedures reduce serious service calls and extend the working life of servers.
Even with the performance and redundant features of modern servers, increased workload consolidation and reliability expectations can take a toll on your fleet. Your server maintenance checklist should cover physical elements as well as the system's critical configuration.
Stick to a routine
Server administrators too often overlook planning maintenance windows. Don't wait until there is an actual failure; set aside time for routine server preventative maintenance.
Download a PDF of this server maintenance checklist
Maintenance frequency depends on the age of the equipment, the data center environment, the volume of servers requiring maintenance and other factors. For example, older equipment located in an equipment closet needs more frequent inspections than new servers deployed in a HEPA-filtered, well-cooled data center. Organizations can base routine maintenance schedules on vendor or third-party provider routines; if the vendor's service contract calls for system inspections every four or six months, follow that schedule.
Preparation is everything
Have a plan before you tackle the items on a server maintenance checklist. This includes checking the system logs for any errors or events that require more direct attention. For example, if system logs denote errors with a specific memory module, you should order a replacement DIMM and have it available for installation. Similarly, if there are firmware, operating system or agent patches/updates available, test and vet those patches first before the maintenance window.
Also have a clear plan for taking the system offline and returning it to service later. Before the advent of virtualization, the server and its resident application would require downtime to accommodate the maintenance window -- often forcing IT personnel to perform maintenance at night or on weekends. Virtualized servers enable workload migration instead of downtime, so you can migrate applications to other servers and they'll remain available whenever server maintenance occurs on the underlying host system. Before service, know where the VMs should go, migrate VMs to selected systems and verify each workload is working before taking the server down for maintenance.
At this point, you can typically shut down the server and remove it from the rack or other enclosure.
Make sure the server can breathe
Once a server is offline, visually inspect its external and internal airflow paths. Remove any accumulations of dust and other debris that can impede cooling air.
Start with the exterior air inlets and outlets, then proceed into the system chassis, looking at the CPU heat sink and fan assemblies, memory modules and all cooling fan blades and air duct pathways. Remove dust or debris on an appropriate, static-safe workspace with clean, dry compressed air. Do not clean the server right there in the rack.
Dust-busting is an old-school process, but that doesn't mean it's outdated. Dust is a thermal insulator, making it all the more important to remove it, now that alternative cooling schemes and ASHRAE recommendations have raised data center operating temperatures. Dust and other airflow obstructions will cause the server to use more energy, even precipitating avoidable premature component failures.
Check local hard disks
Many servers rely on internal hard disks for booting, workload startup and storage, user data, and other functions. Disk media problems seriously hurt workload performance and stability, often leading to premature disk failures.
Magnetic media isn't perfect; common problems include bad sectors and fragmentation. RAID goes a long way toward preserving data integrity in the wake of storage errors, but smaller, 1U rack servers don't provide enough physical space to deploy an array of disks. Use tools such as the CHKDSK (Check Disk) utility to verify the disk's integrity and attempt to recover any bad sectors on it. Windows Server 2012's updated version of CHKDSK quickly analyzes and fixes disk problems in the file system structure.
Disk fragmentation simply won't go away, as long as the NTFS and file allocation table or FAT, file systems use disk space by first-available clusters. Fragmentation can slow down a server's disk and cause failures. A utility such as Optimize-Volume under Windows Server 2012 arranges each file's clusters contiguously on the disk.
Read the event log's fine print
Servers record a wealth of information in event logs, particularly details about problems. No server maintenance checklist is complete without a careful review of system, malware and other event logs. Sure, critical system issues should have attracted the attention of IT administrators and technicians right away, but countless minorissues can signal chronic and serious problems.
While you're there, check the reporting setup and verify the correct alert and alarm recipients. For example, if a technician leaves the server group, you'll need to update the server's reporting system. Double-check the contact methods too; a critical error reported to a technician's company email address might be entirely inadequate if the error occurs outside of business hours.
Be proactive with log data. When a log inspection reveals chronic or recurring issues, a proactive investigation can resolve the problem before it escalates. For example, if the server's log reports recoverable errors in a memory module, it will not trigger critical alarms. But repeated instances signal problems with the module, and IT staff can perform more detailed diagnostics to identify impending failures.
If the problems are not severe enough to warrant shutting down a server, it can return to production until replacement hardware comes in.
Make time for patches and updates
The server's software stack -- BIOS, OS, hypervisors, drivers, applications, support tools -- must all interact and work together. Unfortunately, software code is rarely elegant or problem-free, so pieces of this software puzzle are frequently patched or updated to fix bugs, improve security, streamline interoperability and enhance performance.
No production software should be able to update automatically. Administrators should determine if a patch or upgrade is necessary, then evaluate and test the change thoroughly. If the update fixes a problem your server doesn't have, why risk creating other problems?
Software developers cannot possibly test every potential combination of hardware and software, so patches and updates can cause more problems than they fix on your specific server or software stack. For example, a monitoring-agent patch could cause performance problems with an important workload because the new agent takes more bandwidth than expected.
The shift to DevOps, with smaller and more frequent updates, exacerbates the potential for problems. You still need to test any patch or update in a lab before deploying it. And always be sure you can undo the change and restore the original software configuration if necessary.
Verify and record any changes
A lot can happen to a server during a maintenance window such as hardware, software, system configuration changes. When you've completed the server maintenance checklist, it's vital for IT staff to verify and record any new system state. For example, changing a network adapter, adding or replacing DIMMs, updating an OS, and many other actions can alter the system's configuration. Organizations that depend on system configuration management tools may need to update or "discover" the changes -- recording those changes to the configuration management database before the system is allowed back into service. IT staff may need to update any enforced or desired state configuration posture to allow the changes.
Also verify system security postures such as firewall settings, anti-malware versions or scanning frequency and intrusion detection/prevention (IDS/IPS) settings. Security checks can help ensure that changes to system software did not inadvertently expose any attack surfaces that might have been closed in the prior configuration.
And finally, don't forget to update any system backups or disaster recovery (DR) content once the server is back online. Verify that the server's backup/DR posture or frequency remains unchanged, unless those related settings specifically need to be adjusted to reflect the server's changing role.
Don't let zombie servers waste resources
Be aware of changes in the server market
How to use edge servers with IoT projects