Since you may not use all parts of an IT system during routine operation, a component might fail or a software...
module might crash, but the system might seem to run normally for months or even years. The failure might not manifest until the system restarts -- often unexpectedly -- causing unplanned workload disruption and downtime.
Conduct a periodic power cycling test to identify possible problems and resolve them proactively.
Why should I perform a power cycling test if I don't see any problems?
Systems management tools including Microsoft System Center, SolarWinds, Nagios and Zabbix are powerful and versatile platforms. Almost all systems management tools can provide features for fault, configuration, accounting, performance and security management, making them indispensable to the modern enterprise.
Some faults, however, can occur at the hardware level that might not impact the system or workloads immediately. For example, a memory fault might be detected in a server's dual in-line memory module (DIMM). But, if no workload uses that memory space or the defective DIMM is protected by a fault-correcting technology, then the server can continue to function with little direct error reporting to systems management. In most cases, a modern server's intelligent platform management interface or baseboard management system can report these errors, but that information is typically just logged and the system and its workloads will continue to operate.
The undetected and unresolved hardware issues become most problematic. If an unexpected system crash or power disruption causes an unplanned system reboot, the system's internal firmware may see these problems and refuse to complete the boot process. For example, if the server's south bridge chip fails and USB or onboard disk controller functions don't initialize or respond, the boot process will stop, even though the enterprise may not use the server's USB ports and accesses storage across a network instead. Now, IT must attempt to recover from an unexpected disruption and address defective systems at the same time.
To avoid this, conduct a periodic and proactive power cycling test to force a system restart in low-level hardware. Instead of scrambling during unplanned outages or downtime, use planned restarts to ensure data protection and migrate VMs or storage instances off target devices in an organized manner. Next, cycle power and allow the hardware system to boot fully to reveal potentially unknown or unresolved problems. System power cycling is often included as part of an organization's existing shutdown document. If problems arise during a restart, you'll be better prepared to take corrective action.
How should I approach a power cycling test, and how often should it be done?
Quality server designs can conceivably operate for years. When you deploy those sever designs in resilient configurations, such as server clusters, the workloads supported on those systems are virtually unbreakable. In fact, the emphasis on system resilience and uptime often causes many organizations to forego periodic power cycling.
But if a server or storage subsystem runs for several years, how do you know it will start up again properly? You don't, and the only way to be confident that the systems are capable of a successful cold restart is to conduct it on a regular basis.
What dependencies are needed for a power cycling test?
Conduct a power cycling test as often as needed to achieve a reasonable level of confidence for your business needs. Generally, you can conduct power cycling every few months or several times a year. It may be a good idea to synchronize a power cycling test with routine disaster recovery and shutdown testing to address both goals at the same time.
There are times when external factors, such as electrical substation upgrades or important building renovations, force an enterprise to take a data center offline for some period of time. Today, planned shutdowns can be less disruptive to the business because you can easily migrate workloads to a secondary data center or the cloud. So any IT team poised to handle a planned facility shutdown should also be able to conduct routine power cycling.
Are there any risks to the server hardware?
There is always some level of electrical, thermal and mechanical stress when you start and run a server or storage array. When you allow electronic components to cool down and heat up again, it may cause thermal stresses that could potentially fail a marginal connection and precipitate a premature system fault. Similarly, if you let an aging disk or cooling fan cool down, it could potentially allow exhausted lubricant to seize up and cause problems with the disk/fan spindle or other delicate mechanisms.
There are also potential logical risks. Unexpected configuration changes may put the system out of bounds and result in warnings or application startup problems due to system configuration management tools. For example, if a system restarts and attempts to install an unexpected or unapproved patch, a configuration management tool may halt workload or server cluster startups until the system's approved configuration is restored.
Such problems are rare, especially in modern, energy-efficient systems. While some IT experts argue against power cycling to reduce the possibility of such failures, the idea of a planned power cycling test is exactly to precipitate, isolate and address those types of problems. It's better to expose any issues early on than to wait until an unexpected power failure or application crash and discover that a system isn't starting properly. With today's use of virtualization and clustering, workloads will continue running while you identify and repair failed systems.
Server maintenance checklist to extend a data center's life
Power, cooling improvements squeeze more out of your data
Remote server monitoring tools with power cycling options