Resiliency is the ability of a server, network, storage system, or an entire data center, to recover quickly and continue operating even when there has been an equipment failure, power outage or other disruption.
Data center resiliency is a planned part of a facility’s architecture and is usually associated with other disaster planning and data center disaster-recovery considerations such as data protection. The adjective resilient means "having the ability to spring back."
Data center resiliency is often achieved through the use of redundant components, subsystems, systems or facilities. When one element fails or experiences a disruption, the redundant element takes over seamlessly and continues to support computing services to the user base. Ideally, users of a resilient system never know that a disruption has even occurred.
For example, if an ordinary server’s power supply fails, the server fails -- and all of the workloads on that server become unavailable until the server is repaired and restarted (or the workloads can be restarted on another suitable server). If the server incorporates a redundant power supply, the backup supply keeps the server running until a technician can replace the failed power supply. Techniques, such as server clustering, support redundant workloads on multiple physical servers. When one server in the cluster fails, another node takes over with its redundant workloads.
The same concept holds true all the way up to entire data center facilities. For example, an organization may power its data center with two separate utility feeds from different utility providers so that a backup provider is available when the first utility provider fails. As another example, organizations that support hot sites can support data center collocation–shifting an entire operation from one facility to another in response to any kind of local disruption or regional disaster.
The resiliency techniques employed in a data center can vary with the importance of the respective workloads.Organizations with mission-critical workloads will utilize more resiliency techniques at more levels within the data center, because the cost of not preserving critical computing services is typically costlier during a prolonged service outage. For example, critical business services, such as transaction processing software or database systems, may be designed with comprehensive data center resiliency, including clustering, snapshots and off-site redundancy. Conversely, nonessential workloads that can tolerate some level of disruption may receive little resiliency or simply remain offline until they can be restored.