As the number of servers, storage arrays and network equipment continues to grow, equipment failure is unavoidable. IT pros must consider different ways to come to grips with this guarantee of failure.
The next generation of data centers could use a fragile IT methodology -- knowing that equipment will fail but architecting to accept that failure with minimal human intervention. While an N+M (primary plus multiples for backup) architecture requires someone to replace or repair the failed equipment, fragile IT leaves failed equipment in place and ensures it doesn't draw power or drain the overall system performance.
If you're aiming for a technical data center technology platform that will last three years, you have to make sure your plan for the installed platform can support the workloads envisaged over that time. This approach is nearly impossible in reality, so taking a modular approach is best.
A modular approach uses engineered racks and rows that can be replicated fairly easily, as well as a rolling lifecycle. Nearly one third of equipment is replaced annually to maintain a modern platform.
For example, imagine a data center with 3,000 servers, each of which has a direct attached storage (DAS) and 300 network switches with a fabric network approach. Due to expected workload growth, IT plans to increase server and storage numbers by 10% per year. Therefore, they will have 3,300 servers by the end of year one, 3,630 by the end of year two and 4,000 by the end of year three.
On a yearly basis, this data center is replacing upwards of 1,000 servers, 1,000 hard disk drives and 100 network switches.
Now, factor in predicted failure rates. Network switches are the least prone to failure. Servers and storage systems, while being pretty robust, will lose several boxes throughout the expected three-year lifecycle.
Understand your organization's risk profile to implement a fragile IT platform. In high-workload environments, it's possible that 5% to 10% of servers and hard drives could fail in one year. Therefore, for an organization with no room for error, build in enough extra hardware (N+M) to leave the headroom to manage without failed systems.
If you have 1,000 servers and grow at 10% annually, with a failure rate of 10% per year, engineer for 1,542 servers and storage devices with 44 network switches. This platform should be able to last three years without the need for any human intervention. It caters to failures without outages, and growth alongside workload needs.
This built-in excess capacity won't use energy when the server is switched off, but can be turned on in the event of a failure. It can also meet any unexpected workload spikes.
As IT equipment prices decrease, the cost of human intervention increases. Paying a technician to remove and replace a $1,500 server could cost more than the server itself, when all costs are factored in along with the risk of error during the installation.
Hardware vendors are looking at how to offer fragile IT systems. Today, containerized systems pack all IT equipment into a standard shipping container. Prior to use, the organization plugs in power and data connections to the container and carries out simple configurations.
A container-based approach combined with above-average operating temperatures produces higher than normal equipment failure rates, which fragile IT offsets. The business can buy and use a container with zero human intervention, which is cost effective. Containers minimize cooling and real estate costs. Once the lifecycle is completed, the original vendor can refurbish the container or strip it bare and refill it with new equipment to use elsewhere.
Deployment speed improves with fragile IT in modular set ups. Data centers run completely lights out, with system management based far more on the performance of the platform, rather than dealing with the tasks of hardware maintenance and repairs. The business receives a platform with far greater availability, and IT teams can concentrate on strategic tasks.
Break with your visceral fears. Embrace the fact that IT is fragile; engineer to manage it.
About the author: Clive Longbottom is the co-founder and service director of IT research and analysis firm Quocirca, based in the U.K. Longbottom has more than 15 years of experience in the field. With a background in chemical engineering, he's worked on automation, control of hazardous substances, document management and knowledge management projects. Clive.Longbottom@quocirca.com
27 May 2014