The cost and impact of downtime includes a wide range of variables that extends beyond the lost revenue and customers. To accurately asses the true cost, a study should be taken of all the items that contribute to reduction in availability as well as the cost of implementing mechanisms to reduce the impact of loss of service in critical areas and determine a cost-versus-benefit ratio. Here is a list of ten things to examine when conducting this analysis.
- Analysts estimate that hardware is responsible for 15-20% of downtime. Failure rates vary by vendor, and even by server lines from the same vendors, due to the Meantime Between Failure (MBTF) rate for the various component. Using dual components--fans, NICs, power supplies, RAID controllers, etc-- in a single server can help eliminate some single points of failure at less cost than implementing a second server.
- Operator error is also a frequent cause of outage. These errors include misconfigured load balancers, application installation problems, incorrectly installed hardware, or faulty backups. An investment in training or implementing change control processes (including roll-back plans) for mission critical systems can reduce the impact of errors.
- Network and environment issues may or may not be avoidable in a cost effective manner. Be sure to consider LAN and WAN redundancy costs,
- circuit provider SLAs, power outages, weather and seismic risks for your specific area, and cooling issues.
- Operating system availability ranges by vendor and configuration. Review your servers' configuration and determine if they have been tuned for optimal availability and performance. Upgrading older operating systems may be an option to consider.
- Analysts and IT managers consider databases to be the biggest cause for downtime. Database outages can be due to faulty configuration and/ or tuning, poorly trained administrators, or the hardware that is too small. Hiring an outside consultant to independently verify that your database is configured to provide the required level of availability is something your organization may want to consider.
- Build a model to show relative impacts of downtime. Define what happens to end to end availability when specific components fail. Not every component will cause total loss of service if it fails, rather only a degradation of service will result. A loss of two out of a total of eight web servers due to a DOS attack or misconfigured load balancer will have less impact than the loss of both database servers in a misconfigured cluster that didn't failover properly.
- Once you have determined how many servers (and which ones) must fail to cause an outage, put a plan in place to buy the right amount of hardware to fit the model of degraded-but-acceptable performance rather than complete end to end redundancy.
- Consider all options for your model. A premium 4-hour on-site support contract with a vendor to repair/ replace/ troubleshoot issues may be more cost effective than purchasing and deploying redundant hardware than the staff isn't trained on. Consider that untrained staff could possibly cause a longer downtime scenario because they take longer to identify and fix the problem.
- Analysts agree that the best approach to minimizing downtime is not by reducing the number of failures but by reducing the duration of an outage. Create rapid response policies that are appropriate for your organization and train your IT staff in them. The occasional "fire drill" is great for practicing the execution of these policies before a real event occurs, not only from a training standpoint but also because holes in the processes may be identified and corrected without impacting a true crisis event.
- Consider the cost of lost productivity and revenue for a specific application versus the cost of supporting the platform that application uses. Yankee group found that while Windows server downtime was 3-4 times as expensive as Linux, it was revealed that mission critical data and applications were running on Windows more often than Linux. The relative cost-to-loss ratio for your organization is important to understand.
About the author: Kackie Cohen is a Silicon Valley-based consultant providing data
center planning and operations management to government and private sector clients. Kackie is the
author of Windows 2000 Routing and Remote Access Service and co-author of Windows XP
This was first published in February 2006