Server uptime and hardware failure guide
A comprehensive collection of articles, videos and more, hand-picked by our editors
IT managers hate system downtime, but the harsh reality is that even the best plans and preparation cannot prepare for every circumstance, and even the simplest oversights can snowball into serious events that are difficult and costly to remediate. This month, we asked our Advisory Board about the underlying causes of data center downtime, the affect on personnel stress and morale, the costs involved and the steps that IT staff can take to mitigate the effects.
You can hear more about preventing system downtime in the second part of this Advisory Board Q&A. Readers from India can find local coverage of risk assessment in this tip on risk assessment methodology for disaster recovery.
Robert McFarlane, principal and data center design expert, Shen Milsom Wilke Inc.
Reputable studies have concluded that as much as 75% of downtime is the result of some sort of human error. But what is behind those human errors? It's always easy to say "lack of training," but even the best trained people still make mistakes when they are in a rush, are tired, weren't really thinking, or just thought they could get away with taking a shortcut. My answer probably leans toward the "lack of planning" reason. It has always been my contention that many things (data centers in particular) invite human mistakes, simply because they are illogical in their layout, poorly labeled (if labeled at all) and generally doomed to trap some poor soul into an error that would not have been made if what was being worked on made sense in the first place.
For example, most everything today is "dual corded," connecting to two different power receptacles that are supposed to come from two different power centers. Left to their own devices, electricians may connect one receptacle to breaker 7 in panel A, and the other receptacle to breaker 16 in panel B. Further, they may put circuit labels on the outlets inside a cabinet, which are impossible to read, and put identifications on the panel schedules that don't correspond to the cabinet numbering. This makes it too easy to turn off circuits in different cabinets or fail to power down the intended cabinet.
Morale is seriously affected by system downtime, because IT lives in dread of failures. Small events are bad enough, but big ones suck the life out of staff. IT has become the new "utility." Systems are expected to simply be there, just as power, gas and water are not expected to fail, and are expected to be restored quickly if they do. IT staff know very well that a failure that truly affects the business, or that puts people's lives at risk, will be investigated and maybe even publicized–possibly resulting in job loss. There's daily pressure to avoid downtime, but there’s astronomical stress during recovery. I have seen only one data center where the uptime was publicized regularly.
The one most often overlooked cost of system downtime is corporate image. It varies greatly by business, but for some companies, the damage to their image could be beyond monetary valuation. Another is loss of customers. Suppose a manufacturer supplying an auto industry suddenly found that their shipping system, which depended on their central data center, was interrupted by a downtime event. A car company that relies on "just in time" parts delivery would switch to their second source as soon as the delay was realized. That customer may never come back.
It’s hard to mitigate downtime. IT is a pressure business. There's always another server to be installed, or another application to roll out, and rarely enough time or resources to do it carefully or to fully document. Sometimes it's necessary to stand up to management and say, "This timetable isn't realistic, and it’s an invitation to a disaster down the road." There has to be a discipline and an insistence on proper planning and procedure, which includes all the things noted above. Human beings are failure-prone. We can't push an IT staff into mistakes then act surprised when downtime occurs.
Matt Stansberry, director of content and publications, Uptime Institute
I turned to Rick Schuknecht, Uptime Institute vice president, to answer these questions. Schuknecht works with Uptime Institute's elite data center end user network, and he said 73% of data center downtime is caused by human error. Human error includes poor training, poor maintenance practices and poor operational governance. He said an outage can be very stressful and damaging to morale, because jobs and compensation are often based on an organization's availability goals.
Schuknecht also said that if an organization has a good investigation protocol in place, they can determine the root cause of the outage and identify steps to take in the short and long term. But that only works if you have an effective protocol in place.
There are some overlooked repercussions to an outage. For example, there is a regulatory penalty in financial industries. An outage can also erode a company's competitive edge, like loss of business reputation within the industry and/or customer base. Where would you rather put your money? In the bank with no downtime or the one with repeated downtime? Most financial companies have processes in place to preserve or recover data; it's the loss of transactional continuity that can cause the biggest problems.
What can data center staff do to avoid and mitigate system downtime? Schuknecht recommends establishing a good facilities and computing maintenance program for each piece of equipment, creating a staff training program that describes how and when to respond to downtime events, provide adequate funding levels for operating expenses to make sure everything works properly and institute a good governance program where site infrastructure is operated in accordance with manufacturer expectations.
Chuck Goolsbee, data center manager and SearchDataCenter.com blogger
The two factors I most often see are unrecoverable partial failures and pilot errors. In the case of an unrecoverable partial failure, the usual culprit is a combination of network protocols and network hardware problems that don't cause a complete failure. Network hardware and protocols usually work as expected in the case of complete failure, such as a line card dying, loss of power to one half of a redundant pair, etc. But where things go really wrong is when components continue to partially work while in the process of failing. While this most often happens with network hardware, I've also seen similar partial failures in electrical switch gear and uninterruptable power supply equipment cause downtime, such as the loss of a single phase in a three-phase distribution system.
By comparison, pilot error can almost always be traced back to either a lack of a comprehensive checklist for a particular procedure, or somebody deviating from one. Get your method of procedures in order folks, and stick to them!
There are tangible and intangible costs of downtime. It can be expensive, but beyond the financial implications, there is also a loss of credibility and trust. When you are down, both of these erode fast and take much longer to rebuild.
The best way to mitigate system downtime is communication. Have a communication strategy and use it. Train your customers to expect it on a particular channel. Make sure there is an out-of-band backup. If you communicate well, your credibility and trust [are more likely to remain in tact].
Bill Kleyman, director of technology, World Wide Fittings Inc.
Outages and downtime are the kind of events that an IT administrator (even in a larger environment) doesn't often think about, but they become emergencies when they actually happen. The first step in mitigating any downtime is planning. If an outage occurs and there was no plan for it, one can expect some negative prolonged results. Being well trained and prepared for an emergency will create a more stable environment when the need for a disaster recovery (DR) solution arises. Planning, testing and actual execution of a DR plan will help any environment be ready for an emergency. There's no secret sauce to surviving outages. The more redundant and prepared an environment is, the better it can handle an emergency outage.
A stable environment creates a stable workflow for both staff and data. The last thing an IT engineer wants is to be bombarded with 100 emails or phone calls from employees saying, "The network is down." This will create unnecessary stress and could very well lead to more mistakes being made during the process of recovery. It's almost impossible to plan for everything, but being as ready as possible will help reduce the amount of things that can go wrong. If an outage occurs, stay calm and resolve the situation as quickly as possible. If you have the opportunity, document everything. Note the point of failure, what broke, what needs to be fixed, how it should be fixed and the final result. Then, take this documentation and implement it into your existing DR plan. An emergency situation may prevent proper documentation, but take the time to learn from what happened. In the IT world, anything can happen (sometimes multiple times).
As a worldwide manufacturing firm, a single day of network outages would cost us roughly $250,000-$350,000, depending on which system went down. Being unprepared, or having systems not capable of redundant operation, can cost your company dearly in the long run.
What does that mean? During an initial purchase, IT managers have the option of going cheaper and buying equipment without redundant fans, power supplies, CPUs and so on. This first step can be a mistake that comes back to bite the entire environment. Let's say, for example, that a power surge knocks out a server with a single power supply, thus damaging internal components. Now the entire environment is down, and the machine needs to be replaced. On the other hand, we can take the same IT manager who spent some extra dollars and purchased a better power supply and a power distribution unit to help protect the machines. In this case, a simple power supply swap will fix the problem with little or no downtime. Intangible factors also play an issue when downtime or outages happen. No one wants additional gray hairs because an environment is down and the only feasible resolution is days away. This sort of stress can be relieved with a bit of planning. You also don't want the loss of confidence in your IT department from the executive board.
If an environment needs to be up 99% of the time, then plan for it. It's really as simple as that. The more planning, the better an infrastructure can handle outages. Be ready for an outage, down to the simplest element. That means a data center should have backup power generators, dormant virtual machines ready to go, or a hot/warm site ready to kick in when the need arises. Have multiple points of data restoration (cloud, local, storage area network and remote) and test these solutions regularly. Every environment should have some sort of DR solution. The more redundant the plan, the better an environment can handle an emergency. Ask the simple questions. Do I have redundant Internet service providers? Are they on different circuits? Do I have a backup power plan? Are my batteries all in order? Can my virtual environment handle a physical host failure? Because every environment is unique, plans for an outage will be relative to the needs of the infrastructure. Staff should be trained to know both their main and backup systems very well. The more prepared even the most junior engineer is, the better an entire network environment can handle downtime or an outage.