Downtime has serious consequences for modern businesses. It is difficult, if not impossible, to recoup lost revenue and rebuild a corporate reputation that is hurt by an outage. While IT professionals can't expect to avoid every downtime event, the majority of system downtime is caused by preventable failures. That's why we're asking our Advisory Board about the consequences of data center downtime and the steps that IT staff can take to reduce the risk of an outage.
You can also read more system downtime insights in the first part of this Advisory Board Q&A.Readers from India can find local coverage of risk assessment in this tip on risk assessment methodology for disaster recovery.
Robert Rosen, CIO, mainframe user group leader
You have to distinguish between planned and unplanned system downtime. Both types of downtime are stressful. Planned downtime has to be finished on time, but unplanned downtime is the worst. Unplanned can be a good troubleshooting learning experience that can be good for staff, as I see fewer and fewer technicians with troubleshooting skills. I blame it on the rip and replace mentality in auto repair and the “just reboot” mentality from PCs.
The biggest cost is how it affects the customers and what it makes customers think of you. Think of doing planned downtime during the day versus doing it at night. Which one do you think the customer will like better?
Planning, having people with good troubleshooting skills, and documenting how you found the problem and fixed it means the issue will be resolved faster the next time.
Robert Crawford, lead systems programmer and mainframe columnist
Unfortunately, the biggest cause of system downtime is human error. Many times it’s a procedure someone didn’t follow or think all the way through. Another cause may be a system quirk or obscure design flaw that someone didn’t account for. Sometimes, it’s as simple as a typo.
Most technicians genuinely want to do a good job and are proud of their work, which is why failures are stressful. The most immediate stress caused by downtime involves the “battlefield conditions” (often during non-work hours), when everyone tries to figure out what’s going on. This stress, in turn, affects morale, especially during a run of bad luck. Things do gradually get better over time when the system stabilizes.
The consequences of outages can vary according to industry, although there are some commonalities. An outage could cause any company to lose sales through missed opportunities and bad customers experiences. On the other hand, manufacturing companies may have to shut down production, and financial companies may have to pay money in the form of fines and lawsuits.
The two most important strategies for avoiding downtime are planning and automation. Planning, of course, works out the best way to make changes and avoid conflicts. Long-range planning also comes into play as systems and application programmers design redundant and resilient systems. Automated changes greatly reduce the chance for human error.
Michael Coté, analyst, RedMonk
When I look at the big name failures out there, like Amazon's cloud going down, I see an odd pattern of systems bringing themselves down in the course of trying to automatically fix themselves. In these scenarios, something goes flakey and a cloud's ability to "heal itself" goes into overdrive and ends up bringing down the system. The other thing I see is just the law of big numbers. The more nodes and moving parts in your network, the more frequently problems will occur. Thanks to virtualization, cloud and rouge IT, there're more elements than ever for IT to manage. Even if the percentage of failure stays the same, that's still more [devices and systems] that'll fail.
To some extent, the problem is due to a lack of planning–but that's unfair. There's only so much planning you can do before the process becomes too slow or expensive. While organizations like NASA (who also experience failures from time to time) can spend a lot of time and money making sure "it'll work," the rest of the world is not so lucky. Having rock-solid IT is a luxury, and paying for that luxury isn't something most businesses are willing to provide proper budget for.
To some extent, downtime is the most exciting time for a certain type of IT staff–the hero. The system is down for some mysterious reason, no one knows why, and only one person can save the company! For whatever reason, both programmers and administrators tend to be rewarded more for troubleshooting skills versus building stable systems. After all, if it works and never fails, there's no need for a lot of what an IT staff does, so the fact that IT breaks frequently keeps lots of people's bills paid.
Of course, if staff is punished (for example, verbal dressing-downs, lower compensation or getting fired) for IT services that are continually down and failing, they won’t be so proud of their ability to fix things. The other side though, is that the only way to learn from your mistakes is to make mistakes. There's a lot to be learned from downtime. One danger to avoid is winding up with policies that address past failures rather than considering future ones. If you look at the policies for all types of industries beyond IT, you can see a documented history of things that went wrong–like airport security.
The first cost of downtime is customer satisfaction, be they internal customers (the business) or external ones (the customers). The IT department is always challenged to prove itself internally to the business, and downtime just proves what business really thinks: IT is a black box that burns money. In past years, as Google, Amazon, Facebook and other services have become part of the culture, we've seen that IT can be a source of great satisfaction in life.
Each time internal IT fails, the business is wondering what is wrong with the IT department. They must be thinking, "How hard can it be?" Of course, they easily forget all those annoying requirements and customization they asked IT to throw into each system.
External customers are increasingly vicious when downtime occurs. There are so many options to consider that turnover is a problem with customers. Giving them one more excuse to leave is a difficult risk to quantify on the balance sheet. Think of the bills you pay each month. You probably have near-zero loyalty to those brands, and you are probably just staying with them for lack of a better option. Once their IT is down, and you can't quickly and easily get what you want, you get angry and want to move. Most businesses do little to engender loyalty in their customers, and without that, there's little to keep customers from moving on when downtime strikes.
Testing is part of mitigating downtime. I like Netflix's Chaos Monkey idea--run your system through crazy scenarios where parts of it are killed off. That's a little extreme, but the general principal of testing and planning for failure [is improved]. We're a long way from being able to re-architect systems or build new ones, but running through scenarios is an appealing idea.
Bill Bradford, senior systems administrator, SUNHELP.org
In the past few years, the leading cause of system downtime that I've been involved with has been hardware failure. Whether it's server hardware or infrastructure (data center power and/or cooling), [hardware failure] has taken down more systems and caused more outages than user error or software configuration problems.
Morale is challenged by downtime. The best thing to do for morale during downtime is to support the staff while they work, or work with vendors, to fix what's wrong and get systems
and services back up again. Pointing fingers and laying blame can wait until after the crisis is resolved. When staff is stressed and trying to resolve outages, management should never breathe down their necks saying, "This is all your fault. Why isn't it fixed yet?" That will drive people to the breaking point. Have a meeting to talk about causes, solutions and resolutions after services and machines are back up.
Diagnosing and resolving the cause of an outage can often be more stressful than day-to-day operations. To management, it might just look like the IT guys had to stay late to fix something, but have that happen enough times in a short period, and you have the makings of emotional and physical exhaustion for your IT staff. That can lead to more problems down the road.
Management should recognize and publicly acknowledge the extra effort and overtime that IT staff puts in. Give them time off (for example, coming in late the next day, etc.) to make up for extra hours worked and, in general, treat them like human beings and not automations.
Planning will help mitigate downtime. For example, a proper change management procedure, when done right, can make a huge difference in planned outages versus accidental outages.
As for tactics and skill sets, the best is the ability to think on your feet. Think outside the box and come up with solutions to get outages resolved as quickly as possible. Sometimes, fixes involve the equivalent of "duct tape and bailing wire," and that's fine, as long as you come back later with a scheduled maintenance window to do it right for the long-term solution.
The best of a bad situation
It’s clear that there is no absolute means for preventing downtime in a data center. There are simply too many people, too much gear and a growing reliance on factors that the organization cannot control. So while it’s important to work on ways to prevent system downtime, it’s at least equally important to consider the steps and actions that will address downtime when (not if) it occurs. There is a lot to be learned in the aftermath of downtime, and wise organizations will find ways to grow and improve because of it.