Server uptime and hardware failure guide
A comprehensive collection of articles, videos and more, hand-picked by our editors
If server uptime is IT's holy grail, technology and people both must play a big role in the quest.
Working, constant data center availability is a core requirement of any organization. However, IT's quest for "dial tone" availability over the years has never quite materialized.
Maybe we are getting closer to achieving this vision, thanks to newer technical architectures such as virtualization and cloud computing. But new technologies only go so far. If organizations really want to improve their data center availability, they need to focus on three core principles: automation, modularity and redundancy.
Two, four, six, eight: What can we automate?
If the goal is uptime, the first area that needs to be addressed is not the silicon-based equipment that makes up the enterprise data center, but the carbon-based life forms who nominally maintain and update it.
Unfortunately, people are the main cause of data center downtime. Poor scripting, applying patches incorrectly, unplugging the wrong piece of equipment -- if you need something done completely wrong, bring in a human.
Fortunately, much of what is needed to keep systems running and available these days can be done in a lights-out environment. It's now possible to automate patches, updates and any number of other software tasks, such as provisioning and deprovisioning applications.
Many problems are caused by attempts to apply a patch or upgrade to an ineligible system, such as when there is insufficient storage on a server or when a specific device driver is required but not available on the machine. Good tools should automatically identify such issues before attempting any action. They should either fix them automatically or raise an exception to an admin and skip the action until a human has dealt with the problem.
Automation tools should also be able to monitor and report on the status of not only individual applications, but all the apps that that support enterprise processes. It is a waste of time to start a process if the last part of the process cannot be completed because a downstream application or piece of hardware has failed. Better to identify any problems early, then look at remediating them in real time.
This may involve moving a virtual machine (VM) from one physical environment to another, along with all of its dependencies around storage and networks. Again, this can be done rapidly and effectively through automation. By catching problems at an early stage, the move can be made in real time and systems switched over without any noticeable change for workers. This proactive approach has so much more going for it than that of a standard reactive response. Waiting for users to phone the help desk and then sending people into a data center to address a problem is no good to a modern organization.
Again, avoid human intervention as much as possible. Machines rarely do anything wrong -- they carry out the same activity time after time without deviating from the rules provided to them. If the rule is programmed correctly the first time, servers will continue to do it correctly from there on, time after time after time. A staffer may have done the same task correctly 99 times and then have an off day or just an off moment on the 100th occasion. Use automation, and get people to focus on getting that first-time rule coded correctly.
Modular, not monolithic
In a virtualized, cloud-based environment, it is actually quite unlikely that the failure of an individual piece of hardware will cause a data center to have appreciably lower overall availability. Older applications are generally the problem. Having large, monolithic applications causes difficulties even within the world of superfast virtualized environments. Provisioning and spinning up a new VM containing a full stack, from operating system through to a full instance of SAP ERP or Oracle E-Business Suite, will take time because of scale and complexity.
Moving towards a composite application approach can really help here. The first job is to take the business process, break it down into a set of tasks and then see what technical capabilities are required to facilitate each of these tasks. By finding the right technical functions as small pieces of capability and pulling them together on an as-needed basis, you can get a greater level of flexibility. Processes can be changed, and only the tasks that are affected require new technical components. In addition, you gain much higher availability and overall system uptime.
Consider a process that consists of five tasks. Each of these tasks is facilitated by a different technical function. One of the functions fails -- for whatever reason. The same technical platform can be spun up far more quickly than if that same function failed as part of a monolithic application, where the whole stack would have to be reprovisioned.
Indeed, since the other four functions are still capable of running, activities can be carried on while the failed component is fixed. Assuming that the organization is storing and forwarding transactions correctly, individuals can still carry out their own parts of the overall process, even during an extended outage.
Double down on redundancy
Although I've said that hardware is not the real issue, don't take that as an excuse not to protect the data center against equipment failure. Engineering for data center availability requires a degree of equipment redundancy. This goes not just for servers and storage, but for the network and the facility as well. Virtualized networks allow for dynamic reallocation of network connections should a network interface card fail or a specific route become congested. Modular chillers, uninterruptible power supplies and auxiliary generators allow facilities to survive equipment failures.
For basic server or other hardware uptime, go for one more piece of equipment than is required (N+1). For higher levels of uptime, go for more items of redundant equipment (N+M). For the highest levels of platform uptime, consider long-distance mirroring.
Businesses that cannot tolerate any data center downtime whatsoever need complete mirroring in real time of live VMs, storage and virtual network dependencies across a suitable distance. Redundancy must be built into how the two facilities are networked together, via multiple wireless-area network connections operated by different carriers. Obviously, costs are pretty prohibitive, so make sure that this is really necessary.
In many cases, the business will actually be well served by live synchronized data backed up by on-demand resources for spinning up VMs. Application images can be spun up rapidly, matching against the data within minutes in many circumstances. There will be a hit on availability while the images do spin up, but the lower cost of not having to maintain two hot facilities can make this good enough for the majority of an organization's needs.
The key is to automate wherever possible. Keep humans away from IT systems wherever possible, and use suitable tools to provide repeatable approaches to common tasks. Architect for failure; use redundancy for failover, but make sure that you understand what the business means by "highly available." In many cases, you will find that it really means "minimize downtime and maintain data integrity." This data center approach is different -- and can save an organization millions of dollars.
About the author:
Clive Longbottom is the co-founder and service director of IT research and analysis firm Quocirca, based in the U.K. Longbottom has more than 15 years of experience in the field. With a background in chemical engineering, he's worked on automation, control of hazardous substances, document management and knowledge management projects.