With so many business processes now dependent on technology, IT system downtime can bring an enterprise to its...
knees. Although guaranteed 100% uptime is still not possible, old targets of 99% are still too low. Understand all the dependencies in any IT process and plan for redundancy to maintain as much uptime as possible.
Address individual component failures through standard approaches to equipment redundancy. If you apply component-based redundancy on a per-subsystem basis, you'll open the possibility to miss something. For example, a hosted email provider might use a highly-redundant server farm that connects to a multi-redundant, SAN-based storage array through a single Fibre Channel controller to provide high availability. If the single Fibre Channel controller fails, the email server cannot save any data and will lose all emails until the provider replaces the controller.
Automate to avoid human error
Software is more resilient compared to just a few years ago. It is now rare for commercial off-the-shelf or open-source software to fail catastrophically. Even when there are problems, you can use virtual machines or containerization, as well as provisioning and orchestration, to get things back up and running again quickly.
The biggest causes of system downtime are no longer equipment or application failures. Instead, the biggest cause is at the human level -- systems administrators making mistakes. With the complexity of modern IT platforms that stretch across multiple virtualized private and public platforms, the possibility for human error increases. And, with many administrators still using command-line interfaces (CLIs), there is little filter between what the administrator types and what then happens to IT systems.
Even when an administrator correctly enters information into the CLI, the effect across a highly shared, complex environment can be massive. While the administrator's instruction may fix an immediate problem, it may cause issues with other workloads by taking away resources or creating conflicts with data access.
As a result, IT teams must automate, audit and orchestrate the way administrators work.
Rather than use the CLI directly, create a library of scripts that have been proven to run, and then mandate the use of these scripts. Running a script will always produce the same outcome, while a human has a higher chance of error.
In addition to repeatable scripts, use orchestration systems to provision not only script outcomes, but patches, updates and code rollout. This also enables rapid rollback to a known position in case a rollout causes issues.
Many DevOps orchestration systems, such as HashiCorp, Chef Automate, Canonical Juju and Terraform, fall into this category.
For organizations with a hybrid cloud deployment, consider orchestration tools, such as Electric Cloud and Platform9. Incumbent software providers, such as IBM, Hewlett Packard Enterprise, BMC and Dell EMC, also all have tools that can help automate and manage the rollout of scripts, updates, patches and new code across hybrid clouds.
Raise security to dodge system downtime
Another area where system downtime can occur is through malicious intent. Distributed denial of service (DDoS) attacks can bring down a service, and software delivered via Trojans and worms through spear phishing attacks can cause data corruption and system downtime.
Vendors such as Akamai, Exponential-e and Cloudflare can provide DDoS mitigation services; Norton, Kaspersky Labs and others can provide a degree of protection against spear phishing and other malware attacks.
Ransomware attacks are harder to deal with, since you can activate them with the click of a seemingly innocuous link in an email. Once activated, data encryption can spread rapidly across the network -- encrypting not just the user's own data, but everything it can connect to, including backups.
Malwarebytes and Sophos offer tools to defend against ransomware.
Limit data center downtime with integrated systems tests
What lessons have airlines learned from system downtime?
DR, UPS questions arise from major data center outages