If application availability rested solely on servers' shoulders, virtualization and cloud computing would guarantee...
high availability. But beware the failures in network and storage availability.
In a completely physical world, a single application runs on a single server. If anything goes wrong with that server, the application also falters. We circumvent the danger by clustering: bringing together a collection of servers to support the application so that one server cannot completely take it down.
Virtualization aggregates hundreds, even thousands of servers to make single-server failure far less of an issue. And cloud computing takes advantage of the same concept to create dynamic, elastic resource pools that can be applied to application workloads -- making a single component failure hardly noticeable to application availability or performance.
Great, we now have high availability -- end of story, surely? Well, no, unfortunately. High availability can be far more complex than it first appears. High availability is not just about servers, though servers' fragile nature warrants the historical focus on availability. Network availability, storage availability, data mirroring -- even flaws in the application's code -- can bring the performance of a running application to its knees.
The highly connected network
Virtualized, fabric networks aggregate multiple physical network interface cards (NICs), creating a pool of bandwidth to be dynamically allocated as needed. Should one physical network link go down, traffic can elegantly and immediately reroute via other physical links.
Network systems must be architected to remove single points of failure. Multiple NICs should be implemented per physical server system, but this is not much of a technical or cost issue for most IT shops.
The issue with storage
Storage has been the big issue for application availability. Storage can be virtualized, but the problem remains that the data is stored at a physical level at some stage: The failure of that physical level will always create issues.
Technological approaches, such as RAID and other multi-store techniques, avoid issues where a single disk drive fails, but this still open the possibility of a disk controller or RAID controller failure. The easiest way to combat storage failure is to mirror data in real time to another physical store. This is pricey, however, as the mirror store needs to be as big as the primary store.
There will also be technical issues in ensuring that the data is mirrored in real time, and in identifying which transactions were in progress when the storage failure occurred and what the recovery point will be. Once the recovery point is identified, the application has to fail over to the mirror, reallocating its storage points to new virtual logical unit numbers.
The technology behind business continuity at a storage level is improving rapidly, and with the right investments, high availability within a data center is now possible.
The problem with everyone else
So, what happens when the issue is not within the data center? In many cases, the data center is still running effectively, but connectivity to and from the data center falters. A backhoe can unintentionally rip through a leased line, cutting off the data center's access, for example. Multiple connections coming into the data center from different directions and from different vendors can provide the levels of availability required for a critical application, but at a cost. This price is a worthwhile investment in most circumstances.
Then consider trickier failures, such as fire in the data center or a wide-ranging natural disaster. No amount of high availability within a single facility will protect you here. To provide high availability in these circumstances, mirror across data centers. Data center mirroring can be highly prohibitive, but when application downtime results in immediate and high financial suffering, it's necessary.
Across small distances, data center mirroring presents about the same technical difficulties that a single facility would. As distances increase between the facilities, data latency can become a major issue, particularly in high-speed transactional environments. To fail over mirrored systems with minimal business and customer inconvenience, identify which state transactions were at when the failure happened.
No matter what you put into place for server, network and storage availability, it won't fix a poorly written application's problems. Memory leaks or other flaws in code must be sussed out with proper code testing and run-time garbage collection, or high availability will always be a mirage.
Finally, don't set up applications to be down during planned maintenance. An instance of an application can be run while another instance is being patched or upgraded. This updated instance can then be spun up and the placeholder instance failed over to the new version in what is effectively real time. There may be a few seconds of delay, but this should not have a material effect.
Which application availability approach is best?
Most organizations will invest in a hybrid mix of high-availability solutions. Some businesses can handle losing their main applications for a period of a few hours. For them, the standard application availability approach of server and network virtualization will be enough, with a standard RAID approach for storage.
For applications where downtime will hurt the business, higher levels of protection will rely on added cloud platforms and mirrored data. If a lack of application availability will hit the company's financials or brand image hard, consider facility mirroring.
IT and business executives must agree on priorities and what the business is willing to spend on protecting application uptime.
About the author:
Clive Longbottom is the co-founder and service director of IT research and analysis firm Quocirca, based in the U.K. Longbottom has more than 15 years of experience in the field. With a background in chemical engineering, he's worked on automation, control of hazardous substances, document management and knowledge management projects.
Clive Longbottom asks:
How critical are your applications?
1 ResponseJoin the Discussion