Architecting application failover in your data center

IT Operations.com

Architecting application failover in your data center

By By Frank J. Ohlhorst, Contributor

Failover technologies have become common in the data center. Most larger enterprises have set up failover technologies to handle server failures, database failures and even complete data center failures. The idea is to create a "hot site" or "hot spares" that run concurrently with active systems and services. Ideally, when a primary service, system or site fails, end users are seamlessly switched over to the spares, which have been completely synchronized with the active systems.

When it works properly, there is no interruption in service, no loss of data and the end user may only experience a minor delay while things switch over. High-level failover technologies have been proven to work very well. However, traditional failover technology is very expensive to purchase, deploy and manage -- and it's overkill for the majority of businesses, which are only concerned with protecting a few line-of-business applications.

More on failover and disaster recovery:

How to create failover structures and configure clusters
Virtual disaster recovery with VMware Site Recovery Manager
Explaining the features of VMware Site Recovery Manager
Backup and data protection product guide

The challenge for a modern enterprise is to implement affordable application failover that can protect a limited number of important apps. In some cases, the answer is to re-design the applications to accommodate failover. However, it may not be possible to re-engineer canned applications to use your own in-house failover technologies. In most cases, canned applications will require vendor-engineered failover options, which are proprietary and monolithic in nature. Most organizations, if only using canned applications, will consider a vendor-specific offering to meet their needs. But if the organization has customized the canned application, added custom code or integrated the canned application into other applications, it may be necessary to re-code for application failover. Technologies have emerged that allow application developers to incorporate failover capabilities directly into an application. Even virtualization provides new alternatives for failover that may alleviate the need to re-architect applications in some circumstances. The direction that you follow will depend upon your business needs and application development capabilities.

Coding for failover

There is no single solution that can be applied to any given application to make it resilient. Resiliency and failover should be planned from the outset to properly protect an individual application. That may sound like an impossible task, especially with applications that are already in service, but there are several options available from application development tool vendors. Most of these options are proprietary to each individual application development platform. That means application developers must look to their particular development ecosystems to find the appropriate tools that enable failover capabilities.

For example, those developing under Microsoft Visual Studio can use optional coding methodologies available from Microsoft to make applications "failover–aware." That awareness relies on companion technologies that run on Microsoft databases and Microsoft servers. Ideally, an application can be developed that maintains a data state and session state while the user is redirected to an alternate application server, database server and file server.

That same style of resiliency can be applied to applications that run on top of an Oracle database, especially those coded in Java or those relying on Ajax technologies. Here, the code is made aware of the failover capabilities incorporated into an Oracle database server. The Oracle server actually handles the failover event while notifying the application server of the change in data delivery venue. Developers also have the option of using aspect-orientated programming (AOP), a methodology based upon a retry-and-recover process. Some may consider AOP a brute-force approach to enabling failover; however, it is a method that reduces the amount of coding needed and is proven to work well.

For applications that execute locally using a client/server model, failover becomes somewhat more complex. Those applications have to rely on status polling or "keep alive" ticks to be aware of the status of the server (database or application) and then appropriately reroute requests to the live systems. Here, more technology, such as load-balancing and failover appliances, is needed on the back end to make failover possible. Nevertheless, depending upon the complexity of the application, coding failover capabilities into existing applications could take as long as the original development of the applications. Also, extensive testing is needed to ensure that the application works properly under failover conditions, and this will extend the development cycle even further.

Applications that are developed under C+ or other programming suites usually take a hybrid approach to failover, with the application just being aware of where requests and updates need to be sent, while back-office technology handles the physical routing. That intelligence is added to those applications using libraries available from the software developer that offers the programming suite. For example, Microsoft offers two APIs that are engineered to create failover capabilities; the Failover Cluster API, and the Cluster Automation Server. The Failover Cluster API is a rich set of C/C+ libraries for developing cluster-aware applications, while the Cluster Automation Server provides a set of scriptable objects that are used to monitor and administer failover clusters. Oracle developers can turn to Oracle Call Interfaces to incorporate user-defined callback functions.

Other application failover methods

However, diving deep into the code base to engineer application failover may not be the most cost-effective way to proceed. There are other options that meet the needs of resiliency yet don't involve extensive recoding. In the best cases, these options eliminate the need to change any code, but they require taking a look at how applications are executed and where the data is stored.

The first potential option is virtualization. Many enterprises have used virtualization technology in the data center to save costs, reduce data center footprints and maximize the return on investment of the equipment. The issue with virtualization is that it's usually an all-or-nothing approach. For virtualized failover to work, the application server has to be virtulalized as well as the application itself -- and some applications share servers with other applications, meaning that virtualization may prove more complicated to deploy and manage than coding a single application for failover capabilities. Nevertheless, virtualization can simplify data system resiliency and support failover solutions for key applications.

Many third-party vendors offer canned solutions to combine the benefits of virtualization with the resiliency of failover. For example, VMware Inc. offers its Site Recovery Manager, which is a management framework for detecting failures and automatically bringing up a disaster recovery environment. Another vendor, InMage Systems Inc., offers a product that captures I/O and replicates data on a virtual desktop or to an appliance and can recover a system quickly, even across a WAN. Another option is to look at redundant virtual machines using a product such as EverRun from Marathon Software.

Replication technology, combined with virtualization and warm spares, seems to be one of the quickest ways to introduce failover into an enterprise. Since the applications and data are served from the enterprise data center, very little needs to be done on the client side to create an instant failover environment. Arguably, that is the best way to introduce failover into legacy applications, but the combination of synchronization with virtualization can lead to an added benefit -- application load balancing.

By incorporating a load-balancing appliance into the mix, application failover is not only enabled, but it is extended to create a higher return on investment. Instead of "warm or hot spares" sitting on the sidelines waiting for a failover event to occur, those systems can be used to share the load during high-demand circumstances. Companies including Barracuda Networks, CoyotePoint Systems, F5 Networks and several others offer load-balancing appliances that incorporate failover capabilities.

Other vendors offerings can create unified, virtualized failover solutions, and it is worthwhile looking into vendors and their various products, such as SteelEye Technology's LifeKeeper High Availability Clustering appliance, Racemi Inc.'s Automated Server and Disaster Recovery suites, Novell's Platespin Protect, IBM's System x Blade Server Failover solution as well as VMware's High Availability services. Each of those vendors offers products that enable different levels of failover. The best advice is to consider all the possibilities before rewriting a single line of code.

What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at [email protected].

29 Apr 2010