Business resiliency (also often called business continuity) -- the ability of the enterprise to continue to function as effectively as possible in the face of natural and man-made problems and disasters affecting its IT -- has for the last four decades been one of the three most important value propositions of IT itself, along with competitive advantage and cost savings. Its importance relative to the other two key value propositions...
has waxed and waned over the years; but recent natural and man-made disasters, plus new requirements for business compliance, have increased the importance of business resiliency to the point where even SMBs (small to medium-sized businesses) must plan and implement resiliency strategies, with input from the highest levels of the organization.
Today the mainframe is a necessary but not sufficient condition for good business resiliency. Instead, the business strategist usually considers the mainframe as first among equals: A platform whose resiliency can be counted on, but which must be integrated with other systems via resiliency software and hardware in order to achieve the quality-of-service, availability, and recoverability goals that the enterprise now needs.
However, the mainframe can do more for business resiliency strategies than simply be the best at what it does. In the first place, the mainframe can act consciously (where before it has acted unconsciously) as the spawning and testing ground for new resiliency technologies and strategies that will then flow downwards to the rest of an enterprise's integrated resiliency infrastructure. In the second place, the mainframe can act as a data resiliency hub, supervising key data resiliency tasks, just as it now supervises key data management tasks as a data hub. Together, these two tasks can make the mainframe again the focus of users' business resiliency strategies.
The fundamentals business resiliency strategies
Resiliency includes availability -- the ability for systems and the entire enterprise architecture to continue to function and respond to outside requests as much as possible -- and recoverability -- the ability for systems and the enterprise architecture to return to functioning and responding to requests as quickly as possible, with as little information as possible lost in the meantime. (Resiliency also includes performance, because responses that are too slow can have serious effects on customer satisfaction, but this is typically a less important problem). More importantly, the IT strategist can divide resiliency into operational resiliency (problems at a site that can be dealt with at the site) and disaster resiliency (all other problems that must be handled by involving remote sites, such as hurricanes knocking out a site or denial-of-service attacks aimed at a site). Finally, business resiliency involves application resiliency, data resiliency, network resiliency, and people resiliency (how to find substitutes for people prevented from fulfilling their roles in the organization). Process resiliency sometimes substitutes for application and people resiliency, because a business process can be thought of as a composite application with people involved. Table 1 shows how a strategist can plan for business resiliency problems more effectively by examining a matrix of the last two classification sets.
Table 1: Example of a business resiliency matrix
|Operational resiliency||Disaster resiliency|
|Application resiliency||How quickly can I fail over an application from one system to another -- seconds, minutes, or hours?||Which applications are important to keep running when disaster strikes, and which can wait for an hour, a day, or a week?|
|Data resiliency||How fast and how close to the time of failure can I recover data from backups if a business-critical system fails?||What is my tradeoff between costs of mirroring to a site 150 or more miles away and risks if I use less storage or a closer site?|
|Network resiliency||Does my intranet have enough alternate pathways in case an electrical outage occurs?||What do I do if power lines between the local site and the disaster recovery site are knocked out?|
|People resiliency||Who can substitute for my systems administrator if he/she is sick?||Who can act temporarily for the (CEO, COO, CFO) if he/she has a heart attack?|
Source: Infostructure Associates, April 2006
The good news about this approach to achieving business resiliency is that there are now plenty of technologies and solutions to aid the strategist in each portion of the business resiliency matrix (even people resiliency, with human capital management solutions), and that these technologies and solutions rarely cover more than one square of the matrix. The bad news is that, in many cases, there are so many technologies/solutions in each area that it is often very difficult to determine which technology/solution best fits a particular business. Thus, for example, many businesses continue to believe that backup/restore to tape is their best strategy for data resiliency, whereas backup/restore may not meet the availability (it takes a long time to back up and restore) and recoverability (some transactions may be lost) requirements of the business for some applications.
The resiliency customer should also understand that there are now close links between resiliency efforts and other imperatives of the business. For example, the same technology used for resiliency may be used for business compliance (archiving), for confidentiality (defenses against unauthorized access), or for security (firewalls).
In a recent disaster, a law firm lost all data on its customers, as well as its legal and office applications, network, and PCs. Replacing the software, hardware, and network was the matter of a business day. Replacing the data was effectively impossible, and the lack of data on the law firm's clients forced the firm to fold.
What was true then of a medium-sized law firm is increasingly true now for all enterprises of all sizes. The Internet offers many cost-effective paths from any node in the enterprise to any other, and enterprises have adjusted nicely to its viruses and spyware. Swapping and hot-swapping computers and computer components, while never without pain, is always doable, and hardware and system software (with the aid of such technologies as self-healing) continues to improve in robustness. Time to develop, upgrade, or buy and deploy a business-critical application continues to decrease, and it is relatively easy (compared to data) to save unchanging applications in a safe place. What remains hard is to replace data once lost at the local site.
War, it has been remarked, is too important to leave to the generals; and data resiliency has become too important to leave to each application, or even each data management system. The result in the last few years has been an outpouring of solutions from storage vendors, allowing multiple data stores and file stores to be coordinated via SANs (storage area networks), NAS (networked-attached storage for files), and a wealth of new operational-continuity and disaster-recovery technologies (e.g., Continuous Data Protection, point-in-time copies and semi-synchronous replication).
However, a new technology called Information Lifecycle Management (ILM) is beginning to shift the balance of resiliency technologies back towards being the shared responsibility of storage software and server software. ILM separates out the data that is no longer being read (physically or logically), and places it in a fixed-content store or an active archive. (The effects of ILM implementation on business resiliency, by the way, are impressive: Active-archive data no longer needs to be backed up or mirrored periodically, just once when it becomes read-only, and thus overall business backup and recovery can be quicker and get closer to the point of failure.)
In order to determine what data should be thus isolated, storage systems cannot depend just on information available from the storage itself. Much of that information about data is presently stored in data-management repositories, and accessed by databases, data-integration software, and other infrastructure software. So, to handle data resiliency effectively in an ILM world, vendors and users need to combine storage software and infrastructure software.
The mainframe in business resiliency
The logical end-point of the importance of data resiliency and of ILM is the creation of a resiliency management architecture that coordinates backup/recovery and other technologies in the service of operational continuity and disaster recovery. Such an architecture must run on all platforms -- but it must first be implemented on the mainframe, and then flow down to all other platforms from there. The mainframe is where the majority of today's data-resiliency technologies are deployed, and where the majority of mission-critical and business-critical data resides. Users must first get the mainframe right before getting the enterprise right.
That, in turn, means that when a new business-resiliency technology is first deployed, it should be created and tested on a mainframe platform. The Web-servicization of the mainframe makes this a much easier task, because it is easier to integrate the new technology with Web-service-provider interfaces to existing infrastructure and storage resiliency software.
By the same logic, just as the mainframe can today act as a data hub or enterprise data server, the mainframe should be able to act as a data resiliency hub. More specifically, the mainframe should be the locus of the metadata repository that contains information about all enterprise data's lifecycle and the location and format of data copies (in local and remote sites); and the mainframe should be the locus of the infrastructure and storage software that coordinates all of the enterprise's data-resiliency activities.
How the system z9 fits
The IBM System z9 and IBM's storage arm offer an extensive list of business-resiliency products, solutions, and services. Key technologies that can form the foundation of a data resiliency architecture include GDPS (allowing remote management of copying and storage and remote recovery from a single point), Capacity Backup for zSeries (supporting flexible addition of processors in a crisis), System z9 and Tivoli systems management software, and information on demand infrastructure software for coordinating business-critical data enterprise-wide (e.g., master data management) and creating an enterprise-wide metadata repository (WebSphere Information Integrator). We should also note the new self-healing capabilities of the z/OS operating system.
Vendors continue to be more effective at providing business resiliency -- but users' continually increasing dependence on IT makes even minor outages less bearable than ever. In this situation, not only is the mainframe's business-resiliency prowess more important than ever, but it can also take a lead role in providing the hub of an overall business-resiliency architecture that will leverage key new technologies such as ILM.
The IBM System z9 is well-positioned to act as such a hub, with extensive products, services, and solutions that span the full range of business resiliency (network, application, and data). Above all, System z9 and IBM storage offer strengths in data resiliency -- the one area where it is absolutely business-critical for enterprises to succeed. Enterprises all sizes should focus on business resiliency in general and data resiliency in particular; enterprises with mainframes such as System z9 should use these mainframes as the focus of their efforts.
About the author: Kernochan is the president of Infostructure Associates, LLC.