Home > Data Center Tips > Enterprise Systems Update Newsletter > Disaster recovery on the mainframe: New options for site recovery
Data Center Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

ENTERPRISE SYSTEMS UPDATE NEWSLETTER

Disaster recovery on the mainframe: New options for site recovery


Robert Crawford, Contributor
06.06.2007
Rating: -3.00- (out of 5)


IT infrastructure news
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


Site recovery capabilities were limited in the late 20th century. If an event took a campus down, the only option was to send a subset of programmers to a backup site where they would spend several sleepless nights spinning tapes, fighting JCL errors and replying "cancel" to device allocation error messages.

Things have come a long way since then. As DASD evolved from a string of boxes to large cabinets with great chunks of cache, manufacturers added capabilities for replicating data across campuses. Similarly, mainframe hardware and software changed to support redundancy and failover. This evolution, along with automation, makes possible campus-wide recoverability that drops the need to execute an actual disaster recovery (DR) to almost zero.

Data replication for basic disaster recovery planning
More on disaster recovery for the mainframe:
Mainframe shops scramble to encrypt tapes

Mainframers need to focus on process not piecemeal disaster recovery

The mainframe in business resiliency

In the basic site recovery scenario, two data centers are placed some distance apart on the same campus. The distance should be far enough to prevent simultaneous failure but close enough for synchronous hardware communication. The processors, coupling facilities (CF's) and DASD boxes are all connected over redundant links. For the maximum benefit and fastest recovery each data center should be a "hot-hot" split of the production workload.

DASD vendors have added synchronous data replication over limited, campus-wide distances, which are designed so that a write to a primary disk will be replicated to an alternate volume in another box. Since it is synchronous, the system that initiated the I/O will not get acknowledgement of the write until the replication is complete.

It's important to note that while this functionality may be valuable to some crucial applications, you can basically expect the I/O time to double. The solutions I have seen were device based. This involved the primary and secondary DASD cabinets talking to each other.

Problem detection software, I/O strategies also part of DR planning

Being able to replicate is good, but there are more pieces needed for data recovery. For instance, some manufacturers supply software able to detect I/O problems and initiate a "swap" to the alternate DASD volumes. This is not instantaneous and may take several if not a couple dozen seconds. You also must be very careful to set up the software so it only swaps when there is a real problem. Lastly, the software may need UCB definitions for both the primary and secondary volumes. Given z/Os' 64K UCB limit this might be a squeeze for some configurations.

Implicit in the desire for replication is the requirement for some planning. An administrator must decide which volumes need to be replicated and how to organize the recovery groups. Asking if your manufacturer supports replicating special types of volumes such as page packs is another important point. If not, a volume swap will mean an IPL even if all the other DASD successfully switches to the backup copies.

Take heed of IBM's recommendations for which files, such as couple datasets, that should not be replicated because I/O to them must be as fast as possible to maintain system throughput. In the case of couple datasets, z/Os already maintains backup or alternate copies of system can switch to in a pinch.

Automatic failover is grand, but I would also recommend looking into and developing procedures for swapping back to the primary copy. While the same software that made the initial swap can help, getting back isn't quite as easy. Building and testing the procedures ahead of time can save a lot of headaches if you get any false positives.

Data recovery is just a part of site recovery

You also have to think about processors and CF's. It's easy enough to put processors at opposite ends of a campus provided you can build the specialized and hardened environment required. For failover sake the processors at each site should connect to its own CF. Physically separating the CF's also entails detailed analysis of which structures need replication.

The devil is in the details as structure use and recoverability options will differ subsystem by subsystem. You may also expect elongated CF response times as a synchronous write to a structure must be replicated and confirmed at the other CF before it is considered complete.

Rounding out the site recovery picture is any automation included to ease the transition from one site to another. This is probably the most difficult subject to tackle. Someone must decide how much human intervention is required, which workloads recover where and what really constitutes a site failure. Then, after writing this automation, someone must decide on the safest way to test it.

Capacity planning eases disaster recovery

Planners should not neglect capacity planning. If DASD I/O times double, you might want to beef up your I/O subsystem infrastructure with more paths or faster devices. If you decide to replicate all your DASD, then every cabinet you purchase must have a partner for mirroring, along with all the cables and definitions to go with it.

Processors and CF's must be able to execute double their usual workload in case of a recovery. This may be accomplished either through buying bigger processors or capacity backup units (CBU's) that may be switched on during times of troubles.

The capability and choice of products available for site recovery are astounding. While I can glibly touch on the components and procedures for site recovery rest assured it isn't easy. I also didn't mention recovery of UNIX and Windows servers for which planning is equally, if not more, gruesome. It takes a commitment of money, resources, deep analysis and brainpower from your company. However, based on how much downtime your enterprise can afford it might be all worthwhile.

ABOUT THE AUTHOR: Robert Crawford has been a CICS systems programmer off and on for 24 years. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.

Rate this Tip
To rate tips, you must be a member of SearchDataCenter.com.
Register now to start rating these tips. Log in if you are already a member.




BROWSE BY TAG
Mainframe security and disaster recovery,   Server hardware,   Mainframe computers,   Enterprise Systems Update Newsletter,   Mainframe management,   Chapter 2: Mainframe security and disaster recovery,   Mainframe disaster recovery,   VIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Mainframe security and disaster recovery
Improve CICS Web services security and handle Web transaction requests
Coding a simple mainframe cryptography program
Using cryptography on the mainframe: An amateur's guide
Sun Chemical updates two data centers to handle SAP integration
Mainframe vulnerabilities: Be proactive rather than reactive
Mainframers need to focus on process not piecemeal disaster recovery
Legacy protocol puts IBM mainframes at risk
Securing a CICS screen
CICS command security
How to authenticate users accessing CICS legacy transactions?

Enterprise Systems Update Newsletter
Migrating off the mainframe; part 3: Tuning apps for the new platform
Developing a successful mainframe migration strategy
Coding a simple mainframe cryptography program
How is CICS prepared for future IT market demands?
Troubleshooting mainframe application performance variables
Should you move apps on or off the mainframe to cut costs?
Consider cost-effective mainframe upgrades in down economy
New statistics for CICS Transaction Server 3.2
The mainframe is 45 years old
An intro to CICS Transaction Server 4.1: Upgrades and features

Mainframe disaster recovery
Data center disaster recovery Web resources
Mainframers need to focus on process not piecemeal disaster recovery
The mainframe in business resiliency

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary

DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



White Papers - Data Center Networking

The Intel IT Technology Center - Power, Performance and Mobility Solutions

HomeNewsTopicsITKnowledge ExchangeTipsBlogsMultimediaWhite PapersEvents
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts