Site recovery capabilities were limited in the late 20th century. If an event took a campus down, the only option...
was to send a subset of programmers to a backup site where they would spend several sleepless nights spinning tapes, fighting JCL errors and replying "cancel" to device allocation error messages.
Things have come a long way since then. As DASD evolved from a string of boxes to large cabinets with great chunks of cache, manufacturers added capabilities for replicating data across campuses. Similarly, mainframe hardware and software changed to support redundancy and failover. This evolution, along with automation, makes possible campus-wide recoverability that drops the need to execute an actual disaster recovery (DR) to almost zero.
Data replication for basic disaster recovery planning
In the basic site recovery scenario, two data centers are placed some distance apart on the same campus. The distance should be far enough to prevent simultaneous failure but close enough for synchronous hardware communication. The processors, coupling facilities (CF's) and DASD boxes are all connected over redundant links. For the maximum benefit and fastest recovery each data center should be a "hot-hot" split of the production workload.
DASD vendors have added synchronous data replication over limited, campus-wide distances, which are designed so that a write to a primary disk will be replicated to an alternate volume in another box. Since it is synchronous, the system that initiated the I/O will not get acknowledgement of the write until the replication is complete.
It's important to note that while this functionality may be valuable to some crucial applications, you can basically expect the I/O time to double. The solutions I have seen were device based. This involved the primary and secondary DASD cabinets talking to each other.
Problem detection software, I/O strategies also part of DR planning
Being able to replicate is good, but there are more pieces needed for data recovery. For instance, some manufacturers supply software able to detect I/O problems and initiate a "swap" to the alternate DASD volumes. This is not instantaneous and may take several if not a couple dozen seconds. You also must be very careful to set up the software so it only swaps when there is a real problem. Lastly, the software may need UCB definitions for both the primary and secondary volumes. Given z/Os' 64K UCB limit this might be a squeeze for some configurations.
Implicit in the desire for replication is the requirement for some planning. An administrator must decide which volumes need to be replicated and how to organize the recovery groups. Asking if your manufacturer supports replicating special types of volumes such as page packs is another important point. If not, a volume swap will mean an IPL even if all the other DASD successfully switches to the backup copies.
Take heed of IBM's recommendations for which files, such as couple datasets, that should not be replicated because I/O to them must be as fast as possible to maintain system throughput. In the case of couple datasets, z/Os already maintains backup or alternate copies of system can switch to in a pinch.
Automatic failover is grand, but I would also recommend looking into and developing procedures for swapping back to the primary copy. While the same software that made the initial swap can help, getting back isn't quite as easy. Building and testing the procedures ahead of time can save a lot of headaches if you get any false positives.
Data recovery is just a part of site recovery
You also have to think about processors and CF's. It's easy enough to put processors at opposite ends of a campus provided you can build the specialized and hardened environment required. For failover sake the processors at each site should connect to its own CF. Physically separating the CF's also entails detailed analysis of which structures need replication.
The devil is in the details as structure use and recoverability options will differ subsystem by subsystem. You may also expect elongated CF response times as a synchronous write to a structure must be replicated and confirmed at the other CF before it is considered complete.
Rounding out the site recovery picture is any automation included to ease the transition from one site to another. This is probably the most difficult subject to tackle. Someone must decide how much human intervention is required, which workloads recover where and what really constitutes a site failure. Then, after writing this automation, someone must decide on the safest way to test it.
Capacity planning eases disaster recovery
Planners should not neglect capacity planning. If DASD I/O times double, you might want to beef up your I/O subsystem infrastructure with more paths or faster devices. If you decide to replicate all your DASD, then every cabinet you purchase must have a partner for mirroring, along with all the cables and definitions to go with it.
Processors and CF's must be able to execute double their usual workload in case of a recovery. This may be accomplished either through buying bigger processors or capacity backup units (CBU's) that may be switched on during times of troubles.
The capability and choice of products available for site recovery are astounding. While I can glibly touch on the components and procedures for site recovery rest assured it isn't easy. I also didn't mention recovery of UNIX and Windows servers for which planning is equally, if not more, gruesome. It takes a commitment of money, resources, deep analysis and brainpower from your company. However, based on how much downtime your enterprise can afford it might be all worthwhile.
ABOUT THE AUTHOR: Robert Crawford has been a CICS systems programmer off and on for 24 years. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.