A couple of months ago I wrote about how new DASD features have greatly expanded the number of available recovery options. With synchronous replication an enterprise can use two data centers on the same campus to back up each other. Similarly, there are asynchronous replication techniques over geographic distances that can drastically reduce the time to recover a data center and shrink the recovery time objective (RTO).
Early disaster recovery
As recently as the 90's, if a company had a disaster recovery (DR) plan at all it usually meant contracting with a computer provider, DASD provider and floor space to use in the event of a disaster. The contract would also include provisions for periodic, typically annual, tests. Next they would develop processes to ship weekly volume backups and database recovery tapes to the DR provider or a third party vaulting company. In the event of a disaster or DR test, the systems programmers and database administrators (DBA's) would fly to the provider's site and start slapping the volume backups on the provided DASD.
This worked well enough for companies who put in some serious practice. A well executed plan could get a system up and running in under 12 hours. The DBAs might have all the databases recovered about 12 to 24 hours after uptime was restored. The data RTO would be as good as the backup in a week or, depending on the breaks, hours for the databases.
Disaster recovery changes
Recent unfortunate experiences with 9/11 and government regulations changed all that. After 9/11 many disaster recovery site providers ran out of space and left some companies without a home. In addition, commercial flights were suspended thus keeping programmers away from the backup sites. A little later government regulation added requirements for how long financial companies could be down and the staleness of the recovered data. Soon it became apparent to a lot of IT departments that they would just have to do it themselves.
Now typical DR situation for larger companies may look like this: The enterprise owns secondary or DR site(s) a long distance from the primary data center. The secondary site is most likely active. It may be the company's development environment or another production data center.
Either way, the alternate site must have sufficient capacity to run the other site's workload, either through a surplus of processors or capacity backup units (CBU). A CBU is a dormant processor that can be turned on with the right key and command at the hardware maintenance console (HMC). A CBU doesn't count towards software licensing costs and can be enabled immediately. The bad news is you must buy them in packages from IBM and there are a limited number of times they can be used before you may have to purchase more.
The primary and DR site both have DASD subsystems connected through a high bandwidth communication pipe. The primary data center writes to its local DASD and updates are transmitted asynchronously to the backup site. The changes are applied to offline volumes at the backup site. The backup data's RTO depends on the pipe's bandwidth, the amount of data and the data replication method.
Storage tapes and virtual tape alternatives
Tape will always be with us and those needed for recovery must be physically shipped to the backup site. An alternative would be to put essential recovery datasets on DASD or using a "tape on DASD," or virtual tape product. By virtual tape, I'm referring to software products that redirect tape I/O to DASD volumes transparently, as opposed to a hardware solution such as IBM's virtual tape server (VTS).
The virtual tape option has a couple of advantages. First, of course, is that DASD performs better than tape and switching can only improve batch performance. Second, the faux tape data on DASD will be replicated to the backup site along with everything else. In the event of a disaster, operators at the alternate site execute a plan to bring the backup volumes online and IPL the dead site's LPAR's. The plan may also include adding processor capacity through CBU's or varying idle CP's online. Depending on the breaks, a DR plan should be able to get the systems up much faster with more current data.
New disaster recovery considerations
Capacity planning becomes a big issue. First, of course, are the difficult decisions about which applications must be restarted at once and which may wait. This decision feeds into capacity planning in a number of ways:
- Processor needed during an emergency at the backup site
- The amount of data to be transmitted to the backup site
- DASD needed at the backup site for replication
- Bandwidth for the communication link between the two sites. Network planners should also consider situations where communication breaks between the two DASD subsystems and old data has to be transmitted in order to catch up at the backup site.
There are other things to think about automated scripts for these actions at the recovery site:
- Enabling CBU's
- Brining the offline recovery volumes online
- IPL'ing the dead LPAR's
Some enterprises may also elect to run a small LPAR at the DR site to serve no other purpose but to kick start the recovery process.
Even with today's tools DR remains a daunting task. However, the basics of DR, application triage, backups and detailed planning are still very important. And, as always, the most important contributor to a successful disaster recovery is still practice, practice and more practice.
ABOUT THE AUTHOR: Robert Crawford has been a CICS systems programmer off and on for 24 years. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.