Today's disaster recovery is about an irresistible force meeting an immovable object. The irresistible force is the continued growth in storage size by at least 50% per year; the immovable object is the IT budget, which is predicted to remain flat or see only single-digit growth in 2010. The result is a frantic search for ways to store more backup information in the same amount of space, year after year.
Enter information lifecycle management (ILM). ILM takes the old mainframe concept of a remote backup copy to be restored when a disaster hits the primary site and adds the concept of separating the mass of online and backup data into "pools" based on metadata associated with "stages in the information's lifecycle" (these definitions are borrowed from Data Protection by David Hill, a former colleague of mine). More specifically, there are three typical pools: an "active changeable" one for just-arrived data that is likely to be changed further (e.g., a hotel reservation record); an "active archive" pool for data that will probably never be changed again ("fixed-content" data); and a "deep archive" pool that corresponds to the old idea of an archive (sending tapes to Iron Mountain or dumping them in a landfill). Each pool is associated with a particular type or set of storage tiers; therefore, each has a different set of policies for disaster recovery. And, of course, the administrator must now deal more often with the business of moving the data from pool to pool as it ages.
Implementing ILM in a mainframe-dominated data center and disaster recovery site has several interesting effects. ILM can be used to achieve more cost-effective application performance by associating high-performance disk with a much smaller cache of active changeable data (smaller compared to active archive data). It can speed up the backup and recovery of a large mass of data, since backup to and recovery from high-performance disk is a much faster affair -- theoretically, backup/recovery time can be reduced by 62%. It can reduce the amount to be backed up, since active archive data typically only needs to be backed up once. Above all, the metadata necessary to implement ILM provides ever-better knowledge of when the data moves from pool to pool, enabling finer definition of pools within pools and more effective identification of the tiers that will optimize a particular pool, so ILM just keeps improving.
The net effect of ILM is to reduce costs, reduce the amount of storage needed for disaster recovery, and defer spending for scalability upgrades. That is, ILM gives IT perhaps two to three years of breathing room before the pressure of the irresistible force begins to push at the immovable object again.
ILM best practices
However, ILM complicates the job of not only the mainframe storage administrator but also the mainframe database administrator. There are now several pools of storage rather than just one and several service-level agreements (SLAs) instead of one. The ability to fine-tune storage performance depending on the pool has led to a proliferation of tiers, with the addition of high-performance disk and SSDs, nearline tape, and active-archive-optimized, less costly/performant SATA disk. Moreover, each tier has its unique "mirror" for disaster recovery. The administrator must determine the size and location of the tiers; he/she must also change that fine-tuning at least yearly, as the mix of data arriving changes. And now that most large enterprises have at least a local and maybe a regional disaster recovery site, they must choose between active-active and active-passive; point-in-time copy and "constant" replication; and raw or de-duplicated data.
Moreover, the type of data that the mainframe handles is changing significantly. The main driver is the advent of Linux on the mainframe. It's true that Linux workloads are often more compute-intensive and less concerned with business transactions, but they're also more likely to deal with email, document handling, Web-interface video/audio/graphics, and similar semi-structured and unstructured megabyte-chunk data. This data more likely to move quickly to the active archive, but it's also likely to speed the growth of storage size, it's more susceptible to being compressed and it decreases the amount of parallelization that you can achieve in backup/recovery.
Therefore, mainframe disaster recovery practices in the age of ILM include the following:
- Fitting the backup/recovery technology to the type of pool. Typically, this might mean active-active replication to a high-performance disk on a local site of all active changeable data and some period of active archive data (to be safe); active-passive replication of the same data to a regional facility; and active-passive replication of the rest of the active archive to SATA disk, with some older data sent to nearline tape and, if necessary, older backups kept on tape.
- Proactively defining policies for moving the data from pool to pool -- and including compliance, legal discovery, and governance concerns in those policies. As a rule of thumb, the more fine-grained the ability to define when data becomes fixed-content, the better both the system performance and the disaster recovery speed.
- Anticipating how those policies and technology mixes will change as mainframe data includes more unstructured data. In very broad-brush terms, unstructured data will require higher-speed transfer to local/remote sites; compression before replication (despite the performance hit, if any); and possibly a greater admixture of high-performance disk in the active archive pool, because too great of a proportion of fixed-content rather than active, changeable data will mean too-slow performance for customers if the secondary site takes over.
The virtues of compression
There has recently been pushback from users about the virtues of compression and de-duplication. Concerns have been expressed that the technology is immature and therefore risks the data, that compression may have a negative impact on performance, and that in the real world, compression is more likely to be a negligible 20-30% rather than the 60-70% touted by vendors.
All of this may or may not be true, depending on the individual customer. However, I would still recommend that users move into compression and de-duplication technology sooner rather than later, for several reasons.
- As I have indicated, ILM only buys you about two to three years of relief from the pressure to increase storage size. Effective compression, by shrinking storage size per datum, buys you another one to two years -- long enough to allow new database, storage hardware and data-handling software technology to kick in and buy you a couple more years after that.
- As the amount of unstructured data increases relative to structured data (as is now happening with Linux workloads), the percentage of the total that can be compressed increases.
- Compression not only improves backup/recovery speed, but it also improves I/O speed -- there's less to upload. The reason for the performance hit from compression in the old days was that decompression cost CPU time in the typical CPU-intensive application. IBM is now reporting that DB2 works faster with compressed data. Moreover, recently, columnar databases have started to handle database-type data without decompressing it and to find better ways of compressing relational data. The result is credible order-of-magnitude performance increases in large relational data warehouses.
- A neat side effect of compression is a crude additional form of encryption, which increases data security.
But remember, compression is best in the context of ILM so that the SLA of each pool can be factored into the choice of whether to compress or not.
The study I cited above also shows that only 10% of IT respondents are doing anything about energy use and emissions in the data center and disaster recovery site. This is short-sighted, if understandable: The same 50% yearly increase in storage size is driving a rapid increase in IT/personal-computer energy use, and it's already well over 2% of the world's energy usage. While today's business typically "punts the future" by moving computation to places with less stringent energy limitations, the disaster recovery site often has no such option -- and even if it did, the overall result of all businesses fleeing to the least restrictive spot is simply to make the problem worse.
For that reason, users need to factor into their storage-buying decisions the ability to power down when not active (as tapes do, and as some disk products are beginning to do). Users should actively seek out power-down disks, look for ILM storage-tier designs that integrate with a power-saving data center and begin to apply energy-monitoring tools to their disaster-recovery sites -- and the mainframe is a prime locus of IBM's energy-administration software utilities.
One other thought, and this one's a bit visionary: Effectively, with the addition of compliance responsibilities, the secondary site has become a potential separate online data center -- in walks the lawyer and he wants to see the email and security-camera video records for the last couple of years. At the same time, as I have pointed out in a parallel article, security is becoming not only firewall-based and network-oriented but also information-centric. One of IT's new disaster recovery responsibilities, therefore, will be to identify information security policies for the active archive pool of secondary-site data, which will probably be different from those of the primary site.
Overall, mainframe users and vendors alike should be aware that the fundamental disaster recovery problem remains. The irresistible force (storage size growth) may have been shut off for a while by ILM, compression and energy-saving measures, but at some point it will return, and there is little prospect of the immovable object (budgets) moving in the foreseeable future. I look forward to the next technology leap, four years from now: a perpetual-shrinkage machine.