Sergey Nivens - Fotolia


Build a resilient mainframe system with redundancy, automation

To ensure your mainframe is fault-tolerant, resilient and ready to support a growing number of apps, follow these best practices for hardware allocation, automation and more.

Operations and application teams have different goals. Operations staff is interested in efficient and high-performing systems. Application programmers, on the other hand, want to provide quick and cheap business functionality. In most cases, these goals are irreconcilable, and in this hypercompetitive age, the application viewpoint prevails. Therefore, operations teams must provide resilient, fault-tolerant systems -- including mainframes -- that can handle whatever the application code dishes out.

Parallelism, redundancy and beyond

IBM has touted parallelism and redundancy on the mainframe system since the introduction of the Parallel Sysplex in the '90s. The ideal Parallel Sysplex contains identical system instances, such as logical partitions (LPARs),that share data and coordinate activity through a coupling facility. Incoming online traffic comes through a workload or connection balancer, such as Sysplex distributor.

This setup provides resiliency through parallelism and redundancy, but admins might need to tweak it to ensure nondisruptive and automatic recovery from issues such as failed address spaces or a dead central processor complex (CPC).

To avoid these problems, consider overallocating hardware on your mainframe system. Rather than run all the processors full-tilt, consider buying enough capacity to run them at 60% utilization or less. That way, if an LPAR fails, even under peak workload, the remaining processors will take on additional workloads. IT teams can overallocate other hardware components, such as real memory and I/O infrastructure, as well.

In addition to overallocation, use two CPCs, if your entire workload fits on one. Another option is to double up on a direct access storage device to take advantage of automated recovery capabilities, such as IBM's HyperSwap or EMC's AutoSwap.

Run more CICS instances than you need to protect a workload against bad application code that might take out a region or to absorb a sudden workload spike during LPAR failures. You must configure the OS and subsystems to withstand sudden workload surges. If the OS runs out of common virtual storage or CICS reaches its maximum task limit, it doesn't matter that a surviving LPAR has all of the CPU that it needs.

The role of automation and monitoring in a mainframe

If mainframe system monitoring tools find an issue, automated processes or alarms should notify IT. An IBM Health Checker is built into the OS to ensure IT follows best practices for performance and availability.

A poorly written or lightly tested application can be disruptive and create infinite loops, database lockouts and memory leaks. Programmers used to have the luxury of gathering diagnostic data and taking problems back to the application team for repair. But now, the application programmer needs to focus on functionality and move onto the next task before the bad code hits production.

Programming best practices

To prevent shared resources from locking up a Sysplex, use best programming practices. For example, programs should always serialize resources in the same order to avoid lockouts and deadly embraces. To process workloads in the proper order and avoid withholding resources from more important tasks, evaluate discretionary workload policies to make sure that if a lock occurs, enough resources are present to continue the other tasks. During busy periods, separate shared resources with user catalogs and data hotspots. A portioned database can also be helpful. If you maintain a small LPAR for software cost containment, make sure you have a backup for it.

With little attention to application bugs or maintenance, IT operations teams resort to automatically canceling bad actors before they hurt the mainframe system. It's tricky to find the correct threshold and is often more of a gut feeling for when a unit of work consumes more than a "reasonable" amount of resources. However, automatic cancellation isn't a long-term fix; sooner or later, someone must look into the bugs.

Automatic recovery applies to system components as well. For instance, a CICS region that runs without storage protection is vulnerable to application overlays. The worst-case scenario is an overlay that renders CICS dysfunctional but doesn't bring it down. As a result, a zombie region can continue to take on work that immediately fails and could last for hours. To protect against this, create automation to detect and cancel damaged regions.

Next Steps

Explore the new capabilities of z/OS

How to optimize mainframe processor performance

Get a better grasp on virtual storage with SMFLIMxx

Dig Deeper on IBM system z and mainframe systems