Efficiently managing z/OS WLM

IBM's z/OS Workload Manager (WML) is designed for the best response time, but it doesn't always result in balanced workloads. This tip delves into z/OS WLM efficiency.

IBM introduced Workload Manager (WLM) in the '90s as the next-generation workload and performance management tool. At the Sysplex level, WLM works in concert with workload portals, such as Sysplex Distributor (SD) and VTAM Generic Resources, to send incoming work to the LPAR with the most white space. While this is very good at the global level, when a unit of work lands on a subsystem, the pursuit of a different kind of efficiency can cause unintended results.

On the LPAR
When work arrives at an LPAR, it is at the mercy of the subsystem to which it's destined. For the various subsystems, the definition of efficiency seems to center around response time, instruction path length and I/O avoidance.

This, by the way, has been a recent subject of some backpedalling and redefinition. For instance, last year, a CICS representative prefaced her remarks with (in paraphrase), "CICSplex System Manager [CPSM] dynamic transaction routing was not meant to be a workload balancer. Its job is to get the work to the region best able to process the transaction in the least amount of time." In this case, "least amount of time" means CICS tries to keep the workload on the same LPAR to avoid CPU and I/O.

Similarly, MQ support recently backed off the previous idea that using shared queues naturally balanced workloads through trigger interface message pulls. This is explained very well in the Orwellian documentation APAR PK60692 that outlines the new process. This new strategy (with the fabulous name "fast put to waiting getters") feeds messages to triggers with outstanding waits before putting the work up for grabs in the coupling facility (CF). This avoids a lot of overhead but tends to push more work through the queue manager that receives the most messages.

IMS, for its part, continues to put less stress on routing and maintains its renowned efficiency by processing work in parallel on the local LPAR in common storage.

The consequences
The unintended consequences are pretty easy to understand. For instance, suppose you have a workload connected through SD to IMS. The system that WLM recommends at 3 a.m. may not be the best place to run later in the day. This means "sticky" applications that tend to maintain long sessions with the host may get crowded onto an LPAR unable to sustain the load. The parallel Sysplex answer to this problem is to configure each IMS subsystem to be able to process the entire workload. While this is efficient considering response time and network traffic, it comes at the cost of CF activity, virtual storage and systems management.

Another example would be a CICSPlex spread over several LPARs driven by shared MQ queues from a group of distributed servers. The local queue manager receiving an incoming message will attempt to give the work to a local CICS that has an outstanding get. This works out well as long as the queue managers on the different LPARs receive roughly the same number of messages. But if communication between one distributed server and a host queue manager goes down, the traffic will be asymmetric, as the LPAR with the disabled channel will get fewer messages. CICS' workload will reflect the imbalance as the LPARs with full complements of channels drive higher transaction rates. Again, one answer would be to create enough regions and configure each LPAR to handle the entire load. In this case, the shorter response time and instruction path will drive higher CPU rates and real storage usage.

Inefficiency is the answer
I don't think most customers can afford to configure each LPAR to handle entire workloads. Instead, most customers define efficiency in terms of hardware and software costs. They would also like to be able to manage workloads directly instead of hoping they'll reach the right balance by manipulating WLM goals or LPAR weights. To achieve these goals, they introduce some inefficiency of their own.

In the IMS case, an enterprise may use shared message queues where the IMS that receives a message puts it onto the CF, where any other control region can get it. Thus, the workload is distributed based on which IMS can pull a message first. Another answer may be to configure the clients to break and reconnect to IMS every now and then. Ideally, when a client reconnects to IMS, SD bases its decision on more current WLM information. Thus, the workload may migrate from LPAR to LPAR depending on current whitespace percentages.

The MQ decision to drive local CICS triggers could be short-circuited through the classic strategy of dynamic transaction routing (DTR). In this scheme, the CICS consuming the MQ triggers would start other transactions to be routed to application owning regions (AORs) on various LPARs.

Counter intuitively, IBM customers end up saving money even though they may experience elongated response time, network thrash, higher CPU, elevated CF activity and I/O. They get the savings from the ability to run asymmetric hardware configurations with more confidence the workloads will be balanced as necessary. The ultimate solution, of course, would be for IBM to provide options for customers so they can start working with the subsystems instead of against them.

ABOUT THE AUTHOR: For 24 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.

What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at [email protected].

Dig Deeper on IBM system z and mainframe systems