Capacity planning strained by diverse workloads
Capacity planning is a well understood, strategic exercise. Planners look at Resource Measurement Facility (RMF) reports where the workload is divided up in ways that makes sense for their enterprise. It is then relatively easy to use IBM's MIPS charts and stack the workloads in processors of appropriate size with a little left over for paydays.
That's not to say it's trivial; there are thousands of dollars on the line and each new application implementation or system software configuration change brings the possibility of emergency hardware upgrades. Luckily for us most of the time things ramp up smoothly and enterprises are able to buy the CPU when they need it.
But what happens during that odd, ten second interval, when sensitive workloads suddenly choke and set off alarms?
Most of today's mainframe shops run diverse workloads. Besides the individual quirks and requirements of individual applications, some run more than one DBMS or transaction processor. Performance analysts who deal with tactical performance issues must pluck "loved ones" out of the mess and ensure the darlings get the juice they need.
A performance capacity planning scenario
The problem can be further exacerbated by having too many top priority workloads. No matter how high the priority, if you have more than one top dog someone will have to wait when all the processors are busy.
For instance, picture an LPAR running two high priority workloads, A and B. A is a heavy duty transaction manager that steadily grinds away, doing most of the heavy lifting of the enterprise. B does much less work but back-ends the web and runs at a slightly lower priority than A. Together they are the highest priority work in the LPAR. Things run pretty smoothly with overall processor utilization running in the 80's. Then suddenly, B experiences five to ten seconds of slow response. Since B is the backend to your company's web site, management wants answers and wants them now. Don't laugh, it's happened.
The monitors for B show an elongated dispatch time. Here, dispatch time is defined as when elements of the workload are eligible to run and on a dispatch chain but have to wait for the CPU. Therefore the performance analyst for B concludes that the workload was temporarily processor starved.
On the other hand, RMF shows the CPU's at 90% busy over the minute interval and the job delay panel (DELAYJ) shows B had a healthy workflow rate. Besides, B is defined to workload manager (WLM) such that B should be able to get the processor it needs and RMF says it was available. Even more puzzling is that the slowdowns are periodic and appear at nearly the same time every day.
After a few iterations of the problem the analysts decide to set a trace for workload B. They also begin looking into processor utilization at one second intervals with RMF monitor II. After another incident B's trace shows several gaps of 100 milliseconds or more when B didn't execute. Likewise, during that interval the RMF monitor shows CPU utilization jumping up into the high 90's.
The root cause turns out to be periodic automated processes in workload A that are crowding out B. The spikes in processor usage didn't show in RMF III because they were too short for one minute intervals. In addition, WLM couldn't make adjustments because the spikes were too quick for its 10 second monitor and update window.
Strategic versus tactical performance planning
There are several solutions to this problem. The first may be to adjust B's priority so it runs even with A. However, if they all run at the same level whichever workload gets there second will have to wait. Adding another processor will help in giving B another place to run if there's if all the other CP's are busy. Perhaps the best solution is to find another home for B where it can be the solitary king of the hill. This may not be as hard as it sounds. If management is worried enough about 5 to 10 second problems they should also be motivated to pony up to protect the workload.
The lesson here is to differentiate between strategic and tactical performance planning. It also means we should pay attention to our systems even when it looks like it they're not running at full capacity. The good news is the mainframe has the tools to capture data at a detail level at millisecond intervals and we still have MIPS to do long range planning. The bad news is, given today's business environment, we're going to have to spend more hours searching for the elusive root cause to five second "problems."
ABOUT THE AUTHOR: Robert Crawford has been a CICS systems programmer off and on for 24 years. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.
This was first published in April 2007