With fewer, more expensive components, wringing every last drop of performance out of a mainframe has always been important. Traditionally, mainframe performance and tuning (P&T) centered on things like instruction paths, I/O avoidance and software configuration. While these are still important, recent generations in mainframe processors make relatively unpredictable memory locality of reference the supreme factor affecting performance.
Clock speed and memory
Processor clock speed increases with each new hardware generation. IBM's new z196 mainframe boasts a 5.6 GHZ processor. While this certainly quite an improvement on the z10's relatively anemic 4.4 GHZ, faster clocks have some consequences.
First, the processor has less time to do things within a single clock cycle. IBM mitigated this in recent models by lengthening the instruction pipeline and adding more stages. This isn't all bad -- more stages mean more things can happen in parallel even if they have to be done in smaller bites.
The second outcome involves accessing data. As clock speed accelerates, the number of cycles the processor must wait to get information out of memory or non-local cache increases. Externally, the wait looks like CPU busy, which means a well-running workload may experience CPU time increases for no other reason than a change processor model. The catch is figuring out how to measure memory locality of reference.
Measuring mainframe performance using relative nest intensity
To help measure mainframe performance by way of memory locality of reference, IBM came up with the relative nest intensity (RNI) metric. Note that, although the z196 has a separate instruction and data level 1 (L1) caches, the discussion below applies the term "data" generically to both.
The first step in understanding RNI is the concept of the nest.
The z196 has four cache levels. Level 1 and 2 belong to a core. All the cores on a chip share the L3 cache. L4 is local to the book, although a processor can, if necessary, get data from remote L4 cache in another book. Last is main memory shared between all the books in the CEC.
By definition, the nest consists of L3 and L4 cache along with memory, which are levels for which the processor encounters the greatest access penalty. For the z196, the calculation is:
RNI = 1.6*(.4*L3LP+1*L4LP+2.4L4RP+7.5*MEMP)/100
L3P - Percentage of L1 misses sourced from chip level L3 cache
L4LP – Percentage of L1 misses pulled from local L4 cache
L4RP – Percentage of L1 misses retrieved from L4 cache in another book
MEMP – Percentage of L1 misses gotten from memory
The numbers in front of each percentage represent the relative penalty for retrieving data from the level. IBM says these numbers are subject to change.
From the formula, you've probably concluded that a higher index denotes a workload spending a lot of time waiting for data. A lower number indicates a tight locality of reference and a workload making effective use of the processor. A high RNI may also indicate denote a system experiencing a lot of competition for memory at the cache level, as opposed to real storage. Again, the time an instruction waits for data will be measured as CPU busy. Thus, the very same workload may show a different CPU profile based solely on busyness of the system and competition for cache.
Most tuners are comfortable with mainframe performance factors, such as I/O and instruction path. RNI, however, is something relatively new and difficult to change. IBM has a few tips:
- IBM created the L1 instruction cache under the assumption it would rarely need updates. Therefore, there is a substantial penalty for any program that modifies itself or any storage that might get loaded into instruction cache. This is old advice from when IBM introduced Z architecture.
- Customers should turn on Hiperdispatch. With Hiperdispatch, the mainframe hypervisor PR/SM collaborates with the z/OS dispatcher to keep workloads on the same physical processor. The hope is that matching workloads with processors will result in better cache use, especially on the local chip.
- Use link pack area (LPA) modules where possible. LPA modules are shared across all the address spaces in the LPAR. Therefore, once LPA code makes it into the L1 cache, many different address spaces should be able to use it.
- Rewrite programs for better locality of reference. This is probably the most difficult option. It may also be impossible in systems using dynamic storage structures or address data outside of their program.
- Reduce the number of address spaces in an LPAR. Fewer address spaces mean less competition for memory, especially at the cache level.
This last recommendation is perhaps the hardest to swallow. Online subsystems, such as CICS and IMS, tend to spread horizontally for mainframe performance and availability. Pressure on shortening batch windows also encourages customers to run as many jobs in parallel as possible. As always, the answer seems to be "your mileage may vary," and each installation will have to find its own sweet spot between memory reference and parallelism.
ABOUT THE AUTHOR: For 24 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.
What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at email@example.com.