Application tuning is a frustrating business. Changes that perform well in test have no effect when moved to production....
A small modification to a program that isn't called very often suddenly inflates a batch job's elapsed time. Then there are the hours spent staring at CPU graphs whose values jump up and down for no discernable reason. All this brings up the question of how machines built to perform the same operations the same way in the same order can be so random?
Mainframe application performance can vary based on processor clock speed, the type of instruction being executed, and the potential for out of sequence instructions in the pipeline.
Processor clock speed
A principal -- and often misleading -- measure of processor speed is its clock. The assumption is that a faster processor's clock will have a higher frequency. This is true until you think about what the clock actually does.
The processor clock controls and synchronizes processor operations. It reminds me of when we ran wind sprints in P.E. We would all line up at one end of the field, and when the coach whistled we would run until he blew his whistle again. At that point we would freeze until the next signal. The operations inside the processor work on the same logic. At the start of a clock signal, data moves around, bits are tested or signals raised. At the end of the cycle, everything should end up where it it's going to be, awaiting the next interval.
Knowing this, it becomes clear that some of the randomness we see in performance measurement has to do with the fact that what we call CPU is not measured by the number of instructions executed, but rather the number of clock cycles taken to complete an operation -- and the number of cycles may vary for many reasons.
For instance, some instructions take more cycles than others. A load address (LA) instruction is fairly straightforward, as it just has to add two registers together with an offset and put the answer in another register. A move character long (MVCL) instruction, on the other hand, is a miniature program in its own right, and the following steps are paraphrased:
- Determine if the program has access to the source and destination addresses.
- If the source destination number is zero, move the pad character to the destination address. Otherwise move the next source byte to the destination area.
- Decrement the destination length register. If it is zero, the instruction is done.
- Decrement the source length register if it is non-zero.
- Increment the source and destination address registers.
- Go back to step one.
Not only is the instruction more complicated, but a MVCL's execution time is proportional to the amount of data it moves. By extension, the CPU time reported in System Management Facility (SMF) for a program also depends on the amount of data it slings around.
Z architecture machines have an additional twist, as the processors use pipelines in an attempt to parallelize as much work as possible. This means an instruction or pieces of multiple instructions may be processing at the same time and out of order. This works very well until the processor reaches the point where one instruction depends on the results of another.
Say, for instance, a program has an LA instruction followed by a move character (MVC) that uses the value computed in the first instruction as one of its bases. Obviously the processor cannot execute the MVC until the LA completes, which brings things to a temporary stop. But even though there's a small pause in the processor, the clock still ticks, so the cycles spent waiting are counted as another quantum of CPU.
Thus, we understand instruction order impacts CPU time. Some of the smarter compilers and Assembler programmers know this. In addition to the instruction order, there is also processor cache. Z10 machines have levels 1, 1.5 and 2 cache, which take progressively longer to access from the processor's point of view. Then there's main memory, which, at CPU speeds, is nearly the equivalent of an I/O. Obviously it's nice if a program's favorite bytes are in level 1 cache. But, given limited space on a busy machine that spends a lot of time switching context (as mainframes are wont to do), this is not always possible. There will be times when the CPU pipeline must wait to fetch data from cache or memory, and each pause will be accounted for in clock cycles. Therefore, locality of reference also impacts CPU time because the same instruction may use varying amounts of processor, depending on where its data is.
Finally, the dispatcher can interrupt a program at any time. When the interruption occurs, the system spends lots of cycles handling the interrupt and redirecting the instruction flow to the proper handlers. Some of this may get charged to the victim depending on the type of interrupt and what needs to be done.
There are many reasons why the deterministic operation of a computer can look awfully random, and I can tell you from personal experience it will drive you nuts. However, tuning is not impossible, it's just a matter of eliminating the noise.
For anyone interested in knowing more about Z processor architecture, I highly recommend Robert Rogers' Share presentation, "How Do You Do What You Do When You're a CPU." Robert manages to clearly explain Z's pipeline architecture and its impact on performance. Anyone with access to www.share.org can find the presentation in the proceedings section. I have not been able to find it on the open Web.
ABOUT THE AUTHOR: For 24 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.
What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at email@example.com.