IBM has been talking a lot lately about a relatively new z10 feature called the CPU Measurement Facility (C-MF)....
C-MF captures some very detailed processor-related information that can be used for debugging and performance measurement. It consists of two pieces: sampling and counters.
Sampling with the CPU Measurement Facility
The sampling component in the CPU Measurement Facility seems to operate like a hardware version of software performance sampling tools like Strobe or TriTune. In either case, the operation is basically the same: Something wakes up every so often and records program status word (PSW) information. After collecting a couple thousand of these samples, a follow-on process compares the saved PSW instruction addresses against an address space map to figure out where a workload spends most of its time.
C-MF does this at the logical CP level. In every interval, defined as some number of clock cycles, C-MF stores the PSW instruction address along with the dynamic address translation, problem state, wait state and address space control bits. The information goes into a linked list of table entries with the help of two control registers, the table entry address register and data entry address register. When a block fills, the sampling facility raises an interrupt when, presumably, something saves the information somewhere else.
In addition to the abovementioned basic information, the sampling component will, if requested, save “diagnostic” information. IBM is somewhat vague on what constitutes diagnostic data -- we are only warned that it is model-dependent.
There are a couple of things that make this feature less interesting. First, the captured instruction address may sometimes be indeterminate given the z10’s pipeline processor. Second, the documentation plainly states that sampling is not meant for ordinary use and can only be authorized by IBM technical personnel. Then, as always, IBM warns us to carefully pick the sampling interval, as this facility may have a significant impact on performance.
Maybe In the future, when IBM puts suitable controls around sampling, it will be available to the rest of us.
CPU Measurement Counter Facility
As the name says, this facility maintains sets of counters that record processor events. There are sets of counters per processor and another for global or book-wide events. The number and types of counters varies with the model, but there is a special register that provides the counter version’s number.
Version level 1 (L1) comes with these counters:
- Basic counters. These counters include the number of clock cycles since the last sample as well as the total number of instructions executed. In addition, there are the number of writes to L1 instruction and data cache along with the number of cycles the CPU waited for the data to promote to L1 cache.
- Problem state counters. This set contains the same counters as above except the events are recorded when the processor is in problem state.
- Crypto activity counters. These counters record the different sort of cryptographic activities. Another helpful set of counters, one per function, shows how long a processor waited because the crypto engine was busy.
- Extended counter set. As of this writing, this set is model-dependent and “undefined in this architecture.” One assumes this area is set aside for future growth.
Doing some math on these counters yields some more interesting numbers. Dividing the cycles by the number of instruction yields the average number of cycles per instruction. While this number is not immediately useful, it may be a good benchmark for how a workload changes over time. It can also prove how the very same code runs differently depending on the time of day, what else is running or which machine it’s on.
The buckets recording L1 cache writes indicate a workload’s memory locality of reference, which is now an important performance metric. Knowing the number of cycles each cache write took, one can calculate the percentage of time the processor spent just waiting for data.
The crypto activity counters point out how much cryptography a workload is doing and the most popular features. The number of blocked cycles will present trends for crypto-engine use and tell you when they’re getting too busy.
Eventually all this information gets dumped into Systems Management Facility (SMF) type 113 records for post processing.
IBM spends some time explaining that these counters are for statistical estimation only. They may outline a workload but not accurately describe how it works. IBM also warns us to use the facility sparingly, as it may impact performance.
Another problem is the counter’s granularity. At the processor level there’s no indication of what specific job or task drove the activity. After all, it would be nice to know if a particular batch job runs slowly because it can’t keep working storage inside of L1 cache or a CICS transaction’s CPU went up because it’s using longer records.
Both CPU Measurement Facility components allow a tantalizing glimpse inside of the mainframe processor. However, its usefulness will be limited for at least a little while.
What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at firstname.lastname@example.org.