IBM's zEC12 includes plenty of goodies for the "big iron" faithful. One particularly interesting feature is a new...
mainframe monitoring tool called zAware.
According to IBM, zAware, which stands for z Advanced Workload Analysis Reporter, is an out of band, integrated self-learning analytic tool. In essence, zAware, analyzes system (OPERLOG) messages and learns what is normal for a given z/OS system. With this analysis, zAware monitors OPERLOG and recognizes when unusual things happen. Unusual things might consist of infrequent messages appearing in large volume, messages showing up when they're not supposed to or are absent when they should be expected.
According to a draft IBM Redbook, zAware assigns a z/OS logical partition (LPAR) a score through a moving 10-minute window based on the normality of the message traffic. The book notes that zAware is smart enough to know that a relatively low score on a stable system may represent a problem whereas a high score on a volatile system may be normal.
Mainframe messages usually follow this standard
PPP is a prefix unique to a product or subsystem
NNNN is a message number
S is a severity code (e.g., E for error, W for warning and I for informational.)
Individual systems may vary from this structure without significantly changing its nature. For instance, CICS message prefixes are five bytes long. Many subsystems issue messages without the severity code.
Of course, a lot of this intelligence depends on how well zAware can identify messages. By convention, mainframe console messages start with a structured 8- to 10-byte message ID. IBM and most independent software vendors do a good job following that standard. However, customers might not have applied the same discipline to their own systems, which may make it difficult for zAware to act on those messages.
ZAware also includes an easy-to-use Web user interface (WUI) with extensive "drill-down" abilities.
ZAware must run on a zEC12 machine, but it can monitor LPARs on z196 and z10 processors. Technically it is firmware that processes in its own logical partition similar to the coupling facility controller code. As an extra bonus, zAware can execute either on a general processor or an Integrated Facility for Linux (IFL) engine. Partitions running on the same machine can talk to zAware through HiperSockets while images on other boxes must communicate over network adapters.
As for the operating system, zAware can monitor only z/OS 1.13 systems with system logger APAR OA38747 applied. After applying the APAR, customers will have to make further changes to the OPERLOG log stream configuration.
The firmware requires 4 GB of memory with an additional 200 MB for each monitored LPAR. Of course, zAware can't learn if it can't remember, so it needs 300 MB of count-key-data (CKD) direct-access storage device (DASD) attached to it. The total amount of the DASD depends on the number of monitored LPARs.
ZAware cannot divine system weirdness without first knowing what is normal. To establish a baseline, IBM recommends a 90-day "training period" to allow zAware to become familiar with the systems. Customers can reduce the training period although that may affect the accuracy of zAware's analysis. In addition to the training period, systems programmers may "prime" zAware by feeding it archived OPERLOGS.
The WUI is supported from a Web server built into zAware, and zAware can export its knowledge and assertions to other components through extensible markup language (XML) documents. In addition, the Redbook contains a statement of direction whereby IBM plans to integrate zAware's analytics into the Tivoli Integrated Service Management family for alerting and events.
IBM touts zAware's usefulness for debugging complicated events. Take a situation where a company's favorite online system is running slowly due to DB2 database locks. The standard way to debug this type of problem is to paw through system logs for messages identifying the locked resources and the process that owns them. It's quite a tedious and lengthy procedure in Sysplexes with large data-sharing groups.
In these situations, zAware's WUI can display message ID frequency charts in which a user can drill down. Thus, not only are the deadlock messages in one place, but also the user can quickly click through the bars to find the resources and owners that may be causing the slowdown.
Of course, a lot of "big data" offerings could do the same thing, so what does IBM have to offer that other monitors or massive data search tools don't?
On an operational level, zAware is truly out of band because it doesn't run inside the z/OS image itself. Instead, there's the relatively tiny overhead of socket communication. Running on IFLs is even more attractive to customers that want the functionality but don't want to add to their software costs by upgrading their CPUs.
While a lot of big data tools show you frequency reports and let you draw your own conclusions, zAware has analytics built in. These analytics are based on the expertise of the very people who write, maintain and debug the OS and subsystems under analysis. This means zAware has very deep, specific knowledge of what the system is supposed to do and how it acts under stress. This unique sort of expertise is certainly added value.
ZAware also has enormous potential for growth. This first release sifts OPERLOG messages for trouble. Future versions might be able to understand resource management facility data and flag performance problems almost immediately or analyze system management facility resource usage information to predict how much CPU tonight's batch cycle will need. This becomes especially powerful when you think about marrying this sort of intelligence with other Tivoli automation tools.
I think zAware is at least worth a try.
ABOUT THE EXPERT: Robert Crawford has been a systems programmer for 29 years. While specializing in CICS technical support, he has also worked with VSAM, DB2, IMS and other mainframe products. He has programmed in Assembler, Rexx, C, C++, PL/1 and COBOL. The latest phase in his career finds him an operations architect responsible for establishing mainframe strategy and direction for a large insurance company. He works in south Texas where he lives with his family.