Mainframe problem analysis: What's the worst that could happen?

In this mainframe humor observation, an overconfident programmer gets more than he bargained for when he takes his shot at analyzing a TSO problem.

Bob slurped some coffee and drummed his fingers on the desk. Logging onto TSO was particularly slow this morning. A few seconds later, he took another swig and hit the Enter key sharply eight or nine times. TSO rewarded his patience with a third “logon proceeding” message.

He rolled his eyes and stood up. “Anyone else having problems?” he started, when he noticed several people huddled inside of Hugh Britz’s cubicle. Bob’s inerrant instinct told him that’s where the action was. He ambled down to Hugh’s cube and asked what was going on.

“Got a looping task on system five. Unfortunately, it’s glommed onto some locks that are holding up everyone else,” Hugh said cheerily to Bob. He then turned around and pointed to the monitor. “We were trying to decide if it’s okay to cancel the task when, for grits and shins, we used the monitor to figure out where the loop is. We tracked it down to these instructions.”

Hugh double-clicked on the emulator window to make it full-screen. Bob pushed his glasses over the bridge of his nose and focused on the dump display. He did a quick disassembly in his head. The offending code was five instructions long.

“Whew! That’s a pretty tight loop,” said Bob.

Hugh nodded, a grin plastered on his face. “Classic control block walking algorithm. Somewhere, there must be a bogus address that turned this into a bad day. We (by which Hugh usually meant I) thought about canceling the task, but we might be able to get out of this relatively painlessly by zapping that branch instruction --" Hugh tapped his monitor -- “into a no-op.”

Bob cleared his throat. In the back of his mind he heard his boss, Stan, telling him to be more of a technical leader. Bob preferred not to lead, but he did feel compelled to offer advice in this situation.

“Well, sure Hugh, that’ll get you out of the loop, but is it really such a good idea?” Bob felt the alpha-technician rising inside of him as he warmed to the subject. “We don’t know what happens when the code falls through the loop. It won’t fix the broken control block chain, and we would have to zap the instruction back to its original state before the next task comes through that path. I don’t think that can be done fast enough."

Hugh’s grin got wider, as if that were possible, meaning he had anticipated Bob’s objections.

“Most likely this thing will fall through, run into some sort of error condition and ABEND," said Hugh. "If the problem recurs, well, then at least we tried and we can go with plan B. Besides --” Hugh raised a hand to point out the piece de resistance -- “I stacked the zap and unzap on the command line. As soon as I hit Enter, they will be applied one right after the other.”

Bob had no reply to Hugh’s careless confidence. The others in the cube kept their faces carefully neutral. Hugh looked around cheerfully and said, “Anybody for a peer review?”

Everyone found something interesting to look at; a couple were even whistling.

Hugh emphatically pressed the Enter key. “What’s the worst thing that could happen?”

The following week…
The problem manager flipped open her notebook and called the meeting to order. 

“We’re here to talk about last Tuesday's outage. Who has the timeline?” she asked.

A different problem analyst referenced his spiral notebook. 

“Around 8:30, TSO users began to experience slow response and batch stalled," said the other analyst. "The online systems seemed to be okay. About 15 minutes later, system five went down and the other three production systems locked up.”

Hugh, sitting several people to the left of the speaker, appeared to be thoughtfully rubbing his forehead.

The problem manager continued, “This was followed by an initial program load of the production Sysplex and recovery of the online systems thirty minutes later.”

Bob, sitting next to Hugh, thought he heard a small whimper.

“A few of the online systems did not successfully emergency restart and had to be cold started. The cold starts required several databases to be stopped and recovered. Two hours later, the databases were successfully recovered and restarted. That’s when we had full return to service,” noted the manager.

A soft sob echoed lightly off of the white board in the front of the room.

The meeting leader finished copying the information down in her notes. “Hugh, I think you have information for root cause?”

Hugh straightened up in his seat as the blood drained from his face. His mouth opened and closed several times before Bob interjected, “Uh, if I may? Hugh did the original problem analysis on the loop, but the rest of us collected the diagnostic information that was sent to the vendor. Fortunately, this was a known problem fixed by patch --” Bob checked his own notes -- “AA44244. We applied the fix to test Tuesday night and hit production yesterday. We’re in good shape.”

Hugh, feeling a little calmer, started to take a sip of coffee.

Unfortunately, the meeting chair was one of the smart problem managers. 

“That was the cause of the original loop, correct? But why would one looping task bring down production?” she asked.

“ turns out the looping task was holding onto some important Sysplex-wide locks the other machines needed," said Bob, ignoring the choking sounds coming from Hugh.

The meeting chair looked puzzled. “Well, once the task went away, shouldn’t it have freed the resource and let the other systems continue?”

Bob deadpanned, “You would think so.”

The problem manager paused and said, “I would?”

“Yes ma’am, that’s the way it’s supposed to work.”

After an uncomfortable silence, the chairwoman turned to the other problem analyst. 

“Do we have the number of impacted customers?” she asked.

“Yes. Employees couldn’t use the system for two hours. The website was down for 45 minutes.”

Hugh was staring intently at the third button of his shirt.

“Okay, then, if there aren’t any other action items, I guess we can call it a day,” said the manager.

That afternoon, Hugh gently tapped on Bob’s cubicle frame. His hair was somewhat disheveled and his tie was half unknotted. His confident grin was gone and he was rather twitchy.

“Bob, I wanted to let you know I’m…I’m going to a new position," said Hugh. "I’ll be working in Production Control. I’ll have more time to, er, think over there. It’ll be a safer, um, better place for me.”

Bob shook his hand and wished him good luck. Hugh picked up his briefcase and headed for the exit.

Bob watched the stairway door close behind Hugh when he heard Stan’s gruff voice behind him: “What’s the matter with him?”

“Work-related stress.”

“Stress, eh?” Stan’s face screwed up into a complicated image of thought. “Maybe that’s why he asked for the transfer to Production Control.”

“Don’t understand it myself," Bob said, shaking his head. "This is where the action is -- the real technical stuff.”

Stan growled up at Bob, “What could be more fun than playing around with the bits, zappin’ ‘em off and on like you know what you’re doing?”

Stan trudged back to his desk shaking his head.

ABOUT THE AUTHOR: For over 25 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.

Dig Deeper on IBM system z and mainframe systems