IT service outages are a data center nightmare. But Malvern, Pa.-based Siemens Medical Solutions Inc.'s Enterprise Hosting Solutions developed a system for reducing outages by identifying and classifying root causes in a standardized way. In this Q&A, Siemens service-level manager Charles T. Foy, whose research paper on Siemens' problem management system won an award from the Turnersville, N.J.-based Computer Measurement Group Inc., shares how his team developed better systems management processes.
What sparked your interest in creating your problem management system (PMS) for gathering
statistics about past IT service outages and their root causes?
Charles Foy: We had several problem management processes in place, and we needed to standardize on one and make it even better to meet the needs of our customers.
What has the system achieved?
Foy: Our standardized problem management system has resulted in reduced downtime because of the following factors:
- The standardized PMS is a repeatable method to define the issue root causes, address them and implement corrective action;
- The database yields actionable intelligence by identifying trends that might have been otherwise overlooked, and, finally;
The PMS has enabled us to be even more proactive in our operations. For example, PMS reinforces the need to complete scheduled maintenance within the designated timeframe and avoid an extended outage, which would be classified as unscheduled downtime.
You moved beyond the idea of individual reports to a central database of outages, causes and resolutions. What did you hope to achieve by classifying problems this way?
Foy: We realized that a system that classified outage root causes in a consistent manner could be searched for trends that might not otherwise be obvious. Once you have a series of similar root causes and their outage time impacts, you have a tremendous amount of business intelligence. Remediation of issues can be quantified and costs of implementation defined. These can then be compared with the impacts of the outages over time, and informed business decisions made to address the issues.
What benefits did you gain by brainstorming with and involving others in your efforts?
Foy: We knew that other managers and staff would have valuable insight into the future process. In addition, we knew their participation would add to their sense of ownership. In fact, we did get a lot of great ideas that made the end product much better, like the addition of an external communication template. Similar sections in the new external template and the existing internal template are written simultaneously, saving time and ensuring consistency of internal and external communications.
You quickly learned that outages often have multiple root causes and that multiple corrective actions might be needed. How did you design your database to address those complex issues?
Foy: We had a lot of discussion on how to categorize root causes, both primary and contributing root causes. The challenge was to classify them at a high enough level to reveal trends but at a low enough level to provide actionable intelligence. Our defect-tracking database allowed for multiple levels of identifiers, called keywords, which aided categorization. In addition, this database allowed multiple defects to be associated with one another in a relational construct, yet each defect record had its own corrective action that could be tracked independently, from design through implementation.
How did you create root cause classifications that would help get to the source of the problem? What was the key?
Foy: There are two keys in getting to the root cause. The first is understanding that there is seldom a single root cause, especially in major incidents. The second is having a process that supports exploration of the depth of root causes, using a method such as the "five whys."
After creating the database and defining the key terms to be tracked, how did you create a plan and team to resolve future outages?
Foy: Most companies that adopt ITIL [IT Infrastructure Library] best practices realize that most of the ITIL disciplines are already in place in some form or another, and it is just a matter of identifying the existing processes and matching them to ITIL. Siemens was no different; we merely identified and coalesced our existing best practices into one process and then assisted our co-workers with migration.
What implementation issues did you face?
Foy: Our initial process had too many steps, making it rather bulky to use. After gathering feedback, we re-designed the process to reduce handoffs. In addition, we now issue daily reports to eliminate initial confusion about which problems are resolved and which still have outstanding requirements.
How long did it take you to develop this system from start to finish?
Foy: The project took six months to define and was implemented over the next year. Since then, we have periodically reviewed the process and updated the standardized templates for communicating the issue root cause, the permanent resolution and the preventive measures required to prevent recurrence.
What is the one take-away message you would leave for our data center audience?
Foy: You cannot address that which you cannot measure. Service outages are defects, whether in process, software or hardware. Once you identify and measure them, you can improve.
Let us know what you think about the story; email Matt Stansberry, Executive Editor.
This was first published in March 2009