Server uptime and hardware failure guide
A comprehensive collection of articles, videos and more, hand-picked by our editors
When everything is working right, people aren't learning anything. When things go wrong, the real learning starts.
This adage holds true for the military, athletic organizations and certainly IT. Take VMware vSphere, for instance. When your organization installed it, did the team really learn anything? Sure, you probably had some training, and you learned where things like network configurations and the virtual machine (VM) settings are located. But when did you really learn something? For me, the real learning started the day that one of the storage arrays didn't get along with the vSphere environment. Or perhaps it was the day we needed to clone a VM's snapshot, which isn't an option in vSphere's graphical user interface. On that day, the data center had its real training on the infinitely powerful PowerCLI -- lessons that opened doors to many good things.
Ask yourself what happens when something is wrong. Let's say that an important business process fails to run overnight. Can your staff fix it, or do you need a consultant or a support call? What happens after that call? Does your organization record the fix and related steps in a knowledgebase or wiki so the information is available in the future, or are you just glad that it got fixed and move on? Was there a lot of yelling and finger pointing during the problem, or was it dealt with calmly and professionally?
The way an organization deals with failure tells you a lot about its culture. Some organizations handle failure extremely poorly, with managers roaming the halls screaming at people and firing them in the middle of the outage. Does behavior like that lengthen or shorten an outage? What does that tell employees about taking risks, even calculated ones? What does that tell you about the employees that still work there? I know a number of organizations whose employees are so scared of being blamed for anything that they won't even apply desperately needed security patches to their systems. It's easier to count the systems that don't have massive security problems than the opposite because of this pervasive fear-driven culture.
Some IT teams just freeze up when they encounter failure. They don't know what to do or where to begin, so they just don't do anything. Maybe the problem will fix itself, or maybe someone else will step up and fix it. People work around problems, sometimes going outside of IT for solutions in the cloud. That isn't good, either, because shadow IT expenditures should be avoided at all costs. One organization, no longer in existence, ended up switching to pen and paper because its inventory systems became so broken over time. Yes, really -- pen and paper. No shadow IT, but also no company now, either.
My favorite kind of IT organization is the type that treats failure as an opportunity for data center training. They keep blame to a minimum, even during post-mortem analysis, because defensive people aren't open to learning. These organizations stay focused on the problems at hand, and work as a team to get things done. This lends itself to both professionalism and honesty, which generates frank discussions of problems and solutions.
Experimentation and failure are also encouraged as part of new implementations and upgrades in the data center. Failure of this sort isn't seen as a risk or as a detour, but as a way to find the best path forward. This ideal organization will also embrace the DevOps and lean software development ideas of fail fast and learn rapidly. Employees learn how to make good decisions, take good calculated risks, and they succeed.
Consider your organization -- what will it take to start encouraging better risk-taking and less blame? Or, if it is hopeless, why are you still there?