Data center outages at Delta Airlines and Amazon Web Services stole the headlines in recent months, but there’s plenty of other outages at everyday enterprises that fly under the radar.
IT pros dished the dirt last week on the show floor at IBM Interconnect, anonymously sharing tales about their data center outages at the hybrid cloud booth. They illustrated the various problems behind data center downtime, and a reality check about how that next outage could be caused by just about anything.
A CIO, two weeks into the new position, claimed he was hired to implement a “transformational agenda” – but first he endured a one week outage of a core, externally facing customer system. “I spent months delaying my agenda to focus on sustainability” wrote the unnamed CIO.
An insurance company in Connecticut performed a data migration from its original system to a new platform, then shut down the old system, claimed another contributor. But when they attempted to bring up the new system, the data was corrupt.
In a networking tale of woe a F5 refresh took out an entire website when a parameter set to direct traffic to the least loaded server instead sent the traffic to a test server. You can probably guess what happened next.
Another debacle cited failure of an unspecified storage component which degraded performanceand ultimately triggered the disaster recovery plan. But there was one problem: — “We had no way to failback – not good,” wrote the IT pro.
Nature was blamed for one data center takedown — a squirrel chewed into a main power feed during maintenance to the data center’s battery backup. That caused a blackout – albeit short – with the data center going down for about five seconds until the generators kicked in. No word on whether any data –or the squirrel — was lost.
One IT pro lamented how a load test was conducted on productive storage during working hours. It was a virtualized environment and nothing should have happened, but the ports became saturated and the network couldn’t handle the load so there was downtime.
Timing can be everything, and that was certainly the case when the hard drive died in a network staging server at one company — just before a new product was to be launched, according to the anonymous writer.
Backup for data center cooling and power systems are especially important, as shown by one story where an IT pro claimed that there was no UPS or generator backup for cooling towers on the roof of the data center. When power went out, the CPU overheated with no working cooling system.
Don’t blame me
Notice a common theme? None of the authors accept guilt in their stories of data center downtime. In fact, nobody is blamed in most cases. So much for the blameless post mortem, even when it is anonymous. A majority of data center outages are caused by human error, which leaves us wondering exactly what was the painful truth behind these outages.
Now that you’ve read some tales from the data center trenches, what’s your best story about an outage and downtime?