The famous Pogo line applies: "We have seen the enemy, and it is us!"
Sad, but true, the vast majority of failures in data centers are caused, triggered or exacerbated by human error. Like it or not, the hardware and software are a lot more reliable than we are.
We'll never completely eliminate our human failings, but we can sure do some things to make ourselves less vulnerable. Here are a few. You can undoubtedly add to the list yourself, especially after giving some deep thought to the next instance or past occurrence or even to those conditions you've noticed, but haven't acted on yet.
Logic! Nothing traps us into inadvertent mistakes like things that aren't what they seem. Here are some cases of what should be, but too often isn't:
- Circuit breaker order in the panels clearly related to cabinet rows.
- A and B circuits in each cabinet on the same breaker number in each pair of PDUs (assuming proper design for dual-corded hardware).
- Receptacles exactly "mirrored" inside cabinets.
- Dual-corded circuits run from "paired" PDUs. (All devices on PDU A are also on PDU B. NOT Device 1 from PDUs A and B, and Device 2 from PDUs B and C, etc. This is nearly impossible to keep track of and makes power balancing a nightmare with a built-in overload trap if a PDU fails.)
Labels! Even if you can't get everything physically organized, you can still label it clearly. The best job I've ever seen used color-coded labels for every PDU, with character sizes you could read across the room and corresponding color labels on every circuit. Likewise, clear, meaningful and organized cabinet and patch panel labels make everyone's life easier. Avoid relying on the "one guy" who knows where everything is.
Another great approach is labeling each tile row at the upper wall or ceiling so any location can be identified by an alpha-numeric grid identification that can be seen from anywhere in the room. It makes it faster to find things, above or below the floor, even for the inexperienced.
And don't forget about machine labels! "Cutsie" idents can be fun for the programmers, but they don't tell people much about what each machine does, what goes with what or who's responsible for it. If the apps guys insist on clever server names, add another to the tag that's more descriptive. And absolutely include the name and number/e-mail of whomever is knowledgeable about and responsible for each device.
Visibility! Something else to consider is clear plexiglas panels at key locations in your raised floor. Put them over CRAC unit thermometers and valves, ground bars and anyplace else something could go wrong, but go unnoticed. Just don't put excessive weight on them. They're not as strong as regular panels.
In short, make your data center an easy place for everyone to work in. If you're doing your job well, there's no need to withhold key information as a form of job protection, and you certainly shouldn't let anyone else do it either. If you're not there and a problem occurs, you'll be found out real fast, with a different end result than the one you were planning on.