Data center operations and maintenance best practices for critical facilities

This tip discusses best practices associated with the operations and maintenance (O&M) processes for data center facilities, including the physical aspects of critical infrastructures, staffing considerations, and appropriate O&M processes, tools, and procedures necessary to support the demands of 7/24/forever expectations.

Operations and maintenance (O&M) of today's critical facilities is being recognized as equally important as the...

engineering and design phases of these complex sites.

As the robustness and associated complexity of critical infrastructures has increased to allow for improved fault tolerance and concurrent maintenance capabilities, the importance of establishing equally robust O&M practices to manage these facilities has become apparent. Studies show that 60% or more of "impact events" where critical missions have been compromised are associated with human activity. This activity includes routine switching and reconfiguration of critical systems, maintenance tasks and, of course, human error.

The requisite staff and processes necessary to support continuous operations must be in place on the first day that the site goes live and must continue through to the final day that critical operations occur. This requires that efforts toward establishing these processes begin well before the facility begins operations, and ideally will begin during the site programming and requirements definition stage.

Data center design considerations

High availability for critical facilities typically necessitates complex redundancy schemes such as 2N, 2(N+1) or even 2(N+1)/3 configurations. The expectation is that even if critical equipment or systems fail, there is sufficient redundancy available to support uninterrupted operations.

But if the affected infrastructure doesn't include adequate means to isolate the failed equipment, and subsequently cannot access, repair or replace the equipment during sustained operations, outages are still incurred. This means that the requirements for sustaining critical operations over the life of a facility must be included in the design and construction before operations begin. This is called designing for maintainability.

Construction, startup and commissioning

The best engineered and designed facility is no better than the implementation of the design during construction. The need to provide strict oversight and quality control of the construction process requires frequent field progress inspections during construction. In addition, comprehensive startup and testing by qualified technicians is necessary to prepare for formal acceptance testing before the facility can be certified as ready to commence critical operations. This process is called commissioning, and it also includes ensuring that the project is appropriately staffed and that workers are provided site-specific training and have accurate as-built documentation on hand on Day One.

Formal commissioning starts during the design phase (if not earlier) to provide reviews for constructability, maintainability and to ensure the design intent (captured in the basis-of-design document) is compliant with the owner's requirements and expectations for the facility's performance. Commissioning also includes various levels of testing and verification, including factory acceptance tests, shipping and receiving requirements, field progress inspections, pre-functional and functional performance tests, and, finally, integrated systems tests.

The site O&M staff should participate in the commissioning process throughout the construction, startup and acceptance testing of the site. This provides valuable and sometimes unique opportunities for the O&M staff to participate in activities where they can learn and prepare for future tasks they will be responsible for during critical operations. There is no better opportunity for hands-on training and developing a deep understanding of site-specific nuances than at this time.

Operations and maintenance staff and organization

The staff assigned to operate and maintain a critical facility deserves as much foresight, consideration and attention as any other aspect of the process. The O&M staff should be identified, organized and trained before the site goes live. Some important considerations are which skills are required to operate and maintain the site. Whom should this department report to? What will the permanent staff be responsible for and what will be outsourced, including service-level agreements?

One of the first questions should be, "Will the O&M organization differentiate between staff assigned to provide O&M services to the critical infrastructure or will one organization cover all critical and noncritical O&M activities?" Ideally, a dedicated staff is assigned responsibility for the critical infrastructure and a separate staff for noncritical infrastructure. Continuous operations require constant vigilance and focus on the critical, 7/24/forever systems. As urgent as a leaking window may be, especially when it is in a highly visible location, it can be a distraction for staff that should be totally focused on critical operations. Likewise, critical O&M budgets should not compete for scarce resources that may include furniture, landscaping and other necessary expenditures.

Operations & Maintenance Processes

Operations and maintenance of critical facilities is not just a set of procedures. It is a strategy that should include clear goals and objectives, well-defined roles and responsibilities, an organization that focuses on continuous operations, and sufficient resources to accomplish the goals.

When is the site most vulnerable and deserving of the best staffing? Nights and weekends, when contractors, vendors and parts are hardest to come by? Or during the ]business day, when outages can have the most impact? Obviously, the answer is connected to the mission of the site. If the site does support business activities that are more valuable during normal business hours, you may get one answer. If, on the other hand, the site has a true 24/7/forever mission in which 9 a.m. Monday is no more important than 9 p.m. Saturday, you may get another answer.

The answer to these questions can generate even more questions. For instance, where will you store critical spare parts? Will they require environmental conditioning or routine maintenance (like rotating equipment to preserve lubrication and preclude "bowing" of shafts)? Will the site require in-house expertise for administration of complex monitoring and control systems or just what's necessary to operate the systems?

Which spare parts will be considered critical and maintained on-site? What tools, equipment and inventory will be necessary? Will a Computerized Maintenance Management System be employed, and if so, who will build and configure it?

There are also significant variations in maintenance programs for facilities in general, with critical facilities tending more toward the high-end. Most facilities have some level of planned maintenance. Routine tasks based on time intervals, or frequency, are referred to as preventive maintenance. For instance, on a particular piece of equipment, inspections may occur monthly, belts checked and adjusted quarterly, filters replaced every six months, and internal cleaning, alignment checks, and sensors calibrated yearly. The shortcoming here is that the tasks occur regardless of actual operating condition. These programs can be improved when based on actual equipment runtime, but still do not take actual operating condition into consideration.

An improvement is to implement condition-based monitoring technologies that allow maintenance to occur based on actual operating conditions. A simple example is using a differential pressure sensor to monitor filter condition. When the filter loads up, the delta-P increases and the filter is replaced when appropriate.

When these condition-monitoring technologies are used and the data trended, you can predict in advance when maintenance will be required. This is called predictive maintenance. Thresholds can be assigned for alerts and alarm conditions, and by analyzing the trends, you can predict when the thresholds will be exceeded and even predict failures.

Some examples of operating condition monitoring technologies include vibration analysis, tribology (lubrication analysis) and infrared thermal scans. These technologies can reveal incredible insights into the operational condition of equipment while the equipment is online, without requiring shutdowns or maintenance outages.


It is necessary that all aspects of the facility's operations and maintenance are considered early in the development of the site requirements. Otherwise, opportunities may be lost to embed the requisite O&M requirements into the facility's design and construction. It should be intuitively obvious that with the extremely large capital investment required to design, construct and bring online critical facilities today, and considering the importance of the missions associated with these facilities, that equal consideration is given to the staff, programs and resources that will be entrusted to operate and maintain the site over its intended lifespan.

About the authors:
John B. Collins, Associate Partner at Syska Hennessy Group, has over 30 years of experience in critical facilities operations and management. John has provided operating and maintenance consultation in the field of mechanical and electrical systems. He is a technical specialist in the field of development of operation and maintenance documentation through the use of field surveys, CAD schematics and computerized database configurations. John also has worked as a project manager in many data center power downs and has extensive experience with operations and maintenance of critical facilities. John was employed in the financial sector in Data Center operations before joining Syska. Currently at Syska Hennessy, John works in the Critical Facilities and Facility Management groups.

Terry L. Rodgers, CPE and Associate Partner at Syska Hennessy Group, has over 25 years of experience in critical facilities operations and management. Terry earned a BSME from Virginia Tech in 1981. Currently at Syska Hennessy, Terry works in the Critical Facilities and National Commissioning groups and has been an active member of ASHRAE Technical Committee 9.9 since 2004. He is a member of Syska's Critical Facilities Technical Leadership Committee and chair of the Syska Green Critical Facilities Committee. Terry has co-authored various books, whitepapers and presentations on critical facilities.

What did you think of this feature? Write to's Matt Stansberry about your data center concerns at [email protected].

Dig Deeper on Best practices for data center operations