Regardless of the size or complexity of your organization, a capacity planning and performance monitoring program is essential for proactive planning and management of your data center. Monitoring technical aspects of your environment, such as CPU, disk, memory and network utilization are important, but the ideal capacity planning and performance monitoring program goes beyond these types of measurements and includes both technical operations and business concerns. This article lists some important items to include in your data center's capacity planning and performance monitoring program.
- Set up and follow a regular schedule for review, analysis and presentation of the information you gather. The most important element of any planning and monitoring program is making it a routine event within your organization. Quarterly review and analysis is a good interval to plan for, but it may make more sense for your organization to do this review more or less frequently. Present your findings to stakeholders, along with next steps for addressing problem areas.
- Success and failure rates of backups. Regularly pulling tapes from on- and off-site storage facilities, and restoring files to ensure data is being backed up successfully and is accessible within the timeframe set forth by your organization, is part of standard operating procedures for many data centers. Analyzing the timing and success/failure rates over time is a good thing to include in your stakeholder presentations. Many stakeholders aren't interested in the mechanics of "how" the process works, but they do want to know that it works and how long it takes to restore a file in the event of a catastrophe.
- Put together a lifecycle roadmap of standard hardware and software supported in your data center. Get in touch with the product marketing team for the hardware and software you have installed in your organization. Find out the three-to-five year plans for specific features and components of interest to your organization. Keeping up with this information on a regular basis will help you better match up corporate rollout and upgrade plans with the products that will have the longest support window.
- Determine which reports your organization really needs. A daily dashboard approach for outages and basic utilization levels is great, and creating trend reports based on the standard information that comes from your monitoring system is essential. However, your organization may really be most interested in reports that demonstrate cause and effect relationships between seemingly disparate activities, such as the recent implementation of a policy limiting the size of mailboxes and an increase in the amount of space utilized on file servers, as users begin to archive old mail elsewhere in the network.
- Hardware support and repair. Examine how your current repair process works over time, paying particular attention to speed of resolution and cost of repairs. Your organization may have an extensive hot or cold spare inventory, but buying and deploying redundant hardware that the staff isn't trained on can result in a longer downtime scenario because it takes the staff longer to diagnose and fix the problem. It may be more cost effective to purchase a premium four hour on-site support contract with a vendor to repair/replace/troubleshoot an issue.
- Determine which response, resolution or service restoration phase is where critical time is lost during an outage. Track how long each of these repair phases lasts. Use this information to determine where improvements can be made to shorten the duration of the outage. By looking at the tasks performed and how long they take during the initial response -- the troubleshooting and resolution phase, and finally service restoration phases -- an organization can determine if changes need to be made in notification and escalation processes, if additional staff training is required or a platform change for a specific application would be the most resource and cost-effective way to reduce the impact of outages on your organization.
- Utilization levels of staff. It's common to determine how many transactions per hour or minute a single server can perform, but typically staffing is reviewed in terms of how many servers can a single administrator support. Complex environments, such as multiple virtual machines on a single server, clusters or multiple servers in a load-balanced configuration to increase the availability of the environment, increase the workload on the administrator. Practical experience shows that organizations with lower administration and engineering skill sets should stick with running multiple servers at lower utilization rates. Lack of skilled staff, combined with running clustered servers in any OS, can actually reduce availability rather than increase it. It may be more cost-effective to run more hardware in a simple configuration than running less hardware in a more complex configuration with more skilled, and expensive, staff over time.
- Infrastructure. Don't forget to review the physical facilities in addition to the computational gear. Electrical and cooling needs should be matched up with the roadmap mentioned in item No. 3 above and planned accordingly. Conversely, if your data center has limited capacity for expansion of power or cooling, selecting server and network equipment with lesser power and cooling requirements is essential. Don't forget to evaluate infrastructure services like Netegrity, Active Directory, domain name server and Dynamic Host Configuration Protocol, as well as racking, cabling and the number of available ports on LAN/WAN/SAN gear with regards to planned upgrades, rollouts or migrations.
- Staging requirements. The amount of space and resources an organization devotes to staging equipment for preproduction application testing is an important consideration. Work with development teams to determine the needs for services, such as Active Directory authentication, database space, availability and the hours the environment must be available for use. Include these plans and findings in the final analysis of the data center as a whole. It may be possible to identify ways to repurpose staff or equipment from other uses for support of the staging environment on an ongoing basis, limiting the total capital outlay but increasing support services provided for development teams.