Home > Data Center Tips > Systems Management Tips > Use Nagios to trend and troubleshoot performance issues
Data Center Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

SYSTEMS MANAGEMENT TIPS

Use Nagios to trend and troubleshoot performance issues


Kyle Rankin, Contributor
07.30.2008
Rating: -4.33- (out of 5)


IT infrastructure news
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


The higher the number of servers you manage, the less likely you will be able to continually monitor their health all on your own. Systems administrators do actually sleep every now and then -- so an ever-vigilant computer that probes system health and can monitor performance issues on a large number of machines is a huge help.

My personal preference for monitoring tools happens to be Nagios. There are a number of good monitoring solutions out there (a number of them seem to actually be based off of Nagios) but I've long liked Nagios for its price (free), its completeness, and the fact that it is an open source project.

The open source nature of Nagios combined with the modular nature of its probes and the fact that the plug-ins themselves are pretty easy to write means that if Nagios doesn't happen to check an attribute out of the box, I can either easily write a new script that does (or more likely, someone has already written it for me). There is a large set of third-party plug-ins available online that go far beyond system load and ping checks and move into SAN multipathing and more advanced Apache monitoring.

For the longest time, I just thought of Nagios as a monitoring tool. It would probe all of my servers and send alerts if a particular service was down or if system load or other statistics were outside the norm. Then one day while I was looking for a good solution for trending performance data on my servers, I realized that while there are many other tools out there like Cacti can poll system statistics and graph them for you, I already had a system in place that polled every server I cared about in my network, I should just figure out a way to extend it to graph all of that data it collected.

While a monitoring server mostly values data that falls outside of the norm, it still collects tons of valuable data every time it does a probe. Even though Nagios does not graph performance data by default, it does offer a mechanism to collect the


BROWSE BY TAG
Information systems management,   Hardware and performance monitoring,   Data center operations management,   Systems Management Tips,   VIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


RELATED CONTENT
Hardware and performance monitoring
HP downsizes data center cooling monitor: News in brief
Zenoss upgrades IT monitoring software to vie with Big Four
Indemnification, support woes plague open source systems management
Capacity planning tools tutorial for Linux and Unix
BDNA Insight updated with discovery capabilities: News in brief
How data center pros do due diligence on startup software firms
Users demand SNMP standard-based monitoring for data center power and cooling equipment
Using Zenoss infrastructure monitoring software in your data center
IT shops pick point management tools to cut staff, speed installs
Web monitoring tools gain ground against Big Four

Systems Management Tips
What does the future hold for Oracle's virtualization acquisitions?
Top 50 universal Unix commands
Lower disaster recovery costs with open source replication tools
Choosing the best x86 server for your data center
Capacity planning tools tutorial for Linux and Unix
A look at Linux interoperability to date: Microsoft and Novell leading the pack
Using z10 HiperDispatch for vertical CPU management
Examining MySQL in real time using DTrace
Ensuring CICS security with the Web Services Security standard
Emergency systems administration from your cell phone

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
automated test equipment  (SearchSoftwareQuality.com)
DCML  (SearchDataCenter.com)
event forwarding  (SearchDataCenter.com)
HP OpenView  (SearchDataCenter.com)
lights-out management  (SearchDataCenter.com)
MIS  (SearchDataCenter.com)
smoke testing  (SearchWinDevelopment.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


data it does get from its probes. Basically, all a Nagios plug-in has to do to support Nagios's performance data collection is to output the extra performance data at the end of its standard output. The format for the output is pretty straightforward and is documented for Nagios 3.0. Once the plug-in outputs this data, Nagios can then be configured to simply dump this data into a file in certain formats for later parsing, or it can pass the data to a third-party program. There are a number of programs to manage this data but I settled on one called PNP. PNP stores the performance data to RRD (Round-Robin Database) files that can then easily be graphed.

Using graphs for troubleshooting server systems performance
Now what exactly is the advantage of graphing all of this data? Graphs aren't just for vendor presentations, graphs can be invaluable when you are trying to identify and track performance problems on your network. While you could certainly just pore through the performance numbers by hand, you will find you can identify problem points more quickly when all the system stats for a machine are graphed and lined up according to time.

Whether you use Nagios and PNP or some other graphing tool, once your system is set up, how do you use these graphs to track down performance issues? Sometimes you get lucky, but it's not always as easy as finding that one graph with a spike. For example, let's take one of the most basic statistics you will likely monitor and graph: system load. On Linux and a number of other Unix systems, the system load is displayed with three numbers: the average number of running or uninterruptable processes over one, five, and 15 minute intervals. These numbers aren't normalized across multiple CPUs so for instance a load average of one on a single CPU machine means the processor is currently 100% busy. But on a two-CPU machine, a load of one means you have one processor idle on average.

However, spikes in load average can be misleading. While it's easy to point out a performance problem being caused by a high load, it's important to remember that all load averages really tell you are how many processes are running and potentially waiting. Load averages don't tell you why they are waiting. There are a number of different causes for high load averages on a system and they can cause the system performance to degrade in different ways.

Probably the most simple cause of a high-load average is a large number of processes on the system, many of which fully use a CPU. If all of your CPUs are currently completely busy and new processes spawn, each of those processes will have to wait for their turn with the CPU. This CPU-bound load can behave interestingly. Depending on how many of the waiting processes use the CPU heavily, you could have a very high load but still have a relatively responsive system. I've seen systems with CPU-bound loads in the hundreds that while not exactly zippy, could still be logged into to check performance without much of a problem. I've also seen machines with relatively low CPU-bound loads bog down because there were enough CPU hogs running at the same time to more than tie up all CPUs.

Another reason for high load is often due to I/O bottlenecks. When processes compete for the same disk resources, some have to wait and during high disk I/O the waiting processes can stack up. In my experience high I/0-bound load can cause the system to become even more sluggish than CPU-bound loads even for lower load averages.

Since there are a number of different causes of load, it can sometimes take a bit of detective work to track down the root cause. However, good graphing tools can often help you pinpoint the cause much more quickly. For instance, a few metrics I monitor and graph on my systems are the load averages, RAM and swap utilization, disk I/O for each mount point, and network I/O. Once all the graphs are lined up, you can easily tell whether the spikes in load correspond to spikes in any of the other metrics. If I see high load but no spike in disk, then there's a good chance the load is CPU-bound. If I see high load that correlates with high disk I/O then I can be assured that the load is disk I/O-bound. If I also notice my overall RAM use increasing before the spike in load along with an increase in my system swap, then I would have a good hunch that the load could be caused by the system running out of available RAM and relying on swap, which would then cause a large increase in disk I/O.

What's even better about using graphs to track down performance problems is that you can do it after the fact. For whatever reason, some times you can't access a machine as it is experiencing a performance issue. By the time you get an alert and log into the system, it's possible that everything could have returned to normal. There are a number of times I have been able to piece together the cause of a performance issue strictly from the graphs. I know on my graphs I can always tell when my nightly backup job has run by the series of spikes in disk, then network traffic. This has been particularly handy when I've needed to rule out the backup job as the cause of sluggish performance as I can do it at a glance and not have to dig through backup logs.

Over time, your monitoring can also provide good baselines for trending. Whether it's something more complex like the gradual increase of overall Apache processes your Web servers use during peak times and how they correlate with spikes in your RAM usage, or whether it's something simple like the rates your databases consume disk space over the past few months, good graphing tools tied into your monitoring can provide you with reports automatically that you'd otherwise have to devote to mundane data collection and manual graphing. Plus, with proper arrangements of your statistics, you can more easily see the relationship of spikes across a number of different systems.

The combination of a good monitoring tool that is reliable and extensible with automated graphing and trending tools makes yet another otherwise time-consuming and mundane process like gathering statistics and tracking performance bottlenecks manageable. When your downtime is measured in dollars, not seconds, you definitely need all the advantages you can get so that you can accurately and quickly isolate the cause of performance issues. Plus you get that added advantage of fancy graphs to throw into your next presentation in front of management.

ABOUT THE AUTHOR: Kyle Rankin is a systems administrator in the San Francisco Bay Area and the author of a number of books including Knoppix Hacks and Ubuntu Hacks for O'Reilly Media.

Rate this Tip
To rate tips, you must be a member of SearchWinDevelopment.com.
Register now to start rating these tips. Log in if you are already a member.


Submit a Tip




DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



Database Programming Solutions - .NET XML, Visual Studio LINQ, ORM .NET
HomeNewsTopicsITKnowledge ExchangeTipsBlogsMultimediaWhite PapersEvents
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts