If you don't count spam, I get more emails from computers than from people these days. Every server I am responsible for checks itself for health, and to ensure a second opinion, I have a central server that monitors the health of all of the machines on my network. Every blip, flap and crash is detected and I am alerted via email. Since all of these emails ultimately land on my Blackberry, I feel like I'm at the nerve center of my network...
no matter where I am.
Sometimes my nerve center has been more, well, nervous than other times. One of those instances was when I first set up monitoring. If you've ever set up an intrusion detection system (IDS) or central monitoring for a new network, then you're familiar with the stream of emails you get when you first turn it on. It's amazing how many exceptions for an out-of-the-box monitoring solution are normal for your network. It might be those servers that always use the maximum amount of RAM, or it could be that the database machine has a load spike every few hours when a major report is run. If you have an IDS, it will probably generate a flood of alerts about all of the port scans and slammer worms you never knew hit your gateway.
Of course, the danger of all of this is that if you continually get alerts that amount to nothing, over time you'll naturally start to take these alerts less seriously. That's why, after a monitoring solution is installed, the next step is to weed out false alarms one at a time. After a few days or weeks of working with a new system, I've typically been able to work out all of the flapping and false alarms so that when I do get an alert, it's for something serious.
If everything stayed like it was once I finished tuning my monitoring server, life would be easy. The fact is, entropy eventually sets in. It doesn't happen all at once, otherwise after one or two sleepless nights you would drop everything and give your server a tune-up. No, the entropy sets in slowly, one false alarm at a time. It might start with a server that gets close to filling its disk. You get the alert, note it, order more storage, and either put up with the alert in the meantime, or more likely disable it. Then a second server has a load spike every few hours, and a third starts to max out its RAM, but then stabilizes just above your alert threshold. Finally two new IIS vulnerabilities hit the Internet, and even though you only run Apache, your IDS alerts you about the attack attempts anyway. Over time you have an inbox full of these false alarms that you just haven't had the time to track down and remove.
If you're lucky (or smart), you will hit a certain threshold, decide that enough is enough, and then do a little spring cleaning on your monitoring thresholds and overall server health. If you are unlucky, like the boy who cried wolf, a real alert will come through unnoticed in the stream of false alarms and cause real downtime. In my opinion, a poorly tuned monitoring server or IDS is as bad or worse than no server monitoring at all. At least with no monitoring you are less likely to become complacent. If you don't have a car alarm and live in a bad neighborhood, you'll probably be more careful to put away valuables and lock your doors. But if you have a car alarm that goes off every time another car drives by, you will naturally start to ignore it over time.
When it's time to clean up your monitoring server, you have to ask yourself what is really exceptional behavior for a server and what is normal. Not all servers fit into the same mold of "normal." For instance, on an active mail server it might be normal for there to be 50 or more deferred messages in the queue for a few hours, yet that could signal a problem on a Linux Web server that has its own local mail service. A fair rule of thumb is that if you are constantly getting alerts for something, it might be normal for that server. If so, disable or increase the threshold for the alert. Even if the alert is disabled, it doesn't do any more harm than an inbox full of ignored alerts.
Finally, once your monitoring server has been tuned, try to keep it that way. I know how easy it is to let little problems accumulate, but it's like cleaning a room -- as annoying as it can be to pick up a mess right after you make it, if you do it then you won't be overwhelmed by the huge mess it will otherwise become. Your mailbox will be smaller and easier to manage and your servers will ultimately be healthier since you'll be more aware of real problems as soon as they arise. Plus, with a cleaner inbox, you'll be able to notice those other important emails -- the ones from actual people.
ABOUT THE AUTHOR: Kyle Rankin is a systems administrator in the San Francisco Bay Area and the author of a number of books including Knoppix Hacks and Ubuntu Hacks for O'Reilly Media.