In this interview, author Taylor Dondich suggests a workaround for balancing active and passive checks and explains the impact of service failures.
If you've already deployed Nagios in your IT environment, what are some tricks you can use to enhance and improve performance?
Dondich: As your IT environment grows, the number of monitored devices will grow with it. As it continues to grow, you may see performance degrade or the bandwidth in your network saturate with the number of checks Nagios is performing. The thing I can't stress enough is to use active checks only when necessary and to really leverage passive checking.
Active checks occur when Nagios itself is responsible for checking the status of a device at regular intervals. On the other side, a passive check is when the device reports its status to Nagios only when its status changes.
Increasing the number of passive checks you use instead of active checks will increase the number of devices you can monitor with Nagios. Beyond that, you may need to start looking into using a distributed Nagios implementation. This requires separate Nagios instances communicating with a central Nagios system. It's tough to maintain the configuration files, since each Nagios instance requires its own set, but in the end, it'll do the job.
Won't continual host checks become a performance drain in Nagios?
Dondich: A balance of the use of passive and active checks should take place. For example, I may wish to use passive checks for most of the services on a device, but I may want to check for the reachability -- via a PING check using check_ping -- using an active check.
If the Nagios service recovers from an error (i.e. a soft recovery) administrators won't be informed. Is this important?
Dondich: When a service fails for the first time, Nagios will put that service in a "soft" state. Nagios will then check the service a configured number of times to see if it comes back up. If it does not come back up within that preconfigured number of checks, then Nagios will put the service in a "hard" state and notifications will be sent out. If the service recovers within those checks, Nagios will not send out notifications. So why do this?
Well, event handlers can be used to perform actions based on a status change, whether it is a soft or hard state. For example, if you have an Apache Web service which fails, an event handler may be run to attempt to restart the Apache service. If the service comes back up while Nagios is checking it, then there's probably no real reason to send out notifications.
But if the attempt to restart Nagios fails, then Nagios will eventually put the service in a hard state, causing the notification to be sent out. If you want notifications to always be sent out, the parameter used to specify how many checks to perform before setting the state to a "hard" state is the max_check_attempts parameter for both host and services.
What are some common mistakes that occur when configuring problem escalation?
Dondich: I think the biggest one would be that of multiple escalation levels for a device or service. For example, say you have the initial contact for a Web server as Bob. Bob gets notified about a problem which is occurring. Bob is lazy, doesn't fix the problem, so the notifications escalate properly to the network admin, Tim.
Tim is working on something and doesn't initially see his notifications. So it gets escalated again, this time sending notifications to Tim's boss, Rich. But something strange has happened. Rich is getting notifications that there is this problem, but Tim is no longer receiving them. It's a common problem, and to put it plainly, you need to make sure that your escalation rules include all individuals that should be notified at that time, not just the additional contact.
Some good online documentation describes this escalation scenario.