It's a familiar story: Your Linux server is busy, but you have no clue what it's doing. Before adding memory or processing power, find out what's happening inside the box and monitor what the CPU is doing. In this article, you'll learn how to find out how with some easy-to-use utilities.
CPU monitoring commands
A good starting point for analyzing CPU activity is the
top command. Even if this command causes some CPU workload by itself, it still is good enough to get a generic view of what is happening. When using top, look to the information provided in two lines. First, there is the load average, which gives three different values that indicate the status of the server CPU. These values give information on the load average for the previous 1, 5 and 15 minutes.
Interpretation of these values is not difficult. If the value is higher than 1.00 per CPU core, it means that there is more work to do than what the CPU can handle. For example, a value of 2.00 on a single core CPU indicates that, in the queue of runable processes, there is twice as much work to do than what the CPU can handle. On a server with dual Xeon processors and four cores per processor, a value of 6.00 would also indicate that the workload on the CPUs is over capacity.
This data might cause one to jump to conclusions. For a single core system that consistently shows values above 1.00a, one might assume that a faster CPU is all that is needed to solve the problem, but the solution could be somewhere else as well. For example, if your server has a heavy disk I/O and your disk is slow, your processes would be waiting for disk I/O all the times, showing a high workload on the CPU. This is where the second line in top becomes relevant. Actually, this is the third line that displays current CPU status.
Typically, a CPU can be doing any number of things, including:
- running kernel code. This is referred to as system time (sy)
- running user code (us)
- running code that has been set at a lower priority using the nice command (ni)
- doing nothing, which is represented as idle time (id)
- waiting for I/O (wa)
- waiting for a high-priority hardware interrupt (hi)
- waiting for a low-priority software interrupt (si)
Top can also show an st value for the CPU, indicating the measure to which a virtual machine has been stealing processor time from the host operating system in a virtualized environment.
The above parameters can indicate the type of problem afflicting a server. Generically speaking, if the sy value is high, there is some kernel related task that causes too high a workload. Finding this process isn't hard, either. This suspect process causing high workload can be found at the beginning of the list of processes shown by top. The same goes for the us value, which indicates a high workload being caused by a user.
A high id value is the sign of a good server. The id value indicates the idle loop, or the amount of time that your server is doing nothing. Thus, the higher the idle loop, the more resources available.
If one of the wa or hi parameters is high, it can indicate a real problem. Normally, the wa parameter shows how much time the CPU has wasted waiting for I/O. This I/O can come from the hard disk or from the network. Therefore, a high value on the wa parameter often indicates a slow hard disk or a slow network connection, which will require some fine tuning. To find out if it is the hard disk or the network, you can use
ifconfig. The ifconfig command shows statistics on packets handled by a network card, whereas the vmstat command provides information about the amount of traffic handled by a hard disk. The latter is displayed by the bi (blocks in) and bo (blocks out) parameters. If these are really high, the disk may be the cause of the high value in top's wa parameter. In that case it may be useful to upgrade the disk channel.
If problems are related to the network interfaces, the
ifconfig command will tell you. High values at the TX packets and RC packets parameters might be discovered, but ifconfig is likely to find some errors, too, such as dropped packets or overruns. If this is the case, the network interface is probably the culprit. Upgrade the driver of your network board or the network board itself.
Sometimes nothing is failing, the server is just busy. It's a typical situation with one process causing a heavy workload active on a server, or when there are just too many processes being handled by the server. In the later scenario, the processor(s) try to give each process a fair amount of CPU time. This means that they'll switch between processes very fast. This is the foundation of a multitasking operating system, but these context switches do have some disadvantages as well. To make a context switch to a new process, the CPU has to save all context information for the old process and retrieve the context information for the new process. In terms of CPU cycles, that's a rather expensive procedure.
To find out if a server suffers from a high amount of context switches, use the
vmstat utility. Add the 2 parameter to instruct the utility to refresh every two seconds which will create trending in all of the displayed parameters. The column to look at is the cs column. If it is stable, there's no problem. But, if there are some peaks in the context switch trends, there is probably a process causing these high cs values to occur. Open a top window at the same time and try to observe what process is doing this. Maybe it can be transferred to some other server.
If a server is slow, use these tools to analyze the situation on the CPU. These techniques can help improve server performance, creating a more efficient IT infrastructure.
About the author: Sander van Vugt is an author and independent technical trainer, specializing in Linux since 1994. Vugt is also a technical consultant for high availability (HA) clustering and performance optimization, as well as an expert on SUSE Linux Enterprise Desktop 10 (SLED 10) administration.