Troubleshooting is not a skill best learned during a Linux system panic or a security breach, says Steve Best, author of Linux Debugging and Performance Tuning, a new book from Prentice Hall PTR. In this interview, he helps IT managers prepare for fixing and preventing Linux crashes and discusses the art of troubleshooting.
Linux servers are known for not crashing, but crashes do occur. Can you offer some advice for someone faced with a Linux crash?
Steve Best: First, look at the error logs of the system, and check to see if there are messages that don't seem to be normal. Finding a message here can be very helpful in starting to know where to look next. Sometimes a message there can tell you what has happened to the system. Maybe the system had an oops and that could lead to the component of the system that is having a problem or a key component didn't start.
What are the most common causes of a Linux system crash?
Best: Sometimes not having a component (application, device, etc.) setup correctly can cause a crash, as can having hardware that isn't working correctly or starting to fail.
Another cause could be the system running low on a key system resource. This one might not cause a crash, but the performance of the system won't be where you would like it to be. There could be a race condition that only happens when the system is under heavy load.
Once you find that a system has crashed or isn't performing as expected, the detective work begins to find the problem and the diagnostics aids available in Linux can help to find the problem quickly and get the system back up and running.
What makes Linux easier or harder to troubleshoot than other operating systems?
Best: Whether or not an operating system is difficult or easy to debug depends on the toolset available. Luckily Linux has a rich toolset for debugging. Knowing all about the toolset available on the operating system is key to troubleshooting a problem.
Would you consider system and performance troubleshooting an art, in that it requires creative thinking on the part of the detective?
Best: Most definitely, yes. As I said, it helps to know all of the tools that you have available to make the detective work easier.
Say, if you need to tune an application, it is beneficial if you can identify the code that will give you the most performance improvement and focus in on that area. A profiler can help with this task. There are application profilers like gprof and system-wide profiler called oprofile, for example.
Another tool that is in development is called SystemTap, which will allow developers and system administrators to view the system and take performance measurements. SystemTap uses kprobe technology to collect the performance information.
What tools or best practices are must-haves in any troubleshooting situation?
Best: You need to be able to ask questions about what was occurring in the system when the problem happened. Sometimes, this is the key to finding the cause of problems.
Once I have access to the system, one of the first actions I do to troubleshoot a problem is to look at the system and applications error logs. Next I look at the processes on the system and see what state they are in, and a good way to do that is to use one of the process viewing tools. There are usually several process viewing tools on a system (ps, pgrep, pstree and top).
When you know that the source code is the problem, having a debugger available can be key to solving a problem. A debugger can be an exceptionally powerful tool used by developers to investigate the behavior of programs and troubleshoot problems with programs that are failing.
By giving the capability to step through the code instruction by instruction, examining and changing program variables and setting breakpoints at certain code locations, the developer can use the debugger to find what is causing the problem.