Linux's record for reliability may be the polar opposite of what critics consider the crash-a-day life of Windows. Yet, the fact that Linux crashes are rare means that an unexpected outage throws many IT administrators in unmapped territory. Learning the proper steps to prevent such crashes can help Linux admins avoid many headaches over the long term.
In this SearchOpenSource.com interview, Mark Wilding and Dan Behman, authors of the new book Self-Service Linux: Mastering the Art of Problem Determination from Prentice-Hall PTR, provide a straightforward guide to Linux crash prevention and recovery.
Linux on the server is known for not crashing, but it can and does crash and hang. What's the difference between a crash or a hang at the application level, as opposed to the kernel?
Mark Wilding: A crash or a hang at the application level is isolated only to a particular thread or process. The crash or hang will not directly cause other unrelated processes or threads running on the same system to crash or hang. At the kernel level, a crash or hang will affect all processes running on the system.
What's the difference between a crash and a hang?
Dan Behman: The properties of a crash and a hang at either level are basically the same. A hang occurs when a process or thread gets stuck waiting for something -- usually a lock of some kind or some hardware resource -- to become free. Waiting for a lock or a resource is not uncommon, but it is when that lock or resource doesn't become available that a hang entails.
It's also important to note that hangs can sometimes be diagnosed too early. What I mean is that say, for example, a resource is very busy at a given time -- a process or thread that needs to use that resource may then have to wait an unusually long time for that resource to become free. A user may be unaware that the resource is busy and only sees the process waiting, so he interprets that as a 'hang' when it's actually working as designed, albeit slowly.
A crash is very different from a hang and occurs when an unexpected hardware or software error occurs. When these errors occur, special error handling is hopefully invoked to dump out diagnostic information and reports that will hopefully be useful to track down the cause of the error.
Crashes can be thought of as point-in-time problems that require post-mortem analysis, and hangs can be thought of as real-time problems that one can analyze live.
Besides the fact that the source code is available -- and I know that's a huge advantage -- are there other reasons why Linux crashes are easier to handle than crashes on other operating systems?
Behman: Along with the source code being available, there is a plethora of documentation at just about every level. Also, since the source code is open, so too is the development community. Getting your question read by key Linux kernel developers, including the guy that started it all, Linus Torvalds himself, is simply a matter of posting to a mailing list. That ability does not exist with any 'closed source' operating system that I know of.
What are the challenges of dealing with a hang?
Wilding: An application hang can have several causes, including a hang that is caused by something in kernel space; this usually means the problem is beyond the control of the developer. But that's the beauty of Linux. All the source code is available, so if you can obtain a kernel stack dump of the process, that can be correlated with the source code to get an idea of what the process is doing in the kernel. Very often, going to this length isn't required, though. [To determine why their process is hanging], application developers will examine the evidence -- such as stack traces -- at the application level.
For users or support personnel that may not have an intimate knowledge of the application's workings or have access to its source code, hangs can get very tricky to diagnose. Consider a case where process A is waiting for a lock to be released by process B, and process B is waiting for a lock to be released by process A. This is known as a deadlock and is a common problem in complex applications that often gets diagnosed as a hang.
If you do not know what specifically processes A and B are waiting for, then you wouldn't even know that a deadlock has occurred and you would probably have no choice but to kill the processes and start over. In cases like these, it is very important for the application to have thorough lock tracking of some kind built into it to help diagnose these tricky problems.
Behman: Another challenge with hangs is that often when a hang occurs, the process or thread often does not know that it's hung and usually doesn't know when it's about to hang. Comparing this to a crash, when a crash occurs, the process can intercept most signals, and signal handling can be added to perform special actions, such as dumping memory, stack traces, etc. But when a hang occurs, this special handling is not impossible but very tricky.
With a hang, there is also the knee-jerk reaction to restart the system or application. Keep in mind that with a hang, the evidence you need to diagnose the problem is probably captured in the live kernel/application that is hanging. If you restart without collecting the right information, you will not be able to diagnose the problem and therefore you will not be able to prevent it from occurring again in the future.
For mission-critical environments, the availability of the system is directly related to how quickly a problem can be diagnosed and prevented. Collect first, then restart.
What's the first thing you'd do when faced with a hang, as opposed to a crash?
Behman: Again, dealing with hangs at the kernel level is very different from dealing with them at the application level.
Let's assume you're asking about the application level. When a crash occurs, there are special functions called signal handlers that get invoked to dump various information, such as memory contents, stack tracebacks, etc. So usually, in the case of a crash, it's simply a matter of gathering, organizing and analyzing that data.
In the case of a hang, this data is not automatically gathered and is very much a manual process. Two key things to gather in a hang situation are strace output and stack tracebacks. The strace output will give an indication of what the process is doing -- for instance, is it still moving? -- while strace is watching the process. The stack tracebacks will give an indication of where in the source code the process currently is. This is very useful for developers so they can determine why the process might be in an apparent hang situation.
What are the most common causes of crashes and hangs?
Wilding: For crashes, we can split the common causes into either panics or traps/exceptions. A panic (or abort of some kind) is a crash where the kernel or application decides to crash because of a severe situation. The software itself realized that there was a problem and literally panics and 'commits suicide' in a way to prevent further errors, which could get more serious. A trap/exception means that memory was accessed in an invalid way and is almost always a programming error. In this case, the hardware actually detects the invalid memory access and raises an exception, which results in the application getting sent a signal to terminate processing.
There are generally two causes of hangs. One is a process or thread waiting on a resource, which may or may not become available. Other processes or thread can bind on resources (e.g., locks) that this process/thread is holding while it is hung. An example would be a process that is holding a critical lock while waiting indefinitely to receive information from the network. The second general cause is a dependency loop where two or more processes or threads are waiting for each other. Examples of this action could be releasing a lock, writing something to an area of shared memory, etc.
In crash and hang situations, what are some basic investigation practices admins should use?
Wilding: One basic best practice is to be organized. It's important to keep all collected data in a well-defined location so the data can be easily found in the future. This is especially important when one is working on several different problems at the same time.
Behman: Another basic best practice is to be quantitative rather than qualitative when collecting data. For example, saying 'last night at 6 p.m. the system was low on memory' is a qualitative observation. This isn't very useful in determining the problem. The quantitative version of this example would be to collect and save the output of the 'date' command, along with output of the 'free,' 'vmstat,' 'top' and other related diagnostic commands. The goal is to collect enough data to not have to go back and have the problem reproduced to gather another piece of data.