Linux crashes. Yes, it does. Sure, Linux on the server crashes rarely; but, like a Boy Scout, an IT manager has to be prepared for any emergency. So, here are some handy tools to use and ways to find out why Linux has crashed, courtesy of an interview with Mark Wilding and Dan Behman, authors of the new book Self-Service Linux: Mastering the Art of Problem Determination from Prentice-Hall PTR. Even more importantly, they explain when an IT administrator shouldn't try troubleshooting a crash.
When a crash occurs, what's the first thing that admin should do after the restart?
Dan Behman: The first thing should be to gather, categorize, and save the diagnostic data created by the crash. Whether the admin will be looking at the problem themselves or not, saving this data in a safe place is important so that it can be analyzed when the time is right to do so.
It's also important to save all the data from all occurrences of a particular crash as it's important to determine if they are identical or not. In crash situations, an admin who might be supporting a production server for example might be under the gun to just get the system back up and running as quickly as possible. This is why collecting and saving this data properly is so important.
What's the role of the serial console in troubleshooting a crash?
Mark Wilding: When crashes occur at the system level, the crash can be severe enough that it affects the system's ability to write the diagnostic data to disk. A serial console basically provides a much safer way of capturing and writing very important diagnostic data to disk.
When a serial console is setup, the diagnostic data that is written to the local console -- usually the monitor connected to the system -- is also written to the serial port where the remote machine will receive and write it to its disk.
Could you offer some tips for using the serial console effectively?
Wilding: The first thing to do is ensure that you're using a NULL modem cable and not just a regular serial cable. Once the cable is in place between the two machines, it's important to run a terminal program on the remote end such as minicom and send a test message from the Linux server via stty and echo.
Behman: Next, it's important to boot your system with the additional parameters needed to define the addition of the serial console. Once this is done, you should see bootup messages appearing on the remote console.
Lastly, ensure that saving the incoming data to disk is enabled in the remote terminal program so that the important diagnostic data is captured when it's really needed. In our book, chapter seven discusses more details and gives step-by-step instructions for setting up a serial console.
What's an Oops Report, and how can it be set up and used?
Behman: An Oops report is basically the important diagnostic data. It gets generated by the Linux kernel when a panic or trap/exception is encountered. It contains a wealth of important information that can be used to determine the cause of the abnormal termination.
Oops reports are built into the kernel and are always enabled so no special configuration is needed. The data dumped out is very detailed and not for the faint of heart. If you're a kernel developer or someone who is interested in diving into these types of problems, then you can use an Oops report. For the average user, Oops reports are better analyzed by kernel developers, distribution support personnel, or support specialists. We dissect a sample Oops report in our book.
When is it not a good idea for an admin to try to diagnose and fix a system problem on Linux?
Wilding: Basically, the admin needs to ask themselves two questions:
- Does my role allow me to troubleshoot Linux system problems?
- Do I want to dive in and learn about this stuff?
Let me explain these questions further. Very often in today's IT market, admins are overworked and under the gun to ensure their systems are available and performing acceptably.
Troubleshooting system problems can be an involved and difficult process for the inexperienced, so an admin could be taking a great risk by spending time on troubleshooting a system problem instead of maintaining other running systems or performing other admin tasks. If, however, this is not an issue, then the admin really needs to want to dive in head first and learn about the problem. This can be very arduous at first which is why the interest needs to be there. The interest level needs to surpass the frustration that can occur. It's important to remember though that everyone had to start somewhere, even the experts.
An admin who has a high level interest but is under the gun to get a solution to the problem may opt to enlist the help of professional support services, but still work to resolve the problem on their own at the same time. This can often lead to learning a great deal in a short amount of time, as a solution will usually be provided.