In many cases, Linux servers are managed by groups of people. Configuration files are constantly edited; processes and daemons are frequently restarted. The likelihood of a configuration error causing a problem is often greater than that of a hardware failure. So, with this in mind, at the expense of seeming cavalier, I'll say that one of the most important troubleshooting tools is the
less command allows you to scroll back and forth through configuration and error log files. The command will also even allow you to search for text within files too such as timestamps.
Once you have identified a key piece of error information, you can use the
grep command to do highly specialized searches for the error pertaining to your application or sub-system.
grep commands go hand in hand. You should probably use the
less command to search for an error message timestamp at the approximate time of the event in the files of your system's syslog directory. Once something interesting is found, the
grep command can be used to specifically search for the occurrence of similar messages or timestamps in multiple log files for the purposes of event correlation.
I would say that next in importance is the
man command, as it provides help on the commands you'll need to fix the problem. Books are often sufficient, but when you are under pressure, the
man command will provide detailed information on known commands much faster. For me, these would be the two most important command sets.
There are other commands that are obvious but frequently forgotten when IT managers are under pressure. The
ls command will help determine when last the configuration files were edited. The
vmstat, top, ps, and
free commands will give a good idea of the general CPU, memory and swap partition loads and could be used to help discover rogue processes that could be affecting performance.
ping are also helpful in eliminating sources of network- related problems.
Once you have the error message, a possible configuration file change and some performance figures in hand, the best tool to use is a Web browser to check search engine results for possible clues as to what the problem could be. Remember to check both Web pages and user group results for better coverage of the problem. Also, remember to search the Web sites of your hardware and software vendors for information too. Books are a good resource too, but are usually not as quickly searchable.
Armed with this information, you can use some of the commands related to system and network performance I mentioned in the answer to the previous question to determine whether the condition that triggered the event may still exist. This will help to isolate the source of the problem with an aim to fixing it.
Familiarity with troubleshooting tools obviously will help to rectify problems quicker; but, equally importantly, it will make you aware of potential issues that may need to be fixed proactively.