Troubleshooting Linux Systems Guide

This is a guide to basic, and not so basic troubleshooting and debugging on Red Hat linux systems.
Goals include description and useage of common tools, how to find information, etc. Basically, info that may be helpful to someone diagnosing a problem. Emphasis will be on software issues, but might include hardware as well.
Enviroment settings - Allowing Core Files "core" files are dumps of a processes memory. When a program crashes it can leave behind a core file that can help determine what was the cause of the crash by loading the core file in a debugger. By default, most linuxes turn off core file support by setting the maximum allowed core file size to 0. In order to allow a segfaulting application to leave a core, you need to raise this limit. This is done via ùlimit`. To allow core files to be of an unlimitted size, issue: ulimit -c unlimited See the section on GDB for more information on what to do with core files. - LD_ASSUME_KERNEL LD_ASSUME_KERNEL is an enviroment variable used by the dynamic linker to decide what implementation of libraries are used. For most cases, the most important lib is the c lib, or "libc" or "glibc". The reason "glibc" is important is because it contains the thread implentation for a system. The values you can set LD_ASSUME_KERNEL to equate to linux kernel versions. Since glibc and the kernel are tighly bound, it's neccasary for glibc to change it's behaviour based on what kernel version is installed. For properly written apps, there sould be no reason to use this setting. However, for some legacy apps that depend on a particular thread implementation in glibc, LD_ASSUME_KERNEL can be used to force the app to use an older implementation. The primary targets fore LD_ASSUME_KERNEL=2.4.20 for use
of the NTPL thread library. LD_ASSUME_KERNEL=2.4.1 use the implementation in /lib/i686 (newer LinuxTrheads). LD_ASSUME_KERNEL=2.2.5 or older uses the implementation in /lib (old LinuxThreads) For an app that requires the old thread implentation, it can be launch as: LD_ASSUME_KERNEL=2.2.5 ./some-old-app see http://people.redhat.com/drepper/assumekernel.html for more details. - glibc enviroment variables. Theres a wide variety of enviroment varibles that glibc uses to alter it's behaviour, many of which are useful for debugging or troubleshoot purposes. A good refence on these variables is at: http://www.scratchbox.org/documentation/general/tutorials/glibce nv.html Some interesting ones: LANG and LANGUAGE LANG sets what message catalog to use, while LANGUAGE sets LANG and all the LC_* variables. These are control the locale specific parts of glibc. Lots of programs are written expecting to be one in one local, and can break in other locales. Since locale settings can change things like sort order (LC_COLLATE), and the time formats (LC_TIME), shells scripts are particularly prone to problems from this. A script that assumes the sort order of something is a good example. A common way to test this is to try running the troublesome app with the locale set to "C", or the default locale.
LANGUAGE=C ls -al If the app starts behaviour when ran that way, there is probably something in the code that is assuming "C" local (sorted lists and timeformats are strong candidates). - glibc malloc stuff - all the glibc env variable stuff Tools Effiently debugging and troubleshooting is often a matter of knowing the right tools for the job, and how to use them. - strace - simple useage - filtering output - examples - use as profiling - see what files are open - network connections Strace is one of the most powerful tools available for troubleshooting. It allows you to see what an application is doing, to some degree. strace display all the system calls that an application is makeing, what arguments it passes to them, and what the return code is. A system call is generally something that requires the kernel to do something. This generally means I/O of all sorts, process management, shared memory and IPC useage, memory allocation, and network useage. - examples The simplest example of using strace is as follows: strace ls -al This starts the strace process, which then starts `ls -al` and shows every system call. For `ls -al` this is mostly I/O related calls. You can see it stat'ing files, opening config files, opening the libs it is linked against, allocatin memory, and write()'ing out the contents to the screen.
- what files is this thing trying to open A common troubleshooting technique is to see
what files an app is reading. You might want to make sure it's reading the proper config file, or looking at the correct cache, etc. strace by default shows all file i/o operations. But to make it a bit easier, you can filter strace output. To see just file open()'s strace -eopen ls -al - whats this thing doing to the network To see all network related system calls (name resolution, opening sockets, writing/reading to sockets, etc) strace -e trace=network curl --head http://www.redhat.com - rudimentary profiling One thing that strace can be used for that is useful for debugging performance problems is some simple profiling. strace -c ls -la Invoking strace with '-c' will cause a cumulative report of system call useage to be printed. This includes approximate amount of time spent in each call, and how many times a system call is made. This can sometimes help pinpoint performance issues, especially if an app is doing something like repeatedly opening/closing the same files. strace -tt ls -al the -tt option causes strace to print out the time each call finished, in microseconds. strace -r ls -al the -r option causes strace to print out the time since the last system call. This can be used to spot where a process is spending large amounts of time in user space, or especially slow syscalls. - following forks and attaching to running processes
Often is difficult or impossible to run a command under strace (an apache httpd for instance). In this case, it's possible to attach to an already running process. strace -i 12345 where 12345 is the PID of a process. This is very finding for trying to determine why a process has stalled. Many times a process might be blocking while waiting for I/O. with strace -p, this is easy to detect. Lots of processes start other processes. It is often desireable to see a strace of all the processes. strace -f /etc/init.d/httpd start will strace not just the bash process that runs the script, but any helper utilities executed by the script, and httpd itself. Since strace output is often a handy way to help a developer solve a problem, it's useful to be able to write it to a file. The easiest way to do this is with the -o option. strace -o /tmp/strace.out program Being somewhat familar with the common syscalls for linux is helpful in understanding strace output. But most of the common ones are simple enough to be able to figure out on context. A line in strace output is essentially, the system call name, the arguments to the call in parens (sometimes truncated...), and then the return status. A return status for error is typically -1, but varies sometimes. For more information about the return status of a typically system call is by `man 2 syscallname`. Usually the return status will be documented in the "RETURN STATUS" section. Another thing to note about strace it is often shows "errno" status. If your not familar with unix system programming, errno is a glo bal variable that gets sets to specific values when some commands execute. T his variable gets set to different values based on the error mode of the com mand. More info on this can be found in `man errno`. But typically, strace wil l show the brief description for any any errno values it gets. ie: open("/foo/bar", O_RDONLY) = -1 ENOENT (No such file or directo ry) strace -s X
the -s option tells strace to show the first X digits of strings. The default is 32 characters, which sometimes isnt enough. This will increase the info available to the user. - ltrace - simple useage - filtering output ltrace is very similar to strace, except ltrace focuses on tracing library calls. For apps that use a lot of libs, this can be a very powerful debugging tool. However, because most modern apps use libs very heavily, the output from ltrace can sometimes be painfully verbose. There is a distinction between what makes a "systemcall" and a call to a library functions. Sometimes the line between the two is blurry, but the basic difference is that system calls are essentially communicating to the kernel, and library calls are just running more userland code. System calls are usually require for things like I/O, process controll, memory management issues, and other "kernel" things. Library calls are by bulk, generaly calls to the standard C library (glibc..), but can of course be calls to any library, say Gtk,libjpeg, libnss, etc. Luckily most glibs functions are well documented and have either man or info pages. Documentation for other libraries varies greatly. ltrace supports the -r, -tt, -p, and -c options the same as strace. In addition it supports the -S option which tells it to print out system calls as well as library calls. One of the more useful options is "-n 2" which will indent 2 spaces for each nested calls. This can make it much easier to read. Another useful option is the "-l" option, which allows you to specify a specific library to trace, potentionall cutting down on the rather verbose output.
- gdb `gdb` is the GNU debugger. A debugger is typically used by developers to debug applications in development. It allows for a very detailed examination of exactly what a program is doing. That said, gdb isnt as useful as strace/ltrace for troubleshooting/sysad min types of issues, but occasionally it comes in handy.
For troubleshooting, its useful for determining what cause core files. (`file core` will also typically show you this information t oo). But gdb can also show you "where" the file crashed. Once you determine the name of the app that caused the failure, you can start gdb with: gdb filename core then at the prompt type `where` The unfortunate thing is that all the binaries we ship are stripped of debuggig symbols to make them smaller, so this often returns less than useful information. However, starting in Red Hat Enterprise Li nux 3, and included in Fedora, there are "debuginfo" packages. These packages include all the debugging symbols. You can install them the same as any other rpm, so `rpm`, ùp2date`, and `yum` all work. The only difficult part about debuginfo rpms is figuring out which ones you need. Generally, you want the debuginfo package for the src rpm of the package thats crashing. rpm -qif /path/to/app Will tell you the info for the binary package the app is part of. Part of that info include the src.rpm. Just use the package name of the src rpm plus "-debuginfo" - python debugging - perl debugging `splain` `perl -V` perldoc -q perldoc -l
- sh debugging - bugbuddy etc - top `top` is a simple text based system monitoring tool. It packs a lot of information unto the screen, which can be helpful troubleshooting problems, particularly performance related problems. The top of the "top" output includes a basic summary of the system. The top line is current time, uptime since the last reboot, users logged, and load average. The load average values here are the load for the last 1, 5,and 15 minutes. A load of 1.0 is considerd 100% cpu utilization, so loads over one typically means stuff is having to wait. There is a lot of leeway and approxiation in these load values however.
The memory line shows the total physical ram available on the system, how much of it is used, how much free, and how
much is shared, along with the amount of ram in buffers. These buffers are typically file system cachine, but can be other things. On a system with a significant uptime, expect the buffer value to take up all free physical ram not in use by a process. The swap line is similar.
Each of the entries viewable in the system contain several fields by default. The most interesting are RSS, %CPU, and time. RSS shows the amount of physical ram the process is consuming. %CPU shows the percentage of the available processor time a process is taking,and time shows the total amount of processor time the process has had. A processor intensive program can easily have more "time" in just a few seconds than a long running low cpu process. Sorting the output: M : sorts the output by memory useage. Pretty handy for figurin g out which version of openoffice.org to kill. P : sorts the process by the % of cpu time they are using. T : sorts by cumulative time A : sorts by age of the process, newest process first Command line options: The only really useful command line options are:
b [batch mode] writes the standard top output to stdout. Useful for a quick "system monitoring hack".
ie, top d 360 b >> foo.output to get a snapshot of the system appended to foo.output every six minutes. - ps `ps` can be thought of as a one shot top. But it's a bit more flexible in it's output than top. As far as `ps` commandline options go, it can get pretty hairy. The linux version of `ps` inherits ideas from both the BSD version, and the SYSV version. So be warned. The `ps` man page does a pretty good job of explaining this, so look there for more examples. some examples:
ps aux shows all the process on the system is a "user" oriented format. In this case meaning the username of the owner of the process is shown in the first column. ps auxww the "w" option, when used twice, allows the output to be of unlimited width. For apps started with lots of commandline options, this will allow you to see all the options. ps auxf the 'f" option, for "forest" tries to present the list of processes in a tree format. This is a quick and easy way to see which process are child processes of what. ps -eo pid,%cpu,vsz,args,wchan This is an interesting example of the -eo option. This allows you to customize the output of `ps`. In this case, the interesting bit is the wchan option, which attempts to show what syscall the process is in which `ps` checks. For things like, apache httpds, this can be useful to get an idea what all the processes are doing at one time. See the info in the strace section on understand system call info for more info - systat/sar Systat works with two steps, a daemon process that collects information, and a "monitoring" tool. The daemon is called "systat", and the monitoring tool is called `sar` To start it, start the systat daemon: ./systat start To see a list of `sar` options, just try `sar --help` Things to note. There are lots of commandline options. The last one is always the "count", meaning the time between updates. sar 3 Will run the default sar stuff every three seconds. For a complete summary, try: sar -A
This generates a big pile of info ;-> To get a good idea of disk i/o activity: sar -b 3 For something like a heavily used web server, you may want to get a good idea how many processes are being created per second: sar -c 2 Kind of surprising to see how many process can be created. Theres also some degree of hardware monitoring builtin. Monitoring how many times a IRQ is triggered can also provide good hints at whats causing system performance problems. sar -I SUM 3 Will show the total number of system interrupts sar -I 14 2 Watch the standard IDE controller IRQ every two seconds. Network monitoring is in here too: sar -n DEV 2 Will show # of packets sent/receiced. # of bytes transfered, etc sar -n EDEV 2 Will show stats on network errors. Memory usege can be monitoring with something like: sar -r 2 This is similar to the output from `free`, except more easily parsed. For SMP machines, you can monitor per CPU stats with: sar -U 0 where 0 is the first processor. The keyword ALL will show all of them. A really useful one on web servers and other configurations that use lots and lots of open files is: sar -v This will show number of used file handles, %of available filehandles available, and same for inodes.
To show the number of context switches ( a good indication of how much time a process is wasting..) sar -w 2 - vmstat This util is part of the procps package, and can provide lots of useful info when diagnosing performance problems. Heres a sample vmstat output on a lightly used desktop: procs r b w 1 0 0 swpd 5416 free 2200 memory swap buff cache si so 1856 34612 0 1 bi 2 io system cpu bo in cs us sy id 1 140 194 2 1 97
And heres some sample output on a heavily used server: r 16 24 15 procs b w 0 0 0 0 0 0 memory swap swpd free buff cache si so 2360 264400 96672 9400 0 0 2360 257284 96672 9400 0 0 2360 250024 96672 9400 0 0 bi 0 0 0 io system bo in cs 1 53 24 6 3063 17713 3 3039 16811 cpu us 3 64 66 sy 1 36 34 id 96 0 0
The interesting numbers here are the first one, this is the number of the process that are on the run queue. This value shows how many process are ready to be executed, but can not be ran at the moment because other pro cess need to finish. For lightly loaded systems, this is almost never above 1 -3, and numbers consistently higher than 10 indicate the machine is getting pounded. Other interseting values include the "system" numbers for in and cs. The in value is the number of interupts per second a system is getting. A sy stem doing a lot of network or disk I/o will have high values here, as interu pts are generated everytime something is read or written to the disk or netw ork. The cs value is the number of context switches per second. A context switch is when the kernel has to take off of the executable code for a p rogram out of memory, and switch in another. It's actually _way_ more complicat ed than that, but thats the basic idea. Lots of context swithes are bad, si nce it takes some fairly large number of cycles to performa a context swithch, so if you are doing lots of them, you are spending all your time chainging job s and not actually doing any work. I think we can all understand that concept. - tcpdump/ethereal - netstat Netstat is a app for getting general information about the status of network connections to the machine.
netstat will just show all the current open sockets on the machine. This will include unix domain sockets, tcp sockets, udp sockets, etc. One of the more useful options is: netstat -pa The `-p` options tells it to try to determine what program has the socket open, which is often very useful info. For example, someone nmap' s their system and wants to know what is using port 666 for example. Runni ng netstat -pa will show you its satand running on that tcp port. One of the most twisted, but useful invocations is: netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n This will show you a sorted list of how many sockets are in each connect ion state. For example: 9 LISTEN 21 ESTABLISHED - what process is doing what and to whom over the network - number of sockets open - socket status - lsof /usr/sbin/lsof is a utility that checks to see what all open files are on the system. Theres a ton of options, almost none of which you ever need. This is mostly useful for seeing what processes have what file open. Useful in cases where you need to umount a partion, or perhaps you have deleted some file, but its space wasnt reclaimed and you want to know why. The EXAMPLES section of the lsof man page includes many useful examples. - fuser - ldd ldd prints out shared library depenencies. For apps that are reporting missing libraries, this is a handy utility. It shows all the libraries a given app or library is linked to.
For most cases, what you will be looking for is missing libs. in the ldd output, they will show something like: libpng.so.3 => (file not found) In this case, you need to figure out why libpng.so.3 isnt being found. It might not be in the standard lib paths, or perhaps not in a path in /etc/ld.so.conf. Or you need to run `ldconfig` again to update the ld cache. ldd can also be useful when tracking down cases where a app is finding a library, but it finding the wrong library. This can happen if there are two libraries with the same name installed on a system in different paths. Since the `ldd` output includes the full path to the lib, you can see if anything is pointing at at a wrong paths. One thing to look for when scanning for this, is one lib thats in a different lib path than the rest. If an app uses apps from /usr/lib, except for one from /usr/local/lib, theres a good chance thats your culprit. - nm - file `file` is a simple utility that tries to figure out what kind of file a given file is. It does this by magic(5). Where this sometimes comes in handy for troubleshooting is looking for rogue files. A .jpg file that is actually a .html file. A tar.gz thats not actually compressed. Cases like those can sometimes cause apps to behave very strangely. - netcat - to see network stuff - md5sum - verifying files - verifying iso's - diff diff compares two files, and shows the difference between the two. For troubleshooting, this is most often used on config files. If one version of a config file works, but another does not, a `diff` of the two files can often be enlightening. Since it can be very easy to miss a small difference in a file, being able to see jus the differences is useful. For debugging during development, diff (especially the versions built into revision control systems like cvs) is invaluable. Seeing exactly what changed between two versions is a great help.
For example, if foo-2.2 is acting weird, where foo-2.1 worked fine, it's not uncommon to `diff` the source code between the two versions to see if anything related to your problem changed. - find For troubleshooting a system that seems to have suddenly stopped working, find has a few tricks up its sleeve. When a system stops working suddenly, the first question to ask is "what changed?". find / -mtime -1 That command will recursively list all the file from / that have changed in the last day. find /usr/lib -mmin -30 Will list all the files in /usr/lib that changed in the last 30 minutes. Similar options exist for ctime and atime. find /tmp -amin -30 Will show all the files in /tmp that have been accessed in the last 30 minutes. The -atime/-amin options are useful when trying to determine if an app is actually reading the files it is supposed. If you run the app, then run that command where the files are, and nothing has been accessed, something is wrong. If no "+" or "-" is given for the time value, find will match only exactly that time. This is handy in several cases. You can determine what files were modified/created at the same time. A good example of this is cleaning up from a tar package that was unpacked into the wrong directory. Since all the files will have the same access time, you can use find and -exec to delete them all. - executables `find` can also find files with particular permisions set. find / -perm -0777 will find all world writeable files from / down.
find /tmp -user "alikins" will find all files in /tmp owned by "alikins" - used in combo with grep to find markers (errors, filename, etc) When troubleshooting, there are plenty of cases where you want to find all instances of a filename, or a hostname, etc. To recursievely grep a large number of files, you can use find and it's exec options find . -exec grep foo {} \; This will grep for "foo" on all files from the current working directory and down. Note that in many cases, you can also use `grep -r` to do this as well. - ls/stat - finding [sym|hard] links - out of space - df Running out of disk spaces causes so many apps to fail in weird and bizarre ways, that a quick `df -h` is a pretty good troubleshooting starting point. Use is easy, looks for any volume thats 100% full. Or in the case of apps that might be writing lots of data at once, reasonably close to being filled. It's pretty common to spend more time that anyone would like to admit debugging a problem to suddenly here someone yell "Damnit! It's out of disk space!". A quick check avoids that problem. In addition to running out of space, it's possible to run out of file system inodes. A `df -h` will not show this, but a `df -i` will show the number of inodes available on each filesystem. Being out of inodes can cause even more obscure failures than being out of space, so something to keep in mind. - watch - used to see if process output changes
- free, df, etc - ipcs/iprm - anything that uses shm/ipc - oracle/apache/etc - google - googling for error messages can be very handy - source code For Red Hat Linux, you have the source code, so it can often be useful to search though the code for error messages, filenames, or other markers related to the problem. In many cases, you don't really need to be able to understand the programming language to get some useful info. Kernel drivers are an great example for this, since they often include very detailed info about which hardware is supported, whats likely to break, etc. - strings `strings` is a utility that will search though a file and try to find text strings. For troubleshooting sometimes it is handy to be able to look for strings in an executable. For an example, you can run strings on a binary to see if it has any hard coded paths to helper utilities. If those utils are in the wrong place, that app may fail. Searching for error messages can help we well, especially in cases where you not sure what binary is reporting an error message. It some ways, it's a bit like grepping though source code for error messages, but a bit easier. Unfortunately, it also provide far less info. - syslog/log levels - what goes to syslog - how to get more stuff there - ksymoops - get somewhat meaning info out of kernel traces - netdump? - xev - debugging keycode/mouseclick weirdness, etc Logs - messages, dmesg, lastlog, etc - log filtering tools?
Using RPM to help troubleshoot - package verify - missing deps Types Of Problems - Things are missing. This type of problem occurs in many forms. Shell scripts that expect an executable to exist that doesn't. Applications linked against a library that can not be found. Applications expecting a config file to be found that isnt. It can get even more subtle when file permisions are involved. An app can report a file as "not found" when it reality, it exists, but the permissions are wrong. - Missing Files Often an app will fail because of missing files, but will not be so helpful as to tell which file is missing. Or it reports the error in a vague manner like "config file not found" For most of these cases where something is missing, but you are not sure _what_, strace is the best tool. strace -eopen trouble_causing_app That commandline will list all the files that app is trying to open up, and if it succedded or not. The type of line to look for is something like: open("/path/to/some/file/", O_RDONLY) = -1 ENOENT (No such file or direc tory) That indicates the file wasn't found. In many cases, these errors are harmless. For example, some apps will try to open config files in the users home directory, in addition to system config files. If the user config file doesn't exist, the app might just continue. - Missing Libs For missing libraries, the same approach will work. Another approach is to run `ldd` against the app, and see if any shared libraries show up as missing. See the `ldd` section for more details. - File Permissions For cases where it's the file permision thats causing the problem, you are looking for a line like: open("/path/to/file/you/cant/read", O_RDONLY) = -1 EACCES (Permission de nied) Something about that file is not letting you read it. So the permisions need to be checked, or perhaps elevated privilidges obtained (aka, does the app require running it as root?) - networking
On modern systems, having networking problems is crippling at times. Troubleshooting whats causing them can be just as painful at times. Some common problems include firewall issues (both on the client and external), kernel/kernel module issues, routing problems, name resolution, etc. Name resolution issues deserve there own category, so see the name resolution section for more info. - firewall checks When seeing odd network behaviour, these days, local firewalls are a pretty good suspect. Client side firewalls are getting more and more aggressive. If you see issues using a network service, especially a non standard service, it's possible local firewall rules are causing it. Insert infor about seeing what firewall rules are up. Insert info about increasing log levels to see firewall rejections in system logs. Insert info about temprorarily dropping firewalls to diagnose problems. - Crappy Connections A common problem is connections that are having problems. A few easy things to look for to see if an external connection might be having issues. - ping to a remote host `ping` is very simple, and very low level, so it's a good tool to get an idea if an interface or route is working correctly. ping www.yahoo.com That will start pinging www.yahoo.com and reporting ping times. Stopping it with ctrl-c will show a report of any missed packets. Generally healthy links will have 0 dropped packets, so anything higher than that is something to be worried about. - traceroute traceroute www.yahoo.com Attempts to gather info about each node in the connections. Generally these map to physical routers, but in these days of
VPN's, it's hard to tell. If a traceroute stalls at some point, it usually indicates a problems. Also look for high ping times, particularly any node that seems much slower than the others. - /sbin/ifconfig ifconfig does a lot. It can control and configure network interfaces of all types. See the man page. When trying to determine if theres networking issues, run ìfconfig` and look for the interface showing issues. If there is a high "error" count, there could be physical layer issues, or possibly overloaded routers etc. That said, with modern networks, it's pretty rare to see interface errors, but it's still something to take a quick look at. - Bandwidth Useage When the available network bandwidth runs dry, it can be difficult to find the culprits. Theres a couple subtle variations of this. One being a client machine that has some process using a lot lot of bandwidth. Another is a server application that has one or more clients using a lot of bandwidth. - /sbin/ifconfig ifconfig reports the number of packets sent/received on a network interface, so this can be a quick way to get an idea what interface is out of bandwidth. - sar As mentioned in the section on sar, `sar -n DEV` can be used to see info about the amount of packages each interface is sending at a given time. - trafshow I don't know anything about trafshow - ntop/intop havent used in ages - netstat `netstat` wont show bandwith useage, but it
is a quick way to see what applications have open network connections, which is often a good start to finding bandwidth hogs. See the netstat section for more info. - tcpdump/ethereal tcpdump and ethereal are both applications to monitor network traffic. tcpdump is pretty standard, but ethereal is more featureful. ethereal also has a nice graphical user interface which can be very handy when attempting to digest the large amouts of data a network trace can deliver. The basic approach is to fire up ethereal, start a capture, let whatever weird networking your trying to diagnose happen, then stop capture. Ethereal will display all the connections it traced during the capture. There are a couple ways to look for bandwidth hogs. The "Statitics" menu has a couple useful options. The "Protocol Hierarchy" shows what % of packets in the trace is from each type of protocol. In the case of a bandwith hog, at least what protocol is the culprit should be easy to spot here. The "Conversations" screen is also helpful for looking for bandwidth hogs. Since you can sort the "conversations" by number of packets, the culprit is likely to hop to the top. This isn't always the case, as it could easily be many small connections killing the bandwidth, not one big heavy connection. As far as tcpdump goes, the best way to spo bandwidth hogs is just to start it up. Since it pretty much dumps all traffic to the screen in a text format, just keep your eyes peel for what seems to be coming up a lot. - using iptables iptables can log how much traffic is flowing though a given rule. something like: iptables -nLx
- routing issues
- kernel module flakyeness - dropped connections - tcpdump/ethereal - netcat - netstat - Programs Crashing You just finished the last page in your 1200 page novel about how aliens invaded Siberia in the 19th century and made everyone depressed. *boom* the word processor disappears off the screen faster than it really should. It segfaulted. Your work is lost. Crashing applications are annoying to say the least. But sometimes, there are ways to figure out why they crashed. And if you can figure out why, you might be able to avoid it next time. - Crash Catchers Most GNOME/KDE apps now are linked against libs that include a cratch catching utility. Basically, whenever the app gets a segfault, the hander for it invokes a process, attaches to it with a debugger, gets a stacktrace, and offeres to upload it to a bug tracking system. Since these include the option to see the stack trace, it can be handy way to get a stack trace. Once you have a stack trace, it should point you to where the app is crashing. Figure out why it crashed varies greatly in complexity. - strace `strace` can also be handy for tracking down crashes. It doesn't provide as muct detail as ltrace or gdb, but it is commonly available. The idea being to start the app under trace, wait for it to crash, and see what the last few things it did. Some things to look for include recently opened files (maybed the app is trying to load a corrupted file), memory management calls (maybe something is causing it to use large amounts of ram), failed network connections (maybe the app has poor error handling). - ltrace `ltrace` is a bit more useful for debuggin crashing apps, as it can give you an idea what function an app was in when it crashed. Not as useful as a real stack trace, but its easier. - gdb
When it comes to figuring out all the gory details of why an app crashed, nothing is better than `gdb`. For basic useage, see the gdb section in the tools section of this document. For more detail useage, see the gdb documentation. need some examples here - debuginfo packages One caveat with using gdb on most apps, is that they are stripped of debugging information. You can still get a stack trace, but it will not be as meaningful as one with the debug information available. In the past, this meant recompiling the application with debugging turned on, and "stripping" turned off. Which can at times, be a slow and painful process. In Red Hat Enterprise Linux and later, you can install the "debuginfo" packages. See the gdb section in the tools section for more info on debug packages. - core files If an application has crashed, and left a core file (see the "Allowing Core Files" section under the "Enviroment Settings" section for info on how to do this), you can use gdb to debug the core file. Invocation is easy: gdb /path/to/core/file After loading the core file, you can issue `bt` to get a backtrace. See the gdb section above for infomation about "debuginfo" packages. - configs screwed up An incorrect, missing, or corrupt config file can wreak havoc. Well coded apps will usually give you some idea if a config file is bogus, but thats not always the case. - Finding the config files The first thing is figureing out if an app uses a config file and what it is. Theres a couple ways to do this. - finding config files with rpm If a package as installed from an rpm, it
should have the config files flagged as such. To query a package to see what it's config files are, issue the command: rpm -q --configfiles packagename While you are using rpm, you should see if the config files have been modified from the defaults. rpm -V packagename That command will list all the files in that packaged that have been changed in someways. The rpm man page includes more details on what the output means, but the basics are: if there is a "S", the files size has changed. if there is a "5", the file has been modified. if there is a "M", the files permissions have changed. - strace Using `strace -eopen process` is a good way to see what files a process is opening, including any config files. - documentation If all else fails, try reading the docs. Often the man pages or docs describe where and what the config files are. - Verifying the Config Files Once you know what the config files, then you need to verify they are correct. This is highly application dependent. - diff'ing against known good files If you have a known good config file, diffing the old file and the new one can often be useful. - look for for .rpmnew or .rpmorig files. In some cases, rpm will install a new config file along side the existing one. This happens on package upgrades where the default config file has changed between the two packages, and the version on disk is different from either version. The idea being, if the default config file is different, then it's possible the config file format changed. Which means the previous on disk config file may not work with the new version, so a .rpmnew version is installed alongside the existing one.
So if an app is suddenly behaving oddly, especially after a package update, see if there are any .rpmnew or .rpmorig file. If so, you may need to update the existing config file to use the new format. - stat/ls If an app is behaving oddly, and you belive is it because of a config file, you should check to see when that file was modified. stat /path/to/config/file The `stat` command will give you the last modified and last accessed times. If the file seems to have changed later than you think, it's possibly something or someone has changed it more recently. See the information on the `find` utility for ways to look for all files modified at/since/before a certain time. - gconf - The config file has changed but the app is ignoring it - is it the correct config file? Often an application will look for config files in several places. In some cases, some versions of the config file have precedence over other versions. A common example is for an app to a default config file, a per system config file, and a per user config file. With the user and system runs overriding the default one. For some apps, individual config items have there own inheritance rules. So for example, if your modifying a system wide config file, make sure there isnt a per user config file that overrides the change. - is it a daemon? daemon and server processes typically only read there config file when they start up. Sometimes a restart of the process is required. Sometimes it is possible to send a "HUP" signal to an app to force it to reload configs. To send a "HUP" signal: kill -HUP $pid
Where $pid is the process id of the running process. Sometimes init scripts will have options for reloading config files. Apache httpd's init script has a reload option. service httpd reload - shell config? Some process, user shells in particular, have fairly complicated rules about when some of it's config files are read. See the "INVOCATION" section of the bash man page for an example of when the various bash config files get loaded. - kernel issues - single user - init=/bin/bash - bootloader configs - log levels - stuff not writing to disk - out of space You run a command to write a file, or save a file from an app. When you go to look at the file, it's not there, or it's empty. Or the app complains that is "unable to write to device" Whats going on? More than likely, the system doesnt not have any storage space for the file. The file system that the app is trying to write to is full, or nearly full. This case can cause a wide variety of problems. The easiest way to check to see if this is the case is the `df` command. See the df section in the tools section for more info on df. One thing to keep in mind is that the correct filesystem has space. Just because something in `df` shows free space, doesn't mean the app can use it. - out of inodes `df -i` can catch this one as well. It's fairly uncommon these days, but it can still happen. - file permissions Check the file permissions for the file, and directory the app is trying to write to. You can use strace to see where it's writing to
if nothing else tells you. - ACL's If the system is using ACL's, you need to verify the user/app is in the proper ACL's. - selinux selinux can control what app can write where and how. So verify the selinux perms are correct. need more info on tracking down selinux issues - quotas If the system has file system quotas enabled, it's possible the user is over quota. `quota` That command will show the current quota useage, and indicate if quotas are in effect or not. - read-only mounts Network file systems in particular tend to mount shared partions read-only. The mount options overrides any file permisions on the file system that is being shared. - read only media cd-roms are read-only media. The app isn't trying to write to it is it? - chattr/lsattr One feature of ext2/3 is the ability to `chattr` files. There are per file attributes beyond standard unix permissions. See the chattr/lsattr section of the tools section for more details. If a file has had the command `chattr +i` run on it, then the file is "immutable" and nothing can modify it or delete it, including the root user. The only way to change it is to run `chattr -i` on it. Then it can be treated as a normal file. - files doing weird stuff The app is reading the right file. The file _looks_ correct, but it is still behaving weirdly. A few things to look for.
- hidden chars Sometimes a file can get hidden characters in it that give parsers headaches. This is increasingly common as support for more characters encoding become common. - dos style carriage returns - embedded tabs - high byte chars One good approach is to open the file with vi in bin mode: vi -b filename Then to put vi into 'setlist" mode. Do this by hitting escape and entering ":setlist". This should show any non ascii chars, new lines, tabs, etc. the òd` utility can be useful for viewing files "in the raw" as well. some useful od invocations - ending new line Some apps are picking about having any extra new lines at the end of files. So something to look for. - trailing spaces A particular hard to spot circumstance that can break some parsers. Trailing spaces after a string. This can be particularly difficult to spot in cases where it's a space, then a new line. This seems to be particularly common for config options for usernames and passords "foobar" != "foobar " - env stuff The users "enviroment" can often cause problems for applications. Some typical cases and how to detect them. X DISPLAY settings, the PATH, HOME, TERM settings etc can cause issues. - things work as user/not root, vice versa There can be any number reason something works as root, but not as a user. Most of them related to permissions (on files or devices).
Another common cause is PATH. On Red Hat at least, users do not have /sbin:/usr/sbin in there PATH by default. So some scripts or commands can fail because they are not in the PATH. Even having the PATH order being different between root/user can cause problems. X forwarding crap - env The easiest way to see enviroment variables is just to run: env - what basic env stuff means - su/sudo issues - env -i to launch with clean env If a app seems to be having issues that is enviroment dependent, one thing that can be useful when trouble shooting is to launch it with ènv -i`. Something like: env -i /bin/someapp ènv -i` basically strips all enviroment variables, so the app can launch with nothing set. - su -, etc If `su` is being used to gain root, one thing to keep in mind is the difference between `su` and `su -`. The '-' tells su to start up a new login shell. In practice, this means `su -` essentialy gets a shell with the same enviroment a normal root shell would have. A shell created with `su` still has the users SHELL, PATH, USERNAME, HOME, and other variables from the users shell. A shell created with `su -` has the same variables a root shell has. This can often cause weird behavior on apps that depend on the enviroment. - sudo -l - shell scripting Scripting in bash or sh is often the quickest and easiest way to solve a problem. It's also heavily used in
the configuration and startup of Red Hat Linux systems. Unfortunately, debugging shell scripts can be quite painful. Debugging shell scripts someone else wrote a decade ago is even worse. - echo The bash builtin "echo" is often the best debugging tool. A common trick is just to add "echo" to the begining of a line of code that you believe is doing something incorrect. This will just print out the line, but after variable expansion. Particularly handy if the line in question is using lots of shell variables. - sh -x Bash includes some support for getting more verbose information out of scripts as they run. Invoking a shell script as follows: sh -x /path/to/someshell.sh That command will at least trying to print out every line it as it executes them. - trap - bash debugger There is a bash debugger available at http://bashdb.sourceforge.net/. It's essentially a "gdb" style debugger but for bash. Including support for step debugging, breakpoints, and watchpoints. - DNS/name resolution Once a network is up and going, name resolution can continue to be a source of problems. Since so many applications expect reliable name resolution, when it fails. - useage of dig `dig` is probably the most useful tool for tracking down DNS issues. insert useful dig examples - /etc/hosts
Check /etc/hosts for spurious entries. It's not uncommon for "temporary" /etc/hosts entries to become permanent, and when the host ip does change, things break. - nscd nscd is a name service caching daemon. It's very useful when using name services info like hesiod and ldap. But it can also cache DNS as well. Most of the time, it just works. But it's been known to break in odd and mysterious ways before. So trying DNS with and without nscd running is a good idea. - /etc/nsswitch.conf - splat names/host typos "*" matching on DNS servers is pretty common these days. It normally doesn't cause any problems, as much as it can make certain types of errors harder to track down. A typo in a hostname will get redirected to another server (typically a web server) instead of giving an name resolution error. Since the obvious "host not found" errors don't happen, tracking down these kind of problems can be compounded if used with "wildcard" DNS. - auth info - getent - ypwhich/match/cat - certificate/crypto issues - ssl CA certs - gpg keys/signatures - rpm gpg keys - ssltool - curl - Network File Systems - NFS causes weird issues - timestamps - perms/rootsquash/etc - weird inode caching - samba - it touches windows stuff, icky - Some app/apps is rapidly forking shortlived process - gah, what a PITA to troubleshoot - psacct?
- sar -X? - watching pids grow? - dump-acct + parsing? App specific - apache - scorecard stuff - module debugging - log files - init file "configtest" - -X debug mode - php - gtk apps - event debuging stuff? - X apps - nosync stuff - X log - ssh - debug flags - sshd -d -d - pam/auth/nss - logging options? - getent - sendmail Credits Comments, suggestions, hints, ideas, critisicms, pointers, and other useful info from various folks including: Mihai Ibanescu Chip Turner Chris MacLeod Todd Warner Nicholas Hansen Sven Riedel Jacob Frelinger James Clark Brian Naylor Drew Puch

Troubleshooting Linux Systems Guide

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Troubleshooting Linux Systems Guide

Hochgeladen von

Copyright:

Verfügbare Formate

This is a guide to basic, and not so basic troubleshooting and debugging on Red Hat linux systems.

Das könnte Ihnen auch gefallen