Friday, August 3, 2007

Kernel debugging, part 1: "Understanding what's gone wrong"

*****************************************************************************
* Kernel debugging, part 1: "Understanding what's gone wrong" *
* *
* Copyright (c) 2001 Daniel P. Bovet, Marco Cesati, and Cosimo Comella *
* Permission is granted to copy, distribute and/or modify this document *
* under the terms of the GNU Free Documentation License, Version 1.1, *
* published by the Free Software Foundation; with no Invariant Sections, *
* with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the *
* license is included in the file named LICENSE. *
* *
* (version 1.0) *
*****************************************************************************
Every complex program has bugs, and the Linux kernel is not an
exception. However, debugging the kernel is an hard task, because:
* the kernel is BIG (millions of lines)
* the kernel is very complex (multithreaded,
hardware-related, ...)
* there is no higher program that monitors it
You are usually made aware of a kernel bug when:
* your User Mode program crashes, freezes, or behaves in
unexpected ways
* you discover a kernel oops on the console or in the log files
* your system does not reply to ping, your console no longer
responds ...
* your cabinet smokes :-)
Fixing a kernel bug involves three different steps:
1. finding where the bug showed itself
2. understanding what is wrong with the existing code
3. fixing the code
As you may well suppose, steps 1 and 3 are relatively easy, while
step 2 is hairy. You need to understand how the kernel is supposed to
work, and how it actually works. If you finds a difference, then you
have likely found a bug.
Now, I have to admit that the title of this lecture is a lie: I
cannot give you any help to cope with step 2. You have to study the
kernel by yourself, and trying to make a sense of it.
In this lecture we'll focus on step 1, namely, locating the exact
instruction in the kernel source code where a particular bug showed
up. Be warned that the bug is not necessarily due to a wrong
instruction in the exact place where it showed up: the bug might be a
design flaw, a stale pointer passed by another function, and so on.
We shall proceed by examples.
Example 1. The "mystery1" program hangs up.
If "mystery1" is in foreground, it cannot be stopped or killed. But we can use
the "ps" command to collect some information about it:
$ ps u -C mystery1
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
cesati 208 0.0 0.1 1200 312 pts/1 D 18:35 0:00 ./mystery1

The "D" flag gives us two bits of information:
1. the process is in Kernel Mode
2. the process is in TASK_UNINTERRUPTIBLE state
In Linux, every process alternates between User Mode and Kernel Mode. In User
Mode the process executes application-specific code, while in Kernel Mode it
executes kernel code. (In Intel processors, User and Kernel Modes are different
hardware states encoded in the last two bits of the cs and ss segmentation
registers.)
While in Kernel Mode, the process can be in one of the following states:

TASK_RUNNING running or waiting for the CPU
TASK_INTERRUPTIBLE waiting for some event, interruptible by signals
TASK_UNINTERRUPTIBLE waiting for some event, not interruptible
TASK_STOPPED stopped by a signal or terminal keypress (Ctrl-Z)
TASK_ZOMBIE execution terminated, waiting to be destroyed

Thus, our "mystery1" program is stuck in Kernel Mode waiting for some event.
Understanding what precisely the process is waiting for might clarify a lot.
The "ps" command is still useful here:
$ ps n -o pid,tt,user,fname,wchan -C mystery1
PID TT USER COMMAND WCHAN
208 pts/1 1000 mystery1 161dad
The WCHAN field is the information we need: it denotes the "wait
channel" of the process, namely, an offset of a function where the
process is blocked. The "n" flag of "ps" forces numerical output, thus
161dad must be interpreted as hex number 0x161dad.
How can we make sense out of this number? First of all, recall that the kernel
lives in the fourth gigabyte of linear address space. The address returned by
WCHAN should therefore be considered as 0xc00161dad.
Now, we must know what instruction lies in RAM at that address. Association
between kernel symbols and RAM addresses is mostly done in the linking phase
of kernel compilation. Beside the kernel image file to be loaded in RAM, the
compilation yields a file named System.map, which lists the linear addresses of
all kernel symbols. When installing a new kernel, you should always save such
file in a safe place.
Even better, you should save this file in a place where the "ps" command may
find it. IMHO, the best place is the /boot directory. Also, "ps" looks first for
a file named "System.map-X.Y.Z", where "X.Y.Z" is the release of the kernel
as returned by "uname -r" (first four lines of the top Makefile).
Without the "n" flag, "ps" will search the System.map file and translate the
address by yourself:
$ ps -o pid,tt,user,fname,wchan -C mystery1
PID TT USER COMMAND WCHAN
208 pts/1 cesati mystery1 semctl_down
WARNING: be sure that "ps" looks at the right System.map file before using the
symbolic name! In our case, "ps" looks at the wrong file, therefore it gives
a wrong result. Using the correct System.map file yields:
$ ps -o pid,tt,user,fname,wchan -C mystery1
PID TT USER COMMAND WCHAN
208 pts/1 cesati mystery1 noninterruptible_delay

Now, we should look up noninterruptible_delay() in the kernel source
code. Either you already know the source file where the function is defined,
or you have to search. I'm used to type a command like this in the top source
directory:
$ find . -name \*.[chS] -exec grep -l interruptible_delay {} \;
./lkhc/hack1.c
./lkhc/hack-static.c
In the lkhc/hack1.c file, interruptible_delay() is invoked:
====== excerpt from lkhc/hack1.c ======
static int hack1_read(char *page, char **start, off_t off, int count, int *eof,
void *data)
{
int written;

written = sprintf(page+off, "jiffies=%ld ", jiffies);
noninterruptible_delay(1); /* sleep one second */
written += sprintf(page+off, "%ld\n", jiffies);
return written;
}
=======================================
In the lkhc/hack-static.c file, interruptible_delay() is defined as:
====== excerpt from lkhc/hack-static.c ======
void noninterruptible_delay(int seconds)
{
DECLARE_WAIT_QUEUE_HEAD(wait);

sleep_on_timeout(&wait, seconds*HZ*1000);
}
=============================================
We eventually found where our process is blocked: the sleep_on_timeout()
function, which puts the current process in TASK_UNINTERRUPTIBLE state
until either it is awakened by another process, or a timeout expires.
Now step 2 can start: how is the kernel behaving? From its definition, it is
quite clear that the parameter "seconds" represents the timeout in seconds.
However, there is a mistake in the invocation of sleep_on_timeout(), because
this function expects a time interval in ticks.
Step 3 is also trivial: just replace the wrong line with:
sleep_on_timeout(&wait, seconds*HZ);

Let's try a slightly different example.
Example 2. The "mystery2" program hangs up.
When applying our "ps" command, we get a weird result:
$ ps -o pid,tt,user,fname,wchan -C mystery2
PID TT USER COMMAND WCHAN
10687 pts/3 cesati mystery2 end
The symbol "end" is too vague to be a real symbol name, and actually it
corresponds to the linker symbol denoting the end of the kernel code.
Something's going wrong. Let's try "ps" with numerical output:
$ ps n -o pid,tt,user,fname,wchan -C mystery2
PID TT USER COMMAND WCHAN
10687 pts/3 1000 mystery2 5509d
Now, if we try to interpret 5509d as the linear address 0xc005509d, we fail:
looking at the Symbol.map file, we don't find it.
As you have already guessed, the wait channel falls inside a module. It is easy
to understand that the System.map file cannot hold the linear addresses of
symbols defined inside modules, because any module is relocated when it is
loaded at run-time. Thus, we need another source of information: /proc/ksyms
virtual file to the rescue!
The /proc/ksyms file has not been designed for debugging. It is mainly used
when loading a module, since it includes the linear addresses of all kernel
symbols (both from static code and loaded modules) that are exported. For
example, consider our hack1.c module, that refers to the jiffies global
variable: the /proc/ksyms file includes the linear address of jiffies:
$ grep jiffies /proc/ksyms
c02559e4 jiffies
The insmod program looks into /proc/ksyms, fetches the address, and replace any
occurrence of jiffies in hack1.o when preparing the relocated executable code
of the module.
This discussion suggests that /proc/ksyms does not include the address of every
function of every module. In fact:
$ cat /proc/ksyms
cd055000 __insmod_hack2_O/home/cesati/LKHC/linux-2.4.16-lkhc1/lkhc/hack2.o_M3C1659AA_V132112 [hack2]
cd055060 __insmod_hack2_S.text_L208 [hack2]
cd055130 __insmod_hack2_S.rodata_L24 [hack2]
cd0551f4 __insmod_hack2_S.bss_L4 [hack2]
cd053000 __insmod_hack1_O/home/cesati/LKHC/linux-2.4.16-lkhc1/lkhc/hack1.o_M3C164209_V132112 [hack1]
cd053060 __insmod_hack1_S.text_L144 [hack1]
cd0530f0 __insmod_hack1_S.rodata_L24 [hack1]
cd0531b4 __insmod_hack1_S.bss_L4 [hack1]
cd046c88 pcmcia_lookup_bus [pcmcia_core]
cd0465e4 pcmcia_access_configuration_register [pcmcia_core]
cd044d8c pcmcia_adjust_resource_info [pcmcia_core]
cd0466a0 pcmcia_bind_device [pcmcia_core]
cd04676c pcmcia_bind_mtd [pcmcia_core]
cd045830 pcmcia_check_erase_queue [pcmcia_core]
cd0458f8 pcmcia_close_memory [pcmcia_core]
(snipped)
The hack1 and hack2 modules don't export any symbol. Fortunately, the insmod
program produces a few symbols that mark the beginning and the end of the
kernel address space allocated to any loaded module. In our example, the code
of module hack1 starts at linear address 0xcd053060 (text section), and the
code of module hack2 starts at 0xcd055060. These symbols allow us to make sense
of the number 0x5509d returned by "ps": it looks like an offset inside the code
of hack2.
(Short digression: you might wonder why statically compiled symbols start at
0xc0000000, while module symbols start at 0xcd000000. The kernel uses the
initial portion of the fourth gigabyte of linear address space to linearly map
all physical RAM in the system; the statically compiled kernel "lives" here. The
rest of the fourth gigabyte is used by the kernel to establish temporary
mappings; modules are loaded using these temporary mappings.)
Now, we would like to establish the exact function inside the hack2 module
that corresponds to 0xcd05509d. The module starts at 0xcd055060, thus
the instruction is at offset 0xcd05509d-0xcd055060=0x3d=61. In order to get the
Assembly code of the module, let's use the objdump command:
$ objdump -d hack2.o
hack2.o: file format elf32-i386
Disassembly of section .text:
0000000000000000 <_noninterruptible_delay>:
0: 83 ec 18 sub $0x18,%esp
3: 53 push %ebx
4: 8b 4c 24 20 mov 0x20(%esp,1),%ecx
8: c7 44 24 04 00 00 00 movl $0x0,0x4(%esp,1)
f: 00
10: 8b 44 24 04 mov 0x4(%esp,1),%eax
14: 8d 54 24 14 lea 0x14(%esp,1),%edx
18: 89 54 24 08 mov %edx,0x8(%esp,1)
1c: 89 54 24 0c mov %edx,0xc(%esp,1)
20: 89 54 24 14 mov %edx,0x14(%esp,1)
24: 89 54 24 18 mov %edx,0x18(%esp,1)
28: 69 d1 a0 86 01 00 imul $0x186a0,%ecx,%edx
2e: 8d 5c 24 10 lea 0x10(%esp,1),%ebx
32: 89 44 24 10 mov %eax,0x10(%esp,1)
36: 89 d8 mov %ebx,%eax
38: e8 fc ff ff ff call 39 <_noninterruptible_delay+0x39>
3d: 5b pop %ebx
3e: 83 c4 18 add $0x18,%esp
41: c3 ret
42: 89 f6 mov %esi,%esi
0000000000000044 :
44: 6a 00 push $0x0
46: 68 a4 01 00 00 push $0x1a4
4b: 68 00 00 00 00 push $0x0
50: e8 fc ff ff ff call 51
55: a3 00 00 00 00 mov %eax,0x0
5a: 83 c4 0c add $0xc,%esp
(snipped)
The instruction precedings those starting at offset 0x3d is where our process is
blocked. As you see, the instruction is inside the _uninterruptible_timeout
function, and it is a call to an external function whose symbol is yet to be
resolved by insmod.
If things are not clear, you can also compare the output of objdump with the
Assembly code generated by the compiler. To generate the Assembly file, go
into the top source directory and give the command:
# make lkhc/hack2.s
Assembly files have suffix ".s". In our case it contains:
====== excerpt from lkhc/hack2.s ======
.file "hack2.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.type _noninterruptible_delay,@function
_noninterruptible_delay:
subl $24,%esp
pushl %ebx
movl 32(%esp),%ecx
movl $0,4(%esp)
movl 4(%esp),%eax
leal 20(%esp),%edx
movl %edx,8(%esp)
movl %edx,12(%esp)
movl %edx,20(%esp)
movl %edx,24(%esp)
imull $100000,%ecx,%edx
leal 16(%esp),%ebx
movl %eax,16(%esp)
movl %ebx,%eax
call sleep_on_timeout
popl %ebx
addl $24,%esp
ret
.Lfe1:
.size _noninterruptible_delay,.Lfe1-_noninterruptible_delay
.section .rodata
.LC11:
(snipped)
Thus, the problem is again the invocation of the sleep_on_timeout() function.
We look in the source code of the module, and we discover the same problem as
in the previous example:
====== excerpt from lkhc/hack2.c ======
static void _noninterruptible_delay(int seconds)
{
DECLARE_WAIT_QUEUE_HEAD(wait);

sleep_on_timeout(&wait, seconds*HZ*1000);
}
(A hint that may save you some trouble: remember to remove your Assembly file
from the kernel source code tree before recompiling, otherwise the make program
will use it instead of the corresponding C file.)
Example 3. The "mystery3" program fails.
The "mystery3" program fails with a "Segmentation fault" message. This message
is usually due to a system call that returned the -EFAULT error code, but it
could also be raised by a uncaught SIGSEGV signal sent to the process.
If we have the source code of the failing program, we could easily understand
what happened. Things are a bit harder if the source code of the program is not
available.
Let's suppose we don't have the source code of mystery3. In order to understand
what happens, we must trace the system calls invoked by the program by means of
the "strace" command:
$ strace mystery3
execve("./mystery3", ["./mystery3"], [/* 36 vars */]) = 0
brk(0) = 0x80496c8
old_mmap(NULL, 4096, PROT_READPROT_WRITE, MAP_PRIVATEMAP_ANONYMOUS, -1, 0) = 0x40016000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/kde/lib/i686/mmx/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory)
open("/opt/kde/lib/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory)
open("/opt/kde/lib/mmx/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(0xbfffedbc, 0xbfffee14) = -1 ENOENT (No such file or directory)
open("/opt/kde/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64(0xbfffedbc, 0xbfffee14) = 0
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(0x3, 0xbfffedc4) = 0
old_mmap(NULL, 37203, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\264\323"..., 1024) = 1024
fstat64(0x3, 0xbfffee0c) = 0
old_mmap(NULL, 1116516, PROT_READPROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40021000
mprotect(0x40128000, 39268, PROT_NONE) = 0
old_mmap(0x40128000, 24576, PROT_READPROT_WRITE, MAP_PRIVATEMAP_FIXED, 3, 0x106000) = 0x40128000
old_mmap(0x4012e000, 14692, PROT_READPROT_WRITE, MAP_PRIVATEMAP_FIXEDMAP_ANONYMOUS, -1, 0) = 0x4012e000
close(3) = 0
munmap(0x40017000, 37203) = 0
getpid() = 220
open("/proc/hack3", O_RDONLY) = 3
read(3, "jiffies=14462\n", 128) = 14
fstat64(0x1, 0xbffff20c
+++ killed by SIGSEGV +++
The first system calls are issued by the ELF program interpreter that sets up
the execution context of the program. We are interested in the last three system
calls. The program successfully open the /proc/hack3 virtual file (file
descriptor 3). Then it read 128 bytes from the file. The read() system call
fills the program buffer with the string "jiffies=14462\n" (length = 14 bytes).
Then the program invokes the fstat64() system call, but this system call never
terminates: the process receives a SIGSEGV signal and aborts.
The SIGSEGV signal is sent either by a User Mode process, or by the kernel
when a serious error occurs. In this second case, the kernel builds a so-called
"kernel Oops".
A kernel Oops is a short message that describes the erroneous condition.
There are several ways to read the message (and you often have to try many of
them):
1. The Oops appears on the system console, or on a terminal that acts as the
console. This is not very convenient for you, because console messages
scrolls as new messages arrive, and the Oops is eventually lost.

2. The Oops can be read in the /proc/kmsg virtual file. Usually a system process
named klogd continuously read from this file, and sends its content to
another system process called syslogd. In turn, syslogd could write the
message in one of the system log files in /var/log.

3. If the klogd process is not running, you could read the /proc/kmsg file by
yourself.

4. The kernel messages (including Oops) can also be read with the dmesg
command. This is my preferred method. Be aware, however, that the kernel
message buffer usually contains only 16KB, thus the kernel messages
are cyclically discarded.

Let's use the dmesg command to look for the Oops:
$ dmesg
(snipped)
Unable to handle kernel NULL pointer dereference at virtual address 00000008
printing eip:
c012ecc9
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[] Not tainted
EFLAGS: 00010246
eax: 00000000 ebx: c8840000 ecx: 00000001 edx: 00000000
esi: fffffff7 edi: 00000001 ebp: bffff1ac esp: c8841fb0
ds: 0018 es: 0018 ss: 0018
Process mystery3 (pid: 217, stackpage=c8841000)
Stack: c0134065 c8840000 40131170 00000001 c0106b0b 00000001 bffff20c 4012d760
40131170 00000001 bffff1ac 000000c5 0000002b 0000002b 000000c5 400df05c
00000023 00000212 bffff140 0000002b
Call Trace: [] []

Code: 3b 48 08 73 06 8b 40 14 8b 14 88 85 d2 74 03 ff 42 14 89 d0
As you see, there is lot of information here. First of all, a description of
the error: "Unable to handle kernel NULL pointer dereference at virtual address
XXX". This is equivalent to state that the kernel tried to access an object
addressed by a null pointer. The CPU raised an exception ("General protection
fault"), and the corresponding exception handler established that the fault was
due to a kernel bug.
The Oops then shows the content of the eip register (the Program Counter) when
the exception was raised. Finally, this is something we can understand. We
could look at the System.map file, and search the corresponding function:
====== excerpt from System.map ======
00000000c012ecb8 T fget
00000000c012ece0 T put_filp
=====================================
Thus, we have a function name, fget(), and an offset inside its code:
0xc012ecc9 - 0xc012ecb8 = 0x11 = 17
Next steps are straightforward: we search fget() inside the kernel source code
with the find command. It turns out to be defined in the fs/file_table.c file:
====== excerpt from fs/file_table.c ======
struct file * fget(unsigned int fd)
{
struct file * file;
struct files_struct *files = current->files;

read_lock(&files->file_lock);
file = fcheck(fd);
if (file)
get_file(file);
read_unlock(&files->file_lock);
return file;
}
===========================================
Now we have to locate the faulting instruction. By invoking objdump on the
correspoding object file we get:
$ objdump -d fs/file_table.o
(snipped)
0000000000000248 :
248: 89 c1 mov %eax,%ecx
24a: b8 00 e0 ff ff mov $0xffffe000,%eax
24f: 21 e0 and %esp,%eax
251: 8b 80 54 05 00 00 mov 0x554(%eax),%eax
257: 31 d2 xor %edx,%edx
259: 3b 48 08 cmp 0x8(%eax),%ecx
25c: 73 06 jae 264
25e: 8b 40 14 mov 0x14(%eax),%eax
261: 8b 14 88 mov (%eax,%ecx,4),%edx
264: 85 d2 test %edx,%edx
266: 74 03 je 26b
268: ff 42 14 incl 0x14(%edx)
26b: 89 d0 mov %edx,%eax
26d: c3 ret
26e: 89 f6 mov %esi,%esi
(snipped)
Linus suggests using the GNU debugger on the uncompressed kernel image:
$ gdb linux-2.4.16-lkhc/vmlinux
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-slackware-linux"...(no debugging symbols found)...
(gdb) disassemble fget
Dump of assembler code for function fget:
0xc012ecb8 : mov %eax,%ecx
0xc012ecba : mov $0xffffe000,%eax
0xc012ecbf : and %esp,%eax
0xc012ecc1 : mov 0x554(%eax),%eax
0xc012ecc7 : xor %edx,%edx
0xc012ecc9 : cmp 0x8(%eax),%ecx
0xc012eccc : jae 0xc012ecd4
0xc012ecce : mov 0x14(%eax),%eax
0xc012ecd1 : mov (%eax,%ecx,4),%edx
0xc012ecd4 : test %edx,%edx
0xc012ecd6 : je 0xc012ecdb
0xc012ecd8 : incl 0x14(%edx)
0xc012ecdb : mov %edx,%eax
0xc012ecdd : ret
0xc012ecde : mov %esi,%esi
End of assembler dump.
(gdb) quit
The instruction at offset 17 is "cmp 0x8(%eax),%ecx". (This is also confirmed
by the Code section in our Oops, which shows the bytes of the machine code of
the faulting instruction.) The Assembly instruction addresses the data
structure pointed to by the eax register, and store the field at offset 8 in
the ecx register. The source code of the function suggests that eax should
contain the content of the files field of the current process descriptor (at
offset 0x554); 8 is the offset of the file_lock field inside the data structure
pointed to by files.
Now, we have to consider another precious information coming from the Oops: the
dump of all register contents. In particular, before executing the faulting
instruction, %eax is set to 0. This causes the "NULL pointer dereference", the
general protection fault, the exception handler, the SIGSEGV signal sent to
mystery3, and so on.
However, the crucial point is: what is the bug? Of course, the Oops could be
avoided by simply adding a check on the files field before using its value. In
this way, however, the bug is not fixed, it is just hidden. Step 2 of our bug
fixing procedure starts here: is it conceivable that the files field is NULL? A
basic understanding of the kernel is enough to establish that files should
definitevely be not null here. In fact, the field stores a pointer to the table
of opened file of the process; moreover, the fget() function is invoked to
lookup a particular opened file specified by its file descriptor.
Clearly, the bug is not in the fget() function itself. Most likely, the bug has
been introduced in some function invoked before fget(). The Oops comes once
again to our rescue, because it shows a dump of the kernel stack when the
exception was raised. Even better, the "call trace" section lists the addresses
on the stack corresponding to the functions, discarding the function arguments:
Call Trace: [] []
Looking into the System.map file we find that 0xc0134065 corresponds to
sys_fstat64(), while 0xc0106b0b is inside the system_call() function. In this
particular case, the call trace section is not particularly rewarding: the Oops
occurred while processing the fstat64() system call, as we already know.
Searching the function and looking at it doesn't help, either:
====== excerpt from fs/stat.c ======
asmlinkage long sys_fstat64(unsigned long fd, struct stat64 * statbuf,
long flags)
{
struct file * f;
int err = -EBADF;

f = fget(fd);
if (f) {
struct dentry * dentry = f->f_dentry;

err = do_revalidate(dentry);
if (!err)
err = cp_new_stat64(dentry->d_inode, statbuf);
fput(f);
}
return err;
}
===================================
The function doesn't touch the files field. Therefore, the bug has been
introduced in a previous kernel function, most likely in the previous read()
system call. Unfortunately, the kernel stack is cleared when returning in User
Mode, thus the Oops cannot show the information relative to the read() system
call execution.
However, the strace command told us that the read() was reading from the
/proc/hack3 virtual file. We can thus dig in the kernel source code looking for
the function that implements /proc/hack3. We easily find the lkhc/hack3.c
source file, and discover the buggy instruction:
====== excerpt from lkhc/hack3.c ======
static int hack3_read(char *page, char **start, off_t off, int count,
int *eof, void *data)
{
int written;
current->files = NULL;
written = sprintf(page+off, "jiffies=%d\n", jiffies);
return written;
}
=======================================
Eureka!
(A final note on kernel Oops. A small program named ksymoops fetches several
bits of information from the System.map file, the uncompressed kernel image,
the /proc/ksyms file, and so on, and then interprets the Oops for you. The home
page of the program is listed in the scripts/ksymoops directory of the Linux
source code.)

Thursday, July 12, 2007

Inside the kernel debugger (KDB)

http://linuxdevices.com/articles/AT3761062961.html


Inside the Linux kernel debuggerA guide to getting started with KDBThe Linux kernel debugger (KDB) allows you to debug the Linux kernel. This aptly named tool is essentially a patch to the kernel code that lets hackers access kernel memory and data structures. One of the main advantages of KDB is that it does not require an additional machine for debugging: you can debug the kernel that you are running on.
Setting up a machine for KDB requires a little work in that the kernel needs to be patched and recompiled. Users of KDB should be familiar with compiling the Linux kernel (and with kernel internals, to an extent), but refer to the Resources section at the end of this article if you need help compiling the kernel.
In this article we'll start with information on downloading KDB patches, applying them, (re)compiling the kernel, and starting KDB. Then we'll get into the KDB commands and review some of the more often-used commands. Finally, we'll look at some details about the setup and display options.
Getting startedThe KDB project is maintained by Silicon Graphics (see Resources for a link), and you will need to download kernel-version dependent patches from its FTP site. The latest version of KDB available (at the time this article was written) is 4.2. You'll need to download and apply two patches. One is the "common" patch that contains the changes for the generic kernel code, and the other is the architecture-specific patch. The patches are available as bz2 files. On an x86 machine running the 2.4.20 kernel, for example, you would need kdb-v4.2-2.4.20-common-1.bz2 and kdb-v4.2-2.4.20-i386-1.bz2.
All examples provided here are for the i386 architecture and the 2.4.20 kernel. You will need to make suitable changes based on your machine and kernel version. You will also need root permissions for performing these operations.
Copy the files into the /usr/src/linux directory and extract the patch files from the bzipped files:
#bzip2 -d kdb-v4.2-2.4.20-common-1.bz2
#bzip2 -d kdb-v4.2-2.4.20-i386-1.bz2
You will get the kdb-v4.2-2.4.20-common-1 and kdb-v4.2-2.4-i386-1 files.
Now, apply the patches:
#patch -p1 #patch -p1
The patches should apply cleanly. Look for any files ending with .rej. This indicates failed patches. If the kernel tree is clean, the patches should apply without any problems.
Next, the kernel needs to be built with KDB enabled. The first step is to set the CONFIG_KDB option. Use your favorite configuration mechanism (xconfig, menuconfig, etc.) to do this. Go to the "Kernel hacking" section at the end and select the "Built-in Kernel Debugger support" option.
There are also a couple of other options you can select based on your preferences. Selecting the "Compile the kernel with frame pointers" option (if present) sets the CONFIG_FRAME_POINTER flag. This will lead to better stack tracebacks, as the frame pointer register is used as a frame pointer rather than a general purpose register. You can also select the "KDB off by default" option. This will set the CONFIG_KDB_OFF flag and will turn off KDB by default. We'll cover this more in a later section.
Save the configuration and exit. Recompile the kernel. Doing a "make clean" is recommended before building the kernel. Install the kernel in the usual manner and boot to it.
Initializing and setting the environment variablesYou can define KDB commands that will be executed during the initialization of KDB. These commands need to be defined in a plain text file called kdb_cmds, which exists in the KDB directory of the Linux source tree (post patching, of course). This file can also be used to define environment variables for setting the display and print options. The comments at the beginning of the file offer help on editing the file. The disadvantage of using this file is that the kernel needs to be rebuilt and reinstalled you change the file.
Activating KDBIf CONFIG_KDB_OFF was not selected during compilation, KDB is active by default. Otherwise, you need to activate it explicitly -- either by passing the kdb=on flag to the kernel during boot or by doing this once /proc has been mounted:
#echo "1" >/proc/sys/kernel/kdb
The reverse of the above steps will deactivate KDB. That is, either passing the kdb=off flag to the kernel or doing the following will deactivate KDB, if KDB is on by default:
#echo "0" >/proc/sys/kernel/kdb
There is yet another flag that can be passed to the kernel during boot. The kdb=early flag will result in control being passed to KDB very early in the boot process. This would help if you need to debug very early during the boot process.
There are various ways to invoke KDB. If KDB is on, it will get invoked automatically whenever there is a panic in the kernel. Pressing the PAUSE key on the keyboard would manually invoke KDB. Another way to invoke KDB is through the serial console. Of course, to do this, you need to set up the serial console (see Resources for help on this) and you need a program that reads from the serial console. The key sequence Ctrl-A will invoke KDB from the serial console.
KDB commandsKDB is a very powerful tool that allows several operations such as memory and register modification, applying breakpoints, and stack tracing. Based on these, the KDB commands can be classified into several categories. Here are details on the most commonly used commands in each of these categories.
Memory display and modificationThe most often-used commands in this category are md, mdr, mm, and mmW.
The md command takes an address/symbol and a line count and displays memory starting at the address for line-count number of lines. If line-count is not specified, the defaults as specified by the environment variables are used. If an address is not specified, md continues from the last address that was printed. The address is printed at the beginning, and the character conversion is printed at the end.
The mdr command takes an address/symbol and a byte count and displays the raw contents of memory starting at the specified address for byte-count number of bytes. It is essentially the same as md, but it does not display the starting address and the character conversion at the end. The mdr command is of very little use.
The mm command modifies memory contents. It takes an address/symbol and new contents as parameters and replaces the contents at the address with new-contents.
The mmW command changes W bytes starting at the address. Note that mm changes a machine word.
Examples
To display 15 lines of memory starting at 0xc000000:
[0]kdb> md 0xc000000 15
To change the contents of memory location 0xc000000 to 0x10:
[0]kdb> mm 0xc000000 0x10
Register display and modificationThe commands in this category are rd, rm, and ef.
The rd command (without any arguments) displays the contents of the processor registers. It optionally takes three arguments. If the c argument is passed, rd displays the processor's control registers; with the d argument, it displays the debug registers; and with the u argument, the register set of the current task, the last time they entered the kernel, are displayed.
The rm command modifies the contents of a register. It takes a register name and new-contents as arguments and modifies the register with new-contents. The register names depend on the specific architecture. Currently, the control registers cannot be modified.
The ef command takes an address as an argument and displays an exception frame at the specified address.
Examples
To display the general register set:
[0]kdb> rd
To set the contents of register ebx to 0x25:
[0]kdb> rm %ebx 0x25
BreakpointsThe commonly used breakpoint commands are bp, bc, bd, be, and bl.
The bp command takes an address/symbol as an argument and applies a breakpoint at the address. Execution is stopped and control is given to KDB when this breakpoint is hit. There are a couple of variations of this command that could be useful. The bpa command applies the breakpoint on all processors in an SMP system. The bph command forces the use of a hardware register on systems that support it. The bpha command is similar to the bpa command except that it forces the use of a hardware register.
The bd command disables a particular breakpoint. It takes in a breakpoint number as an argument. This command does not remove the breakpoint from the breakpoint table but just disables it. The breakpoint numbers start from 0 and are allocated to breakpoints in the order of availability.
The be command enables a breakpoint. The argument to this command is also the breakpoint number.
The bl command lists the current set of breakpoints. It includes both the enabled and the disabled breakpoints.
The bc command removes a breakpoint from the breakpoint table. It takes either a specific breakpoint number as an argument or it takes *, in which case it will remove all breakpoints.
Examples
To set up a breakpoint at the function sys_write():
[0]kdb> bp sys_write
To list all the breakpoints in the breakpoint table:
[0]kdb> bl
To clear breakpoint number 1:
[0]kdb> bc 1
Stack tracingThe main stack-tracing commands are bt, btp, btc, and bta.
The bt command attempts to provide information on the stack for the current thread. It optionally takes a stack frame address as an argument. If no address is provided, it takes the current registers to traceback the stack. Otherwise, it assumes the provided address as a valid stack frame start address and attempts to traceback. If the CONFIG_FRAME_POINTER option was set during kernel compilation, the frame pointer register is used to maintain stacks and, hence, the stack traceback can be performed correctly. The bt command may not produce correct results in the event of CONFIG_FRAME_POINTER not being set.
The btp command takes a process ID as an argument and does a stack traceback for that particular process.
The btc command does a stack traceback for the running process on each live CPU. Starting from the first live CPU, it does a bt, switches to the next live CPU, and so on.
The bta command does a traceback for all processes in a particular state. Without any argument, it does a traceback for all processes. Optionally, various arguments can be passed to this command. The processes in a particular state will be processed depending on the argument. The options and the corresponding states are as follows:
D: Uninterruptible state
R: Running
S: Interruptible sleep
T: Traced or stopped
Z: Zombie
U: Unrunnable
Each of these commands prints out a whole lot of information. Check out the Resources below for detailed documentation on these fields.
Examples
To trace the stack for the current active thread:
[0]kdb> bt
To trace the stack for process with ID 575:
[0]kdb> btp 575
Other commandsHere are a few other KDB commands that are useful in kernel debugging.
The id command takes an address/symbol as an argument and disassembles instructions starting at that address. The environment variable IDCOUNT determines how many lines of output are displayed.
The ss command single steps an instruction and returns control to KDB. A variation of this instruction is ssb, which executes instructions from the current instruction pointer address (printing the instruction on the screen) until it encounters an instruction that would cause a branch. Typical examples of branch instructions are call, return, and jump.
The go command lets the system continue normal execution. The execution continues until a breakpoint is hit (if one has been applied).
The reboot command reboots the system immediately. It does not bring down the system cleanly, and hence the results are unpredictable.
The ll command takes an address, an offset, and another KDB command as arguments. It repeats the command for each element of a linked list. The command executed takes the address of the current element in the list as the argument.
Examples
To disassemble instructions starting from the routine schedule. The number of lines displayed depends on the environment variable IDCOUNT:
[0]kdb> id schedule
To execute instructions until it encounters a branch condition (in this case, instruction jne):
[0]kdb> ssb0xc0105355 default_idle+0x25: cli0xc0105356 default_idle+0x26: mov 0x14(%edx),%eax0xc0105359 default_idle+0x29: test %eax, %eax0xc010535b default_idle+0x2b: jne 0xc0105361 default_idle+0x31
Tips and tricksDebugging a problem involves locating the source of the problem using a debugger (or any other tool) and using the source code to track the root cause of the problem. Using source code alone to determine problems is extremely difficult and may be possible only for expert kernel hackers. On the other hand, newbies tend to rely excessively on debuggers to fix bugs. This approach may lead to incorrect solutions to problems. The fear is that such an approach leads to fixing symptoms rather than the actual problems. A classic example of such a mistake is adding error handling code to take care of NULL pointers or bad references without looking into the real cause of the invalid references.
The dual approach of studying the code and using debugging tools is the best way to identify and fix problems.
The primary use of debuggers is to get to the location of the bug, confirm the symptom (and cause, in some cases), determine the values of variables, and determine how the program got there (that is, establish the call stack). An experienced hacker will know which debugger to use for a particular kind of problem and will quickly obtain the necessary information from debugging and get on with analyzing the code to identify the cause.
Here, then, are some tips for you to achieve the above mentioned results quickly using KDB. Of course, keep in mind that speed and accuracy in debugging comes with experience, practice, and good knowledge of the system (hardware, kernel internals, etc.)
Tip #1In KDB, typing an address at the prompt returns its nearest symbol match. This is extremely useful in stack analysis and in determining the addresses/values of global data and function addresses. Similarly, typing the name of a symbol returns its virtual address.
Examples
To indicate that the function sys_read starts at address 0xc013db4c:
[0]kdb> 0xc013db4c0xc013db4c = 0xc013db4c (sys_read)
Similarly,
Similarly, to indicate that sys_write is at address 0xc013dcc8:
[0]kdb> sys_writesys_write = 0xc013dcc8 (sys_write)
These help in locating global data and function addresses while analyzing stacks.

Tip #2
Whenever present, use the CONFIG_FRAME_POINTER option while compiling the kernel with KDB. To do this, you need to select the "Compile the kernel with frame pointers" option under the "Kernel hacking" section while configuring the kernel. This ensures that the frame pointer register will be used as a frame pointer leading to accurate tracebacks. You could, in fact, manually dump the contents of the frame pointer register and trace the entire stack. For example, on an i386 machine, the %ebp register can be used to traceback the entire stack.
For example, after executing the first instruction at function rmqueue(), the stack looked as follows:
[0]kdb> md %ebp0xc74c9f38 c74c9f60 c0136c40 000001f0 000000000xc74c9f48 08053328 c0425238 c04253a8 000000000xc74c9f58 000001f0 00000246 c74c9f6c c0136a250xc74c9f68 c74c8000 c74c9f74 c0136d6d c74c9fbc0xc74c9f78 c014fe45 c74c8000 00000000 08053328
[0]kdb> 0xc0136c400xc0136c40 = 0xc0136c40 (__alloc_pages +0x44)
[0]kdb> 0xc0136a250xc0136a25 = 0xc0136a25 (_alloc_pages +0x19)
[0]kdb> 0xc0136d6d0xc0136d6d = 0xc0136d6d (__get_free_pages +0xd)
We can see that rmqueue() has been called by __alloc_pages, which in turn has been called by _alloc_pages, and so on.
The first double word of every frame points to the next frame and this is immediately followed by the address of the calling function. Hence, tracing the stack becomes an easy job.

Tip #3
The go command, optionally, takes an address as a parameter. If you want to continue execution at a particular address, you could provide that address as a parameter. The other alternative is to modify the instruction pointer register using the rm command and just type go. This would be useful if you wanted to skip a particular instruction or a set of instructions that appeared to be causing a problem. Note, though, that using this instruction without care could lead to severe problems, and the system might crash badly.

Tip #4
You can define your own set of commands with a useful command called defcmd. For example, whenever you hit a breakpoint, you might wish to simultaneously check a particular variable, check the contents of some registers, and dump the stack. Normally, you would have to type a series of commands to be able to do this all at the same time. The defcmd allows you to define your own command, which could consist of one or more of the predefined KDB commands. Only one command would then be needed to do all three jobs. The syntax for this is as follows:
[0]kdb> defcmd name "usage" "help"
[0]kdb> [defcmd] type the commands here
[0]kdb> [defcmd] endefcmd
For example, a (trivial) new command called hari could be defined that would display one line of memory starting at address 0xc000000, display the contents of the registers, and dump the stack:
[0]kdb> defcmd hari "" "no arguments needed"
[0]kdb> [defcmd] md 0xc000000 1
[0]kdb> [defcmd] rd
[0]kdb> [defcmd] md %ebp 1
[0]kdb> [defcmd] endefcmd

The output of this command would be:
[0]kdb> hari
[hari]kdb> md 0xc000000 10xc000000 00000001 f000e816 f000e2c3 f000e816
[hari]kdb> rdeax = 0x00000000 ebx = 0xc0105330 ecx = 0xc0466000 edx = 0xc0466000.......
[hari]kdb> md %ebp 10xc0467fbc c0467fd0 c01053d2 00000002 000a0200
[0]kdb>

Tip #5
The bph and bpha commands can be used (provided the architecture supports use of hardware registers) to apply read and write breakpoints. This means we can get control whenever data is read from or written into a particular address. This can be extremely handy when debugging data/memory corruption problems, where you can use it to identify the corrupting code/process.
Examples
To enter the kernel debugger whenever four bytes are written into address 0xc0204060:
[0]kdb> bph 0xc0204060 dataw 4
To enter the kernel debugger when at least two bytes of data starting at 0xc000000 are read:[0]kdb> bph 0xc000000 datar 2
ConclusionKDB is a handy and powerful tool for performing kernel debugging. It offers various options and enables analysis of memory contents and data structures. Best of all, it does not require an additional machine to perform the debugging.
Resources
Find the KDB man pages in the Documentation/kdb directory.
For information on setting up the serial console, look for serial-console.txt in the Documentation directory.
Download KDB at SGI's kernel debugger project Web site.
For an overview of several scenario-based debugging techniques for Linux, read "Mastering Linux debugging techniques" (developerWorks, August 2002).
The tutorial "Compiling the Linux kernel" (developerWorks, August 2000) walks you through configuring, compiling, and installing the kernel.
IBM AIX users can get help using KDB for AIX on the KDB Kernel Debugger and Command page.
Those looking for information on debugging OS/2 should read Volume II of the four-volume IBM Redbook, The OS/2 Debugging Handbook.
Find more resources for Linux developers in the developerWorks Linux zone.
About the author: Hariprasad Nellitheertha works in IBM's Linux Technology Center in Bangalore, India. He is currently working on the Linux Change Team fixing kernel and other Linux bugs. Hari has worked on the OS/2 kernel and file systems. His interests include Linux kernel internals, file systems, and autonomic computing. You can contact Hari at nharipra@in.ibm.com.
First published by IBM developerWorks. Reproduced by LinuxDevices.com with permission.

Tuesday, May 15, 2007

HOWTO: Profile Memory in a Linux System

1. Introduction

It's important to determine how your system utilizes it's
resources. If your systems performance is unacceptable, it is
necessary to determine which resource is slowing the system
down. This document attempts to identify the following:

a. What is the system memory usage per unit time?
b. How much swap is being used per unit time?
c. What does each process' memory use look like over time?
d. What processes are using the most memory?

I used a RedHat-7.3 machine (kernel-2.4.18) for my experiments,
but any modern Linux distribution with the commands "ps" and
"free" would work.

2. Definitions

RAM (Random Access Memory) - Location where programs reside when
they are running. Other names for this are system memory or
physical memory. The purpose of this document is to determine if
you have enough of this.

Memory Buffers - A page cache for the virtual memory system. The
kernel keeps track of frequently accessed memory and stores the
pages here.

Memory Cached - Any modern operating system will cache files
frequently accessed. You can see the effects of this with the
following commands:

for i in 1 2 ; do
free -o
time grep -r foo /usr/bin >/dev/null 2>/dev/null
done

Memory Used - Amount of RAM in use by the computer. The kernel
will attempt to use as much of this as possible through buffers
and caching.

Swap - It is possible to extend the memory space of the computer
by using the hard drive as memory. This is called swap. Hard
drives are typically several orders of magnitude slower than RAM
so swap is only used when no RAM is available.

Swap Used - Amount of swap space used by the computer.

PID (Process IDentifier) - Each process (or instance of a running
program) has a unique number. This number is called a PID.

PPID (Parent Process IDentifier) - A process (or running program)
can create new processes. The new process created is called a
child process. The original process is called the parent
process. The child process has a PPID equal to the PID of the
parent process. There are two exceptions to this rule. The first
is a program called "init". This process always has a PID of 1 and
a PPID of 0. The second exception is when a parent process exit
all of the child processes are adopted by the "init" process and
have a PPID of 1.

VSIZE (Virtual memory SIZE) - The amount of memory the process is
currently using. This includes the amount in RAM and the amount in
swap.

RSS (Resident Set Size) - The portion of a process that exists in
physical memory (RAM). The rest of the program exists in swap. If
the computer has not used swap, this number will be equal to
VSIZE.

3. What consumes System Memory?

The kernel - The kernel will consume a couple of MB of memory. The
memory that the kernel consumes can not be swapped out to
disk. This memory is not reported by commands such as "free" or
"ps".

Running programs - Programs that have been executed will consume
memory while they run.

Memory Buffers - The amount of memory used is managed by the
kernel. You can get the amount with "free".

Memory Cached - The amount of memory used is managed by the
kernel. You can get the amount with "free".

4. Determining System Memory Usage

The inputs to this section were obtained with the command:

free -o

The command "free" is a c program that reads the "/proc"
filesystem.

There are three elements that are useful when determining the
system memory usage. They are:

a. Memory Used
b. Memory Used - Memory Buffers - Memory Cached
c. Swap Used

A graph of "Memory Used" per unit time will show the "Memory Used"
asymptotically approach the total amount of memory in the system
under heavy use. This is normal, as RAM unused is RAM wasted.

A graph of "Memory Used - Memory Buffered - Memory Cached" per
unit time will give a good sense of the memory use of your
applications minus the effects of your operating system. As you
start new applications, this value should go up. As you quit
applications, this value should go down. If an application has a
severe memory leak, this line will have a positive slope.

A graph of "Swap Used" per unit time will display the swap
usage. When the system is low on RAM, a program called kswapd will
swap parts of process if they haven't been used for some time. If
the amount of swap continues to climb at a steady rate, you may
have a memory leak or you might need more RAM.

5. Per Process Memory Usage

The inputs to this section were obtained with the command:

ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid

The command "ps" is a c program that reads the "/proc"
filesystem.

There are two elements that are useful when determining the per
process memory usage. They are:

a. RSS
b. VSIZE

A graph of RSS per unit time will show how much RAM the process is
using over time.

A graph of VSIZE per unit time will show how large the process is
over time.

6. Collecting Data

a. Reboot the system. This will reset your systems memory use

b. Run the following commands every ten seconds and redirect the
results to a file.

free -o
ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid

c. Do whatever you normally do on your system

d. Stop logging your data

7. Generate a Graph

a. System Memory Use

For the output of "free", place the following on one graph

1. X-axis is "MB Used"

2. Y-axis is unit time

3. Memory Used per unit time

4. Memory Used - Memory Buffered - Memory Cached per unit time

5. Swap Used per unit time

b. Per Process Memory Use

For the output of "ps", place the following on one graph

1. X-axis is "MB Used"

2. Y-axis is unit time

3. For each process with %MEM > 10.0

a. RSS per unit time

b. VSIZE per unit time

8. Understand the Graphs

a. System Memory Use

"Memory Used" will approach "Memory Total"

If "Memory Used - Memory Buffered - Memory Cached" is 75% of
"Memory Used", you either have a memory leak or you need to
purchase more memory.

b. Per Process Memory Use

This graph will tell you what processes are hogging the
memory.

If the VSIZE of any of these programs has a constant, positive
slope, it may have a memory leak.