When a program misbehaves, debugging is the process of finding out why . There are various strategies of finding errors in a program. The crudest one is debugging by print statements. If you have a notion of where in your code the error arises, you can edit your code to insert print statements, recompile, rerun, and see if the output gives you any suggestions. There are several problems with this:
For these reasons, the best way to debug is by the use of an interactive debugger , a program that allows you to monitor and control the behaviour of a running program. In this section you will familiarize yourself with gdb , which is the open source debugger of the GNU project. Other debuggers are proprietary, and typically come with a compiler suite. Another distinction is that gdb is a commandline debugger; there are graphical debuggers such as ddd (a~frontend to gdb) or DDT and TotalView (debuggers for parallel codes). We limit ourselves to gdb, since it incorporates the basic concepts common to all debuggers.
In this tutorial you will debug a number of simple programs with gdb and valgrind. The files can be found in the repository in the directory tutorials/debug_tutorial_files .
crumb trail: > debug > Step 0: compiling for debug
You often need to recompile your code before you can debug it. A~first reason for this is that the binary code typically knows nothing about what variable names corresponded to what memory locations, or what lines in the source to what instructions. In order to make the binary executable know this, you have to include the symbol table in it, which is done by adding the -g option to the compiler line.
Usually, you also need to lower the compiler optimization level : a production code will often be compiled with flags such as -O2 or -Xhost that try to make the code as fast as possible, but for debugging you need to replace this by~ -O0 (`oh-zero'). The reason is that higher levels will reorganize your code, making it hard to relate the execution to the source\footnote{Typically, actual code motion is done by -O3 , but at level -O2 the compiler will inline functions and make other simplifications.}.
crumb trail: > debug > Invoking gdb
There are three ways of using gdb: using it to start a program, attaching it to an already running program, or using it to inspect a core dump . We will only consider the first possibility.
Here is an exaple of how to start gdb with program that has no arguments (Fortran users, use hello.F ): \codelisting{tutorials/gdb/c/hello.c}
%% cc -g -o hello hello.c # regular invocation: %% ./hello hello world # invocation from gdb: %% gdb hello GNU gdb 6.3.50-20050815 # ..... version info Copyright 2004 Free Software Foundation, Inc. .... copyright info .... (gdb) run Starting program: /home/eijkhout/tutorials/gdb/hello Reading symbols for shared libraries +. done hello world Program exited normally. (gdb) quit %%
Important note: the program was compiled with the flag} -g . This causes the symbol table (that is, the translation from machine address to program variables) and other debug information to be included in the binary. This will make your binary larger than strictly necessary, but it will also make it slower, for instance because the compiler will not perform certain optimizations\footnote{Compiler optimizations are not supposed to change the semantics of a program, but sometimes do. This can lead to the nightmare scenario where a program crashes or gives incorrect results, but magically works correctly with compiled with debug and run in a debugger.}.
To illustrate the presence of the symbol table do
%% cc -g -o hello hello.c %% gdb hello GNU gdb 6.3.50-20050815 # ..... version info (gdb) list
and compare it with leaving out the -g flag:
%% cc -o hello hello.c %% gdb hello GNU gdb 6.3.50-20050815 # ..... version info (gdb) list
For a program with commandline input we give the arguments to the run command (Fortran users use say.F ): \codelisting{tutorials/gdb/c/say.c}
%% cc -o say -g say.c %% ./say 2 hello world hello world %% gdb say .... the usual messages ... (gdb) run 2 Starting program: /home/eijkhout/tutorials/gdb/c/say 2 Reading symbols for shared libraries +. done hello world hello world Program exited normally.
crumb trail: > debug > Finding errors
Let us now consider some programs with errors.
crumb trail: > debug > Finding errors > C programs
\codelisting{tutorials/gdb/c/square.c}
%% cc -g -o square square.c %% ./square 5000 Segmentation fault
The segmentation fault (other messages are possible too) indicates that we are accessing memory that we are not allowed to, making the program stop. A debugger will quickly tell us where this happens:
%% gdb square (gdb) run 50000 Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x000000000000eb4a 0x00007fff824295ca in __svfscanf_l ()
Apparently the error occurred in a function __svfscanf_l , which is not one of ours, but a system function. Using the backtrace (or bt , also where or w ) command we quickly find out how this came to be called:
(gdb) backtrace #0 0x00007fff824295ca in __svfscanf_l () #1 0x00007fff8244011b in fscanf () #2 0x0000000100000e89 in main (argc=1, argv=0x7fff5fbfc7c0) at square.c:7
We take a close look at line 7, and see that we need to change nmax to &nmax .
There is still an error in our program:
(gdb) run 50000 Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_PROTECTION_FAILURE at address: 0x000000010000f000 0x0000000100000ebe in main (argc=2, argv=0x7fff5fbfc7a8) at square1.c:9 9 squares[i] = 1./(i*i); sum += squares[i];
We investigate further:
(gdb) print i $1 = 11237 (gdb) print squares[i] Cannot access memory at address 0x10000f000
and we quickly see that we forgot to allocate squares .
By the way, we were lucky here: this sort of memory errors is not always detected. Starting our programm with a smaller input does not lead to an error:
(gdb) run 50 Sum: 1.625133e+00 Program exited normally.
crumb trail: > debug > Finding errors > Fortran programs
Compile and run the following program: \codelisting{tutorials/gdb/f/square.F} It should end prematurely with a message such as `Illegal instruction'. Running the program in gdb quickly tells you where the problem lies:
(gdb) run Starting program: tutorials/gdb//fsquare Reading symbols for shared libraries ++++. done Program received signal EXC_BAD_INSTRUCTION, Illegal instruction/operand. 0x0000000100000da3 in square () at square.F:7 7 sum = sum + squares(i)
We take a close look at the code and see that we did not allocate squares properly.
crumb trail: > debug > Memory debugging with Valgrind
Insert the following allocation of squares in your program:
squares = (float *) malloc( nmax*sizeof(float) );
Compile and run your program. The output will likely be correct, although the program is not. Can you see the problem?
To find such subtle memory errors you need a different tool: a memory debugging tool. A popular (because open source) one is valgrind purify .
\codelisting{tutorials/gdb/c/square1.c} Compile this program with cc -o square1 square1.c and run it with valgrind square1 (you need to type the input value). You will lots of output, starting with: {\small
%% valgrind square1 ==53695== Memcheck, a memory error detector ==53695== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==53695== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==53695== Command: a.out ==53695== 10 ==53695== Invalid write of size 4 ==53695== at 0x100000EB0: main (square1.c:10) ==53695== Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd ==53695== at 0x1000101EF: malloc (vg_replace_malloc.c:236) ==53695== by 0x100000E77: main (square1.c:8) ==53695== ==53695== Invalid read of size 4 ==53695== at 0x100000EC1: main (square1.c:11) ==53695== Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd ==53695== at 0x1000101EF: malloc (vg_replace_malloc.c:236) ==53695== by 0x100000E77: main (square1.c:8)
} Valgrind is informative but cryptic, since it works on the bare memory, not on variables. Thus, these error messages take some exegesis. They state that a line 10 writes a 4-byte object immediately after a block of 40 bytes that was allocated. In other words: the code is writing outside the bounds of an allocated array. Do you see what the problem in the code is?
Note that valgrind also reports at the end of the program run how much memory is still in use, meaning not properly free d.
If you fix the array bounds and recompile and rerun the program, valgrind still complains: {\small
==53785== Conditional jump or move depends on uninitialised value(s) ==53785== at 0x10006FC68: __dtoa (in /usr/lib/libSystem.B.dylib) ==53785== by 0x10003199F: __vfprintf (in /usr/lib/libSystem.B.dylib) ==53785== by 0x1000738AA: vfprintf_l (in /usr/lib/libSystem.B.dylib) ==53785== by 0x1000A1006: printf (in /usr/lib/libSystem.B.dylib) ==53785== by 0x100000EF3: main (in ./square2)
} Although no line number is given, the mention of printf gives an indication where the problem lies. The reference to an `uninitialized value' is again cryptic: the only value being output is sum , and that is not uninitialized: it has been added to several times. Do you see why valgrind calls is uninitialized all the same?
crumb trail: > debug > Stepping through a program
Often the error in a program is sufficiently obscure that you need to investigate the program run in detail. Compile the following program \codelisting{tutorials/gdb/c/roots.c} and run it:
%% ./roots sum: nan
Start it in gdb as follows:
%% gdb roots GNU gdb 6.3.50-20050815 (Apple version gdb-1469) (Wed May 5 04:36:56 UTC 2010) Copyright 2004 Free Software Foundation, Inc. .... (gdb) break main Breakpoint 1 at 0x100000ea6: file root.c, line 14. (gdb) run Starting program: tutorials/gdb/c/roots Reading symbols for shared libraries +. done Breakpoint 1, main () at roots.c:14 14 float x=0;
Here you have done the following:
If execution is stopped at a breakpoint, you can do various things, such as issuing the step command:
Breakpoint 1, main () at roots.c:14 14 float x=0; (gdb) step 15 for (i=100; i>-100; i--) (gdb) 16 x += root(i); (gdb)
(if you just hit return, the previously issued command is repeated). Do a number of step s in a row by hitting return. What do you notice about the function and the loop?
Switch from doing step to doing next . Now what do you notice about the loop and the function?
Set another breakpoint: break 17 and do cont . What happens?
Rerun the program after you set a breakpoint on the line with the sqrt call. When the execution stops there do where and list .
crumb trail: > debug > Inspecting values
Run the previous program again in gdb: set a breakpoint at the line that does the sqrt call before you actually call run . When the program gets to line 8 you can do print n . Do cont . Where does the program stop?
If you want to repair a variable, you can do set var=value . Change the variable n and confirm that the square root of the new value is computed. Which commands do you do?
If a problem occurs in a loop, it can be tedious keep typing cont and inspecting the variable with print . Instead you can add a condition to an existing breakpoint: the following:
condition 1 if (n<0)
or set the condition when you define the breakpoint:
break 8 if (n<0)
Another possibility is to use ignore 1 50 , which will not stop at breakpoint 1 the next 50 times.
Remove the existing breakpoint, redefine it with the condition n<0 and rerun your program. When the program breaks, find for what value of the loop variable it happened. What is the sequence of commands you use?
crumb trail: > debug > Parallel debugging
Debugging parallel programs is harder than than sequential programs, because every sequential bug may show up, plus a number of new types, caused by the interaction of the various processes.
Here are a few possible parallel bugs:
There are few low-budget solutions to parallel debugging. The main one is to create an xterm for each process. We will describe this next. There are also commercial packages such as DDT and TotalView , that offer a GUI. They are very convenient but also expensive. The Eclipse project has a parallel package, Eclipse PTP , that includes a graphic debugger.
crumb trail: > debug > Parallel debugging > MPI debugging with gdb
You can not run parallel programs in gdb, but you can start multiple gdb processes that behave just like MPI processes! The command
mpirun -np <NP> xterm -e gdb ./program
create a number of xterm windows, each of which execute the commandline gdb ./program . And because these xterms have been started with mpirun , they actually form a communicator.
\begin{pcse}
crumb trail: > debug > Parallel debugging > Full-screen parallel debugging with DDT
In this tutorial you will run and diagnose a few incorrect MPI programs using DDT. You can start a session with \n{ddt yourprogram &}, or use File > New Session > Run to specify a program name, and possibly parameters. In both cases you get a dialog where you can specify program parameters. It is also important to check the following:
When DDT opens on your main program, it halts at the MPI_Init statement, and need to press the forward arrow, top left of the main window.
Problem1
This program has every process independently generate
random numbers, and if the number meets a certain condition, stops execution.
There is no problem with this code as such, so let's suppose you simply want
to monitor its execution.
Problem2
Compile
problem1.c
and run it in DDT. You'll
get a dialog warning about an error condition.
Problem2
\end{pcse}
crumb trail: > debug > Further reading
A good tutorial: http://www.dirac.org/linux/gdb/ .
Reference manual: http://www.ofb.net/gnu/gdb/gdb_toc.html .