Debugging

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 49.1 : Step 0: compiling for debug
49.2 : Invoking gdb
49.3 : Finding errors
49.3.1 : C programs
49.3.2 : Fortran programs
49.4 : Memory debugging with Valgrind
49.5 : Stepping through a program
49.6 : Inspecting values
49.7 : Parallel debugging
49.7.1 : MPI debugging with gdb
49.7.2 : Full-screen parallel debugging with DDT
49.8 : Further reading
Back to Table of Contents

49 Debugging

When a program misbehaves, debugging is the process of finding out why . There are various strategies of finding errors in a program. The crudest one is debugging by print statements. If you have a notion of where in your code the error arises, you can edit your code to insert print statements, recompile, rerun, and see if the output gives you any suggestions. There are several problems with this:

  • The edit/compile/run cycle is time consuming, especially since
  • often the error will be caused by an earlier section of code, requiring you to edit, compile, and rerun repeatedly. Furthermore,
  • the amount of data produced by your program can be too large to display and inspect effectively, and
  • if your program is parallel, you probably need to print out data from all proccessors, making the inspection process very tedious.

For these reasons, the best way to debug is by the use of an interactive debugger , a program that allows you to monitor and control the behaviour of a running program. In this section you will familiarize yourself with gdb , which is the open source debugger of the GNU project. Other debuggers are proprietary, and typically come with a compiler suite. Another distinction is that gdb is a commandline debugger; there are graphical debuggers such as ddd (a~frontend to gdb) or DDT and TotalView (debuggers for parallel codes). We limit ourselves to gdb, since it incorporates the basic concepts common to all debuggers.

In this tutorial you will debug a number of simple programs with gdb and valgrind. The files can be found in the repository in the directory tutorials/debug_tutorial_files .

49.1 Step 0: compiling for debug

crumb trail: > debug > Step 0: compiling for debug

You often need to recompile your code before you can debug it. A~first reason for this is that the binary code typically knows nothing about what variable names corresponded to what memory locations, or what lines in the source to what instructions. In order to make the binary executable know this, you have to include the symbol table in it, which is done by adding the -g option to the compiler line.

Usually, you also need to lower the compiler optimization level : a production code will often be compiled with flags such as -O2 or -Xhost that try to make the code as fast as possible, but for debugging you need to replace this by~ -O0 (`oh-zero'). The reason is that higher levels will reorganize your code, making it hard to relate the execution to the source\footnote{Typically, actual code motion is done by -O3 , but at level -O2 the compiler will inline functions and make other simplifications.}.

49.2 Invoking gdb

crumb trail: > debug > Invoking gdb

There are three ways of using gdb: using it to start a program, attaching it to an already running program, or using it to inspect a core dump . We will only consider the first possibility.

Here is an exaple of how to start gdb with program that has no arguments (Fortran users, use hello.F ): \codelisting{tutorials/gdb/c/hello.c}

%% cc -g -o hello hello.c
# regular invocation:
%% ./hello
hello world
# invocation from gdb:
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
Copyright 2004 Free Software Foundation, Inc. .... copyright info ....
(gdb) run
Starting program: /home/eijkhout/tutorials/gdb/hello
Reading symbols for shared libraries +. done
hello world


Program exited normally.
(gdb) quit
%%

Important note: the program was compiled with the flag}  -g . This causes the symbol table (that is, the translation from machine address to program variables) and other debug information to be included in the binary. This will make your binary larger than strictly necessary, but it will also make it slower, for instance because the compiler will not perform certain optimizations\footnote{Compiler optimizations are not supposed to change the semantics of a program, but sometimes do. This can lead to the nightmare scenario where a program crashes or gives incorrect results, but magically works correctly with compiled with debug and run in a debugger.}.

To illustrate the presence of the symbol table do

%% cc -g -o hello hello.c
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
(gdb) list

and compare it with leaving out the -g flag:

%% cc -o hello hello.c
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
(gdb) list

For a program with commandline input we give the arguments to the run command (Fortran users use say.F ): \codelisting{tutorials/gdb/c/say.c}

%% cc -o say -g say.c
%% ./say 2
hello world
hello world
%% gdb say
.... the usual messages ...
(gdb) run 2
Starting program: /home/eijkhout/tutorials/gdb/c/say 2
Reading symbols for shared libraries +. done
hello world
hello world


Program exited normally.

49.3 Finding errors

crumb trail: > debug > Finding errors

Let us now consider some programs with errors.

49.3.1 C programs

crumb trail: > debug > Finding errors > C programs

\codelisting{tutorials/gdb/c/square.c}

%% cc -g -o square square.c
 %% ./square
5000
Segmentation fault

The segmentation fault (other messages are possible too) indicates that we are accessing memory that we are not allowed to, making the program stop. A debugger will quickly tell us where this happens:

%% gdb square
(gdb) run
50000


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000000000eb4a
0x00007fff824295ca in __svfscanf_l ()

Apparently the error occurred in a function __svfscanf_l , which is not one of ours, but a system function. Using the backtrace (or  bt , also where or  w ) command we quickly find out how this came to be called:

(gdb) backtrace
#0  0x00007fff824295ca in __svfscanf_l ()
#1  0x00007fff8244011b in fscanf ()
#2  0x0000000100000e89 in main (argc=1, argv=0x7fff5fbfc7c0) at square.c:7

We take a close look at line 7, and see that we need to change nmax to  &nmax .

There is still an error in our program:

(gdb) run
50000


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000000010000f000
0x0000000100000ebe in main (argc=2, argv=0x7fff5fbfc7a8) at square1.c:9
9           squares[i] = 1./(i*i); sum += squares[i];

We investigate further:

(gdb) print i
$1 = 11237
(gdb) print squares[i]
Cannot access memory at address 0x10000f000

and we quickly see that we forgot to allocate squares .

By the way, we were lucky here: this sort of memory errors is not always detected. Starting our programm with a smaller input does not lead to an error:

(gdb) run
50
Sum: 1.625133e+00


Program exited normally.

49.3.2 Fortran programs

crumb trail: > debug > Finding errors > Fortran programs

Compile and run the following program: \codelisting{tutorials/gdb/f/square.F} It should end prematurely with a message such as `Illegal instruction'. Running the program in gdb quickly tells you where the problem lies:

(gdb) run
Starting program: tutorials/gdb//fsquare
Reading symbols for shared libraries ++++. done


Program received signal EXC_BAD_INSTRUCTION, Illegal instruction/operand.
0x0000000100000da3 in square () at square.F:7
7                sum = sum + squares(i)

We take a close look at the code and see that we did not allocate squares properly.

49.4 Memory debugging with Valgrind

crumb trail: > debug > Memory debugging with Valgrind

Insert the following allocation of squares in your program:

squares = (float *) malloc( nmax*sizeof(float) );

Compile and run your program. The output will likely be correct, although the program is not. Can you see the problem?

To find such subtle memory errors you need a different tool: a memory debugging tool. A popular (because open source) one is valgrind purify .

\codelisting{tutorials/gdb/c/square1.c} Compile this program with cc -o square1 square1.c and run it with valgrind square1 (you need to type the input value). You will lots of output, starting with: {\small

%% valgrind square1
==53695== Memcheck, a memory error detector
==53695== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==53695== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==53695== Command: a.out
==53695==
10
==53695== Invalid write of size 4
==53695==    at 0x100000EB0: main (square1.c:10)
==53695==  Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695==    at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695==    by 0x100000E77: main (square1.c:8)
==53695==
==53695== Invalid read of size 4
==53695==    at 0x100000EC1: main (square1.c:11)
==53695==  Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695==    at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695==    by 0x100000E77: main (square1.c:8)

} Valgrind is informative but cryptic, since it works on the bare memory, not on variables. Thus, these error messages take some exegesis. They state that a line 10 writes a 4-byte object immediately after a block of 40 bytes that was allocated. In other words: the code is writing outside the bounds of an allocated array. Do you see what the problem in the code is?

Note that valgrind also reports at the end of the program run how much memory is still in use, meaning not properly free d.

If you fix the array bounds and recompile and rerun the program, valgrind still complains: {\small

==53785== Conditional jump or move depends on uninitialised value(s)
==53785==    at 0x10006FC68: __dtoa (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x10003199F: __vfprintf (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x1000738AA: vfprintf_l (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x1000A1006: printf (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x100000EF3: main (in ./square2)

} Although no line number is given, the mention of printf gives an indication where the problem lies. The reference to an `uninitialized value' is again cryptic: the only value being output is sum , and that is not uninitialized: it has been added to several times. Do you see why valgrind calls is uninitialized all the same?

49.5 Stepping through a program

crumb trail: > debug > Stepping through a program

Often the error in a program is sufficiently obscure that you need to investigate the program run in detail. Compile the following program \codelisting{tutorials/gdb/c/roots.c} and run it:

%% ./roots
sum: nan

Start it in gdb as follows:

%% gdb roots
GNU gdb 6.3.50-20050815 (Apple version gdb-1469) (Wed May  5 04:36:56 UTC 2010)
Copyright 2004 Free Software Foundation, Inc.
....
(gdb) break main
Breakpoint 1 at 0x100000ea6: file root.c, line 14.
(gdb) run
Starting program: tutorials/gdb/c/roots
Reading symbols for shared libraries +. done


Breakpoint 1, main () at roots.c:14
14        float x=0;

Here you have done the following:

  • Before calling run you set a breakpoint at the main program, meaning that the execution will stop when it reaches the main program.
  • You then call run and the program execution starts;
  • The execution stops at the first instruction in main.

If execution is stopped at a breakpoint, you can do various things, such as issuing the step command:

Breakpoint 1, main () at roots.c:14
14        float x=0;
(gdb) step
15        for (i=100; i>-100; i--)
(gdb)
16          x += root(i);
(gdb)

(if you just hit return, the previously issued command is repeated). Do a number of step s in a row by hitting return. What do you notice about the function and the loop?

Switch from doing step to doing next . Now what do you notice about the loop and the function?

Set another breakpoint: break 17 and do cont . What happens?

Rerun the program after you set a breakpoint on the line with the sqrt call. When the execution stops there do where and list .

  • If you set many breakpoints, you can find out what they are with info breakpoints .
  • You can remove breakpoints with delete n where n is the number of the breakpoint.
  • If you restart your program with run without leaving gdb, the breakpoints stay in effect.
  • If you leave gdb, the breakpoints are cleared but you can save them: save breakpoints <file> . Use source <file> to read them in on the next gdb run.

49.6 Inspecting values

crumb trail: > debug > Inspecting values

Run the previous program again in gdb: set a breakpoint at the line that does the sqrt call before you actually call run . When the program gets to line 8 you can do print n . Do cont . Where does the program stop?

If you want to repair a variable, you can do set var=value . Change the variable n and confirm that the square root of the new value is computed. Which commands do you do?

If a problem occurs in a loop, it can be tedious keep typing cont and inspecting the variable with print . Instead you can add a condition to an existing breakpoint: the following:

condition 1 if (n<0)

or set the condition when you define the breakpoint:

break 8 if (n<0)

Another possibility is to use ignore 1 50 , which will not stop at breakpoint 1 the next 50 times.

Remove the existing breakpoint, redefine it with the condition n<0 and rerun your program. When the program breaks, find for what value of the loop variable it happened. What is the sequence of commands you use?

49.7 Parallel debugging

crumb trail: > debug > Parallel debugging

Debugging parallel programs is harder than than sequential programs, because every sequential bug may show up, plus a number of new types, caused by the interaction of the various processes.

Here are a few possible parallel bugs:

  • Processes can deadlock because they are waiting for a message that never comes. This typically happens with blocking send/receive calls due to an error in program logic.
  • If an incoming message is unexpectedly larger than anticipated, a memory error can occur.
  • A collective call will hang if somehow one of the processes does not call the routine.

There are few low-budget solutions to parallel debugging. The main one is to create an xterm for each process. We will describe this next. There are also commercial packages such as DDT and TotalView , that offer a GUI. They are very convenient but also expensive. The Eclipse project has a parallel package, Eclipse PTP , that includes a graphic debugger.

49.7.1 MPI debugging with gdb

crumb trail: > debug > Parallel debugging > MPI debugging with gdb

You can not run parallel programs in gdb, but you can start multiple gdb processes that behave just like MPI processes! The command

mpirun -np <NP> xterm -e gdb ./program

create a number of xterm windows, each of which execute the commandline gdb ./program . And because these xterms have been started with mpirun , they actually form a communicator.

\begin{pcse}

49.7.2 Full-screen parallel debugging with DDT

crumb trail: > debug > Parallel debugging > Full-screen parallel debugging with DDT

In this tutorial you will run and diagnose a few incorrect MPI programs using DDT. You can start a session with \n{ddt yourprogram &}, or use File > New Session > Run to specify a program name, and possibly parameters. In both cases you get a dialog where you can specify program parameters. It is also important to check the following:

  • You can specify the number of cores here;
  • It is usually a good idea to turn on memory checking;
  • Make sure you specify the right MPI.

When DDT opens on your main program, it halts at the MPI_Init statement, and need to press the forward arrow, top left of the main window.

Problem1
This program has every process independently generate random numbers, and if the number meets a certain condition, stops execution. There is no problem with this code as such, so let's suppose you simply want to monitor its execution.

  • Compile abort.c . Don't forget about the -g -O0 flags; if you use the makefile they are included automatically.
  • Run the program with DDT, you'll see that it concludes succesfully.
  • Set a breakpoint at the Finalize statement in the subroutine, by clicking to the left of the line number. Now if you run the program you'll get a message that all processes are stopped at a breakpoint. Pause the execution.
  • The `Stacks' tab will tell you that all processes are the same point in the code, but they are not in fact in the same iteration.
  • You can for instance use the `Input/Output' tabs to see what every process has been doing.
  • Alternatively, use the variables pane on the right to examine the it variable. You can do that for individual processes, but you can also control click on the it variable and choose \n{View as Array}. Set up the display as a one-dimensional array and check the iteration numbers.
  • Activate the barrier statement and rerun the code. Make sure you have no breakpoints. Reason that the code will not complete, but just hang.
  • Hit the general Pause button. Now what difference do you see in the `Stacks' tab?

Problem2
Compile problem1.c and run it in DDT. You'll get a dialog warning about an error condition.

  • Pause the program in the dialog. Notice that only the root process is paused. If you want to inspect other processes, press the general pause button. Do this.
  • In the bottom panel click on Stacks . This gives you the `call stack', which tells you what the processes were doing when you paused them. Where is the root process in the execution? Where are the others?
  • From the call stack it is clear what the error was. Fix it and rerun with File > Restart Session .

Problem2

\end{pcse}

49.8 Further reading

crumb trail: > debug > Further reading

A good tutorial: http://www.dirac.org/linux/gdb/ .

Reference manual: http://www.ofb.net/gnu/gdb/gdb_toc.html .

Back to Table of Contents