Debugging

Debugging distributed application is always a complex task. Here is a set of advices and tips to help you in this journey.

Single Process Execution

Debugging OmpCluster programs might be a bit tricky since they run on multiple MPI processes, so it is usually a good idea to start by fixing the execution of the application on a single node without using mpirun.

Runtime Information

You can enable the debug message of the libomptarget runtime by setting the environment variable LIBOMPTARGET_INFO=-1. Just re-execute the application and you should see many messages in the execution log including some more error messages.

It is recommended to build your application with debugging information enabled, this will enable filenames and variable declarations in the information messages. Use the -g flag in Clang/GCC or configure CMake with -DCMAKE_BUILD_TYPE=Debug or -DCMAKE_BUILD_TYPE=RelWithDebInfo.

More information on how to use this environment variable is available [here][libomptarget-info].

Advanced Runtime Information

LIBOMPTARGET_DEBUG and OMPCLUSTER_DEBUG can also be used to enable additional logs. This feature is only available if libomptarget was built with -DOMPTARGET_DEBUG. The debugging output provided is intended for use by libomptarget and OMPC developers. More user-friendly output is presented when using LIBOMPTARGET_INFO.

The GNU Debugger (GDB)

GDB is one of the most popular terminal debuggers out there, even though it is very serial-oriented. In order to make the debugging endeavor smoother, be sure to enable debug information when compiler your program. Use the -g flag in Clang/GCC or configure CMake with -DCMAKE_BUILD_TYPE=Debug or -DCMAKE_BUILD_TYPE=RelWithDebInfo. When troubleshooting MPI applications, we usually launch one instance of GDB per MPI process. The following sections discuss some tools to aid you.

Here is a non-exhaustive table of useful GDB commands:

Command

Description

run

Run your program until it exits.

continue, c

Run your program until it hits a breakpoint.

break file:line

Set a breakpoint in a file followed by a line.

backtrace, bt

Show the stack trace.

step, s

Execute until next statment, stepping into function calls.

next, n

Execute until next statement, stepping over function calls.

finish, fin

Execute until current function returns.

print expression

Evaluates expression and print the result.

info break

List all breakpoints.

info threads

List all threads.

info locals

List all local variables.

The official GDB documentation has a lot more commands and better descriptions, be sure to check it out.

The LLVM Debugger (LLDB)

The LLVM debugger is a more modern alternative to GDB, and as the name suggests, it is part of the LLVM compiler infrastructure. You must also compile your program with debug information enabled and then launch it with lldb -- <program> <args>. The commands are different from GDB, you can check a tutorial here and the equivalent commands from GDB here.

LLDB commands follow a well defined structure:

<noun> <verb> [-options [option-value]] [argument [argument...]]

Here is a non-exhaustive list:

LLDB Command

GDB Equivalent

run, process launch

run

step, thread step-in

step

next, thread step-over

step

finish, thread step-out

finish

breakpoint set --file file --line line

break file:line

breakpoint list

info break

frame variable

info locals

Debugging with TMPI

Note: This is the recommended way of debugging. Both TMPI and Tmux are installed in our containers such that you can debug your application everywhere with no setup required other than the container image itself.

TMPI (repo) is a bash script that launches multiple MPI processes in a Tmux window and attaches one pane for each process. Combined with GDB, it is possible to debug distributed applications more or less easily. By default TMPI enables pane synchronization which means that the keys you type in one pane are also sent to the others.

TMPI usage is:

tmpi <nprocs> <commmand>

Where <nprocs> is the number of processes to be launched and <command> is the command you with to run. A more concrete example would be:

tmpi 4 gdb --args <program> <args>

Tip: If your command is really long or you need to set variables before execution you can write a bash script and then make TMPI invoke your script instead.

Tmux has a lot of interesting features, it is highly recommended that you take some time to learn how to use this tool properly. In the meantime, check this cheatsheet out for quick reference.

Tip: Here is is @leiteg’s Tmux configuration, there are a bunch of shortcuts to make the experience of Tmux feel smoother. Feel free to use it as you like. If you are unsure what the options do ask him on Slack or simply man tmux. :wink:

Closing panes after execution

TPMI sets the Tmux option remain-on-exit on, which keeps the panes after the command finishes. To close the tmux window, you can use Ctrl + B, &, y. If you are exclusively using TMPI with GDB, you do not need remain-on-exit on, since gdb does not exit when the executable finishes. To unset this option, comment line 128 from TMPI: #tmux set-window-option -t ${window} remain-on-exit on &> /dev/null.

Unreproducible bugs

Note: This is not thoroughly tested, it may not work properly.

When working with parallel/distributed code, it is very common to run into a bug which does not always occur and consequently is really hard to reproduce and debug. A trick using TMPI and GDB can be used to run a command multiple times and quit GDB automatically if the program succeeds or leave it open otherwise.

How to do it: First you will need to change your TMPI script and comment line 128 which has the following contents: #tmux set-window-option -t ${window} remain-on-exit on &> /dev/null. This will configure your Tmux window to close when all the processes (in this case, GDB) exit. Next, we need to tell GDB itself to quit when everything runs fine. Use the following script that runs the same program repeatedly in order to catch one faulty execution:

for i in {1..100}; do
    tmpi 2 gdb -quiet -ex='!sleep 1' -ex='set confirm on' -ex=run -ex=quit --args ./program args
done

How it works: GDB flags work out the magic:

Flag

What it does

-quiet

Supress GDB startup message.

-ex='!sleep 2'

Introduce a small delay in GDB. This is needed because sometimes GDB exists before the TMPI script finishes executing and then the latter complains “window not found”.

-ex='set confirm on'

Confirm before doing dangerous operations. More specifically, confirms before quitting a program in progress.

-ex=run

Start running program immediately.

-ex=quit

Quit GDB right after running. If the program exits successfully, no confirmation is neeeded and GDB quits. If the program received a signal (error), then it hangs waiting for confirmation, just say “no” and you have your debug session.

Missing RTTI information

Sometimes GDB may not correctly parse the RTTI information embedded into clang debug binaries. In such cases, one can use LLDB to correctly debug an MPI program compiled by clang.

tmpi 4 lldb -- <program> <args>

Mind you that the commands accepted by LLDB may not directly match the ones supported by GDB. For more information, see the LLDB section.

Common Errors

Fatal error

You might get the following error which is quite common but not not very helpful:

Libomptarget fatal error 1: failure of target construct while offloading is mandatory

This error basically means the offloading of the computation failed.

In this case, it is usually helpful to enable the debug message of the libomptarget runtime using LIBOMPTARGET_INFO=-1. Then, re-run the application and you should see many messages in the execution log including some more interesting errors.

Undefined symbol

Target library loading error: /tmp/tmpfile_zSK0IW: undefined symbol: xxx"

This means xxx is used in the target region and should be declared as such, using the declare target pragmas.

Segfault error

In case you get a segfault, you can try to debug the program using TMPI and gdb. Using printf might also be useful.