Getting started with OpenMP

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 17.1 : The OpenMP model
17.1.1 : Target hardware
17.1.2 : Target software
17.1.3 : About threads and cores
17.2 : Compiling and running an OpenMP program
17.2.1 : Compiling
17.2.2 : Running an OpenMP program
17.3 : Your first OpenMP program
17.3.1 : Directives
17.3.2 : Parallel regions
17.3.3 : Code and execution structure
17.4 : Thread data
17.5 : Creating parallelism
Back to Table of Contents

17 Getting started with OpenMP

This chapter explains the basic concepts of OpenMP, and helps you get started on running your first OpenMP program.

17.1 The OpenMP model

crumb trail: > omp-basics > The OpenMP model

We start by establishing a mental picture of the hardware and software that OpenMP targets.

17.1.1 Target hardware

crumb trail: > omp-basics > The OpenMP model > Target hardware

Modern computers have a multi-layered design. Maybe you have access to a cluster, and maybe you have learned how to use MPI to communicate between cluster nodes. OpenMP, the topic of this chapter, is concerned with a single cluster node or motherboard  , and getting the most out of the available parallelism available there.

FIGURE 17.1: A node with two sockets and a co-processor

Figure  17.1 pictures a typical design of a node: within one enclosure you find two sockets  , single processor chips, plus an accelerator  . (The picture is of a node of the TACC Stampede cluster no longer in serivce, with two sockets and an Intel Xeon PHI co-processor.)

Your personal laptop or desktop computer will probably have one socket, most supercomputers have nodes with two or four sockets. In either case there can be a GPU as co-processor; supercomputer clusters can also have other types of accelerators. OpenMP versions as of OpenMP- target such offloadable devices.

FIGURE 17.2: Structure of an Intel Sandybridge eight-core socket

To see where OpenMP operates we need to dig into the sockets. Figure  17.2 shows a picture of an Intel Sandybridge socket. You recognize a structure with eight core s: independent processing units, that all have access to the same memory. (In figure  17.1 you saw four memory chips, or DIMMs, attached to each of the two sockets; all of the sixteen cores have access to all that memory.) OpenMP makes it easy to explore all these cores in the same program. The OpenMP- standard also added the possibility to offload computations to the GPU or other accelerator.

To summarize the structure of the architecture that OpenMP targets:

17.1.2 Target software

crumb trail: > omp-basics > The OpenMP model > Target software

OpenMP is based on on two concepts: the use of threads and the fork/join model of parallelism. For now you can think of a thread as a sort of process: the computer executes a sequence of instructions. The fork/join model says that a thread can split itself (`fork') into a number of threads that are identical copies. At some point these copies go away and the original thread is left (`join'), but while the team of threads created by the fork exists, you have parallelism available to you. The part of the execution between fork and join is known as a parallel region  .

Figure  17.3 gives a simple picture of this: a thread forks into a team of threads, and these threads themselves can fork again.

FIGURE 17.3: Thread creation and deletion during parallel execution

The threads that are forked are all copies of the master thread computed so far; this is their shared data  . Of course, if the threads were completely identical the parallelism would be pointless, so they also have private data, and they can identify themselves: they know their thread number. This allows you to do meaningful parallel computations with threads.

This brings us to the third important concept: that of work sharing constructs. In a team of threads, initially there will be replicated execution; a work sharing construct divides available work over the threads.

So there you have it: OpenMP uses teams of threads, and inside a parallel region the work is distributed over the threads with a work sharing construct. Threads can access shared data, and they have some private data.

An important difference between OpenMP and MPI is that parallelism in OpenMP is dynamically activated by a thread spawning a team of threads. Furthermore, the number of threads used can differ between parallel regions, and threads can create threads recursively. This is known as as dynamic mode  . By contrast, in an MPI program the number of running processes is (mostly) constant throughout the run, and determined by factors external to the program.

17.1.3 About threads and cores

crumb trail: > omp-basics > The OpenMP model > About threads and cores

OpenMP programming is typically done to take advantage of multicore processors. Thus, to get a good speedup you would typically let your number of threads be equal to the number of cores. However, there is nothing to prevent you from creating more threads if that serves the natural expression of your algorithm: the operating system will use time slicing to let them all be executed. You just don't get a speedup beyond the number of actually available cores.

On some modern processors there are hardware threads  , meaning that a core can actually let more than thread be executed, with some speedup over the single thread. To use such a processor efficiently you would let the number of OpenMP threads be 2 or 4 times the number of cores, depending on the hardware.

17.2 Compiling and running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program

17.2.1 Compiling

crumb trail: > omp-basics > Compiling and running an OpenMP program > Compiling

A C program needs to contain:

#include "omp.h"
while a Fortran program needs to contain:
use omp_lib
or
#include "omp_lib.h"

OpenMP is handled by extensions to your regular compiler, typically by adding an option to your commandline:

# gcc
gcc -o foo foo.c -fopenmp
# Intel compiler
icc -o foo foo.c -qopenmp
If you have separate compile and link stages, you need that option in both.

When you use the above compiler options, the OpenMP macro  , (or cpp macro) _OPENMP will be defined. Thus, you can have conditional compilation by writing

#ifdef _OPENMP
   ...
#else
   ...
#endif
The value of this macro is a decimal value yyyymm denoting the OpenMP standard release that this compiler supports; see section  28.7  .

Fortran note The parameter openmp_version contains the version in yyyymm format.

!! version.F90
  integer :: standard

standard = openmp_version

End of Fortran note

17.2.2 Running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program > Running an OpenMP program

You run an OpenMP program by invoking it the regular way (for instance ./a.out  ), but its behavior is influenced by some OpenMP environment variables  . The most important one is OMP_NUM_THREADS :

export OMP_NUM_THREADS=8
which sets the number of threads that a program will use. You would typically set this equal to the number of cores in your hardware, and hope for approximately linear speedup.

See section  28.1 for a list of all environment variables.

17.3 Your first OpenMP program

crumb trail: > omp-basics > Your first OpenMP program

In this section you will see just enough of OpenMP to write a first program and to explore its behavior. For this we need to introduce a couple of OpenMP language constructs. They will all be discussed in much greater detail in later chapters.

17.3.1 Directives

crumb trail: > omp-basics > Your first OpenMP program > Directives

OpenMP is not magic, so you have to tell it when something can be done in parallel. This is mostly done through directives ; additional specifications can be done through library calls.

In C/C++ the pragma mechanism is used: annotations for the benefit of the compiler that are otherwise not part of the language. This looks like:

#pragma omp somedirective clause(value,othervalue)
  statement;

#pragma omp somedirective clause(value,othervalue)
 {
  statement 1;
  statement 2;
 }
with Directives in C/C++ are case-sensitive. Directives can be broken over multiple lines by escaping the line end.

Fortran note The sentinel in Fortran looks like a comment:

!$omp directive clause(value)
  statements
!$omp end directive
The difference with the C directive is that Fortran does not have code blocks, so there is an explicit end-of directive line.

If you break a directive over more than one line, all but the last line need to have a continuation character, and each line needs to have the sentinel:

!$omp parallel &
!$omp     num_threads(7)
  tp = omp_get_thread_num()
!$omp end parallel

The directives are case-insensitive. In Fortran fixed-form source files (which is the only possibility in Fortran77), c$omp and *$omp are allowed too. End of Fortran note

17.3.2 Parallel regions

crumb trail: > omp-basics > Your first OpenMP program > Parallel regions

The simplest way to create parallelism in OpenMP is to use the parallel pragma. A block preceded by the parallel pragma is called a parallel region ; it is executed by a newly created team of threads. This is an instance of the SPMD model: all threads execute (redundantly) the same segment of code.

#pragma omp parallel
{
  // this is executed by a team of threads
}

Exercise Write a `hello world' program, where the print statement is in a parallel region. Compile and run.

Run your program with different values of the environment variable OMP_NUM_THREADS  . If you know how many cores your machine has, can you set the value higher?
End of exercise

Let's start exploring how OpenMP handles parallelism, using the following functions:

Exercise Take the hello world program of exercise  17.3.2 and insert the above functions, before, in, and after the parallel region. What are your observations?
End of exercise

Exercise Extend the program from exercise  17.3.2  . Make a complete program based on these lines:

\csnippetwithoutput{reductthreads}{code/omp/c}{sumthread}

Compile and run again. (In fact, run your program a number of times.) Do you see something unexpected? Can you think of an explanation?
End of exercise

If the above puzzles you, read about race condition s in Eijkhout:IntroHPC  .

17.3.3 Code and execution structure

crumb trail: > omp-basics > Your first OpenMP program > Code and execution structure

Here are a couple of important concepts:

17.4 Thread data

crumb trail: > omp-basics > Thread data

In most programming languages, visibility of data is governed by rules on the scope of variables : a variable is declared in a block, and it is then visible to any statement in that block and blocks with a lexical scope contained in it, but not in surrounding blocks:

main () {
  // no variable `x' define here
  {
    int x = 5;
    if (somecondition) { x = 6; }
    printf("x=%e\n",x); // prints 5 or 6
  }
  printf("x=%e\n",x); // syntax error: `x' undefined
}
Fortran has simpler rules, since it does not have blocks inside blocks.

OpenMP has similar rules concerning data in parallel regions and other OpenMP constructs. First of all, data is visible in enclosed scopes:

main() {
  int x;
#pragma omp parallel
  {
     // you can use and set `x' here
  }
  printf("x=%e\n",x); // value depends on what
                      // happened in the parallel region
}

In C, you can redeclare a variable inside a nested scope:

{
  int x;
  if (something) {
    double x; // same name, different entity
  }
  x = ... // this refers to the integer again
}
Doing so makes the outer variable inaccessible.

OpenMP has a similar mechanism:

{
  int x;
#pragma omp parallel
  {
    double x;
  }
}
There is an important difference: each thread in the team gets its own instance of the enclosed variable.

FIGURE 17.4: Locality of variables in threads

This is illustrated in figure  17.4  .

In addition to such scoped variables, which live on a stack  , there are variables on the heap  , typically created by a call to malloc (in C) or new (in C++). Rules for them are more complicated.

Summarizing the above, there are

In addition to using scoping, OpenMP also uses options on the directives to control whether data is private or shared.

Many of the difficulties of parallel programming with OpenMP stem from the use of shared variables. For instance, if two threads update a shared variable, there is no guarantee an the order on the updates.

We will discuss all this in detail in section  OpenMP topic: Work sharing  .

17.5 Creating parallelism

crumb trail: > omp-basics > Creating parallelism

The fork/join model of OpenMP means that you need some way of indicating where an activity can be forked for independent execution. There are two ways of doing this:

  1. You can declare a parallel region and split one thread into a whole team of threads. We will discuss this next in chapter  OpenMP topic: Parallel regions  . The division of the work over the threads is controlled by work sharing construct ; see chapter  OpenMP topic: Work sharing  .
  2. Alternatively, you can use tasks and specify one parallel activity at a time. You will see this in section  OpenMP topic: Tasks  .

Note that OpenMP only indicates how much parallelism is present; whether independent activities are in fact executed in parallel is a runtime decision.

Declaring a parallel region tells OpenMP that a team of threads can be created. The actual size of the team depends on various factors (see section  28.1 for variables and functions mentioned in this section).

To ask how much parallelism is actually used in your parallel region, use omp_get_num_threads  . To query these hardware limits, use omp_get_num_procs  . You can query the maximum number of threads with omp_get_max_threads  . This equals the value of OMP_NUM_THREADS  , not the number of actually active threads in a parallel region.

// proccount.c
void nested_report() {
#pragma omp parallel
#pragma omp master
  printf("Nested    : %2d cores and %2d threads out of max %2d\n",
         omp_get_num_procs(),
         omp_get_num_threads(),
         omp_get_max_threads());
}
  int env_num_threads;
#pragma omp parallel
#pragma omp master
  {
    env_num_threads = omp_get_num_threads();
    printf("Parallel  : %2d cores and %2d threads out of max %2d\n",
           omp_get_num_procs(),
           omp_get_num_threads(),
           omp_get_max_threads());
  }

#pragma omp parallel \ num_threads(2*env_num_threads) #pragma omp master { printf("Double : %2d cores and %2d threads out of max %2d\n", omp_get_num_procs(), omp_get_num_threads(), omp_get_max_threads()); }

#pragma omp parallel #pragma omp master nested_report();

\tiny

[c:48] for t in 1 2 4 8 16 ; do OMP_NUM_THREADS=$t ./proccount ; done
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  1
Parallel  : count  4 cores and  1 threads out of max  1
Parallel  : count  4 cores and  1 threads out of max  1
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  2
Parallel  : count  4 cores and  2 threads out of max  2
Parallel  : count  4 cores and  1 threads out of max  2
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  4
Parallel  : count  4 cores and  4 threads out of max  4
Parallel  : count  4 cores and  1 threads out of max  4
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  8
Parallel  : count  4 cores and  8 threads out of max  8
Parallel  : count  4 cores and  1 threads out of max  8
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max 16
Parallel  : count  4 cores and 16 threads out of max 16
Parallel  : count  4 cores and  1 threads out of max 16

Another limit on the number of threads is imposed when you use nested parallel regions. This can arise if you have a parallel region in a subprogram which is sometimes called sequentially, sometimes in parallel. For details, see section  18.2  .

Back to Table of Contents