\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 28.1 : Runtime functions, environment variables, internal control variables
28.2 : Timing
28.3 : Thread safety
28.4 : Performance and tuning
28.5 : Accelerators
28.6 : Tools interface
28.7 : OpenMP standards
28.8 : Memory model
28.8.1 : Dekker's algorithm
Back to Table of Contents

28 OpenMP remaining topics

28.1 Runtime functions, environment variables, internal control variables

crumb trail: > openmp > Runtime functions, environment variables, internal control variables

OpenMP has a number of settings that can be set through environment variables , and both queried and set through library routines . These settings are called ICVs }: an OpenMP implementation behaves as if there is an internal variable storing this setting.

The runtime functions are:

Counting threads and cores: omp_set_num_threads , omp_get_num_threads , omp_get_max_threads , omp_get_num_procs ; see section 17.5 .
Querying the current thread: omp_get_thread_num , omp_in_parallel
omp_set_dynamic
omp_get_dynamic
omp_set_nested
omp_get_nested
omp_get_wtime
omp_get_wtick
omp_set_schedule
omp_get_schedule
omp_set_max_active_levels
omp_get_max_active_levels
omp_get_thread_limit
omp_get_level
omp_get_active_level
omp_get_ancestor_thread_num
omp_get_team_size
omp_

Here are the OpenMP environment variables :

OMP_CANCELLATION Set whether cancellation is activated; see section 18.3 .
OMP_DISPLAY_ENV Show OpenMP version (section 28.7 ) and environment variables.
OMP_DEFAULT_DEVICE Set the device used in target regions
OMP_DYNAMIC Dynamic adjustment of threads
OMP_MAX_ACTIVE_LEVELS Set the maximum number of nested parallel regions; section 18.2 .
OMP_MAX_TASK_PRIORITY Set the maximum task priority value; section 24.6.2 .
OMP_NESTED Nested parallel regions
OMP_NUM_THREADS Specifies the number of threads to use
OMP_PROC_BIND Whether theads may be moved between CPUs; section 25.1 .
OMP_PLACES Specifies on which CPUs the theads should be placed; section 25.1 .
OMP_STACKSIZE Set default thread stack size; section 22.2 .
OMP_SCHEDULE How threads are scheduled; section 19.3 .
OMP_THREAD_LIMIT Set the maximum number of threads; see section 27.3 .
OMP_WAIT_POLICY How waiting threads are handled; ICV wait-policy-var . Values: ACTIVE for keeping threads spinning, PASSIVE for possibly yielding the processor when threads are waiting. There is no runtime function for setting this.

There are 4 ICVs that behave as if each thread has its own copy of them. The default is implementation-defined unless otherwise noted.

It may be possible to adjust dynamically the number of threads for a parallel region. Variable: OMP_DYNAMIC ; routines: omp_set_dynamic , omp_get_dynamic .
If a code contains nested parallel regions , the inner regions may create new teams, or they may be executed by the single thread that encounters them. Variable: OMP_NESTED ; routines omp_set_nested , omp_get_nested . Allowed values are TRUE and FALSE ; the default is false.
The number of threads used for an encountered parallel region can be controlled. Variable: OMP_NUM_THREADS ; routines omp_set_num_threads , omp_get_max_threads .
The schedule for a parallel loop can be set. Variable: OMP_SCHEDULE ; routines omp_set_schedule , omp_get_schedule .

Nonobvious syntax:

export OMP_SCHEDULE="static,100"

Other settings:

omp_get_num_threads : query the number of threads active at the current place in the code; this can be lower than what was set with omp_set_num_threads . For a meaningful answer, this should be done in a parallel region.
omp_get_thread_num
omp_in_parallel : test if you are in a parallel region.
omp_get_num_procs : query the physical number of cores available.

Other environment variables:

OMP_STACKSIZE controls the amount of space that is allocated as per-thread stack ; the space for private variables; see section 22.2 .
OMP_WAIT_POLICY determines the behavior of threads that wait, for instance for critical section :
- ACTIVE puts the thread in a spin-lock , where it actively checks whether it can continue;
- PASSIVE puts the thread to sleep until the OS wakes it up.
The `active' strategy uses CPU while the thread is waiting; on the other hand, activating it after the wait is instantaneous. With the `passive' strategy, the thread does not use any CPU while waiting, but activating it again is expensive. Thus, the passive strategy only makes sense if threads will be waiting for a (relatively) long time.
OMP_PROC_BIND with values TRUE and FALSE can bind threads to a processor. On the one hand, doing so can minimize data movement; on the other hand, it may increase load imbalance.

28.2 Timing

crumb trail: > openmp > Timing

OpenMP has a wall clock timer routine omp_get_wtime

double omp_get_wtime(void);

The starting point is arbitrary and is different for each program run; however, in one run it is identical for all threads. This timer has a resolution given by omp_get_wtick .

Exercise Use the timing routines to demonstrate speedup from using multiple threads.

Write a code segment that takes a measurable amount of time, that is, it should take a multiple of the tick time.

Write a parallel loop and measure the speedup. You can for instance do this

for (int use_threads=1; use_threads<=nthreads; use_threads++) {
#pragma omp parallel for num_threads(use_threads)
    for (int i=0; i<nthreads; i++) {
        .....
    }
    if (use_threads==1)
      time1 = tend-tstart;
    else // compute speedup

In order to prevent the compiler from optimizing your loop away, let the body compute a result and use a reduction to preserve these results.

End of exercise

28.3 Thread safety

crumb trail: > openmp > Thread safety

With OpenMP it is relatively easy to take existing code and make it parallel by introducing parallel sections. If you're careful to declare the appropriate variables shared and private, this may work fine. However, your code may include calls to library routines that include a race condition ; such code is said not to be thread-safe .

For example a routine

static int isave;
int next_one() {
 int i = isave;
 isave += 1;
 return i;
}

 ...
for ( .... ) {
  int ivalue = next_one();
}

has a clear race condition, as the iterations of the loop may get different next_one values, as they are supposed to, or not. This can be solved by using an critical pragma for the next_one call; another solution is to use an threadprivate declaration for isave . This is for instance the right solution if the next_one routine implements a random number generator .

28.4 Performance and tuning

crumb trail: > openmp > Performance and tuning

[epcc-ompbench] .

The performance of an OpenMP code can be influenced by the following.

[Amdahl effects] Your code needs to have enough parts that are parallel (see Eijkhout:IntroHPC ). Sequential parts may be sped up by having them executed redundantly on each thread, since that keeps data locally.
[Dynamism] Creating a thread team takes time. In practice, a team is not created and deleted for each parallel region, but creating teams of different sizes, or recursize thread creation, may introduce overhead.
[Load imbalance] Even if your program is parallel, you need to worry about load balance. In the case of a parallel loop you can set the \indexclause{schedule} clause to dynamic , which evens out the work, but may cause increased communication.
[Communication] Cache coherence causes communication. Threads should, as much as possible, refer to their own data.
- Threads are likely to read from each other's data. That is largely unavoidable.
- Threads writing to each other's data should be avoided: it may require synchronization, and it causes coherence traffic.
- If threads can migrate, data that was local at one time is no longer local after migration.
- Reading data from one socket that was allocated on another socket is inefficient; see section 25.2 .
[Affinity] Both data and execution threads can be bound to a specific locale to some extent. Using local data is more efficient than remote data, so you want to use local data, and minimize the extent to which data or execution can move.
- See the above points about phenomena that cause communication.
- Section 25.1.1 describes how you can specify the binding of threads to places. There can, but does not need, to be an effect on affinity. For instance, if an OpenMP thread can migrate between hardware threads, cached data will stay local. Leaving an OpenMP thread completely free to migrate can be advantageous for load balancing, but you should only do that if data affinity is of lesser importance.
- Static loop schedules have a higher chance of using data that has affinity with the place of execution, but they are worse for load balancing. On the other hand, the \indexclause{nowait} clause can aleviate some of the problems with static loop schedules.
[Binding] You can choose to put OpenMP threads close together or to spread them apart. Having them close together makes sense if they use lots of shared data. Spreading them apart may increase bandwidth. (See the examples in section 25.1.2 .)
[Synchronization] Barriers are a form of synchronization. They are expensive by themselves, and they expose load imbalance. Implicit barriers happen at the end of worksharing constructs; they can be removed with nowait .

Critical sections imply a loss of parallelism, but they are also slow as they are realized through operating system functions. These are often quite costly, taking many thousands of cycles. Critical sections should be used only if the parallel work far outweighs it.

28.5 Accelerators

crumb trail: > openmp > Accelerators

In OpenMP- there is support for offloading work to an accelerator or co-processor

#pragma omp target [clauses]

with clauses such as

data : place data
update : make data consistent between host and device

28.6 Tools interface

crumb trail: > openmp > Tools interface

The OpenMP- defines a tools interface. This means that routines can be defined that get called by the OpenMP runtime. For instance, the following example defines callback that are evaluated when OpenMP is initialized and finalized, thereby giving the runtime for the application.

int ompt_initialize(ompt_function_lookup_t lookup, int initial_device_num,
                    ompt_data_t *tool_data) {
  printf("libomp init time: %f\n",
         omp_get_wtime() - *(double *)(tool_data->ptr));
  *(double *)(tool_data->ptr) = omp_get_wtime();
  return 1; // success: activates tool
}

void ompt_finalize(ompt_data_t *tool_data) {
  printf("application runtime: %f\n",
         omp_get_wtime() - *(double *)(tool_data->ptr));
}

ompt_start_tool_result_t *ompt_start_tool(unsigned int omp_version,
                                          const char *runtime_version) {
  static double time = 0; // static defintion needs constant assigment
  time = omp_get_wtime();
  static ompt_start_tool_result_t ompt_start_tool_result = {
      &ompt_initialize, &ompt_finalize, {.ptr = &time}};
  return &ompt_start_tool_result; // success: registers tool
}

(Example courtesy of https://git.rwth-aachen.de/OpenMPTools/OMPT-Examples .)

28.7 OpenMP standards

crumb trail: > openmp > OpenMP standards

Here is the correspondence between the value of OpenMP versions (given by the _OPENMP macro) and the standard versions :

201511 OpenMP-,
201611 Technical report 4: information about the OpenMP- but not yet mandated.
201811 OpenMP-
202011 OpenMP-,
202111 OpenMP-.

// version.c
int standard = _OPENMP;
printf("Supported OpenMP standard: %d\n",standard);
switch (standard) {
case  201511: printf("4.5\n");
  break;
case 201611: printf("Technical report 4: information about 5.0 but not yet mandated.\n");
  break;
case 201811: printf("5.0\n");
  break;
case 202011:
  printf("5.1\n");
  break;
case 202111: printf("5.2\n");
  break;
default:
  printf("Unrecognized version\n");
  break;
}

The openmp.org website maintains a record of which compilers support which standards: https://www.openmp.org/resources/openmp-compilers-tools/ .

28.8 Memory model

crumb trail: > openmp > Memory model

28.8.1 Dekker's algorithm

crumb trail: > openmp > Memory model > Dekker's algorithm

A standard illustration of the weak memory model is Dekker's algorithm . We model that in OpenMP as follows;

// weak1.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
  {
	a = 1;
	r1 = b;
	tasks++;
  }
#pragma omp section
  {
	b = 1;
	r2 = a;
	tasks++;
  }
}

Under any reasonable interpretation of parallel execution, the possible values for r1,r2 are $1,1$ $0,1$ or $1,0$. This is known as sequential consistency : the parallel outcome is consistent with a sequential execution that interleaves the parallel computations, respecting their local statement orderings. (See also Eijkhout:IntroHPC .)

However, running this, we get a small number of cases where $r_1=r_2=0$. There are two possible explanations:

The compiler is allowed to interchange the first and second statements, since there is no dependence between them; or
The thread is allowed to have a local copy of the variable that is not coherent with the value in memory.

We fix this by flushing both a,b :

// weak2.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
  {
	a = 1;
#pragma omp flush (a,b)
	r1 = b;
	tasks++;
  }
#pragma omp section
  {
	b = 1;
#pragma omp flush (a,b)
	r2 = a;
	tasks++;
  }
}

OpenMP remaining topics

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse