# OpenMP remaining topics

##### Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
28.2 : Timing
28.4 : Performance and tuning
28.5 : Accelerators
28.6 : Tools interface
28.7 : OpenMP standards
28.8 : Memory model
28.8.1 : Dekker's algorithm
28.8.2 : Relaxed memory model

# 28 OpenMP remaining topics

## 28.1 Runtime functions, environment variables, internal control variables

OpenMP has a number of settings that can be set through environment variables  , and both queried and set through library routines  . These settings are called ICVs }: an OpenMP implementation behaves as if there is an internal variable storing this setting.

The runtime functions are:

omp_get_max_threads  , omp_get_num_procs ; see section  17.5  .

• omp_set_dynamic

• omp_get_dynamic

• omp_set_nested

• omp_get_nested

• omp_get_wtime

• omp_get_wtick

• omp_set_schedule

• omp_get_schedule

• omp_set_max_active_levels

• omp_get_max_active_levels

• omp_get_level

• omp_get_active_level

• omp_get_team_size

• omp_

Here are the OpenMP environment variables :

• OMP_CANCELLATION Set whether cancellation is activated; see section  18.3  .

• OMP_DISPLAY_ENV Show OpenMP version (section  28.7  ) and environment variables.

• OMP_DEFAULT_DEVICE Set the device used in target regions

• OMP_MAX_ACTIVE_LEVELS Set the maximum number of nested parallel regions; section  18.2  .

• OMP_NESTED Nested parallel regions

• OMP_PROC_BIND Whether theads may be moved between CPUs; section  25.1  .

• OMP_PLACES Specifies on which CPUs the theads should be placed; section  25.1  .

• OMP_STACKSIZE Set default thread stack size; section  22.2  .

• OMP_SCHEDULE How threads are scheduled; section  19.3  .

• OMP_THREAD_LIMIT Set the maximum number of threads; see section  27.3  .

• OMP_WAIT_POLICY How waiting threads are handled; ICV

wait-policy-var  . Values:

ACTIVE for keeping threads spinning, PASSIVE for possibly yielding the processor when threads are waiting.

There are 4 ICVs that behave as if each thread has its own copy of them. The default is implementation-defined unless otherwise noted.

• It may be possible to adjust dynamically the number of threads for a parallel region. Variable: OMP_DYNAMIC ; routines: omp_set_dynamic  ,

omp_get_dynamic  .

• If a code contains nested parallel regions  , the inner regions may create new teams, or they may be executed by the single thread that encounters them. Variable:

OMP_NESTED ; routines omp_set_nested  ,

omp_get_nested  . Allowed values are TRUE and

FALSE ; the default is false.

• The number of threads used for an encountered parallel region can be controlled. Variable: OMP_NUM_THREADS ; routines omp_set_num_threads  ,

• The schedule for a parallel loop can be set. Variable:

OMP_SCHEDULE ; routines

omp_set_schedule  , omp_get_schedule  .

Nonobvious syntax:

export OMP_SCHEDULE="static,100"


Other settings:

• omp_get_num_threads : query the number of threads active at the current place in the code; this can be lower than what was set with omp_set_num_threads  . For a meaningful answer, this should be done in a parallel region.
• omp_in_parallel : test if you are in a parallel region.
• omp_get_num_procs : query the physical number of cores available.

Other environment variables:

• OMP_STACKSIZE controls the amount of space that is allocated as per-thread stack ; the space for private variables; see section  22.2  .

• OMP_WAIT_POLICY determines the behavior of threads that wait, for instance for critical section :

• ACTIVE puts the thread in a spin-lock  , where it actively checks whether it can continue;

• PASSIVE puts the thread to sleep until the OS wakes it up.

The active' strategy uses CPU while the thread is waiting; on the other hand, activating it after the wait is instantaneous. With the passive' strategy, the thread does not use any CPU while waiting, but activating it again is expensive. Thus, the passive strategy only makes sense if threads will be waiting for a (relatively) long time.
• OMP_PROC_BIND with values TRUE and FALSE

can bind threads to a processor. On the one hand, doing so can minimize data movement; on the other hand, it may increase load imbalance.

## 28.2 Timing

crumb trail: > openmp > Timing

OpenMP has a wall clock timer routine omp_get_wtime

double omp_get_wtime(void);


The starting point is arbitrary and is different for each program run; however, in one run it is identical for all threads. This timer has a resolution given by omp_get_wtick  .

Exercise Use the timing routines to demonstrate speedup from using multiple threads.

• Write a code segment that takes a measurable amount of time, that is, it should take a multiple of the tick time.

• Write a parallel loop and measure the speedup. You can for instance do this

for (int use_threads=1; use_threads<=nthreads; use_threads++) {
for (int i=0; i<nthreads; i++) {
.....
}
time1 = tend-tstart;
else // compute speedup


• In order to prevent the compiler from optimizing your loop away, let the body compute a result and use a reduction to preserve these results.

End of exercise

crumb trail: > openmp > Thread safety

With OpenMP it is relatively easy to take existing code and make it parallel by introducing parallel sections. If you're careful to declare the appropriate variables shared and private, this may work fine. However, your code may include calls to library routines that include a race condition ; such code is said not to be thread-safe  .

For example a routine

static int isave;
int next_one() {
int i = isave;
isave += 1;
return i;
}

...
for ( .... ) {
int ivalue = next_one();
}


has a clear race condition, as the iterations of the loop may get different next_one values, as they are supposed to, or not. This can be solved by using an critical pragma for the next_one call; another solution is to use an threadprivate declaration for isave  . This is for instance the right solution if the next_one

routine implements a random number generator  .

## 28.4 Performance and tuning

crumb trail: > openmp > Performance and tuning

The performance of an OpenMP code can be influenced by the following.

• [Amdahl effects] Your code needs to have enough parts that are parallel (see  Eijkhout:IntroHPC  ). Sequential parts may be sped up by having them executed redundantly on each thread, since that keeps data locally.
• [Dynamism] Creating a thread team takes time. In practice, a team is not created and deleted for each parallel region, but creating teams of different sizes, or recursize thread creation, may introduce overhead.
• [Load imbalance] Even if your program is parallel, you need to worry about load balance. In the case of a parallel loop you can set the \indexclause{schedule} clause to dynamic  , which evens out the work, but may cause increased communication.
• [Communication] Cache coherence causes communication. Threads should, as much as possible, refer to their own data.

• Threads are likely to read from each other's data. That is largely unavoidable.

• Threads writing to each other's data should be avoided: it may require synchronization, and it causes coherence traffic.

• If threads can migrate, data that was local at one time is no longer local after migration.

• Reading data from one socket that was allocated on another socket is inefficient; see section  25.2  .

• [Affinity] Both data and execution threads can be bound to a specific locale to some extent. Using local data is more efficient than remote data, so you want to use local data, and minimize the extent to which data or execution can move.

• See the above points about phenomena that cause communication.

• Section  25.1.1 describes how you can specify the binding of threads to places. There can, but does not need, to be an effect on affinity. For instance, if an OpenMP thread can migrate between hardware threads, cached data will stay local. Leaving an OpenMP thread completely free to migrate can be advantageous for load balancing, but you should only do that if data affinity is of lesser importance.

• Static loop schedules have a higher chance of using data that has affinity with the place of execution, but they are worse for load balancing. On the other hand, the \indexclause{nowait} clause can aleviate some of the problems with static loop schedules.

• [Binding] You can choose to put OpenMP threads close together or to spread them apart. Having them close together makes sense if they use lots of shared data. Spreading them apart may increase bandwidth. (See the examples in section  25.1.2  .)
• [Synchronization] Barriers are a form of synchronization. They are expensive by themselves, and they expose load imbalance. Implicit barriers happen at the end of worksharing constructs; they can be removed with nowait  .

Critical sections imply a loss of parallelism, but they are also slow as they are realized through operating system functions. These are often quite costly, taking many thousands of cycles. Critical sections should be used only if the parallel work far outweighs it.

## 28.5 Accelerators

crumb trail: > openmp > Accelerators

In OpenMP- there is support for offloading work to an accelerator or co-processor

#pragma omp target [clauses]


with clauses such as

• data : place data

• update : make data consistent between host and device

## 28.6 Tools interface

crumb trail: > openmp > Tools interface

The OpenMP- defines a tools interface. This means that routines can be defined that get called by the OpenMP runtime. For instance, the following example defines callback that are evaluated when OpenMP is initialized and finalized, thereby giving the runtime for the application.

int ompt_initialize(ompt_function_lookup_t lookup, int initial_device_num,
ompt_data_t *tool_data) {
printf("libomp init time: %f\n",
omp_get_wtime() - *(double *)(tool_data->ptr));
*(double *)(tool_data->ptr) = omp_get_wtime();
return 1; // success: activates tool
}

void ompt_finalize(ompt_data_t *tool_data) {
printf("application runtime: %f\n",
omp_get_wtime() - *(double *)(tool_data->ptr));
}

ompt_start_tool_result_t *ompt_start_tool(unsigned int omp_version,
const char *runtime_version) {
static double time = 0; // static defintion needs constant assigment
time = omp_get_wtime();
static ompt_start_tool_result_t ompt_start_tool_result = {
&ompt_initialize, &ompt_finalize, {.ptr = &time}};
return &ompt_start_tool_result; // success: registers tool
}


(Example courtesy of https://git.rwth-aachen.de/OpenMPTools/OMPT-Examples  .)

## 28.7 OpenMP standards

crumb trail: > openmp > OpenMP standards

Here is the correspondence between the value of OpenMP versions (given by the _OPENMP macro) and the standard versions:

• 201511 OpenMP-,

• 201611 Technical report 4: information about the OpenMP- but not yet mandated.

• 201811 OpenMP-

• 202011 OpenMP-,

• 202111 OpenMP-.

// version.c
int standard = _OPENMP;
printf("Supported OpenMP standard: %d\n",standard);
switch (standard) {
case  201511: printf("4.5\n");
break;
case 201611: printf("Technical report 4: information about 5.0 but not yet mandated.\n");
break;
case 201811: printf("5.0\n");
break;
case 202011:
printf("5.1\n");
break;
case 202111: printf("5.2\n");
break;
default:
printf("Unrecognized version\n");
break;
}


The openmp.org website maintains a record of which compilers support which standards: https://www.openmp.org/resources/openmp-compilers-tools/  .

## 28.8 Memory model

crumb trail: > openmp > Memory model

### 28.8.1 Dekker's algorithm

crumb trail: > openmp > Memory model > Dekker's algorithm

A standard illustration of the weak memory model is Dekker's algorithm  . We model that in OpenMP as follows;

// weak1.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
{
a = 1;
r1 = b;
}
#pragma omp section
{
b = 1;
r2 = a;
}
}


Under any reasonable interpretation of parallel execution, the possible values for r1,r2 are $1,1$ $0,1$ or $1,0$. This is known as sequential consistency : the parallel outcome is consistent with a sequential execution that interleaves the parallel computations, respecting their local statement orderings. (See also  Eijkhout:IntroHPC  .)

However, running this, we get a small number of cases where $r_1=r_2=0$. There are two possible explanations:

1. The compiler is allowed to interchange the first and second statements, since there is no dependence between them; or

2. The thread is allowed to have a local copy of the variable that is not coherent with the value in memory.

We fix this by flushing both a,b :

// weak2.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
{
a = 1;
#pragma omp flush (a,b)
r1 = b;
}
#pragma omp section
{
b = 1;
#pragma omp flush (a,b)
r2 = a;
}
}


### 28.8.2 Relaxed memory model

crumb trail: > openmp > Memory model > Relaxed memory model

flush

• There is an implicit flush of all variables at the start and end of a parallel region  .

• There is a flush at each barrier, whether explicit or implicit, such as at the end of a

work sharing  .

• At entry and exit of a

critical section

• When a lock is set or unset.