\[
\newcommand\inv{^{-1}}\newcommand\invt{^{-t}}
\newcommand\bbP{\mathbb{P}}
\newcommand\bbR{\mathbb{R}}
\newcommand\defined{
\mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}}
\]
28.1 :
Runtime functions, environment variables, internal control variables
28.2 :
Timing
28.3 :
Thread safety
28.4 :
Performance and tuning
28.5 :
Accelerators
28.6 :
Tools interface
28.7 :
OpenMP standards
28.8 :
Memory model
28.8.1 :
Dekker's algorithm
Back to Table of Contents
28 OpenMP remaining topics
28.1 Runtime functions, environment variables, internal control variables
crumb trail: > openmp > Runtime functions, environment variables, internal control variables
OpenMP has a number of settings that can be set through
environment variables
,
and both queried and set through
library routines
. These settings are called
ICVs
}: an OpenMP implementation behaves as if there is an internal variable
storing this setting.
The runtime functions are:
-
Counting threads and cores:
omp_set_num_threads
,
omp_get_num_threads
,
omp_get_max_threads
,
omp_get_num_procs
;
see section
17.5
.
-
Querying the current thread:
omp_get_thread_num
,
omp_in_parallel
-
omp_set_dynamic
-
omp_get_dynamic
-
omp_set_nested
-
omp_get_nested
-
omp_get_wtime
-
omp_get_wtick
-
omp_set_schedule
-
omp_get_schedule
-
omp_set_max_active_levels
-
omp_get_max_active_levels
-
omp_get_thread_limit
-
omp_get_level
-
omp_get_active_level
-
omp_get_ancestor_thread_num
-
omp_get_team_size
-
omp_
Here are the OpenMP
environment variables
:
-
OMP_CANCELLATION
Set whether cancellation is activated;
see section
18.3
.
-
OMP_DISPLAY_ENV
Show OpenMP version (section
28.7
)
and environment variables.
-
OMP_DEFAULT_DEVICE
Set the device used in target regions
-
OMP_DYNAMIC
Dynamic adjustment of threads
-
OMP_MAX_ACTIVE_LEVELS
Set the maximum number of nested parallel
regions; section
18.2
.
-
OMP_MAX_TASK_PRIORITY
Set the maximum task priority value;
section
24.6.2
.
-
OMP_NESTED
Nested parallel regions
-
OMP_NUM_THREADS
Specifies the number of threads to use
-
OMP_PROC_BIND
Whether theads may be moved between CPUs;
section
25.1
.
-
OMP_PLACES
Specifies on which CPUs the theads should be placed;
section
25.1
.
-
OMP_STACKSIZE
Set default thread stack size;
section
22.2
.
-
OMP_SCHEDULE
How threads are scheduled;
section
19.3
.
-
OMP_THREAD_LIMIT
Set the maximum number of threads;
see section
27.3
.
-
OMP_WAIT_POLICY
How waiting threads are
handled;
ICV
wait-policy-var
. Values:
ACTIVE
for keeping threads spinning,
PASSIVE
for possibly
yielding the processor when threads are waiting.
There is no runtime function for setting this.
There are 4
ICVs
that behave as if each thread has its own copy of them.
The default is implementation-defined unless otherwise noted.
-
It may be possible to adjust dynamically the number of threads
for a parallel region. Variable:
OMP_DYNAMIC
;
routines:
omp_set_dynamic
,
omp_get_dynamic
.
-
If a code contains
nested parallel regions
,
the inner regions may create new teams, or they may be executed by
the single thread that encounters them. Variable:
OMP_NESTED
; routines
omp_set_nested
,
omp_get_nested
. Allowed values are
TRUE
and
FALSE
; the default is false.
-
The number of threads used for an encountered parallel region
can be controlled. Variable:
OMP_NUM_THREADS
;
routines
omp_set_num_threads
,
omp_get_max_threads
.
-
The schedule for a parallel loop can be set. Variable:
OMP_SCHEDULE
; routines
omp_set_schedule
,
omp_get_schedule
.
Nonobvious syntax:
export OMP_SCHEDULE="static,100"
Other settings:
-
omp_get_num_threads
: query the number of threads
active at the current place in the code; this can be lower than what
was set with
omp_set_num_threads
. For a meaningful answer, this
should be done in a parallel region.
-
omp_get_thread_num
-
omp_in_parallel
: test if you are in a parallel
region.
-
omp_get_num_procs
: query the physical number of cores available.
Other environment variables:
-
OMP_STACKSIZE
controls the amount of space that is
allocated as per-thread
stack
; the space for private
variables; see section
22.2
.
-
OMP_WAIT_POLICY
determines the behavior of
threads that wait, for instance for
critical section
:
-
ACTIVE
puts the thread in a
spin-lock
, where
it actively checks whether it can continue;
-
PASSIVE
puts the thread to sleep until the
OS
wakes
it up.
The `active' strategy uses CPU while the thread is waiting; on the
other hand, activating it after the wait is instantaneous. With the
`passive' strategy, the thread does not use any CPU while waiting,
but activating it again is expensive. Thus, the passive strategy
only makes sense if threads will be waiting for a (relatively) long
time.
-
OMP_PROC_BIND
with values
TRUE
and
FALSE
can bind threads to a processor. On the one hand, doing so can
minimize data movement; on the other hand, it may increase load
imbalance.
crumb trail: > openmp > Timing
OpenMP has a wall clock timer routine
omp_get_wtime
double omp_get_wtime(void);
The starting point is arbitrary and is different for each program run;
however, in one run it is identical for all threads.
This timer has a resolution given by
omp_get_wtick
.
Exercise
Use the timing routines to demonstrate speedup from using
multiple threads.
End of exercise
28.3 Thread safety
crumb trail: > openmp > Thread safety
With OpenMP it is relatively easy to take existing code and make
it parallel by introducing parallel sections. If you're careful
to declare the appropriate variables shared and private,
this may work fine. However, your code may include
calls to library routines that include a
race condition
;
such code is said not to be
thread-safe
.
For example a routine
static int isave;
int next_one() {
int i = isave;
isave += 1;
return i;
}
...
for ( .... ) {
int ivalue = next_one();
}
has a clear race condition, as the iterations of the loop
may get different
next_one
values, as they are supposed to,
or not. This can be solved by using an
critical
pragma for the
next_one
call; another solution
is to use an
threadprivate
declaration for
isave
.
This is for instance the right solution if the
next_one
routine implements a
random number generator
.
28.4 Performance and tuning
crumb trail: > openmp > Performance and tuning
[epcc-ompbench]
.
The performance of an OpenMP code can be influenced by the following.
-
[Amdahl effects] Your code needs to have enough parts that are
parallel (see
Eijkhout:IntroHPC
). Sequential parts may be sped up
by having them executed redundantly on each thread, since that keeps
data locally.
-
[Dynamism] Creating a thread team takes time. In practice, a team
is not created and deleted for each parallel region, but creating
teams of different sizes, or recursize thread creation, may
introduce overhead.
-
[Load imbalance] Even if your program is parallel, you need to
worry about load balance. In the case of a parallel loop you can set
the
\indexclause{schedule} clause to
dynamic
, which evens out
the work, but may cause increased communication.
-
[Communication] Cache coherence causes communication. Threads
should, as much as possible, refer to their own data.
-
Threads are likely to read from each other's data. That is
largely unavoidable.
-
Threads writing to each other's data should be avoided: it may
require synchronization, and it causes coherence traffic.
-
If threads can migrate, data that was local at one time is no
longer local after migration.
-
Reading data from one socket that was allocated on another
socket is inefficient; see section
25.2
.
-
[Affinity] Both data and execution threads can be bound to a
specific locale to some extent. Using local data is more efficient
than remote data, so you want to use local data, and minimize the extent to which data
or execution can move.
-
See the above points about phenomena that cause communication.
-
Section
25.1.1
describes how you can specify the
binding of threads to places. There can, but does not need, to be
an effect on affinity. For instance, if an OpenMP thread can
migrate between hardware threads, cached data will stay local.
Leaving an OpenMP thread completely free to migrate can be
advantageous for load balancing, but you should only do that if
data affinity is of lesser importance.
-
Static loop schedules have a higher chance of using data that
has affinity with the place of execution, but they are worse for
load balancing. On the other hand, the
\indexclause{nowait} clause
can aleviate some of the problems with static loop schedules.
-
[Binding] You can choose to put OpenMP threads close together or
to spread them apart. Having them close together makes sense if they
use lots of shared data. Spreading them apart may increase
bandwidth. (See the examples in section
25.1.2
.)
-
[Synchronization] Barriers are a form of synchronization. They
are expensive by themselves, and they expose load
imbalance. Implicit barriers happen at the end of worksharing
constructs; they can be removed with
nowait
.
Critical sections imply a loss of parallelism, but they are also
slow as they are realized through
operating system
functions. These are often quite costly, taking many thousands of
cycles. Critical sections should be used only if the parallel work
far outweighs it.
28.5 Accelerators
crumb trail: > openmp > Accelerators
In OpenMP- there is support for offloading work to an
accelerator
or
co-processor
#pragma omp target [clauses]
with clauses such as
-
data
: place data
-
update
: make data consistent between host and device
28.6 Tools interface
crumb trail: > openmp > Tools interface
The OpenMP- defines a tools interface.
This means that routines can be defined that get called by the OpenMP runtime.
For instance, the following example defines callback that are evaluated
when OpenMP is initialized and finalized, thereby giving the
runtime for the application.
int ompt_initialize(ompt_function_lookup_t lookup, int initial_device_num,
ompt_data_t *tool_data) {
printf("libomp init time: %f\n",
omp_get_wtime() - *(double *)(tool_data->ptr));
*(double *)(tool_data->ptr) = omp_get_wtime();
return 1; // success: activates tool
}
void ompt_finalize(ompt_data_t *tool_data) {
printf("application runtime: %f\n",
omp_get_wtime() - *(double *)(tool_data->ptr));
}
ompt_start_tool_result_t *ompt_start_tool(unsigned int omp_version,
const char *runtime_version) {
static double time = 0; // static defintion needs constant assigment
time = omp_get_wtime();
static ompt_start_tool_result_t ompt_start_tool_result = {
&ompt_initialize, &ompt_finalize, {.ptr = &time}};
return &ompt_start_tool_result; // success: registers tool
}
(Example courtesy of
https://git.rwth-aachen.de/OpenMPTools/OMPT-Examples
.)
28.7 OpenMP standards
crumb trail: > openmp > OpenMP standards
Here is the correspondence between the value of OpenMP versions
(given by the
_OPENMP
macro)
and the
standard versions
:
-
201511
OpenMP-,
-
201611
Technical report 4: information about the OpenMP-
but not yet mandated.
-
201811
OpenMP-
-
202011
OpenMP-,
-
202111
OpenMP-.
// version.c
int standard = _OPENMP;
printf("Supported OpenMP standard: %d\n",standard);
switch (standard) {
case 201511: printf("4.5\n");
break;
case 201611: printf("Technical report 4: information about 5.0 but not yet mandated.\n");
break;
case 201811: printf("5.0\n");
break;
case 202011:
printf("5.1\n");
break;
case 202111: printf("5.2\n");
break;
default:
printf("Unrecognized version\n");
break;
}
The
openmp.org
website maintains a record of which compilers
support which standards:
https://www.openmp.org/resources/openmp-compilers-tools/
.
28.8 Memory model
crumb trail: > openmp > Memory model
28.8.1 Dekker's algorithm
crumb trail: > openmp > Memory model > Dekker's algorithm
A standard illustration of the weak memory model is
Dekker's algorithm
.
We model that in OpenMP as follows;
// weak1.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
{
a = 1;
r1 = b;
tasks++;
}
#pragma omp section
{
b = 1;
r2 = a;
tasks++;
}
}
Under any reasonable interpretation of parallel execution,
the possible values for
r1,r2
are $1,1$ $0,1$ or $1,0$.
This is known as
sequential consistency
:
the parallel outcome is consistent with a sequential execution that
interleaves the parallel computations, respecting their local statement orderings.
(See also
Eijkhout:IntroHPC
.)
However, running this, we get a small number of cases where $r_1=r_2=0$.
There are two possible explanations:
-
The compiler is allowed to interchange the first and second statements,
since there is no dependence between them; or
-
The thread is allowed to have a local copy of the variable
that is not coherent with the value in memory.
We fix this by flushing both
a,b
:
// weak2.c
int a=0,b=0,r1,r2;
#pragma omp parallel sections shared(a, b, r1, r2)
{
#pragma omp section
{
a = 1;
#pragma omp flush (a,b)
r1 = b;
tasks++;
}
#pragma omp section
{
b = 1;
#pragma omp flush (a,b)
r2 = a;
tasks++;
}
}