OpenMP topic: Synchronization

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 23.1 : Barrier
23.1.1 : Implicit barriers
23.2 : Mutual exclusion
23.2.1 : Race conditions
23.2.2 : critical and atomic
23.2.3 : atomic construct
23.3 : Locks
23.3.1 : Routines
23.3.2 : Example: object with atomic update
23.3.3 : Example: histogram / binning
23.3.4 : Nested locks
23.4 : Relaxed memory model
23.5 : Example: Fibonacci computation
Back to Table of Contents

23 OpenMP topic: Synchronization

In the constructs for declaring parallel regions above, you had little control over in what order threads executed the work they were assigned. This section will discuss synchronization constructs: ways of telling threads to bring a certain order to the sequence in which they do things.

Loop-related synchronization constructs were discussed earlier:

23.1 Barrier

crumb trail: > omp-sync > Barrier

A barrier defines a point in the code where all active threads will stop until all threads have arrived at that point. With this, you can guarantee that certain calculations are finished. For instance, in this code snippet, computation of  y can not proceed until another thread has computed its value of  x  .

#pragma omp parallel 
{
  int mytid = omp_get_thread_num();
  x[mytid] = some_calculation();
  y[mytid] = x[mytid]+x[mytid+1];
}
This can be guaranteed with a barrier pragma:
#pragma omp parallel 
{
  int mytid = omp_get_thread_num();
  x[mytid] = some_calculation();
#pragma omp barrier
  y[mytid] = x[mytid]+x[mytid+1];
}

23.1.1 Implicit barriers

crumb trail: > omp-sync > Barrier > Implicit barriers

Apart from the barrier directive, which inserts an explicit barrier, OpenMP has implicit barriers \index[omp]{omp!barrier!implicit} after a work sharing construct; see section  21.3  . Thus the following code is well defined:

#pragma omp parallel 
{
#pragma omp for
  for (int mytid=0; mytid<number_of_threads; mytid++)
    x[mytid] = some_calculation();
#pragma omp for
  for (int mytid=0; mytid<number_of_threads-1; mytid++)
    y[mytid] = x[mytid]+x[mytid+1];
}

You can also put each parallel loop in a parallel region of its own, but there is some overhead associated with creating and deleting the team of threads in between the regions.

At the end of a parallel region the team of threads is dissolved and only the master thread continues. Therefore, there is an implicit barrier at the end of a parallel region \index[omp]{parallel region!barrier at the end of}. This barrier behavior can be canceled with the \indexclause{nowait} clause.

You will often see the idiom

#pragma omp parallel
{
#pragma omp for nowait
  for (i=0; i<N; i++)
    a[i] = // some expression
#pragma omp for
  for (i=0; i<N; i++)
    b[i] = ...... a[i] ......
Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

23.2 Mutual exclusion

crumb trail: > omp-sync > Mutual exclusion

Sometimes it is necessary to limit a piece of code so that it can be executed by only one tread at a time. Such a piece of code is called a critical section  , and OpenMP has several mechanisms for realizing this.

23.2.1 Race conditions

crumb trail: > omp-sync > Mutual exclusion > Race conditions

OpenMP, being based on shared memory, has a potential for race conditions These happen when two threads access the same data item, with at least one access a write. The problem with race conditions is that programmer convenience runs counter to efficient execution.

For a simple example: \csnippetwithoutput{racecounter}{examples/omp/c}{race}

The basic rule about multiple-thread access of a single data item is:

Any memory location that is written by one thread, can not be read by another thread in the same parallel region, if no synchronization is done.

To start with that last clause: any workshare construct ends with an implicit barrier  , so data written before that barrier can safely be read after it.

23.2.2 critical and atomic

crumb trail: > omp-sync > Mutual exclusion > critical and atomic

There are two pragmas for critical sections: critical and atomic  . Both denote atomic operations in a technical sense. The first one is general and can contain an arbitrary sequence of instructions; the second one is more limited but has performance advantages.

Beginning programmers are often tempted to use critical for updates in a loop:

#pragma omp parallel
{
  int mytid = omp_get_thread_num();
  double tmp = some_function(mytid);
  // This Works, but Not Best Solution:
#pragma omp critical
  sum += tmp;
}
but this should really be done with a reduction clause, which will be far more efficient.

A good use of critical sections is doing file writes or database updates.

Exercise Consider a loop where each iteration updates a variable.

#pragma omp parallel for shared(result)
  for ( i ) {
      result += some_function_of(i);
  }
Discuss qualitatively the difference between: Do an Ahmdal-style quantitative analysis of the first case, assuming that you do $n$ iterations on $p$ threads, and each iteration has a critical section that takes a fraction $f$. Assume the number of iterations $n$ is a multiple of the number of threads $p$. Also assume the default static distribution of loop iterations over the threads.
End of exercise

Critical sections are an easy way to turn an existing code into a correct parallel code. However, there are performance disadvantages to critical sections, and sometimes a more drastic rewrite is called for.

A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore, if your code has multiple critical sections, they are all mutually exclusive: if a thread is in one critical section, the other ones are all blocked.

The problem with critical sections being mutually exclusive can be mitigated by naming them:

#pragma omp critical (optional_name_in_parens)

On the other hand, the syntax for atomic sections is limited to the update of a single memory location, but such sections are not exclusive and they can be more efficient, since they assume that there is a hardware mechanism for making them critical. See the next section.

23.2.3 atomic construct

crumb trail: > omp-sync > Mutual exclusion > atomic construct

While the \indexclause{critical} construct can enclose arbitrary blocks of code, the \indexompclause{atomic} clause has one of a limited number of forms, for which hardware support is likely. Those consist of assigning to a variable:

x++;
// or:
x += y;
possibly combination with reading that variable:
v = x; x++;

There are various further refinements on the atomic specification:

  1. omp atomic write is followed by a single assignment statement to a shared variable.
  2. omp atomic read is followed by a single assignment statement from a shared variable.
  3. omp atomic is equivalent to omp atomic update ; it accomodates statements such as
        x++; x += 1.5;
    
  4. omp atomic capture can accommodate a single statement similar to omp atomic update  , or a block that essentially combines a read and update form.

23.3 Locks

crumb trail: > omp-sync > Locks

OpenMP also has the traditional mechanism of a lock  . A lock is somewhat similar to a critical section: it guarantees that some instructions can only be performed by one process at a time. However, a critical section is indeed about code; a lock is about data. With a lock you make sure that some data elements can only be touched by one process at a time.

23.3.1 Routines

crumb trail: > omp-sync > Locks > Routines

Create/destroy:

void omp_init_lock(omp_lock_t *lock);
void omp_destroy_lock(omp_lock_t *lock);
Set and release:
void omp_set_lock(omp_lock_t *lock);
void omp_unset_lock(omp_lock_t *lock);
Since the set call is blocking, there is also
omp_test_lock();

Unsetting a lock needs to be done by the thread that set it.

Lock operations implicitly have a flush ; see section 23.4  .

Exercise In the following code, one process sets array A and then uses it to update B; the other process sets array B and then uses it to update A. Argue that this code can deadlock. How could you fix this?

#pragma omp parallel shared(a, b, nthreads, locka, lockb)
  #pragma omp sections nowait
    {
    #pragma omp section
      {
      omp_set_lock(&locka);
      for (i=0; i<N; i++)
        a[i] = ..

      omp_set_lock(&lockb);
      for (i=0; i<N; i++)
        b[i] = .. a[i] ..
      omp_unset_lock(&lockb);
      omp_unset_lock(&locka);
      }

    #pragma omp section
      {
      omp_set_lock(&lockb);
      for (i=0; i<N; i++)
        b[i] = ...

      omp_set_lock(&locka);
      for (i=0; i<N; i++)
        a[i] = .. b[i] ..
      omp_unset_lock(&locka);
      omp_unset_lock(&lockb);
      }
    }  /* end of sections */
  }  /* end of parallel region */

End of exercise

23.3.2 Example: object with atomic update

crumb trail: > omp-sync > Locks > Example: object with atomic update

OO languages such as C++ allow for syntactic simplification, for instance building the locking and unlocking actions into the update operator.

C++ note

// lockobject.cxx
class atomic_int {
private:
  omp_lock_t the_lock;
  int _value{0};
  bool delay;
public:
  atomic_int(bool delay=false)
    : delay(delay) {
    omp_init_lock(&the_lock);
  };
  atomic_int( const atomic_int& )
      = delete;
  atomic_int& operator=( const atomic_int& )
      = delete;
   atomic_int() {
    omp_destroy_lock(&the_lock);
  };
  int operator +=( int i ) {
// atomic increment
    omp_set_lock(&the_lock);
    _value += i; int rv = _value;
    omp_unset_lock(&the_lock);
    return rv;
  };
  auto value() { return _value; };
};

Running this:

for (int ithread=0; ithread<NTHREADS; ithread++) {
  threads.push_back
    ( std::thread(
             [=,&my_object] () {
               for (int iop=0; iop<nops; iop++)
                 my_object += 1; } ) );
}
for ( auto &t : threads )
  t.join();
End of C++ note

23.3.3 Example: histogram / binning

crumb trail: > omp-sync > Locks > Example: histogram / binning

One simple example of the use of locks is generation of a histogram (also known as the binning problem). A histogram consists of a number of bins, that get updated depending on some data. Here is the basic structure of such a code:

int count[100];
float x = some_function();
int ix = (int)x;
if (ix>=100)
  error();
else
  count[ix]++;
It would be possible to guard the last line:
#pragma omp critical
  count[ix]++;
but that is unnecessarily restrictive. If there are enough bins in the histogram, and if the some_function takes enough time, there are unlikely to be conflicting writes. The solution then is to create an array of locks, with one lock for each count location.

Another solution would be to give each thread a local copy of the result array, and perform a reduction on these. See section  20.2.2  .

23.3.4 Nested locks

crumb trail: > omp-sync > Locks > Nested locks

A lock as explained above can not be locked if it is already locked. A  nested lock can be locked multiple times by the same thread before being unlocked.

23.4 Relaxed memory model

crumb trail: > omp-sync > Relaxed memory model

flush

23.5 Example: Fibonacci computation

crumb trail: > omp-sync > Example: Fibonacci computation

The Fibonacci sequence is recursively defined as \begin{equation} F(0)=1,\qquad F(1)=1,\qquad F(n)=F(n-1)+F(n-2) \hbox{ for $n\geq2$}. \end{equation} We start by sketching the basic single-threaded solution. The naive code looks like:

int main() {
  value = new int[nmax+1];
  value[0] = 1;
  value[1] = 1;
  fib(10);
}

int fib(int n) {
  int i, j, result;
  if (n>=2) {
    i=fib(n-1); j=fib(n-2);
    value[n] = i+j;
  }
  return value[n];
}
However, this is inefficient, since most intermediate values will be computed more than once. We solve this by keeping track of which results are known:
  ...
  done = new int[nmax+1];
  for (i=0; i<=nmax; i++)
    done[i] = 0;
  done[0] = 1;
  done[1] = 1;
  ...
int fib(int n) {
  int i, j;
  if (!done[n]) {
    i = fib(n-1); j = fib(n-2);
    value[n] = i+j; done[n] = 1;
  }
  return value[n];
}
The OpenMP parallel solution calls for two different ideas. First of all, we parallelize the recursion by using tasks (section  OpenMP topic: Synchronization :
int fib(int n) {
  int i, j;
  if (n>=2) {
#pragma omp task shared(i) firstprivate(n)
    i=fib(n-1);
#pragma omp task shared(j) firstprivate(n)
    j=fib(n-2);
#pragma omp taskwait
    value[n] = i+j;
  }
  return value[n];
}
This computes the right solution, but, as in the naive single-threaded solution, it recomputes many of the intermediate values.

A naive addition of the done array leads to data races, and probably an incorrect solution:

int fib(int n) {
  int i, j, result;
  if (!done[n]) {
#pragma omp task shared(i) firstprivate(n)
    i=fib(n-1);
#pragma omp task shared(i) firstprivate(n)
    j=fib(n-2);
#pragma omp taskwait
    value[n] = i+j;
    done[n] = 1;
  }
  return value[n];
}
For instance, there is no guarantee that the done array is updated later than the value array, so a thread can think that done[n-1] is true, but value[n-1] does not have the right value yet.

One solution to this problem is to use a lock, and make sure that, for a given index  n  , the values done[n] and value[n] are never touched by more than one thread at a time:

int fib(int n)
{
  int i, j;
  omp_set_lock( &(dolock[n]) );
  if (!done[n]) {
#pragma omp task shared(i) firstprivate(n)
    i = fib(n-1);
#pragma omp task shared(j) firstprivate(n)
    j = fib(n-2);
#pragma omp taskwait
    value[n] = i+j;
    done[n] = 1;
  }
  omp_unset_lock( &(dolock[n]) );
  return value[n];
}
This solution is correct, optimally efficient in the sense that it does not recompute anything, and it uses tasks to obtain a parallel execution.

However, the efficiency of this solution is only up to a constant. A lock is still being set, even if a value is already computed and therefore will only be read. This can be solved with a complicated use of critical sections, but we will forego this.

Back to Table of Contents