OpenMP topic: Work sharing

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 21.1 : Work sharing constructs
21.2 : Sections
21.3 : Single/master
21.4 : Fortran array syntax parallelization
Back to Table of Contents

21 OpenMP topic: Work sharing

The declaration of a parallel region establishes a team of threads. This offers the possibility of parallelism, but to actually get meaningful parallel activity you need something more. OpenMP uses the concept of a \indexterm{work sharing construct}: a way of dividing parallelizable work over a team of threads.

You have already seen loop parallelism as a way of distributing parallel work in chapter  OpenMP topic: Loop parallelism  . We will now discuss other work sharing constructs.

21.1 Work sharing constructs

crumb trail: > omp-share > Work sharing constructs

The work sharing constructs are:

21.2 Sections

crumb trail: > omp-share > Sections

A parallel loop is an example of independent work units that are numbered. If you have a pre-determined number of independent work units, the sections is more appropriate. In a sections construct can be any number of section constructs. These need to be independent, and they can be execute by any available thread in the current team, including having multiple sections done by the same thread.

#pragma omp sections
{
#pragma omp section
  // one calculation
#pragma omp section
  // another calculation
}

This construct can be used to divide large blocks of independent work. Suppose that in the following line, both f(x) and g(x) are big calculations:

  y = f(x) + g(x)
You could then write
double y1,y2;
#pragma omp sections
{
#pragma omp section
  y1 = f(x)
#pragma omp section
  y2 = g(x)
}
y = y1+y2;
Instead of using two temporaries, you could also use a critical section; see section  23.2.2  . However, the best solution is have a reduction clause on the parallel sections directive. For the sum
  y = f(x) + g(x)
You could then write
// sectionreduct.c
float y=0;
#pragma omp parallel reduction(+:y)
#pragma omp sections
{
#pragma omp section
  y += f();
#pragma omp section
  y += g();
}

21.3 Single/master

crumb trail: > omp-share > Single/master

The \indexpragmadef{single} pragma limits the execution of a block to a single thread. This can for instance be used to print tracing information or doing I/O operations.

#pragma omp parallel
{
#pragma omp single
  printf("We are starting this section!\n");
  // parallel stuff
}
Another use of single is to perform initializations in a parallel region:
int a;
#pragma omp parallel
{
  #pragma omp single
    a = f(); // some computation
  #pragma omp sections
    // various different computations using a
}

The point of the single directive in this last example is that the computation needs to be done only once, because of the shared memory. barrier} \index[omp]{implicit barrier!after single directive} after it, which guarantees that all threads have the correct value in their local memory (see section  23.4  ).

Exercise What is the difference between this approach and how the same computation would be parallelized in MPI?
End of exercise

The master directive also enforces execution on a single thread, specifically the master thread of the team. This is not a work sharing construct, and therefore does not have the synchronization through the implicit barrier.

Exercise Modify the above code to read:

int a;
#pragma omp parallel
{
  #pragma omp master
    a = f(); // some computation
  #pragma omp sections
    // various different computations using a
}
This code is no longer correct. Explain.
End of exercise

Above we motivated the single directive as a way of initializing shared variables. It is also possible to use single to initialize private variables. In that case you add the \indexclausedef{copyprivate} clause. This is a good solution if setting the variable takes I/O.

Exercise Give two other ways to initialize a private variable, with all threads receiving the same value. Can you give scenarios where each of the three strategies would be preferable?
End of exercise

21.4 Fortran array syntax parallelization

crumb trail: > omp-share > Fortran array syntax parallelization

The parallel do directive is used to parallelize loops, and this applies to both C and Fortran. However, Fortran also has implied loops in its array syntax  . To parallelize array syntax you can use the workshare directive.

The workshare directive exists only in Fortran. It can be used to parallelize the implied loops in array syntax  , as well as forall loops.

We compare two version of $C\leftarrow C+A\times B$ (where all operations are elementwise), running on TACC Frontera up to 56 cores.

Workshare based:

!! workshare2d.F90
           !$omp parallel workshare
           C = A*B + C
           !$omp end parallel workshare

SIMD'ized loop

!$omp parallel do simd
do i=1,dim
   do j=1,dim
      C(i,j) = C(i,j) + A(i,j) * B(i,j)
   end do
end do
!$omp end parallel do simd

With results:

SIMD times     :
0.07115 0.04053 0.02498 0.01609 0.01210 0.01247 0.01765 0.02689
Speedup:
 1 1.75549 2.84828 4.422 5.88017 5.70569 4.03116 2.64597

Workshare times:
0.06188 0.03186 0.01625 0.00867 0.00619 0.00379 0.00354 0.00373
Speedup:
 1 1.94225 3.808 7.13725 9.99677 16.3272 17.4802 16.5898  
Back to Table of Contents