The declaration of a parallel region establishes a team of threads. This offers the possibility of parallelism, but to actually get meaningful parallel activity you need something more. OpenMP uses the concept of a \indexterm{work sharing construct}: a way of dividing parallelizable work over a team of threads.
You have already seen loop parallelism as a way of distributing parallel work in chapter OpenMP topic: Loop parallelism . We will now discuss other work sharing constructs.
crumb trail: > omp-share > Work sharing constructs
The work sharing constructs are:
crumb trail: > omp-share > Sections
A parallel loop is an example of independent work units that are numbered. If you have a pre-determined number of independent work units, the sections is more appropriate. In a sections construct can be any number of section constructs. These need to be independent, and they can be execute by any available thread in the current team, including having multiple sections done by the same thread.
#pragma omp sections { #pragma omp section // one calculation #pragma omp section // another calculation }
This construct can be used to divide large blocks of independent work. Suppose that in the following line, both f(x) and g(x) are big calculations:
y = f(x) + g(x)You could then write
double y1,y2; #pragma omp sections { #pragma omp section y1 = f(x) #pragma omp section y2 = g(x) } y = y1+y2;Instead of using two temporaries, you could also use a critical section; see section 23.2.2 . However, the best solution is have a reduction clause on the parallel sections directive. For the sum
y = f(x) + g(x)You could then write
// sectionreduct.c float y=0; #pragma omp parallel reduction(+:y) #pragma omp sections { #pragma omp section y += f(); #pragma omp section y += g(); }
crumb trail: > omp-share > Single/master
The \indexpragmadef{single} pragma limits the execution of a block to a single thread. This can for instance be used to print tracing information or doing I/O operations.
#pragma omp parallel { #pragma omp single printf("We are starting this section!\n"); // parallel stuff }Another use of single is to perform initializations in a parallel region:
int a; #pragma omp parallel { #pragma omp single a = f(); // some computation #pragma omp sections // various different computations using a }
The point of the single directive in this last example is that the computation needs to be done only once, because of the shared memory. barrier} \index[omp]{implicit barrier!after single directive} after it, which guarantees that all threads have the correct value in their local memory (see section 23.4 ).
Exercise
What is the difference between this approach and how the same
computation would be parallelized in MPI?
End of exercise
The master directive also enforces execution on a single thread, specifically the master thread of the team. This is not a work sharing construct, and therefore does not have the synchronization through the implicit barrier.
Exercise Modify the above code to read:
int a; #pragma omp parallel { #pragma omp master a = f(); // some computation #pragma omp sections // various different computations using a }This code is no longer correct. Explain.
Above we motivated the single directive as a way of initializing shared variables. It is also possible to use single to initialize private variables. In that case you add the \indexclausedef{copyprivate} clause. This is a good solution if setting the variable takes I/O.
Exercise
Give two other ways to initialize a private variable, with all
threads receiving the same value. Can you give scenarios where each
of the three strategies would be preferable?
End of exercise
crumb trail: > omp-share > Fortran array syntax parallelization
The parallel do directive is used to parallelize loops, and this applies to both C and Fortran. However, Fortran also has implied loops in its array syntax . To parallelize array syntax you can use the workshare directive.
The workshare directive exists only in Fortran. It can be used to parallelize the implied loops in array syntax , as well as forall loops.
We compare two version of $C\leftarrow C+A\times B$ (where all operations are elementwise), running on TACC Frontera up to 56 cores.
Workshare based:
!! workshare2d.F90 !$omp parallel workshare C = A*B + C !$omp end parallel workshare
SIMD'ized loop
!$omp parallel do simd do i=1,dim do j=1,dim C(i,j) = C(i,j) + A(i,j) * B(i,j) end do end do !$omp end parallel do simd
With results:
SIMD times : 0.07115 0.04053 0.02498 0.01609 0.01210 0.01247 0.01765 0.02689 Speedup: 1 1.75549 2.84828 4.422 5.88017 5.70569 4.03116 2.64597 Workshare times: 0.06188 0.03186 0.01625 0.00867 0.00619 0.00379 0.00354 0.00373 Speedup: 1 1.94225 3.808 7.13725 9.99677 16.3272 17.4802 16.5898