\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 29.1 : Concepts review
29.1.1 : Basic concepts
29.1.2 : Parallel regions
29.1.3 : Work sharing
29.1.4 : Data scope
29.1.5 : Synchronization
29.1.6 : Tasks
29.2 : Review questions
29.2.1 : Directives
29.2.2 : Parallelism
29.2.3 : Data and synchronization
29.2.3.1 :
29.2.3.2 :
29.2.3.3 :
29.2.4 : Reductions
29.2.4.1 :
29.2.4.2 :
29.2.5 : Barriers
29.2.6 : Data scope
29.2.7 : Tasks
29.2.8 : Scheduling
Back to Table of Contents

29 OpenMP Review

29.1 Concepts review

crumb trail: > ompreview > Concepts review

29.1.1 Basic concepts

crumb trail: > ompreview > Concepts review > Basic concepts

process / thread / thread team
threads / cores / tasks
directives / library functions / environment variables

29.1.2 Parallel regions

crumb trail: > ompreview > Concepts review > Parallel regions

execution by a team

29.1.3 Work sharing

crumb trail: > ompreview > Concepts review > Work sharing

loop / sections / single / workshare
implied barrier
loop scheduling, reduction
sections
single vs master
(F) workshare

29.1.4 Data scope

crumb trail: > ompreview > Concepts review > Data scope

shared vs private, C vs F
loop variables and reduction variables
default declaration
firstprivate, lastprivate

29.1.5 Synchronization

crumb trail: > ompreview > Concepts review > Synchronization

barriers, implied and explicit
nowait
critical sections
locks, difference with critical

29.1.6 Tasks

crumb trail: > ompreview > Concepts review > Tasks

generation vs execution
dependencies

29.2 Review questions

crumb trail: > ompreview > Review questions

29.2.1 Directives

crumb trail: > ompreview > Review questions > Directives

What do the following program output?

\small

int main() {
  printf("procs %d\n",
    omp_get_num_procs());
  printf("threads %d\n",
    omp_get_num_threads());
  printf("num %d\n",
    omp_get_thread_num());
  return 0;
}

int main() {
#pragma omp parallel
  {
  printf("procs %d\n",
    omp_get_num_procs());
  printf("threads %d\n",
    omp_get_num_threads());
  printf("num %d\n",
    omp_get_thread_num());
  }
  return 0;
}

\small

Program main
  use omp_lib
  print *,"Procs:",&
    omp_get_num_procs()
  print *,"Threads:",&
    omp_get_num_threads()
  print *,"Num:",&
    omp_get_thread_num()
End Program

Program main
  use omp_lib
!$OMP parallel
  print *,"Procs:",&
    omp_get_num_procs()
  print *,"Threads:",&
    omp_get_num_threads()
  print *,"Num:",&
    omp_get_thread_num()
!$OMP end parallel
End Program

\vfill\pagebreak

29.2.2 Parallelism

crumb trail: > ompreview > Review questions > Parallelism

Can the following loops be parallelized? If so, how? (Assume that all arrays are already filled in, and that there are no out-of-bounds errors.)

\small

// variant #1
for (i=0; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i] + c[i+1];
}

// variant #2
for (i=0; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i+1] + c[i+1];
}

// variant #3
for (i=1; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i-1] + c[i+1];
}

// variant #4
for (i=1; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i+1] = 2*x[i-1] + c[i+1];
}

\small

! variant #1
do i=1,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i) + c(i+1)
end do

! variant #2
do i=1,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i+1) + c(i+1)
end do

! variant #3
do i=2,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i-1) + c(i+1)
end do

! variant #3
do i=2,N
  x(i) = a(i)+b(i+1)
  a(i+1) = 2*x(i-1) + c(i+1)
end do

\vfill\pagebreak

29.2.3 Data and synchronization

crumb trail: > ompreview > Review questions > Data and synchronization

29.2.3.1

crumb trail: > ompreview > Review questions > Data and synchronization >

What is the output of the following fragments? Assume that there are four threads.

\small

// variant #1
int nt;
#pragma omp parallel
  {
  nt = omp_get_thread_num();
  printf("thread number: %d\n",nt);
  }

// variant #2
int nt;
#pragma omp parallel private(nt)
  {
  nt = omp_get_thread_num();
  printf("thread number: %d\n",nt);
  }

// variant #3
int nt;
#pragma omp parallel
  {
#pragma omp single
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

// variant #4
int nt;
#pragma omp parallel
  {
#pragma omp master
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

// variant #5
int nt;
#pragma omp parallel
  {
#pragma omp critical
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

\small

! variant #1
  integer nt
!$OMP parallel
  nt = omp_get_thread_num()
  print *,"thread number:",nt
!$OMP end parallel

! variant #2
  integer nt
!$OMP parallel private(nt)
  nt = omp_get_thread_num()
  print *,"thread number:",nt
!$OMP end parallel

! variant #3
  integer nt
!$OMP parallel
!$OMP single
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end single
!$OMP end parallel

! variant #4
  integer nt
!$OMP parallel
!$OMP master
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end master
!$OMP end parallel

! variant #5
  integer nt
!$OMP parallel
!$OMP critical
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end critical
!$OMP end parallel

29.2.3.2

crumb trail: > ompreview > Review questions > Data and synchronization >

The following is an attempt to parallelize a serial code. Assume that all variables and arrays are defined. What errors and potential problems do you see in this code? How would you fix them?

\small

#pragma omp parallel
{
  x = f();
  #pragma omp for
  for (i=0; i<N; i++)
    y[i] = g(x,i);
  z = h(y);
}

!$OMP parallel
  x = f()
!$OMP do
  do i=1,N
    y(i) = g(x,i)
  end do
!$OMP end do 
  z = h(y)
!$OMP end parallel

\vfill\pagebreak

29.2.3.3

crumb trail: > ompreview > Review questions > Data and synchronization >

Assume two threads. What does the following program output?

int a;
#pragma omp parallel private(a) {
  ...
  a = 0;
  #pragma omp for
  for (int i = 0; i < 10; i++)
  {
    #pragma omp atomic
    a++; }
  #pragma omp single
    printf("a=%e\n",a);
}

29.2.4 Reductions

crumb trail: > ompreview > Review questions > Reductions

29.2.4.1

crumb trail: > ompreview > Review questions > Reductions >

Is the following code correct? Is it efficient? If not, can you improve it?

#pragma omp parallel shared(r)
{
  int x;
  x = f(omp_get_thread_num());
#pragma omp critical
  r += f(x);
}

29.2.4.2

crumb trail: > ompreview > Review questions > Reductions >

Compare two fragments:

// variant 1
#pragma omp parallel reduction(+:s)
#pragma omp for
  for (i=0; i<N; i++)
    s += f(i);

// variant 2
#pragma omp parallel 
#pragma omp for reduction(+:s)
  for (i=0; i<N; i++)
    s += f(i);

! variant 1
!$OMP parallel reduction(+:s)
!$OMP do
  do i=1,N
    s += f(i);
  end do
!$OMP end do
!$OMP end parallel

! variant 2
!$OMP parallel 
!$OMP do reduction(+:s)
  do i=1,N
    s += f(i);
  end do
!$OMP end do
!$OMP end parallel

Do they compute the same thing?

\vfill\pagebreak

29.2.5 Barriers

crumb trail: > ompreview > Review questions > Barriers

Are the following two code fragments well defined?

#pragma omp parallel 
{
#pragma omp for
for (mytid=0; mytid<nthreads; mytid++)
  x[mytid] = some_calculation();
#pragma omp for
for (mytid=0; mytid<nthreads-1; mytid++)
  y[mytid] = x[mytid]+x[mytid+1];
}

#pragma omp parallel 
{
#pragma omp for
for (mytid=0; mytid<nthreads; mytid++)
  x[mytid] = some_calculation();
#pragma omp for nowait
for (mytid=0; mytid<nthreads-1; mytid++)
  y[mytid] = x[mytid]+x[mytid+1];
}

29.2.6 Data scope

crumb trail: > ompreview > Review questions > Data scope

The following program is supposed to initialize as many rows of the array as there are threads.

\small

int main() {
  int i,icount,iarray[100][100];
  icount = -1;
#pragma omp parallel private(i)
  {
#pragma omp critical
    { icount++; }
    for (i=0; i<100; i++) 
      iarray[icount][i] = 1;
  }
  return 0;
}

Program main
  integer :: i,icount,iarray(100,100)
  icount = 0
!$OMP parallel private(i)
!$OMP critical
    icount = icount + 1
!$OMP end critical
    do i=1,100
      iarray(icount,i) = 1
    end do
!$OMP end parallel
End program

Describe the behavior of the program, with argumentation,

as given;
if you add a clause private (icount) to the parallel directive;
if you add a clause firstprivate (icount) .

What do you think of this solution:

\small

#pragma omp parallel private(i) shared(icount)
  {
#pragma omp critical
    { icount++;
      for (i=0; i<100; i++) 
        iarray[icount][i] = 1;
    }
  }
  return 0;
}

!$OMP parallel private(i) shared(icount)
!$OMP critical
    icount = icount+1
    do i=1,100
      iarray(icount,i) = 1
    end do
!$OMP critical
!$OMP end parallel

29.2.7 Tasks

crumb trail: > ompreview > Review questions > Tasks

Fix two things in the following example:

\small

#pragma omp parallel
#pragma omp single
{
  int x,y,z;
#pragma omp task
  x = f();
#pragma omp task
  y = g();
#pragma omp task
  z = h();
  printf("sum=%d\n",x+y+z);
}

  integer :: x,y,z
!$OMP parallel
!$OMP single

!$OMP task
  x = f()
!$OMP end task

!$OMP task
  y = g()
!$OMP end task

!$OMP task
  z = h()
!$OMP end task

  print *,"sum=",x+y+z
!$OMP end single
!$OMP end parallel

29.2.8 Scheduling

crumb trail: > ompreview > Review questions > Scheduling

Compare these two fragments. Do they compute the same result? What can you say about their efficiency?

#pragma omp parallel
#pragma omp single
  {
    for (i=0; i<N; i++) {
    #pragma omp task
      x[i] = f(i)
    }
    #pragma omp taskwait
  }

#pragma omp parallel
#pragma omp for schedule(dynamic)
  {
    for (i=0; i<N; i++) {
      x[i] = f(i)
    }
  }

How would you make the second loop more efficient? Can you do something similar for the first loop?

OpenMP Review

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

29 OpenMP Review

29.1 Concepts review

29.1.1 Basic concepts

29.1.2 Parallel regions

29.1.3 Work sharing

29.1.4 Data scope

29.1.5 Synchronization

29.1.6 Tasks

29.2 Review questions

29.2.1 Directives

29.2.2 Parallelism

29.2.3 Data and synchronization

29.2.3.1

29.2.3.2

29.2.3.3

29.2.4 Reductions

29.2.4.1

29.2.4.2

29.2.5 Barriers

29.2.6 Data scope

29.2.7 Tasks

29.2.8 Scheduling