OpenMP topic: Tasks

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 24.1 : Task generation
24.2 : Task data
24.3 : Task synchronization
24.3.1 : Task groups
24.3.2 : Task loop
24.4 : Task dependencies
24.5 : Task reduction
24.6 : More
24.6.1 : Scheduling points
24.6.2 : Hints for performance improvement
24.6.3 : Task canceling
24.7 : Examples
24.7.1 : Fibonacci
24.7.2 : Binomial coefficients
Back to Table of Contents

24 OpenMP topic: Tasks

Tasks are a mechanism that OpenMP uses behind the scenes: if you specify something as being a task, OpenMP will create a `block of work': a section of code plus the data environment in which it occurred. This block is set aside for execution at some later point. Thus, task-based code usually looks something like this:

#pragma omp parallel
{
  // generate a bunch of tasks
# pragma omp taskwait
  // the result from the tasks is now available  
}
For instance, a parallel loop was always implicitly translated to something like:

Sequential loop:

for (int i=0; i<N; i++)
  f(i);

Parallel loop:

for (int ib=0; ib<nblocks; ib++) {
  int first=... last=... ;
# pragma omp task
  for (int i=first; i<last; i++)
    f(i)
}
#pragma omp taskwait
// the results from the loop are available

24.1 Task generation

crumb trail: > omp-task > Task generation

If we stick with this example of implementing a parallel loop through tasks, the next question is: precisely who generates the tasks? The following code has a serious problem:

// WRONG. DO NOT WRITE THIS
#pragma omp parallel
for (int ib=0; ib<nblocks; ib++) {
  int first=... last=... ;
# pragma omp task
  for (int i=first; i<last; i++)
    f(i)
}
because the parallel region creates a team, and each thread in the team executes the task-generating code. Instead, we use the following idiom:
#pragma omp parallel
#pragma omp single
for (int ib=0; ib<nblocks; ib++) {
  // setup stuff
# pragma omp task
  // task stuff
}
  1. A parallel region creates a team of threads;
  2. a single thread then creates the tasks, adding them to a queue that belongs to the team,
  3. and all the threads in that team (possibly including the one that generated the tasks)

Btw, the actual task queue is not visible to the programmer. Another aspect that is out of the programmer's control is the exact timing of the execution of the task: this is up to a task scheduler  , which operates invisible to the programmer.

The task mechanism allows you to do things that are hard or impossible with the loop and section constructs. For instance, a \indexterm{while loop} traversing a linked list can be implemented with tasks:

\toprule CodeExecution
\midrule p = head_of_list(); one thread traverses the list
\n{while (!end_of_list(p)) \{}
#pragma omp task a task is created,
process( p ); one for each element
p = next_element(p); the generating thread goes on without waiting
\ }the tasks are executed while
more are being generated.
\bottomrule

Another concept that was hard to parallelize earlier is the `while loop'. This does not fit the requirement for OpenMP parallel loops that the loop bound needs to be known before the loop executes.

Exercise Use tasks to find the smallest factor of a large number (using $2999\cdot 3001$ as test case): generate a task for each trial factor. Start with this code:

  int factor=0;
#pragma omp parallel
#pragma omp single
  for (int f=2; f<4000; f++) {
    { // see if `f' is a factor
      if (N%f==0) { // found factor!
        factor = f;
      }
    }
    if (factor>0)
      break;
  }
  if (factor>0)
    printf("Found a factor: %d\n",factor);

End of exercise

24.2 Task data

crumb trail: > omp-task > Task data

Treatment of data in a task is somewhat subtle. The basic problem is that a task gets created at one time, and executed at some later time. Thus, if shared data is accessed, does the task see the value at creation time or at execution time? In fact, both possibilities make sense depending on the application, so we need to discuss the rules which possibility applies when.

The first rule is that shared data is shared in the task, but private data becomes \indexclause{firstprivate}. To see the distinction, consider two code fragments.

int count = 100;
#pragma omp parallel
#pragma omp single
{
  while (count>0) {
#   pragma omp task
    {
      int countcopy = count;
      if (count==50) {
        sleep(1);
        printf("%d,%d\n",
            count,countcopy);
      } // end if
    }   // end task
    count--;
  }     // end while
}       // end single

#pragma omp parallel
#pragma omp single
{
  int count = 100;
  while (count>0) {
#   pragma omp task
    {
      int countcopy = count;
      if (count==50) {
        sleep(1);
        printf("%d,%d\n",
            count,countcopy);
      } // end if
    }   // end task
    count--;
  }     // end while
}       // end single

In the first example, the variable count is declared outside the parallel region and is therefore shared. When the print statement is executed, all tasks will have been generated, and so count will be zero. Thus, the output will likely be 0,50  .

In the second example, the count variable is private to the thread creating the tasks, and so it will be firstprivate in the task, preserving the value that was current when the task was created.

24.3 Task synchronization

crumb trail: > omp-task > Task synchronization

Even though the above segment looks like a linear set of statements, it is impossible to say when the code after the task directive will be executed. This means that the following code is incorrect:

  x = f();
#pragma omp task
  { y = g(x); }
  z = h(y);  
Explanation: when the statement computing z is executed, the task computing  y has only been scheduled; it has not necessarily been executed yet.

In order to have a guarantee that a task is finished, you need the taskwait directive. The following creates two tasks, which can be executed in parallel, and then waits for the results:

\toprule CodeExecution
\midrule x = f(); the variable x gets a value
#pragma omp task \multirow{4}{*}{two tasks are created with the current value of x }
\n{ \{ y1 = g1(x); \}}
#pragma omp task
\n{ \{ y2 = g2(x); \}}
#pragma omp taskwait the thread waits until the tasks are finished
z = h(y1)+h(y2); the variable z is computed using the task results
\bottomrule

The task pragma is followed by a structured block. Each time the structured block is encountered, a new task is generated. On the other hand taskwait is a standalone directive; the code that follows is just code, it is not a structured block belonging to the directive.

You can indicate task dependencies in several ways:

  1. Using the `task wait' directive you can explicitly indicate the join of the forked tasks. The instruction after the wait directive will therefore be dependent on the spawned tasks.
  2. The taskgroup directive is discussed in section  24.3.1  .
  3. The taskloop directive is discussed in section  24.3.2  .
  4. Each OpenMP task can have a \indexclause{depend} clause, indicating what data dependency of the task; section  24.4  . By indicating what data is produced or absorbed by the tasks, the scheduler can construct the dependency graph for you.

24.3.1 Task groups

crumb trail: > omp-task > Task synchronization > Task groups

The taskgroup directive, followed by a structured block, ensures completion of all tasks created in the block, even if recursively created.

A task group is somewhat similar to having a taskwait directive after the block. The big difference is that that taskwait directive does not wait for tasks that are recursively generated, while a taskgroup does.

24.3.2 Task loop

crumb trail: > omp-task > Task synchronization > Task loop

The taskloop directive prefaces a for/do loop, just like an for pragma. The difference is that now every iteration is turned into a task, rather than groups of iterations as in the for case. The end of the loop is a synchronization point: statements after the loop are only executed when all tasks from the loop are finished.

There is a mast taskloop directive that is shorthand for a master containing only a taskloop  .

24.4 Task dependencies

crumb trail: > omp-task > Task dependencies

It is possible to put a partial ordering on tasks through use of the \indexclause{depend} clause. For example, in

#pragma omp task
  x = f()
#pragma omp task
  y = g(x)
it is conceivable that the second task is executed before the first, possibly leading to an incorrect result. This is remedied by specifying:
#pragma omp task depend(out:x)
  x = f()
#pragma omp task depend(in:x)
  y = g(x)

Exercise Consider the following code:

for i in [1:N]:
    x[0,i] = some_function_of(i)
    x[i,0] = some_function_of(i)

for i in [1:N]:
    for j in [1:N]:
        x[i,j] = x[i-1,j]+x[i,j-1]

End of exercise

Tasks dependencies are used to indicated how two uses of one data item relate to each other. Since either use can be a read or a write, there are four types of dependencies.

24.5 Task reduction

crumb trail: > omp-task > Task reduction

The \indexclause{reduction} clause only pertains to ordinary parallel loops, not to taskgroup loops of tasks. To do a reduction over computations in tasks you need the \indexclausedef{task_reduction} clause (a OpenMP- feature):

#pragma omp taskgroup task_reduction(+:sum)
The task group can contain both task that contribute to the reduction, and ones that don't. The former type needs a clause \indexclausedef{in_reduction}:
#pragma omp task in_reduction(+:sum)

As an example, here the sum $\sum_{i=1}^{100} i$ is computed with tasks:

// taskreduct.c
#pragma omp parallel
#pragma omp single
  {
  #pragma omp taskgroup task_reduction(+:sum)
  for (int itask=1; itask<=bound; itask++) {
    #pragma omp task in_reduction(+:sum)
    sum += itask;
    }
  }

24.6 More

crumb trail: > omp-task > More

24.6.1 Scheduling points

crumb trail: > omp-task > More > Scheduling points

Normally, a task stays tied to the thread that first executes it. However, at a task scheduling point the thread may switch to the execution of another task created by the same team.

On the other hand a task created with them \indexclause{untied} clause on the task pragma is never tied to one thread. This means that after suspension at a scheduling point any thread can resume execution of the task. If you do this, beware that the value of a thread-id does not stay fixed. Also locks become a problem.

Example: if a thread is waiting for a lock, with a scheduling point it can suspend the task and work on another task.

while (!omp_test_lock(lock))
#pragma omp taskyield
  ;

24.6.2 Hints for performance improvement

crumb trail: > omp-task > More > Hints for performance improvement

If a task involves only a small amount of work, the scheduling overhead may negate any performance gain. There are two ways of executing the task code directly:

If you want to indicate that certain tasks are more important than others, use the priority clause:

#pragma omp task priority(5)
where the priority is any non-negative scalar less than OMP_MAX_TASK_PRIORITY  .

24.6.3 Task canceling

crumb trail: > omp-task > More > Task canceling

It is possible (in OpenMP-) to cancel tasks. This is useful when tasks are used to perform a search: the task that finds the result first can cancel any outstanding search tasks. See section  18.3 for details.

Exercise Modify the prime finding example to use cancel  .
End of exercise

24.7 Examples

crumb trail: > omp-task > Examples

24.7.1 Fibonacci

crumb trail: > omp-task > Examples > Fibonacci

As an example of the use of tasks, consider computing an array of Fibonacci values: \cverbatimsnippet[examples/omp/c/taskgroup0.c]{fiboarray} If you simply turn each calculation into a task, results will be unpredictable (confirm this!) since tasks can be executed in any sequence. To solve this, we put dependencies on the tasks: \cverbatimsnippet[examples/omp/c/taskgroup2.c]{fibotaskdepend}

24.7.2 Binomial coefficients

crumb trail: > omp-task > Examples > Binomial coefficients

Exercise An array of binomial coefficients can be computed as follows: \cverbatimsnippet[code/omp/c/binomial1.c]{binomialarray} Putting a single task group around the double loop, and use \indexclause{depend} clauses to make the execution satisfy the proper dependencies.
End of exercise

Back to Table of Contents