Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
$\newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{\scriptstyle D}}}}}$ 24.1 : Task data
24.5 : More
24.5.1 : Scheduling points
24.5.2 : Hints for performance improvement
24.6 : Examples
24.6.1 : Fibonacci
24.6.2 : Binomial coefficients

Tasks are a mechanism that OpenMP uses behind the scenes: if you specify something as being a task, OpenMP will create a block of work': a section of code plus the data environment in which it occurred. This block is set aside for execution at some later point. Thus, task-based code usually looks something like this:

#pragma omp parallel
{
// generate a bunch of tasks
// the result from the tasks is now available
}


For instance, a parallel loop was always implicitly translated to something like:

Sequential loop:

for (int i=0; i<N; i++)
f(i);


Parallel loop:

for (int ib=0; ib<nblocks; ib++) {
int first=... last=... ;
for (int i=first; i<last; i++)
f(i)
}
// the results from the loop are available


If we stick with this example of implementing a parallel loop through tasks, the next question is: precisely who generates the tasks? The following code has a serious problem:

// WRONG. DO NOT WRITE THIS
#pragma omp parallel
for (int ib=0; ib<nblocks; ib++) {
int first=... last=... ;
for (int i=first; i<last; i++)
f(i)
}


because the parallel region creates a team, and each thread in the team executes the task-generating code. Instead, we use the following idiom:

#pragma omp parallel
#pragma omp single
for (int ib=0; ib<nblocks; ib++) {
// setup stuff
}


1. A parallel region creates a team of threads;

2. a single thread then creates the tasks, adding them to a queue that belongs to the team,

3. and all the threads in that team (possibly including the one that generated the tasks)

Btw, the actual task queue is not visible to the programmer. Another aspect that is out of the programmer's control is the exact timing of the execution of the task: this is up to a task scheduler  , which operates invisible to the programmer.

The task mechanism allows you to do things that are hard or impossible with the loop and section constructs. For instance, a \indexterm{while loop} traversing a linked list can be implemented with tasks:

 \toprule Code Execution \midrule p = head_of_list(); one thread traverses the list \n{while (!end_of_list(p)) \{} #pragma omp task a task is created, process( p ); one for each element p = next_element(p); the generating thread goes on without waiting \ } the tasks are executed while more are being generated. \bottomrule

Another concept that was hard to parallelize earlier is the while loop'. This does not fit the requirement for OpenMP parallel loops that the loop bound needs to be known before the loop executes.

Exercise

Use tasks to find the smallest factor of a large number (using $2999\cdot 3001$ as test case): generate a task for each trial factor. Start with this code:

  int factor=0;
#pragma omp parallel
#pragma omp single
for (int f=2; f<4000; f++) {
{ // see if f' is a factor
if (N%f==0) { // found factor!
factor = f;
}
}
if (factor>0)
break;
}
if (factor>0)
printf("Found a factor: %d\n",factor);


• Turn the factor finding block into a task.

• Run your program a number of times:

for i in seq 1 1000 ; do ./taskfactor ; done | grep -v 2999


Does it find the wrong factor? Why? Try to fix this.
• Once a factor has been found, you should stop generating tasks. Let tasks that should not have been generated, meaning that they test a candidate larger than the factor found, print out a message.

End of exercise

Treatment of data in a task is somewhat subtle. The basic problem is that a task gets created at one time, and executed at some later time. Thus, if shared data is accessed, does the task see the value at creation time or at execution time? In fact, both possibilities make sense depending on the application, so we need to discuss the rules when which possibility applies.

The first rule is that shared data is shared in the task, but private data becomes \indexclause{firstprivate}. To see the distinction, consider two code fragments.

int count = 100;
#pragma omp parallel
#pragma omp single
{
while (count>0) {
{
int countcopy = count;
if (count==50) {
sleep(1);
printf("%d,%d\n",
count,countcopy);
} // end if
count--;
}     // end while
}       // end single


#pragma omp parallel
#pragma omp single
{
int count = 100;
while (count>0) {
{
int countcopy = count;
if (count==50) {
sleep(1);
printf("%d,%d\n",
count,countcopy);
} // end if
count--;
}     // end while
}       // end single


In the first example, the variable count is declared outside the parallel region and is therefore shared. When the print statement is executed, all tasks will have been generated, and so count will be zero. Thus, the output will likely be 0,50  .

In the second example, the count variable is private to the thread creating the tasks, and so it will be firstprivate in the task, preserving the value that was current when the task was created.

Even though the above segment looks like a linear set of statements, it is impossible to say when the code after the task directive will be executed. This means that the following code is incorrect:

  x = f();
{ y = g(x); }
z = h(y);


Explanation: when the statement computing z is executed, the task computing  y has only been scheduled; it has not necessarily been executed yet.

In order to have a guarantee that a task is finished, you need the taskwait directive. The following creates two tasks, which can be executed in parallel, and then waits for the results:

 \toprule Code Execution \midrule x = f(); the variable x gets a value #pragma omp task \multirow{4}{*}{two tasks are created with the current value of x } \n{ \{ y1 = g1(x); \}} #pragma omp task \n{ \{ y2 = g2(x); \}} #pragma omp taskwait the thread waits until the tasks are finished z = h(y1)+h(y2); the variable z is computed using the task results \bottomrule

The task pragma is followed by a structured block. Each time the structured block is encountered, a new task is generated. On the other hand taskwait is a standalone directive; the code that follows is just code, it is not a structured block belonging to the directive.

Another aspect of the distinction between generating tasks and executing them: usually the tasks are generated by one thread, but executed by many threads. Thus, the typical idiom is:

#pragma omp parallel
#pragma omp single
{
}


This makes it possible to execute loops in parallel that do not have the right kind of iteration structure for a omp parallel for  . As an example, you could traverse and process a linked list:

#pragma omp parallel
#pragma omp single
{
while (!tail(p)) {
p = p->next();
process(p)
}
}


You can indicate task dependencies in several ways:

1. Using the task wait' directive you can explicitly indicate the join of the

forked tasks. The instruction after the wait directive will therefore be dependent on the spawned tasks.

2. The taskgroup directive, followed by a structured block, ensures completion of all tasks created in the block, even if recursively created.

3. Each OpenMP task can have a \indexclause{depend} clause, indicating what data dependency of the task; section  24.3  . By indicating what data is produced or absorbed by the tasks, the scheduler can construct the dependency graph for you.

Another mechanism for dealing with tasks is the taskgroup : a task group is a code block that can contain task directives; all these tasks need to be finished before any statement after the block is executed.

A task group is somewhat similar to having a taskwait directive after the block. The big difference is that that taskwait directive does not wait for tasks that are recursively generated, while a taskgroup does.

It is possible to put a partial ordering on tasks through use of the \indexclause{depend} clause. For example, in

#pragma omp task
x = f()
y = g(x)


it is conceivable that the second task is executed before the first, possibly leading to an incorrect result. This is remedied by specifying:

#pragma omp task depend(out:x)
x = f()
y = g(x)


• These dependencies only hold between sibling tasks.
• The depdending data items of the various tasks are either identical or disjoint. In particular, dependencies on different sections of an array are not allowed, though a compiler may not always catch this.

Exercise Consider the following code:

for i in [1:N]:
x[0,i] = some_function_of(i)
x[i,0] = some_function_of(i)

for i in [1:N]:
for j in [1:N]:
x[i,j] = x[i-1,j]+x[i,j-1]


• Observe that the second loop nest is not amenable to OpenMP loop parallelism.

• Can you think of a way to realize the computation with OpenMP loop parallelism? Hint: you need to rewrite the code so that the same operations are done in a different order.

• Use tasks with dependencies to make this code parallel without any rewriting: the only change is to add OpenMP directives.

End of exercise

Tasks dependencies are used to indicated how two uses of one data item relate to each other. Since either use can be a read or a write, there are four types of dependencies.

 ... omp task depend(OUT:x)
foo(x)
foo(x)


• [WaR (Write after Read)] The first task reads and item, and the second task overwrites it. The second task has to be executed second to prevent overwriting the initial value:

 ... omp task depend( IN:x)
foo(x)
foo(x)


• [WaW (Write after Write)] Both tasks set the same variable. Since the variable can be used by an intermediate task, the two writes have to be executed in this order.

 ... omp task depend(OUT:x)
foo(x)
foo(x)


• [RaR (Read after Read)] Both tasks read a variable. Since neither tasks has an out' declaration, they can run in either order.

 ... omp task depend(IN:x)
foo(x)
foo(x)


The \indexclause{reduction} only pertains to ordinary parallel loops, not to taskgroup loops of tasks. To do a reduction over computations in tasks you need the \indexclausedef{task_reduction} clause (a OpenMP- feature):

#pragma omp taskgroup task_reduction(+:sum)


The task group can contain both task that contribute to the reduction, and ones that don't. The former type needs a clause \indexclausedef{in_reduction}:

#pragma omp task in_reduction(+:sum)


As an example, here the sum $\sum_{i=1}^{100} i$ is computed with tasks:

// taskreduct.c
#pragma omp parallel
#pragma omp single
{
}
}


24.5 More

crumb trail: > omp-task > More

24.5.1 Scheduling points

crumb trail: > omp-task > More > Scheduling points

Normally, a task stays tied to the thread that first executes it. However, at a task scheduling point the thread may switch to the execution of another task created by the same team.

• There is a scheduling point after explicit task creation. This means that, in the above examples, the thread creating the tasks can also participate in executing them.

On the other hand a task created with them \indexclause{untied} clause on the task pragma is never tied to one thread. This means that after suspension at a scheduling point any thread can resume execution of the task. If you do this, beware that the value of a thread-id does not stay fixed. Also locks become a problem.

Example: if a thread is waiting for a lock, with a scheduling point it can suspend the task and work on another task.

while (!omp_test_lock(lock))
;


24.5.2 Hints for performance improvement

crumb trail: > omp-task > More > Hints for performance improvement

If a task involves only a small amount of work, the scheduling overhead may negate any performance gain. There are two ways of executing the task code directly:

• The if clause will only create a task if the test is true:

#pragma omp task if (n>100)
f(n)


• The if clause may still lead to recursively generated tasks. On the other hand, final will execute the code, and will also skip any recursively created tasks:

#pragma omp task final(level<3)


If you want to indicate that certain tasks are more important than others, use the priority clause:

#pragma omp task priority(5)
`

where the priority is any non-negative scalar less than OMP_MAX_TASK_PRIORITY  .

It is possible (in OpenMP-) to cancel tasks. This is useful when tasks are used to perform a search: the task that finds the result first can cancel any outstanding search tasks. See section  18.3 for details.

Exercise Modify the prime finding example to use cancel  .
End of exercise

24.6 Examples

crumb trail: > omp-task > Examples

24.6.1 Fibonacci

crumb trail: > omp-task > Examples > Fibonacci

As an example of the use of tasks, consider computing an array of Fibonacci values: \cverbatimsnippet[examples/omp/c/taskgroup0.c]{fiboarray} If you simply turn each calculation into a task, results will be unpredictable (confirm this!) since tasks can be executed in any sequence. To solve this, we put dependencies on the tasks: \cverbatimsnippet[examples/omp/c/taskgroup2.c]{fibotaskdepend}

24.6.2 Binomial coefficients

crumb trail: > omp-task > Examples > Binomial coefficients

Exercise An array of binomial coefficients can be computed as follows: \cverbatimsnippet[code/omp/c/binomial1.c]{binomialarray} Putting a single task group around the double loop, and use

\indexclause{depend} clauses to make the execution satisfy the proper dependencies.
End of exercise