Tasks are a mechanism that OpenMP uses behind the scenes: if you specify something as being a task, OpenMP will create a `block of work': a section of code plus the data environment in which it occurred. This block is set aside for execution at some later point. Thus, task-based code usually looks something like this:
#pragma omp parallel { // generate a bunch of tasks # pragma omp taskwait // the result from the tasks is now available }For instance, a parallel loop was always implicitly translated to something like:
Sequential loop:
for (int i=0; i<N; i++) f(i);
Parallel loop:
for (int ib=0; ib<nblocks; ib++) { int first=... last=... ; # pragma omp task for (int i=first; i<last; i++) f(i) } #pragma omp taskwait // the results from the loop are available
crumb trail: > omp-task > Task generation
If we stick with this example of implementing a parallel loop through tasks, the next question is: precisely who generates the tasks? The following code has a serious problem:
// WRONG. DO NOT WRITE THIS #pragma omp parallel for (int ib=0; ib<nblocks; ib++) { int first=... last=... ; # pragma omp task for (int i=first; i<last; i++) f(i) }because the parallel region creates a team, and each thread in the team executes the task-generating code. Instead, we use the following idiom:
#pragma omp parallel #pragma omp single for (int ib=0; ib<nblocks; ib++) { // setup stuff # pragma omp task // task stuff }
Btw, the actual task queue is not visible to the programmer. Another aspect that is out of the programmer's control is the exact timing of the execution of the task: this is up to a task scheduler , which operates invisible to the programmer.
The task mechanism allows you to do things that are hard or impossible with the loop and section constructs. For instance, a \indexterm{while loop} traversing a linked list can be implemented with tasks:
\toprule Code | Execution |
\midrule p = head_of_list(); | one thread traverses the list |
\n{while (!end_of_list(p)) \{} | |
#pragma omp task | a task is created, |
process( p ); | one for each element |
p = next_element(p); | the generating thread goes on without waiting |
\ } | the tasks are executed while |
more are being generated. | |
\bottomrule |
Another concept that was hard to parallelize earlier is the `while loop'. This does not fit the requirement for OpenMP parallel loops that the loop bound needs to be known before the loop executes.
Exercise Use tasks to find the smallest factor of a large number (using $2999\cdot 3001$ as test case): generate a task for each trial factor. Start with this code:
int factor=0; #pragma omp parallel #pragma omp single for (int f=2; f<4000; f++) { { // see if `f' is a factor if (N%f==0) { // found factor! factor = f; } } if (factor>0) break; } if (factor>0) printf("Found a factor: %d\n",factor);
for i in `seq 1 1000` ; do ./taskfactor ; done | grep -v 2999Does it find the wrong factor? Why? Try to fix this.
crumb trail: > omp-task > Task data
Treatment of data in a task is somewhat subtle. The basic problem is that a task gets created at one time, and executed at some later time. Thus, if shared data is accessed, does the task see the value at creation time or at execution time? In fact, both possibilities make sense depending on the application, so we need to discuss the rules which possibility applies when.
The first rule is that shared data is shared in the task, but private data becomes \indexclause{firstprivate}. To see the distinction, consider two code fragments.
int count = 100; #pragma omp parallel #pragma omp single { while (count>0) { # pragma omp task { int countcopy = count; if (count==50) { sleep(1); printf("%d,%d\n", count,countcopy); } // end if } // end task count--; } // end while } // end single
#pragma omp parallel #pragma omp single { int count = 100; while (count>0) { # pragma omp task { int countcopy = count; if (count==50) { sleep(1); printf("%d,%d\n", count,countcopy); } // end if } // end task count--; } // end while } // end single
In the first example, the variable count is declared outside the parallel region and is therefore shared. When the print statement is executed, all tasks will have been generated, and so count will be zero. Thus, the output will likely be 0,50 .
In the second example, the count variable is private to the thread creating the tasks, and so it will be firstprivate in the task, preserving the value that was current when the task was created.
crumb trail: > omp-task > Task synchronization
Even though the above segment looks like a linear set of statements, it is impossible to say when the code after the task directive will be executed. This means that the following code is incorrect:
x = f(); #pragma omp task { y = g(x); } z = h(y);Explanation: when the statement computing z is executed, the task computing y has only been scheduled; it has not necessarily been executed yet.
In order to have a guarantee that a task is finished, you need the taskwait directive. The following creates two tasks, which can be executed in parallel, and then waits for the results:
\toprule Code | Execution |
\midrule x = f(); | the variable x gets a value |
#pragma omp task | \multirow{4}{*}{two tasks are created with the current value of x } |
\n{ \{ y1 = g1(x); \}} | |
#pragma omp task | |
\n{ \{ y2 = g2(x); \}} | |
#pragma omp taskwait | the thread waits until the tasks are finished |
z = h(y1)+h(y2); | the variable z is computed using the task results |
\bottomrule |
The task pragma is followed by a structured block. Each time the structured block is encountered, a new task is generated. On the other hand taskwait is a standalone directive; the code that follows is just code, it is not a structured block belonging to the directive.
You can indicate task dependencies in several ways:
crumb trail: > omp-task > Task synchronization > Task groups
The taskgroup directive, followed by a structured block, ensures completion of all tasks created in the block, even if recursively created.
A task group is somewhat similar to having a taskwait directive after the block. The big difference is that that taskwait directive does not wait for tasks that are recursively generated, while a taskgroup does.
crumb trail: > omp-task > Task synchronization > Task loop
The taskloop directive prefaces a for/do loop, just like an for pragma. The difference is that now every iteration is turned into a task, rather than groups of iterations as in the for case. The end of the loop is a synchronization point: statements after the loop are only executed when all tasks from the loop are finished.
There is a mast taskloop directive that is shorthand for a master containing only a taskloop .
crumb trail: > omp-task > Task dependencies
It is possible to put a partial ordering on tasks through use of the \indexclause{depend} clause. For example, in
#pragma omp task x = f() #pragma omp task y = g(x)it is conceivable that the second task is executed before the first, possibly leading to an incorrect result. This is remedied by specifying:
#pragma omp task depend(out:x) x = f() #pragma omp task depend(in:x) y = g(x)
Exercise Consider the following code:
for i in [1:N]: x[0,i] = some_function_of(i) x[i,0] = some_function_of(i) for i in [1:N]: for j in [1:N]: x[i,j] = x[i-1,j]+x[i,j-1]
Tasks dependencies are used to indicated how two uses of one data item relate to each other. Since either use can be a read or a write, there are four types of dependencies.
... omp task depend(OUT:x) foo(x) ... omp task depend( IN:x) foo(x)
... omp task depend( IN:x) foo(x) ... omp task depend(OUT:x) foo(x)
... omp task depend(OUT:x) foo(x) ... omp task depend(OUT:x) foo(x)
... omp task depend(IN:x) foo(x) ... omp task depend(IN:x) foo(x)
crumb trail: > omp-task > Task reduction
The \indexclause{reduction} clause only pertains to ordinary parallel loops, not to taskgroup loops of tasks. To do a reduction over computations in tasks you need the \indexclausedef{task_reduction} clause (a OpenMP- feature):
#pragma omp taskgroup task_reduction(+:sum)The task group can contain both task that contribute to the reduction, and ones that don't. The former type needs a clause \indexclausedef{in_reduction}:
#pragma omp task in_reduction(+:sum)
As an example, here the sum $\sum_{i=1}^{100} i$ is computed with tasks:
// taskreduct.c #pragma omp parallel #pragma omp single { #pragma omp taskgroup task_reduction(+:sum) for (int itask=1; itask<=bound; itask++) { #pragma omp task in_reduction(+:sum) sum += itask; } }
crumb trail: > omp-task > More
crumb trail: > omp-task > More > Scheduling points
Normally, a task stays tied to the thread that first executes it. However, at a task scheduling point the thread may switch to the execution of another task created by the same team.
On the other hand a task created with them \indexclause{untied} clause on the task pragma is never tied to one thread. This means that after suspension at a scheduling point any thread can resume execution of the task. If you do this, beware that the value of a thread-id does not stay fixed. Also locks become a problem.
Example: if a thread is waiting for a lock, with a scheduling point it can suspend the task and work on another task.
while (!omp_test_lock(lock)) #pragma omp taskyield ;
crumb trail: > omp-task > More > Hints for performance improvement
If a task involves only a small amount of work, the scheduling overhead may negate any performance gain. There are two ways of executing the task code directly:
#pragma omp task if (n>100) f(n)
#pragma omp task final(level<3)
If you want to indicate that certain tasks are more important than others, use the priority clause:
#pragma omp task priority(5)where the priority is any non-negative scalar less than OMP_MAX_TASK_PRIORITY .
crumb trail: > omp-task > More > Task canceling
It is possible (in OpenMP-) to cancel tasks. This is useful when tasks are used to perform a search: the task that finds the result first can cancel any outstanding search tasks. See section 18.3 for details.
Exercise
Modify the prime finding example to use
cancel
.
End of exercise
crumb trail: > omp-task > Examples
crumb trail: > omp-task > Examples > Fibonacci
As an example of the use of tasks, consider computing an array of Fibonacci values: \cverbatimsnippet[examples/omp/c/taskgroup0.c]{fiboarray} If you simply turn each calculation into a task, results will be unpredictable (confirm this!) since tasks can be executed in any sequence. To solve this, we put dependencies on the tasks: \cverbatimsnippet[examples/omp/c/taskgroup2.c]{fibotaskdepend}
crumb trail: > omp-task > Examples > Binomial coefficients
Exercise
An array of binomial coefficients can be computed as follows:
\cverbatimsnippet[code/omp/c/binomial1.c]{binomialarray}
Putting a single task group around the double loop, and use
\indexclause{depend} clauses to make the execution satisfy the proper dependencies.
End of exercise