crumb trail: > omp-reduction > Reductions: why, what, how?
Parallel tasks often produce some quantity that needs to be summed or otherwise combined. If you write:
int sum=0; #pragma omp parallel for for (int i=0; i<N; i++) sum += f(i);you will find that the sum value depends on the number of threads, and is likely not the same as when you execute the code sequentially. The problem here is the race condition involving the sum variable, since this variable is shared between all threads.
We will discuss several strategies of dealing with this.
crumb trail: > omp-reduction > Reductions: why, what, how? > Reduction clause
The easiest way to effect a reduction is of course to use the \indexclause{reduction} clause. Adding this to an omp parallel region has the following effect:
The simplest case is a reduction over a parallel loop. Here we compute $\pi/4$ as a Riemann sum :
// pi.c #pragma omp parallel for reduction(+:pi4) for (int isample=0; isample<N; isample++) { float xsample = isample * h; float y = sqrt(1-xsample*xsample); pi4 += h*y; }
You can also reduce over sections :
// sectionreduct.c float y=0; #pragma omp parallel reduction(+:y) #pragma omp sections { #pragma omp section y += f(); #pragma omp section y += g(); }
Another reduction, this time over a parallel region, without any work sharing:
// reductpar.c m = INT_MIN; #pragma omp parallel reduction(max:m) num_threads(ndata) { int t = omp_get_thread_num(); int d = data[t]; m = d>m ? d : m; };
If you want to reduce multiple variables with the same operator, use
reduction(+:x,y,z)For multiple reduction with different operators, use more than one clause.
Remark A reduction is one of those cases where the parallel execution can have a slightly different value from the one that is computed sequentially, because floating point operations are not associative, so roundoff {omp!reduction!roundoff} will lead to differing results. See Eijkhout:IntroHPC for more explanation.
The OpenMP standard does not even specify that two runs with the same number of threads,
and the same scheduling, have to give identical results.
Some runtimes may have a setting to enforce this;
for instance
KMP_DETERMINISTIC_REDUCTION
for the Intel runtime.
End of remark
crumb trail: > omp-reduction > Reductions: why, what, how? > Code your own solution
The most immediate way is to eliminate the race condition by declaring a critical section :
double result = 0; #pragma omp parallel { double local_result; int num = omp_get_thread_num(); if (num==0) local_result = f(x); else if (num==1) local_result = g(x); else if (num==2) local_result = h(x); # pragma omp critical result += local_result; }
This is a good solution if the amount of serialization in the critical section is small compared to computing the functions $f,g,h$. On the other hand, you may not want to do that in a loop:
double result = 0; #pragma omp parallel { double local_result; # pragma omp for for (i=0; i<N; i++) { local_result = f(x,i); # pragma omp critical result += local_result; } // end of for loop }
Exercise
Can you think of a small modification of this code, that still uses a critical section,
that is more efficient? Time both codes.
End of exercise
crumb trail: > omp-reduction > Reductions: why, what, how? > Code your own solution > False sharing
If your code can not be easily structured as a reduction, you can realize the above scheme by hand by `duplicating' the global variable and gather the contributions later. This example presumes three threads, and gives each a location of their own to store the result computed on that thread:
double result,local_results[3]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num] = f(x) else if (num==1) local_results[num] = g(x) else if (num==2) local_results[num] = h(x) } result = local_results[0]+local_results[1]+local_results[2]While this code is correct, it may be inefficient because of a phenomemon called false sharing . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.
False sharing can be prevent by giving each thread its own cacheline:
double result,local_results[3][8]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num][1] = f(x) // et cetera }A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:
double result = 0; #pragma omp parallel { double local_result; local_result = ..... #pragam omp critical result += local_result; }
crumb trail: > omp-reduction > Built-in reduction
crumb trail: > omp-reduction > Built-in reduction > Operators
Arithmetic reductions: +,*,-,max,min .
Logical operator reductions in C: & && | || ^
Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.
Exercise
The maximum and minimum reductions were not added to OpenMP until
OpenMP-.
Write a parallel loop that computes the maximum and
minimum values in an array without using the
reduction
directive.
Discuss the various options. Do timings
to evaluate the speedup that is attained and to find the best option.
End of exercise
crumb trail: > omp-reduction > Built-in reduction > Reduction on arrays
Starting with the OpenMP- standard, you can reduce on statically and dynamically allocated arrays:
// reductarray.c #pragma omp parallel for schedule(static,1) \ reduction(+:data[:nthreads]) for (int it=0; it<nthreads; it++) { for (int i=0; i<nthreads; i++) data[i]++; }
int *alloced = (int*)malloc( nthreads*sizeof(int) ); for (int i=0; i<nthreads; i++) alloced[i] = 0; #pragma omp parallel for schedule(static,1) \ reduction(+:alloced[:nthreads]) for (int it=0; it<nthreads; it++) { for (int i=0; i<nthreads; i++) alloced[i]++; }
C++ note Use the data method to extract the array on which to reduce. Also, the reduction clause wants a variable, not an expression, for the array, so you need an extra bare pointer:
// reductarray.cxx vector<int> data(nthreads,0); int *datadata = data.data(); #pragma omp parallel for schedule(static,1) \ reduction(+:datadata[:nthreads]) for (int it=0; it<nthreads; it++) { for (int i=0; i<nthreads; i++) datadata[i]++; }
crumb trail: > omp-reduction > Built-in reduction > Types
Reduction can be applied to any type for which the operator is defined. The types to which max/min are applicable are limited.
crumb trail: > omp-reduction > Initial value for reductions
The treatment of initial values in reductions is slightly involved.
x = init_x #pragma omp parallel for reduction(min:x) for (int i=0; i<N; i++) x = min(x,data[i]);Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type.
\caption{Reduction of four items on two threads, taking into account
initial values.}
Figure 20.3 illustrates this, where 1,2,3,4 are four data items, i is the OpenMP initialization, and u is the user initialization; each p stands for a partial reduction value. The figure is based on execution using two threads.
Exercise
Write a program to test the fact that the partial results
are initialized to the unit of the reduction operator.
End of exercise
crumb trail: > omp-reduction > User-defined reductions
In a loop that performs a reduction, most of the element-by-element reduction as done in user code. However, in a parallel version of that loop, OpenMP needs to perform that same reduction on the partial results from the threads. Thus, if you want to perform your own reduction, you need to declare this reduction to OpenMP.
With user-defined reductions , the programmer specifies the function that does the elementwise comparison. We discuss two strategies:
crumb trail: > omp-reduction > User-defined reductions > Reduction functions
This takes two steps.
This is the syntax of the definition of the reduction, which can then be used in multiple \indexclause{reduction} clauses.
#pragma omp declare reduction ( identifier : typelist : combiner ) [initializer(initializer-expression)]where:
crumb trail: > omp-reduction > User-defined reductions > Reduction functions > Explicit expressions
For very simple cases:
for (i=0; i<N; i++) { if (abs(data[i]) < result) { result = abs(data[i]); } }
// reductexpr.c #pragma omp declare reduction\ (minabs : int : \ omp_out = abs(omp_in) > omp_out ? omp_out : abs(omp_in) ) \ initializer (omp_priv=LARGENUM)
#pragma omp parallel for reduction(minabs:result)
C++ note You can use lambda expressions in the explicit expression:
// reductexpr.cxx #pragma omp declare reduction\ (minabs : int : \ omp_out = \ [] (int x,int y) -> int { \ return abs(x) > abs(y) ? abs(y) : abs(x); } \ (omp_in,omp_out) ) \ initializer (omp_priv=limit::max())
crumb trail: > omp-reduction > User-defined reductions > Reduction functions > Reduction functions
For instance, recreating the maximum reduction would look like this:
// ireduct.c int mymax(int r,int n) { // r is the already reduced value // n is the new value int m; if (n>r) { m = n; } else { m = r; } return m; } #pragma omp declare reduction \ (rwz:int:omp_out=mymax(omp_out,omp_in)) \ initializer(omp_priv=INT_MIN) m = INT_MIN; #pragma omp parallel for reduction(rwz:m) for (int idata=0; idata<ndata; idata++) m = mymax(m,data[idata]);
Exercise
Write a reduction routine that operates on an array of nonnegative
integers, finding the smallest nonzero one. If the array has size
zero, or entirely consists of zeros, return
-1
.
End of exercise
C++ note Support for C++ iterators
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))End of C++ note
C++ note You can reduce with a templated function if you put both the declaration and the reduction in the same templated function:
template<typename T> T generic_reduction( vector<T> tdata ) { #pragma omp declare reduction \ (rwzt:T:omp_out=reduce_without_zero<T>(omp_out,omp_in)) \ initializer(omp_priv=-1.f)T tmin = -1; #pragma omp parallel for reduction(rwzt:tmin) for (int id=0; id<tdata.size(); id++) tmin = reduce_without_zero<T>(tmin,tdata[id]); return tmin; };
auto tmin = generic_reduction<float>(fdata);
C++ note You can do a reduction over a std::map by merging thread-local maps:
// mapreduce.cxx template<typename key> class bincounter : public map<key,int> { public: // merge this with other map void operator+=( const bincounter<key>& other ) { for ( auto [k,v] : other ) if ( map<key,int>::contains(k) ) // c++20 this->at(k) += v; else this->insert( {k,v} ); }; // insert one char in this map void inc(char k) { if ( map<key,int>::contains(k) ) this->at(k) += 1; else this->insert( {k,1} ); }; };
using CharCounter = bincounter<char>; #pragma omp declare reduction\ ( \ +:CharCounter:omp_out += omp_in \ ) \ initializer( omp_priv = CharCounter{} )string text{"the quick brown fox jumps over the lazy dog"}; CharCounter charcount; #pragma omp parallel for reduction(+ : charcount) for ( int i=0; i<text.size(); i++ ) charcount.inc( text[i] );
End of C++ note
crumb trail: > omp-reduction > User-defined reductions > Overloaded operators
Fortran note Reduction can be applied to any derived type that has the reduction operator defined.
!! reducttype.F90 Type inttype integer :: value = 0 end type inttype Interface operator(+) module procedure addints end Interface operator(+)
Type(inttype),dimension(nsize) :: intarray Type(inttype) :: intsum = inttype(0) !$OMP parallel do reduction(+:intsum) do i=1,nsize intsum = intsum + intarray(i) end do !$OMP end parallel
End of Fortran note
C++ note Reduction can be applied to any class for which the reduction operator is defined as operator+ or whichever operator the case may be.
// reductcomplex.cxx class Thing { private: float x; public: Thing() : Thing( 0.f ) {}; Thing( float x ) : x(x) {}; Thing operator+( const Thing& other ) { return Thing( x + other.x ); }; };
vector< Thing > things(500,Thing(1.f) ); Thing result(0.f); #pragma omp parallel for reduction( +:result ) for ( const auto& t : things ) result = result + t;
A default constructor is required for the internally used init value; see figure 20.3 . End of C++ note
crumb trail: > omp-reduction > Scan / prefix operations
A `scan' or prefix operation is like a reduction, except that you're interested in the partial results. For this OpenMP, as of OpenMP-, has the scan directive. This needs the following:
#pragma omp parallel for reduction(inscan,+:sumvar)
sumvar // update #pragma omp scan inclusive(sumvar) partials[i] = sumvarFor exclusive scans the reduction variable is updated after the scan pragma:
partials[i] = sumvar #pragma omp scan inclusive(sumvar) sumvar // update
\csnippetwithoutput{scanincexc}{examples/omp/c}{scansum}
crumb trail: > omp-reduction > Reductions and floating-point math
The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see Eijkhout:IntroHPC . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behavior.