# OpenMP topic: Reductions

##### Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

$\newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{D}}}}}$ 20.1 : Reductions: why, what, how?
20.1.1 : Reduction clause
20.1.2 : Code your own solution
20.1.2.1 : False sharing
20.2 : Built-in reduction
20.2.1 : Operators
20.2.2 : Reduction on arrays
20.2.3 : Types
20.3 : Initial value for reductions
20.4 : User-defined reductions
20.4.1 : Reduction functions
20.4.1.1 : Explicit expressions
20.4.1.2 : Reduction functions
20.5 : Scan / prefix operations
20.6 : Reductions and floating-point math

# 20 OpenMP topic: Reductions

## 20.1 Reductions: why, what, how?

crumb trail: > omp-reduction > Reductions: why, what, how?

Parallel tasks often produce some quantity that needs to be summed or otherwise combined. If you write:

  int sum=0;
#pragma omp parallel for
for (int i=0; i<N; i++)
sum += f(i);

you will find that the sum value depends on the number of threads, and is likely not the same as when you execute the code sequentially. The problem here is the race condition involving the sum variable, since this variable is shared between all threads.

We will discuss several strategies of dealing with this.

### 20.1.1 Reduction clause

crumb trail: > omp-reduction > Reductions: why, what, how? > Reduction clause

The easiest way to effect a reduction is of course to use the \indexclause{reduction} clause. Adding this to an omp parallel region has the following effect:

• OpenMP will make a copy of the reduction variable per thread, initialized to the identity of the reduction operator, for instance $1$ for multiplication.
• Each thread will then reduce into its local variable;
• At the end of the parallel region, the local results are combined, again using the reduction operator, into the global variable.

The simplest case is a reduction over a parallel loop. Here we compute $\pi/4$ as a Riemann sum :

// pi.c
#pragma omp parallel for reduction(+:pi4)
for (int isample=0; isample<N; isample++) {
float xsample = isample * h;
float y = sqrt(1-xsample*xsample);
pi4 += h*y;
}


You can also reduce over sections :

// sectionreduct.c
float y=0;
#pragma omp parallel reduction(+:y)
#pragma omp sections
{
#pragma omp section
y += f();
#pragma omp section
y += g();
}


Another reduction, this time over a parallel region, without any work sharing:

// reductpar.c
m = INT_MIN;
{
int d = data[t];
m = d>m ? d : m;
};


If you want to reduce multiple variables with the same operator, use

reduction(+:x,y,z)

For multiple reduction with different operators, use more than one clause.

Remark A reduction is one of those cases where the parallel execution can have a slightly different value from the one that is computed sequentially, because floating point operations are not associative, so roundoff {omp!reduction!roundoff} will lead to differing results. See  Eijkhout:IntroHPC for more explanation.

The OpenMP standard does not even specify that two runs with the same number of threads, and the same scheduling, have to give identical results. Some runtimes may have a setting to enforce this; for instance KMP_DETERMINISTIC_REDUCTION for the Intel runtime.
End of remark

### 20.1.2 Code your own solution

crumb trail: > omp-reduction > Reductions: why, what, how? > Code your own solution

The most immediate way is to eliminate the race condition by declaring a critical section :

double result = 0;
#pragma omp parallel
{
double local_result;
if (num==0)      local_result = f(x);
else if (num==1) local_result = g(x);
else if (num==2) local_result = h(x);
# pragma omp critical
result += local_result;
}


This is a good solution if the amount of serialization in the critical section is small compared to computing the functions $f,g,h$. On the other hand, you may not want to do that in a loop:

double result = 0;
#pragma omp parallel
{
double local_result;
# pragma omp for
for (i=0; i<N; i++) {
local_result = f(x,i);
#   pragma omp critical
result += local_result;
} // end of for loop
}


Exercise Can you think of a small modification of this code, that still uses a critical section, that is more efficient? Time both codes.
End of exercise

#### 20.1.2.1 False sharing

If your code can not be easily structured as a reduction, you can realize the above scheme by hand by duplicating' the global variable and gather the contributions later. This example presumes three threads, and gives each a location of their own to store the result computed on that thread:

double result,local_results;
#pragma omp parallel
{
if (num==0)      local_results[num] = f(x)
else if (num==1) local_results[num] = g(x)
else if (num==2) local_results[num] = h(x)
}
result = local_results+local_results+local_results

While this code is correct, it may be inefficient because of a phenomemon called false sharing  . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.

False sharing can be prevent by giving each thread its own cacheline:

double result,local_results;
#pragma omp parallel
{
if (num==0)      local_results[num] = f(x)
// et cetera
}

A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:
double result = 0;
#pragma omp parallel
{
double local_result;
local_result = .....
#pragam omp critical
result += local_result;
}


## 20.2 Built-in reduction

crumb trail: > omp-reduction > Built-in reduction

### 20.2.1 Operators

crumb trail: > omp-reduction > Built-in reduction > Operators

Arithmetic reductions: +,*,-,max,min  .

Logical operator reductions in C: & && | || ^

Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.

Exercise The maximum and minimum reductions were not added to OpenMP until OpenMP-. Write a parallel loop that computes the maximum and minimum values in an array without using the reduction directive. Discuss the various options. Do timings to evaluate the speedup that is attained and to find the best option.
End of exercise

### 20.2.2 Reduction on arrays

crumb trail: > omp-reduction > Built-in reduction > Reduction on arrays

Starting with the OpenMP- standard, you can reduce on statically and dynamically allocated arrays:

// reductarray.c
#pragma omp parallel for schedule(static,1) \
for (int it=0; it<nthreads; it++) {
data[i]++;
}

int *alloced = (int*)malloc( nthreads*sizeof(int) );
alloced[i] = 0;
#pragma omp parallel for schedule(static,1) \
for (int it=0; it<nthreads; it++) {
alloced[i]++;
}


C++ note Use the data method to extract the array on which to reduce. Also, the reduction clause wants a variable, not an expression, for the array, so you need an extra bare pointer:

// reductarray.cxx
#pragma omp parallel for schedule(static,1) \
for (int it=0; it<nthreads; it++) {
}

End of C++ note

### 20.2.3 Types

crumb trail: > omp-reduction > Built-in reduction > Types

Reduction can be applied to any type for which the operator is defined. The types to which max/min are applicable are limited.

## 20.3 Initial value for reductions

crumb trail: > omp-reduction > Initial value for reductions

The treatment of initial values in reductions is slightly involved.

  x = init_x
#pragma omp parallel for reduction(min:x)
for (int i=0; i<N; i++)
x = min(x,data[i]);

Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type. \caption{Reduction of four items on two threads, taking into account initial values.}

Figure  20.3 illustrates this, where 1,2,3,4 are four data items, i  is the OpenMP initialization, and u  is the user initialization; each p  stands for a partial reduction value. The figure is based on execution using two threads.

Exercise Write a program to test the fact that the partial results are initialized to the unit of the reduction operator.
End of exercise

## 20.4 User-defined reductions

crumb trail: > omp-reduction > User-defined reductions

In a loop that performs a reduction, most of the element-by-element reduction as done in user code. However, in a parallel version of that loop, OpenMP needs to perform that same reduction on the partial results from the threads. Thus, if you want to perform your own reduction, you need to declare this reduction to OpenMP.

With user-defined reductions  , the programmer specifies the function that does the elementwise comparison. We discuss two strategies:

1. In non- OO languages you can define a function, and declare that to be a reduction operator with the declare   \indexclause{reduction} construct.
2. In OO languages (C++ and Fortran2003) you can overload ordinary operators for types, including class objects.

### 20.4.1 Reduction functions

crumb trail: > omp-reduction > User-defined reductions > Reduction functions

This takes two steps.

1. You need a function of two arguments that returns the result of the comparison. You can do this yourself, but, especially with the C++ standard library, you can use functions such as std::vector::insert  .
2. Specifying how this function operates on two variables omp_out and omp_in  , corresponding to the partially reduced result and the new operand respectively. The new partial result should be left in omp_out  .
3. Optionally, you can specify the value to which the reduction should be initialized.

This is the syntax of the definition of the reduction, which can then be used in multiple \indexclause{reduction} clauses.

#pragma omp declare reduction
( identifier : typelist : combiner )
[initializer(initializer-expression)]

where:
• [ identifier  ] is a name; this can be overloaded for different types, and redefined in inner scopes.
• [ typelist  ] is a list of types.
• [ combiner  ] is an expression that updates the internal variable omp_out as function of itself and omp_in  .
• [ initializer  ] sets omp_priv to the identity of the reduction; this can be an expression or a brace initializer.

#### 20.4.1.1 Explicit expressions

For very simple cases:

for (i=0; i<N; i++) {
if (abs(data[i]) < result) {
result = abs(data[i]);
}
}

you can declare the reduction through an expression:
// reductexpr.c
#pragma omp declare reduction\
(minabs : int :							\
omp_out = abs(omp_in) > omp_out ? omp_out : abs(omp_in) )		\
initializer (omp_priv=LARGENUM)

and use that in the \indexclause{reduction} clause:
#pragma omp parallel for reduction(minabs:result)


C++ note You can use lambda expressions in the explicit expression:

// reductexpr.cxx
#pragma omp declare reduction\
(minabs : int : \
omp_out = \
[] (int x,int y) -> int { \
return abs(x) > abs(y) ? abs(y) : abs(x); } \
(omp_in,omp_out) ) \
initializer (omp_priv=limit::max())

You can not assign the lambda expression to a variable and use that, because omp_in/out are the only variables allowed in the explicit expression. End of C++ note

#### 20.4.1.2 Reduction functions

For instance, recreating the maximum reduction would look like this:

// ireduct.c
int mymax(int r,int n) {
// r is the already reduced value
// n is the new value
int m;
if (n>r) {
m = n;
} else {
m = r;
}
return m;
}
#pragma omp declare reduction \
(rwz:int:omp_out=mymax(omp_out,omp_in)) \
initializer(omp_priv=INT_MIN)
m = INT_MIN;
#pragma omp parallel for reduction(rwz:m)
for (int idata=0; idata<ndata; idata++)
m = mymax(m,data[idata]);


Exercise Write a reduction routine that operates on an array of nonnegative integers, finding the smallest nonzero one. If the array has size zero, or entirely consists of zeros, return  -1  .
End of exercise

C++ note Support for C++ iterators

#pragma omp declare reduction (merge : std::vector<int>
: omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

End of C++ note

C++ note You can reduce with a templated function if you put both the declaration and the reduction in the same templated function:

template<typename T>
T generic_reduction( vector<T> tdata ) {
#pragma omp declare reduction					\
(rwzt:T:omp_out=reduce_without_zero<T>(omp_out,omp_in))	\
initializer(omp_priv=-1.f)

T tmin = -1;
#pragma omp parallel for reduction(rwzt:tmin)
for (int id=0; id<tdata.size(); id++)
tmin = reduce_without_zero<T>(tmin,tdata[id]);
return tmin;
};

which is then called with specific data:
auto tmin = generic_reduction<float>(fdata);

End of C++ note

C++ note You can do a reduction over a std::map by merging thread-local maps:

// mapreduce.cxx
template<typename key>
class bincounter : public map<key,int> {
public:
// merge this with other map
void operator+=( const bincounter<key>& other ) {
for ( auto [k,v] : other )
if ( map<key,int>::contains(k) ) // c++20
this->at(k) += v;
else
this->insert( {k,v} );
};
// insert one char in this map
void inc(char k) {
if ( map<key,int>::contains(k) )
this->at(k) += 1;
else
this->insert( {k,1} );
};
};


using CharCounter = bincounter<char>;
#pragma omp declare reduction\
(	\
+:CharCounter:omp_out += omp_in	\
) \
initializer( omp_priv = CharCounter{} )

string text{"the quick brown fox jumps over the lazy dog"};
CharCounter charcount;
#pragma omp parallel for reduction(+ : charcount)
for ( int i=0; i<text.size(); i++ )
charcount.inc( text[i] );


End of C++ note

crumb trail: > omp-reduction > User-defined reductions > Overloaded operators

Fortran note Reduction can be applied to any derived type that has the reduction operator defined.

!! reducttype.F90
Type inttype
integer :: value = 0
end type inttype
Interface operator(+)
end Interface operator(+)


Type(inttype),dimension(nsize) :: intarray
Type(inttype) :: intsum = inttype(0)
!$OMP parallel do reduction(+:intsum) do i=1,nsize intsum = intsum + intarray(i) end do !$OMP end parallel


End of Fortran note

C++ note Reduction can be applied to any class for which the reduction operator is defined as operator+ or whichever operator the case may be.

// reductcomplex.cxx
class Thing {
private:
float x;
public:
Thing() : Thing( 0.f ) {};
Thing( float x ) : x(x) {};
Thing operator+( const Thing& other ) {
return Thing( x + other.x );
};
};


vector< Thing > things(500,Thing(1.f) );
Thing result(0.f);
#pragma omp parallel for reduction( +:result )
for ( const auto& t : things )
result = result + t;


A default constructor is required for the internally used init value; see figure  20.3  . End of C++ note

## 20.5 Scan / prefix operations

crumb trail: > omp-reduction > Scan / prefix operations

A scan' or prefix operation is like a reduction, except that you're interested in the partial results. For this OpenMP, as of OpenMP-, has the scan directive. This needs the following:

• The reduction clause gets a modifier \indexompclause{inscan}:
#pragma omp parallel for reduction(inscan,+:sumvar)

• In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:
  sumvar // update
#pragma omp scan inclusive(sumvar)
partials[i] = sumvar

For exclusive scans the reduction variable is updated after the scan pragma:
  partials[i] = sumvar
#pragma omp scan inclusive(sumvar)
sumvar // update


\csnippetwithoutput{scanincexc}{examples/omp/c}{scansum}

## 20.6 Reductions and floating-point math

crumb trail: > omp-reduction > Reductions and floating-point math

The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see  Eijkhout:IntroHPC  . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behavior.