Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 41.1 : Parallel code execution
41.1.1 : Example: 1D loop
41.1.2 : Reduction
41.1.3 : Examples: Multi-D loops
41.2 : Data
41.2.1 : Data layout
41.3 : Execution and memory spaces
41.3.1 : Memory spaces
41.3.2 : Space coherence
41.4 : Configuration
41.5 : Stuff
41.5.1 : OpenMP integration
Back to Table of Contents

41 Kokkos

Much of this material is based on the Kokkos Tutorial that Jeff Miles and Christian Trott gave April 21-24, 2020.

Include file:

// hello.cxx
#include "Kokkos_Core.hpp"

41.1 Parallel code execution

crumb trail: > kokkos > Parallel code execution

In parallel execution we basically have two issues:

  1. The parallel structure of the algorithm; that's what we discuss in this section.
  2. the memory structure of how the data is laid out; this will be discussed in section  41.2  .

The algorithmic parallel structure is indicated with the following constructs.


41.1.1 Example: 1D loop

crumb trail: > kokkos > Parallel code execution > Example: 1D loop

Hello world:

  ( 10,
    [](int i){ cout << "hello " << i << "\n"; }

41.1.2 Reduction

crumb trail: > kokkos > Parallel code execution > Reduction

Reductions add a parameter to the construct: the reduction variable.

double pi{0.};
int n{100};
  ( "PI",
    KOKKOS_LAMBDA ( int i, double& partial ) {
    double h = 1./n, x = i*h;
    partial += h * sqrt( 1-x*x );

For reductions other than summing, a \indexkokkosshow{reducer} is needed.

// reduxmax.cxx
double max=0.;
  ( npoints,
    KOKKOS_LAMBDA (int i,double& m) {
    if (x(i)>m)
      m = x(i);
cout << "max: " << max << "\n";

41.1.3 Examples: Multi-D loops

crumb trail: > kokkos > Parallel code execution > Examples: Multi-D loops

You can of course parallelize over the outer loop, and do the inner loops in the functor. This code computes $r\leftarrow y^tAx$:

Kokkos::parallel_reduce( "yAx", N,
  KOKKOS_LAMBDA ( int j, double &update ) {
    double temp2 = 0;

    for ( int i = 0; i < M; ++i ) {
      temp2 += A[ j * M + i ] * x[ i ];

    update += y[ j ] * temp2;

You can also leave all the loops to Kokkos, with an \indexkokkos{RangePolicy} or \indexkokkos{MDRangePolicy}. Here you indicate the rank (as in: number of dimensions) of the object, as well as arrays of first/last values. In the above examples

Kokkos::parallel_reduce( N, ... );
// equivalent: 
Kokkos::parallel_reduce( Kokkos:RangePolicy<>(0,N), ... );  

An example with a higher rank than one:

// matyax.cxx
  ( "ytAx product",
    Kokkos::MDRangePolicy<Kokkos::Rank<2>>( {0,0}, {m,n} ),
    KOKKOS_LAMBDA (int i,int j,double &partial ) {
        partial += yvec(i) * matrix(i,j) * xvec(j); },

Note the multi-D indexing in this example: this parenthesis notation gets translated to the correct row/column-major depending on whether the code runs on a CPU or GPU; see section  19.5.2  .

41.2 Data

crumb trail: > kokkos > Data

One of the problems Kokkos addresses is the coherence of data between main processor and attached devicees such GPUs  . This is handled through the Kokkos::View mechanism.

// matsum.cxx
int m=10,n=100;
Kokkos::View<double**> matrix("flat",m,n);
assert( matrix.extent(0)==10 );

These act like C++ shared_ptr , so capturing them by value gives you the data by reference anyway. Storage is automatically freed, RAII-style, when they go out of scope.

Indexing is best done with a Fortran-style notation:

which makes indexing in your algorithm independent of the actual layout.

Compile-time dimensions can be accomodated:

View<double*[2]>  tallskinny("tallthin",100);
View<double*[2][3]>  tallthin(100);
with the compile-time dimensions trailing. Naming is optional.


41.2.1 Data layout

crumb trail: > kokkos > Data > Data layout

The view declaration has an optional template argument for the data layout.

View<double***, Layout, Space> name(...);
Values are

Practically speaking, the traversal of a two-dimensional array is now a function of

It is probably best to stick with this Rule of Thumb:

With a layout determined by the memory space,

let the iterator index be first,

and let loops inside the functor range over subsequent indexes.

41.3 Execution and memory spaces

crumb trail: > kokkos > Execution and memory spaces

The body of the functor can be executed on the CPU or on a GPU. Those are the execution space s. Kokkos needs to be installed with support for such spaces.

To indicate that a function or lambda expression can be executed on more than one possible execution space:

Execution spaces can be explicitly indicated using the \indexkokkos{RangePolicy} keyword:
  ( Kokkos::RangePolicy<>( 0,10 ), # default execution space
    [] (int i) {} );
  ( Kokkos::RangePolicy<SomeExecutionSpace>( 0,10 ), 
    [] (int i) {} );
The default
Kokkos::parallel_for( N, ...
is equivalent to
Kokkos::parallel_for( RangePolicy<>(N), ...

41.3.1 Memory spaces

crumb trail: > kokkos > Execution and memory spaces > Memory spaces

Where data is stored is an independent story. Each execution space has a memory space  . When creating a \indexkokkos{View}, you can optionally indicate a memory space argument:

View<double***,MemorySpace> data(...);
Available memory spaces include: \indexkokkos{HostSpace}, \indexkokkos{CudaSpace}, \indexkokkos{CudaUVMSpace}. Leaving out the memory space argument is equivalent to
View<double**,DefaultExecutionSpace::memory_space> x(1,2);


View<double*,HostSpace> hostarray(5);
View<double*,CudaSpace> cudaarray(5);

The \indexkokkos{CudaSpace} is only available if Kokkos has been configured with CUDA

41.3.2 Space coherence

crumb trail: > kokkos > Execution and memory spaces > Space coherence

Kokkos never makes implicit deep copies, so you can not immediately run a functor in the Cuda execution space on a view in Host space.

You can create a mirror of CUDA data on the host:

using CuMatrix = Kokkos::View<double**,CudaSpace>;
CuMatrix matrix(m,n);
CuMatrix::HostMirror hostmatrix =
// populate matrix on the host
for (i) for (j) hostmatrix(i,j) = ....;
// deep copy to GPU
// do something on the GPU
    RangePolicy<CudaSpace>( 0,n ),
    some lambda );
// if needed, deep copy back.

41.4 Configuration

crumb trail: > kokkos > Configuration

An accelerator-free installation with OpenMP:

cmake \

Threading is not compatible with OpenMP:


Cuda installation:

cmake \

41.5 Stuff

crumb trail: > kokkos > Stuff

There are init/finalize calls, which are not always needed.

// pi.cxx

41.5.1 OpenMP integration

crumb trail: > kokkos > Stuff > OpenMP integration

Cmake flag to enable OpenMP: -D Kokkos_ENABLE_OPENMP=ON

After that, all the usual OpenMP environment variables work.


int nthreads = Kokkos::OpenMP::concurrency();

Parallelism control:

--kokkos-threads=123 # threads
--kokkos-numa=45     # numa regions
--kokkos-device=6    * GPU id to use

Back to Table of Contents