\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 41.1 : Parallel code execution
41.1.1 : Example: 1D loop
41.1.2 : Reduction
41.1.3 : Examples: Multi-D loops
41.2 : Data
41.2.1 : Data layout
41.3 : Execution and memory spaces
41.3.1 : Memory spaces
41.3.2 : Space coherence
41.4 : Configuration
41.5 : Stuff
41.5.1 : OpenMP integration
Back to Table of Contents

41 Kokkos

Much of this material is based on the Kokkos Tutorial that Jeff Miles and Christian Trott gave April 21-24, 2020.

Include file:

// hello.cxx
#include "Kokkos_Core.hpp"

41.1 Parallel code execution

crumb trail: > kokkos > Parallel code execution

In parallel execution we basically have two issues:

The parallel structure of the algorithm; that's what we discuss in this section.
the memory structure of how the data is laid out; this will be discussed in section 41.2 .

The algorithmic parallel structure is indicated with the following constructs.

Kokkos::parallel_for
Kokkos::parallel_reduce
Kokkos::parallel_scan

41.1.1 Example: 1D loop

crumb trail: > kokkos > Parallel code execution > Example: 1D loop

Hello world:

Kokkos::parallel_for
  ( 10,
    [](int i){ cout << "hello " << i << "\n"; }
    );

How many iterations,
Function of the iteration number. You can use a function pointer here, but often we will use lambda expressions. If you're iterating over an array, it needs to be captured in the lambda expression.

41.1.2 Reduction

crumb trail: > kokkos > Parallel code execution > Reduction

Reductions add a parameter to the construct: the reduction variable.

double pi{0.};
int n{100};
Kokkos::parallel_reduce
  ( "PI",
    n,
    KOKKOS_LAMBDA ( int i, double& partial ) {
    double h = 1./n, x = i*h;
    partial += h * sqrt( 1-x*x );
  },
    pi
    );

The parallel construct has an optional name. This is useful for profiling and debugging.
Instead of an explicit lambda capture, we use \lstinline+KOKKOS_LAMBDA+ which does a [=] capture, and adds clauses for GPU execution, if needed.
The lambda expression now takes two parameters: the iteration number, and the reduction variable. This is the thread-private variable, not the final one.
The final argument is the global reduction variable.

For reductions other than summing, a \indexkokkosshow{reducer} is needed.

// reduxmax.cxx
double max=0.;
Kokkos::parallel_reduce
  ( npoints,
    KOKKOS_LAMBDA (int i,double& m) {
    if (x(i)>m)
      m = x(i);
  },
    Kokkos::Max<double>(max)
    );
cout << "max: " << max << "\n";

41.1.3 Examples: Multi-D loops

crumb trail: > kokkos > Parallel code execution > Examples: Multi-D loops

You can of course parallelize over the outer loop, and do the inner loops in the functor. This code computes $r\leftarrow y^tAx$:

Kokkos::parallel_reduce( "yAx", N,
  KOKKOS_LAMBDA ( int j, double &update ) {
    double temp2 = 0;

    for ( int i = 0; i < M; ++i ) {
      temp2 += A[ j * M + i ] * x[ i ];
    }

    update += y[ j ] * temp2;
  },
  result
 );

You can also leave all the loops to Kokkos, with an \indexkokkos{RangePolicy} or \indexkokkos{MDRangePolicy}. Here you indicate the rank (as in: number of dimensions) of the object, as well as arrays of first/last values. In the above examples

Kokkos::parallel_reduce( N, ... );
// equivalent: 
Kokkos::parallel_reduce( Kokkos:RangePolicy<>(0,N), ... );

An example with a higher rank than one:

// matyax.cxx
Kokkos::parallel_reduce
  ( "ytAx product",
    Kokkos::MDRangePolicy<Kokkos::Rank<2>>( {0,0}, {m,n} ),
    KOKKOS_LAMBDA (int i,int j,double &partial ) {
        partial += yvec(i) * matrix(i,j) * xvec(j); },
    sum
    );

Note the multi-D indexing in this example: this parenthesis notation gets translated to the correct row/column-major depending on whether the code runs on a CPU or GPU; see section 19.5.2 .

41.2 Data

crumb trail: > kokkos > Data

One of the problems Kokkos addresses is the coherence of data between main processor and attached devicees such GPUs . This is handled through the Kokkos::View mechanism.

// matsum.cxx
int m=10,n=100;
Kokkos::View<double**> matrix("flat",m,n);
assert( matrix.extent(0)==10 );

These act like C++ shared_ptr , so capturing them by value gives you the data by reference anyway. Storage is automatically freed, RAII-style, when they go out of scope.

Indexing is best done with a Fortran-style notation:

matrix(i,j)

which makes indexing in your algorithm independent of the actual layout.

Compile-time dimensions can be accomodated:

View<double*[2]>  tallskinny("tallthin",100);
View<double*[2][3]>  tallthin(100);

with the compile-time dimensions trailing. Naming is optional.

Methods:

extent(int) gives the extent in a certain dimensions;
data gives a raw pointer to the data.

41.2.1 Data layout

crumb trail: > kokkos > Data > Data layout

The view declaration has an optional template argument for the data layout.

View<double***, Layout, Space> name(...);

Values are

\indexkokkos{LayoutLeft} where, Fortran-style, the leftmost index is stride 1; this is the default for \indexkokkos{CudaSpace}.
\indexkokkos{LayoutRight} where, C-style, the leftmost index is stride 1; this is the default for \indexkokkos{HostSpace}.
\indexkokkos{LayoutStride}, \indexkokkos{LayoutTiled} and others.
User-defined.

Practically speaking, the traversal of a two-dimensional array is now a function of

the layout, possible determined by the memory space, and

the indexing in in the functor:

Kokkos:parallel_whatever(
  N,
  KOKKOS_LAMBDA ( size_t i ) {
    matrix(i,j) or matrix(j,i); }
 );

It is probably best to stick with this Rule of Thumb:

With a layout determined by the memory space,

let the iterator index be first,

and let loops inside the functor range over subsequent indexes.

41.3 Execution and memory spaces

crumb trail: > kokkos > Execution and memory spaces

The body of the functor can be executed on the CPU or on a GPU. Those are the execution space s. Kokkos needs to be installed with support for such spaces.

To indicate that a function or lambda expression can be executed on more than one possible execution space:

use \indexkokkos{KOKKOS_LAMBDA} as the capture for lambda expressions, or
prefix explicitly defined functions with \indexkokkos{KOKKOS_INLINE_FUNCTION}.

Execution spaces can be explicitly indicated using the \indexkokkos{RangePolicy} keyword:

Kokkos::parallel_for
  ( Kokkos::RangePolicy<>( 0,10 ), # default execution space
    [] (int i) {} );
Kokkos::parallel_for
  ( Kokkos::RangePolicy<SomeExecutionSpace>( 0,10 ), 
    [] (int i) {} );

The default

Kokkos::parallel_for( N, ...

is equivalent to

Kokkos::parallel_for( RangePolicy<>(N), ...

41.3.1 Memory spaces

crumb trail: > kokkos > Execution and memory spaces > Memory spaces

Where data is stored is an independent story. Each execution space has a memory space . When creating a \indexkokkos{View}, you can optionally indicate a memory space argument:

View<double***,MemorySpace> data(...);

Available memory spaces include: \indexkokkos{HostSpace}, \indexkokkos{CudaSpace}, \indexkokkos{CudaUVMSpace}. Leaving out the memory space argument is equivalent to

View<double**,DefaultExecutionSpace::memory_space> x(1,2);

Examples:

View<double*,HostSpace> hostarray(5);
View<double*,CudaSpace> cudaarray(5);

The \indexkokkos{CudaSpace} is only available if Kokkos has been configured with CUDA

41.3.2 Space coherence

crumb trail: > kokkos > Execution and memory spaces > Space coherence

Kokkos never makes implicit deep copies, so you can not immediately run a functor in the Cuda execution space on a view in Host space.

You can create a mirror of CUDA data on the host:

using CuMatrix = Kokkos::View<double**,CudaSpace>;
CuMatrix matrix(m,n);
CuMatrix::HostMirror hostmatrix =
    Kokkos::create_mirror_view(matrix);
// populate matrix on the host
for (i) for (j) hostmatrix(i,j) = ....;
// deep copy to GPU
Kokkos::deep_copy(matrix,hostmatrix);
// do something on the GPU
Kokkos:parallel_whatever(
    RangePolicy<CudaSpace>( 0,n ),
    some lambda );
// if needed, deep copy back.

41.4 Configuration

crumb trail: > kokkos > Configuration

An accelerator-free installation with OpenMP:

cmake \
    -D Kokkos_ENABLE_SERIAL=ON -D Kokkos_ENABLE_OPENMP=ON

Threading is not compatible with OpenMP:

-D Kokkos_ENABLE_THREADS=ON

Cuda installation:

cmake \
    -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_TURING75=ON -D Kokkos_ENABLE_CUDA_LAMBDA=ON

41.5 Stuff

crumb trail: > kokkos > Stuff

There are init/finalize calls, which are not always needed.

// pi.cxx
Kokkos::initialize(argc,argv);
Kokkos::finalize();

41.5.1 OpenMP integration

crumb trail: > kokkos > Stuff > OpenMP integration

Cmake flag to enable OpenMP: -D Kokkos_ENABLE_OPENMP=ON

After that, all the usual OpenMP environment variables work.

Alternatively:

int nthreads = Kokkos::OpenMP::concurrency();
Kokkos::initialize(Kokkos::InitializationSettings().set_num_threads(nthreads))

Parallelism control:

--kokkos-threads=123 # threads
--kokkos-numa=45     # numa regions
--kokkos-device=6    * GPU id to use

Kokkos

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

41 Kokkos

C++ Code: kokkoscore

41.1 Parallel code execution

41.1.1 Example: 1D loop

C++ Code: kkhello

41.1.2 Reduction

C++ Code: kkpi

C++ Code: kreduxmax