Much of this material is based on the Kokkos Tutorial that Jeff Miles and Christian Trott gave April 21-24, 2020.
Include file:
// hello.cxx #include "Kokkos_Core.hpp"
crumb trail: > kokkos > Parallel code execution
In parallel execution we basically have two issues:
The algorithmic parallel structure is indicated with the following constructs.
Kokkos::parallel_for Kokkos::parallel_reduce Kokkos::parallel_scan
crumb trail: > kokkos > Parallel code execution > Example: 1D loop
Hello world:
Kokkos::parallel_for
( 10,
[](int i){ cout << "hello " << i << "\n"; }
);
crumb trail: > kokkos > Parallel code execution > Reduction
Reductions add a parameter to the construct: the reduction variable.
double pi{0.};
int n{100};
Kokkos::parallel_reduce
( "PI",
n,
KOKKOS_LAMBDA ( int i, double& partial ) {
double h = 1./n, x = i*h;
partial += h * sqrt( 1-x*x );
},
pi
);
For reductions other than summing, a \indexkokkosshow{reducer} is needed.
// reduxmax.cxx
double max=0.;
Kokkos::parallel_reduce
( npoints,
KOKKOS_LAMBDA (int i,double& m) {
if (x(i)>m)
m = x(i);
},
Kokkos::Max<double>(max)
);
cout << "max: " << max << "\n";
crumb trail: > kokkos > Parallel code execution > Examples: Multi-D loops
You can of course parallelize over the outer loop, and do the inner loops in the functor. This code computes $r\leftarrow y^tAx$:
Kokkos::parallel_reduce( "yAx", N,
KOKKOS_LAMBDA ( int j, double &update ) {
double temp2 = 0;
for ( int i = 0; i < M; ++i ) {
temp2 += A[ j * M + i ] * x[ i ];
}
update += y[ j ] * temp2;
},
result
);
You can also leave all the loops to Kokkos, with an \indexkokkos{RangePolicy} or \indexkokkos{MDRangePolicy}. Here you indicate the rank (as in: number of dimensions) of the object, as well as arrays of first/last values. In the above examples
Kokkos::parallel_reduce( N, ... ); // equivalent: Kokkos::parallel_reduce( Kokkos:RangePolicy<>(0,N), ... );
An example with a higher rank than one:
// matyax.cxx
Kokkos::parallel_reduce
( "ytAx product",
Kokkos::MDRangePolicy<Kokkos::Rank<2>>( {0,0}, {m,n} ),
KOKKOS_LAMBDA (int i,int j,double &partial ) {
partial += yvec(i) * matrix(i,j) * xvec(j); },
sum
);
Note the multi-D indexing in this example: this parenthesis notation gets translated to the correct row/column-major depending on whether the code runs on a CPU or GPU; see section 19.5.2 .
One of the problems Kokkos addresses is the coherence of data between main processor and attached devicees such GPUs . This is handled through the Kokkos::View mechanism.
// matsum.cxx
int m=10,n=100;
Kokkos::View<double**> matrix("flat",m,n);
assert( matrix.extent(0)==10 );
These act like C++ shared_ptr , so capturing them by value gives you the data by reference anyway. Storage is automatically freed, RAII-style, when they go out of scope.
Indexing is best done with a Fortran-style notation:
matrix(i,j)which makes indexing in your algorithm independent of the actual layout.
Compile-time dimensions can be accomodated:
View<double*[2]> tallskinny("tallthin",100);
View<double*[2][3]> tallthin(100);
with the compile-time dimensions trailing. Naming is optional.
Methods:
crumb trail: > kokkos > Data > Data layout
The view declaration has an optional template argument for the data layout.
View<double***, Layout, Space> name(...);Values are
Practically speaking, the traversal of a two-dimensional array is now a function of
Kokkos:parallel_whatever(
N,
KOKKOS_LAMBDA ( size_t i ) {
matrix(i,j) or matrix(j,i); }
);
With a layout determined by the memory space,
let the iterator index be first,
and let loops inside the functor range over subsequent indexes.
crumb trail: > kokkos > Execution and memory spaces
The body of the functor can be executed on the CPU or on a GPU. Those are the execution space s. Kokkos needs to be installed with support for such spaces.
To indicate that a function or lambda expression can be executed on more than one possible execution space:
Kokkos::parallel_for
( Kokkos::RangePolicy<>( 0,10 ), # default execution space
[] (int i) {} );
Kokkos::parallel_for
( Kokkos::RangePolicy<SomeExecutionSpace>( 0,10 ),
[] (int i) {} );
The default
Kokkos::parallel_for( N, ...is equivalent to
Kokkos::parallel_for( RangePolicy<>(N), ...
crumb trail: > kokkos > Execution and memory spaces > Memory spaces
Where data is stored is an independent story. Each execution space has a memory space . When creating a \indexkokkos{View}, you can optionally indicate a memory space argument:
View<double***,MemorySpace> data(...);Available memory spaces include: \indexkokkos{HostSpace}, \indexkokkos{CudaSpace}, \indexkokkos{CudaUVMSpace}. Leaving out the memory space argument is equivalent to
View<double**,DefaultExecutionSpace::memory_space> x(1,2);
Examples:
View<double*,HostSpace> hostarray(5); View<double*,CudaSpace> cudaarray(5);
The \indexkokkos{CudaSpace} is only available if Kokkos has been configured with CUDA
crumb trail: > kokkos > Execution and memory spaces > Space coherence
Kokkos never makes implicit deep copies, so you can not immediately run a functor in the Cuda execution space on a view in Host space.
You can create a mirror of CUDA data on the host:
using CuMatrix = Kokkos::View<double**,CudaSpace>;
CuMatrix matrix(m,n);
CuMatrix::HostMirror hostmatrix =
Kokkos::create_mirror_view(matrix);
// populate matrix on the host
for (i) for (j) hostmatrix(i,j) = ....;
// deep copy to GPU
Kokkos::deep_copy(matrix,hostmatrix);
// do something on the GPU
Kokkos:parallel_whatever(
RangePolicy<CudaSpace>( 0,n ),
some lambda );
// if needed, deep copy back.
crumb trail: > kokkos > Configuration
An accelerator-free installation with OpenMP:
cmake \
-D Kokkos_ENABLE_SERIAL=ON -D Kokkos_ENABLE_OPENMP=ON
Threading is not compatible with OpenMP:
-D Kokkos_ENABLE_THREADS=ON
Cuda installation:
cmake \
-D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_TURING75=ON -D Kokkos_ENABLE_CUDA_LAMBDA=ON
There are init/finalize calls, which are not always needed.
// pi.cxx Kokkos::initialize(argc,argv); Kokkos::finalize();
crumb trail: > kokkos > Stuff > OpenMP integration
Cmake flag to enable OpenMP: -D Kokkos_ENABLE_OPENMP=ON
After that, all the usual OpenMP environment variables work.
Alternatively:
int nthreads = Kokkos::OpenMP::concurrency(); Kokkos::initialize(Kokkos::InitializationSettings().set_num_threads(nthreads))
Parallelism control:
--kokkos-threads=123 # threads --kokkos-numa=45 # numa regions --kokkos-device=6 * GPU id to use