Much of this material is based on the Kokkos Tutorial that Jeff Miles and Christian Trott gave April 21-24, 2020.
Include file:
// hello.cxx #include "Kokkos_Core.hpp"
crumb trail: > kokkos > Parallel code execution
In parallel execution we basically have two issues:
The algorithmic parallel structure is indicated with the following constructs.
Kokkos::parallel_for Kokkos::parallel_reduce Kokkos::parallel_scan
crumb trail: > kokkos > Parallel code execution > Example: 1D loop
Hello world:
Kokkos::parallel_for ( 10, [](int i){ cout << "hello " << i << "\n"; } );
crumb trail: > kokkos > Parallel code execution > Reduction
Reductions add a parameter to the construct: the reduction variable.
double pi{0.}; int n{100}; Kokkos::parallel_reduce ( "PI", n, KOKKOS_LAMBDA ( int i, double& partial ) { double h = 1./n, x = i*h; partial += h * sqrt( 1-x*x ); }, pi );
For reductions other than summing, a \indexkokkosshow{reducer} is needed.
// reduxmax.cxx double max=0.; Kokkos::parallel_reduce ( npoints, KOKKOS_LAMBDA (int i,double& m) { if (x(i)>m) m = x(i); }, Kokkos::Max<double>(max) ); cout << "max: " << max << "\n";
crumb trail: > kokkos > Parallel code execution > Examples: Multi-D loops
You can of course parallelize over the outer loop, and do the inner loops in the functor. This code computes $r\leftarrow y^tAx$:
Kokkos::parallel_reduce( "yAx", N, KOKKOS_LAMBDA ( int j, double &update ) { double temp2 = 0; for ( int i = 0; i < M; ++i ) { temp2 += A[ j * M + i ] * x[ i ]; } update += y[ j ] * temp2; }, result );
You can also leave all the loops to Kokkos, with an \indexkokkos{RangePolicy} or \indexkokkos{MDRangePolicy}. Here you indicate the rank (as in: number of dimensions) of the object, as well as arrays of first/last values. In the above examples
Kokkos::parallel_reduce( N, ... ); // equivalent: Kokkos::parallel_reduce( Kokkos:RangePolicy<>(0,N), ... );
An example with a higher rank than one:
// matyax.cxx Kokkos::parallel_reduce ( "ytAx product", Kokkos::MDRangePolicy<Kokkos::Rank<2>>( {0,0}, {m,n} ), KOKKOS_LAMBDA (int i,int j,double &partial ) { partial += yvec(i) * matrix(i,j) * xvec(j); }, sum );
Note the multi-D indexing in this example: this parenthesis notation gets translated to the correct row/column-major depending on whether the code runs on a CPU or GPU; see section 19.5.2 .
One of the problems Kokkos addresses is the coherence of data between main processor and attached devicees such GPUs . This is handled through the Kokkos::View mechanism.
// matsum.cxx int m=10,n=100; Kokkos::View<double**> matrix("flat",m,n); assert( matrix.extent(0)==10 );
These act like C++ shared_ptr , so capturing them by value gives you the data by reference anyway. Storage is automatically freed, RAII-style, when they go out of scope.
Indexing is best done with a Fortran-style notation:
matrix(i,j)which makes indexing in your algorithm independent of the actual layout.
Compile-time dimensions can be accomodated:
View<double*[2]> tallskinny("tallthin",100); View<double*[2][3]> tallthin(100);with the compile-time dimensions trailing. Naming is optional.
Methods:
crumb trail: > kokkos > Data > Data layout
The view declaration has an optional template argument for the data layout.
View<double***, Layout, Space> name(...);Values are
Practically speaking, the traversal of a two-dimensional array is now a function of
Kokkos:parallel_whatever( N, KOKKOS_LAMBDA ( size_t i ) { matrix(i,j) or matrix(j,i); } );
With a layout determined by the memory space,
let the iterator index be first,
and let loops inside the functor range over subsequent indexes.
crumb trail: > kokkos > Execution and memory spaces
The body of the functor can be executed on the CPU or on a GPU. Those are the execution space s. Kokkos needs to be installed with support for such spaces.
To indicate that a function or lambda expression can be executed on more than one possible execution space:
Kokkos::parallel_for ( Kokkos::RangePolicy<>( 0,10 ), # default execution space [] (int i) {} ); Kokkos::parallel_for ( Kokkos::RangePolicy<SomeExecutionSpace>( 0,10 ), [] (int i) {} );The default
Kokkos::parallel_for( N, ...is equivalent to
Kokkos::parallel_for( RangePolicy<>(N), ...
crumb trail: > kokkos > Execution and memory spaces > Memory spaces
Where data is stored is an independent story. Each execution space has a memory space . When creating a \indexkokkos{View}, you can optionally indicate a memory space argument:
View<double***,MemorySpace> data(...);Available memory spaces include: \indexkokkos{HostSpace}, \indexkokkos{CudaSpace}, \indexkokkos{CudaUVMSpace}. Leaving out the memory space argument is equivalent to
View<double**,DefaultExecutionSpace::memory_space> x(1,2);
Examples:
View<double*,HostSpace> hostarray(5); View<double*,CudaSpace> cudaarray(5);
The \indexkokkos{CudaSpace} is only available if Kokkos has been configured with CUDA
crumb trail: > kokkos > Execution and memory spaces > Space coherence
Kokkos never makes implicit deep copies, so you can not immediately run a functor in the Cuda execution space on a view in Host space.
You can create a mirror of CUDA data on the host:
using CuMatrix = Kokkos::View<double**,CudaSpace>; CuMatrix matrix(m,n); CuMatrix::HostMirror hostmatrix = Kokkos::create_mirror_view(matrix); // populate matrix on the host for (i) for (j) hostmatrix(i,j) = ....; // deep copy to GPU Kokkos::deep_copy(matrix,hostmatrix); // do something on the GPU Kokkos:parallel_whatever( RangePolicy<CudaSpace>( 0,n ), some lambda ); // if needed, deep copy back.
crumb trail: > kokkos > Configuration
An accelerator-free installation with OpenMP:
cmake \ -D Kokkos_ENABLE_SERIAL=ON -D Kokkos_ENABLE_OPENMP=ON
Threading is not compatible with OpenMP:
-D Kokkos_ENABLE_THREADS=ON
Cuda installation:
cmake \ -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_TURING75=ON -D Kokkos_ENABLE_CUDA_LAMBDA=ON
There are init/finalize calls, which are not always needed.
// pi.cxx Kokkos::initialize(argc,argv); Kokkos::finalize();
crumb trail: > kokkos > Stuff > OpenMP integration
Cmake flag to enable OpenMP: -D Kokkos_ENABLE_OPENMP=ON
After that, all the usual OpenMP environment variables work.
Alternatively:
int nthreads = Kokkos::OpenMP::concurrency(); Kokkos::initialize(Kokkos::InitializationSettings().set_num_threads(nthreads))
Parallelism control:
--kokkos-threads=123 # threads --kokkos-numa=45 # numa regions --kokkos-device=6 * GPU id to use