Sycl, OneAPI, DPC++

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 42.1 : Logistics
42.2 : Platforms and devices
42.3 : Queues
42.3.1 : Device selectors
42.3.2 : Queue submission and execution
42.3.3 : Kernel ordering
42.4 : Kernels
42.5 : Parallel operations
42.5.1 : Loops
42.5.1.1 : Loop bounds: ranges
42.5.1.2 : Loop indices
42.5.1.3 : Multi-dimensional indexing
42.5.2 : Task dependencies
42.5.3 : Race conditions
42.5.4 : Reductions
42.6 : Memory access
42.6.1 : Unified shared memory
42.6.2 : Buffers and accessors
42.6.2.1 : Multi-D buffers
42.6.3 : Querying
42.7 : Parallel output
42.8 : DPCPP extensions
42.9 : Intel devcloud notes
42.10 : Examples
42.10.1 : Kernels in a loop
42.10.2 : Stencil computations
Back to Table of Contents

42 Sycl, OneAPI, DPC++

This chapter explains the basic concepts of Sycl/Dpc++, and helps you get started on running your first program.

\begin{dpcppnote} The various Intel extensions are listed here:

https://spec.oneapi.com/versions/latest/elements/dpcpp/source/index.html#extensions-table

\end{dpcppnote}

42.1 Logistics

crumb trail: > dpcpp > Logistics

Headers:

#include <CL/sycl.hpp>

You can now include namespace, but with care! If you use

using namespace cl;

you have to prefix all SYCL class with sycl:: , which is a bit of a bother. However, if you use

using namespace cl::sycl;

you run into the fact that SYCL has its own versions of many STL commands, and so you will get name collisions. The most obvious example is that the cl::sycl name space has its own versions of \n{cout} and \n{endl}. Therefore you have to use explicitly \lstinline+std::cout+ and std::end . Using the wrong I/O will cause tons of inscrutable error messages. Additionally, SYCL has its own version of free  , and of several math routines.

\begin{dpcppnote}

    using namespace sycl;
  

\end{dpcppnote}

42.2 Platforms and devices

crumb trail: > dpcpp > Platforms and devices

Since DPCPP is cross-platform, we first need to discovers the devices.

First we list the platforms:

// devices.cxx
std::vector<sycl::platform> platforms = sycl::platform::get_platforms();
for (const auto &plat : platforms) {
// get_info is a template. So we pass the type as an `arguments`.
  std::cout << "Platform: "
            << plat.get_info<sycl::info::platform::name>() << " "
            << plat.get_info<sycl::info::platform::vendor>() << " "
            << plat.get_info<sycl::info::platform::version>() 
	      << '\n';

Then for each platform we list the devices:

std::vector<sycl::device> devices = plat.get_devices();
for (const auto &dev : devices) {
  std::cout << "-- Device: "
            << dev.get_info<sycl::info::device::name>()
            << (dev.is_host() ? ": is the host" : "")
            << (dev.is_cpu() ? ": is a cpu" : "")
            << (dev.is_gpu() ? ": is a gpu" : "")
            << std::endl;

You can query what type of device you are dealing with by \indexsyclshow{is_host}, \indexsyclshow{is_cpu}, \indexsyclshow{is_gpu}.

42.3 Queues

crumb trail: > dpcpp > Queues

The execution mechanism of SYCL is the queue a sequence of actions that will be executed on a selected device. The only user action is submitting actions to a queue; the queue is executed at the end of the scope where it is declared.

Queue execution is asynchronous with host code.

42.3.1 Device selectors

crumb trail: > dpcpp > Queues > Device selectors

You need to select a device on which to execute the queue. A single queue can only dispatch to a single device.

A queue is coupled to one specific device, so it can not spread work over multiple devices. You can find a default device for the queue with

  sycl::queue myqueue;

The following example explicitly assigns the queue to the CPU device using the sycl:: \indexsyclshow{cpu_selector}.

// cpuname.cxx
sycl::queue myqueue( sycl::cpu_selector{} );

The sycl:: \indexsyclshow{host_selector} bypasses any devices and make the code run on the host.

It is good for your sanity to print the name of the device you are running on:

// devname.cxx
std::cout << myqueue.get_device().get_info<sycl::info::device::name>()
          << std::endl;

If you try to select a device that is not available, a sycl:: \indexsyclshow{runtime_error} exception will be thrown.

\begin{dpcppnote}

    #include "CL/sycl/intel/fpga_extensions.hpp"
    fpga_selector
  

\end{dpcppnote}

42.3.2 Queue submission and execution

crumb trail: > dpcpp > Queues > Queue submission and execution

It seems that queue kernels will also be executed when only they go out of scope, but not the queue:

// doubler.cxx
  sycl::range<1> mySize{SIZE};
  sycl::buffer<int, 1> bufferA(myArray.data(), mySize);
  myqueue.submit
    ( [&](sycl::handler &myHandle) {
      auto deviceAccessorA =
        bufferA.get_access<sycl::access::mode::read_write>(myHandle);
} // queue goes out of scope, executes

42.3.3 Kernel ordering

crumb trail: > dpcpp > Queues > Kernel ordering

Kernels are not necessarily executed in the order in which they are submitted. You can enforce this by specifying an in-order queue :

sycl::queue myqueue{property::queue::inorder()};

42.4 Kernels

crumb trail: > dpcpp > Kernels

One kernel per submit.

myqueue.submit( [&] ( handler &commandgroup ) {
    commandgroup.parallel_for<uniquename> 
      ( range<1>{N},
        [=] ( id<1> idx ) { ... idx }
      )
    } );

Note that the lambda in the kernel captures by value. Capturing by reference makes no sense, since the kernel is executed on a device.

cgh.single_task(
  [=]() {
    // kernel function is executed EXACTLY once on a SINGLE work-item
});

The \indexsyclshow{submit} call results in an event object:

auto myevent = myqueue.submit( /* stuff */ );

This can be used for two purposes:

  1. It becomes possible to wait for this specific event:

    myevent.wait();    
    

  2. It can be used to indicate kernel dependencies:

    myqueue.submit( [=] (handler &h) {
        h.depends_on(myevent);
        /* stuff */
        } );
    

42.5 Parallel operations

crumb trail: > dpcpp > Parallel operations

42.5.1 Loops

crumb trail: > dpcpp > Parallel operations > Loops

cgh.parallel_for(
  range<3>(1024,1024,1024),
  // using 3D in this example
  [=](id<3> myID) {
    // kernel function is executed on an n-dimensional range (NDrange)
});

cgh.parallel_for(
  nd_range<3>( {1024,1024,1024},{16,16,16} ),
  // using 3D in this example 
  [=](nd_item<3> myID) {
    // kernel function is executed on an n-dimensional range (NDrange)
});

cgh.parallel_for_work_group(
  range<2>(1024,1024),
  // using 2D in this example
  [=](group<2> myGroup) {
    // kernel function is executed once per work-group
});

grp.parallel_for_work_item(
  range<1>(1024),
  // using 1D in this example
  [=](h_item<1> myItem) {
    // kernel function is executed once per work-item
});

42.5.1.1 Loop bounds: ranges

crumb trail: > dpcpp > Parallel operations > Loops > Loop bounds: ranges

SYCL adopts the modern C++ philosophy that one does not iterate over by explicitly enumerating indices, but by indicating their range. This is realized by the \indexsycldef{range} class, which is templated over the number of space dimensions.

sycl::range<2> matrix{10,10};

Some compilers are sensitive to the type of the integer arguments:

sycl::range<1> array{ static_cast<size_t>(size)} ;

42.5.1.2 Loop indices

crumb trail: > dpcpp > Parallel operations > Loops > Loop indices

Kernels such as \indexsyclshow{parallel_for} expects two arguments:

There are several ways of indexing. The \indexsyclshow{id} class of multi-dimensional indices.

myHandle.parallel_for<class uniqueID>
   ( mySize,
     [=]( id<1> index ) {
       float x = index.get(0) * h;
       deviceAccessorA[index] *= 2.;
     }
   )

cgh.parallel_for<class foo>(
    range<1>{D*D*D},
    [=](id<1> item) {
        xx[ item[0] ] = 2 * item[0] + 1;
    }
 )

While the C++ vectors remain one-dimensional, DPCPP allows you to make multi-dimensional buffers:

std::vector<int> y(D*D*D);
buffer<int,1> y_buf(y.data(), range<1>(D*D*D));
cgh.parallel_for<class foo2D>
   (range<2>{D,D*D},
    [=](id<2> item) {
        yy[ item[0] + D*item[1] ] = 2;
    }
   );

\begin{dpcppnote} There is an implicit conversion from the one-dimensional sycl:: \indexsyclshow{id<1>} to size_t , so

[=](sycl::id<1> i) {
   data[i] = i;
}

is legal, which in SYCL requires

data[i[0]] = i[0];

\end{dpcppnote}

42.5.1.3 Multi-dimensional indexing

crumb trail: > dpcpp > Parallel operations > Loops > Multi-dimensional indexing

// stencil2d.cxx
sycl::range<2> stencil_range(N, M);
sycl::range<2> alloc_range(N + 2, M + 2);
std::vector<float>
  input(alloc_range.size()), 
  output(alloc_range.size());
  sycl::buffer<float, 2> input_buf(input.data(), alloc_range);
  sycl::buffer<float, 2> output_buf(output.data(), alloc_range);

constexpr size_t B = 4;
sycl::range<2> local_range(B, B);
sycl::range<2> tile_range = local_range + sycl::range<2>(2, 2); // Includes boundary cells
auto tile = local_accessor<float, 2>(tile_range, h); // see templated def'n above

We first copy global data into an array local to the work group:

sycl::id<2> offset(1, 1);
h.parallel_for
  ( sycl::nd_range<2>(stencil_range, local_range, offset),
    [=] ( sycl::nd_item<2> it ) {
// Load this tile into work-group local memory
    sycl::id<2>    lid    = it.get_local_id();
    sycl::range<2> lrange = it.get_local_range();
    for   (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
      for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
        int gi = ti + B * it.get_group(0);
        int gj = tj + B * it.get_group(1);
        tile[ti][tj] = input[gi][gj];
      }
    }

Global coordinates in the input are computed from the

\indexsyclshow{nd_item}'s coordinate and group:

[=] ( sycl::nd_item<2> it ) {
for   (int ti ... ) {
  for (int tj ... ) {
    int gi = ti + B * it.get_group(0);
    int gj = tj + B * it.get_group(1);
    ... = input[gi][gj];

Local coordinates in the tile, including boundary, I DON'T QUITE GET THIS YET.

[=] ( sycl::nd_item<2> it ) {
sycl::id<2>    lid    = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for   (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
  for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
    tile[ti][tj] = ..

42.5.2 Task dependencies

crumb trail: > dpcpp > Parallel operations > Task dependencies

Each submit call can be said to correspond to a `task'. Since it returns a token, it becomes possible to specify task dependencies by refering to a token as a dependency in a later specified task.

queue myQueue;
auto myTokA = myQueue.submit
   ( [&](handler& h) {
       h.parallel_for<class taskA>(...);
     }
   );
auto myTokB = myQueue.submit
   ( [&](handler& h) {
       h.depends_on(myTokA);
       h.parallel_for<class taskB>(...);
     }
   );

42.5.3 Race conditions

crumb trail: > dpcpp > Parallel operations > Race conditions

Sycl has the same problems with race conditions that other shared memory system have:

// sum1d.cxx
auto array_accessor =
  array_buffer.get_access<sycl::access::mode::read>(h);
auto scalar_accessor =
  scalar_buffer.get_access<sycl::access::mode::read_write>(h);
h.parallel_for<class uniqueID>
  ( array_range,
    [=](sycl::id<1> index) 
    {
      scalar_accessor[0] += array_accessor[index];
    }
    ); // end of parallel for

To get this working correctly would need either a reduction primitive or atomics on the accumulator. The 2020 proposed standard has improved atomics.

// reduct1d.cxx
auto input_values = array_buffer.get_access<sycl::access::mode::read>(h);
auto sum_reduction = sycl::reduction( scalar_buffer,h,std::plus<>() );
h.parallel_for
  ( array_range,sum_reduction,
    [=]( sycl::id<1> index,auto& sum ) 
    {
      sum += input_values[index];
    }
    ); // end of parallel for

42.5.4 Reductions

crumb trail: > dpcpp > Parallel operations > Reductions

Reduction operations were added in the the SYCL 2020 Provisional Standard, meaning that they are not yet finalized.

Here is one approach, which works in hipsycl :

// reductscalar.cxx
auto reduce_to_sum =
  sycl::reduction( sum_array, static_cast<float>(0.), std::plus<float>() );
myqueue.parallel_for// parallel_for<reduction_kernel<T,BinaryOp,__LINE__>>
  ( array_range,    // sycl::range<1>(input_size), 
    reduce_to_sum,  // sycl::reduction(output, identity, op), 
    [=] (sycl::id<1> idx, auto& reducer) { // type of reducer is impl-dependent, so use auto
    reducer.combine(shared_array[idx[0]]); //(input[idx[0]]);
//reducer += shared_array[idx[0]]; // see line 216: add_reducer += input0[idx[0]];
  } ).wait();
Here a sycl:: \indexsycldef{reduction} object is created from the target data and the reduction operator. This is then passed to the \indexsyclshow{parallel_for} and its \indexsycldef{combine} method is called.

42.6 Memory access

crumb trail: > dpcpp > Memory access

Memory treatment in SYCL is a little complicated, because is (at the very least) host memory and device memory, which are not necessarily coherent.

There are also three mechanisms:

TABLE 42.1: Memory types and treatments

\toprule Locationallocationcoherencecopy to/from device
\midrule Host malloc explicit transfer \indexsyclshow{queue::memcpy}
\indexsyclshow{malloc_host} coherent host/device
Device \indexsyclshow{malloc_device}explicit transfer \indexsyclshow{queue::memcpy}
Shared \indexsyclshow{malloc_shared}coherent host/device
\bottomrule

42.6.1 Unified shared memory

crumb trail: > dpcpp > Memory access > Unified shared memory

Memory allocated with \indexsyclshow{malloc_host} is visible on the host:

// outshared.cxx
floattype
  *host_float = (floattype*)malloc_host( sizeof(floattype),ctx ),
  *shar_float = (floattype*)malloc_shared( sizeof(floattype),dev,ctx );
    cgh.single_task
      ( [=] () {
        shar_float[0] = 2 * host_float[0];
        sout << "Device sets " << shar_float[0] << sycl::endl;
      } );

Device memory is allocated with \indexsyclshow{malloc_device}, passing the queue as parameter:

// reductimpl.cxx
floattype
  *host_float = (floattype*)malloc( sizeof(floattype) ),
  *devc_float = (floattype*)malloc_device( sizeof(floattype),dev,ctx );
   [&](sycl::handler &cgh) {
     cgh.memcpy(devc_float,host_float,sizeof(floattype));
   }
Note the corresponding \indexsyclshow{free} call that also has the queue as parameter.

Note that you need to be in a parallel task. The following gives a segmentation error:

  [&](sycl::handler &cgh) {
    shar_float[0] = host_float[0];
  }

Ordinary memory, for instance from \indexsyclshow{malloc}, has to be copied in a kernel:

   [&](sycl::handler &cgh) {
     cgh.memcpy(devc_float,host_float,sizeof(floattype));
   }
   [&](sycl::handler &cgh) {
     sycl::stream sout(1024, 256, cgh);
     cgh.single_task
       (
        [=] () {
          sout << "Number " << devc_float[0] << sycl::endl;
        }
        );
   } // end of submitted lambda
free(host_float);
sycl::free(devc_float,myqueue);

42.6.2 Buffers and accessors

crumb trail: > dpcpp > Memory access > Buffers and accessors

Arrays need to be declared in a way such that they can be access from any device.

// forloop.cxx
std::vector<int> myArray(SIZE);
  range<1> mySize{myArray.size()};
  buffer<int, 1> bufferA(myArray.data(), myArray.size());

Remark

sycl::range takes a size_t parameter; specifying an int may give a compiler warning about a narrowing conversion.
End of remark

Inside the kernel, the array is then unpacked from the buffer:

myqueue.submit( [&] (handler &h) {
    auto deviceAccessorA =
      bufferA.get_access<access::mode::read_write>(h);

However, the \indexsyclshow{get_access} function results in a sycl:: \indexsyclshow{accessor}, not a pointer to a simple type. The precise type is templated and complicated, so this is a good place to use auto .

Accessors can have a mode associated: sycl::access::mode:: \indexsyclshow{read} sycl::access::mode:: \indexsyclshow{write}

\begin{dpcppnote}

    array<floattype,1> leftsum{0.};
#ifdef __INTEL_CLANG_COMPILER
    sycl::buffer leftbuf(leftsum);
#else
    sycl::range<1> scalar{1};
    sycl::buffer<floattype,1> leftbuf(leftsum.data(),scalar);

\end{dpcppnote}

\begin{dpcppnote} There are modes

// standard
sycl::accessor acc = buffer.get_access<sycl::access::mode:write>(h);
// dpcpp extension
sycl::accessor acc( buffer,h,sycl::read_only );
sycl::accessor acc( buffer,h,sycl::write_only );

\end{dpcppnote}

42.6.2.1 Multi-D buffers

crumb trail: > dpcpp > Memory access > Buffers and accessors > Multi-D buffers

To create a multi-dimensional buffer object, use a sycl::range to specify the dimensions:

// jordan.cxx
vector<double> matrix(vecsize*vecsize);
sycl::range<2> mat_range{vecsize,vecsize};
sycl::buffer<double,2> matrix_buffer( matrix.data(),mat_range );

42.6.3 Querying

crumb trail: > dpcpp > Memory access > Querying

The function \indexsyclshow{get_range} can query the size of either a buffer or an accessor: \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclbufrange} \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclaccrange}

42.7 Parallel output

crumb trail: > dpcpp > Parallel output

There is a sycl:: \indexsyclshow{cout} and sycl:: \indexsyclshow{endl}.

// hello.cxx
[&](sycl::handler &cgh) {
  sycl::stream sout(1024, 256, cgh);
  cgh.parallel_for<class hello_world>
    (
     sycl::range<1>(global_range), [=](sycl::id<1> idx) {
       sout << "Hello, World: World rank " << idx << sycl::endl;
     }); // End of the kernel function
}

Since the end of a queue does not flush stdout, it may be necessary to call sycl::queue:: \indexsyclshow{wait}

myQueue.wait();  

42.8 DPCPP extensions

crumb trail: > dpcpp > DPCPP extensions

Intel has made some extensions to SYCL:

42.9 Intel devcloud notes

crumb trail: > dpcpp > Intel devcloud notes

qsub -I for interactive session.

gdb-oneapi for debugging.

https://community.intel.com/t5/Intel-oneAPI-Toolkits/ct-p/oneapi

for support.

42.10 Examples

crumb trail: > dpcpp > Examples

42.10.1 Kernels in a loop

crumb trail: > dpcpp > Examples > Kernels in a loop

The following idiom works: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclkernelloop}

42.10.2 Stencil computations

crumb trail: > dpcpp > Examples > Stencil computations

The problem with stencil computations is that only interior points are updated. Translated to SYCL: we need to iterate over a subrange of the range over which the buffer is defined. First let us define these ranges: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclrangebc} Note the boundary value $1.$ on the right boundary.

Restricting the iteration to the interior points is done through the \indexsyclshow{offset} parameter of the \indexsyclshow{parallel_for}: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{sycliteratebc}

Back to Table of Contents