This chapter explains the basic concepts of Sycl/Dpc++, and helps you get started on running your first program.
\begin{dpcppnote} The various Intel extensions are listed here:
https://spec.oneapi.com/versions/latest/elements/dpcpp/source/index.html#extensions-table
\end{dpcppnote}
crumb trail: > dpcpp > Logistics
Headers:
#include <CL/sycl.hpp>
You can now include namespace, but with care! If you use
using namespace cl;you have to prefix all SYCL class with sycl:: , which is a bit of a bother. However, if you use
using namespace cl::sycl;you run into the fact that SYCL has its own versions of many STL commands, and so you will get name collisions. The most obvious example is that the cl::sycl name space has its own versions of \n{cout} and \n{endl}. Therefore you have to use explicitly \lstinline+std::cout+ and std::end . Using the wrong I/O will cause tons of inscrutable error messages. Additionally, SYCL has its own version of free , and of several math routines.
\begin{dpcppnote}
using namespace sycl;\end{dpcppnote}
crumb trail: > dpcpp > Platforms and devices
Since DPCPP is cross-platform, we first need to discovers the devices.
First we list the platforms:
// devices.cxx std::vector<sycl::platform> platforms = sycl::platform::get_platforms(); for (const auto &plat : platforms) { // get_info is a template. So we pass the type as an `arguments`. std::cout << "Platform: " << plat.get_info<sycl::info::platform::name>() << " " << plat.get_info<sycl::info::platform::vendor>() << " " << plat.get_info<sycl::info::platform::version>() << '\n';
Then for each platform we list the devices:
std::vector<sycl::device> devices = plat.get_devices(); for (const auto &dev : devices) { std::cout << "-- Device: " << dev.get_info<sycl::info::device::name>() << (dev.is_host() ? ": is the host" : "") << (dev.is_cpu() ? ": is a cpu" : "") << (dev.is_gpu() ? ": is a gpu" : "") << std::endl;
You can query what type of device you are dealing with by \indexsyclshow{is_host}, \indexsyclshow{is_cpu}, \indexsyclshow{is_gpu}.
The execution mechanism of SYCL is the queue a sequence of actions that will be executed on a selected device. The only user action is submitting actions to a queue; the queue is executed at the end of the scope where it is declared.
Queue execution is asynchronous with host code.
crumb trail: > dpcpp > Queues > Device selectors
You need to select a device on which to execute the queue. A single queue can only dispatch to a single device.
A queue is coupled to one specific device, so it can not spread work over multiple devices. You can find a default device for the queue with
sycl::queue myqueue;
The following example explicitly assigns the queue to the CPU device using the sycl:: \indexsyclshow{cpu_selector}.
// cpuname.cxx sycl::queue myqueue( sycl::cpu_selector{} );
The sycl:: \indexsyclshow{host_selector} bypasses any devices and make the code run on the host.
It is good for your sanity to print the name of the device you are running on:
// devname.cxx std::cout << myqueue.get_device().get_info<sycl::info::device::name>() << std::endl;
If you try to select a device that is not available, a sycl:: \indexsyclshow{runtime_error} exception will be thrown.
\begin{dpcppnote}
#include "CL/sycl/intel/fpga_extensions.hpp" fpga_selector\end{dpcppnote}
crumb trail: > dpcpp > Queues > Queue submission and execution
It seems that queue kernels will also be executed when only they go out of scope, but not the queue:
// doubler.cxx sycl::range<1> mySize{SIZE}; sycl::buffer<int, 1> bufferA(myArray.data(), mySize); myqueue.submit ( [&](sycl::handler &myHandle) { auto deviceAccessorA = bufferA.get_access<sycl::access::mode::read_write>(myHandle); } // queue goes out of scope, executes
crumb trail: > dpcpp > Queues > Kernel ordering
Kernels are not necessarily executed in the order in which they are submitted. You can enforce this by specifying an in-order queue :
sycl::queue myqueue{property::queue::inorder()};
crumb trail: > dpcpp > Kernels
One kernel per submit.
myqueue.submit( [&] ( handler &commandgroup ) { commandgroup.parallel_for<uniquename> ( range<1>{N}, [=] ( id<1> idx ) { ... idx } ) } );
Note that the lambda in the kernel captures by value. Capturing by reference makes no sense, since the kernel is executed on a device.
cgh.single_task( [=]() { // kernel function is executed EXACTLY once on a SINGLE work-item });
The \indexsyclshow{submit} call results in an event object:
auto myevent = myqueue.submit( /* stuff */ );This can be used for two purposes:
myevent.wait();
myqueue.submit( [=] (handler &h) { h.depends_on(myevent); /* stuff */ } );
crumb trail: > dpcpp > Parallel operations
crumb trail: > dpcpp > Parallel operations > Loops
cgh.parallel_for( range<3>(1024,1024,1024), // using 3D in this example [=](id<3> myID) { // kernel function is executed on an n-dimensional range (NDrange) }); cgh.parallel_for( nd_range<3>( {1024,1024,1024},{16,16,16} ), // using 3D in this example [=](nd_item<3> myID) { // kernel function is executed on an n-dimensional range (NDrange) }); cgh.parallel_for_work_group( range<2>(1024,1024), // using 2D in this example [=](group<2> myGroup) { // kernel function is executed once per work-group }); grp.parallel_for_work_item( range<1>(1024), // using 1D in this example [=](h_item<1> myItem) { // kernel function is executed once per work-item });
crumb trail: > dpcpp > Parallel operations > Loops > Loop bounds: ranges
SYCL adopts the modern C++ philosophy that one does not iterate over by explicitly enumerating indices, but by indicating their range. This is realized by the \indexsycldef{range} class, which is templated over the number of space dimensions.
sycl::range<2> matrix{10,10};
Some compilers are sensitive to the type of the integer arguments:
sycl::range<1> array{ static_cast<size_t>(size)} ;
crumb trail: > dpcpp > Parallel operations > Loops > Loop indices
Kernels such as \indexsyclshow{parallel_for} expects two arguments:
There are several ways of indexing.
The
\indexsyclshow{id
myHandle.parallel_for<class uniqueID>
( mySize,
[=]( id<1> index ) {
float x = index.get(0) * h;
deviceAccessorA[index] *= 2.;
}
)
cgh.parallel_for<class foo>( range<1>{D*D*D}, [=](id<1> item) { xx[ item[0] ] = 2 * item[0] + 1; } )
While the C++ vectors remain one-dimensional, DPCPP allows you to make multi-dimensional buffers:
std::vector<int> y(D*D*D); buffer<int,1> y_buf(y.data(), range<1>(D*D*D)); cgh.parallel_for<class foo2D> (range<2>{D,D*D}, [=](id<2> item) { yy[ item[0] + D*item[1] ] = 2; } );
\begin{dpcppnote} There is an implicit conversion from the one-dimensional sycl:: \indexsyclshow{id<1>} to size_t , so
[=](sycl::id<1> i) { data[i] = i; }is legal, which in SYCL requires
data[i[0]] = i[0];\end{dpcppnote}
crumb trail: > dpcpp > Parallel operations > Loops > Multi-dimensional indexing
// stencil2d.cxx sycl::range<2> stencil_range(N, M); sycl::range<2> alloc_range(N + 2, M + 2); std::vector<float> input(alloc_range.size()), output(alloc_range.size()); sycl::buffer<float, 2> input_buf(input.data(), alloc_range); sycl::buffer<float, 2> output_buf(output.data(), alloc_range);
constexpr size_t B = 4; sycl::range<2> local_range(B, B); sycl::range<2> tile_range = local_range + sycl::range<2>(2, 2); // Includes boundary cells auto tile = local_accessor<float, 2>(tile_range, h); // see templated def'n above
We first copy global data into an array local to the work group:
sycl::id<2> offset(1, 1); h.parallel_for ( sycl::nd_range<2>(stencil_range, local_range, offset), [=] ( sycl::nd_item<2> it ) { // Load this tile into work-group local memory sycl::id<2> lid = it.get_local_id(); sycl::range<2> lrange = it.get_local_range(); for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) { for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) { int gi = ti + B * it.get_group(0); int gj = tj + B * it.get_group(1); tile[ti][tj] = input[gi][gj]; } }
Global coordinates in the input are computed from the \indexsyclshow{nd_item}'s coordinate and group:
[=] ( sycl::nd_item<2> it ) { for (int ti ... ) { for (int tj ... ) { int gi = ti + B * it.get_group(0); int gj = tj + B * it.get_group(1); ... = input[gi][gj];
Local coordinates in the tile, including boundary, I DON'T QUITE GET THIS YET.
[=] ( sycl::nd_item<2> it ) { sycl::id<2> lid = it.get_local_id(); sycl::range<2> lrange = it.get_local_range(); for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) { for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) { tile[ti][tj] = ..
crumb trail: > dpcpp > Parallel operations > Task dependencies
Each submit call can be said to correspond to a `task'. Since it returns a token, it becomes possible to specify task dependencies by refering to a token as a dependency in a later specified task.
queue myQueue; auto myTokA = myQueue.submit ( [&](handler& h) { h.parallel_for<class taskA>(...); } ); auto myTokB = myQueue.submit ( [&](handler& h) { h.depends_on(myTokA); h.parallel_for<class taskB>(...); } );
crumb trail: > dpcpp > Parallel operations > Race conditions
Sycl has the same problems with race conditions that other shared memory system have:
// sum1d.cxx auto array_accessor = array_buffer.get_access<sycl::access::mode::read>(h); auto scalar_accessor = scalar_buffer.get_access<sycl::access::mode::read_write>(h); h.parallel_for<class uniqueID> ( array_range, [=](sycl::id<1> index) { scalar_accessor[0] += array_accessor[index]; } ); // end of parallel for
To get this working correctly would need either a reduction primitive or atomics on the accumulator. The 2020 proposed standard has improved atomics.
// reduct1d.cxx auto input_values = array_buffer.get_access<sycl::access::mode::read>(h); auto sum_reduction = sycl::reduction( scalar_buffer,h,std::plus<>() ); h.parallel_for ( array_range,sum_reduction, [=]( sycl::id<1> index,auto& sum ) { sum += input_values[index]; } ); // end of parallel for
crumb trail: > dpcpp > Parallel operations > Reductions
Reduction operations were added in the the SYCL 2020 Provisional Standard, meaning that they are not yet finalized.
Here is one approach, which works in hipsycl :
// reductscalar.cxx auto reduce_to_sum = sycl::reduction( sum_array, static_cast<float>(0.), std::plus<float>() ); myqueue.parallel_for// parallel_for<reduction_kernel<T,BinaryOp,__LINE__>> ( array_range, // sycl::range<1>(input_size), reduce_to_sum, // sycl::reduction(output, identity, op), [=] (sycl::id<1> idx, auto& reducer) { // type of reducer is impl-dependent, so use auto reducer.combine(shared_array[idx[0]]); //(input[idx[0]]); //reducer += shared_array[idx[0]]; // see line 216: add_reducer += input0[idx[0]]; } ).wait();
crumb trail: > dpcpp > Memory access
Memory treatment in SYCL is a little complicated, because is (at the very least) host memory and device memory, which are not necessarily coherent.
There are also three mechanisms:
TABLE 42.1: Memory types and treatments
\toprule Location | allocation | coherence | copy to/from device |
\midrule Host | malloc | explicit transfer | \indexsyclshow{queue::memcpy} |
\indexsyclshow{malloc_host} | coherent host/device | ||
Device | \indexsyclshow{malloc_device} | explicit transfer | \indexsyclshow{queue::memcpy} |
Shared | \indexsyclshow{malloc_shared} | coherent host/device | |
\bottomrule |
crumb trail: > dpcpp > Memory access > Unified shared memory
Memory allocated with \indexsyclshow{malloc_host} is visible on the host:
Device memory is allocated with \indexsyclshow{malloc_device}, passing the queue as parameter:
// reductimpl.cxx floattype *host_float = (floattype*)malloc( sizeof(floattype) ), *devc_float = (floattype*)malloc_device( sizeof(floattype),dev,ctx ); [&](sycl::handler &cgh) { cgh.memcpy(devc_float,host_float,sizeof(floattype)); }
Note that you need to be in a parallel task. The following gives a segmentation error:
[&](sycl::handler &cgh) { shar_float[0] = host_float[0]; }
Ordinary memory, for instance from \indexsyclshow{malloc}, has to be copied in a kernel:
[&](sycl::handler &cgh) { cgh.memcpy(devc_float,host_float,sizeof(floattype)); } [&](sycl::handler &cgh) { sycl::stream sout(1024, 256, cgh); cgh.single_task ( [=] () { sout << "Number " << devc_float[0] << sycl::endl; } ); } // end of submitted lambda free(host_float); sycl::free(devc_float,myqueue);
crumb trail: > dpcpp > Memory access > Buffers and accessors
Arrays need to be declared in a way such that they can be access from any device.
// forloop.cxx std::vector<int> myArray(SIZE); range<1> mySize{myArray.size()}; buffer<int, 1> bufferA(myArray.data(), myArray.size());
Remark
sycl::range
takes a
size_t
parameter;
specifying an
int
may give a compiler warning about a narrowing conversion.
End of remark
Inside the kernel, the array is then unpacked from the buffer:
myqueue.submit( [&] (handler &h) { auto deviceAccessorA = bufferA.get_access<access::mode::read_write>(h);
However, the \indexsyclshow{get_access} function results in a sycl:: \indexsyclshow{accessor}, not a pointer to a simple type. The precise type is templated and complicated, so this is a good place to use auto .
Accessors can have a mode associated: sycl::access::mode:: \indexsyclshow{read} sycl::access::mode:: \indexsyclshow{write}
\begin{dpcppnote}
array<floattype,1> leftsum{0.}; #ifdef __INTEL_CLANG_COMPILER sycl::buffer leftbuf(leftsum); #else sycl::range<1> scalar{1}; sycl::buffer<floattype,1> leftbuf(leftsum.data(),scalar);\end{dpcppnote}
\begin{dpcppnote} There are modes
// standard sycl::accessor acc = buffer.get_access<sycl::access::mode:write>(h); // dpcpp extension sycl::accessor acc( buffer,h,sycl::read_only ); sycl::accessor acc( buffer,h,sycl::write_only );\end{dpcppnote}
crumb trail: > dpcpp > Memory access > Buffers and accessors > Multi-D buffers
To create a multi-dimensional buffer object, use a sycl::range to specify the dimensions:
// jordan.cxx vector<double> matrix(vecsize*vecsize); sycl::range<2> mat_range{vecsize,vecsize}; sycl::buffer<double,2> matrix_buffer( matrix.data(),mat_range );
crumb trail: > dpcpp > Memory access > Querying
The function \indexsyclshow{get_range} can query the size of either a buffer or an accessor: \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclbufrange} \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclaccrange}
crumb trail: > dpcpp > Parallel output
There is a sycl:: \indexsyclshow{cout} and sycl:: \indexsyclshow{endl}.
// hello.cxx [&](sycl::handler &cgh) { sycl::stream sout(1024, 256, cgh); cgh.parallel_for<class hello_world> ( sycl::range<1>(global_range), [=](sycl::id<1> idx) { sout << "Hello, World: World rank " << idx << sycl::endl; }); // End of the kernel function }
Since the end of a queue does not flush stdout, it may be necessary to call sycl::queue:: \indexsyclshow{wait}
myQueue.wait();
crumb trail: > dpcpp > DPCPP extensions
Intel has made some extensions to SYCL:
crumb trail: > dpcpp > Intel devcloud notes
qsub -I for interactive session.
gdb-oneapi for debugging.
https://community.intel.com/t5/Intel-oneAPI-Toolkits/ct-p/oneapi
for support.
crumb trail: > dpcpp > Examples
crumb trail: > dpcpp > Examples > Kernels in a loop
The following idiom works: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclkernelloop}
crumb trail: > dpcpp > Examples > Stencil computations
The problem with stencil computations is that only interior points are updated. Translated to SYCL: we need to iterate over a subrange of the range over which the buffer is defined. First let us define these ranges: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclrangebc} Note the boundary value $1.$ on the right boundary.
Restricting the iteration to the interior points is done through the \indexsyclshow{offset} parameter of the \indexsyclshow{parallel_for}: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{sycliteratebc}