This chapter explains the basic concepts of Sycl/Dpc++, and helps you get started on running your first program.
\begin{dpcppnote} The various Intel extensions are listed here:
https://spec.oneapi.com/versions/latest/elements/dpcpp/source/index.html#extensions-table
\end{dpcppnote}
crumb trail: > dpcpp > Logistics
Headers:
#include <CL/sycl.hpp>
You can now include namespace, but with care! If you use
using namespace cl;you have to prefix all SYCL class with sycl:: , which is a bit of a bother. However, if you use
using namespace cl::sycl;you run into the fact that SYCL has its own versions of many STL commands, and so you will get name collisions. The most obvious example is that the cl::sycl name space has its own versions of \n{cout} and \n{endl}. Therefore you have to use explicitly \lstinline+std::cout+ and std::end . Using the wrong I/O will cause tons of inscrutable error messages. Additionally, SYCL has its own version of free , and of several math routines.
\begin{dpcppnote}
using namespace sycl;
\end{dpcppnote}
crumb trail: > dpcpp > Platforms and devices
Since DPCPP is cross-platform, we first need to discovers the devices.
First we list the platforms:
// devices.cxx
std::vector<sycl::platform> platforms = sycl::platform::get_platforms();
for (const auto &plat : platforms) {
// get_info is a template. So we pass the type as an `arguments`.
std::cout << "Platform: "
<< plat.get_info<sycl::info::platform::name>() << " "
<< plat.get_info<sycl::info::platform::vendor>() << " "
<< plat.get_info<sycl::info::platform::version>()
<< '\n';
Then for each platform we list the devices:
std::vector<sycl::device> devices = plat.get_devices();
for (const auto &dev : devices) {
std::cout << "-- Device: "
<< dev.get_info<sycl::info::device::name>()
<< (dev.is_host() ? ": is the host" : "")
<< (dev.is_cpu() ? ": is a cpu" : "")
<< (dev.is_gpu() ? ": is a gpu" : "")
<< std::endl;
You can query what type of device you are dealing with by \indexsyclshow{is_host}, \indexsyclshow{is_cpu}, \indexsyclshow{is_gpu}.
The execution mechanism of SYCL is the queue a sequence of actions that will be executed on a selected device. The only user action is submitting actions to a queue; the queue is executed at the end of the scope where it is declared.
Queue execution is asynchronous with host code.
crumb trail: > dpcpp > Queues > Device selectors
You need to select a device on which to execute the queue. A single queue can only dispatch to a single device.
A queue is coupled to one specific device, so it can not spread work over multiple devices. You can find a default device for the queue with
sycl::queue myqueue;
The following example explicitly assigns the queue to the CPU device using the sycl:: \indexsyclshow{cpu_selector}.
// cpuname.cxx
sycl::queue myqueue( sycl::cpu_selector{} );
The sycl:: \indexsyclshow{host_selector} bypasses any devices and make the code run on the host.
It is good for your sanity to print the name of the device you are running on:
// devname.cxx
std::cout << myqueue.get_device().get_info<sycl::info::device::name>()
<< std::endl;
If you try to select a device that is not available, a sycl:: \indexsyclshow{runtime_error} exception will be thrown.
\begin{dpcppnote}
#include "CL/sycl/intel/fpga_extensions.hpp"
fpga_selector
\end{dpcppnote}
crumb trail: > dpcpp > Queues > Queue submission and execution
It seems that queue kernels will also be executed when only they go out of scope, but not the queue:
// doubler.cxx
sycl::range<1> mySize{SIZE};
sycl::buffer<int, 1> bufferA(myArray.data(), mySize);
myqueue.submit
( [&](sycl::handler &myHandle) {
auto deviceAccessorA =
bufferA.get_access<sycl::access::mode::read_write>(myHandle);
} // queue goes out of scope, executes
crumb trail: > dpcpp > Queues > Kernel ordering
Kernels are not necessarily executed in the order in which they are submitted. You can enforce this by specifying an in-order queue :
sycl::queue myqueue{property::queue::inorder()};
crumb trail: > dpcpp > Kernels
One kernel per submit.
myqueue.submit( [&] ( handler &commandgroup ) {
commandgroup.parallel_for<uniquename>
( range<1>{N},
[=] ( id<1> idx ) { ... idx }
)
} );
Note that the lambda in the kernel captures by value. Capturing by reference makes no sense, since the kernel is executed on a device.
cgh.single_task(
[=]() {
// kernel function is executed EXACTLY once on a SINGLE work-item
});
The \indexsyclshow{submit} call results in an event object:
auto myevent = myqueue.submit( /* stuff */ );This can be used for two purposes:
myevent.wait();
myqueue.submit( [=] (handler &h) {
h.depends_on(myevent);
/* stuff */
} );
crumb trail: > dpcpp > Parallel operations
crumb trail: > dpcpp > Parallel operations > Loops
cgh.parallel_for(
range<3>(1024,1024,1024),
// using 3D in this example
[=](id<3> myID) {
// kernel function is executed on an n-dimensional range (NDrange)
});
cgh.parallel_for(
nd_range<3>( {1024,1024,1024},{16,16,16} ),
// using 3D in this example
[=](nd_item<3> myID) {
// kernel function is executed on an n-dimensional range (NDrange)
});
cgh.parallel_for_work_group(
range<2>(1024,1024),
// using 2D in this example
[=](group<2> myGroup) {
// kernel function is executed once per work-group
});
grp.parallel_for_work_item(
range<1>(1024),
// using 1D in this example
[=](h_item<1> myItem) {
// kernel function is executed once per work-item
});
crumb trail: > dpcpp > Parallel operations > Loops > Loop bounds: ranges
SYCL adopts the modern C++ philosophy that one does not iterate over by explicitly enumerating indices, but by indicating their range. This is realized by the \indexsycldef{range} class, which is templated over the number of space dimensions.
sycl::range<2> matrix{10,10};
Some compilers are sensitive to the type of the integer arguments:
sycl::range<1> array{ static_cast<size_t>(size)} ;
crumb trail: > dpcpp > Parallel operations > Loops > Loop indices
Kernels such as \indexsyclshow{parallel_for} expects two arguments:
There are several ways of indexing.
The
\indexsyclshow{id
myHandle.parallel_for<class uniqueID>
( mySize,
[=]( id<1> index ) {
float x = index.get(0) * h;
deviceAccessorA[index] *= 2.;
}
)
cgh.parallel_for<class foo>(
range<1>{D*D*D},
[=](id<1> item) {
xx[ item[0] ] = 2 * item[0] + 1;
}
)
While the C++ vectors remain one-dimensional, DPCPP allows you to make multi-dimensional buffers:
std::vector<int> y(D*D*D);
buffer<int,1> y_buf(y.data(), range<1>(D*D*D));
cgh.parallel_for<class foo2D>
(range<2>{D,D*D},
[=](id<2> item) {
yy[ item[0] + D*item[1] ] = 2;
}
);
\begin{dpcppnote} There is an implicit conversion from the one-dimensional sycl:: \indexsyclshow{id<1>} to size_t , so
[=](sycl::id<1> i) {
data[i] = i;
}
is legal, which in SYCL requires
data[i[0]] = i[0];\end{dpcppnote}
crumb trail: > dpcpp > Parallel operations > Loops > Multi-dimensional indexing
// stencil2d.cxx sycl::range<2> stencil_range(N, M); sycl::range<2> alloc_range(N + 2, M + 2); std::vector<float> input(alloc_range.size()), output(alloc_range.size()); sycl::buffer<float, 2> input_buf(input.data(), alloc_range); sycl::buffer<float, 2> output_buf(output.data(), alloc_range);
constexpr size_t B = 4; sycl::range<2> local_range(B, B); sycl::range<2> tile_range = local_range + sycl::range<2>(2, 2); // Includes boundary cells auto tile = local_accessor<float, 2>(tile_range, h); // see templated def'n above
We first copy global data into an array local to the work group:
sycl::id<2> offset(1, 1);
h.parallel_for
( sycl::nd_range<2>(stencil_range, local_range, offset),
[=] ( sycl::nd_item<2> it ) {
// Load this tile into work-group local memory
sycl::id<2> lid = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
int gi = ti + B * it.get_group(0);
int gj = tj + B * it.get_group(1);
tile[ti][tj] = input[gi][gj];
}
}
Global coordinates in the input are computed from the \indexsyclshow{nd_item}'s coordinate and group:
[=] ( sycl::nd_item<2> it ) {
for (int ti ... ) {
for (int tj ... ) {
int gi = ti + B * it.get_group(0);
int gj = tj + B * it.get_group(1);
... = input[gi][gj];
Local coordinates in the tile, including boundary, I DON'T QUITE GET THIS YET.
[=] ( sycl::nd_item<2> it ) {
sycl::id<2> lid = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
tile[ti][tj] = ..
crumb trail: > dpcpp > Parallel operations > Task dependencies
Each submit call can be said to correspond to a `task'. Since it returns a token, it becomes possible to specify task dependencies by refering to a token as a dependency in a later specified task.
queue myQueue;
auto myTokA = myQueue.submit
( [&](handler& h) {
h.parallel_for<class taskA>(...);
}
);
auto myTokB = myQueue.submit
( [&](handler& h) {
h.depends_on(myTokA);
h.parallel_for<class taskB>(...);
}
);
crumb trail: > dpcpp > Parallel operations > Race conditions
Sycl has the same problems with race conditions that other shared memory system have:
// sum1d.cxx
auto array_accessor =
array_buffer.get_access<sycl::access::mode::read>(h);
auto scalar_accessor =
scalar_buffer.get_access<sycl::access::mode::read_write>(h);
h.parallel_for<class uniqueID>
( array_range,
[=](sycl::id<1> index)
{
scalar_accessor[0] += array_accessor[index];
}
); // end of parallel for
To get this working correctly would need either a reduction primitive or atomics on the accumulator. The 2020 proposed standard has improved atomics.
// reduct1d.cxx
auto input_values = array_buffer.get_access<sycl::access::mode::read>(h);
auto sum_reduction = sycl::reduction( scalar_buffer,h,std::plus<>() );
h.parallel_for
( array_range,sum_reduction,
[=]( sycl::id<1> index,auto& sum )
{
sum += input_values[index];
}
); // end of parallel for
crumb trail: > dpcpp > Parallel operations > Reductions
Reduction operations were added in the the SYCL 2020 Provisional Standard, meaning that they are not yet finalized.
Here is one approach, which works in hipsycl :
// reductscalar.cxx
auto reduce_to_sum =
sycl::reduction( sum_array, static_cast<float>(0.), std::plus<float>() );
myqueue.parallel_for// parallel_for<reduction_kernel<T,BinaryOp,__LINE__>>
( array_range, // sycl::range<1>(input_size),
reduce_to_sum, // sycl::reduction(output, identity, op),
[=] (sycl::id<1> idx, auto& reducer) { // type of reducer is impl-dependent, so use auto
reducer.combine(shared_array[idx[0]]); //(input[idx[0]]);
//reducer += shared_array[idx[0]]; // see line 216: add_reducer += input0[idx[0]];
} ).wait();
crumb trail: > dpcpp > Memory access
Memory treatment in SYCL is a little complicated, because is (at the very least) host memory and device memory, which are not necessarily coherent.
There are also three mechanisms:
TABLE 42.1: Memory types and treatments
| \toprule Location | allocation | coherence | copy to/from device |
| \midrule Host | malloc | explicit transfer | \indexsyclshow{queue::memcpy} |
| \indexsyclshow{malloc_host} | coherent host/device | ||
| Device | \indexsyclshow{malloc_device} | explicit transfer | \indexsyclshow{queue::memcpy} |
| Shared | \indexsyclshow{malloc_shared} | coherent host/device | |
| \bottomrule |
crumb trail: > dpcpp > Memory access > Unified shared memory
Memory allocated with \indexsyclshow{malloc_host} is visible on the host:
Device memory is allocated with \indexsyclshow{malloc_device}, passing the queue as parameter:
// reductimpl.cxx
floattype
*host_float = (floattype*)malloc( sizeof(floattype) ),
*devc_float = (floattype*)malloc_device( sizeof(floattype),dev,ctx );
[&](sycl::handler &cgh) {
cgh.memcpy(devc_float,host_float,sizeof(floattype));
}
Note that you need to be in a parallel task. The following gives a segmentation error:
[&](sycl::handler &cgh) {
shar_float[0] = host_float[0];
}
Ordinary memory, for instance from \indexsyclshow{malloc}, has to be copied in a kernel:
[&](sycl::handler &cgh) {
cgh.memcpy(devc_float,host_float,sizeof(floattype));
}
[&](sycl::handler &cgh) {
sycl::stream sout(1024, 256, cgh);
cgh.single_task
(
[=] () {
sout << "Number " << devc_float[0] << sycl::endl;
}
);
} // end of submitted lambda
free(host_float);
sycl::free(devc_float,myqueue);
crumb trail: > dpcpp > Memory access > Buffers and accessors
Arrays need to be declared in a way such that they can be access from any device.
// forloop.cxx
std::vector<int> myArray(SIZE);
range<1> mySize{myArray.size()};
buffer<int, 1> bufferA(myArray.data(), myArray.size());
Remark
sycl::range
takes a
size_t
parameter;
specifying an
int
may give a compiler warning about a narrowing conversion.
End of remark
Inside the kernel, the array is then unpacked from the buffer:
myqueue.submit( [&] (handler &h) {
auto deviceAccessorA =
bufferA.get_access<access::mode::read_write>(h);
However, the \indexsyclshow{get_access} function results in a sycl:: \indexsyclshow{accessor}, not a pointer to a simple type. The precise type is templated and complicated, so this is a good place to use auto .
Accessors can have a mode associated: sycl::access::mode:: \indexsyclshow{read} sycl::access::mode:: \indexsyclshow{write}
\begin{dpcppnote}
array<floattype,1> leftsum{0.};
#ifdef __INTEL_CLANG_COMPILER
sycl::buffer leftbuf(leftsum);
#else
sycl::range<1> scalar{1};
sycl::buffer<floattype,1> leftbuf(leftsum.data(),scalar);
\end{dpcppnote}
\begin{dpcppnote} There are modes
// standard sycl::accessor acc = buffer.get_access<sycl::access::mode:write>(h); // dpcpp extension sycl::accessor acc( buffer,h,sycl::read_only ); sycl::accessor acc( buffer,h,sycl::write_only );\end{dpcppnote}
crumb trail: > dpcpp > Memory access > Buffers and accessors > Multi-D buffers
To create a multi-dimensional buffer object, use a sycl::range to specify the dimensions:
// jordan.cxx
vector<double> matrix(vecsize*vecsize);
sycl::range<2> mat_range{vecsize,vecsize};
sycl::buffer<double,2> matrix_buffer( matrix.data(),mat_range );
crumb trail: > dpcpp > Memory access > Querying
The function \indexsyclshow{get_range} can query the size of either a buffer or an accessor: \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclbufrange} \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclaccrange}
crumb trail: > dpcpp > Parallel output
There is a sycl:: \indexsyclshow{cout} and sycl:: \indexsyclshow{endl}.
// hello.cxx
[&](sycl::handler &cgh) {
sycl::stream sout(1024, 256, cgh);
cgh.parallel_for<class hello_world>
(
sycl::range<1>(global_range), [=](sycl::id<1> idx) {
sout << "Hello, World: World rank " << idx << sycl::endl;
}); // End of the kernel function
}
Since the end of a queue does not flush stdout, it may be necessary to call sycl::queue:: \indexsyclshow{wait}
myQueue.wait();
crumb trail: > dpcpp > DPCPP extensions
Intel has made some extensions to SYCL:
crumb trail: > dpcpp > Intel devcloud notes
qsub -I for interactive session.
gdb-oneapi for debugging.
https://community.intel.com/t5/Intel-oneAPI-Toolkits/ct-p/oneapi
for support.
crumb trail: > dpcpp > Examples
crumb trail: > dpcpp > Examples > Kernels in a loop
The following idiom works: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclkernelloop}
crumb trail: > dpcpp > Examples > Stencil computations
The problem with stencil computations is that only interior points are updated. Translated to SYCL: we need to iterate over a subrange of the range over which the buffer is defined. First let us define these ranges: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclrangebc} Note the boundary value $1.$ on the right boundary.
Restricting the iteration to the interior points is done through the \indexsyclshow{offset} parameter of the \indexsyclshow{parallel_for}: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{sycliteratebc}