##### Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

27.2 : Data on the device
27.3 : Execution on the device

This chapter explains the mechanisms for offloading work to a GPU  , introduced in OpenMP-.

The memory of a processor and that of an attached GPU are not coherent there are separate memory spaces and writing data in one is not automatically reflected in the other.

OpenMP transfers data (or maps it) when you enter an \indexompclause{target} construct.

#pragma omp target
{
// do stuff on the GPU
}

You can test whether the target region is indeed executed on a device with omp_is_initial_device :

#pragma omp target

crumb trail: > omp-gpu

The \indexompclause{target} clause causes OpenMP to create a target task  . This is a task running on the host, dedicated to managing the offloaded region.

The \indexompclause{target} region is executed by a new initial task  . This is distinct from the initial task that executes the main program.

By default, the generating task is blocked while the task on the device is running, but adding the \indexompclause{target}{nowait} clause makes it asynchronous. This requires a taskwait directive to synchronize host and device.

## 27.2 Data on the device

crumb trail: > omp-gpu > Data on the device

• Scalars are treated as \indexompclause{firstprivate}, that is, they are copied in but not out.

• Stack arrays \indexompclause{tofrom}.

• Heap arrays are not mapped by default.

For explicit mapping with \indexompclauseoption{target}{map}:

#pragma omp target map(...)
{
// do stuff on the GPU
}

The following map options exist:

• map(to: x,y,z) copy from host to device when entering the target region.

• map(from: x,y,z) copy from devince to host when exiting the target region.

• map(tofrom: x,y,z) is equivalent to combining the previous two.

• map(allo: x,y,z) allocates data on the device.

Fortran note {Array sizes in map clause} If the compiler can deduce the array bounds and size, it is not necessary to specify them in the `map' clause.

Data transfer to a device is probably slow, so mapping the data at the start of an offloaded section of code is probably not the best idea. Additionally, in many cases data will stay resident on the device throughout several iterations of, for instance, a time-stepping PDE solver. For such reasons, it is possible to move data onto, and off from, the device explicitly, using the \indexompclauseoption{target}{enter data} and \indexompclauseoption{target}{exit data} directives.

#pragma omp target enter data map(to: x,y)
#pragma omp target
{
// do something
}
#pragma omp target enter data map(from: x,y)

Also \indexompclauseoption{target}{update to} (synchronize data from host to device), \indexompclauseoption{target}{update from} (synchronize data to host from device).

## 27.3 Execution on the device

crumb trail: > omp-gpu > Execution on the device

For parallel execution of a loop on the device use the \indexompclause{teams} clause:

#pragma omp target teams distribute parallel do

On GPU devices and the like, there is a structure to threads:

• threads are grouped in \indexompterm{team}s, and they can be synchronized only within these teams;

• teams are groups in \indexompterm{league}s, and no synchronization between leagues is possible inside a target region.

The combination teams distribute splits the iteration space over teams. By default a static schedule is used, but the option \indexompclause{dist_schedule} can be used to specify a different one. However, this combination only gives the chunk of space to the master thread in each team. Next we need parallel for or parallel do to spread the chunk over the threads in the team.

When creating teams, it's often useful to limit the number of threads in each with \indexclause{thread_limit}. This can also be set with the OMP_THREAD_LIMIT environment variable. The value can be queried with omp_get_thread_limit  .