\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 45.1 : Concurrency
45.2 : Affinity
45.3 : What does the hardware look like?
45.4 : Affinity control
45.5 : Discussion
45.6 : Processes and cores and affinity
45.7 : Practical specification
Back to Table of Contents

45 Hybrid computing

So far, you have learned to use MPI for distributed memory and OpenMP for shared memory parallel programming. However, distribute memory architectures actually have a shared memory component, since each cluster node is typically of a multicore design. Accordingly, you could program your cluster using MPI for inter-node and OpenMP for intra-node parallelism.

You now have to find the right balance between processes and threads, since each can keep a core fully busy. Complicating this story, a node can have more than one socket , and corresponding NUMA domain.

FIGURE 45.1: Three modes of MPI/OpenMP usage on a multi-core cluster

Figure 45.1 illustrates three modes: pure MPI with no threads used; one MPI process per node and full multi-threading; two MPI processes per node, one per socket, and multiple threads on each socket.

45.1 Concurrency

crumb trail: > hybrid > Concurrency

With hybrid multi-process / multi-thread computing, one thing that goes out the door is the sequential semantics of each MPI process. For instance, the fact that messages between a single sender and a single receiver are non-overtaking no longer holds if the messages originated in different threads.

// anytag.c
#pragma omp parallel sections
    {
#pragma omp section
    MPI_Isend
      ( &x,1,MPI_DOUBLE,
	receiver,xtag,comm,requests+0);
#pragma omp section
    MPI_Isend
      ( &y,1,MPI_DOUBLE,
	receiver,ytag,comm,requests+1);
    }
    MPI_Waitall(2,requests,MPI_STATUSES_IGNORE);

#pragma omp section
    MPI_Irecv
      ( &xy1,1,MPI_DOUBLE,
	sender, MPI_ANY_TAG, comm, requests+0);
#pragma omp section
    MPI_Irecv
      ( &xy2,1,MPI_DOUBLE,
	sender, MPI_ANY_TAG, comm, requests+1);
    }
    MPI_Waitall(2,requests,statuses);

45.2 Affinity

crumb trail: > hybrid > Affinity

In the preceeding chapters we mostly considered all MPI nodes or OpenMP thread as being in one flat pool. However, for high performance you need to worry about affinity : the question of which process or thread is placed where, and how efficiently they can interact.

FIGURE 45.2: The NUMA structure of a Ranger node

Here are some situations where you affinity becomes a concern.

In pure MPI mode processes that are on the same node can typically communicate faster than processes on different nodes. Since processes are typically placed sequentially, this means that a scheme where process $p$ interacts mostly with $p+1$ will be efficient, while communication with large jumps will be less so.
If the cluster network has a structure ( processor grid as opposed to fat-tree ), placement of processes has an effect on program efficiency. MPI tries to address this with graph topology ; section 11.2 .
Even on a single node there can be asymmetries. Figure 45.2 illustrates the structure of the four sockets of the Ranger supercomputer (no longer in production). Two cores have no direct connection.

This asymmetry affects both MPI processes and threads on that node.
Another problem with multi-socket designs is that each socket has memory attached to it. While every socket can address all the memory on the node, its local memory is faster to access. This asymmetry becomes quite visible in the first-touch phenomemon; section 25.2 .
If a node has fewer MPI processes than there are cores, you want to be in control of their placement. Also, the operating system can migrate processes, which is detrimental to performance since it negates data locality. For this reason, utilities such as numactl

can be used to pin a thread or process to a specific core.
Processors with hyperthreading or hardware threads introduce another level or worry about where threads go.

45.3 What does the hardware look like?

crumb trail: > hybrid > What does the hardware look like?

If you want to optimize affinity, you should first know what the hardware looks like. The \indextermttdef{hwloc} utility is valuable here [goglin:hwloc] ( https://www.open-mpi.org/projects/hwloc/ ).

FIGURE 45.3: Structure of a Stampede compute node

FIGURE 45.4: Structure of a Stampede largemem four-socket compute node

FIGURE 45.5: Structure of a Lonestar5 compute node

Figure 45.3 depicts a Stampede compute node , which is a two-socket Intel Sandybridge design; figure 45.4 shows a Stampede largemem node , which is a four-socket design. Finally, figure 45.5 shows a Lonestar5 compute node, a two-socket design with 12-core Intel Haswell processors with two hardware threads each.

45.4 Affinity control

crumb trail: > hybrid > Affinity control

See chapter OpenMP topic: Affinity for OpenMP affinity control.

45.5 Discussion

crumb trail: > hybrid > Discussion

The performance implications of the pure MPI strategy versus hybrid are subtle.

First of all, we note that there is no obvious speedup: in a well balanced MPI application all cores are busy all the time, so using threading can give no immediate improvement.
Both MPI and OpenMP are subject to Amdahl's law that quantifies the influence of sequential code; in hybrid computing there is a new version of this law regarding the amount of code that is MPI-parallel, but not OpenMP-parallel.
MPI processes run unsynchronized, so small variations in load or in processor behavior can be tolerated. The frequent barriers in OpenMP constructs make a hybrid code more tightly synchronized, so load balancing becomes more critical.
On the other hand, in OpenMP codes it is easier to divide the work into more tasks than there are threads, so statistically a certain amount of load balancing happens automatically.
Each MPI process has its own buffers, so hybrid takes less buffer overhead.

Exercise Review the scalability argument for 1D versus 2D matrix decomposition in Eijkhout:IntroHPC . Would you get scalable performance from doing a 1D decomposition (for instance, of the rows) over MPI processes, and decomposing the other directions (the columns) over OpenMP threads?
End of exercise

Another performance argument we need to consider concerns message traffic. If let all threads make MPI calls (see section 13.1 ) there is going to be little difference. However, in one popular hybrid computing strategy we would keep MPI calls out of the OpenMP regions and have them in effect done by the master thread. In that case there are only MPI messages between nodes, instead of between cores. This leads to a decrease in message traffic, though this is hard to quantify. The number of messages goes down approximately by the number of cores per node, so this is an advantage if the average message size is small. On the other hand, the amount of data sent is only reduced if there is overlap in content between the messages.

Limiting MPI traffic to the master thread also means that no buffer space is needed for the on-node communication.

45.6 Processes and cores and affinity

crumb trail: > hybrid > Processes and cores and affinity

In OpenMP, threads are purely a software construct and you can create however many you want. The hardware limit of the available cores can be queried with omp_get_num_procs (section 17.5 ). How does that work in a hybrid context? Does the `proc' count return the total number of cores, or does the MPI scheduler limit it to a number exclusive to each MPI process?

The following code fragment explore this:

// procthread.c
int ncores;
#pragma omp parallel
#pragma omp master
ncores = omp_get_num_procs();




int totalcores;
MPI_Reduce(&ncores,&totalcores,1,MPI_INT,MPI_SUM,0,comm);
if (procid==0) {
  printf("Omp procs on this process: %d\n",ncores);
  printf("Omp procs total: %d\n",totalcores);
}

Running this with Intel MPI (version 19) gives the following:

---- nprocs: 14
Omp procs on this process: 4
Omp procs total: 56
---- nprocs: 15
Omp procs on this process: 3
Omp procs total: 45
---- nprocs: 16
Omp procs on this process: 3
Omp procs total: 48

We see that

Each process get an equal number of cores, and
Some cores will go unused.

While the OpenMP `proc' count is such that the MPI processes will not oversubscribe cores, the actual placement of processes and threads is not expressed here. This assignment is known as affinity and it is determined by the MPI/OpenMP runtime system. Typically it can be controlled through environment variables, but one hopes the default assignment makes sense.

FIGURE 45.6: Process and thread placement on an Intel Knights Landing

Figure 45.6 illustrates this for the Intel Knights Landing :

Placing four MPI processes on 68 cores gives 17 cores per process.
Each process receives a contiguous set of cores.
However, cores are grouped in `tiles' of two, so processes 1 and 3 start halfway a tile.
Therefore, thread zero of that process is bound to the second core.

45.7 Practical specification

crumb trail: > hybrid > Practical specification

Say you use 100 cluster nodes, each with 16 cores. You could then start 1600 MPI processes, one for each core, but you could also start 100 processes, and give each access to 16 OpenMP threads.

There is a third choice, in between these extremes, that makes sense. A cluster node often has more than one socket, so you could put one MPI process on each socket , and use a number of threads equal to the number of cores per socket.

Hybrid computing

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse