So far, you have learned to use MPI for distributed memory and OpenMP for shared memory parallel programming. However, distribute memory architectures actually have a shared memory component, since each cluster node is typically of a multicore design. Accordingly, you could program your cluster using MPI for inter-node and OpenMP for intra-node parallelism.
You now have to find the right balance between processes and threads, since each can keep a core fully busy. Complicating this story, a node can have more than one socket , and corresponding NUMA domain.
FIGURE 45.1: Three modes of MPI/OpenMP usage on a multi-core cluster
Figure 45.1 illustrates three modes: pure MPI with no threads used; one MPI process per node and full multi-threading; two MPI processes per node, one per socket, and multiple threads on each socket.
crumb trail: > hybrid > Concurrency
With hybrid multi-process / multi-thread computing, one thing that goes out the door is the sequential semantics of each MPI process. For instance, the fact that messages between a single sender and a single receiver are non-overtaking no longer holds if the messages originated in different threads.
// anytag.c #pragma omp parallel sections { #pragma omp section MPI_Isend ( &x,1,MPI_DOUBLE, receiver,xtag,comm,requests+0); #pragma omp section MPI_Isend ( &y,1,MPI_DOUBLE, receiver,ytag,comm,requests+1); } MPI_Waitall(2,requests,MPI_STATUSES_IGNORE);
#pragma omp section MPI_Irecv ( &xy1,1,MPI_DOUBLE, sender, MPI_ANY_TAG, comm, requests+0); #pragma omp section MPI_Irecv ( &xy2,1,MPI_DOUBLE, sender, MPI_ANY_TAG, comm, requests+1); } MPI_Waitall(2,requests,statuses);
crumb trail: > hybrid > Affinity
In the preceeding chapters we mostly considered all MPI nodes or OpenMP thread as being in one flat pool. However, for high performance you need to worry about affinity : the question of which process or thread is placed where, and how efficiently they can interact.
FIGURE 45.2: The NUMA structure of a Ranger node
Here are some situations where you affinity becomes a concern.
This asymmetry affects both MPI processes and threads on that node.
can be used to pin a thread or process to a specific core.
crumb trail: > hybrid > What does the hardware look like?
If you want to optimize affinity, you should first know what the hardware looks like. The \indextermttdef{hwloc} utility is valuable here [goglin:hwloc] ( https://www.open-mpi.org/projects/hwloc/ ).
FIGURE 45.3: Structure of a Stampede compute node
FIGURE 45.4: Structure of a Stampede largemem four-socket compute node
FIGURE 45.5: Structure of a Lonestar5 compute node
Figure 45.3 depicts a Stampede compute node , which is a two-socket Intel Sandybridge design; figure 45.4 shows a Stampede largemem node , which is a four-socket design. Finally, figure 45.5 shows a Lonestar5 compute node, a two-socket design with 12-core Intel Haswell processors with two hardware threads each.
crumb trail: > hybrid > Affinity control
See chapter OpenMP topic: Affinity for OpenMP affinity control.
crumb trail: > hybrid > Discussion
The performance implications of the pure MPI strategy versus hybrid are subtle.
Exercise
Review the scalability argument for 1D versus 2D matrix
decomposition in
Eijkhout:IntroHPC
. Would you get
scalable performance from doing a 1D decomposition (for instance, of
the rows) over MPI processes, and decomposing the other directions
(the columns) over OpenMP threads?
End of exercise
Another performance argument we need to consider concerns message traffic. If let all threads make MPI calls (see section 13.1 ) there is going to be little difference. However, in one popular hybrid computing strategy we would keep MPI calls out of the OpenMP regions and have them in effect done by the master thread. In that case there are only MPI messages between nodes, instead of between cores. This leads to a decrease in message traffic, though this is hard to quantify. The number of messages goes down approximately by the number of cores per node, so this is an advantage if the average message size is small. On the other hand, the amount of data sent is only reduced if there is overlap in content between the messages.
Limiting MPI traffic to the master thread also means that no buffer space is needed for the on-node communication.
crumb trail: > hybrid > Processes and cores and affinity
In OpenMP, threads are purely a software construct and you can create however many you want. The hardware limit of the available cores can be queried with omp_get_num_procs (section 17.5 ). How does that work in a hybrid context? Does the `proc' count return the total number of cores, or does the MPI scheduler limit it to a number exclusive to each MPI process?
The following code fragment explore this:
// procthread.c int ncores; #pragma omp parallel #pragma omp master ncores = omp_get_num_procs();int totalcores; MPI_Reduce(&ncores,&totalcores,1,MPI_INT,MPI_SUM,0,comm); if (procid==0) { printf("Omp procs on this process: %d\n",ncores); printf("Omp procs total: %d\n",totalcores); }
Running this with Intel MPI (version 19) gives the following:
---- nprocs: 14 Omp procs on this process: 4 Omp procs total: 56 ---- nprocs: 15 Omp procs on this process: 3 Omp procs total: 45 ---- nprocs: 16 Omp procs on this process: 3 Omp procs total: 48We see that
While the OpenMP `proc' count is such that the MPI processes will not oversubscribe cores, the actual placement of processes and threads is not expressed here. This assignment is known as affinity and it is determined by the MPI/OpenMP runtime system. Typically it can be controlled through environment variables, but one hopes the default assignment makes sense.
FIGURE 45.6: Process and thread placement on an Intel Knights Landing
Figure 45.6 illustrates this for the Intel Knights Landing :
crumb trail: > hybrid > Practical specification
Say you use 100 cluster nodes, each with 16 cores. You could then start 1600 MPI processes, one for each core, but you could also start 100 processes, and give each access to 16 OpenMP threads.
There is a third choice, in between these extremes, that makes sense. A cluster node often has more than one socket, so you could put one MPI process on each socket , and use a number of threads equal to the number of cores per socket.