Hybrid computing

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 45.1 : Concurrency
45.2 : Affinity
45.3 : What does the hardware look like?
45.4 : Affinity control
45.5 : Discussion
45.6 : Processes and cores and affinity
45.7 : Practical specification
Back to Table of Contents

45 Hybrid computing

So far, you have learned to use MPI for distributed memory and OpenMP for shared memory parallel programming. However, distribute memory architectures actually have a shared memory component, since each cluster node is typically of a multicore design. Accordingly, you could program your cluster using MPI for inter-node and OpenMP for intra-node parallelism.

You now have to find the right balance between processes and threads, since each can keep a core fully busy. Complicating this story, a node can have more than one socket  , and corresponding NUMA domain.

FIGURE 45.1: Three modes of MPI/OpenMP usage on a multi-core cluster

Figure  45.1 illustrates three modes: pure MPI with no threads used; one MPI process per node and full multi-threading; two MPI processes per node, one per socket, and multiple threads on each socket.

45.1 Concurrency

crumb trail: > hybrid > Concurrency

With hybrid multi-process / multi-thread computing, one thing that goes out the door is the sequential semantics of each MPI process. For instance, the fact that messages between a single sender and a single receiver are non-overtaking no longer holds if the messages originated in different threads.

// anytag.c
#pragma omp parallel sections
    {
#pragma omp section
    MPI_Isend
      ( &x,1,MPI_DOUBLE,
	receiver,xtag,comm,requests+0);
#pragma omp section
    MPI_Isend
      ( &y,1,MPI_DOUBLE,
	receiver,ytag,comm,requests+1);
    }
    MPI_Waitall(2,requests,MPI_STATUSES_IGNORE);
#pragma omp section
    MPI_Irecv
      ( &xy1,1,MPI_DOUBLE,
	sender, MPI_ANY_TAG, comm, requests+0);
#pragma omp section
    MPI_Irecv
      ( &xy2,1,MPI_DOUBLE,
	sender, MPI_ANY_TAG, comm, requests+1);
    }
    MPI_Waitall(2,requests,statuses);

45.2 Affinity

crumb trail: > hybrid > Affinity

In the preceeding chapters we mostly considered all MPI nodes or OpenMP thread as being in one flat pool. However, for high performance you need to worry about affinity : the question of which process or thread is placed where, and how efficiently they can interact.

FIGURE 45.2: The NUMA structure of a Ranger node

Here are some situations where you affinity becomes a concern.

45.3 What does the hardware look like?

crumb trail: > hybrid > What does the hardware look like?

If you want to optimize affinity, you should first know what the hardware looks like. The \indextermttdef{hwloc} utility is valuable here  [goglin:hwloc] ( https://www.open-mpi.org/projects/hwloc/  ).

FIGURE 45.3: Structure of a Stampede compute node

FIGURE 45.4: Structure of a Stampede largemem four-socket compute node

FIGURE 45.5: Structure of a Lonestar5 compute node

Figure  45.3 depicts a Stampede compute node  , which is a two-socket Intel Sandybridge design; figure  45.4 shows a Stampede largemem node  , which is a four-socket design. Finally, figure  45.5 shows a Lonestar5 compute node, a two-socket design with 12-core Intel Haswell processors with two hardware threads each.

45.4 Affinity control

crumb trail: > hybrid > Affinity control

See chapter  OpenMP topic: Affinity for OpenMP affinity control.

45.5 Discussion

crumb trail: > hybrid > Discussion

The performance implications of the pure MPI strategy versus hybrid are subtle.

Exercise Review the scalability argument for 1D versus 2D matrix decomposition in Eijkhout:IntroHPC  . Would you get scalable performance from doing a 1D decomposition (for instance, of the rows) over MPI processes, and decomposing the other directions (the columns) over OpenMP threads?
End of exercise

Another performance argument we need to consider concerns message traffic. If let all threads make MPI calls (see section  13.1  ) there is going to be little difference. However, in one popular hybrid computing strategy we would keep MPI calls out of the OpenMP regions and have them in effect done by the master thread. In that case there are only MPI messages between nodes, instead of between cores. This leads to a decrease in message traffic, though this is hard to quantify. The number of messages goes down approximately by the number of cores per node, so this is an advantage if the average message size is small. On the other hand, the amount of data sent is only reduced if there is overlap in content between the messages.

Limiting MPI traffic to the master thread also means that no buffer space is needed for the on-node communication.

45.6 Processes and cores and affinity

crumb trail: > hybrid > Processes and cores and affinity

In OpenMP, threads are purely a software construct and you can create however many you want. The hardware limit of the available cores can be queried with omp_get_num_procs (section  17.5  ). How does that work in a hybrid context? Does the `proc' count return the total number of cores, or does the MPI scheduler limit it to a number exclusive to each MPI process?

The following code fragment explore this:

// procthread.c
int ncores;
#pragma omp parallel
#pragma omp master
ncores = omp_get_num_procs();

int totalcores; MPI_Reduce(&ncores,&totalcores,1,MPI_INT,MPI_SUM,0,comm); if (procid==0) { printf("Omp procs on this process: %d\n",ncores); printf("Omp procs total: %d\n",totalcores); }

Running this with Intel MPI (version 19) gives the following:

---- nprocs: 14
Omp procs on this process: 4
Omp procs total: 56
---- nprocs: 15
Omp procs on this process: 3
Omp procs total: 45
---- nprocs: 16
Omp procs on this process: 3
Omp procs total: 48
We see that

While the OpenMP `proc' count is such that the MPI processes will not oversubscribe cores, the actual placement of processes and threads is not expressed here. This assignment is known as affinity and it is determined by the MPI/OpenMP runtime system. Typically it can be controlled through environment variables, but one hopes the default assignment makes sense.

FIGURE 45.6: Process and thread placement on an Intel Knights Landing

Figure  45.6 illustrates this for the Intel Knights Landing :

45.7 Practical specification

crumb trail: > hybrid > Practical specification

Say you use 100 cluster nodes, each with 16 cores. You could then start 1600 MPI processes, one for each core, but you could also start 100 processes, and give each access to 16 OpenMP threads.

There is a third choice, in between these extremes, that makes sense. A cluster node often has more than one socket, so you could put one MPI process on each socket  , and use a number of threads equal to the number of cores per socket.

Back to Table of Contents