MPI topic: Shared memory

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 12.1 : Recognizing shared memory
12.2 : Shared memory for windows
12.2.1 : Pointers to a shared window
12.2.2 : Querying the shared structure
12.2.3 : Heat equation example
12.2.4 : Shared bulk data
Back to Table of Contents

12 MPI topic: Shared memory

Some programmers are under the impression that MPI would not be efficient on shared memory, since all operations are done through what looks like network calls. This is not correct: many MPI implementations have optimizations that detect shared memory and can exploit it, so that data is copied, rather than going through a communication layer. (Conversely, programming systems for shared memory such as OpenMP can actually have inefficiencies associated with thread handling.) The main inefficiency associated with using MPI on shared memory is then that processes can not actually share data.

The one-sided MPI calls (chapter  MPI topic: One-sided communication  ) can also be used to emulate shared memory, in the sense that an origin process can access data from a target process without the target's active involvement. However, these calls do not distinguish between actually shared memory and one-sided access across the network.

In this chapter we will look at the ways MPI can interact with the presence of actual shared memory. (This functionality was added in the \mpistandard{3} standard.) This relies on the MPI_Win windows concept, but otherwise uses direct access of other processes' memory.

12.1 Recognizing shared memory

crumb trail: > mpi-shared > Recognizing shared memory

MPI's one-sided routines take a very symmetric view of processes: each process can access the window of every other process (within a communicator). Of course, in practice there will be a difference in performance depending on whether the origin and target are actually on the same shared memory, or whether they can only communicate through the network. For this reason MPI makes it easy to group processes by shared memory domains using

 .

Here the split_type parameter has to be from the following (short) list:

  • MPI_COMM_TYPE_SHARED : split the communicator into subcommunicators of processes sharing a memory area. \begin{mpifournote} {Split by guided types}

  • MPI_COMM_TYPE_HW_GUIDED (\mpistandard{4}): split using an info value from MPI_Get_hw_resource_types  .

  • MPI_COMM_TYPE_HW_UNGUIDED (\mpistandard{4}): similar to MPI_COMM_TYPE_HW_GUIDED  , but the resulting communicators should be a strict subset of the original communicator. On processes where this condition can not be fullfilled,

    MPI_COMM_NULL will be returned. \end{mpifournote}

Splitting by shared memory: \begingroup \def\snippetcodefraction{.45} \def\snippettextfraction{.5} \csnippetwithoutput{commsplittype}{examples/mpi/c}{commsplittype} \endgroup

Exercise

Write a program that uses MPI_Comm_split_type to analyze for a run

  1. How many nodes there are;

  2. How many processes there are on each node.

If you run this program on an unequal distribution, say 10 processes on 3 nodes, what distribution do you find? \lstinputlisting{code/mpi/c/nodecount.runout}
End of exercise

Remark The OpenMPI implementation of MPI has a number of non-standard split types, such as OMPI_COMM_TYPE_SOCKET ; see https://www.open-mpi.org/doc/v4.1/man3/MPI_Comm_split_type.3.php


End of remark

MPL note Similar to ordinary communicator splitting  7.1 : communicator:: split_shared. End of MPL note

12.2 Shared memory for windows

crumb trail: > mpi-shared > Shared memory for windows

Processes that exist on the same physical shared memory should be able to move data by copying, rather than through MPI send/receive calls -- which of course will do a copy operation under the hood. In order to do such user-level copying:

  1. We need to create a shared memory area with

    MPI_Win_allocate_shared  . This creates a window with the

    unified memory model ; and

  2. We need to get pointers to where a process' area is in this shared space; this is done with MPI_Win_shared_query  .

12.2.1 Pointers to a shared window

crumb trail: > mpi-shared > Shared memory for windows > Pointers to a shared window

The first step is to create a window (in the sense of one-sided MPI; section  9.1  ) on the processes on one node. Using the

call presumably will put the memory close to the socket on which the process runs.

// sharedbulk.c
MPI_Win node_window;
MPI_Aint window_size; double *window_data;
if (onnode_procid==0)
  window_size = sizeof(double);
else window_size = 0;
MPI_Win_allocate_shared
  ( window_size,sizeof(double),MPI_INFO_NULL,
    nodecomm,
    &window_data,&node_window);

The memory allocated by MPI_Win_allocate_shared is contiguous between the processes. This makes it possible to do address calculation. However, if a cluster node has a NUMA structure, for instance if two sockets have memory directly attached to each, this would increase latency for some processes. To prevent this, the key alloc_shared_noncontig can be set to true in the MPI_Info object. \begin{mpifournote} {Window memory alignment} In the contiguous case, the mpi_minimum_memory_alignment info argument (section  9.1.1  ) applies only to the memory on the first process; in the noncontiguous case it applies to all. \end{mpifournote}

// numa.c
MPI_Info window_info;
MPI_Info_create(&window_info);
  MPI_Info_set(window_info,"alloc_shared_noncontig","true");
MPI_Win_allocate_shared( window_size,sizeof(double),window_info,
                         nodecomm,
                         &window_data,&node_window);
MPI_Info_free(&window_info);

Let's explore this. We create a shared window where each process stores exactly one double, that is, 8 bytes. The following code fragment queries the window locations, and prints the distance in bytes to the window on process 0.

for (int p=1; p<onnode_nprocs; p++) {
  MPI_Aint window_sizep; int windowp_unit; double *winp_addr;
  MPI_Win_shared_query( node_window,p,
                        &window_sizep,&windowp_unit, &winp_addr );
  distp = (size_t)winp_addr-(size_t)win0_addr;
  if (procno==0)
    printf("Distance %d to zero: %ld\n",p,(long)distp);

With the default strategy, these windows are contiguous, and so the distances are multiples of 8 bytes. Not so for the the non-contiguous allocation:

Strategy: default behavior of shared window allocation

Distance 1 to zero: 8
Distance 2 to zero: 16
Distance 3 to zero: 24
Distance 4 to zero: 32
Distance 5 to zero: 40
Distance 6 to zero: 48
Distance 7 to zero: 56
Distance 8 to zero: 64
Distance 9 to zero: 72

Strategy: allow non-contiguous shared window allocation

Distance 1 to zero: 4096
Distance 2 to zero: 8192
Distance 3 to zero: 12288
Distance 4 to zero: 16384
Distance 5 to zero: 20480
Distance 6 to zero: 24576
Distance 7 to zero: 28672
Distance 8 to zero: 32768
Distance 9 to zero: 36864

The explanation here is that each window is placed on its own small page  , which on this particular system has a size of 4K.

Remark The ampersand operator in C is not a

physical address  , but a

virtual address  . The translation of where pages are placed in physical memory is determined by the page table  .
End of remark

12.2.2 Querying the shared structure

crumb trail: > mpi-shared > Shared memory for windows > Querying the shared structure

Even though the window created above is shared, that doesn't mean it's contiguous. Hence it is necessary to retrieve the pointer to the area of each process that you want to communicate with:

 .

MPI_Aint window_size0; int window_unit; double *win0_addr;
MPI_Win_shared_query
  ( node_window,0,
    &window_size0,&window_unit, &win0_addr );

12.2.3 Heat equation example

crumb trail: > mpi-shared > Shared memory for windows > Heat equation example

As an example, which consider the 1D heat equation. On each process we create a local area of three point:

// sharedshared.c
MPI_Win_allocate_shared(3,sizeof(int),info,sharedcomm,&shared_baseptr,&shared_window);

12.2.4 Shared bulk data

crumb trail: > mpi-shared > Shared memory for windows > Shared bulk data

In applications such as ray tracing  , there is a read-only large data object (the objects in the scene to be rendered) that is needed by all processes. In traditional MPI, this would need to be stored redundantly on each process, which leads to large memory demands. With MPI shared memory we can store the data object once per node. Using as above MPI_Comm_split_type to find a communicator per NUMA domain, we store the object on process zero of this node communicator.

Exercise

Let the `shared' data originate on process zero in

MPI_COMM_WORLD  . Then:

  • create a communicator per shared memory domain;

  • create a communicator for all the processes with number zero on their node;

  • broadcast the shared data to the processes zero on each node.

\skeleton{shareddata}
End of exercise

Back to Table of Contents