Some programmers are under the impression that MPI would not be efficient on shared memory, since all operations are done through what looks like network calls. This is not correct: many MPI implementations have optimizations that detect shared memory and can exploit it, so that data is copied, rather than going through a communication layer. (Conversely, programming systems for shared memory such as OpenMP can actually have inefficiencies associated with thread handling.) The main inefficiency associated with using MPI on shared memory is then that processes can not actually share data.
The one-sided MPI calls (chapter MPI topic: One-sided communication ) can also be used to emulate shared memory, in the sense that an origin process can access data from a target process without the target's active involvement. However, these calls do not distinguish between actually shared memory and one-sided access across the network.
In this chapter we will look at the ways MPI can interact with the presence of actual shared memory. (This functionality was added in the MPI-3 standard.) This relies on the MPI_Win windows concept, but otherwise uses direct access of other processes' memory.
crumb trail: > mpi-shared > Recognizing shared memory
MPI's one-sided routines take a very symmetric view of processes: each process can access the window of every other process (within a communicator). Of course, in practice there will be a difference in performance depending on whether the origin and target are actually on the same shared memory, or whether they can only communicate through the network. For this reason MPI makes it easy to group processes by shared memory domains using MPI_Comm_split_type .
Here the split_type parameter has to be from the following (short) list:
Splitting by shared memory: \begingroup \def\snippetcodefraction{.45} \def\snippettextfraction{.5} \csnippetwithoutput{commsplittype}{examples/mpi/c}{commsplittype} \endgroup
Exercise Write a program that uses MPI_Comm_split_type to analyze for a run
Remark The OpenMPI implementation of MPI has a number of non-standard split types, such as OMPI_COMM_TYPE_SOCKET ; see https://www.open-mpi.org/doc/v4.1/man3/MPI_Comm_split_type.3.php
End of remark
MPL note Similar to ordinary communicator splitting 7.1 : communicator:: split_shared. End of MPL note
crumb trail: > mpi-shared > Shared memory for windows
Processes that exist on the same physical shared memory should be able to move data by copying, rather than through MPI send/receive calls -- which of course will do a copy operation under the hood. In order to do such user-level copying:
crumb trail: > mpi-shared > Shared memory for windows > Pointers to a shared window
The first step is to create a window (in the sense of one-sided MPI; section 9.1 ) on the processes on one node. Using the MPI_Win_allocate_shared call presumably will put the memory close to the socket on which the process runs.
The memory allocated by MPI_Win_allocate_shared is contiguous between the processes. This makes it possible to do address calculation. However, if a cluster node has a NUMA structure, for instance if two sockets have memory directly attached to each, this would increase latency for some processes. To prevent this, the key alloc_shared_noncontig can be set to true in the MPI_Info object. \begin{mpifournote} {Window memory alignment} In the contiguous case, the mpi_minimum_memory_alignment info argument (section 9.1.1 ) applies only to the memory on the first process; in the noncontiguous case it applies to all. \end{mpifournote}
// numa.c MPI_Info window_info; MPI_Info_create(&window_info); MPI_Info_set(window_info,"alloc_shared_noncontig","true"); MPI_Win_allocate_shared( window_size,sizeof(double),window_info, nodecomm, &window_data,&node_window); MPI_Info_free(&window_info);
Let's explore this. We create a shared window where each process stores exactly one double, that is, 8 bytes. The following code fragment queries the window locations, and prints the distance in bytes to the window on process 0.
for (int p=1; p<onnode_nprocs; p++) { MPI_Aint window_sizep; int windowp_unit; double *winp_addr; MPI_Win_shared_query( node_window,p, &window_sizep,&windowp_unit, &winp_addr ); distp = (size_t)winp_addr-(size_t)win0_addr; if (procno==0) printf("Distance %d to zero: %ld\n",p,(long)distp);
With the default strategy, these windows are contiguous, and so the distances are multiples of 8 bytes. Not so for the the non-contiguous allocation:
Strategy: default behavior of shared window allocation
Distance 1 to zero: 8 Distance 2 to zero: 16 Distance 3 to zero: 24 Distance 4 to zero: 32 Distance 5 to zero: 40 Distance 6 to zero: 48 Distance 7 to zero: 56 Distance 8 to zero: 64 Distance 9 to zero: 72
Strategy: allow non-contiguous shared window allocation
Distance 1 to zero: 4096 Distance 2 to zero: 8192 Distance 3 to zero: 12288 Distance 4 to zero: 16384 Distance 5 to zero: 20480 Distance 6 to zero: 24576 Distance 7 to zero: 28672 Distance 8 to zero: 32768 Distance 9 to zero: 36864
The explanation here is that each window is placed on its own small page , which on this particular system has a size of 4K.
Remark
The ampersand operator in C is not a
physical address
, but a
virtual address
.
The translation of where pages are placed in physical memory
is determined by the
page table
.
End of remark
crumb trail: > mpi-shared > Shared memory for windows > Querying the shared structure
Even though the window created above is shared, that doesn't mean it's contiguous. Hence it is necessary to retrieve the pointer to the area of each process that you want to communicate with: MPI_Win_shared_query .
crumb trail: > mpi-shared > Shared memory for windows > Heat equation example
As an example, which consider the 1D heat equation. On each process we create a local area of three point:
crumb trail: > mpi-shared > Shared memory for windows > Shared bulk data
In applications such as ray tracing , there is a read-only large data object (the objects in the scene to be rendered) that is needed by all processes. In traditional MPI, this would need to be stored redundantly on each process, which leads to large memory demands. With MPI shared memory we can store the data object once per node. Using as above MPI_Comm_split_type to find a communicator per NUMA domain, we store the object on process zero of this node communicator.
Exercise Let the `shared' data originate on process zero in MPI_COMM_WORLD . Then: