In this chapter you will learn the use of the main tool for distributed memory programming: the MPI library. The MPI library has about 250 routines, many of which you may never need. Since this is a textbook, not a reference manual, we will focus on the important concepts and give the important routines for each concept. What you learn here should be enough for most common purposes. You are advised to keep a reference document handy, in case there is a specialized routine, or to look up subtleties about the routines you use.
crumb trail: > mpi-started > Distributed memory and message passing
In its simplest form, a distributed memory machine is a collection of single computers hooked up with network cables. In fact, this has a name: a Beowulf cluster . As you recognize from that setup, each processor can run an independent program, and has its own memory without direct access to other processors' memory. MPI is the magic that makes multiple instantiations of the same executable run so that they know about each other and can exchange data through the network.
One of the reasons that MPI is so successful as a tool for high performance on clusters is that it is very explicit: the programmer controls many details of the data motion between the processors. Consequently, a capable programmer can write very efficient code with MPI. Unfortunately, that programmer will have to spell things out in considerable detail. For this reason, people sometimes call MPI `the assembly language of parallel programming'. If that sounds scary, be assured that things are not that bad. You can get started fairly quickly with MPI, using just the basics, and coming to the more sophisticated tools only when necessary.
Another reason that MPI was a big hit with programmers is that it does not ask you to learn a new language: it is a library that can be interfaced to C/C++ or Fortran; there are even bindings to Python and Java (not described in this course). This does not mean, however, that it is simple to `add parallelism' to an existing sequential program. An MPI version of a serial program takes considerable rewriting; certainly more than shared memory parallelism through OpenMP, discussed later in this book.
MPI is also easy to install: there are free implementations that you can download and install on any computer that has a Unix-like operating system, even if that is not a parallel machine. However, if you are working on a supercomputer cluster, likely there will already be an MPI installation, tuned to that machine's network.
crumb trail: > mpi-started > History
Before the MPI standard was developed in 1993-4, there were many libraries for distributed memory computing, often proprietary to a vendor platform. MPI standardized the inter-process communication mechanisms. Other features, such as process management in PVM , or parallel I/O were omitted. Later versions of the standard have included many of these features.
Since MPI was designed by a large number of academic and commercial participants, it quickly became a standard. A few packages from the pre-MPI era, such as Charmpp [charmpp] , are still in use since they support mechanisms that do not exist in MPI.
crumb trail: > mpi-started > Basic model
Here we sketch the two most common scenarios for using MPI. In the first, the user is working on an interactive machine, which has network access to a number of hosts, typically a network of workstations; see figure 1.1 .
FIGURE 1.1: Interactive MPI setup
The user types the command mpiexec \footnote {A command variant is mpirun ; your local cluster may have a different mechanism.} and supplies
In the second scenario (figure 1.2 ) the user prepares a batch job script with commands, and these will be run when the batch scheduler gives a number of hosts to the job.
FIGURE 1.2: Batch MPI setup
Now the batch script contains the mpiexec command% \begin{istc} , or some variant such as ibrun \end{istc} , and the hostfile is dynamically generated when the job starts. Since the job now runs at a time when the user may not be logged in, any screen output goes into an output file.
You see that in both scenarios the parallel program is started by the mpiexec command using an SPMD mode of execution: all hosts execute the same program. It is possible for different hosts to execute different programs, but we will not consider that in this book.
There can be options and environment variables that are specific to some MPI installations, or to the network.
https://www.mpich.org/static/docs/v3.1/www1/mpiexec.html
crumb trail: > mpi-started > Making and running an MPI program
MPI is a library, called from programs in ordinary programming languages such as C/C++ or Fortran. To compile such a program you use your regular compiler:
gcc -c my_mpi_prog.c -I/path/to/mpi.h gcc -o my_mpi_prog my_mpi_prog.o -L/path/to/mpi -lmpichHowever, MPI libraries may have different names between different architectures, making it hard to have a portable makefile. Therefore, MPI typically has shell scripts around your compiler call, called mpicc , mpicxx , mpif90 for C/C++/Fortran respectively.
mpicc -c my_mpi_prog.c mpicc -o my_mpi_prog my_mpi_prog.o
If you want to know what \indextermttdef{mpicc} does, there is usually an option that prints out its definition. On a Mac with the clang compiler:
$$ mpicc -show clang -fPIC -fstack-protector -fno-stack-check -Qunused-arguments -g3 -O0 -Wno-implicit-function-declaration -Wl,-flat_namespace -Wl,-commons,use_dylibs -I/Users/eijkhout/Installation/petsc/petsc-3.16.1/macx-clang-debug/include -L/Users/eijkhout/Installation/petsc/petsc-3.16.1/macx-clang-debug/lib -lmpi -lpmpi
Remark
In
OpenMPI
, these commands are
binary executables by default,
but you can make it a shell script by passing the
--enable-script-wrapper-compilers
option at configure time.
End of remark
MPI programs can be run on many different architectures. Obviously it is your ambition (or at least your dream) to run your code on a cluster with a hundred thousand processors and a fast network. But maybe you only have a small cluster with plain ethernet . Or maybe you're sitting in a plane, with just your laptop. An MPI program can be run in all these circumstances -- within the limits of your available memory of course.
The way this works is that you do not start your executable directly, but you use a program, typically called mpiexec or something similar, which makes a connection to all available processors and starts a run of your executable there. So if you have a thousand nodes in your cluster, mpiexec can start your program once on each, and if you only have your laptop it can start a few instances there. In the latter case you will of course not get great performance, but at least you can test your code for correctness.
crumb trail: > mpi-started > Language bindings
crumb trail: > mpi-started > Language bindings > C
The MPI library is written in C. However, the standard is careful to distinguish between MPI routines, versus their C bindings . In fact, as of MPI MPI-4, for a number of routines there are two bindings, depending on whether you want 4 byte integers, or larger. See section 6.4 , in particular 6.4.1 .
crumb trail: > mpi-started > Language bindings > C++, including MPL
C++ bindings but they were declared deprecated, and have been officially removed in the \mpistandardsub{3}{C++ bindings removed} standard. Thus, MPI can be used from C++ by including
#include <mpi.h>and using the C API.
The boost library has its own version of MPI, but it seems not to be under further development. A recent effort at idiomatic C++ support is MPL
https://github.com/rabauke/mpl . This book has an index of MPL notes and commands: section sec:idx:mpl .
MPL note MPL is a C++ header-only library. Notes on MPI usage from MPL will be indicated like this. End of MPL note
crumb trail: > mpi-started > Language bindings > Fortran
Fortran note Fortran-specific notes will be indicated with a note like this. End of Fortran note
Traditionally, Fortran bindings for MPI look very much like the C ones, except that each routine has a final error return parameter. You will find that a lot of MPI code in Fortran conforms to this.
However, in the MPI 3 % standard it is recommended that an MPI implementation providing a Fortran interface provide a module named \indextermttdef{mpi_f08} that can be used in a Fortran program. This incorporates the following improvements:
There is no matching specific subroutine for this generic subroutine call [MPI_Send]For details see
http://mpi-forum.org/docs/mpi-3.1/mpi31-report/node409.htm .
crumb trail: > mpi-started > Language bindings > Python
Python note Python-specific notes will be indicated with a note like this.
The \indextermttdef{mpi4py} package [Dalcin:m4py12,mpi4py:homepage] of python bindings is not defined by the MPI standards committee. Instead, it is the work of an individual, Lisandro Dalcin .
In a way, the Python interface is the most elegant. It uses OO techniques such as methods on objects, and many default arguments.
Notable about the Python bindings is that many communication routines exist in two variants:
Codes with mpi4py can be interfaced to other languages through Swig or conversion routines.
Data in numpy can be specified as a simple object, or [data, (count,displ), datatype] .
crumb trail: > mpi-started > Language bindings > How to read routine signatures
Throughout the MPI part of this book we will give the reference syntax of the routines. This typically comprises:
These `routine signatures' look like code but they are not! Here is how you translate them.
crumb trail: > mpi-started > Language bindings > How to read routine signatures > C
The typically C routine specification in MPI looks like:
int MPI_Comm_size(MPI_Comm comm,int *nprocs)This means that
MPI_Comm comm = MPI_COMM_WORLD; int nprocs; int errorcode; errorcode = MPI_Comm_size( MPI_COMM_WORLD,&nprocs); if (errorcode!=MPI_SUCCESS) { printf("Routine MPI_Comm_size failed! code=%d\n", errorcode); return 1; }However, the error codes are hardly ever useful, and there is not much your program can do to recover from an error. Most people call the routine as
MPI_Comm_size( /* parameter ... */ );For more on error handling, see section 15.2 .
MPI_Comm my_comm = MPI_COMM_WORLD; // using a predefined value MPI_Comm_size( comm, /* remaining parameters */ );
MPI_Comm my_comm = MPI_COMM_WORLD; // using a predefined value int nprocs; MPI_Comm_size( comm, &nprocs );Seeing a `star' parameter usually means either: the routine has an array argument, or: the routine internally sets the value of a variable. The latter is the case here.
crumb trail: > mpi-started > Language bindings > How to read routine signatures > Fortran
The Fortran specification looks like:
MPI_Comm_size(comm, size, ierror) Type(MPI_Comm), Intent(In) :: comm Integer, Intent(Out) :: size Integer, Optional, Intent(Out) :: ierroror for the Fortran90 legacy mode:
MPI_Comm_size(comm, size, ierror) INTEGER, INTENT(IN) :: comm INTEGER, INTENT(OUT) :: size INTEGER, OPTIONAL, INTENT(OUT) :: ierrorThe syntax of using this routine is close to this specification: you write
Type(MPI_Comm) :: comm = MPI_COMM_WORLD ! legacy: Integer :: comm = MPI_COMM_WORLD Integer :: comm = MPI_COMM_WORLD Integer :: size,ierr CALL MPI_Comm_size( comm, size ) ! without the optional ierr
crumb trail: > mpi-started > Language bindings > How to read routine signatures > Python
The Python interface to MPI uses classes and objects. Thus, a specification like:
MPI.Comm.Send(self, buf, int dest, int tag=0)should be parsed as follows.
from mpi4py import MPI
comm = MPI.COMM_WORLD
comm.Send( .... )
comm.Send(sendbuf,dest=other)specifying the send buffer as positional parameter, the destination as keyword parameter, and using the default value for the optional tag.
MPI.Request.Waitall(type cls, requests, statuses=None)would be used as
MPI.Request.Waitall(requests)
crumb trail: > mpi-started > Review
Review What determines the parallelism of an MPI job?
Review
T/F: the number of cores of your laptop is the limit of how many MPI
proceses you can start up.
End of review
Review Do the following languages have an object-oriented interface to MPI? In what sense?