This chapter discusses the I/O support of MPI, which is intended to alleviate the problems inherent in parallel file access. Let us first explore the issues. This story partly depends on what sort of parallel computer are you running on. Here are some of the hardware scenarios you may encounter:
Based on this, the following strategies are possible, even before we start talking about MPI I/O.
For these reasons, MPI has a number of routines that make it possible to read and write a single file from a large number of processes, giving each process its own well-defined location where to access the data. These locations can use MPI derived datatype s for both the source data (that is, in memory) and target data (that is, on disk). Thus, in one call that is collective on a communicator each process can address data that is not contiguous in memory, and place it in locations that are not contiguous on disc.
There are dedicated libraries for file I/O, such as hdf5 , netcdf , or silo . However, these often add header information to a file that may not be understandable to post-processing applications. With MPI I/O you are in complete control of what goes to the file. (A useful tool for viewing your file is the unix utility od .)
TACC note Each node has a private /tmp file system (typically flash storage), to which you can write files. Considerations:
crumb trail: > mpi-io > File handling
MPI has a datatype for files: MPI_File . This acts a little like a traditional file handle, in that there are open, close, read/write, and seek operations on it. However, unlike traditional file handling, which in parallel would mean having one handle per process, this handle is collective: MPI processes act as if they share one file handle.
You open a file with MPI_File_open . This routine is collective, even if only certain processes will access the file with a read or write call. Similarly, MPI_File_close is collective.
Python note Note the slightly unusual syntax for opening a file:
mpifile = MPI.File.Open(comm,filename,mode)Even though the file is opened on a communicator, it is a class method for the MPI.File class, rather than for the communicator object. The latter is passed in as an argument.
File access modes:
As a small illustration: \csnippetwithoutput{mpifilebasic}{examples/mpi/c}{write}
You can delete a file with MPI_File_delete .
Buffers can be flushed with MPI_File_sync , which is a collective call.
crumb trail: > mpi-io > File reading and writing
The basic file operations, in between the open and close calls, are the POSIX-like, noncollective, calls
For thread safety it is good to combine seek and read/write operations:
Writing to and reading from a parallel file is rather similar to sending a receiving:
crumb trail: > mpi-io > File reading and writing > Nonblocking read/write
Just like there are blocking and nonblocking sends, there are also nonblocking writes and reads: MPI_File_iwrite , MPI_File_iread operations, and their collective versions MPI_File_iwrite_all , MPI_File_iread_all .
Also MPI_File_iwrite_at , MPI_File_iwrite_at_all , MPI_File_iread_at ., MPI_File_iread_at_all .
These routines output an MPI_Request object, which can then be tested with MPI_Wait or MPI_Test .
Nonblocking collective I/O functions much like other nonblocking collectives (section 3.11 ): the request is satisfied if all processes finish the collective.
There are also split collective s that function like nonblocking collective I/O, but with the request/wait mechanism: MPI_File_write_all_begin / MPI_File_write_all_end (and similarly MPI_File_read_all_begin / MPI_File_read_all_end ) where the second routine blocks until the collective write/read has been concluded.
Also MPI_File_iread_shared , MPI_File_iwrite_shared .
crumb trail: > mpi-io > File reading and writing > Individual file pointers, contiguous writes
After the collective open call, each process holds an individual file pointer that it can individually position somewhere in the shared file. Let's explore this modality.
The simplest way of writing a data to file is much like a send call: a buffer is specified with the usual count/datatype specification, and a target location in the file is given. The routine MPI_File_write_at gives this location in absolute terms with a parameter of type MPI_Offset , which counts bytes.
FIGURE 10.1: Writing at an offset
Exercise
Create a buffer of length
nwords=3
on each process, and write
these buffers as a sequence to one file with
MPI_File_write_at
.
(There is a skeleton code blockwrite in the repository)
End of exercise
Instead of giving the position in the file explicitly, you can also use a MPI_File_seek call to position the file pointer, and write with MPI_File_write at the pointer location. The write call itself also advances the file pointer so separate calls for writing contiguous elements need no seek calls with MPI_SEEK_CUR .
Exercise
Rewrite the code of exercise
10.1
to
use a loop where each iteration
writes only one item to file.
Note that no explicit advance of the file pointer is needed.
End of exercise
Exercise Construct a file with the consecutive integers $0,\ldots,WP$ where $W$ some integer, and $P$ the number of processes. Each process $p$ writes the numbers $p,p+W,p+2W,\ldots$. Use a loop where each iteration
crumb trail: > mpi-io > File reading and writing > File views
The previous mode of writing is enough for writing simple contiguous blocks in the file. However, you can also access noncontiguous areas in the file. For this you use MPI_File_set_view . This call is collective, even if not all processes access the file.
// scatterwrite.c MPI_File_set_view (mpifile, offset,MPI_INT,scattertype, "native",MPI_INFO_NULL);
FIGURE 10.2: Writing at a view
Exercise
(There is a skeleton code viewwrite in the repository)
Write a file in the same way as in exercise
10.1
,
but now use
MPI_File_write
and use
MPI_File_set_view
to set
a view that determines where the data is written.
End of exercise
You can get very creative effects by setting the view to a derived datatype.
FIGURE 10.3: Writing at a derived type
Fortran note In Fortran you have to assure that the displacement parameter is of `kind' MPI_OFFSET_KIND . In particular, you can not specify a literal zero `0' as the displacement; use 0_MPI_OFFSET_KIND instead. End of Fortran note
More: MPI_File_set_size , MPI_File_get_size MPI_File_preallocate , MPI_File_get_view .
crumb trail: > mpi-io > File reading and writing > Shared file pointers
It is possible to have a file pointer that is shared (and therefore identical) between all processes of the communicator that was used to open the file. This file pointer is set with MPI_File_seek_shared . For reading and writing there are then two sets of routines:
Shared file pointers require that the same view is used on all processes. Also, these operations are less efficient because of the need to maintain the shared pointer.
crumb trail: > mpi-io > Consistency
It is possible for one process to read data previously writte by another process. For this, it is of course necessary to impose a temporal order, for instance by using MPI_Barrier , or using a zero-byte send from the writing to the reading process.
However, the file also needs to be declared atomic MPI_File_set_atomicity .
crumb trail: > mpi-io > Constants
MPI_SEEK_SET used to be called SEEK_SET which gave conflicts with the C++ library. This had to be circumvented with
make CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK -DMPICH_SKIP_MPICXX"and such.
crumb trail: > mpi-io > Error handling
By default, MPI uses MPI_ERRORS_ARE_FATAL since parallel errors are almost impossible to recover from. File handling errors, on the other hand, are less serious: if a file is not found, the operation can be abandoned. For this reason, the default error handler for file operations is MPI_ERRORS_RETURN .
The default I/O error handler can be queried and set with MPI_File_get_errhandler and MPI_File_set_errhandler respectively, passing MPI_FILE_NULL as argument.
\newpage
Exercise T/F? File views ( MPI_File_set_view ) are intended to
Exercise
The sequence
MPI_File_seek_shared
,
MPI_File_read_shared
can be replaced by
MPI_File_seek
,
MPI_File_read
if you make what changes?
End of exercise