MPI topic: File I/O

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/

\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 10.1 : File handling
10.2 : File reading and writing
10.2.1 : Nonblocking read/write
10.2.2 : Individual file pointers, contiguous writes
10.2.3 : File views
10.2.4 : Shared file pointers
10.3 : Consistency
10.4 : Constants
10.5 : Error handling
10.6 : Review questions
Back to Table of Contents

10 MPI topic: File I/O

This chapter discusses the I/O support of MPI, which is intended to alleviate the problems inherent in parallel file access. Let us first explore the issues. This story partly depends on what sort of parallel computer are you running on. Here are some of the hardware scenarios you may encounter:

Based on this, the following strategies are possible, even before we start talking about MPI I/O.

For these reasons, MPI has a number of routines that make it possible to read and write a single file from a large number of processes, giving each process its own well-defined location where to access the data. These locations can use MPI derived datatype s for both the source data (that is, in memory) and target data (that is, on disk). Thus, in one call that is collective on a communicator each process can address data that is not contiguous in memory, and place it in locations that are not contiguous on disc.

There are dedicated libraries for file I/O, such as hdf5  , netcdf  , or silo  . However, these often add header information to a file that may not be understandable to post-processing applications. With MPI I/O you are in complete control of what goes to the file. (A useful tool for viewing your file is the unix utility  od  .)

TACC note Each node has a private /tmp file system (typically flash storage), to which you can write files. Considerations:

10.1 File handling

crumb trail: > mpi-io > File handling

MPI has a datatype for files: MPI_File  . This acts a little like a traditional file handle, in that there are open, close, read/write, and seek operations on it. However, unlike traditional file handling, which in parallel would mean having one handle per process, this handle is collective: MPI processes act as if they share one file handle.

You open a file with MPI_File_open  . This routine is collective, even if only certain processes will access the file with a read or write call. Similarly, MPI_File_close is collective.

Python note Note the slightly unusual syntax for opening a file:

mpifile = MPI.File.Open(comm,filename,mode)
Even though the file is opened on a communicator, it is a class method for the MPI.File class, rather than for the communicator object. The latter is passed in as an argument.

File access modes:

These modes can be added or bitwise-or'ed.

As a small illustration: \csnippetwithoutput{mpifilebasic}{examples/mpi/c}{write}

You can delete a file with MPI_File_delete  .

Buffers can be flushed with MPI_File_sync  , which is a collective call.

10.2 File reading and writing

crumb trail: > mpi-io > File reading and writing

The basic file operations, in between the open and close calls, are the POSIX-like, noncollective, calls

For thread safety it is good to combine seek and read/write operations:

Writing to and reading from a parallel file is rather similar to sending a receiving:

10.2.1 Nonblocking read/write

crumb trail: > mpi-io > File reading and writing > Nonblocking read/write

Just like there are blocking and nonblocking sends, there are also nonblocking writes and reads: MPI_File_iwrite  , MPI_File_iread operations, and their collective versions MPI_File_iwrite_all  , MPI_File_iread_all  .

Also MPI_File_iwrite_at  , MPI_File_iwrite_at_all  , MPI_File_iread_at  ., MPI_File_iread_at_all  .

These routines output an MPI_Request object, which can then be tested with MPI_Wait or MPI_Test  .

Nonblocking collective I/O functions much like other nonblocking collectives (section  3.11  ): the request is satisfied if all processes finish the collective.

There are also split collective s that function like nonblocking collective I/O, but with the request/wait mechanism: MPI_File_write_all_begin  / MPI_File_write_all_end (and similarly MPI_File_read_all_begin  / MPI_File_read_all_end  ) where the second routine blocks until the collective write/read has been concluded.

Also MPI_File_iread_shared  , MPI_File_iwrite_shared  .

10.2.2 Individual file pointers, contiguous writes

crumb trail: > mpi-io > File reading and writing > Individual file pointers, contiguous writes

After the collective open call, each process holds an individual file pointer that it can individually position somewhere in the shared file. Let's explore this modality.

The simplest way of writing a data to file is much like a send call: a buffer is specified with the usual count/datatype specification, and a target location in the file is given. The routine MPI_File_write_at gives this location in absolute terms with a parameter of type MPI_Offset  , which counts bytes.

FIGURE 10.1: Writing at an offset

Exercise Create a buffer of length nwords=3 on each process, and write these buffers as a sequence to one file with MPI_File_write_at  . (There is a skeleton code blockwrite in the repository)
End of exercise

Instead of giving the position in the file explicitly, you can also use a MPI_File_seek call to position the file pointer, and write with MPI_File_write at the pointer location. The write call itself also advances the file pointer so separate calls for writing contiguous elements need no seek calls with MPI_SEEK_CUR  .

Exercise Rewrite the code of exercise  10.1 to use a loop where each iteration writes only one item to file. Note that no explicit advance of the file pointer is needed.
End of exercise

Exercise Construct a file with the consecutive integers $0,\ldots,WP$ where $W$ some integer, and $P$ the number of processes. Each process $p$ writes the numbers $p,p+W,p+2W,\ldots$. Use a loop where each iteration

  1. writes a single number with MPI_File_write  , and
  2. advanced the file pointer with MPI_File_seek with a whence parameter of MPI_SEEK_CUR  .

End of exercise

10.2.3 File views

crumb trail: > mpi-io > File reading and writing > File views

The previous mode of writing is enough for writing simple contiguous blocks in the file. However, you can also access noncontiguous areas in the file. For this you use MPI_File_set_view  . This call is collective, even if not all processes access the file.

// scatterwrite.c

FIGURE 10.2: Writing at a view

Exercise (There is a skeleton code viewwrite in the repository) Write a file in the same way as in exercise  10.1  , but now use MPI_File_write and use MPI_File_set_view to set a view that determines where the data is written.
End of exercise

You can get very creative effects by setting the view to a derived datatype.

FIGURE 10.3: Writing at a derived type

Fortran note In Fortran you have to assure that the displacement parameter is of `kind' MPI_OFFSET_KIND  . In particular, you can not specify a literal zero `0' as the displacement; use 0_MPI_OFFSET_KIND instead. End of Fortran note

More: MPI_File_set_size  , MPI_File_get_size MPI_File_preallocate  , MPI_File_get_view  .

10.2.4 Shared file pointers

crumb trail: > mpi-io > File reading and writing > Shared file pointers

It is possible to have a file pointer that is shared (and therefore identical) between all processes of the communicator that was used to open the file. This file pointer is set with MPI_File_seek_shared  . For reading and writing there are then two sets of routines:

Shared file pointers require that the same view is used on all processes. Also, these operations are less efficient because of the need to maintain the shared pointer.

10.3 Consistency

crumb trail: > mpi-io > Consistency

It is possible for one process to read data previously writte by another process. For this, it is of course necessary to impose a temporal order, for instance by using MPI_Barrier  , or using a zero-byte send from the writing to the reading process.

However, the file also needs to be declared atomic MPI_File_set_atomicity  .

10.4 Constants

crumb trail: > mpi-io > Constants

MPI_SEEK_SET used to be called SEEK_SET which gave conflicts with the C++ library. This had to be circumvented with

and such.

10.5 Error handling

crumb trail: > mpi-io > Error handling

By default, MPI uses MPI_ERRORS_ARE_FATAL since parallel errors are almost impossible to recover from. File handling errors, on the other hand, are less serious: if a file is not found, the operation can be abandoned. For this reason, the default error handler for file operations is MPI_ERRORS_RETURN  .

The default I/O error handler can be queried and set with MPI_File_get_errhandler and MPI_File_set_errhandler respectively, passing MPI_FILE_NULL as argument.


10.6 Review questions

crumb trail: > mpi-io > Review questions

Exercise T/F? File views ( MPI_File_set_view  ) are intended to

End of exercise

Exercise The sequence MPI_File_seek_shared  , MPI_File_read_shared can be replaced by MPI_File_seek  , MPI_File_read if you make what changes?
End of exercise

Back to Table of Contents