Scientific Data Storage

Experimental html version of downloadable textbook, see https://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ % mathjax inclusion. \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}} \] 27.1 : Introduction to HDF5
27.2 : Creating a file
27.3 : Datasets
27.4 : Writing the data
27.5 : Reading
Back to Table of Contents

27 Scientific Data Storage

There are many ways of storing data, in particular data that comes in arrays. A~surprising number of people stores data in spreadsheets, then exports them to ascii files with comma or tab delimiters, and expects other people (or other programs written by themselves) to read that in again. Such a process is wasteful in several respects:

For such reasons, it is desirable to have a file format that is based on binary storage. There are a few more requirements on a useful file format:

This tutorial will introduce the HDF5 library, which fulfills these requirements. HDF5 is a large and complicated library, so this tutorial will only touch on the basics. For further information, consult http://www.hdfgroup.org/HDF5/ . While you do this tutorial, keep your browser open on

http://www.hdfgroup.org/HDF5/doc/ or

http://www.hdfgroup.org/HDF5/RM/RM_H5Front.html

for the exact syntax of the routines.

27.1 Introduction to HDF5

crumb trail: > hdf5 > Introduction to HDF5

As described above, HDF5 is a file format that is machine-independent and self-documenting. Each HDF5 file is set up like a directory tree, with subdirectories, and leaf nodes which contain the actual data. This means that data can be found in a file by referring to its name, rather than its location in the file. In this section you will learn to write programs that write to and read from HDF5 files. In order to check that the files are as you intend, you can use the

h5dump utility on the command line.\footnote{In order to do the examples, the h5dump utility needs to be in your path, and you need to know the location of the hdf5.h and libhdf5.a and related library files.}

Just a word about compatibility. The HDF5 format is not compatible with the older version HDF4, which is no longer under development. You can still come across people using hdf4 for historic reasons. This tutorial is based on HDF5 version~1.6. Some interfaces changed in the current version~1.8; in order to use 1.6 APIs with 1.8 software, add a flag -DH5_USE_16_API to your compile line.

Many HDF5 routines are about creating objects: file handles, members in a dataset, et cetera. The general syntax for that is

hid_t h_id;

h_id = H5Xsomething(...);

Failure to create the object is indicated by a negative return parameter, so it would be a good idea to create a file

myh5defs.h containing:

#include "hdf5.h"

#define H5REPORT(e) \

  {if (e<0) {printf("\nHDF5 error on line %d\n\n",__LINE__); \

   return e;}}

and use this as:

#include "myh5defs.h"

hid_t h_id; h_id = H5Xsomething(...); H5REPORT(h_id);

27.2 Creating a file

crumb trail: > hdf5 > Creating a file

First of all, we need to create an HDF5 file.

hid_t file_id;

herr_t status;

file_id = H5Fcreate( filename, ... ); ... status = H5Fclose(file_id);

This file will be the container for a number of data items, organized like a directory tree.

\practical{Create an HDF5 file by compiling and running the create.c

example below.}{A file file.h5 should be created.}{Be sure to add HDF5 include and library directories:\\ \n{cc -c create.c -I. -I/opt/local/include}\\ and\\ \n{cc -o create create.o -L/opt/local/lib -lhdf5}. The include and lib directories will be system dependent.} \begin{istc} On the TACC clusters, do module load hdf5 , which will give you environment variables TACC_HDF5_INC and TACC_HDF5_LIB for the include and library directories, respectively. \end{istc}

{\small \verbatiminput{tutorials/hdf5/create.c} }

You can display the created file on the commandline:

%% h5dump file.h5

HDF5 "file.h5" {

GROUP "/" {

}

}

Note that an empty file corresponds to just the root of the directory tree that will hold the data.

27.3 Datasets

crumb trail: > hdf5 > Datasets

Next we create a dataset, in this example a 2D grid. To describe this, we first need to construct a dataspace:

   dims[0] = 4; dims[1] = 6; 

   dataspace_id = H5Screate_simple(2, dims, NULL);

   dataset_id = H5Dcreate(file_id, "/dset", dataspace_id, .... );

   ....

   status = H5Dclose(dataset_id);

   status = H5Sclose(dataspace_id);

Note that datasets and dataspaces need to be closed, just like files.

\practical{Create a dataset by compiling and running the dataset.c

code below}{This creates a file dset.h that can be displayed with h5dump .}{}

{\small \verbatiminput{tutorials/hdf5/dataset.c} }

We again view the created file online:

%% h5dump dset.h5 

HDF5 "dset.h5" {

GROUP "/" {

   DATASET "dset" {

      DATATYPE  H5T_STD_I32BE

      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }

      DATA {

      (0,0): 0, 0, 0, 0, 0, 0,

      (1,0): 0, 0, 0, 0, 0, 0,

      (2,0): 0, 0, 0, 0, 0, 0,

      (3,0): 0, 0, 0, 0, 0, 0

      }

   }

}

}

The datafile contains such information as the size of the arrays you store. Still, you may want to add related scalar information. For instance, if the array is output of a program, you could record with what input parameter was it generated.

   parmspace = H5Screate(H5S_SCALAR);

   parm_id = H5Dcreate

     (file_id,"/parm",H5T_NATIVE_INT,parmspace,H5P_DEFAULT);

\practical{Add a scalar dataspace to the HDF5 file, by compiling and running the parmwrite.c code below.}{A new file wdset.h5

is created.}{}

{\small \verbatiminput{tutorials/hdf5/parmdataset.c} }

%% h5dump wdset.h5 

HDF5 "wdset.h5" {

GROUP "/" {

   DATASET "dset" {

      DATATYPE  H5T_IEEE_F64LE

      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }

      DATA {

      (0,0): 0.5, 1.5, 2.5, 3.5, 4.5, 5.5,

      (1,0): 6.5, 7.5, 8.5, 9.5, 10.5, 11.5,

      (2,0): 12.5, 13.5, 14.5, 15.5, 16.5, 17.5,

      (3,0): 18.5, 19.5, 20.5, 21.5, 22.5, 23.5

      }

   }

   DATASET "parm" {

      DATATYPE  H5T_STD_I32LE

      DATASPACE  SCALAR

      DATA {

      (0): 37

      }

   }

}

}

27.4 Writing the data

crumb trail: > hdf5 > Writing the data

The datasets you created allocate the space in the hdf5 file. Now you need to put actual data in it. This is done with the H5Dwrite call.

{\small

/* Write floating point data */

for (i=0; i<24; i++) data[i] = i+.5;

status = H5Dwrite

  (dataset,H5T_NATIVE_DOUBLE,H5S_ALL,H5S_ALL,H5P_DEFAULT,

   data); 

/* write parameter value */

parm = 37;

status = H5Dwrite

  (parmset,H5T_NATIVE_INT,H5S_ALL,H5S_ALL,H5P_DEFAULT,

   &parm);

/* 

 * File: parmwrite.c

 * Author: Victor Eijkhout

 */

#include "myh5defs.h"

#define FILE "wdset.h5"

main() {

hid_t file_id, dataset, dataspace; /* identifiers */ hid_t parmset,parmspace; hsize_t dims[2]; herr_t status; double data[24]; int i,parm;

/* Create a new file using default properties. */ file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

/* Create the dataset. */ dims[0] = 4; dims[1] = 6; dataspace = H5Screate_simple(2, dims, NULL); dataset = H5Dcreate (file_id, "/dset", H5T_NATIVE_DOUBLE, dataspace, H5P_DEFAULT);

/* Add a descriptive parameter */ parmspace = H5Screate(H5S_SCALAR); parmset = H5Dcreate (file_id,"/parm",H5T_NATIVE_INT,parmspace,H5P_DEFAULT);

/* Write data to file */ for (i=0; i<24; i++) data[i] = i+.5; status = H5Dwrite (dataset,H5T_NATIVE_DOUBLE,H5S_ALL,H5S_ALL,H5P_DEFAULT, data); H5REPORT(status);

/* write parameter value */ parm = 37; status = H5Dwrite (parmset,H5T_NATIVE_INT,H5S_ALL,H5S_ALL,H5P_DEFAULT, &parm); H5REPORT(status);

/* End access to the dataset and release resources used by it. */ status = H5Dclose(dataset); status = H5Dclose(parmset);

/* Terminate access to the data space. */ status = H5Sclose(dataspace); status = H5Sclose(parmspace);

/* Close the file. */ status = H5Fclose(file_id); }

%% h5dump wdset.h5     

HDF5 "wdset.h5" {

GROUP "/" {

   DATASET "dset" {

      DATATYPE  H5T_IEEE_F64LE

      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }

      DATA {

      (0,0): 0.5, 1.5, 2.5, 3.5, 4.5, 5.5,

      (1,0): 6.5, 7.5, 8.5, 9.5, 10.5, 11.5,

      (2,0): 12.5, 13.5, 14.5, 15.5, 16.5, 17.5,

      (3,0): 18.5, 19.5, 20.5, 21.5, 22.5, 23.5

      }

   }

   DATASET "parm" {

      DATATYPE  H5T_STD_I32LE

      DATASPACE  SCALAR

      DATA {

      (0): 37

      }

   }

}

}

}

If you look closely at the source and the dump, you see that the data types are declared as `native', but rendered as LE . The `native' declaration makes the datatypes behave like the built-in C or Fortran data types. Alternatively, you can explicitly indicate whether data is \indexterm{little-endian} or big-endian. These terms describe how the bytes of a data item are ordered in memory. Most architectures use little endian, as you can see in the dump output, but, notably, IBM\index{IBM} uses big endian.

27.5 Reading

crumb trail: > hdf5 > Reading

Now that we have a file with some data, we can do the mirror part of the story: reading from that file. The essential commands are

  h5file = H5Fopen( .... )

  ....

  H5Dread( dataset, .... data .... )

where the H5Dread command has the same arguments as the corresponding H5Dwrite .

\practical{Read data from the wdset.h5 file that you create in the previous exercise, by compiling and running the allread.c

example below.}{Running the allread executable will print the value  37 of the parameter, and the value  8.5 of the

(1,2) data point of the array.}{Make sure that you run

parmwrite to create the input file.}

{\small \verbatiminput{tutorials/hdf5/allread.c} }

 %% ./allread

parameter value: 37

arbitrary data point [1,2]: 8.500000e+00

Back to Table of Contents