HDF5/netCDF-4
This section details a set of additional conventions that are specific to the HDF5 file format.
Note that this section also applies to the netCDF-4 format, since netCDF-4 actually uses HDF5 underneath.
The following table shows the mapping between HARP data types and HDF5 data types. The HDF5 backend uses this mapping for writing only.
HARP data type |
HDF5 data type |
---|---|
int8 |
H5T_NATIVE_SCHAR |
int16 |
H5T_NATIVE_SHORT |
int32 |
H5T_NATIVE_INT |
float |
H5T_NATIVE_FLOAT |
double |
H5T_NATIVE_DOUBLE |
string |
H5T_C_S1 |
The mapping used by the HDF5 backend for reading is shown below. The HDF5 data type interface (H5T) is used to introspect the data type of the variable to be read.
H5T_get_class() |
H5T_get_size() |
H5T_get_sign() |
H5Tget_native_type() |
HARP data type |
---|---|---|---|---|
H5T_INTEGER |
1 |
H5T_SGN_2 |
int8 |
|
H5T_INTEGER |
2 |
H5T_SGN_2 |
int16 |
|
H5T_INTEGER |
4 |
H5T_SGN_2 |
int32 |
|
H5T_FLOAT |
H5T_NATIVE_FLOAT |
float |
||
H5T_FLOAT |
H5T_NATIVE_DOUBLE |
double |
||
H5T_STRING |
string |
HDF5 data types not covered in this table are not supported by HARP.
In the HDF5 data model there is no concept of shared dimensions (unlike netCDF). The shape of an HDF5 dataset is specified as a list of dimension lengths. However, the netCDF-4 library uses HDF5 as its storage backend. It represents shared dimensions using HDF5 dimension scales.
Dimension scales were introduced in HDF5 version 1.8.0. A dimension scale is a special dataset that can be attached to one or more dimensions of other datasets. Multiple dimension scales can be attached to a single dimension, and the length of the dimension scale does not have to be the same as the length of the dimension it is attached to. There are no limitations on the shape or dimensionality of a dimension scale, since it is just a dataset with particular attributes attached.
To represent shared dimensions, netCDF-4 creates dimension scales for each shared dimension and attaches these dimension
scales to the corresponding dimensions of all variables. If a product contains a one-dimensional variable with the same
name as a shared dimension where the single dimension of the variable also matches that shared dimension then the dataset
containing the values of the variable will be used as the dimension scale. Such a variable is called a
coordinate variable in netCDF-4. If a variable has the same name as the dimension but the variable is not
one-dimensional or the single dimension of the variable does not match the dimension then the variable will be stored
with a _nc4_non_coord_
name prefix in the HDF5 file (this is identical to how netCDF-4 deals with this case).
For shared dimensions for which a variable with the same name does not exist, a stub dataset containing fill values is
created and used as the dimension scale. The optional NAME
attribute of the dimension scale is set to
This is a netCDF dimension but not a netCDF variable.
, which causes the netCDF-4 library to hide the stub dataset
from the user. For more information about the netCDF-4 format, see the NetCDF User’s Guide.
The HDF5 file format conventions used by HARP are designed to be compatible with netCDF-4. Like netCDF-4, HARP uses
dimension scales to represent shared dimensions. For independent dimensions, the same approach is used as for the
netCDF-3 backend. For each unique dimension length L
, a dimension scale named independent_L
is created.
To summarize, HARP dimensions types are mapped to HDF5 dimension scales as follows:
HARP dimension type |
HDF5 dimension scale |
---|---|
time |
time |
latitude |
latitude |
longitude |
longitude |
vertical |
vertical |
spectral |
spectral |
independent |
independent_<length> |
The _nc3_strict
attribute is attached to the root group of the HDF5 file such that it will be interpreted using the
netCDF classic data model by the netCDF-4 library. Enhanced features of netCDF-4 beyond the classic data model, such as
groups and user-defined types, are not supported by HARP.
HDF5 can represent strings in several ways. Both fixed and variable length strings are supported. The HDF5 backend stores a HARP variable of type string as an HDF5 dataset of fixed length strings. The fixed string length equals the length of the longest string, or 1 if the length of the longest string is zero. Shorter strings are padded with null- termination characters.
HARP uses empty strings to represent the unit of dimensionless quantities (to distinguish them from non-quantities,
which will lack a unit attribute). However, HDF5 cannot store string attributes with length zero. For this reason
an empty unit string will be written as a units
attribute with value "1"
when writing data to HDF5.
When reading from HDF5 a unit string value "1"
will be converted back again to an empty unit string.
Note that even though the time
dimension is conceptually considered appendable, this dimension is not stored as an
actual appendable dimension in HDF5. Products are read/written from/to files in full and are only modified in memory.
The appendable aspect is only relevant for tools such as plotting routines that combine the data from a series of HARP
products in order to provide plots/statistics for a whole dataset (and thus, where data from different files will have
to be concatenated).