Categorical Variables
HARP has special support for categorical variables, which are variables that can be represented by an enumeration type.
Categorical variables are represented in HARP using integer data types that provide an index into a fixed list of
enumeration labels. These enumeration labels are stored as metadata together with a variable. For int8
data the
labels are stored as a flag_meanings
attribute (together with a flag_values
attribute) in netCDF3, HDF4, and
HDF5 files.
HARP does not impose a rule on which integer value maps to a specific enumeration label. The integer values are just plain indices into the enumeration label list, and the list can be sorted in any way.
Any integer value in a categorical variable that is outside the range of the available enumeration labels
(value < 0
or value >= N
) will be treated as an invalid value (valid_min
and valid_max
should therefore
always be set to 0
and N-1
respectively for these variables). Invalid values are represented by an empty string
as label.
Operations in HARP on categorical variables will need to primarily use the enumeration label names (and not the integer values). Any standardization (such as for the variable names) on the categories for categorical variables in HARP will thus only be on enumeration names and not on their numbers.
In HARP we distinguish category types and category values. The name of the category type <cattype>
is used in the
variable name and the names of the category values <catval>
are the enumeration labels. Variables for category
types should generally be called <cattype>_type
. For instance, if we want to classify clouds we would have
<cattype>=cloud
and a variable name cloud_type
, and for land use classification we would have land_type
.
The enumeration labels for a variable should ideally be fully distinct categories. This means that if you have land
types A
, B
, and C
, you should try to avoid also having land types for A_and_B
, B_and_C
,
A_and_B_and_C
, etc.
A <cattype>_type
variable should ideally indicate just a single dominant applicable category.
To deal with mixed presence of multiple categories one can use variables such as <catval>_flag
and
<catval>_fraction
.
For instance, for a land_type
variable we could have forest
as enumeration label, but then also have variables
forest_flag
and forest_fraction
.
A <catval>_flag
variable in HARP should always be a binary/dichotomous variable (it should contain true/false, 1/0
values). A flag variable denotes whether something is present/applicable/true or not. A flag variable will therefore
generally always have an int8
data type and is not treated as categorical variable (i.e. it will not have
enumeration labels). Important to note is that <*>_flag
variables in HARP should not be used as bitfield variables
(unlike the <*>_validity
variables).
A <catval>_fraction
variable can be used to define the fraction of the area that has the given <catval>
classification. This fraction is a floating point value between 0 and 1.
Note that many other data product formats use bitfields to indicate applicability of multiple classification categories in order to save storage space. However, the aim of HARP is not to provide a compact data format, but to have data variables that can be readily used by algorithms. For this reason, combining multiple information elements in a single variable is discouraged within the HARP conventions.