• No results found

3.3 CEOP-AEGIS Data Interface

3.3.1 Intermediate data model

Required metadata as structured additional information about data (docu- mented in section 3.2.3 on page 89) needs to be fully made available so that NetCDF files compliant to the CEOP-AEGIS Data Model can be generated out of it. The intermediate data model was developed out of the needs to easily modify and complete metadata information of input data which was provided from project partners. Post-processing is often necessary due to heterogeneity not only in regard of data as well as metadata content and organization, but also in regard of the provided file format, as mentioned in the problem statement (see section 1.2 on page 5). By first converting files in this intermediate data model, data and metadata elements are separated from each other and can easily be standardized to reach conformance with the requirements of the determined CEOP-AEGIS Data Portal (see section 3.2 on page 71). The focus in the development of this intermediate data model was set to a data and metadata representation that easily allows the post- processing of its elements in order to generate standardized NetCDF files in a simple way.

Data in the intermediate data model is constrained within three compo- nents that must all be present and consistent to each other so that NetCDF files can be generated out of it. Together, all of this three components form a unit and represent one single data file. These components of the intermediate data model are summarized in table 3.2 on page 95 and must all share the same filename, except of the filename extension.

Format Filename Description

XML (NcML) *__ncml.xml NetCDF metadata information

numpy *__data.npy Multi-dimensional numerical data

XML *__coords.xml Coordinate metadata information

Table 3.2: Components of the intermediate data model

NetCDF metadata information

The scope of metadata post-processing in a simple way could be achieved within this data interface by storing NetCDF metadata information in NcML. This is a representation of NetCDF metadata in form of a XML document

format and was described in detail in subsection 2.1.11 on page 35. Employ- ing the XML dialect NcML can solve the problem of heterogeneity by taking advantage of its structured documentation that divides expressions from the content of the data. Data stored in XML is readable for both machines and humans and makes it easy to modify its elements. The filename extension of the NcML metadata component of the intermediate data model must be *__ncml.xml. Listing A.3 on page 136 is a NetCDF NcML representation of a NetCDF file containing gridded data, and the NcML representation of listing A.4 on page 137 represents a NetCDF file containing in-situ data. Multi-dimensional numerical data

NetCDF variables are implemented in multidimensional arrays that can have an arbitrary rank. Such arrays can also be implemented in another way than by the use of NetCDF. Rew et al. (2010) explain that it is also possible to employ arrays of a computing language to hold data of NetCDF variables. In order to separate the data content from the descriptive metadata infor- mation in the intermediate data model, the numpy file format of the Python programming language was chosen to store the content of data variables. By using the numpy package of Python, multidimensional array objects can be created as container for numerical data, and linear algebra can be applied to these data arrays. Moreover, a strict separation of data and metadata components can be achieved, and the data elements stored in a numpy array can be easily modified by the use of an extensive numpy Python API. Within the CEOP-AEGIS Data Interface, the filename extension of the numpy data part must be set to *__data.npy.

The order of the dimensions is crucial for multi-dimensional data and is depending on the storage order of the employed programming language. There is a need to differentiate between the column major order that is for example used by the programming language FORTRAN, and the row major order that is employed for instance in C, Java and Python. As a result, the outer dimension is the first dimension in C, Java and Python (usually the time dimension), but the last dimension in Fortran. By definition in NetCDF, the most inner axis (so the last axis in C, Java and Python) is running from left to right (longitude), the last but one axes from the top to the bottom (latitude in reversed order), and all the other axes with a next higher level are ordered from the top to the bottom with an empty line in between of each dataset of the lower dimension. The origin of the coordinate system is the upper left corner. By the use of an index iterator, specific array elements can be sequentially accessed in logical row-major order (see Caron, 2004, 2009, 2011; Hartnett, 2009; Rew et al., 2010).

A numpy data array as part of the CEOP-AEGIS Data Interface must generally consist of five dimensions that have the order [Variable;T;Z;Y:X]. The index values of the first dimension determine the data variables, fol- lowed by the temporal dimension at the second position, and finally the spatial dimensions height, latitude and longitude. In-situ data from ground observation measurements have constant spatial coordinate values, but vary- ing temporal coordinate values upon that one or more measurement depends (see subsection 3.1.2 on page 71). To respond to these needs, the numpy data array can also be two-dimensional with the order [T;Variable].

Array sections are specified by zero-based indexing, and the range of an array is defined by an index value that determines a subset of an array. The range of an array must correspond with the shape of the related NetCDF variable. Following from this, the index space of the variable dimension in the numpy data array must be the same as the number of variables defined in the NcML NetCDF metadata file. The same condition exists regarding the number of provided temporal and spatial coordinate metadata values. Inconsistency will exist within the intermediate data model if this constraint is not considered.

NetCDF record as well as fixed-length variables appear in the same order as they were defined. An increasing numerical index determines the variable ID value. The variable order in the numpy data array must be the same as the order of variables appearing in the NcML metadata file. Since data in the numpy data array and NetCDF metadata in the NcML file is read and written in the same order, the first variable metadata entry in the NcML file must correspond to the first dataset in the numpy data array (index = 0) and the last variable entry in the NcML file to the last dataset in the numpy data array. Data values will be allocated to a wrong variable if this important constraint is not considered.

Coordinate metadata information

Coordinate metadata information is necessary to describe the index values of the multidimensional numpy array that is storing the data values. Run- ning index values have corresponding coordinate values, and these coordinate information is provided by this XML metadata file that is following an inter- nal standardized schema. In the regard of NetCDF, it stores the coordinate variable content. The filename extension is *__coords.xml within the inter- mediate data model.

If both the min and max coordinate values are present, then they are read in preference of the single coordinate values stores in the values attribute. Out of the minimum and maximum coordinate values, equally spaced inter-

mediate coordinate values will be calculated by the application in order to obtain the same number of coordinate values as the index space of the corre- sponding array dimension of the numpy data array. Single coordinate values or multiple coordinate values that are not equally spaced can be stored in the values attribute. The number of coordinate values in this field must however be equivalent to the index space of the corresponding array dimen- sion of the numpy data array. In case of in-situ data, a unique 32-bit integer identification value that is serving for the scalar coordinate variable _id must furthermore be provided. For gridded data, only coordinate information of the temporal and the three spatial axes must be fully present in this coordi- nate metadata file. Listing A.5 on page 138 represents a typical XML file for storing coordinate metadata information for gridded data, and listing A.6 on page 138 represents a similar file that is containing typical in-situ metadata information.