An XML Based Marine Data Management Framework
Paul Sliogeris
Australian Oceanographic Data Centre.
Maritime Headquarters, Wylde St, Potts Point, NSW, 2011, Australia. http://www.aodc.gov.au
Abstract
This document is all about using XML to encode the marine data your agency is responsible for in terms of end to end or whole-of-life data management. MarineXML provides a strong foundation for a standardised data framework that supports the creation of generic marine data management applications. The development and application of an XML based data management framework as it applies to the Australian Oceanographic Data Centre is outlined here.
Table of Contents
1 Introduction...3
2 Establishing A Framework ...3
3 A Case Study: AODC ...6
3.1 MarineXML Generator...7 3.2 MarineQC...8 3.3 MarineDB ...9 3.4 MarineXML Converter ...11 4 Conclusions ...12
List of Figures
Figure 1 - Simplified MarineXML Structure ...4Figure 2 – MarineXML Generator . ...7
Figure 3 – MarineQC ...8
Figure 4 – MarineDB data query . ...9
Figure 5 – MarineDB data visualisation ...10
1 Introduction
When the extensible markup language (XML) first started being touted around the Australian Oceanographic Data Centre (AODC) those involved in data management groaned at the thought, sarcastically mumbling “Great, another format to deal with!” But interestingly, as those initial months unfolded, it became clear that due to XML’s inherent extensibility, it was proven that any existing marine data format we held could in fact be encapsulated with a working of XML coined MarineXML1.
What then became increasingly apparent was that if XML could encode all our data then we need to rethink the way our data management and information systems breathe data. If XML was to become our internal standard format then there was no longer any need to write conversion scripts to get differing data formats into the quality checking or viewing software, code a multitude of unique parsers to extract metadata, nor struggle to fuse data sets together. Instead of reformatting data to fit our systems our systems would be redesigned to work with MarineXML.
We had already had some experience using XML to encode metadata from various data sets we held and knew about its popularity as a vehicle for data exchange. However our focus wasn’t to use XML as an exchange format – after all, even now, without a common document type definition (DTD) that all agencies adhere to, who is there to exchange data with in this format? Our vision was to lock in a DTD that works for us with the idea to translate this with an XSLT to whatever DTD the good people of the ICES-IODE SGXML2 decide on.
Why did we believe MarineXML to be so beneficial to warrant the massive task of rebuilding our applications? It was more than the generic benefits of XML being self-describing, object-oriented and easy to reformat through extensible stylesheet language transformation (XSLT). It was specifically how XML could benefit our agency through whole-of-life data management and fundamentally forming the basis of a marine data management framework.
2 Establishing
A
Framework
In determining how to establish an XML framework for data management, it is important to first realise the boundaries of the framework within your agency. Commonly these boundaries are set by the data the agency is directly responsible for or are the primary custodians of.
1 More information on MarineXML at http://www.aodc.gov.au > Products > MarineXML 2
To clarify, an agency acquires data, adds value to that data, perhaps fuses the data with other data of its type to and produces data and information products meaningful to its users. Sometimes the data that is acquired can be passed directly to the end user with very little intervention, thereby bypassing the framework. Examples of this type of “processed” data include data products on CD, online map services, or complete data sets from other agencies. Other data may need to be managed more intensively, applying functions such as archiving, quality assessment, editing, extraction and merging of valuable data. This data is often referred to as “raw” data, commonly in the form of paper logsheets, outputs from sensors and instruments, satellite feeds or digitised data. It is to this realm that the majority of data management systems apply to and hence forms the boundaries of our framework.
XML is oft described as building blocks that house data3, which is indeed very apt as XML can not only accommodate the data but includes rooms (or elements) that can be designed to accommodate the data’s metadata, quality assessment processes, lookup tables, units and edit history, see Figure 1 below.
Figure 1 - Simplified MarineXML Structure (see how self describing XML is!)
XML can store all inputs from the data management systems, providing two huge benefits:
· Whole-of-life data management
XML encapsulation enables the preservation of all original data, metadata, units and lookup tables, quality flags and a history of edited values in the one file. Simplifying data management by avoiding the perils of creating and managing multiple files.
· Basis of a marine data management framework
With data in a common format, XML provides a standardised data framework that can support the creation of generic software.
With the boundaries and the language of the framework defined, there only remains the processes and tools that lie within the framework. There are four main components within the framework that support the end-to-end data management of marine environmental data:
· Encapsulation
All data within the framework must exist and be valid against a common DTD. Hopefully the data your agency is responsible for is reasonably consistent in regards of acquisition and format because converters need to be created to encapsulate them into MarineXML. For data that is digitised from hardcopy, a direct encoding utility can be created. Server-sided scripting languages do this task well. Basic metadata about the data set should also be included at this stage.
· Value Adding
The application and recording of quality assurance procedures to data and in general adding value to data is central to any agency’s data management responsibilities. These applications should write into MarineXML, populating its elements by recording tests performed and their results, history of edits, when made and by whom and of course adding more information to the data’s metadata. Because of data format commonality, the QC process can now be split into a common spatiotemporal component and unique modules specifically designed to check the differing aspatial attributes of the data sets.
· Search & Query
A compulsory requirement in effective data management is the ability to spatially view the distribution of data including the ability to query temporally and by its aspatial data. XML works well with web based technologies and a browser search and query tool with a scalable vector graphics (SVG)4 interactive display has been proven to work effectively.
· Export
Conversely, as an import converter is required to get data into XML and the framework, a simple export utility is helpful at the exit end of the framework to spit the data out in other formats (comma delimited format provides widest accessibility). An XML parser will effectively read through the elements of any XML file and list all attributes it finds for a user to select what fields and in what order are required to be written out.
3 A
Case
Study:
AODC
To better illustrate the ideas discussed here, the framework as it exists at the AODC will be described in some detail. Focussing on the marine data environment as it applies to the Centre’s responsibilities in terms of data management and in particular, the components of the data management information systems.
Taking a holistic approach to how the marine data environment relates to procedures within the AODC, it is best to keep in mind the three primary components; end users, data management and data acquisition;
· End Users
As well as operating as the National Oceanographic Data Centre for Australia in various national and international oceanographic data collection, exchange and management programs, the AODC functions as Australia’s Defence Oceanographic Data Centre. Our main output is meteorological and oceanographic (METOC) information products. That is, graphical representations of fused “best of breed” data sets. To achieve this the AODC operates an enterprise Geographic Information System (GIS) incorporating an online Internet Mapping Service (IMS) FOOTNOTE. The GIS fuses products from processed data such as CDROM atlases and the outputs of raw data from the framework that the AODC is responsible for and manage.
· Data Acquisition
Besides acquiring already processed data sets, the AODC is responsible for the end-to-end data management of Royal Australian Navy METOC data. It is this data that is put through the XML based marine data management framework. At the time of writing this document, that data comprises of data from disitised spreadsheets of bioluminescence and seabed sample observations, secchi disk and sound speed measurements, marine meteorological observations, ASCII and binary data from expendable bathythermograph (XBT) probes, and digitised paper traces of temperature profiles.
· Data Management
The majority of data management at the AODC operates within the framework on the RAN collected METOC data outlined above. The core applications that apply to data management will now follow:
3.1 MarineXML Generator
MarineXML Generator (Figure 2) encodes the known formats of the RAN collected METOC data as listed above into MarineXML. Its function is to wrap up multiple or single, binary or ASCII input files into one XML file, all within the parent element <MarineDataSet>. Some basic collector’s metadata can also be included within the MarineDataSet.
Figure 2 – MarineXML Generator application about to create an XML file from XBT data.
An encoding module has been created for each known data type format to be encoded. Each module is activated based on the file extension of the acquired data file. A very basic error checking function reports if the format differs from what is expected for that data type or if compulsory fields such as position are blank.
3.2 MarineQC
MarineQC (Figure 3) is the flagship application within AODC’s XML based data management framework. It reads MarineXML and runs a series of algorithms that test for operator or instrument errors within the spatiotemporal or aspatial data. These checks are run automatically upon reading a file and any errors are graphically highlighted for user intervention. The user can agree with or ignore flagged errors, make changes to the data, insert comments and assign caveats on releasability and quality on the data. All edits are saved within the MarineXML structure.
Figure 3 – MarineQC application loaded with a marine meteorological MarineXML file
The application is divided into two distinct modules; the left side for spatiotemporal quality control checking for incorrect positions based on positions on land, impossible ship speeds or possible duplications, and the right side which assists the user in verifying the integrity of aspatial attribute data. The aspatial module changes specific to the MarineDataSet name attribute that is read from the XML file. This makes the application very extensible; with only a new aspatial module required to be created to QC a new data type5.
3.3 MarineDB
To view and query MarineXML encoded files the MarineDB servlet (Figure 4) built with JavaServer Page (JSP) technology has been created for the AODC Intranet. It incorporates a SVG window to define search areas and display spatial distribution of resulting data. MarineDB supports temporal queries by period or interval and queries on data type. cruise ID, source and all parameters that are present within MarineXML files.
Figure 4 – MarineDB provides visualisation and query functionality to MarineXML files.
MarineDB has been designed to overcome one of the two main problems with XML6. Being a hierarchical file format, XML does not support an index method of searching through the subset elements for retrieval. Hence we cheated by
squeezing in a MySQL database between the servlet and XML files that fulfils the role of indexing. Indexing speeds up searches by storing spatial and temporal values in a table with pointer to the referring XML file stored on disk. Thus to use MarineDB to view and query XML files the files must first be “registered” with the servlet by passing the file location to the servlet. This extracts the position, time and whatever parameters are present within the XML file and stores them in the indexing table.
Returned results can subsequently be queried to display the aspatial data, and associated metadata (Figure 5). SVG assists with visualisation by allowing the user to toggle data types on and off or similarly show or hide data that has been flagged as “bad”. Using the onscreen cursor a user can click and select data that can subsequently be viewed and/or exported.
3.4 MarineXML Converter
As has been described, XML works very well inside our framework, however externally, admittedly it can be difficult to work with, if for example a user simply requires observed wave height data in a simple comma delimited file. Hence MarineXML Converter (Figure 6) has been created to universally export any MarineXML valid file into a comma delimited file. Upon adding a file the application parses the parameters and lists all unique values for the user to select which are required and in what order.
Figure 6 – MarineXML Converter that exports XML into comma delimited format
Various output date and time formats can be set as well as position in decimal degrees or degrees minutes seconds. Original, corrected, suspect and/or erroneous flagged data can be output to send back to data providers for attention. Also if it is known that a specific file format is requested on a regular basis, this format can be coded into the application and will appear in the drop down list “Files of Type”.
4 Conclusions
MarineXML has provided a strong basis for a framework that that supports streamlined through flow of marine environmental data management and has enabled the integration of all data management applications at the AODC. The implementation of the framework significantly changed the way we traditionally viewed data and data management and continues to challenge traditional ideas. One such example is the inherited conception of metadata being data about data created in a file that is held separately from the data. By including metadata within a data sets’ MarineXML file, applications like MarineDB can provide traditional metadata search and retrieve functions with the added bonus of revealing the actual data as well if required.
The future for the framework lies in web services and moving from our present achievement of application integration to seamless application integration. Following the development of MarineDB, it became clear that browser-based applications open up our services to a far greater range of users and minimise the number of manual steps now present in the framework. Server sided scripting languages can accomplish most of what the present stand-alone Java applications do with far less overheads.
“God help us if we need to change our MarineXML format” is the now oft quoted phrase at AODC. Hence the immense importance of thorough planning and reviewing of any considered design of XML. At the AODC we believe that we built in enough extensibility7 into MarineXML to survive into our future and, as they say, “so far, so good”.