• No results found

An Introduction to XML

2.3 Examples of XML Grammars

The parser has to guess what was meant and so does not always do sensible things with the input. For instance, it considers unquoted attributes within a start-tag as text outside of the start-tag. However, it can be useful in some circumstances.

When working with HTML documents, we can also try to correct these using the HTML Tidy facility. This can be used via a Web browser, or by sending HTML content to the Web interface, or directly within R using theRTidyHTMLpackage [33].

2.3 Examples of XML Grammars

Many government agencies, commercial entities, and scientific research areas make their data avail-able in XML formats. In the commercial arena, Microsoft Office [27], Libre Office [21], iWork [1], and Google Docs [14] use XML in their office suite tools, including spreadsheets and word process-ing documents. Google and the Open Geospatial Consortium (OGC) developed the Keyhole Markup Language (KML) [11] as a language for describing geo-spatial information that can be rendered inter-actively using Google Earth [12], Google Maps [13], and Google Sky [15]. The eXtensible HyperText Markup Language (XHTML ) [19] is a collection of XML tags for representing Web pages, similar to HTML but using well-formed and structured XML.

Scientists have developed specific grammars suitable for data in their fields. As an example, the Systems Biology Markup Language (SBML,sbml.org)offers a common format for describing bi-ological systems, e.g., metabolic pathways, biochemical reactions, and gene regulation. With SBML, researchers can share the essential aspects of their models independently of the software environ-ment. CellML is another biological markup language that has evolved into a general format to store and exchange computer-based mathematical models of biological processes. The Geography Markup Language (GML) was defined by the Open Geospatial Consortium (OGC) [25] to express geograph-ical features. The Materials Markup Language (MatML) is an XML standard for the exchange of a material’s property information and has been used in various applications, e.g., for contaminant emis-sions data. The Sloan Digital Sky survey provides results from Internet database queries in an XML format [22, 32].

The International Monetary Fund, World Bank, the Statistical Office of the European Communities, and other organizations have sponsored and developed SDMX, the Statistical Data and Metadata eX-change [30]. This is an initiative to foster XML standards for the exeX-change of statistical information.

Foreign exchange rate data from the US Federal Reserve Bank and the European Central Bank con-form to SDMX standards, and the United Nations Statistical Commission has recognized SDMX as the preferred standard for the exchange and sharing of data and metadata [36]. As another example, the US Food and Drug Administration (USFDA) and the European Agency for the Evaluation of Medical Products (EMEA) are working on a series of initiatives to develop XML -based standards for data ex-change, which includes Structured Product Labeling (SPL) for USFDA-regulated products [37]. Also, the US Census Bureau uses XML to facilitate the layout and assembly of economic census forms of US businesses. Similarly, a large number of data sets are available fromhttp://www.data.gov in a variety of formats, including XML.

XML is widely used, but it is by no means the only format. JSON is popular for various reasons, most notably for its simplicity and brevity relative to XML. In many cases, JSON is simpler, but XML is more robust and better suited to precise description of the data structure via schema. This difference is analogous to the comparison between dynamic, untyped languages and strongly typed, compiled languages. The former is often better for one-off tasks or interactive exploration, while the latter often provide meta-tools for working with the data and also better long-term development and

reproducability needs. Some of the features of XML are shown in the box below. As consumers of data, we work with whatever format the data are made available to us.

Features of XML

• XML is self-describing in that it can contain the format and structural information needed to properly read and interpret the content. For example, an XML document typically spec-ifies its character encoding in the XML declaration. It can contain a DTD or schema that describes the structure of all documents within that XML vocabulary. For traditional data sets, it can include the missing value identifier, description of its provenance, etc. Different data sets can be clearly identified within a document. Compare each of these with a CSV file.

• XML typically separates information (content and structure) from the appearance of the information. This is generally considered important in all aspects of software.

• The highly extensible format allows XML content and data to be easily merged into higher-level container documents and to be easily exchanged with other applications.

• The content of an XML document is human-readable using any simple plain-text viewer.

Although human-readable, XML also supports binary data and arbitrary character sets.

• Since XML is highly structured, it is easily machine generated and read.

• Many communities are actively using XML and providing extensive collections of tools for working with XML, and these tools have been, or can be, incorporated into other environ-ments and programming languages relatively transparently.

We provide a few examples in this section to highlight the important features of XML, e.g., that it is self-describing, separates content from form, and is easily machine generated.

Example 2-1 A DocBook Document

As mentioned at the start of this chapter, DocBook is a vocabulary designed for authoring techni-cal material. It leverages the extensive collection of XML tools, such as XSL (eXtensible StyleSheet Language), XPath , XInclude, HTML, and FO (Formatting Objects) (and also LATEX) for transform-ing structured XML documents and generattransform-ing rendered versions that can be displayed on computer screens or printed on paper. For example, this book was written using DocBook.

Below is a snippet of author information in the DocBook format.

<author>

<personname>

<firstname>Jane</firstname>

<surname>Smith Doe</surname>

</personname>

<email>[email protected]</email>

<affiliation>

<orgname>University of California</orgname>

<orgdiv>Department of Statistics</orgdiv>

</affiliation>

</author>

From this sample, one can see why XML is called self-describing. Without the mark up, the content reduces to:

Jane Smith Doe [email protected]

2.3 Examples of XML Grammars 31

University of California Department of Statistics

Someone may have trouble distinguishing between Doe or Smith Doe as the author’s surname. The element identifiers <personname>, <firstname>, and <surname> provide metadata about the data, i.e., the meaning/role of each of the strings. We see from these element names that it is a person’s name (as opposed to a corporate name), that the person’s first name is Jane, and her surname is Smith Doe.

Of course, there are many other self-describing formats. For example, name:value pairs also pro-vide metadata, e.g.,

firstname:Jane surname:Smith Doe email:[email protected]

orgname:University of California orgdiv:Department of Statistics

However, this approach is not as rich as DocBook’s structure, which allows complex nesting of ele-ments.

We could achieve this with JSON. However, nobody would ever suggest authoring a book in JSON, primarily because text needs to be within quotes. Also, JSON does not have a way to specify the char-acter encoding, but assumes/demands Unicode. Similarly, it does not support schema for validating and describing documents.

Note also, the meta information in DocBook does not contain instructions for formatting the data, e.g., that the author’s name should appear in italicized font. In general, XML separates the content and structure from the way the information is rendered. We process the semantic information with separate information (e.g., XSL style sheets [31]) for how to render it for different audiences. This is much the same way we have learned to use CSS [2] for controlling the appearance of HTML and reducing or eliminating formatting information in the HTML.

While we have not formally introduced the R functions for working with XML content, it is useful to note that the rich structure and formal grammar of XML makes it easy to work with XML docu-ments. For example, we can find all <email> elements, or all <r:func> or <r:package> nodes.

We can even locate the <section> node in a book which is, e.g., a) within a chapter whose title contains the phrase "social network" and b) which has a paragraph with <r:code> that con-tains a call to load thegraphpackage [10]. These are significantly harder to do robustly with markup languages such as LATEX or Markdown [16] since they do not have formal grammars. Typically, people use line-oriented regular expressions for querying such documents and so cannot use the hierarchical context to locate nodes. This also makes it much harder to programmatically update content.

Our <dataFrame> example from earlier in the chapter also suggests how XML can be self-contained and self-describing. Rather than requiring the consumer of the data to both know and specify information about the structure of the data, software can determine this information by reading it from the XML document itself. We can read the number of observations and the types of the variables, obtain the names of the variables, determine the missing value symbol, and identify each separate data set within the document from XML markup. Also, we automatically determine the character encoding when parsing the XML document. Contrast this with CSV files and calls toread.table() andread.csv(). The extensibility of XML means that we can easily add new metadata to an XML representation and allow clients to query this as they make sense of the content.

The previous example illustrated DocBook’s choice of simple words for tag names that suggests its meaning or purpose. We saw in Section2.1that some names appeared to contain a colon, e.g.,

<r:code>. In fact, the name <r:code> is made up of two terms—code and the r prefix—

separated via the : character. The r prefix identifies a namespace. XML namespaces have a role similar to package namespaces in R. They allow us to avoid (potential) conflicts from using the same name in different contexts or with different meaning, specifically when we mix element names from two or more different XML vocabularies. In R, we can refer to a function in either of two pack-ages with pkg1::aFuncor pkg2::aFunc. We have a similar problem if we use an element named <code> in XML. Does this mean R code, C code, shell code, or code in any other lan-guage? We could use <rcode> for the element, but Ruby programmers might use that also to refer to Ruby code. There is still a conflict. We could use a URL uniquely identifying the project, e.g.,

<r-project.orgcode>. This would remove the conflict, but be very verbose, tedious, and error-prone. Instead, XML namespaces provide a more robust, flexible, and richer approach to disambiguate conflicts. Within the XML document, we define an XML namespace as a pair consisting of a prefix and the uniquely identifying URI, e.g., r andwww.r-project.org.Then, we can use the prefix to qualify the element name, e.g., <r:code>. We can define the prefix-URI mapping in an element using the form xmlns:prefix=URI, e.g.,

<dataFrame xmlns:r="http://www.r-project.org">...

This looks like an attribute, but is technically different. We can use the prefix in the node in which it is defined or in any of its descendant nodes, i.e., its child nodes, their children, and so on. In addition to making the node names shorter, the mapping of the URI to a prefix allows us, the authors of the XML document, to chose the prefix or to select which is the default namespace. The next example, provides another illustration of namespaces.

Example 2-2 A Climate Science Modelling Language (CSML) Document

The Climate Science Modelling Language (CSMLhttp://ndg.nerc.ac.uk/csml)was de-veloped by the British Atmospheric Data Centre and British Oceanographic Data Centre through the UK’s Natural Environment Research Council’s (NERC) DataGrid project [20]. Rather than starting from scratch to create their vocabulary for climate science modeling, NERC built on an existing gram-mar, Geographic Markup Language (GML), which already had many of the needed XML elements and features for CSML. This is an example of the extensibility represented by the “X” in XML.

The following snippet of CSML data contains daily rainfall measurements at a specified location for each day in the month of January. These measurements (5 3 10 1 2 ...) are the text content of the

<gml:QuantityList>element, which is an element within <PointSeriesFeature>.2

<gml:featureMember>

<PointSeriesFeature gml:id="feat02">

<gml:description>

January timeseries of raingauge measurements

</gml:description>

<PointSeriesDomain>

<domainReference>

<Trajectory srsName="urn:EPSG:geographicCRS:4979">

<locations>0.1 1.5 25</locations>

<times frame="#RefSys01"> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2The values can be marked up individually, e.g., within <value> or <double> elements. However, since the values do not contain white space, we can separate them by spaces and recover them faithfully.

2.3 Examples of XML Grammars 33

24 2 26 27 28 29 30 31

</times>

</Trajectory>

</domainReference>

</PointSeriesDomain>

<gml:rangeSet>

<gml:QuantityList uom="udunits.xml#mm">

5 3 10 1 2 8 10 2 5 10 20 21 12 3 5 19 12 23 32 10 8 8 2 0 0 1 5 6 10 17 20

</gml:QuantityList>

</gml:rangeSet>

<parameter xlink:href="#rainfall"></parameter>

</PointSeriesFeature>

</gml:featureMember>

Notice that the tag name, <gml:QuantityList>, begins with the prefix gml: while

<PointSeriesFeature>has no prefix. The prefix distinguishes GML element identifiers from CSML element identifiers. In creating CSML, NERC began with the element names and structure of GML and added to it tags needed for climate science. We can simply add new element and attribute names to an XML vocabulary without using a namespace. However, if we use the same name for a different concept, we have a conflict. We use a namespace to avoid this conflict.

The creators of CSML decided to clearly separate the additional CSML elements by using a sepa-rate namespace for them. The sample XML suggests that this situation is reversed, i.e., that the GML elements have a namespace prefix and the CSML elements do not. In fact, both sets of elements have a namespace, but the CSML elements use the default namespace (defined earlier in the document). The majority of the elements come from the CSML vocabulary so we use that as the default namespace to reduce the number of prefixes. We can equivalently use explicit namespace prefixes for both sets of elements, or use the GML namespace as the default and so qualify <PointSeriesFeature>,

<PointSeriesDomain>, etc. We can use any prefix, and are not required to use the name of the vocabulary, e.g., "gml". The namespace is defined by specifying a URI and a document-local prefix.

We leave this example with one final observation. The CSML snippet shows several layers of nesting. We use indentation to make the nesting of the elements clear and emphasize the tree struc-ture. This hierarchical structure gives great flexibility in describing complex data structures. We can represent linear, tree, and even graph structures easily with XML. This generality, flexibility, and expressiveness make XML useful.

Example 2-3 A Statistical Data and Metadata Exchange (SDMX) Exchange Rate Document

The European Central Bank (ECB) provides daily foreign exchange rates between the euro and the most common currencies [7]. These are provided in several file formats, including an HTML format for the iPhone and two XML formats. Both XML formats were developed in accordance with the Statistical Data and Metadata Exchange initiative [6]. These foreign exchange reference rates (eu-rofxref, for short) use both the SDMX-EDI (GESMES/TS, GEneric Statistical MESsage for Time Series) format(http://sdmx.org/)and the ECB’s extension vocabulary. The ECB uses this for-mat to exchange data with its partners in the European System of Central Banks. According to them, this format “was a key element in the statistical preparations for Monetary Union and has proved both efficient and effective in meeting the ESCB’s rapidly evolving statistical requirements” [8].

An XML snippet of exchange rates for four currencies on two days is shown below.3Notice the extensive use of attributes in this vocabulary and little use of text elements for representing data. The attributes time, currency, and rate contain, respectively, the date, name of the currency, and exchange rate to buy one euro in this currency.

<Envelope>

<subject>Reference rates</subject>

<Sender>

<name>European Central Bank</name>

</Sender>

<Cube>

<Cube time="2008-04-21">

<Cube currency="USD" rate="1.5898"/>

<Cube currency="JPY" rate="164.43"/>

<Cube currency="BGN" rate="1.9558"/>

<Cube currency="CZK" rate="25.091"/>

</Cube>

<Cube time="2008-04-17">

<Cube currency="USD" rate="1.5872"/>

<Cube currency="JPY" rate="162.74"/>

<Cube currency="BGN" rate="1.9558"/>

<Cube currency="CZK" rate="24.975"/>

</Cube>

</Cube>

</Envelope>

This document appears quite different from the XML in Example2-2(page32). Here the snippet shows three levels of tags with the same name, i.e., <Cube>, and each has no text content. All of the relevant information is contained in the attribute values of the <Cube> elements. There is one parent <Cube> that holds all of the others. The next layer of <Cube> elements pertain to the date;

there is one <Cube> element for each day, where the time attribute identifies the specific date, e.g.,

"2008-04-17". Within each of these “time” cubes are four <Cube> elements, one for each cur-rency (US dollar, Japanese yen, Bulgarian lev, and Czech koruna). These innermost <Cube>s provide the name of the currency in currency and the exchange rate for that currency in rate. The exchange rate is for the date found in the parent <Cube> in which the element is nested. The <Cube> ele-ment corresponds to a multidimensional data cube, which represent the dimensions in a generic way.

The values are grouped by time, space/geography, and other variables as <Cube>s, with arbitrary metadata associated with each dimension.

The eurofxref grammar uses the SDMX cube model for data. That is, the data are viewed as an n-dimensional object where the value of each dimension is derived from a hierarchy. According to SDMX [30]: “The utility of such cube systems is that it is possible to ‘roll up’ or ‘drill down’ each of the hierarchy levels for each of the dimensions to specify the level of granularity required to give a

The eurofxref grammar uses the SDMX cube model for data. That is, the data are viewed as an n-dimensional object where the value of each dimension is derived from a hierarchy. According to SDMX [30]: “The utility of such cube systems is that it is possible to ‘roll up’ or ‘drill down’ each of the hierarchy levels for each of the dimensions to specify the level of granularity required to give a