XPath, XPointer, and XInclude
4.2 XPath and the XML Tree
aim to provide the reader with a reasonably complete understanding of using XPath in R. Additional information can be gained from books dedicated to XPath such as [9].
XPath is also used in other XML technologies. For example, XPath is an important part of XIn-clude and XPointer (Section4.9). XInclude is a mechanism that allows one XML document to include part, or all, of another XML document. This is a general merge mechanism that allows us to maintain XML content in different documents and act as if it is one single document. It is like, but more pow-erful than, LATEX’s include or input commands. For example, XInclude allows us to include only a part of the document. To specify which nodes to include, we use XPath as part of the XPointer language to identify the nodes we want.
We also mention here the eXtensible Stylesheet Language (XSL ) [1, 12, 13], which is an XML -based language for transforming XML documents into different forms, be they other XML /HTML documents or text. We often use XSL to generate HTML or PDF from our articles written in Doc-Book. We can also use XSL to convert the data in a tree to a different format, e.g., CSV, where appropriate. XSL templates, or rules, for processing and transforming a node are written using XPath . These XPath -based technologies illustrate that XPath is a general technology and very useful to know for many purposes, not just extracting data from XML content.
4.2 XPath and the XML Tree
XPath is a language for querying and locating elements in an XML document. It operates on the hierarchy of a well-formed XML document to specify the desired chunks to obtain. XPath is not an XML vocabulary; it has a syntax that is similar to but more powerful than the way files are lo-cated in a hierarchy of directories in a computer file system. For example, on Windows, the C: drive acts as a root node, and within this, a file is located in a hierarchy of folders (directories) by an expression such as C:\MyDocuments\Memo_Aug2.docx . On UNIX, the root node of all file systems is represented by a forward slash /, and within this there are files and sub-directories, e.g., /Users/nolan/Documents/Memo_Aug2.docx. Anyone familiar with navigating file-system trees, either with command line utilities in UNIX such aslsfor listing directories andcdfor chang-ing directory, or with a graphical user interface such as Mac OS X’s Finder or Microsoft’s Explorer, will find similarities to XPath expressions. XPath , however, is much more succinct and expressive.
XPath has many similarities to regular expressions. In both cases, we are identifying patterns to match data or content. There is a trade-off between matching too liberally/permissively and being overly specific. We often mix the pattern matching with subsequent R computations on the resulting matches. Like regular expressions, experience helps compose correct XPath expressions. However, XPath is a simpler language and has a simpler computational model to understand than regular ex-pressions.
Let’s consider the XML document from Example 2-3(page 33)that contains the currency ex-change rates relative to the euro. The basic structure of the document (with the namespaces removed for simplicity) is shown below.
<Envelope>
<subject>Reference rates</subject>
<Sender>
<name>European Central Bank</name>
</Sender>
<Cube>
<Cube time="2008-04-21">
<Cube currency="USD" rate="1.5898"/>
<Cube currency="JPY" rate="164.43"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="25.091"/>
</Cube>
<Cube time="2008-04-17">
<Cube currency="USD" rate="1.5872"/>
<Cube currency="JPY" rate="162.74"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="24.975"/>
</Cube>
</Cube>
</Envelope>
This snippet of an XML document is in the SDMX format. See Example2-3(page33)for more details about this particular XML vocabulary. The exchange rates for each currency are in <Cube>
nodes with the currency and that day’s exchange rate given as attributes. The currencies are grouped together in a parent node for each day. This parent is also named <Cube> and it has a time at-tribute. To further confuse matters, the collection of daily data are organized as elements of yet an-other <Cube> node. The <Cube> is a general way to represent multidimensional data. Here we have three dimensions. There is the overarching <Cube> to indicate what it is being measured (exchange rates) and within this the different days and within day the different currency values.
The following XPath expression, /Envelope/Sender/name
locates the <name> element near the top of the document. The document hierarchy shown in Fig-ure4.1shows the realization of this XPath expression, i.e., the nodes that have been identified by the query. We can evaluate the query in R with
nm = getNodeSet(doc, "/Envelope/Sender/name")
An XPath expression defines a location path consisting of one or more location steps, each separated by a forward slash. In this expression, we start at the root (/) and look for a child element named
<Envelope>. Having found that, we continue with the next step in the search, and from this position we look for a child node named <Sender>. Finally, we start from this <Sender> node and search for a child called <name>. Here the steps are very specific, but we will see that they can be much more general, e.g., any descendant at any level.
The XPath computational model is designed to identify node-sets, which are collections of nodes in the target tree that meet the criteria in the XPath expression. The result of our query in R is of classXMLNodeSetand is a list of references to those nodes which the XPath query matched. This is a set in the sense that there are no repeated elements, i.e., each node in the result is unique within this result. In this case, it contains a single <name> element from the document, but in general the expression can match more than one node. Indeed, in our example, there was only one matching node at each location step. However, there might be many, and XPath follows all matching nodes at each step. In this way, it is vectorized in its searching.
For an example that matches multiple nodes, consider the XPath expression /Envelope/Cube/Cube
4.2 XPath and the XML Tree 81
Figure 4.1: Simple XPath Expression Applied to a Tree. The shading in this diagram shows how the XPath expression,/Envelope/Sender/namelocates the <name> node. The shaded nodes are location steps in the path to the matching node, which are progressively more brightly shaded as we move from one location step to the next and get more specific in the query.
1. The first location step identifies the root node, <Envelope>.
2. The next step locates the <Sender> child of <Envelope>.
3. The third step identifies <Sender>’s child called <name>.
This expression matches two nodes as shown in Figure4.2,corresponding to the two days of data in our document. These matches are two sibling <Cube> nodes that are grandchildren of <Envelope>, and children of the topmost <Cube> node.
Envelope
Figure 4.2: XPath Expression Locating Multiple Nodes in a Tree. The shading in the diagram shows how the XPath expression /Envelope/Cube/Cube locates two sibling <Cube> nodes. The lightly shaded nodes denote steps in the path toward the match of the two nodes that are brightly shaded in the diagram.
1. The first location step identifies the root node, <Envelope>.
2. The next step locates the <Cube> child of <Envelope>.
3. The third step identifies the two <Cube> children of the second-level <Cube> node matched from the second step.
The notion of a node-set, i.e., where a node can occur just once in the set, may seem problematic. A node may match multiple conditions in a composite XPath expression, i.e., where we specify two or more node tests in the XPath query. In contrast, when we subset the same element multiple times in an R vector, we explicitly obtain multiple copies of the element. But the concept of a node-set with each element occurring at most once is precisely what makes XPath useful. We can find all nodes in a tree that match a particular query and then work with just those. We do not have to worry whether we have already processed that node earlier in the node-set since we know it is unique. We are also guaranteed that the nodes will appear in the node-set in the same order that they occur in the document, i.e., in document order (see page96). If we did two separate XPath queries, we would end up with two node-sets and they might contain some of the same nodes. If we were to process these two node node-sets, we might end up “double-counting” a node. Also, we would not know if we were processing the nodes in an appropriate order. The node-set is precisely what we want. For example, to extract all exchange rates for the Japanese yen from the SDMX data, we might look for all the <Cube> elements with a currency attribute value of JPY. The XPath expression
//Cube[@currency = "JPY"]
does just that. The expression//Cubeis a shortcut for specifying the location step that indicates the
<Cube>element may appear at any level in the document. It means: match all of the descendants, including the current node, by name. Having obtained the resultant matches, we apply a predicate to restrict the set. The square brackets are analogous to subsetting in R and provide a test on the nodes that have matched. That is, the expression within[]provides a condition that must be met by these matching <Cube> nodes, if they are to remain in the node-set. The expression//Cubematches all
<Cube>nodes, but the condition[@currency = "JPY"]filters out those nodes that do not have a currency attribute or whose currency attribute does not have the value"JPY". The @ symbol is shorthand for attribute in XPath . Two nodes match our expression; these are the two nodes with an exchange rate for the yen, as shown in Figure4.3.
Envelope
Figure 4.3: XPath Predicate Filters a Nodeset. The XPath expression,//Cube, matches all Cube nodes anywhere in the document. The // is shorthand in XPath for all descendant nodes from this point down, including this “self” node. The addition of the predicate[@currency="JPY"]filters the set of matches to those elements that have a currency attribute with a value of “JPY”. In this case, two nodes meet this condition. The shaded nodes highlight all the <Cube> elements in the document, and the two darkest satisfy the predicate.