XPath, XPointer, and XInclude
4.4 XPath Functions and Logical Operators
the last node in the node-set; this can be abbreviated to [last()]. Similarly, we can subset by position, e.g.,//section[2]yields the second <section> node. There are functions to compute the name or value of a node and also to perform string manipulation and comparisons. Simple expressions can be combined using theand&oroperators, e.g., //Cube[@currency = "USD" or @currency > 1.5]
Location steps are concatenated together with/to form a location path. At each step in the location path, the step’s expression is evaluated within the current context, i.e., context for each node that matched the previous step. For example, with
//Cube[@time]/Cube[@rate < 25]
or
/descendant-or-self::Cube[@time]/child::Cube[attribute::rate<25]
the first step matches the two <Cube> nodes that have a time attribute. For each of these, we evaluate the next step which searches for a child <Cube> node with a rate attribute with value less than 25. The result is the union of the matching nodes from each of the two separate searches.
More than one element can be located by an XPath expression. The located nodes are called the node-set. Each matching node appears in the node-set just once. This is useful and especially important to remember when we work with compound XPath queries, i.e., using multiple XPath expressions together to search for this OR that.
4.4 XPath Functions and Logical Operators
XPath provides logical operators for combining predicates. Predicates can be combined together into a compound predicate using one of the binary operatorsandoror. For instance, to match <Cube>
nodes for the US dollar or Japanese yen, we can use
//Cube[@currency = "JPY" or @currency = "USD"]
Other boolean operators in XPath include:not(),true(), andfalse(). Thenot()operator is used to compute the opposite or negation of a condition. It is analogous to the!operator in R, and we can use it in an XPath expression such as
//graphic[ not( contains(@fileref, ’.jpg’) )]
to find all <graphic> nodes that do not have a fileref attribute with the extension jpg. We should note, as an aside, this expression is not precise enough for two reasons. Firstly, it matches
<graphic>elements which have no fileref attribute. We can remedy this by testing for the presence of the fileref attribute before the test for the extension, i.e.,
//graphic[ @fileref and not( contains(@fileref, ’.jpg’) )]
The second problem is that this does not test for the string ".jpg" at the end of the file name. We would like to use a function such asends-with(), similar tostarts-with(). Unfortunately, XPath 1.0 does not provide such a function. However, we can get the same effect withsubstring() andstring-length()via
//graphic[ @fileref and
not( substring(@fileref, string-length(@fileref) - 3, 4)
= ’.jpg’ ) ]
This expression illustrates that XPath has no direct equivalent of R ’s!=operator. XPath 1.0 was designed to be minimal, where you can build all of the functionality with a few primitive functions.
Predicates can look at both the content and location of the node they are testing to see if these match the condition(s). For this, we need to be able to perform computations on the node and its contents. XPath provides many functions that are helpful in constructing predicates, and queries gen-erally. Some of these functions access the context of a node. For example,position()returns a numeric value giving the position of the node in the current node-set associated with the predicate.
This function can be quite useful in extracting a specific node according to its position. As an example, //section[position() = 2]/r:code[position() = 1]
locates the first <r:code> node in the second section of the document. This can be abbreviated to //section[2]/r:code[1]
That is, XPath treats[2]as[position() = 2]. Unlike in R, the expression[2]is actually an implicit logical predicate. We do not have to use the logical form, but it is good to know how it is being evaluated.
As another example, suppose we wish to extract the last node in a node-set, and we do not know the number, just that it’s the last node in the node-set. The XPath functionlast()combined with position()in the predicate below returns true when the position of the node in the node-set matches the position of the last node in the node-set (i.e.,last()is equivalent to the size of the node-set).
/Envelope/Cube/Cube[position() = last()]
As before, this predicate can be abbreviated to[last()]because when the result of a predicate expression is a number, then XPath treats it as a logical condition that compares this number to the context node’s position and, if they match, returnstrue.
In addition toposition()andlast(), there are many functions available in XPath for use in predicates. Some provide access to a node’s properties, i.e., to the node’s name (both local and qualified by its namespace), its position within the node-set, family relationship with other nodes in the tree, and string-value. In addition to functions that operate on a node, XPath provides functions that operate on strings. The most commonly used of these functions are summarized in Table4.2.
As an example, we find <r:code> nodes that contain the word ’library’ with //r:code[ contains(., ’library’) ]
There are also several functions in XPath that deal with numbers. The functionnumber() con-verts its (string) argument to a number, e.g.,number(@rate) > 1. As we mentioned, XPath typ-ically does the implicit conversion for us. The functionsfloor(),ceiling(), and round() perform the corresponding tasks as the functions in R. For example, we can find all magnitude 6 earthquakes with
//event[floor(number(./mag)) = 6]
where each earthquake is of the form
<event>
<mag>number</mag>
....
</event>
4.4 XPath Functions and Logical Operators 91
Table 4.2: XPath Functions
Function Input Return Value
last() node Number of elements in the context node-set.
position() node Position of the context node within the node-set.
count() node Number of elements in the node-set.
id() node Element with id matching the input string.
name() node Name of the first node in the node-set.
namespace-uri() node Namespace of the context node or the first node in the node-set.
concat() string One string that concatenates the strings provided as input arguments.
starts-with() string trueif the first string passed to the function begins with the second string. For example,starts-with(@fileref, ’Images/’)
contains() string trueif the first string contains the second.
substring() string Portion of the string starting at the first value for a length of the second value.
substring-after() string Portion of the first string that appears after the second string.
substring-before() string Portion of the first string that appears before the second string.
string-length() string Number of characters in the string.
normalize-space() string String with leading and trailing whitespace stripped and reduced if the second string starts with the first string.
translate() string Original string with the portion of the string starting at the first value for a length of the second value. For example, to change a, c, g and t to A, C, G and T respectively, we usetranslate(’acgt’, ’ACGT’, string(.)) This table describes some of the important XPath (1.0) functions that we can use within XPath expressions.
Additionally, XPath supports the usual arithmetic operators+,-,*,/, and mod (%).
We should note that these functions are useful not only in predicates. XPath is used in the related technology XSL for creating transformations of XML documents. In these cases, we can output the results of computations into text or nodes in other XML documents. For instance, we can convert references to JPEG file names to PNG by replacing the extension usingsubstring-before().
Similarly, we can do calculations across nodes in a node-set to compute aggregates, e.g., usingsum().
Unfortunately, the set of numeric functions in XPath is quite limited, not even including the log function. As a result, we often do computations in R. We can even use R functions within XSL templates by integrating R and the XSL transformation engine (XSLT ) [3]. TheSxsltpackage [11]
makes this possible, embedding XSLT in R and also R in XSLT .
It is important to note that getNodeSet() and related functions in R use XPath 1.0 via the libxml2C-level library. XPath 2.0 provides additional and richer functions than are available in XPath 1.0, but we cannot use those withingetNodeSet(), etc. While it would be convenient to use these additional functions, we do not actually need them. Instead, we can perform simpler XPath queries and then apply our own predicates or transformations to the nodes thatgetNodeSet()returns.
We have a much richer language and set of functions in R than is available in XPath 2.0. Therefore, we can combine XPath and R to perform the computations we need. Readers interested in more powerful facilities than XPath 1.0 might explore the XQuery language [2, 15]. TheRXQuerypackage [10] is a prototype of integrating R and Zorba [5], an Open Source implementation of XQuery.