Combining XPath Location Paths in a Single Query

XPath, XPointer, and XInclude

4.6 Combining XPath Location Paths in a Single Query

There is one ﬁnal and important aspect of using XPath to match nodes, which we alluded to earlier:

multiple criteria in a single query. For example, we might want to find all references to external files in an HTML document including images in <img> nodes, JavaScript [4] code in <script> nodes in the <head> node of the document, and Cascading Style Sheet (CSS ) files also referenced in the

<head>node. Alternatively, when working with an article or a book, we might want the <title>

nodes within a <table> or a <figure>. We can typically do these sorts of queries with multiple criteria in multiple passes with separate calls togetNodeSet()for each individual criterion or query.

For example, in our second example, we can use

tbl = getNodeSet(doc, "//section//table/title") fig = getNodeSet(doc, "//section//figure/title")

However, while not true in this case, performing multiple queries can mean that we end up with the same node present in each query. This destroys the simplicity of the set characteristic of a node-set where we know each node is unique. Also, this involves traversing the tree multiple times which can be expensive for very large trees.

XPath (and hencegetNodeSet(),xpathApply(), etc.) allows us to combine multiple queries into a single query. We separate the individual queries using the|, with or without surrounding spaces.

ti = getNodeSet(doc,

"//section//table/title | //section//figure/title") Importantly, by combining the two XPath queries, the nodes will be returned in the correct document order, i.e., the order in which the nodes appear in the document. This can be important if we need to process them in this order and would not be easily feasible if we had to perform multiple separate queries to get the nodes.

Since these two queries are so similar in structure, there is a natural tendency to avoid repetition and combine the two into a shortened query such as

//section//(table|figure)/title or

//section(//table|//figure)/title

These may seem sensible, but, simply put, they are invalid XPath expressions and will cause an error.

Each query separated by a|must be a complete and valid XPath location path in its own right.

4.6.1 Programmatically Generating XPath Queries in R

Since we are working within R, we can create queries programmatically to help us. For example, rather than paste the queries together ourselves, we can passgetNodeSet()andxpathApply()a character vector of individual queries. These functions will combine them into a single string, separating the queries with the|character. This allows us to keep related queries in a vector and to subset them to perform speciﬁc subqueries. To locate the <title> nodes within <table> or <figure> nodes in <section> nodes, we can passgetNodeSet()a vector of XPath expressions as

xpQueries = c(section = "//section/title", table = "//section//table/title",

4.6 Combining XPath Location Paths in a Single Query 95

figure = "//section//figure/title") ti = getNodeSet(doc, xpQueries[c("table", "figure")])

We can also change individual elements or add new ones without having to manipulate the single string containing all the queries, e.g.,

all.titles = getNodeSet(doc, xpQueries)

The three XPath queries in xpQuerieshave exactly the same structure but differ only in the element we look in for the <title>. We can create this more readily with

xpQueries = sprintf("//section%s/title",

c("", "//table", "//figure"))

This illustrates that we can create programmatically XPath queries in R using string manipulation and substituting values for R variables. We provide additional examples of this programmatic approach.

Example 4-4 Creating Multiple XPath Queries for Exchange Rates

We can create a query to get the exchange rates for different currencies with something like currencies = c("USD", "JPY", "BGN")

q = sprintf("//Cube[@currency=’%s’]/@rate", currencies)

This gives us three separate queries and we can evaluate them separately to get the exchange rates for the different currencies with

exRates = lapply(q, function(q) as.numeric(getNodeSet(doc, q))) and we can put them into a data frame with

as.data.frame(structure(exRates, names = currencies))

Should we want to, we can even compute values in one query and put them into another query. We demonstrate how in the next example.

Example 4-5 Using XPath Functions to Retrieve Loan Information for a Large Number of Kiva Loans Each node in the collection of <loan> elements of the Kiva data set gives many details about the loan. These include the name of the person given the loan, their geo-location, the purpose of the loan, the amount, the history of payments, etc. We may be interested in all loans above the 90th percentile for the amount loaned. We could read each loan into R and then subset these based on the loan amount.

However, this would involve processing the entire node for 90 percent of nodes that we do not want.

Instead, we can get the value of the loan from the <funded amount> node with xpx = "//loan/funded_amount"

loanAmounts = as.numeric(xpathSApply(kiva, xpx , xmlValue))

where kiva contains the parsed XML document. Next, we ﬁnd the quantile of interest with quantile(loanAmounts, .9), and use it to get the nodes that exceed this funded amount:

q = sprintf("//loan[string(funded_amount) > %.2f]", quantile(loanAmounts, .9))

bigLoans = getNodeSet(kiva, q)

Combining Queries and Location Paths

Location paths can be combined using the|operator, where the expression on each side must be a valid location path, e.g.,

/book/chapter/section[1]/table | /book/chapter/section[1]/figure locates all tables and ﬁgures in the ﬁrst section of each chapter. We can combine any number of queries together, not just two.

Complicated compound expressions may have nodes that match more than one subexpression.

However, the node-set will contain unique nodes and these will be in document order, e.g., xpQ = c("//section[.//table]", "//section[.//figure]")

getNodeSet(book, xpQ)

locates all sections that have either a table or ﬁgure in them in the parsed book object book.

Sections that have both will appear once in the node-set and the sections will be in the order that they appear in the document. This is very different from

sapply(xpQ, getNodeSet, doc = book)

This returns two lists of nodes (i.e., two node-sets), where sections that have both tables and ﬁgures appear in both and the document order is of the union of these two node-sets is lost.

Document Order

XPath adds the matching nodes to a node-set in what is called “document order.” This defines which elements of an XML document are considered to be “before” other elements. The root element is the first node in the document. Next come any namespace definitions on that node. The attributes are next in order. Next come the child nodes and the order is defined in the same manner for each of those. The order within a set of namespaces on a node is implementation-dependent.

The same is true for attributes within a node. Consider the following XML document:

<para>

The following figure displays a hierarchy.

<graphic width="6in" format="SVG"

fileref="images/SDMXEnvelope-Sender.svg"/>

</figure>

The elements of the <xml/> are ordered.

</para>

</section>

We can get all of the elements (except the namespace deﬁnition) with an XPath query getNodeSet(doc, "//text() | //* | //attribute::*")

and see the order of the elements. The <section> node comes ﬁrst, then the id attribute.

Next comes the <para> node and then its child nodes. The ﬁrst of these is the text "The following ...". Next is the <figure> node and since it has no attributes, its children are next. This means the <graphic> element is next, then its attributes. After this is the next text child of the <para> node. This is, "The elements of the". After this is the node

<xml>and ﬁnally we have the text node containing " are ordered".

In document XML and Web Technologies for Data Sciences With R-Springer-Verlag New York (2014) (Page 118-121)