XPath, XPointer, and XInclude
4.1 Getting Started with XPath
When we work with XML documents, we typically want to extract data from them and bring these data into R structures such as vectors or data frames. (We will discuss creating XML documents within R in Chapter6.) For example, we may want to extract: R code from the examples in this book, daily exchange rates for the yen in an HTML document, lender information from a Kiva response to a Web service request, or articles published by a particular author in the Journal of Statistical Software.
(JSS provides its bibliographic data in XML.) In the previous chapter, we saw how we can manipulate a node in R to get its name, attributes, namespaces, text content, and children. This small collection of functions (xmlName(),xmlAttrs(),xmlNamespace(),xmlValue()andxmlChildren()) allows us to process an entire tree as we can recursively traverse the hierarchy by processing the root node and then its children and their children and so on. In theory, we have all the functionality we need to extract data from any XML tree using recursive functions. With thexmlParent()function, we can even go back up the tree rather than only working downwards. However, many people find recursive functions difficult to understand and write. Also, it is slightly challenging to collect the results across function calls as we descend the tree. Closures and lexical scoping in R can help here, but again not all R users are familiar or comfortable with these concepts. Fortunately, there is a technology associated with XML named XPath [9, 14] that frees us from having to recursively traverse the tree with R code.
Suppose we want to process an HTML document and extract the URLs for all of the links it contains. We are not interested in where these nodes appear in the document; we are just interested in the value of each href attribute of the <a> nodes. Consider how involved it would be to write a function to loop over all nodes in the tree and extract the value of the href attribute if and only if
75 DOI 10.1007/978-1-. Nolan and D. Temple Lang4614 7900- , -0_4, © Springer Science+Business Media New York 2014, Use R!,
D XML and Web Technologies for Data Sciences with R
the node name is “a”. We would have to start at the root node of the tree, check its name, and then recursively do the same thing for each of the node’s children. In contrast, if we have a list of all of the <a> nodes that have an href attribute, we can loop over these in R to get the value of the href attribute with
links = sapply(listOfANodes, xmlGetAttr, "href") We can use XPath to get this list of <a> nodes using, e.g.,
doc = htmlParse("http://www.omegahat.org") listOfANodes = getNodeSet(doc, "//a[@href]")
The XPath expression//a[@href]is very succinct and means essentially: find all nodes named
“a” throughout the tree that have an attribute (@) named “href”. This expression uses some shorthand in the XPath language for common operations which we will explain later in this chapter, but it should be clear that XPath is very succinct and powerful. We do not need to worry about where the nodes are in the tree. Also, we can add some constraints on the <a> nodes. For example, we can require them to be within a <table> node, in a <table> with a class attribute value "data", or within an ordered list node (i.e., <ol>) which has at least three list items. This is the power of XPath .
Much the same as with regular expressions, XPath is a separate language from R and XML, and typically consists of short strings that express a query. We can form these in R and then use the R functiongetNodeSet()to evaluate the XPath query and return the matching elements of the tree to which we applied the query. Note that once we retrieved the list of <a> nodes above, we used the functionxmlGetAttr(), which we saw in earlier chapters, to retrieve the URL for the link. If we want the text displayed for the link in the HTML page, we can apply thexmlValue()function to each of the
<a>nodes to get the text content, or we can process the children withxmlChildren(). Generally, we use XPath to find nodes and then we process them in R. In this particular example, we can combine finding the nodes and getting the href attribute in either of two ways:
xpathSApply(doc, "//a[@href]", xmlGetAttr, "href") or, entirely within XPath with
getNodeSet(doc, "//a/@href")
The latter shows that we can actually return attributes and not just nodes from an XPath query. The xpathSApply()function (and similarlyxpathApply()) is a generalized version ofgetNodeSet() that allow us to find elements of a tree and apply a function to each of them in a single R command.
The following example demonstrates the usefulness of XPath when working with a large data set.
Example 4-1 Efficient Extractions from a Michigan Molecular Interactions (MiMI) Document The Michigan Molecular Interactions (MiMI) [8] is part of the National Institute of Health’s Na-tional Center for Integrative Biomedical Informaticshttp://www.ncibi.org,and is available athttp://mimi.ncibi.org/MimiWeb/.MiMI provides access to data from several curated protein interaction databases for people studying systems biology and gene pathways and their in-teractions. The data are available via a Web service, but it is also available an XML file. This is reasonably large, with over 25,000 top-level <molecule> nodes. The file is 6 megabytes when compressed, and 70 megabytes as raw text. We can parse the document without uncompressing it via xmlParse()and the call
system.time(mi1 <- xmlParse("˜/XML/mi1.txt.gz"))
Depending on the machine and amount of memory, this takes between 2 and 4 seconds to parse the entire 70 MB. We have not converted any of the content into R objects, but merely have a reference
4.1 Getting Started with XPath 77 to the C-level tree. There are almost 3 million nodes in this tree, and so traversing the entire hierarchy with R functions would be extremely time-consuming.
The basic structure is a collection of <molecule> nodes that look something like
<molecule>
<prov><im><imid>30</imid></im></prov>
<moleculeID>116226</moleculeID>
<moleculeType>protein
<prov><im><imid>30</imid></im></prov>
</moleculeType>
<organismID>9606
<prov><im><imid>30</imid></im></prov>
</organismID>
<id><prov><im><imid>30</imid></im></prov>
<idType>HGNC</idType><idValue>9859</idValue></id>
<name>RAP1GDS1 <prov><im><imid>30</imid></im></prov> </name>
<name>GDS1 <prov><im><imid>30</imid></im></prov> </name>
<name>MGC118859 <prov><im><imid>30</imid></im></prov> </name>
<name>MGC118861 <prov><im><imid>30</imid></im></prov> </name>
<variant>
<prov><im><imid>30</imid></im></prov> <variantID>0</variantID>
</variant>
<interaction><interactionRef>93569</interactionRef>
<moleculeRef>116280</moleculeRef>
<moleculeName>RAC1</moleculeName>
<selfVariantRef>0</selfVariantRef>
<partnerVariantRef>0</partnerVariantRef>
</interaction>
<interaction><interactionRef>104132</interactionRef>
<moleculeRef>103040</moleculeRef>
<moleculeName>RHOA</moleculeName>
<selfVariantRef>0</selfVariantRef>
<partnerVariantRef>0</partnerVariantRef>
</interaction>
<interaction><interactionRef>121818</interactionRef>
<moleculeRef>74726</moleculeRef>
<moleculeName>MBIP</moleculeName>
<selfVariantRef>0</selfVariantRef>
<partnerVariantRef>0</partnerVariantRef>
</interaction>
</molecule>
A task we were asked to do was to find the content of the <moleculeName> nodes within the
<molecule>nodes for only those <molecule> nodes that have a <name> node containing the string ’frm-1’. For example, the following <name> node satisfies the requirement on the <name>
node’s value:
<name>frm-1<prov><im><imid>30</imid></im></prov></name>
We can get the list of <molecule> nodes that match this criterion using XPath with the command
mol = getNodeSet(mi1, "/*/molecule[.//name/text() = ’frm-1’]") This searches the entire tree of 25,452 molecules and returns the two matching nodes in about four-tenths of a second.
Now that we have these nodes, we can loop over them in R and fetch the text in the
<moleculeName>node within each <interaction> node. Without XPath we can do this with lapply(mol, function(node)
sapply(node[names(node) == "interaction"],
function(x) xmlValue(x[["moleculeName"]]))) This is not very complicated as the structure of these <molecule> nodes is quite simple. We demon-strate how we can use the XPath expression
.//interaction/moleculeName/text()
to find the same information in a given <molecule> node. This XPath expression translates to:
starting at this current node in the tree, look at all its descendants for nodes named <interaction>;
for each of these find its child nodes named <moleculeName>; and for these <moleculeName>
nodes, extract the child nodes that are just text. We use this XPath expression on each node from R with
xpexpr = ".//interaction/moleculeName/text()"
lapply(mol, function(node) xpathSApply(node, xpexpr, xmlValue)) yielding
[[1]]
[1] "alecting"
[[2]]
[1] "09H1.6W"
in about two-thousandths of a second!
If we just want the names of the molecules in the <interaction> nodes matching’frm-1’
and do not care about the molecule with which they interacted, we can make the entire query within a single XPath expression by combining the two steps, i.e.,
xpexpr = "/*/molecule[.//name/text() = ’frm-1’]//
interaction/moleculeName"
int = xpathSApply(mi1, xpexpr, xmlValue)
This is very fast, about half a second. If there is a lot of matching <molecule> nodes, this can be faster than looping over them in R. The key point is that we can use either or both languages to find the “best” (fastest or easiest) solution.
At this point, we have seen some examples of the power and efficiency of XPath . In the next sec-tion, we give an informal introduction to XPath and describe how to think about it heuristically. In Section4.3,we describe the full syntax and computational model that underlies XPath . Then in Sec-tion4.4we discuss some of the XPath functions we can use within queries which make the language more powerful, and in Sections4.5and4.6,we address how to create compound expressions. In Sec-tion4.7,we see how to use XPath in the context of some short examples, and we provide a case study that demonstrates more complex expressions. Section4.8covers how to work with namespaces. We