XPath, XPointer, and XInclude
4.8 Namespaces and XPath Queries
When we want to use XPath to query an XML document that uses namespaces, we need to do slightly more work. That is, we need to ensure that the namespaces in the XPath query and the document match. Recall that we use namespaces in XML documents to identify the vocabulary of a node or attribute name. These allow us to disambiguate when we use the same name for a node or attribute in a different sense within the same document. A namespace has a URI that uniquely identifies it, and a prefix, which is what we work with locally within our document. For example, we can use the node
<code>to refer to R code. However, if we also want to refer to C code, we cannot use <code> for that too. Rather than, a priori, knowing about the conflict and using different names such as <rcode>
and <ccode>, we use namespaces to differentiate between the two uses. For example, consider the following simple document:
<article xmlns="http://docbook.org/ns/docbook"
xmlns:r="http://www.r-project.org"
xmlns:c="http://www.C.org">
...
<section> <title>Random Numbers</title>
<para>...
<r:code><![CDATA[
x <- rnorm(1000) ]]></r:code>
<c:code>
for(int i = 0; i < n; i++) total += x[i];
</c:code>
</para>
</section>
</article>
4.8 Namespaces and XPath Queries 105 Here we have defined two namespaces with their own URIs and local document-specific prefixes (r and c). These prefixes are used to qualify the <code> nodes, e.g., <r:code> and <c:code>.
Rather than having to qualify the many DocBook nodes with a prefix, we made the DocBook names-pace the default that applies to all unqualified nodes. That is, the unqualified nodes are in the default namespace associated with the URIhttp://docbook.org/ns/docbook.
XPath expressions also use namespaces to qualify node names. When we need to refer to a names-pace in an XPath query, we need to define the namesnames-pace as a URI and prefix pair and then use that prefix locally within the XPath expression. The XPath query effectively replaces the prefix with the URI and only considers a match if the fully qualified name matches. In R, we specify the names-pace mappings via thenamespaces parameter for each of the XPath functions, i.e.,getNodeSet(), xpathSApply()andxpathApply(). We give this a named character vector where the names are our own choice of prefix for the namespace, and the values are the URIs. These URIs must match the corresponding namespace URIs in the target document, if we are to match effectively. The prefixes, however, do not have to match at all as these are local to the XPath expression, just as the prefixes are local to the document. For example, to retrieve the <r:code> nodes in the above document we use getNodeSet(doc, "//r:code",
namespaces = c(r = "http://www.r-project.org"))
We did not have to use the same prefix—r—as that used in the target document. We can choose any prefix, but we do have to use that same prefix in both the XPath query and thenamespacesargument togetNodeSet(). For example,
getNodeSet(doc, "//s:code",
namespaces = c(s = "http://www.r-project.org")) is equivalent to the previous command, but uses the prefix s rather than r.
We can, and sometimes must, use multiple namespaces within an XPath query. For example, sup-pose we wanted to get the <code> nodes for both R and C languages in our document. We can do this using a vector with multipleprefix = URIelements as follows:
getNodeSet(doc, "//s:code | //c:code",
namespaces = c(c = "http://www.C.org",
s = "http://www.r-project.org"))
Perhaps the most common confusion arises when querying a document that has a default names-pace (as opposed to no namesnames-pace). We have to tell XPath about that default namesnames-pace. Recall that the choice of default namespace and prefix (or lack thereof) is the choice of the document’s author, i.e., local to the document. Similarly, the choice for the namespace prefix in our XPath query is local to us and not connected to the target document. Indeed, we want to be able to use the same XPath query across different instances of the same class of documents where some documents may have a default namespace and some may not but they will all use the same namespace definitions (i.e., URIs).
Therefore, even if the document has a default namespace (i.e., with no explicit prefix), we have to ex-plicitly identify and use that namespace in our XPath expression. For example, to query our document and find the <r:code> nodes in the first <section> node, we use the query
Namespaces = c(x = ’http://docbook.org/ns/docbook’, r = ’http://www.r-project.org’)
getNodeSet(doc, "//x:section[1]//r:code", namespaces = Namespaces) (Note that it is a good idea to define namespaces in a character vector and reference this in calls to getNodeSet(). This avoids repeating them in different calls and having to change them in more than one place or making an error in typing them in more than one place.)
In our XPath query, we have to map the document’s default namespace to an actual prefix, e.g., x, and then use that in our query to qualify the <section> node. If we did not introduce this explicit mapping to the default namespace, but instead used a query such as//section//r:code, we would get no matching nodes. This is because XPath is looking for a node named <section>
with no namespace. It does not recognize the default namespace. Forgetting to deal with a default namespace is a common mistake that people make when using XPath initially.
Basically, to deal with namespaces, we really need to know about the document before we try to query it. However, we often want to query it to find out about its contents. ThegetNodeSet() function and related functions try to help us in these situations. Firstly, if the document has namespaces defined on the root node, then we can find these via the xmlNamespaceDefinitions() function.
This returns the namespace definitions on a particular node. However, the XPath functions such as getNodeSet()query the namespace definitions on the target document’s root node. We can get the individual namespace definitions in full form or in simplified form. The simplified form is sufficient for our needs here:
ns = xmlNamespaces(doc, simplify = TRUE)
"http://docbook.org/ns/docbook"
"r"
"http://www.r-project.org"
"c"
"http://www.C.org"
attr(,"class")
[1] "SimplifiedXMLNamespaceDefinitions"
[2] "XMLNamespaceDefinitions"
Here we see three namespace definitions. The prefixes are given in the names of thens object’s elements so we can find the default with
names(ns) == ""
Alternatively, we can use the functiongetDefaultNamespace()to find the default namespace, if there is one:
getDefaultNamespace(xmlRoot(doc))
"http://docbook.org/ns/docbook"
If there is no default namespace on our document’s root node, we get an emptyvector.
If there is a default namespace, we have to somehow express this in our XPath queries. We can havegetNodeSet()(and the other XPath functions) assist us in this. If we do not specify an explicit definition for namespaces, but just a simple prefix,getNodeSet()will try to match that prefix to the target document’s set of namespace prefixes. If it does not match, but there is a default namespace, getNodeSet()maps the prefix we specified to the URI of the default namespace. This allows us to use an arbitrary prefix in our XPath queries to identify the default namespace without having to know or specify its URI. For example, to find the <para> nodes in the first <section>, we can use the command
getNodeSet(doc, "//x:section[1]//x:para", namespaces = "x")