Parsing Other XML Element Types - Parsing XML Content

Parsing XML Content

3.4 Parsing Other XML Element Types

• Access the root node via root = xmlRoot(doc)

• Operate on a node as if it is a list of its children, i.e., use[and[[to access elements in the tree. For example, we access the <e> node in the document with either of the following:

node3_1 = root[[3]][[1]]

node3_1 = root[["d"]][["e"]]

• The XML package provides functions for determining information about a node. These include xmlName(), xmlSize(), xmlAttrs(), xmlGetAttr(), xmlValue(), xmlNames-pace(), andgetDefaultNamespace(), which provide, in order, the node’s name, number of children, attributes, a speciﬁc attribute, text content of the node and its descendants, names-pace, and default namespace. For example,

xmlValue(root[["b"]])

"text 2"

andxmlSize(node3_1)returns 0.

• In addition to[and[[, other functions inXMLenable us to work with a node’s siblings, children, parent, and ancestors. These aregetSibling(),xmlChildren(),xmlParent(), and xmlAncestors(), respectively. For example, we retrieve the parent ofnode3 1with xmlParent(node3_1)

and the sibling followingnode3 1with getSibling(node3_1)

With these functions, we can traverse the tree from one node to another anywhere in the tree.

For example, fromnode3 1we access the ﬁrst child of <a> with either of these expressions:

xmlParent(xmlParent(node3_1))[[1]]

getSibling(getSibling(xmlParent(node3_1), after = FALSE), after = FALSE)

• The tree object behaves differently from regular R objects. When we make the as-signment, node3_1 = root[[3]][[1]], we now have a reference to that point in the tree. Any operations on node3 1 will be made to the tree as well. For example, xmlParent(xmlParent(node3_1))references the root of the parsed document, i.e., root.

3.4 Parsing Other XML Element Types

The xmlParse() function parses the entire document, including comments, text content, process-ing instructions, etc. These nodes are not all “regular” XML nodes. Indeed a text node does not have a name or any attributes. The various types of nodes in the document each have their own class in R. A comment has class XMLInternalCommentNode, a processing instruction has classXMLInternalPINode, text content has class XMLInternalTextNode, and <CDATA>

has class XMLInternalCDATANode (by default). These are all extensions of the base class XMLInternalNode. (See Section2.5for a description of the various elements of an XML doc-ument.) AsxmlParse()encounters a node, it knows what type it is and maps it to the corresponding R object. To illustrate, let us work with the following simple DocBook document, which we have annotated to highlight the different kinds of elements it contains:

<?xml version="1.0" encoding="UTF-8"?> 1

<title>A Title</title>

<-- A comment --> 3

<para>

This paragraph includes text and a comment 4

<-- a comment in a paragraph -->

and a processing instruction <?R sum( 1, 3, 5) ?> 5 The paragraph includes code in a CDATA node

<r:code><![CDATA[ 6 x <- (y > 1 & z < 0) ]]></r:code> 7

</para>

</article>

1 The XML Declaration along with the encoding information.

2 The topmost root node with two namespace deﬁnitions.

3 A comment node.

4 Text content.

5 A processing instruction.

6 Escaped character data (<CDATA>).

7 There is text content after the <r:code> element which consists of a space and a new-line character.

This document contains comments, processing instructions, <CDATA>, and elements from the R DocBook extension of DocBook (see Example2-1(page30)). The extension elements begin with the namespace preﬁx r. Those without a preﬁx are element names from the default namespace, DocBook.

The two namespaces are both declared on the root node, <article>.

We read this document into R with a call toxmlParse()and access the root node withxmlRoot() as shown here :

rdbRoot = xmlRoot(xmlParse("simpleDoc.xml"))

We can look at the individual nodes in the usual manner using[[and, e.g., conﬁrm that the second child is a comment:

rdbRoot[[2]]

Additionally, we can access the text content within the comment, with thexmlValue()function. That is, we can extract the information between the comment delimiters as follows:

3.4 Parsing Other XML Element Types 65

xmlValue(rdbRoot[[2]])

[1] " A comment "

ThexmlValue()function is generic so it works on different types of nodes. For nodes that are mix-tures of text content and other nodes,xmlValue()returns a character string that concatenates the text content of all the node’s descendants.

Next, we explore the contents of the <para> node inrdbRoot, names(rdbRoot[["para"]])

text comment text R text code text

"text" "comment" "text" "R" "text" "code" "text"

These seven elements consist of text prior to the comment, the comment, text between the comment and processing instruction, the processing instruction, text immediately following the processing in-struction, code, and the ﬁnal text at the close of the paragraph. We check their classes with

sapply(xmlChildren(rdbRoot[["para"]]), class)

text comment

"XMLInternalTextNode" "XMLInternalCommentNode"

text R

"XMLInternalTextNode" "XMLInternalPINode"

text code

"XMLInternalTextNode" "XMLInternalElementNode"

text

"XMLInternalTextNode"

The seventh node may seem unexpected because it does not appear that there is any text in the

<para>node that follows the <code> node. WithxmlValue(), we see that there is indeed a blank space followed by a new-line character:

xmlSApply(rdbRoot[["para"]], xmlValue)

text

"\nThis is text including < and a comment\n"

comment

" a comment in a paragraph "

text

"\nand a PI "

"sum( 1, 3, 5) "

text

"\nThis paragraph includes code in a CDATA node \n"

code

"\nx <- (y > 1 & z < 0) \n"

text

" \n"

We also have access to the namespace on a node withxmlNamespace(). For example, we can ﬁnd the namespace on the <r:code> node in a DocBook document with, e.g.,

xmlNamespace(rdbRoot[[3]][[6]]) r

"http://www.r-project.org"

attr(,"class") [1] "XMLNamespace"

This returns the URI associated with the r preﬁx on the node, which we see is http://www.r-project.org. Additionally, thegetDefaultNamespace()function retrieves the default names-pace declared on the top-level node in the document. We can pass it any node, e.g.,

getDefaultNamespace(rdbRoot[[3]][[6]])

"http://docbook.org/ns/docbook"

and the function retrieves the default namespace for the root node.

In document XML and Web Technologies for Data Sciences With R-Springer-Verlag New York (2014) (Page 87-90)