• No results found

Strategies for Extracting Data from HTML and XML Content

5.7 Element Handler Functions

Whether we use XPath or recursively traverse the tree directly with R code, the key concept is that we are traversing a tree. XPath queries do this in C code and so tend to be much more rapid. However, we frequently have to iterate over the resulting nodes in R and also make multiple XPath queries and so traverse the tree multiple times. This is typically significantly faster than iterating recursively over all of the nodes in the tree with R functions such asxmlSApply(). The fact that we are traversing the tree suggests another possible approach. We can iterate over the entire tree in C (rather than with

R code) and collect the information we want from the nodes as we process each node in the tree. The xmlParse()function allows us to do this. Instead of just asking for the parsed tree, we can pass a list of R functions via thehandlersparameter toxmlParse(). If we provide a value for thishandlers parameter,xmlParse()uses its elements to process the individual nodes as it (xmlParse()) makes a single pass over all of the nodes in the tree. The functions in ourhandler’s list can extract information from each of the relevant nodes and combine and store the information in a “central” location. When xmlParse()has finished traversing all of the nodes in the tree, we can pick up this information as an R object constructed by thehandlerfunctions.

The handlerslist must contain named elements that either match a node name or correspond to generic node names such as.startElement,.comment,.textand so on. This allows us to provide specific functions for processing particular nodes based on their name, while also providing catch-all functions that can process all nodes of a particular type, e.g., generic nodes, text nodes, comments.

If thehandlersargument is specified, the XML parser consults it each time it attempts to traverse a new node in the tree:

• It looks at the name of the XML element and searches for an element in thehandlerslist with this name. If it finds an entry, it calls that function and passes it the XML node as its primary argument. Then, it takes the return value from this call and, if it is non-NULL, it adds that value to the tree. If it is NULL, it drops that node from the tree.

• If there is no matching element in thehandlerslist, then the parser looks for a general function in thehandlerslist for handling the particular type of XML node and uses that, if it exists. For example, for a text node it will look for a function namedtext()in the list of handler functions.

The association between node type and function name in thehandlerslist is given in Table5.1.

• Finally, for those nodes that do not have a matching handler (of either of the two types described above), the DOM parser proceeds as usual.

While we will use thehandlerfunctions to collect the content from the tree, the functions can return arbitrary objects which are then combined into a hierarchical structure that mimics the tree of nodes.

This allows us to create our own tree rather than the C-level tree or the simple R tree of nodes. Next we provide an example for the earthquake data from the USGS introduced in Section3.2.

Table 5.1: General DOM Handler Names

Node Type Example Function name

XML element <node> startElement

Text node Simple text inside a node text

Comment node <!-- a comment --> comment

<CDATA>node <[CDATA[ literal text ]]> cdata

Processing instruction <?R library(XML)?> processingInstruction XML namespace xmlns:r="http://www.r-proj..." namespace

Entity reference &gt; entity

This table describes the different elements we can specify in thelistof functions passed toxmlTreeParse()or html-TreeParse()via thehandlersparameter. These functions respond to the different types of nodes the parser encounters in the tree/DOM as it traverses the tree after parsing it. ThestartElement()function will be called when processing a generic <xml> node. However, we can specify a more specific function to handle all nodes with a particular name by adding an entry to thelistof handler functions with the name of the target nodes.

154 5 Strategies for Extracting Data from HTML and XML Content

Example 5-10 Reading Earthquake Data with Handler Functions

Recall that the data are organized as a collection of <event> nodes within a root node named

<merge>. Each <event> node is of the form

<event id="71880980" network-code="NC"

time-stamp="2012/11/12_16:25:49 " version="0">

<param name="latitude" value="38.7953"/>

<param name="longitude" value="-122.7520"/>

<param name="depth" value="1.7"/>

<param name="magnitude" value="0.9"/>

<param name="magnitude-type-ext"

value="Mcd = coda duration magnitude"/>

<param name="num-stations-mag" value="4"/>

<param name="stand-mag-error" value="0.0"/>

...

</event>

We have attributes giving the time the earthquake was recorded, the network on which it was recorded and a unique identifier for the event. Details describing the event, such as where it occurred, its depth, magnitude and metadata about the details are all provided in <param> nodes with a name and value attribute.

Suppose we want to create a data frame with variables corresponding to the parameter names latitude, longitude, depth, magnitude, and rms-error. We can do this quite easily with XPath . For each of the variable names, we find all the <param> nodes with a name attribute that matches that variable name. We then extract the value of the corresponding value attributes. We can do all of this with doc = xmlParse("merged_catalog.xml")

varNames = c("latitude", "longitude", "depth",

"magnitude", "rms-error") values = lapply(varNames,

function(var) {

xp = sprintf("//param[@name = ’%s’]", var) xpathSApply(doc, xp, xmlGetAttr, "value") })

names(values) = varNames

If some <event> nodes do not have a <param> for a given variable, we will end up with the value for the different variables misaligned across the events/observations. See Section5.4for a discussion of this issue.

Let’s consider how we can do this with ourhandlersparameter forxmlParse(). We want to specify a function that will process each of the <param> nodes. That function will get the value of the name attribute of that node. If it is one of the variables we want to collect, it will add the value of the value attribute to an R vector for that variable. We can write our function as

paramFun = function(node) {

var = xmlGetAttr(node, "name") if(var %in% varNames)

values[[var]] <<- c(values[[var]], xmlGetAttr(node, "value")) }

These handler functions typically use our familiar node manipulation functions such asxmlGetAttr() to process the internal node object they are passed. Note that we use nonlocal assignment to update a variable across calls to this function. This is the variablevalues. We have to create it before we call this function. We can create a list with an empty vector for each variable:

values = structure(replicate(length(varNames),

character(), simplify = FALSE), names = varNames)

Note also that we concatenate the new value to the existing vector, which will be very inefficient, but we will return to these issues.

Now that we have initialized thevaluesobject in which we will store the result and defined our sole handler function, we can parse the document and process the nodes with

xmlParse("merged_catalog.xml", handlers = list(param = paramFun)) When this returns, we can examinevalueswith

sapply(values, length)

latitude longitude depth magnitude rms-error

1291 1291 1291 1291 1277

We see that this now contains an observation for each event in the XML document. Therms-error element has fewer observations. This means that not all <event> nodes have a <param> node for this detail. This also means our XPath approach needs to be modified as we suggested in Section5.4.

We can however adjust our strategy here relatively easily to a) handle the missing values in some

<event>nodes, and b) also make our code more efficient.

Firstly, we want to avoid concatenating the current value from the <param> node being processed to the end of a vector. This causes R to create a copy of the old vector with one extra element and then to populate that. This is a very expensive idiom in R generally. Instead, we would like to pre-allocate a vector of the correct length, or at least a guess and enlarge or shrink it as necessary. Unfortunately, we do not know the number of <event> nodes in our document. We can parse the document and then query this withxmlSize(xmlRoot(doc)). However, we want to parse and traverse the tree in one step. We can still use this code, but rather than parsing the document first, we can specify a handler function for our root node: <merge>. We can create the pre-allocated version ofvaluesin this function. We define it as

mergeFun = function(node) {

num = xmlSize(node)

values <<- structure(replicate(length(varNames),

rep(NA_character_, num), simplify = FALSE),

names = varNames) counter <<- 0

}

This assumes all of the children of <merge> are <event> nodes, but that is merely a detail we can easily fix. Our handler function then assigns our list of template vectors to a nonlocal variable values. We need to create this before we run the code so thatmergeFun()can assign to it.

156 5 Strategies for Extracting Data from HTML and XML Content We also create another variablecounterwhich we will use to identify to which row/position we are currently adding. Each time we process an <event> node, we will increment this. When we insert the value for any of the <param> nodes, we will use this counter to specify the position in our vector. We can define the handler function for <event> as

eventFun = function(node)

counter <<- counter + 1L

We do not need to look at the node itself. If we wanted to collect the time-stamp or network infor-mation, we can store those also at this point. Again, we need to create this global variablecounter beforeeventFun()ormergeFun()is called and tries to assign to it.

We want to change our functionparamFun()so that instead of concatenating the result, it uses counterto insert the value at the appropriate position. This is easily done with

paramFun = function(node) {

var = xmlGetAttr(node, "name") if(var %in% varNames)

values[[var]][counter] <<- xmlGetAttr(node, "value") }

With these three handler functions defined, we can now pass them toxmlParse()via thehandlers argument. There is one thing we have to specify, however. By default,xmlParse() processes the children of a node before it processes the node itself. Therefore, the handler for themerge()node will not be called until all of the <event> nodes have been processed and this involves processing all of the <param> nodes in each <event> node first. We need to change the order of evaluation so that the node is processed by the handler functions before the children. We indicate this via theparentFirst parameter:

xmlParse("merged_catalog.xml", parentFirst = TRUE,

handlers = list(param = paramFun, merge = mergeFun, event = eventFun))

We can now examine the contents ofvalues. We can turn this into a data frame and convert the variables from strings to numbers and factors as appropriate, e.g.,

values = data.frame(lapply(values, as.numeric)) summary(values)

latitude longitude depth

Min. :-59.35 Min. :-178.4 Min. : 0.00

1st Qu.: 34.34 1st Qu.:-142.1 1st Qu.: 3.10 Median : 38.79 Median :-121.7 Median : 8.10

Mean : 39.63 Mean :-115.2 Mean : 19.26

3rd Qu.: 48.22 3rd Qu.:-116.8 3rd Qu.: 14.80

Max. : 67.12 Max. : 179.8 Max. :635.10

magnitude rms.error

Min. :-0.500 Min. :0.0000

1st Qu.: 0.900 1st Qu.:0.0000 Median : 1.400 Median :0.1550

Mean : 1.678 Mean :0.2901

3rd Qu.: 2.100 3rd Qu.:0.4600

Max. : 6.800 Max. :3.5200

NA’s :11

We see the 11 NA values and all of the variables have values that appear to be sensible.

One final issue we have to deal with is avoiding the global variables valuesandcounter. We show how to do this in the next example.

Example 5-11 Extracting Earthquake Information Using Handler Functions with Closures

A much better approach to Example5-10(page154)is to make thevaluesandcountervariables nonglobal, but shared across the three functions:mergeFun(),eventFun()andparamFun(). We can do this by defining a generator function that both defines these three handler functions and defines the shared variables within its body, i.e., with

quakeHandlers = function() {

counter = 0 values = NULL

paramFun = function(node) { var = xmlGetAttr(node, "name") if(var %in% varNames)

values[[var]][counter] <<- xmlGetAttr(node, "value") }

eventFun = function(node) counter <<- counter + 1L

mergeFun = function(node) { num = xmlSize(node)

values <<- structure(replicate(length(varNames),

rep(NA_character_, num), simplify = FALSE),

names = varNames) counter <<- 0

}

list(event = eventFun, merge = mergeFun, param = paramFun, .result = function()

data.frame(lapply(values, as.numeric))) }

When we call this function, we obtain a list of functions. We can pass these directly toxmlParse(), as in

h = quakeHandlers()

xmlParse("merged_catalog.xml", handlers = h)

We can then get the data frame withh$.result(). If we callquakeHandlers()again, we get a different list of functions. They behave the same way, but have their own variablesvalues and

158 5 Strategies for Extracting Data from HTML and XML Content

counter. The two lists of functions operate independently of each other and each other’s shared vari-ables. Hence, we have removed the troublesome global variables and at the same time made creating and passing the event handlers easier.

This approach of usinghandlersfunctions is not likely to be as efficient as even multiple XPath queries over the entire tree. Indeed, comparing the two approaches for this earthquake data with only 1291 nodes, the handler function approach is about 10 times slower than the XPath approach. One reason for this is that we are processing individual nodes and making calls to functions such as xml-Name(),xmlChildren(),xmlGetAttr(). These are not vectorized and involve a lot of interpreted R code. If calling R functions were significantly faster and the code itself were faster, e.g., compiled to native machine code or byte-code, then this approach might be faster than or competitive with an XPath approach as we can avoid repeated traversal of the entire tree. (Temple Lang is working on us-ing LLVM, the Low Level Virtual Machine, to compile R code to machine code and, in some simple cases, has seen dramatic speedup. Similarly, the byte-code compiler already in R can often improve the performance of code by a factor three or four.) However, if we can compile R code generally, the speed of traversing a DOM tree in R code may also be significantly faster and competitive with the XPath orhandlersfunctions approach.

If the function handler mechanism is slow, why do we include a description of it? As we mentioned, it may get more competitive with compilation of R code. However, it is conceptually a different and important approach and it serves as a good introduction to a mechanism for parsing very large XML documents for which we cannot keep the entire DOM in memory.