Strategies for Extracting Data from HTML and XML Content
5.8 SAX : Simple API for XML
We have now seen how to parse an XML document using the DOM approach and to provide func-tions that are called asynchronously, or, as needed, to convert nodes. In this section, we will use the same idea of providing handler functions, but in a slightly different way. The Simple API for XML, commonly known as SAX , is an alternative parsing model that differs from DOM parsing.
Unlike the DOM approach, the SAX parser never creates a tree or even an XML node. As the SAX parser reads content, it converts the bytes into tokens such as the start of an XML node (e.g.,
<lender id="14232">), the close of a node (</lender>), an entire processing instruction (<?xsl-stylesheet html.xsl?>, a comment (<!-- -->), or an entity (<). As it en-counters these different low-level tokens in the XML document, it generates events and invokes our handler/callback functions. These can then construct R objects that contain the data of interest. They can create a tree or any data structure for that matter, but critically, the parser does not create the tree.
SAX works in a linear manner on the XML stream, reading tokens from that stream until it finds enough to constitute an event. This leads to one of the biggest differences between the DOM and SAX models which is that it works top-down. In the DOM model, the parser collects all of the child nodes and their children and so on and then processes each node. When our handler function is called, it has access to the node and its descendants. The handler functions can then manipulate the entire node and its subcontents as a self-contained, meaningful unit. In the SAX model, however, we get information about the start of the parent node before we see anything about its subnodes. Our handler function is told of the start of a node, but it cannot process the child nodes. It must leave processing these children to calls to other handler functions. This means that we cannot transform a node into an R object in a single handler function since we do not have access to all the node’s
descendants. Instead, we can often use the start of the parent node to create an empty or default object and then fill it in as we encounter the children nodes later in the XML stream, in separate calls to other handler functions. Only when we see the event that announces the closing of the parent node can we finalize the construction of the object corresponding to the complete node. In this way, the SAX model encourages a very incremental construction approach and typically one that involves sharing state across the callback functions to remember what object is currently being constructed. This is somewhat similar to the handler example in Example5-11(page157). There we used one handler function to create the empty data frame, another to update a counter and a handler for the <param>
nodes to populate cells/elements of the data frame.
The primary advantage of SAX is memory efficiency. The SAX parser does not incur the penalty of having both a tree and the target data structure in memory simultaneously. However, this typically comes at the expense of more complexity in the callbacks than one would have in the DOM process-ing. We will see in Section5.10that we can reduce this complexity by combining SAX with local DOM parsing and building nodes that we can manipulate as entire units.
We use the R functionxmlEventParse() to implement SAX parsing. This function handles the XML input source in the same way thatxmlParse()does, by assuming it is either a file name (either compressed or not), a remote URL, or a string containing the XML. As one might expect, the main difference between the functions is that, to be useful, you must supply callbacks to handle the different SAX events. As in DOM parsing, these functions are provided as a named list of functions via the handlersargument. The names of the list’s elements correspond to the SAX event types, which are listed in Table5.2.
Table 5.2: Event Handlers Available for SAX Parsing
Example Function name
<node att1="value" att2="...> startElement
</node> endElement
some text text
<!-- a comment .. --> comment
<?R library(XML) ?> processingInstruction
%lt; externalEntity
<!ENTITY % lt ’<’> entityDeclaration
This table lists the different events that can arise in the SAX parser along with the names of the elements of thehandlers listthat are invoked to respond to such an event.
We will take a look at a simple and familiar example to show the basics of SAX parsing.
Example 5-12 Extracting Exchange Rates via SAX Parsing
We return to the daily euro exchange rates from the European Central Bank’s XML files that we saw in Example2-3(page33). A snippet is shown below. Recall that the data are provided in the attribute values of nested <Cube> elements. One outer <Cube> contains the entire set of exchange rates. This element has a separate <Cube> element for each day in the dataset, representing the time/date dimension. These <Cube> nodes have a time attribute that gives the date. Within each of these “time” elements, there is another set of <Cube> nodes, one for each currency. These innermost
<Cube>nodes have two attributes: currency, which has a three-letter abbreviation for the particular
160 5 Strategies for Extracting Data from HTML and XML Content currency, and rate, which contains the exchange rate for that currency relative to the euro. The data look like
<Cube>
<Cube time="2006-10-06">
<Cube currency="USD" rate="1.2664"/>
...
</Cube>
<Cube time="2006-10-05">
<Cube currency="USD" rate="1.2721"/>
...
</Cube>
...
</Cube>
Our goal is to end up with a vector of dates (the time attributes) and the exchange-rate values for a subset of one or more of the currencies. We want the code to allow us to indicate which currencies to collect so that we can skip those that are not of interest. Suppose we want the rates for the US and New Zealand dollars (USD and NZD, respectively). We want to end up with a list with twonumeric vectors, each of which contains the exchange rates for that currency. We also want a vector of the dates corresponding to those exchange rates.
As we encounter each <Cube> node with a time attribute, we will append that value to the end of thetimevector. When we encounter a <Cube> node for one of the currencies, we will append the value of the rate attribute to the appropriate vector. Before we start, we need to first create R variables to store the values. We create arateslist withnumericvectors for each currency, and atimesvector to store the date as it is encountered in the attributes of the <Cube> subelements.
Obviously we need a place so that all of the different callback/handler functions can access and update these variables. Essentially, we need to be able to make these shared objects that are available to the different callback functions and have any changes these functions make to the objects be available to subsequent calls. We can do this with closures as we did in Example5-11(page157). We define a function that defines and returns a list of handler functions and that defines variables that these handler functions share and can update. We do this with
saxHandlers =
function(currencies = c("USD","NZD")) {
rates = vector("list", length(currencies)) names(rates) = currencies
times = numeric() day = 0
startElement = function(name, attrs) { ... # to be defined
}
list(startElement = startElement, rateData = function()
list(times = times, rates = rates)) }
Here thedayvariable is used as a counter to keep track of where to add the next time value and exchange rate values. We use this when appending the values to thetimesvector and the individual elements ofrates.
We will use therateData()function (in the list we return) to access the results, i.e., thetimesand ratesvariables.
Note that the currencies to collect are specified by the caller of thesaxHandlers()function. We have specified defaults, but we can collect different currencies with different collections of handler functions created with different calls to this function. We will develop the code for thestartElement() function next.
The important part of oursaxHandlers()function is thestartElement()function. This function is used to process any node in the XML document since it is named startElement. We can return this function in the list with the name Cube to apply only to <Cube> nodes. However, since there are only <Cube> nodes in our XML document, it will see all of them. Moreover, this function needs to handle the different <Cube> nodes differently. ThestartElement()function is called with both the name of the node and the vector of attributes for the XML node. It can ignore the top-level <Cube>
node, i.e., those that have no attributes. It can also ignore any node not named <Cube>, should they be in the document. When the <Cube> node has a time attribute, we want to incrementdayand append the value of the time attribute to thetimesvector. When the attributes contain a currency element, we will add the exchange rate to the appropriate element ofrates. We can implement all of this with
startElement = function(name, attrs) { if (name != "Cube")
return(NULL)
if ("time" %in% names(attrs)) { day <<- day + 1
times[day] <<- attrs["time"]
return(TRUE) }
if ("currency" %in% names(attrs) &&
attrs["currency"] %in% currencies)
rates[[attrs["currency"]]][day] <<- attrs["rate"]
TRUE }
Since we definestartElement()within the body of thesaxHandlers()function above, rather than as a regular top-level function, it will have access to the variablescurrencies,day,timesandrates.
Note that we use the global assignment operator (<<-) to make changes in these variables.
Now we can use these handlers to extract the data.
h = saxHandlers()
xmlEventParse("../Data/eurofxref-hist.xml", handlers = h) exchange.rate = h$rateData()
We can then transform the results from strings to numbers and dates and, for example, plot the values
162 5 Strategies for Extracting Data from HTML and XML Content
rates = as.data.frame(lapply(exchange.rate$rates, as.numeric)) rates = cbind(rates,
date = as.Date(exchange.rate$times, "%Y-%m-%d")) matplot(rates$date, rates[, -ncol(rates)])
Since we do not need the handler functions after we have parsed the document and extracted the results by calling theirrateData()element, we can extract the exchange rates in a single call
exchange.rate = xmlEventParse("../Data/eurofxref-hist.xml",
handlers = saxHandlers())$rateData() In some cases, we do want to reuse the exact same handler functions and shared variables across different documents. In other cases, we may want to use the same functions, but re-initialize the shared variables. We do this by adding aresetelement to the list of handler functions.
Another approach for implementing ourstartElement()handler function would be to keep a count of the depth of the nodes, i.e., how many start and end events we have processed for <Cube> nodes.
When we see the first one, we would have depth 1 and ignore the node. For a second <Cube> node, we would collect the time attribute and increment the depth to 2. When we encounter another <Cube>
node, we recognize that this must be a currency rate node since the depth is 2. We can increment depth for each node and then decrement it for each closing event for each <Cube> node. In this way, the depth tells us how to process the node without looking at the attributes. To implement this, we need anendElementin our list of handler functions which decrements adepthvariable. We would implement this something like
saxHandlers =
function(currencies = c("USD","NZD")) {
rates = vector("list", length(currencies)) names(rates) = currencies
times = numeric() day = 0
depth = 0
startElement = function(name, attrs) { depth <<- depth + 1
if (depth == 2) { day <<- day + 1
times[day] <<- attrs["time"]
} else if (depth == 3 &&
attrs["currency"] %in% currencies)
rates[[attrs["currency"]]][day] <<- attrs["rate"]
}
list(startElement = startElement,
endElement = function(node) depth <<- depth - 1, rateData = function()
list(times = times, rates = rates)) }
For comparison, we provide an XPath approach to the problem. We pass over the tree several times—once for each currency for which we want the exchange rate, and again to get the date. The
XPath expression in the second case extracts those <Cube> nodes that have a parent <Cube> and an attribute time. Note that we do not check the value for the attribute, only that it exists.
doc = xmlParse("eurofxref-hist.xml")
dates = xpathApply(doc, "//x:Cube/x:Cube[@time]",
xmlGetAttr, "time", namespaces = "x") currency = c("USD", "NZD")
currencies = as.data.frame(
lapply(currency, function(cr) {
xp = sprintf("//x:Cube/x:Cube[@currency = ’%s’]", cr)
as.numeric(xpathApply(doc, xp, xmlGetAttr, "rate", namespaces = "x"))
}))
names(currencies) = currency currencies = cbind(currencies,
dates = as.POSIXct(strptime(dates, "%Y-%m-%d")))
summary(currencies)
USD NZD dates
Min. :0.8252 Min. :1.641 Min. :1999-01-04
1st Qu.:0.9664 1st Qu.:1.868 1st Qu.:2001-04-02 Median :1.1466 Median :1.955 Median :2003-07-12
Mean :1.1214 Mean :1.954 Mean :2003-07-08
3rd Qu.:1.2614 3rd Qu.:2.044 3rd Qu.:2005-10-09
Max. :1.4895 Max. :2.292 Max. :2008-01-15
The SAX approach is less direct and more complicated than the equivalent process for the DOM model. This is because when we use the DOM approach we have the entire node and its subnodes within each callback and we can process all the information together. The SAX model requires us to construct the necessary information ourselves and store it so that we can process it when we have enough to make sense of it, i.e., in our examples thedayanddepthvariables. Not only are we re-sponsible for making sense of the information, we have the additional task of building the information from the low-level pieces the XML parser hands us—across calls to different functions. We definitely have more work to do when using a SAX parser. However, what this gives us is the control over what intermediate information is created. When we have to be concerned with the potential for a DOM parser creating an excessive amount of the tree that we will never use, we can assume control and use SAX . Of course, if we are reading small data sets, it is easier to program using the DOM approach.
If we need to read a very large data set that will exceed the capacity of the DOM parser, we will need a SAX parser. Hence, we have this difficult trade-off of whether we implement both approaches and use them on different inputs depending on their (expected) size, or do we just implement a single, complicated parser? Unfortunately, there is no good, general answer to this problem. It will depend on
164 5 Strategies for Extracting Data from HTML and XML Content the circumstances in which you find yourself. Issues such as the cost of developing the code compared with maintaining and (re)testing it will be important. What we can say is that SAX parsing is not as complex as it may appear.