Strategies for Extracting Data from HTML and XML Content
5.4 Extracting Multiple Variables From XML Content
It is very natural to use different XPath queries to retrieve all of the values in an XML document for a given variable. We can then combine the resulting vectors into a data frame in R and then we have the data we want from the XML document. Unfortunately, this approach sometimes runs into a particular problem. To illustrate the issue, we consider a very simple made-up example of an XML document displayed below. We have two observations—people—and three variables recording each person’s first name, middle initial, and surname. For one of the records, we do not have the middle initial.
<data>
<obs id="1">
<firstname>Deborah</firstname>
<surname>Nolan</surname>
</obs>
<obs id="2">
<firstname>Duncan</firstname>
<initial>W</initial>
<surname>Temple Lang</surname>
</obs>
</data>
Suppose we want to create a data frame with the values of <firstname> and <surname> as two separate columns. We can do this quite simply with
doc = xmlParse("MissingObsNode.xml")
data = data.frame(firstname = xpathSApply(doc, "//obs/firstname", xmlValue),
surname = xpathSApply(doc, "//obs/surname", xmlValue))
The problem arises if we also want to have a column for the initial. We can get this from the observa-tions in the same way with
data$initial = xpathSApply(doc, "/data/obs/initial", xmlValue) However, thexpathSApply()call returns a vector with only one element since only one node matches the XPath expression. When we add this to the data frame, the values are recycled and we end up with an initial W for both observations. The problem is that the <initial> node is optional and not present in all of the observations. When we extract the value for this optional variable via XPath , we lose the association between the values and the observation. The single value we got was for the second observation, not the first.
This fabricated example illustrates the problem of having optional nodes/variables within XML nodes that we treat as observations. When we use XPath to extract the values of individual variables one at a time, we may be processing different numbers of observations and the results may not be aligned correctly across the observations. We need to ensure that the values of the separate variables are associated with the correct observations. There are a few ways to do this.
This is not an abstract problem. It arises in many situations since XML allows us to have differ-ent child nodes within nodes of the same name. An example of this is information about lenders in the Kiva data that we introduced in Section1.3 and read in Example3-1 (page54). The data de-scribe different individuals (or groups) that provide micro loans. Each lender has a <lender id>, a <name>, an <occupation> and so on. However, most but not all of the <lender> nodes also have a <loan count> node giving the number of loans that lender has made since joining Kiva.
Even fewer have a <country code>, and very few have an <invitee count>. We can find this information using some simple R commands on the <lender> nodes within the document. We calculate the names of the child nodes within each <lender> node and then count the total number of occurrences of each node name across all the lenders, i.e.,
lendersDoc = xmlParse("lenders1.xml")
lenderNodes = xmlRoot(lendersDoc)[["lenders"]]
varNames = xmlSApply(lenderNodes, names) table(unlist(varNames))
country_code image invitee_count lender_id
957 686 334 1000
loan_because loan_count member_since name
1000 985 1000 1000
occupation occupational_info personal_url uid
148 5 Strategies for Extracting Data from HTML and XML Content
1000 1000 1000 1000
whereabouts 948
We start by creating the basic data frame with variables that are present in all of the <lender>
nodes, e.g.,lender id,loan because,name,occupation. We can get these with
varNames = c("lender_id", "loan_because", "name", "occupation") vars = lapply(varNames,
function(var) {
xp = sprintf("//lender/%s", var) xpathSApply(lendersDoc, xp, xmlValue) })
names(vars) = varNames
Now if we try to do the same operation to get the values of the <loan count> nodes, we will end up with a vector of 985 values, and not 1000. We have lost the association between the values and the observation/row in our would-be data frame. In this case, the observational unit is the lender. We consider three approaches to address this problem. We describe the first approach, which uses unique identifiers in the nodes, in the following example.
Example 5-9 Extracting Loan Counts from Kiva Using Unique Identifiers on Nodes
If we can find a unique identifier for each observation, then we can use this identifier to associate the loan countvalues with the appropriate observation. That is, we can then use the identifier to set the values for specific observations for variables that are optional in the XML nodes.
With these data, we have at least two potential candidate variables—lender idanduid. Thename variable is not actually unique across lenders so we cannot use this. We can create a data frame from the variables we have computed above and then uselender idas the row names. We do this with lenders = as.data.frame(vars, stringsAsFactors = FALSE,
row.names = vars$lender_id)
We can now create loan count from the corresponding nodes in the XML document. However, we also need to get thelender id at the same time so that we know how to associate the val-ues with the correct row of our data frame. We can do this in two ways: a) a single XPath query to get the <lender> nodes that have a <loan count> child and then extract both the value of theloan countand theloan id, or b) make two separate XPath queries to identify the same
<lender>nodes, but get the specific children separately. Both approaches will require that we cre-ate theloan countvariable in our data frame with suitable default values for each observation, e.g., lenders$loan_count = rep(NA, nrow(lenders))
In both approaches, we will assign values to a subset of the observations using thelender idvalues.
We can implement the first approach with
tmp = getNodeSet(lendersDoc, "//lender[loan_count]")
ids = sapply(tmp, function(node) xmlValue(node[["loan_id"]])) lenders$loan_count[ids] =
as.integer(sapply(tmp, function(node)
xmlValue(node[["loan_count"]]))) Heretmpis a list of the 985lendernodes that have aloan countchild. We loop over these to get the values oflender idandloan countand then use these to assign the values.
The second approach is quite similar. However, we get the values oflender idandloan count in two separate but similar and related XPath queries:
ids = xpathSApply(lendersDoc, "//lender[loan_count]/lender_id", xmlValue)
lenders$loan_count[ids] =
as.integer(xpathSApply(lendersDoc,
"//lender[loan_count]/loan_count", xmlValue))
In our example, we had two unique identifiers for each observation—lender idanduid. We can check they are unique with
length(unique(lenders$id)) == nrow(lenders)
However, what happens if we do not have a variable or node within each observation that uniquely identifies it? Well, we can always create a unique identifier for the nodes corresponding to the obser-vational unit, i.e., <lender> in our example. The simplest identifier to use is the index of the node, i.e., 1, 2, 3, . . . . We can also compute a checksum from the text representation of the each node, e.g., sapply(lenderNodes,
function(node)
digest(saveXML(node), "md5", serialize = FALSE))
or we might use XSL ’s generate-id() function to generate the identifiers. Regardless of how we create the identifiers, we have to somehow associate them with the XML nodes. There is no value having the identifiers in R as we cannot match the nodes resulting from an XPath expression (unless we compute the checksum for each of those nodes). Instead, we want to put the identifiers into the observation nodes in the XML document. We might add each identifier as an attribute named id.
Alternatively, we can use a different attribute name (if any of the identifiers are already being used in the XML document for any node, not just the observation node). Instead of using an attribute, we can add the identifiers as child nodes, similar to <lender id> in our example. In other words, we will modify the XML document to add our unique identifiers. Creating and modifying nodes is the topic of Chapter6.
The unique identifiers we create are only for our own purpose. They have no other role in the XML document other than allowing us to associate a node with a row in our data frame or element of a vector. So why are we changing the XML document, and won’t this have repercussions? In short, this is fine. We are not changing the original XML document on disk. Instead, we are changing a copy of it in memory. When we are finished with the modified document, it will be garbage collected and the modifications will disappear. If we want to reuse the XML document later in the R session and do not want our modifications to be present, we can make a copy of the XML tree before we make the changes. We can do this with thexmlClone()function, e.g.,
tempDoc = xmlClone(lendersDoc)
and then work with this instead of the original document, which will remain unaltered.
150 5 Strategies for Extracting Data from HTML and XML Content
5.4.1 Extracting an Entire Observation: A Different Approach
The problem we had above was that we were extracting variables for our data frame individually and needed to associate the values with the correct observations. An alternative approach is to process an entire observation node in one step and create all of the different variables. We can then combine the different observations into a data frame knowing that we have values for all of the variables for each observation. To do this, we operate on the observation nodes, i.e., in <lender> nodes in our example. We can do this with code like
obs = lapply(lenderNodes, function(node) {
ans = structure(sapply(node[varNames], xmlValue), names = varNames)
ans["loan_count"] =
if(!is.null(a <- node[["loan_count"]])) xmlValue(a)
else NA ans
})
The key idea here is that we check if the optional elements (node or attribute) are present in the XML node for each observation and ensure that there is a value in the R object representing the observation.
The result of our code above is a vector (or a list or a data frame) for each observation. We can combine these with
lenders = do.call(rbind, obs)
We can then convert the columns to the appropriate types, e.g.,integer,logical,factor, and so on.
5.4.2 Modifying the Tree Before Extracting Variables: A Final Approach
We end this section by looking at a different approach whereby we can still use separate XPath queries over the <lender> nodes to get different pieces of information. The problem we encountered was due to the XML having optional elements. If every <lender> node has a <loan count> child, we can create theloan countvariable in our data frame in the same way as we did for the other nodes. There is no issue in associating the values with the correct observations. This suggests that we can modify our XML document by adding the missing elements to any of the relevant observation nodes. In our example, we can find the <lender> nodes that do not have a <loan count> child.
For each of these, we insert an empty <loan count> node. We can perform these steps with lendersDoc = xmlParse("lenders1.xml")
nodes = getNodeSet(lendersDoc, "//lender[not(loan_count)]") sapply(nodes, function(node) node[["loan_count"]] = "")
The command node[["loan_count"]] = "" inserts (or replaces) the XML node named
<loancount>with no child nodes.
Now every <lender> node in our document has a <loan count> node. When we extract the value for each <lender> node with
xpathSApply(lendersDoc, "//lender/loan_count", xmlValue)
we end up with a value for each observation. Converting this to anintegervector will result in NA values for each of the observations for which we had no <loan count> node in the XML document.
Up to this point, we have parsed the entire document, including other XML documents or parts of them. Next we turn to examining how to control the inclusion of these parts when parsing a document.