Reading Data from XML -formatted Documents

Getting Started with XML and JSON

1.3 Reading Data from XML -formatted Documents

We next turn to reading data from XML documents. XML documents are very general and can be used to describe very complex data structures. However, many XML documents are quite simple. As an example, we consider the lender information provided by Kiva [4], a nonproﬁt organization that provides microloans to individuals in developing countries by connecting them with people around the world who want to loan money for such activities. Kiva makes information about lenders and loans available via a Web service and also provides a “dump” of the entire database in both XML and JSON formats. These data include characteristics of the lenders and also of the loans, e.g., the amount and purpose of the loan, which were paid back and which were not, and when. The data are available from http://build.kiva.org.

Looking at one of the documents describing lenders below, we see the basic structure. In addi-tion, the graphical representation of the document in Figure1.2makes clear the hierarchical struc-ture of the document. We are interested in the information about each lender. A <lender> has a

<lenderid>, a <name>, information about a picture of the lender, and a location given by the

<whereabouts>node and <country code> nodes. We also have information about the number of loans the individual has provided (<loan count>), the short label for his or her stated occupation and a more comprehensive description of it (in <occupation> and <occupation info>), and why the lender participates in Kiva (<loan because>).

<?xml version="1.0" encoding="UTF-8"?>

<page_size>1000</page_size>

</header>

<lender_id>matt</lender_id>

<image>

<template_id>1</template_id>

</image>

<whereabouts>San Francisco CA</whereabouts>

1.3 Reading Data from XML -formatted Documents 9

<country_code>US</country_code>

<member_since>2006-01-01T09:01:01Z</member_since>

<personal_url>

www.socialedge.org/blogs/kiva-chronicles

</personal_url>

<occupation>Entrepreneur</occupation>

<loan_because>I love the stories. </loan_because>

<occupational_info>I co-founded a startup

nonprofit (this one!) and I work with an amazing group of people dreaming up ways to

alleviate poverty through personal lending.

</occupational_info>

<loan_count>89</loan_count>

<invitee_count>23</invitee_count>

</lender>

<lender_id>jessica</lender_id>

<name>Jessica</name>

<image>

<template_id>1</template_id>

</image>

<whereabouts>San Francisco CA</whereabouts>

<country_code>US</country_code>

<uid>jessica</uid>

<member_since>2006-01-01T09:01:01Z</member_since>

<personal_url>www.kiva.org</personal_url>

<occupation>Kiva cofounder</occupation>

<loan_because>

Life is about connecting with each other.

</loan_because>

<occupational_info/>

<loan_count>54</loan_count>

<invitee_count>26</invitee_count>

</lender>

....

</lenders>

</snapshot>

The sequence of <lender> nodes in the Kiva data naturally maps to a list in R with an element for each <lender> node. Similarly, each <lender> can be represented as a list. That is, each child node of a <lender> node can be mapped to either a string containing the text content of the child or in the case of <image>, a list or vector with two elements: <id> and <template id>. Basically, it is natural to map an XML node to a list with an element for each child node using the name of the

snapshot

lender

invitee_count image lender_id

template_id id

23 matt

12829

lender

invitee_count image lender_id

template_id id

26 jessica

197292 header

page_size date

total 576803

page 1

2010...

1000

name Matt

lenders

name Jessica

Figure 1.2: Tree Diagram of a Kiva Lender XML Document. This graphical representation of a Kiva lender document shows the basic structure of the XML document. Notice the hierarchical format of the XML where we have a single root node called <snapshot>, its two children <header> and

<lenders>, and so on. In the ﬁgure, a rectangle with dashed lines denotes text content.

1.3 Reading Data from XML -formatted Documents 11 child as the name for the list element. In the next example, we use the functionxmlToList()to help us do this.

Example 1-2 Converting XML-formatted Kiva Data to an R List or Data Frame We begin by parsing the XML document with

doc = xmlParse("kiva_lender.xml")

We then passdocto thexmlToList()function, and it will return an R list with an element for each of its top-level child nodes, mapping each of these children to an R list and so on, in the same recursive way:

kivaList = xmlToList(doc, addAttributes = FALSE)

The result is a list with 1000 elements, one for each <lender> node. (TheaddAttributes = FALSEensures that any XML attributes are not included in the result, e.g., the type="list" in the <lenders> node.) The ﬁrst lender element in the list is

$lender_id [1] "matt"

$name [1] "Matt"

$image

$image$id [1] "12829"

$image$template_id [1] "1"

$whereabouts

[1] "San Francisco CA"

$country_code [1] "US"

$uid

[1] "matt"

$member_since

[1] "2006-01-01T09:01:01Z"

$personal_url

[1] "www.socialedge.org/blogs/kiva-chronicles"

$occupation

[1] "Entrepreneur"

$loan_because

[1] "I love the stories. "

$occupational_info

[1] "I co-founded a startup nonprofit (this one!) and I work with an amazing group of people dreaming up ways to alleviate poverty through personal lending. "

$loan_count [1] "89"

$invitee_count [1] "23"

When appropriate,xmlToList()makes converting XML content to R quite easy.

On the other hand, if the XML data have a simple structure, we can read it into a data frame with thexmlToDataFrame() function. We might arrange the Kiva lenders data as a data frame, with an observation for each lender and a column/variable for each node within <lender>, e.g.,

<lenderid>, <name>, <country code>, <loan count>. In our situation, the <lender>

nodes are two levels below the root node so we need to access them to pass toxmlToDataFrame().

We get the top-level/root node and then its <lenders> node as follows:

lendersNode = xmlRoot(doc)[["lenders"]]

ThexmlRoot()function gives us the top-level node of our document, i.e., <snapshot>. To fetch the <lenders> subnode, we treat the root node as if it were a list in R and use the expression node[["lenders"]]to extract the (ﬁrst) child node whose element name is <lenders>. This is a convenient way to access child nodes. We can also index by position, e.g.,

xmlRoot(doc)[[2]]

as we know the second element is the <lenders> node. We want thelistof <lender> nodes. The functionxmlChildren()is the means for getting the list of all child nodes of a given node, e.g., the

<lender>nodes under <lenders>. We then pass this list of the individual <lender> nodes to xmlToDataFrame()to create a data frame with

lenders = xmlToDataFrame(xmlChildren(lendersNode))

This function returns a 1000 by 13 data frame. The variables in the data frame correspond to the top-level XML elements in each of the <lender> nodes, i.e.,

names(lenders)

[1] "lender_id" "name" "image"

[4] "whereabouts" "country_code" "uid"

[7] "member_since" "personal_url" "occupation"

[10] "loan_because" "occupational_info" "loan_count"

[13] "invitee_count"

Note that this approach collapses the image column to just the value of the ﬁrst child node in

<image>. This may not be what we want. In Chapter3,we continue with this example and demon-strate how to include the children of <image> in our data frame.

1.3 Reading Data from XML -formatted Documents 13 The previous example introduced two high-level functions,xmlToList()andxmlToDataFrame(), for extracting XML content into R lists and data frames, respectively. We also got a glimpse of other functions available in theXMLpackage, such asxmlParse(),xmlRoot(),xmlChildren()and[[

for accessing child nodes within a node. These and other functions provide much greater control over data extractions. The example in the next section gives a preview of the possibilities, particularly for working with attributes on XML nodes. Chapter3 provides a more in-depth introduction to these parsing functions.

1.3.1 Extracting Data from XML Attributes

When we examine the XML for the Kiva lenders, we see only one attribute being used. This is type = "list"within the <lenders> node. In this case, the attribute conveys metadata about the content of the <lenders> node. In other XML documents, the attributes often contain data.

For example, the following is a segment describing activities related to a bill in the US Congress [1]

(available athttp://www.govtrack.us/developers):

<bill session="111" type="h" number="1"

updated="2011-01-29T15:03:08-05:00">

<state datetime="2009-02-17">ENACTED:SIGNED</state>

<status><enacted datetime="2009-02-17" /></status> ...

....

</relatedbills> ...

</bill>

We see that the date the bill was enacted is given in the attribute datetime in the <enacted>

element, and the information about each related bill is provided via the attributes relation, session, type, and number in a <bill> node within <relatedbills>. Hence we need a tool that extracts attributes from XML nodes to access this information. There are two functions for this in theXML package:xmlAttrs()andxmlGetAttr(). ThexmlAttrs()function returns a named character vector of all the attributes for a given node, from which we can extract individual elements as strings. The xmlGetAttr()function is used to retrieve a single attribute, rather than returning the entire collection of attributes.xmlGetAttr()also allows us to provide a default value if the attribute is not present in the node and to coerce the value if it is present. We demonstrate both functions in the next example.

Example 1-3 Retrieving Attribute Values from XML-formatted Bills in the US Congress

Consider the <bill> nodes within <relatedbills> for a particular bill. We have a list of these

<bill>nodes available in therBillsvariable; the attributes on the ﬁrst related <bill> node are xmlAttrs(rBills[[1]])

relation session type number

"rule" "111" "hr" "88"

We can combine these attributes across <bill> nodes into a data frame with do.call(rbind, lapply(rBills, xmlAttrs))

With thexmlGetAttr()function we can retrieve a single attribute and specify a default value that is returned if the attribute is not present. Furthermore, we can coerce the attribute’s string value to a particular type. For example, if we want the number attribute back as an integer or NA, if not present, then we can use

xmlGetAttr(rBills[[1]], "number", NA, as.integer) We can collect all the number values across the <bill> nodes with

as.integer(sapply(rBills, xmlGetAttr, name = "number"))

[1] 88 92 168 290 291 598 629 679 679 861 [11] 336 336 350 350

Here, it is better to convert the vector of attribute values together into anintegerrather than each one individually.

In document XML and Web Technologies for Data Sciences With R-Springer-Verlag New York (2014) (Page 32-38)