• No results found

Strategies for Extracting Data from HTML and XML Content

5.2 Using High-level Functions to Read XML Content

5.2.4 Helper Functions for Converting Nodes

We have briefly seen the functionsxmlToList()andxmlToDataFrame(). These functions can be useful for converting simple XML documents into R objects. They work best when the documents are very shallow, i.e., have only two or three levels of nodes. For more complex documents with descendants of the root node being three or more generations/levels, the functions may not map to an appropriate R representation. These functions are, however, useful as tools that we can use when converting subnodes within a tree. As such, we can sometimes use them as part of a larger strategy for processing an entire XML document. They are helper functions. They are additional examples of where we want the functions to be flexible in allowing us to specify a document by name, as a parsed document, or as a collection of nodes, e.g., a simple list from a call such asnode[ "player" ] or from an XPath query and so of classXMLNodeList.

In addition to these two functions, there are two other high-level helper functions. One is named xmlAttrsToDataFrame()which processes a collection of nodes and takes their attributes rather than the subnodes and turns them into a data frame. This can be useful when all of the content is in the attributes of the nodes of interest. Alternatively, we can usexmlAttrsToDataFrame()to process the attributes and usexmlToDataFrame()to process the subnodes and then combine the results. We will look at an example of using this function.

Example 5-3 Reading Baseball Player Information from Attributes into an R Data Frame

Some people (Americans mostly) love to explore statistics about the game of baseball. We com-pute batting averages, earned run averages (ERA), hits in different ball parks, percentages of hits against left-handed pitchers, and percentages of hits against right-handed pitchers named Ernest or Joe, pitching on a Tuesday—0 for 1! While the number of observations is small, the inference may be useful so it may be valuable to analyze games. The Major League Baseball (MLB) site http://gd2.mlb.com/components/game/mlb/provides detailed information about each game played for many years.1

Each baseball game has information about the players involved in the game. The document looks something like

<game venue="Busch Stadium" date="October 28, 2011">

<team type="away" id="TEX" name="Texas Rangers">

<player id="119984" first="Darren" last="Oliver"

num="28" boxname="Oliver" rl="L" position="P"

status="A" avg=".000" hr="0" rbi="0" wins="0"... />

<player id="134181" first="Adrian" last="Beltre"

num="29" boxname="Beltre, A" rl="R" position="3B"

status="A" bat_order="5" game_position="3B" .../>

<coach position="manager" first="Ron"

last="Washington" id="123965" num="38"/>...

</team>

<team>

<player..../>

<player..../>

<coach .../>

</team>

<umpires>

1Use of these data is governed by the license athttp://gdx.mlb.com/components/copyright.txt.

<umpire....>

...

</umpires>

</game>

Each of the <player> nodes has various attributes giving the first and last name, jersey number, batting and fielding position, average, rbi (runs batted in), wins and so on.

We want to create a data frame with a row for each player and columns corresponding to the attributes. Not all attributes are present in each <player> node. This means we have to decide if we want just those that are present in all, or if we want to use all of the available attributes and have missing values for those nodes in which an attribute is not present. Indeed, we might be interested in just a subset of the attributes, e.g., id, first, last. In other circumstances, we might want to ignore particular attributes and include the rest. ThexmlAttrsToDataFrame() allows us to chose which approach to follow.

The first step is to read the XML document, i.e., doc = xmlParse("players2.xml")

We do not want to process the <team> and <umpires> nodes. Instead, we want only the

<player>nodes, ignoring the <coach> nodes within the <team> elements. To do this, we will use XPath to retrieve the <player> nodes:

playerNodes = getNodeSet(doc, "//player") We can now pass this list of nodes toxmlAttrsToDataFrame()

players = xmlAttrsToDataFrame(playerNodes, stringsAsFactors = TRUE) This results in a data frame with 50 rows and 16 variables. The columns include all of the attributes in any of the <player> nodes:

names(players)

[1] "id" "first" "last" "num"

[5] "boxname" "rl" "position" "status"

[9] "bat_order" "game_position" "avg" "hr"

[13] "rbi" "wins" "losses" "era"

We can requestxmlAttrsToDataFrame()to use the names of the attributes that are common to all of the nodes. We can do this by either computing and specifying the names of those attributes ourselves, or using the functionXML:::inAllRecords():

players = xmlAttrsToDataFrame(playerNodes,

attrs = XML:::inAllRecords, stringsAsFactors = TRUE) This gives a data frame with only 11 columns:

names(players)

[1] "id" "first" "last" "num" "boxname" "rl"

[7] "position" "status" "avg" "hr" "rbi"

Suppose we just want the id, first, and last variables. We can, of course, subset the data frame after we have created it. We can also specify that we want just those variables in the call to xmlAttrsTo-DataFrame()with

126 5 Strategies for Extracting Data from HTML and XML Content

players = xmlAttrsToDataFrame(playerNodes, stringsAsFactors = TRUE, c("id", "last", "first"))

We can either specify the variables as a character vector or by providing a function that dynamically processes the nodes and returns the collection of desired names. This allows us to determine which variables we want based on the contents of the nodes and attributes.

We can also use theomitparameter to discard some attributes.

There is also the functionxmlToS4(). It is similar toxmlToList()in that it decomposes the contents of a node and puts them individually into an R list. ThexmlToS4()function takes an XML node and the name of an S4 class as arguments. It then attempts to match the slot names in the S4 class to node and attribute names in the XML node, and it converts the matching XML elements to the type of the corresponding slot. This allows us to transform string values to numbers, logical values, and so on.

This function also works recursively so subnodes with children can be transformed to other R classes using the class of the target slot. This can be useful when we define S4 classes either manually or programmatically from an XML schema. (See Chapter14.) We will use the same XML document from Example5-3(page124)) to illustratexmlToS4().

Example 5-4 Converting Player Information into S4 Objects

Suppose instead of a data frame, we wanted to represent the player information as a list with an element for each player. We want these elements to be S4 objects of classPlayer. We define this class as

setClass("Player",

representation(id = "character", first = "character", last = "character", position = "character", avg = "numeric", num = "integer"))

We can also add a prototype to provide default values for the slots.

We can now loop over the <player> nodes and convert each to this class:

players = lapply(playerNodes, xmlToS4, "Player") The first element is

An object of class "Player"

Slot "id":

id

"119984"

Slot "first":

first

"Darren"

Slot "last":

last

"Oliver"

Slot "position":

position

"P"

Slot "avg":

[1] 0

Slot "num":

[1] 28

If we do not specify the name of the target class, thenxmlToS4()uses the name of the XML node, e.g., player in this case. Therefore, if we had named our classplayer, rather than with a capital P, we would not have needed to specify the name of the class in our calls.

Instead of passing the name of the target class, we can pass an actual instance of the object. This is useful when we are working with subclasses and want to fill in the slots of the parent or base class(es).