• No results found

Strategies for Extracting Data from HTML and XML Content

5.2 Using High-level Functions to Read XML Content

5.2.3 XML Property List Documents

if(xmlName(node) == "td" && !is.null(node[["a"]])) xmlGetAttr(node[["a"]], "href", character()) else

character() }

Note that we test the name is td by using thexmlName() function. This returns the name of the node, as we might expect, so we can pass this toreadHTMLTable()as

links = readHTMLTable(kmlUrl, which = 1, elFun = getCellLink, stringsAsFactors = FALSE)

wherekmlUrlcontains the eqarchives URL shown above. Now,linksis a data frame containing all the links so we can just unlist them to get a character vector with the names of the 391 data files:

unique(unlist(links))

The purpose of this example is to show how we can extract general information from each cell, and not just its displayed value. We can also usegetHTMLLinks()to get the links in the entire document, or use XPath to find the <table> node and then get the links within that, e.g.,

doc = htmlParse(kmlUrl)

tb = getNodeSet(doc, "//table[1]") getHTMLLinks(tb[[1]])

5.2.3 XML Property List Documents

Property list files are used on Mac OS X to store data. On a Mac, there may be upwards of 25,000 property list files ranging in purpose from storing the current state of windows in an application, to the history list for a Web browser, to the preferences for a user’s account, to printer configurations, to a description of the applications and bundles, even for R itself. While property lists are used mainly on the Mac, there is nothing about these documents that makes them specific to Mac. Indeed, Major League Baseball (MLB) also uses plist documents to describe aspects of each game; seehttp:

//gdx.mlb.com/components/copyright.txt.

Some of these files are stored as XML and we can read them directly. More recently, property list files are stored in a binary format. We can convert these to XML using the plutil tool, e.g.,

plutil -convert xml1 -o theFile.xml theFile.plist We can even convert them to the JSON format.

A property list in XML format has a root node <plist> which has any number of children that are <string>, <real>, <integer>, <true>, <false>, <date>, <array>, <dict>, or

<data>. The first six of these are used to represent scalar values of the corresponding type. An

<array >is like a vector or a list in R without names, while a <dict> element corresponds to a

122 5 Strategies for Extracting Data from HTML and XML Content vector or list that has names for the elements. A <dict> element is made up of <key> and value nodes, e.g.,

<plist>

<dict>

<key>CFBundleTypeExtensions</key>

<array>

<string>Rd</string>

<string>rd</string>

</array>

<key>CFBundleTypeName</key>

<string>Rd Documentation File</string>

</dict>

</plist>

There are two keys here, each followed by one of the possible value node elements we just listed. In this case, these are <array> and <string>. The <key> node contains the name of the element that follows.

If we want to read a property list into R, we can parse the XML document and then process the nodes. We convert each of the different scalar nodes into the corresponding R value, i.e., a vector of length 1. We convert an <array> node into alistas the elements may be of different types. Once we determine the class of all of the elements, we can collapse them to a vector if they are of compatible types. Similarly, we convert a <dict> node into a named vector or list in the same way. The only difference for an <array> and a <dict> is that we use the names from the <key> nodes.

The functionreadKeyValueDB()reads an XML document in this form. This is a generic function that has methods for the various different classes of inputs, i.e., name of a file or URL, the content itself, a parsed XML document, or an XML node. We can call this function as, for example,

readKeyValueDB(content) The result is

$CFBundleTypeExtensions string string

"Rd" "rd"

$CFBundleTypeName

[1] "Rd Documentation File"

The methods for readKeyValueDB() give us a great deal of control over how we provide the property list content to the function. We can use the method for an individual node or subtree when we extract the property list from within a larger document. We can also use it recursively when processing the subnodes within a plist tree. Indeed, this is the most sensible approach to converting the XML to R. We start at the root node of the plist tree and convert it to R by processing its subnodes, if there are any. The method that takes anXMLInternalNodeobject handles the nodes for scalars by creating the corresponding R value, e.g., <string> yields acharactervector, <true> yields TRUE, and so on. When we process an <array> node, we process each of its child elements using this same method, i.e., calling it recursively. We do this withxmlSApply(). This simplifies the result to a vector when it can and otherwise returns a list.

A <dict> node is processed by processing each of the key elements with

kids = xmlChildren(node)

keyPos = seq(1, by = 2, length = xmlSize(node)/2) structure(sapply(kids[ keyPos + 1], readKeyValueDB),

names = sapply(kids[ keyPos ], xmlValue))

There is no point in using XPath to process the content as we have to process each node relative to its parent. We need to do this by traversing the tree, which we can do with the node manipulation functions. (We will see another approach we can use in Section5.7.)

There are two purposes to this example. One is to show that if we want to process property lists, there exists a function to do it. The second is to show how we process a document of this nature and structure the code.

The concept underlying a property list is very common and general—representing scalars, arrays, and named/associative arrays. These occur in many contexts and are covered well by XML schema, JSON, and so on. For whatever reason, there are various different formats with the same basic ideas.

One of these is a Solr document which is used in the Open Source Lucene text search engine. Below is a sample Solr document.

<lst name="responseHeader">

<int name="status">0</int>

<int name="QTime">90</int>

</lst>

<lst name="index">

<!-- ANN: Provides info about the state of the index -->

<int name="numDocs">17</int>

<int name="maxDoc">17</int>

<int name="numTerms">1044</int>

<long name="version">1297337332283</long>

<bool name="optimized">true</bool>

<bool name="current">true</bool>

<bool name="hasDeletions">false</bool>

<str name="directory">

<!-- ANN: The choice of Directory can sometimes effect performance. Lucene tries to automatically pick the correct one, but ... -->

org.apache.lucene.store.NIOFSDirectory:...@[{PATH}...

lockFactory=org.apache.lucene.store....

</str>

<date name="lastModified">2011-02-10T11:29:03Z</date>

</lst>

This document is conceptually similar to a property list document. The <bool> nodes correspond to

<true>and <false> but have the actual value as text within the node. The <int> and <long>

nodes correspond to numbers; <str> correspond to a string or character vector with one element;

<lst>is an <array>. Names can appear on any element, not via a separate <key> element, but via a name attribute. The functionreadSolrDoc() can read documents of this form. It too uses a similar approach of recursively processing nodes.

124 5 Strategies for Extracting Data from HTML and XML Content