Automated
Data Collection with R
A
Practical Guide
to
Web
Scraping
and
Text
Mining
Simon Munzert
Department
of
Politics and Public
Administration,
University ofKonstanz,
Germany
Christian
Rubba
Department of
Political
Science,
University of
Zurich and National Centerof
Competence
inResearch, Switzerland
Peter MeiBner
Department
of
Politics and Public
Administration, University
ofKonstanz,
Germany
Dominic
Nyhuis
Department of
PoliticalScience,
University
of
Mannheim,
Germany
Contents
Preface
xv1 Introduction 1
1.1 Case
study:
WorldHeritage
Sites inDanger
11.2 Some remarks on web data
quality
7 1.3Technologies
fordisseminating, extracting,
andstoring
web data 91.3.1
Technologies
fordisseminating
contentontheWeb 9 1.3.2Technologies
for information extraction from web documents 11 1.3.3Technologies
for datastorage 121.4 Structureofthebook 13
Part One A
Primer
onWeb and Data
Technologies
152 HTML 17
2.1 Browser
presentation
andsourcecode 182.2
Syntax
rules 192.2.1
Tags,
elements,and attributes 20 2.2.2 Treestructure 212.2.3 Comments 22
2.2.4 Reserved and
special
characters 22 2.2.5 Documenttype
definition 232.2.6
Spaces
and line breaks 23 2.3Tags
and attributes 24 2.3.1 The anchortag
<a> 242.3.2 The metadatatag <meta> 25 2.3.3 The external referencetag<link> 26 2.3.4
Emphasizing tags <b>, <i>,
<strong> 262.3.5 The
paragraphs
tag <p> 27 2.3.6Heading
tags <hl>, <h2>, <h3>,... 27 2.3.7Listing
content with <ul>,<ol>,
and <dl> 27viii CONTENTS
2.3.9 The <form> tag and its
companions
29 2.3.10 Theforeign script
tag<script>
30 2.3.11 Tabletags <table>,<tr>, <td>,and <th> 322.4
Parsing
322.4.1 What is
parsing?
332.4.2
Discarding
nodes 35 2.4.3Extracting
information in thebuilding
process 37Summary
38Further
reading
38Problems 39
3 XMLandJSON 41
3.1 Ashort
example
XML document 423.2 XMLsyntaxrules 43
3.2.1 Elements and attributes 44
3.2.2 XMLstructure 46
3.2.3
Naming
andspecial
characters 48 3.2.4 Comments and character data 49 3.2.5 XMLsyntaxsummary 503.3 When isanXML document well formedorvalid? 51 3.4 XML extensions and
technologies
533.4.1
Namespaces
53 3.4.2 Extensions of XML 54 3.4.3Example: Really
Simple Syndication
55 3.4.4Example:
scalable vectorgraphics
58 3.5 XML and R inpractice
603.5.1
Parsing
XML 603.5.2 Basic
operations
on XML documents 63 3.5.3 From XMLtodata frames orlists 65 3.5.4 Event-drivenparsing
66 3.6 A shortexample
JSONdocument 683.7 JSONsyntaxrules 69
3.8 JSONand R in
practice
71Summary
76Further
reading
76Problems 76
4 XPath 79
4.1 XPath—a query
language
forwebdocuments 804.2
Identifying
node sets with XPath 814.2.1 Basicstructure ofanXPath query 81 4.2.2 Node relations 84 4.2.3 XPath
predicates
864.3
Extracting
node elements 934.3.1
Extending
the funargument 944.3.2 XML namespaces 96 4.3.3 Little XPath
helper
tools 97CONTENTS ix
Summary
98 Furtherreading
99 Problems 99 5 HTTP 101 5.1 HTTP fundamentals 102 5.1.1 A short conversation witha webserver 102 5.1.2 URLsyntax 104 5.1.3 HTTP messages 106 5.1.4Request
methods 108 5.1.5 Statuscodes 108 5.1.6 Header fields 109 5.2 Advanced features of HTTP 116 5.2.1 Identification 116 5.2.2 Authentication 121 5.2.3 Proxies 123 5.3 Protocolsbeyond
HTTP 124 5.3.1 HTTP Secure 124 5.3.2 FTP 126 5.4 HTTP in action 1265.4.1 The libcurl
library
127 5.4.2 Basicrequestmethods 128 5.4.3 A low-level function of RCurl 131 5.4.4Maintaining
connectionsacrossmultiple
requests 132 5.4.5Options
133 5.4.6Debugging
139 5.4.7 Errorhandling
143 5.4.8 RCurlorhttr—whattouse? 144Summary
144Further
reading
144Problems 146
6 AJAX 149
6.1
JavaScript
1506.1.1 How
JavaScript
is used 150 6.1.2 DOMmanipulation
1516.2 XHR 154
6.2.1
Loading
external HTML/XML documents 1556.2.2
Loading
JSON 1576.3
Exploring
AJAX withWebDeveloper
Tools 158 6.3.1Getting
started with Chrome's WebDeveloper
Tools 1596.3.2 TheElements
panel
1596.3.3 The Network
panel
160Summary
161Further
reading
162CONTENTS
7
SQL
andrelational databases 164 7.1 Overview andterminology
165 7.2 Relational Databases 167 7.2.1Storing
data in tables 1677.2.2 Normalization 170 7.2.3 Advanced features of relational databases and DBMS 174 7.3
SQL:
alanguage
tocommunicate with Databases 175 7.3.1 General remarksonSQL,
syntax,andourrunning example
175 7.3.2 Data controllanguage—DCL
1777.3.3 Data definition
language—DDL
178 7.3.4 Datamanipulation language—DML
180 7.3.5 Clauses 184 7.3.6 Transaction controllanguage—TCL
187 7.4 Databases in action 188 7.4.1 Rpackages
tomanage databases 1887.4.2
Speaking R-SQL
via DBI-basedpackages
189 7.4.3Speaking
R-SQL
via RODBC 191Summary
192Further
reading
193Problems 193
8
Regular expressions
and essentialstring
functions 1968.1
Regular expressions
1988.1.1 Exact character
matching
198 8.1.2Generalizing regular expressions
200 8.1.3 Theintroductory example
reconsidered 2068.2
String processing
2078.2.1 The
stringr
package
207 8.2.2 Acouple
morehandy
functions 211 8.3 A wordoncharacterencodings
214Summary
216Further
reading
217Problems 217
Part Two A Practical Toolbox for Web
Scraping
and TextMining
2199
Scraping
the Web 2219.1 Retrieval scenarios 222
9.1.1
Downloading ready-made
files 223 9.1.2Downloading multiple
filesfromanFTPindex 226 9.1.3Manipulating
URLs toaccessmultiple
pages 228 9.1.4 Convenient functions togather
links, lists,
and tables from HTMLdocuments 232
9.1.5
Dealing
with HTML forms 2359.1.6 HTTPauthentication 245
9.1.7 Connections via HTTPS 246 9.1.8
Using
cookies 247CONTENTS
9.1.9
Scraping
data fromAJAX-enrichedwebpages
withSelenium/Rwebdriver 251 9.1.10
Retrieving
data from APIs 259 9.1.11 Authentication with OAuth 2669.2 Extraction
strategies
2709.2.1
Regular
expressions
270 9.2.2 XPath 273 9.2.3Application Programming
Interfaces 2769.3 Web
scraping:
Goodpractice
2789.3.1 Isweb
scraping legal?
278 9.3.2 What is robots.txt? 280 9.3.3 Befriendly!
284 9.4 Valuablesourcesofinspiration
290Summary
291Further
reading
292Problems 293
10 Statisticaltext
processing
29510.1 The
running example: Classifying
pressreleasesof the Britishgovernment 296
10.2
Processing
textual data 29810.2.1
Large-scale
textoperations—The
tmpackage
298 10.2.2Building
aterm-document matrix 303 10.2.3 Datacleansing
304 10.2.4Sparsity
and n-grams 30510.3
Supervised learning techniques
30710.3.1
Support
vectormachines 309 10.3.2 Random Forest 30910.3.3 Maximum entropy 309
10.3.4 The RTextTools
package
309 10.3.5Application:
Governmentpressreleases 31010.4
Unsupervised learning techniques
31310.4.1 Latent Dirichlet Allocation and correlated
topic
models 314 10.4.2Application:
Governmentpressreleases 314Summary
320Further
reading
32011
Managing
dataprojects
32211.1
Interacting
with the filesystem 322 11.2Processing multiple
documents/links 32311.2.1
Using
/or-loops
32411.2.2
Using while-loops
and controlstructures 326 11.2.3Using
theplyr package
32711.3
Organizing
scraping procedures
32811.3.1
Implementation
of progress feedback:Messages
andprogressbars 331
xii CONTENTS
11.4
Executing
Rscripts
onaregular
basis 334 11.4.1Scheduling
tasksonMac OS and Linux 335 11.4.2Scheduling
tasksonWindowsplatforms
337Part Three
ABag
of Case Studies 34112 Collaboration networks in the US Senate 343 12.1 Informationonthebills 344 12.2 Informationonthe senators 350
12.3
Analyzing
the networkstructure 35312.3.1
Descriptive
statistics 354 12.3.2 Networkanalysis
35612.4 Conclusion 358
13
Parsing
information fromsemistructureddocuments 35913.1
Downloading
data from the FTPserver 36013.2
Parsing
semistructuredtextdata 361 13.3Visualizing
station andtemperaturedata 36814
Predicting
the2014Academy
Awardsusing
Twitter 37114.1 Twitter APIs: Overview 372
14.1.1 TheREST API 372 14.1.2 The
Streaming
APIs 373 14.1.3Collecting
andpreparing
thedata 373 14.2 Twitter-based forecast ofthe2014Academy
Awards 374 14.2.1Visualizing
the data 374 14.2.2Mining
tweetsforpredictions
37514.3 Conclusion 379
15
Mapping
thegeographic
distribution ofnames 38015.1
Developing
adatacollection strategy 38115.2 Website
inspection
38215.3 Data retrieval andinformation extraction 384
15.4
Mapping
names 38715.5
Automating
theprocess 389Summary
39516
Gathering
dataon mobilephones
39616.1
Page exploration
39616.1.1
Searching
mobilephones
ofaspecific
brand 396 16.1.2Extracting
product
information 40016.2
Scraping
procedure
40416.2.1
Retrieving
dataonseveralproducers
40416.2.2 Data
cleansing
405CONTENTS xiii
16.4
Datastorage
40816.4.1 General considerations 408
16.4.2 Tabledefinitionsforstorage 409 16.4.3 Table definitions for futurestorage 410 16.4.4 View definitions for convenient dataaccess 411 16.4.5 Functions for
storing
data 41316.4.6 Datastorageand
inspection
41517
Analyzing
sentiments ofproduct
reviews 41617.1 Introduction 416
17.2
Collecting
the data 41717.2.1
Downloading
the files 417 17.2.2 Information extraction 42117.2.3 Databasestorage 424
17.3
Analyzing
the data 42617.3.1 Data
preparation
42617.3.2
Dictionary-based
sentimentanalysis
427 17.3.3Mining
thecontentof reviews 43217.4 Conclusion 434
References 435
General index 442
Package
index 448