4 R ESEARCH M ETHODS 4.1 Introduction
5. Data measurements – Section 4.6 (p164) describes the development of a measurement and scoring system used here to categorise baseline levels of
4.4 Data procedures
The previous section has described how social media interaction data, downloaded from DataSift in CSV and JSON files, were loaded into the Oracle 12c RDBMS.
Although Oracle's (2012) Text features enabled increasingly sophisticated indexing and querying of free-form text, additional procedures – using even more advanced Natural Language Processing (NLP) software – were adopted to search for and find mentions of place in OSN interaction message text and linked/shared content.
These text-mining ‘augmentations, and the systems used to perform them, are described in the following section.
4.4.1 Data augmentation
Massive recent growth in the amount of unstructured electronic text available for analysis (JISC, 2012; Manyika et al., 2011) has spurred the development of many commercial and open-source software systems designed to ‘mine’, ‘augment’ or
‘enhance’ textual data. As JISC (2012, p13) state, the ‘availability [of large amounts of text data] does not equate to being able to analyse easily the content to find sought after information or to develop new insights.’ There is simply too much text for individual researchers to read; e.g., JISC note that upwards of ‘1.5 million [journal] articles are added [by 11,500 journals] per year’ and specialist domain knowledge is required to make sense of certain text terms (e.g., ‘tree’, ‘branch’,
‘leaf’ in JISC’s example) that may have very different meanings in different disciplines.
JISC (2012, p13) propose that ‘Text mining offers a solution to these problems, drawing on techniques from information retrieval, natural language processing, information extraction and data mining/knowledge discovery’ in four stages:
1. Enhanced information retrieval
2. Linguistic analysis and entity recognition
148 3. Information extraction
4. Data mining/Knowledge discovery
Enhanced information retrieval has been used in this research programme, a) to search for relevant academic literature to contextualise the study (Chapter 2, p51), and; b) to search for relevant OSN interactions to provide case study material for the research (Chapter 3, p94 and Chapter 4, p118). As the ~8 million social media interactions recorded here could not possibly be examined individually, three Natural Language Processing (NLP) systems have been used to address JISC’s suggested stages 2 and 3. Two of the three systems, GATEcloud and CLAVIN-rest, offer somewhat similar geoparsing information extraction functionality allowing cross comparison, while the third, AlchemyAPI, is particularly well-suited to information extraction and knowledge discovery operations against Web-hosted URLs, which are widely-shared in OSN interactions. Several data mining and
knowledge discovery processes address the fourth stage of JISC’s suggestions, using a mixture (Sections 4.5.1 and 4.5.2, pp159-161) of relational, non-relational and graph databases, queries, visualisation and statistical analyses (Section 4.5.3, p163).
Stock (2018, p209) has noted that ‘During the last ten years, a large body of research extracting and analysing geographic data from social media has
developed.’ Reviewing 690 papers accessing 20 social media platforms she states that ‘a wide array of […] approaches have been developed, with methods that extract place names from message text providing the highest accuracy.’ The three NLP packages successfully used for geographical entity recognition and extraction from message text and linked/shared content in this research are discussed below.
A fourth subsection (4.4.1.4, p157) briefly describes two others since, as Gritta, Pilehvar, Limsopatham, & Collier (2018) have noted, there is a ‘substantial disparity’
between working or workable NLP entity extraction systems and those that are difficult to install, use or simply will not run.
149 4.4.1.1 General Architecture for Text Engineering (GATE)
GATE, released as open-source software by a team at the University of Sheffield, ‘is over 15 years old and is in active use for all types of computational task[s] involving human language’ (GATE, 2017). Originally a desktop application, or a set of Java Archives (JARs) for use in ‘embedded’ applications, two more recent developments prompted adoption of the software for use in this research. First, the team released TwitIE (Bontcheva et al., 2013), a Twitter Information Extraction engine and ‘open-source NLP pipeline customised to microblog text at every stage.’ Around 90% of the ~8m records in the case study research data corpus originate from Twitter, and tweets are notoriously ‘difficult [to process]: the genre is noisy, documents have little context, and utterances are very short’ (Bontcheva et al., 2013, p1).
TwitIE is designed to process terse and frequently ungrammatical tweets using the
‘sentence splitter’ and ‘name gazetteer’ functions of ANNIE (A Nearly-New Information Extraction system and another GATE component), supplemented by specially developed functions for language identification, tokenisation,
normalisation, Part of Speech (POS) tagging and Named Entity Recognition (NER).
GATE software operates on a Corpus, a set of documents or, in this case, a set of tweets. A Corpus can be constructed by searching for records within the Oracle 12c database, e.g., to find the 19 Twitter tweets recorded in the SCOT2014 data set made by Scottish First Minister Alex Salmond during the 2014 Scottish
Independence Referendum campaign (Appendix 11 listing 8, p480).
After loading the necessary plugins (File -> Manage CREOLE Plugins… to activate plugins Format_Twitter and Twitter) the TwitIE Ready Made Application may be launched. The appropriate Corpus is selected, and TwitIE run against it, yielding the sort of output shown in Figure 4-11 (p150). In this example checkboxes on the rightmost panel of GATE Desktop have been used to highlight Locations (pink), Persons (purple) and Organizations (green) recognised by the software in the text of Salmond’s 19 Twitter tweets. No human intervention has been necessary.
150 The software works well on the desktop, but is constrained by memory limits, and cannot deal with the millions of tweets in the research data corpus. Recognising the need to analyse increasingly massive text corpora, the GATE team developed
GATEcloud.net, ‘a platform for large-scale, open-source text processing on the cloud’ (Tablan et al., 2012). NLP pipelines developed on GATE Desktop software can be exported to GATEcloud.net and run on dedicated machines, ‘harnessing the vast, on-demand compute power of the Amazon cloud’ (Tablan et al., 2012, p1).
Figure 4-11 – Scottish First Minister Alex Salmond’s Twitter tweets processed using TwitIE on GATE Desktop
Sharding, load-balancing and other ‘important infrastructural issues’ of the process are handled by the GATEcloud.net application which, the authors’ suggest, helps enable the ‘democratization of science’ by providing individual researchers or small research groups with ‘cutting-edge, data-driven, text-processing’ systems that are otherwise extremely difficult to set up (Tablan et al., 2012, p2). Having used GATE Desktop experimentally for some time, the entire research data corpus of ~8 million records was processed using GATEcloud.net (Figure A8-3, p441). The run was
designed to perform Information Extraction and Named Entity Recognition, particularly of locations and, coincidentally, helped in beta testing of a new
deployment of GATEcloud software (Roberts, personal communication, 2016). Input
151 files in DataSift’s JSON format, used earlier to help develop GATE’s DataSift reader (Bontcheva & Greenwood, personal communication, 2014), were processed using TwitIE with output to a set of 86 (US2012=16, SCOT2014=70) JSON files
subsequently re-imported to the Oracle 12c database.
Crucially, rather than outputting only input text (i.e., the message) and TwitIE’s augmentations, co-development work with Roberts (2017) ensured that ‘tweet IDs [were] passed through into the output’; making it much easier to join input and output for analysis using SQL in the database (Section 5.2.2.1, p193). The ability to join two tables of data together on a common key (e.g., an ID or identifier field) held in each is a central concept in data management (Codd, 1970). While tables can be joined on text fields (e.g., the message text of OSN interactions) there are many duplicated rows of message text (n=4,739,827), mainly Twitter retweets, in the OSNDATA database; each retweeted by an individual user with different characteristics (e.g., coordinate-geotagging or not). Using message text as the key, in this or similar instances, would not enable correct joining of metadata to the respective message input to and output from GATEcloud. The co-developed functionality in GATEcloud newly resulting from this research should, therefore, prove extremely useful for subsequent researchers.
GZIP-compressed in Linux, GATEcloud.net output totalled 170MB (US2012) and 790MB (SCOT2014) in size. Uncompressed, the US data set required 901MB of file storage and the Scottish data set 4.19GB. The tables storing this data in Oracle 12c are 2.34GB and 11.19GB in size respectively.
4.4.1.2 AlchemyAPI
AlchemyAPI, now re-branded Watson Natural Language Understanding and part of IBM’s Watson Developer Cloud service (IBM, 2017b), is Cloud-hosted, commercial software, available on a rate-throttled basis upon request to academic researchers.
This RESTful API service has been used by several scholars (Cios & Kurgan, 2006;
Gelernter & Mushegian, 2011; Kulshrestha, Zafar, Espin-Noboa, Gummadi, &
152 Ghosh, 2017; Quercia et al., 2012; Saif, He, Fernandez, & Alani, 2016) particularly for Twitter sentiment analysis, where it may help to ‘alleviate data sparsity [and]
performs better than [other Web-hosted systems including Zemanta or OpenCalais]
in terms of the quality and the quantity of the extracted entities [returned]’ (Saif, He, & Alani, 2012, p4).
IBM (2017a) documentation states that AlchemyAPI offers:
• Entity Extraction
• Sentiment Analysis
• Emotion Analysis
• Keyword Extraction
• Concept Tagging
• Relation Extraction
• Taxonomy Classification
• Author Extraction
• Language Detection
• Text Extraction
• Microformats Parsing
• Feed Detection
• Linked Data Support
As a RESTful web service, calls to AlchemyAPI are made over Hyper Text Transfer Protocol (HTTP) using an API Key for authentication. Academic usage is restricted to 30,000 ‘daily transactions’, compared to 2 million/day or more for commercial users, and the number of ‘transactions’ used to process each piece of text (e.g., OSN message text or text found at a linked/shared URL) will vary according to which calls, from the list above, are made to the service.
Bespoke software was developed using Ruby (2017) scripts running on a CentOS 7 virtual machine (Appendix 8, p436) to select data from Oracle 12c, pass it to the
153 AlchemyAPI service and store the returned JSON directly in the database. As the service is rate-limited, the XML-parsing Nokogiri (2017) plugin for Ruby was used to decode responses from the AlchemyAPI management URL to determine how many
‘daily transactions’ remained to be consumed (Appendix 10, p451). Two applications were developed:
1. PROCESS_RECS – A Ruby script, executed through a shell script called from cron, running every 10 minutes to process up to 150 records per run (Appendix A10.3, p451) selected from a ‘queueing’ table (ALCHEMY_API) created in Oracle 12c on the VM host, a Dell Latitude E7440 laptop running Windows 10. The queueing table was populated, with five SQL INSERT statements, to store five tranches of OSN interaction messages for
AlchemyAPI processing (US2012_GEO Stream=146,424, SCOT2014 geo tagged=1,074, US2012_NON_GEO 1% sample tweets=92,304, and SCOT2014 1% sample tweets=56,622 records). Message text was
processed using AlchemyAPI calls for Entity Extraction, Keyword Extraction, Concept Tagging, Sentiment Analysis, Relation Extraction, Text Extraction and Taxonomy Classification. The data processing allows for comparison, according to these various augmentations, for all coordinate geotagged records from each Stream against a random sample of non-coordinate-geotagged records from both US2012 and SCOT2014 data sets. Results are presented in Chapter 5 (p186).
2. PROCESS_URL_RECS – A Ruby script, executed through a shell script called from cron, running every 15 minutes to process up to 250 records per run (Appendix A10.4, p461) selected from the LI_LINKS_URLS_DISTINCT
‘queueing’ table created in Oracle 12c on the Dell laptop VM host, as above.
The queueing table was populated, using a SQL INSERT statement, with 641,472 distinct link URLs (pointers to linked URLs made by Twitter or Facebook users in an approximate 80:20 ratio) derived from 3,485,840 URL links to online media (e.g., newspaper websites, blogs, YouTube videos etc.)
154 recorded in the LINKS_URL field of the main INTERACTIONS table.
Linked URLs were processed using AlchemyAPI calls for Entity Extraction.
The data processing allows for comparison of many detected entity types, particularly location, in linked URLs. Numbers of locations referenced in linked URLs may then be compared by user class, e.g., coordinate-geotagging or not. Results are presented in Chapter 5 (p186).
While GATEcloud.net could be set up and used within days to process ~8 million OSN interactions, rate-throttling of the AlchemyAPI service required the
development of queuing tables, populated by numbers of records likely to be processed within the timescales available. All coordinate-geotagged US2012 and SCOT2014 OSN interactions were processed, and all distinct linked URLs, but only a 1% sample of all other OSN interactions from each of the two case study events.
4.4.1.3 Cartographic Location and Vicinity INdexer (CLAVIN)
CLAVIN (the Cartographic Location and Vicinity Indexer) is, according to its developers Berico-Technologies (2017), ‘an award-winning open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It extracts location names from unstructured text and resolves them against a gazetteer to produce data-rich geographic entities.’ It is one of several gazetteer-based geoparsing solutions evaluated in this research (see Section 4.4.1.4, p157) but the only one that would compile and run reliably. The version of CLAVIN used, the CLAVIN-rest variant, is a ‘DropWizard RESTful micro-service demonstration of CLAVIN, GeoNames, and OpenNLP or CLAVIN-NERD’
(Berico-Technologies, 2017). The software uses the Stanford CoreNLP toolkit (Manning et al., 2014; Stanford University, 2017) which ‘can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations
155 between entity mentions, get quotes people said, etc.’ (Stanford University, 2017).
Stanford CoreNLP is widely used and can also be used in GATE. Like CLAVIN-rest, Stanford’s CoreNLP code is written in Java.
Figure 4-12 – Scottish First Minister Alex Salmond’s Twitter tweets processed using CLAVIN-rest running on a CentOS 7 virtual machine
The Apache Maven ‘software project management and comprehension tool’
(Apache Software Foundation, 2017) was used to build and compile the code (mvn package was run against the repository downloaded from GitHub) on a CentOS 7 virtual machine (Appendix 8, p436). The GeoNames (2016) gazetteer database allCountries.zip, listing 11,370,639 place names and locations with many language-specific spelling alternatives (e.g., Londres for London), was downloaded on 4 April 2017 and used as the location master file. When started, CLAVIN-rest presents a Web browser-based interface on localhost:9090 as shown in Figure 4-12.
Text data (in this case First Minister Alex Salmond’s 19 sampled Twitter tweets from the 2014 Scottish Independence Referendum) may be copied into the TEXTAREA at the top of the browser page and, once submitted, will be geoparsed by CLAVIN-rest. The mappable locations found in this example include ‘Scotland’ and ‘Europe’.
Others, including ‘Westminster’, identified as a location by GATE Desktop (Figure
156 4-11, p150) in another of Alex Salmond’s Twitter tweets, has not been found.
Geoparsers have different success rates (Gritta et al., 2018) and GATEcloud.net, AlchemyAPI and CLAVIN-rest could all be fooled by sentence structure, a problem returned to in Section 5.4 (p221).
As a RESTful web service, CLAVIN-rest could also be called using curl, the Linux command to call URLs from the terminal. A shell script (Figure 4-13, p156) was developed to pass Universally Unique Identifiers and message content from OSN data (concatenating UUID and INTERACTION_CONTENT fields with the characters ‘|~|’, which did not appear anywhere else in message text) to CLAVIN-rest.
curl -s --data "$f2" --header "Content-Type:
text/plain"
Figure 4-13 – Shell script written to call CLAVIN-rest from the command line The standard output of CLAVIN-rest appends all GeoNames data to the input text, including multiple language-alternative gazetteer spellings (n=185 in the case of London, UK), creating extremely verbose and excessively large files (1.7GB in;
~103GB out). This can be controlled through the use of the geotagmin URL argument (Figure 4-13), which prevents output of multiple language-alternative spellings. File sizes were further minimised by piping output through grep to store only the UUIDs, and resolved locations in JSON, of text that could be geoparsed.
157 This resulted in a much smaller output file size of 487.13MB. The script could be run in a Linux terminal on the Centos7 virtual machine using the command:
./test_curl_line_at_a_time_minjson.sh > out.txt
All 8,196,380 OSN records were passed through CLAVIN-rest and 1,978,404 records (24.14%) containing resolvedLocationsMinimum, an array of GeoNames locations with Latitude and Longitude coordinates in JSON, and UUID, to join back to the input text and associated metadata, were imported into the Oracle 12c database. Results from this exercise, and a comparison of CLAVIN-rest and the other NLP-based NERs used in this research, are presented in Chapter 5 (p186).
4.4.1.4 Others
Gazetteer search has played a sometimes confounding role in the development of GIS technology on the modern-day Web (G. Cheng & Du, 2008; Pradeepa &
Manjula, 2016; Shi & Barker, 2011) and in historical applications (Southall et al., 2011, 2009). The spelling of place names may change over time, many alternate spellings may be used, or places (e.g., Kaliningrad) may change their name altogether. Software may, or may not, be able to pick up on these subtleties, and few geoparsers come close to human levels of accuracy when identifying probable place names within text (Gritta et al., 2018).
In addition to the GATEcloud.net, AlchemyAPI and CLAVIN-rest NLP-based NERs described above, several other geoparsers were assessed. Unfortunately, while showing promise, these systems failed to deliver either due to setup, coding or software compilation problems.
• BALEEN – from the UK’s Defence Science and Technology Laboratory (2015) is another RESTful entity extraction framework designed to ‘extract
information from unstructured and semi-structured text.’ The software uses Ordnance Survey-derived gazetteers which might have improved geoparsing
158 results against the SCOT2014 dataset. Unfortunately, the available
downloadable version of Baleen would neither compile or run.
• Edinburgh Geoparser – from the Language Technology Group (2014) at the University of Edinburgh (Alex, Byrne, Grover, & Tobin, 2014) has been widely used, and scored particularly highly in Gritta et al.'s (2018) review of five geoparsing systems. Version 1.1 (16/03/2016) was downloaded and installed on a Scientific Linux virtual machine. Packaged tests ran but the software would not run against the OSN data corpus examined here.
It is probable that more time spent with either, or both, of these software packages would eventually have yielded results. However, both distributions are open-source projects and, as such, the onus is on the user to attempt to solve installation or setup problems. Neither system offered dedicated support and one of them (the Edinburgh Geoparser) is now a ‘retired’ project. Results presented in Chapter 5 (p186) therefore rely upon NLP-based data augmentations produced using GATEcloud.net, AlchemyAPI and CLAVIN software.