Data analysis - Data measurements – Section 4.6 (p164) describes the development of a measurem

4 R ESEARCH M ETHODS 4.1 Introduction

5. Data measurements – Section 4.6 (p164) describes the development of a measurement and scoring system used here to categorise baseline levels of

4.5 Data analysis

All of the data feeding into the Oracle 12c RDBMS used in this research (Section 4.3.1.3, p145), from case study data acquisition (Section 4.2.5, p134) and data augmentation (Section 4.4.1, p147), must be queried, tabulated and analysed to produce results. A database management system has no practical utility if it can only store data; files and file systems do that. This section details the range of data query, tabulation and analysis (Section 4.5.1, p159), data visualisation (Section 4.5.2, p161) and statistical software (Section 4.5.3, p163) used to manipulate stored data. Measurements of geographicality resulting from these analytical methods are described in Section 4.6 (p164).

159 4.5.1 Data query, tabulation and analysis

In 1972 Codd set out to ‘define a collection of operations on relations [or a]

relational algebra’ which could be used to build a ‘query language’. He stated that

‘In a practical environment [this query language] would need to be augmented by a counting and summing capability, together with the capability of invoking any one of a finite set of library functions tailored to that environment’ (Codd, 1972, p1).

This work led directly to the creation of Structured Query Language (SQL) which has been used extensively in database management systems, whether relational or not, ever since.

The two case study data sets, with added DataSift source indicator (STREAM), sequential numeric identifier (STREAMID) and Universally Unique Identifier (UUID) fields held in the main Oracle 12c table INTERACTIONS, is comprised of 1,196,671,480 data points. Clearly it is impossible to analyse a >1 billion cell matrix without using sophisticated computer systems. Relational Database Management Systems (RDBMS) and Structured Query Language (SQL) have been developed to address large, complex, tasks of this type. As Wolfram (2006, p301) explains,

‘DBMSs are primarily used to house structured textual and numeric data that have been compartmentalized into records and fields. This compartmentalization facilitates access and retrieval of data contents. Through the use of the SQL data manipulation language, one is able to readily summarize and process content. The power associated with these capabilities makes relational DBMSs and, specifically, SQL well suited for the storage and manipulation of informetric data.’

SQL is particularly ‘well suited’ to generating descriptive counts and aggregations, using the COUNT… GROUP BY… syntax. In practice, many data investigations start by counting records, since ‘aggregation [offers] the ability to summarize

information’ (van Renesse, 2003, p87, author's italics). Initial data investigation in this research followed the ‘manageability’ practice described by Knobbe, Siebes, &

Marseille (2002), using standard SQL constructs such as COUNT, COUNT

160 DISTINCT, MIN, MAX, SUM, AVG and, significantly in the context of

heavily-skewed OSN data (Section 6.4.3, p255), MEDIAN.

Upon successful completion of ETL data import processes (Section 4.3.1.3, p145), which themselves relied upon many SQL queries to check data consistency along the way, the first query run against the newly-created database counted the number of records in the INTERACTIONS table (Appendix 11 listing 9, p480).

Table 4-5 – Count of Interactions by Stream

STREAM COUNT

US2012_GEO 146,424

US2012_NON_GEO 1,560,967 US2012_NON_GEO_HISP 11,276

SCOT2014 6,477,713

This simple statement was swiftly followed by a slightly more complicated SQL query designed to count the number of records by Stream (Appendix 11 listing 10, p480), yielding the result set shown in Table 4-5. On a laptop equipped with Solid State Disks (SSD)s, using a function-based bitmap index (Oracle, 2016b) on the STREAM field, the query ran in 2.104 seconds over 8,196,380 records. Many SQL queries of this type, some taking much longer to compute, or running within looping PL/SQL programmes (Feuerstein & Pribyl, 2005), have been issued as part of the investigatory process and are referenced throughout this thesis (Appendix 11, p479). Queries have been designed to assess sparsity, coordinate and

toponymic geographicality, temporality, spatiotemporality and skewness in the research data corpus of OSN interactions. Output has been visualised mainly through the use of Tableau and various GIS packages, together with Gephi graph analysis software. These software systems are described in the following section.

161 4.5.2 Data visualisation

Most of the maps and charts in this thesis, excluding the maps in Section 6.4.4 (p262) produced with QGIS (2018), have been generated using Tableau (2017b) Desktop Professional Edition versions 8.2 through 10.5 running under 64-bit

Microsoft Windows Server 2012 R2 or Windows 10 Professional. Tableau (2017a) is made freely available to academic users having grown out of a data-based

visualisation project (‘Polaris’) at Stanford University (Stolte, Tang, & Hanrahan, 2002). Presenting Polaris, the authors stated that:

In the last several years, large multidimensional databases have become common in a variety of applications, such as data warehousing and scientific computing. Analysis and exploration tasks place significant demands on the interfaces to these databases. Because of the size of the data sets, dense graphical representations are more effective for exploration than spreadsheets and charts. Furthermore, because of the exploratory nature of the analysis, it must be possible for the analysts to change visualizations rapidly as they pursue a cycle involving first

hypothesis and then experimentation.

(Stolte et al., 2002, p52)

Polaris was designed ‘to discover structure, find patterns, and derive causal relationships’ in large databases. The software featured tight integration between visual design and data query, achieved through the use of Pivot Tables and ‘n-dimensional data cubes [where] each dimension in [the] cube corresponds to one dimension in the relational schema’ (Stolte et al., 2002, p52). The design of Polaris exploited many of the graphing techniques (use of size, shape, colour etc.) first developed in Bertin's (1967) Semiology of Graphics, later codified in Wilkinson's (1999) computerised Grammar for Graphics. Now commercialised as Tableau, this system (VizQL, or Visual Query Language) allows the user to ‘[make] interactive

162 data visualization an integral part of understanding data’ (Tableau, 2017c) through the use of a ‘drag and drop’ interface which translates visual requirements (‘draw a map’, ‘plot a line graph’) into standard SQL queries, which can be executed against a large number of databases, including the Oracle 12c RDBMS used here.

Tableau is not a conventional Geographic Information System (GIS), but does have mapping capabilities. Point plotting of Latitude/Longitude data and geocoding, with display against an OpenStreetMap (2017) backdrop, has been available since version 8. Version 10, released in 2017, supports inclusion of third-party maps in Keyhole Markup Language (KML), ESRI Shapefile, MapInfo (Tables and MapInfo Interchange Format) and GeoJSON formats (Marten, 2017). The software also features ‘Paging’, the computerised animation of time-series data – including spatiotemporal data – which has been used to temporally visualise coordinate-geotagged OSN interactions. Even on a commodity SSD-equipped laptop, also running the Oracle 12c RDBMS, Tableau can rapidly produce graphical output from large (~8 million row) database tables or views. As the software has evolved, a wide range of connections to other database servers have been added, including major NoSQL software releases such as MapR Hadoop Hive and MongoDB mentioned earlier (Section 4.3.1.2, p140). This combination of features has made Tableau a particularly useful component of the exploratory spatiotemporal data analysis and visualisation methodology adopted in this research (Section 3.3, p102).

Aside from Tableau, the usual range of desktop computing applications (e.g., Microsoft Excel) have been used together with QGIS (2018) and MapInfo (Pitney Bowes, 2018) GISs. One other specialist scientific computing package, Gephi (2018a), has also been used. This application, which describes itself as ‘the leading visualization and exploration software for all kinds of graphs and networks’ has proven particularly useful when analysing or visualising social network graphs, e.g., Twitter mentions relationships (Figure 4-5, p130; Figure 6-21, p277). The software stems from academic research (Bastian et al., 2009) and encodes several algorithms

163 commonly used in graph network analysis (Jacomy, Venturini, Heymann, & Bastian, 2014; S. Martin et al., 2011) in a flexible and reasonably easy-to-use package.

4.5.3 Statistical tests

A series of 80 Oracle 12c database views, created in SQL (e.g., Appendix 11 listing 37, p491), output numbers of NLP-detectable toponymic mentions in OSN

interaction message text and linked/shared URL content for further analysis in R statistical computing software (The R Foundation, 2018). Paired (Welch’s) T-tests were used to measure statistical significance (Spector, 2018).

Welch’s T-test is a variant of Student’s T-test and is more reliable when the two samples for comparison exhibit unequal variances and unequal sample sizes. This is the case in the current research where numbers of NLP-detectable toponymic mentions vary substantially, according to interaction or user categorisation (e.g., coordinate-geotagged/geotagging or not) and by OSN source/subtype (i.e..,

Facebook post, Twitter tweet or retweet) within event. For example, at least 1 and at most 706 toponymic mentions per interaction were detected in non-coordinate-geotagged Facebook interactions by GATEcloud in the SCOT2014 data set (Table A12-3, p506). Within events there are also sizeable discrepancies in ‘like-for-like’

sample sizes, e.g., in the US2012 data set GATEcloud detected toponymic mentions in 125,758 Twitter tweets; 21,455 of which were coordinate-geotagged and 104,303 were not (Table A12-1, p504).

R scripts computed descriptive statistics (Appendix A12.1 listing 2, p494), T and P scores (Appendix A12.1 listing 3, p494) for numbers of NLP-detectable toponymic mentions in Facebook (FB), Twitter tweet (TW) and retweet (RT) message text or linked/shared URL content, whether coordinate-geotagged (GEO=Y) or not (GEO=N), for each of the three NLP/geoparser systems described in Section 4.4 (p147); TwitIE on GATEcloud (GT), AlchemyAPI (AL against message text; LI against links) and CLAVIN-rest (CL). These computations are repeated for the two

164 case study electoral events, US2012 and SCOT2014. Summary statistics are presented in Section 5.3 (p219) while detailed statistical results and commentary are presented in Appendix A12.2 (p502).

In document Geotagging matters?:the role of space and place in politicised social media discourse (Page 188-194)