• No results found

As already stated in the preface, the work in this thesis was conducted while work- ing as a researcher on six different academic projects. The Equator project with UCL Computer Science is only mentioned as it contributed the papers I wrote on air qual- ity monitoring. Then, working for the Centre for Advanced Spatial Analysis (CASA), the GeoVUE, GenESiS, NeISS, Talisman, and ANTS FutureICT projects make sep- arate contributions. The web-based mapping work started with GeoVUE, continuing into GenESiS, and provided the foundation to the project proposal that made the JISC NeISS grant successful, delivering the first Census maps project and SurveyMapper. The NCRM Talisman grant was written around geospatial data mining, real-time city data and visualisation, building on what had been previously started with my data store mining project which began as an idea for a PhD topic while working on GenESiS.

1.4. Timeline 39 Real-time data completes the picture with the FutureICT funded project, ANTS, which I pitched as an idea at a funding meeting and worked on with three other researchers for two months, although all the code development is mine. A time-line of the development is shown in Table 1.1.

Table 1.1: Time-line of Work

2004 · · · •Equator project in UCL Computer Science. This is only mentioned for the papers on air quality which are referenced later.

2005 · · · •GeoVUE project starts in CASA.

2006 · · · •GMapCreator released for automatically building Google Maps websites from data in shapefiles.

2008 · · · •MapTube website released on 20th February at the Barbican Centre in London. The idea behind MapTube is to crowd-source geospatial data.

•GenESiS project starts in CASA.

•Mood Maps, volunteered geographic information project created by Prof. Andy Hudson-Smith and initially trialled on Radio Four’s iPM programme as a “Credit Crunch” question. This is the genesis of the mapping on demand technology. •MapTubeD, MapTube Dynamic tile renderer. This was built to make the mood

maps render immediately from fast changing data, but also has applications for automatic mapping on MapTube.

•Geometry Finder web service included as a spin-off from MapTubeD. This is a piece of infrastructure which allows automatic identification of the spatial context of data.

2010 · · · •Talisman project starts in CASA.

•Data Store Miner, first envisaged following the launch of the London Datastore in the same year. This takes the automatic mapping technology to a new level by mapping all the geospatial data on the website. Map comparisons with Census data follows on from this.

2011 · · · •GeoGL, virtual globe project for 3D rendering real-time data. Used for analysis of large amounts of real-time data not possible with the web based systems. 2012 · · · •ANTS, Adaptive Networks for complex Transport Systems project starts in CASA

(duration 3 months). Provides real-time transport data to other systems.

•CityDBAPI, data downloading and query API for real-time data which is used by other systems. Handles, transport data (via ANTS), cycle hire, weather and air quality data.

2016 · · · •MapTubeV, vector tiler for MapTube. Adds the ability to visualise ‘one-off’ model outputs.

•MapTubeExplorer, provides map comparisons between maps on MapTube where the geometry may not be identical. Built as an experimentation tool for map com- parison techniques.

The audience for this work is split between web infrastructure for mapping and experimentation tools for researchers. The MapTube components are all web-based mapping, with this thesis presenting the theory behind the performance enhancements that make this work in practice. The end result is a system for casual users to find and explore geospatial data, and, hopefully, upload some of their own. The data store min- ing and MapTube Explorer projects are aimed at academic researchers as they address

the problem of how to get more out of web-based maps than simply visualisations of data. These projects come from the idea of handling sets of geospatial data together as a block, where comparisons can be performed between maps to find similarities. Neither project is designed as a fully working package, as, while they do work on the examples shown here, their worth is in showing the methodology which other researchers can pick up and modify to mine different data stores and perform other types of map com- parisons. The value in the code is the generalisation of the problem, so only a small percentage of code is needed to adapt them to a different problem.

Real-time data also falls into the infrastructure category due to the nature of the sys- tems involved. Different logins are needed to access the services for different sources of data. The result presented in this thesis is a system similar to TransportAPI 1, two years before TransportAPI launched. In fact, the ANTS project and subsequent devel- opments in this thesis put forward a more integrated approach to city scale data than just the transport system. This is aimed at researchers, providing simple access to data via an API which also feeds into other real-time visualisations of city data, for example, http:\\citydashboard.org, the iPad wall installed in the London Mayor’s office in 2012 and an exhibit at the London Transport Museum, along with various other one-off exhi- bitions and conferences. The main aim with the real-time part of this thesis is to make it easier for researchers to access this type of data needed for research, by collecting everything in one place.

Chapter 2

Literature Review

The following chapter presents a brief overview of the literature relevant to this thesis.

2.1

Presentation of Geospatial Data

The first use of the term, “Choropleth” is by Wright in his 1938 article on population mapping [Wri38], although Tobler points out that the terms “Choropleth” and “Car- togram” were used interchangeably around that time [Tob04] and that the first use of choropleth maps is thought to be by Minard, although Wright was the first to use the name. The choropleth map uses coloured areas to represent data and is one of the most common forms of geographic data visualisation in use today. A good example are the political maps of election results, where coloured areas represent the easily identifiable colours of the political parties who won the majority of votes.

(a) UK General Election Result 2005 (b) UK General Election Result 2010

Figure 2.1: Two choropleth maps of the UK general election. The 2015 and 2017 results could be added to this sequence.

result for 2005 and 2010. Political parties in the UK have standardised colours, so the first problem with the choice of a representative set of map colours is eliminated. People viewing these maps should already be conditioned to accept that red is Labour and blue is Conservative, but this also highlights the main criticism of this type of choropleth. In 2005, red won a clear 50% majority, while in 2010, blue and yellow had to join together to form the required 50%. This is not immediately obvious from the map as the Parliamentary Constituency areas differ in size, giving more visual prominence to those with greater area. Techniques like hexagon maps [Hen10] can avoid this problem, but it is highly dependent on the data being displayed. In the case where the map is used to display outliers on a map that is mainly white, this problem can be turned into an advantage. Also, when displaying Census data, where areas have already been normalised to all contain approximately the same number of people, this might not be an issue.

Openshaw first coined the term, “Modifiable Areal Unit Problem”, or “MAUP” in his article, “The Modifiable Areal Unit Problem” [Ope84], even though the problem of zone aggregation in the correlations of U.S. Census statistics had been published as early as 1934 by Gehlke and Biehl [GB34]. Statistical effects caused by the choice of which electoral wards form the different Parliamentary Constituencies is used as an ex- ample by Openshaw, who comments on how the zonal aggregation choice in Camden can change the result. In his example, there are 520 Parliamentary Constituencies in 1983, but this has now increased to 650 in 2019, which is reflected in the two maps in figures 2.1a and 2.1b. Openshaw’s statement on page 7 of [Ope84] where he says, “the availability of fast super-computers opens up the possibility of seeking approximate numerical solutions” and his comments on “Monte Carlo optimisation methods” in the conclusion both relate to his analysis of MAUP as a combinatorial problem. Hennessy and Patterson state that, “...the highest-performance microprocessors of today outper- form the supercomputer of less than 10 years ago” [HP11, pp2], which needs to be taken in the context of the PDP370 that Openshaw was using in 1983 for his analysis. Current desktop computers are 3.5 decades removed from this, or more than 3 super- computer generations ahead. The graph of processor speeds that the authors use to justify this claim points to a 10,000 times increase in speed between 1983 and 2006. This is significant in the context of this thesis as the aim is to provide a next gener- ation of tools for geographers and spatial analysts to use. With the MAUP problem,

2.1. Presentation of Geospatial Data 43 this amounts to a sensitivity analysis to guard against any zoning bias in the results. Either on the desktop, or in the cloud, the compute power now exists to perform this type of analysis. However, the number of zones of interest has also gone up, often with models of the whole of the U.K. containing many thousands of zones in an attempt to combat edge effects. This is an O(n2) problem due to the combinatorial nature of the

zones, which suggests that compute speed is not keeping up with the requirements if it only increases linearly over time. A robust sensitivity analysis with large n is likely to require a degree of parallelism to compute in a reasonable time.

Presentation of map based data to humans is fraught with numerous difficulties linked to the perception of coloured data. In “ColorBrewer in Print: A Catalog of Color Schemes for Maps” [BHH03], Brewer studies how humans interpret data pre- sented on maps by asking volunteers to answer map-based questions with a variety of colour schemes and under different lighting conditions. This paper forms the basis of the “ColorBrewer” set of colours which is built into numerous GIS systems for example “GeoTools”, “QGIS” and “ArcGIS”.

Before drawing a choropleth map, though, the colour scale needs to be defined, in other words, how the data maps to a discrete set of colours.1 In the work by Brewer,

[BHH03], she categorises the data according to three types: ‘sequential’, ‘diverging’ and ‘qualitative’. Sequential data is characterised by continuous real number values, for example a carbonmonoxide sensor value that varies from 0 parts per million upwards. Diverging data is where there is a natural break point, for example, male:female pop- ulation ratio where the break is at zero. Qualitative data is where the data fits a set of distinct classes where there is no ordering relationship, for example types of dwelling classified as detached, semi-detached or terraced.

In addition to the type of data being represented, the number of colours to use in the colour scale and the data breaks also need to be chosen. These are related in that there must be enough colours for the number of breaks chosen, but how the data breaks are distributed over the range of the data is also required. Firstly, though, the question of how many breaks to use must be answered. In “Comparing continuity and compactness of choropleth map classes” [Cal18], Calka analyses six different methods of classifica- tion breaks, identifying the strengths and weaknesses of each. In particular, the inclu- sion of “Head-tail Breaks” for data that is “not normally distributed” is potentially in- 1The colours in the scale could also be defined continuously, for example a linear transition between blue and red based on

teresting when working with scale-free network data. The remainder of the breaks clas- sifications are: Equal Intervals, Quantile, Standard Deviation, Natural Breaks (Jenks) and Geometric Intervals. In addition to this, there is a review of “goodness of fit” measures, including Jenks’ own assessment criteria for his natural breaks classification first published in “Optimal Data Classification for Choropleth Maps” [Jen77]. The pa- per concludes with examples of population density in Poland, measuring the “Tabular Accuracy Index” (TAI) with reference to two new index methods, “Spatial Distance Index” (SDI) and “Spatial Contiguity Index” (SCI). The conclusion made here is that, “depending on the selection of class ranges in mapping population density, rural areas can be made prominent or the focus can be directed at urban areas with small, medium and large town and cities”.

In light of these issues with the visual perception of maps, none of the data store mining (section 4.4.2) or map comparison (section 5.2) actually visualises any maps, though. The only visualisation is as a first guess for humans, which can then be altered. All the map comparisons and data store mining presented in this thesis is visualisation agnostic. All the data is machine to machine, up to the point where a human sees a meta-visualisation of the map comparisons e.g. a graph of linkage.