When Google Maps was first released in 2005, data visualisations were “mash-ups” of data from different sources overlayed on a map for visual comparison [Bat+10a]. Having progressed from this initial phase the aim now is to be able to make a more rigorous spatial comparison between different data sources. This inevitably leads to the question of how to compare data at different relative scales, for instance air quality data where there are around 10 London sites, with TfL bus data where there are 7,200 buses moving in London at the peak of rush hour. As the point sources of data are not co-incident, some form of interpolating data appears to be needed.
In his 1968 paper, “A two-dimensional interpolation function for irregularly-spaced data”, Shepard shows examples of data interpolation taken from biology, meteorology
2.11. Relative Scales in Data Comparison 65 and geography [She68]. Distance interpolation is of critical importance to geographic visualisation, being fundamental to heat maps, contours and height maps. For a gen- eral survey of alternatives to Shepard, see [Ami02], where triangulation methods for interpolation and radial basis functions are compared. An example of Inverse Distance Weighting can be found in [Ste+04], where the technique is used to visualise carbon monoxide data from GPS tracked sensors on a 3D map of London. Alternatively, the geostatistical method of “Kriging” [Cre93] can be used for data interpolation. Krige’s original paper [Kri51] was on mining and geology, while Wackernagel [Wac03] addi- tionally cites examples of altitude mapping for temperature and modelling. In “Spatial Data Interpolation” [MM05], Mitas and Mitsova frame the spatial interpolation prob- lem as, “...finding a function F (r) which passes through the given points”. They go on to say that,
“Because there exist an infinite number of functions which fulfil this re- quirement, additional conditions have to be imposed, defining the character
of various interpolation techniques.” (Mitas, [MM05])
In the article, the authors state that “recent applications of geostatistics have de- emphasised the use of Kriging”, going on to suggest alternative distance interpolation functions that have gained popularity recently. Examples of Thiessen polygons and tri- angulation techniques for surfaces are outlined, finishing up with a group of techniques based on splines designed to maximise the smoothness of the interpolation. The ques- tion of discontinuities in the data is also discussed in relation to climate data. Weather data, as included later in this thesis, falls into this category as fronts are discontinuities in the pressure field14. Interpolating data across a weather front produces invalid results,
much like interpolating footfall data through a building. For example, in [ACM16], a high resolution air temperature data set from sensors around Birmingham is compared to satellite information at a lower spatial scale to measure the urban heat island effect in the city. This is currently not possible to do at city scale due to the satellite resolution. Land use data is also included in the analysis, but the authors have to take into account the atmospheric stability using the Pasquill-Gifford index for the results to be valid. Their methodology is to only analyse days when the weather across the city is stable, in other words, there are no weather fronts moving across the study area. Selective data 14For a definition of weather fronts, see Met Office fact sheet 10 - Air Masses and weather fronts, https:// www.metoffice.gov.uk/
analysis would appear to be a safe methodology to adopt with this type of data.
Having researched a number of interpolation methods and types of data, the geosta- tistical and mathematical techniques for comparing data at different scales provides a solid foundation to work from. Where new research could add to the current body of knowledge is in the combinations of data from different disciplines. Extensive sources of data on real-time transport, weather, air quality and commuter behaviour are starting to emerge, but use of the data often requires domain expertise to acquire and interpret correctly. One avenue of research is to investigate whether this domain expertise can be coded into predictive models of the phenomena which can then be used for interpolative comparisons and analysis to look for correlations between the different systems? The idea is for the expert’s domain specific knowledge to be coded into a model of space, rather than time. For this to work, though, there would need to be a model of accuracy, uncertainty or validity which fits with the interpolation methodology.
Chapter 3
Technical Background
This chapter forms the background research into the technologies required to build the MapTube website, http:// www.maptube.org, which was created in 2008 for sharing and exploring maps online. Investigating computer architectures for geospatial comput- ing, the motivation behind this research is that the tiled map systems behind how sites like Google Maps and OpenStreetMap display cartography are analogous to the visu- alisation layer in a regular Geographic Information System (GIS), such as ArcGIS or QGIS1. The first aim of MapTube was to make it easy for the general public to make
maps from data, and so increase the amount of geospatial data in the public domain. The automatic mapping system is explained in Chapter 4, “Designing Systems”, firstly in the context of mapping volunteered geographic data when the underlying data can change 60 times a second. Then, having developed a technology that can make maps automatically from data, the idea of data mining Internet data stores is explored.
With 3,439 maps currently stored on MapTube2, there is an immediate supply of geospatial data, so an obvious question to ask is, “how can we perform spatial analy- sis on all the maps to extract further knowledge?”. This forms the bulk of Chapter 5, “Dynamic Visualisation”. Also, an Internet GIS system which can handle fast chang- ing data is strategically well-placed to handle real-time data, a concept which forms Chapter 6. Before that, though, the remainder of this chapter explores the fundamental architectures and algorithms that are required to make this a reality.
1ArcGIS (https:// www.arcgis.com) is a commercial GIS system from the Environmental Systems Research Institute (ESRI),
while QGIS (http:// www.qgis.org) is an open-source alternative.
3.1
Hypothesis
Web-based visualisation of geospatial data has the potential to convey information to a wide audience, but it is currently bound by the limitations of the technology. Over- reliance on static data, which can be cached for performance reasons, is holding back the development of both real-time data visualisation and exploratory ‘one-off’ visual- isation, for example the outputs of a user’s model which might only be viewed once before being run again with different parameters. The potential exists for new com- puter architectures and algorithms utilising the ‘infinite provisioning’ possible with cloud storage, plus the connectivity that the Internet gives to both users and sources of data.
Firstly, with data collection, the crowd-sourcing of geospatial information is now possible on a scale not seen before. The first hypothesis is that by constructing a web- site, along with tools to make it easier for the general public to upload maps of data, that this will allow the collection of data that would not be possible otherwise. The second hypothesis is that the tools which make it easier to upload the data then become tools in themselves, as the computer effectively becomes the user, with the ability to discover its own data on the Internet and make its own maps. This can only come about through improved handling of geospatial data sets and the ability to interrogate Internet data stores. As an additional benefit, the automatic interpretation of geospatial data lends itself to the question of whether further knowledge about the structure of the re- lationships between different maps can be discovered. This is a more ‘fully connected’ approach to geospatial data than the current ‘one data file one layer’ approach of the current desktop GIS programs. New algorithms of the type just proposed will enable the handling of thousands of maps simultaneously with a scalability that is beyond the current generation of analysis tools.
The final hypothesis is that algorithms for automatically handling static data can be adapted to handle real-time streams, interpreting the streamed data automatically and unlocking further information about what is happening in city-scale systems. It is expected that new techniques for handling the mixed use of real-time and static data will also be required to achieve this.
In the chapters that follow, the first step is to develop a theory behind web-based tiled mapping systems, which allows their performance to be quantified with different computer architectures. Next, the tools for automatic data collection are developed, us-
3.1. Hypothesis 69 ing experience gained while working on web-based mapping systems for volunteered geographic data. These tools are then used with an Internet data store example, auto- matically mapping all 2,558 variables from the 2011 Census. Once a set of data has been collected, the focus then turns to analysis of the data and map comparison al- gorithms. Otherwise, without these additional knowledge discovery tools, there is an argument that collecting data in this way results in nothing more than a large number of coloured maps. The value in the three hypotheses presented here is in asking questions about the relationships between the data. Finally, the real-time data problem is tack- led with the automatic modelling of transport data from the London Underground used as an example. In the penultimate “Data Exploration” Chapter, the previously devel- oped tools are brought together to demonstrate their effectiveness at tackling applied problems.