• No results found

Knowledge discovery has already been touched on in section 2.6 in the context of auto- matic data store mining as there is overlap between knowledge discovery, data mining and machine learning. The Springer journal “Data Mining and Knowledge Discovery” contained a special issue on “Smart Cities”, acknowledging the increasing use of city data. In “Structural robustness and service reachability in urban settings” [AZB18], Abbar et al. look at the concept of urban resilience in the context of the Rockefeller Foundation’s “100 Resilient Cities Program” (100RC). From the data science point of view, they required road network data and the geographic distribution of services from every city in their study. Their analysis proceeds by building networks from the road data and calculating statistics for: vertices, edges, total length, average degree, together with “meshness” and “organic score” which are defined by Wang [Wan15]. The dis- cussion of the results centres on an analysis of the road network graphs. He concludes with the remark,

“...a systematic study of these “robustness fingerprints” may lead to a classification of cities with regard to their fragility and service distribution imbalances-along the lines of Louf and Barthelemy [LB14], but taking an urban resilience angle.”

2.10. Knowledge Discovery 61 is mentioned in the above quote, is another example of network graph analysis applied to the road networks of cities. The aim of both pieces of research is to discover regular patterns in the data, which links to the previous section 2.8 on comparing maps and correlations. Where map data was concerned, Muller also used a graph technique to look for his “associations in choropleth maps” [Mul75], but at a higher level than the road networks being analysed here. Both Wang and Louf make the point that their analysis is only made possible by new data that has become available, which make road network data accessible to them.

Cimini defines a network as:

“...a network represents the simplest yet extremely effective way to model a large class of technological, social, economic and biological systems, as a set of entities (nodes) and of interactions (links) among them.”

(Cimini et al. [Cim+19]) The knowledge derived from analysing networks of this kind is in discovering pat- terns in the data, but the graph techniques presented here suffer from the problem of computational complexity. The networks can get big quickly, for example Wang’s analysis of the city of Los Angeles had 525,246 vertices and 678,652 edges. While in a street network this is not too big a problem as the average degree is ≈ 2, in fully con- nected networks the average degree can grow with vertices2. This is exactly the type

of problem being described in section 2.8, “Comparing Map Data and Correlations” and section 2.6 “Automatic Data Store Mining”, for comparing all combinations of the maps.

In “The Statistical Physics of Real-World Networks” [Cim+19], Cimini et al. make the point that:

“Notably, most of the networks observed in the real world fall within the domain of complex systems, as they exhibit strong and complicated interac- tion patterns, and feature collective emergent phenomena that do not follow trivially from the behaviours of the individual entities.”

The authors then go on to introduce scale-free networks [BA99], where the node de- gree follows a fat-tailed distribution or power law and hubs, and where a few nodes exist with high degree. A detailed mathematical description of operations on network graphs is given by Dorogovtsev et al. in “Critical phenomena in complex networks” [DGM08]

and also Barabasi in his network science book [Bar16]. Small world networks [SK78], which inspired Milgram’s famous “six degrees of separation” [TM69] and Erdos-Renyi random networks [ER59] are also essential to the analysis of the structure of networks. Finally, on networks, Caldarelli and Catanzaro remark that, “some collective be- haviour... cannot be predicted by looking at the single elements forming the system” [CC18, pp2]. Although referring to complex systems, in section 2.6, the data being mined is linked through relationships to place and so needs to be analysed in this con- text. However, given that a network formed of relationships between spatial data sets has not been constructed from a set of production rules like the other examples in this section, there is the possibility that a network-based analysis of this data could discover something new.

In addition to data mining spatial data and calculating statistical factors that can be presented graphically on a map, there is also the question of how to handle knowledge in the form of the semantic descriptions of what the maps contain. Up to this point only the data on the map has been considered and not the explanation of what the data repre- sents. Without this the data is essentially meaningless, in fact one user of the MapTube site used the phrase, “colouriser”, to refer to a set of “pretty coloured maps without any descriptions”. This raises a point about volunteered geographic information [GMH15], where users can upload their own data, but might describe the data in a less than rig- orous way10. One avenue of research is to ask the question of whether it is possible to

identify an unknown data set using the context of where it sits in the current framework of knowledge (“blind identification”).

Two approaches can be taken to the descriptions of spatial data: free text natural language or a structured semantic description using a well-formed ontology. The idea of using an ontology where users fill out a form with information that describes their data was rejected early on in the development of the MapTube website due to being unworkable. As an example, take a field called “year”. Obvious examples are ‘2019’ or ’1966’, but ‘Pre-Cambrian’ is a bit more difficult when describing archaeological or geological maps.

The “GeoSPARQL” standard [Ope11], is a geographic query language for “RDF” data. “RDF” is an acronym for “Resource Description Framework” which is a W3C standard for linked data stored as triples composed of: subject, predicate, object. The 10While MapTube forces the user to enter a short and long description of every map uploaded, without advanced natural language

2.10. Knowledge Discovery 63 storage of data in this format allows complex queries about the relationships between data to be performed. The “NeoGeo” vocabulary (http:// geovocab.org) exists as a draft specification, which is an attempt to bring together the currently disparate ontologies currently published by a number of different organisations who curate geospatial data. The Ordnance Survey is one of these organisations, publishing their “Spatial Relations Ontology”11which was designed by John Goodwin for querying relationships between geographic areas like “neighbour” and “within”. Ontologies also exist for postcodes, administrative areas and the 50K Map Gazetteer (e.g. City, Farm, Forest, Water etc.). In the U.S. the National Science Foundation (NSF) also funded a geospatial interoper- ability project with similar aims [Wie11].

For an ontology related to the data description, Hochmair examines efficient search- ing on Internet portals [Hoc05]. Here, the author singles out two standards covering the storage of geospatial data as being important:

• The Content Standard for Digital Geospatial Metadata (CSDGM) version 2 (FGDC- STD-001-1998)

• ISO/TC 211 19115-2003

Hochmair’s application uses the “Ontology Web Language” (OWL), which is a W3C standard [W3C05] for ontologies, allowing web crawlers to semantically index data in data stores. The body of the paper is an extension to previous work by Holscher in, “Web Search Behaviour of Internet Experts and Newbies” [HS00], and Pirolli in, “A user-tracing architecture for modeling interaction with the world wide web” [Pir+02], on more general searching of the world wide web. Notably, Hochmair uses natural language English dictionaries of nouns, verbs, adjectives and adverbs from WordNet [Mil95] to build the system that he refers to as “intelligent query expansion”. Similar corpus are used in modern sequence to sequence machine learning systems, although on a larger scale, for example in the natural language processing system built in “Smart IoT and Soft AI” [Mil+18] [Mil+] around the Google Dialogflow application12.

The use of graph tools to build graphs from metadata repositories is the subject of [Ulr+18], where Ulrich et al. analyse health record data using a “Neo4J” database13.

To quote their justification:

11The Ordnance Survey list of ontologies is published at: http:// data.ordnancesurvey.co.uk/ ontology. 12Google Dialogflow: https:// dialogflow.com.

“Summarised metadata itself is a highly connected object which gains its value from meaningful semantic connection to other data objects [EJX01].” Although the quote is originally from another paper on making connections by at- tribute matching, the results show the authors extracting structure and new knowledge about the data from the metadata records. What is missing here, however, is a method- ology for linking data where the attributes are natural language descriptions. Given the progress made in artificial intelligence for natural language processing in recent years, for example the sequence to sequence deep neural network models of Sutskever [SVL14], there is an increased ability for handling human-readable content without resorting to manual semantic labelling. In the sequence to sequence translation paper Sutskever utilises “gated recurrent units” (GRUs) to build a natural language translation system which is invariant of sentence length. Here, the gated units (based on the long short term memories of [HS97]) enable the training of a recurrent neural network on the sequences of words. Without going into the detail of the training regime for the recur- rent network, which is essential for sentence length invariance, it is sufficient to say that the network represents sentences in a “high dimensional concept space”. Sentences are effectively reduced to a number (vector in the space), where sentences meaning similar things are grouped together. The words forming the sentences are also coded using a vector representation using the Word2Vec methodology of Mikolov [Mik+13]. On the surface, this would appear to be a technique worth pursuing in the area of meta data labelling of maps.