Combining Real-time and Static Data - Geospatial Computing: Architectures and Algorithms for Ma

2.4 Combining Real-time and Static Data

The combination of real-time data with static data can take a number of different forms. Firstly, the concept of “polystores” is gaining momentum in the data management and semantic knowledge representation journals, for example the “ACM Special Interest

Group on Management of Data (SIGMOD)”4_.

While combining real-time and static data encompasses polystores and data ware- housing technology, in “High resolution population estimates from telecommunications data” [Dou+15], Douglass uses real-time telecommunications data to derive population estimates which are then verified using Census data. They also find that they can make estimates by age, gender and ethnicity, but only by using a model which is calibrated on the specific area of study. They warn against trying to apply the same results to another area without any base data to build the model from. Their conclusion states that, “We envision an inter-census calibration using a very small scale stratified population count in key calibration regions”, making their point that the model is a way of augmenting the existing static population data. While this example uses data in an offline sense, real-time streaming can often require the processing of data sets that grow rapidly over time. Architectures like the Lambda Architecture and Kappa Architecture, along with high performance computing like Apache Spark are often utilised for this task due to the volume of real-time data. Central to the concept of the Lambda architecture5 is the base transaction data, which is analysed using a batch, speed and serving layer. Nathan Marz is credited with inventing the term, defined in his book, “Big Data: Principles and best practices of scalable realtime data systems” [MW15].The general idea behind the architecture is that data is processed quickly and approximately to provide immedi- ate results on one thread, while on a parallel thread, the exact result is being computed using a different method taking a longer time. The serving layer allows queries to be made on the basis of the best data available at the time.

As the size of datasets increases, so algorithms that act on data as a stream operation are becoming more popular. The basic idea follows the data stream learning principles described in Chapter 9 of [WEH11]. The following quote defines the methodology:

“One way of addressing massive datasets is to develop learning algo- 4_{SIGMOD: https:// dl.acm.org.sigmod.}

rithms that treat data as a continuous stream. In the new paradigm of data stream mining, which has developed in the last decade, algorithms are developed that cope naturally with datasets that are many times the size of main memory - perhaps even infinitely large. The core assumption is that each instance can be inspected once only (or at most once) and then must be dis- carded to make room for subsequent instances.”

(from “Data Mining” [WEH11, pp380])

In other words, the memory footprint of the program remains constant as the data is fed through it. The power of this method is in its applicability to very large datasets where an answer that is approximately correct using frugal compute resources is pre- ferred over spending more time and effort on an answer which is only a few percent more accurate.

The “Frugal Majority” algorithm [Boy91], shows this frugal paradigm in action when calculating whether the stream contains a class with more than a 50% majority. The more obvious way of writing this would be simply to count how many of each class there were, but this would have a memory footprint that increased with the number of classes in the data. While the majority class algorithm is a simple example, the “Count- Min” algorithm [CM04] is a probabilistic method similar to a Bloom filter [Blo70], but which applies multiple hash functions to data in order to count the different values. The importance of this algorithm is that the technique can be used to find approximate quar- tile ranges on very large datasets using a single pass through the data and no sorting. The technique relies on having enough hash functions so that a hash collision between two data values in one hash table is unlikely to result in a collision on all hash functions simultaneously. The histogram of data values encountered is then calculated from the data in the hash tables.

Finally, the combination of static and real-time data is being investigated by the Of- fice for National Statistics (ONS) Big Data team researching the use of mobile phone data6 for population estimates, urban planning, commuter flows, ethnicity and com- munity, amongst other areas. While much of the work centres on mobile phone data, for example [Dou+15], earlier, and “Redrawing the Map of Great Britain from a Net- work of Human Interactions” [Rat+10] looking at regional boundaries, and “Poverty 6_{The ONS report is available at: https:// www.ons.gov.uk/ methodology/ methodologicalpublications/ generalmethodology/}

2.5. Computer Graphics and Rendering 49

In document Geospatial Computing: Architectures and Algorithms for Mapping Applications (Page 47-49)