Sampling - Theoretical Foundations - Quality of Service Aware Data Stream Processing for Highly

5.2 Theoretical Foundations

5.2.2 Sampling

5.2.2.1 A Short Primer on sampling

In statistics, sampling is loosely defined as the procedure of selecting a representative portion (could be miniatures) of a population for estimating an unknown population quantity, such as an ‘average’ or ‘count’ of a target variable. Population represents all units in a specific study area. For example, all persons in a city, where the target of sampling is, for instance, estimating the average age of persons. Those estimators are normally associated with a variance measuring their accuracy [90] .

Sampling is pivotal for most statistical studies for various reasons. For example, obtaining a total population could be purely fictional. For instance, heights of all people in a country. One other potential reason is that processing a whole population census is, more than often, computationally challenging. Despite that this is hardly ever an issue with the abundance of wide spectrum of big data processing engines, at times, it may be true that data arrives in streams where updating results regularly based on newcomers is pivotal for correct time- dependent estimators. In those cases, we usually base our estimates on observations arrived so-far and extrapolate our results to future times. Besides, at times, it’s not even practical to visually plot a summary of billions of observations on boards, such as those cases where we generate heat-maps of a natural phenomenon.

Our decision on whether a method is a good or bad sampling method depends highly on various factors including the sampling design and size. The sampling design is the procedure by which a sample of units or sites is selected. However, there is a consensus on the idea that the sample should be a good representative for the population. Stated another way, sample constitutes a scaled-down (can also be dubbed as ‘microcosm’) version of the population holding intrinsically and mirroring all traits and characteristics of that population it is representing. It is undoubtedly true that there is no such thing like a “perfectly-representative sample”, but at least if we could obtain a sample that is good enough to yield characteristic’s estimations with a known degree of accuracy or confidence, then it would be safe claiming that the sample is representative. One of the most recurrent problems that renders some

sampling designs as bad is the selection biasedness, which, in simple terms, is the process for which the sampling method overlooks some parts of the population by design [90] . For example, for estimating a percentage of possible voters in the United States who potentially will vote for the democratic party in an upcoming election cycle, selection biasedness may render estimates invalid. It is an indispensable fact that sampling generally cause sampling errors (normally termed as Standard Errors (SE)) which stems from basing estimates on a sample rather than the whole population [90]. Modeling uncertainty has strong ties with selecting proper sampling designs. A design that minimizes uncertainty figures, such as standard errors, is plausible more than those with expanded error intervals. In other terms, as long as those values estimated using a sample are close to the real values (i.e., estimated from the total population with no sampling) for some arbitrary number of sampling permutations, the method is considered good, otherwise not.

Aiming at increasing the unbiasedness coupled with the tendency to design methods that yield low-variance estimates in a variety of scenarios, many sampling methods have been designed, among which the two most widely adopted are simple random sampling (SRS) , which is a probability design (a.k.a. random sampling without replacement) and Simple Stratified Sampling (SSS). The former proceeds by normally assigning an equal selection probability to each unit in the population, thereafter, assigning labels to each unit and selecting labels randomly until a specific number of distinct units that is equal to the sample size is selected. This guarantees that all possible permutations have equal probabilities of being considered as a sample. The latter operates in a different way, where it selects fractional portions from total units depending on the group they belong to. Sampling students from schools, we take 50% boys and 50% girls, where boys and girls are stratum in this case. The distinction between those two magnets lies in the fact that SSS may assign equal inclusion probabilities to each unit in the same stratum, but this may differ from other units in other stratum as each stratum is treated independently [91].

The overarching traits offered by stratification has encouraged us to consider a design that is based on stratified sampling, but at the same time considers the spatial patchy distributions in scenarios of smart cities and Industry 4.0. In the next subsections, we provide a short

primer that spots the light on spatial sampling, aiming at steering a better comprehension for the hybridization we have performed in our method, discussed in section 5.3.4.

5.2.2.2 Sampling

Deterministic solutions for data analytics problems do not play well with fast arriving huge data streams that are mostly geo-referenced with complex data structures that show oscillation in data arrival rates and skewness [4]. Be that as it may, in geo-statistics, approximations that yield plausible error-bounded statistical results are acceptable [92]. Having said that, a well-selected representative sample can be safely exploited for geostatistical analytics such as the approximation of target study variables (e.g., ‘average’, ‘total’ and ‘proportion’). Also, observing all items of a population could be intractable, such as observing migrating birds in a huge location, which are spatially unevenly distributed [93].

5.2.2.3 Spatial Online Sampling Designs

Spatial sampling has a great advantage in many domains such as environmental monitoring [94] .It is formally expressed with a ternary (𝜓, ℑ, ℜ), where ℜ is the embedding space (often two- or three-dimensional space) from which samples are drawn, ℑ is the sampling frame (i.e., SRS, SSS) overlaying the survey area (i.e., embedding space), 𝜓 is the statistic that is employed for estimating a variable of interest (e.g., ‘total’ and ‘mean’ of a parameter in study area). The choices of ℑ and 𝜓 heavily affects the goodness of the spatial sampling design [94]. Those configurations enforce an uncertainty on the spatial sample estimation and the common goal is to reach an unbiased estimation with the lowest possible variance, which, in spatial distribution, is normally achieved by being attuned to the characteristics of the spatial data, where the sample is spatially representative and well-spread out over the sampling space [95] .

Preserving spatial co-locality through a sampling design is known to yield better estimates [96, 97]. A principle that complies with Tobler's first law of geography, which simply states that nearby spatial objects are more related than those far apart [98]. One way for achieving this, is to imagine the earth flattened out (i.e., two-dimensional planar irregular grid-like representation) and sample proportional quantities from each subregion (i.e., cell or polygon), which is known to yield plausible statistical results with reduced estimation errors [94, 98] .

Current SPEs with their related spatial-aware extensions and plugins focus on striking a weighted balance between few QoS goals (e.g., low-latency and high-accuracy) by either overprovisioning resources (i.e., scaling in/out) or dropping-off (a.k.a. sampling or shedding) portions from the arriving data, thus loosing tiny accuracy for plausible latency gains. However, overprovisioning resources, that are not normally released after a spike, conflicts with the target of high resources utilization. For sampling and other sketching methods, state- of-art SPEs exploit sampling schemes that are basically embracing randomness, based mostly on SRS [90] , rendering them non-attuned for spatial characteristics that surround objects in proximate locations. SRS does not serve the estimation quality QoS target in spatial patchy environments, where spatial objects are normally clumped into few patches. Stated in other terms, SRS normally unduly chooses random counts with unfair fractions from all cells of the survey area (analogous to strata in stratified sampling), even if it performs well at times, at most times it cannot. There is a consensus in geo-statistics that geo-near spatial objects have, more than often, strong ties with contexts of their surroundings (i.e., ecological, anthropogony, etc.,) [93, 99, 100]. All in all, selecting geographically spread-out samples is known to affect estimations quality. We dub those samples drawn that way as geospatially representative samples. In addition, although some works of the related art focus on spatially representative sampling designs, they normally consider only static finite populations (as opposed to continuous infinite populations that always have superpopulations). Chief among factors that played a role in the shortage of spatially representative sampling designs for continuous populations is maybe the prohibitive computational capacities of systems at those times. However, current SPEs act as promising jumping off systems for building online sampling designs.

In this thesis, we scope ourselves to designing stratified-alike spatial sampling methods that select well-spread out proportional spatial samples from irregular regions in the sampling space (a.k.a. polygons). It should be also noticed that there are requirements that affect the fact that we are constrained to selecting spatial samples in non-stationary, anisotropy online settings with temporal fluctuations in arrival rates and skewness, thus the term stream sampling (a.k.a. online sampling), which is discussed in the next subsection.

5.2.2.4 Stream Sampling (a.k.a. Online Sampling)

There are requirements that are normally imposed on stream sampling in a way that does not affect finite sampling designs. One important consideration is that samples would be taken either on-the-fly in case of record-at-a-time stream processing models, or from small batches (known as micro-batches) in micro-batch processing models. Another fact is that streaming systems normally apply the exactly-once semantics, where tuples are not replayable. Also, estimates should be designed so that they confluence with the incrementalization semantics of the streaming model. For example, in time-based micro-batching window semantics, an ‘average’ on an interesting variable should be updated in every interval (i.e., batch interval, portion of the time window) incrementally building on preceding intervals. Those challenges place many constraints on stream sampling designs that do not normally affect stationary sampling designs in the same way. To close those gaps, we have designed a spatial aware online sampling method that is based on a SPE that supports a declarative SQL-like API. Our system that we term SpatialSPE is discussed in the next section.

5.3 SpatialSPE: QoS-aware Approximate Spatial Data Stream Processing Engine

In document Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications (Page 104-108)