• No results found

5.5 Performance Evaluation and Results

5.5.5 Test Cases and Results

5.5.5.1 Testing scenarios

1) SQG1 test cases. We have measured the performance (i.e., the achievement of

estimation quality QoS goal) by applying the following query (which belongs to

SQG1): “what is the accumulative average of a trip distance travelled by taxicabs itinerary trips within first six months of 2016”.

2) SQG2 test cases. For top-N rankings, we apply an online spatial aggregation query, specifically the following; “what is the top 10 neighborhoods (or circular locations bounded by MBRs, geohashes) in NY city, USA where taxicabs trips originate”.

5.5.5.2 Results and discussion 5.5.5.2.1 SQG1 test case results

We use parameter configurations#1 for running those tests in this subsection.

Figure 5.2 elucidates the differences between the online sampling schemes SAOS and SpSS- based SRS in terms of the “estimation quality” of an estimator for a target variable (such as the ‘average’ requested through SQG1 test cases as explained in section 5.5.5.1).

As it is evident, SpSS-based SRS underperforms SAOS in terms of the estimation quality (measured through RE and accuracy loss). Increasing the geohash size negatively affects the estimation quality. Results shown here are measures for confidence interval 68%. The same

112

pattern applies for the confidence intervals 95% and 99%. Refer to our paper for more interesting results [101] .

Notice that, despite seems trivial, relative errors signifies an important aspect in regard to estimation quality. To make sense of it, reducing the error by a factor of 2 requires at least a sample that is bigger by a factor of 4. This means that even a small fractional gain in terms of those measures (i.e., accuracy loss and relative error) significantly meets the accuracy QoS goals (i.e., higher estimation quality).

To take a more utilitarian perspective of how this effect (even looks small in figures ) can negatively impact the estimation, we show in figures 5.3 and 5.4 , respectively, how by using SpSS-based SRS the estimator misses the 68% confidence interval (for the mean estimator) at some sampling rates, whereas SAOS is perfectly fitting within the boundaries of the same CI. The same trend occurs for 99 and 95 confidence intervals, refer to our paper for more interesting results [101].

Figure 5.2. Estimation accuracy of SAOS vs. SpSS-based SRS, for G1 queries. ‘loss’ in the

legend is the accuracy loss calculated by applying equation (5.9), whereas ‘RE’ is the relative error calculated through equations (5.8) and (5.13) for SAOS and SpSS-based SRS, respectively

113

Figure 5.3. CI 68% SRS on mean estimator varying the sampling fraction. CI in the legend is the confidence interval

Figure 5.4. CI 68% SAOS on mean estimator varying the sampling fraction. CI in the legend is

114

To better understand how SAOS is adept more than SRS in geo-statistics, we show in figure 5.5 the gain obtained by applying our method to SQG1 queries (calculated by applying the design effect measurement [90] , refer to section 5.3.6).

Notice that we obtain as large as 7% gain by applying SAOS against SpSS-based SRS. If these figures were the population variances, we would expect that we would need on average only (1000k).(0.93) = 930k observations with a sample from SAOS to obtain the same

estimation quality as from an SRS of 1000k observations, this saves (70K tuples less, for an

arrival rate of 1 million tuples/second, this means that we take 70 thousands tuples less, which is statistically significant) a precious time of online processing in latency-sensitive SPEs, where even milliseconds can save the system from coming into a halt.

5.5.5.2.2 SQG2 test case results

Figure 5.6 depicts the skill of SAOS in comparison to the baselines (SpSS-based SRS) in the language of estimation quality for spatial queries of SQG2, where Top-N ranking quality of SAOS outperforms SRS, despite almost at par for some sampling rates and geohash sizes.

115

Speaking about time-based QoS goals (more specifically the throughput in this case), Figure 5.7 elucidates that SpSS-based SRS slightly underperforms SAOS. Despite being a simple approach, SRS in this case performs worst because, on average, the system needs to manage more key states between triggers when applying SRS, this is basically due to the fact that

Figure 5.6. Spearmans’s rho by applying SAOS Vs. SpSS SRS-based. ‘rho 30’ (in the primary access)

means rho value at geohash precision 30, whereas ‘rho 35’ (in the secondary axis) means rho value at geohash precision 35

Figure 5.7. Throughput by running SAOS against SpSS-based SRS, with a streaming rate that is

equal to 500k tuples/second. ‘key_states_updated’ (in the secondary access) in the legend means the average number of keys updated between tumbling windows

116

SRS is not aware of keys distribution and selects tuples totally randomly, which means sampling unnecessarily more keys every trigger.

SAOS and SpSS-based SRS act in the same way for data oscillation from 500K to 1000K to 2000K tuples/second, while SpSS-based SRS always underperforming. Refer to our paper for more interesting results, specifically those showing the same latency and throughput trends on different settings (four worker nodes instead of two) [101]. All in all, SAOS is able to handle the pace at which data is arriving (almost at the par), thus achieving the latency quality goals.

We finally show the effect of incrementalization on mean estimator. Figure 5.8 shows how both SAOS and SpSS-based SRS are catching up with the true mean value after a total number of a million tuples arrived. The mean estimation for both is approaching stepwise the true value. However, SAOS is approaching faster and this is further self-explained by the smaller value of standard error that is resulting from applying SAOS (as opposed to the value obtained by applying SRS), calculated incrementally. Notice also however, that the standard

error (SE) difference for both converges and vanishes as their estimates approach the true

value.

Figure 5.8. the effect of incrementalization on the ‘average’ or ‘mean’ estimator. Sampling

fraction is set to 40 %. In the legend, ‘stepwise_mean’ (the primary access on the left) is the ‘mean’ value changes in correspondence to total tuples arrived up until that point in time. SE (the secondary access on the right) is the standard error.

117

In theory also, SAOS is an appealing and compelling approach, a theoretical perspective explaining the excellence of SAOS is explained in Appendix E.

5.6 Similar Works

In relevant literature [99, 100, 104] apply various dimensionality reduction approaches, but however are computationally expensive and inapplicable in distributed online deployments. Also, relevant art of the literature focus on achieving single QoS goals (for example, satisfying either high-resource utilization or low-latency) without seeking a balanced weight between them.

Several works can be traced in the literature focusing on spatial sampling. However, most of them are geared toward centralized and stationary settings, depending on High Performance Computing (HPC) deployments with disk-resident datasets. While this works for some scenarios, it was not usually the case during the last decade, where spatially-augmented huge data amounts are arriving very fast, with sometimes burst loads and unruly spikes (i.e., not amenable to discipline), thus leading to an interest in online spatial sampling. This is specifically challenging, giving that implanting spatial awareness normally presents systems with additional overheads, due in part to the ‘curse of dimensionality’ of geospatial objects representations.

Most relevantly, [105] have designed a dimensionality reduction method for finite populations, dubbed as generalized random-tessellation stratified (GRTS) , that is based on mapping two-dimensional into lower-dimensional space, then creating a set of randomly ordered spatial addresses with a mix of systematic sampling in order to generate a well- balanced random sample. They depend on the fact that spatial objects that are proximate in the two-dimensional planar space tend to be proximate in a lower one-dimensional space after mapping. The sample is then selected using a systematic sampling scheme. This is analogous to random tessellation in a two-dimensional space. However, well-spread does not necessarily mean well representativeness and the systematic component may under-represent some regions. In the same vein, [106] presents a sampling method that relies on dimensionality reduction, more specifically by utilizing space-filling curves. They order the survey units in such a way that consecutively numbered points represent spatially well-

118

balanced sample. Other works include [96] which has incorporated kriging estimator for a real-time monitoring of environmental phenomena into a higher-level architectural pattern. The picture that emerges from the relevant literature, however, is that, none of the forgoing studies are applicable in distributed deployments. Hence, they are not designed to achieve incrementally accurate results that improve dynamically over time (i.e., stepwise). On the contrary, our system was adept in achieving spatial-awareness in distributed settings. Also, SpatialSPE has introduced incrementalization over geo-referenced data streams using a declarative API, a target that is completely novel.