Search Query Workload Estimative - A distributed in-memory database system for large-scale spat

• Multi-user query processing environment, with high throughput and low latency queries. • Resource-wise utilization, in order to reduce memory consumption without affecting the system’s

performance.

4.3 Search Query Workload Estimative

We provide an estimative of the workload for a single spatial-temporal (ST) search query, in a distributed parallel fashion using space partitioning based on the following observations:

4.3.1 Distributed ST-Search Cost

Given a list of user input queries Q = [Q1, Q2, ..., Qk], where Qi= (Ri,t₁i,t₀i), the cost Csti of a single

search query Qifor a given trajectory dataset S, spatial query region Ri, and time interval[t0i,t1i], can

be estimated on the total number of trajectory segments n in S, i.e. Csti = O(n), since we simply need

to check for segments intersecting Riduring[t₀i,t₁i].

p

7 b b b b b b b b b b b

p

5 b b b b

p

4 b

R

₁

R

₃

R

₂

Figure 4.1: Example of Quadtree space partitioning for trajectories.

Since ST-Search is intrinsically parallelizable using space decomposition, we can greatly decrease the computational cost Ci_stby partitioning the input dataset using some space partitioning method, then select only the partitions containing candidate trajectories, that is, the partitions intersecting the query region Ri during[t0i,t1i]. For instance, Figure 4.1 shows an example of space decomposition using

Quadtree with three query regions R1, R2, and R3. To process R1we just need to consider data in the

three spatial partitions the query region intersects with. Finally, we perform a precise search in each candidate partition in parallel. Equation (4.1) depicts the estimate cost for spatial-aware ST-Selection

query C pi_stin parallel with space partitioning, C pi_st= Cio+

arg min(B,U)+Cpos (4.1)

where n is the number of data records in the candidate partitions, B is the number of candidate partitions/blocks, and U is the number of processing units available to process the query i (supposing all units with same computational power).

C_iois the I/O cost to load the candidate partitions from the file system, and it is relative to the total number of records to read in the candidate partitions. Notice that, if the data partitions are stored in main-memory, then Cio= 0.

Furthermore, there is a post-processing step to merge segments from trajectories that have been split across multiple partitions; this step also adds a cost Cposto the final workload. Notice that the cost

C_posdepends on how we handle boundary records, as well as the partitioning granularity. For instance, in Figure 4.1 notice that the query region R1intersects trajectories in multiple spatial partitions. If we

decide to assign boundary crossing trajectories to all intersecting partitions (i.e. multiple assignments), then Cposis simply the cost of removing duplicated results. If, however, we decide to split boundary

trajectories into sub-trajectories according to their containing spatial partitions (i.e. single assignment), then Cposis the cost of merging the sub-trajectories at the end of the processing. If we decide, however,

to split boundary trajectories into sub-trajectories according to their containing spatial partitions (i.e. single assignment), then Cpos is the cost of merging the resultant sub-paths at the end of the

processing. For both strategies Cpos also depends on the partitioning granularity, for instance, in

Figure 4.1 increasing the partitioning granularity would either increase the number of replications for multiple assignments, or increase the number of splits for single assignment, thus increasing the post-processing cost. Overall, we can estimate Cposon either the number of replicated trajectories on

multiple assignment policy, or the number of sub-trajectories to merge in single assignment.

In Equation (4.1) we suppose B as a set of disjoint and homogeneous spatial partitions. However, real life trajectory and spatial datasets are not uniformly distributed, for instance, the density of data records in a city center is much larger than in the suburbs. In distributed parallel applications, a poorly partitioned dataset can lead to contention, and increase communication and data transfer between the computing nodes. Furthermore, a unbalanced partitioning will increase the number of False Positives (FP), that is, records in the candidate partitions that are not part of the query result, hence increasing C_ioand the overall cost. Therefore, we must make sure we employ a partitioning strategy that takes the data distribution into account for better load balancing.

In addition, when partitioning the data space, we must account for boundary objects, once both trajectories can intersect with more than one spatial partition. The simplest solution is to replicate boundary segments, however, for regions with high density of boundary records, replication will negatively affect the computation cost by increasing the number of data records; moreover, for in-memory storage frameworks, such as Spark, replication will also increase memory usage. In Equation (4.2) we estimate Cd_sti the cost of distributed ST-Selection query,

Cd_sti = Cio+

(n + rn)

In document A distributed in-memory database system for large-scale spatial-temporal trajectory data (Page 101-103)