Complexity Analysis - Utilizing query logs for data replication and placement in big data appli

In this section, we provide detailed complexity analyses of the recursive replicated declustering and multi-way replicated refinement phases of our algorithm.

2.7.1 Recursive replicated declustering phase

In the recursive replicated declustering phase, initial gain computations (Algo- rithm 2) take O(2 ×P

q∈Q|q|) = O(

q∈Q|q|) time for each two-way refinement

pass. This is because, each data item has two gains (either gm and gr or guA and

guB), and for each data item d, we check all the queries by which d is requested

to identify the initial operation gains of d. After selecting the best operation to perform and the related data item, we perform an extractMax for the selected operation, and delete the other operation gain related with the data item. Thus, at each pass, at most |D| extract-max and at most |D| heap-delete operations are performed. When an operation is performed on a data item, the operation gains of its neighboring data items must be investigated for possible updates (Algorithms 3, 4, 5). In the implementation of gain update operations, we use increase-key or decrease-key operations in max-heaps. If an operation is invoked on every data item during an FM-like pass, every query q incurs at most |q|2

updates in total. Note that this |q|2 _{upper bound is very loose since when an}

operation is performed on a data item d requested by a query q, only in a handful of conditions (that are determined by changes in ∆ values of q) the update of the gain values of the other data items requested by q is necessary. Since we don’t update the gains of previously iterated data items in a pass, after each operation, the maximum number of updates reduces by one.

When all operations are considered, in one two-way refinement pass, at most (|q| × (|q| − 1)/2) × 2 ≈ |q|2 _{gain update operations will have to be performed}

on the data items of a query q in the worst case. (Each data has two gains, hence the multiplication by two). Totally, it makes O(P

q∈Q|q|2) gain update

two-way refinement pass is O(P

q∈Q|q|2 × lg|D|). The total cost of a two-way

refinement pass is O(|D| × lg|D| +P

q∈Q|q|2× lg|D|) ≈ O(

q∈Q|q|2× lg|D|) for

practical purposes. We limit the number of passes to 10 for a recursive replicated declustering step and in practice the number of passes rarely reaches to 10.

The worst-case analysis above is valid for the first two-way replicated declustering where there are no replications initially. We proceed with the analysis of the recursive replicated declustering by investigating the complexity in the lev- els of the recursion tree. We assume that maximum allowable replication occurs after the first two-way replicated declustering step, which is a worst case sce- nario. So at each recursion level the sum of the sizes of the sub-datasets will be at most |D|(1 + r). Under a balanced declustering assumption, at the ℓth recursion level, two-way declustering processes will be applied on 2ℓ _sub-datasets

each of size O(|D|(1 + r)/2ℓ_{). If we ignore the decrease in the query sizes due}

to the query splitting process, the aggregate cost of 2ℓ _{two-way declustering pro-}

cesses at the ℓth recursion level will be O(2ℓ_{× |D|(1 + r) ×}P

q∈Q|q|2), leading

to an overall cost of O(K × lg(|D|(1 + r)) ×P

q∈Q|q|2). However, for practical

purposes, considering the decrease in the query sizes due to the query splitting, the aggregate cost of 2ℓ _{two-way declustering processes at the ℓth recursion level}

will approximately be O(|D|(1 + r) ×P

q∈Q|q|2), leading to an overall cost of

O(lg(K) × lg(|D|(1 + r)) ×P

q∈Q|q|2).

2.7.2 Multi-way replicated refinement phase

Multi-way replicated refinement phase has a preprocessing step where we compute the optimal schedules for all queries. This preprocessing takes O(P

q∈Q(|q|2× K)

time.

The complexity analysis of each multi-way refinement pass is as follows: In initial virtual leave gain computation, since we have a single gain for each data item, the complexity is similar to the initial gain computation of a two-way replicated declustering. The only difference roots from the getRequestedDisk(q, d) operation that takes O(|q|) time. Thus, calculating the virtual gains of every item takes

O(P

q∈Q|q|2) time. We build a heap from the virtual leave gains in O(|D|) time.

We select the data item with the maximum virtual leave gain in O(log(|D|) time. Then we compute its actual move and replication gains for the remaining K − 1 disks and decide where to move or replicate according to this actual gains. In total, actual gain calculations for all data items take O(P

q∈Q|q| × K) time. After

performing the operation that has the highest actual gain, we update the virtual leave gains of neighboring data items as well as the item distributions of all related queries. Updating virtual leave gain of a data item takes O(lg|D|) time, since we store virtual leave gains in a max-heap. Similar to the recursive replicated declustering phase, we may need to update the virtual leave gains at most |q|2 _{times in}

a multi-way refinement pass, leading to a O(P

q∈Q|q|2× lg|D|) complexity. In the

worst case, the total cost of virtual leave gain updates is O(P

q∈Q|q|2 × lg|D|).

Updating query item distributions takes a total of O(P

q∈Q|q|) time. Thus, in a

multi-way refinement pass, the total cost for updates is O(P

q∈Q|q|2× lg|D|).

We limited the number of passes in multi-way refinement to 10, thus, when we take the preprocessing, virtual leave gain initialization, actual gain computation and virtual leave gain update stages into account, the total cost for multi-way refinement is O(P q∈Q|q|2× K) + O( P q∈Q|q|2) + O( P q∈Q|q| × K) + O( P q∈Q|q|2×

lg|D|), which is equal to O((K + log|D|) ×P

q∈Q|q|2).

In document Utilizing query logs for data replication and placement in big data applications (Page 55-57)