In this section, we provide detailed complexity analyses of the recursive replicated declustering and multi-way replicated refinement phases of our algorithm.
2.7.1
Recursive replicated declustering phase
In the recursive replicated declustering phase, initial gain computations (Algo- rithm 2) take O(2 ×P
q∈Q|q|) = O(
P
q∈Q|q|) time for each two-way refinement
pass. This is because, each data item has two gains (either gm and gr or guA and
guB), and for each data item d, we check all the queries by which d is requested
to identify the initial operation gains of d. After selecting the best operation to perform and the related data item, we perform an extractMax for the selected operation, and delete the other operation gain related with the data item. Thus, at each pass, at most |D| extract-max and at most |D| heap-delete operations are performed. When an operation is performed on a data item, the operation gains of its neighboring data items must be investigated for possible updates (Algorithms 3, 4, 5). In the implementation of gain update operations, we use increase-key or decrease-key operations in max-heaps. If an operation is invoked on every data item during an FM-like pass, every query q incurs at most |q|2
updates in total. Note that this |q|2 upper bound is very loose since when an
operation is performed on a data item d requested by a query q, only in a handful of conditions (that are determined by changes in ∆ values of q) the update of the gain values of the other data items requested by q is necessary. Since we don’t update the gains of previously iterated data items in a pass, after each operation, the maximum number of updates reduces by one.
When all operations are considered, in one two-way refinement pass, at most (|q| × (|q| − 1)/2) × 2 ≈ |q|2 gain update operations will have to be performed
on the data items of a query q in the worst case. (Each data has two gains, hence the multiplication by two). Totally, it makes O(P
q∈Q|q|2) gain update
two-way refinement pass is O(P
q∈Q|q|2 × lg|D|). The total cost of a two-way
refinement pass is O(|D| × lg|D| +P
q∈Q|q|2× lg|D|) ≈ O(
P
q∈Q|q|2× lg|D|) for
practical purposes. We limit the number of passes to 10 for a recursive replicated declustering step and in practice the number of passes rarely reaches to 10.
The worst-case analysis above is valid for the first two-way replicated declus- tering where there are no replications initially. We proceed with the analysis of the recursive replicated declustering by investigating the complexity in the lev- els of the recursion tree. We assume that maximum allowable replication occurs after the first two-way replicated declustering step, which is a worst case sce- nario. So at each recursion level the sum of the sizes of the sub-datasets will be at most |D|(1 + r). Under a balanced declustering assumption, at the ℓth recursion level, two-way declustering processes will be applied on 2ℓ sub-datasets
each of size O(|D|(1 + r)/2ℓ). If we ignore the decrease in the query sizes due
to the query splitting process, the aggregate cost of 2ℓ two-way declustering pro-
cesses at the ℓth recursion level will be O(2ℓ× |D|(1 + r) ×P
q∈Q|q|2), leading
to an overall cost of O(K × lg(|D|(1 + r)) ×P
q∈Q|q|2). However, for practical
purposes, considering the decrease in the query sizes due to the query splitting, the aggregate cost of 2ℓ two-way declustering processes at the ℓth recursion level
will approximately be O(|D|(1 + r) ×P
q∈Q|q|2), leading to an overall cost of
O(lg(K) × lg(|D|(1 + r)) ×P
q∈Q|q|2).
2.7.2
Multi-way replicated refinement phase
Multi-way replicated refinement phase has a preprocessing step where we compute the optimal schedules for all queries. This preprocessing takes O(P
q∈Q(|q|2× K)
time.
The complexity analysis of each multi-way refinement pass is as follows: In ini- tial virtual leave gain computation, since we have a single gain for each data item, the complexity is similar to the initial gain computation of a two-way replicated declustering. The only difference roots from the getRequestedDisk(q, d) opera- tion that takes O(|q|) time. Thus, calculating the virtual gains of every item takes
O(P
q∈Q|q|2) time. We build a heap from the virtual leave gains in O(|D|) time.
We select the data item with the maximum virtual leave gain in O(log(|D|) time. Then we compute its actual move and replication gains for the remaining K − 1 disks and decide where to move or replicate according to this actual gains. In total, actual gain calculations for all data items take O(P
q∈Q|q| × K) time. After
performing the operation that has the highest actual gain, we update the virtual leave gains of neighboring data items as well as the item distributions of all related queries. Updating virtual leave gain of a data item takes O(lg|D|) time, since we store virtual leave gains in a max-heap. Similar to the recursive replicated declus- tering phase, we may need to update the virtual leave gains at most |q|2 times in
a multi-way refinement pass, leading to a O(P
q∈Q|q|2× lg|D|) complexity. In the
worst case, the total cost of virtual leave gain updates is O(P
q∈Q|q|2 × lg|D|).
Updating query item distributions takes a total of O(P
q∈Q|q|) time. Thus, in a
multi-way refinement pass, the total cost for updates is O(P
q∈Q|q|2× lg|D|).
We limited the number of passes in multi-way refinement to 10, thus, when we take the preprocessing, virtual leave gain initialization, actual gain computation and virtual leave gain update stages into account, the total cost for multi-way refinement is O(P q∈Q|q|2× K) + O( P q∈Q|q|2) + O( P q∈Q|q| × K) + O( P q∈Q|q|2×
lg|D|), which is equal to O((K + log|D|) ×P
q∈Q|q|2).