GA-based Data Allocation Adjustment - An Efficient Holistic Data Distribution and Storage Solut

To maximize the reservation beneﬁt, the data allocation and reservation schedule should achieve the ideal situation, in which all data Get/Put rates are no more than the reserved rates while there is no over-reservation. If the allocated Get/Put rates vary over time largely (i.e., the rates exceed and drop below the reserved rates frequently), then the reservation saving is small according to Equation (5.3) in the data allocation and reservation algorithm. For example, Figure 5.1(a) shows the Get rates of diﬀerent data items in two datacenters (dp₁ and dp₂) in two billing periods (t₁ and t₂). We assume the reservation price ratio αdpj = 60%, A1 = 100 Gets and A2 = 200

Gets for both dp₁ and dp₂, and pg_dp

1 = p g

dp₂ = $1. According to Equation (5.14), we calculate

f_dpg 1(A1) = f g dp₂(A1) = 80 and f g dp₁(A2) = f g

dp₂(A2) = 60. Then, we can get that R g

dpj = A1= 100

introduces the maximum reservation benefit. After the data allocation and reservation scheduling, the reserved amounts in both dp₁and dp₂can be much smaller than the actual usage (i.e., 100<200), which prevents from achieving high reservation benefit. In Figure 5.1(b), the ideal data allocation and reservation schedule can make the reserved amount approximately equal to the actual usage and hence enlarge the reservation benefits to reduce the cost. In order to keep the Get/Put relatively stable so as to maximize the reservation benefit, we propose a genetic algorithm (GA) [42]-based data allocation adjustment approach that further improves the data allocation schedule to approximately achieve the ideal situation after calculating a data allocation schedule and before determining the reservation amount.

GA is a heuristic method that mimics the process of natural selection and is routinely used to generate useful solutions to optimization problems. In the GA-based data allocation adjustment approach, as shown in Figure 5.3, a data allocation schedule is formed by <dl,{dp1, ..., dpβ}> of

each data item requested by a customer datacenter, where {dp₁, ..., dpβ} (denoted by Gdl) is the

set of datacenters that store dl. This algorithm regards each data allocation schedule as a genome

string. Using Algorithm 3, it generates data allocation schedules with the lowest total cost (named as global optimal schedule), and with the lowest Storage cost, lowest Get cost and lowest Put cost (named as partial optimal schedules) by assuming all data items as Storage-, Get- and Put-intensive data, respectively.

<d₁,{dp₁,…,dp_β}> <d₂,{dp_1’,…,dp_β’}> … <d_k,{dp_1’’,…,dp_β’’}> <d₁,{dp₁,…,dp_β}> <d₂,{dp_1’,…,dp_β’}> … <d_k,{dp_1’’,…,dp_β’’}> Global optimal Storage optimal <d₁,{dp₁,…,dp_β}> <d₂,{dp_1’,…,dp_β’}> … <d_k,{dp_1’’,…,dp_β’’}> Get optimal <d₁,{dp₁,…,dp_β}> <d₂,{dp_1’,…,dp_β’}> … <d_k,{dp_1’’,…,dp_β’’}> Put optimal <d₁,{dp₁’,…,dp_β’}>

Crossover Crossover Crossover Mutation

Figure 5.3: GA-based data allocation enhancement.

To generate the children of the next generation, the global optimal schedule sequentially conducts crossover with each partial optimal schedule with crossover probability θ (Figure 5.3). For each genome of a child’s genome string, either the global optimal schedule (with probability θ) or the partial optimal schedule (with probability 1-θ) propagates its genome to this child. To ensure the schedule validity, for each crossover, the genomes that do not meet all constraints in Section 5.1.2 are discarded. Since each genome remains the same, we do not need to check the constraints for

Get and Put SLAs. However, the Get/Put capacity of each dpj may be exceeded. Thus, we only

need to check Constraint (5.11). In order not to be trapped into a sub-optimal result, the genome mutation occurs after the crossover in each genome string with a certain probability to change it to a new genome string. In the mutation of a genome, for each data item, dp₁inGdlwhich serves Gets

and a randomly selected dpk inGdl are replaced with qualiﬁed datacenters.

After a crossover and mutation, the global optimal schedule and the partial optimal schedules are updated accordingly. To produce the new global optimal schedule, we calculate each child schedule’s total cost (Csum) according to Equation (5.12), among the child schedules and the global

optimal schedule, the one with the smallest Csum is selected as the new global optimal sched-

ule. Similarly, we evaluate each schedule’s cost according to Equations (5.1), (5.5) and (5.6) to generate the new Storage/Get/Put partial optimal schedules, respectively. In order to speed up the convergence to the optimal solution, the population of the next generation (Ng) is inversely

proportional to the improvement of the global optimal schedule in the next generation. That is,

Ng = M in{N,_C_sumN_/C

sum}, where N is a constant integer as the base population, Csum and C

sum

are the total cost of global optimal solution of current and next generations, respectively. Creat- ing generation is terminated when the maximum number of consecutive generations without cost improvement or the largest number of generations is reached.

it is only executed once at the initial time of reservation period T before determining the reservation amount. Though it is time consuming, compared to the long reservation time period (e.g., one year in Amazon DynamoDB [1]), the computing time is negligible. After each billing period tk during T ,

ES3only needs to do the data allocation if the new allocation schedule leads to lower cost based on the determined reservation in T .

In document An Efficient Holistic Data Distribution and Storage Solution for Online Social Networks (Page 100-102)