To maximize the reservation benefit, the data allocation and reservation schedule should achieve the ideal situation, in which all data Get/Put rates are no more than the reserved rates while there is no over-reservation. If the allocated Get/Put rates vary over time largely (i.e., the rates exceed and drop below the reserved rates frequently), then the reservation saving is small according to Equation (5.3) in the data allocation and reservation algorithm. For example, Figure 5.1(a) shows the Get rates of different data items in two datacenters (dp1 and dp2) in two billing periods (t1 and t2). We assume the reservation price ratio αdpj = 60%, A1 = 100 Gets and A2 = 200
Gets for both dp1 and dp2, and pgdp
1 = p g
dp2 = $1. According to Equation (5.14), we calculate
fdpg 1(A1) = f g dp2(A1) = 80 and f g dp1(A2) = f g
dp2(A2) = 60. Then, we can get that R g
dpj = A1= 100
introduces the maximum reservation benefit. After the data allocation and reservation scheduling, the reserved amounts in both dp1and dp2can be much smaller than the actual usage (i.e., 100<200), which prevents from achieving high reservation benefit. In Figure 5.1(b), the ideal data allocation and reservation schedule can make the reserved amount approximately equal to the actual usage and hence enlarge the reservation benefits to reduce the cost. In order to keep the Get/Put relatively stable so as to maximize the reservation benefit, we propose a genetic algorithm (GA) [42]-based data allocation adjustment approach that further improves the data allocation schedule to approximately achieve the ideal situation after calculating a data allocation schedule and before determining the reservation amount.
GA is a heuristic method that mimics the process of natural selection and is routinely used to generate useful solutions to optimization problems. In the GA-based data allocation adjustment approach, as shown in Figure 5.3, a data allocation schedule is formed by <dl,{dp1, ..., dpβ}> of
each data item requested by a customer datacenter, where {dp1, ..., dpβ} (denoted by Gdl) is the
set of datacenters that store dl. This algorithm regards each data allocation schedule as a genome
string. Using Algorithm 3, it generates data allocation schedules with the lowest total cost (named as global optimal schedule), and with the lowest Storage cost, lowest Get cost and lowest Put cost (named as partial optimal schedules) by assuming all data items as Storage-, Get- and Put-intensive data, respectively.
<d1,{dp1,…,dpβ}> <d2,{dp1’,…,dpβ’}> … <dk,{dp1’’,…,dpβ’’}> <d1,{dp1,…,dpβ}> <d2,{dp1’,…,dpβ’}> … <dk,{dp1’’,…,dpβ’’}> Global optimal Storage optimal <d1,{dp1,…,dpβ}> <d2,{dp1’,…,dpβ’}> … <dk,{dp1’’,…,dpβ’’}> Get optimal <d1,{dp1,…,dpβ}> <d2,{dp1’,…,dpβ’}> … <dk,{dp1’’,…,dpβ’’}> Put optimal <d1,{dp1’,…,dpβ’}>
Crossover Crossover Crossover Mutation
Figure 5.3: GA-based data allocation enhancement.
To generate the children of the next generation, the global optimal schedule sequentially conducts crossover with each partial optimal schedule with crossover probability θ (Figure 5.3). For each genome of a child’s genome string, either the global optimal schedule (with probability θ) or the partial optimal schedule (with probability 1-θ) propagates its genome to this child. To ensure the schedule validity, for each crossover, the genomes that do not meet all constraints in Section 5.1.2 are discarded. Since each genome remains the same, we do not need to check the constraints for
Get and Put SLAs. However, the Get/Put capacity of each dpj may be exceeded. Thus, we only
need to check Constraint (5.11). In order not to be trapped into a sub-optimal result, the genome mutation occurs after the crossover in each genome string with a certain probability to change it to a new genome string. In the mutation of a genome, for each data item, dp1inGdlwhich serves Gets
and a randomly selected dpk inGdl are replaced with qualified datacenters.
After a crossover and mutation, the global optimal schedule and the partial optimal sched- ules are updated accordingly. To produce the new global optimal schedule, we calculate each child schedule’s total cost (Csum) according to Equation (5.12), among the child schedules and the global
optimal schedule, the one with the smallest Csum is selected as the new global optimal sched-
ule. Similarly, we evaluate each schedule’s cost according to Equations (5.1), (5.5) and (5.6) to generate the new Storage/Get/Put partial optimal schedules, respectively. In order to speed up the convergence to the optimal solution, the population of the next generation (Ng) is inversely
proportional to the improvement of the global optimal schedule in the next generation. That is,
Ng = M in{N,CsumN/C
sum}, where N is a constant integer as the base population, Csum and C
sum
are the total cost of global optimal solution of current and next generations, respectively. Creat- ing generation is terminated when the maximum number of consecutive generations without cost improvement or the largest number of generations is reached.
it is only executed once at the initial time of reservation period T before determining the reservation amount. Though it is time consuming, compared to the long reservation time period (e.g., one year in Amazon DynamoDB [1]), the computing time is negligible. After each billing period tk during T ,
ES3only needs to do the data allocation if the new allocation schedule leads to lower cost based on the determined reservation in T .