Evaluation - Towards auto-scaling in the cloud: online resource allocation techniques

In the evaluation we want to answer following questions. First, what is the impact of parallel learning with assumption on the learning speed? Second, how does VScaler perform resource assignment under dynamic workload? To answer these questions we divided the evaluation in two parts. Each part addresses corresponding question.

For the evaluation we build a test bed. The testbed for our experiments is hosted on quad- core Xeon 2.66GHz with 16 GB memory, 100 Mbps network and Ubuntu 12.04 running Xen 4.1. We use RUBiS [121] benchmark to evaluate VScaler. RUBiS is widely adopted interactive application benchmark. In the experiments PHP version of the benchmark is applied. It consists of web front-end and database back-end. We run Apache 2.2 as a web server and MySQL as a database server. The web-server and the database run inside VMs with 64-bit CentOS 6.3. Throughout all experiments only web server is scaled. The database VM is over-provisioned. We present multi-tier application provisioning in chapter 6.

5.5.1 CONVERGENCE SPEEDUP

To show learning speed-up provided by VScaler we evaluated two Q-learning algorithms. The first algorithm uses parallel learning with assumption. The second one is a standard Q-learning approach, where the agent after each observation updates only one state-action pair. To make clear comparison we used the same state-action formalism in parallel learning with assumption as for standard Q-learning approach. Therefore we exclude prediction mechanism. Instead each state was assigned fixed workload value. We also did not use special initialization policy.

0 100 200 300 400 0 1 2 3 4 5 6 7 8 9 10 Time (min) T ransitions lear ned VScaler RL Standard RL

Figure 3:Transitions learned

0 5 10

95th percentile response time (ms)

VScaler RL Standard RL

Figure 4:Application 95% response time

In each approach the agent learns the environment using a standard policy. It means that with a probability of ε = 0.05 the agent takes a random action. In the exploitation phase the agent takes an action that gives a higher utility. In the experiment we run a constant workload. 1000 clients were emulated using the RUBiS benchmark. Reconfiguration is performed every 10 seconds.

In figure3 we show the number of transitions learned. During the first 3 minutes VScaler RL learns 300 transitions, while Standard RL only 2 * 60/ 10 = 12 transitions. VScaler RL has a higher learning rate, but the most important part is the quality of resource allocation policy obtained by each approach. In figure4 presented response time delivered by the web server under the control of the evaluated learning models. We ran experiment for 40 minutes. For the experiment we set desired response time to be below 20 ms. Both learning approaches keep 95% of the response time below 20 ms.

The standard RL model has only information about actually visited transitions, while VScaler RL in addition to visited transitions knows about the impact of alternative transitions. The additional information allows VScaler RL to quickly converge to the state with minimal amount of required resources. In figure5we present the cost of resources in each state. We assume that in a flexible bundles resource model that allows to dynamically modify individual resource assigned to the VM. For the experiment we took prices from CloudSigma. The figure shows that VScaler RL needs 3 minutes to find the optimal VM size.

VScaler RL quickly adapts to the workload, while Standard RL needs more time to learn the environment. One has to notice that both approaches do not violate the SLA, but VScaler RL in comparison to Standard RL achieves the performance goal for the lower cost.

5.5.2 REAL WORLD SCENARIO

To evaluate VScaler performance in real cloud environment with dynamic resource demands variations we instrumented RUBiS client emulator to modulate request rate of RUBiS benchmark. The RUBiS client emulator reads clients request rate from trace file. The trace consists

0.04 0.06 0.08 0.10 0 5 10 15 20 25 30 35 40 Time (min) Cost ($) VScaler RL Standard RL

Figure 5:Average costs: Standard RL vs VScaler RL. The greater size of VM in terms of CPU and memory, the greater the cost

of per-minute workload intensity observed during WorldCup 98 [134]. We used 6 hour trace starting at 1998-05-10:03.00.

The goal of this experiment is to measure performance and resource utilization between peak and dynamic resource allocation schemes. Below we present description for each of the schemes. The peak one represents fixed bundle model. We configured VM template for the peak allocation scheme with 1536 MB of memory and 1 virtual CPU with cap value of 50. The template represents Amazon EC2 m1.small instance. To evaluate dynamic schemes we run VScaler in 4 different configurations. In the first two configurations we use fixed step allocation. From each state of the model only 3 action is allowed (add, nop and remove). Hence, CPU and RAM of the VM can be modified only by fixed value. For fixed step allocation we defined small and big step. The other two configurations modify resource with respect to current state capacity. We consider dynamic allocation scheme, because it is not trivial task to find right step size for resource allocation, when an application resource demand changes dynamically. Too big step size leads to over-provisioning, while small step causes resource saturation and leads to SLA violation. In our experiment we set SLA to 20 ms.

• peak allocation • dynamic allocation

– fixed step allocation

* small step (CPU step 2, memory step 64 MB) * big step (CPU step 5, memory step 128 MB) – flexible step allocation

* scale up 100%, scale down 100% * scale up 100%, scale down 20%

Figures 6 and 7 present the total amount CPU and RAM allocated to web server VM for the workload trace. Peak allocation has highest CPU and memory allocation. It demonstrates how many resources are wasted if web application runs in fixed instance type. The dynamic scheme configurations have significantly lower CPU and memory allocation. Configurations with fixed step allocate slightly less resources than configurations with flexible step. The reason

1 0.28 0.29 0.29 0.32 0.00 0.25 0.50 0.75 1.00 Nor maliz ed CPU usage

EC2 small instance VScaler dyn (100, 20) VScaler big step

VScaler small step VScaler dyn (100, 100)

Figure 6:Amount of allocated CPU power

1 0.36 0.38 0.39 0.42 0.00 0.25 0.50 0.75 1.00 Nor maliz ed RAM usage

EC2 small instance VScaler dyn (100, 20) VScaler big step

VScaler small step VScaler dyn (100, 100)

Figure 7:Amount of allocated memory

is that flexible step allocation has more options in scaling. Flexible step configuration with smaller scaling down factor has largest amount of allocated resources among dynamic scheme configuration. It is more conservative on releasing resources.

Figure 8shows 95% response time. Two configurations violate SLA requirements. As we can see fixed step scaling with smaller CPU allocation size violated SLA, while scheme with bigger CPU allocation step satisfied SLA. The result shows that it is difficult to find ’right’ step size when there is no knowledge about application running inside the VM. Another interesting observation is that one of dynamic step allocation schemes also violates SLA. Scheme with 100% scaling down performs aggressive capacity management. In exploration phase it can sharply drop VM capacity by removing a half of assigned resource. However, if after reconfiguration workload increases, then the application cannot provide desired response time. To improve quality of the model one can limit scale down action as we did for other configuration with flexible step. Alternatively, aggressive de-allocation actions during exploration phase can be prohibited.

The evaluation results show that fixed step allocation achieves good performance, but only if allocation step has a proper size. However, one needs to find ’right’ one. Configuration with flexible step allocation can dynamically decide how many resources to assign.

In document Towards auto-scaling in the cloud: online resource allocation techniques (Page 71-74)