CloneScale: Distributed Resource Scaling for Virtualized Cloud Systems.

(1)

ABSTRACT

SUBBIAH, SETHURAMAN. CloneScale: Distributed Resource Scaling for Virtualized Cloud Systems. (Under the direction of Dr. Xiaohui (Helen) Gu.)

Infrastructure-as-a-Service (IaaS) cloud systems often use the virtualization technology to

share resources among different users and provide isolation among uncooperative users. We can

use a resource cap to limit the resource consumption of one virtual machine (VM). However,

the resource requirements of an application are rarely static, varying with workload intensity

and mix changes. Many existing techniques use machine learning, control theory, or queuing

theory to determine the application resource demand on the fly. When the local resource cannot

satisfy a VM’s resource requirement, previous work has employed live VM migration to move

the VM to a host with sufficient resources. However, it is possible that not a single host can

satisfy the resource requirement of the resource-intensive VM. Under those circumstances, it is

desirable to perform distributed resource scaling that can aggregate the resources of multiple

hosts to serve one application.

In this thesis, we present CloneScale, a system that leverages VM cloning to dynamically

distribute the workload of one application among multiple hosts. The newly cloned VM will be

the replica of the parent VM including its memory and disk state, yet they act as two

indepen-dent VMs. VM cloning is chosen to reduce the time taken for the VM to resume processing.

CloneScale provides load balancing among all the cloned VMs with minimal application

knowl-edge. Besides resource depletion, progress rate and SLO violations can also be used to trigger

VM cloning. We have implemented a prototype of CloneScale over the Kernel-based Virtual

Machine (KVM) platform and shown that we can effectively clone an application VM and start

(2)

c

(3)

CloneScale: Distributed Resource Scaling for Virtualized Cloud Systems

by

Sethuraman Subbiah

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Master of Science

Computer Science

Raleigh, North Carolina

2012

APPROVED BY:

Dr. Frank Mueller Dr. Xiaosong Ma

(4)

DEDICATION

(5)

BIOGRAPHY

Sethurman Subbiah was born in a town called Salem, Tamil Nadu, India and brought up in

a huge and beautiful city called Chennai. He recieved his Bachelor of Technology in

Informa-tion Technology from Sri Venkateswara College of Engineering (Affiliated to Anna University),

Chennai, India in May 2009. With the defense of this thesis, he will recieve the Master of

Sci-ence in Computer SciSci-ence from North Carolina State University in December 2011. He intends

to start his full-time employment at NetApp Inc., Sunnyvale after graduation. His research

(6)

ACKNOWLEDGEMENTS

I would like to thank my advisor, Dr. Xiaohui (Helen) Gu for her invaluable guidance in

performing research and writing this thesis. Furthermore, I would like to thank my advisor for

giving me such a wide exposure in this field. I thank Dr. Xiaosong Ma and Dr. Frank Mueller

for their kind consent to serve on my thesis committee. I am extremely thankful to all DANCE

research group members especially to Zhiming Shen, Yongmin Tan and Hiep Nguyen for their

advice and help during my research.

I would also like to thank all my friends for their help, motivation and encouragement

during my graduate study: Arun Natarajan, Anand Parthasarathi, Dinesh Radhakrishnan,

SriKrishna Gopu, Vivek Devarajan, Sudharshan Prasad, Vijay Shanmugasundaram, Koushik

Krishnakumar, Manikandan Sivanesan and Gopinath Chandrasekaran. Finally, my deepest

gratitude goes to my father Mr. S. Subbiah, my mother Dr. S. Meenakshi, my brother S. Vijay

(7)

TABLE OF CONTENTS

List of Figures . . . vi

Chapter 1 Introduction . . . 1

Chapter 2 Background. . . 4

2.1 Multi-Tenant Cloud Systems . . . 4

2.2 CloudScale : Elastic Resource Scaling for Multi-Tenant Cloud Systems . . . 5

2.2.1 Fast under-estimation correction . . . 5

2.2.2 Online adaptive padding . . . 5

2.2.3 Scaling conflict handling . . . 6

2.2.4 Predictive frequency/voltage scaling . . . 6

2.3 Apache Hadoop Mapreduce . . . 6

2.3.1 Hadoop Architecture . . . 7

2.3.2 HDFS Architecture . . . 7

2.4 RUBiS - Rice University Bidding System . . . 8

Chapter 3 System Design . . . 9

3.1 Overview . . . 9

3.2 VM cloning . . . 11

3.2.1 Hot Cloning . . . 11

3.2.2 Cold Cloning . . . 12

3.3 VM Reconfiguration . . . 13

3.3.1 IP Reconfiguration . . . 13

3.3.2 Post-Cloning Application Reconfiguration . . . 14

3.4 Prediction-Driven Cloning . . . 15

3.4.1 Prediction . . . 15

3.4.2 Predictive cloning . . . 16

3.4.3 Predictive Down-Scaling . . . 18

Chapter 4 Experimental Evaluation . . . 19

4.1 Experimental Setup . . . 19

4.2 Prediction . . . 20

4.3 Cloning . . . 22

Chapter 5 Related Work . . . 40

Chapter 6 Conclusion and Future work . . . 46

(8)

LIST OF FIGURES

Figure 3.1 CloneScale architecture Diagram . . . 10

Figure 4.1 Comparison between the predicted and actual CPU trace . . . 21

Figure 4.2 Prediction accuracy under different lead time . . . 21

Figure 4.3 Time series of SLO violation rate for different cloning schemes . . . 24

Figure 4.4 Impact of cloning on the cloned VM and co-allocated VM . . . 25

Figure 4.5 Comparison between time taken for cold cloning and hot cloning . . . 27

Figure 4.6 Comparison of average delay after hot and cold cloning of web server and MySQL database server . . . 28

Figure 4.7 Comparison between hot and cold cloning of a web server . . . 29

Figure 4.8 Comparison between hot and cold cloning of a MySQL database server . 31 Figure 4.9 Cold cloning warmup phase for MySQL database server . . . 32

Figure 4.10 Comparison between completion rate of hadoop map node without cloning and with hot cloning . . . 34

Figure 4.11 Comparison of hot and cold cloning of a hadoop map node . . . 35

(9)

Chapter 1

Introduction

Cloud computing has emerged as a very popular technology among startups and new developers

with its pay-as-you-go model. It provides a hassle free and cost-effective solution mainly because

of cloud elasticity. Cloud elasticity provides the capability to dynamically allocate and reclaim

resource in response to the changes in workload. A particular Facebook application became

very prominent and the user base increased from 25,000 to 250,000 in three days with signing-up

20,000 new users per hour at peak [2]. As the scale of applications and their hosting environment

expands, the complexity and overhead involved in managing the cloud environment increases.

It creates the need for automating dynamic VM resource allocation.

Amazon EC2 provides an auto scaling feature [1] where new VMs can be added or removed

dynamically based on the user defined policy. For example, adding a new node when the CPU

usage exceeds above 90% and removing a node when the CPU usage goes below 10%. It takes

several minutes to start a VM from the base image, followed by the time taken to set up the

environment and start the application. Since the VM is freshly booted, the application will

require a warm up time to populate the cache. All together, the time taken for the new VM to

work at its maximum efficiency is in minutes. Moreover, the user is burdened with the process

of setting up the environment, starting the application, and deciding the scaling policy.

(10)

signifi-cantly reduce the time taken for the new VM to start servicing requests at its peak efficiency.

We also support prediction-driven cloning, where the cloning operation can be started before

the VM resource demands increase beyond the resource available on that host. This helps to

reduce the application’s SLO violation time. Our system is intended to perform with least

human intervention and minimal application knowledge.

To design such a system, we need to address three challenges: i) performing efficient cloning,

ii) making the application inside the VM compatible with cloning, and iii) when to perform

cloning.

In order to perform efficient cloning, we present two variations of cloning mechanisms,

namely hot cloning and cold cloning. In hot cloning, we stop the parent VM temporarily, save

all its memory content to a state file, and use this state file to create a clone. Hot cloning can

propagate the memory state of the parent VM to the cloned VM. In cold cloning, we create

a new cloned VM from an existing template. This template is created before the application

starts processing any requests. Cold cloning can be used for applications whose memory and

disk state do not change a lot during runtime. The decision on which cloning mechanism to

use is made dynamically during runtime.

To make the application compatible with cloning, we present a generic user interface. The

application user is expected to provide minimal information about the application running inside

the VM and a script to re-configure the application after cloning. In order to decide when to

trigger cloning, we present a scaling policy for the user. We use a prediction model for every VM

to predict the future resource demands of that particular VM. When the predicted resource

demand exceeds the residual resources available on that host, we trigger cloning. Stateless

application does not suffer any loss of information when a node is terminated. Thus, when

we predict the resource usage of a VM running stateless application to go below a minimum

threshold, we terminate the unnecessary VM.

We have implemented a prototype of CloneScale on top of the Kernel-based Virtual Machine

(11)

hot cloning with full state propagation in tens of seconds. We have evaluated our work against

(12)

Chapter 2

Background

In this chapter, we introduce the background for the CloneScale work. Our current work aims

at dynamically distributing the workload of one application among multiple nodes using VM

cloning and load balancing techniques. Hadoop MapReduce and RUBiS are the two real world

applications we use to evaluate our work. This chapter aims at providing good background

knowledge on these important topics.

2.1 Multi-Tenant Cloud Systems

The pay-as-you-go model has been the driving force of cloud computing. Cloud users, called

tenants, can lease computing resources and pay only for what they use. Multi-tenant cloud

systems let multiple users share one common cloud infrastructure, and also provide various

features such as automatic scale-up and scale-down, and resource isolation among co-allocated

users. Multi-tenant cloud systems are appropriate for startups and new developers to kick start

their applications by providing a more cost-effective solution than buying physical computing

(13)

2.2 CloudScale :

Elastic Resource Scaling for Multi-Tenant

Cloud Systems

CloudScale performs prediction driven resource scaling in a multi-tenant cloud system. In order

to provide performance isolation, CloudScale aims at setting a suitable resource cap for VMs.

A low cap may lead to SLO violations and a generous cap may lead to wastage of resources.

Thus, Cloudscale monitors the resource usage of a domain, use prediction algorithms to predict

the future resource usage and set a suitable cap.

2.2.1 Fast under-estimation correction

Achieving 100 percent prediction accuracy is hardly possible. If the predicted CPU value is lower

than the real demand, the newly set CPU cap will not be sufficient for the application. Under

provisioning causes undesirable performance impact. Under provisioning is detected by the

resource pressure metric or SLO feedback from the application. However, the exact requirement

is unknown because the application cannot consume more resource than the resource cap. To

address this problem, CloudScale raises the resource cap exponentially until it overcomes under

provisioning.

2.2.2 Online adaptive padding

To avoid underestimation errors, CloudScale provides a mechanism called proactive padding

scheme. Based on our observations, the underestimation errors are often caused by bursts in

workload.

CloudScale employs burst based padding to accommodate bursts. The burst pattern is

extracted using signal-processing techniques such as FFT (Fast Fourier Transform). The

fre-quency of burst appearance (burst density) is calculated from the burst pattern. The padding

value is calculated based on the burst density.

(14)

Cloud-Scale calculates the padding value based on the degree of deviation of recent predictions from

the actual resource usage values. A sufficient padding will ensure less under prediction in the

future and also provides enough resources.

2.2.3 Scaling conflict handling

In a multi-tenant cloud system, a host may contain multiple VMs. Each VM will try to scale

up or down based on the predicted value. Thus, it is normal to encounter a scaling conflict

when the total resource demand exceeds the capacity. CloudScale aims at handling scaling

conflicts among co-allocated applications with minimal SLO impact. Long-term prediction

lends a helping hand in predicting the conflict early. A conflict prediction model is built to find

out the time of occurrence, intensity, and duration of conflict. A suitable solution is achieved

by comparing the SLO penalties of ignoring a few scale-up requirements by applications with

migrating VMs out of the box.

2.2.4 Predictive frequency/voltage scaling

Cloudscale couples VM CPU scaling and dynamic voltage and frequency scaling (DVFS) to

convert unused resources into energy savings. For example, if the resource demand prediction

model predicts the total CPU usage of the host as 50%, we can half the operational frequency

and double the caps of all VMs. Thus, we slow down the CPU to save energy without affecting

the application SLO.

2.3 Apache Hadoop Mapreduce

MapReduce is a programing paradigm used for processing huge amount of data in parallel.

Apache Hadoop is a MapReduce implementation in Java, heavily used in real world by Yahoo,

Facebook etc. A MapReduce program consists of two phases called Map phase and Reduce

phase. The input data is divided into chunks for processing by map tasks. The Map function

(15)

interme-diate results are produced, a split function (essentially a hash function) partitions the records

into disjoint buckets by applying the hash function to the keys of the records. These map

buckets are then written to the local disk. A number of map instances would be running on a

computing cluster. The second phase is known as the Reduce phase. Records with the same

hash value are sent to the same reducer. Each reducer processes the record assigned to it and

writes the record to the output file. For example, Wordcount is one of the Hadoop applications

we use. It counts the number of appearances of each word in a collection of text files.

2.3.1 Hadoop Architecture

Hadoop framework follows master/slave architecture. The master node consists of jobtracker

and namenode. The jobs are submitted to the jobtracker. The jobtracker manages the

sub-mitted jobs and schedules map and reduce tasks with the knowledge of the data location. The

tasktrackers at the slaves perform the computation.

2.3.2 HDFS Architecture

Hadoop distributed file system is a distributed file system designed based on the Google File

System. The features of HDFS include storing huge amount of data, providing reliability and

scalability.

The master holds the namenode, which maintains metadata of the entire filesystem.

Na-menode performs the bookkeeping while the original information is stored in datanodes. HDFS

provides fault tolerance through data replication. The user can configure the replication factor.

Different data nodes host the replicas. There can be only one master node at any instant.

Thus, master will be a single point of failure. Since the participation of namenode is less, the

(16)

2.4 RUBiS - Rice University Bidding System

We use RUBiS as an auction-site benchmarking tool. This benchmark helps us validate ourselves

with the real-world workloads. Several versions of RUBiS are implemented including PHP, Java

servlets and EJB version. Our experiments make use of the PHP version. The PHP version

consists of an Apache Webserver and a backend MySQL Database server. The client will be

able to establish a separate HTTP connection at the beginning of a session and close it at the

end. The client can perform various functionalities such as browse items, bid for items, buy

items and sell items. RUBiS provides two types of workloads: browsing, and bidding mix. The

browsing mix only includes read only interactions. In contrast, the bidding mix provides both

(17)

Chapter 3

System Design

In this chapter we present the overview of our approach, and then we introduce the two VM

cloning techniques, followed by the generic user interface that we provide to configure the

application after cloning, and finally we will provide information about the CloneScale policies.

3.1 Overview

CloneScale has two important components, the CloneScale master and CloneScale slave, and

is built on top of the KVM hypervisor. The CloneScale slave resides within each host in the

cloud system, providing VM manipulation and monitoring functions. The CloneScale master

has a global view of the entire system and performs cloning based on requests from CloneScale

slaves. Figure 3.1 shows the overall architecture of the CloneScale system.

CloneScale slave uses a monitoring tool based on libvirt to monitor the guest VMs resource

usage from the host OS. The monitored metrics include CPU consumption, memory allocated,

network traffic and disk I/O statistics. The resource usage time series are fed into an online

resource demand prediction model that uses a hybrid prediction model to predict the future

resource demands. Prediction helps us foresee the resource requirement and trigger cloning if

necessary.

(18)

Host n Host 2

Host 1

VM + Application

Application Reconfiguration

Module

Linux + KVM

CloneScale Slave Resource Monitor Prediction Module Clone Scale Policy CloneScale Master Cloning Module Clone Co-ordinator Resource usage time Series Predicted resource demands Clonning Request Cloning decision Perform Cloning

Figure 3.1: CloneScale architecture Diagram

than the resource available on the host. The CloneScale master then creates a clone on the

next available free host. By doing so, the application will be distributed among different hosts,

enabling additional capacity to handle the anticipated load increase. After cloning, CloneScale

modifies the newly cloned VM and performs some application specific changes to seamlessly

scale the application.

Stateful applications cannot be arbitrarily terminated, while stateless applications will not

suffer any loss of information. Thus, with additional information that the application is

state-less, we can provide a much more efficient solution. When the predicted resource demands are

less than the minimum threshold, we can turn off unnecessary VMs.

(19)

3.2 VM cloning

CloneScale is a system that leverages VM cloning to dynamically distribute the workload of one

application among multiple hosts with minimal SLO impact. In order to make cloning more

efficient, we provide two variants of cloning in this thesis: hot cloning and cold cloning.

3.2.1 Hot Cloning

To initiate cloning, the CloneScale slave sends a message to the CloneScale master, containing

the keyword “clone request”, the target VM’s domain name and IP address and the host IP. On

receiving a cloning request, the CloneScale master stops the VM temporarily and transfers its

memory contents to an external state file using the Virsh save command. Virsh is a management

user interface based on libvirt that can be used to create, pause, save, restore, list, and shutdown

domains. To extract the VM’s disk state, CloneScale leverages the QEMU Copy On Write

(QCOW2) image format feature to create an incremental QCOW2 image based on the parent

VM’s disk image. QCOW2 is the native image format of QEMU, and creating an instantaneous

incremental copy-on-write image is the major advantage of qcow2 image format. After creating

an incremental copy, QEMU will treat the base image as read-only, and will store all new writes

to the incremental QCOW2 image. Since the base image is changed to read-only format, it

cannot be further used to serve the parent VM. Thus while cloning, we create two incremental

QCOW2 images from the base image to restore the parent and the cloned VM. The incremental

images should always be linked with the base image, and therefore they must remain together

on a shared storage. For this purpose, we store all our images in a network-attached storage

(NAS). We can create an incremental image over an incremental image, but doing so in this

recursive manner will deteriorate the performance of the VM.

The CloneScale master restores the parent VM from the memory state file and an

incre-mental QCOW2 disk image using the Virsh restore interface. However, the Virsh save interface

saves the configuration xml inside the memory state file. The xml configuration in the state file

(20)

incremental disk image, we have to modify the xml configuration. The master uses the Virsh

save-image-dumpxml to obtain the xml saved in the memory state file header. The master

modifies the xml and restores the parent VM using the memory-state file and the modified xml.

The next available free host is chosen to host the newly cloned VM. Since the real requirement

is unknown, it is safe to choose a free host. The xml configuration file is modified to point

to a new incremental disk image. The cloned VM is restored with the memory-state file and

the modified xml file using the Virsh restore interface. The main advantage of hot cloning is

the copying of state from the parent VM to cloned VM. This avoids the initial warm-up time

required for the cloned VM to start performing at its peak efficiency.

3.2.2 Cold Cloning

In cold cloning, a new VM is cloned from a VM template instead of a running VM. We perform

cold cloning for VM hosting applications whose state can be determined in advance. The VM

template consists of a QCOW2 base image and an existing state file. CloneScale master creates

this template when the application was deployed in the cloud for the first time. The template

is created in a process similar to hot cloning mentioned in section 3.2.1 but this template is

created before the application starts processing any requests and it is stored in NAS. Thus,

we can neglect this downtime observed when calculating the efficiency of this mechanism, as it

does not affect the application’s SLO. To initiate cloning, the CloneScale slave sends a cloning

request as we mentioned in section 3.2.1. The CloneScale master responds to this cloning

request by choosing a suitable host and deploys a new VM at that host. To deploy a new VM,

we make an incremental QCOW2 image from the base image in the VM template and use the

Virsh restore interface to start a new VM with the help of the newly created incremental disk

image and exiting VM state image. Since we are creating the VM with the state of the OS

and the application before processing any requests, the newly created clone will only hold a

minimal state. However, on the other hand, the new VM can be created in a very short period

(21)

3.3 VM Reconfiguration

3.3.1 IP Reconfiguration

Since the newly cloned VM will be the exact copy of the parent VM, it will hold the same IP

address as its parent. Once the cloned VM is restored, a startup script performs the following

actions in sequence: i) Disable the VM network using IPtables, ii) Mount the cd in the standard

mount point /cd, iii) Execute the network configuration script available inside the cd, and iv)

Enable the VM network using IPtables.

To make sure this startup script is executed immediately after the VM restore, we SSH into

the VM and execute this startup script before initiating the cloning process. The startup script

has a small sleep period (five seconds in our experiments) before configuring the network. This

sleep period is to ensure that the network configuration starts after the VM cloning. We have to

note that the startup script will be executed in both the parent and cloned VMs, but we ensure

that no network changes are made in the parent by comparing the VM’s hostname with the

hostname provided through the cd and configuring the network only when they are different.

When the memory state was saved, there may be some on-going processes that may try to

communicate to a different machine with the old IP, which may cause some inconsistencies. For

this reason, we disable the network before the cloned VM gets a new IP address.

After restoring the cloned VM, the master has to pass a new MAC address into the VM.

However, the cloned VM cannot be accessed using the network. Hence, we created a virtual CD

ISO file containing the new mac address and hostname along with the network configuration

script and application post-configuration script. The virtual cd is added using the VM

config-uration xml before the cloned VM is restored. The network configconfig-uration script changes the

VM’s mac address internally, restarts the network service to obtain a new IP address using the

DHCP protocol and triggers the application post-configuration script. The functionality of the

network configuration script can be included within the startup script but following this

(22)

the network. An example to support this statement is available in the forthcoming section.

3.3.2 Post-Cloning Application Reconfiguration

After attaining a new IP address, the cloned VM is available to use. However, the application

running inside the VM will continue processing the requests that were started before cloning.

Since we clone a running VM, it is natural to expect the application to be processing requests

before cloning initiation. We provide a user interface which can be used to specify the required

information regarding the application. We provide a generalized user interface in the form of

an xml file. This xml file poses the following questions to the user: i) Can we clone this VM?,

ii) Is this a stateless application?, iii) What type of application is this?, iv) What is the path to

post-configuration script?, v) Approximate time taken for application reconfiguration, vi) Does

the application have its own load balancer?, vii) Which port has to be load balanced? and viii)

What will be the type of incoming requests to the load balancer ?

As soon as a new VM is created, CloneScale slave parses this XML file and populates the VM

settings file and creates one file per tier. The CloneScale master reads this file before performing

a cloning operation. The VM settings file will hold information that aids the master in deciding

whether cloning is viable, if yes, what type of cloning should be performed on that VM (hot

or cold). There are few applications that cannot be cloned, for example, we cannot clone a

hadoop master node. There can be only one master node in a Hadoop cluster. CloneScale also

provides a haproxy load balancer to every application. The user to work with his application can

configure this load balancer at the beginning. Besides the xml, we also require an application

post cloning configuration script. The application post cloning configuration script provided

by the user is expected to handle requests that were started by the parent VM before cloning,

making changes to the application which has modules that are bound to the host-name or IP

address of the VM and adding the cloned VM’s IP to the load balancer.

To the best of our knowledge, the most commonly used cloud applications such as web

(23)

balanced by adding the cloned VM’s IP to the load balancer for balancing a specific incoming

port, for example port 80 for apache web server and port 3306 for MySql database server. For

certain applications like Hadoop, it has its own load balancer called the job tracker. In Hadoop,

the VM hostname is considered as a unique identifier, thus the previously running task tracker

and data node processes will hold the host name of its parent. These processes has to be

stopped before the VM gets a new IP and host name, followed by starting the task tracker and

datanode with the new hostname. Once the task tracker is started, it automatically contacts

the job tracker and adds itself to the existing cluster. As mentioned in section 3.3.1, having a

network configuration script separated from the startup script gives us the flexibility to execute

the post-configuration script before or after obtaining a new IP address and host name. At

present, we have proved our CloneScale can work successfully with apache web server, MySQL

database server and apache Hadoop. The MySQL database that we use in our experiments

only caters read-only requests. Database writes and other workload patterns or applications

that alters the state of the application is not supported by CloneScale at present.

3.4 Prediction-Driven Cloning

3.4.1 Prediction

During VM cloning, the VM is stopped to transfer the VM memory, register values and cache

to a state file. The parent VM we are trying to clone has a downtime. Thus, VM cloning is not

a very cheap process. Using a reactive approach will trigger cloning when the VM is already

under provisioned. This will lead to huge SLO violations. CloneScale uses resource prediction

algorithm to proactively trigger cloning before the VM resource demands exceed the residual

resource on the host.

CloneScale uses two complementary resource prediction techniques, signatudriven

re-source prediction and state-driven rere-source prediction to trigger cloning. Initially, the

(24)

model uses the signature to predict the future resource requirement. Otherwise, the prediction

model uses discrete-time Markov chain to predict the future resource demands.

CloneScale slave resides on the host OS and monitors the resource usage of the guest VMs.

At present, CloneScale mostly performs scaling based on CPU usage. The metric used for

prediction should directly correlate to the measure of client’s service level objective (e.g. average

delay) and we believe CPU usage is one such metric. CloneScale can also use different metrics

based on the application needs such as application progress rate for Hadoop. The length of the

time we wish to predict is the sum of the time taken for cloning a VM and time taken for the

cloned VM to continue processing requests. This represents the time between clone initiations

and when the new VM will be useful. The training window sizetw differs between experiments,

but mostly we use a training window size between 100 and 2000 seconds. Later, the model is

retrained every eight seconds usingtw recent samples and a new prediction is made every eight

seconds. The time taken to clone a VM depends upon the memory size of the VM and does

not show much variation. We have already calculated the average time taken for hot and cold

cloning of VMs with different memory size. We use this average cloning time to define our look

ahead window size or how far we need to look into the future, which helps us clone the VM

before it gets overloaded.

3.4.2 Predictive cloning

CloneScale slave employs resource monitoring and prediction to foresee the resource demands

of a VM. We use prediction to trigger cloning before the system is underprovisioned. The look

ahead window dictates how long we would like to predict the future. The duration of a

look-ahead window is set as twice the time taken for cloning a VM. The first half of the look look-ahead

window denotes the cloning period and the second half of the look ahead window denotes the

period after cloning. Thus, we will be able to observe the system’s resource demand after the

cloning period and decide whether the newly cloned VM will aid in reducing the SLO

(25)

threshold, CloneScale slave triggers cloning. This maximum threshold is a configurable

param-eter, it can be configured based on the SLO needs. Setting a very high threshold gives very less

importance to SLO and effectively reduces the number of clone requests. On the other hand,

setting a very low threshold will trigger cloning too often, which will affect the application SLO

and also overprovisions resources. Thus, it is really important to choose the right threshold.

Once the cloning is triggered, the CloneScale master choses the next available free host as

the cloning destination and parses the settings file corresponding to the domain name. The

user’s information regarding the application is retrieved by the CloneScale slave and it creates

a settings file which helps the CloneScale master to identify whether: 1) the VM can be cloned,

2) which cloning mechanism should be used (hot or cold), 3) What is the application type? 4) Is

the VM using our load balancer?, 5) which port has to be load balanced. Since CloneScale slaves

reside on all hosts and are independent, there can be multiple scaling requests to the master.

The master should accept only one scaling request from a tier and ignore rest of the requests.

For example, let there be three webservers servicing requests. When the workload gets highly

intensive such that the three webservers will not be enough to satisfy all requests. In this case,

the CloneScale slaves running in the host OS will predict the total resource available on that

host is insufficient. All three hosts will trigger cloning. In this case, CloneScale master creates

just one clone based on the first received request and sends a message to the load balancer

to distribute load to the newly created clone. If four servers cannot handle the request again

another clone will be initiated. Creating a clone for each request will stop all three running

VMs simultaneously, which will make the application SLO suffer more.

Immediately after the parent VM is restored, it will experience few early spikes till the newly

cloned VM starts servicing requests. These spikes may invoke another cloning operation, which

is a false call but it stops the parent VM once again to save the memory state and create a new

clone leading to further SLO violations. Thus after the completion of a cloning operation, we

(26)

ip address, sleep time in the startup script and the approximate application reconfiguration

time. Any further cloning requests from the same tier during that locked period is ignored.

3.4.3 Predictive Down-Scaling

For stateless applications, besides cold cloning, CloneScale also performs distributed down-scale.

When more than 3/4th (a heuristic value) of the look-ahead window is below the minimum threshold, then down scaling is triggered. During down-scaling, IP address of the first VM to

trigger cloning is removed from the list of load sharing nodes at the load balancer. However, the

VM is not immediately turned off. We keep the VM at a standby state for a time period ofts

seconds. ts refers to the time taken to perform cold cloning. Any further scale up requirements

within the nexttsseconds can be satisfied immediately by informing the load balancer to include

(27)

Chapter 4

Experimental Evaluation

In this chapter we discuss various results we collected using our system.

4.1 Experimental Setup

The experiments were conducted on the Hybrid Green Cloud Computing (HGCC) Cluster at

North Carolina State University. Each HGCC node has an Intel Xeon 3440 2.56 Ghz quad-core

processor, 250 GB hard disk, 8GB Memory, 1Gbps network bandwidth, and runs CentOS 6.0

64 bit with QEMU-KVM 0.12.1. The HGCC cluster has a network-attached storage (NAS)

with size 6TB in RAID5 storage technology. 2 TB of our NAS is used for parity. All images are

maintained on the NAS and we leverage this advantage to create incremental QCOW2 images.

Our resource monitoring code was written based on libvirt 0.9.4 library. We use the virsh

management interface to create, stop, save, and restore guest VMs. The CloneScale master

runs on a single node (HGCC13 in our cluster), which has the same configuration as mentioned

above. The master has a global view of the entire system and performs cloning and down-scaling

on behalf of CloneScale slaves.

The guest VMs are usually configured with 1 VCPU core and 1 GB memory. Each guest

VM runs Centos 5.2 64 bit and is connected to the network using the bridge interface ”br0” in

(28)

CPU, memory, disk and network IO. The resource metrics are collected at different sampling

rates based on the type of cloning technique used (hot or cold cloning) and we currently perform

scaling based on the CPU metric alone. We used two different kinds of workloads to evaluate

our work, the RUBiS and Hadoop. Section 2.2 and 2.3 gives a brief overview about both

applications. In RUBiS, we let the workload generator to track the response time of the HTTP

requests it made. The SLO violation rate is the fraction of requests that has response time

greater than the pre-defined SLO threshold during each experiment. The SLO threshold varies

in our experiments based on the intensity of the workload used. In Hadoop, we are using the

word count on 282 text files as a workload to evaluate our system.

4.2 Prediction

During cloning, the VM will be temporarily suspended and the memory will be transferred to

a state file. This action takes tens of seconds and performing this action during peak usage

will provide unacceptable results. Thus, we try to use a prediction algorithm to predict the

future resource usage of the VM and perform cloning when the VM’s CPU usage is predicted

to exceed the maximum threshold. Figure 4.1 shows a sample real and predicted trace.

To measure the accuracy of our long term prediction algorithm, we collected a six hour CPU

trace of a RUBiS webserver and used this trace to test our prediction model. We feed this CPU

trace as input to our prediction algorithm to get a list of predictions. We had a fixed training

window of 100 samples and varied the sampling interval between one second and ten seconds.

We retrain the model every eight seconds and predict the next eight steps. We mark all the

time instances where cloning has to be triggered, then use our prediction algorithm to predict

when cloning has to be triggered with different look ahead window size. From the predictions,

we calculate: i) Number of true positives (Ntp) - the CPU usage of the VM is predicted to

exceed the maximum threshold correctly, ii) Number of false negatives (Nf n) - the CPU usage

of the VM exceeds the maximum threshold but it is not predicted, iii) Number of false positives

(29)

0 20 40 60 80 100 120

200 300 400 500 600 700

CPU (%)

Time (s)

Actual value Predicted value

Figure 4.1: Comparison between the predicted and actual CPU trace

0 20 40 60 80 100

0 8 16 24 32 40 48 56 64 72 80 88

Prediction accuracy (%)

Lead Time (s)

True positive rate False positive rate

(30)

happen, and iv) Number of True negatives (Ntn) - the CPU usage of the VM is predicted to

stay within the maximum threshold correctly. The true positive rate Tp and the false positive

rateFp are calculated using the standard formulaTp =Ntp/(Ntp+Nf n);Fp =Nf p/(Nf p+Ntn).

Figure 4.2 shows the true positive rate and the false positive rate of our long term prediction

algorithm under different lead time window sizes. The prediction model has a true positive

rate of 91.5% for a lead time of eight seconds and reduces when the lead time window size

increases. Since cloning is not a cheap process, false alarms will cause unnecessary cloning and

unnecessary SLO penalities. Improving the prediction accuracy is one of our main future works.

4.3 Cloning

From the previous experiments, it is evident that our prediction has high true positive rates and

it can be used to drive cloning. We further investigated this issue by comparing the reactive

approach against our proactive approach to trigger cloning. We have modified the RUBiS

workload generator to send UDP message when a SLO violation is recorded. In the reactive

approach, cloning is triggered when three continuous SLO violations are reported. Whereas

in a proactive approach, we use a moving window with eight recent CPU samples and a look

ahead window containing the predicted next eight steps. Here, cloning is triggered when any

five samples in either one of the windows exceed the maximum threshold. There are a few

special scenarios where the CPU usage remains very low for a long duration and the workload

rises almost instantly (e.g. consider a situation where a university releases exam results at 6.00

PM. Every student tries to access the same website to view his/her results). In such scenarios,

if we use only the look ahead window, the model will take few minutes for catching up to the

current trend. Whereas, using both the look ahead and moving window will identify the need

for cloning immediately. We use RUBiS system with one front-end webserver and two back-end

databases to perform this experiment. The client workload generator creates 400 threads to

browse the webserver and the workload generator tracks the delay observed by each thread. We

(31)

than the pre-defined threshold per second. Using this method, we plot the SLO violation trace

with time as x-axis and the SLO violation rate as y-axis.

Figure 4.3a shows the violation rate observed in the absence of CloneScale. The SLO

violations start from 3330 seconds. The initial small violations observed between seconds 3310

and 3320 is due to the moving average. Since, the observed raw trace was hard to read, we

calculated the average with a moving window of size 15 to smooth the curve. The violation

observed between seconds 3310 and 3320 is due to small violation observed for two seconds.

Figure 4.3b shows the effect of using a reactive based approach. After receiving three continuous

violations, cold cloning is triggered. Since webservers are stateless, cold cloning technique is

more apt for cloning a webserver than hot cloning. The small peak observed between 3360 and

3380 is due to the moving average that we calculate. In reality, we observe only a single very

high spike for two seconds caused by the load balancer. The load balancer should be restarted

after adding a newly cloned node for the change to take effect. Since it takes few seconds for

the newly cloned VM to start servicing requests and the system has already started observing

violations, we see an impact in performance. Figure 4.3c shows the performance of CloneScale

which uses prediction driven cloning. We use a prediction model to foresee the future resource

usage and make a cloning decision based on this prediction. Thus, before the system starts

observing violations we perform cloning. This helps in reducing the violations.

In Figure 4.4 we compare the impact of cloning on the cloned and co-allocated VMs. We

are trying to compare three different metrics among the cloned and co-allocated VM. The three

metrics we chose to compare are average delay, CPU usage and network usage of both the cloned

and co-allocated VMs. We believe these three metrics will help us understand the impact from

all three dimensions. Since we are writing the VM’s memory to a state file in NAS, we believe

network will be a very good metric to study the impact.

These experiments are performed with two VMs on a single host. Both VMs host a RUBiS

webserver belonging to different applications. We perform hot cloning and study the impact of

(32)

0 10 20 30 40 50 60

3250 3275 3300 3325 3350 3375 3400 3425 3450

SLO Violation Rate (%)

Time (s)

(a) No cloning (Total violation time: 112s)

0 10 20 30 40 50 60

3250 3275 3300 3325 3350 3375 3400 3425 3450

Time (s)

Cloning period

(b) Reactive cloning (Total violation time: 76s)

0 10 20 30 40 50 60

3250 3275 3300 3325 3350 3375 3400 3425 3450

Time (s) Cloning period

(c) Proactive cloning (Total violation time: 36s)

(33)

4500 500 550 600 650 20 40 60 80 100 Time (s) Delay (ms) Cloning period

(a) Delay of the co-allocated VM

4500 500 550 600 650

100 200 300 400 500 Time (s) Delay (ms) Cloning period

(b) Delay of the cloned VM

4500 500 550 600 650

20 40 60 80 100 Time (s) CPU Cloning period

(c) CPU usage of the co-allocated VM

5000 550 600 650

20 40 60 80 100 120 140 Time (s) CPU Cloning period

(d) CPU usage of the cloned VM

4500 500 550 600 650

2 4 6 8x 10

6 Time (s) Bandwidth (bytes) transmitted bytes received bytes Cloning period

(e) Network usage of the co-allocated VM

4500 500 550 600 650 1 2 3 4 5 6 7x 10

6 Time (s) Bandwidth (bytes) transmitted bytes received bytes Cloning period

(f) Network usage of the cloned VM

(34)

is transient and it does not affect the co-allocated VM. For this experiment, the co-allocated

VM’s disk image is present in the host local disk and the disk image of the VM to be cloned

is in the NAS. The VMs are not pinned to any core, thus the scheduler may schedule the VMs

in any one of the four cores. Figure 4.4a and 4.4b shows the delay observed by the cloned VM

and the delay observed by the co-allocated VM. During the cloning period between 551 and 600

seconds, the VM is stopped temporarily to transfer its memory contents to a statefile. Thus,

it does not process any requests causing a very high delay as seen in Figure 4.4b. In Figure

4.4c and 4.4e, we see a sharp decrease in CPU and both incoming and outgoing network at the

start of the cloning. But, it is very short lived and the application recovers quickly within the

next few seconds. In Figure 4.4d and 4.4f, we do not have any data during the cloning period

because the VM is stopped during this period. From all six figures, we can very well see an

impact of cloning on co-allocated VM but they are very minimal.

As we mentioned in section 4.1 prediction, the size of the look-ahead window is determined

by the type of cloning and memory size of the VM to be cloned. We conducted the following

experiment to compare the time taken to perform hot and cold cloning of VMs with different

memory sizes.

Figure 4.5 clearly states the implication of memory size on hot and cold cloning. Since we

are saving the entire VMs memory into a state file, the time taken to clone a VM with 4 GB

will be greater than the time taken to clone a VM with 2 GB memory. The Figure 4.5 also

clearly states that the cold cloning takes less than ten seconds. However, cold cloning cannot

provide improvement on all occasions, VMs with application carrying a state will benefit less

from cold cloning. The reconfiguration time in Figure 4.5 is the time taken for obtaining the

new IP address and the time taken to modify the application. The time taken to modify the

application is application specific, in this case we have used a web server. Web servers do

not need any modification. Thus the black bar indicates the time taken to reconfigure the IP

address of the cloned VM. In the following experiment, we will understand the advantages of

(35)

0 10 20 30 40 50 60 70

1 2 3 4

Time taken for cloning (s)

Memory (GB) Hot Cloning

Cold Cloning

Figure 4.5: Comparison between time taken for cold cloning and hot cloning

containing web server and a VM containing a database server. We compare the SLO delay time

series of both hot and cold cloning to understand the importance of both approaches.

Figure 4.6 shows the comparison between the average delay after hot cloning and cold

cloning for different applications. We chose a stateful application (MySQL database server)

and a stateless application (Webserver) to understand the advantage of both hot cloning and

cold cloning. We use only the average delay observed after cloning to eliminate the influence

of the cloning period. The experimental setup is similar to the RUBiS setup as mentioned in

section 4.1. The duration of the training period was 1800 seconds. The average delay observed

after hot cloning is better than cold cloning in both cases due to the warm cache. However,

the advantage of hot cloning is very less in a webserver due to its stateless nature. In MySQL,

we can see the real impact of hot cloning. MySQL database used in our experiment caches the

index of the entire table. We take advantage of the hot cloning to reduce the overall warmup

time of the MySQL application. In the following paragraphs, we will compare the hot cloning

(36)

0 200 400 600 800 1000 1200 1400 1600

Webserver MySQL

Delay (ms)

Hot Cloning

Cold Cloning

(37)

0 200 400 600 800 1000 1200 1400

0 500 1000 1500 2000 2500 3000

Delay (ms)

Time (s)

25000

Cloning

(a) Delay trace showing hot cloning of web server

0 200 400 600 800 1000 1200 1400

0 500 1000 1500 2000 2500 3000

Delay (ms)

Time (s)

7240

3800 2011

Cloning

(b) Delay trace showing cold cloning of web server

(38)

The input curve has a recurring pattern and as we can see in Figure 4.7a there is an

increase in average delay starting at 900 and lasts till 1500. When it tries to repeat itself at

2100, CloneScale predicts the increase in resource usage and triggers cloning (hot cloning in

this case). After a brief cloning period, we find that the SLO average delay is brought back to

normal. This clearly states the advantage of using CloneScale, which distributes the workload

among multiple nodes when the resource demands of a VM exceeds the resource available on

that physical host. Figure 4.7b shows the average delay trace of a web server performing cold

cloning. In order to compare the effect of hot and cold cloning, we used the same input trace

for both the experiments. Similarly near 2100, CloneScale predicts the increase in resource

demands and triggers cold cloning. In cold cloning the running VM is never stopped and a new

VM is started from a template, but still we observe a single huge peak. This is due to the load

balancer. After adding a new node to the haproxy load balancer we have to restart the load

balancer for the change to take effect. Even though, the restart takes only one or two seconds,

the system drops all the incoming packets during that period. We can overcome this issue, by

replacing the haproxy load balancer with another load balancer that supports dynamic addition

of nodes. CloneScale is independent of the load balancer and can work with any load balancer.

Finally, Figure 4.7 shows the average delay between 2140 and 2800 seconds. This average delay

observed after hot and cold cloning are roughly the same. Cold cloning observes a small warmup

time right after cloning, but when we consider the average delay between 2000 and 2800 seconds

(including the time taken for cloning), cold cloning does better than hot cloning due to its small

cloning phase. Since time taken for cold cloning is less and the difference in delay after cloning

is only slightly greater than hot cloning, it is more apt for cloning web server.

Figure 4.8 shows the comparison between hot and cold cloning of a database server. The

RUBiS setup includes three web servers and one database server. The three web servers are

provided with three cores each and adequate memory to avoid bottleneck in this tier. In order

to create a bottleneck in the database tier, we used a very high workload with 1000 worker

(39)

1000 1200 1400 1600 1800 2000 2200

0 500 1000 1500 2000 2500

Delay (ms)

Time (s) Cloning

16100 2300

(a) Delay trace showing hot cloning of database server

1000 1200 1400 1600 1800 2000 2200

0 500 1000 1500 2000 2500

Delay (ms)

Time (s) Cloning

2300 5600

(b) Delay trace showing cold cloning of database server

(40)

0 500 1000 1500 2000

1200 1400 1600 1800 2000 2200 2400

Delay (ms)

Time (s)

cold cloning hot cloning

(41)

sure the model can view the recurring pattern and make accurate predictions. In Figure 4.8a,

hot cloning is triggered at the 1850th second and lasts for a duration of 44 seconds. The figure

also shows the reduction in average response time after cloning. Figure 4.8b shows the impact

of cold cloning on the database tier. In cold cloning, we create a cloned VM from a template

and thus we will miss all the stateful application specific information available in the memory.

We used the pmap linux command to check the memory footprint of the MySQL process. We

found that the VM cloned using hot cloning has a memory foot print of 396MB and the memory

foot print of the VM cloned using cold cloning has a memory foot print of 164MB. Figure 4.9

shows the warmup phase observed in cold cloning. The delay trace of cold cloning is clearly

greater than the delay trace of hot cloning from 1150 seconds to 2100 seconds. After 2100

seconds, we find both curves are similar and this indicates the end of the warmup phase. The

warmup phase varies based on the application and its caching ability. We used MySQL myisam

database which stores the database index in the memory. From Figure 4.6 and Figure 4.9, we

understand the advantage of using hot cloning over cold cloning for stateful applications.

Figure 4.10 shows the progress of a Hadoop map node. This experiment was started with

two nodes, one master node and one slave node. The HDFS replication factor is set to 1 in

this experiment. The master node performs only reduce tasks and the slave node performs only

map tasks. Usually the same node performs both map and reduce tasks, we separated the VMs

performing map and reduce tasks to facilitate VM cloning. Wordcount on 282 text files was

used as the benchmark job to trace the progress rate. In Figure 4.10 , x axis denotes the time

in seconds and y axis denotes the progress in percentage. The progress rate is sampled every

five seconds using the Hadoop API. The progress rate of map node is nearly constant, around

0.72% every second. Whereas, the reduce node progresses 66% in the last few seconds and has

only 33% progress from the beginning, hence it is not an ideal candidate for cloning. Unlike

a web server or a database server, the CPU usage of a hadoop map node is mostly above 90

because it keeps performing map tasks as long as they are available. Thus, CloneScale cannot

(42)

0 20 40 60 80 100

0 20 40 60 80 100 120 140

Progress (%)

Time (s) Cloning

Without cloning Hot cloning

(43)

0 20 40 60 80 100

0 20 40 60 80 100 120

Progress (%)

Time (s) Cloning

Hot cloning Cold cloning

(44)

can use the Hadoop progress rate and estimated completion time to trigger cloning. In this

experiment, we trigger cloning manually to show that CloneScale can work with Hadoop. In

Figure 4.10, we manually trigger hot cloning at 120 seconds , During the cloning period (50s)

the Hadoop progress rate is zero. Once the cloning is completed, the map phase accelerates

as expected with the help of two mapper nodes. Figure 4.11 shows the comparison between

hot and cold cloning. The cold cloning is triggered at 120 seconds, since the running VM is

not stopped to save the state, the newly cloned VM joins the running VM which proves more

efficient as expected.

From our previous experiments we have identified that the power consumed by an idle

server is huge compared to the power used by the server under maximum CPU usage [24]. It is

better to consolidate and turn off or put the unused machines to sleep for saving power. Thus

it is very important to reduce the size of the application cluster when the cluster is not fully

utilized. CloneScale performs down-scaling to reduce the size of the tier dynamically when

the CPU usage is below a minimal threshold. This minimal threshold is configurable, usually

the minimal threshold is chosen less than half the maximum threshold to make sure there is

no fluctuation or no cloning right after down-scaling. This experiment shows the impact of

down-scaling by CloneScale compared to resource overprovisioning.

The same RUBiS setup was reproduced on the HGCC cluster. To illustrate down-scaling

we created an initial system with two RUBiS web servers and two RUBiS database servers.

The client generator generates workload that can be easily satisfied with just one node. After

the training phase, CloneScale slave triggers down-scaling and the web servers count is reduced

to one. The minimum threshold for this experiment was set at 30. When the CPU usage was

predicted to go below 30, down-scaling is triggered. The prediction model was trained with 100

samples with two second sampling interval. During down-scaling, the VM is first removed from

the load balancer list and the load balancer is restarted. After a minute, if the VM is not used,

it is destroyed.

(45)

0 20 40 60 80 100 120 140 160

0 100 200 300 400 500 600 700 800

Total CPU (%)

Time (s)

Down-Scaling Cloning With CloneScale

Without CloneScale

(a) Total web server tier CPU usage

0 50 100 150 200

0 100 200 300 400 500 600 700 800

Delay (ms) Time (s) 2800 520280 Down-Scaling Cloning With CloneScale Without CloneScale

(b) Delay trace - cold cloning

(46)

0 10 20 30 40 50 60 70 80 90 100

Over-Provisioning CloneScale

Average Delay (ms)

(47)

and its usage goes below the minimal threshold, a down-scaling request is sent to the master

node but the master node ignores this request because there is only one node left in that tier. At

510 seconds, when the predicted CPU usage exceeds the upper threshold (80% in this case), we

trigger cloning again. Since, we are working with the web server tier, we perform cold cloning.

In figure 4.12a, we have demonstrated both down-scaling and cloning and also compared it with

the over provisioned trace to show the change in trend. Figure 4.12b shows the delay tarace of

a VM. We can see that we have reduced the resource usage with a small increase in delay. The

spikes after the cloning phase are due to the warmup phase of cold cloning. To confirm it is the

warmup phase, we repeated the same expeirment with hot cloning. Figure 4.12c shows that

there is a brief cloning period with very high delay. After the cloning phase, we see an increase

in average delay of the CloneScale case compared to without CloneScale case. This is due to

the incremental QCOW2 images created during hot cloning. The without CloneScale trace is

collected from VMs that are created from base QCOW2 images. We did further experiments

to confirm the increase in average delay after cloning is due to the incremental QCOW2 image

and not the cloning itself. Figure 4.13 proves that the performance of CloneScale using the cold

cloning is comparable and only slightly greater than over-provisioning.

Prediction is not very accurate. If there is a prediction mistake, a new node has to be

cloned to satisfy the deficit caused by the recent down-scaling, which is costly. To overcome

this situation, on a down-scaling request, we remove the server’s IP from the load balancer but

leave the VM dangling for a minute before shutting down the VM. Having a VM as standby

(48)

Chapter 5

Related Work

In this chapter, we provide a brief survey of existing cloud scaling systems with more emphasis

on work related to VM cloning, migration and deployment.

In our previous work, CloudScale [24], we present an elastic resource scaling system driven

by prediction. Cloudscale employs online VM resource demand prediction as well as prediction

error handling in order to provide accurate resource to the VM. CloudScale does not address

adding or removing nodes to an existing application. In this thesis, we present CloneScale an

extension of CloudScale that distributes the workload among multiple nodes with the help of

VM cloning and load balancing to minimize the application SLO time.

In security, the Potemkin project [27] uses a dynamic VM creation technique to improve

honeypot scalability. On the arrival of network packets, a gateway router dynamically binds IP

addresses to physical honey farm servers. For each active IP address, Potemkin dynamically

creates a VM from a reference image. It implements copy-on-write optimization called delta

virtualization where the cloning operation maps all memory pages from the reference image.

The memory of reference image will be read only and all further write requests will be performed

locally. This helps in creating quick short-lived VMs but with one constraint that the newly

created VMs should share the same host with the reference image.

(49)

VM forking technique. In Snowflock, the application running inside the VM uses the fork API

to trigger cloning. While cloning, a VM descriptor containing the page tables, global descriptor

tables, metadata, VCPU register values and few memory pages is transferred from the parent

host to the clone host. The cloned VM is started with this VM descriptor and all initial read

requests by the child VMs are faulted and fetched from the parent VM. The newly cloned

VM will be in a virtual private network with parent and other child VMs. This design is

extremely effective to create quick clones with state but it needs to modify the VMM, host OS

and application running on the guest to use the API.

Kaleidoscope [4] uses snowflock to create swift transient short-lived VM clones for handling

bursts in workloads. As per snowflock implementation, the newly cloned VM will have large

number of page faults at the beginning, leading to a huge warm-up time. This design is not

suitable for handling bursts in workloads. Thus, kaleidoscope uses VM introspection techniques

to perform VM state coloring through which they identify the working set of the memory pages

and push the working set along with the VM descriptor to create a clone. The clone VM

created avoids the warm-up time and start servicing requests immediately. The kaleidoscope

architecture consists of the parent VM, multiple child VMs and also a gateway VM that

inter-faces the cluster to the outside world and performs load balancing. It monitors the number of

incoming client requests and spawns new worker clones when a defined threshold is exceeded.

Kaleidoscope discusses mainly about the cloning mechanism and gives more importance to

cre-ate short-lived VMs to handle bursts. Whereas, CloneScale concentrcre-ates more on the policy,

which determines when cloning should be initiated and what are conditions influencing the

decision. Kaleidoscope has addressed many shortcomings of Snowflock and tailor made the

implementation to support micro elasticity but it still needs to make changes in the VMM and

host OS. Whereas using CloneScale, we can perform VM cloning without making any changes

to the VMM or host OS. We also use a more intelligent scaling policy involving VM demand

prediction to trigger cloning before the application suffers from resource unavailability.

(50)

5, 3]. Chieu et al. [7] provide a solution to increase the scalability of webservers by monitoring

the number of incoming requests to each webserver and create a new webserver from a modified

image template when the number of requests increases above a threshold. Similarly when the

number of requests falls below a minimal threshold, unnecessary VMs are removed. Flurry

DB [19] uses Snowflock to create VM clones to dynamically scale the database tier. The load

balancer is designed to redirect read requests to any one clone and write requests to all clones.

This distribution technique makes sure all database are in a consistent state. During cloning,

the requests are queued in the load balancer and the queued requests are satisfied after cloning

completion. Harold et al [18] from Duke university presented an automated controller for

elastic storage system that uses control theory to determine the number of systems required

in the storage tier. Dolly [5] uses prediction techniques to predict the future workload and

performs cloning using VM snapshotting to scale the database tier. Dolly assumes there is

no shared storage between the servers, thus VM snapshots are transferred to the destination

server through the network. It takes into consideration the time taken for snapshotting and

replica resynchronization time during resource provisioning. CloneScale can perform cloning to

any application with minimal application knowledge and support from the user. We provide a

general user interface to the user for specifying the nature of the application and a script to

make application specific changes after cloning.

Few other related works such as Jump start cloud[30] and Rethink the Virtual Machine

Template [29] mainly concentrate on the quick deployment of VMs in the cloud environment.

Jump start cloud[30] discusses about deploying a VM from a template. The cloning mechanism

provided in this paper is similar to our cold cloning. It focuses only on VM deployment instead

of application scaling and does not address the problem of scaling policy. Rethink the Virtual

Machine Template [29] discusses about rapid deployment of stateful VMs. It avoids long VM

creation time by saving the VM state into a VM substrate, and then restore a new VM from the

substrate pool. Before saving a VM, it shrinks the VM memory size, detaches the devices, so

(51)

format and solves the issue associated with incremental qcow images by copying information

from the base image in background. This paper doesnt address the challenge of application

configuration and application scaling.

Live VM migration [16, 13, 12] has been the most famous solution provided to overcome

scaling conflicts among co-allocated VMs. ”Live migration of VMs” [16] provides a technique

to migrate a VM from one host to another without any downtime. In live migration, resources

are reserved on a destination host and VM memory pages are copied from source to destination.

Subsequent iterations copy only the recently dirtied pages. After transferring all the memory

pages, the VM is suspended at the source, its CPU state and remaining inconsistent memory

pages are transferred to the destination. After transferring the entire VM state successfully,

the original VM is discarded at the source and the destination activates the VM. The duration

of live migration depends on two important factors, the memory size and the memory access

rate. For a memory intensive VM, it takes multiple iterations to transfer the entire memory

content to the destination. Other existing solutions [13, 12] aim at providing post-copy based

live migration. Here, the VM is first transferred to the destination by copying its CPU state

and important memory pages. Further memory pages are fetched after each page fault. The

post-copy implementation helps in moving the VM out of a congested host before transferring

the memory pages but the VMM has to be modified to support post-copy migration. Previous

work, that suggests VM migrations as a solution to scaling conflict, assumes there are free hosts

available to migrate the VM. However, there are special scenarios where there are no better

physical hosts available or the VM requirement exceeds the resources available on that physical

host. CloneScale considers this scenario and aims at avoiding this limitation by distributing

the workload between multiple nodes.

Previous work has studied IO resource scaling in virtualized environment. Virtual IO

Sched-uler [23] provides a fair IO schedSched-uler by creating separate queues for each application. They

apply weight to each queue and schedule them according to Virtual IO scheduler (VIOS). The