Efficient Data Placement and Workflow Variability
Management in Cloud Computing
Thesis submitted in partial fulfillment of the requirements for the degree of
MS by Research in
Computer Science and Engineering
by
Nitesh Maheshwari 200807010
Search and Information Extraction Lab International Institute of Information Technology
Hyderabad - 500 032, INDIA February 2011
Copyright c Nitesh Maheshwari, 2011 All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Efficient Data Placement and Workflow Variability Management in Cloud Computing” by Nitesh Mahesh-wari, has been carried out under my supervision and is not submitted elsewhere for a degree.
To my parents and my teachers
Acknowledgments
I would like to thank my advisor Dr. Vasudeva Varma for his continuous and inspiring guidance during the course of my thesis, without which this thesis would not have been possible. His constant support motivated me to pursue research in Cloud Computing and later his guidance helped me during the complete tenure of my research and of writing this thesis.
Besides my advisor, I also owe gratitude to Mr. ReddyRaja AnnaReddy of Pramati Technologies, whose technical guidance during my research has been in-valuable in shaping my thesis. My sincere thanks to Mr. Chidambaram Kollengode of Yahoo! for providing his insights on Hadoop and MapReduce. I would also like to thank Dr. Santanu Paul of TalentSprint for his guidance during my work on Software Product Lines.
I thank my parents, for their continuous encouragement during the course of my study. Their insistence on pursuing a research degree kept me motivated during the research work. I also want to thank my fianc´ee Aditi for always motivating me to deliver better.
I thank my fellow labmates: Radheshyam, Amit Sangroya, Sudip, Prashant and Manisha, for the stimulating discussions and for the sleepless nights we worked together before deadlines. I also want to thank the ever so helpful M. Babji for his tremendous enthusiasm, and zeal for helping lab students. I also thank all my friends: Abhijit, Akshat, Gururaj, Manav, Nihar, Sandeep, Sarat and Sudheer, for their companionship, which made my stay at IIIT-H enjoyable.
A special thanks to Jaideep Dhok for his unwavering support all through my work, and for a being a great mentor, friend, labmate and coauthor.
Abstract
In utility computing, users access services which are delivered in a manner similar to metered traditional utilities such as water, gas and electricity. This model is advantageous as it does not involve the initial cost to acquire computing re-sources because computation, storage, network and other services are available as metered services which can be provisioned as per the customers’ demands. These services can be broadly classified as:Infrastructure-as-a-Service(IaaS), Platform-as-a-Service(PaaS) andSoftware-as-a-Service(SaaS). We are thus approaching a model where everything shall be offered as a service - XaaS. These modern data-centers and clouds are distinguished by a utility pricing model where customers are charged based on their utilization of computational resources, storage and transfer of data.
With the recent emergence of cloud computing based services on the Internet,
MapReducehas emerged as the paradigm of choice for developing large scale data intensive applications which are distributed in nature. MapReduce works best for embarrassingly parallel workloads, in which little or no effort is required to sepa-rate the problem into a number of parallel tasks which can run independently on a cluster of machines. This is often the case when a bigger problem can be divided in a number of smaller problems that can run in parallel and there exists no depen-dency between those parallel tasks. MapReduce is used by more than a 100 orga-nizations worldwide to perform tasks such as web crawling and indexing, social media monitoring, scientific data processing, data mining and machine learning. TheApache Hadoopframework is the leading open source implementation of the MapReduce model. It provides a distributed file system calledHDFSorHadoop Distributed File Systemthat facilitates high throughput access to application data. The data stored on HDFS is divided into smaller chunks of configurable size and
vii distributed across the cluster. HDFS creates multiple replicas of data to ensure data availability at all times. Hadoop follows a rack aware data placement policy and ensures data availability under situations ranging from a single node failure to a complete rack disconnect.
Hadoop is designed to work on cheap commodity hardware, is scalable and resilient against machine failures and works on clusters comprising a single node to thousands of nodes. It is also adapted by a number of educational institutes for performing research where the budgets are even tighter. The cost to support such an infrastructure is an important factor for consideration while setting up the cluster. Power consumption of datacenters has become a key factor contributing to the costs incurred by a service provider. This power related cost includes invest-ment, operating expenses, cooling costs and environmental impacts. Also given the scale at which these applications are deployed, minimizing power consumption of these clusters can significantly cut down operational costs and reduce their car-bon footprint - thereby increasing the utility from a provider’s point of view. For High Performance Computing systems, where the main focus is on improving the performance at any cost, these energy related costs have increased significantly to a point where they are able to surpass the actual hardware acquisition costs.
The first problem that we address in this thesis is: Energy conservation for clusters of nodes that run MapReduce jobs. This problem becomes more important as there is no separate power controller in MapReduce frameworks such as Hadoop. We attempt to reduce the energy consumption of datacenters that run MapReduce jobs by reconfiguring the cluster. We propose an algorithm that dynamically re-configures the cluster based on the current workload and turns cluster nodes on or off when the average cluster utilization rises above or falls below administrator specified thresholds, respectively. Our implementation creates a channel between the cluster’s power controller module and the underlying distributed file system to dynamically scale the number of nodes and adapt to the current service demands on the cluster.
We evaluate our algorithm on a variety of workloads and our results show that the proposed algorithm achieves substantial energy conservation, as compared to the default HDFS implementation. In our model, the amount of energy saved is proportional to the number of deactivated nodes. We implement the default
viii
rack aware replica placement policy followed by HDFS and incorporate the clus-ter reconfiguration decisions suggested by our algorithm to dynamically scale the number of nodes in the cluster. We demonstrate the scale up and scale down op-erations of our algorithm and their corresponding energy savings and observe that the cluster intelligently reconfigures itself based on the workload imposed, proving the effectiveness of our algorithm. As expected, the energy conserved in case of low workloads is considerably higher.
The second problem that we address in this thesis deals with the issues asso-ciated with achieving workflow variability in Software-as-a-Service (SaaS) appli-cations. The concept of SaaS has gained momentum with the recent emergence of cloud based services on the Internet. SaaS is software in which a provider li-censes an application to its customers in a pay-as-you-go model and workflows are an inherent part of SaaS applications. Since SaaS applications are often multi-tenant applications, they help bring down the overall cost of development but the cost of managing a SaaS application can grow over time due to high degree of required configurability. A SaaS application can be deployed in following ways: single instance, single configurable instance and multiple instances. The amount of commonality and variability thus depends on the deployment model used for the application. User interface, workflow, data and access control are the major points where variability can be explored in a SaaS application.
As a part of this thesis, we discuss the variability points that exist in a SaaS application and explore the use of aSoftware Product Lines(SPL) based approach to achieve workflow variability. Supporting workflow variability in the context of a SaaS application can significantly reduce the costs associated with its development and maintenance over time. A software product line is a set of applications shar-ing a common, managed set of features satisfyshar-ing the specific needs of a particular market segment that are developed from a common set of core assets using a pre-scribed architecture. A product lines strategy if skillfully used can produce many benefits like improvements in time-to-market, reduced cost, increased productivity and better quality of service to name a few. We also discuss the types of variability models that are proposed in SPL, takingExtract-Transform-Load(ETL) workflow application as a case study. ETL workflows represent an important part of data warehousing where the phases of the workflow are well defined but the required
ix level of customization demands different implementations of this standard work-flow. We believe that the example of ETL application is a valid use case, where, by making efficient use of variability management concepts, a range of SaaS applica-tions can be offered in lesser time and at a reduced cost over a period of time.
Contents
Chapter Page
1 Introduction . . . 1
1.1 MapReduce and HDFS . . . 4
1.1.1 MapReduce: Large Scale Data Processing . . . 4
1.1.2 HDFS: Distributed Data Storage . . . 5
1.2 SaaS Workflows and Software Product Lines . . . 6
1.2.1 Software Product Line Engineering . . . 8
1.3 Problem Definition and Scope . . . 8
1.3.1 Energy Efficient Data Placement and Cluster Reconfiguration 9 1.3.2 Workflow Variability in SaaS Applications . . . 10
1.4 Organization of the thesis . . . 11
2 Context: Energy Efficiency in MapReduce and Workflow Variability in SaaS . . . 13
2.1 Background: MapReduce and HDFS . . . 13
2.1.1 Need for a Power Controller in MapReduce . . . 14
2.1.2 HDFS Data Layout . . . 15
2.2 Related Work: Energy Efficiency in MapReduce . . . 16
2.2.1 Node-level Energy Conservation . . . 18
2.2.2 Dynamic Voltage Scaling (DVS) . . . 19
2.2.3 Virtualization . . . 20
2.2.4 Energy Efficiency in Hadoop . . . 22
2.2.5 Load Balancing and Configuration of Clusters . . . 23
2.3 Background: Software-as-a-Service (SaaS) . . . 24
2.3.1 Key Roles in SaaS . . . 25
2.3.2 Deployment Patterns in SaaS . . . 25
2.3.3 Need for Configuration in SaaS . . . 27
2.4 Related Work: Workflow Variability in SaaS . . . 28
CONTENTS xi
2.4.2 Variability Modeling for Customization Support . . . 29
2.4.3 Variability Management using SPL . . . 30
2.4.4 Workflow Variability and Reuse . . . 32
2.5 Summary . . . 33
3 Energy Efficient Data Placement and Cluster Reconfiguration . . . 34
3.1 Proposed Algorithm . . . 34
3.1.1 Replica Placement in Hadoop . . . 34
3.1.2 Approach . . . 36
3.2 Cluster Reconfiguration . . . 39
3.2.1 Scaling Up . . . 39
3.2.2 Scaling Down . . . 40
3.2.3 Avoiding Jitter Effect . . . 42
3.3 Cluster Rebalancing . . . 42
3.4 Evaluation and Results . . . 45
3.4.1 Simulation Model . . . 45 3.4.2 Cluster Reconfiguration . . . 45 3.4.2.1 Scaling Up . . . 48 3.4.2.2 Scaling Down . . . 48 3.4.3 Energy Savings . . . 48 3.4.3.1 Workload Imposed . . . 49
3.4.3.2 Scaling Up and the effect ofλ. . . 49
3.4.3.3 Scaling Down and the effect ofµ . . . 51
3.5 Summary . . . 51
4 Workflow Variability in SaaS Applications . . . 53
4.1 Variability Management in SaaS Applications . . . 53
4.1.1 Variability . . . 54
4.1.2 Variability Management . . . 54
4.1.3 Variability Support in SaaS . . . 55
4.2 Supporting Workflow Variability in SaaS Applications . . . 56
4.2.1 Types of Variation Points . . . 56
4.2.2 Workflow Variability . . . 57
4.2.2.1 Example of Workflow Variability . . . 58
4.2.3 Variability Models in SPL . . . 59
4.3 Case Study: ETL Workflow . . . 60
4.3.1 ETL: Extract-Transform-Load . . . 60
4.3.2 Phases in an ETL cycle . . . 62
4.3.3 External and Internal Variability in ETL . . . 63
xii CONTENTS
5 Conclusions . . . 66
5.1 Efficient Data Placement and Cluster Reconfiguration . . . 67
5.1.1 Future Work . . . 68
5.2 Supporting Workflow Variability in SaaS . . . 70
5.2.1 Future Work . . . 71
5.3 Summary . . . 72
List of Figures
Figure Page
1.1 The SPI Model: SaaS, PaaS and IaaS. Source [7]. . . 2 1.2 Popular uses of MapReduce framework. Source [5]. . . 5 1.3 Centralized architecture of Hadoop. Data storage and computation
are co-located on worker nodes. . . 6 2.1 Server power usage and energy efficiency at varying utilization
lev-els, from idle to peak performance. Even an energy-efficient server consumes about half its maximum power while doing virtually no work. Source: [14]. . . 14 2.2 Architecture of Hadoop Distributed File System (HDFS). The files
to be stored on HDFS are split in 64 MB chunks and distributed across racks. The communication between the master node (Na-meNode) and the worker nodes (DataNodes) is shown by means of Heartbeat messages. . . 15 2.3 Key Roles in a SaaS Environment . . . 25 3.1 A typical replication pipeline in Hadoop for a replication factor of 3. 35 3.2 Proposed Approach: Interaction of Power Controller module with
the NameNode. The NameNode performs Rebalancing and Scal-ing operations once it receives the reconfiguration information from the Power Controller. . . 37 3.3 Algorithm forScaling Upreconfiguration. Upon reaching the
trig-ger condition, we perform intrarack transfer and interrack transfer to all the newly added nodes. . . 40 3.4 Algorithm forScaling Downreconfiguration. Upon reaching the
trigger condition, we perform intrarack transfer and interrack trans-fer from the nodes to be removed. . . 41
xiv LIST OF FIGURES
3.5 Examples of valid and invalid movement of data blocks while per-forming intrarack and interrack transfer. . . 44 3.6 Basic Scale Up and Scale Down operations for a cluster of 50
nodes spread across 5 racks, with each rack having 10 nodes. . . . 47 3.7 Energy Savings for Scale Up and Scale Down operations for a
clus-ter of 120 nodes spread across 8 racks, with each rack having 15 nodes. . . 50 4.1 Workflow variability: Workflow B does not invoke methods F2 and
F4, instead invokes a new method F6 not used by workflow A . . 58 4.2 Conceptual view of Extract-Transform-Load . . . 60 4.3 Phases in an ETL cycle . . . 62 4.4 External Variability in ETL workflows . . . 64
List of Tables
Table Page
1.1 Components used to support variability at various SaaS layers:
Data, Business, Service and Presentation layer. . . 7
2.1 Deployment Patterns in SaaS. . . 27
3.1 Data transferred (p) fromθsrc toθdst . . . 44
3.2 Simulation Parameters . . . 46
4.1 Variability Support at different layers in a SaaS application. . . 56
Chapter 1
Introduction
In the last two decades, the Internet has seen a rapid advance in terms of connectivity, reach, reliability and bandwidth. With this, Cloud Computing has emerged as one of the most popular paradigms for delivering hosted services over the Internet in which software and computing resources are deployed over the Inter-net and provided in the form of a service to the customers via a utility model. These services are broadly divided into three categories: Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS). These three models combined are often called as theSPI model (Figure 1.1). SaaSis a soft-ware distribution model in which applications are deployed over the Internet and offered in the form of a license to the customers.PaaSis a paradigm for delivering middleware services like operating systems, database services, workflow services and integration systems. TheIaaSlayer provides a virtualized access to the equip-ments used for supporting operations and includes hardware, storage, servers and networking components.
The three main tenets of cloud computing that differentiate it from traditional hosting services are instant availability of services, pay-as-you-use model and mas-sive scalability. This means to say that it is sold on demand, users can have as much or as little of a service as they want at any given time, pay only for the amount of service that they use and the service is fully managed by the cloud provider. The consumer only needs a thin client (often a browser) to access the cloud, which is the Internet in most cases. This model has an advantage of low or no initial cost
SaaS
PaaS
IaaS
• Application hosted and accessible over the cloud
• Ex: Google Apps, Salesforce (CRM)
• Development in the cloud • Ex: Google App Engine, Force.com
• Computing infrastructure accessible over the cloud
• Ex: Google Storage, Amazon EC2 and S3 Application hosted and accessible over
Computing infrastructure accessible over Ex: Google Storage, Amazon EC2 and S3
Figure 1.1: The SPI Model: SaaS, PaaS and IaaS. Source [7].
to acquire computing resources and is sometimes referred to asUtility Computing, where everything is offered as a utility.
Consider the case of a website for which the traffic is not uniform throughout the year and spikes for a defined period every year, example, websites for sports events. If the owners of this website want to be able to serve all the traffic including that at peak level, they can invest in buying more servers, a substantial percentage of which will be idle for the period when the website is not experiencing heavy traffic. Moving to cloud is clearly a better alternative for the website owners as they can add resources when needed based on the traffic and pay according to a pay-as-you-use model. Although this is more efficient option for the website owners, the same might not stand for a cloud provider. Cloud computing uses the concept of Service Level Agreements(SLA) to control the use and receipt of computing resources to cloud users. A cloud provider thus needs to meet theQoS
(Quality of Service) requirements guaranteed to its users. Cloud Computing is at an early stage and for it to be efficient, the individual servers that make up the datacenter cloud will need to be used optimally.
A key component of the costs incurred by most service providers is the cost related to the power consumed by their datacenters. For the same reason, many enterprises today are focusing their attention on energy efficient computing,
moti-vated by high operational costs for their large scale clusters and warehouses. This power related cost includes investment, operating expenses, cooling costs and en-vironmental impacts. The significance of such a shift is well supported by the fact that theU.S. Environmental Protection Agency(EPA) 1and theU.S. Depart-ment of Energy (DoE) 2 are planning to operate a national datacenter efficiency program to develop protocols, metrics and Energy Star specifications for the 11.8 million servers across their country [10]. The EPA datacenter report [9] mentions that the energy consumed by datacenters has doubled in the period of 2000 and 2006 and estimates another two fold increase from 2007 to 2011 if the servers are not used in an improved operational scenario. This ‘improved operation’ scenario includes energy-efficiency improvements that go beyond trends that are currently operational in nature and require little or no new capital investment. It represents a scenario where the existing infrastructure can be made more efficient by means of new innovating techniques for energy efficiency.
The techniques mentioned above will mostly be operational at the infrastruc-ture level in the SPI stack of figure 1.1. Moving up in the stack, Software-as-a-Service eliminates the need to install the application on customer’s own resources and simplifies maintenance and support. It has continuously proven advantageous because of its characteristics such as pay-as-you-go, license sharing within an or-ganization, online access and single point of management of the application. The application can be controlled, monitored and updated from this single point of management. SaaS applications are required to support multi-tenancy as it cre-ates significant economies of scale for the application vendors. However, to attract a significant number of tenants, SaaS applications must be made configurable or customizable to fulfill the varying functional and quality requirements of individ-ual tenants [61]. Also, to realize the true benefits of SaaS, systematic and effective processes and methods to support the development of SaaS services are needed [55].
To form a basis for the problems addressed by this thesis we first give a brief overview of the MapReduce model and SaaS workflows in the following sections.
1
http://www.epa.gov/ 2
1.1
MapReduce and HDFS
A substantial percentage of these datacenters discussed above run large scale data intensive applications andMapReduce[28] has emerged as an important paradigm for building such applications. This growth is partly encouraged by the flood of data coming from numerous sources. For instance, Facebook 3 hosts approx-imately 10 billion photos, taking up one petabyte of storage with about 2 terabytes of photos being uploaded everyday [71]. They serve over 15 billion photos per day with a peak traffic of about 300,000 images served per second. The numbers pre-sented here will see a multiple fold increase when we talk about the data generated automatically by machines. These large volumes of data pose a need for easy-to-program application frameworks and cheap affordable infrastructure which can be used to store and process this huge amount of data. Apache Hadoop[3] is the most prominent among such paradigms. It provides a reliable shared storage and an analysis system in the form of Hadoop Distributed File System (HDFS) and
MapReduce, respectively [80].
1.1.1 MapReduce: Large Scale Data Processing
MapReduce was introduced by Google 4 in 2004 as a distributed data pro-cessing model and execution environment that runs on large clusters of commodity machines. MapReduce is a linearly scalable programming model and works by breaking the processing into two phases: themapphase (runs map tasks) and the
reduce phase (runs reduce tasks), each of which defines a mapping from one set of key-value pairs to another. A MapReduce job is easy to program because the programmer only needs to provide the implementation of the map and reduce func-tions along with the data on which the job is to be run. It has been adapted by over 100 organizations worldwide to solve a wide array of problems ranging from build-ing production search indexes, to image processbuild-ing, to user sentiment analysis, to machine learning algorithms. Figure 1.2 presents popular uses of MapReduce as gathered from theHadoop Powered Bypage [5] as found in January 2011.
3
http://www.facebook.com 4
Log Processing 16% Web Crawling 12% Data Mining 11% Reporting 10% Analytics 10% Text Indexing/NLP 10% Data Storage 9% Machine Learning 8% Image Processing 5% Recommendation 5% Scientific/Biomedical 4% Other 14%
Figure 1.2: Popular uses of MapReduce framework. Source [5].
In a Hadoop cluster, the input data is split and distributed across a number of nodes and Hadoop tries to run the map task on a node where the input data resides in HDFS. This feature known asdata locality optimization is at the heart of MapReduce because the designers of MapReduce identified network bandwidth as being one of the most precious resources in a datacenter environment. More discussion on this will follow in subsequent sections.
1.1.2 HDFS: Distributed Data Storage
Hadoop relies heavily on the performance and reliability of the underlying HDFS, which stands forHadoop Distributed File System. HDFS is Hadoop’s flag-ship filesystem and runs in a distributed manner on large clusters of commodity machines. It has been known to scale at the magnitude of petabytes, the perfect example being of Facebook where the size of data warehouse cluster has grown from 15 PB in June 2010 to 30 PB in December 2010 [75, 17]. Hadoop divides the inputs to a MapReduce job into fixed-size blocks of data calleddata blocksand independently stores these data blocks in HDFS. Hadoop creates one map task for
each data block, which runs the user-defined map function for each record in that data block. This block size is configurable for all files stored on HDFS with the default value being 64 MB.
TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode MapReduce Layer HDFS Layer JobTracker NameNode
Master
Slaves
Figure 1.3: Centralized architecture of Hadoop. Data storage and computation are co-located on worker nodes.
Both MapReduce and HDFS follow a centralized architecture with one master and multiple slaves. For MapReduce, the master node is called asJobTrackerand the slaves as TaskTrackers. Their similar analogues in case of HDFS are called asNameNodeandDataNodes, respectively. The NameNode manages the filesys-tem namespace and metadata while the DataNodes store the actual physical data. Figure 1.3 gives a brief overview of a typical Hadoop cluster. The master commu-nicates with the slaves by means of periodic messages calledheartbeats.
1.2
SaaS Workflows and Software Product Lines
In the previous section, we covered a brief overview of MapReduce and HDFS. We now move to discuss SaaS workflows and present an overview of Software Product Line Engineering.
A SaaS application comprises of the following layers - Data Layer, Business Layer, Service Layer and Presentation Layer (Table 1.1). Variability at the data
layer is handled by providing multiple data models used by the application and the access control for the data can be controlled on a per-tenant basis. The business layer handles the business rules of the application, and workflows are one of the major components to support variability in this layer. Variability at the service layer is provided through service interfaces and message types used for commu-nication across modules. Finally, the user of a SaaS application interacts with the application through the presentation layer and variability can be provided through configurable user interface based on a user’s or organization’s preferences.
SaaS Layer Components
Presentation Layer User Interface
Service Layer Service Interfaces, Message Types
Business Layer Workflows
Data Layer Data Access, Data Models
Table 1.1: Components used to support variability at various SaaS layers: Data, Business, Service and Presentation layer.
Most SaaS applications today need to support multi-tenancy as it creates sig-nificant economies of scale for the application vendors. Multi-tenancyis a princi-ple where a single instance of the application runs on a server and serves multiprinci-ple client organizations (tenants). Multi-tenant applications need to provide a high degree of customization to support each organization’s needs. For the same rea-son, a workflow process may involve constant change and update. A workflowis defined as a sequence of activities carried out to perform a particular task. Work-flow designers need workWork-flow management systems that can automate the process of creating multiple customized workflows from the set of available artifacts. A workflow management system if systematically designed and architectured can re-sult in increased efficiency and productivity by means of factors such as lowered cost, improved process control, increased hardware utilization, focus on business needs and ease of use, to name a few. However, a wrong architectural choice might entail that multi-tenancy becomes a maintenance nightmare [16].
1.2.1 Software Product Line Engineering
Software Product Line (SPL) engineering refers to methods, tools and tech-niques for creating a collection of similar software systems from a shared set of software assets using a common means of production [11]. TheSoftware Engineer-ing Institute(SEI)5 atCarnegie Mellon University(CMU)6 is performing useful research in the field of SPL engineering and has published a variety of case studies that demonstrate how various organizations have adopted and succeeded by using SPL engineering. SPL engineering is all about reusing components, structures and artifacts as much as possible. It is because of this property of SPL that various product variants of the software can be derived from the basic product family fairly quickly.
Commonality and variability analysis is an essential activity in product line engineering and has attracted a lot of attention from the research community. Com-monalityis the reusable area andVariabilityis the difference between products in the product family [50]. [24] summarizes the state of the art in variability manage-ment research by presenting an exhaustive and systematic literature review of the literature on variability management in SPL engineering. The process of choos-ing an entity from several possible variants in the product line is calledderivation. Through careful management and engineering, an organization can achieve eco-nomic and marketplace benefits by moving to a product line approach.
1.3
Problem Definition and Scope
Now that we have formed a basis and described the concepts related to the problems addressed by this thesis, we move on to discuss the problem definition and scope in this section.
In this thesis, we discuss and attempt to solve two problems that aim at max-imizing the utility from a service provider’s point of view. As described in the beginning of this chapter, a cloud computing environment can be visualized as a stack of layers described by the SPI model - SaaS, PaaS and IaaS (Figure 1.1).
5
http://www.sei.cmu.edu/ 6
The first problem that we address works at the infrastructure layer of the stack and deals with the energy consumption of datacenters that run MapReduce jobs. Our solution dynamically scales the resources of the cluster in accordance with the workload imposed on it to save energy and hence reduce the carbon footprint of the cluster. Given the scale at which these datacenters operate and consume en-ergy, even a slight improvement can result in substantial savings for the service provider. The second problem that we discuss is of achieving workflow variabil-ity in SaaS applications efficiently and hence works at the software layer of the stack. We discuss and propose the concepts fromSoftware Product Line Engineer-ing(SPLE) that can make use of the commonality and variability in SaaS workflow applications to reduce the development and operational cost of these applications.
1.3.1 Energy Efficient Data Placement and Cluster Reconfiguration MapReduce framework, by design, incorporates mechanisms to ensure relia-bility, availarelia-bility, load balancing, fault tolerance, etc. but such mechanisms can have an impact on energy efficiency as even idle machines remain powered on to ensure data availability. The Server and Energy Efficiency Report [10] states that more than 15% of the servers are run without being used actively on a daily basis. In MapReduce, data is split and replicated across multiple nodes on the clusters and over time the distribution of this data stored on the cluster can become skewed. In this thesis, we address the issue of power conservation for clusters of nodes that run MapReduce jobs, as there is no separate power controller in MapReduce frameworks such as Hadoop [3]. Although we focus on one specific framework, our approach can also be applied to other application frameworks where the frame-work’s power controller module can be connected to its foundational software. Our key contribution is an algorithm that dynamically reconfigures the cluster based on the current workload on the cluster. It scales up the number of active nodes in the cluster when the average cluster utilization rises above a threshold specified by the cluster administrator. Similarly, the algorithmscales downthe number of active nodes in the cluster when the average cluster utilization falls below another threshold specified by the administrator. By doing this, the nodes in the cluster that are underutilized can be turned off to save power. While doing this, the
algo-rithm intelligently transfers the workload from these underutilized nodes to other active nodes in the cluster, thereby not compromising on the performance of the system. We remodel the cluster’s data placement policy and reconfigure the clus-ter depending on the workload experienced. The advantage of such a modification to the framework is that the applications developed for these clusters need not be fine-tuned to be energy aware. We only focus on the data aspect of this problem and although the computation part also forms an important part of the solution, it is not considered in this work.
To verify the efficacy of our algorithm, we simulated the Hadoop Distributed File System (HDFS) [4] using theGridSim toolkit[20]. We studied the behavior of our algorithm with the default Hadoop implementation as baseline and our results indicate an energy reduction of 33% under average workloads and up to 54% under low workloads.
1.3.2 Workflow Variability in SaaS Applications
Development of a range of SaaS applications may lead to an increase in time-to-market for these applications. It may also lead to an increase in the development and operating costs that includes maintenance costs over a period of time. To ad-dress these challenges, various methods to identify and manage the variability in SaaS applications have been proposed [12, 13, 18, 27, 53, 61]. Variability manage-ment is concerned with the managemanage-ment of commonality and variability across a range of applications that share a common set of business goals. Major points of variability in SaaS applications are: user interface (UI), workflow, data, and access control. Although the issues related to UI, access control and data variability have been studied earlier [64], the concepts of workflow variability in SaaS applications have not yet been explored extensively. Therefore, in this thesis, we discuss the problem of workflow variability management in SaaS applications.
One of the major challenges with SaaS workflows is a high degree of required configurability. In older applications, often the only way to change a workflow was by modifying the application code, which in most cases would result in creat-ing another copy of the application source and a separate deployment of the same. Clearly, this process takes up a lot of time to roll out new and customized versions
of applications which adds to customer frustration. In more recent times, with
time-to-market andcost-of-solutionbecoming major business drivers for applica-tion providers, there is a need to create significantly new applicaapplica-tions in a rather quick time. This also means that organizations should be able to create different kinds of business logic on a common application platform.
We use the concepts of variability management from Software Product Line Engineering (SPLE), which is a well studied research area, and propose their use to support workflow variability in SaaS. We do not focus on defining variability management in general as these issues have been thoroughly studied [70]. Rather, we define various workflow variability points that can exist in SaaS applications and suggest how the concepts of commonality and variability from SPLE can help achieve workflow variability for a family of SaaS applications efficiently. We il-lustrate our proposed approach using a case study of an ETL workflow application.
ETL workflowsrepresent an important part of data warehousing, as they represent the mechanisms in which data actually gets loaded into the warehouse [72]. We believe that the example of ETL applications is a valid use case, where, by making efficient use of the variability management concepts, a range of SaaS applications can be offered in lesser time and at a reduced cost over a period of time.
1.4
Organization of the thesis
In this chapter, we first presented a brief overview of MapReduce framework and SaaS workflows. We then moved on to discuss the problem definition and scope of the problems addressed by this thesis. The rest of this thesis is organized as follows:
Chapter 2 begins by giving a background of MapReduce and HDFS. We
present the need for a power controller in MapReduce and then proceed to com-pare our energy efficient algorithm with other proposed approaches in the literature. We then give a background of SaaS by describing key roles in SaaS, deployment patterns for SaaS applications and the need for configuration support in SaaS ap-plications before moving on the present the related work in this field.
Chapter 3presents our energy efficient algorithm for data placement and
algorithm for cluster reconfiguration before presenting the techniques we used for rebalancing the cluster. Finally, we conclude with a discussion on the methodol-ogy and simulation model used for evaluation of our algorithm and the results of evaluation.
Chapter 4presents and discusses the use of variability management concepts
from the field of Software Product Line Engineering to support workflow vari-ability in SaaS based workflow applications. We discuss that even though SaaS based models have been proven to be advantageous, there is a need to efficiently and effectively support variability in multi-tenant SaaS applications. We define the variability points that exist in a SaaS workflow application and conclude by analyzing the proposed approach on Extract-Transform-Load (ETL) applications.
Chapter 5 is the last chapter of this thesis that summarizes our work and
establishes the key lessons learned from our work. We touch upon a number of topics for further research in this field, and conclude this thesis.
Chapter 2
Context: Energy Efficiency in MapReduce and Workflow
Variability in SaaS
In this chapter, we give a background of MapReduce and HDFS (Hadoop Distributed File System) for the first problem that is addressed in this thesis. We present the need for a power controller module in MapReduce framework and then describe the architecture followed by HDFS. We then discuss the significance of our first problem discussed in Section 1.3.1 by mentioning the related work in this field. We then form a base for our second problem discussed in Section 1.3.2 by giving a background of Software-as-a-Service(SaaS). We describe the key roles in SaaS, deployment patterns for SaaS applications and the need for configuration support in SaaS applications before moving on the present the related work in this field.
2.1
Background: MapReduce and HDFS
In this section, we shall provide an architectural overview of data layout fol-lowed in Hadoop Distributed File System (HDFS). But before that, we present the need for a power controller module in MapReduce framework.
2.1.1 Need for a Power Controller in MapReduce
With the emergence of cloud computing in the past few years, MapReduce [28] has seen tremendous growth especially for large-scale data intensive comput-ing [30]. A large segment of this MapReduce workload is managed by Apache Hadoop [3], Amazon Elastic MapReduce [2] and Google’s in house MapReduce implementation. MapReduce was designed for deployment on clusters running in-expensive commodity hardware and even idle nodes remain powered on to ensure data availability. These inexpensive hardware are usually devoid of any hardware features to save power when their usage is not optimal. Also, there is no separate power controller in MapReduce frameworks such as Hadoop as yet. This makes the problem of energy efficiency in MapReduce an interesting area to explore.
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Ser ver po w e r usag e (per cen t of peak) Utilization (percent) Power Energy Efficiency
Figure 2.1: Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server consumes about half its maximum power while doing virtually no work. Source: [14].
Datacenters are known to be expensive to operate and they consume huge amounts of electric power [21]. Google’s server utilization and energy consump-tion study [14] reports that the energy efficiency peaks at full utilizaconsump-tion and sig-nificantly drops as the utilization level decreases (Figure 2.1). Hence, the power consumption at zero utilization is still considerably high (around 50%). Essentially, even an idle server consumes about half its maximum power. We also observe that
the energy efficiency of these servers lies in the range of 20-60% when operating under 20-50% utilization. Thus Figure 2.1 indicates that dynamically reconfigur-ing the cluster by scalreconfigur-ing it while runnreconfigur-ing the active nodes at higher utilization is the best decision from a power management point of view.
The MapReduce framework implemented by Hadoop relies heavily on the per-formance and reliability of the underlying HDFS. It divides applications into nu-merous smaller blocks of work and application data into smaller data blocks. It then allows parallel execution for applications in a fault tolerant way by creating redundant copies of both data and computation.
2.1.2 HDFS Data Layout Worker Nodes (DataNodes) Master (NameNode) Data Files Heartbeats Racks Data blocks
Figure 2.2: Architecture of Hadoop Distributed File System (HDFS). The files to be stored on HDFS are split in 64 MB chunks and distributed across racks. The communication between the master node (NameNode) and the worker nodes (DataNodes) is shown by means of Heartbeat messages.
The design of HDFS is based on Google File System (GFS) [38] and has a master-slave architecture with the master called as NameNodeand the slaves as
DataNodes. The NameNode is responsible for storing the HDFS namespace and recording changes to the file system metadata. The DataNodes are spread across multiple racks and are responsible for storing the actual HDFS data in their local file systems as shown in Figure 2.2. The DataNodes are connected via a network and machines in the same rack are usually connected by a higher bandwidth link as compared to machines in different racks. We shall make use of the terms in-trarackandinterrackto imply an operation in the same rack and in different racks, respectively.
Each file stored on HDFS is split into smaller chunks of size 64 MB called
data blocksand these blocks are distributed across the cluster. Each block is repli-cated to ensure data availability in case of connectivity failure to individual nodes or to even complete racks. Each DataNode periodically sends a heartbeat message to the NameNode in order to inform the NameNode about its current state includ-ing the block report of the blocks it stores. The NameNode assumes DataNodes without recent heartbeat messages to be dead and stops forwarding any more re-quests to them. The NameNode also re-replicates the data lost because of dead nodes, corrupt blocks, hard disk failures, etc. If some DataNodes are turned off without informing the NameNode then the cluster might enter what can be called as a panic phase where the NameNode tries to replicate the blocks earlier stored on these DataNodes. This process might end up generating heavy network traffic [77].
2.2
Related Work: Energy Efficiency in MapReduce
Lowering the energy usage of data centers is a challenging and complex issue because computing applications and data are growing so quickly that increasingly larger servers and disks are needed to process them fast enough within the required time period [19]. Fan et al. report that the opportunities for power and energy savings are significant at the cluster-level and hence the systems need to be energy efficient across their activity range [35].
Bycluster-level techniques, we mean techniques that make use of the global state of the cluster and in accordance with the framework or the software running on the cluster, employ methods that reduce the energy consumption of the ma-chines. The basic idea of a cluster-wide technique is to aggregate the system load and then determine the minimal set of servers which could handle the load [46]. As against this,local techniqueshave focused on reducing the power consumption of a single workstation. Most local techniques try to reduce the energy consumption at a component level, like cpu, network, disk, storage, etc. by means such as reducing the clock frequency, voltage, disk speed or by saving on network interconnect com-ponents. In contrast to this, in a cluster-wide technique, all the services deployed in a cluster actively participate in the power management actions.
Most businesses in the computing sectors are experiencing an explosion of data used and generated by their applications. Analysis of this data can help un-cover useful patterns about user behavior. Most datacenters today run large scale data intensive applications that store and process huge amount of data. A num-ber of tools and frameworks have been developed to process data at this scale and MapReduce [28] has emerged as the most prominent paradigm among them. It was first developed at Google to process web scale data on inexpensive commodity hardware.
MapReduce framework allows programmers without any prior experience with parallel and distributed systems to easily utilize the resources of a large distributed system and is hence adopted by a huge number of programmers and organizations worldwide. At Google alone, hundreds of MapReduce jobs have been implemented and more than a thousand MapReduce jobs are executed on their clusters every day. Apache Hadoop [3], an open source implementation of the original model, is the most popular framework to develop MapReduce applications and has become the de facto tool for modeling large scale data processing applications.
Most datacenters are known to be expensive to operate and a key component of the costs incurred by service providers is the power related costs for their data-centers. Most enterprises today are thus focusing their attention on energy efficient computing, motivated by high operational costs for their large scale clusters and warehouses. MapReduce was designed for deployment on clusters running inex-pensive commodity hardware and hence, even idle nodes remain powered on to
ensure data availability. The lack of a separate power controller in MapReduce frameworks post an interesting area of research to work on.
We now present the related work in this field and group them under the fol-lowing heads: node-level energy conservation techniques, dynamic voltage scaling (DVS) based techniques, virtualization based techniques, techniques for energy-efficiency in Hadoop and load balancing techniques.
2.2.1 Node-level Energy Conservation
A substantial amount of earlier research has dealt with optimizing the energy efficiency of servers at a component level like processors, memories and disks [79, 69, 56, 42]. A lot of literature in this field focuses on solving the problem of energy efficiency using local techniques like Dynamic Voltage Scaling (DVS), request batching and multi-speed disks [34, 22, 41]. In our work, we employ the use of a cluster wide technique where the global state of the cluster is used to dynamically scale the cluster to handle the workload imposed on it.
Weiser et al. propose a method for reducing the energy used by the cpu by introducing a new metric for cpu energy performance, millions-of-instructions-per-joule (MIPJ) [79]. They examine a class of methods characterized by dynamic control of system clock by the operating system scheduler and discuss that reducing the clock speed alone does not reduce the energy consumed by the cpu, since to do the same amount of work the system has to run longer. They consider several methods for varying the clock speed dynamically and examine the performance of these methods. They conclude that by adjusting the clock speed at a fine grained level, substantial energy can be saved with a little impact on performance.
Power aware page allocation policies have been suggested for improving en-ergy efficiency of memories that have power management features such as support for different power modes. Lebeck et al. exploit these hardware features supported by memories and explore the interaction of page placement with these features [56]. They consider page allocation policies that can be employed by an informed operating system to complement the hardware power management strategies.
Energy savings for servers have received special attention at disk level. Sev-eral methods such as request batching and multi-speed disks for servers have been
proposed in the literature to optimize energy efficiency of disks. In their work, Helmbold et al. apply a machine learning based technique to spin down the speed at which a disk is operating in order to the extend battery life of a mobile device [42]. Carrera et al. employ the use of multi-speed disks, and slow down each disk for lower energy consumption during periods of light load, to provide energy sav-ings for network servers [22]. Gurumurthi et al. propose a similar approach and modulate disk speed dynamically to provide significant energy savings [41].
2.2.2 Dynamic Voltage Scaling (DVS)
The technique ofDynamic Voltage Scaling(DVS) has been employed to pro-vide power-aware scheduling mechanisms to minimize the power consumption of servers [52, 57]. DVS is a power management technique where undervolting (decreasing the voltage) is done to conserve power and overvolting (increasing the voltage) is done to increase computing performance. Much recent research [48, 37, 44] has been done to provide power-aware cluster computing by using the DVS scheme. Hsu et al. apply a variation of DVS called Dynamic Voltage Fre-quency Scaling(DVFS) by operating servers at various cpu voltage and frequency levels to reduce overall power consumption [45]. For DVS based techniques, the servers need to be DVS-enabled as against our approach where we propose al-gorithm to save energy for clusters comprising inexpensive commodity hardware where the servers might not have these special features.
Dynamic Voltage Scaling (DVS) scheduling algorithms can reduce power con-sumption by controlling appropriate voltage levels. Kim et al. proposed a power-aware scheduling algorithm for bag-of-tasks applications that must finish all the sub-tasks before a given deadline [52]. They provide the DVS scheduling algo-rithms for both time-shared and space-shared resource sharing policies and denote a reduction in power consumption as compared to static voltage schemes.
In a similar approach, Lee et al. propose a technique to schedule the jobs on high-performance computing systems with the goal of minimizing their comple-tion times with special attencomple-tion to energy consumpcomple-tion [57]. DVS can be used to operate the system at different voltage supply levels at the expense of sacrific-ing clock frequencies. Hence, there is a trade-off between the quality of schedules
and energy consumption and they effectively balance these two performance goals be means of an objective function. However, unlike [52], they do not consider applications which are deadline-constrained.
Dynamic Voltage Scaling has also been combined with other techniques to reduce the energy consumption of servers. Elnozahy et al. use DVS along with request-batching mechanisms to reduce processor energy usage for web servers [34]. Both these techniques are complementary in the different kinds of support needed from the hardware and their efficiency over different ranges of workload intensities. They demonstrate that energy conservation policies using a feedback-driven framework achieve significant energy savings while maintaining system re-sponsiveness at a desired level.
A variation of DVS,Dynamic Voltage and Frequency Scaling(DVFS) has also been proposed in the literature. Hsu et al. propose an automatically adapting power-aware algorithm that adapts its voltage and frequency settings to achieve power reduction and energy savings with minimal impact on performance [45]. They propose aβ-adaptationalgorithm that makes scheduling decisions at the be-ginning of time intervals of the same length as they make use of existing alarm clock functionality found in the operating systems. A user can specify a maxi-mum amount of acceptable performance slowdown and the algorithm schedules cpu frequencies and voltages to ensure desired performance levels. Rangasamy et al. propose a compiler directed approach for applying DVFS toMultiple Clock Domain(MCD) processors by allowing individual processor chips to be partitioned into different clock domains, and each domain’s frequency and voltage to be indi-vidually configured [69].
2.2.3 Virtualization
AVirtual Machine(VM) was originally defined by Popek and Goldberg as an efficient, isolated duplicate of a real machine [68]. Using virtualization, a com-puter program can be used to emulate and replace a real comcom-puter. The physical machine is called the hostsystem and the virtual machines running as computer programs are called theguestsystems. By doing this, a single computer can run a few operating systems and their applications at the same time.
Berl et al. propose virtualization with cloud computing as a way forward to identifying the main sources of energy consumption [15]. Live migration and placement optimizations of virtual machines have been used in earlier works to provide a mechanism to achieve energy efficiency [59, 46, 62]. Power aware pro-visioning and scheduling of VMs have also been used with DVFS techniques to reduce the overall power consumption [51, 78]. Although virtualization helps re-duce datacenter’s power consumption, the ease of provisioning virtualized servers can lead to uncontrolled growth and more unused servers, a phenomenon called virtual server sprawl [10].
In live VM migration, a VM is moved from one host server to another while continuously running without any noticeable effects from the end user’s point of view. Liu et al. have proposed a new architecture, namely GreenCloud, that aims to reduce datacenter power consumption by enabling comprehensive online-monitoring, live VM migration and VM placement optimization [59]. Their archi-tecture monitors a variety of system factors and performance measures including application workload, resource utilization and power consumption, and dynami-cally adapt to changing workload and resource utilization through live migration of VMs.
The green computing algorithm,Magnet, proposed by Hu et al. in [46] tends to turn off the redundant nodes to save the energy, given that the performance of the cluster as a whole is guaranteed by the leftover nodes. Magnet keeps track of all active nodes and organizes them in terms of decreasing workload making it easy to squeeze the existing jobs which are widely distributed among lightweight nodes and then deliver them to a subset of currently active nodes. Upon doing this, systems which are in non-intensive computing state are turned off to save energy and bigger jobs are transferred to the free nodes when the system is in intensive computing state to obtain performance gains. Milenkovic et al. apply a similar approach in [62] to minimize global power consumption of a datacenter. They do so by automating a policy-driven mechanism that migrates VMs from lightly loaded nodes, to save system power by shutting off vacated nodes with no active workload.
Virtualization techniques have also been used with local techniques such as DVFS to reduce power consumption. Laszewski et al. have proposed an approach
that focuses on scheduling VMs in a compute cluster to reduce power consump-tion via DVFS and designed an efficient scheduling algorithm to allocate VMs in a DVFS-enable cluster [78]. Kim et al. have proposed a different approach in which they model a real-time service request as a real-time VM request and provision VMs in the datacenters using DVFS schemes [51]. They provide several schemes on power-aware provisioning of real-time virtual machines for the purpose of max-imizing profits of cloud computing datacenters.
Datacenters can certainly increase their profit by provisioning more virtual machines to users. In addition, reducing energy consumption also increases profit by reducing the cost for a cloud service. But the potential overhead caused by live migrations of VMs cannot be ignored, as it may have negative effects on cluster utilization, throughput and QoS issues. It has to be ensured that only used servers are migrated to a virtualized environment so that the growth of virtualized servers remains controlled [10].
2.2.4 Energy Efficiency in Hadoop
For MapReduce frameworks like Hadoop Vasi´c et al. present a design for energy aware MapReduce and HDFS where they leverage the sleeping state of machines to save power [77]. They demonstrate that leveraging the sleep state may lead to unacceptably poor performance and low data availability if the distributed services are not aware of the power management’s actions. They’ve presented an architecture for cluster services and proposed a model for collaborative power management where a common control platform acts as a communication channel between the cluster power management and services running on the MapReduce cluster.
Leverich et al. present a modified design for Hadoop that allows scale-down of operational cluster [58]. They propose the notion ofcovering subsetduring block replication that at least one replica of a data block must be stored in the covering subset. This ensures data availability, even when all the nodes not in the covering subset are turned off. They show that it is possible to recast the data layout and task distribution of Hadoop to enable significant portions of a cluster to be powered
down. However, covering subsets for files in their work have to be established and specified by users (or cluster administrators).
An important difference between the two works mentioned above and our work is that the techniques employed in these works compromise on the replica-tion factor of data blocks stored on the cluster. Our algorithm attempts to keep the cluster utilized to its maximum allowed potential and accordingly scale the number of nodes without compromising on the replication factor for the data blocks.
Chen et al. measure the energy consumption of MapReduce under workloads that stress different parts of the system, and analyze the performance and scala-bility behavior of MapReduce with respect to energy [25]. From these measure-ments, they conclude that well-configured system parameters and well-designed workloads can improve the energy efficiency of MapReduce substantially without significant modifications to the underlying MapReduce infrastructure. Based on their experiments, they also suggest that HDFS replication factor of less than its default value of three would be less energy efficient.
2.2.5 Load Balancing and Configuration of Clusters
A number of groups have done research on load balancing and cluster con-figuration, but most of them have been without consideration for energy or power efficiency. The literature for load balancing is gigantic [39, 31, 49] and the goal of these systems is either to balance load across multiple machines or to harvest the cycles of idle machines. As compared to these, our work focuses on load concen-tration.
Work by P´erez et al. in [65] presents a mathematical formalism to achieve dy-namic reconfiguration with the use of storage groups for data-based clusters. Work by Duy et al. also presents this problem using machine learning based approach by applying the use of neural network predictors to predict future load demand based on historical demand [32]. They predict future load demand based on historical demand and according to the prediction, turn off unused servers to minimize the number of running servers.
Our work is inspired from earlier works by Pinheiro et al. where they present the problem of energy efficiency using cluster reconfiguration at the application
level for cluster of nodes, and at the operating system level for standalone servers [66, 33]. They have proposed a cluster configuration and load distribution algo-rithm to develop systems that dynamically turn cluster nodes on, to be able to handle the load imposed on the system efficiently, and off, to save power under lighter load.
We present the same problem for a cluster of machines running MapReduce framework such as Hadoop. Other earlier works assume that any given request might be served by a number of currently active servers but this sort of policy is possible in scenarios where each server is largely stateless. However, cluster ap-plications can be heterogeneous and our algorithm does not make any assumptions about the statelessness of the servers.
2.3
Background: Software-as-a-Service (SaaS)
The last two sections presented the readers with a brief overview of MapRe-duce and a literature survey of existing works in the field of energy efficient com-puting. In this section, we present an overview of Software-as-a-Service (SaaS) before moving on to present the related work in the field of variability manage-ment in the next section.
The growth and reach of the Internet has triggered the advent of Software-as-a-Service(SaaS) paradigm in which software are offered in the form of license to the customers. With SaaS, software are deployed over the Internet eliminating the need of installing the application on customer’s own resources. SaaS has contin-uously proven advantageous because of its characteristics such as pay-as-you-go, license sharing within the customer’s organization, online access and single point of management of the application. SaaS is a software delivery model, wherein a common code base is maintained often in a multi-tenant instance. In this section, we shall cover the basics of SaaS starting with key roles in SaaS. We shall then describe the types of deployment patterns for SaaS applications before we go on to discuss the need for configuration and customization in SaaS.
2.3.1 Key Roles in SaaS
There are three key roles in SaaS environment:SaaS customer,SaaS provider
andSaaS application vendor [61]. The application vendor develops the applica-tion. The provider hosts the application on its infrastructure and makes it open for subscription over the Internet. Customers are the users of the application and subscribe for its use (Figure 2.3).
SaaS Application
SaaS Provider SaaS Customer SaaS Application Vendordevelops hosts uses
Figure 2.3: Key Roles in a SaaS Environment
SaaS is popular with small and medium sized enterprises because it is the responsibility of SaaS application vendors to manage the technology. SaaS cus-tomers benefit from SaaS as there is no need for an external delivery method to acquire an IT infrastructure for their needs. They can access the application from any location at any time using a network. SaaS providers rent one implementation of software to multiple customers, hence a new functionality or feature is available to all the customers when it is added.
2.3.2 Deployment Patterns in SaaS
Based on the need for customer-specific implementation and adaptation, a SaaS application can be deployed using one of the three basic patterns listed below [26, 54, 60, 61].
InSingle Instancepattern, all the tenants use the same instance of the appli-cation, which means that the same instance is shared by all the customers. This pattern exploits the commonality between SaaS applications and pro-vides the same workflow using the same code on the same infrastructure for all the tenants.
Single Configurable Instance :
As in the case of Single Instance pattern, all tenants inSingle Configurable Instancepattern also use the same instance of the application. The difference here is that there is support for configuration on a per-tenant basis in this pattern. The instance is adapted during run-time whenever the application is invoked by the tenant. This pattern exploits the commonality between the SaaS applications and manages variability by means of metadata like configuration files.
Multiple Instance :
Unlike the previous two patterns, each tenant in Multiple Instancepattern uses a different instance of the SaaS application. This pattern means vari-ability but requires separate code to be deployed for each tenant. Although this pattern allows for the most flexible adaptation to the customer require-ments, it has its own disadvantages in terms of cost of implementation, time-to-market and operating expenses.
Table 2.1 lists the differences between the three deployment patterns discussed above on the following three parameters: instance sharing, workflow variations and codebase. We see from the table that there is a need for customization in SaaS applications deployed using one of the two single instance patterns. We also see that the extent of configurability needed in a SaaS application depends on the customer’s requirements and that configuration support is needed for satisfying di-verse customer requirements in a multi-tenant environment. To respect the privacy of customers’ data, the configuration data remains specific to a tenant (or a group of tenants) for an application.
Single Instance Single Config-urable Instance Multiple In-stances Instance sharing
Same instance for all tenants Same configurable instance Different instance Workflow variations
Same workflow for all tenants
Configurable work-flow based on re-quirements
Different workflow
Codebase Same codebase for
all tenants
Same codebase
with configuration support
Multiple installa-tion with difference codebase
Table 2.1: Deployment Patterns in SaaS.
2.3.3 Need for Configuration in SaaS
The benefit of SaaS model comes from exploiting economies of scale on the provider side by serving multiple tenants using the same software infrastructure. The two single instance patterns discussed previously bring in the idea of multi-tenant aware applications and such applications must therefore allow multi-tenants to customize individual parts of an application by exchanging them for their custom implementations.
In order to serve a significant number of tenants, SaaS applications have to be made customizable so that the varying functional and quality requirements of in-dividual tenants can be fulfilled. As a consequence, SaaS providers need to ensure that there are enough commonalities in the variants of their SaaS application so that economies of scale can be exploited. There is, thus, a need to allow tenant-specific configuration and customization in SaaS applications.
The terms configuration and customization can create some confusion as they have been misused at times. If a software artefact is configurable, it means that the product is already deployed and run as designed, and changes to the behavior of the system can be achieved by means such as changes to configuration files, or configuring the system by some buttons through a user interface. In the case of customization, you need to change existing source code or add new source code to the system to change any of its behavior.
2.4
Related Work: Workflow Variability in SaaS
Variability management is a fundamental activity in software product line en-gineering. Variability is defined as the ability of a system, an asset, or a develop-ment environdevelop-ment to support the production of a set of artifacts that differ from each other in a preplanned fashion [12, 13, 53]. In their work, Kim et al. have presented types of variability supported incomponent-based development (CBD) [53]. They have defined key concepts such as variability, variation point, variants, types and scopes. They have also identified five types of variability that can exist in CBD: attribute, logic, workflow, persistency and interface variability. We make use of these workflow variability concepts discussed in [53] and apply them to SaaS workflow applications. The ideas of variability management have also been extensively studied in other works such as [18, 27, 70, 74, 76].
2.4.1 Multi-tenancy in SaaS Applications
Software-as-a-Service can be described to be the software that is deployed as a hosted service and is accessible over the Internet [26]. SaaS eliminates the need to install software on a customer’s own resources and has been proven advantageous with benefits like pay-as-you-go, no or little investment in hardware, license shar-ing and network-based access. This model also has significant savshar-ings for SaaS service providers such as single point of management, better resource utilization and support for a range of customers. This however, requires SaaS applications to support multi-tenancy since it creates significant economies of scale for the appli-cation vendors. To attract a significant number of tenants, SaaS appliappli-cations must be made configurable or customizable to fulfill the varying functional and qual-ity requirements of individual tenants [61]. Also, to realize the true benefits of SaaS, systematic and effective processes and methods to support the development of SaaS services are needed [55].
Chong et al. describe how economies of scale are achieved by deploying an application as a SaaS offering and discuss the challenges and benefits of developing and offering SaaS [26]. Their work goes on to describe the four-levelSaaS Matu-rity Modelwith the levels being: Ad Hoc (or Custom), Configurable, Configurable with multi-tenancy, Scalable with configurability and multi-tenancy. The maturity
model for a service should be decided based on the amount of isolation required in customers’ data. In another article, they explain three distinct approaches for cre-ating data architectures while designing multi-tenant SaaS applications: separated database, separate schema and shared schema [36].
Although multi-tenancy results in substantial benefits for the service providers, work by Bezemer and Zaidman advocates that a wrong architectural choice might prove to be costlier [16]. They identify configuration and versioning of source code as major challenges because in a multi-tenant application all the users are served by the same installation (same codebase). They propose an architectural approach to introduce multi-tenancy in a managed way in an application. The work by Nitu addresses the issues of how to effectively and efficiently support configurability in SaaS software and proposes a SaaS based architecture to support configurability [64]. However, this work fails to address the issues of workflow variability in SaaS applications.
2.4.2 Variability Modeling for Customization Support
Various methods using variability modeling to support customization and de-ployment of multi-tenant aware SaaS applications have been proposed in the litera-ture [60, 47, 23, 63]. In their work on customization process for SaaS applications, Mietzner and Leymann suggest that an application needs to provide a set of vari-ability points that can be directly modified by the customers [60]. They describe the notion of a variability descriptor which is used to define variability points for the process layer of SaaS applications. They also suggest the use of customiza-tion processes by template filling to develop robust customizacustomiza-tion tools for SaaS applications.
Nanjangud et al. propose a set of mechanisms calledVariation-Oriented En-gineering (VOE) as a comprehensive formal approach for modeling end-to-end variability in SOA-based (Service Oriented Architecture) solutions for the purpose of enhancing reusability [63]. This work is claimed to be the first fully end-to-end approach for modeling variability in SOA-based solutions. Their approach is based on the principle of modeling a solution via a software component’s static and
vari-able parts and helps in semi-automatically generating variants of the solution to meet changing requirements.
Jegadeesan et al. use the principles of aspect-oriented software development to modularize the variability