The ability to manage large datasets. Dynamic Cloud Deployment of a MapReduce Architecture. Cloud Computing

(1)

Cl

ou

d C

om

pu

tin

Dynamic Cloud Deployment

of a MapReduce Architecture

Steve Loughran, Jose M. Alcaraz Calero, Andrew Farrell,

Johannes Kirschnick, and Julio Guijarro

Hewlett-Packard Laboratories

Cloud-based MapReduce services process large datasets in the cloud, significantly reducing users’ infrastructure requirements. Almost all of these services are cloud-vendor-specific and thus internally designed within their own cloud infrastructures, resulting in two important limitations. First, cloud vendors don’t let developers see and evaluate how the MapReduce architecture is managed internally. Second, users can’t build their own private cloud-infrastructure-based offerings or use different public cloud infrastructures for deploying MapReduce services. The authors’ proposed framework enables the dynamic deployment of a MapReduce service in virtual infrastructures from either public or private cloud providers.

T

he ability to manage large datasets is becoming more important in research and business environ ments. Researchers are demanding tools that can quickly process large amounts of data, while businesses require new solutions for data warehousing and pro cessing business intelligence. The Map Reduce1_{programming model ad dresses} these problems in many scenarios. Prob ably the most successful example is the indexing system that produces the data structures Google uses for its Web search service.1

The main challenge associated with processing large datasets is the infra structure required, which can demand large investments up front to cope with the highest anticipated workload; actual platform usage must justif y this investment. However, demand for the processing platform can fluctuate,

creating periods of over and under utilization. Thus, cloud computing can significantly reduce the infrastructure capital expenditure, providing new busi ness models in which providers offer ondemand virtualized infrastructures in a payasyou go fashion. In this new model, elastic infrastructures can dynamically adapt to the consumer’s data processing requirements.

Some cloud providers have begun offering services for large data pro cessing based on the MapReduce pro gramming model; a prominent example is Amazon Elastic MapReduce (EMR; http:// aws.amazon.com/elasticmapreduce/). In examining these services, however, we discovered several limitations. First, cloud providers offer such services in a readytouse fashion (that is, as platform asaservice) and don’t provide any details about implementation or how

(2)

the services work internally — for instance, how MapReduce jobs are deployed, configured, and executed. This hampers developers’ ability to evaluate service efficiency and can make the development of tools for efficiently deploying large distributed systems more difficult. Sec ond, clients can’t control the MapReduce soft ware stack and its configuration, which can lead to optimization, performance, and com patibility problems. Third, these services are always vendorspecific, preventing clients from using multiple cloud providers or their own pri vate cloud infrastructure, or even from offering their own elastic cloudbased MapReduce ser vice. Note that the use of private clouds can be especially relevant when sensitive information is processed, which is an open issue for public clouds.

Our proposed architecture lets developers dynamically deploy a MapReduce service in vir tual infrastructures, enabling them to employ different combinations of public and private clouds and exposing transparently how the

service is managed. Our MapReduce service is customizable and deployed using SmartFrog,2_a configurationmanagement software that hides the complexity involved in service provisioning from users while letting them retain full control over the service’s individual aspects. We use the Hadoop MapReduce implementation to vali date our architecture.

MapReduce Programming Model

The MapReduce programming model provides a framework for processing large datasets in parallel.1_{Figure 1 shows a highlevel overview} of this framework’s execution steps.

The framework splits the initial large data set into processing chunks according to a predefined split function, which defines ele mental processing units such as a database row, a column, or a line of a file. This function must generate <key, value> pairs, as in <position of the line in a file, content of the line>.

The framework assigns each split result to a worker, who performs the map phase. Figure 1. MapReduce execution overview. Data flows across the different phases involved in the MapReduce programming model. Each phase executes in a highly distributed environment. The map phase establishes how to group data for processing, whereas the reduce phase processes such groups of data. <K1, V1> 1 <K1, V1> 2 <K1, V1> 3 <K1, V1> 4 <K1, V1> n Worker Large dataset (input file) <K2, V2> <K2, V2> … <K2, V2> Output file 2 Worker Output file 1

Worker Worker Map phase

Reduce phase Split phase <K2, V2> <K2, V2> … <K2, V2> <K2, V2> <K2, V2> … <K2, V2>

Distributed file system MapReduce execution overview

(3)

In this phase, the developer’s specified map function produces from the input stream inter mediate <key, value> pairs. Workers automati cally group these tuples, sort them by key, and forward them to the reduce phase, in which a worker applies the reduce function on all val ues associated with a particular key. Conse quently, each worker produces a partial output of the data, which is usually stored in an output file. Developers can optionally specify a merge function to join the outputs from two or more reducers.

Developers need only specify the split, map, reduce, and merge functions, but can avoid dealing with aspects related to parallel pro gramming. The framework schedules, monitors, and checkpoints individual jobs.

Conceptually, our proposed architecture is a masterslave model, with many slave nodes (workers) processing data; a master node assigns, controls, and synchronizes all the slave nodes’ jobs. More information on the MapReduce pro gramming model is available elsewhere.1

Infrastructure as a Service

An infrastructureasaservice (IaaS) cloud pro vider offers computation and storage resources to third parties. IaaS providers act as resource brokers, providing access to their infrastruc ture and services in a payasyougo model, and leveraging virtualization to enable secure resource sharing. Figure 2 shows a conceptual design of an IaaS with a set of typically offered core services.

The physical layer encompasses all com putational resources found in one or more data centers; the virtualization layer lets users share these resources in a secure, isolated fash ion; and the IaaS layer manages these virtual resources efficiently. An IaaS typically supports the following core services: user management, controlled access to infrastructure resources, resource usage accounting, and the ability to create virtual infrastructures on demand.

As Figure 2 shows, the IaaS API gives third parties access to infrastructure services via the Internet; our cloud service uses this API to Figure 2. Infrastructure as a service. This conceptual design supports a set of typically offered core services.

Node 1 Node 2 Node 3 Node N

VM 1. A VM 1. B VM 1. C VM 1… VM 1. M External network IaaS API IaaS layer Virtualization layer Physical layer Farm controller Internal network Image management User management VM management Smart locator Autonomic capabilities Accounting Security Visibility rules VM 2. A VM 2. B VM 2. C VM 2… VM 2. M VM 3. A VM 3. B VM 3. C VM 3… VM 3. M VM N. A VM N. B VM N. C VM N… VM N. M

(4)

automatically create the infrastructure needed for the MapReduce architecture.

Managing Large

Datasets with MapReduce

Our proposed architecture lets developers dynamically deploy a MapReduce service in elastic virtual infrastructures. Figure 3 shows its main components: an elastic MapReduce ser vice offering data processing for third party applications through an API, and above that a Webbased GUI that gives users direct access.

The Webbased component is a userfriendly front end for the data processing services the elastic MapReduce service provides. The MapReduce service comprises several layers, which we describe next. The main monitor ing and deployment components are flexible enough to be deployed in any type of machine, physical or virtual, potentially located either clientside or in a public cloud. Deployment in a public cloud leads to a similar offer as that from Amazon EMR, but uses a whitebox approach, letting users see what’s happening underneath. Hosting it clientside enables deployment in

multiple private and public clouds, all of which will have different connectivity requirements. Public clouds are more restrictive due to their low bandwidth, firewalled communications, data security, and so on.

Next, we assume a scenario in which our framework is deployed clientside, with inten sive data processing done in a public cloud. Note, however, that the framework lets develop ers exclusively employ private clouds to process sensitive information.

MapReduce Job Management Layer

The MapReduce job management layer (see Figure 3) provides users with an access point to the service via an HTTP REST interface. This interface lets users create, configure, and exe cute new jobs and set parameters related to con figuring each job.

The framework first captures the virtual infrastructure parameters for data processing — that is, the number of master and slave nodes, the cloud provider, and the boot volume. Next, it captures input files and the output folder. Devel opers must upload the input files separately using the API if those files aren’t already present Figure 3. Cloud service for data management using MapReduce. The architecture has a Web-based GUI that gives users direct access, on top of an elastic MapReduce service offering data processing as a service for third-party applications through an API.

Open Nebula connector HP Cells connector Amazon EC2 connector VM A VM B VM C HP Cells IaaS API Open Nebula

IaaS API Amazon EC2IaaS API

Web-based GUI MapReduce job management layer

Hadoop management Web-based GUI Hadoop/X-Trace connector … SmartFrog

Automatic deployment layer

Infrastructure abstraction layer Create/delete Monitoring layer Elastic MapReduce service Infrastructure provider API Configuring Monitoring Virtual machines

(5)

in the deployed infrastructure. Then, the frame work captures the job configuration parameters — that is, selected split, map, reduce, and merge functions, as well as the cloud provider’s cre dentials, such as the username/password and authentication token. This is the only user interaction needed; the framework handles the rest of the process automatically.

After capturing the input data, the frame work executes the job, triggering the deploy ment of the virtual infrastructure using the automatic deployment layer. To track and moni tor running MapReduce jobs, we can use ser vices from the monitoring layer, described later. Finally, the job management layer lets devel opers define complex jobs as dataflows in which several jobs are chained together to achieve complex data processing flows. Developers can then submit such workflows to this layer for execution.

Automatic Deployment Layer

This layer deploys the virtual infrastruc ture, installs and configures the MapReduce implementation, and executes MapReduce jobs.

Users trigger this process by starting a new job. The deployment layer uses the job configuration parameters the job management layer has cap tured. Figure 4 shows an overview of the auto mated deployment steps.

First, the deployment layer creates the required number of virtual machines (VMs, as master and slave nodes) and starts them within the selected cloud provider. It then uses the infrastructure provider abstraction layer to abstract specific parameters of individual providers by presenting a common, highlevel interface (step 1 in Figure 4).

VMs are driven by the boot volume attached to them. Most cloud providers use a preloaded boot volume that might contain a preconfig ured MapReduce implementation for slave and master nodes — that is, a static provisioning of the MapReduce service. This approach doesn’t fit well with public cloud environments for sev eral reasons. First, each image update must be manually downloaded, mounted, updated, and uploaded (at several Gbytes) across the Internet. This also makes keeping the MapReduce imple mentation up to date difficult. Second, users Figure 4. The automatic deployment process. The architecture creates a set of virtual machines (VMs) according to user-provided specifications. It automatically sets up a complete MapReduce architecture using a service catalog, then processes the data.

Architecture Architecture 4 2 5 3 1 Service catalog Service catalog Web interface IaaS API IaaS API VM – virtual machine SF – SmarFrog daemon IaaS – infrastructure as a servie

VM3 (slave) SF VM4 (slave) SF VM7 (slave) SF VM6 (slave) SF VM5 (slave) SF VM1 (master) SF SF SF VM2 (slave) P2P network SF

(6)

must manually select all nondefault, jobspecific configuration parameters. Third, no automated management of the infrastructure and services is available. Finally, this approach lacks a run time model that can provide current status on the infrastructure and services.

In our architecture, all VMs share a boot volume. This boot volume consists of a clean OS installation with a configuration manage ment tool (CMT) installed as a daemon, which launches automatically during the OS booting process (step 2 in Figure 4). The daemon dynam ically provisions services at runtime, receiving software installation and configuration instruc tions and executing them on the local OS. This approach lets developers dynamically create the MapReduce architecture, thus overcoming the shortcomings of a preloaded boot volume.

Puppet,3_CHEF,4_{and CFEngine3}5_{are client} server CMT solutions originally designed for distributed environments. Although they effec tively configure software artifacts, they don’t cope well with cloud environments in which developers must create virtual infrastructure to provision services. SmartFrog2_{and SLIM}6 were designed for cloud environments and use a peertopeer (P2P) and a clientserver architec ture, respectively.

We use SmartFrog for several reasons: • It’s a pure P2P architecture with no central

point of failure.

• It enables faulttolerance deployment, which is critical to intensive data processing in virtual environments, where resources are out of the user’s control and could spontane ously disappear.

• It facilitates different communication proto cols between its daemons that are suitable for both private and public cloud environments. • It enables dynamic reconfiguration capabili ties to change infrastructure and services at runtime.

• It keeps a model of the current deployment status that can help drive autoscaling based on observed metrics.

• It enables developers to use latebinding con figuration information to configure services — that is, information that becomes available only after a process step has been reached. The latter is especially important in cloud envi ronments where users don’t usually have control

over resource names such as IP addresses. Here, we use SmartFrog to inform the slave nodes of the master node’s IP address.

Once all the VMs are booted and the Smart Frog daemons are running, we must install and configure the MapReduce implementation. We use Hadoop, a wellknown MapReduce Java implementation. To drive installation, the auto matic deployment layer generates a configura tion file using the configuration parameters for the given job. In this scenario, the deployment layer randomly chooses one VM to be the mas ter, and the others become slaves. The master VM receives and processes the configuration file (step 3 in Figure 4) and acts as the master node for the Hadoop architecture.

The configuration file contains all the neces sary information for installing and configuring the entire Hadoop framework and executing the MapReduce jobs. Figure 5 depicts a simplified version of a generated configuration file using the SmartFrog syntax.

The example file in Figure 5 defines a master node and two slave nodes. It also contains the dependency information as state dependencies for each component. In this example, the mas ter node must run before the slave nodes can be deployed. Note that this example uses latebinding variables (using the reserved word LAZY).

The automated deployment layer submits this configuration file to the SmartFrog dae mon in the master VM, which fragments it and propagates the fragments across the other SmartFrog daemons in the P2P network (step 4 in Figure 4). For the master node, SmartFrog processes the associated fragment by installing and starting the Hadoop master node compo nents (Nodename and JobTraker). All the slave nodes proceed in a similar manner: on receiv ing a file fragment and processing it, SmartFrog installs the Hadoop slave components (Datanode and TaskTracker).

SmartFrog uses application repositories to retrieve the necessary application files. These repositories store the Hadoop packages and configuration templates that will be copied and executed in the VMs. These repositories can also be hosted within the cloud provider itself, minimizing the network delay to further reduce the time required for provisioning the VMs. A comprehensive description on how to deal with the automatic installation and configuration of applications in the cloud is available elsewhere.7

(7)

Once Hadoop is running, developers upload the input data to Hadoop’s internal distributed file system (HDFS). After that, the deploy ment layer submits the MapReduce job to the Hadoop master node, triggering the job’s execu tion. Once the MapReduce job has finished, the deployment layer extracts its output files from Hadoop and copies them into the file system the service uses, making them accessible to the user via the REST interface.

Monitoring Layer

The monitoring layer tracks the MapReduce jobs and presents this information to users. It monitors the VMs and the running MapReduce components. Monitoring information can either be periodically pulled from the components or pushed to it. To do this, the master VM must run monitoring software suitable for the specific Hadoop implementation, which exposes job sta tistics. Different plugins can provide additional Figure 5. SmartFrog configuration file example using a Hadoop MapReduce implementation. The configuration file contains all necessary information for installing and configuring the entire Hadoop framework and executing the MapReduce jobs.

masterNode extends HadoopMasterNode{ //Install “hadoop-master”, “hadoop-nodename”, “hadoop-dataname” packages

sfProcessHost "10.0.0.1"; //VM IP

slaveNodeList LAZY [slaveNodel, slaveNode2]; //List of slave nodes

running true; }

slaveNodel extends HadoopSlaveNode{ //Install “hadoop-slave” package

sfProccesHost "10.0.0.101"; //VM IP

masterNodeList LAZY [masterNode]; //List of master nodes

running true; }

slaveNode2 extends HadoopSlaveNode{ //Install “hadoop-slave” package

sfProccesHost "10.0.0.102"; //VM IP

masterNodeList LAZY [masterNode]; //List of master nodes

running true; }

job extends MapReduceJob{ //Job specifications

masterNode LAZY masterNode;

dataInput "file://localhost/home/input.dat"; dataOUtput "file://localhost/horne/output/"; source "file://localhost/home/wordcount.jar"; MapClass "com.hp.cloud.hadoop.Mapper"; ReduceClass "com.hp.cloud.hadoop.Reducer"; SplitterClass "com.hp.cloud.hadoop.Splitter"; }

//Deployment dependencies for the MapReduce infrastructure

dependsOn(slaveNodel, masterNode::running == true); depends0n(slaveNode2, masterNode::running == true); dependsOn(job, masterNode::running == true && slaveNodel::running == true && slaveNode2::running == true);

(8)

information depending on the software stack and cloud provider used. For example, the sys tem for monitoring the infrastructure provider can yield OS metrics, whereas the XTrace8 monitoring software can provide specific Hadoop metrics.

Infrastructure Provider Abstraction Layer The infrastructure provider abstraction layer (see Figure 3) helps homogenize the IaaS API the cloud infrastructure offers, presenting a highlevel view across different public and pri vate providers. Developers can add additional providers by supplying new connector imple mentations. These connectors map the IaaS API the provider offers with a homogeneous inter face to the upper layers.

Security Issues

We use existing encryption, authentication, and access control mechanisms to enable secure data processing. Although security in private clouds is less restrictive, using public cloud providers requires maximizing the security in the data exchange between our architec ture and the hosting VMs. In such cases, the infrastructure provider connector handles this, using secure transport protocols such as Secure Shell (SSH), the SSH File Transfer Protocol, and HTTPS when possible. Moreover, the boot image might have a firewall and antivirus software installed, and the OS might be uptodate.

How to securely process data in outsourced data centers is an open issue. We don’t offer a solution for this issue here, but we do try to minimize the length of time the data is stored in the cloud. Thus, data never persists in the boot disk. Input data for the MapReduce job is copied dynamically to the running VMs when neces sary. Moreover, once the process has finished, the deployment layer copies the final results to local storage (assumed to be at the client side) and cleans the VM (including the MapReduce cache). We take this approach to maximize data privacy against unauthorized accesses (by both external users and the infrastructure provider).

Moreover, we don’t share the local file sys tem used in our elastic MapReduce service between tenants. We isolate the file system by physically separating it for each tenant or pro viding an authorization system for controlling access to the files stored in the local file system, depending on the scenario.

Implementation

We implemented our architecture as a proof of concept and released it publicly under a GNU LGPL. (The elastic Hadoop service is available at http://smartfrog.svn.sourceforge.net/viewvc/ smartfrog/trunk/core/extras/hadoopcluster/.) Our implementation has two components: a RESTful Web service offering functionalities from the MapReduce job management layer and a frontend Webbased application. We can install this application in both servlet and port let containers.

We implemented the elastic MapReduce ser vice itself as a war artifact that developers can deploy in an existing Web application server. The boot volume comprises a clean instal lation of Ubuntu 9.04 and SmartFrog v3.17. We’ve added additional components on which SmartFrog relies to the image, such as apt-get for installing the Linux packages Hadoop requires. Furthermore, the SmartFrog Hadoop component is preinstalled, which provides the necessary classes used for mapping configura tion parameters to the configuration file format Hadoop expects and controlling individual ser vices (start, stop, and so on). To test our archi tecture, we developed several cloud connectors: the HewlettPackard private cloud HP Cells, VMWare, and Mock, as well as Open Nebula and Amazon EC2 for public clouds.

Performance Statistics

To evaluate our framework’s performance, we conducted an experiment that aimed to measure

• the relationship between the time for creat ing the virtual infrastructure and for boot ing the OS (infrastructure creation time); • the time for provisioning and starting the

infrastructure with the MapReduce imple mentations (provisioning time); and

• the time for executing the MapReduce job, including data uploading and downloading (MapReduce execution time).

We also wanted to validate the architecture for automatically deploying Hadoop in the cloud, given that we executed all these tests using a mere batch script that interacts with the REST interface, configures the jobs, submits them into the framework, and finally gathers the time information from the monitoring layer.

(9)

Although we can execute this sequence of batch jobs using workflow capabilities, we decided to execute each job in isolation, forc ing the whole infrastructure to redeploy with every job. However, the framework could be optimized in production by exposing a donot undeploy parameter to users, which, if enabled, retains the deployed resources so the same user can reuse them for additional jobs, avoiding or minimizing the creation and provision times for subsequent executions.

The job was to sort 4 Gbytes of randomly generated records, spread across 10 files of 400 Mbytes, each generated using the Hadoop randomWriter sample application. Both random Writer and the Terasort application are available in the latest Hadoop release. For comparison, the MapReduce job and its data are fixed, with only the number of workers varying. We cre ated several virtual infrastructures ranging from 1 to 50 workers. For statistical relevance, we repeated each experiment 50 times; the val ues presented are averages over all test runs. We used HP Cells as the cloud prov ider — an exclusive private cloud used only for this experiment (no extra workload). This small cloud is composed of six physical blades with an Intel Core 2 Quad, 6 Gbytes of RAM, and a 500Gbyte HDD, connected via gigabit LAN.

Each blade is configured to manage up to eight VMs.

Figure 6 shows a clear reduction in the time spent executing the MapReduce job as we increase the number of available workers. We expected this result because the Map Reduce framework is built to scale with avail able resources. The minimal fluctuation in the linear trend might be due to different VM placement in the physical infrastructure, cre ating the need for additional network hops or not, depending on where VMs were deployed on the same machine and the sometimes unpredict able behavior of virtualization technologies in shared environments. The provisioning time is also constant and sometimes even achieves super linearity due to SmartFrog’s ability to install and configure applications in parallel. On average, the overhead time with respect to infrastructure creation is around 10 percent for carrying out service provisioning, which keeps it within acceptable times. The Hadoop execu tion and provisioning times are directly related to our proposed architecture, and validate its scalability.

Infrastructure creation time is almost con stant up to 10 workers. After that, the time grows linearly (note that the xaxis steps aren’t constant) when the framework creates bigger Figure 6. Scalability results. We can see a clear decrease in the time spent executing the MapReduce job as we increase the number of available workers.

0 2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200

Execution time (sec)

Number of workers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 30 40 50

Infrastructure creation time MapReduce execution time Provisioning time

(10)

virtual infrastructures. We can attribute this to our underlying cluster’s small size and thus to the limited number of resources available for the workers. Note that 30 workers means that each physical machine must support five VMs simul taneously along with the associated computa tional, access disk, and memory consumption overhead.

These results show that the virtual infra structure creation time is significant with respect to the Hadoop execution time even in a private cloud, emphasizing the importance of letting users decide which cloud provider or pri vate cloud to use if performance is a concern. Finally, real workloads might include sensitive data. Our architecture has successfully demon strated that, in such cases, using a private cloud to process the data is acceptable because the data won’t cross security boundaries. We didn’t use a public cloud for this experiment because we wanted to control for external data fluctuations

that, due to unpredictable overheads related to the physical machines, could yield different deployment times.

T

he MapReduce programming model has

shown immense potential for processing large and unstructured datasets. In the future, we plan to improve our cloud service with algorithms that will automatically schedule tasks using the most suitable cloud provider to maximize per formance while minimizing price. Moreover, we expect to provide autoscaling capabilities to the architecture according to certain business poli cies as well as deal with faulttolerance capabili ties. We plan to adapt the current architecture to the Hadoop NextGen architecture recently announced. We also hope to explore workflow languages for expressing advanced complex data processing jobs and how we can adapt these languages for the cloud environment.

Related Work in Cloud Services for MapReduce Data Management

O

ther researchers have proposed cloud services for data

management. Robert Grossman and Yunhong Gu explain the design and implementation of a high-performance cloud specifically designed to archive, analyze, and mine large dis-tributed datasets.1_{They describe the advantages of using cloud} infrastructure for processing such datasets.

Kate Keahey and her colleagues present a cloud plat-form targeted at scientific and educational projects intended to facilitate experiments on Amazon Elastic Compute Cloud (EC2)-style cloud providers.2_{The authors remark that the} use of Hadoop on their platform dominated during the proj-ect’s lifetime and attribute this fact to the science community’s growing interest and the advantages achieved by combining MapReduce with on-demand cloud infrastructure providers. In fact, Amazon Elastic MapReduce already offers this service. However, it’s a black-box, ready-to-use service that employs EC2 as a cloud provider and doesn’t describe its architecture or let users customize the Hadoop software stack. Moreover, it doesn’t provide tools for letting users become their own Elastic MapReduce providers, especially when processing sensi-tive data.

One proposal closely related to ours is an implementa-tion of the MapReduce programming model on top of EC2.3 The authors focus on how to handle failure detection/recovery and conflict resolution as regards MapReduce nodes; control latency; and track jobs, statistics, and so on. However, they don’t detail how the MapReduce architecture is deployed, con-figured, and executed, hampering other researchers from being able to reproduce or validate this proposal.

Mesos is a platform for sharing commodity clusters among multiple diverse cluster computing frameworks, such as Hadoop and the Message Passing Interface (MPI).4_Sharing improves cluster utilization and avoids per-framework data replication. This work is complementary to the architecture we present in the main text and could be deployed together with Hadoop, enabling data locality by reading and computing data stored on the machine that holds it and offering similar advantages to our framework.

Moreover, other proposals such as Hadoop on Demand (HOD; http://hadoop.apache.org/common/docs/r0.17.0/hod. html) and the Cloudera Hadoop distribution (www.cloudera. com) offer client-side alternatives for deploying Hadoop on demand. However, none of these solutions cover the on-demand infrastructure creation in which Hadoop is installed as part of the service’s deployment process.

References

1. R. Grossman and Y. Gu, “Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, ACM, 2008, pp. 920–927. 2. K. Keahey et al., “Science Clouds: Early Experiences in Cloud Computing

for Scientific Applications,” Proc. Int’l Conf. Cloud Computing and Its Applica-tions, 2008; www.cca08.org/papers/Paper39-Kate-Keahey.pdf.

3. H. Liu and D. Orban, Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System, tech. report, Accenture Technology Labs, 2009. 4. B. Hindman et al., “Mesos: A Platform for Fine-Grained Resource Sharing

in the Data Center,” Proc. 8th Usenix Symp. Networked Systems Design and Implementations, Usenix Assoc., 2011, pp. 1–14.

(11)

Acknowledgments

Thanks to Funcion Seneca for sponsoring Jose M. Alcaraz Calero under the postdoctoral grant 15714/PD/10.

References

1. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. 6th Symp.

Operating System Design and Implementation, Usenix

Assoc., 2004, pp. 137–149.

2. P. Goldsack et al., “The SmartFrog Configuration Man agement Framework,” ACM SIGOPS Operating Systems

Rev., vol. 43, no. 1, 2009, pp. 16–25.

3. J. Turnbull, Pulling Strings with Puppet: Configuration

Management Made Easy, Apress, 2009.

4. A. Jacob, “Infrastructure in the Cloud Era,” Proc.

Int’l O’Reilly Velocity Web Performance and Opera tions Conf., O’Reilly, 2009; www.web2expo.com/

webexsf2009/public/schedule/detail/7771.

5. M. Burgess, “Knowledge Management and Promises,”

Scalability of Networks and Services, LNCS 5637,

Springer, 2009, pp. 95–107.

6. J. Kirschnick et al., “Towards an Architecture for the Automated Provisioning of Cloud Services,” IEEE Com

munication Magazine, vol. 48, Dec. 2010, p. 12.

7. J. Kirschnick et al., “Towards a P2P Framework for Deploying Services in the Cloud,” Software: Practice

and Experience, vol. 42, no. 1, 2012, pp. 395–408.

8. R. Fonseca et al., “Xtrace: A Pervasive Network Tracing Framework,” Proc. 4th Usenix Symp. Networked Sys

tems Design & Implementation, Usenix Assoc., 2007,

pp. 271–284.

Steve Loughran is a researcher at HewlettPackard Labo ratories and is one of the Apache committers of the Hadoop project. His research areas include cloud com puting and intensive data processing in distributed architectures. Loughran has a BSc (Hons) in computer science from the University of Edinburgh. Contact him at [email protected].

Jose M. Alcaraz Calero is a researcher at HewlettPackard Laboratories. His research areas include cloud comput ing, security, policybased systems, and the Semantic Web. Alcaraz Calero has a PhD in computer science from the University of Murcia. He’s a member of IEEE and ACM. Contact him at [email protected].

Andrew Farrell is a researcher in the Cloud and Secu rity Lab at HewlettPackard Laboratories. His research focuses on developing technologies for highly auto mated, secure, and dynamic instantiation and manage ment of cloud computing infrastructure and services. Farrell has a PhD in computer science from Imperial College London. Contact him at [email protected].

Johannes Kirschnick is a researcher in the Cloud and Secu rity Lab at HewlettPackard Laboratories in Bristol, UK. His research focuses on developing technologies for highly automated, secure, and dynamic instantiation and management of cloud computing infrastructure and services. Kirschnick has a BSc in computer science from the Technical University of Munich. Contact him at [email protected].

Julio Guijarro is a senior project manager at HewlettPackard Laboratories and one of the main authors of the Smart Frog project. His main research areas are in computing and automatic deployment in distributed architectures. Guijarro has a BSc in business administration from London Business School. Contact him at julio.guijarro@ hp.com.

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

• Hybrid journals known for their established impact factors • New fully open access journals in many technical areas • A multidisciplinary open access mega journal spanning all IEEE fields of interest

IEEE Open Access

IEEE offers a variety of open access (OA) publications:

Discover top-quality articles, chosen by the IEEE peer-review standard of excellence.

Unrestricted access to today’s groundbreaking research via the IEEE Xplore®_{digital library}

Learn more about IEEE Open Access www.ieee.org/open-access