Sort - Higher-level experiments with MapReduce applications

9.4 Higher-level experiments with MapReduce applications

9.4.3 Sort

Finally, we evaluate sort, a standard MapReduce application, that sorts key-value pairs. The key is represented by the first 10 bytes from each record, while the value is the remain- ing 100 bytes. This application is read-intensive in the map phase and it generates a write- intensive workload in the reduce phase. The access patterns exhibited by this application are thus concurrent reads from the same file and concurrent writes to different files.

A full deployment of HDFS/BSFS was performed on all 270 available nodes followed by a deployment of the entities belonging to the Hadoop framework: the jobtracker, deployed on a dedicated node, and the tasktrackers, co-deployed with the datanodes/providers. The input file to be sorted by the application is stored in 64 MB chunks spread across the datanodes/providers. The Hadoop jobtracker assigns a mapper to process each chunk of the input file. The same input data was stored in multiple chunk configurations in order to be able to vary the number of mappers from 1 to 120. This corresponds to an input file whose size varies from 64 MB to 8 GB. For each of these input files, we measured the job completion time when HDFS and BSFS are respectively used as storage backends.

104 Chapter9– High performance storage for MapReduce applications

Figure 9.5 displays the time needed by the application to complete, when increasing the size of the input file. When using BSFS as a storage backend, the Hadoop framework manages to finish the job faster than when using HDFS. These results are consistent with the ones delivered by the microbenchmarks. However, the impact of the average throughput when accessing a file in the file system is less visible in these results, as the job completion time includes not only file access time, but also the computation time and the I/O transfer time.

9.5 Conclusions

In this chapter we presented BlobSeer-based File System (BSFS), a storage layer for Hadoop MapReduce that builds on BlobSeer to provide high performance and scalability for data- intensive applications. We demonstrated that it is possible to enhance Hadoop MapReduce by replacing the default storage layer, Hadoop Distributed File System (HDFS), with BSFS. Thank to this new BlobSeer-based File System (BSFS) layer, the sustained throughput of Hadoop is significantly improved in scenarios that exhibit highly concurrent accesses to shared files. We demonstrated this claim through extensive experiments, both using synthetic benchmarks and real MapReduce applications. The results obtained in the synthetic benchmarks show not only large throughput improvements under concurrency, but also su- perior scalability and load balancing. These theoretical benefits were put to test by running real-life MapReduce applications that cover all possible access pattern combinations: read- intensive, write-intensive and mixed. In all three cases, improvement over HDFS ranges from 11% to 30%.

105

Chapter

10

Efficient VM Image Deployment and

Snapshotting in Clouds

Contents 10.1 Problem definition . . . 106 10.2 Application model . . . 107 10.2.1 Cloud infrastructure . . . 107 10.2.2 Application state . . . 107 10.2.3 Application access pattern . . . 108 10.3 Our approach . . . 108 10.3.1 Core principles . . . 108 10.3.2 Applicability in the cloud: model . . . 110 10.3.3 Zoom on mirroring . . . 112 10.4 Implementation . . . 113 10.5 Evaluation . . . 114 10.5.1 Experimental setup . . . 114 10.5.2 Scalability of multi-deployment under concurrency . . . 114 10.5.3 Local access performance: read-your-writes access patterns . . . 118 10.5.4 Multi-snapshotting performance . . . 120 10.5.5 Benefits for real-life, distributed applications . . . 121 10.6 Positioning of this contribution with respect to related work . . . 122 10.7 Conclusions . . . 123

I

Nthis chapter we leverage the object-versioning capabilities of BlobSeer to address two

important challenges that arise in the context of IaaS cloud computing (presented in Sec- tion 2.3): (1) efficient deployment of VM images on many nodes simultaneously (multi-

106 Chapter10– Efficient VM Image Deployment and Snapshotting in Clouds

persistent storage (multi-snapshotting). In the context of cloud computing, efficiency means not only fast execution time, but also low network traffic and storage space, as these re- sources need to be paid by the user proportional to the consumption.

We propose a series of optimization techniques that aim at minimizing both execution time and resource consumption. While conventional approaches transfer the whole VM image contents between the persistent storage service and the computing nodes, we leverage object-versioning to build a lazy deployment scheme that transfers only the needed content on-demand, which greatly reduces execution time and resource consumption. The work presented in this chapter was published in [111].

10.1 Problem definition

The on-demand nature of IaaS is one of the key features that makes it attractive as an alter- native to buying and maintaining hardware, because users can rent virtual machines (VMs) instantly, without having to go through lengthy setup procedures. VMs are instantiated from a virtual machine image (simply referred to as image), a file that is stored persistently on the cloud and represents the initial state of the components of the virtual machine, most often the content of the virtual hard drive of the VM.

One of the commonly occurring patterns in the operation of IaaS is the need to instantiate a large number of VMs at the same time, starting from a single (or multiple) images. For example, this pattern occurs when the user wants to deploy a virtual cluster that executes a distributed application, or a set of environments to support a workflow.

Once the application is running, a wide range of management tasks, such as checkpoint- ing and live migration are crucial on clouds. Many such management tasks can be ultimately reduced to snapshotting [154]. This essentially means to capture the state of the running VM inside the image, which is then transferred to persistent storage and later reused to restore the state of the VM, potentially on a different node than the one where it originally ran. Since the application consists of a large number of VMs that run at the same time, another important patten that occurs in the operation of IaaS is concurrent VM snapshotting.

This chapter focuses on highlighting the benefits of BlobSeer for these two patterns. We call these two patterns the multi-deployment pattern and the multi-snapshotting pattern:

• The multi-deployment pattern occurs when multiple VM images (or a single VM image) are deployed on many nodes at the same time. In such a scenario where massive concurrent accesses increase the pressure on the storage service where the images are located, it is interesting to avoid full transfer of the image to the nodes that will host the VMs. At the minimum, when the image is booted, only parts of the image that are actually accessed by the boot process need to be transferred. This saves us the cost of moving the image and makes deployment fast while reducing the risk for a bottleneck on the storage service where images are stored. However, such a “lazy” transfer will make the boot process longer, as some necessary parts of the image may not be available locally. We exploit this tradeoff to achieve a good balance between deployment and application execution.

• The multi-snapshotting pattern occurs when many images corresponding to deployed VM instances in a datacenter are persistently saved to a storage system at the same

In document BlobSeer: Towards efficient data storage management for large-scale, distributed systems (Page 114-118)