Technology stacks - Cloud design and performance targets

2.3 Cloud design and performance targets

2.3.3 Technology stacks

Cloud Computing frameworks are different hardware and software technologies merged together into technology stacks to handle thousands of machines. This subsection highlights some of these hardware and software technology stacks developed by first row cloud actors.

Hardware technologies

In this subsection we describe how some of the cloud challenges are taken up, namely the minimization of the operating cost per machine and the improvement of bandwidth throughput between machines. The cost issue is managed by lowering energy consumption and improving automation (we refer the reader to [72] for an illuminating introduction to automation challenges in the cloud). The bandwidth throughputs are optimized with specialized hardware improvement and network topology refinements.

According to Jon Koomey (Consulting Professor at Stanford) on his blog, “Cloud Computing is (with few exceptions) significantly more energy efficient than using in-house data centers”. Yet, more than 40% of the amortized cost of a data center is still energy consumption, according to [59]. This energy consumption is optimized through special care of the cooling system and the minimization of power distribution loss (which respectively account for 33% and 8% of the total energy consumption of data centers, according to [59]). While early days data centers were built without paying much attention to those constraints, the most recently built data centers are located in places where power and cooling are cheap (for example on a river in Oregon for a Google data center, or in the Swedish town of Lulea for a Facebook data center).

Since automation is a mandatory requirement of scale, a lot of work has been done to improve it in data centers. According to [59], a typical ratio of IT staff members to servers is 1:100 in an efficient firm and 1:1000 in an efficient data center. This is partially due to the technologies developed by hosting companies (like Rackspace or OVH) or backup companies (like Mozy, funded in 2005). The virtualization techniques of machines —e.g. VM-ware or Microsoft Hyper-V— increase administrators’ productivity by letting them handle much more machines. In addition, the number of CPU and the number of cores per CPU has been recently increased. This led to a situation in which the administrators are handling much more machines and in which each machine holds more cores. Machines are replaced when the physical failing machines in a rack or in a container hit a

given ratio. Other techniques of automation can be found in [72] or [79].

A key factor in cloud performance bandwidth (whether intra or inter data centers bandwidth) is the spatial disposition of the hardware in the data center, as well as the network topology between the different nodes dedicated to data storage and computing. While a lot of research has been made on this subject, major cloud companies have not been disclosing much about this major technological challenge and their own implementation. According to [59], Microsoft plans to upgrade some of its data centers following the networking architecture studied in [18].

Software technologies

Several software technology stacks have been developed to make the access to CPU and storage easier. They split into three abstraction level : the storage level, the execution level, and the DSL (domain-specific language) level.

– The storage level is responsible for preserving the data into the physical storage. It handles all the issues of distributed storage management, of data contention and provides the data with a structured access. It can be SQL or No-SQL storage, as described in Section 2.4. This storage can also be used as a means of communication between machines. Other types of storage are available in IaaS solutions such as Amazon Elastic Block Store that provides storage volumes that can be attached as a device to a running Amazon EC2 instance and that persist independently from the life of this instance.

– The execution level defines the general execution framework. It specifies how the machines are supposed to organize themselves, it monitors the workers and restarts the failing machines. Depending on the type of Cloud Computing offers (IaaS, PaaS, etc.), the execution level can provide each computing device with anything a standard OS could achieve, or it can give a much more constrainted execution framework where the execution flow must be designed in a specific way (the most famous example of such a constrained execution framework is MapReduce, described in Subsection 2.5.1).

– The DSL level is a higher level execution framework than the execution level. It often provides a declarative language instead of the procedural approach provided by the execution level, therefore saving the users from specifying how the work is supposed to be parallelized. It provides automated and under the hood parallelization, scheduling and inter-machines communications. There- fore, the DSL level helps users focus only on defining the result they want,

and not on how they want it to be computed. DSL implementations are often built in a SQL-like manner, allowing the people who are familiar with this syntax to run those requests in a similar way on an underlying No-SQL storage. Besides, since it provides a run-time framework to execute those requests, it spares the users from compiling their software each time they modify or add a request. Historically, they have been added years after the introduction of the execution level to provide non developing specialists with a means to easily use the execution level. PigLatin, Sawzall, DryadLinq are well-known examples of DSL implementations.

The storage level can be used independently from the two other levels as an outsourced storage service. The execution level can also be used independently from the storage level and the DSL level, provided no persistence is required. However, most cloud applications are built using both storage and execution level environments. The DSL level is an optional layer provided to ease the usage of the storage and execution levels. Theoretically, each project at a given level could be substitutable to any other project at the same level so we could for example plug and use any storage framework with any execution framework. Yet, in practice, some components are designed to work only with other components (as it is the case for the original MapReduce framework which requires some knowledge about where the data are stored for example). Therefore, there are four main frameworks that provide a full technology stack holding the three abstraction levels.

The first framework has been presented by Google. It is the oldest one. On the storage level, it is composed of Google File System ([56]) and BigTable ([39]) (the latter being built on top of Google File System). The execution level is represented by the original MapReduce version (described in Subsection 2.5.1): Google MapReduce ([47]). After the release of MapReduce, Google engineered a dedicated language, on the DSL level, built on top of MapReduce that provides the storage and the computation framework a higher level access: Sawzall ([97]). The second technology stack is the Apache Hadoop project ([112]). This is one of the most famous open-source frameworks for distributed computing. It is composed of several widespread subprojects. The storage level includes HDFS, Cassandra, etc. The execution level includes Hadoop MapReduce, but can also be hand-tailored with the help of distributed coordination tools such as Hadoop Zookeeper. The DSL level includes Hive and Pig. Pig is a compiler that processes Pig Latin ([93]) code to produce sequences of MapReduce programs that can be run for example on Hadoop MapReduce. Pig Latin is a data processing language that is halfway between the high-level declarative style of SQL and the lower

procedural style of MapReduce. Pig Latin is used for example at Yahoo! (see for example [93]) to reduce development and execution time of data analysis requests. The third technology stack is composed of Amazon Web Services (AWS) and AWS-compatible open-source components. Its most popular storage components are Amazon Simple Storage Service (Amazon S3), Amazon Simple Queue Service (Amazon SQS), Amazon SimpleDB, Amazon ElastiCache or Amazon Relational Database Service (Amazon RDS). AWS execution level is mainly composed of Amazon Elastic Cloud Compute (Amazon EC2) and Amazon Elastic MapReduce (which uses a hosted Hadoop framework running on top of Amazon EC2 and Amazon S3). AWS is one of the first large commercial Cloud Comput- ing offers: it was initially launched in July 2002, and was valued at as a 1 billion dollar business in 2011 (see [1]). In parallel, Eucalyptus provides an open-source stack that exports a user-facing interface that is compatible with the Amazon EC2 and S3 services.

The fourth framework has been developed by Microsoft. The storage level is em- bodied by distinct independent projects: Azure Storage, Cosmos or SQL-Server. Cosmos is a Microsoft internal project that has been used by Search teams and BING. Azure Storage is a different project, initially built using a part of the Cosmos system (according to [36]). Azure Storage is meant to be the storage level standard layer for applications run on Azure. The execution level being held by Dryad ([73]) or by Azure Compute (or small frameworks like Lokad-Cloud ([11]) (on top of Azure Compute or Lokad-CQRS ([12])). The DSL level is represented through two different projects: DryadLINQ ([116]) and Scope ([38]). Figure 2.1 (on the next page) is taken from a blog post of Mihai Budiu. It summarizes the technology stacks of Google, Hadoop and Microsoft.

In document Thèse de doctorat : Algorithmes de classification répartis sur le cloud (Page 38-41)