The Impact of Virtualization on High Performance Computing Clustering in the Cloud

(1)

The Impact of Virtualization on High Performance

Computing Clustering in the Cloud

Master Thesis Report

Submitted in

Fall 2013

In partial fulfillment of the requirements for the degree of

Master of Science in Software Engineering at the School of

Science and Engineering of Al Akhawayn University in Ifrane

By

Ouidad ACHAHBAR

Supervised by

Dr. Mohamed Riduan ABID

Ifrane, Morocco January, 2014

(2)

2

Acknowledgment

I would like to express my deepest and sincere gratitude to ALLAH for giving me guidance and strength to complete this work, and for having the chance to study and accomplish my master degree with high support from my family, friends and professors. Thank you ALLAH. I would also like to deeply thank my supervisor Dr. Abid for trusting me to conduct this research, providing me with valuable feedback and overseeing my progress in a weekly basis. Thank you Dr. Abid for your motivation and support.

My gratitude also goes to Dr. Haitouf who provided me with valuable comments and shared with me his knowledge in cloud computing and distributed systems. Thank you Dr. Haitouf. I am most thankful to my dear parents, brothers, sisters, nephews and fiancé for their continuous support, encouragement and love. There are no words to express my gratitude to all of you.

Many thanks go to my very close friends: Nora El Bakraoui Alaoui, Inssaf El Boukari, Sara El Alaoui, Aida Tahditi, Jamila Barroug, Wafa Bouya and Chahrazad Touzani. Thank you for being always by my side; thank you for sharing enjoyable moments with me, and thank you for being my friends.

Last but not least, special acknowledgements go to all my professors for their support, respect and encouragement. Thank you Ms. Hanaa Talei, Ms. Asmaa Mourhir, Dr. Naeem Nizar Sheikh, Mr. Omar Iraqui, Dr. Violetta Cavalli Sforza, Dr. Kevin Smith and Dr. Harroud.

(3)

3

Abstract

The ongoing pervasiveness of Internet access is largely increasing big data production. This, in turn, increases demand on compute power to process the massive data, and thus rendering High Performance Computing (HPC) into a high solicited service.

Based on the paradigm of providing computing as a utility, the cloud is offering user-friendly infrastructures for processing these big data, e.g., High Performance Computing as a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter controls the creation of virtual machines instances that carry data processing jobs.

In this thesis, we characterize and evaluate the impact of machine virtualization on HPCaaS. We track HPC performance under different cloud virtualization platforms, namely KVM and VMware ESXi, and compare it to the performance in a physical computing cluster infrastructure. The virtualized environment is deployed using Hadoop on top of Openstack. The resulting HPCaaS runs MapReduce algorithms on benchmarked big data samples using a granularity of 8 physical machines per cluster.

We got several interesting results when we ran the selected benchmarks on virtualized and physical cluster. Each tested cluster provided different performance trends. Yet, the overall analysis of the research findings proved that the selection of virtualization technology can lead to significant improvements when running and handling HPCaaS.

(4)

4

صخلم

لو ةرهاظل رمتسملا يشفتلا ربتعي جو لامعتساو ايسيئر اببس تنرتنلإا جاتنإ ديازت يف لا يبلا نم ديدع تانا ةمخضلا هرودب اذه . ىلع بلطلا ةدايز ىلإ يدؤي ةيلاع ةيباسح تاردق ةجلاعمل ه ذ ه تانايبلا ه . ذ تلعج تارشؤملا ه ةمدخ نم لاع ةبسوح" ةي "ءادلأا ةمدخك .مامتهلإل ةريثم ا ،ةدعاسم ةادأك ةبسوحلا ريفوت جذومن ىلإ ادانتس مدقت ةبسوحلا مل لامعتسلإا ةنرم ةيتحت تاينب ةيباحسلا تانايبلا ةجلاع ةمخضلا " ،لاثملا ليبس ىلع ، لا ةبسوح لا ءادلأا ةيلاع ةمدخك ذه ءادأ نرتقي ،كلذ عم ." ه ريخلأا ة لكشب ريبك تب ةينق ةئيبلا ىلإ ارظن ةيضارتفلاا مكحت اه نإ يف ءاش ةيضارتفلاا تلالأا ( بساوحلا لاا ةيضارتف ) يتلا موقت عم فئاظوب ا ةجلا .تانايبل ريثأت مييقت و فصوب انمق ،ةحورطلأا هذه يف ةئيبلا نمق ."ةمدخك ءادلأا ةيلاعلا ةبسوحلا" ىلع ةيضارتفلاا ءادأ عبتتب اضيأ ا "ءادلأا ةيلاعلا ةبسوحلا" ىلع ةيضارتفا ةيباحس جمارب فلتخم لعو ة ةيدام ةبسوح ى وكم ن نم ة بمك ةزهجأ نامث رتوي انمق . ع "سويدر بام" تايمزراوخ ليغشتل "بوداه" و ،"ةمدخك ءادلأا ةيلاعلا ةبسوحلا" ءانبل "كاتس نبوأ" مادختساب ىل تانايب ةريبك . انايبلا مجح ريغتب "ءادلأا ةيلاعلا ةبسوحلا " ءادأ يف مهم ريغت انظحلا ،ثحبلا اذه جئاتن للاخ نم لا ةيعون ،ت بسوح ة ( لا ينب ة لا ةيتحت : )ةيضارتفلااو ةيداملا .ةبسوحلا مجحو ،كلاذ نم مغرلاب ف جانتسلاا تبثي هيلا انلصو يذلا نقت نا ةيضارتفلاا ةئيبلا ةي عمو مهم رود اهل ت "ءادلأا ةيلاعلا ةبسوحلا" ءادأ نيسحت يف رب .

(5)

5

Table of Content

Acknowledgment 2 Abstract 3 صخلم 4 Table of Content 5 List of Figures 7 List of Tables 9 List of Appendices 10 List of Acronyms 11

PART I: THESIS OVERVIEW 12

Chapter 1: Introduction 13 1.1. Background 13 1.2. Motivation 14 1.3. Problem Statement 15 1.4. Research Question 15 1.5. Research Objective 15 1.6. Research Approach 15 1.7. Thesis Organization 16

PART II: THEORETICAL BASELINES 17

Chapter 2: Cloud Computing 18

3.1. Cloud Computing Definition 18

3.2. Cloud Computing Characteristics 19

3.3. Cloud Computing Service Models 20

3.4. Cloud Computing Deployment Models 21

3.5. Cloud Computing Benefits 22

3.6. Cloud Computing Providers 23

Chapter 3: Virtualization 24

4.1. Definition of Virtualization 24

4.2. History of Virtualization 25

4.3. Benefits of Virtualization 25

4.4. Virtualization Approaches 26

4.5. Virtual Machine Manager 28

Chapter 4: Big Data and High Performance Computing as a Service 32

5.1. Big Data 32

5.2. High Performance Computing as a Service (HPCaaS) 33

Chapter 5: Literature Review and Research Contribution 35

5.1. Related Work 35

5.2. Contribution 36

PART III: TECHNOLOGY ENABLERS 37

Chapter 6: Technology Enablers Selection 38

6.1. Cloud Platform Selection 38

(6)

6

Chapter 7: Openstack 42

7.1. OpenStack Overview 42

7.2. OpenStack History 42

7.3. OpenStack Components 43

7.4. OpenStack Supported Hypervisors 49

Chapter 8: Hadoop 50

8.1. Hadoop Overview 50

8.2. Hadoop History 50

8.3. Hadoop Architecture 51

8.4. Hadoop Implementation 52

8.5. Hadoop Cluster Connectivity 55

PART III: RESEARCH CONTRIBUTION 57

Chapter 9: Research Methodology 58

9.1. Research Approach 58

9.2. Research Steps 58

Chapter 10: Experimental Setup 59

10.1. Experimental Hardware 59

10.2. Experimental Software and Network 60

10.3. Clusters Architecture 60

10.4. Experimental Performance Benchmarks 64

10.5 Experimental Datasets Size 65

10.6 Experiment Execution 66

Chapter 11: Experimental Results 67

11.1. Hadoop Physical Cluster Results 67 11.2. Hadoop Virtualized Cluster- KVM Results 72 11.3. Hadoop Virtualized Cluster- VMware ESXi Results 77

11.4. Results Comparison 82

Chapter 12: Discussion 88

12.1. TeraSort 88

12.2. TestDFSIO 90

12.3. Conclusion 91

PART IV: CONCLUSION 92

Chapter 13 93

Conclusion and Future Work 93

Bibliography 94

Appendix A: OpenStack with KVM Configuration 100

Appendix B. OpenStack with VMware ESXi Configuration 127

Appendix C: Hadoop Configuration 131

Appendix D: TeraSort and TestDFSIO Execution 145

Appendix E: Data Gathering for TeraSort 147

(7)

7

List of Figures

Figure 1: Thesis organization ... 16

Figure 2: NIST visual model of cloud computing definition ... 19

Figure 3: services provided in cloud computing environment ... 21

Figure 4: Full virtualization architecture ... 26

Figure 5: Paravirtualization architecture ... 27

Figure 6: Hardware assisted virtualization architecture ... 28

Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor ... 29

Figure 8: Xen hypervisor architecture ... 30

Figure 9: KVM hypervisor architecture ... 31

Figure 10: VMware ESXi architecture ... 31

Figure 11: Data growth over 2008 and 2020 ... 32

Figure 12: Active cloud community population ... 38

Figure 13: Active distributed systems population ... 40

Figure 14: OpenStack conceptual architecture ... 44

Figure 15: Nova subcomponents ... 44

Figure 16: Glance subcomponents ... 46

Figure 17: Keystone subcomponents ... 46

Figure 18: Swift subcomponents ... 47

Figure 19: Cinder subcomponents ... 48

Figure 20: Quantum subcomponents ... 48

Figure 21: Apache Hadoop subprojects ... 51

Figure 22: Hadoop Architecture ... 52

Figure 23: HDFS and MapReduce representation ... 53

Figure 24: Word count MapReduce example ... 55

Figure 25 : Research steps ... 58

Figure 26 : Hadoop Physical Cluster ... 61

Figure 27: Hadoop Physical Cluster architecture ... 61

Figure 28: Hadoop virtualized cluster - KVM ... 62

Figure 29: Hadoop virtualized cluster – VMware ESXi (a) ... 63

Figure 30 : Hadoop virtualized cluster – VMware ESXi (b) ... 64

Figure 31 : Experimental execution ... 66

Figure 32: TeraSort performance on Hadoop Physical Cluster ... 67

Figure 33: TeraSort performance for 100 MB on Hadoop Physical Cluster ... 68

Figure 34 : TeraSort performance for 1 GB on Hadoop Physical Cluster... 68

Figure 35: TeraSort performance for 10 GB on Hadoop Physical Cluster... 68

Figure 36 : TeraSort performance for 30 GB on Hadoop Physical Cluster... 68

Figure 37: TestDFSIO-Write performance on Hadoop Physical Cluster ... 69

Figure 38: TestDFSIO-Write performance for 1 GB on Hadoop Physical Cluster ... 70

Figure 39 : TestDFSIO-Write performance for 100 MB on Hadoop Physical Cluster ... 70

Figure 40: TestDFSIO-Write performance for 10 GB on Hadoop Physical Cluster ... 70

Figure 41 : TestDFSIO-Write performance for 100 GB on Hadoop Physical Cluster ... 70

Figure 42: TestDFSIO-Read performance on Hadoop Physical Cluster ... 71

Figure 43: TestDFSIO-Read performance for 100 MB on Hadoop Physical Cluster ... 71

Figure 44 : TestDFSIO-Read performance for 1 GB on Hadoop Physical Cluster ... 71

Figure 45: TestDFSIO-Read performance for 10 GB on Hadoop Physical Cluster ... 72

Figure 46 : TestDFSIO-Read performance for 100 GB on Hadoop Physical Cluster ... 72

(8)

8

Figure 48: TeraSort performance for 100 MB on Hadoop KVM Cluster ... 73

Figure 49 : TeraSort performance for 1 GB on Hadoop KVM Cluster ... 73

Figure 50: TeraSort performance for 10 GB on Hadoop KVM Cluster ... 73

Figure 51 : TeraSort performance for 30 GB on Hadoop KVM Cluster ... 73

Figure 52 : TestDFSIO-Write performance on Hadoop KVM Cluster ... 74

Figure 53: TestDFSIO-Write performance for 100 MB on Hadoop KVM Cluster ... 75

Figure 54 : TestDFSIO-Write performance for 1GB on Hadoop KVM Cluster ... 75

Figure 55: TestDFSIO-Write performance for 10 GB on Hadoop KVM Cluster ... 75

Figure 56 : TestDFSIO-Write performance for 100 GB on Hadoop KVM Cluster ... 75

Figure 57: TestdFSIO-Read performance on Hadoop KVM Cluster ... 76

Figure 58: TestDFSIO-Read performance for 100 MB on Hadoop KVM Cluster ... 76

Figure 59 : TestDFSIO-Read performance for 1GB on Hadoop KVM Cluster ... 76

Figure 60: TestDFSIO-Read performance for 10 GB on Hadoop VMware ESXi Cluster ... 77

Figure 61 : TestDFSIO-Read performance for 100 GB on Hadoop VMware ESXi Cluster ... 77

Figure 62 : TeraSort performance on Hadoop VMware ESXi Cluster ... 77

Figure 63: TeraSort performance for 100 MB on Hadoop VMware ESXi Cluster ... 78

Figure 64 : TeraSort performance for 1GB on Hadoop VMware ESXi Cluster ... 78

Figure 65: TeraSort performance for 10 GB on Hadoop VMware ESXi Cluster ... 78

Figure 66 : TeraSort performance for 30GB on Hadoop VMware ESXi Cluster ... 78

Figure 67 : TestDFSIO-Write performance on Hadoop VMware ESXi Cluster ... 79

Figure 68: TestDFSIO-Write performance for 100 MB on Hadoop VMware ESXi Cluster ... 80

Figure 69 : TestDFSIO-Write performance for 1GB on Hadoop VMware ESXi Cluster ... 80

Figure 70: TestDFSIO-Write performance for 10 GB on Hadoop VMware ESXi Cluster ... 80

Figure 71 : TestDFSIO-Write performance for 100 GB on Hadoop VMware ESXi Cluster ... 80

Figure 72 : TestDFSIO-Read performance on Hadoop VMware ESXi Cluster ... 81

Figure 73: TestDFSIO- Read performance for 100 MB on Hadoop VMware ESXi Cluster ... 81

Figure 75: TestDFSIO- Read performance for 10 GB on Hadoop VMware ESXi Cluster ... 82

Figure 77 : Average time for sorting 100 MB on HPhC, HVC with KVM and ... 83

Figure 78 : Average time for sorting 1 GB on HPhC, HVC with KVM and VMware ESXi... 83

Figure 79 : Average time for sorting 10 GB on HPhC, HVC with KVM and ... 84

Figure 80 : Average time for sorting 30 GB on HPhC, HVC with KVM and ... 84

Figure 81: Average time for writing 1 GB on HPhC, HVC with KVM and VMware ESXi ... 85

Figure 82 : Average time for writing 10 GB on HPhC, HVC with KVM and VMware ESXi ... 85

Figure 83: Average time for wrting 100 GB on HPhC, HVC with KVM ... 86

Figure 84: Average time for reading 1 GB on HPC, HVC with KVM and VMware ESXi ... 86

Figure 85 : Average time for reading 1 GB on HPC, HVC with KVM and HVC VMware ESXi ... 86

Figure 86: Average time for reading 10 GB on HPhC, HVC with KVM and HVC VMware ESXi .... 87

Figure 87 : Average time for reading 100 GB on HPhC, HVC with KVM and HVC VMware ESXi 87 Figure 88: Memory overhead when running 30 GB (started at 4.55PM) on 8 VMware ESXi VMs .... 89

Figure 89 : System latency reaches its peak (at 12.28PM) when running30 GB on 8 VMware ESXi VMs89 Figure 90: OpenStack warning statistics about system’ resources usage ... 90

(9)

9

List of Tables

Table 1 : A Comparison of cloud deployment models ... 22

Table 2 : Cloud IaaS selection ... 39

Table 3 : Parallel and distributed platform selection ... 41

Table 4 : OpenStack releases ... 43

Table 5 : OpenStack projects ... 43

Table 6: Apache Hadoop subprojects ... 51

Table 7 : Dell OptiPlex 755 computer features (used for Hadoop physical cluster) ... 59

Table 8 : Dell PowerEdge server used for building OpenStack & Hadoop virtualized cluster ... 60

Table 9 : OpenStack virtual machines’ features ... 60

Table 10 : Experimental performance metrics ... 64

Table 11 : Datasets size used for Hadoop benchmarks ... 65

Table 12: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop Physical Cluster ... 67

Table 13: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop Physical Cluster ... 69

Table 14: Average time (in seconds) of running TestDFSIO-Read on different dataset sizes and different number of nodes- Hadoop Physical Cluster ... 71

Table 15: Average time (in seconds) of running TeraSort on different dataset sizes and different number of nodes- Hadoop KVM Cluster ... 72

Table 16: Average time (in seconds) of running TestDFSIO-Write on different dataset sizes and different number of nodes- Hadoop KVM Cluster ... 74

Table 17 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop KVM Cluster ... 76

Table 18 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster ... 77

Table 19 : Average time (in seconds) of running TestdFSIO-Write on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster ... 79

Table 20 : Average time (in seconds) of running TestdFSIO-Read on different dataset sizes and different number of nodes- Hadoop VMware ESXi Cluster ... 81

(10)

10

List of Appendices

Appendix A : OpenStack with KVM Configuration………...….100

Appendix B : OpenStack with VMware ESXi Configuration……….127

Appendix C: Hadoop Configuration………...………131

Appendix D: TeraSort and TestDFSIO Execution……….… ………….145

Appendix E: Data Gathering for TeraSort………..………..147

(11)

11

List of Acronyms

HPC High Performance Computing

HPCaaS High Performance Computing as a Service

VM Virtual Machine

VMM Virtual Machine Manager

EMC American Multinational Corporation

DCI Digital Communications Inc.

GFS Google File System

HDFS Hadoop Distributed File System

NDFS Nutch Distributed File System

DOE Department of Energy National Laboratories

NIST National Institute of Standards and Technology

SaaS Software as a Service

PaaS Platform as a Service

IaaS Infrastructure as a Service

NoSQL Not Only Structured Query Language

SNIA Storage Networking Industry Association

ACID Atomicity, Consistency, Isolation and Durability

AWS Amazon Web Services

HPhC Hadoop Physical Cluster

HVC Hadoop Virtualized Cluster

SSH Secure Shell

JSON JavaScript Object Notation

XML Extensible Markup Language

API Application Programming Interface

Amazon EC2 Amazon Elastic Compute Cloud

Amazon S3 Amazon Simple Storage Service

VLAN Virtual Local Area Network

(12)

12

Part I: Thesis Overview

This part introduces the key points to understand the purpose of the present research. It provides an introduction of the research starting with its background, motivation, problem statement, research question, objective and research methodology.

(13)

13

Chapter 1: Introduction

In this chapter, we first come to the background of the present research, and then describe the motivation and the problem behind conducting this study. After that, questions, objectives, and methodology of the research are stated. Finally, an outline of the thesis is given out at the end of this chapter.

1.1.Background

During the last decades, the demand for computing power has steadily increased as data generated from social networks, web pages, sensors, online transactions, etc. is continuously growing. A study done in 2012 by American Multinational Corporation (EMC), has estimated that from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000 exabytes), and therefore, digital data will be doubled every two years [1]. The growth of data constitutes the “Big Data” phenomenon.

As Big Data grows in terms of volume, velocity and value, the current technologies for storing, processing and analyzing data become inefficient and insufficient. Gartner survey stated that data growth is considered as the largest challenge for organizations [2]. Stating this issue, High Performance Computing (HPC) has started to be widely integrated in managing and handling Big Data. In this case, HPC is used to process and analyze Big Data related to different problems including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3].

However, HPC still lacks the toolsets that fit the current growth of data. In this case, new paradigms and storage tools were integrated with HPC to deal with the current challenges related to data management. Some of these technologies include, providing computing as a utility (cloud computing) and introducing new parallel and distributed paradigms.

Cloud computing plays an important role as it provides organizations with the ability to analyze and store data economically and efficiently. Performing HPC in the cloud was introduced as data has started to be migrated and managed in the cloud. Digital Communications Inc. (DCI) stated that by 2020, a significant portion of digital data will be managed in the cloud, and even if a byte in the digital universe is not stored in the cloud, it will pass, at some point, through the cloud [4]. Performing HPC in the cloud is known as High Performance Computing as a Service (HPCaaS). In short, HPCaaS offers

(14)

high-14 performance, on-demand, and scalable HPC environment that can handle the complexity and challenges related to Big Data [5].

One of the most known and adopted parallel and distributed systems is MapReduce model that was developed by Google to meet the growing of their web search indexing process [6]. MapReduce computations are performed with the support of data storage system known as Google File System (GFS). The success of both Google File System and MapReduce inspired the development of Hadoop which is a distributed and parallel system that implements MapReduce and Hadoop Distributed File System (HDFS) [7]. Nowadays, Hadoop is widely adopted by big players in the market because of its scalability, reliability and low cost of implementation. Stating this, Hadoop is also proposed to be integrated with HPC as an underlying technology that distributes the work across HPC cluster [8, 9].

1.2.Motivation

Many solutions have been proposed and developed to improve computation performance of Big Data. Some of them tend to improve algorithms efficiency, provide new distributed paradigms or develop powerful clustering environments. Though, few of those solutions have addressed a whole picture of integrating HPC with the current emerging technologies in terms of storage and processing.

As stated before, some of the most popular technologies currently used in hosting and processing Big Data are cloud computing, HDFS and Hadoop MapReduce[10]. At present, the use of HPC in the cloud computing is still limited. The first step towards this research was done by the Department of Energy National Laboratories (DOE), which started exploring the use of cloud services for scientific computing [11]. Besides, in 2009, Yahoo Inc. launched partnership with major top universities in United States to conduct more research about cloud computing, distributed systems and high computing applications.

HPCaaS still needs more investigation to decide on appropriate environments that can fit high computing requirements. One of the HPCaaS’ aspects that is not yet investigated is the impact of different virtualization technologies on HPC in the cloud. Therefore, the motivation of this research consists in the need for evaluating HPCaaS performance using MapReduce and different virtualization techniques. This motivation is accompanied by a strong rational that is addressed by the free accessibility to MapReduce and cloud computing open sources.

(15)

15

1.3.Problem Statement

Cloud computing is offering set of services for processing Big Data; one of these services is HPCaaS. Still, HPCaaS performance is highly affected by the underlying virtualization techniques which are considered as the heart of cloud computing. Stating this, the problem addressed in this research is formulated as follow: “HPCaaS is still facing poor performance and still doesn’t fit Big Data requirements”.

1.4.Research Question

Addressing the problem statement, this thesis aims at bringing answers to the following research questions:

1. What is the performance of HPC on Hadoop Physical Cluster (HPhC)? 2. Is it worth moving HPC to the cloud?

3. How virtualization techniques affect HPCaaS performance?

4. Is there an optimal virtualization technique that can ensure good performance?

1.5.Research Objective

The purpose of the present research is to find solutions for the addressed issues and questions in the previous sections. Hence, this research introduces a new architecture that can handle HPC complexity and increase its performance. The proposed architecture consists of building a Hadoop Virtualized Cluster (HVC) in a private cloud using OpenStack. Hence, the first goal of this research is to investigate the added value of adopting virtualized cluster, and the second goal is to evaluate the impact of virtualization techniques on HPCaaS.

1.6.Research Approach

To evaluate HPCaaS over different virtualization technologies, we followed both qualitative and quantitative research methodologies. The qualitative approach was adopted to select appropriate technology enablers that will be used in building an architecture that will solve the issues addressed in this study. On the other hand, quantitative approach was adopted to conduct different experiments on three different clusters: Hadoop Physical Cluster (HPhC), Hadoop Virtualized Cluster using KVM (HVC- KVM) [12] and Hadoop Virtualized Cluster using VMware ESXi (HVC- VMware ESXi) [13]. Each experiment tends to measure the performance of HPC.

(16)

16

1.7.Thesis Organization

The rest of this thesis is structured as follow (Figure 1):

 Part I covers chapter 1 (current chapter) which introduces the present research.

 Part II covers chapter 2, 3, 4 and 5. Chapter 2 provides basic understanding of cloud computing; chapter 3 introduces virtualization; chapter 4 presents the concept of Big Data and HPCaaS, and chapter 5 lists some related work and states clearly our contribution

 Part III covers chapter 6, 7 and 8. Chapter 6 explains the steps we followed in selecting the technology enablers of this research, and chapter 7 and 8 present in details OpenStack and Hadoop respectively.

 Part IV covers chapter 9, 10, 11 and 12. Chapter 9 presents the methodology adopted in conducting this research; chapter 10 demonstrates the environment preparation to run the needed experiments; chapter 11 introduces the results, and chapter 12 discusses the research findings.

 Part V covers chapter 13 which concludes the research findings and proposes some future work; further, this part includes bibliography and appendices of this study.

(17)

17

Part II: Theoretical Baselines

The objective of this part is to elaborate and shed light on some scientific concepts, theories and topics that serve as a foundation to understand the whole picture of the present research. Hence; this part is structured as follow: chapter 2 demonstrates basic background of cloud computing; chapter 3 introduces cloud computing related technologies, namely virtualization; chapter 4 presents Big Data and HPaaS, and chapter 5 situates this research by introducing previous research that were done in the domain of evaluating HPC.

(18)

18

Chapter 2: Cloud Computing

Cloud computing becomes the current innovative and emerging trend in delivering IT services that attract both the interest of academic and industrial fields. Using advanced technologies, cloud computing provides end users with a variety of services, starting from the hardware level services to the application level. Cloud computing is understood as utility computing over the Internet. Meaning, computing services have moved from local data centers to hosted services which are offered over the Internet and paid based on pay-per-use model [14]. This chapter provides an overview of cloud computing concept. It provides a distinct definition of what cloud computing is; defines cloud computing characteristics, describes cloud service and deployment models, discusses some cloud computing benefits, and finally this chapter lists some cloud computing providers.

3.1.Cloud Computing Definition

In the late 1960’s, John McCarthy brought a new concept into computer science field which predicts that technology will not be only provided as tangible products [14]. Meaning, computer resources will be provided as a service like water and electricity. The concept was known as utility computing, and nowadays it known as cloud computing.

Cloud computing is defined by NIST (National Institute of Standards and Technology) [15] in 2009 as:

“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. ”

NIST definition of cloud sheds light on the effective use of cloud computing in terms of providing minimum management efforts of the shared resources. It sets five characteristics that define cloud computing: on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. Concerning the deployment models, NIST has classified them into: private, public, community and hybrid cloud. More details about cloud characteristics, delivery and deployment models are provided in the upcoming subsections.

(19)

19 The NIST definition of cloud is summarized in Figure 2 which encapsulates cloud computing characteristics, service models, and deployment models.

Figure 2: NIST visual model of cloud computing definition [14]

3.2.Cloud Computing Characteristics

NIST has listed five main characteristics that describe precisely cloud computing, which are [15]:

 On-demand self-service: end users can use and change computing capabilities as desired without the need of human interaction with each service provider.

 Broad network access: resources are accessed over network using standards mechanism.  Resource pooling: the provider’s computing resources are pooled to serve multiple

consumers; these resources are dynamically assigned and reassigned according to consumer demand. Examples of resources include storage, processing, memory, and network bandwidth.

 Rapid elasticity: cloud providers can elastically scale in and scale out resources depending on current end users’ demand. Therefore, resources can be available for provisioning in any quantity at any time.

 Measured service: resources usage can be monitored, controlled and measured; therefore, these features enable end users to pay using the pay as you go model.

(20)

20  Reliability: this feature is ensured by implementing and providing multiple redundant sites. Having this feature, cloud computing is considered as an ideal solution for disaster recovery and business critical tasks.

 Customization: cloud computing allows customization of infrastructure and applications based on end user’ demand.

 Efficient resource utilization: this feature ensures delivering resources as long as they are needed.

3.3.Cloud Computing Service Models

Based on NIST definition of cloud computing, cloud deployment models are classified as follow:

 Software as a Service (SaaS)

Software as a Service (SaaS) represents application software, operating system and computing resources. End users can view the SaaS model as a web-based application interface where services and complete software applications are delivered over the Internet. Some examples of SaaS applications are: Google Docs, Microsoft Office Live, Salesforce Customer Relationship Management, etc.

 Platform as a Service (PaaS)

This service allows end users to create and deploy applications on provider’s cloud infrastructure. In this case, end users do not manage or control the underlying cloud infrastructure like network, servers, operating systems, or storage. However, they do have control over the deployed applications by being allowed to design, model, develop and test them. Examples of PaaS are: Google App Engine, Microsoft Azure, Salesforce, etc.

 Infrastructure as a Service (IaaS)

This service consists of a set of virtualized computing resources such as network bandwidth, storage capacity, memory, and processing power. These resources can be used to deploy and run arbitrary software which can include operating systems and applications. Examples of IaaS providers are Drop Box, Amazon web service, etc.

(21)

21 Figure 3: services provided in cloud computing environment [16]

3.4.Cloud Computing Deployment Models

 Private Cloud

Private cloud computing is provisioned for exclusive use by an organization. The cloud in this case is owned, managed and operated by the organization, a third party, or both of them. The advantage of private cloud consists in providing high security since the cloud is accessed by trusted entities within the organization [15].

 Public Cloud

The cloud infrastructure is provisioned for general public use. It may be owned, managed, and operated by cloud service provider who offers services based on pay-per-use model. In contrast to private cloud, public cloud is known as untrustworthy environment [15].

 Community Cloud

The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from different organizations that share some goals (e.g., mission, security requirements, policy, and compliance considerations). In this case, the cloud may be owned, managed, and operated by one or more organizations in the community, a third party, or combination of them [15].

 Hybrid Cloud

This cloud is a combination of both private and public cloud computing environments. Hybrid cloud provides high flexibility and choices for organization; for instance, critical core activities of an organization can be run under the control of the private part of the hybrid cloud while other tasks may be outsourced to the public part [17].

(22)

22 Table 1 : A Comparison of cloud deployment models [17]

3.5.Cloud Computing Benefits

Nowadays, cloud is widely used because of the benefits it provides to end users. Some of the key benefits offered by the cloud include [17, 18]:

 Initial Cost Savings

Organizations or individuals can save the big initial investment for launching new hardware, products and services; in this case, cloud computing platform offers the needed resources in terms of infrastructure, platform and applications.

 Scalability

Cloud computing ensures high computing scalability by scaling up resources as needed. Therefore, when the usage increases, resources increase relatively to respond to end user’ demand.

 Availability

Cloud providers have the infrastructure and bandwidth to accommodate business requirements for high speed access, storage and systems.

 Reliability

Cloud computing implements redundant paths to support business continuity and disaster recovery.

(23)

23  Maintenance

End users are not concerned with the resources maintenance since it is done by the cloud service provider.

3.6.Cloud Computing Providers

There are many providers who offer cloud services with different features and pricing. Some of them are listed as follow [16, 19]:

 Amazon Web Services

Amazon (AWS) [20] offers a number of cloud services for all business sizes. AWS ensures advanced data privacy techniques to protect users’ data. For that reason, AWS got various security certifications and audits such as ISO 27001, FISMA moderate and SAS 70 Type II. Some AWS services are: Elastic Compute Cloud, Simple Storage Service, SimpleDB (relational data storage service that stores, processes and queries data sets in the cloud), etc.

 Google

Google [21] offers high accessibility and usability in its cloud services. Some of Google services include: Google’s App Engine, Gmail, Google Docs, Google analytics, Picasa (a tool used to exhibit product and uploading their images in the cloud), etc.

 Microsoft

Microsoft [22] offers a famous cloud platform called Windows Azure which runs Windows applications. Some other services include: SQL Azure, Windows Azure Marketplace (an online market to buy and sell applications and data), etc.

 OpenStack

OpenStack [23] is an open source platform for public and private cloud computing that aims at ensuring scalability and flexibility. It was founded by Rackspace hosting and NASA.

Some other organizations that invest in the cloud are: Dell, IBM, Oracle, HP, Sales force, etc. [16].

(24)

24

Chapter 3: Virtualization

There are many different existing technologies and practices used by cloud providers; some of them are internet protocols for communication, virtual private cloud provisioning, load balancing and scalability, distributed processing, high performance computing technologies and virtualization [24]. This chapter emphasizes an understanding of virtualization technology as it is considered the core of cloud computing. It describes in details the history, benefits, types and the abstract layer of virtualization.

4.1.Definition of Virtualization

Virtualization is a widely used term; it has been introduced for many years as a powerful technology in computer science. The definition of virtualization can change depending on which component of computer system is applied on. However, it is broadly defined as an abstract layer between physical resources and their logical representation [25]. NIST has defined virtualization as [26]:

Furthermore, Virtualization is defined by SNIA (Storage Networking Industry Association) as follow [27]:

From both definitions, we can say that virtualization is a methodology of dividing a physical machine into multiple execution environments that allow multiple tasks to run simultaneously. This is done by providing a software abstract layer that is called Virtual

“The simulation of the software and/or hardware upon which other software runs. This simulated environment is called a virtual machine (VM). There are many forms of virtualization, distinguished primarily by computing architecture layer. For example, application virtualization provides a virtual implementation of the application programming interface (API) that a running application expects to use, allowing applications developed for one platform to run on another without modifying the application itself. The Java Virtual Machine (JVM) is an example of application virtualization; it acts as an intermediary between the Java application code and the operating system (OS). Another form of virtualization, known as operating system virtualization, provides a virtual implementation of the OS interface that can be used to run applications written for the same OS as the host, with each application in a separate VM container”.

“The act of abstracting, hiding, or isolating the internal functions of a storage (sub) system or service from applications, host computers, or general network resources, for the purpose of enabling application and network-independent management of storage or data”.

(25)

25 Machine Manager (VMM) or Hypervisor. VMM is therefore designed to hide the physical resources from the operating system. In this case, VMM allows creating multiple guest Operating Systems (OS) (each guest is run by software units called Virtual Machines (VM) [28].

4.2.History of Virtualization

The roots of virtualization go back to the first visualized IBM mainframes that were designed in the 1690s, and which allowed the company to run multiple applications and processes simultaneously. In fact, the main drivers behind introducing virtualization were the high cost of hardware and the need for running and isolating applications on the same hardware. During 1970s, the adoption of virtualization technology increased sharply in many organizations because of cost effectiveness. However, in 1980s and 1990s, hardware prices dropped down as well as the emergence of multitasking operating systems. With these facts, there was no need to assure a high CPU utilization, and therefore, there was no need for virtualization technology. Yet, in the 1990s, virtualization technology brought again to the market after introducing VMware Inc. at Stanford University. Nowadays, virtualization is widely used to reduce management costs by replacing a bunch of low-utilized servers by a single server [29].

4.3.Benefits of Virtualization

There a bunch of reasons that push many organizations to go for virtualization technology; some of them are listed in [24, 29, 30] as follow:

 Server Consolidation

It condenses multiple servers into one physical server that would host many virtual machines. This feature allows the physical server to run at high rate of utilization, and it reduces at the same time the hardware maintenance, power and cooling requirements’ costs.

 Application Consolidation

Legacy applications might require newer hardware and/or operating systems. In this case, virtualization can be used to virtualize the new requirements.

 Sandboxing

Virtualization can provide secure and isolated environment by running virtual machines that can be used to run foreign or less-trusted applications.

(26)

26 It can provide the facility of having multiple simultaneous operating systems that can run different types of applications.

 Reducing Cost

Virtualization reduces cost deployment and configuration by ensuring less hardware, less space and less staffing. Furthermore, virtualization reduces the cost of networking by requiring less wirings, switches and hubs.

4.4.Virtualization Approaches

Virtualization can take different forms depending on which component of computer system is applied on [31]. In this section, we will shed light on three famous virtualization techniques: Full Virtualization, Para-virtualization, and Hardware Assisted Virtualization.

4.4.1. Full Virtualization

In full virtualization, guest OS is fully abstracted from the hardware level by adding virtualization layer: VMM or hypervisor. In this case, the guest OS is not aware it is being virtualized, and it requires no modifications. This approach provides each VM with all services of the physical system, including virtual BIOS, virtual devices and virtualized memory management. To manage the communication between different layers, full virtualization provides both binary translation and direct execution techniques (Figure 4). Binary translation is used to convert guest OS instructions into host instructions. On the other hand, application or user level instructions are directly executed on the processor to ensure high performance [32]. Microsoft Virtual Server is an example of full virtualization.

(27)

27

4.4.2. Paravirtualization

The fundamental issue with full virtualization is the emulation of devices within the hypervisor. This issue was solved by developing paravirtualization technique which allows the guest OS to be aware that it's being virtualized and to have direct access to the underlying hardware. In paravirtualization, the actual guest code is modified to use a different interface that accesses the hardware directly or the virtual resources controlled by the hypervisor [32]. In more details, paravirtualization changes the OS kernel to replace non-virtualized instructions with hypercalls that communicate directly with the hypervisor. Thus, when a privileged command is to be executed on the guest OS, it is delivered to the hypervisor (instead of the OS) by using a hypercall; the hypervisor receives this hypercall and accesses the hardware to returns the needed result (Figure 5). Xen is one of the systems that adopt paravirtualization technology.

Figure 5: Paravirtualization architecture [32]

The downside of paravirtualization is that the guest must be modified to integrate hypervisor awareness. This is a limitation as some operating systems do not allow such modifications (e.g. Windows 2000/XP), and even the ones that can be modified may need additional resources for maintenance/troubleshooting [32].

4.4.3. Hardware Assisted Virtualization

Hardware Assisted Virtualization allows VMM to run directly on the hardware. In this case, VMM controls the access of the guest OS to the hardware resources. As depicted in Figure 6, privileged and sensitive calls are sent directly to the hypervisor, removing the need for binary translation and paravirtualization. VMWare ESX Server is one of the main competing VMMs that use this approach [29].

(28)

28 Figure 6: Hardware assisted virtualization architecture [32]

4.5.Virtual Machine Manager

As defined before, hypervisor or VMM is the layer between the operating system and a guest operating system or the layer between the hardware and the guest operating systems. In [25], the author has set three main features that need to be maintained by VMM. First feature demonstrates that VMM has to provide an environment that is identical with the original machine that we want to virtualize. Second feature shows that programs running on VM or original machine should show the same performance, or, with some minor decrease. Finally, last feature states that VMM needs to control all system resources provided to VMs.

4.5.1. Hypervisor Types

Hypervisors are classified into Type 1 Hypervisor and Type 2 Hypervisor. Type 1 runs directly on the system hardware, and therefore they monitor the operating system guests and they allocate all the needed resources including disk, memory, and CPU and I/O peripherals. Having no intermediary between Type 1 hypervisor and the physical layer has led to an efficient performance in terms of hardware access and security level (Figure 7-a). On the other hand, Type 2 hypervisor runs on host operating system that provides virtualization services such as I/O and memory management (Figure 4-b). Having an intermediary layer between the hypervisor and the hardware makes the installation process easier than Type 1 hypervisor since the operating system is in charge of hardware configuration such as networking and storage [33].

(29)

29 Figure 7: a) Type 1 hypervisor b) Type 2 hypervisor [33]

The differences between Type 1 and Type 2 hypervisor can lead to different performance results. The layer between the hardware and the hypervisor in Type 2 makes the performance less efficient than in Type 1. A sample scenario that illustrates this difference is when a virtual machine requires a hardware interaction (reading from disk); in this case, Type 2 hypervisor needs first to pass the request to the operating system and then the hardware layer. Besides performance efficiency, the reliability of Type 1 hypervisor is higher than in Type 2 reliability. For instance, the failure in operating system can directly affect the hosted guests in Type 2 hypervisor; therefore, the availability of hypervisor type 2 is highly related to the operating system availability. However, hypervisor type 2 has some advantages which consist in having fewer hardware/driver issues as the host operating system is responsible for interfacing with the hardware [34].

4.5.2. Examples of Hypervisors a) Xen Hypervisor

Xen hypervisor is a Type 1 or bare metal hypervisor that is widely used for paravirtualization [35]. It is managed by a specific privileged guest (privileged VM) called Domain-0 (Dom0). Dom0 runs on the hypervisor, and it is responsible of managing all aspects of other unprivileged virtual machine that are known as DomainU (DomU). Furthermore, Dom0 has direct access for the resources on the physical, which is not the case for DomU guests [36]. Overall architecture of Xen hypervisor is shown in Figure 8.

(30)

30 Figure 8: Xen hypervisor architecture

Xen uses paravirtualization as well as full virtualization. In paravirtualization, DomU are referred to DomU PV Guests, and they can be modified Linux operating systems, Solaris, FreeBSD, and other UNIX operating systems [37]. DomU PV Guests are aware that they are running in a virtualized environment, and they don’t have direct access to the hardware resources. In this case, the guest operating system is modified to make special calls (hypercalls) to the hypervisor for privileged operations, instead of the regular system calls in a traditional unmodified operating system. On the other, in full virtualization, DomU are referred to as DomU HVM Guests and run standard any unchanged operating system [37]. DomU HVM is not aware that it is sharing processing time on the hardware, and it is not aware of the presence of other virtual machines. In this case, DomU HVM requires processors which specifically support hardware virtualization extensions (Intel VT or AMD-V). Virtualization extensions allow for many of the privileged kernel instructions (which in PV were converted to "hypercalls") to be handled by the hardware using the trap-and-emulate technique.

b) KVM Hypervisor

KVM hypervisor provides a full virtualization solution based on Linux operating system. It works by reusing the hardware assisted virtualization extensions that were already developed. In this case, KVM requires the presence of Intel VT or AMD-V extensions on the host system. When KVM is loaded, it converts the kernel into a bare metal hypervisor. As a result, it takes; as mentioned above, a full advantage of many components which are already present within the kernel such as memory management and scheduling [38]. KVM is implemented using two main components; the first one is the KVM-loadable module that, when installed in the Linux kernel, provides management of the virtualization hardware (Figure 9). The second component provides PC platform emulation, which is offered by a modified version of

(31)

31 QEMU. QEMU is executed as a user-space process, coordinating with the kernel for guest operating system requests [39].

Figure 9: KVM hypervisor architecture

c) VMware ESXi Hypervisor

VMware was the first leader company that contributed to virtualization technology. One of its virtualization products is VMware ESXi which is installed directly on top of the physical machine [40]. VMware ESXi was introduced in 2007 to provide the highest levels of reliability and performance to companies of all sizes. The overall architecture of VMware ESXi is illustrated in Figure 10. The main component is the vmkernel which contains all the necessary processes to manage VMs. It provides certain functionality similar to that found in other operating systems, such as process creation and control, signals, file system, and process threads. Therefore, vmkernel supports running multiple virtual machines and provides some core functionalities like: Resource scheduling, I/O stacks and Device drivers [24].

(32)

32

Chapter 4: Big Data and High Performance

Computing as a Service

As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of users and data generated, the capacity and computing power of current data tools lead to inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The increase in data size to many terabytes and petabytes is known as Big Data. To handle the complexity of Big Data, HPC is adopted to provide high computation capabilities, high bandwidth, and low latency network. This chapter provides an overview of Big Data phenomena and HPaaS concept.

Figure 11: Data growth over 2008 and 2020 [54]

5.1.Big Data

5.1.1. Big Data Definition

Big Data is defined as large and complex datasets that are generated from different sources including social media, online transactions, sensors, smart meters and administrative services [43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of storing, analyzing and processing data. Literature reviews on Big Data divided the concept into four dimensions: Volume, Velocity, Variety and Value [43].

(33)

33  Volume: the size of data generated is very large, and it goes from terabytes to petabytes.  Velocity: data grows continuously at an exponential rate.

 Variety: data are generated in different forms: structured data, semi-structured and unstructured data. These forms require new techniques that can handle data heterogeneity.

 Value: the challenge in Big Data is to identify what is valuable as to be able to capture, transform and extract data for analysis.

5.1.2. Big Data Technologies

With Big Data phenomenon, there is an increasing demand for new technologies that can support the volume, velocity, variety and value of data. Some of the new technologies are NoSQL, parallel and distributed paradigms and new cloud computing trends that can support the four dimensions of big data.

NoSQL (Not Only Structured Query Language) is the transition from relational databases to non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability to replicate and to partition data over many servers, and the ability to provide high performance operations. However, moving from relational to NoSQL systems has eliminated some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability) [45]. In this context, NoSQL properties are defined by CAP theory [46] which states that developers must make trade-off decisions between consistency, availability and partitioning. Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and CouchDB [50].

Other supporting technologies for Big Data are parallel and distributed paradigms (e.g. Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in the upcoming chapters (Part III- Chapter 8, 9).

5.2. High Performance Computing as a Service (HPCaaS) 5.2.1. HPCaaS Overview

High Performance Computing (HPC) is used to process and analyze large and complex problems, including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. HPC fits these requirements by implementing large physical clusters. However, traditional HPC faces a set

(34)

34 of challenges that consist in peak demand, high capital, and high expertise to acquire and operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of new technology trends including, cloud technologies, parallel processing paradigms and large storage infrastructures. Merging HPC with these new technologies has proposed new HPC model, called HPC as a service (HPCaaS).

HPCaaS is an emerging computing model where end users have on-demand access to pre-existing needed technologies that provide high performance and scalable HPC computing environment [52]. HPCaaS provides unlimited benefits because of the better quality of services provided by the cloud technologies, and the better parallel processing and storage provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some HPCaaS benefits are stated in [51] as follow:

 High Scalability: resources are scaling up as to ensure essential resources that fit users’ demand in terms of processing large and complex datasets.

 Low Cost: End-users can eliminate the initial capital outlay, time and complexity to procure HPC.

 Low Latency: by implementing the placement group concept that ensures the execution and processing of data in the same rack or on the same server.

5.2.2. HPCaaS Providers

There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin Computing [53] which has been a leader in designing and implementing high performance environments for over a decade. Nowadays, it provides HPCaaS with different options: on-demand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services (AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is currently used for Computer Aided Engineering, molecular modeling, genome analysis, and numerical modeling across many industries including Oil and Gas, Financial Services and Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure HPC) [56] and Google (Google Compute Engine) [57].

(35)

35

Chapter 5: Literature Review and Research

Contribution

In order to bridge the gap between the present research and previous studies, a review was conducted on the current state of HPC and virtualization. Therefore, this chapter situates the research in relation to previous research publications and states clearly the research contribution.

5.1. Related Work

There have been several studies that evaluated the performance of high computing in the cloud. Most of these studies used Amazon EC2 [20] as a cloud environment [58-63]. Besides, only few studies have evaluated the performance of high computing using the combination of both new emerging distributed paradigms and cloud environment [64].

In [58], authors have evaluated HPC on three different cloud providers: Amazon EC2, GoGrid Cloud and IBM Cloud. For each cloud platform, they run HPC on Linux virtual machines (VM), and they came up to the conclusion that the tested public clouds do not seem to be optimized for running HPC applications. This was explained by the fact that public cloud platforms have slow network connections between virtual machines. Furthermore, authors in [13] evaluated the performance of HPC applications in today's cloud environments (Amazon EC2) to understand the tradeoffs in migrating to the cloud. Overall results indicated that running HPC on EC2 cloud platform limits performance and causes significant variability. Besides Amazon EC2, a research done in [63] evaluated the performance-cost tradeoffs of running HPC applications on three different platforms. First and second platform consist of two physical clusters (Taub and Open Cirrus cluster), and the third platform consists of Eucalyptus. Running HPC on these platforms led authors to conclude that cloud is more cost-effective for low communication-intensive applications.

In order to understand the performance implications on HPC using virtualized resources and distributed paradigms, authors in [64] performed an extensive analysis using Eucalyptus (16 nodes) and other technologies such as Hadoop [7], Dryad and DryadLINQ [65], and MapReduce [6]. The conclusion of this research suggested that most parallel applications can be handled in a fairly and easy manner when using cloud technologies (Hadoop, MapReduce,

(36)

36 and Dryad); however, scientific applications, which require complex communication patterns, still require more efficient runtime support.

Evaluating HPC without relating it to new cloud technologies was also performed using different virtualization technologies [66, 67, 68, 69]. In [66], authors performed an analysis of virtualization techniques including VMWare, Xen, and OpenVZ. Their findings showed that none of the techniques match the performance of the base system perfectly; yet, OpenVZ demonstrates high performance in both file system performance and industry-standard benchmarks. In [67], authors compared the performance of KVM and VMware. Overall findings showed that the VMWare performs better than KVM. Still, in few cases KVM gave better results than VMWare. In [68], authors conducted quantitative analysis of two leading open source hypervisors, Xen and KVM. Their study evaluated the performance isolation, overall performance and scalability of virtual machines for each virtualization technology. In short, their findings showed that KVM has substantial problems with guests crashing (when increasing the number of guests); however, KVM still has better performance isolation than Xen. Finally, in [69] authors have extensively compared four hypervisors: Hyper-V, KVM, VMWare, and Xen. Their results demonstrated that there is no perfect hypervisor.

5.2.Contribution

So far, there are only few studies that compared different virtualization techniques and its impact on HPC in the cloud. The only study we found was done in [70], where authors compared the performance of adopting Xen, KVM and Virtual Box. Each virtualization technology was compared with bare-metal using a set of high performance benchmarking tools. The results of this research demonstrated that KVM is the best choice for HPC in the cloud because of its rich features and near-native performance.

The contribution of this present research will fill the literature gap by examining the impact of virtualization techniques on HPCaaS using OpenStack as a cloud platform and Hadoop as a distributed and parallel system.

(37)

37

Part III: Technology

Enablers

This part explains the use of OpenStack and Hadoop as underlying technologies for this research. Hence, this part starts first with providing a qualitative study for selecting an appropriate cloud platform and distributed system; second chapter of this part introduces in details OpenStack components, and third chapter presents Hadoop and its main aspects.

(38)

38

Chapter 6: Technology Enablers Selection

The architecture we adopted to evaluate the impact of virtualization on HPCaaS was built after conducting a qualitative study of available tools in the market. We targeted mainly open sources to select appropriate cloud computing platform and distributed system. Hence, this chapter presents the analysis we followed in selecting cloud platform and distributed system.

6.1.Cloud Platform Selection

To compare available cloud open sources, we tried to choose the most popular platforms. The selection of competing platforms was based on a study that compares the popularity of OpenStack, Opennebula, Eucalyptus and CloudStack in 2013 [71]. As depicted in Figure 12, the study showed that OpenStack has the largest total population index, followed by Eucalyptus, CloudStack, and Opennebula.

Figure 12: Active cloud community population [71]

Based on Figure 12, we selected to compare and study OpenStack, Opennebula and Eucalyptus. To adopt one of these cloud open sources, we used some other studies that compare their performance and quality [72-75].

In [72], authors compared some open and commercial cloud platforms. Concerning open platforms, they compared OpenNebula and Eucalyptus. To perform the comparison, they adopted a set of criteria, including storage, virtualization, network, management, security and vendor support. The results of the research showed that open-source and commercial solutions

(39)

39 can have comparable features, and that OpenNebula is the most feature-complete cloud platform when it is compared with Eucalyptus.

[73] and [74] provide a comparison study of OpenStack and OpenNebula. In [73], authors compared the performance of both cloud platforms based on measuring the time when the cloud starts instantiating VMs and the time when they are ready to accept SSH connections. The findings of the research demonstrate that OpenStack is slightly better than OpenNebula due to smaller instantiation time. Moreover, the results showed that OpenStack is more suitable for high computing due to faster instantiation of large number of VMs. In [74], authors used qualitative and quantitative analysis to compare OpenStack and OpenNebula. For the qualitative analysis, they adopted some of the following criteria: security, virtualization supported, access, image support, resource selection, storage support, high-availability support and API support. Based on the results of the qualitative study, authors concluded that OpenStack would benefit in case of auto-scaling, while OpenNebula would benefit in case of persistent storage support. For the quantitative analysis, authors measured the deployment, network overhead and the clean-up time of VMs. The results of quantitative analysis showed that each platform can be used depending on user requirements and specifications.

In [75], authors provided a comparative study of four solutions: Eucalyptus, OpenStack, OpenNebula and CloudStack. To perform the comparison, authors adopted the following criteria: storage, network, security, hypervisor, scalable and installation code openness. In short, the results of this study [75] showed that OpenStack is the preferred cloud open source. Table 2 summarizes the preferred cloud IaaS in [72-75]. Based on this table, we decided to go for OpenStack as it is known for its flexibility and total openness.

(40)

40

6.2.Distributed and Parallel System Selection

To compare available distributed and parallel systems in the market, we opted again for the popularity index of those systems. The selection of competing systems was based on a study done in [76]. The study is summarized in Figure 13 which compares the popularity index of Hadoop, MongoDB, Cassandra, CouchDB, Redis, VoltDB, Neo4j, Riak and Infinispan. The study was done in 2012, and it demonstrates the total downloads between January 2011 and March 2012. Figure 13 depicts that Hadoop is the most popular distributed system, followed by MongoDB and Cassandra.

Figure 13: Active distributed systems population [76]

Based on Figure 13, we performed a qualitative analysis of both Hadoop and MongoDB in order to end up with one selected system for the present research.

MongoDB is a document-oriented, uses a binary form of JSON called Binary JSON store data in tables with columns and rows. To provide high redundancy and make data highly available, MongoDB offers replication across multiple servers. While data is synchronized between servers using replication, MongoDB also facilitates the scale out option by supporting sharding which partitions a collection and stores the different portions on different machines. MongoDB can be built with MapReduce as to execute data in parallel at each shard [62]. On the other hand, Hadoop is an open source for distributed file system that supports processing, analyzing and storing large data sets across large clusters using MapReduce paradigm and HDFS [7]. More details about Hadoop are included in chapter 8.