• No results found

As big companies such as Google, Amazon, Facebook, LinkedIn and Twitter grow in terms of users and data generated, the capacity and computing power of current data tools lead to inefficient and insufficient data processing, analyzing, managing, and storing. IBM estimates that every day 2.5 quintillion bytes of data are created, and 90% of the data in the world today has been created in the last two years [41]. Besides, Oracle estimated that 2.5 zettabytes of data were generated in 2012, and it will grow significantly every year (Figure 11) [42]. The increase in data size to many terabytes and petabytes is known as Big Data. To handle the complexity of Big Data, HPC is adopted to provide high computation capabilities, high bandwidth, and low latency network. This chapter provides an overview of Big Data phenomena and HPaaS concept.

Figure 11: Data growth over 2008 and 2020 [54]

5.1.Big Data

5.1.1. Big Data Definition

Big Data is defined as large and complex datasets that are generated from different sources including social media, online transactions, sensors, smart meters and administrative services [43]. Having all these sources, the size of Big Data goes beyond the ability of typical tools of storing, analyzing and processing data. Literature reviews on Big Data divided the concept into four dimensions: Volume, Velocity, Variety and Value [43].

33

Volume: the size of data generated is very large, and it goes from terabytes to petabytes.

Velocity: data grows continuously at an exponential rate.

Variety: data are generated in different forms: structured data, semi-structured and unstructured data. These forms require new techniques that can handle data heterogeneity.

Value: the challenge in Big Data is to identify what is valuable as to be able to capture, transform and extract data for analysis.

5.1.2. Big Data Technologies

With Big Data phenomenon, there is an increasing demand for new technologies that can support the volume, velocity, variety and value of data. Some of the new technologies are NoSQL, parallel and distributed paradigms and new cloud computing trends that can support the four dimensions of big data.

NoSQL (Not Only Structured Query Language) is the transition from relational databases to non-relational databases [44]. It is characterized by the ability to scale horizontally; the ability to replicate and to partition data over many servers, and the ability to provide high performance operations. However, moving from relational to NoSQL systems has eliminated some of the ACID transactional properties (Atomicity, Consistency, Isolation and Durability) [45]. In this context, NoSQL properties are defined by CAP theory [46] which states that developers must make trade-off decisions between consistency, availability and partitioning. Some example of NoSQL tools are: Cassandra [47], HBase [48], MongoDB [49] and CouchDB [50].

Other supporting technologies for Big Data are parallel and distributed paradigms (e.g. Hadoop) and cloud computing services (e.g. OpenStack). These technologies are discussed in the upcoming chapters (Part III- Chapter 8, 9).

5.2. High Performance Computing as a Service (HPCaaS) 5.2.1. HPCaaS Overview

High Performance Computing (HPC) is used to process and analyze large and complex problems, including scientific, engineering and business problems that require high computation capabilities, high bandwidth, and low latency network [3]. HPC fits these requirements by implementing large physical clusters. However, traditional HPC faces a set

34

of challenges that consist in peak demand, high capital, and high expertise to acquire and operate the HCP [51]. To deal with these issues, HPC experts have leveraged the benefits of new technology trends including, cloud technologies, parallel processing paradigms and large storage infrastructures. Merging HPC with these new technologies has proposed new HPC model, called HPC as a service (HPCaaS).

HPCaaS is an emerging computing model where end users have on-demand access to pre- existing needed technologies that provide high performance and scalable HPC computing environment [52]. HPCaaS provides unlimited benefits because of the better quality of services provided by the cloud technologies, and the better parallel processing and storage provided by, for example, Hadoop Distributed System and MapReduce paradigm. Some HPCaaS benefits are stated in [51] as follow:

High Scalability: resources are scaling up as to ensure essential resources that fit users’ demand in terms of processing large and complex datasets.

Low Cost: End-users can eliminate the initial capital outlay, time and complexity to procure HPC.

Low Latency: by implementing the placement group concept that ensures the execution and processing of data in the same rack or on the same server.

5.2.2. HPCaaS Providers

There are many HPCaaS providers in the market. An example of HPCaaS provider is Penguin Computing [53] which has been a leader in designing and implementing high performance environments for over a decade. Nowadays, it provides HPCaaS with different options: on- demand, HPCaaS as private services and hybrid HPCaaS services. Amazon Web Services (AWS) [3] is also an active HPCaaS in the market; it provides simplified tools to perform HPC over the cloud. AWS allows end users to benefit from HPCaaS features with different pricing models: On-Demand, Reserved [54] or Spot Instances [55]. HPCaaS on AWS is currently used for Computer Aided Engineering, molecular modeling, genome analysis, and numerical modeling across many industries including Oil and Gas, Financial Services and Manufacturing [3]. Other leaders of HPCaaS in the market are Microsoft (Windows Azure HPC) [56] and Google (Google Compute Engine) [57].

35

Chapter 5: Literature Review and Research

Related documents