Summary - Investigating elastic cloud based RDF processing

This chapter has provided a background on the Resource Description Framework (RDF) and RDF Schema (RDFS). The chapter also presented a review of the literature covering work on distributed RDFS reasoning and RDF data compression. Due to the massive growth of RDF data on the Web to billions of statements, these approaches have primarily focused on using distributed computing for RDF processing, but this is not without challenges. Challenges include data storage, workload distribution and the restriction of fixed computation resources both in terms of number of nodes and the specification of each node. Moreover, existing work on RDF dictionary encoding for large datasets generates large dictionaries which results in a slow encoding and decoding process.

To conclude, existing work on large scale reasoning on RDF Web data has highlighted two important characteristics about this data: 1. the schema data represents only a small fraction of the overall data [70] and can fit in memory during the reasoning process, 2. the schema data should be treated differently to increase performance [85]. This thesis utilises both of these characteristics to address the challenges described in this chapter. The following chapter (Chapter 3) reviews the literature concerning the

cloud computing services that can be used to address the RDF processing challenges presented here.

The Cloud Computing Paradigm

Chapter 2 has introduced the Resource Description Framework (RDF) and highlighted some of the issues currently facing large scale RDF processing. It was shown that some of these issues are related to computing resources and storage constraints, hence the aim of this research is to address these RDF issues by utilising cloud computing (Figure 1.1). Cloud computing is a computing paradigm that runs on physical hardware and enables users to acquire computing resources on-demand without any upfront investments.

Cloud computing can provide substantial cost savings when compared to physical computer environments that consume energy and incur cost even when the resources are not being used. This is because the cost of running the underlying hardware is managed by the cloud providers and users only pay for the resources they use. Publicly available cloud computing services can be categorised under broad categories, despite the fact that the underlying implementation differs from one provider to another. This chapter provides a review of these services, introduces key concepts and definitions that are utilised to design and develop the CloudEx framework.

cloud computing definitions. Then, Section 3.2 presents an overview of public cloud deployments and common concepts shared amongst public cloud providers. Followed by Section 3.3, which explains at an abstract level the main cloud services utilised in this research. Then, Section 3.4 provides an overview of the main Google Cloud Platform services used in the CloudEx prototype development and evaluation phases of the research. Subsequently, Section 3.5 categorises and provides a survey of related cloud computing literature and introduces the cloud-first frameworks concept. Finally, Section 3.6 concludes this chapter with a summary of the key concepts introduced herein.

3.1 Background

Advances in the virtualisation of computing resources have led to the emergence of the cloud computing paradigm. The National Institute of Standards and Technology (NIST) provides the following definition for cloud computing [23]:

“Cloud computing is a model for enabling ubiquitous, convenient, on- demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

Cloud computing resources are deployed on top of physical hardware (servers, network switches, storage appliances, etc.) hosted in data centres. Because the underlying hardware is always on, it is possible to start and shutdown these virtual computing resources at any time. The ability to instantly provision or release computing resources programmatically using an API is called elasticity [24, 25, 26]. Elasticity

enables a wide range of features, one of which is the ability to script virtual infrastructure deployments. Another feature is the ability to dynamically scale up and down computing resources based on demand. Resources can be scaled horizontally by adding more computing resources or to some extent vertically by utilising computing resources with higher specification.

3.1.1 Cloud Service Models

Public cloud providers, most notably Amazon Web Services [27], Google Cloud Plat- form [28] and Microsoft Azure [29], utilise their data centres to commercially offer virtualised computing resources. Offered resources broadly fall into one of these service models:

• Infrastructure as a Service (IaaS) such as compute, storage and network resources.

• Platform as a Service (PaaS) such as databases, servers runtime and mes- saging.

• Software as a Service (SaaS) such as collaboration and productivity suites.

3.1.2 Cloud Benefits

Consumers can benefit from using cloud computing in many ways, generally speaking the benefits can be summarised as follows:

• On-demand services - Consumers can provision and scale computing re- sources automatically when needed without any human interaction with the cloud providers.

• Pay per use computing - Consumers only pay for what they use, which is highly cost efficient compared to a physical computing environment that is billed 24/7 even when the resources are not being fully utilised.

• No upfront infrastructure investment - With cloud Infrastructure as a Ser- vice (IaaS), consumers can instantly acquire the required infrastructure without any expensive upfront investment.

In document Investigating elastic cloud based RDF processing (Page 64-69)