Running the Framework - Investigating elastic cloud based RDF processing

5.6 Implementation

5.6.6 Running the Framework

The coordinator is the minimum component that needs to be running in order to use the CloudEx framework. The coordinator can be a VM deployed on the cloud provider IaaS or a physical computer, as long as the CloudEx framework has the required credentials to authenticate with the cloud provider. To run the coordinator two components are required, the JVM and the CloudEx framework. Users can utilise and extend the CloudEx framework by import the framework libraries into their own applications and frameworks.

5.7 Summary

This chapter have focused on the first contributions of this thesis to answer the re- search question “Can a cloud-based system efficiently distribute and process tasks with

different computational requirements?”. The chapter introduced CloudEx, a generic

and open source cloud-first task execution framework that can be implemented on any cloud provider IaaS. The algorithms presented for CloudEx, address the issues highlighted in Chapter 3 with cloud-first frameworks. This is achieved by providing generic mechanisms that are cloud provider independent for executing tasks on cloud

environments. Additional components that enable CloudEx to work with a particular cloud provider can be added, enabling users to easily move their applications from one cloud provider to another. Moreover, a workload partitioning approach based on the bin-packing algorithm is also introduced to efficiently distribute the processing of tasks between a number of processors.

This chapter has outlined the various components of CloudEx including the coordinator and processor virtual machines. An approach based on the bin-packing algorithm was presented for workload partitioning between the various processors. Addition- ally, CloudEx enables users to use a divide-conquer approach to divide their jobs in multiple tasks and specify how these tasks should be executed. Users can specify not only the number of required processors, but also their types in terms of CPU cores and memory. An implementation for the Google Cloud Platform was also presented. Chapters 6 builds on CloudEx and define the architecture of ECARF, a triple store for RDF dataset processing and forward reasoning on the Cloud. Chapter 7 evaluates this implementation in processing real-world problems such as RDF data processing.

ECARF, Processing RDF on the

Cloud

The explosive growth of RDF datasets to billions of statements have resulted in solutions that utilise high specification hardware, which requires considerable upfront in- vestments. Cloud computing on the other hand has motivated this research to develop solutions that can efficiently process large datasets with billion of statements without any upfront investment. In this regards, Chapter 5 introduced the first contribution of this thesis and outlined the architecture of CloudEx, a cloud-first tasks execution framework. This chapter continues the contribution of this thesis and describes the design of an Elastic Cost Aware Reasoning Framework (ECARF), a cloud-based RDF triple store implemented as CloudEx tasks. The algorithms described in this chapter addresses the issues with large RDF processing outlined in Chapter 2, namely dictionary encoding, data storage and workload partitioning. Additionally, these algorithms provide answers to the following research questions:

• Q2. How can an efficient dictionary that fits in-memory be created for encoding RDF URIs from strings to integers?

BigQuery Loading and Analysing Dictionary Encoding Transforming NTriples → CSV Querying (SPARQL) Rule-based Reasoning Managing Create, Update, Delete Scope of Research Cloud Storage RDF Datasets

Figure 6.1: ECARF High Level Activities

• Q3. How can cloud-based big data services such as Google BigQuery be used to perform RDFS rule-based reasoning?

Some of the high level activities that can be performed with ECARF are illustrated in Figure 6.1. The activities with solid border such as “Loading and Analysing”, “Dictionary Encoding”, “Transforming” and “Rule-based Reasoning” are implemented in this thesis. Other activities with dotted border such as “Managing” and “Query” can be implemented to provide full triple store capability, though, these are left as future work. ECARF algorithms that load, process, encode and reason on large RDF datasets will enable applications to easily and efficiently utilise RDF technologies to provide many features such as semantic search, content discovery, etc. . . .

The rest of this chapter is organised as follows, Section 6.1 provides an overview of the ECARF triple store architecture, followed by Section 6.3 which explains the distributed RDFS reasoning approach in details. Then, Section 6.2 presents the dictionary encoding algorithms used by ECARF, followed by Section 6.4 which provides

a detailed walkthrough the architecture. Finally, Section 6.5 concludes this chapter with a summary of the key contributions of ECARF.

6.1 ECARF Overview

The ECARF high level architecture shown in Figure 6.2 is based on the CloudEx’s cloudex-core and cloudex-google components. Both the coordinator and processors virtual machines run the cloudex-core Coordinator and Processor components re- spectively. These components use the cloudex-core’s CloudService component to interact with other cloud services. ECARF defines a number of CloudEx jobs and user defined tasks (Section 5.6.3) for RDF datasets processing. Furthermore, ECARF extends the cloudex-core’s CloudService component with additional API wrappers for the Google BigQuery. Similar to the CloudEx implementation, the ECARF implementation is open source and is publicly1 _available.

The ECARF architecture uses both Cloud Storage and BigQuery for shared storage and BigQuery for execution, hence eliminating the need for processors to communicate with each other. This embarrassingly parallel approach ensures each processor can work independently of other processors to avoid any overhead with data exchange between them. A CloudEx coordinator component is used for partitioning the workload between a number of processors using the CloudEx built-in Bin Packing partitioning described in Section 5.3. Additionally, the coordinator assigns tasks and workload items to the processors by using the metadata server as described in Section 5.2. The following sections provide a brief overview of each of the activities explained in this chapter.

Virtual Machines Coordinator cloudex d Processor-1 cloudex Cloud APIs Metadata Server VM Images Processor-n cloudex

Authentication & Authorisation

CloudEx Image

Compute Engine Processors

Google Cloud Platform

Required interface Provided interface Jobs

Component

Virtual Machines CloudService

Cloud Storage BigQuery

RDF dataset file BigQuery table

Compute Engine VMs

RDF dataset file Cloud Storage bucket VM image

Figure 6.2: ECARF High Level Architecture

In document Investigating elastic cloud based RDF processing (Page 128-133)