High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

(1)

High Performance Cloud:

a MapReduce and GPGPU Based

Hybrid Approach

Beniamino Di Martino, Antonio Esposito and Andrea Barbato

Department of Industrial and Information Engineering Second University of Naples

Aversa, Italy

(2)

Motivations

● High Performance Computing requires expensive machines

● Leverage the virtually unlimited pool of resources offered by the Cloud ● The “pay-as-you-go” service model reduces initial investments

● Clouds' elasticity reduces computing power waste

● Ease applications' porting from on-premises environment to the Cloud ● Reusing existing sequential code

● A number of environments\technique\languages already exist for

development of sequential programs

● Lack of shared programming interfaces can hamper the porting process ● Exploit the naturally distributed characteristics of Cloud solutions

(3)

Objective

● Realize the automatic transformation of a class of sequential algorithms

into a corresponding parallel version

● Make the parallel version compatible with a target Cloud environment ● Apply two levels of parallelization

➢ 1st level → Use parallel skeletons to port to the Cloud ➢ 2nd level → Use GPU simulation

Serial Code Code Analyser Parallel Code Parallel Skeletons Translator MapReduce + GPGPU

(4)

Employed technologies\1

● There are patterns in parallel applications

● Those patterns can be generalized in Skeletons

● Applications are assembled as combination of such patterns ● Functional point of view

● Skeletons are Higher-Order Functions

● Skeletons support a compositional semantic

● Applications become composition of state-less functions

● Orchestration and synchronization of the parallel activities are implicitly

defined and hidden to the programmer

(5)

Employed technologies\2

Map Reduce

● Programming model and an associated implementation for processing

and generating large data sets with a parallel, distributed algorithm on a cluster

(6)

Employed technologies\3

GPGPU

General-purpose computing on graphics processing units

● OpenCL is the currently dominant open general-purpose GPU computing

language.

● The dominant proprietary framework is Nvidia's CUDA

● Single-Program Multiple Data (SPMD)

● CUDA programming use keywords

provided as extensions to high-level programming languages like C/C++ ● A kernel is organized as a hierarchy

structure in which threads are grouped into blocks, and blocks into a grid

(7)

Analysis of the source code

● Analysis of the AST through ROSE compiler

● Recognition of data structures

● Vectors, Matrices, Queues, Stacks, Lists... ● Recognition of computation algorithms

● Matrix multiplication

● The user is shown the PDG graph

● Control and data dependency

● Each node reports an ID which can be used to

trace the code line and the relative control or data

structure corresponding to it.

(8)

2. Algebraic expressions involving matrices

and vectors

1. Matrix multiplication

for (int i=0; i<N; i++)

for (int j=0; j<M; j++) { C[i][j] = 0;

for (int k=0; k<P; k++)

C[i][j] = C[i][j] + A[i][k] * B[k][j]; }

for (int i=0; i<N; i++) for (int j=0; j<M; j++)

...

• _{c[i][j] = alfa * a[i][j] + beta * b[i][j];} • c[i][j] = alfa * a[i][j] + beta * b[i][j]

+gamma*d[i][j]+...

• c[i][j] = alfa*a[i][j]^2 • . . .

(9)

Selection of the Skeleton

● Skeleton selection

● Users can tweak the dimension of

the sub-block in which the matrices will be divided

● If CUDA is selected, options to determine grid and block

dimensions are available

● A preview of the data distribution is shown

(10)

X N M m n M P p m + + + + =

Matrix sub-block Multiplication

●

Distribution of blocks will be handled by a Map function

●

Calculations are executed by Reduce function

➔

First round: execute sub-matrix multiplication

➔

Second round: sum the partial results of the sub-block

(11)

Matrix sub-block Multiplication

●

Distribution of blocks will be handled by a Map function

●

Calculations are executed by Reduce function

K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0

A

B

1

st

_round

Map

Function

1 2 3 4 5 7 8 9 1₀ 1₁ 0 6 13 1 4 15 16 17 1 9 2 0 2 1 2 2 2 3 1 2 1 8 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 1 2 3 5 6 7 0 4 9 1₀ 1₁ 8 1 3 1 7 1 2 1 6 2 1 2 0 1 5 1 9 1 4 1 8 2 3 2 2 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K 0 = 0 , 0 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0

(12)

1

st

_round

Reduce

Function

=

(13)

₂nd_{round functions}

Map

Function

K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 + =

Reduce

Function

Tecniche di trasformazione automatica del codice per l’High Performance Cloud

Barbato - Simoniello

(14)

_{Code produced for 1}st_round

(15)

Use of GPGPU

● GPGPU parallelization applied to Reduce function

● Used on the code produced in

the second round

● Users can set the number of GPU threads

● Default value depends on matrices' dimensions

Added CUDA code is in charge of:

● Allocating data structures on the GPU

● Copying data onto the GPU

● Kernel execution

● Copying data back from the GPU

(16)

// Device Allocation

float *A_d; float *B_d; float *C_d;

cudaMalloc( (void**)&A_d, (n)*(m)*sizeof(float) ); cudaMalloc( (void**)&B_d, (m)*(p)*sizeof(float) ); cudaMalloc( (void**)&C_d, (n)*(p)*sizeof(float) );

// Move data to device

cudaMemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudaMemcpyHostToDevice );

// Launch the kernel

dim3 dimBlock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimGrid( DIM_GRID_X, DIM_GRID_Y );

multiply_matrix<<<dimGrid, dimBlock>>>(A_d, B_d, C_d, n, m, p);

// Move data from device

cudaMemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudaMemcpyDeviceToHost ); // Device De-allocation cudaFree( A_d ); cudaFree( B_d ); cudaFree( C_d );

Adding CUDA code

class MyReducerCUDA : public Reducer {

public:

MyReducerCUDA(TaskContext& context) { } void reduce(ReduceContext& context) { float *A_h = (float *)

malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextValue() ) {

string line = context.getInputValue(); vector<string> indicesAndValue = splitString(line, ",");

int i = toInt(indicesAndValue[1]); int j = toInt(indicesAndValue[2]);

float value = toFloat(indicesAndValue[3]); if(indicesAndValue[0].compare("A")==0) A_h[i*m+j] = value;

else

B_h[i*p+j] = value; }

string key = context.getInputKey();

vector<string> blockIndices = splitString(key, ",");

for(int row=0; row<n; row++) for(int col=0; col<p; col++) {

int i = toInt(blockIndices[0])*n + row; int j = toInt(blockIndices[2])*p + col; string ii = toString(i);

string jj = toString(j);

string value = toString(C_h[row*p+col]); context.emit(ii+","+jj+",", value); }

} };

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

Conclusions and Future Work

●

We are still at a preliminary stage

● Need skeletons for different computation algorithms

● Need to specialize skeletons for different programming paradigms ● Need skeletons for different Cloud platforms

●

A performance evaluation of the produced code is missing

●

Overhead of the recognition and transformation process has

to be checked

● Matrices of important dimension are needed for the evaluation ● Time needed to transfer data to the cloud has to be considered

● When GPU parallelization is used, time needed to transfer data onto it

(25)

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach