High Performance Cloud:
a MapReduce and GPGPU Based
Hybrid Approach
Beniamino Di Martino, Antonio Esposito and Andrea Barbato
Department of Industrial and Information Engineering Second University of Naples
Aversa, Italy
Motivations
● High Performance Computing requires expensive machines
● Leverage the virtually unlimited pool of resources offered by the Cloud ● The “pay-as-you-go” service model reduces initial investments
● Clouds' elasticity reduces computing power waste
● Ease applications' porting from on-premises environment to the Cloud ● Reusing existing sequential code
● A number of environments\technique\languages already exist for
development of sequential programs
● Lack of shared programming interfaces can hamper the porting process ● Exploit the naturally distributed characteristics of Cloud solutions
Objective
● Realize the automatic transformation of a class of sequential algorithms
into a corresponding parallel version
● Make the parallel version compatible with a target Cloud environment ● Apply two levels of parallelization
➢ 1st level → Use parallel skeletons to port to the Cloud ➢ 2nd level → Use GPU simulation
Serial Code Code Analyser Parallel Code Parallel Skeletons Translator MapReduce + GPGPU
Employed technologies\1
● There are patterns in parallel applications
● Those patterns can be generalized in Skeletons
● Applications are assembled as combination of such patterns ● Functional point of view
● Skeletons are Higher-Order Functions
● Skeletons support a compositional semantic
● Applications become composition of state-less functions
● Orchestration and synchronization of the parallel activities are implicitly
defined and hidden to the programmer
Employed technologies\2
Map Reduce
● Programming model and an associated implementation for processing
and generating large data sets with a parallel, distributed algorithm on a cluster
Employed technologies\3
GPGPU
General-purpose computing on graphics processing units
● OpenCL is the currently dominant open general-purpose GPU computing
language.
● The dominant proprietary framework is Nvidia's CUDA
● Single-Program Multiple Data (SPMD)
● CUDA programming use keywords
provided as extensions to high-level programming languages like C/C++ ● A kernel is organized as a hierarchy
structure in which threads are grouped into blocks, and blocks into a grid
Analysis of the source code
● Analysis of the AST through ROSE compiler
● Recognition of data structures
● Vectors, Matrices, Queues, Stacks, Lists... ● Recognition of computation algorithms
● Matrix multiplication
● The user is shown the PDG graph
● Control and data dependency
● Each node reports an ID which can be used to
trace the code line and the relative control or data
structure corresponding to it.
2. Algebraic expressions involving matrices
and vectors
1. Matrix multiplication
for (int i=0; i<N; i++)for (int j=0; j<M; j++) { C[i][j] = 0;
for (int k=0; k<P; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j]; }
for (int i=0; i<N; i++) for (int j=0; j<M; j++)
...
• c[i][j] = alfa * a[i][j] + beta * b[i][j]; • c[i][j] = alfa * a[i][j] + beta * b[i][j]
+gamma*d[i][j]+...
• c[i][j] = alfa*a[i][j]^2 • . . .
Selection of the Skeleton
● Skeleton selection
● Users can tweak the dimension of
the sub-block in which the matrices will be divided
● If CUDA is selected, options to determine grid and block
dimensions are available
● A preview of the data distribution is shown
X N M m n M P p m + + + + =
Matrix sub-block Multiplication
●
Distribution of blocks will be handled by a Map function
●
Calculations are executed by Reduce function
➔
First round: execute sub-matrix multiplication
➔
Second round: sum the partial results of the sub-block
Matrix sub-block Multiplication
●
Distribution of blocks will be handled by a Map function
●
Calculations are executed by Reduce function
K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0
A
B
1
stround
Map
Function
1 2 3 4 5 7 8 9 10 11 0 6 13 1 4 15 16 17 1 9 2 0 2 1 2 2 2 3 1 2 1 8 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 1 2 3 5 6 7 0 4 9 10 11 8 1 3 1 7 1 2 1 6 2 1 2 0 1 5 1 9 1 4 1 8 2 3 2 2 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K 0 = 0 , 0 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 0 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0= A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K 0 = 0 , 1 , 0 V 0 = A , 0 , 0 K 1 = 0 , 1 V 1 = A , 0 , 0 . 0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,0,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0 K0=0,1,0 V0=A,0,0 K1=0,1 V1=A,0,0.0
1
stround
Reduce
Function
=2nd round functions
Map
Function
K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 K0=0,0 V0=A,0,0.0 K1=0,1V1=A,0,0.0 + =Reduce
Function
Tecniche di trasformazione automatica del codice per l’High Performance Cloud
Barbato - Simoniello
Code produced for 1st round
Use of GPGPU
● GPGPU parallelization applied to Reduce function
● Used on the code produced in
the second round
● Users can set the number of GPU threads
● Default value depends on matrices' dimensions
Added CUDA code is in charge of:
● Allocating data structures on the GPU
● Copying data onto the GPU
● Kernel execution
● Copying data back from the GPU
// Device Allocation
float *A_d; float *B_d; float *C_d;
cudaMalloc( (void**)&A_d, (n)*(m)*sizeof(float) ); cudaMalloc( (void**)&B_d, (m)*(p)*sizeof(float) ); cudaMalloc( (void**)&C_d, (n)*(p)*sizeof(float) );
// Move data to device
cudaMemcpy( A_d, A_h, (n)*(m)*sizeof(float),cudaMemcpyHostToDevice ); cudaMemcpy( B_d, B_h, (m)*(p)*sizeof(float),cudaMemcpyHostToDevice );
// Launch the kernel
dim3 dimBlock( DIM_BLOCK_X, DIM_BLOCK_Y ); dim3 dimGrid( DIM_GRID_X, DIM_GRID_Y );
multiply_matrix<<<dimGrid, dimBlock>>>(A_d, B_d, C_d, n, m, p);
// Move data from device
cudaMemcpy( C_h, C_d, (n)*(p)*sizeof(float), cudaMemcpyDeviceToHost ); // Device De-allocation cudaFree( A_d ); cudaFree( B_d ); cudaFree( C_d );
Adding CUDA code
class MyReducerCUDA : public Reducer {public:
MyReducerCUDA(TaskContext& context) { } void reduce(ReduceContext& context) { float *A_h = (float *)
malloc((n)*(m)*sizeof(float)); float *B_h = (float *) malloc((m)*(p)*sizeof(float)); float *C_h = (float *) malloc((n)*(p)*sizeof(float)); while ( context.nextValue() ) {
string line = context.getInputValue(); vector<string> indicesAndValue = splitString(line, ",");
int i = toInt(indicesAndValue[1]); int j = toInt(indicesAndValue[2]);
float value = toFloat(indicesAndValue[3]); if(indicesAndValue[0].compare("A")==0) A_h[i*m+j] = value;
else
B_h[i*p+j] = value; }
string key = context.getInputKey();
vector<string> blockIndices = splitString(key, ",");
for(int row=0; row<n; row++) for(int col=0; col<p; col++) {
int i = toInt(blockIndices[0])*n + row; int j = toInt(blockIndices[2])*p + col; string ii = toString(i);
string jj = toString(j);
string value = toString(C_h[row*p+col]); context.emit(ii+","+jj+",", value); }
} };
Conclusions and Future Work
●
We are still at a preliminary stage
● Need skeletons for different computation algorithms
● Need to specialize skeletons for different programming paradigms ● Need skeletons for different Cloud platforms
●
A performance evaluation of the produced code is missing
●Overhead of the recognition and transformation process has
to be checked
● Matrices of important dimension are needed for the evaluation ● Time needed to transfer data to the cloud has to be considered
● When GPU parallelization is used, time needed to transfer data onto it