Co processing SPMD Computation on GPUs and CPUs cluster

(1)

Co-processing SPMD Computation

on GPUs and CPUs cluster

Cluster 2013, Indianapolis, 9/24/2013

Hui Li

Pervasive Technology Institute

Indiana University

(2)

Acknowledgement

• Co Authors:

Geoffrey Fox

Gregor von Laszewski

Arun Chauhan

• Ph.D:

Andrew Pangborn

Jerome Mitchell

Adnan Ozsoy

Jong Choi

2

(3)

Outline

• Introduction

–

Heterogeneous Computation

–

Programming models for GPU and CPU

• System Design and Implementation

–

Heterogeneous MapReduce API

–

Scheduling Model and Implementation

• Experiment Results

–

GEMV, C-means, and GMM

(4)

Hybrid Cluster with GPU and CPU

• Supercomputers with GPU and multicore CPU

• Accelerators used for scientific research

(5)

Heterogeneous Computation

• Why Heterogeneous?

• CPU are not capable for exascale computation.

• Energy and power are the first class constraint on exascale system

• There are more overlap between Top 500 and Green 500 machines

• How Heterogeneous?

• Not only use heterogeneous devices separately, but make use of them together as a system.

• Challenges for heterogeneous computing

• Programmability

• Efficiency

• Scalability

(6)

Programmability Challenges for

Heterogeneous Devices

• Programming Models on CPU

• SIMD, MIMD, fast for threading code,

• Low Latency, Modest parallelism,

• OpenMP, Pthreads

• Programming Models on GPU

• SIMT, fast for vector code

• High Throughput, Massive parallelism,

• CUDA, OpenCL

• Programming models on GPU and CPU.

• Either use a hybrid programming model, or use a

uniform programming model.

• Difficult to exploit all computing units simultaneously.

• OpenACC, Qilin, MAGMA

(7)

Code Samples

SPMD

for (int tid = 0;tid<num_threads;tid++){

if (pthread_create(NULL,NULL,RunPandaCPUMapThread, panda_cpu_task_info[tid])!=0) perror("Thread creation failed!\n");

}//for

for (int tid = 0;tid<num_threads;tid++){ void *exitstat;

if (pthread_join(d_g_state->panda_cpu_task[tid],&exitstat)!=0) perror("joining failed"); }//for

SIMD

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) { for(int i=0; i<n; i+=4) {

//compute c[i], c[i+1], c[i+2], c[i+3] uint32×4_t a4 = vld1q_u32(a+i); uint32×4_t b4 = vld1q_u32(b+i); uint32×4_t c4 = vaddq_u32(a4,b4); vst1q_u32(c+i,c4);

} }

SIMT

__global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop!

(8)

Heterogeneous MapReduce API

• MapReduce

• A successful paradigm consist of Map Reduce stages to run SPMD computation

• Hide the implementation and

optimization details of the underlying system from developers

• Heterogeneous MapReduce

• Consist GPU and CPU

implementation of MapReduce API

• Applications should deliver

different performance on different types of hardware resources

• Leave the programmer the

flexibility to optimize the code for

(9)

Heterogeneous MapReduce API

Type

Function

C/C++ Func

**void cpu_map(KEY key, VAL val, int keySize, …)**

**void cpu_reduce(KEY key, VAL val, …)**

**void cpu_combiner(KEY KEY, VAL_Arr val, …)**

**Int cpu_compare(KEY key1, VAL val1, .., KEY …)**

CUDA Device Func

**device void gpu_device_map(KEY *key, …)**

**device void gpu__device_reduce(KEY *key, …)**

**device void gpu_device_combiner(KEY *key, ..)**

**device Int gpu_device_compare(KEY *key, …)**

CUDA Host Func

**host void gpu_host_map(KEY *key, …)**

**host void gpu_host_reduce(KEY *key, …)**

**host void gpu_host_combiner(KEY *key, …)**

**host void gpu_host_compare(KEY *key, …)**

• Three different types of MapReduce interface, each of which consists

of map, reduce, combiner, and compare functions

• The runtime implementation use the C++ and CUDA as backend

(10)

Scheduling SPMD on GPU and CPU

• The runtime schedule C++/CUDA binary on GPU

and CPU devices simultaneously

• Difficulty

–

More complex than scheduling SPMD on

homogeneous resources in order to have good

efficiency

–

Workload balance, Task granularity

• Solution

–

Use roofline model to unify capability of GPU and

CPU

–

Extend roofline model so as to use multiple devices

simultaneously

(11)

Roofline Model (1)

• Performance is how well the kernel’s

characteristics map to an architecture’s

characteristics

• Roofline model can describe the type of SPMD

task and the capability of hardware device

• Roofline model can integrate in-core

performance, memory bandwidth into a

single readily understandable performance

figure

• Metric of interest:

–

Architecture’s characteristics: Peak Performance

of GPU and CPU, DRAM bandwidth, PCI-E

bandwidth, Network bandwidth, Disk bandwidth.

(12)

Roofline Model (2)

Arithmetic Intensity

O(1) _O(log(N)) O(N)

(13)

Extend Roofline Model to Use

Multiple Devices Simultaneously

• The peak performance is a continuous

additive interval function of arithmetic

density of app and a set of characteristics of

hardware devices

• Assign the proper workload to each device

according to calculated peak performance

of every hardware device

Low

Latency ThroughputHigh

Roofline Model

100

1000

Unified peak performance for CPU and GPU in flops for target

(14)

Equations to Calculate the Workload

Distribution between GPU and CPU

• Use the roofline model to calculate the workload distribution between GPU and CPU

• Variable

– Fc/Fg flop per second for target application on CPU/GPU, respectively – Ac/Ag flop per byte for target application on CPU/GPU, respectively – B_dram present the bandwidth of DRAM,

– B_pcie present the bandwidth of PCI-E,

– Pc/Pg present peak performance of CPU/GPU

• In Eq(1), when Tg_p ~= Tc_p, Tgc gets the minimal value

(15)

Equations to Calculate the Workload

Distribution between GPU and CPU

• Equation (6)&(7) calculate Fg and Fc value so as to calculate the P value

– Compute the value of Fg and Fc for three different situations

– The equality is established when data transfer rate is equal to

computation speed or when the computation speed reaches the peak performance, the roofline of performance

(16)

Synthesis Approach for Scheduling

SPMD on GPU and CPU Cluster

• Scheduling in heterogeneous environment is more

complex than that in homogeneous environment

• Two-level scheduling

–

First level schedules computation among nodes, the

scheduler is located on master node.

–

Second level schedules tasks among devices on same

node, the scheduler is located on worker node.

• Other scheduling strategies

–

Dynamic task scheduling: GPU and CPU daemons

dynamical poll tasks from worker node.

(17)

Two Level Scheduling Strategy

• First level scheduling:

– Assume the computation capability of all the nodes are identical.

– Split whole job into some tasks, where

#tasks = 2*(#nodes)

• Second level scheduling:

– Split tasks into sub-tasks on each node – Work load distribution among GPU and

CPU is calculated by using extended roofline model.

Worker Node #1

Group 2

Group 1

Worker Node #0

Group 2

(18)

Tasks Funnel: Use Task Daemon to

Schedule Tasks on GPU and CPU

• CPU Daemon

– Eliminate the cost of creating and terminating thread for CPU tasks

– #cpu_pthread = #cores -#gpu_pthread

• GPU Daemon

– GPU resource lacks of multi-task management in the OS level.

– Maintain the GPU context among different streams or kernels invocation.

– #gpu_pthread = #gpu

Worker Node

Slot 1 Task Tracker

(19)

Dynamic Task Scheduling and

Task Granularity

Worker Node Slot 3 Slot 2 Task Pool Slot 1

• There are the CPU task daemons and GPU task daemons on each worker node

• CPU task scheduling

– CPU daemon dynamical poll task from worker node.

– Task granularity: #task = 2*(#cores)

• GPU task scheduling

– GPU daemon dynamical poll task from worker node

– It is more complex to decide the task granularity for the GPUs.

• Overlap data transfer with computation

• Increase the arithmetic intensity so as to saturate the peak performance of GPU

(20)

Runtime Framework

(21)

C-means algorithm with CUDA

C-means algorithm with CUDA:

• Configure:

• Copy data from the CPU to GPU

• Map function:

• Calculate the distance matrix

• Calculate the membership matrix

• Update the centers kernel

• Reduce function:

• Aggregate the partial cluster centers and compute final cluster

centers.

• Compute the difference between the current cluster centers and

previous iteration.

• Main program:

• The iteration will stop when the difference is smaller than

predefined threshold or it will go to next iteration.

(22)

(23)

GMM algorithm with CUDA

•The expectation maximization using a mixture model approach takes the data set as sum of a mixture of multiple distinct events.

•Gaussians mixtures from probabilistic models composed of multiple distinct Gaussians distributions as clusters. Each cluster ‘m’ within a D dimensional data set can be characterized by the following

parameters_N

m: the number of samples in the cluster

πm: probability that a sample in data set belongs

to the cluster

μm : a D dimensional mean

Rm: a DxD spectral covariance matrix

(24)

Performance results on GPUs and

CPUs cluster

(1) DGEMV (2) C-means (3) GMM

weak scalability for GEMV, C-means, and GMM applications with up to 8 nodes on Delta. Y axis represent Gflops per node for each application. (1) GEMV, M=35000, N=10,000 per Node. (2) C-means, N=1000,000 per node , D=100, M=10. (3) GMM, N=100,000 per node, D=60, M=100. The red bard

(25)

Performance results on GPUs and

CPUs cluster

Apps GEMV C-means GMM Arithmetic

intensity 2 (M = 100)5*M (M=10,D=6011*M*D )

p calculated by

Equation (8) 97.3% 11.2% 11.2% p calculated by

app profiling 90.8% 11.9% 13.1%

Work Load Distribution among GPU and CPU for Three SPMD Applications using Proposed Model

Machine Name Future Grid

Delta BigRed2IU GPU Type C2070 K20

Flops 515Gflops 1.17Tflops Bandwidth 148GB/s 208GB/s Cores/GPU 448 Cores 2496 Cores

CPU Type Intel Xeon

5660 AMD Opteron6212 Cores/CPU 12 Cores 32 Cores

Flops 2.8 GHz 2.6 GHz

Hardware Configuration

• Hardware Configuration

– Metrics: bandwidth of DRAM, PCI-E bus, peak performance of GPU and CPU

– Uses nvcc 5.0 and gcc 4.4.6

• Workload Distribution

– The work load distribution proportions, p values, between GPU and CPU are calculated by using Equation (8).

– The error between p values calculated by using Equation (8) and the ones by application profiling is less than 10%

(26)

Contributions and Future Work

• Contributions:

–

We propose an analytic model that scheduling SPMD

tasks on heterogeneous cluster and we implement the

distributed parallel system based on this model

–

We extend roofline model so as to schedule

computation on heterogeneous resources

simultaneously

• Future Work

–

Extend the proposed analytical model by considering

the network bandwidth issue

(27)

27

Thank You!

(28)

Heat

Power restriction Transistor size

(29)

Parallel Programming Models on GPU

Designed to be used for writing software in a wide variety of application domains on GPU

Directive-based extensions to existing high-level languages with

user-controlled parallelism offload workload from CPU to accelerator, providing

portability across operating systems, CPU and GPU

Internal DSLs depends on a host language, whose syntax is both influenced and restricted by the host language.

(30)

Tasks Funnel: use device daemon to

manage GPU and CPU resources

• CPU Daemon

– Eliminate the cost of creating and terminating thread for CPU tasks

– #cpu_pthread = #cores -#gpu_pthread

• GPU Daemon

– GPU resource lacks of

multi-task management by the OS.

– Maintain the GPU context among different streams or kernels invocation.

– #gpu_pthread = #gpu

Worker Node

Slot 1 Task Tracker

Co processing SPMD Computation on GPUs and CPUs cluster