• No results found

Co processing SPMD Computation on GPUs and CPUs cluster

N/A
N/A
Protected

Academic year: 2020

Share "Co processing SPMD Computation on GPUs and CPUs cluster"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Co-processing SPMD Computation

on GPUs and CPUs cluster

Cluster 2013, Indianapolis, 9/24/2013

Hui Li

Pervasive Technology Institute

Indiana University

(2)

Acknowledgement

Co Authors:

Geoffrey Fox

Gregor von Laszewski

Arun Chauhan

Ph.D:

Andrew Pangborn

Jerome Mitchell

Adnan Ozsoy

Jong Choi

2

(3)

Outline

Introduction

Heterogeneous Computation

Programming models for GPU and CPU

System Design and Implementation

Heterogeneous MapReduce API

Scheduling Model and Implementation

Experiment Results

GEMV, C-means, and GMM

(4)

Hybrid Cluster with GPU and CPU

Supercomputers with GPU and multicore CPU

Accelerators used for scientific research

(5)

Heterogeneous Computation

Why Heterogeneous?

CPU are not capable for exascale computation.

Energy and power are the first class constraint on exascale system

There are more overlap between Top 500 and Green 500 machines

How Heterogeneous?

Not only use heterogeneous devices separately, but make use of them together as a system.

Challenges for heterogeneous computing

Programmability

Efficiency

Scalability

(6)

Programmability Challenges for

Heterogeneous Devices

Programming Models on CPU

SIMD, MIMD, fast for threading code,

Low Latency, Modest parallelism,

OpenMP, Pthreads

Programming Models on GPU

SIMT, fast for vector code

High Throughput, Massive parallelism,

CUDA, OpenCL

Programming models on GPU and CPU.

Either use a hybrid programming model, or use a

uniform programming model.

Difficult to exploit all computing units simultaneously.

OpenACC, Qilin, MAGMA

(7)

Code Samples

SPMD

for (int tid = 0;tid<num_threads;tid++){

if (pthread_create(NULL,NULL,RunPandaCPUMapThread, panda_cpu_task_info[tid])!=0) perror("Thread creation failed!\n");

}//for

for (int tid = 0;tid<num_threads;tid++){ void *exitstat;

if (pthread_join(d_g_state->panda_cpu_task[tid],&exitstat)!=0) perror("joining failed"); }//for

SIMD

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) { for(int i=0; i<n; i+=4) {

//compute c[i], c[i+1], c[i+2], c[i+3] uint32×4_t a4 = vld1q_u32(a+i); uint32×4_t b4 = vld1q_u32(b+i); uint32×4_t c4 = vaddq_u32(a4,b4); vst1q_u32(c+i,c4);

} }

SIMT

__global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop!

(8)

Heterogeneous MapReduce API

MapReduce

A successful paradigm consist of Map Reduce stages to run SPMD computation

Hide the implementation and

optimization details of the underlying system from developers

Heterogeneous MapReduce

Consist GPU and CPU

implementation of MapReduce API

Applications should deliver

different performance on different types of hardware resources

Leave the programmer the

flexibility to optimize the code for

(9)

Heterogeneous MapReduce API

Type

Function

C/C++ Func

void cpu_map(KEY *key, VAL *val, int keySize, …)

void cpu_reduce(KEY *key, VAL *val, …)

void cpu_combiner(KEY *KEY, VAL_Arr *val, …)

Int cpu_compare(KEY *key1, VAL *val1, .., KEY …)

CUDA Device Func

__device__ void gpu_device_map(KEY *key, …)

__device__ void gpu__device_reduce(KEY *key, …)

__device__ void gpu_device_combiner(KEY *key, ..)

__device__ Int gpu_device_compare(KEY *key, …)

CUDA Host Func

__host__ void gpu_host_map(KEY *key, …)

__host__ void gpu_host_reduce(KEY *key, …)

__host__ void gpu_host_combiner(KEY *key, …)

__host__ void gpu_host_compare(KEY *key, …)

Three different types of MapReduce interface, each of which consists

of map, reduce, combiner, and compare functions

The runtime implementation use the C++ and CUDA as backend

(10)

Scheduling SPMD on GPU and CPU

The runtime schedule C++/CUDA binary on GPU

and CPU devices simultaneously

Difficulty

More complex than scheduling SPMD on

homogeneous resources in order to have good

efficiency

Workload balance, Task granularity

Solution

Use roofline model to unify capability of GPU and

CPU

Extend roofline model so as to use multiple devices

simultaneously

(11)

Roofline Model (1)

Performance is how well the kernel’s

characteristics map to an architecture’s

characteristics

Roofline model can describe the type of SPMD

task and the capability of hardware device

Roofline model can integrate in-core

performance, memory bandwidth into a

single readily understandable performance

figure

Metric of interest:

Architecture’s characteristics: Peak Performance

of GPU and CPU, DRAM bandwidth, PCI-E

bandwidth, Network bandwidth, Disk bandwidth.

(12)

Roofline Model (2)

Arithmetic Intensity

O(1) O(log(N)) O(N)

(13)

Extend Roofline Model to Use

Multiple Devices Simultaneously

The peak performance is a continuous

additive interval function of arithmetic

density of app and a set of characteristics of

hardware devices

Assign the proper workload to each device

according to calculated peak performance

of every hardware device

Low

Latency ThroughputHigh

Roofline Model

100

1000

Unified peak performance for CPU and GPU in flops for target

(14)

Equations to Calculate the Workload

Distribution between GPU and CPU

Use the roofline model to calculate the workload distribution between GPU and CPU

Variable

Fc/Fg flop per second for target application on CPU/GPU, respectivelyAc/Ag flop per byte for target application on CPU/GPU, respectivelyB_dram present the bandwidth of DRAM,

B_pcie present the bandwidth of PCI-E,

Pc/Pg present peak performance of CPU/GPU

In Eq(1), when Tg_p ~= Tc_p, Tgc gets the minimal value

(15)

Equations to Calculate the Workload

Distribution between GPU and CPU

Equation (6)&(7) calculate Fg and Fc value so as to calculate the P value

Compute the value of Fg and Fc for three different situations

The equality is established when data transfer rate is equal to

computation speed or when the computation speed reaches the peak performance, the roofline of performance

(16)

Synthesis Approach for Scheduling

SPMD on GPU and CPU Cluster

Scheduling in heterogeneous environment is more

complex than that in homogeneous environment

Two-level scheduling

First level schedules computation among nodes, the

scheduler is located on master node.

Second level schedules tasks among devices on same

node, the scheduler is located on worker node.

Other scheduling strategies

Dynamic task scheduling: GPU and CPU daemons

dynamical poll tasks from worker node.

(17)

Two Level Scheduling Strategy

First level scheduling:

Assume the computation capability of all the nodes are identical.

Split whole job into some tasks, where

#tasks = 2*(#nodes)

Second level scheduling:

Split tasks into sub-tasks on each nodeWork load distribution among GPU and

CPU is calculated by using extended roofline model.

Worker Node #1

Group 2

Group 1

Worker Node #0

Group 2

(18)

Tasks Funnel: Use Task Daemon to

Schedule Tasks on GPU and CPU

CPU Daemon

Eliminate the cost of creating and terminating thread for CPU tasks

#cpu_pthread = #cores -#gpu_pthread

GPU Daemon

GPU resource lacks of multi-task management in the OS level.

Maintain the GPU context among different streams or kernels invocation.

#gpu_pthread = #gpu

Worker Node

Slot 1 Task Tracker

(19)

Dynamic Task Scheduling and

Task Granularity

Worker Node Slot 3 Slot 2 Task Pool Slot 1

There are the CPU task daemons and GPU task daemons on each worker node

CPU task scheduling

CPU daemon dynamical poll task from worker node.

Task granularity: #task = 2*(#cores)

GPU task scheduling

GPU daemon dynamical poll task from worker node

It is more complex to decide the task granularity for the GPUs.

Overlap data transfer with computation

Increase the arithmetic intensity so as to saturate the peak performance of GPU

(20)

Runtime Framework

(21)

C-means algorithm with CUDA

C-means algorithm with CUDA:

Configure:

Copy data from the CPU to GPU

Map function:

Calculate the distance matrix

Calculate the membership matrix

Update the centers kernel

Reduce function:

Aggregate the partial cluster centers and compute final cluster

centers.

Compute the difference between the current cluster centers and

previous iteration.

Main program:

The iteration will stop when the difference is smaller than

predefined threshold or it will go to next iteration.

(22)
(23)

GMM algorithm with CUDA

The expectation maximization using a mixture model approach takes the data set as sum of a mixture of multiple distinct events.

Gaussians mixtures from probabilistic models composed of multiple distinct Gaussians distributions as clusters. Each cluster ‘m’ within a D dimensional data set can be characterized by the following

parametersN

m: the number of samples in the cluster

πm: probability that a sample in data set belongs

to the cluster

μm : a D dimensional mean

Rm: a DxD spectral covariance matrix

(24)

Performance results on GPUs and

CPUs cluster

(1) DGEMV (2) C-means (3) GMM

weak scalability for GEMV, C-means, and GMM applications with up to 8 nodes on Delta. Y axis represent Gflops per node for each application. (1) GEMV, M=35000, N=10,000 per Node. (2) C-means, N=1000,000 per node , D=100, M=10. (3) GMM, N=100,000 per node, D=60, M=100. The red bard

(25)

Performance results on GPUs and

CPUs cluster

Apps GEMV C-means GMM Arithmetic

intensity 2 (M = 100)5*M (M=10,D=6011*M*D )

p calculated by

Equation (8) 97.3% 11.2% 11.2% p calculated by

app profiling 90.8% 11.9% 13.1%

Work Load Distribution among GPU and CPU for Three SPMD Applications using Proposed Model

Machine Name Future Grid

Delta BigRed2IU GPU Type C2070 K20

Flops 515Gflops 1.17Tflops Bandwidth 148GB/s 208GB/s Cores/GPU 448 Cores 2496 Cores

CPU Type Intel Xeon

5660 AMD Opteron6212 Cores/CPU 12 Cores 32 Cores

Flops 2.8 GHz 2.6 GHz

Hardware Configuration

Hardware Configuration

Metrics: bandwidth of DRAM, PCI-E bus, peak performance of GPU and CPU

Uses nvcc 5.0 and gcc 4.4.6

Workload Distribution

The work load distribution proportions, p values, between GPU and CPU are calculated by using Equation (8).

The error between p values calculated by using Equation (8) and the ones by application profiling is less than 10%

(26)

Contributions and Future Work

Contributions:

We propose an analytic model that scheduling SPMD

tasks on heterogeneous cluster and we implement the

distributed parallel system based on this model

We extend roofline model so as to schedule

computation on heterogeneous resources

simultaneously

Future Work

Extend the proposed analytical model by considering

the network bandwidth issue

(27)

27

Thank You!

(28)

Heat

Power restriction Transistor size

(29)

Parallel Programming Models on GPU

Designed to be used for writing software in a wide variety of application domains on GPU

Directive-based extensions to existing high-level languages with

user-controlled parallelism offload workload from CPU to accelerator, providing

portability across operating systems, CPU and GPU

Internal DSLs depends on a host language, whose syntax is both influenced and restricted by the host language.

(30)

Tasks Funnel: use device daemon to

manage GPU and CPU resources

CPU Daemon

Eliminate the cost of creating and terminating thread for CPU tasks

#cpu_pthread = #cores -#gpu_pthread

GPU Daemon

GPU resource lacks of

multi-task management by the OS.

Maintain the GPU context among different streams or kernels invocation.

#gpu_pthread = #gpu

Worker Node

Slot 1 Task Tracker

References

Related documents

The metal used in the minting of these coins is known as electrum, which is a natural pale yellow alloy of gold and silver.. Electrum was found in the Lydian mountains in

In fact, sequential kernels can be parallelized at the shared memory level using OpenMP: one example is once more the coarse solve of iterative solvers; another example is

Furthermore, in flagellar T3S systems, the T3S4 protein FliK has been identified, which switches the substrate specificity from early (hook components) to late (filament pro-

Most common include: Injection site reactions, skin rash, headache, nausea, diarrhea, sinus infection, abdominal pain, increased risk of minor infections.. Most serious

The Barracuda Web Application Firewall, if configured, will ignore the proxy IP and extract the actual client IP from the appropriate header to apply security policies as well as

Study of mobile adoption by older adults increases the understanding of older adults’ ability to continue working and learning due to the growth of mobile learning, the

This thesis aims to construct a census of Indian slaves brought to the Cape from 1658 to 1834—along the lines of Philip Curtin's aggregated census of the

In 2000 two Malaria Consultation Meetings attended by five SADC Health Ministers from Botswana, Mozambique, South Africa, Swaziland and Zimbabwe, were held in response to the