Co-processing SPMD Computation
on GPUs and CPUs cluster
Cluster 2013, Indianapolis, 9/24/2013
Hui Li
Pervasive Technology Institute
Indiana University
Acknowledgement
•
Co Authors:
Geoffrey Fox
Gregor von Laszewski
Arun Chauhan
•
Ph.D:
Andrew Pangborn
Jerome Mitchell
Adnan Ozsoy
Jong Choi
2
Outline
•
Introduction
–
Heterogeneous Computation
–
Programming models for GPU and CPU
•
System Design and Implementation
–
Heterogeneous MapReduce API
–
Scheduling Model and Implementation
•
Experiment Results
–
GEMV, C-means, and GMM
Hybrid Cluster with GPU and CPU
•
Supercomputers with GPU and multicore CPU
•
Accelerators used for scientific research
Heterogeneous Computation
• Why Heterogeneous?
• CPU are not capable for exascale computation.
• Energy and power are the first class constraint on exascale system
• There are more overlap between Top 500 and Green 500 machines
• How Heterogeneous?
• Not only use heterogeneous devices separately, but make use of them together as a system.
• Challenges for heterogeneous computing
• Programmability
• Efficiency
• Scalability
Programmability Challenges for
Heterogeneous Devices
•
Programming Models on CPU
•
SIMD, MIMD, fast for threading code,
•
Low Latency, Modest parallelism,
•
OpenMP, Pthreads
•
Programming Models on GPU
•
SIMT, fast for vector code
•
High Throughput, Massive parallelism,
•
CUDA, OpenCL
•
Programming models on GPU and CPU.
•
Either use a hybrid programming model, or use a
uniform programming model.
•
Difficult to exploit all computing units simultaneously.
•
OpenACC, Qilin, MAGMA
Code Samples
SPMD
for (int tid = 0;tid<num_threads;tid++){
if (pthread_create(NULL,NULL,RunPandaCPUMapThread, panda_cpu_task_info[tid])!=0) perror("Thread creation failed!\n");
}//for
for (int tid = 0;tid<num_threads;tid++){ void *exitstat;
if (pthread_join(d_g_state->panda_cpu_task[tid],&exitstat)!=0) perror("joining failed"); }//for
SIMD
void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) { for(int i=0; i<n; i+=4) {
//compute c[i], c[i+1], c[i+2], c[i+3] uint32×4_t a4 = vld1q_u32(a+i); uint32×4_t b4 = vld1q_u32(b+i); uint32×4_t c4 = vaddq_u32(a4,b4); vst1q_u32(c+i,c4);
} }
SIMT
__global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop!
Heterogeneous MapReduce API
• MapReduce
• A successful paradigm consist of Map Reduce stages to run SPMD computation
• Hide the implementation and
optimization details of the underlying system from developers
• Heterogeneous MapReduce
• Consist GPU and CPU
implementation of MapReduce API
• Applications should deliver
different performance on different types of hardware resources
• Leave the programmer the
flexibility to optimize the code for
Heterogeneous MapReduce API
Type
Function
C/C++ Func
void cpu_map(KEY *key, VAL *val, int keySize, …)
void cpu_reduce(KEY *key, VAL *val, …)
void cpu_combiner(KEY *KEY, VAL_Arr *val, …)
Int cpu_compare(KEY *key1, VAL *val1, .., KEY …)
CUDA Device Func
__device__ void gpu_device_map(KEY *key, …)
__device__ void gpu__device_reduce(KEY *key, …)
__device__ void gpu_device_combiner(KEY *key, ..)
__device__ Int gpu_device_compare(KEY *key, …)
CUDA Host Func
__host__ void gpu_host_map(KEY *key, …)
__host__ void gpu_host_reduce(KEY *key, …)
__host__ void gpu_host_combiner(KEY *key, …)
__host__ void gpu_host_compare(KEY *key, …)
•
Three different types of MapReduce interface, each of which consists
of map, reduce, combiner, and compare functions
•
The runtime implementation use the C++ and CUDA as backend
Scheduling SPMD on GPU and CPU
•
The runtime schedule C++/CUDA binary on GPU
and CPU devices simultaneously
•
Difficulty
–
More complex than scheduling SPMD on
homogeneous resources in order to have good
efficiency
–
Workload balance, Task granularity
•
Solution
–
Use roofline model to unify capability of GPU and
CPU
–
Extend roofline model so as to use multiple devices
simultaneously
Roofline Model (1)
•
Performance is how well the kernel’s
characteristics map to an architecture’s
characteristics
•
Roofline model can describe the type of SPMD
task and the capability of hardware device
•
Roofline model can integrate in-core
performance, memory bandwidth into a
single readily understandable performance
figure
•
Metric of interest:
–
Architecture’s characteristics: Peak Performance
of GPU and CPU, DRAM bandwidth, PCI-E
bandwidth, Network bandwidth, Disk bandwidth.
Roofline Model (2)
Arithmetic Intensity
O(1) O(log(N)) O(N)
Extend Roofline Model to Use
Multiple Devices Simultaneously
•
The peak performance is a continuous
additive interval function of arithmetic
density of app and a set of characteristics of
hardware devices
•
Assign the proper workload to each device
according to calculated peak performance
of every hardware device
LowLatency ThroughputHigh
Roofline Model
100
1000
Unified peak performance for CPU and GPU in flops for target
Equations to Calculate the Workload
Distribution between GPU and CPU
• Use the roofline model to calculate the workload distribution between GPU and CPU
• Variable
– Fc/Fg flop per second for target application on CPU/GPU, respectively – Ac/Ag flop per byte for target application on CPU/GPU, respectively – B_dram present the bandwidth of DRAM,
– B_pcie present the bandwidth of PCI-E,
– Pc/Pg present peak performance of CPU/GPU
• In Eq(1), when Tg_p ~= Tc_p, Tgc gets the minimal value
Equations to Calculate the Workload
Distribution between GPU and CPU
• Equation (6)&(7) calculate Fg and Fc value so as to calculate the P value
– Compute the value of Fg and Fc for three different situations
– The equality is established when data transfer rate is equal to
computation speed or when the computation speed reaches the peak performance, the roofline of performance
Synthesis Approach for Scheduling
SPMD on GPU and CPU Cluster
•
Scheduling in heterogeneous environment is more
complex than that in homogeneous environment
•
Two-level scheduling
–
First level schedules computation among nodes, the
scheduler is located on master node.
–
Second level schedules tasks among devices on same
node, the scheduler is located on worker node.
•
Other scheduling strategies
–
Dynamic task scheduling: GPU and CPU daemons
dynamical poll tasks from worker node.
Two Level Scheduling Strategy
• First level scheduling:
– Assume the computation capability of all the nodes are identical.
– Split whole job into some tasks, where
#tasks = 2*(#nodes)
• Second level scheduling:
– Split tasks into sub-tasks on each node – Work load distribution among GPU and
CPU is calculated by using extended roofline model.
Worker Node #1
Group 2
Group 1
Worker Node #0
Group 2
Tasks Funnel: Use Task Daemon to
Schedule Tasks on GPU and CPU
•
CPU Daemon
– Eliminate the cost of creating and terminating thread for CPU tasks
– #cpu_pthread = #cores -#gpu_pthread
•
GPU Daemon
– GPU resource lacks of multi-task management in the OS level.
– Maintain the GPU context among different streams or kernels invocation.
– #gpu_pthread = #gpu
Worker Node
Slot 1 Task Tracker
Dynamic Task Scheduling and
Task Granularity
Worker Node Slot 3 Slot 2 Task Pool Slot 1• There are the CPU task daemons and GPU task daemons on each worker node
• CPU task scheduling
– CPU daemon dynamical poll task from worker node.
– Task granularity: #task = 2*(#cores)
• GPU task scheduling
– GPU daemon dynamical poll task from worker node
– It is more complex to decide the task granularity for the GPUs.
• Overlap data transfer with computation
• Increase the arithmetic intensity so as to saturate the peak performance of GPU
Runtime Framework
C-means algorithm with CUDA
C-means algorithm with CUDA:
•
Configure:
•
Copy data from the CPU to GPU
•
Map function:
•
Calculate the distance matrix
•
Calculate the membership matrix
•
Update the centers kernel
•
Reduce function:
•
Aggregate the partial cluster centers and compute final cluster
centers.
•
Compute the difference between the current cluster centers and
previous iteration.
•
Main program:
•
The iteration will stop when the difference is smaller than
predefined threshold or it will go to next iteration.
GMM algorithm with CUDA
•The expectation maximization using a mixture model approach takes the data set as sum of a mixture of multiple distinct events.
•Gaussians mixtures from probabilistic models composed of multiple distinct Gaussians distributions as clusters. Each cluster ‘m’ within a D dimensional data set can be characterized by the following
parametersN
m: the number of samples in the cluster
πm: probability that a sample in data set belongs
to the cluster
μm : a D dimensional mean
Rm: a DxD spectral covariance matrix
Performance results on GPUs and
CPUs cluster
(1) DGEMV (2) C-means (3) GMM
weak scalability for GEMV, C-means, and GMM applications with up to 8 nodes on Delta. Y axis represent Gflops per node for each application. (1) GEMV, M=35000, N=10,000 per Node. (2) C-means, N=1000,000 per node , D=100, M=10. (3) GMM, N=100,000 per node, D=60, M=100. The red bard
Performance results on GPUs and
CPUs cluster
Apps GEMV C-means GMM Arithmetic
intensity 2 (M = 100)5*M (M=10,D=6011*M*D )
p calculated by
Equation (8) 97.3% 11.2% 11.2% p calculated by
app profiling 90.8% 11.9% 13.1%
Work Load Distribution among GPU and CPU for Three SPMD Applications using Proposed Model
Machine Name Future Grid
Delta BigRed2IU GPU Type C2070 K20
Flops 515Gflops 1.17Tflops Bandwidth 148GB/s 208GB/s Cores/GPU 448 Cores 2496 Cores
CPU Type Intel Xeon
5660 AMD Opteron6212 Cores/CPU 12 Cores 32 Cores
Flops 2.8 GHz 2.6 GHz
Hardware Configuration
• Hardware Configuration
– Metrics: bandwidth of DRAM, PCI-E bus, peak performance of GPU and CPU
– Uses nvcc 5.0 and gcc 4.4.6
• Workload Distribution
– The work load distribution proportions, p values, between GPU and CPU are calculated by using Equation (8).
– The error between p values calculated by using Equation (8) and the ones by application profiling is less than 10%
Contributions and Future Work
•
Contributions:
–
We propose an analytic model that scheduling SPMD
tasks on heterogeneous cluster and we implement the
distributed parallel system based on this model
–
We extend roofline model so as to schedule
computation on heterogeneous resources
simultaneously
•
Future Work
–
Extend the proposed analytical model by considering
the network bandwidth issue
27
Thank You!
Heat
Power restriction Transistor size
Parallel Programming Models on GPU
Designed to be used for writing software in a wide variety of application domains on GPU
Directive-based extensions to existing high-level languages with
user-controlled parallelism offload workload from CPU to accelerator, providing
portability across operating systems, CPU and GPU
Internal DSLs depends on a host language, whose syntax is both influenced and restricted by the host language.
Tasks Funnel: use device daemon to
manage GPU and CPU resources
•
CPU Daemon
– Eliminate the cost of creating and terminating thread for CPU tasks
– #cpu_pthread = #cores -#gpu_pthread
•
GPU Daemon
– GPU resource lacks of
multi-task management by the OS.
– Maintain the GPU context among different streams or kernels invocation.
– #gpu_pthread = #gpu
Worker Node
Slot 1 Task Tracker