Optimizing OpenCL Kernels
for Iterative Statistical Applications
on GPUs
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox {tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu
Outline
•
Motivation
•
Overview
•
Applications
–
KMeansClustering
–
Multi-Dimensional Scaling
–
PageRank
•
Lessons learned
Iterative Statistical Applications
•
Consists of iterative computation and communication
steps
•
Growing set of applications
–
Clustering, data mining, machine learning &
dimension reduction applications
–
Driven by data deluge & emerging computation fields
Compute Communication Reduce/ barrier
Iterative Statistical Applications
•
Data intensive
•
Larger loop-invariant data
•
Smaller loop-variant delta between iterations
–
Result of an iteration
–
Broadcast to all the workers of the next iteration
•
High memory access to floating point operations ratio
Compute Communication Reduce/ barrier
Motivation
•
Important set of applications
•
Increasing power and availability of GPGPU computing
•
Cloud Computing
–
Iterative MapReduce technologies
–
GPGPU computing in clouds
Motivation
•
A sample bioinformatics pipeline
Overview
• Three iterative statistical kernels implemented using OpenCl
– Kmeans Clustering
– Multi Dimesional Scaling
– PageRank
• Optimized by,
– Reusing loop-invariant data
– Utilizing different memory levels
– Rearranging data storage layouts
OpenCL
• Cross platform, vendor neutral, open standard
– GPGPU, multi-core CPU, FPGA…
• Supports parallel programming in heterogeneous environments
• Compute kernels
– Based on C99
– Basic unit of executable code
• Work items
– Single element of the execution domain
– Grouped in the work groups
OpenCL Memory Hierarchy
Local Memory Work
Item 1 Item 2Work
Private Private
Compute Unit 1
Local Memory Work
Item 1 Item 2Work
Private Private
Compute Unit 2
Global GPU Memory
Constant Memory
Environment
• NVIDIA Tesla C1060
– 240 scalar processors
– 4GB global memory
– 102 GB/sec peak memory bandwidth
– 16KB shared memory per 8 cores
– CUDA compute capability 1.3
– Peak Performance
• 933 GFLOPS Single with SF
• 622 GFLOPS Single MAD
KMeans Clustering
•
Partition a given data set into disjoint clusters
•
Each iteration
–
Cluster assignment step
–
Centroid update step
KMeansClustering Optimizations
• Naïve (with data re-using)
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
KMeansClustering Optimizations
• Data points copied to local memory
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
0 20 40 60 80 100 120
Naïve (A)
KMeansClustering Optimizations
• Cluster centroid points copied to local memory
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
0 20 40 60 80 100
120 Naïve (A)
Data in Local Memory(B)
KMeansClustering Optimizations
• Local memory data points in column major order
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
0 20 40 60 80 100
120 Naïve (A)
Data in Local Memory(B)
Data & Centers in Local Mem (C) C+ Data Coalescing (D)
KMeansClustering Performance
• Varying number of clusters (centroids)
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
0 20 40 60 80 100 120 140
KMeansClustering Performance
• Varying number of dimensions
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
KMeansClustering Performance
• Increasing number of iterations
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF LOPS 0 20 40 60 80 100 120 140 5 Iterations 10 Iterations 15 Iterations 20 Iterations
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
KMeans Clustering Overhead
Number of Data Points
1000 10000 100000 1000000 10000000 100000000
1 10 100 1000 10000 100000
0% 30% 60% 90% 120%
150% Double Compute
Regular (Single Compute) Compute Only
Multi Dimesional Scaling
• Map a data set in high dimensional space to a data set in lower dimensional space
• Use a NxN dissimilarity matrix as the input
– Output usually in 3D (Nx3) or 2D (Nx2) space
• Flops per work item (8DN+7N+3D+1)
D : target dimension
N : number of data points
• SMACOF MDS algorithm
MDS Optimizations
• Re-using loop-invariant data
Number of Data Points (N)
0 5000 10000 15000 20000 25000
Speedup
of
Ca
ching
MDS Optimizations
• Naïve (with loop-invariant data reuse)
Number of Data Points (N)
0 5000 10000 15000 20000 25000
Performa
nce
(GF
LOPS
)
MDS Optimizations
• Naïve (with loop-invariant data reuse)
Number of Data Points (N)
0 5000 10000 15000 20000 25000
Performa
nce
(GF
LOPS
)
0 10 20 30 40 50 60 70
Naïve
MDS Optimizations
• Naïve (with loop-invariant data reuse)
Number of Data Points (N)
0 5000 10000 15000 20000 25000
Performa nce (GF LOPS ) 0 10 20 30 40 50 60 70 Naïve
MDS Optimizations
• Naïve (with loop-invariant data reuse)
Number of Data Points (N)
0 5000 10000 15000 20000 25000
Performa nce (GF LOPS ) 0 10 20 30 40 50 60 70 Naïve
MDS Performance
• Increasing number of iterations
Number of Data Points (N)
0 5000 10000 15000 20000 25000
GPU Speedup 0 20 40 60 80 100 120 140 160 180 10 Iterations 25 Iterations 50 Iterations 100 Iterations
Number of Data Points (N)
0 5000 10000 15000 20000 25000
MDS Overhead
Number of Data Points (N)
64 5064 10064 15064 20064 25064
1 10 100 1000 10000 100000
0% 12% 24% 36% 48%
60% Double Compute
Regular (Single Compute) Compute Only Time
Page Rank
• Analyses the linkage information to measure the relative importance
• Sparse matrix and vector multiplication
• Web graph
– Very sparse
Sparse Matrix Representations
ELLPACK
Compressed Sparse Row (CSR)
PageRank implementations
Number of Iterations
10 25 50 75 100 125 150
Time (ms ) 0 200 400 600 800 1000 1200 1400 1600 1800 CPU only
Lessons
•
Reusing of loop-invariant data
•
Leveraging local memory
•
Optimizing data layout
OpenCL experience
•
Flexible programming environment
•
Support for work group level synchronization
primitives
•
Lack of debugging support
•
Lack of dynamic memory allocation
•
Compilation target than a user programming
Future Work
•
Extending kernels to distributed environments
•
Comparing with CUDA implementations
•
Exploring more aggressive CPU/GPU sharing
•
Studying more application kernels
Acknowledgements
•
This work was started as a class project for
CSCI-B649:Parallel Architectures (spring 2010) at IU
School of Informatics and Computing.
•
Thilina was supported by National Institutes of
Health grant 5 RC2 HG005806-02.
KMeansClustering Optimizations
• Data in global memory coalesced
Number of Data Points
1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
GF
LOPS
0 20 40 60 80 100
120 Naïve (A)
Data in Local Memory(B)