• No results found

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

N/A
N/A
Protected

Academic year: 2020

Share "Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Optimizing OpenCL Kernels

for Iterative Statistical Applications

on GPUs

Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox {tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu

(2)

Outline

Motivation

Overview

Applications

KMeansClustering

Multi-Dimensional Scaling

PageRank

Lessons learned

(3)

Iterative Statistical Applications

Consists of iterative computation and communication

steps

Growing set of applications

Clustering, data mining, machine learning &

dimension reduction applications

Driven by data deluge & emerging computation fields

Compute Communication Reduce/ barrier

(4)

Iterative Statistical Applications

Data intensive

Larger loop-invariant data

Smaller loop-variant delta between iterations

Result of an iteration

Broadcast to all the workers of the next iteration

High memory access to floating point operations ratio

Compute Communication Reduce/ barrier

(5)

Motivation

Important set of applications

Increasing power and availability of GPGPU computing

Cloud Computing

Iterative MapReduce technologies

GPGPU computing in clouds

(6)

Motivation

A sample bioinformatics pipeline

(7)

Overview

• Three iterative statistical kernels implemented using OpenCl

– Kmeans Clustering

– Multi Dimesional Scaling

– PageRank

• Optimized by,

– Reusing loop-invariant data

– Utilizing different memory levels

– Rearranging data storage layouts

(8)

OpenCL

• Cross platform, vendor neutral, open standard

– GPGPU, multi-core CPU, FPGA…

• Supports parallel programming in heterogeneous environments

• Compute kernels

– Based on C99

– Basic unit of executable code

• Work items

– Single element of the execution domain

– Grouped in the work groups

(9)

OpenCL Memory Hierarchy

Local Memory Work

Item 1 Item 2Work

Private Private

Compute Unit 1

Local Memory Work

Item 1 Item 2Work

Private Private

Compute Unit 2

Global GPU Memory

Constant Memory

(10)

Environment

• NVIDIA Tesla C1060

240 scalar processors

– 4GB global memory

– 102 GB/sec peak memory bandwidth

– 16KB shared memory per 8 cores

– CUDA compute capability 1.3

– Peak Performance

933 GFLOPS Single with SF

622 GFLOPS Single MAD

(11)

KMeans Clustering

Partition a given data set into disjoint clusters

Each iteration

Cluster assignment step

Centroid update step

(12)
(13)

KMeansClustering Optimizations

• Naïve (with data re-using)

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

(14)

KMeansClustering Optimizations

• Data points copied to local memory

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

0 20 40 60 80 100 120

Naïve (A)

(15)

KMeansClustering Optimizations

• Cluster centroid points copied to local memory

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

0 20 40 60 80 100

120 Naïve (A)

Data in Local Memory(B)

(16)

KMeansClustering Optimizations

• Local memory data points in column major order

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

0 20 40 60 80 100

120 Naïve (A)

Data in Local Memory(B)

Data & Centers in Local Mem (C) C+ Data Coalescing (D)

(17)

KMeansClustering Performance

• Varying number of clusters (centroids)

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

0 20 40 60 80 100 120 140

(18)

KMeansClustering Performance

• Varying number of dimensions

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

(19)

KMeansClustering Performance

• Increasing number of iterations

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF LOPS 0 20 40 60 80 100 120 140 5 Iterations 10 Iterations 15 Iterations 20 Iterations

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

(20)

KMeans Clustering Overhead

Number of Data Points

1000 10000 100000 1000000 10000000 100000000

1 10 100 1000 10000 100000

0% 30% 60% 90% 120%

150% Double Compute

Regular (Single Compute) Compute Only

(21)

Multi Dimesional Scaling

• Map a data set in high dimensional space to a data set in lower dimensional space

• Use a NxN dissimilarity matrix as the input

– Output usually in 3D (Nx3) or 2D (Nx2) space

• Flops per work item (8DN+7N+3D+1)

D : target dimension

N : number of data points

• SMACOF MDS algorithm

(22)

MDS Optimizations

• Re-using loop-invariant data

Number of Data Points (N)

0 5000 10000 15000 20000 25000

Speedup

of

Ca

ching

(23)

MDS Optimizations

• Naïve (with loop-invariant data reuse)

Number of Data Points (N)

0 5000 10000 15000 20000 25000

Performa

nce

(GF

LOPS

)

(24)

MDS Optimizations

• Naïve (with loop-invariant data reuse)

Number of Data Points (N)

0 5000 10000 15000 20000 25000

Performa

nce

(GF

LOPS

)

0 10 20 30 40 50 60 70

Naïve

(25)

MDS Optimizations

• Naïve (with loop-invariant data reuse)

Number of Data Points (N)

0 5000 10000 15000 20000 25000

Performa nce (GF LOPS ) 0 10 20 30 40 50 60 70 Naïve

(26)

MDS Optimizations

• Naïve (with loop-invariant data reuse)

Number of Data Points (N)

0 5000 10000 15000 20000 25000

Performa nce (GF LOPS ) 0 10 20 30 40 50 60 70 Naïve

(27)

MDS Performance

• Increasing number of iterations

Number of Data Points (N)

0 5000 10000 15000 20000 25000

GPU Speedup 0 20 40 60 80 100 120 140 160 180 10 Iterations 25 Iterations 50 Iterations 100 Iterations

Number of Data Points (N)

0 5000 10000 15000 20000 25000

(28)

MDS Overhead

Number of Data Points (N)

64 5064 10064 15064 20064 25064

1 10 100 1000 10000 100000

0% 12% 24% 36% 48%

60% Double Compute

Regular (Single Compute) Compute Only Time

(29)

Page Rank

• Analyses the linkage information to measure the relative importance

• Sparse matrix and vector multiplication

• Web graph

– Very sparse

(30)

Sparse Matrix Representations

ELLPACK

Compressed Sparse Row (CSR)

(31)

PageRank implementations

Number of Iterations

10 25 50 75 100 125 150

Time (ms ) 0 200 400 600 800 1000 1200 1400 1600 1800 CPU only

(32)

Lessons

Reusing of loop-invariant data

Leveraging local memory

Optimizing data layout

(33)

OpenCL experience

Flexible programming environment

Support for work group level synchronization

primitives

Lack of debugging support

Lack of dynamic memory allocation

Compilation target than a user programming

(34)

Future Work

Extending kernels to distributed environments

Comparing with CUDA implementations

Exploring more aggressive CPU/GPU sharing

Studying more application kernels

(35)

Acknowledgements

This work was started as a class project for

CSCI-B649:Parallel Architectures (spring 2010) at IU

School of Informatics and Computing.

Thilina was supported by National Institutes of

Health grant 5 RC2 HG005806-02.

(36)
(37)
(38)
(39)

KMeansClustering Optimizations

• Data in global memory coalesced

Number of Data Points

1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

GF

LOPS

0 20 40 60 80 100

120 Naïve (A)

Data in Local Memory(B)

References

Related documents

chorých pacientov – klasické liečebné kúpeľné pobyty, zamerané na udržanie. pracovnej schopnosti a nezávislosti v starobe Produkty primárnej a sekundárnej prevencie u primárnej

The dental anomalies dens invaginatus, dens evaginatus, peg-shaped and con- genitally missing lateral incisors were found to be asso- ciated with aggressive and chronic

Results: Our data showed that the combination of ECDI-SPs and anti-OX40L mAb induced donor- specific tolerance in skin-presensitized heart transplant recipients, with the mechanism

To test the relationship among learners’ self-efficacy, self-esteem, and test anxiety, and the relationship between students’ self-efficacy, self- esteem, or test anxiety and

Each D participant will conduct a 5 minute portion of direct; select a topic: witness qualifications, method of expert opinion development, the opinion in this case,

This study examined the effect of single doses of K + channel openers; diazoxide, minoxidil and K + channel blockers; chlorpropamide, glibenclamide on

Smart cards provide secure user authentication, secure roaming, and a platform for value-added services in wireless communications.. Presently, smart cards are used mainly in