Java GPU Computing. Maarten Steur & Arjan Lamers

31 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Java GPU Computing

(2)

Overzicht OpenCL

Simpel voorbeeld

Casus

Tips & tricks

Vragen

(3)
(4)

Afkortingen

CPU, GPU, APU

Khronos: OpenCL, OpenGL

Nvidia: CUDA

(5)

GPU vergeleken met CPU

Veel simpele cores

Veel high bandwidth geheugen

Intel core i7

GeForce GT 650M

(6)

Programmeer model

Definieer stream (flow)

(7)

Gebruik

Algorithme:

Hoge Concurrency

Partitioneerbaar

Maar:

Extra latency door on- en offloaden op

de GPU

(8)
(9)
(10)

Voorbeeld (MacBook Pro)

Platform name: Apple

Platform profile: FULL_PROFILE Platform spec version: OpenCL 1.2 Platform vendor: Apple

Device 16925696 HD Graphics 4000 Driver:1.2(Aug 17 2014 20:29:07) Max work group size:512

Global mem size: 1073741824 Local mem size: 65536

Max clock freq: 1200 Max compute units: 16

Device 16918272 GeForce GT 650M Driver:8.26.28 310.40.55b01 Max work group size:1024 Global mem size: 1073741824 Local mem size: 49152

Max clock freq: 900 Max compute units: 2

Device 4294967295 Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz

Driver:1.1

Max work group size:1024 Global mem size: 17179869184 Local mem size: 32768

Max clock freq: 2600 Max compute units: 8

(11)
(12)

Application / Kernel

Schrijf .cl files in C variant

Kernels zijn de 'publieke' functies

Java Bytecode

Aparapi (OpenCL)

(13)
(14)

Parallel sort

kernel

void

sort(global

const

float

* in, global

float

* out,

int

size) {

int

i = get_global_id(0);

// current thread

float

id = in[i];

int

pos = 0;

for

(

int

j=0;j<size;j++)

{

float

jd = in[j];

// in[j] < in[i] ?

bool

smaller = (jx < ix) || (jx == ix && j < i);

pos += (smaller)?1:0;

}

out[pos] = id;

}

(15)

Java GPU Computing

CLContext

globalContext

= CLContext.

create

();

CLDevice

device

=

globalContext

.getMaxFlopsDevice(Type.

GPU

);

CLContext

context

= CLContext.

create

(

device

);

CLCommandQueue

queue

=

device

.createCommandQueue();

CLProgram

program

=

context

.createProgram(

First8GpuComputing.

class

.getResourceAsStream(

"MyTask.cl")

(16)

Java GPU Computing

CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.

length

,

READ_ONLY

);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.

length

,

WRITE_ONLY

);

(17)

Java GPU Computing

CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.

length

,

READ_ONLY

);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.

length

,

WRITE_ONLY

);

mapToBuffer

(inBuffer.getBuffer(), workLoad);

CLKernel kernel = program.createCLKernel(

"MyTask"

);

(18)

Java GPU Computing

CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(

input.

length

,

READ_ONLY

);

CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(

input.

length

,

WRITE_ONLY

);

mapToBuffer

(inBuffer.getBuffer(), workLoad);

CLKernel kernel = program.createCLKernel(

"MyTask"

);

kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.

length

);

queue.putWriteBuffer(inBuffer, false)

.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)

.putReadBuffer(outBuffer, true);

(19)
(20)

Praktijk casus

Rekeninstrument ter ondersteuning van

de Programmatische Aanpak Stikstof.

(21)
(22)
(23)

Tips & tricks

CL beheer

getResourceAsStream()?

Java constanten → #define

(24)

Tips & tricks

Unit testen

Aparte test kernels

Test cases in batches

kernel

void

testDifficultCalculation(

const

int

testCount,

global

const

double

* distance,

global

double

* results) {

const

int

testId = get_global_id(

0

);

if

(testId < testCount) {

results[testId] = difficultCalculation(distance[testId]);

}

(25)

Direct memory management

-XX:MaxDirectMemorySize=??M

ByteBuffer.allocateDirect(

int

capacity)

Max 2GB per buffer

Garbage collection te laat

Getriggered door heap collection

Handmatig vrijgeven

(26)

GPU vs CPU

GPU's checken minder dan CPU's

Div by zero

Out of bounds checks

(27)

Portabiliteit

OpenCL is portable, de performance

niet

Memory sizes verschillen

Memory latencies verschillen

Work group sizes verschillen

Compute devices verschillen

(28)

Ten slotte

Float vs Double

Dubbele precisie

Halve performance

(29)
(30)

Conclusie

Wanneer te gebruiken?

Als performance echt nodig is

Als probleem hoge concurrency heeft

(31)

Vragen?

Setting up OpenCL test on Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz Warming up OpenCL test

[thread 32003 also had an error][thread 33027 also had an error]

#

# A fatal error has been detected by the Java Runtime Environment: #

# SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error]

[thread 32259 also had an error]

at pc=0x00000001250ded70, pid=99851, tid=29475 #

# JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops) # Problematic frame:

# [thread 17415 also had an error]

C [cl_kernels+0x1d70] sort_wrapper+0x1b0 #

# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again #

Figure

Updating...

References

Updating...

Related subjects :