Java GPU Computing
●
Overzicht OpenCL
●
Simpel voorbeeld
●
Casus
●
Tips & tricks
●
Vragen
Afkortingen
●
CPU, GPU, APU
●
Khronos: OpenCL, OpenGL
●
Nvidia: CUDA
GPU vergeleken met CPU
●
Veel simpele cores
●
Veel high bandwidth geheugen
●
Intel core i7
GeForce GT 650M
Programmeer model
●
Definieer stream (flow)
Gebruik
●
Algorithme:
–
Hoge Concurrency
–
Partitioneerbaar
●
Maar:
–
Extra latency door on- en offloaden op
de GPU
Voorbeeld (MacBook Pro)
Platform name: Apple
Platform profile: FULL_PROFILE Platform spec version: OpenCL 1.2 Platform vendor: Apple
Device 16925696 HD Graphics 4000 Driver:1.2(Aug 17 2014 20:29:07) Max work group size:512
Global mem size: 1073741824 Local mem size: 65536
Max clock freq: 1200 Max compute units: 16
Device 16918272 GeForce GT 650M Driver:8.26.28 310.40.55b01 Max work group size:1024 Global mem size: 1073741824 Local mem size: 49152
Max clock freq: 900 Max compute units: 2
Device 4294967295 Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
Driver:1.1
Max work group size:1024 Global mem size: 17179869184 Local mem size: 32768
Max clock freq: 2600 Max compute units: 8
Application / Kernel
●
Schrijf .cl files in C variant
●
Kernels zijn de 'publieke' functies
●
Java Bytecode
–
Aparapi (OpenCL)
Parallel sort
kernel
void
sort(global
const
float
* in, global
float
* out,
int
size) {
int
i = get_global_id(0);
// current thread
float
id = in[i];
int
pos = 0;
for
(
int
j=0;j<size;j++)
{
float
jd = in[j];
// in[j] < in[i] ?
bool
smaller = (jx < ix) || (jx == ix && j < i);
pos += (smaller)?1:0;
}
out[pos] = id;
}
Java GPU Computing
CLContext
globalContext
= CLContext.
create
();
CLDevice
device
=
globalContext
.getMaxFlopsDevice(Type.
GPU
);
CLContext
context
= CLContext.
create
(
device
);
CLCommandQueue
queue
=
device
.createCommandQueue();
CLProgram
program
=
context
.createProgram(
First8GpuComputing.
class
.getResourceAsStream(
"MyTask.cl")
Java GPU Computing
CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.
length
,
READ_ONLY
);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.
length
,
WRITE_ONLY
);
Java GPU Computing
CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.
length
,
READ_ONLY
);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.
length
,
WRITE_ONLY
);
mapToBuffer
(inBuffer.getBuffer(), workLoad);
CLKernel kernel = program.createCLKernel(
"MyTask"
);
Java GPU Computing
CLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.
length
,
READ_ONLY
);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.
length
,
WRITE_ONLY
);
mapToBuffer
(inBuffer.getBuffer(), workLoad);
CLKernel kernel = program.createCLKernel(
"MyTask"
);
kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.
length
);
queue.putWriteBuffer(inBuffer, false)
.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)
.putReadBuffer(outBuffer, true);
Praktijk casus
●
Rekeninstrument ter ondersteuning van
de Programmatische Aanpak Stikstof.
Tips & tricks
●
CL beheer
–
getResourceAsStream()?
–
Java constanten → #define
Tips & tricks
●
Unit testen
–
Aparte test kernels
–
Test cases in batches
kernel
void
testDifficultCalculation(
const
int
testCount,
global
const
double
* distance,
global
double
* results) {
const
int
testId = get_global_id(
0
);
if
(testId < testCount) {
results[testId] = difficultCalculation(distance[testId]);
}
Direct memory management
●
-XX:MaxDirectMemorySize=??M
●
ByteBuffer.allocateDirect(
int
capacity)
–
Max 2GB per buffer
●
Garbage collection te laat
–
Getriggered door heap collection
–
Handmatig vrijgeven
GPU vs CPU
●
GPU's checken minder dan CPU's
–
Div by zero
–
Out of bounds checks
Portabiliteit
●
OpenCL is portable, de performance
niet
–
Memory sizes verschillen
–
Memory latencies verschillen
–
Work group sizes verschillen
–
Compute devices verschillen
Ten slotte
●
Float vs Double
–
Dubbele precisie
–
Halve performance
Conclusie
●
Wanneer te gebruiken?
–
Als performance echt nodig is
–
Als probleem hoge concurrency heeft
Vragen?
Setting up OpenCL test on Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz Warming up OpenCL test
[thread 32003 also had an error][thread 33027 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment: #
# SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error]
[thread 32259 also had an error]
at pc=0x00000001250ded70, pid=99851, tid=29475 #
# JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops) # Problematic frame:
# [thread 17415 also had an error]
C [cl_kernels+0x1d70] sort_wrapper+0x1b0 #
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again #