Java in High-Performance Computing
Dawid Weiss
Carrot Search
Institute of Computing Science, Poznan University of Technology
Learn from the mistakes of others. You can’t live long
enough to make them all yourself.
Talk outline
•
What is “High performance”?
•
What is “Java”?
•
Measuring performance (benchmarking).
•
HPPC library.
Crosscutting: (un?)common pitfalls and performance killers. Some
HotSpot internals.
Talk outline
•
What is “High performance”?
•
What is “Java”?
•
Measuring performance (benchmarking).
•
HPPC library.
Crosscutting: (un?)common pitfalls and performance killers. Some
HotSpot internals.
Divide-and-conquer
style algorithm
for (Example e : examples) {
e.hasQuiz() ? e.showQuiz() : e.showCode(); e.explain();
e.deriveConclusions(); }
— PART I —
High Performance
Computing
High-performance computing (HPC) uses
supercomputers and computer clusters to solve
advanced computation problems.
Is Java faster than C/C++?
The short answer is: it depends.
It’s usually
hard
to make
a fast program run faster.
It’s
easy
to make a slow
program run even slower.
It’s
easy
to make fast
It’s usually
hard
to make
a fast program run faster.
It’s
easy
to make a slow
program run even slower.
It’s
easy
to make fast
It’s usually
hard
to make
a fast program run faster.
It’s
easy
to make a slow
program run even slower.
It’s
easy
to make fast
For now, HPC
•
limited allowed computation time,
•
constrained resources (hardware, memory).
For now, HPC
•
limited allowed computation time,
•
constrained resources (hardware, memory).
— PART II —
What is Java?
Example 1
public void testSum1() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum1(i, i);
result = sum; }
public void testSum2() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum2(i, i);
result = sum; }
where the body of
sum1
and
sum2
sums arguments and returns the
result and
COUNT
is significantly large. . .
Example 1
public void testSum1() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum1(i, i);
result = sum; }
public void testSum2() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum2(i, i);
result = sum; }
where the body of
sum1
and
sum2
sums arguments and returns the
result and
COUNT
is significantly large. . .
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sun-1.6.0-20
0.04
2.62
sun-1.6.0-16
0.04
3.20
sun-1.5.0-18
0.04
3.29
ibm-1.6.2
0.08
6.28
jrockit-27.5.0
0.18
0.16
harmony-r917296
0.17
0.35
VM
sum1
sum2
sum3
sum4
sun-1.6.0-20
0.04
2.62
1.05
3.76
sun-1.6.0-16
0.04
3.20
1.39
4.99
sun-1.5.0-18
0.04
3.29
1.46
5.20
ibm-1.6.2
0.08
6.28
0.16
14.64
jrockit-27.5.0
0.18
0.16
1.16
3.18
harmony-r917296
0.17
0.35
9.18
22.49
int sum1(int a, int b) { return a + b;
}
Integer sum2(Integer a, Integer b) { return a + b;
}
↓
Integer sum2(Integer a, Integer b) { return Integer.valueOf(
a.intValue() + b.intValue()); }
int sum3(int... args) { int sum = 0;
for (int i = 0; i < args.length; i++) sum += args[i];
return sum; }
Integer sum4(Integer... args) { int sum = 0;
for (int i = 0; i < args.length; i++) { sum += args[i];
}
return sum; }
↓
Integer sum4(Integer [] args) {
// ...
Conclusions
•
Syntactic sugar
may
be costly.
•
Primitive types are
fast.
Example 2
private static boolean ready;
public static void startThread() { new Thread() {
public void run() { try {
sleep(2000);
} catch (Exception e) { /* ignore */ } System.out.println("Marking loop exit."); ready = true;
} }.start(); }
public static void main(String[] args) { startThread();
System.out.println("Entering the loop..."); while (!ready) {
// Do nothing.
}
System.out.println("Done, I left the loop!"); }
while (!ready) { // Do nothing. }
≡
?
boolean r = ready; while (!r) { // Do nothing. }while (!ready) { // Do nothing. }
≡
?
boolean r = ready; while (!r) { // Do nothing. }C1:
•
fast
•
not (much) optimization
C2:
•
slow(er) than C1
There are hundreds of JVM
tuning/diagnostic switches.
Conclusions
•
Bytecode is
far
from what is executed.
•
A lot
going on under the (VM) hood.
•
Bad code may work, but will eventually crash.
•
HotSpot-level optimizations are
good.
Conclusions
•
Bytecode is
far
from what is executed.
•
A lot
going on under the (VM) hood.
•
Bad code may work, but will eventually crash.
•
HotSpot-level optimizations are
good.
Any other diversifying
factors?
J2ME
•
more VM vendors,
•
hardware diversity,
Non-JVM target platforms
•
Dalvik
•
GWT
•
IKVM
Conclusions
•
There is no “single” Java performance model.
•
Performance depends on the VM,
environment, class library, hardware.
Example 3
public void testSum1() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum1(i, i);
result = sum; }
public void testSum1_2() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += sum1(i, i);
VM
sum1
sum1_2
sun-1.6.0-20
0.04
0.00
sun-1.6.0-16
0.04
0.00
sun-1.5.0-18
0.04
0.00
ibm-1.6.2
0.08
0.01
jrockit-27.5.0
0.17
0.08
harmony-r917296
0.17
0.11
VM
sum1
sum1_2
sun-1.6.0-20
0.04
0.00
sun-1.6.0-16
0.04
0.00
sun-1.5.0-18
0.04
0.00
ibm-1.6.2
0.08
0.01
jrockit-27.5.0
0.17
0.08
harmony-r917296
0.17
0.11
VM
sum1
sum1_2
sun-1.6.0-20
0.04
0.00
sun-1.6.0-16
0.04
0.00
sun-1.5.0-18
0.04
0.00
ibm-1.6.2
0.08
0.01
jrockit-27.5.0
0.17
0.08
harmony-r917296
0.17
0.11
VM
sum1
sum1_2
sun-1.6.0-20
0.04
0.00
sun-1.6.0-16
0.04
0.00
sun-1.5.0-18
0.04
0.00
ibm-1.6.2
0.08
0.01
jrockit-27.5.0
0.17
0.08
harmony-r917296
0.17
0.11
java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...
- method holder: ’com/dawidweiss/geecon2010/Example03’ - access: 0xc1000001 public
- name: ’testSum1_2’ ...
010 pushq rbp
subq rsp, #16 # Create frame
nop # nop for patch_verified_entry 016 addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 021 ret
java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...
- method holder: ’com/dawidweiss/geecon2010/Example03’ - access: 0xc1000001 public
- name: ’testSum1_2’ ...
010 pushq rbp
subq rsp, #16 # Create frame
nop # nop for patch_verified_entry 016 addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 021 ret
Conclusions
•
Benchmarks must be executed to provide
feedback.
•
HotSpot is smart and effective at removing
Example 4
@Test
public void testAdd1() { int sum = 0;
for (int i = 0; i < COUNT; i++) { sum += add1(i);
}
guard = sum; }
public int add1(int i) { return i + 1; }
switch
testAdd1
-XX:+Inlining -XX:+PrintInlining
0.04
-XX:-Inlining
?
switch
testAdd1
-XX:+Inlining -XX:+PrintInlining
0.04
-XX:-Inlining
0.45
Most Java calls are
monomorphic.
HotSpot adjusts to
megamorphic
calls
Example 5
abstract class Superclass { abstract int call(); }
class Sub1 extends Superclass { int call() { return 1; } } class Sub2 extends Superclass
{ int call() { return 2; } } class Sub3 extends Superclass
{ int call() { return 3; } } Superclass[] mixed =
initWithRandomInstances(10000); Superclass[] solid =
initWithSub1Instances(10000);
@Test
public void testMonomorphic() { int sum = 0;
int m = solid.length;
for (int i = 0; i < COUNT; i++) sum += solid[i % m].call(); guard = sum;
}
@Test
public void testMegamorphic() { int sum = 0;
int m = mixed.length;
for (int i = 0; i < COUNT; i++) sum += mixed[i % m].call(); guard = sum;
VM
monomorphic
megamorphic
sun-1.6.0-20
0.19
0.32
sun-1.6.0-16
0.19
0.34
sun-1.5.0-18
0.18
0.34
ibm-1.6.2
0.20
0.30
jrockit-27.5.0
0.22
0.29
harmony-r917296
0.27
0.32
Example 6
@Test
public void testBitCount1() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += Integer.bitCount(i); guard = sum;
}
@Test
public void testBitCount2() { int sum = 0;
for (int i = 0; i < COUNT; i++) sum += bitCount(i); guard = sum; } /* Copied from * {@link Integer#bitCount} */
static int bitCount(int i) {
// HD, Figure 5-2 i = i - ((i >>> 1) & 0x55555555); i = (i & 0x33333333) + ((i >>> 2) & 0x33333333); i = (i + (i >>> 4)) & 0x0f0f0f0f; i = i + (i >>> 8); i = i + (i >>> 16); return i & 0x3f; }
VM
testBitCount1
testBitCount2
sun-1.6.0-20
0.43
0.43
sun-1.7.0-b80
0.43
0.43
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM
testBitCount1
testBitCount2
sun-1.6.0-20
0.08
0.33
sun-1.7.0-b83
0.07
0.32
VM
testBitCount1
testBitCount2
sun-1.6.0-20
0.43
0.43
sun-1.7.0-b80
0.43
0.43
(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).
VM
testBitCount1
testBitCount2
sun-1.6.0-20
0.08
0.33
sun-1.7.0-b83
0.07
0.32
... -XX:+PrintInlining ...
...
Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Example06.testBitCount1: [measured 10 out of 15 rounds]
round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...
@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds]
... -XX:+PrintInlining ...
...
Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Example06.testBitCount1: [measured 10 out of 15 rounds]
round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...
@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds]
... -XX:+PrintOptoAssembly ...
{method}
- klass: {other class}
- method holder: com/dawidweiss/geecon2010/Example06 - name: testBitCount1
...
0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ... 0c2 movl R10, RDX # spill
...
0e1 movl [rsp + #40], R11 # spill 0e6 popcnt R8, R8
...
0f5 addl R9, #7 # int 0f9 popcnt R11, R11 0fe popcnt RCX, R9
... -XX:+PrintOptoAssembly ...
{method}
- klass: {other class}
- method holder: com/dawidweiss/geecon2010/Example06 - name: testBitCount1
...
0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ... 0c2 movl R10, RDX # spill
...
0e1 movl [rsp + #40], R11 # spill 0e6 popcnt R8, R8
...
0f5 addl R9, #7 # int 0f9 popcnt R11, R11 0fe popcnt RCX, R9
Conclusions
•
Benchmarks must be statistically sound.
→
averages, variance, min, max, warm-up phase•
Account for HotSpot optimisations.
•
Account for hardware differences.
→
test-on-target•
Use domain data and real scenarios.
•
Inspect suspicious output with debug JVM.
HPPC
Motivation
•
Primitive types: fast and memory-friendly.
•
Optional assertions.
•
Single-threaded. No fail-fast.
•
Fast, fast, fast iterators, with no GC overhead.
•
Open internals (explicit implementation).
Why not JCF?
public interface List<E> extends Collection<E> {
boolean contains(Object o); // [-] contract-enforced methods
Iterator<E> iterator(); // [-] iterators over primitive types?
Object[] toArray(); // [-] troublesome covariants
Friendly Competition
•
fastutil
•
PCJ
•
GNU Trove
•
Apache Mahout (ported COLT)
•
Apache Primitive Collections
All of these have pros and cons and deal with JCF compatibility
somehow.
Iterators in
fastutil
or
PCJ
interface IntIterator extends Iterator<Integer> {
// Primitive-specific method
int nextInt(); }
Iterators in
HPPC
public final class IntCursor { public int index;
public int value; }
public class IntArrayList extends Iterable<IntCursor> { Iterator<IntCursor> iterator() { ... }
Iterating over list elements in HPPC
for (IntCursor c : list) {
System.out.println(c.index + ": " + c.value); }
...or
list.forEach(new IntProcedure() { public void apply(int value) {
System.out.println(value); }
});
...or
final int[] buffer = list.buffer; final intsize = list.size();
for (int i = 0; i < size; i++) {
System.out.println(i + ": " + buffer[i]); }
Iterating over list elements in HPPC
for (IntCursor c : list) {
System.out.println(c.index + ": " + c.value); }
...or
list.forEach(new IntProcedure() { public void apply(int value) {
System.out.println(value); }
});
...or
final int[] buffer = list.buffer; final intsize = list.size();
for (int i = 0; i < size; i++) {
System.out.println(i + ": " + buffer[i]); }
Iterating over list elements in HPPC
for (IntCursor c : list) {
System.out.println(c.index + ": " + c.value); }
...or
list.forEach(new IntProcedure() { public void apply(int value) {
System.out.println(value); }
});
...or
final int[] buffer = list.buffer; final intsize = list.size();
for (int i = 0; i < size; i++) {
System.out.println(i + ": " + buffer[i]); }
Open implementation is
good.
/**
* Applies a supplemental hash function to a given * hashCode, which defends against poor quality * hash functions. [...]
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by // constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }
HPPC approach (example):
public class LongIntOpenHashMap implements LongIntMap {
// ...
public LongIntOpenHashMap(int initialCapacity, float loadFactor, LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) {
// ...
}
Example 7
•
HPPC:
final char [] CHARS = DATA;
final IntIntOpenHashMap counts = new IntIntOpenHashMap(); for (int i = 0; i < CHARS.length - 1; i++) {
counts.putOrAdd((CHARS[i] << 16 | CHARS[i + 1]), 1, 1); }
•
JCF, boxed integer types.
final Integer currentCount = map.get(bigram);
map.put(bigram, currentCount == null ? 1 : currentCount + 1);
•
JCF, with IntHolder (mutable value object).
•
GNU Trove
map.adjustOrPutValue(bigram, 1, 1);
•
fastutil, OpenHashMap and LinkedOpenHashMap
map.put(bigram, map.get(bigram) + 1);