• No results found

Java in High-Performance Computing

N/A
N/A
Protected

Academic year: 2021

Share "Java in High-Performance Computing"

Copied!
117
0
0
Show more ( Page)

Full text

(1)

Java in High-Performance Computing

Dawid Weiss

Carrot Search

Institute of Computing Science, Poznan University of Technology

(2)
(3)
(4)

Learn from the mistakes of others. You can’t live long

enough to make them all yourself.

(5)
(6)

Talk outline

What is “High performance”?

What is “Java”?

Measuring performance (benchmarking).

HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. Some

HotSpot internals.

(7)

Talk outline

What is “High performance”?

What is “Java”?

Measuring performance (benchmarking).

HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. Some

HotSpot internals.

(8)

Divide-and-conquer

style algorithm

for (Example e : examples) {

e.hasQuiz() ? e.showQuiz() : e.showCode(); e.explain();

e.deriveConclusions(); }

(9)
(10)

— PART I —

High Performance

Computing

(11)

High-performance computing (HPC) uses

supercomputers and computer clusters to solve

advanced computation problems.

(12)

Is Java faster than C/C++?

The short answer is: it depends.

(13)

It’s usually

hard

to make

a fast program run faster.

It’s

easy

to make a slow

program run even slower.

It’s

easy

to make fast

(14)

It’s usually

hard

to make

a fast program run faster.

It’s

easy

to make a slow

program run even slower.

It’s

easy

to make fast

(15)

It’s usually

hard

to make

a fast program run faster.

It’s

easy

to make a slow

program run even slower.

It’s

easy

to make fast

(16)

For now, HPC

limited allowed computation time,

constrained resources (hardware, memory).

(17)

For now, HPC

limited allowed computation time,

constrained resources (hardware, memory).

(18)

— PART II —

What is Java?

(19)

Example 1

public void testSum1() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum1(i, i);

result = sum; }

public void testSum2() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum2(i, i);

result = sum; }

where the body of

sum1

and

sum2

sums arguments and returns the

result and

COUNT

is significantly large. . .

(20)

Example 1

public void testSum1() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum1(i, i);

result = sum; }

public void testSum2() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum2(i, i);

result = sum; }

where the body of

sum1

and

sum2

sums arguments and returns the

result and

COUNT

is significantly large. . .

(21)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(22)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(23)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(24)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(25)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(26)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(27)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(28)

VM

sum1

sum2

sun-1.6.0-20

0.04

2.62

sun-1.6.0-16

0.04

3.20

sun-1.5.0-18

0.04

3.29

ibm-1.6.2

0.08

6.28

jrockit-27.5.0

0.18

0.16

harmony-r917296

0.17

0.35

(29)

VM

sum1

sum2

sum3

sum4

sun-1.6.0-20

0.04

2.62

1.05

3.76

sun-1.6.0-16

0.04

3.20

1.39

4.99

sun-1.5.0-18

0.04

3.29

1.46

5.20

ibm-1.6.2

0.08

6.28

0.16

14.64

jrockit-27.5.0

0.18

0.16

1.16

3.18

harmony-r917296

0.17

0.35

9.18

22.49

(30)

int sum1(int a, int b) { return a + b;

}

Integer sum2(Integer a, Integer b) { return a + b;

}

Integer sum2(Integer a, Integer b) { return Integer.valueOf(

a.intValue() + b.intValue()); }

(31)

int sum3(int... args) { int sum = 0;

for (int i = 0; i < args.length; i++) sum += args[i];

return sum; }

Integer sum4(Integer... args) { int sum = 0;

for (int i = 0; i < args.length; i++) { sum += args[i];

}

return sum; }

Integer sum4(Integer [] args) {

// ...

(32)

Conclusions

Syntactic sugar

may

be costly.

Primitive types are

fast.

(33)
(34)

Example 2

(35)
(36)
(37)
(38)
(39)
(40)

private static boolean ready;

public static void startThread() { new Thread() {

public void run() { try {

sleep(2000);

} catch (Exception e) { /* ignore */ } System.out.println("Marking loop exit."); ready = true;

} }.start(); }

public static void main(String[] args) { startThread();

System.out.println("Entering the loop..."); while (!ready) {

// Do nothing.

}

System.out.println("Done, I left the loop!"); }

(41)

while (!ready) { // Do nothing. }

?

boolean r = ready; while (!r) { // Do nothing. }

(42)

while (!ready) { // Do nothing. }

?

boolean r = ready; while (!r) { // Do nothing. }

(43)
(44)
(45)
(46)
(47)

C1:

fast

not (much) optimization

C2:

slow(er) than C1

(48)

There are hundreds of JVM

tuning/diagnostic switches.

(49)
(50)

Conclusions

Bytecode is

far

from what is executed.

A lot

going on under the (VM) hood.

Bad code may work, but will eventually crash.

HotSpot-level optimizations are

good.

(51)

Conclusions

Bytecode is

far

from what is executed.

A lot

going on under the (VM) hood.

Bad code may work, but will eventually crash.

HotSpot-level optimizations are

good.

(52)
(53)
(54)
(55)

Any other diversifying

factors?

(56)
(57)

J2ME

more VM vendors,

hardware diversity,

(58)
(59)

Non-JVM target platforms

Dalvik

GWT

IKVM

(60)
(61)

Conclusions

There is no “single” Java performance model.

Performance depends on the VM,

environment, class library, hardware.

(62)
(63)

Example 3

public void testSum1() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum1(i, i);

result = sum; }

public void testSum1_2() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += sum1(i, i);

(64)

VM

sum1

sum1_2

sun-1.6.0-20

0.04

0.00

sun-1.6.0-16

0.04

0.00

sun-1.5.0-18

0.04

0.00

ibm-1.6.2

0.08

0.01

jrockit-27.5.0

0.17

0.08

harmony-r917296

0.17

0.11

(65)

VM

sum1

sum1_2

sun-1.6.0-20

0.04

0.00

sun-1.6.0-16

0.04

0.00

sun-1.5.0-18

0.04

0.00

ibm-1.6.2

0.08

0.01

jrockit-27.5.0

0.17

0.08

harmony-r917296

0.17

0.11

(66)

VM

sum1

sum1_2

sun-1.6.0-20

0.04

0.00

sun-1.6.0-16

0.04

0.00

sun-1.5.0-18

0.04

0.00

ibm-1.6.2

0.08

0.01

jrockit-27.5.0

0.17

0.08

harmony-r917296

0.17

0.11

(67)

VM

sum1

sum1_2

sun-1.6.0-20

0.04

0.00

sun-1.6.0-16

0.04

0.00

sun-1.5.0-18

0.04

0.00

ibm-1.6.2

0.08

0.01

jrockit-27.5.0

0.17

0.08

harmony-r917296

0.17

0.11

(68)
(69)

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’ - access: 0xc1000001 public

- name: ’testSum1_2’ ...

010 pushq rbp

subq rsp, #16 # Create frame

nop # nop for patch_verified_entry 016 addq rsp, 16 # Destroy frame

popq rbp

testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 021 ret

(70)

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’ - access: 0xc1000001 public

- name: ’testSum1_2’ ...

010 pushq rbp

subq rsp, #16 # Create frame

nop # nop for patch_verified_entry 016 addq rsp, 16 # Destroy frame

popq rbp

testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC 021 ret

(71)

Conclusions

Benchmarks must be executed to provide

feedback.

HotSpot is smart and effective at removing

(72)

Example 4

@Test

public void testAdd1() { int sum = 0;

for (int i = 0; i < COUNT; i++) { sum += add1(i);

}

guard = sum; }

public int add1(int i) { return i + 1; }

(73)

switch

testAdd1

-XX:+Inlining -XX:+PrintInlining

0.04

-XX:-Inlining

?

(74)

switch

testAdd1

-XX:+Inlining -XX:+PrintInlining

0.04

-XX:-Inlining

0.45

(75)

Most Java calls are

monomorphic.

(76)

HotSpot adjusts to

megamorphic

calls

(77)

Example 5

abstract class Superclass { abstract int call(); }

class Sub1 extends Superclass { int call() { return 1; } } class Sub2 extends Superclass

{ int call() { return 2; } } class Sub3 extends Superclass

{ int call() { return 3; } } Superclass[] mixed =

initWithRandomInstances(10000); Superclass[] solid =

initWithSub1Instances(10000);

@Test

public void testMonomorphic() { int sum = 0;

int m = solid.length;

for (int i = 0; i < COUNT; i++) sum += solid[i % m].call(); guard = sum;

}

@Test

public void testMegamorphic() { int sum = 0;

int m = mixed.length;

for (int i = 0; i < COUNT; i++) sum += mixed[i % m].call(); guard = sum;

(78)

VM

monomorphic

megamorphic

sun-1.6.0-20

0.19

0.32

sun-1.6.0-16

0.19

0.34

sun-1.5.0-18

0.18

0.34

ibm-1.6.2

0.20

0.30

jrockit-27.5.0

0.22

0.29

harmony-r917296

0.27

0.32

(79)

Example 6

@Test

public void testBitCount1() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += Integer.bitCount(i); guard = sum;

}

@Test

public void testBitCount2() { int sum = 0;

for (int i = 0; i < COUNT; i++) sum += bitCount(i); guard = sum; } /* Copied from * {@link Integer#bitCount} */

static int bitCount(int i) {

// HD, Figure 5-2 i = i - ((i >>> 1) & 0x55555555); i = (i & 0x33333333) + ((i >>> 2) & 0x33333333); i = (i + (i >>> 4)) & 0x0f0f0f0f; i = i + (i >>> 8); i = i + (i >>> 16); return i & 0x3f; }

(80)

VM

testBitCount1

testBitCount2

sun-1.6.0-20

0.43

0.43

sun-1.7.0-b80

0.43

0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM

testBitCount1

testBitCount2

sun-1.6.0-20

0.08

0.33

sun-1.7.0-b83

0.07

0.32

(81)

VM

testBitCount1

testBitCount2

sun-1.6.0-20

0.43

0.43

sun-1.7.0-b80

0.43

0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM

testBitCount1

testBitCount2

sun-1.6.0-20

0.08

0.33

sun-1.7.0-b83

0.07

0.32

(82)

... -XX:+PrintInlining ...

...

Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Example06.testBitCount1: [measured 10 out of 15 rounds]

round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds]

(83)

... -XX:+PrintInlining ...

...

Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1 Example06.testBitCount1: [measured 10 out of 15 rounds]

round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) @ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot) Example06.testBitCount2: [measured 10 out of 15 rounds]

(84)

... -XX:+PrintOptoAssembly ...

{method}

- klass: {other class}

- method holder: com/dawidweiss/geecon2010/Example06 - name: testBitCount1

...

0c2 B13: # B12 B14 &lt;- B8 B12 Loop: B13-B12 inner stride: ... 0c2 movl R10, RDX # spill

...

0e1 movl [rsp + #40], R11 # spill 0e6 popcnt R8, R8

...

0f5 addl R9, #7 # int 0f9 popcnt R11, R11 0fe popcnt RCX, R9

(85)

... -XX:+PrintOptoAssembly ...

{method}

- klass: {other class}

- method holder: com/dawidweiss/geecon2010/Example06 - name: testBitCount1

...

0c2 B13: # B12 B14 &lt;- B8 B12 Loop: B13-B12 inner stride: ... 0c2 movl R10, RDX # spill

...

0e1 movl [rsp + #40], R11 # spill 0e6 popcnt R8, R8

...

0f5 addl R9, #7 # int 0f9 popcnt R11, R11 0fe popcnt RCX, R9

(86)
(87)
(88)

Conclusions

Benchmarks must be statistically sound.

averages, variance, min, max, warm-up phase

Account for HotSpot optimisations.

Account for hardware differences.

test-on-target

Use domain data and real scenarios.

Inspect suspicious output with debug JVM.

(89)

HPPC

(90)

Motivation

Primitive types: fast and memory-friendly.

Optional assertions.

Single-threaded. No fail-fast.

Fast, fast, fast iterators, with no GC overhead.

Open internals (explicit implementation).

(91)

Why not JCF?

public interface List<E> extends Collection<E> {

boolean contains(Object o); // [-] contract-enforced methods

Iterator<E> iterator(); // [-] iterators over primitive types?

Object[] toArray(); // [-] troublesome covariants

(92)

Friendly Competition

fastutil

PCJ

GNU Trove

Apache Mahout (ported COLT)

Apache Primitive Collections

All of these have pros and cons and deal with JCF compatibility

somehow.

(93)

Iterators in

fastutil

or

PCJ

interface IntIterator extends Iterator<Integer> {

// Primitive-specific method

int nextInt(); }

(94)

Iterators in

HPPC

public final class IntCursor { public int index;

public int value; }

public class IntArrayList extends Iterable<IntCursor> { Iterator<IntCursor> iterator() { ... }

(95)

Iterating over list elements in HPPC

for (IntCursor c : list) {

System.out.println(c.index + ": " + c.value); }

...or

list.forEach(new IntProcedure() { public void apply(int value) {

System.out.println(value); }

});

...or

final int[] buffer = list.buffer; final intsize = list.size();

for (int i = 0; i < size; i++) {

System.out.println(i + ": " + buffer[i]); }

(96)

Iterating over list elements in HPPC

for (IntCursor c : list) {

System.out.println(c.index + ": " + c.value); }

...or

list.forEach(new IntProcedure() { public void apply(int value) {

System.out.println(value); }

});

...or

final int[] buffer = list.buffer; final intsize = list.size();

for (int i = 0; i < size; i++) {

System.out.println(i + ": " + buffer[i]); }

(97)

Iterating over list elements in HPPC

for (IntCursor c : list) {

System.out.println(c.index + ": " + c.value); }

...or

list.forEach(new IntProcedure() { public void apply(int value) {

System.out.println(value); }

});

...or

final int[] buffer = list.buffer; final intsize = list.size();

for (int i = 0; i < size; i++) {

System.out.println(i + ": " + buffer[i]); }

(98)
(99)
(100)
(101)
(102)
(103)

Open implementation is

good.

(104)

/**

* Applies a supplemental hash function to a given * hashCode, which defends against poor quality * hash functions. [...]

*/

static int hash(int h) {

// This function ensures that hashCodes that differ only by // constant multiples at each bit position have a bounded

// number of collisions (approximately 8 at default load factor).

h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

(105)

HPPC approach (example):

public class LongIntOpenHashMap implements LongIntMap {

// ...

public LongIntOpenHashMap(int initialCapacity, float loadFactor, LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) {

// ...

}

(106)

Example 7

(107)

HPPC:

final char [] CHARS = DATA;

final IntIntOpenHashMap counts = new IntIntOpenHashMap(); for (int i = 0; i < CHARS.length - 1; i++) {

counts.putOrAdd((CHARS[i] << 16 | CHARS[i + 1]), 1, 1); }

JCF, boxed integer types.

final Integer currentCount = map.get(bigram);

map.put(bigram, currentCount == null ? 1 : currentCount + 1);

JCF, with IntHolder (mutable value object).

GNU Trove

map.adjustOrPutValue(bigram, 1, 1);

fastutil, OpenHashMap and LinkedOpenHashMap

map.put(bigram, map.get(bigram) + 1);

(108)
(109)
(110)

Is Java faster than C/C++?

The short answer is: it depends.

(111)

Example 8

The same algorithm for building a DFSA automaton accepting a

set of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2

java 1.6.0_20-64

real

63.850s

43.197s

user

63.110s

46.370s

sys

0.240s

0.840s

(112)

Example 8

The same algorithm for building a DFSA automaton accepting a

set of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2

java 1.6.0_20-64

real

63.850s

43.197s

user

63.110s

46.370s

sys

0.240s

0.840s

(113)

Example 8

The same algorithm for building a DFSA automaton accepting a

set of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2

java 1.6.0_20-64

real

63.850s

43.197s

user

63.110s

46.370s

sys

0.240s

0.840s

(114)

Example 8

The same algorithm for building a DFSA automaton accepting a

set of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2

java 1.6.0_20-64

real

63.850s

43.197s

user

63.110s

46.370s

(115)
(116)

Performance checklist

(sanity check)

Algorithms, algorithms, algorithms.

Proper data structures.

Spurious GC activity.

Memory barriers in tight loops.

CPU cache utilization.

(117)

HPPC and junit-benchmarks are at:

References

Related documents

Looking at the estimations for cross-border trade, we can highlight one service sectors with evident complementary effects which stands out because most of regulations show

( 2014 ) case-based ( n  = 1) Exploring the effects of Hr practices in the implementation of lean production Hallgren and olhager ( 2009 ) Secondary data ( n  = 211) testing the

Understanding the factors and dynamics behind the regional wage differentials is crucial from the policy perspective, as it provides with valuable information about the performance

However, in addition to investigating the extent to which leader image persuaded voting behaviour, this research provided deeper understanding as was demonstrated

In the context of Indigenous self-determination in Canada, antagonisms exist in the paramountcy of state sovereignty premised on a colonial history, which in turn is met by

A new method to collect suspended particles in air using cloth samplers as atmospheric fine dust catch - ers for monitoring air pollution was introduced known as “Textile

Analyzing the short-term merits of externally conducted TCL projects Anecdotal evidence indicates that open innovation intermediaries, possessing relevant project

This tends to show that access in any form appears to be the source of vulnera- bility rather than incident specifics due to a fundamental trade-off between usability and security

Finally, ESOPs are classified as a pure employee benefit if they are not defensive, wage concession, or leveraged ESOPs and we could surmise from reading the Wall Street Journal,

If a team forfeits three (3) league games due to lack of players at the start of a regularly scheduled game, the team will be dropped from the league with no refund of entry fee

To withdraw, students must contact East Central College Community Education no less than 48 hours prior to the start of the class. Students that withdraw less than 48 hours prior to

divestment movement, which included over 700 independent divestment pledges, including that of the Rockefeller fund, in the span of one day (Divest-Invest, 2014). Thus

If the insurer accepts your application, your existing amount of death/TPD cover (subject to maximum limits) as at the transfer date under your former super fund will be added

Orienteering Australia does have a major role in the management of national teams and squads at junior and elite level, and in coordinating the National Orienteering League, which

year round Lafayette Farmers & Artisans Market Horse Farm, 2913 Johnston St.. New Orleans Crescent City Farmers Market #2 Uptown Square, 200

Valtra Smart Farming is a set of technologies that work seamlessly together – Valtra Guide, ISOBUS, Section Control, Variable Rate Control and..

AIA Prime Critical Cover offers you critical illness coverage up to age 100, with additional 50% coverage during your later years.. • Receive 100% of the insured amount when you

For my thesis project, I decided to build a website that allows  users to input their zip code in order to receive information on which types of native  pollinator-friendly plants

Keywords: project scatter factor, resource dedication pro®le, human resources, R&amp;D organisation, project portfolio, business plan, project planning, project management,

 “ “ The Mongol leaders were unusually The Mongol leaders were unusually cruel in their conquests and rule. cruel in their conquests

For further help or support, please contact your pouako, Awhi Tauira or the eWānanga Helpdesk (see &#34;eWānanga Helpdesk&#34; on page 8 of this booklet).... 16 Managing Your

You can sort them into primary, secondary or tertiary industries: take it, make it or sell or service

In April 2015, the Central Research Laboratory, Hitachi Research Laboratory, Yokohama Research Laboratory, Design Division, and the overseas research centers were realigned