Writing High Performance Java Code Which Runs as Fast as Fortran, C or C++ J. C. Schatzman a

(1)

Writing High Performance Java Code Which Runs as Fast as Fortran, C or C++

J. C. Schatzman

^a

ABSTRACT

Java software is often perceived to be slow as compared to corresponding C/C++ or FORTRAN software. For some computationally demanding algorithms, straightforward implementations in Java may run 100-150 times or more slower than C++ or FORTRAN. In the past, problem algorithms have included floating point intensive algorithms such as FFTs and integer functions such as endian and alignment byte manipulations. However, current JVMs do well with model floating point code such as FFTs and linear algebra. Java vectors and lists are somewhat slow compared to C++

equivalents; hashmaps can be very fast. Function calls in general, and getters and setters in particular, remain troublesome. Current JITs are extremely important for optimizing performance.

Keywords: Java performance, Java scientific computing, Java FFT, Java sorting

1. INTRODUCTION

Many inconsistent claims have appeared in the Internet about Java performance. According to some, Java is just as fast as other languages such as C/C++; according to others, Java performs poorly. Of course, we should note that computer languages are neither fast nor slow – only implementations can truly be associated with speeds. However, some languages lead naturally to efficient compilation and execution; others, particularly higher level languages, one is likely to perceive as “inefficient” because of some innate characteristics of the language. For example, Java has historically been regarded as an interpreted language, rather than being compiled to native code. This fact leads us to suspect that Java may be innately slower than languages that traditionally are compiled to native code, such as Fortran, C or C++.

In practice, the Just In Time (JIT) compiler technology for Java alleviates this problem, but we are left with a suspicion that Java may be slow. Additional superficial “suspect” factors include:

1. Java objects carry overhead data such as thread synchronization information which C/C++ objects or data structures avoid. A purely memory-organizing data structure (C struct) does not exist in Java.

2. Java arrays of objects are actually and unavoidably arrays of references to objects. Array-handling code traverses this additional indirection, memory used by a Java array of objects may be highly non-localized and therefore less efficiently operated upon by modern memory systems.

3. Object Oriented code, and Java code in particular, tends to involve much dynamic creation of objects. This is less efficient that using statically allocated memory, stack memory, or processor registers for storage of intermediary results.

4. Java array operations unavoidably incorporate subscript range checking. Multi-dimensional arrays apparently require checking of multiple subscripts.

5. The Java runtime invokes a garbage collector periodically; one presumes that this garbage collector requires resources such as CPU cycles that detract from application performance.

6. Today’s JITs are designed to be highly portable instead of taking advantage of all machine instructions on the particular hardware platform. C/C++ compilers available today can optimize code for particular processors or classes of processors.

7. JIT technology is immature compared to optimization technologies for FORTRAN or C.

In fact, many of these concerns have been addressed by the Java compiler and runtime communities. For example, the current JVMs exhibit greatly reduced effects of the garbage collector as compared to the technology of a year or two ago. Most users of Java applications with recent JVMs will not be aware of any drop in responsiveness when the garbage collector runs. This is a notable triumph of the technology.

aCorrespondence: e-mail: [email protected]; Telephone: (303) 873-9979.

(2)

JIT technology also appears to have made great strides in the last 12 to 18 months. For example, FFT code that some months ago ran far slower than its C++ or FORTRAN equivalents is now running almost as fast.

As Java technology matures, we find that performance issues are overall less severe, but still remain in a large number of small crevasses that trap the unwary developer. For example, in Schatzman and Donehower (2001) the authors show that byte manipulations in the serialization classes slow down certain Java operations by about a factor of 150. In this short paper I cannot hope to exhaustively analyze all performance issues. Instead, I examine a number of commonly used algorithms and data structures. I compare execution times for the basic algorithms executed as interpreted Java, JIT- compiled Java with the Client and Server JITs available as part of the Java Development Kit, C++, and FORTRAN.

2. NUMBER CRUNCHING ALGORITHMS

We would not expect that interpreted code would deliver good performance with computationally intensive algorithms such as FFTs or linear algebra. The question is then: how effective are current JITs at optimizing these kinds of numerical code?

Our first test is the FFT. Actually, I have implemented the simplest of all possible FFT algorithms, the original Cooley- Tukey radix-2 algorithm. I tested three variations of the basic code, which was modified from Claerbout (1976):

A. Complex-object arithmetic with new Complex objects created dynamically as they are computed.

B. Complex-object arithmetic with objects reused (no dynamic creation).

C. Real arithmetic with no dynamic creation.

In practice, these implementations are sub-optimal because the twiddle factors (trigonometric functions) are evaluated as part of the transform step; they should be computed in a separate one-time-only step. Also, it is well-known that higher radix FFTs and other FFT variants such as the Prime Factor Algorithm (PFA) can be faster than the radix-2 algorithm implemented here. Regardless, the tested algorithm is very simple and admits easy language translation.

The test consisted of timing the execution of a 1024-point complex-to-complex transform (64-bit). The results are tabulated in Table 1. Appendix A explains the testing procedures, which attempt to measure bias-free execution times accurate to about 0.25%.

Table 1. Tests of the three radix-2 FFT algorithm variants on a 1 GHz Pentium III processor. Execution times are given in milliseconds.

Code Variant A

dynamic object creation

Variant B complex object reuse

Variant C primitives

Java - Interpreted 16.3 ms 8.91 ms 8.37 ms

Java – Client Hotspot JIT 4.95 0.827 0.779

Java – Server Hotspot JIT 4.59 0.732 0.798

C++ 6.26 0.611 0.611

FORTRAN 2.02

We observe:

1. The Java-JIT implementation was only about 20% slower than the C++ implementation; it was much faster than the FORTRAN implementation.

2. Interpreted Java is slow, as much as 15 times slower than the corresponding C++ code in these tests.

3. Dynamic creation of objects slows down the software enormously in both Java and C++, as much as 10 times. It appears that Java handles this somewhat better than C++, to the point that the Java implementation ran faster than the C++ implementation, although this result may vary depending on the C++ heap manager.

4. This FORTRAN code did not perform very well compared to the other implementations. We should probably attribute this lackluster result to the neglect and poor state of current FORTRAN compiler products.

(3)

A crucial issue in cross-language comparisons is the translation. I have found it helpful in many cases to start with a Java implementation and then make the minimal changes necessary to make it work sensibly in C++ or FORTRAN (see Appendix A for more details). This becomes difficult in the case of C++ mainly in two areas:

1. Garbage collection – explicit deletions have to be added to C++ code to prevent memory leaks.

2. Arrays of objects – Java arrays of objects are actually arrays of object references. While corresponding C++

code can be easily implemented, the natural and usually more efficient data structure to use is a true array of objects. However, such an implementation does not permit references to be manipulated. Judgment is called for. For example, the Variant A C++ implementation used arrays of objects references, so that the references could be directly manipulated, whereas Variants B and C used true arrays of objects, because no manipulation o f the references is required.

The bottom line: Java/JIT performs FFTs very well. We might even rate this performance surprisingly good considering that Java involves dynamic subscript checking which C++ does not.

A final twist on the FFT picture is the results of Table 2, corresponding to the same tests on a different platform: a 1.33 GHz Athlon/DDR processor.

Table 2. Tests of the three radix-2 FFT algorithm variants on a 1.33 GHz/DDR Athlon processor. Execution times are given in milliseconds.

Code Variant A

dynamic object creation

Variant B complex object reuse

Variant C primitives

Java - Interpreted 14.4 ms 9.28 ms 7.81 ms

Java – Client Hotspot JIT 2.75 0.453 0.490

Java – Server Hotspot JIT 2.88 1.09 1.03

C++ 4.45 0.439 0.284

FORTRAN 0.603

It is interesting to note that the Server Hotspot(TM) JIT does particularly badly on this platform with the pure-arithmetic Variants B and C. It would appear that Server JIT’s dynamic memory management performance makes up for this poor arithmetic performance for Variant A. In any case, now C++ does better than Java/JIT by about a factor of four in the worst case (compared to the Server JIT).

The fact that Java/JIT does not fare as well on the Athlon as compared to the Pentium III may be explained by the relative performance of floating point and integer operations on the two processors. By comparison with integer operations (for example, array subscript checking), the floating point performance of the Athlon far exceeds that of the Pentium III processor. Therefore, the C++ implementation, which lacks subscript checking, should most greatly benefit from the increased speed of the Athlon floating point operations.

A second test measured performance of matrix-vector multiplication – possibly the simplest operation in linear algebra.

A 1000x1000 matrix and 1000-element vector were populated with pseudo-random numbers. The algorithm was implemented in five slightly different ways, all with arrays:

A. normal implementation, 2-D arrays B. normal implementation, 1-D arrays

C. vector processor strategy (no dot-products), 1-D arrays D. reverse direction, 1-D arrays

E. reverse direction, 2-D arrays

Variant C is the typical implementation on a highly pipelined processor (such as the Cray-1, X/P and Y/MP processors), because on those processors dot products are inefficient. Variants D and E were evaluated because of a belief among some developers that loop upper limits related to zero should be faster than other limit values. Table 3 shows the results.

(4)

Table 3. Tests of the matrix-vector product variants on a 1GHz Pentium III processor. Execution times are given in seconds.

Code Variant A

2-D arrays

Variant B 1-D arrays

Variant C vectorized

Variant D 1-D reverse

Variant E 2-D reverse Java - Interpreted 0.208 sec 0.238 sec 0.298 sec 0.242 sec 0.214 sec Java – Client

Hotspot JIT

0.0368 0.0356 0.0511 0.0360 0.0362

Java – Server Hotspot JIT

0.0363 0.0437 0.0493 0.0380 0.0362

C++ 0.0238 0.0238 0.0450 0.0239 0.0238

We observe:

A. Java/JIT run times are about 50% greater than C++.

B. Reversing the direction of the loops had very little effect.

C. The vector processor strategy was distinctly unhelpful.

D. The Server Hotspot(TM) JIT again performed erratically and occasionally rather badly.

The matrix-vector product requires more address calculations and more load/store operations per floating operation than the radix-2 FFT algorithm. Therefore, we would expect that subscript checking would be more of a performance factor in this case; this argument is consistent with the results.

More of a puzzle is the difference between Variants A and B. If subscript checking were a significant factor, we would expect Variant B to be substantially faster than Variant A but it was not. In fact, the Server Hotspot JIT version ran slower. I have no convincing explanation for the results. We can conclude, however, that in this case algorithm rewriting the code was not worthwhile. This contradicts claims by Moreira et al (2000), who found that collapsing arrays to one dimension was helpful.

Another test was done of the Quicksort algorithm – an algorithm that requires much computation but computation limited to comparisons rather than floating point addition, subtraction, multiplication and division. To make fair comparisons in this case exactly the same sequences of pseudo-random numbers were used in all tests. Table 4 shows the results for sorting 1,048,576 64-bit floating point numbers.

Table 4. Tests of the Quicksort algorithm, sorting 1,048,576 random 64-bit floating point numbers, on a 1GHz Pentium III processor and 1.33 GHz/DDR Athlon. Execution times are given in seconds.

Code PIII Athlon

Java - Interpreted 5.83 sec 5.18 sec

Java – Client Hotspot JIT 0.897 0.541 Java – Server Hotspot JIT 0.921 0.574

C++ 0.800 0.402

FORTRAN 0.820 0.572

For this test C++ beat Java/JIT by about 15-40%. Again, we see that the JIT is very important and that Server Hotspot JIT performs slightly worse than the Client Hotspot JIT.

3. DATA STRUCTURE ALGORITHMS

The simplest data structure is probably the sequential list, which is typically implemented as an array or a formal list container. In Java there are several ways commonly used to do this:

A. array B. Vector C. ArrayList

(5)

D. Collections.synchronizedList(ArrayList)

E. primitive wrapper for an underlying primitive array F. object wrapper for an underlying object array

A recent journal paper on Java performance brought many e-mail messages from readers who felt that the Java Vector class is obsolete and slow, and that the ArrayList should be used instead. Variants B and D are synchronized (thread- safe), so it is valuable to compare them. In C++ it is natural to use ether an array or a wrapper as for Java, or the STL vector template class.

This test was designed to evaluate data access performance. Performance for construction of the data structures, insertions, deletions, etc., may show completely different results. The test code generated very large data structures (100’s of millions of 64-bit floating point numbers) and measured the time to read the elements sequentially. It is a bit of an exaggeration to call this “an algorithm”; nonetheless, this kind of operation is performed frequently.

Table 5. Tests of linear data access on a 1GHz Pentium III processor, accurate to better than 1%. Execution times are given in microseconds per element access (read).

Code Variant A array

Variant B Vector/

vector<double>

Variant C ArrayList

Variant D ArrayList (synchronized)

Variant E primitive wrapper

Variant F object wrapper Java -

Interpreted

0.0517 µs 0.432 µs 0.463 µs 0.7408 µs 0.126 µs 0.644 µs Java – Client

Hotspot JIT

0.0112 0.109 0.111 0.134 0.0259 0.207

0.0210 0.102 0.0847 0.123 0.0208 0.194

C++ 0.0322 0.0204 0.0322

FORTRAN 0.0213

While doing the tests it became apparent that Java performance with the wrappers was dependent on whether or not the accessor was declared “final”. Table 6 shows the results.

Table 6. Dependency on “final” declaration of linear data access on a 1GHz Pentium III processor, accurate to better than 1%. Execution times are given in microseconds per element access (read).

Code Variant A

array

Variant E final primitive

wrapper

Variant E not final primitive wrapper

Variant F final object wrapper

Variant F not final

object wrapper Java -

Interpreted

0.0517 µs 0.126 µs 0.128 µs 0.644 µs 0.657 µs Java – Client

Hotspot JIT

0.0112 0.0259 0.0387 0.207 0.215 Java – Server

Hotspot JIT

0.0210 0.0208 0.0259 0.194 0.200

Java execution times for Vector and ArrayList also depended on how the data structures were created. If the data structures were fully sized before being populated, execution times were actually slightly larger in most cases than if the default construction was employed. The times reported above are the former. This counter-intuitive result ndicates that it is not helpful to pre-size these containers.

Some rather alarming observations are:

1. The Server Hotspot JIT performs very poorly compared to the Client Hotspot JIT in optimizing array access (compare Variant A results).

2. Neither JIT does very well at optimizing code that should be in-lined (compare Variants A and E), although the Server Hotspot JIT is rather better than the Client Hotspot JIT.

(6)

3. There is a huge penalty associated with the object wrappers for primitives (compare Variants E and F).

4. C++ arrays are not as efficient as the vector template!

The important recommendations for Java performance are, in decreasing order of importance:

1. Use arrays rather than containers, whenever possible. In this case, the Client Hotspot HIT delivers better performance.

2. Use primitives rather than objects, whenever possible.

3. If using the Client Hotspot JIT, declare methods to be “final” whenever possible.

4. Avoid using the synchronization wrapper with ArrayList – use Vector instead.

A second benchmark timed table lookup; random entries were searched for with get(key) in Map objects with String keys. In the Java case, I tested the three Map classes, HashMap, TreeMap and HashTable. I also tested two slightly different implementations – one a natural implementation and one in which all String objects were replaced by their “canonical representation” obtain using the String.intern() function. The principal effect of canonicalization is an improvement in the performance of the equals/hashcode methods.

Table 7. Tests of the table lookup benchmark (microseconds per lookup); 1 GHz Pentium III processor.

Code HashMap natural

Strings

HashMap canonical

rep.

TreeMap natural Strings

TreeMap canonical

rep.

HashTable natural Strings

HashTable canonical

rep.

STL map

Java - Interpreted

6.76 µs 1.39µs 34.7 µs 31.9µs 6.58 µs 1.38µs

Java – Client Hotspot JIT

2.05 0.578 6.52 5.04 2.07 0.606

2.03 0.572 6.24 4.71 2.02 0.614

C++ 3.13 µs

The Java/C++ comparison here is somewhat unfair because the Java benchmark used 32 16-bit Unicode characters as keys and the C++ benchmark used 32 8-bit ASCII characters as keys. Since these are the most commonly used string representations, the bias towards C+ was accepted. Standard STL maps are most similar to the Java TreeMap in function.

The end result is somewhat ambiguous: the C++ benchmark executes about twice as fast as the Java benchmark using the same kind of map but with the C++ benchmark handling about half the data volume measured in bits. As a practical matter, when all that is desired is an associative array with no particular ordering properties then the Java benchmarks when properly implemented run about five times faster than the C++ benchmark despite the Java character size handicap.

We conclude:

1. The Java HashMap and HashTable are very fast, especially when implemented with fast equals/hashcode methods.

2. The Java TreeMap is very slow compared to the other maps.

3. The canonical representations of the String objects make a huge improvement in HashMap and HashTable performance.

4. In Java maps where the keys are objects other than String objects, it would be very beneficial to optimize performance of the equals/hashcode methods for those objects in the case of the HashMap and HashTable.

4. CONCLUSIONS

Today’s Java platforms perform surprisingly well in many cases. For floating point intensive calculations such as FFTs we have seen that Java/JIT execution times are within 20% of C++ and FORTRAN on the Pentium III and somewhat worse, about 50% slower, on the Athlon processor. Most of the other tests gave similar good performance. It is important to note that seemingly minor changes in coding (such as the choice of container class) can lead to changes in

(7)

performance of an order of magnitude. It is important to understand the issues and to benchmark applications if high performance Java is to be achieved.

It is disappointing that Java platforms still fail to give good performance for the following operations:

1. containers (Vector/ArrayList) as compared to arrays 2. setters and getters (method calls in general) 3. array subscript checking

In Schatzman and Donehower (2001), we also observed severe performance issues related to byte manipulation. If these few limitations were overcome, it appears that Java performance would rival that of C++ or FORTRAN in a wide range of applications.

Finally, it appears that the current Server Hotspot (TM) JIT performs erratically and often performs much worse than the Client JIT. I recommend the Client JIT for most applications until these performance issues are resolved.

5. APPENDIX A

Tests were performed using Windows 2K, JDK 1.3.1, Java HotSpot(TM) Client and Server VM 1.3.1-b24, Microsoft Visual C++ 6.0 Service Pack 4, GNU g77 version 0.5.19 The Java VM defaults were taken except for the memory size which was set large enough to prevent thrashing or “out of memory” errors. C++ builds were done in Visual C++ with

“Maximize Speed” optimization and “Blend” target instruction set. FORTRAN builds were done with optimization level 3. Timings were done carefully to overcome the limitations of the Windows clock (10 ms resolution), to avoid paging, and to average over the effects of garbage collection, heap management and other irregular events. A typical test run required hours of CPU time for an individual benchmark. Repeated runs showed repeatability to better than 0.25% in all cases. It is difficult to verify that all sources of bias have been removed. For example, multiple runs can be performed with different size data structures, extrapolating to size zero. The results of these tests indicated that the biases are similarly small.

Language translation can be a challenge when we wish to avoid introducing biases into the performance data. I found it helpful to start with Java code, which is usually the easiest to write. Converting this to C++ roughly follows the following steps:

1. make syntax changes: public/private, final, semicolons at the ends of class declarations 2. replace all occurences of object specifiers a with &a

3. replace A a = new A() with A & a = *(new A())

4. replace System.out.println with corresponding cout structures 5. as necessary, replace arrays of objects with arrays of object references 6. replace package/import directives with include directives

7. make anonymous classes explicit 8. use second() entry point for timing

Converting to FORTRAN is simple enough in the benchmarks that use arrays as their only data structure (syntax changes; classes replaced by global functions; member data replaced by common blocks). However, there is no obvious translation of containers so I omitted those tests.

The FFT tests were done with what is probably the simplest radix-2 algorithm known. Figure 1 shows the fastest Java version tested. See the website futurelabusa.com for full benchmark code.

(8)

public static void fftB(int nn, Complex[] cx, double sign1) {

final double pi = 3.1415926535897932384;

double sc = Math.sqrt(1.0/nn);

final int nnh = nn/2;

Complex ctemp = new Complex();

Complex csc = new Complex();

int j = 0;

for (int i = 0; i < nn; ++i) { if(i < j) {

cx[j].mul(sc);

cx[i].mul(sc);

Complex.swap(cx[j], cx[i]);

}

else if(i == j) { cx[i].mul(sc);

}

int m = nnh;

while (true) { if(j < m) break;

j -= m;

m /= 2;

if(m < 1) break;

} j += m;

}

int k = 1;

while (true) { int istep = 2*k;

for (int m = 0; m < k; ++m) { double arg = pi*sign1*m/(double)k;

csc.set(Math.cos(arg), Math.sin(arg));

for (int i = m; i < nn; i+=istep) { Complex ck = cx[i+k];

ctemp.mul(csc, ck);

ck.sub(cx[i], ctemp);

cx[i].add(ctemp);

} }

k = istep;

if(k >= nn) break;

} }

Figure 1. Variant B, Java version, of the FFT benchmark algorithm.

The Quicksort benchmark consisted of equivalent implementations of the variant of the algorithm given in Press et al (1989). The three implementations, Java, C, and FORTRAN were virtually identical except for necessary syntax medications. All three implementations used arrays for the principal tables. The three implementations used the same pseudo-random number generator with the same seed. All tests were performed in sets of 100 sorts with different data in each sort, the sample mean and variance being computed for each set. All tests then repeated the sets computations 12 times, the means being averaged.

public final void qsort(double[] data, int n) { final int m = 7;

final int nstack = 500;

final double fm = 7875;

final double fa = 211;

final double fc = 1663;

final double fmi = 1.0/fm;

int[] istack = new int[nstack];

int jstack = 0;

int k = 1;

int ir = n;

double fx = 0;

while(true) { if(ir-k < m) {

for (int j = k+1; j<= ir; ++j) { double a = data[j-1];

int i;

for (i = j-1; i >= 1; --i) { if(data[i-1] <= a) break;

data[i] = data[i-1];

}

data[i] = a;

}

if(jstack == 0) return;

ir = istack[--jstack];

k = istack[--jstack];

} else { int i = k;

int j = ir;

fx = (fx*fa+fc);

int r = (int) (fx/fm);

fx -= r*fm;

} } break;

}

if(j <= i) { data[i-1] = a;

break;

}

data[i-1] = data[j-1];

++i;

while(true) { if(i <= j) { if(a > data[i-1]) { ++i;

continue;

} } break;

}

if(j <= i) { data[j-1] = a;

i = j;

break;

}

data[j-1] = data[i-1];

--j;

}

jstack += 2;

if(jstack > nstack) {

System.out.println("error: stack too small");

return;

}

(9)

int iq = (int)(k + (ir-k+1)*(fx*fmi));

double a = data[iq-1];

data[iq-1] = data[k-1];

while(true) { while(true) { if(j > 0) {

if(a < data[j-1]) { --j;

continue;

if(ir >= i) {

istack[jstack-1] = ir;

istack[jstack-2] = i+1;

ir = i-1;

} else {

istack[jstack-1] = i-1;

istack[jstack-2] = k;

k = i+1;

}}}}

Figure 2. Java version of the Quicksort benchmark algorithm (modified from Press et al (1989)).

public class PrimitiveDoubleList { private double[] array;

public PrimitiveDoubleList(final double[] data) { array = data;

}

public final double get(final int index) { return array[index];

} }

Figure 3. Java primitive wrapper class for double arrays. The C++ equivalent is very similar. Although it appears that the accessor get() should be easily in-lined, this does not appear to be done successfully by the JITs.

The Map tests were done with randomly generated strings of 32 Unicode characters (Java) or 32 ASCII characters (C++). Maps of 5-50,000 entries were then searched for the same set of random strings but in random order. The execution time of large numbers of table lookups were then averaged.

See the website futurelabusa.com for full benchmark code, including all variants and languages.

ACKNOWLEDGEMENTS

I acknowledge the insights shared with me by Roy Donehower of TRW Systems, my co-author on an earlier journal article on Java performance. I also want to thank my employer, TRW Systems, for its support of advanced technologies, and Frank Mercado, the manager of the TRW Data Systems Office of Technology. I also recommend Larman and Guthrie (2000) and Wilson and Kesselman (2000).

REFERENCES

1. J. Claerbout, Fundamentals of Geophysical Data Processing, McGraw-Hill, New York, 1976.

2. C. Larman and R. Guthrie, Java 2 Performance and Idiom Guide, Prentice Hall PTR, Upper Saddle River, NJ, 2000.

3. J.E. Moreira, S.P. Midkiff, M. Gupta, P.V. Artigas, M. Snir, and R.D. Lawrence, "Java programming for high- performance numerical computing", IBM Systems Journal 39, No. 1, 21-56, 2000.

4. W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipies – The Art of Scientific Computing, Cambridge UP, Cambridge, 1989.

5. J.C. Schatzman and R. Donehower, “High-Performance Java Software Development”, Java Report, 6:2, pp. 24-42, 2001.

6. S. Wilson and J. Kesselman, Java Platform Performance: Strategies and Tactics, Addison-Wesley, Reading, Mass.

2000.