• No results found

Performance Evaluation

In document Distributed Java Virtual Machine (Page 65-71)

First it should be mentioned that due to the lack of a proper debugger support, we used system calls to print the debugging outputs onto the node’s screen during development. These debugging outputs are declared within a condition if (currentLevel <= DEBUGLEVEL). For the performance evaluation we have set the constant DEBUGLEVEL to 0 so that no debugging information are printed. However, since the Baseline compiler does not perform optimization such as dead code removal, we have an overhead of a compare instruction for every debug instruction inserted in the code. By removing all debug statements, the performance should become slightly better.

In our performance test, we measured the overhead for accessing different objects as shown in Listing 6.1.

Listing 6.1: Worker thread accessing objects 1 public void run () {

2 long [] elapsedTime = new long [2];

3 for ( int j = 0; j < elapsedTime . length ; j ++) {

4 long start = System . nanoTime ();

5 for ( int i = 0; i< objs . length ; i ++) {

6 ++ objs [i]. id;

7 }

8 long end = System . nanoTime ();

9 elapsedTime [j] = end - start ;

10 }

11 for ( int j = 0; j < elapsedTime . length ; j ++) { 12 System . out . println ( elapsedTime [j ]);

13 } 14 }

The first row in Table 6.1 shows the time for the unmodified JikesRVM for accessing 100, 1000 and 10000 objects respectively. The access time for node-local objects in our DJVM is higher because of the additional software checks we added to the compiler. For node-local objects, we have a load instruction for the status header, a compare instruction if the object is shared and finally a jump instruction which is responsible for the overhead. Accesses to shared objects require an additional check of the invalid state (see Section 4.4.4) and are therefore slower than accesses on node-local objects. The faulting-in of shared objects is a costly operation that needs about 2 ms per object6.

Once a shared object is cached, the access time is similar to shared objects located on their home node.

We compared the execution time for allocating and starting a thread on the master node itself against the distribution of a thread to a remote node. The overhead is a result of the remote allocation and the remote initialization messages being sent to the remote node (see Section 5.5.1). In Table 6.2, we also show the time required for classloading on the master node and the time when

6

A request message is sent to the home node where the object data is copied into the response message. After receiving the reply, the requesting node allocates space for the object and deserializes the data.

50 6. Benchmarks 100 objects 1000 objects 10000 objects

original 2.12 µs 15.93 µs 151.66 µs

node-local 7.03 µs 52.1 µs 554.51 µs

shared home node 7.45 µs 66.32 µs 584.79 µs

shared non-home node (miss) 260 ms 2379 ms 23313 ms

shared non-home node (cached) 8.45 µs 75.85 µs 701.79 µs Table 6.1: Access on node-local and shared objects.

a worker forwards the classloading to the master node. Finally, the last row in the table presents the execution time for a simple System.out.println() on the master node and on the worker node that must copy the byte stream into a message and sent it the the master node.

Master node Worker node

Thread allocation 0.46 ms 155.8 ms

Classloading 3.38 ms 15 ms

I/O redirection 0.11 ms 12.26 ms

Table 6.2: Overhead thread allocation, classloading and I/O redirection.

In the last performance test we ran a simple producer-consumer application. A writer thread synchronizes on an object, that is located on the master node, and writes into a field. Then, it sets a flag and notifies a reader thread that reads this value and signals the writer to write new data. Listing 6.2 shows the run() method of the reader thread.

Listing 6.2: Reader thread 1 public void run () {

2 int [] values = new int [1000];

3 long start = System . currentTimeMillis ();

4 for ( int i = 0; i < 1000; i ++) {

5 synchronized ( obj ) {

6 while (! obj . isReadable ()) {

7 try {

8 obj . wait ();

9 } catch ( InterruptedException e) {}

10 }

11 values [i] = obj . getCounter ();

12 obj . setWritable ();

13 obj . notify ();

14 }

15 }

16 long end = System . currentTimeMillis ();

17 long elapsed = end - start ;

18 System . out . println ( elapsed ); 19 }

We modified our scheduling load balancing function to explicitly allocate a certain thread on a particular node so that we could test four different cases:

1. The reader and writer thread are both located on the master node.

2. The reader thread is located on the master node, the writer thread resides on the worker node.

3. The reader thread is allocated on the worker node, the writer thread remains on the master node.

4. Both threads are allocated and started on the worker node.

Note that the synchronization object is located on the master node in any case. Figure 6.1 shows the result of all four test cases. If both threads are allocated on the same node as the synchronization object, the execution time is less than 20 ms. However, when one thread is located on a worker node, upon each acquire operation the synchronization object must be invalidated and updated before the next access. It even becomes more expensive if both threads are located on the worker node which results in twice the amount of updates and diffs’ propagation. This examples shows the cache flush of the DSM is much more expensive than its local counterpart.

In this last chapter we discuss the problems we encountered during our implementation of the distributed JVM. We also talk about several optimizations to our Shared Object Space to improve performance. We list some further work and we give a conclusion about our system at the end.

7.1

Problems

As we have mentioned in Section 5.3.5, the JikesRVM uses the compilers to generate native machine code from the application bytecode and from the VM code and uses its own runtime system to execute itself. On one hand significant performance improvements can be achieved through close integration of the Java application and the VM object space. On the other hand the boundary between the application and VM objects is blurred which results in difficulties especially concerning distributed shared memory.

To understand the problem more clearly, we first explain how the application and the VM actually interact with each other. We have seen in Section 5.7 that the JikesRVM runtime system provides the Java application with the GNU Classpath Java library. When an application object is created, the code from Classpath is executed. To interact with the VM, some library objects are used as adapters to VM objects. Consider Figure 7.1: If a Java application creates a java.lang.Thread ob- ject, a corresponding adapter object java.lang.VMThread is created. When the java.lang.Thread is finally started, it invokes a method call in java.lang.VMThread that in turn performs a call on the VM object org.jikesrvm.scheduler.VM_Thread.

#vmThread : java.lang.VMThread java.lang.Thread

#vmdata : java.lang.VMThread java.lang.VMThread

org.jikesrvm.scheduler.VM_Thread

Figure 7.1: Thread representation

The difficulties due to the lack of a clear VM and application object separation are quite obvious. Since some VM objects such as org.jikesrvm.scheduler.VM_Thread should always be considered local because it is used in the cluster node’s local thread queue for scheduling, this object should

54 7. Conclusions and Future Work not become a shared object. However, if the application performs a call to a method defined in java.lang.Thread, that is a non-home shared object, the corresponding method in the VM object will be invoked, resulting in a fault-in of the internal org.jikesrvm.scheduler.VM_Thread object from the home node. Problems arise for method calls such as join() defined in java.lang.Thread. By executing this method, the calling thread waits until it is woken up when the other thread has terminated. However, in our example the calling thread joins on a cached copy of the VM object that is only scheduled on its home node. The calling thread is never woken up since the home node does not know that it has to wake up the joining thread on the other node. This issue has a negative impact on the performance of our system: If a local VM object becomes a shared object, further synchronization operations on it result in unnecessary invalidations of the other cached non-home shared objects so that they have to be updated. The cache flush problem also appears in the case when a static synchronized method is called that belongs to the Java class library. The synchronization is done on the single instance of type Class. Because we defined all Class objects to be shared, this results in a cache flush of all non-home shared objects even when the static synchronized method was called within the VM1.

Fortunately, the internal org.jikesrvm.scheduler.VM_Thread object does not inherit from java.lang.Thread whose instances are only started by the Java application2. Therefore, we rely on

the fact that every started java.lang.Thread object must be an application object. By introducing several exception cases we try to prevent VM objects to become shared objects by using the package name of the class. This approach is not clean because a java.lang.Integer object could either be used by the VM or by the application. Therefore, a real separation is encouraged.

The issue for separating the heap for VM objects is known and several ideas have been discussed3,

but no real implementation has been done so far. In [29] the approach is based on name space separation by the classloader. In particular, they make use of two initial classloaders: The bootstrap classloader loads the initial bootstrap code and the Java class libraries for the applications’ use, and the VM classloader that is used for loading the VM classes and all classes used by the VM, including the Java class library. By preventing the VM to pass the VM classloader object to the application code, the application cannot access any of the VM classes so that the required isolation is achieved. By introducing an interface layer called the Mu4-layer that is located between the VM

and the application, the application is able to interact with the VM through so-called Mu-objects. It should be mentioned that JavaSplit pursues a similar approach since the added runtime logic is also written in Java, thus they have similar issues for separating VM and application objects. In JavaSplit, the whole class hierarchy is instrumented and then replicated in a sub space of the class name space by adding the “javasplit” prefix [16]. The application uses this replica transparently while the instrumentation code and the DSM operate on the original class hierarchy. As the latest 1E.g. if one would use System.out.println(String s) within the VM code, this would lead to a cache flush since

a static synchronized method is executed in the call chain.

2

Inside the VM java.lang.Thread could be used of course since the corresponding VM objects are created automatically.

3

http://jira.codehaus.org/browse/RVM-399

version of the dJVM5 (see Section 3.3.4) also uses bytecode rewriting techniques, we assume that

they instrument the application code similarly.

Since the JikesRVM only supports partial debugging with the GNU debugger (GDB) by inspecting object methods within the bootimage, debugging methods that are compiled after the execution was difficult. Especially in our distributed and multi-threaded environment, a proper debugging en- vironment is desirable. A Google Summer of Code 2008 project6 has been announced to implement

Java Debug Wire Protocol and Java Virtual Machine Tool Interface support.

In document Distributed Java Virtual Machine (Page 65-71)

Related documents