• No results found

Adaption to Our DSM

In document Distributed Java Virtual Machine (Page 34-48)

4.2 Home-Based Lazy Release Consistency Model

4.2.1 Adaption to Our DSM

Our DSM is based on HLRC. Since our DSM is object-based rather than page-based, every object is assigned to a home node which is the node from where the object’s reference is initially exposed across node boundaries. When an object2 is requested from an other node, a copy of the requested

object is created on the non-home node. Instead of creating the twin as a separate object as proposed in [32] and [20], we place the twin’s data at the end of the requested copy which saves us two words per shared object since we can omit the header for the twin.

We adapted our cache coherence protocol to become similar to the JMM (see Section 2.3.2). Upon a release of a lock, diffs are computed for all written non-home nodes objects by comparing each field of the copy with their respective twins. The updates are then sent in a message to the according home node that finally applies the updates to the home node object. This procedure corresponds

1

In the following, these terms can be substituted with each other.

exactly to the cache flush in the JMM. When a lock is acquired, the JMM states that all copies in the working memory of a thread must be invalidated [21]. In our DSM, this means that every non-home node object must be invalidated such that further reads or writes to an invalid object will cause the runtime system to fetch the object’s latest value from its home node.

4.3

Communication

Since our Shared Object Space is embedded into a cluster-based JVM, we have to make JikesRVM cluster-aware by implementing a communication subsystem that is responsible for sending and receiving messages between nodes in the cluster. Our CommManager is an interface to hide the actual underlying hardware involved during communication. Currently, we provide two different communication channels.

• Raw sockets: Provide fast communication by encapsulating messages directly in Ethernet frames.

• TCP/IP sockets: Provide reliable communication between the cluster nodes.

Under the assumption that all our cluster nodes are located within the same subnet in a local area network, we provide a communication substrate based on raw sockets. Because messages sent over these sockets are encoded directly in Ethernet frames, we avoid the overhead of the messages going through the TCP/IP stack of the operating system. As shown in a comparison in [6], raw sockets can even outperform UDP sockets that do not provide reliability compared to TCP. Broadcasting messages over raw sockets is also a benefit since only one message3 must be sent to address all

nodes in the cluster. However, due to the lack of reliability by using raw sockets, we also provide a communication based on TCP/IP4

Our cluster supports a static set of nodes. At startup, after booting the CommManager, the master node applies for the role of a coordinator that waits until all worker nodes have established a connection to the master node5. When all worker nodes have setup a connection, the master

node sends unique node IDs, that are assigned to each node, and the nodes’ MAC or IP address. When using TCP/IP sockets, the worker nodes further establish a communication channel between each other so that nodes can send messages to and receive messages from each other.

4.3.1 Protocol

For the serialization process of our messages, we avoided the standard object serialization provided by the Java API because the object serialization provides meta information encapsulated in the serialized object bytestream to reconstruct the object on the receiver side which leads to unnecessary

3

By using the MAC address FF:FF:FF:FF:FF:FF.

4

TCP is reliable because lost messages are retransmitted accordingly and messages are received in-order.

20 4. Design overhead of the communication traffic. Instead, we designed an own protocol consisting of a 15 byte header that is sufficient to reconstruct the message on the receiving node.

Message Type Message Subtype Node-ID Packet Number Total Packets Message- ID Message Length

1 byte 1 byte 1 byte 2 bytes 2 bytes 4 bytes 4 bytes

Table 4.1: Protocol header.

The Message Type field defines the type of the message such as classloading, I/O, object syn- chronization, etc. The Message Subtype field specializes the exact type of the message, e.g. the ObjectLockAcquire subtype that is sent when a remote lock for an object should be acquired. The Node-ID contains the unique ID of the sending cluster node.

Since a message cannot exceed the length of 1500 bytes when using raw sockets, every message larger than this limit is splitted into several packets. The Packet Number defines the actual packet of the possibly splitted message whereas the Total Packets field states the total amount of packets the splitted message consists of. Finally we have two 4-byte fields for the Message-ID that is used for messages requiring an acknowledgment and for the Message-Length defining the length of the message excluding the 15-byte header per packet.

When sending a message, it is encoded into a buffer that is obtained from a buffer pool, set up by the CommManager to avoid allocation upon each use. If the buffer pool becomes empty, new buffer objects will be allocated to expand the pool. A filled buffer contains all necessary data to reconstruct the message object on the receiving node. By serializing the messages ourselves, we reduce communication traffic for each transmitted message compared to the standard object serialization API in Java. For every socket, a MessageDispatcher thread waits in a loop to receive messages. This thread dispatches all incoming packets by reading the 15 bytes of the protocol header first and assembling them together to construct a new message object which can then be processed by calling the process() method (more details are given in Section 5.1).

4.4

Shared Objects

In our DJVM, the object heaps of every node contribute for a global object heap that we call the Shared Object Space. Before we start to explain the design of our Shared Object Space, we discuss some memory related issues by defining Shared Objects, what their benefits are and how they are detected to become part of the Shared Object Space.

4.4.1 Definition

An object is said to be reachable if there is a reference in a Java thread’s stack or if there exists a path in a connectivity graph6. When an object is reached by only one thread, that object is considered

6

Two objects are considered as connected when one object contains a reference to the other object. A connectivity graph is formed by the transitive relationships of connected objects.

as a thread-local object and since only one thread operates on that object, synchronization is not an issue whereas objects reached by several threads need to be synchronized (cf. Section 2.3.2). In a distributed JVM an object can also be reached from Java threads that reside on a different node in the cluster and therefore we distinguish between (based on [16,17])

• node-local objects that are only reached by one or multiple threads within a single node and • shared objects that are reached by at least two threads located on different nodes.

This means when an object is exposed to a thread residing on a different node than the object itself, the object becomes a shared object.

4.4.2 Globally Unique ID

In a JVM running on a single node, every object in the heap can be reached by loading its memory address. In a distributed JVM though, locating an object by a memory address is not sufficient enough since an object o1 at a specific address on node N1 is not necessarily located at the same address on another node N2. Shared objects are required to be reachable from any node per definition. For that purpose, every shared object is assigned a Globally Unique Identifier (GUID). Every node keeps track of a GUID table that maps the object’s local address against its GUID which allows to do pointer translation across node boundaries.

A GUID consists of 64 bits, containing an Object-ID (OID) and a Node-ID part: 0 |{z} 1 Bit, unused 11 |{z} 2 Bits, Node-ID 1...1 |{z} 61 Bits, Object-ID

We left the most significant bit unused that can be used freely for example for garbage collection. Since our DJVM is a prototype of a distributed runtime system that is designed for small clusters, we only use 2 bits for addressing at most 4 different nodes, leaving 61 bits for the OID. However, our design is flexible so that this can be changed by modifying a constant. As discussed in Section 4.2.1, every shared object is assigned to a home node. Therefore, the information encapsulated in a GUID suffices to relocate an object on a particular node.

4.4.3 Detection

By using the definition in Section 4.4.1 an object becomes a shared object if two threads located at different nodes reach the object. Because of Java’s strong type system every object on the heap or thread-local objects on the thread’s stack is associated with a particular type that is either a reference type or a primitive type. Since the JikesRVM runtime systems keeps track of all types, it is easy to identify which variables are reference types. By creating an object connectivity graph we can detect at runtime if a particular object will be reached from a thread’s stack. When a thread is

22 4. Design scheduled on a different node in the cluster (cf. Section 4.6) all reachable objects from the thread’s stack become shared objects since they are reachable from a different node, too.

The idea of marking all reachable objects as shared whenever an object becomes shared leads to several performance issues since objects with a large connectivity graph have to be traversed. We encountered problems when objects that reference VM-internal objects had been marked as shared. For the reason that we cannot distinguish VM objects from application objects in the JikesRVM (see Section 7.1 for more details), we use a lazy detection scheme for shared objects similar to the one described in the paper [18] along with explicit exception cases that some objects never become shared objects to avoid this problem. It makes more sense to share objects lazily since the maintenance cost of a node-local object is lower and because it is undecidable if an object will be accessed by its reaching thread.

When a thread is distributed from a sender node to a receiver node, we only mark all directly reachable objects from the thread’s stack as shared objects and a GUID will be assigned to them. By sending the thread to the receiver node, the GUIDs belonging to the references of the thread object will also be transmitted along. On the receiving side we reconstruct the thread object and the received GUIDs are looked up in the GUID table which maps a GUID against a local memory address. If the GUID is found, the local memory address of the object belonging to the GUID will be written into the corresponding object reference field. On the other hand if the GUID is not found, that means that the object has never been seen before and the object reference field is replaced by an invalid reference. Our software DSM detects accesses to such dangling pointers which leads to a request of the corresponding object on its home node, i.e. the copy of the object is faulted-in from the home node.

Consider an example as shown in Figure 4.2. Thread T1 has a reference to the root object a. Thread T2 is created and a reference to object c is passed (Figure 4.2(a)). Since object c is reachable by thread T2, c will be detected as a shared object when T2 is distributed and started on node 2 (Figure 4.2(b)). Note that object a and b remain local objects since they are only reached by the local thread T1. Upon the first access to c by T2, object c will be requested from node 1. Because object c has references to d and e, they become shared objects as well (Figure 4.2(c)). Node 2 receives the GUIDs of d and e and a copy of c for which dangling pointers are installed representing remote references to d and e. When T2 accesses object d in a later step, d will be faulted-in from node 1 (Figure 4.2(d)).

4.4.4 States

As seen in Section 4.2.1 shared objects can have some different states as shown in Figure 4.3. Non-home shared objects can become invalid after an acquire operation requiring the object the be faulted-in on further reads or writes. Since only modified non-home objects must be written back to their corresponding home node upon releasing a lock, we set a bit in the shared object’s header upon a write operation, which enables that only diffs from non-home shared object are computed

T1 T2

a

b c

d e

(a) Initial object connectivity graph. T1 a b c d e s Node 1 Node 2 T2 (b) Distribution of T2. T1 a b Node 1 Node 2 c s e s d s T2 c' s (c) Fault-in of object c. T1 a b T2 Node 1 Node 2 c s e s d s c’ s d’ s (d) Fault-in of object d.

24 4. Design that were actually changed. After the diffs have been propagated to the home nodes, the write bit is cleared.

Figure 4.3: Shared object states.

4.4.5 Conclusion

By combining the object heap of several JVMs, we form a single virtual heap that consists of our so-called Shared Object Space and a local heap. Shared objects are moved into the Shared Object Space when they are reachable from at least two threads on two different nodes. Node-local objects remain in the local heap. This separation of the heap is different from other approaches described in Chapter 3 where usually all objects are allocated in one global object heap only. The lazy detection scheme helps us to gain performance when dealing with memory consistency issues. As specified in the JMM, a Java thread’s working memory is invalidated upon a lock and the cached copies must be written back to main memory upon an unlock. In our context this means that these operations are triggered when a node-local object is acquired or released. Further, in our DJVM we have to consider distributed memory consistency issues since changes to cached copies must be written back to their home nodes: When a shared object is locked or unlocked, these operations are much more expensive compared to their local counterparts. By sharing only objects that are currently accessed by a remote thread, we reduce the communication traffic of the distributed cache coherence protocol (see Section 5.3.4).

4.5

Distributed Classloading

The JikesRVM runtime system keeps track of all loaded types. If a type is loaded by the classloader, so-called meta-objects such as VM_Class, VM_Array, VM_Field, VM_Method, etc., describing classes, interfaces, fields and methods respectively, are created. Therefore, if classes are loaded into the JVM, the type information maintained by each cluster node is altered. For our DJVM we make

use of a centralized classloader (similar to [36]) that helps to achieve a consistent view of all loaded types on all nodes in the cluster by replicating these meta-objects to the worker nodes. Loading all classes on a centralized node simplifies the coordination but it also has the disadvantage of creating a bottleneck. But since classloading becomes less common during the runtime of a long running application, we believe that this should not effect the performance. If classes have been loaded once, they can be compiled and instantiated locally.

Since the JikesRVM is written in Java, a set of classes is needed for bootstrapping. These classes are written into the JikesRVM bootimage and are considered initialized and do not need to be loaded anymore. When the JikesRVM is booted, some additional types are loaded by a special bootstrap classloader to set up the JikesRVM runtime system. After booting, the VM types of the Java application will be loaded by the application classloader. In our DJVM, the cluster is set up during boot time of the VM. Therefore, to achieve a consistent view on all cluster nodes, we must further divide the boot process into two different phases:

1. Before the cluster is set up, all types that must be loaded prior to a node can join the cluster. 2. After the cluster is set up, some additional types needed for the VM runtime system are

loaded.

To achieve a consistent view of the type information on all cluster nodes before the cluster is set up, we add all classes into the bootimage prior to the cluster booting. Since only one thread is running during the boot process, we can guarantee consistent class definitions on all nodes at this point. After the cluster is setup, consistency is maintained by the centralized classloading mechanism described above. Loaded Resolved Instantiated Initializing VM_Class read() Initialized resolve() instantiate() (pre) initialize() (post) initialize() Java Class File

load()

Figure 4.4: Classloading states.

When a class, that is not located in the bootimage, is loaded, a Java .class file is read from the filesystem and a corresponding VM_Class object in the VM is created and put into the loaded state as shown in Figure 4.4. Resolution is the process where all (static) fields and methods of the class are resolved and linked against the VM_Class object (state resolved). After compiling all static and

26 4. Design virtual methods, the object is considered to be instantiated. Initialization finally invokes the static class initializer which is only done once per class. During initialization the VM_Class object is put into the initializing state, after the process has completed the state changes to initialized.

Because the JikesRVM stores all static variables in a special table, all classloading phases except instantiation must be done by the master node. The master node executes the classloading phase and replicates the resulting type information to the cluster nodes. The instantiation process can be done locally on any node since no entries are inserted into the static table. An exception exists for the initialization phase for classes that are considered as local per node only, such as runtime supporting classes (see Section 5.3.5). If such a class is initialized, the initialization is also done locally per node since the class is intended for internal management purposes and therefore not considered as global7

4.6

Scheduling

To gain performance in a cluster-aware JVM, threads from the Java application are distributed automatically to the nodes in the cluster. In our DJVM we use a load balancing mechanism based on the Linux Load Average [5] that is a time-dependent average of the CPU utilization. Particularly, the averages of the last minute, 5 minutes and 10 minutes are shown by calling this command. Every worker node reports the Load Average of the last minute periodically to the master node, i.e. after a certain time period, a message containing the Load Average is sent to keep the CPU utilization up-to-date. Load balancing is achieved by distributing Java application threads to the node in the cluster with the lowest load. Upon the start of an application thread, the master node is contacted to get the ID of the cluster node with the lowest Load Average. The thread is then copied to the cluster node where the thread is finally started. The node, to which the thread

In document Distributed Java Virtual Machine (Page 34-48)

Related documents