Replication on Virtual Machines
Siggi Cherem
CS 717
November 23rd, 2004
Outline
1 Introduction
The Java Virtual Machine
2 Napper, Alvisi, Vin - DSN 2003 Introduction
JVM as state machine Addressing non-determinism Implementation
Experiments
3 Fiedman, Kama - SRDS 2003 Introduction
Non-determinism
Design and implementation Experimentation
Outline
1 Introduction
The Java Virtual Machine
2 Napper, Alvisi, Vin - DSN 2003 Introduction
JVM as state machine Addressing non-determinism Implementation
Experiments
3 Fiedman, Kama - SRDS 2003 Introduction
Non-determinism
Design and implementation Experimentation
JVM philosophy
Compile once, run everywhere Java → bytecodes
bytecode = instruction set of Java Virtual Machine One JVM for each architecture
High-level support
Memory management (garbage collection) Multithreading support (monitors)
JVM to real machines
Internal components Dynamic class loader
Interpreters vs. Just in Time compiling (JIT) Native methods (JNI)
Provided libraries
Allocation and garbage collection User-level vs. native threads
Random technical details
A few characteristics
Compact bytecodes (202 instructions) Types are preserved for safety, precise GC Objects accessible through references
Strong, soft, weak, phantom references Object can be shared
Passed to new thread constructor Static fields
Outline
1 Introduction
The Java Virtual Machine
2 Napper, Alvisi, Vin - DSN 2003 Introduction
JVM as state machine Addressing non-determinism Implementation
Experiments
3 Fiedman, Kama - SRDS 2003 Introduction
Non-determinism
Design and implementation Experimentation
General idea
Their work
Modify JVM to tolerate fail-stop failures.
Extends hypervisor-based fault-tolerance
Hypervisor model
Implement a virtual state machine over underlying hardware
Perform replica coordination in the hypervisor
State machine approach
Requirements
1 Determinism: defining replicas
2 Independence: implementing replicas
3 Choice replication: ensuring replication
4 Transparency: guaranteeing single output
A state machine
Read set → [ deterministic command ] → Write set
↓
output to environment
A state machine for the JVM
Challenges
Non-determinism of commands Replication of sequence of commands Copying read-sets
Multithreading
Their approach: Bytecode execution engines (BEE) A BEE is a state machine
JVM = set of BEEs, one for each application thread Replication at BEE level
Sources of non-determinism
Some causes of non-determinism Asynchronous commands Non-deterministic commands Non-deterministic read sets Output to environment
Asynchronous commands
Definition
A command is asynchronous if it can appear anywhere in the BEE’s sequence of commands.
Examples
Hardware interrupts, not for JVM Asynchronous Java exceptions
Fatal errors, e.g. no resources, deadlocks Killing another thread, i.e. thread.stop()
Added restrictions
1 Fatal exceptions are not replicated
2 Threads must not call Thread.stop
Non-deterministic commands
Definition
A command is non-deterministic if it write-set or output are not uniquely determined by the read-set values.
Example
Native calls: I/O, clock
Solution
Agreement between replicas on input environment and read-set.
Not possible! input is outside JVM’s control Backup must adopt primary write-set
Restrict output on native methods
Non-deterministic commands
Added restrictions
3 Native methods produce deterministic output to environment
4 Native methods invoke other methods deterministically
Handling these conditions: splitting methods void non determ read write() {
long r = read clock();
printf("%d\n",r);
}
Non-deterministic commands
Added restrictions
3 Native methods produce deterministic output to environment
4 Native methods invoke other methods deterministically
Handling these conditions: splitting methods long non determ read(){
return read clock();
}
void determ write(long r ){
printf("%d\n",r);
}
Non-deterministic read sets
Definition
A read set is non-deterministic if it contains a shared variable.
Examples
Invoking methods on shared objects Storing objects in static references
Alternatives
Bookkeeping of shared data: anorder of magnitude overhead
Lock acquisition ordering, needs data-race elimination Exclusive access to all variables while thread is scheduled
Non-deterministic read sets
Added restrictions
5 Include one of the following
1 No data-races (protect any shared data with monitors)
2 Exclusive access to all shared variables
Example
class Example {
static F shared = null;
String toString() {
if (null == shared) { shared = new F();
synchronized call();
...
Output to the environment
Definition
An output is idempotent if it is independent of the number of times a command is executed.
An output is testable when the environment can be tested for occurrence of output.
Examples
cd /home/siggi is idempotent, cd .. is not cd .. is testable if pwd is available.
Definition
A state isvolatileif it does not survive failure of state machine.
Otherwise, it is stable.
Output to the environment
Solution
Only support idempotent and testable output.
Volatile data might be necessary for correct operation Side effects handlers: replicate lost volatile state of the primary
Added restrictions
6 Output of native methods is idempotent or testable
7 Native methods annotated for volatile output
Implementation details
Extended JVM New threads for
Failure detection and backup initiation Transfer of logging information
Interact with other system threads: GC, finalization
Threading modification
Restriction (5.2) requires modifying multi-threading libraries Sun’s JVM provides both native and green threads Native threads are desired to run applications on SMP Green threads are desired for portability
Implementation for non-deterministic commands
Initial work
Inspected and categorized all native methods by hand!
Found only 100 non-deterministic
Runtime support algorithm
Create table with (non-deterministic) method’s unique signature
Native call on primary triggers message to backup Backup on recovery uses same values
Side effect handlers used for volatile state (e.g. file descriptor from an open command)
Non-deterministic read sets: first approach
Replicated lock synchronization
Assumes (5.1), ensuring mutual exclusion Defines lock acquisition record = (ti,lj,ti#,lj#)
Locking thread (ti) Lock (lj)
Relative order of lock acquisition thread acquire sequence number (ti#) lock acquire sequence number (lj#)
Primary creates (ti,lj,ti#,lj#)
Backup uses (ti,lj,ti#,lj#) to repeat ordering
Computing lock acquisition record
Defining record values Not trivial
Object address for lj: meaningless at replicas Order of events: might differ in primary and backup Recursive definition for ti= (tp,k)
tp is parent of ti
ti is the k created w.r.t. siblings Use thread determinism for lj
lj assigned the first time used Log map lj↔ (ti,ti#)
Using lock acquisition record
Recovery algorithm Case 1:
Backup thread ti tries to acquire lj, and Log contains r = (ti,lj,ti#,lj#)
⇒ Wait until we reach lj#
⇒ Remove r from log Case 2:
Backup thread ti tries to acquire lj, and Log doesn’t contain (ti,lj,ti#,lj#)
⇒ Wait until log is empty (end of recovery protocol)
Using lock acquisition record
Recovery algorithm Case 3:
Backup thread ti tries to acquire a lock with no id, and Log contains map lj↔ (ti,ti#)
⇒ Assign lock primary’s lj
⇒ Remove map entry Case 4:
Backup thread ti tries to acquire a lock with no id, and Log doesn’t contain map lj↔ (ti,ti#)
⇒ Wait until
A thread ti’ assigns lj to the lock
Log contains no more maps (assign fresh lj)
Non-deterministic read sets: second approach
Replicated thread scheduling
Assumes (5.2), all shared data is protected
Defines thread scheduling record = (bn,pc,m,l#,tn) Code executed
Current program counter (pc) Trace summary to get there (bn) Monitor uses (m)
Thread was waiting on a lock (l#) Next scheduled thread (tn)
Log record on each context switch
Computing thread scheduling record
Defining program position
How many statements were executed?
Avoid counting each instruction
bn counts branches, jumps and invocations taken.
pc is program counter offset (not absolute address):
updated on every instruction!
Computing thread scheduling record
Defining program position
What if preemption occurred inside a native method?
Can’t control preemption outside JVM On recovery, preemption before native call?
Need to keep track of locks acquired Locking done in JVM monitors
On recovery, preempt when m is reached
Interaction with system threads
Example
Heap shared with GC. GC not in Java!
Problems
ti acquires a lock at primary with no contention, but ti waits at backup
tn can enter at backup before it should!
⇒ User-level threads to force ti to stay
ti acquires a lock at backup with no contention, but ti waits at primary
⇒ Use m also to force rescheduling at backup
Replicated scheduling: final details
Wait and notifyAll()
Multiple threads awakened
Store the l# to preserve order in backup
Finishing recovery
Log becomes empty, last entry contains tn
Backup must schedule tn to reproduce interaction to environment.
Garbage Collection
Common problems
Soft/weak references
Primary and backup may diverge
⇒ Convert them to strong references
Finalizers should be no source of non-determinism
⇒ Replicate as before
Output to the environment
Side effects handlers
Store and recover volatile state
Ensure exactly-once semantics for output Composed by 5 methods
register test log receive restore
Components of SE handlers
Method register
Provide method signature Non-determinism flag Output command flag Arguments used for output
Method test
Used by backup
True if output command was successfully executed Only defined for testable commands
Idempotent commands are replayed
Components of SE handlers
Method log
Used by primary after an output command Saves arguments, return value and internal state Produces a message with recovery information
Method receive
Used by backup to retrieve result of log Can perform compaction of messages
Method restore
Used by backup only once to recover volatile state Uses received messages
Experimental setting
Architecture and settings
Sun E5000 Servers. 15 400MHz UltraSPARC II CPUs.
2GB Mem. 100Mbps Ethernet
Primary and backup run on different machines Log is kept at backup in volatile memory Synchronization on each output (acks) Interpreted mode, no JIT
3 scenarios: AL, TS, NoFT
Only green threads (native on SMP yield similar result)
Experimental setting
Benchmarks
Spec JVM98 Benchmarks
Shown result for 6 benchmarks compress: cpu intensive db: database, heavy on locking mtrt: only multithreaded
Algorithms comparison
Running times under two algorithms
Overhead
Overhead of lock acquisition algorithm
Overhead
Overhead of thread scheduling algorithm
Outline
1 Introduction
The Java Virtual Machine
2 Napper, Alvisi, Vin - DSN 2003 Introduction
JVM as state machine Addressing non-determinism Implementation
Experiments
3 Fiedman, Kama - SRDS 2003 Introduction
Non-determinism
Design and implementation Experimentation
Introduction
Another hypervisor
Build on top of Jikes RVM Ignore native code
Support JIT
Jikes RVM
Almost all in Java
Yield points and time slices
Sources of non-determinism
Multithreading
Use deterministic scheduler (yield points) Deterministic dequeuing
Data-races on SMPs
⇒ assume no data-races, enforce lock ordering
Design decisions
Frames
One frame lag between primary and backup
Synchronize with replicas before starting a new frame Send all I/O results to replicas at start point
Send locks (on SMP) anywhere Send non-deterministic read sets
Frames example
Framing...
Implementation details
Replication engine
Additional module to Jikes RVM
Communication between primary and backup Detection of fails
Election of new primary
Implementation details
Hurdles
JIT compilation
⇒ saving thread switch counter
Non-deterministic number of statements
⇒ also disable preemption
Garbage collection on SMP: cooperative threads
⇒ GC non-preemptive until all are done
Experimental setting
Benchmarks
Some Spec JVM98 Benchmarks and SciMark scimark, compress, db, raytrace, mtrt
Variations on frame-size (number of context switches)
Compress
compress Benchmark
Database
db Benchmark
Raytrace
raytrace Benchmark
Multithreaded Raytrace
mtrt Benchmark
Replication Overhead
Overhead
Final remarks
Summary
Common technique: hypervisor model Restrictions to solve non-determinism Support for SMPs
First paper main features
SE Handlers: native methods Second paper main features
Frames
Lower synchronization Faster recovery