Replication on Virtual Machines

(1)

Replication on Virtual Machines

Siggi Cherem

CS 717

November 23rd, 2004

(2)

Outline

1 Introduction

The Java Virtual Machine

2 Napper, Alvisi, Vin - DSN 2003 Introduction

JVM as state machine Addressing non-determinism Implementation

Experiments

3 Fiedman, Kama - SRDS 2003 Introduction

Non-determinism

Design and implementation Experimentation

(3)

Outline

1 Introduction

Experiments

Non-determinism

(4)

JVM philosophy

Compile once, run everywhere Java → bytecodes

bytecode = instruction set of Java Virtual Machine One JVM for each architecture

High-level support

Memory management (garbage collection) Multithreading support (monitors)

(5)

JVM to real machines

Internal components Dynamic class loader

Interpreters vs. Just in Time compiling (JIT) Native methods (JNI)

Provided libraries

Allocation and garbage collection User-level vs. native threads

(6)

Random technical details

A few characteristics

Compact bytecodes (202 instructions) Types are preserved for safety, precise GC Objects accessible through references

Strong, soft, weak, phantom references Object can be shared

Passed to new thread constructor Static fields

(7)

Outline

1 Introduction

Experiments

Non-determinism

(8)

General idea

Their work

Modify JVM to tolerate fail-stop failures.

Extends hypervisor-based fault-tolerance

Hypervisor model

Implement a virtual state machine over underlying hardware

Perform replica coordination in the hypervisor

(9)

State machine approach

Requirements

1 Determinism: defining replicas

2 Independence: implementing replicas

3 Choice replication: ensuring replication

4 Transparency: guaranteeing single output

A state machine

Read set → [ deterministic command ] → Write set

↓

output to environment

(10)

A state machine for the JVM

Challenges

Non-determinism of commands Replication of sequence of commands Copying read-sets

Multithreading

Their approach: Bytecode execution engines (BEE) A BEE is a state machine

JVM = set of BEEs, one for each application thread Replication at BEE level

(11)

Sources of non-determinism

Some causes of non-determinism Asynchronous commands Non-deterministic commands Non-deterministic read sets Output to environment

(12)

Asynchronous commands

Definition

A command is asynchronous if it can appear anywhere in the BEE’s sequence of commands.

Examples

Hardware interrupts, not for JVM Asynchronous Java exceptions

Fatal errors, e.g. no resources, deadlocks Killing another thread, i.e. thread.stop()

Added restrictions

1 Fatal exceptions are not replicated

2 Threads must not call Thread.stop

(13)

Non-deterministic commands

Definition

A command is non-deterministic if it write-set or output are not uniquely determined by the read-set values.

Example

Native calls: I/O, clock

Solution

Agreement between replicas on input environment and read-set.

Not possible! input is outside JVM’s control Backup must adopt primary write-set

Restrict output on native methods

(14)

Non-deterministic commands

Added restrictions

3 Native methods produce deterministic output to environment

4 Native methods invoke other methods deterministically

Handling these conditions: splitting methods void non determ read write() {

long r = read clock();

printf("%d\n",r);

}

(15)

Non-deterministic commands

Added restrictions

3 Native methods produce deterministic output to environment

4 Native methods invoke other methods deterministically

Handling these conditions: splitting methods long non determ read(){

return read clock();

}

void determ write(long r ){

printf("%d\n",r);

}

(16)

Non-deterministic read sets

Definition

A read set is non-deterministic if it contains a shared variable.

Examples

Invoking methods on shared objects Storing objects in static references

Alternatives

Bookkeeping of shared data: anorder of magnitude overhead

Lock acquisition ordering, needs data-race elimination Exclusive access to all variables while thread is scheduled

(17)

Non-deterministic read sets

Added restrictions

5 Include one of the following

1 No data-races (protect any shared data with monitors)

2 Exclusive access to all shared variables

Example

class Example {

static F shared = null;

String toString() {

if (null == shared) { shared = new F();

synchronized call();

...

(18)

Output to the environment

Definition

An output is idempotent if it is independent of the number of times a command is executed.

An output is testable when the environment can be tested for occurrence of output.

Examples

cd /home/siggi is idempotent, cd .. is not cd .. is testable if pwd is available.

Definition

A state isvolatileif it does not survive failure of state machine.

Otherwise, it is stable.

(19)

Output to the environment

Solution

Only support idempotent and testable output.

Volatile data might be necessary for correct operation Side effects handlers: replicate lost volatile state of the primary

Added restrictions

6 Output of native methods is idempotent or testable

7 Native methods annotated for volatile output

(20)

Implementation details

Extended JVM New threads for

Failure detection and backup initiation Transfer of logging information

Interact with other system threads: GC, finalization

Threading modification

Restriction (5.2) requires modifying multi-threading libraries Sun’s JVM provides both native and green threads Native threads are desired to run applications on SMP Green threads are desired for portability

(21)

Implementation for non-deterministic commands

Initial work

Inspected and categorized all native methods by hand!

Found only 100 non-deterministic

Runtime support algorithm

Create table with (non-deterministic) method’s unique signature

Native call on primary triggers message to backup Backup on recovery uses same values

Side effect handlers used for volatile state (e.g. file descriptor from an open command)

(22)

Non-deterministic read sets: first approach

Replicated lock synchronization

Assumes (5.1), ensuring mutual exclusion Defines lock acquisition record = (ti,lj,t_i^#,l_j^#)

Locking thread (ti) Lock (l_j)

Relative order of lock acquisition thread acquire sequence number (t_i^#) lock acquire sequence number (l_j^#)

Primary creates (ti,lj,t_i^#,l_j^#)

Backup uses (ti,lj,t_i^#,l_j^#) to repeat ordering

(23)

Computing lock acquisition record

Defining record values Not trivial

Object address for l_j: meaningless at replicas Order of events: might differ in primary and backup Recursive definition for ti= (tp,k)

tp is parent of ti

t_i is the k created w.r.t. siblings Use thread determinism for lj

l_j assigned the first time used Log map lj↔ (ti,t_i^#)

(24)

Using lock acquisition record

Recovery algorithm Case 1:

Backup thread ti tries to acquire lj, and Log contains r = (ti,lj,t_i^#,l_j^#)

⇒ Wait until we reach l_j^#

⇒ Remove r from log Case 2:

Backup thread ti tries to acquire lj, and Log doesn’t contain (ti,lj,t_i^#,l_j^#)

⇒ Wait until log is empty (end of recovery protocol)

(25)

Using lock acquisition record

Recovery algorithm Case 3:

Backup thread ti tries to acquire a lock with no id, and Log contains map lj↔ (ti,t_i^#)

⇒ Assign lock primary’s l_j

⇒ Remove map entry Case 4:

Backup thread ti tries to acquire a lock with no id, and Log doesn’t contain map lj↔ (ti,t_i^#)

⇒ Wait until

A thread ti’ assigns lj to the lock

Log contains no more maps (assign fresh lj)

(26)

Non-deterministic read sets: second approach

Replicated thread scheduling

Assumes (5.2), all shared data is protected

Defines thread scheduling record = (bn,pc,m,l^#,tn) Code executed

Current program counter (pc) Trace summary to get there (bn) Monitor uses (m)

Thread was waiting on a lock (l^#) Next scheduled thread (tn)

Log record on each context switch

(27)

Computing thread scheduling record

Defining program position

How many statements were executed?

Avoid counting each instruction

bn counts branches, jumps and invocations taken.

pc is program counter offset (not absolute address):

updated on every instruction!

(28)

Computing thread scheduling record

Defining program position

What if preemption occurred inside a native method?

Can’t control preemption outside JVM On recovery, preemption before native call?

Need to keep track of locks acquired Locking done in JVM monitors

On recovery, preempt when m is reached

(29)

Interaction with system threads

Example

Heap shared with GC. GC not in Java!

Problems

t_i acquires a lock at primary with no contention, but ti waits at backup

tn can enter at backup before it should!

⇒ User-level threads to force t_i to stay

t_i acquires a lock at backup with no contention, but ti waits at primary

⇒ Use m also to force rescheduling at backup

(30)

Replicated scheduling: final details

Wait and notifyAll()

Multiple threads awakened

Store the l^# to preserve order in backup

Finishing recovery

Log becomes empty, last entry contains tn

Backup must schedule tn to reproduce interaction to environment.

(31)

Garbage Collection

Common problems

Soft/weak references

Primary and backup may diverge

⇒ Convert them to strong references

Finalizers should be no source of non-determinism

⇒ Replicate as before

(32)

Output to the environment

Side effects handlers

Store and recover volatile state

Ensure exactly-once semantics for output Composed by 5 methods

register test log receive restore

(33)

Components of SE handlers

Method register

Provide method signature Non-determinism flag Output command flag Arguments used for output

Method test

Used by backup

True if output command was successfully executed Only defined for testable commands

Idempotent commands are replayed

(34)

Components of SE handlers

Method log

Used by primary after an output command Saves arguments, return value and internal state Produces a message with recovery information

Method receive

Used by backup to retrieve result of log Can perform compaction of messages

Method restore

Used by backup only once to recover volatile state Uses received messages

(35)

Experimental setting

Architecture and settings

Sun E5000 Servers. 15 400MHz UltraSPARC II CPUs.

2GB Mem. 100Mbps Ethernet

Primary and backup run on different machines Log is kept at backup in volatile memory Synchronization on each output (acks) Interpreted mode, no JIT

3 scenarios: AL, TS, NoFT

Only green threads (native on SMP yield similar result)

(36)

Experimental setting

Benchmarks

Spec JVM98 Benchmarks

Shown result for 6 benchmarks compress: cpu intensive db: database, heavy on locking mtrt: only multithreaded

(37)

Algorithms comparison

Running times under two algorithms

(38)

Overhead

Overhead of lock acquisition algorithm

(39)

Overhead

Overhead of thread scheduling algorithm

(40)

Outline

1 Introduction

Experiments

Non-determinism

(41)

Introduction

Another hypervisor

Build on top of Jikes RVM Ignore native code

Support JIT

Jikes RVM

Almost all in Java

Yield points and time slices

(42)

Sources of non-determinism

Multithreading

Use deterministic scheduler (yield points) Deterministic dequeuing

Data-races on SMPs

⇒ assume no data-races, enforce lock ordering

(43)

Design decisions

Frames

One frame lag between primary and backup

Synchronize with replicas before starting a new frame Send all I/O results to replicas at start point

Send locks (on SMP) anywhere Send non-deterministic read sets

(44)

Frames example

Framing...

(45)

Implementation details

Replication engine

Additional module to Jikes RVM

Communication between primary and backup Detection of fails

Election of new primary

(46)

Implementation details

Hurdles

JIT compilation

⇒ saving thread switch counter

Non-deterministic number of statements

⇒ also disable preemption

Garbage collection on SMP: cooperative threads

⇒ GC non-preemptive until all are done

(47)

Experimental setting

Benchmarks

Some Spec JVM98 Benchmarks and SciMark scimark, compress, db, raytrace, mtrt

Variations on frame-size (number of context switches)

(48)

Compress

compress Benchmark

(49)

Database

db Benchmark

(50)

Raytrace

raytrace Benchmark

(51)

Multithreaded Raytrace

mtrt Benchmark

(52)

Replication Overhead

Overhead

(53)

Final remarks

Summary

Common technique: hypervisor model Restrictions to solve non-determinism Support for SMPs

First paper main features

SE Handlers: native methods Second paper main features

Frames

Lower synchronization Faster recovery