Microreboot RAS Model - Analysis Techniques

4.3 Analysis Techniques

4.3.1 Microreboot RAS Model

Recursive microreboots are a technique for improving overall system availability by reac- tively restarting failed components and rejuvenating functioning components to prevent degradation [21]. It is specifically targeted at recovering from failures such as crashes, deadlocks, infinite loops, livelocks and state corruption (memory leaks, dangling pointers, damaged heaps, etc.).

A microreboot (µRB) can be applied at different levels of a system: component-level, subsystem-level or whole-system level5. As a remediation technique, recursive microreboots target the minimal set of a system’s components for a restart and progressively restart larger subsets of components up to and including restarting the entire system. Microreboots, like whole-system reboots, have a number of properties in common that make them attractive as a remediation mechanism. They return the target of recovery (component, subsystem,

3_{Java 2 Platform, Enterprise Edition (J2EE) defines the standard for developing multi-tier enterprise}

applications [128].

4_{Additional measurement-based evaluations can be found in [}₂₃_{] and [}₂₂_].

5_{The ability to precisely target and restart system elements at these various levels depend on a number of}

structural properties and design considerations of the system under consideration. See [21] for more details on design considerations for recursively restartable systems.

system) to a well-understood state – its start state. Further, they provide a high confidence way of reclaiming stale or leaked resources [21].

We chose recursive microreboot for our analysis example because it is an instance of a sophisticated remediation mechanism that exhibits a number of characteristics that make it interesting to study:

1. Layered recovery strategy – one layer for each level at which recovery can occur in the system.

2. Imperfect recovery between layers – failures can escalate to higher layers e.g. if component-level reboots are unsuccessful then the failure “bubbles” up to the next higher layer to be handled – subsystem-level reboots – and so on.

3. Problem mitigation rather than elimination – microreboots do not eliminate the underlying root cause of the problem, rather they attempt mitigate its effects. Over time, the same failures can resurface.

In [20], the authors evaluate the efficacy of microrebooting, comparing fine-grained microreboots to coarse-grained system reboots using their microrebootable J2EE application server, custom fault-injection tools and eBid, a version of the Rice University Bidding System (RUBiS) N-tier web-application, modified to be amenable to microreboots. RUBiS is a J2EE/Web-based auction system modeled after eBay.com.

The test system deployment in [20] consists of the following elements, which also correspond to the units of recovery. These recovery units are listed in order of fine-grained restarts to coarse-grained restarts:

• Enterprise Java Beans (EJBs) – these encapsulate the business logic of the eBid web-application. They may interact with other EJBs and/or backend databases in the processing of a client request.

contains the presentation tier of the web application: Java Server Pages (JSPs) and servlets. These invoke EJB methods and format the returned results for presentation to the client.

• eBid web-application – the collection of EJBs, JSPs and servlets.

• JVM/JBoss – the execution/hosting environment for the eBid web-application. A Recovery Manager component added to the JBoss application server performs failure diagnosis and recovery guided by the simple recursive policy of “cheapest recovery first”. In response to the faults injected into eBid, the recovery manager progressively reboots larger sets of components: first EJBs, then eBid’s WAR, then the eBid web application, followed by the JVM/JBoss, and if necessary finally reboots the operating system. To fully resolve some failures microreboots may be followed up by additional automated or manual actions, e.g., recovering persistent data may be done automatically (via transaction rollback) or may require manual reconstruction of the data in the database.

Based on the description of the microrebootable application server in [20], we use the SHARPE [160] RAS modeling and analysis tool to generate a model (shown in Figure4.4) that can be used to evaluate the efficacy of the application server and its recovery manager. The RAS model is an irreducible CTMC that consists of 6 states and 17 parameters, see Table4.1.

Our RAS model captures a number of key elements of the operation of the application server’s recovery manager including: a) multiple layers of recovery and b) the possible escalation of failures to higher levels of recovery. Further, the use of an irreducible CTMC allows us to model the operation of the Recovery Manager as an infinitely running process where failures can re-occur.

This RAS model, plus fault-injection tools like Kheiron (Chapter3) or the ones used in the experiments in [20], can be used to design, initiate and score fault-injection experiments

Figure 4.4: RAS model for a microrebootable application server

that represent different failure scenarios for evaluating the efficacy of microreboots. Fault-injection tools can be used to control the rate of failure (λf ailure) and/or the proportions

of failures that initially target a specific level of recovery (pe jb rb, pwar rb, and pjvm jboss rb).

Varying these parameters allows us to study the behavior of the system under different fault-loads/failure mixes.

Parameters concerned with the success or failure of recovery at a specific level (pe jb rb success,

pe jb f allthru, pwar rb success, pwar f allthru, pebid rb success, pebid f allthru, pjvm jboss rb success, and

pjvm jboss f allthru) can be observed experimentally or varied in the model to reason about their

expected impacts on system operation.

Parameters concerned with recovery times at a specific level (µe jb rb, µwar rb, µebid rb,

µjvm jboss rb, and µoperator f ix) can be observed experimentally or varied in the model based on

S0 The initial state of the system

S1 State where one or more EJBs is being restarted

S2 State where the eBid WAR file is being restarted

S3 State where the entire eBid application is being restarted

S4 State where the JVM/JBoss application server is being restarted

S5 State where an operator performs some action(s) to resolve an issue

λf ailure Rate at which faults are injected/failures induced

pe jb rb Proportion of failures that are initially handled by an EJB restart

pwar rb Proportion of failures that are initially handled by a WAR restart

pjvm jboss rb Proportion of failures that are initially handled by a JVM/JBoss restart

pe jb rb success Proportion of failures successfully resolved by an EJBs restart

pe jb f allthru Proportion of failures that fall through to WAR restart level

pwar rb success Proportion of failures successfully resolved by a WAR restart

pwar f allthru Proportion of failures that fall through to eBid restart level

pebid rb success Proportion of failures successfully resolved by restarting eBid

pebid f allthru Proportion of failures that fall through to JVM/JBoss restart level

pjvm jboss rb success Proportion of failures successfully resolved by a JVM/JBoss restart

pjvm jboss f allthru Proportion of failures that fall through to operator fix level

µe jb rb EJB restart time

µwar rb WAR restart time

µebid rb eBid web-application restart

µjvm jboss rb JVM/JBoss restart

µoperator f ix Time for an operator resolution

Table 4.1: RAS model parameters for a microrebootable application server

level increases.

Finally, labeling states associated with normal request processing or degraded request processing as UP states and states where no requests are processed as DOWN states allow us to capture different perspectives on what it means for the microrebootable application server to be considered “working”. By adjusting state-labels and varying the parameters of the RAS model, we can quantify various facets of reliability, availability and serviceability for the microrebootable application server.

In document The 7U Evaluation Method: Evaluating Software Systems via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models (Page 136-141)