4.3 Analysis Techniques
4.3.1 Microreboot RAS Model
Recursive microreboots are a technique for improving overall system availability by reac- tively restarting failed components and rejuvenating functioning components to prevent degradation [21]. It is specifically targeted at recovering from failures such as crashes, deadlocks, infinite loops, livelocks and state corruption (memory leaks, dangling pointers, damaged heaps, etc.).
A microreboot (µRB) can be applied at different levels of a system: component-level, subsystem-level or whole-system level5. As a remediation technique, recursive microreboots target the minimal set of a system’s components for a restart and progressively restart larger subsets of components up to and including restarting the entire system. Microreboots, like whole-system reboots, have a number of properties in common that make them attractive as a remediation mechanism. They return the target of recovery (component, subsystem,
3Java 2 Platform, Enterprise Edition (J2EE) defines the standard for developing multi-tier enterprise
applications [128].
4Additional measurement-based evaluations can be found in [23] and [22].
5The ability to precisely target and restart system elements at these various levels depend on a number of
structural properties and design considerations of the system under consideration. See [21] for more details on design considerations for recursively restartable systems.
system) to a well-understood state – its start state. Further, they provide a high confidence way of reclaiming stale or leaked resources [21].
We chose recursive microreboot for our analysis example because it is an instance of a sophisticated remediation mechanism that exhibits a number of characteristics that make it interesting to study:
1. Layered recovery strategy – one layer for each level at which recovery can occur in the system.
2. Imperfect recovery between layers – failures can escalate to higher layers e.g. if component-level reboots are unsuccessful then the failure “bubbles” up to the next higher layer to be handled – subsystem-level reboots – and so on.
3. Problem mitigation rather than elimination – microreboots do not eliminate the underlying root cause of the problem, rather they attempt mitigate its effects. Over time, the same failures can resurface.
In [20], the authors evaluate the efficacy of microrebooting, comparing fine-grained microre- boots to coarse-grained system reboots using their microrebootable J2EE application server, custom fault-injection tools and eBid, a version of the Rice University Bidding System (RUBiS) N-tier web-application, modified to be amenable to microreboots. RUBiS is a J2EE/Web-based auction system modeled after eBay.com.
The test system deployment in [20] consists of the following elements, which also correspond to the units of recovery. These recovery units are listed in order of fine-grained restarts to coarse-grained restarts:
• Enterprise Java Beans (EJBs) – these encapsulate the business logic of the eBid web-application. They may interact with other EJBs and/or backend databases in the processing of a client request.
contains the presentation tier of the web application: Java Server Pages (JSPs) and servlets. These invoke EJB methods and format the returned results for presentation to the client.
• eBid web-application – the collection of EJBs, JSPs and servlets.
• JVM/JBoss – the execution/hosting environment for the eBid web-application. A Recovery Manager component added to the JBoss application server performs failure diagnosis and recovery guided by the simple recursive policy of “cheapest recovery first”. In response to the faults injected into eBid, the recovery manager progressively reboots larger sets of components: first EJBs, then eBid’s WAR, then the eBid web application, followed by the JVM/JBoss, and if necessary finally reboots the operating system. To fully resolve some failures microreboots may be followed up by additional automated or manual actions, e.g., recovering persistent data may be done automatically (via transaction rollback) or may require manual reconstruction of the data in the database.
Based on the description of the microrebootable application server in [20], we use the SHARPE [160] RAS modeling and analysis tool to generate a model (shown in Figure4.4) that can be used to evaluate the efficacy of the application server and its recovery manager. The RAS model is an irreducible CTMC that consists of 6 states and 17 parameters, see Table4.1.
Our RAS model captures a number of key elements of the operation of the application server’s recovery manager including: a) multiple layers of recovery and b) the possible escalation of failures to higher levels of recovery. Further, the use of an irreducible CTMC allows us to model the operation of the Recovery Manager as an infinitely running process where failures can re-occur.
This RAS model, plus fault-injection tools like Kheiron (Chapter3) or the ones used in the experiments in [20], can be used to design, initiate and score fault-injection experiments
Figure 4.4: RAS model for a microrebootable application server
that represent different failure scenarios for evaluating the efficacy of microreboots. Fault-injection tools can be used to control the rate of failure (λf ailure) and/or the proportions
of failures that initially target a specific level of recovery (pe jb rb, pwar rb, and pjvm jboss rb).
Varying these parameters allows us to study the behavior of the system under different fault-loads/failure mixes.
Parameters concerned with the success or failure of recovery at a specific level (pe jb rb success,
pe jb f allthru, pwar rb success, pwar f allthru, pebid rb success, pebid f allthru, pjvm jboss rb success, and
pjvm jboss f allthru) can be observed experimentally or varied in the model to reason about their
expected impacts on system operation.
Parameters concerned with recovery times at a specific level (µe jb rb, µwar rb, µebid rb,
µjvm jboss rb, and µoperator f ix) can be observed experimentally or varied in the model based on
S0 The initial state of the system
S1 State where one or more EJBs is being restarted
S2 State where the eBid WAR file is being restarted
S3 State where the entire eBid application is being restarted
S4 State where the JVM/JBoss application server is being restarted
S5 State where an operator performs some action(s) to resolve an issue
λf ailure Rate at which faults are injected/failures induced
pe jb rb Proportion of failures that are initially handled by an EJB restart
pwar rb Proportion of failures that are initially handled by a WAR restart
pjvm jboss rb Proportion of failures that are initially handled by a JVM/JBoss restart
pe jb rb success Proportion of failures successfully resolved by an EJBs restart
pe jb f allthru Proportion of failures that fall through to WAR restart level
pwar rb success Proportion of failures successfully resolved by a WAR restart
pwar f allthru Proportion of failures that fall through to eBid restart level
pebid rb success Proportion of failures successfully resolved by restarting eBid
pebid f allthru Proportion of failures that fall through to JVM/JBoss restart level
pjvm jboss rb success Proportion of failures successfully resolved by a JVM/JBoss restart
pjvm jboss f allthru Proportion of failures that fall through to operator fix level
µe jb rb EJB restart time
µwar rb WAR restart time
µebid rb eBid web-application restart
µjvm jboss rb JVM/JBoss restart
µoperator f ix Time for an operator resolution
Table 4.1: RAS model parameters for a microrebootable application server
level increases.
Finally, labeling states associated with normal request processing or degraded request processing as UP states and states where no requests are processed as DOWN states allow us to capture different perspectives on what it means for the microrebootable application server to be considered “working”. By adjusting state-labels and varying the parameters of the RAS model, we can quantify various facets of reliability, availability and serviceability for the microrebootable application server.