Failures classification and analysis of the Java Virtual Machine

(1)

1

Failures classification and analysis of the

Java Virtual Machine

Domenico Cotroneo1,2, Salvatore Orlando1, Stefano Russo1,2

(1) Dipartimento di Informatica e Sistemistica - Universit`a degli Studi di Napoli Federico II Via Claudio 21, 80125 - Naples, Italy

(2) Laboratorio ITEM - Consorzio Interuniversitario Nazionale per l’Informatica Via Diocleziano 328 - 80124 Naples, Italy

{

cotroneo, saorland, sterusso

}

@unina.it

Abstract —This paper presents a failure

analy-sis of the Java Virtual Machine in order to pro-vide useful insigths into the nature of reported failures and to improve the understanding of de-pendability aspects of the Java Virtual Machine. Failure data are extracted, in the form of reports, from publicly available bug databases, where de-velopers and users of Java applications usually submit failures/bugs.

Results presented in this work clearly indicate that much more efforts have still to be done in or-der to improve the dependability of the JVM. In particular, the conducted analysis revealed that i) built-in error detection mechanism are char-acterized by a low coverage; ii) it is not pos-sible to claim that the JVM achieves the same levels of dependability across different platforms, since a considerable amount of failures depend on the Operating System or Hardware Platform (or both) on which the JVM was running; iii) developers have to face a trade-off between per-formance and reliability, since Just-in-time com-pilers and Garbage Collectors are responsible for more than 35% of reported failures; iv) The Java Virtual Machine is particularly susceptible to CPU-bound workloads. Finally, code fragments activating faults in the Java Virtual Machine are

injected into Java Applications. A monitoring

infrastructure is setup to gain insights into the nature and causes of each failure. Preliminary results show that often these faults could be re-moved changing the environment of the JVM.

Index Terms—Dependability, Java Virtual

Ma-chine, Failure Analysis, Failure Diagnosis

I. Introduction

Thanks to its high level of portability, pro-gramming abstractions, improvements and ex-tensions carried out in last years, the Java plat-form has gained great popularity being adopted in a wider class of application scenarios. While in the past the major limitation of using Java as

This work has been partially supported by the Ital-ian Ministry for Education, University and Research (MIUR),within the framework of the FIRB Project “Middleware for advanced services over large-scale, wired-wireless distributed systems (WEB-MINDS), and by the Regione Campania, within the framework of the “Centro di Competenza Regionale ICT”.

a development platform was a dramatic perfor-mance penalty, it is possible to state that nowa-days benefits from using Java can be achieved at an acceptable performance penalty, as demon-strated with benchmarks reported in [1]. Re-cently, we are witnessing the use of Java in the real-world critical applications, such as on-line banking or stock exchange trading. Moreover, Java is starting to be adopted also in more crit-ical scenarios, such as process and remote con-trol systems. For instance, Java has been used to develop the ROver Sequence Editor (ROSE), a component of the Rover Sequencing on Visu-alization Program (RSVP), used to control the Spirit Robot in the exploration of Mars [2]. Given such scenarios, we claim that the time is ripe for addressing dependability issues of the JVM. Current implementations have no di-rect support for fault tolerance: Java applica-tions either ignore fault tolerance mechanisms or achieve them through approaches that are out-side the scope of the JVM itself. Although in the industrial and academic community there is a growing need to improve the dependability of the Java Virtual Machine, just few works in the existing literature focused on this issue.

This paper represents a first step toward a de-pendability study of the Java Virtual Machine, which can be regarded as the “beating heart” of the Java Platform. The steps we aim to pursue are the following:

1. Since the JVM is a very complex system, classifying failures, their characterization and their occurrences is a quite an hard task. Thus, we start by presenting a fail-ure characterization of the Java Virtual Ma-chine based on data gathered from stan-dard Bug Databases, which constitute the only publicly available source of failure data for the JVM. Nevertheless, despite of their qualitative nature, these reports should be considered well-founded since they are

(2)

sub-mitted and evaluated by skilled people. In order to perform a significant analysis of failure data, a careful filtering is requested. 2. Analyze extracted data thus providing use-ful insights into i) the nature of the failures of the JVM, ii) the sources of the errors and, iii) the relationships between workloads and failures into the JVM.

3. Reproduce the conditions which led to such failures, in order to analyze in detail the behavior of the JVM in faulty condition, thus allowing us to define several significant profiles to conduct an injection campaign aimed at discovering failures modes of the JVM. The behavior of the virtual machine when the fault is injected is captured us-ing a monitorus-ing infrastructure developed using the Java Platform Profiling Architec-ture (JPPA) [3].

4. Once results from previous phases are ana-lyzed, we can define JVM workload profiles in order to conduct a comprehensive field-data measurement campaign.

This work concerns with the first two steps of the outlined phases, even though we are able to give some preliminary results about the third phase. Results presented in this work clearly indicate that much more efforts have still to be done in order to improve the dependability of the JVM. In particular, some of our key findings, which are detailed in the paper, are described below:

1. Built-in error detection mechanisms need to be improved, since they was not capable of detect a considerable amount of failures (45.03%)

2. It is not possible to claim that the JVM keeps the same levels of dependability across different platforms, since a large per-centage of failures depended on the environ-ment on which the JVM was running. 3. Developers have to face a trade-off between

performance and reliability, since Garbage Collectors (which are often optimized for performance) and Just-in-time compilers are responsible, together, for more than 35% of failures.

4. The greatest part of failures (80%) oc-curred when the JVM was loaded with rel-evant workloads.

The rest of the paper is organized as follows. Section II reports an overview of the Java Plat-form and discusses previous relevant work in the fields of dependability assessment. Section III briefly introduces an architectural model of the JVM. Section IV, V and VI cope with the first two steps of the above mentioned dependability

study: the first two sections describe analyzed data sources and criteria used for the analy-sis, whereas the latter discusses obtained results. Section VII performs a preliminary investigation about the behavior of the Java Virtual Machine when faults are injected. Finally, section VIII concludes the paper discussing directions of fu-ture work.

II. Background and Related Work

The Java platform was designed to be as much as possible portable: Java technology compo-nents do not care what kind of computer, phone, TV, or Operating System they run on. They can work on any kind of compatible device for which an implementation of the Java Virtual Machine is available. But there are many differ-ences in terms of available resources and com-putation capabilities among the various devices supporting the Java Platform: Java-enabled de-vices range from smart phones to multiprocessor servers. The analysis reported in this paper ad-dresses dependability issues of JVMs fully com-pliant to the Java Virtual Machine specification [4], usually employed in J2EE (Java 2 Enter-prise Edition) and J2SE (Java 2 Standard Edi-tion) applications. Currently, we do not address issues related to Virtual Machines employed in J2ME (Java 2 Micro Edition) applications. Field failure data collection,measurement-based analysis and fault injection campaigns are first steps of a process that leads to dependability modeling of computer systems and to develop-ment of mechanisms for detection and recovery. Several studies addressed the characterization of the dependability of computer systems based on field failure data collection. In [5] Iyer and Tang performed a measurement analysis of field fail-ure data for a DEC VAXCluster Multicomputer system, proposing markovian models both for failures and errors. Many other works cope with the measurement based dependability analysis of COTS operating systems such as Windows [6], [7], [8], [9] and Linux [10], [11].

Moreover, several studies were carried out about system dependability evaluation through fault injection. Some of them address the Java Plat-form, such as [12], [13] and [14]. The first proposes a pattern-based Fault Injector, named JACA, which performs fault injections using re-flection, whereas the latter proposes another Fault Injector designed for network applications, named FIONA, which inject faults into the

DatagramSocketClass, using theJVM Tool

In-terface, included in JPPA.

(3)

er-Hardware Operating System

Java Applications

Host System ISA

Java API (JDK)

Memory Management Unit

Reference Handling

Garbage Collection Finalization

System Services Unit

OS Virtualization Layer Unit

Thread Management Timers Class Loader Management & Debugging Fast Allocation Mechanisms Execution Unit JIT Compiler (s) Interpreter Exception handling JNI

Host System ABI

User Host System ISA

Vendor-specific Packages

JVM ISA

Fig. 1. Architecture of the Java Virtual Machine

ror information gathered from event logs, auto-mated failure reporting tools or failure reports provided by user or maintenance staff. The work proposed in this paper aims at perform a prelim-inary analysis of the dependability of the JVM, gathering information from failure reports pro-vided in bug databases. Moreover, the behav-ior of the JVM following the execution of faulty code, i.e. fragments of Java code which lead to JVM failure (taken from the above mentioned Bug Databases) is analyzed.

III. An overview of the architecture of the JVM

The JVM is a virtual machine belonging to the High Level Language VMs (HLL-VM) cat-egory [15]. An HLL-VM is a VM which adds support for cross-platform programming,

pro-viding a virtual Instruction Set Architecture

(ISA), virtualizing the Application Binary

In-terface (ABI) and the ISA exposed by the

un-derlying Operating System and Hardware, thus making applications written for the virtual ma-chine platform-independent.

The virtual ISA of the JVM is a set of instruc-tions called bytecodes; programs written in Java are compiled into bytecodes. The JVM is com-posed by four main components, depicted in fig-ure 1:

• Execution Unit - It dispatches and

exe-cutes operations, emulating a CPU. An op-eration could be a translated bytecode in-struction, a compiled bytecode instruction

or a native instruction. The Interpreter

translates single bytecode instructions into

native machine code whereas the

Just-In-Time(JIT)compiler translates entire

meth-ods into native code doing some optimiza-tions. Instead native instructions need no

translation since they are not bytecodes but native machine instructions. They are dynamically loaded,linked and executed by

theJava Native Interface (JNI). Moreover,

the Exception Handler handles exceptions

thrown by both Java Applications and the

Virtual Machine. Exceptions thrown by

applications are definedchecked, while

ex-ceptions thrown by the VM are defined

unchecked and are related to errors

origi-nated into the virtual machine.

• OS Virtualization Layer Unit - It

pro-vides a platform-independent abstraction of the host system’s ABI. This abstraction layer provides a common gateway for all JVM components to access host system’s resources.

• Memory Management Unit- It handles

both the JVM heap area and the stack area, managing object allocation, reference han-dling, object finalization and garbage

col-lection. Moreover Fast Allocation

Mech-anism are provided to allocate temporary memory areas for internal VM operations.

• System Services Unit - Components

in-cluded in this unit offer services to Java

Ap-plications. TheThread Management

com-ponent handles thread creation and ter-mination and implements mechanisms for thread synchronization as specified by the

Java Virtual Machine Specification [4] and

theJava Language Specification [16]. The

Class Loader is in charge of dynamically loading and verifying Java classfiles (which

contain byte codes).Timerscomponent

ex-poses functionalities to access system timers

through the JVM. Finally, theManagement

and Debugging component includes

func-tionalities for debugging Java applications and for the management of the JVM.

IV. Data Sources and data extraction procedure

Bug databases are a precious source of infor-mation related to reliability and robustness of software systems: software faults occur when buggy code is executed. Some kinds of bugs, namely HeisenBugs, MandelBugs and Schroed-inbugs, are particularly likely to elude all test-ing phases, since they usually disappear or alter their characteristics when they are researched. Failure data presented in this paper are ex-tracted by Sun [17] and Jikes [18] bug databases. Other implementations, such as Kaffe and JRockit had no public bug databases or very poor ones. These Bug Database were

(4)

periodi-cally checked between June 2005 and October 2005.

Among thousands of submissions related to the whole Java Platform, 698 bug submissions related to JVM failures were selected and ana-lyzed. This set was further refined by excluding submissions which met the following criteria:

• The bug has been marked as Fixed.

• The failure reported is related to a version

of the JVM still under development or

test-ing. Since our research is aimed to

dis-cover information about failures of opera-tional JVMs we dropped these submissions (i.e: submissions related to J2SE 6.0).

• The submission is elusive or it does not

con-tain enough information to characterize the failure.

• The failure report is related to a fault or

an error in lower levels, such as operating system or hardware. We are interested only in failures originated from software faults in the JVM itself.

• The failure is attributable to errors in upper

levels such as applications, middleware or application servers. Even if these reports are submitted as JVM-related bugs, their source is outside the JVM.

Among the initially selected submissions, 147 (29 from Jikes Database, 118 from Sun) were selected; only 3 of these submission were re-lated to BohrBugs, which can be easily repro-duced and located; 191 distinct failures were re-ported in these submissions. Each submission reports the environment on which the JVM was running,the configuration of the virtual machine (i.e.: heap configuration,JIT compiler used) and stack traces. Many failure reports also contained a detailed description of the source of the failure (given by specialists in the evaluation section of the report itself) and information related to the frequency of the failure and its reproducibility.

V. Failure Classification Criteria

In this section we discuss criteria we adopted to classify extracted failure reports.

• Failure Manifestation - Reported

fail-ures were classified according to their man-ifestations (i.e: the message printed on the console). Five failure manifestation types were defined:

- VM Error Message - A Java

Program-ming Language Exception was thrown and reported to the user.

-OS Error Message- An Operating System

level error message such asSIGSEGVwas

re-ported to the user.

Platform-Ind. The failure occurs independently by the environment OS-Dep. The failure occurs only on a specific Operating System Platform-Dep. The failure occurs only on a specific Hardware Platform

OS&Platform-Dep.The failure occurs only on a specific Hardware Platform and Operating System

TABLE I

Categories for the reliance by the environment

The Execution unit is mainly stressed, many compilation tasks are performed and a lot of thread synchronization happens.

EXAMPLES: Application servers,Transactional systems,Parallel algorithms.

Almost all available heap space is allocated, collections happen at an high frequency

EXAMPLES: Scientific applications,database processing applications

Input-Output operations on file system, databases, network connections are mainly executed

EXAMPLES: Web Servers,Transactional systems

The application does not impose any particular workload

EXAMPLES: web browsers, e-mail clients, address book s COMMON I/O BOUND MEMORY BOUND CPU BOUND TABLE II Workload levels

-Hang/Deadlock - The JVM did not crash,

but it stopped executing the application (or a part of it).

-Silent Crash - The JVM crashed silently,

without printing any error message.

- Computation Error - Results obtained

were different from the expected ones.

• Failure Source - By analyzing

informa-tion attached to failure reports we were able to pinpoint the component(s) of the JVM in which the source of the error was lo-cated. According to the architectural view described in section III the following cate-gories and subcatecate-gories were defined:

-Execution Unit - This category in further

divided intoShared Runtime, JIT

Compil-ers,Interpreter andJNI subcategories.

-OS Virtualization Layer Unit.

- Memory Management Unit - This

cate-gory is further divided intoGarbage

Collec-tor andReference Handling subcategories.

- System Services Unit - This category is

further divided into Thread Management,

Class LoaderandMonitoringsubcategories

• Severity- A failure is definedCatastrophic

if the failure leads to the crash of the JVM

or non-Catastrophic if the JVM still runs

despite of the failure.

• Environment - Failure reports were

clas-sified according to the reliance by the en-vironment on which the JVM was running. Four categories (described in table I) were defined.

• VM Activity - Past studies [19] showed

that the average failure rate of a system cor-related strongly with the average workload on the system. Failure reports were classi-fied according to the workload imposed on the JVM when the failure was reported.

(5)

Ta-- a Ta-- b -0,00% 5,00% 10,00% 15,00% 20,00% 25,00% 30,00% 35,00% 40,00% 45,00% 50,00% O S-Leve l VM -Lev el Sile nt C rash Han g/D eadl ock Com p. e rror Catastrophic Non-Catastrophic 0,00% 5,00% 10,00% 15,00% 20,00% 25,00% 30,00% 35,00% 40,00% 45,00% Out Of M em ory Stac k O verfl ow Run tim e Ex cept ion Asse rtion Fai lure Inte rnal Err or Oth ers Catastrophic Non-Catastrophic

Fig. 2. (a) Failure manifestations distribution (b) detailed view of VM-level failure manifestations. Computation errors were captured comparing the “Expected Output” against the “Actual Output” in the failure report.

REGULAR The failure occurs regularly whenever a particular sequence _{of operations is executed.} STARTUP The failure occurs at JVM startup

HOURLY The failure occurs on average within an hour by JVM startup DAILY The failure occurs on average once a day

WEEKLY The failure occurs on average once a week

TABLE III

Timing categories

ble II shows the qualitative workload levels defined.

• Failure Frequency - Failure reports were

classified according to the frequency of fail-ure occurrences. This information was ex-tracted by JVM core dumps or by hints given by submitters. Table III describes the categories defined.

VI. Results

In this section we discuss the results obtained from the analysis of selected failure reports. The first part analyzes failure manifestations and their relationship with the environment on which the JVM runs. The second part high-lights the role of internal JVM components in reported failures, whereas the third part shades some light on the relationships between internal JVM components, failure frequency and work-loads imposed on the JVM itself. To this aim, above the extracted 147 submissions (account-ing for 191 failure reports), 108 failure reports (56.54%) were selected for frequency analysis, 114 failure reports (59.69%) for workload anal-ysis and 101 submissions (68.71%) for environ-ment dependency analysis.

A. Failure manifestation analysis

Figure 2-a depicts a bar chart of failure manifestation and severity. The most recurrent

manifestation is anOS-level message (45.03%),

followed byVM-level messages (32.46%),hangs

or deadlocks (11.52%), computation errors

(5.76%) and silent crashes (5.24%). We found

that almost all failures lead to VM crash (86.06%). Only computation errors are always non catastrophic. A quarter of hang/deadlocks are non catastrophic (the virtual machine is still able to run other tasks), whereas a little part of VM-level manifestations (13.09%) does not lead to VM crash. Almost an half of the failures manifested as OS-level messages (i.e:

SIGBUS, SIGSEGV or ACCESS VIOLATION). This means that built-in error detection mechanism are not able to cover all the activities of the JVM: in many cases the JVM crashes without detecting any abnormal condition.

VM-level manifestations appear when error detection mechanisms pinpoint faulty

condi-tions. In this cases an unchecked exception

can be thrown from the virtual machine, thus giving a chance to handle the faulty

condition in applicative code. With respect

to VM-level manifestation, figure 2-b depicts a bar chart of the various error messages

reported and their severity. Among

VM-level manifestations, the most recurrent is

OutOfMemoryError (44.07%). InternalError

(15.25%), RuntimeException and

AssertionFailure (11.86%), StackOverflow

(6.78%) and others exceptions (10.17%) (i.e.:

NullPointerException) are reported less fre-quently. Even if applications could handle these conditions through Java exception handling

(6)

27,59% 72,41% 6.32% (8.73%) 3.45% (4.76%) 23.56% (32.54%) 39.08% (53.97% )

UNKNOWN PLATFORM Ind. OS Dep.

PLATFORM Dep. OS&PLATFORM Dep.

Fig. 3. Relationships between failures and environment. In the bar reported on the right the value without paren-theses represent the absolute percentage of environment-dependent (or inenvironment-dependent) failures, whereas the value in parentheses represent the relative percentage of fail-ures with respect to the number of failfail-ures which OS-dependency is deductible from submissions in bug databases.

mechanism, we found that in the greater part of

cases (with the exception of RuntimeException

manifestations) the consequences were catas-trophic. This indicates that the state of the virtual machine has become so corrupted that no recovery action is possible or that no recovery action was taken in Java applications, since developers did not expect to face similar extreme conditions.

To gain an understanding of the relationship between failures and the underlying environ-ment, we analyzed the dependency of the reported failures on the Operating System and the Hardware Platform, as depicted in figure

3. In some cases we were not able to

dis-tinguish between environment-dependent and

independent failures. In the remaining cases

(more than 70%), we observed that 53.97% of the failures were platform-independent, i.e., the same application showed the same failure on different operating systems and hardware

platforms. Even if only a little percentage

of failures (4.76%) depended exclusively on the hardware platform, a more considerable percentage of failures were dependent on the Operating System (32.54%) or both OS and

Hardware (8.73%). These results indicate

that there is a substantial dependency on the

Operating System. Therefore, it is not possible

to claim that Java applications keep the same levels of dependability across different operating

systems.

To gain a more detailed view of the relation-ship between failures and operating systems, we analyzed OS-dependent failures reported in

Windows, Linux and Solaris. The results of

OS OS-DEP % OS-IND % UNKNOWN %

Windows 27,03% 38,71% 32,43%

Linux 40,00% 52,46% 12,00%

Solaris 19,35% 43,86% 40,32%

TABLE IV

Detailed view of OS-dependent failures

15.69%

44.12% 16.18%

24.02%

OS Virtualization Layer Execution Unit M emory M anagement Unit System Services Unit

Fig. 4. Failure sources

these analysis are described in table IV. These results showed that the dependency on the underlying operating system is more critical in Linux than in Windows and Solaris. However the fourth column of table IV highlights that there is a large margin of uncertainty, since in many cases it was not possible to distin-guish whether the failure is OS-dependent or OS-independent.

B. Failure sources analysis

By analyzing stack traces and core dumps at-tached to bug submissions it is possible to pro-vide useful insights into failure sources, nailing JVM components in which errors were located. Often the source of a failure is located in more than one component. Among the reported fail-ures, 22.47% of them were due to errors in more than one component.

The percentage of failures for each component of the JVM is depicted in figure 4. It is clearly

Execution 19,33% Optimizing JIT 15,13% JNI 3,78% Base JIT 4,20% Interpeter 2,10% GC 16,81% Ref Handler 5,04% Other Memory-Related 1,26% Thread Management 10,92% Class Loader 2,52% Monitoring 0,84% 18,07% OS Virtualization Layer

Memory Management Unit

System Services Execution Unit

TABLE V

(7)

- a - b -0,00% 5,00% 10,00% 15,00% 20,00% 25,00% 30,00% 35,00% 40,00% 45,00%

REGULAR STARTUP HOURLY DAILY WEEKLY OS Virtualization Lay er Ex ecution Unit

Memory Management Unit Sy stem Serv ices Unit

0,00% 5,00% 10,00% 15,00% 20,00% 25,00% 30,00% 35,00%

CPU BOUND I/O BOUND MEM BOUND COMMON OS Virtualization Lay er Ex ecution Unit Memory Management Unit Sy stem Serv ices Unit

Fig. 5. Frequency and workload classification of failures with respect to JVM components

visible that the greatest part of failures is due to the Execution unit. Moreover, looking on the details about the subcomponents described in section III, which are depicted in table V, it is straightforward that:

- The greatest part of failures in the mem-ory management unit (72.73%) is due to the Garbage Collector.

- Runtime support operations and optimized just-in-time compilation tasks cover the 77.36% of Execution unit failures.

- The greatest part of failures in the System Ser-vices Unit (76.41%) is due to the Thread Man-agement sub-component.

By analyzing these results it is possible to argue that:

- Runtime support operations, such as method invocation, stack frame allocation and dealloca-tion or excepdealloca-tion handling, seem to be the most critical dependability bottleneck in the JVM. - The optimizing JIT compiler, even if improves prominently the performance of Java applica-tions, is one of the major sources of failures in the JVM; therefore Java developers have to cope with a trade-off between performance and relia-bility.

- The Garbage Collector still remains one of most error-prone components in the JVM. In particular low-pause or high-throughput garbage collectors seem to be critical for JVM reliability; therefore there is another trade-off between the performance of the collector and its reliability.

- Also the OS Virtualization layer has a deep impact on the dependability of the JVM. In par-ticular this component is responsible for 15.91% of Solaris failures, 14.29% of Windows Failures and 24.21% of Linux Failures.

These regards show that the JVM is a com-plex system characterized by several

dependabil-ity bottlenecks. In particular, all performance-enabling components of the JVM represent a

se-rious threat for JVM dependability.

Further-more, the interface between the virtual machine and the underlying environment is one of the most critical dependability bottlenecks for the

JVM itself.

C. Relationships between failure frequency and workloads

We conclude the analysis with a discussion of the relationship between the frequency of fail-ures, the workloads imposed on the virtual ma-chine and the components of the virtual mama-chine itself. Figure 5-a reports the percentage of er-rors with respect to JVM components for each frequency category.

”Regular” Failures are most recurrent ones (39.81%). Since regular failures are related to known issues in JVM implementations, Java de-velopers can avoid them by adopting proper workarounds. Regular failures are mainly at-tributable to the Execution Unit (22.21%) and to the Memory Management Unit (12.96%). ”Startup” (11.11%) and ”Hourly” (10.19%) fail-ures occur at the first stages of Java Program Execution. The OS Virtualization Layer and the Execution Unit are the main causes of hourly failures (6.48%, 2.78%), whereas each compo-nent plays an equivalent role in startup fail-ures. Many non-regular failures shows a daily or weekly frequency (19.44%).

It is worth noting that Execution Unit and Sys-tem Services Unit failures increase when fre-quency decreases, whereas Memory Management Unit and OS Virtualization Layer Unit failures decrease when frequency decreases. This sug-gests the presence of software aging phenomena in this components (especially in JIT compilers, Shared Runtime Support and Thread

(8)

Manage-CPU BOUND I/O BOUND MEM BOUND COMMON STARTUP 20,00% 30,00% 10,00% 40,00% HOURLY 50,00% 40,00% 10,00% 0,00% DAILY 20,00% 55,00% 20,00% 5,00% WEEKLY 40,00% 40,00% 5,00% 15,00% REGULAR 15,15% 15,15% 27,27% 42,42% TABLE VI

Relationships between failure frequencies and workload levels

ment sub-components). Further investigations

are required to gain more details about the dy-namics of these phenomena, which, as stated in [20], represent a consistent source of failures in software systems.

Figure 5-b reports the percentage of error with respect to JVM components for each workload level defined. The greatest percentage of fail-ures occurred under CPU Bound Workloads (32.00%), followed by I/O Bound Workloads

(26.40%). Less failures occured under

Mem-ory Bound Workloads (21.60%) or ”Common” Workloads (20.00%).

It is straightforward that the greatest part of failures (80%) occurs when significant work-loads are imposed on the JVM, moreover CPU Bound and I/O Bound applications seem to be more critical for the JVM than Memory Bound applications. Moreover, these results in-dicate that CPU Bound and I/O Bound ap-plications, such as Web Servers, stress mainly the Execution unit (50.98% CPU Bound;38.46% I/O Bound) and the OS Virtualization Layer (25.49% CPU Bound; 32.69% I/O Bound). On the other hand, the most relevant percentage of failures with non-significant workloads are attributable to errors in the Memory Manage-ment Unit (33.03%) and in the System Services

Unit (37.04%). Therefore, since the JVM

suf-fers mainly CPU Bound and I/O Bound applica-tions, it is possible to argue that the development of strategies and mechanisms aimed to augment the reliability of the virtual machine should first address these kinds of applications.

Table VI shows that Regular and Startup fail-ures usually occur when non significant

work-loads are applied, thus confirming that many

failures are due to bugs in JVM implementa-tions or to issues in the interface between the

VM and the underlying environment. Moreover,

non regular failures occur when significant work-loads are applied. For instance, weekly failures usually occur when CPU Bound or I/O Bound workloads are applied.

System Classes

Java Applications

Java Virtual

Machine JVMTI Agent

Test App Monitor MBeans Local Log (events) Faulty Class Events

State information Querying

JVM Events and State Collected Data

Fig. 6. Fault Injection mechanism and monitoring in-frastructure

BUG ID Failure manifestation Code Description

4396719 Silent Crash (Linux) _{ACCESS VIOLATION (Windows)} Iteratively allocates arrays of null objects. 5073365 NullPointerException Tries to change the priority of a thread after the

thread has exited.

6343401 Either Silent Crash,SIGSEGV or _{ACCESS VIOLATION} Executes several times a function copying an _{array of bytes into another array.}

TABLE VII

Faulty code fragments executed to analyze the behavior of the Virtual Machine

VII. Injecting faults into the JVM

The failure classification reported in the pre-vious section highlights the most error-prone components in the virtual machine. By ana-lyzing extracted submissions, it is possible to define several fault injection profiles, executing the code which activates the bug or reproduc-ing the conditions under which the failure has manifested. In this section we present an analy-sis of the behavior of the Java Virtual Machine when faults are injected through the execution of “faulty” code fragments.

The infrastructure used to inject faults and an-alyze the behavior of the virtual machine is re-ported in figure 6. The JVM is instrumented using the Java Platform Profiling Architecture

[3], namely aJVM Tool Interfaceagent and

sev-eral Java Management Extensions Beans (the

Monitor MBeans component depicted in figure 6). The former implements callbacks to handle events raised from the Virtual Machine. These callbacks make use of the JVMTI API in order to retrieve information about Virtual Machine’s state. The latter captures more details about the state of the virtual machine and collects in-formation sent from the JVMTI agent.

When the faulty code is executed, the monitor-ing infrastructure collects data about both the evolution of the state of the JVM and the fail-ure caused by that fault. This infrastructfail-ure is

(9)

Component Timestamp Event Additional Information RUNTIME CORE 20051117141924070 THREAD START GC Daemon

RUNTIME 20051117141924070 CONTEXT SWITCH 14 RMI Reaper - 15 GC Daemon MEMORY 20051117141924071 GC START

MEMORY 20051117141924119 GC FINISH

THREAD MANAGEMENT 20051117141924119 MONITOR WAITED Reference Handler Ljava/lang/ref/Reference$Lock; - NOTIFIED 17538308 THREAD MANAGEMENT 20051117141924119 MONITOR WAIT Reference Handler Ljava/lang/ref/Reference$Lock; - 17538308 0 THREAD MANAGEMENT 20051117141924119 MONITOR WAITED Finalizer Ljava/lang/ref/ReferenceQueue$Lock; - NOTIFIED 24212267 THREAD MANAGEMENT 20051117141924119 MONITOR WAIT Finalizer Ljava/lang/ref/ReferenceQueue$Lock; - 24212267 0 THREAD MANAGEMENT 20051117141924119 MONITOR WAIT GC Daemon Lsun/misc/GC$LatencyLock; - 12455463 60000 RUNTIME 20051117141924119 CONTEXT SWITCH 15 GC Daemon - 12 Thread-2

...

CLASSLOADER 20051117141924199 LOAD Ljava/util/TreeMap$KeyIterator; - CLASSLOADER 20051117141924199 PREPARE Ljava/util/TreeMap$PrivateEntryIterator; - CLASSLOADER 20051117141924199 PREPARE Ljava/util/TreeMap$KeyIterator; - MEMORY 20051117141927046 GC START MEMORY 20051117141927048 GC FINISH MEMORY 20051117141927379 GC START MEMORY 20051117141927380 GC FINISH MEMORY 20051117141927550 GC START MEMORY 20051117141927551 GC FINISH ... MEMORY 20051117141929066 GC START MEMORY 20051117141929068 GC FINISH MEMORY 20051117141929251 GC START MEMORY 20051117141929252 GC FINISH MEMORY 20051117141929407 GC START Collections in Faulty Conditions

JVM Crash

Fig. 7. Crash during garbage collection

a part of a more complex monitoring system dis-cussed in [21].

By analyzing collected data it is possible to iden-tify components that caused the failure, along with the error activated in that components. Moreover, in order to discover which fault led to activated errors, the same faulty code frag-ment with different configurations of the virtual machine.

Three types of code fragments, summarized in table VII, are analyzed. These are extracted from the Sun Hotspot Bug Database. The first proves that the error in a component could be activated by a fault in another component, whereas the second points out that the thread management sub-component should not be con-sidered very reliable. Finally, the third code fragment confirms the trade-off between perfor-mance and reliability regarding the optimizing JIT compiler.

A. Bug 4396719 - Error during garbage collec-tion

This failure iscatastrophic: the JVM crashes

after a certain number of iterations. Analyz-ing the event log (depicted in figure 7) it is evi-dent that the crash occurred during garbage

col-lection (last event logged was GC START

with-outGC FINISHevent). Therefore the failure has been activated by an error in the garbage col-lector. The first part of figure 7 shows events logged when a collection is performed in

“nor-mal” conditions: a daemon thread (GcDaemon) is

activated, the collection is executed (GC START

and GC FINISHevent pair), and then the heap

is freed (by the Finalizer thread). Instead,

the second part of figure 7 reports events logged when the collection is performed in “faulty” con-ditions: the Garbage Collector is invoked several times consequently (about 5 times per second) and no object is freed during this collections (the

finalizerthread is never activated).

Although it could seem this error is activated by a fault in the memory management unit, due to low memory conditions, we observed the fail-ure at the same point even with greater heap sizes. Moreover, the error is activated both with the client and the server virtual machine. Nev-ertheless, augmenting the initial heap size, the error is activated later. This could mean that the fault activated by this code fragment has to be located in the mechanisms that manage heap resizing (heap size dynamically grow or shrink according to application requirements).

B. Bug 5073365 - Error setting thread priority

This failure is non-catastrophic. A

NullPointerException is thrown trying to set the priority of a terminated thread. The error is clearly located in the Thread Management sub-component. The log reported in figure 8 shows that no finalization or garbage collection occurs between the thread

termi-nation (THREAD END event) and the exception

(EXCEPTIONevent). Thus the Thread object is still reachable and alive. Moreover, none of the methods of the class Thread were JIT-compiled, thus the failure is not caused by errors during code optimization.

(10)

C o m p o n e n t T i m e s t a m p E v e n t A d d i t i o n a l I n f o r m a t i o n T H R E A D M A N A G E M E N T 2 0 0 5 1 1 1 7 1 4 4 2 5 9 2 2 2 M O N I T O R W A I T m a i n L B u g 5 0 7 3 3 6 5 ; 2 2 2 9 3 1 0 9 R U N T I M E C O R E 2 0 0 5 1 1 1 7 1 4 4 2 5 9 2 2 2 T H R E A D S T A R T T h r e a d - 3 R U N T I M E 2 0 0 5 1 1 1 7 1 4 4 2 5 9 2 2 3 C O N T E X T S W I T C H T h r e a d - 2 T h r e a d - 3 R U N T I M E C O R E 2 0 0 5 1 1 1 7 1 4 4 2 5 9 7 2 5 T H R E A D E N D T h r e a d - 3 T H R E A D M A N A G E M E N T 2 0 0 5 1 1 1 7 1 4 4 2 5 9 7 2 6 M O N I T O R W A I T E D m a i n L B u g 5 0 7 3 3 6 5 ; N O T I F I E D 2 2 2 9 3 1 0 9 R U N T I M E 2 0 0 5 1 1 1 7 1 4 4 2 5 9 7 2 6 E X C E P T I O N m a i n s e t P r i o r i t y( I ) V L j a v a / l a n g / T h r e a d ; 2 8 j a v a . l a n g . N u l l P o i n t e r E x c e p t i o n R U N T I M E 2 0 0 5 1 1 1 7 1 4 4 2 5 9 7 2 6 C O N T E X T S W I T C H T h r e a d - 3 m a i n L j a v a / l a n g / N u l l P o i n t e r E x c e p t i o n ;

Fig. 8. An exception is thrown after thread termination

F a i l u r e d e t a i l s

T i m e s t a m p 20051117153216617 M e s s a g e SIGSEGV

C u r r e n t T h r e a d CompilerThread1 Java daemon in VM

O t h e r d e t a i l s opto: 32 Bug6343401.compressMsg([BI)I (153 bytes)

Fig. 9. Failure during optimizing JIT compilation

that led to this failure is attributable to an erroneous update of the thread’s data structures upon its termination.

C. Bug 6343401 - Error in just-in-time compi-lation

This failure iscatastrophic: the JVM crashes

each time this code is executed. Analyzing col-lected data it is not possible to find any anomaly in the behavior of the virtual machine. However, analyzing the failure more in detail, summarized in figure 9, it is evident that an error during just-in-time compilation led to a JVM crash. The error is no more activated when the same code is executed with the “client” VM. This means that the compilation of the faulty method

compressMsg (ref. fig.9) activates a fault in

the Optimizing JIT Compiler sub-component,

which in turns fails the optimization of the above mentioned method leading to the crash of the Java Virtual Machine.

VIII. Conclusions and Future Work

This paper presented a failure analysis for the Java Virtual Machine. The results of the analysis indicated how failures are distributed with respect to failure manifestations, host system environment, internal JVM components,

frequency and workloads. We showed that

there is a non-negligible dependency of JVM reliability on the Operating System on which it runs, and that the Execution Unit is respon-sible for the greatest percentage of reported

failures. Furthermore, even if a considerable

amount of failures are related to bugs in JVM implementations, there is a strict relationship between failures and workloads imposed on the JVM.

We then investigated the behavior of the virtual machine when faults are injected allowing us

to obtain more insight about its dependability issues.

Starting from the analysis presented in this paper, we are going to perform a fault injection campaign to investigate the behavior of the virtual machine when faults are injected into its components. Once enough knowledge about JVM failure modes is acquired, we will be able to conduct a comprehensive field-data measurement campaign aimed at perform a dependability assessment of the various imple-mentations of the Java Virtual Machine.

References

[1] J.M. Bull, L.A. Smith, L. Pottage, and R. Free-man. Benchmarking Java against C and Fortran for Scientific Applications. Proceedings of the joint ACM-ISCOPE conference on Java Grande, 2001. [2] Frank Hartman and Scott Maxwell. Driving the

Mars Rover.Linux Journal, (125):68–70, september 2004.

[3] Java Community Process (JCP). JSR-163: Java Platform Profiling Architecture (JPPA), 2004. [4] T.Lindholm and F.Yellin. The Java(TM) Virtual

Machine Specification. Sun Microsystems, 2nd edi-tion, 1999.

[5] D. Tang and R.K. Iyer. Dependability measure-ment and modeling of a multicomputer system.

IEEE Transactions on Computers, Volume 42(Is-sue 1):Pages 62–75, January 1993.

[6] A. Kalakech, K. Kanoun, Y. Crouzet, and J.Arlat. Benchmarking the dependability of windows nt4, 2000 and xp.Proceedings of the 2004 International Conference on Dependable System and Networks (DSN04), June 2004.

[7] R.K. Iyer, Z.Kalbarczyk, and M.Kalyanakrishnam. Measurement-based analisys of networked sys-tem availability. Performance Evaluation Ori-gins and Directions, Ed. G.Haring, Ch.Lindemann, M.Reiser.

[8] R.K.Iyer, Z.Kalbarczyk, and J.Xu. Networked win-dows nt system field data analysis. 1999 Pacific Rim International Symposium on Dependable Com-puting (PRDC99), December 1999.

[9] C.Simanche, M.Kaaniche, and A.Saidane. Event log based dependability analysis of windows nt and 2k systems. 2002 Pacific Rim Internation Symposium on Dependable Computing (PRDC02), December 2002.

[10] W.Gu, R.K. Iyer, Z.Kalbarczyk, and Z.Yang. Charachterization of linux kernel behavior under er-rors.2003 International Conference on Dependable System and Networks (DSN03), June 2003. [11] C.Simanche and M.Kaaniche. Measurement-based

availbaility analysis of unix systems in a dis-tributed environment. 12th International

(11)

Sym-posium on Software Reliability Engineering (IS-SRE01), November 2001.

[12] E. Martins, M.F. Rubira, and N.G.M. Leme. A re-flective fault injection tool based on patterns. Pro-ceedings of the International Conference on De-pendable Systems and Networks (DSN ’02), June 2002.

[13] R.L.O. Morales, E. Martins, and N.V. Mendes. Fault injecion approach based on dependence anal-ysis. Proceedings of the 29th Annual Computer Software and Applications Conference (COMPSAC ’05), 2005.

[14] G. Jacques-Silva, R.J. Debres, J. Gerchmann, and T. Silva Weber. Fiona: A fault injector for depend-ability evaluation of java-based networks applica-tions. Proceedings of the 3rd IEEE International Symposium on Network Computing and Applica-tions (NCA ’04), 2004.

[15] J.E. Smith and R.Nair. The architecture of vir-tual machines. IEEE Computer, Volume 38(Issue 5):Pages 32–38, May 2005.

[16] J.Gosling, B.Joy, G.Steele, and G.Bracha.The Java Language Specification. Sun Microsystems, 3rd edi-tion, 2005.

[17] Hotspot java virtual machine.

http://java.sun.com/products/hotspot/.

[18] Jikes Research Virtual Machine. http://jikesrvm.sourceforge.net.

[19] R.K. Iyer, S.E. Burtner, and E.J. McCluskey. A sta-tistical failure/load relationship: results of a multi-computer study.IEEE Transactions on Computers, Volume C-31:Pages 697–705, July 1982.

[20] K.S.Trivedi, K.Vaidyanathan, and K.Goseva-Popstojanova. Modeling and analysis of software aging and rejuvenation. Proceeding of the 33rd annual Symposium on Simulation, Washington D.C., 2000.

[21] Salvatore Orlando. Dependability analisys of the java virtual machine. Proceedings of the 2005 In-ternational Conference on Dependable Systems and Networks (DSN 05), Supplemental Volume, June 2005.