tesi di laurea magistrale
relatori
Ch.mo Prof. Stefano Russo Ch.mo Prof. Marcello Cinque
correlatori
Ch.mo Ing. Flavio Frattini
Ch.mo Dr. Santonu Sarkar, Infosys Ltd., Bangalore, INDIA
candidato
Agostino Savignano Matr. M63000176
Mining Invariant Relationships for Failure Analysis of
Batch Software Systems
This thesis is in part the result of a research work done during the
InStep internship experience with Infosys Ltd
multinational provider of business consulting, information technology, software engineering and outsourcing services.
160.000 employees, 8 billion revenues.
Next Gen Computing Labs, within the Infosys campus of Bangalore, India.
Context
A batch software system is a system where jobs run without end user interaction, or can be scheduled to run as resources permit. This class of systems shows a very repetitive behaviour. They are often used:
to process large amount of transaction data
for scientific computation
as there may be failures in the execution of jobs, malfunctions have to be detected and actions have to be taken accordingly; manual detection is inefficient.
A feasible approach to automatically and dynamically characterise the behaviour of a system, in order to detect execution problems, is represented by flow intensity invariants.
A simple example of
flow intensity invariant
A platform for the batch processing of files: each file goes through a stage of elaboration. The processing time is proportional to the size
Measuring the file size and the time spent in a stage, I(x) and I(y), (the
flow intensities), the equation
invariant relationship characterising the expected behaviour of the batch system.
If there is an execution problem (e.g., file processing hangs) the equation does not hold any more.
)
(
)
(
y
k
I
x
Contribution
An approach for the
automatic
discovery of a set of flow
intensity
invariant relationships
in batch software systems
,useful to characterise the
expected behaviour
of the
system.
A detection mechanism that helps the operations personnel in identifying deviations of the actual system behaviour from the correct expected one is also presented.
The application of a tool implementing the approach,
MinerVa
, in two real world batch systems:
A SaaS application by Infosys *
SCoPE, the Supercomputer of the Federico II University
* S. Sarkar, R. Ganesan, M. Cinque, F. Frattini, S. Russo, and A. Savignano, “Mining Invariants from SaaS Application Logs”,
accepted for publication in the proceedings of the European Dependable Computing Conference, May 2014, Newcastle upon Tyne, UK
Approach
Source of knowledge of the system behaviour: event logs
Objective
automatically mine a set of invariant relationships between
time series measurements from system logs (learning phase)
use such relationships for detecting deviations of the system behaviour (detection phase)
Approach Data Flow
Learning phase
Off-line
Past observations
RESULTS: a set of linear relationships,
representing a model of the system
Detection phase
Either on-line or off-line
Uses the model to
predict the behaviour of a system on new
Sampling: from Event Logs to Time Series
PROBLEM
How to retrieve time series from event logs
SOLUTION
With a certain
time interval:
retrieve the events in the application logs
count the events,
according to the different
flow intensities
the “
sampling time
” is determined using a
tupling
algorithm, applied on particular events, e.g., job arrivals or
completions.
Mining: from Time Series to Invariants
PROBLEM
How to discover relationships between time series
SOLUTION
For each possible pair of flow intensities:
Perform a linear regression using: AutoRegressive models with eXogenous inputs (ARX)
Recursive Least Squares (RLS) algorithm Test the goodness of the regression:
R2 coefficient of determination
All the pairwise relationships which have a R2 value greater than a
threshold are considered as linear :
Detection: from Invariants to Alerts
PROBLEM
How to detect a deviation from the expected behaviour
SOLUTION
Using the regressed coefficients to predict the expected behaviour For each regressed invariant relationship:
compute the expected output
compare the expected output with the actual output
If the difference between the expected and the actual output is greater than an adaptive threshold, raise an alert.
SaaS CPG application by Infosys
A platform provided as a service by means of virtualized machines that:
Receives transaction details data daily from several subscribers all around the world.
Stores the data and performs analysis on the stored data.
The work items consisting of several records are provided as a file in a specific format and with different frequencies by the various
SaaS CPG application by Infosys: Mining
9 months of observations 108.000 workitems
Tupling on job completions Sampling time: 10 min
Identified flow intensities: 33 Linear regressions performed: 528
Discovered invariant relationships: 11
Time spent in post-processing validation - Time spent in archiving (i1)
Time spent in transaction validation - # rows completed (i4)
# files arrived - # files completed (i7)
# files started - # files completed (i8)
# files arrived - # files started (i9)
# rows started - # rows completed(i10)
SaaS CPG application: Broken Invariants
Detected broken invariants with p.i. threshold correspond or include the SLA violations discussed with the operations personnel.
Maintenance staff confirmed the possibility of violations for invariants
i1-i11.
Month Samples Threshold i1 … i7 i8 i9 …
M1 4320 10% 4035 … 3865 3754 3182 … p.i. 5 … 14 13 4 … M2 4464 10% 4204 … 3988 3903 3050 … p.i. 2 … 17 18 6 … M4 4598 10% 4306 … 4208 4106 3422 … p.i. 0 … 8 12 4 ...
SCoPE: Mining
8 months of observations 1.104.139 jobs
Tupling on job arrivals Sampling time: 60 s
Regression performed on a subset of node processing the same kind of jobs
Identified flow intensities: 15 Linear regressions performed: 105
Discovered invariant relationships: 69
# jobs active - # jobs arrived (i5)
WORKLOAD_VMEMORY - # jobs started (i13)
# jobs active - # jobs started (i16)
# jobs started - # jobs arrived (i19)
virtual memory - # of cores (i40)
CPU% - # jobs active (i48)
memory - # jobs started (i56)
SCoPE: Broken Invariants
i5, i13, i16 are almost never broken
i21 is broken for more than the 50% of samples and
i40 is broken for almost each samples
i19, i36, i48, i56, i59, i63, i66 are broken for less than the 2% of
samples.
Month Samples i5 i13 i16 i19 i40 i48 i66
M1 44640 13 0 0 431 41443 3328 6442
M2 43200 3 0 0 555 42097 4026 6743
Conclusions and Future Work
It is possible to automatically
discover invariants in batch systems
They are useful to the operation personnel to
reduce the quantity of data to be analysed for understanding the system behaviour in case of execution problems
In some cases, to detect the execution problem itself (SLA violations)
Future avenues of research:
Use logs from different layers Improve the detection
mechanism:
Different adaptive threshold
Improve the mining process:
Perform a continuous invariant mining: forgetting factor RLS algorithm.
Invariant identification with
Prediction Error Method instead of the Least Squared Method
Model relationships between
multiple time series instead of only pairwise relationships