• No results found

2 SIFT: A Scalable Iterative-Unfolding Technique for Filtering Execution Traces

2.4 Analysis

2.4.1 Efficiency

In Section 2.4.1.1 we discuss efficiency of the core algorithms. Efficiency of the Iterative-Unfolding Algorithms is shown in Section 2.4.1.2. Special cases are discussed in Section 2.4.1.3.

2.4.1.1

Core Algorithms Efficiency

The derivation of our algorithms asymptotic behaviour may be found in [24]. Only the final results are presented here. Look at the maximum number of algorithm operations required to compare a single trace t against a set of traces S (given in Figure 5). In the worst-case scenario, when all traces within S are close to t and cannot be filtered out, we will have to perform all iterations on the full set |S|:

(

) (

)

2

(

)

0 max 2 2 2 2 2 1 0 0 [ ( , )] | | | | | | l KL L l C P t S K S N L S N L S N L α β < ∈O +O = O (2.5) 11

where K is the maximum level of compression, N is the maximum possible number of

processes in a trace and Ll is the maximum length of a given fingerprint at compression

level l. The comparison for l > 0 is performed using measures of distance (2.1) and (2.3); comparison of uncompressed traces is performed at l = 0 (using diff[26]), see Section 2.3.2.1.2 for details. The term α arises from comparison of fingerprints and the term β from comparison of uncompressed traces. Note that, in practice, most traces will be filtered out at high levels of compression and that the length of fingerprints, by construction, should be small.

The algorithm for comparison within a set of traces calls a recursive procedure “compare” (given in Figure 6). The running time, T, of the function “compare” can be represented as

(

)

1 (| |) | | / compare( ), a i S i T S T S b = =

+ P (2.6)

where PS is a set of properties of traces in set S. Coefficients bi (fraction of elements in S) and a (number of clusters) change at each iteration − they are obtained from the clustering procedure (called from “compare”) and will depend onPS.

The problem formulated in (2.6) is too general and, to the best of our knowledge, cannot be “unraveled” without knowing distributions of the parameters bi and a. The worst-case scenario is when a=1 and bi=1, i.e., members of the set S cannot be partitioned into subsets since they are too close to each other. In this case

(

) (

)

2

(

)

0 max 2 2 2 2 2 2 2 2 2 0 0 [ ( )] | | | | | | . l KL L l C P S S KN L S N L S N L < ∈O +O = O (2.7)

In closing, the worst-case scenario computational time for P1(t,S) grows linearly with the

of traces. The algorithm P2(T,S) is quadratic in the number of traces, the number of

processes per trace and the length of traces.

Assume that compression level l is in the range [0, 1, … , N], where 0 is an uncompressed trace level and N is a fingerprint containing the least amount of information. The initial set of traces is given by trace set S. S(i) returns the i-th member of the set S. Array Td (of size N+1) contains the maximum measures of distance between traces for different compression levels. Maximum percentage [0,1] of non-matching processes is denoted by Tp. Global variable Sout stores similar clusters of traces.

//Transform 2-tuples distance measure into a scalar Procedure condense_tuple(d, p, Tp) if p > Tp m ← ∞; else m ← d; return m;

//recursive comparison function (note that recursion can be //“unraveled” by parsing the tree in breadth)

Procedure compare (trace_set, l, Td, Tp)

//calculate distance matrix D between traces M ← cardinality(trace_set);

for i=1 to M

for j=i+1 to M //D is symmetric

(d, p) ← compare_traces(trace_set(i), trace_set(j), l);

D(i, j) ← condense_tuple(d, p, Tp);

//Cluster traces using distance data in D

clusters ← cluster(D, Td(l));

if (l > 0)

compare(cluster, l − 1, Td, Tp) else //reached uncompressed level

add trace_set to Sout; //main procedure

Procedure P2(S) Set Td and Tp;

compare (S, N, Td, Tp); return Sout ;

Figure 6. Algorithm for comparing traces within a given set S.

2.4.1.2

Special Cases

We identify two cases where the direct comparison approach is more efficient than the iterative-unfolding approach. The first case occurs when the traces of interest are very similar; the traces won’t be filtered out at higher levels of comparison and so direct comparison becomes necessary at the uncompressed level. The second case occurs when the traces are small (i.e., the length of the processes in the traces is comparable to the length of the fingerprints) and the traces consist of only a few processes so comparison times between the iterative-unfolding approach and uncompressed comparisons would be more or less equivalent. Note that if the number of processes is large, our approach may yield superior results by aggregating information from different processes into a single fingerprint. In all other cases, our approach is superior.

An additional case occurs when one needs to identify identical traces (e.g., for identification of duplicate test cases). Two iterations are needed in this case. The first one uses hashes of processes as fingerprints. The second iteration is needed to verify the result of first iteration by analyzing uncompressed traces (the Levenshtein distance [18] between processes should be equal to 0).