CHAPTER 4 Trace Segmentation
4.1 Trace Segmentation Approach
4.1.1 Trace Segmentation Problem
This section summarises essential details of a previous trace segmentation approach (Asadi et al., 2010a,b), which problem we reformulate as a dynamic programming problem. There- fore, the five steps of the two approaches are identical, with the only difference that the trace segmentation was previously performed using a GA algorithm and that we describe the use of DP in Section 4.1.2.
Step 1 and 2 – Program Instrumentation and Trace Collection First, a program under study is instrumented using the instrumentor of MoDeC to collect traces of its ex- ecution under some scenarios. MoDeC is an external tool to extract and model sequence diagrams from Java programs (Ng et al., 2010), implemented using the Apache BCEL byte- code transformation library1. The tool also allows to manually label parts of the traces
during executions of the instrumented programs, which we did to produce our oracle. In this dissertation MoDeC is simply used to collect and manually tag traces.
Step 3 – Pruning and Compacting Traces Usually, execution traces contain methods invoked in most scenarios, e.g., methods related to logging or Graphical User Interface (GUI) events. Yet, it is unlikely that such invocations are related to any particular concept, i.e., they are utility methods. We build the distribution of method invocation frequency and prune out methods having an invocation frequency greater than Q3 + 2 × IQR, where Q3 is the third quartile (75% percentile) of the invocation frequency distribution and IQR is the inter- quartile range because these methods do not provide useful information when segmenting traces and locating concepts.
Execution trace contains repetitions of method calls, for example m1(); m1(); m1(); or m1(); m2(); m1(); m2();. Since the repetition does not define a new concept we remove the repetitions using the Run Length Encoding (RLE) algorithm and we just keep one occurrence of any repetition. We compact any sub-sequences of method invocations having an arbitrary length. The examples would become m1() and m1(); m2(), respectively.
We compact the traces using a Run Length Encoding (RLE) algorithm to remove rep- etitions of method invocations. We still apply the RLE compaction to compare segments obtained with the DP approach with those obtained using the GA approach when segment- ing the same traces.
Step 4 – Textual Analysis of Method Source Code Trace segmentation aims at grouping together subsequent method invocations that form conceptually cohesive groups. The conceptual cohesion among method is computed using the Conceptual Cohesion metric defined by Marcus et al. (2008).
We first extract a set of terms from each method by tokenizing the method source code and comments, removing out special characters, programming language keywords, and terms belonging to a stop-word list for the English language. We split compound identifiers sepa- rated by Camel Case, e.g., getBook is split into get and book. Then, we perform stemming using a Porter stemmer (Porter, 1980). We then index the obtained terms using the T F -IDF
indexing mechanisms (Baeza-Yates et Ribeiro-Neto, 1999). We obtain a term–document ma- trix, and where documents are all methods of all classes belonging to the program under study and where terms are all the terms extracted (and split) from the method source code. Finally, we apply Latent Semantic Indexing (LSI) (Deerwester et al., 1990) to reduce the term–document matrix into a concept–document2 matrix, choosing, as in previous works
(Asadi et al., 2010a,b), a LSI subspace size equal to 50.
Step 5 – Trace Splitting through Optimization Techniques Since the execution traces are very large and the execution trace segmentation solution must be found in large search spaces. Due to the potentially large size of the search space we need to apply some optimization techniques to segment the obtained trace. Applying an optimization technique requires a representation of the trace and of a trace segmentation and a means to evaluate the quality of a trace segmentation, i.e., a fitness function. In the following paragraphs, we reuse where possible previous notations and definitions (Asadi et al., 2010b) for the sake of simplicity.
The fitness function drives the optimization technique to produce a (near) optimal seg- mentation of a trace into segments likely to relate to some concepts. It relies on the software design principles of cohesion and coupling, already adopted in the past to identify modules in programs (Mitchell et Mancoridis, 2006), although we use conceptual (i.e., textual) cohe- sion and coupling measures (Marcus et al., 2008; Poshyvanyk et Marcus, 2006), rather than structural cohesion and coupling measures.
Segment cohesion (COH) is the average (textual) similarity between the source code any pair of methods invoked in a given segment l. It is computed using the formulas in Equation 4.1 where begin(l) is the position of the first method invocation of the lthsegment and end(l)
the position of the last method invocation in that segment. The similarity σ between methods mi and mj is computed using the cosine similarity measure over the LSI matrix from the
previous step. COH is the average of the similarity (Marcus et al., 2008; Poshyvanyk et Marcus, 2006) of all pairs of methods in a segment.
Segment coupling (COU) is the average similarity between a segment l and all other seg- ments in the trace, computed using Equation 4.2, where N is the trace length. It represents, for a given segment, the average similarity between methods in that segment and those in different ones.
Thus, we compute the quality of the segmentation of a trace split into K segments using the fitness function (f it) defined in Equation 4.3, which balances segment cohesion and their
2. In LSI “concept” refers to orthonormal dimensions of the LSI space, while in the rest of the dissertation “concept” means some abstraction relevant to developers.
coupling with other segments in the split trace. COHl = Pend(l)−1 i=begin(l) Pend(l) j=i+1σ(mi, mj)
(end(l) − begin(l) + 1) × (end(l)−begin(l))2 (4.1) COUl =
Pend(l)
i=begin(l)
PN
j=1,j<begin(l) or j>end(l)σ(mi, mj)
(N − (end(l) − begin(l) + 1)) × (end(l) − begin(l) + 1) (4.2) f it(segmentation) = 1 K × K X i=1 COHi COUi+ 1 (4.3)