• No results found

2.3 Data Mining

3.2.3 Constraint-Based Subgraph Mining

Constraint-based mining allows the user to formulate constraints describing the pat- terns she or he is interested in. The mining algorithms in turn may make use of these constraints by narrowing down their internal search space and thus speeding up the al- gorithm. In Section 2.3.3, we have presented the constraint classes anti-monotonicity, monotonicity and succinctness, as originally introduced by Ng et al. [NLHP98].

More recently, constraint-based graph mining has been proposed. Wang et al. [WZW+05] build on the constraint classes introduced in [NLHP98] and categorise various graph-based constraints into these classes. Then the authors develop a frame- work to integrate the different constraint classes into a pattern-growth-based graph- mining algorithm. They use anti-monotone constraints to prune the search space and monotone constraints to speed up the evaluation of further constraints. Further, they use the succinctness property to reduce the size of the graph database. Wang et al. also propose a way to deal with some weight-based constraints. For the average-weight constraint, they propose to omit nodes and edges with outlier values from the graphs in the database. They do so to shrink the graph size and to avoid the evaluation of

such ‘unfavourable’ elements. This can lead to incomplete result sets. Furthermore, situations where such constraints lead to significant speedups are rare, according to the evaluation of the authors with one artificial dataset, and they do not make any statements regarding result quality.

In [ZYHY07], Zhu et al. extend [WZW+05] by refining the classes of constraints, and they integrate them into mining algorithms. However, they do not consider weights, too.

Although the techniques proposed work well with monotone, anti-monotone or succinct constraints and their derivations, most weight-based constraints do not fall into these categories [WZW+05]. They are not convertible (see Section 2.3.3) as well, even if such constraints might seem to be similar. The weights considered in convertible constraints stay the same for every item in all transactions, while weights in graphs can be different in every graph in D. Therefore, the established constraint- based-mining schemes cannot use weight-based constraints for pruning while guar- anteeing completeness.

Call-graph-based defect localisation naturally relies on call graphs. Such graphs are representations of programme executions. Raw call graphs typically become much too large for graph-mining algorithms, as programmes might be executed for a long period and frequently call other parts of the programme, which adds information to the graph. Therefore, it is essential to compress the graphs – we call this process also reduction. It is usually done by a lossy compression technique. This involves the trade-off between keeping as much information as possible and a strong compres- sion. The literature has proposed a number of different call-graph representations [CLZ+09, DFLS06, LYY+05], standing for different degrees of reduction and dif- ferent types and amounts of information encoded in the graphs. In this dissertation, we make further proposals for call graph compressions and for encoding additional information by means of numerical annotations at the edges. To ease presentation, we discuss the related approaches from the literature (in particular the call-graph rep- resentations Rtmptotal, Rord

01m and Rtotalblock; Rtotal and Runord01m are simplified variants thereof)

along with the new proposals (Rsubtree) or variations (Rtotalw , Rtotalmult) in this disserta-

tion. Besides the graph representations discussed in this chapter, we introduce further graph representations in Chapters 6 and 7. They focus on specific graph representa- tions for call graphs at different levels of abstraction and on the incorporation of dataflow-related information, respectively.

In Section 4.1, we discuss call-graph representations at the method level. In Sec- tion 4.2, we briefly explain call graphs at other levels of granularity than the method level. In Section 4.3, we present call-graph representations for multithreaded pro- grammes. In Section 4.4, we explain how we technically derive call graphs from Java programme executions. Section 4.5 subsumes this chapter.

4.1 Call Graphs at the Method Level

We now discuss call-graph representations at the method level. The basis for all such representations are unreduced call graphs, sometimes also called call trees, as obtained from tracing programme executions (in Section 4.4 we give some details on tracing):

Notation 4.1 (Unreduced call graphs)

Unreduced call graphs can be obtained by tracing a programme execution. They are rooted ordered trees. Nodes stand for methods and one edge stands for each method invocation. The order of the nodes is the temporal order in which the methods were executed.

Example 4.1: Figure 4.1(a) is an example of such a graph. Even if not depicted in the figure, the siblings in the graph are ordered by execution time from left to right. When we want to emphasise the temporal order, we express the order by increasing integers attached to the nodes. Figure 4.4(a) is the same graph featuring this representation.

In Section 4.1.1, we describe the total reduction scheme. In Section 4.1.2, we introduce various techniques for the reduction of iteratively executed structures. As some techniques make use of the temporal order of method calls during reduction, we describe these aspects in Section 4.1.3. We provide some ideas on the reduction of recursion in Section 4.1.4 and conclude with a brief comparison in Section 4.1.5.