Constraint-Based Subgraph Mining

2.3 Data Mining

3.2.3 Constraint-Based Subgraph Mining

Constraint-based mining allows the user to formulate constraints describing the pat- terns she or he is interested in. The mining algorithms in turn may make use of these constraints by narrowing down their internal search space and thus speeding up the al- gorithm. In Section 2.3.3, we have presented the constraint classes anti-monotonicity, monotonicity and succinctness, as originally introduced by Ng et al. [NLHP98].

More recently, constraint-based graph mining has been proposed. Wang et al. [WZW+05] build on the constraint classes introduced in [NLHP98] and categorise various graph-based constraints into these classes. Then the authors develop a frame- work to integrate the different constraint classes into a pattern-growth-based graph- mining algorithm. They use anti-monotone constraints to prune the search space and monotone constraints to speed up the evaluation of further constraints. Further, they use the succinctness property to reduce the size of the graph database. Wang et al. also propose a way to deal with some weight-based constraints. For the average-weight constraint, they propose to omit nodes and edges with outlier values from the graphs in the database. They do so to shrink the graph size and to avoid the evaluation of

such ‘unfavourable’ elements. This can lead to incomplete result sets. Furthermore, situations where such constraints lead to signiﬁcant speedups are rare, according to the evaluation of the authors with one artiﬁcial dataset, and they do not make any statements regarding result quality.

In [ZYHY07], Zhu et al. extend [WZW+05] by reﬁning the classes of constraints, and they integrate them into mining algorithms. However, they do not consider weights, too.

Although the techniques proposed work well with monotone, anti-monotone or succinct constraints and their derivations, most weight-based constraints do not fall into these categories [WZW+05]. They are not convertible (see Section 2.3.3) as well, even if such constraints might seem to be similar. The weights considered in convertible constraints stay the same for every item in all transactions, while weights in graphs can be different in every graph in D. Therefore, the established constraint- based-mining schemes cannot use weight-based constraints for pruning while guar- anteeing completeness.

Call-graph-based defect localisation naturally relies on call graphs. Such graphs are representations of programme executions. Raw call graphs typically become much too large for graph-mining algorithms, as programmes might be executed for a long period and frequently call other parts of the programme, which adds information to the graph. Therefore, it is essential to compress the graphs – we call this process also reduction. It is usually done by a lossy compression technique. This involves the trade-off between keeping as much information as possible and a strong compression. The literature has proposed a number of different call-graph representations [CLZ+09, DFLS06, LYY+05], standing for different degrees of reduction and different types and amounts of information encoded in the graphs. In this dissertation, we make further proposals for call graph compressions and for encoding additional information by means of numerical annotations at the edges. To ease presentation, we discuss the related approaches from the literature (in particular the call-graph rep- resentations Rtmp_total, Rord

01m and Rtotalblock; Rtotal and Runord01m are simpliﬁed variants thereof)

along with the new proposals (Rsubtree) or variations (R_totalw , R_totalmult) in this disserta-

tion. Besides the graph representations discussed in this chapter, we introduce further graph representations in Chapters 6 and 7. They focus on speciﬁc graph representations for call graphs at different levels of abstraction and on the incorporation of dataﬂow-related information, respectively.

In Section 4.1, we discuss call-graph representations at the method level. In Sec- tion 4.2, we brieﬂy explain call graphs at other levels of granularity than the method level. In Section 4.3, we present call-graph representations for multithreaded programmes. In Section 4.4, we explain how we technically derive call graphs from Java programme executions. Section 4.5 subsumes this chapter.

4.1 Call Graphs at the Method Level

We now discuss call-graph representations at the method level. The basis for all such representations are unreduced call graphs, sometimes also called call trees, as obtained from tracing programme executions (in Section 4.4 we give some details on tracing):

Notation 4.1 (Unreduced call graphs)

Unreduced call graphs can be obtained by tracing a programme execution. They are rooted ordered trees. Nodes stand for methods and one edge stands for each method invocation. The order of the nodes is the temporal order in which the methods were executed.

Example 4.1: Figure 4.1(a) is an example of such a graph. Even if not depicted in the ﬁgure, the siblings in the graph are ordered by execution time from left to right. When we want to emphasise the temporal order, we express the order by increasing integers attached to the nodes. Figure 4.4(a) is the same graph featuring this representation.

In Section 4.1.1, we describe the total reduction scheme. In Section 4.1.2, we introduce various techniques for the reduction of iteratively executed structures. As some techniques make use of the temporal order of method calls during reduction, we describe these aspects in Section 4.1.3. We provide some ideas on the reduction of recursion in Section 4.1.4 and conclude with a brief comparison in Section 4.1.5.

In document Data-Mining Techniques for Call-Graph-Based Software-Defect Localisation (Page 62-66)