Classification Application - Mining Significant Graph Patterns by Scalable Leap Search

Chapter 5 Mining Significant Graph Patterns by Scalable Leap Search

5.8 Experiments

5.8.4 Classification Application

Having examined the efficiency and effectiveness of descending leap mining, in this subsection we are going to investigate the usage of graph patterns discovered by LEAP in graph classification. A primary problem in classification of graph data is the lack of feature vector representation. LEAP provides a scalable approach to find discriminative subgraph patterns and one can take them as features to classify, in a way faster than other subgraph-based classification approach [33, 15, 58]. As LEAP produces one discriminative pattern a time, we run LEAP iteratively on the training data until every training example can be represented by some discovered graph patterns. The data is then represented by the graph patterns and used for model learning. We measure the runtime of LEAP by mining a set of graph patterns sufficient for representing a training set. In the following experiments, we are going to see the advantage of our pattern-based classification framework over graph kernel method, not only in terms of scalability, but also accuracy. Studies of kernel-based approach aim at designing effective kernel functions to measure the similarity between graphs. We choose the state-of-the-art graph kernel – optimal assignment kernel (abbr. OA) [20] for comparison. Its implementation is provided by the authors. The runtime of OA is measured as the time to compute the kernel.

Since OA is unable to handle the large scale NCI bioassay datasets, we sample 5% of the active compounds and a comparable size of inactive compounds from each dataset to derive a compact balanced sample set. The classification accuracy is evaluated with 5-fold cross validation. On each training fold a model selection for the necessary parameters was performed by evaluating the parameters by an extra level of 5-fold cross validation. For both methods, we run the same implementation of support vector machine, LIBSVM [9], with parameter C selected from [2−5_{, 2}5_{]. We choose linear kernel for our graph pattern-based classifier.}

Table 5.3: AUC Comparison between OA and LEAP

Dataset OA LEAP OA (6x) LEAP (6x)

MCF-7 0.68 0.67 0.75 0.76 MOLT-4 0.65 0.66 0.69 0.72 NCI-H23 0.79 0.76 0.77 0.79 OVCAR-8 0.67 0.72 0.79 0.78 P388 0.79 0.82 0.81 0.84 PC-3 0.66 0.69 0.79 0.76 SF-295 0.75 0.72 0.79 0.77 SN12C 0.75 0.75 0.76 0.80 SW-620 0.70 0.74 0.76 0.76 UACC257 0.65 0.64 0.71 0.75 Yeast 0.64 0.71 0.64 0.71 Average 0.70 0.72 0.75 0.77

We compare the area under the ROC curve (AUC) achieved by LEAP and OA kernel. ROC curve shows the trade-off between true positive rate and false positive rate for a given classifier. A good classifier would produce a ROC curve as close to the left-top corner as possible. The area under the ROC curve (AUC) is a measure of the model accuracy, in the range of [0, 1]. A perfect model will have an area of 1.

As an additional test, we increase the training size to 6 times (6x) and run both LEAP and OA. While LEAP runs efficiently for both cases, OA can hardly scale to 6x training set. Table 5.3 shows AUC by OA and LEAP on the 1x training set as well as the 6x ones. As shown in Table 5.3, LEAP achieves comparable results with OA on 1x training set, with 0.02 AUC improvement over OA on average. Both methods achieve further improvement on 6x training set, while LEAP still outperforms OA on average. We also use precision as the measure and observe similar performance. This result demonstrates that, (1) LEAP could efficiently discover highly discriminative features which lead to satisfactory classification accuracy; and (2) Better classification performance could be achieved, given sufficient training examples and a computationally scalable method. Figure 5.16 shows the runtime in log scale by LEAP and OA kernel with 1x and 6x training set respectively. All runtime in the figure is averaged over 5 folds. As demonstrated in the figure, LEAP scales linearly with the data size, whereas the runtime of OA kernel increases quadratically with the data size. In practice, the scalability issue of OA actually limits its capability of achieving higher accuracy since it cannot handle large scale training sets.

MC MO NC OV P3 PC SF SN SW UA YE 1 10 100 1000 10000 100000 Data Sets

Run Time (sec)

LEAP OA LEAP (6x) OA (6x)

Figure 5.16: Runtime: OA versus LEAP

5.9 Summary

In this paper, we examined an increasingly important issue in frequent subgraph mining: the huge number of frequent subgraphs makes it impossible for experts to analyze returned patterns and blocks the use of graph patterns in several key application areas such as indexing and classification. A comprehensive study on general mining strategy was performed, which is able to mine the most significant graph patterns measured by different kinds of non-monotonic objective functions. We proposed a new mining framework, called LEAP(Descending Leap Mine), to exploit the correlation between structural similarity and significance similarity in a way that the optimality of significance could be calculated quickly by searching dissimilar graph patterns. Two novel mining concepts, structural leap search and frequency-descending mining, were proposed to find (near)-optimal patterns quickly by leaping in graph pattern space. The new mining method revealed that the widely adopted branch-and-bound search in data mining literature is indeed not fast enough, thus providing new insights for fast graph mining. Interestingly, the mining strategy included in LEAP can also be applied to searching other simpler structures such as itemsets, sequences and trees.

Chapter 6

Classification with Very Large

Feature Sets and Skewed Distribution

6.1 Introduction

In many real data mining applications, there are two commonly encountered challenges: (1) no initial feature vector representation (i.e., original forms of data is not represented in readily available feature vector), and (2) skewed prior class distribution. In this situation, we have to undergo an interleaving process of feature invention and model construction based on the invented features. However, this is a nontrivial task because of the following complicated factors. On one hand, we could potentially construct a huge feature set during the feature enumeration process.. For example, if frequent subgraphs are chosen as invented features in the graph classification context, it typically generates around 105 _{or even 10}6 _{features by frequent subgraph mining}

[65], due to the exponential combination between the graph vertices and edges. This large number could create all sorts of problems for traditional inductive learning algorithms, including efficiency, overfitting, accuracy, model complexity, comprehensibility. On the other hand, a very tricky issue with the invented features is that they may not even “cover” the examples uniformly – the partial feature coverage problem. In other words, some examples may have significantly fewer invented features representing them than others. If one is not careful with the invented features, it is easy to run into a situation where some examples are not covered well. This is particularly problematic for “skewed” distribution where one class dominates the prior class probability distribution (such as 99% negative and 1% positive), since it is likely to invent fewer features in the skewed positive class, thus, the positive examples are more likely to be under represented.

In this chapter, the focus is to discuss various issues involved in this mixed process of inventing features from raw data and effectively using these features to match the true distribution. Figure 6.1 shows the flow of the classification process. Given a set of raw data with skewed class distribution, sampling is applied first to get multiple re-balanced data samples. For each data sample, feature invention is applied to construct a set of candidate features. However, feature enumeration could potentially generate a huge number of features. As discussed, both the huge feature number and the partial coverage problem caused by the invented features pose non-trivial challenges for model construction. Therefore, we propose a solution, cascaded

5 10 × 2 10 × ₂ 10 × 2 10 × S OH O O ) ( x fmk ) ( 2 ) 1 ( _x f m− k+ ) ( 1 ) 1 ( x f m− k+ ∑ = = mk i i E x f mk x f 1 ) ( 1 ) ( 5 10 × 2 10 × _×₁₀2 2 10 × S OH O O ) ( 2 _x f k ) ( 2 _x fk+ ) ( 1_x fk+ 5 10 × 2 10 × 2 10 × ₂ 10 × S OH O O ) ( x fk ) ( 1 x f 2() x f

Figure 6.1: Balanced Data Ensemble and Cascaded Feature Ensemble

feature ensemble, which aims at reducing the feature number as well as increasing the feature coverage, by “progressively” selecting multiple disjoint sets of features and constructing multiple models based on that. To make the work easier understood and demonstrate solid applications, we adopt the problem contexts of graph data classification as well as transaction data classification, since they both possess the main challenges mentioned above. Other examples include protein docking, weather forecasting, mutual fund performance analysis, image retrieval, and video retrieval.

Example 1: Graph Classification

Graphs become increasingly important in modelling complicated social and natural structures such as genetic interactions and molecules. Graph classification is involved in a lot of applications including chemical infor- matics and drug design process. Formally, given a set of training graphs associated with labels {gi, yi}ni=1,

yi∈ {±1}, we are interested in learning a classifier that can predict the class of unclassified structures.

However, graph classification is a difficult research problem with two main obstacles: (1) a graph is in the form of complicated structures with a set of vertices and edges connecting these vertices, but not represented in a readily available feature vector format that most traditional machine learning algorithms are comfortable with; and (2) many real graph datasets are extremely skewed. For example, in the AIDS anti-viral screen dataset [1], the active class is only around 1%. In the NCI anti-cancer screen datasets [2], the active class is usually about 5%.

discriminative features for graph classification is an interesting and open problem by itself. It is very unlikely to build high performance classifiers based only on simple features including single atoms such as carbon (C), oxygen (O), and nitrogen (N), or their short edge links, such as C-C, C-O, or C-C-N in organic chemical structures. This is because such features are not discriminative: every molecule likely contains these simple features. In this study, we choose to use frequent substructures as graph features since they expose the intrinsic topological characteristic of the underlying graphs. In addition, the effectiveness of frequent substructures has been demonstrated by previous studies [15].

Although frequent substructures have some appealing characteristics, e.g., the ability to preserve structural information and the availability of efficient mining algorithms, there exists two tricky issues associated with the application of frequent substructures: (1) it usually generates a huge number of features by frequent substructure mining, due to the exponential combination between graph vertices and edges. Typically the number of frequent subgraphs is over 105_{or even 10}6_{with a minimum support of 3–5%; and (2) the frequent}

substructure features may not “cover” the training examples uniformly. Given a set of frequent subgraphs F and a graph g, if none (or a small number) of the graph patterns in F is a subgraph of g, then the feature vector representation of g contains almost all 0s. That is, g is barely covered by the frequent subgraph fea- tures. As a result, the topological property of such graphs is inadequately reflected, which makes it difficult to distinguish them.

The second challenge associated with the graph classification problem in real applications is the extremely skewed class distribution which is not specially handled by the existing graph classification methods. The skewed distribution causes two problems. First, if mining is done on the skewed dataset, frequent subgraphs prevalent in the negative class dominate the feature space while subgraphs unique in the positive class are overwhelmed, assuming the size of the negative class is much larger than that of the positive class. As a consequence, the positive examples will be insufficiently covered and under represented. Second, applying inductive learning methods to the skewed datasets leads to a model which is biased towards the negative class, since the goal of the classifier is to minimize the classification errors.

Example 2: Transaction Data Classification

Transaction data classification is another example which possesses the aforementioned two challenges. Each record of transaction data includes different number and types of items such as purchased merchandize, executed command, etc. and a class label. In most cases, the individual items by themselves may not be discriminative enough to separate the examples from different classes, while the combinations of multiple items, whether or not the order is considered, have much higher discriminative power. Previous studies

[39, 38, 59, 11] have shown that frequent itemsets are good candidates as discriminative features or rules for classification. However, a potential problem with frequent itemsets is that we could generate a rather large number of such itemsets (×104 _{to ×10}6 _{), especially on dense datasets. How to construct a compact set of}

highly discriminative and representative frequent itemsets as features is a challenging problem. In addition, a similar partial feature coverage problem also exists in the context of frequent itemsets. Given a set of frequent itemsets, if few of them appear in a transaction example, this example is insufficiently represented by the set of features. As for the skewed class distribution, it could be another challenge for transaction data in both feature invention and model construction.

In document Towards Accurate and Efficient Classification: A Discriminative and Frequent Pattern-Based Approach (Page 81-87)