• No results found

Subtree Mining: A Comparison of Algorithms on Real World Datasets

In document Taylor_unc_0153D_16421.pdf (Page 36-38)

With the notion of a web session tree in place, we now need a way to extract the structural properties of the tree for analysis. There are a few options including tree edit distance (discussed in Chapter 4), but by far the most popular approach in the literature is that of subtree mining. Subtree mining is the practice of breaking large, complicated structures into more manageable substructures (i.e., subtrees) and to study their patterns. Among all substructures, frequent substructures, i.e., those occurring sufficiently often in the database, are of particular importance, as they open the door for advanced analyses, such as search (Cohen, 2013), indexing (Zhao et al., 2007), and classification (Nguyen and Shimazu, 2011). Subtree mining has been applied in areas such as phylogenetic analysis in biology (Deepak et al., 2013), text mining (Subercaze et al., 2015), natural language processing (Nguyen and Shimazu, 2011), malware detection (Narouei et al., 2015), and robot task recognition (Gemignani et al., 2015). Although subtree mining algorithm research is

relatively old, semi-structured datasets that represent trees and graphs are ubiquitous and new algorithms based on graph and subtree mining are still being proposed (Bui et al., 2014; Hadzic et al., 2015; Narouei et al., 2015). There are even approaches that reduce graph mining problems into subtree mining problems in order to reduce the runtime complexity of mining (Gemignani et al., 2015). Subtree mining research is still relevant because there are relatively few better alternatives to encode structure for solving problems on semi-structured data. Indeed, mining frequent substructures represents non-trivial challenges. The process often requires scanning the entire database over multiple iterations, which can be prohibitively expensive for large-scale settings. To address numerous practical challenges and limitations, several approaches have been recently proposed (Zaki, 2005; Jim´enez et al., 2012; Zou et al., 2006b; Kutty et al., 2007; Xiao et al., 2003; Tatikonda et al., 2006; Chehreghani et al., 2011; Asai et al., 2002; Chi et al., 2003; Hido and Kawano, 2005; Wang et al., 2004) (see (da Jim´enez et al., 2010) for an excellent survey); however, to date, most evaluations of subtree mining algorithms are on synthetic or small scale real datasets leaving one to wonder how they perform on a variety of real-world datasets. My interest in subtree mining is motivated by the problem of discovering malicious subtree patterns in network traffic, but existing literature does not provide insight as to whether subtree mining represents a viable solution for a real-world networking dataset.

In what follows, I examine a recent line of inquiry on the problem of mining frequent subtrees in a database of rooted and labeled trees. Existing methods can be broadly classified into two categories:

candidate generationandpattern growth. A candidate generation algorithm enumerates all possible subtree combinations and incrementally calculates the frequency count for each subtree using an indexing structure that stores the occurrences of frequent nodes in the database. A pattern growth algorithm follows the divide-and-conquer methodology and generates candidate subtrees by growing subtrees from the data itself.

Despite the substantial body of work, there is still a significant lack of understanding of the strengths and limitations of these algorithms in realistic settings, resulting in a set of widely held, yet questionable, conclusions. For example, it is believed that pattern growth techniques are superior (Zou et al., 2006b; Kutty et al., 2007; Wang et al., 2004; Deepak et al., 2013); however, little evidence suggests a measurable advantage. Second, while the performance of subtree mining algorithms is influenced profoundly by multiple factors (e.g., tree size, degree, depth, label distribution), due to limitations of evaluation datasets, little is known about the intricate interplay between these factors.

Motivated by this, I conduct the first large-scale comparative study on frequent subtree mining algorithms using a variety of synthetic and real datasets. The goal is to assess the performance of existing algorithms in

realistic settings and ultimately inform better algorithm design by investigating their strengths and limitations. This chapter begins by studying the characteristics of synthetic datasets (Zaki, 2005) used in the majority of studies and demonstrate their shortcomings when compared to real datasets. The work then proposes novel synthetic tree generators that provide great flexibility in setting multiple factors (e.g., tree size, depth, fanout, label distribution) and produce trees closely mimicking the characteristics of real datasets. Leveraging the generated synthetic datasets and seven large real datasets, I investigate the runtime performance of four representative subtree mining algorithms from the two main categories (candidate generation and pattern growth) under varying setting of profounders. The algorithms were chosen because they contain the core concepts of the two categories, are popular in the literature, and represent the current state of the art. I provide insights into the strengths and weaknesses of these algorithms, many of which challenge conventionally held beliefs.

Besides regular frequent subtree mining, the chapter also considersclosedfrequent subtree mining. A frequent subtree is closed if none of its supertrees have the same support. The concept of closed subtree is attractive because special pruning techniques can be applied to speed up the mining performance and reduce the number of subtrees generated. The performance impact gained by leveraging closeness is measured.

The remainder of the Chapter begins with the four subtree mining algorithms compared in this study. Section 3.4 details the methodology, while sections 3.5 and 3.6 describe the real-world and synthetic datasets utilized. Experimental results are provided in section 6.6 before discussion and lessons learned.

In document Taylor_unc_0153D_16421.pdf (Page 36-38)