• No results found

Introduction

CHAPTER 1. INTRODUCTION 3 The Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM)

problems introduced in this thesis are to solve a wide range of interaction mining problems at the abstract level, and to do so very eciently. This means the prob- lems must be solved at a general level, requiring the development of frameworks and a consistent and ecient computational model that can capture diverse interaction mining problems. This is a challenging task since such problems have very dier- ent semantics governing the interactions, their structures and their interpretation. For example, frequent itemset mining, graph mining, nding correlation structures between variables, clustering, rule based classication, mining uncertain databases and mining relationships in social networks (to name a few) have very dierent prob- lem denitions: The pattern denitions and semantics are dierent; what makes an interaction pattern interesting is dierent and how the search should progress is dif- ferent. The data is also very dierent; for example, real valued records, a set of time series, transaction databases, probabilistic databases, attribute value pairs produced by discretization, instances and adjacency matrices. Finally, solving interaction min- ing problems usually requires the simultaneous and interdependent development of new pattern semantics and specialist algorithms for mining the respective pattern. One may therefore conclude that it is not easy to develop a model abstract enough to capture this variation in interaction mining problems, while at the same time enabling the development of an equally abstract algorithm that also solves them eciently ideally, more eciently than specialist algorithms. Doing so is very ben- ecial however; it can separate the semantics of a problem from the algorithm used to mine it. This makes it easier to develop new methods by allowing the data miner to focus only on their problem's semantics and then plug them into a framework. Furthermore, by removing the burden of designing an ecient algorithm, this can make it easier for end users to design custom data mining methods.

Solving interaction mining problems at the abstract level, as well as applications of this to specic problems, is the primary focus of partII in this thesis.

Chapter3introduces and solves the GIM problem. GIM1 uses an ecient and intu- itive computational model based purely on vectors and vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are all exible components. Intuitively, each interaction is represented by a vector in a space typically spanned by the samples in the database. The search pro- gresses by performing functions on these vectors. By providing a layer of abstraction between a problems semantics and the algorithm used to mine it, the computa- tional model allows both to vary independently of each other. It also encourages 1Note that the term GIM refers both to the problem, as well as the model, framework and

algorithm proposed to solve interaction mining problems at the abstract level.

4 1.1. RESEARCH PROBLEMS AND THESIS OVERVIEW an interesting geometric way of thinking about pattern mining problems in terms of vector operations especially when an interestingness measure has a geometric interpretation. The GIM algorithm runs in linear time in the number of interesting interactions and uses little space. Chapter3also shows how GIM can be applied to a wide range of problems, including graph mining, counting based methods, itemset mining, clique mining, clustering, complex pattern mining, negative pattern mining, solving an optimisation problem, etc.

Chapter4 presents a vectorised framework and novel algorithm called GLIMIT for solving itemset mining problems from a geometric perspective in a transposed trans- action database. It is shown to outperform FP-Growth and Apriori on the frequent itemset mining task. An ecient method for generating association rules is also presented.

Chapter 5 considers the problem of mining complex co-location patterns between dierent types of objects in a real world spatial database. When applied to a large astronomy database, this mines relationships including negative relationships and the eect of multiple occurrences between dierent types of galaxies. Part of this problem can be solved eciently with GIM or GLIMIT.

Chapter6introduces and solves the Generalised Rule Mining (GRM) problem. Rules are an important interaction pattern but existing approaches are limited to conjunc- tions of binary literals, xed measures and counting based algorithms. Rules can be much more diverse, useful and interesting! The chapter redenes rule mining in terms of a vectorised computational model similar to that used in GIM. This abstraction is motivated through the introduction of three novel methods addressing problems including correlation based classication, nding interactions for improving regres- sion models and nding probabilistic association rules in uncertain databases. Two of these methods are introduced in chapter6(Probabilistic Association Rule Mining (PARM) in uncertain databases and Conjunctive Correlation Rules (CCRules) for classication), while one is introduced in chapter7.

Since interactions between variables in a database are often unknown to the detriment of further analysis, classication or mining tasks, chapter7proposes Correlated Mul- tiplication Rules (CMRules). These capture interactions predictive of a dependent variable and are the rst rules with multiplicative semantics. Furthermore, a feature selection and dimensionality reduction method is described whereby CMRules are used to generate composite features. One advantage of this is that it enables linear models to learn non-linear decision boundaries with respect to the original features. As described in detail below, partII has a strong link to the problems considered in

CHAPTER 1. INTRODUCTION 5