In this dissertation, we studied pattern mining in the supervised setting, where the objective is to find patterns (defining subpopulations of the data) that are important for predicting the class labels. We have presented several methods for mining predictive patterns for both atemporal and temporal data. The main contributions of this dissertation are summarized below.
• We presented the minimal predictive patterns (MPP) framework for supervised pattern mining in static (atemporal) data. This framework applies a novel Bayesian score to evaluate the predictiveness of patterns. It also considers the structure of patterns to assure that every pattern is not only predictive compared to the entire data, but also compared to the data matching any of its subpatterns. We showed that the MPP frame- work is able to explain and summarize the data using fewer patterns that the existing methods. We also showed that using MPPs as features can greatly improve the classifi- cation performance.
• We presented an efficient algorithm for mining MPPs, which integrates pattern evalu- ation with frequent pattern mining and applies supervised pruning strategies to speed up the mining. We showed that our algorithm is more efficient than standard frequent pattern mining algorithms.
• We also studied the problem of supervised pattern mining in multivariate temporal data. We presented a novel method for mining recent temporal patterns (RTP), which we ar- gued is appropriate for event detection problems. We showed that the RTP framework is able to learn accurate event detection models for real-world clinical tasks. In addition, we showed that it is much more efficient and scalable than existing temporal pattern
mining methods.
• We extended the MPP framework to the temporal domain and presented the minimal predictive recent temporal patterns (MPRTP). We showed that MPRTP is effective for selecting predictive and non-spurious RTPs.
There are however some limitations of pattern mining techniques, which our proposed methods inherit:
• Pattern mining suffers when applied on high-dimensional data. The reason is that when the dimensionality of the data is large, the space of patterns becomes very large, which in turn makes the mining computationally very expensive and increases the risk of false discoveries.
• Pattern mining requires a prior discretization of the data in order to convert numeric values to a finite number of categories. This may result in loosing some predictive in- formation in the numeric attributes. Besides, pattern mining treats these discretized categories as being independent and disregards their ordinal relations.
We now outline some related open questions and research opportunities.
• Mining Association Rules: This is an unsupervised pattern mining task which aims to extract interesting correlations, associations and casual relations between items in the data1. Association rules are usually obtained by first applying a frequent pattern mining method and then generating rules that have coverage and confidence higher than user-specified thresholds [Han et al., 2006]. However, using a similar argument to the one in Section3.4, we can see that this approach usually leads to many spurious association rules. For example, if rule chips ⇒ salsa has a high confidence, many of its spurious rules, such as chips ∧ banana ⇒ salsa, are expected to have high confidences as well.
The MPP framework we proposed for supervised pattern mining can also be used to filter out spurious association rules. That is, we can apply it as a postprocessing step to 1In contrast to our work, where we restrict the consequent of rules to be a class label (supervised), the consequent of rules for association rule mining can be any item (unsupervised).
assure that every association rule in the result offers a significant predictive advantage over all of its subrules.
• Comparing and Contrasting Datasets: Identifying and explaining the similarities and differences between two datasets can be very valuable. For example, suppose we have data about patients in two different intensive care units (ICUs), or within the same ICU during two different periods. If the two ICUs experience different outcomes (e.g., different mortality rates), we may wish to understand and gain insights on the reasons they differ2. An important research problem is to extend our method to search for pat- terns that most contribute to the differences between datasets and provide explanations on how they account for the differences.
• Detecting Patterns in Spatio-Temporal Data: The aim of this task is to find pat- terns that describe the temporal changes in the relations between spatially related ob- jects. For example, assume we have a temporal sequence of medical images and an object detection algorithm. Assume we detected two neighboring objects A and B and defined their relations using the intensity gradient. It would be interesting to study patterns that describe how this relation changes over time. An example of such patterns is Intensity_gradient(A,B)=low proceeds Intensity_gradient(A,B)=high.
• Mining Pattern Sets: Traditional pattern mining methods are based on the idea of evaluating the quality of individual patterns and choosing the top quality ones. In this thesis, we proposed a method that considers the relations between patterns (the partial order defined on the lattice of patterns) when evaluating their quality. An alternative (and more general) approach is to cast pattern mining as an optimization problem. This can be done by specifying a function that evaluates the quality of an entire set of patterns and finding a set that optimizes (or satisfies constraints on) that function. An example of such task is to find the smallest set of patterns that collectively cover at least 90% of the data and predict the class label with accuracy at least 80%. This general formulation appears to be hard to solve. An interesting research direction is to investigate specific forms of quality functions that make the problem computationally more tractable.
2For example, the higher mortality in hospital A compared to hospital B may be simply because patients in A were in worse conditions than patients in B, not because of worse patient management.
APPENDIX
MATHEMATICAL DERIVATION AND COMPUTATIONAL COMPLEXITY OF THE BAYESIAN SCORE
This appendix explains in details the mathematical derivations and the computational com- plexity of the Bayesian score described in Section 3.5.1.2. Section A.2 derives the closed form solution for the marginal likelihood of model Mh: P(G|Mh). Section A.3 shows the four equivalent formulas for solving P(G|Mh). Section A.4 illustrates how to obtain the marginal likelihood of model Ml from the solution to the marginal likelihood of model Mh. Finally, Section A.5analyzes the overall computational complexity for computing the Bayesian score.