In this chapter, the problem of mining frequent closed patterns in dynamic trans- action datasets as a natural complement to the problem originally set in static transaction datasets is formally defined and studied. A new incremental mining algorithm is proposed to handle this problem by applying the incremental method on the CLOSET+ algorithm [120], which is one of the most efficient algorithms for mining frequent closed patterns in static transaction datasets. The purpose of the study in this chapter is to investigate how to use the incremental method to extend existing mining algorithms to efficiently handle dynamic datasets. With this purpose, only the principal data structures and techniques of CLOSET+ are imple- mented. Our proposed algorithm can serve as the core procedure and the platform for adopting other unimplemented techniques, although this is left to future work.
The proposed algorithm uses two tree-like data structures (the DB-tree and the two-level hash-indexed result tree) to compactly retain the dynamic transaction and the previously-discovered results. It then updates them by scanning the transactions affected by the update twice, after each dataset update of adding new transactions or deleting existing transactions from the original transaction dataset occurs. Hence, without requiring the entire updated dataset to be scanned again, the proposed algorithm can incrementally mine the current frequent closed patterns in the updated dataset by considering the changes caused by the updates only.
Several sets of experiments were conducted to explore the relative merits and limitations of the proposed algorithm compared with rerunning CLOSET+ on the entire updated dataset. The experimental results show that the proposed algorithm is more efficient than rerunning CLOSET+ in the updated dataset when the number of transactions changed by the dataset update is relatively small compared with the number of unchanged transactions in the dataset.
Enhancing the scalability of the proposed algorithm and reducing the cost of reordering the DB-tree and the two-level hash-indexed result tree during the mining process are in the future plan, so that the proposed algorithm can also be efficiently applied to other situations.
In general, the incremental method is about trade off. It intends to find a balance between the benefits of maintaining previously-discovered information and the associated costs in order to reduce the cost of discovering the information in the updated dataset. Thus, any algorithm adopting the incremental method actually attempts to find, use and maintain the balance to improve its performance. That is
also the golden rule for designing and modifying incremental algorithms.
The performance of mining frequent patterns can also be affected by the char- acteristics of target datasets. As an important characteristic of real datasets, the pattern support distribution is investigated in the following chapter in order to de- sign efficient mining algorithms and turn the performance of mining algorithms with target datasets.
Chapter 3
Investigating Power-law
Relationships in Pattern Support
Distributions
It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something.
— Franklin D. Roosevelt
A Pattern Support Distribution (PSD) is an important characteristic of a dataset. The support of a pattern indicates the frequency of its occurrence in a dataset, and the distribution of the number of patterns against their corresponding supports is known as the pattern support distribution of the dataset. Identifying pattern sup- port distributions of target datasets is very useful for many data mining tasks such as mining frequent patterns.
Power-law relationships appear very often in natural and man-made worlds such as the populations of cities [7]. Chuang et al. [28, 29] claim that power-law relation- ships and self-similarity phenomena also exist in the pattern support distributions of real datasets. In this chapter, we investigate the truth of this.
Identifying power-law relationships in the pattern support distributions of target datasets can provide better understanding of, and more successful applications of pattern support distributions. However, verifying whether power-law relationships really exist in pattern support distributions is difficult because of the large statistical fluctuations co-existing with power-law relationships in real world examples. Chuang et al. [28, 29] made their claim based on observing the visualization of some instances. Such a claim should be made based on statistical tests. Therefore, their claim is
unverified.
This chapter mainly focuses on the problem of how to use quantitative goodness- of-fit tests to examine whether power-law relationships exist in the pattern support distributions of real static transaction datasets. Some discussions, such as [30], have proposed some approaches to statistically verifying whether a power-law relation- ship really exists in a set of empirical data. These tests are limited to examining an instance of all possible datasets generated by an underlying process. That makes our problem more challenging since we intend to do the verification over the whole population generated by an underlying process. By extending the approach sug- gested by Clauset et al. [30], this chapter proposes a new method of utilizing the bootstrap method and the universality of power-law relationships to conduct quan- titative verifications of whether power-law relationships exist in the pattern support distributions of real transaction datasets. Based on a large number of statistical goodness-of-fit test results and discussions in this chapter, eventually a new and more proper claim is given, which is that the hypothesis that power-law relation- ships exist in the pattern support distributions of real retail transaction datasets cannot be ruled out at the level of basic distributions. This is different to that given by Chuang et al. [28, 29].
This chapter is organized as follows. After introducing the background and formally defining the pattern support distribution of a transaction dataset in Sec- tion 3.1, we overview power-law relationships, especially the special properties of power-law relationships, fitting discrete power-law distributions to a set of empirical data and verifying the existence of power-law relationships in a set of empirical data in Section 3.2. By observing the visualization—a power-law relationship shows a roughly straight line in a log-log plot—of the partial (cumulative) pattern support distributions of five real transaction datasets, a qualitative appraisal is made and generates a hypothesis that power-law relationships exist in the pattern support dis- tributions of real retail transaction datasets in Section 3.3. This hypothesis is further tested and verified based on a large number of experimental results and discussions. Moreover, the self-similarity phenomenon linked to power-law relationships in the pattern support distributions of real retail transaction datasets is also explored in this section. This chapter is summarized in Section 3.4.
3.1
Introduction
Some interesting work has been done along the line of discovering deeper knowledge about the characteristics of real datasets. For instance, Ramesh et al. [99] studied the length distributions of frequent and maximal frequent patterns, which show the relationship between the count of patterns and their lengths. Ramesh et al. [99] also attempted to apply their study to generate more realistic synthetic datasets for algorithm bench-marking. Lhote et al. [68] used probabilistic techniques to investigate the relationship between the average number of frequent (closed) patterns and their supports. Besson et al. [15] sought the relationship between the number of patterns and the conjunction of maximal frequency, minimal frequency and size constraints, which can guide users to choose initial parameter settings for substring pattern discovery.
One important characteristic of a dataset is the pattern support distribution. A pattern support distribution is a discrete distribution that comprises a set of points (xi, yi) with an absolute/relative support value xi and a number yi of the patterns with the support valuexi. This chapter concentrates on the pattern support distribution (denoted as P SD(T D) or P SD) of a transaction dataset T D based on the absolute support of patterns. However, the ideas and methods applied to absolute support in this chapter can be applied to relative support too.
From the view of probability theory and statistics, a pattern support distribution can indicate the discrete probability distribution that expresses the probability that the support of an arbitrary pattern in a target dataset is equal to some particular value1. Pattern support distributions are of use in many data mining tasks, such as providing a method of determining an appropriate minimum support for mining frequent patterns, synthetic data generation and frequency approximation over data streams [28, 29].
A power-law relationship generally indicates a special inverse relationship be- tween two quantities2. It indicates how one quantity reduces with the increase of the other quantity. Power-law relationships have been observed in many natural and man-made phenomena including city sizes, word frequencies, and sightings of bird species in the United States, to name a few [3, 7, 83, 85, 86]. Its ubiquity and 1Based on the definition in the Cambridge Dictionary of Statistics [35], in probability theory
and statistics, a probability distribution identifies the probability of each value of a discrete random variable, such as the binomial distribution, or it gives the probability of the value of a continuous random variable, which falls within a particular interval.
mathematical features make it of interest.
Chuang et al. [28, 29] observed the pattern support distributions drawn from several real retail datasets and claimed that power-law relationships also exist in pattern support distributions of real datasets. To the best of our knowledge, the work done by Chuang et al. [28, 29] is the first and only work on power-law-based pattern support distributions so far. Identifying power-law relationships in pattern support distributions can provide a better insight into real datasets and benefit min- ing applications. However, like most of the work regarding power-law relationships in the literature, [28, 29] do not provide any statistical tests to verify their claim.
In addition, the self-similarity phenomenon is often linked to power-law relation- ships. A self-similar object is exactly or approximately similar to a part of itself [78]. Chuang et al. [28, 29] claimed that power-law relationships also exist in the sup- port distribution of patterns with a certain length and that it is a self-similarity phenomenon.
A power-law-based distribution is different to most other distributions, such as the normal distribution. Most of the other distributions show that quantities are distributed around their average values. That is to say, the probability of instances increases when the instances are closer to the mean. Such distributions can be well depicted by their mean and corresponding standard deviations. However, a power- law-based distribution cannot be depicted by these simple measurements. This abnormal characteristic of a power-law-based distribution often implies that some kind of complex underlying process is behind the distribution. It makes the power- law-based distributions more interesting to investigate, especially since the complex underlying processes are unknown in most cases. For the same reason, to the best of our knowledge, work such as [30] only discusses how to verify whether a set of empirical data (i.e., an instance of datasets) follows a power-law relationship.
The rest of this chapter mainly proposes a new method of using statistical tests and special properties of power-law relationships to verify whether power-law rela- tionships exist in pattern support distributions. Our proposed method extends the verification of power-law hypothesis on an instance to a whole population of target datasets generated by an unknown process.