Multi-Objective Evolutionary Algorithms for Association Rule Mining

Chapter 2 Literature Review

2.3 Genetic Algorithm

2.3.4 Multi-Objective Evolutionary Algorithms for Association Rule Mining

Mining

An association rule is an implication between two item sets A and B, A→B, which is used to define the dependencies between the item sets in a data set. The problem of mining association rules are considered by many researchers and a large number of algorithms are developed for extracting association rules from different type of data sets (Hipp et al. 2000; Han & Kamber 2006, pp. 227-254).

Most of the existing classical algorithms for mining association rules are based on a support-confidence framework (Agrawal & Srikant 1994; Hipp et al. 2000; Jesus et al. 2011). This framework consists of two sub processes: finding all frequent item sets and generating rules from those frequent item sets based on a user defined support value and a confidence value, respectively. Several authors (Yan et al. 2009; Yan et al. 2005; Wakabi-Waiswa & Baryamureeba 2008; Jesus et al. 2011) have noted that these algorithms raised the following major challenges: 1) Users need to specify an appropriate threshold value for mining rules although they have no information regarding the data set, and 2) Association rule mining is an NP-Hard problem because searching all frequent item sets satisfying a minimum support value reveals an exponential search space

of size 2n, where n is the number of item sets (Yan et al. 2009; Jesus et al. 2011). Finally, it generates a huge number of unnecessary rules from frequent item sets, resulting in weak mining performance (Berzal et al. 2002; Martin et al. 2014).

To avoid the use of minimum support and confidence threshold, researchers use genetic algorithm based multi-objective approaches because through this way, a more complex value is considered as a fitness function for an individual (Jesus et al. 2011).

Recently, a large number of research papers have used evolutionary algorithms for mining association rules. These studies have found that evolutionary algorithms (EAs) par- ticularly genetic algorithms based approaches are efficient tools especially when the search space is too large to use deterministic search methods (Martin et al. 2014; Mukhopadhyay et al. 2014). Because of inherent parallel structure, GA based methods are effective for automatic processing of large amount of data and discovering meaning- ful and significant information. In real world applications, data sets not only use quantitative or numeric values but also contain categorical values. For this reason, several studies are proposed for mining Boolean association rules (BARs) from data sets with categorical values (Yan et al. 2009; Shenoy et al. 2003; Shenoy et al. 2005).

Ghosh and Nath (Ghosh & Nath 2004) consider the association rule mining task as a multi-objective problem instead of a single objective one. Different measures are used to improve the quality of a generated rule such as support count, comprehensibility and interestingness. Using these measures as an objective for association rule mining task, this study uses a pareto based genetic algorithm to mine useful and interesting rules from market basket database.

To mine interesting association rules, Wakabi-Waiswa and Baryamureeba (Wakabi- Waiswa & Baryamureeba 2008) proposes a Pareto based multi-objective evolutionary algorithm. For improving the interestingness of an association rule, they use different measures such as J-measure, perplexity, comprehensibility, interestingness and predic- tive accuracy.

Yan, Zhang and Zhang (Yan et al. 2005; Yan et al. 2009) proposes ARMGA and EARMGA algorithms for identifying BARs using genetic algorithm without specifying actual minimum support and confidence value. This article showed the hardness of se- lecting suitable threshold values by the users since different database require different

support values to mine useful and interesting rules. Instead of using support-confidence framework, these algorithms use Piatetsky-Shapiro (Piatetsky-Shapiro 1991) based rule interest method to define the positive confidence of a rule. To encode each association rules these algorithm follow Michigan strategy based encoding technique. Experimental results show that, a large number of high quality rules are generated due to considering weak fitness function. Because of the use of simple genetic operators like mutation, these approaches miss some high quality rules which are generated in intermediate generation of a population.

Recent multi-objective association rules with genetic algorithm (ARMMGA) is proposed for reducing the generation of a large number of rules by ARMGA (Qodmanan et al. 2011). New crossover and mutation operators are presented in this approach to pre- vent the generation of invalid chromosomes in ARMGA. In this approach, the order of the chromosomes in the population is specified by the fitness value. Although this approach generates a smaller number of rules but some of those are misleading and trivial due to using a weak constraint. The fitness function is defined in such a way that it generates unnecessary rules.

In order to extract a set of high quality rules which are easy to understand and interesting, recent studies show that researchers jointly optimize different measures (Alatas & Akin 2008a; Ghosh & Nath 2004; Martin et al. 2014). These approaches remove the drawbacks of single objective algorithms and mine high quality rules from the data sets with quantitative or numerical values (Salleb-aouissi et al. 2013; Webb 2001; Martin et al. 2014).

Motivated by the features of multi-objective approaches, in this thesis two new GA based approaches are proposed which are based on different design factors and data sets that jointly optimize multiple objectives for discovering a reduced set high quality BARs. The main objectives of designing these approaches are generating rules which are easy to understand, interesting and having a good trade-off among the number of rules, support, confidence and other objectives of the data sets.

2.4 Initial Populations of an Evolutionary Algorithm for Associa-

tion Rule Mining Problems

An initial population has a significant effect of further generation of a population. Pre- vious studies of a simple genetic algorithm which is based on a single seed, the effects of an initial population and dynamic diversity control mechanism in a genetic algorithm are described in this section.

In document New evolutionary algorithms for mining interesting association rules (Page 56-59)