Chapter 3 Research Methodologies
3.4 New Evolutionary Algorithms for Mining Frequent Patterns
3.4.1
GeneticMax: A New Evolutionary Algorithm for Improving Level
by Level Searching Method Named Apriori
Genetic algorithm (GA) is an adaptive heuristic search method and it is applied in opti- mization problems. It is used as a general search approach with robustness and high scalability (Du et al. 2009; Yan et al. 2009). Due to its high scalability, a new approach named GeneticMax, which is based on GA, is designed in such a way that it decreases time complexity for mining frequent patterns from a large data set.
To generate maximal frequent item sets (MFI) from a large data set is the most time consuming task in the present day. In this research, an evolutionary approach is present- ed for finding maximal frequent item sets from large data sets by using the principles of Genetic Algorithm (GA). The search strategy of the new approach uses a lexicographic tree that avoids level by level searching, which finally reduces the time required to mine maximal frequent item sets in a linear way.
This algorithm also includes bitmap representation of nodes in a lexicographic tree and from the superset-subset relationship of nodes, it identifies frequent item sets.
The significant difference between Apriori and GeneticMax is that it randomly gener- ates the chromosome and if the generated chromosome is in the positive boundary area, then it prunes all the subsets of that chromosome. Through this technique, it reduces the cost of calculation of support value of all the subsets of generated chromosomes. Whereas the Apriori algorithm calculates the support value of all the chromosomes in each level and prune those chromosomes which do not satisfy a user define support val-
ue at that level. From the above discussion, it can be concluded that the new approach is more efficient than Apriori algorithm.
The length of a frequent item set depends on its relationship among the item sets. The major advantage of this approach is that it performs a global search and its time com- plexity is less than that of other algorithms. Another advantage is that it generates fre- quent item sets independently of the size of the data sets.
This work differs from existing research (Kabir et al. 2014) in the following aspects: 1) Unlike Apriori, this approach uses a lexicographic tree (Agarwal et al. 2001) as a search space and it does not need to enumerate frequent item sets level by level; 2) For com- parative analysis, Apriori and this approach are applied on different real data sets as well as synthetic data sets. Finally, the results are compared with the results of Apriori algorithm. 3) Unlike a Boolean based approach (Salleb et al. 2002) and FP- growth al- gorithm (Han et al. 2000), this approach does not need memory for loading a lexico- graphic tree which avoids the large consumption of memory space. This technique dra- matically reduces the time for accessing a large data set to calculate the support value of unnecessary individuals to find frequent item sets. Although it is invented a long time ago but still Apriori is one of the famous algorithms and it performs better than other existing algorithms like Eclat, Partition, and DIC, especially when the support value is set high (Hipp et al. 2000). The performance analysis of Apriori and other existing fa- mous algorithms of the present day is shown by Hipp, Guntzer and Nakhaeizadeh (Hipp et al. 2000). For this reason, an Apriori algorithm is chosen for comparison with the newly designed approach. CPU time (Run time) is needed by the existing mining ap- proaches for calculating support values of examined nodes. The efficiency of an algo- rithm depends on how many numbers of frequent or infrequent item sets it considers to get the final solution i.e. maximal frequent item sets. In this research, thorough experi- ments demonstrate how many numbers of nodes i.e. item sets are considered by GA based approach and the results are compared with Apriori algorithm for different sup- port values and data sets.
3.4.2
Hybrid GeneticMax: Improving GeneticMax Algorithm by Intro-
ducing a New Algorithm Named Hybrid GeneticMax
The early developed method, GeneticMax, is improved and extended by another ap- proach named Hybrid GeneticMax. Three main features are embedded by the new ap- proach:
1) it sorts out infrequent items from 1- item sets,
2) there is a superset-subset relationship in both positive and negative boundaries in a lexicographic tree for pruning invalid chromosomes, and
3) the use of a genetic algorithm which uses a global search mechanism. The pur- pose of sorting out infrequent items from 1-item sets is that, if an item is infre- quent then all of its super item sets are also infrequent. Through this technique, the search space is dramatically reduced by this approach for finding the solu- tion. The aim of this new approach is converging to a solution as fast as possible, especially if 1-item sets contain a reasonable amount of infrequent items and the solution resides in the deep level of the lexicographic tree instead of near the root. A full experiment of the new approach on different data sets are conducted which demonstrates the ability of this approach to yield solutions rapidly by ac- cessing the data sets for a few number of nodes in a lexicographic tree.
From the previous discussion, it can be concluded that, all the nodes in each level of a lexicographic tree are tested by Apriori algorithm and those nodes of a level which do not satisfy a user defined support value are pruned. In GeneticMax, if it generates an individual X in any level which satisfies a user defined support value, then all other subsets of X in any level are automatically pruned. This mechanism is also true the oth- er way around: if it generates an individual Y on any level which is infrequent i.e. which does not satisfy a user defined support value, then all the supersets of Y in any level of a lexicographic tree are automatically pruned. The Hybrid GeneticMax embeds all the features of the GeneticMax algorithm including local search mechanism for find- ing infrequent item sets from 1- item sets of a large data set.