• No results found

2.2 Data Analytics

2.2.3 Algorithms used in MBA

2.2.3.1 Algorithms used in ARM

The growth of ARM may be seen as the catalyst for widespread research on ARM- based algorithms, that were used as part of computer-based software programs to solve various real-life data mining challenges [24]. The Apriori algorithm introduced in [7] is still recognised as the benchmark for ARM-based data mining, as it is both ef- ficient and robust [140][160]. However, it has one major drawback in that it is costly, from a time and computer-memory perspective, due to its breadth-first computa- tional approach [68][76]. Consequently, there has been several attempts to enhance the efficiency of this algorithm most notably the Eclat and FP-Growth algorithms proposed by [187] and [78] respectively. The fundamental concept of these algorithms is the obedience of the downward closure property, also known as the monotonicity or Apriori principle [77]. Unlike the Apriori, the FP-Growth and Eclat algorithms perform a depth-first scan of the database to identify all frequent, 1-item sets, and

then uses this result and the downward closure property to generate larger frequent item sets [77]. In the Eclat algorithm, this is achieved by recording the transaction identities, tids, and support for all frequent 1-item sets, and then generating frequent 2-item sets through intersecting the frequent 1-item sets and comparing the support of the resulting set with minsup. Whilst the Eclat algorithm is faster than Apriori, as it scans the database only once, it is memory intensive as it initially requires a large part of the vertical database to fit into main memory [68]. This is particularly acute for large databases like those found in grocery retail. A further issue with the Eclat algorithm is that it compounds its memory demands by theoretically combining frequent 1-item sets to form larger, frequent itemsets, that may not necessarily exist in the actual database [68]. This shortcoming of a memory-intensive initial search was realised, and the dEclat algorithm was thus proposed as an update to the Eclat algorithm in [189].

The dEclat algorithm showed a significant improvement in memory usage when com- pared with the Eclat algorithm [189]. However on reflection, it was also shown to be more memory-intensive than the Eclat for sparse transaction databases with low minimum supports, which typically is the case grocery retail shopping transaction databases [24]. The dEclat algorithm is based on calculating the support of a set of items by considering the number of transactions in which this set is not present. This is represented by a diffset, where the diffset for item (X) is given by d(X) = t − tX,

where t is the total database and tX is a cut of the total database with transactions

that only contain (X) [189] . Given this, it can be clearly seen that if the database is sparse and minimum support is low, then d(X) > tX, thus making the dEclat more

memory-intensive than the Eclat algorithm.

than the Eclat algorithm, it can be significantly faster because of the novel way in which it exploits the downward closure property [68]. The FP-Growth algorithm com- mences similar to the Eclat by scanning the database for all frequent 1-item sets but goes a step further and ranks the results in descending order of frequency [77]. The database is then scanned, transaction by transaction, to build-up a tree structure with items that are frequent in multiple transactions, frequent item antecedents, forming the branches and lower frequency consequents, forming the leaves. The support of the itemset, root to leaf, is then given by the support of the leaf, using the downward closure property, as the branch will always have at least the same support as the leaf. Database pruning is done concurrently and a leaf will only propagate further if it is frequent [78].

The several attempts to enhance the compactness of the Apriori, Eclat and FP-Growth algorithms have yielded good results, particularly those algorithms that use the down- ward closure property to prune computations that generate subsets of already existing larger sets [68]. However, several of these modifications are based on the depth-first search principle and requires the entire database to be scanned into main memory. The Genmax algorithm proposed in [71] enhances the speed and memory utilisation of frequent itemset mining by leveraging the Eclat algorithm to build an initial list of the largest frequent itemsets possible, commonly referred to as maximal frequent itemsets, MFIs, and thereafter prunes all subsequent searches if it results in item- sets that are subsets of the known MFIs or that are not frequent. The rationale behind Genmax is that once all MFIs are found then by definition all frequent item- sets have also been found, because any subset of an MFI is also frequent. Similarly, the CHARM algorithm proposed in [190] uses the closed itemset property to rapidly prune all subsets and co-occurring sets to reduce the mining exercise to only closed frequent itemsets which in practice is substantially less than the list of all frequent

itemsets [68][190]. Since, by definition, a closed itemset has no subsets with the same or lower support and no supersets with the same or greater support, all non-closed subsets can be pruned and replaced by the closed itemset. Further, if a closed itemset has support = minsup, then its is also an MFI and all supersets of this itemset can immediately be pruned as they are not frequent.

Another popular compression technique is the use of the non-derivable itemset, NDI, technique proposed in [31]. However, despite several enhancements to this technique, it was, in general, found to be ineffective in sparse datasets [117]. NDI mining is based on finding all itemsets whose support cannot be determined from its subsets, defined as non-derivable itemsets, and uses this to determine all frequent itemsets. By defini- tion, all frequent 1-item itemsets are frequent NDIs, hence the initial steps are similar to Eclat, FP-Growth etcetera. Attempts were made to combine closed itemset mining with NDI mining to generate Closed NDIs [117]. Whilst this technique was found to reduce the number of NDIs required to generate all frequent itemsets, its performance in sparse databases was more complex but no better than other techniques [117].

The issues around sparse databases and high main memory requirements were to some extent addressed by the RElim (Recursive Elimination), and SaM (Split and Merge) algorithms proposed in [23]. However, comparison tests with the Eclat and FP-Growth algorithms showed that whilst both the SaM and RElim algorithms have simpler structures, SaM can be slower on sparse databases due to the additional calculations included in the “merge” step whilst RElim was slower on dense databases [23][24]. SaM and RElim are based on sorting itemsets and transactions with an ascending order of frequency using the prefix item. The frequency of the itemset is checked and if it is frequent, it is stored in a separate list. The prefix item is removed and the process recurs until all items have been removed. However, SaM and RElim default

to a divide and conquer method, similar to Eclat, and can be memory intensive [24].