This section introduces DAC, a Distributed Associative Classifier, whose training phase is designed to be distributed in an in-memory cluster computing framework like Apache Spark.
In Section 2.3, we have seen an effective approach to split the training of an associative classifier without losing predictive quality, which is bagging. This strategy alone has a limit, though, as the amount of memory needed grows with the number of rules extracted, and thus does not scale. We build upon the findings of Section 2.2, where we found the most scalable frequent itemset miners the ones relying on the search space split approach and an effective usage of memory, and we design an approach that aims at limiting the amount of rules stored in memory and making an effective usage of memory with ad-hoc data structures.
The section is organized as follows. Section 2.4.1 explains how DAC works, and Section 2.4.2 describes the experimental evaluation of DAC.
2.4.1
The proposed approach
Traditionally, the training phase of an associative classifier is a memory-intensive process, often executed out-of-core. The vast majority of the techniques has at least an instant of time where a very large set of itemsets or rules has been extracted and not yet pruned. This model cannot leverage the advantages of our reference architecture, an in-memory cluster computing framework like Apache Spark. In building a scalable associative classifier, we have been guided by the two following design principles: i) anticipating pruning before the actual extraction of the rules, and ii) moving from a large model that predicts with only the first matching rule toward a lightweight model, that compensates the loss in size by applying all the rules that match. These two principles aim at reducing the amount of rules contemporarily present in the main memory at any given instant of time, allowing for an effective exploitation of the in-memory computing platform.
The baseline framework on which we build for the training of our Distributed Associative Classifier, namely DAC, is as follows.
1. The dataset is split into N partitions, each one sampled from the original dataset with a ratio α;
2. Within each partition, a rule extraction phase occurs, that produces a model as a set of CARs. The CARs found are filtered by minimum support, minimum confidence and minimum χ2 and optionally further pruned with a database coverage phase;
3. The generated N models are collected in an ensemble.
Following our first design principle, we aimed at devising an extraction phase that made the work of the posterior pruning extremely reduced or null, in the best case. We have therefore adopted a greedy approach based on the Gini impurity of an item, keeping in mind the second design principle presented before, that we finally want a smaller model where several rules can collaborate for the prediction, instead of a single first-match. This calls for shorter rules, that can more easily match new records and avoid over-fitting. In order to follow such a route without sacrificing predictive quality we designed several solutions that will be presented in the next sections, namely: i) an FP-growth-like CAR extractor that produces only useful classification rules, in a greedy fashion, by exploiting the Gini impurity; ii) an added model consolidation phase for the generation of the ensemble that reduces further the size of the final model; iii) new voting strategies for the ensemble that exploit the before-mentioned novelties.
CAP-growth
The FP-tree is an effective solution for frequent itemsets extraction, and is often adapted to the extraction of CARs [11]. Moreover, it adapts well to in-memory computing, as its construction needs only two scans of the dataset and, once built, the FP-tree stores in the main memory all the necessary information for frequent itemsets or CARs extraction.
However, there is a twofold motivation behind designing an alternative to the FP-tree, like [32, 33], as method of storage for the patterns that will build the final CARs. First, the FP-tree is designed to build all frequent patterns, that are a superset of what we look for when we build CARs. Second, being frequent does not always coincide with being useful, and using the standard FP-growth algorithm would yield
the growth of an overwhelming number of rules that would impede the descent to lower supports, where more useful information may dwell. Guided by these considerations, and keeping in mind the design principles outlined in the beginning of the section, we designed an FP-growth-like algorithm called CAP-growth, for Class Association Patterns growth.
CAP-growth stores the information that is useful for extracting CARs in a CAP-tree. Similarly to an FP-tree, this structure allows to compactly store all the information needed to extract association rules reading the dataset only twice. Differently from the FP-tree, a CAP-tree stores in each node extra information useful to extract only CARs, as it is usually done in single-machine approaches[32, 33]. Moreover, the first phase of the CAP-tree’s construction sorts the frequent items by their Gini impurity, which will help the extraction of more useful rules in the CAP-growth phase.
The algorithm that builds a CAP-tree is detailed in Algorithm 2.1. Algorithm 2.1: CAP-tree building
Input :A transaction DB labeled with classes - D Input :A minimum support threshold - minsup Output :A CAP-tree
1 Scan D once. Collect L, the list of frequent items (support >= minsup).
Sort L by decreasing IG and filter out items with IG≤ 0.
2 Create the root of a CAP-tree T and label it as null. 3 for each labeled transaction t do
4 select only the items in t that appear in L and sort them according to the
order in L, obtaining t′
5 call insert(t′, T )
6 end
7 Function insert (transaction t, node T ) 8 h= first item of t
9 if T has a child T′s.t. T′.id = h.id then 10 T′.freqs[t.class]+=1
11 else
12 create a new node T′
13 init T′.id = h.id and T′.freqs to an array of zeros 14 T′.freqs[t.class]+=1
15 T′.parent = T
16 update the header table
17 end
18 t′= t\h
19 if t′is not empty then
20 insert(t′, T′)
21 end
Given a minimum support threshold, which is used to recognize frequent itemsets, the algorithm scans the dataset twice. In the first pass (line 1), it builds a list L of frequent items, with decreasing and strictly positive Information Gain. Since we are considering the item alone, we assume that the(1 − wi)-th part of the dataset not
tid Transaction Class 1 {A,B,D,E} + 2 {B,C,E} - 3 {A,B,D,E} + 4 {A,B,C,E} - 5 {A,B,C,D,E} + 6 {B,C,D} -
Table 2.6 An example transactional dataset, binary-labeled.
(same Gini). Hence, the Information Gain is computed as follows:
IGi= GiniD− [wiGinii+ (1 − wi)GiniD] (2.1)
in which GiniD is the impurity of the global dataset, Giniiis the impurity of item
i, and wiis the ratio of dataset D containing item i.
Equation 2.1 simplifies as
IGi= wi(GiniD− Ginii) (2.2)
In this first passage, we can also obtain the frequency of the classes in the entire dataset, which is used in the CAP-growth’s extraction phase. In the second pass (lines 2-6), we insert each read transaction in the CAP-tree (line 5), maintaining a header table that keeps track of the pointers to the nodes in the tree that store the frequent items, like in the original FP-tree (line 16). Before being inserted, the transaction is cleaned from the infrequent items and reordered according to the order of L (decreasing IG) (line 4). The insertion updates the structure of the CAP-tree to keep track of the label of the transaction in an array of frequencies (lines 9-17). This allows the direct extraction of CARs and the computation of the IG and the confidence of the rules in the CAP-growth.
Figure 2.11 shows the CAP-tree built on the toy dataset in Table 2.6, with the minimum support threshold set to 0.3, that is 2 records. Each node of the tree is labeled with the array of the frequencies of the classes, positive and negative respectively. In this tree we see how item B has been pruned, since its IG is 0, and how the remaining items are sorted and inserted by their IG, with item A being the
/0 3,3 C 0,2 E 0,1 D 0,1 A 3,1 D 2,0 E 2,0 C 1,1 E 0,1 D 1,0 E 1,0 Header Table Item Freqs + - A 3 1 C 1 3 D 3 1 E 3 2
Fig. 2.11 A CAP-tree example built over the toy dataset with minimum support equal to 0.3
Item wi Ginii IGi A 4/6 0.375 8.3% B 1 0.5 0 C 4/6 0.375 8.3% D 4/6 0.375 8.3% E 5/6 0.48 1.7%
Table 2.7 IG, weight and Gini for the items in the toy dataset
first and the most useful for classification. The IG of all items of the toy dataset are shown for reference in Table 2.7, together with their weight w and Gini.
CAP-growth extracts a set of CARs from the CAP-tree descending the tree greedily. This yields that, since the frequent items are sorted by decreasing IG, we evaluate the rules made of high-IG items first. The rationale that guided the design of the algorithm is to avoid redundant rules, where possible, while keeping the length of the rules minimal.
The following example illustrates some ways in which redundancy affects CARs. In other approaches, this redundancy is often reduced after the extraction of CARs, as shown in Section 2.5. We provide this example so that the reader may later gain an intuition of where CAP-growth helps reducing redundancy before the extraction itself. In Figure 2.12 we see all the CARs in the set of association rules extracted
EDA⇒ + BEDA ⇒ + ED⇒ + BED⇒ + EC⇒ - BEC⇒ - EA⇒ + BEA⇒ + DA⇒ + BDA⇒ + A⇒ + BA⇒ + E⇒ + BE⇒ + C⇒ - BC⇒ - D⇒ + BD⇒ +
Fig. 2.12 An example model with CARs for the dataset in Table 2.6
with the standard FP-Growth, with minimum support set to 0.3 (2 rows or more) and minimum confidence 0.51 on the toy dataset in Table 2.6. 18 CARs for a dataset of 6 records are clearly redundant. A first, evident source of this redundancy is item
B, which is present in all the records in the dataset. This results in having, for any rule generated, an identical rule with B appended, that does not contribute to the classification and lengthens the model. A similar situation happens with item E. Likewise, item C appears in many rules, all of which agree in classifying a record as negative: item C itself would be sufficient as antecedent of the rule. The same holds for other rules as well.
As previously stated, CAP-growth aims to avoid the redundancy of the example above. Algorithm 2.2 shows the pseudocode for CAP-growth. Similarly to Equation 2.2, we define the Information Gain for a node as
IGT = wT(GiniT.parent− GiniT) (2.3)
where wT is the ratio of transactions represented in node T with regards to its parent
node, and Gini is computed on the frequencies of the labels stored in the node.
The algorithm is a recursive call to the function extract (line 6), which visits in a depth-first fashion the CAP-tree. The stopping criteria of this visit are:
1. a negative Information Gain for the current node. In this case, we do not generate any rule (line 9).
Algorithm 2.2: CAP-growth Input :a CAP-tree
Input :A minimum support threshold - minsup Input :A minimum confidence threshold - minconf Input :A minimum chi2 threshold - minchi2 Output :A list of CARs
1 rules = /0
2 for each child T of CAP-tree.root do 3 rules += extract(T )
4 end
5 return rules
6 Function extract(node T) 7 rules = /0
8 if IG(T ) <= 0 then //negative Information Gain: do not generate any
rule
9 return /0
10 end
11 if Gini(T ) == 0 then //pure node: try to generate a rule 12 return generateRule(T )
13 end
14 for each child T′of T do 15 rules += extract(T′)
16 end
17 if rules is /0 then //none of the children has produced a rule: try to
generate a rule
18 return generateRule(T )
19 end
20 return rules
21 Function generateRule(node T)
22 consequent = class with highest value in T .freqs[]
23 antecedent = set of items in the path from T to CAP-tree.root 24 tree = CAP-tree conditioned by the items in antecedent 25 freqs = tree.root.freqs
26 sup = freqs[consequent] / totCount 27 supAntecedent = freqs.sum / totCount
28 from sup, supAntecedent and the global frequencies of the classes
computed in the first pass of Algorithm 2.1 compute support, confidence and χ2for the generated rule: antecedent⇒ consequent
29 if sup< minsup or conf < minconf or χ2< minchi2 then 30 return /0
31 end
/0 3,3 0.5 A 3,1 0.375 IG: 8.3% (a) /0 3,3 0.5 A 3,1 0.375 C 1,1 0.5 IG: -6.25% (b) /0 3,3 0.5 A 3,1 0.375 D 2,0 0.0 C 1,1 0.5 IG: 18.75% (c) /0 3,3 0.5 C 0,2 0.0 A 3,1 0.375 D 2,0 0.0 C 1,1 0.5 IG: 16.6% (d) Fig. 2.13 Example visit of the CAP-tree in Figure 2.11
/0 3,1 C 0,1 A 3,0 C 1,0 (a) project(D) /0 3,0 (b) project(DA)
Fig. 2.15 Example projection of the CAP-tree in Figure 2.11 to reconstruct the support of itemset {A,D}
2. a Gini impurity for the current node equal to 0. Being the Gini impurity always strictly decreasing, this makes the current node the first pure node in the path from the root to this node, i.e. we see only one label for it. We try to generate a rule(line 12).
Whenever none of the children of a node does generate a rule, the node itself tries to generate a new rule (lines 14-19). This can occur when the children nodes do not see enough samples to satisfy the minimum support threshold, for example, or if the current node is a leaf.
The function that generates a new rule (lines 21-32) needs first to recollect the frequencies of the labels from all the nodes where the current pattern appears. Like in the original FP-growth, this is done by projecting the CAP-tree recursively on all the items of the pattern, that is all the nodes in the path to the root (line 24). At the end of the projection, the root node contains the array of classes’ frequencies for the pattern (line 25). With it, we can compute the support, the confidence and the χ2 of the rule we are trying to generate (lines 26-28). If any of the measures does not satisfy the minimum constraints, the rule is not generated (line 29).
In Figure 2.13 we see an example of the CAP-growth algorithm, run on the CAP-tree of Figure 2.11, with minimum support, confidence and χ2set respectively to 0.3, 0.51 and 0. In the figure, each node is labeled by the array of frequencies of the classes and the resulting Gini impurity. The root of the tree has a Gini impurity of 0.5. Its first child to be explored stores item A with a Gini of 0.375 (Figure 2.14a). Having a positive IG and a non-null Gini, we continue the descent to its children. The first to be explored describes the pattern A,C (Figure 2.14b). This node has a Gini index of 0.5, thus a negative IG. This means that the addition of item C to the pattern only worsens the ability of A in predicting a label. We therefore do not explore anymore this pattern and its offsprings. The other sibling (Figure 2.14c), storing item D, is pure for the positive class: continuing the descent further would only lengthen the rule without any improvement. We reconstruct the real frequencies of itemset{A,D} to see if the rule A,D ⇒ + is really worth to generate and compute its support, confidence, and χ2. First, we need to project the CAP-tree for item D. The header table stores the pointers to the three nodes that store this item. Only the parts of the tree that end to these three nodes are kept, and all the surviving nodes and the header table are updated in their frequency arrays to reflect this change (Figure 2.16a). Now we have a CAP-tree storing only the transactions that contain item D.
We project again for item A. The header table points to a single node that stores this item, and its frequency array, updated in the step before, is [3,0]. By projecting, the CAP-tree reduces to the root node alone, whose frequencies also are updated to [3,0] (Figure 2.16b). This is the frequency array for itemset{A,D}. Thus, the rule
A, D ⇒ + has confidence 1 and support 0.5, and satisfies the minimum thresholds5. Rule A⇒ + is not generated, as one of the subpatterns of A has already produced one rule. Finally, we move to the second child of the root, storing item C (Figure 2.14d). This is again a pure node. We recollect the frequencies of item C by projection as seen before and get the array [1,3], which produces the rule C⇒ - with support 0.5 and confidence 0.75. The final model is made of only two rules.
It is worth paralleling the strategy in CAP-growth with the one in the database coverage pruning [10]. The database coverage scans the rules extracted and already sorted by prediction quality, and keeps on adding them to the model if they predict correctly at least a transaction not yet covered, and until all the transactions have been covered at least once. Similarly, CAP-growth keeps on adding rules that cover transactions not yet covered, since they are extracted in different branches of the CAP- tree, and does so without extracting the entire set of CARs that satisfy the minimum thresholds. The main difference between the two strategies is in the moment when the pruning is performed: the database coverage acts at the end of the extraction, when all the rules have been already extracted, whereas CAP-growth anticipates the pruning in the extraction phase. The aim of both strategies is the same, that is generating the least, shortest rules, avoiding redundancy in the model.
Model consolidation
CAP-growth generates a single model, in each partition of the dataset, that is at the same time compact and useful. Still, with massively large datasets, it may happen that the number of partitions to have a sufficient division of the workload is in the order of thousands, or more. Consequently, the number of single models in the ensemble explodes. This results in a larger model to store, more complex to be read and examined by a human, and with longer execution times when applied to predict new records.
To cope with these issues, we shrink the ensemble of the models to a unique