2.6 Techniques for Text Classification
2.6.3 Classification Association Rule Mining
Association Rule Mining (ARM) is a technique to extract association rules from a transactional database, first introduced in Agrawal et al. [1]. ARM is defined as follows:
• DT is a transactional database;
• I ={a1,a2, ... , an} is a set of binary-valued database attributes called items;
• T ={t1,t2, ... , tm} is a set of database records called transactions;
• DT is described byT, whereti ∈T comprises a set of itemsI0 ⊆I.
An association rule describes the co-occurrence relationship between two sets of items in DT and is expressed as X ⇒ Y, where X, Y ⊆ I and X ∩Y = ∅. The
quality of an association rule is typically measured by using the support and confidence framework. The support and confidence measures are defined as follows:
1. Support: The support of an itemset is used to determine if an itemset is frequent. If the support value of an itemset is more than a pre-determined thresholdσ, then the itemset is said to be frequent.
2. Confidence: The confidence value is used to determine how stronglyX implies
Y in an association rule of the form X ⇒ Y. A pre-determined threshold α is used to filter high confidence association rules from low confidence association rules.
Equation 2.4 and 2.5 are used to compute the support and confidence values respec- tively.
support(X∪Y) =count(X∪Y)
|T | (2.4)
conf idence(X⇒Y) = support(X∪Y)
support(X) (2.5)
The support and confidence framework with pre-defined thresholds (σ and α re- spectively) is used to identify frequent itemsets. These frequent itemsets are then used to generate association rules. The most frequently cited ARM algorithm is the Apri- ori algorithm, introduced by Agrawal and Srikant [2] and subsequently used to form the basis of other ARM algorithms. In the Apriori algorithm, frequent itemsets are iteratively identified by using the “downward closure property” of itemsets, where an itemset is considered frequent if and only if all its subsets are identified as frequent in the previous pass. Algorithm 1 shows the Apriori algorithm for identifying frequent itemsets.
Classification Association Rule Mining (CARM) is the use of ARM algorithms to induce rules for use in classification tasks. ARM algorithms are employed to extract classification association rules from transactional databases with binary features. A
Algorithm 1: Apriori algorithm for identifying frequent itemsets
input :Ik∈DT and minimum support threshold σ output:S, a set of frequent itemsets
k←1;
1
S ← an empty set to hold frequent itemsets;
2 generate Ik∈DT; 3 while Ik6=∅ do 4 for allIk ∈DT do 5
determine the support for Ik∈DT;
6
if support for Ik ≥σ then
7 StoreIk inS; 8 end 9 else 10 remove Ik; 11 end 12 end 13 generate Ik+1∈DT; 14 k =k+ 1; 15 end 16 returnS; 17
classification association rule describes the association between a set of binary feature- value pair and a class feature. Therefore, in terms of the association ruleX ⇒Y,X is some subset of binary feature-value pairs, while Y is the class feature.
The use of CARM techniques has been reported by a number of authors [58, 19, 21, 89, 97]. Popular techniques include: Classification Based on Associations (CBA) [60]; Classification based on Multiple Association Rules (CMAR) [58]; Classification based on Predictive Association Rules (CPAR) [96] and Total From Partial Classification (TFPC) [89].
The TFPC algorithm will be used as a technique for comparison with the proposed IRL mechanism in the multi-class classification task in this thesis as it is a multi- class text classification system. It has one keyword selection strategy and four phrase selection strategies built into its system. The keyword selection strategy is based on selecting words that exceed a minimum threshold for a user-defined contribution value, which is a measure of how discriminative a word is. The four phrase selection strategies, on the other hand, are based on the notion of noise words, significant words, ordinary words and stop marks. They are defined as follows:
• Noise words - Common and rare words which occur in the documents in the dataset.
• Significant words- Selected keywords which are used to differentiate between classes.
• Ordinary words - Other non-noise words which are not selected as significant words.
• Stop marks- Punctuation marks comprising{, . : ; ! and ?}
The four phrase selection strategies are then derived as follows:
1. DelSN contGO - These phrases are delimited by stop marks and/or noise words and are made up of sequences of one or more significant words and ordinary words.
2. DelSN contGW - These phrases are delimited by stop marks and/or noise words and are made up of sequences of one or more significant words and “wild card” words, which can be matched to any single word.
3. DelSO contGN - These phrases are delimited by stop marks and/or ordinary words and are made up of sequences of one or more significant words and noise words.
4. DelSO contGW - These phrases are delimited by stop marks and/or ordinary words and are made up of sequences of one or more significant words and “wild card” words, which can be matched to any single word.