2.4 Text Classification
2.4.2 Relevant Text Classification algorithms
There is no single “best” classification algorithm that can be applied effectively to every data mining problem. The reasons for this are unclear but it is conjectured that this is due to the unique characteristics of individual datasets: (i) the type of data, (ii) the size of the data and (iii) the number and distribution of the classes amongst the records. Research conducted within the data mining community, over the last few decades, has resulted in many different techniques that might suit specific conditions (such as small data sets, data sets comprised mostly of numeric records and data sets that feature a large number of classes). With respect to the work described in this thesis a number of different classification techniques were considered: (i) Bayesian Classifiers (Na¨ıve Bayes), (ii) Decision Trees (C4.5), (iii) Rule Learners (TFPC and RIPPER), (iv) k- nearest neighbour (KNN), and (v) Support Vector Machines (SMO and LibSVM). Each is discussed in more detail in the following subsections. The seven indicated algorithms (Na¨ıve Bayes, C4.5, TFPC, RIPPER, KNN, SMO and LibSVM) were selected for a variety of reasons, as will become apparent in the following subsections.
2.4.2.1 Bayesian Classifiers
Bayesian classifiers are probabilistic classifiers based on Bayes’ theorem, which was proposed and named after Thomas Bayes. Bayes’ theorem is usually expressed as:
P(A|B) = P(B|A)P(A)
P(B) (2.11)
whereAis a hypothesis,B is evidence,P(A|B) is the posterior probability ofA condi- tioned onB,P(A) and P(B) are the prior probabilities of A and B respectively, and P(B|A) is the posterior probability of B conditioned on A.
Of the many implementations of Bayes’ theorem that have been proposed, the simplest one (Naive Bayes) is used in this thesis mainly to serve as a benchmark with which to compare the other classification techniques. Naive Bayes combines prior and conditional probabilities to calculate the probability of alternative classifications [8]. It is called “naive” because it “naively” makes the assumption that attributes are independent of each other, thus probabilities can be multiplied. Despite the “naive” assumption, the Naive Bayes algorithm has proved to be an effective form of classifier
generator, especially when feature selection has been performed and only non-redundant (independent) attributes are left.
2.4.2.2 Decision Trees
Decision trees have been used for classification purposes for many years. The main advantage of using decision trees is their simplicity. The decision tree classification process can be easily understood and interpreted, and is straightforward to explain. An additional advantage is that, if desired, rules can be easily generated from decision trees (see below). The decision tree algorithm adopted with respect to the evaluation described in this thesis was C4.5. Proposed by Quinlan [82], C4.5 is the successor to the ID3 (Iterative Dichotomiser 3) algorithm [81] and has established itself as a benchmark algorithm throughout the data mining community.
2.4.2.3 Rule Learners
As described in [41], classification rules have a general form: if < AN T ECEDEN T > then < CON CLU SION >, thus a simpleif a then bor a complexif a and b then c. A common mechanism for generating rule based classifiers is to use Classification As- sociation Rule Mining (CARM), another is rule induction. As explained in [107], as- sociation rules can be used to predict a class attribute, however CARM usually results in a large number of Classification Association Rules (CARs). Support and confidence thresholds are used in order to limit the number of generated rules by keeping the most relevant rules. Support is an indicator of the coverage of a rule, while confidence is an indicator of the accuracy of a rule. Bramer [8] defines support as “the proportion of right-hand sides predicted by the rule that are correctly predicted” and confidence as “the proportion of the training set correctly predicted by the rule”. In the context of the work described in this thesis the rule-based algorithms considered are the TFPC (Total From Partial Classification) and RIPPER (Repeated Incremental Pruning to Produce Error Reduction) algorithms.
The TFPC algorithm was proposed by Coenen et al. [18] and is a CARM algo- rithm based on the Apriori-TFP (Total From Partial) Association Rule Mining (ARM) algorithm [17]. Apriori-TFP, in turn, was founded on the classic Apriori algorithm [3]. While Apriori-TFP generates association rules, TFPC generates classification associa- tion rules. The difference between TFPC and other CARM algorithms is that it does not follow the typical approach of first generating all the rules and then pruning them to generate a classifier; instead, TFPC comprises a single step in which all the rules are generated according to a process of identifying the frequent sets of attributes that can be used to generate CARs.
RIPPER is a CAR mining algorithm proposed by Cohen [19] in which “classes are examined in increasing size and a set of rules for a class is generated using incremental
reduced-error pruning” [107].
2.4.2.4 Nearest Neighbour Techniques
Nearest neighbour classification techniques operate using some form of similarity func- tion to compare new instances with existing instances. The similarity is given by the representation of each instance as a point in an n-dimensional space where an un- seen instance is classified according to the nearest classified instances. The distance between the points is usually measured using Euclidean distance, but other metrics can be used (for example the Mahalanobis distance or the Chebyshev distance). The most well known nearest neighbour technique is the K-Nearest-Neighbour (KNN) al- gorithm. The KNN algorithm classifies unknown instances based on their similarity to their closest training instances in a feature space. Because of its nature, it is more computationally expensive than other methods, however KNN is used for evaluation purposes in this thesis because of its enduring popularity.
2.4.2.5 Support Vector Machines (SVM)
Support Vector Machines (SVMs) are a relatively recent addition to the range of avail- able classification techniques compared to other classification techniques. However, SVMs have proved to be very effective in the context of text classification [112]. The SVM technique operates by separating the training instances in an instance space of a binary classification problem using a maximum-margin hyperplane; the hyperplane (among many other existing hyperplanes that can also be used to separate the training instances) that corresponds to the maximum separation between the training instances of the two classes. In mathematics, a hyperplane is typically defined as an (n-1)- dimensional subspace of an n-dimensional vector space. The hyperplanes are based on the instances of both classes that are near the boundaries that separate them. In the context of this thesis two SVM algorithms were used: (i) SMO (Sequential Minimal Optimization) and (ii) LibSVM (Library for Support Vector Machines).
SMO (Sequential Minimal Optimization) was proposed by Platt [80] and is similar to other SVM algorithms in that it divides a large QP (Quadratic Programming) problem into smaller QP problems. SMO differs with respect to other SVM algorithms in that it uses the smallest QP problems, as a result it is much more computationally efficient with respect to both cost and time.
Chang and Lin [11] presented LibSVM (Library for Support Vector Machines), which is a library that includes many SVM implementations for multiclass classification, regression and one-class problems and options for different kernels: linear, polynomial, radial-basis and sigmoid. The SVM implementation and the kernel used with respect to the experiments described in this thesis are classification and radial-basis respectively.