• No results found

CHAPTER 2 Automatic Text Classification: A Review

2.2 Components of an ATC System

2.2.3 Machine Learning algorithms for TC task

2.2.3.2 Support Vector Machine classifier

The Support Vector Machine (SVM) is supervised learning algorithm that analyses data and recognises patterns. It is based on the structural risk minimisation principle [24] from computational learning theory. SVM was first introduced by Vapnik in 1995 for solving two-category pattern recognition problems [24]. The SVM was adopted for the problem of TC by Joachims in 1998 [25] and subsequently used by others [26, 27].

The SVM is defined over a vector space where the problem is to find a decision surface, or hyperplane, that separates the data points to two categories. As shown in Figure 2.3 in order to define the greatest separation dividing the data into two groups, we need to introduce a margin between the two categories.

Let D be a training set of documents where each document belongs to one of two categories. The SVM classifier builds a model that predicts whether a new document falls into one category or the other.

separation

(

Hyperplane

)

Support

Vectors

Figure 2.3 The separation hyperplane

Moreover, the SVM model is a representation of documents as points in space, so that documents belonging to one category can be separated from others by a clear gap, or margin as it is otherwise known, and it should be as wide as possible. A new document is predicted to belong to a category based on which side of the gap it falls on [7, 28, 29]. The SVM problem is to find the decision surface that maximizes the margin between the data points in a training dataset (see Figure 2.4).

Small Margin Large Margin

Chapter 2: Automatic Text Classification: A Review

As can be seen in Figure 2.4, a good separation is achieved by the largest margin without causing misclassification of the data [7, 28, 30].

The separation hyperplane in dimensional space can be written as:

π‘Š βˆ™ 𝑋 + 𝑏 = 0 (2.12)

where W is the weight vector for optimal hyperplane, b is the bias, and WΒ·X is dot product of weight and input vectors. Thus, we will consider a decision function of the form:

𝐷(π‘₯) = Sign(π‘Š βˆ™ 𝑋 + 𝑏) (2.13)

The sign function is influenced by the sign of WΒ·X+b and not its magnitude. In other words, the decision function is left invariant if we scale W and b by any positive quantity. Therefore, we can implicitly fix scale by fixing the canonical hyperplanes:

π‘Š βˆ™ 𝑋 + 𝑏 = βˆ’1 (2.14)

π‘Š βˆ™ 𝑋 + 𝑏 = 1 (2.15)

By subtracting the equation of one of the canonical hyperplanes from the other for

the two support vectors on each side of the hyperplane (𝑋1βˆ’ 𝑋2) we get:

(π‘Š βˆ™ 𝑋 + 𝑏 = βˆ’1) βˆ’ (π‘Š βˆ™ 𝑋 + 𝑏 = 1) = π‘Š βˆ™ (𝑋1βˆ’ 𝑋2) = 2 (2.16)

the margin will be given by the projection of the vectors (𝑋1βˆ’ 𝑋2) onto the normal

vector to the separating hyperplane, i.e., W/||W||, as illustrated in Figure 2.5:

π‘€π‘Žπ‘Ÿπ‘”π‘–π‘› = π‘Š

W.X + b < -1

W.X + b = 0 W.X + b = -1

Figure 2.5 Separation hyperplane margin

Dividing both sides of the equation derived in (2.13) by ||W|| will give us the equation for the separating hyperplane:

π‘Š β€–π‘Šβ€–βˆ™ (𝑋1βˆ’ 𝑋2) = 2 β€–π‘Šβ€–= 1 0.5β€–π‘Šβ€– (2.18)

Our goal is to maximise the margin which is equivalent to minimising ||W||. Maximisation of the margin is thus equivalent to minimisation of the function:

βˆ…(π‘Š) = 0.5(π‘Š. π‘Š) (2.19)

Subject to the constraint: 𝑦𝑖 [(π‘Š βˆ™ 𝑋𝑖) + 𝑏] β‰₯ 1 (2.20)

The constraint ensures that as the margin gets bigger, the separating hyperplane would still separate the data points correctly. This is a constraint optimisation problem which is solved using quadratic programming techniques [24, 27, 31]. In situations where a training set is not linearly separable, the standard approach is to allow the decision margin to make a few mistakes. Even when the data is linearly separable, we might prefer a solution that better separates the bulk of data by making the margin bigger while ignoring a few noisy data points. We then pay a cost for each misclassified example, which depends on how far it is from meeting the margin requirement given in (2.15). To implement this, we introduce slack

Chapter 2: Automatic Text Classification: A Review

requirement at a cost proportional to the value of ΞΎi. The formulation of the SVM

optimization problem with slack variables is:

βˆ…(π‘Š) = 0.5(π‘Š. π‘Š) + C βˆ‘ πœ‰π‘–

𝑙

𝑖

(2.21)

Subject to the constraint: 𝑦𝑖 [(π‘Š βˆ™ 𝑋𝑖) + 𝑏] β‰₯ 1 βˆ’ πœ‰π‘– (2.22)

The margin can be less than 1 for point xi by setting ΞΎi > 0, but then one pays a

penalty of C ΞΎi in the minimization for having done that. βˆ‘π‘™π‘–πœ‰π‘– gives an upper bound

on the number of training errors. Soft-margin SVM minimizes training error traded off against the margin. The parameter C is a regularization term, which provides a way to control over-fitting.

SVMs can be used for both linear and nonlinear data. It uses a nonlinear mapping to transform the original training data into a higher dimension, and a different kernel function can be used with SVM which is:

𝐾(π‘₯𝑖, π‘₯𝑗) ≑ βˆ…(π‘₯𝑖)π‘‡βˆ…(π‘₯

𝑗) (2.23)

The most common kernel functions are [32, 33]: ο‚· The Polynomial kernel:

𝐾(π‘₯𝑖, π‘₯𝑗) = (𝛾π‘₯𝑖𝑇π‘₯

𝑗)𝑑, 𝛾 > 0 (2.24)

ο‚· The Radial Basis Function (RBF) kernel:

𝐾(π‘₯𝑖, π‘₯𝑗) = 𝑒π‘₯𝑝 (βˆ’π›Ύβ€–π‘₯𝑖 βˆ’ π‘₯𝑗‖2) 𝛾 > 0 (2.25)

ο‚· The Sigmoid kernel:

𝐾(π‘₯𝑖, π‘₯𝑗) = π‘‘π‘Žπ‘›β„Ž(𝛾π‘₯𝑖𝑇π‘₯

𝑗+ π‘Ÿ) (2.26)

Here, Ξ³, r, and d are kernel parameters.

SVMs are effective on high dimensional data because the complexity of the trained classifier is characterized by the number of support vectors rather than the dimensionality of the data. The support vectors are the essential or critical training examples as they lie closest to the decision boundary. If all other training examples

are removed and the training is repeated, the same separating hyperplane would be found. The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVMs classifier, which is independent of the data dimensionality. Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high [7, 28, 30, 34].