Classification Model Building Training and Testing

CHAPTER 3: RESEARCH METHODOLOGY

3.4 Classification Model Building Training and Testing

For the sake of accuracy and efficiency, various classifiers are evaluated, including

FastText and six conventional machine learning algorithms [95,96]:

3.4.1 FastText

FastText (FT) was proposed by Facebook for word embeddings and text classification. In this study, FastText is used for text classification. FT produces accurate classification results that are comparable with the results produced by deep neural network classifiers. In addition, the processes of FT training and classification are very fast using a standard computer with a multicore processor. A FastText model can be trained using a bil- lion of words in just a few minutes and can classify about five hundred thousand sentences

in less than a minute [93].

FastText utilizes several techniques to enhance the efficiency. It is a linear–based model, scaled to very large data and large output space using a rank constraint and a fast loss approximation. It uses a hierarchal softmax function for a faster search. In addition, only partial information about the word order is utilized for prediction. Furthermore, FT

utilizes the technique of hashing for textual feature mapping [93].

3.4.2 Conventional Machine Learning Classifiers

For training and testing, several supervised classification methods were evaluated to

determine the one with better classification accuracy [95]. The evaluated conventional clas-

sifiers include Random Forest, Naïve Bayes, SVM, C4.5 Decision Tree, K–nearest neighbors classifier (KNN) using the Instance Based learning algorithm (IBK), and AdaBoost. The preprocessed labeled dataset was used to train and test the model of different classifiers using 10–fold cross validation as the experimental setting. The 10–fold cross validation is

a method to validate the studied/built model by iterating through the labeled data 10 times with different subsets of training and testing for each iteration.

3.4.2.1 Support Vector Machines (SVM)

SVM was proposed in 1998 by Vapnik [97]. It uses the features of the provided

training dataset to decide the classification boundary (hyperplane) that divides the space into regions, one for each class. SVM only chooses part of training samples that are close to the boundary to form the support vector instead of using the whole feature space in order to distinguish between the classes.

3.4.2.2 K–Nearest Neighbor (KNN)

KNN is a simple classifier. It uses the provided training set as an input vector to form different regions for different classes. Each sample in the training dataset is mapped to a point in the feature space. When a new unlabeled sample requires classification, KNN identifies the approximate distances between its associated point and k–neighbors in the space and then assigns the point to the class of the majority of the k–nearest neighbors.

The K value plays an important role in the performance of KNN classifiers [98]. A large

value may increase the classification accuracy, but it requires more computation time. In this study, we use an extended version of KNN that is implemented based on Instance–

Based learning algorithm (IBK) [99].

3.4.2.3 Random Forest (RF)

RF consists of several decision trees (n). Each tree is trained separately using a random feature subset of the training dataset T . The decision trees altogether make an accurate classifier. An unlabeled instance is classified by all the trees separately. Then, the final decision about the class of the unlabeled instance is decided by the majority voting

technique [100].

3.4.2.4 C4.5 Decision Tree

For this study, J48 that has been used is an implementation of the C4.5 decision tree

[101]. It consists of decision and leaf nodes. A decision node (internal node) is the node

that has a child and a leaf (terminal) node is the node that specifies the class value. The decision nodes are used to represent different attributes. When a decision node represents a discrete attribute, its child nodes indicate different possible values of the discrete attribute. A decision node for a continuous attribute has two child nodes. Each child indicates a certain range of the continuous attribute that is determined by using a threshold value. Lastly, the terminal nodes (leaves) represent the final value of the class labels. A decision tree is constructed and tested using a training dataset. Then, a new instance is classified by

the tree based on the values of the attribute (features) [101,102].

3.4.2.5 Naïve Bayes

Naïve Bayes is a supervised classifier that uses the conditional probability formula for classification. It is constructed using a training dataset with a prior knowledge about the relationships between the attributes (features), and a directed acyclic graph (DAG) that is used to represent conditionally independent features for a given class and their relationships. Each feature is represented by a node, and each relationship is represented by a link

in the graph. The links indicate the influences between different variables [103].

3.4.2.6 Ensemble classifier

An ensemble classifier is a combination of multiple classifiers that work together to enhance the accuracy of classification. Each classifier can be trained individually with

106]. In this study, we evaluate the AdaBoost ensemble classifier.

AdaBoost was introduced by Freund et al. [106]. Boosting is an enhanced version of

bagging. It consists of multiple base learners that are trained sequentially. The first learner is trained using a subset (bag) of random selected instances n from a training dataset T . The trained model is then tested using the training dataset T . During the testing process, weights are assigned to the examined instances with high weight values for the misclas- sified instances. Based on the weight values, the instances are picked for the next bag. Thus, instances with higher weights should be picked to train the next base learner in the sequence.

In document Efficient Text Classification with Linear Regression Using a Combination of Predictors for Flu Outbreak Detection (Page 70-73)