CHAPTER 3: RESEARCH METHODOLOGY
3.4 Classification Model Building Training and Testing
For the sake of accuracy and efficiency, various classifiers are evaluated, including
FastText and six conventional machine learning algorithms [95,96]:
3.4.1 FastText
FastText (FT) was proposed by Facebook for word embeddings and text classifica- tion. In this study, FastText is used for text classification. FT produces accurate clas- sification results that are comparable with the results produced by deep neural network classifiers. In addition, the processes of FT training and classification are very fast using a standard computer with a multicore processor. A FastText model can be trained using a bil- lion of words in just a few minutes and can classify about five hundred thousand sentences
in less than a minute [93].
FastText utilizes several techniques to enhance the efficiency. It is a linear–based model, scaled to very large data and large output space using a rank constraint and a fast loss approximation. It uses a hierarchal softmax function for a faster search. In addition, only partial information about the word order is utilized for prediction. Furthermore, FT
utilizes the technique of hashing for textual feature mapping [93].
3.4.2 Conventional Machine Learning Classifiers
For training and testing, several supervised classification methods were evaluated to
determine the one with better classification accuracy [95]. The evaluated conventional clas-
sifiers include Random Forest, Naïve Bayes, SVM, C4.5 Decision Tree, K–nearest neigh- bors classifier (KNN) using the Instance Based learning algorithm (IBK), and AdaBoost. The preprocessed labeled dataset was used to train and test the model of different classifiers using 10–fold cross validation as the experimental setting. The 10–fold cross validation is
a method to validate the studied/built model by iterating through the labeled data 10 times with different subsets of training and testing for each iteration.
3.4.2.1 Support Vector Machines (SVM)
SVM was proposed in 1998 by Vapnik [97]. It uses the features of the provided
training dataset to decide the classification boundary (hyperplane) that divides the space into regions, one for each class. SVM only chooses part of training samples that are close to the boundary to form the support vector instead of using the whole feature space in order to distinguish between the classes.
3.4.2.2 K–Nearest Neighbor (KNN)
KNN is a simple classifier. It uses the provided training set as an input vector to form different regions for different classes. Each sample in the training dataset is mapped to a point in the feature space. When a new unlabeled sample requires classification, KNN identifies the approximate distances between its associated point and k–neighbors in the space and then assigns the point to the class of the majority of the k–nearest neighbors.
The K value plays an important role in the performance of KNN classifiers [98]. A large
value may increase the classification accuracy, but it requires more computation time. In this study, we use an extended version of KNN that is implemented based on Instance–
Based learning algorithm (IBK) [99].
3.4.2.3 Random Forest (RF)
RF consists of several decision trees (n). Each tree is trained separately using a random feature subset of the training dataset T . The decision trees altogether make an accurate classifier. An unlabeled instance is classified by all the trees separately. Then, the final decision about the class of the unlabeled instance is decided by the majority voting
technique [100].
3.4.2.4 C4.5 Decision Tree
For this study, J48 that has been used is an implementation of the C4.5 decision tree
[101]. It consists of decision and leaf nodes. A decision node (internal node) is the node
that has a child and a leaf (terminal) node is the node that specifies the class value. The decision nodes are used to represent different attributes. When a decision node represents a discrete attribute, its child nodes indicate different possible values of the discrete attribute. A decision node for a continuous attribute has two child nodes. Each child indicates a certain range of the continuous attribute that is determined by using a threshold value. Lastly, the terminal nodes (leaves) represent the final value of the class labels. A decision tree is constructed and tested using a training dataset. Then, a new instance is classified by
the tree based on the values of the attribute (features) [101,102].
3.4.2.5 Naïve Bayes
Naïve Bayes is a supervised classifier that uses the conditional probability formula for classification. It is constructed using a training dataset with a prior knowledge about the relationships between the attributes (features), and a directed acyclic graph (DAG) that is used to represent conditionally independent features for a given class and their relation- ships. Each feature is represented by a node, and each relationship is represented by a link
in the graph. The links indicate the influences between different variables [103].
3.4.2.6 Ensemble classifier
An ensemble classifier is a combination of multiple classifiers that work together to enhance the accuracy of classification. Each classifier can be trained individually with
106]. In this study, we evaluate the AdaBoost ensemble classifier.
AdaBoost was introduced by Freund et al. [106]. Boosting is an enhanced version of
bagging. It consists of multiple base learners that are trained sequentially. The first learner is trained using a subset (bag) of random selected instances n from a training dataset T . The trained model is then tested using the training dataset T . During the testing process, weights are assigned to the examined instances with high weight values for the misclas- sified instances. Based on the weight values, the instances are picked for the next bag. Thus, instances with higher weights should be picked to train the next base learner in the sequence.