Chapter 4 Identifying Indonesian Cyberbullying Messages
4.5 Implementation of Machine Learning for Data Analysis
4.5.1 Learning the Model
Rapid Miner assists the machine learning process that builds analysis models through the application of an extensive range of leading algorithms (Akthar and Hahne, 2012). Furthermore, Rapid Miner offers a user-friendly integration of recent and well- established data mining techniques. The Rapid Miner Studio analysis procedure involves dragging and dropping the operators, assembling parameters, and linking the operators. Rapid Miner Studio comprises over 1,500 operations covering all aspects of professional data analysis ranging from data partitioning, to market-based analysis, and to attribute generation (Akthar and Hahne, 2012). These processes offer all the tools required for subsequent stages of data analysis. However, there are other available approaches such as text mining methods, web-mining, the automatic
158 sentiment analysis derived from Internet discussion forums (sentiment analysis, opinion mining), timer series analysis and prediction (Klinkenberg, 2013).
This research has identified the following algorithms as most useful for analysing cyberbullying messages. By utilising the functions of Rapid Miner, this research applied naïve Bayes, decision tree and neural network techniques to analyse the Indonesian cyberbullying messages. This learning model required several processes explained below:
1. Retrieving the training data set from the repository was the same process as in Chapter 3 where it has been explained in section 3.4.
2. The processing document function is used to clean data containing operators such as tokenize, transform case, stop words, and stem dictionary. By dragging and dropping the process document operator to main the view in Rapid Miner, several parameters were set based on the required measurements such as a word vectors as well as the output of this process, prune method, selecting attribute and weight as a target data that will be cleaned. In this case, the term frequency–inverse document frequency (TF-IDF) was chosen as a parameter of the word vectors because the TF-IDF values increase relative to the number of words appearing in the document. However, it is counterbalanced by the frequency of the word located in the corpus as it helps to identify frequently- occurring words. TF-IDF is part of the well-known term-weighting arrangements. The complete value was determined as a parameter through the pruning approach with a given range value of 2 for pruning below the absolute, meanwhile, 80,006 for pruning above the absolute. For selecting the attribute and weight parameter, text attribute has been determined to be a target data for analysis and was given a value by default of 1.0 as the weight.
159 3. An analysis model was created by means of three classification techniques, naïve Bayes, decision tree and neural network, by dragging and dropping the these classification techniques into X-validation operator on main process view in Rapid Miner. In this stage, every X-validation operator consisted of one technique: the first X-validation operator consisted of the naïve Bayes technique, the second X-validation operator comprised the decision tree technique and the third X-validation operator applied a neural network technique. Moreover, the X-validation operator involved executing a cross- validation process for the purpose of evaluating the statistical accomplishment of a learning operator, typically on undetected data sets. Mainly, it is applied to assess the accuracy of a model’s actual performance (learnt via a specific learning operator). In other words, the X-Validation operator is an installed operator that holds two sub-processes, which are a training sub-process and a testing sub-process. The training sub-process is applied for training a model; meanwhile, the trained model is implemented to evaluate the testing sub- process. The model’s performance is calculated during the testing stage. The input from the item set created in the processing document is separated into k
subsets in a balanced size. From the k subsets, an individual subset is maintained for a testing data set, which is the input for the testing sub-process, while the leftover k – 1 subsets are applied further as a training data set or, in other words, the input of the training sub-process. Then, the cross-validation process is done recurrently k times under a condition where every k subset has been used once as the testing data. Following this stage, the k outcomes derived from the k iterations can be calculated for the average value, or in other cases combined to create a sole prediction. However, the value of k can be
160 modified by using a number of validation parameters. Moreover, the learning processes in this phase typically advance the model in order to create a more appropriate model for the training data.
The three stages of building an analysis model involve connecting the three stages simultaneously, then running the model. The three models of classifier techniques were set to three X-validation operators to estimate the statistical performance of the learning model in terms of the data prediction. Then, the operator performed an accuracy, precision and recall test of the model. The results of this model’s performance are presented in Table 18.
Table 18 shows the model’s performance of the training set based on the estimation of accuracy, precision and recall. Based on the result, the three techniques have different performance in terms of predicting cyberbullying and non-cyberbullying classes.
Table 18 Performance Model Training Set using Naive Bayes, Decision Tree (C4.5), and Neural Network Class labels of data Parameters Naïve Bayes Decision tree (C4.5) Neural network Prediction Cyberbullying % Accuracy 100 99.97 99.99 % Precision 100 100 100 % Recall 100 99.96 99.99 Prediction Non- cyberbullying % Accuracy 100 100 100 % Precision 100 99.85 99.97 % Recall 100 100 100
Although, naïve Bayes technique is a basic classifier technique, its performance in this model achieved 100% in accuracy, precision, and recall for both the cyberbullying and non-cyberbullying classes. This indicates that naïve Bayes has high performance in relation to document retrieval and in this case, the parameter in naïve Bayes is set up by default. The decision tree has an accuracy of 99.97% in the
161 cyberbullying class and 100% in the non-cyberbullying class. The values for precision and recall for the cyberbullying and non-cyberbullying classes were distinctive. The performance of the decision tree in the cyberbullying class was 100% for precision and 99.96% for the recall. In contrast, the performance of the decision tree in the non- cyberbullying class was 99.85% for precision and 100% for recall. This is due to the setup of parameters having different values based on the weighting.
The performance of neural network also has a different value in the cyberbullying and non-cyberbullying classes. In the cyberbullying class, the accuracy of performance of neural network was 99.99%; whereas in the non-cyberbullying class it was 100%. The precision and recall values in both class were also different. This is similar to decision tree where the performance value of precision was 100% for the cyberbullying class, compared to the non-cyberbullying class that had 99.97% precision. The performance of recall for the cyberbullying class was 99.99%, but for the non-cyberbullying class it was 100%.