Results - Implementation of Machine Learning for Data Analysis

Chapter 4 Identifying Indonesian Cyberbullying Messages

4.5 Implementation of Machine Learning for Data Analysis

4.5.3 Results

This section describes the process of accurately predicting instances of cyberbullying or non-cyberbullying data from Twitter. The highly accurate result was significant since this research focused on identifying the cyberbullying messages. Although some research involving the analysis of cyberbullying messages has been conducted previously, the result of identifying cyberbullying messages in this research contributes to knowledge, specifically regarding the development of analysis model using three classification techniques.

As mentioned in Table 18, the development of a model training set and its performance had to be considered to obtain an accurate analysis of the training data set. The model performed with a high level of accuracy, precision and recall of data, so the confidence of running the model to analyse the data set was high. After running the machine learning in Rapid Miner, the results were converted to table and figure formats to provide an overview of the data in more detail.

163

Figure 19 Label Attribute Data Class into Cyberbullying and Non-Cyberbullying

Figure 19 shows simple label attribute data in both the cyberbullying and non- cyberbullying classes. The model analysis that was developed successfully divided the data into two classes (cyberbullying and non-cyberbullying). Figure 19 demonstrated that 23 insulting words were spread across both class. This indicated that although the messages contained insulting words, the messages might be non- cyberbullying ones, although this depends on the content and context of messages. Figure 19 also illustrated that the posterior probability of label cyberbullying was0.812 and the posterior probability of label non-cyberbullying was 0.188. This suggested that the probability of the cyberbullying class in the data set was dominant compared to the non-cyberbullying class. In other words, this also meant that most of training data set items were considered to be instances of cyberbullying.

Based on the experiment results, the density of insulting words was also indicated. There were six insulting words with a high value of density estimation: babi

(pig), bajingan (scoundrel), bangsat (bastard), goblok (stupid), iblis (devil), and monyet

164 seen that the six insulting words had a major influence on the labelling and allocation of data to the cyberbullying class. The graph of the density for the iblis term that was one of six insulting words is depicted in Figure 20.

Figure 20 demonstrates that the graphs of the iblis term that occurred holds high density in both the cyberbullying and non-cyberbullying class. It can be assumed that the six insulting words are those that most suggest cyberbullying or reasonably, these words were more commonly used than the other more rarely used insulting words that appeared in the messages.

Figure 20 Density of the Iblis term in Cyberbullying and Non-Cyberbullying Class

Iblis

Den

sity

165 Another parameter appeared after generating the model had a confidence measurement in both cyberbullying and non-cyberbullying classes. Parker et al. (2005) proposed that an output of a classification system is a confidence evaluation. The classification confidence measurement, in this case, is understood to be the parameter of a correct label (Akthar and Hahne, 2012). A high value indicates that there is a high probability that the model is correct (Gelman et al., 2014). In this scenario, the confidence evaluation result was confirmed by the result from generating the entire model in terms of data set analysis. An example of a confidence measurement is presented in Table 19.

Table 19 Example of Calculated Confidence Measure for Labelling Correctness

No. Tweet Confidence of

cyberbullying

Confidence of non-cyberbullying

Prediction Class

1 Avanya andaiy, avanyaâ™¥RT

mohamadDFDM: "sarap!! addnan_ch: Oh anjing goblog fakyu siah monyet setan babi alas bangsat shit damn aaaaahhhh an

1 1.23E-15 cyberbullying

2 GILA GILA ANJING GANGERTI SM HENRY BANGSAT GUE GAKUAT WOY MSH GO NIH WOY

1 8.60E-12 cyberbullying

3 sarap!!addnan_ch: Oh anjing goblog fakyu siah monyet setan babi alas bangsat shit damn aaaaahhhh anjing anjing ava nya si eta ah anjing ah. Cageur?addnan_ch: Oh anjing goblog fakyu siah monyet setan babi alas bangsat shit damn aaaaahhhh anjing anjing ava nya si eta ah anjing ah

1 1.23E-15 cyberbullying

4 Bangsat deuh tolol =)) RT"syemaAT: Anjing

lu setan" 1 1.47E-15 cyberbullying

The scope of the confidence measurement for the whole system of classification is simplified as a number between 0 and 1 (Parker et al., 2005). Assuming that the resulting value is closer to 1, then the estimation of true class condition is high. In contrast, when the resulting value is closer to 0, then the estimation of false class condition is high.

166 Interestingly, the results from the developed model were two classes for the 152,843 data items: 122.842 items (80.37%) in the cyberbullying class; 30,001 items (19.63%) in the non-cyberbullying class. A close analysis of three classification techniques resulted in an outcome of a correct label which showed the larger number of predictive data in cyberbullying compared to the non-cyberbullying class. Figure 21 shows the distribution of the two classes.

Figure 21 Chart of the Result Prediction Cyberbullying and Non-Cyberbullying Data

Figure 21 shows that most data were detected as containing cyberbullying messages and a small portion of data was identified as non-cyberbullying messages. At first glance, it is clear that most of the data, after generating the machine learning using the classification techniques, was in the cyberbullying category. Since the models used for analysing cyberbullying data discussed above performed with high accuracy, the definite result after testing the data set in the cyberbullying class was 80.37% of the total amount of data. However, only a small proportion of 19.63% was

80.37% 19.63%

Portions of Data Distribution into Classes

cyberbullying non-cyberbullying

167 in the non-cyberbullying class. This small proportion may comprise of exasperation towards self in terms of various context, as instanced in “Bangsat gua terus mendapatkan sial, keparat emang” (translation: Bastard, I always receive bad luck, what an a*s). Other

reasons can be annoyed and negative expressions towards their personal pets, such as “Sial anjing gue gigit tangan gue” (translation: sh*t my dog just bit me), or positive

reactions to their pets, for instance “Anjing gue ternyata lucu juga” (translation: My dog is actually hilarious too). Another viewpoint that can be taken is the insulting words may be used to convey what people are feeling at the moment, for example, “AC keparat dinginnya” (translation: The air conditioner is f***ing cold). From the 19.63% of

non-cyberbullying data, this signifies that the insulting words used in the messages tend to be independent and have no relationships among the words and messages.

4.6 Conclusion

In this chapter, the number of insulting words-based method of detecting cyberbullying used a model of classification. The classification techniques that were employed were naïve Bayes, decision tree and neural network. With the assistance of Rapid Miner it was possible to apply the three classification techniques simultaneously.

From the experiment results obtained, several phases were necessary in order to achieve a high accuracy of data prediction; to be more specific, the simultaneous application of unsupervised and supervise learning techniques. Unsupervised learning such as the k-medoids was used to divide the data training set into two clusters. The purpose of this was to group data and label it as either cyberbullying or non- cyberbullying based on the data patterns that emerged, as described in the previous chapter. Moreover, to obtain highly accurate labelling of the data training set, k-

168 medoids and data patterns of relationship between itemsets were applied to validate the results of the labelled classes.

Supervised learning such as naïve Bayes, decision tree and neural network were connected synchronously and adopted to analyse the data. The process of data analysis required an examination of the analysis model performance. The main goal of this process was to achieve a high level of accuracy in data class prediction. Hence, the processes of data precision and recall in the model were also employed. As a result, the model had a high level of accuracy, precision and recall. Therefore, there was great confidence in using the model to analyse messages from Twitter. Moreover, the analysis model was able to correctly identify 80.37% of the data consisting of cyberbullying messages and 19.63% for non-cyberbullying messages.

This research contributes to the development of the analysis model and the implementation of data mining techniques for the study of cyberbullying messages by applying unsupervised and supervised learning techniques. Furthermore, these results also provided important data that can be used in further research for the development and application of data mining techniques by building an analysis model for social issues.

The next chapter presents the analysis and discussion of the results obtained through both association rules and classification techniques. The outcomes from determining whether messages are classified as either cyberbullying or non- cyberbullying explored in this chapter will be used as the basis for constructing a framework and identifying cyberbullying messages from the social network. Furthermore, the experiment used for the development of the analysis model can serve to improve the further use of machine learning for data derived from the social networks.

169

Chapter 5

In document Analysis of the Indonesian Cyberbullying through Data Mining: The Effective Identification of Cyberbullying through Characteristics of Messages (Page 173-180)