Selecting the Active Learning Strategy - Methodology and Results Summary

3.4 Methodology and Results Summary

5.3.1 Selecting the Active Learning Strategy

Having adopted the relative neutral zone from Section 4.1.1.2 for use in the dynamic active learning setting, the first step was to calculate the average distances of training examples from the SVM hyperplane. The average distance of positive training examples was 0.0097 and the average distance of negative training examples was -0.0169.3 In order to adjust the neutral zone, i.e., the average distances, to the changes in the data stream, the formula for calculating adaptive average distances of positive and negative training examples, given in Equation 5.1, was applied after processing every batch from the Twitter data stream, under the condition that there were some positive/negative training tweets in the processed batch (see Algorithm 5.1).

We examined the impact of the α parameter from Equation 5.1, which controls the influence of additional training tweets on the positive and negative average distances. First, we experimented with α = 0, meaning that positive and negative average distances of training examples were constant throughout the whole experiment. In the second experiment the value of α was set to 0.1, meaning that examples from a processed batch had influence of 10% on the average distances. Tables 5.1 and 5.2 show the results for

α = 0 and α = 0.1 , respectively, presented in terms of average F-measure values (± std.

deviation) over all batches from the tweet data stream. In both experiments we employed all the active learning strategies discussed in Section 5.2.3 and the strategy which did not employ active learning (in tables marked as “No AL”), meaning that the sentiment classifier was not updated with time. In the experiments we varied the reliability threshold from 0 to 0.5, in steps of 0.1.

The results in Table 5.1, where α = 0, indicate that, in general, active learning im- proves the performance of the classifiers compared to the strategy which does not employ the active learning approach. Moreover, for all the active learning strategies, the results show that in terms of the F-measure, it is better to have a very small value for reliability threshold. Also, it can be observed, that values of the F-measure are very similar throughout the different active learning strategies for the same reliability threshold value. This could be a consequence of an unchanged average positive/negative distance of training tweets, although new training tweets were used for periodically updating the sentiment model.

On the other hand, the results for α = 0.1, in Table 5.2, are more diverse and show bigger differences between the F-measures of the active learning strategies, which leads to the conclusion that, by dynamically adjusting the neutral zone through the average positive/negative distances of training tweets, the approach becomes more sensitive re- garding the query strategy. It is interesting to notice that the strategy which did not employ active learning was in general better than the Active random 100% strategy. This implies that the complete random choice of tweets for manual labeling is not a good one in this setting, since the randomly chosen tweets for hand-labeling and their distances from the SVM hyperplane worsen the classifier performance. Recall that when the av- erage distances of training tweets were unchanged, for α = 0 (see Table 5.1), the Active random 100% strategy was better than the strategy without active learning. Moreover, it can be noticed that the values of the F-measures in Table 5.2 for the Active learning combination strategies are highest compared both to values in Table 5.2 and Table 5.1. Therefore, if the proper active learning strategy is applied, it is beneficial to dynamically

Note that the values for average distances of training examples are different from the ones in Sec- tion 4.3.2. This is a consequence of using a different SVM implementation in the active learning exper-

iments, i.e., we used the Pegasos SVM implementation, instead of SVMperf_{, which we used in the static}

setting. We found the Pegasos SVM better for use in the stream-based environment since it is faster, allows incremental learning, and it is adapted for learning from large datasets. See discussion in Section 5.2.1.

54 Chapter 5. Dynamic Predictive Twitter Sentiment Analysis

Table 5.1: Values of average F-measure ± std. deviation for different strategies, while changing the size of the reliability threshold for α = 0.

Reliability threshold 0 0.1 0.2 0.3 0.4 0.5 Select 10 of 100 AL closest to NZ 0.5512±0.12 0.5396±0.12 0.5282±0.11 0.5165±0.11 0.5018±0.11 0.4802±0.11 AL comb. 20% rand. 0.5512±0.12 0.5396±0.12 0.5281±0.11 0.5164±0.11 0.5017±0.11 0.4803±0.11 AL comb. 50% rand. 0.5513±0.12 0.5398±0.12 0.5283±0.11 0.5165±0.11 0.5016±0.11 0.4803±0.11 AL rand. 100% 0.5514±0.12 0.5399±0.12 0.5281±0.11 0.5169±0.11 0.5017±0.11 0.4804±0.11 No AL 0.5500±0.12 0.5389±0.12 0.5277±0.11 0.5162±0.11 0.5004±0.11 0.4787±0.10 Select 10 of 50 AL closest to NZ 0.5466±0.14 0.5342±0.14 0.5221±0.14 0.5103±0.14 0.4956±0.13 0.4756±0.13 AL comb. 20% rand. 0.5466±0.14 0.5339±0.14 0.5220±0.14 0.5103±0.14 0.4957±0.13 0.4757±0.13 AL comb. 50% rand. 0.5465±0.14 0.5340±0.14 0.5219±0.14 0.5104±0.14 0.4957±0.13 0.4758±0.13 AL rand. 100% 0.5466±0.14 0.5341±0.14 0.5222±0.14 0.5109±0.14 0.4963±0.13 0.4762±0.13 No AL 0.5444±0.14 0.5329±0.14 0.5213±0.14 0.5094±0.14 0.4938±0.13 0.4731±0.13

Table 5.2: Values of average F-measure ± std. deviation for different strategies, while changing the size of the reliability threshold for α = 0.1. Significance of differences in performance of the strategies can be observed in Figures 5.1, 5.2, and 5.3.

Reliability threshold 0 0.1 0.2 0.3 0.4 0.5 Select 10 of 100 AL closest to NZ 0.5512±0.12 0.5808±0.12* 0.5800±0.10* 0.5923±0.10* 0.5356±0.11* 0.5765±0.10* AL comb. 20% rand. 0.5530±0.12 0.5463±0.12 0.5432±0.11 0.5375±0.11 0.5289±0.11 0.5102±0.11 AL comb. 50% rand. 0.5513±0.12 0.5415±0.12 0.5320±0.12 0.5246±0.11 0.5116±0.11 0.4831±0.11 AL random 100% 0.5514±0.12 0.5335±0.11 0.5164±0.11 0.4961±0.11 0.4638±0.11 0.4323±0.11 No AL 0.5500±0.12 0.5389±0.12 0.5277±0.11 0.5162±0.11 0.5004±0.11 0.4787±0.10 Select 10 of 50 AL closest to NZ 0.5766±0.15* 0.5682±0.14 0.6349±0.12* 0.6348±0.11* 0.6250±0.11* 0.5114±0.14 AL comb. 20% rand. 0.5466±0.14 0.5398±0.14 0.5382±0.14 0.5328±0.14 0.5237±0.14 0.5172±0.14 AL comb. 50% rand. 0.5464±0.14 0.5359±0.14 0.5262±0.14 0.5173±0.14 0.4957±0.14 0.4690±0.13 AL random 100% 0.5466±0.14 0.5299±0.14 0.5153±0.14 0.4967±0.14 0.4751±0.13 0.4521±0.13 No AL 0.5444±0.14 0.5329±0.14 0.5213±0.14 0.5094±0.14 0.4938±0.13 0.4731±0.13 * sample contains less than 50% of all data4

update the neutral zone through the average positive/negative distances. For that reason, for the rest of the experiments, we focused mainly on the setting where the neutral zone was dynamically adjusted, i.e., where α = 0.1.

The results of the Friedman test (M. Friedman, 1937, 1940) with the Iman-Davenport improvement (Iman & Davenport, 1980) and its corresponding post-hoc Nemenyi test (Nemenyi, 1963) for different active learning strategies are graphically represented using critical diagrams. Figure 5.1 shows the results of the analysis of the F-measures from Table 5.2. The diagram presents the mean ranks of the active learning settings, having the lowest (best) ranks on the right side. The critical distance, which connects the settings that are not significantly different, is shown on the top of the graph. From these results we can draw several conclusions. Overall, the best setting for active learning is to choose 10 tweets in each batch of 100 tweets and use the querying strategy “Active learning combination 20% random”. This setting is significantly better than “Select 10 of 50 Active

“AL closest to NZ” strategy for α = 0.1 proved to be unreliable, since many batches did not contain tweets classified as positive, leading to missing F-measure values for such batches. After examining this phenomenon, we found out that this was a consequence of inverting the average positive distance to a negative number, which means that even though a tweet was positioned on a positive distance from the SVM hyperplane in the classification process, when calculating its classification reliability, the reliability was a negative number, and consequently the tweet was classified as being neutral. Such values for the average F-measure values, which were calculated on an insufficient sample size, are marked with an asterisk in Table 5.2. As a consequence of this phenomenon, we did not use “AL closest to NZ” strategy in the following experiments.

5.3. Experimental Results 55 8 7 6 5 4 3 2 1 10_100_Active_Combination20 10_50_Active_Combination20 10_100_Active_Combination50 10_100_NoActiveLearning 10_50_Active_Combination50 10_50_NoActiveLearning 10_100_Active_Random100 10_50_Active_Random100 Critical Distance = 4.28648

Figure 5.1: Visualisation of Nemenyi post-hoc tests for the active learning strategies on data from Table 5.2.

Random 100%”, “Select 10 of 100 Active Random 100%” and “Select 10 of 50 no active learning”. Moreover, it seems that in general active combination strategies are the best ones. Also, from the figure it can be observed that “Select 10 of 100” batch selection is in general better than “Select 10 of 50” batch selection, since most of the strategies on the right-hand side of the figure employ “Select 10 of 100” batch selection.

Next, we applied the Friedman test with the Iman-Davenport improvement and the significance post-hoc test on F-measure values of individual batches for the two analyzed batch selection strategies for α = 0.1. In Figure 5.2, the results of the test on the case “Select 10 of 50” batch selection can be seen. Similarly, Figure 5.3 shows the results of the “Select 10 of 100” batch selection. From both figures it follows that strategies with the active combination approach are better than the strategies without the active learning approach or the active random strategy. In most of the cases the results are also significant, as can be seen from the figures.

In Smailović et al. (2014) we performed an additional experiment where incremental active learning was performed from Baidu Twitter data only; that is, the active learning algorithm, instead from smiley-labeled dataset, conducted the initial learning from 100 positive and 100 negative tweets chosen from the first 1,000 hand-labeled financial tweets from the Baidu dataset. According to this initial model, the algorithm selected a set of financial tweets from a first batch of data from the Baidu tweet data stream to query for their labels. Based on these hand-labeled financial tweets, the model was updated and the process was repeated for the next batch of Baidu tweets. This process was repeated until the end of the simulated data stream was reached. The results indicated that the classifier learned on such a small initial dataset, although hand-labeled and specific for the financial domain, was highly unstable. The sentiment classifier learned on this dataset classified all tweets at the beginning of the data stream as negative. Then, as a consequence of active learning and improving the classifier with new labeled tweets, the classifier improved and started to classify new tweets as positive or negative. This improvement lasted for several batches, and then the classifier classified all the newcoming tweets as positive. This behavior indicated that the classifier was highly unstable since incremental learning introduced significant changes into the model with the occurrence of every new labeled tweet. Although this experiment in Smailović et al. (2014) was performed using a different active learning setting, it showed that in general sentiment classifier is unstable if it is trained on a small dataset, even if such a dataset is manually labeled and domain specific.

56 Chapter 5. Dynamic Predictive Twitter Sentiment Analysis 4 3 2 1 Active_Combination20 Active_Combination50 NoActiveLearning Active_Random100 Critical Distance = 0.126812

Figure 5.2: Visualization of Nemenyi post-hoc tests for the “Select 10 of 50” batch selection for α = 0.1. 4 3 2 1 Active_Combination20 Active_Combination50 NoActiveLearning Active_Random100 Critical Distance = 0.179339

Figure 5.3: Visualization of Nemenyi post-hoc tests for the “Select 10 of 100” batch selection for α = 0.1.

In document SENTIMENT ANALYSIS IN STREAMS OF MICROBLOGGING POSTS. Jasmina Smailović (Page 79-82)