Context and the Supervised Learning Method

3.4 Empirical Results

3.4.5 Context and the Supervised Learning Method

Supervised models that do not include context degrade as they become more general

To showcase the need for context in the supervised method, we studied individual words models trained at different levels of our domain hierarchy. We started at the bottom level, where there were thirteen narrow domains. For each domain, we trained two models: one using the most frequent words and another additionally using words from the Hu and Liu lexicon. We continued at level 4, where there were four domains (vacuums, cameras, hotels, and appliances). For each domain and feature set, we again trained a supervised model. We repeated this up to the top level, where there was one broad domain. For each hierarchy level and feature set, we evaluated the resulting supervised models on the test sets falling within their scope. For instance, at level 4, we evaluated a vacuums model on the four test sets corresponding to the individual vacuum categories.

3.4. Empirical Results

Figure 3.11: Context acquisition. Error of supervised models trained at the ﬁve hierarchy levels, from level 5 (individual categories) to level 1 (products and hotels combined)

level 5 to level 3, at which point it started to harm performance, up to level 1. The model using both the frequent words and the lexicon words showed a similar but less pronounced behavior (Figure 3.11). Therefore, it seems that these models beneﬁted when moving from level 5 (where individual categories resided separately) to level 3 (where electronics, appliances, and hotels resided separately) from more training data coming from merging related product categories. However, when we started merging datasets with more obvious differences, these models gradually degraded their performance. This conﬁrms that a supervised model that does not model context cannot be competitive on broad domains.

Context helps broad supervised models become as powerful as specialized ones

We studied how human-generated context impacts the supervised method when integrated using the sentiment score extension. We extended the supervised model based on frequent and lexicon words. At the middle level in our hierarchy, this model produced a minimum error of 7.71%. However, at the top level, this model increased its error to 8.38%. We used the sentiment score extension to complement this latter model with the combined context for all domains. This decreased the error to 7.75%. The improvement was statistically signiﬁcant on one camera and one appliance categories and on hotels (Table 3.6). We also performed an intermediate experiment where we complemented the supervised model with only the individual words in the combined context. This actually harmed the performance of the supervised model, increasing the error to 8.67%. Further adding the longer word combinations brought statistically signiﬁcant improvements on two camera and three appliance categories and on hotels (Table 3.7). This shows that the improvement recorded when complementing with the full context model was due to these longer features. Therefore, even if it was not integrated in the training process, human-generated context improved the general supervised model and made it perform as well as the latter’s specialized counterpart.

space extension. At the middle level, we extended the three individual words supervised models with the three separate context models for electronics, appliances, and hotels, respectively. This decreased the error from 7.71% to 7.35% (Figure 3.12). The improvement was signiﬁcant on two appliance categories. We also tested the intermediate effect of using only the individual words in these context model, when we recorded an error rate of 7.68%. Further adding the longer word combinations brought statistically signiﬁcant improvements on the same appliance categories.

At the top level, we repeated this procedure and extended the individual words supervised model with the combined context. This decreased the error from 8.38% to 7.31%. The improvement was signiﬁcant on: two vacuum and three appliance categories; hotels (Table 3.8). When we extended the supervised model only with the individual words in the combined context, we recorded an error rate of 8.19%. Further adding the longer features brought improvements that were statistically signiﬁcant on the same categories (Table 3.9).

Therefore, context improved the individual words supervised models at both levels of detail, and the error decrease was mostly due to the longer word combinations. At the top level, this improvement was greater than the one we obtained with the sentiment score extension. More importantly, unlike the individual words supervised models, which decreased in performance when they became broader, the supervised models that incorporated context performed comparably in both their specialized and general versions. This means that human-generated context helped the supervised method scale to a broad domain.

Bigrams also improve the supervised method. However, intersecting them with the human- generated context makes them more efﬁcient and still helps the method scale

We also studied how human-generated context compares to bigrams. At the middle level, we extended the three individual words supervised models with the three human-generated context models for electronics, appliances, and hotels, respectively. Then, for each of the three domains, we replaced the human-generated features with bigrams. We used as many bigrams as there were longer word combinations in the corresponding human-generated context. Finally, for each domain, we intersected the bigrams with the corresponding human- generated context model. At the top level, we repeated the same steps.

At the middle level, the human-generated context and bigrams gave errors of 7.35% and 7.06%, respectively. Intersecting the two types of context decreased the error to 6.27% (Figure 3.12). The improvement was significant on four vacuum and one appliance categories. At the top level, the human-generated context and bigrams gave errors of 7.31% and 7.02%, whereas intersecting the two decreased the error to 6.25%. The improvement was significant on one vacuum and three appliance categories (Table 3.10). Therefore, both the human-generated context and the bigrams had constant error rates. This means that bigrams also helped the supervised method scale without harming performance. However, in both the specific and the general setups, intersecting the two types of context proved to be more efficient. This

In document Acquiring Broad Commonsense Knowledge for Sentiment Analysis Using Human Computation (Page 70-73)