3.3 Human Computation Task
3.3.3 Tutorial
To ensure that workers understand the task and are qualified to do it, we create a tutorial that explains the basic principles. The tutorial uses both text instructions and interactive quizzes. It starts by explaining the concepts of phrase, context, and polarity. Each explanation is succeeded by a quiz asking workers to solve part of a normal game round. The first quizzes ask workers to identify a phrase with a given polarity (Figure 3.2). The subsequent quizzes invite workers to identify contexts that make a given phrase first positive and then negative (Figure 3.3). Next, the tutorial challenges workers to solve a couple of more quizzes emulating the full rounds in the game. Finally, the tutorial explains the round-based structure of the game, the scoring mechanism, and the animal puzzles (Figure 3.4). Workers cannot graduate the tutorial unless they correctly solve all the quizzes. We thus ensure that only those workers with a good understanding move on to play the game.
3.3.4 Quality Assurance
The tutorial implicitly influences quality before the game, by allowing only the workers that have gained a good understanding to proceed to the real task. The scoring mechanism controls quality during the game, by encouraging workers to submit useful answers. As an additional safety measure, we allow workers to submit at most 200 answers, to prevent them from submitting doing sloppy work as a result of fatigue. Moreover, if after fifty answers, their average score falls below a predefined threshold, we again stop workers from solving further
3.3. Human Computation Task
Figure 3.3: Context acquisition. Tutorial quiz that asks workers to identify a context which makes a phrase positive
rounds (we come back to choosing this threshold when we describe the scoring mechanism below). In both cases, workers are seamlessly notified that no more rounds are available to them. We also control quality after the game, by first filtering out the workers that did a bad job, then by removing the remaining bad answers.
Scoring Mechanism
We use a scoring mechanism that encourages workers to submit useful answers. We judge the usefulness of an answer using two criteria: whether it has common sense and whether it brings new knowledge. We assess that an answer is commonsensical if it is consistent with a scoring model that aggregates the workers’ activity in the game up to that point: this means that the answer agrees with the common judgement of previous workers. We establish that an answer is novel if it has a great impact on the scoring model: this means that the answer is submitted early on and that it contains a phrase that requires a context for polarity clarification, along with such a disambiguating context. We compute score rewards by adding an agreement score with a novelty score. Even if our scoring model contains some initial mistakes, workers do not know where these occur. They thus need to consistently provide useful answers to maximize their score. As a result, any initial errors in the model should be corrected over time.
Our scoring model contains beliefs about the polarities of phrase and context combinations. This model attaches a Beta distribution [32] to each phrase and context pair. From this
Figure 3.4: Context acquisition. Tutorial explanation of the game rounds, scoring, and puzzles
distribution, we can estimate the probabilities that, in that context, the phrase has positive and negative polarities, respectively. We initialize these Beta distributions using corpus statistics and several sentiment lexicons. To account for pairs containing phrases and non-empty contexts, we use the word pairs that co-occur in the sentences of our review corpus, and we attach a corresponding Beta distribution to each. We initialize a distribution based on the difference between a word pair’s frequencies in positive and negative documents, respectively. When the word in the pair playing the role of the phrase also appears in a sentiment lexicon, we complement the corpus frequencies with its polarity score. To account for pairs containing phrases and empty contexts, we use the individual words in our corpus, and we attach a Beta distribution to each. We initialize these distributions in a similar manner. We use a Bayesian update process to incorporate incoming answers into these Beta distributions. As a result, the positive and negative probabilities assimilate the fractions of positive and negative answers, respectively.
For a new answer, we compute an agreement score that assesses if it is commonsensical. We set this agreement score highest if the answer agrees with the scoring model early in the game, since this improves the model’s confidence. We set the agreement score lowest if the answer contradicts the scoring model early in the game, since this damages the model’s confidence. Finally, we assign a low-to-medium value to the agreement score when the answer comes later in the game, because, at that point, it has a smaller impact on the scoring model’s confidence. To assess the model’s uncertainty in the polarity of a phrase and context, we compute the entropy over the pair’s corresponding polarity distribution. The answer decreases this entropy if it agrees with the model and vice versa. Moreover, the answer produces bigger changes
3.3. Human Computation Task
in entropy earlier in the game. We thus obtain the agreement score by (piecewise) linearly mapping the updates in entropy to the agreement score interval.
To further reward the answer, we also compute a novelty bonus. This reflects whether the answer contains a phrase whose polarity can fluctuate, along with a non-empty context. Specifically, we focus on the phrases that are ambiguous by themselves, such as small. As an indicator for the phrases’s ambiguity, we use the scoring model’s uncertainty in its out of context polarity. If the context is an empty word, we set the novelty score to zero. Otherwise, we assess the phrase’s ambiguity by computing the entropy over the polarity distribution attached to it. We obtain the novelty score by (piecewise) linearly mapping this entropy to the novelty score interval. Note that, as it is designed, this novelty score does not generously reward the answers containing contexts that flip the polarities of unambiguous phrases, given that these should have a low entropy according to the scoring model. However, if a context genuinely flips the polarity of an unambiguous phrase, an answer should still be generously rewarded through the agreement score.
We reward the answer with a total score update summing the agreement score and the context novelty bonus. Because we do not want to encourage answers that are submitted only once, we give a bigger importance to agreement by setting the maximum agreement score higher than the maximum novelty score. In initial experiments, we choose values of forty and ten points, respectively. More specifically, in terms of the agreement score, we map the answers that increase entropy to the interval [0, 10] and those that decrease entropy to the interval [10, 40]. In addition, we notice that, based on the corpus statistics that initialize the scoring model, even the unambiguous sentiment words tend to have a relatively high entropy. Therefore, in terms of the novelty score, we heuristically set an entropy threshold to 0.9. We reward this bonus to answers attaching contexts to words that meet this ambiguity threshold. In these cases, we simply multiply the entropy with the maximum novelty score of 10.
Given our choice of scoring parameters, Figures 3.5 and 3.6 show scenarios of how the two score components vary in time, as workers submit answers (both of these scenarios are sampled from one of our game launches, further described in Section 3.4). The first example is relatively simple, and shows how the score updates evolve as workers keep indicating that the word good is positive outside any context. Given that these answers do not contain a context, they do not receive the novelty bonus. In addition, we can see that, at the beginning, the agreement scores are very high (the first answer is rewarded with the maximum agreement score of forty), but as workers keep indicating the same polarity, the scoring model becomes more confident, so the agreement scores steadily decrease. The second example is more complex, in that it shows how the scoring updates evolve as workers indicate polarities for the phrase and context combination small room. Most of the time, workers agree that this expression is negative, which is why we see the same behavior in the evolution of the agreement scores. There is one exception, when a worker specifies a positive polarity, which is penalized with a low agreement score. In addition, for the most part, the scoring model is ambiguous about the polarity of the phrase small outside any context, which is why these answers are
Figure 3.5: Context acquisition. A simple example of how the score updates vary in time as workers keep submitting the answer (good, , positive)
Figure 3.6: Context acquisition. A more complex example of how the score updates vary in time as workers mostly submit the answer (small, room, negative) with some exceptions stating the reverse polarity
also rewarded with the maximum novelty bonus. At some point, one worker indicates that this word is positive by itself, which slightly decreases the model’s confusion about the term. This is why, toward the end, we see the novelty bonus decreasing by one unit.
Finally, as it was previously indicated, an additional quality assurance measure that we take during the game is to stop workers if, after at least fifty answers submitted, their average score happens to fall under a predefined threshold. Given our choice of parameters for the scoring mechanism, we heuristically set this threshold to twenty. This is because we require that, on average, workers should submit answers that decrease the scoring model’s uncertainty (entropy), and hence earn an agreement score of at least ten. We also want answers that attach a context to an otherwise ambiguous phrase, which by itself has a high entropy according to the scoring model. This means that such answer should earn a context novelty bonus or nine or ten points. Of course, the score threshold does not exclude answers that do not include a context component. However, because they do not get the novelty bonus, such answers
3.3. Human Computation Task
should on average earn an agreement score of at least twenty. This means that answers without contexts are acceptable as long as they are submitted relatively early in the game (they are not repetitions).
Worker Filtering
We filter out the answers of lazy workers by measuring their performance on gold-standard game rounds. We define several such rounds that we interleave with the regular rounds in the game. The gold rounds show simple sentences with only a few acceptable answers. For each worker, we establish the number of answers she submitted in gold rounds and how many of those answers belong to our set of acceptable, gold-standard answers. We consider the worker’s performance is acceptable if she gave correct answers in the majority of the gold rounds that she attempted to solve. We tend to set this accuracy threshold to 80%.
Answer Filtering
After we eliminate the bad workers, we aggregate the remaining answers to obtain a human- generated context model. We consider each phrase and context that one or more workers combined in their answers, and we add it to the context model. To each pair, we attach the majority polarity that results from the workers’ activity. From the resulting context model, we remove the phrase and context combinations that lead to many classification errors. Using a corpus of training reviews, we first classify documents with the Hu and Liu lexicon alone. We then combine this lexicon with our context model using the sentiment model extension, and we reclassify the training reviews. For each review, three scenarios can happen: the context model fixes an error of the sentiment lexicon; the context model harms a correct classification of the lexicon; or it neither helps nor harms the output of the lexicon. For each phrase and context pair in our context model, we keep track of improvement and error counts, which we increment when the first or second scenarios occur, respectively. We use these counts to remove the bad elements that damage the performance of the Hu and Liu lexicon.
We delete the bad elements in four iterations. In each iteration, we classify documents and generate the improvement and error counts. We then choose a heuristic for pruning elements. In the first two iterations, we delete the elements with error counts above a predefined thresh- old. These elements tend to have incorrect polarities and are also very frequent in the corpus. Because they produce errors in many documents, they also add noise to the error counts of the elements they co-occur with. This is why we aim to remove them first. In the following two iterations, we remove the remaining elements that have high error counts as well as the elements whose error counts exceed the improvement counts. In our experiments, we choose the error count threshold in correlation with the size of the training corpus. Generally, this parameter ranges between 100 and 300.