• No results found

Analysis of The Worker Experience

For all the experiments we run, no significant differences have been observed in the workload perceived by workers across the different settings. In Figure4.6(top) we show the result of the NASA-TLX questionnaire for E1. We can observe that, as more relevant documents are present in the batch, frustration tends to decrease together with an increased perceived performance. In Figure4.6(bottom) we show the result of the NASA-TLX questionnaire for E2. While the dif- ferences between the scores are not statistically significant, we can observe that the maximum

FIGURE4.6: Perceived workload using the NASA-TLX assessment tool for each setting in Experiment 1 and Experiment 2.

perceived performance and minimum effort were observed for batch 3 (50% of relevant docu- ments shown first, followed by 50% of non-relevant). The corresponding more realistic version of it (batch 5) shows the lowest level of frustration and, together with the results of Figure4.3, corroborates the idea that it is a suitable candidate for a re-balancing technique to maximize performance without affecting the assessor’s perceived workload. Similar results are observed for E3 where batch length has no significant impact on perceived judgment complexity.

The effort required to complete the HIT batch is not affected by the class balance or by the order of items presented to workers (Figure4.6). This is a positive result that allows us to re-order HITs in a batch without impacting on the crowd worker experience.

4.7.2 The Effect of Document Position on judgement Quality and Time

Since workers completed HITs in sequence, we also analysed the effect of the HIT position on their performance, regardless of the class balance setting. In this way we can answer questions like, for example: is the judgment accuracy of the first document appearing in a batch different than the judgment accuracy of the document in the last position? Even if the differences in

judgment accuracy were not statistically significant, we noticed that documents presented first in a batch have the lowest accuracy showing a possible learning effect of workers getting into a new batch (Figure4.7). This finding is consistent with previous work (Maddalena, Basaldella, and Innocenti,2016).

FIGURE4.7: Mean Accuracy, PPV, and NPV of the documents in the first, second

and last position in a batch over all the experiments.

On average, the first document being judged in a batch shows lower accuracy levels. More interestingly, documents in the first position of the batch show high precision and low NPV values: When the first document is relevant, workers tend to be very accurate while when it is non-relevant, workers make more mistakes. This supports even further the ‘batch 5’ alternative in E2, that is, to include in the first positions documents known to be relevant from editorial judgments: This will both train workers on relevance as well as allow for training. We also observed that the position of the document to be judged does not affect the completion time in a significant way for any of the batches

4.7.3 Completion Time

We analysed the relationship between judgment quality and HITs completion time for Exper- iment 1, 2 and 3. For Experiment 1, we found that workers that spent between 3 - 5 minutes working on the experiment had a low accuracy. Similarly, for Experiment 2, the majority of the workers who spent between 500 and 1800 seconds on the experiment had an accuracy between 0.6 and 1.

Figure4.8shows the average completion time for all batches considered in E1 and E2 compared to PPV values. We can observe that in E1 completion time shows no clear pattern as compared to balance and order settings. In E2, fastest completion time was achieved in balanced batches (30-50%). Comparing time with judgment effectiveness, we can see no strong correlation of PPV with the average completion time. We conclude that while introducing lower bounds in task completion time allows to filter out workers who randomly judge relevance, in general,

completion time is not a sufficient indicator of judgment quality: a result in agreement with previous work (e.g., (Cai, Iqbal, and Teevan,2016)).

(A) (B)

FIGURE4.8: Median PPV vs. completion time for each batch in Experiment 1 (a)

and Experiment 2 (b).

4.7.4 Effect on Agreement

Since different workers have been judging the same documents in the same order and balance conditions, we are also able to measure the effect of document order and class balance on assessor agreement across experimental settings.

FIGURE4.9: (a) Krippendorff’s alpha for all batches in Experiment 1 (horizon-

tal line for median value). (b) Krippendorff’s alpha for batches in Experiment 1 (blue) with different balance classes, and Experiment 2 (red) with different order-

ing of documents.

Figure4.9-a shows Krippendorff’s alpha scores computed in different class balance situations (E1). We can observe that inter-annotator agreement scores tend to be higher when fewer rel- evant documents are present in the batch of tasks (which is the most realistic setting). Lowest agreement levels are observed around 50% balance levels.

Figure4.9-b shows average assessor agreement levels computed on documents appearing at the beginning or at the end of a batch. We can observe that higher worker agreement levels are

observed when relevant documents are presented first and when few relevant documents are present in the batch (10% vs 50%) consistently with Figure4.9-a.