5.7 Crowdsourced Task: Document Relevance Assessment
5.7.2 Interface Design for Document Relevance Assessment
To enable workers to assess each of the 8,190 documents we develop a common interface that we vary slightly for each of the five different type of documents that we have assessed. This interface is comprised of two components. Firstly, an instruction block describing how the worker is to complete the task is provided at the top. An illustration of the instructions can be found in Appendix B.4. Below the instruction block are a series or assessment blocks. Each assessment block represents a single assessment that the worker is to make. Figure 5.9 illustrates how documents from three of the five sources are displayed to the user as assessment blocks. From Figure 5.9, we see that in each interface the news-related query is displayed at the top. Below the query, the document title and snippet pair is displayed. For Figure 5.9 (a) and (b) representing blog and Digg posts respectively, these take the form of an actual title and document snippet from the original document, while for Twitter (Figure 5.9 c) the
5.7 Crowdsourced Task: Document Relevance Assessment
(a) Blog Post
(b) Digg
(c) Twitter
5.7 Crowdsourced Task: Document Relevance Assessment
tweet text is displayed in the ‘title’ slot and the username and the number of retweets is displayed below it. Underneath the document representation, a common evaluation block is displayed enabling the user to record their assessment for that document and query.
5.7.3
Validation of Worker Assessments
As with previous crowdsourcing studies, we use gold judgement validation to identify workers who produce random or low quality assessments. In particular, for each of the five types of document that we rank, we create randomly select a small subset to manually assess, forming our gold judgement set. In particular, for newswire articles, we created 77 gold documents (4.9% of the total), while for blog posts we created 70 (3.7% of total). For the Digg source we created 88 (4.5% of the total), while we assessed 91 for Twitter (4.6% of the total). Finally, for Wikipedia, we created 50 gold documents (5.1% of the total). These ‘gold’ documents are interspersed with the non-gold documents that each worker assesses. Workers that fail more than 20% of the gold documents are rejected from the evaluation unpaid.
5.7.4
Crowdsourcing Configuration
To crowdsource our relevance labels, we used the CrowdFlower microtask crowdsourcing platform on top of Amazon’s Mechanical Turk (MTurk). We have the documents from each of the five sources assessed separately. Documents from each source are grouped into sets of ten (including a single gold document) forming an MTurk HIT. We pay workers $0.01 (US dollars) per assessment made, hence $0.10 for each HIT. Each document is assessed by three workers. The total cost for crowdsourcing all 24,570 assignments (8190 documents * 3 workers) was $446.53. We do not restrict workers by region, however, we do have three individual workers attempt each. The final assessments produced are the majority vote across the three assessments.
5.7.5
Assessment Quality
We evaluate the quality of the assessments produced by the workers in terms of inter-worker agreement across the three workers that attempted each HIT. A high level of agreement indicates that the final labels are of good quality (see Section 5.3.2). In this case, we report the agreement measure that Crowd- Flower provides, namely observed agreement. In this case, the probability of random agreement is the probability of all three workers selecting the same label by chance, i.e. 33%. Table 5.15 reports on a per source basis the number of assessments accrued for each in addition to the average worker agreement across all 199 queries for each source.
5.8 Conclusions
Corpus # Documents # Gold # Assessments # Agreement BlekkoN ewsSnippetsApr2012 1,553 77 5,245 82.62% BlekkoBlogSnippetsApr2012 1,857 70 6,432 71.53% DiggSnippetsApr2012 1,888 88 6,027 77.29% T weetsApr2012 1,990 91 7,722 73.43% W ikiU pdatesApr2012 902 50 3,695 77.39%
Total 8,190 376 29,121 76.07%
Table 5.15: Statistics of the document relevance assessments produced by MTurk workers.
From Table 5.15, we observe that for all five sources, agreement is higher than 71% across the three relevancy labels. This is a high level of agreement that indicates that the resultant assessments are of good quality in comparison to random levels of agreement. The lowest agreement reported is 71.53% for the blog corpus, while the highest agreement was when assessing newswire articles with 82.62% agreement. The higher level of agreement when assessing newswire articles in comparison to tweets or blogs indicates that news articles are easier to assess as relevant or not to a query. Overall observed agreement across sources is 76.07%. Since this is much higher than random agreement, we conclude that the resultant relevance assessments are of sufficient quality to use later in Chapter 9.
5.8
Conclusions
In this chapter, we described ten different datasets from time periods between 2006 and 2012 that we use in our subsequent experimental chapters to evaluate whether the use of user-generated content can aid in satisfying news-related queries submitted to universal Web search engines. In Section 5.2, we listed each of the ten datasets that we use later (see Table 5.2) divided into the four components of our proposed news search framework.
Out of the ten datasets, four contain document assessments that we developed using the medium of crowdsourcing. In Section 5.3, we defined what crowdsourcing is and why it is a valuable resource for relevance assessment and detailed the validation strategies the we employ to improve the quality of the assessments produced. Agreement measures that were used to evaluate the quality of those assessments were also described.
In Sections 5.4, 5.5, 5.6 and 5.7, we described how we used crowdsourcing to generate assessments for each of the four datasets, totalling over 60,000 individual assessments and a total cost of $1,442.93 (US dollars). In particular, in each section, we described in detail how we developed effective new inter- faces for crowdsourcing assessments and then validated the resulting assessments in real-time, enabling the rejection of poor quality work. Furthermore, we closed each of these four sections by evaluating
5.8 Conclusions
the quality of the final assessments produced, showing through high levels of inter-worker agreement or accuracy against a ground truth, that the resultant assessments were of sufficiently good quality.
In the next chapter of thesis, we examine the Top Events Identification (TEI) component of our news search framework. We use the five datasets that we described in Section 5.2.1 (see Table 5.3), including the BlogT rack2011T opN ews−P hase1dataset that contains newswire article assessments that we crowdsourced (see Section 5.4), to evaluate whether important events can be identified accurately using live streams of blog posts or tweets.