Chapter 1: General Introduction
1.6 Crowdsourcing Platforms
We have discussed the large number of binary and ternary textures available for study (Maddess et al., 2004, Maddess et al., 2007, Victor and Conte, 2012). An efficient way of exploring the properties of these textures is to use a crowdsourcing service to carry out remote visual psychometric experiments (Paolacci et al., 2010). Crowdsourcing websites coordinate the supply and demand of tasks that require human input. In Chapter 3 of this thesis, we will validate the use of crowdsourcing platforms for such experiments. In Chapter 4, crowdsourcing will be used to carry out a large study using ternary textures.
91
Launched in 2005, Amazon Mechanical Turk (mTurk) has developed into the largest crowdsourcing platform (see (Mason and Suri, 2012) and (Paolacci et al., 2010) for reviews on using mTurk for conducting experiments). It now has in excess of 200,000 Workers registered from over 100 countries (Ross et al., 2010, Pontin, 2007). MTurk provides the elements required to conduct research studies: a large, persistent pool of research subjects, an integrated payment system, a streamlined process for study design, recruitment, and data collection (Paolacci et al., 2010). Therefore, mTurk has developed into a labour market for tasks which vary from surveys and language translations, to psychometric experiments (Mason and Suri, 2012).
Anyone with internet access can register to use mTurk and as a Requester or a Worker. Requesters create Human Intelligence Tasks (HITs). HITs expire after a predefined time and/or when the available pool has been exhausted. Workers are presented with a list of HITs, which they can browse and complete. Completed HITs are reviewed by the Requester and accepted or rejected. Accepted HITs trigger payment to be transferred from the Requester to the Worker’s account. Amazon charges Requesters 10% of the total pay issued (minimum $0.005 USD per HIT) (Mason and Suri, 2012). At the time of writing, the number of active HITs on mTurk was 403,336 (AWS, 2014).
MTurk has several advantages over traditional methods of subject recruitment and study implementation. It provides access a large, established workforce, which facilitates rapid recruiting with very little effort (Ross et al., 2010, Paolacci et al., 2010). The service is available every day of the year and subject availability appears to remain quite stable over time,
92
with minor seasonal fluctuations (Ipeirotis, 2010). Its established technical infrastructure allows studies to be developed relatively quickly and easily (Mason and Suri, 2012). It also includes an integrated payment system, which eliminates third-party services, such as PayPal, that have been linked to lower response rates (Goritz et al., 2008). It allows for pre-screening and can also incorporate quality control measures, such as catch trials (easy questions which can be used to gauge subject attentiveness) (Paolacci et al., 2010, Kittur et al., 2008). Oppenheimer et al. have also experimented with ways of checking that Workers are following instructions and remaining engaged (Oppenheimer et al., 2009).
There have been reports of Workers using programs (bots) to automatically complete HITs, although this appears to be rare (McCreadie et al., 2010). Nonetheless, it is important to validate Worker performance. To that end, mTurk has a built-in reputation system so Requesters can block Workers whose rejection rate exceeds a given threshold (Paolacci et al., 2010, McCreadie et al., 2010). Qualification tasks may also be used to enforce practice trials and careful reading of experimental procedures (Heer et al., 2010).
The majority of mTurk Workers are based in the USA, although there is considerable international diversity; the number of Workers in India in particular is increasing (Ross et al., 2010, Ipeirotis, 2010). The requirement for an Internet connection and English language proficiency restricts the majority of mTurk Workers to highly industrialized societies (Ross et al., 2010). MTurk Workers are highly educated, with 57% being educated to Bachelor’s degree level or above; a slight majority are female (55%) and
93
aged 18-30 years (62%) (Ross et al., 2010). MTurk Workers are slightly more demographically diverse than standard Internet samples and significantly more diverse than US college samples (Buhrmester et al., 2011, Gosling et al., 2004). Therefore, mTurk allows researchers to access subjects that would be difficult to access by other means (Paolacci et al., 2010).
One of the advantages offered by crowdsourcing platforms such as mTurk is its low cost. One study found that Workers have a reservation wage (the minimum pay rate for which they would complete a HIT) of $1.38 USD per hour (Ipeirotis, 2010, Mason and Suri, 2012, Buhrmester et al., 2011). Compensation rates appear to have less effect on data quality than on the rate of data collection (Buhrmester et al., 2011, Paolacci et al., 2010, Mason and Watts, 2009). This is consistent with the idea that Workers are not primarily financially motivated, but also derive secondary benefits from mTurk, such as entertainment and a sense of being productive (Buhrmester et al., 2011, Ipeirotis, 2010, Paolacci et al., 2010, Ross et al., 2010). In addition to cost savings and reduced recruiting effort, crowdsourcing can scale to levels which would be prohibitive in a laboratory setting.
Given the low levels of compensation, it is interesting to consider what motivates mTurk Workers. The number of Workers that rely on mTurk for their primary income is quite low: 12% of US and 27% of Indian Workers (Ipeirotis, 2010). Nonetheless, only 12% of US and 10% of Indian Workers indicated that the money they derive from mTurk was "irrelevant" (Ross et al., 2010). Similarly, Paolacci et al. reported that although only 13.8% of the US Workers derived their primary income from mTurk, 61.4% reported that
94
earning additional money was an important motivating force (Paolacci et al., 2010).
Many Workers also use mTurk for entertainment (40.7%) and “killing time” (32.3%) (Ipeirotis, 2010, Paolacci et al., 2010). Buhrmester et al. also reported that Workers used mTurk because they found the tasks enjoyable (Buhrmester et al., 2011). The majority of US Workers (69%) reported that they thought of mTurk as a productive way to spend free time and make some extra money (Ipeirotis, 2010, Paolacci et al., 2010). This suggests that mTurk Workers may produce good quality data, despite receiving relatively low wages, as financial gain is not their sole motivation (Buhrmester et al., 2011, Ipeirotis, 2010, Paolacci et al., 2010, Ross et al., 2010). In this context, Crump has suggested that tasks which have an entertainment component may be more successful at recruiting Workers (Crump et al., 2013).
Several authors have suggested that, on ethical grounds, the pay of mTurk Workers should be comparable to that of laboratory subjects (Mason and Suri, 2012, Crump et al., 2013). However, because of the entertainment value derived from mTurk (Paolacci et al., 2010, Buhrmester et al., 2011, Ipeirotis, 2010), and the low numbers of Workers that rely on mTurk for their primary income (Ross et al., 2010, Paolacci et al., 2010, Ipeirotis, 2010), that case is not entirely clear.
An obvious concern with a novel platform such as mTurk is whether the data produced is of high quality. Workers are unsupervised and therefore may be less attentive than supervised subjects in a laboratory setting (Oppenheimer et al., 2009). The anonymity afforded to Workers may increase deceptive
95
responding (Skitka and Sargis, 2006) and rates of non-completion (Crump et al., 2013). Some financially motivated Workers may be more concerned with completing HITs rapidly than by the quality of their work (Mason and Suri, 2012). Therefore, it is important to consider how compensation levels affect Worker behaviour.
HIT participation rates are affected by the level of compensation and how long the HIT takes to complete (Buhrmester et al., 2011). Buhrmester et al. offered $0.02 USD for a 30 minute task, but still accumulated 25 completed HITs within 5 hours of posting. When the compensation was increased to $0.50 USD, they obtained the same number of completed HITs in less than 2 hours of posting (Buhrmester et al., 2011).
In a second study, Buhrmester et al. posted a set of personality questionnaires at three levels of compensation ($0.02, $0.10, and $0.50 USD). The alpha reliabilities for the data collected were within one hundredth of a point across all three compensation levels (Buhrmester et al., 2011). Mason et al. observed a similar relationship (Mason and Watts, 2009). They altered the compensation levels of two mTurk tasks, whilst simultaneously measuring the HIT participation rate and the data quality. As the compensation level increased (from $0.01 to $0.10 USD), so did the number of HITs completed; however, there was no difference in the data quality (Mason and Watts, 2009). Taken together, these findings suggest that, even at the lowest levels, compensation has less effect on data quality than the rate of data collection (Buhrmester et al., 2011, Paolacci et al., 2010, Mason and Watts, 2009). This may be explained by Workers deriving secondary
96
benefits from participating in mTurk studies (Buhrmester et al., 2011, Ipeirotis, 2010, Paolacci et al., 2010, Ross et al., 2010).
With regard to data quality, several studies have carried out replications of laboratory studies using mTurk. Paolacci et al. replicated a series of classic judgment and decision-making experiments on mTurk at a cost of just $1.71 USD per hour per subject. Quantitatively, there were only very minor differences between the mTurk and laboratory data sets (Paolacci et al., 2010). Similarly, Horton et al. and Crump et al. replicated some classic behavioural psychology experiments using mTurk, including the Prisoners’ Dilemma, and found it to be congruent with previously published laboratory data (Horton et al., 2011, Crump et al., 2013). In the latter study, it is interesting to note that the introduction of instructional checks had a greater effect on data quality than increasing the level of compensation (Crump et al., 2013).
At the time of writing, only three studies have used mTurk to administer visual psychometric studies (Heer et al., 2010, Cole et al., 2009, Freeman et al., 2013). Cole et al. recruited 550 mTurk Workers to evaluate three- dimensional line drawings and indicate surface normals. They gathered 275,000 observations and used them to rate rendering techniques. However, the data gathered was not compared to supervised laboratory data and collection statistics were not reported (Cole et al., 2009). Heer et al. (Heer et al., 2010) carried out a series of graphical perception experiments, including a replication of a laboratory alpha contrast experiment (Stone and Bartram, 2008). Freeman et al. used mTurk to evaluate the discrimination of natural versus synthetic textures (Freeman et al., 2013). They gathered over "300
97
hours of behavioural data from thousands of human observers" (exact numbers were not reported). Each Worker was paid $0.40 USD for ~5 minutes of work ($4.80 USD per hour). Neither demographic data nor platform data were reported, and the authors did not report using catch trials or qualification tasks (Freeman et al., 2013). In this context, there is a clear need for studies which investigate the suitability of mTurk for performing visual psychometric testing.