Methods - Statistical Analysis and Design of Crowdsourcing Applications

3.2.1 Recruitment of Participants

We designed an MTurk HIT (our “task”) to appear as a nondescript survey task similar to many others that are now popular on MTurk. By making our survey appear like any other, we intended to recruit a population that is representative of the MTurk survey-taking population.

We entitled our task “Take a short 30-question survey — $0.11USD” and its

description was “Answer a few short questions for a survey”. The preview pane of the HIT only displayed, “Welcome to the short survey! In this survey you will answer 30 questions. You may only do one survey.” Our HIT was labeled with the keywords “survey”, “questionnaire”, “survey, “poll”, “opinion”, “study”, and “experiment” so that people specifically looking for survey tasks could easily find our task. However,

we also wanted to attract workers who were not specifically looking for survey tasks. Therefore, we posted batches of tasks at various times.

We recruited 784 MTurk workers from the United States.6 _{Each worker was only}

allowed to complete one survey task and took one of four different versions of the survey according to their randomized assignment. In total, 727 workers completed

the entire survey and each was paid $0.11USD.7

We posted in batches of 200 tasks at a time four times per day (at 10AM, 4PM, 10PM, and 4AM EST) as to not bias for early-riser or night-owl workers. Altogether, 18 bunches (of 200 HITs each) were posted between September 1 and September 5, 2010. All HITs expired six hours after creation as to not interfere with the subsequent batch. Note that if we had posted our tasks as one gargantuan batch and waited until completion (possibly a week or longer), we would have attracted a majority of

workers who were specifically looking for survey tasks (most likely searching for them

via keyword) rather than a more general sample of workers.8 The workers were given

a maximum of 45 minutes to complete the task.9

6_{This restriction reduces the influence of any confounding language-specific or cultural effects.} Generalizing our results to other countries and languages may be a fruitful future research direction. 7_{The workers worked 81.3 hours at an average wage of} _$_{0.98/hr and a total cash cost to the} experimenters of $87.97 (including Amazon’s fee of 10%). In this calculation, we ignore the time they spent on the feedback question and the bonuses we paid.

8_{This phenomenon is due to HITs rapidly losing prominence in the public listings and eventually} being relegated to obscurity where they may only be found by those searching via keyword (see Chilton et al., 2010). Also note that we save the time each HIT was created at and expires at. We use this information to check at what point in the HIT listing life-cycle the worker accepted the HIT.

9_{Even though the survey was short, we wanted to give ample time to be able to collect data on} task breaks. Note that few workers took advantage of the long time limit; 117 workers (16%) took more than 15 minutes and only 61 workers (8%) took more than 30 minutes.

3.2.2 Treatments

To test the satisficing-reducing effect of theKapcha, we randomized each participant

into one of four treatments which are described below, summarized in table 3.1, and

pictured in figures 3.1a–d.10

Increase Force Attract att-

perceived value slow ention to ind-

Treatment of survey down ividual words

Control

Exhortation _X

Timing control _X

Kapcha _X _X

Table 3.1: Overview of treatments and how they improve data quality

The first treatment was theControl where the questions were displayed in a similar

fashion as any other online survey.

Our Exhortation treatment presents survey questions in an identical way as the

Control treatment except that we try to increase the survey taker’s motivation by

reminding them in alarming red text at the bottom of each question page to “Please answer accurately. Your responses will be used for research.” Past research on survey design has shown that respondents are more likely to devote effort to completing surveys if they perceive them as valuable because they contribute to research (Krosnick, 1991).

Rather than using exhortation, ourTiming control andKapchatreatments induce

more careful survey taking by changing the incentives of a respondent. In short, we

10_Videos _llustrating _the _four _treatments _are _available _at http://danachandler.com/kapchastudy.html

lower the payoff to satisficing.11 _{When respondents can breeze through a survey and}

click one answer after another without delay, they may be tempted to satisfice — i.e., click the first answer that seems correct or any answer at random. If, however, survey respondents are forced to wait before proceeding to the next question, we hypothesize that they will use this time to think more carefully about how to answer.

Our Timing control treatment is identical to the Control treatment except that

the continue button is disabled and has a spinning graphic during a waiting period12

after which the continue button is enabled.

The Kapcha treatment goes one step further and, in addition to slowing down the

respondent for a time equal to the Timing control treatment, also draws additional

attention to the instructions and answer choices by “fading-in” the survey’s words at 250 words per minute.

The delay time for the questions in the Timing control treatment were calibrated

to be the same total fade-in time for theKapchaparticipant’s question. By controlling

for the timed delay, we were able to isolate the additional effect due to forcing the

respondent to pay attention to the words in the Kapcha.

Although this is the first research to our knowledge that studies waiting peri- ods and textual fade-ins, there is a long history of research on how various forms of survey implementation affect response. Two interesting examples include how self-administration lead respondents to answer sensitive questions more truthfully

11_{If the}_Exhortation_{treatment increased the rate at which people passed the trick question (which} it did not), we might have worried that this framing could bias the way survey takers answer questions, particularly socially sensitive ones, since it reminds the respondent that they are under scrutiny. In social psychology, over-reporting “positive” behaviors is known as the “social desirability bias” (DeMaio, 1984).

12_{We peg the waiting period to the time it takes an average person to read the number of words} in each question. Taylor (1965) finds that the average reading speed for college-level readers is 280 words per minute and 250 for twelfth-graders. We chose 250 words per minute.

and how questions that are accompanied by audio do the same among people who might not understand the text (especially among low-literacy respondents). Recently, Couper et al. (2009) has helped separate the effect of the self-administration and the

audio component.13

3.2.3 Custom Survey Task Design

As soon as the worker accepted the HIT, they were given a page with directions that explained the length of the survey and asked to begin when ready. Depending on the treatment, we also added an additional sentence or two to the instructions in order to

explain the particularities associated with each treatment. For ourExhortationgroup,

we emphasized the importance of giving accurate and honest answers. In ourTiming

control group, we told participants that the continue button would be disabled for a

short time so they would have more time to read and answer each question. For our

Kapcha group, we mentioned how words and answer choices would appear one at a

time.

After reading the directions, the worker began the survey task which consisted of 30 questions plus two optional questions eliciting feedback. Each question was presented individually so that the respondent must click submit before moving onto

the next question.14

Our first question, “question A”, is a hypothetical thought experiment (which we call the soda-pricing example) that “demonstrates how different expectations can change people’s willingness to pay for identical experiences” (Oppenheimer et al.,

13_{For an excellent, though slightly dated review of various survey presentation formats and the} issues they try to overcome, see Chapter 10.1 of Tourangeau et al. (2000)

14_{Note that most surveys on MTurk display all questions on one page. Presenting questions one} at a time, as we do in our study, probably serves to reduce satisficing. A future study would allow us to determine how the number of questions on each page affects satisficing.

2009). The question text is shown below. The subtle text manipulation which induces an effect according to Thaler (1985) is shown in brackets and will be denoted as the “run-down” vs. “fancy” treatments:

You are on the beach on a hot day. For the last hour you have been thinking about how much you would enjoy an ice cold can of soda. Your companion needs to make a phone call and offers to bring back a soda from the only nearby place where drinks are sold, which happens to be a [run-down grocery store / fancy resort]. Your companion asks how much you are willing to pay for the soda and will only buy it if it is below the price you state. How much are you willing to pay?

It has been shown repeatedly in the literature that people are willing to pay more when the beverage comes from a fancy resort. If the workers were reading the instructions carefully, we expect them to pay a higher price in the “fancy” treatment. The worker was then given “question B,” another hypothetical thought experiment (which we call the football attendance example), which demonstrates that people are susceptible to the sunk cost fallacy (for screenshots, see figure 3.1a–d). The question text is shown below. The subtle text manipulation which induces an effect according to Thaler (1985) is shown in brackets and will be denoted as the “paid” vs “free” treatments:

Imagine that your favorite football team is playing an important game. You have a ticket to the game that you have [paid handsomely for / received for free from a friend]. However, on the day of the game, it happens to be freezing cold. What do you do?

Their intention was gauged on a nine-point scale where 1 was labeled “definitely stay at home” and 9 was labeled “definitely go to the game”. It has been shown that

(a)Control treatment (b)Exhortation treatment

Figure 3.1: The participant’s screen during question B, the intent to go to a football game, shown for all four experimental treatments (in the “paid” treatment). The

Timing control and Kapcha treatments are both shown at 10 seconds since page

load.

people who read the treatment where they paid for the tickets are more likely to go to the game.

By randomizing the text changes independently of treatments, we were able com-

pare the strength of these two well-established psychological effects across the four

types of survey-presentation.

The participant then answered an “instructional manipulation check” (IMC) question. Once again, this is a trick question designed to gauge whether the participant reads and follows directions carefully. The instructions of the IMC question asks the participant which sports they like to play and provides many options. However, within the question’s directions (i.e. the “fine print”), we tell the respondent to ignore the question prompt and instead click on the title above (see Fig 3.2). If the

participant clicked the title, they “pass” the IMC; any other behavior is considered failure. We administer the IMC and consider it to be a proxy for satisficing behavior

in general15 _{which we were able to compare across the four treatments.}

Figure 3.2: A screenshot of theinstructional manipulation check in theControl treat-

ment.

After this point, there are no longer any differences in how we present survey

questions across the treatments. We turn off the Kapcha fade-in, theTiming control,

and the Exhortation message. This allows us to collect demographics and need for

cognition measures in a way that is comparable and independent of treatment.16

The next eight questions collect demographic information. We ask for birth year, gender, and level of education. We then ask a few questions about their general work habits on MTurk: “Why do you complete tasks in Mechanical Turk?”, “How much do

15_{Oppenheimer et al. (2009) could not detect a significant difference between the “fancy resort”} and the “run-down grocery store” conditions in question A using the full sample,but could detect a significant difference using data only from the participants who passed the IMC. This is an intuitive result; the psychological effect that occurs due to subtle word changes would only be detectable if a respondent carefully read the instructions.

you earn per week on Mechanical Turk?”, “How much time do you spend per week on

Mechanical Turk?”17_{, and “Do you generally multi-task while doing HITs?” To see}

if people who take surveys more often are any different, we also ask: “What percent of your time on MTurk do you spend answering surveys, polls, or questionnaires?”.

After the eight demographic questions, we administer an 18-question abbreviated version of the full “Need for Cognition” (NFC) scale (see Cacioppo et al., 1984). Re- spondents say how characteristic each of the statements are of their personality (e.g. “I find satisfaction in deliberating hard and for long hours”) on a five-point scale where 1 indicates “extremely uncharacteristic” and 5 indicates “extremely character-

istic”.18 _{In short, the NFC scale assesses how much an individual has a need to think}

meticulously and abstractly. This is of interest to our study because an individual’s NFC may itself affect the probability of satisficing during a survey. People who have a higher NFC are also more likely to fill out the survey diligently and appear with high scores. Those with low NFC will have scores that are attenuated toward the center (i.e., 3) because they are more likely to haphazardly guess.

For the 30th question, we ask the participant to rate how motivated they were to take this survey on a nine-point scale.

We then give two optional feedback questions. On both questions, we indicated

that responses would be given either a $0.01, $0.05 USD bonus, or no bonus (these

three levels were randomized in order to test the effect of bonus level on feedback quality). The amount of feedback is also used as a measure of respondent engagement. The first question inquired, “What did you like most about this survey? What did you like least about this survey? Is there anything you would recommend to

make it better?” The second feedback prompt was only relevant for the Kapcha or

17_{This question was used in Ipeirotis (2010), which presents result from a large MTurk survey} 18_{Approximately half of the questions are also reverse-coded so that noisy survey responses would} cancel themselves out and tend toward the center.

Timing control treatments. We asked, “Certain respondents had to wait for each survey question to complete before filling in answers. We are especially interested in knowing how this affected the way you took the survey.”

3.2.4 Other Data Collected

During our recruitment period, we posted HIT bunches (see section 3.2.1 for details) with an equal number of tasks in each of the four experimental treatments.

In addition to the participants’ responses, we recorded how long survey respondents spent on each part of the survey. This gave us some indication of how seriously people took our survey (i.e., read instructions, considered answers to questions). We also recorded when the participant’s task window was focused on our task or focused

on another window.19

In the future, we hope to collect much more detailed information on the user’s activity including timestamps of exact mouse position locations, mouse clicks, and keystrokes. Ultimately, it would be an asset to researchers to be able to “playback” the task by watching the worker’s mouse movements in a short video in order to gain

greater insight into how respondents answer surveys.20 _{This would help identify sat-}

isficing behavior in a way that would go undetected using other rigid rules, but would be obvious from watching a video (e.g., instantaneously and haphazardly clicking on random answer choices).

19_{Unfortunately, this variable was not compatible with all Internet browsers and was too noisy to} use in analysis.

In document Statistical Analysis and Design of Crowdsourcing Applications (Page 48-58)