Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets
BBreaking Monotony with Meaning:Motivation in Crowdsourcing Markets (cid:73)
Dana Chandler a, ∗ , Adam Kapelner b a Massachusetts Institute of Technology50 Memorial DriveCambridge, MA 02142 b The Wharton School of the University of Pennsylvania,3730 Walnut StreetPhiladelphia, PA 19104
Abstract
We conduct the first natural field experiment to explore the relationship betweenthe “meaningfulness” of a task and worker effort. We employed about 2,500workers from Amazon’s Mechanical Turk (MTurk), an online labor market, tolabel medical images. Although given an identical task, we experimentally ma-nipulated how the task was framed. Subjects in the meaningful treatment weretold that they were labeling tumor cells in order to assist medical researchers,subjects in the zero-context condition (the control group) were not told the pur-pose of the task, and, in stark contrast, subjects in the shredded treatment werenot given context and were additionally told that their work would be discarded.We found that when a task was framed more meaningfully, workers were morelikely to participate. We also found that the meaningful treatment increasedthe quantity of output (with an insignificant change in quality) while the shred-ded treatment decreased the quality of output (with no change in quantity).We believe these results will generalize to other short-term labor markets. Ourstudy also discusses MTurk as an exciting platform for running natural fieldexperiments in economics.
Keywords: natural field experiment, worker motivation, crowdsourcing, onlinelabor markets (cid:73)
Both authors contributed equally to this work. The authors wish to thank Professor SusanHolmes of Stanford University for comments and for allowing us to adapt the DistributeEyessoftware for our experiment (funded under NIH grant ∗ Principal corresponding author
Email addresses: [email protected] (Dana Chandler), [email protected] (Adam Kapelner)
Preprint submitted to Elsevier October 4, 2012 a r X i v : . [ s t a t . O T ] O c t . Introduction Economists, philosophers, and social scientists have long recognized thatnon-pecuniary factors are powerful motivators that influence choice of occupa-tion. For a multidisciplinary literature review on the role of meaning in theworkplace, we recommend Rosso et al. (2010). Previous studies in this areahave generally been based on ethnographies, observational studies, or labora-tory experiments. For instance, Wrzesniewski et al. (1997) used ethnographiesto classify work into jobs, careers, or callings. Using an observation study, Pre-ston (1989) demonstrated that workers may accept lower wages in the non-profitsector in order to produce goods with social externalities. Finally, Ariely et al.(2008) showed that labor had to be both recognizable and purposeful to havemeaning. In this paper, we limit our discussion to the role of meaning in eco-nomics, particularly through the lens of competing differentials. We performthe first natural field experiment (Harrison and List, 2004) in a real effort taskthat manipulates levels of meaningfulness. This method overcomes a numberof shortcomings of the previous literature, including: interview bias, omittedvariable bias, and concerns of external validity beyond the laboratory.We study whether employers can deliberately alter the perceived “meaning-fulness” of a task in order to induce people to do more and higher quality workand thereby work for a lower wage. We chose a task that would appear mean-ingful for many people if given the right context — helping cancer researchersmark tumor cells in medical images. Subjects in the meaningful treatment weretold the purpose of their task is to “help researchers identify tumor cells;” sub-jects in our zero-context group were not given any reason for their work andthe cells were instead referred to as mere “objects of interest” and laborers inthe shredded group were given zero context but also explicitly told that theirlabelings would be discarded upon submission. Hence, the pay structure, taskrequirements, and working conditions were identical, but we added cues to alterthe perceived meaningfulness of the task.We recruited workers from the United States and India from Amazon’s Me-chanical Turk (MTurk), an online labor market where people around the worldcomplete short, “one-off” tasks for pay. The MTurk environment is a spotmarket for labor characterized by relative anonymity and a lack of strong rep-utational mechanisms. As a result, it is well-suited for an experiment involvingthe meaningfulness of a task since the variation we introduce regarding a task’smeaningfulness is less affected by desires to exhibit pro-social behavior or ananticipation of future work (career concerns). We ensured that our task ap-peared like any other task in the marketplace and was comparable in terms ofdifficulty, duration, and wage.Our study is representative of the kinds of natural field experiments forwhich MTurk is particularly suited. Section 2.2 explores MTurk’s potential as aplatform for field experimentation using the framework proposed in Levitt andList (2007, 2009). 2e contribute to the literature on compensating wage differentials (Rosen,1986) and the organizational behavioral literature on the role of meaning in theworkplace (Rosso et al., 2010). Within economics, Stern (2004) provides quasi-experimental evidence on compensating differentials within the labor marketfor scientists by comparing wages for academic and private sector job offersamong recent Ph.D. graduates. He finds that “scientists pay to be scientists”and require higher wages in order to accept private sector research jobs becauseof the reduced intellectual freedom and a reduced ability to interact with thescientific community and receive social recognition. Ariely et al. (2008) use alaboratory experiment with undergraduates to vary the meaningfulness of twoseparate tasks: (1) assembling Legos and (2) finding 10 instances of consecutiveletters from a sheet of random letters. Our experiment augments experiment1 in Ariely et al. (2008) by testing whether their results extend to the field.Additionally, we introduce a richer measure of task effort, namely task quality .Where our experiments are comparable, we find that our results parallel theirs.We find that the main effects of making our task more meaningful is toinduce a higher fraction of workers to complete our task, hereafter dubbed as“induced to work.” In the meaningful treatment, 80.6% of people labeled atleast one image compared with 76.2% in the zero-context and 72.3% in theshredded treatments.After labeling their first image, workers were given the opportunity to la-bel additional images at a declining piecerate. We also measure whether thetreatments increase the quantity of images labeled. We classify participants as“high-output” workers if they label five or more images (an amount correspond-ing to roughly the top tercile of those who label) and we find that workers areapproximately 23% more likely to be high-output workers in the meaningfulgroup.We introduce a measure of task quality by telling workers the importance ofaccurately labeling each cell by clicking as close to the center as possible. Wefirst note that MTurk labor is high quality, with an average of 91% of cells found.The meaning treatment had an ambiguous effect, but the shredded condition inboth countries lowered the proportion of cells found by about 7%.By measuring both quantity and quality we are able to observe how taskeffort is apportioned between these two “dimensions of effort.” Do workers work“harder” or “longer” or both? We found an interesting result: the meaningfulcondition seems to increase quantity without a corresponding increase in qualityand the shredded treatment decreases quality without a corresponding decreasein quantity. Investigating whether this pattern generalizes to other domainsmay be a fruitful future research avenue.Finally, we calculate participants’ average hourly waged based on how longthey spent on the task. We find that subjects in the meaningful group workfor $1.34 per hour, which is 6 cents less per hour than zero context participantsand 14 cents less per hour than shredded condition participants.We expect our findings to generalize to other short-term work environmentssuch as temporary employment or piecework. In these environments, employersmay not consider that non-pecuniary incentives of meaningfulness matter; we3rgue that these incentives do matter, and to a significant degree.Section 2 provides background on MTurk and discusses its use as a platformfor conducting economic field experiments. Section 3 describes our experimentaldesign. Section 4 presents our results and discussion and Section 5 concludes.Appendix A provides full details on our experimental design and Appendix Bis a technical appendix for conducting experiments using the MTurk platform.
2. Mechanical Turk and its potential for field experimentation
Amazon’s Mechanical Turk (MTurk) is the largest online, task-based labormarket and is used by hundreds of thousands of people worldwide. Individualsand companies can post tasks (known as Human Intelligence Tasks, or “HITs”)and have them completed by an on-demand labor force. Typical tasks includeimage labeling, audio transcription, and basic internet research. Academicsalso use MTurk to outsource low-skilled resource tasks such as identifying lin-guistic patterns in text (Sprouse, 2011) and labeling medical images (Holmesand Kapelner, 2010). The image labeling system from the latter study, knownas “DistributeEyes,” was originally used by breast cancer researchers and wasmodified for our experiment.Beyond simply using MTurk as a source of labor, academics have also beganusing MTurk as a way to conduct online experiments. The remainder of thesection highlights some of the ways this subject pool is used and places specialemphasis on the suitability of the environment for natural field experiments ineconomics.
As Henrich et al. (2010) argue, many findings from social science are dis-proportionately based on what he calls “W.E.I.R.D.” subject pools ( W estern, E ducated, I ndustrialized, R ich, and D emocratic) and as a result it is inappro-priate to believe the results generalize to larger populations. Since MTurk hasusers from around the world, it is also possible to conduct research across cul-tures. For example, Eriksson and Simpson (2010) use a cross-national samplefrom MTurk to test whether differential preferences for competitive environ-ments are explained by females’ stronger emotional reaction to losing, hypoth-esized by Croson and Gneezy (2009).It is natural to ask whether results from MTurk generalize to other popula-tions. Paolacci et al. (2010) assuage these concerns by replicating three classicframing experiments on MTurk: The Asian Disease Problem, the Linda Problemand the Physician Problem; Horton et al. (2011) provide additional replicationevidence for experiments related to framing, social preferences, and priming.Berinsky et al. (2012) argues that the MTurk population has “attractive char-acteristics” because it approximates gold-standard probability samples of theUS population. All three studies find that the direction and magnitude of theeffects line up well compared with those found in the laboratory.4n advantage of MTurk relative to the laboratory is that the researcher canrapidly scale experiments and recruit hundreds of subjects within only a fewdays and at substantially lower costs. Apart from general usage by academics, the MTurk environment offers addi-tional benefits for experimental economists and researchers conducting naturalfield experiments. We analyze the MTurk environment within the frameworklaid out in Levitt and List (2007, 2009).In the ideal natural field experiment, “the environment is such that thesubjects naturally undertake these tasks and [do not know] that they are par-ticipants in an experiment.” Additionally, the experimenter must exert a highdegree of control over the environment without attracting attention or causingparticipants to behave unnaturally. MTurk’s power comes from the ability toconstruct customized and highly-tailored environments related to the questionbeing studied. It is possible to collect very detailed measures of user behav-ior such as precise time spent on a webpage, mouse movements, and positionsof clicks. In our experiment, we use such data to construct a precise qualitymeasure.MTurk is particularly well-suited to using experimenter-as-employer designs(Gneezy and List, 2006) as a way to study worker incentives and the employ-ment relationship without having to rely on cooperation of private sector firms. For example, Barankay (2010) posted identical image labeling tasks and variedwhether workers were given feedback on their relative performance (i.e., rank-ing) in order to study whether providing rank-order feedback led workers toreturn for a subsequent work opportunity. For a more detailed overview of howonline labor markets can be used in experiments, see Horton et al. (2011).Levitt and List (2007) enumerate possible complications that arise whenexperimental findings are extrapolated outside the lab: scrutiny , anonymity , stakes , selection , and artificial restrictions . We analyze each complication in thecontext of our experiment and in the context of experimentation using MTurkin general. Scrutiny and anonymity . In the lab, experimenter effects can be powerful;subjects behave differently if they are aware their behavior is being watched.Relatedly, subjects frequently lack anonymity and believe their choices will bescrutinized after the experiment. In MTurk, interaction between workers andemployers is almost non-existent; most tasks are completed without any commu-nication and workers are only identifiable by a numeric identifier. Consequently,we believe that MTurk experiments are less likely to be biased by these compli-cations. For example, in our study we paid 2,471 subjects $789 total and they worked 701 hours(equating to 31 cents per observation). This includes 60 subjects whose data were not usable. Barankay (2010) remarks that “the experimenter [posing] as the firm [gives] substantialcontrol about the protocol and thereby eliminates many project risks related to field experi-ments. takes . In the lab or field, it’s essential to “account properly for the differ-ences in stakes across settings” (Levitt and List, 2007). We believe that ourresults would generalize to other short-term work environments, but would notexpect them to be generalizable to long-term employment decisions such as oc-cupational choice. Stakes must also be chosen adequately for the environmentand so we were careful to match wages to the market average. Selection . Experiments fail to be generalizable when “participants in thestudy differ in systematic ways from the actors engaged in the targeted real-world setting.” We know that within MTurk, it is unlikely that there is se-lection into our experiment since our task was designed similar in appearanceto real tasks. The MTurk population also seems representative along a num-ber of observable demographic characteristics (Berinsky et al., 2012); however,we acknowledge that there are potentially unobservable differences between oursubject pool and the broader population. Still, we believe that MTurk sub-ject behavior would generalize to workers’ behavior in other short-term labormarkets.
Artificial restrictions . Lab experiments place unusual and artificial restric-tions on the actions available to subjects and they examine only small, non-representative windows of time because the experimenter typically doesn’t havesubjects and time horizons for an experiment. In structuring our experiment,workers had substantial latitude in how they performed their task. In con-trast with the lab, subjects could “show-up” to our task whenever they wanted,leave at will, and were not time-constrained. Nevertheless, we acknowledge thatwhile our experiment succeeded in matching short-term labor environments likeMTurk, that our results do not easily generalize to longer-term employmentrelationships.Levitt and List (2009) highlight two limitations of field experiments vis-a-vis laboratory experiments: the need for cooperation with third parties and thedifficulty of replication . MTurk does not suffer from these limitations. Workenvironments can be created by researchers without the need of a private sectorpartner, whose interests may diverge substantially from that of the researcher.Further, MTurk experiments can be replicated simply by downloading sourcecode and re-running the experiment. In many ways, this allows a push-buttonreplication that is far better than that offered in the lab.
3. Experimental Design
In running our randomized natural field experiment, we posted our exper-imental task so that it would appear like any other task (image labeling tasksare among the most commonly performed tasks on MTurk). Subjects had noindication they were participating in an experiment. Moreover, since MTurk isa market where people ordinarily perform one-off tasks, our experiment couldbe listed inconspicuously. 6e hired a total of 2,471 workers (1,318 from the US and 1,153 from India).Although we tried to recruit equally from both countries, there were fewer Indi-ans in our sample since attrition in India was higher. We collected each worker’sage and gender during a “colorblindness” test that we administered as part ofthe task. These and other summary statistics can be found in Table 1. Bycontracting workers from the US and India, we can also test whether workersfrom each country respond differentially to the meaningfulness of a task.Our task was presented so that it appeared like a one-time work opportunity(subjects were barred from doing the experiment more than once) and our designsought to maximize the amount of work we could extract during this shortinteraction. The first image labeling paid $0.10, the next paid $0.09, etc, levelingoff at $0.02 per image. This wage structure was also used in Ariely et al. (2008)and has the benefit of preventing people from working too long.
Upon accepting our task, workers provided basic demographic informationand passed a color-blindness test. Next, they were randomized into either the meaningful , the zero-context , or the shredded condition. Those in the shreddedcondition were shown a warning message stating that their labeling will not berecorded and we gave them the option to leave. Then, all participants wereforced to watch an instructional video which they could not fast-forward. SeeAppendix A for the full script of the video as well as screenshots.The video for the meaningful treatment began immediately with cues ofmeaning. We adopt a similar working definition of “meaningfulness” as used inAriely et al. (2008): “Labor [or a task] is meaningful to the extent that (a) it isrecognized and/or (b) has some point or purpose.”We varied the levels of meaningfulness by altering the degree of recognitionand the detail used to explain the purpose of our task. In our meaningful group,we provided “recognition” by thanking the laborers for working on our task. Wethen explained the “purpose” of the task by creating a narrative explaining howresearchers were inundated with more medical images than they could possiblylabel and that they needed the help of ordinary people. In contrast, the zero-context and shredded groups were not given recognition, told the purpose ofthe task, or thanked for participating; they were only given basic instructions.Analyzing the results from a post-manipulation check (see section 4.4), we areconfident that these cues of meaning induced the desired affect.Both videos identically described the wage structure and the mechanics ofhow to label cells and properly use the task interface (including zooming in/outand deleting points, which are metrics we analyze). However, in the meaningfultreatment, cells were referred to as “cancerous tumor cells” whereas in the zero-context and shredded treatments, they were referred to as nondescript “objectsof interest.” Except for this phrase change, both scripts were identical duringthe instructional sections of the videos. To emphasize these cues, workers in themeaningful group heard the words “tumor,” “tumor cells,” “cells,” etc. 16 timesbefore labeling their first image and similar cues on the task interface remindedthem of the purpose of the task as they labeled.7 .3. Task interface, incentive structure, and response variables
After the video, we administered a short multiple-choice quiz testing workers’comprehension of the task and user interface. In the shredded condition, we gavea final question asking workers to again acknowledge that their work will notbe recorded.Upon passing the quiz, workers were directed to a task interface which dis-played the image to be labeled and allowed users to mark cancerous tumor cells(or “objects of interest”) by clicking (see figure 1). The image shown was oneof ten look-alike photoshopped images displayed randomly. We also provide theworkers with controls — zoom functionality and the ability to delete points —whose proper use would allow them to produce high-quality labelings.
Figure 1:
Main task portal for a subject in the meaningful treatment
Workers wereasked to identify all tumors in the image. Each image had 90 cells and took 5 minutes onaverage. Our interface reminds the workers in 8 places that they are identifying tumor cells.The black circles around each point were not visible to participants. We display them toillustrate the size of a 10-pixel radius.
During the experiment, we measured three response variables: (1) inducedto work, (2) quantity of image labelings, and (3) quality of image labelings.Many subjects can – and – do stop performing a task even after agreeing tocomplete it. While submitting bad work on MTurk is penalized, workers canabandon a task with only nominal penalty. Hence, we measure attrition withthe response variable induced to work . Workers were only counted as inducedto work if they watched the video, passed the quiz, and completed one imagelabeling. Our experimental design deliberately encourages attrition by imposingan upfront and unpaid cost of watching a three-minute instructional video andpassing a quiz before moving on to the actual task.8orkers were paid $0.10 for the first image labeling. They were then given anoption to label another image for $0.09, and then another image for $0.08, andso on. At $0.02, we stopped decreasing the wage and the worker was allowedto label images at this pay rate indefinitely. After each image, the worker couldeither collect what they had earned thus far, or label more images. We used the quantity of image labelings for our second response variable.In our instructional video, we emphasized the importance of marking theexact center of each cell. When a worker labeled a cell by clicking on the image,we measured that click location to the nearest pixel. Thus, we were able todetect if the click came “close” to the actual cell. Our third response variable, quality of image labelings is the proportion of objects identified based on whethera worker’s click fell within a pixel radius from the object’s true center. We willdiscuss the radii we picked in the following section.After workers chose to stop labeling images and collect their earnings, theywere given a five-question PMC survey which asked whether they thought thetask (a) was enjoyable (b) had purpose (c) gave them a sense of accomplishment(d) was meaningful (e) made their efforts recognized. Responses were collectedon a five-point Likert scale. We also provided a text box to elicit free-responsecomments. We hypothesize that at equal wages, the meaningful treatmentwill have the highest proportion of workers induced to work and the shreddedcondition will have the lowest proportion. In the following section, we providetheoretical justification for this prediction.
Hypothesis 2.
As in Ariely et al. (2008), we hypothesize that quantity of imageslabeled will be increasing in the level of meaningfulness.
Hypothesis 3.
In addition to quantity, we measure the quality of image labelingsand hypothesize that this is increasing in the level of meaningfulness.
Hypothesis 4.
Based upon prior survey research on MTurk populations, we hy-pothesize that
Indian workers are less responsive to meaning . Ipeirotis (2010)finds that Indians are more likely to have MTurk as a primary source of income(27% vs. 14% in the US). Likewise, people in the US are nearly twice as likelyto report doing tasks because they are fun (41% vs. 20%). Therefore, one mightexpect financial motivations to be more important for Indian workers. Each image was randomly picked from a pool of ten look-alike images. About 24% of respondents left comments (no difference across treatments). Although Horton et al. (2011) find that workers of both types are strongly motivated bymoney. . Experimental Results and Discussion We ran the experiment on N = 2 ,
471 subjects (1,318 from the United Statesand 1,153 from India). Table 1 shows summary statistics for our response vari-ables (induced to work, number of images, and quality), demographic variables,and hourly wage.
Shredded Zero Meaningful US IndiaContext only only% Induced to Work .723 .762 .806 .85 .666 ≥
1) 5.94 ± ± ± ± ± ≥ ≥ ± ± ± ± ± N
828 798 845 1318 1153Coarse quality .883 ± .21 .904 ± .18 .930 ± .14 .924 ± .15 .881 ± .21Fine quality .614 ± .22 .651 ± .21 .676 ± .18 .668 ± .19 .621 ± .26PMC Meaning 3.44 ± ± ± ± ± Table 1: Summary statistics for response variables and demographics by treatment and coun-try. The statistics for the quality metrics are computed by averaging each worker’s averagequality (only for workers who labeled one or more images). The statistics for the PMC mean-ing question only include workers who finished the task and survey.
Broadly speaking, as the level of meaning increases, subjects are more likelyto participate and they label more images and with higher quality. Across alltreatments, US workers participate more often, label more images, and markpoints with greater accuracy. Table 2 uses a heatmap to illustrate our main effectsizes and their significance levels by treatment, country, and response variable.Each cell indicates the size of a treatment effect relative to the control (i.e., zerocontext condition). Statistically significant positive effects are indicated usinggreen fill where darker green indicates higher levels of significance. Statisticallysignificant negative effects are indicated using red fill where darker red indicateshigher levels of significance. Black text without fill indicates effects that aremarginally significant ( p < . We investigate how treatment and country affects whether or not subjectschose to do our task. Unlike in a laboratory environment, our subjects were10nduced Did ≥ ↑ ↑ ↑ ↓ ↑ ↑ ↑ ↓ ↓ ↑ ↓ ↑ ↓ ↓ ↓ ↑ ↓ ↓ ↓ ↑ ↓ ↓ ↓ ↓ * p < .
05, ** p < .
01, *** p < . workers in a relatively anonymous labor market and were not paid a “show-upfee.” On MTurk, workers frequently start but do not finish tasks; attrition istherefore a practical concern for employers who hire from this market. In ourexperiment, on average, 25% of subjects began, but did not follow-through bycompleting one full labeling.Even in this difficult environment, we were able to increase participationamong workers by roughly 4.6% by framing the task as more meaningful (seecolumns 1 and 2 of table 3). The effect is robust to including various controls forage, gender, and time of day effects. As a subject in the meaningful treatmenttold us, “It’s always nice to have [HITs] that take some thought and meansomething to complete. Thank you for bringing them to MTurk.” The shreddedtreatment discouraged workers and caused them to work 4.0% less often butthe effect was less significant ( p = 0 .
057 without controls and p = 0 .
082 withcontrols). Thus, hypothesis 1 seems to be correct.Irrespective of treatment, subjects from India completed an image 18.5% lessoften ( p < . p = 0 . nduced Induced Did ≥ ≥ ≥ ≥ R N * p < .
05, ** p < .
01, *** p < . p -values for the partial F -test for sets of different types of control variables. good use doing similar work with images, e.g. in dosimetry or pathology ... andit would free up medical professionals to do the heftier work.” Table 1 shows that the number of images increased with meaning. However,this result is conditional on being induced to work and is therefore contaminatedwith selection bias. We follow Angrist (2001) and handle selection by creatinga dummy variable for “did two or more labelings” and a dummy for “did fiveor more labelings” and use them as responses (other cutoffs produced similarresults).We find mixed results regarding whether the the level of meaningfulnessaffects the quantity of output. Being assigned to the meaningful treatmentgroup did have a positive effect, but assignment to the shredded treatment didnot result in a corresponding decrease in output.Analyzing the outcome “two or more labelings,” column 3 of table 3 showsthat the meaningful treatment induced 4.7% more subjects to label two or moreimages ( p < . the meaningful treatment was highly significant and induced 8.5% Labeling five or more images corresponds to the top tercile of quantity among people whowere induced to work. p < .
001 with and without controls), an increase of nearly 23percent, and the shredded treatment again has no effect.Hypothesis 2 (quantity increases with meaningfulness) seems to be correctonly when comparing the meaningful treatment to the zero-context treatment.An ambiguous effect of the shredded treatment on quantity is also reported byAriely et al. (2008).We didn’t find differential effects between the United States and India. Inan unshown regression, we found that Americans were 9.5% more likely to labelfive or more images ( p < .
01) and Indians were 8.4% more likely to label fiveor more ( p < . p = 0 . Quality was measured by the fraction of cells labeled at a distance of fivepixels (“coarse quality”) and two pixels (“fine quality”) from their true centers.In presenting our results (see table 4), we analyze the treatment effects usingour fine quality measure. The coarse quality regression results were similar, butthe fine quality had a much more dispersed distribution. Our main result is that fine quality was 7.2% lower in the shredded treat-ment, but there wasn’t a large corresponding increase in the meaningful treat-ment. This makes sense; if the workers knew their labelings weren’t goingto be checked, there is no incentive to mark points carefully. This result wasnot different across countries (regression unshown). The meaningful treatmenthas a marginally significant effect only in the United States, where fine qualityincreased by 3.9% ( p = 0 .
092 without controls and p = 0 .
044 with controls),but there was no effect in India. Thus, hypothesis 3 (quality increases withmeaningfulness) seems to be correct only when comparing the shredded to thezero context treatment which is surprising.Although Indian workers were less accurate than United States workers andhad 5.3% lower quality ( p < .
001 and robust to controls), United States andIndian workers did not respond differentially to the shredded treatment ( p =0 . The inter-quartile range of coarse quality overall was [93.3%, 97.2%] whereas the IQR offine quality was overall [54.7%, 80.0%]. One caveat with our quality results is that we only observe quality for people who wereinduced to work and selected into our experiment (we have “attrition bias”). Attrition was 4%higher in the shredded treatment and we presume that the people who opted out of labelingimages would have labeled them with far worse quality had they remained in the experiment. ine QualityBoth Countries United States IndiaMeaningful 0.007 0.014 0.039 0.039* -0.031 -0.013(0.017) (0.014) (0.023) (0.019) (0.025) (0.021)Shredded -0.072*** -0.074*** -0.061* -0.066** -0.087** -0.073**(0.021) (0.017) (0.027) (0.023) (0.031) (0.023)India -0.053*** -0.057***(0.015) (0.013)Male 0.053*** 0.014 0.100***(0.013) (0.017) (0.021)Labelings 6—10 -0.018** -0.024** -0.016*(0.006) (0.008) (0.008)Labelings ≥
11 -0.140*** -0.116*** -0.148***(0.017) (0.029) (0.020)Constant 0.666*** 0.645*** 0.651*** 0.625*** 0.634*** 0.588***ControlsImage 0.00*** 0.00*** 0.00***Age 0.10 0.01** 0.25Time of Day 0.33 0.29 0.78Day of Week 0.12 0.46 0.26 R N p < .
05, ** p < .
01, *** p < .
Experience matters. Once subjects had between 6 and 10 labelings undertheir belt, they were 1.8% less accurate ( p < . p < . Finally, we found thatsome of the ten images were substantial harder to label accurately than others(a partial F-test for equality of fixed effects results in p < . In order to understand how our treatments affected the perceived mean-ingfulness of the task, we gave a post manipulation check to all subjects whocompleted at least one image and did not abandon the task before payment.This data should be interpreted cautiously given that subjects who completedthe tasks and our survey are not representative of all subjects in our experi- Anecdotally, subjects from the shredded condition who submitted comments regardingthe task were less likely to have expressed concerns about their accuracy. One subject fromthe meaningful group remarked that “[his] mouse was too sensitive to click accurately, even allthe way zoomed in,” but we found no such apologies or comments from people in the shreddedgroup. We found that those in the meaningful treatment rated significantly higher inthe post manipulation check in both the United States and India. Using a five-point Likert scale, we asked workers to rate the perceived level of meaningful-ness, purpose, enjoyment, accomplishment, and recognition. In the meaningfultreatment, subjective ratings were higher in all categories but the self-rated levelof meaningfulness and purpose were the highest. The level of meaningfulnesswas 1.3 points higher in the US and 0.6 points higher in the India; the level ofperceived porposefulness was 1.2 points higher in America and 0.5 points higherin India. In the United States, the level of accomplishment only increased by 0.8and the level of enjoyment and recognition increased by 0.3 and 0.5 respectivelywith a marginal increase in India. As a US participant told us, “I felt it was aprivilege to work on something so important and I would like to thank you forthe opportunity.”We conclude that the meaningful frames accomplished their goal. Remark-ably, those in the shredded treatment in either country did not report signifi-cantly lower ratings on any of the items in the post manipulation check. Thus,the shredded treatment may not have had the desired effect.
5. Conclusion
Our experiment is the first that uses a natural field experiment in a reallabor market to examine how a task’s meaningfulness influences labor supply.Overall, we found that the greater the amount of meaning, the more likelya subject is to participate, the more output they produce, the higher qualityoutput they produce, and the less compensation they require for their time. Wealso observe an interesting effect: high meaning increases quantity of output(with an insignificant increase in quality) and low meaning decreases quality ofoutput (with no change in quantity). It is possible that the level of perceivedmeaning affects how workers substitute their efforts between task quantity andtask quality. The effect sizes were found to be the same in the US and India.Our finding has important implications for those who employ labor in anyshort-term capacity besides crowdsourcing, such as temp-work or piecework. Asthe world begins to outsource more of its work to anonymous pools of labor,it is vital to understand the dynamics of this labor market and the degree towhich non-pecuniary incentives matter. This study demonstrates that they domatter, and they matter to a significant degree.This study also serves as an example of what MTurk offers economists: anexcellent platform for high internal validity natural field experiments while evad-ing the external validity problems that may occur in laboratory environments. Ideally, we would have collected this information immediately after introducing the treat-ment condition. However, doing so would have compromised the credibility of our naturalfield experiment. eferences Angrist, J. D., 2001. Estimation of Limited Dependent Variable Models WithDummy Endogenous Regressors. Journal of Business & Economic Statistics19 (1), 2–28.Ariely, D., Kamenica, E., Prelec, D., 2008. Man’s Search for Meaning: The Caseof Legos. Journal of Economic Behavior & Organization 67 (3-4), 671 – 677.Barankay, I., 2010. Rankings and social tournaments: Evidence from a fieldexperiment. University of Pennsylvania mimeo.Berinsky, A. J., Huber, G. A., Lenz, G. S., 2012. Evaluating online labor marketsfor experimental research: Amazon.com’s mechanical turk. Political Analysis20, 351–368.Croson, R., Gneezy, U., 2009. Gender differences in preferences. Journal ofEconomic Literature 47 (2), 448–474.Eriksson, K., Simpson, B., 2010. Emotional reactions to losing explain genderdifferences in entering a risky lottery. Judgment and Decision Making 5 (3),159–163.Gneezy, U., List, J. A., 2006. Putting Behavioral Economics to Work: Testingfor Gift Exchange in Labor Markets Using Field Experiments. Econometrica74 (5), 1365–1384.Harrison, G. W., List, J. A., 2004. Field Experiments. Journal of EconomicLiterature 42 (4), 1009 – 1055.Henrich, J., Heine, S., Norenzayan, A., et al., 2010. The weirdest people in theworld. Behavioral and Brain Sciences 33 (2-3), 61–83.Holmes, S., Kapelner, A., 2010. DistributeEyes. Stanford University Manuscript.Horton, J. J., Chilton, L., 2010. The Labor Economics of Paid Crowdsourcing.Proceedings of the ACM Conference on Electronic Commerce, Forthcoming.Horton, J. J., Rand, D. G., Zeckhauser, R. J., 2011. The online laboratory:Conducting experiments in a real labor market. Experimental Economics 14,399–425.Ipeirotis, P., Mar. 2010. Demographics of Mechanical Turk. CeDER workingpaper CeDER-10-01, New York University, Stern School of Business.Levitt, S., List, J., 2009. Field experiments in economics: the past, the present,and the future. European Economic Review 53 (1), 1–18.Levitt, S. D., List, J. A., 2007. What Do Laboratory Experiments MeasuringSocial Preferences Reveal About the Real World? Journal of Economic Per-spectives 21 (2), 153–174. 16aolacci, G., Chandler, J., Ipeirotis, P., 2010. Running experiments on amazonmechanical turk. Judgment and Decision Making 5 (5), 411–419.Preston, A. E., 1989. The Nonprofit Worker in a For-Profit World. Journal ofLabor Economics 7 (4), 438–463.Rosen, S., 1986. The theory of equalizing differences. Handbook of labor eco-nomics 1, 641–692.Rosso, B. D., Dekas, K. H., Wrzesniewski, A., 2010. On the meaning of work: Atheoretical integration and review. Research in Organizational Behavior 30,91–127.Sprouse, J., 2011. A validation of amazon mechanical turk for the collectionof acceptability judgments in linguistic theory. Behavior Research Methods43 (1), 155–167.Stern, S., 2004. Do scientists pay to be scientists? Management Science 50 (6),835–853.Wrzesniewski, A., Mccauley, C., Rozin, P., Schwartz, B., 1997. Jobs, Careers,and Callings: Peoples Relations to Their Work Amy Wrzesniewski 33 (31),21–33. 17 ppendix A. Detailed Experimental Design
This section details exact screens shown to users in the experimental groups.The worker begins by encountering the HIT on the MTurk platform (see FigureA.2).
Figure A.2: The HIT as initially encountered on MTurk. Note: we used an alias in order toappear as a non-corporate and non-institutional employer.
The worker can then click on the HIT and they see the “preview screen”which describes the HIT (not shown) with text. In retrospect, a flashy imageenticing the worker into the HIT would most likely have increased throughput. Ifthe worker chooses to accept, they are immediately directed to a multi-purposepage which hosts a colorblindness test, demographic survey, and an audio testfor functioning speakers (see Figure A.3). Although many tasks require workersto answer questions before working, we avoided asking too many survey-likequestions to avoid appearing as an experiment.At this point, the worker is randomized into one of the three treatmentsand transitioned to the “qualification test.” The page displays an instructionalvideo varying by treatment which they cannot fast-forward. Screenshots of thevideo are shown in Figures A.4, A.5, and A.6. We include the verbatim script for the videos below. Text that differs be-tween treatments is typeset in square brackets separated by a slash. The textbefore the slash in red belongs to the meaningful treatment and the text follow-ing the slash in blue belongs to both the zero-context and shredded treatments.
Thanks for participating in this task. [Your job will be to help identify tumor cellsin images and we appreciate your help. / In this task, you’ll look at images and findobjects of interest.]In this video tutorial, we’ll explain [three / two] things:[First, why you’re labeling the images, which is to help researchers identify tumorouscancer cells. Next, we’ll show you how to identify those tumor cells. / First, we’llshow you how to identify objects of interest in images.] [Finally, / Then,] we’ll explainhow after labeling your first image you’ll have a chance to label some more. We thank Rob Cohen who did an excellent job narrating both scripts. igure A.3: The colorblindness test a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.4: Opening screen of training video.Figure A.5: Examples of meaningful cues which are not present in the Zero-context andShredded treatment instructional video. ow we’re ready to learn how to identify [tumor cells / objects of interest] in images.Some example pictures of the [tumor cells / objects of interest] you’ll be identifyingcan be found at the bottom left. Each [tumor cell / object of interest] is blue andcircular and surrounded by a red border. (a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.6: Describing the training process. When you begin each image, the magnification will be set to the lowest resolution.This gives you an overview of all points on the image, but you’ll need to zoom in andout in order to make the most precise clicks in the center of the [tumor cells / objectsof interest].Let’s scroll through the image and find some [tumor cells / objects of interest] toidentify.Here’s a large cluster of [tumor cells / objects of interest]. To identify them, it isvery important to click as closely to the center as possible on each [cell / object] . IfI make a mistake and don’t click in the center, I can undo the point by right-clicking.Notice that this [cell / point] isn’t entirely surrounded by red, [probably because thecell broke off]. Even though it’s not entirely surrounded by red, we still want toidentify it as a [tumor cell / object of interest].In order to ensure that you’ve located all [tumor cells / objects of interest], you shoulduse the thumbnail view in the top right. You can also use the magnification buttonsto zoom out.It looks like we missed a cluster of [tumor cells / objects of interest] at the bottom.Let’s go identify those points.Remember once again, that if you click on something that is not a [tumor cell / objectof interest], you can unclick by right-clicking.Using the scroll bars, we’ll navigate to the other points ... and here’s some more tothe left ... Now that we think we’ve identified all points, let’s zoom out to be sureand scroll around.Before submitting, we should be sure of three things: (1) That we’ve identified all[tumor cells / objects of interest] (2) That we’ve clicked in the center of each one (3)That we haven’t clicked on anything that’s not a [tumor cell / object of interest].Once we’ve done that, we’re ready to submit.Finally, after you complete your first image, you’ll have an opportunity to labeladditional images as part of this HIT.The first images you label will pay more to compensate for training.After that, as part of this HIT you’ll have the chance to identify as many additionalimages as you like as long as you aren’t taking more than 15 minutes per image.Although you can label unlimited images in this HIT, you won’t be able to acceptmore HITs. This is to give a variety of turkers an opportunity to identify the images.[Thank you for your time and effort. Advances in the field of cancer and treatmentprevention rely on the selfless contributions of countless individuals such as yourself.] igure A.7: The quiz after watching the training video for the meaningful treatment. In thezero-context and shredded treatments, all mention of “tumor cells” are replaced by “objectsof interest.” The shredded treatment has an additional question asking them to acknowledgethat they are working on a test system and their work will be discarded. Green indicates acorrect response; red indicates an incorrect response. (a) Meaningful treatment (b) Zero-context / Shredded treatmentsFigure A.8: The training interface as seen by workers. The meaningful interface reminds thesubjects in 8 places that they are identifying tumor cells. The zero-context interface only says“objects of interest” and the shredded condition in addition has a message in red indicatingthat their points will not be saved (unshown). The circles around each point were not visibleto participants. We display them to illustrate the size of a 10-pixel radius. Participants were given 15 minutes to mark an image. Above the trainingwindow, we displayed a countdown timer that indicated the amount of timeleft. The participant’s total earnings was also prominently displayed atop. Onthe very top, we provided a submit button that allowed the worker to submitresults at any time.Each image had the same 90 cells from various-sized clusters. The cell clus-ters were selected for their unambiguous examples of cells, thereby eliminatingthe difficulty of training the difficult-to-identify tumor cells. In each image, thesame clusters were arranged and rotated haphazardly, then pasted on one offive different believable backgrounds using Adobe Photoshop. Those clusterswere then further rotated to create a set of ten images. This setup guaranteesthat the difficulty was relatively the same image-image. Images were displayedin random order for each worker, repeating after each set of ten (repetition wasnot an issue since it was rare for a participant to label more than ten).After the worker is finished labeling, the worker presses submit and they areled to an intermediate page which asks if they would like to label another imageand the new wage is prominently displayed (see Figure A.9). In the meaningfultreatment, we add one last cue of meaning — a stock photo of a researcher toemphasize the purpose of the task. In the shredded treatment, we append the23ext “NONE of your points will be saved because we are testing our system,but you will still be paid.” If the worker wishes to continue, they are led toanother labeling task; otherwise, they are directed to the post manipulationcheck survey shown in figure A.10.The program ensures that the worker is being honest. We require them tofind more than 20% of the cells (the workers were unaware that we were ableto monitor their accuracy). If they are found cheating on three images, theyare deemed fraudulent and not allowed to train more images. Since payment isautomatic, this is to protect us from a worker depleting our research account.In practice, this happened rarely and was not correlated with treatment. (a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.9: The landing page after a labeling task is completed. At this point, workers areasked if they’d like to label another image or quit and be paid what they’ve earned so far.
Appendix B. A Technical Guide to Running Field Experiments onMechanical Turk
Institutional Review Board (IRB) Requirements
This study requires the use of deception in order to observe social preferencesin a natural environment and thus is not exempted under category 2’s surveyprocedures. The issue is you cannot give the subjects an initial consent formindicating that they are part of an experiment.Upon waiving the requirement of consent, the IRB will most likely requireyou to issue a debrief statement to your subjects stating that they were part ofan experiment, a blurb about the purpose of the experiment, and contact infor-mation to your institution’s IRB. In order for the experiment to work properly,you can only issue the debriefing after data collection is completed. Otherwise,24 igure A.10: The survey a subject fills out upon completion of the task.