[PDF] Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets

Abstract

We conduct the first natural field experiment to explore the relationship between the "meaningfulness" of a task and worker effort. We employed about 2,500 workers from Amazon's Mechanical Turk (MTurk), an online labor market, to label medical images. Although given an identical task, we experimentally manipulated how the task was framed. Subjects in the meaningful treatment were told that they were labeling tumor cells in order to assist medical researchers, subjects in the zero-context condition (the control group) were not told the purpose of the task, and, in stark contrast, subjects in the shredded treatment were not given context and were additionally told that their work would be discarded. We found that when a task was framed more meaningfully, workers were more likely to participate. We also found that the meaningful treatment increased the quantity of output (with an insignificant change in quality) while the shredded treatment decreased the quality of output (with no change in quantity). We believe these results will generalize to other short-term labor markets. Our study also discusses MTurk as an exciting platform for running natural field experiments in economics.

Full PDF

BBreaking Monotony with Meaning:Motivation in Crowdsourcing Markets (cid:73)

Dana Chandler a, ∗ , Adam Kapelner b a Massachusetts Institute of Technology50 Memorial DriveCambridge, MA 02142 b The Wharton School of the University of Pennsylvania,3730 Walnut StreetPhiladelphia, PA 19104

Abstract

We conduct the ﬁrst natural ﬁeld experiment to explore the relationship betweenthe “meaningfulness” of a task and worker eﬀort. We employed about 2,500workers from Amazon’s Mechanical Turk (MTurk), an online labor market, tolabel medical images. Although given an identical task, we experimentally ma-nipulated how the task was framed. Subjects in the meaningful treatment weretold that they were labeling tumor cells in order to assist medical researchers,subjects in the zero-context condition (the control group) were not told the pur-pose of the task, and, in stark contrast, subjects in the shredded treatment werenot given context and were additionally told that their work would be discarded.We found that when a task was framed more meaningfully, workers were morelikely to participate. We also found that the meaningful treatment increasedthe quantity of output (with an insigniﬁcant change in quality) while the shred-ded treatment decreased the quality of output (with no change in quantity).We believe these results will generalize to other short-term labor markets. Ourstudy also discusses MTurk as an exciting platform for running natural ﬁeldexperiments in economics.

Keywords: natural ﬁeld experiment, worker motivation, crowdsourcing, onlinelabor markets (cid:73)

Both authors contributed equally to this work. The authors wish to thank Professor SusanHolmes of Stanford University for comments and for allowing us to adapt the DistributeEyessoftware for our experiment (funded under NIH grant ∗ Principal corresponding author

Email addresses: [email protected] (Dana Chandler), [email protected] (Adam Kapelner)

Preprint submitted to Elsevier October 4, 2012 a r X i v : . [ s t a t . O T ] O c t . Introduction Economists, philosophers, and social scientists have long recognized thatnon-pecuniary factors are powerful motivators that inﬂuence choice of occupa-tion. For a multidisciplinary literature review on the role of meaning in theworkplace, we recommend Rosso et al. (2010). Previous studies in this areahave generally been based on ethnographies, observational studies, or labora-tory experiments. For instance, Wrzesniewski et al. (1997) used ethnographiesto classify work into jobs, careers, or callings. Using an observation study, Pre-ston (1989) demonstrated that workers may accept lower wages in the non-proﬁtsector in order to produce goods with social externalities. Finally, Ariely et al.(2008) showed that labor had to be both recognizable and purposeful to havemeaning. In this paper, we limit our discussion to the role of meaning in eco-nomics, particularly through the lens of competing diﬀerentials. We performthe ﬁrst natural ﬁeld experiment (Harrison and List, 2004) in a real eﬀort taskthat manipulates levels of meaningfulness. This method overcomes a numberof shortcomings of the previous literature, including: interview bias, omittedvariable bias, and concerns of external validity beyond the laboratory.We study whether employers can deliberately alter the perceived “meaning-fulness” of a task in order to induce people to do more and higher quality workand thereby work for a lower wage. We chose a task that would appear mean-ingful for many people if given the right context — helping cancer researchersmark tumor cells in medical images. Subjects in the meaningful treatment weretold the purpose of their task is to “help researchers identify tumor cells;” sub-jects in our zero-context group were not given any reason for their work andthe cells were instead referred to as mere “objects of interest” and laborers inthe shredded group were given zero context but also explicitly told that theirlabelings would be discarded upon submission. Hence, the pay structure, taskrequirements, and working conditions were identical, but we added cues to alterthe perceived meaningfulness of the task.We recruited workers from the United States and India from Amazon’s Me-chanical Turk (MTurk), an online labor market where people around the worldcomplete short, “one-oﬀ” tasks for pay. The MTurk environment is a spotmarket for labor characterized by relative anonymity and a lack of strong rep-utational mechanisms. As a result, it is well-suited for an experiment involvingthe meaningfulness of a task since the variation we introduce regarding a task’smeaningfulness is less aﬀected by desires to exhibit pro-social behavior or ananticipation of future work (career concerns). We ensured that our task ap-peared like any other task in the marketplace and was comparable in terms ofdiﬃculty, duration, and wage.Our study is representative of the kinds of natural ﬁeld experiments forwhich MTurk is particularly suited. Section 2.2 explores MTurk’s potential as aplatform for ﬁeld experimentation using the framework proposed in Levitt andList (2007, 2009). 2e contribute to the literature on compensating wage diﬀerentials (Rosen,1986) and the organizational behavioral literature on the role of meaning in theworkplace (Rosso et al., 2010). Within economics, Stern (2004) provides quasi-experimental evidence on compensating diﬀerentials within the labor marketfor scientists by comparing wages for academic and private sector job oﬀersamong recent Ph.D. graduates. He ﬁnds that “scientists pay to be scientists”and require higher wages in order to accept private sector research jobs becauseof the reduced intellectual freedom and a reduced ability to interact with thescientiﬁc community and receive social recognition. Ariely et al. (2008) use alaboratory experiment with undergraduates to vary the meaningfulness of twoseparate tasks: (1) assembling Legos and (2) ﬁnding 10 instances of consecutiveletters from a sheet of random letters. Our experiment augments experiment1 in Ariely et al. (2008) by testing whether their results extend to the ﬁeld.Additionally, we introduce a richer measure of task eﬀort, namely task quality .Where our experiments are comparable, we ﬁnd that our results parallel theirs.We ﬁnd that the main eﬀects of making our task more meaningful is toinduce a higher fraction of workers to complete our task, hereafter dubbed as“induced to work.” In the meaningful treatment, 80.6% of people labeled atleast one image compared with 76.2% in the zero-context and 72.3% in theshredded treatments.After labeling their ﬁrst image, workers were given the opportunity to la-bel additional images at a declining piecerate. We also measure whether thetreatments increase the quantity of images labeled. We classify participants as“high-output” workers if they label ﬁve or more images (an amount correspond-ing to roughly the top tercile of those who label) and we ﬁnd that workers areapproximately 23% more likely to be high-output workers in the meaningfulgroup.We introduce a measure of task quality by telling workers the importance ofaccurately labeling each cell by clicking as close to the center as possible. Weﬁrst note that MTurk labor is high quality, with an average of 91% of cells found.The meaning treatment had an ambiguous eﬀect, but the shredded condition inboth countries lowered the proportion of cells found by about 7%.By measuring both quantity and quality we are able to observe how taskeﬀort is apportioned between these two “dimensions of eﬀort.” Do workers work“harder” or “longer” or both? We found an interesting result: the meaningfulcondition seems to increase quantity without a corresponding increase in qualityand the shredded treatment decreases quality without a corresponding decreasein quantity. Investigating whether this pattern generalizes to other domainsmay be a fruitful future research avenue.Finally, we calculate participants’ average hourly waged based on how longthey spent on the task. We ﬁnd that subjects in the meaningful group workfor $1.34 per hour, which is 6 cents less per hour than zero context participantsand 14 cents less per hour than shredded condition participants.We expect our ﬁndings to generalize to other short-term work environmentssuch as temporary employment or piecework. In these environments, employersmay not consider that non-pecuniary incentives of meaningfulness matter; we3rgue that these incentives do matter, and to a signiﬁcant degree.Section 2 provides background on MTurk and discusses its use as a platformfor conducting economic ﬁeld experiments. Section 3 describes our experimentaldesign. Section 4 presents our results and discussion and Section 5 concludes.Appendix A provides full details on our experimental design and Appendix Bis a technical appendix for conducting experiments using the MTurk platform.

2. Mechanical Turk and its potential for ﬁeld experimentation

Amazon’s Mechanical Turk (MTurk) is the largest online, task-based labormarket and is used by hundreds of thousands of people worldwide. Individualsand companies can post tasks (known as Human Intelligence Tasks, or “HITs”)and have them completed by an on-demand labor force. Typical tasks includeimage labeling, audio transcription, and basic internet research. Academicsalso use MTurk to outsource low-skilled resource tasks such as identifying lin-guistic patterns in text (Sprouse, 2011) and labeling medical images (Holmesand Kapelner, 2010). The image labeling system from the latter study, knownas “DistributeEyes,” was originally used by breast cancer researchers and wasmodiﬁed for our experiment.Beyond simply using MTurk as a source of labor, academics have also beganusing MTurk as a way to conduct online experiments. The remainder of thesection highlights some of the ways this subject pool is used and places specialemphasis on the suitability of the environment for natural ﬁeld experiments ineconomics.

As Henrich et al. (2010) argue, many ﬁndings from social science are dis-proportionately based on what he calls “W.E.I.R.D.” subject pools ( W estern, E ducated, I ndustrialized, R ich, and D emocratic) and as a result it is inappro-priate to believe the results generalize to larger populations. Since MTurk hasusers from around the world, it is also possible to conduct research across cul-tures. For example, Eriksson and Simpson (2010) use a cross-national samplefrom MTurk to test whether diﬀerential preferences for competitive environ-ments are explained by females’ stronger emotional reaction to losing, hypoth-esized by Croson and Gneezy (2009).It is natural to ask whether results from MTurk generalize to other popula-tions. Paolacci et al. (2010) assuage these concerns by replicating three classicframing experiments on MTurk: The Asian Disease Problem, the Linda Problemand the Physician Problem; Horton et al. (2011) provide additional replicationevidence for experiments related to framing, social preferences, and priming.Berinsky et al. (2012) argues that the MTurk population has “attractive char-acteristics” because it approximates gold-standard probability samples of theUS population. All three studies ﬁnd that the direction and magnitude of theeﬀects line up well compared with those found in the laboratory.4n advantage of MTurk relative to the laboratory is that the researcher canrapidly scale experiments and recruit hundreds of subjects within only a fewdays and at substantially lower costs. Apart from general usage by academics, the MTurk environment oﬀers addi-tional beneﬁts for experimental economists and researchers conducting naturalﬁeld experiments. We analyze the MTurk environment within the frameworklaid out in Levitt and List (2007, 2009).In the ideal natural ﬁeld experiment, “the environment is such that thesubjects naturally undertake these tasks and [do not know] that they are par-ticipants in an experiment.” Additionally, the experimenter must exert a highdegree of control over the environment without attracting attention or causingparticipants to behave unnaturally. MTurk’s power comes from the ability toconstruct customized and highly-tailored environments related to the questionbeing studied. It is possible to collect very detailed measures of user behav-ior such as precise time spent on a webpage, mouse movements, and positionsof clicks. In our experiment, we use such data to construct a precise qualitymeasure.MTurk is particularly well-suited to using experimenter-as-employer designs(Gneezy and List, 2006) as a way to study worker incentives and the employ-ment relationship without having to rely on cooperation of private sector ﬁrms. For example, Barankay (2010) posted identical image labeling tasks and variedwhether workers were given feedback on their relative performance (i.e., rank-ing) in order to study whether providing rank-order feedback led workers toreturn for a subsequent work opportunity. For a more detailed overview of howonline labor markets can be used in experiments, see Horton et al. (2011).Levitt and List (2007) enumerate possible complications that arise whenexperimental ﬁndings are extrapolated outside the lab: scrutiny , anonymity , stakes , selection , and artiﬁcial restrictions . We analyze each complication in thecontext of our experiment and in the context of experimentation using MTurkin general. Scrutiny and anonymity . In the lab, experimenter eﬀects can be powerful;subjects behave diﬀerently if they are aware their behavior is being watched.Relatedly, subjects frequently lack anonymity and believe their choices will bescrutinized after the experiment. In MTurk, interaction between workers andemployers is almost non-existent; most tasks are completed without any commu-nication and workers are only identiﬁable by a numeric identiﬁer. Consequently,we believe that MTurk experiments are less likely to be biased by these compli-cations. For example, in our study we paid 2,471 subjects $789 total and they worked 701 hours(equating to 31 cents per observation). This includes 60 subjects whose data were not usable. Barankay (2010) remarks that “the experimenter [posing] as the ﬁrm [gives] substantialcontrol about the protocol and thereby eliminates many project risks related to ﬁeld experi-ments. takes . In the lab or ﬁeld, it’s essential to “account properly for the diﬀer-ences in stakes across settings” (Levitt and List, 2007). We believe that ourresults would generalize to other short-term work environments, but would notexpect them to be generalizable to long-term employment decisions such as oc-cupational choice. Stakes must also be chosen adequately for the environmentand so we were careful to match wages to the market average. Selection . Experiments fail to be generalizable when “participants in thestudy diﬀer in systematic ways from the actors engaged in the targeted real-world setting.” We know that within MTurk, it is unlikely that there is se-lection into our experiment since our task was designed similar in appearanceto real tasks. The MTurk population also seems representative along a num-ber of observable demographic characteristics (Berinsky et al., 2012); however,we acknowledge that there are potentially unobservable diﬀerences between oursubject pool and the broader population. Still, we believe that MTurk sub-ject behavior would generalize to workers’ behavior in other short-term labormarkets.

Artiﬁcial restrictions . Lab experiments place unusual and artiﬁcial restric-tions on the actions available to subjects and they examine only small, non-representative windows of time because the experimenter typically doesn’t havesubjects and time horizons for an experiment. In structuring our experiment,workers had substantial latitude in how they performed their task. In con-trast with the lab, subjects could “show-up” to our task whenever they wanted,leave at will, and were not time-constrained. Nevertheless, we acknowledge thatwhile our experiment succeeded in matching short-term labor environments likeMTurk, that our results do not easily generalize to longer-term employmentrelationships.Levitt and List (2009) highlight two limitations of ﬁeld experiments vis-a-vis laboratory experiments: the need for cooperation with third parties and thediﬃculty of replication . MTurk does not suﬀer from these limitations. Workenvironments can be created by researchers without the need of a private sectorpartner, whose interests may diverge substantially from that of the researcher.Further, MTurk experiments can be replicated simply by downloading sourcecode and re-running the experiment. In many ways, this allows a push-buttonreplication that is far better than that oﬀered in the lab.

3. Experimental Design

In running our randomized natural ﬁeld experiment, we posted our exper-imental task so that it would appear like any other task (image labeling tasksare among the most commonly performed tasks on MTurk). Subjects had noindication they were participating in an experiment. Moreover, since MTurk isa market where people ordinarily perform one-oﬀ tasks, our experiment couldbe listed inconspicuously. 6e hired a total of 2,471 workers (1,318 from the US and 1,153 from India).Although we tried to recruit equally from both countries, there were fewer Indi-ans in our sample since attrition in India was higher. We collected each worker’sage and gender during a “colorblindness” test that we administered as part ofthe task. These and other summary statistics can be found in Table 1. Bycontracting workers from the US and India, we can also test whether workersfrom each country respond diﬀerentially to the meaningfulness of a task.Our task was presented so that it appeared like a one-time work opportunity(subjects were barred from doing the experiment more than once) and our designsought to maximize the amount of work we could extract during this shortinteraction. The ﬁrst image labeling paid $0.10, the next paid $0.09, etc, levelingoﬀ at $0.02 per image. This wage structure was also used in Ariely et al. (2008)and has the beneﬁt of preventing people from working too long.

Upon accepting our task, workers provided basic demographic informationand passed a color-blindness test. Next, they were randomized into either the meaningful , the zero-context , or the shredded condition. Those in the shreddedcondition were shown a warning message stating that their labeling will not berecorded and we gave them the option to leave. Then, all participants wereforced to watch an instructional video which they could not fast-forward. SeeAppendix A for the full script of the video as well as screenshots.The video for the meaningful treatment began immediately with cues ofmeaning. We adopt a similar working deﬁnition of “meaningfulness” as used inAriely et al. (2008): “Labor [or a task] is meaningful to the extent that (a) it isrecognized and/or (b) has some point or purpose.”We varied the levels of meaningfulness by altering the degree of recognitionand the detail used to explain the purpose of our task. In our meaningful group,we provided “recognition” by thanking the laborers for working on our task. Wethen explained the “purpose” of the task by creating a narrative explaining howresearchers were inundated with more medical images than they could possiblylabel and that they needed the help of ordinary people. In contrast, the zero-context and shredded groups were not given recognition, told the purpose ofthe task, or thanked for participating; they were only given basic instructions.Analyzing the results from a post-manipulation check (see section 4.4), we areconﬁdent that these cues of meaning induced the desired aﬀect.Both videos identically described the wage structure and the mechanics ofhow to label cells and properly use the task interface (including zooming in/outand deleting points, which are metrics we analyze). However, in the meaningfultreatment, cells were referred to as “cancerous tumor cells” whereas in the zero-context and shredded treatments, they were referred to as nondescript “objectsof interest.” Except for this phrase change, both scripts were identical duringthe instructional sections of the videos. To emphasize these cues, workers in themeaningful group heard the words “tumor,” “tumor cells,” “cells,” etc. 16 timesbefore labeling their ﬁrst image and similar cues on the task interface remindedthem of the purpose of the task as they labeled.7 .3. Task interface, incentive structure, and response variables

After the video, we administered a short multiple-choice quiz testing workers’comprehension of the task and user interface. In the shredded condition, we gavea ﬁnal question asking workers to again acknowledge that their work will notbe recorded.Upon passing the quiz, workers were directed to a task interface which dis-played the image to be labeled and allowed users to mark cancerous tumor cells(or “objects of interest”) by clicking (see ﬁgure 1). The image shown was oneof ten look-alike photoshopped images displayed randomly. We also provide theworkers with controls — zoom functionality and the ability to delete points —whose proper use would allow them to produce high-quality labelings.

Figure 1:

Main task portal for a subject in the meaningful treatment

Workers wereasked to identify all tumors in the image. Each image had 90 cells and took 5 minutes onaverage. Our interface reminds the workers in 8 places that they are identifying tumor cells.The black circles around each point were not visible to participants. We display them toillustrate the size of a 10-pixel radius.

During the experiment, we measured three response variables: (1) inducedto work, (2) quantity of image labelings, and (3) quality of image labelings.Many subjects can – and – do stop performing a task even after agreeing tocomplete it. While submitting bad work on MTurk is penalized, workers canabandon a task with only nominal penalty. Hence, we measure attrition withthe response variable induced to work . Workers were only counted as inducedto work if they watched the video, passed the quiz, and completed one imagelabeling. Our experimental design deliberately encourages attrition by imposingan upfront and unpaid cost of watching a three-minute instructional video andpassing a quiz before moving on to the actual task.8orkers were paid $0.10 for the ﬁrst image labeling. They were then given anoption to label another image for $0.09, and then another image for $0.08, andso on. At $0.02, we stopped decreasing the wage and the worker was allowedto label images at this pay rate indeﬁnitely. After each image, the worker couldeither collect what they had earned thus far, or label more images. We used the quantity of image labelings for our second response variable.In our instructional video, we emphasized the importance of marking theexact center of each cell. When a worker labeled a cell by clicking on the image,we measured that click location to the nearest pixel. Thus, we were able todetect if the click came “close” to the actual cell. Our third response variable, quality of image labelings is the proportion of objects identiﬁed based on whethera worker’s click fell within a pixel radius from the object’s true center. We willdiscuss the radii we picked in the following section.After workers chose to stop labeling images and collect their earnings, theywere given a ﬁve-question PMC survey which asked whether they thought thetask (a) was enjoyable (b) had purpose (c) gave them a sense of accomplishment(d) was meaningful (e) made their eﬀorts recognized. Responses were collectedon a ﬁve-point Likert scale. We also provided a text box to elicit free-responsecomments. We hypothesize that at equal wages, the meaningful treatmentwill have the highest proportion of workers induced to work and the shreddedcondition will have the lowest proportion. In the following section, we providetheoretical justiﬁcation for this prediction.

Hypothesis 2.

As in Ariely et al. (2008), we hypothesize that quantity of imageslabeled will be increasing in the level of meaningfulness.

Hypothesis 3.

In addition to quantity, we measure the quality of image labelingsand hypothesize that this is increasing in the level of meaningfulness.

Hypothesis 4.

Based upon prior survey research on MTurk populations, we hy-pothesize that

Indian workers are less responsive to meaning . Ipeirotis (2010)ﬁnds that Indians are more likely to have MTurk as a primary source of income(27% vs. 14% in the US). Likewise, people in the US are nearly twice as likelyto report doing tasks because they are fun (41% vs. 20%). Therefore, one mightexpect ﬁnancial motivations to be more important for Indian workers. Each image was randomly picked from a pool of ten look-alike images. About 24% of respondents left comments (no diﬀerence across treatments). Although Horton et al. (2011) ﬁnd that workers of both types are strongly motivated bymoney. . Experimental Results and Discussion We ran the experiment on N = 2 ,

471 subjects (1,318 from the United Statesand 1,153 from India). Table 1 shows summary statistics for our response vari-ables (induced to work, number of images, and quality), demographic variables,and hourly wage.

Shredded Zero Meaningful US IndiaContext only only% Induced to Work .723 .762 .806 .85 .666 ≥

1) 5.94 ± ± ± ± ± ≥ ≥ ± ± ± ± ± N

828 798 845 1318 1153Coarse quality .883 ± .21 .904 ± .18 .930 ± .14 .924 ± .15 .881 ± .21Fine quality .614 ± .22 .651 ± .21 .676 ± .18 .668 ± .19 .621 ± .26PMC Meaning 3.44 ± ± ± ± ± Table 1: Summary statistics for response variables and demographics by treatment and coun-try. The statistics for the quality metrics are computed by averaging each worker’s averagequality (only for workers who labeled one or more images). The statistics for the PMC mean-ing question only include workers who ﬁnished the task and survey.

Broadly speaking, as the level of meaning increases, subjects are more likelyto participate and they label more images and with higher quality. Across alltreatments, US workers participate more often, label more images, and markpoints with greater accuracy. Table 2 uses a heatmap to illustrate our main eﬀectsizes and their signiﬁcance levels by treatment, country, and response variable.Each cell indicates the size of a treatment eﬀect relative to the control (i.e., zerocontext condition). Statistically signiﬁcant positive eﬀects are indicated usinggreen ﬁll where darker green indicates higher levels of signiﬁcance. Statisticallysigniﬁcant negative eﬀects are indicated using red ﬁll where darker red indicateshigher levels of signiﬁcance. Black text without ﬁll indicates eﬀects that aremarginally signiﬁcant ( p < . We investigate how treatment and country aﬀects whether or not subjectschose to do our task. Unlike in a laboratory environment, our subjects were10nduced Did ≥ ↑ ↑ ↑ ↓ ↑ ↑ ↑ ↓ ↓ ↑ ↓ ↑ ↓ ↓ ↓ ↑ ↓ ↓ ↓ ↑ ↓ ↓ ↓ ↓ * p < .

05, ** p < .

01, *** p < . workers in a relatively anonymous labor market and were not paid a “show-upfee.” On MTurk, workers frequently start but do not ﬁnish tasks; attrition istherefore a practical concern for employers who hire from this market. In ourexperiment, on average, 25% of subjects began, but did not follow-through bycompleting one full labeling.Even in this diﬃcult environment, we were able to increase participationamong workers by roughly 4.6% by framing the task as more meaningful (seecolumns 1 and 2 of table 3). The eﬀect is robust to including various controls forage, gender, and time of day eﬀects. As a subject in the meaningful treatmenttold us, “It’s always nice to have [HITs] that take some thought and meansomething to complete. Thank you for bringing them to MTurk.” The shreddedtreatment discouraged workers and caused them to work 4.0% less often butthe eﬀect was less signiﬁcant ( p = 0 .

057 without controls and p = 0 .

082 withcontrols). Thus, hypothesis 1 seems to be correct.Irrespective of treatment, subjects from India completed an image 18.5% lessoften ( p < . p = 0 . nduced Induced Did ≥ ≥ ≥ ≥ R N * p < .

05, ** p < .

01, *** p < . p -values for the partial F -test for sets of diﬀerent types of control variables. good use doing similar work with images, e.g. in dosimetry or pathology ... andit would free up medical professionals to do the heftier work.” Table 1 shows that the number of images increased with meaning. However,this result is conditional on being induced to work and is therefore contaminatedwith selection bias. We follow Angrist (2001) and handle selection by creatinga dummy variable for “did two or more labelings” and a dummy for “did ﬁveor more labelings” and use them as responses (other cutoﬀs produced similarresults).We ﬁnd mixed results regarding whether the the level of meaningfulnessaﬀects the quantity of output. Being assigned to the meaningful treatmentgroup did have a positive eﬀect, but assignment to the shredded treatment didnot result in a corresponding decrease in output.Analyzing the outcome “two or more labelings,” column 3 of table 3 showsthat the meaningful treatment induced 4.7% more subjects to label two or moreimages ( p < . the meaningful treatment was highly signiﬁcant and induced 8.5% Labeling ﬁve or more images corresponds to the top tercile of quantity among people whowere induced to work. p < .

001 with and without controls), an increase of nearly 23percent, and the shredded treatment again has no eﬀect.Hypothesis 2 (quantity increases with meaningfulness) seems to be correctonly when comparing the meaningful treatment to the zero-context treatment.An ambiguous eﬀect of the shredded treatment on quantity is also reported byAriely et al. (2008).We didn’t ﬁnd diﬀerential eﬀects between the United States and India. Inan unshown regression, we found that Americans were 9.5% more likely to labelﬁve or more images ( p < .

01) and Indians were 8.4% more likely to label ﬁveor more ( p < . p = 0 . Quality was measured by the fraction of cells labeled at a distance of ﬁvepixels (“coarse quality”) and two pixels (“ﬁne quality”) from their true centers.In presenting our results (see table 4), we analyze the treatment eﬀects usingour ﬁne quality measure. The coarse quality regression results were similar, butthe ﬁne quality had a much more dispersed distribution. Our main result is that ﬁne quality was 7.2% lower in the shredded treat-ment, but there wasn’t a large corresponding increase in the meaningful treat-ment. This makes sense; if the workers knew their labelings weren’t goingto be checked, there is no incentive to mark points carefully. This result wasnot diﬀerent across countries (regression unshown). The meaningful treatmenthas a marginally signiﬁcant eﬀect only in the United States, where ﬁne qualityincreased by 3.9% ( p = 0 .

092 without controls and p = 0 .

044 with controls),but there was no eﬀect in India. Thus, hypothesis 3 (quality increases withmeaningfulness) seems to be correct only when comparing the shredded to thezero context treatment which is surprising.Although Indian workers were less accurate than United States workers andhad 5.3% lower quality ( p < .

001 and robust to controls), United States andIndian workers did not respond diﬀerentially to the shredded treatment ( p =0 . The inter-quartile range of coarse quality overall was [93.3%, 97.2%] whereas the IQR ofﬁne quality was overall [54.7%, 80.0%]. One caveat with our quality results is that we only observe quality for people who wereinduced to work and selected into our experiment (we have “attrition bias”). Attrition was 4%higher in the shredded treatment and we presume that the people who opted out of labelingimages would have labeled them with far worse quality had they remained in the experiment. ine QualityBoth Countries United States IndiaMeaningful 0.007 0.014 0.039 0.039* -0.031 -0.013(0.017) (0.014) (0.023) (0.019) (0.025) (0.021)Shredded -0.072*** -0.074*** -0.061* -0.066** -0.087** -0.073**(0.021) (0.017) (0.027) (0.023) (0.031) (0.023)India -0.053*** -0.057***(0.015) (0.013)Male 0.053*** 0.014 0.100***(0.013) (0.017) (0.021)Labelings 6—10 -0.018** -0.024** -0.016*(0.006) (0.008) (0.008)Labelings ≥

11 -0.140*** -0.116*** -0.148***(0.017) (0.029) (0.020)Constant 0.666*** 0.645*** 0.651*** 0.625*** 0.634*** 0.588***ControlsImage 0.00*** 0.00*** 0.00***Age 0.10 0.01** 0.25Time of Day 0.33 0.29 0.78Day of Week 0.12 0.46 0.26 R N p < .

05, ** p < .

01, *** p < .

Experience matters. Once subjects had between 6 and 10 labelings undertheir belt, they were 1.8% less accurate ( p < . p < . Finally, we found thatsome of the ten images were substantial harder to label accurately than others(a partial F-test for equality of ﬁxed eﬀects results in p < . In order to understand how our treatments aﬀected the perceived mean-ingfulness of the task, we gave a post manipulation check to all subjects whocompleted at least one image and did not abandon the task before payment.This data should be interpreted cautiously given that subjects who completedthe tasks and our survey are not representative of all subjects in our experi- Anecdotally, subjects from the shredded condition who submitted comments regardingthe task were less likely to have expressed concerns about their accuracy. One subject fromthe meaningful group remarked that “[his] mouse was too sensitive to click accurately, even allthe way zoomed in,” but we found no such apologies or comments from people in the shreddedgroup. We found that those in the meaningful treatment rated signiﬁcantly higher inthe post manipulation check in both the United States and India. Using a ﬁve-point Likert scale, we asked workers to rate the perceived level of meaningful-ness, purpose, enjoyment, accomplishment, and recognition. In the meaningfultreatment, subjective ratings were higher in all categories but the self-rated levelof meaningfulness and purpose were the highest. The level of meaningfulnesswas 1.3 points higher in the US and 0.6 points higher in the India; the level ofperceived porposefulness was 1.2 points higher in America and 0.5 points higherin India. In the United States, the level of accomplishment only increased by 0.8and the level of enjoyment and recognition increased by 0.3 and 0.5 respectivelywith a marginal increase in India. As a US participant told us, “I felt it was aprivilege to work on something so important and I would like to thank you forthe opportunity.”We conclude that the meaningful frames accomplished their goal. Remark-ably, those in the shredded treatment in either country did not report signiﬁ-cantly lower ratings on any of the items in the post manipulation check. Thus,the shredded treatment may not have had the desired eﬀect.

5. Conclusion

Our experiment is the ﬁrst that uses a natural ﬁeld experiment in a reallabor market to examine how a task’s meaningfulness inﬂuences labor supply.Overall, we found that the greater the amount of meaning, the more likelya subject is to participate, the more output they produce, the higher qualityoutput they produce, and the less compensation they require for their time. Wealso observe an interesting eﬀect: high meaning increases quantity of output(with an insigniﬁcant increase in quality) and low meaning decreases quality ofoutput (with no change in quantity). It is possible that the level of perceivedmeaning aﬀects how workers substitute their eﬀorts between task quantity andtask quality. The eﬀect sizes were found to be the same in the US and India.Our ﬁnding has important implications for those who employ labor in anyshort-term capacity besides crowdsourcing, such as temp-work or piecework. Asthe world begins to outsource more of its work to anonymous pools of labor,it is vital to understand the dynamics of this labor market and the degree towhich non-pecuniary incentives matter. This study demonstrates that they domatter, and they matter to a signiﬁcant degree.This study also serves as an example of what MTurk oﬀers economists: anexcellent platform for high internal validity natural ﬁeld experiments while evad-ing the external validity problems that may occur in laboratory environments. Ideally, we would have collected this information immediately after introducing the treat-ment condition. However, doing so would have compromised the credibility of our naturalﬁeld experiment. eferences Angrist, J. D., 2001. Estimation of Limited Dependent Variable Models WithDummy Endogenous Regressors. Journal of Business & Economic Statistics19 (1), 2–28.Ariely, D., Kamenica, E., Prelec, D., 2008. Man’s Search for Meaning: The Caseof Legos. Journal of Economic Behavior & Organization 67 (3-4), 671 – 677.Barankay, I., 2010. Rankings and social tournaments: Evidence from a ﬁeldexperiment. University of Pennsylvania mimeo.Berinsky, A. J., Huber, G. A., Lenz, G. S., 2012. Evaluating online labor marketsfor experimental research: Amazon.com’s mechanical turk. Political Analysis20, 351–368.Croson, R., Gneezy, U., 2009. Gender diﬀerences in preferences. Journal ofEconomic Literature 47 (2), 448–474.Eriksson, K., Simpson, B., 2010. Emotional reactions to losing explain genderdiﬀerences in entering a risky lottery. Judgment and Decision Making 5 (3),159–163.Gneezy, U., List, J. A., 2006. Putting Behavioral Economics to Work: Testingfor Gift Exchange in Labor Markets Using Field Experiments. Econometrica74 (5), 1365–1384.Harrison, G. W., List, J. A., 2004. Field Experiments. Journal of EconomicLiterature 42 (4), 1009 – 1055.Henrich, J., Heine, S., Norenzayan, A., et al., 2010. The weirdest people in theworld. Behavioral and Brain Sciences 33 (2-3), 61–83.Holmes, S., Kapelner, A., 2010. DistributeEyes. Stanford University Manuscript.Horton, J. J., Chilton, L., 2010. The Labor Economics of Paid Crowdsourcing.Proceedings of the ACM Conference on Electronic Commerce, Forthcoming.Horton, J. J., Rand, D. G., Zeckhauser, R. J., 2011. The online laboratory:Conducting experiments in a real labor market. Experimental Economics 14,399–425.Ipeirotis, P., Mar. 2010. Demographics of Mechanical Turk. CeDER workingpaper CeDER-10-01, New York University, Stern School of Business.Levitt, S., List, J., 2009. Field experiments in economics: the past, the present,and the future. European Economic Review 53 (1), 1–18.Levitt, S. D., List, J. A., 2007. What Do Laboratory Experiments MeasuringSocial Preferences Reveal About the Real World? Journal of Economic Per-spectives 21 (2), 153–174. 16aolacci, G., Chandler, J., Ipeirotis, P., 2010. Running experiments on amazonmechanical turk. Judgment and Decision Making 5 (5), 411–419.Preston, A. E., 1989. The Nonproﬁt Worker in a For-Proﬁt World. Journal ofLabor Economics 7 (4), 438–463.Rosen, S., 1986. The theory of equalizing diﬀerences. Handbook of labor eco-nomics 1, 641–692.Rosso, B. D., Dekas, K. H., Wrzesniewski, A., 2010. On the meaning of work: Atheoretical integration and review. Research in Organizational Behavior 30,91–127.Sprouse, J., 2011. A validation of amazon mechanical turk for the collectionof acceptability judgments in linguistic theory. Behavior Research Methods43 (1), 155–167.Stern, S., 2004. Do scientists pay to be scientists? Management Science 50 (6),835–853.Wrzesniewski, A., Mccauley, C., Rozin, P., Schwartz, B., 1997. Jobs, Careers,and Callings: Peoples Relations to Their Work Amy Wrzesniewski 33 (31),21–33. 17 ppendix A. Detailed Experimental Design

This section details exact screens shown to users in the experimental groups.The worker begins by encountering the HIT on the MTurk platform (see FigureA.2).

Figure A.2: The HIT as initially encountered on MTurk. Note: we used an alias in order toappear as a non-corporate and non-institutional employer.

The worker can then click on the HIT and they see the “preview screen”which describes the HIT (not shown) with text. In retrospect, a ﬂashy imageenticing the worker into the HIT would most likely have increased throughput. Ifthe worker chooses to accept, they are immediately directed to a multi-purposepage which hosts a colorblindness test, demographic survey, and an audio testfor functioning speakers (see Figure A.3). Although many tasks require workersto answer questions before working, we avoided asking too many survey-likequestions to avoid appearing as an experiment.At this point, the worker is randomized into one of the three treatmentsand transitioned to the “qualiﬁcation test.” The page displays an instructionalvideo varying by treatment which they cannot fast-forward. Screenshots of thevideo are shown in Figures A.4, A.5, and A.6. We include the verbatim script for the videos below. Text that diﬀers be-tween treatments is typeset in square brackets separated by a slash. The textbefore the slash in red belongs to the meaningful treatment and the text follow-ing the slash in blue belongs to both the zero-context and shredded treatments.

Thanks for participating in this task. [Your job will be to help identify tumor cellsin images and we appreciate your help. / In this task, you’ll look at images and ﬁndobjects of interest.]In this video tutorial, we’ll explain [three / two] things:[First, why you’re labeling the images, which is to help researchers identify tumorouscancer cells. Next, we’ll show you how to identify those tumor cells. / First, we’llshow you how to identify objects of interest in images.] [Finally, / Then,] we’ll explainhow after labeling your ﬁrst image you’ll have a chance to label some more. We thank Rob Cohen who did an excellent job narrating both scripts. igure A.3: The colorblindness test a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.4: Opening screen of training video.Figure A.5: Examples of meaningful cues which are not present in the Zero-context andShredded treatment instructional video. ow we’re ready to learn how to identify [tumor cells / objects of interest] in images.Some example pictures of the [tumor cells / objects of interest] you’ll be identifyingcan be found at the bottom left. Each [tumor cell / object of interest] is blue andcircular and surrounded by a red border. (a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.6: Describing the training process. When you begin each image, the magniﬁcation will be set to the lowest resolution.This gives you an overview of all points on the image, but you’ll need to zoom in andout in order to make the most precise clicks in the center of the [tumor cells / objectsof interest].Let’s scroll through the image and ﬁnd some [tumor cells / objects of interest] toidentify.Here’s a large cluster of [tumor cells / objects of interest]. To identify them, it isvery important to click as closely to the center as possible on each [cell / object] . IfI make a mistake and don’t click in the center, I can undo the point by right-clicking.Notice that this [cell / point] isn’t entirely surrounded by red, [probably because thecell broke oﬀ]. Even though it’s not entirely surrounded by red, we still want toidentify it as a [tumor cell / object of interest].In order to ensure that you’ve located all [tumor cells / objects of interest], you shoulduse the thumbnail view in the top right. You can also use the magniﬁcation buttonsto zoom out.It looks like we missed a cluster of [tumor cells / objects of interest] at the bottom.Let’s go identify those points.Remember once again, that if you click on something that is not a [tumor cell / objectof interest], you can unclick by right-clicking.Using the scroll bars, we’ll navigate to the other points ... and here’s some more tothe left ... Now that we think we’ve identiﬁed all points, let’s zoom out to be sureand scroll around.Before submitting, we should be sure of three things: (1) That we’ve identiﬁed all[tumor cells / objects of interest] (2) That we’ve clicked in the center of each one (3)That we haven’t clicked on anything that’s not a [tumor cell / object of interest].Once we’ve done that, we’re ready to submit.Finally, after you complete your ﬁrst image, you’ll have an opportunity to labeladditional images as part of this HIT.The ﬁrst images you label will pay more to compensate for training.After that, as part of this HIT you’ll have the chance to identify as many additionalimages as you like as long as you aren’t taking more than 15 minutes per image.Although you can label unlimited images in this HIT, you won’t be able to acceptmore HITs. This is to give a variety of turkers an opportunity to identify the images.[Thank you for your time and eﬀort. Advances in the ﬁeld of cancer and treatmentprevention rely on the selﬂess contributions of countless individuals such as yourself.] igure A.7: The quiz after watching the training video for the meaningful treatment. In thezero-context and shredded treatments, all mention of “tumor cells” are replaced by “objectsof interest.” The shredded treatment has an additional question asking them to acknowledgethat they are working on a test system and their work will be discarded. Green indicates acorrect response; red indicates an incorrect response. (a) Meaningful treatment (b) Zero-context / Shredded treatmentsFigure A.8: The training interface as seen by workers. The meaningful interface reminds thesubjects in 8 places that they are identifying tumor cells. The zero-context interface only says“objects of interest” and the shredded condition in addition has a message in red indicatingthat their points will not be saved (unshown). The circles around each point were not visibleto participants. We display them to illustrate the size of a 10-pixel radius. Participants were given 15 minutes to mark an image. Above the trainingwindow, we displayed a countdown timer that indicated the amount of timeleft. The participant’s total earnings was also prominently displayed atop. Onthe very top, we provided a submit button that allowed the worker to submitresults at any time.Each image had the same 90 cells from various-sized clusters. The cell clus-ters were selected for their unambiguous examples of cells, thereby eliminatingthe diﬃculty of training the diﬃcult-to-identify tumor cells. In each image, thesame clusters were arranged and rotated haphazardly, then pasted on one ofﬁve diﬀerent believable backgrounds using Adobe Photoshop. Those clusterswere then further rotated to create a set of ten images. This setup guaranteesthat the diﬃculty was relatively the same image-image. Images were displayedin random order for each worker, repeating after each set of ten (repetition wasnot an issue since it was rare for a participant to label more than ten).After the worker is ﬁnished labeling, the worker presses submit and they areled to an intermediate page which asks if they would like to label another imageand the new wage is prominently displayed (see Figure A.9). In the meaningfultreatment, we add one last cue of meaning — a stock photo of a researcher toemphasize the purpose of the task. In the shredded treatment, we append the23ext “NONE of your points will be saved because we are testing our system,but you will still be paid.” If the worker wishes to continue, they are led toanother labeling task; otherwise, they are directed to the post manipulationcheck survey shown in ﬁgure A.10.The program ensures that the worker is being honest. We require them toﬁnd more than 20% of the cells (the workers were unaware that we were ableto monitor their accuracy). If they are found cheating on three images, theyare deemed fraudulent and not allowed to train more images. Since payment isautomatic, this is to protect us from a worker depleting our research account.In practice, this happened rarely and was not correlated with treatment. (a) Zero-context / Shredded treatments (b) Meaningful treatmentFigure A.9: The landing page after a labeling task is completed. At this point, workers areasked if they’d like to label another image or quit and be paid what they’ve earned so far.

Appendix B. A Technical Guide to Running Field Experiments onMechanical Turk

Institutional Review Board (IRB) Requirements

This study requires the use of deception in order to observe social preferencesin a natural environment and thus is not exempted under category 2’s surveyprocedures. The issue is you cannot give the subjects an initial consent formindicating that they are part of an experiment.Upon waiving the requirement of consent, the IRB will most likely requireyou to issue a debrief statement to your subjects stating that they were part ofan experiment, a blurb about the purpose of the experiment, and contact infor-mation to your institution’s IRB. In order for the experiment to work properly,you can only issue the debrieﬁng after data collection is completed. Otherwise,24 igure A.10: The survey a subject ﬁlls out upon completion of the task.