MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk
MMmm
Turkey: A Crowdsourcing Framework for Deploying Tasks and RecordingWorker Behavior on Amazon Mechanical Turk
Brandon Dang
Department of Computer ScienceThe University of Texas at [email protected]
Miles Hutson
Department of Computer ScienceThe University of Texas at [email protected]
Matthew Lease
School of InformationThe University of Texas at [email protected]
Abstract
Internal HITs on Mechanical Turk can be programmati-cally restrictive, and as a result, many requesters turn tousing external HITs as a more flexible alternative. How-ever, creating such HITs can be redundant and time-consuming. We present
Mmm
Turkey , a framework thatenables researchers to not only quickly create and man-age external HITs, but more significantly also captureand record detailed worker behavioral data characteriz-ing how each worker completes a given task. Mechanical Turk requesters post HITs (Human IntelligenceTasks) to be completed by workers . However, a challengeconstantly faced by requesters is preserving the integrity ofthe data they collect. While most workers undertake tasks ingood faith and are competent to deliver quality work, not allsubmitted work is high quality. This poses a potential risk tocorrupt a data set (Eickhoff and de Vries 2013). Addition-ally, reported automated ”botters” performing tasks meantfor human workers further risk compromising data quality(Difallah, Demartini, and Cudr´e-Mauroux 2012).Just as search engines routinely capture user interac-tions in order to better understand their users and deliverhigher quality results, prior work has suggested instrument-ing worker interfaces can provide similar insights into bet-ter understanding worker behavior and assessing work qual-ity (Rzeszotarski and Kittur 2011; Kazai and Zitouni 2016)as well as detecting potential fraud (Heymann and Garcia-Molina 2011). Unfortunately, no open source project yet ex-ists to enable other researchers to similarly instrument theirown task interfaces.
Mmm
Turkey not only makes it easy todeploy external HITs on Mechanical Turk, but critically en-ables its users to capture and log underlying worker inter-actions on the front-end, presenting the potential to improvedata quality and better understand latent worker behaviorsunderlying observed work products (e.g., behavioral tracedata characterizing top performers).
Copyright c (cid:13) Not to be confused with
MmTurkey , https://github.com/longouyang/mmturkey Mmm
Turkey endeavors to be easy to use and expand.While
Mmm
Turkey itself is built atop the popular Pythonweb framework Django, adding to its capabilities re-quires no more than a basic understanding of Python andJavaScript.
Mmm
Turkey seeks to provide requesters thefreedom and ability to create their own components, as wellas the option to reuse the foundational work of others. Thebenefit of a common, modular platform is that when a re-quester creates a new component to fill some gap, othersshould be able to understand and use it with relative ease.Preserving data export, admin creation process, and struc-ture inherited from the framework which which users al-ready have familiarity eases reusability and maintenance.Components of the open source framework consist of: • Task s are HITs to be completed by workers. A task iscomprised of steps and auditors and can be completed byany number of workers. Data will be recorded for eachresponse. Figure 1 exemplifies the process of creating anew task in the online dashboard. • Step s are parts of a task that workers must complete andare the equivalent of a question. Steps are modular and canbe assembled in any order, including a randomized order,within a task by the requester. Multiple choice, multipleanswer (checkbox), and textual response types of stepshave been implemented and are provided for immediateuse. Requesters have the ability to create and add theirown custom steps to a task if needed. • Auditor s, implemented in JavaScript and jQuery,surveil worker activity while completing a task and recorduser interaction with the webpage and browser. Manyauditors are provided by default, including auditors thatrecord the user’s mouse movements, clicking interactions,and tab focus changes. Like steps, auditors are modularand requesters can choose which auditors to include in atask when creating one. Requesters may also create andadd their own custom auditors as well.For users satisfied with
Mmm
Turkey’s out-of-the-box ca-pabilities, no extra code is necessary. Tasks are createdwithin a browser-based interface in which authorized userscan selectively add steps, auditors, and otherwise managethese tasks (Figure 1). Data collected from a task is readily a r X i v : . [ c s . H C ] O c t igure 1: Creating a new task available to be exported as XML, which can be easily parsedto extract worker responses and auditor data (Figure 2).For those wanting to add their own steps or auditors, thefollowing details of how the framework’s administration,database, and export functionalities have been abstractedfrom the core information requirements of components: • What fields the new step or auditor will return when a usersubmits a response • What fields the new step or auditor needs to pass to a user-provided template • How to render a step on screen • How to collect the information for the fields of an auditor • Where to locate the JavaScript and HTML template thatload alongside tasksThe administrative interface for the component, itsdatabase representation, and the logic to produce its XMLexport are all handled by
Mmm
Turkey (unless a user decidesto override their hooks), reducing the time and complexity ofadding custom items to the framework. < auditors >< clicks_total >< list_item >< model > survey.auditorclickstotaldata < /model >< pk > < /pk >< fields >< general_model > < /general_model >< count > < /count >< /fields >< /list_item >< /clicks_total > ... < /auditors > Figure 2:
Example auditor data in exported XML
Task Fingerprinting . Mmm
Turkey’s auditors are based onthe concept of task fingerprinting (Rzeszotarski and Kittur2011). Task fingerprinting is an attempt to capture the pro-cess in which workers work on and complete a specific taskby using user event loggers. Information is recorded whena worker clicks, presses a key, or otherwise interacts withthe webpage or browser. The collected data can then be usedto analyze behavioral trends in ”good” compared to ”bad”workers and filter out potential low quality responses.
Gold Standard Behavior . Using task fingerprinting anda supervised classifier, an experiment was conducted to tryand detect poor-performing workers based on their behav-ioral data as well as that of trained professional judges(Kazai and Zitouni 2016). Both normal workers and judgeswere given the same three tasks to complete, and 160 behav-ioral features were recorded per worker including ”dwell”times between specific events (e.g., mouse clicks) and num-ber of window resize events. Tasks were completed on for normal workers and on anin-house platform for judges. The recorded behavior of thejudges was used as the gold standard.
TurkPrime ( ) assists re-searchers in collecting data, particularly in the social andbehavioral sciences (Litman, Robinson, and Abberbock2016). Like Mmm
Turkey, TurkPrime is integrated with Me-chanical Turk and enables users to easily create and manageexternal HITs through an online administrative dashboard.TurkPrime offers many useful features, including the abilityto create worker groups, include or exclude certain workers,and more options and control over managing HITs. Unlike
Mmm
Turkey, however, TurkPrime does not provide anyform of auditor functionality.
At its simplest,
Mmm
Turkey is a tool for easily develop-ing and managing external HITs on Amazon MechanicalTurk. Though are other tookits have been created that of-fer some similar services,
Mmm
Turkey stands out as the firstopen source framework we are aware of providing auditors:a unique and powerful feature. Rather than just collect re-sponses to tasks, these auditors can record a worker’s in-teractions on the front-end, providing researchers and taskdesigners of wealth of new data to study and better under-stand worker behaviors in task execution.
Mmm
Turkey, withits modular architecture and core auditor feature, enables itsusers to collect comprehensive data without the trouble ofhaving to code a HIT that has already been created before.
Acknowledgments.
This work was made possible by theNational Science Foundation’s support for Research Experi-ences for Undergraduates (REU), under grant No. 1253413.The statements made herein are solely the responsibility ofthe authors. We thank our awesome crowd workers for theirparticipation in powering our crowd-driven systems.
References [Atterer, Wnuk, and Schmidt 2006] Atterer, R.; Wnuk, M.;and Schmidt, A. 2006. Knowing the user’s every move: userctivity tracking for website usability evaluation and implicitinteraction. In
Proceedings of the 15th international confer-ence on World Wide Web , 203–212. ACM.[Bakshy, Eckles, and Bernstein 2014] Bakshy, E.; Eckles,D.; and Bernstein, M. S. 2014. Designing and deployingonline field experiments. In
Proceedings of the 23rd inter-national conference on World wide web , 283–292. ACM.[Difallah, Demartini, and Cudr´e-Mauroux 2012] Difallah,D. E.; Demartini, G.; and Cudr´e-Mauroux, P. 2012.Mechanical cheat: Spamming schemes and adversarialtechniques on crowdsourcing platforms. In
CrowdSearch ,26–30.[Eickhoff and de Vries 2013] Eickhoff, C., and de Vries,A. P. 2013. Increasing cheat robustness of crowdsourcingtasks.
Information retrieval
Proceedings of the 20th internationalconference on World wide web , 477–486. ACM.[Hong and Landay 2001] Hong, J. I., and Landay, J. A. 2001.Webquilt: a framework for capturing and visualizing the webexperience. In
Proceedings of the 10th international confer-ence on World Wide Web , 717–724. ACM.[Kazai and Zitouni 2016] Kazai, G., and Zitouni, I. 2016.Quality management in crowdsourcing using gold judgesbehavior. In
Proceedings of the Ninth ACM InternationalConference on Web Search and Data Mining , 267–276.ACM.[Litman, Robinson, and Abberbock 2016] Litman, L.;Robinson, J.; and Abberbock, T. 2016. Turkprime. com:A versatile crowdsourcing data acquisition platform for thebehavioral sciences.
Behavior research methods
Proceedings of the 23nd an-nual ACM symposium on User interface software and tech-nology , 57–66. ACM.[Nebeling, Speicher, and Norrie 2013a] Nebeling, M.; Spe-icher, M.; and Norrie, M. 2013a. W3touch: metrics-based web page adaptation for touch. In
Proceedings of theSIGCHI Conference on Human Factors in Computing Sys-tems , 2311–2320. ACM.[Nebeling, Speicher, and Norrie 2013b] Nebeling, M.; Spe-icher, M.; and Norrie, M. C. 2013b. Crowdstudy: Generaltoolkit for crowdsourced evaluation of web interfaces. In
Proceedings of the 5th ACM SIGCHI symposium on Engi-neering interactive computing systems , 255–264. ACM.[Parkes et al. 2012] Parkes, D. C.; Mao, A.; Chen, Y.; Gajos,K. Z.; Procaccia, A.; and Zhang, H. 2012. Turkserver: En-abling synchronous and longitudinal online experiments. In
Proceedings of the Fourth Workshop on Human Computa-tion (HCOMP’12) . AAAI Press.[Rzeszotarski and Kittur 2011] Rzeszotarski, J. M., and Kit-tur, A. 2011. Instrumenting the crowd: using implicit behav-ioral measures to predict task performance. In