Hardhats and Bungaloos: Comparing Crowdsourced Design Feedback with Peer Design Feedback in the Classroom
Jonas Oppenlaender, Elina Kuosmanen, Andrés Lucero, Simo Hosio
HHardhats and Bungaloos: Comparing Crowdsourced DesignFeedback with Peer Design Feedback in the Classroom
Jonas Oppenlaender [email protected] of OuluOulu, Finland
Elina Kuosmanen [email protected] of OuluOulu, Finland
Andrés Lucero [email protected] UniversityEspoo, Finland
Simo Hosio [email protected] of OuluOulu, Finland
Figure 1: Selected interactive prototypes of mobile applications created by students in the design course.
ABSTRACT
Feedback is an important aspect of design education, and crowd-sourcing has emerged as a convenient way to obtain feedback atscale. In this paper, we investigate how crowdsourced design feed-back compares to peer design feedback within a design-orientedHCI class and across two metrics: perceived quality and perceivedfairness. We also examine the perceived monetary value of crowd-sourced feedback, which provides an interesting contrast to thetypical requester-centric view of the value of labor on crowdsourc-ing platforms. Our results reveal that the students ( 𝑁 = Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
CHI ’21, May 8–13, 2021, Yokohama, Japan © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8096-6/21/05...$15.00https://doi.org/10.1145/3411764.3445380 teachers in HCI and other researchers interested in crowd feedbacksystems on using crowds as a potential complement to peers.
CCS CONCEPTS • Information systems → Crowdsourcing ; •
Human-centeredcomputing → Empirical studies in HCI ; Interface design proto-typing ; HCI design and evaluation methods . KEYWORDS crowdsourcing, design feedback, crowd feedback system, classroomstudy, peer review
ACM Reference Format:
Jonas Oppenlaender, Elina Kuosmanen, Andrés Lucero, and Simo Hosio.2021. Hardhats and Bungaloos: Comparing Crowdsourced Design Feedbackwith Peer Design Feedback in the Classroom . In
CHI Conference on HumanFactors in Computing Systems (CHI ’21), May 8–13, 2021, Yokohama, Japan.
ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3411764.3445380
Design is an important topic in HCI education. While prior workhas established the feasibility of using crowdsourced feedback indesign education [6, 17, 37], not many deep investigations intothe qualitative expectations and perceptions of students have beendocumented in the literature. a r X i v : . [ c s . H C ] J a n HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al.
Our work sets out to replicate and extend prior findings [6, 17,37], particularly concerning the perceptions of the students. Weprovide a detailed empirical investigation into how students ofa design-oriented undergraduate HCI course perceived and ex-perienced crowdsourced design feedback, and how this feedbackcompared to peer design feedback in the classroom. In our study,students ( 𝑁 = MTurk ).Our study extends prior studies’ findings with a comparison ofthe two sources of feedback across an extensive range of evaluationdimensions related to the perceived quality and felt experience offeedback. We found that the perceived quality of the feedback isshaped by the perceived effectiveness, perceived effort, and themodality of the feedback. Perceived fairness is shaped by agree-ableness, valence (i.e., the affective tone and “harshness” of thefeedback), diligence, usefulness, and credibility of the feedback. Ad-ditionally, we explore how students formulate a monetary valuationof crowdsourced design feedback, which contrasts the typical wayof thinking about the monetary value of crowdsourced contribu-tions as rewards paid by the requester. This, we argue, may haveimplications for crowdsourcing setups in which the teacher as therequester of feedback (and thus the party who is setting the pricefor the crowdsourced task) is not the receiver and beneficiary ofthe crowdsourced feedback.The main contributions of our work are:(1) An empirical case study on using the crowd from Ama-zon Mechanical Turk as a source of design feedback in auniversity-level HCI course.(2) A detailed analysis of the perceived attributes of formativedesign feedback from two sources: peers and the crowd fromAmazon Mechanical Turk.(3) An exploration of how students formulate a monetary valu-ation of crowdsourced design feedback.(4) A discussion on the implications and how teachers mightbest leverage crowdsourced design feedback in their courses.Overall, the students in our study found the formative designfeedback from peers to be of higher quality across all of the analyzedcriteria, except for valence. However, the diversity of people fromdifferent backgrounds was mentioned as a particularly positiveaspect of the crowdsourced feedback, and approximately 25% of theparticipants preferred MTurk feedback over peer feedback. Con-cerning the monetary value of crowdsourced feedback, participantresponses differed drastically from the pay that crowd workers typi-cally receive on MTurk. Our findings are informative to researchersworking on crowd feedback systems as well as teachers of design-oriented HCI courses who wish to explore crowdsourced feedbackas a way to conveniently expose students to feedback from outsidethe course itself.
The field of HCI has always had an interest in learning and educa-tion, for example by establishing a new dedicated CHI subcommit-tee on “Learning, Education, and Families” and a Special Interest Group (SIG) at CHI ’19 [30]. Our work touches on design feed-back, feedback collection via crowdsourcing, and the application ofcrowdsourced feedback in the classroom.
Feedback is an important component of design education. Feedbackcommunicates a notion of a standard to the learner, and learnerstypically strive to minimize the feedback-standard discrepancy [21].The overall goal of design education is to raise the learner’s concep-tion of standard to that of the teacher [34]. One important mecha-nism to realize this goal is design feedback.Design feedback can be given at four different levels (product,process, self-regulation, and self [13]), and the feedback can be for-mative or summative. While summative feedback is geared towardsgrading results at the end of a course, formative feedback guidesstudents in improving their work [34]. According to the theoryof formative assessment [34], formative feedback is instrumentalto developing expertise. Formative peer feedback for in-progresswork may, for instance, improve course outcomes [23]. Effectivefeedback provides specific information about a learner’s currentperformance together with explanations and concrete examples [1].Successful feedback is specific, critical, and actionable [34, 44].The term critique is sometimes used as synonym for designfeedback in related literature (e.g., in [27]). And indeed, a critiqueis “the communication of a reasoned opinion about an artifact or adesign” [9]. Traditionally, however, a (studio) critique refers to aformal, co-located feedback setting in which knowledgeable peersor experts provide feedback to student learners [4]. Critiques are acommon practice in design education as a form of assessment. Ina critique, co-located students discuss and evaluate their sketches,collages, or designs. Critiques are also practiced among designcolleagues as a means to receive feedback.In this work, we concentrate on product-based formative feed-back for a design artifact. We follow a process-agnostic conceptual-ization of feedback as information provided by a feedback provider (a teacher, peer, or other kind of reviewer of a work) to the feedbackreceiver (in our case, a university-level student) [13].
Peer feedback has become a common pedagogical tool especiallyin online teaching and when the instructors are not always avail-able [7, 24]. Crowdsourcing offers an even more scalable approachto feedback. Researchers have in the past looked into the trade-offsof these approaches. Nguyen et al. found that design feedback fromanonymous sources was perceived more positively compared tofeedback from peers or an authority [27]. Anonymous feedbackproviders may give feedback that contains more specific criticismand praise, and thus be rated as more useful by the feedback re-ceiver [17]. Anonymity may further help minimize power differ-ences between the feedback provider and the feedback receiver [4].To this end, crowdsourcing platforms, such as Amazon MechanicalTurk (
MTurk ), may provide a fruitful ground for eliciting designfeedback from a large and diverse group of anonymous people. omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan
On MTurk, requesters (the feedback receivers) publish short “Hu-man Intelligence Tasks” (
HITs ) for anonymous workers (the feed-back providers) to complete in exchange for a small monetaryreward [16].The microtask crowdsourcing model is, however, – by its nature –troubled with a number of issues that may negatively affect theprovision of feedback. Survey satisficing may inflate reliability andvalidity of collected data [11] and could negatively impact the quan-tity and quality of the collected feedback. The power differencesbetween workers and requesters manifest in low payment to work-ers [12]. Workers may further be subject to numerous biases, suchas the observer effect. On the other hand, feedback from peers maybe positively biased [36], and students were shown to appreciateand prefer feedback from external and anonymous sources [6, 27].Researchers created a number of web-based systems to exploreand investigate the potential of MTurk for collecting feedback ongraphic designs.
Voyant [41] is a system for eliciting perception-based feedback on graphic designs from a non-expert crowd. Thesystem aims to capture the first impressions and how well a graphicdesign meets its stated goals and design guidelines.
CrowdUI [29] isa system for eliciting and aggregating visual feedback on the designof a website. The system enables users to modify a website’s userinterface. The user-generated design suggestions are presented tothe feedback requester in aggregated form.
CrowdCrit [25] gathersformative feedback on graphic designs from crowd workers andclients. Designers found the feedback provided by the system tobe helpful and appreciated the specificity and level of detail of thefeedback. Feedback from crowd workers was however found to be“more generic.” In
Paragon [19], feedback providers complementwritten feedback on graphic designs with visual design examples.Feedback provided in this manner was found to be more novel,specific, and actionable.
SIMPLEX [28] is a system to gather sum-mative feedback for designs and artworks from a situated crowdusing two public displays.
ZIPT [5] is a system that – unlike theabove systems that elicit feedback on static graphic designs – usesvirtualization technology to enable a crowd to conduct remote usertests on mobile applications. This setup is similar to the one usedin our study, except that feedback providers in our case providedfeedback on a high-fidelity web-based prototype, not a real mobileapplication.Researchers further investigated strategies for improving thequality of crowdsourced design feedback. Rubrics are an effectiveway to structure feedback and raise its perceived value, valence(i.e., affective tone), and specificity [44]. Scaffolds were shown tosupport students in reflecting on the feedback received [2]. Struc-tured workflows were found to generate feedback that was moreinterpretative, diverse and critical than free-form prompts [42]. Fur-ther, guiding questions formulated by the feedback receiver werefound to be effective scaffolds to facilitate the exchange of feedbackamong peers [3]. Measures such as the above have been found toimprove the quality of crowdsourced design feedback.In contrast to the above quality measures, our study providesinsights into the felt experience of evaluating feedback. The partici-pants in our study were given the responses from Amazon Mechan-ical Turk as is , without any filtering of responses or other post-hocmeasures for improving data quality. We deliberately chose to com-pare the “unfiltered” feedback from the two sources. We decided against filtering the crowdsourced data as we believe this is thefairest way of comparing the two feedback sources without payingfavors to one source. This decision avoids introducing bias to thestudy and best reflects what can be expected from the two types offeedback providers. However, we did employ standard qualificationmeasures widely used in academic studies on MTurk.
Dow et al. established the feasibility of using crowds from MTurkfor collecting design feedback in the classroom [6]. In Dow et al.’sstudies, the crowd provided feedback along four key stages of theinnovation process (needfinding, ideating, testing, and pitching).Students found feedback from crowd workers to be beneficial in allfour stages and the crowdsourced feedback helped students groundtheir design efforts in real-world opportunities. Most relevant toour study, testing early-stage storyboards with the online crowdwas found to help students uncover issues and directions for futurework.Xu et al. used a crowd feedback system to provide 10 studentsin a visual design course with formative feedback from workerson MTurk [42]. The authors found that formative feedback fromthe crowd prompted students to change and improve their designs.Compared with the feedback from experts, the crowdsourced feed-back was however not found to agree on whether a design met itscommunicative goals.In the work by Wauck et al., students created prototypes ofuser interfaces in a project-based design course. Students receivedfeedback from their peers and three different online crowds: thestudents’ own social network, online communities, and AmazonMechanical Turk. The authors measured the quality, quantity, andvalence of the feedback, and how it was acted upon. Student peerswere found to provide feedback that was higher in perceived quality,more acted upon, and longer than feedback from online commu-nities. Further, summative feedback from both peers and onlinecommunities was found to be more negative in valence than for-mative feedback at an early stage in the design project.Our study is similar to the study by Wauck et al., both in itsfocus and study design. Our study is therefore partially a replica-tion study of the work by Wauck et al.. Our contribution to HCI isthe confirmation of the prior study’s findings [14]. The study byWauck et al. answers three research questions, only one of whichis student perceptions, whereas our work focuses only on the stu-dent perceptions, but extends the prior study to more dimensionsqualitatively and quantitatively. Our work therefore extends andcontributes a clear increment to prior studies of feedback in theclassroom with an in-depth analysis of students perceptions.Prior literature found that student designers may attribute valueto feedback along a number of different criteria, such as the feed-back’s quantity [25, 37, 44], specificity [2, 3, 17, 19, 25, 34, 44], crit-icality [2, 3, 34, 44], valence , affect , or sentiment [3, 27, 37, 42, 44], helpfulness [5, 22, 25, 41], fairness [44], and actionability [2, 3, 19,34, 37, 44]. Students may, for instance, find specific and emotionallypositive feedback useful and students are likely to appreciate longerand more actionable feedback with clear justifications [44]. In ourstudy, we provide a mixed-method investigation into these aspects HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al. of formative design feedback in the context of a university-leveldesign course.
Our aim was to investigate how students perceive and experiencecrowdsourced formative feedback in the classroom and how thisfeedback compares to peer feedback from the students’ classmates.The study was conducted in the context of a design course at theUniversity of Oulu, Finland.
The study was conducted in a 9-week undergraduate course (521145A“Human Computer Interaction”) between October and December2019. The course followed a project-based learning approach. Inthe beginning of the course, students formed 42 groups with threemembers each around self-chosen topics. Each group of studentswas tasked to brainstorm, design, and prototype a user interfaceof a mobile application. Each group first sketched their idea withpen and paper and evaluated this paper prototype in a user studywith their classmates. The groups then transferred their designsinto a digital prototype using
Adobe XD (a user interface designtool), while taking into account the lessons-learned from the userstudies. The interactive prototypes were published online usingAdobe XD’s functionalities, which allowed us to invite both stu-dents and workers from a crowdsourcing platform to operate andreview the interactive application prototypes. Each group receivedformative feedback from other students and crowd workers. Afterreviewing the formative feedback, the groups acted on the feedbackreceived and submitted a revised version of the digital prototypefor grading. Selected examples of the interactive designs createdduring the course are depicted in Figure 1.
Our study design choices were shaped by practical and pedagogicalconsiderations and informed by our study being a live, authenticteaching exercise rather than a controlled laboratory experimenton feedback quality. Our study therefore needs to be viewed in lightof being a case study of feedback with ecological validity in theclassroom.We conducted a within-subject study in the context of the HCIcourse. Each of the 42 groups of students was provided a zip file (the“feedback package”) with formative feedback from two sources: nineof their peers and nine crowd workers. The number of feedbackitems was determined based on the group size of the students. Sinceeach group consisted of three members, we asked each student toreview the work of three other teams. Students were briefly intro-duced to the notion of crowdsourcing in the course lectures. Thewithin-subject design was mandated by the university’s teachingrules, to provide each group of students a similar learning experi-ence and not to disadvantage some groups in the graded course.Students were individually asked to comment on the formativefeedback in a survey (the “final questionnaire”) at the end of thecourse. In the following section, we describe the procedure for col-lecting feedback, how feedback was provided to the students, andthe final questionnaire.
Feedback was collected using theweb-based survey instrument by Hosio et al. [15] (depicted in Fig-ure 2). The interface included 14 questions, of which 13 were elicitedwith sliders. Both students and crowd workers (henceforth: the re-viewers ) received the same instructions and provided feedback withthe same interface, with one minor difference. Students were pre-sented with a list of all groups and navigated to the group theywere assigned to review. For the review by crowd workers, thesidebar with group names was hidden to not distract the workers.Instead, the task was set up so that the worker would only see theone design that was to be rated during that task. For the students,providing peer feedback was a fixed part of the coursework, andstudents provided feedback individually in their self-study hoursoutside the scheduled exercises.Using the web-based interface, the reviewers rated the mobile ap-plication prototype on several criteria (each on a 7-point anchoredLikert scale; see Figure 2). We elicited the novelty and usefulness ofthe application prototype as two important components of product-based creativity [33]. We further elicited practicality and a ratingof the design. Each criteria was explained with a short question.For instance, novelty was explained with “Has the application beenthought of by others?” Reviewers were asked to rate their successin following the set of tasks provided by the student group. Further,the interface included the 8-item version of the User ExperienceQuestionnaire (UEQ-S) [35]. The UEQ-S measures several aspectsof user experience on dichotomous scales, such as “Boring-Exciting,”“Confusing-Clear,” and “Complicated-Easy.” In the last item, review-ers were asked to provide an exhaustive open-ended justificationfor how they scored the different criteria.
Feedback Package.
The collected feedback from the reviewerswas bundled in a zip file (see Table 1). Each feedback packagecontained raw data collected with the web-based survey instru-ment, a graphical summary of Likert-scale responses (split betweenpeer and crowd feedback), and two text files with the open-endedfeedback from nine student reviewers and nine crowd workers,respectively. The feedback package was accompanied with detailedinstructions for the students. A full example of a feedback packageis available in the auxiliary material.
Students individually inspected the feed-back package at the end of the course (in their self-study hoursoutside of the scheduled exercises). Feedback was not given in aspecific order, and the students could freely choose which one toconsume first. The students then evaluated the contents of thefeedback package in an online questionnaire. All related ethicalprocedures were followed as required by our University. Studentswere asked to consent to their data being used for the purpose ofthe academic study prior to completing the study. Students wereinformed that declining consent did not affect their course grades.Students were specifically instructed to fill out the final question-naire individually, not in teams. As an incentive, 5 points countingtowards the course credit were awarded to the students who calcu-lated the three components of the UEQ-S (Overall, Pragmatic, andHedonic Quality) from the raw data.The final questionnaire consisted of 18 items including demo-graphic questions, an evaluation of the formative feedback, and thesatisfaction with the course. The questionnaire items related to the omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan
Figure 2: Web-based interface for feedback collection. A set of general instructions was provided to the reviewers (A). Eachstudent group provided their own set of tasks to be completed in the mobile application (B). A click on a link (C) directed thereviewer to the interactive prototype of the mobile application hosted on Adobe XD resource servers. The reviewer was in-structed to rate the mobile application along a number of dimensions (E). Students could access the groups they were assignedto review in the sidebar (D). A screenshot of the full interface is available in the auxiliary material.Table 1: Feedback package with peer and crowdsourced feedback provided to the students.
File data.xlsx summary.png summary_openended_class.txt summary_openended_mturk.txt
Description
Quantitative responses (rawdata) collected with the surveyinstrument Graphical summary (bar chart)of the feedback, split betweenpeer and MTurk feedback Open-ended feedback from stu-dent peers Open-ended feedback fromworkers on MTurk evaluation of the feedback are listed in tables 2 and 3. The full finalquestionnaire can be found in the auxiliary material.The questionnaire elicited detailed justifications for the students’preference of feedback (see Table 3). We first inquired which ofthe two feedback sources the students preferred overall. This firstopen-ended item was given to capture students’ reasoning in theirown terms, without imposing a structure or limiting the response.The following three items were structured around three princi-ples of good feedback: effectiveness , fairness , actionability [34, 44].Students were next asked to quantitatively rate the feedback on7-point Likert scales (see Table 2). Effectiveness was measured withitems related to effective feedback [1]: specificity , actionability , and explanations . Fairness and actionability had their own items, and inaddition, students were asked to judge the overall valence of each feedback source, the relevance of the feedback to the applicationprototype, the overall quality of the feedback, and the satisfaction with the respective feedback source. Finally, students were askedto provide their personal estimate of the monetary value of thefeedback provided by crowd workers ( “Looking only at the feedbackfrom online workers on Mechanical Turk for your own group, howmuch do you think this feedback is worth, in money (Euros)?” ). Stu-dents were specifically instructed to estimate the cost of “all of thefeedback from MTurk combined (but for your own group) – not foran individual feedback item among the many.”
We invited crowd workers from AmazonMechanical Turk to review the 42 mobile application prototypes
HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al. created by the student groups. Each prototype was assigned to ninedifferent crowd workers. Workers were recruited following bestpractices (HIT approval rate greater than 95% and number of HITsapproved greater than 100). Similar qualification criteria are widelyused in academic studies [31]. In total, 387 HITs (with an attritionrate of 2.3%) were completed by 328 unique workers. Workers werepaid US$ 1 per HIT completed.
Out of 112 students who submitted the final ques-tionnaire, 106 students (P1–P106, aged from 18 to 58, 𝑀 = . 𝑆𝐷 = . We focus our mixed-method analysis on the students’ evaluation ofthe feedback in the final questionnaire. We only briefly report onthe feedback itself, because we want to foreground the subjectiveexperience of formative design feedback in this paper.We compared different criteria of how students perceived thefeedback, as described in Section 3.2.2. Since the data was not fol-lowing a normal distribution (according to Shapiro-Wilk’s test), weused paired Wilcoxon rank sum tests to evaluate the differencesin the students’ perception of the peer feedback and the crowd-sourced feedback. Besides investigating the students’ perspectiveon the feedback, we quantitatively analyzed and compared theopen-ended feedback along three criteria contributing to the qual-ity of the feedback: the length of the feedback (measured by thenumber of characters in the feedback item), the amount of noisein the feedback (as measured by the presence of unrelated andnonsensical words), and the effort spent on providing the feedback(as measured by the time taken to complete both the Likert-scalefeedback and the open-ended feedback).Qualitatively, the three open-ended items (see Table 3) wereanalyzed following the guidelines for content analysis [38]. Oneresearcher first familiarized himself with the data and extractedverbatim terms from the first item of the questionnaire in whichstudents provided reasons for their overall preference of feedback.The researcher consolidated the extracted terms into codes, whilealso considering the responses to the other two open-ended ques-tionnaire items about the effectiveness and fairness of the feedback(Table 3). Each code was categorized on whether it was a positive ornegative statement about peer or crowd feedback, respectively. Sub-sequently, the researcher iteratively and inductively grouped thecodes into themes and sub-themes. Next, a codebook was developedwith descriptions for each code. After discussing the codebook, thefirst and second authors of this paper and one additional student individually coded all responses to the first open-ended item ofthe questionnaire. Inter-rater reliability improved after reconcilingdifferences in two discussions and adjusting the codebook. Thefinal inter-rater reliability among the three raters (as measured byFleiss Kappa [10]) was 𝜅 = . We found significant differences in how the students perceivedthe feedback from peers and crowd workers in all criteria exceptvalence (see Table 2 on page 7).Peer feedback was significantly more specific , more actionable ,and contained more explanations than the crowdsourced feedbackin the opinion of the students (each 𝑝 < . relevant to the mobile application prototype andof better quality than the crowdsourced feedback (each 𝑝 < . satisfied with the peer feedback com-pared to the crowdsourced feedback from MTurk ( 𝑝 < . 𝑀 = .
51 for crowd-sourced feedback versus 𝑀 = .
49 for peer feedback). However,no statistical difference between peer feedback and crowdsourcedfeedback was found in terms of valence ( 𝑝 > . “uehi,” “asdf,” and “adse” ), only numbers,or other words not relevant to the task (such as “yes,” “no,” “this app,” and “NICE” ). We manually curated a list of words and found theamount of noise greatly differed between the two feedback sources.Only 0.3% ( 𝑁 =
1) of the open-ended peer feedback contained noise,compared to 12.2% ( 𝑁 =
47) of the feedback from MTurk.Related to the amount of noise, the peer feedback containedsignificantly longer explanations compared to the crowdsourcedfeedback (peer feedback 𝑀 =
296 characters, 𝑆𝐷 =
287 charactersversus crowdsourced feedback 𝑀 =
117 characters, 𝑆𝐷 = 𝑝 < . 𝑊 = . 𝑀 = . 𝑀𝑖𝑛 = . 𝑀𝑎𝑥 = . 𝑆𝐷 = . 𝑀 = . 𝑀𝑖𝑛 = . 𝑀𝑎𝑥 = . 𝑆𝐷 = . 𝑝 < . 𝑊 = . We coded the open-ended questions to determine the preference ofeach student for either peer or crowdsourced feedback (see Table 3).The majority of the students ( 𝑁 =
79; 74.5%) voiced a preferencefor peer feedback, but a not entirely insignificant minority ( 𝑁 = omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan Table 2: Likert-scale items in the final questionnaire given to students at the end of the course. The Likert plots reflect thepercentage of participants who agreed, disagreed, or were neutral to a given statement. For example, 62% of the studentsthought the formative feedback from MTurk workers was not specific, 7% neither agreed nor disagreed, and 31% agreed to thisstatement.
Feedback item PeerfeedbackM (SD) CrowdfeedbackM (SD) Wilcoxon rank sum
The feedback is specific 𝑊 = . 𝑊 = 𝑊 = 𝑊 = . 𝑊 = . 𝑊 = . 𝑊 = *** p < 0.001 the peer feedback more useful , more effective , more actionable , andof greater fairness . In contrast, students often noticed and com-plained about the low perceived effort in the crowdsourced feedback.We elaborate on each of these aspects in the following sections.We structure our qualitative findings around three main themesof experiencing feedback: feedback quality (with sub-themes effec-tiveness , effort , and modality ), feedback fairness (with sub-themes agreeableness , valence , diligence , usefulness , and credibility ), and the monetary valuation of the feedback. The overall perceptionof the effectiveness of the feedback (from both sources) ranged from “not that effective” (P3) to “highly effective” (P5).Peer feedback was overall perceived as more meaningful andcomprehensive. Peers provided more constructive and actionablesuggestions for improvement than the crowdsourced feedback. Sev-eral students lauded the peer feedback for “pinpointing design flaws” (P18) and specific elements to improve, such as buttons, font sizes,wording, and other “little things that bugged [the reviewers]” (P5).The open-ended crowdsourced feedback, on the other hand, wasseen as being more “general” (e.g., P7) and, as a consequence, less
HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al.
Table 3: Qualitative items of the final questionnaire given to students at the end of the course, with coded preference for peerfeedback and crowd feedback. For example, 79 students overall preferred the peer feedback, 26 the crowdsourced feedback,and one student could not decide.
Item Type Peer MTurk undecided n/a Which of the two feedback sources (in-class from other students or crowdsourcing fromonline workers) do you prefer , and why? Open-ended 79 26 1 –How effective did you find the feedback (from both sources combined) in helping youunderstand what you did well and how you could do better? Open-ended – – – –Which of the two feedback sources (in-class or crowdsourcing) do you think was more fair ,and why? Open-ended 62 34 9 1Which of the two feedback sources (in-class or crowdsourcing) contained items that youwould be able to act on – that is, which feedback is more actionable ? Multiple-choice
84 22 – – Preference could not be determined from the response. “In-class feedback from other students” or “MTurk feedback from online workers” useful and actionable by the majority of students. Several studentscomplained about the low number of takeaways found in the crowd-sourced feedback.Only few students ( 𝑁 =
9) found both sources equally effective.P8, for instance, thought that “the class noticed the bad sides andthe online workers noticed the good sides of the app.”
P21 searchedthe fault in his own group, as he thought that “the design was notrealistic and attractive for a general public, and some features werenot that clear” , but expressed optimism that “after [implementingthe review] we think that the opinions would change.”
Students noticed a difference in effort putinto writing the open-ended feedback. The peer feedback was per-ceived as being more elaborate. The length of the feedback affectedthe perception of usefulness, as students were able to find fewertakeaways in the crowdsourced feedback. Crowdsourced feedbackwas often perceived as nonsensical, primarily due to some of theresponses being extremely short and unrelated to the task: “Mygroup didn’t learn anything useful from online workers open endedquestions, it had answers like "24", "god like app" and "Useful app :)".All of our classmates answered with many rows of text” (P16).Students recognized that their peers had put more thought andtime into writing the feedback, whereas the “online workers justtried to speedrun the questions” (P91). Two students suspected thefeedback by MTurk workers was provided by “bots and other ESLswho probably didn’t even give it anything but a glance to get theirpennies” (P3), and that “the slider selections were done randomly bythese bots” (P1).
Students generally preferred theopen-ended feedback over the raw data and summary chart. Thewritten justifications allowed the students to identify specific prob-lems in their user interfaces and areas to improve. In contrast, thenumerical raw data and the summary chart was perceived as lesshelpful, as exemplified by the comment from P6 who thought “thenumerical scores weren’t really helpful; too often a low score wasn’tjustified in the open-ended feedback, and it was hard to understandwhat we could improve.”
The numerical feedback was only “usefulfor finding the most obvious flaws and getting an overall feeling of how the people reacted” (P27). Nevertheless, students appreciatedthe summary chart, as they “found it useful to have two differentgroups of people to evaluate our application [...] It forced us to thinkhow we could make it more appealing to both groups” (P43).When evaluating the numerical feedback, students primarilylooked for differences between the two sources of feedback inthe summary chart, not commonalities. “Variety” was mentionedseveral times as criteria to look for in the summary chart. P92, forinstance, mentioned that “in the [Likert-scale] questions there wasnot enough variety to see which areas were good and which neededimprovement.”
Students liked to contrast the feedback from thetwo sources, as it allowed them to identify the weaknesses andstrengths in their designs. For instance, P10 mentioned that “sinceboth sources are not correlated, it was easy to identify the main designfailures in the app and prioritize and solve them.”
Approximately two thirds of the students (62 students) found thepeer feedback was more fair than crowdsourced feedback, com-pared to 34 students who thought the crowdsourced feedback wasmore fair (see Table 3). Students primarily perceived the feedback’sfairness along five dimensions: (a) agreeableness , (b) valence , (c) dili-gence , (d) usefulness , and (e) credibility . The five dimensions werementioned by approximately the same number of students.
The students’ perception of fairness was stronglydetermined by how agreeable they perceived the feedback to be. Sen-sible and well justified feedback was easier to reconcile with theirown view in this regard. Constructive and well-thought out feed-back was overall perceived to be more fair than feedback lackingthese attributes. Peer feedback was generally perceived to possessmore of the above attributes than crowdsourced feedback.Responses from students who preferred crowdsourced designfeedback in terms of fairness were often motivated by the lack ofcriticism which made the crowdsourced feedback “more agreeable,”such as the comment from P16 who found crowdsourced feedbackwas “much more positive and [workers] seemed to like our app. Fellowstudents were much more critical. I think our class tried to be toocritical and find every little thing that was wrong.”
Fairness, in the omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan minority group who thought crowdsourced feedback was more fair,was simply a matter of the crowd finding “less problems with theapplication” (P5).Several students perceived feedback with a variety of differentviewpoints as more fair. For these students, peer feedback containeda greater diversity of viewpoints and was thus judged to be morefair than crowdsourced feedback.
As for the contribution of the affectivetone (i.e., valence of the feedback) to the students’ sense of fairness,students perceived the peer feedback as more critical, and muchmore harsh and negative in tone than the crowdsourced feedback.The criticism and negativity divided the students. While the ma-jority of students appreciated critical feedback and perceived it asfair, a minority perceived the critical peer feedback as less fair thancrowdsourced feedback.Among the students who thought that critical peer feedbackwas fair, the crowdsourced feedback was often found to be overlypositive and not critical enough to be useful. P1, for instance, foundpeer feedback more fair because “even the little feedback we got from[MTurk] was way too positive and approving and did not criticizethe obvious problems the prototype had.”
Similarly, P8 stated that hepreferred peer feedback, because peers had “given more negativefeedback compared to Mturk people and some of those commentshelped lot to further improve and redesign our app.”
Feedback from MTurk was generally perceived to be more ap-proving and praiseful. However, not all students considered positivefeedback as being correlated with fairness. Of the 16 students whomentioned MTurk feedback being positive, six found the crowd-sourced feedback was fair, but 10 thought the crowdsourced feed-back was unfair because of its overly positive praise. As for thereasons of why peer feedback was more critical, P92 speculatedthat “working with our own designs during the course made us stu-dents more demanding towards each other’s designs than the MTurkworkers.”
A high quantity of feedback positivelycontributed to the students’ sense of fairness. The feedback fromMTurk lacked in this regard and was perceived as less fair than thepeer feedback due to the often short replies. In the same vein, somestudents judged the MTurk feedback to be less fair because of itssuperficiality, lower clarity and lower specificity, as exemplified bythe comments from P85 who thought that “peer feedback was morefair because the some crowdsourced feedback was not clear,” and P8,who thought that “Mturk people gave more general comments whichwere useful too, while peers were more specific on errors and mistakes –like spellings.”
Internal consistency and absence of contradictionswas mentioned by a few students as contributing positively to thefairness of the feedback.
Feedback with a high number of sug-gestions and helpful hints was perceived more fair than feedbacklacking these features. The effectiveness of the feedback and its per-ceived value also contributed to the students’ sense of fairness. Peerfeedback overall was perceived more useful and fair than crowd-sourced feedback. One contributing factor to this sense of fairnesswas the context of the feedback’s inception. The peers were familiarwith the design task and had all undergone the same training. The peer feedback was thus perceived as more useful, as exemplified bythe statement from P28: “the class seemed to know the context better, and gavemore relevant feedback. Much of the MTurk feedbackwas useless.”
On the other hand, some students thought the crowdsourced feed-back was more fair because of the crowd having an outsider’s per-spective on the application. P88, for instance, thought that “crowd-sourcing was fairer because it felt like there was no obligation forthe testers to be nice.”
The external reviewer having an impartialpoint of view also contributed to the students’ perception of thecredibility of the feedback, as detailed in the next section.
The credibility of the feedback providerwas an especially strong argument among the students who pre-ferred the feedback from MTurk in terms of fairness. Among thesestudents, the feedback from external workers was perceived to bemore objective than the feedback from classmates. The “externaland impartial point of view” (P19) was appreciated by several stu-dents and contributed to the sense of credibility. The workers wereneither familiar with the course nor the students, and were thusperceived as being less biased, which positively contributed to thetrust into the workers’ feedback.The majority of students, however, perceived peer feedback asmore credible than crowdsourced feedback. Among this majority,peer feedback was perceived as more serious and sincere, morerealistic, and better scoped to the course. Feedback from MTurk,on the other hand, was described as containing “a lot of joke an-swers” (P15). Some of the distrust into the crowdsourced feedbackwas, however, motivated by the students being unfamiliar with theexternal source of feedback. P47, for example, commented that hepreferred “feedback from other students because I am not familiarwith MTurk, so it is difficult to estimate the reliability of it.”
We asked each student to quantify the monetary value of the crowd-sourced feedback for the group work (see Section 3.2.2). In thissection, we provide an analysis of the responses to this item.
We extracted the monetary value from eachresponse (where possible and using the mid-point of a range inseven of the 106 responses). Unrealistically high values (e.g., 5000EUR) were removed using the upper 1.5 interquartile range (IQR)as a cut-off. After the removal of outliers, the monetary value of thecrowdsourced feedback ranged from 0 EUR to 25 EUR, with a meanvalue of 6.50 EUR (
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑄 = .
80 EUR, 𝑄 =
10 EUR).
Twenty-nine of the 106 stu-dents provided written explanations for their choice of monetaryvalue. In the remainder of this section, we provide an account of the29 students’ thoughts with specific attention to how students voicedand reasoned about their valuation of the feedback from crowdworkers. The overall sentiment of the feedback’s value ranged frompositive to extremely negative. Two main criteria for valuing thecrowdsourced feedback emerged from the students’ responses: per-ceived usefulness and professionalism (see Table 4). The valuation
HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al. of the feedback fell in two camps: students who thought the feed-back was completely worthless (we refer to them as “hardhats” dueto their strict, matter-of-factly and engineer-like approach to es-timating the feedback’s value), and students who attributed somevalue to the feedback (the “bungaloos” due to their bona fide wayof valuing the feedback driven by conflated expectations).Among the hardhats, the bulk of the comments mentioned notwanting to pay or not valuing the feedback at all. This was often jus-tified with the low quality of the feedback. P13, for example, wouldnot pay anything “because the feedback was extremely poor.”
Otherstudents in this camp expressed their sense of value in more drasticways, often motivated by the low usefulness and low relevancy ofthe crowdsourced feedback for the design task. P48, for instance,thought that “Mturk should pay us to read that rubbish, there wereonly two useful feedbacks and for them maybe few euros.”
Similarly,P28 complained that “only one comment was actually relevant” , and “even if they paid you, this feedback would not be worth anything.”
While the potential of the crowdsourced feedback was recognized,some students were disappointed, thinking that the feedback wasworth “absolutely nothing, it was a disgrace and a waste of moneyfor at least most of our feedback. I’ve seen some of our classmatesin other groups getting fair feedback and specific ones but ours wasjust...blatantly saying extremely disappointing” (P4). P98 expressedregret that the crowdsourced feedback was worth “sadly nothingbecause it literally gave us nothing to work with.”
A perceived low specificity of the crowdsourced feedback was acommon theme in the responses from both camps, as exemplifiedby the comment of P89 (hardhat) who thought that some workers “talked about the idea and not the design, and to top it all lot of thecritics were not specific at all and so short we couldn’t take anythingfrom them (like ‘GOOD’).”
A suspicion of low effort affected thestudents’ valuation of the feedback, as highlighted by P99 (bunga-loo) who thought that some workers “didn’t even spend 1 minuteon filling the feedback form so those deserve 1€ max. We got a verydetailed review also though which was worth 3-5€ imo.”
Only one of the 29 students explicitly mentioned considering theminimum wage of workers on MTurk. This student (P12, bungaloo)placed the value of the feedback “somewhere between whatever theminimum fee on MTurk was (90 cent or so I don’t remember exactly)to few euros at max (maybe 5-6 if generous).”
In the camp of bungaloos, the low amount of information inthe feedback was recognized by several students, with also manycomplaints about the relevance and usefulness of the feedback. P26,for instance, complained that “some of the feedback had basically nocontent in them” and “it can’t be more than maybe 30€ because if it’smore than that, the whole thing is a rip off.”
Similarly, P43 estimatedthe feedback from the nine workers to cost about 20 Euros, but wasconcerned whether the workers had “even evaluated the app or justthrown randomly points in the questionnaire.”
Many students from the bungaloo camp exhibited a naïvetétowards what an acceptable pay for the feedback would be, and howto value crowd work in general. For example, P88 noted that “thefeedback helped streamline our app which I think is really valuable.A lot of the times small inconveniences can be a gamebreaker forproducts. Maybe 100€. I’m not quite shure [sic] how much feedbacklike that costs normally so its hard to put a concrete value to it.”
P102admitted to “have no idea but I guess hundreds of euros, maybe closer to thousands.”
P65 also noted that “I don’t have proper knowledgeabout how much such online workers charge but i think 20–30 eurowould be fine.”
As is evident in some of the quotes above, students from thesecond camp often had a negative opinion of the feedback, butstill considered paying a decent amount to the crowd workers. P96,for example, complained that “all of the open ended questions werenot useful,” but thought the feedback was still worth 5€. In thesame vein, P32 mentioned the feedback was “really poor,” but stillconsidered paying “10 EUR per worker.”
The primary contributor to this disaccord between perceivedusefulness and monetary valuation was the student’s mental imageof the crowd worker as a trained professional with experience inusability and user experience (UX) testing. The suspected profes-sionalism strongly affected the students’ valuation of the crowd-sourced feedback: “I would only pay for one of the workers, maybearound 5-20 Euros depending on the expertise of the worker in thisfield” (P9). The poor quality of the feedback did not dissuade thestudents from thinking that the workers deserved to be paid, oftenwith extremely generous amounts: “As they are professionals , ifI consider the feedback on my application, the cost be between 100-200 Euros” (P36). Similarly, P68 valued the feedback with “150 [EUR]max,” even though ”some [open-ended] feedback was only numberslike ‘3’.”
The majority of students in our study perceived peer feedbackas more useful, more detailed, more specific, more effective, moreactionable, and of greater fairness than crowdsourced feedback fromMTurk. Further, the crowd workers spent less effort on writing theirfeedback, as evident in the high percentage of short answers. Peerfeedback was more elaborate and contained less noise than thecrowdsourced feedback.In line with prior work [27], the qualitative analysis of the finalquestionnaire revealed that students perceived the affective tone(valence) of anonymous feedback to be slightly more positive com-pared to feedback from peers. Peer feedback was perceived to bemuch harsher and critical in tone. This, however, did not negativelyimpact the students’ sense of fairness. On the contrary, studentsappreciated critical feedback, as long as it was elaborate, specific,well justified, and useful. We found, however, a gap in how stu-dent’s perceived and valued the crowdsourced feedback which maycontribute to a false sense of achievement in students, as discussedin the following section.
Our data highlights that a student’s perception of the monetaryvalue of MTurk work may not be accurate. But since studentslikely have no experience with crowdsourcing, why is the students’perception of feedback value relevant and interesting? Why does itmatter if they know how much to pay crowd workers?From the students’ individual perspective in the design task,the monetary value of feedback may indeed be irrelevant. Fromthe perspectives of teaching, crowdsourcing research, and researchfunding, it is however exceedingly important to know that money omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan
Table 4: Students’ criteria for estimating the monetary value of crowdsourced feedback from MTurk.
Criteria and sub-criteria DescriptionPerceived Usefulness
The usefulness of the feedback and its appropriateness for completing the design task.• Perceived Quality The overall quality of the feedback.• Perceived Relevance The relevance of the formative feedback for the students’ design task.• Perceived Effort The effort the worker put into completing the task.• Perceived Helpfulness The amount of novel insights provided by the feedback to improve the design.
Professionalism
The assumed expertise and skill set of the worker for conducting usability and user experience tests. on crowdsourcing is well spent as a monetary and pedagogicalinvestment by the teaching organization. Knowing that studentsvalued the feedback, and how they valued it, is therefore of partic-ular importance to research and education institutions as well asteachers.Over the years, many researchers have argued for fair work con-ditions on crowdsourcing platforms (e.g., [18, 39]). On the workerside, much of the discussion focuses on the imbalance in the domi-nant power structures on crowdsourcing platforms and on meetinga minimum wage for workers. From the requester’s perspective, thecrowdsourcing literature primarily focuses on the design of effec-tive quality control mechanisms. Our work highlights a third side –that of the feedback receiver’s qualitative experience of feedback.The design space for crowdsourcing involving third parties hasbeen only recently emerging. Examples include PledgeWork [20], asystem for volunteers to donate their income from crowd work to athird party (a charity), revenue sharing (e.g., [8]), and the review ofsubcontracting microwork by Morris et al. [26]. In the classroomwith three parties (i.e., the teacher, crowd workers and students),the requester of feedback (the teacher) may not be the receiver andbeneficiary of the feedback (in this case, the students). This poses thequestion of how crowdsourced tasks for the collection of feedbackshould be priced. As the feedback mainly contributes value to thefeedback receiver, the feedback receiver may be the right party todetermine the feedback’s value and hence its price. However, priorresearch on design feedback in the context of the classroom pricedthe crowdsourcing tasks without taking the feedback’s value forstudents into consideration.On the other hand, many students in our cohort were clearly un-sure of how to value the crowdsourced feedback. Interestingly, thesestudents still considered paying the workers a handsome amountof money for the feedback, even though the feedback was oftenof low usefulness. One reason for this gap in the perception andvaluation of the feedback was that students conjured up an imageof a professional and experienced online worker who is equippedwith subject-specific expertise and an appropriate skill set. Thisconjured image, however, is in contrast to how MTurk is designed,as a marketplace for anonymous humans to complete short tasksirrespective of skills and expertise. Further, local salary standardsobviously skew people’s understanding about the monetary valueof labor on MTurk.All things considered, if teachers are to employ crowdsourceddesign feedback, teachers must educate their students about onlinework and its value. We contend that in the age of the gig economy,crowdsourcing and especially its valuation should be a fixed part of the Computer Science curriculum. Considering the strengthsof employing crowdsourced feedback in the classroom, we concurwith prior studies and argue that using crowdsourced feedbackas a complement to traditional feedback mechanisms is useful forstudents.
Our work confirms the findings of prior studies that crowdsourcedfeedback is a good supplement to peer feedback [6, 37]. While thestudents in our study generally favored peer design feedback, theyalso discovered and acknowledged clear value in the crowdsourceddesign feedback – value that is impossible to be obtained in theclassroom setting alone. For instance, the diversity of people provid-ing the feedback was mentioned. In general, getting an outsider’sperspective and feedback from diverse people from different cul-tures and backgrounds was valuable to the students. Crowdsourceddesign feedback was seen as a way to provide a reality check. Re-lated to this, a few students stated that crowd workers did not “holdback with their feedback” (P67), whereas many perceived peer feed-back as being “somehow biased” (e.g., P46, P100, and P105) evenif they could not articulate this aspect in more detail. Further, thestudents reported that contrasting the feedback from the two dif-ferent sources helped them identify the weaknesses and strengthsin their designs, which allowed the students to set priorities andtake appropriate action to address the weaknesses in their designs.More specifically, the students could distinguish between the twodifferent groups of application users, and think how to improve theapplication for both groups. In the following section, we reflect onour own perspective as teachers of the HCI course.
The complementary value of crowdsourced feedback becomes aquestion of trade-offs between the added value to the learning ver-sus the added burden to teachers and the monetary cost of thefeedback. In our experience there is certainly something of valuein students seeing their designs being rated by other than familiarfaces in the class: it is exhilarating and introduces an element ofexcitement to teaching. From the teachers’ standpoint, utilizingMTurk as a feedback source was a refreshing experiment and pro-vided us a chance to expose the students to feedback from peoplewho are not their “friendly peers.” Further, and while we have nodata to back this up, we hypothesize that the knowledge of one’s Several of the article’s authors were involved in organizing this study’s course.
HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al. design ending up online and being inspected by others than peersmotivates the students to put in more effort.Ultimately, teachers are the ones who choose to use crowd-sourced feedback or not. In our case, for example, this will cer-tainly not be the last time we employ crowd workers for additionalfeedback. Yet, several questions to consider remain. For instance,what is the correct way to determine the value of crowdsourcedfeedback? In the case of crowdsourcing platforms, the feedbackis simply work that has a certain monetary value, dictated by theplatforms’ rules as well as the workers’ and requesters’ perceptions.The classroom as a context distorts this simple way of looking atvalue, however. In many institutions teaching does not have anyextra budget available, and the question of value is not just aboutmoney. It is also a question of time and effort required to set it allup: collecting and distributing the feedback. We did this manually,and it certainly was not an insignificant amount of work. Not allteachers are equipped with the skills and knowledge required tocrowdsource feedback from the crowds, and learning such skillscould be difficult. There is a dearth of tools that would ease thesetup for non-technically savvy teachers, and given how differenteach case most likely is, it is challenging to even envision suffi-ciently generic and easy-to-use tools for crowdsourcing feedbackthat would work across all types of classes.Further, with teaching one has to always consider the quality offeedback. Instructors supposedly are experts in the subjects theyteach, and the students can thus trust the feedback and its quality.With crowds, there is always an uncontrolled element of uncertaintyinvolved. Should we trust that the crowd feedback, collected anddistributed as-is, is of sufficient quality, or should the instructorsvet and double-check everything? Would this, then, lead to loss ofauthenticity in the crowd feedback, even if authenticity is one ofthe perks of it? We believe that a fully transparent approach mightwork best here; Collecting the feedback from crowds but at thesame time explaining the students exactly what they are in for andfrom what type of workforce.We envision students to take a more active role in collaborativelyworking with crowds in the classroom, and we have already takenthis approach in our own teaching. Our work may inspire thedesign of pedagogical tools and interventions (e.g., a set of rubrics),using the criteria that we outline in the paper as a framework.For example, we now encourage students to define their own setof rubrics and guiding questions for collecting feedback as part oftheir learning experience. To this end, we give recommendations forusing crowdsourced feedback in the classroom in the next section.
In this section, we consolidate recommendations that are commonknowledge in the crowdsourcing community and available in thecrowdsourcing literature (e.g., [6]), but may be novel for teachers inthe HCI community who are less familiar with crowdsourcing andwish to use crowdsourcing as a source of feedback in the classroom.
We contend that teach-ers should educate their class about crowdsourcing (and crowdwork in general), and especially about the working conditions oncrowdsourcing platforms, and how such work is commonly priced. To effectively evaluate the crowdsourced feedback, students mustbe aware of the strengths and weaknesses of the crowdsourcingmodel. In the context of a course that provides crowdsourced feed-back to students, educators should collect demographic data aboutthe crowd workers’ educational and professional background. Thisdemographic data may help students to adjust their conflated expec-tations and arrive at a more realistic valuation of the crowdsourcedfeedback.
According tofeedback intervention theory, if feedback signals that an effortfalls short of an expected standard, learners become motivated toincrease their efforts to attain the standard [21]. To implementthis mechanism successfully, the standard and expectations mustbe clearly explained to the feedback provider. The crowd workersshould be made aware that their feedback is given to students asformative, not summative feedback. Feedback requesters shoulddefine clear success criteria for the crowd workers and providea definition of what defines good feedback. Structuring feedbackwith rubrics, scaffolds, guiding questions, and other structuredwork flows may help increase the quality of the feedback in thisregard [2, 3, 44]. In particular, requesters of crowdsourced feedbackshould pay attention to how the criteria that affect the students’perception of quality and fairness shape the students’ experienceof feedback. Critical and harsh feedback, for instance, was notnecessarily perceived as unfair by the students.
To ef-fectively complement each other, both sources of feedback mustcontribute some valuable insights. To this end, typical qualificationcriteria as employed in our study (i.e., 95% past acceptance rate and100 completed HITs) did not prove to be enough to motivate theMTurk workers to provide good feedback. To this end, the qual-ity of the crowdsourced feedback could be elevated by filteringresponses (post data collection). This is indeed a common practicewith crowdsourced data collection. In our study, we found that theopen-ended feedback provided by the crowd workers markedlyimproved when short responses were removed. The relevance ofthe feedback and its usefulness can be expected to improve if theseresponses are discarded.
Another consideration is how feedback should be col-lected and provided to students. As is evident in a number of crowdfeedback systems (e.g., [29, 41, 43]), visualization and aggregationof feedback supports the feedback receiver in making sense of thefeedback. Research on mechanisms for aggregating crowdsourceddesign feedback is only recently emerging (e.g., [43]). Our studyfound that students in particular valued diversity in the responsesand appreciated the direct contrast between the feedback from thetwo sources.
We acknowledge limitations in our study. First, the feedback re-ceiver’s pedagogic literacy affects how feedback is evaluated [32].Our findings may therefore be specific to the class and we do notclaim that the findings generalize to other cohorts of students orother study subjects. omparing Crowdsourced and Peer Design Feedback CHI ’21, May 8–13, 2021, Yokohama, Japan
Second, the subjective experience of feedback is influenced bya number of factors. For instance, the order in which feedback isreceived may affect the perception of feedback [27, 40]. Nguyenet al. found that framing feedback for a writing task with positiveaffective language had a positive effect on work quality [27]. In Wuand Bailey’s study, participants were more motivated and perceivedthe feedback as most favorable when negative feedback was givenafter positive feedback [40]. In our study, we did not control theorder of feedback. Students explored the contents of the feedbackpackage on their own terms. We argue that our study setup isnot unrealistic, as it aligns well with the microtask crowdsourcingmodel found on MTurk and reflects how feedback could practicallybe provided in the classroom.Third, a limitation of the investigation into the monetary valua-tion of crowdsourced design feedback is that while the question-naire item specifically asked the students to estimate the value ofthe crowdsourced feedback as a whole, some students may still haveestimated the value per worker. However, the students who elabo-rated on their answer typically mentioned whether their estimatewas per worker or for the whole of the crowdsourced feedback, asdiscussed in Section 5.3.Fourth, knowledge of the feedback source may have influencedthe students’ interpretation of the feedback. The choice to disclosethe source was, however, imperative for teaching students the valueof the two feedback sources. We believe revealing the source at theend of the study would have created tension among the students.Further, peer feedback was not assigned a value in our study. Peerfeedback in our context (i.e., teaching an HCI class) is a mandatorypart of the course and a traditional element of teaching and learning.In our institution’s ethical stance, peer feedback can and shouldnot be assigned a monetary value, and taxation law prohibits usfrom handing out money to students. We did, however, not seesigns of insincerity in the students’ answers (after removing thefew unrealistic responses – see Section 5.3.1). We acknowledge thatalternative study designs could have been explored.Last, we did not control the students’ familiarity with crowd-sourcing. Students received a general introduction to crowdsourc-ing in the lectures (but without going into detail about the workreality on crowdsourcing platforms). A few students had priorknowledge of the work conditions on crowdsourcing platforms andwere thus able to accurately estimate the price of crowdsourcedwork. The majority of the students, however, was new to crowd-sourcing, as evident in their responses to our survey.
In this work, we provided a detailed empirical study of how stu-dents perceive and value crowdsourced feedback, as well as howcrowdsourced feedback compares to peer feedback in the class-room. We found that students preferred peer feedback as it wasperceived as more useful, fair, effective, and actionable. Addition-ally, our investigation of the students’ monetary valuation of thecrowdsourced feedback revealed quality, relevancy, effort, and help-fulness as important factors that shape the value of the feedback forstudents. We found clear evidence that some students were naïvetoward the work conditions on crowdsourcing platforms and howsuch work is priced. The monetary valuation of the crowdsourced feedback by these students was strongly shaped by a mental imageof the worker as a trained professional. Ultimately, we believe thatcrowdsourced design feedback in HCI teaching provides a greatway to complement peer feedback, as long as student expectationsare calibrated adequately.
ACKNOWLEDGMENTS
We thank the students of the HCI course and the workers fromAmazon Mechanical Turk who contributed their feedback in thisstudy. This research is partially enabled by the GenZ strategic pro-filing project at the University of Oulu, supported by the Academyof Finland (project number 318930).
REFERENCES [1] Jennifer Carr. 2011. Providing Effective Feedback. , 15–17 pages. https://ssrn.com/abstract=1919096[2] Amy Cook, Steven Dow, and Jessica Hammer. 2020. Designing Interactive Scaf-folds to Encourage Reflection on Peer Feedback. In
Proceedings of the 2020 ACMDesigning Interactive Systems Conference (DIS ’20) . Association for ComputingMachinery, New York, NY, USA, 1143–1153. https://doi.org/10.1145/3357236.3395480[3] Amy Cook, Jessica Hammer, Salma Elsayed-Ali, and Steven Dow. 2019. HowGuiding Questions Facilitate Feedback Exchange in Project-Based Learning. In
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems(CHI ’19) . Association for Computing Machinery, New York, NY, USA, 1–12.https://doi.org/10.1145/3290605.3300368[4] Deanna P. Dannels and Kelly Norris Martin. 2008. Critiquing Critiques: A GenreAnalysis of Feedback Across Novice to Expert Design Studios.
Journal of Businessand Technical Communication
22, 2 (2008), 135–159. https://doi.org/10.1177/1050651907311923[5] Biplab Deka, Zifeng Huang, Chad Franzen, Jeffrey Nichols, Yang Li, and RanjithaKumar. 2017. ZIPT: Zero-Integration Performance Testing of Mobile App Designs.In
Proceedings of the 30th Annual ACM Symposium on User Interface Softwareand Technology (UIST ’17) . Association for Computing Machinery, New York, NY,USA, 727–736. https://doi.org/10.1145/3126594.3126647[6] Steven Dow, Elizabeth Gerber, and Audris Wong. 2013. A Pilot Study of UsingCrowds in the Classroom. In
Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems (CHI ’13) . Association for Computing Machinery,New York, NY, USA, 227–236. https://doi.org/10.1145/2470654.2470686[7] Josemaria Elizondo-Garcia, Christian Schunn, and Katherina Gallardo. 2019.Quality of Peer Feedback in Relation to Instructional Design: A ComparativeStudy in Energy and Sustainability MOOCs.
International Journal of Instruction
12, 1 (2019), 1025–1040.[8] Shaoyang Fan, Ujwal Gadiraju, Alessandro Checco, and Gianluca Demartini.2020. CrowdCO-OP: Sharing Risks and Rewards in Crowdsourcing.
Proc. ACMHum.-Comput. Interact.
4, CSCW2, Article 132 (Oct. 2020), 24 pages. https://doi.org/10.1145/3415203[9] Gerhard Fischer, Kumiyo Nakakoji, Jonathan Ostwald, Gerry Stahl, and TamaraSumner. 1993. Embedding Computer-Based Critics in the Contexts of Design.In
Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors inComputing Systems (CHI ’93) . Association for Computing Machinery, New York,NY, USA, 157–164. https://doi.org/10.1145/169059.169133[10] Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement Among Many Raters.
Psychological Bulletin
76, 5 (1971), 378–382.[11] Tyler Hamby and Wyn Taylor. 2016. Survey Satisficing Inflates Reliability andValidity Measures: An Experimental Comparison of College and Amazon Me-chanical Turk Samples.
Educational and Psychological Measurement
76, 6 (2016),912–932. https://doi.org/10.1177/0013164415627349[12] Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch,and Jeffrey P. Bigham. 2018. A Data-Driven Analysis of Workers’ Earnings onAmazon Mechanical Turk. In
Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems . Association for Computing Machinery, New York,NY, USA, 1–14.[13] John Hattie and Helen Timperley. 2007. The Power of Feedback.
Review of Edu-cational Research
77, 1 (2007), 81–112. https://doi.org/10.3102/003465430298487[14] Kasper Hornbæk, Søren S. Sander, Javier Andrés Bargas-Avila, and Jakob Grue Si-monsen. 2014. Is Once Enough? On the Extent and Content of Replications inHuman-Computer Interaction. In
Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems (CHI ’14) . Association for Computing Machinery,New York, NY, USA, 3523–3532. https://doi.org/10.1145/2556288.2557004[15] Simo Hosio, Jorge Goncalves, Theodoros Anagnostopoulos, and Vassilis Kostakos.2016. Leveraging Wisdom of the Crowd for Decision Support. In
Proceedings of
HI ’21, May 8–13, 2021, Yokohama, Japan J. Oppenlaender et al. the 30th International BCS Human Computer Interaction Conference: Fusion! (HCI’16) . BCS Learning & Development Ltd., Swindon, UK, Article 38, 12 pages.[16] Jeff Howe. 2006. The Rise of Crowdsourcing.
Wired Magazine
14, 6 (2006), 1–4.[17] Julie S. Hui, Amos Glenn, Rachel Jue, Elizabeth M Gerber, and Steven P. Dow. 2015.Using Anonymity and Communal Efforts to Improve Quality of CrowdsourcedFeedback. In
Proceedings of the Third AAAI Conference on Human Computationand Crowdsourcing , Elizabeth Gerber and Panos Ipeirotis (Eds.). AAAI Press, PaloAlto, CA, USA, 72–82.[18] Lilly C. Irani and M. Six Silberman. 2013. Turkopticon: Interrupting WorkerInvisibility in Amazon Mechanical Turk. In
Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems (CHI ’13) . Association for Computing Ma-chinery, New York, NY, USA, 611–620. https://doi.org/10.1145/2470654.2470742[19] Hyeonsu B. Kang, Gabriel Amoako, Neil Sengupta, and Steven P. Dow. 2018.Paragon: An Online Gallery for Enhancing Design Feedback with Visual Exam-ples. In
Proceedings of the 2018 CHI Conference on Human Factors in ComputingSystems (CHI ’18) . Association for Computing Machinery, New York, NY, USA,1–13. https://doi.org/10.1145/3173574.3174180[20] Keiko Katsuragawa, Qi Shu, and Edward Lank. 2019. PledgeWork: Online Volun-teering through Crowdwork. In
Proceedings of the 2019 CHI Conference on HumanFactors in Computing Systems (CHI ’19) . Association for Computing Machinery,New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300541[21] Avraham N. Kluger and Angelo S. DeNisi. 1996. The effects of feedback inter-ventions on performance: A historical review, a meta-analysis, and a prelimi-nary feedback intervention theory.
Psychological Bulletin
Proceedings of the 2017 CHIConference on Human Factors in Computing Systems (CHI ’17) . Association forComputing Machinery, New York, NY, USA, 4627–4639. https://doi.org/10.1145/3025453.3025883[23] Chinmay E. Kulkarni, Michael S. Bernstein, and Scott R. Klemmer. 2015. Peer-Studio: Rapid Peer Feedback Emphasizes Revision and Improves Performance.In
Proceedings of the Second (2015) ACM Conference on Learning @ Scale (L@S’15) . Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/2724660.2724670[24] Amna Liaqat, Cosmin Munteanu, and Carrie Demmans Epp. 2020. Collaboratingwith Mature English Language Learners to Combine Peer and Automated Feed-back: a User-Centered Approach to Designing Writing Support.
InternationalJournal of Artificial Intelligence in Education.
A Festschrift in Honour of Jim Greer.(2020), 1–42.[25] Kurt Luther, Jari-Lee Tolentino, Wei Wu, Amy Pavel, Brian P. Bailey, ManeeshAgrawala, Björn Hartmann, and Steven P. Dow. 2015. Structuring, Aggregating,and Evaluating Crowdsourced Design Critique. In
Proceedings of the 18th ACMConference on Computer Supported Cooperative Work & Social Computing (CSCW’15) . Association for Computing Machinery, New York, NY, USA, 473–485. https://doi.org/10.1145/2675133.2675283[26] Meredith Ringel Morris, Jeffrey P. Bigham, Robin Brewer, Jonathan Bragg, AnandKulkarni, Jessie Li, and Saiph Savage. 2017. Subcontracting Microwork. In
Pro-ceedings of the 2017 CHI Conference on Human Factors in Computing Systems(CHI ’17) . Association for Computing Machinery, New York, NY, USA, 1867–1876.https://doi.org/10.1145/3025453.3025687[27] Thi Thao Duyen T. Nguyen, Thomas Garncarz, Felicia Ng, Laura A. Dabbish, andSteven P. Dow. 2017. Fruitful Feedback: Positive Affective Language and SourceAnonymity Improve Critique Reception and Work Outcomes. In
Proceedings ofthe 2017 ACM Conference on Computer Supported Cooperative Work and SocialComputing (CSCW ’17) . Association for Computing Machinery, New York, NY,USA, 1024–1034. https://doi.org/10.1145/2998181.2998319[28] Jonas Oppenlaender and Simo Hosio. 2019. Towards Eliciting Feedback forArtworks on Public Displays. In
Proceedings of the 2019 ACM Conference onCreativity & Cognition (C&C ’19) . ACM, New York, NY, USA, 562–569. https://doi.org/10.1145/3325480.3326583[29] Jonas Oppenlaender, Thanassis Tiropanis, and Simo Hosio. 2020. CrowdUI:Supporting Web Design with the Crowd.
Proceedings of the ACM on Human-Computer Interaction
4, EICS, Article 76 (6 2020), 28 pages. https://doi.org/10.1145/3394978[30] Viktoria Pammer-Schindler, Erik Harpstead, Benjamin Xie, Betsy DiSalvo, AhmedKharrufa, Petr Slovak, Amy Ogan, Joseph Jay Williams, and Michael J. Lee. 2020.Learning and Education in HCI: A Reflection on the SIG at CHI 2019.
Interactions
27, 5 (Sept. 2020), 6–7. https://doi.org/10.1145/3411290[31] Eyal Peer, Joachim Vosgerau, and Alessandro Acquisti. 2014. Reputation as aSufficient Condition for Data Quality on Amazon Mechanical Turk.
BehaviorResearch Methods
46, 4 (Dec 2014), 1023–1031. https://doi.org/10.3758/s13428-013-0434-y[32] Margaret Price, Karen Handley, Jill Millar, and Berry O’Donovan. 2010. Feedback:All that effort, but what is the effect?
Assessment & Evaluation in Higher Education
35, 3 (2010), 277–289. https://doi.org/10.1080/02602930903541007 [33] Mark A. Runco and Garrett J. Jaeger. 2012. The Standard Definition of Creativity.
Creativity Research Journal
24, 1 (2012), 92–96. https://doi.org/10.1080/10400419.2012.650092[34] D. Royce Sadler. 1989. Formative Assessment and the Design of InstructionalSystems.
Instructional Science
18, 2 (Jun 1989), 119–144. https://doi.org/10.1007/BF00117714[35] Martin Schrepp, Andreas Hinderks, and Jörg Thomaschewski. 2017. Design andEvaluation of a Short Version of the User Experience Questionnaire (UEQ-S).
International Journal of Interactive Multimedia and Artificial Intelligence
Proceedings of the SIGCHI Confer-ence on Human Factors in Computing Systems (CHI ’06) . Association for ComputingMachinery, New York, NY, USA, 1243–1252. https://doi.org/10.1145/1124772.1124960[37] Helen Wauck, Yu-Chun (Grace) Yen, Wai-Tat Fu, Elizabeth Gerber, Steven P.Dow, and Brian P. Bailey. 2017. From in the Class or in the Wild? Peers ProvideBetter Design Feedback Than External Crowds. In
Proceedings of the 2017 CHIConference on Human Factors in Computing Systems (CHI ’17) . Association forComputing Machinery, New York, NY, USA, 5580–5591. https://doi.org/10.1145/3025453.3025477[38] Robert Philip Weber. 1990.
Basic Content Analysis (2nd ed.). Sage, Newbury Park,CA, USA. https://doi.org/10.4135/9781412983488[39] Mark E. Whiting, Grant Hugh, and Michael S. Bernstein. 2019. Fair Work: CrowdWork Minimum Wage with One Line of Code. In
Proceedings of the Seventh AAAIConference on Human Computation and Crowdsourcing (HCOMP ’19) . AAAI, PaloAlto, CA, USA, 197–206.[40] Y. Wayne Wu and Brian P. Bailey. 2017. Bitter Sweet or Sweet Bitter? How ValenceOrder and Source Identity Influence Feedback Acceptance. In
Proceedings of the2017 ACM SIGCHI Conference on Creativity and Cognition (C&C ’17) . Associationfor Computing Machinery, New York, NY, USA, 137–147. https://doi.org/10.1145/3059454.3059458[41] Anbang Xu, Shih-Wen Huang, and Brian Bailey. 2014. Voyant: Generating Struc-tured Feedback on Visual Designs Using a Crowd of Non-Experts. In
Proceedingsof the 17th ACM Conference on Computer Supported Cooperative Work & SocialComputing (CSCW ’14) . Association for Computing Machinery, New York, NY,USA, 1433–1444. https://doi.org/10.1145/2531602.2531604[42] Anbang Xu, Huaming Rao, Steven P. Dow, and Brian P. Bailey. 2015. A ClassroomStudy of Using Crowd Feedback in the Iterative Design Process. In
Proceedingsof the 18th ACM Conference on Computer Supported Cooperative Work & SocialComputing (CSCW ’15) . Association for Computing Machinery, New York, NY,USA, 1637–1648. https://doi.org/10.1145/2675133.2675140[43] Yu-Chun Grace Yen, Joy O. Kim, and Brian P. Bailey. 2020. Decipher: An Inter-active Visualization Tool for Interpreting Unstructured Design Feedback fromMultiple Providers. In
Proceedings of the 2020 CHI Conference on Human Factors inComputing Systems (CHI ’20) . Association for Computing Machinery, New York,NY, USA, 1–13. https://doi.org/10.1145/3313831.3376380[44] Alvin Yuan, Kurt Luther, Markus Krause, Sophie Isabel Vennix, Steven P Dow, andBjorn Hartmann. 2016. Almost an Expert: The Effects of Rubrics and Expertise onPerceived Value of Crowdsourced Design Critiques. In