[PDF] Interplay of Game Incentives, Player Profiles and Task Difficulty in Games with a Purpose

Abstract

How to take multiple factors into account when evaluating a Game with a Purpose? How is player behaviour or participation influenced by different incentives? How does player engagement impact their accuracy in solving tasks? In this paper, we present a detailed investigation of multiple factors affecting the evaluation of a GWAP and we show how they impact on the achieved results. We inform our study with the experimental assessment of a GWAP designed to solve a multinomial classification task.

Full PDF

IInterplay of Game Incentives, Player Proﬁlesand Task Diﬃculty in Games with a Purpose

Gloria Re Calegari and Irene Celino

Cefriel – Politecnico of Milano, Viale Sarca 226, 20126 Milano, Italy { gloria.re,irene.celino } @cefriel.com Abstract.

How to take multiple factors into account when evaluatinga Game with a Purpose? How is player behaviour or participation inﬂu-enced by diﬀerent incentives? How does player engagement impact theiraccuracy in solving tasks? In this paper, we present a detailed investiga-tion of multiple factors aﬀecting the evaluation of a GWAP and we showhow they impact on the achieved results. We inform our study with theexperimental assessment of a GWAP designed to solve a multinomialclassiﬁcation task.

Games with a Purpose [1] are a well-known Human Computation approach [2]to encourage users to execute tasks with an entertaining reward. While severalmetrics are proposed in literature to evaluate the ability of GWAPs to achievetheir intended purpose, there is a large number of factors that inﬂuences theirsuccess and eﬀectiveness.In order to fully understand the strengths as well as the weaknesses of aGWAP, we propose an approach that takes into account player characteristics (reliability, participation, behaviour and accuracy), game aspects (playing incen-tive, playing style and game nature) and features of the task to be solved (levelof diﬃculty and variety). Our goal is to investigate the interplay between thosediﬀerent factors, by proposing a multi-faceted analysis framework that allowsfor a deep assessment and understanding of the eﬃcacy of a GWAP to achieveits purpose. We apply the proposed framework to a speciﬁc GWAP to show theempirical results and the insights that can be drawn through our approach.The original contributions of this paper are: (1) an extension of traditionalGWAP metrics to take temporal evolution and incentive eﬀects into account; (2)a comparison of engagement metrics and engagement proﬁles with non-gamingcitizen science; and (3) the deﬁnition of GWAP-speciﬁc engagement proﬁles andtheir interplay with diﬀerent factors (incentive, task diﬃculty and task variety).The remainder of the paper is organized as follows: Section 2 illustrates themain related work; Section 3 gives details about the GWAP that we use to ex-emplify our approach; in the following sections, we propose diﬀerent evaluationmethods, by extending state-of-the-art metrics: global GWAP metrics and in-terplay with incentive are adopted in Section 4, Section 5 oﬀers a comparisonwith citizen science user engagement proﬁles and Section 6 proposes new GWAP a r X i v : . [ c s . H C ] N ov Gloria Re Calegari, Irene Celino player proﬁling driven by measures of participation and accuracy; ﬁnally, Sec-tion 7 concludes the paper.

The basic metrics to evaluate GWAPs [1,2,3] are global indicators computedas means over the entire data; while eﬀective in summarizing the behaviour ofGWAP players, those are very simple measures that do not tell the entire story:an analysis of data distribution and temporal evolution is usually required toget a deeper understanding of a GWAP.Some work exists on cross-feature analysis of GWAPs [4] and similarly oncitizen science [5] and crowdsourcing [6]; our goal is to contribute to making suchevaluation easier to replicate and reproduce.Participation incentives are usually classiﬁed as intrinsic or extrinsic motiva-tion [7]. Some comparative analysis of incentives exists for GWAPs [8], especiallyin contrast to diﬀerent methods like micro-working [9,10,11] or machine learn-ing [12]. The eﬀect of competition and tangible rewards on participation andquality of results has also been explored, both in the context of GWAPs [13] andonline citizen science campaigns [14], revealing the pros and cons of designingdiﬀerent motivation mechanisms.Other metrics to evaluate GWAPs can be borrowed from studies of socialcommunity [15] and citizen science evolution [16]; in those cases, however, userparticipation’s “success” is measured through simple indicators like number ofparticipants and contributions, while a deeper investigation is needed to assessthe eﬀectiveness of participation. Behavioural studies in HCI research have inves-tigated volunteer characterization in citizen science, deﬁning engagement metricsand proﬁles [17,18], which may or may not apply to GWAP players.In the context of (paid) crowdsourcing, assessment is usually conducted inrelation to micro-work platforms [19], in which important features are related tocost minimization [20,21] which is out of scope with respect to our work.While Games with a Purpose are a well-known and widely adopted humancomputation method to involve users in task solution, a comprehensive assess-ment of their ability to address their “purpose” needs to take into account mul-tiple factors aﬀecting the game and the players. We therefore propose a multi-faceted analysis framework for GWAPs that includes game aspects, player char-acteristics and task features, with speciﬁc focus on the eﬀect of game incentiveson the overall GWAP eﬃcacy.

The GWAP that we will use as running example is Night Knights, an onlinegame for the multinomial classiﬁcation of images . Pictures come from a massivepublic-domain dataset provided by NASA and they can be classiﬁed according to Cf. .nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 3(a) Classify an image (b) Agreement (c) Disagreement

Fig. 1.

Night Knights: the gameplay six diﬀerent categories depending on their visual content. The classiﬁed images– in particular those labeled with three of the six categories – are then usedin a subsequent scientiﬁc workﬂow in the ﬁeld of astronomy and environmentalsciences to measure light pollution eﬀects (cf. [12]).The GWAP is inspired by the ESP game [3], because users play in randompairs according to an output-agreement mechanism [1]. The game adopts a re-peated labeling approach [22] by asking diﬀerent players to classify the sameimage; conversely, the same image is never given twice to the same player. NightKnights is built on top of our open source software framework for GWAPs [23].The players visualize a picture and six buttons reporting the six possiblecategories (cf. Figure 1); the labeling task is therefore executed by clicking on thecategory that better ﬁts the picture content. Each game round lasts one minute,during which players can classify as many images as they can (as detailed in thefollowing, on average 15 pictures are played per round); each time the two playersagree, they gain points and level up in the game leaderboard; some badges arealso assigned in special conditions as additional game intrinsic incentives.Players’ contributions are aggregated through an incremental truth inferencealgorithm [24] that (1) processes inputs as soon as they are provided, (2) weightsplayers’ answer with a round-speciﬁc reliability measure [25] taking into accountplayers’ answers on control tasks (for which the “true” solution is known), and (3)dynamically adjusts the number of required contributions. Our truth inferenceapproach accounts for the very nature of GWAPs, in which usually there isno “deadline” for contributing, players’ varying attention can impact answer

Gloria Re Calegari, Irene Celino quality and task diﬃculty needs a dynamic estimation of the required numberof repeated labeling.In this paper, we use the data collected through Night Knights. The gamewas released in February 2017 and then it was more extensively advertised fora related competition whose winner joined the 2017 Summer Expedition to ob-serve the Solar Eclipse in USA. The competition lasted about one month, frommid June to mid July 2017, and was addressed to all EU University students.After the end of the competition, the game has still been available online, butwithout any additional advertising. Overall, the data we analyse was collectedin 9 months, one month of competition and 4 months before and after it .In the following experimental sections, we apply a set of assessment meth-ods on this game data. On the one hand, we exemplify the analyses we proposefor a thorough multi-faceted assessment of GWAPs; on the other hand, we pro-vide concrete results from the evaluation of Night Knights, which are – at leastpartially – typical of GWAPs. The main metrics adopted in literature [2] to evaluate GWAPs are: throughput ,computed as the average number of solved task per unit of time, average lifeplay or ALP, i.e. the average time spent by each user playing the game, and expected contribution or EC, measured as average number of tasks solvedby each player. A task is solved when player contributions, aggregated by thetruth inference algorithm [26], output a “true” solution. Those indicators areglobal measures, as they are computed as mean values over the entire GWAPuse. Hereafter, we extend this analysis by assessing the inﬂuence of diﬀerentgame incentives and the evolution over time of game-play and engagement.In particular, we investigate how player participation and GWAP resultschange with and without an extrinsic motivation such as a tangible reward [7].We analyse incentive eﬀect in terms of both general statistics and speciﬁc metricsadopted in GWAP evaluation. We show that users participation can be highlyinﬂuenced by the presence of an extrinsic motivation.

In 9 months, Night Knights managed to engage about 650 users that played asubstantial amount of time and classiﬁed almost 28,000 photos (cf. Table 1).Measuring the main metrics in the three periods ( before , during and after thecompetition), we notice a signiﬁcant increase of player participation during thecompetition, both in terms of given contributions and classiﬁed images (one orderof magnitude higher with the additional incentive in both cases). This diﬀerenceis clearly highlighted in Figure 2, which shows the temporal evolution of thenumber of images classiﬁed per day. The diﬀerence between throughput, ALP Data is available with a CC-BY license at http://ckan.stars4all.eu/ .nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 5

Before During Aftertime span (months) classiﬁed images contributions users

285 174 174 total play time (hours)

65 471 29 throughput (tasks/hour)

69 212 113

ALP (mins/user)

EC (tasks/user)

Table 1.

Experimental results in the three periods (before, during and after the intro-duction of the extrinsic motivation) and EC in the competition and non-competition periods is statistically signiﬁcant(t-test or Wilcoxon rank sum test at the 0.01 signiﬁcance level). Also the playtime signiﬁcantly increases during the competition period, as demonstrated bythe ALP metrics which reaches values over 65 minute/player (cf. Table 1).Those results prove that providing a tangible reward to players can makethem contribute more eﬃciently, speeding up the classiﬁcation process (higherthroughput), engaging them for a longer time (higher ALP), and ensuring alarger contribution rate to the human computation task (higher EC). As a globalresult, more tasks get solved.

Fig. 2.

Number of images classiﬁed per day in the three periods

Adding a tangible prize to a game does not seem to ensure lasting eﬀects. InNight Knights, looking deeper in the before and after periods in Table 1, wedo not notice substantial diﬀerences in terms of classiﬁcation and participationrate. The metrics of the before period are slightly higher, probably due to thefact that more users tried the game, attracted by advertising campaigns (smallpeaks in Figure 2) and by the novelty of the game.Given this similarity, in our analysis we think it worth distinguishing onlybetween intrinsic motivation periods (e.g., Night Knights before and after peri-ods together, when users play only to have fun) and extrinsic motivation periods(e.g., the during phase of Night Knights, with the tangible and valuable reward).

Gloria Re Calegari, Irene Celino

Deﬁning contribution speed the number of images played in each round, we checkif also this metrics is inﬂuenced by a tangible reward.As explained in Section 3, each round in Night Knights lasts one minute andeach user is asked to classify one image at a time, so users have to be quick andclassify as many images as possible to increase their score and being successfulin the game. Given the image loading time, connection delays and waiting timefor the other player’s answer, we estimate that in this case classifying each imagetakes at least 3–5 seconds, which means 12–20 photos per round.As Figure 3 shows, in the extrinsic motivation period, the contribution speedfollows a normal distribution centered around 15 photos/round, while, in the intrinsic motivation phase, the distribution is ﬂat and most players played lessthan 10 images/round. This indicates that, during the competition, all playersdid their best to classify as many images as possible, reaching a median value of15 that coincides with the estimated image classiﬁcation time. On the other hand,in the intrinsic motivation period, people play the game in a more “relaxed” way,just to try and explore it, taking more time to answer. (a) Extrinsic motivation (b) Intrinsic motivation

Fig. 3.

Distribution of the number of images played in each round

As a ﬁrst step to the assessment of player behaviour, we adopt the engagementmetrics proposed by [17]: activity ratio , number of days a user plays a gamedivided by the total number of days the user remains linked to the game; dailydevoted time , average time (e.g. in hours) a user plays the game in each activeday; relative active duration , ratio of days during which a player remainslinked to the game and the total number of days since the player joined thegame until the day the game is over (this metric can be computed only if a“game end” is envisaged, which is not always the case in GWAPs); and varia-tion in periodicity , standard deviation of the intervals between each pair ofnon-consecutive active days. Computing those metrics for each player and then nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 7 applying clustering techniques leads to the identiﬁcation of engagement proﬁles .Our goal is to assess if the proﬁles recognized in citizen science literature withrespect to volunteer behaviour are also detected in GWAP player behaviourand if player proﬁles are aﬀected by game incentives. Indeed, we expect playerbehaviour to diﬀer from volunteer engagement.

The mean values (and in brackets standard deviation) of the four main en-gagement metrics deﬁned by [17] are shown in Table 2. For Night Knights, wedistinguish the global values and those measured during the competition only(extrinsic motivation period); for comparison, we also report the values for thecitizen science initiatives illustrated in [17,18]. Daily devoted time for NightKnights is measured by approximation, multiplying the number of game roundsper 1-minute duration (the actual time is higher, because players also browseleaderboards, badges, played pictures, etc.); relative active duration is computedonly during the competition time, where a “project ﬁnish time” is deﬁned withthe contest deadline.We observe that Night Knights players display quite a diﬀerent behaviourwith respect to volunteers: they show a 2-3 times higher activity ratio, and alsoconsistently higher values for daily devoted time and relative active duration; thismay mean that GWAP players tend to contribute in a more regular manner thanvolunteers. Focusing on the competition, those metrics also show a clear increasein engagement, with a signiﬁcantly lower value of variation in periodicity, whichsuggests that the limited-time contest period stimulates players to access thegame even more frequently and regularly.Clustering players to identify engagement proﬁles does not give the sameresults as in the cited citizen science analyses [17,18]. Cross-validation betweendiﬀerent methods (within groups sum of squares and Silhouette statistics) sug-gests an optimal clustering with 3 groups. Applying both agglomerative hierar-chical clustering and K-means clustering yields to similar and very unbalancedgrouping, with one big cluster (around 90% of players) roughly corresponding

Night Knights MW GZ WI global compet. [17] [17] [18]

Activity ratio 0.96 (0.17) (0.16) 0.40 (0.40) 0.33 (0.38) 0.32 (0.35)

Daily devoted time (3.30) 0.44 (0.54) 0.32 (0.40) –

Rel. active duration – (0.35) 0.20 (0.30) 0.23 (0.29) 0.43 (0.44) Var. in periodicity (2.12) 18.27 (43.3) 25.23 (49.2) 5.11 (5.36)

Table 2.

Engagement metrics (mean values and standard deviation in brackets): com-parison of Night Knights (global values and competition-only metrics) with citizenscience campaigns (MW: Milky Way, GZ: Galaxy Zoo, WI: Weather-it). Gloria Re Calegari, Irene Celino to the hardworker proﬁle (high activity ratio and low variation in periodicity);the remaining players are grouped in a small cluster that we can name “focused”hardworkers (similar to hardworkers but with higher daily devoted time) andanother small cluster that does not clearly correspond to known proﬁles (lowvalues of all metrics, but higher variation in periodicity). The spasmodic, per-sistent, lasting and moderate proﬁles deﬁned in [17] are not observed. This canbe interpreted as another diﬀerence between players and volunteers engagement,with game users either heavily playing and contributing, or simply trying outthe game without being actually engaged.

If we also evaluate user engagement in terms of when players participated, i.e.for how long they played the game, from the ﬁrst to the last played round, wediscover that only few users played the game both in the intrinsic motivationand extrinsic motivation periods; in particular, only 13 users played both before and during the competition and only 17 users became aware of the existence ofthe game during the competition and went on playing it after its end.In addition, by analysing the users’ total active time (diﬀerence between thelast and the ﬁrst time a user played the game), we discover that most of theusers played for a very short amount of time; 75% of players used the game forless than 5 minutes and only the 10% played for more than a day.These statistics are not surprising, because they are strong indicators of thegame nature, which is a so-called casual game . Casual games are usually de-signed to be played in short bursts of a few minutes and then set aside. By theirvery nature, casual games target the short free/leisure time between the myriadof everyday tasks, such as between work and domestic obligations or betweenattention and distraction [27]. Regarding the overall time spent playing mobilegames, the literature shows that an average gamer spends every day approxi-mately 24 minutes playing games on mobile devices, with heavy gamers spendingabout 1 hour/day and light gamers about 2 minutes/day [28].

Given that volunteer proﬁles in citizen science do not seem to suitably describeGWAP players, we focus our investigation on two additional main metrics, playeraccuracy and player participation, more closely related to human computation,and analyse their interplay with diﬀerent factors, like game incentive, task dif-ﬁculty and task variety. The goal is to uncover GWAP-speciﬁc user behavioursand to identify

GWAP-speciﬁc player proﬁles . Player accuracy is measured ex-post by counting how many tasks eachuser correctly solved over the total number of tasks he/she played with; in thiscontext, “correct” refers to the ﬁnal task solution computed by the truth in-ference algorithm. Accuracy takes values between 0 and 1 and corresponds tothe worker precision or labeling quality metrics used in crowdsourcing literature nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 9 (e.g. [26]).

Player participation is measured as the total number of contribu-tions given by each user in the game rounds he/she played. While there are ofcourse alternative ways to measure participation (e.g., number of game rounds,total played time), we prefer to consider the number of contributions, since thisindicator is more closely related to the “task” execution and the game purpose.

Referring again to Night Knights data, we plot each user as a data point alongparticipation and accuracy axes (cf. Figure 4). To divide players into groups, weapplied clustering as in Section 5, but – at least in the case of Night Knights – theresults put 98-99% of players in the same cluster, placing only “outliers” in theother clusters. Therefore, to deﬁne GWAP-speciﬁc proﬁles, we propose to simplyset separation thresholds on the two axes dividing the space into quadrants; morespeciﬁcally, we adopt the median as separation value, which is a commonly usedmeasure and robust statistic. While this deﬁnition is arbitrary, it is also data-independent, thus the proposed approach can be adopted to analyse and comparediﬀerent GWAPs without loss of generality.The thresholds calculated on the Night Knights dataset are 12 contributionsfor the x-axis and 0.87 accuracy for the y-axis. The median value for participationroughly corresponds to the separation between those who played just a coupleof game rounds from those who were more deeply engaged (cf. Section 4). Themedian accuracy value is quite high and this is a good sign about the GWAPeﬃcacy to achieve its purpose; in other cases, when a speciﬁc minimum valueof accuracy is required, the threshold choice could be driven by domain-speciﬁcconsideration instead of being identiﬁed by the median.By using this approach, the investigation space is divided into areas thatrepresent diﬀerent “behavioral” proﬁles as follows. Along the accuracy axis, weobtain two proﬁles: accurate players , i.e. players with an accuracy higher thanthe median, and the remaining inaccurate players (cf. Figure 5-a). Along the

Fig. 4.

Players’ participation vs. accuracy and median values0 Gloria Re Calegari, Irene Celino

Fig. 5.

Deﬁnition of GWAP-speciﬁc player proﬁles participation axis (cf. Figure 5-b), we deﬁne casual players those who contributeless than the median, and frequent players the most addicted and loyal contrib-utors. Considering both dimensions, we deﬁne four proﬁles (cf. Figure 5-c): – Beginners (bottom-left): this is the set of users that play the game for ashort period of time, just for curiosity; this kind of players gives only fewcontributions with low accuracy. – Snipers (top-left): users that are very accurate in their contributions butthey contribute only a little. Ideally, they should be motivated to becomechampions, since their contributions are valuable. – Champions (top-right): this is the most desirable category of players, sincethey have high level of participation with very high accuracy. – Trolls (bottom-right): this is the category of less desirable users, since theygive a lot of inaccurate contributions; having a lot of

Trolls in the gameeither makes the classiﬁcation process longer, since it is harder to reach anagreement, or even leads to undesired results.Observing again Night Knights data, we can also quantitatively analyse theeﬀect of game incentive on the proﬁle composition (cf. Figure 6). With extrinsicmotivation, most users (53%) acted as champions, and this share is much higherthan in the total (32%). On the other hand, with the intrinsic motivation only,

Fig. 6.

Distribution of players between proﬁles, in total and with diﬀerent incentivesnterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 11 the presence of champions was lower, only 25%. This diﬀerence may indicatethat the diﬀerent incentives lead to diﬀerent user behaviour; the presence oftangible rewards can engage users for a longer time and can motivates them tocontribute with more eﬀort and attention.With intrinsic motivation, also the percentages of snipers was higher than theaverage. The largest group of users in the intrinsic motivation period, however,was beginners (37%): probably this happened because they tried the game justfor curiosity or to understand how the game works, without paying too muchattention to the answers they gave. As expected, the number of beginners wasvery low with the extrinsic motivation, since they had a clear goal to play thegame. Fortunately, the percentages of trolls were low in both periods. This meansthat the Night Knights game succeeded in avoiding too many spammers thatcould have made the classiﬁcation process longer or more inaccurate.While the above results are speciﬁc to Night Knights, the proﬁle analysis canbe applied to any other GWAP; indeed, examining the composition of a GWAPplayer population can reveal diﬀerent behaviour and inform game re-design.Finally, we would like to point out an insight that is not immediately evidentin Figure 4: since the players on the right part of the plot are those who con-tributed more, if we sum the contributions from the four proﬁles, we obtain theﬁgures in Table 3. In the case of our GWAP, therefore, the large majority of con-tributions comes from the most active and accurate players, which is reassuringwith respect to the achievement of the game purpose.

Beginners Snipers Champions Trolls

Task contributions 0.7% 0.4% 95.9% 3.0%

Table 3.

Distribution of contributions across players proﬁles

In the following, we analyse the interplay between player accuracy and playerparticipation by taking into account additional factors. More speciﬁcally, wecheck if there is a statistically signiﬁcant diﬀerence between the mean accuracyof casual and frequent players with respect to some control variables, namelythe incentive type, the task diﬃculty and the task variety.

To answer this question, we check for mean diﬀerence in accuracy for casual andfrequent players in the intrinsic and extrinsic motivation periods.In Night Knights, the average accuracy of the frequent players is higher thanthe one of casual players in both periods, as shown in the ﬁrst two boxplotsof Figure 7; this diﬀerence is also signiﬁcant from a statistically point of view(p-value of the t-test less than 0.05). We also notice a mean accuracy increaseof about 10% when a tangible rewards is present (from 0.74 to 0.81 for casualand from 0.83 to 0.90 for frequent): since during the competition users were

Fig. 7.

Accuracy distribution of casual and frequent players with diﬀerent incentives( a and b ) and with diﬀerent task diﬃculty ( c and d ). The diﬀerence between players’proﬁles is statistically signiﬁcant in all cases except for easy tasks. encouraged to play to win the prize, they paid more attention to the imageclassiﬁcation, raising also the answers’ quality.This may indicate that in GWAPs frequent players contribute in a moreaccurate way than casual ones, and that extrinsic motivation has a positiveimpact on accuracy. We deﬁne task diﬃculty as the number of diﬀerent users needed to solve it(the higher the number, the harder the task); this is because our incrementaltruth inference algorithm (cf. Section 3) dynamically estimates the number ofcontributions required to solve a task. We split the images in two sets based ontheir diﬃculty and we check if this impacts player behaviour.For Night Knights, we marked as “easy” the images that requires only 4 con-tributions (the minimum number to reach an agreement according to our domainexperts), and as “diﬃcult” those that required more contributions. “Easy” im-ages are 58% of all classiﬁed images, while the number of contributions requiredto classify “diﬃcult” images ranges from 5 to 17.As shown in the (c) and (d) boxplots in Figure 7, accuracy on “easy” imagesis almost the same between casual and frequent players (indeed, the diﬀerence inmean accuracies is not statistically signiﬁcant). On the contrary, this diﬀerence isstatistically signiﬁcant for “diﬃcult” images (mean accuracy is 0.84 for frequentplayers and 0.68 for casual players ).Those results suggest a learning eﬀect in GWAPs: the more a user plays thegame, the more he/she understands the task to be solved, thus increasing his/heraccuracy and consequently also result quality.

Since Night Knights aims to solve a multinomial classiﬁcation task, we inves-tigate whether there is any evident phenomenon related to the diﬀerent image nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 13

Black City Stars Aurora ISS NoneCasual

Frequent

Table 4.

Mean accuracy of casual and frequent players with images of diﬀerent cate-gories. The diﬀerence is not statistically signiﬁcant for any of the categories. categories. Therefore, we compute again the accuracies of the two groups of ca-sual and frequent players in classifying the 6 output classes. We summarize themean accuracy values in Table 4.Applying the t-test to check if the mean accuracy is diﬀerent for the twoplayers’ proﬁles, we cannot reject the null hypothesis. This may mean that anyplayer is equally able/unable to distinguish the diﬀerent categories, indepen-dently of his/her level of participation; indeed, in our GWAP, there is no needfor background- or domain-speciﬁc knowledge to play the game. This analysiscan help in identifying the need for training or expert knowledge of GWAPplayers.On the other hand, the mean accuracy values change a lot across diﬀerentcategories, spanning between 0.57 and 0.91. This is also explained by the diﬀer-ent distribution of easy/diﬃcult tasks across the variety of classes, as shown inFigure 8. Indeed, some categories are intrinsically more diﬃcult to classify thanothers, but Table 4 shows that this complexity related to task variety is equallyperceived by players with low and high levels of participation.

Fig. 8.

Distribution of easy/diﬃcult tasks across diﬀerent image categories.

In this paper, we presented an investigation of the interplay of diﬀerent factorsin the evaluation of GWAP results. More speciﬁcally, we focused on the proﬁlingof players according to diﬀerent user metrics and we studied the inﬂuence ofgame incentive and task characteristics.

To inform our discussion, we described the results of such multi-dimensionalanalysis over the data collected by a GWAP for multinomial classiﬁcation ofimages. While some of our considerations result from the quantitative analysisof a single game, and are not per se generalizable, we believe that the proposedapproach is replicable to evaluate any other GWAP. We believe that such deeperanalysis is an important (and sometimes neglected) investigation to understandplayers’ behaviour, to evaluate the impact of various factors on reliability andquality, and ﬁnally to assess the ability of GWAPs to achieve their intendedpurpose and its sustainability over time.Finally, we would like to point out that, even when player participation islimited in time, a classiﬁcation GWAP can be used to build a reasonably largetraining set to be used in traditional machine learning settings to train classiﬁersfor larger-scale labeling. In our previous work, we showed that humans andmachines indeed agree on image classiﬁcation for the Night Knights dataset [12].

Acknowledgments

This work is partially supported by the STARS4ALL project (H2020-688135), co-funded by the European Commission. We thank all the Night Knights players whocontributed to the classiﬁcation task solution and allowed us to perform this work.

References

1. Von Ahn, L., Dabbish, L.: Designing games with a purpose. Communications ofthe ACM (8) (2008) 58–672. Law, E., Ahn, L.v.: Human computation. Synthesis Lectures on Artiﬁcial Intelli-gence and Machine Learning (3) (2011) 1–1213. Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedingsof the SIGCHI conference on Human factors in computing systems, ACM (2004)319–3264. Singh, A., Ahsan, F., Blanchette, M., Waldisp”uhl, J.: Lessons from an onlinemassive genomics computer game. In: Proceedings of the Fifth Conference onHuman Computation and Crowdsourcing (HCOMP 2017). (2017)5. Sauermann, H., Franzoni, C.: Crowd science user contribution patterns and theirimplications. Proceedings of the National Academy of Sciences (3) (2015)679–6846. Yang, J., Redi, J., Demartini, G., Bozzon, A.: Modeling task complexity in crowd-sourcing. In: Fourth AAAI Conference on Human Computation and Crowdsourc-ing. (2016)7. Ryan, R.M., Deci, E.L.: Intrinsic and extrinsic motivations: Classic deﬁnitions andnew directions. Contemporary educational psychology (1) (2000) 54–678. Prestopnik, N., Crowston, K., Wang, J.: Gamers, citizen scientists, and data:Exploring participant contributions in two games with a purpose. Computers inHuman Behavior (2017) 254–2689. Thaler, S., Simperl, E., Wolger, S.: An experiment in comparing human-computation techniques. IEEE Internet Computing (2012) 52–5810. Feyisetan, O., Simperl, E., Van Kleek, M., Shadbolt, N.: Improving paid micro-tasks through gamiﬁcation and adaptive furtherance incentives. In: Proceedingsof the 24th International Conference on World Wide Web. WWW ’15, Republicand Canton of Geneva, Switzerland, International World Wide Web ConferencesSteering Committee (2015) 333–343nterplay of Game Incentives, Player Proﬁles and Task Diﬃculty in GWAPs 1511. Feyisetan, O., Simperl, E.: Social incentives in paid collaborative crowdsourcing.ACM Trans. Intell. Syst. Technol. (6) (2017) 73:1–73:3112. Re Calegari, G., Nasi, G., Celino, I.: Human computation vs. machine learning:an experimental comparison for image classiﬁcation. Human Computation Journal (1) (2018) 13–3013. Siu, K., Zook, A., Riedl, M.O.: Collaboration versus competition: Design andevaluation of mechanics for games with a purpose. In: Proceedings of Foundationsof Digital Games Conference. (2014)14. Reeves, N., West, P., Simperl, E.: “A game without competition is hardly a game”:The impact of competitions on player activity in a human computation game. In:Proceedings of Human Computation Conference. (2018)15. Reeves, N., Tinati, R., Zerr, S., Van Kleek, M., Simperl, E.: From crowd to com-munity: A survey of online community features in citizen science projects. In:Proceedings of the 2017 ACM Conference on Computer Supported CooperativeWork and Social Computing, CSCW 2017. (2017) 2137–215216. Celino, I., Corcho, ´O., H¨olker, F., Simperl, E.: Citizen science: Design and engage-ment (dagstuhl seminar 17272). Dagstuhl Reports (7) (2017) 22–4317. Ponciano, L., Brasileiro, F.: Finding volunteers’ engagement proﬁles in humancomputation for citizen science projects. Human Computation Journal (2) (2015)247–26618. Aristeidou, M., Scanlon, E., Sharples, M.: Proﬁles of engagement in online com-munities of citizen science participation. Computers in Human Behavior (2017)246–25619. Allahbakhsh, M., Benatallah, B., Ignjatovic, A., Motahari-Nezhad, H.R., Bertino,E., Dustdar, S.: Quality control in crowdsourcing systems: Issues and directions.IEEE Internet Computing (2) (2013) 76–8120. Karger, D.R., Oh, S., Shah, D.: Budget-optimal task allocation for reliable crowd-sourcing systems. Operations Research (1) (2014) 1–2421. Han, T., Sun, H., Song, Y., Wang, Z., Liu, X.: Budgeted task scheduling for crowd-sourced knowledge acquisition. In: Proceedings of the 2017 ACM on Conferenceon Information and Knowledge Management, ACM (2017) 1059–106822. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data qualityand data mining using multiple, noisy labelers. In: Proceedings of the 14th ACMSIGKDD international conference on Knowledge discovery and data mining, ACM(2008) 614–62223. Re Calegari, G., Fiano, A., Celino, I.: A Framework to build Games with a Purposefor Linked Data Reﬁnement. In: proceedings of the International Semantic WebConference 2018, Resources Track. (2018)24. Celino, I., Re Calegari, G.: An Incremental Truth Inference Approach to AggregateCrowdsourcing Contributions in GWAPs. In: currently under revision. (2018)25. Celino, I., Contessa, S., Corubolo, M., Dell’Aglio, D., Della Valle, E., Fumeo, S.,Kr¨uger, T.: Linking Smart Cities Datasets with Human Computation: the case ofUrbanMatch. In: Proceedings of the 11th international conference on The SemanticWeb, Springer-Verlag (2012) 34–4926. Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: isthe problem solved? Proceedings of the VLDB Endowment10