[PDF] Gamified Crowdsourcing for Idiom Corpora Construction

Abstract

Learning idiomatic expressions is seen as one of the most challenging stages in second language learning because of their unpredictable meaning. A similar situation holds for their identification within natural language processing applications such as machine translation and parsing. The lack of high-quality usage samples exacerbates this challenge not only for humans but also for artificial intelligence systems. This article introduces a gamified crowdsourcing approach for collecting language learning materials for idiomatic expressions; a messaging bot is designed as an asynchronous multiplayer game for native speakers who compete with each other while providing idiomatic and nonidiomatic usage examples and rating other players' entries. As opposed to classical crowdprocessing annotation efforts in the field, for the first time in the literature, a crowdcreating & crowdrating approach is implemented and tested for idiom corpora construction. The approach is language independent and evaluated on two languages in comparison to traditional data preparation techniques in the field. The reaction of the crowd is monitored under different motivational means (namely, gamification affordances and monetary rewards). The results reveal that the proposed approach is powerful in collecting the targeted materials, and although being an explicit crowdsourcing approach, it is found entertaining and useful by the crowd. The approach has been shown to have the potential to speed up the construction of idiom corpora for different natural languages to be used as second language learning material, training data for supervised idiom identification systems, or samples for lexicographic studies.

Full PDF

UUnderReview (2021), 1–00doi:10.1017/xxxxx

ARTICLE

Gamiﬁed Crowdsourcing for Idiom CorporaConstruction

G¨uls¸en Eryi˘git , , ∗ , Ali S¸ entas¸ , and Johanna Monti NLP Research Group, Faculty of Computer&Informatics, Istanbul Technical University, Istanbul, Turkey Department of Artiﬁcial Intelligence & Data Engineering, Istanbul Technical University, Istanbul, Turkey UNIOR NLP Research Group - Department of Literary, Linguistic and Comparative Studies, University of NaplesL’Orientale*Corresponding author. Email:[email protected](Received 2021; revised xx xxx xxx; accepted xx xxx xxx)

Abstract

Learning idiomatic expressions is seen as one of the most challenging stages in second language learningbecause of their unpredictable meaning. A similar situation holds for their identiﬁcation within naturallanguage processing applications such as machine translation and parsing. The lack of high-quality usagesamples exacerbates this challenge not only for humans but also for artiﬁcial intelligence systems. This arti-cle introduces a gamiﬁed crowdsourcing approach for collecting language learning materials for idiomaticexpressions; a messaging bot is designed as an asynchronous multiplayer game for native speakers whocompete with each other while providing idiomatic and nonidiomatic usage examples and rating otherplayers’ entries. As opposed to classical crowdprocessing annotation efforts in the ﬁeld, for the ﬁrst timein the literature, a crowdcreating & crowdrating approach is implemented and tested for idiom corporaconstruction. The approach is language independent and evaluated on two languages in comparison totraditional data preparation techniques in the ﬁeld. The reaction of the crowd is monitored under differentmotivational means (namely, gamiﬁcation affordances and monetary rewards). The results reveal that theproposed approach is powerful in collecting the targeted materials, and although being an explicit crowd-sourcing approach, it is found entertaining and useful by the crowd. The approach has been shown to havethe potential to speed up the construction of idiom corpora for different natural languages to be used assecond language learning material, training data for supervised idiom identiﬁcation systems, or samplesfor lexicographic studies.

Keywords: crowdsourcing, gamiﬁcation, game with a purpose (GWAP), idiomatic expressions, language resources

1. Introduction

An idiom is usually deﬁned as a group of words established by usage as having a idiosyncraticmeaning not deducible from those of the individual words forming that idiom. It is possible toencounter cases where the idiom’s constituents come together with or without forming that spe-cial meaning. This ambiguous situation poses a signiﬁcant challenge for both foreign languagelearners and artiﬁcial intelligence (AI) systems since it requires a deep semantic understanding ofthe language.Idiomatic control has been seen as a measure of proﬁciency in a language both for humans andAI systems. The task is usually referred to as idiom identiﬁcation or idiom recognition in naturallanguage processing (NLP) studies and is deﬁned as understanding/classifying the idiomatic (i.e.,ﬁgurative) or nonidiomatic usage of a group of words (i.e., either with the literal meaning arisingfrom their cooccurrence or by their separate usage). For the words { “hold”,“one’s”, “tongue” } , © 2021 a r X i v : . [ c s . C L ] F e b two such usage examples are provided below:“Out of sheer curiosity I held my tongue , and waited.” (idiomatic) (meaning “stop talking”)“One of the things that they teach in ﬁrst aid classes is that a victim having a seizure can swallowhis tongue, and you should hold his tongue down.” (nonidiomatic)Learning idiomatic expressions is seen as one of the most challenging stages in second lan-guage learning because of their unpredictable meaning. Several studies have discussed efﬁcientways of teaching idioms to second language (L2) learners (Vasiljevic 2015; Siyanova-Chanturia2017), and obviously, both computers and humans need high-quality usage samples exemplifyingthe idiom usage scenarios and patterns. When do some words occurring within the same sentenceform a special meaning together? Can the components of an idiom undergo different morphologi-cal inﬂections? If so, is it possible to inﬂect them in any way or do they have particular limitations?May other words intervene between the components of an idiom? If so, could these be of any wordtype or are there any limitations? Although it may be possible to deduct some rules (see Appendixfor an example) deﬁning some speciﬁc idioms, unfortunately, creating a knowledge base that pro-vides such detailed deﬁnitions or enough samples to deduct answers to these questions is a verylabor-intensive and expensive process, which could only be conducted by native speakers. Yet,these knowledge bases are crucial for foreign language learning since there is not enough time orinput for students to implicitly acquire idiom structures in the target language.Due to the mentioned difﬁculties, there exist very few studies that introduce an idiom corpus(providing idiomatic and nonidiomatic examples), and these are available only for a couple oflanguages and a limited number of idioms: Birke and Sarkar (2006); Cook et al. (2008) for 25 and53 English idioms respectively, and Hashimoto and Kawahara (2009) for 146 Japanese idioms.Similarly, high coverage idiom lexicons either do not exist for every language or contain only acouple of idiomatic usage samples, which is insufﬁcient to answer the above questions. Examplesof use were considered as musthave features of an idiom dictionary app in Caruso et al. (2019)that tested a dictionary mockup for the Italian language with Chinese students. On the other hand,it may be seen that foreign language learning communities are trying to ﬁll this resource gap bycreating/joining online groups or forums to share idiom examples a . Obviously, the necessity foridiom corpora applies to all natural languages and we need an innovative mechanism to speedup the creation process of such corpora by ensuring the generally accepted quality standards inlanguage resource creation.Gamiﬁed crowdsourcing is a rapidly increasing trend, and researchers explore creative methodsof use in different domains (Morschheuser et al. 2017; Morschheuser and Hamari 2019; Murillo-Zamorano et al. 2020). The use of gamiﬁed crowdsourcing for idiom corpora construction hasthe potential to provide solutions to the above-mentioned problems, as well as to the unbalanceddistributions of idiomatic and nonidiomatic samples, and the data scarcity problem encountered intraditional methods. This article proposes a gamiﬁed crowdsourcing approach for idiom corporaconstruction where the crowd is actively taking a role in creating and annotating the languageresource and rating annotations. The approach is experimented on two languages (Turkish andItalian) and evaluated in comparison to traditional data preparation techniques in the ﬁeld. Theresults reveal that the approach is powerful in collecting the targeted materials, and althoughbeing an explicit crowdsourcing approach, it is found entertaining and useful by the crowd. Theapproach has been shown to have the potential to speed up the construction of idiom corporafor different natural languages, to be used as second language learning material, training data forsupervised idiom identiﬁcation systems, or samples for lexicographic studies.The article is structured as follows: §2 provides a background and the related work, §3 describesthe game design, §4 provides analyses, and §5 the conclusion. a Some examples include https://t.me/Idiomsland for English idioms with 65K subscribers, https://t.me/Deutschpersich for German idioms with 3.4K subscribers, https://t.me/deyimler for Turkish Idioms with 2.7Ksubscribers, https://t.me/Learn_Idioms for French idioms with 2.5 subscribers. The last three are messaging groupsproviding idiom examples and their translations in Arabic. amiﬁed Crowdsourcing for Idiom Corpora Construction

2. Background & Related Work

Several studies investigate idioms from a cognitive science perspective: Kaschak and Saffran(2006) constructed artiﬁcial grammars that contained idiomatic and “core” (nonidiomatic) gram-matical rules and examined learners’ ability to learn the rules from the two types of constructions.The ﬁndings suggested that learning was impaired by idiomaticity, counter to the conclusion ofSprenger et al. (2006) that structural generalizations from idioms and nonidioms are similar instrength. Konopka and Bock (2009) investigate idiomatic and non-idiomatic English phrasal verbsand states that despite differences in idiomaticity and structural ﬂexibility, both types of phrasalverbs induced structural generalizations and differed little in their ability to do so.We may examine the traditional approaches which focus on idiom annotation in two main parts:ﬁrst, the studies focusing solely on idiom corpus construction, and second the studies on generalmultiword expressions’ (MWEs) annotations also including idioms. Both approaches have theirown drawbacks, and exploration of different data curation strategies in this area is crucial for anynatural language, but especially for morphologically rich and low resource languages (MRLs andLRLs).The studies focusing solely on idiom corpus construction (Birke and Sarkar 2006; Cook et al.2008; Hashimoto and Kawahara 2009) ﬁrst retrieve sentences from a text source according tosome keywords from the target group of words (i.e., target idiom’s constituents) and then anno-tate them as idiomatic or nonidiomatic samples. The retrieval process is not as straightforwardas one might think since the keywords should cover all possible inﬂected forms of the wordsin focus (e.g. keyword “go” could not retrieve its inﬂected form “went”), especially for MRLswhere words may appear under hundreds of different surface forms. The solution to this may belemmatization of the corpus and searching with lemmas, but this will not work in cases where thedata source is pre-indexed and only available via a search engine interface such as the internet.This ﬁrst approach may also lead to unexpected results on the class distributions. For example,Hashimoto and Kawahara (2009) states that examples were annotated for each idiom, regardlessof the proportion of idioms and literal phrases, until the total number of examples for each idiomreached 1000, which is sometimes not reachable due to data unavailability.Idioms are seen as a subcategory b of multiword expressions (MWEs) which have been sub-ject to many initiatives in recent years such as Parseme EU COST Action, MWE-LEX workshopseries, ACL special interest group SIGLEX-MWE. Traditional methods for creating MWE cor-pora (Schneider et al. 2014; Vincze et al. 2011; Losnegaard et al. 2016; Savary et al. 2018)generally rely on manually annotating MWEs on previously collected text corpora (news arti-cles most of the time and sometimes books), this time without being retrieved with any speciﬁckeywords. However, the scarcity of MWEs (especially idioms) in text have presented obstacles tocorpus-based studies and NLP systems addressing these (Schneider et al. 2014). In this approach,only idiomatic examples are annotated. One may think that all the remaining sentences contain-ing idiom’s components are nonidiomatic samples. However, in this approach, human annotatorsare prone to overlook especially those MWE components that are not juxtaposed within a sen-tence. Bontcheva et al. (2017) states that annotating one named entity (another sub-category ofMWEs) type at a time as a crowdsourcing task is a better approach than trying to annotate allentity types at the same time. Similar to Bontcheva et al. (2017), our approach achieves the goalof collecting quality and creative samples by focusing the crowd’s attention on a single idiom at atime. Crowdsourcing MWE annotations has been rarely studied (Kato et al. 2018; Fort et al. 20182020) and these were crowdprocessing c efforts. b In this article, differing from Constant et al. (2017), which list subcategories of MWEs, we use the term “idiom” for alltypes of MWEs carrying an idiomatic meaning including phrasal verbs in some languages. c “Crowdprocessing approaches rely on the crowd to perform large quantities of homogeneous tasks. Identical contributionsare a quality attribute of the work’s validity. The value is derived directly from each isolated contribution (non-emergent)”(Morschheuser et al. 2017). Crowdsourcing (Howe 2006) is a technique used in many linguistic data collection tasks(Mitrovi´c 2013). Crowdsourcing systems are categorized under four main categories: crowdpro-cessing, crowdsolving, crowdrating, and crowdcreating (Geiger and Schader 2014; Prpi´c et al.2015; Morschheuser et al. 2017). While “Crowdcreating solutions seek to create comprehensive(emergent) artifacts based on a variety of heterogeneous contributions”, “ Crowdrating systemscommonly seek to harness the so-called wisdom of crowds to perform collective assessmentsor predictions” (Morschheuser et al. 2017). The use of these two later types of crowdsourcingtogether has a high potential to provide solutions to the above-mentioned problems for idiomcorpora construction.One platform researchers often use for crowdsourcing tasks is

Amazon Mechanical Turk(MTurk) d . Snow et al. (2008) used it for linguistics tasks such as word similarity, textual entail-ment, temporal ordering, and word sense disambiguation. Lawson et al. (2010) used MTurk tobuild an annotated NER corpus from emails. Akkaya et al. (2010) used the platform to gatherword-sense disambiguation data. The platform proved especially cost-efﬁcient in the highlyhuman labor-intensive task of word-sense disambiguation (Akkaya et al. 2010; Rumshisky et al.2012). Growing popularity also came with criticism for the platform as well (Fort et al. 2011).MTurk platform uses monetary compensation as an incentive to complete the tasks. Anotherway of utilizing the crowd for microtasks is gamiﬁcation, which, as an alternative to monetarycompensation utilizes game elements such as points, achievements, and leaderboards. Von Ahn(2006) pioneered these types of systems and called them Games with a Purpose (GWAP) (Von Ahn2006). ESPGame (Von Ahn and Dabbish 2004) can be considered as one of the ﬁrst GWAPs. It’sdesigned as a game where users were labeling images from the web while playing a Taboo™like game against each other. The authors later developed another GWAP, Verbosity (Von Ahnet al. 2006), this time for collecting common-sense facts in a similar game setting. GWAPs arepopularized in the NLP ﬁeld by early initiatives such as (Chklovski 2005), Phrase Detectives (Chamberlain et al. 2008),

JeuxDeMots (Artignan et al. 2009),

Dr. Detective (Dumitrache et al. 2013).

RigorMortis (Fort et al. 2018 2020) gamiﬁes the traditional MWEannotation process described above.Gamiﬁed crowdsourcing is a rapidly increasing trend, and researchers explore creative methodsof use in different domains (Morschheuser et al. 2017; Morschheuser and Hamari 2019; Murillo-Zamorano et al. 2020). Morschheuser et al. (2017) introduce a conceptual framework of gamiﬁedcrowdsourcing systems according to which the motivation of the crowd may be provided by eithergamiﬁcation affordances (such as leaderboards, points, and badges) or additional incentives (suchas monetary rewards, prizes). In our study, we examine both of these motivation channels andreport their impact. According to Morschheuser et al. (2017), “one major challenge in motivatingpeople to participate is to design a crowdsourcing system that promotes and enables the forma-tion of positive motivations towards crowdsourcing work and ﬁts the type of the activity.” Ourapproach to gamiﬁed crowdsourcing for idiom corpora construction relies on crowdcreating andcrowdrating. We both value the creativity and systematic contributions of the crowd. As explainedabove, since it is not easy to retrieve samples from available resources, we expect our users to becreative in providing high-quality samples.Morschheuser et al. (2018) state that users increasingly expect the software to be not only usefulbut also enjoyable to use, and a gamiﬁed software requires an in-depth understanding of motiva-tional psychology and requires multidisciplinary knowledge. In our case, these multidisciplinesinclude language education, computational linguistics, natural language processing, and gami-ﬁcation. To shed light to a successful design of gamiﬁed software, the above-mentioned studydivides the engineering of gamiﬁed software into 7 main phases and mention 13 design princi-ples (from now on depicted as DP d amiﬁed Crowdsourcing for Idiom Corpora Construction evaluation, and monitoring phases. The following sections provide the details of our gamiﬁcationapproach by relating the stages to these main design phases and principles.

3. Game Design

The aim while designing the software was to create an enjoyable and cooperative environment thatwould motivate the volunteers to help the research studies. The game is designed to collect usagesamples for idioms of which the words of the idiom may also commonly be used in their literalmeanings within a sentence. An iterative design process has been adopted. After the ﬁrst ideation,design, and prototype implementation phases, the prototype was shared with the stakeholders (see§Acknowledgments) (as stated in DP e ) is designed as an asynchronous multiplayer game f fornative speakers who compete with each other while providing idiomatic and nonidiomatic usageexamples and rating other players’ entries. The game is an explicit crowdsourcing game and play-ers are informed from the very beginning that they are helping to create a public data source byplaying this game g .The story of the game is based on a bird named Dodo (the persona of the bot) trying to learn aforeign language and having difﬁculty learning idioms in that language. Players try to teach Dodothe idioms in that language by providing examples. Dodiom has been developed as an opensourceproject (available on Github h ) with the focus on being easily adapted to different languages. Allthe interaction messages are localized and shown to the users in the related language; localizationsare currently available for English, Italian, and Turkish languages. Dodo learns a new idiom every day. The idioms to be played each day are selected by moderatorsaccording to their tendency to be used with their literal meaning. For each idiom, players havea predetermined time frame to submit their samples and reviews so they can play at their ownpace. Since the bot may send notiﬁcations via the messaging platform in use, the time frame isdetermined as between 11 a.m. and 11 p.m. i When the users connect to the bot for the ﬁrst time, they are greeted with Dodo explaining tothem what the game is about and teaching them how to play in a step-by-step manner (Figure 1a).This pre-game tutorial and the simplicity of the game proved useful as most of the players wereable to play the game in a matter of minutes and provided high-quality examples. All the gamemessages are studied very carefully to achieve this goal and ensure that the crowd unfamiliar withAI or linguistics understands the task easily. Random tips for the game are also shared with theplayers right after they submit their examples. This approach is similar to video games where tipsabout the game are shown to players on loading screens and/or menus.Figure 1b shows the main menu, from where the players can access various modes of the game.

Today’s Idiom tells the player what that day’s chosen idiom is, players can then submit usageexamples for said idioms to get more points.

Submit allows players to submit new usage examples. When clicked, Dodo asks the player toinput the example sentence and when the player sends one, the sentence is checked if it contains e A language-agnostic name has been given for the game to be generalized to every language. f Asynchronous multiplayer games enable players to take their turns at a time that suits them; i.e., the users do not need tobe in the game simultaneously. g In addition to the reporting and banning mechanism (DP h https://github.com/Dodiom/dodiom i Different time frames have been tried in the iterative development cycle (DP (a) Dodo greeting the player, describing the game,and showing the next steps (b) Main Menu showing the currently availableoptions

Figure 1 : Dodiom welcome and menu screens.the words (i.e., the lemmas of the words) that appear in the idiom. If so, Dodo then asks whetherthese words form an idiom in the given sentence or not. The players are awarded each time otherplayers like their examples so they are incentivized to enter multiple high-quality submissions.

Review allows players to review submissions sent by other players. Dodo shows players exam-ples of other players one at a time together with their annotations (i.e., idiom or not), and asks itsapproval. Users are awarded points for each submission they review so they are also incentivizedto review. The exact scoring and incentivization system will be explained in §3.2. Figure 2a showsa simple interaction between Dodo and a user, where Dodo asks whether or not the words pulling and leg (automatically underlined by the system) in the sentence “Quit pulling my leg, will you” are used idiomatically. The user responds with acknowledgment or dislike and then Dodo thanksthe user for his/her contribution. Users can also report the examples which don’t ﬁt the generalguidelines (e.g., vulgar language, improper usage of the platform) for the submissions to be laterreviewed by moderators. The moderators can ﬂag the submissions and ban the users from thegame depending on the submission. Submissions with fewer reviews are shown to the players amiﬁed Crowdsourcing for Idiom Corpora Construction (a) Review interaction(b) Leaderboard interaction (c) User’s score and achievements Figure 2 : Some interaction screensﬁrst (i.e., before the samples that were reviewed previously) so that each submission can receiveapproximately the same number of reviews.

Help shows the help message, which is a more compact version of the pre-game tutorial.

Show Scoreboard displays the current state of the leaderboard which is updated every time aplayer is awarded any points. As seen in Figure 2b, the scoreboard displays the ﬁrst ﬁve players’and the current player’s scores. The scoreboard is reset every day for each idiom. Additionally, 100submissions are set as a soft target for the crowd and a message stating the number of submissionsremaining to reach this goal is shown below the scoreboard. The message no longer appears whenthe target is reached.

Achievements option shows the score, level, and locked/unlocked achievements of the user. Anexample can be seen in Figure 2c where many achievements are designed to gamify the processand reward players for speciﬁc actions such as

Early Bird achievement for early submissions and

Author for sending 10 submissions in a given day. Whenever an achievement is obtained, the useris notiﬁed with a message and an exciting ﬁgure (similar to the ones in Figure 3).

Dodiom uses both gamiﬁcation affordances and additional incentives (Morschheuser et al. 2017)for the motivation of its crowd. Before the decision of the ﬁnal design, we have tested with severalscoring systems with and without additional incentives. This section provides the detailed form ofthe ﬁnal scoring system together with previous attempts, gamiﬁcation affordances, and additionalincentives. (a) Score increase (b) Falling back from the top 5

Figure 3 : Notiﬁcation samplesThe philosophy of the game is based on collecting valuable samples that illustrate the differentways of use and make it possible to make inferences that deﬁne how to use a speciﬁc idiom (suchas the ones in the Appendix). The samples to be collected are categorized into 4 main types givenbelow. For the sake of game simplicity, this categorization is not explicitly described to the usersbut is only used for background evaluations.• A-type samples: Idiomatic samples in which the constituent words are used side by side(juxtaposed) (ex: “Please hold your tongue and wait.”),• B-type samples: Idiomatic samples in which the constituent words are separated by someother words, which is a more common phenomenon in free-word order languages j (ex:“Please hold your breath and tongue and wait for the exciting announcements.”),• C-type samples: Nonidiomatic samples in which the constituent words are used side by side(ex: “Use sterile tongue depressor to hold patient’s tongue down.”),• D-type samples: Nonidiomatic samples in which the constituent words are separated by someother words (ex: “ Hold on to your mother tongue .”).Producing samples from some categories (e.g., B-types and C-types) may not be the mostnatural form of behavior. For Turkish, we experimented with different scorings that will motivateour users to produce samples from different categories. Before settling on the ﬁnal form of thescoring system, two other systems have been experimented with in the preliminary tests. Theseinclude having a ﬁxed set of scores for each type (i.e., 30, 40, 20, and 10 for A-, B-, C-, and D-types respectively). This scoring system caused players to enter submissions for only the B-type toget the most out of a submission and resulted in very few other types of samples. To ﬁx the majorimbalance problem, in another trial, a decay parameter has been added to lower the initial typescores whenever a new submission arrives. Unfortunately, this new system had little to no effecton remedying the imbalance and made the game harder to understand for players who couldn’teasily ﬁgure out the scoring system k . This latter strategy was also expected to incentivize playersto enter submissions early in the game but it didn’t work out as planned.Although being one of the most challenging types for language learners and an important typethat we want to collect samples from, B-type utterances may not be as common as A-type utter-ances for some idioms and be rare for some languages with ﬁxed word-order. Similarly producing j In free word order languages, the syntactic information is mostly carried at word level due to afﬁxes thus the words mayfreely change their position within the sentence without affecting the meaning. k User feedbacks are taken via personal communication on trial runs. amiﬁed Crowdsourcing for Idiom Corpora Construction C-type samples may be difﬁcult for some idioms and overdirecting the crowd to produce moreexamples of this type can result in unnatural sentences. Thus, game motivations should be chosencarefully.We used scoring, notiﬁcations, and tips to increase the type variety in the dataset in a meaning-ful and natural way. The ﬁnal scoring system used during the evaluations (presented in the nextsections) is as follows: Each review is worth one point unless it is done in the happy hour duringwhich all reviews are worth two points. As stated above, after each submission a random tip isshown to the submitter motivating him/her to either review other’s entries or to submit samplesfrom either B-type or C-type. The scores for each type are set to 10 with the only difference ofB-type being set to 12. The system periodically checks the difference between A-type and C-typesamples and when this exceeds 15 samples, it increases the scores of the idiomatic or nonid-iomatic classes l . The score increase is notiﬁed to the crowd via a message (Figure 3a stating eitherDodo needs more idiomatic samples or nonidiomatic samples) and remains active until the differ-ence falls below 5 samples. As stated above, although for some idioms producing C-type samplesmay be difﬁcult, since the notiﬁcation message is for calling nonidiomatic samples in general,the crowd is expected to provide both C-type and D-type samples in accordance with the naturalbalance.Push notiﬁcations are also used to increase player engagement. There are several notiﬁcationssent through the game, which are listed below. The messages are arranged so that an inactive userwould only receive a couple of notiﬁcations from the game each day; the ﬁrst three items beloware sent to every user whereas the last three are sent only to active users of that day.(1) Every morning Dodo sends a good morning message when the game starts and tells theplayer that day’s idiom.(2) When a category score is changed, a notiﬁcation is sent to all players (Figure 3a).(3) A notiﬁcation is sent to players when review happy hour is started. This event is trig-gered manually by moderators, and for one hour, points for reviews worth double. Thisnotiﬁcation also helps to reactivate low-speed play.(4) When a player’s submission is liked by other players, the author of the submission is noti-ﬁed and encouraged to check back the scoreboard. Only one message of this type is sentwithin a limited time to avoid causing too many messages consecutively.(5) When a player becomes the leader of the scoreboard or enters the ﬁrst ﬁve he/she iscongratulated.(6) When a player loses his/her ﬁrst position on the leader board or loses his/her place in theﬁrst three or ﬁve he/she is notiﬁed about it and encouraged to get back and send moresubmissions to take his/her place back. (Figure 3b)We’ve seen that player engagement increased dramatically when these types of notiﬁcationswere added (this will be detailed in Section 4.4). As additional incentives, we also tested withsome monetary rewards given to the best player of each day and investigated the impacts; a 5 Euroonline store gift card for Italian, and a 25 Turkish Lira online bookstore gift card for Turkish. The game is designed as a Telegram bot to make use of Telegram’s advanced features (e.g. multi-platform support) which allowed us to focus on the NLP back-end rather than building web-basedor mobile versions of the game. Python-telegram-bot m library is used to communicate with the l That is to say, when (cid:62) (cid:62) m https://github.com/python-telegram-bot/python-telegram-bot/ Telegram servers and to implement the main messaging interface. A PostgreSQL n database isused as the data back-end. The “Love Bird” Telegram sticker package has been used for the visu-alization of the selected persona, which can be changed according to the needs (e.g. with a localcultural character). For NLP related tasks, NLTK (Loper and Bird 2002) is used for tokenization.Idioms are located in new submissions by tokenizing the submission and checking the lemmaof each word whether they match that day’s idiom constituents. If all idiom lemmas are foundwithin the submission, the player is asked to choose whether the submission is an idiomatic ornonidiomatic sample. The position of the lemmas determines the type (i.e., one of the for typesintroduced in Section 3.2) of the submission within the system. NLTK is used for the lemmati-zation of English, Tint o (Palmero Aprosio and Moretti 2016) for the Italian and Zeyrek p for thelemmatization of Turkish q .The game is designed with localization in mind. The localization ﬁles are currently available inEnglish, Italian and Turkish. Adaptation to other languages requires: 1. translation of localizationﬁles containing game messages (currently 145 interaction messages in total), 2. a list of idioms,and 3. a lemmatizer for the target language. We also foresee that there may be need for somelanguage speciﬁc enhancements (such as the use of wildcard characters, or words) in the deﬁnitionof idioms to be categorized under different types. The game is deployed on Docker r containersadjusted to each countries time-zone where the game is played. In accordance with DP

4. Analysis & Discussions

In accordance with DP

The game was deployed three times: the ﬁrst one for preliminary testing with a limited number ofusers, and then two consecutive 16-day periods open to crowd, for Turkish and Italian separately.The ﬁrst preliminary testing of the game was accomplished on Turkish with nearly 10 people,and yielded to signiﬁcant improvements in the game design. The Italian preliminary tests wereaccomplished with around 100 people s . The game was played between October 13 and December17, 2020 for Turkish, and between November 8 and December 29, 2020 for Italian. From nowon, the four later periods (excluding the preliminary testing periods), for which we provide dataanalysis, will be referred to as TrP1, TrP2 for Turkish, ItP1, and ItP2 for Italian. While TrP1 andItP1 are trials without monetary rewards, TrP2 and ItP2 are with monetary rewards. n o A Stanza(Qi et al. 2020) based tool customized for the Italian language p An NLTK based lemmatizer, customized for the Turkish language, https://zeyrek.readthedocs.io q Stanza is also tested for Turkish, but outputting only a single possible lemma for each word failed in many cases in thislanguage. r https://docker.com/ s Students of the third author and people contacted at EU Researchers Night at Italy. amiﬁed Crowdsourcing for Idiom Corpora Construction The idioms to be played each day were selected by moderators according to their tendencyto be used with their literal meaning. For ItP1 and ItP2, the selection procedure was randomfrom an Italian idiom list t , wherein the later one 4 idioms from ItP1 are replayed for comparisonpurposes. Similarly for TrP2, the idioms were randomly selected from an online Turkish idiomlist u again taking two idioms from TrP1 for comparison. For TrP1, the idioms were selected againwith the same selection strategy but this time instead of using an idiom list, the idioms from aprevious annotation effort (Parseme multilingual corpus of verbal multiword expressions Savaryet al. (2018); Ramisch et al. (2018)) are listed according to their frequencies within the corpusand given to the moderators for the selection. Table 5 and Table 6, given in the Appendix section,provide the idioms played each day together with daily submission, review statistics and someextra information to be detailed later.Table 1. : User Statistics Statistic Turkish ItalianTotal > = 7 days 24 19Total For the actual play, the game was announced on LinkedIn and Twitter for both languages atthe beginning of each play (viz., TrP1, TrP2, ItP1, ItP2). For Italian, announcements and dailyposts were also shared via Facebook and Instagram. In total, there were ∼

25K views and ∼ ∼

12K views ∼

400 likes/reshares for Italian. As mentioned in the previ-ous sections, players are informed from the very beginning that they are helping to create a publicdata source by playing this game. It should be noted that many people wanted to join this cooper-ative effort and shared the announcements from their accounts, which improved the view counts.For both languages, the announcements of the second 16-day period with monetary reward werealso shared within the game itself. The Turkish crowd inﬂuencer (the ﬁrst author of this article) isfrom NLP and AI community, and the announcements mostly reached her NLP focused network.On the other hand, the Italian crowd inﬂuencer (the last author of this article) is from the com-putational linguistics community, and the announcements mostly reached students and educators.There were 255 and 205 players who played the game in total for both periods. Table 1 providesthe detailed user statistics. As may be seen from this table, almost 10 per cent of the players playedthe game for more than 7 days. A survey has been shared with the users at the end of TrP2 andItP2. More than 10 per cent of the players ﬁlled in this survey.Figure 4a shows the newly player counts for each day. This graphic shows the users visiting thebot, whether they start playing or not. It can be seen that the player counts in the initial days arevery high for almost all periods due to the social media announcements. The new player counts inthe following days are relatively low compared to the initial days, which is understandable. Still,it may be seen that the game continues to spread except for ItP1. It is worth pointing out that thespread also applies to Turkish although there had been no daily announcements contrary to Italian.Figure 4b provides the daily player counts who either submitted or reviewed. It should be notedthat the initial values between Figure 4a and Figure 4b differ from each other since some players, t u (a) Daily new player count (b) Daily player count Figure 4 : Daily play statisticsalthough entering the game (contributed to the new player counts in Figure 4a), did not play it, orthe old players from previous periods continued to play the game. As Figure 4b shows, for TrP1,TrP2 and ItP2 there are more than 10 players playing the game each day (except the last day ofTrP1). For ItP1, the number of daily players is under 10 for 9 days out of 16. Figure 4b showsa general decline in daily player counts for TrP1 and ItP1, whereas each day, nearly 20 playersplayed the game for TrP2 and ItP2.The following constructs are selected for the analysis of the motivational and behavioral out-comes of the proposed gamiﬁcation approach: system usage , engagement , loyalty , ease of use , enjoyment , attitude , motivation , and willingness to recommend (Morschheuser et al. 2017 2019).These constructs are evaluated quantitatively and qualitatively via different operational means;i.e., survey results, bot usage statistics, and social media interactions. During the four 16-day periods, we collected 5978 submissions and 22712 reviews for Turkish,and 6728 submissions and 13620 reviews for Italian in total. In this section, we make a dataanalysis by providing 1) submission and average review statistics in Figure 5, 2) daily reviewfrequencies per submission in Figure 6, and 3) collected sample distributions in Figure 7 accordingto the sample categories provided in §3.2. The impact of the monetary reward can be observed onall ﬁgures, but the comparisons between periods with and without monetary reward are left to bediscussed in §4.4 under the related constructs. In this section, although the analyses are providedfor all the four periods, the discussions are mostly carried out on TrP2 and ItP2, which yielded amore systematic data collection (see Figure 5a - Daily submission counts).Figure 5a shows that the soft target of 100 submissions per idiom is reached for both of thelanguages, most of the time by a large margin: 258 submissions on daily average for Turkish and330 submissions for Italian. The average review counts are most of the time above 3 for Turkishidioms with a mean and its standard error of 3.7 ± ± amiﬁed Crowdsourcing for Idiom Corpora Construction (a) Daily submission counts (b) Daily review count average per submission Figure 5 : Daily statistics for submissions and reviewsan example, when we look at Figure 6d the 3 rd day of ItP2 (which received 803 samples with0.8 reviews in average Table 6), we may see that we still have more than 100 hundred samples(speciﬁed with green colors) which received more than 2 reviews. On the other hand, TrP2 results(Figure 6b) show that there are quite a lot of submissions that are reviewed by at least 3 people.Similarly for TrP1 (Figure 6a) and ItP1 (Figure 6c), although the submissions counts are lower,most of them are reviewed by at least 2 people.The Appendix tables (Table 5 and Table 6) also provide the dislikes percentages for each idiomin their last column. The daily averages are 15.5 ± ± th and 11 th ) in ItP2 were exceptional, and thedislike ratios were very high. In those days there were players who entered very similar sentenceswith slight differences, and reviewers caught those and reported. It was also found that thesereported players repeatedly sent dislikes to other players’ entries. The moderators had to banthem, and their submissions and reviews were excluded from the statistics. No such situation hadbeen encountered in TrP2 where the idiom with the highest dislike ratio appears in the 8 th daywith 36 per cent. Although the data aggregation stage (Hung et al. 2013) is out of the scope of thisstudy, it is worth mentioning that despite this ratio, we still obtained many fully liked examples(87 out of 374 submissions, liked by at least 2 people).Figure 7 shows the type distributions (introduced in §3.2) of the collected samples. We seethat the used scoring, notiﬁcations, and tips helped to achieve our goal of collecting samplesfrom various types. The type ratios change from idiom to idiom according to their ﬂexibility.When we investigate B-type samples for Italian, we observe pronouns, nouns and mostly adverbsintervening between the idiom components. Italian B-type samples seem to be more prevalentthan Turkish. This is due to the fact that possessiveness is generally represented with possessivesufﬁxes in Turkish and we don’t see any B-type occurrences due to this. The possessive pronouns,if present, occur before the ﬁrst component of the idiom within the sentence. For Turkish, we seethat generally interrogative markers (enclitics), adverbs, and question words intervene between theidiom components. We see that some idioms can only take some speciﬁc question words whereasothers are more ﬂexible.As explained at the beginning, collecting samples with different morphological varieties wasalso one of our objectives for the aforementioned reasons. When we investigate the samples, weobserve that the crowd produced quite a lot of morphologically various samples. For example,for the two Turkish idioms “ to oppose ) and “ to forget (a) TrP1 (b) TrP2(c) ItP1 (d) ItP2 Figure 6 : Daily review frequencies per submission someone ), we observe 65 and 57 different surface forms in 167 and 97 idiomatic samples, respec-tively. For these idioms, the inﬂections were mostly on the verbs, but still, we observe that the ﬁrstconstituents of the idioms were also seen under different surface forms (“kars¸ı” ( opposite ) in 4different surface forms inﬂected with different possessive sufﬁxes and the dative case marker, and“defterden” ( from the notebook ) in 5 different surface forms inﬂected with different possessivesufﬁxes with no change on the ablative case marker). We also encounter some idioms where theﬁrst constituent only occurs under a single surface form (e.g., “ to stay strong orto be ready )). The observations are in line with the initial expectations, and the data to be collectedwith the proposed gamiﬁcation approach is undeniably valuable for building knowledge bases foridioms.

Parseme multilingual corpus of verbal multiword expressions (Savary et al. 2018; Ramisch et al.2018) contains 280K sentences from 20 different languages including Italian and Turkish. As thename implies, this dataset includes many different types of verbal MWEs as well as verbal idioms.Thus, it is very appropriate to be used as an output of a classical annotation approach. During thepreparation of this corpus, datasets, retrieved mostly from newspaper text and Wikipedia articles, amiﬁed Crowdsourcing for Idiom Corpora Construction (a) TrP1 (b) TrP2(c) ItP1 (d) ItP2 Figure 7 : Daily sample type distributionswere read (i.e., scanned) and annotated by human annotators according to well-deﬁned guidelines.Since MWEs are random in these texts, only the surrounding text fragments (longer than a singlesentence) around the annotated MWEs were included in the corpus, instead of the entire scannedmaterial (Savary et al. 2018). Due to the selected genre of the datasets, it is obvious that manyidioms, especially the ones used in colloquial language, do not appear in this corpus. Additionally,annotations are error-prone as stated in the previous sections. Table 2 provides the statistics forTurkish and Italian parts of this corpus.In order to compare the outputs of the classical annotation approach and the gamiﬁed construc-tion approach, we select 4 idioms for each language (from Table 5 and Table 6) and manuallycheck their annotations in the Parseme corpus. For Turkish, we select one idiom which is anno-tated the most (“yer almak” - to occur

132 times) in the Parseme corpus, one which appears veryfew (“zaman ¨old¨urmek” - to waste time

Turkish Italian by human annotators (i.e., False negatives (Fn)). As may be seen from Table 3, the mistakenlyomitted idiomatic samples (the last column) are quite high although this dataset is reported tobe annotated by two independent research groups in two consecutive years: e.g., 16 idiomaticusage samples for the idiom “meydana gelmek” ( to happen ) were mistakenly omitted out of 25unannotated sentences. Similar to the ﬁndings of Bontcheva et al. (2017) on named entity anno-tations, these results support our claim about the quality of the produced datasets when the crowdfocuses on a single phenomenon at a time. Additionally, the proposed gamiﬁed approach (with acrowdrating mechanism) also provides multiple reviews on the crowdcreated dataset.Table 3. : Comparison with classical data annotation. (Id.:

Dodiom ParsemeLang. Idiom Id. Nonid. Rev. Id. Unann. Fn.Turkish yer almak 69 49 3.5 132 83 30meydana gelmek 103 92 3.6 29 25 16kars¸ı c¸ıkmak 167 352 4.1 27 46 14zaman ¨old¨urmek 123 127 3.7 1 5 0Italian aprire gli occhi 143 132 1.0 2 0 0prendere con le pinze 409 394 0.8 1 0 0essere tra i piedi 152 32 2.2 0 3 0mandare a casa 102 137 2.4 2 2 0

When the idiomatic annotations in Parseme are investigated, it is seen that they are almostall A-types samples, and B-type samples very rarely appear within the corpus, which could beanother side effect of the selected input text genres.

In this section, we provide our analysis on motivational and behavioral outcomes of the proposedgamiﬁcation approach for idiom corpora construction. The survey results (provided in Table 4),bot usage statistics (provided in §4.2), and social media interactions are used during the evalua-tions. The investigated constructs are system usage , engagement , loyalty , ease of use , enjoyment , attitude , motivation , and willingness to recommend .16able 4. : Survey Constructs, Questions & Results. (Answer Types: 5 point Likert scale (5PLS),predeﬁned answer list (PL), PL including the “other” option with free text area (PLwO)) Q Constructs Survey Questions Answer Type Turkish Italian1 demographic -What is your educational background?PL: { from 1:primary school to 5:PhD } PL 0 0 9 12 4 0 2 6 20 32 demographic -What ﬁeld do you work in?PLwO: { education, AI, computer tech.,other } PLwO 0 12 8 5 6 2 2 213 demographic -How old are you? PL: { <

18, 18-25, 25-30, > } PL 0 9 9 7 0 14 12 54 demographic -How did you hear about Dodiom?PLwO: { Linkedin, Twitter, a friend, other } PLwO 7 4 10 4 6 7 9 95 attitude -What’s your opinion about Dodiom? 5PLS 0 0 0 10 15 0 2 1 15 126 motivation -Why did you play Dodiom, what wasthe main motivation for you to play?PLwO: { help dodo, daily achievements,fun, help NLP studies, other } PLwO 0 4 0 20 1 1 4 4 21 17 motivation -The Gift Certiﬁcate was an importantmotivation for me to play the game. 5PLS 4 2 7 4 8 6 1 6 8 108 enjoyment -The leaderboard and racing componentsmade the game more fun. 5PLS 0 1 2 5 17 3 2 5 7 149 engagement -Dodo’s messages about my place in therankings increased my participation in thegame. 5PLS 1 0 4 9 11 4 3 3 8 1310 attitude -I liked the interface of the game and theease of play, it kept me playing the game. 5PLS 0 1 0 5 19 0 0 9 10 1211 ease of use -I was able to learn the gameplay of thegame without much effort. 5PLS 0 0 1 2 22 0 0 1 7 2312 engagement -The frequency of Dodo’s notiﬁcationswas not disturbing. 5PLS 4 2 8 3 8 4 4 10 7 613 enjoyment -The theme and gameplay was fun, Ienjoyed playing. 5PLS 0 0 1 8 16 0 1 4 11 1514 loyalty -Dodo will take a break from learningsoon. Do you want to continue helpingwhen it starts again? PLwO: { yes, no,other } PLwO 24 0 1 28 2 115 attitude -Which aspect of the game did you like themost? free-text - -17 attitude -Was there anything you didn’t like in thegame, and if so, what? free-text - -18 loyalty -How many days did you play Dodiom?PL: {

1, 2-3, < > } PL 2 2 4 16 7 10 7 719 loyalty -How many samples did you send toDodiom per day on average? PL: { <

10, 10-20, > } PL 3 7 6 9 12 6 7 520 - -Can you share any suggestions aboutthe game? free-text - -

Table 4 summarizes the survey results in terms of response counts provided in the last twocolumns for Turkish and Italian games, respectively. In questions with 5 point Likert scaleanswers, the options go from 1: strongly disagree or disliked to 5: strongly agree or liked. Theﬁrst 4 questions of the survey are related to demographic information. The answers to question 2(Q2 of Table 4) reveal that the respondents for Turkish play are mostly AI and computer technol-ogy related people (21 out of 25 participants selected the related options and 2 stated NLP under17he other option), whereas for Italian play they are from different backgrounds; 21 people out of31 selected the other option where only 2 of them stated NLP and computational linguistics, andthe others gave answers like translation, student, administration, tourism, and sales. The differ-ence between crowd types seems to also affect their behavior. In TrP2, we observe that the reviewratios are higher than ItP2 as stated in the previous section. On the other hand, ItP2 participantsmade more submissions. There were more young people in Italian plays (Q3) than Turkish plays.The appearing situation may be attributed to their eagerness to earn more points. We had manyfree text comments (to be discussed below) related to the low scoring of the review process fromboth communities.The overall system usage of the participants is provided in §4.2. Figure 4b and Figure 5 showsplayer counts and their play rates. Although, only 50 per cent of survey Q7 answers, about thegift card motivation , says agree (4) or strongly agree (5), the graphics mentioned above reveal thatthe periods with additional incentives (i.e., gift card rewards) (TrP2, ItP2) are more successful atfulﬁlling the expectations about loyalty than the periods without (TrP1, ItP1). Again related to the loyalty construct (Q18 and Q19), we see that more than half of the Turkish survey participantswere playing the game for more than 1 week at the time of ﬁlling out the survey (which wasopen for the last three days of TrP2) and they were providing more than 10 samples each day.Since the Italian survey was open for a longer period of time (see Table 1), we see a more diversedistribution on the answers. Most of the participants also stated that they would like to continueplaying the game (Q14).A very high number of participants (20 out of 25, and 21 out of 31) stated that their motivation to play the game was to help NLP studies. 4 of them answered Q15 as: “-I felt that I’m help-ing a study.”, “-The scientiﬁc goal”, “-The ultimate aim”, “-I liked it being the ﬁrst applicationdesigned to provide input for Turkish NLP as far as I know. Apart from that, we are forcing ourbrains while entering in a sweet competition with the friends working in the ﬁeld and contributingat the same time.”

We see that the gamiﬁcation elements and the additional incentive helped theplayers to stay on the game with this motivation (Q8, Q13 enjoyment ). In TrP2, we also observedthat some game-winners shared their achievements from social media ( willingness to recommend )and found each other from the same channel. Setting more moral goals than monetary rewards,they combined distributed bookstore gift cards and sent book gifts to poor children by using these.Around 800 social media likes and shares were made in total (for both languages). More than halfof the respondents chose the answer “from a friend” or “other” to Q4 (“How did you hear aboutDodiom?”) instead of the ﬁrst two options covering Linkedin and Twitter. The “other” answerswere covering either the name of the inﬂuencers, or Facebook and Instagram for Italian. We maysay that the spread of the game (introduced in §4.1) is not due to the social media inﬂuences alonebut people let each other now about it, which could also be seen as an impact of their willingnessto recommend the game.Almost all of the users found the game easy to understand and play (Q11 ease of use ). Almostall of them liked the game; only 3 out of 31 Italian participants scored under 4 (liked) to Q5( attitude ), and 9 of them scored neutral (3) to Q10. Only 1 Turkish participant was negative tothis later question about the interface. When we analyze the free-text answers to Q15, we see that8 people out of 56 stated that they liked the game elements. Some answers are as follows: “-Iloved Dodo gifts.”, “-Gamiﬁcation and story was good”, “-it triggers a sense of competition”,“-The icon of the application is very sympathetic, the messages are remarkable.”, “-I liked thecompetition content and the ranking”, “-gift voucher”, “-interaction with other players” . Threeparticipants stated that they liked the game being a bot with no need to download an application.Three of them mentioned that they liked playing the game: “-I liked that the game increasesmy creativeness. It makes me think. I’m having fun myself.”, “-To see original sentences”, “-... Besides these, a fun opportunity for mental gymnastics.”, ‘-‘Learn new idioms”, “-Linguisticaspect” . 8 participants stated that they liked the uniqueness of the idea: “-The originality of theidea”, “-the creativity”, “-Efﬁciency and immediacy”, “-The chosen procedure”, “-the idea is ery useful for increasing the resources for the identiﬁcation of idiomatic expressions.”, “-Theidea of being interacting with someone else”, “-Undoubtedly, as a Ph.D. student in the ﬁeld ofNLP, I liked that it reduces the difﬁculty of labeling data, makes it fun, and is capable of enablingother people to contribute whether they are from the ﬁeld or not” .More than half of the participants were okay with the frequency of the Dodo’s instant messagesand most of them agreed about their usefulness in keeping them in the game (Q9 and Q12).4 people out of 56 participants in total complained about the frequency of the messages as ananswer to Q16 ( “-Slightly frequent notiﬁcations”, “-Notiﬁcations can be sent less often.”,“-Toomany notiﬁcations” ). As opposed to this, one participant said “It’s nice that when you put it aside,the reminders and notiﬁcations that encourage you to earn more points make me re-enter wordsduring the day” as an answer to Q15.Other answers to Q16 are as follows: “-I don’t think it should allow the possibility of repeat-ing the same sentences.”, “-It can get repetitive, a mixed-mode where automatically alternatingbetween suggestions and evaluations with multiple expressions per day would have been moreengaging”, “-Error occurrence during voting has increased recently. Maybe it could be relatedto increased participation. However, there is no very critical issue.”, “-Sometimes it froze” .Regarding the last two comments, we have stated in the previous sections the need for optimiza-tion towards the end of the play with the increased workload and the action taken. On the otherhand, the ﬁrst two comments are also very good indicators for future directions.For Q19, we received 3 suggestions for the scoring system, 1 suggestion for automatic spellingcorrection, 2 suggestions for detailing dislikes, and 1 suggestion for the need to cancel/changean erroneous submission or review. Obviously, the users wanted to warn about spelling mistakesin the input sentences but hesitated to send a dislike due to this. That is why they suggesteddifferentiating dislikes according to their reasons. Suggestions for scoring are as follows: “-Morepoints can be given to the reviews”, “- The low score for reviews causes the reviewing to loseimportance, especially for those who play for leadership. Because while voting, you both get fewerpoints and in a sense, you make your opponents earn points.”, “-I would suggest that the score wasassigned differently, that is, that the 10/15 points can be obtained when sending a suggestion (andnot when others evaluate it positively). In this way, those who evaluate will have more incentivesto positively evaluate the suggestions of others (without the fear of giving more points to others)(thus giving a point to those who evaluate and one to those who have been evaluated)” . We seethat in the last two comments, the players are afraid of making other players earn points.As explained in the game design section above, the reviews worth 1 point and sometimes2 in happy hours, triggered by the moderators to attract the attention of the players. Althoughopen for discussions and changes in future trials, in the original design, we didn’t want to givehigh points to reviews since we believe that people should review with the responsibility of acooperative effort to create a public data set. Giving very high scores to reviews may lead to unex-pected results. Other scenarios together with cheating mechanisms (such as consecutive rapidlikes/dislikes detection) may be considered in future works. As stated before, we had some report-ing and banning mechanisms added to control cheating/gaming the system in line with DP Figure 8 : Histogram of interaction times in TrP2 and ItP2

5. Conclusion

Idiom corpora are valuable resources for foreign language learning, natural language processing,and lexicographic studies. Unfortunately, they are rare and hard to construct. For the ﬁrst time inthe literature, this article introduced a gamiﬁed approach that uses crowdcreating and crowdratingtechniques to speed up idiom corpora construction for different languages. The approach has beenevaluated under different motivational strategies on two languages, which produced the ﬁrst idiomcorpora for Turkish and Italian. The implementation developed as a Telegram messaging bot andthe collected data for the two languages in a time span of 30 days are shared with the researchers.Our detailed qualitative and quantitative analyses revealed that the outcomes of the research areappreciated by the crowd, found useful and enjoyable, and yielded to the collection and assessmentof valuable samples that illustrate the different ways of use, which is not easily achievable withtraditional data annotation techniques. Gift cards were found to be very effective in incentivizingthe users to continue playing the game in addition to gamiﬁcation affordances.Our ﬁrst short-term goal is to extend and play the game for languages other than the ones inthis article, especially for languages with few lexical resources. We hope that the game introducedas an opensource project will speed up the development of idiom corpora and the research in theﬁeld.

Acknowledgments

The authors would like to offer their special thanks to Cihat Eryi˘git for the discussions dur-ing the initial design of the game, Fatih Bektas¸, Branislava ˇSandrih, Josip Mihaljevi´c, Martin20enjamin, Daler Rahimjonov, and Doruk Eryi˘git for fruitful discussions during its implementa-tion, Federico Sangati for providing the codes of a telegram bot example (Plagio which is used inanother game (Araneta et al. 2020) for foreign language learners practicing phrasal verbs), MartinBenjamin for helping on the English localization messages, Inge Maria Cipriano for discussionconcerning the game interactions and possible deceptive behaviours by players, and all the anony-mous volunteer players. The study has been proposed as a task by the ﬁrst author and took placein a crowdfest event (in February 2020 in Coimbra, Portugal) of EU COST Action (CA16105)Enetcollect, where the prototype has been introduced and discussed with stakeholders. The authorswould like to thank Enetcollect for this opportunity which initiated new collaborations and ideas.

References

Akkaya, C. , Conrad, A. , Wiebe, J. , and Mihalcea, R. Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’sMechanical Turk , pp. 195–203.

Araneta, M. G. , Eryi˘git, G. , K¨onig, A. , Lee, J.-U. , Lu´ıs, A. , Lyding, V. , Nicolas, L. , Rodosthenous, C. , and Sangati, F. Proceedingsof the 9th NLP4CALL , 1.

Artignan, G. , Hasco¨et, M. , and Lafourcade, M. , pp. 685–690. IEEE. Birke, J. and Sarkar, A. , 1.

Bontcheva, K. , Derczynski, L. , and Roberts, I. Handbook of Linguistic Annotation , pp. 875–892. Springer.

Caruso, V. , Barbara, B. , Monti, J. , and Roberta, P. ELEX 2019: SMART LEXICOGRAPHY , pp. 374–396. Lexical Computing CZ sro,.

Chamberlain, J. , Poesio, M. , and Kruschwitz, U. Semantics in Text Processing. STEP 2008 Conference Proceedings , pp. 375–380.

Chklovski, T.

Proceedings of the 3rd internationalconference on Knowledge capture , pp. 115–120.

Constant, M. , Eryi˘git, G. , Monti, J. , van der Plas, L. , Ramisch, C. , Rosner, M. , and Todirascu, A. Computational Linguistics , 43(4):837–892.

Cook, P. , Fazly, A. , and Stevenson, S. Proceedings of the LREC Workshop Towards aShared Task for Multiword Expressions (MWE 2008) , pp. 19–22.

Dumitrache, A. , Aroyo, L. , Welty, C. , Sips, R.-J. , and Levas, A. Proceedings of the 1st International Conference onCrowdsourcing the Semantic Web , volume 1030, 1.

Fort, K. , Adda, G. , and Cohen, K. B. Computational Linguistics ,37(2):413–420.

Fort, K. , Guillaume, B. , Constant, M. , Lefebvre, N. , and Pilatte, Y.-A. Proceedings of the JointWorkshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) , pp. 207–213.

Fort, K. , Guillaume, B. , Pilatte, Y.-A. , Constant, M. , and Lef`ebvre, N. Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020) , pp. 4395–4401, Marseille, France. European Language Resources Association.

Geiger, D. and Schader, M.

Decision Support Systems , 65:3 – 16. Crowdsourcing and Social Networks Analysis.

Hashimoto, C. and Kawahara, D.

Language resources and evaluation , 43(4):355.

Howe, J.

Wired magazine , 14(6):1–4.

Hung, N. Q. V. , Tam, N. T. , Tran, L. N. , and Aberer, K. International Conference on Web Information Systems Engineering , pp. 1–15. Springer.

Kaschak, M. P. and Saffran, J. R.

Cognitive Science ,30(1):43–63.

Kato, A. , Shindo, H. , and Matsumoto, Y. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) ,Miyazaki, Japan. European Language Resources Association (ELRA). lock, A. C. T. , Gasparini, I. , Pimenta, M. S. , and Hamari, J. International Journal of Human-Computer Studies , 144:102495.

Konopka, A. E. and Bock, K.

Cognitive Psychology , 58(1):68–101.

Lawson, N. , Eustice, K. , Perkowitz, M. , and Yetisgen-Yildiz, M. Proceedings of the NAACL HLT 2010 workshop on creating speech and languagedata with Amazon’s Mechanical Turk , pp. 71–79.

Loper, E. and Bird, S. arXiv preprint cs/0205028 . Losnegaard, G. S. , Sangati, F. , Escart´ın, C. P. , Savary, A. , Bargmann, S. , and Monti, J. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) , pp.2299–2306, Portoroˇz, Slovenia. European Language Resources Association (ELRA).

Mitrovi´c, J.

INFOtheca-Journal of Informatics & Librarianship , 14(1).

Morschheuser, B. and Hamari, J.

Journal of ManagementInquiry , 28(2):145–148.

Morschheuser, B. , Hamari, J. , Koivisto, J. , and Maedche, A. International Journal of Human-Computer Studies , 106:26–43.

Morschheuser, B. , Hamari, J. , and Maedche, A. International Journal of Human-Computer Studies , 127:7 – 24.Strengthening gamiﬁcation studies: critical challenges and new opportunities.

Morschheuser, B. , Hassan, L. , Werder, K. , and Hamari, J. Information and Software Technology , 95:219 – 237.

Murillo-Zamorano, L. R. , ´Angel L´opez S´anchez, J. , and Bueno Mu˜noz, C. Thinking Skills and Creativity , 36:100645.

Palmero Aprosio, A. and Moretti, G.

ArXive-prints . Prpi´c, J. , Shukla, P. P. , Kietzmann, J. H. , and McCarthy, I. P. Business Horizons , 58(1):77 – 85.

Qi, P. , Zhang, Y. , Zhang, Y. , Bolton, J. , and Manning, C. D. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations . Ramisch, C. , Cordeiro, S. , Savary, A. , Vincze, V. , Mititelu, V. , Bhatia, A. , Buljan, M. , Candito, M. , Gantar, P. , Giouli,V. , and others Rumshisky, A. , Botchan, N. , Kushkuley, S. , and Pustejovsky, J. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) , pp. 4055–4059,Istanbul, Turkey. European Language Resources Association (ELRA).

Savary, A. , Candito, M. , Mititelu, V. B. , Bejˇcek, E. , Cap, F. , ˇC´epl¨o, S. , Cordeiro, S. R. , Eryi˘git, G. , Giouli, V. , vanGompel, M. , HaCohen-Kerner, Y. , Kovalevskait˙e, J. , Krek, S. , Liebeskind, C. , Monti, J. , Escart´ın, C. P. , van der Plas,L. , QasemiZadeh, B. , Ramisch, C. , Sangati, F. , Stoyanova, I. , and Vincze, V. Markantonatou, S. , Ramisch, C. , Savary, A. , and Vincze, V. , editors, Multiwordexpressions at length and in depth: Extended papers from the MWE 2017 workshop , pp. 87–147. Language Science Press.,Berlin.

Schneider, N. , Onuffer, S. , Kazour, N. , Danchik, E. , Mordowanec, M. T. , Conrad, H. , and Smith, N. A. Proceedings of the Ninth InternationalConference on Language Resources and Evaluation (LREC’14) , pp. 455–461, Reykjavik, Iceland. European LanguageResources Association (ELRA).

Siyanova-Chanturia, A.

Language TeachingResearch , 21(3):289–297.

Snow, R. , O’connor, B. , Jurafsky, D. , and Ng, A. Y. Proceedings of the 2008 conference on EMNLP , pp. 254–263.

Sprenger, S. A. , Levelt, W. J. , and Kempen, G. Journal ofmemory and language , 54(2):161–184.

Vasiljevic, Z.

Mextesol Journal , 39(4):1–24.

Vincze, V. , Nagy, I. , and Berend, G. Proc. ofRANLP 2011 , pp. 289–295, Hissar.

Von Ahn, L.

Computer , 39(6):92–94.

Von Ahn, L. and Dabbish, L.

Proceedings of the SIGCHI conference onHuman factors in computing systems , pp. 319–326.

Von Ahn, L. , Kedia, M. , and Blum, M. Proceedings of theSIGCHI conference on Human Factors in computing systems , pp. 75–78. ppendix As stated in the introduction, deducting well-deﬁned rules to express an idiom is usually achallenging task. Below we provide an example idiom from a language other than English v , withan explanation of its meaning and usage patterns in English (for a second language learner whospeaks English).Example idiom: (cid:28) eat someone’s head’s meat (cid:29) “annoying someone by talking too much” as “to nag at”Rule someone’s may be replaced with one of the possessive pronouns (e.g, my, your, his) orany noun taking a possessive sufﬁx -’s (i.e. the genitive sufﬁx in the target language).Rule someone’s may be omitted since the target language is a pro-drop language and theword head also takes possessive sufﬁxes which also carry the person agreement information thus someone is pragmatically or grammatically inferable.Rule eat may be inﬂectedRule someone , i.e. reﬂexive usage is generally not welcome;“eating own’s head’s meat”.As one may notice, although it could be possible to deﬁne rules, they are both hard to deduct(e.g. for teachers or lexicographers) and hard to understand (for language learners: humans orcomputers). Language learners will still need usage examples both to understand the usagepatterns and to practice. Additionally, for being able to deﬁne such rules even teachers orlexicographers should investigate many usage samples or come up with new ones. v The example idiom is a Turkish idiom and is presented as “(birinin) bas¸ının etini yemek” in this language dictionary. able 5. : Idioms of the TrP1 (ﬁrst 16 rows) & TrP2 (last 16 rows) – (id.:idiomatic samples, nonid.:nonidiomatic samples, d :dislikes) Day Idiom Literal Meaning Idiomatic Meaning d able 6. : Idioms of the ItP1 (ﬁrst 16 rows) & ItP2 (last 16 rows) – (id.:idiomatic samples, nonid.:nonidiomatic samples, d :dislikes) Day Idiom Literal Meaning Idiomatic Meaning dd