[PDF] Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017

Abstract

We report on the Wikidata vandalism detection task at the WSDM Cup 2017. The task received five submissions for which this paper describes their evaluation and a comparison to state of the art baselines. Unlike previous work, we recast Wikidata vandalism detection as an online learning problem, requiring participant software to predict vandalism in near real-time. The best-performing approach achieves a ROC-AUC of 0.947 at a PR-AUC of 0.458. In particular, this task was organized as a software submission task: to maximize reproducibility as well as to foster future research and development on this task, the participants were asked to submit their working software to the TIRA experimentation platform along with the source code for open source release.

Full PDF

OOverview of the Wikidata Vandalism Detection Taskat WSDM Cup 2017

Stefan Heindorf Martin Potthast Gregor Engels Benno Stein Paderborn University @uni-paderborn.de Leipzig University <ﬁrst name>.@uni-leipzig.de Bauhaus-Universität Weimar <ﬁrst name>.@uni-weimar.de

ABSTRACT

We report on the Wikidata vandalism detection task at the WSDMCup 2017. The task received ﬁve submissions for which this pa-per describes their evaluation and a comparison to state of the artbaselines. Unlike previous work, we recast Wikidata vandalismdetection as an online learning problem, requiring participant soft-ware to predict vandalism in near real-time. The best-performingapproach achieves a ROC

AUC of 0.947 at a PR

AUC of 0.458. Inparticular, this task was organized as a software submission task : tomaximize reproducibility as well as to foster future research anddevelopment on this task, the participants were asked to submit theirworking software to the TIRA experimentation platform along withthe source code for open source release.

1. INTRODUCTION

Knowledge is increasingly gathered by the crowd. One of themost prominent examples in this regard is Wikidata, the knowledgebase of the Wikimedia Foundation. Wikidata stores knowledge (bet-ter: facts) in structured form as subject-predicate-object statementsthat can be edited by anyone. Most of the volunteers’ contributionsto Wikidata are of high quality; however, there are, just like in Wiki-pedia, some “editors” who vandalize and damage the knowledgebase. The impact of these few can be severe: since Wikidata is, to anincreasing extent, integrated into information systems such as searchengines and question-answering systems, the risk of spreading falseinformation to all their users increases as well. It is obvious that thisthreat cannot be countered by human inspection alone: currently,Wikidata gets millions of contributions every month; the effort of re-viewing them manually will exceed the resources of the community,especially as Wikidata further grows.Encouraged by the success of algorithmic vandalism detectionon Wikipedia, we started a comparable endeavor for Wikidata twoyears ago: we carefully compiled a corpus based on Wikidata’srevision history [6] and went on by developing the ﬁrst machinelearning-based Wikidata vandalism detector [7]. Compared with thequality of the best vandalism detectors for Wikipedia, our resultsmay be considered as a ﬁrst step toward a practical solution.We are working on new detection approaches ourselves but wesee that progress can be made at a much faster pace if independentresearchers work in parallel, generating a diversity of ideas. Whileresearch communities often form around problems of interest, thishas not been the case for vandalism detection in knowledge bases—perhaps due to the novelty of the task. We hence took a proactivestance by organizing a shared task event as part of the WSDMCup 2017 [8]. Shared tasks have proven successful as catalysts forforming communities on a number of occasions before, in particularfor vandalism detection on Wikipedia: on the basis of two sharedtasks, considerable interest from researchers worldwide resulted indozens of approaches to date [14, 16]. The goal of our shared task at the WSDM Cup 2017 is to developan effective vandalism detection model for Wikidata:Given a Wikidata revision, the task is to compute aquality score denoting the likelihood of this revisionbeing vandalism (or, similarly, damaging).The revisions had to be scored in near real-time as soon as theyarrived, allowing for immediate action upon potential vandalism.Moreover, a model (detector) should hint at vandalism across a widerange of precision-recall-points to enable use cases such as (1) fullyautomatic reversion of damaging edits at high precision, as well as(2) pre-ﬁltering and ranking of revisions with respect to importanceof being reviewed at high recall. As our main evaluation metric weemploy the area under curve of the receiver operating characteristic.Our contributions to the WSDM Cup 2017 are as follows: • Compilation of the Wikidata Vandalism Corpus 2016, an up-dated version of our previous corpus, enlarged and retroﬁttedfor the setup of the shared task. • Survey of the participant approaches with regard to featuresand model variants. • Comparison of the participant approaches to the state of theart under a number of settings beyond the main task. • Analysis of the combined performance of all models (de-tectors) as an ensemble in order to estimate the achievableperformance when integrating all approaches. • Release of an open source repository of the entire evaluationframework of the shared task, as well as release of most ofthe participants’ code bases by courtesy of the participants.In what follows, Section 2 brieﬂy reviews the aforementionedrelated work, Section 3 introduces the developed evaluation frame-work including the new edition of the Wikidata vandalism corpus,Section 4 surveys the submitted approaches, and Section 5 reports ontheir evaluation. Finally, Section 6 reﬂects on the lessons learned.

2. RELATED WORK

The section gives a comprehensive overview of the literature onvandalism detection. The two subsections detail the approaches re-garding dataset construction and detection technology respectively.

There are basically three strategies to construct vandalism corporafor Wiki-style projects [9], namely, (1) based on independent manualreview of edits, (2) based on exploiting community feedback aboutedits, and (3) based on comparing item states. As expected, there a r X i v : . [ c s . I R ] D ec s a trade-off between corpus size, label quality, and annotationcosts. Below, we review state-of-the art approaches for constructingvandalism corpora under each strategy. Annotation Based on Independent Manual Review.

The most reliable approach to construct a vandalism corpus is tomanually review and annotate its edits. When facing millions ofedits, however, the costs for a manual review become prohibitive,thus severely limiting corpus size. The largest manually annotatedvandalism corpus to date is the PAN Wikipedia Vandalism Cor-pus 2010 and 2011, comprising a sample of 30,000 edits each ofwhich having manually been annotated via crowdsourcing usingAmazon’s Mechanical Turk [13]. About 7% of the edits have beenfound to be vandalism. This approach, however, is probably notsuited to Wikidata: an average worker on Mechanical Turk is muchless familiar with Wikidata, and the expected ratio of vandalismin a random sample of Wikidata edits is about 0.2% (comparedwith 7% in Wikipedia), so that a signiﬁcantly higher number of editswould have to be reviewed in order to obtain a sensible number ofvandalism cases for training a model.

Annotation Based on Community Feedback.

A more scalable approach to construct a vandalism corpus is torely on feedback about edits provided by the community for an-notations. However, not all edits made to Wikidata are currentlyreviewed by the community, thus limiting the recall in a sample ofedits to the amount of vandalism that is actually caught, and, notall edits that are rolled back are true vandalism. Nevertheless, forits simplicity, this approach was adopted to construct the WikidataVandalism Corpus (WDVC) 2015 [6] and 2016, whereas the latterwas employed as evaluation corpus at the WSDM Cup 2017. Bothcorpus versions are freely available for download. The corpus con-struction is straightforward: based on the portion of the Wikidatadump with manually created revisions, those revisions that havebeen reverted via Wikidata’s rollback facility are labeled vandalism.The rollback facility is a special instrument to revert vandalism; it isaccessible to privileged Wikidata editors only. This makes our cor-pus robust against manipulation by non-privileged and, in particular,anonymous Wikidata editors. As a result, we obtain a large-scalevandalism corpus comprised of more than 82 million manual revi-sions that were made between October 2012 and June 2016. About200,000 revisions have been rolled back in this time (and hence arelabeled vandalism). By a manual analysis we got evidence that 86%of the revisions labeled vandalism are indeed vandalism as per Wiki-data’s deﬁnition [6]. Recently, vandalism corpora for Wikipediahave also been constructed based on community feedback.Tran and Christen [21, 22] and Tran et al. [23] label all revi-sions with certain keywords in the revision comment as vandalism,e.g., ‘vandal’ or ‘rvv’ (revert due to vandalism), based on Kitturet al.’s [10] approach to identify vandalism. These keywords donot work for Wikidata, since revision comments are almost alwaysautomatically generated and cannot be changed by editors.

Annotation Based on Item State.

An alternative (and still scalable) approach to build a vandalismcorpus is to analyze recurring item states in order to identify so-called item ”reverts” to previous states in the respective item history.In addition to community feedback this approach does also considerall other events that may have caused an item state to reappear, e.g.,in case that someone just ﬁxes an error without noticing that the error was due to vandalism. As a consequence, a higher recall canbe achieved whereas, however, a lower precision must be expected.Sarabadani et al. [19] adopted this approach, and, in order to in-crease precision, they suggested a set of rules to annotate an edit asvandalism if and only if (1) it is a manual edit, (2) it has been re-verted, (3) it does not originate from a pre-deﬁned privileged editorgroup, (4) it has not been propagated from a dependent Wikipediaproject, and (5) it does not merge two Wikidata items. For unknownreasons, not all edits of Wikidata’s history are annotated this waybut only a subset of 500,000 in 2015, yielding altogether only about20,000 vandalism edits. While the authors claim superiority of thedesign of their corpus over ours, their self-reported precision valuesare not convincing: while only 68% of the edits labeled vandalismare in fact vandalism (86% in our corpus), 92% of the edits arereported to be at least "damaging” to a greater or lesser extent. Theauthors have reviewed only 100 edits to substantiate these numbers(we have reviewed 1,000 edits), so that these numbers must be takenwith a grain of salt.Altogether, both corpora are suboptimal with regard to recall:within both corpora, about 1% of the edits are wrongly labeled non-vandalism, which currently amounts to an estimated 800,000 missedvandalism edits over Wikidata’s entire history. A machine learningapproach to vandalism detection must hence be especially robustagainst false negatives in the training dataset.Tan et al. [20] compiled a dataset of low-quality triples in Free-base according to the following heuristics: Additions that have beendeleted within 4 weeks after their submission are considered low-quality, as well as removals that have not been reinserted within4 weeks. The manual investigation of 200 triples revealed a preci-sion of about 89%. However, the usefulness of the Freebase dataset is restricted by the fact that Google has shut down Freebase; theFreebase data is currently being transferred to Wikidata [12]. The detection of vandalism and damaging edits in structured knowledge bases such as triple stores is a new research area. Hence,only three approaches have been published before the WSDMCup 2017, which represent the state of the art [7, 19, 20]. Allemploy machine learning, using features derived from both an edit’scontent and its context. In what follows, we brieﬂy review them.The most effective approach to Wikidata vandalism detection,WDVD, was proposed by Heindorf et al. [7]. It implements 47 fea-tures, from which 27 encode an edit’s content and 20 an edit’s context . The content-based features cover character level features,word level features, and sentence-level features, all are computedfrom the automatically generated revision comment. In addition,WDVD employs features to capture predicates and objects of subject-predicate-object triples. Context-based features include user repu-tation, user geolocation, item reputation, and the meta data of anedit. As classiﬁcation algorithm random forests along with multiple-instance learning is employed. Multiple-instance learning is appliedto consecutive edits by the same user on the same item, so-calledediting sessions. Typically, editing sessions are closely related, sothat multiple-instance learning has a signiﬁcant positive effect onthe classiﬁcation performance.The second-most effective approach has been deployed withinWikimedia’s Objective Revision Evaluation Service, ORES [19].ORES operationalizes 14 features (see Table 3), most of whichwere introduced with WDVD, since the WDVD developers sharedwith Wikimedia a detailed feature list. Meanwhile, certain featureshave been discarded from WDVD due to overﬁtting but are still These include the groups sysop, checkuser, ﬂood, ipblock-exempt, oversight, prop-erty-creator, rollbacker, steward, sysop, translationadmin, and wikidata-staff. ound in the ORES system. Altogether, the effectiveness reported bySarabadani et al. is signiﬁcantly worse compared with WDVD [7].Tan et al. [20] developed a classiﬁer to detect low-quality con-tributions to Freebase. The only content-based features are thepredicates of subject-predicate-object triples, which are used to pre-dict low-quality contributions. Regarding context-based features,the developers employ user history features including numbers ofpast edits (total, correct, incorrect) and the age of the user account.Also user expertise is captured by representing users in a topic spacebased on their past contributions: a new contribution is mapped intothe topic space and compared to the user’s past revisions using thedot product, the cosine similarity, and the Jaccard index. None ofthe existing approaches has evaluated so far user expertise featuresfor Wikidata.The three approaches above build upon previous work on

Wiki-pedia vandalism detection, whereas the ﬁrst machine learning-basedapproach for this task was proposed by Potthast et al. [15]. Itwas based on 16 features, primarily focusing on content, detectingnonsense character sequences and vulgarity. In subsequent work,and as a result of two shared task competitions at PAN 2010 andPAN 2011 [14, 16], the original feature set was substantially ex-tended; Adler et al. [2] integrated many of them. From the large setof approaches that have been proposed since then, those of Wangand McKeown [24] and Ramaswamy et al. [18] stick out: they arebased on search engines to check the correctness of Wikipedia edits,achieving a better performance than previous approaches. On thedownside, their approaches have only been evaluated on a smalldataset and cannot be easily scaled-up to hundreds of millions ofedits. To improve the context-based analysis it was proposed tomeasure user reputation [1] as well as spatio-temporal user informa-tion [25]. Again, Kumar et al. [11] stick out since they do not try todetect damaging edits but vandalizing users via their behavior. Manyof these features have been transferred to Wikidata vandalism de-tection; however, more work will be necessary to achieve detectionperformance comparable to Wikipedia vandalism detectors.

3. EVALUATION FRAMEWORK

This section introduces the knowledge base Wikidata in brief, theWikidata vandalism corpus derived from it, the evaluation platform,the used performance measures, as well as the baselines. Wikidata is organized around items. Each item describes a co-herent concept from the real world, such as a person, a city, anevent, etc. An item in turn can be divided into an item head andan item body. The item head consists of human-readable labels,descriptions, and aliases, provided for up to 375 supported languagecodes. The item body consists of structured statements, such as thedate of birth of a person, as well as sitelinks to Wikipedia pages thatcover the same topic as the item. Each time a user edits an item, anew revision is created within the item’s revision history. We referto consecutive revisions from the same user on the same item as an“editing session”.

For the shared task we built the Wikidata Vandalism Corpus 2016(short: WDVC-2016 ), which is an updated version of the WDVC-2015 corpus [6]. The corpus consists of 82 million user-contributedrevisions made between October 2012 to June 2016 (excluding revi-sions from bots) alongside 198,147 vandalism annotations on thoserevisions that have been reverted via the administrative rollback fea- Table 1: Datasets for training, validation, and test in terms oftime period covered, vandalism revisions, total revisions, ses-sions, items, and users. Numbers are given in thousands.

Dataset From To Vand. Rev. Sessions Items Users

Training Oct 1, 2012 Feb 29, 2016 176 65,010 36,552 12,401 471Validation Mar 1, 2016 Apr 30, 2016 11 7,225 3,827 3,116 43Test May 1, 2016 Jun 30, 2016 11 10,445 3,122 2,661 41

Table 2: The Wikidata Vandalism Corpus WDVC-2016 interms of total unique users, items, sessions, and revisions witha breakdown by item part and by vandal(ism) status (Vand.).Numbers are given in thousands.

Entire corpus Item head Item body

Total Vand. Regular Total Vand. Regular Total Vand. RegularRevisions 82,680 198 82,482 16,602 100 16,502 59,699 92 59,606Sessions 43,254 119 43,142 9,835 71 9,765 33,955 49 33,908Items 14,049 85 14,049 6,268 54 6,254 12,744 39 12,741Users 518 96 431 247 65 186 310 35 279 ture; the feature is employed at Wikidata with the purpose to revertvandalism and similarly damaging contributions. Moreover, ourcorpus provides meta information that is not readily available fromWikidata, such as geolocation data of all anonymous edits as wellas Wikidata revision tags originating from both the Wikidata AbuseFilter and semi-automatic editing tools. Table 1 gives an overviewof the corpus: participants of the shared task were provided trainingdata and validation data, while the test data was held back until afterthe ﬁnal evaluation.Table 2 gives an overview about the corpus in terms of contenttype (head vs. body) as well as revisions, sessions, items, and users.Figure 1 plots the development of the corpus over time. While thenumber of revisions per month is increasing (top), the number ofvandalism revisions per month varies without a clear trend (bottom).We attribute the observed variations to the fast pace at which Wiki-data is developed, both in terms of data acquisition and frontenddevelopment. For example, the drop in vandalism affecting itemheads around April 2015 is probably related to the redesign of Wiki-data’s user interface around this time: with the new user interface itis less obvious to edit labels, descriptions, and aliases which mightdeter many drive-by vandals. Our evaluation platform has been built to ensure (1) reproducibil-ity of results, (2) blind evaluation, (3) ground truth protection, and,(4) to implement a realistic scenario of vandalism detection whererevisions are scored in near real-time as they arrive. Reproducibilityis ensured by inviting participants to submit their software instead ofjust its run output, employing the evaluation-as-a-service platformTIRA [17]. TIRA implements a cloud-based evaluation system,where each participant gets assigned a virtual machine in which aworking version of their approach is to be deployed. The virtualmachines along with the deployed softwares are remote-controlledvia TIRA’s web interface. Once participants manage to run theirsoftware on the test dataset hosted on TIRA, the virtual machine canbe archived for later re-evaluation. In addition, TIRA ensures thatparticipants do not get direct access to the test datasets, giving riseto blind evaluation, where the to-be-evaluated software’s authorshave never observed the test datasets directly.For following reasons the task of vandalism detection in Wikis isintrinsically difﬁcult to be organized as a shared task: the groundtruth is publicly available via Wikimedia’s data dumps, and the van- https://lists.wikimedia.org/pipermail/wikidata/2015-March/005703.html N u m b e r o f r e v i s i o n s ( i n m illi o n s ) BodyMiscHead N u m b e r o f r e v i s i o n s ( i n m illi o n s ) Head Vand.Body Vand.Misc Vand.

Figure 1: Overview of the Wikidata Vandalism Corpus 2016 bycontent type: Number of all revisions by content type (top) andnumber of vandalism revisions by content type (bottom). dalism occurring at some point in the revision history of a Wikidataitem will eventually be undone manually via the rollback facility,which we also use to create our ground truth. Handing out a testdataset spanning a long time frame would therefore effectively re-veal the ground truth to participants. At the same time, in practice, itwould be rather an unusual task to be handed a large set of consecu-tive revisions to be classiﬁed as vandalism or no, when in fact, everyrevision should be reviewed as it arrives. We therefore opted for astreaming-based setup where the participant software was supposedto connect to a data server, which initiates a stream of revisionsin chronological order (by increasing revision IDs), but waits send-ing new ones unless the software has returned vandalism scoresfor those already sent. This way, a software cannot easily exploitinformation gleaned from revisions occurring after a to-be-classiﬁedrevision. However, we did not expect synchronous classiﬁcation,but allowed for a backpressure window of k = 16 revisions, so thatrevision i + k is sent as soon as the score for i -th revision has beenreported. This allows for concurrent processing of data and the anal-ysis, while preventing a deep look-ahead into the future of a revision.At k = 16 , no more than 0.003% of the ground truth was revealedwithin the backpressure window (283 regular revisions and 76 van-dalism revisions). However, as reported by participants, vandalismscores were based on previous revisions, while the backpressurewindow was unanimously used to gain computational speedup, andnot to exploit ground truth or editing sessions. We employ the same performance measures as in our previouswork [7]: area under curve of the receiver operating characteristic,ROC

AUC , and area under the precision-recall curve, PR

AUC . BothROC

AUC and PR

AUC cover a wide range of operating points, empha-sizing different characteristics. Given the large class imbalance ofvandalism to non-vandalism, ROC

AUC emphasizes performance inthe high recall range, while PR

AUC emphasizes performance in thehigh precision range. I.e., ROC

AUC is probably more meaningful https://github.com/wsdm-cup-2017/wsdmcup17-data-server for a semi-automatic operation of a vandalism detector, pre-ﬁlteringrevisions likely to be vandalism and leaving the ﬁnal judgment tohuman reviewers. PR AUC addresses scenarios where vandalism shallbe reverted fully automatically without any human intervention. Thewinner of the WSDM Cup was determined based on ROC

AUC .For informational purposes, we also report the typical classiﬁerperformance measures for the operating point at the threshold 0.5:accuracy (ACC), precision (P), recall (R), F measure (F). WDVD.

We employ the Wikidata Vandalism Detector, WDVD[7],as (strong) state-of-the-art baseline. The underlying model consistsof 47 features and employs multiple-instance learning on top ofbagging and random forests. The model was trained on training dataranging from May 1, 2013, to April 30, 2016. We used the samehyperparameters for this model as reported in our previous work [7]:16 random forests, each build on 1/16 of the training dataset with theforests consisting of 8 trees, each having a maximal depth of 32 withtwo features per split and using the default Gini split criterion. Inorder to adjust WDVD to the new evaluation setup where revisionsarrive constantly in a stream, we adjusted the multiple-instancelearning to consider only those revisions of a session up until thecurrent revision.

FILTER.

As a second baseline, we employ so-called revisiontags, which are created on Wikidata due to two main mechanisms:(1) The Wikidata Abuse Filter automatically tags revisions accordingto a collection of human-generated rules. (2) Revisions created bysemi-automatic editing tools such as the Wikidata Game are taggedwith the authentication method used by the semi-automatic editingtool. In general, tags assigned by the abuse ﬁlter are a strong signalfor vandalism while tags from semi-automatic editing tools are asignal for non-vandalism. We trained a random forest with scikit-learn’s default hyperparameters on the training data from May 1,2013 to April 30, 2016. The revision tags were provided to allparticipants as part of the meta data. This baseline has also beenincorporated into WDVD as feature revisionTags . ORES.

We reimplemented the ORES approach [19] developedfor Wikidata vandalism detection by the Wikimedia Foundation andwe apply it to the Wikidata Vandalism Corpus 2016. Essentially,this approach consists of a subset of WDVD’s features plus someadditional features that were previously found to overﬁt [7]. It usesa random forest and was trained on the training data ranging fromMay 1, 2013 to April 30, 2016. We use the original hyperparametersby Sarabadani et al.:

80 decision trees considering ‘log2’ featuresper split using the ‘entropy’ criterion. While Sarabadani et al. [19]experimented with balancing the weights of the training examples,we do not do so for the ORES baseline since it has no effect onperformance in terms of ROC

AUC and decreases performance interms of PR

AUC . META.

Given the vandalism scores returned by participant ap-proaches and our baselines, the question arises what the detectionperformance would be if all these approaches were combined intoone. To get an estimation of the possible performance, we employ asimple meta approach whose score for each to-be-classiﬁed revisioncorresponds to the mean of all 8 approaches. As it turns out, themeta approach slightly outperforms the other approaches. Compared with our previous paper [7], we employ an updated version of ORES thatwas recently published by Sarabadani et al. [19]. https://github.com/wiki-ai/wb-vandalism/blob/sample_subsets/Makeﬁle The difference in performance of our reimplementation of ORES compared withSarabadani et al. [19] is explained by the different datasets and evaluation metrics:while we split the dataset by time, Sarabadani et al. split the dataset randomly, causingrevisions from the same editing session to appear both in the training as well as testdataset. able 3: Overview of feature groups, features, and their usage by WSDM Cup participants. Features that were newly introduced byparticipants and not previously used as part of WDVD are marked with an astersik (*). Features computed in the same way as forWDVD as well as new features are marked with (cid:51) , features computed similarly to WDVD are marked with ( (cid:51) ), features for which itis unclear from their respective paper whether they are included are marked with ?, and features not utilized are marked with –.

Feature group Feature Submitted Approach Baseline B u ff a l ob e rr y [ ] C onk e r b e rr y [ ] H on e yb e rr y [ ] L og a nb e rr y [ ] R i b e rr y [ ] W D V D [ ] F I L T E R [ ] O R E S [ ] C on t e n t f ea t u r e s Character features lowerCaseRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – upperCaseRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – nonLatinRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – latinRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – alphanumericRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – digitRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – punctuationRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – whitespaceRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – longestCharacterSequence (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – asciiRatio – – – (cid:51) (cid:51) (cid:51) – – bracketRatio (cid:51) – – (cid:51) (cid:51) (cid:51) – –misc features from WDVD – – – – (cid:51) – – – symbolRatio * (cid:51) – – – – – – – mainAlphabet * (cid:51) – – – – – – –Word features languageWordRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – containsLanguageWord (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – lowerCaseWordRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – longestWord (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – containsURL (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – badWordRatio (cid:51) – – (cid:51) (cid:51) (cid:51) – – proportionOfQidAdded – – – ? – (cid:51) – (cid:51) upperCaseWordRatio (cid:51) – (cid:51) (cid:51) (cid:51) (cid:51) – – proportionOfLinksAdded – – – ? – (cid:51) – (cid:51) proportionOfLanguageAdded – – – – – – – (cid:51) misc features from WDVD – – – – (cid:51) – – – bagOfWords * – (cid:51) – – – – – –Sentence features commentTailLength (cid:51) – – (cid:51) (cid:51) (cid:51) – – commentSitelinkSimilarity ( (cid:51) ) – – (cid:51) (cid:51) (cid:51) – – commentLabelSimilarity ( (cid:51) ) – – (cid:51) (cid:51) (cid:51) – – commentCommentSimilarity – – – ? – (cid:51) – – languageMatchProb * (cid:51) – – – – – – – hasIdentifierChanged – – – – – – – (cid:51) Statement features propertyFrequency – ( (cid:51) ) – (cid:51) (cid:51) (cid:51) – ( (cid:51) ) itemValueFrequency – ( (cid:51) ) – (cid:51) (cid:51) (cid:51) – – literalValueFrequency – ( (cid:51) ) – (cid:51) (cid:51) (cid:51) – – C on t e x t f ea t u r e s User features userCountry (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – userTimeZone (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – userCity (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – userCounty (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – userRegion (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – cumUserUniqueItems – – – – (cid:51) (cid:51) – – userContinent (cid:51) (cid:51) (cid:51) – (cid:51) (cid:51) – – isRegisteredUser (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) – (cid:51) userFrequency (cid:51) (cid:51) – – (cid:51) (cid:51) – – isPrivilegedUser (cid:51) – – (cid:51) (cid:51) (cid:51) – ( (cid:51) )misc features from WDVD – – – – (cid:51) – – – userIPSubnets * – (cid:51) – – – – – – userVandalismFraction * – – – (cid:51) (cid:51) – – – userVandalismCount * – – – (cid:51) – – – – userUniqueItems * – – – – (cid:51) – – – userAge – – – – – – – (cid:51) Item features logCumItemUniqueUsers – – – – – (cid:51) – – logItemFrequency – – – – – (cid:51) – – isHuman – – – – – – – (cid:51) isLivingPerson – – – – – – – (cid:51) misc features from WDVD – – – – (cid:51) – – – itemFrequency * (cid:51) ( (cid:51) ) – – (cid:51) – – – itemVandalismFraction * – – – (cid:51) (cid:51) – – – itemVandalismCount * – – – (cid:51) – – – – itemUniqueUsers * – – – – (cid:51) – – –Revision features revisionTags ( (cid:51) ) (cid:51) (cid:51) ? (cid:51) (cid:51) (cid:51) – revisionLanguage ( (cid:51) ) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) – – revisionAction (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) – ( (cid:51) ) commentLength – – (cid:51) (cid:51) (cid:51) (cid:51) – – isLatinLanguage – – (cid:51) (cid:51) (cid:51) (cid:51) – – revisionPrevAction – – – ? – (cid:51) – – revisionSubaction (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) – ( (cid:51) ) positionWithinSession – – – ? – (cid:51) – – numberOfIdentifiersChanged – – – – – – – (cid:51) misc features from WDVD – – – – (cid:51) – – – isMinorRevision * (cid:51) – – – (cid:51) – – – changeCount * (cid:51) (cid:51) – – (cid:51) – – ( (cid:51) ) superItem * (cid:51) – – – (cid:51) – – – revisionSize * (cid:51) – – – (cid:51) – – – hourOfDay * (cid:51) – – – – – – – dayOfWeek * – – – – (cid:51) – – – revisionPrevUser * (cid:51) – – – – – – – hashTag * – (cid:51) ( (cid:51) ) – – – – – isSpecialRevision * – – (cid:51) – – – – – . SURVEY OF SUBMISSIONS This section surveys the features and learning algorithms em-ployed by the participants. All of them chose to build their ownmodel—which eventually are based on our WDVD approach [7],whose code base had been published to ensure reproducibility. Onthe one hand, the availability of this code base leveled the playingﬁeld among participants since it enabled everyone to achieve at leaststate-of-the-art performance. On the other hand, the availability mayhave stiﬂed creativity and innovation among participants since allapproaches follow a similar direction, and no one investigated dif-ferent classiﬁcation paradigms. However, all participants attemptedto improve over our approach (which was one of the baselines) bydeveloping new features and experimenting with learning variants.

Table 3 gives a comprehensive overview of the features used inthe submitted approaches. The feature set is complete; in particular,it uniﬁes those features that are the same or closely related acrossparticipants. The table divides the features into two main groups,content features and context features. The content features in turnare subdivided regarding the granularity level at which character-istics are quantiﬁed, whereas the context features are subdividedregarding contextual entities in connection with a to-be-classiﬁedrevision. Since the features have been extensively described in ourprevious work and the participant notebooks we omit a detaileddescription here. Instead, the feature names have been chosen toconvey their intended semantics and are in accordance with the cor-responding implementations found in our code base. For in-depthinformation we refer to our paper covering WDVD [7], the FILTERbaseline, our reimplementation of ORES, as well as the notebookpapers submitted by the participants [3, 5, 26, 27, 28].We would like to point out certain observations that can be gainedfrom the overview: Buffaloberry [3] used many of the WDVDfeatures but also contributed a number of additional features on topof that. Conkerberry [5] used an interesting bag of words model thatbasically consists of the feature values computed by many of theWDVD features, all taken as words. Loganberry [28] did not exploitthe information we provided as meta data, such as geolocation,etc. With two exceptions, Honeyberry [26] used almost exclusivelyWDVD features. Riberry [27] used on top of the WDVD featuresthose that we previously found to overﬁt (denoted as “misc. featuresfrom WDVD” in the table), which may explain their poor overallperformance, corroborating our previous results.

Table 4 overviews the employed learning algorithms and orga-nizes them wrt. achieved performance. The best-performing ap-proach by Buffaloberry employs XGBoost and multiple-instancelearning. The second-best approach by Conkerberry employs a lin-ear SVM, encoding all WDVD features as a bag of words model.This results in an effectiveness comparable to the WDVD baseline interms of ROC

AUC , but not in terms of PR

AUC . The third approach byLoganberry also employs XGBoost, however, in contrast to the ﬁrstapproach, no multiple-instance learning was conducted. The fourthapproach, Honeyberry, created an ensemble of various algorithmsfollowing a stacking strategy. In contrast to the ﬁrst approach, theauthors put less emphasis on feature engineering. Their ﬁnal sub-mission contained a bug reducing the performance of their approach,which was ﬁxed only after the submission deadline. The bugﬁxcaused their performance to jump to an ROC

AUC of 0.928, thus virtu-ally achieving the third place in the competition. The ﬁfth approachof Riberry performed poorly, probably due to overﬁtting features.The baselines employ a parameter-optimized random forest.

Table 4: Overview of the employed learning algorithms persubmission. The rows are sorted wrt. the achieved evaluationscores, starting with the best.

Submission Learning Algorithms

X G B oo s t L i n ea r S V M L og i s ti c R e g r e ss i on R a ndo m F o r e s t E x t r a T r ee s G B T N e u r a l N e t w o r k s M u lti p l e -I n s t a n ce META (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

Buffaloberry (cid:51) – – – – – – (cid:51)

Conkerberry – (cid:51) – – – – – –WDVD (baseline) – – – (cid:51) – – – (cid:51)

Honeyberry – – (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) –Loganberry (cid:51) – – – – – – –Riberry – – – (cid:51) – (cid:51) – –ORES (baseline) – – – (cid:51) – – – –FILTER (baseline) – – – (cid:51) – – – –

5. EVALUATION

This section presents an in-depth evaluation of the submitted ap-proaches, including overall performance and performance achievedregarding different data subsets, such as head content vs. bodycontent, registered users vs. unregistered users, and performancevariation over time. Furthermore, we combine the ﬁve submittedapproaches within an ensemble to get a ﬁrst idea about the perfor-mance that could be achieved if these approaches were integrated.

The competition phase of the WSDM Cup 2017 ofﬁcially endedon December 30, 2016, resulting in the following ranking:1. Buffaloberry by Crescenzi et al. [3]2. Conkerberry by Grigorev [5]3. Loganberry by Zhu et al. [28]4. Honeyberry by Yamazaki et al. [26]5. Riberry by Yu et al. [27]We congratulate the winners! Working versions of these ap-proaches have been successfully deployed within TIRA and evalu-ated using our aforementioned evaluation framework; three partici-pants also shared their code bases as open source. To provide a realistic overview of the state of the art at the time ofwriting, we report the results that were most recently achieved. Thisis particularly relevant for the authors of the Honeyberry vandalismdetector, who found and ﬁxed an error in their approach shortly afterthe deadline, moving their approach up one rank. Moreover, weinclude the performances achieved by our own approach as well asthat of our two baselines and of the meta approach. The followingranking hence slightly differs from the ofﬁcial one.

Table 5 and Figure 2 give an overview of the evaluation results ofthe vandalism detection task at WSDM Cup 2017.

The evaluation results shown in Table 5 are ordered by ROC

AUC .The meta approach (ROC

AUC

AUC is the FILTER https://github.com/wsdm-cup-2017 able 5: Evaluation results of the WSDM Cup 2017 on the test dataset. Performance values are reported in terms of accuracy (Acc),precision (P), recall (R), F-measure, area under the precision-recall curve (PR AUC ), and area under curve of the receiver operatingcharacteristic (ROC

AUC ), as well as with regard to the four data subsets. The darker a cell, the better the performance.

Approach Overall performance Item head Item body Registered user Unregistered user

Acc P R F PR

AUC

ROC

AUC PR AUC

ROC

AUC PR AUC

ROC

AUC PR AUC

ROC

AUC PR AUC

ROC

AUC

META 0.9991 0.668 0.339 0.450 0.475 0.950 0.648 0.996 0.387 0.926 0.082 0.829 0.627 0.944Buffaloberry 0.9991 0.682 0.264 0.380 0.458 0.947 0.634 0.997 0.364 0.921 0.053 0.820 0.613 0.938Conkerberry 0.9990 0.675 0.099 0.173 0.352 0.937 0.512 0.989 0.281 0.911 0.004 0.789 0.538 0.915WDVD (baseline) 0.9991 0.779 0.147 0.248 0.486 0.932 0.668 0.996 0.388 0.900 0.086 0.767 0.641 0.943Honeyberry 0.7778 0.004 0.854 0.008 0.206 0.928 0.364 0.993 0.101 0.893 0.002 0.760 0.308 0.819Loganberry 0.9285 0.011 0.767 0.022 0.337 0.920 0.429 0.961 0.289 0.892 0.020 0.758 0.487 0.895Riberry 0.9950 0.103 0.483 0.170 0.174 0.894 0.328 0.932 0.113 0.878 0.002 0.771 0.378 0.795ORES (baseline) 0.9990 0.577 0.199 0.296 0.347 0.884 0.448 0.973 0.298 0.836 0.026 0.627 0.481 0.897FILTER (baseline) 0.9990 0.664 0.073 0.131 0.227 0.869 0.249 0.908 0.182 0.840 0.021 0.644 0.387 0.771 T r u e P o s i t i v e R a t e False Positive Rate M E T A B u ff a l o b e rr y C o n k e r b e r r y W D V D H o n e y b e rr y L o g a n b e rr y R i b e rr y O R E S F I L T E R P r e c i s i o n Recall M E T A B u ff a l o b e rr y C o n k e r b e rr y W D V D H o n e y b e rr y L o g a nb e rr y R i b e rr y O R E S F I L T E R Figure 2: ROC and precision recall curves approaches. baseline, achieving 0.869. Note that in our previous work [7] wereported higher values for ROC

AUC , which, however, were obtainedwith a previous version of the dataset. We discuss the apparentdifferences below.The PR

AUC scores are lower and more diverse than the ROC

AUC scores, which is a consequence of the extreme class imbalance. TheWDVD baseline outperforms all approaches in terms of PR

AUC ,including the meta classiﬁer; the ORES baseline outperforms allbut two participants. The ranking among the participants changesonly slightly: Loganberry and Honeyberry switch places. The mainreason for high PR

AUC scores are features that can signal vandalismwith a pretty high precision. For example, WDVD, Buffaloberry,Conkerberry, ORES, and Loganbery are all able to pick up on badwords ( badWordRatio and bagOfWords ) or contain detailed userinformation ( userFrequency , userAge , userVandalismFrac-tion , userVandalismCount ), whereas FILTER and Honeyberrylack the respective features. The performance of Riberry can beexplained by the inclusion of features that have previously beenfound to overﬁt [7].While PR AUC and ROC

AUC are computed on continuous scores,we also computed accuracy, precision, recall, and F-measure onbinary scores at a threshold of 0.5. Honeyberry and Loganberryachieve poorest precision but highest recall. Conkerberry and FIL-TER achieve poorest recall but high precision. The META ap- proach and Buffaloberry manage the best trade-off in terms of theF-measure. WDVD achieves highest precision at a non-negligiblerecall. Accuracy correlates with precision due to the high classimbalance. Recall that, since the winner of the competition wasdetermined based on ROC

AUC , the teams had only little incentive tooptimize the other scores. For a real-world applications of classiﬁersit might be beneﬁcial to calibrate the raw scores to represent moreaccurate probability estimates and to set the threshold depending onthe use case, i.e., adjusting Acc, P, R, and the F-measure.

Table 5 contrasts the approaches’ performances on different con-tent types, namely, item heads vs. item bodies. Regardless of themetric, all approaches perform signiﬁcantly better on item heads.We explain this by the fact that vandalism on item heads typicallyhappens at the lexical level (and hence can be detected more easily),e.g., by inserting bad words or wrong capitalization, whereas vandal-ism on item bodies typically happens at the semantic level, e.g., byinserting wrong facts. In particular, the character and word featuresfocus on textual content as is found in the item head, but there arenot many features for structural content. Future work might focuson transferring techniques that are used for the Google knowledgevault [4] to Wikidata, such as link prediction techniques to check the .5 R O C - A U C Week of the year 2016 (starting on Sundays)

METABuffaloberryConkerberryWDVDHoneyberryLoganberryRiberryORES

FILTER

Figure 3: Performance over time on the test dataset. correctness of links between items, and Gaussian mixture models tocheck the correctness of attribute values.

Table 5 also contrasts the approaches’ performances regardingrevisions originated from registered users vs. revisions from un-registered users. All approaches perform signiﬁcantly better onrevisions from unregistered users, which is in particular reﬂectedby the PR

AUC scores. Spot-checks suggest that vandalism of unreg-istered users appears to be rather obvious, whereas vandalism ofregistered users appears to be more sophisticated and elusive. More-over, some reverted edits of registered users may also be consideredhonest mistakes. Telling apart honest mistakes from a (detected)vandalism attempt may be difﬁcult.

Figure 3 shows the performance of the approaches on the test setover time. Over the ﬁrst seven weeks (calendar weeks 19 to 25) theapproaches’ performances remain relatively stable. All approacheswere only trained with data obtained up until calendar week 18.Since no drop in the performances is observed, no major changes inthe kinds of vandalism seem to have happened in this time frame.However, in calendar week 26 a major performance drop can beobserved. The outlier is caused by a single, highly reputable userusing some automatic editing tool on June 19, 2016, to create 1,287slightly incorrect edits (which were rollback-reverted later). Sinceonly 11,043 edits were labeled vandalism in the entire test set oftwo months, these 1,287 edits within a short time period have asigniﬁcant impact on the overall performance.We were curious to learn about the impact of this data artifact andrecomputed the detection performances, leaving out the series oferroneous edits from the user in question. Table 6 shows the overallperformance the approaches would have achieved then: while theabsolute performance increases, the ranking among the participantsis not affected.Probably one cannot anticipate such an artifact in the test data,but, with hindsight, we consider it a blessing rather than a curse:it points to the important question of how to deal with such casesin practice. Machine learning-based vandalism detectors becomeunreliable when the characteristics of the stream of revisions tobe classiﬁed suddenly changes—errors in both directions, falsepositives and false negatives, can be the consequence. Ideally, thedevelopers of detectors envision possible exceptional circumstancesand provide a kind of exception handling; e.g., ﬂagging a user withsuspicious behavior by default for review, regardless whether therespective series of edits is considered damaging or not.

Table 6: Evaluation results of the WSDM Cup 2017 on the testdataset without the erroneous edits.

Approach Overall performance

Acc P R F PR

AUC

ROC

AUC

META 0.9992 0.668 0.384 0.487 0.536 0.988Buffaloberry 0.9992 0.682 0.298 0.415 0.517 0.988WDVD (baseline) 0.9992 0.779 0.167 0.274 0.548 0.980Conkerberry 0.9991 0.675 0.113 0.193 0.398 0.980Honeyberry 0.7779 0.004 0.967 0.008 0.233 0.972Loganberry 0.9286 0.011 0.867 0.022 0.383 0.939FILTER (baseline) 0.9991 0.664 0.082 0.146 0.257 0.938ORES (baseline) 0.9991 0.577 0.225 0.324 0.392 0.935Riberry 0.9951 0.103 0.546 0.173 0.196 0.902

6. REFLECTIONS ON THE WSDM CUP

The WSDM Cup 2017 had two tasks for which a total of140 teams registered, 95 of which ticked the box for participationin the vandalism detection task (multiple selections allowed). Thisis a rather high number compared with other shared task events.We attribute this success to the facts that the WSDM conference isan A-ranked conference, giving the WSDM Cup a high visibility,that the vandalism detection task was featured on Slashdot, andthat we attracted sponsorship from Adobe, which allowed us toaward cash prizes to the three winning participants of each task.However, only 35 registered participants actually engaged whenbeing asked for their operating system preferences for their virtualmachine on TIRA, 14 of which managed to produce at least one run,whereas the remainder never used their assigned virtual machines atall. In the end, ﬁve teams made a successful submission by runningtheir software without errors on the test dataset.Why did so many participants decided to drop out on this task?We believe that the comparably huge size of the dataset as well asdifﬁculties in setting up their approach in our evaluation platformare part of the reason: each approach had to process gigabytes ofdata by implementing a client-server architecture, and all of thathad to be deployed on a remote virtual machine. The requirementto submit working software, however, may not have been the verymain cause since the retention rate of our companion task was muchhigher. Rather, the combination of dataset size, real-time client-server processing environment, and remote deployment is a likelycause. Note that the vandalism detection task itself demanded forthis scale of operations, since otherwise it would have been easy tocheat, which is particularly a problem when cash prizes are involved.Finally, the provided baseline systems were already competitive, sothat the failure to improve upon them may have caused additionaldropouts.The WSDM Cup taught us an important lesson about the opportu-nities and limitations of shared tasks in general and about evaluationplatforms and rule enforcement in particular. On the one hand, com-petitions like ours are important to rally followers for a given taskand to create standardized benchmarks. On the other hand, sharedtasks are constrained to a relatively short period of time and create acompetitive environment between teams. I.e., it becomes importantto implement a good trade-off in the evaluation setup in order toprevent potential cheating and data leaks, while, at the same time,placing low hurdles on the submission procedure. A means towardsthis end might be standardized evaluation platforms that are widelyused for a large number of shared tasks. While there are alreadyplatforms like TIRA or Kaggle, we are not aware of a widely usedevaluation platform for time series data, serving teams with onetest example after the other and providing a sandboxing mechanism https://developers.slashdot.org/story/16/09/10/1811237f the submitted softwares to prevent data leaks. Moreover, theredeﬁnitely is a trade-off between enforcing strict rules on the one sideand scientiﬁc progress on the other. For example, only two teamshad made a successful submission by the original deadline, whileother teams were still struggling with running their approaches. Inthis case, we erred on the side of scientiﬁc progress by additionalsubmissions in favor of overly strict rules and gave all teams a shortdeadline extension of 8 days, accepting some discussion about thedeadline’s extensions fairness.

7. CONCLUSION AND OUTLOOK

This paper gives an overview of the ﬁve vandalism detectionapproaches submitted to the WSDM Cup 2017. The approacheswere evaluated on the new Wikidata Vandalism Corpus 2016, whichhas been speciﬁcally compiled for the competition. Under a semi-automatic detection scenario, where newly arriving revisions areranked for manual review, the winning approach from BuffaloberryCrescenzi et al. [3] performs best. Under a fully automatic detectionscenario, where the decision whether or not to revert a given revisionis left with the classiﬁer, the baseline approach WDVD by Heindorfet al. [7] still performs best. Combining all approaches within ameta classiﬁer yields a small improvement; however, the feature setseems to be the performance-limiting factor.All approaches build upon the WDVD baseline, proposing onlyfew additional features. I.e., for the future it is interesting to developand explore fundamentally different feature sets. E.g., building uponthe work on knowledge graphs, technology for link prediction andvalue range prediction should be investigated. Building upon thework of other user-generated content, also psychologically moti-vated features capturing a user’s personality and state of mind appearpromising.

Acknowledgments

We thank Adobe for sponsoring the WSDM Cup, and WikimediaGermany for supporting our task. Our special thanks go to allparticipants for their devoted work.

References [1] B. Adler, L. de Alfaro, and I. Pye. Detecting WikipediaVandalism using WikiTrust. In

CLEF , pages 22–23, 2010.[2] B. T. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, andA. G. West. Wikipedia Vandalism Detection: Combining NaturalLanguage, Metadata, and Reputation Features. In

CICLing ,pages 277–288, 2011.[3] R. Crescenzi, M. Fernandez, F. A. G. Calabria, P. Albani,D. Tauziet, A. Baravalle, and A. S. D’Ambrosio. A ProductionOriented Approach for Vandalism Detection in Wikidata—TheBuffaloberry Vandalism Detector at WSDM Cup 2017. In

WSDM Cup , 2017.[4] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,T. Strohmann, S. Sun, and W. Zhang. Knowledge Vault: AWeb-scale Approach to Probabilistic Knowledge Fusion. In

KDD , pages 601–610. ACM, 2014.[5] A. Grigorev. Large-Scale Vandalism Detection with LinearClassiﬁers— The Conkerberry Vandalism Detector at WSDMCup 2017. In

WSDM Cup , 2017.[6] S. Heindorf, M. Potthast, B. Stein, and G. Engels. TowardsVandalism Detection in Knowledge Bases: Corpus Constructionand Analysis. In

SIGIR , pages 831–834. ACM, 2015.[7] S. Heindorf, M. Potthast, B. Stein, and G. Engels. VandalismDetection in Wikidata. In

CIKM , pages 327–336. ACM, 2016. [8] S. Heindorf, M. Potthast, H. Bast, B. Buchhold, andE. Haussmann. WSDM Cup 2017: Vandalism Detection andTriple Scoring. In

WSDM , pages 827–828. ACM, 2017.[9] J. Kiesel, M. Potthast, M. Hagen, and B. Stein. Spatio-TemporalAnalysis of Reverted Wikipedia Edits. In

ICWSM , pages122–131. AAAI Press, 2017.[10] A. Kittur, B. Suh, B. A. Pendleton, and E. H. Chi. He Says, SheSays: Conﬂict and Coordination in Wikipedia. In

CHI , pages453–462. ACM, 2007.[11] S. Kumar, F. Spezzano, and V. S. Subrahmanian. VEWS: AWikipedia Vandal Early Warning System. In

KDD , pages607–616. ACM, 2015.[12] T. Pellissier Tanon, D. Vrandeˇci´c, S. Schaffert, T. Steiner, andL. Pintscher. From Freebase to Wikidata: The Great Migration.In

WWW , pages 1419–1428, 2016.[13] M. Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In

SIGIR , pages 789–790, 2010.[14] M. Potthast and T. Holfeld. Overview of the 2nd InternationalCompetition on Wikipedia Vandalism Detection. In

CLEF , 2011.[15] M. Potthast, B. Stein, and R. Gerling. Automatic VandalismDetection in Wikipedia. In

ECIR , pages 663–668, 2008.[16] M. Potthast, B. Stein, and T. Holfeld. Overview of the 1stInternational Competition on Wikipedia Vandalism Detection. In

CLEF , 2010.[17] M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos, andB. Stein. Improving the Reproducibility of PAN’s Shared Tasks:Plagiarism Detection, Author Identiﬁcation, and AuthorProﬁling. In

CLEF , pages 268–299. Springer, 2014.[18] L. Ramaswamy, R. Tummalapenta, K. Li, and C. Pu. AContent-Context-Centric Approach for Detecting Vandalism inWikipedia. In

COLLABORATECOM , pages 115–122, 2013.[19] A. Sarabadani, A. Halfaker, and D. Taraborelli. BuildingAutomated Vandalism Detection Tools for Wikidata. In

WWWCompanion , pages 1647–1654, 2017.[20] C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust,but Verify: Predicting Contribution Quality for Knowledge BaseConstruction and Curation. In

WSDM , pages 553–562. ACM,2014.[21] K. Tran and P. Christen. Cross Language Prediction ofVandalism on Wikipedia Using Article Views and Revisions. In

PAKDD (2) , volume 7819 of

Lecture Notes in Computer Science ,pages 268–279. Springer, 2013.[22] K. Tran and P. Christen. Cross-Language Learning from Botsand Users to Detect Vandalism on Wikipedia.

IEEE Trans.Knowl. Data Eng. , 27(3):673–685, 2015.[23] K. Tran, P. Christen, S. Sanner, and L. Xie. Context-AwareDetection of Sneaky Vandalism on Wikipedia Across MultipleLanguages. In

PAKDD (1) , volume 9077 of

Lecture Notes inComputer Science , pages 380–391. Springer, 2015.[24] W. Y. Wang and K. R. McKeown. "Got You!": AutomaticVandalism Detection in Wikipedia with Web-based ShallowSyntactic-semantic Modeling. In

COLING , pages 1146–1154,2010.[25] A. G. West, S. Kannan, and I. Lee. Detecting WikipediaVandalism via Spatio-temporal Analysis of Revision Metadata.In

EUROSEC , pages 22–28, 2010.[26] T. Yamazaki, M. Sasaki, N. Murakami, T. Makabe, andH. Iwasawa. Ensemble Models for Detecting WikidataVandalism with Stacking—Team Honeyberry VandalismDetector at WSDM Cup 2017. In

WSDM Cup , 2017.[27] T. Yu, Y. Zhao, X. Wang, Y. Xu, H. Shao, Y. Wang, X. Ma, andD. Dey. Vandalism Detection Midpoint Report—The RiberryVandalism Detector at WSDM Cup 2017. University of Illinoisat Urbana-Champaign Student Report, not published, 2017.[28] Q. Zhu, H. Ng, L. Liu, Z. Ji, B. Jiang, J. Shen, and H. Gui.Wikidata Vandalism Detection—The Loganberry VandalismDetector at WSDM Cup 2017. In