[PDF] A large scale study of SVM based methods for abstract screening in systematic reviews

Abstract

A major task in systematic reviews is abstract screening, i.e., excluding, often hundreds or thousand of, irrelevant citations returned from a database search based on titles and abstracts. Thus, a systematic review platform that can automate the abstract screening process is of huge importance. Several methods have been proposed for this task. However, it is very hard to clearly understand the applicability of these methods in a systematic review platform because of the following challenges: (1) the use of non-overlapping metrics for the evaluation of the proposed methods, (2) usage of features that are very hard to collect, (3) using a small set of reviews for the evaluation, and (4) no solid statistical testing or equivalence grouping of the methods. In this paper, we use feature representation that can be extracted per citation. We evaluate SVM-based methods (commonly used) on a large set of reviews ( 61 ) and metrics ( 11 ) to provide equivalence grouping of methods based on a solid statistical test. Our analysis also includes a strong variability of the metrics using 500 x 2 cross validation. While some methods shine for different metrics and for different datasets, there is no single method that dominates the pack. Furthermore, we observe that in some cases relevant (included) citations can be found after screening only 15-20% of them via a certainty based sampling. A few included citations present outlying characteristics and can only be found after a very large number of screening steps. Finally, we present an ensemble algorithm for producing a 5 -star rating of citations based on their relevance. Such algorithm combines the best methods from our evaluation and through its 5 -star rating outputs a more easy-to-consume prediction.

Full PDF

aa r X i v : . [ c s . I R ] J a n Study of Methods for Abstract Screening in aSystematic Review Platform

Tanay Kumar Saha a, ∗ , Mourad Ouzzani b , Hossam Hammady b , Ahmed K.Elmagarmid b , Wajdi Dhiﬂi c , Mohammad Al Hasan a a Indiana University Purdue University Indianapolis, Indianapolis, IN 46202, USA. b Qatar Computing Research Institute, Hamad-Bin Khalifa University, Doha, Qatar. c University of Lille, Faculty of pharmaceutical and biological sciences, EA2694,BioMathematics Lab, 3 rue du docteur Laguesse, 59006 Lille, France.

Abstract

A major task in systematic reviews is abstract screening, i.e. , excluding hun-dreds or thousands of irrelevant citations returned from one or several databasesearches based on titles and abstracts. Most of the earlier eﬀorts on studyingsystematic review methods for abstract screening evaluate the existing technolo-gies in isolation and based on the ﬁndings found in the published literature. Ingeneral, there is no attempt to discuss and understand how these technologieswould be rolled out in an actual systematic review system. In this paper, weevaluate a collection of commonly used abstract screening methods over a widespectrum of metrics on a large set of reviews collected from Rayyan, a compre-hensive systematic review platform. We also provide equivalence grouping ofthe existing methods through a solid statistical test. Furthermore, we proposea new ensemble algorithm for producing a 5-star rating for a citation based onits relevance in a systematic review.In our comparison of the diﬀerent methods, we observe that almost alwaysthere is a method that ranks ﬁrst in the three prevalence groups. However, thereis no single dominant method across all metrics. Various methods perform wellon diﬀerent prevalence groups and for diﬀerent metrics. Thus, a holistic “com-posite” strategy is a better choice in a real-life abstract screening system. Weindeed observe that the proposed ensemble algorithm combines the best of theevaluated methods and outputs an improved prediction. It is also substantiallymore interpretable, thanks to its 5-star based rating.

Keywords:

Abstract Screening Platform, Systematic Review. ∗ Corresponding Author and this work was done while the author was at QCRI.

Email addresses: [email protected] (Tanay Kumar Saha), [email protected] (Mourad Ouzzani), [email protected] (Hossam Hammady), [email protected] (Ahmed K. Elmagarmid), [email protected] (Wajdi Dhiﬂi), [email protected] (Mohammad Al Hasan)

Preprint submitted to Elsevier January 17, 2018 . Introduction

Randomized controlled trials (RCTs) constitute a key component of med-ical research and it is by far the best way of achieving results that increaseour knowledge about treatment eﬀectiveness [7]. Because of the large numberof RCTs that are being conducted and reported by diﬀerent medical researchgroups, it is diﬃcult for an individual to glean a summary evidence from allthe RCTs. The objective of systematic reviews is to overcome this diﬃculty bysynthesizing the results from multiple RCTs.Systematic reviews (SRs) involve multiple steps. Firstly, systematic review-ers formulates a research question and search in multiple biomedical databases.Secondly, they identify relevant RCTs based on abstracts and titles (abstractscreening). Thirdly, based on full texts of a subset thereof, assess their method-ological qualities, extract diﬀerent data elements and synthesize them. Finally,report the conclusions on the review question [50]. However, the precision of thesearch step of a SR process is usually low and for most cases it returns a largenumber of citations from hundreds to thousands. SR authors have to manuallyscreen each of these citations making such a task tedious and time-consuming.Therefore, a platform that can automate the abstract screening process is fun-damental in expediting the systematic review process. To fulﬁll this need, wehave built Rayyan —a web and mobile application for SR [38]. As of November2017, Rayyan users exceeded 8,000 from over 100 countries conducting hundredsof reviews totaling more than 11M citations.A primary objective of Rayyan is to expedite the abstract screening process.The ﬁrst step of the abstract screening process is to upload a set of citationsobtained from searches in one or more databases. Once citations are processedand deduplicated, Rayyan provides an easy-to-use interface to the systematicreviewers so that they can label the citations as relevant or irrelevant. As soonas the systematic reviewer labels a certain amount of citations from both classes(relevant and irrelevant), Rayyan runs its 5-star rating algorithm and sorts theunlabeled citations based on their relevance. As the reviewer continues to labelthe citations, the system may trigger multiple runs of the 5-star rating algorithm.The process ends when all the citations are labeled. Thereof, one of the designrequirements of the system is the ability to give feedbacks on the relevance ofstudies to users quickly. Figure 1 shows a pictorial depiction of our system.Earlier eﬀorts on studying systematic review methods in biomedical [39]and software engineering [37] domains evaluate the existing technologies forabstract screening from the ﬁndings presented in the published literature. Aspointed out by [39], these evaluations do not make it clear how these technologieswould be rolled out for use in an actual review system. In this study, wepresent our insights in designing methods (from both feature representation andalgorithm design perspectives) for a light-weight abstract screening platform,Rayyan (Please see Section 3 for challenges, and Section 5 for study limitations). https://rayyan.qcri.org/ igure 1: Pictorial depiction of the abstract screening system Rayyan. A key diﬀerence of our evaluation, compared to existing studies such as [37,39], is that our reviews come from an actual deployed systematic review system.In addition, many of the results presented in the literature are hard to generalizeas they do not provide performance over multiple datasets and over diﬀerentprevalence groups. We thus collect a large set of reviews from the Rayyanplatform and evaluate a large set of methods based on their practical utility tosolve the class imbalance problem. Additionally, to give a clear insight to theclass imbalance problem, we also divide the dataset into three prevalence groups,perform well deﬁned statistical tests, and present our ﬁndings over a large setof metrics. The results of our evaluation give insights on the best approachfor abstract screening in a real-life environment. Besides the evaluation of theexisting methods, we also propose an ensemble method for the task of abstractscreening that presents the prediction results through a 5-star rating scale. Wemake our detailed results and related resources available online .The primary aim of this study is to assess the performance of abstract screen-ing task on the widely used SVM-based methods using a large set of reviews froma real abstract screening platform. This paper therefore addresses the followingresearch questions from an abstract screening system design perspective:1. What are the challenges of using diﬀerent feature space representations?2. What kind of algorithms are suitable?3. How SVM-based methods perform in diﬀerent metrics over a large set ofreviews and diﬀerent class imbalance ratios? • What aspects should be considered for designing a new algorithm forabstract screening? • Can an ensemble method improve performance?The remainder of this paper is organized as follows. In Section 2, we discussrelated work. In Section 3, we provide an overview of the SVM-based meth-ods used in the evaluation and details of the proposed 5-star rating method. https://tksaha.github.io/paper_resources/

2. Related Work

We organize the related works into the following four groups: 1) works on“abstract screening” methods, 2) methodologies for handling “data imbalance”in classiﬁcation, as it is one of the main challenges in the abstract screening pro-cess, 3) methodologies for “active learning”, a popular method for addressingdata imbalance and user’s labeling burden issue, and ﬁnally, 4) works on “lin-ear review”, from the legal domain, which bears some similarity with abstractscreening in systematic reviews.

A large body of past research has focused on automating the abstract screen-ing process [2, 9, 10, 11, 29, 34]. In terms of feature representation, most of theexisting approaches [9, 11] use unigram, bigram, and MeSH (Medical SubjectHeadings). An alternative to the MeSH terms is to extract LDA based la-tent topics from the titles and abstracts and use them as features [34]. Othermethods [9, 29] utilize external information as features such as citations andco-citations. In terms of classiﬁcation methods, SVM-based learning algorithmsare commonly used [2, 9, 10]. According to a recent study [37], 13 diﬀerent typesof algorithms, including SVM, decision trees, naive Bayes, k -nearest neighbor,and neural networks have been proposed in existing works on abstract screen-ing. To the best of our knowledge, none of the existing works use the structuredoutput version of SVM ( SV M perf [27]) with diﬀerent loss functions, which wehave considered in this study. Note that

SV M perf ’s learning is faster than atransductive SVM [26]. For prediction, the former only keeps feature weights,which makes its prediction module very fast. On the other hand, transductiveSVM takes both labeled and unlabeled citations into account, so its learning isslower than

SV M perf . [53] uses a distant supervision technique to extract PICO(Population/Problem, Intervention, Comparator, and Outcome) sentences fromclinical trial reports. The distant supervision method trains a model using pre-viously conducted reviews and for the abstract screening task there is no suchdatabase for a particular review. So, we do not use distant supervision method.In this work, we evaluate various SVM-based classiﬁcation methods with diﬀer-ent loss functions over a set of abstract screening tasks.

Data imbalance in supervised classiﬁcation is a well studied problem [1, 26,27, 47, 48]. Two diﬀerent kinds of approaches have been proposed to solve it.Methods belonging to the ﬁrst kind focus on designing suitable loss functions:such as KL divergence [17], quadratic mean [32], cost sensitive classiﬁcation [16],mean average precision [55], random forest [8] with meta-cost, and AUC [27].Methods belonging to the second kind generate synthetic data for artiﬁcially4alancing the ratio of labeled relevant and irrelevant citations. Examples ofsuch methods are borderline-SMOTE [22], safe-level-SMOTE [6], and oversam-pling of the instances of the minority class along with under-sampling of theinstances of the majority class. The authors in [35, 51, 52] use probabilitycalibration techniques. In this paper, we evaluate the loss-centric approachesignoring the data centric and probability calibration based approaches. Thereason for our choice is the fact that synthetic data generation is computation-ally expensive which makes it very diﬃcult to overcome extreme imbalance—atypical scenario for abstract screening. Also, for probability calibration we needa validation set which is not always available in a typical abstract screeningsession. For loss-centric approaches, we have modiﬁed

SV M perf as proposedin [17, 32] to incorporate KL divergence and quadratic mean loss because thedefault implementation did not consider these loss functions.

As systematic reviews are done in batches, the problem of abstract screeningcan be modeled as an instance of batch mode active learning. Online learningtrains the classiﬁer after adding every citation whereas batch-mode active learn-ing (BAL) does the same after adding batches of citations. However, BAL doesnot have any theoretical guarantee compared to online learning [23, 49]. Thetask of BAL is to select batches with informative samples (citations) that wouldhelp to learn a better classiﬁcation model. There are two popular methods toselect samples: (1) Certainty based and (2) Uncertainty based. In certaintybased methods, “most certain” samples are selected to train the classiﬁer whilein uncertainty based method ,“most uncertain” ones are selected. In the un-certainty sampling-based methodologies, a large number of uncertainty met-rics have been proposed; examples include entropy [14], smallest-margin [45],least conﬁdence [13], committee disagreement [46], and version space reduc-tion [36, 49]. Using “most uncertain” samples based on these metrics improvesthe quality of the classiﬁer to ﬁnd the best separating hyperplane, and thusimproves the accuracy in classifying new instances [34]. On the other hand,certainty-based methods have also shown to be eﬀective in carrying out activelearning on imbalanced datasets, as demonstrated in [18]. In this paper, wepresent some interesting ﬁndings based on “certainty” based sampling in ab-stract screening task.

A similar task to the abstract screening review is Technology Assisted Re-view (TAR) [19, 20, 21, 24, 43, 44] which is popular among law ﬁrms. The mainobjectives of both review systems are the same, which is to classify betweenrelevant results and irrelevant ones in a very imbalanced setup. TAR is used toprioritize attorneys’ time to screen documents relevant to a lawsuit rather thanthose which are irrelevant. Generally, the number of documents are much larger(in millions) in TAR compared to the same in systematic review, so active learn-ing is very popular in the TAR domain. However, unlike systematic reviews,5AR researchers are interested in ﬁnding a good stabilization point where thetraining of the classiﬁer should be stopped. Moreover, in TAR, achieving atleast 75% recall is considered as acceptable, whereas in systematic reviews 95%-100% recall is desired. Linear review faces similar set of challenges as is faced byabstract screening and in many times, they overcome those challenges by usingsimilar solutions.

3. Method

For abstract screening, we have as input the title ( T i ) and abstract ( A i ) ofa set of n citations, C = { C T i ,A i } ≤ i ≤ n . We represent each citation, C T i ,A i asa tuple h x C i , y C i i : x C i is a d -dimensional feature space representation of thecitation and y C i ∈ { +1 , − } is a binary label denoting whether C i is relevant ornot for the given abstract screening task. For a labeled citation, y C i is knownand for an unlabeled citation it is not known. We use L and U to denote theset of labeled and unlabeled citations, respectively, F to represent features fora set of citations, and h for the hypothesis or hyperplane learned by training on L . Some of our feature representation methods embed a word/term to a latentspace; for a word w , we use l ( w ) for the latent space representation of this word. Task Challenges

The desiderata for supporting abstract screening in a SR platform like Rayyanincludes the following: (1) feature extraction should be fast and the featuresshould be readily computable from the available data, (2) the learning andprediction algorithms should be eﬃcient, (3) the prediction algorithm should beable to handle extreme data imbalance so that it can overcome the shortcomingsof low search precision, and (4) the prediction algorithm should solve an instanceof a two-class ranking problem such that the relevant citations should be rankedhigher than the irrelevant ones. Based on these requirements, the choices forfeature representations, the prediction algorithms, and the prediction metricsare substantially reduced, as we discuss in the following paragraphs.For feature extraction, we focus on two classes of methods namely uni-bigramand word2vec. Uni-bigram is a traditional feature extraction method for textdata. Word2vec [33] is a recent distributed method which provides vector repre-sentation of words capturing semantic similarity between them. Both methodsare fast and their feature representation is computed from the citation datawhich are readily available. We avoided other feature extraction choices as theyare either hard to compute or the information from which these feature valuesare computed is not readily available. For example, features like co-citationsare hard to obtain. Another possible feature is the frequency of MeSH (Medi-cal Subject Heading) terms. While MeSH terms of a citation may be obtainedfrom PubMed, this has some practical limitations especially in an automatedsystem like Rayyan. In particular, (i) PubMed does not necessarily cover allof the existing biomedical literature and (ii) to obtain MeSH terms, one has toeither provide the PMID (PubMed Article Identiﬁer) of each reference which is6ot always available, or use the title search API which is an error-prone pro-cess since it could return more than one result. Thus, we avoided using thisfeature in our experiments. We also discarded the use of LDA (Latent DirichletAllocation) based feature generation and a matrix factorization based method[30] as an LDA based method is slow, and there is no easy way to set someof its critical parameters such as the number of topics. Methods based ondistributed representations ( e.g. , word2vec [33]) are shown to perform well inpractice compared to the matrix factorization based alternatives. Additionally,for distributed representation methods, Natural Language Processing (NLP)functions such as lemmatization (heuristic method to chop end of a word) andstemming (morphological analysis method to ﬁnd root form of a word) are usu-ally not a requirement as the methods automatically discovers the relationshipbetween inﬂected word forms (for example, require, requires, required) based onthe co-occurrence statistics and represents them closely in the vector space. ForUni-Bi (unigram, bigram) feature representation, one can use lemmatization orstemming. However, one immediate problem with lemmatization is that it isactually quite diﬃcult to perform accurately for very large untagged corpora,so often a simple stemming algorithm (such as Porter Stemmer) is applied,instead, to remove all the most common morphological and inﬂectional wordendings from the corpus [5]. However, we avoided such preprocessing step as itadds some extra cost to the preprocessing step and in our initial analysis, theperformance diﬀerence was not statistically signiﬁcant.Since irrelevant citations are more common than the relevant ones, abstractscreening algorithms suﬀer from the data-imbalance issue. Among the binaryclassiﬁcation methods, the maximum-margin based methods ( aka

SVM) havebecome very popular in the abstract screening domain [39]. Thereof, we useSVM variants that are more likely to overcome the data-imbalance problem.Speciﬁcally, we use

SV M cost (cost-sensitive SVM), and

SV M perf (SVM withmultivariate performance measure) with diﬀerent loss functions for our evalua-tion.Finally, various prediction metrics have been used in the abstract screeningdomain which makes it diﬃcult to compare among various models. A recentsurvey of current approaches in abstract screening [39] reported that amongthe 69 publications they surveyed, 22 report performance using Recall, 18 usingPrecision, and 10 using AUC. The non-uniform usage of these metrics makes itdiﬃcult to draw a conclusion about the performance of diﬀerent methods. Forexample, if a particular work only reports Recall but not Precision, it is hardto understand its applicability. Among all the above metrics, the most commonis the Area Under the ROC Curve (AUC), which can be computed from theranking of the citations as provided by the classiﬁcation model. Hence, to useAUC, the classiﬁcation model for abstract screening needs to provide a rankedorder of the citations. Among other ranking based metrics, AUPRC (area underPrecision-Recall curve) is also used extensively. Moreover, authors of a recentwork [40] suggested to study the variability of the reported metrics and advo-cated to use a large number of repetitions (500 ×

2) to ensure reproducibility.We also observed that most of the existing studies were performed on a small7umber of SRs, reported the evaluation with a diﬀerent set of metrics, vali-dated with a small number of repetitions, and most of them did not performproper statistical signiﬁcant tests which would provide a statistical guaranty ofthe superiority of a method over the others.In the following, we describe the feature space representation, then the ex-isting methods of automated abstract screening, and ﬁnally our proposed 5-starrating algorithm.

We use two types of feature representation: (1) uni-bigram (Uni-Bi) and(2) word2vec (w2v). Uni-Bi based feature representation uses the frequency ofuni-grams and bi-grams in a document. Note that uni-grams and bi-grams aregeneralization of n -grams, the collection of all contiguous sequences of n wordsfrom a given text. Since the sum of the number of distinct words (uni-grams)and the number of distinct word-pairs (bi-grams) over the document collectionin a particular review task is generally large, Uni-Bi feature representation ishigh-dimensional. Besides, it produces a sparse feature representation, i.e. , alarge number of entries in the feature matrix is zero. Such high dimensional andsparse feature representation is poor for training a classiﬁcation model, so severalalternative feature representations are proposed to overcome this issue. Amongthem, word2vec [33] is a popular alternative. It adopts a distributed frameworkto represent the words in a dense d -dimensional latent feature space. The featurevector of each word in this latent space is learned by shallow neural networkmodels and the feature vector of a citation is then obtained by averaging thelatent representation of all the words in that citation. It has been hypothesizedthat these condensed real-valued vector representation, learned from unlabeled data, outperforms Uni-Bi based representations.To learn vector representation of words using word2vec [33] for our corpus,we train the model on all abstracts and titles available in Rayyan. We useGensim [42] with the following parameters: the number of context words as 5,the min word count as 15 and the number of dimensions in the latent spaceas 500 ( d =500, chosen through validation). Thus, for each word in the set ofall available words ( w i ∈ W ), we learn a 500-dimensional latent space repre-sentation. After averaging the latent vectors of words, we obtain the latentvector of a citation on which we apply two kinds of normalization: (1) instancenormalization (row normalization) and (2) feature based normalization (col-umn normalization). These normalizations give statistically signiﬁcant betterresults than using raw features for threshold agnostic metrics such as AUC andAUPRC. In instance normalization, we z -normalize the extracted features foreach citation, x C i . For column normalization, we z -normalize in each of the500 dimensions. After both normalizations, we keep the feature values up to2-decimal places to minimize the memory requirement in our system. Note thatthere exist some neural network based models such as sen2vec [31] which canlearn the representation of a citation holistically instead of learning it by aver-aging the representation of words in that citation. However, the sen2vec model8 eature Algorithm (Param.) Id Uni-Bi

SV M perf (b=0, AUC) 1— (b=1, AUC) 2— (b=1, KLD) 3— (b=1, QuadMean) 4SVM (Default) 5

SV M cost (J, b=0) 6— (J, b=1) 7SVM Transduction 11w2v row

SV M perf (b=1, AUC) 21— (b=1, KLD) 22— (b=1, QuadMean) 23

SV M cost (J, b=0) 24— (J, b=1) 25w2v col

SV M perf (b=1, AUC) 31— (b=1, KLD) 32— (b=1, QuadMean) 33

SV M cost (J, b=0) 34— (J, b=1) 35

Table 1: Description of the algorithms used in the evaluation. We use the default regularizationin all cases. J in SV M cost is set to the following ratio: L L is transductive in nature, i.e. , for new citations, we need to execute a few passesover the trained model, which is time-consuming.After learning the representation, for a limited number of words, we manuallyvalidate whether the latent space representation of a word captures the knownsemantic similarities of that words with various biomedical terms. Semanticsimilarities are computed using cosine similarity in the latent space following[33]. Our manual validation shows encouraging results. For example, the cosinesimilarity between “liver” and “cirrhosis” is 0 .

63, which is large considering thefact that the vectors are of 500 dimensions. Also, for the query “which wordsare related to cirrhosis in the same way breast and cancer are related”, returns“liver” as one of the top-5 answers in the w2v feature representation.

A recent study [37] reports that Support Vector Machine (SVM) is the mostused algorithm in abstract screening. It is used in 31% of the studies and at leastone experiment annually since 2006. Moreover, as discussed in Section 2, SVM-based methods are mostly used in both the data-imbalance and batch-modeactive learning settings [18, 34]. We thus restrict our evaluation to existingSVM-based algorithms. SVM is a supervised classiﬁcation algorithm whichuses a set of labeled data instances and learns a maximum-margin separatinghyperplane by solving a quadratic optimization problem. We should mention9hat the number of labeled citations can be very few at the start of a citationscreening process and also the number of citations varies from a few hundredto a few thousand for diﬀerent reviews. Thus, we did not try any superviseddeep-learning based technique in our evaluation.We use three types of SVM methods: (1) inductive, (2) transductive, and(3)

SV M perf . Inductive SVM learns a hypothesis h induced by F ( L ). Trans-ductive SVM [26] reduces the learning problem of ﬁnding a h ∈ H to a diﬀerentlearning problem where the goal is to ﬁnd one equivalence class from inﬁnitelymany induced by all the instances in F ( U ) and F ( L ). SV M perf exploits thealternative structural formulation of the SVM optimization problem for con-ventional binary classiﬁcation with error rate [28]. We use three diﬀerent lossfunctions for the

SV M perf implementation: (i) AUC, (ii) Kullback-Leibler Di-vergence (KLD), and (iii) QuadMean Error [32]. Table 1 shows the diﬀerentSVM-based algorithms we used in our comparison along with their parame-ters and loss functions. We use the default regularization in all cases and J in SV M cost is set to the following ratio: L L . The ratiobiases the hypothesis learner (the training algorithm) to penalize mistakes onthe minority class (the relevant class) J times more than the majority class (thenegative class). In table 1, we assign distinct integer identiﬁers to representeach of the algorithms. For example, the ﬁrst row in Table 1 refers to a methodidentiﬁed by Id 1 that uses SV M perf with b = 0 and AUC as the loss function.Comparison results among diﬀerent methods (such as the results shown in Ta-ble 4) are shown compactly by referring to each method by its identiﬁer insteadof its name.Making a comparison with the exact baseline algorithms proposed by variousstudies is not straightforward. 31% of the studies reported in [37] use SVM astheir classiﬁer. However, each one employs a diﬀerent feature representation anda diﬀerent SVM implementation. Thus, in this study we restrict our attention toa speciﬁc set of feature representations which are more suitable for our abstractscreening platform, and use the linear kernel with diﬀerent loss functions andthe implementation provided by the author of each SVM algorithm. -star Rating Algorithm In our SR platform, we want to rank citations based on their graded rel-evance. The intuition is to help reviewers better manage their time. To thisend, we rate the citations from 1 to 5 using our proposed

RelRank algorithm.The citations with 5 stars are relevant with high probability and need moreattention whereas 1 starred documents are irrelevant with high probability andmay need less attention. Among the 5 stars, we conceptualize 3 − − RelRank (Algorithm 1) uses an ensemble ofSVM based methods. We choose method

SV M perf (b=1, AUC) with w2vfeatures,

SV M cost (J, b=1) with w2v features, and

SV M cost (J, b=1) with Uni-Bi features because of their special characteristics based on our initial evaluationof the methods on multiple datasets and over a range of metrics (described in10 lgorithm 1: RelRank : A Five Star rating algorithm using ensemble ofmax-margin based methods

Input : L , Labeled dataset; U , Unlabeled dataset Output:

Score, S ≤ i ≤|U| F L , F U ← GenerateFeature ( L , U , feature = W) h ← Train ( SV M perf , F L ) h ← Train ( SV M cost , F L ) S h ← Predict ( h F U ) S h ← Predict ( h F U ) F L , F U ← GenerateFeature ( L , U , feature = U) h ← Train ( SV M cost , F L ) S h ← Predict ( h F U ) S ←

GenerateCombinedScore ( U , S h , S h , S h ) return S Section 4.4). For instance,

SV M perf (b=1, AUC) outperforms others based onthe AUC and Recall metrics. When evaluated on the F1 metric (the harmonicmean of Precision and Recall),

SV M cost (J, b=1) with w2v feature producesthe highest value. On the other hand,

SV M cost (J, b=1) with Uni-Bi featurehas the highest Precision. So, in

RelRank , if all of these three methods agreeon the relevance of a citation, then the citation gets 5 stars; if two of themagrees, then it gets 4 stars. Within a particular star, citations are ranked basedon their average ranks (described in Algorithm 2).In

RelRank , we ﬁrst generate w2v row features for both sets L and U . Then,we train SV M perf and

SV M cost with w2v features to generate the ﬁrst ( h h

2) hyperplanes, respectively. We then compute scores for U basedon the distances from h h

2. Using the Uni-Bi features with

SV M cost ,we compute a third score S h using the hyperplane h

3. Finally, we combinethe three scores to generate a ﬁnal score for U which we formally present inAlgorithm 2 ( GenerateCombinedScore ). GenerateCombinedScore helps present the already classiﬁed citations in aranked manner based on the scores obtained by calculating the distance fromhyperplanes. The method takes the separation threshold (ST) and the maxrange (MR) as user-deﬁned parameters. ST is used to deﬁne separation betweenratings. MT is used for range normalization of a rank score between 1 − MR.Step 1 gets the number of unlabeled citations. Steps 2 and 3 deﬁne FVOTE andNORM as a lambda function. Both of them are used to transform ranks into aspeciﬁed range. FVOTE smooths any value between 0 . − .

0. In Step 4, weobtain ranks for each of the scores S ≤ h ≤| h | . The higher the distance from thehyperplane for a particular citation is, the higher the rank. Steps 5-16 calculatethe ﬁnal combined score. We iterate through all the scores in order. The veryﬁrst hypothesis has the supreme power. If it predicts a particular citation asrelevant, then the citation is rated between 3 to 5 star, and if it predicts it as11 lgorithm 2: GenerateCombinedScore

Input :

Unlabeled dataset( U ), S h [], T h [], Separation Threshold (ST),MAX Range (MR) Output:

Score, S U = |U| FVOTE = lambda {| r | exp( − U − r U ) } NORM = lambda {| s | s − . − . ∗ MR } r h = GetRanksForASetOfElements ( S h ) for u = 0 .. | U | do nv = S h [0][ u ] > T h [0] ? 1 : -1 rs = r h [0][ u ] fv = FVOTE (rs) for v = 1 .. | h | do rs v = r h [ v ][ u ] fv = fv + FVOTE ( rs v ) rs = rs + rs v nv = S h [ v ][ u ] > T h [ v ] ?( nv > nv + 1 : nv ) : ( nv < nv − nv ) ns = NORM ( rs/ | h | ) ﬁnv = nv > S [ u ] = ﬁnv * ST + ns return S irrelevant, then the same gets a 1 or 2 star. In Step 6, we get the vote fromthe dominant classiﬁer, i.e. , if the score is greater than the classiﬁer thresholdthen the vote count increases by 1. The rank score and the fractional vote areobtained in Steps 7 and 8. The number of votes ( nv ) only increases if boththe dominant classiﬁer and the current hypothesis vote the citation as relevant.On the other hand, the number of votes decreases if both of them agree on itsirrelevancy. Finally, in Step 14, we get a normalized rank between 1 and M R .If the number of votes is greater than or equal to 1 then the fractional vote isconsidered as the ﬁnal vote otherwise the negative votes are used unchanged.This ensures that even if a particular citation gets positive votes, it still needsto get top rank to maintain its position in the rating. We use 1000 and 800 asvalues for the parameters ST and MR, respectively. We also use 0 .

0, 2000, and2500 as thresholds for

RelRank (3-star),

RelRank (4-star), and

RelRank (5-star), respectively, as these thresholds performed the best while performingmanual veriﬁcation of system’s performance.

4. Experimental Results

In this section, we ﬁrst present our experimental settings, then we describeour dataset and performance metrics and ﬁnally present our experimental re-sults. 12 revalence [0 . − . . − . . − . Table 2: Dataset Statistics. All publicly available reviews start with C (Cohen) whereasreviews from our system start with P (Private). Rel. stands for relevant citations whereasPrev. stands for prevalence of the dataset. For example, in dataset P18, the prevalence is0.22% as 5 (number of relevant citations) out of 2241 (total citations) is 0.0022.

We run all of our experiments on a computer with an Intel XEON 2.6Ghzprocessor running CentOS 6.7 operating system. For each dataset (described inthe next section), we perform a 500 × n × k fold cross validation, we split the data in k blocks where the k -th block becomes the test block and the rest becomes the training data. Werepeat this process n times. The split is carried out through stratiﬁcation. Sofor 500 ×

2, we split each dataset into two blocks and then we use each blockonce as a training and once as a test. We repeat this process for 500 times.According to [41], over the course of many iterations/repetitions (500 in ourcase) the average performance estimate for a given classiﬁer may stabilize andproduce “steady-state” classiﬁer performance. It also produces substantiallymore repeatable results than using a small ﬁxed number of iterations such as 5or 10.Now, we describe our algorithms parameter settings. For the inductive andtransductive SVM methods, the default cost parameter, denoted as c , is com-puted as follows: the summation of all the 2-norm values of the feature vectorsare divided by the number of instances in F ( L ) to generate b . Finally, thefraction . b ∗ b gives c . J in SV M cost is set to the following ratio: .Furthermore, we follow the recommendation for error loss function from [27]to set the cost parameter for

SV M perf . The way we calculate the parametervalues can also avoid the need for a representative validation set which is veryhard to obtain in a systematic review platform. Note that for all the methods,13e set the classiﬁcation threshold to 0 . We use 61 reviews (dataset) for our experiments: 15 of them are publiclyavailable from [9] and the other 46 reviews were collected from Rayyan. Theserepresent 84K citations. In Table 2, we present our dataset statistics. All pub-licly available reviews start with C (Cohen) whereas reviews from Rayyan startswith P (Private). In the table, we give three statistics about each review: thetotal number of citations (Total), the total number of relevant citations (Rel.),and prevalence (ratio of relevant and total number of citations (Prev(%))). Thedatasets of each group are sorted by their prevalence. For example, in thedataset P18, the prevalence is 0.22% corresponding to the ratio of its 5 relevantcitations out of the 2241 citations. We divide the reviews into three prevalencegroups depending on the ratio of included citations: (1) Low (0 .

22% to 5 . .

79% to 13 . .

45% to 40 . i ) it allows to study the trend of various per-formance metrics within each group, i.e , depending on the complexity of thetask (the smaller the prevalence, the more complex the task), and ( ii ) it alsomakes our statistical tests robust against Type I error for data with frequentextreme values (the performance on some metrics varies wildly among diﬀerentprevalence groups). We group reviews into three groups based on the severityof the class imbalance (high, mid, low) while making sure that every group hasa reasonable number of data samples (in our case, the number of datasets foreach group is around 20). Next, we describe our performance metrics.We use 11 metrics for the evaluation [39] (discussed in Table 3). The ﬁrstfour measures (Recall, Precision, F-measure and Accuracy) depend on a thresh-old and are widely used for evaluating automated abstract screening methods.In abstract screening, the highest cost is associated with false negatives (arti-cles incorrectly identiﬁed as irrelevant) as these will not be manually reviewedand relevant evidence is omitted from the ﬁnal decision (the systematic review).Therefore, it is expected that Recall should be very high in screening automa-tion. AUC and AUPRC do not depend on thresholds and are common in binaryclassiﬁcation problems with data imbalance. In an abstract screening platform,AUC and AUPRC are the second most important metrics. This is because agood ranking of documents makes it easier for the users to screen them, i.e. ,they can mark a large batch of documents as relevant and irrelevant in a sin-gle operation. We also add two more metrics: (1) Arithmetic mean error and(2) Quadratic mean error. Both metrics measure the loss in Recall in both therelevant and irrelevant classes and thus are important for an abstract screeningplatform. Finally, for active learning settings, we use Burden, Yield and Utilitywhich respectively correspond to the fraction of the total number of citationsthat a human must screen, the fraction of citations that are identiﬁed by agiven screening approach, and the weighted sum of Yield and Burden. From14 etric Deﬁnition Formula Recall (Sen-sitivity) Ratio of correctly predicted relevant citations to allrelevant ones.

TPTP + FN

Precision Ratio of correctly identiﬁed relevant citations to allof those predicted as relevant.

TPTP + FP

F-Measure Combines Precision and Recall values. It corre-sponds to the harmonic mean of Precision and Recallfor β = 1. (1+ β ) . Precision.Recall β . Precision + Recall

Accuracy Ratio of relevant and irrelevant citations predictedcorrectly to all citations.

TP + TNTP + TN + FP + FN

ROC (AUC) Area under the curve obtained by graphing the truepositive rate against the false positive rate; 1.0 isa perfect score and 0.5 is equivalent to a randomordering.AUPRC Area under precision recall curve.AM ER. Arithmetic mean of the loss in Recall of the relevant( L R p ) and irrelevant class ( L R n ) FNTP + FN + FPFP + TN QD ER. Quadratic mean, aka. root mean square, measuresthe magnitude of varying quantities. It is deﬁned asthe square root of the arithmetic mean of the squaresof the loss in Recall of the relevant ( L R p ) and theirrelevant class ( L R n ). s FNTP + FN . FNTP + FN + FPFP + TN . FPFP + TN TP L +TN L +TP U +FP U N Yield The fraction of citations that are identiﬁed by a givenscreening approach. TP L +TP U TP L +TP U +FN U Utility It is a weighted sum of Yield and Burden. Here, β is a constant. It represents the relative importanceof Yield in comparison to Burden. We use β = 19 togive 19 times more importance to yield in comparisonto Burden in our experimental evaluations followingthe suggestion from [54]. β. Yield+(1 − Burden) β +1 Table 3: Various metrics with their formulas and deﬁnitions. TP, FP, TN and FN representtrue positive, false positive, true negative and false negative respectively. an abstract screening platform perspective, we want to minimize Burden andmaximize Yield.

We have 18 diﬀerent methods (Table 1) and 61 datasets. As we have parti-tioned the datasets into three prevalence groups: (1) Low-prevalence, (2) Mid-prevalence, and (3) High-prevalence, we will compare the methods based ontheir overall performance on datasets of a particular prevalence group on aspeciﬁc metric. Our goal is to generalize the ﬁndings on a larger population(dataset) that could fall within one of our deﬁned prevalence groups. To per-form a statistical test, we follow Cohen et al . [12] who considered data andmethod as covariate to predict the performance metric, i.e. , METRIC ∼ DATA15 METHOD. To be more speciﬁc, the model resembles y = mx + c , where x is a variable and m, c are the constants. In this model, DATA and METHODmimic x and c respectively. In our case, for a particular metric, we use theaverage of 500 × i.e. , multiple resampling from each dataset is to be used only toasses the performance score not its variance, which is similar in spirit to [15].The sources of variance are the diﬀerences in performance over (independent)datasets and not on (usually dependent) samples, so the elevated Type 1 erroris not an issue.We ﬁt the model (METRIC ∼ DATA + METHOD) by using linear regressionand perform a two-factor (DATA and METHOD) analysis of variance. Thishelps us in identifying whether there is any statistically signiﬁcant diﬀerenceamong the methods, the datasets, and the interactions between the methods andthe data. If the test shows a performance diﬀerence among the methods thatis statistically signiﬁcant, we again compare a pair of methods by using pairedt-tests (post-hoc testing). Post-hoc testing with 18 methods leads to 18*17/2possible post-hoc comparisons. Each of these 18*17/2 tests needs correctionfor multiple testings. To avoid the large number of pairwise testings, we followthese steps:1. Find the best method and compare it with all the remaining methodsvia paired t-tests, and then utilize the Linear Step Up (LSU) procedure(also known as Benjamini and Hochberg procedure [3]) to control the FalseDiscovery Rate (FDR). It has been shown in [15] that the Holm [25] andHochberg [3] tests give practically equal results for post-hoc tests.2. Separate the group of non-signiﬁcant diﬀerences using LSU to performFDR at the given alpha ( α = 0 . :(a) Step 1: ﬁnd the ﬁrst k such that p(k) ≤ pos(k) ∗ α /m. Here, m is thetotal number of methods that are tested against the best method in arank group and p(1),...,p(k) are the corresponding ordered p-values, i.e. , p(1) ≥ p(2) ≥ p(3) ..... ≥ p(m). pos(k) is the position of amethod in the ascending order of p-values. For example, pos(1) is m.(b) Step 2: if such a k exists, group ﬁrst k-1 methods associated withp(1),...,p(k-1) in a single group with the best method as they do nothave any statistically signiﬁcant diﬀerence with the best method.3. The group of methods from step 2 are isolated and the pairwise comparisonis then repeated from step 1 with the remaining methods. The processcontinues until no more methods are left for pairwise comparison.For example, if Step 2 yields 4 methods that are not worse than the bestmethod by the desired statistically signiﬁcant margin, then we will form our ﬁrst The exact p-values are used inside the LSU procedure to control the FDR. We providedonly minimum and maximum p-values used for the rejection (fall into a diﬀerent group) asthe number of p-values generated from each pairwise comparison is enormous. rg ) such that within each group the performance ofall the methods is statistically similar to the performance of the best methodwithin that group. Also, across two groups the best method of a superior groupis better than all the methods in the inferior group by a desired statisticallysigniﬁcant margin. Table 4 shows the comparison results. Along the rows and the columns, welist respectively the 11 performance metrics and the three prevalence groups(low, mid, and high). The methods that perform the best for a given metricwithin a prevalence group are organized within sub-columns of ranked group.The ranked groups of methods are created by a statistical test with α =0 . rg = 1, rg = 2, and rg = 3. Within a ranked group, the methods arelisted by their identiﬁers (as shown in Table 1). For each rank group, metric,and prevalence range, we report a representative metric value in brackets. Ourdetailed results are available online . We also provide the p-values of all themetrics corresponding to Table 4 for our statistical test in Table 5. The statisticsgive some insights on how conservative the LSU method is in selecting twomethods to put into the same group (Please see Section 4.3 for details). Forexample, for the Precision metric, any p-value greater than 0.3 may put twomethods into the same group for some rank group. This is because of the LSUprocedure described earlier in Section 4.3. As a convenience to the reader, notethat methods that use the Uni-Bi, w2v row, and w2v col features have identiﬁersin the range of 1 −

11, 21 −

25 and 31 −

35 respectively.We ﬁrst discuss the best methods by considering the metrics that depend ona threshold. Results for these metrics are shown in the ﬁrst four rows and thelast 5 rows of Table 4. We will discuss the best methods by their names withtheir identiﬁers within parenthesis. For Precision,

SV M cost (7) outperforms theother methods in the low and mid prevalence groups achieving around 40% and58% for low and mid prevalence groups, respectively. But, for the high preva-lence group, it falls in the second tier and obtains a Precision value of 60%. SVMDefault (5) also performs very well with a second tier rank in the low prevalencegroup and a ﬁrst tier rank in the mid and high prevalence groups. SVM Default(5) achieves around 25%, 58%, and 74% in Precision for low, mid, and highprevalence groups, respectively. For Recall, w2v row with

SV M perf (AUC)(21) is in the top position in all three prevalence groups obtaining around 96%,97% and 98% in Recall for low, mid, and high prevalence groups, respectively.

SV M trans with Uni-Bi (11) and

SV M cost with w2v row (25) are the top per-formers for the F1 metric. These methods achieve around 34%, 46%, and 58% in https://tksaha.github.io/paper_resources/ revalence Prevalence Prevalence[0 . − . . − . . − . Metric rg = 1 rg = 2 rg = 3 rg = 1 rg = 2 rg = 3 rg = 1 rg = 2 rg = 3 Precision 7 , 11 3, ,25 4, 6,35 ,

11 3, 6,22, 25 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Recall 21

31 1, 2,34

1, 2,31, 34 22 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) F-measure 11 , 25 24, 35 4, 22,23 , 25 4, 22,23, 24,35 3, 6,33 7, ,25 35 24( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Accuracy 7

11 6 5,

11 3, 4, 6( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) AUC 21 ,

31, 35 1, 2, 5,7, 23,24 ,

4, 24,31, 35 1, 2, 5,7, 22,23

7, 25 (AllOthers)( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) AUPRC 1 , ,5, 7,

4, 25 11, 23,24, 31 , ,4, 5 7, 21 25 , (AllOth-ers) -( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) - AM ER. 4 , 31,35, 24,

1, 2,11, 22,23, 32,33, 34 3, 6, 7,21 ,

24 31, 33,35

24 4, 35( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) QD ER.

1, 2,4, ,25, 31,32, 33 23, 34,35 11, 22 4, 25,33

23, 31,35 25

4, 35( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Yield 21

31 1, 2,34

1, 2,31, 34 32

1, 2 31( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Burden 5

11 6( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Utility 21

31 1, 2,34

1, 2,31, 34 32 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 4: Grouped results of all the methods using metrics from Table 3 ( rg stands for rankgroup) along with representative values (within brackets). For easy of reading, we recallthe methods as described in Table 1. Method 1 – uni-bi feature with SV M perf with b=1,and AUC. Methods 2, 3, and 4 – uni-bi feature with

SV M perf with b=1, and AUC, KLD,and QuadMean loss functions, respectively. Method 5 – SVM (Default). Method 11 – toSVM transduction. Methods 6 and 7 – uni-bi feature with

SV M cost with b=0 and b=1,respectively. Methods 21, 22, and 23 – w2v row feature with

SV M perf with b=1, andAUC, KLD, and QuadMean loss functions, respectively. Methods 31, 32, and 33 – w2v colfeature with

SV M perf with b=1, and AUC, KLD, and QuadMean loss functions, respectively.Methods 24 and 25 – w2v row feature with

SV M cost with b=0 and b=1, respectively. Methods34 and 35 – w2v col feature with

SV M cost with b=0 and b=1, respectively. revalence Prevalence Prevalence[0 . − . . − . . − . P-value P-value P-value

Metric (min., max.) (min., max.) (min., max.)Precision (3.68e-07, 0.03) (1.17e-14, 0.03) (3.63e-14, 0.03)Recall (7.53e-22, 0.04) (1.24e-20, 0.007) (2.09e-14, 0.02)F-measure (6.21e-09, 0.015) (4.94e-12, 0.03) (2.12e-11, 0.03)Accuracy (6.33e-22, 0.02) (1.36e-21, 0.04) (1.98e-14, 0.04)AUC (5.19e-09, 0.03) (4.93e-11, 0.03) (1.58e-11, 0.03)AUPRC (1.44e-09, 0.03) (1.23e-10, 0.03) (1.44e-09, 0.03)AM ER. (1.26e-15, 0.03) (1.76e-15, 0.04) (9.75e-12, 0.016)QD ER. (1.15e-20, 0.03) (2.03e-16, 0.02) (2.06e-12, 0.02)Yield (8.14e-22, 0.04) (1.24e-20, 0.007) (2.09e-14, 0.02)Burden (2.39e-26 0.01) (1.66e-25, 0.03) (7.40e-16, 0.03)Utility (8.40e-22, 0.04) (1.66e-20, 0.006) (3.23e-14, 0.02)

Table 5: P-values of all the metrics corresponding to the Table 4. The statistics give someinsights on how conservative we are in selecting two statistically insigniﬁcant methods to putinto the same group (Please see Section 4.3 for details). For example, for the Precision metric,any p-value greater than 0.3 may put two methods into the same group for some rank group,even though we choose α = 0 .

05 in our case. This is because of the Linear Step Up (LSU)procedure described in 4.3.

F1 value for low, mid, and high prevalence groups, respectively. For Accuracy,

SV M cost (b=1) with Uni-Bi (7) performs the best followed by

SV M trans withUni-Bi (11).

SV M cost (b=1) with Uni-Bi (7) achieves 97%, 91%, and 85% inAccuracy in low, mid, and high prevalence groups, respectively. The decreasingtrend in performance values is because the lower is the prevalence the more theclassiﬁer is able to achieve higher accuracies by predicting the label of all theinstances as the dominating class.

SV M cost with w2v row normalized features,24 and 25, rank ﬁrst in Arithmetic Mean (AM) and Quadratic Mean (QM) Er-rors, respectively. Again, w2v row with

SV M perf (AUC) (21) ranks ﬁrst in theYield and Utility metrics while the SVM Default (5) is at the top of the list inthe Burden metric.

SV M perf (AUC) (21) achieves around 98% in Yield and94% in Utility for all the prevalence groups.We now analyze the performance of the methods on other metrics which arethreshold agnostic. For instance, in AUC, w2v row with

SV M perf (AUC) (21)performs the best.

SV M cost (b=1) with w2v row normalized features (25) alsoperforms well. Both methods achieve around 86% AUC in all three prevalencegroups. In AUPRC,

SV M perf with (b=0, AUC) (1) and (b=1, AUC) (2) alongwith the Uni-Bi features are the top performing methods. They obtain 33%,47%, and 62% in low, mid, and high prevalence groups, respectively.19 revalence Prevalence Prevalence [0 . − . . − . . − . o f r e v i e w s Percent Citations Screened 0 1 2 3 4 5 6 7 8 9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 o f r e v i e w s Percent Citations Screened 0 1 2 3 4 5 6 7 8 9 10 11 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 o f r e v i e w s Percent Citations Screened

Figure 2: Active Learning experiment with a variable test set. The results are shown based onthree groups of prevalence values. Prevalence is the ratio of the number of relevant citationsout of the total citations. The X-axis represents the percentage of the total citations to screenfor getting all relevant citations for a particular review. The Y-axis shows the total numberof reviews.

In this experiment, we use the

SV M perf (b=1, AUC) (21) method as it isthe top performing method in terms of the AUC metric in all three prevalencegroups (Table 4). For training, we choose 5 relevant and 45 irrelevant citationsuniformly at random from the entire set and then learn a hyperplane h . Wecalculate the distance (score) from h for each of the unlabeled instances andrank them based on this score. We choose the top-50 from the ranked citationsto retrain the model along with the existing labeled citations. We repeat thisexperiment 500 times and take the average. The goal is to see the following: if aparticular user labels 50 top ranked citations per batch, what is the percentageof total citations that the user has to screen to get all the relevant ones? Figure 2shows the results of the proposed experiment. As before, the results are shownbased on three groups of prevalence values. As shown in Figure 2, for the Lowprevalence group, out of 20 reviews, 7 reviews (the ﬁrst four values on the tophistogram) need 40% of the total citations to be screened to get all the relevantcitations, 12 reviews need around 50 to 70% citations, and 1 review needs morethan 90% citations. For the mid and high prevalence groups, 9 out of 20 and11 out of 21 reviews need around 80% to 90% citations to be screened to get allthe relevant citations. RelRank

Results in Section 4.4 show that there is not a single algorithm which winsover all the metrics and prevalence groups. Our aim here is to choose an en-semble of algorithms in a way that captures as much Precision as possible whilemaintaining a reasonably good Recall as we go up in the starring threshold. Forthis, we choose

SV M cost (J, b=0) [Method 7],

SV M perf (AUC, b=1) [Method21], and

SV M cost (J, b=1) [Method 25]. The choice of these algorithms isguided by the characteristics that these algorithms have in diﬀerent metrics(please see next paragraph) and based on the generalized performance over dif-ferent prevalence groups measured through a well deﬁned statistical test. FromTable 4.4, we observe that Method 7 has the best performance for Precision20 P r e c i s i on PrevalenceMethod 07Method 25 Method 21 (a) R e c a ll PrevalenceMethod 07Method 25 Method 21 (b)

80 82 84 86 88 90low mid high A UC PrevalenceMethod 07Method 25 Method 21 (c)

25 30 35 40 45 50 55 60 65low mid high A U P RC PrevalenceMethod 07Method 25 Method 21 (d)

Figure 3: Performance of

SV M cost (J, b=0) [Method 7],

SV M perf (AUC, b=1) [Method21], and

SV M cost (J, b=1) [Method 25] in Precision, Recall, AUC, and AUPRC metric overthree prevalence groups. in low and mid prevalence group and also performs reasonably good in Burdenmetric. Both Methods 5 and 7 are the candidates here; however, we chooseMethod 7 because the algorithm takes into account the data imbalance ratio(through the J parameter of the algorithm). Method 21 performs the bestin many diﬀerent metrics such as Recall, AUC, AUPRC, Yield, and Utility.Method 25 also performs the best across many diﬀerent metrics (see Table 4.4)such as F-Measure, AUC, AUPRC, AM ER., and QD ER.Figure 3 shows the generalized performance of the three algorithms used inour ensemble method over three diﬀerent prevalence groups in terms of Pre-cision, Recall, AUC, and AUPRC. In AUPRC, all the three methods showedan increasing performance as the prevalence increases. For Method 21 and 25,AUC is consistent over diﬀerent prevalence groups. Method 7 has a rising trendin AUC as the prevalence increases. For Recall, Method 21 has surpassed theother two methods by a wide margin. Thus, we consider this algorithm as thedominant classiﬁer. The other two methods showed an increasing trend as theprevalence of the dataset increases. On the other hand, for Precision, Method7 has the highest Precision, and the Precision value increases with prevalenceof the dataset (see Figure 3(a)) increases. Two other methods have much lowerPrecision than Method 7; however, their Precision is also increasing as thedataset prevalence increases. 21 etric RelRank( -star) RelRank ( -star) RelRank ( -star) Recall h , , i h , , i h , , i Precision h , , i h , , i h , , i F-Measure h , , i h , , i h , , i Accuracy h , , i h , , i h , , i ROC (AUC) h , , i h , , i h , , i AUPRC h , , i h , , i h , , i AM ER. h , , i h , , i h , , i QD ER. h , , i h , , i h , , i Burden h , , i h , , i h , , i Utility h , , i h , , i h , , i Table 6: Performance of

RelRank in diﬀerent starred thresholds. The lower is the rank,the better is the performance. The three numbers inside the brackets indicate rank group for h Low, Mid, High i prevalence group. Metric RelRank ( -star) RelRank ( -star) RelRank ( -star) Precision h i h i h i Recall h i h i h i AUC h i h i h i AUCPR h i h i h i Burden h i h i h i Utility h i h i h i Yield h i h i h i AM Error h i h i h i QM Error h i h i h i Table 7: Performance values of

RelRank in diﬀerent starred thresholds. For Precision, Recall,AUC, AUCPR, Burden, Utility, and Yield, the higher the value, the better the performance.On the other hand, for AM and QM errors, the lower the better. The three numbers insidethe brackets indicate the value of the metric for h low, mid, and high i prevalence groups. For a SR platform, we need easy to consume prediction, i.e. , the citationsshould be sorted based on their relevance (discussed in Introduction). Themost important metrics in our case are: AUC, AUPRC, Recall and Utility. Weevaluate our 5-star algorithm in diﬀerent cumulative star-groups. Table 6 showsthe obtained results. We also present the average performance of a metric indiﬀerent prevalence groups and starring thresholds in Table 7. We observe thatour 5-star algorithm performs similarly to method 21 for the citations receiving3 stars or above. For the Precision metric, the 5-star algorithm performs betterthan method 21 as it falls in the rank group 7 whereas method 21 falls in thegroup 9. This is because our algorithm is a combination of method 21 with twoother methods that have better Precision performances. We see an increment of22round 10% in Precision in the high prevalence group from

RelRank (3-star) to

RelRank (4-star) and then to

RelRank (5-star); the corresponding values forPrecision are 22.98%, 31.43%, and 41.62%, respectively. For the mid prevalencegroup, the increment is around 4-6% and for the low one, it is around 2%.Similarly, for the Recall metric,

RelRank (3-star) is in the top group whereas

RelRank (4-star) and

RelRank (5-star) are in the second top group. As thePrecision goes high, Recall decreases as we go up in the rating. For the highprevalence group, it decrements from 98.43% to 93.82% and then to 84.10%.For the low and mid prevalence groups, we also see a similar decrement of 3-7%in performance.Interestingly, for both AUC and AUPRC,

RelRank falls in the top group(see Table 4 and 7 to observe that this is not the case if the algorithms areused alone) which makes our ensemble algorithm more suitable for SR appli-cations. Our algorithm achieves around 87% AUC in all prevalence groupsand around 37%, 50% and 64% AUPRC for the low, mid and high prevalencegroups, respectively. For Utility, we also see a similar behavior. Burden dropsfrom around 90% to 78% and then to 68%, going from

RelRank (3-star) to

RelRank (4-star) and then to

RelRank (5-star) in all prevalence groups asexpected. However, Yield which represents the fraction of citations that areidentiﬁed by a given screening approach (i.e.,

RelRank ) does not drop muchas we go up in the star rating. It only drops around 2-4% which is a very impor-tant aspect of our algorithm as the rate

RelRank is ﬁnding relevant citation isnot decreasing so rapidly. For AM and QD errors, which take into account lossin Recall in both the relevant and irrelevant classes, we see a drop around 10-12% from

RelRank (3-star) to

RelRank (4-star) to

RelRank (5-star). Thisis because of the Precision/Recall trade-oﬀ. In

RelRank (3-star), we achievethe highest Recall and the lowest Precision in the positive class which indicatesthat we loose Recall in the irrelevant class. This is why the AM and QM errorsare very high (AM error is around 40%, QM error around 50%) in the

RelRank (3-star).

5. Discussion

In our study of diﬀerent methods, we observed that almost always there isa method that ranks ﬁrst in the three prevalence groups. However, there isno single dominant method across all metrics. Various methods perform wellon diﬀerent prevalence groups and for diﬀerent metrics. For instance, w2v rowwith

SV M perf (b=1, AUC) (21) seems to be a good choice, outperforming theother methods in ﬁve metrics. The method achieves around 97% Recall and 87%AUC across diﬀerent prevalence groups and datasets. However, it is not presentin any of the equivalence groups for a few other metrics. The method givesaround 4% Precision in low prevalence group, 11.75% in mid-prevalence groupand 23% in high prevalence group. The same behavior is seen for other metricsin which it performed poorly. This is because this method has a high Recall,but a low Precision. As a consequence, such a method is penalized heavilyby some metrics such as AUPRC and Burden. This comprehensive study and23ubsequent analysis on our abstract screening platform Rayyan suggests that aholistic “composite” strategy is a better choice in a real-life abstract screeningsystem.We proposed such a composite method, called

RelRank in Section 3.4 andshowed that it performs well across many metrics. This is not surprising sinceeach of the baseline methods has been designed to meet a certain objective.However, abstract screening is a complicated process in which a “good” methodshould be able to optimize simultaneously several metrics such as Recall, AUC,and Utility, which can be achieved by a composite strategy like

RelRank . Wealso see a similar behavior while doing a manual veriﬁcation over a set of sevenreviews. As Table 6 shows, we fall short on some other metrics, these are themetrics that directly depend on Precision. However, from our user surveys,we realized that if the ranking is very good then a user can generally excludemany citations from the bottom without so much eﬀort and the predictionactually makes more sense. So, we strictly emphasize on ranking in our combinedalgorithm. N o o f r e l e v an t c i t a t i on l e ft Iteration

Figure 4: The pattern of decreasein the number of relevant citationsduring abstract screening for a par-ticular review.

We also analyzed the performance of an ab-stract screening system in an active learning set-ting. We observe that the 3-prevalence groupsexhibit similar characteristics, i.e. , some of thecitations add the burden of reviewing almost allof them. In a separate experiment (in additionto the reported experiments in the results sec-tion), we further analyze the inclusion behaviorof a particular random run of review 1. Figure 4shows the results of this experiment. For a ran-dom run, around 58% (1500 out of 2544) of thedocuments have to be screened before it gets itsﬁnal relevant citation. However, by getting intothe details of the experiment, we see that it getsalmost all but the ﬁnal one after screening only400 out of 2544 which is around 15% of the total citations. The last citation sur-prisingly takes 22 more iterations. With the help of a domain expert, it wouldbe interesting to see whether the ﬁnal citation has outlying characteristics. Wesee a similar behavior for some other reviews. In our opinion, in addition tocapturing the relevancy information, a holistic model for abstract screening mayalso have another component to exclude and point out outlying citations whichwe keep as our future work.

Our study is speciﬁcally targeted to study and improve abstract screeningin a practical systematic review platform like Rayyan which serves thousandsof users. Therefore, this study is quite diﬀerent from the standalone evaluationof single abstract screening tasks as the method for abstract screening has togeneralize over thousands of reviews from thousands of users. For this reason, we24ave carefully avoided some of the feature representation techniques such as co-citations and MeSH (Medical Subject Heading) which are not readily availableeven though some of the previous works have shown better performances withthose features. Furthermore, according to the study [37], many methods otherthan Support Vector Machines (SVM) are used as classiﬁcation algorithms forabstract screening. However, in this study, we restricted our attention to onlySVM-based methods. Our proposed 5-star method also combines three of theSVM methods. One of the reasons for this choice from the system perspectiveis that SVM methods require a small amount of memory to store models andhence allows for storage savings. For example, for

SV M perf , we need to storeonly the feature importance values.

6. Conclusion

In this paper, we studied the most popular classiﬁcation methods employedin abstract screening for systematic review. We focused on the algorithms thatbetter ﬁt the constraints of a real-world system like Rayyan. We found thatthere is no single “winner” approach, i.e. , various methods performed well ondiﬀerent prevalence groups and for diﬀerent metrics. For instance, w2v rowwith

SV M perf (b=1, AUC) outperformed all the other studied methods inﬁve metrics but not on few others. We also observe that in an active learningsetting, a substantial portion of included citations is discovered within a fewiterations. However, one or two citations had some outlying behavior, andthe method would require more iterations to retrieve these citations. We alsopresented an ensemble method, namely

RelRank that combines three of thestudied methods. Our approach converts their scores into a 5-star rating systemthrough a voting mechanism and ranks citations eﬃciently based on their gradedrelevance.

References [1] Aggarwal, C. C., 2014. Data classiﬁcation: algorithms and applications.CRC Press.[2] Ambert, K., 2010. A prospective evaluation of an automated classiﬁcationsystem to support evidence-based medicine and systematic review 2010,121.[3] Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: apractical and powerful approach to multiple testing. Journal of the RoyalStatistical Society. Series B (Methodological), 289–300.[4] Benjamini, Y., Krieger, A. M., Yekutieli, D., 2006. Adaptive linear step-upprocedures that control the false discovery rate. Biometrika 93 (3), 491–507.[5] Bullinaria, J. A., Levy, J. P., Sep 2012. Extracting semantic representationsfrom word co-occurrence statistics: stop-lists, stemming, and svd. BehaviorResearch Methods 44 (3), 890–907.256] Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C., 2009. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for han-dling the class imbalanced problem. In: Advances in knowledge discoveryand data mining. pp. 475–482.[7] Chalmers, T. C., Smith, H., Blackburn, B., Silverman, B., Schroeder, B.,Reitman, D., Ambroz, A., 1981. A method for assessing the quality of arandomized control trial. Controlled clinical trials 2 (1), 31–49.[8] Chen, C., Liaw, A., Breiman, L., 2004. Using random forest to learn im-balanced data. University of California, Berkeley.[9] Cohen, A. M., 2008. Optimizing feature representation for automated sys-tematic review work prioritization. In: AMIA annual symposium proceed-ings. Vol. 2008. p. 121.[10] Cohen, A. M., Ambert, K., McDonagh, M., 2012. Studying the potentialimpact of automated document classiﬁcation on scheduling a systematicreview update. BMC medical informatics and decision making 12 (1), 1.[11] Cohen, A. M., Hersh, W. R., Peterson, K., Yen, P.-Y., 2006. Reducingworkload in systematic review preparation using automated citation clas-siﬁcation. Journal of the American Medical Informatics Association 13 (2),206–219.[12] Cohen AM, M. S., 2008. RMEQ: A tool for computing equivalence groups inrepeated measures studies. in: Linking literature. In: BioLINK2008 Work-shop.[13] Culotta, A., McCallum, A., 2005. Reducing labeling eﬀort for structuredprediction tasks. In: AAAI. pp. 746–751.[14] Dagan, I., Engelson, S. P., 1995. Committee-based sampling for trainingprobabilistic classiﬁers. In: Proceedings of the Twelfth International Con-ference on Machine Learning. pp. 150–157.[15] Demˇsar, J., 2006. Statistical comparisons of classiﬁers over multiple datasets. Journal of Machine learning research 7 (Jan), 1–30.[16] Elkan, C., 2001. The foundations of cost-sensitive learning. In: Interna-tional joint conference on artiﬁcial intelligence. Vol. 17. pp. 973–978.[17] Esuli, A., Sebastiani, F., 2015. Optimizing text quantiﬁers for multivari-ate loss functions. ACM Transactions on Knowledge Discovery from Data(TKDD) 9 (4), 27.[18] Fu, J., Lee, S., 2011. Certainty-enhanced active learning for improvingimbalanced data classiﬁcation. In: Data Mining Workshops (ICDMW).IEEE, pp. 405–412. 2619] Gabriel, M., Paskach, C., Sharpe, D., 2013. The challenge and promise ofpredictive coding for privilege. In: ICAIL 2013 DESI V Workshop.[20] GLOSSARY, C., 2013. The grossman-cormack glossary of technology-assisted review. Federal Courts Law Review 7 (1).[21] Grossman, M. R., Cormack, G. V., 2010. Technology-assisted review in e-discovery can be more eﬀective and more eﬃcient than exhaustive manualreview. Rich. JL & Tech. 17, 1.[22] Han, H., Wang, W.-Y., Mao, B.-H., 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Advances in intel-ligent computing. pp. 878–887.[23] Hanneke, S., 2014. Theory of active learning.[24] Henry, D. W., 2015. Predictive coding: Explanation and analysis of judi-cial impact and acceptance compared to established e-commerce methodol-ogy.