A large scale study of SVM based methods for abstract screening in systematic reviews
Tanay Kumar Saha, Mourad Ouzzani, Hossam M. Hammady, Ahmed K. Elmagarmid, Wajdi Dhifli, Mohammad Al Hasan
aa r X i v : . [ c s . I R ] J a n Study of Methods for Abstract Screening in aSystematic Review Platform
Tanay Kumar Saha a, ∗ , Mourad Ouzzani b , Hossam Hammady b , Ahmed K.Elmagarmid b , Wajdi Dhifli c , Mohammad Al Hasan a a Indiana University Purdue University Indianapolis, Indianapolis, IN 46202, USA. b Qatar Computing Research Institute, Hamad-Bin Khalifa University, Doha, Qatar. c University of Lille, Faculty of pharmaceutical and biological sciences, EA2694,BioMathematics Lab, 3 rue du docteur Laguesse, 59006 Lille, France.
Abstract
A major task in systematic reviews is abstract screening, i.e. , excluding hun-dreds or thousands of irrelevant citations returned from one or several databasesearches based on titles and abstracts. Most of the earlier efforts on studyingsystematic review methods for abstract screening evaluate the existing technolo-gies in isolation and based on the findings found in the published literature. Ingeneral, there is no attempt to discuss and understand how these technologieswould be rolled out in an actual systematic review system. In this paper, weevaluate a collection of commonly used abstract screening methods over a widespectrum of metrics on a large set of reviews collected from Rayyan, a compre-hensive systematic review platform. We also provide equivalence grouping ofthe existing methods through a solid statistical test. Furthermore, we proposea new ensemble algorithm for producing a 5-star rating for a citation based onits relevance in a systematic review.In our comparison of the different methods, we observe that almost alwaysthere is a method that ranks first in the three prevalence groups. However, thereis no single dominant method across all metrics. Various methods perform wellon different prevalence groups and for different metrics. Thus, a holistic “com-posite” strategy is a better choice in a real-life abstract screening system. Weindeed observe that the proposed ensemble algorithm combines the best of theevaluated methods and outputs an improved prediction. It is also substantiallymore interpretable, thanks to its 5-star based rating.
Keywords:
Abstract Screening Platform, Systematic Review. ∗ Corresponding Author and this work was done while the author was at QCRI.
Email addresses: [email protected] (Tanay Kumar Saha), [email protected] (Mourad Ouzzani), [email protected] (Hossam Hammady), [email protected] (Ahmed K. Elmagarmid), [email protected] (Wajdi Dhifli), [email protected] (Mohammad Al Hasan)
Preprint submitted to Elsevier January 17, 2018 . Introduction
Randomized controlled trials (RCTs) constitute a key component of med-ical research and it is by far the best way of achieving results that increaseour knowledge about treatment effectiveness [7]. Because of the large numberof RCTs that are being conducted and reported by different medical researchgroups, it is difficult for an individual to glean a summary evidence from allthe RCTs. The objective of systematic reviews is to overcome this difficulty bysynthesizing the results from multiple RCTs.Systematic reviews (SRs) involve multiple steps. Firstly, systematic review-ers formulates a research question and search in multiple biomedical databases.Secondly, they identify relevant RCTs based on abstracts and titles (abstractscreening). Thirdly, based on full texts of a subset thereof, assess their method-ological qualities, extract different data elements and synthesize them. Finally,report the conclusions on the review question [50]. However, the precision of thesearch step of a SR process is usually low and for most cases it returns a largenumber of citations from hundreds to thousands. SR authors have to manuallyscreen each of these citations making such a task tedious and time-consuming.Therefore, a platform that can automate the abstract screening process is fun-damental in expediting the systematic review process. To fulfill this need, wehave built Rayyan —a web and mobile application for SR [38]. As of November2017, Rayyan users exceeded 8,000 from over 100 countries conducting hundredsof reviews totaling more than 11M citations.A primary objective of Rayyan is to expedite the abstract screening process.The first step of the abstract screening process is to upload a set of citationsobtained from searches in one or more databases. Once citations are processedand deduplicated, Rayyan provides an easy-to-use interface to the systematicreviewers so that they can label the citations as relevant or irrelevant. As soonas the systematic reviewer labels a certain amount of citations from both classes(relevant and irrelevant), Rayyan runs its 5-star rating algorithm and sorts theunlabeled citations based on their relevance. As the reviewer continues to labelthe citations, the system may trigger multiple runs of the 5-star rating algorithm.The process ends when all the citations are labeled. Thereof, one of the designrequirements of the system is the ability to give feedbacks on the relevance ofstudies to users quickly. Figure 1 shows a pictorial depiction of our system.Earlier efforts on studying systematic review methods in biomedical [39]and software engineering [37] domains evaluate the existing technologies forabstract screening from the findings presented in the published literature. Aspointed out by [39], these evaluations do not make it clear how these technologieswould be rolled out for use in an actual review system. In this study, wepresent our insights in designing methods (from both feature representation andalgorithm design perspectives) for a light-weight abstract screening platform,Rayyan (Please see Section 3 for challenges, and Section 5 for study limitations). https://rayyan.qcri.org/ igure 1: Pictorial depiction of the abstract screening system Rayyan. A key difference of our evaluation, compared to existing studies such as [37,39], is that our reviews come from an actual deployed systematic review system.In addition, many of the results presented in the literature are hard to generalizeas they do not provide performance over multiple datasets and over differentprevalence groups. We thus collect a large set of reviews from the Rayyanplatform and evaluate a large set of methods based on their practical utility tosolve the class imbalance problem. Additionally, to give a clear insight to theclass imbalance problem, we also divide the dataset into three prevalence groups,perform well defined statistical tests, and present our findings over a large setof metrics. The results of our evaluation give insights on the best approachfor abstract screening in a real-life environment. Besides the evaluation of theexisting methods, we also propose an ensemble method for the task of abstractscreening that presents the prediction results through a 5-star rating scale. Wemake our detailed results and related resources available online .The primary aim of this study is to assess the performance of abstract screen-ing task on the widely used SVM-based methods using a large set of reviews froma real abstract screening platform. This paper therefore addresses the followingresearch questions from an abstract screening system design perspective:1. What are the challenges of using different feature space representations?2. What kind of algorithms are suitable?3. How SVM-based methods perform in different metrics over a large set ofreviews and different class imbalance ratios? • What aspects should be considered for designing a new algorithm forabstract screening? • Can an ensemble method improve performance?The remainder of this paper is organized as follows. In Section 2, we discussrelated work. In Section 3, we provide an overview of the SVM-based meth-ods used in the evaluation and details of the proposed 5-star rating method. https://tksaha.github.io/paper_resources/
2. Related Work
We organize the related works into the following four groups: 1) works on“abstract screening” methods, 2) methodologies for handling “data imbalance”in classification, as it is one of the main challenges in the abstract screening pro-cess, 3) methodologies for “active learning”, a popular method for addressingdata imbalance and user’s labeling burden issue, and finally, 4) works on “lin-ear review”, from the legal domain, which bears some similarity with abstractscreening in systematic reviews.
A large body of past research has focused on automating the abstract screen-ing process [2, 9, 10, 11, 29, 34]. In terms of feature representation, most of theexisting approaches [9, 11] use unigram, bigram, and MeSH (Medical SubjectHeadings). An alternative to the MeSH terms is to extract LDA based la-tent topics from the titles and abstracts and use them as features [34]. Othermethods [9, 29] utilize external information as features such as citations andco-citations. In terms of classification methods, SVM-based learning algorithmsare commonly used [2, 9, 10]. According to a recent study [37], 13 different typesof algorithms, including SVM, decision trees, naive Bayes, k -nearest neighbor,and neural networks have been proposed in existing works on abstract screen-ing. To the best of our knowledge, none of the existing works use the structuredoutput version of SVM ( SV M perf [27]) with different loss functions, which wehave considered in this study. Note that
SV M perf ’s learning is faster than atransductive SVM [26]. For prediction, the former only keeps feature weights,which makes its prediction module very fast. On the other hand, transductiveSVM takes both labeled and unlabeled citations into account, so its learning isslower than
SV M perf . [53] uses a distant supervision technique to extract PICO(Population/Problem, Intervention, Comparator, and Outcome) sentences fromclinical trial reports. The distant supervision method trains a model using pre-viously conducted reviews and for the abstract screening task there is no suchdatabase for a particular review. So, we do not use distant supervision method.In this work, we evaluate various SVM-based classification methods with differ-ent loss functions over a set of abstract screening tasks.
Data imbalance in supervised classification is a well studied problem [1, 26,27, 47, 48]. Two different kinds of approaches have been proposed to solve it.Methods belonging to the first kind focus on designing suitable loss functions:such as KL divergence [17], quadratic mean [32], cost sensitive classification [16],mean average precision [55], random forest [8] with meta-cost, and AUC [27].Methods belonging to the second kind generate synthetic data for artificially4alancing the ratio of labeled relevant and irrelevant citations. Examples ofsuch methods are borderline-SMOTE [22], safe-level-SMOTE [6], and oversam-pling of the instances of the minority class along with under-sampling of theinstances of the majority class. The authors in [35, 51, 52] use probabilitycalibration techniques. In this paper, we evaluate the loss-centric approachesignoring the data centric and probability calibration based approaches. Thereason for our choice is the fact that synthetic data generation is computation-ally expensive which makes it very difficult to overcome extreme imbalance—atypical scenario for abstract screening. Also, for probability calibration we needa validation set which is not always available in a typical abstract screeningsession. For loss-centric approaches, we have modified
SV M perf as proposedin [17, 32] to incorporate KL divergence and quadratic mean loss because thedefault implementation did not consider these loss functions.
As systematic reviews are done in batches, the problem of abstract screeningcan be modeled as an instance of batch mode active learning. Online learningtrains the classifier after adding every citation whereas batch-mode active learn-ing (BAL) does the same after adding batches of citations. However, BAL doesnot have any theoretical guarantee compared to online learning [23, 49]. Thetask of BAL is to select batches with informative samples (citations) that wouldhelp to learn a better classification model. There are two popular methods toselect samples: (1) Certainty based and (2) Uncertainty based. In certaintybased methods, “most certain” samples are selected to train the classifier whilein uncertainty based method ,“most uncertain” ones are selected. In the un-certainty sampling-based methodologies, a large number of uncertainty met-rics have been proposed; examples include entropy [14], smallest-margin [45],least confidence [13], committee disagreement [46], and version space reduc-tion [36, 49]. Using “most uncertain” samples based on these metrics improvesthe quality of the classifier to find the best separating hyperplane, and thusimproves the accuracy in classifying new instances [34]. On the other hand,certainty-based methods have also shown to be effective in carrying out activelearning on imbalanced datasets, as demonstrated in [18]. In this paper, wepresent some interesting findings based on “certainty” based sampling in ab-stract screening task.
A similar task to the abstract screening review is Technology Assisted Re-view (TAR) [19, 20, 21, 24, 43, 44] which is popular among law firms. The mainobjectives of both review systems are the same, which is to classify betweenrelevant results and irrelevant ones in a very imbalanced setup. TAR is used toprioritize attorneys’ time to screen documents relevant to a lawsuit rather thanthose which are irrelevant. Generally, the number of documents are much larger(in millions) in TAR compared to the same in systematic review, so active learn-ing is very popular in the TAR domain. However, unlike systematic reviews,5AR researchers are interested in finding a good stabilization point where thetraining of the classifier should be stopped. Moreover, in TAR, achieving atleast 75% recall is considered as acceptable, whereas in systematic reviews 95%-100% recall is desired. Linear review faces similar set of challenges as is faced byabstract screening and in many times, they overcome those challenges by usingsimilar solutions.
3. Method
For abstract screening, we have as input the title ( T i ) and abstract ( A i ) ofa set of n citations, C = { C T i ,A i } ≤ i ≤ n . We represent each citation, C T i ,A i asa tuple h x C i , y C i i : x C i is a d -dimensional feature space representation of thecitation and y C i ∈ { +1 , − } is a binary label denoting whether C i is relevant ornot for the given abstract screening task. For a labeled citation, y C i is knownand for an unlabeled citation it is not known. We use L and U to denote theset of labeled and unlabeled citations, respectively, F to represent features fora set of citations, and h for the hypothesis or hyperplane learned by training on L . Some of our feature representation methods embed a word/term to a latentspace; for a word w , we use l ( w ) for the latent space representation of this word. Task Challenges
The desiderata for supporting abstract screening in a SR platform like Rayyanincludes the following: (1) feature extraction should be fast and the featuresshould be readily computable from the available data, (2) the learning andprediction algorithms should be efficient, (3) the prediction algorithm should beable to handle extreme data imbalance so that it can overcome the shortcomingsof low search precision, and (4) the prediction algorithm should solve an instanceof a two-class ranking problem such that the relevant citations should be rankedhigher than the irrelevant ones. Based on these requirements, the choices forfeature representations, the prediction algorithms, and the prediction metricsare substantially reduced, as we discuss in the following paragraphs.For feature extraction, we focus on two classes of methods namely uni-bigramand word2vec. Uni-bigram is a traditional feature extraction method for textdata. Word2vec [33] is a recent distributed method which provides vector repre-sentation of words capturing semantic similarity between them. Both methodsare fast and their feature representation is computed from the citation datawhich are readily available. We avoided other feature extraction choices as theyare either hard to compute or the information from which these feature valuesare computed is not readily available. For example, features like co-citationsare hard to obtain. Another possible feature is the frequency of MeSH (Medi-cal Subject Heading) terms. While MeSH terms of a citation may be obtainedfrom PubMed, this has some practical limitations especially in an automatedsystem like Rayyan. In particular, (i) PubMed does not necessarily cover allof the existing biomedical literature and (ii) to obtain MeSH terms, one has toeither provide the PMID (PubMed Article Identifier) of each reference which is6ot always available, or use the title search API which is an error-prone pro-cess since it could return more than one result. Thus, we avoided using thisfeature in our experiments. We also discarded the use of LDA (Latent DirichletAllocation) based feature generation and a matrix factorization based method[30] as an LDA based method is slow, and there is no easy way to set someof its critical parameters such as the number of topics. Methods based ondistributed representations ( e.g. , word2vec [33]) are shown to perform well inpractice compared to the matrix factorization based alternatives. Additionally,for distributed representation methods, Natural Language Processing (NLP)functions such as lemmatization (heuristic method to chop end of a word) andstemming (morphological analysis method to find root form of a word) are usu-ally not a requirement as the methods automatically discovers the relationshipbetween inflected word forms (for example, require, requires, required) based onthe co-occurrence statistics and represents them closely in the vector space. ForUni-Bi (unigram, bigram) feature representation, one can use lemmatization orstemming. However, one immediate problem with lemmatization is that it isactually quite difficult to perform accurately for very large untagged corpora,so often a simple stemming algorithm (such as Porter Stemmer) is applied,instead, to remove all the most common morphological and inflectional wordendings from the corpus [5]. However, we avoided such preprocessing step as itadds some extra cost to the preprocessing step and in our initial analysis, theperformance difference was not statistically significant.Since irrelevant citations are more common than the relevant ones, abstractscreening algorithms suffer from the data-imbalance issue. Among the binaryclassification methods, the maximum-margin based methods ( aka
SVM) havebecome very popular in the abstract screening domain [39]. Thereof, we useSVM variants that are more likely to overcome the data-imbalance problem.Specifically, we use
SV M cost (cost-sensitive SVM), and
SV M perf (SVM withmultivariate performance measure) with different loss functions for our evalua-tion.Finally, various prediction metrics have been used in the abstract screeningdomain which makes it difficult to compare among various models. A recentsurvey of current approaches in abstract screening [39] reported that amongthe 69 publications they surveyed, 22 report performance using Recall, 18 usingPrecision, and 10 using AUC. The non-uniform usage of these metrics makes itdifficult to draw a conclusion about the performance of different methods. Forexample, if a particular work only reports Recall but not Precision, it is hardto understand its applicability. Among all the above metrics, the most commonis the Area Under the ROC Curve (AUC), which can be computed from theranking of the citations as provided by the classification model. Hence, to useAUC, the classification model for abstract screening needs to provide a rankedorder of the citations. Among other ranking based metrics, AUPRC (area underPrecision-Recall curve) is also used extensively. Moreover, authors of a recentwork [40] suggested to study the variability of the reported metrics and advo-cated to use a large number of repetitions (500 ×
2) to ensure reproducibility.We also observed that most of the existing studies were performed on a small7umber of SRs, reported the evaluation with a different set of metrics, vali-dated with a small number of repetitions, and most of them did not performproper statistical significant tests which would provide a statistical guaranty ofthe superiority of a method over the others.In the following, we describe the feature space representation, then the ex-isting methods of automated abstract screening, and finally our proposed 5-starrating algorithm.
We use two types of feature representation: (1) uni-bigram (Uni-Bi) and(2) word2vec (w2v). Uni-Bi based feature representation uses the frequency ofuni-grams and bi-grams in a document. Note that uni-grams and bi-grams aregeneralization of n -grams, the collection of all contiguous sequences of n wordsfrom a given text. Since the sum of the number of distinct words (uni-grams)and the number of distinct word-pairs (bi-grams) over the document collectionin a particular review task is generally large, Uni-Bi feature representation ishigh-dimensional. Besides, it produces a sparse feature representation, i.e. , alarge number of entries in the feature matrix is zero. Such high dimensional andsparse feature representation is poor for training a classification model, so severalalternative feature representations are proposed to overcome this issue. Amongthem, word2vec [33] is a popular alternative. It adopts a distributed frameworkto represent the words in a dense d -dimensional latent feature space. The featurevector of each word in this latent space is learned by shallow neural networkmodels and the feature vector of a citation is then obtained by averaging thelatent representation of all the words in that citation. It has been hypothesizedthat these condensed real-valued vector representation, learned from unlabeled data, outperforms Uni-Bi based representations.To learn vector representation of words using word2vec [33] for our corpus,we train the model on all abstracts and titles available in Rayyan. We useGensim [42] with the following parameters: the number of context words as 5,the min word count as 15 and the number of dimensions in the latent spaceas 500 ( d =500, chosen through validation). Thus, for each word in the set ofall available words ( w i ∈ W ), we learn a 500-dimensional latent space repre-sentation. After averaging the latent vectors of words, we obtain the latentvector of a citation on which we apply two kinds of normalization: (1) instancenormalization (row normalization) and (2) feature based normalization (col-umn normalization). These normalizations give statistically significant betterresults than using raw features for threshold agnostic metrics such as AUC andAUPRC. In instance normalization, we z -normalize the extracted features foreach citation, x C i . For column normalization, we z -normalize in each of the500 dimensions. After both normalizations, we keep the feature values up to2-decimal places to minimize the memory requirement in our system. Note thatthere exist some neural network based models such as sen2vec [31] which canlearn the representation of a citation holistically instead of learning it by aver-aging the representation of words in that citation. However, the sen2vec model8 eature Algorithm (Param.) Id Uni-Bi
SV M perf (b=0, AUC) 1— (b=1, AUC) 2— (b=1, KLD) 3— (b=1, QuadMean) 4SVM (Default) 5
SV M cost (J, b=0) 6— (J, b=1) 7SVM Transduction 11w2v row
SV M perf (b=1, AUC) 21— (b=1, KLD) 22— (b=1, QuadMean) 23
SV M cost (J, b=0) 24— (J, b=1) 25w2v col
SV M perf (b=1, AUC) 31— (b=1, KLD) 32— (b=1, QuadMean) 33
SV M cost (J, b=0) 34— (J, b=1) 35
Table 1: Description of the algorithms used in the evaluation. We use the default regularizationin all cases. J in SV M cost is set to the following ratio: L L is transductive in nature, i.e. , for new citations, we need to execute a few passesover the trained model, which is time-consuming.After learning the representation, for a limited number of words, we manuallyvalidate whether the latent space representation of a word captures the knownsemantic similarities of that words with various biomedical terms. Semanticsimilarities are computed using cosine similarity in the latent space following[33]. Our manual validation shows encouraging results. For example, the cosinesimilarity between “liver” and “cirrhosis” is 0 .
63, which is large considering thefact that the vectors are of 500 dimensions. Also, for the query “which wordsare related to cirrhosis in the same way breast and cancer are related”, returns“liver” as one of the top-5 answers in the w2v feature representation.
A recent study [37] reports that Support Vector Machine (SVM) is the mostused algorithm in abstract screening. It is used in 31% of the studies and at leastone experiment annually since 2006. Moreover, as discussed in Section 2, SVM-based methods are mostly used in both the data-imbalance and batch-modeactive learning settings [18, 34]. We thus restrict our evaluation to existingSVM-based algorithms. SVM is a supervised classification algorithm whichuses a set of labeled data instances and learns a maximum-margin separatinghyperplane by solving a quadratic optimization problem. We should mention9hat the number of labeled citations can be very few at the start of a citationscreening process and also the number of citations varies from a few hundredto a few thousand for different reviews. Thus, we did not try any superviseddeep-learning based technique in our evaluation.We use three types of SVM methods: (1) inductive, (2) transductive, and(3)
SV M perf . Inductive SVM learns a hypothesis h induced by F ( L ). Trans-ductive SVM [26] reduces the learning problem of finding a h ∈ H to a differentlearning problem where the goal is to find one equivalence class from infinitelymany induced by all the instances in F ( U ) and F ( L ). SV M perf exploits thealternative structural formulation of the SVM optimization problem for con-ventional binary classification with error rate [28]. We use three different lossfunctions for the
SV M perf implementation: (i) AUC, (ii) Kullback-Leibler Di-vergence (KLD), and (iii) QuadMean Error [32]. Table 1 shows the differentSVM-based algorithms we used in our comparison along with their parame-ters and loss functions. We use the default regularization in all cases and J in SV M cost is set to the following ratio: L L . The ratiobiases the hypothesis learner (the training algorithm) to penalize mistakes onthe minority class (the relevant class) J times more than the majority class (thenegative class). In table 1, we assign distinct integer identifiers to representeach of the algorithms. For example, the first row in Table 1 refers to a methodidentified by Id 1 that uses SV M perf with b = 0 and AUC as the loss function.Comparison results among different methods (such as the results shown in Ta-ble 4) are shown compactly by referring to each method by its identifier insteadof its name.Making a comparison with the exact baseline algorithms proposed by variousstudies is not straightforward. 31% of the studies reported in [37] use SVM astheir classifier. However, each one employs a different feature representation anda different SVM implementation. Thus, in this study we restrict our attention toa specific set of feature representations which are more suitable for our abstractscreening platform, and use the linear kernel with different loss functions andthe implementation provided by the author of each SVM algorithm. -star Rating Algorithm In our SR platform, we want to rank citations based on their graded rel-evance. The intuition is to help reviewers better manage their time. To thisend, we rate the citations from 1 to 5 using our proposed
RelRank algorithm.The citations with 5 stars are relevant with high probability and need moreattention whereas 1 starred documents are irrelevant with high probability andmay need less attention. Among the 5 stars, we conceptualize 3 − − RelRank (Algorithm 1) uses an ensemble ofSVM based methods. We choose method
SV M perf (b=1, AUC) with w2vfeatures,
SV M cost (J, b=1) with w2v features, and
SV M cost (J, b=1) with Uni-Bi features because of their special characteristics based on our initial evaluationof the methods on multiple datasets and over a range of metrics (described in10 lgorithm 1: RelRank : A Five Star rating algorithm using ensemble ofmax-margin based methods
Input : L , Labeled dataset; U , Unlabeled dataset Output:
Score, S ≤ i ≤|U| F L , F U ← GenerateFeature ( L , U , feature = W) h ← Train ( SV M perf , F L ) h ← Train ( SV M cost , F L ) S h ← Predict ( h F U ) S h ← Predict ( h F U ) F L , F U ← GenerateFeature ( L , U , feature = U) h ← Train ( SV M cost , F L ) S h ← Predict ( h F U ) S ←
GenerateCombinedScore ( U , S h , S h , S h ) return S Section 4.4). For instance,
SV M perf (b=1, AUC) outperforms others based onthe AUC and Recall metrics. When evaluated on the F1 metric (the harmonicmean of Precision and Recall),
SV M cost (J, b=1) with w2v feature producesthe highest value. On the other hand,
SV M cost (J, b=1) with Uni-Bi featurehas the highest Precision. So, in
RelRank , if all of these three methods agreeon the relevance of a citation, then the citation gets 5 stars; if two of themagrees, then it gets 4 stars. Within a particular star, citations are ranked basedon their average ranks (described in Algorithm 2).In
RelRank , we first generate w2v row features for both sets L and U . Then,we train SV M perf and
SV M cost with w2v features to generate the first ( h h
2) hyperplanes, respectively. We then compute scores for U basedon the distances from h h
2. Using the Uni-Bi features with
SV M cost ,we compute a third score S h using the hyperplane h
3. Finally, we combinethe three scores to generate a final score for U which we formally present inAlgorithm 2 ( GenerateCombinedScore ). GenerateCombinedScore helps present the already classified citations in aranked manner based on the scores obtained by calculating the distance fromhyperplanes. The method takes the separation threshold (ST) and the maxrange (MR) as user-defined parameters. ST is used to define separation betweenratings. MT is used for range normalization of a rank score between 1 − MR.Step 1 gets the number of unlabeled citations. Steps 2 and 3 define FVOTE andNORM as a lambda function. Both of them are used to transform ranks into aspecified range. FVOTE smooths any value between 0 . − .
0. In Step 4, weobtain ranks for each of the scores S ≤ h ≤| h | . The higher the distance from thehyperplane for a particular citation is, the higher the rank. Steps 5-16 calculatethe final combined score. We iterate through all the scores in order. The veryfirst hypothesis has the supreme power. If it predicts a particular citation asrelevant, then the citation is rated between 3 to 5 star, and if it predicts it as11 lgorithm 2: GenerateCombinedScore
Input :
Unlabeled dataset( U ), S h [], T h [], Separation Threshold (ST),MAX Range (MR) Output:
Score, S U = |U| FVOTE = lambda {| r | exp( − U − r U ) } NORM = lambda {| s | s − . − . ∗ MR } r h = GetRanksForASetOfElements ( S h ) for u = 0 .. | U | do nv = S h [0][ u ] > T h [0] ? 1 : -1 rs = r h [0][ u ] fv = FVOTE (rs) for v = 1 .. | h | do rs v = r h [ v ][ u ] fv = fv + FVOTE ( rs v ) rs = rs + rs v nv = S h [ v ][ u ] > T h [ v ] ?( nv > nv + 1 : nv ) : ( nv < nv − nv ) ns = NORM ( rs/ | h | ) finv = nv > S [ u ] = finv * ST + ns return S irrelevant, then the same gets a 1 or 2 star. In Step 6, we get the vote fromthe dominant classifier, i.e. , if the score is greater than the classifier thresholdthen the vote count increases by 1. The rank score and the fractional vote areobtained in Steps 7 and 8. The number of votes ( nv ) only increases if boththe dominant classifier and the current hypothesis vote the citation as relevant.On the other hand, the number of votes decreases if both of them agree on itsirrelevancy. Finally, in Step 14, we get a normalized rank between 1 and M R .If the number of votes is greater than or equal to 1 then the fractional vote isconsidered as the final vote otherwise the negative votes are used unchanged.This ensures that even if a particular citation gets positive votes, it still needsto get top rank to maintain its position in the rating. We use 1000 and 800 asvalues for the parameters ST and MR, respectively. We also use 0 .
0, 2000, and2500 as thresholds for
RelRank (3-star),
RelRank (4-star), and
RelRank (5-star), respectively, as these thresholds performed the best while performingmanual verification of system’s performance.
4. Experimental Results
In this section, we first present our experimental settings, then we describeour dataset and performance metrics and finally present our experimental re-sults. 12 revalence [0 . − . . − . . − . Table 2: Dataset Statistics. All publicly available reviews start with C (Cohen) whereasreviews from our system start with P (Private). Rel. stands for relevant citations whereasPrev. stands for prevalence of the dataset. For example, in dataset P18, the prevalence is0.22% as 5 (number of relevant citations) out of 2241 (total citations) is 0.0022.
We run all of our experiments on a computer with an Intel XEON 2.6Ghzprocessor running CentOS 6.7 operating system. For each dataset (described inthe next section), we perform a 500 × n × k fold cross validation, we split the data in k blocks where the k -th block becomes the test block and the rest becomes the training data. Werepeat this process n times. The split is carried out through stratification. Sofor 500 ×
2, we split each dataset into two blocks and then we use each blockonce as a training and once as a test. We repeat this process for 500 times.According to [41], over the course of many iterations/repetitions (500 in ourcase) the average performance estimate for a given classifier may stabilize andproduce “steady-state” classifier performance. It also produces substantiallymore repeatable results than using a small fixed number of iterations such as 5or 10.Now, we describe our algorithms parameter settings. For the inductive andtransductive SVM methods, the default cost parameter, denoted as c , is com-puted as follows: the summation of all the 2-norm values of the feature vectorsare divided by the number of instances in F ( L ) to generate b . Finally, thefraction . b ∗ b gives c . J in SV M cost is set to the following ratio: .Furthermore, we follow the recommendation for error loss function from [27]to set the cost parameter for
SV M perf . The way we calculate the parametervalues can also avoid the need for a representative validation set which is veryhard to obtain in a systematic review platform. Note that for all the methods,13e set the classification threshold to 0 . We use 61 reviews (dataset) for our experiments: 15 of them are publiclyavailable from [9] and the other 46 reviews were collected from Rayyan. Theserepresent 84K citations. In Table 2, we present our dataset statistics. All pub-licly available reviews start with C (Cohen) whereas reviews from Rayyan startswith P (Private). In the table, we give three statistics about each review: thetotal number of citations (Total), the total number of relevant citations (Rel.),and prevalence (ratio of relevant and total number of citations (Prev(%))). Thedatasets of each group are sorted by their prevalence. For example, in thedataset P18, the prevalence is 0.22% corresponding to the ratio of its 5 relevantcitations out of the 2241 citations. We divide the reviews into three prevalencegroups depending on the ratio of included citations: (1) Low (0 .
22% to 5 . .
79% to 13 . .
45% to 40 . i ) it allows to study the trend of various per-formance metrics within each group, i.e , depending on the complexity of thetask (the smaller the prevalence, the more complex the task), and ( ii ) it alsomakes our statistical tests robust against Type I error for data with frequentextreme values (the performance on some metrics varies wildly among differentprevalence groups). We group reviews into three groups based on the severityof the class imbalance (high, mid, low) while making sure that every group hasa reasonable number of data samples (in our case, the number of datasets foreach group is around 20). Next, we describe our performance metrics.We use 11 metrics for the evaluation [39] (discussed in Table 3). The firstfour measures (Recall, Precision, F-measure and Accuracy) depend on a thresh-old and are widely used for evaluating automated abstract screening methods.In abstract screening, the highest cost is associated with false negatives (arti-cles incorrectly identified as irrelevant) as these will not be manually reviewedand relevant evidence is omitted from the final decision (the systematic review).Therefore, it is expected that Recall should be very high in screening automa-tion. AUC and AUPRC do not depend on thresholds and are common in binaryclassification problems with data imbalance. In an abstract screening platform,AUC and AUPRC are the second most important metrics. This is because agood ranking of documents makes it easier for the users to screen them, i.e. ,they can mark a large batch of documents as relevant and irrelevant in a sin-gle operation. We also add two more metrics: (1) Arithmetic mean error and(2) Quadratic mean error. Both metrics measure the loss in Recall in both therelevant and irrelevant classes and thus are important for an abstract screeningplatform. Finally, for active learning settings, we use Burden, Yield and Utilitywhich respectively correspond to the fraction of the total number of citationsthat a human must screen, the fraction of citations that are identified by agiven screening approach, and the weighted sum of Yield and Burden. From14 etric Definition Formula Recall (Sen-sitivity) Ratio of correctly predicted relevant citations to allrelevant ones.
TPTP + FN
Precision Ratio of correctly identified relevant citations to allof those predicted as relevant.
TPTP + FP
F-Measure Combines Precision and Recall values. It corre-sponds to the harmonic mean of Precision and Recallfor β = 1. (1+ β ) . Precision.Recall β . Precision + Recall
Accuracy Ratio of relevant and irrelevant citations predictedcorrectly to all citations.
TP + TNTP + TN + FP + FN
ROC (AUC) Area under the curve obtained by graphing the truepositive rate against the false positive rate; 1.0 isa perfect score and 0.5 is equivalent to a randomordering.AUPRC Area under precision recall curve.AM ER. Arithmetic mean of the loss in Recall of the relevant( L R p ) and irrelevant class ( L R n ) FNTP + FN + FPFP + TN QD ER. Quadratic mean, aka. root mean square, measuresthe magnitude of varying quantities. It is defined asthe square root of the arithmetic mean of the squaresof the loss in Recall of the relevant ( L R p ) and theirrelevant class ( L R n ). s FNTP + FN . FNTP + FN + FPFP + TN . FPFP + TN TP L +TN L +TP U +FP U N Yield The fraction of citations that are identified by a givenscreening approach. TP L +TP U TP L +TP U +FN U Utility It is a weighted sum of Yield and Burden. Here, β is a constant. It represents the relative importanceof Yield in comparison to Burden. We use β = 19 togive 19 times more importance to yield in comparisonto Burden in our experimental evaluations followingthe suggestion from [54]. β. Yield+(1 − Burden) β +1 Table 3: Various metrics with their formulas and definitions. TP, FP, TN and FN representtrue positive, false positive, true negative and false negative respectively. an abstract screening platform perspective, we want to minimize Burden andmaximize Yield.
We have 18 different methods (Table 1) and 61 datasets. As we have parti-tioned the datasets into three prevalence groups: (1) Low-prevalence, (2) Mid-prevalence, and (3) High-prevalence, we will compare the methods based ontheir overall performance on datasets of a particular prevalence group on aspecific metric. Our goal is to generalize the findings on a larger population(dataset) that could fall within one of our defined prevalence groups. To per-form a statistical test, we follow Cohen et al . [12] who considered data andmethod as covariate to predict the performance metric, i.e. , METRIC ∼ DATA15 METHOD. To be more specific, the model resembles y = mx + c , where x is a variable and m, c are the constants. In this model, DATA and METHODmimic x and c respectively. In our case, for a particular metric, we use theaverage of 500 × i.e. , multiple resampling from each dataset is to be used only toasses the performance score not its variance, which is similar in spirit to [15].The sources of variance are the differences in performance over (independent)datasets and not on (usually dependent) samples, so the elevated Type 1 erroris not an issue.We fit the model (METRIC ∼ DATA + METHOD) by using linear regressionand perform a two-factor (DATA and METHOD) analysis of variance. Thishelps us in identifying whether there is any statistically significant differenceamong the methods, the datasets, and the interactions between the methods andthe data. If the test shows a performance difference among the methods thatis statistically significant, we again compare a pair of methods by using pairedt-tests (post-hoc testing). Post-hoc testing with 18 methods leads to 18*17/2possible post-hoc comparisons. Each of these 18*17/2 tests needs correctionfor multiple testings. To avoid the large number of pairwise testings, we followthese steps:1. Find the best method and compare it with all the remaining methodsvia paired t-tests, and then utilize the Linear Step Up (LSU) procedure(also known as Benjamini and Hochberg procedure [3]) to control the FalseDiscovery Rate (FDR). It has been shown in [15] that the Holm [25] andHochberg [3] tests give practically equal results for post-hoc tests.2. Separate the group of non-significant differences using LSU to performFDR at the given alpha ( α = 0 . :(a) Step 1: find the first k such that p(k) ≤ pos(k) ∗ α /m. Here, m is thetotal number of methods that are tested against the best method in arank group and p(1),...,p(k) are the corresponding ordered p-values, i.e. , p(1) ≥ p(2) ≥ p(3) ..... ≥ p(m). pos(k) is the position of amethod in the ascending order of p-values. For example, pos(1) is m.(b) Step 2: if such a k exists, group first k-1 methods associated withp(1),...,p(k-1) in a single group with the best method as they do nothave any statistically significant difference with the best method.3. The group of methods from step 2 are isolated and the pairwise comparisonis then repeated from step 1 with the remaining methods. The processcontinues until no more methods are left for pairwise comparison.For example, if Step 2 yields 4 methods that are not worse than the bestmethod by the desired statistically significant margin, then we will form our first The exact p-values are used inside the LSU procedure to control the FDR. We providedonly minimum and maximum p-values used for the rejection (fall into a different group) asthe number of p-values generated from each pairwise comparison is enormous. rg ) such that within each group the performance ofall the methods is statistically similar to the performance of the best methodwithin that group. Also, across two groups the best method of a superior groupis better than all the methods in the inferior group by a desired statisticallysignificant margin. Table 4 shows the comparison results. Along the rows and the columns, welist respectively the 11 performance metrics and the three prevalence groups(low, mid, and high). The methods that perform the best for a given metricwithin a prevalence group are organized within sub-columns of ranked group.The ranked groups of methods are created by a statistical test with α =0 . rg = 1, rg = 2, and rg = 3. Within a ranked group, the methods arelisted by their identifiers (as shown in Table 1). For each rank group, metric,and prevalence range, we report a representative metric value in brackets. Ourdetailed results are available online . We also provide the p-values of all themetrics corresponding to Table 4 for our statistical test in Table 5. The statisticsgive some insights on how conservative the LSU method is in selecting twomethods to put into the same group (Please see Section 4.3 for details). Forexample, for the Precision metric, any p-value greater than 0.3 may put twomethods into the same group for some rank group. This is because of the LSUprocedure described earlier in Section 4.3. As a convenience to the reader, notethat methods that use the Uni-Bi, w2v row, and w2v col features have identifiersin the range of 1 −
11, 21 −
25 and 31 −
35 respectively.We first discuss the best methods by considering the metrics that depend ona threshold. Results for these metrics are shown in the first four rows and thelast 5 rows of Table 4. We will discuss the best methods by their names withtheir identifiers within parenthesis. For Precision,
SV M cost (7) outperforms theother methods in the low and mid prevalence groups achieving around 40% and58% for low and mid prevalence groups, respectively. But, for the high preva-lence group, it falls in the second tier and obtains a Precision value of 60%. SVMDefault (5) also performs very well with a second tier rank in the low prevalencegroup and a first tier rank in the mid and high prevalence groups. SVM Default(5) achieves around 25%, 58%, and 74% in Precision for low, mid, and highprevalence groups, respectively. For Recall, w2v row with
SV M perf (AUC)(21) is in the top position in all three prevalence groups obtaining around 96%,97% and 98% in Recall for low, mid, and high prevalence groups, respectively.
SV M trans with Uni-Bi (11) and
SV M cost with w2v row (25) are the top per-formers for the F1 metric. These methods achieve around 34%, 46%, and 58% in https://tksaha.github.io/paper_resources/ revalence Prevalence Prevalence[0 . − . . − . . − . Metric rg = 1 rg = 2 rg = 3 rg = 1 rg = 2 rg = 3 rg = 1 rg = 2 rg = 3 Precision 7 , 11 3, ,25 4, 6,35 ,
11 3, 6,22, 25 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Recall 21
31 1, 2,34
1, 2,31, 34 22 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) F-measure 11 , 25 24, 35 4, 22,23 , 25 4, 22,23, 24,35 3, 6,33 7, ,25 35 24( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Accuracy 7
11 6 5,
11 3, 4, 6( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) AUC 21 ,
31, 35 1, 2, 5,7, 23,24 ,
4, 24,31, 35 1, 2, 5,7, 22,23
7, 25 (AllOthers)( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) AUPRC 1 , ,5, 7,
4, 25 11, 23,24, 31 , ,4, 5 7, 21 25 , (AllOth-ers) -( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) - AM ER. 4 , 31,35, 24,
1, 2,11, 22,23, 32,33, 34 3, 6, 7,21 ,
24 31, 33,35
24 4, 35( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) QD ER.
1, 2,4, ,25, 31,32, 33 23, 34,35 11, 22 4, 25,33
23, 31,35 25
4, 35( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Yield 21
31 1, 2,34
1, 2,31, 34 32
1, 2 31( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Burden 5
11 6( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Utility 21
31 1, 2,34
1, 2,31, 34 32 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Table 4: Grouped results of all the methods using metrics from Table 3 ( rg stands for rankgroup) along with representative values (within brackets). For easy of reading, we recallthe methods as described in Table 1. Method 1 – uni-bi feature with SV M perf with b=1,and AUC. Methods 2, 3, and 4 – uni-bi feature with
SV M perf with b=1, and AUC, KLD,and QuadMean loss functions, respectively. Method 5 – SVM (Default). Method 11 – toSVM transduction. Methods 6 and 7 – uni-bi feature with
SV M cost with b=0 and b=1,respectively. Methods 21, 22, and 23 – w2v row feature with
SV M perf with b=1, andAUC, KLD, and QuadMean loss functions, respectively. Methods 31, 32, and 33 – w2v colfeature with
SV M perf with b=1, and AUC, KLD, and QuadMean loss functions, respectively.Methods 24 and 25 – w2v row feature with
SV M cost with b=0 and b=1, respectively. Methods34 and 35 – w2v col feature with
SV M cost with b=0 and b=1, respectively. revalence Prevalence Prevalence[0 . − . . − . . − . P-value P-value P-value
Metric (min., max.) (min., max.) (min., max.)Precision (3.68e-07, 0.03) (1.17e-14, 0.03) (3.63e-14, 0.03)Recall (7.53e-22, 0.04) (1.24e-20, 0.007) (2.09e-14, 0.02)F-measure (6.21e-09, 0.015) (4.94e-12, 0.03) (2.12e-11, 0.03)Accuracy (6.33e-22, 0.02) (1.36e-21, 0.04) (1.98e-14, 0.04)AUC (5.19e-09, 0.03) (4.93e-11, 0.03) (1.58e-11, 0.03)AUPRC (1.44e-09, 0.03) (1.23e-10, 0.03) (1.44e-09, 0.03)AM ER. (1.26e-15, 0.03) (1.76e-15, 0.04) (9.75e-12, 0.016)QD ER. (1.15e-20, 0.03) (2.03e-16, 0.02) (2.06e-12, 0.02)Yield (8.14e-22, 0.04) (1.24e-20, 0.007) (2.09e-14, 0.02)Burden (2.39e-26 0.01) (1.66e-25, 0.03) (7.40e-16, 0.03)Utility (8.40e-22, 0.04) (1.66e-20, 0.006) (3.23e-14, 0.02)
Table 5: P-values of all the metrics corresponding to the Table 4. The statistics give someinsights on how conservative we are in selecting two statistically insignificant methods to putinto the same group (Please see Section 4.3 for details). For example, for the Precision metric,any p-value greater than 0.3 may put two methods into the same group for some rank group,even though we choose α = 0 .
05 in our case. This is because of the Linear Step Up (LSU)procedure described in 4.3.
F1 value for low, mid, and high prevalence groups, respectively. For Accuracy,
SV M cost (b=1) with Uni-Bi (7) performs the best followed by
SV M trans withUni-Bi (11).
SV M cost (b=1) with Uni-Bi (7) achieves 97%, 91%, and 85% inAccuracy in low, mid, and high prevalence groups, respectively. The decreasingtrend in performance values is because the lower is the prevalence the more theclassifier is able to achieve higher accuracies by predicting the label of all theinstances as the dominating class.
SV M cost with w2v row normalized features,24 and 25, rank first in Arithmetic Mean (AM) and Quadratic Mean (QM) Er-rors, respectively. Again, w2v row with
SV M perf (AUC) (21) ranks first in theYield and Utility metrics while the SVM Default (5) is at the top of the list inthe Burden metric.
SV M perf (AUC) (21) achieves around 98% in Yield and94% in Utility for all the prevalence groups.We now analyze the performance of the methods on other metrics which arethreshold agnostic. For instance, in AUC, w2v row with
SV M perf (AUC) (21)performs the best.
SV M cost (b=1) with w2v row normalized features (25) alsoperforms well. Both methods achieve around 86% AUC in all three prevalencegroups. In AUPRC,
SV M perf with (b=0, AUC) (1) and (b=1, AUC) (2) alongwith the Uni-Bi features are the top performing methods. They obtain 33%,47%, and 62% in low, mid, and high prevalence groups, respectively.19 revalence Prevalence Prevalence [0 . − . . − . . − . o f r e v i e w s Percent Citations Screened 0 1 2 3 4 5 6 7 8 9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 o f r e v i e w s Percent Citations Screened 0 1 2 3 4 5 6 7 8 9 10 11 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 o f r e v i e w s Percent Citations Screened
Figure 2: Active Learning experiment with a variable test set. The results are shown based onthree groups of prevalence values. Prevalence is the ratio of the number of relevant citationsout of the total citations. The X-axis represents the percentage of the total citations to screenfor getting all relevant citations for a particular review. The Y-axis shows the total numberof reviews.
In this experiment, we use the
SV M perf (b=1, AUC) (21) method as it isthe top performing method in terms of the AUC metric in all three prevalencegroups (Table 4). For training, we choose 5 relevant and 45 irrelevant citationsuniformly at random from the entire set and then learn a hyperplane h . Wecalculate the distance (score) from h for each of the unlabeled instances andrank them based on this score. We choose the top-50 from the ranked citationsto retrain the model along with the existing labeled citations. We repeat thisexperiment 500 times and take the average. The goal is to see the following: if aparticular user labels 50 top ranked citations per batch, what is the percentageof total citations that the user has to screen to get all the relevant ones? Figure 2shows the results of the proposed experiment. As before, the results are shownbased on three groups of prevalence values. As shown in Figure 2, for the Lowprevalence group, out of 20 reviews, 7 reviews (the first four values on the tophistogram) need 40% of the total citations to be screened to get all the relevantcitations, 12 reviews need around 50 to 70% citations, and 1 review needs morethan 90% citations. For the mid and high prevalence groups, 9 out of 20 and11 out of 21 reviews need around 80% to 90% citations to be screened to get allthe relevant citations. RelRank
Results in Section 4.4 show that there is not a single algorithm which winsover all the metrics and prevalence groups. Our aim here is to choose an en-semble of algorithms in a way that captures as much Precision as possible whilemaintaining a reasonably good Recall as we go up in the starring threshold. Forthis, we choose
SV M cost (J, b=0) [Method 7],
SV M perf (AUC, b=1) [Method21], and
SV M cost (J, b=1) [Method 25]. The choice of these algorithms isguided by the characteristics that these algorithms have in different metrics(please see next paragraph) and based on the generalized performance over dif-ferent prevalence groups measured through a well defined statistical test. FromTable 4.4, we observe that Method 7 has the best performance for Precision20 P r e c i s i on PrevalenceMethod 07Method 25 Method 21 (a) R e c a ll PrevalenceMethod 07Method 25 Method 21 (b)
80 82 84 86 88 90low mid high A UC PrevalenceMethod 07Method 25 Method 21 (c)
25 30 35 40 45 50 55 60 65low mid high A U P RC PrevalenceMethod 07Method 25 Method 21 (d)
Figure 3: Performance of
SV M cost (J, b=0) [Method 7],
SV M perf (AUC, b=1) [Method21], and
SV M cost (J, b=1) [Method 25] in Precision, Recall, AUC, and AUPRC metric overthree prevalence groups. in low and mid prevalence group and also performs reasonably good in Burdenmetric. Both Methods 5 and 7 are the candidates here; however, we chooseMethod 7 because the algorithm takes into account the data imbalance ratio(through the J parameter of the algorithm). Method 21 performs the bestin many different metrics such as Recall, AUC, AUPRC, Yield, and Utility.Method 25 also performs the best across many different metrics (see Table 4.4)such as F-Measure, AUC, AUPRC, AM ER., and QD ER.Figure 3 shows the generalized performance of the three algorithms used inour ensemble method over three different prevalence groups in terms of Pre-cision, Recall, AUC, and AUPRC. In AUPRC, all the three methods showedan increasing performance as the prevalence increases. For Method 21 and 25,AUC is consistent over different prevalence groups. Method 7 has a rising trendin AUC as the prevalence increases. For Recall, Method 21 has surpassed theother two methods by a wide margin. Thus, we consider this algorithm as thedominant classifier. The other two methods showed an increasing trend as theprevalence of the dataset increases. On the other hand, for Precision, Method7 has the highest Precision, and the Precision value increases with prevalenceof the dataset (see Figure 3(a)) increases. Two other methods have much lowerPrecision than Method 7; however, their Precision is also increasing as thedataset prevalence increases. 21 etric RelRank( -star) RelRank ( -star) RelRank ( -star) Recall h , , i h , , i h , , i Precision h , , i h , , i h , , i F-Measure h , , i h , , i h , , i Accuracy h , , i h , , i h , , i ROC (AUC) h , , i h , , i h , , i AUPRC h , , i h , , i h , , i AM ER. h , , i h , , i h , , i QD ER. h , , i h , , i h , , i Burden h , , i h , , i h , , i Utility h , , i h , , i h , , i Table 6: Performance of
RelRank in different starred thresholds. The lower is the rank,the better is the performance. The three numbers inside the brackets indicate rank group for h Low, Mid, High i prevalence group. Metric RelRank ( -star) RelRank ( -star) RelRank ( -star) Precision h i h i h i Recall h i h i h i AUC h i h i h i AUCPR h i h i h i Burden h i h i h i Utility h i h i h i Yield h i h i h i AM Error h i h i h i QM Error h i h i h i Table 7: Performance values of
RelRank in different starred thresholds. For Precision, Recall,AUC, AUCPR, Burden, Utility, and Yield, the higher the value, the better the performance.On the other hand, for AM and QM errors, the lower the better. The three numbers insidethe brackets indicate the value of the metric for h low, mid, and high i prevalence groups. For a SR platform, we need easy to consume prediction, i.e. , the citationsshould be sorted based on their relevance (discussed in Introduction). Themost important metrics in our case are: AUC, AUPRC, Recall and Utility. Weevaluate our 5-star algorithm in different cumulative star-groups. Table 6 showsthe obtained results. We also present the average performance of a metric indifferent prevalence groups and starring thresholds in Table 7. We observe thatour 5-star algorithm performs similarly to method 21 for the citations receiving3 stars or above. For the Precision metric, the 5-star algorithm performs betterthan method 21 as it falls in the rank group 7 whereas method 21 falls in thegroup 9. This is because our algorithm is a combination of method 21 with twoother methods that have better Precision performances. We see an increment of22round 10% in Precision in the high prevalence group from
RelRank (3-star) to
RelRank (4-star) and then to
RelRank (5-star); the corresponding values forPrecision are 22.98%, 31.43%, and 41.62%, respectively. For the mid prevalencegroup, the increment is around 4-6% and for the low one, it is around 2%.Similarly, for the Recall metric,
RelRank (3-star) is in the top group whereas
RelRank (4-star) and
RelRank (5-star) are in the second top group. As thePrecision goes high, Recall decreases as we go up in the rating. For the highprevalence group, it decrements from 98.43% to 93.82% and then to 84.10%.For the low and mid prevalence groups, we also see a similar decrement of 3-7%in performance.Interestingly, for both AUC and AUPRC,
RelRank falls in the top group(see Table 4 and 7 to observe that this is not the case if the algorithms areused alone) which makes our ensemble algorithm more suitable for SR appli-cations. Our algorithm achieves around 87% AUC in all prevalence groupsand around 37%, 50% and 64% AUPRC for the low, mid and high prevalencegroups, respectively. For Utility, we also see a similar behavior. Burden dropsfrom around 90% to 78% and then to 68%, going from
RelRank (3-star) to
RelRank (4-star) and then to
RelRank (5-star) in all prevalence groups asexpected. However, Yield which represents the fraction of citations that areidentified by a given screening approach (i.e.,
RelRank ) does not drop muchas we go up in the star rating. It only drops around 2-4% which is a very impor-tant aspect of our algorithm as the rate
RelRank is finding relevant citation isnot decreasing so rapidly. For AM and QD errors, which take into account lossin Recall in both the relevant and irrelevant classes, we see a drop around 10-12% from
RelRank (3-star) to
RelRank (4-star) to
RelRank (5-star). Thisis because of the Precision/Recall trade-off. In
RelRank (3-star), we achievethe highest Recall and the lowest Precision in the positive class which indicatesthat we loose Recall in the irrelevant class. This is why the AM and QM errorsare very high (AM error is around 40%, QM error around 50%) in the
RelRank (3-star).
5. Discussion
In our study of different methods, we observed that almost always there isa method that ranks first in the three prevalence groups. However, there isno single dominant method across all metrics. Various methods perform wellon different prevalence groups and for different metrics. For instance, w2v rowwith
SV M perf (b=1, AUC) (21) seems to be a good choice, outperforming theother methods in five metrics. The method achieves around 97% Recall and 87%AUC across different prevalence groups and datasets. However, it is not presentin any of the equivalence groups for a few other metrics. The method givesaround 4% Precision in low prevalence group, 11.75% in mid-prevalence groupand 23% in high prevalence group. The same behavior is seen for other metricsin which it performed poorly. This is because this method has a high Recall,but a low Precision. As a consequence, such a method is penalized heavilyby some metrics such as AUPRC and Burden. This comprehensive study and23ubsequent analysis on our abstract screening platform Rayyan suggests that aholistic “composite” strategy is a better choice in a real-life abstract screeningsystem.We proposed such a composite method, called
RelRank in Section 3.4 andshowed that it performs well across many metrics. This is not surprising sinceeach of the baseline methods has been designed to meet a certain objective.However, abstract screening is a complicated process in which a “good” methodshould be able to optimize simultaneously several metrics such as Recall, AUC,and Utility, which can be achieved by a composite strategy like
RelRank . Wealso see a similar behavior while doing a manual verification over a set of sevenreviews. As Table 6 shows, we fall short on some other metrics, these are themetrics that directly depend on Precision. However, from our user surveys,we realized that if the ranking is very good then a user can generally excludemany citations from the bottom without so much effort and the predictionactually makes more sense. So, we strictly emphasize on ranking in our combinedalgorithm. N o o f r e l e v an t c i t a t i on l e ft Iteration
Figure 4: The pattern of decreasein the number of relevant citationsduring abstract screening for a par-ticular review.
We also analyzed the performance of an ab-stract screening system in an active learning set-ting. We observe that the 3-prevalence groupsexhibit similar characteristics, i.e. , some of thecitations add the burden of reviewing almost allof them. In a separate experiment (in additionto the reported experiments in the results sec-tion), we further analyze the inclusion behaviorof a particular random run of review 1. Figure 4shows the results of this experiment. For a ran-dom run, around 58% (1500 out of 2544) of thedocuments have to be screened before it gets itsfinal relevant citation. However, by getting intothe details of the experiment, we see that it getsalmost all but the final one after screening only400 out of 2544 which is around 15% of the total citations. The last citation sur-prisingly takes 22 more iterations. With the help of a domain expert, it wouldbe interesting to see whether the final citation has outlying characteristics. Wesee a similar behavior for some other reviews. In our opinion, in addition tocapturing the relevancy information, a holistic model for abstract screening mayalso have another component to exclude and point out outlying citations whichwe keep as our future work.
Our study is specifically targeted to study and improve abstract screeningin a practical systematic review platform like Rayyan which serves thousandsof users. Therefore, this study is quite different from the standalone evaluationof single abstract screening tasks as the method for abstract screening has togeneralize over thousands of reviews from thousands of users. For this reason, we24ave carefully avoided some of the feature representation techniques such as co-citations and MeSH (Medical Subject Heading) which are not readily availableeven though some of the previous works have shown better performances withthose features. Furthermore, according to the study [37], many methods otherthan Support Vector Machines (SVM) are used as classification algorithms forabstract screening. However, in this study, we restricted our attention to onlySVM-based methods. Our proposed 5-star method also combines three of theSVM methods. One of the reasons for this choice from the system perspectiveis that SVM methods require a small amount of memory to store models andhence allows for storage savings. For example, for
SV M perf , we need to storeonly the feature importance values.
6. Conclusion
In this paper, we studied the most popular classification methods employedin abstract screening for systematic review. We focused on the algorithms thatbetter fit the constraints of a real-world system like Rayyan. We found thatthere is no single “winner” approach, i.e. , various methods performed well ondifferent prevalence groups and for different metrics. For instance, w2v rowwith
SV M perf (b=1, AUC) outperformed all the other studied methods infive metrics but not on few others. We also observe that in an active learningsetting, a substantial portion of included citations is discovered within a fewiterations. However, one or two citations had some outlying behavior, andthe method would require more iterations to retrieve these citations. We alsopresented an ensemble method, namely
RelRank that combines three of thestudied methods. Our approach converts their scores into a 5-star rating systemthrough a voting mechanism and ranks citations efficiently based on their gradedrelevance.
References [1] Aggarwal, C. C., 2014. Data classification: algorithms and applications.CRC Press.[2] Ambert, K., 2010. A prospective evaluation of an automated classificationsystem to support evidence-based medicine and systematic review 2010,121.[3] Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: apractical and powerful approach to multiple testing. Journal of the RoyalStatistical Society. Series B (Methodological), 289–300.[4] Benjamini, Y., Krieger, A. M., Yekutieli, D., 2006. Adaptive linear step-upprocedures that control the false discovery rate. Biometrika 93 (3), 491–507.[5] Bullinaria, J. A., Levy, J. P., Sep 2012. Extracting semantic representationsfrom word co-occurrence statistics: stop-lists, stemming, and svd. BehaviorResearch Methods 44 (3), 890–907.256] Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C., 2009. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for han-dling the class imbalanced problem. In: Advances in knowledge discoveryand data mining. pp. 475–482.[7] Chalmers, T. C., Smith, H., Blackburn, B., Silverman, B., Schroeder, B.,Reitman, D., Ambroz, A., 1981. A method for assessing the quality of arandomized control trial. Controlled clinical trials 2 (1), 31–49.[8] Chen, C., Liaw, A., Breiman, L., 2004. Using random forest to learn im-balanced data. University of California, Berkeley.[9] Cohen, A. M., 2008. Optimizing feature representation for automated sys-tematic review work prioritization. In: AMIA annual symposium proceed-ings. Vol. 2008. p. 121.[10] Cohen, A. M., Ambert, K., McDonagh, M., 2012. Studying the potentialimpact of automated document classification on scheduling a systematicreview update. BMC medical informatics and decision making 12 (1), 1.[11] Cohen, A. M., Hersh, W. R., Peterson, K., Yen, P.-Y., 2006. Reducingworkload in systematic review preparation using automated citation clas-sification. Journal of the American Medical Informatics Association 13 (2),206–219.[12] Cohen AM, M. S., 2008. RMEQ: A tool for computing equivalence groups inrepeated measures studies. in: Linking literature. In: BioLINK2008 Work-shop.[13] Culotta, A., McCallum, A., 2005. Reducing labeling effort for structuredprediction tasks. In: AAAI. pp. 746–751.[14] Dagan, I., Engelson, S. P., 1995. Committee-based sampling for trainingprobabilistic classifiers. In: Proceedings of the Twelfth International Con-ference on Machine Learning. pp. 150–157.[15] Demˇsar, J., 2006. Statistical comparisons of classifiers over multiple datasets. Journal of Machine learning research 7 (Jan), 1–30.[16] Elkan, C., 2001. The foundations of cost-sensitive learning. In: Interna-tional joint conference on artificial intelligence. Vol. 17. pp. 973–978.[17] Esuli, A., Sebastiani, F., 2015. Optimizing text quantifiers for multivari-ate loss functions. ACM Transactions on Knowledge Discovery from Data(TKDD) 9 (4), 27.[18] Fu, J., Lee, S., 2011. Certainty-enhanced active learning for improvingimbalanced data classification. In: Data Mining Workshops (ICDMW).IEEE, pp. 405–412. 2619] Gabriel, M., Paskach, C., Sharpe, D., 2013. The challenge and promise ofpredictive coding for privilege. In: ICAIL 2013 DESI V Workshop.[20] GLOSSARY, C., 2013. The grossman-cormack glossary of technology-assisted review. Federal Courts Law Review 7 (1).[21] Grossman, M. R., Cormack, G. V., 2010. Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manualreview. Rich. JL & Tech. 17, 1.[22] Han, H., Wang, W.-Y., Mao, B.-H., 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Advances in intel-ligent computing. pp. 878–887.[23] Hanneke, S., 2014. Theory of active learning.[24] Henry, D. W., 2015. Predictive coding: Explanation and analysis of judi-cial impact and acceptance compared to established e-commerce methodol-ogy.