Arbitrariness of peer review: A Bayesian analysis of the NIPS experiment
AArbitrariness of peer review: A Bayesian analysis of theNIPS experiment
Olivier Fran¸cois
Universit´e Grenoble-Alpes, Centre National de la Recherche Scientifique, TIMC-IMAG UMR5525, Grenoble, 38042, France.Running Title: Arbitrariness of peer review
Keywords:
Peer review, Arbitrariness, NIPS experiment.Corresponding Author: Olivier Fran¸coisUniversit´e Grenoble-Alpes,TIMC-IMAG, UMR CNRS 5525,Grenoble, 38042, France.+334 56 52 00 25 (Phone)+334 56 52 00 55 (Fax) [email protected] a r X i v : . [ s t a t . O T ] J u l bstract The principle of peer review is central to the evaluation of research, by ensuring that onlyhigh-quality items are funded or published. But peer review has also received criticism, as theselection of reviewers may introduce biases in the system. In 2014, the organizers of the “NeuralInformation Processing Systems” conference conducted an experiment in which 10% of submittedmanuscripts (166 items) went through the review process twice. Arbitrariness was measured asthe conditional probability for an accepted submission to get rejected if examined by the secondcommittee. This number was equal to 60%, for a total acceptance rate equal to 22 . I = (0 . , . Introduction
The principle of peer review is central to the evaluation of research proposals and research studies,by ensuring that only high-quality items are funded or published. Since its origin, the aim ofpeer-review has been to filter out the lack of novelty, flaws in research methodology or data, lackof reproducibility, falsification, plagiarism and other forms of misconduct (Hames 2007; Vintzileoset al. 2010). But peer review has also been criticized on the grounds that it imposes burdenon research communities, that the selection of reviewers may introduce biases in the system, andthat the reviewers’ judgements may be subjective or arbitrary (Kassirer and Campion 1994; Hojatet al. 2003; Li and Agha 2015). Arbitrariness of peer review, which is the quality of acceptingsubmitted items by chance or whim, and not by necessity or rationality, can be measured by theheterogeneity of evaluations among raters during the review process (Mutz et al. 2012; Marsh etal. 2008; Giraudeau et al. 2011).In 2014, the organizers of the
Neural Information Processing Systems (NIPS) conference,Corinna Cortes and Neil Lawrence, decided to look at how fair the conference evaluation sys-tem was (Langford and Guzdial 2015). NIPS is one of the main theoretical computer scienceconferences, and its review process has an advanced format which includes double blind reviewand the possibility of rebuttal for authors. Cortes and Lawrence ran the
NIPS experiment in which1 /
10 of manuscripts (items) submitted to NIPS went through the review process twice. A total of n = 166 submissions were reviewed by two independent program committees, and the discrepancyof committee decisions was reported in a fully transparent way (Langford and Guzdial 2015).The NIPS organizers defined arbitrariness as the conditional probability, a , for an acceptedsubmission to get rejected if examined by a second committee. From the NIPS experiment,the observed arbitrariness was equal to ˆ a = 60%. Since the total acceptance rate was equal toˆ π = 22 . a = 77 . x , corresponding to the probabilitythat a submitted item meets the basic quality criteria. They assume that items which do not meetthose minimal quality criteria are almost surely rejected by both committees. We used the modelsto quantify uncertainty on the observed value of arbitrariness and on the hidden variable x , andcomputed model probabilities for several conditional acceptance rules. Let n be the total number of submissions during the NIPS experiment ( n = 166), and let ˆ π = k/n = 22 .
5% be the total acceptance rate at the NIPS conference. To explain an observedarbitrariness level, a , we introduce the Reject or Flip a Coin (RFC) model, which is based ontwo parameters, x and y . The first parameter, x , is a hidden variable representing the probabilitythat an item meets basic quality criteria, such as novelty, clarity, absence of methodological flaws,reproducibility of results, and no form of misconduct. The second parameter, y , represents theconditional probability that an item meeting all quality criteria is accepted. Items that fail tomeet all quality criteria are rejected with probability one.From basic probability theory, the total acceptance rate in the RFC model is equal to π = xy ,and arbitrariness is equal to a = 1 − y . Thus, model parameters are related through the followingrelationship a = 1 − πx ≤ − π . When the total acceptance rate is known, the arbitrariness level is maximal for x = 100%, and themaximum value is 1 − π . Arbitrariness is avoided when a = 0. In this case, the total acceptancerate corresponds to the rate of items meeting the quality criteria, and we have y = 1. With theNIPS experiment data, the moment estimates of x and y are equal to ˆ x = 56% and ˆ y = 40%respectively.Assuming non-informative prior distributions for x and y , we used the Bayes formula to de-rive the posterior distribution of the model parameter ( x, y ). The posterior distribution can be4escribed by the following equation p ( x, y | ˆ a, k, n ) ∝ ( xy ) k (1 − xy ) n − k y (1 − ˆ a ) k (1 − y ) ˆ ak . The moment estimates ˆ x = 56% and ˆ y = 40% correspond to the mode of the posterior distributionand to the maximum of the likelihood function. By integrating with respect to the variable x , weobtained the posterior distribution of y as follows p ( y | ˆ a, k, n ) ∝ y (1 − ˆ a ) k (1 − y ) ˆ ak (cid:90) ( xy ) k (1 − xy ) n − k dx ∝ y (1 − ˆ a ) k − (1 − y ) ˆ ak . In other words, p ( y | ˆ a, k, n ) is a beta distribution with parameters (1 − ˆ a ) k and (ˆ ak + 1) y | ˆ a, k, n ∼ beta((1 − ˆ a ) k, ˆ ak + 1) . To provide an exact simulation algorithm for the posterior distribution of the model parameter( x, y ), we computed the the density of the conditional distribution of x given y, ˆ a, k, n . Thisconditional distribution could be represented as the distribution of the random variable x (cid:63) /y where x (cid:63) is drawn from a beta( k + 1 , n − k + 1) distribution conditioned on being lower than y .Next, we used a basic rejection algorithm for sampling 100,000 replicates from the posteriordistribution (Figure 1). A Bayesian Monte Carlo estimate of the arbitrariness parameter, a = 1 − y ,was 61%, and its 95% credibility interval was equal to I = (0 . , . x , meeting all quality criteria was 56%, and the 95% credibility interval was I = (0 . , .
83) (Figure 2).To model the fact that some submissions are clearly accepted by any committee, we consideredan extension of the RFC model in which both low-quality and high-quality items give rise todeterministic decisions. The new model is called the
Reject, Accept or Flip a Coin (RAFC) model. Low-quality items are rejected by committees with probability 1, whereas high-qualityitems are accepted with probability 1. The rate of low-quality items is 1 − x , and the rate of highquality items is αx , where α is a known parameter with value between 0 and 1. Items that meetall quality criteria but that are not high-quality items are accepted with probability y . In theRAFC model, arbitrariness is parameterized as a = (1 − α ) xy (1 − y ) αx + (1 − α ) xy , x and y parameters in the RFC model.and the total acceptance rate π is equal to π = αx + (1 − α ) xy . The RFC model is a particular instance of the RAFC model obtained for α = 0. The questionhere was to evaluate for which values of α the RAFC model could provide a better fit to the datathan the RFC model.Assuming non-informative prior distributions for x and y in the RAFC model, the posteriordistribution of ( x, y ) was described by the following formula p ( x, y | ˆ a, k, n ) ∝ ( αx + (1 − α ) xy ) k (1 − αx − (1 − α ) xy ) n − k × ((1 − α ) xy (1 − y ) / ( αx + (1 − α ) xy )) ˆ ak × (1 − (1 − α ) xy (1 − y ) / ( αx + (1 − α ) xy )) (1 − ˆ a ) k Taking α = 5%, the mode of the posterior distribution corresponded to the parameter valuesˆ x = 69% and ˆ y = 29%. To sample from the posterior distribution and evaluate uncertainty6igure 2: Posterior density for the rate of items meeting all quality criteria, x , and for the arbi-trariness parameter, a = 1 − y , in the RFC model.on model parameters, we used an approximate Bayesian computation (ABC) approach based on100,000 simulations (Csill´ery et al. 2010). Using ABC and α = 5%, the Bayesian estimates of x and y were calculated as ˆ x = 67% ( I = ( . , . y = 32% ( I = ( . , . I = (0 . , . α = 5%, five other values of α were tested ( α = 0%, 2.5%, 10%,20%,50%) andthe corresponding RAFC models were compared with the model using α = 5%. The comparisonwas achieved by using an ABC approach to evaluate posterior model probabilities. ABC modelchoice indicated that smaller values of α provided better fit to the data than larger values. TheRAFC model using α = 5% corresponded to the highest posterior probability ( p = 26%). In apairwise comparison with the RFC model, the RAFC model using α = 5% had a probability of56%, and the Bayes factor was equal to BF = 1.31 (barely worth mentioning). Peer review is not perfect, and levels of arbitrariness in the range (0 . , .
73) supported theevidence for biases during the review process. In light of the RFC model interpretation, theresults indicated that the burdens on reviewers, which is one of the biggest costs in the peer7eview system, could be alleviated by restricting their role to check whether basic quality criteriasuch as novelty, absence of methodological flaws, reproducibility of results, are met. This phaseof the review process should end with an acceptance rate, x , within the interval (0 . , . π/x would lead to the sameacceptance rate and level of arbitrariness as in the NIPS experiment. It seems however unlikelythat this apparently random process could be envisaged as an alternative to the original reviewprocess in future experiments.One of the highest costs in the peer review system is for the submitters themselves and fortheir funding agencies. Arbitrary decisions delay publication or funding of research works thatwould deserve merit. Those decisions can have a negative influence on junior researchers whomight be more importantly impacted by arbitrary rejection than senior researchers (see Bourne2005, “Rule 5: learn to live with rejection”). A positive aspect of the NIPS experiment is thatits analysis provides a way to restrict arbitrariness in future instances of the peer review process.The estimate of ˆ x = 56% is a clear suggestion to push the total acceptance rate close to π = 56%,so that arbitrariness would be closer to zero.Many critics claim that review processes are unnecessary and slow the communication of in-formation. Initiatives such as preprint repositories have demonstrated the utility of open science(Sitek and Bertelmann 2014). Some multidisciplinary open access journals use publication criteriabased on ethical standards and the rigor of the methodology and conclusions reported. Althoughsurveys of peer review among fee-charging open access journals showed that the target of pub-lishing “scientifically rigorous research’’ could be difficult to reach (Bohannon 2013), the lessonfrom the NIPS experiment is that accepting all scientifically rigorous research works would reducearbitrariness to very small levels. Bohannon J (2013). Who’s afraid of peer review? Science 342 (6154): 60-65.Bourne PE (2005). Ten simple rules for getting published. PLoS Computational Biology 1(5):e57. 8sill´ery K, Blum MGB, Gaggiotti OE, Fran¸cois O (2010). Approximate Bayesian computation(ABC) in practice. Trends in Ecology and Evolution 25(7):410-418.Gelman A, Carlin JB, Stern HS, Rubin DB (2014).
Bayesian Data Analysis . London: Chap-man Hall/CRC.Giraudeau B, Leyrat C, Le Gouge A, L´eger J, Caille A (2011). Peer review of grant applications:a simple method to identify proposals with discordant reviews. PLoS ONE 6: e27557.Hames I (2007).