[PDF] Rethinking the Funding Line at the Swiss National Science Foundation: Bayesian Ranking and Lottery

Abstract

Funding agencies rely on peer review and expert panels to select the research deserving funding. Peer review has limitations, including bias against risky proposals or interdisciplinary research. The inter-rater reliability between reviewers and panels is low, particularly for proposals near the funding line. Funding agencies are increasingly acknowledging the role of chance. The Swiss National Science Foundation (SNSF) introduced a lottery for proposals in the middle group of good but not excellent proposals. In this article, we introduce a Bayesian hierarchical model for the evaluation process. To rank the proposals, we estimate their expected ranks (ER), which incorporates both the magnitude and uncertainty of the estimated differences between proposals. A provisional funding line is defined based on ER and budget. The ER and its credible interval are used to identify proposals with similar quality and credible intervals that overlap with the funding line. These proposals are entered into a lottery. We illustrate the approach for two SNSF grant schemes in career and project funding. We argue that the method could reduce bias in the evaluation process. R code, data and other materials for this article are available online.

Full PDF

RRethinking the Funding Line at the Swiss NationalScience Foundation: Bayesian Ranking and Lottery

Rachel Heyard , Manuela Ott , Georgia Salanti , and Matthias Egger Data Team, Swiss National Science Foundation, Bern, Switzerland Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK * Corresponding author: Rachel Heyard, [email protected]

Abstract

Funding agencies rely on peer review and expert panels to select the research deserv-ing funding. Peer review has limitations, including bias against risky proposals orinterdisciplinary research. The inter-rater reliability between reviewers and panels islow, particularly for proposals near the funding line. Funding agencies are increasinglyacknowledging the role of chance. The Swiss National Science Foundation (SNSF) intro-duced a lottery for proposals in the middle group of good but not excellent proposals.In this article, we introduce a Bayesian hierarchical model for the evaluation process.To rank the proposals, we estimate their expected ranks (ER), which incorporatesboth the magnitude and uncertainty of the estimated diﬀerences between proposals. Aprovisional funding line is deﬁned based on ER and budget. The ER and its credibleinterval are used to identify proposals with similar quality and credible intervals thatoverlap with the funding line. These proposals are entered into a lottery. We illustratethe approach for two SNSF grant schemes in career and project funding. We arguethat the method could reduce bias in the evaluation process. R code, data and othermaterials for this article are available online.

Keywords: grant peer review, expected rank, posterior mean, Bayesian hierarchical model,research funding, modiﬁed lottery 1 a r X i v : . [ s t a t . A P ] F e b ntroduction Public research funding is limited and highly competitive. Not every grant proposal canbe funded, even if the idea is worthwhile and the research group highly qualiﬁed. Fundingagencies face the challenge of selecting the proposals or researchers that merit support amongall proposals submitted to a call for applications. Funders generally rely on expert peerreview for determining which projects deserve to be funded (Harman, 1998). For example, inthe UK, over 95% of medical research funding was allocated based on peer review (Guthrieet al., 2018). At the Swiss National Science Foundation (SNSF), external experts ﬁrst assessthe proposals which are then reviewed and discussed by the responsible panel, taking intoaccount the external peer review (Severin et al., 2020).The evidence base on the eﬀectiveness of peer review is limited (Guthrie et al., 2019),but it is clear that peer review of grant poposals has several limitations. Bias againsthighly innovative and risky proposals is well documented (Guthrie et al., 2018) and probablyexacerbated by low success rates, potentially leading to “conservative, short-term thinkingin applicants, reviewers, and funders” (Alberts et al., 2014). There is also evidence of biasagainst highly interdisciplinary projects. An analysis of the Australian Research Council’sDiscovery Programme showed that the greater the degree of interdisciplinarity, the lower thesuccess rate (Bromham et al., 2016). The data on gender bias are mixed, but women tendto have lower publication rates and lower success rates for high-status research awards thanmen (Kaatz et al., 2014; van der Lee and Ellemers, 2015).Several studies have shown considerable disagreement between individual peer reviewersassessing the same proposal, with kappa statistics typically well below 0.50 (Cicchetti, 1993;Cole et al., 1981; Fogelholm et al., 2012; Guthrie et al., 2019). A similar situation is observedat the level of evaluation panels. For example, a study of the Finnish Academy comparedthe assessments by two expert panels reviewing the same grant proposals (Fogelholm et al.,2012). The kappa for the consolidated panel score of the two panels after the discussion was0.23. Interestingly, the same kappa was obtained when using the mean of the scores fromthe external reviewers, indicating that panel discussions did not improve the evaluation’sconsistency. The low inter-rater reliability means that a research proposal’s funding decisionwill partly depend on the peer reviewers and the responsible panel, and therefore on chance.The wisdom of a system for research funding decisions that depends partly on chancehas been questioned over many years (Cole et al., 1981; Mayo et al., 2006). More recently,Fang and Casadevall (2016) argued that the current grant allocation system is “in essence alottery without the beneﬁts of being random” and that the role of chance should be explicitlyacknowledged. They proposed a modiﬁed lottery where excellent research proposals areidentiﬁed based on peer review and those funded within a given budget are selected atrandom. The Health Research Council of New Zealand (Liu et al., 2020), the VolkswagenFoundation (2017) and recently the Austrian Research Fund (2020) are funders that haveapplied lotteries, with a focus on transformative research or unconventional research ideas.The SNSF introduced a modiﬁed lottery for its junior fellowship scheme, focusing on proposalsnear the funding line (Adam, 2019; Bieri et al., 2020).The deﬁnition of the funding line is central in this context. Many funders rank proposalsbased on the review scores’ simple averages to draw a funding line (Lucy et al., 2017).Averages are understandable to all stakeholders but not optimal for ranking. Proposals2round the funding line will often have similar or even identical scores. Because the numberof reviewers or panel members is limited, the statistical evidence that two adjacent scoresabove and below the funding line are diﬀerent is often weak. Kaplan et al. (2008) have shownthat unrealistic numbers of reviewers (100 reviewers or more) would be required to detectsmaller diﬀerences between average scores reliably. The same point has been made in thecontext of the ranking of baseball players (Berger and Deely, 1988).In this paper, we show how Bayesian ranking methods (Laird and Louis, 1989) can beused to deﬁne funding lines and identify applications to be entered into a modiﬁed lottery.We will describe the expected rank and other statistics, formulate recommendations on itsuse to support funding decisions, and ﬁnally, apply the method to two grant schemes of theSNSF. Mandated by the government, the SNSF is Switzerland’s foremost funding agency,supporting scientiﬁc research in all disciplines.

Methodology

A Bayesian hierarchical model for the evaluation process

Let us assume a setting where n research proposals are being submitted for funding. Theproposals are graded for their quality using a score on a 6-point interval scale (outstanding=6,excellent=5, very good=4, good=3, mediocre=2, poor=1). A research proposal i is evaluatedby up to m distinct evaluators / assessors; so that y ij , with i in { , . . . , n } , j in { , . . . , m } isthe score given to proposal i by assessor j .The observations y ij are assumed to come from a normal distribution with mean µ , where µ is the mean score of all proposals. The parameters of interest for ranking are the trueunderlying proposal eﬀects θ i compared to that average. Then, the model is: y ij | θ i , λ ij ∼ N ( µ + θ i + λ ij , σ ) (1) θ i ∼ N (0 , τ ) ,λ ij ∼ N ( ν j , τ λ ) . In the above model τ can be interpreted as the total variability of the proposals aroundthe common mean µ . To account for the tendency of assessors to be more or less strict intheir scoring, we add a parameter λ ij . We assume that this parameter follows a normaldistribution with mean ν j and variance τ λ . The assessor speciﬁc mean ν j accounts for diﬀerentscoring habits of diﬀerent assessors. A stricter assessor, compared to all other assessors, willhave a negative ν j ; a more generous assessor will have a positive ν j . The estimation of theseparameters will be discussed later in Section 2.3. Ranking grant proposals

The key parameter upon which we will base the ranking are the true proposal eﬀects θ i inmodel (1). Ranking proposals based on the means of the posterior distribution of the θ i ’smay be misleading because it ignores the uncertainty reﬂected by the posterior variances of3he θ i ’s. Laird and Louis (1989) introduce the expected rank (ER i ) as the expectation of therank of θ i : ER i = E(rank( θ i ))= n X k =1 Pr( θ i ≤ θ k )= 1 + X i = k Pr( θ i ≤ θ k ) , because Pr( θ i ≤ θ i ) = 1 . Note that ER incorporates the magnitude and the uncertainty of the estimated diﬀerencebetween proposals. Scaling ER to a percentage (from 0% to 100%) facilitates interpretation(van Houwelingen et al., 2009):PCER i = 100 × (ER i − . /n. The percentile based on ER (PCER) is independent of the number of competing proposalsand is interpreted as the probability that proposal i is of worse quality compared to arandomly selected proposal. A related quantity, the surface under the cumulative ranking (SUCRA) line, summarizes all rank probabilities (Salanti et al., 2011). It is based on theprobability that proposal i is ranked on the m-th place, denoted by Pr( i = m ). Then, theseprobabilities can be plotted against the possible ranks, m = 1 , . . . , n in a so-called rankogram.The cumulative probabilities, cum im , of proposal i being among the m best proposals aresummed up to form the SUCRA of proposal i :SUCRA i = P n − m =1 cum im n − . The higher the SUCRA value, the higher the likelihood of a proposal being in the topranks. As SUCRA is between 0 and 1, it can be interpreted as the fraction of competingproposals that the proposal i ‘beats’ in the ranking. As shown in the appendix, the SUCRAcan directly be transformed to ER:ER i = n − ( n − · SUCRA i . van Houwelingen et al. (2009) and Lingsma et al. (2009) further introduced the rankability ρ which can be interpreted as the part of heterogeneity between the proposals that is due totrue diﬀerences as opposed to natural variation. In our more complex setting with additionalreviewer eﬀects, the rankability can be deﬁned as: ρ = τ / ( τ + τ λ + σ ) . The rankability isin the interval [0 ,

1] and quantiﬁes how appropriate ranking the proposals is. However, in thequest of deﬁning three groups, rejected and accepted groups as well as a random selectiongroup, we do not require a high rankability since a clear ranking of the proposals within eachgroup is not necessary. 4 stimation and implementation

We pursue a fully Bayesian approach in JAGS (Just Another Gibbs Sampler) using the{rjags}-package in R (Plummer, 2019). Therefore, we need to deﬁne priors on the parametersdescribed in model (1).Since, in our case studies, the scores y ij all lie in the interval [1 , θ i ’s and λ ij ’s by assuming a uniform distribution onthis interval. Apart from this upper bound, we do not have much prior information on thevariability of the θ i ’s and λ ij ’s. Therefore, as suggested by Gelman (2006), we use a uniformprior on (0 ,

2] for τ and τ λ . Similarly, we ﬁx a uniform prior on (0 ,

2] on σ : τ, τ λ , σ ∼ U (0 , ν j ∼ N (0 , . ) . The parameter summarizing the assessor behavior, ν j , will follow a normal distribution around0, with a small variance ( ∼ . ), i.e. about 95% of the reviewers are assumed to have areviewer-speciﬁc mean in [-1, 1].We implemented the presented methodology in R (see package ERforResearch availableon github and the online supplement).

Strategy for ranking proposals and drawing the funding line

To rank the competing proposals we can use the following steps:(1) Ranking proposals based on the means of the posterior distribution of the θ i ’s. Thisstrategy incorporates the uncertainty and variation from the evaluation process, butignores the uncertainty induced by the posterior mean’s estimation.(2) Using the expected ranks ER i (or the equivalent PCER i and SUCRA i ) to compareproposals instead. This approach incorporates all sources of variation.We start by plotting the ERs together with their 50% credible interval (CrI). A provisionalfunding line (FL) can be deﬁned by simply funding the best-ranked x proposals. x is generallydetermined by the available budget. The provisional funding line helps to see the biggerpicture. Are there clusters of proposals? Is there a group of proposals with credible intervalsof the ER that overlap with the funding line? The latter group should be considered forinclusion into a lottery group where the proposals that still can be funded are selected atrandom . On the other hand, a clear distance between the ERs of non-funded proposalsand the funding line, meaning no overlap in their 50% CrI with the provisional FL, suggeststhat the proposals can be easily separated regarding their quality and no random selection isneeded. The implementation of the model in JAGS is presented in the appendix of this paper. Note that, if there is enough funding to fund all proposals in the random selection group, no randomselection element is needed.

Case studies

In this section, we will use the approach to simulate recommendations for two fundinginstruments: the

Postdoc.Mobility fellowship for early career researchers and the SNSF

Project Funding scheme for established investigators.

Postdoc.Mobility (PM) funding scheme

Junior researchers who recently defended their PhD and wish to pursue an academic careercan apply for a fellowship, which will allow them to spend two years in a research groupabroad. The proposals are evaluated by two referees who score it on a six-point scale (from1, poor, to 6, outstanding). Based on these scores, the proposals are triaged into threegroups: “fund”, “discuss” (in panel) and “reject”. The middle group’s proposals are thendiscussed in one of ﬁve panels: Humanities; Social Sciences; Science, Technology, Engineeringand Mathematics (STEM); Medicine; and Biology. The panel members score each of theproposals and agree on a funding line, with or without using a lottery for some proposals. Weused the data from the February 2020 call. Due to the pandemic, the meeting was remote,and the scoring independent: panel members did not know how fellow members scored theproposals. Table 1 summarizes the number of proposals in the diﬀerent “panel discussion”groups together with the number of proposals that can still be funded and the size of thepanel.For the funding decision, the θ i ’s are of primary interest. We calculated the distribution ofthe rank of the θ i ’s, the ER. Figure 1 shows the diﬀerent ways of ranking the proposals for allpanels. The points on the left show the ranking based on the simple averages (ﬁxed ranking).Next, indicated by the middle points, the proposals are ranked based on the posterior meansof the θ i ’s (posterior mean ranking). Finally, the points on the right show the expected rank.A provisional funding line represented by the change of color is deﬁned by simply fundingthe x ﬁrst ranked proposals, where x can be retrieved from Table 1.Figure 2 plots the ERs of the same proposals together with their 50% credible intervalsand the provisional funding line (= the ER of the last fundable proposal). This presentationfacilitates identifying proposals that cluster around the funding line; i.e. the proposals thatmight be included in a lottery. The methodology then recommends the following decisions:6igure 1: Junior fellowship proposals ranked based on simple averages (points on the left),the posterior means (middle) and the ER (points on the right), by the evaluation panel.Note that rank 1 is the highest rank. The color indicates the proposals funded if the x bestproposals based on the ER are funded. 7 Humanities panel: The four best ranked proposals are funded, the remaining seven arerejected, no random selection.• Social Sciences: The six best ranked proposals are funded, the ten worst ranked proposalsare rejected. One proposal is randomly selected for funding among the seventh and theeight proposal.• Biology: The eight best ranked proposals are funded, the ten worst ranked proposalsare rejected, no random selection.• Medicine: The ﬁve best ranked proposals are funded, the ﬁve worst ranked proposalsare rejected. Two proposals are randomly selected for funding among the four proposalsranked as sixth to ninth.• STEM: The four best ranked proposals are funded, the ten worst ranked proposals arerejected. Two proposals are randomly selected for funding among the four proposalsranked as ﬁfth to eighth.Figure 2: The expected rank as point estimates (posterior mean), together with 50% credibleintervals (colored boxes). The dashed blue line is the provisional funding line, e.g the ERof the last fundable proposal. The color code indicates the ﬁnal group the proposal is in:accepted or rejected proposals, or random selection group.8amples from the posterior distributions of all the parameters in the Bayesian hierarchicalmodel can be extracted from the JAGS model. This allows the funders to better understandthe evaluation process. As a reminder, parameter ν j summarises the behavior of panelmember j . The more ν j is negative, the stricter the scoring behavior of assessor j comparedto the remaining panel members. This also means that the stricter grades from assessor j arecorrected more, because they comply with their usual behavior. Figure 3 shows the posteriordistributions of the ν j ’s for the Social Sciences and Medicine panels . In the Social Sciences,assessor 8 is a more critical panel member, whereas assessor 14 gives, on average, the highestscores. Also in the Medicine Panel, the distributions for the diﬀerent assessors are quitediﬀerent. This illustrates how important it is to account for assessor eﬀects.Figure 3: Posterior distributions of the assessor-speciﬁc means ν j in the Social Sciences andMedicine panels. Each of the 14 assessors voted on up to 18 proposals, unless they had aconﬂict of interest or other reasons to be absent during the vote.Additionally, the posterior means of the variation of the proposal eﬀects, τ , for eachpanel with 90% CrI can be extracted: 0.14 with [0.06, 0.32] for the Humanities, 0.13 with[0.07, 0.25] for the Social Sciences, 0.17 with [0.0961, 0.3136] for the Biology Panel, 0.17 with[0.08, 0.34] for the Medicine panel and 0.17 with [0.09, 0.31] for the STEM panel. The sameinformation can be retrieved for the variation of the assessor eﬀect τ λ : 0.1 with [0.01, 0.25]for the Humanities, 0.08 with [0, 0.18] for the Social Sciences, 0.07 with [0.01, 0.15] for theBiology Panel, 0.08 with [0, 0.2] for the Medicine panel and 0.11 with [0, 0.29] for the STEMpanel.Further Figures, representing the rankograms and SUCRAs, the PCER and the rankingusing the posterior mean of θ i and their 50% CrI can be found in the appendix and onlinesupplementary material ( snsf-data.github.io/ERpaper-online-supplement ). Note that we only present these two panels, because they are still small enough to allow interpretation.The illustration of the remaining panels can be found in the online supplement.

Project Funding

Project funding is the SNSF’s most prominent funding instrument. Project grants supportblue-sky research of the applicant’s choice. We analyzed the proposals submitted to the April2020 call to the Mathematics, Natural and Engineering Sciences (MINT) division. Overall,the division evaluated 353 grant proposals. The evaluation was done in four panels of thesame size (nine members) and a similar number of international and female members. Eachpanel member evaluated all proposals (unless they had a conﬂict of interest), and each paneldeﬁned its own funding line aiming at a similar ( ∼ δ k is includedin the Bayesian hierarchical model: y ij | θ i , λ ij ∼ N ( µ + θ i + λ ij , σ ) θ i ∼ N (0 , τ ) ,λ ij ∼ N ( ν j , τ λ ) δ k ∼ N (0 , τ δ ) , where k refers to the section (here k ∈ { , . . . , } ). The mean of the normal distributionis set to 0, because we assume the sections being comparable and therefore unbiased (thisassumption can of course be relaxed).Figure 4 (A) shows the resulting expected ranks: several clusters can be identiﬁed. Figure4 (B) shows the same ER ordered from the best-ranked proposal (bottom left) to the worst(top right) together with their 50% credible intervals. The provisional funding line is deﬁnedas the ER of the last fundable proposal: the 106th (30% of 353) best ranked, according to itsER. Zooming in on the provisional funding line shows the cluster of proposals with similarquality and credible intervals overlapping with the funding line. These proposals may beincluded in the modiﬁed lottery.As for the PM evaluations, we can also estimate the variation of the proposal, assessorand panel eﬀects: 1.13 with [0.99, 1.28] for τ , 0.16 with [0.06, 0.2] for τ λ and 0.01 with [0,0.1] for τ δ . A more detailed analysis of this case study can be found in the online supplement( snsf-data.github.io/ERpaper-online-supplement ).10igure 4: (A) All the proposals submitted to the MINT division ranked based on the simpleaverages (points on left), the posterior means of the proposal eﬀects (middle) and the expectedrank (right). The orange color indicates funded proposals to ensure a 30% success rate.(B) Propoals ordered from the best ER (bottom left, rank 1) to the worst (top right, rank352) together with their 50% credible intervals. A provisional funding line to ensure a 30%success rate is drawn (blue dashed line). The proposals are arranged in three groups: funded(orange), random selection (green) and rejected (blue).11 iscussion Inspired by work on ranking baseball players (Berger and Deely, 1988), health care facilities(Lingsma et al., 2010) and treatment eﬀects (Salanti et al., 2011), we developed a Bayesianhierarchical model to support decision making on proposals submitted to the SNSF. Aprovisional funding line is deﬁned based on the expected rank (ER) and the available budget.The ER and its credible interval are then used to identify proposals with similar qualityand credible intervals that overlap with the funding line. These proposals are entered into alottery to select those to be funded. The approach acknowledges that there are proposalsof similar quality and merit, which cannot all be funded. Previous studies suggested thatpeer review has diﬃculties in discriminating between applications that are neither clearlycompetitive nor clearly non competitive (Fang and Casadevall, 2016; Klaus and Alamo, 2018;Scheiner and Bouchie, 2013). Decisions on these proposals typically lead to lengthy paneldiscussions, with an increased risk of biased decision making. The method proposed hereavoids such discussions and thus may increase the eﬃciency of the process, reduce bias andcosts.The Bayesian model compares every assessor to every other panel member. The ERconsiders all uncertainty in the evaluation process that can be observed and quantiﬁed. Itaccommodates the fact that diﬀerent assessors have diﬀerent grading habits. To the best ofour knowledge, the type of (partially) crossed (e.g. not nested) random-eﬀects model withdependent proposal eﬀects, has so far not been used for ranking in combination with theexpected rank but can easily be ﬁtted with standard statistical software (Pinheiro and Bates,2000). The model can also be adjusted for potential confounding variables, such as externalpeer reviewers’ characteristics that inﬂuence their scores. A recent analysis of 38250 peerreview reports on 12294 SNSF project grant applications across all disciplines showed thatmale reviewers, and reviewers from outside Switzerland, awarded higher scores than femalereviewers and Swiss reviewers (Severin et al., 2020).We agree with Goldstein and Spiegelhalter (1996) who argued that “no amount of fancystatistical footwork will overcome basic inadequacies in either the appropriateness or theintegrity of the data collected”. In an ideal world, all proposals would be evaluated byas many experts it takes to ensure that meaningful diﬀerences between aggregated scorescan be detected with conﬁdence. Evaluations would be unbiased and describe nothing elsebut the quality of the proposals. Human nature and limited resources regarding time andfunding sadly prevent this ideal situation from becoming a reality. The evaluation of grantproposals will always be subjective to some extent and aﬀected by unconscious biases andchance. However, we are conﬁdent that the method presented here is an improvement overthe commonly used approaches to ranking proposals and deﬁning funding lines. Our approachshould not be seen as a mechanistic cookbook approach to decision making but as a methodthat can provide decision support for proposals of similar or indistinguishable quality aroundthe funding line. For example, judgement continues to be required to decide whether amodiﬁed lottery should be used or not.We applied the approach for two instruments in career and project funding at the SNSF.Our case studies addressed the speciﬁc context of the SNSF and the two funding schemes andresults may not be generalizable to other instruments or funders. Further, we acknowledgethat the team carrying out this study included several researchers aﬃliated with the SNSF.12s the researchers’ expectations might inﬂuence interpretation, critical comment and reviewof our approach from independent scholars and other funders will be particularly welcome.We treated the ordinal scores (from 1 to 6) as continuous variables in our model, thusassuming that the distance between each set of subsequent scores is equal. This assumptionmight not always be appropriate, but it builds on the methods used currently (averages, e.g normal linear models with ﬁxed proposal speciﬁc eﬀects). Furthermore, the scores’ targetdistribution as deﬁned by the SNSF (and communicated to the evaluators and the applicants)is a discretized version of a normal distribution. Laird and Louis (1989) discussed the ER ina normal linear models scenario for student achievement. In future work, we will explore theuse of ordinal regression models to consider the discrete nature of the scores.The choice of priors in Bayesian models is always disputable. Especially for the varianceparameters, alternative prior distributions could be investigated, like half-normal and half-Cauchy priors, whereas inverse-Gamma priors are generally not recommended (Gelman, 2006).Instead of a Gibbs sampler, the {brms}-package in R (Bürkner, 2018) can be used, whichmakes the programming language Stan accessible with commonly used mixed model syntax,similar to the {lme4}-package by Bates et al. (2015). Another approach is to use the R codeprovided by Lingsma et al. (2010). They implemented a frequentist approach where theposterior means and variances of the proposal-speciﬁc random intercept are approximated.However, if the proposal eﬀects are a posteriori dependent, because of the same assessorsevaluating the same set of proposals, the Bayesian approach is easier to implement.The use of a lottery to allocate research funding is controversial. At the SNSF theapplicants are informed about the possible use of random selection, thus complying with theSan Francisco Declaration on Research Assessment (DORA, 2019), which states that fundersmust be explicit about assessment criteria. Of note, in the context of the Explorer Grantscheme of the New Zealand Health Research Council, Liu et al. (2020) recently reported thatmost applicants agreed with random selection. So far, the SNSF received no negative orpositive reactions to the use of random selection from applicants (Bieri et al., 2020).In conclusion, we propose that a Bayesian modelling approach to ranking proposalscombined with a modiﬁed lottery can improve the evaluation of grant and fellowship proposals.More research on the limitations inherent in peer review and grant evaluation is needed.Funders should be creative when investigating the merit of diﬀerent evaluation strategies(Severin and Egger, 2020). We encourage other funders to conduct studies and test evaluationapproaches to improve the evidence base for rational and fair research funding.

Supplemental Materials

An online fully reproducible appendix is provided which uses an R package with the im-plementation of the above presented methodology (see snsf-data.github.io/ERpaper-online-supplement/). Acknowledgment

We are grateful for helpful discussions with Hans van Houwelingen and Ewout Steyerberg onearlier version of this manuscript. 13 isclosure statement

R. Heyard and M. Ott are employed by the SNSF. M. Egger is the acting president of theSwiss National Research Council.

References

Adam, D. (2019, November). Science funders gamble on grant lotteries.

Nature 575 (7784),574–575.Alberts, B., M. W. Kirschner, S. Tilghman, and H. Varmus (2014, April). Rescuing USbiomedical research from its systemic ﬂaws.

Proceedings of the National Academy ofSciences of the United States of America 111 (16), 5773–5777.Austrian Research Fund (2020). 1000 Ideas Programme.Bates, D., M. Maechler, B. M. Bolker, and S. C. Walker (2015, October). Fitting LinearMixed-Eﬀects Models Using lme4.

Journal of Statistical Software 67 (1), 1–48. Place: LosAngeles Publisher: Journal Statistical Software WOS:000365981400001.Berger, J. and J. Deely (1988, June). A Bayesian-Approach to Ranking and Selection ofRelated Means with Alternatives to Analysis-of-Variance Methodology.

Journal of theAmerican Statistical Association 83 (402), 364–373. Place: Alexandria Publisher: AmerStatistical Assoc WOS:A1988P159900009.Bieri, M., K. Roser, R. Heyard, and M. Egger (2020, November). How to best evaluateapplications for junior fellowships? Remote evaluation and face-to-face panel meetingscompared. bioRxiv , 2020.11.26.400028. Publisher: Cold Spring Harbor Laboratory Section:New Results.Bürkner, P.-C. (2018, July). Advanced Bayesian Multilevel Modeling with the R Package brms.

R Journal 10 (1), 395–411. Place: Wien Publisher: R Foundation Statistical ComputingWOS:000440997400027.Bromham, L., R. Dinnage, and X. Hua (2016, June). Interdisciplinary research has consistentlylower funding success.

Nature 534 (7609), 684–687.Cicchetti, D. (1993, June). The Reliability of Peer-Review for Manuscript and GrantSubmissions - Its Like Deja-Vu All Over Again - Authors Response.

Behavioral andBrain Sciences 16 (2), 401–403. Place: New York Publisher: Cambridge Univ PressWOS:A1993LG84700104.Cole, S., J. R. Cole, and G. A. Simon (1981, November). Chance and consensus in peerreview.

Science (New York, N.Y.) 214 (4523), 881–886.DORA (2019). DORA – San Francisco Declaration on Research Assessment.14ang, F. C. and A. Casadevall (2016, April). Research Funding: the Case for a ModiﬁedLottery (vol 7, e00422, 2016).

Mbio 7 (2), e00694–16. Place: Washington Publisher: AmerSoc Microbiology WOS:000377768700032.Fogelholm, M., S. Leppinen, A. Auvinen, J. Raitanen, A. Nuutinen, and K. Väänänen (2012,January). Panel discussion does not improve reliability of peer review for medical researchgrant proposals.

Journal of Clinical Epidemiology 65 (1), 47–52.Gelman, A. (2006, September). Prior distributions for variance parameters in hierarchicalmodels (comment on article by Browne and Draper).

Bayesian Analysis 1 (3), 515–534.Publisher: International Society for Bayesian Analysis.Goldstein, H. and D. J. Spiegelhalter (1996). League Tables and Their Limitations: StatisticalIssues in Comparisons of Institutional Performance.

Journal of the Royal Statistical Society.Series A (Statistics in Society) 159 (3), 385–443. Publisher: [Wiley, Royal StatisticalSociety].Guthrie, S., I. Ghiga, and S. Wooding (2018, March). What do we know about grant peerreview in the health sciences?

F1000Research 6 , 1335.Guthrie, S., D. Rodriguez Rincon, G. McInroy, B. Ioppolo, and S. Gunashekar (2019, June).Measuring bias, burden and conservatism in research funding processes.

F1000Research 8 ,851.Harman, G. (1998). The Management of Quality Assurance: A Review of In-ternational Practice.

Higher Education Quarterly 52 (4), 345–364. _eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1111/1468-2273.00104.Kaatz, A., B. Gutierrez, and M. Carnes (2014, August). Threats to objectivity in peer review:the case of gender.

Trends in Pharmacological Sciences 35 (8), 371–373.Kaplan, D., N. Lacetera, and C. Kaplan (2008, July). Sample Size and Precision in NIH PeerReview.

PLOS ONE 3 (7), e2761. Publisher: Public Library of Science.Klaus, B. and D. d. Alamo (2018, December). Talent Identiﬁcation at the limits of PeerReview: an analysis of the EMBO Postdoctoral Fellowships Selection Process. bioRxiv ,481655. Publisher: Cold Spring Harbor Laboratory Section: New Results.Laird, N. M. and T. A. Louis (1989). Empirical Bayes Ranking Methods.

Journal of Educa-tional Statistics 14 (1), 29–46. Publisher: [Sage Publications, Inc., American EducationalResearch Association, American Statistical Association].Lingsma, H., M. Eijkemans, and E. Steyerberg (2009, July). Incorporating natural variationinto IVF clinic league tables: The Expected Rank.

BMC Medical Research Methodology 9 (1),53.Lingsma, H., E. Steyerberg, M. Eijkemans, D. Dippel, W. Scholte Op Reimer, H. VanHouwelingen, and The Netherlands Stroke Survey Investigators (2010, February). Compar-ing and ranking hospitals based on outcome: results from The Netherlands Stroke Survey.

QJM: An International Journal of Medicine 103 (2), 99–108.15iu, M., V. Choy, P. Clarke, A. Barnett, T. Blakely, and L. Pomeroy (2020, February).The acceptability of using a lottery to allocate research funding: a survey of applicants.

Research Integrity and Peer Review 5 (1), 3.Lucy, L., F. Jane E., B. Lance B., M. Nehal N., and L. Laura (2017, August). Peer ReviewPractices for Evaluating Biomedical Research Grants: A Scientiﬁc Statement From theAmerican Heart Association.

Circulation Research 121 (4), e9–e19. Publisher: AmericanHeart Association.Mayo, N. E., J. Brophy, M. S. Goldberg, M. B. Klein, S. Miller, R. W. Platt, and J. Ritchie(2006, August). Peering at peer review revealed high degree of chance associated withfunding of grant applications.

Journal of Clinical Epidemiology 59 (8), 842–848.Pinheiro, J. C. and D. M. Bates (2000).

Mixed-Eﬀects Models in S and S-PLUS . New York,NY: Springer.Plummer, M. (2019, November). Package ‘rjags’. Bayesian Graphical Models using MCMC.Salanti, G., A. E. Ades, and J. P. A. Ioannidis (2011, February). Graphical methods andnumerical summaries for presenting results from multiple-treatment meta-analysis: anoverview and tutorial.

Journal of Clinical Epidemiology 64 (2), 163–171. Place: New YorkPublisher: Elsevier Science Inc WOS:000287281300007.Scheiner, S. M. and L. M. Bouchie (2013). The predictive power of NSF reviewersand panels.

Frontiers in Ecology and the Environment 11 (8), 406–407. _eprint:https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.1890/13.WB.017.Severin, A. and M. Egger (2020, November). Research on research funding: an imperativefor science and society.

British Journal of Sports Medicine , bjsports–2020–103340.Severin, A., J. Martins, R. Heyard, F. Delavy, A. Jorstad, and M. Egger (2020). Genderand other potential biases in peer review: cross-sectional analysis of 38 250 external peerreview reports.

Bmj Open 10 (8), e035058. Place: London Publisher: Bmj PublishingGroup WOS:000564351500012.van der Lee, R. and N. Ellemers (2015, October). Gender contributes to personal researchfunding success in The Netherlands.

Proceedings of the National Academy of Sciences ofthe United States of America 112 (40), 12349–12353.van Houwelingen, H., R. Brand, and T. Louis (2009, September). Empirical Bayes methodsfor monitoring health care quality. arXiv:2009.03058 [stat] . arXiv: 2009.03058.Volkswagen Foundation (2017). Experiment! – In search of bold research ideas.16 ppendices

Analytical formulas

This section will give some insights on how the previously discussed quantities can becomputed analytically rather than using MCMC samples. The probability of θ i being smallerthan θ k , which is used for the ER, can be computed as follows:Pr( θ i ≤ θ k ) = Pr( θ i − θ k ≤  θ i − θ k − (ˆ θ i − ˆ θ k ) q var(ˆ θ i ) + var(ˆ θ k ) − θ i , ˆ θ k ) ≤ − (ˆ θ i − ˆ θ k ) q var(ˆ θ i ) + var(ˆ θ k ) − θ i , ˆ θ k )  = Φ  ˆ θ k − ˆ θ i q var(ˆ θ i ) + var(ˆ θ k ) − θ i , ˆ θ k )  , where Φ is the standard normal cumulative distribution function. Here, ˆ θ i denotes theposterior expectation of θ i and var(ˆ θ i ) and cov(ˆ θ i , ˆ θ k ) the corresponding posterior varianceand covariance. If the θ i ’s are a posteriori independent (which does in general not hold formodel (1)), then cov(ˆ θ i , ˆ θ k ) = 0 for i = k .There is also an analytical version of the posterior variance of the rank R i discussed in theAppendix of Laird and Louis (1989) that can be used to compute conﬁdence intervals ratherthan credible intervals. According to the latter authors, this posterior variance is given by:var( R i ) = var n X k =1 I( θ i ≤ θ k ) ! = n X k =1 var (I( θ i ≤ θ k )) + 2 n X k =1 X l>k cov (I( θ i ≤ θ k ) , I( θ i ≤ θ l ))= n X k =1 Pr( θ i ≤ θ k ) · (1 − Pr( θ i ≤ θ k )) +2 · n X k =1 X l>k h Pr( θ i ≤ min( θ k , θ l )) − Pr( θ i ≤ θ k ) · Pr( θ i ≤ θ l ) i . Relationship between ER and SUCRA

In the following, the relationship between the SUCRA and the ER is derived. Note that theER, i.e. the expectation of the rank, can be expressed in terms of the rank probabilities as17ollows: ER i = P nj =1 j · Pr( i = j ).SUCRA i = 1 n − n − X m =1 cum im = 1 n − n − X m =1 m X j =1 Pr( i = j )( n − i = n − X m =1 m X j =1 Pr( i = j )( n − i − n = n − X m =1 m X j =1 Pr( i = j ) − n X m =1 n X j =1 Pr( i = j ) , because n X j = i Pr( i = j ) = 1 − ( n − i + n = − n − X m =1 m X j =1 Pr( i = j ) + n X m =1 n X j =1 Pr( i = j ) n − ( n − i = n − X m =1 n X j = m +1 Pr( i = j ) + n X j =1 Pr( i = j ) n − ( n − i = n X j =1 Pr( i = j ) + n X j =2 Pr( i = j ) + · · · + n X j = n Pr( i = j ) n − ( n − i = n X j =1 j · Pr( i = j ) = ER i Implementation of the Bayesian model in rjags

The following code describes the deﬁnition of the model in R through the package rjags .Note that, for the sampling in the project funding case study, we use 2 chains, 2 × burninand 4 × ﬁnal iterations. "model{ }for (l in 1:n_voters){nu[l] ~ dnorm(0, 4) Credible intervals for the θ i ’s Figure 5 shows the θ i ’s together with their 50%-Crl, the higher the θ i the better the qualityof the proposal. The division into the maximum of three groups (accepted, rejected andrandom selection) is the same using the proposal eﬀects as with the expected rank.19igure 5: The means of the posterior distributions of the θ i ’s together with their 50% credibleintervals for the PM fellowship proposals. The dashed blue line is the provisional fundingline, i.e. the posterior mean of the θ ii