[PDF] Openness and Reproducibility: Insights from a Model-Centric Approach

Abstract

This paper investigates the conceptual relationship between openness and reproducibility using a model-centric approach, heavily informed by probability theory and statistics. We first clarify the concepts of reliability, auditability, replicability, and reproducibility--each of which denotes a potential scientific objective. Then we advance a conceptual analysis to delineate the relationship between open scientific practices and these objectives. Using the notion of an idealized experiment, we identify which components of an experiment need to be reported and which need to be repeated to achieve the relevant objective. The model-centric framework we propose aims to contribute precision and clarity to the discussions surrounding the so-called reproducibility crisis.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Openness and Reproducibility: Insights from aModel-Centric Approach

Bert Baumgaertner · Berna Devezer · Erkan O. Buzbas · Luis G. Nardin

Received: date / Accepted: date

Abstract

This paper investigates the conceptual relationship between open-ness and reproducibility using a model-centric approach, heavily informed byprobability theory and statistics. We ﬁrst clarify the concepts of reliability, au-ditability, replicability, and reproducibility–each of which denotes a potentialscientiﬁc objective. Then we advance a conceptual analysis to delineate therelationship between open scientiﬁc practices and these objectives. Using thenotion of an idealized experiment, we identify which components of an experi-ment need to be reported and which need to be repeated to achieve the relevantobjective. The model-centric framework we propose aims to contribute preci-sion and clarity to the discussions surrounding the so-called reproducibilitycrisis.

Keywords reproducibility · open science · replication · model-centric · reliability · conﬁrmation “I also know this, that, although I might teach only what is true, you must denyme faith in my words. So in order that I do not perorate, but not leaving with anyfaith in all my words, I transfer the authority to anyone who wishes, to come andshow me if what I say concerning the ﬁndings of anatomies is really true. For I havealready shown thousands of times the twin [organs] that intercede the spermaticBert BaumgaertnerDepartment of Politics and Philosophy, University of Idaho, Moscow, ID 84844-1104 USAE-mail: [email protected] DevezerDepartment of Business, University of Idaho, Moscow, ID 84844-1104 USAE-mail: [email protected] O. BuzbasDepartment of Statistical Science, University of Idaho, Moscow, ID 84844-1104 USAE-mail: [email protected] G. NardinComputer Science, National College of Ireland, Dublin, IrelandE-mail: [email protected] a r X i v : . [ s t a t . O T ] O c t Bert Baumgaertner et al.cords from the outer horns to the inside of the uterus, be the animal a goat or anox or a donkey or a horse, that either wooden sticks round and long, even three orfour times thicker, or the ones called sword-like probes, I have positioned throughthe horns. And this must be shown by anyone [that follows the same experimentalmethod] after I and my pupils have died.”

Galen of Pergamon (c.130–210 AD)

In the last decade, scientists have discussed whether science is facing a repro-ducibility crisis (Baker, 2016) and numerous explanations have been given forthe proliferation of irreproducible results (Heesen, 2018; Munaf`o et al., 2017;Spellman, 2015). Many of these explanations suggest that irreproducibility isthe consequence of methodological and cultural practices that are erroneousat individual or system levels. Well-known examples of such questionable re-search practices include HARKing, p-hacking, and publication bias (Bishop,2019; Spellman, 2015). HARKing (hypothesizing after results are known) in-volves presenting a post hoc hypothesis conditional on observing the data asan a priori hypothesis (Kerr, 1998; Munaf`o et al., 2017). P-hacking is a formof data dredging to ﬁnd statistically signiﬁcant results and is a misapplicationof proper statistical methodology (Bruns and Ioannidis, 2016; Munaf`o et al.,2017). Publication bias involves omitting studies with statistically nonsigniﬁ-cant results from publications and is primarily attributed to ﬂawed incentivestructures in scientiﬁc publishing (Munaf`o et al., 2017; Collaboration et al.,2015). The common denominator across these three phenomena is a lack oftransparency (of hypotheses, analyses, and studies, respectively) in researchreporting.As such, openness is sometimes touted as a remedy to the supposed cri-sis of irreproducibility (Collins and Tabak, 2014; Iqbal et al., 2016; Noseket al., 2015). For example, the National Academies of Sciences, Engineering,and Medicine view data sharing as a prerequisite for reproducible results (Na-tional Academies of Sciences, Engineering, and Medicine, 2017). A number oftools are becoming increasingly available to make numerous aspects of scienceopen (Ioannidis, 2014; Munaf`o et al., 2017; Nosek et al., 2015, 2018). As thethinking goes, the building blocks of a reproducible science are replicationstudies (Earp and Traﬁmow, 2015; LeBel and Peters, 2011; Schmidt, 2009),and transparency makes replication studies possible by making research ma-terials and processes explicit and accessible. The link between transparency,replication studies, and reproducibility is not straightforward, however. Forexample, results can be reproduced from experiments that are not completelytransparent (say, because the same materials in question are not accessible,or methods are unknown) and reproducibility can be diﬃcult to achieve, evenwhen experiments are completely transparent and exact replication studiescan be conducted.We are not convinced by the picture that openness, which would correctquestionable research practices, is the straightforward solution to the pur- penness and Reproducibility: Insights from a Model-Centric Approach 3 ported reproducibility crisis. Even if openness were the solution, it is not clearto us why it would be. Our problem is an intellectual one: if we lack a clearunderstanding of the relationship between transparency, replications, and re-producibility, it is not clear how or why openness is supposed to help. Beforecalling foul on scientists or scientiﬁc practice, it behooves us to better un-derstand these relationships and set our expectations right for a reproduciblescience. That is the aim of this paper.To set the stage, we ﬁrst present a toy example and derive a number ofrelatively benign observations and conclusions regarding replications and re-producibility, drawing from probability theory and well-known facts in statis-tics (Section 2). This motivates our conceptualization of an idealized experi-ment and deﬁnitions of reliability, auditability, replicability, and reproducibil-ity (Section 3). We are then in a position to present our analysis of openness,which comes in two parts. The ﬁrst part emphasizes the epistemic aspect of re-producibility, which illuminates a historical debate between Newton, Goethe,and others (Section 4). The second part provides a more detailed account ofthe components of an experiment that need to be shared (or don’t) given someobjective (Section 5).Our analysis makes little to no mention of hypotheses. This is not anoversight. Our analysis is situated in what we call a model-centric view ofscience. From this perspective, scientiﬁc progress is made when old modelsare replaced by new ones and experiments are performed in order to comparemodels. The idea that scientiﬁc meaning is carried by models or that modelsplay a central role otherwise is not novel and has been discussed both inscientiﬁc and philosophical literature (Glass and Hall, 2008; Taper and Lele,2011; Weisberg, 2012). This approach is distinct from a hypothesis-centricapproach and we contrast these two in the discussion section (Section 6).

Consider a large population of ravens where each raven is either black or white.Our goal is to estimate the true proportion of black ravens in the population,denoted by π, given a random sample of ravens from this population. Weassume that there is no classiﬁcation or measurement error. That is, eachobserved raven can be identiﬁed correctly as black or white. We also assumethat the population is well-mixed and observations are independent, each withprobability of being black equal to the true proportion of black ravens in thepopulation.This initial setup is generalizable and quite mundane from a statisticalperspective. We could have considered any phenomenon that can be observedrepeatedly, where a decision has to be made under uncertainty using obser-vations. The probability calculus quantiﬁes uncertainty, and statistical meth-ods based on probabilistic models provide the machinery to perform inferenceabout aspects of the mechanism generating these observations. Examples of Bert Baumgaertner et al. these aspects include estimating a parameter or predicting future observationsunder an assumed probability model, or selecting between competing proba-bility models.With this in mind, we now consider the following two experiments thatcould be conducted within the paradigm of our toy example.

Experiment 1.

A simple random sample of n ravens is collected. Weobserve b black ravens in the sample. The likelihood of the observed dataconditional on π and n is given by the binomial probability model, and isequal to P ( b | π, n ) = (cid:18) nb (cid:19) π b (1 − π ) n − b . (1) Experiment 2.

A simple random sample of ravens is collected until w white ravens are observed. We observe b black ravens such that b + w = n. The likelihood of the observed data conditional on π and w is given by thenegative binomial probability model, and is equal to P ( b | π, w ) = (cid:18) n − w − (cid:19) π b (1 − π ) w . (2)This setup is not novel – it could be found in a standard statistics text.Nevertheless, several observations about the setup are important to make ex-plicit, as they provide us with conclusions that help guide our thinking aboutreproducibility and openness. Observation 1:

The probability models in Equations (1) and (2) are dif-ferent because the parameter vectors ( π, n ) and ( π, w ) are diﬀerent in thesetwo models. Yet the population proportion of black ravens π is a param-eter of both models and therefore can be estimated. Assume we estimatethe proportion of black ravens by the well-known statistical method ofthe maximum likelihood (ML). The ML estimate ˆ π ML is the proportionof black ravens that maximizes the likelihood of the observed data. Theestimates for π under model (1) and model (2) are both equal to b/n. Conclusion 1:

Some results of Experiment 1 can be reproduced by Ex-periment 2 even if the models in these experiments are diﬀerent. In otherwords, the model in Experiment 2 is not required to be the identical modelas in the Experiment 1 to reproduce some of its results.

Observation 2:

Conclusion 1 cannot solely be explained by the use of theidentical method in both experiments because ML is applied on realizationsof diﬀerent random variables in two models. To see this, note that themaximum number of black ravens that can be observed in Experiment 1is n but in Experiment 2 it is the maximum number of black ravens in thepopulation. If we use ˆ π ML in Experiment 1 and ˆ π MM in Experiment 2,where MM denotes the method of moments estimator, the estimates arestill both equal to b/n , even though MM is based on a diﬀerent principlethan ML. penness and Reproducibility: Insights from a Model-Centric Approach 5 Conclusion 2:

The method in Experiment 2 is not required to be theidentical method as in Experiment 1 to reproduce some of its results.

Observation 3:

The stopping rule of data collection in Experiment 1is diﬀerent than the stopping rule in Experiment 2. Experiment 1 stopswhen n ravens are observed. Experiment 2 stops when w white ravens areobserved. Thus, the data structures in Experiment 1 and Experiment 2 arediﬀerent. Conclusion 3:

The data structure in Experiment 2 is not required to beidentical to the data structure in Experiment 1 to reproduce some of itsresults.

Observation 4:

To assess whether the estimate b/n obtained in Experi-ment 1 is reproduced in Experiment 2, Experiment 2 must know the resultof Experiment 1.

Conclusion 4:

Experiment 2 needs suﬃcient background information aboutresults of Experiment 1 to assess whether the results are reproduced. Thus,the background information used in Experiment 1 and Experiment 2 mustbe diﬀerent from each other.

Observation 5:

Assume we repeat Experiment 1 a large number of timesand we choose the estimator of π as:ˆ π = bn + I { st raven is black } , where I { A } = 1 if A, and else zero. This estimator is equal to its true value π in 100(1 − π )% on average and equal to 2 π in 100 π % of the experimentson average. Conclusion 5:

True results are not always reproducible. For example, ifthe true population of black ravens is 0.8 and our experiments return asmattering of values close to the truth but only one that is exactly 0.8,then only one result is true. Note that “true” here involves the idea ofaccuracy because the estimates are real numbers. Providing a range ofvalues instead would sacriﬁce accuracy to ensure truth: in the extreme, ifwe say that the proportion of black ravens is between 0 and 1, we guaranteea true claim, but it is not informative. For ease of exposition, we continueto use “true” with the precision provided by real numbers and handle theidea of closeness in other ways.

Observation 6:

Assume we repeat Experiment 1 a large number of timesand we choose the estimator of π as:ˆ π = 12 . There is no statistical reason for this estimator to be equal to its true value π since it is not using the observations as a representative sample from thepopulation. However, we will always return the same value if we ﬁx thesample size. Conclusion 6:

Perfectly reproducible results may not be true. Hence re-producibility is not suﬃcient for truth.

Bert Baumgaertner et al.

A few remarks regarding the simplicity of our toy example. Observation5 presents a biased estimator that only sometimes equals the true parame-ter value. While this may or may not be a realistic choice of an estimator,our point in this observation is that the observed rate of reproducibility isa function of the methods that we apply on data to make inference and thetrue rate of reproducibility. Theoretically speaking, we cannot always expecttrue results to be reproducible. Similarly, sampling error and model misspec-iﬁcation (Box, 1976; Dennis et al., 2019) present other potential reasons forwhy true results may not always be reproducible. Observation 6, on the otherhand, is in tension with actual scientiﬁc practice, which uses observations asa way for models to make “contact” with the world. Though contrived, thisconclusion, as well as the others, carry over to relevant aspects of real science.For example, the estimator in observation 6 could be a complicated algorithmwith an unintentional “error” in the code implementation could have the sameeﬀect. It is thus possible that two separate labs using the same code reproduceone another’s results, not because they are accessing the truth, but becausethe same error is biasing the process to provide the same result. Of coursecode veriﬁcation and validation processes will help minimize the instantiationof such possibilities. This acknowledges the point. Reproducibility may occurbecause of aspects of experiments or analyses and not “nature” (or whateverscientists are trying to make contact with) .In brief, it is possible to reproduce the results of an experiment in a varietyof ways, even in experiments that are not carbon copies of the original exper-iment. In order to provide a more detailed analysis of what does and doesn’tneed to be copied, and relatedly made open, we ﬁrst provide some additionalclariﬁcation of the idea of an experiment and deﬁne some key terms. We start with the notion of idealized experiment, which is central to manyforms of scientiﬁc inquiry (Devezer et al., 2018). Given some backgroundknowledge K on a natural phenomenon, a scientiﬁc theory makes a prediction,which is in principle testable using observables, the data D . A mechanism gen-erating D is formulated under uncertainty. This mechanism is represented asa probability model M θ parametrized by θ. The extent to which, parts of M θ are relevant to the prediction are conﬁrmed by D is assessed by a ﬁxed andknown collection of methods S evaluated at D. We denote ( M θ , D, S, K ) by ξ, an idealized experiment .We further reﬁne two components of ξ. First, we let D ≡ { D v , D s } where D v denotes the observed values, and D s denotes the structural aspects of Ian Hacking has referred to this phenomenon as “the self-vindication of the laboratorysciences” (Hacking, 1992). Our conceptualization of the idealized experiment follows a parallel to Hacking (1992)’staxonomy of laboratory experiments. Our K and M θ would be subsumed under his ideas , S under things , and D under marks .penness and Reproducibility: Insights from a Model-Centric Approach 7 Assumptions common to all experiments: @ A simple random sample is drawn from a large population, where each raven is either black or white with constant probability @ The population proportion of black ravens is a quantity of interest R ExperimentAssumption: @ Given a random sample of n=3 ravens R M θ (Binomial)S post v D s D 2/3S post (D * ) ReplicationExperiment 1Assumption: @ Given a random sample of n=3 ravens R M θ (Binomial)S post v D s D 1/3S post (D (r) ) ReplicationExperiment 2Assumption: @ Given a sample of ravens until w=1 R M θ (Negative Binomial)S post v D s D 1/2S post (D (r) ) Fig. 1

Elements of three idealized experiments: experiment, replication experiment 1, andan alternative replication experiment 2. the data, such as the sample size, number of variables, units of measurementfor each variable, and metadata. Second, we let S ≡ { S pre , S post } where S pre denotes the procedures, instruments, experimental design, and tools used priorto and necessary for obtaining D v , and S post denotes the analytical tools andprocedures applied to D once it is obtained. We deﬁne R i as a result which isobtained by applying S post to D. We denote the set of all results obtainablefrom an experiment as R ≡ { R , R , R , · · · } . Figure 1 shows these elementsin the context of our toy example of Section 2.Using this notion of an idealized experiment, we adopt the following deﬁ-nitions. We will clarify why in some cases we deﬁne a term diﬀerently from inrelevant literature.

Reliability : Propensity of a method S to produce consistent results giventhe same inputs or initial conditions. Conditional on the same observationin the sample space, M θ , and K, a method S pre is reliable if it consistentlyproduces D . A method S post is reliable if applying S post to D consistentlyyields R. For the rest of this manuscript, we make the simplifying assump-tion that S is suﬃciently reliable. Auditability : The accessibility of all necessary information regarding thecomponents of ξ so that S post can be applied to D to obtain R indepen-dently of ξ. Auditing is a procedure of screening for certain errors, includinghuman and instrumental, that may be introduced in the process of obtain-ing R. Examples are data entry and programming errors. If S post is notreliable, auditability of ξ will not be aﬀected but the auditing process willalso be less reliable because it may not consistently yield R . Replicability : An experiment ξ is replicable if information about the neces-sary components of ξ to obtain some R i is available and if these componentscan be duplicated, copied, or matched in an independent experiment ξ (cid:48) . Areplication experiment ξ (cid:48) , generates D (cid:48) independent from D, conditionalon the true data generating mechanism. We use R i instead of R in thisdeﬁnition because ξ (cid:48) might only be interested in a subset of the results Bert Baumgaertner et al. of ξ. A common interpretation of ξ (cid:48) would be ( M θ , D (cid:48) , S, K ), where thereplication experiment diﬀers from ξ only in data values D (cid:48) v while dupli-cating all other components of ξ. Our analysis in section 5 diverges fromthis view of replication studies and brings a more ﬁne-tuned understandingof which components of an experiment need to be duplicated or matchedfor replicability.

Reproducibility : The rate of R i being reproduced. We say that R i is re-produced by R (cid:48) i if M θ and M (cid:48) θ are conﬁrmed or disconﬁrmed in the samedirection in a probabilistic conﬁrmation sense, such that R i and R (cid:48) i aredeemed equivalent. For example, if R i is an estimate of parameter θ , then D conﬁrms θ if the probability of θ after observing the data, P ( θ | M θ , D, K ),is greater than the probability of θ before observing the data, P ( π | M θ , K ) . Here, R (cid:48) i reproduces R i if P ( θ | M (cid:48) θ , D (cid:48) , K (cid:48) ) > P ( θ | M (cid:48) θ , K (cid:48) ) . In order to re-produce R i for the right reasons, S must be suﬃciently reliable.Notice that our deﬁnition of reproducibility focuses on the end productsof experiments, the results, and not the other components of experimentsthat bring those products about. This choice is ﬁtting given the etymology ofthe term “reproduce”. Our choice for our deﬁnition of replicability similarlyrespects its etymology—it incorporates the notion to repeat something, in ourcase it is the components of an experiment. While our use of these terms isconsistent with, if more reﬁned than, other work (Leonelli, 2018; Radder, 1992,1996; Collaboration et al., 2015), there is considerable variation in how theseterms are used in the scientiﬁc literature (Fidler and Wilcox, 2018; Penderset al., 2019; Stodden, 2011). We aim to sidestep potential confusion by layingout the deﬁnitions as we have and adhering to them for the remainder of thispaper.From our deﬁnitions, we conclude that auditability is not necessary forreplicability or reproducibility. For example, to audit R i , we need to examine D and implement S post on D . To replicate ξ or to reproduce R i , on the otherhand, we do not need to know D. Moreover, auditability is not suﬃcient forreproducibility either. A replication experiment ξ (cid:48) includes new data D (cid:48) thatis generated by the true data generating mechanism. Even if ξ is auditable, R i may not be reproduced by R (cid:48) i —an example of which was shown in Observation5 in Section 2.Our deﬁnition of auditability closely tracks the idea of openness in science.However, whereas we have just stated that audibility is not necessary for re-producibility, the science reform movement, as described in the introduction,leads us to believe that openness is necessary for reproducibility. The tensionwe have built between auditability, openness, and reproducibility provides anopportunity to clarify their relationship. Doing so will ultimately lead us toa better understanding of reproducibility and the putative crisis. In the nextsection we work through a thought experiment to illustrate an important epis-temic aspect of reproducibility that helps us enrich the concept of openness. penness and Reproducibility: Insights from a Model-Centric Approach 9 We expand our toy example of ravens with a thought experiment, “the repro-ducibility collaboratorium” to distinguish between two types of reproducibil-ity: in-principle and epistemic . We argue that open science , which makes thenecessary components of an experiment available for use by others, is a logicalnecessity for epistemic reproducibility of research results.We consider two collaboratoriums, a closed collaboratorium and an opencollaboratorium and imagine the following scenario common to both collab-oratoriums: Each collaboratorium consists of Lab 1 and Lab 2 that conductExperiment 1 and Experiment 1 (cid:48) , respectively. Experiment 1 (cid:48) is conducted af-ter Experiment 1, and ravens are sampled from one large population. All fourlabs assume identical models and data structure, and employ identical meth-ods with the goal of estimating the population proportion of black ravensusing their observations. Further, we assume that the number of black ravensobserved in all four labs is the same.In the closed collaboratorium, Lab 1 and Lab 2 are isolated from eachother and there is no information ﬂow from Lab 1 to Lab 2. Crucially, becauseof this lack of information ﬂow, Experiment 1 (cid:48) will match all the elements ofExperiment 1 that are relevant to estimating the proportion of black ravensin the population by chance . Such a match is improbable since there are manyreasonable ways of conducting an experiment. Nevertheless, since Experiment1 and Experiment 1 (cid:48) use identical models and methods and observed the samenumber of black ravens in the sample, they return the same estimate of thepopulation proportion of black ravens. However, Experiment 1 (cid:48) does not haveany information pertaining to the results of Experiment 1, and thus Lab 2is in a position neither to learn from the results of Experiment 1, nor toclaim that it reproduced the result of Experiment 1. If an external observerwere to observe the experiments conducted in both labs, they could ﬁrst learnfrom the result of Experiment 1. Starting with an updated view about theproportion of black ravens provided by the result of Experiment 1, they couldthen use the number of ravens observed in Experiment 1 (cid:48) to conclude that theresult of Experiment 1 is indeed reproduced by Experiment 1 (cid:48) . When thereis no information exchange between the labs, however, there is no meaningful epistemic interaction between Experiment 1 and Experiment 1 (cid:48) .This closed collaboratorium example highlights two important points: (1) Ifthere is no open science in the sense of information ﬂow from one experiment(or lab) to the next, it is improbable (but still possible) for a replicationexperiment to take place, and (2) the result of Experiment 1 can be said to bereproduced by Experiment 1 (cid:48) only if the result of Experiment 1 is available toExperiment 1 (cid:48) . In order to acknowledge these points, we say that a result canonly be in-principle reproducible if there is no epistemic exchange betweenthe labs which would accumulate evidence, with the exception of via someomniscient external observer. We further explain what we mean by “necessary components” in Section 5.0 Bert Baumgaertner et al.

Closed Collaboratorium and In Principle ReproducibilityExperiment 1and Data in favor of blackravens is 1/1 View on proportion of black ravens before observing the data is 1/2 View on proportionof black ravens after observing the data is 3/4getsExperiment 1'and Data in favor of blackravens is 1/1 View on proportion of black ravens before observing the data is 1/2 View on proportionof black ravens after observing the data is 3/4gets External ObserverData in favorof blackravens is 1/1 View on proportion of black ravens before observing the data is 1/2 View on proportionof black ravens after observing the data is 3/4and gets O pen Collaboratorium and Epistem ic ReproducibilityExperiment 1and Data in favor of blackravens is 1/1 View on proportion of black ravens before observing the data is 1/2 View on proportionof black ravens after observing the data is 3/4getsInformation from Experiment 1 to Replication ExperimentReplication ExperimentData in favor of blackravens is 1/1 View on proportion of black ravens before observing the data is 1/2 View on proportionof black ravens after observing the data is 3/4and gets

Possible but vanishingly improbable:Without any knowledge of Experiment 1,Experiment 1' replicated and"in principle" reproduced the results of Experiment 1 Information from Experiment 1toExternal ObserverInformation from Experiment 1'toExternal Observer

Fig. 2

Closed collaboratorium: Experiment 1 starts with prior view of 1 / / . Identical view, model,and methods are assumed and the same data values are observed in Experiment 1 (cid:48) , but inthe absence of an external observer the two results cannot be connected, thus reproducibilityis only in principle and evidence does not accumulate in the absence of an external observerprivy to both experiments. Open collaboratorium: Experiment 1 starts with prior viewof 1 / / . Experiment 1 (cid:48) –a replication experiment–is informed of the result, model, methods andobserves the same data values. Starting with a view of 3 / / . Thus, Experiment 1 (cid:48) learns from Experiment 1 in a planned manner.The two results can be connected and thus reproducibility is epistemic.

In the open collaboratorium we assume that Experiment 1 has a view onthe population proportion of black ravens prior to observing ravens. By con-trast to the closed version, however, in the open collaboratorium Lab 1 reportsall information relevant to estimating the proportion of black ravens to Lab 2,which incorporates this information to conduct Experiment 1 (cid:48) . Thus Exper-iment 1 (cid:48) matches the elements of Experiment 1 by a kind of social learning .We assume this information is transmitted in the background knowledge ofthe experiment. Starting with an updated view about the proportion of blackravens in the population and conditional on the number of black ravens ob-served, Lab 2 could conclude that they have indeed reproduced the result ofExperiment 1. Thus in the open collaboratorium there is an epistemic interac-tion between the two experiments which contributes to the progress of sciencethrough accumulation of evidence. In contrast to the closed collaboratorium,in the open collaboratorium replication experiments are not contingent onchance but can be routinely performed via social learning, which gives us thenotion of epistemic reproducibility. We illustrate the two collaboratoriums inFigure 2.The distinction between in-principle and epistemic reproducibility is rel-evant to understanding debates about reproducibility. Consider a historicalexample. Isaac Newton (1643–1727 AD) believed his experimentum crucis didnot need to be replicated because it already presented conclusive proof of histheory. When Anthony Lucas (1633–1693 AD) failed to reproduce and even penness and Reproducibility: Insights from a Model-Centric Approach 11 negated Newton’s results in his replication experiments, Newton was angry:[it was not] “the number of experiments, but weight to be regarded; and whereone will do, what need many?” (Newton, 1676). When faced with further crit-icism from his opponents, Newton refused to discuss Lucas’s (unsuccessful)replications and invited Lucas to talk about his original experimentum crucis instead (Westfall and Devons, 1981, p. 275).Newton’s anti-replication attitude was heavily criticized by Johann Wolf-gang von Goethe (1749–1832 AD) who believed that single experiments oreven several of them would not be enough to prove any theory, and that itwas a major task of scientists to design and conduct a series of contiguousexperiments, each derived from the preceding one (Ribe, 1985). Goethe’s in-sistence on replication experiments echoes a long and far ranging history ofadvocates. Any coverage of this history here would be superﬁcial and take ustoo far aﬁeld. Suﬃce it to say that this history is on the side of Goethe.How can we make sense of Netwon’s anti-replication attitude against alarge backdrop of advocates represented by Goethe? If Newton’s dismissiveattitude towards replication is framed by in-principle reproducibility, we canunderstand why the replication of experiments is unnecessary. Newton musthave already believed that his results were in-principle reproducible becauseof the underlying theory, hence, he did not deem any direct replication ex-periments necessary. This view would be consistent with his famous quote“hypotheses non ﬁngo” (I feign no hypotheses) since he did not see his exper-imentum crucis so much as testing a scientiﬁc hypothesis as demonstratingan already proven theory. This view, however, is not widely shared by scien-tists conducting experiments under uncertainty today. For most of empiricalscience today, scientists must exchange information and replicate each others’experiments in order to increase their conﬁdence that new knowledge has beenadded (or, that they know that they have discovered some new fact) by wayof reproducing results. It seems that Goethe and others were concerned withepistemic rather than in-principle reproducibility.With epistemic reproducibility deﬁned, we can take a closer look at thecomponents ( M θ , D, S, K ) deﬁning ξ and investigate what, more speciﬁcally,needs to be open for epistemic reproducibility. We turn to this topic in thenext section. In recent years several tools have been developed to help facilitate opennessin science (Collins and Tabak, 2014; Munaf`o et al., 2017; Nosek et al., 2015;Wagenmakers et al., 2012). In some cases the net is cast broadly, making asmuch information available as possible. In other cases intuition guides whichcomponents are relevant and need to be shared in the replication of an ex-periment. We are interested in using our model-centric framework to better understand what does and does not need to be made available and under whatconditions.The context of epistemic reproducibility is our starting point. In addition,our analysis will take into account our toy example and initial observations, aswell as our thought experiment about collaboratoriums. Moreover, our anal-ysis will be structured by our concept of an idealized experiment. That is,we conﬁne ourselves to identify the components of experiments as we haveconceived of them, leaving open the possibility that there are other ways ofdescribing scientiﬁc processes relevant to replication and reproducibility.For our analysis, we can understand an experiment ξ given by ( M θ , D, S, K )as a function that takes D generated from the true data generating modelas a random input and produces a result R as a random output. Thus, ξ is a random transformation from the space of data to the space of resultsunder the assumptions speciﬁed by M θ (the model), S (the methods), and K (background knowledge). Assuming that a model in an experiment capturesthe true data generating mechanism, we can investigate which components ofthe model, the method, and the data are needed to be transmitted to reproducethe results of an experiment in a replication experiment ξ (cid:48) . We encapsulatethe transmission of these components from ξ to ξ (cid:48) in K (cid:48) . By deﬁnition, thismakes K and K (cid:48) diﬀerent, a version of an observation we already made insections 2 and 3.Our results are grouped by components: i) (model) speciﬁc aspects of themodel might have to be shared, but whether it does and which parts of a modeldepends on the objective; ii) (method) we speciﬁcally identify those aspectsof S (cid:48) post that need to be shared; iii) (data) here we distinguish between thestructural aspects of data and the observed values, and what has to be opendepends on whether we are doing an exact replication or a reproducibilityexperiment.5.1 What parts of model M θ are needed for reproducibility?Statistical theory shows us that for ξ (cid:48) to be able to reproduce all possible resultsof ξ, the speciﬁcation of model M θ up to the unknown quantities needs to betransmitted to the replication experiment, such that M θ and M (cid:48) θ are identicalmodels. If an aspect of M θ that has an inferential value is not transmittedto ξ (cid:48) , that inferential value is lost, and the results relevant to that inferentialvalue cannot be reproduced.On the other hand, given an inferential objective to produce a speciﬁc re-sult R i , the aspects of M θ that are irrelevant to that objective need not betransmitted to the replication experiment. This point is shown by Observa-tion 1 and Conclusion 1 of our toy example in Section 2. When we considerestimating the population proportion of black ravens, the two models in our ex-ample are diﬀerent from each other but have an identical parameter capturingthe population proportion of black ravens and they both employ the number penness and Reproducibility: Insights from a Model-Centric Approach 13 of black ravens observed in the sample in the same way for that particularobjective.If M θ is not identical to M (cid:48) θ , then ξ and ξ (cid:48) diﬀer from each other withrespect to the assumed data generating mechanism. What matters is whetherthese aspects aﬀect the results of S applied on D for estimating the populationproportion of black ravens. That is, even though the model in a replicationexperiment diﬀers from the original, what matters is that the models sharethe relevant parameters, in this case the proportion of black ravens.We can demonstrate this formally. Equation (1) and (2) give the likelihoodof observing b black ravens in a sample of size n under the binomial andnegative binomial probability models respectively and the maximum likelihoodestimate for the population proportion of black ravens under both models is b/n . The reason for this is that binomial and negative binomial models are inthe same likelihood equivalence class with respect to the objective of estimatingthe population proportion of black ravens: The maximum likelihood estimatorcan be derived by setting the expression resulting from taking the derivativeof the logarithm of the likelihood function with respect to π and solving for π . For Equation (1) we get ddπ [log P ( b | π, n )] = ddπ (cid:20) log (cid:18) nb (cid:19)(cid:21) + ddπ [ b log π ] + ddπ [ w log(1 − π )] , (3)and for Equation (2) we get ddπ [log P ( b | π, n )] = ddπ (cid:20) log (cid:18) n − w − (cid:19)(cid:21) + ddπ [ b log π ] + ddπ [ w log(1 − π )] . (4)The diﬀerence between these two equations is only in the ﬁrst terms which areequal to zero. We get ˆ π = b/n as the unique solution in both models.The ﬁrst term in Equation (1) and Equation (2) determines the stoppingrule of the experiments. In ξ, we stop the experiment when n ravens are ob-served and the last raven can be black or white. In ξ (cid:48) we stop the experimentwhen w white ravens are observed and the last observation must be a whiteraven. This diﬀerence between stopping rules means that: 1) S pre is diﬀerentfrom S (cid:48) pre .

2) Under our choice of S post as the maximum likelihood estimatorthe stopping rules in two models are irrelevant for estimating the proportionof black ravens in the population.We also need to distinguish between openness (and auditability) of M θ from replicability of M θ . In our binomial and negative binomial models, M (cid:48) θ isdiﬀerent from M θ . However, these two models are compatible with respect toa certain inferential objective that allows for reproducing a speciﬁc R i , whichis estimating the proportion of black ravens in the population. To establishthis compatibility, M θ should be open to ξ (cid:48) but does not need to be replicableor replicated. Speciﬁcally, to choose a negative binomial probability model in ξ (cid:48) to reproduce the estimate of the proportion of black ravens in the popula-tion obtained in ξ , we need to know that ξ has used a binomial probability model, which ensures that ξ (cid:48) will use a probability model that has the sameparameter—proportion of black ravens, with the exact same meaning in ξ .Without M θ , this compatibility cannot be established.This point is clearly illustrated in a recent article (Silberzahn et al., 2018)in which the same data D was independently analyzed by twenty-nine researchteams who were provided data and a research question that puts a restrictionon which R i s would be relevant for the purposes of the project. The teams werenot, however, provided a M θ , S post , or K. Teams ended up using a variety ofmodels diﬀering in their assumptions about the error variance and the numberof covariates to analyze the same data set. The results diﬀered widely withregard to reported eﬀect sizes and hypothesis tests. So even when D was open,the lack of speciﬁcation with regard to M θ yielded largely inconsistent results.Taking stock, our ravens example is deliberately simple to help in our anal-ysis. State of the art models are often complex objects. If the assumed modeland its assumptions are complex, it might not always be clear which class ofmodels contains others, and a matching model for ξ may not even be available.Then, M θ needs to be both auditable and replicable for reproducibility. Thisresult is particularly important to communicate to scientists who primarilyengage in routine null hypothesis signiﬁcant testing procedures and may notbe conventionally expected to transparently report their models.5.2 What parts of a method are needed for reproducibility?In this section, we focus on S post (the analytical methods applied to data) butleave S pre (experimental procedures to generate data) unspeciﬁed. Studying S pre is complicated because for a given model M θ , the number of ways thatan experiment can be designed is not well speciﬁed under statistical theory,and procedures and measurements to test the same research question can vary.Even in our simple ravens example, a raven can be observed for its color by aninvestigator using their eyes, but a blind investigator may opt for a mechanicalpigment test. This experimental design issue is sometimes referred to as “hid-den moderators” when explaining why results of replication experiments diﬀerfrom original experiments (Baribault et al., 2018). In addition, the issues sur-rounding measurement error has been studied extensively and measurementerror might be a potential factor exacerbating irreproducibility (Loken andGelman, 2017; Stanley and Spence, 2014). What we can say is that, at mini-mum, auditability of S pre appears to be essential for reproducibility. Once allexperimental procedures, design details, and instruments are reported, theirexact replicability becomes less of an issue for ξ (cid:48) to the degree that measure-ment error can be explicitly modeled in case of any deviation from S pre .Turning our attention to analytical methods applied to data, Observation3 and Conclusion 3 in Section 2 show that S post and S (cid:48) post do not have to beidentical. Some statistical methods are mathematically equivalent even thoughtheir motivations are diﬀerent. For example, the maximum likelihood estima- penness and Reproducibility: Insights from a Model-Centric Approach 15 tor and the method of moments estimator are equivalent in estimating thepopulation proportion of black ravens in our toy example. We can demon-strate this formally. Consider the binomial model speciﬁed by Equation (1). If S post is the maximum likelihood estimator motivated by the likelihood princi-ple, the standard procedure to obtain it is to take the likelihood representedby Equation (3), setting the right hand side to zero, taking the derivative, andﬁnally solving for π : ddπ (cid:20) log (cid:18) nb (cid:19)(cid:21) + ddπ [ b log π ] + ddπ [ w log(1 − π )] = 0 ,b/π − w/ (1 − π ) = 0 , so we have ˆ π ML = b/ ( b + w ) ⇒ ˆ π ML = b/n. On the other hand, if S post is the method of moments estimator, the motivationis to set the population mean equal to the sample mean and solve for π. Thepopulation mean in a binomial model with sample size n and the probabilityof observing a black raven π is nπ and the sample mean is the number of blackravens in the sample b. So the method of moments estimator is n ˆ π MM = b ⇒ ˆ π MM = b/n, equivalent to ˆ π ML . Furthermore, other methods are mathematically equivalent even thoughtheir very interpretation of probability diﬀers. For example, maximum likeli-hood estimates and the posterior mode in Bayesian inference under uniformprior distribution of parameters are equivalent regardless of the true data gen-erating model. Conversely, there are also methods designed for a speciﬁc goalbut that do not produce identical R when applied to the same D . For example,Devezer et al. (2018) shows that the choice between the Akaike’s InformationCriterion and Schwarz criterion might inﬂuence the reproducibility of resultsin a model comparison.In addition, a statistical method used to draw inferences about is oftenconditional on a fully speciﬁed statistical model up to a ﬁnite number of un-known parameters of that model. Consider the following situation. A scientistwho designs ξ (cid:48) is given only the following information about ξ : “A populationhas only black and white ravens, and ravens are sampled to perform infer-ence about the error on the population proportion of black ravens.” This isan underspeciﬁed model that might easily lead to diﬀerent methods of choicefor S post and S (cid:48) post —the estimator of the error of population proportion ofblack ravens. The scientist might assume a large population and sample theravens with replacement to build a binomial probability model for the datagenerating process. On the other hand, she might also assume a small popu-lation and sample the ravens without replacement to build a hypergeometricprobability model. The error about these estimates are diﬀerent even though S post and S (cid:48) post might be motivated by one principle, such as maximum like-lihood. Further, these diﬀerences are likely to be exacerbated if the methodsare motivated by diﬀerent principles.From these examples, we infer that S (cid:48) post either needs to be identical to S post or should match it with regard to desired inference (point estimation,interval estimation, hypothesis testing, prediction, model selection) and in away to allow for reproducing a speciﬁc R i . To match, S post should be open orauditable to ξ (cid:48) but does not need to be replicable or replicated. In order to usethe method of moments estimator to estimate the proportion of black ravensin a replication experiment, we need to know that ξ has used a maximumlikelihood estimator. This way, it can be ensured that ξ (cid:48) will either use thesame estimator as ξ or will match it.5.3 What parts of data are needed for reproducibility?In section 3 we deﬁned D ≡ { D v , D s } where D v denotes the observed values,and D s denotes the structural aspects of the data, such as the sample size,number of variables, units of measurement for each variable, and metadata.If D ≡ { D v , D s } and D (cid:48) ≡ { D (cid:48) v , D (cid:48) s } are the data obtained in an experi-ment and a replication experiment respectively, D (cid:48) is often thought of as thenew data of the old kind in the sense that the values D (cid:48) v are independent fromthe values D v , but that the data structures D s and D (cid:48) s are identical. Obser-vation 4 and Conclusion 4 of our toy example consists of a counterexample tothis case where the D s and D (cid:48) s can be diﬀerent and the result R i is still re-produced by R (cid:48) i , if these diﬀerences do not aﬀect how the method S evaluates D and D (cid:48) for the inferential objective.While D s does not need to be exactly duplicated in ξ (cid:48) for reproducibilityof R i , the parts of it relevant to obtaining R i need to be open. Consider asituation where some ravens cannot be classiﬁed as black or white (perhapsdue to S pre not being suﬃciently sensitive) and are recorded as missing data.In this case the number of missing data points is carried in D s . If the estimateof population proportion of black ravens is reported as b/n, without D s , ξ (cid:48) would not know whether the number of missing data points are treated aspart of n or left out. Therefore, ξ (cid:48) cannot ensure whether they reproduced R i . From this example, we infer that D (cid:48) s either needs to be identical to D s or should match it with regard to desired inference and in a way to allow forreproducing a speciﬁc R i . For similar reasons, D s also needs to be open forthe purposes of auditability of R. Data sharing is often viewed as a prerequisite for a reproducible science (Na-tional Academies of Sciences, Engineering, and Medicine, 2017; Hardwickeet al., 2018; Molloy, 2011; Stodden, 2011). Our analysis suggests this is poten-tially misconceived. Using the components of the idealized experiment and sta-tistical theory, we have shown that to reproduce a result R i of ξ := ( M θ , D, S, K ),one needs to know aspects of M θ , D s , and S relevant to obtaining R i . More- penness and Reproducibility: Insights from a Model-Centric Approach 17 over, a replication experiment ξ (cid:48) need not copy these aspects of M θ , D s , and S to reproduce R i . We also show that having open access to D v has no bearingon designing and performing a replication experiment ξ (cid:48) or reproducibility of R i . ξ (cid:48) aims to reproduce the result, not the data. That said, openness of D v isnecessary for auditability of ξ . Auditability, replicability, and reproducibilityare distinct concepts and they need to be assessed separately when evaluat-ing individual experiments. While some level of open scientiﬁc practices isnecessary to obtain reproducible results, we argue that open data are not aprerequisite.There might be other beneﬁts to open data, from auditability of resultsto enabling further research on the same data; however, the distinction wedraw matters particularly in situations where there may be arguably validconcerns, such as ethics regarding data sharing (Borgman, 2012). We recom-mend that open data be evaluated on its own merits which has been discussedextensively (Janssen et al., 2012) but not as a precursor of reproducibility. Open practices in scientiﬁc inquiry have long been intuitively proposed as a keyto solve the issues surrounding reproducibility of scientiﬁc results. However, aformal framework to validate this intuition has been missing and is needed for aclearer discussion of reproducibility. We have contributed to such a theoreticalframework here. We ﬁnish with some discussion about how we see our projectsituated in the landscape of theories of science.Within the last century, an important approach that has been taken to-wards understanding reproducibility was motivated by the work of Karl Pop-per, particularly The Logic of Scientiﬁc Discovery and the notion of falsiﬁa-bility. Popper (1959) states that “non-reproducible single occurrences are ofno signiﬁcance to science” (p. 86) and they would not be useful in refutingtheories. Here Popper appears to rely on some notion of reproducibility toestablish his falsiﬁability criterion of science. At the very least, Popper wouldagree that in order to refute a theory or [scientiﬁc] hypothesis, the experi-mental result must indeed be a falsifying counterexample. There is, however,a tension between Popper’s emphasis on falsiﬁability and his skepticism to-wards conﬁrmation. To generate conﬁdence that a counterexample is genuineand not an error or one-oﬀ case is to ensure that the result is reproducible.To do this, the original experiment should be replicated. The problem here isthat this process of replication and reproducibility is a kind of conﬁrmation,one that establishes or increases conﬁdence that a counterexample is genuine.Only once we establish that we have a genuine counterexample can we thenproceed in Popperian-style towards the falsiﬁcation of the hypothesis or theoryin question.Popper’s falsiﬁability view has had considerable impact on 20th centuryscience and reproducibility has implicitly been accepted as part of scien- tiﬁc activity. Furthermore, Popper’s view on falsiﬁability motivates what wecall the hypothesis-centric approach to understanding reproducibility. On thisapproach, an experiment is performed in order to disconﬁrm a hypothesis,and thereby falsify a theory (Glass and Hall, 2008). One example of takingthe hypothesis-centric approach in the contemporary literature about repro-ducibility is McElreath and Smaldino (2015).There is an alternative. Whereas Popper places a great deal of emphasison the concepts of theories and hypotheses, much of current science, partic-ularly given the rise of Bayesianism, focuses instead on models and modelcomparison (Burnham and Anderson, 2003). Since the exponential increase incomputing power it has become possible to perform the necessary computa-tions and complete analyses that were previously practically impossible. Animportant consequence of this change is the ability to consider, compare, andcontrast diﬀerent models that explain or predict data. As a result, vague scien-tiﬁc hypotheses can be given more rigorous speciﬁcations in terms of statisticalmodels, which in turn provides precision in testing.Our work here is an example of this alternative, model-centric approachto study reproducibility. An experiment is performed in order to comparemodels and scientiﬁc progress is made as old models are replaced by newmodels. That is, whereas the hypothesis-centric approach typically assumes amodel structure and tests a hypothesis, in a model-centric approach the wholemodel is considered and compared to other models. In this model-centric view,hypotheses are subsumed under models such that a hypothesis represents aspeciﬁc statement about a model parameter.In addition to being more general, a model-centric approach avoids cer-tain challenges that a hypothesis-centric approach faces, in particular the un-derdetermination of theory by data and holism (Quine, 1976). Theories orhypotheses can never be tested in isolation because they come as a bundleand with a group of background assumptions. For example, given evidencethat is inconsistent with some hypothesis, one option is to reject the hypoth-esis and corresponding theory. Another option, however, is to place blame onthe measuring instruments, the experimental procedures, or some backgroundassumption. How does one decide which option to exercise? Scientists are gen-erally adept at trouble-shooting issues and ﬁguring out whether a particularinstrument is malfunctioning, for example. They do this by holding ﬁxed oneset of assumptions (e.g., background theory and the reliability of the experi-mental setup) and checking the reliability of others (e.g., whether a particularmeasuring instrument is working as expected). And that is the point aboutholism: there is always some network of assumptions being held ﬁxed whendetermining where to place the blame for negative results. Claiming to test ahypothesis in isolation is pretending that there is no network of assumptions,but there is, and consequently, a hypothesis cannot be tested in isolation.The model-centric approach does a better job of making background as-sumptions more explicit than the hypothesis-centric approach. Moreover, themodel-centric approach is better suited to capture the scientiﬁc practice of penness and Reproducibility: Insights from a Model-Centric Approach 19 model comparison, an integral part of the open, collaborative practices thathave been proposed to solve issues surrounding reproducibility. Our analysisunderscores the importance of transmitting model-speciﬁc information for re-producibility of results—a condition often readily satisﬁed in a model-centricframework. We believe that a hypothesis-centric approach is too impoverishedto provide the necessary resources for a formal theory of such open practices.

Conclusion

We used our model-centric approach and formalization of reproducibility andrelated concepts, to distinguish between reliability, auditability, replicability,and reproducibility. The relationship among them is not as straightforward asit may seem and a need for a nuanced understanding is warranted. For exam-ple, a perfectly auditable experiment does not necessarily lead to reproducibleresults, and an experiment that does not open its data does not necessar-ily yield irreproducible results. Nevertheless, irreproducible results sometimesraise suspicion and discussions typically turn towards concerns regarding thetransparency of research or validity of ﬁndings. These discussions, however, useheuristic analogs of the concepts of reliability, auditability, replicability, andreproducibility. Such heuristics might not hold and can lead to erroneous in-ferences about research ﬁndings and researchers’ practices. Relatedly, we haveprovided some details regarding which components need to be made open rel-ative to some objective, and which don’t. For example, while necessary forauditability of experiments, data sharing is not a prerequisite for reproducibleresults, as suggested by NASEM, but other components of an experiment are.On the other hand, reporting model details, such as modeling assumptions,model structure, and parameters, becomes critical for improving reproducibil-ity. Notably, even in recent recommendations for improving transparency in re-porting via practices such as preregistration, models are typically left out whiletransparency of hypotheses, methods, and study design are emphasized (Noseket al., 2018; van’t Veer and Giner-Sorolla, 2016).Our framework is useful in improving the accuracy of judgments maderegarding replication and reproducibility. The literature on replication crisiscites several putative causes of irreproducibility, including p-hacking, HARK-ing, and publication bias. The social epistemology literature contributes otherunderlying causes such as the rush to publish results due to perverse rewardstructures within the publication system (Heesen, 2018). Our analysis showsthat neither elimination of questionable research practices nor correction ofscientiﬁc reward structures would necessarily lead to reproducible results, asthere are other impediments to reproducibility logically preceding lack of rigoror transparency. For example, the rate of reproducibility is a parameter of thesystem and therefore is a function of truth. Some level of irreproducibility willalways remain as a component of the system and we can only hope to attain alevel of reproducibility within the bounds of model misspeciﬁcation and sam-pling error. And even under the assumption of an ideal version of science that is free of methodological and cultural errors, employs reliable methods, anddoes not operate under model misspeciﬁcation, we might still not be able tomake a true discovery despite having improved the rate of reproducibility. Weare optimistic that our analysis improves the level of precision in discussionssurrounding the drivers of epistemic reproducibility.

References

Baker M (2016) 1,500 scientists lift the lid on reproducibility. Nature533(7604):452–454, DOI 10.1038/533452aBaribault B, Donkin C, Little DR, Trueblood JS, Oravecz Z, van Raven-zwaaij D, White CN, De Boeck P, Vandekerckhove J (2018) Metastudiesfor robust tests of theory. Proceedings of the National Academy of Sciences115(11):2607–2612, DOI 10.1073/pnas.1708285114Bishop D (2019) Rein in the four horsemen of irreproducibility. Nature568(7753):435–435Borgman CL (2012) The conundrum of sharing research data. Journal of theAmerican Society for Information Science and Technology 63(6):1059–1078,DOI 10.1002/asi.22634Box GE (1976) Science and statistics. Journal of the American StatisticalAssociation 71(356):791–799Bruns SB, Ioannidis JPA (2016) P-curve and p-hacking in observational re-search. PLOS ONE 11(2):e0149144, DOI 10.1371/journal.pone.0149144Burnham KP, Anderson DR (2003) Model selection and multimodel inference:a practical information-theoretic approach. Springer Science & Business Me-diaCollaboration OS, et al. (2015) Estimating the reproducibility of psychologicalscience. Science 349(6251):aac4716–aac4716, DOI 10.1126/science.aac4716Collins FS, Tabak LA (2014) Policy: Nih plans to enhance reproducibility.Nature News 505(7485):612Dennis B, Ponciano JM, Taper ML, Lele SR (2019) Errors in statistical in-ference under model misspeciﬁcation: evidence, hypothesis testing, and aic.Frontiers in Ecology and Evolution 7:372Devezer B, Nardin LG, Baumgaertnet B, Buzbas E (2018) Discovery of truthis not implied by reproducibility but facilitated by innovation and epistemicdiversity in a model-centric framework. ArXiv e-prints URL \url{https://arxiv.org/abs/1803.10118v2}

Earp BD, Traﬁmow D (2015) Replication, falsiﬁcation, and the crisis of con-ﬁdence in social psychology. Frontiers in psychology 6:621Fidler F, Wilcox J (2018) Reproducibility of scientiﬁc results. In: Zalta EN (ed)The Stanford Encyclopedia of Philosophy, winter 2018 edn, MetaphysicsResearch Lab, Stanford UniversityGlass DJ, Hall N (2008) A brief history of the hypothesis. Cell 134(3):378–381 penness and Reproducibility: Insights from a Model-Centric Approach 21

Hacking I (1992) The self-vindication of the laboratory sciences. In: PickeringA (ed) Science as Practice and Culture, University of Chicago Press, pp29–64Hardwicke TE, Mathur MB, MacDonald K, Nilsonne G, Banks GC, KidwellMC, Hofelich Mohr A, Clayton E, Yoon EJ, Henry Tessler M, et al. (2018)Data availability, reusability, and analytic reproducibility: Evaluating theimpact of a mandatory open data policy at the journal cognition. RoyalSociety open science 5(8):180448Heesen R (2018) Why the reward structure of science makes reproducibilityproblems inevitable. The Journal of Philosophy 115(12):661–674Ioannidis JPA (2014) How to make more published research true. PLoSMedicine 11(10):e1001747Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JP (2016) Repro-ducible research practices and transparency across the biomedical literature.PLoS biology 14(1):e1002333Janssen M, Charalabidis Y, Zuiderwijk A (2012) Beneﬁts, adoption barriersand myths of open data and open government. Information systems man-agement 29(4):258–268Kerr NL (1998) Harking: Hypothesizing after the results are known. Person-ality and Social Psychology Review 2(3):196–217LeBel EP, Peters KR (2011) Fearing the future of empirical psychology: Bem’s(2011) evidence of psi as a case study of deﬁciencies in modal researchpractice. Review of General Psychology 15(4):371–379Leonelli S (2018) Rethinking reproducibility as a criterion for research quality.In: Including a Symposium on Mary Morgan: Curiosity, Imagination, andSurprise, Emerald Publishing Limited, pp 129–146Loken E, Gelman A (2017) Measurement error and the replication crisis. Sci-ence 355(6325):584–585, DOI 10.1126/science.aal3618McElreath R, Smaldino PE (2015) Replication, communication, and the pop-ulation dynamics of scientiﬁc discovery. PLOS ONE 10(8):e0136088, DOI10.1371/journal.pone.0136088Molloy JC (2011) The open knowledge foundation: open data means betterscience. PLoS biology 9(12):e1001195Munaf`o MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, du Sert NP,Simonsohn U, Wagenmakers EJ, Ware JJ, Ioannidis JPA (2017) A manifestofor reproducible science. Nature Human Behaviour 1(1):0021, DOI 10.1038/s41562-016-0021National Academies of Sciences, Engineering, and Medicine (2017) Fosteringintegrity in research. National Academies Press, Washington, D.C., DOI10.17226/21896Newton I (1676) Mr. newton’s answer to the precedent letter, sent to thepublisher. Philosophical Transactions of the Royal Society 11:698–705Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, BuckS, Chambers CD, Chin G, Christensen G, et al. (2015) Promoting an openresearch culture. Science 348(6242):1422–1425

Nosek BA, Ebersole CR, DeHaven AC, Mellor DT (2018) The preregistrationrevolution. Proceedings of the National Academy of Sciences 115(11):2600–2606Penders B, Holbrook JB, de Rijcke S (2019) Rinse and repeat: Understandingthe value of replication across diﬀerent ways of knowing. Publications 7(3):52Popper KR (1959) The logic of scientiﬁc discovery. University PressQuine WvO (1976) Two dogmas of empiricism. In: Can Theories be Refuted?,Springer, pp 41–64, DOI 10.1007/978-94-010-1863-0 \ penness and Reproducibility: Insights from a Model-Centric Approach 23penness and Reproducibility: Insights from a Model-Centric Approach 23