[PDF] Incentive-Compatible Critical Values

Abstract

Full PDF

IIncentive-Compatible Critical Values

Adam McCloskey, Pascal MichaillatMay 2020

Statistical hypothesis tests are a cornerstone of scientific research. The tests areinformative when their size is properly controlled, so the frequency of rejecting truenull hypotheses (type I error) stays below a prespecified nominal level. Publicationbias exaggerates test sizes, however. Since scientists can typically only publish resultsthat reject the null hypothesis, they have the incentive to continue conducting studiesuntil attaining rejection. Such p -hacking takes many forms: from collecting additionaldata to examining multiple regression specifications, all in the search of statisticalsignificance. The process inflates test sizes above their nominal levels because thecritical values used to determine rejection assume that test statistics are constructedfrom a single study—abstracting from p -hacking. This paper addresses the problemby constructing critical values that are compatible with scientists’ behavior giventheir incentives. We assume that researchers conduct studies until finding a teststatistic that exceeds the critical value, or until the benefit from conducting an extrastudy falls below the cost. We then solve for the incentive-compatible critical value(ICCV). When the ICCV is used to determine rejection, readers can be confident thatsize is controlled at the desired significance level, and that the researcher’s responseto the incentives delineated by the critical value is accounted for. Since they allowresearchers to search for significance among multiple studies, ICCVs are larger thanclassical critical values. Yet, for a broad range of researcher behaviors and beliefs,ICCVs lie in a fairly narrow range. McCloskey: University of Colorado–Boulder. Michaillat: Brown University. We thank Brian Cadena, Kenneth Chay,Pedro Dal Bo, Stefano DellaVigna, Miles Kimball, Emily Oster, Andriy Norets, Bobak Pakzad-Hurson, and JesseShapiro for helpful discussions and comments. This work was supported by the Institute for Advanced Study. a r X i v : . [ ec on . E M ] M a y . Introduction Statistical hypothesis testing is a key tool for scientific investigation and discovery. It is used toevaluate existing theories and paradigms. In particular it allows for the identification of anomalies:instances where the theory does not accord well with empirical observations. It is also used toassess the effectiveness of new medical treatments, public policies, designs, processes, and otherpotential remedies to real-world problems. In that context it allows for the detection of novelstatistically significant effects.Scientists conduct many hypothesis tests but they are typically only able to publish those thatreject the null hypothesis under investigation. This preference for “significant” results is clearlyvisible in scientific journals. First identified in psychology journals (Sterling 1959; Bozarth andRoberts 1972), it has since been observed across the social sciences (Christensen, Freese, andMiguel 2019, chap. 3), medical sciences (Begg and Berlin 1988; Song et al. 2000; Ioannidis andTrikalinos 2007; Dwan et al. 2008), biological sciences (Csada, James, and Espie 1996; Jennionsand Moeller 2002), and many other disciplines (Head et al. 2015; Fanelli, Costas, and Ioannidis2017). This preference may reflect the presumption that “significant” results are more valuable toscientific progress than “insignificant” results.Publications, however, are the main output of scientific research, and the main markerof scientific productivity. A scientist’s publications determine her career path: whether shewill receive tenure, whether she will be promoted further, and at each step what salary shewill command (Gibson, Anderson, and Tressler 2014). Publications also determine a scientist’srecognition and status in the profession and the resources made available to her such as researchfunds, grants and fellowships. Such resources are integral to pursuing new projects and obtainingadditional publications.Since publishing yields significant rewards but requires statistically significant results, scien-tists are given the incentive to collect and analyze data until they are able to reject the null hy-pothesis they are investigating (Nosek, Spies, and Motyl 2012). Such behavior is well-documented:researchers inflate reported t -statistics by choosing amongst specifications (Brodeur et al. 2016),they fail to write up experimental results that are not statistically significant (Franco, Malhotra,and Simonovits 2014), and they favor reporting statistically significant outcomes at the expense ofinsignificant outcomes (Dwan et al. 2008). And flexibility in data collection and analysis affordsresearchers many opportunities to obtain statistically significant results, even when the nullhypothesis is true (Lovell 1983; Simmons, Nelson, and Simonsohn 2011).Given this incentive, scientists’ behavior invalidates the assumptions of classical statisticsupon which standard hypothesis testing is based. One key assumption is that scientists report1hatever they observe from a single data sample that is representative of an underlying populationof interest. Thus, the use of standard critical values (CVs) based upon this assumption to determinestatistical significance in scientific publications leads to the over-rejection of true null hypotheses.For instance, when using a standard CV of 1.96 for a two-sided t -test of nominal level 5%, a truenull hypothesis can be rejected much more often than 5% of the time. This lack of size control isone of the most troubling consequences of publication bias, “the systematic error induced in astatistical inference by conditioning on the achievement of publication status” (Begg and Berlin1988, p. 422).This consequence of publication bias is problematic because the informativeness of hypothe-sis tests relies upon properly controlling the size of the test: to limit spurious findings of statisticalsignificance, one must be confident that the frequency of rejecting true null hypotheses doesnot exceed a pre-specified nominal level. When using data to test theory, if the size of a test istoo large, too many apparent anomalies will be “discovered” and perfectly sound paradigms maybecome disfigured or abandoned. In applied science, if the size is too large, ineffective methods,remedies, or policies will be implemented in the real world, and the problems they were designedto alleviate may persist or even worsen.To address the size distortions induced by researchers’ incentives, we construct CVs that arecompatible with these incentives. Our incentive-compatible CVs (ICCVs) therefore bound testsize by the nominal significance level. To construct ICCVs, we model the strategic behavior ofresearchers. Researchers face costs and benefits from conducting studies, and these incentivesdetermine how many studies researchers conduct.Imagine for example that a researcher conducts two-sided t -tests at a nominal significancelevel of 5%. The usual CV for such tests based upon the large sample standard normal distributionof a t -statistic under the null hypothesis is 1.96. Now imagine that the research collects a randomsample of data, performs the test, and fails to reject the null hypothesis because the observedabsolute t -statistic is below 1.96. Given that such a result is unlikely to be published, if the relativecost is low enough, the researcher may collect another sample of data and analyze it in a furtherattempt to reject the null hypothesis. This implies that the CV of 1.96, based upon the large sampledistribution of a t -statistic from a single study in isolation, is generally not incentive-compatible:it is built upon the assumption that researchers collect one data sample, although it generallyincentivizes them to collect more than one (so long as the benefits from publication are largeenough).Indeed, for a given CV, researchers have the incentive to continue collecting data until eithertheir test statistic exceeds the CV, or the expected benefit from data collection falls below thecollection cost. A CV that is incentive-compatible needs to take this behavior into account to2ound test size by the nominal significance level. When the ICCV is used, researchers may stillconduct several studies. Nevertheless, using information about the cost of conducting research,the rewards to publication, and data-collection process, we can deduce the maximum numberof studies that the researcher has the incentive to conduct at any given CV. From this, we cancompute an upper bound on the distribution of the reported test statistic, and therefore makesure that the CV achieves the desired significance level. When using our ICCVs, readers can beconfident that a published rejection of a true null hypothesis occurs no more frequently than thenominal level of the test.Because researchers often conduct more than one study before submitting their results to ajournal, our ICCVs are necessarily larger than standard CVs. In particular, when scientific rewardsare higher or data-collection costs are lower, the ICCVs are larger. Using both theoretical andnumerical results, we obtain ICCVs for a broad range of researcher behavior, research costs,publication rewards, and researcher prior knowledge; for two-sided tests with 5% nominalsignificance level, we find ICCVs between 1.96 and 3 instead of the standard 1.96. For fields withbetter estimates of the cost and benefits of research and of prior knowledge, it is possible toobtain more precise ICCVs; however, our results indicate that ICCVs are fairly insensitive to theseinputs, yielding convenient rules-of-thumb for practical application. Other methods to control test size: modeling researcher behavior.

The issue of publication bias isof course well-known, and several methods have been developed to address various consequencesof it (Christensen, Freese, and Miguel 2019; Wasserstein, Schirm, and Lazar 2019). In this paperwe focus on one important consequence of publication bias: test size distortions. Such distortionsmean that nominal significance levels (say, 5%) understate the actual probability of rejecting atrue null hypothesis, such that positive results may be due much more to chance—and much lessto the null hypothesis being false—than nominally stated.To address size distortions, we opt to model researcher behavior and take this behavior intoaccount to undo the distortions. Our approach is related to other approaches in the literature thatrecognize that researcher behavior may invalidate classical assumptions underlying standardstatistical analyses. In particular, Glaeser (2006) assumes that researchers do not report theresult of one study, but the most extreme result from n > p -values could be applied in principle to control size in some contexts that we consider,namely “pooling” data; but these techniques also require that the number of studies conductedby the researcher is known (see Johari, Pekelis, and Walsh 2019; Howard et al. 2019). Other methods to control test size: constraining researcher behavior.

A different approach tocontrolling test size consists of constraining researchers’ behavior through a pre-analysis plan(PAP), so they indeed only conduct one study and report the results from that study (Miguelet al. 2014; Olken 2015; Coffman and Niederle 2015). With such constraints, researchers’ behaviorconforms to the assumptions from classical statistics and publication bias disappears. In exchangefor abiding by a PAP, journals could promise to publish results irrespective of significance as longas the research design is of sufficient quality (Christensen, Freese, and Miguel 2019, pp. 110–112).Results-blind review would eliminate journals’ current preference for significance and thusresearchers’ incentive to work within the boundaries of the PAP to obtain significance.Christensen, Freese, and Miguel (2019, pp. 107–117) discuss the strengths and limitationsof PAPs. An important limitation is that PAPs prevent scientists from engaging in exploratoryanalysis, although it is often a source of new ideas and scientific discoveries. With observationaldata, an additional problem is that it would difficult to ensure that the PAP is written before thedata are observed. Gelman and Loken (2014, p. 464) concur: “For most of our own research projects [a PAP] hardly seems possible: Inour many applied research projects, we have learned so much by looking at the data. Our most important hypothesescould never have been formulated ahead of time. . . . In any case, as applied social science researchers we are oftenanalyzing public data on education trends, elections, the economy, and public opinion that have already been studiedby others many times before, and it would be close to meaningless to consider preregistration for data with which weare already so familiar. . . . The most valuable statistical analyses often arise only after an iterative process involvingthe data. Preregistration may be practical in some fields and for some types of problems, but it cannot realisticallybe a general solution.” p -hacking could be used in conjunction with PAPs to furtherensure correct size. Methods to correct other facets of publication bias.

The focus of this paper is on controlling thesize of hypothesis tests, which is inflated by publication bias. Publication bias also takes otherforms, which our methodology cannot address, and which require additional corrections. Forexample, our methodology does not address the fact that only significant results are published,and almost all non-significant results remain unpublished, or even unwritten. Such selectivity isa problem for meta-analyses of literatures: a large subset of all studies may never appear in print,biasing the evidence included in meta-analyses (Rosenthal 1979). This bias is different from ours;it would occur even if scientists were not strategic and did not respond to incentives: even thenthe sample of published studies would be truncated at the significance level. Numerous methodshave been developed to address this bias, and they would continue to be useful in meta-analyseseven if ICCVs replaced standard CVs.

2. Research and publication process

We start by modeling the process of research and publication. We assume that researchers in ascientific community are interested in inferring the true value of a parameter β ∈ (cid:82) . In particular,we focus on problems for which scientists wish to test a particular null hypothesis for the valuethat β takes. In economics, the parameter β is often a causal or treatment effect, a regressioncoefficient, or a parameter in a structural model.For concreteness, we focus on the most common hypotheses used in economics: two-sided See Begg and Berlin (1988, sec. 5.3), Lau, Ioannidis, and Schmid (1998), and Stanley (2005) for surveys of publica-tions bias in meta-analysis and solutions to debias the meta-estimates. See Simonsohn, Nelson, and Simmons (2014)and Andrews and Kasy (2019) for recent procedures to tackle publication bias in meta-analysis. H : β = β versus H : β (cid:44) β , where H and H constitute the null and alternative hypotheses of interest. In the commonapplication of statistical significance testing, β takes the value of zero. The analysis here extendsstraightforwardly to one-sided hypotheses (appendix B).After performing statistical tests of (1), researchers communicate their findings in journals.We assume that journal editors only wish to publish papers that reject the prevailing null hypothe-sis characterized by H . We characterize journals as wishing to publish “significant” or “non-null”results, as is common practice in economics and many other disciplines. Researchers gather datato form a t -statistic for testing (1) and report the value of this statistic to a journal. Common prac-tice in empirical work is to then deem H as rejected or the result of the research as “statisticallysignificant” if the absolute value of the statistic exceeds an appropriate CV. Researchers receivean expected payoff v > H . The value that expected payoff v takes depends upon the set of journals the researcher intends to submit to and the researchquestion itself. For example, a researcher receives a higher expected payoff for submitting theirwork to a more prestigious set of journals but a lower expected payoff if those journals do notfind the research question interesting.It is common practice to derive the CV that determines rejection of H from the upper quantilesof a standard normal distribution as this is the approximate large-sample distribution of a t -statistic under standard assumptions when H holds. One of these standard assumptions is thatonly a single model and dataset is used to form the t -statistic. However, this assumption is violatedin most applied research. Typically, a researcher looks through multiple model specifications orsets of data to form statistics. Given a researcher’s positive payoff from publishing their work, wework under the assumption that the researcher actually reports the maximum absolute valueof a set of latent t -statistics, generated across different sets of data or model specifications, tomaximize his chances of rejecting H . Formally, we model the researcher as constructing asequence of n t -statistics X ∗ , . . . , X ∗ n from n “studies” that are unobservable to the editor. As usual,we assume the latent t -statistics take the form(2) X ∗ i = ˆ β i − β se ( ˆ β i ) , where ˆ β i denotes the estimator of β in study i and se ( ˆ β i ) denotes a consistent standard errorestimator for ˆ β i . After conducting n studies, the researcher reports X n = max (cid:8) | X ∗ | , . . . , | X ∗ n | (cid:9)

6o the journal upon submission of his research article. By standard central limit theorem andstandard error consistency arguments, each of the latent t -statistics X ∗ i is approximately normallydistributed with unit variance when H holds. In particular under H , X ∗ i ∼ N ( , ) . Unless n = − α / H when it is true is greater than the nominal level α . This over-rejection is the type of publicationbias we aim to correct.The researcher determines when to stop conducting additional studies based upon a marginalcost-benefit analysis. After conducting n − v > X n exceeds the CV z used by the editor to determine a statisticallysignificant rejection of H . If X n − < z , the researcher chooses to conduct the n th study if theexpected marginal payoff of conducting this additional study exceeds the expected cost of doingso. Formally, we model the expected benefit of conducting the n th study as v (cid:80) (cid:0) X n > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) , where (cid:80) is the researcher’s subjective probability measure that incorporates his beliefs on thetrue value of β and X ∗ n = ( X ∗ , . . . , X ∗ n ) . We may wish to allow the expected cost of a study todepend upon the specific nature of that study. The costs of studies can vary due factors suchas different costs of acquiring data or running experiments or differences in the opportunitycosts of different studies. In the most general version of our model, we assume the researcherincurs an expected marginal cost of c ( n ) ≥ n th study. After n − n th study is thus (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v (cid:80) (cid:0) X n > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) = v (cid:80) (cid:0) max (cid:8) | X ∗ n | , X n − (cid:9) > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) = v (cid:80) (cid:0) | X ∗ n | > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) . (3)However, the researcher has the option to conduct further studies after the n th study if andonly if he chooses to conduct the n th study. Thus, conducting the n th study not only yields thedirect payoff in (3) but also the indirect payoff of the option value to conduct further studies,leading to a dynamic decision problem. Nevertheless, in the following proposition we show thatif the researcher pursues studies in the order of their direct expected profitability, pursuing Conditioning on X ∗ n − does not make sense when n =

1. As a notational convention, we set X ∗ = X ∗ = Proposition 1.

Suppose v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j ) for all j > .Then the researcher chooses to conduct the next study if and only if X n − < z and v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) ≥ . Since it is reasonable to believe that researchers pursue studies in order of their profitability,we model the researcher as conducting the n th study according to whether (3) is positive ornegative. Taking the reader or editor’s perspective, for a given CV z , we seek to characterize themaximum number of studies that are profitable for the researcher to pursue. That is we seekto characterize the maximum n = N ( z ) for which (cid:69) (cid:0) π ( X n − , X n ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) > X N ( z ) when faced with the CV z . If the editorchooses the CV threshold by looking at the upper α -quantile of X N ( z ) under H , she can thereforebe confident that when H holds she will receive a submission with probability α . Since N ( z ) endogenously depends upon z , the editor should then wish to find the value of z at which theprobability that X N ( z ) exceeds z is no larger than α when H is true. This guarantees that overrepeated submissions of results in favor of H , less than 100 · α % of these results are false when H is the truth.Our goal is to characterize the value of the CV z = z ∗ that, based upon the researcher’sincentives, does not lead to over-rejection of a true H . That is, we seek to find the smallest value z ∗ such that(4) P H ( X N ( z ∗ ) ≥ z ∗ ) ≤ α , where P H denotes the objective probability measure given that the null hypothesis holds. Wedefine this z ∗ as the incentive-compatible critical value since the researcher has no incentive toconduct additional studies that would result in rejecting a true H with probability larger than thenominal level of α . From the point of view of the editor, the maximum N ( z ) is a random variablesince it depends upon the latent t -statistics X ∗ i ’s. Nevertheless, if we know the joint distributionof the sequence of latent t -statistics X ∗ i ’s, we may obtain the distribution of N ( z ) . We now statea general result ensuring that the ICCV z ∗ is well-defined and providing a sufficient conditionunder which it yields tests with non-trivial power. We seek to find the smallest z ∗ satisfying (4) in order to maximize power subject to controlling size. able 1. Parameter values in simulations

Description Source1 .

99 Mean of t -statistic prior Roth and Murnighan (1978); Murnighan and Roth (1983)0 .

40 Standard deviation of t -statistic prior Roth and Murnighan (1978); Murnighan and Roth (1983) v = c =

933 Cost of a study Dal Bo (2005) ϵ = Proposition 2.

Suppose v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j ) for all j > and c ( ) > . The ICCV z ∗ defined in (4) , (i) exists and (ii) yields a test with non-zero power if andonly if (cid:80) (| X ∗ | > z ∗ ) ≥ c ( )/ v . We now study different settings that yield different cost functions, subjective probabilitymeasures and joint distributions of latent studies, leading us to different conclusions on how z ∗ ought to be chosen by the journal editor so (4) holds.

3. Illustrative calibration

In the next sections we construct ICCVs under different assumptions about researchers and theresearch process. In these different cases we derive theoretical results and compute the ICCVs inMonte Carlo simulations. Here, as a preamble, we calibrate the parameters of our model (table 1).We use these calibrated values in the subsequent simulations. The calibration also illustrates howthe model can be mapped to the actual research process.This illustrative calibration is based on Dal Bo (2005). There are several advantages to usingthis paper for calibration. First, it is based on independent and (nearly) identically distributedlaboratory experiments, making it straightforward to model the joint distribution of X ∗ n . Sec-ond, everything in the paper is well documented, making calibration of most model parametersstraightforward. Third, the paper is based on a specific question, within a well delineated litera-ture, making the calibration of reasonable subjective beliefs of the researcher straightforward.Conceptually, the null hypothesis explored by Dal Bo is H : people do not account for thefuture in strategic decisions; the alternative hypothesis is H : people do account for the future instrategic decisions. The hypothesis is explored in eight experiments. Dal Bo compares the levelof cooperation in repeated prisoner’s dilemma games that have different probabilities of continu-ation. In some treatments, the prisoner’s dilemma games are one-shot. The only equilibrium isno cooperation, so players are not expected to cooperate. In other treatments, the probabilityof continuation—which governs the probability of future interaction between players—is 1/2 or9/4. In these treatments, while no cooperation remains an equilibrium outcome, cooperationequilibria also appear. Then, the likelihood of cooperation between players is expected to increase.Formally, the statistical parameter of interest β is the difference in the cooperation probabilitybetween games in which theory predicts no cooperation and games in which theory predictspossible cooperation. The null and alternative hypotheses become H : β = H : β (cid:44) . Here, we model the alternative hypothesis as two-sided, in agreement with convention for empir-ical work in economics. To test the alternative that the possibility of future interaction increasescooperation, it might be more appropriate to use the upper one-sided alternative H : β > t -statistic, which is equal to β / sd ( ˆ β ) . Here, we take an empirical Bayes approach, eliciting DalBo’s prior from previous estimates. Before Dal Bo’s article, there were two articles on the sametopic: Roth and Murnighan (1978) and Murnighan and Roth (1983). Dal Bo presumably used thesetwo precursors to form his prior beliefs about the true value of β . In fact, Dal Bo (2005, table 1)reports the results from these studies. Although these studies report estimates of β , they do notreport the associated t -statistics or standard errors se ( ˆ β ) , which are estimates of sd ( ˆ β ) . However,the matched pairs design of the experiments in these studies, allows us to bound se ( ˆ β ) fromabove and below in each study. We then simply take the midpoint of these bounds to equal theestimate of sd ( ˆ β ) used to calibrate the researcher’s prior for β / sd ( ˆ β ) . The details are provided inappendix C.We obtained four previous values of t -statistics for testing H , β / sd ( ˆ β ) , from the four treat-ments in the two articles. Taking the sample size-weighted average of these values, we obtain themean of the prior belief on the mean of the t -statistic: (cid:69) ( β / sd ( ˆ β )) = . Taking the (samplesize-weighted) sample variance of these values, we obtain the variance of the prior belief on themean of the t -statistic: var ( β / sd ( ˆ β )) = . . Finally, we assume that Dal Bo’s subjective priorfollows a normal distribution. The laboratory where the experiments were conducted was available at no cost, so themarginal cost of an experiment is the cost to pay the participants and the research assistantmonitoring them. On average, there are 48.75 subjects in each experiment, and each subject earns$18.94, so paying the participants costs 48 . × $18 . = $923 .

3. In addition each experiment lasts The weights account for the fact that one study has 121 subjects while the other has 252 subjects. From an empirical Bayes perspective, a normal prior can be justified when averaging results of previous studiesvia a central limit theorem. c = $923 . + $10 = $933 . American Economic Review (AER). What is the value of such publication? Usingdata from the University of California system, Gibson, Anderson, and Tressler (2014) find that thepublication of 10 AER-quality pages increases academic earnings by 1.3%, which translates to anet present value over an entire career of about $28,466. Since the length of Dal Bo’s article is14 pages, its value is $28 , × . = $39 , Quarterly Journal of Economics , Econometrica , and

Journal of Political Economy . The valueof publishing in other journals can be inferred from their relative ranking compared to these topjournals (see Gibson, Anderson, and Tressler 2017, table A1). Of course obtaining a significantresult is not sufficient to publish in the top journals: each of them publishes less than 5% ofthe articles they receive (Card and DellaVigna 2013; Zheng and Kaiser 2016). Depending on thecorrelation structure between the decisions of these top journals, as well as the correlationswith decisions of lower-ranked journals, we could compute the expected net present value ofsubmitting a significant result. We approximate this value as v = $5 , c / v , actually determines the ICCV.In appendix D, we provide an alternative indirect method for eliciting the implied value ofthis ratio from the number of studies a researcher would typically conduct without rejecting H before moving on to a different research question. Again using data from Dal Bo (2005), we find arange of implied values consistent with the calibrated value of c / v = / , ≈ .

4. Baseline: an impossibility result

To build intuition, we begin with the simplest version of our model of research and publication.Consider a researcher who may conduct a sequence of independent studies to determine thevalue of β , where the estimator of β in study i , ˆ β i , is constructed from the data gathered instudy i only. These estimators are statistically identical and, appealing to a standard central limittheorem, approximately distributed as N ( β , var ( ˆ β )) in large samples. This case would arise forexample when the studies are experiments, the sample size in each experiment is the same andthe observations in each experiment are drawn from the same underlying population. Since theresearcher does not know the true value β , we treat it as a random variable from his point ofview and write ˆ β i | β iid ∼ N ( β , var ( ˆ β )) . In this setting, the latent t -statistics are independent and11dentically distributed approximately following a normal distribution: X ∗ i | θ iid ∼ N ( θ , ) , where θ = β − β sd ( ˆ β ) is a scaled version of the difference between the true value β and its null version hypothesized β with sd ( ˆ β ) being the true population standard deviation of ˆ β . Note that in this setting, the nulland alternative hypotheses (1) are equivalent to(5) H : θ = H : θ (cid:44) . Suppose the researcher incurs a constant cost of c ( n + ) = c ≥ θ of F ( θ ) . This prior distributionmay arise, for example, as the result of previous studies conducted by other researchers or asthe biases of the researcher. For now we assume that the researcher does not learn about θ inthe process of conducting his studies so his subjective distribution on θ does not change as heconducts more studies. Specializing (3) to the current context, the expected marginal profit fromconducting an additional study after conducting n − (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:0) | X n | ∗ > z (cid:12)(cid:12) X ∗ n − , θ (cid:1) dF ( θ ) (cid:49) ( X n − < z ) − c = v ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ ) (cid:49) ( X n − < z ) − c . (6)At stage n −

1, the researcher engages in another study if and only if (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) | X ∗ n − (cid:1) >

0. That is, he will conduct another study if and only if X n − < z and(7) ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ ) ≥ cv . Now we can make two observations:

Remark 1.

Depending upon the values c , v , z and θ , either the researcher conducts zero studies orcontinues to conduct studies indefinitely until X ∗ n > z . Remark 2.

A low marginal cost c , high payoff v , low CV z or a prior distribution placing largeweight on large values of θ lead to engagement in studies indefinitely searching for significance. Given remark 1, it is actually impossible for the journal editor to set the CV z to a value sothe probability that the observed X n exceeds z is between zero and one under H . If z is set low Since θ = β − β sd ( ˆ β ) , this prior can be elicited directly from a prior on the original parameter of interest β . H . Conversely,if the editor sets z high enough so (7) is violated, the researcher never conducts a single study.In other words, depending upon the editor’s choice of z , the resulting size of the test is eitherzero (and so is the power) or one. In both cases, the test becomes uninformative. Therefore, it isimpossible to find an ICCV in this setting that yields an informative test. The threshold choice ofCV z that dictates which of these two cases pertains is equal to about 3 for the calibrated modelinputs from the previous section.In the next three sections, we show that this impossibility result is a special case that relieson (i) a constant marginal cost, (ii) a researcher that does not learn and (iii) studies that areindependent of each other.

5. Sampling increasingly costly data

We now modify the analysis of section 4 by assuming that the expected cost of conductingan additional study is increasing in the number of studies already run. This modification canincorporate situations in which it becomes increasingly costly to collect new data or run newexperiments or could simply capture increasing marginal opportunity costs on the part of theresearcher. In terms of the marginal cost function introduced in section 2, we assume that c : (cid:82) + → (cid:82) + is continuous, strictly increasing, c ( ) = n →∞ c ( n ) = ∞ . These assumptionsensure that new studies eventually become overwhelmingly costly so a researcher eventuallystops conducting new studies pertaining to (1) and that it is costless not to conduct a study. Thecontinuity assumption is without loss of generality since c will only be evaluated at integer values.We maintain all of the other assumptions of section 4 so an analogous analysis leads us to theconclusion that at stage n −

1, the researcher engages in the n th study if and only if X n − < z and(8) ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ ) ≥ c ( n ) v . For this problem, N ( z ) is thus equal to the maximum positive integer n for which X n − < z andthe above condition holds under H . If(9) ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ ) < c ( ) v , N ( z ) = F ( θ ) , the significance threshold z is too large relative to c ( )/ v . Since the marginalcost function is strictly increasing in this setting, we obtain the following result that breaks the13mpossibility result of the previous section. Proposition 3.

For any CV z , the researcher eventually stops conducting studies even if he neverattains rejection. Since X n is the maximum of the absolute value of n mean-zero iid normal variables under H , P H ( X n < z ) = [ Φ ( z ) − Φ (− z )] n for z ≥ z ∗ such that1 − [ Φ ( z ∗ ) − Φ (− z ∗ )] N ( z ∗ ) ≤ α . If the researcher’s prior distribution F ( θ ) is conjugate with the normal distribution, the computa-tion of N ( z ) simplifies because the integral in the continuation condition (8) can be expressed inclosed form. The proposition below provides three examples of such computational simplification:a mass point prior, a uniform prior, and a normal prior. Proposition 4.

In the increasing cost setting of this section, N ( z ) is equal to the maximumpositive integer n for which X n − < z and1. if dF ( θ ) = δ ( ˜ θ ) dθ for some ˜ θ ∈ (cid:82) , where δ denotes the Dirac delta function, n ≤ c − (cid:16) v [ − Φ ( z − ˜ θ ) + Φ (− z − ˜ θ )] (cid:17) ;

2. if dF ( θ ) = ( b − a ) − (cid:49) ( a ≤ θ ≤ b ) dθ for some a < b , n ≤ c − ( v [ − ( b − a ) − { ϕ ( z − b ) − ϕ ( z − a ) + ( b − z ) Φ ( z − b ) + ( z − a ) Φ ( z − a )− ϕ (− z − b ) + ϕ (− z − a ) − ( b + z ) Φ (− z − b ) + ( a + z ) Φ (− z − a )}]) ;

3. if dF ( θ ) = σ − ϕ (( θ − µ )/ σ ) dθ for some µ ∈ (cid:82) and σ > , n ≤ c − ( v [ − Φ (( z − µ )/√ + σ ) + Φ ((− z − µ )/√ + σ )]) . For each of these special cases, it would be reasonable for the editor to take an empiricalBayes approach and estimate the parameters entering the prior distributions using data fromprevious studies.The analysis yields the following remarks:

Remark 3.

If the expected payoff from rejecting H , v , is larger, the ICCV z ∗ is larger. This meansthat editors evaluating research about “important” questions should impose higher CVs. All else xed, researchers have more incentive to search across many studies when the publication payoffis higher. A larger CV can be used by the editor to counteract the higher rate of false rejectionsinduced by these high payoffs by reducing the incentive to conduct many studies. Remark 4.

If the cost of running experiments, c ( n ) , is larger, the ICCV z ∗ is smaller. This meansthat editors evaluating research with high costs should impose smaller CVs (and vice versa). Allelse fixed, researchers have less incentive to conduct studies when they are costly. A smaller CVcan be used by the editor to account for high research costs by increasing the incentive to conductstudies. Remark 5.

The more probability mass a researcher’s prior places on large values of θ , the largerthe ICCV z ∗ is. This means that editors evaluating studies for which researchers are more confidentthat H is violated should impose a larger CVs. When they are more confident that H is false,researchers expect larger payoffs from continuing to conduct studies. All else equal, this givesresearchers more incentive to continue conducting studies on (1) . Hence, a larger CV can be usedby the editor to counteract the corresponding higher rate of false rejections. Remark 6.

In practice, the journal editor may not know the prior of the researcher. However, itis reasonable to estimate this prior from the outcomes of past studies, either parametrically ornonparametrically. If there are no past studies to gather data from, it is reasonable to simply usea diffuse prior.

The ICCVs with increasingly costly studies, obtained by Monte-Carlo simulations, are il-lustrated in figure 1. A first insight from the figure is that the publication bias that we aim tocorrect—the size distortion of hypothesis tests—may be fairly large. The figure displays the sizeof a two-sided hypothesis test at the standard 5% CV of 1.96, for various priors (panel A), variouscost-benefit ratios (panel C), and various cost elasticities (panel E). If researchers conducted onlyone study and performed the hypothesis test on it, as is assumed in classical statistics, the sizewould be 5%. But when researchers sample datasets, the size is much higher than 5%: as high as24% in panel A, 23% in panel C, and 30% in panel E. So a significant result is much more likely tobe due to luck than advertised by the nominal significance level of 5%.A second insight from the figure is that the ICCVs required to bring back the actual test sizeto 5% are above the standard value of 1.96 but not by a tremendous amount. The figure displaysthe ICCV that delivers a size of 5% for two-sided tests, for various priors (panel B), various cost-benefit ratios (panel D), and various cost elasticities (panel F). Across parameterizations, the ICCVis above 1.96 but never above 2.5. Moreover, prior beliefs about the estimated parameter andincentives faced by researchers do not have a strong impact on the ICCV. Across a broad range ofpriors and cost-benefit ratios from publication, the ICCV lies between 1.96 and 2.5.15 . Size distortions for various priors B. ICCVs for various priors S i z e a t C V = . C. Size distortions for various incentives I CC V D. ICCVs for various incentives S i z e a t C V = . E. Size distortions for various cost elasticities I CC V F. ICCVs for various cost elasticities

Figure 1.

ICCVs at 5% significance level when researchers sample increasingly costly data

This figure considers researchers who attempt to obtain significance in two-sided hypothesis tests by sampling iiddatasets that are increasingly costly to collect: collecting the n th dataset costs c ( n ) = c × n ϵ , where ϵ is the costelasticity. All results are obtained from Monte-Carlo simulations with the parameter values in table 1, except forthose parameter values specified on the axes. Panels A, C, and E display the size of the hypothesis test at the standardCV of 1.96, for various priors (panel A), various cost-benefit ratios (panel C), and various cost elasticities (panel E).If researchers collected only one dataset and conducted the test on it, as is assumed in classical statistics, the sizewould be 5%. But when researchers sample datasets, the size is much higher than 5%, sometimes as high as 25%.Panels B, D, and F display the ICCV that delivers a size of 5%, for various priors (panel B), various cost-benefit ratios(panel D), and various cost elasticities (panel F). Across parameterizations, the ICCV falls between 1.96 and 2.6. . Learning while sampling data We now depart from section 4 by allowing the researcher to learn about the true value of θ .To keep the analysis straightforward we assume that as he conducts successive studies, theresearcher updates his prior distribution F ( θ ) according to Bayes’ rule but maintain all of theother assumptions of section 4. Learning about the true value of θ can induce the researcher tostop conducting additional studies, breaking the negative result of section 4.After n − F ( θ ) to compute the expectedmarginal profit from conducting an additional study according to (6), the researcher incorporatesall of the information contained in the n − F ( θ | X ∗ n − ) .Assuming that the researcher’s prior distribution F ( θ ) admits a pdf f ( θ ) , for n ≥

1, the posteriorpdf for θ takes the form(10) f ( θ | X ∗ n ) = f ( X ∗ n | θ ) f ( θ ) ∫ f ( X ∗ n | θ ) f ( θ ) dθ , where the likelihood f ( X ∗ n | θ ) is the pdf of a N ( θι n , I n ) distribution evaluated at X ∗ n with ι n denotingan n -vector of one’s. This is the joint pdf corresponding to the random vector X ∗ n when the valueof θ is known. As a notational convention, for n = F ( θ | X ∗ n ) is equal to F ( θ ) . Maintaining all ofthe other assumptions in section 4, we arrive at the conclusion that at stage n −

1, the researcherengages in the n th study if and only if X n − < z and(11) ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ | X ∗ n − ) ≥ cv . In contrast to the previous section, even if we know X n − < z , whether or not the researcherchooses to engage in another study depends upon the realization of the random vector X ∗ n − . Forexample, if the researcher obtains a large draw for | X ∗ n − | that is not quite large enough to crossthe threshold z , his posterior for θ will be updated to shift probability mass toward larger valuesand (11) can hold. On the other hand, a very small draw for | X ∗ n − | can induce his posterior toshift probability mass toward very small values so (11) can be violated. These facts break theimpossibility result of section 4 so we can define N ( z ) to be the largest n for which X n − < z and (11) holds when the data are generated under H . As in proposition 5, we obtain a result thatbreaks the impossibility result of section 4. Proposition 5.

Let θ ∗ denote the true value of θ . For any CV z large enough such that − Φ ( z − θ ∗ ) + Φ ( z − θ ∗ ) < c / v , the researcher eventually stops conducting studies even if he never attainsrejection. z ∗ in this context is complicated by the fact that F ( θ | X ∗ n − ) depends uponthe realizations of prior studies, which are not observed by the journal editor. Nevertheless, sincewe know the distribution of X ∗ n − under H , namely N ( , I n − ) , z ∗ remains feasible to compute inthe presence of learning by the researcher. In the following proposition, we examine the specialcase for which the researcher’s prior on θ follows a N ( µ , σ ) distribution since this distributionresults naturally from an empirical Bayes approach to prior construction (via a central limittheorem approximation) and because it provides analytically tractable results that are useful forconveying intuition. Proposition 6.

In the learning while sampling setting of this section, if dF ( θ ) = σ − ϕ (( θ − µ )/ σ ) dθ for some µ ∈ (cid:82) and σ > , then N ( z ) is the largest positive integer n for which X n − < z and − Φ (cid:169)(cid:173)(cid:173)(cid:171) z − µ n | n − (cid:113) + σ n | n − (cid:170)(cid:174)(cid:174)(cid:172) + Φ (cid:169)(cid:173)(cid:173)(cid:171) − z − µ n | n − (cid:113) + σ n | n − (cid:170)(cid:174)(cid:174)(cid:172) ≥ c / v , where µ n | n − = σ (cid:205) n − i = X ∗ i + µ ( n − ) σ + and σ n | n − = σ ( n − ) σ + . This proposition allows us to make a few observations:

Remark 7.

The mean of the posterior distribution µ n | n − = ( n − ) σ ( n − ) σ + ¯ X ∗ n − + ( n − ) σ + µ is a weightedaverage of the sample mean of previous latent studies ¯ X ∗ n − ≡ ( n − ) − (cid:205) n − i = X ∗ i and the priormean µ , where the former receives relatively more weight as the number of previous studies n − increases. For example, in the simplest case of only one previous study and a prior variance equalto the variance of that previous study ( σ = ), the posterior mean is equal to the arithmeticaverage of the realization of the previous study and the prior mean. Remark 8.

As the number of previous studies n grows large, the posterior mean µ n + | n convergesto the sample mean of the previous studies. Remark 9.

As the number of previous studies n grows large, the posterior variance σ n + | n = σ /( nσ + ) shrinks toward zero. In conjunction with remark 8, this implies that for very large n the posterior distribution concentrates heavily around the sample mean of the previous studies,which by the law of large numbers converges to the true value θ . This is simply a consequence of amore general Bernstein-von Mises theorem or standard posterior contraction result from Bayesianstatistics. Letting the true value of θ be denoted by θ ∗ this implies that the researcher must even-tually stop conducting studies if − Φ ( z − θ ∗ ) + Φ (− z − θ ∗ ) < c / v . Since Φ is strictly increasingand lim x →∞ Φ ( x ) = and c / v > , this implies that for any θ ∗ , there is a z large enough thatinduces the researcher to eventually stop conducting studies. . Size distortions for various priors B. ICCVs for various priors S i z e a t C V = . C. Size distortions for various incentives I CC V D. ICCVs for various incentives

Figure 2.

ICCVs at 5% significance level when researchers learn while sampling data

The figure considers researchers who attempt to obtain significance in two-sided hypothesis tests by samplingiid datasets, and who update their beliefs about the distribution of the t -statistic after observing the value of the t -statistic in each new dataset. All results are obtained from Monte-Carlo simulations with the parameter values intable 1, except for those parameter values specified on the axes. Panels A and C display the size of the hypothesistest at the standard CV of 1.96, for various priors (panel A) and various cost-benefit ratios (panel C). If researcherscollected only one dataset and conducted the test on it, as is assumed in classical statistics, the size would be 5%.But when researchers sample datasets, the size is much higher than 5%, sometimes as high as 100%. Panels B andD display the ICCV that delivers a size of 5%, for various priors (panel B) and various cost-benefit ratios (panel D).Across parameterizations, the ICCV falls between 1.96 and 3.4.

7. Pooling data

Instead of allowing the researcher to learn about the true value of θ , we take a different departurefrom section 4 here and have the researcher accumulate data as he moves from one study to thenext. To keep the analysis simple, suppose that the i th study involves collecting an additional T random variables with associated constant cost c ( i ) = c and the researcher’s estimate of β inthe i th study is equal to the sample mean of the n × T independent and identically distributedrandom variables collected thus far. That is,ˆ β i = iT iT (cid:213) j = Y j for some sequence of independent and identically distributed random variables Y , . . . , Y iT . Inthis setting, a standard large sample approximation yields ˆ β i | β ∼ N ( β , var ( Y j )/( iT )) . However,since they use some of the same underlying random variables in their construction, the differentˆ β i ’s are no longer independent of one another. Instead, they are jointly normally distributed withcov ( ˆ β i , ˆ β k ) = kT var ( Y j ) i ≤ k . In turn, the latent t -statistics are jointly normally distributed according to X ∗ i | θ ∼ N (√ iθ , ) , where θ = √ T ( β − β ) sd ( Y j ) and cov ( X ∗ i , X ∗ k ) = (cid:112) i / k for i ≤ k .Due to the correlation between the latent t -statistics as well as their differing means, special-izing (3) to the current context changes the expression for the expected marginal profit fromconducting an additional study from (6). In this setting, X ∗ n | X ∗ n − = x ∗ n − , θ ∼ N (cid:32) √ nθ + (cid:114) n − n ( x ∗ n − − √ n − θ ) , n (cid:33) , so that (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:0) | X ∗ n | > z (cid:12)(cid:12) X ∗ n − , θ (cid:1) dF ( θ ) (cid:49) ( X n − < z ) − c (12) = v ∫ (cid:104) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) + Φ (cid:16) −√ nz − θ − √ n − X ∗ n − (cid:17) (cid:105) dF ( θ ) (cid:49) ( X n − < z ) − c . Hence, the researcher engages in the additional study if and only if X n − < z and(13) ∫ (cid:104) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) + Φ (cid:16) −√ nz − θ − √ n − X ∗ n − (cid:17) (cid:105) dF ( θ ) ≥ cv . Similarly to the case with learning, even if we know X n − < z , the researcher’s choice to engagein another study depends upon the realization of the previous latent t -statistic X ∗ n − . Though themechanism through which the previous latent t -statistic determines the researcher’s choice hereis different from the case with learning, the qualitative effect is similar: large enough draws of X ∗ n − cause (13) to hold while small enough draws can cause it to be violated. This effect arisesbecause the researcher is accumulating data, rather than updating his prior, so the previous latent t -statistic contains a lot of information about the subsequent t -statistics. As in propositions 3 and5, we obtain a result that breaks the impossibility result of section 4, this time in the context ofpooling data. Proposition 7.

Under H , for any CV z > , the researcher will eventually stop conductingstudies even if he never attains rejection. Since the impossibility result of section 4 is also broken in this context, we similarly define N ( z ) to be the largest n for which X n − < z and (13) holds when the data are generated under H .We make a few additional observations: 21 emark 10. Results analogous to propositions 4 and 6 also apply here. Certain specifications ofthe researcher’s prior allow us to analytically evaluate the integral on the left hand side of (13) and simplify the expression for N ( z ) . Remark 11.

Since √ n − X ∗ n − = ( n − ) θ ∗ + √ n − Z , where θ ∗ denotes the true value of θ and Z ∼ N ( , ) , the terms √ nz and √ n − X ∗ n − dominate any finite value of θ in order of magnitudeif the researcher has accumulated enough data (for n large enough). This implies that the effectof any prior distribution F ( θ ) that only places positive probability mass over finite values of θ (asis standard) on the decision of the researcher eventually disappears as the researcher continues toaccumulate data. This feature of the prior distribution being eventually “washed out” from the de-cision problem is also apparent in the learning context (see remarks 8 and 9). However, this occursin the current context because the researcher accumulates data, rather than throwing it away ineach successive latent study. As more data accumulate, the researcher has more information onthe value of future t -statistics. Remark 12. If H holds so θ ∗ (cid:44) , √ nz − θ − √ n − X ∗ n − = −( n − ) θ ∗ + O p (√ n ) diverges (in probability) for any finite θ as the researcher accumulates more data. This implies thatfor any standard prior distribution F ( θ ) , the left hand side of (13) converges to one (in probability)so if c ≤ v and the number of studies already conducted is large, there is a high probability thatthe researcher continues to conduct studies and accumulate data until he can reject H . The ICCVs when researchers pool data, obtained by Monte Carlo simulations, are illustratedin figure 3. Here the size distortions we aim to correct are more moderate. Figure 3 displays thesize of a two-sided t -test at the standard CV of 1.96 for various priors (panel A) and various cost-benefit ratios (panel C). When researchers pool datasets, the size is higher than 5%, indicatingsize distortion: as high as 20% in panel A and 10% in panel C. Hence, the size distortions whenresearchers pool data are less severe than when researchers sample data, as in Figures 1 and 2.Since the size distortions are smaller, the ICCVs required to bring back the actual test sizeto 5% are above the standard value of 1.96, but not by much. The figure displays the ICCV thatdelivers a size of 5% for two-sided tests, for various priors (panel B) and various cost-benefit ratios(panel D). Across parameterizations, the ICCV is above 1.96, but never above 2.5.Figure 3 also shows that the incentives faced by researchers barely affect the ICCV. Across abroad range of cost-benefit ratios from publication, the ICCV remains around 2.2 (panel D). Priorbeliefs about the estimated parameter (especially the prior mean) have a stronger impact on the22 . Size distortions for various priors B. ICCVs for various priors S i z e a t C V = . C. Size distortions for various incentives I CC V D. ICCVs for various incentives

Figure 3.

ICCVs at 5% significance level when researchers pool data

The figure considers researchers who attempt to obtain significance in two-sided hypothesis tests by pooling iiddatasets. All results are obtained from Monte-Carlo simulations with the parameter values in table 1, except for thoseparameter values specified on the axes. Panels A and C display the size of the hypothesis test at the standard CV of1.96, for various priors (panel A) and various cost-benefit ratios (panel C). If researchers collected only one datasetand conducted the test on it, as is assumed in classical statistics, the size would be 5%. But when researchers sampledatasets, the size is higher than 5%, sometimes as high as 20%. Panels B and D display the ICCV that delivers a sizeof 5%, for various priors (panel B) and various cost-benefit ratios (panel D). Across parameterizations, the ICCV fallsbetween 1.96 and 2.5.

8. General case

In the general case, we would like to allow for dependence across latent studies and potentiallydifferent means of the underlying sequence of latent t -statistics to incorporate observationalstudies with dependent data, cases for which the researcher adds or subtracts new data to a study,instrument selection when using two-stage least squares, regression specification for ordinaryleast squares, looking through studies of different precision (resulting in different standarderrors), etc. In this more general formulation of the problem, the latent outcome of empiricalstudy i ≥ X ∗ i , which is approximately distributed according to the distribution of a N ( θ i , ) distribution with cov ( X ∗ i , X ∗ j ) ≡ ω ij so for any n ≥

1, we may write X ∗ n ∼ N ( θ n , Ω n ) with θ n ≡ [ θ , . . . , θ n ] (cid:48) and Ω n ≡ (cid:2) ω ij (cid:3) i , j = n . In particular, θ n = H . We assume that θ n is unknown to the researcher for all n ≥ Ω n is known after n studies have been conducted. These assumptions approximate the largesample joint distribution of a set of latent t -statistics for testing (1) in a general framework underwhich standard errors and correlations between estimators can be consistently estimated. Weprovide concrete examples of these quantities in different research settings below. With a slightabuse of notation, we denote the researcher’s prior distribution on θ n as F ( θ n ) and also discussbelow how a prior distribution on β translates to F ( θ n ) . Sampling data.

In the case that each successive study uses an independent sample from thesame population to estimate β , from the point of view of the researcher we have the followinglarge-sample distributional approximation: ˆ β i | β iid ∼ N ( β , var ( ˆ β i )) , where var ( ˆ β i ) = ς / T i for some ς with T i denoting the sample size of the i th study. In this case, sd ( ˆ β i ) ≈ ς /√ T i so using the formof the latent t -statistic (2), θ i = √ T i ( β − β )/ ς ≥

0. Since the studies are independent, Ω n = I n .Finally, suppose the researcher has a prior distribution on β . His prior on each individual θ i is thus the same distribution shifted by β and scaled by ς /√ T i . From the point of view of theresearcher, each θ i is a shifted and scaled version of the same underlying random variable β ,making F ( θ n ) degenerate. Pooling data.

Suppose that each successive study simply adds additional data to the previousstudy to estimate β , where these additional data are independent and collected from the same24nderlying population. All of the analysis of the previous section applies to this case with theexception of Ω n = I n . Instead, ω ij is a decreasing function of | i − j | . In particular, assume that wecan approximate ˆ β i by a sample mean: ˆ β i ≈ T i T i (cid:213) i = Y i for some sequence of independent random variables Y , . . . , Y T i . This approximation can be usedfor instance for standard linear regression estimators. Since the researcher accumulates datawhen forming each estimate, further assume that T i > T i − . In this case, we havecov ( ˆ β i , ˆ β j ) ≈ T j var ( Y i ) for i ≤ j so ω ij = (cid:112) T i / T j for i ≤ j . Regression specification.

For this example, we assume that the researcher uses ordinary leastsquares in the standard linear regression model to estimate the effect of interest. In practice, atypical effect of interest corresponds to the population value of a regression coefficient. However,when the researcher uses different regression specifications across different latent studies, theeffect of interest is no longer fixed across studies so (1)–(2) do not apply in general. Nevertheless,a simple generalization of (1)–(2) can be used for this example. More specifically, suppose that inthe i th study the researcher uses ordinary least squares to estimate a regression coefficient in aregression of y i on w i from a set of T independent data points ( y i , . . . , y iT ) and ( w i , . . . , w iT ) soˆ β i = (cid:205) Tt = w it y it (cid:205) Tt = w it . Here, w i represents the regressor of interest after it has been projected off of the space spannedby the covariates included in the i th regression model, allowing for both different specificationsof the regressor of interest and covariates across studies. This framework also allows for differentspecifications of the dependent variable y i across latent studies. When using different regressionspecifications across different studies, the researcher implicitly sets the object of interest in the i th study equal to the population regression coefficient β i = E ( w it y it ) E ( w it ) , where E denotes the expectation operator with respect to the true objective probability measure.25rom the point of view of the researcher, standard assumptions yield the following large-sample distributional approximation: ˆ β i | β i ∼ N ( β i , var ( ˆ β i )) , where var ( ˆ β i ) = var ( w it y it )/ T E ( w it ) .When allowing for different specifications across studies (1) must be modified to(14) H : β i = β i versus H : β i (cid:44) β i and (2) must be correspondingly modified to X ∗ i = ˆ β i − β i se ( ˆ β i ) so in large samples X ∗ i | θ i iid ∼ N ( θ i , ) , where θ i = √ T ( β i − β i ) sd ( w it y it )/ E ( w it ) . In addition, cov ( ˆ β i , ˆ β j ) ≈ cov ( w it y it , w jt y jt ) T E ( w it ) E ( w jt ) so ω ij = cov ( w it y it , w jt y jt )/[ sd ( w it y it ) sd ( w jt y jt )] . In this example, the researcher’s prior distri-bution on ( β , . . . , β n ) induces a prior distribution on θ n via simple shifting and scaling. It maybe reasonable in some examples to assume that the researcher has an identical prior distributionfor all β i ’s, leading to a degenerate F ( θ n ) . Instrument selection.

By modifying some of the definitions in the previous example, we canalso cover the case for which the researcher uses two-stage least squares in a standard linearregression model to estimate the effect of interest. Assuming that the instruments are both strongand valid, we can simply modify the definition of w i to equal the regressor of interest after (i) allregressors have been projected onto the space spanned by the instruments used in the i th studyand then (ii) the resulting regressor of interest has been projected off of the space spanned by thecovariates included in the i th regression model. In the special case for which the researcher usesthe same dependent variable and covariates across all latent studies and only changes the set ofinstruments used across studies, if the regression model is correctly specified, (1)–(2) continue tohold since each β i will simply equal the true regression coefficient.For the general case, we allow the researcher to learn about the true value of the parameterof interest in a potentially limited fashion. The researcher’s subjective distribution over θ n maycorrespond to his prior F ( θ n ) , a fully updated posterior F ( θ n | X ∗ n − ) or some other function of the26ata such as a convex combination of his prior and the posterior distribution. We will simply use F ( θ n ; X ∗ n − ) to denote the researcher’s subjective distribution after n − n th study (3) as (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:16) | X ∗ n | n − | > z | θ n (cid:17) dF ( θ n ; X ∗ n − ) (cid:49) ( X n − < z ) − c ( n ) (15) = v ∫ [ − Φ (( z − θ n | n − )/ ω n | n − ) + Φ ((− z − θ n | n − )/ ω n | n − )] dF ( θ n ; X ∗ n − ) (cid:49) ( X n − < z ) − c ( n ) , where X ∗ n | n − ∼ X ∗ n | X ∗ n − such that X ∗ n | n − ∼ N ( θ n | n − , ω n | n − ) with θ n | n − ≡ θ n + Ω n , Ω − n − [ X ∗(cid:48) n − − θ n − ] , ω n | n − ≡ − Ω n , Ω − n − Ω n , , Ω n ≡ (cid:34) Ω n − Ω n , Ω n , (cid:35) . Finally, the ICCV z ∗ is defined in (4) with N ( z ) being equal to the largest value of n such that X n − < z and(16) ∫ [ − Φ (( z − θ n | n − )/ ω n | n − ) + Φ ((− z − θ n | n − )/ ω n | n − )] dF ( θ n ; X ∗ n − ) ≥ c ( n ) v . Since we know the distribution of X ∗ n − under H , namely N ( , Ω n − ) , when it exists, z ∗ remains feasible to compute in general: since Ω n is known for all n ≥

1, it is straightforward tocompute P H ( X n > z ) for any given n , z by Monte Carlo simulation. General results with a normal prior.

If the researcher’s prior distribution is conjugate with thenormal distribution, we can provide an analytical expression for his posterior distribution. Inthis subsection, we assume that the researcher’s prior over θ n follows a normal distribution. Thisspecial case is interesting because it allows us to characterize N ( z ) and therefore the ICCV definedin (4) more explicitly. Moreover, a normal prior may be natural in many contexts. For example, ifthe researcher’s prior is based upon averaging estimation results from previous studies, a centrallimit theorem argument leads to an approximate normal distribution for such an average. Proposition 8.

Suppose that the researcher’s prior on θ n follows a N ( µ n , Σ n ) distribution for all n ≥ . Then F ( θ n | X ∗ n − ) is equal to the distribution function of a multivariate normal distributionwith mean µ n | n − = (cid:16) (cid:101) Ω − n + Σ − n (cid:17) − (cid:16) (cid:101) Ω − n X ∗ n + Σ − n µ n (cid:17) . nd covariance matrix Σ n | n − ≡ ( (cid:101) Ω − n + Σ − n ) − , where (cid:101) Ω − n = (cid:34) Ω − n −

00 0 (cid:35) . By the definition of (cid:101) Ω − n , (cid:101) Ω − n X ∗ n = (cid:34) Ω − n − X ∗ n − (cid:35) so this distribution only depends upon the realization of the first n − n − F ( θ n ; X ∗ n − ) = α F ( θ n ) + ( − α ) F ( θ n | X ∗ n − ) , where α ∈ [ , ] is a parameter that indexes the sophistication of the researcher with α = α = α ∈ ( , ) corresponding to a partially sophisticated researcher who only partiallyupdates his beliefs. The results of proposition 8 can also be applied to simplify the computationof N ( z ) according to (16) in this case since, for example, ∫ Φ (( z − θ n | n − )/ ω n | n − ) dF ( θ n ; X ∗ n − ) = α ∫ Φ (( z − θ n | n − )/ ω n | n − ) dF ( θ n ) d θ n + ( − α ) ∫ Φ (( z − θ n | n − )/ ω n | n − ) dF ( θ n | X ∗ n − ) d θ n . The integral inside the first term is the expected value of Φ (( z − θ n | n − )/ ω n | n − ) when θ n follows a N ( µ n , Σ n ) distribution, and the integral inside the second term is the expected value of Φ (( z − θ n | n − )/ ω n | n − ) when θ n follows a N ( µ n | n − , Σ n | n − ) distribution.

9. Conclusion

Statistical hypothesis testing is a key tool for scientific investigation and discovery. It is used toevaluate existing paradigms, and to assess the effectiveness of new medical treatments, publicpolicies, and other potential remedies to real-world problems. To be informative, hypothesis test-28 .96 2.1 2.3 2.5 2.7 2.9 Critical value123456 N u m be r o f e x pe r i m en t s sampling data with learning pooling datasampling increasingly costly dataICCV Figure 4.

Number of experiments for different CVs

The figure displays the average number of experiments conducted by researchers for different CVs (and in particularat the ICCV). The results for the case “sampling increasingly costly data” come from the simulations in figure 1. Theresults for the case “sampling data with learning” come from the simulations in figure 2. The results for the case“pooling data” come from the simulations in figure 3. In all the simulations, we set parameter values as in table 1. ing relies on properly controlling size—the probability of rejecting a true null hypothesis. A majorissue in modern science, however, is that test sizes in published test results are systematicallydistorted because scientists are given the incentive to continue to conduct studies until they areable to reject the null hypothesis they are investigating. This is because rejecting a null hypothesisis often considered more interesting from a scientific perspective, and therefore is often requiredfor publication. As a result, a true null hypothesis is much more likely to be rejected than what isadvertised in scientific publications.To correct these size distortions, we construct CVs that are compatible with the incentivesfaced by scientists and therefore deliver the promised nominal test size. To construct ICCVs, wemodel the strategic behavior of researchers. Researchers face costs and benefits from collectingdata, and these incentives determine how many studies are conducted. Once an ICCV is in place,researchers may conduct several studies to be able to publish their work; nevertheless, readerscan be confident that true null hypotheses are not rejected more often than a pre-specifiednominal level. Our CVs are larger than standard ones. For example, for two-sided t -tests with a 5%significance level, we find that ICCVs are not 1.96 but between 1.96 and 3 in experimental settingsacross a wide range of researcher behavior. The exact values of the ICCVs depend on the costs ofresearch, rewards from rejecting the null hypothesis, and the researcher’s prior beliefs. Imposing29he upper bound of 3.4 in this range will control size across all configurations we examined.Our approach to correcting size distortions in hypothesis tests is to construct CVs that take intoaccount researchers’ behavior. Another approach is to constrain researchers’ behavior by askingresearchers to register their experiments and analysis plans in advance (Christensen, Freese,and Miguel 2019, part 3). These two strategies result in a very different research process. UnderICCVs, researchers pool about 2 studies on average, or sample between 2 and 3 studies, dependingon the learning process and cost structure (figure 4). On the other hand, with preregistration,researchers should only conduct one study and report those results. Each approach may bemore appropriate in different settings. Ours could be more appropriate for observational dataand in the early stage of a research question, when scientific exploration plays a key role. Thepreregistration approach could be better suited to experimental data and later stages of a researchquestion, when the research question is well understood and delineated, and it is important toobtain precise estimates of the parameters of interest. Moving forward, it would be useful toobtain more information on the parameters of our model to improve the calibration of the ICCVs,and possible adapt them to specific research methodologies. For example, it would be useful toelicit costs of research methodologies and researchers’ prior beliefs. Indeed, ongoing work onresearcher prior elicitation such as that summarized in DellaVigna, Pope, and Vivalt (2019) couldprove useful for computing more refined ICCVs than the 1.96–3 range found here. References

Andrews, Isaiah, and Maximilian Kasy. 2019. “Identification of and Correction for Publication Bias.”

Ameri-can Economic Review

109 (8): 2766–2794.Begg, Colin B., and Jesse A. Berlin. 1988. “Publication Bias: a Problem in Interpreting Medical Data.”

Journalof the Royal Statistical Society (Series A)

151 (3): 419–445.Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E.J. Wagenmakers, RichardBerk, Kenneth A. Bollen, Bjorn Brembs, Lawrence Brown, Colin Camerer et al. 2018. “Redefine Statis-tical Significance.”

Nature Human Behaviour

American Psychologist

27 (8): 774–775.Brodeur, Abel, Mathias Le, Marc Sangnier, and Yanos Zylberberg. 2016. “Star Wars: the Empirics StrikeBack.”

American Economic Journal: Applied Economics

Journal ofEconomic Literature

51 (1): 144–161.Christensen, Garret, Jeremy Freese, and Edward Miguel. 2019.

Transparent and Reproducible Social ScienceResearch: How to Do Open Science . Oakland, CA: University of California Press.Coffman, Lucas C., and Muriel Niederle. 2015. “Pre-Analysis Plans Have Limited Upside, Especially WhereReplications Are Feasible.”

Journal of Economic Perspectives

29 (3): 81–98. sada, Ryan D., Paul C. James, and Richard H. M. Espie. 1996. “The ‘File Drawer Problem’ of Non-SignificantResults: Does It Apply to Biological Research?” Oikos

76 (3): 591–593.Dal Bo, Pedro. 2005. “Cooperation under the Shadow of the Future: Experimental Evidence from InfinitelyRepeated Games.”

American Economic Review

95 (5): 1591–1604.DellaVigna, Stefano, Devin Pope, and Eva Vivalt. 2019. “Predict Science to Improve Science.”

Science

PLoS ONE

Proceedings of the National Academy of Sciences

114 (14): 3714–3719.Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2014. “Publication Bias in the Social Sciences: Un-locking the File Drawer.”

Science

345 (6203): 1502–1505.Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.”

American Scientist

102 (6):460–465.Gibson, John, David L. Anderson, and John Tressler. 2014. “Which Journal Rankings Best Explain AcademicSalaries? Evidence from the University of California.”

Economic Inquiry

52 (4): 1322–1340.Gibson, John, David L. Anderson, and John Tressler. 2017. “Citations or Journal Quality: Which Is RewardedMore in the Academic Labor Market?”

Economic Inquiry

55 (4): 1945–1965.Glaeser, Edward L. 2006. “Researcher Incentives and Empirical Methods.” NBER Technical Working Paper329.Head, Megan L., Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions. 2015. “The Extentand Consequences of P-Hacking in Science.”

PLoS Biology

13 (3): e1002106.Howard, Steven R., Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. 2019. “Uniform, Nonparametric,Non-Asymptotic Confidence Sequences.” arXiv:1810.08240.Ioannidis, John P.A., and Thomas A. Trikalinos. 2007. “An Exploratory Test for an Excess of SignificantFindings.”

Clinical Trials

Biological Reviews

77 (2): 211–222.Johari, Ramesh, Leo Pekelis, and David J. Walsh. 2019. “Always Valid Inference: Continuous Monitoring ofA/B Tests.” arXiv:1512.04922.Lau, Joseph, John P.A. Ioannidis, and Christopher H. Schmid. 1998. “Summing Up Evidence: One AnswerIs Not Always Enough.”

Lancet

351 (9096): 123–127.Lovell, Michael C. 1983. “Data Mining.”

Review of Economics and Statistics

65 (1): 1–12.McCrary, Justin, Garret Christensen, and Daniele Fanelli. 2016. “Conservative Tests under SatisficingModels of Publication Bias.”

PLoS ONE

11 (2): e0149590.Miguel, Edward, Colin Camerer, Katherine Casey, Joshua Cohen, Kevin M. Esterling, Alan Gerber, RachelGlennerster, Don P. Green, Macartan Humphreys, Guido Imbens et al. 2014. “Promoting Transparencyin Social Science Research.”

Science

343 (6166): 30–31.Murnighan, J. Keith, and Alvin E. Roth. 1983. “Expecting Continued Play in Prisoner’s Dilemma Games: ATest of Several Models.”

Journal of Conflict Resolution

27 (2): 279–300.Nosek, Brian A., Jeffrey R. Spies, and Matt Motyl. 2012. “Scientific Utopia: II. Restructuring Incentives and ractices to Promote Truth Over Publishability.” Perspectives on Psychological Science

Journal of Economic Perspectives

Psychological Bulletin

86 (3): 638–641.Roth, Alvin E., and J. Keith Murnighan. 1978. “Equilibrium Behavior and Repeated Play of the Prisoner’sDilemma.”

Journal of Mathematical Psychology

17 (2): 189–198.Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: UndisclosedFlexibility in Data Collection and Analysis Allows Presenting Anything as Significant.”

PsychologicalScience

22 (11): 1359–1366.Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.”

Journalof Experimental Psychology: General

143 (2): 534.Song, F., A. J. Eastwood, S. Gilbody, L. Duley, and A. J. Sutton. 2000. “Publication and Related Biases: AReview.”

Health Technology Assessment

Journal of Economic Surveys

19 (3): 309–345.Sterling, Theodore D. 1959. “Publication Decisions and their Possible Effects on Inferences Drawn fromTests of Significance–Or Vice Versa.”

Journal of the American Statistical Association

54 (285): 30–34.Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘ p < . American Statistician

73 (S1): 1–19.Zheng, Yuqing, and Harry M. Kaiser. 2016. “Submission Demand In Core Economics Journals: A PanelStudy.”

Economic Inquiry

54 (2): 1319–1338.

Appendix A. Proofs

A.1. Proof of proposition 1

After conducting n − X n − < z , the researcher chooses to conduct the next study ifand only if (i) the expected direct profit from the next study is positive or (ii) the expected directprofit from the next study is negative but the expected direct profit from later studies offset theexpected loss from conducting the next study. That is, if and only if (i) v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) < (cid:213) j ∈J { v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j )} ≥ J ⊂ { , , , . . . } . However, since v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j ) for all j >

0, if v (cid:80) (| X ∗ n | > z | X ∗ n − ) − c ( n ) <

0, then v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j ) < j > (cid:205) j ∈J { v (cid:80) (| X ∗ n + j | > z | X ∗ n − ) − c ( n + j )} < A.2. Proof of proposition 2

To show (i), note that for any v , c ( ) , there exists ¯ z > (cid:80) (| X ∗ n | > z | X ∗ n − ) < c ( )/ v for all z ≥ ¯ z . Thus by proposition 1, N ( z ) = z ≥ ¯ z and therefore P H ( X N ( z ) > z ) = P H ( > z ) = z ≥ ¯ z .To show (ii), note that the test using z ∗ yields non-zero power if and only if P H ( X ˜ N ( z ∗ ) ≥ z ∗ ) >

0, where P H denotes the objective probability measure for some true value of θ underthe alternative hypothesis. Since X n is a continuous random variable with support on the entirepositive real line for all n > θ consistent H (as well as H ), this impliesthat the test using z ∗ yields non-zero power if and only if N ( z ∗ ) >

0. Finally, N ( z ∗ ) > (cid:80) (| X ∗ | > z ∗ ) ≥ c ( )/ v . A.3. Proof of proposition 4

For each case, N ( z ) is equal to the largest positive integer n such that X n − < z and (8) holds. Incase (i), ∫ [ Φ ( z − θ ) − Φ (− z − θ )] dF ( θ ) = Φ ( z − ˜ θ ) − Φ (− z − ˜ θ ) so (8) is equal to the condition given in (i). In case (ii), ∫ [ Φ ( z − θ ) − Φ (− z − θ )] dF ( θ ) = ( b − a ) − ∫ ba [ Φ ( z − θ ) − Φ (− z − θ )] dθ = ( b − a ) − [ ϕ ( z − b ) − ϕ ( z − a ) + ( b − z ) Φ ( z − b ) + ( z − a ) Φ ( z − a )− ϕ (− z − b ) + ϕ (− z − a ) − ( b + z ) Φ (− z − b ) + ( a + z ) Φ (− z − a )] so (8) is equal to the condition given in (ii). In case (iii), ∫ [ Φ ( z − θ ) − Φ (− z − θ )] dF ( θ ) = σ − ∫ ∞−∞ [ Φ ( z − θ ) − Φ (− z − θ )] ϕ (( θ − µ )/ σ ) dθ = ∫ ∞−∞ [ Φ ( z − µ − σ x ) − Φ (− z − µ − σ x )] ϕ ( x ) dx = Φ (( z − µ )/√ + σ ) − Φ ((− z − µ )/√ + σ )

33o (8) is equal to the condition given in (iii).

A.4. Proof of proposition 5

Standard posterior contraction results imply that ∫ [ − Φ ( z − θ ) + Φ (− z − θ )] dF ( θ | X ∗ n − ) p −→ − Φ ( z − θ ∗ ) + Φ (− z − θ ∗ ) as n → ∞ , so the researcher must eventually stop conducting studies even if he never attainsrejection if 1 − Φ ( z − θ ∗ ) + Φ (− z − θ ∗ ) < c / v . Since Φ is strictly increasing, lim x →−∞ Φ ( x ) = x →∞ Φ ( x ) =

1, this implies the statement of the proposition.

A.5. Proof of proposition 6

Using our knowledge of the distribution of X ∗ n , we have f ( X ∗ n | θ ) = (cid:112) ( π ) n exp (cid:32) − n (cid:213) i = ( X ∗ i − θ ) (cid:33) f ( θ ) = √ π σ exp (cid:18) − σ ( θ − µ ) (cid:19) so that ∫ f ( X ∗ n | θ ) f ( θ ) dθ = (cid:112) ( π ) n + σ ∫ exp (cid:32) − (cid:34) n (cid:213) i = ( θ − X ∗ i ) + σ ( θ − µ ) (cid:35) (cid:33) dθ = (cid:112) ( π ) n + σ ∫ exp (cid:18) − (cid:18) n + σ (cid:19) ( θ − h ) (cid:19) dθ exp ( k ) = (cid:114)(cid:16) n + σ (cid:17) − (cid:112) ( π ) n σ ∫ (cid:114) π (cid:16) n + σ (cid:17) − exp (cid:18) − (cid:18) n + σ (cid:19) ( θ − h ) (cid:19) dθ exp ( k ) = (cid:112) ( π ) n ( nσ + ) exp ( k ) , where h = σ (cid:205) ni = X ∗ i + µnσ + , = − (cid:32) n (cid:213) i = X ∗ i + µ σ (cid:33) + (cid:16)(cid:205) ni = X ∗ i + µσ (cid:17) (cid:16) n + σ (cid:17) . Thus, f ( θ | X ∗ n ) = f ( X ∗ n | θ ) f ( θ ) ∫ f ( X ∗ n | θ ) f ( θ ) dθ = (cid:112) ( π ) n ( nσ + ) (cid:112) ( π ) n + σ exp (cid:169)(cid:173)(cid:173)(cid:171) −  n (cid:213) i = ( θ − X ∗ i ) + σ ( θ − µ ) + (cid:16)(cid:205) ni = X ∗ i + µσ (cid:17) n + σ − (cid:32) n (cid:213) i = X ∗ i + µ σ (cid:33) (cid:170)(cid:174)(cid:174)(cid:172) = (cid:113) π σ n + | n exp (cid:32) − (cid:0) θ − µ n + | n (cid:1) σ n + | n (cid:33) . This means that θ | X ∗ n has a normal distribution with mean µ n + | n and variance σ n + | n . The state-ment of the proposition then follows from analogous arguments to those made in the proof ofproposition 4(iii). A.6. Proof of proposition 7 If H holds so θ ∗ = P H (cid:16) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) + Φ (cid:16) −√ nz − θ − √ n − X ∗ n − (cid:17) < cv (cid:17) is positive for all n and any finite θ and c ≤ v . This implies that for any standard prior distribution F ( θ ) , if H is true, the probability that the researcher stops conducting additional studies ispositive at every stage. Moreover, P H (cid:16) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) + Φ (cid:16) −√ nz − θ − √ n − X ∗ n − (cid:17) < cv (cid:17) ≥ P H (cid:16) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) < c v , Φ (cid:16) −√ nz − θ − √ n − X ∗ n − (cid:17) < c v (cid:17) = P H (cid:18) −√ nz − θ − Φ − ( c / v )√ n − < X ∗ n − < √ nz − θ − Φ − ( − c / v )√ n − (cid:19) = Φ (cid:18) √ nz − θ − Φ − ( − c / v )√ n − (cid:19) − Φ (cid:18) −√ nz − θ − Φ − ( c / v )√ n − (cid:19) → Φ ( z ) − Φ (− z ) > n → ∞ so even as the researcher accumulates more data, the probability that the researchercontinues to conduct more studies remains bounded away from one. Thus, the statement of the35roposition holds. A.7. Proof of proposition 8

If the researcher’s prior distribution F ( θ n ) admits a pdf f ( θ n ) , for n ≥

2, the posterior pdf for θ n takes the form(A1) f ( θ n | X ∗ n − ) = f ( X ∗ n − | θ n ) f ( θ n ) ∫ f ( X ∗ n − | θ n ) f ( θ n ) d θ n , where the likelihood f ( X ∗ n − | θ n ) is the pdf of a N ( θ n − , Ω n − ) distribution evaluated at X ∗ n − . As anotational convention, for n = f ( θ n | X ∗ n − ) is equal to f ( θ n ) . Given that the researcher’s prioron θ n follows a N ( µ n , Σ n ) distribution for all n ≥ f ( X ∗ n − | θ n ) = (cid:112) ( π ) n − | Ω n − | exp (cid:18) − ( X ∗ n − − θ n − ) (cid:48) Ω − n − ( X ∗ n − − θ n − ) (cid:19) (A2) f ( θ n ) = (cid:112) ( π ) n | Σ n | exp (cid:18) − ( θ n − µ n ) (cid:48) Σ − n ( θ n − µ n ) (cid:19) (A3)so for (cid:101) Ω − n = (cid:34) Ω − n −

00 0 (cid:35) , A = − (cid:16) (cid:101) Ω − n + Σ − n (cid:17) , b = (cid:101) Ω − n X ∗ n + Σ − n µ n , c = − (cid:16) X ∗(cid:48) n (cid:101) Ω − n X ∗ n + µ (cid:48) n Σ − n µ n (cid:17) , h = − A − b / k = c − b (cid:48) A − b /

4, we have ∫ f ( X ∗ n − | θ n ) f ( θ n ) d θ n = (cid:112) ( π ) n − | Ω n − || Σ n |× ∫ exp (cid:18) − (cid:104) ( θ n − X ∗ n ) (cid:48) (cid:101) Ω − n ( θ n − X ∗ n ) + ( θ n − µ n ) (cid:48) Σ − n ( θ n − µ n ) (cid:105) (cid:19) d θ n = (cid:112) ( π ) n − | Ω n − || Σ n | ∫ exp ([ θ n − h ] (cid:48) A [ θ n − h ]) d θ n exp ( k ) (cid:112) ( π ) n − | A || Ω n − || Σ n | ∫ (cid:113) ( π ) n | A − | exp ([ θ n − h ] (cid:48) A [ θ n − h ]) d θ n exp ( k ) = (cid:112) ( π ) n − | A || Ω n − || Σ n | exp ( k ) = (cid:113) ( π ) n − | (cid:101) Ω − n Σ n + I n || Ω n − | exp ( k ) . Thus, (A1)–(A3) imply f ( θ n | X ∗ n − ) = (cid:113) ( π ) n − | (cid:101) Ω − n Σ n + I n || Ω n − | (cid:112) ( π ) n − | Ω n − |( π ) n | Σ n | exp (−( / )( X ∗ n − − θ n − ) (cid:48) Ω − n − ( X ∗ n − − θ n − )− ( / )( θ n − µ n ) (cid:48) Σ − n ( θ n − µ n ) − k ) = (cid:113) | (cid:101) Ω − n + Σ − n | (cid:112) ( π ) n exp (−( / ) θ (cid:48) n [ (cid:101) Ω − n + Σ − n ] θ n + θ (cid:48) n [ (cid:101) Ω − n X ∗ n + Σ − n µ n ] + ( / )[ X ∗(cid:48) n (cid:101) Ω − n + µ (cid:48) n Σ − n ][ (cid:101) Ω − n + Σ − n ] − [ (cid:101) Ω − n X ∗ n + Σ − n µ n ]) = (cid:113) ( π ) n |[ (cid:101) Ω − n + Σ − n ] − | exp (cid:18) − ( θ n − µ n | n − ) (cid:48) [ (cid:101) Ω − n + Σ − n ]( θ n − µ n | n − ) (cid:19) . This implies that the statement of the proposition holds.

Appendix B. One-sided hypothesis testing

The main text focuses on two-sided hypothesis tests. This appendix presents analogous resultsfor upper one-sided hypotheses(A4) H : β = β versus H : β > β . This analysis also covers lower one-sided hypotheses for which the alternative is instead H : β < β by simply redefining the parameter of interest β to be equal to its negative value. We stateresults briefly, omitting discussions and derivations since they are analogous to those for thetwo-sided problem.For the one-sided hypotheses (A4), redefine X n = max (cid:8) X ∗ , . . . , X ∗ n (cid:9) . Replace (3) with (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v (cid:80) (cid:0) X n > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) = v (cid:80) (cid:0) max (cid:8) X ∗ n , X n − (cid:9) > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) v (cid:80) (cid:0) X ∗ n > z (cid:12)(cid:12) X ∗ n − (cid:1) (cid:49) ( X n − < z ) − c ( n ) . Proposition 1 should be modified to the following.

Proposition A1.

Suppose v (cid:80) ( X ∗ n > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) ( X ∗ n + j > z | X ∗ n − ) − c ( n + j ) for all j > .Then the researcher chooses to conduct the next study if and only if X n − < z and v (cid:80) ( X ∗ n > z | X ∗ n − ) − c ( n ) ≥ . Proposition 2 should be modified to the following.

Proposition A2.

Suppose v (cid:80) ( X ∗ n > z | X ∗ n − ) − c ( n ) ≥ v (cid:80) ( X ∗ n + j > z | X ∗ n − ) − c ( n + j ) for all j > and c ( ) > . The ICCV z ∗ defined in (4) , (i) exists and (ii) yields a test with non-zero power if andonly if (cid:80) ( X ∗ > z ∗ ) ≥ c ( )/ v . Replace display (5) with H : θ = H : θ > , display (6) with (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:0) X ∗ n > z (cid:12)(cid:12) X ∗ n − , θ (cid:1) dF ( θ ) (cid:49) ( X n − < z ) − c = v ∫ [ − Φ ( z − θ )] dF ( θ ) (cid:49) ( X n − < z ) − c , display (7) with ∫ [ − Φ ( z − θ )] dF ( θ ) ≥ cv , display (8) with ∫ [ − Φ ( z − θ )] dF ( θ ) ≥ c ( n ) v , and display (9) with ∫ [ − Φ ( z − θ )] dF ( θ ) < c ( ) v . The statement immediately following proposition 3 should be replaced by the following: “Since X n is the maximum of n mean-zero iid normal variables under H , P H ( X n < z ) = Φ ( z ) n for z ≥ z ∗ suchthat 1 − Φ ( z ∗ ) N ( z ∗ ) ≤ α . ”38roposition 4 should be modified to the following: Proposition A3.

2. if dF ( θ ) = ( b − a ) − (cid:49) ( a ≤ θ ≤ b ) dθ for some a < b , n ≤ c − ( v [ − ( b − a ) − { ϕ ( z − b ) − ϕ ( z − a ) + ( b − z ) Φ ( z − b ) + ( z − a ) Φ ( z − a )}]) ;

3. if dF ( θ ) = σ − ϕ (( θ − µ )/ σ ) dθ for some µ ∈ (cid:82) and σ > , n ≤ c − ( v [ − Φ (( z − µ )/√ + σ )]) . Replace display (11) with(A5) ∫ [ − Φ ( z − θ )] dF ( θ | X ∗ n − ) ≥ cv . In the paragraph preceding proposition 5, | X ∗ n − | should be replaced with X ∗ n − . Proposition 5should be modified to the following: Proposition A4.

Let θ ∗ denote the true value of θ . For any CV z large enough such that − Φ ( z − θ ∗ ) < c / v , the researcher will eventually stop conducting studies even if he never attainsrejection. Proposition 6 should be modified to the following:

Proposition A5.

In the learning while sampling setting of this section, if dF ( θ ) = σ − ϕ (( θ − µ )/ σ ) dθ for some µ ∈ (cid:82) and σ > , then N ( z ) is the largest positive integer n for which X n − < z and − Φ (cid:169)(cid:173)(cid:173)(cid:171) z − µ n | n − (cid:113) + σ n | n − (cid:170)(cid:174)(cid:174)(cid:172) ≥ c / v , where µ n | n − = σ (cid:205) n − i = X ∗ i + µ ( n − ) σ + and σ n | n − = σ ( n − ) σ + .

39n remark 9, “1 − Φ ( z − θ ∗ ) + Φ (− z − θ ∗ ) < c / v ” should be replaced by “1 − Φ ( z − θ ∗ ) < c / v ”.Replace display (12) with (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:0) X ∗ n > z (cid:12)(cid:12) X ∗ n − , θ (cid:1) dF ( θ ) (cid:49) ( X n − < z ) − c = v ∫ (cid:104) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) (cid:105) dF ( θ ) (cid:49) ( X n − < z ) − c , display (13) with ∫ (cid:104) − Φ (cid:16) √ nz − θ − √ n − X ∗ n − (cid:17) (cid:105) dF ( θ ) ≥ cv , display (14) with H : β i = β i versus H : β i > β i , display (15) with (cid:69) (cid:0) π ( X n , X n − ; z , v , c ) (cid:12)(cid:12) X ∗ n − (cid:1) = v ∫ (cid:80) (cid:16) X ∗ n | n − > z | θ n (cid:17) dF ( θ n ; X ∗ n − ) (cid:49) ( X n − < z ) − c ( n ) = v ∫ [ − Φ (( z − θ n | n − )/ ω n | n − )] dF ( θ n ; X ∗ n − ) (cid:49) ( X n − < z ) − c ( n ) and display (16) with ∫ [ − Φ (( z − θ n | n − )/ ω n | n − )] dF ( θ n ; X ∗ n − ) ≥ c ( n ) v . Appendix C. Calibration of researcher’s prior

As mentioned in section 3, the two previous articles to which we calibrate Dal Bo’s prior, Roth andMurnighan (1978) and Murnighan and Roth (1983), do not provide standard error or t -statisticvalues, but only estimates of β . Nevertheless, the structure of the matched pairs experimentaldesign in each of these two papers allows us to compute bounds for the estimated standard errorfor ˆ β as follows. Roth and Murnighan (1978) and Murnighan and Roth (1983) report the number ofsubjects in their experiments who choose to cooperate under high and low probabilities of futureinteraction and compare the arithmetic averages of these two numbers to assess whether there isstatistical evidence that the probability of cooperation is different under high vs low levels offuture interaction. Between these two papers, the authors conduct four total experiments.Formally, in each experiment, the authors report the sample size n , (cid:205) ni = y i and (cid:205) ni = x i , where y i = i chooses to cooperate under low probability of future interaction and y i = x i = i chooses to cooperate under high probability of future40nteraction and x i = β as the difference in the sample averages ofthese values between the same set of individuals: ˆ β = n − (cid:205) ni = x i − n − (cid:205) ni = y i = n − (cid:205) ni = ( x i − y i ) .Though the authors report that ˆ β is statistically different from zero in each experiment, theyneither provide t -statistics nor standard errors. However, in a matched pairs design, the standarderror is calculated as se ( ˆ β ) = n − / (cid:98) sd ( x i − y i ) , where (cid:98) sd ( x i − y i ) = (cid:115) (cid:205) ni = ( x i − y i − ˆ β ) n − ( x i − y i ) . Since n (cid:213) i = ( x i − y i − ˆ β ) = n (cid:213) i = x i − n (cid:213) i = x i y i + n (cid:213) i = y i − n ˆ β = n (cid:213) i = x i + n (cid:213) i = y i − n ˆ β − n (cid:213) i = x i y i and 0 ≤ n (cid:213) i = x i y i ≤ min (cid:40) n (cid:213) i = x i , n (cid:213) i = y i (cid:41) , we can bound (cid:98) sd ( x i − y i ) as follows: (cid:98) sd lb ≤ (cid:98) sd ( x i − y i ) ≤ (cid:98) sd ub , where (cid:98) sd lb = (cid:115) (cid:205) ni = x i + (cid:205) ni = y i − n ˆ β − (cid:8)(cid:205) ni = x i , (cid:205) ni = y i (cid:9) n − , (cid:98) sd ub = (cid:115) (cid:205) ni = x i + (cid:205) ni = y i − n ˆ β n − (cid:98) sd ( x i − y i ) that the researcher would infer from these previous studieswhen forming his prior on the mean of the t -statistic β / sd ( ˆ β ) . More specifically, we calibratedthe researcher’s prior for β / sd ( ˆ β ) as follows:1. Calculate ˆ σ j ( x i − y i ) = . (cid:98) sd lb , j + . (cid:98) sd ub , j , where (cid:98) sd lb , j and (cid:98) sd ub , j are equal to the valuesof (cid:98) sd lb and (cid:98) sd ub in the j th study for each of the j = , . . . , t j = √ n j ˆ β j / ˆ σ j ( x i − y i ) , where n j is the sample size and ˆ β j is the estimate of β instudy j . This serves as an approximation to the t -statistic obtain in each of the j = , . . . , β / sd ( ˆ β ) as (cid:69) ( β / sd ( ˆ β )) = (cid:205) j = w j (cid:112) . / n j t j ,where the weights w j = n j / (cid:205) i = n i correspond to the relative sample size of study j . The41 able A1. Marginal cost-benefit ratio bounds ¯ n E H [ (cid:80) (cid:0) | X ∗ ¯ n + | > z (cid:12)(cid:12) X ∗ ¯ n (cid:1) ] E H [ (cid:80) (cid:0) | X ∗ ¯ n | > z (cid:12)(cid:12) X ∗ ¯ n − (cid:1) ] scaling by (cid:112) . / n j inside of the sum is used to account for the fact that Dal Bo’s averagesample size is 48.75, rather than n j .4. Calculate the variance of the researcher’s prior for β / sd ( ˆ β ) asvar ( β / sd ( ˆ β )) = (cid:213) j = w j (cid:32)(cid:113) . / n j t j − (cid:213) i = w i (cid:112) . / n i t i (cid:33) . Appendix D. Indirect elicitation of cost-benefit ratio

Suppose we knew the maximum number of studies ¯ n a researcher would typically conduct under H without rejecting H at a given CV z before stopping and moving on to a different researchproject. Using proposition 1, we could then infer vE H [ (cid:80) (cid:0) | X ∗ ¯ n | > z (cid:12)(cid:12) X ∗ ¯ n − (cid:1) ] − c ( ¯ n ) ≥ vE H [ (cid:80) (cid:0) | X ∗ ¯ n + | > z (cid:12)(cid:12) X ∗ ¯ n (cid:1) ] − c ( ¯ n + ) < , where the expectations are taken over the distributionof X ∗ ¯ n − and X ∗ ¯ n under the objective probability measure when H is true. If marginal costs areconstant, we could then bound the marginal cost-to-payoff ratio: E H [ (cid:80) (cid:0) | X ∗ ¯ n + | > z (cid:12)(cid:12) X ∗ ¯ n (cid:1) ] < cv ≤ E H [ (cid:80) (cid:0) | X ∗ ¯ n | > z (cid:12)(cid:12) X ∗ ¯ n − (cid:1) ] . Moreover, if we were able to obtain this maximum number of studies across a range of CVs z , so¯ n is a function of z , we could tighten these bounds:sup z E H (cid:104) (cid:80) (cid:16) | X ∗ ¯ n ( z ) + | > z (cid:12)(cid:12)(cid:12) X ∗ ¯ n ( z ) (cid:17) (cid:105) < cv ≤ inf z E H (cid:104) (cid:80) (cid:16) | X ∗ ¯ n ( z ) | > z (cid:12)(cid:12)(cid:12) X ∗ ¯ n ( z )− (cid:17) (cid:105) . In principle, this could be achieved by surveying researchers, although a mechanism elicitingtruthful reporting would be crucial to such an analysis.As a practical illustration, consider the pooling data paradigm of section 7 and setting the priorof the researcher according to the calibration in section 3. In this case since √ ¯ n − X ¯ n − /√ + σ ∼ ( , ( ¯ n − )/( + σ )) under H , we obtain E H [ (cid:80) (cid:0) | X ∗ ¯ n + | > z (cid:12)(cid:12) X ∗ ¯ n (cid:1) ] = E H (cid:34) − Φ (cid:32) √ ¯ n + z − √ ¯ nX ∗ ¯ n − µ √ + σ (cid:33) + Φ (cid:32) −√ ¯ n + z − √ ¯ nX ∗ ¯ n − µ √ + σ (cid:33) (cid:35) = − Φ (cid:32) √ ¯ n + z − µ √ + σ + ¯ n (cid:33) + Φ (cid:32) −√ ¯ n + z − µ √ + σ + ¯ n (cid:33) , where µ and σ are the mean and variance of the researcher’s prior distribution on β / sd ( ˆ β ) .Similarly, E H [ (cid:80) (cid:0) | X ∗ ¯ n | > z (cid:12)(cid:12) X ∗ ¯ n − (cid:1) ] = − Φ (cid:32) √ ¯ nz − µ √ σ + ¯ n (cid:33) + Φ (cid:32) −√ ¯ nz − µ √ σ + ¯ n (cid:33) . Table A1 records these bounds for c / v corresponding to different values of ¯ n for the standard CVof z = .

96. Dal Bo (2005) conducted four studies. If we assume that this is the number of studieshe would typically conduct under a true H (so ¯ n = c / v = / = ..