The Journal of Physiology | 2021

The unknowable probability of replication

 

Abstract


I read the recent editorial by Simon Gandevia (2021) on publications, replication and statistics in The Journal of Physiology and I was most interested by the discussion about P-values and replication, which formed one focus of the editorial. Although I agree with the message that findings with P-values above the a priori α-level should never be interpreted as trends, I have concerns about the statements on individual replication probabilities based on the P-value observed in an initial study. This is because it has previously been shown that the individual replication probability is unknown (Miller & Schwarz 2011). In the figure caption, the author states there is a 50% chance that an attempted replication experiment will achieve a P-value below a 5% α-level, assuming the P-value in the initial experiment was 0.05. The author’s reasoning arises from a post hoc power computation, which assumes that the observed effect size in the sample is equal to the true effect size in the population. In this instance, because the P-value equals the α-level of the test (e.g. 0.05), the test statistic (z-score) equals the critical value of the test. Therefore, assuming the true mean of z is 1.96, the z-score has a 50% chance (50% post hoc power) of being above the critical value for significance (e.g. 1.96 in a two-tailed z-test) (Hoenig & Heisey 2001). Likewise, the chance of rejecting the null hypothesis (e.g. a z-score being outside the interval −1.96 to 1.96), assuming again that the true mean of z is 1.96, is slightly higher than 50% (Hoenig & Heisey 2001). Unfortunately, though, there are problems with this line of reasoning in the real world because the true effect size in the population is unknown, which results in the individual replication probability being unknown (Miller & Schwarz 2011). This occurs because the true effect size is not tightly constrained by the effect size observed in the initial study and because the effect sizes in subsequent studies vary around the true effect size (Miller & Schwarz 2011). Both factors subsequently contribute to an uncomfortably wide range for individual replication probabilities and this high level of uncertainty cannot be avoided (Miller & Schwarz 2011). Additional statements made by the author that follow the above line of reasoning (e.g. there is a 66% chance that an attempted replication will achieve P < 0.05 if the initial P-value was 0.01, and that even low initial P-values will frequently fail the replication test) are therefore also incorrect. In high-powered and well-designed studies, assuming there is a true effect, a significant P-value should be observed as often as the true power of the study (e.g. 90% for studies with 90% power) (Cumming 2008; Lakens & Evers 2014). This is true regardless of whether the initial P-value is 0.001 or 0.049. Consequently, as we can never know the true effect size in the population and thus the probability of replication in the real world, we need to have high-powered and well-designed studies to advance scientific knowledge and avoid so-called replication crises in scientific fields. To conclude, I would like to echo the words of the author that we need convergent lines of experimental evidence together with direct replication. But I would like to add that although we cannot accurately estimate how successful our replication attempts are likely to be, our ability to replicate significant findings will surely be bolstered by high-powered studies (e.g. at least 90% estimated power) that investigate effects which are likely to be true.

Volume 599
Pages None
DOI 10.1113/JP281472
Language English
Journal The Journal of Physiology

Full Text