I can see clearly now: reinterpreting statistical significance
aa r X i v : . [ s t a t . O T ] O c t I can see clearly now: reinterpreting statisticalsignificance
Jonathan Dushoff , Morgan P. Kain , and Benjamin M. Bolker Department of Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1Canada Department of Mathematics and Statistics, McMaster University, 1280 Main Street West, Hamilton,Ontario L8S 4L8 Canada
Corresponding author:Jonathan Dushoff Email address: [email protected]
Abstract
Null hypothesis significance testing remains popular despite decades of concern aboutmisuse and misinterpretation. We believe that much of the problem is due to language:significance testing has little to do with other meanings of the word “significance”.Despite the limitations of null-hypothesis tests, we argue here that they remain useful inmany contexts as a guide to whether a certain effect can be seen clearly in that context(e.g. whether we can clearly see that a correlation or between-group difference ispositive or negative). We therefore suggest that researchers describe the conclusions ofnull-hypothesis tests in terms of statistical “clarity” rather than statistical “significance”.This simple semantic change could substantially enhance clarity in statisticalcommunication.
Key Words
Statistical philosophy; Statistical clarity; Hypothesis testing; p-value1 ntroduction
Statisticians and scientists have bemoaned the shortcomings of null hypothesissignificance testing (NHST) for nearly a century (Cohen, 1994). Books and articlesproposing the de-emphasis or abandonment of the p-value have been cited thousands oftimes (Cohen, 1994, Goodman, 1999, Wilkinson, 1999, Ziliak and McCloskey, 2008,Wasserstein and Lazar, 2016). These works plead for a focus on effect sizes andconfidence intervals, and point out that null effects that truly have zero magnitude areunrealistic or impossible in most fields outside of the hard physical sciences (Meehl,1990, Tukey, 1991, Cohen, 1994). Yet, p-values without confidence intervals (or eveneffect sizes) and references to null effects still pervade the scientific literature at all levelsup to and including articles in high-impact journals.In a meta-analysis of 356 studies Bernardi et al. (2017) found that 72% of studiescontained an ambiguous use of the term “significant”, 49% interpreted non-significanteffects as zero effects, and 44% failed to report a comprehensible effect size. The misuseand misinterpretation of NHST is so frequent that there have been recent calls fordrastically reducing (Szucs and Ioannidis, 2017) or abandoning (McShane et al., 2017) itsuse. Other prescriptions have included the complete abandonment of frequentiststatistics (The, 2011), or the use of a stricter significance threshold (e.g. p < . :Benjamin et al. (2018)); however, the former seems impractical, while the latter isunlikely to reduce the misuse and misinterpretation of p-values, or the publication biasimposed by any p-value threshold (Ridley et al., 2007).Here, we argue that NHST remains useful, and that pervasive misuse can be reducedthrough a linguistic change: using the language of statistical “clarity” instead ofstatistical “significance”. he null hypothesis is false In most biological studies, the null hypothesis is known or believed to not be strictlytrue. Even in cases where the null hypothesis is sensible (e.g., particle physics, Staley(2017)), NHST does not provide evidence that a difference is exactly zero. This being thecase, it is worth asking how NHST has survived “if it is as idiotic as . . . long believed”Ziliak and McCloskey (2008, cited in Kr¨amer (2011)).The value of NHST can be seen in something like a permutation-based t-test (Good,2000, Chapter 1): it provides a simple, robust framework to ask whether we can tell which mean is bigger. More generally, testing the null hypothesis is a proxy for askingwhether we clearly see a signal of how our data differs from it. In many cases, this comesdown simply to whether we can be confident of the sign of a difference or a correlationcoefficient (Robinson and Wainer, 2001). In other cases (e.g., a one-way ANOVA), it maynot be simple to describe the difference, but NHST is still a reasonable, accepted way toevaluate whether an effect has been seen clearly.The “idiocy”, if any, comes in the interpretive step. A statistical fact (“we have seen adifference between the groups”, which should immediately prompt the question “whathave you learned about that difference?”) is interpreted as a scientific fact (“there is a‘significant’ difference between the groups), which is often seen as an end in itself: “weshowed that the groups differ”. The p-value is a property of the study
We often see sentences like, “X et al. showed that there is no significant effect of Y on Z”with the implication that this effect can now be assumed to be absent (or unimportant).In fact, the sentence is erroneous even before we get to the implication: significance testsprovide information about a data set – that is, about a study, not about the study system(Hoenig and Heisey, 2001). Indeed, a very small effect can lead to p < . , when data isabundant (or noise is small); or a very large one can lead to p > . when the sample is mall or noisy.The statement “X et al. showed that Y has a statistically significant effect on Z” issimilarly misleading. Frequentist statistics effectively assume that the effect is present(or at least, admit that it can’t be disproven). The question is whether it is seen in aparticular data set. The statement “X et al. were able to see the effect of Y on Z” is notonly more accurate, but it appropriately implies that something is missing: What effectdid they see?
Statistical clarity
The language of “statistical clarity” could help researchers escape various logical trapswhile interpreting the results of NHST, allowing for the continued use of NHST as asimple, robust method of evaluating whether a data signal is clear (see Abelson (1997)for arguments for NHST). The use of “significance” to describe the results of hypothesistests is deeply, and sometimes subtly, misleading, because it is at odds with othermeanings of the word: the p-value is not an accurate gauge of whether a result is large inmagnitude, biologically important, or relevant. “Clarity,” on the other hand, is an aptterm for what NHST actually evaluates. Jones and Tukey (2000) andRobinson and Wainer (2001) suggest that researchers should report p > . usinglanguage such as “the direction of the differences among the treatments wasundetermined”. This is a step in the right direction. Replacing “significance” with“clarity” takes this idea further, and has the promise to substantially improve statisticalcommunication.For example, the sentence “X et al. showed that the effect of Y on Z is statisticallyunclear”, is noticeably awkward. It seems less like a statement about the study system,and suggests the more straightforward “did not find a statistically clear effect.” Similarly,“We did not find a clear difference in response between the control and sham groups” isboth more colloquial and harder to transform into a misleading statement than “We didnot find a significant difference . . . ”. Bernardi et al. (2017) complained that . . . sociological and social significance are sacrificed on the altar of statisticalsignificance”. Describing statistical tests in terms of clarity would allow “significant” toreclaim its common English definition and reduce conflation between statistical resultsand substantive significance.Descriptions of statistical results using the language of clarity should begin withreference to the effect. For example, “The difference between the control and treatmentgroup was not statistically clear.” Table 1 shows published examples of statements thatmisinterpret p-values in three different ways and demonstrates how to rephrase them inthe language of clarity. Language from published articles Rewritten using “clarity”
Accepting the null hypothesis ( p > . ; no effect) Toxins accumulate after acute expo-sure but have no effect on behaviour Toxins accumulate after acute expo-sure but their effects on behaviour arestatistically unclearThere was no effect of elevated carbondioxide on reproductive behaviors The effect of elevated carbon dioxideon reproductive behaviors was statis-tically unclearThe finding that species richnessshowed no significant relationshipwith the area of available habitat issurprising because richness is usuallystrongly influenced by landscape con-text Although species richness is usuallystrongly influenced by landscape con-text, we were unable to find a statisti-cally clear relationship in this study
Inferring weak effects from large p-values (Wasserstein and Lazar, 2016). . . differences between treatment andcontrol groups were nonsignificant,with P values of at least 0.3, and mostin the range . ≤ P ≤ . . . . . differences between treatment andcontrol groups were not statisticallyclear (all P > . ) [ confidence intervalswould also be valuable here! ] The difference between “clear” and “not clear” is not clear (Gelman and Stern, 2006)This correlation was significant inmales ( ρ = . , P < ρ = . , NS). . . . [ The authorslater write as though they have demon-strated a difference between males and fe-males ] Although males and females showthe same correlation coefficient ( ρ = . ), the sign of the coefficient is sta-tistically clear only in males . . . [ Thisphrasing may suggest to the authors thatconfidence intervals are called for. ]. . . risk of low BMD [bone mineraldensity] remained greater in HCV-coinfected women versus womenwith HIV alone (adjusted OR 2.99,95% CI 1.33–6.74), but no associationwas found between HCV coinfectionand low BMD in men (adjusted OR1.26, 95% CI 0.75–2.10). . . . The pre-cise mechanisms for the associationbetween viral hepatitis and low BMDin HIV-infected women but not menremain unclear. . . . risk of low BMD [bone mineraldensity] remained greater in HCV-coinfected women versus womenwith HIV alone (adjusted OR 2.99,95% CI 1.33–6.74), but the associationbetween HCV coinfection and lowBMD in men was not statistically clear(adjusted OR 1.26, 95% CI 0.75–2.10).. . . Pursuing biological differences be-tween women and men in the effectof HIV on BMD would be prematuregiven these results.
Table 1.
Examples of misleading language in peer-reviewed papers (citations availableby request), and revisions using our proposed language of clarity. onclusions
We believe that NHST is useful as a simple, robust way to ask whether an effect can beseen clearly in a particular data set (Robinson and Wainer, 2001), and that careful,clarity-based language can reduce misinterpretation and miscommunication.We agree with Cohen (1994) and others (Goodman, 1999, Ziliak and McCloskey, 2008,Wasserstein and Lazar, 2016), that scientific communication and understanding will beimproved by a shift away from p-values to effect sizes and confidence intervals. Weargue that the use of “statistical clarity” reinforces the need for confidence intervals andeffect sizes by making it clearer that bald statements about p-values are insufficient. Thestatement “The difference between our control and treatment groups was notstatistically clear ( p = . )” is noticeably incomplete; an effect size and confidenceinterval are required to complete the story.Improving language will not by itself solve all of the known problems with currentstatistical practice. We echo previous statements in favor of “neglected factors” (priorand related evidence, plausibility of mechanism, study design and data quality, realworld benefits, novelty and other factors) (McShane et al., 2017) and reporting of a priori analysis of statistical power to avoid emphasis on implausibly large effects given lowstatistical power (the “winner’s curse” Gelman and Carlin, 2014, Szucs and Ioannidis,2017, Bernardi et al., 2017). Additionally, we support the idea of writing a statisticaljournal that chronicles all steps in the analytical process (Kass et al., 2016), and clearlydelineating the boundary between inferences based on a priori hypotheses anddiscoveries from post hoc data exploration.Whether or not our recommendations are broadly adopted by authors, reviewers, andeditors, they can be useful for individual researchers who want to help themselves thinkclearly about NHST results. We have found that rephrasing NHST statements that weencounter (in the literature, or in seminar presentations) in terms of clarity has alreadyhelped us with both interpretation and communication. cknowledgments We thank members of the Dushoff and Bolker labs for helpful comments on the firstdraft of the manuscript.
EFERENCES
Jacob Cohen. The earth is round (p < .05). American Psychologist , 49(12):997, 1994.Steven N Goodman. Toward evidence-based medical statistics. 1: The p value fallacy.
Annals of Internal Medicine , 130(12):995–1004, 1999.Leland Wilkinson. Statistical methods in psychology journals: Guidelines andexplanations.
American Psychologist , 54(8):594, 1999.Steve Ziliak and Deirdre Nansen McCloskey.
The cult of statistical significance: How thestandard error costs us jobs, justice, and lives . University of Michigan Press, 2008.Ronald L Wasserstein and Nicole A Lazar. The ASA’s statement on p-values: context,process, and purpose, 2016.Paul E Meehl. Why summaries of research on psychological theories are oftenuninterpretable.
Psychological Reports , 66(1):195–244, 1990.John W Tukey. The philosophy of multiple comparisons.
Statistical Science , pages100–116, 1991.Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold. ‘Sing me a song with socialsignificance’: The (mis) use of statistical significance testing in European sociologicalresearch.
European Sociological Review , 33(1):1–15, 2017.Denes Szucs and John Ioannidis. When null hypothesis significance testing is unsuitablefor research: a reassessment.
Frontiers in Human Neuroscience , 11:390, 2017.Blakeley B McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer LTackett. Abandon statistical significance.
The American Statistician , 70, 2017.Bertram The. Significance testing: are we ready yet to abandon its use?
Current MedicalResearch and Opinion , 27(11):2087–2090, 2011. aniel J Benjamin, James O Berger, Magnus Johannesson, Brian A Nosek, E-JWagenmakers, Richard Berk, Kenneth A Bollen, Bj ¨orn Brembs, Lawrence Brown, ColinCamerer, et al. Redefine statistical significance.
Nature Human Behaviour , 2(1):6, 2018.J Ridley, Niclas Kolm, RP Freckelton, and MJG Gage. An unexpected influence of widelyused significance thresholds on the distribution of reported p -values. Journal ofEvolutionary Biology , 20(3):1082–1089, 2007.Kent W Staley. Pragmatic warrant for frequentist statistical practice: the case of highenergy physics.
Synthese , 194(2):355–376, 2017.Walter Kr¨amer. The cult of statistical significance–what economists should and shouldnot do to make their data talk.
Schmollers Jahrbuch , 131(3):455–468, 2011.Phillip Good.
Permutation Tests: A Practical Guide to Resampling Methods for TestingHypotheses . Springer, 2000.Daniel H Robinson and Howard Wainer. On the past and future of null hypothesissignificance testing.
ETS Research Report Series , 2001(2), 2001.John M Hoenig and Dennis M Heisey. The abuse of power: the pervasive fallacy ofpower calculations for data analysis.
The American Statistician , 55(1):19–24, 2001.Robert P Abelson. On the surprising longevity of flogged horses: Why there is a case forthe significance test.
Psychological Science , 8(1):12–15, 1997.Lyle V Jones and John W Tukey. A sensible formulation of the significance test.
Psychological Methods , 5(4):411, 2000.Andrew Gelman and Hal Stern. The difference between “significant” and “notsignificant” is not itself statistically significant.
The American Statistician , 60(4):328–331,2006. ndrew Gelman and John Carlin. Beyond power calculations: assessing type S (sign)and type M (magnitude) errors.
Perspectives on Psychological Science , 9(6):641–651, 2014.Robert E Kass, Brian S Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, and Nancy Reid.Ten simple rules for effective statistical practice.
PLoS Computational Biology , 12(6):e1004961, 2016., 12(6):e1004961, 2016.