[PDF] I can see clearly now: reinterpreting statistical significance

Abstract

Null hypothesis significance testing remains popular despite decades of concern about misuse and misinterpretation. We believe that much of the problem is due to language: significance testing has little to do with other meanings of the word "significance". Despite the limitations of null-hypothesis tests, we argue here that they remain useful in many contexts as a guide to whether a certain effect can be seen clearly in that context (e.g. whether we can clearly see that a correlation or between-group difference is positive or negative). We therefore suggest that researchers describe the conclusions of null-hypothesis tests in terms of statistical "clarity" rather than statistical "significance". This simple semantic change could substantially enhance clarity in statistical communication.

Full PDF

aa r X i v : . [ s t a t . O T ] O c t I can see clearly now: reinterpreting statisticalsigniﬁcance

Jonathan Dushoff , Morgan P. Kain , and Benjamin M. Bolker Department of Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1Canada Department of Mathematics and Statistics, McMaster University, 1280 Main Street West, Hamilton,Ontario L8S 4L8 Canada

Corresponding author:Jonathan Dushoff Email address: [email protected]

Abstract

Null hypothesis signiﬁcance testing remains popular despite decades of concern aboutmisuse and misinterpretation. We believe that much of the problem is due to language:signiﬁcance testing has little to do with other meanings of the word “signiﬁcance”.Despite the limitations of null-hypothesis tests, we argue here that they remain useful inmany contexts as a guide to whether a certain effect can be seen clearly in that context(e.g. whether we can clearly see that a correlation or between-group difference ispositive or negative). We therefore suggest that researchers describe the conclusions ofnull-hypothesis tests in terms of statistical “clarity” rather than statistical “signiﬁcance”.This simple semantic change could substantially enhance clarity in statisticalcommunication.

Key Words

Statistical philosophy; Statistical clarity; Hypothesis testing; p-value1 ntroduction

Statisticians and scientists have bemoaned the shortcomings of null hypothesissigniﬁcance testing (NHST) for nearly a century (Cohen, 1994). Books and articlesproposing the de-emphasis or abandonment of the p-value have been cited thousands oftimes (Cohen, 1994, Goodman, 1999, Wilkinson, 1999, Ziliak and McCloskey, 2008,Wasserstein and Lazar, 2016). These works plead for a focus on effect sizes andconﬁdence intervals, and point out that null effects that truly have zero magnitude areunrealistic or impossible in most ﬁelds outside of the hard physical sciences (Meehl,1990, Tukey, 1991, Cohen, 1994). Yet, p-values without conﬁdence intervals (or eveneffect sizes) and references to null effects still pervade the scientiﬁc literature at all levelsup to and including articles in high-impact journals.In a meta-analysis of 356 studies Bernardi et al. (2017) found that 72% of studiescontained an ambiguous use of the term “signiﬁcant”, 49% interpreted non-signiﬁcanteffects as zero effects, and 44% failed to report a comprehensible effect size. The misuseand misinterpretation of NHST is so frequent that there have been recent calls fordrastically reducing (Szucs and Ioannidis, 2017) or abandoning (McShane et al., 2017) itsuse. Other prescriptions have included the complete abandonment of frequentiststatistics (The, 2011), or the use of a stricter signiﬁcance threshold (e.g. p < . :Benjamin et al. (2018)); however, the former seems impractical, while the latter isunlikely to reduce the misuse and misinterpretation of p-values, or the publication biasimposed by any p-value threshold (Ridley et al., 2007).Here, we argue that NHST remains useful, and that pervasive misuse can be reducedthrough a linguistic change: using the language of statistical “clarity” instead ofstatistical “signiﬁcance”. he null hypothesis is false In most biological studies, the null hypothesis is known or believed to not be strictlytrue. Even in cases where the null hypothesis is sensible (e.g., particle physics, Staley(2017)), NHST does not provide evidence that a difference is exactly zero. This being thecase, it is worth asking how NHST has survived “if it is as idiotic as . . . long believed”Ziliak and McCloskey (2008, cited in Kr¨amer (2011)).The value of NHST can be seen in something like a permutation-based t-test (Good,2000, Chapter 1): it provides a simple, robust framework to ask whether we can tell which mean is bigger. More generally, testing the null hypothesis is a proxy for askingwhether we clearly see a signal of how our data differs from it. In many cases, this comesdown simply to whether we can be conﬁdent of the sign of a difference or a correlationcoefﬁcient (Robinson and Wainer, 2001). In other cases (e.g., a one-way ANOVA), it maynot be simple to describe the difference, but NHST is still a reasonable, accepted way toevaluate whether an effect has been seen clearly.The “idiocy”, if any, comes in the interpretive step. A statistical fact (“we have seen adifference between the groups”, which should immediately prompt the question “whathave you learned about that difference?”) is interpreted as a scientiﬁc fact (“there is a‘signiﬁcant’ difference between the groups), which is often seen as an end in itself: “weshowed that the groups differ”. The p-value is a property of the study

We often see sentences like, “X et al. showed that there is no signiﬁcant effect of Y on Z”with the implication that this effect can now be assumed to be absent (or unimportant).In fact, the sentence is erroneous even before we get to the implication: signiﬁcance testsprovide information about a data set – that is, about a study, not about the study system(Hoenig and Heisey, 2001). Indeed, a very small effect can lead to p < . , when data isabundant (or noise is small); or a very large one can lead to p > . when the sample is mall or noisy.The statement “X et al. showed that Y has a statistically signiﬁcant effect on Z” issimilarly misleading. Frequentist statistics effectively assume that the effect is present(or at least, admit that it can’t be disproven). The question is whether it is seen in aparticular data set. The statement “X et al. were able to see the effect of Y on Z” is notonly more accurate, but it appropriately implies that something is missing: What effectdid they see?

Statistical clarity

The language of “statistical clarity” could help researchers escape various logical trapswhile interpreting the results of NHST, allowing for the continued use of NHST as asimple, robust method of evaluating whether a data signal is clear (see Abelson (1997)for arguments for NHST). The use of “signiﬁcance” to describe the results of hypothesistests is deeply, and sometimes subtly, misleading, because it is at odds with othermeanings of the word: the p-value is not an accurate gauge of whether a result is large inmagnitude, biologically important, or relevant. “Clarity,” on the other hand, is an aptterm for what NHST actually evaluates. Jones and Tukey (2000) andRobinson and Wainer (2001) suggest that researchers should report p > . usinglanguage such as “the direction of the differences among the treatments wasundetermined”. This is a step in the right direction. Replacing “signiﬁcance” with“clarity” takes this idea further, and has the promise to substantially improve statisticalcommunication.For example, the sentence “X et al. showed that the effect of Y on Z is statisticallyunclear”, is noticeably awkward. It seems less like a statement about the study system,and suggests the more straightforward “did not ﬁnd a statistically clear effect.” Similarly,“We did not ﬁnd a clear difference in response between the control and sham groups” isboth more colloquial and harder to transform into a misleading statement than “We didnot ﬁnd a signiﬁcant difference . . . ”. Bernardi et al. (2017) complained that . . . sociological and social signiﬁcance are sacriﬁced on the altar of statisticalsigniﬁcance”. Describing statistical tests in terms of clarity would allow “signiﬁcant” toreclaim its common English deﬁnition and reduce conﬂation between statistical resultsand substantive signiﬁcance.Descriptions of statistical results using the language of clarity should begin withreference to the effect. For example, “The difference between the control and treatmentgroup was not statistically clear.” Table 1 shows published examples of statements thatmisinterpret p-values in three different ways and demonstrates how to rephrase them inthe language of clarity. Language from published articles Rewritten using “clarity”

Accepting the null hypothesis ( p > . ; no effect) Toxins accumulate after acute expo-sure but have no effect on behaviour Toxins accumulate after acute expo-sure but their effects on behaviour arestatistically unclearThere was no effect of elevated carbondioxide on reproductive behaviors The effect of elevated carbon dioxideon reproductive behaviors was statis-tically unclearThe ﬁnding that species richnessshowed no signiﬁcant relationshipwith the area of available habitat issurprising because richness is usuallystrongly inﬂuenced by landscape con-text Although species richness is usuallystrongly inﬂuenced by landscape con-text, we were unable to ﬁnd a statisti-cally clear relationship in this study

Inferring weak effects from large p-values (Wasserstein and Lazar, 2016). . . differences between treatment andcontrol groups were nonsigniﬁcant,with P values of at least 0.3, and mostin the range . ≤ P ≤ . . . . . differences between treatment andcontrol groups were not statisticallyclear (all P > . ) [ conﬁdence intervalswould also be valuable here! ] The difference between “clear” and “not clear” is not clear (Gelman and Stern, 2006)This correlation was signiﬁcant inmales ( ρ = . , P < ρ = . , NS). . . . [ The authorslater write as though they have demon-strated a difference between males and fe-males ] Although males and females showthe same correlation coefﬁcient ( ρ = . ), the sign of the coefﬁcient is sta-tistically clear only in males . . . [ Thisphrasing may suggest to the authors thatconﬁdence intervals are called for. ]. . . risk of low BMD [bone mineraldensity] remained greater in HCV-coinfected women versus womenwith HIV alone (adjusted OR 2.99,95% CI 1.33–6.74), but no associationwas found between HCV coinfectionand low BMD in men (adjusted OR1.26, 95% CI 0.75–2.10). . . . The pre-cise mechanisms for the associationbetween viral hepatitis and low BMDin HIV-infected women but not menremain unclear. . . . risk of low BMD [bone mineraldensity] remained greater in HCV-coinfected women versus womenwith HIV alone (adjusted OR 2.99,95% CI 1.33–6.74), but the associationbetween HCV coinfection and lowBMD in men was not statistically clear(adjusted OR 1.26, 95% CI 0.75–2.10).. . . Pursuing biological differences be-tween women and men in the effectof HIV on BMD would be prematuregiven these results.

Table 1.

Examples of misleading language in peer-reviewed papers (citations availableby request), and revisions using our proposed language of clarity. onclusions

We believe that NHST is useful as a simple, robust way to ask whether an effect can beseen clearly in a particular data set (Robinson and Wainer, 2001), and that careful,clarity-based language can reduce misinterpretation and miscommunication.We agree with Cohen (1994) and others (Goodman, 1999, Ziliak and McCloskey, 2008,Wasserstein and Lazar, 2016), that scientiﬁc communication and understanding will beimproved by a shift away from p-values to effect sizes and conﬁdence intervals. Weargue that the use of “statistical clarity” reinforces the need for conﬁdence intervals andeffect sizes by making it clearer that bald statements about p-values are insufﬁcient. Thestatement “The difference between our control and treatment groups was notstatistically clear ( p = . )” is noticeably incomplete; an effect size and conﬁdenceinterval are required to complete the story.Improving language will not by itself solve all of the known problems with currentstatistical practice. We echo previous statements in favor of “neglected factors” (priorand related evidence, plausibility of mechanism, study design and data quality, realworld beneﬁts, novelty and other factors) (McShane et al., 2017) and reporting of a priori analysis of statistical power to avoid emphasis on implausibly large effects given lowstatistical power (the “winner’s curse” Gelman and Carlin, 2014, Szucs and Ioannidis,2017, Bernardi et al., 2017). Additionally, we support the idea of writing a statisticaljournal that chronicles all steps in the analytical process (Kass et al., 2016), and clearlydelineating the boundary between inferences based on a priori hypotheses anddiscoveries from post hoc data exploration.Whether or not our recommendations are broadly adopted by authors, reviewers, andeditors, they can be useful for individual researchers who want to help themselves thinkclearly about NHST results. We have found that rephrasing NHST statements that weencounter (in the literature, or in seminar presentations) in terms of clarity has alreadyhelped us with both interpretation and communication. cknowledgments We thank members of the Dushoff and Bolker labs for helpful comments on the ﬁrstdraft of the manuscript.

EFERENCES

Jacob Cohen. The earth is round (p < .05). American Psychologist , 49(12):997, 1994.Steven N Goodman. Toward evidence-based medical statistics. 1: The p value fallacy.

Annals of Internal Medicine , 130(12):995–1004, 1999.Leland Wilkinson. Statistical methods in psychology journals: Guidelines andexplanations.

American Psychologist , 54(8):594, 1999.Steve Ziliak and Deirdre Nansen McCloskey.

The cult of statistical signiﬁcance: How thestandard error costs us jobs, justice, and lives . University of Michigan Press, 2008.Ronald L Wasserstein and Nicole A Lazar. The ASA’s statement on p-values: context,process, and purpose, 2016.Paul E Meehl. Why summaries of research on psychological theories are oftenuninterpretable.

Psychological Reports , 66(1):195–244, 1990.John W Tukey. The philosophy of multiple comparisons.

Statistical Science , pages100–116, 1991.Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold. ‘Sing me a song with socialsigniﬁcance’: The (mis) use of statistical signiﬁcance testing in European sociologicalresearch.

European Sociological Review , 33(1):1–15, 2017.Denes Szucs and John Ioannidis. When null hypothesis signiﬁcance testing is unsuitablefor research: a reassessment.

Frontiers in Human Neuroscience , 11:390, 2017.Blakeley B McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer LTackett. Abandon statistical signiﬁcance.

The American Statistician , 70, 2017.Bertram The. Signiﬁcance testing: are we ready yet to abandon its use?

Current MedicalResearch and Opinion , 27(11):2087–2090, 2011. aniel J Benjamin, James O Berger, Magnus Johannesson, Brian A Nosek, E-JWagenmakers, Richard Berk, Kenneth A Bollen, Bj ¨orn Brembs, Lawrence Brown, ColinCamerer, et al. Redeﬁne statistical signiﬁcance.

Nature Human Behaviour , 2(1):6, 2018.J Ridley, Niclas Kolm, RP Freckelton, and MJG Gage. An unexpected inﬂuence of widelyused signiﬁcance thresholds on the distribution of reported p -values. Journal ofEvolutionary Biology , 20(3):1082–1089, 2007.Kent W Staley. Pragmatic warrant for frequentist statistical practice: the case of highenergy physics.

Synthese , 194(2):355–376, 2017.Walter Kr¨amer. The cult of statistical signiﬁcance–what economists should and shouldnot do to make their data talk.

Schmollers Jahrbuch , 131(3):455–468, 2011.Phillip Good.

Permutation Tests: A Practical Guide to Resampling Methods for TestingHypotheses . Springer, 2000.Daniel H Robinson and Howard Wainer. On the past and future of null hypothesissigniﬁcance testing.

ETS Research Report Series , 2001(2), 2001.John M Hoenig and Dennis M Heisey. The abuse of power: the pervasive fallacy ofpower calculations for data analysis.

The American Statistician , 55(1):19–24, 2001.Robert P Abelson. On the surprising longevity of ﬂogged horses: Why there is a case forthe signiﬁcance test.

Psychological Science , 8(1):12–15, 1997.Lyle V Jones and John W Tukey. A sensible formulation of the signiﬁcance test.

Psychological Methods , 5(4):411, 2000.Andrew Gelman and Hal Stern. The difference between “signiﬁcant” and “notsigniﬁcant” is not itself statistically signiﬁcant.

The American Statistician , 60(4):328–331,2006. ndrew Gelman and John Carlin. Beyond power calculations: assessing type S (sign)and type M (magnitude) errors.

Perspectives on Psychological Science , 9(6):641–651, 2014.Robert E Kass, Brian S Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, and Nancy Reid.Ten simple rules for effective statistical practice.

PLoS Computational Biology , 12(6):e1004961, 2016., 12(6):e1004961, 2016.