A reckless guide to P-values: local evidence, global errors
AA reckless guide to P-values
Local evidence, global errors
Chapter 13 in the book Good Research Practice in Experimental Pharmacology,editors A. Bespalov, MC Michel, and T Steckler, to be published by Springer.Open Access, Creative Commons 4.0 a r X i v : . [ s t a t . O T ] S e p ichael J. Lew Department of Pharmacology and TherapeuticsUniversity of MelbourneOctober 7, 2019
This chapter demystifies P-values, hypothesis tests and significance tests, andintroduces the concepts of local evidence and global error rates. The localevidence is embodied in this data and concerns the hypotheses of interest for this experiment, whereas the global error rate is a property of the statisticalanalysis and sampling procedure. It is shown using simple examples that localevidence and global error rates can be, and should be, considered together whenmaking inferences. Power analysis for experimental design for hypothesis testingare explained, along with the more locally focussed expected P-values. Issuesrelating to multiple testing, HARKing, and P-hacking are explained, and itis shown that, in many situation, their effects on local evidence and globalerror rates are in conflict, a conflict that can always be overcome by a freshdataset from replication of key experiments. Statistics is complicated, and so isscience. There is no singular right way to do either, and universally acceptablecompromises may not exist. Statistics offers a wide array of tools for assistingwith scientific inference by calibrating uncertainty, but statistical inference isnot a substitute for scientific inference. P-values are useful indices of evidenceand deserve their place in the statistical toolbox of basic pharmacologists.
Contents
Practical problems with P-values 17
There is a widespread consensus that we are in the midst of a ‘reproducibility cri-sis’ and that inappropriate application of statistical methods facilitates, or evencauses, irreproducibility [Ioannidis, 2005, Nuzzo, 2014, Colquhoun, 2014, Georgeet al., 2017, Wagenmakers et al., 2018]. P-values are a “pervasive problem”[Wagenmakers, 2007] because they are misunderstood, misapplied, and answera question that no-one asks [Royall, 1997, Halsey et al., 2015, Colquhoun, 2014].They exaggerate evidence [Johnson, 2013, Benjamin et al., 2018] or they are ir-reconcilable with evidence [Berger and Sellke, 1987]. What’s worse, ‘P-hacking’amplifies their intrinsic shortcomings [Fraser et al., 2018]. The inescapable con-clusion, it would seem, is that P-values should be eliminated by replacementwith Bayes factors [Goodman, 2001, Wagenmakers, 2007] or confidence inter-vals [Cumming, 2008], or by simply doing without [Trafimow and Marks, 2015].However, much of the blame for irreproducibility that is apportioned to P-valuesis based on pervasive and pernicious misunderstandings.This chapter is an attempt to resolve those misunderstandings. Some mightsay it is a reckless attempt because history suggests that it is doomed to fail-ure, and reckless also because it goes against much of the conventional wisdomregarding P-values and will therefore be seen by some as promoting inappropri-ate statistical practices. That’s OK though, because the conventional wisdomregarding P-values is mistaken in important ways, and those mistakes fuel falsesuppositions regarding what practices are appropriate.
Statistics is complicated but is usually presented simplistically in the statisticstextbooks and courses studied by pharmacologists. Readers of those books andgraduates of those courses should therefore be forgiven for wrongly assumingthat statistics is a set of rules and recipes that must be applied in order to obtaina statistically valid statistically significant. The instructions say that you matchthe data to the recipe, turn the crank, and bingo: it’s significant, or not. If youdo it right then you might rewarded with a star! No matter how explicablethat simplistic view of statistics might be, it is far too limiting. It leads to Even its grammatical form is complicated: “statistics” looks like a plural noun, but itboth plural when referring to values calculated from data and singular when referring to thediscipline or approaches to data analysis. . However, scientific inferences can be made more securely with statisticsbecause it offers a rich set of tools for calibrating uncertainty. Statistical analysisis particularly helpful in the penumbral ‘maybe zone’ where the uncertaintyis relatively evenly balanced—the zone where scientists are most likely to beswayed by biasses into over-interpretation of random deviations within the noise.The extra insight from a well-implemented statistical analysis can protect fromthe desire to find something notable, and thereby reduce the number of falseclaims made.Most people need all the help they can get to prevent them mak-ing fools of themselves by claiming that their favourite theory issubstantiated by observations that do nothing of the sort.—[Colquhoun, 1971, p. 1]Improved utilisation of statistical approaches would indeed help to min-imise the number of times that pharmacologists make fools of themselves byreducing the number of false positive results in pharmacological journals and,consequently, reduce the number of faulty leads that fail to translate into a ther-apeutic [Begley and Ellis, 2012]. However, even ideal application of the mostappropriate statistical methods would not improve the replicability of publishedresults quite as much as might be assumed because not every result that failsto be replicated is a false positive and not every mistaken conclusion would beprevented by better statistical inferences.Basic pharmacological studies are typically performed using biological mod-els such as cell lines, tissue samples, or laboratory animals and so even if theoriginal results are not false positives a replication might fail when it is con-ducted using different models, [Drucker, 2016]. Replications might also failwhen the original results are critically dependent on unrecognised methodolog-ical details, or on reagents such as antibodies that have properties that canvary over time or between sources [Berglund et al., 2008, Baker and Dolgin,2017, Voelkl et al., 2018]. It is those types of irreproducibility rather than falsepositives that are responsible for many failures of published leads to translateinto clinical targets or therapeutics (see also Chapter 11). The distinction beingmade here is between false positive inferences which lack ‘internal validity’ and In other words, results that hit you right between the eyes. In the Australian vernacularthe inter-ocular impact test is the bloody obvious test. doing statistics is not enough, because it isnot a set of rules for scientists to follow to make automated scientific inferences.To get from calibrated statistical inferences to reliable inferences about thereal world, the statistical analyses have to be interpreted; thoughtfully andin the full knowledge of the properties of the tool and the nature of the realworld system being probed. Some researchers might be disconcerted by the factthat statistics cannot provide such certainty, because they just want to be toldwhether their latest result is “real”. No matter how attractive it might be to foboff onto statistics the responsibility for inferences, the answers that scientistsseek cannot be answered by statistics alone.
P-values are not everything, and they are certainly not nothing. There are many,many useful procedures and tools in statistics that do not involve or provide P-values, but P-values are by far the most widely used inferential statistic in basicpharmacological research papers.0P-values are a practical success but a critical failure. Scientists theworld over use them, but scarcely a statistician can be found todefend them. —[Senn, 2001, p. 193]Not only are P-values rarely defended, they are frequently derided (e.g.Berger and Sellke [1987], Lecoutre et al. [2001], Goodman [2001], Wagenmakers[2007]). Even so, support for the continued use of P-values for at least somepurposes with some caveats can be found (e.g. Nickerson [2000], Senn [2001],Garc´ıa-P´erez [2016], Krueger and Heck [2017]). One crucial caveat is that aclear distinction has to be drawn between the dichotomisation of P-values into‘significant’ or ‘not significant’ (typically on the basis of a threshold set at 0.05)and the evidential meaning of the actual numerically specified P-value. Theformer comes from a hypothesis test and the latter from a significance test .Contrary to what many readers will think and have been taught, they are notthe same things. It might be argued that the battle to retain a clear distinctionbetween significance tests and hypothesis tests has long been lost, but I have tocontinue that battle here because that distinction is critical for understandingthe uses and misuses of P-values. Detailed accounts can also be found elsewhere[Huberty, 1993, Senn, 2001, Hubbard et al., 2003, Lenhard, 2006, Hurlbert andLombardi, 2009, Lew, 2012]. 5 .1 Hypothesis test and Significance test
When comparing significance tests and hypothesis tests it is conventional tonote that the former are ‘Fisherian’ (or, perhaps, “neoFisherian” [Hurlbert andLombardi, 2009]) and the latter are ‘Neyman–Pearsonian’. R.A. Fisher did notinvent significance tests per se—Gossett published what became Student’s t -testbefore Fisher’s career had begun [Student, 1908] and even that is not the firstexample—but Fisher did effectively popularise their use with his book StatisticalMethods for Research Workers (1925), and he is credited with (or blamed for!)the convention of P < .
05 as a criterion for ‘significance’. It is important tonote that Fisher’s ‘significant’ denoted something along the lines of worthy offurther consideration or investigation, which is different to what is denoted bythe same word applied tot he results of a hypothesis test. Hypothesis tests camelater, with the 1933 paper by Neyman & Pearson that set out the workings ofdichotomising hypothesis tests and also introduced of the ideas “errors of thefirst kind” (false positive errors; type I errors) and “errors of the second kind”(false negative errors; type II errors) and a formalisation of the concept ofstatistical power.A Neyman–Pearsonian hypothesis test is more than a simple statistical cal-culation. It is a method that properly encompasses experimental planning andexperimenter behaviour as well. Before an experiment is conducted, the exper-imenter chooses α , the size of the critical region in the distribution of the teststatistic, on the basis of the acceptable false positive (i.e. type I) error rateand sets the sample size on the basis of an acceptable false negative (i.e. typeII) error rate. In effect the sample size, power , and α are traded off againsteach other to obtain an experimental design with the appropriate mix of costand error rates. In order for the error rates of the procedure to be well cal-ibrated, the sample size and α have to be set in advance of the experimentbeing performed, a detail that is often overlooked by pharmacologists. Afterthe experiment has been run and the data are in hand, the mechanics of thetest involves a determination of whether the observed value of the test statisticlies within a pre-determined critical region under the sampling distribution pro-vided by a statistical model and the null hypothesis. When the observed valueof the test statistic falls within the critical range the result is ‘significant’ andthe analyst discards the null hypothesis. When the observed test statistic fallsoutside the critical range the result is ‘not significant’ and the null hypothesisis not discarded.In current practice, dichotomisation of results into significant and not signif-icant is most often made on the basis of the observed P-value being less than orgreater than a conventional threshold of 0.05, so we have the familiar P < . α = 0 .
05. The one-to-one relationship between the test statistic being withinthe critical range and the P-value being less than α means that such practiceis not intrinsically problematical, but using a P-value as an intermediate in ahypothesis test obscures the nature of the test and contributes to the conflation The ‘power’ of the experiment is one minus the false positive error rate, but it is a functionof the true effect size, as explained later.
6f significance tests and hypothesis tests.The classical Neyman–Pearsonian hypothesis test is an acceptance proce-dure, or a decision theory procedure [Birnbaum, 1977, Hurlbert and Lombardi,2009] that does not require, or provide, a P-value. Its output is a binary deci-sion: either reject the null hypothesis or fail to reject the null hypothesis. Incontrast, a Fisherian significance test yields a P-value that encodes the evidencein the data against the null hypothesis, but not, directly, a decision. The P-value is the probability of observing data as extreme as that observed, or moreextreme, when the null hypothesis is true. That probability is generated ordetermined by a statistical model of some sort, and so we should really includethe phrase ‘according to the statistical model’ into the definition. In the Fish-erian tradition a P-value is interpreted evidentially: the smaller the P-valuethe stronger the evidence against the null hypothesis and the more implausiblethe null hypothesis is, according to the statistical model. No behavioural orinferential consequences attach to the observed P-value and no threshold needbe applied because the P-value is a continuous index.In practice the probabilistic nature of P-values has proved difficult to usebecause people tend to mistakenly assume that the P-value measures the proba-bility of the null hypothesis or the probability of an erroneous decision—it seemsthat they prefer any probability that is more noteworthy or less of a mouthfulthan the probability according to a statistical model of observing data as ex-treme or more extreme when the null hypothesis is true. Happily, there are noordinary uses of P-values that require them to be interpreted as probabilities.My advice is to forget that P-values can be defined as probabilities and insteadlook at them as indices of surprisingness or unusualness of data: the smallerthe P-value the more surprising are the data compared to what the statisticalmodel predicts when the null hypothesis is true.Conflation of significance tests and hypothesis tests may be encouraged bytheir apparently equivalent outputs (significance and P-values), but the confla-tion is too often encouraged by textbook authors, even to the extent of pre-senting a hybrid approach containing features of both. The problem has deeproots: when Neyman & Pearson published their hypothesis test in 1933 it wasimmediately assumed that their test was an extension of Fisher’s significancetests. Substantive differences in the philosophical and theoretical underpinnings It has been argued that because Fisher regularly described experimental results as ‘sig-nificant’ or ’not significant’ he was treating P-values dichotomously and that he used a fixedthreshold for that dichotomisation (e.g. [Lehmann, 2011, pp. 51–53]). However, Fisher meantthe word ‘significant’ to denote only that a result that is worthy of attention and follow up,and he quoted P-values as being less than 0.05, 0.02, and 0.01 because he was was work-ing from tables of critical values of test statistics rather than laboriously calculating exactP-values manually. He wrote about the issue on several occasions, for example this:Convenient as it is to note that a hypothesis is contradicted at some familiar levelof significance such as 5% or 2% or 1% we do not, in Inductive Inference, everneed to lose sight of the exact strength which the evidence has in fact reached,or to ignore the fact that with further trial it might come to be stronger, orweaker. —[Fisher, 1960, p. 25] —a composite creature consisting of parts of several animals—but isa real animal, rare but beautiful, and perfectly adapted to its ecological niche.The common NHST is assumed by its many users to be a proper statisticalprocedure but is, in fact, an ugly composite, maladapted for almost all analyticpurposes. No-one should be using NHST, but should we use hypothesis testing or signif-icance testing? The answer should depend on what your analytical objectivesare, but in practice it more often depends on who you ask. Not all advice is goodadvice, and not even the experts agree. Responses to the American StatisticalAssociation’s official statement on P-values provides a case in point. In responseto the widespread expressions of concern over the misuse and misunderstandingof P-values, the ASA convened a group of experts to consider the issues and tocollaborate on drafting an official statement on P-values [Wasserstein and Lazar,2016]. Invited commentaries were published alongside the final statement, andeven a brief reading of those commentaries on the statement will turn up mis-givings and disagreements. Given that most of the commentaries were written Well, that’s the conventional wisdom, but it may be an exaggeration. The first scientificdescription of the “duck-billed platypus” was done in England by Shaw & Nodder (1789),who wrote “Of all Mammalia yet known it seems the most extraordinary in its conformation;exhibiting the perfect resemblance of the beak of a Duck engrafted on the head of a quadruped.So accurate is the similitude that, at first view, it naturally excites the idea of some deceptivepreparation by artificial means”. If Shaw & Nodder really thought it a fake, they did not doso for long.
8y participants in the expert group, such disquiet and dissent confirms the dif-ficulty of this topic. It should also should signal to readers that their practicalfamiliarity with P-values does not ensure that they understand P-values.The official ASA statement on P-values sets out six numbered principlesconcerning P-values and scientific inference:1. P-values can indicate how incompatible the data are with a specified sta-tistical model.2. P-values do not measure the probability that the studied hypothesis istrue, or the chance that the data were produced by random chance.3. Scientific conclusions and business or policy decisions should not be basedonly on whether a P-value passes a specific threshold.4. Proper inference requires full reporting and transparency.5. A P-value, or statistical significance, does not measure the size of an effector the importance of a result.6. By itself, a P-value does not provide a good measure of evidence regardinga model or hypothesis.Those principles are all sound—some derive directly from the definition ofP-values and some are self-evidently good advice about the formation and re-porting of scientific conclusions—but hypothesis tests and significance tests arenot even mentioned in the statement and so it does not directly answer thequestion about whether we should use significance tests or hypothesis tests thatI asked at the start of this section. Nevertheless, the statement offers a usefulperspective and is not entirely neutral on the question. It urges against the useof a threshold in Principle 3 which says “Scientific conclusions and business orpolicy decisions should not be based only on whether a p-value passes a specificthreshold.” Without a threshold we cannot use a hypothesis test. Lest anyreader think that the intention is that P-values should not be used, I point outthat the explanatory note for that principle in the ASA document begins thus:Practices that reduce data analysis or scientific inference to mechan-ical “bright-line” rules (such as “ p < . British Journal of Pharmacology :9 ontrol Test 12030 P=0.06 Control Test 22030 P=0.04 Figure 1:
P=0.04 is not very different from P=0.06.
Pseudo-data devised toyield one-tailed P=0.06 (left) and P=0.04 (right) from a Student’s t -test forindependent samples, n = 5 per group. The y-axis is an arbitrarily scaledmeasure.When comparing groups, a level of probability ( P ) deemed to con-stitute the threshold for statistical significance should be defined inMethods, and not varied later in Results (by presentation of multi-ple levels of significance). Thus, ordinarily P < .
05 should be usedthroughout a paper to denote statistically significant differences be-tween groups. —[Curtis et al.,2015]An updated version of the guidelines retains those instructions [ ? ], but be-cause it is a bad instruction I present three objections. The first is that routineuse of an arbitrary P-value threshold for declaring a result significant ignoresalmost all of the evidential content of the P-value by forcing an all-or-nonedistinction between a P-value small enough and one not small enough. The ar-bitrariness of a threshold for significance is well known and flows from the factthat there is no natural cutoff point or inflection point in the scale of P-values.Anyone who is unconvinced that it matters should note that the evidence in aresult of P=0.06 is not so different from that in a result of P=0.04 as to supportan opposite conclusion (Figure 1).The second objection to the instruction to use a threshold of P < .
05 is thatexclusive focus on whether the result is above or below the threshold blindsanalysts to information beyond the sample in question. If the statistical proce-dure says discard the null hypothesis (or don’t discard it) then that statisticaldecision seems to override and make redundant any further considerations of ev-idence, theory, or scientific merit. That is quite dangerous, because all relevantmaterial should be considered and integrated into scientific inferences.The third objection refers to the strength of evidence needed to reach thethreshold: the
British Journal of Pharmacology instruction licenses claims onthe basis of relatively weak evidence. The evidential disfavouring of the null Accepting P=0.05 as a sufficient reason to suppose that a treatment is effective is akin toaccepting 50% as a passing grade: it’s traditional in many settings, but it’s far from reassuring. It would be possible to overcome this last objectionby setting a lower threshold whenever an extraordinary claim is to be made,but the
British Journal of Pharmacology instructions preclude such a choice byinsisting that the same threshold be applied to all tests within the whole study.There has been a serious proposal that a lower default threshold of P < . British Journal of Pharmacology enforce its guideline on the useof Neyman–Pearsonian hypothesis testing with a fixed threshold for statisti-cal significance? Definitely not, and laboratory pharmacologists should usuallyavoid them because the nature those tests is ill-suited to the reality of basicpharmacological studies.The shortcoming of hypothesis testing is that it offers an all-or-none out-come and it engenders a one-and-done response to an experiment. All-or-nonein that the significant or not significant outcome is dichotomous. One-and-donebecause once a decision has been made to reject the null hypothesis there is lit-tle apparent reason to re-test that null hypothesis the same way, or differently.There is no mechanism within the classical Neyman–Pearsonian hypothesis test-ing framework for a result to be treated as provisional. That is not particularlyproblematical in the context of a classical randomised clinical trial (RCT) be-cause an RCT is usually conducted only after preclinical studies have addressedthe relevant biological questions. That allows the scientific aims of the studyto be simple—they are designed to provide a definitive answer to the primaryquestion. An all-or-none one-and-done hypothesis test is therefore appropriatefor an RCT. But the majority of basic pharmacological laboratory studies donot have much in common with an RCT because they consist of a series ofinterlinked and inter-related experiments contributing variously to the primaryinference. For example, a basic pharmacological study will often include experi-ments that validate experimental methods and reagents, concentration-responsecurves for one or more of drugs, positive and negative controls, and other exper-iments subsidiary to the main purpose of the study. The design of the ‘headline’experiment (assuming there is one) and interpretation of its results is dependent That phrase comes from the television series
Cosmos , 1980, but may derive from Laplace(1812), who wrote “The weight of evidence for an extraordinary claim must be proportionedto its strangeness.” [translated, the original is in French]. Clinical trials are sometimes aggregated in meta-analyses, but the substrate for meta-analytical combination is the observed effect sizes and sample sizes of the individual trials,not the dichotomised significant or not significant outcomes.
11n the results of those subsidiary experiments, and even when there is a singu-lar scientific hypothesis, it might be tested in several ways using observationswithin the study. It is the aggregate of all of the experiments that inform thescientific inferences. The all-or-none one-and-done outcome of a hypothesis testis less appropriate to basic research than it is to a clinical trial.Pharmacological laboratory experiments also differ from RCTs in other waysthat are relevant to the choice of statistical methodologies. Compared to anRCT, basic pharmacological research is very cheap, the experiments can becompleted very quickly, with the results available for analysis almost imme-diately. Those advantages mean that a pharmacologist might design some ofthe experiments within a study in response to results obtained in that samestudy, and so a basic pharmacological study will often contain preliminary orexploratory research. Basic research and clinical trials also differ in the conse-quences of erroneous inference. A false positive in an RCT might prove verydamaging by encouraging the adoption of an ineffective therapy, but in the muchmore preliminary world of basic pharmacological research a false positive resultmight have relatively little influence on the wider world. It could be arguedthat statistical protections against false positive outcomes that are appropriatein the realm of clinical trials can be inappropriate in the realm of basic research.This idea is illustrated in a later section of this chapter.The multi-faceted nature of the basic pharmacological study means thatstatistical approaches yielding dichotomous yes or no outcomes are less relevantthan they are to the archetypical RCT. The scientific conclusions drawn frombasic pharmacological experiments should be based on thoughtful considerationof the entire suite of results in conjunction with any other relevant information,including both pre-existing evidence and theory. The dichotomous all-or-none,one-and-done hypothesis test is poorly adapted to the needs of basic pharmaco-logical experiments, and is probably poorly adapted to the needs of most basicscientific studies. Scientific studies depend on a detailed evaluation of evidencebut a hypothesis test does not fully support such an evaluation. A way to understand difference between the Fisherian significance test and theNeyman–Pearsonian hypothesis test is to recognise that the former supports‘local’ inference, whereas the latter is designed to protect against ‘global’ long-run error. The P-value of a significance test is local because it is an index of theevidence in this data against this null hypothesis. In contrast, the hypothesistest decision regarding rejection of the null hypothesis is global because it isbased on a parameter, α , which is set without reference to the observed data.The long run performance of the hypothesis test is a property of the procedureitself and is independent of any particular data, and so it is global. Localevidence; global errors. This is not an ahistoric imputation, because Neyman Yes, that is also done in ‘adaptive’ clinical trials, but they are not the archetypical RCTthat is the comparator here.
12 Pearson were clear about their preference for global error protection ratherthan local evidence and their objectives in devising hypothesis tests:We are inclined to think that as far as a particular hypothesis isconcerned, no test based upon the theory of probability can by it-self provide any valuable evidence of the truth or falsehood of thathypothesis.But we may look at the purpose of tests from another view-point.Without hoping to know whether each separate hypothesis is trueor false, we may search for rules to govern our behaviour with re-gard to them, in following which we insure that, in the long run ofexperience, we shall not be too often wrong.—Neyman and Pearson [1933]The distinction between local and global properties or information is rel-atively little known, but Liu & Meng 2016 offer a much more technical andcomplete discussion of the local/global distinction, using the descriptors ‘indi-vidualised’ and ‘relevant’ for the local and the ‘robust’ for the global. Theydemonstrate a trade-off between relevance and robustness that requires judge-ment on the part of the analyst. In short, the desirability of methods that havegood long-run error properties is undeniable, but paying attention exclusively tothe global blinds us to the local information that is relevant to inferences. Theinstructions of the
British Journal of Pharmacology are inappropriate becausethey attend entirely to the global and because the dichotomising of each exper-imental result into significant and not significant hinders thoughtful inference.Many of the battles and controversies regarding statistical tests swirl aroundissues that might be clarified using the local versus global distinction, and so itwill be referred to repeatedly in what follows.
In order to be able to safely interpret the local, evidential, meaning of a P-value, a pharmacologist should understand its scaling. Just like the EC s withwhich pharmacologists are so familiar, P-values have a bounded scale, and justas is the case with EC s it makes sense to scale P-values geometrically (orlogarithmically). The non-linear relationship between P-values and an intuitivescaling of evidence against the null hypothesis can be gleaned from Figure 2. Ofcourse, a geometric scaling of the evidential meaning of P-values implies that thedescriptors of evidence should be similarly scaled and so such a scale is proposedin Figure 3, with P-values around 0.05 being called ‘trivial’ in recognition of therelatively unimpressive evidence for a real difference between condition A andcontrol in Figure 2.Attentive readers will have noticed that the P-values in Figures 1, 2, and3 are all one-tailed. The number of tails that published P-values have is in-consistent, is often unspecified, and the number of tails that a P-value should have is controversial (e.g. see Dubey [1991], Bland and Bland [1994], Kobayashi13 ontrol A B C D203040 P=0.0001P=0.001P=0.01P=0.05
Figure 2:
What simple evidence looks like.
Pseudo-data devised to yield one-tailed P-values from 0.05 to 0.0001 from a Student’s t -test for independentsamples, n = 5 per group. The left-most group of values is the control againstwhich each of the other sets is compared, and the pseudo-datasets A, B, C, andD were generated by arithmetic adjustment of a single dataset to obtain theindicated P-values. The y-axis is an arbitrarily scaled measure.[1997], Freedman [2008], Lombardi and Hurlbert [2009], Ruxton and Neuhaeuser[2010]). Arguments about P-value tails are regularly confounded by differencesbetween local and global considerations. The most compelling reasons to favourtwo tails relate to global error rates, which means that they apply only to P-values that are dichotomised into significant and not significant in a hypothesistest. Those arguments can safely be ignored when P-values are used as indicesof evidence and I therefore recommend one-tailed P-values for general use inpharmacological experiments—as long as the P-values are interpreted as evi-dence and not as a surrogate for decision. (Either way, the number of tailsshould always be specified.) The Neyman–Pearsonian hypothesis test is a decision procedure that, with afew assumptions, can be an optimal procedure. Optimal only in the restrictedsense that the smallest sample gives the highest power to reject the null hypoth-esis when it is false, for any specified rate of false positive errors. To achievethat optimality the experimental sample size and α are selected prior to theexperiment using a power analysis and with consideration of the costs of thetwo specified types of error and the benefits of potentially correct decisions. Inother words, there is a loss function built into the design of experiments. How-ever, outside of the clinical trials arena, few pharmacologists seem to designexperiments in that way. For example, a study of 22 basic biomedical researchpapers published in Nature Medicine found that none of them included anymention of a power analysis for setting the sample size [Strasak et al., 2007],and a simple survey of the research papers in the most recent issue of
BritishJournal of Pharmacology (2018, issue 17 of volume 175) gives a similar picturewith power analyses specified in only one out the 11 research papers that used14 .00.10.20.30.40.50.60.70.80.91.0P-value(one-tailed) NoneTrivial 0.000010.00010.0010.010.11 StrongWeakNoneTrivialModerateRough scaling of evidence against the null hypothesisVery strongStrongWeakModerate
Figure 3:
Evidential descriptors for P-values.
Strength of evidence againstthe null hypothesis scales semi-geometrically with the smallness of the P-value.Note that the descriptors for strength of evidence are illustrative only, and itwould be a mistake to assume, for example, that a P-value of 0.001 indicatesmoderately strong evidence against the null hypothesis in every circumstance.P < BJP papers included statements in their methods sections claiming compliancewith the guidelines for experimental design and analysis, guidelines that includethis as the first key point:Experimental design should be subjected to a priori power analysisso as to ensure that the size of treatment and control groups isadequate[. . . ] —[Curtis et al., 2015]The most recent issue of
Journal of Pharmacology and Experimental Therapeu-tics (2018, issue 3 of volume 366) similarly contains no mention of power ofsample size determination in any of its 9 research papers, although none of itsauthors had to pay lip service to guidelines requiring it.In reality, power analyses are not always necessary or helpful. They haveno clear role in the design of a preliminary or exploratory experiment thatis concerned more with hypothesis generation than hypothesis testing, and alarge fraction of the experiments published in basic pharmacological journals areexploratory or preliminary in nature. Nonetheless, they are described here indetail because experience suggests they are mysterious to many pharmacologistsand they are very useful for planning confirmatory experiments.For a simple test like Student’s t -test a pre-experiment power analysis fordetermination of sample size is easily performed. The power of a Student’s t -test is dependent on: (i) the predetermined acceptable false positive error rate, α (bigger α gives more power); (ii) the true effect size, which we will denoteas δ (more power when δ is larger); (iii) the population standard deviation, σ (smaller σ gives more power); and (iv) the sample size (larger n for more power).The common approach to a power test is to specify an effect size of interest and15 δ / σ ) P o w e r δ / σ ) P o w e r n =3 n =52040 10 α = 0.05 n =3 n =52040 10 α = 0.005 Figure 4:
Power functions for α = 0 . and . . Power of one-sided Student’s t -test for independent samples expressed as a function of standardised true effectsize δ/σ for sample sizes (per group) from n = 3 to n = 40. Note that δ = µ − µ and σ are population parameters rather than sample estimates.the minimum desired power, so say we wish to detect a true effect of δ = 3 in asystem where we expect the standard deviation to be σ = 2. The free software called R has the function power.t.test() that gives this result: > power.t.test(delta=3, sd=2, power=0.8, sig.level = 0.05,alternative =’one.sided’, n=NULL)Two-sample t test power calculationn = 6.298691delta = 3sd = 2sig.level = 0.05power = 0.8alternative = one.sidedNOTE: n is number in *each* group It is conventional to round the sample size up to the next integer so the samplesize would be 7 per group.While a single point power analysis like that is straightforward, it providesrelatively little information compared to the information supplied by the analyst,and its output is specific to the particular effect size specified, an effect size thatmore often than not has to be ‘guesstimated’ instead of estimated because it isthe unknown that is the object of study. A plot of power versus effect size isfar more informative than the point value supplied by the conventional powertest (Figure 4). Those graphical power functions show clearly the three-wayrelationship between sample size, effect size and the risk of a false negativeoutcome (i.e. one minus the power). should have been, and proceed to plug inthe observed effect size and standard deviation and pulling out a larger sam-ple size—always larger—that might have given them the desired small P-value.Their interpretation is then that the result would have been significant but forthe fact that the experiment was underpowered. That interpretation ignoresthat fact that the observed effect size might be an exaggeration, or the observedstandard deviation might be an underestimation and the null hypothesis mightbe true! Such a procedure is generally inappropriate and dangerous [Hoenigand Heisey, 2001]. There is a one to one correspondence of observed P-valueand post-experiment power and no matter what the sample size, a larger thandesired P-value always corresponds to a low power at the observed effect size,whether the null hypothesis is true or false. Power analyses are useful in thedesign of experiments, not for the interpretation of experimental results.Power analyses are tied closely to dichotomising Neyman–Pearsonian hy-pothesis tests, even when expanded to provide full power functions as in Figure4. However, there is an alternative more closely tied to Fisherian significancetesting—an approach better aligned to the objectives of evidence gathering.That alternative is a plot of average expected P-values as functions of effectsize and sample size [Sackrowitz and Samuel-Cahn, 1999, Bhattacharya andHabtzghi, 2002]. The median is more relevant than the mean, both because thedistribution of expected P-values is very skewed and because the median valueoffers a convenient interpretation of there being a 50:50 bet that and observedP-value will be either side of it. An equivalent plot showing the 90th percentileof expected P-values gives another option for experiment sample size planningpurposes (Figure 5).Should the British Journal of Pharmacology enforce its power guideline? Ingeneral no, but pharmacologists should use power curves or expected P-valuecurves for designing some of their experiments, and ought to say so when theydo. Power analyses for sample size are very important for experiments that areintended to be definitive and decisive, and that’s why sample size considerationsare dealt with in detail when planning clinical trials. Even though the majorityof experiments in basic pharmacological research papers are not like that, asdiscussed above, even preliminary experiments should be planned to a degree,and power curves and expected P-value curves are both useful in that role.
The sections above deal with the most basic misconceptions regarding the natureof P-values, but critics of P-values usually focus on other important issues. Inthis section I will deal with the significance filter, multiple comparisons, andsome forms of P-hacking, and I need to point out immediately that most ofthe issues are not specific to P-values even if some of them are enabled bythe unfortunate dichotomisation of P-values into significant and not significant.17 –8 –7 –6 –5 –4 –3 –2 –1 M ed i an P - v a l ue n =3 n =52040 10Effect size ( δ / σ ) 0 1 2 3 410 –8 –7 –6 –5 –4 –3 –2 –1 t h pe r c en t il e o f P - v a l ue s Effect size ( δ / σ ) n =3 n =52040 10 Figure 5:
Expected P-value functions
P-values expected from Student’s t -testfor independent samples expressed as a function of standardised true effect size δ/σ for sample sizes (per group) from n = 3 to n = 40. The graph on theleft shows the median of expected P-values (i.e. the 50th percentile) and thegraph on the right shows the 90th percentile. It can be expected that 50% ofobserved P-values will lie below the median lines and 90% will lie below the90th percentile lines for corresponding sample sizes and effect sizes. The dashedlines indicate P=0.05 and 0.005.In other words, the practical problems with P-values are largely the practicalproblems associated with the mis use of P-values and with sloppy statisticalinference generally. It is natural to assume that the effect size observed in an experiment is a goodestimate of the true effect size, and in general that can be true. However,there are common circumstances where the observed effect size consistentlyoverestimates the true, sometimes wildly so. The overestimation depends onthe facts that experimental results exaggerating the true effect are more likelyto be found statistically significant, and that we pay more attention to thesignificant results and are more likely to report them. The key to the effectis selective attention to a subset of results—the significant results—and so theprocess is appropriately called the significance filter .If there is nothing assume nothing untoward in the sampling mechanism, sample means are unbiassed estimators of population means and sample-basedstandard deviations are nearly unbiassed estimators of population standard de-viations. Because of that we can assume that, on average, a sample mean That is not a safe assumption, in particular because a haphazard sample is not a randomsample. When was the last time that you used something like a random number generatorfor allocation of treatments? The variance is unbiassed but the non-linear square root transformation into the standarddeviation damages that unbiassed-ness. Standard deviations calculated from small samplesare biassed toward underestimation of the true standard deviation. For example, if the truestandard deviation is 1 the expected average observed standard deviation for samples of n = 5is 0.92. the significance filter . The way itworks is fairly easily described but, as usual, there are some complexities in itsinterpretation.Say we are in the position to run an experiment 100 times with randomsamples of n = 5 from a single normally distributed population with mean µ = 1 and standard deviation σ = 1. We would expect that, on average, thesample means, ¯ x would be scattered symmetrically around the true value of1, and the sample-based standard deviations, s , would be scattered around thetrue value of 1, albeit slightly asymmetrically. A set of 100 simulations matchingthat scenario show exactly that result (see the left panel of Figure 6), with themedian of ¯ x being 0.97 and the median of s being 0.94, both of which are closeto the expected values of exactly 1 and about 0.92, respectively. If we were topay attention only to the results where the observed P-value was less than 0.05(with the null hypothesis being that the population mean is 0), then we geta different picture because the values are very biassed (see the right panel ofFigure 6). Among the ‘significant’ results the median sample mean is 1.2 andthe median standard deviation is 0.78.The systematic bias of mean and standard deviation among ‘significant’results in those simulations might not seem too bad, but it is conventional toscale the effect size as the standardised ratio ¯ x/s , and the median of that ratioamong the ‘significant’ results is fully 50% larger than the correct value. What’smore, the biasses get worse with smaller samples, with smaller true effect sizes,and with lower P-value thresholds for ‘significance’.It is notable that even the results with the most extreme exaggeration ofeffect size in Figure 6—550%—would not be counted as an error within theNeyman–Pearsonian hypothesis testing framework! It would not lead to thefalse rejection of a true null or to an inappropriate failure to reject a false nulland so it is neither a type I nor a type II error. But it is some type of error, asubstantial error in estimation of the magnitude of the effect. The term type Merror has been devised for exactly that kind of error [Gelman and Carlin, 2014].A type M error might be underestimation as well as the overestimation, butoverestimation is the more common in theory [Lu et al., 2018] and in practice[Camerer et al., 2018].The effect size exaggeration coming from the significance filter is not a resultof sampling, or of significance testing, or of P-values. It is a result of payingextra attention to a subset of all results—the ‘significant’ subset.The significance filter presents a peculiar difficulty. It leads to exaggera-tion on average , but any particular result may well be close to the correct size That ratio is often called Cohen’s d . Pharmacologists should pay no attention to Cohen’sspecifications of small, medium and large effect sizes [Cohen, 1992] because they are muchsmaller than the effects commonly seen in basic pharmacological experiments. .0 0.5 1.0 1.5 2.0 2.500.511.522.50.0 0.5 1.0 1.5 2.0 2.500.511.522.5 Standard deviation, s S a m p l e m ean , x ̅ All samples Standard deviation, s S a m p l e m ean , x ̅ Samples where P<0.05
Figure 6:
The significance filter.
The dots in the graphs are means and standarddeviations of samples of n = 5 drawn from a normally distributed populationwith mean µ = 1 and standard deviation σ = 1. The left panel shows all 100samples and the right panel shows only the results where P < x = 1 . µ = 1, it might be an underestimation of µ = 2,or it might be pretty close to µ = 1 . µ , and if µ were known then the experiment would probablynot have been necessary in the first place. That means that the possibility ofa type M error looms over any experimental result that is interesting becauseof a small P-value, and that is particularly true when the sample size is small.The only way to gain more confidence that a particular significant result closelyapproximates the true state of the world is to repeat the experiment–the secondresult would not have been run through the significance filter and so its resultswould not have a greater than average risk of exaggeration and the overall in-ference can be informed by both results. Of course, experiments intended torepeat or replicate an interesting finding should take the possible exaggerationinto account by being designed to have higher power than the original. Multiple testing is the situating where the tension between global and localconsiderations are most stark. It is also the situation where the well-knownjelly beans cartoon from XKCD.com is irresistable (Figure 7). The cartoonscenario is that jelly beans were suspected of causing acne, but a test found “nolink between jelly beans and acne (P > . < . < .
05 is expected to yield a false positive onetime in 20, on average, when the null is true.The more hypothesis tests there are, the higher the risk that one of them willyield a false positive result. The textbook response to multiple comparisons isto introduce ‘corrections’ that protect an overall maximum false positive errorrate by adjusting the threshold according to the number of tests in the familyto give protection from inflation of the family-wise false positive error rate. TheBonferroni adjustment is the best-known method, and while there are severalalternative ‘corrections’ that perform a little better, none of those is nearly assimple. A Bonferroni adjustment for the family of experiments in the cartoonwould preserve an overall false positive error rate of 5% by setting a thresholdfor significance of 0 . /
20 = 0 . It mustbe noted that such protection does not come for free, because adjustments formultiplicity invariably strip statistical power from the analysis.We do not know whether the ‘significant’ link between green jelly beans andacne would survive a Bonferroni adjustment because the actual P-values werenot supplied, but as an example, a P-value of 0.003, low enough to be quiteencouraging as the result of a significance test, would be ‘not significant’ ac-cording to the Bonferroni adjustment. Such a result that would present us witha serious dilemma because the inference supported by the local evidence wouldbe apparently contradicted by global error rate considerations. However, thatcontradiction is not what it seems because the null hypothesis of the significancetest P-value is a different null hypothesis from that tested by the Bonferroni-adjusted hypothesis test. The significance test null concerns only the green jellybeans whereas the null hypothesis of the Bonferroni is an omnibus null hypoth-esis that says that the link between green jelly beans on acne is zero and thelink between purple jelly beans on acne is zero and the link between brown jellybeans is zero, and so on. The P-value null hypothesis is local and the omnibusnull is global. The global null hypothesis might be appropriate before the evi-dence is available (i.e. for power calculations and experimental planning), butafter the data are in hand the local null hypothesis concerning just the greenjelly beans gains importance.It is important to avoid being blinded to the local evidence by a non-significant global. After all, the pattern of evidence in the cartoon is exactly what would be expected if the green colouring agent caused acne: green jellybeans are associated with acne but the other colours are not. (The failure tosee an effect of the mixed jelly beans in the first test is easily explicable onthe basis of the lower dose of green.) If the data from the trial of green jelly You may notice that the first test of jelly beans without reference to colour has beenignored here. There is no set rule for saying exactly which experiments constitute a familyfor the purposes of correction of multiplicity. That serves to illustrate one facet of the inadequacy of reporting ‘P less thans’ in placeof actual P-values. all of the original experiments should be reportedalongside the new, as well as reasoned argument incorporating corroborating orrebutting information and theory.The fact that a fresh experiment is necessary to allow a straightforward con-clusion about the effect of the green jelly beans means that the experimental23eries shown in the cartoon is a preliminary, exploratory study. Preliminary orexploratory research is essential to scientific progress and can merit publicationas long as it is reported completely and openly as preliminary. Too often sci-entists fall into the pattern of misrepresenting the processes that lead to theirexperimental results, perhaps under the mistaken assumption that science hasto be hypothesis driven [Medawar, 1963, du Prel et al., 2009, Howitt and Wil-son, 2014]. That misrepresentation may take the form of a suggestion, impliedor stated, that the green jelly beans were the intended subject of the study, abehaviour described as
HARKing for h ypothesising a fter the r esults are k nown,or cherry picking where only the significant results are presented. The reasonthat HARKing is problematical is that hypotheses cannot be tested using thedata that suggested the hypothesis in the first place because those data always support that hypothesis (otherwise they would not be suggesting it!), and cherrypicking introduces a false impression of the nature of the total evidence and al-lows the direct introduction of experimenter bias. Either way, focussing on justthe unusual observations from a multitude is bad science. It takes little effortand few words to say that 20 colours were tested and only the green yielded astatistically significant effect, and a scientist can (should) then hypothesise thatgreen jelly beans cause acne and test that hypothesis with new data. P-hacking is where an experiment or its analysis are directed at obtaining a smallenough P-value to claim significance instead of being directed at the clarificationof a scientific issue or testing of a hypothesis. Deliberate P-hacking does happen,perhaps driven by the incentives built into the systems of academic reward andpublication imperatives, but most P-hacking is accidental—honest researchersdoing ‘the wrong thing’ through ignorance. P-hacking is not always as wrongas might be assumed, as the idea of P-hacking comes from paying attentionexclusively to global consideration of error rates, and most particularly to falsepositive error rates. Those most stridently opposed to P-hacking will point tothe increased risk of false positive errors, but rarely to the lowered risk of falsenegative errors. I will recklessly note that some categories of P-hacking lookentirely unproblematical when viewed through the prism of local evidence. Thelocal versus global distinction allows a more nuanced response to P-hacking.Some P-hacking is outright fraud. Consider this example that has recentlycome to light:One sticking point is that although the stickers increase apple selec-tion by 71%, for some reason this is a p value of .
06. It seems to meit should be lower. Do you want to take a look at it and see whatyou think. If you can get the data, and it needs some tweeking, itwould be good to get that one value below .05.—Email from Brian Wansink to David Just on Jan. 7, 2012. [Lee,2018] 24 do not expect that any readers would find P-hacking of that kind to be accept-able. However, the line between fraudulent P-hacking and the more innocentP-hacking through ignorance is hard to define, particularly so given the factthat some behaviours derided as P-hacking can be perfectly legitimate as partof a scientific research program. Consider this cherry picked list of responsesto a P-value being greater than 0.05 that have been described as P-hacking[Motulsky, 2014]: • Analyze only a subset of the data; • Remove suspicious outliers; • Adjust data (e.g. divide by body weight); • Transform the data (i.e. logarithms); • Repeat to increase sample size (n);Before going any further I need to point out that Motulsky has a morerealistic attitude to P-hacking than might be assumed from my treatment ofhis list. He writes: “If you use any form of P-hacking, label the conclusions as‘preliminary’.” [Motulsky, 2014, p. 1019].Analysis of only a subset of the data is illicit if the unanalysed portion isomitted in order to manipulate the P-value, but unproblematical if it is omittedfor being irrelevant to the scientific question at hand. Removal of suspiciousoutliers is similar in being only sometimes inappropriate: it depends on whatis meant by the term “outlier”. If it indicates that a datum is a mistake suchas a typographical or transcriptional error, then of course it should be removed(or corrected). If an outlier is the result of a technical failure of a particularrun of the experimental then perhaps it should be removed, but the technicalsuccess or failure of an experimental run must not be judged by the influence ofits data on the overall P-value. If with word outlier just denotes a datum thatis further from the mean than the others in the dataset, then omit it at yourperil! Omission of that type of outlier will reduce the variability in the data andgive a lower P-value, but will markedly increase the risk of false positive resultsand it is, indeed, an illicit and damaging form of P-hacking.Adjusting the data by standardisation is appropriate—desirable even—insome circumstances. For example, if a study concerns feeding or organ massesthen standardising to body weight is probably a good idea. Such manipulationof data should be considered P-hacking only if an analyst finds a too large P-value in unstandardised data and then tries out various re-expressions of thedata in search of a low P-value, and then reports the results as if that expres-sion of the data was intended all along. The P-hackingness of log-transformationis similarly situationally dependent. Consider pharmacological EC s or drugaffinities: they are strictly bounded at zero and so their distributions are skewed.In fact the distributions are quite close to log-normal and so log-transformation There are nine specified in the original but I discuss only five: cherry picking! s gives more power to parametric tests and so it is common that signifi-cance testing of logEC s gives lower P-values than significance testing of the un-transformed EC s. An experienced analyst will choose the log-transformationbecause it is known from empirical and theoretical considerations that the trans-formation makes the data better match the expectations of a parametric sta-tistical analysis. It might sensibly be categorised as P-hacking only if the log-transformation was selected with no justification other than it giving a lowP-value.The last form of P-hacking in the list requires a good deal more considerationthan the others because, well, statistics is complicated. That considerationis facilitated by a concrete scenario—a scenario that might seem surprisinglyrealistic to some readers. Say you run an experiment with n = 5 observationsin each of two independent groups, one treated and one control, and obtain aP-value of 0.07 from Student’s t -test. You might stop and integrate the veryweak evidence against the null hypothesis into your inferential considerations,but you decide that more data will clarify the situation. Therefore you run someextra replicates of the experiment to obtain a total of n = 10 observations ineach group (including the initial 5), and find that the P-value for the data inaggregate is 0.002. The risk of the ‘significant’ result being a false positive erroris elevated because the data have had two chances to lead you to discard thenull hypothesis. Conventional wisdom says that you have P-hacked. However,there is more to be considered before the experiment is discarded.Conventional wisdom usually takes the global perspective. As mentionedabove, it typically privileges false positive errors over any other consideration,and calls the procedure invalid. However, the extra data has added power tothe experiment and lowered the expected P-value for any true effect size. Froma local evidence point of view, increasing the sample increases the amount ofevidence available for use in inference, which is a good thing. Is extendingan experiment after the statistical analysis a good thing or a bad thing? Theconventional answer is that it is a bad thing and so the conventional advice isdon’t do it! However, a better response might balance the bad effect of extendingthe experiment with the good. Consideration of the local and global aspectsof statistical inference allows a much more nuanced answer. The proceduredescribed would be perfectly acceptable for a preliminary experiment.Technically the two-stage procedure in that scenario allows optional stopping .The scenario is not explicit, but it can be discerned that the stopping rule was,in effect, run n = 5 and inspect the P-value; if it is small enough then stop andmake inferences about the null hypothesis; if the P-value is not small enough forthe stop but nonetheless small enough to represent some evidence against thenull hypothesis, add an extra 5 observations to each group to give n = 10, stop,and analyse again. We do not know how low the interim P-value would haveto be for the protocol to stop, and we do not know how high it could be andthe extra data still be gathered, but no matter where those thresholds are set,such stopping rules yield false positive rates higher than the nominal criticalvalue for stopping would suggest. Because of that, the conventional view (the26lobal perspective, of course) is that the protocol is invalid, but it would be moreaccurate to say that such a protocol would be invalid unless the P-value or thethreshold for a Neyman–Pearsonian dichotomous decision is adjusted as wouldbe done with a formal sequential test . It is interesting to note that the elevationof false positive rate is not necessarily large. Simulations of the scenario asspecified and with P < n = 10. After all, the data are exactly the same as if theexperimenter had intended to obtain n = 10 from the beginning. The optionalstopping has changed the global properties of the statistical procedure but notthe local evidence which contained in the actualised data.You might be wondering how it is possible that the local evidence be un-affected by a process that increases the global false positive error rate. Therationale is that the evidence is contained within the data but the error rate isa property of the procedure—evidence is local and error rates are global. Recallthat false positive errors can only occur when the null hypothesis is true. Ifthe null is true then the procedure has increased the risk of the data leadingus to a false positive decision, but if the null is false then the procedure has decreased the risk of a false negative decision. Which of those has paid out inthis case cannot be known because we do not know the truth of this local nullhypothesis. It might be argued that an increase in the global risk of false posi-tive decisions should outweigh the decreased risk of false negatives, but that is avalue judgement that ought to take into account particulars of the experiment inquestion, the role of that experiment in the overall study, and other contextualfactors that are unspecified in the scenario and that vary from circumstance tocircumstance.So, what can be said about the result of that scenario? The result of P=0.002provides moderately strong evidence against the null hypothesis, but it wasobtained from a procedure with sub-optimal false positive error characteristics.That sub-optimality should be accounted for in the inferences that made fromthe evidence, but it is only confusing to say that it alters the evidence itself,because it is the data that contain the evidence and the sub-optimality did notchange the data. Motulsky provides good advice on what to do when yourexperiment has the optional stopping: • For each figure or table, clearly state whether or not the sam-ple size was chosen in advance, and whether every step usedto process and analyze the data was planned as part of theexperimental protocol. • If you used any form of P-hacking, label the conclusions as“preliminary.” 27iven that basic pharmacological experiments are often relatively inexpen-sive and quickly completed one can add to that list the option of also corrobo-rating (or not) those results with a fresh experiment designed to have a largersample size (remember the significance filter exaggeration machine) and per-formed according to the design. Once we move beyond the globalist mindset ofone-and-done such an option will seem obvious.
I remind the reader that this chapter is written under the assumption thatpharmacologists can be trusted to deal with the full complexity of statistics.That assumption gives me licence to discuss unfamiliar notions like the role ofthe statistical model in statistical analysis. All too often the statistical modelis often invisible to ordinary users of statistics and that invisibility encouragesthoughtless use of flawed and inappropriate models, thereby contributing to themisuse of inferential statistics like P-values.A statistical model is what allows the formation of calibrated statisticalinferences and non-trivial probabilistic statements in response to data. Themodel does that by assigning probabilities to potential arrangements of data. Astatistical model can be thought of as a set of assumptions, although it might bemore realistic to say that a chosen statistical model imposes a set of assumptionsonto the experimenter.I have often been struck by the extent to which most textbooks, onthe flimsiest of evidence, will dismiss the substitution of assumptionsfor real knowledge as unimportant if it happens to be mathemati-cally convenient to do so. Very few books seem to be frank about, orperhaps even aware of, how little the experimenter actually knows about the distribution of errors in his observations, and about factsthat are assumed to be known for the purposes of statistical calcu-lations. —[Colquhoun, 1971, p. v ]Statistical models can take a variety of forms [McCullagh, 2002], but themodel for the familiar Student’s t -test for independent samples is reasonablyrepresentative. That model consists of assumed distributions (normal) of twopopulations with parameters mean ( µ and µ ) and standard deviation ( σ and σ ), and a rule for obtaining samples (e.g. a randomly selected sample of n = 6 observations from each population). A specified value of the the differencebetween means serves as the null hypothesis, so H : µ − µ = δ H . The teststatistic is t = (¯ x − ¯ x ) − δ H s p (cid:112) /n + 1 /n The ordinary Student’s t -test assumes that σ = σ , but the Welch-Scatterthwaite variantrelaxes that assumption. Oh no! An equation! Don’t worry, it’s the only one, and, anyway, it is too late now tostop reading. x is a sample mean and s p is the pooled standard deviation. The explicitinclusion of a null hypothesis term in the equation for t is relatively rare, butit is useful because it shows that the null hypothesis is just a possible value ofthe difference between means. Most commonly the null hypothesis says thatthe difference between means is zero—it can be called a ‘nill-null’—and in thatcase the omission of δ H from the equation makes no numerical difference.Values of t calculated by that equation have a known distribution when µ − µ = δ H , and that distribution is Student’s t -distribution. Because thedistribution is known it is possible to define hypothesis test acceptance regionsfor any level of α for a hypothesis test, and any observed t -value can be convertedinto a P-value in a significance test.An important problem that a pharmacologist is likely to face when usinga statistical model is that it’s just a model. Scientific inferences are usuallyintended to communicate something about the real world, not the mini worldof a statistical model, and the connection between a model-based probability ofobtaining a test statistic value and the state of the real world is always indirectand often inscrutable. Consider the meaning conveyed by an observed P-valueof 0.002. It indicates that the data are strange or unusual compared to theexpectations of the statistical model when the parameter of interest is set to thevalue specified by the null hypothesis. The statistical model expects a P-valueof, say, 0.002 to occur only 2 times out of a thousand on average when the nullis true. If such a P-value is observed then one of these situations has arisen: • a two in a thousand accident of random sampling has occurred; • the null hypothesised parameter value is not close to the true value; • the statistical model is flawed or inapplicable because one or more of theassumptions underlying its application are erroneous.Typically only the first and second are considered, but the last is everybit as important because when the statistical model is flawed or inapplicablethen the expectations of the model are not relevant to the real world systemthat spawned the data. Figure 8 shows the issue diagrammatically. When weuse that statistical inference to inform inferences about the real world we areimplicitly assuming: (i) that the real world system that generated the data isan analog to the population in the statistical model; (ii) that the way the datawere obtained is well described by the sampling rule of the statistical model;and (iii) that the observed data is analogous to the random sample assumed inthe statistical model. To the degree that those assumptions are erroneous thereis degradation of the relevance of the model-based statistical inference to thereal world inference that is desired.Considerations of model applicability are often limited to the populationdistribution (is my data normal enough to use a Student’s t -test?) but it ismuch more important to consider whether there is a definable population that is Technically it’s the central Student’s t -distribution. When δ (cid:54) = δ H it’s a non-central t -distribution [Cumming and Finch, 2001]. opulation Statistical modelReal world
Real world system under investigation SampleData
SamplingInference
Assumed to be equivalent
Desiredinference
Assumed to be relevant
Figure 8: Diagram of inference using a statistical model.relevant to the inferential objectives and whether the experimental units (“sub-jects”) approximate a random sample. Cell culture experiments are notoriousfor having ill-defined populations, and while experiments with animal tissuesmay have a definable population, the animals are typically delivered from ananimal breeding or holding facility and are unlikely to be a random sample.Issues like those mean that the calibration of uncertainty offered by statisticalmethods might be more or less uncalibrated. For good inferential performancein the real world, there has to be a flexible and well-considered linking of model-based statistical inferences and scientific inferences concerning the real world.
A P-value tells you how well the data match with the expectations of a statisticalmodel when the null hypothesis is true. But, as we have seen, there are manyconsiderations that have to be made before a low P-value can safely be taken toprovide sufficient reason to say that the null hypothesis is false. What’s more,inferences about the null hypothesis are not always useful. Royall argues thatthere are three fundamental inferential questions that should be considered whenmaking scientific inferences [Royall, 1997] (here paraphrased and re-ordered):1. What do these data say?2. What should I believe now that I have these data?3. What should I do or decide now that I have these data?30hose questions are distinct, but not entirely independent and there is nosingle best way to answer to any of them.A P-value from a significance test is an answer to the first question. Itcommunicates how strongly the data argue against the null hypothesis, witha smaller P-value being a more insistent shout of “I disagree!”. However, theanswer provided by a P-value is at best incomplete, because it is tied to a partic-ular null hypothesis within a particular statistical model and because it capturesand communicates only some of the information that might be relevant to sci-entific inference. The limitations of a P-value can be thought of as analogous toa black and white photograph that captures the essence of a scene, but missescoloured detail that might be vital for a correct interpretation.Likelihood functions provide more detail than P-values and so they can besuperior to P-values as answers to the question of what the data say. However,they will be unfamiliar to most pharmacologists and they are not immune toproblems relating to the relevance of the statistical model and the peculiarities ofexperimental protocol. As this chapter is about P-values, we will not considerlikelihoods any further, and those who, correctly, see that they might offer utilitycan read Royall’s book [Royall, 1997].The second of Royall’s questions, What should I believe now that I havethese data?, requires integration of the evidence of the data with what wasbelieved prior to the evidence being available. A formal statistical combinationof the evidence with prior beliefs can be done using Bayesian methods, but theyare rarely used for the analysis of basic pharmacological experiments and areoutside the scope of this chapter about P-values. Considerations of belief canbe assisted by P-values because when the data argue strongly against the nullhypothesis one should be less inclined to believe it true, but it is important torealise that P-values do not in any way measure or communicate belief.The Neyman–Pearsonian hypothesis test framework was devised specificallyto answer the third question: it is a decision theoretic framework. Of course,it is a good decision procedure only when α is specified prior to the data beingavailable, and when a loss function informs the experimental design. And itis only useful when there is a singular decision to be made regarding a nullhypothesis, as can be the case in acceptance sampling and in some randomisedclinical trials. A singular decision regarding a null hypothesis is rarely a sufficientinference from the collection of experiments and observations that typicallymake up a basic pharmacological studies and so hypothesis tests should notbe a default analytical tool (and the hybrid NHST should not be used in anycircumstance).Readers might feel that this section has failed to provide a clear method formaking inferences about any of the three questions, and they would be correct.Statistics is a set of tools to help with inferences and not a set of inferentialrecipes, a scientific inferences concerning the real world have to be made by sci- Royall [1997] and other proponents of likelihood-based inference (e.g. Berger and Wolpert[1988]) make a contrary argument based on the likelihood principle and the (irrelevance of)sampling rule principle, but those arguments may fall down when viewed with the local versusglobal distinction in mind. Happily, those issues are beyond the scope of this chapter.
References
M. Baker and E. Dolgin. Reproducibility project yields muddy results.
Nature ,541(7637):269–270, 2017.C. G. Begley and L. M. Ellis. Drug development: Raise standards for preclinicalcancer research.
Nature , 483(7391):531–533, Mar. 2012.D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E. J. Wagenmak-ers, R. Berk, K. A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini,C. D. Chambers, M. Clyde, T. D. Cook, P. De Boeck, Z. Dienes, A. Dreber,K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A. P. Field, M. Forster, E. I.George, R. Gonzalez, S. Goodman, E. Green, D. P. Green, A. G. Greenwald,J. D. Hadfield, L. V. Hedges, L. Held, T.-H. Ho, H. Hoijtink, D. J. Hruschka,K. Imai, G. Imbens, J. P. A. Ioannidis, M. Jeon, J. H. Jones, M. Kirchler,D. Laibson, J. List, R. Little, A. Lupia, E. Machery, S. E. Maxwell, M. Mc-Carthy, D. A. Moore, S. L. Morgan, M. Munaf´o, S. Nakagawa, B. Nyhan,T. H. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei,F. D. Sch¨onbrodt, T. Sellke, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire,D. J. Watts, C. Winship, R. L. Wolpert, Y. Xie, C. Young, J. Zinman, andV. E. Johnson. Redefine statistical significance.
Nature Human Behaviour ,pages 1–5, Jan. 2018.J. Berger and T. Sellke. Testing a point null hypothesis: the irreconcilability ofP values and evidence.
Journal of the American Statistical Association , pages112–122, 1987.J. O. Berger and R. L. Wolpert.
The Likelihood Principle . Lecture notes–Monograph Series. IMS, 1988.L. Berglund, E. Bj¨orling, P. Oksvold, L. Fagerberg, A. Asplund, C. A.-K.Szigyarto, A. Persson, J. Ottosson, H. Wern´erus, P. Nilsson, E. Lundberg,A. Sivertsson, S. Navani, K. Wester, C. Kampf, S. Hober, F. Pont´en, andM. Uhl´en. A genecentric Human Protein Atlas for expression profiles basedon antibodies.
Molecular & cellular proteomics : MCP , 7(10):2019–2027, Oct.2008. 32. Bhattacharya and D. Habtzghi. Median of the pValue Under the AlternativeHypothesis.
The American Statistician , 56(3):202–206, Aug. 2002.A. Birnbaum. The Neyman-Pearson Theory as Decision Theory, and as Infer-ence Theory; With a Criticism of the Lindley-Savage Argument for BayesianTheory.
Synthese , 36(1):19–49, Sept. 1977.J. M. Bland and D. G. Bland. Statistics Notes: One and two sided tests ofsignificance.
BMJ , 309(6949):248–248, July 1994.C. F. Camerer, A. Dreber, F. Holzmeister, T.-H. Ho, J. Huber, M. Johannesson,M. Kirchler, G. Nave, B. A. Nosek, T. Pfeiffer, A. Altmejd, N. Buttrick,T. Chan, Y. Chen, E. Forsell, A. Gampa, E. Heikensten, L. Hummer, T. Imai,S. Isaksson, D. Manfredi, J. Rose, E.-J. Wagenmakers, and H. Wu. Evaluatingthe replicability of social science experiments in Nature and Science between2010 and 2015.
Nature Human Behaviour , pages 1–10, Sept. 2018.J. Cohen. A power primer.
Psychological bulletin , 112(1):155–159, July 1992.D. Colquhoun.
Lectures on Biostatistics . Oxford University Press, 1971.D. Colquhoun. An investigation of the false discovery rate and the misinter-pretation of p-values.
Royal Society Open Science , 1(3):140216–140216, Nov.2014.M. Cowles.
Statistics in psychology: an historical perspective . Lawrence ErlbaumAssociates, Inc., 1989.G. Cumming. Replication and p intervals: p values predict the future onlyvaguely, but confidence intervals do much better.
Perspectives on Psycholog-ical Science , 3(4):286–300, 2008.G. Cumming and S. Finch. A primer on the understanding, use, and calculationof confidence intervals that are based on central and noncentral distributions.
Educational and Psychological Measurement , 61(4):532–574, 2001.M. Curtis, R. Bond, D. Spina, A. Ahluwalia, S. Alexander, M. Giembycz,A. Gilchrist, D. Hoyer, P. Insel, A. Izzo, A. Lawrence, D. MacEwan, L. Moon,S. Wonnacott, A. Weston, and J. McGrath. Experimental design and analysisand their reporting: new guidance for publication in bjp.
British Journal ofPharmacology , 172(2):3461–3471, 2015.D. J. Drucker. Never Waste a Good Crisis: Confronting Reproducibility inTranslational Research.
Cell Metabolism , 24(3):348–360, Sept. 2016.J.-B. du Prel, G. Hommel, B. R¨ohrig, and M. Blettner. Confidence interval orp-value?: Part 4 of a series on evaluation of scientific publications.
Deutsches¨Arzteblatt international , 106(19):335–339, May 2009.S. D. Dubey. Some thoughts on the one-sided and two-sided tests.
Journal ofBiopharmaceutical Statistics , 1(1):139–150, 1991.33. Fisher.
Statistical Methods for Research Workers . Oliver & Boyd, 1925.R. Fisher.
Design of experiments . Hafner, New York, 1960.H. Fraser, T. Parker, S. Nakagawa, A. Barnett, and F. F. Questionable researchpractices in ecology and evolution.
PLoS ONE , 13(7):e0200303, 2018.L. S. Freedman. An analysis of the controversy over classical one-sided tests.
Clinical Trials , 5(6):635–640, 2008.M. A. Garc´ıa-P´erez. Thou Shalt Not Bear False Witness Against Null Hypoth-esis Significance Testing.
Educational and Psychological Measurement , 77(4):631–662, Oct. 2016.A. Gelman and J. Carlin. Beyond Power Calculations.
Perspectives on Psycho-logical Science , 9(6):641–651, Nov. 2014.C. H. George, S. C. Stanford, S. Alexander, G. Cirino, J. R. Docherty, M. A.Giembycz, D. Hoyer, P. A. Insel, A. A. Izzo, Y. Ji, D. J. MacEwan, C. G.Sobey, S. Wonnacott, and A. Ahluwalia. Updating the guidelines for datatransparency in the British Journal of Pharmacology - data sharing and theuse of scatter plots instead of bar charts.
British Journal of Pharmacology ,174(17):2801–2804, Aug. 2017.G. Gigerenzer. We need statistical thinking, not statistical rituals.
Behavioraland Brain Sciences , 1998.S. N. Goodman. Of P-Values and Bayes: A Modest Proposal.
Epidemiology , 12(3):295–297, May 2001.S. N. Goodman and R. Royall. Evidence and scientific research.
AmericanJournal of Public Health , 78(12):1568–1574, 1988.P. F. Halpin and H. J. Stam. Inductive inference or inductive behavior: Fisherand Neyman-Pearson approaches to statistical testing in psychological re-search (1940-1960).
The American journal of psychology , 119(4):625–653,2006.L. Halsey, D. Curran-Everett, S. Vowler, and G. Drummond. The fickle p valuegenerates irreproducible results.
Nature Methods , 12(3):179–185, 2015.J. Hoenig and D. Heisey. The Abuse of Power: The Pervasive Fallacy of PowerCalculations for Data Analysis.
The American Statistician , 2001.S. M. Howitt and A. N. Wilson. Revisiting ”Is the scientific paper a fraud?”:The way textbooks and scientific research articles are being used to teachundergraduate students could convey a misleading image of scientific research.
EMBO reports , 15(5):481–484, May 2014.34. Hubbard, M. Bayarri, K. Berk, and M. Carlton. Confusion over Measuresof Evidence (p’s) versus Errors ( α ’s) in Classical Statistical Testing. TheAmerican Statistician , 57(3), Aug. 2003.C. J. Huberty. Historical origins of statistical testing practices: The treatmentof Fisher versus Neyman-Pearson views in textbooks.
The Journal of Exper-imental Educational , pages 317–333, 1993.S. Hurlbert and C. Lombardi. Final collapse of the Neyman-Pearson decisiontheoretic framework and rise of the neoFisherian.
Annales Zoologici Fennici ,46(5):311–349, 2009.J. P. A. Ioannidis. Why Most Published Research Findings Are False.
PLoSMedicine , 2(8):e124, Aug. 2005.V. E. Johnson. Revised standards for statistical evidence.
Proceedings of theNational Academy of Sciences , 110(48):19313–19317, 2013.K. Kobayashi. A comparison of one- and two-sided tests for judging significantdifferences in quantitative data obtained in toxicological bioassay of labora-tory animals.
Journal of Occupational Health , 39(1):29–35, 1997.J. I. Krueger and P. R. Heck. The Heuristic Value of p in Inductive StatisticalInference.
Frontiers in Psychology , 8:108–16, June 2017.P. Laplace.
Th´eorie analytique des probabilit´es . 1812.B. Lecoutre, M.-P. Lecoutre, and J. Poitevineau. Uses, Abuses and Misuses ofSignificance Tests in the Scientific Community: Won’t the Bayesian ChoiceBe Unavoidable?
International Statistical Review / Revue Internationale deStatistique , 69(3):399–417, Dec. 2001.S. M. Lee. Buzzfeed news: Here’s how cornell scientist brian wansinkturned shoddy data into viral studies about how we eat, February2018. URL .E. Lehmann.
Fisher, Neyman, and the creation of classical statistics . Springer,2011.J. Lenhard. Models and statistical inference: The controversy between fisherand neyman-pearson.
Br. J. Philos. Sci. , 57(1):69–91, March 2006. ISSN0007-0882. doi: 10.1093/bjps/axi152.M. J. Lew. Bad statistical practice in pharmacology (and other basic biomedicaldisciplines): you probably don’t know P. 166(5):1559–1567, June 2012.K. Liu and X.-L. Meng. There is individualized treatment. why not indi-vidualized inference?
Annual Review of Statistics and Its Application ,3(1):79–111, 2016. doi: 10.1146/annurev-statistics-010814-020310. URL https://doi.org/10.1146/annurev-statistics-010814-020310 .35. Lombardi and S. Hurlbert. Misprescription and misuse of one-tailed tests.
Austral Ecology , May 2009.J. Lu, Y. Qiu, and A. Deng. A note on type s & m errors in hypothesis testing.
British Journal of Mathematical and Statistical Psychology , Online version ofrecord before inclusion in an issue, 2018.P. McCullagh. What is a statistical model?
The Annals of Statistics , 30(5):1125–1310, 2002.P. Medawar. Is the scientific paper a fraud?
Listener , 70:377–378, 1963.H. J. Motulsky. Common misconceptions about data analysis and statistics.
Naunyn-Schmiedeberg’s Archives of Pharmacology , 387(11):1017–1023, 2014.J. Neyman and E. Pearson. On the Problem of the Most Efficient Tests ofStatistical Hypotheses.
Philosophical Transactions of the Royal Society ofLondon. Series AContaining Papers of a Mathematical or Physical Character ,231:289–337, 1933.R. S. Nickerson. Null hypothesis significance testing: a review of an old andcontinuing controversy.
Psychological Methods , 5(2):241–301, June 2000.R. Nuzzo. Statistical errors: P values, the ’gold standard’of statistical validity,are not as reliable as many scientists assume.
Nature , 506:150–152, 2014.R. Royall.
Statistical evidence: a likelihood paradigm , volume 71 of
Monographson sttistics and applied probability . Chapman & Hall, 1997.G. D. Ruxton and M. Neuhaeuser. When should we use one-tailed hypothesistesting?
Methods in Ecology and Evolution , 1(2):114–117, 2010.H. Sackrowitz and E. Samuel-Cahn. P Values as Random Variables-ExpectedP Values.
American Statistician , pages 326–331, Nov. 1999.S. Senn. Two cheers for P-values?
Journal of Epidemiology and Biostatistics , 6(2):193–204, Dec. 2001.G. Shaw and F. Nodder.
The naturalist’s miscellany: or coloured figures ofnatural objects; drawn and described immediately from nature . 1789.A. Strasak, Q. Zaman, G. Marinell, and K. Pfeiffer. The Use of Statistics inMedical Research: A Comparison of The New England Journal of Medicineand Nature Medicine.
The American Statistician , 61(1):47–55, 2007.Student. The Probable Error of a Mean.
Biometrika , 6(1):1–25, Mar. 1908.B. Thompson.
The nature of statistical evidence , volume 189 of
Lecture notesin statistics . Springer, 2007. 36. Trafimow and M. Marks. Editorial.
Basic and Applied Social Psychology ,37(1):1–2, 2015. doi: 10.1080/01973533.2015.1012991. URL https://doi.org/10.1080/01973533.2015.1012991 .J. W. Tukey. The Philosophy of Multiple Comparisons.
Statistical Science , 6(1):100–116, Feb. 1991.B. Voelkl, L. Vogt, E. S. Sena, and H. W¨urbel. Reproducibility of preclinicalanimal research improves with heterogeneity of study samples.
PLOS Biology ,16(2):e2003693–13, Feb. 2018.E.-J. Wagenmakers. A practical solution to the pervasive problems of p values.
Psychonomic bulletin & review , 14(5):779–804, Oct. 2007.E.-J. Wagenmakers, M. Marsman, T. Jamil, A. Ly, J. Verhagen, J. Love,R. Selker, Q. F. Gronau, M. ˇSm´ıra, S. Epskamp, D. Matzke, J. N. Rouder,and R. D. Morey. Bayesian inference for psychology. Part I: Theoretical ad-vantages and practical ramifications. pages 1–23, Mar. 2018.R. L. Wasserstein and N. A. Lazar. The ASA’s Statement on p-Values: Context,Process, and Purpose.