[PDF] Dealing with multiple testing: To adjust or not to adjust

Abstract

Multiple testing problems arise naturally in scientific studies because of the need to capture or convey more information with more variables. The literature is enormous, but the emphasis is primarily methodological, providing numerous methods with their mathematical justification and practical implementation. Our aim is to highlight the logical issues involved in the application of multiple testing adjustment.

Full PDF

aa r X i v : . [ s t a t . O T ] O c t Dealing with multiple testing: To adjust or not to adjust

Yudi Pawitan ∗ and Arvid Sj¨olanderDepartment of Medical Epidemiology and BiostatisticsKarolinska InstitutetSeptember 2020 Abstract:

Multiple testing problems arise naturally in scientiﬁc studies because ofthe need to capture or convey more information with more variables. The litera-ture is enormous, but the emphasis is primarily methodological, providing numerousmethods with their mathematical justiﬁcation and practical implementation. Ouraim is to highlight the logical issues involved in the application of multiple testingadjustment.

Let’s start by saying that dealing with multiple comparisons is tricky, perhaps not mathemati-cally, but logically for sure. The wrapping of this problem should be labeled with at least twowarnings, one from Oscar Wilde: ‘the truth is rarely simple and never pure’ and the other fromEinstein: ‘as far as the laws of mathematics refer to reality, they are not certain; and as far asthey are certain, they do not refer to reality’. Einstein’s statement was made in a lecture ongeometry in relation to experience, partly highlighting his insight that Euclidean geometry failsto explain reality, so any branch of mathematics less intuitively obvious in its relation to realitythan geometry has an even larger gap to close.Multiple comparisons arise naturally in most scientiﬁc studies because of the need to capture orconvey more information with more variables. But there are many steps taken during a scientiﬁcprocess that also have multiplicity implications, such as deciding the appropriate scale (linear orlog), variables to be included in the analysis, subgroup analysis, etc.First, the easy part: mathematically precise statements can be derived assuming the null hy-potheses are true ; in eﬀect we are dealing with random numbers with no signal. At 5% sig-niﬁcance level, for every 100 tests we shall get 5 spuriously signiﬁcant tests on average. Al-ternatively, assuming independence and still under the null, it is virtually certain – probability= 1 − (1 − . = 0 .

994 – that the most signiﬁcant result among the 100 tests will have a rawP-value ≤ .

05. This means that if we are na¨ıve we would be easily misled by false positives,hence the usual warning about data dredging.In practice, to account for multiplicity, one can simply adjust the P-values by multiplying eachby the number of tests (say M ), or dividing the nominal signiﬁcance level by M . This makes itharder to declare signiﬁcance. Using this so-called Bonferroni correction, we can guarantee that ∗ Corresponding email: [email protected]

1f we only reject hypotheses with adjusted P-value ≤ .

05, say, then the probability of makingany false rejection among all tests is ≤ . . To prove this, suppose we have M null hypotheses,where a subset M are true; let F be the number of false rejections. ThenProb( F >

0) = Prob(adjusted P ≤ . , at least one of M hypotheses)= Prob(P-value ≤ . /M, at least one of M hypotheses) ≤ M × . /M ≤ . . Due to its simplicity, this procedure is commonly used in practice. If we want to protect ourselvesagainst false positives, it seems reasonable to apply this procedure. The literature on multipletesting is enormous and there are numerous alternative methods, many as improvements of theBonferroni correction, indicating that multiple-testing correction is taken seriously, and whenapplied, it is mathematically straightforward.

Example 1: Cancer drug study.

The following table shows the P-values in astudy to compare a metastatic cancer drug vs placebo for 10 patient characteristics.The best raw P-value is 0.007 for the Karnofsky index (let’s keep it mysterious fornow), but when corrected using the Bonferroni method, nothing is signiﬁcant at 5%level. The last column actually gives an estimated number of false positives when weconsider the variable and those above it as signiﬁcant, so we do not truncate it at 1.We will use this quantity later.Variables P-value 10 × P-value1 Karnofsky index 0.007 0.072 Body weight 0.013 0.133 Tricep skin-fold 0.091 0.914 Hemoglobin concentration 0.236 2.365 Erythr. sedimentation rate 0.350 3.506 Albumin in serum 0.525 5.257 Creatinine in serum 0.535 5.358 Bilirubin in serum 0.662 6.629 S-alkalinephosphatase 0.823 8.2310 Alanine aminotransferase 0.908 9.08Table 1:

P-values in a study to compare a metastatic cancer drug vs placebo for 10 patientcharacteristics. The third column is the Bonferroni-adjusted P-value that is (on purpose) nottruncated at 1.

Yet the statistical concerns about false positives are not shared universally among scientists. Theclearest objection was formulated by Rothman (1990), who declared ‘No Adjustments are Neededfor Multiple Comparisons’. Being the ﬁrst Editor of the journal

Epidemiology and author of awidely used textbook, Rothman is one of the most inﬂuential and highly cited epidemiologists,so it is worth understanding his arguments. Brieﬂy, two non-mathematical presumptions areneeded for application of multiplicity adjustments: • the ‘universal null hypothesis’, covering all hypotheses under consideration, is a reasonablestate of nature, so that chance does cause many unexpected ﬁndings, and2 no one would want to investigate further something that could be caused by chance.He argued that in a scientiﬁc process these presumptions are not true, that ‘chance’ is not ascientiﬁc explanation, so scientists should ‘grasp at every opportunity’ to understand unusualﬁndings, and that the possibility of being misled is part of the trial-and-error process of science.It is still the case today that most statistical results in epidemiology and medical literature arerarely adjusted for multiple comparisons, with notable exceptions in clinical trials and high-throughput molecular studies; see more below. In clinical trials, multiplicity issues arise in, forexample, the choice of hypotheses to be tested, sidedness of the test, interim analyses during thetrial, main analysis plan, subgroup analyses after the trial, etc.The US Food and Drug Administration’s Guidance for Industry: E9 Statistical Principles forClinical Trials

Example 1: Cancer drug study (continued).

In the cancer drug study, supposewe know nothing about the variables prior to the study, so for us all these variablesassume equal status. Then it would be naive to take the raw P-values seriously, thusforcing us to accept the adjustment and the overall null result. But suppose priorto collecting the data, because this is a study on end-stage metastatic patients, wedeclared that the Karnofsky index, a generalized measure of functional performance,was of primary interest, while the other variables were of secondary interest. Then,no adjustment would be necessary for the Karnofsky index. Thus, as accepted inclinical trials, in assessment of evidence our intention matters . This does not feelcontroversial. We note that, for justifying the non adjustment, the primary interestin the index does not require prior knowledge or data or anything sensible; in principlewe only need to declare it in advance.Now suppose we accept we know nothing prior to the study, hence the multiplicity adjustment,but another research group who is working on the Karnofsky index contacts us to share the data.We agree to give them that variable only. Now, for them, it seems reasonable that no adjustmentis needed and to conclude that the Karnofsky index is statistically signiﬁcant. So, here we have aseeming paradox that, with the same data, two research groups can claim diﬀerent evidence: onegroup cannot claim signiﬁcance, but the other can. But what if we contact them? Now it seemsthe adjustment must be used, since we could have contacted any group working with the mostsigniﬁcant variable. This means the mechanism of contact becomes an issue, but in practice theappearance of the ‘second group’ in the scene can of course be completely haphazard, e.g. viaa chance encounter at a party, a friend of a friend, etc. In this social contact, how do we keeptrack of who brings up the topic ﬁrst? It would be impossible to formalize such a process.3

Single test

Surprisingly, the logical issue associated with the application of multiplicity adjustment ariseseven when we only perform a single test (Berger and Berry, 1987). Suppose a client comes toa statistician with a study (say with sample n = 100), and the statistician performs a singletest and obtains z = 2 . α = 0 . . Yes, but wait . . . what did he plan to do if the result was not signiﬁcant? Suppose, aswith other scientists, he planned to collect more data. So, his actual procedure is a sequentialtest as follows:1. Collect n = 100 samples and test if | z | > c , if signiﬁcant stop.2. Otherwise, collect 100 more samples, and test if the overall | z | > c .To get an overall signiﬁcance level α = 0 . . , we must use c = 2 .

18, such thatProb( | z | > c ) + Prob( | z | ≤ c and | z | > c ) = 0 . . So, the observed z = 2 . These examples highlight a generic logical question: what collection of tests do we want to applythe adjustment to? To cover all tests relating to the same biological process? All tests in asingle paper? If the latter, then theoretically we can avoid multiplicity correction by splittingthe results into separate papers. The primary problem is that we are using the same statistic(P-value) both as a measure of evidence in a speciﬁc dataset (statistical distance between thehypothesis and the data) and as a measure of uncertainty (decision-making error rates) overhypothetical repetitions of the study.In the former, adjustment for multiplicity is not an issue, but in the latter it is. However, thelatter requires a precise setup of how the study repetitions are to be done, and here intentionmatters , for example, in deciding which tests are to be included or prioritized. The examples showthat any test can legitimately belong to distinct collections with distinct repetition studies thatdepend on the perspective of the experimenters. (This is not strange, e.g. a person may belongto distinct clubs with conﬂicting rules.) In the drug example, diﬀerent groups of researchers areimagining diﬀerent hypothetical experiments involving the Karnofsky index. In the single-testexample, on seeing the observed data, the statistician’s ﬁrst reaction is to imagine hypotheticalrepetitions involve one test, but the scientist’s intention involves hypothetical repetitions withtwo tests. Whose perspective is correct?In view of this logical diﬃculty, how are we to react to potentially conﬂicting conclusions fromunadjusted and adjusted analyses? Actually, we are unwittingly put into this corner by animplicit demand that we make a decision. When a study is only performed once, i.e. thehypothetical repetitions remain hypothetical, we are in a never-ending unsolvable logical puzzleon how to decide. How do we break this puzzle?4n clinical trials, for the key hypothesis, we break it simply by decree (e.g. following the FDAGuidelines), in essence limiting or avoiding the issue by stating the hypothesis and analysismethods in advance. Science does not follow such strict rules, but we need validation studiesto conﬁrm interesting discoveries. Before further conﬁrmation discoveries are considered provi-sional, so, in contrast to clinical trials, it is not necessary to make a decision about the truestate of nature. However, we also note that, to conﬁrm a discovery, it is not necessary to per-form exactly the same experiment as before (i.e. the hypothetical repetitions). For example,a discovery in an observational study in human will be substantially more credible if validatedbiologically in mice, and vice versa.Eventually, in treating a study as a screening tool to identify interesting discoveries, we have togo back to the basic trade-oﬀ between type-I (false positive) and type-II (false negative) errors:protecting against one will increase the other. Strict adherence to multiple-testing adjustmentprotects against inﬂation of type-I errors and increases type-II errors, but who decides whicherror is more important? The legendary investor George Soros reportedly said, “It’s not whetheryou’re right or wrong [that is important], but how much money you make when you’re rightand how much you lose when you’re wrong.” It does not seem to make sense to be feel stronglyagainst one type of error vs the other independently of the context.Diﬀerent areas of science may treat multiplicity adjustment diﬀerently, perhaps depending on therelative costs of the errors, their history/experience with the errors and abundance of potentialleads for discoveries. For example, molecular epidemiology went through the lamented candidate-gene approach to complex diseases, roughly from 1980s up to early 2000s (Hirshhorn et al, 2002;Chabris et al, 2012), where few ﬁndings were replicated.

Example 2: Publication bias and Winner’s curse.

The human genome is arich source of variables/genes. For many complex phenotypes, suppose there is noreal eﬀect or, more likely, the individual-gene eﬀects are so tiny that the power offundable studies is too small to detect anything. Say there are 100 research groupsinvestigating diﬀerent genes and phenotypes; each group is essentially generatingrandom numbers. At 5% level, there will be 5 lucky groups with signiﬁcant results:these are much more likely to get published, and then fail to replicate. If theseare really 100 distinct groups, what can we do about this problem? No system ofpublication now can communicate so many negative ﬁndings to balance the falsepositives, so the problem seems to be an inevitable price of the scientiﬁc process.When high-throughput molecular studies, particularly the genome-wide association studies (GWAS)came into the scene, a single study/paper routinely performs millions of tests of single nucleotidepolymorphisms (SNPs), and the genome-wide signiﬁcance threshold 5 × − based on the Bon-ferroni correction became the accepted method of dealing with the huge multiplicity problem.One may argue correctly that we would be missing a lot of signals (type-II errors), but since theﬁeld had seen a lot of wasted eﬀort at replicating false leads during the candidate-gene era, theconsensus on the use of Bonferroni correction seems unchallenged.The problem in molecular studies in the genomic era is that there are simply too many potentialleads, so one needs a method to limit them. Note, however, that the P-values for each phenotypeare usually adjusted separately, e.g. if there are 1M SNPs, then the 5 × − is applied for eachphenotype regardless of the number of phenotypes, so the multiplicity adjustment is not followedconsistently. 5 Flexible multiplicity adjustment

Multiple testing ideas are useful during an exploratory phase of a study. Goeman and Solari(2011) identiﬁed three sensible requirements for a multiple correction procedure: • not too strict: should allow possibilities of false rejections • post-hoc: should allow choice after seeing the data • ﬂexible: should allow whatever results to pursue, not just the signiﬁcant ones.Instead of focusing on the probability of making any false rejections, which is too strict, wecan instead estimate and provide conﬁdence lower-bounds for the number of true discoveries(= correct rejections). To emphasize its post-hoc feature, they called their procedure ‘cherrypicking’. Recent developments, for example in false discovery rate (FDR) estimation, are alsoin line with these requirements (e.g. Lee et al, 2012). The purpose is more to set realisticexpectations for further studies rather than making ﬁnal decisions about the true state of nature. Example 1: Cancer drug study (continued).

Suppose this study was performedon end-stage cachexic patients, which are characterized by severe wasting/loss of bodymass and functional impairment. The top 3 variables are then of special interest, but,crucially, this interest can be decided post-hoc (after seeing the data) . From the thirdcolumn (10 × P-value), the estimated number of true discoveries is 3 − . ≈ To conclude, we do not at all mean to sound skeptical of the use of multiplicity adjustment, espe-cially when put in a ﬂexible false discovery rate framework. However, we do want to emphasizethe nuances when adjustment is applied to real studies, and the constant need to consider thetype-I and type-II errors, whose relative costs are highly context speciﬁc. We highlight some log-ical problems with its formal use when assessing scientiﬁc evidence, thus partly explain the lackof universal acceptance. In areas of science where potential leads for discoveries are not abun-dant and the true discoveries are rare but fundamental, formal multiplicity adjustment is notthe norm. On the other hand, there are areas that must pay close attention to the multiplicityadjustment, e.g. clinical trials, where type-I errors can have enormous human or ﬁnancial cost,or high-throughput molecular studies with the ﬂood of false positives when we use traditionalsigniﬁcance level.Nature does not reveal its truths easily. Anyone in search of those ultimate truths will toleratecertain false positives along the way. Healthy skepticism is a necessary default attitude in theerror-prone but self-correcting process of science. But there is a legitimate concern of falsepositives when a scientiﬁc result is disseminated to the public. The scientists and the publicmay have diﬀerent relative costs of the type-I and type-II errors: the scientists may be moreconcerned with missing a discovery (type-II), but the public does not like to be misled by falsepositives. Also, the public may not have the patience or the kind of skepticism that accepts6esults as provisional. They are like observers that do not have suﬃcient background knowledge– and attention span – to absorb conﬂicting accounts of an event faraway. So, conﬂicting ﬁndings,say, about the eﬀects of electromagnetic ﬁeld or coﬀee or cancer screening, etc., will confuse thepublic. The public position is similar to the one we mention above, where we put ourselvesunnecessarily into a corner by thinking that we have to decide one way or the other. The publicwants to know ‘the truth’ one way or the other, but the scientiﬁc answer comes in confusingbits. The scientists themselves can live with the uncertainty of a provisional result or with avariegated truth about eﬀects. It is not obvious how to reconcile the scientiﬁc acceptance ofcomplex uncertainties and the public need for a simple message.

Acknowledgement

An earlier version of this manuscript was published in

Qvintensen (Nr 2, 2014), a bulletin ofthe Swedish Statistical Society. We are grateful to the Editor Dan Hedlin who arranged thepublication and kindly helped in editing the manuscript. This is an accompanying manuscriptto

Defending the P-value , which is also published in the Arxiv.

References

Berger, JO and Berry, D (1987). The relevance of stopping rules in statistical inference. In

Statistical Decision Theory and Related Topics

IV (S Gupta and J Berger, Eds). NewYork: Springer Verlag.Chabris CF, et al. (2012). Most reported genetic associations with general intelligence areprobably false positives.

Psychol Sci.

Statistical Science , 26 (4), 584-597.Hirschhorn JN, et al. (2002). A comprehensive review of genetic association studies.

Geneticsin Medicine

Statistics in Medicine , 31(11-12): 1177-89Pawitan Y (2001).

In All Likelihood . Oxford: Oxford University Press.Rothman, KJ (1990). No Adjustments are Needed for Multiple Comparisons.