[PDF] A comprehensive error rate for multiple testing

Abstract

The higher criticism of a family of tests starts with the individual uncorrected p-values of each test. It then requires a procedure for deciding whether the collection of p-values indicates the presence of a real effect and if possible selects the ones that deserve closer scrutiny. This paper investigates procedures in which the ordered p-values are compared to an arbitrary positive and non-decreasing threshold sequence.

Full PDF

AA comprehensive error rate for multiple testing

Djalel-Eddine Meskaldji ∗ , Dimitri Van De Ville ,Jean-Philippe Thiran and Stephan Morgenthaler FSB/MATHAA, ´Ecole Polytechnique F´ed´erale de Lausanne(EPFL), Lausanne, Switzerland. Institute of Bioengineering, ´Ecole Polytechnique F´ed´erale deLausanne (EPFL), Lausanne, Switzerland Department of Radiology and Medical Informatics,University ofGeneva, Geneva, Switzerland LTS5/STI, ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL),Lausanne, Switzerland.November 12, 2018

Abstract

The higher criticism of a family of tests starts with the individualuncorrected p-values of each test. It then requires a procedure for decidingwhether the collection of p-values indicates the presence of a real eﬀectand if possible selects the ones that deserve closer scrutiny. This paperinvestigates procedures in which the ordered p-values are compared to anarbitrary positive and non-decreasing threshold sequence.

When testing the existence of m potential eﬀects, it is customary to computean individual p-value p i for each null hypothesis H i , i = 1 , . . . , m . A true eﬀecttends to have a p-value close to zero, while a null eﬀect has a randomly assignedp-value between 0 and 1. Let m = m + m be the decomposition into thenumber of null eﬀects and true eﬀects, respectively. The plot of the ordered p-values against the rank provides a simple visual check to see whether any eﬀectis present at all. The resulting graph of p ( r ) vs r for r = 1 , . . . , m carves out annon-decreasing sequence of points in the rectangle [1; m ] × [0; 1] and should, ifno real eﬀect exists, stay close to the diagonal from the lower left corner to theupper right corner of the square. Procedures of higher criticism can be based ∗ djalel.meskaldji@epﬂ.ch a r X i v : . [ m a t h . S T ] F e b ypotheses not rejected rejected TotalTrue TN FP m False

FN TP m Total m − R R m

Table 1:

General outcome of separating m tests into non signiﬁcant and sig-niﬁcant ones. TP, FP denote the number of true, false rejections (true, falsepositives), respectively. R is the total number of rejections. TN, FN denote thenumber of true, false non-rejections (true, false negatives), respectively. on a second curve or threshold sequence t r , which stays close to the lower edgeof the square. If all p-values are above this curve, none of the tests rejects.If any of the p-values are below the threshold curve, there are several possibleprocedures. We will concentrate on the maximum p-value below the curve. Allthe tests with p-value smaller or equal to this maximum then reject. Figure 1illustrates this general idea. If such a procedure is applied, each eﬀect we test iseither rejected or not rejected. Table 1 introduces the contingencies that occur.If we take the classical meaning of “statistical signiﬁcance at level α ” andreject all tests with p ( r ) ≤ t r = α , or equivalently p i ≤ α , the expected numberof false rejections is equal to m α . If m = 1000 and α = 0 .

05, we expect around50 false rejections under the assumption that m is moderate. In Figure 1, thehorizontal threshold has α = 0 .

15 and we expect around 7.5 false rejections.The 9 rejections we actually found is close enough to 7.5 to suspect that mostor even all the rejections are false positives.Even though this calculation makes it clear that performing many tests inparallel challenges the classical sense of “statistically signiﬁcant”, a large numberof claims are still published in the scientiﬁc literature without a proper control(Benjamini, 2010). To make things worse, the above calculation neglects pre-viously published or unpublished studies that have tested the same or relatedhypotheses, thus even further increasing the risk of spurious ﬁndings.There is an obvious solution to the dilemma; we can simply replace t r = α by t Bonf r = α/m . If we do this, the expected number of false rejections is less thanor equal to m α/m ≤ α . Bonferroni (1936) shows that in many circumstancesthis test also controls another aspect, namely the probability of making one ormore false rejections. This stringent error control is one of the recommendedsolutions to the dilemma of multiplicity, but can only be applied for modestvalues of m , because otherwise the ability to ﬁnd the real eﬀects of interestvanishes.More accommodating procedures were investigated. Holm (1979), for ex-ample, showed that the curved threshold t Bonf r ≤ t Holm r = α/ ( m + 1 − r ), lyingalong a convex function, provides the same protection. Simes (1986) wenteven further by considering the proportional choice t Holm r ≤ t Simes r = rα/m . He To be precise, Holm’s procedure is a step-down procedure, that is , it rejects H ( r ) as longas p ( r ) ≤ t r for r = 1 , , , . . . and stops rejecting when for the ﬁrst time the inequality fails. . . . . . . rank o r de r ed p - v a l ue s Figure 1:

The plot shows for m = 50 tests the ordered p-values plotted againstthe rank. Besides the diagonal, three other curves are shown, which have appeared inthe literature as thresholds. They are the constant threshold t r = 0 . (solid line),the proportional threshold t r = 0 . r (dotted line) and the curved threshold t r =0 . / ( m +1 − r ) (dashed curve). With the constant, 9 tests reject, with the proportional,1 test rejects and with the curved, 2 tests reject. In addition a linear threshold t r =0 .

05 + 0 . r is shown. FDP = FP / max( R, t Simes r ensures that E ( FDP ) ≤ α (BH procedure).In all these examples, the values of the thresholds lie between α/m and α . This is reasonable since these two bounds represent the cases of “severecorrection” and “no correction”. In this paper, we explore what happens, if thethreshold t r satisﬁes α/m ≤ t r ≤ α and is non-decreasing, for example, thelinear threshold t linear r = A + Br , where A, B ≥ A + B > If we are willing to contemplate new error metrics, the answer to “what happenswith arbitrary thresholds?” has an easy answer. Let the number of rejections be R , the largest rank such that p ( R ) ≤ t R (see Fig. 1) and zero if all the p-valuesare larger than the threshold. We will show that with this threshold,1 m E (cid:18) { R> } FP t R (cid:19) ≤ , (1)which means that a ratio of the number of false positives divided by a positiveand non-decreasing function of the total number of rejections is controlled. Inorder to relate this inequality to familiar results, we consider thresholds of theform t r = s r α/m , (2)where the 0 ≤ s r ≤ m is a non-decreasing sequence, which we will call the shapeof the threshold, and α > α , the larger the threshold and thelarger the number of rejections. With this notation, inequality (1) becomes E (cid:18) { R> } FP s R (cid:19) ≤ α . (3)For the thresholds we have discussed in Section 1, the inequality (3) yields theresults in Table 2.Inequality (3) gives us a clearer understanding of procedures of higher criticismbased on thresholding p-values of families of tests. The threshold is meant tocapture the p-values belonging to true eﬀects from above. In Fig. 1, such p-values cluster near the bottom left corner, that is, they are small and have lowrank. At least this is the situation we have in mind, even though this will nothappen if the true eﬀects are not large enough. If we move to the right alongthe curve of the ordered p-values, the ones belonging to true eﬀects will becomerare and the p-values sequence cuts across the threshold.4ame threshold t r shape s r controlBonferroni α/m E (1 { R> } FP ) = E ( FP ) ≤ α Holm α/ ( m + 1 − r ) mm +1 − R E (1 { R> } m +1 − Rm FP ) ≤ α BH rα/m r E (1 { R> } FP R ) ≤ α Linear A + Br m ( A + Br ) α E (1 { R> } FP m ( A + BR ) ) ≤ α Table 2:

The control exerted by using the thresholds discussed in Section 1. TheBonferroni and BH case are well-known, the other two are new. The Holmthreshold controls the expected number of erroneous rejections, but multipliedby a weight smaller or equal to 1 that decreases linearly as R increases. Thelinear and proportional thresholds also control such a weighted expectation, withdecreasing weights depending on R . The proportional thresholds t r = rα/m are ideal in situations where thenumber of true eﬀects is small and the eﬀects are just barely detectable. Inother situations, the number R of detected eﬀects can be large or moderatelylarge and the proportional thresholds will then ﬁnd a sizable number of falsepositives and loose their eﬀectiveness. This was our motivation for studyingother thresholds. We are, however, interested in threshold shapes s r that growlinearly in r for small values, but sub-linearly for larger values r . The prototypeis the trimmed threshold s truncr = min( τ, r ) for 1 ≤ τ ≤ m . Deﬁnition 2.1

Consider an arbitrary non-decreasing threshold t r = s r α/m ( r = 1 , . . . , m ) with non-decreasing shape function ≤ s r ≤ m . The matchederror metric is the penalized false discovery proportionpFDP = FP s R , if R ≥ and = 0 , otherwise . We call the expectation of the pFDP the penalized false discovery rate, pFDR = E ( pFDP ) . In an older version of the manuscript (Meskaldji et al., 2011), we show that withparticular cases of the shape function, one can recover a large number of type Ierror rates that have been introduced in the literature.Our results are valid for m ≥ under independence, positive dependence and general dependence, aswell as the control via weighted procedures (Benjamini and Hochberg, 1997;Genovese et al., 2006). In a weighted procedure, an a priori non-negativeweight w i is assigned to each null hypothesis H i , and the raw p-values p i are replaced by the weighted p-values q i = p i /w i if w i > q i = 1 if w i = 0. A hypothesis of big weight is more easily rejected than a hypothesisof small weight. The weights have to be standardized so that (cid:80) mi =1 w i = m ,which includes w i ≡ That is, procedures that reject all null hypothesis up to the last up-crossing of the thresh-old sequence.

5s a weight, albeit one that applies to the rank of the hypothesis as deter-mined by its p-value. If we re-write p ( r ) ≤ s r α/m as p ( r ) r/s r ≤ rα/m , wesee that the method acts as a weighted BH procedure with weight proportionalto w i ∝ s rank of hypothesis i / rank of hypothesis i. Since these weights depend onthe p-values, the condition (cid:80) mr s r /r = m does not guarantee the control of theFDR. Other studies have proposed threshold sequences diﬀerent from the horizontalor the proportional, among them Genovese and Wasserman (2002, p. 508 and p.513), where it is shown that asymptotically the proportional one is optimal insome sense, Finner et al. (2009), who derive an alternative threshold sequencebased on asymptotic arguments, and Roquain and Villers (2011), who inves-tigate the operating characteristics of the FDR under an increasing thresholdsequence. But none of these papers makes the connection to a generalization ofthe error metric. Metrics related to ours were introduced in van der Laan et al.(2004) and described in Dudoit and van der Laan (2008, p. 238 and ch. 6, 7).These references consider transformations that involve both FP and R , whilewe concentrate on transforming R alone. The freedom oﬀered by using any threshold sequence is practical and useful. Itgives users the opportunity to choose on a very ﬁne scale between tests thatmerely provide a screening of a large numbers of potential eﬀects and on theother hand tests that exert a strict control on the number of false discoveries,but have diﬃculties in ﬁnding moderately large true eﬀects.When applying a procedure that exerts FDR control, we can expect to have FP ≈ m m αR for m >> E ( FP ) ≤ mα/m = α ,independently of R . Since none of the error metrics is superior in all aspects,the users may wish to have several distinct controls achieved by one procedure.On general principles, an ideal procedure is as powerful as possible when itcomes to detecting true eﬀects and, at the same time, controls the expectednumber of false positives. The BH procedure does achieve this for smallish R , but at the risk of large values of FP when R becomes large. To avoidthis, one can add strong control of E ( FP ) at a relaxed level τ α ( τ ≥

1) asHommel and Hoﬀmann (1987) guarantees. In the context of penalized falsediscovery procedures, the truncated linear threshold, t trunc r = s trunc r α/m , withshape function s trunc r = min( τ, r ) provides a solution. The corresponding step-up procedure with threshold sequence t trunc r controls the FDR to be less than α E ( FP ) to be less than τ α . We might call this type of control thetruncated FDR. This is an appropriate choice, if the number m of hypothesesbeing tested is very large. Applications include functional magnetic resonanceimages in the neuroscience as well as many genomic or other “omic” studies(Meskaldji et al., 2015, 2013).Note that the truncated FDR could be written as E (cid:18) FP s trunc R (cid:19) = E (cid:18) FP min( τ, R ) (cid:19) = max (cid:20) E (cid:18) FP R (cid:19) , E (cid:18) FP τ (cid:19)(cid:21) = max (cid:20) FDR , τ E ( FP ) (cid:21) . A similar criterion (with smooth transition) is obtained by minimizing the ex-pected number of false positives while simultaneously controlling the false dis-covery rate. This leads to a family of mixed error rates indexed by 0 < (cid:15) < (cid:15),τ = (cid:15) E ( FP /τ ) + (1 − (cid:15) ) E ( FP /R ) , where (cid:15) tunes the transition between the two controls. To derive a step-upprocedure that controls the MER, we use Propositions 4.1 and 4.2. We haveMER (cid:15),τ = E (cid:34) FP (cid:18) R(cid:15)R/τ + (1 − (cid:15) ) (cid:19) − (cid:35) , which we recognize as a pFDR with shape sequence s (cid:15),τr = r(cid:15)r/τ + (1 − (cid:15) ) . This sequence has a roughly linear rise for small r and then ﬂattens out to reachthe horizontal asymptote r = τ /(cid:15) . Control at level α , when m hypotheses aretested, is obtained by the thresholds t (cid:15),τr = s (cid:15),τr α/m = r(cid:15)r/τ + (1 − (cid:15) ) α/m . To complete this section, we present some simulations that demonstrate theadvantage of using the truncated FDR. We compare three multiple testing pro-cedures: the Bonferroni procedure, the most stringent, the BH procedure, theleast stringent, and the truncated linear procedure which bridges the gap be-tween theses two extremes. We restrict our simulations to the case of shiftedGaussians, that is, if the null hypothesis is true, the test statistic follows a Gaus-sian law with expectation 0 and variance 1, while false hypotheses are Gaussianswith expectation ∆ > TP large while keeping FP small. These two types of rejections are opposites of7 . . . m = , m = , ∆ = λ G . . . m = , m = , ∆ = λ G − . . m = , m = , ∆ = λ G − . . . m = , m = , ∆ = λ G − − m = , m = , ∆ = λ G − m = , m = , ∆ = λ G Figure 2: Behavior of the gain G and the gain G of diﬀerent procedures com-pared to BH, as a function of the price λ . Here we show the diﬀerence betweenthe gain function of each procedure and the one corresponding to the BH pro-cedure. The number of tests is m = 1000 and the number of false hypotheses is m = 100. The eﬀect size ∆ is either 1,2, or 3. The curves show the simulatedgains minus the gain of the BH procedure (zero line), Bonferroni (continuous),truncated linear with τ = 5 (small dashes), τ = 20 (small-big dashes) and τ = 100 (large dashes).each other, but asymmetrical opposites. Among multiple testing proceduresthat satisfy a given error control, the ones that reject a maximal number ofhypotheses are preferred. This is often the only measure of optimality thatis investigated. We consider instead a simple monetary utility function withexpected value G = E ( TP ) − λ E ( FP ) , (4)8 . . . m = , m = , ∆ = λ G . . . m = , m = , ∆ = λ G . . . m = , m = , ∆ = λ G . . . . m = , m = , ∆ = λ G − . . . m = , m = , ∆ = λ G − . . . . m = , m = , ∆ = λ G Figure 3: Behavior of the gain G and the gain G of diﬀerent procedures com-pared to BH, as a function of the price λ . Here we show the diﬀerence betweenthe gain function of each procedure and the one corresponding to the BH pro-cedure. The number of tests is m = 1000 and the number of false hypotheses is m = 50. The eﬀect size ∆ is either 1,2, or 3. The curves show the simulatedgains minus the gain of the BH procedure (zero line), Bonferroni (continuous),truncated linear with τ = 5 (small dashes), τ = 20 (small-big dashes) and τ = 100 (large dashes).in which each true discovery leads to a gain of 1 and each false discovery to a lossof λ ≥

1. This approach might be unfamiliar to statisticians, who are used tomaximizing power under control of the false rejections, but it can be shown thatthe two are closely related in the Neyman-Pearson context. Our criterion allowsa mixture of diﬀerent error rates and will pick the one best adapted to λ . Inthe philosophy of multiple testing, λ >

1, because the subsequent investigation9 − . . . . m = , m = , ∆ = λ G − . . . . m = , m = , ∆ = λ G − − m = , m = , ∆ = λ G − − m = , m = , ∆ = λ G − − m = , m = , ∆ = λ G m = , m = , ∆ = λ G Figure 4: Behavior of the gain G and the gain G of diﬀerent procedures com-pared to BH, as a function of the price λ . Here we show the diﬀerence betweenthe gain function of each procedure and the one corresponding to the BH proce-dure. The number of tests is m = 10000 and the number of false hypotheses is m = 1000. The eﬀect size ∆ is either 1,2, or 3. The curves show the simulatedgains minus the gain of the BH procedure (zero line), Bonferroni (continuous),truncated linear with τ = 5 (small dashes), τ = 20 (small-big dashes) and τ = 100 (large dashes).of any discovery is expensive and being on the wrong track may be a costlymistake.It is stated in Benjamini and Hochberg (1995) that the BH procedure isoptimal in the sense that under the FDR control, it maximizes the numberof rejections. In situations where m and the eﬀect are large, the number oftrue rejections is expected to be large. To maximize the number of rejections,10 . . . . m = , m = , ∆ = λ G . . . m = , m = , ∆ = λ G − . . . m = , m = , ∆ = λ G − m = , m = , ∆ = λ G − − m = , m = , ∆ = λ G − m = , m = , ∆ = λ G Figure 5: Behavior of the gain G and the gain G of diﬀerent procedures com-pared to BH, as a function of the price λ . Here we show the diﬀerence betweenthe gain function of each procedure and the one corresponding to the BH proce-dure. The number of tests is m = 10000 and the number of false hypotheses is m = 500. The eﬀect size ∆ is either 1,2, or 3. The curves show the simulatedgains minus the gain of the BH procedure (zero line), Bonferroni (continuous),truncated linear with τ = 5 (small dashes), τ = 20 (small-big dashes) and τ = 100 (large dashes).counting both true and false, the BH procedure can be expected to add rejectionsof null hypotheses, even though the corresponding p-values are no longer smalland we in fact reject a sizable proportion of false ones.Another monetary utility function to counteract such behavior has expected11alue G = E ( TP ) − λ E (1 + 2 + . . . + FP ) = E ( TP ) − λ (cid:0) E ( FP ) + E ( FP ) (cid:1) , (5)where the loss due to a false discovery increases with each false discovery. Thisgain function does not only consider the expected number of false positive, butalso its variance.We compare the gain functions of diﬀerent procedures to the of the BH pro-cedure, as a function of the price λ . Figures 2, 3, 4 and 5 show the diﬀerencebetween the gain functions of the Bonferroni procedure, the truncated linearprocedure with diﬀerent values of τ (5, 20 and 100), and the BH gain func-tions, for various choices of the number of tests m , of the number of false nullhypotheses m and of the eﬀect ∆. Any curve above zero indicates that theBH procedure is worse than the corresponding competing procedure, and viceversa for below zero. For both gains, the truncated linear procedures compen-sate a considerable part of the power lost by the Bonferroni procedure withoutcommitting a large number of false positives. Both the truncated linear and theBonferroni procedure are much more stable in terms of control of false positives.The BH procedure seems to be appropriate only when the price λ of a false dis-covery is small. In some cases, the BH procedure is optimal only when λ = 1,which corresponds to the case where a false rejection and a false acceptance ofa null hypothesis are considered to be of equal harm.As expected, the shortcomings of the BH are very pronounced for large m .In the sparse case ( m = 500), the BH procedure is not much more powerfulthan the other procedures when the raw eﬀect is weak (∆ = 1). When theeﬀect becomes moderate (∆ = 2) or strong (∆ = 3), the truncated proceduresbehave better than both the BH and the Bonferroni procedure for reasonablevalues of the price λ . When the number of alternative increases ( m = 1000 or2000), the performances of the truncated methods are still good. However, theBH procedure shows a marked performance instability, especially for m = 2000and ∆ = 3.The simulations suggest the use of the truncated linear with a value of τ thatis situated between 20% and 50% of √ m . Gene expression levels of m = 35512 genes were measured in two tissues of malemice who were fed either a normal diet or a high-fat diet (Hasenfuss et al., 2014).We ﬁrst consider an initial exploration of the gene-by-gene comparison of thetwo tissues (strong diﬀerence). Two-sample t-tests were performed to comparethe expression levels in the muscle versus the liver. The results are given as two-sided p-values, one for each gene. The left-hand panel of Figure 6 shows thata large majority of the genes exhibit a signiﬁcant diﬀerential eﬀect between thetwo tissues. The Bonferroni procedure with α = 0 .

05 ﬁnds 21020 diﬀerentiallyexpressed genes, whereas the BH procedure with α = 0 .

05 identiﬁes 28436 suchgenes. Under Bonferroni we stop reporting positives early, in order to avoid12he declaration of false discoveries. Under BH, assuming that m = 29000,one would expect approximately 261 falsely identiﬁed genes, namely 0 . × × (6512 / . × (6512 / .

009 false positives.It is evident, that these two procedures cannot be compared directly. Ifwe were to match the approximate expected number of false discoveries, theBonferroni procedure would need α = 1422, rather than α = 0 .

05 and thisprocedure detects 28437 signiﬁcances, that is, the two threshold sequences meetthe ordered p-values almost at the same point. The left-hand panel of Fig. 6shows these two threshold sequences. The truncated procedure with τ = 1422,in which we decide to apply FDR control up to a certain level and then switch toBonferroni control would, of course, give exactly the same result. This illustratesa fact pointed out by Genovese and Wasserman (2002), namely that the valuewhere the threshold sequence crosses the ordered p-values is the only aspect ofinterest when investigating asymptotic error rates of the threshold sequence.Commenting on this ﬁrst example, we can say that when many null hy-potheses are rejected, the BH procedure is a questionable choice for performingmultiple testing. By deﬁnition of the control metric, a large number of rejectionsmeans that a large number of false rejections are tolerated. The Bonferroni pro-cedure with α = 0 .

05 tolerates none or at most a few false discoveries, but canonly achieve this feat by limiting the reported number of signiﬁcances. Amongthese ﬁrst 21000 positives the fraction of false ones is negligible. The BH proce-dure with the same α = 0 .

05 has roughly an additional 7500 rejections. Amongthese, we estimate that 261 / .

5% are false. The local rate of falsediscoveries increases when we only consider the last 7500 rejections.As a second example, the right-hand panel of Figure 6 shows the (rounded)p-values one obtains when comparing the two diets in the liver. In this case, weonly take the data from one tissue and we thus have only half the previous num-ber of observations. Only a few of the 35512 genes are diﬀerentially expressed,that is, the eﬀect is weaker than in the ﬁrst example, and because of this, we onlyshow the smallest 2000 p-values. The superposed curves show the BH thresholdsequence with α = 0 .

05, the Bonferroni threshold with α = 0 .

05 and truncatedthreshold sequences with τ = 71 ( τ ≈

38% of √ m ) and τ = 710 ( τ ≈ √ m ). The FDR control results in 1853 discoveries, whereas the Bonferroniprocedure discovers 353 genes. The truncated thresholds result in 737 and 1386rejections, respectively. This second example shows, in a case where the eﬀectis moderate, the advantage of using the truncated threshold over the Bonferroniprocedure. Up to a certain number of discoveries, we accept the FDR controland then switch to a relaxed Bonferroni. If the eﬀect becomes sparse and atruncated threshold is used, the up-crossing of the ordered p-values may hap-pen in the linearly increasing part of the threshold sequence. In this case, thetruncated threshold behaves similarly to the FDR procedure and is able to de-tect sparse signals, whereas the Bonferroni procedure will miss them. At thesame time, it oﬀers better control.These examples illustrate how to use a truncated linear procedure in orderto increase the power considerably, but without incurring a punishingly high13igure 6: Ordered p-values corresponding to gene expression level of m = 35512genes in two tissues of male mice, some fed a healthy diet, some a high-fat diet.In the left-hand panel the p-values are obtained from gene-by-gene comparisonbetween muscle and liver, both diets confounded. The right-hand panel showsthe p-values obtained when comparing the two diets in liver cells.14xpected number of false rejections. The optimal choice of the τ parameterdepend on the price that the researcher is willing to pay when committing afalse discovery and on the (unknown) eﬀect sizes. The theory developed in Blanchard and Roquain (2008) can be adapted to ourerror rate. Three of their lemmas form the basis, together with the concept of self consistency .Any step-up procedure is such that the rejection set satisﬁes R = { i ∈ I = { , . . . , m } : p i ≤ T } , where T is an upper bound which usually depends on thep-values p , . . . , p m . To prove control, we need to boundpFDR( T ) = E (cid:20) FP s R { R > } (cid:21) = (cid:88) i ∈ I E (cid:20) { p i ≤ T } s R { R > } (cid:21) , (6)where I = { i ∈ I such that H i is true } . The family of threshold sequencesΓ : I × R + (cid:55)→ R + of the form Γ( i, r ) = αm w i s r , with i denoting the null hypothesis and r the rank of its p-value covers all casesof interest to us. It takes account of the weights { w i , i ∈ I } and the shape s r .Such a threshold sequence leads to a step-up procedure that satisﬁes R ⊂ { i ∈ I | p i ≤ Γ( i, R ) } , where R is the set of rejections. This holds for all R and is called the self-consistency condition. pFDR under independence Proposition 4.1

Assume that the p -values p , . . . , p m are independent randomvariables and that non-negative weights w , . . . , w m are given. It follows thatthe step-up procedure with threshold sequence t r = s r α/m (see Def. 2.1) appliedto q = p /w , . . . , q m = p m /w m ( q i = 1 if w i = 0 ), satisﬁes pFDR = E ( FP /s R ) ≤ α/m m (cid:88) i =1 w i . Proof 4.1

Denote by p − i the set of p -values { p j | j (cid:54) = i } . From (6) and using he self-consistency, we have pFDR( R ) = (cid:88) i ∈ I E (cid:20) { p i /w i ≤ s R α/m } s R (cid:21) ≤ (cid:88) i ∈ I E (cid:34) E (cid:34) (cid:8) p i ≤ αm w i s R ( p ) (cid:9) s R ( p ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − i (cid:35)(cid:35) ≤ αm (cid:88) i ∈ I w i , where we used the fact that given p − i , p i is stochastically lower bounded by auniform distribution in the independent case. By case (i) of the Lemma 3.2 ofBlanchard and Roquain (2008) we obtain the result if we set U = p i , g ( U ) = s ( R ( p − i , U )) and c = αm w i . Note that g ( s R ( p − i ,U ) ) is non increasing in U since s r is non-decreasing and R is non-increasing in U . pFDR under positive dependence Proposition 4.2

Assume that the p -values p , . . . , p m are positively dependent(positively regression dependent on a subset as in Benjamini and Yekutieli (2001))random variables and that non negative weights w , . . . , w m are given. It followsthat the step-up procedure with threshold sequence t r = s r α/m (see Def. 2.1)applied to q = p /w , . . . , q m = p m /w m ( q i = 1 if w i = 0 ), satisﬁes pFDR = E ( FP /s R ) ≤ α/m m (cid:88) i =1 w i , where R denotes the number of rejections. Proof 4.2

By case (ii) of the Lemma 3.2 of Blanchard and Roquain (2008),with U = p i , V = s ( R ) and c = αm w i , the result follows. pFDR under any dependence structure Under arbitrary dependence of the p-values, an additional correction to thethresholds has to be performed. Let ν be a probability distribution on (0 , ∞ )and deﬁne for r > ξ ( r ) = (cid:90) r udν ( u ) . (7)We call the ξ function the reshaping function . It was introduced in Blanchardand Roquain (2008), who called ξ ( r ) shape function, because they studied onlythe case of proportional thresholds. Proposition 4.3

Let p , . . . , p m be the p-values computed for the m null hy-potheses. Non-negative weights w , . . . , w m are given. Let ξ ( r ) be as described bove. It follows that the step-up procedure with threshold sequence t r = ξ ( s r ) α/m applied to q = p /w , . . . , q m = p m /w m ( q i = 1 if w i = 0 ), satisﬁes pFDR = E ( FP /s R ) ≤ m (cid:88) i =1 w i α/m . Proof 4.3

By case (iii) of the Lemma 3.2 of Blanchard and Roquain (2008)with U = p i , V = s ( R ) and c = w i α/m the result follows. The reshaping function ξ ( r ) = r/ (1 + 1 / . . . + 1 /m ) is obtained for the proba-bility measure ν that puts point mass proportional to 1 /r on r = 1 , . . . , m . Thecorresponding threshold sequence for the general dependence case (Proposition4.3) is t r = s r α/ ( m (1 + 1 / . . . + 1 /m )) . (8)The corresponding step-up testing procedure is a penalized version of the pro-cedure proposed in Benjamini and Yekutieli (2001). If we instead take ν as theprobability measure with point mass proportional to 1 / , / , . . . with support { , , . . . , τ } , the reshaping function truncates, because it is equal to ξ ( r ) = r (cid:32) τ (cid:88) i =1 i (cid:33) − for r ≤ τ and ξ ( r ) = ξ ( τ ) for r > τ . The corresponding threshold sequence is s ( r ) trunc α (cid:46) (cid:32) m τ (cid:88) i =1 i (cid:33) , which exceeds (8) for r ≤ (cid:98) τ (cid:80) mi =1 1 i / (cid:80) τi =1 1 i (cid:99) . This test thus increases thepower in some situations even though it exerts a provably stricter control (seeFigure 7). A particular case is τ = 1, where ξ ( r ) ≡

1, and the threshold sequenceis α/m , that is, the Bonferroni procedure.Blanchard and Roquain (2008) propose other reshaping functions for theFDR case and comment on their use

In addition to the freedom of choice of error rate, the duality that exists be-tween the error rate and the threshold sequence t r = s r α/m together with thereshaping function ξ ( r ), brings new perspectives to the ﬁeld of multiple testingprocedures and reveals new interesting results. In this section, we give someexamples of corollaries and applications of our results.17

10 20 30 40 50 . . . . . m = , τ = index T h r e s ho l d v a l ue s . . . . . m = , τ = index T h r e s ho l d v a l ue s . . . . . m = , τ = index T h r e s ho l d v a l ue s . . . . . m = , τ = index T h r e s ho l d v a l ue s Figure 7:

Examples of threshold sequences for diﬀerent values of m and τ . TheBH procedure (continuous line) and the truncated linear procedure (big/small dashes),both for independent and positively dependent tests. The modiﬁed BH thresholds asdescribed in Benjamini and Yekutieli (2001) (dashes) and the modiﬁed truncated linearthresholds as described in the Results section (points), both for the distribution freecase. .1 Relation between error rates based on expectationor on tail probabilities Deﬁne the penalized false exceedence rate ( p FER) as p FER q = P (cid:18) FP s R > q (cid:19) . By Markov’s inequality p FER q = P (cid:18) FP s R > q (cid:19) ≤ α ⇒ E (cid:18) FP q s R (cid:19) ≤ α , from which it follows that the threshold sequence qs r α/m provides conservativecontrol of this error rate, under independence (Theorem 4.1) or positive depen-dence (Theorem 4.2). Under general dependency structure, we replace qs R by ξ ( qs R ) (see Lemma 4.3). In an older version of the manuscript (Meskaldji et al.,2011), we proposed step-down procedure that tightly controls the pFER, underdiﬀerent assumptions.An interesting choice is q = 0 .

5, in which case the p FER q becomes themedian of FP /s R and we note that in order to achieve control, we simply haveto halve α . Consequently, if we wish to control the median of the FDP at level α , we can use the BH procedure with α/ Hochberg’s procedure is a step-up procedure with Holm’s threshold sequence t r = α/ ( m − r + 1). If we factor out an α/m , the shape is seen to be s r = m/ ( m − r + 1). Our results show that Hochberg’s procedure controls E (cid:18) FP m ( m − R + 1) (cid:19) ≈ E ( FP ) (for large m ) , both under independence and positive dependence.Another example of this type is the asymptotic control of the optimal rejec-tion curve (ORC) introduced by Finner et al. (2009), which controls asymptoti-cally the FDR. We give here a simple proof of that control. The ORC thresholdsequence is t r = rα/ ( m − r (1 − α )), which equals 1, when r = m and thushas R = m . With a slight modiﬁcation it can be used in practice. When wefactor out α/m we obtain s r = rm/ ( m − r (1 − α )) and we can conclude thatthe modiﬁed threshold controls the expectation E (cid:18) FP R (1 − R/m (1 − α )) (cid:19) → FDR as m → ∞ . This result is valid under independence and positive dependence.19 .3 Assumption weakening

For a step-up procedure associated with a speciﬁc threshold sequence, one canﬁnd diﬀerent error rates controlled by this procedure under diﬀerent assump-tions on the dependence structure of the p-values. The proportional threshold t r = rα/m , for example, is obtained from αξ ( s r ) /m when ξ ( r ) is the inverse ofthe shape sequence s r , that is, ξ ( s r ) = r . This shows that the BH procedurecontrols the quantity E (cid:20) FP s R (cid:21) = E (cid:20) FP ξ − ( R ) (cid:21) at level α under any distribution of the p-values and for an arbitrary choice of ξ ( r ) or the shape sequence s r . We are thus able to determine the error ratethat is under control if the independence or positive dependence assumption isviolated. This result is obvious for a linear ξ ( r ) such as ξ ( r ) = r/ (cid:80) mi =1 1 i , inwhich case ξ − ( r ) = r m (cid:88) i =1 i = s r . The BH procedure, no matter the dependence structure of the p-values, controls E (cid:32) FP / (cid:32) R m (cid:88) i =1 i (cid:33)(cid:33) ≤ α , which immediately leads to E ( FP /R ) ≤ α m (cid:88) i =1 i ≈ α log( m ) . Our results thus inter-relate diﬀerent error rates and conditions on the depen-dence structures of p-values by the relation t r = ξ ( s r ) α/m ≡ ˜ s r α/m, for r = 1 , . . . , m. Acknowledgments

This work was supported by the Swiss National Science Foundation [144467,PP00P2-146318].A part of this work was done while the ﬁrst author was a Ph.D student in theSignal Processing Laboratory (LTS5), Ecole Polytechnique F´ed´erale de Lau-sanne (EPFL), Lausanne, Switzerland (Meskaldji, 2013).The data is from the Nestl´e chair in Energy Metabolism (Prof. J. Auwerx) andwas collected by a graduate student (E. G. Williams).The authors would like to thank Etienne Roquain for interesting comments andsuggestions. 20 eferences

Benjamini, Y. (2010). Simultaneous and selective inference: Current successesand future challenges.

Biometrical Journal 52 (6, SI), 708–721.Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate:a practical and powerful approach to multiple testing.

Journal of the RoyalStatistical Society. Series B. Methodological 57 (1), 289–300.Benjamini, Y. and Y. Hochberg (1997). Multiple hypotheses testing withweights.

Scandinavian Journal of Statistics 24 (3), 407–418.Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery ratein multiple testing under dependency.

The Annals of Statistics 29 (4), 1165–1188.Blanchard, G. and E. Roquain (2008). Two simple suﬃcient conditions for FDRcontrol.

Electronic Journal of Statistics 2 , 963–992.Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilit`a.

Pub. del R Ist. Sup. di Sci. Eco. e Com. di Fir. 8 , 3–62. Bonferroni adjustmentfor multiple statistical tests using the same data.Dudoit, S. and M. J. van der Laan (2008).

Multiple testing procedures withapplications to genomics . Springer Series in Statistics. New York: Springer.Finner, H., T. Dickhaus, and M. Roters (2009). On the false discovery rateand an asymptotically optimal rejection curve.

The Annals of Statistics 37 ,596–618.Genovese, C. and L. Wasserman (2002). Operating characteristics and exten-sions of the false discovery rate procedure.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 64 , 499–517.Genovese, C. R., K. Roeder, and L. Wasserman (2006). False discovery controlwith p-value weighting.

Biometrika 93 (3), 509–524.Hasenfuss, S. C., L. Bakiri, M. K. Thomsen, E. G. Williams, J. Auwerx, andE. F. Wagner (2014). Regulation of steatohepatitis and PPAR signaling bydistinct AP-1 dimers.

Cell Metabolism 19 (1), 84–95.Holm, S. (1979). A simple sequentially rejective multiple test procedure.

Scan-dinavian Journal of Statistics. Theory and Applications 6 (2), 65–70.Hommel, G. and T. Hoﬀmann (1987). Controlled uncertainty. In P. Bauer,G. Hommel, and E. Sonnemann (Eds.),

Multiple Hypotheses Testing , pp. 154–161. Springer, Heidelberg.Meskaldji, D.-E. (2013).

Multiple Comparison Procedures for Large CorrelatedData with Application to Brain Connectivity Analysis . Ph. D. thesis, EcolePolytechnique F´ed´erale de Lausanne.21eskaldji, D.-E., E. Fischi-Gomez, A. Griﬀa, P. Hagmann, S. Morgenthaler, andJ.-P. Thiran (2013). Comparing connectomes across subjects and populationsat diﬀerent scales.

NeuroImage 80 (0), 416 – 425. Mapping the Connectome.Meskaldji, D.-E., J.-P. Thiran, and S. Morgenthaler (2011, dec). A comprehen-sive error rate for multiple testing.

ArXiv e-prints 1112.4519 .Meskaldji, D.-E., L. Vasung, D. Romascano, J.-P. Thiran, P. Hagmann, S. Mor-genthaler, and D. V. D. Ville (2015). Improved statistical evaluation of groupdiﬀerences in connectomes by screening–ﬁltering strategy with application tostudy maturation of brain connections between childhood and adolescence.

NeuroImage 108 (0), 251 – 264.Roquain, E. and F. Villers (2011). Exact calculations for false discovery pro-portion with application to least favorable conﬁgurations.

The Annals ofStatistics 39 , 584–612.Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests ofsigniﬁcance.

Biometrika 73 (3), 751–754.van der Laan, M. J., S. Dudoit, and K. S. Pollard (2004). Augmentation proce-dures for control of the generalized family-wise error rate and tail probabilitiesfor the proportion of false positives.