[PDF] AdaPT: An interactive procedure for multiple testing with side information

Abstract

We consider the problem of multiple hypothesis testing with generic side information: for each hypothesis H i we observe both a p-value p i and some predictor x i encoding contextual information about the hypothesis. For large-scale problems, adaptively focusing power on the more promising hypotheses (those more likely to yield discoveries) can lead to much more powerful multiple testing procedures. We propose a general iterative framework for this problem, called the Adaptive p-value Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p-value rejection threshold and controls the false discovery rate (FDR) in finite samples. At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored p-values, estimates the false discovery proportion (FDP) below the threshold, and either stops to reject or proposes another threshold, until the estimated FDP is below α . Our procedure is adaptive in an unusually strong sense, permitting the analyst to use any statistical or machine learning method she chooses to estimate the optimal threshold, and to switch between different models at each iteration as information accrues. We demonstrate the favorable performance of AdaPT by comparing it to state-of-the-art methods in five real applications and two simulation studies.

Full PDF

AAdaPT: An interactive procedure for multiple testing with sideinformation

Lihua Lei and William Fithian

Department of Statistics, University of California, Berkeley, USA

E-mail: { lihua.lei, wﬁthian } @berkeley.edu Summary . We consider the problem of multiple hypothesis testing with generic side information:for each hypothesis H i we observe both a p -value p i and some predictor x i encoding contextual in-formation about the hypothesis. For large-scale problems, adaptively focusing power on the morepromising hypotheses (those more likely to yield discoveries) can lead to much more powerfulmultiple testing procedures. We propose a general iterative framework for this problem, called theAdaptive p -value Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p -value rejection threshold and controls the false discovery rate (FDR) in ﬁnite samples. At eachiteration of the procedure, the analyst proposes a rejection threshold and observes partially cen-sored p -values, estimates the false discovery proportion (FDP) below the threshold, and proposesanother threshold, until the estimated FDP is below α . Our procedure is adaptive in an unusu-ally strong sense, permitting the analyst to use any statistical or machine learning method shechooses to estimate the optimal threshold, and to switch between different models at each itera-tion as information accrues. We demonstrate the favorable performance of AdaPT by comparingit to state-of-the-art methods in ﬁve real applications and two simulation studies.

Keywords : multiple testing, false discovery rate, p-value weighting, selective inference,adaptive inference, martingales

1. Introduction

In classical statistics we assume that the question to be answered, and the analysis to be usedin answering the question, are both ﬁxed in advance of collecting the data. Many modernapplications, however, involve extremely complex data sets that may be collected without anyspeciﬁc hypothesis in mind. Indeed, very often the express goal is to explore the data in search ofinsights we may not have expected to ﬁnd. A central challenge in modern statistics is to providescientists with methods that are ﬂexible enough to allow for exploration, but that neverthelessprovide statistical guarantees for the conclusions that are eventually reported.Selective inference methods blend exploratory and conﬁrmatory analysis by allowing a searchover the space of potentially interesting questions, while still guaranteeing control of an appro-priate Type I error rate such as a conditional error rate (e.g., Yekutieli, 2012; Lee et al., 2016;Fithian et al., 2014), familywise error rate (e.g., Tukey, 1994; Berk et al., 2013), or false dis-covery rate (e.g., Benjamini and Hochberg, 1995; Barber and Cand`es, 2015). However, mostselective inference methods require that the selection algorithm be speciﬁed in advance, forc-ing a choice between either ignoring any diﬃcult-to-formalize domain knowledge or sacriﬁcingstatistical validity guarantees.Interactive data analysis methods relax the requirement of a pre-deﬁned selection algorithm.Instead, they provide for an interactive analysis protocol between the analyst and the data,guaranteeing statistical validity as long as the protocol is followed. The two central questions ininteractive data analysis are “what did the analyst know and when did she know it?” Previousmethods for interactive data analysis involve randomization (Dwork et al., 2015; Tian et al.,2018) to control the analyst’s access to the data at the time she decides what questions to ask.This paper proposes an iterative, interactive method for multiple testing in the presenceof side information about the hypotheses. We restrict the analyst’s knowledge by partially a r X i v : . [ s t a t . M E ] J u l Lihua Lei and William Fithian censoring all p -values smaller than a currently-proposed rejection threshold, and guaranteeﬁnite-sample FDR control by applying a version of the optional-stopping argument pioneeredby Storey et al. (2004) and extended in Barber and Cand`es (2015); G’Sell et al. (2016); Li andBarber (2016a); Lei and Fithian (2016); Barber and Cand`es (2016). In many areas of modern applied statistics, from genetics and neuroimaging to online advertisingand ﬁnance, researchers routinely test thousands or millions or hypotheses at a time. Forlarge-scale testing problems, perhaps the most celebrated multiple testing procedure of themodern era is the Benjamini–Hochberg (BH) procedure (Benjamini and Hochberg, 1995). Given n hypotheses and a p -value for each one, the BH procedure returns a list of rejections or“discoveries.” If R is the number of total rejections and V is the number of false rejections(rejections of true null hypotheses), the BH procedure controls the false discovery rate (FDR),deﬁned as FDR = E (cid:20) V max { R, } (cid:21) , (1)at a user-speciﬁed target level α . The random variable V / max { R, } is called the false discoveryproportion (FDP).The BH procedure is nearly optimal when the null hypotheses are exchangeable a priori ,and nearly all true. In other settings, however, the power can be improved, sometimes dra-matically, by applying prior knowledge or by learning from the data. For example, adaptiveFDR-controlling procedures can gain in power by estimating the overall proportion of truenulls (Storey, 2002), applying priors to increase power using p -value weights (Benjamini andHochberg, 1997; Genovese et al., 2006; Dobriban et al., 2015; Dobriban, 2016), grouping similarnull hypotheses and estimating the true null proportion within each group (Hu et al., 2012), orexploiting a prior ordering to focus power on more “promising” hypotheses near the top of theordering (Barber and Cand`es, 2015; G’Sell et al., 2016; Li and Barber, 2016a; Lei and Fithian,2016).In most large-scale testing problems, the null hypotheses do not comprise an undiﬀerentiatedlist; rather, each hypothesis is associated with rich contextual information that could potentiallyhelp to inform our testing procedures. For example, Li and Barber (2016a) test for diﬀerentialexpression of 22,283 genes between a treatment and control condition for a breast cancer drug,with side information in the form of an ordering of genes from most to least “promising” usingauxiliary data collected at larger dosages. Multiple testing procedures that exploit the orderingcan reject hundreds of hypotheses while the BH procedure (which does not exploit the ordering)rejects none.More generally, prior information could arise in more complex ways. For example, considertesting for association of 400,000 single-nucleotide polymorphisms (SNPs) with each of 40 relateddiseases. If gene-regulatory relationships are known, then we might expect SNPs near relatedgenes to be associated (or not) with related diseases, but without knowing ahead of time whichgene-disease pairs are promising. In a similar vein, Fortney et al. (2015) used prior knowledge ofeach SNP’s associations with age-related diseases to focus their search for SNPs associated withlongevity, leading to novel discoveries. Inspired by examples like this, Ignatiadis et al. (2016)and Li and Barber (2016b) have recently proposed a more general problem setting where, foreach hypothesis H i , i ∈ [ n ] we observe not only a p -value p i ∈ [0 ,

1] but also a predictor x i lyingin some generic space X . Unlike p i , x i carries only indirect information about the hypothesis:it is meant to capture some side information that might bear on H i ’s likelihood to be false, oron the power of p i under the alternative, but the nature of this relationship is not fully knownahead of time and must be learned from the data.In other situations, the “predictor” information could simply represent a measure of samplesize or overall signal for testing the i th hypothesis, which could be informative about the powerof the i th test to distinguish the alternative from the null. For example, if each p i concerns atest for association between the i th SNP and a disease, then the overall prevalence of that SNP daPT: An interactive procedure for multiple testing with side information AdaPT (Intermediate Stage) p − v a l ue p i predictor x i lll ll ll lllll lll ll lll llllllllllllllllllllll l ll l l l l l - s t ( x ) s t ( x ) (a) A t = 4 and R t = 11 are the numbersof blue and red points respectively, leadingto (cid:91) FDP = (1 + 4) / ≈ .

45. If (cid:91)

FDP ≤ α ,we stop and reject the red points; other-wise we choose a new threshold s t +1 (cid:22) s t and continue. AdaPT (Analyst's View) p − v a l ue p i predictor x i lll ll ll lllll lll ll lll llllllllllllllllllllll l ll l l l l llllllll l ll l l l l l - s t ( x ) s t ( x ) (b) Information available to the analystwhen choosing s t +1 ( x ) ( A t and R t are alsoknown). Each red and blue point is re-ﬂected across p = 0 .

5, leaving the analystto impute which are the true p -values andwhich are the mirror images. Fig. 1: Illustration of one step of the AdaPT procedure with a univariate predictor.(in the combined treatment and control groups) can be used as prior information. Or, if p i arises from a two-sample t -test, we could use the pooled variance, the sample variance ignoringthe group labels, as prior information; see e.g. (Bourgon et al., 2010; Ignatiadis et al., 2016). This paper presents a new framework for FDR control with generic side information, which wecall adaptive p -value thresholding or AdaPT for short. Our method proceeds iteratively: at eachstep t = 0 , , . . . , the analyst proposes a rejection threshold s t ( x ) and computes an estimator (cid:91) FDP t for the false discovery proportion for this threshold. If (cid:91) FDP t ≤ α , she stops and rejectsevery H i for which p i ≤ s t ( x i ). Otherwise, she proposes a more stringent threshold s t +1 (cid:22) s t and moves on to the next iteration, where the notation a (cid:22) b means a ( x ) ≤ b ( x ) for all x ∈ X .The estimator (cid:91) FDP t is computed by comparing the number R t of rejections to the number A t of p -values for which p i ≥ − s t ( x i ): R t = |{ i : p i ≤ s t ( x i ) }| , A t = |{ i : p i ≥ − s t ( x i ) }| , and (cid:91) FDP t = 1 + A t R t ∨ . The estimate (cid:91)

FDP t is also used by Lei and Fithian (2016) and Arias-Castro and Chen (2016).Figure 1a illustrates the way s t ( x ) and 1 − s t ( x ) partition the data into three regions; A t is thenumber of points in the upper blue region and R t is the number in the lower red region.At each step t , the analyst can choose the next threshold s t +1 ( x ) however she chooses, withonly two constraints. First, s t +1 (cid:22) s t as stated before. Second, the large and small p -values(the ones contributing to A t and R t ) are partially masked. Speciﬁcally, at step t the analyst isallowed to observe A t and R t , as well as the entire sequence ( x i , ˜ p t,i ) ni =1 , where˜ p t,i = (cid:40) p i s t ( x i ) < p i < − s t ( x i ) { p i , − p i } otherwise . (2)Thus, if p i = 0 . ≤ s t ( x i ) then at step t the analyst knows only that p i is either 0.01 or 0.99,but if s t +1 ( x i ) < .

01 then p i is revealed at step t + 1 as 0.01. Figure 1b illustrates what theanalyst can see: each red and blue point from Figure 1a is shown along with its mirror imagereﬂected across the midline p = 0 . Lihua Lei and William Fithian

We show in Section 3 that, in a generic two-groups empirical Bayes model, an ideal choicefor s t ( x ) would be a level surface of the local false discovery rate (fdr), as a function of x and p :fdr( p | x ) = P ( H i is null | p i = p, x i = x ) . Formally, fdr( p | x ) is unidentiﬁable from the data but, under reasonable assumptions, we canuse a good proxy based on the conditional density of the p -value given the covariate, f ( p | x )(note however that our method controls FDR without any empirical Bayes assumptions).In each step information is gradually revealed to the analyst as the threshold shrinks andmore p -values are unmasked. Our procedure is adaptive in an unusually strong sense: providedthat the two constraints are met, the analyst may apply any method she wants to select s t +1 ( x ),consulting her own hunches or the intuition of domain experts, and can even switch betweendiﬀerent methods as information accrues. Moreover, the analyst is under no obligation todescribe, or even to fully understand, her update rule for choosing s t +1 ( x ). In this sense, wesay our method is fully interactive — the analyst’s behavior is arbitrary as long as she abidesby a certain protocol for interacting with the algorithm.While the partial masking of p -values obscures just enough information from the analystto control the FDR, in many cases it does not seriously impact the ability of the analyst tolearn the optimal threshold surface s ( x ). This is because, by the time the algorithm is close tostopping, the vast majority of p -values have already been revealed, and many of the ones thatremain masked are so minuscule as to leave little doubt about whether p i is large or small. Aswe show in numerous simulation and real data experiments in Section 5, the fdr estimates basedon masked data typically converge to the full-data estimates well before the algorithm stops.The AdaPT procedure controls FDR at level α in ﬁnite samples provided that the null p -values are uniform, or mirror-conservative as deﬁned in Section 2.1, and independent conditionalon the non-null p -values. The proof relies on a pairwise exchangeability argument similar to theargument in Barber and Cand`es (2015).Algorithm 1 summarizes the AdaPT procedure, using the generic sub-routine update torepresent whatever process the analyst uses to select s t +1 ( x ). Note that s t +1 ( x ) is a randomfunction that is measurable to F t . Sections 3–4 discuss recommendations for a good update routine. It is worth mentioning that AdaPT reduces to Barber-Cand`es method, inspired byBarber and Cand`es (2015) and proposed by Arias-Castro and Chen (2016), when s t ( x ) is aconstant function for every t . Algorithm 1

AdaPT

Input: predictors and p -values ( x i , p i ) i ∈ [ n ] , initialization s , target FDR level α Procedure: for t = 0 , , . . . do (cid:91) FDP t ← A t R t ∨ ; if (cid:91) FDP t ≤ α then Reject { H i : p i ≤ s t ( x i ) } ; Return s t ; end if s t +1 ← update (( x i , ˜ p t,i ) i ∈ [ n ] , A t , R t , s t ); end for In recent work Ignatiadis et al. (2016) propose a diﬀerent method independent hypothesisweighting (IHW) for multiple testing with side information. They ﬁrst bin the predictors intogroups g , . . . , g K , and then apply the weighted-BH procedure at level α with piecewise-constantweights; i.e., if x i ∈ g k , then w i = w ( g k ). The weights w ( g ) , . . . , w ( g K ) are chosen to maximizethe number of rejections. This proposal is similar in spirit to the AdaPT procedure since it daPT: An interactive procedure for multiple testing with side information attempts to ﬁnd optimal weights, but it is a bit more limited: ﬁrst, binning the data may bediﬃcult if the predictor space X is multivariate or more complex; and second, their methodis only guaranteed to control FDR asymptotically, as the number of bins stays ﬁxed and thenumber of hypotheses in each bin grows to inﬁnity. As a result, we must trust that n is largeenough to support however many bins we have chosen to use. By contrast, AdaPT can useany machine-learning method to estimate ˆ f ( p | x ), and we can “overﬁt away” without fear ofcompromising ﬁnite-sample FDR control (though overﬁtting can of course reduce our power ifour fdr estimates are too noisy). Another method is proposed by Du et al. (2014) when thecovariate is an auxiliary univariate p-value derived by prior information. However, similar toIgnatiadis et al. (2016), it only controls FDR asymptotically under the fairly strong conditionsthat the p-values are symmetrically distributed under the null and bounded by under thealternative.Perhaps the procedure most closely related to ours is the structure-adaptive BH algorithm or SABHA (Li and Barber, 2016b). SABHA ﬁrst censors the p -values below at a ﬁxed level τ ( τ = 0 . p -values p i { p i > τ } . Using these, theycan estimate π ( x ), deﬁned as P ( H i is non-null | x i = x ), as a function of x , then apply theweighted BH procedure of Genovese et al. (2006) with weights ˆ π ( x i ) − , at a corrected FDRlevel ˜ α = Cα (where C < π − ). Wenotice that this type of censoring is also employed in a variant of IHW (Ignatiadis and Huber,2017), which guarantees the FDR control in ﬁnite samples.As the ﬁrst procedure to provably control the ﬁnite-sample FDR using generic feature in-formation, SABHA represents a major step forward. However, AdaPT has several importantadvantages: First, even if ˆ π ( x ) estimates π ( x ) consistently, the weights π ( x ) − are not Bayesoptimal as we show in Section 3; by contrast, our method estimates a Bayes optimal threshold.Second, the correction factor C makes the method conservative and restricts the available esti-mators ˆ π − to those with provably low Rademacher complexity. Third, AdaPT can use moreinformation for learning: in later stages we will typically have s t ( x i ) (cid:28) . p -values ˜ p t,i may be much more informative than p i { p i > . } , especially since our goal is toestimate f ( p | x ) for small values of p .Finally, we remark that there is a literature on very diﬀerent approaches for incorporatingcovariates into multiple testing problems; see e.g. Lewinger et al. (2007); Ferkingstad et al.(2008); Lawyer et al. (2009); Zablocki et al. (2014). Unlike our method (and IHW and SABHA),these approaches hinge on the correct speciﬁcation of the model and might lose the statisticalguarantee if the proposed model deviates from the ground truth. By contrast, our method (andIHW and SABHA) rely only on validity of p -values (see assumptions of Theorem 1 in nextSection) and guarantee FDR control even when employing a misspeciﬁed model. Section 2 deﬁnes the AdaPT procedure more formally and gives our main result: if the null p -values are independent and mirror-conservative (deﬁned below), AdaPT controls FDR at level α in ﬁnite samples. Section 3 explains why selection of s t +1 ( x ) will typically operate by ﬁrstestimating the conditional density f ( p | x ) as a function of x , and Section 4 gives practicalsuggestions for update rules. Section 5 illustrates the AdaPT procedure’s power on ﬁve realdatasets and two simulated datasets, and Section 6 concludes. The programs to replicate allour experiments can be obtained from https://github.com/lihualei71/adaptPaper/ . Our R package adaptMT can be found in https://github.com/lihualei71/adaptMT/ .

2. The AdaPT procedure

Let [ n ] denote the set { , . . . , n } . For each hypothesis H i , i ∈ [ n ] we observe x i ∈ X and p i ∈ [0 , H denote the set of true null hypotheses. We will assume throughout that( p i ) i ∈H are mutually independent, and independent of ( p i ) i/ ∈H (see Section 6 for a discussion Lihua Lei and William Fithian of how we might relax the independence assumption). Finally, for each i ∈ H , we assume that p i is either uniform or mirror-conservative in a sense we will deﬁne shortly.Let F t for t = 0 , , . . . represent the ﬁltration generated by all information available to theuser at step t : F t = σ (( x i , ˜ p t,i ) ni =1 , A t , R t ) . We similarly deﬁne an initial σ -ﬁeld with all p -values masked, F − = σ (( x i , { p i , − p i } ) ni =1 ) . The p -value masking is equivalent to requiring that s t +1 ∈ F t . (For simplicity we have implicitlyruled out the possibility that the analyst uses a randomized rule to update the threshold, butthis restriction could be easily removed.) The two constraints s t +1 (cid:22) s t and s t +1 ∈ F t ensurethat ( F t ) t = − , , ,... is a ﬁltration; i.e., the information in F t only grows from t to t + 1: Lemma 1.

For all t ≥ − , F t ⊆ F t +1 . Proof.

We use induction on u to show that F u ⊆ F t for any u ≤ t . The conclusion istrivial for u = − { p i , − p i } is always computable from p t,i (masked p -values can alwaysbe computed from masked or unmasked ones).For u ≥

0, note that, by the inductive assumption, s u ∈ F u − ⊆ F t . As a result, we cancompute p u,i which depends only on p t,i and s u ( x i ). Furthermore, R u = R t + { i : p t,i ∈ ( s t ( x i ) , s u ( x i )] } , A u = A t + { i : p t,i ∈ [1 − s u ( x i ) , − s t ( x i )) } , completing the proof.To avoid trivialities we assume that the analyst always reveals at least one censored p -valuein each step of the algorithm, since there is no reason ever to update the threshold surface in away that reveals no new information. Thus, the stopping time ˆ t ≤ n almost surely.In many common settings, null p -values are conservative but not necessarily exactly uniform.For example, p -values from permutation tests are discrete, and p -values for composite nullhypotheses are often conservative if the true value of the parameter lies in the interior of thenull.Our method does not require uniformity, but the standard deﬁnition of conservatism — that P H i ( p i ≤ a ) ≤ a for all 0 ≤ a ≤ not enough to guarantee FDR control. Instead, we saythat a p -value p i is mirror-conservative if P H i ( p i ∈ [ a , a ]) ≤ P H i ( p i ∈ [1 − a , − a ]) , for all 0 ≤ a ≤ a ≤ . . (3)If p i is discrete, (3) means p i = 1 − a is at least as likely as p i = a for a ≤ .

5; if p i has acontinuous density, it means the density is at least as large at 1 − a as at a . Mirror-conservatismis not a consequence of conservatism (take p i = 0 . . B where B ∼ Bernoulli(0 . p i = B ). Any null distribution with an increasingdensity is evidently both conservative and mirror-conservative.Permutation p -values are mirror-conservative, as are p -values for one-sided tests of univariateparameters with monotone likelihood ratio (with discrete p -values randomized to be uniform atthe boundary between the null and alternative). See Appendix B.1 for proofs of these claims. We are now prepared to prove our main result: the AdaPT procedure controls FDR in ﬁnitesamples. The proof relies on a similar optional stopping argument as the one presented in Leiand Fithian (2016) and Barber and Cand`es (2016) (themselves modiﬁcations of arguments inStorey et al. (2004) and Barber and Cand`es (2015)). Let V t and U t denote the numbers of null p i ≤ s t ( x i ) and null p i ≥ − s t ( x i ), respectively. If the null p -values are uniform then, no matterhow we choose s t ( x ) at each step, we will always have V t ≈ U t and (cid:91) FDP t > U t R t ∨ ≈ V t R t ∨ . daPT: An interactive procedure for multiple testing with side information Lemma 2.

Suppose that, conditionally on the σ -ﬁeld G − , b , . . . , b n are independent Bernoullirandom variables with P ( b i = 1 | G − ) = ρ i ≥ ρ > , almost surely. Also suppose that [ n ] ⊇ C ⊇ C ⊇ · · · , with each subset C t +1 measurable with respect to G t = σ (cid:32) G − , C t , ( b i ) i/ ∈C t , (cid:88) i ∈C t b i (cid:33) . If ˆ t is an almost-surely ﬁnite stopping time with respect to the ﬁltration ( G t ) t ≥ , then E (cid:34) |C ˆ t | (cid:80) i ∈C ˆ t b i | G − (cid:35) ≤ ρ − . Our Lemma 2 generalizes Lemma 1 in Barber and Cand`es (2016) and uses a very similartechnical argument. The proof is given in the appendix. Using Lemma 2, we can give our mainresult:

Theorem 1.

Assume that the null p -values are independent of each other and of the non-null p -values, and the null p -values are uniform or mirror-conservative. Then the AdaPT procedurecontrols the FDR at level α , conditional on F − and also marginally. Proof.

Let ˆ t denote the step at which we stop and reject. ThenFDP ˆ t = V ˆ t R ˆ t ∨ U ˆ t R ˆ t ∨ · V ˆ t U ˆ t ≤ α V ˆ t U ˆ t , where the last step follows from the stopping condition that (cid:91) FDP ˆ t ≤ α , and the fact that U t ≤ A t . We will ﬁnish the proof by establishing that E [ V ˆ t / (1 + U ˆ t )] ≤

1, using Lemma 2.Let m i = min { p i , − p i } and b i = { p i ≥ . } , so p i = b i (1 − m i ) + (1 − b i ) m i . Then knowing b i and m i is equivalent to knowing p i . Let C t = { i ∈ H : p i / ∈ ( s t ( x i ) , − s t ( x i )) } , representingthe null p -values that are not visible to the analyst at time t . Then, U t = (cid:88) i ∈C t b i , and V t = (cid:88) i ∈C t (1 − b i ) = |C t | − U t . Further, deﬁne the σ -ﬁelds G − = σ (( x i , m i ) ni =1 , ( b i ) i/ ∈H ) , and G t = σ ( G − , C t , ( b i ) i/ ∈C t , U t ) . The assumptions of independence and mirror-conservatism guarantee P ( b i = 1 | G − ) ≥ . i ∈ H , with the b i conditionally independent.Next, note that F t ⊆ G t because p i ∈ G t for each p i ∈ ( s t ( x i ) , − s t ( x i )), and A t = U t + |{ i / ∈ H : p i ≥ − s t ( x i ) }| , and R t ∈ G t by a similar argument. It follows that ˆ t = min { t : (cid:91) FDP t ≤ α } is a stopping timewith respect to G t ; furthermore, C t +1 ∈ F t ⊆ G t by assumption.As a result, conditional on G − , we can apply Lemma 2 to obtain E [FDP | G − ] ≤ α E (cid:20) V ˆ t U ˆ t | G − (cid:21) = α E (cid:20) |C ˆ t | U ˆ t − | G − (cid:21) ≤ α (2 −

1) = α. Note that F − ⊂ G − . The proof is completed by applying the tower property of conditionalexpectation.The main technical point of departure for our method is that the optional stopping argu-ment is not merely a technical device to prove FDR control for a ﬁxed algorithm like the BH,Storey-BH, or Knockoﬀ+ procedures. Instead, we push the optional-stopping argument to itslimit, allowing the analyst to interact with the data in a much more ﬂexible and adaptive way.Sections 6.2–6.3 further investigate the connection to knockoﬀs. Lihua Lei and William Fithian

3. A Guideline To Choose Thresholding Rules

Although the AdaPT procedure controls FDR no matter how we update the threshold, its powerdepends on the quality of the updates. This section concerns the question of what thresholdswe would choose if we had perfect knowledge of the data-generating distribution, with Section 4discussing suggestions for learning optimal thresholds from the data. To establish a guidelinefor threshold update, we consider a conditional two-groups model as the working model . As wewill see, under mild conditions, the Bayes-optimal rejection thresholds are the level surfaces ofthe local false discovery rate (fdr), deﬁned as the probability that a hypothesis is null conditionalon its p -value. The local FDR was ﬁrst discussed by Efron et al. (2001); see also Efron (2007).A similar result is obtained by Storey (2007) under a diﬀerent framework. To begin, we assume a two-groups model conditional on the predictors x i . Letting H i = 0 if the i th null is true and H i = 1 otherwise, we assume: H i | x i ∼ Bernoulli( π ( x i )) p i | H i , x i ∼ (cid:40) f ( p | x i ) if H i = 0 f ( p | x i ) if H i = 1 . In addition, we assume that ( x i , H i , p i ) are independent for i ∈ [ n ]. Unless otherwise statedwe will assume for simplicity that both f and f are continuous densities, with f ( p | x ) ≡ p -values are uniform) and f ( p | x ) non-increasing in p (smaller p -values imply strongerevidence against the null). Furthermore, deﬁne the conditional mixture density f ( p | x ) = (1 − π ( x )) f ( p | x ) + π ( x ) f ( p | x ) = 1 − π ( x ) + π ( x ) f ( p | x ) , and the conditional local false discovery ratefdr( p | x ) = P ( H i is null | x i = x, p i = p ) = 1 − π ( x ) f ( p | x ) . Note that we never observe H i directly. Thus, while f is identiﬁable from the data, π and f are not: for example, π = 0 . , f ( p | x ) = 2(1 − p ) and π = 1 , f ( p | x ) = 1 . − p resultin exactly the same mixture density. Unless f ( p | x ) is known a priori , we can make theconservative identifying assumption that1 − π ( x ) = inf p ∈ [0 , f ( p | x ) = f (1 | x ) , attributing as many observations as possible to the null hypothesis. This approximation is verygood when fdr(1 | x ) ≈

1, which is reasonable in many settings. Thus, any estimate ˆ f of themixture density translates to a conservative estimate (cid:99) fdr( p | x ) = ˆ f (1 | x ) / ˆ f ( p | x ). Let ν be a probability measure on X and deﬁne a random variable X ∼ ν . Similar to Sun et al.(2015), for any thresholding rule s ( x ), we deﬁne the global FDR asFDR( s ; ν ) = P ( H = 0 | H is rejected) = P ( H = 0 | P ≤ s ( X ))where H and P are a hypothesis and p -value distributed according to the two-groups model.The power is deﬁned in a similar fashion asPow( s ; ν ) = P ( H is rejected | H = 1) = P ( P ≤ s ( X ) | H = 1) . daPT: An interactive procedure for multiple testing with side information Sun et al. (2015) formulates a compound decision-theoretic framework by deﬁning a Bayesian-type loss function. Instead, we propose a Neyman-Pearson type framework, i.e.max s Pow( s ; ν ) s . t . FDR( s ; ν ) ≤ α. (4)Next, deﬁne Q ( s ) = P ( P ≤ s ( X ) , H = 0) = (cid:90) X F ( s ( x ) | x )(1 − π ( x )) ν ( dx ) Q ( s ) = P ( P ≤ s ( X ) , H = 1) = (cid:90) X F ( s ( x ) | x ) π ( x ) ν ( dx ) , where F and F are the cumulative distribution functions under the null and alternative. Wecan simplify (4) asmax s Q ( s ) P ( H = 1) s . t . Q ( s ) Q ( s ) + Q ( s ) ≤ α (5) ⇐⇒ min s − Q ( s ) s . t . − αQ ( s ) + (1 − α ) Q ( s ) ≤ ⇐⇒ min s (cid:90) X − F ( s ( x ) | x ) π ( x ) ν ( dx )s . t . (cid:90) X (cid:26) − αF ( s ( x ) | x ) π ( x ) + (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx ) ≤ . (7)The corresponding Lagrangian function can be written as L ( s ; λ ) = (cid:90) X (cid:26) − (1 + λα ) F ( s ( x ) | x ) π ( x ) + λ (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx ) . (8)Let s ∗ be the optimum, then the Karush-Kuhn-Tucker (KKT) condition (under regularity con-ditions) implies that(1 + λα ) f ( s ∗ ( x ) | x ) π ( x ) = λ (1 − α ) f ( s ∗ ( x ) | x )(1 − π ( x ))= ⇒ fdr( s ∗ ( x ) | x ) = 1 + λα λ . (9)In other words, the optimal thresholding rules are level surfaces of local FDR. Theorem 2formalizes the above derivation by clarifying the regularity conditions. Theorem 2.

Assume that(a) f ( p | x i ) is continuously non-increasing and f ( p | x i ) is continuously non-decreasing anduniformly bounded away from ∞ ;(b) ν is a discrete measure supported on { x , . . . , x n } with ν ( { x i : fdr(0 | x i ) < α, f (0 | x i ) > } ) > .Then (4) has at least a solution, and all solutions are level surfaces of fdr( p | x ) . In practice, any conservative null distribution (stochastically dominated by U ([0 , f is also valid since smallerp-values imply stronger evidence against null. In condition (b), the assumption on the supportis reasonable since we treat { x i : i ∈ [ n ] } as ﬁxed and hence only the quantities associated withthese values are of interest. We believe it can be relaxed to more general measures and will notdiscuss it due to the technical complication. In contrast, the second requirement is necessarysince it implies the feasibility of the problem. If the local FDR is above α almost everywhere,no thresholding rule is able to control FDR at α . As mentioned above, we can set s as thelevel surfaces of (cid:99) fdr( p | x ) = ˆ f (1 | x ) / ˆ f ( p | x ) given some estimator ˆ f ( p | x ). The next sectiondiscusses estimation of ˆ f ( p | x ). Lihua Lei and William Fithian

4. Implementation

Having shown that level surfaces of the local FDR are optimal under the two-groups model, wenow turn to estimation of fdr( p | x ), which boils down to estimation of the conditional density f ( p | x ). This section discusses a ﬂexible framework for conditional density estimation that canperform favorably when no domain-speciﬁc expertise can be brought to bear.More generally, we should model the data using as much domain-speciﬁc expertise as pos-sible. We emphasize once more that, no matter how misspeciﬁed our model is, no matter howmisguided our priors are (if we use a Bayesian method), no matter how we select a model ortuning parameter, or how much that selection biases our resulting estimate of local FDR, theAdaPT procedure nevertheless controls global FDR. Thus, there is every reason to be relativelyaggressive in choosing a modeling strategy. Generically, we can model the conditional density by a parametric family where we assume nullp-values are uniform distributed, i.e. f ( p | x i ) ≡

1, and each non-null p-value has a density inthe following exponential family, indexed by a univariate parameter η i : f ( p | x i ) = h ( p ; η i ) (cid:44) e η i g ( p ) − B ( η i ) . (10)Note that η i and g ( p ) can be vectors but we focus on the scalar case for simplicity. Let y i = g ( p i ) , µ i = B (cid:48) ( η i ) . (11)Using the standard argument, (10) implies that E η i [ y i ] = E η i [ g ( p i )] = B (cid:48) ( η i ) = µ i , (12)where E η i denotes the expectation under h ( · ; η i ). If g is not almost-everywhere constant, then B (cid:48)(cid:48) ( η ) = Var η ( y i ) > B (cid:48) is bijective. Then there is a one-to-one mapping from µ i to η i ,denoted by η i = η ( µ i ) as convention. In fact, η ( · ) = ( B (cid:48) ) − ( · ). Then (10) can be reparametrizedusing µ i , h ( p ; µ i ) = e η ( µ i ) g ( p ) − A ( µ i ) , (13)where A ( · ) = B ( η ( · )) and we abuse the notation h ( p ; · ). As we will see, it is more convenientto use the mean parametrization (13).Given (13), it is left to model π i (cid:44) π ( x i ) and µ i (or η i equivalently). In this article weconsider the following generalized linear model where φ π ( x ) , φ µ ( x ) denote two featurization and ζ denotes a link function: H i | x i ∼ Bernoulli( π i ) , with log π i − π i = θ (cid:48) φ π ( x i ) , and p i | x i , H i ∼ (cid:26) h ( p ; µ i ) if H i = 11 if H i = 0 , with ζ ( µ i ) = β (cid:48) φ µ ( x i ) . (14)In particular, ζ ( · ) = η ( · ) gives the canonical link function. For instance, when g ( p ) = − log p, η ( µ ) = − µ + 1 and A ( µ ) = log µ , f ( p | x ) = π i h ( p ; µ i ) + (1 − π i ) = π i · µ i p µi − + (1 − π i ) . (15)This yields a beta-mixture model on the conditional density, which has been considered inliterature, e.g. Parker and Rothenberg (1988); Allison et al. (2002); Pounds and Morris (2003);Markitsis and Lai (2010). daPT: An interactive procedure for multiple testing with side information The fully-observed log-likelihood for the model (14) is (cid:96) ( θ, β ; p, H, x ) = n (cid:88) i =1 (cid:110) H i θ (cid:48) φ π ( x i ) − log (cid:16) e − θ (cid:48) φ π ( x i ) (cid:17)(cid:111) + n (cid:88) i =1 H i { y i · η ◦ ζ − ( β (cid:48) φ µ ( x i )) − A ◦ ζ − ( β (cid:48) φ µ ( x i )) } (16)Because some values of y i and all values of H i are unknown, we can use the expectation max-imization (EM) algorithm to maximize the partially observed log-likelihood. To simplify esti-mation, we will proceed as though A t and R t are missing, so that the ( y i , H i ) pairs are mutuallyindependent given the predictors. That is, at step t of the AdaPT procedure we attempt tomaximize the likelihood of the data D t = ( x i , ˜ p t,i ) i ∈ [ n ] and treating s t as ﬁxed.Recall that b i = I ( p i ≥ . b i , H i ), with each pairconditionally independent given D t , and whose probabilities can be eﬃciently computed forany values of θ and β . Let r = 0 , , . . . index stages of the EM algorithm (recall t is ﬁxed for theduration of the EM algorithm). For the E-step we compute the expectation of the log-likelihood, E ˆ θ ( r − , ˆ β ( r − [ (cid:96) ( θ, β ; y, H, x ) | D t ] , which amounts to computing the following quantities: (cid:98) H ( r ) i = E ˆ θ ( r − , ˆ β ( r − [ H i | D t ] , and (17)ˆ y ( r, i = E ˆ θ ( r − , ˆ β ( r − [ y i H i | D t , H i = 1] / (cid:98) H ( r ) i , (18)where ˆ θ ( r ) and ˆ β ( r ) denote the current coeﬃcient estimates. We derive the exact formula for(17) and (18) in Appendix A.1. For the M-step, we setˆ θ ( r ) , ˆ β ( r ) = arg max β,θ E ˆ θ ( r − , ˆ β ( r − [ (cid:96) ( θ, β ; y, H, x ) | D t ]= arg max β,θ n (cid:88) i =1 (cid:98) H ( r ) i θ (cid:48) φ π ( x i ) − log (cid:16) e − θ (cid:48) φ π ( x i ) (cid:17) + n (cid:88) i =1 (cid:98) H ( r ) i · (cid:16) ˆ y ( r, i · η ◦ ζ − ( β (cid:48) φ µ ( x i )) − A ◦ ζ − ( β (cid:48) φ µ ( x i )) (cid:17) . (19)The optimization above splits into two separate optimization problems, a logistic regressionwith predictors φ π ( x i ) and fractional responses (cid:98) H ( r ) i , and a GLM with predictors φ µ ( x i ), re-sponses ˆ y ( r, i , and weights (cid:98) H ( r ) i . Each of these GLM problems can be solved eﬃciently usingthe glm function in R (e.g. Dobson and Barnett (2008)). For r = 0, we can initialize ˆ θ (0) andˆ β (0) by a simple method with details discussed in Appendix A.2. Algorithm 2 formalizes theEM algorithm using R pseudocode. The family argument for estimating ˆ β ( r ) depends on theform of exponential family (13). For example, (19) yields a Gamma GLM in the beta-mixturemodel (15).The GLM model (14) provides the starting point for an extremely ﬂexible and extensiblemodeling framework. More generally, we could replace the ﬁtting procedure in M-step by penal-ized GLM ( glmnet package), generalized additive model ( gam or mgcv package), or generalizedboosting regression ( gbm package). Furthermore, noting that π ( x ) = E [ H | x ] , µ ( x ) = E [ y | x, H = 1] , one can even ﬁt them directly using any nonparametric method, such as random forest or neuralnetworks, that targets on estimating conditional mean. Lihua Lei and William Fithian

Algorithm 2

EM algorithm to estimate π ( · ) and µ ( · ) based on D t = ( x i , ˜ p t,i ) i ∈ [ n ] Input: data D t , number of iterations m , initialization ˆ θ (0) , ˆ β (0) ; for r = 1 , , . . . , m do ( E-step ): (cid:98) H ( r ) i ← E ˆ θ ( r − , ˆ β ( r − [ H i | D t ] , i ∈ [ n ];ˆ y ( r, i ← E ˆ θ ( r − , ˆ β ( r − [ y i H i | D t , H i = 1] / (cid:98) H ( r ) i , i ∈ [ n ];( M-step ):ˆ θ ( r ) ← glm (cid:16) (cid:98) H ( r ) ∼ φ π ( x ) , family = binomial (cid:17) ;ˆ β ( r ) ← glm (cid:16) ˆ y ( r, ∼ φ µ ( x ) , family = ... ( link = ζ ) , weights = (cid:98) H ( r ) (cid:17) ; end forOutput: ˆ π ( x ) = (cid:16) e − φ π ( x ) (cid:48) ˆ θ ( m ) (cid:17) − , ˆ µ ( x ) = ζ − (cid:16) φ µ ( x ) (cid:48) ˆ β ( m ) (cid:17) . Suppose we are given a ﬁnite set of candidate featurization { ( φ π,j ( x ) , φ µ,j ( x )) : j = 1 , . . . , M } .For instance for univariate x , φ π,j ( x ) and φ µ,j ( x ) could be spline bases with certain numbersof equi-spaced knots; for multivariate x , φ π,j ( x ) and φ µ,j ( x ) could be subsets of covariatescontained in x . At step t , one is permitted to ﬁt a model for each featurization, using arbi-trary methods (e.g., GLM, penalized GLM, etc.), based on (( φ π,j ( x i ) , φ µ,j ( x i ) , ˜ p t,i ) ni =1 ). Letˆ π ( j )1 = (ˆ π ( j )11 , . . . , ˆ π ( j )1 n ) and ˆ µ ( j ) = (ˆ µ ( j )1 , . . . , ˆ µ ( j ) n ) denote the resulting ﬁtted values. The fulllog-likelihood, assuming H i is known, for the GLM model (14) based on ( φ π,j ( x ) , φ µ,j ( x )) canbe written as (cid:96) j ( π , µ ) = n (cid:88) i =1 ( H i log π ( j )1 i + (1 − H i ) log(1 − π ( j )1 i )) + n (cid:88) i =1 H i log h ( p i ; µ ( j ) i ) . Though (cid:96) j is not computable, we can replace it by˜ (cid:96) j (cid:44) E ˆ π ( j )1 , ˆ µ ( j ) [ (cid:96) j ( π , µ )] . This is precisely the objective of M-step and hence is directly computed from the EM algorithm.Based on { ˜ (cid:96) j } Mj =1 , we can use any information criterion for featurization selection. Ourimplementation uses BIC as default, deﬁned asBIC j = log n · (df π,j + df µ,j ) − (cid:96) j where df π,j (resp. df µ,j ) is the degree of freedom of φ π,j (resp. φ µ,j ). For instance, df π,j is thenumber of knots plus 1 (for the intercept) when φ π,j is the spline basis; df π,j is the number ofselected covariates plus 1 (for the intercept) when φ π,j is a sparse subset of x .Alternatively, the user can also apply cross-validation to select the featurization. Speciﬁcally,at step t the data is divided into K folds. For k -th fold, the expected log-likelihood ˜ (cid:96) jk iscomputed by taking the k -th fold as the holdout set and ﬁtting the parameters on other folds.The selection is then based on ˜ (cid:96) j = (cid:80) Kk =1 ˜ (cid:96) jk .We emphasize that any of above selection procedures can be performed in any intermediatestep of AdaPT. If the featurization selection can be computed eﬃciently, we suggest applying itin every step. Otherwise we suggest performing it only at the ﬁrst step, in which s ( x ) = s ( x ),and keeping the selected featurization for all later steps. Theorem 2 suggests that our updated threshold s t +1 should approximate a level surface of (cid:99) fdr( p | x ). For the model (14), level surfaces of the local FDR are given by c = f (1 | x ) f ( s ( x ) | x ) = π ( x ) h (1; µ ( x )) + 1 − π ( x ) π ( x ) h ( s ( x ); µ ( x )) + 1 − π ( x ) . (20) daPT: An interactive procedure for multiple testing with side information For various widely-used exponential families in the form (13), h ( p ; µ ) is decreasing with respectto p , in which case, s ( x ; c ) = f − (cid:18) h (1; µ ( x )) c + 1 − π ( x ) π ( x ) 1 − cc ; µ ( x ) (cid:19) (21)Given a chosen local FDR level c , we can evolve s t by s t +1 ( x ) = min { s t ( x ) , s ( x ; c ) } , (22)where the minimum is taken to meet the requirement that s t +1 ( x ) ≤ s t ( x ). Note that a higherlevel surface (larger c ) will typically give a higher (cid:91) FDP t and vice versa. Unless computationaleﬃciency is at a premium, it is better to force the procedure to be patient since more informationcan be gained after each update and the learning step can be more accurate. In other words,we shall choose a large c such that s t +1 ( x ) only deviates from s t ( x ) slightly.In this article we propose a simple procedure to achieve this: it chooses c such that exactlyone partially-masked p-value is revealed based on s t +1 ( x ) deﬁned in (22). The choice of c canbe computed in the following way(a) Estimate local FDR for each p (cid:48) t,i asfdr t,i = f (1 | x i ) f ( p (cid:48) t,i | x i ) = ˆ π i · h (1; ˆ µ i ) + 1 − ˆ π i ˆ π i · h ( p (cid:48) t,i ; ˆ µ i ) + 1 − ˆ π i , (23)where p (cid:48) t,i is the minimum element in ˜ p t,i (i.e., ˜ p t,i = p (cid:48) t,i for revealed p-values and ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } for masked p-values.)(b) Set c as the largest value of lfdr t,i among all partially masked p-values . (Strictly speaking, c should be slightly smaller than max i lfdr t,i . In implementation we subtract 10 − from it.)As a consequence, this choice of c is measurable with respect to F t and hence a permissibleoperation in AdaPT . Initial thresholds.

As shown in Algorithm 1, AdaPT starts from some curve s ( x ) and thenslowly update it. If the hypotheses are not ordered, then we can simply set s ( x ) ≡ s , with s , ≤ .

5. A larger s , is conceptually preferred since the procedure is more patient. We foundthat s , = 0 .

45 is a consistently good choice.

Computation eﬃciency.

The model update (Algorithm 2) is the most computationally costlycomponent. To save computation, we recommend not updating the model at every step. In ourimplementation, the default is to update the model every (cid:100) n/ (cid:101) steps. q -Values. Rather than specify α in advance, some researchers might prefer to see a list ofdiscoveries for each of a range of α values. Rather than return a single list for a single α , we canalternatively run the algorithm once and output q -values for every hypothesis (Storey, 2002;Storey and Tibshirani, 2003), deﬁned as the minimum value of α for which the hypothesis wouldbe rejected.Let ˆ t α = min { t : (cid:91) FDP t ≤ α } and t ∗ i = min { t : s t ( x i ) < p i < − s t ( x i ) } , the time at which p i is revealed. We then see that H i rejected at level α ⇐⇒ p i ≤ s ˆ t α ( x i ) ⇐⇒ ˆ t α < t ∗ i ⇐⇒ min t

5. Experiments

To illustrate the power of the AdaPT procedure, we apply it to the GEOquery gene-dosagedata (Davis and Meltzer, 2007), which has been analyzed repeatedly as a benchmark for orderedtesting procedures Li and Barber (2016a); Lei and Fithian (2016); Li and Barber (2016b). Weuse Algorithm 2 with a beta-mixture model (15) for the E-step (see Appendix A.1.1 for details)and a Gamma GLM with canonical link function for the M-step. This dataset consists of geneexpression measurements for n = 22283 genes, in response to estrogen treatments in breastcancer cells for ﬁve groups of patients, with diﬀerent dosage levels and 5 trials in each. Thetask is to identify the genes responding to a low dosage. The p-values p i for gene i is obtainedby a one-sided permutation test which evaluates evidence for a change in gene expression levelbetween the control group (placebo) and the low-dose group. { p i : i ∈ [ n ] } are then orderedaccording to permutation t -statistics comparing the control and low-dose data, pooled, againstdata from a higher dosage (with genes that appear to have a strong response at higher dosagesplaced earlier in the list).We consider two orderings: ﬁrst, a stronger (more informative) ordering based on a com-parison to the highest dosage; and second, a weaker (less informative) ordering based on acomparison to a medium dosage. Let σ S ( i ) and σ W ( i ) denote respectively the permutations of i ∈ [ n ] given by the stronger and weaker orderings. Further details on these two orderings canbe found in Li and Barber (2016a) and Li and Barber (2016b). We write the p -values, thusreordered, as p Si = p σ S ( i ) and p Wi = p σ W ( i ) . Once the data are reordered, we can apply eithera method that ignores the ordering altogether, or an ordered testing procedure, or a testingprocedure that uses generic side information, using the index of the reordered p -values as aunivariate predictor.We compare AdaPT against twelve other methods :(a) SeqStep with parameter C = 2 (Barber and Cand`es, 2015);(b) ForwardStop (G’Sell et al., 2016);(c) the accumulation test with the HingeExp function and parameter C = 2 (Li and Barber,2016a);(d) Adaptive SeqStep with s = q and λ = 1 − q (Lei and Fithian, 2016);(e) BH procedure (Benjamini and Hochberg, 1995);(f) Storey’s BH procedure with threshold λ = 0 . τ = 0 . , (cid:15) = 0 . { (cid:15), } (see section 4.1 of Li and Barber (2016b));(i) SABHA with τ = 0 . , (cid:15) = 0 . (cid:15),

1] (see section4.1 of Li and Barber (2016b));(j) Independent Hypothesis Weighting (IHW) with number of bins and folds set as default(Ignatiadis et al., 2016);(k) an oracle version of IHW with the number of bins determined by maximizing the numberof rejections;(l) an oracle version of Independent Filtering (IF) with the cutoﬀ determined by maximizingthe number of rejections (Bourgon et al., 2010). daPT: An interactive procedure for multiple testing with side information Note that the last two methods do not guarantee FDR control because the optimal parameteris selected; and both versions of SABHA control FDR at level 1 . α (Lemma 1 of Li andBarber (2016b)) when the target level is α . Despite the potential anti-conservativeness of thesemethods, we do not make correction in order to compare their best possible performance toAdaPT . Figure 2 shows the number of discoveries with diﬀerent target FDR levels. We onlyshow the range of α from 0 .

01 to 0 . π ( x ) and µ ( x ), selected from the combination ofall spline basis with 6 −

15 equi-quantile knots via BIC criterion at the initial step and kept thesame afterwards; see Section 4.2.

Random ordering

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT

Moderately informative ordering

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l Highly informative Ordering

Target FDR level a l l ll l ll l ll l l Number of Rejections (Gene/Drug Response)

Fig. 2: Number of discoveries, in gene/drug response dataset, by each method at a range oftarget FDR levels α from 0.01 to 0.30. Each panel plots the results for an ordering, rangingfrom random ordering to highly informative. Rejection threshold (alpha = 0.05) p − v a l ue 0 . . Estimated local FDR (alpha = 0.05) x = rank p − v a l ue 00 . . Rejection threshold (alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (alpha = 0.1) x = rank p − v a l ue 00 . . Fig. 3: Results for gene/drug response data with moderately informative ordering of p-values,i.e. { p Wi } , with α = 0 .

05 (left) and α = 0 . s ( x ); (bottom) the contourplots of estimated local FDR.The right two panels of Figure 2 correspond to the weaker and the strong orderings, andshow that AdaPT signiﬁcantly outperforms all other methods for all target FDR levels. Onemight doubt whether the power gain is driven by overﬁtting. To check this, we also applyAdaPT, as well as all other methods, on the same set of p-values with a random ordering. We Lihua Lei and William Fithian repeat it using 100 random seeds and report the average number of rejections in the left panelof Figure 2. In this case, the number of rejections drop dramatically and the power is almostthe same as Barber-Cand`es method, the non-adaptive version of AdaPT. This provides strongevidence against overﬁtting.To illustrate how AdaPT exploits the covariate to improve the power, we plot the thresh-olding rules and estimated signal strength for p-values with moderately informative orderingand p-values with highly informative ordering, respectively in Figure 3 and Figure 4. It can beseen from the bottom panels that the evidence to be non-null has an obvious decreasing trendwhen the ordering is used. Moreover, the highly informative ordering indeed sorts the p-valuesbetter than the moderately informative ordering. For the former, the thresholding rule is fairlymonotone while it has a small bump at i ≈ Rejection threshold (alpha = 0.05) p − v a l ue 0 . . Estimated local FDR (alpha = 0.05) x = rank p − v a l ue 00 . . Rejection threshold (alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (alpha = 0.1) x = rank p − v a l ue 00 . . Fig. 4: Results for gene/drug response data with highly informative ordering of p-values, i.e. { p Si } , with α = 0 .

05 (left) and α = 0 . s ( x ); (bottom) the contourplots of estimated local FDR.Finally, we measure the information loss caused by partial masking: We ﬁrst estimate localFDR using the set of (unmasked) p-values and the covariates, denoted by lfdr ∗ ( x ). It can beregarded as the best possible estimate given the algorithm. Let lfdr t ( x ) denote the estimate oflocal FDR at step t (based on partially masked p-values). Then we measure the informationloss by the correlation of { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 . The results are shown in Figure 5where the x-axis corresponds to the target FDR, in a reverse order ranging from 0 . . (cid:91) FDP ﬁrst drops below the targetFDR. As expected from the discussion in Subsection 1.3, the information loss is quite small andeven negligible after the target FDR drops to the “practical” regime (e.g. below 0 . { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 is almost 1. The pattern is even moresigniﬁcant in other data examples in the next Subsection. This provides a strong evidence thatAdaPT allows eﬃcient data exploration under comparatively limited information loss.In summary, these plots show a strong data adaptivity of AdaPT , which can also learnthe local structure of data while controlling FDR. Moreover, it provides a quantitative way, byestimated signal strength, to evaluate the quality of ordering, which is the major concern inordered testing problems (Li and Barber, 2016a; Lei and Fithian, 2016; Li and Barber, 2016b). daPT: An interactive procedure for multiple testing with side information Correlation of Estimated Local FDR

Target FDR a (large −−> small) C o rr e l a t i on . . . . moderately informative orderinghighly informative ordering Fig. 5: Correlation of { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 for gene/drug-response dosage datasetunder original, moderately informative and highly informative orderings. The x-axis correspondsto the target FDR, in a reverse order ranging from 0 . .

01, and y-axis corresponds to thecorrelation at the step where (cid:91)

FDP ﬁrst drops below the target FDR. • Example 1: a two-dimensional case

We generate the covariates x i ’s from an equi-spaced 50 ×

50 grid in the area [ − , × [ − , p -values i.i.d. from a one-sided normal test, i.e. p i = 1 − Φ( z i ) , and z i ∼ N ( µ, , (24)where Φ is the cdf of N (0 , i ∈ H we set µ = 0 and for i (cid:54)∈ H we set µ = 2. Figure 6below shows three types of H that we conduct tests on. Circle in the middle llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Circle in the corner llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Thin ellipse llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Fig. 6: The above panels display the underlying ground truth for three cases in Example 1.Each point represents a hypothesis (2500 in total) with gray ones being nulls and black onesbeing non-nulls.In this case, it is not clear how to apply non-adaptive ordered testing procedures or In-dependent Filter. Thus we compare AdaPT only with Storey’s BH method, Barber-Cand´esmethod, IHW using the default automatic parameter tuning procedure and SABHA using 2-dim low total variation weights (see Section 4.3 of Li and Barber (2016b)). For AdaPT , we ﬁttwo-dimensional Generalized Additive Models in M-step, using R package mgcv with the knotsselected automatically in every step by GCV criterion. For each procedure and a given level α ,let R α be the set of rejected hypotheses with a target FDR level α . Then we calculate the FDPand the power as FDP( α ) = |R α ∩ H ||R α | , power( α ) = |R α ∩ H c ||H c | . (25) Lihua Lei and William Fithian

We repeat the above procedure for on 100 fresh simulated datasets and calculate the averageof FDP( α ) and power( α ) as the measure of FDR and power. The results are shown in Figure7. It is clearly seen that AdaPT controls FDR as other methods while achieving a signiﬁcantlyhigher power.To see why AdaPT gains power, we plot the estimated local FDR in Figure 8 for the ﬁrstcase, at the initial step, the step where (cid:91) FDP is ﬁrst below 0.3 and the step where (cid:91)

FDP is ﬁrstbelow 0.1. As shown in the real examples, the ﬁtted local FDR identiﬁes the non-nulls quiteaccurately even at the initial step where most p-values are partially-masked. The estimatesbecome very stable and informative after reaching the practical regime of α ’s. Circle in the middle

Target FDR level a F DR . . . . l l ll l ll l l lll BHStorey−BHBarber−CandesSABHAIHWAdaPT

Circle in the corner

Target FDR level a F DR . . . . l l ll l ll l l Thin ellipse

Target FDR level a F DR . . . . l l ll l ll l l Circle in the middle

Target FDR level a po w e r . . . l l ll l ll l l lll BHStorey−BHBarber−CandesSABHAIHWAdaPT

Circle in the corner

Target FDR level a po w e r . . . l l ll l ll l l Thin ellipse

Target FDR level a po w e r . . . l l ll l ll l l Fig. 7: FDR and power with α ∈ { . , . , . . . , . } in Example 1. Estimated local FDR (step 0) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Estimated local FDR (alpha = 0.3) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Estimated local FDR (alpha = 0.1) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Fig. 8: Estimated local FDR in the ﬁrst case of Example 1 at the initial step (left), with thetarget FDR level 0 . . • Example 2: a 100-dimensional case

We generate x i ∈ R d with d = 100 and { x ij : i ∈ [ n ] , j ∈ [ d ] } i.i.d. ∼ U ([0 , . daPT: An interactive procedure for multiple testing with side information Then we generate p-values from a varying-coeﬃcient two group beta-mixture model (15) with π i and µ i are speciﬁed as a logistic model and a truncated linear model, respectively, i.e.,log (cid:18) π i − π i (cid:19) = θ + x Ti θ, µ i = max { x Ti β, } , β, θ ∈ R d . In this case, we choose θ and β as highly sparse vectors with only two non-zero entries: θ = (3 , , , . . . , T , β = (2 , , , . . . , T and θ is chosen so that n (cid:80) ni =1 π i = 0 .

3. In this case, E ( − log p i ) = µ i under the alternative.Figure 9 shows the histograms of π i ’s and µ i ’s. p F r equen cy m i F r equen cy Fig. 9: Distributions of π i ’s and µ i ’s in Example 2.In this case, it is not clear how to apply non-adaptive ordered testing procedures or Inde-pendent Filter or adaptive procedures like IHW and SABHA. Thus we compare AdaPT onlywith BH method, Storey’s BH method and Barber-Cand´es method. For AdaPT , we ﬁt L -regularized GLMs in M-step (See Appendix A for details), using R package glmnet with thepenalty level selected automatically in every step by cross validation. Further we run AdaPTby ﬁtting an ”oracle” GLM in M-steps where only the ﬁrst two covariates are involved.As in Example 1, we estimate the FDR and the power using 100 replications. The resultsare plotted in Figure 10. It is clearly seen that both AdaPT ’s control FDR as other meth-ods while achieving a higher power. Not surprisingly, compare to AdaPT with L -regularizedGLMs, AdaPT with “oracle” GLMs has a higher power. Nevertheless, this example shows theunprecedented ability of AdaPT to improve power by squeezing information from a large set ofnoisy features. Target FDR level a F DR . . . . l l ll l l ll BHStorey−BHBarber−CandesAdaPTAdaPT (oracle)

Target FDR level a po w e r . . . . . l l ll l l Fig. 10: FDR and power with α ∈ { . , . , . . . , . } in Example 2. Lihua Lei and William Fithian

In this Subsection, we examine the performance of AdaPT on four more real datasets, which areanalyzed in other papers exploiting adaptive FDR control methods, e.g. Bourgon et al. (2010);Ignatiadis et al. (2016). In all cases, we start with a brief introduction of the dataset and showthe plots on the number of rejections, as Figure 2, path of information loss, as Figure 5, andthreshold curve and level curves of estimated local FDR with target FDR 0 .

1, as Figure 3 andFigure 4. We use the same settings for AdaPT as in the gene dosage dataset: performing modelselection at the initial step with candidate featurization being all combinations of spline basiswith 6 ∼

15 equi-quantile knots on π ( x ) and µ ( x ); and ﬁxing the selected model in subsequentupdates. • Bottomly data

This dataset is an RNA-Seq dataset targeting on detecting the diﬀerential expression on twomouse strains, C57BL/6J (B6) and DBA/2J (D2), collected by Bottomly et al. (2011), availableon ReCount repository (Frazee et al., 2011), and analyzed by Ignatiadis et al. (2016) using IHW.It consists of gene expression measurements for n = 13932 genes. Following Ignatiadis et al.(2016), we analyze the data using DEseq2 package (Love et al., 2014) and use the logarithm ofnormalized count (averaged across all samples plus 1) as the univariate covariate for each gene.The results are plotted in Figure 11. It is clearly seen that AdaPT produces signiﬁcantly morediscoveries than all other methods and the information loss is almost negligible (with correlationconsistently above 0.985). Furthermore, we observe the same pattern that AdaPT prioritizesthe genes with higher mean normalized means.

Number of Rejections (Bottomly)

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT

Correlation of Estimated Local FDR (Bottomly)

Target FDR a (large −−> small) C o rr e l a t i on . . . . . −2 0 2 4 6 8 10 Rejection threshold (Bottomly, alpha = 0.1) p − v a l ue 0 . . −2 0 2 4 6 8 10 Estimated local FDR (Bottomly, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 11: Results for Bottomly dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Airway data

This dataset is an RNA-Seq dataset targeting on identifying the diﬀerentially expressed genes inairway smooth muscle cell lines in response to dexamethasone, collected by Himes et al. (2014)and available in R package airway . It is analyzed in the vignette of IHW package using IHWmethod Ignatiadis et al. (2016). As in the vignette and the previous example, we analyze thedata using

DEseq2 package (Love et al., 2014) and use the logarithm of normalized count asthe univariate covariate for each gene. The results are plotted in Figure 12. Again, AdaPTproduces signiﬁcantly more discoveries than all other methods. daPT: An interactive procedure for multiple testing with side information Number of Rejections (airway)

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT

Correlation of Estimated Local FDR (airway)

Target FDR a (large −−> small) C o rr e l a t i on . . . . . −2 0 2 4 6 8 10 12 Rejection threshold (airway, alpha = 0.1) p − v a l ue 0 . . −2 0 2 4 6 8 10 12 Estimated local FDR (airway, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 12: Results for Airway dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Pasilla data

This dataset is also an RNA-Seq dataset targeting on detecting genes that are diﬀerentiallyexpressed between the normal and Pasilla-knockdown conditions, collected by Brooks et al.(2011) and available in R package pasilla (Huber and Reyes, 2016). It is analyzed in thevignette of genefilter package (Gentleman et al., 2016) using independent ﬁltering methodBourgon et al. (2010). As in the vignette, we analyze the data using DEseq package (Andersand Huber, 2010) and use the logarithm of normalized count as the univariate covariate for eachgene. The results are plotted in Figure 13. It is clear that we arrive at the same conclusionthat AdaPT is more powerful than all other methods.

Number of Rejections (pasilla)

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT

Correlation of Estimated Local FDR (pasilla)

Target FDR a (large −−> small) C o rr e l a t i on . . . . . Rejection threshold (pasilla, alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (pasilla, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 13: Results for Pasilla dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Yeast proteins data

This dataset is a proteomics dataset, collected by Dephoure and Gygi (2012) and available in R package IHWpaper , that provides temporal abundance proﬁles for 2666 yeast proteins froma quantitative mass-spectrometry (SILAC) experiment. The goal is to identify the diﬀerentialprotein abundance in yeast cells treated with rapamycin and DMSO. It is analyzed in Ignatiadiset al. (2016) using IHW method. As in Dephoure and Gygi (2012) and Ignatiadis et al. (2016),we calculate the p-values using Welch’s t-test and use as the univariate covariate the logarithmof total number of peptides that were quantiﬁed across all samples for each gene. The results areplotted in Figure 14. In this case, AdaPT has a similar performance to Barber-Cand´es methodand Storey’s BH method. However, it still outperforms all other methods. Furthermore, AdaPTlearns the monotone pattern of the local FDR, which coincides with the heuristic. Lihua Lei and William Fithian

Number of Rejections (SILAC)

Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT

Correlation of Estimated Local FDR (SILAC)

Target FDR a (large −−> small) C o rr e l a t i on . . . . . Rejection threshold (SILAC, alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (SILAC, alpha = 0.1) x = log(number of peptides) p − v a l ue 00 . . Fig. 14: Results for yeast proteins dataset: (left) number of rejections; (middle) path of infor-mation loss; (right) threshold curve and level curves of estimated local FDR when α = 0 .

6. Discussion

We have proposed the AdaPT procedure, a general iterative framework for multiple testingwith side information. Using partially masked p -values, we estimate a family of optimal andincreasingly stringent rejection thresholds, which are level surfaces of the local FDR. We thenmonitor an estimator of FDP to decide which threshold to use, updating our estimates as weunmask more p -values and gain more information.Our method is interactive in that it allows the analyst to use an arbitrary method forestimating the local FDR, and to consult her intuition to change models at any iteration, evenafter observing most of the data. No matter what the analyst does or how badly she overﬁts thedata, FDR is still controlled at the advertised level (though power could be adversely aﬀectedby overﬁtting).We show using various experiments that AdaPT can give consistently signiﬁcantpower improvements over current state-of-the-art methods. AdaPT without thresholds

Although we state AdaPT as a procedure that interactively updates a covariate-variant thresh-old curve, the thresholds are not essential. In fact, Algorithm 1 can be modiﬁed as follows inthe absence of s ( x ). Algorithm 3

AdaPT without thresholds

Input: predictors and p -values ( x i , p i ) i ∈ [ n ] , target FDR level α . Procedure: Initialize R = [ n ]; for t = 0 , , . . . do R t ← (cid:8) i ∈ R t : p i ≤ (cid:9) ; A t ← (cid:8) i ∈ R t : p i > (cid:9) ; (cid:91) FDP t ← A t R t ∨ ; if (cid:91) FDP t ≤ α then Reject R t ; end if R t +1 ← update (( x i , ˜ p t,i ) i ∈ [ n ] , R t ); end for Rephrasing Algorithm 3: we start from partially masking all p-values, yielding a “candidaterejection set” R = [ n ], then apply arbitrary method to update R t directly. The FDP estimator(line 4) is deﬁned in an essentially identical way as Algorithm 1. It is easy to see that Algorithm1 is a special case of Algorithm 3. Perhaps strikingly, the proof of FDR control carries throughto this general case without any modiﬁcation. daPT: An interactive procedure for multiple testing with side information It is not hard to see that our implementation in Section 4 can be reformulated in a moresimple and straightforward way: in each step we estimate local FDR for each partially-maskedp-values and peel oﬀ δ -proportion of them with highest estimated local FDR.In principle, we can deﬁne any “score” that measures how ”promising” each hypothesis is orhow “likely” each hypothesis is non-null. A simple workﬂow based on Algorithm 3 is to peel oﬀthe hypotheses with least favorable “scores” and proceed with reﬁtted “scores” by exploitingthe revealed p-values. Heuristically, the most statistical meaningful “score” is local FDR, whichis directly associated with our purpose. However, it arguably allows the framework of AdaPTto be more general and ﬂexible. For instance, we recently exploited this idea and develop ageneral framework for controlling FDR under structural constraints. We refer the readers toLei et al. (2017) for more thoughts in this vein. It would also be interesting to attempt to relax our restriction that the p -values must be inde-pendent. In the absence of some modiﬁcation, our AdaPT procedure does not control FDR inﬁnite samples for dependent p -values. In particular, there is a danger of “overﬁtting” to localrandom eﬀects shared by nearby hypotheses: to the AdaPT procedure, such random eﬀects aretreated as signal to discover.It could be interesting to pursue a hybrid method using ideas from AdaPT and Knockoﬀ+procedures in the case where the p -values arise from regression coeﬃcients or other multivariateGaussian test statistics. Suppose that we observe feature matrix X ∈ R n × d and response vector y ∼ N n ( Xβ, σ I n ), and we wish to test hypotheses H j : β j = 0 for j = 1 , . . . , d . The key stepin Barber and Cand`es (2015) is to compute another matrix (cid:101) X ∈ R n × d with (cid:101) X (cid:48) (cid:101) X = X (cid:48) X and (cid:101) X (cid:48) X = X (cid:48) X − D , for some diagonal D ∈ R d × d with positive entries; this can be done providedthat n ≥ d and X has full column rank.If we deﬁne v = X (cid:48) y and ˜ v = (cid:101) X (cid:48) y , then we have (cid:20) v + ˜ vv − ˜ v (cid:21) ∼ N d (cid:18)(cid:20) (2 X (cid:48) X − D ) βDβ (cid:21) , σ (cid:20) X (cid:48) X − D

00 2 D (cid:21)(cid:19) . As a result (( v j , ˜ v j )) j ∈H are independent exchangeable pairs, conditional on ( v j ) j / ∈H . Let F − = σ (( { v j , ˜ v j } ) dj =1 ). The knockoﬀ ﬁlter directly uses these exchangeable pairs by construct-ing knockoﬀ statistics w ( X, y ) ∈ R d . The suﬃciency and antisymmetry conditions togetherimply that each | w j | is F − -measurable and that, conditional on F − , b j = 1 − sgn( w j ) is amirror-conservative “binary p -value:” that is ( b j ) j ∈H are i.i.d. Bern(1 /

2) independently of F − and ( b j ) j / ∈H . Using | w j | as a “predictor” (along with any other predictors for feature j that wemight have at hand) and b j as the p -value, the AdaPT procedure is immediately applicable.Note that min { b j , − b j } = 0 for every j ; hence, at each step it matters only where therejection threshold surface is above zero or not. If q t is the t th smallest value of ( | w j | ) dj =1 ,the Knockoﬀ+ ﬁlter corresponds to using the thresholds s t ( | w j | ) = 0 . · {| w j | ≥ q t } . Moregenerally, we can use AdaPT and interactively change the threshold we use.If σ is known, we can proceed more directly by constructing z -statistics and two-tailed p -values: z j = v j − ˜ v j (cid:112) d j σ ∼ N (cid:32) β j (cid:112) d j σ , (cid:33) ; p j = 2 min { Φ( z j ) , − Φ( z j ) } . In that case ( p j ) j ∈H are i.i.d. uniform p -values conditional on ( p j ) j / ∈H and v + ˜ v (not on F − above). Once again, we can immediately apply AdaPT using v + ˜ v as a “predictor.” While it isnot fully clear a priori just how we should use v + ˜ v as a predictor, this represents an interestingavenue for future work. Focusing on the case of orthogonal design further illuminates the relationship between AdaPTand the Knockoﬀ+ procedure. Suppose that X ∈ R n × d has orthonormal columns, and that Lihua Lei and William Fithian d ≥ n . In that case Barber and Cand`es (2015) suggest using the knockoﬀ matrix (cid:101) X of d moreorthonormal columns which are also orthogonal to the columns of X . Then X (cid:48) y ∼ N d ( β, σ I d )while (cid:101) X (cid:48) y ∼ N d (0 , σ I d ), independently.In this case, using the LASSO, forward stepwise regression, or virtually any other modelselection path procedure on the design matrix [ X (cid:101) X ] is identical to selecting variables in de-creasing order of absolute value of | X (cid:48) j y | and | (cid:101) X (cid:48) j y | ; or equivalently, in increasing order of thetwo-tailed p -values p j = 2 − | X (cid:48) j y | /σ ) and p ∗ j = 2 − | (cid:101) X (cid:48) j y | /σ ) (this is true whether ornot σ is known). As a result, if we operationalize the Knockoﬀ+ procedure using e.g. LASSO,we would reject hypotheses H j for which min { p j , p ∗ j } is small and p j < p ∗ j . By contrast, if wewere to implement AdaPT with a constant threshold in each step, we would reject hypotheses H j for which min { p j , − p j } is small and p j < − p j . Hence, the pairwise exchangeability of( p j , − p j ) is playing the same role as the i.i.d. pair ( p j , p ∗ j ) in knockoﬀs.The two most salient diﬀerences between AdaPT and Knockoﬀ+ in this case are that:(a) AdaPT allows for iterative interaction between the analyst and data, allowing the analystto update her local FDR estimates as information accrues. By contrast, the knockoﬀ ﬁlteras described in Barber and Cand`es (2015) does not allow for such interaction (though itcould, and this is a potentially interesting avenue for extending knockoﬀs).(b) Unlike Knockoﬀ+, AdaPT introduces no extra randomness into the problem. This isbecause AdaPT uses pairwise exchangeability of p i with the “mirror image” p -value 1 − p i instead of the independent “knockoﬀ” p -value p ∗ i ∼ U [0 , update subroutine, the AdaPT result is a deterministic function of the original data. In addition to returning a list of rejections that is guaranteed to control the global FDR, mostimplementations of AdaPT will also return estimates, for each rejected hypothesis, of the localFDR, (cid:99) fdr( p i | x i ) = (cid:98) P ( H i is null | x i , p i ) . If we have reasonably high conﬁdence in the model we have used to produce these estimates,they may provide the best summary of evidence against the individual hypothesis H i . Bycontrast, the signiﬁcance level for global FDR only summarizes the strength of evidence againstthe entire list of rejections, taken as a whole. Indeed, it is possible to construct pathologicalexamples where fdr( p i | x i ) = 1 for some of the rejected H i , despite controlling FDR at somelevel α (cid:28)

1. Even apart from such perversities, it will typically be the case that (cid:99) fdr( p i | x i ) > α for many of the rejected hypotheses.Despite their more favorable interpretation, however, the local FDR estimates produced byAdaPT rely on much stronger assumptions than the global FDR control guarantee — namely,that the two-groups model, as well as our speciﬁcations for π ( x ) and f ( p | x ), must be correct.Instead of using the parametric estimates (cid:99) fdr( p i | x i ), we could estimate the local FDR in amoving window of w steps of the AdaPT algorithm: (cid:99) fdp t,w = A t − A t + w ∨ ( R t − R t + w ) , or (cid:99) fdp + t,w = 1 + A t − A t + w ∨ ( R t − R t + w ) . Note that if we take an inﬁnitely large window, we obtain (cid:99) fdp + t, ∞ = (cid:91) FDP t ; thus, these estimatorsadaptively estimate the false discovery proportion for p -values revealed in the next w steps ofthe algorithm, in much the same way that (cid:91) FDP t estimates the false discovery proportion for all remaining p -values. It would be interesting to investigate, in future work, what error-controlguarantees we might be able to derive by using these estimators. daPT: An interactive procedure for multiple testing with side information Acknowledgments

The authors thank Jim Pitman, Ruth Heller, Aaditya Ramdas, and Stefan Wager for helpfuldiscussions.

References

Allison, D. B., G. L. Gadbury, M. Heo, J. R. Fern´andez, C.-K. Lee, T. A. Prolla, and R. Wein-druch (2002). A mixture model approach for the analysis of microarray gene expression data.

Computational Statistics & Data Analysis 39 (1), 1–20.Anders, S. and W. Huber (2010). Diﬀerential expression analysis for sequence count data.

Genome biology 11 (10), R106.Arias-Castro, E. and S. Chen (2016). Distribution-free multiple testing. arXiv preprintarXiv:1604.07520 .Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoﬀs.

TheAnnals of Statistics 43 (5), 2055–2085.Barber, R. F. and E. J. Cand`es (2016). A knockoﬀ ﬁlter for high-dimensional selective inference. arXiv preprint arXiv:1602.03574 .Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical andpowerful approach to multiple testing.

Journal of the Royal Statistical Society. Series B(Methodological) , 289–300.Benjamini, Y. and Y. Hochberg (1997). Multiple hypotheses testing with weights.

ScandinavianJournal of Statistics 24 (3), 407–418.Berk, R., L. Brown, A. Buja, K. Zhang, and L. Zhao (2013). Valid post-selection inference.

TheAnnals of Statistics 41 (2), 802–837.Bottomly, D., N. A. Walter, J. E. Hunter, P. Darakjian, S. Kawane, K. J. Buck, R. P. Searles,M. Mooney, S. K. McWeeney, and R. Hitzemann (2011). Evaluating gene expression inc57bl/6j and dba/2j mouse striatum using rna-seq and microarrays.

PloS one 6 (3), e17820.Bourgon, R., R. Gentleman, and W. Huber (2010). Independent ﬁltering increases detec-tion power for high-throughput experiments.

Proceedings of the National Academy of Sci-ences 107 (21), 9546–9551.Boyd, S. and L. Vandenberghe (2004).

Convex optimization . Cambridge university press.Brooks, A. N., L. Yang, M. O. Duﬀ, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner,and B. R. Graveley (2011). Conservation of an rna regulatory map between drosophila andmammals.

Genome research 21 (2), 193–202.Davis, S. and P. S. Meltzer (2007). GEOquery: a bridge between the gene expression omnibus(geo) and bioconductor.

Bioinformatics 23 (14), 1846–1847.Dephoure, N. and S. P. Gygi (2012). Hyperplexing: a method for higher-order multiplexed quan-titative proteomics provides a map of the dynamic response to rapamycin in yeast.

Sciencesignaling 5 (217), rs2.Dobriban, E. (2016). A general convex framework for multiple testing with prior information. arXiv preprint arXiv:1603.05334 .Dobriban, E., K. Fortney, S. K. Kim, and A. B. Owen (2015). Optimal multiple testing undera gaussian prior on the eﬀect sizes.

Biometrika 102 (4), 753–766.Dobson, A. J. and A. Barnett (2008).

An introduction to generalized linear models . CRC press. Lihua Lei and William Fithian

Du, L., C. Zhang, et al. (2014). Single-index modulated multiple testing.

The Annals ofStatistics 42 (4), 1262–1311.Dwork, C., V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth (2015). Preservingstatistical validity in adaptive data analysis. In

Proceedings of the Forty-Seventh Annual ACMon Symposium on Theory of Computing , pp. 117–126. ACM.Efron, B. (2007). Size, power and false discovery rates.

The Annals of Statistics , 1351–1377.Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher (2001). Empirical bayes analysis of amicroarray experiment.

Journal of the American statistical association 96 (456), 1151–1160.Ferkingstad, E., A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong (2008). Unsupervisedempirical bayesian multiple testing with external covariates.

The Annals of Applied Statistics ,714–735.Fithian, W., D. Sun, and J. Taylor (2014). Optimal inference after model selection. arXivpreprint arXiv:1410.2597 .Fortney, K., E. Dobriban, P. Garagnani, C. Pirazzini, D. Monti, D. Mari, G. Atzmon, N. Barzi-lai, C. Franceschi, A. B. Owen, et al. (2015). Genome-wide scan informed by age-relateddisease identiﬁes loci for exceptional human longevity.

PLoS Genet 11 (12), e1005728.Frazee, A. C., B. Langmead, and J. T. Leek (2011). Recount: a multi-experiment resource ofanalysis-ready rna-seq gene count datasets.

BMC bioinformatics 12 (1), 449.Genovese, C. R., K. Roeder, and L. Wasserman (2006). False discovery control with p-valueweighting.

Biometrika 93 (3), 509–524.Gentleman, R., V. Carey, W. Huber, and F. Hahne (2016). geneﬁlter: geneﬁlter: methods forﬁltering genes from high-throughput experiments . R package version 1.54.2.Geyer, C. J. and G. D. Meeden (2005). Fuzzy and randomized conﬁdence intervals and p-values.

Statistical Science , 358–366.G’Sell, M. G., S. Wager, A. Chouldechova, and R. Tibshirani (2016). Sequential selectionprocedures and false discovery rate control.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) 78 (2), 423–444.Hemerik, J. and J. Goeman (2014). Exact testing with random permutations. arXiv preprintarXiv:1411.7565 .Himes, B. E., X. Jiang, P. Wagner, R. Hu, Q. Wang, B. Klanderman, R. M. Whitaker, Q. Duan,J. Lasky-Su, C. Nikolos, et al. (2014). Rna-seq transcriptome proﬁling identiﬁes crispld2 asa glucocorticoid responsive gene that modulates cytokine function in airway smooth musclecells.

PloS one 9 (6), e99625.Hoeﬀding, W. (1952). The large-sample power of tests based on permutations of observations.

The Annals of Mathematical Statistics , 169–192.Hu, J. X., H. Zhao, and H. H. Zhou (2012). False discovery rate control with groups.

Journalof the American Statistical Association 105 (491), 1215–1227.Huber, W. and A. Reyes (2016). pasilla: Data package with per-exon and per-gene read counts ofRNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

R packageversion 0.12.0.Ignatiadis, N. and W. Huber (2017). Covariate-powered weighted multiple testing with falsediscovery rate control. arXiv preprint arXiv:1701.05179 . daPT: An interactive procedure for multiple testing with side information Ignatiadis, N., B. Klaus, J. B. Zaugg, and W. Huber (2016). Data-driven hypothesis weightingincreases detection power in genome-scale multiple testing.

Nature methods 13 (7), 577–580.Lawyer, G., E. Ferkingstad, R. Nesv˚ag, K. Varn¨as, and I. Agartz (2009). Local and covariate-modulated false discovery rates applied in neuroimaging.

NeuroImage 47 (1), 213–219.Lee, J. D., D. L. Sun, Y. Sun, and J. E. Taylor (2016). Exact post-selection inference, withapplication to the lasso.

The Annals of Statistics 44 (3), 907–927.Lehmann, E. and J. P. Romano (2005).

Testing statistical hypotheses . New York:. Springer.Lei, L. and W. Fithian (2016). Power of ordered hypothesis testing. In

ICML .Lei, L., A. Ramdas, and W. Fithian (2017). STAR: A general interactive framework for fdrcontrol under structural constraints. arXiv preprint arXiv:1710.02776 .Lewinger, J. P., D. V. Conti, J. W. Baurley, T. J. Triche, and D. C. Thomas (2007). Hierarchicalbayes prioritization of marker associations from a genome-wide association scan for furtherinvestigation.

Genetic epidemiology 31 (8), 871–882.Li, A. and R. F. Barber (2016a). Accumulation tests for FDR control in ordered hypothesistesting.

Journal of the American Statistical Association 112 (just-accepted), 1–38.Li, A. and R. F. Barber (2016b). Multiple testing with the structure adaptive benjamini-hochberg algorithm. arXiv preprint arXiv:1606.07926 .Love, M. I., S. Anders, and W. Huber (2014). Moderated estimation of fold change and disper-sion for rna-seq data with deseq2.

Genome biology 15 (12), 550.Markitsis, A. and Y. Lai (2010). A censored beta mixture model for the estimation of theproportion of non-diﬀerentially expressed genes.

Bioinformatics 26 (5), 640–646.Parker, R. and R. Rothenberg (1988). Identifying important results from multiple statisticaltests.

Statistics in medicine 7 (10), 1031–1043.Pounds, S. and S. W. Morris (2003). Estimating the occurrence of false positives and falsenegatives in microarray studies by approximating and partitioning the empirical distributionof p-values.

Bioinformatics 19 (10), 1236–1242.Slater, M. (1950). Lagrange multipliers revisited, cowles commis. Technical report, sion Dis-cussion Paper, Mathematics.Storey, J. D. (2002). A direct approach to false discovery rates.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 64 (3), 479–498.Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous sig-niﬁcance testing.

Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) 69 (3), 347–368.Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estima-tion and simultaneous conservative consistency of false discovery rates: a uniﬁed approach.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (1), 187–205.Storey, J. D. and R. Tibshirani (2003). Statistical signiﬁcance for genomewide studies.

Proceed-ings of the National Academy of Sciences 100 (16), 9440–9445.Sun, W., B. J. Reich, T. Tony Cai, M. Guindani, and A. Schwartzman (2015). False discoverycontrol in large-scale spatial multiple testing.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 77 (1), 59–83.Tian, X., J. Taylor, et al. (2018). Selective inference with a randomized response.

The Annalsof Statistics 46 (2), 679–710. Lihua Lei and William Fithian

Tukey, J. W. (1994).

The collected works of John W. Tukey: Multiple comparisons, 1948-1983 ,Volume 8. Chapman & Hall/CRC.Yekutieli, D. (2012). Adjusted bayesian inference for selected parameters.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 74 (3), 515–541.Zablocki, R. W., A. J. Schork, R. A. Levine, O. A. Andreassen, A. M. Dale, and W. K. Thompson(2014). Covariate-modulated local false discovery rate for genome-wide association studies.

Bioinformatics 30 (15), 2098–2104.

A. EM algorithm details

A.1. Derivation of E-step

To ﬁll in the details of the algorithm, we are left to calculate the imputed values ˆ H ( r ) i and ˆ y ( r, i given the parameters θ and β . Denote π i and µ i by π i = (cid:16) e − φ π ( x i ) (cid:48) θ (cid:17) − , µ i = ζ − (cid:0) β (cid:48) φ µ ( x i ) (cid:1) We distinguish two cases: for revealed p-values, ˜ p t,i is a singleton; for masked p-values, ˜ p t,i is atwo-elements set. For clarity, we let p (cid:48) t,i denote the minimum element of ˜ p t,i , i.e. ˜ p t,i = p (cid:48) t,i inthe former case and ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } in the latter case. • For revealed p-values,ˆ H ( r ) i = P ( H i = 1 | p i = ˜ p t,i ) = π i · h (˜ p t,i ; µ i ) π i · h (˜ p t,i ; µ i ) + 1 − π i , (26)and ˆ y ( r, i = E ( y i H i | ˜ p t,i ) / ˆ H ( r ) i = y i . (27) • For masked p-values, P ( p i = p (cid:48) t,i | ˜ p t,i , H i = 1) = h ( p (cid:48) t,i ; µ i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (28)and P ( p i = 1 − p (cid:48) t,i | ˜ p t,i , H i = 1) = h (1 − p (cid:48) t,i ; µ i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (29)By Bayes’ formula, we can also derive the conditional distribution of H i given ˜ p t,i :ˆ H ( r ) i = P ( H i = 1 | ˜ p t,i ) = π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) + 2(1 − π i ) . (30)As a consequence of (28) - (30),ˆ y ( r, i = 1ˆ H ( r ) i · { g ( p (cid:48) t,i ) · P ( H i = 1 , p i = p (cid:48) t,i | ˜ p t,i ) + g (1 − p (cid:48) t,i ) · P ( H i = 1 , p i = 1 − p (cid:48) t,i | ˜ p t,i ) } = 1ˆ H ( r ) i · π i · (cid:16) h ( p (cid:48) t,i ; µ i ) · g ( p (cid:48) t,i ) + h (1 − p (cid:48) t,i ; µ i ) · g (1 − p (cid:48) t,i ) (cid:17) π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) + 2(1 − π i )= h ( p (cid:48) t,i ; µ i ) · g ( p (cid:48) t,i ) + h (1 − p (cid:48) t,i ; µ i ) · g (1 − p (cid:48) t,i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) . (31) daPT: An interactive procedure for multiple testing with side information A.1.1. Beta-mixture model

Consider the beta-mixture model (15), where h ( p ; µ ) = 1 µ · p µ − . Plug it into our general results, we obtain that • for revealed p-values, ˆ H ( r ) i = π i · µ i ˜ p µi − t,i π i · µ i ˜ p µi − t,i + 1 − π i , ˆ y ( r ) i = y i ; (32) • for masked p-values,ˆ H ( r ) i = π i · µ i (cid:18) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − (cid:19) π i · µ i (cid:18) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − (cid:19) + 2(1 − π i ) , ˆ y ( r ) i = p (cid:48) µi − t,i ( − log p (cid:48) t,i ) + (1 − p (cid:48) t,i ) µi − ( − log(1 − p (cid:48) t,i )) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − . (33) A.1.2. Gaussian-mixture model

Suppose the p-values are derived from a one-sided z-test by transforming a normal randomvariable, the most natural transformation is g ( p ) = Φ − (1 − p ). By (13), h ( p ; µ ) = exp (cid:26) µ · g ( p ) − µ (cid:27) . Plug it into our general results, we obtain that • for revealed p-values,ˆ H ( r ) i = P ( H i = 1 | p i = ˜ p t,i ) = π i · e µ i y i π i · e µ i y i + (1 − π i ) e µ i / , ˆ y ( r ) i = y i ; (34) • for masked p-values,ˆ H ( r ) i = π i · cosh( µ i y i ) π i · cosh( µ i y i ) + (1 − π i ) e µ i / , ˆ y ( r ) i = | y i | · tanh( µ i | y i | ) . (35) A.2. Initialization

Another important issue is the initialization. The formulae of ˆ H ( r ) i and ˆ y ( r, i requires esti-mates of π i and µ i . In step 0 when no information can be obtained, we propose a simplemethod by imputing random guess as follows. First we obtain an initial guess of π i . Let J i = I (˜ p t,i contains two elements), then we observe that E J i = P ( p i (cid:54)∈ [ s ( x i ) , − s ( x i )]) ≥ (1 − π i )(1 − s ( x i ))= ⇒ π i ≥ E (cid:18) − J i − s ( x i ) (cid:19) . (36)Let ˜ J i = 1 − J i − s ( x i ) (37) Lihua Lei and William Fithian and then we ﬁt a logistic regression on ˜ J i with covariates φ π ( x i ), denoted by ˆ π i and thentruncate ˆ π i at 0 and 1 to obtain an initial guess of π i , i.e. π (0)1 i = (ˆ π i ∨ ∧ . (38)(36) implies that π (0)1 i is a conservative estimate of π i . This is preferred to an anti-conservativeestimate since the latter might cause over-ﬁtting.Then we obtain an initial guess of µ ( x i ) by imputing p i ’s. If p i ∈ [ s ( x i ) , − s ( x i )], then˜ p t,i = p i and hence we can use it directly. Otherwise, we only know that p i ∈ ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } .If p i is null, then it should be uniform on { p (cid:48) t,i , − p (cid:48) t,i } ; if p i is non-null, then it should morelikely to be p (cid:48) t,i since p (cid:48) t,i < − p (cid:48) t,i . Thus, we impute p i by p (cid:48) t,i , and ﬁt an unweighted GLM on g ( p (cid:48) t,i ) with covariates φ ( x i ) and inverse link to obtain an initial guess of µ i . A.3. Other issues

In Algorithm 2, we ﬁt µ ( x ) using a weighted GLM, corresponding to the the M-step. However,the weighted step is sensitive to the weights ˆ H r derived from the E-step, which relies on theassumption that null p-values are uniformly distributed on [0 , µ ( x ). For this reason, we modify the weighted stepin M-steps for ﬁtting µ ( x ) into an unweighted step and ﬁnd that this choice leads to consistentlygood performance in all experiments shown in Section 5. B. Technical Proofs

Proof (

Proof of Theorem 2 ). Assume f ( p | x i ) ≤ M . Let η i = ν ( { x i } ) and s =( s , . . . , s n ) (cid:44) ( s ( x ) , . . . , s ( x n )), then the objective function of (7) (cid:90) X − F ( s ( x ) | x ) π ( x ) ν ( dx ) = − n (cid:88) i =1 η i F ( s i | x i ) π ( x i )is a convex function of s by condition (i) and the constraint function g ( s ) (cid:44) (cid:90) X (cid:26) − αF ( s ( x ) | x ) π ( x ) + (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx )= n (cid:88) i =1 η i ( − αF ( s i | x i ) π ( x i ) + (1 − α ) F ( s i | x i )(1 − π ( x i )))is also a convex function of s by condition (i). To establish the necessity of the KKT condition,it is left to prove the Slater’s condition (Slater 1950; Boyd and Vandenberghe 2004, Chap. 5),i.e. there exists a ¯ s , such that for any s ∈ B (¯ s, δ ) for some δ >

0, the constraint inequalityholds, i.e. g ( s ) ≤

0, and g (¯ s ) <

0. By condition (ii), WLOG we assume fdr(0 | x ) < α with ν ( { x } ) = η >

0. Fix any (cid:15) > ω ( (cid:15) ) by the maximum modulus of continuity of f ( p | x ) and f ( p | x ) at point 0, i.e. ω ( (cid:15) ) = max (cid:40) sup p ≤ (cid:15) f ( p | x ) , sup p ≤ (cid:15) f ( p | x ) (cid:41) . By assumption (i), we know that lim (cid:15) → ω ( (cid:15) ) = 0 . Let ¯ s ∈ R n with ¯ s = 2 (cid:15), ¯ s = . . . = ¯ s n = (cid:15) · (2 M ) − η ∆ daPT: An interactive procedure for multiple testing with side information where ∆ = f (0 | x ) ( α (1 − fdr(0 | x )) − (1 − α )fdr(0 | x )) > . Let δ = min { ¯ s , (cid:15) } . We will show that for any s ∈ B (¯ s, δ ), g ( s ) <

0. In fact, for any s ∈ B (¯ s, δ ),we have s ∈ [ (cid:15), (cid:15) ] , s i ≤ (cid:15) · M − η ∆ . Recalling that M is an upper bound for the null density. WLOG, we assume that M ≥ η ∆in which case s i ≤ (cid:15) for all i ≥

2. Note that g (0) = 0. By mean-value theorem, there exists˜ s ∈ [0 , (cid:15) ] , ˜ s , . . . , ˜ s n ∈ [0 , (cid:15) ], such that g ( s ) = s η ( − αf (˜ s | x ) π ( x ) + (1 − α ) f (˜ s | x )(1 − π ( x )))+ n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i ))) ≤ s η ( − αf (0 | x ) π ( x ) + (1 − α ) f (0 | x )(1 − π ( x ))) + s η ω ( (cid:15) )+ n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i )))= − s η ∆ + s η ω ( (cid:15) ) + n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i ))) ≤ − s η ∆ + s η ω ( (cid:15) ) + n (cid:88) i =2 s i η i · M ≤ − (cid:15)η ∆ + s η ω ( (cid:15) ) + (cid:15)η ∆ n (cid:88) i =2 η i ≤ − (cid:15)η ∆ + O ( (cid:15)ω ( (cid:15) )) + (cid:15)η ∆(1 − η ) ≤ − (cid:15)η ∆ + o ( (cid:15) ) . Thus, for suﬃciently small (cid:15) , g ( s ) < s ∈ B (¯ s, δ ) and hence the Slater’s condition issatisﬁed. Proof (

Proof of Lemma 2 ). We assume ρ <

A ⊆ [ n ] with P ( i ∈ A | G − ) = 1 − ρ i − ρ , conditionally independent for i ∈ [ n ], and construct conditionally i.i.d. Bernoulli variables q , . . . , q n , independent of A , with P ( q i = 1 | G − ) = ρ . Then we can deﬁne˜ b i = q i { i ∈ A} + { i / ∈ A} , (39)which by construction gives P (˜ b i = 1 | G − ) = ρ i almost surely. Furthermore, noticing that P (˜ b i = 0 , ˜ b j = 0 |G − ) = P ( i ∈ A , j ∈ A , q i = 0 , q j = 0 |G − )= P ( i ∈ A , q i = 0 |G − ) P ( j ∈ A , q j = 0 |G − )= P (˜ b i = 0 |G − ) P (˜ b j = 0 |G − ) , we conclude that the ˜ b i are conditionally independent given G − . As a consequence, given G − ,(˜ b , . . . , ˜ b n ) d = ( b , . . . , b n ) . In the following proof, we will use (39) to represent b i ’s. Lihua Lei and William Fithian

To ensure that C t decreases by at most a single element in each step, we introduce interme-diate steps: for integers t ≥

0, 1 ≤ i ≤ n deﬁne C t + i/n = C t +1 ∪ { j ≤ n − i : j ∈ C t } . Next, deﬁne the augmented ﬁltration G A t = σ (cid:32) G − , A , C t , ( b i ) i/ ∈C t ∩A , (cid:88) i ∈C t ∩A b i (cid:33) ⊇ G t , for both integer and fractional values of t . Note C t +1 /n is measurable with respect to C t . Inaddition we deﬁne U A t = (cid:88) i ∈C t ∩A b i , V A t = (cid:88) i ∈C t ∩A − b i , and Z A t = 1 + |C t ∩ A| U A t . Recall the deﬁnition of U t (Section 2.1) and b i (deﬁned above in (39)), for any t ,1 + |C t | U t = 1 + |C t ∩ A| + |C t ∩ A c | U A t + |C t ∩ A c | ≤ |C t ∩ A| U A t = Z A t . Finally, we observe that ( b i ) i ∈C t ∩A = ( q i ) i ∈C t ∩A are exchangeable with respect to G A t , with therandom vector distributed uniformly over conﬁgurations summing to U A t .There are three cases:(i) if C t +1 /n ∩ A = C t ∩ A then E [ Z A t +1 /n | G A t ] = Z A t ;(ii) if U A t = 0 but C t +1 /n ∩ A (cid:36) C t ∩ A then Z A t +1 /n = 1 + |C t +1 /n ∩ A| ≤ |C t ∩ A| = Z A t − ≤ Z A t ;(iii) otherwise, U A t > C t \ C t +1 /n = { j } for some j ∈ A . The exchangeability of b i = q i implies that P (cid:0) b j = 1 | G A t (cid:1) = U A t U A t + V A t . Then E (cid:104) Z A t +1 /n | G A t (cid:105) = U A t + V A t U A t · V A t U A t + V A t + U A t + V A t U A t · U A t U A t + V A t = V A t U A t + 1 = Z A t . In all three cases, the conditional expectation of Z A t +1 /n is smaller than Z A t ; thus, Z A t is a super-martingale with respect to the ﬁltration G A t . Because ˆ t is also a stopping time with respect tothe ﬁltration ( G A t ) t =0 , /n, /n,... (but one which can only take integer values), for any A ∈ [ n ], wehave E (cid:34) |C ˆ t | (cid:80) i ∈C ˆ t b i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G − , A (cid:35) ≤ E (cid:2) Z A ˆ t | G − , A (cid:3) ≤ E (cid:2) Z A | G − , A (cid:3) = E (cid:2) Z A | G − (cid:3) . (40) daPT: An interactive procedure for multiple testing with side information Let m = |C | and assume C = { , . . . , m } WLOG. Using the representation (39), E (cid:2) Z A | G − (cid:3) = E (cid:20) m (cid:80) mi =1 ( q i I ( i ∈ A ) + I ( i (cid:54)∈ A )) | G − (cid:21) ≤ E (cid:20) m (cid:80) mi =1 q i | G − (cid:21) = E (cid:20) m (cid:80) mi =1 q i (cid:21) = m (cid:88) k =0 m k · (cid:18) mk (cid:19) ρ k (1 − ρ ) m − k = m (cid:88) k =0 (cid:18) m + 1 k + 1 (cid:19) ρ k (1 − ρ ) m − k = ρ − · m (cid:88) k =0 (cid:18) m + 1 k + 1 (cid:19) ρ k +1 (1 − ρ ) m +1 − ( k +1) = ρ − (cid:0) − (1 − ρ ) m +1 (cid:1) ≤ ρ − . Marginalizing over A in (40), we obtain the result. B.1. Mirror-conservatism

In this subsection we provide two important examples that produce mirror-conservative p-values. The ﬁrst example is the permutation test (e.g. Hoeﬀding (1952)). Typically we assumethat under the null hypothesis, the test statistic T ( X ), where X is a short-handed notation forobserved data, is invariant in distribution under a ﬁnite group of transformations G , i.e. T ( X ) d = T ( gX ) , ∀ g ∈ G . When |G| is small, one can compute a discrete p-value by R/ |G| , where R is the rank of T ( X ) inthe set { T ( gX ) : g ∈ G} . When |G| is large, Hemerik and Goeman (2014) proposes sampling asubset G (cid:48) = { g , g , . . . , g m } with g = Id and g , . . . , g m being a simple random sample (withoutreplacement) from G and calculate the p-value based on G (cid:48) . In both cases, it can be provedthat the p-value is uniformly distributed on an equi-spaced grid { m , m , . . . , mm } under the nullhypothesis, where m is the number of replicates. Then for any 0 < a ≤ a ≤ , P ( p ∈ [ a , a ]) = (cid:100) ma (cid:101) − ( (cid:98) ma (cid:99) − + where ( u ) + denotes max { u, } , and P ( p ∈ [1 − a , − a ]) = (cid:100) m (1 − a ) (cid:101) − (cid:98) m (1 − a ) (cid:99) + 1 = (cid:100) ma (cid:101) − ( (cid:98) ma (cid:99) − . As a result we conclude that P ( p ∈ [ a , a ]) ≤ P ( p ∈ [1 − a , − a ])and hence p is mirror-conservative.The second example is the one-sided test for distributions with monotone likelihood ratio,which is ubiquitous in practice. Speciﬁcally, let θ be the univariate parameter of interest and p θ ( x ) be a family of densities with respect to some carrier measure µ . p θ ( x ) is said to havemonotone likelihood ratio with respect to some real-value function T ( x ) if for any θ < θ (cid:48) , p θ (cid:54)≡ p θ (cid:48) and the ratio p θ (cid:48) ( x ) /p θ ( x ) is a nondecreasing function of T ( x ). For testing H : θ ≤ θ against H : θ > θ , it is well-known that there exists a Uniformly Most Powerful (UMP) test(Lehmann and Romano, 2005), with the following decision function: φ ( x ) =  T ( x ) > C ) γ ( T ( x ) = C )0 ( T ( x ) < C ) Lihua Lei and William Fithian where ( γ, C ) is the solution of P θ ( T ( X ) > C ) + γP θ ( T ( X ) = C ) = α. Write T ( X ) as T and P θ ( T ( X ) ≥ t ) as G ( t ) for short. Then the induced p-value can bewritten as p = G ( T + ) + U ( G ( T ) − G ( T + )) , U ∼ U ([0 , , (41)where G ( t + ) = lim t ↓ t + G ( t ). (41) is termed as fuzzy p-values by Geyer and Meeden (2005). Proposition 1.

Let p θ ( x ) be a family of densities (w.r.t the carrier measure µ ) that hasmonotone likelihood ratio w.r.t. T ( x ) . Then the p-value deﬁned in (41) is mirror-conservative. Proof.

Since p θ ( x ) has monotone likelihood ratio, there exists a non-decreasing function g θ ( t ) for each θ ≤ θ , such that p θ ( x ) p θ ( x ) = g θ ( T ( x )) . Let ν be a measure such that for any event A ⊂ R , ν ( A ) = (cid:90) I ( T ( x ) ∈ A ) · p θ ( x ) µ ( dx ) . Then for any event A ⊂ R , P θ ( T ( X ) ∈ A ) = (cid:90) g θ ( T ( x )) · I ( T ( x ) ∈ A ) · p θ ( x ) dµ = (cid:90) g θ ( t ) ν ( dt ) . (42)Note that the above argument can be easily proved by standard approximation argument inmeasure theory that starts from indicator functions g θ ( t ) = I ( t ∈ A (cid:48) ), extends the result tosimple step functions and ﬁnally pushes it to the limit. Let ω be the product measure of ν andthe Lebesgue measure on [0 , B ⊂ R , P θ (( T ( X ) , U ) ∈ B ) = (cid:90) B g θ ( t ) ω ( dt, du ) . (43)Note that g θ ( t ) ≡ ω ( B ) = P θ (( T ( X ) , U ) ∈ B ) . (44)Let H ( · , · ) be the transformation such that p = H ( T, U ). Then for any z , { ( t, u ) : G ( t ) < z } ⊂ H − ([0 , z )) ⊂ H − ([0 , z ]) ⊂ { ( t, u ) : G ( t ) ≤ z } . As a result, for any 0 < z < z < H − ([0 , z ]) ⊂ { ( t, u ) : G ( t ) ≤ z } , H − ([ z , ⊂ { ( t, u ) : G ( t ) ≥ z } , and hence there exists t ( z , z ) such that t ≤ t ( z , z ) ≤ t , ∀ ( t , u ) ∈ H − ([0 , z ]) , ( t , u ) ∈ H − ([ z , . (45)Given 0 ≤ a ≤ a < .

5, let A = [ a , a ] and A = [1 − a , − a ]. Then (45) and (43),together with the monotonicity of g θ , imply that P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≤ ω ( H − ( A )) · g θ ( t ( a , − a )) , and P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≥ ω ( H − ( A )) · g θ ( t ( a , − a )) . daPT: An interactive procedure for multiple testing with side information Recalling (44), we obtain that P θ ( p ∈ A ) P θ ( p ∈ A ) ≤ P θ ( p ∈ A ) P θ ( p ∈ A ) . (46)It is left to prove that P θ ( p ∈ A ) P θ ( p ∈ A ) = 1. In fact, we can prove that p ∼ U ([0 , θ = θ . (47)Fix any z ∈ (0 , G − ( z ) = sup { t : G ( t ) ≥ z } . For clarity we write u for G − ( z ). Now we prove (47) in two cases: • if u is a continuity point of G , i.e. G ( u ) = G (cid:0) u + (cid:1) . Since G is left-continuous, we must have G ( u ) = G (cid:0) u + (cid:1) = z. Then P θ ( p ≤ z ) = P θ ( T ( X ) ≥ u ) = G ( u ) . • if u is an atom of G , i.e. G ( u ) > G (cid:0) u + (cid:1) . By deﬁnition, G ( u ) ≥ z and z ≥ G (cid:0) u + (cid:1) . Then P θ ( p ≤ z ) = P θ (cid:0) T ( X ) > u + (cid:1) + P θ ( T ( X ) = u ) · z − G ( u + ) G ( u ) − G ( u + )= G ( u + ) + ( G ( u ) − G ( u + )) · z − G ( u + ) G ( u ) − G ( u + )= z. Therefore we prove (47). By (46), we conclude that for any θ ≤≤