AdaPT: An interactive procedure for multiple testing with side information
AAdaPT: An interactive procedure for multiple testing with sideinformation
Lihua Lei and William Fithian
Department of Statistics, University of California, Berkeley, USA
E-mail: { lihua.lei, wfithian } @berkeley.edu Summary . We consider the problem of multiple hypothesis testing with generic side information:for each hypothesis H i we observe both a p -value p i and some predictor x i encoding contextual in-formation about the hypothesis. For large-scale problems, adaptively focusing power on the morepromising hypotheses (those more likely to yield discoveries) can lead to much more powerfulmultiple testing procedures. We propose a general iterative framework for this problem, called theAdaptive p -value Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p -value rejection threshold and controls the false discovery rate (FDR) in finite samples. At eachiteration of the procedure, the analyst proposes a rejection threshold and observes partially cen-sored p -values, estimates the false discovery proportion (FDP) below the threshold, and proposesanother threshold, until the estimated FDP is below α . Our procedure is adaptive in an unusu-ally strong sense, permitting the analyst to use any statistical or machine learning method shechooses to estimate the optimal threshold, and to switch between different models at each itera-tion as information accrues. We demonstrate the favorable performance of AdaPT by comparingit to state-of-the-art methods in five real applications and two simulation studies.
Keywords : multiple testing, false discovery rate, p-value weighting, selective inference,adaptive inference, martingales
1. Introduction
In classical statistics we assume that the question to be answered, and the analysis to be usedin answering the question, are both fixed in advance of collecting the data. Many modernapplications, however, involve extremely complex data sets that may be collected without anyspecific hypothesis in mind. Indeed, very often the express goal is to explore the data in search ofinsights we may not have expected to find. A central challenge in modern statistics is to providescientists with methods that are flexible enough to allow for exploration, but that neverthelessprovide statistical guarantees for the conclusions that are eventually reported.Selective inference methods blend exploratory and confirmatory analysis by allowing a searchover the space of potentially interesting questions, while still guaranteeing control of an appro-priate Type I error rate such as a conditional error rate (e.g., Yekutieli, 2012; Lee et al., 2016;Fithian et al., 2014), familywise error rate (e.g., Tukey, 1994; Berk et al., 2013), or false dis-covery rate (e.g., Benjamini and Hochberg, 1995; Barber and Cand`es, 2015). However, mostselective inference methods require that the selection algorithm be specified in advance, forc-ing a choice between either ignoring any difficult-to-formalize domain knowledge or sacrificingstatistical validity guarantees.Interactive data analysis methods relax the requirement of a pre-defined selection algorithm.Instead, they provide for an interactive analysis protocol between the analyst and the data,guaranteeing statistical validity as long as the protocol is followed. The two central questions ininteractive data analysis are “what did the analyst know and when did she know it?” Previousmethods for interactive data analysis involve randomization (Dwork et al., 2015; Tian et al.,2018) to control the analyst’s access to the data at the time she decides what questions to ask.This paper proposes an iterative, interactive method for multiple testing in the presenceof side information about the hypotheses. We restrict the analyst’s knowledge by partially a r X i v : . [ s t a t . M E ] J u l Lihua Lei and William Fithian censoring all p -values smaller than a currently-proposed rejection threshold, and guaranteefinite-sample FDR control by applying a version of the optional-stopping argument pioneeredby Storey et al. (2004) and extended in Barber and Cand`es (2015); G’Sell et al. (2016); Li andBarber (2016a); Lei and Fithian (2016); Barber and Cand`es (2016). In many areas of modern applied statistics, from genetics and neuroimaging to online advertisingand finance, researchers routinely test thousands or millions or hypotheses at a time. Forlarge-scale testing problems, perhaps the most celebrated multiple testing procedure of themodern era is the Benjamini–Hochberg (BH) procedure (Benjamini and Hochberg, 1995). Given n hypotheses and a p -value for each one, the BH procedure returns a list of rejections or“discoveries.” If R is the number of total rejections and V is the number of false rejections(rejections of true null hypotheses), the BH procedure controls the false discovery rate (FDR),defined as FDR = E (cid:20) V max { R, } (cid:21) , (1)at a user-specified target level α . The random variable V / max { R, } is called the false discoveryproportion (FDP).The BH procedure is nearly optimal when the null hypotheses are exchangeable a priori ,and nearly all true. In other settings, however, the power can be improved, sometimes dra-matically, by applying prior knowledge or by learning from the data. For example, adaptiveFDR-controlling procedures can gain in power by estimating the overall proportion of truenulls (Storey, 2002), applying priors to increase power using p -value weights (Benjamini andHochberg, 1997; Genovese et al., 2006; Dobriban et al., 2015; Dobriban, 2016), grouping similarnull hypotheses and estimating the true null proportion within each group (Hu et al., 2012), orexploiting a prior ordering to focus power on more “promising” hypotheses near the top of theordering (Barber and Cand`es, 2015; G’Sell et al., 2016; Li and Barber, 2016a; Lei and Fithian,2016).In most large-scale testing problems, the null hypotheses do not comprise an undifferentiatedlist; rather, each hypothesis is associated with rich contextual information that could potentiallyhelp to inform our testing procedures. For example, Li and Barber (2016a) test for differentialexpression of 22,283 genes between a treatment and control condition for a breast cancer drug,with side information in the form of an ordering of genes from most to least “promising” usingauxiliary data collected at larger dosages. Multiple testing procedures that exploit the orderingcan reject hundreds of hypotheses while the BH procedure (which does not exploit the ordering)rejects none.More generally, prior information could arise in more complex ways. For example, considertesting for association of 400,000 single-nucleotide polymorphisms (SNPs) with each of 40 relateddiseases. If gene-regulatory relationships are known, then we might expect SNPs near relatedgenes to be associated (or not) with related diseases, but without knowing ahead of time whichgene-disease pairs are promising. In a similar vein, Fortney et al. (2015) used prior knowledge ofeach SNP’s associations with age-related diseases to focus their search for SNPs associated withlongevity, leading to novel discoveries. Inspired by examples like this, Ignatiadis et al. (2016)and Li and Barber (2016b) have recently proposed a more general problem setting where, foreach hypothesis H i , i ∈ [ n ] we observe not only a p -value p i ∈ [0 ,
1] but also a predictor x i lyingin some generic space X . Unlike p i , x i carries only indirect information about the hypothesis:it is meant to capture some side information that might bear on H i ’s likelihood to be false, oron the power of p i under the alternative, but the nature of this relationship is not fully knownahead of time and must be learned from the data.In other situations, the “predictor” information could simply represent a measure of samplesize or overall signal for testing the i th hypothesis, which could be informative about the powerof the i th test to distinguish the alternative from the null. For example, if each p i concerns atest for association between the i th SNP and a disease, then the overall prevalence of that SNP daPT: An interactive procedure for multiple testing with side information AdaPT (Intermediate Stage) p − v a l ue p i predictor x i lll ll ll lllll lll ll lll llllllllllllllllllllll l ll l l l l l - s t ( x ) s t ( x ) (a) A t = 4 and R t = 11 are the numbersof blue and red points respectively, leadingto (cid:91) FDP = (1 + 4) / ≈ .
45. If (cid:91)
FDP ≤ α ,we stop and reject the red points; other-wise we choose a new threshold s t +1 (cid:22) s t and continue. AdaPT (Analyst's View) p − v a l ue p i predictor x i lll ll ll lllll lll ll lll llllllllllllllllllllll l ll l l l l llllllll l ll l l l l l - s t ( x ) s t ( x ) (b) Information available to the analystwhen choosing s t +1 ( x ) ( A t and R t are alsoknown). Each red and blue point is re-flected across p = 0 .
5, leaving the analystto impute which are the true p -values andwhich are the mirror images. Fig. 1: Illustration of one step of the AdaPT procedure with a univariate predictor.(in the combined treatment and control groups) can be used as prior information. Or, if p i arises from a two-sample t -test, we could use the pooled variance, the sample variance ignoringthe group labels, as prior information; see e.g. (Bourgon et al., 2010; Ignatiadis et al., 2016). This paper presents a new framework for FDR control with generic side information, which wecall adaptive p -value thresholding or AdaPT for short. Our method proceeds iteratively: at eachstep t = 0 , , . . . , the analyst proposes a rejection threshold s t ( x ) and computes an estimator (cid:91) FDP t for the false discovery proportion for this threshold. If (cid:91) FDP t ≤ α , she stops and rejectsevery H i for which p i ≤ s t ( x i ). Otherwise, she proposes a more stringent threshold s t +1 (cid:22) s t and moves on to the next iteration, where the notation a (cid:22) b means a ( x ) ≤ b ( x ) for all x ∈ X .The estimator (cid:91) FDP t is computed by comparing the number R t of rejections to the number A t of p -values for which p i ≥ − s t ( x i ): R t = |{ i : p i ≤ s t ( x i ) }| , A t = |{ i : p i ≥ − s t ( x i ) }| , and (cid:91) FDP t = 1 + A t R t ∨ . The estimate (cid:91)
FDP t is also used by Lei and Fithian (2016) and Arias-Castro and Chen (2016).Figure 1a illustrates the way s t ( x ) and 1 − s t ( x ) partition the data into three regions; A t is thenumber of points in the upper blue region and R t is the number in the lower red region.At each step t , the analyst can choose the next threshold s t +1 ( x ) however she chooses, withonly two constraints. First, s t +1 (cid:22) s t as stated before. Second, the large and small p -values(the ones contributing to A t and R t ) are partially masked. Specifically, at step t the analyst isallowed to observe A t and R t , as well as the entire sequence ( x i , ˜ p t,i ) ni =1 , where˜ p t,i = (cid:40) p i s t ( x i ) < p i < − s t ( x i ) { p i , − p i } otherwise . (2)Thus, if p i = 0 . ≤ s t ( x i ) then at step t the analyst knows only that p i is either 0.01 or 0.99,but if s t +1 ( x i ) < .
01 then p i is revealed at step t + 1 as 0.01. Figure 1b illustrates what theanalyst can see: each red and blue point from Figure 1a is shown along with its mirror imagereflected across the midline p = 0 . Lihua Lei and William Fithian
We show in Section 3 that, in a generic two-groups empirical Bayes model, an ideal choicefor s t ( x ) would be a level surface of the local false discovery rate (fdr), as a function of x and p :fdr( p | x ) = P ( H i is null | p i = p, x i = x ) . Formally, fdr( p | x ) is unidentifiable from the data but, under reasonable assumptions, we canuse a good proxy based on the conditional density of the p -value given the covariate, f ( p | x )(note however that our method controls FDR without any empirical Bayes assumptions).In each step information is gradually revealed to the analyst as the threshold shrinks andmore p -values are unmasked. Our procedure is adaptive in an unusually strong sense: providedthat the two constraints are met, the analyst may apply any method she wants to select s t +1 ( x ),consulting her own hunches or the intuition of domain experts, and can even switch betweendifferent methods as information accrues. Moreover, the analyst is under no obligation todescribe, or even to fully understand, her update rule for choosing s t +1 ( x ). In this sense, wesay our method is fully interactive — the analyst’s behavior is arbitrary as long as she abidesby a certain protocol for interacting with the algorithm.While the partial masking of p -values obscures just enough information from the analystto control the FDR, in many cases it does not seriously impact the ability of the analyst tolearn the optimal threshold surface s ( x ). This is because, by the time the algorithm is close tostopping, the vast majority of p -values have already been revealed, and many of the ones thatremain masked are so minuscule as to leave little doubt about whether p i is large or small. Aswe show in numerous simulation and real data experiments in Section 5, the fdr estimates basedon masked data typically converge to the full-data estimates well before the algorithm stops.The AdaPT procedure controls FDR at level α in finite samples provided that the null p -values are uniform, or mirror-conservative as defined in Section 2.1, and independent conditionalon the non-null p -values. The proof relies on a pairwise exchangeability argument similar to theargument in Barber and Cand`es (2015).Algorithm 1 summarizes the AdaPT procedure, using the generic sub-routine update torepresent whatever process the analyst uses to select s t +1 ( x ). Note that s t +1 ( x ) is a randomfunction that is measurable to F t . Sections 3–4 discuss recommendations for a good update routine. It is worth mentioning that AdaPT reduces to Barber-Cand`es method, inspired byBarber and Cand`es (2015) and proposed by Arias-Castro and Chen (2016), when s t ( x ) is aconstant function for every t . Algorithm 1
AdaPT
Input: predictors and p -values ( x i , p i ) i ∈ [ n ] , initialization s , target FDR level α Procedure: for t = 0 , , . . . do (cid:91) FDP t ← A t R t ∨ ; if (cid:91) FDP t ≤ α then Reject { H i : p i ≤ s t ( x i ) } ; Return s t ; end if s t +1 ← update (( x i , ˜ p t,i ) i ∈ [ n ] , A t , R t , s t ); end for In recent work Ignatiadis et al. (2016) propose a different method independent hypothesisweighting (IHW) for multiple testing with side information. They first bin the predictors intogroups g , . . . , g K , and then apply the weighted-BH procedure at level α with piecewise-constantweights; i.e., if x i ∈ g k , then w i = w ( g k ). The weights w ( g ) , . . . , w ( g K ) are chosen to maximizethe number of rejections. This proposal is similar in spirit to the AdaPT procedure since it daPT: An interactive procedure for multiple testing with side information attempts to find optimal weights, but it is a bit more limited: first, binning the data may bedifficult if the predictor space X is multivariate or more complex; and second, their methodis only guaranteed to control FDR asymptotically, as the number of bins stays fixed and thenumber of hypotheses in each bin grows to infinity. As a result, we must trust that n is largeenough to support however many bins we have chosen to use. By contrast, AdaPT can useany machine-learning method to estimate ˆ f ( p | x ), and we can “overfit away” without fear ofcompromising finite-sample FDR control (though overfitting can of course reduce our power ifour fdr estimates are too noisy). Another method is proposed by Du et al. (2014) when thecovariate is an auxiliary univariate p-value derived by prior information. However, similar toIgnatiadis et al. (2016), it only controls FDR asymptotically under the fairly strong conditionsthat the p-values are symmetrically distributed under the null and bounded by under thealternative.Perhaps the procedure most closely related to ours is the structure-adaptive BH algorithm or SABHA (Li and Barber, 2016b). SABHA first censors the p -values below at a fixed level τ ( τ = 0 . p -values p i { p i > τ } . Using these, theycan estimate π ( x ), defined as P ( H i is non-null | x i = x ), as a function of x , then apply theweighted BH procedure of Genovese et al. (2006) with weights ˆ π ( x i ) − , at a corrected FDRlevel ˜ α = Cα (where C < π − ). Wenotice that this type of censoring is also employed in a variant of IHW (Ignatiadis and Huber,2017), which guarantees the FDR control in finite samples.As the first procedure to provably control the finite-sample FDR using generic feature in-formation, SABHA represents a major step forward. However, AdaPT has several importantadvantages: First, even if ˆ π ( x ) estimates π ( x ) consistently, the weights π ( x ) − are not Bayesoptimal as we show in Section 3; by contrast, our method estimates a Bayes optimal threshold.Second, the correction factor C makes the method conservative and restricts the available esti-mators ˆ π − to those with provably low Rademacher complexity. Third, AdaPT can use moreinformation for learning: in later stages we will typically have s t ( x i ) (cid:28) . p -values ˜ p t,i may be much more informative than p i { p i > . } , especially since our goal is toestimate f ( p | x ) for small values of p .Finally, we remark that there is a literature on very different approaches for incorporatingcovariates into multiple testing problems; see e.g. Lewinger et al. (2007); Ferkingstad et al.(2008); Lawyer et al. (2009); Zablocki et al. (2014). Unlike our method (and IHW and SABHA),these approaches hinge on the correct specification of the model and might lose the statisticalguarantee if the proposed model deviates from the ground truth. By contrast, our method (andIHW and SABHA) rely only on validity of p -values (see assumptions of Theorem 1 in nextSection) and guarantee FDR control even when employing a misspecified model. Section 2 defines the AdaPT procedure more formally and gives our main result: if the null p -values are independent and mirror-conservative (defined below), AdaPT controls FDR at level α in finite samples. Section 3 explains why selection of s t +1 ( x ) will typically operate by firstestimating the conditional density f ( p | x ) as a function of x , and Section 4 gives practicalsuggestions for update rules. Section 5 illustrates the AdaPT procedure’s power on five realdatasets and two simulated datasets, and Section 6 concludes. The programs to replicate allour experiments can be obtained from https://github.com/lihualei71/adaptPaper/ . Our R package adaptMT can be found in https://github.com/lihualei71/adaptMT/ .
2. The AdaPT procedure
Let [ n ] denote the set { , . . . , n } . For each hypothesis H i , i ∈ [ n ] we observe x i ∈ X and p i ∈ [0 , H denote the set of true null hypotheses. We will assume throughout that( p i ) i ∈H are mutually independent, and independent of ( p i ) i/ ∈H (see Section 6 for a discussion Lihua Lei and William Fithian of how we might relax the independence assumption). Finally, for each i ∈ H , we assume that p i is either uniform or mirror-conservative in a sense we will define shortly.Let F t for t = 0 , , . . . represent the filtration generated by all information available to theuser at step t : F t = σ (( x i , ˜ p t,i ) ni =1 , A t , R t ) . We similarly define an initial σ -field with all p -values masked, F − = σ (( x i , { p i , − p i } ) ni =1 ) . The p -value masking is equivalent to requiring that s t +1 ∈ F t . (For simplicity we have implicitlyruled out the possibility that the analyst uses a randomized rule to update the threshold, butthis restriction could be easily removed.) The two constraints s t +1 (cid:22) s t and s t +1 ∈ F t ensurethat ( F t ) t = − , , ,... is a filtration; i.e., the information in F t only grows from t to t + 1: Lemma 1.
For all t ≥ − , F t ⊆ F t +1 . Proof.
We use induction on u to show that F u ⊆ F t for any u ≤ t . The conclusion istrivial for u = − { p i , − p i } is always computable from p t,i (masked p -values can alwaysbe computed from masked or unmasked ones).For u ≥
0, note that, by the inductive assumption, s u ∈ F u − ⊆ F t . As a result, we cancompute p u,i which depends only on p t,i and s u ( x i ). Furthermore, R u = R t + { i : p t,i ∈ ( s t ( x i ) , s u ( x i )] } , A u = A t + { i : p t,i ∈ [1 − s u ( x i ) , − s t ( x i )) } , completing the proof.To avoid trivialities we assume that the analyst always reveals at least one censored p -valuein each step of the algorithm, since there is no reason ever to update the threshold surface in away that reveals no new information. Thus, the stopping time ˆ t ≤ n almost surely.In many common settings, null p -values are conservative but not necessarily exactly uniform.For example, p -values from permutation tests are discrete, and p -values for composite nullhypotheses are often conservative if the true value of the parameter lies in the interior of thenull.Our method does not require uniformity, but the standard definition of conservatism — that P H i ( p i ≤ a ) ≤ a for all 0 ≤ a ≤ not enough to guarantee FDR control. Instead, we saythat a p -value p i is mirror-conservative if P H i ( p i ∈ [ a , a ]) ≤ P H i ( p i ∈ [1 − a , − a ]) , for all 0 ≤ a ≤ a ≤ . . (3)If p i is discrete, (3) means p i = 1 − a is at least as likely as p i = a for a ≤ .
5; if p i has acontinuous density, it means the density is at least as large at 1 − a as at a . Mirror-conservatismis not a consequence of conservatism (take p i = 0 . . B where B ∼ Bernoulli(0 . p i = B ). Any null distribution with an increasingdensity is evidently both conservative and mirror-conservative.Permutation p -values are mirror-conservative, as are p -values for one-sided tests of univariateparameters with monotone likelihood ratio (with discrete p -values randomized to be uniform atthe boundary between the null and alternative). See Appendix B.1 for proofs of these claims. We are now prepared to prove our main result: the AdaPT procedure controls FDR in finitesamples. The proof relies on a similar optional stopping argument as the one presented in Leiand Fithian (2016) and Barber and Cand`es (2016) (themselves modifications of arguments inStorey et al. (2004) and Barber and Cand`es (2015)). Let V t and U t denote the numbers of null p i ≤ s t ( x i ) and null p i ≥ − s t ( x i ), respectively. If the null p -values are uniform then, no matterhow we choose s t ( x ) at each step, we will always have V t ≈ U t and (cid:91) FDP t > U t R t ∨ ≈ V t R t ∨ . daPT: An interactive procedure for multiple testing with side information Lemma 2.
Suppose that, conditionally on the σ -field G − , b , . . . , b n are independent Bernoullirandom variables with P ( b i = 1 | G − ) = ρ i ≥ ρ > , almost surely. Also suppose that [ n ] ⊇ C ⊇ C ⊇ · · · , with each subset C t +1 measurable with respect to G t = σ (cid:32) G − , C t , ( b i ) i/ ∈C t , (cid:88) i ∈C t b i (cid:33) . If ˆ t is an almost-surely finite stopping time with respect to the filtration ( G t ) t ≥ , then E (cid:34) |C ˆ t | (cid:80) i ∈C ˆ t b i | G − (cid:35) ≤ ρ − . Our Lemma 2 generalizes Lemma 1 in Barber and Cand`es (2016) and uses a very similartechnical argument. The proof is given in the appendix. Using Lemma 2, we can give our mainresult:
Theorem 1.
Assume that the null p -values are independent of each other and of the non-null p -values, and the null p -values are uniform or mirror-conservative. Then the AdaPT procedurecontrols the FDR at level α , conditional on F − and also marginally. Proof.
Let ˆ t denote the step at which we stop and reject. ThenFDP ˆ t = V ˆ t R ˆ t ∨ U ˆ t R ˆ t ∨ · V ˆ t U ˆ t ≤ α V ˆ t U ˆ t , where the last step follows from the stopping condition that (cid:91) FDP ˆ t ≤ α , and the fact that U t ≤ A t . We will finish the proof by establishing that E [ V ˆ t / (1 + U ˆ t )] ≤
1, using Lemma 2.Let m i = min { p i , − p i } and b i = { p i ≥ . } , so p i = b i (1 − m i ) + (1 − b i ) m i . Then knowing b i and m i is equivalent to knowing p i . Let C t = { i ∈ H : p i / ∈ ( s t ( x i ) , − s t ( x i )) } , representingthe null p -values that are not visible to the analyst at time t . Then, U t = (cid:88) i ∈C t b i , and V t = (cid:88) i ∈C t (1 − b i ) = |C t | − U t . Further, define the σ -fields G − = σ (( x i , m i ) ni =1 , ( b i ) i/ ∈H ) , and G t = σ ( G − , C t , ( b i ) i/ ∈C t , U t ) . The assumptions of independence and mirror-conservatism guarantee P ( b i = 1 | G − ) ≥ . i ∈ H , with the b i conditionally independent.Next, note that F t ⊆ G t because p i ∈ G t for each p i ∈ ( s t ( x i ) , − s t ( x i )), and A t = U t + |{ i / ∈ H : p i ≥ − s t ( x i ) }| , and R t ∈ G t by a similar argument. It follows that ˆ t = min { t : (cid:91) FDP t ≤ α } is a stopping timewith respect to G t ; furthermore, C t +1 ∈ F t ⊆ G t by assumption.As a result, conditional on G − , we can apply Lemma 2 to obtain E [FDP | G − ] ≤ α E (cid:20) V ˆ t U ˆ t | G − (cid:21) = α E (cid:20) |C ˆ t | U ˆ t − | G − (cid:21) ≤ α (2 −
1) = α. Note that F − ⊂ G − . The proof is completed by applying the tower property of conditionalexpectation.The main technical point of departure for our method is that the optional stopping argu-ment is not merely a technical device to prove FDR control for a fixed algorithm like the BH,Storey-BH, or Knockoff+ procedures. Instead, we push the optional-stopping argument to itslimit, allowing the analyst to interact with the data in a much more flexible and adaptive way.Sections 6.2–6.3 further investigate the connection to knockoffs. Lihua Lei and William Fithian
3. A Guideline To Choose Thresholding Rules
Although the AdaPT procedure controls FDR no matter how we update the threshold, its powerdepends on the quality of the updates. This section concerns the question of what thresholdswe would choose if we had perfect knowledge of the data-generating distribution, with Section 4discussing suggestions for learning optimal thresholds from the data. To establish a guidelinefor threshold update, we consider a conditional two-groups model as the working model . As wewill see, under mild conditions, the Bayes-optimal rejection thresholds are the level surfaces ofthe local false discovery rate (fdr), defined as the probability that a hypothesis is null conditionalon its p -value. The local FDR was first discussed by Efron et al. (2001); see also Efron (2007).A similar result is obtained by Storey (2007) under a different framework. To begin, we assume a two-groups model conditional on the predictors x i . Letting H i = 0 if the i th null is true and H i = 1 otherwise, we assume: H i | x i ∼ Bernoulli( π ( x i )) p i | H i , x i ∼ (cid:40) f ( p | x i ) if H i = 0 f ( p | x i ) if H i = 1 . In addition, we assume that ( x i , H i , p i ) are independent for i ∈ [ n ]. Unless otherwise statedwe will assume for simplicity that both f and f are continuous densities, with f ( p | x ) ≡ p -values are uniform) and f ( p | x ) non-increasing in p (smaller p -values imply strongerevidence against the null). Furthermore, define the conditional mixture density f ( p | x ) = (1 − π ( x )) f ( p | x ) + π ( x ) f ( p | x ) = 1 − π ( x ) + π ( x ) f ( p | x ) , and the conditional local false discovery ratefdr( p | x ) = P ( H i is null | x i = x, p i = p ) = 1 − π ( x ) f ( p | x ) . Note that we never observe H i directly. Thus, while f is identifiable from the data, π and f are not: for example, π = 0 . , f ( p | x ) = 2(1 − p ) and π = 1 , f ( p | x ) = 1 . − p resultin exactly the same mixture density. Unless f ( p | x ) is known a priori , we can make theconservative identifying assumption that1 − π ( x ) = inf p ∈ [0 , f ( p | x ) = f (1 | x ) , attributing as many observations as possible to the null hypothesis. This approximation is verygood when fdr(1 | x ) ≈
1, which is reasonable in many settings. Thus, any estimate ˆ f of themixture density translates to a conservative estimate (cid:99) fdr( p | x ) = ˆ f (1 | x ) / ˆ f ( p | x ). Let ν be a probability measure on X and define a random variable X ∼ ν . Similar to Sun et al.(2015), for any thresholding rule s ( x ), we define the global FDR asFDR( s ; ν ) = P ( H = 0 | H is rejected) = P ( H = 0 | P ≤ s ( X ))where H and P are a hypothesis and p -value distributed according to the two-groups model.The power is defined in a similar fashion asPow( s ; ν ) = P ( H is rejected | H = 1) = P ( P ≤ s ( X ) | H = 1) . daPT: An interactive procedure for multiple testing with side information Sun et al. (2015) formulates a compound decision-theoretic framework by defining a Bayesian-type loss function. Instead, we propose a Neyman-Pearson type framework, i.e.max s Pow( s ; ν ) s . t . FDR( s ; ν ) ≤ α. (4)Next, define Q ( s ) = P ( P ≤ s ( X ) , H = 0) = (cid:90) X F ( s ( x ) | x )(1 − π ( x )) ν ( dx ) Q ( s ) = P ( P ≤ s ( X ) , H = 1) = (cid:90) X F ( s ( x ) | x ) π ( x ) ν ( dx ) , where F and F are the cumulative distribution functions under the null and alternative. Wecan simplify (4) asmax s Q ( s ) P ( H = 1) s . t . Q ( s ) Q ( s ) + Q ( s ) ≤ α (5) ⇐⇒ min s − Q ( s ) s . t . − αQ ( s ) + (1 − α ) Q ( s ) ≤ ⇐⇒ min s (cid:90) X − F ( s ( x ) | x ) π ( x ) ν ( dx )s . t . (cid:90) X (cid:26) − αF ( s ( x ) | x ) π ( x ) + (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx ) ≤ . (7)The corresponding Lagrangian function can be written as L ( s ; λ ) = (cid:90) X (cid:26) − (1 + λα ) F ( s ( x ) | x ) π ( x ) + λ (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx ) . (8)Let s ∗ be the optimum, then the Karush-Kuhn-Tucker (KKT) condition (under regularity con-ditions) implies that(1 + λα ) f ( s ∗ ( x ) | x ) π ( x ) = λ (1 − α ) f ( s ∗ ( x ) | x )(1 − π ( x ))= ⇒ fdr( s ∗ ( x ) | x ) = 1 + λα λ . (9)In other words, the optimal thresholding rules are level surfaces of local FDR. Theorem 2formalizes the above derivation by clarifying the regularity conditions. Theorem 2.
Assume that(a) f ( p | x i ) is continuously non-increasing and f ( p | x i ) is continuously non-decreasing anduniformly bounded away from ∞ ;(b) ν is a discrete measure supported on { x , . . . , x n } with ν ( { x i : fdr(0 | x i ) < α, f (0 | x i ) > } ) > .Then (4) has at least a solution, and all solutions are level surfaces of fdr( p | x ) . In practice, any conservative null distribution (stochastically dominated by U ([0 , f is also valid since smallerp-values imply stronger evidence against null. In condition (b), the assumption on the supportis reasonable since we treat { x i : i ∈ [ n ] } as fixed and hence only the quantities associated withthese values are of interest. We believe it can be relaxed to more general measures and will notdiscuss it due to the technical complication. In contrast, the second requirement is necessarysince it implies the feasibility of the problem. If the local FDR is above α almost everywhere,no thresholding rule is able to control FDR at α . As mentioned above, we can set s as thelevel surfaces of (cid:99) fdr( p | x ) = ˆ f (1 | x ) / ˆ f ( p | x ) given some estimator ˆ f ( p | x ). The next sectiondiscusses estimation of ˆ f ( p | x ). Lihua Lei and William Fithian
4. Implementation
Having shown that level surfaces of the local FDR are optimal under the two-groups model, wenow turn to estimation of fdr( p | x ), which boils down to estimation of the conditional density f ( p | x ). This section discusses a flexible framework for conditional density estimation that canperform favorably when no domain-specific expertise can be brought to bear.More generally, we should model the data using as much domain-specific expertise as pos-sible. We emphasize once more that, no matter how misspecified our model is, no matter howmisguided our priors are (if we use a Bayesian method), no matter how we select a model ortuning parameter, or how much that selection biases our resulting estimate of local FDR, theAdaPT procedure nevertheless controls global FDR. Thus, there is every reason to be relativelyaggressive in choosing a modeling strategy. Generically, we can model the conditional density by a parametric family where we assume nullp-values are uniform distributed, i.e. f ( p | x i ) ≡
1, and each non-null p-value has a density inthe following exponential family, indexed by a univariate parameter η i : f ( p | x i ) = h ( p ; η i ) (cid:44) e η i g ( p ) − B ( η i ) . (10)Note that η i and g ( p ) can be vectors but we focus on the scalar case for simplicity. Let y i = g ( p i ) , µ i = B (cid:48) ( η i ) . (11)Using the standard argument, (10) implies that E η i [ y i ] = E η i [ g ( p i )] = B (cid:48) ( η i ) = µ i , (12)where E η i denotes the expectation under h ( · ; η i ). If g is not almost-everywhere constant, then B (cid:48)(cid:48) ( η ) = Var η ( y i ) > B (cid:48) is bijective. Then there is a one-to-one mapping from µ i to η i ,denoted by η i = η ( µ i ) as convention. In fact, η ( · ) = ( B (cid:48) ) − ( · ). Then (10) can be reparametrizedusing µ i , h ( p ; µ i ) = e η ( µ i ) g ( p ) − A ( µ i ) , (13)where A ( · ) = B ( η ( · )) and we abuse the notation h ( p ; · ). As we will see, it is more convenientto use the mean parametrization (13).Given (13), it is left to model π i (cid:44) π ( x i ) and µ i (or η i equivalently). In this article weconsider the following generalized linear model where φ π ( x ) , φ µ ( x ) denote two featurization and ζ denotes a link function: H i | x i ∼ Bernoulli( π i ) , with log π i − π i = θ (cid:48) φ π ( x i ) , and p i | x i , H i ∼ (cid:26) h ( p ; µ i ) if H i = 11 if H i = 0 , with ζ ( µ i ) = β (cid:48) φ µ ( x i ) . (14)In particular, ζ ( · ) = η ( · ) gives the canonical link function. For instance, when g ( p ) = − log p, η ( µ ) = − µ + 1 and A ( µ ) = log µ , f ( p | x ) = π i h ( p ; µ i ) + (1 − π i ) = π i · µ i p µi − + (1 − π i ) . (15)This yields a beta-mixture model on the conditional density, which has been considered inliterature, e.g. Parker and Rothenberg (1988); Allison et al. (2002); Pounds and Morris (2003);Markitsis and Lai (2010). daPT: An interactive procedure for multiple testing with side information The fully-observed log-likelihood for the model (14) is (cid:96) ( θ, β ; p, H, x ) = n (cid:88) i =1 (cid:110) H i θ (cid:48) φ π ( x i ) − log (cid:16) e − θ (cid:48) φ π ( x i ) (cid:17)(cid:111) + n (cid:88) i =1 H i { y i · η ◦ ζ − ( β (cid:48) φ µ ( x i )) − A ◦ ζ − ( β (cid:48) φ µ ( x i )) } (16)Because some values of y i and all values of H i are unknown, we can use the expectation max-imization (EM) algorithm to maximize the partially observed log-likelihood. To simplify esti-mation, we will proceed as though A t and R t are missing, so that the ( y i , H i ) pairs are mutuallyindependent given the predictors. That is, at step t of the AdaPT procedure we attempt tomaximize the likelihood of the data D t = ( x i , ˜ p t,i ) i ∈ [ n ] and treating s t as fixed.Recall that b i = I ( p i ≥ . b i , H i ), with each pairconditionally independent given D t , and whose probabilities can be efficiently computed forany values of θ and β . Let r = 0 , , . . . index stages of the EM algorithm (recall t is fixed for theduration of the EM algorithm). For the E-step we compute the expectation of the log-likelihood, E ˆ θ ( r − , ˆ β ( r − [ (cid:96) ( θ, β ; y, H, x ) | D t ] , which amounts to computing the following quantities: (cid:98) H ( r ) i = E ˆ θ ( r − , ˆ β ( r − [ H i | D t ] , and (17)ˆ y ( r, i = E ˆ θ ( r − , ˆ β ( r − [ y i H i | D t , H i = 1] / (cid:98) H ( r ) i , (18)where ˆ θ ( r ) and ˆ β ( r ) denote the current coefficient estimates. We derive the exact formula for(17) and (18) in Appendix A.1. For the M-step, we setˆ θ ( r ) , ˆ β ( r ) = arg max β,θ E ˆ θ ( r − , ˆ β ( r − [ (cid:96) ( θ, β ; y, H, x ) | D t ]= arg max β,θ n (cid:88) i =1 (cid:98) H ( r ) i θ (cid:48) φ π ( x i ) − log (cid:16) e − θ (cid:48) φ π ( x i ) (cid:17) + n (cid:88) i =1 (cid:98) H ( r ) i · (cid:16) ˆ y ( r, i · η ◦ ζ − ( β (cid:48) φ µ ( x i )) − A ◦ ζ − ( β (cid:48) φ µ ( x i )) (cid:17) . (19)The optimization above splits into two separate optimization problems, a logistic regressionwith predictors φ π ( x i ) and fractional responses (cid:98) H ( r ) i , and a GLM with predictors φ µ ( x i ), re-sponses ˆ y ( r, i , and weights (cid:98) H ( r ) i . Each of these GLM problems can be solved efficiently usingthe glm function in R (e.g. Dobson and Barnett (2008)). For r = 0, we can initialize ˆ θ (0) andˆ β (0) by a simple method with details discussed in Appendix A.2. Algorithm 2 formalizes theEM algorithm using R pseudocode. The family argument for estimating ˆ β ( r ) depends on theform of exponential family (13). For example, (19) yields a Gamma GLM in the beta-mixturemodel (15).The GLM model (14) provides the starting point for an extremely flexible and extensiblemodeling framework. More generally, we could replace the fitting procedure in M-step by penal-ized GLM ( glmnet package), generalized additive model ( gam or mgcv package), or generalizedboosting regression ( gbm package). Furthermore, noting that π ( x ) = E [ H | x ] , µ ( x ) = E [ y | x, H = 1] , one can even fit them directly using any nonparametric method, such as random forest or neuralnetworks, that targets on estimating conditional mean. Lihua Lei and William Fithian
Algorithm 2
EM algorithm to estimate π ( · ) and µ ( · ) based on D t = ( x i , ˜ p t,i ) i ∈ [ n ] Input: data D t , number of iterations m , initialization ˆ θ (0) , ˆ β (0) ; for r = 1 , , . . . , m do ( E-step ): (cid:98) H ( r ) i ← E ˆ θ ( r − , ˆ β ( r − [ H i | D t ] , i ∈ [ n ];ˆ y ( r, i ← E ˆ θ ( r − , ˆ β ( r − [ y i H i | D t , H i = 1] / (cid:98) H ( r ) i , i ∈ [ n ];( M-step ):ˆ θ ( r ) ← glm (cid:16) (cid:98) H ( r ) ∼ φ π ( x ) , family = binomial (cid:17) ;ˆ β ( r ) ← glm (cid:16) ˆ y ( r, ∼ φ µ ( x ) , family = ... ( link = ζ ) , weights = (cid:98) H ( r ) (cid:17) ; end forOutput: ˆ π ( x ) = (cid:16) e − φ π ( x ) (cid:48) ˆ θ ( m ) (cid:17) − , ˆ µ ( x ) = ζ − (cid:16) φ µ ( x ) (cid:48) ˆ β ( m ) (cid:17) . Suppose we are given a finite set of candidate featurization { ( φ π,j ( x ) , φ µ,j ( x )) : j = 1 , . . . , M } .For instance for univariate x , φ π,j ( x ) and φ µ,j ( x ) could be spline bases with certain numbersof equi-spaced knots; for multivariate x , φ π,j ( x ) and φ µ,j ( x ) could be subsets of covariatescontained in x . At step t , one is permitted to fit a model for each featurization, using arbi-trary methods (e.g., GLM, penalized GLM, etc.), based on (( φ π,j ( x i ) , φ µ,j ( x i ) , ˜ p t,i ) ni =1 ). Letˆ π ( j )1 = (ˆ π ( j )11 , . . . , ˆ π ( j )1 n ) and ˆ µ ( j ) = (ˆ µ ( j )1 , . . . , ˆ µ ( j ) n ) denote the resulting fitted values. The fulllog-likelihood, assuming H i is known, for the GLM model (14) based on ( φ π,j ( x ) , φ µ,j ( x )) canbe written as (cid:96) j ( π , µ ) = n (cid:88) i =1 ( H i log π ( j )1 i + (1 − H i ) log(1 − π ( j )1 i )) + n (cid:88) i =1 H i log h ( p i ; µ ( j ) i ) . Though (cid:96) j is not computable, we can replace it by˜ (cid:96) j (cid:44) E ˆ π ( j )1 , ˆ µ ( j ) [ (cid:96) j ( π , µ )] . This is precisely the objective of M-step and hence is directly computed from the EM algorithm.Based on { ˜ (cid:96) j } Mj =1 , we can use any information criterion for featurization selection. Ourimplementation uses BIC as default, defined asBIC j = log n · (df π,j + df µ,j ) − (cid:96) j where df π,j (resp. df µ,j ) is the degree of freedom of φ π,j (resp. φ µ,j ). For instance, df π,j is thenumber of knots plus 1 (for the intercept) when φ π,j is the spline basis; df π,j is the number ofselected covariates plus 1 (for the intercept) when φ π,j is a sparse subset of x .Alternatively, the user can also apply cross-validation to select the featurization. Specifically,at step t the data is divided into K folds. For k -th fold, the expected log-likelihood ˜ (cid:96) jk iscomputed by taking the k -th fold as the holdout set and fitting the parameters on other folds.The selection is then based on ˜ (cid:96) j = (cid:80) Kk =1 ˜ (cid:96) jk .We emphasize that any of above selection procedures can be performed in any intermediatestep of AdaPT. If the featurization selection can be computed efficiently, we suggest applying itin every step. Otherwise we suggest performing it only at the first step, in which s ( x ) = s ( x ),and keeping the selected featurization for all later steps. Theorem 2 suggests that our updated threshold s t +1 should approximate a level surface of (cid:99) fdr( p | x ). For the model (14), level surfaces of the local FDR are given by c = f (1 | x ) f ( s ( x ) | x ) = π ( x ) h (1; µ ( x )) + 1 − π ( x ) π ( x ) h ( s ( x ); µ ( x )) + 1 − π ( x ) . (20) daPT: An interactive procedure for multiple testing with side information For various widely-used exponential families in the form (13), h ( p ; µ ) is decreasing with respectto p , in which case, s ( x ; c ) = f − (cid:18) h (1; µ ( x )) c + 1 − π ( x ) π ( x ) 1 − cc ; µ ( x ) (cid:19) (21)Given a chosen local FDR level c , we can evolve s t by s t +1 ( x ) = min { s t ( x ) , s ( x ; c ) } , (22)where the minimum is taken to meet the requirement that s t +1 ( x ) ≤ s t ( x ). Note that a higherlevel surface (larger c ) will typically give a higher (cid:91) FDP t and vice versa. Unless computationalefficiency is at a premium, it is better to force the procedure to be patient since more informationcan be gained after each update and the learning step can be more accurate. In other words,we shall choose a large c such that s t +1 ( x ) only deviates from s t ( x ) slightly.In this article we propose a simple procedure to achieve this: it chooses c such that exactlyone partially-masked p-value is revealed based on s t +1 ( x ) defined in (22). The choice of c canbe computed in the following way(a) Estimate local FDR for each p (cid:48) t,i asfdr t,i = f (1 | x i ) f ( p (cid:48) t,i | x i ) = ˆ π i · h (1; ˆ µ i ) + 1 − ˆ π i ˆ π i · h ( p (cid:48) t,i ; ˆ µ i ) + 1 − ˆ π i , (23)where p (cid:48) t,i is the minimum element in ˜ p t,i (i.e., ˜ p t,i = p (cid:48) t,i for revealed p-values and ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } for masked p-values.)(b) Set c as the largest value of lfdr t,i among all partially masked p-values . (Strictly speaking, c should be slightly smaller than max i lfdr t,i . In implementation we subtract 10 − from it.)As a consequence, this choice of c is measurable with respect to F t and hence a permissibleoperation in AdaPT . Initial thresholds.
As shown in Algorithm 1, AdaPT starts from some curve s ( x ) and thenslowly update it. If the hypotheses are not ordered, then we can simply set s ( x ) ≡ s , with s , ≤ .
5. A larger s , is conceptually preferred since the procedure is more patient. We foundthat s , = 0 .
45 is a consistently good choice.
Computation efficiency.
The model update (Algorithm 2) is the most computationally costlycomponent. To save computation, we recommend not updating the model at every step. In ourimplementation, the default is to update the model every (cid:100) n/ (cid:101) steps. q -Values. Rather than specify α in advance, some researchers might prefer to see a list ofdiscoveries for each of a range of α values. Rather than return a single list for a single α , we canalternatively run the algorithm once and output q -values for every hypothesis (Storey, 2002;Storey and Tibshirani, 2003), defined as the minimum value of α for which the hypothesis wouldbe rejected.Let ˆ t α = min { t : (cid:91) FDP t ≤ α } and t ∗ i = min { t : s t ( x i ) < p i < − s t ( x i ) } , the time at which p i is revealed. We then see that H i rejected at level α ⇐⇒ p i ≤ s ˆ t α ( x i ) ⇐⇒ ˆ t α < t ∗ i ⇐⇒ min t 5. Experiments To illustrate the power of the AdaPT procedure, we apply it to the GEOquery gene-dosagedata (Davis and Meltzer, 2007), which has been analyzed repeatedly as a benchmark for orderedtesting procedures Li and Barber (2016a); Lei and Fithian (2016); Li and Barber (2016b). Weuse Algorithm 2 with a beta-mixture model (15) for the E-step (see Appendix A.1.1 for details)and a Gamma GLM with canonical link function for the M-step. This dataset consists of geneexpression measurements for n = 22283 genes, in response to estrogen treatments in breastcancer cells for five groups of patients, with different dosage levels and 5 trials in each. Thetask is to identify the genes responding to a low dosage. The p-values p i for gene i is obtainedby a one-sided permutation test which evaluates evidence for a change in gene expression levelbetween the control group (placebo) and the low-dose group. { p i : i ∈ [ n ] } are then orderedaccording to permutation t -statistics comparing the control and low-dose data, pooled, againstdata from a higher dosage (with genes that appear to have a strong response at higher dosagesplaced earlier in the list).We consider two orderings: first, a stronger (more informative) ordering based on a com-parison to the highest dosage; and second, a weaker (less informative) ordering based on acomparison to a medium dosage. Let σ S ( i ) and σ W ( i ) denote respectively the permutations of i ∈ [ n ] given by the stronger and weaker orderings. Further details on these two orderings canbe found in Li and Barber (2016a) and Li and Barber (2016b). We write the p -values, thusreordered, as p Si = p σ S ( i ) and p Wi = p σ W ( i ) . Once the data are reordered, we can apply eithera method that ignores the ordering altogether, or an ordered testing procedure, or a testingprocedure that uses generic side information, using the index of the reordered p -values as aunivariate predictor.We compare AdaPT against twelve other methods :(a) SeqStep with parameter C = 2 (Barber and Cand`es, 2015);(b) ForwardStop (G’Sell et al., 2016);(c) the accumulation test with the HingeExp function and parameter C = 2 (Li and Barber,2016a);(d) Adaptive SeqStep with s = q and λ = 1 − q (Lei and Fithian, 2016);(e) BH procedure (Benjamini and Hochberg, 1995);(f) Storey’s BH procedure with threshold λ = 0 . τ = 0 . , (cid:15) = 0 . { (cid:15), } (see section 4.1 of Li and Barber (2016b));(i) SABHA with τ = 0 . , (cid:15) = 0 . (cid:15), 1] (see section4.1 of Li and Barber (2016b));(j) Independent Hypothesis Weighting (IHW) with number of bins and folds set as default(Ignatiadis et al., 2016);(k) an oracle version of IHW with the number of bins determined by maximizing the numberof rejections;(l) an oracle version of Independent Filtering (IF) with the cutoff determined by maximizingthe number of rejections (Bourgon et al., 2010). daPT: An interactive procedure for multiple testing with side information Note that the last two methods do not guarantee FDR control because the optimal parameteris selected; and both versions of SABHA control FDR at level 1 . α (Lemma 1 of Li andBarber (2016b)) when the target level is α . Despite the potential anti-conservativeness of thesemethods, we do not make correction in order to compare their best possible performance toAdaPT . Figure 2 shows the number of discoveries with different target FDR levels. We onlyshow the range of α from 0 . 01 to 0 . π ( x ) and µ ( x ), selected from the combination ofall spline basis with 6 − 15 equi-quantile knots via BIC criterion at the initial step and kept thesame afterwards; see Section 4.2. Random ordering Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT Moderately informative ordering Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l Highly informative Ordering Target FDR level a l l ll l ll l ll l l Number of Rejections (Gene/Drug Response) Fig. 2: Number of discoveries, in gene/drug response dataset, by each method at a range oftarget FDR levels α from 0.01 to 0.30. Each panel plots the results for an ordering, rangingfrom random ordering to highly informative. Rejection threshold (alpha = 0.05) p − v a l ue 0 . . Estimated local FDR (alpha = 0.05) x = rank p − v a l ue 00 . . Rejection threshold (alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (alpha = 0.1) x = rank p − v a l ue 00 . . Fig. 3: Results for gene/drug response data with moderately informative ordering of p-values,i.e. { p Wi } , with α = 0 . 05 (left) and α = 0 . s ( x ); (bottom) the contourplots of estimated local FDR.The right two panels of Figure 2 correspond to the weaker and the strong orderings, andshow that AdaPT significantly outperforms all other methods for all target FDR levels. Onemight doubt whether the power gain is driven by overfitting. To check this, we also applyAdaPT, as well as all other methods, on the same set of p-values with a random ordering. We Lihua Lei and William Fithian repeat it using 100 random seeds and report the average number of rejections in the left panelof Figure 2. In this case, the number of rejections drop dramatically and the power is almostthe same as Barber-Cand`es method, the non-adaptive version of AdaPT. This provides strongevidence against overfitting.To illustrate how AdaPT exploits the covariate to improve the power, we plot the thresh-olding rules and estimated signal strength for p-values with moderately informative orderingand p-values with highly informative ordering, respectively in Figure 3 and Figure 4. It can beseen from the bottom panels that the evidence to be non-null has an obvious decreasing trendwhen the ordering is used. Moreover, the highly informative ordering indeed sorts the p-valuesbetter than the moderately informative ordering. For the former, the thresholding rule is fairlymonotone while it has a small bump at i ≈ Rejection threshold (alpha = 0.05) p − v a l ue 0 . . Estimated local FDR (alpha = 0.05) x = rank p − v a l ue 00 . . Rejection threshold (alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (alpha = 0.1) x = rank p − v a l ue 00 . . Fig. 4: Results for gene/drug response data with highly informative ordering of p-values, i.e. { p Si } , with α = 0 . 05 (left) and α = 0 . s ( x ); (bottom) the contourplots of estimated local FDR.Finally, we measure the information loss caused by partial masking: We first estimate localFDR using the set of (unmasked) p-values and the covariates, denoted by lfdr ∗ ( x ). It can beregarded as the best possible estimate given the algorithm. Let lfdr t ( x ) denote the estimate oflocal FDR at step t (based on partially masked p-values). Then we measure the informationloss by the correlation of { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 . The results are shown in Figure 5where the x-axis corresponds to the target FDR, in a reverse order ranging from 0 . . (cid:91) FDP first drops below the targetFDR. As expected from the discussion in Subsection 1.3, the information loss is quite small andeven negligible after the target FDR drops to the “practical” regime (e.g. below 0 . { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 is almost 1. The pattern is even moresignificant in other data examples in the next Subsection. This provides a strong evidence thatAdaPT allows efficient data exploration under comparatively limited information loss.In summary, these plots show a strong data adaptivity of AdaPT , which can also learnthe local structure of data while controlling FDR. Moreover, it provides a quantitative way, byestimated signal strength, to evaluate the quality of ordering, which is the major concern inordered testing problems (Li and Barber, 2016a; Lei and Fithian, 2016; Li and Barber, 2016b). daPT: An interactive procedure for multiple testing with side information Correlation of Estimated Local FDR Target FDR a (large −−> small) C o rr e l a t i on . . . . moderately informative orderinghighly informative ordering Fig. 5: Correlation of { lfdr ∗ ( x i ) } ni =1 and { lfdr t ( x i ) } ni =1 for gene/drug-response dosage datasetunder original, moderately informative and highly informative orderings. The x-axis correspondsto the target FDR, in a reverse order ranging from 0 . . 01, and y-axis corresponds to thecorrelation at the step where (cid:91) FDP first drops below the target FDR. • Example 1: a two-dimensional case We generate the covariates x i ’s from an equi-spaced 50 × 50 grid in the area [ − , × [ − , p -values i.i.d. from a one-sided normal test, i.e. p i = 1 − Φ( z i ) , and z i ∼ N ( µ, , (24)where Φ is the cdf of N (0 , i ∈ H we set µ = 0 and for i (cid:54)∈ H we set µ = 2. Figure 6below shows three types of H that we conduct tests on. Circle in the middle llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Circle in the corner llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Thin ellipse llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Fig. 6: The above panels display the underlying ground truth for three cases in Example 1.Each point represents a hypothesis (2500 in total) with gray ones being nulls and black onesbeing non-nulls.In this case, it is not clear how to apply non-adaptive ordered testing procedures or In-dependent Filter. Thus we compare AdaPT only with Storey’s BH method, Barber-Cand´esmethod, IHW using the default automatic parameter tuning procedure and SABHA using 2-dim low total variation weights (see Section 4.3 of Li and Barber (2016b)). For AdaPT , we fittwo-dimensional Generalized Additive Models in M-step, using R package mgcv with the knotsselected automatically in every step by GCV criterion. For each procedure and a given level α ,let R α be the set of rejected hypotheses with a target FDR level α . Then we calculate the FDPand the power as FDP( α ) = |R α ∩ H ||R α | , power( α ) = |R α ∩ H c ||H c | . (25) Lihua Lei and William Fithian We repeat the above procedure for on 100 fresh simulated datasets and calculate the averageof FDP( α ) and power( α ) as the measure of FDR and power. The results are shown in Figure7. It is clearly seen that AdaPT controls FDR as other methods while achieving a significantlyhigher power.To see why AdaPT gains power, we plot the estimated local FDR in Figure 8 for the firstcase, at the initial step, the step where (cid:91) FDP is first below 0.3 and the step where (cid:91) FDP is firstbelow 0.1. As shown in the real examples, the fitted local FDR identifies the non-nulls quiteaccurately even at the initial step where most p-values are partially-masked. The estimatesbecome very stable and informative after reaching the practical regime of α ’s. Circle in the middle Target FDR level a F DR . . . . l l ll l ll l l lll BHStorey−BHBarber−CandesSABHAIHWAdaPT Circle in the corner Target FDR level a F DR . . . . l l ll l ll l l Thin ellipse Target FDR level a F DR . . . . l l ll l ll l l Circle in the middle Target FDR level a po w e r . . . l l ll l ll l l lll BHStorey−BHBarber−CandesSABHAIHWAdaPT Circle in the corner Target FDR level a po w e r . . . l l ll l ll l l Thin ellipse Target FDR level a po w e r . . . l l ll l ll l l Fig. 7: FDR and power with α ∈ { . , . , . . . , . } in Example 1. Estimated local FDR (step 0) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Estimated local FDR (alpha = 0.3) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Estimated local FDR (alpha = 0.1) llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Fig. 8: Estimated local FDR in the first case of Example 1 at the initial step (left), with thetarget FDR level 0 . . • Example 2: a 100-dimensional case We generate x i ∈ R d with d = 100 and { x ij : i ∈ [ n ] , j ∈ [ d ] } i.i.d. ∼ U ([0 , . daPT: An interactive procedure for multiple testing with side information Then we generate p-values from a varying-coefficient two group beta-mixture model (15) with π i and µ i are specified as a logistic model and a truncated linear model, respectively, i.e.,log (cid:18) π i − π i (cid:19) = θ + x Ti θ, µ i = max { x Ti β, } , β, θ ∈ R d . In this case, we choose θ and β as highly sparse vectors with only two non-zero entries: θ = (3 , , , . . . , T , β = (2 , , , . . . , T and θ is chosen so that n (cid:80) ni =1 π i = 0 . 3. In this case, E ( − log p i ) = µ i under the alternative.Figure 9 shows the histograms of π i ’s and µ i ’s. p F r equen cy m i F r equen cy Fig. 9: Distributions of π i ’s and µ i ’s in Example 2.In this case, it is not clear how to apply non-adaptive ordered testing procedures or Inde-pendent Filter or adaptive procedures like IHW and SABHA. Thus we compare AdaPT onlywith BH method, Storey’s BH method and Barber-Cand´es method. For AdaPT , we fit L -regularized GLMs in M-step (See Appendix A for details), using R package glmnet with thepenalty level selected automatically in every step by cross validation. Further we run AdaPTby fitting an ”oracle” GLM in M-steps where only the first two covariates are involved.As in Example 1, we estimate the FDR and the power using 100 replications. The resultsare plotted in Figure 10. It is clearly seen that both AdaPT ’s control FDR as other meth-ods while achieving a higher power. Not surprisingly, compare to AdaPT with L -regularizedGLMs, AdaPT with “oracle” GLMs has a higher power. Nevertheless, this example shows theunprecedented ability of AdaPT to improve power by squeezing information from a large set ofnoisy features. Target FDR level a F DR . . . . l l ll l l ll BHStorey−BHBarber−CandesAdaPTAdaPT (oracle) Target FDR level a po w e r . . . . . l l ll l l Fig. 10: FDR and power with α ∈ { . , . , . . . , . } in Example 2. Lihua Lei and William Fithian In this Subsection, we examine the performance of AdaPT on four more real datasets, which areanalyzed in other papers exploiting adaptive FDR control methods, e.g. Bourgon et al. (2010);Ignatiadis et al. (2016). In all cases, we start with a brief introduction of the dataset and showthe plots on the number of rejections, as Figure 2, path of information loss, as Figure 5, andthreshold curve and level curves of estimated local FDR with target FDR 0 . 1, as Figure 3 andFigure 4. We use the same settings for AdaPT as in the gene dosage dataset: performing modelselection at the initial step with candidate featurization being all combinations of spline basiswith 6 ∼ 15 equi-quantile knots on π ( x ) and µ ( x ); and fixing the selected model in subsequentupdates. • Bottomly data This dataset is an RNA-Seq dataset targeting on detecting the differential expression on twomouse strains, C57BL/6J (B6) and DBA/2J (D2), collected by Bottomly et al. (2011), availableon ReCount repository (Frazee et al., 2011), and analyzed by Ignatiadis et al. (2016) using IHW.It consists of gene expression measurements for n = 13932 genes. Following Ignatiadis et al.(2016), we analyze the data using DEseq2 package (Love et al., 2014) and use the logarithm ofnormalized count (averaged across all samples plus 1) as the univariate covariate for each gene.The results are plotted in Figure 11. It is clearly seen that AdaPT produces significantly morediscoveries than all other methods and the information loss is almost negligible (with correlationconsistently above 0.985). Furthermore, we observe the same pattern that AdaPT prioritizesthe genes with higher mean normalized means. Number of Rejections (Bottomly) Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT Correlation of Estimated Local FDR (Bottomly) Target FDR a (large −−> small) C o rr e l a t i on . . . . . −2 0 2 4 6 8 10 Rejection threshold (Bottomly, alpha = 0.1) p − v a l ue 0 . . −2 0 2 4 6 8 10 Estimated local FDR (Bottomly, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 11: Results for Bottomly dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Airway data This dataset is an RNA-Seq dataset targeting on identifying the differentially expressed genes inairway smooth muscle cell lines in response to dexamethasone, collected by Himes et al. (2014)and available in R package airway . It is analyzed in the vignette of IHW package using IHWmethod Ignatiadis et al. (2016). As in the vignette and the previous example, we analyze thedata using DEseq2 package (Love et al., 2014) and use the logarithm of normalized count asthe univariate covariate for each gene. The results are plotted in Figure 12. Again, AdaPTproduces significantly more discoveries than all other methods. daPT: An interactive procedure for multiple testing with side information Number of Rejections (airway) Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT Correlation of Estimated Local FDR (airway) Target FDR a (large −−> small) C o rr e l a t i on . . . . . −2 0 2 4 6 8 10 12 Rejection threshold (airway, alpha = 0.1) p − v a l ue 0 . . −2 0 2 4 6 8 10 12 Estimated local FDR (airway, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 12: Results for Airway dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Pasilla data This dataset is also an RNA-Seq dataset targeting on detecting genes that are differentiallyexpressed between the normal and Pasilla-knockdown conditions, collected by Brooks et al.(2011) and available in R package pasilla (Huber and Reyes, 2016). It is analyzed in thevignette of genefilter package (Gentleman et al., 2016) using independent filtering methodBourgon et al. (2010). As in the vignette, we analyze the data using DEseq package (Andersand Huber, 2010) and use the logarithm of normalized count as the univariate covariate for eachgene. The results are plotted in Figure 13. It is clear that we arrive at the same conclusionthat AdaPT is more powerful than all other methods. Number of Rejections (pasilla) Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT Correlation of Estimated Local FDR (pasilla) Target FDR a (large −−> small) C o rr e l a t i on . . . . . Rejection threshold (pasilla, alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (pasilla, alpha = 0.1) x = log(baseMean) p − v a l ue 00 . . Fig. 13: Results for Pasilla dataset: (left) number of rejections; (middle) path of informationloss; (right) threshold curve and level curves of estimated local FDR when α = 0 . • Yeast proteins data This dataset is a proteomics dataset, collected by Dephoure and Gygi (2012) and available in R package IHWpaper , that provides temporal abundance profiles for 2666 yeast proteins froma quantitative mass-spectrometry (SILAC) experiment. The goal is to identify the differentialprotein abundance in yeast cells treated with rapamycin and DMSO. It is analyzed in Ignatiadiset al. (2016) using IHW method. As in Dephoure and Gygi (2012) and Ignatiadis et al. (2016),we calculate the p-values using Welch’s t-test and use as the univariate covariate the logarithmof total number of peptides that were quantified across all samples for each gene. The results areplotted in Figure 14. In this case, AdaPT has a similar performance to Barber-Cand´es methodand Storey’s BH method. However, it still outperforms all other methods. Furthermore, AdaPTlearns the monotone pattern of the local FDR, which coincides with the heuristic. Lihua Lei and William Fithian Number of Rejections (SILAC) Target FDR level a o f R e j e c t i on s l l ll l ll l ll l l llll SeqStepHingeExpForwardStopAda. SeqStepBHStorey−BHBarber−CandesSABHA (step)SABHA (ordered)IHWIHW (oracle)IF (oracle)AdaPT Correlation of Estimated Local FDR (SILAC) Target FDR a (large −−> small) C o rr e l a t i on . . . . . Rejection threshold (SILAC, alpha = 0.1) p − v a l ue 0 . . Estimated local FDR (SILAC, alpha = 0.1) x = log(number of peptides) p − v a l ue 00 . . Fig. 14: Results for yeast proteins dataset: (left) number of rejections; (middle) path of infor-mation loss; (right) threshold curve and level curves of estimated local FDR when α = 0 . 6. Discussion We have proposed the AdaPT procedure, a general iterative framework for multiple testingwith side information. Using partially masked p -values, we estimate a family of optimal andincreasingly stringent rejection thresholds, which are level surfaces of the local FDR. We thenmonitor an estimator of FDP to decide which threshold to use, updating our estimates as weunmask more p -values and gain more information.Our method is interactive in that it allows the analyst to use an arbitrary method forestimating the local FDR, and to consult her intuition to change models at any iteration, evenafter observing most of the data. No matter what the analyst does or how badly she overfits thedata, FDR is still controlled at the advertised level (though power could be adversely affectedby overfitting).We show using various experiments that AdaPT can give consistently significantpower improvements over current state-of-the-art methods. AdaPT without thresholds Although we state AdaPT as a procedure that interactively updates a covariate-variant thresh-old curve, the thresholds are not essential. In fact, Algorithm 1 can be modified as follows inthe absence of s ( x ). Algorithm 3 AdaPT without thresholds Input: predictors and p -values ( x i , p i ) i ∈ [ n ] , target FDR level α . Procedure: Initialize R = [ n ]; for t = 0 , , . . . do R t ← (cid:8) i ∈ R t : p i ≤ (cid:9) ; A t ← (cid:8) i ∈ R t : p i > (cid:9) ; (cid:91) FDP t ← A t R t ∨ ; if (cid:91) FDP t ≤ α then Reject R t ; end if R t +1 ← update (( x i , ˜ p t,i ) i ∈ [ n ] , R t ); end for Rephrasing Algorithm 3: we start from partially masking all p-values, yielding a “candidaterejection set” R = [ n ], then apply arbitrary method to update R t directly. The FDP estimator(line 4) is defined in an essentially identical way as Algorithm 1. It is easy to see that Algorithm1 is a special case of Algorithm 3. Perhaps strikingly, the proof of FDR control carries throughto this general case without any modification. daPT: An interactive procedure for multiple testing with side information It is not hard to see that our implementation in Section 4 can be reformulated in a moresimple and straightforward way: in each step we estimate local FDR for each partially-maskedp-values and peel off δ -proportion of them with highest estimated local FDR.In principle, we can define any “score” that measures how ”promising” each hypothesis is orhow “likely” each hypothesis is non-null. A simple workflow based on Algorithm 3 is to peel offthe hypotheses with least favorable “scores” and proceed with refitted “scores” by exploitingthe revealed p-values. Heuristically, the most statistical meaningful “score” is local FDR, whichis directly associated with our purpose. However, it arguably allows the framework of AdaPTto be more general and flexible. For instance, we recently exploited this idea and develop ageneral framework for controlling FDR under structural constraints. We refer the readers toLei et al. (2017) for more thoughts in this vein. It would also be interesting to attempt to relax our restriction that the p -values must be inde-pendent. In the absence of some modification, our AdaPT procedure does not control FDR infinite samples for dependent p -values. In particular, there is a danger of “overfitting” to localrandom effects shared by nearby hypotheses: to the AdaPT procedure, such random effects aretreated as signal to discover.It could be interesting to pursue a hybrid method using ideas from AdaPT and Knockoff+procedures in the case where the p -values arise from regression coefficients or other multivariateGaussian test statistics. Suppose that we observe feature matrix X ∈ R n × d and response vector y ∼ N n ( Xβ, σ I n ), and we wish to test hypotheses H j : β j = 0 for j = 1 , . . . , d . The key stepin Barber and Cand`es (2015) is to compute another matrix (cid:101) X ∈ R n × d with (cid:101) X (cid:48) (cid:101) X = X (cid:48) X and (cid:101) X (cid:48) X = X (cid:48) X − D , for some diagonal D ∈ R d × d with positive entries; this can be done providedthat n ≥ d and X has full column rank.If we define v = X (cid:48) y and ˜ v = (cid:101) X (cid:48) y , then we have (cid:20) v + ˜ vv − ˜ v (cid:21) ∼ N d (cid:18)(cid:20) (2 X (cid:48) X − D ) βDβ (cid:21) , σ (cid:20) X (cid:48) X − D 00 2 D (cid:21)(cid:19) . As a result (( v j , ˜ v j )) j ∈H are independent exchangeable pairs, conditional on ( v j ) j / ∈H . Let F − = σ (( { v j , ˜ v j } ) dj =1 ). The knockoff filter directly uses these exchangeable pairs by construct-ing knockoff statistics w ( X, y ) ∈ R d . The sufficiency and antisymmetry conditions togetherimply that each | w j | is F − -measurable and that, conditional on F − , b j = 1 − sgn( w j ) is amirror-conservative “binary p -value:” that is ( b j ) j ∈H are i.i.d. Bern(1 / 2) independently of F − and ( b j ) j / ∈H . Using | w j | as a “predictor” (along with any other predictors for feature j that wemight have at hand) and b j as the p -value, the AdaPT procedure is immediately applicable.Note that min { b j , − b j } = 0 for every j ; hence, at each step it matters only where therejection threshold surface is above zero or not. If q t is the t th smallest value of ( | w j | ) dj =1 ,the Knockoff+ filter corresponds to using the thresholds s t ( | w j | ) = 0 . · {| w j | ≥ q t } . Moregenerally, we can use AdaPT and interactively change the threshold we use.If σ is known, we can proceed more directly by constructing z -statistics and two-tailed p -values: z j = v j − ˜ v j (cid:112) d j σ ∼ N (cid:32) β j (cid:112) d j σ , (cid:33) ; p j = 2 min { Φ( z j ) , − Φ( z j ) } . In that case ( p j ) j ∈H are i.i.d. uniform p -values conditional on ( p j ) j / ∈H and v + ˜ v (not on F − above). Once again, we can immediately apply AdaPT using v + ˜ v as a “predictor.” While it isnot fully clear a priori just how we should use v + ˜ v as a predictor, this represents an interestingavenue for future work. Focusing on the case of orthogonal design further illuminates the relationship between AdaPTand the Knockoff+ procedure. Suppose that X ∈ R n × d has orthonormal columns, and that Lihua Lei and William Fithian d ≥ n . In that case Barber and Cand`es (2015) suggest using the knockoff matrix (cid:101) X of d moreorthonormal columns which are also orthogonal to the columns of X . Then X (cid:48) y ∼ N d ( β, σ I d )while (cid:101) X (cid:48) y ∼ N d (0 , σ I d ), independently.In this case, using the LASSO, forward stepwise regression, or virtually any other modelselection path procedure on the design matrix [ X (cid:101) X ] is identical to selecting variables in de-creasing order of absolute value of | X (cid:48) j y | and | (cid:101) X (cid:48) j y | ; or equivalently, in increasing order of thetwo-tailed p -values p j = 2 − | X (cid:48) j y | /σ ) and p ∗ j = 2 − | (cid:101) X (cid:48) j y | /σ ) (this is true whether ornot σ is known). As a result, if we operationalize the Knockoff+ procedure using e.g. LASSO,we would reject hypotheses H j for which min { p j , p ∗ j } is small and p j < p ∗ j . By contrast, if wewere to implement AdaPT with a constant threshold in each step, we would reject hypotheses H j for which min { p j , − p j } is small and p j < − p j . Hence, the pairwise exchangeability of( p j , − p j ) is playing the same role as the i.i.d. pair ( p j , p ∗ j ) in knockoffs.The two most salient differences between AdaPT and Knockoff+ in this case are that:(a) AdaPT allows for iterative interaction between the analyst and data, allowing the analystto update her local FDR estimates as information accrues. By contrast, the knockoff filteras described in Barber and Cand`es (2015) does not allow for such interaction (though itcould, and this is a potentially interesting avenue for extending knockoffs).(b) Unlike Knockoff+, AdaPT introduces no extra randomness into the problem. This isbecause AdaPT uses pairwise exchangeability of p i with the “mirror image” p -value 1 − p i instead of the independent “knockoff” p -value p ∗ i ∼ U [0 , update subroutine, the AdaPT result is a deterministic function of the original data. In addition to returning a list of rejections that is guaranteed to control the global FDR, mostimplementations of AdaPT will also return estimates, for each rejected hypothesis, of the localFDR, (cid:99) fdr( p i | x i ) = (cid:98) P ( H i is null | x i , p i ) . If we have reasonably high confidence in the model we have used to produce these estimates,they may provide the best summary of evidence against the individual hypothesis H i . Bycontrast, the significance level for global FDR only summarizes the strength of evidence againstthe entire list of rejections, taken as a whole. Indeed, it is possible to construct pathologicalexamples where fdr( p i | x i ) = 1 for some of the rejected H i , despite controlling FDR at somelevel α (cid:28) 1. Even apart from such perversities, it will typically be the case that (cid:99) fdr( p i | x i ) > α for many of the rejected hypotheses.Despite their more favorable interpretation, however, the local FDR estimates produced byAdaPT rely on much stronger assumptions than the global FDR control guarantee — namely,that the two-groups model, as well as our specifications for π ( x ) and f ( p | x ), must be correct.Instead of using the parametric estimates (cid:99) fdr( p i | x i ), we could estimate the local FDR in amoving window of w steps of the AdaPT algorithm: (cid:99) fdp t,w = A t − A t + w ∨ ( R t − R t + w ) , or (cid:99) fdp + t,w = 1 + A t − A t + w ∨ ( R t − R t + w ) . Note that if we take an infinitely large window, we obtain (cid:99) fdp + t, ∞ = (cid:91) FDP t ; thus, these estimatorsadaptively estimate the false discovery proportion for p -values revealed in the next w steps ofthe algorithm, in much the same way that (cid:91) FDP t estimates the false discovery proportion for all remaining p -values. It would be interesting to investigate, in future work, what error-controlguarantees we might be able to derive by using these estimators. daPT: An interactive procedure for multiple testing with side information Acknowledgments The authors thank Jim Pitman, Ruth Heller, Aaditya Ramdas, and Stefan Wager for helpfuldiscussions. References Allison, D. B., G. L. Gadbury, M. Heo, J. R. Fern´andez, C.-K. Lee, T. A. Prolla, and R. Wein-druch (2002). A mixture model approach for the analysis of microarray gene expression data. Computational Statistics & Data Analysis 39 (1), 1–20.Anders, S. and W. Huber (2010). Differential expression analysis for sequence count data. Genome biology 11 (10), R106.Arias-Castro, E. and S. Chen (2016). Distribution-free multiple testing. arXiv preprintarXiv:1604.07520 .Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. TheAnnals of Statistics 43 (5), 2055–2085.Barber, R. F. and E. J. Cand`es (2016). A knockoff filter for high-dimensional selective inference. arXiv preprint arXiv:1602.03574 .Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society. Series B(Methodological) , 289–300.Benjamini, Y. and Y. Hochberg (1997). Multiple hypotheses testing with weights. ScandinavianJournal of Statistics 24 (3), 407–418.Berk, R., L. Brown, A. Buja, K. Zhang, and L. Zhao (2013). Valid post-selection inference. TheAnnals of Statistics 41 (2), 802–837.Bottomly, D., N. A. Walter, J. E. Hunter, P. Darakjian, S. Kawane, K. J. Buck, R. P. Searles,M. Mooney, S. K. McWeeney, and R. Hitzemann (2011). Evaluating gene expression inc57bl/6j and dba/2j mouse striatum using rna-seq and microarrays. PloS one 6 (3), e17820.Bourgon, R., R. Gentleman, and W. Huber (2010). Independent filtering increases detec-tion power for high-throughput experiments. Proceedings of the National Academy of Sci-ences 107 (21), 9546–9551.Boyd, S. and L. Vandenberghe (2004). Convex optimization . Cambridge university press.Brooks, A. N., L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner,and B. R. Graveley (2011). Conservation of an rna regulatory map between drosophila andmammals. Genome research 21 (2), 193–202.Davis, S. and P. S. Meltzer (2007). GEOquery: a bridge between the gene expression omnibus(geo) and bioconductor. Bioinformatics 23 (14), 1846–1847.Dephoure, N. and S. P. Gygi (2012). Hyperplexing: a method for higher-order multiplexed quan-titative proteomics provides a map of the dynamic response to rapamycin in yeast. Sciencesignaling 5 (217), rs2.Dobriban, E. (2016). A general convex framework for multiple testing with prior information. arXiv preprint arXiv:1603.05334 .Dobriban, E., K. Fortney, S. K. Kim, and A. B. Owen (2015). Optimal multiple testing undera gaussian prior on the effect sizes. Biometrika 102 (4), 753–766.Dobson, A. J. and A. Barnett (2008). An introduction to generalized linear models . CRC press. Lihua Lei and William Fithian Du, L., C. Zhang, et al. (2014). Single-index modulated multiple testing. The Annals ofStatistics 42 (4), 1262–1311.Dwork, C., V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth (2015). Preservingstatistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACMon Symposium on Theory of Computing , pp. 117–126. ACM.Efron, B. (2007). Size, power and false discovery rates. The Annals of Statistics , 1351–1377.Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher (2001). Empirical bayes analysis of amicroarray experiment. Journal of the American statistical association 96 (456), 1151–1160.Ferkingstad, E., A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong (2008). Unsupervisedempirical bayesian multiple testing with external covariates. The Annals of Applied Statistics ,714–735.Fithian, W., D. Sun, and J. Taylor (2014). Optimal inference after model selection. arXivpreprint arXiv:1410.2597 .Fortney, K., E. Dobriban, P. Garagnani, C. Pirazzini, D. Monti, D. Mari, G. Atzmon, N. Barzi-lai, C. Franceschi, A. B. Owen, et al. (2015). Genome-wide scan informed by age-relateddisease identifies loci for exceptional human longevity. PLoS Genet 11 (12), e1005728.Frazee, A. C., B. Langmead, and J. T. Leek (2011). Recount: a multi-experiment resource ofanalysis-ready rna-seq gene count datasets. BMC bioinformatics 12 (1), 449.Genovese, C. R., K. Roeder, and L. Wasserman (2006). False discovery control with p-valueweighting. Biometrika 93 (3), 509–524.Gentleman, R., V. Carey, W. Huber, and F. Hahne (2016). genefilter: genefilter: methods forfiltering genes from high-throughput experiments . R package version 1.54.2.Geyer, C. J. and G. D. Meeden (2005). Fuzzy and randomized confidence intervals and p-values. Statistical Science , 358–366.G’Sell, M. G., S. Wager, A. Chouldechova, and R. Tibshirani (2016). Sequential selectionprocedures and false discovery rate control. Journal of the Royal Statistical Society: Series B(Statistical Methodology) 78 (2), 423–444.Hemerik, J. and J. Goeman (2014). Exact testing with random permutations. arXiv preprintarXiv:1411.7565 .Himes, B. E., X. Jiang, P. Wagner, R. Hu, Q. Wang, B. Klanderman, R. M. Whitaker, Q. Duan,J. Lasky-Su, C. Nikolos, et al. (2014). Rna-seq transcriptome profiling identifies crispld2 asa glucocorticoid responsive gene that modulates cytokine function in airway smooth musclecells. PloS one 9 (6), e99625.Hoeffding, W. (1952). The large-sample power of tests based on permutations of observations. The Annals of Mathematical Statistics , 169–192.Hu, J. X., H. Zhao, and H. H. Zhou (2012). False discovery rate control with groups. Journalof the American Statistical Association 105 (491), 1215–1227.Huber, W. and A. Reyes (2016). pasilla: Data package with per-exon and per-gene read counts ofRNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011. R packageversion 0.12.0.Ignatiadis, N. and W. Huber (2017). Covariate-powered weighted multiple testing with falsediscovery rate control. arXiv preprint arXiv:1701.05179 . daPT: An interactive procedure for multiple testing with side information Ignatiadis, N., B. Klaus, J. B. Zaugg, and W. Huber (2016). Data-driven hypothesis weightingincreases detection power in genome-scale multiple testing. Nature methods 13 (7), 577–580.Lawyer, G., E. Ferkingstad, R. Nesv˚ag, K. Varn¨as, and I. Agartz (2009). Local and covariate-modulated false discovery rates applied in neuroimaging. NeuroImage 47 (1), 213–219.Lee, J. D., D. L. Sun, Y. Sun, and J. E. Taylor (2016). Exact post-selection inference, withapplication to the lasso. The Annals of Statistics 44 (3), 907–927.Lehmann, E. and J. P. Romano (2005). Testing statistical hypotheses . New York:. Springer.Lei, L. and W. Fithian (2016). Power of ordered hypothesis testing. In ICML .Lei, L., A. Ramdas, and W. Fithian (2017). STAR: A general interactive framework for fdrcontrol under structural constraints. arXiv preprint arXiv:1710.02776 .Lewinger, J. P., D. V. Conti, J. W. Baurley, T. J. Triche, and D. C. Thomas (2007). Hierarchicalbayes prioritization of marker associations from a genome-wide association scan for furtherinvestigation. Genetic epidemiology 31 (8), 871–882.Li, A. and R. F. Barber (2016a). Accumulation tests for FDR control in ordered hypothesistesting. Journal of the American Statistical Association 112 (just-accepted), 1–38.Li, A. and R. F. Barber (2016b). Multiple testing with the structure adaptive benjamini-hochberg algorithm. arXiv preprint arXiv:1606.07926 .Love, M. I., S. Anders, and W. Huber (2014). Moderated estimation of fold change and disper-sion for rna-seq data with deseq2. Genome biology 15 (12), 550.Markitsis, A. and Y. Lai (2010). A censored beta mixture model for the estimation of theproportion of non-differentially expressed genes. Bioinformatics 26 (5), 640–646.Parker, R. and R. Rothenberg (1988). Identifying important results from multiple statisticaltests. Statistics in medicine 7 (10), 1031–1043.Pounds, S. and S. W. Morris (2003). Estimating the occurrence of false positives and falsenegatives in microarray studies by approximating and partitioning the empirical distributionof p-values. Bioinformatics 19 (10), 1236–1242.Slater, M. (1950). Lagrange multipliers revisited, cowles commis. Technical report, sion Dis-cussion Paper, Mathematics.Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 64 (3), 479–498.Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous sig-nificance testing. Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) 69 (3), 347–368.Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estima-tion and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (1), 187–205.Storey, J. D. and R. Tibshirani (2003). Statistical significance for genomewide studies. Proceed-ings of the National Academy of Sciences 100 (16), 9440–9445.Sun, W., B. J. Reich, T. Tony Cai, M. Guindani, and A. Schwartzman (2015). False discoverycontrol in large-scale spatial multiple testing. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 77 (1), 59–83.Tian, X., J. Taylor, et al. (2018). Selective inference with a randomized response. The Annalsof Statistics 46 (2), 679–710. Lihua Lei and William Fithian Tukey, J. W. (1994). The collected works of John W. Tukey: Multiple comparisons, 1948-1983 ,Volume 8. Chapman & Hall/CRC.Yekutieli, D. (2012). Adjusted bayesian inference for selected parameters. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 74 (3), 515–541.Zablocki, R. W., A. J. Schork, R. A. Levine, O. A. Andreassen, A. M. Dale, and W. K. Thompson(2014). Covariate-modulated local false discovery rate for genome-wide association studies. Bioinformatics 30 (15), 2098–2104. A. EM algorithm details A.1. Derivation of E-step To fill in the details of the algorithm, we are left to calculate the imputed values ˆ H ( r ) i and ˆ y ( r, i given the parameters θ and β . Denote π i and µ i by π i = (cid:16) e − φ π ( x i ) (cid:48) θ (cid:17) − , µ i = ζ − (cid:0) β (cid:48) φ µ ( x i ) (cid:1) We distinguish two cases: for revealed p-values, ˜ p t,i is a singleton; for masked p-values, ˜ p t,i is atwo-elements set. For clarity, we let p (cid:48) t,i denote the minimum element of ˜ p t,i , i.e. ˜ p t,i = p (cid:48) t,i inthe former case and ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } in the latter case. • For revealed p-values,ˆ H ( r ) i = P ( H i = 1 | p i = ˜ p t,i ) = π i · h (˜ p t,i ; µ i ) π i · h (˜ p t,i ; µ i ) + 1 − π i , (26)and ˆ y ( r, i = E ( y i H i | ˜ p t,i ) / ˆ H ( r ) i = y i . (27) • For masked p-values, P ( p i = p (cid:48) t,i | ˜ p t,i , H i = 1) = h ( p (cid:48) t,i ; µ i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (28)and P ( p i = 1 − p (cid:48) t,i | ˜ p t,i , H i = 1) = h (1 − p (cid:48) t,i ; µ i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (29)By Bayes’ formula, we can also derive the conditional distribution of H i given ˜ p t,i :ˆ H ( r ) i = P ( H i = 1 | ˜ p t,i ) = π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) + 2(1 − π i ) . (30)As a consequence of (28) - (30),ˆ y ( r, i = 1ˆ H ( r ) i · { g ( p (cid:48) t,i ) · P ( H i = 1 , p i = p (cid:48) t,i | ˜ p t,i ) + g (1 − p (cid:48) t,i ) · P ( H i = 1 , p i = 1 − p (cid:48) t,i | ˜ p t,i ) } = 1ˆ H ( r ) i · π i · (cid:16) h ( p (cid:48) t,i ; µ i ) · g ( p (cid:48) t,i ) + h (1 − p (cid:48) t,i ; µ i ) · g (1 − p (cid:48) t,i ) (cid:17) π i · (cid:16) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) (cid:17) + 2(1 − π i )= h ( p (cid:48) t,i ; µ i ) · g ( p (cid:48) t,i ) + h (1 − p (cid:48) t,i ; µ i ) · g (1 − p (cid:48) t,i ) h ( p (cid:48) t,i ; µ i ) + h (1 − p (cid:48) t,i ; µ i ) . (31) daPT: An interactive procedure for multiple testing with side information A.1.1. Beta-mixture model Consider the beta-mixture model (15), where h ( p ; µ ) = 1 µ · p µ − . Plug it into our general results, we obtain that • for revealed p-values, ˆ H ( r ) i = π i · µ i ˜ p µi − t,i π i · µ i ˜ p µi − t,i + 1 − π i , ˆ y ( r ) i = y i ; (32) • for masked p-values,ˆ H ( r ) i = π i · µ i (cid:18) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − (cid:19) π i · µ i (cid:18) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − (cid:19) + 2(1 − π i ) , ˆ y ( r ) i = p (cid:48) µi − t,i ( − log p (cid:48) t,i ) + (1 − p (cid:48) t,i ) µi − ( − log(1 − p (cid:48) t,i )) p (cid:48) µi − t,i + (1 − p (cid:48) t,i ) µi − . (33) A.1.2. Gaussian-mixture model Suppose the p-values are derived from a one-sided z-test by transforming a normal randomvariable, the most natural transformation is g ( p ) = Φ − (1 − p ). By (13), h ( p ; µ ) = exp (cid:26) µ · g ( p ) − µ (cid:27) . Plug it into our general results, we obtain that • for revealed p-values,ˆ H ( r ) i = P ( H i = 1 | p i = ˜ p t,i ) = π i · e µ i y i π i · e µ i y i + (1 − π i ) e µ i / , ˆ y ( r ) i = y i ; (34) • for masked p-values,ˆ H ( r ) i = π i · cosh( µ i y i ) π i · cosh( µ i y i ) + (1 − π i ) e µ i / , ˆ y ( r ) i = | y i | · tanh( µ i | y i | ) . (35) A.2. Initialization Another important issue is the initialization. The formulae of ˆ H ( r ) i and ˆ y ( r, i requires esti-mates of π i and µ i . In step 0 when no information can be obtained, we propose a simplemethod by imputing random guess as follows. First we obtain an initial guess of π i . Let J i = I (˜ p t,i contains two elements), then we observe that E J i = P ( p i (cid:54)∈ [ s ( x i ) , − s ( x i )]) ≥ (1 − π i )(1 − s ( x i ))= ⇒ π i ≥ E (cid:18) − J i − s ( x i ) (cid:19) . (36)Let ˜ J i = 1 − J i − s ( x i ) (37) Lihua Lei and William Fithian and then we fit a logistic regression on ˜ J i with covariates φ π ( x i ), denoted by ˆ π i and thentruncate ˆ π i at 0 and 1 to obtain an initial guess of π i , i.e. π (0)1 i = (ˆ π i ∨ ∧ . (38)(36) implies that π (0)1 i is a conservative estimate of π i . This is preferred to an anti-conservativeestimate since the latter might cause over-fitting.Then we obtain an initial guess of µ ( x i ) by imputing p i ’s. If p i ∈ [ s ( x i ) , − s ( x i )], then˜ p t,i = p i and hence we can use it directly. Otherwise, we only know that p i ∈ ˜ p t,i = { p (cid:48) t,i , − p (cid:48) t,i } .If p i is null, then it should be uniform on { p (cid:48) t,i , − p (cid:48) t,i } ; if p i is non-null, then it should morelikely to be p (cid:48) t,i since p (cid:48) t,i < − p (cid:48) t,i . Thus, we impute p i by p (cid:48) t,i , and fit an unweighted GLM on g ( p (cid:48) t,i ) with covariates φ ( x i ) and inverse link to obtain an initial guess of µ i . A.3. Other issues In Algorithm 2, we fit µ ( x ) using a weighted GLM, corresponding to the the M-step. However,the weighted step is sensitive to the weights ˆ H r derived from the E-step, which relies on theassumption that null p-values are uniformly distributed on [0 , µ ( x ). For this reason, we modify the weighted stepin M-steps for fitting µ ( x ) into an unweighted step and find that this choice leads to consistentlygood performance in all experiments shown in Section 5. B. Technical Proofs Proof ( Proof of Theorem 2 ). Assume f ( p | x i ) ≤ M . Let η i = ν ( { x i } ) and s =( s , . . . , s n ) (cid:44) ( s ( x ) , . . . , s ( x n )), then the objective function of (7) (cid:90) X − F ( s ( x ) | x ) π ( x ) ν ( dx ) = − n (cid:88) i =1 η i F ( s i | x i ) π ( x i )is a convex function of s by condition (i) and the constraint function g ( s ) (cid:44) (cid:90) X (cid:26) − αF ( s ( x ) | x ) π ( x ) + (1 − α ) F ( s ( x ) | x )(1 − π ( x )) (cid:27) ν ( dx )= n (cid:88) i =1 η i ( − αF ( s i | x i ) π ( x i ) + (1 − α ) F ( s i | x i )(1 − π ( x i )))is also a convex function of s by condition (i). To establish the necessity of the KKT condition,it is left to prove the Slater’s condition (Slater 1950; Boyd and Vandenberghe 2004, Chap. 5),i.e. there exists a ¯ s , such that for any s ∈ B (¯ s, δ ) for some δ > 0, the constraint inequalityholds, i.e. g ( s ) ≤ 0, and g (¯ s ) < 0. By condition (ii), WLOG we assume fdr(0 | x ) < α with ν ( { x } ) = η > 0. Fix any (cid:15) > ω ( (cid:15) ) by the maximum modulus of continuity of f ( p | x ) and f ( p | x ) at point 0, i.e. ω ( (cid:15) ) = max (cid:40) sup p ≤ (cid:15) f ( p | x ) , sup p ≤ (cid:15) f ( p | x ) (cid:41) . By assumption (i), we know that lim (cid:15) → ω ( (cid:15) ) = 0 . Let ¯ s ∈ R n with ¯ s = 2 (cid:15), ¯ s = . . . = ¯ s n = (cid:15) · (2 M ) − η ∆ daPT: An interactive procedure for multiple testing with side information where ∆ = f (0 | x ) ( α (1 − fdr(0 | x )) − (1 − α )fdr(0 | x )) > . Let δ = min { ¯ s , (cid:15) } . We will show that for any s ∈ B (¯ s, δ ), g ( s ) < 0. In fact, for any s ∈ B (¯ s, δ ),we have s ∈ [ (cid:15), (cid:15) ] , s i ≤ (cid:15) · M − η ∆ . Recalling that M is an upper bound for the null density. WLOG, we assume that M ≥ η ∆in which case s i ≤ (cid:15) for all i ≥ 2. Note that g (0) = 0. By mean-value theorem, there exists˜ s ∈ [0 , (cid:15) ] , ˜ s , . . . , ˜ s n ∈ [0 , (cid:15) ], such that g ( s ) = s η ( − αf (˜ s | x ) π ( x ) + (1 − α ) f (˜ s | x )(1 − π ( x )))+ n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i ))) ≤ s η ( − αf (0 | x ) π ( x ) + (1 − α ) f (0 | x )(1 − π ( x ))) + s η ω ( (cid:15) )+ n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i )))= − s η ∆ + s η ω ( (cid:15) ) + n (cid:88) i =2 s i η i ( − αf (˜ s i | x i ) π ( x i ) + (1 − α ) f (˜ s i | x i )(1 − π ( x i ))) ≤ − s η ∆ + s η ω ( (cid:15) ) + n (cid:88) i =2 s i η i · M ≤ − (cid:15)η ∆ + s η ω ( (cid:15) ) + (cid:15)η ∆ n (cid:88) i =2 η i ≤ − (cid:15)η ∆ + O ( (cid:15)ω ( (cid:15) )) + (cid:15)η ∆(1 − η ) ≤ − (cid:15)η ∆ + o ( (cid:15) ) . Thus, for sufficiently small (cid:15) , g ( s ) < s ∈ B (¯ s, δ ) and hence the Slater’s condition issatisfied. Proof ( Proof of Lemma 2 ). We assume ρ < A ⊆ [ n ] with P ( i ∈ A | G − ) = 1 − ρ i − ρ , conditionally independent for i ∈ [ n ], and construct conditionally i.i.d. Bernoulli variables q , . . . , q n , independent of A , with P ( q i = 1 | G − ) = ρ . Then we can define˜ b i = q i { i ∈ A} + { i / ∈ A} , (39)which by construction gives P (˜ b i = 1 | G − ) = ρ i almost surely. Furthermore, noticing that P (˜ b i = 0 , ˜ b j = 0 |G − ) = P ( i ∈ A , j ∈ A , q i = 0 , q j = 0 |G − )= P ( i ∈ A , q i = 0 |G − ) P ( j ∈ A , q j = 0 |G − )= P (˜ b i = 0 |G − ) P (˜ b j = 0 |G − ) , we conclude that the ˜ b i are conditionally independent given G − . As a consequence, given G − ,(˜ b , . . . , ˜ b n ) d = ( b , . . . , b n ) . In the following proof, we will use (39) to represent b i ’s. Lihua Lei and William Fithian To ensure that C t decreases by at most a single element in each step, we introduce interme-diate steps: for integers t ≥ 0, 1 ≤ i ≤ n define C t + i/n = C t +1 ∪ { j ≤ n − i : j ∈ C t } . Next, define the augmented filtration G A t = σ (cid:32) G − , A , C t , ( b i ) i/ ∈C t ∩A , (cid:88) i ∈C t ∩A b i (cid:33) ⊇ G t , for both integer and fractional values of t . Note C t +1 /n is measurable with respect to C t . Inaddition we define U A t = (cid:88) i ∈C t ∩A b i , V A t = (cid:88) i ∈C t ∩A − b i , and Z A t = 1 + |C t ∩ A| U A t . Recall the definition of U t (Section 2.1) and b i (defined above in (39)), for any t ,1 + |C t | U t = 1 + |C t ∩ A| + |C t ∩ A c | U A t + |C t ∩ A c | ≤ |C t ∩ A| U A t = Z A t . Finally, we observe that ( b i ) i ∈C t ∩A = ( q i ) i ∈C t ∩A are exchangeable with respect to G A t , with therandom vector distributed uniformly over configurations summing to U A t .There are three cases:(i) if C t +1 /n ∩ A = C t ∩ A then E [ Z A t +1 /n | G A t ] = Z A t ;(ii) if U A t = 0 but C t +1 /n ∩ A (cid:36) C t ∩ A then Z A t +1 /n = 1 + |C t +1 /n ∩ A| ≤ |C t ∩ A| = Z A t − ≤ Z A t ;(iii) otherwise, U A t > C t \ C t +1 /n = { j } for some j ∈ A . The exchangeability of b i = q i implies that P (cid:0) b j = 1 | G A t (cid:1) = U A t U A t + V A t . Then E (cid:104) Z A t +1 /n | G A t (cid:105) = U A t + V A t U A t · V A t U A t + V A t + U A t + V A t U A t · U A t U A t + V A t = V A t U A t + 1 = Z A t . In all three cases, the conditional expectation of Z A t +1 /n is smaller than Z A t ; thus, Z A t is a super-martingale with respect to the filtration G A t . Because ˆ t is also a stopping time with respect tothe filtration ( G A t ) t =0 , /n, /n,... (but one which can only take integer values), for any A ∈ [ n ], wehave E (cid:34) |C ˆ t | (cid:80) i ∈C ˆ t b i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G − , A (cid:35) ≤ E (cid:2) Z A ˆ t | G − , A (cid:3) ≤ E (cid:2) Z A | G − , A (cid:3) = E (cid:2) Z A | G − (cid:3) . (40) daPT: An interactive procedure for multiple testing with side information Let m = |C | and assume C = { , . . . , m } WLOG. Using the representation (39), E (cid:2) Z A | G − (cid:3) = E (cid:20) m (cid:80) mi =1 ( q i I ( i ∈ A ) + I ( i (cid:54)∈ A )) | G − (cid:21) ≤ E (cid:20) m (cid:80) mi =1 q i | G − (cid:21) = E (cid:20) m (cid:80) mi =1 q i (cid:21) = m (cid:88) k =0 m k · (cid:18) mk (cid:19) ρ k (1 − ρ ) m − k = m (cid:88) k =0 (cid:18) m + 1 k + 1 (cid:19) ρ k (1 − ρ ) m − k = ρ − · m (cid:88) k =0 (cid:18) m + 1 k + 1 (cid:19) ρ k +1 (1 − ρ ) m +1 − ( k +1) = ρ − (cid:0) − (1 − ρ ) m +1 (cid:1) ≤ ρ − . Marginalizing over A in (40), we obtain the result. B.1. Mirror-conservatism In this subsection we provide two important examples that produce mirror-conservative p-values. The first example is the permutation test (e.g. Hoeffding (1952)). Typically we assumethat under the null hypothesis, the test statistic T ( X ), where X is a short-handed notation forobserved data, is invariant in distribution under a finite group of transformations G , i.e. T ( X ) d = T ( gX ) , ∀ g ∈ G . When |G| is small, one can compute a discrete p-value by R/ |G| , where R is the rank of T ( X ) inthe set { T ( gX ) : g ∈ G} . When |G| is large, Hemerik and Goeman (2014) proposes sampling asubset G (cid:48) = { g , g , . . . , g m } with g = Id and g , . . . , g m being a simple random sample (withoutreplacement) from G and calculate the p-value based on G (cid:48) . In both cases, it can be provedthat the p-value is uniformly distributed on an equi-spaced grid { m , m , . . . , mm } under the nullhypothesis, where m is the number of replicates. Then for any 0 < a ≤ a ≤ , P ( p ∈ [ a , a ]) = (cid:100) ma (cid:101) − ( (cid:98) ma (cid:99) − + where ( u ) + denotes max { u, } , and P ( p ∈ [1 − a , − a ]) = (cid:100) m (1 − a ) (cid:101) − (cid:98) m (1 − a ) (cid:99) + 1 = (cid:100) ma (cid:101) − ( (cid:98) ma (cid:99) − . As a result we conclude that P ( p ∈ [ a , a ]) ≤ P ( p ∈ [1 − a , − a ])and hence p is mirror-conservative.The second example is the one-sided test for distributions with monotone likelihood ratio,which is ubiquitous in practice. Specifically, let θ be the univariate parameter of interest and p θ ( x ) be a family of densities with respect to some carrier measure µ . p θ ( x ) is said to havemonotone likelihood ratio with respect to some real-value function T ( x ) if for any θ < θ (cid:48) , p θ (cid:54)≡ p θ (cid:48) and the ratio p θ (cid:48) ( x ) /p θ ( x ) is a nondecreasing function of T ( x ). For testing H : θ ≤ θ against H : θ > θ , it is well-known that there exists a Uniformly Most Powerful (UMP) test(Lehmann and Romano, 2005), with the following decision function: φ ( x ) = T ( x ) > C ) γ ( T ( x ) = C )0 ( T ( x ) < C ) Lihua Lei and William Fithian where ( γ, C ) is the solution of P θ ( T ( X ) > C ) + γP θ ( T ( X ) = C ) = α. Write T ( X ) as T and P θ ( T ( X ) ≥ t ) as G ( t ) for short. Then the induced p-value can bewritten as p = G ( T + ) + U ( G ( T ) − G ( T + )) , U ∼ U ([0 , , (41)where G ( t + ) = lim t ↓ t + G ( t ). (41) is termed as fuzzy p-values by Geyer and Meeden (2005). Proposition 1. Let p θ ( x ) be a family of densities (w.r.t the carrier measure µ ) that hasmonotone likelihood ratio w.r.t. T ( x ) . Then the p-value defined in (41) is mirror-conservative. Proof. Since p θ ( x ) has monotone likelihood ratio, there exists a non-decreasing function g θ ( t ) for each θ ≤ θ , such that p θ ( x ) p θ ( x ) = g θ ( T ( x )) . Let ν be a measure such that for any event A ⊂ R , ν ( A ) = (cid:90) I ( T ( x ) ∈ A ) · p θ ( x ) µ ( dx ) . Then for any event A ⊂ R , P θ ( T ( X ) ∈ A ) = (cid:90) g θ ( T ( x )) · I ( T ( x ) ∈ A ) · p θ ( x ) dµ = (cid:90) g θ ( t ) ν ( dt ) . (42)Note that the above argument can be easily proved by standard approximation argument inmeasure theory that starts from indicator functions g θ ( t ) = I ( t ∈ A (cid:48) ), extends the result tosimple step functions and finally pushes it to the limit. Let ω be the product measure of ν andthe Lebesgue measure on [0 , B ⊂ R , P θ (( T ( X ) , U ) ∈ B ) = (cid:90) B g θ ( t ) ω ( dt, du ) . (43)Note that g θ ( t ) ≡ ω ( B ) = P θ (( T ( X ) , U ) ∈ B ) . (44)Let H ( · , · ) be the transformation such that p = H ( T, U ). Then for any z , { ( t, u ) : G ( t ) < z } ⊂ H − ([0 , z )) ⊂ H − ([0 , z ]) ⊂ { ( t, u ) : G ( t ) ≤ z } . As a result, for any 0 < z < z < H − ([0 , z ]) ⊂ { ( t, u ) : G ( t ) ≤ z } , H − ([ z , ⊂ { ( t, u ) : G ( t ) ≥ z } , and hence there exists t ( z , z ) such that t ≤ t ( z , z ) ≤ t , ∀ ( t , u ) ∈ H − ([0 , z ]) , ( t , u ) ∈ H − ([ z , . (45)Given 0 ≤ a ≤ a < . 5, let A = [ a , a ] and A = [1 − a , − a ]. Then (45) and (43),together with the monotonicity of g θ , imply that P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≤ ω ( H − ( A )) · g θ ( t ( a , − a )) , and P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≥ ω ( H − ( A )) · g θ ( t ( a , − a )) . daPT: An interactive procedure for multiple testing with side information Recalling (44), we obtain that P θ ( p ∈ A ) P θ ( p ∈ A ) ≤ P θ ( p ∈ A ) P θ ( p ∈ A ) . (46)It is left to prove that P θ ( p ∈ A ) P θ ( p ∈ A ) = 1. In fact, we can prove that p ∼ U ([0 , θ = θ . (47)Fix any z ∈ (0 , G − ( z ) = sup { t : G ( t ) ≥ z } . For clarity we write u for G − ( z ). Now we prove (47) in two cases: • if u is a continuity point of G , i.e. G ( u ) = G (cid:0) u + (cid:1) . Since G is left-continuous, we must have G ( u ) = G (cid:0) u + (cid:1) = z. Then P θ ( p ≤ z ) = P θ ( T ( X ) ≥ u ) = G ( u ) . • if u is an atom of G , i.e. G ( u ) > G (cid:0) u + (cid:1) . By definition, G ( u ) ≥ z and z ≥ G (cid:0) u + (cid:1) . Then P θ ( p ≤ z ) = P θ (cid:0) T ( X ) > u + (cid:1) + P θ ( T ( X ) = u ) · z − G ( u + ) G ( u ) − G ( u + )= G ( u + ) + ( G ( u ) − G ( u + )) · z − G ( u + ) G ( u ) − G ( u + )= z. Therefore we prove (47). By (46), we conclude that for any θ ≤≤ 5, let A = [ a , a ] and A = [1 − a , − a ]. Then (45) and (43),together with the monotonicity of g θ , imply that P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≤ ω ( H − ( A )) · g θ ( t ( a , − a )) , and P θ ( p ∈ A ) = (cid:90) H − ( A ) g θ ( t ) ω ( dt, du ) ≥ ω ( H − ( A )) · g θ ( t ( a , − a )) . daPT: An interactive procedure for multiple testing with side information Recalling (44), we obtain that P θ ( p ∈ A ) P θ ( p ∈ A ) ≤ P θ ( p ∈ A ) P θ ( p ∈ A ) . (46)It is left to prove that P θ ( p ∈ A ) P θ ( p ∈ A ) = 1. In fact, we can prove that p ∼ U ([0 , θ = θ . (47)Fix any z ∈ (0 , G − ( z ) = sup { t : G ( t ) ≥ z } . For clarity we write u for G − ( z ). Now we prove (47) in two cases: • if u is a continuity point of G , i.e. G ( u ) = G (cid:0) u + (cid:1) . Since G is left-continuous, we must have G ( u ) = G (cid:0) u + (cid:1) = z. Then P θ ( p ≤ z ) = P θ ( T ( X ) ≥ u ) = G ( u ) . • if u is an atom of G , i.e. G ( u ) > G (cid:0) u + (cid:1) . By definition, G ( u ) ≥ z and z ≥ G (cid:0) u + (cid:1) . Then P θ ( p ≤ z ) = P θ (cid:0) T ( X ) > u + (cid:1) + P θ ( T ( X ) = u ) · z − G ( u + ) G ( u ) − G ( u + )= G ( u + ) + ( G ( u ) − G ( u + )) · z − G ( u + ) G ( u ) − G ( u + )= z. Therefore we prove (47). By (46), we conclude that for any θ ≤≤ θ , P θ ( p ∈ A ) P θ ( p ∈ A ) ≤≤