Interactive identification of individuals with positive treatment effect while controlling false discoveries
IInteractive identification of individualswith positive treatment effectwhile controlling false discoveries
Boyan Duan Larry Wasserman Aaditya Ramdas { boyand,larry,aramdas } @stat.cmu.edu Department of Statistics and Data Science,Carnegie Mellon University, Pittsburgh, PA 15213February 23, 2021
Abstract
Out of the participants in a randomized experiment with anticipated heterogeneous treatmenteffects, is it possible to identify which ones have a positive treatment effect, even though eachhas only taken either treatment or control but not both? While subgroup analysis has receivedattention, claims about individual participants are more challenging. We frame the problem interms of multiple hypothesis testing: we think of each individual as a null hypothesis (the potentialoutcomes are equal, for example) and aim to identify individuals for whom the null is false (thetreatment potential outcome stochastically dominates the control, for example). We develop anovel algorithm that identifies such a subset, with nonasymptotic control of the false discoveryrate (FDR). Our algorithm allows for interaction — a human data scientist (or a computer programacting on the human’s behalf) may adaptively guide the algorithm in a data-dependent mannerto gain high identification power. We also propose several extensions: (a) relaxing the null tononpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification.We demonstrate via numerical experiments and theoretical analysis that the proposed method hasvalid FDR control in finite samples and reasonably high identification power.
Subgroup identification has been a major topic in the clinical trial community and the causal literature(see Lipkovich et al. (2017); Powers et al. (2018); Loh et al. (2019) and references therein). Typically,the treatment effect in the investigated population varies by the subject’s gender, age, and othercovariates. Identifying subjects with positive effects can help guide follow-up research and providemedication guidance. However, most existing methods do not have an error control guarantee at thelevel of the individual — it is possible that most subjects in the identified subgroup do not have positiveeffects. For example, an identified subgroup could be defined as “female subjects with age less than40”, but only 10% of them with age between 18 and 20 may truly have a positive treatment effect.We propose to upper bound the proportion of falsely identified subjects (whose potential outcomesunder treatment and control are equal, for example) via the language of hypothesis testing. As formal-ized in the next section, we interpret the problem of identifying subjects with positive effects as one ofmultiple hypothesis testing, where each subject corresponds to one null hypothesis, and the proportionof false identifications corresponds to a standard error metric, the false discovery rate (FDR). In thiscontext, we propose algorithms with two appealing properties. First, they achieve a finite sampleguarantee on FDR control. Second, our proposed algorithms identify subjects with positive effectsthrough an interactive procedure according to a particular protocol — an analyst is allowed to lookat an initially “masked” dataset (that is progressively “unmasked” over rounds of interaction), andcombines available covariates with prior knowledge via flexible Bayesian (or black box machine learn-ing) working models to improve power. In summary, she can combine the strengths of (automated)statistical modeling and (human-guided) scientific knowledge, all while avoiding selection bias andguaranteeing valid FDR control. 1 a r X i v : . [ s t a t . M E ] F e b .1 Problem setup Consider a sample with n subjects. Each subject i has potential control outcome Y Ci , potential treatedoutcome Y Ti , and the treatment indicator A i for i ∈ [ n ] ≡ { , , . . . , n } . Our results allow the potentialoutcomes to either be viewed as random variables or fixed. The treatment effect of subject i is definedas Y Ti − Y Ci and the observed outcome is Y i = Y Ci (1 − A i )+ Y Ti A i under the standard causal assumptionof consistency ( Y i = Y Ti when A i = 1 and Y i = Y Ci when A i = 0). Person i ’s covariate is denoted as X i . This paper focuses on completely randomized experiments without interference:(i) conditional on covariates, treatment assignments are independent coin flips: P [( A , . . . , A n ) = ( a , . . . , a n ) | X , . . . , X n ] = n (cid:89) i =1 P ( A i = a i ) = (1 / n , (1)for any ( a , . . . , a n ) ∈ { , } n .(ii) conditional on covariates, the outcome of one subject Y i is independent of the assignment A j ofanother subject, for any i (cid:54) = j : Y i ⊥ A j | { X , . . . , X n } for i (cid:54) = j, (2)which is implied by (1) when the potential outcomes are viewed as fixed values.We do not assume the observed data ( Y i , A i , X i ) are identically distributed. We consider heterogeneouseffects in the sense that the distribution of Y Ti − Y Ci varies, and aim at identifying those individualswith a positive treatment effect. (If the covariates X i are not informative about the heterogeneity in Y Ti − Y Ci , our identification power could be low, and this is to be expected.) To formalize and framethe problem in terms of multiple hypothesis testing, we first define the null hypothesis for subject i ashaving zero treatment effect: H zero0 i : ( Y Ti | X i ) d = ( Y Ci | X i ) , (3)or equivalently, H zero0 i : ( Y i | A i = 1 , X i ) d = ( Y i | A i = 0 , X i ). Alternatively, we can treat the potentialoutcomes and covariates as fixed, and frame the null hypothesis as H zero0 i : Y Ti = Y Ci . (4)A last, hybrid, version (e.g., Howard and Pimentel (2020)) is to treat the two potential outcomes asrandom with joint distribution ( Y Ti , Y Ci ) | X i ∼ P i , and the null posits H zero0 i : Y Ti = Y Ci almost surely- P i , (5)meaning that P i is supported on { ( x, y ) : x = y } . Our work handles any interpretation.In later sections, we describe an extension where we relax the null as those with a nonpositive effect,defined by stochastic dominance ( Y Ti | X i ) (cid:22) ( Y Ci | X i ), meaning that P ( Y Ti ≤ y | X i ) ≤ P ( Y Ci ≤ y | X i ), or simply Y Ti ≤ Y Ci if the potential outcomes are fixed.Our algorithms control the error of falsely identifying subjects whose null hypothesis is true (i.e.,having zero effect), and aim at correctly identifying subjects with positive effects. Let (cid:31) denotestochastic dominance, as above. We say a subject has a positive effect if( Y Ti | X i ) (cid:31) ( Y Ci | X i ) . (6)When treating the potential outcomes and covariates as fixed, we simply write Y Ti > Y Ci .The output of our proposed algorithms is a set of identified subjects, denoted as R , with a guaranteethat the expected proportion of falsely identified subjects is upper bounded. Specifically, denote theset of subjects that are true nulls as H := { i ∈ [ n ] : H zero0 i is true } . Then the number of false2dentifications is |R ∩ H | . The expected proportion of false identifications is a standard error metric,the false discovery rate (FDR): FDR := E (cid:20) |R ∩ H | max {|R| , } (cid:21) . (7)Given α ∈ (0 , ≤ α , and have reasonably high power ,which is defined as the expected proportion of correctly identified subjects:power := E (cid:20) |R ∩ Pos | max {| Pos | , } (cid:21) , where Pos := (cid:8) i : ( Y Ti | X i ) (cid:31) ( Y Ci | X i ) (cid:9) or Pos := { i : Y Ti > Y Ci } is the set of subjects with positiveeffects. We note that our problem setup is not exactly the same as most work in subgroup identification, suchas Foster et al. (2011); Zhao et al. (2012); Imai and Ratkovic (2013). The identified subgroups areusually defined by functions of covariates, rather than a subset of the investigated subjects as in ourpaper. While defining the subgroup by a function of covariates makes it easy to generalize the finding inthe investigated sample to a larger population, it does not seem straightforward to nonasymptoticallycontrol the error of false identifications using the former definition, which is a major distinction betweenprevious studies and our work. Most existing work does not have an error control guarantee (see anoverview in Lipkovich et al. (2017), Table XV), except a few discussing error control on the level ofsubgroups as opposed to the level of individuals in our paper. The difference between FDR control ata subgroup level and at an individual level is detailed below.
Subgroup FDR control.
Karmakar et al. (2018); Gu and Shen (2018); Xie et al. (2018) discussFDR control at a subgroup level, where the latter two have little discussion on incorporating continuouscovariates and require parametric assumptions on the outcomes. Thus, we follow the setup in Karmakaret al. (2018) to compare the FDR control at a subgroup level (in their paper) and individual level (in ourpaper). Let the subgroups be non-overlapping sets {G , . . . , G G } . The null hypothesis for a subgroup G g is defined as: H g : H zero0 i is true for all i ∈ G g , or equivalently, H g : G g ⊆ H (recall H is the set of subjects with zero effect). Let D g be the0 / H g is identified or not. The FDR at a subgroup level isdefined as the expected proportion of falsely identified subgroups:FDR subgroup := E (cid:20) |{ g ∈ [ G ] : G g ⊆ H , D g = 1 | max {|{ g ∈ [ G ] : D g = 1 | , } (cid:21) , (8)which collapses to the FDR at an individual level as defined in (7) when each subgroup has exactlyone subject. Although our interactive procedure is designed for FDR control at an individual level,we propose extensions to FDR control at a subgroup level in Appendix A.1. As a brief summary,Karmakar et al. (2018) propose to control FDR subgroup by constructing a p -value for each subgroupand apply the classical BH method (Benjamini and Hochberg, 1995). While their method has manyorthogonal benefits (e.g., handling observational studies), it is not trivially applicable to control FDRat an individual level, because their p -values would only take value 1 / ther related error control at a subgroup level. Cai et al. (2011) and Athey and Imbens (2016)develop confidence intervals for the averaged treatment effect within subgroups, where the formerassumes the size of each subgroup to be large, and the latter requires a separate sample for inference.These intervals can potentially be used to generate a p -value for each subgroup and control FDRat a subgroup level via standard multiple testing procedures, but no explicit discussion is provided.Lipkovich et al. (2011), Lipkovich and Dmitrienko (2014), Sivaganesan et al. (2011) and Berger et al.(2014) propose methods with control on a different error metric: the global type-I error, which is theprobability of identifying any subgroup when no subject has nonzero treatment effect (i.e., H zero0 i istrue for all subjects). Our FDR control guarantee implies valid global type-I error, and FDR controlis more informative on the correctness of the identified subgroups/subjects when there exist subjectshaving nonzero effects. As discussed, it appears to be new and practically interesting to provide FDR control guarantees atan individual level. Another merit of our proposed method is that it allows a human analyst and analgorithm to interact, in order to better accomplish the goal.Interactive testing is a recent idea that emerged in response to the growing practical needs ofallowing human interaction in the process of data analysis. In practice, analysts tend to try severalmethods or models on the same dataset until the results are satisfying, but this violates the validity ofstandard testing methods (e.g., invalid FDR control). In our context of identifying positive effects, theappealing advantages of an interactive test include that (a) an analyst is allowed to use (partial) data,together with prior knowledge, to design a strategy of selecting subjects potentially having positiveeffects, and (b) it is a multi-step iterative procedure during which the analyst can monitor performanceof the current strategy and make adjustments at any step (at the cost of not altering earlier steps).Despite the flexibility of an analyst to design and alter the algorithm using (partial) data, our proposedprocedure always maintains valid FDR control. We name our proposed algorithm as I (I-cube), forinteractive identification of individual treatment effects. { A i }{ Y i , X i } Unmask A i ∗ t − Prior information Rejection set R t Estimate (cid:91)
FDR( R t ) Report R t Start t = 0, R = [ n ], i ∗ = ∅ SelectionError controlExclude i ∗ t If (cid:91) FDR( R t ) ≤ α If (cid:91) FDR( R t ) > α , then t ← t + 1 Figure 1.
A schematic of the I algorithm. All treatment assignments are initially kept hidden: only( Y i , X i ) i ∈ [ n ] are revealed to the analyst, while all { A i } remain ‘masked’. The initial candidate rejectionset is R = [ n ] (thus no subject is excluded initially and i ∗ = ∅ ). The false discovery proportion (cid:91) FDR of the current candidate set R t is estimated by the algorithm (dashed lines), and reported to theanalyst. If (cid:91) FDR( R t ) > α , the analyst chooses a subject i ∗ t to remove it from the proposed rejection set R t = R t − \{ i ∗ t } , whose assignment A t ∗ t is then ‘unmasked’ (revealed). Importantly, using any availableprior information, covariates and working model, the analyst can choose subject i ∗ t and shrink R t inany manner. This process continues until (cid:91) FDR( R t ) ≤ α (or R t = ∅ ). { A i } ni =1 from the analyst. The algorithm alternates between two steps— selection and error control — until a simple stopping criterion introduced later is reached.1. Selection . Consider a set of candidate subjects to be identified as having a positive effect (whosenull to be rejected), denoted as rejection set R t for iteration t . We start with all the subjectsincluded, R = [ n ]. At each iteration, the analyst excludes possible nulls (i.e., subjects that areunlikely to have positive effects) from the previous R t − , using all the available information (out-comes Y i and covariates X i for all subjects i ∈ [ n ], and progressively unmasked A i from the stepof error control, and possible prior information). Note that our method does not automaticallyuse prior information and the revealed data. The analyst is free to use any black-box predictionalgorithm that uses the available information, and evaluates the subjects possibly using an esti-mated probability of having a positive treatment effect. This step is where a human is allowedto incorporate her subjective choices.2. Error control (and unmasking) . The algorithm uses the complete data { Y i , A i , X i } to es-timate FDR of the current candidate rejection set (cid:91) FDR( R t ), as a feedback to the analyst. Ifthe estimated FDR is above the target level (cid:91) FDR( R t ) > α , the analyst goes back to the stepof selection, along with additional information: the excluded subjects ( i / ∈ R t ) have their A i unmasked (revealed), which could improve her understanding of the data and guide her choicesin the next selection step.The algorithms we propose in the main paper build on and modify the above procedure to achievereasonably high power and develop various extensions. An illustration of the identifications made bythe Crossfit-I (our central algorithm) is in Figure 2. X.1 X . (a) Subjects in treated group andcontrol group separated by twoshapes and colors. llll ll lll ll l ll ll lll ll l ll ll ll l llll l ll l lll l lllll ll l lll l ll l llll ll ll l lllll ll lll l l ll ll l lll l lll ll ll ll ll lll l lllll l ll ll l ll lllllll lllll l llll lll llll lll l ll ll ll l ll l ll lll l llll lll ll lll l lll l ll lllll ll l lll ll X.1 X . (b) Blue round dots representsubjects with true positive effect(unknown ground truth). llll ll l lll ll ll l ll ll l ll ll ll lll l ll ll l lll ll ll l ll ll l lll ll ll lll lll lll ll ll lllll l lll ll ll ll lll l ll lll l l ll ll l lll l lll ll ll ll ll ll l l llll l ll ll l ll llllll lll lll llll l l ll ll llll lll l ll ll lll l llll l ll lll l llll lll ll lll l lll l ll lllll ll l lll lll X.1 X . (c) Green round dots repre-sent subjects identified by theCrossfit-I . Figure 2.
An illustrative example with 1000 subjects, each has two covariates that are uniform in[0 , identifies most subjects with positive effects, although about half of them did notreceive treatment. Outline.
The rest of the paper is organized as follows. In Section 2, we describe an interactivealgorithm wrapped by a cross-fitting framework, which identifies subjects with positive effects withFDR control. We evaluate our proposed algorithm numerically in Section 3, and provide theoreticalpower analysis in simple settings in Section 4. We point out several extensions of the proposedalgorithm in Section 5 and 6. Section 7 concludes the paper with a discussion on the potential of ourproposed interactive procedures. 5
An interactive algorithm with FDR control
To enable us to effectively infer the treatment effect, we use the following working model : Y Ci = f ( X i ) + U i and Y Ti = ∆( X i ) + f ( X i ) + U i , (9)where U i is zero-mean noise (unexplained variance) that is independent of A i . When working with sucha model, we effectively want to identify subjects with a positive treatment effect ∆( X i ). Importantly,model (9) needs not be correctly specified or accurately reflect reality in order for the algorithms inthis paper to have a valid FDR control (but the more ill-specified or inaccurate the model is, the morepower may be hurt).To identify subjects with positive effects, we first introduce an estimator of the treatment ef-fect ∆( X i ) following the working model (9). Denote the expected outcome given the covariates as m ( X i ) := E ( Y i | X i ), and let (cid:98) m ( X i ) be an arbitrary estimator of m ( X i ) using the outcomes andcovariates { Y i , X i } ni =1 . Define the residual as E i := Y i − (cid:98) m ( X i ), and an estimator of ∆( X i ) is (cid:98) ∆ i := 4( A i − / · E i , (10)which, under randomized experiments, is equivalent to the nonparametric estimator of the conditionaltreatment average effect E ( Y Ti | X i ) − E ( Y Ci | X i ) in several recent papers (Nie and Wager, 2020;Kennedy, 2020), and can be traced back to the semiparametrics literature with Robinson (1988). Acritical property of (cid:98) ∆ i that ensures FDR control is that P ( (cid:98) ∆ i > | { Y j , X j , E j } nj =1 ) ≤ / , (11)under H zero0 i (for any definition in (3), (4), or (5)), because H zero0 i implies A i ⊥⊥ { Y i , X i } and P ( A i − / >
0) = 1 /
2. Recall in Figure 1, the treatment assignments A i are hidden from the analyst in the selectionstep, which is reflected in (11) as A i being omitted from the condition. The above property indicatesthat the estimated effect (cid:98) ∆ i is no more likely to be positive than negative if the selected subject haszero effect, regardless of how the analyst decides which subject to select. Therefore, the sign of (cid:98) ∆ i canbe used to estimate the number of false identifications and achieve FDR control, which we elaboratenext. This section presents the I with valid FDR control. We introduce a modification based on cross-fittingthat improves identification power in the next section.The I proceeds as progressively shrinking a candidate rejection set R t at iteration t ,[ n ] = R ⊇ R ⊇ . . . ⊇ R n = ∅ , where recall [ n ] denotes the set of all subjects. We assume without loss of generality that one subjectis excluded in each step. Denote the subject excluded at iteration t as i ∗ t . The choice of i ∗ t can usethe information available to the analyst before iteration t , formally defined as a filtration (sequence ofnested σ -fields): F t − = σ { Y j , X j } j ∈R t − , { Y j , A j , X j } j / ∈R t − , (cid:88) j ∈R t − { (cid:98) ∆ j > } , (12)where we unmask (reveal) the treatment assignments A j for subjects excluded from R t − , and thesum (cid:80) i ∈R t − { (cid:98) ∆ i > } is mainly used for FDR estimation as we describe later. The above available Note that property (11) uses the fact that outcome estimator (cid:98) m ( X i ) is independent of A i , so it is important thatthe estimation of (cid:98) m does not use the assignments { A i } ni =1 ; however, it should not affect the estimation much because m ( X i ) ≡ E ( Y i | X i ) is not a function of A i . { E j } nj =1 definedabove equation (10).To control FDR, the number of false identifications is estimated by property (11). The idea is topartition the candidate rejection set R t into R + t and R − t by the sign of (cid:98) ∆ i : R − t := { i ∈ R t : (cid:98) ∆ i ≤ } , R + t := { i ∈ R t : (cid:98) ∆ i > } . Notice that our proposed procedure only identifies the subjects whose estimated effect is positive,i.e., those in R + t . Thus, the FDR is E (cid:104) |R + t ∩H | max {|R + t | , } (cid:105) by definition, where recall H is the set of truenulls. Intuitively, the number of false identifications |R + t ∩ H | can be approximately upper boundedby |R − t ∩ H | , since the number of positive signs should be no larger than the number of negativesigns for the falsely identified nulls, according to property (11). Note that the set of true nulls H is unknown, so we use |R − t | to upper bound |R − t ∩ H | , and propose an estimator of FDR for thecandidate rejection set R t : (cid:91) FDR( R t ) = |R − t | + 1max {|R + t | , } . (13)Overall, the I shrinks R t until time τ := inf { t : (cid:91) FDR( R t ) ≤ α } and identifies only the subjects in R + τ , as summarized in Algorithm 1. We state the FDR control of I in Theorem 1 and the proof canbe found in Appendix B.1. Theorem 1.
In a randomized experiment with assumptions (1) and (2) , and for any analyst thatupdates their working model(s) at any iteration t using the information in F t − , the set R + τ rejectedby the I algorithm has FDR controlled at level α , meaning that E (cid:20) |R + τ ∩ H | max {|R + τ | , } (cid:21) ≤ α, for any definition of the null hypothesis (3) , (4) or (5) . For the last definition, FDR control also holdsconditional on the covariates and potential outcomes. Algorithm 1
The I (interactive identification of individual treatment effect) procedure. Initial state:
Explorer (E) knows covariates and outcomes { X i , Y i } ni =1 .Oracle (O) knows the treatment assignments { A i } ni =1 .Target FDR level α is public knowledge. Initial exchange:
Both players initialize R = [ n ] and set t = 1.1. E builds a prediction model (cid:98) m from X i to Y i .2. E informs O about residuals E i ≡ Y i − (cid:98) m ( X i ).3. O estimates the treatment effect as (cid:98) ∆ i ≡ A i − / E i .4. O then divides R t into R − t := { i ∈ R t : (cid:98) ∆ i ≤ } and R + t := { i ∈ R t : (cid:98) ∆ i > } .5. O reveals only |R + t | to E (who infers |R − t | ). Repeated interaction:
6. E checks if (cid:91)
FDR( R t ) ≡ |R − t | +1max {|R + t | , } ≤ α .7. If yes, E sets τ = t , reports R + τ and exits.8. Else, E picks any i ∗ t ∈ R t − using everything E currently knows.(E tries to pick an i ∗ t that E thinks is null, i.e. E hopes that (cid:98) ∆ i ∗ t ≤ A i ∗ t to E, who also infers (cid:98) ∆ i ∗ t and its sign.10. E updates R t +1 = R t \{ i ∗ t } , and also |R + t +1 | and |R − t +1 | .11. Increment t and go back to Step 6. 7onsider a simple case where model (9) is accurate for every subject with a constant treatmenteffect ∆( X i ) = δ >
0. If δ is larger than the maximum noise, we have R +0 = [ n ], and the algorithm canstop at the very first step identifying all subjects. At the other extreme, if the effect δ is too small, thealgorithm may also return an empty set, and this makes sense because while small average treatmenteffects can be learned using a large population, larger treatment effects are needed for individual-level identification. Related work.
Testing procedures that allow human interaction are first proposed by Lei andFithian (2018) and Lei et al. (2020) for the problem of FDR control in multiple testing, followed byseveral works for other error metrics (Duan et al., 2020,a). These papers focus on generic multipletesting problems, which operate on the p -values and ignore the process of generating p -values fromdata. In contrast, Duan et al. (2020b) applies the idea of interactive testing to observed data, towhich our paper relates most. Both works propose tests for treatment effect, and the difference is thatDuan et al. (2020b) test whether any subject has nonzero effect with type-I error control, whereasour proposed algorithm aims at identifying subjects having positive effects with FDR control. Whilethe former may appear in an exploratory analysis to see whether the treatment has any effect on anyperson, the latter is useful to characterize the population where the treatment has an effect.We end the section with a remark. In step 8 of Algorithm 1, we hope to exclude subjects thatare unlikely to have positive effects, based on the revealed data in F t − . In other words, we shouldguess the sign of treatment effect (cid:98) ∆ i , which depends on both the revealed data { Y i , X i } and thehidden assignment A i . However, notice that at the first iteration, we may learn/guess the oppositesigns for all the subjects; when all assignments { A i } ni =1 are hidden at t = 1, the likelihood of { A i } ni =1 being the true values (leading to all correct signs for (cid:98) ∆ i ) is the same as the likelihood of all oppositevalues (leading to all opposite signs for (cid:98) ∆ i ), no matter what working model we use. Consequently,the subjects with large positive effects could be guessed as having large negative effects, causing themto be excluded from the rejection set. To improve power, we propose to wrap around the I by across-fitting framework as described in the next section. I Cross-fitting refers to the idea of splitting the samples into two halves. We perform the I on eachhalf separately, so that for each half, the complete data (including the assignments) of the other halfis revealed to the analyst to help infer the sign of treatment effect, addressing the issue of learning theopposite signs and improving the identification power.Specifically, split the subjects randomly into two sets of equal size, denoted as I and II where I ∪ II = [ n ]. The I (Algorithm 1) is implemented on each set separately: at the start of I on set I ,the analyst has access to the complete data for all subjects in set II , and tries to identify subjects withpositive effects in set I with FDR control at level α/
2; similar is the I on set II . Mathematically, letthe candidate rejection set of implementing the I on set I be R t ( I ), where the initial set is R ( I ) = I .The available information at iteration t is defined as: F t − ( I ) = σ { Y i , X i } i ∈R t − ( I ) , { Y j , A j , X j } j / ∈R t − ( I ) , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ i > } , (14)which includes the complete data { Y j , A j , X j } for j ∈ II at any iteration t ≥ . Similarly, wedefine R t ( II ) and F t − ( II ) for the I implemented on set II . The final rejection set is the union ofrejections in I and II (see Algorithm 2). We call this algorithm the Crossfit-I .As long as the I on two sets do not exchange information, Algorithm 2 has a valid FDR control(see the proof in Appendix B.2). Theorem 2.
Under assumption (1) and (2) of randomized experiments, R + τ rejected by the Crossfit- I has FDR controlled at level α for any of the null hypotheses (3) , (4) or (5) . For the last case, FDRcontrol also holds conditional on the covariates and potential outcomes. For notational clarity, we use i to denote candidate subjects i ∈ R t ( I ), and j for non-candidate subjects j / ∈ R t ( I ),while k is used to index all subjects k ∈ [ n ]. lgorithm 2 The Crossfit-I . Input:
Covariates, outcomes, treatment assignments { Y i , A i , X i } ni =1 , target level α ; Procedure:
1. Randomly split the sample into two subsets of equal size, denoted as I and II ;2. Implement Algorithm 1 at level α/
2, where E initially knows { Y k , X k } nk =1 ∪ { A j } j ∈II and sets R ( I ) = I , getting a rejection set R + τ ( I ) ⊆ I ;3. Implement Algorithm 1 at level α/
2, where E initially knows { Y k , X k } nk =1 ∪ { A j } j ∈I and sets R ( II ) = II , getting a rejection set R + τ ( II ) ⊆ II ;4. Combine two rejection sets as the final rejection set, R + τ = R + τ ( I ) ∪ R + τ ( II ).In addition to addressing the issue of learning the opposite (cid:98) ∆ i in the original I , another benefitof using the crossing-fitting framework is that with the complete data revealed for at least half of thesample, the analyst does not have to deal with the problem of inferring missing data (the assignment A i ), which probably needs some parametric probabilistic modeling and the EM algorithm. Instead,because the assignments are revealed for subjects not in the candidate rejection set (at least half of thesample), their signs of (cid:98) ∆ j can be correctly calculated and used as “training data”. The analyst canthen employ a black-box prediction model, such as a random forest, to predict the signs of (cid:98) ∆ i for thesubjects whose assignments are masked (hidden). As an example, we propose an automated strategyas follows to select a subject at step 8 in Algorithm 1. Algorithm 3
An automated heuristic to select i ∗ t in the Crossfit-I . Input:
Current rejection set R t − ( I ), and available information for selection F t − ( I ); Procedure:
1. Train a random forest classifier where the label is sign( (cid:98) ∆ j ) and the predictors are Y j , X j and theresiduals E j , using non-candidate subjects j / ∈ R t − ( I );2. Estimate the probability of (cid:98) ∆ i being positive as (cid:98) p ( i, t ) for subjects i ∈ R t − ( I );3. Select i ∗ t = argmin { (cid:98) p ( i, t ) : i ∈ R t − ( I ) } .We remark that in practice, the analyst can interactively change the prediction model, such asexploring parametric models to see which fits the data better. In principle, the analyst can performany exploratory analysis on data in F t − ( I ) to decide a heuristic or score for selecting subject i ∗ t ;and the FDR control is valid as long as she does not use the assignments A i for candidate subjects i ∈ R t − ( I ). For computation efficiency, we usually update the prediction models (or their parameters)once every 100 iterations (say).To summarize, the Crossfit-I described in Algorithm 2 involves two rounds of the I (Algorithm 1),where step 8 of selecting a subject is allowed to involve human interaction; alternatively, step 8 can bean automated heuristic as presented in Algorithm 3. Recall the illustrative example in Figure 2, wherewe implement the Crossfit-I with the above automated strategy to select subjects. Each subject,recorded with two covariates in [0 , . correctly identifiesmost of them (see Figure 2c). Next, we demonstrate through repeated numerical experiments andtheoretical analysis that the Crossfit-I has reasonably high power. To assess our proposed procedure, we first describe a baseline method, which calculates a p -value foreach subject under the assumption of linear models, and applies the classical BH method (Benjaminiand Hochberg, 1995). We call this method the linear-BH procedure.9 .1 A baseline: the BH procedure under linear assumptions For the treated group and control group, we first separately learn a linear model to predict Y i using X i , denoted as (cid:98) l T and (cid:98) l C . By imputing the unobserved potential outcomes, we get estimators of thepotential outcomes (cid:101) Y Ti = Y i { A i = 1 } + (cid:98) l T ( X i ) { A i = 0 } and (cid:101) Y Ci = (cid:98) l C ( X i ) { A i = 1 } + Y i { A i =0 } , and the treatment effect for subject i can be estimated as (cid:98) ∆ BH i := (cid:101) Y Ti − (cid:101) Y Ci . If the potentialoutcomes are linear functions of covariates with standard Gaussian noises (which we refer to as thelinear assumption), the estimated treatment effect asymptotically follows a Gaussian distribution. Foreach subject i ∈ [ n ], we calculate a p -value for the zero-effect null (3) as P i = 1 − Φ (cid:18) (cid:98) ∆ BH i (cid:46)(cid:113)(cid:100) Var( (cid:98) ∆ BH i ) (cid:19) , (15)where the estimated variance is (cid:100) Var( (cid:98) ∆ BH i ) = (cid:100) Var( (cid:101) Y Ti )+ (cid:100) Var( (cid:101) Y Ci ), and Φ denotes the CDF of a standardGaussian. To identify subjects having positive effects with FDR control, we apply the BH procedureto the above p -values. Notice that the error control would not hold when the linear assumption isviolated (see Appendix B.4 for the formal FDR control guarantee). We run a simulation with 500 subjects ( n = 500). Each subject is recorded with two binary attributes(eg. female/male and senior/junior) and one continue attribute (eg. body weight), denoted as a vec-tor X i = ( X i (1) , X i (2) , X i (3)) ∈ { , } × R . Among n subjects, the binary attributes are marginallybalanced, and the subpopulation with X i (1) = 1 and X i (2) = 1 is of size 30. The continuous attributeis independent of the binary ones and follows the distribution of a standard Gaussian.The outcomes are simulated as a function of the covariates X i and the assignment A i followingthe generating model (9). Recall that we previously used model (9) as a working model, which isnot required to be correctly specified. Here, we generate data from such a model in simulation for aclear evaluation of the considered methods. We specify the noise U i as a standard Gaussian, and theexpected control outcome as f ( X i ) = 5( X i (1) + X i (2) + X i (3)) , and the treatment effect as∆( X i ) = S ∆ · [5 X i (3) { X i (3) > } − X i (1) / , (16)where S ∆ > . Moreexperiments can be found in Appendix D.1. l l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect po w e r _po s l Crossfit−I Linear−BH
Figure 3.
FDR (left) and power (right) of the Crossfit-I compared with the linear-BH procedure,with the treatment effect specified as model (16) and the scale S ∆ varying in { , , , , , } . The FDRcontrol level is 0 .
2, marked by a horizontal line in error control plots. For all plots in this paper, theFDR and power are averaged over 500 repetitions. The linear-BH procedure does not have valid FDRcontrol because the treatment effect is nonlinear, whereas the Crossfit-I controls FDR and can achievehigh power. R code to fully reproduce all plots in the paper are available at https://github.com/duanby/I-cube . controls FDR at the target level as expected. At the same time, the Crossfit-I appears to havecomparable power as the linear-BH procedure to correctly identify subjects with true positive effects. In addition to the numerical experiments, we provide a theoretical power analysis in some simple casesto understand the advantages and limitations of our proposed Crossfit-I .First, consider the case without covariates. Our analysis is inspired by the work of Arias-Castroand Chen (2017); Rabinovich et al. (2020), who study the power of methods with FDR control under asparse Gaussian sequence model. Let there be n hypotheses, each associated with a test statistic V i for i = 1 , . . . , n . They consider a class of methods called threshold procedures such that the final rejectionset R is in the form R = { i : V i ≥ τ ( V , . . . , V n ) } , for some threshold τ ( V , . . . , V n ); they discuss twotypes of thresholds; see Appendix C for details of their results. An example of the threshold procedureis the BH procedure. Our proposed I can also be simplified to a threshold procedure when usingan automated selection strategy at step 8 of Algorithm 1: at each iteration, we exclude the subjectwith the smallest absolute value of the estimated treatment effect | (cid:98) ∆ i | (note that this strategy satisfiesour requirement of not using assignments since | (cid:98) ∆ i | = | A i − / Y i − (cid:98) m ( X i )) | = 2 | Y i − (cid:98) m ( X i ) | ).The resulting (simplified and automated) I is a threshold procedure where V i = (cid:98) ∆ i . Note that ouroriginal interactive procedure is highly flexible, making the power analysis less obvious, so we discussthe power of Crossfit-I with the above simplified selection strategy.To contextualize our power analysis, we paraphrase one of the results in Arias-Castro and Chen(2017); Rabinovich et al. (2020). Assume the test statistics V i ∼ N ( µ i ,
1) are independent, with µ i = 0under the null and µ i = µ > n and the sparsity ofthe non-nulls is parameterized by β ∈ (0 ,
1) such that n /n = n − β . Let the signal µ increase with n as µ = √ r log n , where the signal strength is encoded by r ∈ (0 , r and sparsity β , which are also critical parameters to characterize the power in ourcontext as we state later. These authors effectively prove that for any fixed FDR level α ∈ (0 , ,no threshold procedure can have nontrivial power if r < β , but there exist threshold procedures withasymptotic power one if r > β .Our analysis differs from theirs in the non-null distribution of the test statistics. Given n subjects,suppose the potential outcomes for subject i are distributed as: Y Ci ∼ N (0 ,
1) and Y Ti ∼ N ( µ i , µ i = 0 if subject i is null, or µ i = µ > i is non-null. Thus, theobserved outcome of a null is N (0 , mixture of N ( µ,
1) and N (0 , Theorem 3.
Given a fixed FDR level α ∈ (0 , and let the number of subjects n go to infinity. Whenthere is no covariate, the automated Crossfit- I and the linear-BH procedure have the same powerasymptotically: if r < β , their power goes to zero; if r > β , their power goes to / . Further, amongthe treated subjects, their power goes to one. Remark 1.
Power of both methods cannot converge to a value larger than / because without co-variates, we cannot differentiate between the subjects with zero effect (whose outcome follows standardGaussian regardless of treated or not) and the subjects with positive effects that are not treated (whichalso follows standard Gaussian). And the proportion of untreated subjects among those with positiveeffects is / because of the assumed randomization. The above theorem discusses the case where there are no covariates to help guess which untreatedsubjects have positive effects. Next, we consider the case with an “ideal” covariate X i : its valuecorresponds to whether a subject is a non-null (having positive effect) or not, X i = { µ i > } . Here,we design the selection strategy (for step 8 of Algorithm 1) as a function of the covariates, becausewe hope that subjects with the similar covariates have similar treatment effects. Specifically, for11he I implemented on I , we learn a prediction of (cid:98) ∆ j by X j using non-candidate subjects j ∈ II :Pred( x ) = |II| (cid:80) i ∈II (cid:98) ∆ j { X j = x } , where x = { , } . Then for candidate subjects i ∈ I , weexclude the ones whose Pred( X i ) are lower. As we integrate information among subjects with thesame covariate value, all non-null subjects (i.e., those with X i = 1) would excluded after the nulls(with probability tending to one), regardless of whether they are treated or not; hence we achievepower one. Theorem 4.
Given a fixed FDR level α ∈ (0 , and let the number of subjects n go to infinity. Witha covariate X i = { µ i > } , the power of the automated Crossfit- I converges to one for any fixed r ∈ (0 , and β ∈ (0 , . In contrast, the power of the linear-BH procedure goes to zero if r < β .(When r > β , power of both methods converges to one.) Here is a short informal argument for why our power goes to one. Since the nulls can be excludedbefore the non-nulls, we focus on the test statistics of the non-nulls. Let d → denote convergence indistribution. The estimated effect (cid:98) ∆ i d → N ( µ,
1) for each non-null (since in the notation of Algorithm 1, (cid:98) m ( X i = 1) converges to µ/ E i d → N ( µ/ ,
1) for those with A i = 1, and E i d → N ( − µ/ ,
1) for those with A i = 0.) Hence, at the time t right after all the nulls are excluded(and all the non-nulls are in R t ), the proportion of positive estimated effects |R + t | / |R t | converges toΦ( µ ), where Φ denotes the CDF of a standard Gaussian. We can stop before t and identify subjectsin R + t if (cid:91) FDR( R t ), as a function of |R + t | / |R t | , is less than α , which holds when Φ( µ ) > α . Thepower goes to one because µ grows to infinity for any fixed r ∈ (0 , n , we stopbefore t and the proportion of rejected non-nulls |R + t | / |R t | (which converges to Φ( µ ) as arguedabove) also goes to one. In short, the power guarantee does not depend on the sparsity β because ofthe designed selection strategy that incorporates covariates.We note that our theoretical power analysis discusses two extreme cases, one with no covariate toassist the testing procedure (Theorem 3), and the other with a single “ideal” covariate that equals theindicator of non-nulls (Theorem 4). The numerical experiments in Section 3 consider more practicalsettings, where the analyst is provided with a mixture of covariates informative about the heterogeneouseffect ( X i (1) and X i (3) in our example) and some uninformative ones; still, the Crossfit-I tends tohave reasonably high power. In the following sections, we turn to present extensions of the Crossfit-I in various directions. The Crossfit-I controls the false identifications of subjects with zero treatment effect, as defined inthe null hypothesis (3), (4) or (5). In this section, we develop a modification to additionally controlthe error of falsely identifying subjects with nonpositive treatment effects, by defining a different nullhypothesis. Problem setup.
We define the null hypothesis for subject i as nonpositive effect: H nonpositive0 i : ( Y Ti | X i ) (cid:22) ( Y Ci | X i ) , (17)or equivalently, H nonpositive0 i : ( Y i | A i = 1 , X i ) (cid:22) ( Y i | A i = 0 , X i ) . As before, our algorithm appliesto two alternative definitions of the null hypothesis. In the context of treating the potential outcomesand covariates as fixed, the null hypothesis is H nonpositive0 i : Y Ti ≤ Y Ci , (18)and in the hybrid version where the potential outcomes are random with joint distribution ( Y Ti , Y Ci ) | X i ∼ P i , the null posits H nonpositive0 i : Y Ti ≤ Y Ci almost surely- P i . (19)12ote that the nonpositive-effect null is less strict than the zero-effect null. Thus, an algorithm withFDR control for H nonpositive0 i must have valid FDR control for H zero0 i , but the reverse needs not betrue. Indeed, we observe in numerical experiments (Figure 4b) that the Crossfit-I does not controlFDR for the nonpositive-effect null. This section presents a variant of Crossfit-I that controls falseidentifications of nonpositive effects, possibly more practical when interpreting the identified subjects.For example, when controlling FDR for the nonpositive-effect null at level α = 0 .
2, we are able toclaim that the expected proportion of identified subjects with positive effects is no less than 80%.
An interactive procedure with FDR control of nonpositive effects.
Recall that the FDRcontrol of the Crossfit-I is based on property (11) that when the null hypothesis is true for subject i ,we have P (cid:16) (cid:98) ∆ i | { Y j , X j , E j } nj =1 (cid:17) ≤ /
2, but this statement no longer holds when the null hypothesisis defined as H nonpositive0 i in (17). Fortunately, this issue can be fixed by making the condition in (11)coarser and removing the outcomes: P (cid:16) (cid:98) ∆ i | { X j } nj =1 (cid:17) ≤ / , which is reflected in the interactive procedure as reducing the available information for selecting sub-ject i ∗ t (at step 8 of Algorithm 1) — we additionally mask (hide) the outcome Y i of the candidatesubjects i ∈ R t − ( I ) when implementing the I on set I . We call the resulting interactive algorithmMaY-I , as it masks the outcomes.Specifically, the MaY-I modifies Crossfit-I where we define the available information to selectsubjects when implementing Algorithm 1 on set I as F − Yt − ( I ) = σ { X i } i ∈R t − ( I ) , { Y j , A j , X j } j / ∈R t − ( I ) , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ i > } . (20)To calculate (cid:98) ∆ i at t = 0 when Y i for all i ∈ I are masked, let (cid:98) m −I ( X i ) be an estimator of E ( Y i | X i )that is learned using data from non-candidate subjects { Y j , X j } j / ∈I , and let the residuals be E −I i := Y i − (cid:98) m −I ( X i ). Define ∆ −I i := 4( A i − / · E −I i , and similar to property (11) for the zero-effect null,we have P (cid:16) (cid:98) ∆ −I i > | { X j } j ∈I ∪ { Y j , X j , E −I j } j / ∈I (cid:17) ≤ / , (21)under H nonpositive0 i , leading to valid FDR control for nonpositive effects. Overall, the MaY-I followsAlgorithm 2, except the estimated treatment effect (cid:98) ∆ i replaced by (cid:98) ∆ −I i , and the available informationfor selection F t − ( I ) replaced by F − Yt − ( I ). See Appendix B.3 for the proof of FDR control. Theorem 5.
Under assumption (1) and (2) of randomized experiments, the MaY- I has a valid FDRcontrol at level α for the nonpositive-effect null hypothesis under any of definitions (17) , (18) or (19) .For the last definition, FDR control also holds conditional on the covariates and potential outcomes. Similar to Algorithm 3 for the Crossfit-I , we can design an automated algorithm for the MaY-I to select a subject in step 8 of Algorithm 1, but the available information F − Yt − ( I ) no longer includesthe outcomes of candidate subjects. We defer the details of the automated selection strategy inAppendix A.2. Numerical experiments.
We compare the Crossfit-I and MaY-I using the same experiment asSection 3.2. In terms of the error control, both the Crossfit-I and MaY-I control FDR for the zero-effect null at the target level (Figure 4a). When the null is defined as having a nonpositive effect, theCrossfit-I can violate the error control (Figure 4b), whereas the MaY-I preserves valid FDR control.In terms of the power, the Crossfit-I has slightly higher power since the analyst can select subjectsusing information defined by F t − ( I ) in (12), which is richer compared to F − Yt − ( I ) in (20) for theMaY-I . 13 l l l l l scale of treatment effect F DR (a) FDR for the zero-effect null (3). l l l l l l scale of treatment effect F DR _neg (b) FDR for the nonpositive-effectnull (17). l l l l l l scale of treatment effect po w e r _po s (c) Power of identifying subjectswith positive effects. l Crossfit−I MaY−I Figure 4.
Performance of two interactive methods, Crossfit-I and MaY-I , with the treatment effectspecified as model (16) and the scale S ∆ varying in { , , , , , } . The MaY-I controls FDR for amore relaxed null (nonpositive effects) than the Crossfit-I , while the Crossfit-I has slightly higherpower than the MaY-I . To summarize, the error control of the MaY-I is more strict than the Crossfit-I , controllingfalse identifications of both zero effects and negative effects, while its power is slightly lower. Werecommend the Crossfit-I if one only concerns the error of falsely identifying subjects with zero effect.Alternatively, we recommend the MaY-I when it is desired to control the error of falsely identifyingsubjects with nonpositive effects. Problem setup.
Our discussion has focused on the case where samples are not paired, and theproposed algorithms can be extended to the paired-sample setting. Suppose there are n pairs ofsubjects. Let outcomes of subjects in the i -th pair be Y ij , treatment assignments be indicators A ij ,covariates be X ij for j = 1 , i ∈ [ n ]. We deal with randomized experiments without interference,and assume that(i) conditional on covariates, the treatment assignments are independent coin flips: P [( A , . . . , A n ) = ( a , . . . , a n ) | X , . . . , X n ] = n (cid:89) i =1 P ( A i = a i ) = (1 / n , and A i + A i = 1 for all i ∈ [ n ] . (ii) conditional on covariates, the outcome of one subject Y i ,j is independent of the treatmentassignment of another subject A i ,j for any ( i , j ) (cid:54) = ( i , j ).As before, we can develop interactive algorithms for two types of error control (only the definitionswhen treating the potential outcomes as random variables are presented, but the FDR control stillapplies to all versions of the null): H (zero, paired)0 i : ( Y Tij | X ij ) d = ( Y Cij | X ij ) for both j = 1 ,
2; (22) H (nonpositive, paired)0 i : ( Y Tij | X ij ) (cid:22) ( Y Cij | X ij ) for both j = 1 , . (23)Here, we present the extension of Crossfit-I for FDR control of zero effect, and defer the extension ofMaY-I for FDR control of nonpositive effect to Appendix A.3.14 nteractive algorithms for paired samples. With the pairing information, the treatment effectcan be estimated without involving (cid:98) m as in (10): (cid:98) ∆ paired i := ( A i − A i )( Y i − Y i ) , (24)as used by Rosenbaum (2002) and Howard and Pimentel (2020), among others. The above estimationsatisfies the critical property to guarantee FDR control: for a null pair i of two subjects with zeroeffects in (22), we have P ( (cid:98) ∆ paired i > | { Y j , Y j , X j , X j } nj =1 ) ≤ / . (25)Thus, the Crossfit-I (Algorithm 2) with (cid:98) ∆ i replaced by (cid:98) ∆ paired i has valid FDR control for the zero-effectnull (22), where the analyst excludes pairs using the available information, including { Y i , Y i , X i , X i } for candidate subjects i ∈ R t − ( I ), and { Y j , Y j , A j , A j , X j , X j } for non-candidate subjects j / ∈R t − ( I ), and the sum (cid:80) i ∈R > t − ( I ) { (cid:98) ∆ paired i > } for FDR estimation. An automated strategyexclude pair i ∗ t (at step 8 of Algorithm 1) under paired samples is the same as Algorithm 3, except (cid:98) ∆ i being replaced by (cid:98) ∆ paired i . Numerical experiments.
We compare the power of the interactive procedures with and withoutthe pairing information, using the same experiments as previous. When the subjects within eachpair have the same covariate values, the power under paired samples is higher than treating them asunpaired (see Figure 5a), because the noisy variation in the observed outcomes that results from thepotential control outcomes can be removed by taking the difference in outcomes within each pair.The advantage of procedures under paired samples becomes less evident when the subjects withina pair do not match exactly. We simulate unmatched pairs by introducing a parameter (cid:15) such thatfor each pair i , the covariates of the two subjects within satisfy: P ( X i (1) (cid:54) = X i (1)) = (cid:15), P ( X i (2) (cid:54) = X i (2)) = (cid:15), X i (3) = X i (3) + U (0 , (cid:15) ) , where U (0 , (cid:15) ) is uniformly distributed between 0 and 2 (cid:15) ,and a larger (cid:15) leads to a larger degree of mismatch. As (cid:15) increases, the power of procedures using thepairing information decreases (see Figure 5b), because the estimated treatment effect (cid:98) ∆ paired i becomesless accurate for the mismatching setting. We further investigate the power decrease in Appendix D.2. ll ll ll ll ll ll scale of treatment effect po w e r _po s (a) Exact pairs. ll ll ll ll ll ll digree of mismatch e po w e r _po s (b) Subjects within the pair donot match exactly. l l pair−Crossfit−I pair−MaY−I unpair−Crossfit−I unpair−MaY−I Figure 5.
Power under paired samples with treatment effects specified by model (16) when ourproposed algorithms (Crossfit-I and MaY-I ) utilize the pairing information, which is higher thantreating all subjects as unpaired. The advantage is less evident when the subjects within each pair arenot exactly matched to have the same covariate values. Summary
We discuss the problem of identifying subjects with positive effects. Most existing methods identify subgroups with positive treatment effects, and they cannot upper bound the proportion of falselyidentified subjects within an identified subgroup. In contrast, we propose Crossfit-I with finite-sampleFDR control (i.e., the expected proportion of subjects with zero effect is no larger than α among theidentified subjects). One advantage of the Crossfit-I is allowing human interaction — an analyst (or analgorithm) can incorporate various types of prior knowledge and covariates using any working model;she can also adjust the model at any step, potentially improving the identification power. Despitethis flexibility, the Crossfit-I achieves valid FDR control. Notably, because Crossfit-I incorporatescovariates, it can identify subjects with positive effects, including those not treated.Our proposed interactive procedure was extended to various settings: from FDR control of zeroeffects to FDR control of nonpositive effects, from unpaired samples to paired samples, and from FDRcontrol at an individual level to FDR control at a subgroup level (details deferred to Appendix). Wehave recently begun to extend the proposed methods from the setting of randomized experimentsto observational data, where the probability of receiving treatment effect can be heterogeneous andunknown — it seems like several compromises must be made and finite sample FDR control may nolonger be feasible.The error control for our interactive procedures is based on the independence properties betweenthe data used for FDR control and the revealed data for interaction, such as property (11) for thezero-effect null and (21) for the nonpositive-effect null. The idea of interactive testing can be gener-alized to many other problems as long as we can construct two parts of data that are (conditionally)independent. Importantly, our interactive procedures using the idea of “masking and unmasking”should be contrasted with data-splitting approaches. We remark that no test, interactive or otherwise,can be run twice from scratch (with a tweak made the second time to boost power) after the entiredata has been examined; this amounts to p -hacking. We view our interactive tests as one step towardsenabling experts (scientists and statisticians) to work together with statistical models and machinelearning algorithms in order to discover scientific insights with rigorous guarantees. A More extensions of the I A.1 FDR control at a subgroup level
Our proposed interactive methods control FDR on individual level, which means upper bounding theproportion of falsely identified subjects. In this section, we show that the idea of interactive testingcan be extended to control FDR on subgroup level, where we aim at identifying multiple subgroupswith positive effects and upper bounding the proportion of falsely identified subgroups. Recall thatFDR control at a subgroup level is studied by Karmakar et al. (2018) as we review in Section 1.2 ofthe main paper.
Problem setup.
Let there be G non-overlapping subgroups G g for g ∈ [ G ] ≡ { , . . . , G } . The nullhypothesis for each subgroup is defined as zero effect for all subjects within: H g : H zero0 i is true for all i ∈ G g , (26)or equivalently, H g : G g ⊆ H (recall that H is the set of true null subjects). Let D g be the decisionfunction receiving the values 1 or 0 for whether H g is rejected or not rejected, respectively, and theFDR at a subgroup level is defined as:FDR subgroup := E (cid:20) |{ g ∈ [ G ] : G g ⊆ H , D g = 1 | max {|{ g ∈ [ G ] : D g = 1 | , } (cid:21) . Same as the algorithms at an individual level, the algorithms we propose at a subgroup level can beapplied to samples that are paired or unpaired. For simple notation, we use { Y i , A i , X i } to denote theobserved data for subject i when the samples are unpaired, and for pair i when the samples are paired(where Y i = { Y i , Y i } and similarly for A i and X i ).16 n interactive algorithm to identify subgroups. We first follow the same steps of Karmakaret al. (2018) to define subgroups and generate the p -value for each subgroup. Specifically, the subgroups G g for g ∈ [ G ] is defined using the outcomes and covariates { Y j , X j } nj =1 (by an arbitrary algorithmor strategy, such as grouping subjects with the same covariates). For each subgroup G g , we cancompute a p -value P g by the classical Wilcoxon test (or using a permutation test, which obtains thenull distribution by permuting the treatment assignment { A i } ni =1 ).The interactive procedure we propose differs from Karmakar et al. (2018) by how we processthe p -values of the subgroups. We adopt the work of Lei et al. (2020) that proposes an interactiveprocedure with FDR control for generic multiple testing problems. The key property that allowshuman interaction while guaranteeing valid FDR control is similar to that in the I : the independencebetween the information used for selection and that used for FDR control. Here with the p -values ofsubgroups, the two independent parts are P g := min { P g , − P g } , which is revealed to the analyst for selection and P g := 2 · { P g < } − , which is masked (hidden) for FDR control. Notice that for a null subgroup with a uniform p -value,( P g , P g ) are independent, and we have that P ( P g = 1 | P g , [ Y i , X i ] i ∈G g ) ≤ / , (27)because the p -values obtained by permutating assignments is uniform when conditional on the outcomesand covariates. We remark that the above property is similar to property (11) and (21) in main paperthat lead to valid FDR control at an individual level. Algorithm 4
An interactive procedure for subgroup identification.
Initial state:
Explorer (E) knows the covariates, outcomes { Y i , X i } ni =1 .Oracle (O) knows the treatment assignments { A i } ni =1 .Target FDR level α is public knowledge. Initial exchange:
Set t = 1.1. E defines subgroups {G g } Gg =1 using { Y i , X i } ni =1 .2. Both players initialize R = [ G ], and E informs O about the subgroup division.3. O compute the p -value for each subgroup { P g } Gg =1 , and decompose each p -value as P g :=min { P g , − P g } and P g := 2 · { P g < } − R t into R − t := { g ∈ R t : P g ≤ } and R + t := { g ∈ R t : P g > } .5. O reveals { P g } Gg =1 , |R − t | and |R + t | to E. Repeated interaction:
6. E checks if (cid:91)
FDR subgroup ( R t ) ≡ |R − t | +1max {|R + t | , } ≤ α .7. If yes, E sets τ = t , reports R + τ and exits;8. Else, E picks any g ∗ t ∈ R t − using everything E currently knows.(E tries to pick an g ∗ t that they think is null; E hopes that P g ≤ { A i } i ∈G g to E, who also infers P g .10. E updates R t +1 = R t \{ g ∗ t } , and also |R + t +1 | and |R − t +1 | ;11. Increment t and go back to Step 6.Similar to the proposed methods at an individual level, the interactive procedure for subgroupsprogressively excludes subgroups and recursively estimates the FDR. Let the candidate rejection set R t be a set of selected subgroups, starting from all subgroups included R = [ G ]. We interactivelyshrink R t using the available information: F subgroup t − = σ { P g , [ Y i , X i ] i ∈G g } g ∈R t − , { P g , [ Y j , A j , X j ] j ∈G g } g / ∈R t − , (cid:88) g ∈R t − P g , p -value P g and the treatment assignment A i for candidate subgroupsin R t − ; and the sum (cid:80) g ∈R t − P g is mainly provided for FDR estimation. Similar to our previouslyproposed interactive procedures, the FDR estimator is defined as: (cid:91) FDR subgroup ( R t ) = |R − t | + 1max {|R + t | , } , (28)with R + t = { g ∈ R t : P g = 1 } and R − t = { g ∈ R t : P g = − } . The algorithm shrinks R t untiltime τ := inf { t : (cid:91) FDR subgroup ( R t ) ≤ α } , and identifies only the subgroups in R + τ , as summarized inAlgorithm 4. Details of strategies to select subgroup based on the revealed p -value and covariates canbe found in Lei et al. (2020). As a comparison, Karmakar et al. (2018) use the same set of p -values { P g } g ∈ [ G ] , and control FDR by the classical BH procedure. Numerical experiments.
We compare the performance of our proposed interactive procedure forsubgroup identification with the method proposed by Karmakar et al. (2018), following an experimentin their paper. Suppose each subject is recorded with two discrete covariates X i = { X i (1) , X i (2) } where X i (1) ∈ { , . . . , } takes 40 levels with equal probability, and X i (2) is binary with equalprobability (for example, X i (1) could encode the city subject i lives in, and X i (2) the gender). Thetreatment effect ∆( X i ) is a constant δ if X i (1) is even, and we vary δ in six levels. We conduct theabove experiment in two cases: unpaired samples ( n = 2000) with independent covariates and pairedsamples ( n = 1000) whose covariate values are the same for subjects within each pair. l l l l l l scale of treatment effect po w e r (a) Unpaired samples. l l l l l l scale of treatment effect po w e r (b) Paired samples. l l l l l l scale of treatment effect po w e r (c) Only a few subgroups are non-nulls. l BH interactive
Figure 6.
Performance of methods to identify subgroups with positive effects: the BH procedure andthe interactive procedure (for 80 subgroups defined by the distinct values of covariates). We vary thescale of treatment effect under unpaired or paired samples. In both cases, the interactive procedurecan have higher power than the BH procedure. When the number of non-null subgroups is too small(less than 20), the BH procedure can have higher power. The error bar marks two standard deviationsfrom the center.
Recall that the subgroups can be defined by covariates and outcomes. Here, since the covariatesare discrete, we define subgroups by different values of ( X i (1) , X i (2)), resulting in 80 subgroups. Theinteractive procedure tends to have higher power than the BH procedure (see Figure 6a and Figure 6b)because it focuses on the subgroups that are more likely to be the non-nulls using the excludingprocess, and utilizes the covariates together with the p -values to guide the algorithm. Meanwhile,the BH procedure does not account for covariates once the p -values are calculated. Nonetheless, theinteractive procedure can have lower power when the total number of subgroups that are truly non-nullis small. We simulate the case where a subject has a positive effect δ if X i (1) a multiplier of 4 (i.e., X i (1) / |R + | is small due to a small number of true non-nulls (for example,with FDR control at α = 0 .
2, we need to shrink R t until |R − | < |R + | is around 20).18 side note is that we define the subgroups by distinct values of the covariates, whereas Karmakaret al. (2018) suggest forming subgroups by regressing the outcomes on covariates using a tree algorithm.In their experiments and several numerical experiments we tried, we find that the number of subgroupsdefined by the tree algorithm is usually less than ten. However, we think the FDR control is lessmeaningful when the total number of subgroups is small. To justify our comment, note that analgorithm with valid FDR control at level α can make zero rejection with probability 1 − α and rejectall subgroups with probability α , which can happen when the total number of subgroups is small. Incontrast, with a large number of subgroups, a reasonable algorithm is unlikely to jump between theextremes of making zero rejection and rejecting all n subgroups; and thus, controlling FDR indeedinforms that the proportion of false identifications is low for the evaluated algorithm. Explanation of the higher power achieved by the interactive procedure.
Although theinteractive procedure and the BH procedure define the same set of subgroups and corresponding p -values, the interactive procedure has two properties that potentially improve the power from the BHprocedure: (a) it excludes possible null subgroups so that it can be less sensitive to a large number ofnulls, whereas the BH procedure considers all the subgroups at once; (b) the interactive procedureadditionally uses the covariates. We can separately evaluate the effect of the above two properties byimplementing two versions of Algorithm 4, which differ in the strategy to select subgroups in step a .Specifically, the adaptive procedure selects the subgroup whose revealed (partial) p -value P g is thesmallest (not using the covariates); and the interactive procedure selects the subgroup by an estimatedprobability of the P g to be positive (using the revealed P g , the covariates, and the outcomes). l l l l l l scale of treatment effect po w e r l BHadaptiveinteractive (a) Effect as discrete function of co-variates. l l l l l l scale of treatment effect po w e r l BHadaptiveinteractive (b) Effect as a simpler function ofcovariates.
Figure 7.
Power of two methods for subgroup identification: the BH procedure proposed by Karmakaret al. (2018), the adaptive procedure, and the interactive procedure under different types of treatmenteffect (we define 80 subgroups by discrete values of the covariates). Our proposed interactive proceduretends to have higher power than the BH procedure because (1) it excludes possible nulls (shown byhigher power of the adaptive procedure than the BH procedure in both plots); and (2) it additionallyuses the covariates (shown when the treatment effect can be well learned as a function of covariates inthe right plot).
To see if both properties of Algorithm 4 contribute to the improvement of power from the BHprocedure, we tested the methods under two simulation settings. Recall that the previous experimentdefines a positive treatment effecte when the discrete covariate X i (1) ∈ { , . . . , } is even. Here,we add another case where the treatment effect is positive when X i (1) ≤
20, so that the density ofsubgroups with positive effects is the same as previous, but the treatment effect is a simpler function ofthe covariates. Hence in the latter case, we would expect the interactive procedure learn this functionof covariates rather accurately, and have higher power than the adaptive procedure which does notuse the covairates; as confirmed in Figure 7b. In the former simulation setting where the treatmenteffect is not a smooth function of the covariates and hard to be learned, the adaptive procedure andinteractive procedure have similar power (Figure 7a). Still, they have higher power than the BHprocedure because they exclude possible null subgroups.19 .2 An automated algorithm with FDR control on nonpositive effects
We have proposed the MaY-I to guarantee a valid FDR control for the nonpositive-effect null in (17),by reducing the available information for selecting subjects from F t − ( I ) for the Crossfit-I to F − Yt − ( I )(recall it no longer includes the outcomes of candidate subjects). Here, we present an automatedalgorithm to select a subject for the MaY-I .One naive strategy is to follow Algorithm 3 in the main paper, which is designed for the Crossfit-I ,with the outcomes removed from the predictors; however, it appears to result in less accurate predictionof the effect signs, and in turn rather low power (numerial results are in the next paragraph). Here, wetake a different approach by predicting the treatment effect instead of their signs, because the treatmenteffect might be better predicted as a function of the covariates (without outcomes) than a binarysign, especially when the treatment effect is indeed a smooth and simple function of the covariates.Specifically, we first estimate the treatment effect for the non-candidate subjects j / ∈ R t − ( I ) using awell-studied doubly-robust estimator (see Kennedy (2020) and references therein):∆ DR j = 4( A j − / · ( Y j − (cid:98) µ A ( X j )) + (cid:98) µ ( X j ) − (cid:98) µ ( X j ) , (29)where ( (cid:98) µ , (cid:98) µ ) are random forests trained to predict the outcomes for the control and treated group,respectively. Using the provided covariates X i , we can predict ∆ DR i for the candidate subjects i ∈R t − ( I ). The subject with the smallest prediction of ∆ DR i is then excluded. This automated strategyis described in Algorithm 5. Algorithm 5
An automated heuristic to select i ∗ t in the MaY-I . Input:
Current rejection set R t − ( I ), and available information for selection F − Yt − ( I ); Procedure:
1. Estimate the treatment effect for non-candidate subjects j / ∈ R t − ( I ) as ∆ DR j in (29);2. Train a random forest where the label is the estimated effect ∆ DR j and the predictors are thecovariates X j , using non-candidate subjects j / ∈ R t − ( I );3. Predict ∆ DR i for candidate subjects i ∈ R t − ( I ) via the above random forest, denoted as (cid:98) ∆ DR i ;4. Select i ∗ t as argmin { (cid:98) ∆ DR i : i ∈ R t − ( I ) } .To summarize, we have presented two types of strategy for selecting subjects: the Crossfit-I chooses the one with the smallest predicted probability of a positive (cid:98) ∆ i (see Algorithm 3 in the mainpaper), which we denote here as the min-prob strategy ; and the MaY-I chooses the one with thesmallest prediction of estimated effect ∆ DR j (see Algorithm 5), which we denote here as the min-effect strategy . Note that the proposed interactive algorithm can use arbitrary strategy as long asthe available information for selection is restricted. That is, the Crossfit-I can use the same min-effect strategy, and the MaY-I can use the min-prob strategy (after removing the outcomes from thepredictors, which we elaborate in the next paragraph). However, we observe in numerical experimentsthat both interactive procedures have higher power when using their original strategies, respectively(see Figure 8).Before details of the experiment results, We first describe the min-prob strategy for the MaY-I , where the available information F − Yt − ( I ) does not include the outcomes for candidate subjects.Similar to the min-prob strategy in Algorithm 3 of the main paper, we hope to use the outcome Y i andresidual E i = Y i − (cid:98) m ( X i ) as predictors, and predict the sign of treatment effect for candidate subjects i ∈ R t − ( I ), but Y i and E i for the candidate subjects are not available in F − Yt − ( I ). Thus, we proposealgorithm 6, where we first estimate Y i and E i using the covariates (see step 1-2); and step 3-5 aresimilar to Algorithm 3, which obtain the probability of having a positive treatment effect.20 lgorithm 6 The min-prob strategy to select i ∗ t in the MaY-I . Input:
Current rejection set R t − ( I ), and available information for selection F − Yt − ( I ); Procedure:
1. Predict the outcome Y k of each subject k ∈ [ n ] by covariates, denoted as (cid:98) Y −I ( X k ), where (cid:98) Y −I islearned using non-candidate subjects j / ∈ R t − ( I );2. Predict the residual E k = Y k − (cid:98) m ( X K ) of each subject k ∈ [ n ] by covariates, denoted as (cid:98) E −I ( X k ),where (cid:98) E −I is learned using non-candidate subjects j / ∈ R t − ( I );3. Train a random forest classifier where the label is sign( (cid:98) ∆ j ) and the predictors are (cid:16) (cid:98) Y −I ( X j ) , X j , (cid:98) E −I ( X j ) (cid:17) , using non-candidate subjects j / ∈ R t − ( I );4. Predict the probability of (cid:98) ∆ i being positive as (cid:98) p ( i, t ) for candidate subjects i ∈ R t − ( I );5. Select i ∗ t = argmin { (cid:98) p ( i, t ) : i ∈ R t − ( I ) } . ll ll ll ll ll ll scale of treatment effect po w e r _po s ll (min−prob) Crossfit−I (min−effect) Crossfit−I (min−prob) MaY−I (min−effect) MaY−I Figure 8.
Power of the Crossfit-I and MaY-I with two strategies to select subjects: the min-probstrategy and the min-effect strategy, under the treatment effect defined in (16) of the main paperwith the scale S ∆ varies in { , , , , , } . The Crossfit-I tends to have higher power when using themin-prob strategy, and the MaY-I tends to have higher power when using the min-effect strategy. The Crossfit-I has higher power when using the min-prob strategy than the min-effect strategybecause the former additionally uses the outcome as a predictor. For the MaY-I , the min-effectstrategy leads to higher power because the estimated treatment effect ∆ DR j in (29) can provide reliableevidence of which subjects have a positive effect. If using the min-prob strategy, it could be harderto learn an accurate prediction by Algorithm 6 where two of the predictors (cid:98) Y −I ( X j ) and (cid:98) E −I ( X j )are obtained by estimation, increasing the complexity in modeling. Therefore, we present the Crossfit-I and MaY-I with the min-prob and min-effect strategies, respectively, as preferred in numericalexperiments. Nonetheless, we remark that our proposed interactive frameworks for the Crossfit-I andMaY-I allow arbitrary strategies to select subjects, and an analyst can design her own strategy basedon her domain knowledge. A.3 FDR control of nonpositive effects for paired samples
Recall the nonpositive-effect null under paired samples: H (nonpositive, paired)0 i : ( Y ij | A ij = 1 , X ij ) (cid:22) ( Y ij | A ij = 0 , X ij ) for both j = 1 , , and we observe that P ( (cid:98) ∆ paired i > | { X j , X j } nj =1 ) ≤ / , (30)21here (cid:98) ∆ paired i is defined in (24) of the main paper. Thus, the MaY-I with (cid:98) ∆ i replaced by (cid:98) ∆ paired i hasvalid FDR control for the nonpositive-effect null, where the analyst progressively excludes pairs usingthe available information: F − Y, paired t − = σ { X i , X i } i ∈R t − , { Y j , Y j , A j , A j , X j , X j } j / ∈R t − , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ paired i > } . We can also implement an automated version of the MaY-I where the selection of the excluded subjectfollows a similar procedure as Algorithm 5. The difference is that in step 1, we estimate the treatmenteffect for non-candidate subjects j / ∈ R ( I ) directly as (cid:98) ∆ paired i ≡ ( A i − A i )( Y i − Y i ) instead of ∆ DR i to avoid estimating outcomes in ( (cid:98) µ , (cid:98) µ ). B Proof of error controls
The proofs are based on an optional stopping argument, as a variant of the ones presented in Lei andFithian (2018), Lei et al. (2020), Li and Barber (2017) and Barber and Cand`es (2015).
Lemma 1 (Lemma 2 of Lei and Fithian (2018)) . Suppose that, conditionally on the σ -field G − , b , . . . , b n are independent Bernoulli random variables with P ( b i = 1 | G − ) = ρ i ≥ ρ > , almost surely.Let ( G t ) ∞ t =0 be a filtration with G ⊂ G ⊂ . . . and suppose that [ n ] ⊇ C ⊇ C ⊇ . . . , with each subset C t +1 measurable with respect to G t . If we have G t = σ (cid:32) G − , C t , ( b i ) i/ ∈C t , (cid:88) i ∈C t b i (cid:33) , (31) and τ is an almost-surely finite stopping time with respect to the filtration ( G t ) t ≥ , then E (cid:34) |C τ | (cid:80) i ∈C τ b i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G − (cid:35) ≤ ρ − . B.1 Proof of theorem 1
Proof.
We show that the I controls FDR by Lemma 1, where b i := { ( A i − / · E i ≤ } and G − := σ (cid:0) { Y j , X j } nj =1 (cid:1) and C t := R t ∩ H , for t = 0 , , . . . . The assumptions in Lemma 1 are satisfied: (a) P ( b i = 1 | G − ) ≥ / i ∈ H : P (( A i − / · E i ≤ | G − )= P ( A i = 1) ( E i ≤ | G − ) + P ( A i = 0) ( E i ≥ | G − ) , because A i is independent of G − = 1 / ( E i ≤ | G − ) + ( E i ≥ | G − ))] ≥ / F t ⊆ G t , so the time of stopping the algorithm (cid:98) t :=min { t : (cid:91) FDR( R t ) ≤ α } is a stopping time with respect to G t ; and (c) C t +1 is measurable with respectto G t . Thus, by Lemma 1, expectation, we have E (cid:34) |R (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ ,
22y definition, the FDR conditional on G − at the stopping time (cid:98) t is E (cid:34) |R + (cid:98) t ∩ H | max {|R + (cid:98) t | , } (cid:12)(cid:12)(cid:12) G − (cid:35) = E (cid:34) |R − (cid:98) t ∩ H | max {|R + (cid:98) t | , } · |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ E (cid:34) (cid:91) FDR( R (cid:98) t ) · |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ α E (cid:34) |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) = α E (cid:34) |R (cid:98) t ∩ H | |R − (cid:98) t ∩ H | − (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ α, and the proof completes by applying the tower property of conditional expectation.Notice that when the potential outcomes are treated as fixed, the same proof applies to the null de-fined as Y Tj = Y Cj , because the independence between A i and G − still holds for the nulls. In the hybirdversion of the null (5) in the main paper, the above proof applies with G − := σ (cid:0) { Y j , Y Tj , Y Cj , X j } nj =1 (cid:1) .Thus, FDR is controlled at level α conditional on the potential outcomes and covariates { Y Tj , Y Cj , X j } nj =1 . B.2 Proof of theorem 2
Proof.
Let the set of false rejections in R ( I ) be V ( I ). We conclude that the FDR of the I implementedon set I is controlled at level α/ E (cid:20) |V ( I ) | max {|R ( I ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ α/ , following the error control of the I in Section B.1, where the initial candidate rejection set is R = I ,and thus, C = I ∩ H . Similarly, the FDR of the I implemented on set II is also controlled at level α/
2. Therefore, the FDR of the combined set R ( I ) ∪ R ( II ) is controlled at level α as claimed: E (cid:20) |V ( I ) ∪ V ( II ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ E (cid:20) |V ( I ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) + E (cid:20) |V ( II ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ E (cid:20) |V ( I ) | max {|R ( I ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) + E (cid:20) |V ( II ) || max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ α, the proof completes for the null (3) in the main paper after applying the tower property of conditionalexpectation. The FDR control also applies to the other two definitions of the null (4) and (5) in themain paper, following the same arguments as the end of Section B.1. B.3 Proof of theorem 5
Proof.
We prove that the FDR control holds for the I implemented on I , and the same conclusionapplies to II , so the overall FDR control is guaranteed following the proof of theorem 2 in Section B.2.We first present the proof when the potential outcomes are viewed as random variables. Define G − := σ (cid:0) { X i } ni =1 , { Y i , A i } i/ ∈I (cid:1) , and G (cid:48) t = σ (cid:0) G − , C t , ( Y i , A i ) i/ ∈C t , (cid:80) i ∈C t b i (cid:1) , which contains moreinformation than G t as defined in (31). We claim that Lemma 1 holds when we replace G t by G (cid:48) t , becausethe distribution of b i conditional on G t is the same as on G (cid:48) t for any t = 0 , . . . , n . Similar to the proofof Theorem 1 in Section B.1, we check that the assumption in Lemma 1 are satisfied: (a) the filtrationin our algorithm satisfies F t ⊆ G (cid:48) t , so the time of stopping the algorithm (cid:98) t := min { t : (cid:91) FDR( R t ) ≤ α } isa stopping time with respect to G (cid:48) t ; and (b) C t +1 is measurable with respect to G (cid:48) t ; and (c) for subjectswith nonpositive effect i ∈ H nonpositive0 : P (cid:0) ( A i − / · E −I i ≤ | G − (cid:1) ≥ / . (32)23o see that the last assumption holds, notice that P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) ≤ | G − (cid:1) = P ( Y Ci ≥ (cid:98) m −I ( X i ) | G − ) P ( A i = 0) + P ( Y Ti ≤ (cid:98) m −I ( X i ) | G − ) P ( A i = 1); and P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) > | G − (cid:1) = P ( Y Ci < (cid:98) m −I ( X i ) | G − ) P ( A i = 0) + P ( Y Ci > (cid:98) m −I ( X i ) | G − ) P ( A i = 1) . For any potential outcomes of the nulls such that ( Y Ti | X i ) (cid:22) ( Y Ci | X i ), it holds that P ( Y Ci ≥ D | X i ) ≥ P ( Y Ti > D | X i ) , and P ( Y Ti ≤ D | X i ) ≥ P ( Y Ci < D | X i ) , for any constant D , so P ( Y Ci ≥ (cid:98) m −I ( X i ) | G − ) ≥ P ( Y Ti > (cid:98) m −I ( X i ) | G − ) , and P ( Y Ti ≤ (cid:98) m −I ( X i ) | G − ) ≥ P ( Y Ci < (cid:98) m −I ( X i ) | G − ) , because m −I ( X i ) is fixed given G − . Because P ( A i = 1) is 1 / P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) ≤ | G − (cid:1) ≥ P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) > | G − (cid:1) , which proves Claim (32) and in turn the FDR control of the MaY-I .When the potential outcomes are treated as fixed, the above proof applies to the null defined as Y Ti ≤ Y Ci in (18) of the main paper, in which case P ( Y Ci ≥ D | G − ) is zero or one, and the abovearguments still hold. For the hybird version of the null (19) in the main paper, the above proof applieswith G − := σ (cid:0)(cid:8) Y Ti , Y Ci , X i (cid:9) ni =1 , { Y i , A i } i/ ∈I (cid:1) . Thus, FDR is controlled at level α conditional on thepotential outcomes and covariates { Y Tj , Y Cj , X j } nj =1 . B.4 Error control guarantee for the linear-BH procedure
Theorem 6.
Suppose the outcomes follow a linear model: Y i = l ∆ ( X i ) A i + l f ( X i )+ U i , where l denotesa linear function, and U i is standard Gaussian noise. The linear-BH procedure controls FDR of thenonpositive-effect null in (17) of the main paper asymptotically as the sample size n goes to infinity. Note that the error control would not hold when the linear assumption is violated. For example,if the expected treatment effect E ( Y Ti − Y Ci | X i ) is some nonlinear function of the covariates, theestimated treatment effect (cid:98) ∆ BH i would not be consistent; in turn, for the null subjects with zero effect,the p -values would not be valid (i.e., not stochastically equal or larger than uniform). Hence, the linear-BH procedure would not guarantee the desired FDR control, as we show in the numerical experimentsin Section 3 of the main paper. Proof.
For simplicity, we treat all the covariates as fixed values and denote them as the covariancematrix X a = ( X i : A i = a ) T for a ∈ { T, C } , where we temporarily use A i = T to denote the case ofbeing treated A i = 1. Under the linear assumption, the estimated outcome (cid:98) l a asymptotically follows aGaussian distribution, whose expected value is l ∆ ( X i ) { a = T } + l f ( X i ). Its variance can be estimatedas (cid:100) Var( (cid:98) l a ( X i )) = (cid:98) σ a ( X Ti ( X Ta X a ) − X Ti ) , where the variance from noise is estimated as (cid:98) σ a = (cid:88) A i = a ( Y i − (cid:98) l a ( X i )) (cid:44) (cid:32)(cid:88) i { A i = a } − d − (cid:33) , and d is the number of covariates. Note that the observed outcome also follows a Gaussian distribution N ( l ∆ ( X i ) { a = T } + l f ( X i ) , σ a ). Note that in each estimated effect (cid:98) ∆ BH i , the observed outcome Y ai is24ndependent of the estimated potential outcome Y ai , where a is the complement of a : a ∪ a = { T, C } .Thus, the estimated effect asymptotically follow a Gaussian distribution whose expected value is l ∆ ( X i )(nonpositive under the null) and the variance is Var( (cid:98) ∆ BH i ) = Var( (cid:101) Y Ti ) + Var( (cid:101) Y Ci ), where an estimationis (cid:100) Var( (cid:101) Y ai ) = (cid:98) σ a { A i = a } + (cid:100) Var( (cid:98) l a ( X i )) { A i = a } . Therefore, the resulting p -value P i as definedin (15) of the main paper is asymptotically valid (uniform or stochastically larger) if subject i is a null,and hence the BH procedure leads to asymptotic FDR control (Fan et al., 2007). C Proof of power analysis
Our proof of the power analysis mainly uses the results in Arias-Castro and Chen (2017), who considerthe setup with n hypotheses, each associated with a test statistic V i for i ∈ [ n ]. Assume the teststatistics are independent with the survival function P ( V i ≥ x ) = Ψ i ( x ), which equals Ψ( x − µ i ) where µ i = 0 under the null and µ i > x →∞ x − γ log Ψ( x ) = − /γ, (33)with a constant γ >
0. For example, a normal distribution is AGG with γ = 2. They discuss a classof multiple testing methods called threshold procedure : the final rejection set R is in the form R = { i : V i ≥ τ ( V , . . . , V n ) } , (34)for some threshold τ ( V , . . . , V n ), and separately study two types of thresholds: the BH procedure(Benjamini and Hochberg, 1995) with threshold: τ BH = V ( ι BH ) , ι BH := max { i : V ( i ) ≥ Ψ − ( iα/n ) } , (35)where V (1) ≥ . . . ≥ V ( n ) are ordered statistics; and the Barber-Cand‘es (BC) procedure (Barber andCand`es, 2015) with a threshold on the absolute value of V i : τ BC = inf { ν ∈ | V | : (cid:91) FDP( ν ) ≤ α } , (36)where | V | := {| V i | : i ∈ [ n ] } is the set of sample absolute values, and (cid:91) FDP( ν ) := |{ i : V i ≤ − ν }| + 1max {|{ i : V i ≥ ν }| , } , and the final rejection set is those with positive V i and value larger than τ BC . The stopping rule forthe BC procedure is similar to our proposed algorithms, as detailed next.Recall in Section 4 of the main paper, we consider a simplified automated version of the I thatexclude the subject with the smallest absolute value of the estimated treatment effect | (cid:98) ∆ i | . Thus, theautomated I is a BC procedure where the test statistic of interest is V i = (cid:98) ∆ i = 4( A i − / Y i − (cid:98) m n ( X i )). Following the above notations and let Φ be the CDF for standard Gaussian, we denote thesurvival function for the nulls asΨ null n ( x ) = 12 (1 − Φ( x + (cid:98) m n )) + 12 (1 − Φ( x − (cid:98) m n )) , which is a mixture of two Gaussians, with (cid:98) m n = n (cid:80) ni =1 Y i , and (cid:98) m n a.s. → non-null n ( x ) = 12 (1 − Φ( x + (cid:98) m n − µ )) + 12 (1 − Φ( x − (cid:98) m n )) . Note that our setting is slightly different from the discussion in Arias-Castro and Chen (2017) becausethe non-nulls differ from the nulls by a shift on one of the Gaussian component (rather than a shift25n the overall survival function Ψ null n ). Similar to the characterization by AGG in (33), both survivalfunctions Ψ null n and Ψ non-null n asymptotically satisfy a tail property that for any x n → ∞ as n → ∞ :lim n →∞ x − γn log Ψ n ( x n ) = − /γ, (37)with probability one and γ = 2, which we later refer to as asymptotic AGG. Conclusions in our paperbasically follows the proofs in Arias-Castro and Chen (2017) with the test statistics V i specified as theestimated treatment effect (cid:98) ∆ i . C.1 Proof of theorem 3
We first present the proof for the power of the automated Crossfit-I , and the power of the linear-BHis proven similarly as shown later. Proof.
Zero power when r < β . The argument of zero power indeed applies to any thresholdprocedure as defined in (34): R = { i : (cid:98) ∆ i ≥ d } , for some d ∈ R . Following the proof of Theorem 1 inArias-Castro and Chen (2017), we argue that the FDR control cannot be satisfied for any α ∈ (0 , d is large enough such that d > µ + δ n with δ n = log log n ; but in this case, mostnon-nulls cannot be included in the rejection set, and thus the power goes to zero.First, we claim that when d ≤ µ + δ n , the false discovery proportion (FDP) goes to one in probability.By the proof of Theorem 1 in Arias-Castro and Chen (2017), we have that FDP goes to one inprobability if ( n − n )Ψ null n ( µ + δ n ) n → ∞ with probability one, where n is the number of non-nulls. Theirproof also verifies that ( n − n )Ψ null n ( µ + δ n ) n → ∞ because Ψ null n satisfies property (37) with probabilityone.Next, we show that when d > µ + δ n , the power goes to zero. Notice that power can be equivalentlydefined as E (1 − FNR), where FNR (false negative rate) is defined as the proportion of non-nulls notidentified. Again by the proof of Theorem 1 in Arias-Castro and Chen (2017), we have that the FNRconverge to one in probability if Ψ non-null n ( µ + δ n ) goes to zero with probability one, which is truebecause δ n → ∞ .Combining the above two arguments, we conclude that for any threshold procedure whose rejectionset is in the form of (34), the power goes to zero for any FDR control α ∈ (0 ,
1) when r < β .Note that the above proof assumes that the test statistics V i = (cid:98) ∆ i are mutually independent. Forsimplicity, we design the Crossfit-I where (cid:98) m n for the I implemented on I is computed using data in II , to ensure the above mutual independence. Thus, the above proof applies to the I implementedon each half, I and II . The overall power behaves the same asymptotically, since I and II resultfrom a random split of all subjects [ n ]. For all cases hereafter, we prove the power claim for theI implemented on I conditional on data in II , and the same claim holds for the overall power asreasoned above. Half power when r > β . We first prove the limit inferior of the power is at least 1 /
2, and then thelimit superior is at most 1 /
2, mainly using the proof of Theorem 3 in Arias-Castro and Chen (2017).They consider a sequence of thresholds d ∗ n = ( γr ∗ log n ) /γ for some r ∗ ∈ ( β, r ∧ d ∗ n is less than any α ∈ (0 ,
1) for large n , or mathematically (cid:91) FDR( d ∗ n ) ≤ α . Itcan be verified by the proof of Theorem 3 in Arias-Castro and Chen (2017) where the survival functionof (cid:98) ∆ i is G ( d ∗ n ) = (1 − (cid:15) )Ψ null n ( d ∗ n ) + (cid:15) Ψ non-null n ( d ∗ n ) with (cid:15) = n − β , and the fact that Ψ non-null n ( d ∗ n ) → / null n ( d ∗ n ) → n − r ∗ (by property (37)) with probability one. It follows that the true stoppingthreshold τ n satisfies τ n ≤ d ∗ n . Also, by Lemma 1 in Arias-Castro and Chen (2017), we have that theproportion of correctly identified non-nulls at threshold d ∗ n is n (cid:80) i/ ∈H { (cid:98) ∆ i ≥ d } = Ψ non-null n ( d ∗ n ) + o P (1), where Ψ non-null n ( d ) decreases in d and converges to 1 / d = d ∗ n . Recall that the truestopping threshold is no larger than d ∗ n , so the limit inferior of the power is at least 1 / / / d ∈ (0 , ∞ ), and we claim that the actual stopping threshold τ n ≥ d for26arge n because the FDR estimator goes to one, following similar arguments in the proof of Theorem 3in Arias-Castro and Chen (2017). Specifically, (cid:91) FDR( d ) ≡ |{ i ∈ [ n ] : (cid:98) ∆ i ≤ − d }| + 1max {|{ i ∈ [ n ] : (cid:98) ∆ i ≥ d }| , } = 1 + n (1 − (cid:98) G n ( − d ))max { n (cid:98) G n ( d ) , } , where (cid:98) G n ( d ) = n (cid:80) i ∈ [ n ] ( (cid:98) ∆ i ≥ d ) denotes the empirical survival function. Use the fact that G n ( d ) = (1 − (cid:15) )Ψ null n ( d ) + (cid:15) Ψ non-null n ( d ) → − Φ( d ) and G n ( − d ) → Φ( d ) almost surely, weobserve that (cid:91) FDR( d ) → d is Ψ non-null n ( d ) + o P (1) (recall in the previous paragraph), where Ψ non-null n ( d ) → / / − Φ( d )). Thus, the power for large n is smaller than 1 / / − Φ( d )) for all d ∗ ∈ (0 , ∞ );in other words, the limit superior of the power is smaller than inf d ∈ (0 , ∞ ) / / − Φ( d ∗ )) = 1 / /
2, we conclude that the powerconverges to 1 /
2. In fact, the above proof implies that the power of identifying non-null subjects thatare treated is one (notice that Ψ non-null, treated n ( d ∗ n ) = 1 − Φ( d ∗ n + (cid:98) m n − µ ) → Proof for the linear-BH procedure
The power of the linear-BH procedure when there is nocovariates can be proved following similar steps as above, and using intermediate results of Theorem 2in Arias-Castro and Chen (2017). In their notation, the linear-BH procedure uses V i = (cid:98) ∆ BH i as the teststatistics, and we separately discuss power among the treated group and the control group in ensurethe independence among V i . The survival functions for the nulls and non-nulls in the treated groupare Ψ null n ( x ) = 1 − Φ (cid:32) x (cid:112) /n C (cid:33) and Ψ non-null n ( x ) = 1 − Φ (cid:32) x − µ (cid:112) /n C (cid:33) , where n C is the number of untreated subjects. For the control group, the survival functions areΨ null n ( x ) = 1 − Φ (cid:32) x + (cid:98) Y T (cid:112) /n T (cid:33) and Ψ non-null n ( x ) = 1 − Φ (cid:32) x + (cid:98) Y T (cid:112) /n T (cid:33) , where (cid:98) Y T = (cid:80) A i =1 Y i a.s. →
0, and n T is the number of treated subjects. Since the above distributionsconverge to a Gaussian, these survival functions satisfy the AGG property asymptotically as definedin (37). Proof.
First, we claim that the power goes to zero when r < β , following the proof in Section C.1 forany threshold procedure (separately for the treated group conditional on control group). Then, weprove that power converges to 1 / r > β : the power among untreated subjects is asymptoticallyzero because the survival functions for the nulls and non-nulls are the same; the power among treatedsubjects is asymptotically one following the proof of Theorem 2 in Arias-Castro and Chen (2017) asdetailed next.Again consider the sequence of thresholds d ∗ n = ( γr ∗ log n ) /γ for some r ∗ ∈ ( β, r ∧ d ∗ n is less than any α ∈ (0 ,
1) for large n , or mathematically (cid:91) FDR( d ∗ n ) ≤ α .It can be verified by the proof of Theorem 2 in Arias-Castro and Chen (2017) where G ( d ∗ n ) = (1 − (cid:15) )Ψ null n ( d ∗ n )+ (cid:15) Ψ non-null n ( d ∗ n ) with (cid:15) = n − β , and the fact that Ψ non-null n ( d ∗ n ) → null n ( d ∗ n ) → n − r ∗ (by property (37)) with probability one. It follows that the true stopping threshold τ n satisfies τ n ≤ d ∗ n . Also, by Lemma 1 in Arias-Castro and Chen (2017), we have that the proportionof correctly identified non-nulls at threshold d ∗ n is n (cid:80) i/ ∈H { (cid:98) ∆ BH i ≥ d } = Ψ non-null n ( d ∗ n ) + o P (1),where Ψ non-null n ( d ) for the treated group decreases in d and converges to 1 when d = d ∗ n . Recall thatthe true stopping threshold is no larger than d ∗ n , so the limit inferior of the power among the treatedsubjects is at least 1. Therefore, the overall power converges to 1 / .2 Proof of theorem 4 We first consider the power when all subjects are non-nulls ( β = 0). Lemma 2.
Given any fixed FDR control level α ∈ (0 , and let the number of subjects n goes toinfinity. When all subjects are non-nulls, the stopping time τ = 0 with probability tending to one if µ > Φ − ( α ) , and in this case the power converges to Φ( µ ) . E.g., when α = 0 .
2, the asymptotic power of the automated I is larger than 0 . µ ≥ Proof.
The stopping time τ = 0 if and only if the FDR control is satisfied when all the subjects areincluded: (cid:91) FDR n ( R ) = | R − | +1max {| R +0 | , } ≤ α , or equivalently, | R +0 | n ≥
1+ 1 n α . Notice that the proportion ofpositive (cid:98) ∆ i converges to a constant: | R +0 | n a.s. → Φ( µ ), because (cid:98) ∆ i of each non-null follows a Gaussiandistribution with mean µ and variance less than 2. Thus, if Φ( µ ) > α , for any (cid:15) ∈ (0 , N such that for all n ≥ N , we have that (a) (cid:12)(cid:12)(cid:12) | R +0 | n − Φ( µ ) (cid:12)(cid:12)(cid:12) < (cid:15) with probability at least 1 − (cid:15) ;and (b) (cid:91) FDR n ( R ) = | R − | +1max {| R +0 | , } ≤ α (hence τ = 0) with probability at least 1 − (cid:15) . (Notice that thethreshold N can be chosen as not depending on µ , which is useful in the next proof.) In such a case,the power is no less than (1 − (cid:15) )(Φ( µ ) − (cid:15) ) when n ≥ N ; and the power is no larger than Φ( µ ) − (cid:15) ;so the power converges to Φ( µ ). The proof completes once notice that the condition Φ( µ ) > α isequivalent to µ > Φ − ( α ). Proof of Theorem 4.
Power of the Crossfit- I . Recall that the I implemented on I exclude subjectsbased on the averaged estimated effect on II : Pred( x ) = (cid:98) ∆ i ( X i = x ), which converges to µ almostsurely when x = 1 (the non-nulls), and 0 almost surely when x = 0 (the nulls). Thus, no non-nulls in I would be excluded before excluding all the nulls in I (with probability going to one) for any fixed µ >
0. Combined with Lemma 2, we have that if µ > Φ − ( α ), for any (cid:15) ∈ (0 , N ( (cid:15), α )such that for all n ≥ N , the power of the I is higher than Φ( µ ) − (cid:15) . Also, the limit of power increasesto one for any r > µ increases): there exists N (cid:48) ( (cid:15) ) such that for all n ≥ N (cid:48) ( (cid:15) ),Φ( µ ) ≥ − (cid:15) . Therefore, for any (cid:15) ∈ (0 , α ), we have that for all n ≥ max { N (cid:48) ( (cid:15) ) , N ( (cid:15), α ) } , the powerof I implemented on I is no less than 1 − (cid:15) ; thus completes the proof. Power of the linear-BH procedure.
As before, we separately argue that the power for the treatedgroup and the control group converges to zero when r < β , and converges to one when r > β . Fora subject in the treated group with X i = x where x ∈ { , } , the estimated effect is a Gaussian (cid:98) ∆ BH i ∼ N (0 , (cid:80) i ( X i = x,A i =0) ) for the nulls and (cid:98) ∆ BH i ∼ N ( µ, (cid:80) i ( X i = x,A i =0) ) for the non-nulls.For a subject in the control group with X i = x where x ∈ { , } , the estimated effect is a Gaussian (cid:98) ∆ BH i ∼ N (0 , (cid:80) i ( X i = x,A i =1) ) for the nulls and (cid:98) ∆ BH i ∼ N ( µ, (cid:80) i ( X i = x,A i =1) ) for the non-nulls.The power of the linear-BH procedure directly results from Theorem 2 in Arias-Castro and Chen(2017) because in both the treated and control group, (a) the linear-BH procedure is the BH procedurewhere the random variable of interest is V i = (cid:98) ∆ BH i ; and (b) (cid:98) ∆ BH i of non-nulls and nulls differ by ashift µ ; and (c) the survival function of (cid:98) ∆ BH i is asymptotically AGG (recall definition in (37)) since itconverges to a Gaussian distribution). D Additional numerical experiments
D.1 Identification of individual positive effect
We have seen the numerical results of the proposed methods in the main paper where the treatmenteffect is defined in (16) with sparse and strong positive effect, and dense and weak negative effect.This section presents three more examples of the treatment effect.28 inear effect.
Let the treatment effect be∆( X i ) = S ∆ · [2 X i (1) X i (2) + 2 X i (3)] , (38)where S ∆ >
0. In this case, all subjects have treatment effects, and the scale correlates with thecovariates (with interaction terms) linearly. Thus, the linear-BH procedure has valid error control asshown in Figure 9a (unlike other cases with nonlinear treatment effect).
Sparse and strong effect that is positive.
Let the treatment effect be∆( X i ) = S ∆ · [5 X i (3) { X i (3) > } ] , (39)where S ∆ >
0. Here, the subjects with X i (3) > Sparse and strong effect in both directions.
Let the treatment effect be∆( X i ) = S ∆ · [5 X i (3) {| X i (3) | > } ] , (40)where S ∆ >
0. Here, the subjects with X i (3) > X i (3) < − l l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect F DR _neg l l l l l l scale of treatment effect po w e r _po s (a) Dense two-sided effect (linear) as in model (38). l l l l l l degree of mismatch within pair F DR l l l l l l degree of mismatch within pair F DR _neg l l l l l l degree of mismatch within pair po w e r _po s (b) Sparse and strong effect that is positive (nonlinear) in model (39). l Crossfit−I MaY−I Linear−BH l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect F DR _neg l l l l l l scale of treatment effect po w e r _po s (c) Sparse and strong effect in both directions (nonlinear) in model (40). Figure 9.
FDR for the zero-effect null (3) in the main paper (the first column), and FDR for thenonpositive-effect null (17) in the main paper (the second column), and power (the third column) ofthree methods: linear-BH procedure, Crossfit-I , MaY-I , under three types of treatment effect whenvarying the scale of treatment effect S ∆ in { , , , , , } . When the linear assumption holds as in thefirst row, the linear-BH procedure has valid FDR control and high power, but its FDR is large when thetreatment is a nonlinear function of the covariates as in the latter two rows. In contrast, the Crossfit-I and MaY-I have valid FDR control for their target null hypotheses, respectively. D.2 Paired samples
We have presented in Section 6 of the main paper that the interactive algorithms can be applied topaired samples, and have discussed their power in two cases where the samples within each pair eithermatched exactly or mismatch to some degree. A side observation is that the algorithms not using thepairing information seem to have a small change in power when the degree of mismatch varies. Here,we increase the degree of mismatch to show a more clear pattern of this change. ll ll ll ll ll ll digree of mismatch e po w e r _po s l l pair−Crossfit−I pair−MaY−I unpair−Crossfit−I unpair−MaY−I Figure 10.
Power of identifying subjects with positive effects of the proposed algorithms (Crossfit-I and MaY-I ) with or without pairing information, when the scale of treatment effect is fixed at 2 andthe degree of mismatch (cid:15) varies. The power of algorithms without pairing information first increaseand then decrease as (cid:15) becomes larger. We extend the definition of mismatch for (cid:15) ∈ (0 ,
1) to a larger (cid:15) : P ( X i (1) (cid:54) = X i (1)) = min { (cid:15), } and P ( X i (2) (cid:54) = X i (2)) = min { (cid:15), } and X i (3) = X i (3) + U (0 , (cid:15) ), where U (0 , (cid:15) ) is uniformlydistributed between 0 and 2 (cid:15) , and a larger (cid:15) leads to a larger degree of mismatch. As (cid:15) increases, thepower under the unpaired samples first increases (see Figure 10). It is because the treatment effect ispositive when X i (3) >
1, which only takes 15% proportion if without mismatching; thus, the patternof treatment effect is not easy to learn. In contrast, when there is a positive shift on X i (3) as designedin the mismatching setting above, more subjects have positive effects so that the algorithm can moreeasily learn the effect pattern and hence increase the power. The power can slightly decrease whenthe degree of mismatch is too large ( (cid:15) > cknowledgements We thank Edward Kennedy for his comments on the problem setup.
References
Arias-Castro, E. and S. Chen (2017). Distribution-free multiple testing.
Electronic Journal of Statis-tics 11 (1), 1983–2001.Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects.
Proceedingsof the National Academy of Sciences 113 (27), 7353–7360.Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs.
The Annalsof Statistics 43 (5), 2055–2085.Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and power-ful approach to multiple testing.
Journal of the Royal Statistical Society: Series B (Methodologi-cal) 57 (1), 289–300.Berger, J. O., X. Wang, and L. Shen (2014). A Bayesian approach to subgroup identification.
Journalof Biopharmaceutical Statistics 24 (1), 110–129.Cai, T., L. Tian, P. H. Wong, and L. Wei (2011). Analysis of randomized comparative clinical trialdata for personalized treatment selections.
Biostatistics 12 (2), 270–282.Duan, B., A. Ramdas, S. Balakrishnan, and L. Wasserman (2020). Interactive martingale tests for theglobal null.
Electronic Journal of Statistics 14 (2), 4489–4551.Duan, B., A. Ramdas, and L. Wasserman (2020a). Familywise error rate control by interactive un-masking. In
Proceedings of the 37th International Conference on Machine Learning , Volume 119,pp. 2720–2729. PMLR.Duan, B., A. Ramdas, and L. Wasserman (2020b). Which Wilcoxon should we use? An interactiverank test and other alternatives. arXiv preprint arXiv:2009.05892 .Fan, J., P. Hall, and Q. Yao (2007). To how many simultaneous hypothesis tests can normal, student’st or bootstrap calibration be applied?
Journal of the American Statistical Association 102 (480),1282–1288.Foster, J. C., J. M. Taylor, and S. J. Ruberg (2011). Subgroup identification from randomized clinicaltrial data.
Statistics in Medicine 30 (24), 2867–2880.Gu, J. and S. Shen (2018). Oracle and adaptive false discovery rate controlling methods for one-sidedtesting: theory and application in treatment effect evaluation.
The Econometrics Journal 21 (1),11–35.Howard, S. R. and S. D. Pimentel (2020). The uniform general signed rank test and its designsensitivity.
Biometrika . asaa072.Imai, K. and M. Ratkovic (2013). Estimating treatment effect heterogeneity in randomized programevaluation.
The Annals of Applied Statistics 7 (1), 443–470.Karmakar, B., R. Heller, and D. S. Small (2018). False discovery rate control for effect modificationin observational studies.
Electronic Journal of Statistics 12 (2), 3232–3253.Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal effects. arXivpreprint arXiv:2004.14497 .Lei, L. and W. Fithian (2018). AdaPT: an interactive procedure for multiple testing with side infor-mation.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (4), 649–679.31ei, L., A. Ramdas, and W. Fithian (2020). A general interactive framework for false discovery ratecontrol under structural constraints.
Biometrika . asaa064.Li, A. and R. F. Barber (2017). Accumulation tests for FDR control in ordered hypothesis testing.
Journal of the American Statistical Association 112 (518), 837–849.Lipkovich, I. and A. Dmitrienko (2014). Strategies for identifying predictive biomarkers and sub-groups with enhanced treatment effect in clinical trials using SIDES.
Journal of BiopharmaceuticalStatistics 24 (1), 130–153.Lipkovich, I., A. Dmitrienko, and R. B D’Agostino Sr (2017). Tutorial in biostatistics: data-drivensubgroup identification and analysis in clinical trials.
Statistics in Medicine 36 (1), 136–196.Lipkovich, I., A. Dmitrienko, J. Denne, and G. Enas (2011). Subgroup identification based on differen-tial effect search—a recursive partitioning method for establishing response to treatment in patientsubpopulations.
Statistics in Medicine 30 (21), 2601–2621.Loh, W.-Y., L. Cao, and P. Zhou (2019). Subgroup identification for precision medicine: A comparativereview of 13 methods.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (5),e1326.Nie, X. and S. Wager (2020). Quasi-oracle estimation of heterogeneous treatment effects.
Biometrika .asaa076.Powers, S., J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2018). Some meth-ods for heterogeneous treatment effect estimation in high dimensions.
Statistics in Medicine 37 (11),1767–1787.Rabinovich, M., A. Ramdas, M. I. Jordan, and M. J. Wainwright (2020). Optimal rates and trade-offsin multiple testing.
Statistica Sinica 30 , 741–762.Robinson, P. M. (1988). Root-N-consistent semiparametric regression.
Econometrica: Journal of theEconometric Society 56 (4), 931–954.Rosenbaum, P. R. (2002). Covariance adjustment in randomized experiments and observational studies.
Statistical Science 17 (3), 286–327.Sivaganesan, S., P. W. Laud, and P. M¨uller (2011). A Bayesian subgroup analysis with a zero-enrichedPolya Urn scheme.
Statistics in Medicine 30 (4), 312–323.Xie, Y., N. Chen, and X. Shi (2018). False discovery rate controlled heterogeneous treatment effectdetection for online controlled experiments. In
Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining .Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating individualized treatmentrules using outcome weighted learning.