[PDF] Interactive identification of individuals with positive treatment effect while controlling false discoveries

Abstract

Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which ones have a positive treatment effect, even though each has only taken either treatment or control but not both? While subgroup analysis has received attention, claims about individual participants are more challenging. We frame the problem in terms of multiple hypothesis testing: we think of each individual as a null hypothesis (the potential outcomes are equal, for example) and aim to identify individuals for whom the null is false (the treatment potential outcome stochastically dominates the control, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction -- a human data scientist (or a computer program acting on the human's behalf) may adaptively guide the algorithm in a data-dependent manner to gain high identification power. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power.

Full PDF

IInteractive identiﬁcation of individualswith positive treatment eﬀectwhile controlling false discoveries

Boyan Duan Larry Wasserman Aaditya Ramdas { boyand,larry,aramdas } @stat.cmu.edu Department of Statistics and Data Science,Carnegie Mellon University, Pittsburgh, PA 15213February 23, 2021

Abstract

Out of the participants in a randomized experiment with anticipated heterogeneous treatmenteﬀects, is it possible to identify which ones have a positive treatment eﬀect, even though eachhas only taken either treatment or control but not both? While subgroup analysis has receivedattention, claims about individual participants are more challenging. We frame the problem interms of multiple hypothesis testing: we think of each individual as a null hypothesis (the potentialoutcomes are equal, for example) and aim to identify individuals for whom the null is false (thetreatment potential outcome stochastically dominates the control, for example). We develop anovel algorithm that identiﬁes such a subset, with nonasymptotic control of the false discoveryrate (FDR). Our algorithm allows for interaction — a human data scientist (or a computer programacting on the human’s behalf) may adaptively guide the algorithm in a data-dependent mannerto gain high identiﬁcation power. We also propose several extensions: (a) relaxing the null tononpositive eﬀects, (b) moving from unpaired to paired samples, and (c) subgroup identiﬁcation.We demonstrate via numerical experiments and theoretical analysis that the proposed method hasvalid FDR control in ﬁnite samples and reasonably high identiﬁcation power.

Subgroup identiﬁcation has been a major topic in the clinical trial community and the causal literature(see Lipkovich et al. (2017); Powers et al. (2018); Loh et al. (2019) and references therein). Typically,the treatment eﬀect in the investigated population varies by the subject’s gender, age, and othercovariates. Identifying subjects with positive eﬀects can help guide follow-up research and providemedication guidance. However, most existing methods do not have an error control guarantee at thelevel of the individual — it is possible that most subjects in the identiﬁed subgroup do not have positiveeﬀects. For example, an identiﬁed subgroup could be deﬁned as “female subjects with age less than40”, but only 10% of them with age between 18 and 20 may truly have a positive treatment eﬀect.We propose to upper bound the proportion of falsely identiﬁed subjects (whose potential outcomesunder treatment and control are equal, for example) via the language of hypothesis testing. As formal-ized in the next section, we interpret the problem of identifying subjects with positive eﬀects as one ofmultiple hypothesis testing, where each subject corresponds to one null hypothesis, and the proportionof false identiﬁcations corresponds to a standard error metric, the false discovery rate (FDR). In thiscontext, we propose algorithms with two appealing properties. First, they achieve a ﬁnite sampleguarantee on FDR control. Second, our proposed algorithms identify subjects with positive eﬀectsthrough an interactive procedure according to a particular protocol — an analyst is allowed to lookat an initially “masked” dataset (that is progressively “unmasked” over rounds of interaction), andcombines available covariates with prior knowledge via ﬂexible Bayesian (or black box machine learn-ing) working models to improve power. In summary, she can combine the strengths of (automated)statistical modeling and (human-guided) scientiﬁc knowledge, all while avoiding selection bias andguaranteeing valid FDR control. 1 a r X i v : . [ s t a t . M E ] F e b .1 Problem setup Consider a sample with n subjects. Each subject i has potential control outcome Y Ci , potential treatedoutcome Y Ti , and the treatment indicator A i for i ∈ [ n ] ≡ { , , . . . , n } . Our results allow the potentialoutcomes to either be viewed as random variables or ﬁxed. The treatment eﬀect of subject i is deﬁnedas Y Ti − Y Ci and the observed outcome is Y i = Y Ci (1 − A i )+ Y Ti A i under the standard causal assumptionof consistency ( Y i = Y Ti when A i = 1 and Y i = Y Ci when A i = 0). Person i ’s covariate is denoted as X i . This paper focuses on completely randomized experiments without interference:(i) conditional on covariates, treatment assignments are independent coin ﬂips: P [( A , . . . , A n ) = ( a , . . . , a n ) | X , . . . , X n ] = n (cid:89) i =1 P ( A i = a i ) = (1 / n , (1)for any ( a , . . . , a n ) ∈ { , } n .(ii) conditional on covariates, the outcome of one subject Y i is independent of the assignment A j ofanother subject, for any i (cid:54) = j : Y i ⊥ A j | { X , . . . , X n } for i (cid:54) = j, (2)which is implied by (1) when the potential outcomes are viewed as ﬁxed values.We do not assume the observed data ( Y i , A i , X i ) are identically distributed. We consider heterogeneouseﬀects in the sense that the distribution of Y Ti − Y Ci varies, and aim at identifying those individualswith a positive treatment eﬀect. (If the covariates X i are not informative about the heterogeneity in Y Ti − Y Ci , our identiﬁcation power could be low, and this is to be expected.) To formalize and framethe problem in terms of multiple hypothesis testing, we ﬁrst deﬁne the null hypothesis for subject i ashaving zero treatment eﬀect: H zero0 i : ( Y Ti | X i ) d = ( Y Ci | X i ) , (3)or equivalently, H zero0 i : ( Y i | A i = 1 , X i ) d = ( Y i | A i = 0 , X i ). Alternatively, we can treat the potentialoutcomes and covariates as ﬁxed, and frame the null hypothesis as H zero0 i : Y Ti = Y Ci . (4)A last, hybrid, version (e.g., Howard and Pimentel (2020)) is to treat the two potential outcomes asrandom with joint distribution ( Y Ti , Y Ci ) | X i ∼ P i , and the null posits H zero0 i : Y Ti = Y Ci almost surely- P i , (5)meaning that P i is supported on { ( x, y ) : x = y } . Our work handles any interpretation.In later sections, we describe an extension where we relax the null as those with a nonpositive eﬀect,deﬁned by stochastic dominance ( Y Ti | X i ) (cid:22) ( Y Ci | X i ), meaning that P ( Y Ti ≤ y | X i ) ≤ P ( Y Ci ≤ y | X i ), or simply Y Ti ≤ Y Ci if the potential outcomes are ﬁxed.Our algorithms control the error of falsely identifying subjects whose null hypothesis is true (i.e.,having zero eﬀect), and aim at correctly identifying subjects with positive eﬀects. Let (cid:31) denotestochastic dominance, as above. We say a subject has a positive eﬀect if( Y Ti | X i ) (cid:31) ( Y Ci | X i ) . (6)When treating the potential outcomes and covariates as ﬁxed, we simply write Y Ti > Y Ci .The output of our proposed algorithms is a set of identiﬁed subjects, denoted as R , with a guaranteethat the expected proportion of falsely identiﬁed subjects is upper bounded. Speciﬁcally, denote theset of subjects that are true nulls as H := { i ∈ [ n ] : H zero0 i is true } . Then the number of false2dentiﬁcations is |R ∩ H | . The expected proportion of false identiﬁcations is a standard error metric,the false discovery rate (FDR): FDR := E (cid:20) |R ∩ H | max {|R| , } (cid:21) . (7)Given α ∈ (0 , ≤ α , and have reasonably high power ,which is deﬁned as the expected proportion of correctly identiﬁed subjects:power := E (cid:20) |R ∩ Pos | max {| Pos | , } (cid:21) , where Pos := (cid:8) i : ( Y Ti | X i ) (cid:31) ( Y Ci | X i ) (cid:9) or Pos := { i : Y Ti > Y Ci } is the set of subjects with positiveeﬀects. We note that our problem setup is not exactly the same as most work in subgroup identiﬁcation, suchas Foster et al. (2011); Zhao et al. (2012); Imai and Ratkovic (2013). The identiﬁed subgroups areusually deﬁned by functions of covariates, rather than a subset of the investigated subjects as in ourpaper. While deﬁning the subgroup by a function of covariates makes it easy to generalize the ﬁnding inthe investigated sample to a larger population, it does not seem straightforward to nonasymptoticallycontrol the error of false identiﬁcations using the former deﬁnition, which is a major distinction betweenprevious studies and our work. Most existing work does not have an error control guarantee (see anoverview in Lipkovich et al. (2017), Table XV), except a few discussing error control on the level ofsubgroups as opposed to the level of individuals in our paper. The diﬀerence between FDR control ata subgroup level and at an individual level is detailed below.

Subgroup FDR control.

Karmakar et al. (2018); Gu and Shen (2018); Xie et al. (2018) discussFDR control at a subgroup level, where the latter two have little discussion on incorporating continuouscovariates and require parametric assumptions on the outcomes. Thus, we follow the setup in Karmakaret al. (2018) to compare the FDR control at a subgroup level (in their paper) and individual level (in ourpaper). Let the subgroups be non-overlapping sets {G , . . . , G G } . The null hypothesis for a subgroup G g is deﬁned as: H g : H zero0 i is true for all i ∈ G g , or equivalently, H g : G g ⊆ H (recall H is the set of subjects with zero eﬀect). Let D g be the0 / H g is identiﬁed or not. The FDR at a subgroup level isdeﬁned as the expected proportion of falsely identiﬁed subgroups:FDR subgroup := E (cid:20) |{ g ∈ [ G ] : G g ⊆ H , D g = 1 | max {|{ g ∈ [ G ] : D g = 1 | , } (cid:21) , (8)which collapses to the FDR at an individual level as deﬁned in (7) when each subgroup has exactlyone subject. Although our interactive procedure is designed for FDR control at an individual level,we propose extensions to FDR control at a subgroup level in Appendix A.1. As a brief summary,Karmakar et al. (2018) propose to control FDR subgroup by constructing a p -value for each subgroupand apply the classical BH method (Benjamini and Hochberg, 1995). While their method has manyorthogonal beneﬁts (e.g., handling observational studies), it is not trivially applicable to control FDRat an individual level, because their p -values would only take value 1 / ther related error control at a subgroup level. Cai et al. (2011) and Athey and Imbens (2016)develop conﬁdence intervals for the averaged treatment eﬀect within subgroups, where the formerassumes the size of each subgroup to be large, and the latter requires a separate sample for inference.These intervals can potentially be used to generate a p -value for each subgroup and control FDRat a subgroup level via standard multiple testing procedures, but no explicit discussion is provided.Lipkovich et al. (2011), Lipkovich and Dmitrienko (2014), Sivaganesan et al. (2011) and Berger et al.(2014) propose methods with control on a diﬀerent error metric: the global type-I error, which is theprobability of identifying any subgroup when no subject has nonzero treatment eﬀect (i.e., H zero0 i istrue for all subjects). Our FDR control guarantee implies valid global type-I error, and FDR controlis more informative on the correctness of the identiﬁed subgroups/subjects when there exist subjectshaving nonzero eﬀects. As discussed, it appears to be new and practically interesting to provide FDR control guarantees atan individual level. Another merit of our proposed method is that it allows a human analyst and analgorithm to interact, in order to better accomplish the goal.Interactive testing is a recent idea that emerged in response to the growing practical needs ofallowing human interaction in the process of data analysis. In practice, analysts tend to try severalmethods or models on the same dataset until the results are satisfying, but this violates the validity ofstandard testing methods (e.g., invalid FDR control). In our context of identifying positive eﬀects, theappealing advantages of an interactive test include that (a) an analyst is allowed to use (partial) data,together with prior knowledge, to design a strategy of selecting subjects potentially having positiveeﬀects, and (b) it is a multi-step iterative procedure during which the analyst can monitor performanceof the current strategy and make adjustments at any step (at the cost of not altering earlier steps).Despite the ﬂexibility of an analyst to design and alter the algorithm using (partial) data, our proposedprocedure always maintains valid FDR control. We name our proposed algorithm as I (I-cube), forinteractive identiﬁcation of individual treatment eﬀects. { A i }{ Y i , X i } Unmask A i ∗ t − Prior information Rejection set R t Estimate (cid:91)

FDR( R t ) Report R t Start t = 0, R = [ n ], i ∗ = ∅ SelectionError controlExclude i ∗ t If (cid:91) FDR( R t ) ≤ α If (cid:91) FDR( R t ) > α , then t ← t + 1 Figure 1.

A schematic of the I algorithm. All treatment assignments are initially kept hidden: only( Y i , X i ) i ∈ [ n ] are revealed to the analyst, while all { A i } remain ‘masked’. The initial candidate rejectionset is R = [ n ] (thus no subject is excluded initially and i ∗ = ∅ ). The false discovery proportion (cid:91) FDR of the current candidate set R t is estimated by the algorithm (dashed lines), and reported to theanalyst. If (cid:91) FDR( R t ) > α , the analyst chooses a subject i ∗ t to remove it from the proposed rejection set R t = R t − \{ i ∗ t } , whose assignment A t ∗ t is then ‘unmasked’ (revealed). Importantly, using any availableprior information, covariates and working model, the analyst can choose subject i ∗ t and shrink R t inany manner. This process continues until (cid:91) FDR( R t ) ≤ α (or R t = ∅ ). { A i } ni =1 from the analyst. The algorithm alternates between two steps— selection and error control — until a simple stopping criterion introduced later is reached.1. Selection . Consider a set of candidate subjects to be identiﬁed as having a positive eﬀect (whosenull to be rejected), denoted as rejection set R t for iteration t . We start with all the subjectsincluded, R = [ n ]. At each iteration, the analyst excludes possible nulls (i.e., subjects that areunlikely to have positive eﬀects) from the previous R t − , using all the available information (out-comes Y i and covariates X i for all subjects i ∈ [ n ], and progressively unmasked A i from the stepof error control, and possible prior information). Note that our method does not automaticallyuse prior information and the revealed data. The analyst is free to use any black-box predictionalgorithm that uses the available information, and evaluates the subjects possibly using an esti-mated probability of having a positive treatment eﬀect. This step is where a human is allowedto incorporate her subjective choices.2. Error control (and unmasking) . The algorithm uses the complete data { Y i , A i , X i } to es-timate FDR of the current candidate rejection set (cid:91) FDR( R t ), as a feedback to the analyst. Ifthe estimated FDR is above the target level (cid:91) FDR( R t ) > α , the analyst goes back to the stepof selection, along with additional information: the excluded subjects ( i / ∈ R t ) have their A i unmasked (revealed), which could improve her understanding of the data and guide her choicesin the next selection step.The algorithms we propose in the main paper build on and modify the above procedure to achievereasonably high power and develop various extensions. An illustration of the identiﬁcations made bythe Crossﬁt-I (our central algorithm) is in Figure 2. X.1 X . (a) Subjects in treated group andcontrol group separated by twoshapes and colors. llll ll lll ll l ll ll lll ll l ll ll ll l llll l ll l lll l lllll ll l lll l ll l llll ll ll l lllll ll lll l l ll ll l lll l lll ll ll ll ll lll l lllll l ll ll l ll lllllll lllll l llll lll llll lll l ll ll ll l ll l ll lll l llll lll ll lll l lll l ll lllll ll l lll ll X.1 X . (b) Blue round dots representsubjects with true positive eﬀect(unknown ground truth). llll ll l lll ll ll l ll ll l ll ll ll lll l ll ll l lll ll ll l ll ll l lll ll ll lll lll lll ll ll lllll l lll ll ll ll lll l ll lll l l ll ll l lll l lll ll ll ll ll ll l l llll l ll ll l ll llllll lll lll llll l l ll ll llll lll l ll ll lll l llll l ll lll l llll lll ll lll l lll l ll lllll ll l lll lll X.1 X . (c) Green round dots repre-sent subjects identiﬁed by theCrossﬁt-I . Figure 2.

An illustrative example with 1000 subjects, each has two covariates that are uniform in[0 , identiﬁes most subjects with positive eﬀects, although about half of them did notreceive treatment. Outline.

The rest of the paper is organized as follows. In Section 2, we describe an interactivealgorithm wrapped by a cross-ﬁtting framework, which identiﬁes subjects with positive eﬀects withFDR control. We evaluate our proposed algorithm numerically in Section 3, and provide theoreticalpower analysis in simple settings in Section 4. We point out several extensions of the proposedalgorithm in Section 5 and 6. Section 7 concludes the paper with a discussion on the potential of ourproposed interactive procedures. 5

An interactive algorithm with FDR control

To enable us to eﬀectively infer the treatment eﬀect, we use the following working model : Y Ci = f ( X i ) + U i and Y Ti = ∆( X i ) + f ( X i ) + U i , (9)where U i is zero-mean noise (unexplained variance) that is independent of A i . When working with sucha model, we eﬀectively want to identify subjects with a positive treatment eﬀect ∆( X i ). Importantly,model (9) needs not be correctly speciﬁed or accurately reﬂect reality in order for the algorithms inthis paper to have a valid FDR control (but the more ill-speciﬁed or inaccurate the model is, the morepower may be hurt).To identify subjects with positive eﬀects, we ﬁrst introduce an estimator of the treatment ef-fect ∆( X i ) following the working model (9). Denote the expected outcome given the covariates as m ( X i ) := E ( Y i | X i ), and let (cid:98) m ( X i ) be an arbitrary estimator of m ( X i ) using the outcomes andcovariates { Y i , X i } ni =1 . Deﬁne the residual as E i := Y i − (cid:98) m ( X i ), and an estimator of ∆( X i ) is (cid:98) ∆ i := 4( A i − / · E i , (10)which, under randomized experiments, is equivalent to the nonparametric estimator of the conditionaltreatment average eﬀect E ( Y Ti | X i ) − E ( Y Ci | X i ) in several recent papers (Nie and Wager, 2020;Kennedy, 2020), and can be traced back to the semiparametrics literature with Robinson (1988). Acritical property of (cid:98) ∆ i that ensures FDR control is that P ( (cid:98) ∆ i > | { Y j , X j , E j } nj =1 ) ≤ / , (11)under H zero0 i (for any deﬁnition in (3), (4), or (5)), because H zero0 i implies A i ⊥⊥ { Y i , X i } and P ( A i − / >

0) = 1 /

2. Recall in Figure 1, the treatment assignments A i are hidden from the analyst in the selectionstep, which is reﬂected in (11) as A i being omitted from the condition. The above property indicatesthat the estimated eﬀect (cid:98) ∆ i is no more likely to be positive than negative if the selected subject haszero eﬀect, regardless of how the analyst decides which subject to select. Therefore, the sign of (cid:98) ∆ i canbe used to estimate the number of false identiﬁcations and achieve FDR control, which we elaboratenext. This section presents the I with valid FDR control. We introduce a modiﬁcation based on cross-ﬁttingthat improves identiﬁcation power in the next section.The I proceeds as progressively shrinking a candidate rejection set R t at iteration t ,[ n ] = R ⊇ R ⊇ . . . ⊇ R n = ∅ , where recall [ n ] denotes the set of all subjects. We assume without loss of generality that one subjectis excluded in each step. Denote the subject excluded at iteration t as i ∗ t . The choice of i ∗ t can usethe information available to the analyst before iteration t , formally deﬁned as a ﬁltration (sequence ofnested σ -ﬁelds): F t − = σ  { Y j , X j } j ∈R t − , { Y j , A j , X j } j / ∈R t − , (cid:88) j ∈R t − { (cid:98) ∆ j > }  , (12)where we unmask (reveal) the treatment assignments A j for subjects excluded from R t − , and thesum (cid:80) i ∈R t − { (cid:98) ∆ i > } is mainly used for FDR estimation as we describe later. The above available Note that property (11) uses the fact that outcome estimator (cid:98) m ( X i ) is independent of A i , so it is important thatthe estimation of (cid:98) m does not use the assignments { A i } ni =1 ; however, it should not aﬀect the estimation much because m ( X i ) ≡ E ( Y i | X i ) is not a function of A i . { E j } nj =1 deﬁnedabove equation (10).To control FDR, the number of false identiﬁcations is estimated by property (11). The idea is topartition the candidate rejection set R t into R + t and R − t by the sign of (cid:98) ∆ i : R − t := { i ∈ R t : (cid:98) ∆ i ≤ } , R + t := { i ∈ R t : (cid:98) ∆ i > } . Notice that our proposed procedure only identiﬁes the subjects whose estimated eﬀect is positive,i.e., those in R + t . Thus, the FDR is E (cid:104) |R + t ∩H | max {|R + t | , } (cid:105) by deﬁnition, where recall H is the set of truenulls. Intuitively, the number of false identiﬁcations |R + t ∩ H | can be approximately upper boundedby |R − t ∩ H | , since the number of positive signs should be no larger than the number of negativesigns for the falsely identiﬁed nulls, according to property (11). Note that the set of true nulls H is unknown, so we use |R − t | to upper bound |R − t ∩ H | , and propose an estimator of FDR for thecandidate rejection set R t : (cid:91) FDR( R t ) = |R − t | + 1max {|R + t | , } . (13)Overall, the I shrinks R t until time τ := inf { t : (cid:91) FDR( R t ) ≤ α } and identiﬁes only the subjects in R + τ , as summarized in Algorithm 1. We state the FDR control of I in Theorem 1 and the proof canbe found in Appendix B.1. Theorem 1.

In a randomized experiment with assumptions (1) and (2) , and for any analyst thatupdates their working model(s) at any iteration t using the information in F t − , the set R + τ rejectedby the I algorithm has FDR controlled at level α , meaning that E (cid:20) |R + τ ∩ H | max {|R + τ | , } (cid:21) ≤ α, for any deﬁnition of the null hypothesis (3) , (4) or (5) . For the last deﬁnition, FDR control also holdsconditional on the covariates and potential outcomes. Algorithm 1

The I (interactive identiﬁcation of individual treatment eﬀect) procedure. Initial state:

Explorer (E) knows covariates and outcomes { X i , Y i } ni =1 .Oracle (O) knows the treatment assignments { A i } ni =1 .Target FDR level α is public knowledge. Initial exchange:

Both players initialize R = [ n ] and set t = 1.1. E builds a prediction model (cid:98) m from X i to Y i .2. E informs O about residuals E i ≡ Y i − (cid:98) m ( X i ).3. O estimates the treatment eﬀect as (cid:98) ∆ i ≡ A i − / E i .4. O then divides R t into R − t := { i ∈ R t : (cid:98) ∆ i ≤ } and R + t := { i ∈ R t : (cid:98) ∆ i > } .5. O reveals only |R + t | to E (who infers |R − t | ). Repeated interaction:

6. E checks if (cid:91)

FDR( R t ) ≡ |R − t | +1max {|R + t | , } ≤ α .7. If yes, E sets τ = t , reports R + τ and exits.8. Else, E picks any i ∗ t ∈ R t − using everything E currently knows.(E tries to pick an i ∗ t that E thinks is null, i.e. E hopes that (cid:98) ∆ i ∗ t ≤ A i ∗ t to E, who also infers (cid:98) ∆ i ∗ t and its sign.10. E updates R t +1 = R t \{ i ∗ t } , and also |R + t +1 | and |R − t +1 | .11. Increment t and go back to Step 6. 7onsider a simple case where model (9) is accurate for every subject with a constant treatmenteﬀect ∆( X i ) = δ >

0. If δ is larger than the maximum noise, we have R +0 = [ n ], and the algorithm canstop at the very ﬁrst step identifying all subjects. At the other extreme, if the eﬀect δ is too small, thealgorithm may also return an empty set, and this makes sense because while small average treatmenteﬀects can be learned using a large population, larger treatment eﬀects are needed for individual-level identiﬁcation. Related work.

Testing procedures that allow human interaction are ﬁrst proposed by Lei andFithian (2018) and Lei et al. (2020) for the problem of FDR control in multiple testing, followed byseveral works for other error metrics (Duan et al., 2020,a). These papers focus on generic multipletesting problems, which operate on the p -values and ignore the process of generating p -values fromdata. In contrast, Duan et al. (2020b) applies the idea of interactive testing to observed data, towhich our paper relates most. Both works propose tests for treatment eﬀect, and the diﬀerence is thatDuan et al. (2020b) test whether any subject has nonzero eﬀect with type-I error control, whereasour proposed algorithm aims at identifying subjects having positive eﬀects with FDR control. Whilethe former may appear in an exploratory analysis to see whether the treatment has any eﬀect on anyperson, the latter is useful to characterize the population where the treatment has an eﬀect.We end the section with a remark. In step 8 of Algorithm 1, we hope to exclude subjects thatare unlikely to have positive eﬀects, based on the revealed data in F t − . In other words, we shouldguess the sign of treatment eﬀect (cid:98) ∆ i , which depends on both the revealed data { Y i , X i } and thehidden assignment A i . However, notice that at the ﬁrst iteration, we may learn/guess the oppositesigns for all the subjects; when all assignments { A i } ni =1 are hidden at t = 1, the likelihood of { A i } ni =1 being the true values (leading to all correct signs for (cid:98) ∆ i ) is the same as the likelihood of all oppositevalues (leading to all opposite signs for (cid:98) ∆ i ), no matter what working model we use. Consequently,the subjects with large positive eﬀects could be guessed as having large negative eﬀects, causing themto be excluded from the rejection set. To improve power, we propose to wrap around the I by across-ﬁtting framework as described in the next section. I Cross-ﬁtting refers to the idea of splitting the samples into two halves. We perform the I on eachhalf separately, so that for each half, the complete data (including the assignments) of the other halfis revealed to the analyst to help infer the sign of treatment eﬀect, addressing the issue of learning theopposite signs and improving the identiﬁcation power.Speciﬁcally, split the subjects randomly into two sets of equal size, denoted as I and II where I ∪ II = [ n ]. The I (Algorithm 1) is implemented on each set separately: at the start of I on set I ,the analyst has access to the complete data for all subjects in set II , and tries to identify subjects withpositive eﬀects in set I with FDR control at level α/

2; similar is the I on set II . Mathematically, letthe candidate rejection set of implementing the I on set I be R t ( I ), where the initial set is R ( I ) = I .The available information at iteration t is deﬁned as: F t − ( I ) = σ  { Y i , X i } i ∈R t − ( I ) , { Y j , A j , X j } j / ∈R t − ( I ) , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ i > }  , (14)which includes the complete data { Y j , A j , X j } for j ∈ II at any iteration t ≥ . Similarly, wedeﬁne R t ( II ) and F t − ( II ) for the I implemented on set II . The ﬁnal rejection set is the union ofrejections in I and II (see Algorithm 2). We call this algorithm the Crossﬁt-I .As long as the I on two sets do not exchange information, Algorithm 2 has a valid FDR control(see the proof in Appendix B.2). Theorem 2.

Under assumption (1) and (2) of randomized experiments, R + τ rejected by the Crossﬁt- I has FDR controlled at level α for any of the null hypotheses (3) , (4) or (5) . For the last case, FDRcontrol also holds conditional on the covariates and potential outcomes. For notational clarity, we use i to denote candidate subjects i ∈ R t ( I ), and j for non-candidate subjects j / ∈ R t ( I ),while k is used to index all subjects k ∈ [ n ]. lgorithm 2 The Crossﬁt-I . Input:

Covariates, outcomes, treatment assignments { Y i , A i , X i } ni =1 , target level α ; Procedure:

1. Randomly split the sample into two subsets of equal size, denoted as I and II ;2. Implement Algorithm 1 at level α/

2, where E initially knows { Y k , X k } nk =1 ∪ { A j } j ∈II and sets R ( I ) = I , getting a rejection set R + τ ( I ) ⊆ I ;3. Implement Algorithm 1 at level α/

2, where E initially knows { Y k , X k } nk =1 ∪ { A j } j ∈I and sets R ( II ) = II , getting a rejection set R + τ ( II ) ⊆ II ;4. Combine two rejection sets as the ﬁnal rejection set, R + τ = R + τ ( I ) ∪ R + τ ( II ).In addition to addressing the issue of learning the opposite (cid:98) ∆ i in the original I , another beneﬁtof using the crossing-ﬁtting framework is that with the complete data revealed for at least half of thesample, the analyst does not have to deal with the problem of inferring missing data (the assignment A i ), which probably needs some parametric probabilistic modeling and the EM algorithm. Instead,because the assignments are revealed for subjects not in the candidate rejection set (at least half of thesample), their signs of (cid:98) ∆ j can be correctly calculated and used as “training data”. The analyst canthen employ a black-box prediction model, such as a random forest, to predict the signs of (cid:98) ∆ i for thesubjects whose assignments are masked (hidden). As an example, we propose an automated strategyas follows to select a subject at step 8 in Algorithm 1. Algorithm 3

An automated heuristic to select i ∗ t in the Crossﬁt-I . Input:

Current rejection set R t − ( I ), and available information for selection F t − ( I ); Procedure:

1. Train a random forest classiﬁer where the label is sign( (cid:98) ∆ j ) and the predictors are Y j , X j and theresiduals E j , using non-candidate subjects j / ∈ R t − ( I );2. Estimate the probability of (cid:98) ∆ i being positive as (cid:98) p ( i, t ) for subjects i ∈ R t − ( I );3. Select i ∗ t = argmin { (cid:98) p ( i, t ) : i ∈ R t − ( I ) } .We remark that in practice, the analyst can interactively change the prediction model, such asexploring parametric models to see which ﬁts the data better. In principle, the analyst can performany exploratory analysis on data in F t − ( I ) to decide a heuristic or score for selecting subject i ∗ t ;and the FDR control is valid as long as she does not use the assignments A i for candidate subjects i ∈ R t − ( I ). For computation eﬃciency, we usually update the prediction models (or their parameters)once every 100 iterations (say).To summarize, the Crossﬁt-I described in Algorithm 2 involves two rounds of the I (Algorithm 1),where step 8 of selecting a subject is allowed to involve human interaction; alternatively, step 8 can bean automated heuristic as presented in Algorithm 3. Recall the illustrative example in Figure 2, wherewe implement the Crossﬁt-I with the above automated strategy to select subjects. Each subject,recorded with two covariates in [0 , . correctly identiﬁesmost of them (see Figure 2c). Next, we demonstrate through repeated numerical experiments andtheoretical analysis that the Crossﬁt-I has reasonably high power. To assess our proposed procedure, we ﬁrst describe a baseline method, which calculates a p -value foreach subject under the assumption of linear models, and applies the classical BH method (Benjaminiand Hochberg, 1995). We call this method the linear-BH procedure.9 .1 A baseline: the BH procedure under linear assumptions For the treated group and control group, we ﬁrst separately learn a linear model to predict Y i using X i , denoted as (cid:98) l T and (cid:98) l C . By imputing the unobserved potential outcomes, we get estimators of thepotential outcomes (cid:101) Y Ti = Y i { A i = 1 } + (cid:98) l T ( X i ) { A i = 0 } and (cid:101) Y Ci = (cid:98) l C ( X i ) { A i = 1 } + Y i { A i =0 } , and the treatment eﬀect for subject i can be estimated as (cid:98) ∆ BH i := (cid:101) Y Ti − (cid:101) Y Ci . If the potentialoutcomes are linear functions of covariates with standard Gaussian noises (which we refer to as thelinear assumption), the estimated treatment eﬀect asymptotically follows a Gaussian distribution. Foreach subject i ∈ [ n ], we calculate a p -value for the zero-eﬀect null (3) as P i = 1 − Φ (cid:18) (cid:98) ∆ BH i (cid:46)(cid:113)(cid:100) Var( (cid:98) ∆ BH i ) (cid:19) , (15)where the estimated variance is (cid:100) Var( (cid:98) ∆ BH i ) = (cid:100) Var( (cid:101) Y Ti )+ (cid:100) Var( (cid:101) Y Ci ), and Φ denotes the CDF of a standardGaussian. To identify subjects having positive eﬀects with FDR control, we apply the BH procedureto the above p -values. Notice that the error control would not hold when the linear assumption isviolated (see Appendix B.4 for the formal FDR control guarantee). We run a simulation with 500 subjects ( n = 500). Each subject is recorded with two binary attributes(eg. female/male and senior/junior) and one continue attribute (eg. body weight), denoted as a vec-tor X i = ( X i (1) , X i (2) , X i (3)) ∈ { , } × R . Among n subjects, the binary attributes are marginallybalanced, and the subpopulation with X i (1) = 1 and X i (2) = 1 is of size 30. The continuous attributeis independent of the binary ones and follows the distribution of a standard Gaussian.The outcomes are simulated as a function of the covariates X i and the assignment A i followingthe generating model (9). Recall that we previously used model (9) as a working model, which isnot required to be correctly speciﬁed. Here, we generate data from such a model in simulation for aclear evaluation of the considered methods. We specify the noise U i as a standard Gaussian, and theexpected control outcome as f ( X i ) = 5( X i (1) + X i (2) + X i (3)) , and the treatment eﬀect as∆( X i ) = S ∆ · [5 X i (3) { X i (3) > } − X i (1) / , (16)where S ∆ > . Moreexperiments can be found in Appendix D.1. l l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect po w e r _po s l Crossfit−I Linear−BH

Figure 3.

FDR (left) and power (right) of the Crossﬁt-I compared with the linear-BH procedure,with the treatment eﬀect speciﬁed as model (16) and the scale S ∆ varying in { , , , , , } . The FDRcontrol level is 0 .

2, marked by a horizontal line in error control plots. For all plots in this paper, theFDR and power are averaged over 500 repetitions. The linear-BH procedure does not have valid FDRcontrol because the treatment eﬀect is nonlinear, whereas the Crossﬁt-I controls FDR and can achievehigh power. R code to fully reproduce all plots in the paper are available at https://github.com/duanby/I-cube . controls FDR at the target level as expected. At the same time, the Crossﬁt-I appears to havecomparable power as the linear-BH procedure to correctly identify subjects with true positive eﬀects. In addition to the numerical experiments, we provide a theoretical power analysis in some simple casesto understand the advantages and limitations of our proposed Crossﬁt-I .First, consider the case without covariates. Our analysis is inspired by the work of Arias-Castroand Chen (2017); Rabinovich et al. (2020), who study the power of methods with FDR control under asparse Gaussian sequence model. Let there be n hypotheses, each associated with a test statistic V i for i = 1 , . . . , n . They consider a class of methods called threshold procedures such that the ﬁnal rejectionset R is in the form R = { i : V i ≥ τ ( V , . . . , V n ) } , for some threshold τ ( V , . . . , V n ); they discuss twotypes of thresholds; see Appendix C for details of their results. An example of the threshold procedureis the BH procedure. Our proposed I can also be simpliﬁed to a threshold procedure when usingan automated selection strategy at step 8 of Algorithm 1: at each iteration, we exclude the subjectwith the smallest absolute value of the estimated treatment eﬀect | (cid:98) ∆ i | (note that this strategy satisﬁesour requirement of not using assignments since | (cid:98) ∆ i | = | A i − / Y i − (cid:98) m ( X i )) | = 2 | Y i − (cid:98) m ( X i ) | ).The resulting (simpliﬁed and automated) I is a threshold procedure where V i = (cid:98) ∆ i . Note that ouroriginal interactive procedure is highly ﬂexible, making the power analysis less obvious, so we discussthe power of Crossﬁt-I with the above simpliﬁed selection strategy.To contextualize our power analysis, we paraphrase one of the results in Arias-Castro and Chen(2017); Rabinovich et al. (2020). Assume the test statistics V i ∼ N ( µ i ,

1) are independent, with µ i = 0under the null and µ i = µ > n and the sparsity ofthe non-nulls is parameterized by β ∈ (0 ,

1) such that n /n = n − β . Let the signal µ increase with n as µ = √ r log n , where the signal strength is encoded by r ∈ (0 , r and sparsity β , which are also critical parameters to characterize the power in ourcontext as we state later. These authors eﬀectively prove that for any ﬁxed FDR level α ∈ (0 , ,no threshold procedure can have nontrivial power if r < β , but there exist threshold procedures withasymptotic power one if r > β .Our analysis diﬀers from theirs in the non-null distribution of the test statistics. Given n subjects,suppose the potential outcomes for subject i are distributed as: Y Ci ∼ N (0 ,

1) and Y Ti ∼ N ( µ i , µ i = 0 if subject i is null, or µ i = µ > i is non-null. Thus, theobserved outcome of a null is N (0 , mixture of N ( µ,

1) and N (0 , Theorem 3.

Given a ﬁxed FDR level α ∈ (0 , and let the number of subjects n go to inﬁnity. Whenthere is no covariate, the automated Crossﬁt- I and the linear-BH procedure have the same powerasymptotically: if r < β , their power goes to zero; if r > β , their power goes to / . Further, amongthe treated subjects, their power goes to one. Remark 1.

Power of both methods cannot converge to a value larger than / because without co-variates, we cannot diﬀerentiate between the subjects with zero eﬀect (whose outcome follows standardGaussian regardless of treated or not) and the subjects with positive eﬀects that are not treated (whichalso follows standard Gaussian). And the proportion of untreated subjects among those with positiveeﬀects is / because of the assumed randomization. The above theorem discusses the case where there are no covariates to help guess which untreatedsubjects have positive eﬀects. Next, we consider the case with an “ideal” covariate X i : its valuecorresponds to whether a subject is a non-null (having positive eﬀect) or not, X i = { µ i > } . Here,we design the selection strategy (for step 8 of Algorithm 1) as a function of the covariates, becausewe hope that subjects with the similar covariates have similar treatment eﬀects. Speciﬁcally, for11he I implemented on I , we learn a prediction of (cid:98) ∆ j by X j using non-candidate subjects j ∈ II :Pred( x ) = |II| (cid:80) i ∈II (cid:98) ∆ j { X j = x } , where x = { , } . Then for candidate subjects i ∈ I , weexclude the ones whose Pred( X i ) are lower. As we integrate information among subjects with thesame covariate value, all non-null subjects (i.e., those with X i = 1) would excluded after the nulls(with probability tending to one), regardless of whether they are treated or not; hence we achievepower one. Theorem 4.

Given a ﬁxed FDR level α ∈ (0 , and let the number of subjects n go to inﬁnity. Witha covariate X i = { µ i > } , the power of the automated Crossﬁt- I converges to one for any ﬁxed r ∈ (0 , and β ∈ (0 , . In contrast, the power of the linear-BH procedure goes to zero if r < β .(When r > β , power of both methods converges to one.) Here is a short informal argument for why our power goes to one. Since the nulls can be excludedbefore the non-nulls, we focus on the test statistics of the non-nulls. Let d → denote convergence indistribution. The estimated eﬀect (cid:98) ∆ i d → N ( µ,

1) for each non-null (since in the notation of Algorithm 1, (cid:98) m ( X i = 1) converges to µ/ E i d → N ( µ/ ,

1) for those with A i = 1, and E i d → N ( − µ/ ,

1) for those with A i = 0.) Hence, at the time t right after all the nulls are excluded(and all the non-nulls are in R t ), the proportion of positive estimated eﬀects |R + t | / |R t | converges toΦ( µ ), where Φ denotes the CDF of a standard Gaussian. We can stop before t and identify subjectsin R + t if (cid:91) FDR( R t ), as a function of |R + t | / |R t | , is less than α , which holds when Φ( µ ) > α . Thepower goes to one because µ grows to inﬁnity for any ﬁxed r ∈ (0 , n , we stopbefore t and the proportion of rejected non-nulls |R + t | / |R t | (which converges to Φ( µ ) as arguedabove) also goes to one. In short, the power guarantee does not depend on the sparsity β because ofthe designed selection strategy that incorporates covariates.We note that our theoretical power analysis discusses two extreme cases, one with no covariate toassist the testing procedure (Theorem 3), and the other with a single “ideal” covariate that equals theindicator of non-nulls (Theorem 4). The numerical experiments in Section 3 consider more practicalsettings, where the analyst is provided with a mixture of covariates informative about the heterogeneouseﬀect ( X i (1) and X i (3) in our example) and some uninformative ones; still, the Crossﬁt-I tends tohave reasonably high power. In the following sections, we turn to present extensions of the Crossﬁt-I in various directions. The Crossﬁt-I controls the false identiﬁcations of subjects with zero treatment eﬀect, as deﬁned inthe null hypothesis (3), (4) or (5). In this section, we develop a modiﬁcation to additionally controlthe error of falsely identifying subjects with nonpositive treatment eﬀects, by deﬁning a diﬀerent nullhypothesis. Problem setup.

We deﬁne the null hypothesis for subject i as nonpositive eﬀect: H nonpositive0 i : ( Y Ti | X i ) (cid:22) ( Y Ci | X i ) , (17)or equivalently, H nonpositive0 i : ( Y i | A i = 1 , X i ) (cid:22) ( Y i | A i = 0 , X i ) . As before, our algorithm appliesto two alternative deﬁnitions of the null hypothesis. In the context of treating the potential outcomesand covariates as ﬁxed, the null hypothesis is H nonpositive0 i : Y Ti ≤ Y Ci , (18)and in the hybrid version where the potential outcomes are random with joint distribution ( Y Ti , Y Ci ) | X i ∼ P i , the null posits H nonpositive0 i : Y Ti ≤ Y Ci almost surely- P i . (19)12ote that the nonpositive-eﬀect null is less strict than the zero-eﬀect null. Thus, an algorithm withFDR control for H nonpositive0 i must have valid FDR control for H zero0 i , but the reverse needs not betrue. Indeed, we observe in numerical experiments (Figure 4b) that the Crossﬁt-I does not controlFDR for the nonpositive-eﬀect null. This section presents a variant of Crossﬁt-I that controls falseidentiﬁcations of nonpositive eﬀects, possibly more practical when interpreting the identiﬁed subjects.For example, when controlling FDR for the nonpositive-eﬀect null at level α = 0 .

2, we are able toclaim that the expected proportion of identiﬁed subjects with positive eﬀects is no less than 80%.

An interactive procedure with FDR control of nonpositive eﬀects.

Recall that the FDRcontrol of the Crossﬁt-I is based on property (11) that when the null hypothesis is true for subject i ,we have P (cid:16) (cid:98) ∆ i | { Y j , X j , E j } nj =1 (cid:17) ≤ /

2, but this statement no longer holds when the null hypothesisis deﬁned as H nonpositive0 i in (17). Fortunately, this issue can be ﬁxed by making the condition in (11)coarser and removing the outcomes: P (cid:16) (cid:98) ∆ i | { X j } nj =1 (cid:17) ≤ / , which is reﬂected in the interactive procedure as reducing the available information for selecting sub-ject i ∗ t (at step 8 of Algorithm 1) — we additionally mask (hide) the outcome Y i of the candidatesubjects i ∈ R t − ( I ) when implementing the I on set I . We call the resulting interactive algorithmMaY-I , as it masks the outcomes.Speciﬁcally, the MaY-I modiﬁes Crossﬁt-I where we deﬁne the available information to selectsubjects when implementing Algorithm 1 on set I as F − Yt − ( I ) = σ  { X i } i ∈R t − ( I ) , { Y j , A j , X j } j / ∈R t − ( I ) , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ i > }  . (20)To calculate (cid:98) ∆ i at t = 0 when Y i for all i ∈ I are masked, let (cid:98) m −I ( X i ) be an estimator of E ( Y i | X i )that is learned using data from non-candidate subjects { Y j , X j } j / ∈I , and let the residuals be E −I i := Y i − (cid:98) m −I ( X i ). Deﬁne ∆ −I i := 4( A i − / · E −I i , and similar to property (11) for the zero-eﬀect null,we have P (cid:16) (cid:98) ∆ −I i > | { X j } j ∈I ∪ { Y j , X j , E −I j } j / ∈I (cid:17) ≤ / , (21)under H nonpositive0 i , leading to valid FDR control for nonpositive eﬀects. Overall, the MaY-I followsAlgorithm 2, except the estimated treatment eﬀect (cid:98) ∆ i replaced by (cid:98) ∆ −I i , and the available informationfor selection F t − ( I ) replaced by F − Yt − ( I ). See Appendix B.3 for the proof of FDR control. Theorem 5.

Under assumption (1) and (2) of randomized experiments, the MaY- I has a valid FDRcontrol at level α for the nonpositive-eﬀect null hypothesis under any of deﬁnitions (17) , (18) or (19) .For the last deﬁnition, FDR control also holds conditional on the covariates and potential outcomes. Similar to Algorithm 3 for the Crossﬁt-I , we can design an automated algorithm for the MaY-I to select a subject in step 8 of Algorithm 1, but the available information F − Yt − ( I ) no longer includesthe outcomes of candidate subjects. We defer the details of the automated selection strategy inAppendix A.2. Numerical experiments.

We compare the Crossﬁt-I and MaY-I using the same experiment asSection 3.2. In terms of the error control, both the Crossﬁt-I and MaY-I control FDR for the zero-eﬀect null at the target level (Figure 4a). When the null is deﬁned as having a nonpositive eﬀect, theCrossﬁt-I can violate the error control (Figure 4b), whereas the MaY-I preserves valid FDR control.In terms of the power, the Crossﬁt-I has slightly higher power since the analyst can select subjectsusing information deﬁned by F t − ( I ) in (12), which is richer compared to F − Yt − ( I ) in (20) for theMaY-I . 13 l l l l l scale of treatment effect F DR (a) FDR for the zero-eﬀect null (3). l l l l l l scale of treatment effect F DR _neg (b) FDR for the nonpositive-eﬀectnull (17). l l l l l l scale of treatment effect po w e r _po s (c) Power of identifying subjectswith positive eﬀects. l Crossfit−I MaY−I Figure 4.

Performance of two interactive methods, Crossﬁt-I and MaY-I , with the treatment eﬀectspeciﬁed as model (16) and the scale S ∆ varying in { , , , , , } . The MaY-I controls FDR for amore relaxed null (nonpositive eﬀects) than the Crossﬁt-I , while the Crossﬁt-I has slightly higherpower than the MaY-I . To summarize, the error control of the MaY-I is more strict than the Crossﬁt-I , controllingfalse identiﬁcations of both zero eﬀects and negative eﬀects, while its power is slightly lower. Werecommend the Crossﬁt-I if one only concerns the error of falsely identifying subjects with zero eﬀect.Alternatively, we recommend the MaY-I when it is desired to control the error of falsely identifyingsubjects with nonpositive eﬀects. Problem setup.

Our discussion has focused on the case where samples are not paired, and theproposed algorithms can be extended to the paired-sample setting. Suppose there are n pairs ofsubjects. Let outcomes of subjects in the i -th pair be Y ij , treatment assignments be indicators A ij ,covariates be X ij for j = 1 , i ∈ [ n ]. We deal with randomized experiments without interference,and assume that(i) conditional on covariates, the treatment assignments are independent coin ﬂips: P [( A , . . . , A n ) = ( a , . . . , a n ) | X , . . . , X n ] = n (cid:89) i =1 P ( A i = a i ) = (1 / n , and A i + A i = 1 for all i ∈ [ n ] . (ii) conditional on covariates, the outcome of one subject Y i ,j is independent of the treatmentassignment of another subject A i ,j for any ( i , j ) (cid:54) = ( i , j ).As before, we can develop interactive algorithms for two types of error control (only the deﬁnitionswhen treating the potential outcomes as random variables are presented, but the FDR control stillapplies to all versions of the null): H (zero, paired)0 i : ( Y Tij | X ij ) d = ( Y Cij | X ij ) for both j = 1 ,

2; (22) H (nonpositive, paired)0 i : ( Y Tij | X ij ) (cid:22) ( Y Cij | X ij ) for both j = 1 , . (23)Here, we present the extension of Crossﬁt-I for FDR control of zero eﬀect, and defer the extension ofMaY-I for FDR control of nonpositive eﬀect to Appendix A.3.14 nteractive algorithms for paired samples. With the pairing information, the treatment eﬀectcan be estimated without involving (cid:98) m as in (10): (cid:98) ∆ paired i := ( A i − A i )( Y i − Y i ) , (24)as used by Rosenbaum (2002) and Howard and Pimentel (2020), among others. The above estimationsatisﬁes the critical property to guarantee FDR control: for a null pair i of two subjects with zeroeﬀects in (22), we have P ( (cid:98) ∆ paired i > | { Y j , Y j , X j , X j } nj =1 ) ≤ / . (25)Thus, the Crossﬁt-I (Algorithm 2) with (cid:98) ∆ i replaced by (cid:98) ∆ paired i has valid FDR control for the zero-eﬀectnull (22), where the analyst excludes pairs using the available information, including { Y i , Y i , X i , X i } for candidate subjects i ∈ R t − ( I ), and { Y j , Y j , A j , A j , X j , X j } for non-candidate subjects j / ∈R t − ( I ), and the sum (cid:80) i ∈R > t − ( I ) { (cid:98) ∆ paired i > } for FDR estimation. An automated strategyexclude pair i ∗ t (at step 8 of Algorithm 1) under paired samples is the same as Algorithm 3, except (cid:98) ∆ i being replaced by (cid:98) ∆ paired i . Numerical experiments.

We compare the power of the interactive procedures with and withoutthe pairing information, using the same experiments as previous. When the subjects within eachpair have the same covariate values, the power under paired samples is higher than treating them asunpaired (see Figure 5a), because the noisy variation in the observed outcomes that results from thepotential control outcomes can be removed by taking the diﬀerence in outcomes within each pair.The advantage of procedures under paired samples becomes less evident when the subjects withina pair do not match exactly. We simulate unmatched pairs by introducing a parameter (cid:15) such thatfor each pair i , the covariates of the two subjects within satisfy: P ( X i (1) (cid:54) = X i (1)) = (cid:15), P ( X i (2) (cid:54) = X i (2)) = (cid:15), X i (3) = X i (3) + U (0 , (cid:15) ) , where U (0 , (cid:15) ) is uniformly distributed between 0 and 2 (cid:15) ,and a larger (cid:15) leads to a larger degree of mismatch. As (cid:15) increases, the power of procedures using thepairing information decreases (see Figure 5b), because the estimated treatment eﬀect (cid:98) ∆ paired i becomesless accurate for the mismatching setting. We further investigate the power decrease in Appendix D.2. ll ll ll ll ll ll scale of treatment effect po w e r _po s (a) Exact pairs. ll ll ll ll ll ll digree of mismatch e po w e r _po s (b) Subjects within the pair donot match exactly. l l pair−Crossfit−I pair−MaY−I unpair−Crossfit−I unpair−MaY−I Figure 5.

Power under paired samples with treatment eﬀects speciﬁed by model (16) when ourproposed algorithms (Crossﬁt-I and MaY-I ) utilize the pairing information, which is higher thantreating all subjects as unpaired. The advantage is less evident when the subjects within each pair arenot exactly matched to have the same covariate values. Summary

We discuss the problem of identifying subjects with positive eﬀects. Most existing methods identify subgroups with positive treatment eﬀects, and they cannot upper bound the proportion of falselyidentiﬁed subjects within an identiﬁed subgroup. In contrast, we propose Crossﬁt-I with ﬁnite-sampleFDR control (i.e., the expected proportion of subjects with zero eﬀect is no larger than α among theidentiﬁed subjects). One advantage of the Crossﬁt-I is allowing human interaction — an analyst (or analgorithm) can incorporate various types of prior knowledge and covariates using any working model;she can also adjust the model at any step, potentially improving the identiﬁcation power. Despitethis ﬂexibility, the Crossﬁt-I achieves valid FDR control. Notably, because Crossﬁt-I incorporatescovariates, it can identify subjects with positive eﬀects, including those not treated.Our proposed interactive procedure was extended to various settings: from FDR control of zeroeﬀects to FDR control of nonpositive eﬀects, from unpaired samples to paired samples, and from FDRcontrol at an individual level to FDR control at a subgroup level (details deferred to Appendix). Wehave recently begun to extend the proposed methods from the setting of randomized experimentsto observational data, where the probability of receiving treatment eﬀect can be heterogeneous andunknown — it seems like several compromises must be made and ﬁnite sample FDR control may nolonger be feasible.The error control for our interactive procedures is based on the independence properties betweenthe data used for FDR control and the revealed data for interaction, such as property (11) for thezero-eﬀect null and (21) for the nonpositive-eﬀect null. The idea of interactive testing can be gener-alized to many other problems as long as we can construct two parts of data that are (conditionally)independent. Importantly, our interactive procedures using the idea of “masking and unmasking”should be contrasted with data-splitting approaches. We remark that no test, interactive or otherwise,can be run twice from scratch (with a tweak made the second time to boost power) after the entiredata has been examined; this amounts to p -hacking. We view our interactive tests as one step towardsenabling experts (scientists and statisticians) to work together with statistical models and machinelearning algorithms in order to discover scientiﬁc insights with rigorous guarantees. A More extensions of the I A.1 FDR control at a subgroup level

Our proposed interactive methods control FDR on individual level, which means upper bounding theproportion of falsely identiﬁed subjects. In this section, we show that the idea of interactive testingcan be extended to control FDR on subgroup level, where we aim at identifying multiple subgroupswith positive eﬀects and upper bounding the proportion of falsely identiﬁed subgroups. Recall thatFDR control at a subgroup level is studied by Karmakar et al. (2018) as we review in Section 1.2 ofthe main paper.

Problem setup.

Let there be G non-overlapping subgroups G g for g ∈ [ G ] ≡ { , . . . , G } . The nullhypothesis for each subgroup is deﬁned as zero eﬀect for all subjects within: H g : H zero0 i is true for all i ∈ G g , (26)or equivalently, H g : G g ⊆ H (recall that H is the set of true null subjects). Let D g be the decisionfunction receiving the values 1 or 0 for whether H g is rejected or not rejected, respectively, and theFDR at a subgroup level is deﬁned as:FDR subgroup := E (cid:20) |{ g ∈ [ G ] : G g ⊆ H , D g = 1 | max {|{ g ∈ [ G ] : D g = 1 | , } (cid:21) . Same as the algorithms at an individual level, the algorithms we propose at a subgroup level can beapplied to samples that are paired or unpaired. For simple notation, we use { Y i , A i , X i } to denote theobserved data for subject i when the samples are unpaired, and for pair i when the samples are paired(where Y i = { Y i , Y i } and similarly for A i and X i ).16 n interactive algorithm to identify subgroups. We ﬁrst follow the same steps of Karmakaret al. (2018) to deﬁne subgroups and generate the p -value for each subgroup. Speciﬁcally, the subgroups G g for g ∈ [ G ] is deﬁned using the outcomes and covariates { Y j , X j } nj =1 (by an arbitrary algorithmor strategy, such as grouping subjects with the same covariates). For each subgroup G g , we cancompute a p -value P g by the classical Wilcoxon test (or using a permutation test, which obtains thenull distribution by permuting the treatment assignment { A i } ni =1 ).The interactive procedure we propose diﬀers from Karmakar et al. (2018) by how we processthe p -values of the subgroups. We adopt the work of Lei et al. (2020) that proposes an interactiveprocedure with FDR control for generic multiple testing problems. The key property that allowshuman interaction while guaranteeing valid FDR control is similar to that in the I : the independencebetween the information used for selection and that used for FDR control. Here with the p -values ofsubgroups, the two independent parts are P g := min { P g , − P g } , which is revealed to the analyst for selection and P g := 2 · { P g < } − , which is masked (hidden) for FDR control. Notice that for a null subgroup with a uniform p -value,( P g , P g ) are independent, and we have that P ( P g = 1 | P g , [ Y i , X i ] i ∈G g ) ≤ / , (27)because the p -values obtained by permutating assignments is uniform when conditional on the outcomesand covariates. We remark that the above property is similar to property (11) and (21) in main paperthat lead to valid FDR control at an individual level. Algorithm 4

An interactive procedure for subgroup identiﬁcation.

Initial state:

Explorer (E) knows the covariates, outcomes { Y i , X i } ni =1 .Oracle (O) knows the treatment assignments { A i } ni =1 .Target FDR level α is public knowledge. Initial exchange:

Set t = 1.1. E deﬁnes subgroups {G g } Gg =1 using { Y i , X i } ni =1 .2. Both players initialize R = [ G ], and E informs O about the subgroup division.3. O compute the p -value for each subgroup { P g } Gg =1 , and decompose each p -value as P g :=min { P g , − P g } and P g := 2 · { P g < } − R t into R − t := { g ∈ R t : P g ≤ } and R + t := { g ∈ R t : P g > } .5. O reveals { P g } Gg =1 , |R − t | and |R + t | to E. Repeated interaction:

6. E checks if (cid:91)

FDR subgroup ( R t ) ≡ |R − t | +1max {|R + t | , } ≤ α .7. If yes, E sets τ = t , reports R + τ and exits;8. Else, E picks any g ∗ t ∈ R t − using everything E currently knows.(E tries to pick an g ∗ t that they think is null; E hopes that P g ≤ { A i } i ∈G g to E, who also infers P g .10. E updates R t +1 = R t \{ g ∗ t } , and also |R + t +1 | and |R − t +1 | ;11. Increment t and go back to Step 6.Similar to the proposed methods at an individual level, the interactive procedure for subgroupsprogressively excludes subgroups and recursively estimates the FDR. Let the candidate rejection set R t be a set of selected subgroups, starting from all subgroups included R = [ G ]. We interactivelyshrink R t using the available information: F subgroup t − = σ  { P g , [ Y i , X i ] i ∈G g } g ∈R t − , { P g , [ Y j , A j , X j ] j ∈G g } g / ∈R t − , (cid:88) g ∈R t − P g  , p -value P g and the treatment assignment A i for candidate subgroupsin R t − ; and the sum (cid:80) g ∈R t − P g is mainly provided for FDR estimation. Similar to our previouslyproposed interactive procedures, the FDR estimator is deﬁned as: (cid:91) FDR subgroup ( R t ) = |R − t | + 1max {|R + t | , } , (28)with R + t = { g ∈ R t : P g = 1 } and R − t = { g ∈ R t : P g = − } . The algorithm shrinks R t untiltime τ := inf { t : (cid:91) FDR subgroup ( R t ) ≤ α } , and identiﬁes only the subgroups in R + τ , as summarized inAlgorithm 4. Details of strategies to select subgroup based on the revealed p -value and covariates canbe found in Lei et al. (2020). As a comparison, Karmakar et al. (2018) use the same set of p -values { P g } g ∈ [ G ] , and control FDR by the classical BH procedure. Numerical experiments.

We compare the performance of our proposed interactive procedure forsubgroup identiﬁcation with the method proposed by Karmakar et al. (2018), following an experimentin their paper. Suppose each subject is recorded with two discrete covariates X i = { X i (1) , X i (2) } where X i (1) ∈ { , . . . , } takes 40 levels with equal probability, and X i (2) is binary with equalprobability (for example, X i (1) could encode the city subject i lives in, and X i (2) the gender). Thetreatment eﬀect ∆( X i ) is a constant δ if X i (1) is even, and we vary δ in six levels. We conduct theabove experiment in two cases: unpaired samples ( n = 2000) with independent covariates and pairedsamples ( n = 1000) whose covariate values are the same for subjects within each pair. l l l l l l scale of treatment effect po w e r (a) Unpaired samples. l l l l l l scale of treatment effect po w e r (b) Paired samples. l l l l l l scale of treatment effect po w e r (c) Only a few subgroups are non-nulls. l BH interactive

Figure 6.

Performance of methods to identify subgroups with positive eﬀects: the BH procedure andthe interactive procedure (for 80 subgroups deﬁned by the distinct values of covariates). We vary thescale of treatment eﬀect under unpaired or paired samples. In both cases, the interactive procedurecan have higher power than the BH procedure. When the number of non-null subgroups is too small(less than 20), the BH procedure can have higher power. The error bar marks two standard deviationsfrom the center.

Recall that the subgroups can be deﬁned by covariates and outcomes. Here, since the covariatesare discrete, we deﬁne subgroups by diﬀerent values of ( X i (1) , X i (2)), resulting in 80 subgroups. Theinteractive procedure tends to have higher power than the BH procedure (see Figure 6a and Figure 6b)because it focuses on the subgroups that are more likely to be the non-nulls using the excludingprocess, and utilizes the covariates together with the p -values to guide the algorithm. Meanwhile,the BH procedure does not account for covariates once the p -values are calculated. Nonetheless, theinteractive procedure can have lower power when the total number of subgroups that are truly non-nullis small. We simulate the case where a subject has a positive eﬀect δ if X i (1) a multiplier of 4 (i.e., X i (1) / |R + | is small due to a small number of true non-nulls (for example,with FDR control at α = 0 .

2, we need to shrink R t until |R − | < |R + | is around 20).18 side note is that we deﬁne the subgroups by distinct values of the covariates, whereas Karmakaret al. (2018) suggest forming subgroups by regressing the outcomes on covariates using a tree algorithm.In their experiments and several numerical experiments we tried, we ﬁnd that the number of subgroupsdeﬁned by the tree algorithm is usually less than ten. However, we think the FDR control is lessmeaningful when the total number of subgroups is small. To justify our comment, note that analgorithm with valid FDR control at level α can make zero rejection with probability 1 − α and rejectall subgroups with probability α , which can happen when the total number of subgroups is small. Incontrast, with a large number of subgroups, a reasonable algorithm is unlikely to jump between theextremes of making zero rejection and rejecting all n subgroups; and thus, controlling FDR indeedinforms that the proportion of false identiﬁcations is low for the evaluated algorithm. Explanation of the higher power achieved by the interactive procedure.

Although theinteractive procedure and the BH procedure deﬁne the same set of subgroups and corresponding p -values, the interactive procedure has two properties that potentially improve the power from the BHprocedure: (a) it excludes possible null subgroups so that it can be less sensitive to a large number ofnulls, whereas the BH procedure considers all the subgroups at once; (b) the interactive procedureadditionally uses the covariates. We can separately evaluate the eﬀect of the above two properties byimplementing two versions of Algorithm 4, which diﬀer in the strategy to select subgroups in step a .Speciﬁcally, the adaptive procedure selects the subgroup whose revealed (partial) p -value P g is thesmallest (not using the covariates); and the interactive procedure selects the subgroup by an estimatedprobability of the P g to be positive (using the revealed P g , the covariates, and the outcomes). l l l l l l scale of treatment effect po w e r l BHadaptiveinteractive (a) Eﬀect as discrete function of co-variates. l l l l l l scale of treatment effect po w e r l BHadaptiveinteractive (b) Eﬀect as a simpler function ofcovariates.

Figure 7.

Power of two methods for subgroup identiﬁcation: the BH procedure proposed by Karmakaret al. (2018), the adaptive procedure, and the interactive procedure under diﬀerent types of treatmenteﬀect (we deﬁne 80 subgroups by discrete values of the covariates). Our proposed interactive proceduretends to have higher power than the BH procedure because (1) it excludes possible nulls (shown byhigher power of the adaptive procedure than the BH procedure in both plots); and (2) it additionallyuses the covariates (shown when the treatment eﬀect can be well learned as a function of covariates inthe right plot).

To see if both properties of Algorithm 4 contribute to the improvement of power from the BHprocedure, we tested the methods under two simulation settings. Recall that the previous experimentdeﬁnes a positive treatment eﬀecte when the discrete covariate X i (1) ∈ { , . . . , } is even. Here,we add another case where the treatment eﬀect is positive when X i (1) ≤

20, so that the density ofsubgroups with positive eﬀects is the same as previous, but the treatment eﬀect is a simpler function ofthe covariates. Hence in the latter case, we would expect the interactive procedure learn this functionof covariates rather accurately, and have higher power than the adaptive procedure which does notuse the covairates; as conﬁrmed in Figure 7b. In the former simulation setting where the treatmenteﬀect is not a smooth function of the covariates and hard to be learned, the adaptive procedure andinteractive procedure have similar power (Figure 7a). Still, they have higher power than the BHprocedure because they exclude possible null subgroups.19 .2 An automated algorithm with FDR control on nonpositive eﬀects

We have proposed the MaY-I to guarantee a valid FDR control for the nonpositive-eﬀect null in (17),by reducing the available information for selecting subjects from F t − ( I ) for the Crossﬁt-I to F − Yt − ( I )(recall it no longer includes the outcomes of candidate subjects). Here, we present an automatedalgorithm to select a subject for the MaY-I .One naive strategy is to follow Algorithm 3 in the main paper, which is designed for the Crossﬁt-I ,with the outcomes removed from the predictors; however, it appears to result in less accurate predictionof the eﬀect signs, and in turn rather low power (numerial results are in the next paragraph). Here, wetake a diﬀerent approach by predicting the treatment eﬀect instead of their signs, because the treatmenteﬀect might be better predicted as a function of the covariates (without outcomes) than a binarysign, especially when the treatment eﬀect is indeed a smooth and simple function of the covariates.Speciﬁcally, we ﬁrst estimate the treatment eﬀect for the non-candidate subjects j / ∈ R t − ( I ) using awell-studied doubly-robust estimator (see Kennedy (2020) and references therein):∆ DR j = 4( A j − / · ( Y j − (cid:98) µ A ( X j )) + (cid:98) µ ( X j ) − (cid:98) µ ( X j ) , (29)where ( (cid:98) µ , (cid:98) µ ) are random forests trained to predict the outcomes for the control and treated group,respectively. Using the provided covariates X i , we can predict ∆ DR i for the candidate subjects i ∈R t − ( I ). The subject with the smallest prediction of ∆ DR i is then excluded. This automated strategyis described in Algorithm 5. Algorithm 5

An automated heuristic to select i ∗ t in the MaY-I . Input:

Current rejection set R t − ( I ), and available information for selection F − Yt − ( I ); Procedure:

1. Estimate the treatment eﬀect for non-candidate subjects j / ∈ R t − ( I ) as ∆ DR j in (29);2. Train a random forest where the label is the estimated eﬀect ∆ DR j and the predictors are thecovariates X j , using non-candidate subjects j / ∈ R t − ( I );3. Predict ∆ DR i for candidate subjects i ∈ R t − ( I ) via the above random forest, denoted as (cid:98) ∆ DR i ;4. Select i ∗ t as argmin { (cid:98) ∆ DR i : i ∈ R t − ( I ) } .To summarize, we have presented two types of strategy for selecting subjects: the Crossﬁt-I chooses the one with the smallest predicted probability of a positive (cid:98) ∆ i (see Algorithm 3 in the mainpaper), which we denote here as the min-prob strategy ; and the MaY-I chooses the one with thesmallest prediction of estimated eﬀect ∆ DR j (see Algorithm 5), which we denote here as the min-eﬀect strategy . Note that the proposed interactive algorithm can use arbitrary strategy as long asthe available information for selection is restricted. That is, the Crossﬁt-I can use the same min-eﬀect strategy, and the MaY-I can use the min-prob strategy (after removing the outcomes from thepredictors, which we elaborate in the next paragraph). However, we observe in numerical experimentsthat both interactive procedures have higher power when using their original strategies, respectively(see Figure 8).Before details of the experiment results, We ﬁrst describe the min-prob strategy for the MaY-I , where the available information F − Yt − ( I ) does not include the outcomes for candidate subjects.Similar to the min-prob strategy in Algorithm 3 of the main paper, we hope to use the outcome Y i andresidual E i = Y i − (cid:98) m ( X i ) as predictors, and predict the sign of treatment eﬀect for candidate subjects i ∈ R t − ( I ), but Y i and E i for the candidate subjects are not available in F − Yt − ( I ). Thus, we proposealgorithm 6, where we ﬁrst estimate Y i and E i using the covariates (see step 1-2); and step 3-5 aresimilar to Algorithm 3, which obtain the probability of having a positive treatment eﬀect.20 lgorithm 6 The min-prob strategy to select i ∗ t in the MaY-I . Input:

Current rejection set R t − ( I ), and available information for selection F − Yt − ( I ); Procedure:

1. Predict the outcome Y k of each subject k ∈ [ n ] by covariates, denoted as (cid:98) Y −I ( X k ), where (cid:98) Y −I islearned using non-candidate subjects j / ∈ R t − ( I );2. Predict the residual E k = Y k − (cid:98) m ( X K ) of each subject k ∈ [ n ] by covariates, denoted as (cid:98) E −I ( X k ),where (cid:98) E −I is learned using non-candidate subjects j / ∈ R t − ( I );3. Train a random forest classiﬁer where the label is sign( (cid:98) ∆ j ) and the predictors are (cid:16) (cid:98) Y −I ( X j ) , X j , (cid:98) E −I ( X j ) (cid:17) , using non-candidate subjects j / ∈ R t − ( I );4. Predict the probability of (cid:98) ∆ i being positive as (cid:98) p ( i, t ) for candidate subjects i ∈ R t − ( I );5. Select i ∗ t = argmin { (cid:98) p ( i, t ) : i ∈ R t − ( I ) } . ll ll ll ll ll ll scale of treatment effect po w e r _po s ll (min−prob) Crossfit−I (min−effect) Crossfit−I (min−prob) MaY−I (min−effect) MaY−I Figure 8.

Power of the Crossﬁt-I and MaY-I with two strategies to select subjects: the min-probstrategy and the min-eﬀect strategy, under the treatment eﬀect deﬁned in (16) of the main paperwith the scale S ∆ varies in { , , , , , } . The Crossﬁt-I tends to have higher power when using themin-prob strategy, and the MaY-I tends to have higher power when using the min-eﬀect strategy. The Crossﬁt-I has higher power when using the min-prob strategy than the min-eﬀect strategybecause the former additionally uses the outcome as a predictor. For the MaY-I , the min-eﬀectstrategy leads to higher power because the estimated treatment eﬀect ∆ DR j in (29) can provide reliableevidence of which subjects have a positive eﬀect. If using the min-prob strategy, it could be harderto learn an accurate prediction by Algorithm 6 where two of the predictors (cid:98) Y −I ( X j ) and (cid:98) E −I ( X j )are obtained by estimation, increasing the complexity in modeling. Therefore, we present the Crossﬁt-I and MaY-I with the min-prob and min-eﬀect strategies, respectively, as preferred in numericalexperiments. Nonetheless, we remark that our proposed interactive frameworks for the Crossﬁt-I andMaY-I allow arbitrary strategies to select subjects, and an analyst can design her own strategy basedon her domain knowledge. A.3 FDR control of nonpositive eﬀects for paired samples

Recall the nonpositive-eﬀect null under paired samples: H (nonpositive, paired)0 i : ( Y ij | A ij = 1 , X ij ) (cid:22) ( Y ij | A ij = 0 , X ij ) for both j = 1 , , and we observe that P ( (cid:98) ∆ paired i > | { X j , X j } nj =1 ) ≤ / , (30)21here (cid:98) ∆ paired i is deﬁned in (24) of the main paper. Thus, the MaY-I with (cid:98) ∆ i replaced by (cid:98) ∆ paired i hasvalid FDR control for the nonpositive-eﬀect null, where the analyst progressively excludes pairs usingthe available information: F − Y, paired t − = σ  { X i , X i } i ∈R t − , { Y j , Y j , A j , A j , X j , X j } j / ∈R t − , (cid:88) i ∈R t − ( I ) { (cid:98) ∆ paired i > }  . We can also implement an automated version of the MaY-I where the selection of the excluded subjectfollows a similar procedure as Algorithm 5. The diﬀerence is that in step 1, we estimate the treatmenteﬀect for non-candidate subjects j / ∈ R ( I ) directly as (cid:98) ∆ paired i ≡ ( A i − A i )( Y i − Y i ) instead of ∆ DR i to avoid estimating outcomes in ( (cid:98) µ , (cid:98) µ ). B Proof of error controls

The proofs are based on an optional stopping argument, as a variant of the ones presented in Lei andFithian (2018), Lei et al. (2020), Li and Barber (2017) and Barber and Cand`es (2015).

Lemma 1 (Lemma 2 of Lei and Fithian (2018)) . Suppose that, conditionally on the σ -ﬁeld G − , b , . . . , b n are independent Bernoulli random variables with P ( b i = 1 | G − ) = ρ i ≥ ρ > , almost surely.Let ( G t ) ∞ t =0 be a ﬁltration with G ⊂ G ⊂ . . . and suppose that [ n ] ⊇ C ⊇ C ⊇ . . . , with each subset C t +1 measurable with respect to G t . If we have G t = σ (cid:32) G − , C t , ( b i ) i/ ∈C t , (cid:88) i ∈C t b i (cid:33) , (31) and τ is an almost-surely ﬁnite stopping time with respect to the ﬁltration ( G t ) t ≥ , then E (cid:34) |C τ | (cid:80) i ∈C τ b i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G − (cid:35) ≤ ρ − . B.1 Proof of theorem 1

Proof.

We show that the I controls FDR by Lemma 1, where b i := { ( A i − / · E i ≤ } and G − := σ (cid:0) { Y j , X j } nj =1 (cid:1) and C t := R t ∩ H , for t = 0 , , . . . . The assumptions in Lemma 1 are satisﬁed: (a) P ( b i = 1 | G − ) ≥ / i ∈ H : P (( A i − / · E i ≤ | G − )= P ( A i = 1) ( E i ≤ | G − ) + P ( A i = 0) ( E i ≥ | G − ) , because A i is independent of G − = 1 / ( E i ≤ | G − ) + ( E i ≥ | G − ))] ≥ / F t ⊆ G t , so the time of stopping the algorithm (cid:98) t :=min { t : (cid:91) FDR( R t ) ≤ α } is a stopping time with respect to G t ; and (c) C t +1 is measurable with respectto G t . Thus, by Lemma 1, expectation, we have E (cid:34) |R (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ ,

22y deﬁnition, the FDR conditional on G − at the stopping time (cid:98) t is E (cid:34) |R + (cid:98) t ∩ H | max {|R + (cid:98) t | , } (cid:12)(cid:12)(cid:12) G − (cid:35) = E (cid:34) |R − (cid:98) t ∩ H | max {|R + (cid:98) t | , } · |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ E (cid:34) (cid:91) FDR( R (cid:98) t ) · |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ α E (cid:34) |R + (cid:98) t ∩ H | |R − (cid:98) t ∩ H | (cid:12)(cid:12)(cid:12) G − (cid:35) = α E (cid:34) |R (cid:98) t ∩ H | |R − (cid:98) t ∩ H | − (cid:12)(cid:12)(cid:12) G − (cid:35) ≤ α, and the proof completes by applying the tower property of conditional expectation.Notice that when the potential outcomes are treated as ﬁxed, the same proof applies to the null de-ﬁned as Y Tj = Y Cj , because the independence between A i and G − still holds for the nulls. In the hybirdversion of the null (5) in the main paper, the above proof applies with G − := σ (cid:0) { Y j , Y Tj , Y Cj , X j } nj =1 (cid:1) .Thus, FDR is controlled at level α conditional on the potential outcomes and covariates { Y Tj , Y Cj , X j } nj =1 . B.2 Proof of theorem 2

Proof.

Let the set of false rejections in R ( I ) be V ( I ). We conclude that the FDR of the I implementedon set I is controlled at level α/ E (cid:20) |V ( I ) | max {|R ( I ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ α/ , following the error control of the I in Section B.1, where the initial candidate rejection set is R = I ,and thus, C = I ∩ H . Similarly, the FDR of the I implemented on set II is also controlled at level α/

2. Therefore, the FDR of the combined set R ( I ) ∪ R ( II ) is controlled at level α as claimed: E (cid:20) |V ( I ) ∪ V ( II ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ E (cid:20) |V ( I ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) + E (cid:20) |V ( II ) ||R ( I ) ∪ max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ E (cid:20) |V ( I ) | max {|R ( I ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) + E (cid:20) |V ( II ) || max {|R ( II ) | , } (cid:12)(cid:12)(cid:12) G − (cid:21) ≤ α, the proof completes for the null (3) in the main paper after applying the tower property of conditionalexpectation. The FDR control also applies to the other two deﬁnitions of the null (4) and (5) in themain paper, following the same arguments as the end of Section B.1. B.3 Proof of theorem 5

Proof.

We prove that the FDR control holds for the I implemented on I , and the same conclusionapplies to II , so the overall FDR control is guaranteed following the proof of theorem 2 in Section B.2.We ﬁrst present the proof when the potential outcomes are viewed as random variables. Deﬁne G − := σ (cid:0) { X i } ni =1 , { Y i , A i } i/ ∈I (cid:1) , and G (cid:48) t = σ (cid:0) G − , C t , ( Y i , A i ) i/ ∈C t , (cid:80) i ∈C t b i (cid:1) , which contains moreinformation than G t as deﬁned in (31). We claim that Lemma 1 holds when we replace G t by G (cid:48) t , becausethe distribution of b i conditional on G t is the same as on G (cid:48) t for any t = 0 , . . . , n . Similar to the proofof Theorem 1 in Section B.1, we check that the assumption in Lemma 1 are satisﬁed: (a) the ﬁltrationin our algorithm satisﬁes F t ⊆ G (cid:48) t , so the time of stopping the algorithm (cid:98) t := min { t : (cid:91) FDR( R t ) ≤ α } isa stopping time with respect to G (cid:48) t ; and (b) C t +1 is measurable with respect to G (cid:48) t ; and (c) for subjectswith nonpositive eﬀect i ∈ H nonpositive0 : P (cid:0) ( A i − / · E −I i ≤ | G − (cid:1) ≥ / . (32)23o see that the last assumption holds, notice that P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) ≤ | G − (cid:1) = P ( Y Ci ≥ (cid:98) m −I ( X i ) | G − ) P ( A i = 0) + P ( Y Ti ≤ (cid:98) m −I ( X i ) | G − ) P ( A i = 1); and P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) > | G − (cid:1) = P ( Y Ci < (cid:98) m −I ( X i ) | G − ) P ( A i = 0) + P ( Y Ci > (cid:98) m −I ( X i ) | G − ) P ( A i = 1) . For any potential outcomes of the nulls such that ( Y Ti | X i ) (cid:22) ( Y Ci | X i ), it holds that P ( Y Ci ≥ D | X i ) ≥ P ( Y Ti > D | X i ) , and P ( Y Ti ≤ D | X i ) ≥ P ( Y Ci < D | X i ) , for any constant D , so P ( Y Ci ≥ (cid:98) m −I ( X i ) | G − ) ≥ P ( Y Ti > (cid:98) m −I ( X i ) | G − ) , and P ( Y Ti ≤ (cid:98) m −I ( X i ) | G − ) ≥ P ( Y Ci < (cid:98) m −I ( X i ) | G − ) , because m −I ( X i ) is ﬁxed given G − . Because P ( A i = 1) is 1 / P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) ≤ | G − (cid:1) ≥ P (cid:0) ( A i − / · ( Y i − (cid:98) m −I ( X i )) > | G − (cid:1) , which proves Claim (32) and in turn the FDR control of the MaY-I .When the potential outcomes are treated as ﬁxed, the above proof applies to the null deﬁned as Y Ti ≤ Y Ci in (18) of the main paper, in which case P ( Y Ci ≥ D | G − ) is zero or one, and the abovearguments still hold. For the hybird version of the null (19) in the main paper, the above proof applieswith G − := σ (cid:0)(cid:8) Y Ti , Y Ci , X i (cid:9) ni =1 , { Y i , A i } i/ ∈I (cid:1) . Thus, FDR is controlled at level α conditional on thepotential outcomes and covariates { Y Tj , Y Cj , X j } nj =1 . B.4 Error control guarantee for the linear-BH procedure

Theorem 6.

Suppose the outcomes follow a linear model: Y i = l ∆ ( X i ) A i + l f ( X i )+ U i , where l denotesa linear function, and U i is standard Gaussian noise. The linear-BH procedure controls FDR of thenonpositive-eﬀect null in (17) of the main paper asymptotically as the sample size n goes to inﬁnity. Note that the error control would not hold when the linear assumption is violated. For example,if the expected treatment eﬀect E ( Y Ti − Y Ci | X i ) is some nonlinear function of the covariates, theestimated treatment eﬀect (cid:98) ∆ BH i would not be consistent; in turn, for the null subjects with zero eﬀect,the p -values would not be valid (i.e., not stochastically equal or larger than uniform). Hence, the linear-BH procedure would not guarantee the desired FDR control, as we show in the numerical experimentsin Section 3 of the main paper. Proof.

For simplicity, we treat all the covariates as ﬁxed values and denote them as the covariancematrix X a = ( X i : A i = a ) T for a ∈ { T, C } , where we temporarily use A i = T to denote the case ofbeing treated A i = 1. Under the linear assumption, the estimated outcome (cid:98) l a asymptotically follows aGaussian distribution, whose expected value is l ∆ ( X i ) { a = T } + l f ( X i ). Its variance can be estimatedas (cid:100) Var( (cid:98) l a ( X i )) = (cid:98) σ a ( X Ti ( X Ta X a ) − X Ti ) , where the variance from noise is estimated as (cid:98) σ a = (cid:88) A i = a ( Y i − (cid:98) l a ( X i )) (cid:44) (cid:32)(cid:88) i { A i = a } − d − (cid:33) , and d is the number of covariates. Note that the observed outcome also follows a Gaussian distribution N ( l ∆ ( X i ) { a = T } + l f ( X i ) , σ a ). Note that in each estimated eﬀect (cid:98) ∆ BH i , the observed outcome Y ai is24ndependent of the estimated potential outcome Y ai , where a is the complement of a : a ∪ a = { T, C } .Thus, the estimated eﬀect asymptotically follow a Gaussian distribution whose expected value is l ∆ ( X i )(nonpositive under the null) and the variance is Var( (cid:98) ∆ BH i ) = Var( (cid:101) Y Ti ) + Var( (cid:101) Y Ci ), where an estimationis (cid:100) Var( (cid:101) Y ai ) = (cid:98) σ a { A i = a } + (cid:100) Var( (cid:98) l a ( X i )) { A i = a } . Therefore, the resulting p -value P i as deﬁnedin (15) of the main paper is asymptotically valid (uniform or stochastically larger) if subject i is a null,and hence the BH procedure leads to asymptotic FDR control (Fan et al., 2007). C Proof of power analysis

Our proof of the power analysis mainly uses the results in Arias-Castro and Chen (2017), who considerthe setup with n hypotheses, each associated with a test statistic V i for i ∈ [ n ]. Assume the teststatistics are independent with the survival function P ( V i ≥ x ) = Ψ i ( x ), which equals Ψ( x − µ i ) where µ i = 0 under the null and µ i > x →∞ x − γ log Ψ( x ) = − /γ, (33)with a constant γ >

0. For example, a normal distribution is AGG with γ = 2. They discuss a classof multiple testing methods called threshold procedure : the ﬁnal rejection set R is in the form R = { i : V i ≥ τ ( V , . . . , V n ) } , (34)for some threshold τ ( V , . . . , V n ), and separately study two types of thresholds: the BH procedure(Benjamini and Hochberg, 1995) with threshold: τ BH = V ( ι BH ) , ι BH := max { i : V ( i ) ≥ Ψ − ( iα/n ) } , (35)where V (1) ≥ . . . ≥ V ( n ) are ordered statistics; and the Barber-Cand‘es (BC) procedure (Barber andCand`es, 2015) with a threshold on the absolute value of V i : τ BC = inf { ν ∈ | V | : (cid:91) FDP( ν ) ≤ α } , (36)where | V | := {| V i | : i ∈ [ n ] } is the set of sample absolute values, and (cid:91) FDP( ν ) := |{ i : V i ≤ − ν }| + 1max {|{ i : V i ≥ ν }| , } , and the ﬁnal rejection set is those with positive V i and value larger than τ BC . The stopping rule forthe BC procedure is similar to our proposed algorithms, as detailed next.Recall in Section 4 of the main paper, we consider a simpliﬁed automated version of the I thatexclude the subject with the smallest absolute value of the estimated treatment eﬀect | (cid:98) ∆ i | . Thus, theautomated I is a BC procedure where the test statistic of interest is V i = (cid:98) ∆ i = 4( A i − / Y i − (cid:98) m n ( X i )). Following the above notations and let Φ be the CDF for standard Gaussian, we denote thesurvival function for the nulls asΨ null n ( x ) = 12 (1 − Φ( x + (cid:98) m n )) + 12 (1 − Φ( x − (cid:98) m n )) , which is a mixture of two Gaussians, with (cid:98) m n = n (cid:80) ni =1 Y i , and (cid:98) m n a.s. → non-null n ( x ) = 12 (1 − Φ( x + (cid:98) m n − µ )) + 12 (1 − Φ( x − (cid:98) m n )) . Note that our setting is slightly diﬀerent from the discussion in Arias-Castro and Chen (2017) becausethe non-nulls diﬀer from the nulls by a shift on one of the Gaussian component (rather than a shift25n the overall survival function Ψ null n ). Similar to the characterization by AGG in (33), both survivalfunctions Ψ null n and Ψ non-null n asymptotically satisfy a tail property that for any x n → ∞ as n → ∞ :lim n →∞ x − γn log Ψ n ( x n ) = − /γ, (37)with probability one and γ = 2, which we later refer to as asymptotic AGG. Conclusions in our paperbasically follows the proofs in Arias-Castro and Chen (2017) with the test statistics V i speciﬁed as theestimated treatment eﬀect (cid:98) ∆ i . C.1 Proof of theorem 3

We ﬁrst present the proof for the power of the automated Crossﬁt-I , and the power of the linear-BHis proven similarly as shown later. Proof.

Zero power when r < β . The argument of zero power indeed applies to any thresholdprocedure as deﬁned in (34): R = { i : (cid:98) ∆ i ≥ d } , for some d ∈ R . Following the proof of Theorem 1 inArias-Castro and Chen (2017), we argue that the FDR control cannot be satisﬁed for any α ∈ (0 , d is large enough such that d > µ + δ n with δ n = log log n ; but in this case, mostnon-nulls cannot be included in the rejection set, and thus the power goes to zero.First, we claim that when d ≤ µ + δ n , the false discovery proportion (FDP) goes to one in probability.By the proof of Theorem 1 in Arias-Castro and Chen (2017), we have that FDP goes to one inprobability if ( n − n )Ψ null n ( µ + δ n ) n → ∞ with probability one, where n is the number of non-nulls. Theirproof also veriﬁes that ( n − n )Ψ null n ( µ + δ n ) n → ∞ because Ψ null n satisﬁes property (37) with probabilityone.Next, we show that when d > µ + δ n , the power goes to zero. Notice that power can be equivalentlydeﬁned as E (1 − FNR), where FNR (false negative rate) is deﬁned as the proportion of non-nulls notidentiﬁed. Again by the proof of Theorem 1 in Arias-Castro and Chen (2017), we have that the FNRconverge to one in probability if Ψ non-null n ( µ + δ n ) goes to zero with probability one, which is truebecause δ n → ∞ .Combining the above two arguments, we conclude that for any threshold procedure whose rejectionset is in the form of (34), the power goes to zero for any FDR control α ∈ (0 ,

1) when r < β .Note that the above proof assumes that the test statistics V i = (cid:98) ∆ i are mutually independent. Forsimplicity, we design the Crossﬁt-I where (cid:98) m n for the I implemented on I is computed using data in II , to ensure the above mutual independence. Thus, the above proof applies to the I implementedon each half, I and II . The overall power behaves the same asymptotically, since I and II resultfrom a random split of all subjects [ n ]. For all cases hereafter, we prove the power claim for theI implemented on I conditional on data in II , and the same claim holds for the overall power asreasoned above. Half power when r > β . We ﬁrst prove the limit inferior of the power is at least 1 /

2, and then thelimit superior is at most 1 /

2, mainly using the proof of Theorem 3 in Arias-Castro and Chen (2017).They consider a sequence of thresholds d ∗ n = ( γr ∗ log n ) /γ for some r ∗ ∈ ( β, r ∧ d ∗ n is less than any α ∈ (0 ,

1) for large n , or mathematically (cid:91) FDR( d ∗ n ) ≤ α . Itcan be veriﬁed by the proof of Theorem 3 in Arias-Castro and Chen (2017) where the survival functionof (cid:98) ∆ i is G ( d ∗ n ) = (1 − (cid:15) )Ψ null n ( d ∗ n ) + (cid:15) Ψ non-null n ( d ∗ n ) with (cid:15) = n − β , and the fact that Ψ non-null n ( d ∗ n ) → / null n ( d ∗ n ) → n − r ∗ (by property (37)) with probability one. It follows that the true stoppingthreshold τ n satisﬁes τ n ≤ d ∗ n . Also, by Lemma 1 in Arias-Castro and Chen (2017), we have that theproportion of correctly identiﬁed non-nulls at threshold d ∗ n is n (cid:80) i/ ∈H { (cid:98) ∆ i ≥ d } = Ψ non-null n ( d ∗ n ) + o P (1), where Ψ non-null n ( d ) decreases in d and converges to 1 / d = d ∗ n . Recall that the truestopping threshold is no larger than d ∗ n , so the limit inferior of the power is at least 1 / / / d ∈ (0 , ∞ ), and we claim that the actual stopping threshold τ n ≥ d for26arge n because the FDR estimator goes to one, following similar arguments in the proof of Theorem 3in Arias-Castro and Chen (2017). Speciﬁcally, (cid:91) FDR( d ) ≡ |{ i ∈ [ n ] : (cid:98) ∆ i ≤ − d }| + 1max {|{ i ∈ [ n ] : (cid:98) ∆ i ≥ d }| , } = 1 + n (1 − (cid:98) G n ( − d ))max { n (cid:98) G n ( d ) , } , where (cid:98) G n ( d ) = n (cid:80) i ∈ [ n ] ( (cid:98) ∆ i ≥ d ) denotes the empirical survival function. Use the fact that G n ( d ) = (1 − (cid:15) )Ψ null n ( d ) + (cid:15) Ψ non-null n ( d ) → − Φ( d ) and G n ( − d ) → Φ( d ) almost surely, weobserve that (cid:91) FDR( d ) → d is Ψ non-null n ( d ) + o P (1) (recall in the previous paragraph), where Ψ non-null n ( d ) → / / − Φ( d )). Thus, the power for large n is smaller than 1 / / − Φ( d )) for all d ∗ ∈ (0 , ∞ );in other words, the limit superior of the power is smaller than inf d ∈ (0 , ∞ ) / / − Φ( d ∗ )) = 1 / /

2, we conclude that the powerconverges to 1 /

2. In fact, the above proof implies that the power of identifying non-null subjects thatare treated is one (notice that Ψ non-null, treated n ( d ∗ n ) = 1 − Φ( d ∗ n + (cid:98) m n − µ ) → Proof for the linear-BH procedure

The power of the linear-BH procedure when there is nocovariates can be proved following similar steps as above, and using intermediate results of Theorem 2in Arias-Castro and Chen (2017). In their notation, the linear-BH procedure uses V i = (cid:98) ∆ BH i as the teststatistics, and we separately discuss power among the treated group and the control group in ensurethe independence among V i . The survival functions for the nulls and non-nulls in the treated groupare Ψ null n ( x ) = 1 − Φ (cid:32) x (cid:112) /n C (cid:33) and Ψ non-null n ( x ) = 1 − Φ (cid:32) x − µ (cid:112) /n C (cid:33) , where n C is the number of untreated subjects. For the control group, the survival functions areΨ null n ( x ) = 1 − Φ (cid:32) x + (cid:98) Y T (cid:112) /n T (cid:33) and Ψ non-null n ( x ) = 1 − Φ (cid:32) x + (cid:98) Y T (cid:112) /n T (cid:33) , where (cid:98) Y T = (cid:80) A i =1 Y i a.s. →

0, and n T is the number of treated subjects. Since the above distributionsconverge to a Gaussian, these survival functions satisfy the AGG property asymptotically as deﬁnedin (37). Proof.

First, we claim that the power goes to zero when r < β , following the proof in Section C.1 forany threshold procedure (separately for the treated group conditional on control group). Then, weprove that power converges to 1 / r > β : the power among untreated subjects is asymptoticallyzero because the survival functions for the nulls and non-nulls are the same; the power among treatedsubjects is asymptotically one following the proof of Theorem 2 in Arias-Castro and Chen (2017) asdetailed next.Again consider the sequence of thresholds d ∗ n = ( γr ∗ log n ) /γ for some r ∗ ∈ ( β, r ∧ d ∗ n is less than any α ∈ (0 ,

1) for large n , or mathematically (cid:91) FDR( d ∗ n ) ≤ α .It can be veriﬁed by the proof of Theorem 2 in Arias-Castro and Chen (2017) where G ( d ∗ n ) = (1 − (cid:15) )Ψ null n ( d ∗ n )+ (cid:15) Ψ non-null n ( d ∗ n ) with (cid:15) = n − β , and the fact that Ψ non-null n ( d ∗ n ) → null n ( d ∗ n ) → n − r ∗ (by property (37)) with probability one. It follows that the true stopping threshold τ n satisﬁes τ n ≤ d ∗ n . Also, by Lemma 1 in Arias-Castro and Chen (2017), we have that the proportionof correctly identiﬁed non-nulls at threshold d ∗ n is n (cid:80) i/ ∈H { (cid:98) ∆ BH i ≥ d } = Ψ non-null n ( d ∗ n ) + o P (1),where Ψ non-null n ( d ) for the treated group decreases in d and converges to 1 when d = d ∗ n . Recall thatthe true stopping threshold is no larger than d ∗ n , so the limit inferior of the power among the treatedsubjects is at least 1. Therefore, the overall power converges to 1 / .2 Proof of theorem 4 We ﬁrst consider the power when all subjects are non-nulls ( β = 0). Lemma 2.

Given any ﬁxed FDR control level α ∈ (0 , and let the number of subjects n goes toinﬁnity. When all subjects are non-nulls, the stopping time τ = 0 with probability tending to one if µ > Φ − ( α ) , and in this case the power converges to Φ( µ ) . E.g., when α = 0 .

2, the asymptotic power of the automated I is larger than 0 . µ ≥ Proof.

The stopping time τ = 0 if and only if the FDR control is satisﬁed when all the subjects areincluded: (cid:91) FDR n ( R ) = | R − | +1max {| R +0 | , } ≤ α , or equivalently, | R +0 | n ≥

1+ 1 n α . Notice that the proportion ofpositive (cid:98) ∆ i converges to a constant: | R +0 | n a.s. → Φ( µ ), because (cid:98) ∆ i of each non-null follows a Gaussiandistribution with mean µ and variance less than 2. Thus, if Φ( µ ) > α , for any (cid:15) ∈ (0 , N such that for all n ≥ N , we have that (a) (cid:12)(cid:12)(cid:12) | R +0 | n − Φ( µ ) (cid:12)(cid:12)(cid:12) < (cid:15) with probability at least 1 − (cid:15) ;and (b) (cid:91) FDR n ( R ) = | R − | +1max {| R +0 | , } ≤ α (hence τ = 0) with probability at least 1 − (cid:15) . (Notice that thethreshold N can be chosen as not depending on µ , which is useful in the next proof.) In such a case,the power is no less than (1 − (cid:15) )(Φ( µ ) − (cid:15) ) when n ≥ N ; and the power is no larger than Φ( µ ) − (cid:15) ;so the power converges to Φ( µ ). The proof completes once notice that the condition Φ( µ ) > α isequivalent to µ > Φ − ( α ). Proof of Theorem 4.

Power of the Crossﬁt- I . Recall that the I implemented on I exclude subjectsbased on the averaged estimated eﬀect on II : Pred( x ) = (cid:98) ∆ i ( X i = x ), which converges to µ almostsurely when x = 1 (the non-nulls), and 0 almost surely when x = 0 (the nulls). Thus, no non-nulls in I would be excluded before excluding all the nulls in I (with probability going to one) for any ﬁxed µ >

0. Combined with Lemma 2, we have that if µ > Φ − ( α ), for any (cid:15) ∈ (0 , N ( (cid:15), α )such that for all n ≥ N , the power of the I is higher than Φ( µ ) − (cid:15) . Also, the limit of power increasesto one for any r > µ increases): there exists N (cid:48) ( (cid:15) ) such that for all n ≥ N (cid:48) ( (cid:15) ),Φ( µ ) ≥ − (cid:15) . Therefore, for any (cid:15) ∈ (0 , α ), we have that for all n ≥ max { N (cid:48) ( (cid:15) ) , N ( (cid:15), α ) } , the powerof I implemented on I is no less than 1 − (cid:15) ; thus completes the proof. Power of the linear-BH procedure.

As before, we separately argue that the power for the treatedgroup and the control group converges to zero when r < β , and converges to one when r > β . Fora subject in the treated group with X i = x where x ∈ { , } , the estimated eﬀect is a Gaussian (cid:98) ∆ BH i ∼ N (0 , (cid:80) i ( X i = x,A i =0) ) for the nulls and (cid:98) ∆ BH i ∼ N ( µ, (cid:80) i ( X i = x,A i =0) ) for the non-nulls.For a subject in the control group with X i = x where x ∈ { , } , the estimated eﬀect is a Gaussian (cid:98) ∆ BH i ∼ N (0 , (cid:80) i ( X i = x,A i =1) ) for the nulls and (cid:98) ∆ BH i ∼ N ( µ, (cid:80) i ( X i = x,A i =1) ) for the non-nulls.The power of the linear-BH procedure directly results from Theorem 2 in Arias-Castro and Chen(2017) because in both the treated and control group, (a) the linear-BH procedure is the BH procedurewhere the random variable of interest is V i = (cid:98) ∆ BH i ; and (b) (cid:98) ∆ BH i of non-nulls and nulls diﬀer by ashift µ ; and (c) the survival function of (cid:98) ∆ BH i is asymptotically AGG (recall deﬁnition in (37)) since itconverges to a Gaussian distribution). D Additional numerical experiments

D.1 Identiﬁcation of individual positive eﬀect

We have seen the numerical results of the proposed methods in the main paper where the treatmenteﬀect is deﬁned in (16) with sparse and strong positive eﬀect, and dense and weak negative eﬀect.This section presents three more examples of the treatment eﬀect.28 inear eﬀect.

Let the treatment eﬀect be∆( X i ) = S ∆ · [2 X i (1) X i (2) + 2 X i (3)] , (38)where S ∆ >

0. In this case, all subjects have treatment eﬀects, and the scale correlates with thecovariates (with interaction terms) linearly. Thus, the linear-BH procedure has valid error control asshown in Figure 9a (unlike other cases with nonlinear treatment eﬀect).

Sparse and strong eﬀect that is positive.

Let the treatment eﬀect be∆( X i ) = S ∆ · [5 X i (3) { X i (3) > } ] , (39)where S ∆ >

0. Here, the subjects with X i (3) > Sparse and strong eﬀect in both directions.

Let the treatment eﬀect be∆( X i ) = S ∆ · [5 X i (3) {| X i (3) | > } ] , (40)where S ∆ >

0. Here, the subjects with X i (3) > X i (3) < − l l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect F DR _neg l l l l l l scale of treatment effect po w e r _po s (a) Dense two-sided eﬀect (linear) as in model (38). l l l l l l degree of mismatch within pair F DR l l l l l l degree of mismatch within pair F DR _neg l l l l l l degree of mismatch within pair po w e r _po s (b) Sparse and strong eﬀect that is positive (nonlinear) in model (39). l Crossfit−I MaY−I Linear−BH l l l l l scale of treatment effect F DR l l l l l l scale of treatment effect F DR _neg l l l l l l scale of treatment effect po w e r _po s (c) Sparse and strong eﬀect in both directions (nonlinear) in model (40). Figure 9.

FDR for the zero-eﬀect null (3) in the main paper (the ﬁrst column), and FDR for thenonpositive-eﬀect null (17) in the main paper (the second column), and power (the third column) ofthree methods: linear-BH procedure, Crossﬁt-I , MaY-I , under three types of treatment eﬀect whenvarying the scale of treatment eﬀect S ∆ in { , , , , , } . When the linear assumption holds as in theﬁrst row, the linear-BH procedure has valid FDR control and high power, but its FDR is large when thetreatment is a nonlinear function of the covariates as in the latter two rows. In contrast, the Crossﬁt-I and MaY-I have valid FDR control for their target null hypotheses, respectively. D.2 Paired samples

We have presented in Section 6 of the main paper that the interactive algorithms can be applied topaired samples, and have discussed their power in two cases where the samples within each pair eithermatched exactly or mismatch to some degree. A side observation is that the algorithms not using thepairing information seem to have a small change in power when the degree of mismatch varies. Here,we increase the degree of mismatch to show a more clear pattern of this change. ll ll ll ll ll ll digree of mismatch e po w e r _po s l l pair−Crossfit−I pair−MaY−I unpair−Crossfit−I unpair−MaY−I Figure 10.

Power of identifying subjects with positive eﬀects of the proposed algorithms (Crossﬁt-I and MaY-I ) with or without pairing information, when the scale of treatment eﬀect is ﬁxed at 2 andthe degree of mismatch (cid:15) varies. The power of algorithms without pairing information ﬁrst increaseand then decrease as (cid:15) becomes larger. We extend the deﬁnition of mismatch for (cid:15) ∈ (0 ,

1) to a larger (cid:15) : P ( X i (1) (cid:54) = X i (1)) = min { (cid:15), } and P ( X i (2) (cid:54) = X i (2)) = min { (cid:15), } and X i (3) = X i (3) + U (0 , (cid:15) ), where U (0 , (cid:15) ) is uniformlydistributed between 0 and 2 (cid:15) , and a larger (cid:15) leads to a larger degree of mismatch. As (cid:15) increases, thepower under the unpaired samples ﬁrst increases (see Figure 10). It is because the treatment eﬀect ispositive when X i (3) >

1, which only takes 15% proportion if without mismatching; thus, the patternof treatment eﬀect is not easy to learn. In contrast, when there is a positive shift on X i (3) as designedin the mismatching setting above, more subjects have positive eﬀects so that the algorithm can moreeasily learn the eﬀect pattern and hence increase the power. The power can slightly decrease whenthe degree of mismatch is too large ( (cid:15) > cknowledgements We thank Edward Kennedy for his comments on the problem setup.

References

Arias-Castro, E. and S. Chen (2017). Distribution-free multiple testing.

Electronic Journal of Statis-tics 11 (1), 1983–2001.Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal eﬀects.

Proceedingsof the National Academy of Sciences 113 (27), 7353–7360.Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoﬀs.

The Annalsof Statistics 43 (5), 2055–2085.Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and power-ful approach to multiple testing.

Journal of the Royal Statistical Society: Series B (Methodologi-cal) 57 (1), 289–300.Berger, J. O., X. Wang, and L. Shen (2014). A Bayesian approach to subgroup identiﬁcation.

Journalof Biopharmaceutical Statistics 24 (1), 110–129.Cai, T., L. Tian, P. H. Wong, and L. Wei (2011). Analysis of randomized comparative clinical trialdata for personalized treatment selections.

Biostatistics 12 (2), 270–282.Duan, B., A. Ramdas, S. Balakrishnan, and L. Wasserman (2020). Interactive martingale tests for theglobal null.

Electronic Journal of Statistics 14 (2), 4489–4551.Duan, B., A. Ramdas, and L. Wasserman (2020a). Familywise error rate control by interactive un-masking. In

Proceedings of the 37th International Conference on Machine Learning , Volume 119,pp. 2720–2729. PMLR.Duan, B., A. Ramdas, and L. Wasserman (2020b). Which Wilcoxon should we use? An interactiverank test and other alternatives. arXiv preprint arXiv:2009.05892 .Fan, J., P. Hall, and Q. Yao (2007). To how many simultaneous hypothesis tests can normal, student’st or bootstrap calibration be applied?

Journal of the American Statistical Association 102 (480),1282–1288.Foster, J. C., J. M. Taylor, and S. J. Ruberg (2011). Subgroup identiﬁcation from randomized clinicaltrial data.

Statistics in Medicine 30 (24), 2867–2880.Gu, J. and S. Shen (2018). Oracle and adaptive false discovery rate controlling methods for one-sidedtesting: theory and application in treatment eﬀect evaluation.

The Econometrics Journal 21 (1),11–35.Howard, S. R. and S. D. Pimentel (2020). The uniform general signed rank test and its designsensitivity.

Biometrika . asaa072.Imai, K. and M. Ratkovic (2013). Estimating treatment eﬀect heterogeneity in randomized programevaluation.

The Annals of Applied Statistics 7 (1), 443–470.Karmakar, B., R. Heller, and D. S. Small (2018). False discovery rate control for eﬀect modiﬁcationin observational studies.

Electronic Journal of Statistics 12 (2), 3232–3253.Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal eﬀects. arXivpreprint arXiv:2004.14497 .Lei, L. and W. Fithian (2018). AdaPT: an interactive procedure for multiple testing with side infor-mation.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (4), 649–679.31ei, L., A. Ramdas, and W. Fithian (2020). A general interactive framework for false discovery ratecontrol under structural constraints.

Biometrika . asaa064.Li, A. and R. F. Barber (2017). Accumulation tests for FDR control in ordered hypothesis testing.

Journal of the American Statistical Association 112 (518), 837–849.Lipkovich, I. and A. Dmitrienko (2014). Strategies for identifying predictive biomarkers and sub-groups with enhanced treatment eﬀect in clinical trials using SIDES.

Journal of BiopharmaceuticalStatistics 24 (1), 130–153.Lipkovich, I., A. Dmitrienko, and R. B D’Agostino Sr (2017). Tutorial in biostatistics: data-drivensubgroup identiﬁcation and analysis in clinical trials.

Statistics in Medicine 36 (1), 136–196.Lipkovich, I., A. Dmitrienko, J. Denne, and G. Enas (2011). Subgroup identiﬁcation based on diﬀeren-tial eﬀect search—a recursive partitioning method for establishing response to treatment in patientsubpopulations.

Statistics in Medicine 30 (21), 2601–2621.Loh, W.-Y., L. Cao, and P. Zhou (2019). Subgroup identiﬁcation for precision medicine: A comparativereview of 13 methods.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (5),e1326.Nie, X. and S. Wager (2020). Quasi-oracle estimation of heterogeneous treatment eﬀects.

Biometrika .asaa076.Powers, S., J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2018). Some meth-ods for heterogeneous treatment eﬀect estimation in high dimensions.

Statistics in Medicine 37 (11),1767–1787.Rabinovich, M., A. Ramdas, M. I. Jordan, and M. J. Wainwright (2020). Optimal rates and trade-oﬀsin multiple testing.

Statistica Sinica 30 , 741–762.Robinson, P. M. (1988). Root-N-consistent semiparametric regression.

Econometrica: Journal of theEconometric Society 56 (4), 931–954.Rosenbaum, P. R. (2002). Covariance adjustment in randomized experiments and observational studies.

Statistical Science 17 (3), 286–327.Sivaganesan, S., P. W. Laud, and P. M¨uller (2011). A Bayesian subgroup analysis with a zero-enrichedPolya Urn scheme.

Statistics in Medicine 30 (4), 312–323.Xie, Y., N. Chen, and X. Shi (2018). False discovery rate controlled heterogeneous treatment eﬀectdetection for online controlled experiments. In

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining .Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating individualized treatmentrules using outcome weighted learning.