[PDF] Sharp Inference on Selected Subgroups in Observational Studies

Abstract

In modern drug development, the broader availability of high-dimensional observational data provides opportunities for scientist to explore subgroup heterogeneity, especially when randomized clinical trials are unavailable due to cost and ethical constraints. However, a common practice that naively searches the subgroup with a high treatment level is often misleading due to the "subgroup selection bias." More importantly, the nature of high-dimensional observational data has further exacerbated the challenge of accurately estimating the subgroup treatment effects. To resolve these issues, we provide new inferential tools based on resampling to assess the replicability of post-hoc identified subgroups from observational studies. Through careful theoretical justification and extensive simulations, we show that our proposed approach delivers asymptotically sharp confidence intervals and debiased estimates for the selected subgroup treatment effects in the presence of high-dimensional covariates. We further demonstrate the merit of the proposed methods by analyzing the UK Biobank data.

Full PDF

SSharp Inference on Selected Subgroups in Observational Studies

Xinzhou Guo Linqing Wei Chong Wu Jingshen Wang ∗ Harvard-MIT Center for Regulatory Science Division of Biostatistics, UC Berkeley Department of Statistics, Florida State University

Abstract

In modern drug development, the broader availability of high-dimensional observational dataprovides opportunities for scientist to explore subgroup heterogeneity, especially when random-ized clinical trials are unavailable due to cost and ethical constraints. However, a common prac-tice that naively searches the subgroup with a high treatment level is often misleading due to the“subgroup selection bias.” More importantly, the nature of high-dimensional observational datahas further exacerbated the challenge of accurately estimating the subgroup treatment eﬀects.To resolve these issues, we provide new inferential tools based on resampling to assess the repli-cability of post-hoc identiﬁed subgroups from observational studies. Through careful theoreticaljustiﬁcation and extensive simulations, we show that our proposed approach delivers asymptot-ically sharp conﬁdence intervals and debiased estimates for the selected subgroup treatmenteﬀects in the presence of high-dimensional covariates. We further demonstrate the merit of theproposed methods by analyzing the UK Biobank data. The R package “ debiased.subgroup ”implementing the proposed procedures is available on GitHub. Keywords:

Bootstrap; Precision Medicine; Debiased Inference.

Subgroup analysis, broadly speaking, aims to uncover and conﬁrm treatment eﬀect heterogeneitywithin a population, and it has been frequently applied to randomized clinical trials (RCTs) for drugdevelopment (H´ebert et al., 2002; Alosh et al., 2017). In recent years, the combination of increasingavailability of observational data and advancements in statistical methods and computing capacityhas stimulated researchers’ interest in using observational data for subgroup analysis. This has ∗ Correspondence: [email protected]. The research of Jingshen Wang was supported in part by theNational Science Foundation (DMS 2015325). a r X i v : . [ s t a t . M E ] F e b pawned the hope for deeper ﬁndings and more targeted recommendations or interventions ina wide variety of areas from education, health care, to marketing. Although RCTs remain thegold standard for assessing the eﬃcacy of clinical interventions, observational data oﬀer broaderopportunities to explore subgroup heterogeneity when we cannot aﬀord RCTs due to cost andethical constraints.While observational studies can be useful to identify subgroups with favorable treatment ef-ﬁcacy or unfavorable adverse eﬀects, we must account for the impact of subgroup selection onany subsequent evaluation of the subgroup. We have been frequently reminded by the failure offollow-up trials to conﬁrm a seemingly promising subgroup (see Kubota et al., 2014, for example),and by the discussions in the domain science journals about the appeals and pitfalls of subgroupanalysis (Petticrew et al., 2012, e.g.). When subgroups are identiﬁed post hoc from RCTs, it hasbeen recognized in the literature (Cook et al., 2014; Bornkamp et al., 2017; Guo and He, 2019)that the subgroup selection bias, if not accounted for, is likely to lead to overtreatments and falsediscoveries. When subgroups are identiﬁed from observational studies, the problem becomes evenmore challenging as subgroup treatment eﬀects need to be estimated upon adjusting for possiblyhigh-dimensional confounders. To address these issues, in this paper, we provide new inferentialtools to help assess the replicability of post-hoc identiﬁed subgroups from observational studieswithout having to resort to simultaneous inference methods that are often too conservative to startwith.From a statistical methodological standpoint, our proposed approach delivers asymptoticallysharp conﬁdence intervals as well as debiased estimates for the selected subgroup treatment eﬀectsin the presence of high-dimensional covariates. We break down our methodological contributionsas follows:First, we allow well-understood debiased regression parameter estimates to be used as the build-ing blocks for the subgroup treatment eﬀects from observational studies, thus enabling adjustmentsfor high-dimensional covariates in the study. In particular, we investigate how debiased Lasso andrepeated data splitting can be used in the subgroup analysis. We ﬁnd that the former works wellat handling a large number of subgroup-speciﬁc parameters, whereas the latter achieves better sta-tistical eﬃciency at sparse models especially when the subgroup assignments are highly correlatedwith other covariates in the model. 2econd, we propose new bootstrap-calibrated procedures for quantifying the selected subgrouptreatment eﬀect (Section 3.1 and 3.2) and provide a theoretical guarantee of its asymptotic va-lidity in high dimensions (Section 4). Most importantly, the resulting conﬁdence intervals areasymptotically sharp in the sense that they approximate the nominal coverage probability withoutovershooting it (Theorem 1 and Corollary 1). While the existing statistical literature (Zhang andCheng, 2017; Dezeure et al., 2017) has argued for the use of multiplicity adjustment to ensurepost-selection inference validity in high dimensions, most recommended methods tend to sacriﬁcepower as they aim to protect the family-wise error rates over all possible subgroups. The proposedmethod accounts automatically for subgroup selection bias as well as increasingly complex corre-lation structures among the subgroups, leading to targeted (instead of simultaneous) statisticalvalidity in subgroup analysis. Our paper is closely related to subgroup analysis. Here, the literature is often divided into ex-ploratory subgroup analysis that focuses on subgroup identiﬁcation (Shen and He, 2015; Su et al.,2009), and conﬁrmatory subgroup analysis that aims to validate subgroups identiﬁed from an ear-lier stage (Jenkins et al., 2011; Friede et al., 2012). These classical methods are usually made forrandomized trials and are not applicable for observational studies. While some recent literatureon subgroup analysis (Fan et al., 2017; Yang et al., 2020; Izem et al., 2020) accounts for the pre-treatment information, their generalization to high-dimensional observation studies may requiresubstantial modiﬁcation. Beyond subgroup analysis, some methods on analyzing heterogeneoustreatment eﬀects (Imai et al., 2013; Wager and Athey, 2017) might be also applicable for subgroupanalysis in observational studies. Diﬀerent from our goal of simultaneously identifying and esti-mating subgroup treatment eﬀect, their goal mainly focuses on subgroup identiﬁcation or makinginference on a pre-deﬁned subgroup.Our methodology also contributes to high-dimensional inference and post-selection inferenceliterature, simply because accurate point estimation of treatment eﬀects is usually not availablewithout regularization (e.g. Tibshirani, 1996; Fan and Li, 2001) in high dimensions. Selective infer-ence (Lee et al., 2016; Tian et al., 2018) constructs exact conﬁdence intervals for the selected regres-sion coeﬃcients based on Lasso conditional on the selected model, indicating that their framework3ends to provide conservative conﬁdence intervals overall. While recent developments in debiasedinference (Zhang and Zhang, 2014; Van de Geer et al., 2014; Belloni et al., 2014) removes the Lassoregularization bias by using an estimate of the inverse population covariance matrix, they do notaddress the issue of subgroup selection bias.Our framework is connected to the multiple comparison literature. Although several attemptshave been made to address the multiple comparison issues in subgroup analysis (Hall and Miller,2010; Fuentes et al., 2018; Stallard et al., 2008), those procedures are either conservative or poorlygrounded. Guo and He (2020) propose a asymptotically sharp procedure to adjust the subgroupselection bias but it is not designed for high-dimensional observational studies. In addition, mostfalse discovery controls that are developed for independent tests cannot be well justiﬁed for multipleanalyses based on the same data set. When they protect false positive rates, they tend to beconservative too. In our proposal, we handle the dependence among tests by using the bootstrap-based calibration.

In classical subgroup analysis in RCTs, a subgroup is usually deﬁned as a subpopulation of patientsdecided by their baseline characteristics, and the corresponding subgroup treatment eﬀect is theeﬀect of a certain medical treatment for this given subpopulation. As observational studies oftenallow us to investigate multiple treatment strategies at the same time, subgroups can be morebroadly deﬁned. In this paper, other than simply viewing subgroups in the classical RCT setting,we also consider subgroups that are deﬁned as subpopulations of patients that receive diﬀerenttreatments. We refer the variables that capture the diﬀerential subgroup treatment eﬀects astreatment eﬀect heterogeneity variables.Suppose we have a random sample of n i.i.d. observations { ( Y i , Z i , X i ) } ni =1 , where Y i is theresponse variable, Z i is a p -dimensional vector of treatment eﬀect heterogeneity variables (SeeRemark 1 for its construction in diﬀerent scenarios), and X i is a p -dimensional vector of covariatesfor the i -th subject. Without loss of generality, we assume that a larger value of β i means a better4reatment eﬀect. We work under the high-dimensional setting where p + p (cid:29) n . We assume that Y i = Z (cid:48) i β + X (cid:48) i γ + ε i , i = 1 , . . . , n, (1)where β = ( β , . . . , β p ) (cid:48) captures the subgroup treatment eﬀects, γ is a sparse vector of coeﬃcients,and ε i ’s are independent random errors with E ( ε i | Z i , W i ) = 0. Deﬁne [ p ] = { , . . . , p } . In practice,scientists often hope to identify the subgroup with the highest treatment level. Our goal is to ﬁnd aconsistent point estimate and a conﬁdence interval with asymptotically sharp coverage probabilityon: (i) the selected subgroup treatment eﬀect: β (cid:98) s , where (cid:98) s = arg max j ∈ [ p ] (cid:98) β j ,(ii) the maximal subgroup treatment eﬀect: β max = max j ∈ [ p ] β j ,where (cid:98) β i is an estimate of β i (shall be speciﬁed in Sections 3.1 and 3.2).While one may debate which perspective is practically more relevant in subgroup analysis, ourproposed inferential procedure works for both quantities. We will start with inference on β max andshow in Section 4 (Corollary 1) that the same procedure works for making inference on β (cid:98) s .To fully realize the challenges of making inference on β max in high dimensions, we now provide anillustrative example. By showing this example, our take-away point for practitioners is clear: Evenif β can be estimated accurately, methods without appropriate adjustments of random selectioncannot produce valid inference on β max (or β (cid:98) s ). Example 1 (Selection bias and regularization bias in estimating β max ) . In practice, β max is fre-quently estimated in a two-step procedure: One ﬁrst obtains an estimate (cid:98) β and then estimates β max by taking the maximum max { (cid:98) β , . . . , (cid:98) β p } . Here, we use two widely adopted procedures toestimate β in high dimensions: (1) Lasso, which refers to simply using the estimates (cid:0) (cid:98) β (cid:48) Lasso , (cid:98) γ (cid:48) Lasso (cid:1) (cid:48) from the (cid:96) -penalized regression program without any adjustments, and (2) Reﬁtted Lasso, whichis obtained by reﬁtting the linear model by the ordinary least squares (OLS) based on the co-variates in the support set of (cid:0) (cid:98) β (cid:48) Lasso , (cid:98) γ (cid:48) Lasso (cid:1) (cid:48) . As a benchmark, we also report the performanceof the oracle estimator (cid:0) (cid:98) β (cid:48) Oracle , (cid:98) γ (cid:48) Oracle (cid:1) (cid:48) which pretend the true support set of γ is known andis estimated by OLS. We generate Monte Carlo samples following the setup in Model (1). Wegenerate Z i = { W i > . } , Z i = { W i > . } , X ij = W ij for 3 ≤ j ≤ p , and W i ∼ N (0 , Σ) whereΣ = (Σ jk ) pj,k =1 and Σ jk = 0 . | j − k | for i = 1 , . . . , n . We set the sample size n = 100 and the5imension p = 500 and set the coeﬃcients β = (0 . , . (cid:48) and γ = (1 , , . . . ) ∈ R p − . In Figure 1,we report the root- n scaled bias along with its standard error bands based on 10,000 Monte Carlosamples. Figure 1: Root- n scaled bias along with its MonteCarlo standard error bands for Example 1. From the results in Figure 1, we observe thatall three estimators are biased towards estimat-ing β max . Although the oracle estimator (cid:98) β Oracle is an unbiased estimator of β , its maximum isusually not centered around β max . In fact, fol-lowing some explicit evidence given in Nadara-jah and Kotz (2008), this simple plug-in esti-mate is usually biased upward. The magnitudeof this bias also crucially depends on the un-known parameters β and the covariance struc-ture of (cid:98) β Oracle . Therefore, any inference pro-cedure based on the naive plug-in estimates of β max cannot be valid. This discrepancy betweenthe naive plug-in estimate and the true maxi-mum eﬀect is referred to as the “subgroup selection bias” in the literature (Z¨ollner and Pritchard,2007; Cook et al., 2014; Bornkamp et al., 2017; Guo and He, 2019). The Lasso and the Reﬁttedestimators are typically biased for β due to regularization (see Wang et al., 2019, for detailed dis-cussion), and they cannot avoid the selection bias issue either. The inferential framework proposedin the following sections simultaneously adjusts for the selection bias and the penalization bias, andit produces a bias-reduced estimate as well as an asymptotically sharp conﬁdence interval of β max . Remark 1 (Construction of Z i ) . When subgroups are deﬁned similar to RCTs, Z i is composed ofthe interaction terms between the treatment indicator variable and the subgroups indicator variables.Under the Neyman-Rubin causal model (Neyman, 1923; Rubin, 1974), we show in SupplementaryMaterials (Section C) that such a construction allows us to interpret β as the treatment eﬀectsof the considered subgroups. For now, we require that the considered subgroups do not overlap,and we show in Section 5.2 how our framework can be naturally extended to overlapping subgroups. hen subgroups are deﬁned as subpopulations of patients with diﬀerent treatments, Z i consists of p indicator variables each of which represents a diﬀerent treatment, and β represents the correspondingtreatment eﬀects. Here, we allow each individual to receive multiple treatments. A similar modelsetup has been considered in Imai et al. (2013); Wang and Ware (2013); Lipkovich et al. (2017).Notation. Denote p = p + p as the dimension of the vector ( Z (cid:48) i , X (cid:48) i ) (cid:48) . Deﬁne the covariate ma-trix X = ( X , . . . , X n ) (cid:48) ∈ R n × p and the subgroup design matrix Z = ( Z , . . . , Z n ) (cid:48) ∈ R n × p . De-note the maximal of any vector a as a max = max { a j , j = 1 , · · · p } , and denote a collection of integersfrom 1 to p as [ p ]. Suppose M is a subset of { , · · · , p } . Then for any p -dimensional vector a , a M isdeﬁned to be the sub-vector of a indexed by M and X M = { X · j , j ∈ M } , where X · j is the j th columnof X . Deﬁne I p to be the p − dimensional identity matrix. Deﬁne I M and I Z to be index matrices,whose dimensions are context-speciﬁc and satisfy ( Z (cid:48) i , X (cid:48) i ) I (cid:48) M = ( Z (cid:48) i , X (cid:48) i,M ) and I Z ( Z (cid:48) i , X (cid:48) i,M ) (cid:48) = Z (cid:48) i ,respectively. Deﬁne the sample covariance matrix as (cid:98) Σ M = n (cid:80) ni =1 ( Z (cid:48) i , X (cid:48) i,M ) (cid:48) ( Z (cid:48) i , X (cid:48) i,M ) and theprojection matrix as P M = X M (cid:0) X (cid:48) M X M (cid:1) − X (cid:48) M . Lastly, Q Y ( τ ) denotes the τ th quantile of arandom variable Y . In this section, we propose two bootstrap-based inferential frameworks for β max . We allow twowell-understood debiased estimators to be used as building blocks: One built on the debiasedLasso (Zhang and Zhang, 2014; Van de Geer et al., 2014; Belloni et al., 2014) for large p (Section3.1), and the other one built on the repeated data splitting of Wang et al. (2019) for ﬁxed p (Section3.2). While both procedures have appealing statistical properties, they have diﬀerent strengths.On the one hand, the procedure based on the debiased Lasso allows p to increase with n and doesnot require the non-zero coeﬃcients in γ to be rather large. Therefore, it equips researchers witha ﬂexible choice of Z i . On the other hand, the procedure based on the repeated data splitting(R-Split) takes advantages of a sparse model. When subgroups are highly correlated with thecovariates, it often provides a more eﬃcient estimate of the target parameter. As this sectionfocuses on the methodology side of each method, we delay the discussion on their comparison toSection 4.3. 7 .1 Bootstrap assisted debiased Lasso adjustment for large p When p is a large and increases with the sample size n , we propose the following procedure tosimultaneous address the selection bias and the penalization issues: Step 1.

Construct the debiased estimator (cid:98) b of β through (2). Step 2.

For j ∈ [ p ], calculate the calibration term: (cid:98) c j ( r ) = (1 − n r − . )( (cid:98) β max − (cid:98) β j, Lasso ) , where r is a positive tuning parameter between 0 and 0.5, whose order is motivated by ourtheoretical analysis (See Section 4). In practice, we choose r via cross-validation (see Section5.1 for implementation). Step 3.

For b ← B do1. Generate a consistent bootstrap replicate of (cid:98) b , denoted as (cid:98) b ∗ .2. Recalibrate bootstrap statistics via T ∗ b = max j ∈ [ p ] ( (cid:98) b ∗ j + (cid:98) c j ( r )) − (cid:98) β max . Step 4.

The level- α one-sided conﬁdence interval for β max is [ (cid:98) b max − Q T ∗ b ( α ) , + ∞ ), and a bias-reduced estimate for β max is (cid:98) b max − B (cid:80) Bb =1 T ∗ b .To cast some insight into the proposed framework, we comment on its methodological detailsfrom three perspectives:First, to address the penalization bias issue, we estimate β via the de-sparsiﬁed Lasso procedure(Zhang and Zhang, 2014; Van de Geer et al., 2014) (cid:98) b j = (cid:98) β j, Lasso + (cid:101) Z (cid:48)· j (cid:0) Y − Z (cid:98) β Lasso − X (cid:98) γ Lasso (cid:1) , j = 1 , . . . , p , (2)where (cid:101) Z · j can be viewed as a standardized residuals of Z · j after regressing it on X , and it equals8 Z · j = V j V (cid:48) j Z · j with V j = Z · j − ( Z , X ) − j (cid:98) ζ j and (cid:98) ζ j = arg min (cid:110) || Z · j − ( Z , X ) − j ζ j || /n + λ n j || ζ j || : ζ j ∈ R p − (cid:111) . (3)Second, in Step 3, our procedure relies on a valid bootstrap approximation of the debiasedestimate (cid:98) b . In high dimensions, we adopt the wild bootstrap approach proposed in Dezeure et al.(2017): Y ∗ i = Z (cid:48) i (cid:98) β Lasso + X (cid:48) i (cid:98) γ Lasso + ε ∗ i , i = 1 , . . . , n, (4)where ε ∗ i = u i (cid:98) ε i , and (cid:98) ε i is the residual from the Lasso estimate for the original sample and u i isi.i.d and independent of the data with E [ u i ] = 0, E [ u i ] = 1 and E [ u i ] < ∞ . (cid:98) b ∗ = ( (cid:98) b ∗ , . . . , (cid:98) b ∗ p )is deﬁned as the debiased lasso estimate for β based on the bootstrap sample { ( Y ∗ i , Z i , X i ) } ni =1 .We note that in (4), instead of using the debiased Lasso estimate (cid:98) b to generate bootstrap sample,Dezeure et al. (2017) adopts (cid:98) β Lasso to ensure the bootstrap sample is indeed generated from a sparsemodel. In addition, this wild bootstrap procedure can incorporate heteroscedastic errors withoutmuch increase of the computational cost as the design matrix remains unchanged across diﬀerentbootstrap samples.Under certain regularity conditions, Dezeure et al. (2017) have shown that the bootstrap pro-cedure in (4) is consistent in the sense that conditional on the data, the asymptotic distribution of √ n ( (cid:98) b ∗ − (cid:98) β Lasso ) is the same as the limiting distribution of the debiased Lasso estimator √ n ( (cid:98) b − β ).However, similar to (cid:98) b max not centering around β max , (cid:98) b ∗ max = max j ∈ [ p ] (cid:98) b ∗ j is also not centered around (cid:98) β max = max j ∈ [ p ] (cid:98) β j, Lasso , and usual high dimensional bootstrap procedures do not estimate the selectionbias in (cid:98) b max correctly. As a result, any inference simply based on (cid:98) b ∗ max cannot be valid for makinginference on β max and appropriate adjustments to the bootstrap procedure is needed for a validinference on β max .Third, Step 3 constitutes the core of our bootstrap calibration procedure, which addresses theissue of the selection bias. There, we propose to modify (cid:98) b ∗ max as the following, (cid:98) b ∗ modiﬁed;max = max j ∈ [ p ] ( (cid:98) b ∗ j + (cid:98) c j ( r )) , (5)where (cid:98) c j ( r ) = (1 − n r − . )( (cid:98) β max − (cid:98) β j, Lasso ), and r is a positive tuning parameter whose theoretical9rder is characterized in Assumptions 7 and 8. In practice, we provide a simple cross validationprocedure to adaptively choosing r from data (Section 5.1). In (cid:98) b ∗ modiﬁed;max , clearly we make anadjustment to each debiased estimate (cid:98) b ∗ j by the amount of (cid:98) c j ( r ), which measures the distancebetween (cid:98) β j, Lasso and the observed maximal regression coeﬃcient (cid:98) β max from Lasso. The amount ofthis adjustment is large if (cid:98) β j, Lasso is small, and is small if (cid:98) β j, Lasso is large. By adding the correctionterm (cid:98) c j ( r ), we show in Section 4 that the asymptotic distributions of √ n ( (cid:98) b ∗ modiﬁed;max − (cid:98) β max ) and √ n ( (cid:98) b max − β max ) are equivalent, meaning that the bootstrap modiﬁcation can successfully adjustfor the regularization bias and selection bias simultaneously. p When β is a low-dimensional parameter of interest, penalizing β may not be necessary. In fact, inthis case, inference on β is frequently carried out after a suﬃciently small model is selected (Belloniet al., 2013, 2014). Although any reasonable model selection procedure can be adopted, we onlypenalize γ since β is the target parameter and should always be kept in the selected model. In thiscase, given a properly chosen data-dependent model (cid:99) M , the reﬁtted OLS estimator (cid:98) β OLS ( (cid:98) β (cid:48) OLS , (cid:98) γ (cid:48) OLS ) (cid:48) = arg min β ∈ R p ,γ ∈ R | (cid:99) M | (cid:40) n n (cid:88) i =1 ( Y i − Z (cid:48) i β − X (cid:48) i, (cid:99) M γ ) (cid:41) , is a popular choice in practice. However, under the impact of the random model (cid:99) M entering theestimation process, (cid:98) β OLS is usually biased unless a perfect model is selected. As it is not themain focus of this article, we only brieﬂy discuss the reason of this issue in this section and referinterested readers to Wang et al. (2019) for more examples and simulation studies. Following asimilar derivation in Wang et al. (2019), we decompose (cid:98) β OLS as √ n ( (cid:98) β OLS − β ) = I Z ( (cid:98) Σ (cid:99) M ) − · √ n n (cid:88) i =1  Z i X i, (cid:99) M  ε i (cid:124) (cid:123)(cid:122) (cid:125) =: b n (Over-ﬁtting bias) + ( Z (cid:48) ( I − P (cid:99) M ) Z /n ) − Z (cid:48) ( I − P (cid:99) M ) X β/ √ n (cid:124) (cid:123)(cid:122) (cid:125) =: b n (Under-ﬁtting bias) . Due to the correlation between ε i and a data dependent model (cid:99) M , we typically have E ( ε i | X i, (cid:99) M ) (cid:54) = 0,meaning that the ﬁrst term b n is roughly the mean of n random variables that do not have mean10ero. The second term b n captures the impact of the unselected variables in (cid:99) M in estimating β ,and it vanishes whenever the selected model (cid:99) M covers the support set of γ . To simultaneouslycontrol the over- and under-ﬁtting biases induced by regularization procedures used in model se-lection, we adopt the repeated data splitting approach (R-Split) proposed by Wang et al. (2019)to estimate β . A detailed description of R-Split has been provided in Steps 1 and 2. Under certainregularity conditions, R-Split estimator, denoted as (cid:101) b , removes the over- and under-ﬁtting bias, andit converges to a normal distribution centered around β at a root- n rate.Note that the model selection procedure adopted in Step 1(b) can be any easily accessibleprocedure, but ideally, we hope this selection procedure by-pass the model selection mistakes in ahigh probability to avoid the risk of under-ﬁtting. Following the result given in Theorem 1 of Wanget al. (2019), the smoothed estimator (cid:101) b satisﬁes the following linear expansion: (cid:101) b − β = Γ n · n n (cid:88) i =1 ( Z (cid:48) i , X (cid:48) i ) (cid:48) ε i + o p (1 / √ n ) , where Γ n is a p by p dimensional matrix that is independent of ε , and Γ n can be well approximatedby its sample analogy (cid:101) Γ n deﬁned in Step 2 in the previous section. After generating the residuals( ε ∗ , . . . , ε ∗ n ) following (4), such a linear expansion allows us to quickly generate the bootstrapreplicates of (cid:101) b without implementing double bootstrap: (cid:101) b ∗ = (cid:101) b + (cid:101) Γ n · n n (cid:88) i =1  Z i X i  ε ∗ i . (6)We provide the theoretical justiﬁcation of this bootstrap procedure in the Supplementary MaterialSection B.Similar to (5) in the previous section, with the help of a valid bootstrap procedure in replicating (cid:101) b , we again modify the bootstrap statistics by adding (cid:101) c j ( r ) = (1 − n r − . )( (cid:101) b max − (cid:101) b j ) to havea consistent approximation for the distribution of (cid:101) b max . A detailed description of the proposedprocedure is summarized below. Step 1.

For b ← B do1. Randomly split the data { ( Y i , X i , Z i ) } ni =1 into group T of size n and group T of size11 = n − n , and let v bi = ( i ∈ T ) , for i = 1 , · · · , n .2. Select a model (cid:99) M b to predict Y based on T .3. Reﬁt the model with the data in T to get ( (cid:101) b (cid:48) b , (cid:101) γ (cid:48) b ) (cid:48) = arg min (cid:80) j ∈ T ( Y j − Z (cid:48) j β − X (cid:48) j, (cid:99) M b γ ) . The R-Split estimate is obtained by averaging over (cid:101) b b : (cid:101) b = 1 B B (cid:88) b =1 (cid:101) b b . Step 2.

For j ∈ [ p ], calculate: (cid:101) Γ n = 1 B B (cid:88) b =1 I z  n n (cid:88) i =1 v bi  Z i X i, (cid:99) M  ( Z (cid:48) i , X (cid:48) i, (cid:99) M ) (cid:48)  − I (cid:99) M , (cid:101) c j ( r ) = (1 − n r − . )( (cid:101) b max − (cid:101) b j ) , where r is a positive tuning parameter between 0 to 0.5. Step 3.

For b ← B do1. Generate bootstrap replicate (cid:101) b ∗ from (6).2. Recalibrate bootstrap statistics via T ∗ b = max j ∈ [ p ] ( (cid:101) b ∗ j + (cid:101) c j ( r )) − (cid:101) b max . Step 4.

The level- α one-sided conﬁdence interval for β max is [ (cid:101) b max − Q T ∗ b ( α ) , + ∞ ), and a bias-reduced estimate for β max is (cid:101) b max − B (cid:80) B b =1 T ∗ b . In this section, we discuss the theoretical properties of the bootstrap assisted debiased Lasso ad-justment in detail. Deﬁne the set of indexes with the maximal regression coeﬃcient as H = { j ∈ [ p ] : β j = max k ∈ [ p ] β k } . To establish the asymptotic validity of the bootstrap assisted debiasedLasso procedure, under the ﬁxed design, we work under the following assumptions:12 ssumption 1. The Lasso estimates satisfy || (cid:98) β Lasso − β || + || (cid:98) γ Lasso − γ || = o p (1 / (cid:112) log( p ) log(1 + p )) . Assumption 2.

The tuning parameter for Lasso satisﬁes λ n (cid:16) (cid:112) log( p ) /n . Assumption 3.

The noise variables ε , . . . , ε n are mutually independent, and satisfy (i) E ( ε i ) = 0 ,(ii) E (cid:80) ni =1 ε i /n = σ ε and σ i = E ε i bounded away from zero, and (iii) E | ε i | is bounded. Assumption 4.

The bootstrapped Lasso estimates satisfy || (cid:98) β ∗ Lasso − (cid:98) β Lasso || + || (cid:98) γ ∗ Lasso − (cid:98) γ Lasso || = o p (1 / (cid:112) log( p ) log(1 + p )) . Assumption 5.

The treatment variables Z i and the covariates X i are bounded, for i = 1 , · · · , n . Assumption 6.

For the de-sparsifed Lasso step, we require (i) || V j || /n is bounded away from zero,(ii) || V j || = o ( || V j || ) , j ∈ [ p ] , and (iii) max j ∈ [ p ] || V j || ∞ is bounded above, log( p ) = o ( n / ) . Assumption 7.

The tuning parameter and the Lasso estimate satisfy n r || (cid:98) β Lasso − β || ∞ = o p (1 / (cid:112) log(1 + p )) . Assumption 8.

The tuning parameter satisﬁes lim inf n →∞ (cid:0) n r (max j ∈ [ p ] β j − max j / ∈ H β j ) − log( p ) (cid:1) =+ ∞ . Assumptions 1-6 guarantee a valid (bootstrap) debiased Lasso procedure. Dezeure et al. (2017)requires similar conditions and identify suﬃcient conditions for Assumptions 1, 2, 4 and 6. Werefer interested readers to Dezeure et al. (2017) for a comprehensive discussion of Assumptions1-6. Assumptions 7-8 are required for a valid bias correction procedure for (cid:98) b max . Intuitively, theyrequire the tuning parameter to be neither too large (Assumption 7) nor too small (Assumption 8),so that the modiﬁed bootstrap can correct the penalization bias and selection bias simultaneously.In fact, these two assumptions are immediate results of Assumptions 1-6 under certain conditions:if n r = O ( (cid:112) log( p )), then Assumption 1 directly implies Assumption 7; if max j ∈ [ p ] β j − max j / ∈ H β j is a constant and r ≥ /

7, then Assumption 6 implies Assumption 8. Lastly, when p is ﬁxed and r ∈ (0 , . Theorem 1.

Under Assumptions 1-8, the modiﬁed bootstrap maximal treatment eﬀect estimator (cid:98) b ∗ modiﬁed;max satisﬁes: sup c ∈ R | P ( √ n ( (cid:98) b max − β max ) ≤ c ) − P ∗ ( √ n ( (cid:98) b ∗ modiﬁed;max − (cid:98) β max ) ≤ c ) | = o p (1) . β max , the proposed conﬁdence interval also serves as an asymptotically sharpprediction interval for the selected treatment eﬀect, i.e. β (cid:98) s . We formalize this notion in the followingcorollary and its proof can be found in Supplementary Material Section A.3. Corollary 1 (Selected subgroup with the maximal treatment eﬀect) . Under Assumptions 1-8, wehave sup c ∈ R | P ( √ n ( (cid:98) b max − β (cid:98) s ) ≤ c ) − P ∗ ( √ n ( (cid:98) b ∗ modiﬁed;max − (cid:98) β max ) ≤ c ) | = o p (1) . In this section, we provide the theoretical property of the bootstrap assisted R-Split adjustment.As we work under the same assumption of the ones discussed in Wang et al. (2019), we only statethe theoretical conclusion to avoid redundancy.

Theorem 2.

Under the assumptions given in Wang et al. (2019), when p is a ﬁxed number, themodiﬁed bootstrap maximal treatment eﬀect estimator (cid:101) b ∗ modiﬁed;max = max j ∈ [ p ] ( (cid:101) b ∗ j + (cid:101) c j ( r )) satisﬁes: sup c ∈ R | P ( √ n ( (cid:101) b max − β max ) ≤ c ) − P ∗ ( √ n ( (cid:101) b ∗ modiﬁed;max − (cid:101) b max ) ≤ c ) | = o p (1) , and sup c ∈ R | P ( √ n ( (cid:101) b max − β (cid:98) s ) ≤ c ) − P ∗ ( √ n ( (cid:101) b ∗ modiﬁed;max − (cid:101) b max ) ≤ c ) | = o p (1) . While both procedures provided in Sections 3.1 and 3.2 share similar statistical guarantees, R-Split has some advantages when p is small. As discussed in Wang et al. (2019), when Z i and X i are correlated, R-Split provides more eﬃcient point estimates than the debiased Lasso method14t sparse models. In this case, we expect that the proposed method built on R-Split providesshorter conﬁdence intervals and hence is more powerful in detecting subgroup treatment eﬀectheterogeneity. Moreover, R-Split does not penalize the target parameter β , meaning it is moretransparent compared to the debiased Lasso method.To verify our heuristic claim, we provide a simple simulation study which is similar to thetwo-stage model setup from the ones considered in Wang et al. (2019) and Belloni et al. (2014).We generate random samples from the model Y i = 0 . Z (cid:48) i β + X (cid:48) i γ + ε i , i = 1 , . . . , n , where ε i ’s are i.i.d. standard normal random variables. As for the covariates in the model, we considertwo cases: (1) When Z i is a vector of binary random variables, we generate Z i and X i from Z ij = X i, j − + X i, j + ν ij , j = 1 , . . . , p , where { ν ij } ni =1 are i.i.d. Bernoulli random variables with mean p , { ( X i, j − , X i, j ) (cid:48) } ni =1 are i.i.d. Bernoulli random vectors that satisfy X i, j − + X i, j + ν ij ∈ { , } for all j = 1 , . . . , p , and { ( X i,p +1 , X ip ) (cid:48) } ni =1 are i.i.d. random vectors follows a multivariate normaldistribution with covariance matrix I p − p ; (2) When Z i = ( Z i , . . . , Z ip ) (cid:48) is a vector of continuousrandom variables, we generate Z i and X i from Z ij = 0 . X i, j +3 + . √ X i, j +4 + ν ij , j = 1 , . . . , p ,where ν i ∼ N (0 , . · I p ) and X i ∼ N (0 , I p ). We ﬁx the tuning parameter r = 0 . λ , which is the tuning parameter used to obtain (cid:98) β Lasso . Here, λ. λ. min are thedefault tuning parameters provided by the R package glmnet , λ is the tuning parameter adoptedby the hdi package. To fully illustrate the impact of tuning parameter in the debiased Lassoassisted bootstrap calibration, we also include a larger tuning parameter 1 . · λ. β max = 0. When the covariates X i are highly correlated with Z i , R-Splitassisted bootstrap calibration is more powerful than the one based on the debiased Lasso. Whilethe performance of the debiased Lasso assisted bootstrap calibration is sensitive to the choice ofthe tuning parameter λ , R-Split based approach is rather robust. These ﬁndings are in line withour theoretical analysis. 15igure 2: Power comparison for R-split and the debiased Lasso assisted bootstrap calibrations. Oursimulation results are evaluated through 1,000 Monte Carlo samples. r We implement the debiased Lasso-based method in Section 3.1 using the R package hdi . R-Split assisted bootstrap method in Section 3.2 is implemented following the recommendation inWang et al. (2019): we select the tuning parameter (hence the model selection) in Lasso by cross-validation with a constraint on the maximum and minimum model sizes. We generate 200 bootstrapreplications for both methods and set B = 1000 for R-Split following the recommendation in Wanget al. (2019).As for the tuning parameter r , we propose a data-adaptive cross-validated algorithm to select r (Algorithm 1). Here, the idea is to choose r that minimizes the mean square error betweenthe proposed bias reduced estimate (cid:98) b max , reduced ( (cid:101) b max , reduced ) and β max . To make this possiblewithout knowing the true value of β max , we provide an approximation of the mean square errorthat can be computed from the data (Line 8). The justiﬁcation of this cross validation method forﬁxed p can be found in Guo and He (2020). When p increases, we calibrate the selected tuningparameter via a simple adjustment: r ∗ cv = r cv √ p / . Our simulation results suggest this calibration16 lgorithm 1 Cross-validated choice of the tuning parameter r Suppose R = { r , . . . , r m } is a set of candidate tuning parameters with r < · · · < r m and m isa ﬁnite integer. Randomly partition the data into v equal-sized subsamples; for l = 1 , . . . , m do for j = 1 , . . . , v do Basic setup : use the j th subsample as the reference data and the rest as the trainingdata; Bias-reduced estimator : use the training data to obtain the bias-reduced estimator ofthe maximum treatment eﬀect, (cid:98) b max , reduced ,j ( r l ), with r l as the tuning parameter; for i = 1 , . . . , k do Calculations on the reference data : use the reference data to calculate the debiasedlasso estimate of the i -th covariate of interest, (cid:98) b i,j , and its standard error (cid:98) σ i,j ; Evaluation of accuracy : calculate h i,j ( r l ) = ( (cid:98) b max , reduced ,j ( r l ) − (cid:98) b i,j ) − (cid:98) σ i,j ; end for end for end for The tuning parameter is chosen to be arg min r l { min i ∈ [ k ] [ (cid:80) j = vj =1 h i,j ( r l ) /v ] } .yields conﬁdence intervals with near optimal coverage probabilities in ﬁnite samples. We leavetheoretical justiﬁcation for such a choice to future research. In practice, we suggest implementingAlgorithm 1 via a three-fold cross-validation with a candidate set R = { / , / , . . . , / } . When a subgroup is deﬁned as a subpopulation of patients decided by their baseline characteristics,following the derivation in the Supplementary Material, Model (1) requires each subject to fall intoonly one of the candidate subgroups; in other words we require the considered subgroups to benon-overlapped, so that the validity of the proposed bootstrap calibration procedure is guaran-teed. Overlapping subgroups obviously induce complex correlation structure and pose diﬃcultiesin modelling, estimating, and making inference on the (selected) subgroup treatment eﬀects. In thissection, we provide a natural extension of our procedure to accommodate overlapping subgroups.Our procedure entails following steps:

Step 1.

Separate the K original (possibly overlapping) subgroups S , . . . , S K into non-overlappingsubgroups S ∗ , . . . , S ∗ p . Let θ j denote the subgroup treatment eﬀects for the constructed non-overlapping subgroups for j = 1 , . . . , p , and let Z i denote the interaction term between the17ndicators of the constructed non-overlapping subgroups and the treatment indicator variable; Step 2.

Estimate the non-overlapping subgroup treatment eﬀects either by the debiased Lasso ( (cid:98) θ )or R-Split ( (cid:101) θ ), and generate their corresponding bootstrap replications ( (cid:98) θ ∗ and (cid:101) θ ∗ ) via Model(1). For the debiased lasso procedure, we also calculate the Lasso estimator ( (cid:98) θ Lasso ); Step 3.

Deﬁne a matrix A ∈ R K × p with A kj = P (cid:0) the individual i belongs to subgroup S ∗ j | the individual i belongs to subgroup S k (cid:1) , for k = 1 , . . . , K and j = 1 , . . . , p . In practice, A is either known to us given populationdemographic information or could be estimated externally with some prior knowledge. Step 4.

Estimate the original subgroup treatment eﬀects β by applying a linear transformation:For the debiased Lasso procedure, we estimate the original subgroup treatment eﬀects via (cid:98) β Lasso = A (cid:98) θ Lasso , (cid:98) b = A (cid:98) θ , and construct their bootstrap replicates via (cid:98) b ∗ = A (cid:98) θ ∗ . Similarly,for R-Split, we obtain (cid:101) b = A (cid:101) θ and (cid:101) b ∗ = A (cid:101) θ ∗ ; Step 5.

Proceed with the proposed bootstrap calibration procedure discussed in Section 3 basedon these reconstructed subgroup treatment eﬀects.To further illustrate the proposed algorithm, we provide some explanation via a simple toyexample. Suppose we consider four candidate subgroups: (1) male group S , (2) female group S , (3) young adult group S (18 to 35 years old), and (4) senior group S (older than 65 yearsold). Since these four subgroups are obviously overlapped, the coeﬃcient of Z i in Model (1) cannotcapture the subgroup treatment eﬀect if we naively construct Z i as the interaction term betweenthe indicators of these four candidate subgroups and the treatment indicator variable. Step 1 ofthe above algorithm says we need to separate these overlapping subgroups into non-overlappingsubgroups: (1) young male adult group S ∗ , (2) senior male group S ∗ , (3) young female adult group S ∗ , and (4) senior female group S ∗ . Then in Step 2, we estimate the non-overlapping subgrouptreatment eﬀects via Model (1). In Step 3 and 4, we reconstruct the original subgroup treatmenteﬀects of interest by applying a simple linear transformation A ∈ R × . There, the matrix A encodesthe conditional proportion, and its component A kj represents the proportion of individuals in S k S ∗ j . Take the ﬁrst male group for example: A equals the proportion of youngmale adults in the male group, A equals the proportion of senior males in the male group, A and A equal zero as there is no male in the young female adult group or the senior female group. Theseproportions are available to us if we have access to the population demographic information (suchas those from the Census Bureau). Given these reconstructed subgroup treatment eﬀect estimatesvia the linear transformation, we then proceed with the proposed bootstrap calibration procedureas usual. We provide a detailed implementation of the proposed algorithm in the SupplementaryMaterial (Section D.1).Step 4 is the key element of the above algorithm, because it allows us to estimate the originalsubgroup treatment eﬀects of interest by applying a simple linear transformation. As shown inthe Supplementary Material (Section D.3), under Assumption 9 which indicates the separation iscomplete, we have β = Aθ , meaning that the original true subgroup treatment eﬀects of interest β is connected with the non-overlapping subgroup treatment eﬀects θ via the linear transformation.We will further argue in the Supplementary Material (Section D.2) that such complete separationrequired in Assumption 9 always exists. As this section focuses on the practical implementationside of our proposal, we leave the theoretical justiﬁcation of the proposed algorithm to the Supple-mentary Material (Section D.4). Assumption 9.

The separation of S , . . . , S K into non-overlapping subgroups S ∗ , . . . , S ∗ p is com-plete in the sense that for any j = 1 , . . . , p and any k = 1 , . . . , K , S ∗ j ∩ S k = S ∗ j or ∅ . In this section, we consider various simulation designs to demonstrate the merit of our proposal.The main takeaway from the simulation study is as follows: When p is a large number, the debiasedLasso assisted bootstrap calibration is more robust than the one based on R-Split. When many ofthe covariates are correlated with Z i or p is rather small, R-Split assisted bootstrap calibration ismore preferable to practitioners given its high detection power.We generate data from the model: Y i = 0 . Z (cid:48) i β + X (cid:48) i γ + ε i , i = 1 , . . . , n, n = 600 and ε i ’s are i.i.d. N (0 ,

1) random variables. We consider two cases for β : (1)heterogeneous case with β = (0 , . . . , , (cid:48) ∈ R p , meaning that the treatment eﬀects diﬀer acrossdiﬀerent subgroups; and (2) spurious heterogeneous case with β = (0 , . . . , , (cid:48) ∈ R p , meaningthat there is no signiﬁcant subgroup in the population. We set γ = (1 , , , , , . . . , ∈ R p , andwe leave the non-sparse case with γ j = 1 /j , j = 1 , . . . , p to Supplementary Materials (Section E).In all considered simulation designs, we set p ∈ { , , } and ( n, p ) = (600 , Z i is avector of binary random variables, we generate Z i and X i from Z ij ∼ Bernoulli (cid:16) exp( X i, j − + X i, j )1 + exp( X i, j − + X i, j ) (cid:17) , j = 1 , . . . , p , where X i ∼ N (0 , Σ) with Σ ij = 0 . | i − j | . When Z i = ( Z i , . . . , Z ip ) (cid:48) is a vector of continuousrandom variables, we generate Z i and X i from Z ij = 0 . X i, j +3 + 0 . √ X i, j +4 + ν ij , j = 1 , . . . , p , where ν i ∼ N (0 , I p ) and X i ∼ N (0 , I p ).We compare the ﬁnite sample performance of the proposed R-Split and the debiased Lassoassisted bootstrap calibrations with two benchmark methods: (1) a naive method with no adjust-ment, which uses the estimated maximal coeﬃcient along with its point estimates; and (2) thesimultaneous method as discussed in Dezeure et al. (2017) and Fuentes et al. (2018). For R-Splitestimator, following the recommendation in Wang et al. (2019), we choose the model size via cross-validation with a minimum model size equals 5. As shown in Section 4.3, the debiased Lasso basedbootstrap calibration is sensitive to the choice of λ . We report the results when it has the highestdetection power: we choose λ = 1 . · λ in the continuous case and λ = λ in the binary case. Wereport the coverage probability, the √ n scaled Monte Carlo bias along with their standard errorsbased on 500 Monte Carlo samples in Table 1.From the results in Table 1, we observe that under spurious heterogeneity, regardless of whether Z i are binary or continuous, the method with no adjustment is clearly biased and under covered–especially when p is rather large. Because the simultaneous method provides conservative con-20dence intervals, its testing power can be compromised in practice (also see real data analysis innext section). In contrast, the proposed bootstrap-assisted debiased Lasso and R-split methodsperform well regardless of the underlying data generating processes.When comparing the proposed method based on R-Split with the one based on the debiasedLasso, we observe that the procedure based on the debiased Lasso is more robust than the onebased on R-Split when p is large. We conjecture that the reason is twofold. First, a large p increases the chance of getting a singular reﬁtting covariance matrix, as a result, R-Split can benumerically unstable if some splits do not produce feasible reﬁtted OLS estimates. Second, theR-Split estimator may have an intractable asymptotic distribution when p increases with n . Incontrast, the debiased Lasso estimators are uniformly normally distributed for all components in β , suggesting that the resulting conﬁdence intervals and tests based on the debiased Lasso are notsensitive to the change of p as demonstrated in Section 4.1. In precision medicine to treat high blood pressure, lifestyle change is frequently recommended bydoctors to prevent mild hypertension and to reduce the dose levels of drugs needed to controlhypertension (Whelton et al., 2002). Such lifestyle factors include, but are not limited to, sodiumintake, smoking, dietary patterns, alcohol use, and physical inactivity. In addition, blood pressureis also highly heritable, and to date, more than 600 risk loci have been identiﬁed (Evangelou et al.,2018). So far, there has been no uniﬁed answer as to (1) which genetic-risk subgroup has the mostpositive response to a particular lifestyle change (e.g., smoking cessation status), and (2) whensubgroups are deﬁned by diﬀerent lifestyle factors, which subgroup has the best blood pressurecontrol. In response to these questions, the identiﬁed risk loci need to be adjusted in the model-building process to avoid potential confounding issues. Our framework, which incorporates lifestylefactors as well as those high-dimensional risk loci into covariates, provides natural solutions to thequestions raised above.To demonstrate the eﬀectiveness of our proposed method on controlling the selection and reg-ularization bias and to cast some insights into the eﬀects of lifestyle change on lowering systolicblood pressure, we perform cross-sectional observational studies from UK Biobank resources. The21able 1: Simulation results for Section 6

Debiased Lasso Repeated Data SplittingBoot-Calibrated No adjustment Simultaneous Boot-Calibrated No adjustment Simultaneous Z i ’s are binary random variables, and β = (0 , . . . , , ∈ R p (heterogeneity) p = 2 Cover 0.95(0.01) 0.92(0.02) 0.98(0.01) 0.96(0.01) 0.95(0.01) 0.97(0.01) √ n Bias -0.09(0.00) 0.22(0.00) -1.61(0.00) -0.08(0.00) -0.09(0.00) -1.02(0.00) p = 6 Cover 0.96(0.01) 0.94(0.02) 0.99(0.00) 0.95(0.01) 0.94(0.01) 0.99(0.00) √ n Bias -0.07(0.00) 0.19(0.00) -2.35(0.00) -0.10(0.00) -0.15(0.00) -1.91(0.00) p = 20 Cover 0.96(0.01) 0.96(0.01) 0.99(0.00) 0.93(0.01) 0.93(0.01) 0.99(0.00) √ n Bias -0.04(0.00) 0.13(0.00) -3.16(0.00) -0.12(0.00) -0.20(0.00) -2.69(0.00) Z i ’s are binary random variables, and β = (0 , . . . , ∈ R p (spurious heterogeneity) p = 2 Cover 0.96(0.01) 0.79(0.03) 0.98(0.01) 0.95(0.01) 0.87(0.02) 0.95(0.02) √ n Bias -0.02(0.00) 1.18(0.00) -0.15(0.00) 0.04(0.00) 0.91(0.00) 0.10(0.00) p = 6 Cover 0.96(0.01) 0.69(0.03) 0.97(0.01) 0.93(0.01) 0.77(0.03) 0.97(0.01) √ n Bias -0.03(0.00) 1.91(0.00) -0.25(0.00) 0.14(0.00) 1.69(0.00) 0.15(0.00) p = 20 Cover 0.94(0.01) 0.30(0.03) 0.97(0.01) 0.91(0.01) 0.43(0.03) 0.98(0.01) √ n Bias -0.06(0.00) 2.72(0.00) -0.30(0.00) 0.17(0.00) 2.52(0.00) 0.18(0.00) Z i ’s are continuous random variables, and β = (0 , . . . , , ∈ R p (heterogeneity) p = 2 Cover 0.96(0.01) 0.90(0.01) 0.98(0.01) 0.96(0.01) 0.94(0.01) 0.98(0.00) √ n Bias -0.07(0.00) 0.12(0.00) -0.43(0.00) -0.06(0.00) -0.09(0.00) -0.28(0.00) p = 6 Cover 0.96(0.01) 0.91(0.01) 0.99(0.00) 0.96(0.01) 0.93(0.01) 0.99(0.00) √ n Bias -0.05(0.00) 0.09(0.00) -0.67(0.00) -0.08(0.00) -0.11(0.00) -0.72(0.00) p = 20 Cover 0.96(0.01) 0.94(0.01) 0.99(0.00) 0.94(0.01) 0.92(0.01) 0.99(0.00) √ n Bias -0.02(0.00) 0.07(0.00) -0.96(0.00) -0.09(0.00) -0.12(0.00) -1.03(0.00) Z i ’s are continuous random variables, and β = (0 , . . . , ∈ R p (spurious heterogeneity) p = 2 Cover 0.94(0.01) 0.80(0.01) 0.95(0.01) 0.96(0.01) 0.89(0.02) 0.97(0.02) √ n Bias 0.06(0.00) 1.06(0.00) 0.02(0.00) 0.01(0.00) 0.24(0.00) -0.16(0.00) p = 6 Cover 0.95(0.01) 0.84(0.02) 0.96(0.01) 0.95(0.01) 0.77(0.03) 0.97(0.01) √ n Bias 0.04(0.00) 0.77(0.00) 0.05(0.00) 0.03(0.00) 0.72(0.00) -0.27(0.00) p = 20 Cover 0.95(0.01) 0.91(0.02) 0.97(0.01) 0.93(0.01) 0.70(0.03) 0.98(0.01) √ n Bias 0.02(0.00) 0.16(0.00) 0.08(0.00) 0.08(0.00) 1.10(0.00) -0.38(0.00)

Note: “Cover” is the empirical coverage of the 95% lower bound for β max and “ √ n Bias ” captures the root- n scaledMonte Carlo bias for estimating β max . lifestyle changes we consider include smoking cessation status, three physical activity modiﬁcations,alcohol intake, and 12 diet changes.The UK Biobank study is a long-term observational cohort study that recruited about 500 , > .

05, and have passed standard quality control steps. Several additional covariates, includinglifestyle variables, BMI, age, gender, top 40 principal components of genotype matrix, and genotypearray, are adjusted as well. The response is systolic blood pressure. Our empirical comparisonsare provided in Table 2. As the simultaneous method and the method with no adjustment providesimilar results, we only provide their results based on the debiased Lasso adjustment.For the two-subgroup comparison, we consider low- and high-genetic risk subgroups and forwhom the treatment (or lifestyle change) is smoking cessation. While the result from the naivemethod indicates that blood pressure control among high-genetic risk subgroup individuals is par-ticularly reactive to smoking, the proposed methods suggest such a discovery may not be credibleas the lower conﬁdence bounds are both below zero. While both smoking and blood pressure aremajor risk factors for cardiovascular disease and some spurious relationships have been reported(Groppelli et al., 1992), current belief is that smoking has no independent eﬀect on blood pressurecontrol (Primatesta et al., 2001) and that smoking cessation does not lower blood pressure (Williamset al., 2018). Our results not only align with the current scientiﬁc belief, but also highlight thenecessity of bias correction to avoid spurious conclusions drawn from the naive method.For the multiple-subgroup comparison, we compare the subgroup treatment eﬀects of 17 lifestylechanges with a focus on the high-generic risk individuals. From Table 2, we observe that theproposed method and the naive method reach the same conclusion: the identiﬁed best treatment–moderate physical activity–shows a signiﬁcant eﬀect in controlling high blood pressure. Moderatephysical activity in our analysis is deﬁned as two or more days having 10 of minutes or more perweek of moderate physical activities like carrying light loads or cycling at normal pace.In fact, this is a well-grounded conclusion (Williams et al., 2018). Current guidelines recommendregular physical activity for preventing hypertension (Williams et al., 2018) and several randomizedcontrolled trials have further conﬁrmed the favorable eﬀects of physical activity on reducing bloodpressure (Cornelissen and Smart, 2013). While the eﬀect of physical activity seems small, inpopulation level, an systolic blood pressure decrease of 1 mm Hg decreases the risk of stroke by 5%(Lawes et al., 2004). In both cases, because the simultaneous method is typically overly conservative23nd tends to lose power due to widened conﬁdence intervals, we see that it is unable to identifyany signiﬁcant eﬀect. Table 2: Results for real data analysisDebiased R-Split No adjustment SimultaneousTwo subgroup comparisonLower bound -2.29 -1.20 1.27 -3.78Point estimate -0.13 1.17 3.26 -0.82Multiple treatment comparisonLower bound 0.15 0.03 1.96 -0.38Point estimate 1.75 1.51 4.44 1.26

Note: The unit of the above results is mm Hg. “Lower bound” is the lower boundof the 95%-coverage conﬁdence interval.

In the presence of high-dimensional covariates from observational data, when a seemingly promisingsubgroup has been identiﬁed from the data, a naive estimation and inference for the identiﬁed sub-group that ignores the selection process can lead to biased and overly optimistic conclusions. Thesalient point of the present paper is that appropriate statistical analysis of the post-hoc identiﬁedsubgroup treatment eﬀect must take the selection process into account. We propose two bootstrapassisted debiasing procedures to make valid inferences on the maximal or the selected subgrouptreatment eﬀect. The two proposed methods are not only easy to implement but also simultane-ously control the regularization bias and the selection bias. The resulting statistical inference isasymptotically sharp. They have diﬀerent strengths: the R-Split assisted bootstrap calibration hashigher detection power when the number of subgroups is small, while the debiased Lasso assistedbootstrap calibration tends to be more robust when we want to investigate multiple subgroups.Since the proposed approaches have their own strengths, it is worth exploring the merits of thesetwo approaches and further combine them. We leave this to our future research.

Software and reproducibility R code for the proposed procedures can be found in the package “ debiased.subgroup ” that ispublicly available at https://github.com/WaverlyWei/debiased.subgroup . Simulation24xamples can be reproduced by running examples in the R package. Acknowledgement

We thank the individuals involved in the UK Biobank for their participation and the researchteams for their work on collecting, processing, and sharing these datasets. This research has beenconducted using the UK Biobank Resource (application number 48240), subject to a data transferagreement. We are also grateful for the helpful discussions with Peng Ding, Avi Feller, Xuming He,Xinwei Ma, Gongjun Xu, and the paticipants of the UC Berkeley Causal Inference Reading group.

References

Alosh, M., Huque, M. F., Bretz, F., and D’Agostino Sr, R. B. (2017). Tutorial on statistical consid-erations on subgroup analysis in conﬁrmatory clinical trials.

Statistics in medicine , 36(8):1334–1360.Belloni, A., Chernozhukov, V., et al. (2013). Least squares after model selection in high-dimensionalsparse models.

Bernoulli , 19(2):521–547.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment eﬀects after selectionamong high-dimensional controls.

The Review of Economic Studies , 81(2):608–650.Bornkamp, B., Ohlssen, D., Magnusson, B. P., and Schmidli, H. (2017). Model averaging fortreatment eﬀect estimation in subgroups.

Pharmaceutical statistics , 16(2):133–142.Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., McMa-hon, A., Morales, J., Mountjoy, E., Sollis, E., et al. (2019). The nhgri-ebi gwas catalog of pub-lished genome-wide association studies, targeted arrays and summary statistics 2019.

NucleicAcids Research , 47(D1):D1005–D1012.Cook, D., Brown, D., Alexander, R., March, R., Morgan, P., Satterthwaite, G., and Pangalos,M. N. (2014). Lessons learned from the fate of astrazeneca’s drug pipeline: a ﬁve-dimensionalframework.

Nature reviews Drug discovery , 13(6):419–431.25ornelissen, V. A. and Smart, N. A. (2013). Exercise training for blood pressure: a systematicreview and meta-analysis.

Journal of the American heart association , 2(1):e004473.Dezeure, R., B¨uhlmann, P., and Zhang, C.-H. (2017). High-dimensional simultaneous inferencewith the bootstrap.

TEST , 26(4):685–719.Evangelou, E., Warren, H. R., Mosen-Ansorena, D., Mifsud, B., Pazoki, R., Gao, H., Ntritsos, G.,Dimou, N., Cabrera, C. P., Karaman, I., et al. (2018). Genetic analysis of over 1 million peopleidentiﬁes 535 new loci associated with blood pressure traits.

Nature Genetics , 50(10):1412–1425.Fan, A., Song, R., and Lu, W. (2017). Change-plane analysis for subgroup detection and samplesize calculation.

Journal of the American Statistical Association , 112(518):769–778.Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.

Journal of the American statistical Association , 96(456):1348–1360.Friede, T., Parsons, N., and Stallard, N. (2012). A conditional error function approach for subgroupselection in adaptive clinical trials.

Statistics in Medicine , 31(30):4309–4320.Fuentes, C., Casella, G., Wells, M. T., et al. (2018). Conﬁdence intervals for the means of theselected populations.

Electronic Journal of Statistics , 12(1):58–79.Groppelli, A., Giorgi, D., Omboni, S., Parati, G., and Mancia, G. (1992). Persistent blood pressureincrease induced by heavy smoking.

Journal of Hypertension , 10(5):495–499.Guo, X. and He, X. (2019). Inference on the best selected subgroup with a case study of monet1trial. preprint .Guo, X. and He, X. (2020). Inference on selected subgroups in clinical trials.

Journal of theAmerican Statistical Association , (just-accepted):1–18.Hall, P. and Miller, H. (2010). Bootstrap conﬁdence intervals and hypothesis tests for extrema ofparameters.

Biometrika , 97(4):881–892.H´ebert, P. C., Cook, D. J., Wells, G., and Marshall, J. (2002). The design of randomized clinicaltrials in critically ill patients.

Chest , 121(4):1290–1300.26mai, K., Ratkovic, M., et al. (2013). Estimating treatment eﬀect heterogeneity in randomizedprogram evaluation.

The Annals of Applied Statistics , 7(1):443–470.Izem, R., Liao, J., Hu, M., Wei, Y., Akhtar, S., Wernecke, M., MaCurdy, T. E., Kelman, J.,and Graham, D. J. (2020). Comparison of propensity score methods for pre-speciﬁed subgroupanalysis with survival data.

Journal of Biopharmaceutical Statistics , pages 1–18.Jenkins, M., Stone, A., and Jennison, C. (2011). An adaptive seamless phase ii/iii design foroncology trials with subpopulation selection using correlated survival endpoints.

Pharmaceuticalstatistics , 10(4):347–356.Kubota, K., Ichinose, Y., Scagliotti, G., Spigel, D., Kim, J., Shinkai, T., Takeda, K., Kim, S.-W., Hsia, T.-C., Li, R., et al. (2014). Phase iii study (monet1) of motesanib plus carbo-platin/paclitaxel in patients with advanced nonsquamous nonsmall-cell lung cancer (nsclc): Asiansubgroup analysis.

Annals of oncology , 25(2):529–536.Lawes, C. M., Bennett, D. A., Feigin, V. L., and Rodgers, A. (2004). Blood pressure and stroke:an overview of published reviews.

Stroke , 35(3):776–785.Lee, J. D., Sun, D. L., Sun, Y., Taylor, J. E., et al. (2016). Exact post-selection inference, withapplication to the lasso.

The Annals of Statistics , 44(3):907–927.Lipkovich, I., Dmitrienko, A., and B D’Agostino Sr, R. (2017). Tutorial in biostatistics: data-drivensubgroup identiﬁcation and analysis in clinical trials.

Statistics in medicine , 36(1):136–196.Nadarajah, S. and Kotz, S. (2008). Exact distribution of the max/min of two gaussian randomvariables.

IEEE Transactions on very large scale integration (VLSI) systems , 16(2):210–212.Neyman, J. (1923). On the application of probability theory to agricultural experiments. essayon principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical science(1990), 5, 465-480).

Annals of Agricultural Sciences , 10:1–51.Petticrew, M., Tugwell, P., Kristjansson, E., Oliver, S., Ueﬃng, E., and Welch, V. (2012). Damnedif you do, damned if you don’t: subgroup analysis and equity.

J Epidemiol Community Health ,66(1):95–98. 27rimatesta, P., Falaschetti, E., Gupta, S., Marmot, M. G., and Poulter, N. R. (2001). Associationbetween smoking and blood pressure: evidence from the health survey for england.

Hypertension ,37(2):187–193.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomizedstudies.

Journal of educational Psychology , 66(5):688.Shen, J. and He, X. (2015). Inference for subgroup analysis with a structured logistic-normalmixture model.

Journal of the American Statistical Association , 110(509):303–312.Stallard, N., Todd, S., and Whitehead, J. (2008). Estimation following selection of the largest oftwo normal means.

Journal of Statistical Planning and Inference , 138(6):1629–1638.Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., and Li, B. (2009). Subgroup analysis via recursivepartitioning.

Journal of Machine Learning Research , 10(Feb):141–158.Tian, X., Taylor, J., et al. (2018). Selective inference with a randomized response.

The Annals ofStatistics , 46(2):679–710.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288.Van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimalconﬁdence regions and tests for high-dimensional models.

The Annals of Statistics , 42(3):1166–1202.Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment eﬀects usingrandom forests.

Journal of the American Statistical Association , (just-accepted).Wang, J., He, X., and Xu, G. (2019). Debiased inference on treatment eﬀect in a high dimensionalmodel.

Journal of the American Statistical Association , (just-accepted):1–000.Wang, R. and Ware, J. H. (2013). Detecting moderator eﬀects using subgroup analyses.

PreventionScience , 14(2):111–120.Whelton, P. K., He, J., Appel, L. J., Cutler, J. A., Havas, S., Kotchen, T. A., Roccella, E. J., Stout,R., Vallbona, C., Winston, M. C., et al. (2002). Primary prevention of hypertension: clinical28nd public health advisory from the national high blood pressure education program.

Jama ,288(15):1882–1888.Williams, B., Mancia, G., Spiering, W., Agabiti Rosei, E., Azizi, M., Burnier, M., Clement, D. L.,Coca, A., De Simone, G., Dominiczak, A., et al. (2018). 2018 esc/esh guidelines for the manage-ment of arterial hypertension: The task force for the management of arterial hypertension of theeuropean society of cardiology (esc) and the european society of hypertension (esh).

Europeanheart journal , 39(33):3021–3104.Yang, S., Lorenzi, E., Papadogeorgou, G., Wojdyla, D. M., Li, F., and Thomas, L. E. (2020).Propensity score weighting for causal subgroup analysis. arXiv preprint arXiv:2010.02121 .Zhang, C.-H. and Zhang, S. S. (2014). Conﬁdence intervals for low dimensional parameters inhigh dimensional linear models.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 76(1):217–242.Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models.

Jour-nal of the American Statistical Association , 112(518):757–768.Z¨ollner, S. and Pritchard, J. K. (2007). Overcoming the winner’s curse: estimating penetranceparameters from case-control data.