[PDF] Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data

Abstract

Consider the problem of estimating average treatment effects when a large number of covariates are used to adjust for possible confounding through outcome regression and propensity score models. The conventional approach of model building and fitting iteratively can be difficult to implement, depending on ad hoc choices of what variables are included. In addition, uncertainty from the iterative process of model selection is complicated and often ignored in subsequent inference about treatment effects. We develop new methods and theory to obtain not only doubly robust point estimators for average treatment effects, which remain consistent if either the propensity score model or the outcome regression model is correctly specified, but also model-assisted confidence intervals, which are valid when the propensity score model is correctly specified but the outcome regression model may be misspecified. With a linear outcome model, the confidence intervals are doubly robust, that is, being also valid when the outcome model is correctly specified but the propensity score model may be misspecified. Our methods involve regularized calibrated estimators with Lasso penalties, but carefully chosen loss functions, for fitting propensity score and outcome regression models. We provide high-dimensional analysis to establish the desired properties of our methods under comparable conditions to previous results, which give valid confidence intervals when both the propensity score and outcome regression are correctly specified. We present a simulation study and an empirical application which confirm the advantages of the proposed methods compared with related methods based on regularized maximum likelihood estimation.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Model-assisted inference for treatment eﬀects using regularizedcalibrated estimation with high-dimensional data

Zhiqiang Tan January 31, 2018

Abstract.

Consider the problem of estimating average treatment eﬀects when a large num-ber of covariates are used to adjust for possible confounding through outcome regression andpropensity score models. The conventional approach of model building and ﬁtting iterativelycan be diﬃcult to implement, depending on ad hoc choices of what variables are included. Inaddition, uncertainty from the iterative process of model selection is complicated and oftenignored in subsequent inference about treatment eﬀects. We develop new methods and theoryto obtain not only doubly robust point estimators for average treatment eﬀects, which remainconsistent if either the propensity score model or the outcome regression model is correctlyspeciﬁed, but also model-assisted conﬁdence intervals, which are valid when the propensityscore model is correctly speciﬁed but the outcome regression model may be misspeciﬁed. Witha linear outcome model, the conﬁdence intervals are doubly robust, that is, being also validwhen the outcome model is correctly speciﬁed but the propensity score model may be misspeci-ﬁed. Our methods involve regularized calibrated estimators with Lasso penalties, but carefullychosen loss functions, for ﬁtting propensity score and outcome regression models. We providehigh-dimensional analysis to establish the desired properties of our methods under comparableconditions to previous results, which give valid conﬁdence intervals when both the propensityscore and outcome regression are correctly speciﬁed. We present a simulation study and anempirical application which conﬁrm the advantages of the proposed methods compared withrelated methods based on regularized maximum likelihood estimation.

Key words and phrases.

Calibrated estimation; Causal inference; Doubly robust estima-tion; Inverse probability weighting; Lasso penalty; Model misspeciﬁcation; Propensity score;Regularized M-estimation. Department of Statistics & Biostatistics, Rutgers University. Address: 110 Frelinghuysen Road, Piscataway,NJ 08854. E-mail: [email protected]. The research was supported in part by PCORI grant ME-1511-32740.The author thanks Cun-Hui Zhang and Zijian Guo for helpful discussions.

Introduction

Drawing inferences about eﬀects of treatments or interventions is constantly desired fromobservational studies in social and medical sciences, when randomized experiments are eitherinfeasible or diﬃcult for practical constraints. This subject, broadly known as causal inferencein statistics, is often based on the framework of potential outcomes (Neyman 1923; Rubin1974). For observational studies, causal inference inevitably involves statistical modeling andestimation of population properties and associations from empirical data (e.g., Tsiatis 2006).In particular, as the main problem to be tackled in the paper, estimation of average treatmenteﬀects typically requires building and ﬁtting outcome regression or propensity score models(e.g., Tan 2007). The ﬁtted outcome regression functions or propensity scores can then be usedin various estimators for the average treatment eﬀects, notably inverse probability weighted(IPW) estimators or augmented IPW estimators (Robins et al. 1994).For building and ﬁtting outcome regression or propensity score models, it is possible tofollow the usual process of model speciﬁcation, ﬁtting, and checking in a cyclic manner (e.g.,McCullagh & Nelder 1989). In fact, a conventional approach for propensity score estimation asdemonstrated in Rosenbaum & Rubin (1984) involves ﬁtting a propensity score model (oftenlogistic regression) by maximum likelihood, check covariate balance, and then modify and reﬁtthe propensity score model until reasonable balance is achieved. However, this approach canbe work intensive and diﬃcult to implement, depending on ad hoc choices of what variables areincluded and whether nonlinear terms or interactions are used among others. The situation canbe especially challenging when there are a large number of potentially confounding variables(or covariates) that need to be adjusted for in outcome regression or propensity score models.In addition, another statistical issue is that uncertainty from the iterative process of modelselection is complicated and often ignored in subsequent inference (that is, conﬁdence intervalsor hypothesis testing) about treatment eﬀects.In this article, we develop new methods and theory for ﬁtting logistic propensity scoremodels and generalized linear outcome models and then using the ﬁtted values in augmentedIPW estimators to estimates average treatment eﬀects, in high-dimensional settings where thenumber of covariates p is close to or even greater than the sample size n . There are twomain elements in our approach. First, we employ regularized estimation with a Lasso penalty(Tibshirani 1992) when ﬁtting the outcome regression and propensity score models to deal with1he large number of covariates under a sparsity assumption that only a small but unknownsubset (relative to the sample size) of covariates are associated with nonzero coeﬃcients in thepropensity score and outcome regression models. Second, we carefully choose the loss functionsfor regularized estimation, diﬀerent from least squares or maximum likelihood, such that theresulting augmented IPW estimator and Wald-type conﬁdence intervals possess the followingproperties (G1) and at least one of (G2)–(G3) under suitable conditions:(G1) The point estimator is doubly robust, that is, remains consistent if either the propensityscore model or the outcome regression model is correctly speciﬁed.(G2) The conﬁdence intervals are valid if the propensity score model is correctly speciﬁed butthe outcome regression model may be misspeciﬁed.(G3) The conﬁdence intervals are valid if the outcome regression model is correctly speciﬁedbut the propensity score model may be misspeciﬁed.If either property (G2) or (G3) is satisﬁed, then the conﬁdence intervals are said to be model-assisted, borrowing the terminology from the survey literature (Sarndal et al. 1992). If prop-erties (G2)–(G3) are satisﬁed, then the conﬁdence intervals are doubly robust.Combining the two foregoing elements leads to a regularized calibrated estimator, denotedby ˆ γ RCAL , for the coeﬃcients in the propensity score model and a regularized weighted likelihoodestimator, denoted by ˆ α RWL , for the coeﬃcients in the outcome model within the treatedsubjects. See the loss functions in (11) and (13) or (37). The regularized calibrated estimatorˆ γ RCAL has recently been proposed in Tan (2017) as an alternative to the regularized maximumlikelihood estimator for ﬁtting logistic propensity score models, regardless of outcome regressionmodels. As shown in Tan (2017), minimization of the underlying expected calibration lossimplies reduction of not only the expected likelihood loss for logistic regression but also ameasure of relative errors of limiting propensity scores that controls the mean squared errorsof IPW estimators, when the propensity score model may be misspeciﬁed. In a complementarymanner, our work here shows that ˆ γ RCAL can be used in conjunction with ˆ α RWL to yield anaugmented IPW estimator with valid conﬁdence intervals if the propensity score model iscorrectly speciﬁed but the outcome regression model may be speciﬁed.We provide high-dimensional analysis of the regularized weighted likelihood estimator ˆ α RWL and the resulting augmented IPW estimator with possible model misspeciﬁcation, while build-ing on related results about ˆ γ RCAL in Tan (2017). In fact, a new strategy of inverting a quadratic2nequality is developed to tackle the technical issue that the weighted likelihood loss for ˆ α RWL is deﬁned depending on the estimator ˆ γ RCAL . As a result, we obtain the convergence of ˆ α RWL to a target value in the L norm at the rate ( | S γ | + | S α | ) { log( p ) /n } / and the symmetrizedweighted Bregman divergence at the rate ( | S γ | + | S α | ) log( p ) /n under comparable conditionsto those for high-dimensional analysis of standard Lasso estimators (e.g., Buhlmann & van deGeer 2011), where | S γ | denotes the size of nonzero coeﬃcients of the propensity score modeland | S α | denotes that of the outcome model. Furthermore, we establish an asymptotic expan-sion of the augmented IPW estimator based on ˆ γ RCAL and ˆ α RWL , and show that property (G1)is achieved provided ( | S γ | + | S α | )(log p ) / = o ( n / ) and property (G2) is achieved provided( | S γ | + | S α | )(log p ) = o ( n / ) with a nonlinear outcome model. With a linear outcome model,we obtain stronger results: property (G1) is achieved provided ( | S γ | + | S α | ) log( p ) = o ( n )and both (G2) and (G3) are achieved provided ( | S γ | + | S α | ) log( p ) = o ( n / ). These sparsityconditions are as weak as in previous works (Belloni et al. 2014; van de Geer et al. 2014). Related works.

We compare and connect our work with related works in several areas.Non-penalized calibrated estimation for propensity score models have been studied, sometimesindependently (re)derived, in causal inference, missing-data problems, and survey sampling(e.g., Folsom 1991; Tan 2010; Graham et al. 2012; Hainmueller 2012; Imai & Ratovic 2014;Kim & Haziza 2014; Vermeulen & Vansteelandt 2015; Chan et al. 2016). The non-penalizedversion of the estimator ˆ α RWL for outcome regression models have also been proposed in Kim& Haziza (2014) and Vermeulen & Vansteelandt (2015), where one of the motivations is tocircumvent the need of accounting for variation of such estimators of nuisance parameters andhence simplify the computation of conﬁdence intervals based on augmented IPW estimators.Our work generalizes these ideas to achieve statistical advantages in high-dimensional settings,where model-assisted or doubly robust conﬁdence intervals would not be obtained withoutusing regularized calibration estimation. See Section 3.4 for further discussion.For high-dimensional causal inference, Belloni et al. (2014) and Farrell (2015) employedaugmented IPW estimators based on regularized maximum likelihood estimators in outcomeregression and propensity score models, and obtained Wald-type conﬁdence intervals that arevalid when both the outcome regression and propensity score models are correctly speciﬁed,provided ( | S γ | + | S α | ) log( p ) = o ( n / ). Our main contribution is therefore to provide model-assisted or doubly robust conﬁdence intervals using diﬀerently conﬁgured augmented IPW3stimators for treatment eﬀects. As a secondary diﬀerence, Belloni et al. (2014) and Farrell(2015) used post-Lasso estimators, that is, reﬁtting outcome regression and propensity scoresmodels only including the variables selected from Lasso estimation. In contrast, our estimatorsˆ γ RCAL and ˆ α RWL are directly Lasso penalized M-estimators.Another related work is Athey et al. (2016), where valid conﬁdence intervals are obtainedfor the sample treatment eﬀects such as n − P i : T i =1 { m ∗ ( X i ) − m ∗ ( X i ) } , if a linear outcomemodel is correctly speciﬁed. No propensity score model is explicitly used.Our work is also connected to the literature of conﬁdence intervals and hypothesis testingfor a single or lower-dimensional coeﬃcients in high-dimensional regression models (Zhang &Zhang 2014; van de Geer et al. 2014; Javanmard & Montanari 2014). Model-assisted inferencedoes not seem to be addressed in these works, but can potentially be developed. Suppose that the observed data consist of independent and identically distributed observations { ( Y i , T i , X i ) : i = 1 , . . . , n } of ( Y, T, X ), where Y is an outcome variable, T is a treatmentvariable taking values 0 or 1, and X is a vector of measured covariates. In the potentialoutcomes framework for causal inference (Neyman 1923; Rubin 1974), let ( Y , Y ) be potentialoutcomes that would be observed under treatment 0 or 1 respectively. By consistency, assumethat Y is either Y if T = 0 or Y if T = 1, that is, Y = (1 − T ) Y + T Y . There aretwo causal parameters commonly of interest: the average treatment eﬀect (ATE), deﬁned as E ( Y − Y ) = µ − µ with µ t = E ( Y t ), and the average treatment eﬀect on the treated (ATT),deﬁned as E ( Y − Y | T = 1) = ν − ν with ν t = E ( Y t | T = 1) for t = 0 ,

1. For concreteness,we mainly discuss estimation of µ until Section 3.4 to discuss ATE and ATT.Estimation of ATE is fundamentally a missing-data problem: only one potential outcome, Y i or Y i , is observed and the other one is missing for each subject i . For identiﬁcation of( µ , µ ) and ATE, we make the following two assumptions throughout:(i) Unconfoundedness: T ⊥ Y | X and T ⊥ Y | X , that is, T and Y and, respectively, T and Y are conditionally independent given X (Rubin 1976);(ii) Overlap: 0 < π ∗ ( x ) < x , where π ∗ ( x ) = P ( T = 1 | X = x ) is called the propensityscore (PS) (Rosenbaum & Rubin 1983).Under these assumptions, ( µ , µ ) and ATE are often estimated by imposing additional mod-4ling (or dimension-reduction) assumptions on the outcome regression function m ∗ t ( X ) = E ( Y | T = t, X ) or the propensity score π ∗ ( X ) = P ( T = 1 | X ).Consider a conditional mean model for outcome regression (OR), E ( Y | T = 1 , X ) = m ( X ; α ) = ψ { α g ( X ) } , (1)where ψ () is an inverse link function, assumed to be increasing, g ( x ) = { , g ( x ) , . . . , g d ( x ) } T is a vector of known functions, and α = ( α , α , . . . , α d ) T is a vector of unknown parameters.For example, model (1) can be deduced from a generalized linear model with a canonical link(McCullagh & Nelder 1989). Then the average negative log-(quasi-)likelihood function can bewritten (after dropping any dispersion parameter) as ℓ ML ( α ) = ˜ E (cid:0) T (cid:2) − Y α g ( X ) + Ψ { α g ( X ) } (cid:3)(cid:1) , (2)where Ψ( u ) = R u ψ ( u ′ ) d u ′ , which is convex in u . Throughout, ˜ E () denotes the sample average.With high-dimensional data, a regularized maximum likelihood estimator, ˆ α RML , can be deﬁnedby minimizing the loss ℓ ML ( α ) with the Lasso penalty (Tibshirani 1992), ℓ RML ( α ) = ℓ ML ( α ) + λ k α d k , (3)where k α d k = P dj =1 | α j | is the L norm of α d = ( α , . . . , α d ) T excluding α , and λ ≥ µ is thenˆ µ OR = ˜ E { ˆ m RML ( X ) } = 1 n n X i =1 ˆ m RML ( X i ) , where ˆ m RML ( X ) = m ( X ; ˆ α RML ), the ﬁtted outcome regression function. Various theoreticalresults have been obtained on Lasso penalized estimation in sparse, high-dimensional regression(e.g., Buhlmann & van de Geer 2011; Huang & Zhang 2012; Negahban et al. 2012). Suchresults can be easily adapted to ˆ α RML , with the data restricted to { ( Y i , X i ) : T i = 1 , i =1 , . . . , n } . If model (1) is correctly speciﬁed, then it can be shown under suitable conditionsthat k ˆ α RML − α ∗ k = O p ( k α ∗ k { log( d ) /n } / ) and ˆ µ OR = µ + O p ( {k α ∗ k log( d ) /n } / ), where α ∗ is the true value for model (1) such that m ∗ ( X ) = m ( X ; α ∗ ).Alternatively, consider a propensity score (PS) model P ( T = 1 | X ) = π ( X ; γ ) = Π { γ T f ( X ) } , (4)where Π() is an inverse link function, f ( x ) = { , f ( x ) , . . . , f p ( x ) } T is a vector of known func-tions, and γ = ( γ , γ , . . . , γ p ) T is a vector of unknown parameters. For concreteness, assume5hat model (4) is logistic regression with π ( X ; γ ) = [1 + exp {− γ T f ( X ) } ] − , and hence theaverage negative log-likelihood function is ℓ ML ( γ ) = ˜ E h log { γ T f ( X ) } − T γ T f ( X ) i . (5)To handle high-dimensional data, a Lasso penalized maximum likelihood estimator, ˆ γ RML , isdeﬁned by minimizing the objective function ℓ RML ( γ ) = ℓ ML ( γ ) + λ k γ p k , (6)where k γ p k = P pj =1 | γ j | is the L norm of γ p = ( γ , . . . , γ p ) T excluding γ , and λ ≥ π RML ( X ) = π ( X ; ˆ γ RML ), can be used in variousmanners to estimate ( µ , µ ) and ATE including matching, stratiﬁcation, and weighting. Inparticular, a (ratio) inverse probability weighted (IPW) estimator for µ isˆ µ rIPW (ˆ π RML ) = ˜ E (cid:26) T Y ˆ π RML ( X ) (cid:27) . ˜ E (cid:26) T ˆ π RML ( X ) (cid:27) . From previous works (Buhlmann & van de Geer 2011; Huang & Zhang 2012; Negahban etal. 2012), if model (4) is correctly speciﬁed, then it can be shown under suitable conditionsthat k ˆ γ RML − γ ∗ k = O p ( k γ ∗ k { log( p ) /n } / ) and ˆ µ rIPW (ˆ π RML ) = µ + O p ( {k γ ∗ k log( p ) /n } / ),where γ ∗ is the true value for model (4) such that π ∗ ( X ) = π ( X ; γ ∗ ).To attain consistency for µ , the estimator ˆ µ OR or ˆ µ rIPW (ˆ π RML ) relies on correct speciﬁcationof OR model (1) or PS model (4) respectively. In contrast, there are doubly robust estimatorsdepending on both OR and PS models in the augmented IPW form (Robins et al. 1994)ˆ µ ( ˆ m , ˆ π ) = ˜ E { ϕ ( Y, T, X ; ˆ m , ˆ π ) } , where ˆ m ( X ) and ˆ π ( X ) are ﬁtted values of m ∗ ( X ) and π ∗ ( X ) respectively and ϕ ( Y, T, X ; ˆ m , ˆ π ) = T Y ˆ π ( X ) − (cid:26) T ˆ π ( X ) − (cid:27) ˆ m ( X ) . (7)See Kang & Schafer (2007) and Tan (2010) for reviews in low-dimensional settings. Recently,interesting results in high-dimensional settings have been obtained by Belloni et al. (2014)and Farrell (2015) on the estimator ˆ µ ( ˆ m RML , ˆ π RML ), using the ﬁtted values ˆ m RML ( X ) andˆ π RML ( X ) from Lasso penalized estimation or similar methods. These results are mainly of twotypes. The ﬁrst type shows double robustness: ˆ µ ( ˆ m RML , ˆ π RML ) remains consistent if either OR6odel (1) or PS model (4) is correctly speciﬁed. The second type establishes valid conﬁdenceintervals: ˆ µ ( ˆ m RML , ˆ π RML ) admits the usual inﬂuence function,ˆ µ ( ˆ m RML , ˆ π RML ) = ˜ E { ϕ ( Y, T, X ; m ∗ , π ∗ ) } + o p ( n − / )if both OR model (1) and PS model (4) are correctly speciﬁed. In general, the latter resultrequires a stronger sparsity condition than in consistency results only. For example, it isassumed that {k α ∗ k + k γ ∗ k } log( p ) = o ( n / ) in Belloni et al. (2014). An important limitation of existing works discussed in Section 2 is that valid conﬁdence inter-vals based on ˆ µ ( ˆ m RML , ˆ π RML ) is obtained only under the assumption that both OR model (1)and PS model (4) are correctly speciﬁed, even though the point estimator ˆ µ ( ˆ m RML , ˆ π RML ) isdoubly robust, that is, remains consistent if either OR model (1) or PS model (4) is correctlyspeciﬁed. To ﬁll this gap, we develop new point estimators and conﬁdence intervals for µ ,depending on a propensity score model and an outcome regression model, such that proper-ties (G1) and at least one of (G2)–(G3) are attained as described in Section 1. Obtainingmodel-assisted or doubly robust conﬁdence intervals presents a considerable improvement overexisting theory and methods in Belloni et al. (2014) and Farrell (2015).To illustrate main ideas, consider a logistic propensity score model (4) and a linear outcomeregression model, E ( Y | T = 1 , X ) = m ( X ; α ) = α f ( X ) , (8)that is, model (1) with the identity link and the vector of covariate functions g ( X ) taken tobe the same as f ( X ) in model (4). This condition can be satisﬁed possibly after enlargingmodel (1) or (4) to reach the same dimension. Our point estimator of µ isˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) (cid:9) , (9)where ϕ () is deﬁned in (7), ˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ), ˆ m RWL ( X ) = m ( X ; ˆ α RWL ), and ˆ γ RCAL and ˆ α RWL are deﬁned as follows. The estimator ˆ γ RCAL is a regularized calibrated estimator of γ from Tan (2017), deﬁned as a minimizer of the Lasso penalized objective function, ℓ RCAL ( γ ) = ℓ CAL ( γ ) + λ k γ p k , (10)7here ℓ RCAL ( γ ) is the calibration loss, ℓ CAL ( γ ) = ˜ E n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) o , (11)and k γ p k is the L norm of γ p and λ ≥ α RWL is aregularized weighted least-squares estimator of α , deﬁned as a minimizer of ℓ RWL ( α ; ˆ γ RCAL ) = ℓ WL ( α ; ˆ γ RCAL ) + λ k α p k , (12)where ℓ WL ( α ; ˆ γ RCAL ) is the weighted least-squares loss, ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:20) T − ˆ π RCAL ( X )ˆ π RCAL ( X ) { Y − α f ( X ) } (cid:21) / , (13)and k α p k is the L norm of α p and λ ≥ { − ˆ π RCAL ( X i ) } / ˆ π RCAL ( X i ), which diﬀers slightly fromthe commonly used inverse propensity score weight 1 / ˆ π RCAL ( X i ).There are simple and interesting interpretations of the preceding estimators. By the Karush–Kuhn–Tucker condition for minimizing (10), the ﬁtted propensity score ˆ π RCAL ( X ) satisﬁes1 n n X i =1 T i ˆ π RCAL ( X i ) = 1 , (14)1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 T i f j ( X i )ˆ π RCAL ( X i ) − n X i =1 f j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ, j = 1 , . . . , p, (15)where equality holds in (15) for any j such that the j th estimate (ˆ γ RCAL ) j is nonzero. Eq. (14)shows that the inverse probability weights, 1 / ˆ π RCAL ( X i ) with T i = 1, sum to the sample size n by (14), whereas Eq. (15) implies that the weighted average of each covariate f j ( X i ) over thetreated group may diﬀer from the overall average of f j ( X i ) by no more than λ . In fact, thecalibration loss ℓ CAL ( γ ) in (11) is derived such that its gradient gives the left hand side of (15)without taking absolute values, as shown in Eq. (23). The Lasso penalty is used to induce thebox constraints on the gradient of ℓ CAL ( γ ) instead of setting the gradient to 0.By the Karush–Kuhn–Tucker condition for minimizing (12), the ﬁtted outcome regressionfunction ˆ m RWL ( X ) satisﬁes1 n n X i =1 T i − ˆ π RCAL ( X i )ˆ π RCAL ( X i ) (cid:8) Y i − ˆ m RWL ( X i ) (cid:9) = 0 , (16)1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 T i − ˆ π RCAL ( X i )ˆ π RCAL ( X i ) (cid:8) Y i − ˆ m RWL ( X i ) (cid:9) f j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ, j = 1 , . . . , p, (17)8here equality holds in (17) for any j such that the j th estimate ( ˆ α RWL ) j is nonzero. Eq. (16)implies that by simple calculation, the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) can be recast asˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:20) ˆ m RWL ( X ) + T ˆ π RCAL ( X ) (cid:8) Y − ˆ m RWL ( X ) (cid:9)(cid:21) = ˜ E (cid:8) T Y + (1 − T ) ˆ m RWL ( X ) (cid:9) , (18)which takes the form of linear prediction estimators known in the survey literature (e.g., Sarndalet al. 1992): ˜ E { T Y + (1 − T ) ˆ m ( X ) } for some ﬁtted outcome regression function ˆ m ( X ).As a consequence, ˆ µ ( ˆ m RWL , ˆ π RCAL ) always falls within the range of the observed outcomes { Y i : T i = 1 , i = 1 , . . . , n } and the predicted values { ˆ m RWL ( X i ) : T i = 0 , i = 1 , . . . , n } . Thisboundedness property is not satisﬁed by the estimator ˆ µ ( ˆ m RML , ˆ π RML ).We provide a high-dimensional analysis of the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) in Section 3.2,allowing possible model misspeciﬁcation. Our main result shows that under suitable conditions,the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, T, X ; ¯ m WL , ¯ π CAL ) (cid:9) + o p ( n − / ) , (19)where ¯ π CAL ( X ) = π ( X ; ¯ γ CAL ), ¯ m WL ( X ) = m ( X ; ¯ α WL ) and ¯ γ CAL and ¯ α WL are deﬁned as fol-lows. In the presence of possible model misspeciﬁcation, the target value ¯ γ CAL is deﬁned as aminimizer of the expected calibration loss E { ℓ CAL ( γ ) } = E n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) o . If model (4) is correctly speciﬁed, then ¯ π CAL ( X ) = π ∗ ( X ). Otherwise, ¯ π CAL ( X ) may diﬀer from π ∗ ( X ). The target value ¯ α WL is deﬁned as a minimizer of the expected loss E (cid:8) ℓ WL ( α ; ¯ γ CAL ) (cid:9) = E (cid:20) T − ¯ π CAL ( X )¯ π CAL ( X ) { Y − α f ( X ) } (cid:21) / . If model (8) is correctly speciﬁed, then ¯ m WL ( X ) = m ∗ ( X ). But ¯ m WL ( X ) may in general diﬀerfrom m ∗ ( X ). For concreteness, the following result can be deduced from Theorems 3 and 4.Suppose that the Lasso tuning parameter is speciﬁed as λ = A † { log( p ) /n } / for ˆ γ RCAL and λ = A † { log( p ) /n } / for ˆ α RWL , with some constants A † and A † . Denote S γ = { } ∪ { j :¯ γ CAL ,j = 0 , j = 1 , . . . , p } and S α = { } ∪ { j : ¯ α WL ,j = 0 , j = 1 , . . . , p } . Proposition 1.

Suppose that Assumptions 1 and 2 hold as in Section 3.2, and ( | S γ | + | S α | ) log( p ) = o ( n / ) . For suﬃciently large constants A † and A † , if either logistic PS model(4) or linear OR model (8) is correctly speciﬁed, then the following results hold: i) n / { ˆ µ ( ˆ m RWL , ˆ π RCAL ) − µ } → D N (0 , V ) , where V = var { ϕ ( Y, T, X ; ¯ m WL , ¯ π CAL ) } ;(ii) a consistent estimator of V is ˆ V = ˜ E h(cid:8) ϕ ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ˆ m RWL , ˆ π RCAL ) (cid:9) i ; (iii) an asymptotic (1 − c ) conﬁdence interval for µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n , where z c/ is the (1 − c/ quantile of N (0 , .That is, a doubly robust conﬁdence interval for µ is obtained. We highlight some basic ideas underlying the construction of the estimators ˆ γ RCAL and ˆ α RWL as well as the proof of the asymptotic expansion (19) for ˆ µ ( ˆ m RWL , ˆ π RCAL ). For an estimatorˆ γ in model (4), suppose that ˆ γ converges in probability to a limit ¯ γ . Denote ˆ π ( X ) = π ( X ; ˆ γ )and ¯ π ( X ) = π ( X ; ¯ γ ). Similarly, for an estimator ˆ α in model (1), suppose that ˆ α convergesin probability to a limit ¯ α . Denote ˆ m ( X ) = ˆ α f ( X ) and ¯ m ( X ) = ¯ α f ( X ). Consider thefollowing decomposition of ˆ µ ( ˆ m , ˆ π ) by direct calculation:ˆ µ ( ˆ m , ˆ π ) = ˆ µ ( ¯ m , ¯ π ) + ˜ E (cid:20) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) − T ˆ π ( X ) (cid:27)(cid:21) + ˜ E (cid:20) T { Y − ¯ m ( X ) } (cid:26) π ( X ) − π ( X ) (cid:27)(cid:21) . (20)Eq. (20) can also be obtained from a Taylor expansion with ( ˆ α , ˆ γ ) about ( ¯ α , ¯ γ ). For linearOR model (8), the second term of the decomposition reduces to( ˆ α − ¯ α ) T × ˜ E (cid:20)(cid:26) − T ˆ π ( X ) (cid:27) f ( X ) (cid:21) . (21)For logistic PS model (4) with ∂π ( X ; γ ) /∂γ = π ( X ; γ ) { − π ( X ; γ ) } , the third term of thedecomposition can be approximated via a Taylor expansion by − (ˆ γ − ¯ γ ) T × ˜ E (cid:20) T − ¯ π ( X )¯ π ( X ) { Y − ¯ m ( X ) } f ( X ) (cid:21) . (22)Suppose that ˆ γ and ˆ α are Lasso penalized M-estimators such that under suitable conditions, k ˆ γ − ¯ γ k = O p ( { log( p ) /n } / ) and k ˆ α − ¯ α k = O p ( { log( p ) /n } / ), where for simplicity thedependency on the sparsity sizes of ¯ γ and ¯ α are suppressed. The loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) in (11) and (13) are constructed such that ∂ℓ CAL ( γ ) ∂γ = ˜ E (cid:20)(cid:26) − Tπ ( X ; γ ) (cid:27) f ( X ) (cid:21) , (23) ∂ℓ WL ( α ; γ ) ∂α = − ˜ E (cid:20) T − π ( X ; γ ) π ( X ; γ ) { Y − α f ( X ) } f ( X ) (cid:21) . (24)10hen the second terms in (21) and (22) can be of order O p ( { log( p ) /n } / ) in the supremumnorms, as reﬂected in conditions (14)–(15) and (16)–(17). Consequently, the products (21) and(22) can be of order O p (log( p ) /n ), which becomes o p ( n − / ) and hence (19) holds providedlog( p ) = o ( n / ) up to a constant depending on the sparsity sizes of ¯ γ and ¯ α .The estimator ˆ γ RCAL is called a regularized calibrated estimator of γ (Tan 2017), because inthe extreme case of λ = 0, Eqs. (14)–(15) reduce to calibration equations, which can be tracedto Folsom (1991) in the survey literature. Although such equations are intuitively appealing,the preceding discussion shows that ˆ γ RCAL can also be derived to reduce the variation associatedwith estimation of α from linear OR model (8) for the estimator ˆ µ ( ˆ m , ˆ π ), when PS model(4) may be misspeciﬁed. Similarly, ˆ α RWL is constructed to reduce the variation associated withestimation of γ from logistic PS model (4) for the estimator ˆ µ ( ˆ m , ˆ π ), when OR model (8) maybe misspeciﬁed. By extending the meaning of calibrated estimation, we call ˆ α RWL a regularizedcalibrated estimator of α against model (4), as well as ˆ γ RCAL a regularized calibrated estimatorof γ against model (8), when used to deﬁne ˆ µ ( ˆ m , ˆ π ).While the preceding discussion outlines our basic reasoning, there are several technical issueswe need to address in high-dimensional analysis, including how to handle the dependency ofthe estimator ˆ α RWL on ˆ γ RCAL , and what condition is required on the sparsity sizes of ¯ γ and ¯ α .In addition, we develop appropriate methods and theory in the situation where a generalizedlinear model (1), not just linear model (8), is used for outcome regression. In this section, we assume that linear outcome model (8) is used together with logistic propen-sity score model (4), and develop theoretical results for the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ),leading to Proposition 1 among others, in high-dimensional settings.First we describe relevant results from Tan (2017) about the behavior of the regularizedcalibrated estimator ˆ γ RCAL in model (4). The tuning parameter λ used in (10) for deﬁningˆ γ RCAL is speciﬁed as λ = A λ , with a constant A > λ = C p log { (1 + p ) /ǫ } /n, where C > C , B ) from Assumption 1 below and 0 <ǫ < ǫ = 1 / (1 + p ) gives λ = C p p ) /n , a familiar rate in high-dimensional analysis.11ith possible model misspeciﬁcation, the target value ¯ γ CAL is deﬁned as a minimizer of theexpected calibration loss E { ℓ CAL ( γ ) } as in Section 3.1. From a functional perspective, we write ℓ CAL ( γ ) = κ CAL ( γ T f ), where for a function h ( x ), κ CAL ( h ) = ˜ E h T e − h ( X ) + (1 − T ) h ( X ) i . As κ CAL ( h ) is easily shown to be convex in h , the Bregman divergence associated with κ CAL isdeﬁned such that for two functions h ( x ) and h ′ ( x ), D CAL ( h ′ , h ) = κ CAL ( h ′ ) − κ CAL ( h ) − h∇ κ CAL ( h ) , h ′ − h i , where h is identiﬁed as a vector ( h , . . . , h n ) with h i = h ( X i ), and ∇ κ CAL ( h ) denotes thegradient of κ CAL ( h ) with respect to ( h , . . . , h n ). The following result (Theorem 1) is restatedfrom Tan (2017, Corollary 2), where the convergence of ˆ γ RCAL to ¯ γ CAL is obtained in the L norm k ˆ γ RCAL − ¯ γ CAL k and the symmetrized Bregman divergence D † CAL (ˆ h RCAL , ¯ h CAL ) = D CAL (ˆ h RCAL , ¯ h CAL ) + D CAL (¯ h CAL , ˆ h RCAL ) , where ˆ h RCAL ( X ) = ˆ γ f ( X ) and ¯ h CAL ( X ) = ¯ γ f ( X ). See Lemma 7 in the SupplementaryMaterial for an explicit expression of D † CAL .For a matrix Σ with row indices { , , . . . , k } , a compatibility condition (Buhlmann & vande Geer 2011) is said to hold with a subset S ∈ { , , . . . , k } and constants ν > ξ > ν ( P j ∈ S | b j | ) ≤ | S | ( b T Σ b ) for any vector b = ( b , b , . . . , b k ) T ∈ R k satisfying X j S | b j | ≤ ξ X j ∈ S | b j | . (25)Throughout, | S | denotes the size of a set S . By Cauchy–Schwartz inequality, this compatibilitycondition is implied by (hence weaker than) a restricted eigenvalue condition (Bickel et al. 2009)such that ν ( P j ∈ S b j ) ≤ b T Σ b for any b ∈ R k satisfying (25). Assumption 1.

Suppose that the following conditions are satisﬁed:(i) max j =0 , ,...,p | f j ( X ) | ≤ C almost surely for a constant C ≥ ;(ii) ¯ h CAL ( X ) ≥ B almost surely for a constant B ∈ R , that is, π ( X ; ¯ γ CAL ) is bounded frombelow by (1 + e − B ) − ;(iii) the compatibility condition holds for Σ γ with the subset S γ = { } ∪ { j : ¯ γ CAL ,j = 0 , j =1 , . . . , p } and some constants ν > and ξ > , where Σ γ = E [ T w ( X ; ¯ γ CAL ) f ( X ) f T ( X )] is the Hessian of E { ℓ CAL ( γ ) } at γ = ¯ γ CAL and w ( X ; γ ) = e − γ T f ( X ) ; iv) | S γ | λ ≤ η for a suﬃciently small constant η > , depending only on ( A , C , ξ , ν ) . Theorem 1 (Tan 2017) . Suppose that Assumption 1 holds. Then for A > ( ξ + 1) / ( ξ − ,we have with probability at least − ǫ , D † CAL (ˆ h RCAL , ¯ h CAL ) + ( A − λ k ˆ γ RCAL − ¯ γ CAL k ≤ M | S γ | λ , (26) where M > is a constant depending only on ( A , C , B , ξ , ν , η ) . Remark 1.

We provide comments about the conditions involved. First, Assumption 1(iii) canbe justiﬁed from a compatibility condition for the Gram matrix E { f ( X ) f T ( X ) } in conjunctionwith additional conditions such as for some constant τ > b T E { f ( X ) f T ( X ) } b ≤ ( b T Σ γ b ) /τ , ∀ b ∈ R p . (27)For example, (27) holds provided that π ∗ ( X ) is bounded from below by a positive constant and π ( X ; ¯ γ CAL ) is bounded away from 1. But it is also possible that Assumption 1(iii) is satisﬁedeven if (27) does not hold for any τ >

0. Therefore, Assumption 1 requires that π ( X ; ¯ γ CAL ) isbounded away from 0, but may not be bounded away from 1. Second, Assumption 1(iv) canbe relaxed to only require that | S γ | λ is suﬃciently small, albeit under stronger conditions, forexample, the variables f ( X ) , . . . , f p ( X ) are jointly (not just marginally) sub-gaussian (Huang& Zhang 2012; Negahban et al. 2012). On the other hand, Assumption 1(iv) is already weakerthan the sparsity condition, | S γ | log( p ) = o ( n / ), which is needed for obtaining valid conﬁdenceintervals for µ from existing works (Belloni et al. 2014) and our later results. Remark 2.

For the Hessian Σ γ , the weight w ( X ; ¯ γ CAL ) with ¯ γ CAL replaced by ˆ γ RCAL is identicalto that used in the weighted least-square loss (13) to deﬁne ˆ α RWL , that is, w ( X ; ˆ γ RCAL ) = { − ˆ π RCAL ( X ) } / ˆ π RCAL ( X ). The Hessian of ℓ CAL ( γ ) at ¯ γ CAL is also the same as the Hessian of ℓ WL ( α ; ¯ γ CAL ) in α . As later discussed in Section 3.4, this coincidence is a consequence of theconstruction of the loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) in (11) and (13).Now we turn to the regularized weighted least-squares estimator ˆ α RWL . We develop a newstrategy of inverting a quadratic inequality to address the dependency of ˆ α RWL on ˆ γ RCAL and es-tablish convergence of ˆ α RWL under similar conditions as needed for Lasso penalized unweightedleast-squares estimators in high-dimensional settings. The error bound obtained, however,depends on the sparsity size | S γ | and various constants in Assumption 1.13or theoretical analysis, the tuning parameter λ used in (12) for deﬁning ˆ α RWL is speciﬁedas λ = A λ , with a constant A > λ = max n λ , e − B C p D + D ) p log { (1 + p ) /ǫ } /n o , where 0 < ǫ < C , B ) are from Assumption 1,and ( D , D ) are from Assumption 2 below. With possible model misspeciﬁcation, the targetvalue ¯ α WL is deﬁned as a minimizer of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } as in Section 3.1.The following result gives the convergence of ˆ α RWL to ¯ α WL in the L norm k ˆ α RWL − ¯ α WL k andthe weighted (in-sample) prediction error deﬁned as Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) = ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) , (28)where ˆ m RWL ( X ) = ˆ α f ( X ) and ¯ m WL ( X ) = ¯ α f ( X ). In fact, Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) isthe symmetrized Bregman divergence between ˆ m RWL ( X ) and ¯ m WL ( X ) associated with the loss κ WL ( h ; ¯ γ CAL ) = ˜ E [ T w ( X ; ¯ γ CAL ) { Y − h ( X ) } ] /

2. See Section 3.3 for further discussion.

Assumption 2.

Suppose that the following conditions are satisﬁed:(i) Y − ¯ m WL ( X ) is uniformly sub-gaussian given X : D E (exp[ { Y − ¯ m WL ( X ) } /D ] − | X ) ≤ D for some positive constants ( D , D ) ;(ii) the compatibility condition holds for Σ γ with the subset S α = { } ∪ { j : ¯ α WL ,j = 0 , j =1 , . . . , p } and some constants ν > and ξ > ;(iii) (1 + ξ ) ν − | S α | λ ≤ η for a constant < η < . Theorem 2.

Suppose that linear outcome model (8) is used, A > ( ξ + 1) / ( ξ − , A > ( ξ + 1) / ( ξ − , and Assumptions 1 and 2 hold. If log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) + e η ( A − λ k ˆ α RWL − ¯ α WL k ≤ e η ξ − (cid:0) M | S γ | λ (cid:1) + e η ξ (cid:0) ν − | S α | λ (cid:1) , (29) where ξ = 1 − A / { ( ξ + 1)( A − } ∈ (0 , , ξ = ( ξ + 1)( A − , and ν = ν (1 − η ) / ,depending only on ( A , ξ , ν , η ) , and M = ( D + D )(e η M + η ) + ( D + D D ) η , η = ( A − − M η C , and η = ( A − − M η , depending only on ( A , C , B , ξ , ν , η ) in Theorem 1 and ( D , D ) . emark 3. Assumption 2(ii) is concerned about the same matrix Σ γ as in Assumption 1(iii),but with the sparsity subset S α from ¯ α WL instead of S γ from ¯ γ CAL . The matrix Σ γ is alsothe Hessian of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } at α = ¯ α WL , for reasons mentioned in Re-mark 2. Assumptions 2(ii)–(iii) are combined to derive a compatibility condition for the samplematrix ˜Σ γ = ˜ E [ T w ( X ; ¯ γ CAL ) f ( X ) f T ( X )]. Assumption 2(iii) can be relaxed such that | S α | λ issuﬃciently small under further side conditions, but it is already weaker than the sparsity con-dition, | S α | log( p ) = o ( n / ), later needed for valid conﬁdence intervals for µ . Essentially, theconditions in Assumption 2 are comparable to those for high-dimensional analysis of standardLasso estimators (Bickel et al. 2009; Buhlmann & van de Geer 2011). Remark 4.

One of the key steps in our proof is to upper-bound the product( ˆ α RWL − ¯ α WL ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) . (30)If ˆ γ RCAL were replaced by ¯ γ CAL , then it is standard to use the following bound,( ˆ α RWL − ¯ α WL ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) (31) ≤ k ˆ α RWL − ¯ α WL k × k ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) k ∞ . To handle the dependency on ˆ γ RCAL , our strategy is to derive an upper bound of the diﬀerencebetween (30) and (31), depending on Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ), which we seek to control. Carry-ing this bound leads to a quadratic inequality in Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ), which can be invertedto obtain an explicit bound on Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ). The resulting error bound (29) is oforder ( | S γ | + | S α | ) log( p ) /n , much sharper than what we could obtain using other approaches,for example, directly bounding k ˜ E [ T w ( X ; ˆ γ RCAL ) { Y − ¯ m WL ( X ) } f ( X )] k ∞ .Finally, we study the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) for µ , depending on the regu-larized estimators ˆ γ RCAL and ˆ α RWL from logistic propensity score model (4) and linear outcomeregression model (8). The following result gives an error bound for ˆ µ ( ˆ m RWL , ˆ π RCAL ), allowingthat both models (4) and (8) may be misspeciﬁed.

Theorem 3.

Under the conditions of Theorem 2, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ , (32)15 here M = M + p D + D e η (e η M / η ) , M = ( A − − M , M = A ( A − − M , and M is a constant such that the right hand side of (29) in Theorem 2 is upper-bounded by e η M ( | S γ | λ λ + | S α | λ ) . Theorem 3 shows that ˆ µ ( ˆ m RWL , ˆ π RCAL ) is doubly robust for µ provided ( | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | ) log( p ) = o ( n ). In addition, Theorem 3 gives the n − / asymptoticexpansion (19) provided n / ( | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | ) log( p ) = o ( n / ).To obtain valid conﬁdence intervals for µ via the Slutsky theorem, the following resultgives the convergence of the variance estimator ˆ V to V , as deﬁned in Proposition 1, allow-ing that both models (4) and (8) may be misspeciﬁed. For notational simplicity, denoteˆ ϕ = ϕ ( T, Y, X ; ˆ m RWL , ˆ π RCAL ) and ˆ ϕ c = ˆ ϕ − ˆ µ ( ˆ m RWL , ˆ π RCAL ) such that ˆ V = ˜ E ( ˆ ϕ c ). Similarly,denote ¯ ϕ = ϕ ( T, Y, X ; ¯ m WL , ¯ π CAL ) and ¯ ϕ c = ¯ ϕ − ˆ µ ( ¯ m WL , ¯ π CAL ) such that V = E ( ¯ ϕ c ). Theorem 4.

Under the conditions of Theorem 2, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / ( | S γ | λ + | S α | λ ) + M ( | S γ | λ + | S α | λ ) , (33) where M is a positive constant depending only on ( A , C , B , ξ , ν , η ) in Theorem 1 and ( A , D , D , ξ , ν , η ) in Thorem 2. If, in addition, condition (27) holds, then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / ( | S γ | λ λ + | S α | λ ) / + M ( | S γ | λ λ + | S α | λ ) , (34) where M is a positive constant, depending on τ from (27) as well as ( A , C , B , ξ , ν , η ) and ( A , D , D , ξ , ν , η ) . Remark 5.

Theorem 4 provides two rates of convergence for ˆ V under diﬀerent conditions.Inequality (33) shows that ˆ V is a consistent estimator of V , that is, ˆ V − V = o p (1), provided( | S γ | + | S α | )(log p ) / = o ( n / ). Technically, consistency of ˆ V is suﬃcient for applying Slutskytheorem to establish conﬁdence intervals for µ in Proposition 1(iii). With additional condition(27), inequality (34) shows that ˆ V achieves the parametric rate of convergence, ˆ V − V = o p ( n − / ), provided ( | S γ | + | S α | ) log( p ) = o ( n / ). Remark 6.

Combining Theorems 3–4 directly leads to Proposition 1, which gives doublyrobust conﬁdence intervals of µ . In addition, a broader interpretation can also be accommo-dated. All the results, Theorems 1–4, are developed to remain valid in the presence of misspec-iﬁcation of models (4) and (8), similarly as in classical theory of estimation with misspeciﬁed16odels (e.g., White 1982; Manski 1988). If both models (4) and (8) may be misspeciﬁed, thenˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n is an asymptotic (1 − c ) conﬁdence interval for the target value¯ µ = E ( ¯ ϕ ), which in general diﬀers from the true value µ . By comparison, the standard esti-mator ˆ µ ( ˆ m RML , ˆ π RML ) can be shown to converge to a target value, diﬀerent from µ as well as¯ µ in the presence of model misspeciﬁcation. But it seems diﬃcult to obtain valid conﬁdenceintervals based on ˆ µ ( ˆ m RML , ˆ π RML ) under similar conditions as in our results, because (21) and(22) are then O p ( { log( p ) /n } / ) if either model (4) or (8) is misspeciﬁed. In this section, we turn to the situation where a generalized linear model is used for outcomeregression together with a logistic propensity score model, and develop appropriate methodsand theory for obtaining conﬁdence intervals for µ in high-dimensional settings.A technical complication compared with the situation of a linear outcome model in Sec-tion 3.2 is that the reasoning outlined through (20)–(24) for deriving doubly robust conﬁdenceintervals for µ does not directly hold with a non-linear outcome model, where the second termof (20) does not in general reduce to the simple product in (21). There are, however, diﬀerentapproaches that can be used to derive model-assisted conﬁdence intervals, that is, satisfyingeither property (G2) or (G3) described in Section 3.1. For concreteness, we focus on a PSbased, OR assisted approach to obtain conﬁdence intervals with property (G2), that is, beingvalid if the propensity score model used is correctly speciﬁed but the outcome regression modelmay be misspeciﬁed. See Section 3.4 for further discussion of related issues.Consider a logistic propensity score model (4) and a generalized linear outcome model, E ( Y | T = 1 , X ) = m ( X ; α ) = ψ { α f ( X ) } , (35)that is, model (1) with the vector of covariate functions g ( X ) taken to be the same as f ( X )in model (4). This choice of covariate functions can be more justiﬁed than in the settingof Section 3.2, because OR model (35) plays an assisting role when conﬁdence intervals for µ are concerned. Our point estimator of µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) as deﬁned in (9), whereˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ) and ˆ m RWL ( X ) = m ( X ; ˆ α RWL ). The estimator ˆ γ RCAL is a regularizedcalibrated estimator of γ from Tan (2017) as in Section 3.2. But ˆ α RWL is a regularized weightedlikelihood estimator of α , deﬁned as a minimizer of ℓ RWL ( α ; ˆ γ RCAL ) = ℓ WL ( α ; ˆ γ RCAL ) + λ k α p k , (36)17here ℓ WL ( α ; ˆ γ RCAL ) is the weighted likelihood loss as follows, with w ( X ; γ ) = { − π ( X ; γ ) } /π ( X ; γ ) = e − γ T f ( X ) for logistic model (4), ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:16) T w ( X ; ˆ γ RCAL ) [ − Y α f ( X ) + Ψ { α f ( X ) } ] (cid:17) , (37)and k α p k is the L norm of α p and λ ≥ α RWL used in Section 3.2 is recovered in the special case of the identitylink, ψ ( u ) = u and Ψ( u ) = u /

2. In addition, the Kuhn–Tucker–Karush condition for mini-mizing (36) remains the same as in (16)–(17), and hence the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) can beput in the prediction form (18), which ensures the boundedness property that ˆ µ ( ˆ m RWL , ˆ π RCAL )always falls in the range of the observed outcomes Y i in the treated group ( T i = 1) and thepredicted values ˆ m RWL ( X i ) in the untreated group ( T i = 0).With possible model misspeciﬁcation, the target value ¯ α WL is deﬁned as a minimizer ofthe expected loss E { ℓ WL ( α ; ¯ γ CAL ) } . From a functional perspective, we write ℓ WL ( α ; γ ) = κ WL ( α f ; γ ), where for a function h ( x ) which may not be in the form α f , κ WL ( h ; γ ) = ˜ E ( T w ( X ; γ ) [ − Y h ( X ) + Ψ { h ( X ) } ]) . As κ WL ( h ; γ ) is convex in h by the convexity of Ψ(), the Bregman divergence associated with κ WL ( h ; γ ) is deﬁned as D WL ( h ′ , h ; γ ) = κ WL ( h ′ ; γ ) − κ WL ( h ; γ ) − h∇ κ WL ( h ; γ ) , h ′ − h i , where ∇ κ WL ( h ; γ ) denotes the gradient of κ WL ( h ; γ ) with respect to ( h , . . . , h n ) with h i = h ( X i ). The symmetrized Bregman divergence is easily shown to be D † WL ( h ′ , h ; γ ) = D WL ( h ′ , h ; γ ) + D WL ( h, h ′ ; γ )= ˜ E (cid:0) T w ( X ; γ ) (cid:2) ψ { h ′ ( X ) } − ψ { h ( X ) } (cid:3) { h ′ ( X ) − h ( X ) } (cid:1) . (38)The following result establishes the convergence of ˆ α RWL to ¯ α WL in the L norm k ˆ α RWL − ¯ α WL k and the symmetrized Bregman divergence D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ), where ˆ h RWL ( X ) = ˆ α f ( X )and ¯ h WL ( X ) = ¯ α f ( X ). In the case of the identity link, ψ ( u ) = u , the symmetrized Bregmandivergence D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) becomes Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) in (28). Inequality (39) alsoreduces to (29) in Theorem 2 with the choices C = 1 and C = η = η = 0. Assumption 3.

Assume that ψ () is diﬀerentiable and denote ψ ( u ) = d ψ ( u ) / d u . Supposethat the following conditions are satisﬁed: i) ψ { ¯ h WL ( X ) } ≤ C almost surely for a constant C > ;(ii) ψ { ¯ h WL ( X ) } ≥ C almost surely for a constant C > ;(iii) ψ ( u ) ≤ ψ ( u ′ )e C | u − u ′ | for any ( u, u ′ ) , where C ≥ is a constant.(iv) C C ( A − − ξ ν − C − | S α | λ ≤ η for a constant ≤ η < and C C e η ( A − − ξ − C − ( M | S γ | λ ) ≤ η for a constant ≤ η < , where ( η , ν , ξ , ξ , M ) areas in Theorem 2. Theorem 5.

Suppose that Assumptions 1, 2, and 3(ii)–(iv) hold. If log { (1 + p ) /ǫ } /n ≤ ,then for A > ( ξ + 1) / ( ξ − and A > ( ξ + 1) / ( ξ − , we have with probability at least − ǫ , D † WL ( ˆ m RWL , ¯ m WL ) + e η ( A − λ k ˆ α RWL − ¯ α WL k ≤ e η ξ − (cid:0) M | S γ | λ (cid:1) + e η ξ (cid:0) ν − | S α | λ (cid:1) , (39) where ξ = ξ (1 − η ) / C / , ν = ν / (1 − η ) / C / , and ( η , ν , ξ , ξ , M ) are as inTheorem 2. Remark 7.

We discuss the conditions involved in Theorem 5. Assumption 3(i) is not needed,but will be used in later results. Assumption 3(iii), adapted from Huang & Zhang (2012), isused along with Assumption 1(i) to bound the curvature of D † WL ( h ′ , h ; ¯ γ CAL ) and then withAssumption 3(iv) to achieve a localized analysis when handling a non-quadratic loss function.Assumption 3(ii) is used for two distinct purposes. First, it is combined with Assumptions 2(ii)–(iii) to yield a compatibility condition for ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )], whichis the sample version of the Hessian of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } at α = ¯ α WL , thatis, Σ α = E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )]. Second, Assumption 3(ii) is also used inderiving a quadratic inequality to be inverted in our strategy to deal with the dependency ofˆ α RWL on ˆ γ RCAL as mentioned in Remark 4. As seen from the proofs in Supplementary Material,similar results as in Theorem 5 can be obtained with Assumption 3(ii) replaced by the weakercondition that for some constant τ > b T Σ γ b ≤ ( b T Σ α b ) /τ , ∀ b ∈ R p , provided that the condition on A and Assumption 3(iv) are modiﬁed accordingly, dependingon τ . This extension is not pursued here for simplicity.19ow we study the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) for µ , with the regularized estima-tors ˆ γ RCAL and ˆ α RWL obtained using logistic propensity score model (4) and genealized linearoutcome model (35). Theorem 6 gives an error bound for ˆ µ ( ˆ m RWL , ˆ π RCAL ), allowing that bothmodels (4) and (35) may be misspeciﬁed, but depending on additional terms in the presenceof misspeciﬁcation of model (4). Denote h ( X ; α ) = α f ( X ) and for r ≥ ( r ) = sup j =0 , ,...,p, k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { h ( X ; α ) } f j ( X ) (cid:26) T ¯ π CAL ( X ) − (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . As a special case, the quantity Λ (0) is deﬁned asΛ = sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) T ¯ π CAL ( X ) − (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . By the deﬁnition of ¯ γ CAL , it holds that E [ { T / ¯ π CAL ( X ) − } f j ( X )] = 0 for j = 0 , , . . . , p whether or not model (4) is correctly speciﬁed. But Λ ( r ) is in general either zero or positiverespectively if model (4) is correctly speciﬁed or misspeciﬁed, except in the case of linearoutcome model (8) where Λ ( r ) is automatically zero because ψ () is constant. Theorem 6.

Suppose that Assumptions 1, 2, and 3 hold. If log { (1 + p ) /ǫ } /n ≤ , then for A > ( ξ + 1) / ( ξ − and A > ( ξ + 1) / ( ξ − , we have with probability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ + η Λ ( η ) , (40) where M , M , and M are positive constants, depending only on ( A , C , B , ξ , ν , η ) , ( A , D , D , ξ , ν , η ) , and ( C , C , C , η , η ) , η = ( A − − M ( | S γ | λ + | S α | λ ) , and M isa constant such that the right hand side of (39) is upper-bounded by e η M ( | S γ | λ λ + | S α | λ ) .If, in addition, condition (27) holds, then we have with probability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ + η Λ , (41) where M , M , and M are positive constants, also depending on τ from (27). Remark 8.

Two diﬀerent error bounds are obtained in Theorem 6. Because Λ ( η ) ≥ Λ , theerror bound (41) is tighter than (40), but with the additional condition (27), which requiresthat the generalized eigenvalues of Σ γ relative to the gram matrix E { f ( X ) f T ( X ) } is boundedaway from 0. In either case, the result shows that ˆ µ ( ˆ m RWL , ˆ π RCAL ) is doubly robust for µ | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | )(log p ) / = o ( n / ). In addition, the errorbounds imply that ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the n − / asymptotic expansion (19) provided( | S γ | + | S α | ) log( p ) = o ( n / ), when PS model (4) is correctly speciﬁed but OR model (35)may be misspeciﬁed, because the term involving Λ ( η ) or Λ vanishes as discussed above.Unfortunately, expansion (19) may fail when PS model (4) is misspeciﬁed.Similarly as Theorem 4, the following result establishes the convergence of ˆ V to V as deﬁnedin Proposition 1, allowing that both models (4) and (35) may be misspeciﬁed. Theorem 7.

Under the conditions of Theorem 6, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / { ( η ) } ( | S γ | λ + | S α | λ )+ M { ( η ) } ( | S γ | λ + | S α | λ ) , (42) where M is a positive constant depending only on ( A , C , B , ξ , ν , η ) , ( A , D , D , ξ , ν , η ) ,and ( C , C , C , η , η ) . If, in addition, condition (27) holds, then we have with probability atleast − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / n ( | S γ | λ λ + | S α | λ ) / + Λ ( | S γ | λ + | S α | λ ) o + M (cid:8) ( | S γ | λ λ + | S α | λ ) + Λ ( | S γ | λ + | S α | λ ) (cid:9) , (43) where M is a positive constant, similar to M but also depending on τ from (27). Remark 9.

Two diﬀerent rates of convergence are obtained for ˆ V in Theorem 7. Similarlyas discussed in Remark 5, if ( | S γ | + | S α | )(log p ) / = o ( n / ), then inequality (42) impliesthe consistency of ˆ V for V , which is suﬃcient for applying Slutsky Theorem to establishconﬁdence intervals for µ . With additional condition (27), inequality (43) gives a faster rateof convergence for ˆ V , which is of order n − / provided ( | S γ | + | S α | ) log( p ) = o ( n / ).Combining Theorems 6–7 leads to the following result. Proposition 2.

Suppose that Assumptions 1, 2, and 3 hold, and ( | S γ | + | S α | ) log( p ) = o ( n / ) .For suﬃciently large constants A and A , if logistic PS model (4) is correctly speciﬁed but ORmodel (35) may be misspeciﬁed, then (i)–(iii) in Proposition 1 hold. That is, a PS based, ORassisted conﬁdence interval for µ is obtained. emark 10. The conclusion of Proposition 2 remains valid if PS model (4) is misspeciﬁedbut only locally such that Λ ( η ) = O ( { log( p ) /n } / ) or Λ = O ( { log( p ) /n } / ), in the caseof the error bound (40) or (41). Therefore, ˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n can be interpreted asan asymptotic (1 − c ) conﬁdence interval for the target value ¯ µ = E ( ¯ ϕ ) if model (4) is at mostlocally misspeciﬁed but model (35) may be arbitrarily misspeciﬁed. It is an interesting openproblem to ﬁnd broadly valid conﬁdence intervals in the presence of model misspeciﬁcationsimilarly as discussed in Remark 6 when a linear outcome model is used. Estimation of ATE.

Our theory and methods are presented mainly on estimation of µ , butthey can be directly extended for estimating µ and hence ATE, that is, µ − µ . Consider alogistic propensity score model (4) and a generalized linear outcome model, E ( Y | T = 0 , X ) = m ( X ; α ) = ψ { α f ( X ) } , (44)where f ( X ) is the same vector of covariate functions as in the model (4) and α is a vector ofunknown parameters. Our point estimator of ATE is ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ˆ m RWL , ˆ π RCAL ), andthat of µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, − T, X ; ˆ m RWL , − ˆ π RCAL ) (cid:9) , where ϕ () is deﬁned in (7), ˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ), ˆ m RWL ( X ) = m ( X ; ˆ α RWL ), and ˆ γ RCAL andˆ α RWL are deﬁned as follows. The estimator ˆ γ RCAL is deﬁned similarly as ˆ γ RCAL , but with theloss function ℓ CAL ( γ ) in (11) replaced by ℓ CAL ( γ ) = ˜ E n (1 − T )e γ T f ( X ) − T γ T f ( X ) o , that is, T and γ in ℓ CAL ( γ ) are replaced by 1 − T and − γ . The estimator ˆ α RWL is deﬁnedsimilarly as ˆ α RWL , but with the loss function ℓ WL ( · ; ˆ γ RCAL ) in (37) replaced by ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:16) (1 − T ) w ( X ; ˆ γ RCAL ) (cid:2) − Y α g ( X ) + Ψ { α g ( X ) } (cid:3) (cid:17) , where w ( X ; γ ) = π ( X ; γ ) / { − π ( X ; γ ) } = e γ T f ( X ) . Under similar conditions as in Proposi-tions 1 and 2, the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, − T, X ; ¯ m WL , − ¯ π CAL ) (cid:9) + o p ( n − / ) , (45)22here ¯ π RCAL ( X ) = π ( X ; ¯ γ RCAL ), ¯ m RWL ( X ) = m ( X ; ¯ α RWL ), and ¯ γ RCAL and ¯ α RWL are the targetvalues deﬁned similarly as ¯ γ RCAL and ¯ α RWL . Then Wald conﬁdence intervals for µ and ATEcane be derived from (19) and (45) similarly as in Propositions 1 and 2 and shown to beeither doubly robust in the case of linear outcome models, or valid if PS model (4) is correctlyspeciﬁed but OR models (35) and (44) may be misspeciﬁed for nonlinear ψ ().An unusual aspect of our approach is that two diﬀerent estimators of the propensity scoreare used when estimating µ and µ . On one hand, the estimators ˆ γ RCAL and ˆ γ RCAL areboth consistent, and hence there is no self-contradiction at least asymptotically, when PSmodel (4) is correctly speciﬁed. On the other hand, if model (4) is misspeciﬁed, the twoestimators may in general have diﬀerent asymptotic limits, which can be an advantage fromthe following perspective. By deﬁnition, the augmented IPW estimators of µ and µ areobtained, depending on ﬁtted propensity scores within the treated group and untreated groupsseparately, that is, { π ( X i ; γ ) : T i = 1 } and { π ( X i ; γ ) : T i = 0 } . In the presence of modelmisspeciﬁcation, allowing diﬀerent γ and γ can be helpful in ﬁnding suitable approximationsof the two sets of propensity scores, without being constrained by the then-false assumptionthat they are determined by the same coeﬃcient vector γ = γ . Estimation of ATT.

There is a simple extension of our approach to estimation of ATT, thatis, ν − ν as deﬁned in Section 2. The parameter ν = E ( Y | T = 1) can be directly estimatedby ˜ E ( T Y ) / ˜ E ( T ). For ν = E ( Y | T = 1), our point estimator isˆ ν ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ν ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) (cid:9) / ˜ E ( T ) , where ˆ π RCAL ( X ) and ˆ m RWL ( X ) are the same ﬁtted values as used in the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL )for µ , and ϕ ν ( · ; ˆ m , ˆ π ) is deﬁned as ϕ ν ( Y, T, X ; ˆ m , ˆ π ) = (1 − T )ˆ π ( X )1 − ˆ π ( X ) Y − (cid:26) − T − ˆ π ( X ) − (cid:27) ˆ m ( X ) . The function ϕ ν ( · ; ˆ m , ˆ π ) can be derived, by substituting ﬁtted values ( ˆ m , ˆ π ) for the truevalues ( m ∗ , π ∗ ) in the eﬃcient inﬂuence function of µ under a nonparametric model (Hahn1998). In addition, the estimator ˜ E { ϕ ν ( Y, T, X ; ˆ m , ˆ π ) } is also doubly robust: it remainsconsistent for E ( T Y ) if either ˆ m = m ∗ or ˆ π = π ∗ . In fact, by straightforward calculation,the function ϕ ν () is related to ϕ () in (7) through the simple identify: ϕ ν ( Y, T, X ; ˆ m , ˆ π ) = ϕ ( Y, − T, X ; ˆ m , − ˆ π ) − (1 − T ) Y. (46)23s a result, ˆ ν ( ˆ m RWL , ˆ π RCAL ) can be equivalently obtained asˆ ν ( ˆ m RWL , ˆ π RCAL ) = h ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˜ E { (1 − T ) Y } i / ˜ E ( T )= ˜ E (cid:8) T ˆ m RWL ( X ) (cid:9) / ˜ E ( T ) , where the second step follows from a similar equation for ˆ µ ( ˆ m RWL , ˆ π RCAL ) as (18). Moreover,it can be shown using Eq. (46) that under similar conditions as in Propositions 1 and 2, theestimator ˆ ν ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ ν ( ˆ m RWL , ˆ π RCAL ) − ν = ˜ E (cid:8) ϕ ν ( Y, T, X ; ¯ m WL , ¯ π CAL ) − T ν (cid:9) / ˜ E ( T ) + o p ( n − / ) , similarly as (45) for ˆ µ ( ˆ m RWL , ˆ π RCAL ). From this expansion, Wald conﬁdence intervals for ν and ATT can be derived and shown to be either doubly robust with linear OR model (44) orvalid at least when PS model (4) is correctly speciﬁed. Construction of loss functions.

We provide additional comments about the construction ofloss functions for γ and α and alternative approaches when using nonlinear outcome models.For a linear outcome model (8) as in Section 3.1, the loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) arederived such that their gradients satisfy (23)–(24), which are in turn obtained as the coeﬃcientsfor ˆ α − ¯ α and ˆ γ − ¯ γ in the ﬁrst-order terms (21)–(22) from the Taylor expansion (20) ofˆ µ ( ˆ m , ˆ π ). Combining the two steps, Eqs. (23)–(24) amount to choosing ∂ℓ CAL ( γ ) ∂γ = ∂∂α ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) , (47) ∂ℓ WL ( α ; γ ) ∂α = ∂∂γ ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) . (48)We say that the loss function ℓ CAL ( γ ) for γ in model (4) is calibrated against model (8),whereas ℓ WL ( α ; γ ) for α in model (8) is calibrated against model (4). The estimators ˆ γ RCAL and ˆ α RWL are called regularized calibrated estimators of γ and α respectively. The pair ofequations (47)–(48) also underlie the coincidence of the Hessian of ℓ CAL ( γ ) at ¯ γ CAL and that of ℓ WL ( α ; ¯ γ CAL ) in α with a linear outcome model, as mentioned in Remark 2.Previously, an augmented IPW estimator ˆ µ ( ˆ m , ˆ π ) for µ was proposed in low-dimensionalsettings by Kim & Haziza (2014) and Vermeulen & Vansteelandt (2015), where ( ˆ α , ˆ γ ) are non-penalized, deﬁned by directly setting the right-hand sides of (47)–(48) to zero. One of theirmotivations is to enable simple calculation of conﬁdence intervals without the need of correctingfor estimation of ( α , γ ). Our work generalizes these previous estimators to high-dimensional24ettings, where the motivation for using ( ˆ α RWL , ˆ γ RCAL ), instead of ( ˆ α RML , ˆ γ RML ) is mainly sta-tistical: to reduce the variation caused by estimation of ( α , γ ) from O p ( { log( p ) /n } / ) to o p ( n − / ) for the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ), so that valid conﬁdence intervals for µ can beobtained even in the presence of model misspeciﬁcation.For a possibly nonlinear outcome model (35), the augmented IPW estimator of µ in Kim& Haziza (2014) and Vermeulen & Vansteelandt (2015) is also deﬁned as described above.However, the gradients from the right-hand sides of (47)–(48) become ∂∂α ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) = ˜ E (cid:20)(cid:26) − Tπ ( X ; γ ) (cid:27) ψ { α f ( X ) } f ( X ) (cid:21) , (49) ∂∂γ ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) = − ˜ E (cid:20) T − π ( X ; γ ) π ( X ; γ ) { Y − m ( X ; α ) } f ( X ) (cid:21) , (50)where ψ () denotes the derivative of ψ (). The pair of equations obtained by setting (49)–(50)to zero are intrinsically coupled, unless outcome model (35) is linear and hence the dependencyof (49) on α vanishes. This complication, although mainly computational in low-dimensionalsettings, presents a statistical as well as computational obstacle to developing doubly robustconﬁdence intervals with regularized estimation in high-dimensional settings.The development in Section 3.3 involves using (23) instead of (49) but retaining (50), whichlead to the loss functions ℓ CAL ( γ ) in (11) and ℓ WL ( α ; γ ) in (37). The resulting conﬁdenceintervals are PS based, OR assisted, that is, being valid if PS model (4) is correctly speciﬁedbut OR model (35) may be misspeciﬁed. Alternatively, it is possible to develop an OR based,PS assisted approach using the regularized maximum likelihood estimator ˆ α RML in conjunctionwith a regularized estimator of γ based on a weighted calibration loss, ℓ WL ( γ ; ˆ α RML ) = ˜ E h ψ { ˆ α f ( X ) } n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) oi . (51)The gradient of (51) in γ is (49), with α = ˆ α RML . Similar results can be established as inSection 3.3, to provide valid conﬁdence intervals for µ if OR model (35) is correctly speciﬁedbut PS model (4) may be misspeciﬁed. This work can be pursued elsewhere. We present a simulation study with the design of Kang & Schafer (2007) modiﬁed and extendedto high-dimensional, sparse settings. It is of interest to empirically compare ˆ µ ( ˆ m RML , ˆ π RML )and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated conﬁdence intervals.25n our implementation, the penalized loss function (3) or (6) for computing ˆ α RML or ˆ γ RML or (10), (12), or (36) for computing ˆ α RWL or ˆ γ RCAL is minimized for a ﬁxed tuning parameter λ , using algorithms similar to those in Friedman et al. (2010), but with the coordinate descentmethod replaced by an active set method as in Osborne et al. (2000) for solving each Lassopenalized least squares problem. In addition, the penalized loss (10) for computing ˆ γ RCAL is minimized using the algorithm in Tan (2017), where a nontrivial Fisher scoring step isinvolved for quadratic approximation. The tuning parameter λ is determined using 5-foldcross validation based on the corresponding loss function as follows.For k = 1 , . . . ,

5, let I k be a random subsample of size n/ { , , . . . , n } . For aloss function ℓ ( γ ), either ℓ ML ( γ ) in (5) or ℓ CAL ( γ ) in (11), denote by ℓ ( γ ; I ) the loss functionobtained when the sample average ˜ E () is computed over only the subsample I . The 5-foldcross-validation criterion is deﬁned asCV ( λ ) = 15 X k =1 ℓ (ˆ γ ( k ) λ ; I k ) , where ˆ γ ( k ) λ is a minimizer of the penalized loss ℓ ( γ ; I ck ) + λ k γ p k over the subsample I ck of size4 n/

5, i.e., the complement to I k . Then λ is selected by minimizing CV ( λ ) over the discreteset { λ ∗ / j : j = 0 , , . . . , } , where for ˆ π = ˜ E ( T ), the value λ ∗ is computed as either λ ∗ = max j =1 ,...,p (cid:12)(cid:12)(cid:12) ˜ E { ( T − ˆ π ) f j ( X ) } (cid:12)(cid:12)(cid:12) when the likelihood loss (5) is used, or λ ∗ = max j =1 ,...,p (cid:12)(cid:12)(cid:12) ˜ E { ( T / ˆ π − f j ( X ) } (cid:12)(cid:12)(cid:12) when the calibration loss (11) is used. It can be shown that in either case, the penalized loss ℓ ( γ ) + λ k γ p k over the original sample has a minimum at γ p = 0 for all λ ≥ λ ∗ .For computing ˆ α RML or ˆ α RWL , cross validation is conducted similarly as above using theloss function ℓ ML ( α ) in (2) or ℓ WL ( α ; ˆ γ RCAL ) in (37). In the latter case, ˆ γ RCAL is determinedseparately and then ﬁxed during cross validation for computing ˆ α RWL . Let X = ( X , . . . , X p ) be independent variables, where each X j is N(0 ,

1) truncated to theinterval ( − . , .

5) and then standardized to have mean 0 and variance 1. In addition, let X † = ( X † , . . . , X † p ), where X † j = X j for j = 5 , . . . , p , and X † , X † , X † , and X † are standardized26ersions of exp(0 . X ), 10 + { X ) } − X , (0 . X X + 0 . , and ( X + X + 20) tohave means 0 and variances 1. The truncation of X j prevents propensity scores arbitrarilyclose to 0, and ensures that the mapping between X and X † are strictly one-to-one. See theSupplementary Material for calculation to perform the standardization and for scatter plots of( X † , . . . , X † ). Consider the following data-generating conﬁgurations.(C1) Generate T given X from a Bernoulli distribution with P ( T = 1 | X ) = { X † − . X † + 0 . X † + 0 . X † ) } − , and, independently, generate Y given X from a Normal distribution with variance 1 andmean either (“Linear outcome conﬁguration 1”) E ( Y | X ) = X † + 0 . X † + 0 . X † + 0 . X † , or (“Linear outcome conﬁguration 2”) E ( Y | X ) = 0 . X † + 0 . X † + 0 . X † + 0 . X † . The main diﬀerence between the two outcome conﬁgurations is that X † is both the mostimportant variable inﬂuencing the propensity score and that inﬂuencing the outcomeregression function in the ﬁrst conﬁguration.(C2) Generate T give X as in (C1), but, independently, generate Y given X from a Normaldistribution with variance 1 and mean either (“Linear outcome conﬁguration 1”) E ( Y | X ) = X + 0 . X + 0 . X + 0 . X , or (“Linear outcome conﬁguration 2”) E ( Y | X ) = 0 . X + 0 . X + 0 . X + 0 . X . As X and X † are monotone transformations of each other, the variable X † remainsroughly both the most important variable inﬂuencing the propensity score and thatinﬂuencing the outcome regression function in the ﬁrst conﬁguration.(C3) Generate Y given X as in (C1), but, independently, generate T given X from a Bernoullidistribution with P ( T = 1 | X ) = { X − . X + 0 . X + 0 . X ) } − . n = 800, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .071 .071 .072 .072 .077 .072 † √ EVar .083 .083 .081 .080 .083 .083Cov90 .790 .822 ∗ .850 .848 .856 .837Cov95 .859 .891 ∗ .910 .915 .925 .912Linear outcome conﬁguration 2Bias − . − . − . − . − . − . √ Var .063 .063 .062 .064 .069 .063 † √ EVar .072 .073 .070 .069 .074 .074Cov90 .782 .826 ∗ .786 .865 ∗ .855 .838Cov95 .858 .885 .866 .918 ∗ .926 .901 Note: RML.RML denotes ˆ µ ( ˆ m RML , ˆ π RML ) and RCAL.RWL denotes ˆ µ ( ˆ m RWL , ˆ π RCAL ). Bias and √ Varare respectively the Monte Carlo bias and standard deviation of the points estimates, √ EVar is thesquare root of the mean of the variance estimates, and Cov90 or Cov95 is the coverage proportion ofthe 90% or 95% conﬁdence intervals, based on 1000 repeated simulations. † indicates a case wherethe Monte Carlo variance from the competitive method is at least 10% higher. ∗ indicates a coverageproportion that is 3% or higher than that from the competitive method. As in Section 2, the observed data consist of independent and identically distributed obser-vations { ( T i Y i , T i , X i ) : i = 1 , . . . , n } . Consider logistic propensity score model (4) and linearoutcome model (8), both with f j ( X ) = X † j for j = 1 , . . . , p . Then the two models can beclassiﬁed as follows, depending on the data conﬁguration above:(C1) PS and OR models both correctly speciﬁed;(C2) PS model correctly speciﬁed, but OR model misspeciﬁed;(C3) PS model misspeciﬁed, but OR model correctly speciﬁed.As demonstrated in Kang & Schafer (2007) for p = 4, the PS model (4) in the scenario (C3),although misspeciﬁed, appears adequate as examined by conventional techniques for logisticregression. Similarly, the OR model (8) in the misspeciﬁed case (C2) can also be shown as“nearly correct” by standard techniques for linear regression. On the other hand, neither28igure 1: QQ plots of the t -statistics against standard normal with linear outcome models ( n = 800, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − the PS model (4) in the correctly speciﬁed case (C1) or (C2) nor the OR model (8) in thecorrectly speciﬁed case (C1) or (C3) is used in Kang & Schafer (2007), where correct PS andmisspeciﬁed OR model (or misspeciﬁed PS and correct OR) involve two completely diﬀerentsets of regressors. This aspect of the Kang–Schafer design needs to be modiﬁed in our study,where the same vector of regressors f ( X ) is used in models (4) and (8).We conducted 1000 repeated simulations, each with the sample size n = 400 or 800 andthe number of regressors p = 100 or 200. For n = 800 and p = 200, Table 1 summarizes theresults about ˆ µ ( ˆ m RML , ˆ π RML ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated conﬁdence intervals,and Figure 1 presents the QQ plots of the corresponding t -statistics. See the SupplementaryMaterial for similar results obtained with other values of ( n, p ).There are several advantages demonstrated from these results for the proposed method.Compared with ˆ µ ( ˆ m RML , ˆ π RML ), the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) has consistently smaller biases29able 2: Summary of results with logistic outcome models ( n = 800, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .023 .024 .023 .024 .024 .023 √ EVar .026 .026 .025 .026 .027 .027Cov90 .814 .868 ∗ .841 .872 ∗ .845 .859Cov95 .876 .920 ∗ .916 .928 .914 .912Logistic outcome conﬁguration 2Bias − . − . − . − . − . . √ Var .024 .025 .024 .026 .026 .025 √ EVar .026 .026 .026 .027 .027 .027Cov90 .849 .876 .864 .879 .870 .865Cov95 .909 .936 .925 .933 .931 .927

Note: See the footnote of Table 1. For scenario (C5) (“mis OR”), the true value µ is 0 . µ is not analytically available but calculatedusing Monte Carlo integration, as shown in the Supplementary Material. in absolute values, and either similar or noticeably smaller variances, for example, in the case ofmisspeciﬁed PS model and correct OR model. The coverage proportions of conﬁdence intervalsbased on ˆ µ ( ˆ m RWL , ˆ π RCAL ) are similar or noticeably higher than those based on ˆ µ ( ˆ m RML , ˆ π RML ),although both coverage proportions are below the nominal probabilities to various degree. Fromthe QQ plots, the t -statistics based on ˆ µ ( ˆ m RWL , ˆ π RCAL ) also appear to be more aligned withstandard normal than those based on ˆ µ ( ˆ m RML , ˆ π RML ). For simulations with binary outcomes, let X and X † be as in Section 4.1. Consider the followingdata-generating conﬁgurations, in parallel to (C1)–(C3).(C4) Generate T given X as in (C1) and, independently, generate Y given X from a Bernoullidistribution with probability (“Logistic outcome conﬁguration 1”) P ( Y = 1 | X ) = [1 + exp {− ( X † + 0 . X † + 0 . X † + 0 . X † ) } ] − , or (“Logistic outcome conﬁguration 2”) 30igure 2: QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 800, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − P ( Y = 1 | X ) = [1 + exp {− (0 . X † + 0 . X † + 0 . X † + 0 . X † ) } ] − . (C5) Generate T give X as in (C1), and, independently, generate Y given X from a Bernoullidistribution with probability (“Logistic outcome conﬁguration 1”) P ( Y = 1 | X ) = [1 + exp {− ( X + 0 . X + 0 . X + 0 . X ) } ] − , or (“Logistic outcome conﬁguration 2”) P ( Y = 1 | X ) = [1 + exp {− (0 . X + 0 . X + 0 . X + 0 . X ) } ] − . (C6) Generate Y given X as in (C4), and, independently, generate T given X as in (C3).Consider logistic propensity score model (4) and logistic outcome model (35), both with f j ( X ) = X † j for j = 1 , . . . , p . Then the two models are correctly speciﬁed in scenario (C4),31nly PS model (4) is correctly speciﬁed in scenario (C5), and only OR model (35) is correctlyspeciﬁed in scenario (C6), similarly as in Section 4.1.For n = 800 and p = 200, Table 2 and Figure 2 present the results from 1000 repeated sim-ulations, about ˆ µ ( ˆ m RML , ˆ π RML ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated conﬁdence intervals.Similar conclusions can be drawn as from Table 1 and Figure 1. It is interesting that the cov-erage proportions of conﬁdence intervals based on ˆ µ ( ˆ m RWL , ˆ π RCAL ) are noticeably higher (andcloser to the nominal probabilities) than those based on ˆ µ ( ˆ m RML , ˆ π RML ) in the case where bothPS and OR models are correctly speciﬁed. This diﬀerence can also be seen from the QQ plots.The conﬁdence intervals from both methods appear to yield reasonable coverage proportionswhen the PS model is misspeciﬁed but the OR model is correctly speciﬁed, even though theseresults are not necessarily predicted by asymptotic theory. See the Supplementary Materialfor additional results from simulations with other values of ( n, p ). We provide an empirical application to a medical study in Connors et al. (1996) on the eﬀects ofright heart catheterization (RHC). The study included n = 5735 critically ill patients admittedto the intensive care units of 5 medical centers. For each patient, the data consist of treatmentstatus T (= 1 if RHC was used within 24 hours of admission and 0 otherwise), health outcome Y (survival time up to 30 days), and a list of 75 covariates X speciﬁed by medical specialistsin critical care. For previous analyses, propensity score and outcome regression models wereemployed either with main eﬀects only (Hirano & Imbens 2002; Vermeulen & Vansteelandt2015) or with interaction terms manually added (Tan 2006).To explore dependency beyond main eﬀects, we consider a logistic propensity score modelTable 3: Estimates of 30-day survival probabilities and ATE IPW Augmented IPWRML RCAL RML.RML RCAL.RWL µ . ± .

026 0 . ± .

023 0 . ± .

021 0 . ± . µ . ± .

017 0 . ± .

016 0 . ± . − . ± . − . ± . − . ± . − . ± . Note: Estimate ± × standard error, including nominal standard errors for IPW. Boxplots of inverse probability weights within the treated (left) and untreated (middle)groups, each normalized to sum to the sample size n , and QQ plots with a 45-degree line of thestandardized sample inﬂuence functions based on ϕ ( Y, T, X ; · ) in (7) for ATE (right). RML RCAL

RML RCAL −4 −2 0 2 4 − − RML.RMLRCAL.RWL (4) and a logistic outcome model (35) for 30-day survival status 1 { Y > } , with the vector f ( X ) including all main eﬀects and two-way interactions of X except those with the fractions ofnonzero values less than 46 (i.e., 0.8% of the sample size 5735). The dimension of f ( X ) is p =1855, excluding the constant. All variables in f ( X ) are standardized with sample means 0 andvariances 1. We apply the augmented IPW estimators ˆ µ ( ˆ m RWL , ˆ π RCAL ) and ˆ µ ( ˆ m RWL , ˆ π RCAL )using regularized calibrated (RCAL) estimation and the corresponding estimators such asˆ µ ( ˆ m RML , ˆ π RML ) using regularized maximum likelihood (RML) estimation, similarly as in thesimulation study. The Lasso tuning parameter λ is selected by cross validation over a discreteset { λ ∗ / j/ : j = 0 , , . . . , } , where λ ∗ is the value leading to a zero solution γ = · · · = γ p =0. We also compute the (ratio) IPW estimators, such as ˆ µ rIPW , along with nominal standarderrors obtained by ignoring data-dependency of the ﬁtted propensity scores.Table 3 shows various estimates of survival probabilities and ATE. The IPW estimates fromRCAL estimation of propensity scores have noticeably smaller nominal standard errors thanRML estimation, for example, with the relative eﬃciency (0 . / . = 1 .

28 for estimationof µ . This improvement can also be seen from Figure 3, where the RCAL inverse probabilityweights are much less variable than RML weights. See Tan (2017) for additional results oncovariate balance and parameter sparsity from RML and RCAL estimation of propensity scores.The augmented IPW estimates and conﬁdence intervals are similar to each other from RCALand RML estimation. However, the validity of RML conﬁdence intervals depends on both PS33nd OR models being correctly speciﬁed, whereas that of RCAL conﬁdence intervals holdseven when the OR model is misspeciﬁed. While assessment of this diﬀerence is diﬃcult withreal data, Figure 3 shows that the sample inﬂuence functions for ATE using RCAL estimationappears to be more normally distributed especially in the tails than RML estimation.Finally, the augmented IPW estimates here are smaller in absolute values, and also withsmaller standard errors, than previous estimates based on main-eﬀect models, about − . ± × .

015 (Vermeulen & Vansteelandt 2015). The reduction in standard errors might be ex-plained by the well-known property that an augmented IPW estimator has a smaller asymptoticvariance when obtained using a larger (correct) propensity score model.

References

Athey, S., Imbens, G.W., and Wager, S. (2016) “Approximate residual balancing: De-biasedinference of average treatment eﬀects in high dimensions,” arXiv:1604.07125.Belloni, A., Chernozhukov, V., Fernandez-Val, I., and Hansen, C. (2017) ”Program evaluationand causal inference with high-dimensional data,”

Econometrica , 85, 233–298.Bickel, P., Ritov, Y., and Tsybakov, A.B. (2009) “Simultaneous analysis of Lasso and Dantzigselector,”

Annals of Statistics , 37, 1705–1732.Buhlmann, P. and van de Geer, S. (2011)

Statistics for High-Dimensional Data: Methods,Theory and Applications , New York: Springer.Chan, K.C.G., Yam, S.C.P., and Zhang, Z. (2016) “Globally eﬃcient non-parametric inferenceof average treatment eﬀects by empirical balancing calibration weighting,”

Journal of theRoyal Statistical Society , Ser. B, 78, 673–700.Connors, A.F., Speroﬀ, T., Dawson, N.V., et al. (1996) “The eﬀectiveness of right heartcatheterization in the initial care of critically ill patients,”

Journal of the AmericanMedical Association , 276, 889–897.Farrell, M.H. (2015) “Robust inference on average treatment eﬀects with possibly more co-variates than observations.”

Journal of Econometrics , 189, 1–23.Folsom, R.E. (1991) “Exponential and logistic weight adjustments for sampling and non-response error reduction,”

Proceedings of the American Statistical Association , SocialStatistics Section, 197–202.Friedman, J., Hastie, T., and Tibshirani, R. (2010) “Regularization paths for generalizedlinear models via coordinate descent,”

Journal of Statistical Software , 33, 1–22.34raham, B.S., de Xavier Pinto, C.C., and Egel, D. (2012) “Inverse probability tilting formoment condition models with missing data,”

Review of Economic Studies , 79, 1053–1079.Hirano, K., and Imbens, G.W. (2002) “Estimation of causal eﬀects using propensity scoreweighting: An application to data on right heart catheterization,”

Health Services andOutcomes Research Methodology , 2, 259–278.Hahn, J. (1998) “On the role of the propensity score in eﬃcient semiparametric estimation ofaverage treatment eﬀects,”

Econometrica , 66, 315–331.Hainmueller, J. (2012) “Entropy balancing for causal eﬀects: Multivariate reweighting methodto produce balanced samples in observational studies,”

Political Analysis , 20, 25–46.Huang, J. and Zhang, C.-H. (2012) “Estimation and selection via absolute penalized convexminimization and its multistage adaptive applications,”

Journal of Machine LearningResearch , 13, 1839–1864.Imai, K. and Ratkovic, M. (2014) “Covariate balancing propensity score,”

Journal of theRoyal Statistical Society , Ser. B, 76, 243–263.Javanmard, A. and Montanari, A. (2014) “Conﬁdence intervals and hypothesis testing forhigh-dimensional regression,”

Journal of Machine Learning Research , 15, 2869–2909.Kang, J.D.Y. and Schafer, J.L. (2007) “Demystifying double robustness: A comparison ofalternative strategies for estimating a population mean from incomplete data” (withdiscussion),

Statistical Science , 523–539.Kim, J.K. and Haziza, D. (2014) “Doubly robust inference with missing data in survey sam-pling,”

Statistica Sinica , 24, 375–394.Manski, C.F. (1988)

Analog Estimation Methods in Econometrics , New York: Chapman &Hall.McCullagh, P. and Nelder, J. (1989)

Generalized Linear Models (2nd edition), New York:Chapman & Hall.Negahban, S.N., Ravikumar, P., Wainwright, M.J., and Yu, B. (2012) “A uniﬁed frameworkfor high-dimensional analysis of M-estimators with decomposable regularizers,”

StatisticalScience , 27, 538–557.Neyman, J. (1923) “On the application of probability theory to agricultural experiments:Essay on principles, Section 9,” translated in

Statistical Science , 1990, 5, 465–480.35sborne, M., Presnell, B., and Turlach, B. (2000) “A new approach to variable selection inleast squares problems.”

IMA Journal of Numerical Analysis , 20, 389–404.Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1994) “Estimation of regression coeﬃcientswhen some regressors are not always observed,”

Journal of the American Statistical As-sociation , 89, 846–866.Rosenbaum, P.R. and Rubin, D.B. (1983) “The central role of the propensity score in obser-vational studies for causal eﬀects,”

Biometrika , 70, 41-55.Rosenbaum, P.R. and Rubin, D.B. (1984) “Reducing bias in observational studies using sub-classiﬁcation on the propensity score,”

Journal of the American Statistical Association ,79, 516–524.Rubin, D.B. (1976) “Inference and missing data,”

Biometrika , 63, 581–590.Sarndal, C.E., Swensson, B. and Wretman, J.H. (1992)

Model Assisted Survey Sampling , NewYork: Springer.Tan, Z. (2006) “A distributional approach for causal inference using propensity scores,”

Jour-nal of the American Statistical Association , 101, 1619-1637.Tan, Z. (2007) “Comment: Understanding OR, PS, and DR,”

Statistical Science , 22, 560–568.Tan, Z. (2010) “Bounded, eﬃcient, and doubly robust estimation with inverse weighting,”

Biometrika , 97, 661–682.Tan, Z. (2017) “Regularized calibrated estimation of propensity scores with model misspeci-ﬁcation and high-dimensional data,” arXiv:1710.08074.Tibshirani, R. (1996) “Regression shrinkage and selection via the Lasso,”

Journal of the RoyalStatistical Society , Ser. B, 58, 267–288.Tsiatis, A.A. (2006)

Semiparametric Theory and Missing Data , New York: Springer.van de Geer, S., Buhlmann, P., Ritov, Y., Dezeure, R. (2014) “On asymptotically optimalconﬁdence regions and tests for high-dimensional models”

Annals of Statistics , 42, 1166–1202.Vermeulen. K. and Vansteelandt, S. (2015) “Bias-reduced doubly robust estimation,”

Journalof the American Statistical Association , 110, 1024–1036.White, H. (1982) “Maximum Likelihood Estimation of Misspeciﬁed Models,”

Econometrica ,50, 1–25.Zhang, C.-H. and Zhang, S.S. (2014) “Conﬁdence intervals for low-dimensional parameterswith high-dimensional data,”

Journal of the Royal Statistical Society , Ser. B, 76, 217–242. 36 upplementary Material for “Model-assisted inference for treatmenteﬀects using regularized calibrated estimation with high-dimensionaldata”

Zhiqiang TanThe Supplementary Material contains Appendices I–II.

I Additional results for simulation study

I.1 Results for simulation setup

Denote by φ () the probability density function and Φ() the cumulative distribution functionfor N(0 , a = 2 .

5, let Z be N(0 ,

1) truncated to the interval ( − a, a ), with the densityfunction φ ( z ) /c if z ∈ ( − a, a ) or 0 otherwise, where c = Φ( a ) − Φ( − a ) = 2Φ( a ) −

1. Then E ( Z ) = 0 and var( Z ) = 1 − aφ ( a ) /c , denoted as b .Let ( X , . . . , X ) = ( Z , . . . , Z ) /b , where ( Z , . . . , Z ) are independent variables, each fromN(0 ,

1) truncated to ( − a, a ). The variables ( X † , . . . , X † ) are determined by standardizationfrom ( X , . . . , X ) using the following results. • E (e . X ) = exp( b ) { Φ( a − b ) − Φ( − a − b ) } /c ,var(e . X ) = exp( b ) { Φ( a − b ) − Φ( − a − b ) } /c − E (e . X ). • E ( X X ) = 0,var( X X ) = c R a − a z/b ) φ ( z ) d z ≈ (0 . by numerical integration. • E { ( X X + 0 . } = 3 / ∗ ( .

6) + ( . , E { ( X X + 0 . } = m / + 15 ∗ m / ∗ ( . + 15 / ∗ ( . + ( . ,where m = b c R a − a z φ ( z ) d z = b c { (3 / ∗ (2Φ( z ) − − z ( z + 3) φ ( z ) }| a − a and m = b c R a − a z φ ( z ) d z = b c { (15 / ∗ (2Φ( z ) − − z ( z + 5 z + 15) φ ( z ) }| a − a . • E { ( X + X + 20) } = 2 + 20 , E { ( X + X + 20) } = (2 m + 6) + 6 ∗ ∗ + 20 .For binary outcomes in scenarios (C4) and (C6), the true value µ = E { m ∗ ( X ) } is estimatedby Monte Carlo integration, using 100 repeated samples of ( X , . . . , X ) each of size 10 . The1stimates of µ are 0 . . × − . I.2 Additional simulation results

Figure S1 shows the the scatter plots of the variables ( X † , X † , X † , X † ), which are correlatedwith each other as would be found in real data.Tables S1–S3 and Figures S2–S4, present additional simulation results from Section 4.1with linear outcome models, similarly as Table 1 and Figure 1 but for diﬀerent values of ( n, p ).Tables S4–S6 and Figures S5–S7, present additional simulation results from Section 4.2 withlogistic outcome models, similarly as Table 2 and Figure 2 but for diﬀerent values of ( n, p ).Similar conclusions can be drawn as discussed in Sections 4.1–4.2.2igure S1: Scatter plots of ( X † , X † , X † , X † ) from a sample of size n = 800. −1 1 2 3 4 − var 1 − var 2 − var 3 −1 1 2 3 4 − −2 0 2 4 −2 0 2 4 −2 0 1 2 3 − var 4 n = 400, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .097 .097 .097 .099 .105 .0994 † √ EVar .108 .110 .111 .112 .109 .109Cov90 .787 .829 ∗ .844 .853 .848 .845Cov95 .862 .883 .915 .916 .916 .920Linear outcome conﬁguration 2Bias − . − . − . − . − . − . √ Var .086 .087 .085 .088 .093 .088 † √ EVar .094 .096 .092 .093 .096 .096Cov90 .799 .833 ∗ .828 .859 ∗ .864 .860Cov95 .879 .898 .896 .926 ∗ .927 .919 Note: See the footnote of Table 1.

Table S2: Summary of results with linear outcome models ( n = 800, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .071 .071 .073 .072 .078 .072 † √ EVar .079 .080 .084 .083 .081 .080Cov90 .829 .836 .845 .852 .889 .881Cov95 .889 .901 .905 .909 .938 .929Linear outcome conﬁguration 2Bias − . − . − . − . − . − . √ Var .064 .063 .064 .064 .070 .064 † √ EVar .071 .072 .072 .070 .072 .071Cov90 .814 .830 .782 .850 ∗ .896 .875Cov95 .880 .893 .862 .912 ∗ .941 .924 Note: See the footnote of Table 1. n = 400, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .095 .096 .096 .099 .102 .098 √ EVar .109 .110 .112 .113 .118 .116Cov90 .770 .820 ∗ .819 .834 .823 .829Cov95 .845 .884 ∗ .895 .903 .893 .896Linear outcome conﬁguration 2Bias − . − . − . − . − . − . √ Var .085 .086 .084 .087 .091 .087 √ EVar .092 .094 .094 .095 .103 .103Cov90 .788 .833 ∗ .814 .853 ∗ .842 .836Cov95 .877 .905 .884 .913 .904 .896 Note: See the footnote of Table 1.

Table S4: Summary of results with logistic outcome models ( n = 400, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .032 .033 .032 .034 .033 .033 √ EVar .038 .038 .038 .038 .037 .037Cov90 .776 .835 ∗ .804 .845 ∗ .834 .852Cov95 .864 .911 ∗ .877 .901 .899 .904Logistic outcome conﬁguration 2Bias − . − . − . − . − . . √ Var .033 .034 .034 .035 .035 .034 √ EVar .035 .036 .035 .035 .038 .038Cov90 .852 .876 .876 .883 .863 .864Cov95 .910 .931 .931 .942 .921 .924

Note: See the footnote of Table 2. n = 800, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .023 .024 .024 .025 .024 .024 √ EVar .026 .026 .026 .027 .026 .026Cov90 .816 .868 ∗ .851 .868 .877 .870Cov95 .896 .924 .924 .929 .928 .925Logistic outcome conﬁguration 2Bias − . − . − . − . . . √ Var .024 .025 .025 .026 .027 .025 √ EVar .026 .026 .027 .028 .028 .027Cov90 .842 .870 .841 .867 .881 .862Cov95 .913 .926 .911 .923 .940 .925

Note: See the footnote of Table 2.

Table S6: Summary of results with logistic outcome models ( n = 400, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome conﬁguration 1Bias − . − . − . − . − . − . √ Var .032 .033 .032 .034 .033 .033 √ EVar .037 .037 .039 .038 .036 .036Cov90 .754 .826 ∗ .773 .827 ∗ .834 .866Cov95 .833 .907 ∗ .852 .897 ∗ .898 .917Logistic outcome conﬁguration 2Bias − . − . − . − . − . . √ Var .032 .034 .033 .035 .034 .034 √ EVar .035 .036 .037 .037 .037 .037Cov90 .858 .884 .848 .864 .858 .857Cov95 .915 .936 .904 .932 .916 .926

Note: See the footnote of Table 2.

QQ plots of the t -statistics against standard normal with linear outcome models ( n = 400, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with linear outcome models ( n = 800, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with linear outcome models ( n = 400, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 400, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 800, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 400, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − I Technical details

II.1 Inside Theorem 1

The following result (ii) is taken from Tan (2017), Lemma 1(ii), and result (i) can be shownsimilarly using Lemma 14 in Section II.8 and the union bound.

Lemma 1. (i) Denoted by Ω the event that sup j =0 , ,...,p (cid:12)(cid:12)(cid:12) ˜ E hn − T e − ¯ h CAL ( X ) + (1 − T ) o f j ( X ) i(cid:12)(cid:12)(cid:12) ≤ λ . Under Assumption 1(i)–(ii), if λ ≥ √ − B + 1) C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ .(ii) Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ γ ) jk − (Σ γ ) jk | ≤ λ , (S1) Under Assumption 1(i)–(ii), if λ ≥ (4e − B C ) p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Take λ = C p log { (1 + p ) /ǫ } /n with C = max n √ − B + 1) C , − B C o . Then under the conditions of Theorem 1, inequality (26) holds in the event Ω ∩ Ω , withprobability at least 1 − ǫ , by the proof of Tan (2017, Corollary 2). II.2 Probability lemmas

Lemma 2.

Denote by Ω the event that sup j =0 , ,...,p (cid:12)(cid:12)(cid:12) ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) (cid:3)(cid:12)(cid:12)(cid:12) ≤ λ . (S2) Under Assumptions 1(i)–(ii) and 2(i), if λ ≥ (e − B C ) p D + D ) p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.

Let Z j = T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) for j = 0 , . . . , p . Then E ( Z j ) = 0by the deﬁnition of ¯ α WL . Under Assumption 1(i)–(ii), | Z j | ≤ e − B C | T { Y − ¯ m WL ( X ) }| . ByAssumption 2(i), the variables ( Z , Z , . . . , Z p ) are uniformly sub-gaussian: max j =0 , ,...,p D E { exp( Z j /D ) − } ≤ D , with D = e − B C D and D = e − B C D . Therefore, Lemma 2(i)holds by Lemma 15 in Section II.8 and the union bound. (cid:3) Denote Σ α = E [ T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) f T ( X )], and ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) f T ( X )], the sample version of ˜Σ α .13 emma 3. Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ α ) jk − (Σ α ) jk | ≤ ( D + D D ) λ , (S3) Under Assumptions 1(i)–(ii) and 2(i), if ( D + D D ) λ ≥ − B C h D log { (1 + p ) /ǫ } /n + D D p log { (1 + p ) /ǫ } /n i , then P (Ω ) ≥ − ǫ . Proof.

For any j, k = 0 , , . . . , p , the variable T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) f k ( X ) is theproduct of w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) and T { Y − ¯ m WL ( X ) } , where | w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) | ≤ e − B C by Assumptions 1(i)–(ii) and T { Y − ¯ m WL ( X ) } is sub-gaussian by Assumption 2(i).Applying Lemmas 16 and 18 in Section II.8 yields P n | ( ˜Σ α ) jk − (Σ α ) jk | > − B C D t + 2e − B C D D t √ t o ≤ ǫ (1 + p ) , for j, k = 0 , , . . . , p , where t = log { (1 + p ) /ǫ } /n . The result then follows from the unionbound. (cid:3) Denote Σ α = E [ T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f ( X ) f T ( X )], and ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f ( X ) f T ( X )], the sample version of Σ α . Lemma 4.

Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ α ) jk − (Σ α ) jk | ≤ q D + D λ , (S4) Under Assumptions 1(i)–(ii) and 2(i), if λ ≥ − B C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.

The variables

T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f j ( X ) f k ( X ) for j, k = 0 , , . . . , p are uniformlysub-gaussian, because | w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) | ≤ e − B C by Assumptions 1(i)–(ii) and T | Y − ¯ m WL ( X ) | is sub-gaussian by Assumption 2(i). Applying Lemma 15 yields P n | ( ˜Σ α ) jk − (Σ α ) jk | > t o ≤ ǫ (1 + p ) , for j, k = 0 , , . . . , p , where t = e − B C p D + D ) p log { (1 + p ) /ǫ } /n . The result thenfollows from the union bound. (cid:3) Denote Σ = E [ f ( X ) f T ( X )] and ˜Σ = ˜ E [ f ( X ) f T ( X )], the sample version of Σ .14 emma 5. Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ ) jk − (Σ ) jk | ≤ e B λ , (S5) Under Assumption 1(i), if λ ≥ − B C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.

This result follows directly from Lemma 14 and the union bound, with | f j ( X ) f k ( X ) | ≤ C and hence | f j ( X ) f k ( X ) − (Σ ) jk | ≤ C . (cid:3) II.3 Proof of Theorems 2 and 5

Thoughout this section, suppose that Assumption 1 holds. The proof of Theorem 5 is com-pleted by combining Lemmas 2–3 and 6–12. Theorem 2 is a special case of Theorem 5, whereAssumptions 3(ii)–(iv) are satisﬁed with C = 1 and C = η = η = 0. Lemma 6.

For any coeﬃcient vector α and h ( X ) = α f ( X ) , we have D † WL (ˆ h RWL , h ; ˆ γ RCAL ) + λ k ˆ α RWL , p k ≤ ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + λ k α p k . (S6) Proof.

For any u ∈ (0 , α RWL implies ℓ RWL ( ˆ α RWL ; ˆ γ RCAL ) + λ k ˆ α RWL , p k ≤ ℓ RWL { (1 − u ) ˆ α RWL + uα ; ˆ γ RCAL } + λ k (1 − u ) ˆ α RWL , p + uα p k , which, by the convexity of k · k , gives ℓ RWL ( ˆ α RWL ; ˆ γ RCAL ) − ℓ RWL { (1 − u ) ˆ α RWL + uα ; ˆ γ RCAL } + λu k ˆ α RWL , p k ≤ λu k α p k . Dividing both sides of the preceding inequality by u and letting u →

0+ yields − ˜ E h T w ( X ; ˆ γ RCAL ) { Y − ˆ m RWL ( X ) }{ ˆ h RWL ( X ) − h ( X ) } i + λ k ˆ α RWL , p k ≤ λ k α p k , which leads to (S6) after simple rearrangement using (38). (cid:3) Lemma 7.

In the event Ω ∩ Ω , we have ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ e η M | S γ | λ , (S7)15 nd for any function h ( X ) , D † WL (ˆ h RWL , h ; ˆ γ RCAL ) ≥ e − η D † WL (ˆ h RWL , h ; ¯ γ CAL ) , (S8) where η = ( A − − M η C . Proof.

By direct calculation from the deﬁnition of D CAL (), we ﬁnd D CAL (ˆ h RCAL , ¯ h CAL ) = − ˜ E h T n e − ˆ h ( X ) − e − ¯ h ( X ) o { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i = ˜ E h T e − u (ˆ γ − ¯ γ ) T f ( X ) w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i for some u ∈ (0 , − ˆ h ( X ) − e − ¯ h ( X ) = − e − u ˆ h ( X ) − (1 − u )¯ h ( X ) (ˆ γ RCAL − ¯ γ CAL ) T f ( X ) . (S9)In the event Ω ∩ Ω that (26) holds, we have k ˆ γ RCAL − ¯ γ CAL k ≤ ( A − − M | S γ | λ ≤ ( A − − M η , (S10)by Assumption 1(iv), | S γ | λ ≤ η , and hence M | S γ | λ ≥ D CAL (ˆ h RCAL , ¯ h CAL ) ≥ e − η ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i , which gives the desired inequality (S7). In addition, we write D † WL (ˆ h RWL , h ; ˆ γ RCAL )= ˜ E (cid:16) T w ( X ; ˆ γ RCAL ) h ψ { ˆ h RWL ( X ) } − ψ { h ( X ) } i { ˆ h RWL ( X ) − h ( X ) } (cid:17) = ˜ E (cid:16) T e − (ˆ γ − ¯ γ ) T f ( X ) w ( X ; ¯ γ CAL ) h ψ { ˆ h RWL ( X ) } − ψ { h ( X ) } i { ˆ m RWL ( X ) − h ( X ) } (cid:17) , which, in the event Ω ∩ Ω , yields inequality (S8) by (S10) and Assumption 1(i). (cid:3) For two functions h ( x ) and h ′ ( x ), denote Q WL ( h ′ , h ; γ ) = ˜ E (cid:2) T w ( X ; γ ) { h ′ ( X ) − h ( X ) } (cid:3) . Lemma 8.

Take α = ¯ α WL and h ( X ) = ¯ α WL f ( X ) . Suppose that Assumption 2(i) holds. Thenin the event Ω ∩ Ω ∩ Ω , (S6) implies e − η D † WL (ˆ h RWL , h ; ¯ γ CAL ) + λ k ˆ α RWL , p k ≤ ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + λ k α p k + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , h ; ¯ γ CAL ) } / , where M = ( D + D )(e η M + η ) + ( D + D D ) η , and η = ( A − − M η . roof. Consider the following decomposition,( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) = ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + ˜ E h T n e − ˆ h ( X ) − e − ¯ h ( X ) o { Y − m ( X ; α ) }{ ˆ h RWL ( X ) − h ( X ) } i , (S11)denoted as ∆ + ∆ . By the mean value equation (S9) and the Cauchy–Schwartz inequality,the second term ∆ can be bounded from above as∆ ≤ e C k ˆ γ − ¯ γ k ˜ E / h T w ( X ; ¯ γ CAL ) { ˆ h RWL ( X ) − h ( X ) } i × ˜ E / h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . (S12)We upper-bound the third term on the right hand side in several steps. First, in the event Ω ,we have by inequality (S3),( ˜ E − E ) h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D D ) λ k ˆ γ RCAL − ¯ γ CAL k , where, by some abuse of notation, ( ˜ E − E )( Z ) denotes n − P ni =1 { Z i − E ( Z ) } for a variable Z that is a function of ( T, Y, X ). Second, by Assumption 2(i) and Lemma 17, E [ { Y − m ( X ; α ) } | X ] ≤ D + D and hence E h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D ) E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . Third, in the event Ω , we have by inequality (S1),( E − ˜ E ) h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ λ k ˆ γ RCAL − ¯ γ CAL k . Combining the preceding inequalities, we have in the event Ω ∩ Ω ,˜ E h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D D ) λ k ˆ γ RCAL − ¯ γ CAL k + ( D + D ) n λ k ˆ γ RCAL − ¯ γ CAL k + ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } io . (S13)The desired result follows by collecting inequalities (S11)–(S13) and applying (S7), (S8) and(S10) in the event Ω ∩ Ω . (cid:3) emma 9. Denote b = ˆ α RWL − ¯ α WL . Suppose that Assumption 2(i) holds. In the event Ω ∩ Ω ∩ Ω ∩ Ω , we have e − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + ( A − λ k b k ≤ e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / + 2 A λ X j ∈ S α | b j | . (S14) Proof.

In the event Ω , we have b T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) ≤ λ k b k . From this bound and Lemma 8 with α = ¯ α WL , we have in the event Ω ∩ Ω ∩ Ω ∩ Ω ,e − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + A λ k ˆ α RWL , p k ≤ λ k b k + A λ k ¯ α WL , p k + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / . Applying to the preceding inequality the identity | ˆ α RWL ,j | = | ˆ α RWL ,j − ¯ α WL ,j | for j S α andthe triangle inequality | ˆ α RWL ,j | ≥ | ¯ α WL ,j | − | ˆ α RWL ,j − ¯ α WL ,j | , j ∈ S α \{ } , and rearranging the result givese − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + ( A − λ k b p k ≤ λ | b | + 2 A λ X j ∈ S α \{ } | b j | + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / . The conclusion follows by adding ( A − λ | b | to both sides above. (cid:3) Denote ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )]. Lemma 10.

Suppose that Assumption 3(iii) holds. Then for any h = α f and h ′ = α ′ T f , D † WL ( h, h ′ ; ¯ γ CAL ) ≥ − e − C k b k C k b k (cid:16) b T ˜Σ α b (cid:17) , where b = α ′ − α and C = C C . Throughout, set (1 − e − c ) /c = 1 for c = 0 . Proof.

Set γ = ¯ γ CAL . By direct calculation, we have D † WL ( h, h ′ ; γ ) = ˜ E (cid:0) T w ( X ; γ ) (cid:2) ψ { h ′ ( X ) } − ψ { h ( X ) } (cid:3) (cid:8) h ′ ( X ) − h ( X ) (cid:9)(cid:1) = ˜ E (cid:20) T w ( X ; γ ) (cid:18)Z ψ (cid:2) h ( X ) + u (cid:8) h ′ ( X ) − h ( X ) (cid:9)(cid:3) d u (cid:19) (cid:8) h ′ ( X ) − h ( X ) (cid:9) (cid:21) .

18y Assumption 3(iii) and the fact that | h ′ ( X ) − h ( X ) | ≤ { sup j =0 , ,...,p | f j ( X ) |} k α ′ − α k ≤ C k α ′ − α k by Assumption 1(i), it follows that D † WL ( h, h ′ ; γ ) ≥ ˜ E (cid:20) T w ( X ; γ ) (cid:18)Z ψ { h ( X ) } e − C u | h ′ ( X ) − h ( X ) | d u (cid:19) (cid:8) h ′ ( X ) − h ( X ) (cid:9) (cid:21) ≥ ˜ E h T w ( X ; γ ) ψ { h ( X ) } (cid:8) h ′ ( X ) − h ( X ) (cid:9) i (cid:18)Z e − C u k α ′ − α k d u (cid:19) , which gives the desired result because R e − cu d u = (1 − e − c ) /c for c ≥ (cid:3) Lemma 11.

Suppose that Assumption 2(iii) holds. In the event Ω , Assumption 2(ii) im-plies a compatibility condition for ˜Σ γ : for any vector b = ( b , b , . . . , b p ) T ∈ R p such that P j S α | b j | ≤ ξ P j ∈ S α | b j | , we have (1 − η ) ν  X j ∈ S α | b j |  ≤ | S α | (cid:16) b T ˜Σ γ b (cid:17) . (S15) Proof.

In the event Ω , we have | b T ( ˜Σ γ − Σ γ ) b | ≤ λ k b k by (S1). Then Assumption 2(ii)implies that for any vector b = ( b , b , . . . , b p ) T satisfying P j S α | b j | ≤ ξ P j ∈ S α | b j | , ν k b S α k ≤ | S α | ( b T Σ γ b ) ≤ | S α | (cid:16) b T ˜Σ γ b + λ k b k (cid:17) ≤ | S α | ( b T ˜Σ γ b ) + | S α | λ (1 + ξ ) k b S α k , where k b S α k = P j ∈ S α | b j | . The last inequality uses k b k ≤ (1 + ξ ) k b S α k . Then (S15) followsbecause (1 + ξ ) ν − | S α | λ ≤ η ( <

1) by Assumption 2(iii). (cid:3)

Lemma 12.

Suppose that Assumptions 2 and 3 hold, and A > ( ξ + 1) / ( ξ − . In the event Ω ∩ Ω ∩ Ω ∩ Ω , inequality (29) holds as in Theorem 2. Proof.

Denote b = ˆ α RWL − ¯ α WL , D † WL = D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ), Q WL = Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ),and D ‡ WL = e − η D † WL + ( A − λ k b k . In the event Ω ∩ Ω ∩ Ω ∩ Ω , inequality (S14) from Lemma 9 with Assumption 2(i) leads totwo possible cases: either ξ D ‡ WL ≤ e η (cid:0) M | S γ | λ (cid:1) / ( Q WL ) / , (S16)or (1 − ξ ) D ‡ WL ≤ A λ P j ∈ S α | b j | , that is, D ‡ WL ≤ ( ξ + 1)( A − λ X j ∈ S α | b j | = ξ λ X j ∈ S α | b j | , (S17)19here ξ = 1 − A / { ( ξ + 1)( A − } ∈ (0 ,

1] because A > ( ξ + 1) / ( ξ −

1) and ξ =( ξ + 1)( A − P j S α | b j | ≤ ξ P j ∈ S α | b j | , which, by Lemma 11 and Assumption 2(ii)–(iii), implies (S15), that is, X j ∈ S α | b j | ≤ (1 − η ) − / ν − | S α | / (cid:16) b T ˜Σ γ b (cid:17) / . (S18)By Assumption 3(ii) and Lemma 10 with Assumption 3(iii), we have D † WL ≥ − e − C k b k C k b k (cid:16) b T ˜Σ α b (cid:17) ≥ − e − C k b k C k b k C (cid:16) b T ˜Σ γ b (cid:17) . (S19)Combining (S17), (S18), and (S19) and using D † WL ≤ e η D ‡ WL yields D ‡ WL ≤ e η ξ (1 − η ) − ν − C − | S α | λ C k b k − e − C k b k . (S20)But ( A − λ k b k ≤ D ‡ WL . Inequality (S20) along with Assumption 3(iv) implies that 1 − e − C k b k ≤ C ( A − − ξ (1 − η ) − ν − C − | S α | λ ≤ η ( < C k b k ≤ − log(1 − η ) and hence 1 − e − C k b k C k b k = Z e − C k b k u d u ≥ e − C k b k ≥ − η . From this bound, inequality (S20) then leads to D ‡ WL ≤ e η ξ ν − | S α | λ .If (S16) holds, then simple manipulation using D † WL ≤ e η D ‡ WL and (S19) together with Q WL = b T ˜Σ γ b gives D ‡ WL ≤ e η ξ − C − (cid:0) M | S γ | λ (cid:1) C k b k − e − C k b k . (S21)Similarly as above, using ( A − λ k b k ≤ D ‡ WL and inequality (S21) along with Assump-tion 3(iv), we ﬁnd 1 − e − C k b k ≤ C e η ( A − − ξ − C − ( M | S γ | λ ) ≤ η ( < C k b k ≤ − log(1 − η ) and hence1 − e − C k b k C k b k = Z e − C k b k u d u ≥ e − C k b k ≥ − η . From this bound, inequality (S21) then leads to D ‡ WL ≤ e η ξ − ( M | S γ | λ ). Therefore, (39)holds through (S16) and (S17) in the event Ω ∩ Ω ∩ Ω ∩ Ω . (cid:3) I.4 Proof of Theorem 3

Denote ˆ ϕ = ϕ ( T, Y, X ; ˆ m RWL , ˆ π RCAL ) and ¯ ϕ = ϕ ( T, Y, X ; ¯ m WL , ¯ π CAL ). Thenˆ µ ( ˆ m RWL , ˆ π RCAL ) = ¯ µ ( ¯ m WL , ¯ π CAL ) + ˜ E ( ˆ ϕ − ¯ ϕ ) . Consider the following decomposition,ˆ ϕ − ¯ ϕ = { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) + T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) + { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) T ¯ π CAL ( X ) − T ˆ π RCAL ( X ) (cid:27) , (S22)denoted as δ + δ + δ .We show that in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω , inequality (32) holds as in Theorem 3.The decomposition (20) for ˆ µ ( ˆ m RWL , ˆ π RCAL ) amounts toˆ µ ( ˆ m RWL , ˆ π RCAL ) = ¯ µ ( ¯ m WL , ¯ π CAL ) + ∆ + ∆ , where ∆ = ˜ E ( δ + δ ) = ( ˆ α RWL − ¯ α WL ) T ˜ E (cid:20)(cid:26) − T ˆ π RCAL ( X ) (cid:27) f ( X ) (cid:21) , ∆ = ˜ E ( δ ) = ˜ E (cid:20) T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27)(cid:21) . In the event Ω ∩ Ω ∩ Ω ∩ Ω , we have | ∆ | ≤ ( A − − M ( | S γ | λ + | S α | λ ) × A λ , (S23)by inequality (29) and the Karush–Kuhn–Tucker conditions (14)–(15). Moreover, a Taylorexpansion for ∆ yields for some u ∈ (0 , = − (ˆ γ RCAL − ¯ γ CAL ) T ˜ E h T { Y − ¯ m WL ( X ) } e − ¯ h ( X ) f ( X ) i + (ˆ γ RCAL − ¯ γ CAL ) T ˜ E h T { Y − ¯ m WL ( X ) } e − u ˆ h ( X ) − (1 − u )¯ h ( X ) f ( X ) f T ( X ) i (ˆ γ RCAL − ¯ γ CAL ) / , denoted as ∆ + ∆ . In the event (Ω ∩ Ω ) ∩ Ω , we have | ∆ | ≤ ( A − − M | S γ | λ × λ , (S24)by inequalities (26) and (S2). The term ∆ can be bounded as | ∆ | ≤ e k ˆ γ − ¯ γ k C ˜ E h T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) |{ ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i / . (S25)21n the event Ω ∩ Ω , we have˜ E h T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) |{ ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ q D + D λ k ˆ γ RCAL − ¯ γ CAL k + q D + D n λ k ˆ γ RCAL − ¯ γ CAL k + ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } io , (S26)by inequalities (S1) and (S4) and similar steps as in the proof of (S13). Then (32) follows bycollecting inequalities (S23)–(S26) and applying (S7) and (S10) in the event Ω ∩ Ω . II.5 Proof of Theorem 4

Using a − b = 2( a − b ) b + ( a − b ) and the Cauchy–Schwartz inequality, we ﬁnd (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ E / (cid:0) ¯ ϕ c (cid:1) ˜ E / (cid:8) ( ˆ ϕ c − ¯ ϕ c ) (cid:9) + ˜ E (cid:8) ( ˆ ϕ c − ¯ ϕ c ) (cid:9) . (S27)Using ˆ ϕ c = ˆ ϕ − ˆ µ ( ˆ m RWL , ˆ π RCAL ) and ¯ ϕ c = ¯ ϕ − ˆ µ ( ¯ m WL , ¯ π CAL ), we ﬁnd˜ E { ( ˆ ϕ c − ¯ ϕ c ) } ≤ E { ( ˆ ϕ − ¯ ϕ ) } + 2 (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) . (S28)To control ˜ E { ( ˆ ϕ − ¯ ϕ ) } , we use the decomposition (S22), denoted as δ + δ + δ .First, by the mean value equation (S9) and Assumption 1(i)–(ii), we have˜ E ( δ ) = ˜ E " T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) ≤ e − B +2 k ˆ γ − ¯ γ k C ˜ E h T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . (S29)Second, writing { ˆ π RCAL ( X ) } − − { ¯ π CAL ( X ) } − = e − ¯ h ( X ) { e − ˆ h ( X )+¯ h ( X ) − } and usingAssumption 1(i)–(ii), we have˜ E ( δ ) = ˜ E " T { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) ≤ e − B (cid:16) k ˆ γ − ¯ γ k C (cid:17) ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) . (S30)Third, using Assumption 1(i)–(ii), we also have˜ E ( δ ) = ˜ E " { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) ≤ (1 + e − B ) ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i (S31) ≤ (1 + e − B ) C k ˆ α RWL − ¯ α WL k . (S32)22nequality (33) follows by collecting inequalities (S27)–(S32) and applying (29), (32), (S10),and (S13) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω . If condition (27) holds, then we have in theevent Ω ∩ Ω ,˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e B λ k ˆ α RWL − ¯ α WL k + τ − n λ k ˆ α RWL − ¯ α WL k + ˜ E h T w ( X ; ¯ α WL ) { ˆ h RWL ( X ) − ¯ h WL ( X ) } io , (S33)by inequalities (S1) and (S5) and similar steps as in the proof of (S13). Inequality (34) follows,similarly as (33), by combining inequalities (S27)–(S31) and (S33). II.6 Proof of Theorem 6

We use the decomposition (S22) and handle δ , δ , and δ separately. The term ˜ E ( δ ) can bebounded by (S24)–(S26) as in the proof of Theorem 3. By the mean value equation (S9) andthe Cauchy–Schwartz inequality, ˜ E ( δ ) can be bounded as (cid:12)(cid:12)(cid:12) ˜ E ( δ ) (cid:12)(cid:12)(cid:12) ≤ e C k ˆ γ − ¯ γ k ˜ E / h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i × ˜ E / (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) . (S34)Similarly as in Lemma 10 but arguing in the reverse direction by Assumptions 1(i) and 3(iii),we ﬁnd ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) ≤ e C k ˆ α − ¯ α k × ˜ E h T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) }{ ˆ m RWL ( X ) − ¯ m WL ( X ) }{ ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e C k ˆ α − ¯ α k C D † WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) , (S35)where the second inequality follows from Assumption 3(i). In the following, we derive twodiﬀerent bounds on ˜ E ( δ ), leading to (40) and (41) respectively.First, suppose that condition (27) holds. Consider the following decomposition˜ E ( δ ) = ˜ E (cid:20) ψ { ¯ h WL ( X ) }{ ˆ h RWL ( X ) − ¯ h WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21) + ˜ E (cid:20) ˜ ψ ( X ) { ˆ h RWL ( X ) − ¯ h WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21) , (S36)denoted as ∆ + ∆ , where˜ ψ ( X ) = Z (cid:16) ψ [¯ h WL ( X ) + u { ˆ h RWL ( X ) − ¯ h WL ( X ) } ] − ψ { ¯ h WL ( X ) } (cid:17) d u. the event thatsup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C λ . Then P (Ω ) ≥ − ǫ similarly as in Lemma 1(i). In the event Ω , we have | ∆ | ≤ k ˆ α RWL − ¯ α WL k sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) ˜ E (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k ˆ α RWL − ¯ α WL k (Λ + 2 C λ ) . (S37)To bound ∆ , we have by Assumption 3(iii), | ˜ ψ ( X ) | ≤ ψ { ¯ h WL ( X ) } (cid:16) e C | ˆ h ( X ) − ¯ h ( X ) | − (cid:17) ≤ ψ { ¯ h WL ( X ) } C | ˆ h RWL ( X ) − ¯ h WL ( X ) | e C | ˆ h ( X ) − ¯ h ( X ) | , (S38)where the second inequality follows because (e c − /c = R e uc d u ≤ e c for c ≥

0. As a result,we ﬁnd from Assumptions 1(i) and 3(i), | ∆ | ≤ (1 + e − B ) C C e C k ˆ α − ¯ α k ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i . (S39)By condition (27), ˜ E [ { ˆ h RWL ( X ) − ¯ h WL ( X ) } ] can be bounded as (S33) in the event Ω ∩ Ω .Then (41) follows by collecting inequalities (S24)–(S26) and (S34)–(S39) and applying (39),(S7), and (S10) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω .Now suppose that (27) may not hold. Denote h ( X ; α ) = α f ( X ). Then ˜ E ( δ ) can bedecomposed as˜ E ( δ ) = ( ˜ E − E ) (cid:18)h ψ { ˆ h RWL ( X ) } − ψ { ¯ h WL ( X ) } i (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19) + E (cid:18)h ψ { ˆ h RWL ( X ) } − ψ { ¯ h WL ( X ) } i (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19) , denoted as ∆ + ∆ . In the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω , we have k ˆ α RWL − ¯ α WL k ≤ η from(39) and hence by the mean value theorem, | ∆ | ≤ η sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { h ( X ; ˜ α ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η Λ ( η ) , (S40)where ˜ α lies between ˆ α RWL and ¯ α WL . Moreover, in the event (Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω ) ∩ Ω ,applying Lemma 13 below yields | ∆ | ≤ C (1 + C e C η ) η λ . (S41)Then (41) follows by combining (S40)–(S41) and other aforementioned inequalities.24 emma 13. For r ≥ , denote by Ω the event that sup k α − ¯ α WL k ≤ r (cid:12)(cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:18)(cid:2) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:3) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C (1 + C e C r ) rλ . Under Assumptions 1(i)–(ii), 3(i) and 3(iii), if λ ≥ √ − B + 1) C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.

Denote g ( T, X ; α ) = (cid:2) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:3) (cid:26) − T ¯ π CAL ( X ) (cid:27) . For k α − ¯ α WL k ≤ r , similar manipulation as in (S36) and (S38) using Assumptions 1(i), 3(i)and 3(iii) yields (cid:12)(cid:12) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:12)(cid:12) ≤ ψ { ¯ h WL ( X ) }| h ( X ; α ) − ¯ h WL ( X ) | + ψ { ¯ h WL ( X ) } C | h ( X ; α ) − ¯ h WL ( X ) | e C | h ( X ; α ) − ¯ h ( X ) | ≤ C (1 + C e C r ) | h ( X ; α ) − ¯ h WL ( X ) | , (S42)that is, ψ () satisﬁes a Lipschitz condition. By the symmetrization and contraction theorems(e.g., Buhlmann & van de Geer 2011, Theorems 14.3 and 14.4), we have E " sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ E sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i g ( T i , X i ; α ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) × E sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i { h ( X i ; α ) − ¯ h WL ( X i ) } (cid:26) − T i ¯ π CAL ( X i ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) r × E sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i f j ( X i ) (cid:26) − T i ¯ π CAL ( X i ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where ( σ , . . . , σ n ) are independent Rademacher variables with P ( σ i = 1) = P ( σ i = −

1) = 1 / i . By Hoeﬀding’s moment inequality (Buhlmann & van de Geer 2011, Lemma 14.14),we ﬁnd from the preceding inequality E " sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) r × C (e B + 1) r p ) n , by Assumption 1(i)–(ii). For k α − ¯ α WL k ≤ r , inequality (S42) also shows that | g ( T i , X i ; α ) | ≤ C (1 + C e C r ) C (e B + 1) r . By Massart’s inequality (Buhlmann & van de Geer 2001, Theo-25em 14.2), we have with probability at least 1 − ǫ ,sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ C (e B + 1) C (1 + C e C r ) r ( r p ) n + r { / (2 ǫ ) } n ) ≤ C (e B + 1) C (1 + C e C r ) r r

16 log { (1 + p ) /ǫ } n , where the second inequality uses √ a + √ b ≤ p a + b ). (cid:3) II.7 Proof of Theorem 7

The proof is similar to that of Theorem 4. First, (S29) for ˜ E ( δ ) remains valid. Second,combining (S30) and (S35) yields˜ E ( δ ) ≤ e − B (cid:16) k ˆ γ − ¯ γ k C (cid:17) e C k ˆ α − ¯ α k C D † WL (ˆ h RWL , ¯ h WL ; ¯ α CAL ) . Third, similarly as in (S32) and (S35), we have˜ E ( δ ) = ˜ E " { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) ≤ (1 + e − B ) e C k ˆ α − ¯ α k C ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i (S43) ≤ (1 + e − B ) e C k ˆ α − ¯ α k C C k ˆ α RWL − ¯ α WL k . Inequality (42) follows by collecting the aforementioned inequalities and applying (39), (40),(S10), and (S13) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω . If condition (27) holds, then in the eventΩ ∩ Ω , combining (S33) and (S19) and using (1 − e − c ) /c ≥ e − c for c ≥ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e B λ k ˆ α RWL − ¯ α WL k + τ − n λ k ˆ α RWL − ¯ α WL k + e C k ˆ α − ¯ α k C − D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) o . (S44)Inequality (43) follows by combining (S43)–(S44) and other aforementioned inequalities exceptthat inequality (40) is replaced by (41). 26 I.8 Technical tools

For completeness, we state the following concentration inequalities, which can be obtainedfrom Buhlmann & van de Geer (2011), Lemmas 14.11, 14.16, and 14.9.

Lemma 14.

Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and max i =1 ,...,n | Y i | ≤ c for some constant c . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ! ≤ (cid:18) − nt c (cid:19) . Lemma 15.

Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and ( Y , . . . , Y n ) are uniformly sub-gaussian: max i =1 ,...,n c E { exp( Y i /c ) − } ≤ c for someconstants ( c , c ) . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ! ≤ (cid:26) − nt c + c ) (cid:27) . Lemma 16.

Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and n n X i =1 E ( | Y i | k ) ≤ k !2 c k − c , k = 2 , , . . . , for some constants ( c , c ) . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > c t + c √ t ! ≤ − nt ) . The following results about sub-gaussian variables can be deduced from Buhlmann & vande Geer (2011), Lemmas 14.3 and 14.5.

Lemma 17.

Suppose that Y is sub-gaussian: c E { exp( X /c ) − } ≤ c for some constants ( c , c ) . Then E ( | Y | k ) ≤ Γ (cid:18) k (cid:19) ( c + c ) c k − , k = 2 , , . . . . Lemma 18.

Suppose that X is bounded: | X | ≤ c for a constant c , and Y is sub-gaussian: c E { exp( X /c ) − } ≤ c for some constants ( c , c ) . Then Z = XY satisﬁes E n | Z − E ( Z ) | k o ≤ k !2 c k − c , k = 2 , , . . . , for c = 2 c c and c = 2 c c c ..