Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data
aa r X i v : . [ m a t h . S T ] J a n Model-assisted inference for treatment effects using regularizedcalibrated estimation with high-dimensional data
Zhiqiang Tan January 31, 2018
Abstract.
Consider the problem of estimating average treatment effects when a large num-ber of covariates are used to adjust for possible confounding through outcome regression andpropensity score models. The conventional approach of model building and fitting iterativelycan be difficult to implement, depending on ad hoc choices of what variables are included. Inaddition, uncertainty from the iterative process of model selection is complicated and oftenignored in subsequent inference about treatment effects. We develop new methods and theoryto obtain not only doubly robust point estimators for average treatment effects, which remainconsistent if either the propensity score model or the outcome regression model is correctlyspecified, but also model-assisted confidence intervals, which are valid when the propensityscore model is correctly specified but the outcome regression model may be misspecified. Witha linear outcome model, the confidence intervals are doubly robust, that is, being also validwhen the outcome model is correctly specified but the propensity score model may be misspeci-fied. Our methods involve regularized calibrated estimators with Lasso penalties, but carefullychosen loss functions, for fitting propensity score and outcome regression models. We providehigh-dimensional analysis to establish the desired properties of our methods under comparableconditions to previous results, which give valid confidence intervals when both the propensityscore and outcome regression are correctly specified. We present a simulation study and anempirical application which confirm the advantages of the proposed methods compared withrelated methods based on regularized maximum likelihood estimation.
Key words and phrases.
Calibrated estimation; Causal inference; Doubly robust estima-tion; Inverse probability weighting; Lasso penalty; Model misspecification; Propensity score;Regularized M-estimation. Department of Statistics & Biostatistics, Rutgers University. Address: 110 Frelinghuysen Road, Piscataway,NJ 08854. E-mail: [email protected]. The research was supported in part by PCORI grant ME-1511-32740.The author thanks Cun-Hui Zhang and Zijian Guo for helpful discussions.
Introduction
Drawing inferences about effects of treatments or interventions is constantly desired fromobservational studies in social and medical sciences, when randomized experiments are eitherinfeasible or difficult for practical constraints. This subject, broadly known as causal inferencein statistics, is often based on the framework of potential outcomes (Neyman 1923; Rubin1974). For observational studies, causal inference inevitably involves statistical modeling andestimation of population properties and associations from empirical data (e.g., Tsiatis 2006).In particular, as the main problem to be tackled in the paper, estimation of average treatmenteffects typically requires building and fitting outcome regression or propensity score models(e.g., Tan 2007). The fitted outcome regression functions or propensity scores can then be usedin various estimators for the average treatment effects, notably inverse probability weighted(IPW) estimators or augmented IPW estimators (Robins et al. 1994).For building and fitting outcome regression or propensity score models, it is possible tofollow the usual process of model specification, fitting, and checking in a cyclic manner (e.g.,McCullagh & Nelder 1989). In fact, a conventional approach for propensity score estimation asdemonstrated in Rosenbaum & Rubin (1984) involves fitting a propensity score model (oftenlogistic regression) by maximum likelihood, check covariate balance, and then modify and refitthe propensity score model until reasonable balance is achieved. However, this approach canbe work intensive and difficult to implement, depending on ad hoc choices of what variables areincluded and whether nonlinear terms or interactions are used among others. The situation canbe especially challenging when there are a large number of potentially confounding variables(or covariates) that need to be adjusted for in outcome regression or propensity score models.In addition, another statistical issue is that uncertainty from the iterative process of modelselection is complicated and often ignored in subsequent inference (that is, confidence intervalsor hypothesis testing) about treatment effects.In this article, we develop new methods and theory for fitting logistic propensity scoremodels and generalized linear outcome models and then using the fitted values in augmentedIPW estimators to estimates average treatment effects, in high-dimensional settings where thenumber of covariates p is close to or even greater than the sample size n . There are twomain elements in our approach. First, we employ regularized estimation with a Lasso penalty(Tibshirani 1992) when fitting the outcome regression and propensity score models to deal with1he large number of covariates under a sparsity assumption that only a small but unknownsubset (relative to the sample size) of covariates are associated with nonzero coefficients in thepropensity score and outcome regression models. Second, we carefully choose the loss functionsfor regularized estimation, different from least squares or maximum likelihood, such that theresulting augmented IPW estimator and Wald-type confidence intervals possess the followingproperties (G1) and at least one of (G2)–(G3) under suitable conditions:(G1) The point estimator is doubly robust, that is, remains consistent if either the propensityscore model or the outcome regression model is correctly specified.(G2) The confidence intervals are valid if the propensity score model is correctly specified butthe outcome regression model may be misspecified.(G3) The confidence intervals are valid if the outcome regression model is correctly specifiedbut the propensity score model may be misspecified.If either property (G2) or (G3) is satisfied, then the confidence intervals are said to be model-assisted, borrowing the terminology from the survey literature (Sarndal et al. 1992). If prop-erties (G2)–(G3) are satisfied, then the confidence intervals are doubly robust.Combining the two foregoing elements leads to a regularized calibrated estimator, denotedby ˆ γ RCAL , for the coefficients in the propensity score model and a regularized weighted likelihoodestimator, denoted by ˆ α RWL , for the coefficients in the outcome model within the treatedsubjects. See the loss functions in (11) and (13) or (37). The regularized calibrated estimatorˆ γ RCAL has recently been proposed in Tan (2017) as an alternative to the regularized maximumlikelihood estimator for fitting logistic propensity score models, regardless of outcome regressionmodels. As shown in Tan (2017), minimization of the underlying expected calibration lossimplies reduction of not only the expected likelihood loss for logistic regression but also ameasure of relative errors of limiting propensity scores that controls the mean squared errorsof IPW estimators, when the propensity score model may be misspecified. In a complementarymanner, our work here shows that ˆ γ RCAL can be used in conjunction with ˆ α RWL to yield anaugmented IPW estimator with valid confidence intervals if the propensity score model iscorrectly specified but the outcome regression model may be specified.We provide high-dimensional analysis of the regularized weighted likelihood estimator ˆ α RWL and the resulting augmented IPW estimator with possible model misspecification, while build-ing on related results about ˆ γ RCAL in Tan (2017). In fact, a new strategy of inverting a quadratic2nequality is developed to tackle the technical issue that the weighted likelihood loss for ˆ α RWL is defined depending on the estimator ˆ γ RCAL . As a result, we obtain the convergence of ˆ α RWL to a target value in the L norm at the rate ( | S γ | + | S α | ) { log( p ) /n } / and the symmetrizedweighted Bregman divergence at the rate ( | S γ | + | S α | ) log( p ) /n under comparable conditionsto those for high-dimensional analysis of standard Lasso estimators (e.g., Buhlmann & van deGeer 2011), where | S γ | denotes the size of nonzero coefficients of the propensity score modeland | S α | denotes that of the outcome model. Furthermore, we establish an asymptotic expan-sion of the augmented IPW estimator based on ˆ γ RCAL and ˆ α RWL , and show that property (G1)is achieved provided ( | S γ | + | S α | )(log p ) / = o ( n / ) and property (G2) is achieved provided( | S γ | + | S α | )(log p ) = o ( n / ) with a nonlinear outcome model. With a linear outcome model,we obtain stronger results: property (G1) is achieved provided ( | S γ | + | S α | ) log( p ) = o ( n )and both (G2) and (G3) are achieved provided ( | S γ | + | S α | ) log( p ) = o ( n / ). These sparsityconditions are as weak as in previous works (Belloni et al. 2014; van de Geer et al. 2014). Related works.
We compare and connect our work with related works in several areas.Non-penalized calibrated estimation for propensity score models have been studied, sometimesindependently (re)derived, in causal inference, missing-data problems, and survey sampling(e.g., Folsom 1991; Tan 2010; Graham et al. 2012; Hainmueller 2012; Imai & Ratovic 2014;Kim & Haziza 2014; Vermeulen & Vansteelandt 2015; Chan et al. 2016). The non-penalizedversion of the estimator ˆ α RWL for outcome regression models have also been proposed in Kim& Haziza (2014) and Vermeulen & Vansteelandt (2015), where one of the motivations is tocircumvent the need of accounting for variation of such estimators of nuisance parameters andhence simplify the computation of confidence intervals based on augmented IPW estimators.Our work generalizes these ideas to achieve statistical advantages in high-dimensional settings,where model-assisted or doubly robust confidence intervals would not be obtained withoutusing regularized calibration estimation. See Section 3.4 for further discussion.For high-dimensional causal inference, Belloni et al. (2014) and Farrell (2015) employedaugmented IPW estimators based on regularized maximum likelihood estimators in outcomeregression and propensity score models, and obtained Wald-type confidence intervals that arevalid when both the outcome regression and propensity score models are correctly specified,provided ( | S γ | + | S α | ) log( p ) = o ( n / ). Our main contribution is therefore to provide model-assisted or doubly robust confidence intervals using differently configured augmented IPW3stimators for treatment effects. As a secondary difference, Belloni et al. (2014) and Farrell(2015) used post-Lasso estimators, that is, refitting outcome regression and propensity scoresmodels only including the variables selected from Lasso estimation. In contrast, our estimatorsˆ γ RCAL and ˆ α RWL are directly Lasso penalized M-estimators.Another related work is Athey et al. (2016), where valid confidence intervals are obtainedfor the sample treatment effects such as n − P i : T i =1 { m ∗ ( X i ) − m ∗ ( X i ) } , if a linear outcomemodel is correctly specified. No propensity score model is explicitly used.Our work is also connected to the literature of confidence intervals and hypothesis testingfor a single or lower-dimensional coefficients in high-dimensional regression models (Zhang &Zhang 2014; van de Geer et al. 2014; Javanmard & Montanari 2014). Model-assisted inferencedoes not seem to be addressed in these works, but can potentially be developed. Suppose that the observed data consist of independent and identically distributed observations { ( Y i , T i , X i ) : i = 1 , . . . , n } of ( Y, T, X ), where Y is an outcome variable, T is a treatmentvariable taking values 0 or 1, and X is a vector of measured covariates. In the potentialoutcomes framework for causal inference (Neyman 1923; Rubin 1974), let ( Y , Y ) be potentialoutcomes that would be observed under treatment 0 or 1 respectively. By consistency, assumethat Y is either Y if T = 0 or Y if T = 1, that is, Y = (1 − T ) Y + T Y . There aretwo causal parameters commonly of interest: the average treatment effect (ATE), defined as E ( Y − Y ) = µ − µ with µ t = E ( Y t ), and the average treatment effect on the treated (ATT),defined as E ( Y − Y | T = 1) = ν − ν with ν t = E ( Y t | T = 1) for t = 0 ,
1. For concreteness,we mainly discuss estimation of µ until Section 3.4 to discuss ATE and ATT.Estimation of ATE is fundamentally a missing-data problem: only one potential outcome, Y i or Y i , is observed and the other one is missing for each subject i . For identification of( µ , µ ) and ATE, we make the following two assumptions throughout:(i) Unconfoundedness: T ⊥ Y | X and T ⊥ Y | X , that is, T and Y and, respectively, T and Y are conditionally independent given X (Rubin 1976);(ii) Overlap: 0 < π ∗ ( x ) < x , where π ∗ ( x ) = P ( T = 1 | X = x ) is called the propensityscore (PS) (Rosenbaum & Rubin 1983).Under these assumptions, ( µ , µ ) and ATE are often estimated by imposing additional mod-4ling (or dimension-reduction) assumptions on the outcome regression function m ∗ t ( X ) = E ( Y | T = t, X ) or the propensity score π ∗ ( X ) = P ( T = 1 | X ).Consider a conditional mean model for outcome regression (OR), E ( Y | T = 1 , X ) = m ( X ; α ) = ψ { α g ( X ) } , (1)where ψ () is an inverse link function, assumed to be increasing, g ( x ) = { , g ( x ) , . . . , g d ( x ) } T is a vector of known functions, and α = ( α , α , . . . , α d ) T is a vector of unknown parameters.For example, model (1) can be deduced from a generalized linear model with a canonical link(McCullagh & Nelder 1989). Then the average negative log-(quasi-)likelihood function can bewritten (after dropping any dispersion parameter) as ℓ ML ( α ) = ˜ E (cid:0) T (cid:2) − Y α g ( X ) + Ψ { α g ( X ) } (cid:3)(cid:1) , (2)where Ψ( u ) = R u ψ ( u ′ ) d u ′ , which is convex in u . Throughout, ˜ E () denotes the sample average.With high-dimensional data, a regularized maximum likelihood estimator, ˆ α RML , can be definedby minimizing the loss ℓ ML ( α ) with the Lasso penalty (Tibshirani 1992), ℓ RML ( α ) = ℓ ML ( α ) + λ k α d k , (3)where k α d k = P dj =1 | α j | is the L norm of α d = ( α , . . . , α d ) T excluding α , and λ ≥ µ is thenˆ µ OR = ˜ E { ˆ m RML ( X ) } = 1 n n X i =1 ˆ m RML ( X i ) , where ˆ m RML ( X ) = m ( X ; ˆ α RML ), the fitted outcome regression function. Various theoreticalresults have been obtained on Lasso penalized estimation in sparse, high-dimensional regression(e.g., Buhlmann & van de Geer 2011; Huang & Zhang 2012; Negahban et al. 2012). Suchresults can be easily adapted to ˆ α RML , with the data restricted to { ( Y i , X i ) : T i = 1 , i =1 , . . . , n } . If model (1) is correctly specified, then it can be shown under suitable conditionsthat k ˆ α RML − α ∗ k = O p ( k α ∗ k { log( d ) /n } / ) and ˆ µ OR = µ + O p ( {k α ∗ k log( d ) /n } / ), where α ∗ is the true value for model (1) such that m ∗ ( X ) = m ( X ; α ∗ ).Alternatively, consider a propensity score (PS) model P ( T = 1 | X ) = π ( X ; γ ) = Π { γ T f ( X ) } , (4)where Π() is an inverse link function, f ( x ) = { , f ( x ) , . . . , f p ( x ) } T is a vector of known func-tions, and γ = ( γ , γ , . . . , γ p ) T is a vector of unknown parameters. For concreteness, assume5hat model (4) is logistic regression with π ( X ; γ ) = [1 + exp {− γ T f ( X ) } ] − , and hence theaverage negative log-likelihood function is ℓ ML ( γ ) = ˜ E h log { γ T f ( X ) } − T γ T f ( X ) i . (5)To handle high-dimensional data, a Lasso penalized maximum likelihood estimator, ˆ γ RML , isdefined by minimizing the objective function ℓ RML ( γ ) = ℓ ML ( γ ) + λ k γ p k , (6)where k γ p k = P pj =1 | γ j | is the L norm of γ p = ( γ , . . . , γ p ) T excluding γ , and λ ≥ π RML ( X ) = π ( X ; ˆ γ RML ), can be used in variousmanners to estimate ( µ , µ ) and ATE including matching, stratification, and weighting. Inparticular, a (ratio) inverse probability weighted (IPW) estimator for µ isˆ µ rIPW (ˆ π RML ) = ˜ E (cid:26) T Y ˆ π RML ( X ) (cid:27) . ˜ E (cid:26) T ˆ π RML ( X ) (cid:27) . From previous works (Buhlmann & van de Geer 2011; Huang & Zhang 2012; Negahban etal. 2012), if model (4) is correctly specified, then it can be shown under suitable conditionsthat k ˆ γ RML − γ ∗ k = O p ( k γ ∗ k { log( p ) /n } / ) and ˆ µ rIPW (ˆ π RML ) = µ + O p ( {k γ ∗ k log( p ) /n } / ),where γ ∗ is the true value for model (4) such that π ∗ ( X ) = π ( X ; γ ∗ ).To attain consistency for µ , the estimator ˆ µ OR or ˆ µ rIPW (ˆ π RML ) relies on correct specificationof OR model (1) or PS model (4) respectively. In contrast, there are doubly robust estimatorsdepending on both OR and PS models in the augmented IPW form (Robins et al. 1994)ˆ µ ( ˆ m , ˆ π ) = ˜ E { ϕ ( Y, T, X ; ˆ m , ˆ π ) } , where ˆ m ( X ) and ˆ π ( X ) are fitted values of m ∗ ( X ) and π ∗ ( X ) respectively and ϕ ( Y, T, X ; ˆ m , ˆ π ) = T Y ˆ π ( X ) − (cid:26) T ˆ π ( X ) − (cid:27) ˆ m ( X ) . (7)See Kang & Schafer (2007) and Tan (2010) for reviews in low-dimensional settings. Recently,interesting results in high-dimensional settings have been obtained by Belloni et al. (2014)and Farrell (2015) on the estimator ˆ µ ( ˆ m RML , ˆ π RML ), using the fitted values ˆ m RML ( X ) andˆ π RML ( X ) from Lasso penalized estimation or similar methods. These results are mainly of twotypes. The first type shows double robustness: ˆ µ ( ˆ m RML , ˆ π RML ) remains consistent if either OR6odel (1) or PS model (4) is correctly specified. The second type establishes valid confidenceintervals: ˆ µ ( ˆ m RML , ˆ π RML ) admits the usual influence function,ˆ µ ( ˆ m RML , ˆ π RML ) = ˜ E { ϕ ( Y, T, X ; m ∗ , π ∗ ) } + o p ( n − / )if both OR model (1) and PS model (4) are correctly specified. In general, the latter resultrequires a stronger sparsity condition than in consistency results only. For example, it isassumed that {k α ∗ k + k γ ∗ k } log( p ) = o ( n / ) in Belloni et al. (2014). An important limitation of existing works discussed in Section 2 is that valid confidence inter-vals based on ˆ µ ( ˆ m RML , ˆ π RML ) is obtained only under the assumption that both OR model (1)and PS model (4) are correctly specified, even though the point estimator ˆ µ ( ˆ m RML , ˆ π RML ) isdoubly robust, that is, remains consistent if either OR model (1) or PS model (4) is correctlyspecified. To fill this gap, we develop new point estimators and confidence intervals for µ ,depending on a propensity score model and an outcome regression model, such that proper-ties (G1) and at least one of (G2)–(G3) are attained as described in Section 1. Obtainingmodel-assisted or doubly robust confidence intervals presents a considerable improvement overexisting theory and methods in Belloni et al. (2014) and Farrell (2015).To illustrate main ideas, consider a logistic propensity score model (4) and a linear outcomeregression model, E ( Y | T = 1 , X ) = m ( X ; α ) = α f ( X ) , (8)that is, model (1) with the identity link and the vector of covariate functions g ( X ) taken tobe the same as f ( X ) in model (4). This condition can be satisfied possibly after enlargingmodel (1) or (4) to reach the same dimension. Our point estimator of µ isˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) (cid:9) , (9)where ϕ () is defined in (7), ˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ), ˆ m RWL ( X ) = m ( X ; ˆ α RWL ), and ˆ γ RCAL and ˆ α RWL are defined as follows. The estimator ˆ γ RCAL is a regularized calibrated estimator of γ from Tan (2017), defined as a minimizer of the Lasso penalized objective function, ℓ RCAL ( γ ) = ℓ CAL ( γ ) + λ k γ p k , (10)7here ℓ RCAL ( γ ) is the calibration loss, ℓ CAL ( γ ) = ˜ E n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) o , (11)and k γ p k is the L norm of γ p and λ ≥ α RWL is aregularized weighted least-squares estimator of α , defined as a minimizer of ℓ RWL ( α ; ˆ γ RCAL ) = ℓ WL ( α ; ˆ γ RCAL ) + λ k α p k , (12)where ℓ WL ( α ; ˆ γ RCAL ) is the weighted least-squares loss, ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:20) T − ˆ π RCAL ( X )ˆ π RCAL ( X ) { Y − α f ( X ) } (cid:21) / , (13)and k α p k is the L norm of α p and λ ≥ { − ˆ π RCAL ( X i ) } / ˆ π RCAL ( X i ), which differs slightly fromthe commonly used inverse propensity score weight 1 / ˆ π RCAL ( X i ).There are simple and interesting interpretations of the preceding estimators. By the Karush–Kuhn–Tucker condition for minimizing (10), the fitted propensity score ˆ π RCAL ( X ) satisfies1 n n X i =1 T i ˆ π RCAL ( X i ) = 1 , (14)1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 T i f j ( X i )ˆ π RCAL ( X i ) − n X i =1 f j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ, j = 1 , . . . , p, (15)where equality holds in (15) for any j such that the j th estimate (ˆ γ RCAL ) j is nonzero. Eq. (14)shows that the inverse probability weights, 1 / ˆ π RCAL ( X i ) with T i = 1, sum to the sample size n by (14), whereas Eq. (15) implies that the weighted average of each covariate f j ( X i ) over thetreated group may differ from the overall average of f j ( X i ) by no more than λ . In fact, thecalibration loss ℓ CAL ( γ ) in (11) is derived such that its gradient gives the left hand side of (15)without taking absolute values, as shown in Eq. (23). The Lasso penalty is used to induce thebox constraints on the gradient of ℓ CAL ( γ ) instead of setting the gradient to 0.By the Karush–Kuhn–Tucker condition for minimizing (12), the fitted outcome regressionfunction ˆ m RWL ( X ) satisfies1 n n X i =1 T i − ˆ π RCAL ( X i )ˆ π RCAL ( X i ) (cid:8) Y i − ˆ m RWL ( X i ) (cid:9) = 0 , (16)1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 T i − ˆ π RCAL ( X i )ˆ π RCAL ( X i ) (cid:8) Y i − ˆ m RWL ( X i ) (cid:9) f j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ, j = 1 , . . . , p, (17)8here equality holds in (17) for any j such that the j th estimate ( ˆ α RWL ) j is nonzero. Eq. (16)implies that by simple calculation, the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) can be recast asˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:20) ˆ m RWL ( X ) + T ˆ π RCAL ( X ) (cid:8) Y − ˆ m RWL ( X ) (cid:9)(cid:21) = ˜ E (cid:8) T Y + (1 − T ) ˆ m RWL ( X ) (cid:9) , (18)which takes the form of linear prediction estimators known in the survey literature (e.g., Sarndalet al. 1992): ˜ E { T Y + (1 − T ) ˆ m ( X ) } for some fitted outcome regression function ˆ m ( X ).As a consequence, ˆ µ ( ˆ m RWL , ˆ π RCAL ) always falls within the range of the observed outcomes { Y i : T i = 1 , i = 1 , . . . , n } and the predicted values { ˆ m RWL ( X i ) : T i = 0 , i = 1 , . . . , n } . Thisboundedness property is not satisfied by the estimator ˆ µ ( ˆ m RML , ˆ π RML ).We provide a high-dimensional analysis of the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) in Section 3.2,allowing possible model misspecification. Our main result shows that under suitable conditions,the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, T, X ; ¯ m WL , ¯ π CAL ) (cid:9) + o p ( n − / ) , (19)where ¯ π CAL ( X ) = π ( X ; ¯ γ CAL ), ¯ m WL ( X ) = m ( X ; ¯ α WL ) and ¯ γ CAL and ¯ α WL are defined as fol-lows. In the presence of possible model misspecification, the target value ¯ γ CAL is defined as aminimizer of the expected calibration loss E { ℓ CAL ( γ ) } = E n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) o . If model (4) is correctly specified, then ¯ π CAL ( X ) = π ∗ ( X ). Otherwise, ¯ π CAL ( X ) may differ from π ∗ ( X ). The target value ¯ α WL is defined as a minimizer of the expected loss E (cid:8) ℓ WL ( α ; ¯ γ CAL ) (cid:9) = E (cid:20) T − ¯ π CAL ( X )¯ π CAL ( X ) { Y − α f ( X ) } (cid:21) / . If model (8) is correctly specified, then ¯ m WL ( X ) = m ∗ ( X ). But ¯ m WL ( X ) may in general differfrom m ∗ ( X ). For concreteness, the following result can be deduced from Theorems 3 and 4.Suppose that the Lasso tuning parameter is specified as λ = A † { log( p ) /n } / for ˆ γ RCAL and λ = A † { log( p ) /n } / for ˆ α RWL , with some constants A † and A † . Denote S γ = { } ∪ { j :¯ γ CAL ,j = 0 , j = 1 , . . . , p } and S α = { } ∪ { j : ¯ α WL ,j = 0 , j = 1 , . . . , p } . Proposition 1.
Suppose that Assumptions 1 and 2 hold as in Section 3.2, and ( | S γ | + | S α | ) log( p ) = o ( n / ) . For sufficiently large constants A † and A † , if either logistic PS model(4) or linear OR model (8) is correctly specified, then the following results hold: i) n / { ˆ µ ( ˆ m RWL , ˆ π RCAL ) − µ } → D N (0 , V ) , where V = var { ϕ ( Y, T, X ; ¯ m WL , ¯ π CAL ) } ;(ii) a consistent estimator of V is ˆ V = ˜ E h(cid:8) ϕ ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ˆ m RWL , ˆ π RCAL ) (cid:9) i ; (iii) an asymptotic (1 − c ) confidence interval for µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n , where z c/ is the (1 − c/ quantile of N (0 , .That is, a doubly robust confidence interval for µ is obtained. We highlight some basic ideas underlying the construction of the estimators ˆ γ RCAL and ˆ α RWL as well as the proof of the asymptotic expansion (19) for ˆ µ ( ˆ m RWL , ˆ π RCAL ). For an estimatorˆ γ in model (4), suppose that ˆ γ converges in probability to a limit ¯ γ . Denote ˆ π ( X ) = π ( X ; ˆ γ )and ¯ π ( X ) = π ( X ; ¯ γ ). Similarly, for an estimator ˆ α in model (1), suppose that ˆ α convergesin probability to a limit ¯ α . Denote ˆ m ( X ) = ˆ α f ( X ) and ¯ m ( X ) = ¯ α f ( X ). Consider thefollowing decomposition of ˆ µ ( ˆ m , ˆ π ) by direct calculation:ˆ µ ( ˆ m , ˆ π ) = ˆ µ ( ¯ m , ¯ π ) + ˜ E (cid:20) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) − T ˆ π ( X ) (cid:27)(cid:21) + ˜ E (cid:20) T { Y − ¯ m ( X ) } (cid:26) π ( X ) − π ( X ) (cid:27)(cid:21) . (20)Eq. (20) can also be obtained from a Taylor expansion with ( ˆ α , ˆ γ ) about ( ¯ α , ¯ γ ). For linearOR model (8), the second term of the decomposition reduces to( ˆ α − ¯ α ) T × ˜ E (cid:20)(cid:26) − T ˆ π ( X ) (cid:27) f ( X ) (cid:21) . (21)For logistic PS model (4) with ∂π ( X ; γ ) /∂γ = π ( X ; γ ) { − π ( X ; γ ) } , the third term of thedecomposition can be approximated via a Taylor expansion by − (ˆ γ − ¯ γ ) T × ˜ E (cid:20) T − ¯ π ( X )¯ π ( X ) { Y − ¯ m ( X ) } f ( X ) (cid:21) . (22)Suppose that ˆ γ and ˆ α are Lasso penalized M-estimators such that under suitable conditions, k ˆ γ − ¯ γ k = O p ( { log( p ) /n } / ) and k ˆ α − ¯ α k = O p ( { log( p ) /n } / ), where for simplicity thedependency on the sparsity sizes of ¯ γ and ¯ α are suppressed. The loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) in (11) and (13) are constructed such that ∂ℓ CAL ( γ ) ∂γ = ˜ E (cid:20)(cid:26) − Tπ ( X ; γ ) (cid:27) f ( X ) (cid:21) , (23) ∂ℓ WL ( α ; γ ) ∂α = − ˜ E (cid:20) T − π ( X ; γ ) π ( X ; γ ) { Y − α f ( X ) } f ( X ) (cid:21) . (24)10hen the second terms in (21) and (22) can be of order O p ( { log( p ) /n } / ) in the supremumnorms, as reflected in conditions (14)–(15) and (16)–(17). Consequently, the products (21) and(22) can be of order O p (log( p ) /n ), which becomes o p ( n − / ) and hence (19) holds providedlog( p ) = o ( n / ) up to a constant depending on the sparsity sizes of ¯ γ and ¯ α .The estimator ˆ γ RCAL is called a regularized calibrated estimator of γ (Tan 2017), because inthe extreme case of λ = 0, Eqs. (14)–(15) reduce to calibration equations, which can be tracedto Folsom (1991) in the survey literature. Although such equations are intuitively appealing,the preceding discussion shows that ˆ γ RCAL can also be derived to reduce the variation associatedwith estimation of α from linear OR model (8) for the estimator ˆ µ ( ˆ m , ˆ π ), when PS model(4) may be misspecified. Similarly, ˆ α RWL is constructed to reduce the variation associated withestimation of γ from logistic PS model (4) for the estimator ˆ µ ( ˆ m , ˆ π ), when OR model (8) maybe misspecified. By extending the meaning of calibrated estimation, we call ˆ α RWL a regularizedcalibrated estimator of α against model (4), as well as ˆ γ RCAL a regularized calibrated estimatorof γ against model (8), when used to define ˆ µ ( ˆ m , ˆ π ).While the preceding discussion outlines our basic reasoning, there are several technical issueswe need to address in high-dimensional analysis, including how to handle the dependency ofthe estimator ˆ α RWL on ˆ γ RCAL , and what condition is required on the sparsity sizes of ¯ γ and ¯ α .In addition, we develop appropriate methods and theory in the situation where a generalizedlinear model (1), not just linear model (8), is used for outcome regression. In this section, we assume that linear outcome model (8) is used together with logistic propen-sity score model (4), and develop theoretical results for the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ),leading to Proposition 1 among others, in high-dimensional settings.First we describe relevant results from Tan (2017) about the behavior of the regularizedcalibrated estimator ˆ γ RCAL in model (4). The tuning parameter λ used in (10) for definingˆ γ RCAL is specified as λ = A λ , with a constant A > λ = C p log { (1 + p ) /ǫ } /n, where C > C , B ) from Assumption 1 below and 0 <ǫ < ǫ = 1 / (1 + p ) gives λ = C p p ) /n , a familiar rate in high-dimensional analysis.11ith possible model misspecification, the target value ¯ γ CAL is defined as a minimizer of theexpected calibration loss E { ℓ CAL ( γ ) } as in Section 3.1. From a functional perspective, we write ℓ CAL ( γ ) = κ CAL ( γ T f ), where for a function h ( x ), κ CAL ( h ) = ˜ E h T e − h ( X ) + (1 − T ) h ( X ) i . As κ CAL ( h ) is easily shown to be convex in h , the Bregman divergence associated with κ CAL isdefined such that for two functions h ( x ) and h ′ ( x ), D CAL ( h ′ , h ) = κ CAL ( h ′ ) − κ CAL ( h ) − h∇ κ CAL ( h ) , h ′ − h i , where h is identified as a vector ( h , . . . , h n ) with h i = h ( X i ), and ∇ κ CAL ( h ) denotes thegradient of κ CAL ( h ) with respect to ( h , . . . , h n ). The following result (Theorem 1) is restatedfrom Tan (2017, Corollary 2), where the convergence of ˆ γ RCAL to ¯ γ CAL is obtained in the L norm k ˆ γ RCAL − ¯ γ CAL k and the symmetrized Bregman divergence D † CAL (ˆ h RCAL , ¯ h CAL ) = D CAL (ˆ h RCAL , ¯ h CAL ) + D CAL (¯ h CAL , ˆ h RCAL ) , where ˆ h RCAL ( X ) = ˆ γ f ( X ) and ¯ h CAL ( X ) = ¯ γ f ( X ). See Lemma 7 in the SupplementaryMaterial for an explicit expression of D † CAL .For a matrix Σ with row indices { , , . . . , k } , a compatibility condition (Buhlmann & vande Geer 2011) is said to hold with a subset S ∈ { , , . . . , k } and constants ν > ξ > ν ( P j ∈ S | b j | ) ≤ | S | ( b T Σ b ) for any vector b = ( b , b , . . . , b k ) T ∈ R k satisfying X j S | b j | ≤ ξ X j ∈ S | b j | . (25)Throughout, | S | denotes the size of a set S . By Cauchy–Schwartz inequality, this compatibilitycondition is implied by (hence weaker than) a restricted eigenvalue condition (Bickel et al. 2009)such that ν ( P j ∈ S b j ) ≤ b T Σ b for any b ∈ R k satisfying (25). Assumption 1.
Suppose that the following conditions are satisfied:(i) max j =0 , ,...,p | f j ( X ) | ≤ C almost surely for a constant C ≥ ;(ii) ¯ h CAL ( X ) ≥ B almost surely for a constant B ∈ R , that is, π ( X ; ¯ γ CAL ) is bounded frombelow by (1 + e − B ) − ;(iii) the compatibility condition holds for Σ γ with the subset S γ = { } ∪ { j : ¯ γ CAL ,j = 0 , j =1 , . . . , p } and some constants ν > and ξ > , where Σ γ = E [ T w ( X ; ¯ γ CAL ) f ( X ) f T ( X )] is the Hessian of E { ℓ CAL ( γ ) } at γ = ¯ γ CAL and w ( X ; γ ) = e − γ T f ( X ) ; iv) | S γ | λ ≤ η for a sufficiently small constant η > , depending only on ( A , C , ξ , ν ) . Theorem 1 (Tan 2017) . Suppose that Assumption 1 holds. Then for A > ( ξ + 1) / ( ξ − ,we have with probability at least − ǫ , D † CAL (ˆ h RCAL , ¯ h CAL ) + ( A − λ k ˆ γ RCAL − ¯ γ CAL k ≤ M | S γ | λ , (26) where M > is a constant depending only on ( A , C , B , ξ , ν , η ) . Remark 1.
We provide comments about the conditions involved. First, Assumption 1(iii) canbe justified from a compatibility condition for the Gram matrix E { f ( X ) f T ( X ) } in conjunctionwith additional conditions such as for some constant τ > b T E { f ( X ) f T ( X ) } b ≤ ( b T Σ γ b ) /τ , ∀ b ∈ R p . (27)For example, (27) holds provided that π ∗ ( X ) is bounded from below by a positive constant and π ( X ; ¯ γ CAL ) is bounded away from 1. But it is also possible that Assumption 1(iii) is satisfiedeven if (27) does not hold for any τ >
0. Therefore, Assumption 1 requires that π ( X ; ¯ γ CAL ) isbounded away from 0, but may not be bounded away from 1. Second, Assumption 1(iv) canbe relaxed to only require that | S γ | λ is sufficiently small, albeit under stronger conditions, forexample, the variables f ( X ) , . . . , f p ( X ) are jointly (not just marginally) sub-gaussian (Huang& Zhang 2012; Negahban et al. 2012). On the other hand, Assumption 1(iv) is already weakerthan the sparsity condition, | S γ | log( p ) = o ( n / ), which is needed for obtaining valid confidenceintervals for µ from existing works (Belloni et al. 2014) and our later results. Remark 2.
For the Hessian Σ γ , the weight w ( X ; ¯ γ CAL ) with ¯ γ CAL replaced by ˆ γ RCAL is identicalto that used in the weighted least-square loss (13) to define ˆ α RWL , that is, w ( X ; ˆ γ RCAL ) = { − ˆ π RCAL ( X ) } / ˆ π RCAL ( X ). The Hessian of ℓ CAL ( γ ) at ¯ γ CAL is also the same as the Hessian of ℓ WL ( α ; ¯ γ CAL ) in α . As later discussed in Section 3.4, this coincidence is a consequence of theconstruction of the loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) in (11) and (13).Now we turn to the regularized weighted least-squares estimator ˆ α RWL . We develop a newstrategy of inverting a quadratic inequality to address the dependency of ˆ α RWL on ˆ γ RCAL and es-tablish convergence of ˆ α RWL under similar conditions as needed for Lasso penalized unweightedleast-squares estimators in high-dimensional settings. The error bound obtained, however,depends on the sparsity size | S γ | and various constants in Assumption 1.13or theoretical analysis, the tuning parameter λ used in (12) for defining ˆ α RWL is specifiedas λ = A λ , with a constant A > λ = max n λ , e − B C p D + D ) p log { (1 + p ) /ǫ } /n o , where 0 < ǫ < C , B ) are from Assumption 1,and ( D , D ) are from Assumption 2 below. With possible model misspecification, the targetvalue ¯ α WL is defined as a minimizer of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } as in Section 3.1.The following result gives the convergence of ˆ α RWL to ¯ α WL in the L norm k ˆ α RWL − ¯ α WL k andthe weighted (in-sample) prediction error defined as Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) = ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) , (28)where ˆ m RWL ( X ) = ˆ α f ( X ) and ¯ m WL ( X ) = ¯ α f ( X ). In fact, Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) isthe symmetrized Bregman divergence between ˆ m RWL ( X ) and ¯ m WL ( X ) associated with the loss κ WL ( h ; ¯ γ CAL ) = ˜ E [ T w ( X ; ¯ γ CAL ) { Y − h ( X ) } ] /
2. See Section 3.3 for further discussion.
Assumption 2.
Suppose that the following conditions are satisfied:(i) Y − ¯ m WL ( X ) is uniformly sub-gaussian given X : D E (exp[ { Y − ¯ m WL ( X ) } /D ] − | X ) ≤ D for some positive constants ( D , D ) ;(ii) the compatibility condition holds for Σ γ with the subset S α = { } ∪ { j : ¯ α WL ,j = 0 , j =1 , . . . , p } and some constants ν > and ξ > ;(iii) (1 + ξ ) ν − | S α | λ ≤ η for a constant < η < . Theorem 2.
Suppose that linear outcome model (8) is used, A > ( ξ + 1) / ( ξ − , A > ( ξ + 1) / ( ξ − , and Assumptions 1 and 2 hold. If log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) + e η ( A − λ k ˆ α RWL − ¯ α WL k ≤ e η ξ − (cid:0) M | S γ | λ (cid:1) + e η ξ (cid:0) ν − | S α | λ (cid:1) , (29) where ξ = 1 − A / { ( ξ + 1)( A − } ∈ (0 , , ξ = ( ξ + 1)( A − , and ν = ν (1 − η ) / ,depending only on ( A , ξ , ν , η ) , and M = ( D + D )(e η M + η ) + ( D + D D ) η , η = ( A − − M η C , and η = ( A − − M η , depending only on ( A , C , B , ξ , ν , η ) in Theorem 1 and ( D , D ) . emark 3. Assumption 2(ii) is concerned about the same matrix Σ γ as in Assumption 1(iii),but with the sparsity subset S α from ¯ α WL instead of S γ from ¯ γ CAL . The matrix Σ γ is alsothe Hessian of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } at α = ¯ α WL , for reasons mentioned in Re-mark 2. Assumptions 2(ii)–(iii) are combined to derive a compatibility condition for the samplematrix ˜Σ γ = ˜ E [ T w ( X ; ¯ γ CAL ) f ( X ) f T ( X )]. Assumption 2(iii) can be relaxed such that | S α | λ issufficiently small under further side conditions, but it is already weaker than the sparsity con-dition, | S α | log( p ) = o ( n / ), later needed for valid confidence intervals for µ . Essentially, theconditions in Assumption 2 are comparable to those for high-dimensional analysis of standardLasso estimators (Bickel et al. 2009; Buhlmann & van de Geer 2011). Remark 4.
One of the key steps in our proof is to upper-bound the product( ˆ α RWL − ¯ α WL ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) . (30)If ˆ γ RCAL were replaced by ¯ γ CAL , then it is standard to use the following bound,( ˆ α RWL − ¯ α WL ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) (31) ≤ k ˆ α RWL − ¯ α WL k × k ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) k ∞ . To handle the dependency on ˆ γ RCAL , our strategy is to derive an upper bound of the differencebetween (30) and (31), depending on Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ), which we seek to control. Carry-ing this bound leads to a quadratic inequality in Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ), which can be invertedto obtain an explicit bound on Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ). The resulting error bound (29) is oforder ( | S γ | + | S α | ) log( p ) /n , much sharper than what we could obtain using other approaches,for example, directly bounding k ˜ E [ T w ( X ; ˆ γ RCAL ) { Y − ¯ m WL ( X ) } f ( X )] k ∞ .Finally, we study the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) for µ , depending on the regu-larized estimators ˆ γ RCAL and ˆ α RWL from logistic propensity score model (4) and linear outcomeregression model (8). The following result gives an error bound for ˆ µ ( ˆ m RWL , ˆ π RCAL ), allowingthat both models (4) and (8) may be misspecified.
Theorem 3.
Under the conditions of Theorem 2, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ , (32)15 here M = M + p D + D e η (e η M / η ) , M = ( A − − M , M = A ( A − − M , and M is a constant such that the right hand side of (29) in Theorem 2 is upper-bounded by e η M ( | S γ | λ λ + | S α | λ ) . Theorem 3 shows that ˆ µ ( ˆ m RWL , ˆ π RCAL ) is doubly robust for µ provided ( | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | ) log( p ) = o ( n ). In addition, Theorem 3 gives the n − / asymptoticexpansion (19) provided n / ( | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | ) log( p ) = o ( n / ).To obtain valid confidence intervals for µ via the Slutsky theorem, the following resultgives the convergence of the variance estimator ˆ V to V , as defined in Proposition 1, allow-ing that both models (4) and (8) may be misspecified. For notational simplicity, denoteˆ ϕ = ϕ ( T, Y, X ; ˆ m RWL , ˆ π RCAL ) and ˆ ϕ c = ˆ ϕ − ˆ µ ( ˆ m RWL , ˆ π RCAL ) such that ˆ V = ˜ E ( ˆ ϕ c ). Similarly,denote ¯ ϕ = ϕ ( T, Y, X ; ¯ m WL , ¯ π CAL ) and ¯ ϕ c = ¯ ϕ − ˆ µ ( ¯ m WL , ¯ π CAL ) such that V = E ( ¯ ϕ c ). Theorem 4.
Under the conditions of Theorem 2, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / ( | S γ | λ + | S α | λ ) + M ( | S γ | λ + | S α | λ ) , (33) where M is a positive constant depending only on ( A , C , B , ξ , ν , η ) in Theorem 1 and ( A , D , D , ξ , ν , η ) in Thorem 2. If, in addition, condition (27) holds, then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / ( | S γ | λ λ + | S α | λ ) / + M ( | S γ | λ λ + | S α | λ ) , (34) where M is a positive constant, depending on τ from (27) as well as ( A , C , B , ξ , ν , η ) and ( A , D , D , ξ , ν , η ) . Remark 5.
Theorem 4 provides two rates of convergence for ˆ V under different conditions.Inequality (33) shows that ˆ V is a consistent estimator of V , that is, ˆ V − V = o p (1), provided( | S γ | + | S α | )(log p ) / = o ( n / ). Technically, consistency of ˆ V is sufficient for applying Slutskytheorem to establish confidence intervals for µ in Proposition 1(iii). With additional condition(27), inequality (34) shows that ˆ V achieves the parametric rate of convergence, ˆ V − V = o p ( n − / ), provided ( | S γ | + | S α | ) log( p ) = o ( n / ). Remark 6.
Combining Theorems 3–4 directly leads to Proposition 1, which gives doublyrobust confidence intervals of µ . In addition, a broader interpretation can also be accommo-dated. All the results, Theorems 1–4, are developed to remain valid in the presence of misspec-ification of models (4) and (8), similarly as in classical theory of estimation with misspecified16odels (e.g., White 1982; Manski 1988). If both models (4) and (8) may be misspecified, thenˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n is an asymptotic (1 − c ) confidence interval for the target value¯ µ = E ( ¯ ϕ ), which in general differs from the true value µ . By comparison, the standard esti-mator ˆ µ ( ˆ m RML , ˆ π RML ) can be shown to converge to a target value, different from µ as well as¯ µ in the presence of model misspecification. But it seems difficult to obtain valid confidenceintervals based on ˆ µ ( ˆ m RML , ˆ π RML ) under similar conditions as in our results, because (21) and(22) are then O p ( { log( p ) /n } / ) if either model (4) or (8) is misspecified. In this section, we turn to the situation where a generalized linear model is used for outcomeregression together with a logistic propensity score model, and develop appropriate methodsand theory for obtaining confidence intervals for µ in high-dimensional settings.A technical complication compared with the situation of a linear outcome model in Sec-tion 3.2 is that the reasoning outlined through (20)–(24) for deriving doubly robust confidenceintervals for µ does not directly hold with a non-linear outcome model, where the second termof (20) does not in general reduce to the simple product in (21). There are, however, differentapproaches that can be used to derive model-assisted confidence intervals, that is, satisfyingeither property (G2) or (G3) described in Section 3.1. For concreteness, we focus on a PSbased, OR assisted approach to obtain confidence intervals with property (G2), that is, beingvalid if the propensity score model used is correctly specified but the outcome regression modelmay be misspecified. See Section 3.4 for further discussion of related issues.Consider a logistic propensity score model (4) and a generalized linear outcome model, E ( Y | T = 1 , X ) = m ( X ; α ) = ψ { α f ( X ) } , (35)that is, model (1) with the vector of covariate functions g ( X ) taken to be the same as f ( X )in model (4). This choice of covariate functions can be more justified than in the settingof Section 3.2, because OR model (35) plays an assisting role when confidence intervals for µ are concerned. Our point estimator of µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) as defined in (9), whereˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ) and ˆ m RWL ( X ) = m ( X ; ˆ α RWL ). The estimator ˆ γ RCAL is a regularizedcalibrated estimator of γ from Tan (2017) as in Section 3.2. But ˆ α RWL is a regularized weightedlikelihood estimator of α , defined as a minimizer of ℓ RWL ( α ; ˆ γ RCAL ) = ℓ WL ( α ; ˆ γ RCAL ) + λ k α p k , (36)17here ℓ WL ( α ; ˆ γ RCAL ) is the weighted likelihood loss as follows, with w ( X ; γ ) = { − π ( X ; γ ) } /π ( X ; γ ) = e − γ T f ( X ) for logistic model (4), ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:16) T w ( X ; ˆ γ RCAL ) [ − Y α f ( X ) + Ψ { α f ( X ) } ] (cid:17) , (37)and k α p k is the L norm of α p and λ ≥ α RWL used in Section 3.2 is recovered in the special case of the identitylink, ψ ( u ) = u and Ψ( u ) = u /
2. In addition, the Kuhn–Tucker–Karush condition for mini-mizing (36) remains the same as in (16)–(17), and hence the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) can beput in the prediction form (18), which ensures the boundedness property that ˆ µ ( ˆ m RWL , ˆ π RCAL )always falls in the range of the observed outcomes Y i in the treated group ( T i = 1) and thepredicted values ˆ m RWL ( X i ) in the untreated group ( T i = 0).With possible model misspecification, the target value ¯ α WL is defined as a minimizer ofthe expected loss E { ℓ WL ( α ; ¯ γ CAL ) } . From a functional perspective, we write ℓ WL ( α ; γ ) = κ WL ( α f ; γ ), where for a function h ( x ) which may not be in the form α f , κ WL ( h ; γ ) = ˜ E ( T w ( X ; γ ) [ − Y h ( X ) + Ψ { h ( X ) } ]) . As κ WL ( h ; γ ) is convex in h by the convexity of Ψ(), the Bregman divergence associated with κ WL ( h ; γ ) is defined as D WL ( h ′ , h ; γ ) = κ WL ( h ′ ; γ ) − κ WL ( h ; γ ) − h∇ κ WL ( h ; γ ) , h ′ − h i , where ∇ κ WL ( h ; γ ) denotes the gradient of κ WL ( h ; γ ) with respect to ( h , . . . , h n ) with h i = h ( X i ). The symmetrized Bregman divergence is easily shown to be D † WL ( h ′ , h ; γ ) = D WL ( h ′ , h ; γ ) + D WL ( h, h ′ ; γ )= ˜ E (cid:0) T w ( X ; γ ) (cid:2) ψ { h ′ ( X ) } − ψ { h ( X ) } (cid:3) { h ′ ( X ) − h ( X ) } (cid:1) . (38)The following result establishes the convergence of ˆ α RWL to ¯ α WL in the L norm k ˆ α RWL − ¯ α WL k and the symmetrized Bregman divergence D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ), where ˆ h RWL ( X ) = ˆ α f ( X )and ¯ h WL ( X ) = ¯ α f ( X ). In the case of the identity link, ψ ( u ) = u , the symmetrized Bregmandivergence D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) becomes Q WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) in (28). Inequality (39) alsoreduces to (29) in Theorem 2 with the choices C = 1 and C = η = η = 0. Assumption 3.
Assume that ψ () is differentiable and denote ψ ( u ) = d ψ ( u ) / d u . Supposethat the following conditions are satisfied: i) ψ { ¯ h WL ( X ) } ≤ C almost surely for a constant C > ;(ii) ψ { ¯ h WL ( X ) } ≥ C almost surely for a constant C > ;(iii) ψ ( u ) ≤ ψ ( u ′ )e C | u − u ′ | for any ( u, u ′ ) , where C ≥ is a constant.(iv) C C ( A − − ξ ν − C − | S α | λ ≤ η for a constant ≤ η < and C C e η ( A − − ξ − C − ( M | S γ | λ ) ≤ η for a constant ≤ η < , where ( η , ν , ξ , ξ , M ) areas in Theorem 2. Theorem 5.
Suppose that Assumptions 1, 2, and 3(ii)–(iv) hold. If log { (1 + p ) /ǫ } /n ≤ ,then for A > ( ξ + 1) / ( ξ − and A > ( ξ + 1) / ( ξ − , we have with probability at least − ǫ , D † WL ( ˆ m RWL , ¯ m WL ) + e η ( A − λ k ˆ α RWL − ¯ α WL k ≤ e η ξ − (cid:0) M | S γ | λ (cid:1) + e η ξ (cid:0) ν − | S α | λ (cid:1) , (39) where ξ = ξ (1 − η ) / C / , ν = ν / (1 − η ) / C / , and ( η , ν , ξ , ξ , M ) are as inTheorem 2. Remark 7.
We discuss the conditions involved in Theorem 5. Assumption 3(i) is not needed,but will be used in later results. Assumption 3(iii), adapted from Huang & Zhang (2012), isused along with Assumption 1(i) to bound the curvature of D † WL ( h ′ , h ; ¯ γ CAL ) and then withAssumption 3(iv) to achieve a localized analysis when handling a non-quadratic loss function.Assumption 3(ii) is used for two distinct purposes. First, it is combined with Assumptions 2(ii)–(iii) to yield a compatibility condition for ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )], whichis the sample version of the Hessian of the expected loss E { ℓ WL ( α ; ¯ γ CAL ) } at α = ¯ α WL , thatis, Σ α = E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )]. Second, Assumption 3(ii) is also used inderiving a quadratic inequality to be inverted in our strategy to deal with the dependency ofˆ α RWL on ˆ γ RCAL as mentioned in Remark 4. As seen from the proofs in Supplementary Material,similar results as in Theorem 5 can be obtained with Assumption 3(ii) replaced by the weakercondition that for some constant τ > b T Σ γ b ≤ ( b T Σ α b ) /τ , ∀ b ∈ R p , provided that the condition on A and Assumption 3(iv) are modified accordingly, dependingon τ . This extension is not pursued here for simplicity.19ow we study the proposed estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) for µ , with the regularized estima-tors ˆ γ RCAL and ˆ α RWL obtained using logistic propensity score model (4) and genealized linearoutcome model (35). Theorem 6 gives an error bound for ˆ µ ( ˆ m RWL , ˆ π RCAL ), allowing that bothmodels (4) and (35) may be misspecified, but depending on additional terms in the presenceof misspecification of model (4). Denote h ( X ; α ) = α f ( X ) and for r ≥ ( r ) = sup j =0 , ,...,p, k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { h ( X ; α ) } f j ( X ) (cid:26) T ¯ π CAL ( X ) − (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . As a special case, the quantity Λ (0) is defined asΛ = sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) T ¯ π CAL ( X ) − (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . By the definition of ¯ γ CAL , it holds that E [ { T / ¯ π CAL ( X ) − } f j ( X )] = 0 for j = 0 , , . . . , p whether or not model (4) is correctly specified. But Λ ( r ) is in general either zero or positiverespectively if model (4) is correctly specified or misspecified, except in the case of linearoutcome model (8) where Λ ( r ) is automatically zero because ψ () is constant. Theorem 6.
Suppose that Assumptions 1, 2, and 3 hold. If log { (1 + p ) /ǫ } /n ≤ , then for A > ( ξ + 1) / ( ξ − and A > ( ξ + 1) / ( ξ − , we have with probability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ + η Λ ( η ) , (40) where M , M , and M are positive constants, depending only on ( A , C , B , ξ , ν , η ) , ( A , D , D , ξ , ν , η ) , and ( C , C , C , η , η ) , η = ( A − − M ( | S γ | λ + | S α | λ ) , and M isa constant such that the right hand side of (39) is upper-bounded by e η M ( | S γ | λ λ + | S α | λ ) .If, in addition, condition (27) holds, then we have with probability at least − ǫ , (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) ≤ M | S γ | λ + M | S γ | λ λ + M | S α | λ λ + η Λ , (41) where M , M , and M are positive constants, also depending on τ from (27). Remark 8.
Two different error bounds are obtained in Theorem 6. Because Λ ( η ) ≥ Λ , theerror bound (41) is tighter than (40), but with the additional condition (27), which requiresthat the generalized eigenvalues of Σ γ relative to the gram matrix E { f ( X ) f T ( X ) } is boundedaway from 0. In either case, the result shows that ˆ µ ( ˆ m RWL , ˆ π RCAL ) is doubly robust for µ | S γ | + | S α | ) λ = o (1), that is, ( | S γ | + | S α | )(log p ) / = o ( n / ). In addition, the errorbounds imply that ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the n − / asymptotic expansion (19) provided( | S γ | + | S α | ) log( p ) = o ( n / ), when PS model (4) is correctly specified but OR model (35)may be misspecified, because the term involving Λ ( η ) or Λ vanishes as discussed above.Unfortunately, expansion (19) may fail when PS model (4) is misspecified.Similarly as Theorem 4, the following result establishes the convergence of ˆ V to V as definedin Proposition 1, allowing that both models (4) and (35) may be misspecified. Theorem 7.
Under the conditions of Theorem 6, if log { (1 + p ) /ǫ } /n ≤ , then we have withprobability at least − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / { ( η ) } ( | S γ | λ + | S α | λ )+ M { ( η ) } ( | S γ | λ + | S α | λ ) , (42) where M is a positive constant depending only on ( A , C , B , ξ , ν , η ) , ( A , D , D , ξ , ν , η ) ,and ( C , C , C , η , η ) . If, in addition, condition (27) holds, then we have with probability atleast − ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ M { ˜ E ( ¯ ϕ c ) } / n ( | S γ | λ λ + | S α | λ ) / + Λ ( | S γ | λ + | S α | λ ) o + M (cid:8) ( | S γ | λ λ + | S α | λ ) + Λ ( | S γ | λ + | S α | λ ) (cid:9) , (43) where M is a positive constant, similar to M but also depending on τ from (27). Remark 9.
Two different rates of convergence are obtained for ˆ V in Theorem 7. Similarlyas discussed in Remark 5, if ( | S γ | + | S α | )(log p ) / = o ( n / ), then inequality (42) impliesthe consistency of ˆ V for V , which is sufficient for applying Slutsky Theorem to establishconfidence intervals for µ . With additional condition (27), inequality (43) gives a faster rateof convergence for ˆ V , which is of order n − / provided ( | S γ | + | S α | ) log( p ) = o ( n / ).Combining Theorems 6–7 leads to the following result. Proposition 2.
Suppose that Assumptions 1, 2, and 3 hold, and ( | S γ | + | S α | ) log( p ) = o ( n / ) .For sufficiently large constants A and A , if logistic PS model (4) is correctly specified but ORmodel (35) may be misspecified, then (i)–(iii) in Proposition 1 hold. That is, a PS based, ORassisted confidence interval for µ is obtained. emark 10. The conclusion of Proposition 2 remains valid if PS model (4) is misspecifiedbut only locally such that Λ ( η ) = O ( { log( p ) /n } / ) or Λ = O ( { log( p ) /n } / ), in the caseof the error bound (40) or (41). Therefore, ˆ µ ( ˆ m RWL , ˆ π RCAL ) ± z c/ q ˆ V /n can be interpreted asan asymptotic (1 − c ) confidence interval for the target value ¯ µ = E ( ¯ ϕ ) if model (4) is at mostlocally misspecified but model (35) may be arbitrarily misspecified. It is an interesting openproblem to find broadly valid confidence intervals in the presence of model misspecificationsimilarly as discussed in Remark 6 when a linear outcome model is used. Estimation of ATE.
Our theory and methods are presented mainly on estimation of µ , butthey can be directly extended for estimating µ and hence ATE, that is, µ − µ . Consider alogistic propensity score model (4) and a generalized linear outcome model, E ( Y | T = 0 , X ) = m ( X ; α ) = ψ { α f ( X ) } , (44)where f ( X ) is the same vector of covariate functions as in the model (4) and α is a vector ofunknown parameters. Our point estimator of ATE is ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ˆ m RWL , ˆ π RCAL ), andthat of µ is ˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, − T, X ; ˆ m RWL , − ˆ π RCAL ) (cid:9) , where ϕ () is defined in (7), ˆ π RCAL ( X ) = π ( X ; ˆ γ RCAL ), ˆ m RWL ( X ) = m ( X ; ˆ α RWL ), and ˆ γ RCAL andˆ α RWL are defined as follows. The estimator ˆ γ RCAL is defined similarly as ˆ γ RCAL , but with theloss function ℓ CAL ( γ ) in (11) replaced by ℓ CAL ( γ ) = ˜ E n (1 − T )e γ T f ( X ) − T γ T f ( X ) o , that is, T and γ in ℓ CAL ( γ ) are replaced by 1 − T and − γ . The estimator ˆ α RWL is definedsimilarly as ˆ α RWL , but with the loss function ℓ WL ( · ; ˆ γ RCAL ) in (37) replaced by ℓ WL ( α ; ˆ γ RCAL ) = ˜ E (cid:16) (1 − T ) w ( X ; ˆ γ RCAL ) (cid:2) − Y α g ( X ) + Ψ { α g ( X ) } (cid:3) (cid:17) , where w ( X ; γ ) = π ( X ; γ ) / { − π ( X ; γ ) } = e γ T f ( X ) . Under similar conditions as in Proposi-tions 1 and 2, the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ µ ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ( Y, − T, X ; ¯ m WL , − ¯ π CAL ) (cid:9) + o p ( n − / ) , (45)22here ¯ π RCAL ( X ) = π ( X ; ¯ γ RCAL ), ¯ m RWL ( X ) = m ( X ; ¯ α RWL ), and ¯ γ RCAL and ¯ α RWL are the targetvalues defined similarly as ¯ γ RCAL and ¯ α RWL . Then Wald confidence intervals for µ and ATEcane be derived from (19) and (45) similarly as in Propositions 1 and 2 and shown to beeither doubly robust in the case of linear outcome models, or valid if PS model (4) is correctlyspecified but OR models (35) and (44) may be misspecified for nonlinear ψ ().An unusual aspect of our approach is that two different estimators of the propensity scoreare used when estimating µ and µ . On one hand, the estimators ˆ γ RCAL and ˆ γ RCAL areboth consistent, and hence there is no self-contradiction at least asymptotically, when PSmodel (4) is correctly specified. On the other hand, if model (4) is misspecified, the twoestimators may in general have different asymptotic limits, which can be an advantage fromthe following perspective. By definition, the augmented IPW estimators of µ and µ areobtained, depending on fitted propensity scores within the treated group and untreated groupsseparately, that is, { π ( X i ; γ ) : T i = 1 } and { π ( X i ; γ ) : T i = 0 } . In the presence of modelmisspecification, allowing different γ and γ can be helpful in finding suitable approximationsof the two sets of propensity scores, without being constrained by the then-false assumptionthat they are determined by the same coefficient vector γ = γ . Estimation of ATT.
There is a simple extension of our approach to estimation of ATT, thatis, ν − ν as defined in Section 2. The parameter ν = E ( Y | T = 1) can be directly estimatedby ˜ E ( T Y ) / ˜ E ( T ). For ν = E ( Y | T = 1), our point estimator isˆ ν ( ˆ m RWL , ˆ π RCAL ) = ˜ E (cid:8) ϕ ν ( Y, T, X ; ˆ m RWL , ˆ π RCAL ) (cid:9) / ˜ E ( T ) , where ˆ π RCAL ( X ) and ˆ m RWL ( X ) are the same fitted values as used in the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL )for µ , and ϕ ν ( · ; ˆ m , ˆ π ) is defined as ϕ ν ( Y, T, X ; ˆ m , ˆ π ) = (1 − T )ˆ π ( X )1 − ˆ π ( X ) Y − (cid:26) − T − ˆ π ( X ) − (cid:27) ˆ m ( X ) . The function ϕ ν ( · ; ˆ m , ˆ π ) can be derived, by substituting fitted values ( ˆ m , ˆ π ) for the truevalues ( m ∗ , π ∗ ) in the efficient influence function of µ under a nonparametric model (Hahn1998). In addition, the estimator ˜ E { ϕ ν ( Y, T, X ; ˆ m , ˆ π ) } is also doubly robust: it remainsconsistent for E ( T Y ) if either ˆ m = m ∗ or ˆ π = π ∗ . In fact, by straightforward calculation,the function ϕ ν () is related to ϕ () in (7) through the simple identify: ϕ ν ( Y, T, X ; ˆ m , ˆ π ) = ϕ ( Y, − T, X ; ˆ m , − ˆ π ) − (1 − T ) Y. (46)23s a result, ˆ ν ( ˆ m RWL , ˆ π RCAL ) can be equivalently obtained asˆ ν ( ˆ m RWL , ˆ π RCAL ) = h ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˜ E { (1 − T ) Y } i / ˜ E ( T )= ˜ E (cid:8) T ˆ m RWL ( X ) (cid:9) / ˜ E ( T ) , where the second step follows from a similar equation for ˆ µ ( ˆ m RWL , ˆ π RCAL ) as (18). Moreover,it can be shown using Eq. (46) that under similar conditions as in Propositions 1 and 2, theestimator ˆ ν ( ˆ m RWL , ˆ π RCAL ) admits the asymptotic expansionˆ ν ( ˆ m RWL , ˆ π RCAL ) − ν = ˜ E (cid:8) ϕ ν ( Y, T, X ; ¯ m WL , ¯ π CAL ) − T ν (cid:9) / ˜ E ( T ) + o p ( n − / ) , similarly as (45) for ˆ µ ( ˆ m RWL , ˆ π RCAL ). From this expansion, Wald confidence intervals for ν and ATT can be derived and shown to be either doubly robust with linear OR model (44) orvalid at least when PS model (4) is correctly specified. Construction of loss functions.
We provide additional comments about the construction ofloss functions for γ and α and alternative approaches when using nonlinear outcome models.For a linear outcome model (8) as in Section 3.1, the loss functions ℓ CAL ( γ ) and ℓ WL ( α ; γ ) arederived such that their gradients satisfy (23)–(24), which are in turn obtained as the coefficientsfor ˆ α − ¯ α and ˆ γ − ¯ γ in the first-order terms (21)–(22) from the Taylor expansion (20) ofˆ µ ( ˆ m , ˆ π ). Combining the two steps, Eqs. (23)–(24) amount to choosing ∂ℓ CAL ( γ ) ∂γ = ∂∂α ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) , (47) ∂ℓ WL ( α ; γ ) ∂α = ∂∂γ ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) . (48)We say that the loss function ℓ CAL ( γ ) for γ in model (4) is calibrated against model (8),whereas ℓ WL ( α ; γ ) for α in model (8) is calibrated against model (4). The estimators ˆ γ RCAL and ˆ α RWL are called regularized calibrated estimators of γ and α respectively. The pair ofequations (47)–(48) also underlie the coincidence of the Hessian of ℓ CAL ( γ ) at ¯ γ CAL and that of ℓ WL ( α ; ¯ γ CAL ) in α with a linear outcome model, as mentioned in Remark 2.Previously, an augmented IPW estimator ˆ µ ( ˆ m , ˆ π ) for µ was proposed in low-dimensionalsettings by Kim & Haziza (2014) and Vermeulen & Vansteelandt (2015), where ( ˆ α , ˆ γ ) are non-penalized, defined by directly setting the right-hand sides of (47)–(48) to zero. One of theirmotivations is to enable simple calculation of confidence intervals without the need of correctingfor estimation of ( α , γ ). Our work generalizes these previous estimators to high-dimensional24ettings, where the motivation for using ( ˆ α RWL , ˆ γ RCAL ), instead of ( ˆ α RML , ˆ γ RML ) is mainly sta-tistical: to reduce the variation caused by estimation of ( α , γ ) from O p ( { log( p ) /n } / ) to o p ( n − / ) for the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ), so that valid confidence intervals for µ can beobtained even in the presence of model misspecification.For a possibly nonlinear outcome model (35), the augmented IPW estimator of µ in Kim& Haziza (2014) and Vermeulen & Vansteelandt (2015) is also defined as described above.However, the gradients from the right-hand sides of (47)–(48) become ∂∂α ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) = ˜ E (cid:20)(cid:26) − Tπ ( X ; γ ) (cid:27) ψ { α f ( X ) } f ( X ) (cid:21) , (49) ∂∂γ ˜ E (cid:2) ϕ { Y, T, X ; m ( · ; α ) , π ( · ; γ ) } (cid:3) = − ˜ E (cid:20) T − π ( X ; γ ) π ( X ; γ ) { Y − m ( X ; α ) } f ( X ) (cid:21) , (50)where ψ () denotes the derivative of ψ (). The pair of equations obtained by setting (49)–(50)to zero are intrinsically coupled, unless outcome model (35) is linear and hence the dependencyof (49) on α vanishes. This complication, although mainly computational in low-dimensionalsettings, presents a statistical as well as computational obstacle to developing doubly robustconfidence intervals with regularized estimation in high-dimensional settings.The development in Section 3.3 involves using (23) instead of (49) but retaining (50), whichlead to the loss functions ℓ CAL ( γ ) in (11) and ℓ WL ( α ; γ ) in (37). The resulting confidenceintervals are PS based, OR assisted, that is, being valid if PS model (4) is correctly specifiedbut OR model (35) may be misspecified. Alternatively, it is possible to develop an OR based,PS assisted approach using the regularized maximum likelihood estimator ˆ α RML in conjunctionwith a regularized estimator of γ based on a weighted calibration loss, ℓ WL ( γ ; ˆ α RML ) = ˜ E h ψ { ˆ α f ( X ) } n T e − γ T f ( X ) + (1 − T ) γ T f ( X ) oi . (51)The gradient of (51) in γ is (49), with α = ˆ α RML . Similar results can be established as inSection 3.3, to provide valid confidence intervals for µ if OR model (35) is correctly specifiedbut PS model (4) may be misspecified. This work can be pursued elsewhere. We present a simulation study with the design of Kang & Schafer (2007) modified and extendedto high-dimensional, sparse settings. It is of interest to empirically compare ˆ µ ( ˆ m RML , ˆ π RML )and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated confidence intervals.25n our implementation, the penalized loss function (3) or (6) for computing ˆ α RML or ˆ γ RML or (10), (12), or (36) for computing ˆ α RWL or ˆ γ RCAL is minimized for a fixed tuning parameter λ , using algorithms similar to those in Friedman et al. (2010), but with the coordinate descentmethod replaced by an active set method as in Osborne et al. (2000) for solving each Lassopenalized least squares problem. In addition, the penalized loss (10) for computing ˆ γ RCAL is minimized using the algorithm in Tan (2017), where a nontrivial Fisher scoring step isinvolved for quadratic approximation. The tuning parameter λ is determined using 5-foldcross validation based on the corresponding loss function as follows.For k = 1 , . . . ,
5, let I k be a random subsample of size n/ { , , . . . , n } . For aloss function ℓ ( γ ), either ℓ ML ( γ ) in (5) or ℓ CAL ( γ ) in (11), denote by ℓ ( γ ; I ) the loss functionobtained when the sample average ˜ E () is computed over only the subsample I . The 5-foldcross-validation criterion is defined asCV ( λ ) = 15 X k =1 ℓ (ˆ γ ( k ) λ ; I k ) , where ˆ γ ( k ) λ is a minimizer of the penalized loss ℓ ( γ ; I ck ) + λ k γ p k over the subsample I ck of size4 n/
5, i.e., the complement to I k . Then λ is selected by minimizing CV ( λ ) over the discreteset { λ ∗ / j : j = 0 , , . . . , } , where for ˆ π = ˜ E ( T ), the value λ ∗ is computed as either λ ∗ = max j =1 ,...,p (cid:12)(cid:12)(cid:12) ˜ E { ( T − ˆ π ) f j ( X ) } (cid:12)(cid:12)(cid:12) when the likelihood loss (5) is used, or λ ∗ = max j =1 ,...,p (cid:12)(cid:12)(cid:12) ˜ E { ( T / ˆ π − f j ( X ) } (cid:12)(cid:12)(cid:12) when the calibration loss (11) is used. It can be shown that in either case, the penalized loss ℓ ( γ ) + λ k γ p k over the original sample has a minimum at γ p = 0 for all λ ≥ λ ∗ .For computing ˆ α RML or ˆ α RWL , cross validation is conducted similarly as above using theloss function ℓ ML ( α ) in (2) or ℓ WL ( α ; ˆ γ RCAL ) in (37). In the latter case, ˆ γ RCAL is determinedseparately and then fixed during cross validation for computing ˆ α RWL . Let X = ( X , . . . , X p ) be independent variables, where each X j is N(0 ,
1) truncated to theinterval ( − . , .
5) and then standardized to have mean 0 and variance 1. In addition, let X † = ( X † , . . . , X † p ), where X † j = X j for j = 5 , . . . , p , and X † , X † , X † , and X † are standardized26ersions of exp(0 . X ), 10 + { X ) } − X , (0 . X X + 0 . , and ( X + X + 20) tohave means 0 and variances 1. The truncation of X j prevents propensity scores arbitrarilyclose to 0, and ensures that the mapping between X and X † are strictly one-to-one. See theSupplementary Material for calculation to perform the standardization and for scatter plots of( X † , . . . , X † ). Consider the following data-generating configurations.(C1) Generate T given X from a Bernoulli distribution with P ( T = 1 | X ) = { X † − . X † + 0 . X † + 0 . X † ) } − , and, independently, generate Y given X from a Normal distribution with variance 1 andmean either (“Linear outcome configuration 1”) E ( Y | X ) = X † + 0 . X † + 0 . X † + 0 . X † , or (“Linear outcome configuration 2”) E ( Y | X ) = 0 . X † + 0 . X † + 0 . X † + 0 . X † . The main difference between the two outcome configurations is that X † is both the mostimportant variable influencing the propensity score and that influencing the outcomeregression function in the first configuration.(C2) Generate T give X as in (C1), but, independently, generate Y given X from a Normaldistribution with variance 1 and mean either (“Linear outcome configuration 1”) E ( Y | X ) = X + 0 . X + 0 . X + 0 . X , or (“Linear outcome configuration 2”) E ( Y | X ) = 0 . X + 0 . X + 0 . X + 0 . X . As X and X † are monotone transformations of each other, the variable X † remainsroughly both the most important variable influencing the propensity score and thatinfluencing the outcome regression function in the first configuration.(C3) Generate Y given X as in (C1), but, independently, generate T given X from a Bernoullidistribution with P ( T = 1 | X ) = { X − . X + 0 . X + 0 . X ) } − . n = 800, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome configuration 1Bias − . − . − . − . − . − . √ Var .071 .071 .072 .072 .077 .072 † √ EVar .083 .083 .081 .080 .083 .083Cov90 .790 .822 ∗ .850 .848 .856 .837Cov95 .859 .891 ∗ .910 .915 .925 .912Linear outcome configuration 2Bias − . − . − . − . − . − . √ Var .063 .063 .062 .064 .069 .063 † √ EVar .072 .073 .070 .069 .074 .074Cov90 .782 .826 ∗ .786 .865 ∗ .855 .838Cov95 .858 .885 .866 .918 ∗ .926 .901 Note: RML.RML denotes ˆ µ ( ˆ m RML , ˆ π RML ) and RCAL.RWL denotes ˆ µ ( ˆ m RWL , ˆ π RCAL ). Bias and √ Varare respectively the Monte Carlo bias and standard deviation of the points estimates, √ EVar is thesquare root of the mean of the variance estimates, and Cov90 or Cov95 is the coverage proportion ofthe 90% or 95% confidence intervals, based on 1000 repeated simulations. † indicates a case wherethe Monte Carlo variance from the competitive method is at least 10% higher. ∗ indicates a coverageproportion that is 3% or higher than that from the competitive method. As in Section 2, the observed data consist of independent and identically distributed obser-vations { ( T i Y i , T i , X i ) : i = 1 , . . . , n } . Consider logistic propensity score model (4) and linearoutcome model (8), both with f j ( X ) = X † j for j = 1 , . . . , p . Then the two models can beclassified as follows, depending on the data configuration above:(C1) PS and OR models both correctly specified;(C2) PS model correctly specified, but OR model misspecified;(C3) PS model misspecified, but OR model correctly specified.As demonstrated in Kang & Schafer (2007) for p = 4, the PS model (4) in the scenario (C3),although misspecified, appears adequate as examined by conventional techniques for logisticregression. Similarly, the OR model (8) in the misspecified case (C2) can also be shown as“nearly correct” by standard techniques for linear regression. On the other hand, neither28igure 1: QQ plots of the t -statistics against standard normal with linear outcome models ( n = 800, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − the PS model (4) in the correctly specified case (C1) or (C2) nor the OR model (8) in thecorrectly specified case (C1) or (C3) is used in Kang & Schafer (2007), where correct PS andmisspecified OR model (or misspecified PS and correct OR) involve two completely differentsets of regressors. This aspect of the Kang–Schafer design needs to be modified in our study,where the same vector of regressors f ( X ) is used in models (4) and (8).We conducted 1000 repeated simulations, each with the sample size n = 400 or 800 andthe number of regressors p = 100 or 200. For n = 800 and p = 200, Table 1 summarizes theresults about ˆ µ ( ˆ m RML , ˆ π RML ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated confidence intervals,and Figure 1 presents the QQ plots of the corresponding t -statistics. See the SupplementaryMaterial for similar results obtained with other values of ( n, p ).There are several advantages demonstrated from these results for the proposed method.Compared with ˆ µ ( ˆ m RML , ˆ π RML ), the estimator ˆ µ ( ˆ m RWL , ˆ π RCAL ) has consistently smaller biases29able 2: Summary of results with logistic outcome models ( n = 800, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome configuration 1Bias − . − . − . − . − . − . √ Var .023 .024 .023 .024 .024 .023 √ EVar .026 .026 .025 .026 .027 .027Cov90 .814 .868 ∗ .841 .872 ∗ .845 .859Cov95 .876 .920 ∗ .916 .928 .914 .912Logistic outcome configuration 2Bias − . − . − . − . − . . √ Var .024 .025 .024 .026 .026 .025 √ EVar .026 .026 .026 .027 .027 .027Cov90 .849 .876 .864 .879 .870 .865Cov95 .909 .936 .925 .933 .931 .927
Note: See the footnote of Table 1. For scenario (C5) (“mis OR”), the true value µ is 0 . µ is not analytically available but calculatedusing Monte Carlo integration, as shown in the Supplementary Material. in absolute values, and either similar or noticeably smaller variances, for example, in the case ofmisspecified PS model and correct OR model. The coverage proportions of confidence intervalsbased on ˆ µ ( ˆ m RWL , ˆ π RCAL ) are similar or noticeably higher than those based on ˆ µ ( ˆ m RML , ˆ π RML ),although both coverage proportions are below the nominal probabilities to various degree. Fromthe QQ plots, the t -statistics based on ˆ µ ( ˆ m RWL , ˆ π RCAL ) also appear to be more aligned withstandard normal than those based on ˆ µ ( ˆ m RML , ˆ π RML ). For simulations with binary outcomes, let X and X † be as in Section 4.1. Consider the followingdata-generating configurations, in parallel to (C1)–(C3).(C4) Generate T given X as in (C1) and, independently, generate Y given X from a Bernoullidistribution with probability (“Logistic outcome configuration 1”) P ( Y = 1 | X ) = [1 + exp {− ( X † + 0 . X † + 0 . X † + 0 . X † ) } ] − , or (“Logistic outcome configuration 2”) 30igure 2: QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 800, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − P ( Y = 1 | X ) = [1 + exp {− (0 . X † + 0 . X † + 0 . X † + 0 . X † ) } ] − . (C5) Generate T give X as in (C1), and, independently, generate Y given X from a Bernoullidistribution with probability (“Logistic outcome configuration 1”) P ( Y = 1 | X ) = [1 + exp {− ( X + 0 . X + 0 . X + 0 . X ) } ] − , or (“Logistic outcome configuration 2”) P ( Y = 1 | X ) = [1 + exp {− (0 . X + 0 . X + 0 . X + 0 . X ) } ] − . (C6) Generate Y given X as in (C4), and, independently, generate T given X as in (C3).Consider logistic propensity score model (4) and logistic outcome model (35), both with f j ( X ) = X † j for j = 1 , . . . , p . Then the two models are correctly specified in scenario (C4),31nly PS model (4) is correctly specified in scenario (C5), and only OR model (35) is correctlyspecified in scenario (C6), similarly as in Section 4.1.For n = 800 and p = 200, Table 2 and Figure 2 present the results from 1000 repeated sim-ulations, about ˆ µ ( ˆ m RML , ˆ π RML ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) and their associated confidence intervals.Similar conclusions can be drawn as from Table 1 and Figure 1. It is interesting that the cov-erage proportions of confidence intervals based on ˆ µ ( ˆ m RWL , ˆ π RCAL ) are noticeably higher (andcloser to the nominal probabilities) than those based on ˆ µ ( ˆ m RML , ˆ π RML ) in the case where bothPS and OR models are correctly specified. This difference can also be seen from the QQ plots.The confidence intervals from both methods appear to yield reasonable coverage proportionswhen the PS model is misspecified but the OR model is correctly specified, even though theseresults are not necessarily predicted by asymptotic theory. See the Supplementary Materialfor additional results from simulations with other values of ( n, p ). We provide an empirical application to a medical study in Connors et al. (1996) on the effects ofright heart catheterization (RHC). The study included n = 5735 critically ill patients admittedto the intensive care units of 5 medical centers. For each patient, the data consist of treatmentstatus T (= 1 if RHC was used within 24 hours of admission and 0 otherwise), health outcome Y (survival time up to 30 days), and a list of 75 covariates X specified by medical specialistsin critical care. For previous analyses, propensity score and outcome regression models wereemployed either with main effects only (Hirano & Imbens 2002; Vermeulen & Vansteelandt2015) or with interaction terms manually added (Tan 2006).To explore dependency beyond main effects, we consider a logistic propensity score modelTable 3: Estimates of 30-day survival probabilities and ATE IPW Augmented IPWRML RCAL RML.RML RCAL.RWL µ . ± .
026 0 . ± .
023 0 . ± .
021 0 . ± . µ . ± .
017 0 . ± .
017 0 . ± .
016 0 . ± . − . ± . − . ± . − . ± . − . ± . Note: Estimate ± × standard error, including nominal standard errors for IPW. Boxplots of inverse probability weights within the treated (left) and untreated (middle)groups, each normalized to sum to the sample size n , and QQ plots with a 45-degree line of thestandardized sample influence functions based on ϕ ( Y, T, X ; · ) in (7) for ATE (right). RML RCAL
RML RCAL −4 −2 0 2 4 − − RML.RMLRCAL.RWL (4) and a logistic outcome model (35) for 30-day survival status 1 { Y > } , with the vector f ( X ) including all main effects and two-way interactions of X except those with the fractions ofnonzero values less than 46 (i.e., 0.8% of the sample size 5735). The dimension of f ( X ) is p =1855, excluding the constant. All variables in f ( X ) are standardized with sample means 0 andvariances 1. We apply the augmented IPW estimators ˆ µ ( ˆ m RWL , ˆ π RCAL ) and ˆ µ ( ˆ m RWL , ˆ π RCAL )using regularized calibrated (RCAL) estimation and the corresponding estimators such asˆ µ ( ˆ m RML , ˆ π RML ) using regularized maximum likelihood (RML) estimation, similarly as in thesimulation study. The Lasso tuning parameter λ is selected by cross validation over a discreteset { λ ∗ / j/ : j = 0 , , . . . , } , where λ ∗ is the value leading to a zero solution γ = · · · = γ p =0. We also compute the (ratio) IPW estimators, such as ˆ µ rIPW , along with nominal standarderrors obtained by ignoring data-dependency of the fitted propensity scores.Table 3 shows various estimates of survival probabilities and ATE. The IPW estimates fromRCAL estimation of propensity scores have noticeably smaller nominal standard errors thanRML estimation, for example, with the relative efficiency (0 . / . = 1 .
28 for estimationof µ . This improvement can also be seen from Figure 3, where the RCAL inverse probabilityweights are much less variable than RML weights. See Tan (2017) for additional results oncovariate balance and parameter sparsity from RML and RCAL estimation of propensity scores.The augmented IPW estimates and confidence intervals are similar to each other from RCALand RML estimation. However, the validity of RML confidence intervals depends on both PS33nd OR models being correctly specified, whereas that of RCAL confidence intervals holdseven when the OR model is misspecified. While assessment of this difference is difficult withreal data, Figure 3 shows that the sample influence functions for ATE using RCAL estimationappears to be more normally distributed especially in the tails than RML estimation.Finally, the augmented IPW estimates here are smaller in absolute values, and also withsmaller standard errors, than previous estimates based on main-effect models, about − . ± × .
015 (Vermeulen & Vansteelandt 2015). The reduction in standard errors might be ex-plained by the well-known property that an augmented IPW estimator has a smaller asymptoticvariance when obtained using a larger (correct) propensity score model.
References
Athey, S., Imbens, G.W., and Wager, S. (2016) “Approximate residual balancing: De-biasedinference of average treatment effects in high dimensions,” arXiv:1604.07125.Belloni, A., Chernozhukov, V., Fernandez-Val, I., and Hansen, C. (2017) ”Program evaluationand causal inference with high-dimensional data,”
Econometrica , 85, 233–298.Bickel, P., Ritov, Y., and Tsybakov, A.B. (2009) “Simultaneous analysis of Lasso and Dantzigselector,”
Annals of Statistics , 37, 1705–1732.Buhlmann, P. and van de Geer, S. (2011)
Statistics for High-Dimensional Data: Methods,Theory and Applications , New York: Springer.Chan, K.C.G., Yam, S.C.P., and Zhang, Z. (2016) “Globally efficient non-parametric inferenceof average treatment effects by empirical balancing calibration weighting,”
Journal of theRoyal Statistical Society , Ser. B, 78, 673–700.Connors, A.F., Speroff, T., Dawson, N.V., et al. (1996) “The effectiveness of right heartcatheterization in the initial care of critically ill patients,”
Journal of the AmericanMedical Association , 276, 889–897.Farrell, M.H. (2015) “Robust inference on average treatment effects with possibly more co-variates than observations.”
Journal of Econometrics , 189, 1–23.Folsom, R.E. (1991) “Exponential and logistic weight adjustments for sampling and non-response error reduction,”
Proceedings of the American Statistical Association , SocialStatistics Section, 197–202.Friedman, J., Hastie, T., and Tibshirani, R. (2010) “Regularization paths for generalizedlinear models via coordinate descent,”
Journal of Statistical Software , 33, 1–22.34raham, B.S., de Xavier Pinto, C.C., and Egel, D. (2012) “Inverse probability tilting formoment condition models with missing data,”
Review of Economic Studies , 79, 1053–1079.Hirano, K., and Imbens, G.W. (2002) “Estimation of causal effects using propensity scoreweighting: An application to data on right heart catheterization,”
Health Services andOutcomes Research Methodology , 2, 259–278.Hahn, J. (1998) “On the role of the propensity score in efficient semiparametric estimation ofaverage treatment effects,”
Econometrica , 66, 315–331.Hainmueller, J. (2012) “Entropy balancing for causal effects: Multivariate reweighting methodto produce balanced samples in observational studies,”
Political Analysis , 20, 25–46.Huang, J. and Zhang, C.-H. (2012) “Estimation and selection via absolute penalized convexminimization and its multistage adaptive applications,”
Journal of Machine LearningResearch , 13, 1839–1864.Imai, K. and Ratkovic, M. (2014) “Covariate balancing propensity score,”
Journal of theRoyal Statistical Society , Ser. B, 76, 243–263.Javanmard, A. and Montanari, A. (2014) “Confidence intervals and hypothesis testing forhigh-dimensional regression,”
Journal of Machine Learning Research , 15, 2869–2909.Kang, J.D.Y. and Schafer, J.L. (2007) “Demystifying double robustness: A comparison ofalternative strategies for estimating a population mean from incomplete data” (withdiscussion),
Statistical Science , 523–539.Kim, J.K. and Haziza, D. (2014) “Doubly robust inference with missing data in survey sam-pling,”
Statistica Sinica , 24, 375–394.Manski, C.F. (1988)
Analog Estimation Methods in Econometrics , New York: Chapman &Hall.McCullagh, P. and Nelder, J. (1989)
Generalized Linear Models (2nd edition), New York:Chapman & Hall.Negahban, S.N., Ravikumar, P., Wainwright, M.J., and Yu, B. (2012) “A unified frameworkfor high-dimensional analysis of M-estimators with decomposable regularizers,”
StatisticalScience , 27, 538–557.Neyman, J. (1923) “On the application of probability theory to agricultural experiments:Essay on principles, Section 9,” translated in
Statistical Science , 1990, 5, 465–480.35sborne, M., Presnell, B., and Turlach, B. (2000) “A new approach to variable selection inleast squares problems.”
IMA Journal of Numerical Analysis , 20, 389–404.Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1994) “Estimation of regression coefficientswhen some regressors are not always observed,”
Journal of the American Statistical As-sociation , 89, 846–866.Rosenbaum, P.R. and Rubin, D.B. (1983) “The central role of the propensity score in obser-vational studies for causal effects,”
Biometrika , 70, 41-55.Rosenbaum, P.R. and Rubin, D.B. (1984) “Reducing bias in observational studies using sub-classification on the propensity score,”
Journal of the American Statistical Association ,79, 516–524.Rubin, D.B. (1976) “Inference and missing data,”
Biometrika , 63, 581–590.Sarndal, C.E., Swensson, B. and Wretman, J.H. (1992)
Model Assisted Survey Sampling , NewYork: Springer.Tan, Z. (2006) “A distributional approach for causal inference using propensity scores,”
Jour-nal of the American Statistical Association , 101, 1619-1637.Tan, Z. (2007) “Comment: Understanding OR, PS, and DR,”
Statistical Science , 22, 560–568.Tan, Z. (2010) “Bounded, efficient, and doubly robust estimation with inverse weighting,”
Biometrika , 97, 661–682.Tan, Z. (2017) “Regularized calibrated estimation of propensity scores with model misspeci-fication and high-dimensional data,” arXiv:1710.08074.Tibshirani, R. (1996) “Regression shrinkage and selection via the Lasso,”
Journal of the RoyalStatistical Society , Ser. B, 58, 267–288.Tsiatis, A.A. (2006)
Semiparametric Theory and Missing Data , New York: Springer.van de Geer, S., Buhlmann, P., Ritov, Y., Dezeure, R. (2014) “On asymptotically optimalconfidence regions and tests for high-dimensional models”
Annals of Statistics , 42, 1166–1202.Vermeulen. K. and Vansteelandt, S. (2015) “Bias-reduced doubly robust estimation,”
Journalof the American Statistical Association , 110, 1024–1036.White, H. (1982) “Maximum Likelihood Estimation of Misspecified Models,”
Econometrica ,50, 1–25.Zhang, C.-H. and Zhang, S.S. (2014) “Confidence intervals for low-dimensional parameterswith high-dimensional data,”
Journal of the Royal Statistical Society , Ser. B, 76, 217–242. 36 upplementary Material for “Model-assisted inference for treatmenteffects using regularized calibrated estimation with high-dimensionaldata”
Zhiqiang TanThe Supplementary Material contains Appendices I–II.
I Additional results for simulation study
I.1 Results for simulation setup
Denote by φ () the probability density function and Φ() the cumulative distribution functionfor N(0 , a = 2 .
5, let Z be N(0 ,
1) truncated to the interval ( − a, a ), with the densityfunction φ ( z ) /c if z ∈ ( − a, a ) or 0 otherwise, where c = Φ( a ) − Φ( − a ) = 2Φ( a ) −
1. Then E ( Z ) = 0 and var( Z ) = 1 − aφ ( a ) /c , denoted as b .Let ( X , . . . , X ) = ( Z , . . . , Z ) /b , where ( Z , . . . , Z ) are independent variables, each fromN(0 ,
1) truncated to ( − a, a ). The variables ( X † , . . . , X † ) are determined by standardizationfrom ( X , . . . , X ) using the following results. • E (e . X ) = exp( b ) { Φ( a − b ) − Φ( − a − b ) } /c ,var(e . X ) = exp( b ) { Φ( a − b ) − Φ( − a − b ) } /c − E (e . X ). • E ( X X ) = 0,var( X X ) = c R a − a z/b ) φ ( z ) d z ≈ (0 . by numerical integration. • E { ( X X + 0 . } = 3 / ∗ ( .
6) + ( . , E { ( X X + 0 . } = m / + 15 ∗ m / ∗ ( . + 15 / ∗ ( . + ( . ,where m = b c R a − a z φ ( z ) d z = b c { (3 / ∗ (2Φ( z ) − − z ( z + 3) φ ( z ) }| a − a and m = b c R a − a z φ ( z ) d z = b c { (15 / ∗ (2Φ( z ) − − z ( z + 5 z + 15) φ ( z ) }| a − a . • E { ( X + X + 20) } = 2 + 20 , E { ( X + X + 20) } = (2 m + 6) + 6 ∗ ∗ + 20 .For binary outcomes in scenarios (C4) and (C6), the true value µ = E { m ∗ ( X ) } is estimatedby Monte Carlo integration, using 100 repeated samples of ( X , . . . , X ) each of size 10 . The1stimates of µ are 0 . . × − . I.2 Additional simulation results
Figure S1 shows the the scatter plots of the variables ( X † , X † , X † , X † ), which are correlatedwith each other as would be found in real data.Tables S1–S3 and Figures S2–S4, present additional simulation results from Section 4.1with linear outcome models, similarly as Table 1 and Figure 1 but for different values of ( n, p ).Tables S4–S6 and Figures S5–S7, present additional simulation results from Section 4.2 withlogistic outcome models, similarly as Table 2 and Figure 2 but for different values of ( n, p ).Similar conclusions can be drawn as discussed in Sections 4.1–4.2.2igure S1: Scatter plots of ( X † , X † , X † , X † ) from a sample of size n = 800. −1 1 2 3 4 − var 1 − var 2 − var 3 −1 1 2 3 4 − −2 0 2 4 −2 0 2 4 −2 0 1 2 3 − var 4 n = 400, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome configuration 1Bias − . − . − . − . − . − . √ Var .097 .097 .097 .099 .105 .0994 † √ EVar .108 .110 .111 .112 .109 .109Cov90 .787 .829 ∗ .844 .853 .848 .845Cov95 .862 .883 .915 .916 .916 .920Linear outcome configuration 2Bias − . − . − . − . − . − . √ Var .086 .087 .085 .088 .093 .088 † √ EVar .094 .096 .092 .093 .096 .096Cov90 .799 .833 ∗ .828 .859 ∗ .864 .860Cov95 .879 .898 .896 .926 ∗ .927 .919 Note: See the footnote of Table 1.
Table S2: Summary of results with linear outcome models ( n = 800, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome configuration 1Bias − . − . − . − . − . − . √ Var .071 .071 .073 .072 .078 .072 † √ EVar .079 .080 .084 .083 .081 .080Cov90 .829 .836 .845 .852 .889 .881Cov95 .889 .901 .905 .909 .938 .929Linear outcome configuration 2Bias − . − . − . − . − . − . √ Var .064 .063 .064 .064 .070 .064 † √ EVar .071 .072 .072 .070 .072 .071Cov90 .814 .830 .782 .850 ∗ .896 .875Cov95 .880 .893 .862 .912 ∗ .941 .924 Note: See the footnote of Table 1. n = 400, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLinear outcome configuration 1Bias − . − . − . − . − . − . √ Var .095 .096 .096 .099 .102 .098 √ EVar .109 .110 .112 .113 .118 .116Cov90 .770 .820 ∗ .819 .834 .823 .829Cov95 .845 .884 ∗ .895 .903 .893 .896Linear outcome configuration 2Bias − . − . − . − . − . − . √ Var .085 .086 .084 .087 .091 .087 √ EVar .092 .094 .094 .095 .103 .103Cov90 .788 .833 ∗ .814 .853 ∗ .842 .836Cov95 .877 .905 .884 .913 .904 .896 Note: See the footnote of Table 1.
Table S4: Summary of results with logistic outcome models ( n = 400, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome configuration 1Bias − . − . − . − . − . − . √ Var .032 .033 .032 .034 .033 .033 √ EVar .038 .038 .038 .038 .037 .037Cov90 .776 .835 ∗ .804 .845 ∗ .834 .852Cov95 .864 .911 ∗ .877 .901 .899 .904Logistic outcome configuration 2Bias − . − . − . − . − . . √ Var .033 .034 .034 .035 .035 .034 √ EVar .035 .036 .035 .035 .038 .038Cov90 .852 .876 .876 .883 .863 .864Cov95 .910 .931 .931 .942 .921 .924
Note: See the footnote of Table 2. n = 800, p = 100) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome configuration 1Bias − . − . − . − . − . − . √ Var .023 .024 .024 .025 .024 .024 √ EVar .026 .026 .026 .027 .026 .026Cov90 .816 .868 ∗ .851 .868 .877 .870Cov95 .896 .924 .924 .929 .928 .925Logistic outcome configuration 2Bias − . − . − . − . . . √ Var .024 .025 .025 .026 .027 .025 √ EVar .026 .026 .027 .028 .028 .027Cov90 .842 .870 .841 .867 .881 .862Cov95 .913 .926 .911 .923 .940 .925
Note: See the footnote of Table 2.
Table S6: Summary of results with logistic outcome models ( n = 400, p = 200) cor PS, cor OR cor PS, mis OR mis PS, cor ORRML.RML RCAL.RWL RML.RML RCAL.RWL RML.RML RCAL.RWLLogistic outcome configuration 1Bias − . − . − . − . − . − . √ Var .032 .033 .032 .034 .033 .033 √ EVar .037 .037 .039 .038 .036 .036Cov90 .754 .826 ∗ .773 .827 ∗ .834 .866Cov95 .833 .907 ∗ .852 .897 ∗ .898 .917Logistic outcome configuration 2Bias − . − . − . − . − . . √ Var .032 .034 .033 .035 .034 .034 √ EVar .035 .036 .037 .037 .037 .037Cov90 .858 .884 .848 .864 .858 .857Cov95 .915 .936 .904 .932 .916 .926
Note: See the footnote of Table 2.
QQ plots of the t -statistics against standard normal with linear outcome models ( n = 400, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with linear outcome models ( n = 800, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with linear outcome models ( n = 400, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 400, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 800, p = 100), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − QQ plots of the t -statistics against standard normal with logistic outcome models ( n = 400, p = 200), based on the estimators ˆ µ ( ˆ m RML , ˆ π RML ) ( ◦ ) and ˆ µ ( ˆ m RWL , ˆ π RCAL ) ( × ). For readability, onlya subset of 100 order statistics are shown as points on the QQ lines. cor PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − cor PS, mis OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − mis PS, cor OR −4 −2 0 2 4 − − −4 −2 0 2 4 − − I Technical details
II.1 Inside Theorem 1
The following result (ii) is taken from Tan (2017), Lemma 1(ii), and result (i) can be shownsimilarly using Lemma 14 in Section II.8 and the union bound.
Lemma 1. (i) Denoted by Ω the event that sup j =0 , ,...,p (cid:12)(cid:12)(cid:12) ˜ E hn − T e − ¯ h CAL ( X ) + (1 − T ) o f j ( X ) i(cid:12)(cid:12)(cid:12) ≤ λ . Under Assumption 1(i)–(ii), if λ ≥ √ − B + 1) C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ .(ii) Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ γ ) jk − (Σ γ ) jk | ≤ λ , (S1) Under Assumption 1(i)–(ii), if λ ≥ (4e − B C ) p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Take λ = C p log { (1 + p ) /ǫ } /n with C = max n √ − B + 1) C , − B C o . Then under the conditions of Theorem 1, inequality (26) holds in the event Ω ∩ Ω , withprobability at least 1 − ǫ , by the proof of Tan (2017, Corollary 2). II.2 Probability lemmas
Lemma 2.
Denote by Ω the event that sup j =0 , ,...,p (cid:12)(cid:12)(cid:12) ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) (cid:3)(cid:12)(cid:12)(cid:12) ≤ λ . (S2) Under Assumptions 1(i)–(ii) and 2(i), if λ ≥ (e − B C ) p D + D ) p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.
Let Z j = T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) for j = 0 , . . . , p . Then E ( Z j ) = 0by the definition of ¯ α WL . Under Assumption 1(i)–(ii), | Z j | ≤ e − B C | T { Y − ¯ m WL ( X ) }| . ByAssumption 2(i), the variables ( Z , Z , . . . , Z p ) are uniformly sub-gaussian: max j =0 , ,...,p D E { exp( Z j /D ) − } ≤ D , with D = e − B C D and D = e − B C D . Therefore, Lemma 2(i)holds by Lemma 15 in Section II.8 and the union bound. (cid:3) Denote Σ α = E [ T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) f T ( X )], and ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) f T ( X )], the sample version of ˜Σ α .13 emma 3. Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ α ) jk − (Σ α ) jk | ≤ ( D + D D ) λ , (S3) Under Assumptions 1(i)–(ii) and 2(i), if ( D + D D ) λ ≥ − B C h D log { (1 + p ) /ǫ } /n + D D p log { (1 + p ) /ǫ } /n i , then P (Ω ) ≥ − ǫ . Proof.
For any j, k = 0 , , . . . , p , the variable T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f j ( X ) f k ( X ) is theproduct of w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) and T { Y − ¯ m WL ( X ) } , where | w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) | ≤ e − B C by Assumptions 1(i)–(ii) and T { Y − ¯ m WL ( X ) } is sub-gaussian by Assumption 2(i).Applying Lemmas 16 and 18 in Section II.8 yields P n | ( ˜Σ α ) jk − (Σ α ) jk | > − B C D t + 2e − B C D D t √ t o ≤ ǫ (1 + p ) , for j, k = 0 , , . . . , p , where t = log { (1 + p ) /ǫ } /n . The result then follows from the unionbound. (cid:3) Denote Σ α = E [ T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f ( X ) f T ( X )], and ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f ( X ) f T ( X )], the sample version of Σ α . Lemma 4.
Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ α ) jk − (Σ α ) jk | ≤ q D + D λ , (S4) Under Assumptions 1(i)–(ii) and 2(i), if λ ≥ − B C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.
The variables
T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) | f j ( X ) f k ( X ) for j, k = 0 , , . . . , p are uniformlysub-gaussian, because | w ( X ; ¯ γ CAL ) f j ( X ) f k ( X ) | ≤ e − B C by Assumptions 1(i)–(ii) and T | Y − ¯ m WL ( X ) | is sub-gaussian by Assumption 2(i). Applying Lemma 15 yields P n | ( ˜Σ α ) jk − (Σ α ) jk | > t o ≤ ǫ (1 + p ) , for j, k = 0 , , . . . , p , where t = e − B C p D + D ) p log { (1 + p ) /ǫ } /n . The result thenfollows from the union bound. (cid:3) Denote Σ = E [ f ( X ) f T ( X )] and ˜Σ = ˜ E [ f ( X ) f T ( X )], the sample version of Σ .14 emma 5. Denote by Ω the event that sup j,k =0 , ,...,p | ( ˜Σ ) jk − (Σ ) jk | ≤ e B λ , (S5) Under Assumption 1(i), if λ ≥ − B C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.
This result follows directly from Lemma 14 and the union bound, with | f j ( X ) f k ( X ) | ≤ C and hence | f j ( X ) f k ( X ) − (Σ ) jk | ≤ C . (cid:3) II.3 Proof of Theorems 2 and 5
Thoughout this section, suppose that Assumption 1 holds. The proof of Theorem 5 is com-pleted by combining Lemmas 2–3 and 6–12. Theorem 2 is a special case of Theorem 5, whereAssumptions 3(ii)–(iv) are satisfied with C = 1 and C = η = η = 0. Lemma 6.
For any coefficient vector α and h ( X ) = α f ( X ) , we have D † WL (ˆ h RWL , h ; ˆ γ RCAL ) + λ k ˆ α RWL , p k ≤ ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + λ k α p k . (S6) Proof.
For any u ∈ (0 , α RWL implies ℓ RWL ( ˆ α RWL ; ˆ γ RCAL ) + λ k ˆ α RWL , p k ≤ ℓ RWL { (1 − u ) ˆ α RWL + uα ; ˆ γ RCAL } + λ k (1 − u ) ˆ α RWL , p + uα p k , which, by the convexity of k · k , gives ℓ RWL ( ˆ α RWL ; ˆ γ RCAL ) − ℓ RWL { (1 − u ) ˆ α RWL + uα ; ˆ γ RCAL } + λu k ˆ α RWL , p k ≤ λu k α p k . Dividing both sides of the preceding inequality by u and letting u →
0+ yields − ˜ E h T w ( X ; ˆ γ RCAL ) { Y − ˆ m RWL ( X ) }{ ˆ h RWL ( X ) − h ( X ) } i + λ k ˆ α RWL , p k ≤ λ k α p k , which leads to (S6) after simple rearrangement using (38). (cid:3) Lemma 7.
In the event Ω ∩ Ω , we have ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ e η M | S γ | λ , (S7)15 nd for any function h ( X ) , D † WL (ˆ h RWL , h ; ˆ γ RCAL ) ≥ e − η D † WL (ˆ h RWL , h ; ¯ γ CAL ) , (S8) where η = ( A − − M η C . Proof.
By direct calculation from the definition of D CAL (), we find D CAL (ˆ h RCAL , ¯ h CAL ) = − ˜ E h T n e − ˆ h ( X ) − e − ¯ h ( X ) o { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i = ˜ E h T e − u (ˆ γ − ¯ γ ) T f ( X ) w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i for some u ∈ (0 , − ˆ h ( X ) − e − ¯ h ( X ) = − e − u ˆ h ( X ) − (1 − u )¯ h ( X ) (ˆ γ RCAL − ¯ γ CAL ) T f ( X ) . (S9)In the event Ω ∩ Ω that (26) holds, we have k ˆ γ RCAL − ¯ γ CAL k ≤ ( A − − M | S γ | λ ≤ ( A − − M η , (S10)by Assumption 1(iv), | S γ | λ ≤ η , and hence M | S γ | λ ≥ D CAL (ˆ h RCAL , ¯ h CAL ) ≥ e − η ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i , which gives the desired inequality (S7). In addition, we write D † WL (ˆ h RWL , h ; ˆ γ RCAL )= ˜ E (cid:16) T w ( X ; ˆ γ RCAL ) h ψ { ˆ h RWL ( X ) } − ψ { h ( X ) } i { ˆ h RWL ( X ) − h ( X ) } (cid:17) = ˜ E (cid:16) T e − (ˆ γ − ¯ γ ) T f ( X ) w ( X ; ¯ γ CAL ) h ψ { ˆ h RWL ( X ) } − ψ { h ( X ) } i { ˆ m RWL ( X ) − h ( X ) } (cid:17) , which, in the event Ω ∩ Ω , yields inequality (S8) by (S10) and Assumption 1(i). (cid:3) For two functions h ( x ) and h ′ ( x ), denote Q WL ( h ′ , h ; γ ) = ˜ E (cid:2) T w ( X ; γ ) { h ′ ( X ) − h ( X ) } (cid:3) . Lemma 8.
Take α = ¯ α WL and h ( X ) = ¯ α WL f ( X ) . Suppose that Assumption 2(i) holds. Thenin the event Ω ∩ Ω ∩ Ω , (S6) implies e − η D † WL (ˆ h RWL , h ; ¯ γ CAL ) + λ k ˆ α RWL , p k ≤ ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + λ k α p k + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , h ; ¯ γ CAL ) } / , where M = ( D + D )(e η M + η ) + ( D + D D ) η , and η = ( A − − M η . roof. Consider the following decomposition,( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ˆ γ RCAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) = ( ˆ α RWL − α ) T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } f ( X ) (cid:3) + ˜ E h T n e − ˆ h ( X ) − e − ¯ h ( X ) o { Y − m ( X ; α ) }{ ˆ h RWL ( X ) − h ( X ) } i , (S11)denoted as ∆ + ∆ . By the mean value equation (S9) and the Cauchy–Schwartz inequality,the second term ∆ can be bounded from above as∆ ≤ e C k ˆ γ − ¯ γ k ˜ E / h T w ( X ; ¯ γ CAL ) { ˆ h RWL ( X ) − h ( X ) } i × ˜ E / h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . (S12)We upper-bound the third term on the right hand side in several steps. First, in the event Ω ,we have by inequality (S3),( ˜ E − E ) h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D D ) λ k ˆ γ RCAL − ¯ γ CAL k , where, by some abuse of notation, ( ˜ E − E )( Z ) denotes n − P ni =1 { Z i − E ( Z ) } for a variable Z that is a function of ( T, Y, X ). Second, by Assumption 2(i) and Lemma 17, E [ { Y − m ( X ; α ) } | X ] ≤ D + D and hence E h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D ) E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . Third, in the event Ω , we have by inequality (S1),( E − ˜ E ) h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ λ k ˆ γ RCAL − ¯ γ CAL k . Combining the preceding inequalities, we have in the event Ω ∩ Ω ,˜ E h T w ( X ; ¯ γ CAL ) { Y − m ( X ; α ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ ( D + D D ) λ k ˆ γ RCAL − ¯ γ CAL k + ( D + D ) n λ k ˆ γ RCAL − ¯ γ CAL k + ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } io . (S13)The desired result follows by collecting inequalities (S11)–(S13) and applying (S7), (S8) and(S10) in the event Ω ∩ Ω . (cid:3) emma 9. Denote b = ˆ α RWL − ¯ α WL . Suppose that Assumption 2(i) holds. In the event Ω ∩ Ω ∩ Ω ∩ Ω , we have e − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + ( A − λ k b k ≤ e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / + 2 A λ X j ∈ S α | b j | . (S14) Proof.
In the event Ω , we have b T ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } f ( X ) (cid:3) ≤ λ k b k . From this bound and Lemma 8 with α = ¯ α WL , we have in the event Ω ∩ Ω ∩ Ω ∩ Ω ,e − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + A λ k ˆ α RWL , p k ≤ λ k b k + A λ k ¯ α WL , p k + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / . Applying to the preceding inequality the identity | ˆ α RWL ,j | = | ˆ α RWL ,j − ¯ α WL ,j | for j S α andthe triangle inequality | ˆ α RWL ,j | ≥ | ¯ α WL ,j | − | ˆ α RWL ,j − ¯ α WL ,j | , j ∈ S α \{ } , and rearranging the result givese − η D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) + ( A − λ k b p k ≤ λ | b | + 2 A λ X j ∈ S α \{ } | b j | + e η (cid:0) M | S γ | λ (cid:1) / { Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) } / . The conclusion follows by adding ( A − λ | b | to both sides above. (cid:3) Denote ˜Σ α = ˜ E [ T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) } f ( X ) f T ( X )]. Lemma 10.
Suppose that Assumption 3(iii) holds. Then for any h = α f and h ′ = α ′ T f , D † WL ( h, h ′ ; ¯ γ CAL ) ≥ − e − C k b k C k b k (cid:16) b T ˜Σ α b (cid:17) , where b = α ′ − α and C = C C . Throughout, set (1 − e − c ) /c = 1 for c = 0 . Proof.
Set γ = ¯ γ CAL . By direct calculation, we have D † WL ( h, h ′ ; γ ) = ˜ E (cid:0) T w ( X ; γ ) (cid:2) ψ { h ′ ( X ) } − ψ { h ( X ) } (cid:3) (cid:8) h ′ ( X ) − h ( X ) (cid:9)(cid:1) = ˜ E (cid:20) T w ( X ; γ ) (cid:18)Z ψ (cid:2) h ( X ) + u (cid:8) h ′ ( X ) − h ( X ) (cid:9)(cid:3) d u (cid:19) (cid:8) h ′ ( X ) − h ( X ) (cid:9) (cid:21) .
18y Assumption 3(iii) and the fact that | h ′ ( X ) − h ( X ) | ≤ { sup j =0 , ,...,p | f j ( X ) |} k α ′ − α k ≤ C k α ′ − α k by Assumption 1(i), it follows that D † WL ( h, h ′ ; γ ) ≥ ˜ E (cid:20) T w ( X ; γ ) (cid:18)Z ψ { h ( X ) } e − C u | h ′ ( X ) − h ( X ) | d u (cid:19) (cid:8) h ′ ( X ) − h ( X ) (cid:9) (cid:21) ≥ ˜ E h T w ( X ; γ ) ψ { h ( X ) } (cid:8) h ′ ( X ) − h ( X ) (cid:9) i (cid:18)Z e − C u k α ′ − α k d u (cid:19) , which gives the desired result because R e − cu d u = (1 − e − c ) /c for c ≥ (cid:3) Lemma 11.
Suppose that Assumption 2(iii) holds. In the event Ω , Assumption 2(ii) im-plies a compatibility condition for ˜Σ γ : for any vector b = ( b , b , . . . , b p ) T ∈ R p such that P j S α | b j | ≤ ξ P j ∈ S α | b j | , we have (1 − η ) ν X j ∈ S α | b j | ≤ | S α | (cid:16) b T ˜Σ γ b (cid:17) . (S15) Proof.
In the event Ω , we have | b T ( ˜Σ γ − Σ γ ) b | ≤ λ k b k by (S1). Then Assumption 2(ii)implies that for any vector b = ( b , b , . . . , b p ) T satisfying P j S α | b j | ≤ ξ P j ∈ S α | b j | , ν k b S α k ≤ | S α | ( b T Σ γ b ) ≤ | S α | (cid:16) b T ˜Σ γ b + λ k b k (cid:17) ≤ | S α | ( b T ˜Σ γ b ) + | S α | λ (1 + ξ ) k b S α k , where k b S α k = P j ∈ S α | b j | . The last inequality uses k b k ≤ (1 + ξ ) k b S α k . Then (S15) followsbecause (1 + ξ ) ν − | S α | λ ≤ η ( <
1) by Assumption 2(iii). (cid:3)
Lemma 12.
Suppose that Assumptions 2 and 3 hold, and A > ( ξ + 1) / ( ξ − . In the event Ω ∩ Ω ∩ Ω ∩ Ω , inequality (29) holds as in Theorem 2. Proof.
Denote b = ˆ α RWL − ¯ α WL , D † WL = D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ), Q WL = Q WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ),and D ‡ WL = e − η D † WL + ( A − λ k b k . In the event Ω ∩ Ω ∩ Ω ∩ Ω , inequality (S14) from Lemma 9 with Assumption 2(i) leads totwo possible cases: either ξ D ‡ WL ≤ e η (cid:0) M | S γ | λ (cid:1) / ( Q WL ) / , (S16)or (1 − ξ ) D ‡ WL ≤ A λ P j ∈ S α | b j | , that is, D ‡ WL ≤ ( ξ + 1)( A − λ X j ∈ S α | b j | = ξ λ X j ∈ S α | b j | , (S17)19here ξ = 1 − A / { ( ξ + 1)( A − } ∈ (0 ,
1] because A > ( ξ + 1) / ( ξ −
1) and ξ =( ξ + 1)( A − P j S α | b j | ≤ ξ P j ∈ S α | b j | , which, by Lemma 11 and Assumption 2(ii)–(iii), implies (S15), that is, X j ∈ S α | b j | ≤ (1 − η ) − / ν − | S α | / (cid:16) b T ˜Σ γ b (cid:17) / . (S18)By Assumption 3(ii) and Lemma 10 with Assumption 3(iii), we have D † WL ≥ − e − C k b k C k b k (cid:16) b T ˜Σ α b (cid:17) ≥ − e − C k b k C k b k C (cid:16) b T ˜Σ γ b (cid:17) . (S19)Combining (S17), (S18), and (S19) and using D † WL ≤ e η D ‡ WL yields D ‡ WL ≤ e η ξ (1 − η ) − ν − C − | S α | λ C k b k − e − C k b k . (S20)But ( A − λ k b k ≤ D ‡ WL . Inequality (S20) along with Assumption 3(iv) implies that 1 − e − C k b k ≤ C ( A − − ξ (1 − η ) − ν − C − | S α | λ ≤ η ( < C k b k ≤ − log(1 − η ) and hence 1 − e − C k b k C k b k = Z e − C k b k u d u ≥ e − C k b k ≥ − η . From this bound, inequality (S20) then leads to D ‡ WL ≤ e η ξ ν − | S α | λ .If (S16) holds, then simple manipulation using D † WL ≤ e η D ‡ WL and (S19) together with Q WL = b T ˜Σ γ b gives D ‡ WL ≤ e η ξ − C − (cid:0) M | S γ | λ (cid:1) C k b k − e − C k b k . (S21)Similarly as above, using ( A − λ k b k ≤ D ‡ WL and inequality (S21) along with Assump-tion 3(iv), we find 1 − e − C k b k ≤ C e η ( A − − ξ − C − ( M | S γ | λ ) ≤ η ( < C k b k ≤ − log(1 − η ) and hence1 − e − C k b k C k b k = Z e − C k b k u d u ≥ e − C k b k ≥ − η . From this bound, inequality (S21) then leads to D ‡ WL ≤ e η ξ − ( M | S γ | λ ). Therefore, (39)holds through (S16) and (S17) in the event Ω ∩ Ω ∩ Ω ∩ Ω . (cid:3) I.4 Proof of Theorem 3
Denote ˆ ϕ = ϕ ( T, Y, X ; ˆ m RWL , ˆ π RCAL ) and ¯ ϕ = ϕ ( T, Y, X ; ¯ m WL , ¯ π CAL ). Thenˆ µ ( ˆ m RWL , ˆ π RCAL ) = ¯ µ ( ¯ m WL , ¯ π CAL ) + ˜ E ( ˆ ϕ − ¯ ϕ ) . Consider the following decomposition,ˆ ϕ − ¯ ϕ = { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) + T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) + { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) T ¯ π CAL ( X ) − T ˆ π RCAL ( X ) (cid:27) , (S22)denoted as δ + δ + δ .We show that in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω , inequality (32) holds as in Theorem 3.The decomposition (20) for ˆ µ ( ˆ m RWL , ˆ π RCAL ) amounts toˆ µ ( ˆ m RWL , ˆ π RCAL ) = ¯ µ ( ¯ m WL , ¯ π CAL ) + ∆ + ∆ , where ∆ = ˜ E ( δ + δ ) = ( ˆ α RWL − ¯ α WL ) T ˜ E (cid:20)(cid:26) − T ˆ π RCAL ( X ) (cid:27) f ( X ) (cid:21) , ∆ = ˜ E ( δ ) = ˜ E (cid:20) T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27)(cid:21) . In the event Ω ∩ Ω ∩ Ω ∩ Ω , we have | ∆ | ≤ ( A − − M ( | S γ | λ + | S α | λ ) × A λ , (S23)by inequality (29) and the Karush–Kuhn–Tucker conditions (14)–(15). Moreover, a Taylorexpansion for ∆ yields for some u ∈ (0 , = − (ˆ γ RCAL − ¯ γ CAL ) T ˜ E h T { Y − ¯ m WL ( X ) } e − ¯ h ( X ) f ( X ) i + (ˆ γ RCAL − ¯ γ CAL ) T ˜ E h T { Y − ¯ m WL ( X ) } e − u ˆ h ( X ) − (1 − u )¯ h ( X ) f ( X ) f T ( X ) i (ˆ γ RCAL − ¯ γ CAL ) / , denoted as ∆ + ∆ . In the event (Ω ∩ Ω ) ∩ Ω , we have | ∆ | ≤ ( A − − M | S γ | λ × λ , (S24)by inequalities (26) and (S2). The term ∆ can be bounded as | ∆ | ≤ e k ˆ γ − ¯ γ k C ˜ E h T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) |{ ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i / . (S25)21n the event Ω ∩ Ω , we have˜ E h T w ( X ; ¯ γ CAL ) | Y − ¯ m WL ( X ) |{ ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i ≤ q D + D λ k ˆ γ RCAL − ¯ γ CAL k + q D + D n λ k ˆ γ RCAL − ¯ γ CAL k + ˜ E h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } io , (S26)by inequalities (S1) and (S4) and similar steps as in the proof of (S13). Then (32) follows bycollecting inequalities (S23)–(S26) and applying (S7) and (S10) in the event Ω ∩ Ω . II.5 Proof of Theorem 4
Using a − b = 2( a − b ) b + ( a − b ) and the Cauchy–Schwartz inequality, we find (cid:12)(cid:12)(cid:12) ˜ E (cid:0) ˆ ϕ c − ¯ ϕ c (cid:1)(cid:12)(cid:12)(cid:12) ≤ E / (cid:0) ¯ ϕ c (cid:1) ˜ E / (cid:8) ( ˆ ϕ c − ¯ ϕ c ) (cid:9) + ˜ E (cid:8) ( ˆ ϕ c − ¯ ϕ c ) (cid:9) . (S27)Using ˆ ϕ c = ˆ ϕ − ˆ µ ( ˆ m RWL , ˆ π RCAL ) and ¯ ϕ c = ¯ ϕ − ˆ µ ( ¯ m WL , ¯ π CAL ), we find˜ E { ( ˆ ϕ c − ¯ ϕ c ) } ≤ E { ( ˆ ϕ − ¯ ϕ ) } + 2 (cid:12)(cid:12) ˆ µ ( ˆ m RWL , ˆ π RCAL ) − ˆ µ ( ¯ m WL , ¯ π CAL ) (cid:12)(cid:12) . (S28)To control ˜ E { ( ˆ ϕ − ¯ ϕ ) } , we use the decomposition (S22), denoted as δ + δ + δ .First, by the mean value equation (S9) and Assumption 1(i)–(ii), we have˜ E ( δ ) = ˜ E " T { Y − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) ≤ e − B +2 k ˆ γ − ¯ γ k C ˜ E h T w ( X ; ¯ γ CAL ) { Y − ¯ m WL ( X ) } { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i . (S29)Second, writing { ˆ π RCAL ( X ) } − − { ¯ π CAL ( X ) } − = e − ¯ h ( X ) { e − ˆ h ( X )+¯ h ( X ) − } and usingAssumption 1(i)–(ii), we have˜ E ( δ ) = ˜ E " T { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) π RCAL ( X ) − π CAL ( X ) (cid:27) ≤ e − B (cid:16) k ˆ γ − ¯ γ k C (cid:17) ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) . (S30)Third, using Assumption 1(i)–(ii), we also have˜ E ( δ ) = ˜ E " { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) ≤ (1 + e − B ) ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i (S31) ≤ (1 + e − B ) C k ˆ α RWL − ¯ α WL k . (S32)22nequality (33) follows by collecting inequalities (S27)–(S32) and applying (29), (32), (S10),and (S13) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω . If condition (27) holds, then we have in theevent Ω ∩ Ω ,˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e B λ k ˆ α RWL − ¯ α WL k + τ − n λ k ˆ α RWL − ¯ α WL k + ˜ E h T w ( X ; ¯ α WL ) { ˆ h RWL ( X ) − ¯ h WL ( X ) } io , (S33)by inequalities (S1) and (S5) and similar steps as in the proof of (S13). Inequality (34) follows,similarly as (33), by combining inequalities (S27)–(S31) and (S33). II.6 Proof of Theorem 6
We use the decomposition (S22) and handle δ , δ , and δ separately. The term ˜ E ( δ ) can bebounded by (S24)–(S26) as in the proof of Theorem 3. By the mean value equation (S9) andthe Cauchy–Schwartz inequality, ˜ E ( δ ) can be bounded as (cid:12)(cid:12)(cid:12) ˜ E ( δ ) (cid:12)(cid:12)(cid:12) ≤ e C k ˆ γ − ¯ γ k ˜ E / h T w ( X ; ¯ γ CAL ) { ˆ h RCAL ( X ) − ¯ h CAL ( X ) } i × ˜ E / (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) . (S34)Similarly as in Lemma 10 but arguing in the reverse direction by Assumptions 1(i) and 3(iii),we find ˜ E (cid:2) T w ( X ; ¯ γ CAL ) { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:3) ≤ e C k ˆ α − ¯ α k × ˜ E h T w ( X ; ¯ γ CAL ) ψ { ¯ h WL ( X ) }{ ˆ m RWL ( X ) − ¯ m WL ( X ) }{ ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e C k ˆ α − ¯ α k C D † WL ( ˆ m RWL , ¯ m WL ; ¯ γ CAL ) , (S35)where the second inequality follows from Assumption 3(i). In the following, we derive twodifferent bounds on ˜ E ( δ ), leading to (40) and (41) respectively.First, suppose that condition (27) holds. Consider the following decomposition˜ E ( δ ) = ˜ E (cid:20) ψ { ¯ h WL ( X ) }{ ˆ h RWL ( X ) − ¯ h WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21) + ˜ E (cid:20) ˜ ψ ( X ) { ˆ h RWL ( X ) − ¯ h WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21) , (S36)denoted as ∆ + ∆ , where˜ ψ ( X ) = Z (cid:16) ψ [¯ h WL ( X ) + u { ˆ h RWL ( X ) − ¯ h WL ( X ) } ] − ψ { ¯ h WL ( X ) } (cid:17) d u. the event thatsup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C λ . Then P (Ω ) ≥ − ǫ similarly as in Lemma 1(i). In the event Ω , we have | ∆ | ≤ k ˆ α RWL − ¯ α WL k sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) ˜ E (cid:20) ψ { ¯ h WL ( X ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k ˆ α RWL − ¯ α WL k (Λ + 2 C λ ) . (S37)To bound ∆ , we have by Assumption 3(iii), | ˜ ψ ( X ) | ≤ ψ { ¯ h WL ( X ) } (cid:16) e C | ˆ h ( X ) − ¯ h ( X ) | − (cid:17) ≤ ψ { ¯ h WL ( X ) } C | ˆ h RWL ( X ) − ¯ h WL ( X ) | e C | ˆ h ( X ) − ¯ h ( X ) | , (S38)where the second inequality follows because (e c − /c = R e uc d u ≤ e c for c ≥
0. As a result,we find from Assumptions 1(i) and 3(i), | ∆ | ≤ (1 + e − B ) C C e C k ˆ α − ¯ α k ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i . (S39)By condition (27), ˜ E [ { ˆ h RWL ( X ) − ¯ h WL ( X ) } ] can be bounded as (S33) in the event Ω ∩ Ω .Then (41) follows by collecting inequalities (S24)–(S26) and (S34)–(S39) and applying (39),(S7), and (S10) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω .Now suppose that (27) may not hold. Denote h ( X ; α ) = α f ( X ). Then ˜ E ( δ ) can bedecomposed as˜ E ( δ ) = ( ˜ E − E ) (cid:18)h ψ { ˆ h RWL ( X ) } − ψ { ¯ h WL ( X ) } i (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19) + E (cid:18)h ψ { ˆ h RWL ( X ) } − ψ { ¯ h WL ( X ) } i (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19) , denoted as ∆ + ∆ . In the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω , we have k ˆ α RWL − ¯ α WL k ≤ η from(39) and hence by the mean value theorem, | ∆ | ≤ η sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) ψ { h ( X ; ˜ α ) } f j ( X ) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η Λ ( η ) , (S40)where ˜ α lies between ˆ α RWL and ¯ α WL . Moreover, in the event (Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω ) ∩ Ω ,applying Lemma 13 below yields | ∆ | ≤ C (1 + C e C η ) η λ . (S41)Then (41) follows by combining (S40)–(S41) and other aforementioned inequalities.24 emma 13. For r ≥ , denote by Ω the event that sup k α − ¯ α WL k ≤ r (cid:12)(cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:18)(cid:2) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:3) (cid:26) − T ¯ π CAL ( X ) (cid:27)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ C (1 + C e C r ) rλ . Under Assumptions 1(i)–(ii), 3(i) and 3(iii), if λ ≥ √ − B + 1) C p log { (1 + p ) /ǫ } /n , then P (Ω ) ≥ − ǫ . Proof.
Denote g ( T, X ; α ) = (cid:2) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:3) (cid:26) − T ¯ π CAL ( X ) (cid:27) . For k α − ¯ α WL k ≤ r , similar manipulation as in (S36) and (S38) using Assumptions 1(i), 3(i)and 3(iii) yields (cid:12)(cid:12) ψ { h ( X ; α ) } − ψ { ¯ h WL ( X ) } (cid:12)(cid:12) ≤ ψ { ¯ h WL ( X ) }| h ( X ; α ) − ¯ h WL ( X ) | + ψ { ¯ h WL ( X ) } C | h ( X ; α ) − ¯ h WL ( X ) | e C | h ( X ; α ) − ¯ h ( X ) | ≤ C (1 + C e C r ) | h ( X ; α ) − ¯ h WL ( X ) | , (S42)that is, ψ () satisfies a Lipschitz condition. By the symmetrization and contraction theorems(e.g., Buhlmann & van de Geer 2011, Theorems 14.3 and 14.4), we have E " sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ E sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i g ( T i , X i ; α ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) × E sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i { h ( X i ; α ) − ¯ h WL ( X i ) } (cid:26) − T i ¯ π CAL ( X i ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) r × E sup j =0 , ,...,p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 σ i f j ( X i ) (cid:26) − T i ¯ π CAL ( X i ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where ( σ , . . . , σ n ) are independent Rademacher variables with P ( σ i = 1) = P ( σ i = −
1) = 1 / i . By Hoeffding’s moment inequality (Buhlmann & van de Geer 2011, Lemma 14.14),we find from the preceding inequality E " sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ C (1 + C e C r ) r × C (e B + 1) r p ) n , by Assumption 1(i)–(ii). For k α − ¯ α WL k ≤ r , inequality (S42) also shows that | g ( T i , X i ; α ) | ≤ C (1 + C e C r ) C (e B + 1) r . By Massart’s inequality (Buhlmann & van de Geer 2001, Theo-25em 14.2), we have with probability at least 1 − ǫ ,sup k α − ¯ α k ≤ r (cid:12)(cid:12)(cid:12) ( ˜ E − E ) { g ( T, X ; α ) } (cid:12)(cid:12)(cid:12) ≤ C (e B + 1) C (1 + C e C r ) r ( r p ) n + r { / (2 ǫ ) } n ) ≤ C (e B + 1) C (1 + C e C r ) r r
16 log { (1 + p ) /ǫ } n , where the second inequality uses √ a + √ b ≤ p a + b ). (cid:3) II.7 Proof of Theorem 7
The proof is similar to that of Theorem 4. First, (S29) for ˜ E ( δ ) remains valid. Second,combining (S30) and (S35) yields˜ E ( δ ) ≤ e − B (cid:16) k ˆ γ − ¯ γ k C (cid:17) e C k ˆ α − ¯ α k C D † WL (ˆ h RWL , ¯ h WL ; ¯ α CAL ) . Third, similarly as in (S32) and (S35), we have˜ E ( δ ) = ˜ E " { ˆ m RWL ( X ) − ¯ m WL ( X ) } (cid:26) − T ¯ π CAL ( X ) (cid:27) ≤ (1 + e − B ) e C k ˆ α − ¯ α k C ˜ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i (S43) ≤ (1 + e − B ) e C k ˆ α − ¯ α k C C k ˆ α RWL − ¯ α WL k . Inequality (42) follows by collecting the aforementioned inequalities and applying (39), (40),(S10), and (S13) in the event Ω ∩ Ω ∩ Ω ∩ Ω ∩ Ω . If condition (27) holds, then in the eventΩ ∩ Ω , combining (S33) and (S19) and using (1 − e − c ) /c ≥ e − c for c ≥ E h { ˆ h RWL ( X ) − ¯ h WL ( X ) } i ≤ e B λ k ˆ α RWL − ¯ α WL k + τ − n λ k ˆ α RWL − ¯ α WL k + e C k ˆ α − ¯ α k C − D † WL (ˆ h RWL , ¯ h WL ; ¯ γ CAL ) o . (S44)Inequality (43) follows by combining (S43)–(S44) and other aforementioned inequalities exceptthat inequality (40) is replaced by (41). 26 I.8 Technical tools
For completeness, we state the following concentration inequalities, which can be obtainedfrom Buhlmann & van de Geer (2011), Lemmas 14.11, 14.16, and 14.9.
Lemma 14.
Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and max i =1 ,...,n | Y i | ≤ c for some constant c . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ! ≤ (cid:18) − nt c (cid:19) . Lemma 15.
Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and ( Y , . . . , Y n ) are uniformly sub-gaussian: max i =1 ,...,n c E { exp( Y i /c ) − } ≤ c for someconstants ( c , c ) . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ! ≤ (cid:26) − nt c + c ) (cid:27) . Lemma 16.
Let ( Y , . . . , Y n ) be independent variables such that E ( Y i ) = 0 for i = 1 , . . . , n and n n X i =1 E ( | Y i | k ) ≤ k !2 c k − c , k = 2 , , . . . , for some constants ( c , c ) . Then for any t > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > c t + c √ t ! ≤ − nt ) . The following results about sub-gaussian variables can be deduced from Buhlmann & vande Geer (2011), Lemmas 14.3 and 14.5.
Lemma 17.
Suppose that Y is sub-gaussian: c E { exp( X /c ) − } ≤ c for some constants ( c , c ) . Then E ( | Y | k ) ≤ Γ (cid:18) k (cid:19) ( c + c ) c k − , k = 2 , , . . . . Lemma 18.
Suppose that X is bounded: | X | ≤ c for a constant c , and Y is sub-gaussian: c E { exp( X /c ) − } ≤ c for some constants ( c , c ) . Then Z = XY satisfies E n | Z − E ( Z ) | k o ≤ k !2 c k − c , k = 2 , , . . . , for c = 2 c c and c = 2 c c c ..