[PDF] High-dimensional Model-assisted Inference for Local Average Treatment Effects with Instrumental Variables

Abstract

Consider the problem of estimating the local average treatment effect with an instrument variable, where the instrument unconfoundedness holds after adjusting for a set of measured covariates. Several unknown functions of the covariates need to be estimated through regression models, such as instrument propensity score and treatment and outcome regression models. We develop a computationally tractable method in high-dimensional settings where the numbers of regression terms are close to or larger than the sample size. Our method exploits regularized calibrated estimation, which involves Lasso penalties but carefully chosen loss functions for estimating coefficient vectors in these regression models, and then employs a doubly robust estimator for the treatment parameter through augmented inverse probability weighting. We provide rigorous theoretical analysis to show that the resulting Wald confidence intervals are valid for the treatment parameter under suitable sparsity conditions if the instrument propensity score model is correctly specified, but the treatment and outcome regression models may be misspecified. For existing high-dimensional methods, valid confidence intervals are obtained for the treatment parameter if all three models are correctly specified. We evaluate the proposed methods via extensive simulation studies and an empirical application to estimate the returns to education.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p High-dimensional Model-assisted Inference for Local AverageTreatment Effects with Instrumental Variables

Baoluo Sun & Zhiqiang Tan September 22, 2020

Abstract.

Consider the problem of estimating the local average treatment effect with an instru-ment variable, where the instrument unconfoundedness holds after adjusting for a set of measuredcovariates. Several unknown functions of the covariates need to be estimated through regressionmodels, such as instrument propensity score and treatment and outcome regression models. Wedevelop a computationally tractable method in high-dimensional settings where the numbers ofregression terms are close to or larger than the sample size. Our method exploits regularized cali-brated estimation, which involves Lasso penalties but carefully chosen loss functions for estimatingcoefﬁcient vectors in these regression models, and then employs a doubly robust estimator for thetreatment parameter through augmented inverse probability weighting. We provide rigorous the-oretical analysis to show that the resulting Wald conﬁdence intervals are valid for the treatmentparameter under suitable sparsity conditions if the instrument propensity score model is correctlyspeciﬁed, but the treatment and outcome regression models may be misspeciﬁed. For existinghigh-dimensional methods, valid conﬁdence intervals are obtained for the treatment parameter ifall three models are correctly speciﬁed. We evaluate the proposed methods via extensive simulationstudies and an empirical application to estimate the returns to education.

Key words and phrases.

Calibrated estimation; Causal inference; Complier average causal ef-fect; Doubly robust estimation; Instrumental variable; Lasso penalty; Model misspeciﬁcation;Propensity score; Regularized M-estimation. Baoluo Sun is Assistant Professor, Department of Statistics and Applied Probability, National University of Sin-gapore, Singapore, SG 117546 (Email: [email protected]), and Zhiqiang Tan is Professor, Department of Statistics,Rutgers University, Piscataway, NJ 08854, USA (Email: [email protected]).

Introduction

A major difﬁculty in drawing causal inference from observational studies is the possible existenceof unobserved background variables that are related to both the treatment status and outcome ofinterest, which is usually referred to as unmeasured confounding. In such settings, biased esti-mates of the causal effects may be obtained by comparing observed outcomes between treatedand untreated individuals, even after adjusting for measured covariates. To tackle unmeasuredconfounding, instrumental variable (IV) methods have been widely used for estimating causal ef-fects. While conventional IV methods are rooted in econometrics (Wright 1928), there are modernIV approaches which formulate structural assumptions required to be satisﬁed by IVs and providenonparametric identiﬁcation results for certain causal contrasts in terms of potential outcomes (An-grist et al. 1996; Robins 1994). Two basic IV assumptions are called instrument unconfoundednessand exclusion restriction. Intuitively, a valid IV serves as an exogenous experimental handle, theturning of which may change each individual’s treatment status and, through and only through thiseffect, also change observed outcome. Then under a monotonicity assumption and other techni-cal conditions, the local average treatment effect (LATE), deﬁned as the average treatment effectamong individuals whose treatment status would be manipulated through the change of the IV, isidentiﬁed from observed data nonparametrically (Angrist et al. 1996).We consider the problem of estimating population LATEs provided that the IV assumptionshold after conditioning on a set of measured covariates. While a completely randomized IV isconceptually easy to interpret, the conditional version of instrument unconfoundedness is moreplausible in allowing an IV to be randomized within different values of measured covariates. Ingeneral, it is helpful to think of estimation of LATEs in two stages. First, regression models arebuilt and ﬁtted for certain unknown functions of the covariates, such as the instrument propensityscore and the treatment and outcome regression functions in Tan (2006a) or regression functionsin Frolich (2007), Uysal (2011), and Ogburn et al. (2015). In the second stage, the ﬁtted functionsare substituted into various estimators, related to the identiﬁcation formulas of LATEs. For theregression tasks in the ﬁrst stage, a conventional approach involves an iterative process of modeldiagnosis, modiﬁcation, and reﬁtting until some criterion is satisﬁed, for example, by inspectionof residual plots for outcome regression or covariate balance for propensity score models. This1pproach depends on ad hoc choices of how regression terms are added or dropped in model build-ing. Moreover, uncertainty in the iterative process is complicated and often ignored in subsequentinference (i.e., conﬁdence intervals or hypothesis testing) about LATEs.In this article, we develop a new method, extending the doubly robust method in Tan (2006a)for estimating population LATEs to high-dimensional settings where the numbers of regressionterms are close to or larger than the sample size in the ﬁrst-stage estimation described above. Theinstrument and treatment are assumed to be binary. Three regression models are involved: aninstrument propensity score model for the conditional probability of the instrument being 1 giventhe covariates, a treatment regression model for the conditional probability of the treatment being1 given the instrument and covariates, and an outcome regression model for the conditional meanof the observed outcome given the treatment, instrument and covariates. The regression terms inall three models are pre-speciﬁed, for example, as main effects from a large number of covariatesor additional interaction or nonlinear terms from even a moderate number of covariates.Our method uses the doubly robust estimator of the LATE in Tan (2006a), in the form of a ratioof two augmented inverse probability weighted (AIPW) estimators. To tackle high-dimensionaldata, however, our method employs regularized calibrated estimation for estimating coefﬁcientssequentially in the instrument propensity score and treatment and outcome regression models.The Lasso penalties (Tibshirani 1996) are introduced to achieve adequate estimation with a largenumber of regression terms under sparsity conditions where only a small but unknown subset ofregression terms are associated with nonzero coefﬁcients. The loss functions are carefully chosenfor regularized estimation, different from least squares or maximum likelihood, by leveraging reg-ularized calibrated estimation in Tan (2020b) for estimating average treatment effects (ATEs) undertreatment uncounfoundedness in high-dimensional data. In fact, our estimators for the coefﬁcientvectors in the instrument propensity score and treatment regression models are directly transferredfrom Tan (2020b). Moreover, our estimator for the coefﬁcient vector in the outcome regressionmodel is new as a regularized weighted likelihood estimator, with a pseudo-response depending onboth treatment status and observed outcome. This differs sharply from maximum quasi-likelihoodestimation where the response depends on observed outcome only.We provide rigorous high-dimensional analysis of the regularized calibrated estimators andthe resulting AIPW estimator for the LATE. We establish sufﬁciently fast convergence rates for2he regularized calibrated estimators, in spite of their sequential construction with data-dependentweights and mean functions. Moreover, we show that under suitable sparsity conditions, the pro-posed estimator of LATE achieves a desired asymptotic expansion in the usual order O p ( n − / ) with a sample size n and then valid Wald conﬁdence intervals can be obtained, provided that theinstrument propensity score model is correctly speciﬁed, but the treatment and outcome regressionmodels may be misspeciﬁed. Following the survey literature (Sarndal et al. 2003), our conﬁdenceintervals for the LATE are said to be instrument propensity score model based, and treatment andoutcome regression models assisted. It should be stressed that our method is aimed to be compu-tationally tractable for practical use, with the sequential construction of the regularized calibratedestimators. In principle, doubly robust conﬁdence intervals for LATEs can also be obtained, whichremain valid if either the instrument propensity score model or the treatment and outcome regres-sion models are correctly speciﬁed. But there are several analytical and computational issues whichneed to be properly addressed in this direction (see Remarks 6 and 7). Related work.

There is an extensive literature on IV and related methods for causal inference(e.g., Baiocchi et al. 2014; Imbens 2014). For space limitation, we discuss closely related workonly. Under the IV monotonicity assumption, parametric and semiparametric methods for estimat-ing conditional LATEs given the full vector of covariates include Little & Yau (1998), Hirano etal. (2000), and Abadie (2003). For estimating population or subpopulation LATEs, doubly robustmethods are proposed in Tan (2006a), Uysal (2011), and Ogburn et al. (2015), whereas nonpara-metric smoothing based methods are studied in Frolich (2007). Alternatively, there are variousmethods using IVs based on homogeneity assumptions for estimating certain average treatment ef-fects on the treated (Robins 1994; Vansteelandt & Goetghebeur 2003; Tan 2010a). Doubly robustestimation of ATEs using IVs is studied in Okui et al. (2012) under a partially linear model and inWang & Tchetgen Tchetgen (2018) under suitable identiﬁcation assumptions.The foregoing IV methods are developed in low-dimensional settings without regularized esti-mation. With high-dimensional data, Chernozhukov et al. (2018) proposed debiased methods forestimating various treatment parameters, using regularized likelihood-based estimation. In particu-lar, for estimating population LATEs under the monotonicity assumption, their method exploits thedoubly robust estimating function in Tan (2006a) similarly as our method, but employs regularizedlikelihood estimation for ﬁtting three regression models as in Uysal (2011). The Wald conﬁdence3ntervals for LATEs are shown to be valid under similar sparsity conditions as ours, provided thatall the three regression models are correctly speciﬁed (or with negligible biases). Our main con-tribution is therefore to provide model-assisted conﬁdence intervals for LATEs using differentlyconﬁgured regularized estimation. See Remark 4 for further discussion.Similar methods to non-regularized calibrated estimation are proposed in Kim & Haziza (2014)for estimating ATEs under treatment unconfoundedness and in Vermeulen & Vansteelandt (2015)for general doubly robust estimation. In low-dimensional settings, such methods lead to computa-tionally simpler variance estimation and conﬁdence intervals than likelihood-based estimation fornuisance parameters, where valid conﬁdence intervals can still be derived using usual inﬂuencefunctions (see Remark 5). Similarly as in Tan (2020b), we exploit these ideas in high-dimensionalsettings to obtain model-assisted conﬁdence intervals for LATEs, which would not be feasible ifusing regularized likelihood-based estimation as in Chernozhukov et al. (2018).There is a growing literature on conﬁdence intervals and hypothesis testing in high-dimensionalsettings. Examples include debiased Lasso in generalized linear models (Zhang & Zhang 2014;van de Geer et al. 2014), and double robustness related methods (Belloni et al. 2014; Farrell 2015;Avagyan & Vansteelandt 2017; Smucler et al. 2019; Bradic et al. 2019; Ning et al. 2020), inaddition to Chernozhukov et al. (2018) and Tan (2020b). Sample splitting and cross ﬁtting areused in some of these methods, but not pursued here.

Suppose that ( O , . . . , O n ) are independent and identically distributed observations of O = ( Y, D,Z, X ) , where Y is an outcome variable, D is a binary, treatment variable encoding the presence ( D = 1) or absence of treatment ( D = 0) , Z is a binary, instrument variable, and X is a vectorof measured covariates. We use the potential outcomes notation (Neyman 1990; Rubin 1974) todeﬁne quantities of causal interest. For z, d ∈ { , } , let D ( z ) denote the potential treatmentstatus that would be observed if Z were set to level z , and let Y ( d, z ) denote the potential outcomethat would be observed if the treatment and instrument D and Z were set to the levels d and z respectively. Following Angrist et al. (1996), the population can be divided into four strata: thecompliers with D (1) > D (0) , the always-takers with D (1) = D (0) = 1 , the never-takers with4 (1) = D (0) = 0 , and the deﬁers with D (1) < D (0) . Angrist et al. (1996) formalized an IV approach with a set of structural assumptions for identiﬁ-cation of the local average treatment effect (LATE), also known as the average treatment effectamong the compliers. Throughout, we make the following conditional versions of the IV assump-tions (Abadie 2003; Tan 2006a; Frolich 2007):(a) Instrument unconfoundedness: ( Y ( d, z ) , D ( z )) ` Z | X for d, z ∈ { , } , where ` denotesindependence.(b) Exclusion restriction: Y ( d,

1) = Y ( d, , henceforth denoted as Y ( d ) , for d ∈ { , } .(c) Monotonicity: D (1) ≥ D (0) with probability 1.(d) Instrument overlap: < π ∗ ( X ) < with probability 1, where π ∗ ( X ) = P ( Z = 1 | X ) iscalled the instrument propensity score (Tan 2006a).(e) Instrumentation: P ( D (1) = 1) = P ( D (0) = 1) .(f) Consistency: Y = DY (1) + (1 − D ) Y (0) and D = ZD (1) + (1 − Z ) D (0) .Assumption (a) states that the instrument Z is essentially randomized within levels of the co-variate X . Assumptions (a) and (b) together imply an independence condition, ( Y ( d ) , D ( z )) ` Z | X for d, z ∈ { , } , which can be technically used in place of Assumptions (a) and (b). Assumption(c) excludes the existence of deﬁers in the population. Vytlacil (2002) and Tan (2006a) showedthat the independence and monotonicity assumptions are equivalent to the assumptions of a non-parametric latent index model:(i) D ( z ) = { η ( z, X ) ≥ U } for a function η and a random variable U , where ( · ) is theindicator function.(ii) ( Y ( d ) , U ) ` Z | X and U ` ( Z, X ) .As a result, U can be transformed to be uniformly distributed on [0 , and hence η ( z, x ) equalsthe treatment propensity score P ( D = 1 | Z = z, X = x ) . The preceding representation is helpfulfor understanding the data-generating process, which is used in our simulation studies.5ssumption (d) ensures that every unit within levels of X has a positive probability of receivingeach instrument level z ∈ { , } . Assumption (e) requires a non-null causal effect of Z on D , inaccordance with the concept of Z being an experimental handle; turning of the handle Z needsto change treatment status D . Assumption (f) relates the potential outcomes and treatments to theobserved data, under no interference and well-deﬁned intervention conditions.Under Assumptions (a)–(f), the LATE conditionally on X = x , deﬁned as LATE ( x ) = E ( Y (1) − Y (0) | D (1) > D (0) , X = x ) , can be identiﬁed as (Angrist et al. 1996) E ( Y | Z = 1 , X = x ) − E ( Y | Z = 0 , X = x ) E ( D | Z = 1 , X = x ) − E ( D | Z = 0 , X = x ) . For high-dimensional X , LATE ( x ) is difﬁcult to interpret, depending on all covariates in X . More-over, estimation of LATE ( x ) can be sensitive to modeling assumptions on the conditional expec-tations above. Hence it is of interest to consider the population LATE (or in short LATE), deﬁnedas LATE = E ( Y (1) − Y (0) | D (1) > D (0)) . As shown by Tan (2006a) and Frolich (2007), LATEcan be identiﬁed under Assumptions (a)–(f) in two distinct ways:LATE = E { E ( Y | Z = 1 , X ) − E ( Y | Z = 0 , X ) } E { E ( D | Z = 1 , X ) − E ( D | Z = 0 , X ) } , (1)depending on the regression functions E ( Y | Z = z, X ) and E ( D | Z = z, X ) for z ∈ { , } , orLATE = E { Zπ ∗ ( X ) Y } − E { − Z − π ∗ ( X ) Y } E { Zπ ∗ ( X ) D } − E { − Z − π ∗ ( X ) D } , (2)depending on the instrument propensity score π ∗ ( X ) = P ( Z = 1 | X ) . Both (1) and (2) are in theform of a ratio of the difference in outcome Y over that in treatment D .A further identiﬁcation exploited later in our approach is that the individual expectations θ d = E ( Y ( d ) | D (1) > D (0)) for d ∈ { , } , not just the difference LATE = θ − θ , can also beidentiﬁed. In fact, θ is identiﬁed under Assumptions (a)–(f) as θ = E { E ( DY | Z = 1 , X ) − E ( DY | Z = 0 , X ) } E { E ( D | Z = 1 , X ) − E ( D | Z = 0 , X ) } , (3)or equivalently as θ = E { Zπ ∗ ( X ) DY } − E { − Z − π ∗ ( X ) DY } E { Zπ ∗ ( X ) D } − E { − Z − π ∗ ( X ) D } . (4)6imilarly, θ is identiﬁed as (3) or (4) with D replaced by − D . The difference of the correspond-ing identiﬁcation equations for θ and θ leads back to (1) or (2). As shown in Tan (2006a), both(3) and (4) can be derived from the following expression of θ : θ = E { D (1) Y (1) } − E { D (0) Y (1) } E { D (1) } − E { D (0) } , (5)which is a ratio of two differences, depending on potential outcomes and treatments. Because Z is an experimental handle with ( D, Y ) as “outcomes” under Assumption (a) (instrument uncon-foundedness), each expectation in the numerator and denominator of (5) can be identiﬁed throughoutcome regression averaging or inverse probability weighting, so that (3) or (4) are obtained.These results are parallel to related identiﬁcation results under the assumption of treatment uncon-foundedness. See Tan (2006b, 2010b) and references therein. For estimating ( θ , θ ) and LATE from sample data, additional modeling assumptions are requiredto estimate unknown functions in the identiﬁcation equations (1)–(2) or (3)–(4). There are at leasttwo distinct approaches, depending on models for the instrument propensity score π ∗ ( x ) = P ( Z =1 | X = x ) or treatment and outcome regression functions m ∗ z ( x ) = P ( D = 1 | Z = z, X = x ) and m ∗ dz ( x ) = E ( Y | D = d, Z = z, X = x ) for d, z ∈ { , } (Tan 2006a). For simplicity, estimationof θ is discussed, whereas that of θ can be similarly handled. Throughout, ˜ E ( · ) denotes a sampleaverage such that ˜ E { b ( O ) } = n − P ni =1 b ( O i ) for a function b ( O ) . Remark 1 (On modeling choices) . Consideration of models for m ∗ z ( x ) and m ∗ dz ( x ) is convenientlyaligned with our interest in estimating both ( θ , θ ) and LATE, through identiﬁcation equations(3)–(4). As illustrated in Tan (2006a, Section 5), separate estimates of θ and θ can be informativein applications. The conditional expectation E ( DY | Z = z, X ) in (3) is decomposed as P ( D =1 | Z = z, X ) E ( Y | D = 1 , Z = z, X ) = m ∗ z ( x ) m ∗ z ( x ) . Both models for m ∗ z ( x ) and m ∗ z ( x ) can bespeciﬁed using appropriate links functions, as in (9)–(10) below. If estimation of LATE is solelyof interest through identiﬁcation equations (1)–(2), then modeling assumptions can be introducedon m ∗ z ( x ) = E ( D | Z = z, X ) and E ( Y | Z = z, x ) (Froelich 2007; Uysal 2011). In this case, ourmethods and theory developed later can be similarly extended.7irst, consider an instrument propensity score model P ( Z = 1 | X = x ) = π ( x ; γ ) = Π { γ T f ( x ) } , (6)where Π( · ) is an inverse link function, f ( x ) = { , f ( x ) , ..., f p ( x ) } T is a vector of known functionssuch as (1 , x T ) T and γ = ( γ , γ , ..., γ p ) is a vector of unknown parameters. For concreteness,assume that logistic regression is used such that π ( x ; γ ) = [1 + exp {− γ T f ( x ) } ] − . By (4), theinverse probability weighted (IPW) estimator of θ is ˆ θ , IPW (ˆ π ) = ˜ E n Z ˆ π ( X ) DY o − ˜ E n − Z − ˆ π ( X ) DY o ˜ E n Z ˆ π ( X ) D o − ˜ E n − Z − ˆ π ( X ) D o , (7)where ˆ π ( X ) = π ( X ; ˆ γ ) is a ﬁtted instrument propensity score. For low-dimensional X , ˆ γ iscustomarily the maximum likelihood estimator of γ . In high-dimensional settings, ˆ γ can be aLasso penalized maximum likelihood estimator ˆ γ RML , deﬁned as a minimizer of L RML ( γ ) = L ML ( γ )+ λ k γ p k , where k · k denotes the L norm, γ p = ( γ , . . . , γ p ) T , λ ≥ is a tuning parameter, and L ML ( γ ) is the average negative log-likelihood L ML ( γ ) = ˜ E h − Zγ T f ( X ) + log { γ T f ( X ) } i . (8)Alternatively, for z ∈ { , } , consider treatment and outcome regression models, which canboth be called “outcome regression” with ( D, Y ) as “outcomes”: P ( D = 1 | Z = z, X = x ) = m z ( x ; α z ) = ψ D { α T z g ( x ) } , (9) E ( Y | D = 1 , Z = z, X = x ) = m z ( x ; α z ) = ψ Y { α T z h ( x ) } , (10)where ψ D ( · ) and ψ Y ( · ) are inverse link functions, assumed to be increasing with ψ D ( −∞ ) = 0 and ψ D ( ∞ ) = 1 , g ( x ) = { , g ( x ) , ..., g q ( x ) } T and h ( x ) = { , h ( x ) , ..., h q ( x ) } T are two vectors ofknown functions, and α z and α z are two vectors of unknown parameters of dimensions q and q respectively. By (3), the outcome-regression based estimator of θ is ˆ θ , OR ( ˆ m • , ˆ m • ) = ˜ E { ˆ m ( X ) ˆ m ( X ) } − ˜ E { ˆ m ( X ) ˆ m ( X ) } ˜ E { ˆ m ( X ) } − ˜ E { ˆ m ( X ) } , (11)where ˆ m • = ( ˆ m , ˆ m ) , ˆ m • = ( ˆ m , ˆ m ) , and, for z ∈ { , } , ˆ m z ( X ) = m z ( X ; ˆ α z ) is a ﬁttedtreatment regression function and ˆ m z ( X ) = m z ( X ; ˆ α z ) is a ﬁtted outcome regression function.8or low-dimensional X , ˆ α z and ˆ α z are customarily maximum quasi-likelihood estimators of α z and α z or their variants. In high-dimensional settings, ˆ α z and ˆ α z can be regularized estimators.For concreteness, let ˆ α z, RML be a Lasso penalized quasi-likelihood estimator of α z which is a mini-mizer of L RML ( α z ) = L ML ( α z ) + λ k ( α z ) q k , where ( α z ) q is α z excluding the intercept, λ ≥ is a tuning parameter, and L ML ( α z ) = ˜ E ( { Z = z } [ − Dα T z g ( X ) + Ψ D { α T z g ( X ) } ]) , (12)where Ψ D ( u ) = R u ψ D (˜ u ) d˜ u . Let ˆ α z, RML be a Lasso penalized quasi-likelihood estimator of α z which is a minimizer of L RML ( α z ) = L ML ( α z ) + λ k ( α z ) q k , where ( α z ) q is α z excludingthe intercept, λ ≥ is a tuning parameter, and L ML ( α z ) = ˜ E ( { Z = z } D [ − Y α T z h ( X ) + Ψ Y { α T z h ( X ) } ]) , (13)where Ψ Y ( u ) = R u ψ Y (˜ u ) d˜ u . The loss function (12) or (13) is the average negative log-quasi-likelihood in the case where model (9) or (10) corresponds to a generalized linear model with acanonical link (McCullagh and Nelder 1989). Remark 2 (On outcome regression) . We comment on speciﬁcation of outcome regression model(10). By the IV assumptions, E ( Y | D = 1 , Z = z, X = x ) = E ( Y | Z = z, D ( z ) = 1 , X = x ) = E ( Y (1) | U ≤ m ∗ z ( x ) , X = x ) depends on ( z, x ) only through m ∗ z ( x ) and x . On one hand,this relationship can be incorporated in model (10), by including in h ( x ) various functions of x and ˆ m z ( x ) , as well as their interactions, for example, (1 , x T , ˆ m z , x T ˆ m z ) T . On the other hand, suchspeciﬁcation of h ( x ) , depending on an estimator ˆ α z , introduces additional variation which needsto be taken account of in theoretical analysis. For simplicity, this complication is not addressed inour theoretical results later. In fact, our method is shown to yield valid inference when model (10)may be misspeciﬁed, and hence h ( x ) can be chosen independently of ˆ m z ( x ) . See Remarks 6 and7 for related discussions on choices of model-assisted inference.Consistency of the estimator ˆ θ , IPW (ˆ π ) relies on correct speciﬁcation of model (6), whereas con-sistency of ˆ θ , OR ( ˆ m • , ˆ m • ) relies on correct speciﬁcation of models (9)–(10). The weighting andregression approaches can be combined to obtain doubly robust estimators through augmentedIPW estimation (Tan 2006a), in a similar manner as in the setting of treatment unconfoundedness9Robins et al. 1994; Tan 2007). The expectations E { D (1) } and E { D (0) } in (5) can be estimatedby ˜ E { ϕ D ( O ; ˆ π, ˆ m ) } and ˜ E { ϕ D ( O ; ˆ π, ˆ m ) } respectively, where ϕ D ( O ; ˆ π, ˆ m ) = Z ˆ π ( X ) D − (cid:26) Z ˆ π ( X ) − (cid:27) ˆ m ( X ) , (14) ϕ D ( O ; ˆ π, ˆ m ) = 1 − Z − ˆ π ( X ) D − (cid:26) − Z − ˆ π ( X ) − (cid:27) ˆ m ( X ) . (15)The expectations E { D (1) Y (1) } and E { D (0) Y (1) } in (5) can be estimated by ˜ E { ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) } and ˜ E { ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) } respectively, where ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) = Z ˆ π ( X ) DY − (cid:26) Z ˆ π ( X ) − (cid:27) ˆ m ( X ) ˆ m ( X ) , (16) ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) = 1 − Z − ˆ π ( X ) DY − (cid:26) − Z − ˆ π ( X ) − (cid:27) ˆ m ( X ) ˆ m ( X ) . (17)By (5), the resulting doubly robust estimator of θ is ˆ θ (ˆ π, ˆ m • , ˆ m • ) = ˜ E { ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) − ϕ D Y ( O ; ˆ π, ˆ m , ˆ m ) } ˜ E { ϕ D ( O ; ˆ π, ˆ m ) − ϕ D ( O ; ˆ π, ˆ m ) } , (18)where ˆ m • = ( ˆ m , ˆ m ) and ˆ m • = ( ˆ m , ˆ m ) . Consistency of ˆ θ (ˆ π, ˆ m • , ˆ m • ) can be achieved ifeither model model (6) or models (9)–(10) are correctly speciﬁed.There is potentially a further advantage of doubly robust estimators in high-dimensional set-tings. In this case, the estimator ˆ θ , IPW (ˆ π ) or ˆ θ , OR ( ˆ m • , ˆ m • ) in general converges at a slowerrate than O p ( n − / ) to the true value θ under correctly speciﬁed model (6) or models (9)–(10)respectively. Denote ˆ π RML ( X ) = π ( X ; ˆ γ RML ) , ˆ m z, RML ( X ) = m z ( X ; ˆ α z, RML ) , and ˆ m z, RML ( X ) = m z ( X ; ˆ α z, RML ) , obtained from Lasso penalized likelihood estimation. By related results in Cher-nozhukov et al. (2018, Section 5.2), it can be shown that if both models (6) and (9)–(10) arecorrectly speciﬁed, then under suitable sparsity conditions, ˆ θ , RML = ˆ θ (ˆ π RML , ˆ m • , RML , ˆ m • , RML ) con-verges to θ at rate O p ( n − / ) and admits the asymptotic expansion ˆ θ , RML = ˜ E { ϕ D Y ( O ; π ∗ , m ∗ , m ∗ ) − ϕ D Y ( O ; π ∗ , m ∗ , m ∗ ) } ˜ E { ϕ D ( O ; π ∗ , m ∗ ) − ϕ D ( O ; π ∗ , m ∗ ) } + o p ( n − / ) , (19)where π ∗ ( X ) = π ( X ; γ ∗ ) , m ∗ z ( X ) = m z ( X ; α ∗ z ) , and m ∗ z ( X ) = m z ( X ; α ∗ z ) , with ( γ ∗ , α ∗ z , α ∗ z ) the true values in models (6) and (9)–(10). From this expansion, valid Wald conﬁdence intervalsbased on ˆ θ , RML can be obtained for θ . 10 Methods and theory

To focus on main ideas, we describe our new method for estimating θ . Estimation of θ and LATEis discussed later in this section. Similarly as in Section 2.2, consider logistic regression model (6),for estimating the instrument propensity score π ∗ ( x ) = P ( Z = 1 | X = x ) , and models (9)–(10)for estimating treatment and outcome regression functions m ∗ z ( x ) = P ( D = 1 | Z = z, X = x ) and m ∗ z ( x ) = E ( Y | D = 1 , Z = z, X = x ) respectively for z ∈ { , } . For technical reasons (seeSection 3.2), we require that the “regressor” vector f ( x ) in model (6) is a subvector of g ( x ) and h ( x ) in models (9)–(10) (hence p ≤ q and p ≤ q ). This condition can be satisﬁed possibly afterenlarging models (9)–(10) to accommodate f ( x ) .A class of doubly robust estimators of θ , slightly more ﬂexible than (18), is ˆ θ (ˆ π • , ˆ m • , ˆ m • ) = ˜ E { τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) } ˜ E { τ D ( O ; ˆ π • , ˆ m • ) } , where ˆ π • = (ˆ π , ˆ π ) with ˆ π and ˆ π two possibly different versions of ﬁtted values for π ∗ , ˆ m • =( ˆ m , ˆ m ) and ˆ m • = ( ˆ m , ˆ m ) with ( ˆ m z , ˆ m z ) ﬁtted values for ( m ∗ z , m ∗ z ) respectively for z = { , } , and, with ϕ Dz and ϕ DzY z deﬁned as (14)–(17), τ D ( O ; ˆ π • , ˆ m • ) = ϕ D ( O ; ˆ π , ˆ m ) − ϕ D ( O ; ˆ π , ˆ m ) ,τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) = ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) . Our point estimator of θ is ˆ θ , RCAL = ˆ θ (ˆ π • , RCAL , ˆ m • , RWL , ˆ m • , RWL ) , where, for z ∈ { , } , ˆ π z, RCAL ( X ) = π ( X ; ˆ γ z, RCAL ) , ˆ m z, RWL ( X ) = m z ( X ; ˆ α z, RWL ) , and ˆ m z, RWL ( X ) = m z ( X ; ˆ α z, RWL ) are ﬁtted values,and (ˆ γ z, RCAL , ˆ α z, RWL , ˆ α z, RWL ) are estimators of ( γ, α z , α z ) deﬁned as follows.For logistic regression model (6), the estimator ˆ γ z, RCAL is a regularized calibrated estimator of γ (Tan 2020a), deﬁned as a minimizer of the Lasso penalized objective function L z, RCAL ( γ ) = L z, CAL ( γ ) + λ k γ p k , z ∈ { , } , (20)with the calibration loss functions L , CAL ( γ ) = ˜ E { (1 − Z ) e γ T f ( X ) − Zγ T f ( X ) } , (21) L , CAL ( γ ) = ˜ E { Ze − γ T f ( X ) + (1 − Z ) γ T f ( X ) } . (22)11inimization of (20) can be implemented using R package RCAL (Tan 2020a). For treatmentregression model (9), ˆ α z, RWL is a regularized weighted likelihood estimator of α z , deﬁned as aminimizer of the Lasso penalized objective function L z, RWL ( α z ; ˆ γ z, RCAL ) = L z, WL ( α z ; ˆ γ z, RCAL ) + λ k ( α z ) q k , (23)with the weighted (quasi-)likelihood loss function L z, WL ( α z ; ˆ γ z ) = ˜ E ( { Z = z } w z ( X ; ˆ γ z )[ − Dα T z g ( X ) + Ψ D { α T z g ( X ) } ]) , (24)where the weight function is w z ( x ; γ ) = [ { − π ( X ; γ ) } /π ( X ; γ )] z − for z ∈ { , } . For outcomeregression model (10), ˆ α z, RWL is a regularized calibrated estimator of α z , deﬁned as a minimizerof the Lasso penalized objective function L z, RWL ( α z ; ˆ γ z, RCAL , ˆ α z, RWL ) = L z, WL ( α z ; ˆ γ z, RCAL , ˆ α z, RWL ) + λ k ( α z ) q k , (25)with the loss function L z, WL ( α z ; ˆ γ z , ˆ α z ) = ˜ E ( { Z = z } w z ( X ; ˆ γ z )[ − DY α T z h ( X ) + m z ( X ; ˆ α z )Ψ Y { α T z h ( X ) } ]) , (26)where the weight function w z ( x ; γ ) is the same as above. Interestingly, (26) can be equivalentlyexpressed as a weighted (quasi-)likelihood loss L z, WL ( α z ; ˆ γ z , ˆ α z ) = ˜ E (cid:16) { Z = z } w z ( X ; ˆ γ z , ˆ α z )[ − ˜ Y z α T z h ( X ) + Ψ Y { α T z h ( X ) } ] (cid:17) , (27)with the pseudo-response ˜ Y z = Y D/m z ( X ; ˆ α z ) and weight w z ( X ; ˆ γ z , ˆ α z ) = w z ( X ; ˆ γ z ) m z ( X ; ˆ α z ) . Hence existing software for Lasso penalized weighted estimation such as glmnet (Fried-man et al. 2010) and RCAL can be employed to minimize (25) as well as (23). The loss (26) or(27) differs sharply from that of the likelihood loss (13), in terms of the residuals implied.Compared with regularized likelihood estimation in Section 2.2, our method involves using adifferent set of estimators (ˆ γ z, RCAL , ˆ α z, RWL , ˆ α z, RWL ) , which are called regularized calibrated estima-tors. Similarly as in Tan (2020b), these estimators are derived to allow model-assisted, asymptoticconﬁdence intervals for θ based on ˆ θ , RCAL . See Proposition 1 for a summary and Section 3.2 forfurther discussion. We also point out several interesting properties algebraically associated with12ur estimators. First, by the Karush–Kuhn–Tucker (KKT) condition for minimizing (20), the ﬁttedinstrument propensity score ˆ π , RCAL ( X ) satisﬁes n n X i =1 Z i ˆ π , RCAL ( X i ) = 1 , (28) n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 Z i f j ( X i )ˆ π , RCAL ( X i ) − n X i =1 f j ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ, j = 1 , . . . , p. (29)where equality holds in (29) for any j such that the j th estimate (ˆ γ , RCAL ) j is nonzero. These equa-tions also hold with Z i replaced by − Z i and ˆ π , RCAL replaced by − ˆ π , RCAL . Eq. (28) showsthat the inverse probability weights, / ˆ π , RCAL ( X i ) with Z i = 1 , sum to the sample size n , whereasEq. (29) implies that the weighted average of each covariate f j ( X i ) over the instrument group { Z i = 1 } may differ from the overall average of f j ( X i ) by no more than λ . Such differences areof interest in showing how a weighted instrument group resembles the overall sample. In contrast,similar results are not available when using the regularized likelihood estimator ˆ γ RML . Moreover,Tan (2020a) studied a comparison between calibrated and maximum likelihood estimation in lo-gistic regression. Minimization of the calibration loss (21) or (22) achieves a stronger control ofrelative errors of ﬁtted propensity scores than that of the likelihood loss (8).By the KKT condition associated with the intercept in α for minimizing (23), the ﬁtted treat-ment regression function ˆ m , RWL ( X ) satisﬁes n n X i =1 Z i − ˆ π , RCAL ( X i )ˆ π , RCAL ( X i ) { D i − ˆ m , RWL ( X i ) } = 0 . (30)A similar equation holds with Z i replaced by − Z i , ˆ π , RCAL by − ˆ π , RCAL , and ˆ m , RWL by ˆ m , RWL .As a result of (30), our augmented IPW estimator for E { D (1) } , deﬁned as ˆ E RCAL { D (1) } =˜ E { ϕ D ( O ; ˆ π , RCAL , ˆ m , RWL ) } , can be simpliﬁed to ˜ E (cid:20) ˆ m , RWL ( X ) + Z ˆ π , RCAL ( X ) { D − ˆ m , RWL ( X ) } (cid:21) = ˜ E { ZD + (1 − Z ) ˆ m , RWL ( X ) } . Hence ˆ E RCAL { D (1) } always falls within the range of the binary treatment values { D i : Z i =1 , i = 1 , . . . , n } and the predicted values { ˆ m , RWL ( X i ) : Z i = 0 , i = 1 , . . . , n } , which are bydeﬁnition in the interval [0 , . This boundedness property is not satisﬁed by the usual estimator ˆ E RML { D (1) } = ˜ E { ϕ D ( O ; ˆ π RML , ˆ m , RML ) } , but is desirable for stabilizing the behavior of augmentedIPW estimators, especially when used in the denominator of (18).13y the KKT condition associated with the intercept in α for minimizing (25), the ﬁtted func-tions ˆ m , RWL ( X ) and ˆ m , RWL ( X ) jointly satisfy n n X i =1 Z i − ˆ π , RCAL ( X i )ˆ π , RCAL ( X i ) { D i Y i − ˆ m , RWL ( X i ) ˆ m , RWL ( X i ) } = 0 . (31)A similar equation holds with Z i replaced by − Z i , ˆ π , RCAL by − ˆ π , RCAL , and ( ˆ m , RWL , ˆ m , RWL ) by ( ˆ m , RWL , ˆ m , RWL ) . By (31), our augmented IPW estimator for E { D (1) Y (1) } , deﬁned as ˆ E RCAL { D (1) × Y (1) } = ˜ E { ϕ D Y ( O ; ˆ π , RCAL , ˆ m , RWL , ˆ m , RWL ) } , can be simpliﬁed to ˜ E (cid:20) ( ˆ m ˆ m ) RWL + Z ˆ π , RCAL ( X ) { DY − ( ˆ m ˆ m ) RWL } (cid:21) = ˜ E { ZDY + (1 − Z )( ˆ m ˆ m ) RWL } , where ( ˆ m ˆ m ) RWL = ˆ m , RWL ( X ) ˆ m , RWL ( X ) . As a consequence, ˆ E RCAL { D (1) Y (1) } always fallswithin the range of the observed values { D i Y i : Z i = 1 , i = 1 , . . . , n } and the predicted values { ˆ m , RWL ( X i ) ˆ m , RWL ( X i ) : Z i = 0 , i = 1 , . . . , n } .We present a high-dimensional analysis of the proposed estimator ˆ θ , RCAL in Section 3.2 providedthat instrument propensity score model (6) is correctly speciﬁed but treatment and outcome models(9)–(10) may be misspeciﬁed. Our main result shows that under suitable conditions, ˆ θ , RCAL isconsistent for θ and admits the asymptotic expansion ˆ θ , RCAL = ˜ E { τ DY ( O ; ¯ π • , ¯ m • , ¯ m • ) } ˜ E { τ D ( O ; ¯ π • , ¯ m • ) } + o p ( n − / ) , (32)where ¯ π • = (¯ π , ¯ π ) , ¯ m • = ( ¯ m , ¯ m ) , ¯ m • = ( ¯ m , ¯ m ) , and, for z ∈ { , } , ¯ π z ( X ) = π ( X ; ¯ γ z ) , ¯ m z ( X ) = m z ( X ; ¯ α z ) , and ¯ m z ( X ) = m z ( X ; ¯ α z ) , with (¯ γ z , ¯ α z , ¯ α z ) deﬁned as follows. Thetarget value ¯ γ z is deﬁned as a minimizer of the expected loss E { L z, CAL ( γ ) } = E h { Z = z } e (1 − z ) γ T f ( X ) + { Z = 1 − z } (2 z − γ T f ( X ) i , z ∈ { , } . Because model (6) is correctly speciﬁed, the target values ¯ γ and ¯ γ are identical to the true value γ ∗ , so that ¯ π ( X ) = ¯ π ( X ) = π ∗ ( X ) (Tan 2020a). With possible misspeciﬁcation of model (9),the target value ¯ α z is deﬁned as a minimizer of the expected loss E { L z, WL ( α z ; ¯ γ z ) } = E ( { Z = z } w z ( X ; ¯ γ z )[ − Dα T z g ( X ) + Ψ D { α T z g ( X ) } ]) . (33)If model (9) is correctly speciﬁed, then ¯ α z coincides with the true value α ∗ z such that ¯ m z ( X ) = m ∗ z ( X ) . Otherwise, ¯ m z ( X ) may differ from m ∗ z ( X ) . Similarly, with possible misspeciﬁcation of14odel (10), the target value ¯ α z is deﬁned as a minimizer of the expected loss E { L z, WL ( α z ; ¯ γ z , ¯ α z ) } = E ( { Z = z } w z ( X ; ¯ γ z )[ − DY α T z h ( X ) + m z ( X ; ¯ α z )Ψ Y { α T z h ( X ) } ]) . (34)If model (10) is correctly speciﬁed, then ¯ α z coincides with the true value α ∗ z such that ¯ m z ( X ) = m ∗ z ( X ) . But ¯ m z ( X ) may in general differ from m ∗ z ( X ) . Suppose that the Lasso tuning parame-ters are speciﬁed as λ = A † z { log(e p ) /n } / for ˆ γ z, RCAL in (20), λ = A † z { log(e q ) /n } / for ˆ α z, RCAL in (23), and λ = A † z { log(e q ) /n } / for ˆ α z, RCAL in (25), where ( A † z , A † z , A † z ) are sufﬁcientlylarge constants for z ∈ { , } . For a vector b = ( b , b , . . . , b k ) T , denote S b = { } ∪ { j : b j =0 , j = 1 , . . . , k } and the size of the set S b as | S b | . Proposition 1.

Suppose that Assumptions 1–3 hold as in Section 3.2, their corresponding versionswith Z replaced by − Z and (¯ γ , ¯ α , ¯ α ) replaced by ( − ¯ γ , ¯ α , ¯ α ) , and P z =0 , {| S ¯ γ z | + | S ¯ α z | + | S ¯ α z |} log { e( q ∨ q ) } = o ( n / ) , If model (6) is correctly speciﬁed, then ˆ θ , RCAL satisﬁes theasymptotic expansion (32). Furthermore, the following results hold.(i) n / (ˆ θ , RCAL − θ ) → D N(0 , V ) , where V = var { τ DY ( O ; ¯ π • , ¯ m • , ¯ m • ) − θ τ D ( O ; ¯ π • , ¯ m • ) } /E { τ D ( O ; ¯ π • , ¯ m • ) } ;(ii) a consistent estimator of V is ˆ V = ˜ E (cid:20)n τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) − ˆ θ , RCAL τ D ( O ; ˆ π • , ˆ m • ) o (cid:21) . ˜ E { τ D ( O ; ˆ π • , ˆ m • ) } , where (ˆ π • , ˆ m • , ˆ m • ) = (ˆ π • , RCAL , ˆ m • , RWL , ˆ m • , RWL ) ;(iii) an asymptotic (1 − c ) conﬁdence interval for θ is ˆ θ , RCAL ± z c/ q ˆ V /n , where z c/ is the (1 − c/ quantile of N(0 , . Finally, we describe how our method can be applied to estimate θ and LATE, denoted as θ = θ − θ . In addition to models (6) and (9)–(10), consider the following outcome regressionmodel in the untreated population for z ∈ { , } , E ( Y | Z = z, D = 0 , X = x ) = m z ( x ; α z ) = ψ Y { α T z h ( x ) } , (35)15here h ( x ) = { , h ( x ) , ..., h q ( x ) } T is a vector of known functions as in model (10) and α z isa vector of unknown parameters of dimension q . For augmented IPW estimation of E { (1 − D (0)) Y (0) } and E { (1 − D (1)) Y (1) } , deﬁne ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) = 1 − Z − ˆ π ( X ) (1 − D ) Y − (cid:26) − Z − ˆ π ( X ) − (cid:27) { − ˆ m ( X ) } ˆ m ( X ) ,ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) = Z ˆ π ( X ) (1 − D ) Y − (cid:26) Z ˆ π ( X ) − (cid:27) { − ˆ m ( X ) } ˆ m ( X ) , where ˆ m z ( X ) = m z ( X ; ˆ α z ) be a ﬁtted regression function. Then a doubly robust estimator of θ , similar to that of θ in (18) is ˆ θ (ˆ π • , ˆ m • , ˆ m • ) = ˜ E { τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) } ˜ E { τ D ( O ; ˆ π • , ˆ m • ) } , (36)where ˆ m • = ( ˆ m , ˆ m ) , τ D ( O ; ˆ π • , ˆ m • ) is as in (18), and τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) = ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) . Our point estimator of θ is ˆ θ , RCAL = ˆ θ (ˆ π • , RCAL , ˆ m • , RWL , ˆ m • , RWL ) , and that of LATE, θ = θ − θ , is ˆ θ RCAL = ˆ θ , RCAL − ˆ θ , RCAL , where ˆ π z, RCAL ( X ) and ˆ m z, RWL ( X ) remain the same as before, and ˆ m z, RWL ( X ) = m z ( X ; ˆ α z, RWL ) with ˆ α z, RWL deﬁned as follows. For z ∈ { , } , let ˆ α z, RWL be aminimizer of the Lasso penalized objective function L z, RWL ( α z ; ˆ γ z , ˆ α z ) = L z, WL ( α z ; ˆ γ z , ˆ α z ) + λ k ( α z ) r k , with the weighted (quasi-)likelihood loss L z, WL ( α z ; ˆ γ z , ˆ α z ) = ˜ E (cid:16) { Z = z } w z ( X ; ˆ γ z , ˆ α z )[ − ˜ Y z α T z h ( X ) + Ψ Y { α T z h ( X ) } ] (cid:17) , where ˜ Y z = Y (1 − D ) / { − m z ( X ; ˆ α z ) } and w z ( X ; ˆ γ z , ˆ α z ) = w z ( X ; ˆ γ z ) { − m z ( X ; ˆ α z ) } .Under similar conditions as in Proposition 1, ˆ θ , RCAL admits an asymptotic expansion in the form of(32), and Wald conﬁdence intervals for θ and LATE can be derived accordingly. In particular, anasymptotic (1 − c ) conﬁdence interval for LATE is ˆ θ RCAL ± z c/ q ˆ V /n , where ˆ V = ˜ E (cid:20)n τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) − τ DY ( O ; ˆ π • , ˆ m • , ˆ m • ) − ˆ θ RCAL τ D ( O ; ˆ π • , ˆ m • ) o (cid:21) ˜ E { τ D ( O ; ˆ π • , ˆ m • ) } , where (ˆ π • , ˆ m • , ˆ m • , ˆ m • ) = (ˆ π • , RCAL , ˆ m • , RWL , ˆ m • , RWL , ˆ m • , RWL ) .16 emark 3 (On completely randomized instruments) . Our method is directly applicable in thespecial case where the instrument is assumed to be completely randomized, independently of ob-served covariates. The instrument propensity score model (6) with an intercept is valid, because P ( Z = 1 | X ) is a constant. With ﬂexible treatment and outcome models (9)–(10), the proposed es-timator ˆ θ RCAL based on augmented IPW estimation is expected to achieve smaller variances than thesimple Wald estimator. Such efﬁciency gains are analogous to related results on regression adjust-ment in completely randomized experiments (with full compliance) in low- and high-dimensionalsettings (e.g., Davidian et al. 2005; Bloniarz et al. 2016; Wager et al. 2016).

We develop theoretical analysis which leads to Proposition 1 for the proposed estimator ˆ θ , RCAL in high-dimensional settings, provided model (6) is correctly speciﬁed but models (9)–(10) maybe misspeciﬁed. Similar analysis can be obtained for ˆ θ , RCAL and ˆ θ RCAL . Before formal results arepresented, we discuss heuristically how the asymptotic expansion (32) can be achieved, due touse of the regularized calibrated estimators (ˆ γ z, RCAL , ˆ α z, RWL , ˆ α z, RWL ) for z ∈ { , } . For notationalbrevity, these estimators are denoted as (ˆ γ z , ˆ α z , ˆ α z ) unless otherwise noted.There are two main steps in our analysis. First, the estimators (ˆ γ z , ˆ α z , ˆ α z ) can be shown toconverge in probability to (¯ γ z , ¯ α z , ¯ α z ) with the L -norm error bounds: k ˆ γ z − ¯ γ z k = O p (1) | S ¯ γ z |{ log(e p ) /n } / , (37) k ˆ α z − ¯ α z k = O p (1) {| S ¯ γ z | + | S ¯ α z |}{ log(e q ) /n } / , (38) k ˆ α z − ¯ α z k = O p (1) {| S ¯ γ z | + | S ¯ α z | + | S ¯ α z |}{ log(e( q ∨ q )) /n } / , (39)where (¯ γ z , ¯ α z , ¯ α z ) are the target values deﬁned as minimizers of the corresponding expected lossfunctions in Section 3.1. For simplicity, we do not discuss prediction L -norm error bounds for (ˆ γ z , ˆ α z , ˆ α z ) , which are also involved in our rigorous analysis later. While these results are builton existing high-dimensional, sparse analysis of Lasso penalized M-estimators (Buhlmann & vande Geer 2011; Huang & Zhang 2012; Tan 2020a), additional arguments are needed to carefullyhandle the dependency of ˆ α z on ˆ γ z and that of ˆ α z on (ˆ γ z , ˆ α z ) , hence the presence of | S ¯ γ z | in thebound for ˆ α z and | S ¯ γ z | and | S ¯ α z | in that for ˆ α z .17econd, for z ∈ { , } , the augmented IPW estimators of E { D ( z ) } and E { D ( z ) Y (1) } in-volved in ˆ θ , RCAL can be shown to admit the following asymptotic expansions, ˜ E { ϕ Dz ( O ; ˆ π z , ˆ m z ) } = ˜ E { ϕ Dz ( O ; ¯ π z , ¯ m z ) } + o p ( n − / ) , (40) ˜ E { ϕ DzY z ( O ; ˆ π z , ˆ m z , ˆ m z ) } = ˜ E { ϕ DzY z ( O ; ¯ π z , ¯ m z , ¯ m z ) } + o p ( n − / ) , , (41)where (¯ π z , ¯ m z , ¯ m z ) are deﬁned as in Section 3.1. From (40)–(41), the expansion (32) for ˆ θ , RCAL then follows by the delta method. To show (40) for z = 1 , consider a Taylor expansion ˜ E { ϕ D ( O ; ˆ π , ˆ m ) } = ˜ E { ϕ D ( O ; ¯ π , ¯ m ) } + (ˆ γ − ¯ γ ) T δ ¯ γ + ( ˆ α − ¯ α ) T δ ¯ α + o p ( n − / ) , (42)where the remainder is taken to be o p ( n − / ) under suitable conditions, and δ ¯ γ = ∂∂γ ˜ E { ϕ D ( O ; π, m ) } (cid:12)(cid:12)(cid:12) ( γ,α )=(¯ γ , ¯ α ) = − ˜ E (cid:26) Z − ¯ π ¯ π ( D − ¯ m ) f (cid:27) ,δ ¯ α = ∂∂α ˜ E { ϕ D ( O ; π, m ) } (cid:12)(cid:12)(cid:12) ( γ,α )=(¯ γ , ¯ α ) = ˜ E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ′ g (cid:27) . Here ¯ m ′ ( x ) = ψ ′ D { ¯ α T g ( x ) } and ψ ′ D denotes the derivative of ψ D . A crucial point is that theexpectations of the gradients δ ¯ γ and δ ¯ α reduce to 0, − E (cid:26) Z − ¯ π ¯ π ( D − ¯ m ) f (cid:27) = 0 , (43) E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ′ g (cid:27) = 0 , (44)provided that model (6) is correctly speciﬁed but model (9) may be misspeciﬁed. In fact, undercorrectly speciﬁed model (6), ¯ π coincides with π ∗ (Tan 2020a) and hence condition (44) holds.Moreover, condition (43) follows from the gradient equation for ¯ α as a minimizer of the expectedloss (33) for z = 1 , because f is a subvector of g and the gradient of (33) at ¯ α matches theexpectation of δ ¯ α in (43) with f replaced by g : ∂∂α E ( Zw ( X ; ¯ γ )[ − Dα T g ( X ) + Ψ D { α T g ( X ) } ]) (cid:12)(cid:12)(cid:12) α =¯ α = − E (cid:20) Z − ¯ π ¯ π { D − ψ D ( ¯ α T g ) } g (cid:21) . From conditions (43)–(44), the sum of the two terms (ˆ γ − ¯ γ ) T δ ¯ γ and ( ˆ α − ¯ α ) T δ ¯ α can be shownto be of order {| S ¯ γ | + | S ¯ α |}{ log(e q ) /n } , which becomes o p ( n − / ) and hence (42) leads to (40)for z = 1 provided that {| S ¯ γ | + | S ¯ α |} log(e q ) = o ( n / ) .18imilarly, to show (41) for z = 1 , consider a Taylor expansion ˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } = ˜ E { ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } + (ˆ γ − ¯ γ ) T ∆ ¯ γ + ( ˆ α − ¯ α ) T ∆ ¯ α + ( ˆ α − ¯ α ) T ∆ ¯ α + o p ( n − / ) , (45)where the remainder is taken to be o p ( n − / ) under suitable conditions, and ∆ ¯ γ = − ˜ E (cid:26) Z − ¯ π ¯ π ( DY − ¯ m ¯ m ) f (cid:27) , ∆ ¯ α = ˜ E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ′ ¯ m g (cid:27) , ∆ ¯ α = ˜ E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ¯ m ′ h (cid:27) . Here ¯ m ′ ( x ) = ψ ′ Y { ¯ α T h ( x ) } and ψ ′ Y denotes the derivative of ψ Y . Under correctly speciﬁedmodel (6), ¯ π coincides with π ∗ (Tan 2020a) and hence the expectations of the gradients ∆ ¯ γ and ∆ ¯ α reduce to 0, E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ′ ¯ m g (cid:27) = 0 , E (cid:26)(cid:18) − Z ¯ π (cid:19) ¯ m ¯ m ′ h (cid:27) = 0 . (46)Moreover, whether model (10) is correctly speciﬁed or not, the expectation of the gradient ∆ ¯ α reduces to 0, E (cid:26) Z − ¯ π ¯ π ( DY − ¯ m ¯ m ) f (cid:27) = 0 , (47)because f is a subvector of h , and ¯ α is deﬁned as a minimizer of the expected loss (34) for z = 1 such that the following gradient is 0 at ¯ α : ∂∂α E ( Zw ( X ; ¯ γ )[ − DY α T h ( X ) + m ( X ; ¯ α )Ψ Y { α T h ( X ) } ]) (cid:12)(cid:12)(cid:12) α =¯ α = − E (cid:20) Z − ¯ π ¯ π { DY − ¯ m ψ Y ( ¯ α T h ) } (cid:21) . From these mean-zero conditions on the gradients, the sum of three terms (ˆ γ − ¯ γ ) T ∆ ¯ γ . ( ˆ α − ¯ α ) T ∆ ¯ α , and ( ˆ α − ¯ α ) T ∆ ¯ α can be shown to be of order {| S ¯ γ | + | S ¯ α | + | S ¯ α |}{ log(e( q ∨ q )) /n } , which becomes o p ( n − / ) and hence (45) leads to (41) for z = 1 provided that {| S ¯ γ | + | S ¯ α | + | S ¯ α |} log { e( q ∨ q ) } = o ( n / ) , where q ∨ q = max( q , q ) . Remark 4 (On likelihood vs calibrated estimation) . We compare calibrated estimation with usuallikelihood-based estimation. From the preceding discussion, the mean-zero conditions (43)–(44)19nd (46)–(47) are crucial for the desired expansions (40)–(41) to hold. For example, if the expecta-tions of δ ¯ γ and δ ¯ α were nonzero, then the two terms (ˆ γ − ¯ γ ) T δ ¯ γ and ( ˆ α − ¯ α ) T δ ¯ α in (42) wouldbe of order { log( p ) /n } / and { log( q ) /n } / in high-dimensional settings, as seen from (37)–(38).These mean-zero conditions can be satisﬁed in different manners. If models (6) and (9)–(10) arecorrectly speciﬁed, then (43)–(44) and (46)–(47) directly hold, with the target values (¯ γ , ¯ α , ¯ α ) identical to the true values ( γ ∗ , α ∗ , α ∗ ) . This reasoning is applicable with (ˆ γ z , ˆ α z , ˆ α z ) replacedby the regularized likelihood estimators (ˆ γ RML , ˆ α z, RML , ˆ α z, RML ) , and would lead to asymptotic ex-pansion (19) for ˆ θ , RML , as studied in Chernozhukov et al. (2018). In contrast, for our method, whileconditions (44) and (47) are satisﬁed by relying on model (6) being correctly speciﬁed, conditions(43) and (46) are achieved with possible misspeciﬁcation of models (9)–(10), by carefully choosing(“calibrating”) the loss functions for the estimators ˆ α z, RWL and ˆ α z, RWL . Effectively, the loss function(24) and (26) for ˆ α z, RWL and ˆ α z, RWL are derived by integrating with respect to α z and α z respectivelythe gradients of ˜ E { ϕ Dz ( O ; π ( · ; γ ) , m z ( · ; α z )) } and ˜ E { ϕ DzY z ( O ; π ( · ; γ ) , m z ( · ; α z ) , m z ( · ; α z )) } in γ , in the case of f = g = h . See Tan (2020b, Section 3.2) for a related discussion. Remark 5 (On low- vs high-dimensional estimation) . Our preceding discussion is mainly con-cerned with high-dimensional settings where the numbers of regressors p , q , and q are close toor larger than the sample size n . The asymptotic expansions (40)–(41) from calibrated estima-tion are desirable in facilitating construction of conﬁdence intervals, because the ﬁrst-order termssuch as (ˆ γ − ¯ γ ) T δ ¯ γ and ( ˆ α − ¯ α ) T δ ¯ α are made to be negligible up to order n − / . Otherwise,these ﬁrst-order terms would be at least of order { log( p ) /n } / and difﬁcult to quantify. For com-pleteness, it should also be noted that for previous methods studied in low-dimensional settings(Tan 2006a, Ogburn et al. 2015), valid conﬁdence intervals can be obtained from the more generalasymptotic expansions (42) and (45) along with usual inﬂuence functions for likelihood-based orsimilar estimators (ˆ γ , ˆ α , ˆ α ), where the ﬁrst-order terms are of order n − / . Remark 6 (On choices of model-assisted inference) . Our method allows model-assisted infer-ence, relying on correct speciﬁcation of instrument propensity score model (6) but not treatmentand outcome regression models (9)–(10). Similar ideas can be employed to develop model-assistedinference relying on correct speciﬁcation of models (9)–(10) but not model (6), or doubly robustinference relying on correct speciﬁcation of either model (6) or models (9)–(10). In each case, ad-20itional modiﬁcations would be needed with increasing analytical and computational complexity.For example, model (6) needs to be properly expanded such that regularized estimation of γ couldbe designed to satisfy all three mean-zero conditions in (43) and (46). In contrast, by handlingmodel-assisted inference based on model (6), our method is developed in a practically convenientmanner, involving sequential estimation in the three models (6), (9), and (10). Remark 7 (On asymmetry between propensity score and outcome regression) . Our choice of in-ference based on instrument propensity score model (6) is also related to a fundamental asymmetrybetween propensity score and outcome regression approaches (Tan 2007). As reﬂected by the es-timator (11), ﬁtted treatment and outcome regression functions ˆ m z ( X ) and ˆ m z ( X ) are desired tobe valid for all observed covariates X , but such functions for those values of X with P ( Z = z | X ) close to 0 are effectively determined by extrapolation according to models (9)–(10), which can bebuilt and checked from only the truncated data { ( D i , Y i , X i ) : Z i = z, i = 1 , . . . , n } . In contrast,model (6) can be readily learned from { ( Z i , X i ) : i = 1 , . . . , n } without data truncation. Remark 8 (On calibrated estimation for propensity scores) . By a careful examination of the outlineabove, our theoretical analysis also remains valid with ˆ γ z replaced by the regularized likelihoodestimator ˆ γ RML . In particular, the mean-zero conditions (44) and (47) would still be satisﬁed undercorrectly speciﬁed model (6). Nevertheless, we prefer the regularized likelihood estimator ˆ γ z, RCAL for two additional reasons. One is the informative form of the KKT condition (29). The other isan advantage of calibrated estimation in controlling relative errors of propensity scores for inverseprobability weighting, regardless of outcome regression, as studied in Tan (2020a).In the remainder of this section, we present formal results underlying Proposition 1. While con-vergence of the regularized calibrated estimators (ˆ γ z , ˆ α z ) and the AIPW estimator for E { D ( z ) } can be obtained directly from Tan (2020b), our analysis needs to carefully tackle convergence of ˆ α z and the AIPW estimator for E { D ( z ) Y (1) } . The situation is more complicated than in Tan(2020b), as well as the earlier literature on Lasso penalized M -estimation (Buhlmann & van deGeer 2011; Huang & Zhang 2012), mainly because the loss function for ˆ α z , i.e., L z, WL ( α z ; ˆ γ z , ˆ α z ) from (26), involves not only data-dependent weight w z ( X ; ˆ γ z ) but also data-dependent mean func-tion ˆ m z ( X ; ˆ α z ) . We extend a technical strategy in Tan (2020b) to control such dependency andestablish the desired convergence of ˆ α z under similar conditions as required in unweighted Lasso21enalized M -estimation in high-dimensional settings. The error bounds obtained, however, dependon the sparsity sizes of the target values ¯ γ z and ¯ α z , in addition to that of ¯ α z .First, we summarize the results which can be deduced directly from Tan (2020b) about (ˆ γ , ˆ α ) and the AIPW estimator ˜ E { ϕ D ( O ; ˆ π , ˆ m ) } for E { D (1) } . Suppose that the Lasso tuning pa-rameters are speciﬁed as λ = A λ for ˆ γ in (20) and λ = A λ for ˆ α in (23), where λ = { log((1 + p ) /ǫ ) /n } / and λ = { log((1 + q ) /ǫ ) /n } / ( ≥ λ ) . For a matrix Σ with row indices { , , . . . , k } , a compatibility condition (Buhlmann & van de Geer 2011) is said to hold with asubset S ∈ { , , . . . , k } and constants ν > and µ > if ν ( P j ∈ S | b j | ) ≤ | S | ( b T Σ b ) for anyvector b = ( b , b , . . . , b k ) ∈ R k satisfying P j S | b j | ≤ µ P j ∈ S | b j | . Assumption 1.

Suppose that the following conditions are satisﬁed.(i) max j =0 , ,...,p | f j ( X ) | ≤ C f almost surely for a constant C f ≥ .(ii) ¯ γ T f ( X ) is bounded below by a constant C f almost surely.(iii) A compatibility condition holds for Σ f with the subset S ¯ γ = { } ∪ { j : (¯ γ ) j = 0 , j =1 , . . . , p } and some constants ν > and µ > , where Σ f = E { Zw ( X ; ¯ γ ) f ( X ) f T ( X ) } .(iv) | S ¯ γ | λ is sufﬁciently small. Assumption 2.

Suppose that the following conditions are satisﬁed.(i) max j =0 , ,...,p | g j ( X ) | ≤ C g almost surely for a constant C g ≥ .(ii) ¯ α T g ( X ) is bounded in absolute values by C g > almost surely.(iii) ψ ′ D ( u ) ≤ ψ ′ D (˜ u )e C g | u − ˜ u | for any ( u, ˜ u ) , where C g > is a constant.(iv) A compatibility condition holds for Σ g with the subset S ¯ α = { } ∪ { j : ( ¯ α ) j = 0 , j =1 , . . . , p } , and some constants ν > and µ > , where Σ g = E { Zw ( X ; ¯ γ ) g ( X ) g T ( X ) } .(v) | S ¯ γ | λ + | S ¯ α | λ is sufﬁciently small. Theorem 1 (Tan 2020b) . Suppose that Assumptions 1–2 hold and λ ≤ . For sufﬁciently largeconstants A and A , we have probability − c ǫ , k ˆ γ − ¯ γ k ≤ M | S ¯ γ | λ , (ˆ γ − ¯ γ ) T ˜Σ f (ˆ γ − ¯ γ ) ≤ M | S ¯ γ | λ , (48) k ˆ α − ¯ α k ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) , ( ˆ α − ¯ α ) T ˜Σ g ( ˆ α − ¯ α ) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) , (49)22 here c , M , and M are positive constants and ˜Σ f and ˜Σ g are sample versions of Σ f and Σ g . i.e., ˜Σ f = ˜ E { Zw ( X ; ¯ γ ) f ( X ) f T ( X ) } and ˜Σ g = ˜ E { Zw ( X ; ¯ γ ) g ( X ) g T ( X ) } . Moreover, if model (6)is correctly speciﬁed, then we also have with probability − c ǫ , (cid:12)(cid:12)(cid:12) ˜ E { ϕ D ( O ; ˆ π , ˆ m ) } − ˜ E { ϕ D ( O ; ¯ π , ¯ m ) } (cid:12)(cid:12)(cid:12) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) λ , (50) (cid:12)(cid:12)(cid:12) ˜ E (cid:2) { ϕ D ( O ; ˆ π , ˆ m ) − ϕ D ( O ; ¯ π , ¯ m ) } (cid:3)(cid:12)(cid:12)(cid:12) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) , (51) where M and M are positive constants. Inequalities (48)–(49) lead directly to the desired convergence (37)–(38) for (ˆ γ , ˆ α ) . More-over, inequality (50) yields the asymptotic expansion (40) for the AIPW estimator ˆ E { D (1) } =˜ E { ϕ D ( O ; ˆ π , ˆ m ) } provided {| S ¯ γ | + | S ¯ α |} log(e q ) = o ( n / ) . Inequality (51) can be usedto show that the sample variance ˜ E ([ ϕ D ( O ; ˆ π , ˆ m ) − ˆ E { D (1) } ] ) is a consistent estimator for var { ϕ D ( O ; ¯ π , ¯ m ) } , provided {| S ¯ γ | + | S ¯ α |} log / (e q ) = o ( n / ) , which is satisﬁed under {| S ¯ γ | + | S ¯ α |} log(e q ) = o ( n / ) . While consistent variance estimation is sufﬁcient for justifyingWald conﬁdence intervals for E { D (1) } by the Slutsky theorem, the convergence rate in (51) canbe improved under additional conditions. See Tan (2020b, Theorem 4).Next, we discuss theoretical analysis of ˆ α , with the Lasso tuning parameter λ = A λ in (25),where λ = { log((1 + q ) /ǫ ) /n } / ( ≥ λ ) . As the loss L , WL ( α ; ˆ γ z , ˆ α z ) is convex in α , thecorresponding Bregman divergence is deﬁned as D z, WL ( α ′ z , α z ; ˆ γ z , ˆ α z )= L z, WL ( α ′ z ; ˆ γ z , ˆ α z ) − L z, WL ( α z ; ˆ γ z , ˆ α z ) − ( α ′ z − α z ) T ∂L z, WL ∂α z ( α z ; ˆ γ z , ˆ α z ) . The symmetrized Bregman divergence is easily shown to be D † z, WL ( α ′ z , α z ; ˆ γ z , ˆ α z ) = D z, WL ( α ′ z , α z ; ˆ γ z , ˆ α z ) + D z, WL ( α z , α ′ z ; ˆ γ z , ˆ α z )= ( α ′ z − α z ) T ˜ E [ { Z = z } w z ( X ; ˆ γ z ) m z ( X ; ˆ α z ) { ψ Y ( α ′ T z h ) − ψ Y ( α T z h ) } ] . (52)After statement of the assumptions required, Theorem 2 establishes the convergence of ˆ γ to thetarget value ¯ γ in the both L norm k ˆ α − ¯ α k and the symmetrized Bregman divergence. Assumption 3.

Suppose that the following conditions are satisﬁed.(i) max j =0 , ,...,p | h j ( X ) | ≤ C h almost surely for a constant C h ≥ . ii) DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) is uniformly sub-gaussian given X with parameters ( σ , σ ) .(iii) ¯ α T h ( X ) is bounded in absolute values by C h > almost surely.(iv) ψ ′ Y ( u ) ≤ ψ ′ Y (˜ u )e C h | u − ˜ u | for any ( u, ˜ u ) , where C h > is a constant.(v) A compatibility condition holds for Σ h with the subset S ¯ α = { } ∪ { j : ( ¯ α ) j = 0 , j =1 , . . . , p } , and some constants ν > and µ > , where Σ h = E { Zw ( X ; ¯ γ ) h ( X ) h T ( X ) } .(vi) | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ is sufﬁciently small: | S ¯ γ | λ ≤ ̺ , | S ¯ γ | λ + | S ¯ α | λ ≤ ̺ , and | S ¯ α | λ ≤ ̺ , such that ̺ = ν − B h (1 + µ ) ̺ < , ̺ = M ̺ C g C g e C g C g M ̺ < ̺ = C h C h ( A − B ) − µ ν − C − g C − h ̺ < , and ̺ = C h C h ( A − B ) − µ − C − g C − h × ( M ̺ + M ̺ ) < , where µ = 1 − A / { ( µ + 1)( A − B ) } ∈ (0 , , µ =( µ + 1)( A − B ) , ν = ν (1 − ̺ ) / , ( B , B h ) are from Lemmas 1–2, ( M , M ) arefrom Lemma 9, and ( C g , C h ) are from Lemma 11 in the Supplement. Theorem 2.

In the setting of Theorem 1, suppose that Assumption 3 also holds. Then for λ = A λ and A > B ( µ + 1) / ( µ − , we have with probability − ( c + 8) ǫ , D † , WL ( ˆ α , ¯ α ; ¯ γ , ¯ α ) + ( A − B ) λ k ˆ α − ¯ α k ≤ C − g C − h (cid:8) µ − ( M + M ) | S ¯ γ | λ + 2 µ − M | S ¯ α | λ + µ ν − | S ¯ α | λ (cid:9) , (53) where ( µ , µ , ν , B , M , M , C g , C h ) are deﬁned in Assumption 3. From the proof, an upper bound can also be obtained with probability − (8 + c ) ǫ on theweighted prediction L norm (in the scale of linear predictors), ˜ E { Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) } = ( ˆ α − ¯ α ) T ˜Σ h ( ˆ α − ¯ α ) ≤ C − g C − h (1 − ̺ ∨ ̺ ) − D † , WL ( ˆ α , ¯ α ; ¯ γ , ¯ α ) , (54)where ˜Σ h is the sample version of Σ h , i.e., ˜Σ h = ˜ E { Zw ( X ; ¯ γ ) h ( X ) h T ( X ) } . For notationalsimplicity of subsequent discussion, let M be a constant such that the right-hand side of (53) isupper bounded by min { C g C h (1 − ̺ ∨ ̺ ) , ( A − B ) − } × M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) .Then we have with probability − (8 + c ) ǫ , k ˆ α − ¯ α k ≤ M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) , ( ˆ α − ¯ α ) T ˜Σ h ( ˆ α − ¯ α ) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) , (ˆ γ , ˆ α , ˆ α ) , Theorem 3provides an error bound for the AIPW estimator for E { D (1) Y (1) } . Theorem 3.

In the setting of Theorem 2, if model (6) is correctly speciﬁed, then we have withprobability − (14 + c ) ǫ , (cid:12)(cid:12)(cid:12) ˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } − ˜ E { ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } (cid:12)(cid:12)(cid:12) ≤ M | S ¯ γ | λ ( λ ∨ λ ) + M | S ¯ α | λ ( λ ∨ λ ) + M | S ¯ α | λ , (55) where ( M , M , M ) are positive constants from Lemma 14. In addition, we have with proba-bility − (8 + c ) ǫ , (cid:12)(cid:12)(cid:12) ˜ E (cid:2) { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } (cid:3)(cid:12)(cid:12)(cid:12) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) + M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) , (56) where ( M , M ) are positive constants from Lemma 15. Inequality (55) yields the asymptotic expansion (41) for the AIPW estimator ˆ E { D (1) Y (1) } =˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } provided {| S ¯ γ | + | S ¯ α | + | S ¯ α |} log { e( q ∨ q ) } = o ( n / ) . Inequality(56) can be used to show that the sample variance ˜ E ([ ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ˆ E { D (1) Y (1) } ] ) isa consistent estimator for var { ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } , provided {| S ¯ γ | + | S ¯ α | + | S ¯ α |} log / { e( q ∨ q ) } = o ( n / ) , which is satisﬁed under {| S ¯ γ | + | S ¯ α | + | S ¯ α |} log { e( q ∨ q ) } = o ( n / ) .Our theoretical analysis above deals with convergence of (ˆ γ , ˆ α , ˆ α ) and the asymptotic ex-pansions of the AIPW estimators for E { D (1) } and E { D (1) Y (1) } , i.e., (40) and (41) with z = 1 .Similar results can be obtained as Theorems 1–3 with Z replaced by − Z and (ˆ γ , ˆ α , ˆ α ) re-placed by ( − ˆ γ , ˆ α , ˆ α ) , provided that Assumptions 1–3 are modiﬁed accordingly. Combiningboth results for z = 1 and z = 0 leads to Proposition 1 by standard arguments. We present simulation studies to compare pointwise properties of ˆ θ , RML based on regularized like-lihood estimation without or with post-Lasso reﬁtting and ˆ θ , RCAL based on regularized calibratedestimation and coverage properties of the associated conﬁdence intervals. In addition, motivatedby Remark 3, we compare these methods with the IPW (i.e. Wald) estimator ˆ θ , IPW (ˆ π ) , with ˆ π = ˜ E ( Z ) , in the setting where the instrument is assumed to be completely randomized.25 .1 Implementation details Both the regularized likelihood and calibrated methods are implemented using the R package

RCAL (Tan 2020a). The penalized versions of loss functions (8), (12) and (13) for computing ˆ γ RML , ˆ α z, RML and ˆ α z, RML , or penalized loss functions (20), (23) and (25) for computing ˆ γ z, RCAL , ˆ α z, RWL and ˆ α z, RWL ,are minimized for ﬁxed tuning parameters λ using algorithms similar to those in Friedman etal. (2010), but with the coordinate descent method replaced by an active set method as in Osborneet al. (2000) for solving each Lasso penalized least squares problem. All variables in f ( X ) , g ( X ) and h ( X ) are standardized to have sample means 0 and variances 1.We determine the value of the Lasso tuning parameter λ using 5-fold cross validation basedon the corresponding loss function. Let ( I k ) k =1 be a -fold random partition of the observationindices { , , ..., n } . For a loss function L ( γ ) , either the average negative log-likelihood L ML ( γ ) in(8), or the calibration loss L z, CAL ( γ ) in (21)–(22) for z = 0 , , denote by L ( γ ; I ) the loss functionobtained when the sample average ˜ E () is computed over only the subsample indexed by I . The5-fold cross-validation criterion is deﬁned as CV ( λ ) = (1 / P k =1 L (ˆ γ ( k ) λ ; I k ) , where ˆ γ ( k ) λ isa minimizer of the penalized loss L ( γ ; I ck ) + λ k γ p k over the subsample of size n/ indexedby I ck , the complement of I k . Then λ is selected by minimizing CV ( λ ) over the discrete set { λ ∗ / j : j = 0 , , ..., } , where for ˆ π = ˜ E ( Z ) , the value λ ∗ is computed as either λ ∗ =max j =1 ,...,p | ˜ E { ( T − ˆ π ) f j ( X ) }| when the likelihood loss (8) is used, or λ ∗ = max j =1 ,...,p | ˜ E { [(1 − Z ) / (1 − ˆ π ) − f j ( X ) }| or max j =1 ,...,p | ˜ E { ( Z/ ˆ π − f j ( X ) }| when calibration loss (21) or (22)is used respectively. It can be shown that in each case, the penalized loss L ( γ ) + λ k γ p k over theoriginal sample of size n has a minimum at γ p = 0 for all λ ≥ λ ∗ .The computation of ( ˆ α z, RML , ˆ α z, RML ) or ( ˆ α z, RWL , ˆ α z, RWL ) proceeds similarly as above. In thelatter case, cross-validation based on L z, WL ( α z ; ˆ γ z ) is performed with ˆ γ z held at the ﬁxed value ˆ γ z, RCAL obtained in the prior step, and cross-validation based on L z, WL ( α z ; ˆ γ z , ˆ α z ) is performedwith (ˆ γ z , ˆ α z ) held at the ﬁxed values (ˆ γ z, RCAL , ˆ α z, RWL ) in the prior steps. Let X = ( X , ..., X p ) be independent variables where each X j is N (0 , truncated to the inter-val ( − . , . , and then standardized to have mean 0 and variance 1. Consider the transformed26ariables W = exp(0 . X ) , W = 10 + { X ) } − X , W = (0 . X X + 0 . and W = ( X + X + 20) . Let X † = ( X † , ..., X † p ) , where X † j = { W j − E ( W j ) } / p Var ( W j ) for j = 1 , ..., , and X † j = X j for ≤ j ≤ p . This setup follows that in the preprint of Tan (2020b)and ensures strict one-to-one mapping between X and X † . Figure S1 in the Supplement showsthe scatter plots from a simulated data sample of the variables ( X † , X † , X † , X † ) , which are cor-related with each other as would be found in real data. Consider the following data-generatingconﬁgurations:(C1) Generate Z given X from a Bernoulli distribution with P ( Z = 1 | X ) = { − X † +0 . X † − . X † − . X † ) } − . Then, independently, generate U from a standard Logisticdistribution, D = (1 − . Z +0 . X † + X † +0 . X † − . X † ≥ U ) and Y (1) from a Normaldistribution with variance 1 and mean E { Y (1) | Z, X, U } = 0 . X † + X † + X † + X † + 2 U .(C2) Generate ( Z, U ) as in (C1), but generate D = (1 − . Z + 0 . X + X + 0 . X − . X ≥ U ) and Y (1) from a Normal distribution with variance 1 and mean E { Y (1) | Z, X, U } =0 . X + X + X + X + 2 U .(C3) Generate Z given X from a Bernoulli distribution with P ( Z = 1 | X ) = { − X +0 . X − . X − . X ) } − , then generate ( U, D, Y (1)) as in (C1).Set Y = Y (1) if D = 1 . The observed data consist of independent and identically distributedcopies { ( Y i D i , D i , Z i , X i ) : i = 1 , ..., n } . Consider the following model speciﬁcations:(M1) Logistic instrument propensity score model (6), logistic D -outcome model (9) and linear Y -outcome model (10) with f j ( X ) = g j ( X ) = h j ( X ) = X † j for j = 1 , ..., p .(M2) Logistic instrument propensity score model (6) and logistic D -outcome model (9) with f j ( X ) = g j ( X ) = X † j for j = 1 , ..., p , and linear Y -outcome model (10) with h j ( X ) = X † j , j = 1 , ..., p , h p +1 ( X ) = ˆ m D ( X ) , h p +1+ k ( X ) = { ˆ m D ( X ) − ξ k } + , h p +4 ( X ) = ˆ m D ( X ) ,and h p +4+ k ( X ) = { ˆ m D ( X ) − ξ k } + , ≤ k ≤ , where ξ zk is the k th quartile of the ﬁttedvalues ˆ m Dz ( X ) and c + = max(0 , c ) .For (M2), linear spline bases ( ˆ m D , ˆ m D ) are included as additional functions in h ( X ) . As dis-cussed in Remark 2, the dependence of m ∗ Ydz ( X ) on m ∗ Dz ( X ) is in general unknown. A simplestrategy is then to incorporate splines bases in ( ˆ m D , ˆ m D ) to enlarge Y -outcome model.27able 1: Summary of results for estimation of θ with a conditionally randomized instrument. (C1) cor IPS, more cor OR (C2) cor IPS, less cor OR (C3) mis IPS, more cor ORRCAL RML RML2 RCAL RML RML2 RCAL RML RML2(M1) n = 800 , p = 400 Bias − . − . − . − . − . − . . − . . √ Var .433 .435 .896 .518 .521 1.231 .429 .434 .804 √ EVar .418 .400 2.533 .510 .486 11.410 .418 .422 .957Cov90 .854 .811 .868 .889 .830 .848 .886 .876 .885Cov95 .908 .889 .930 .935 .897 .898 .932 .933 .939(M2) n = 800 , p = 400 Bias − . − . − . − . − . − . . − . . √ Var .432 .438 .752 .522 .521 1.301 .427 .433 .784 √ EVar .415 .401 1.290 .509 .486 11.804 .415 .422 .790Cov90 .848 .803 .872 .884 .832 .851 .884 .877 .883Cov95 .909 .882 .928 .933 .895 .894 .937 .929 .941(M1) n = 800 , p = 1000 Bias − . − . − . − . − . − . . − . . √ Var .428 .424 .632 .518 .521 .749 .438 .451 .961 √ EVar .411 .393 .600 .493 .477 .742 .411 .413 .816Cov90 .837 .808 .832 .879 .831 .833 .882 .864 .857Cov95 .900 .869 .901 .933 .888 .899 .945 .924 .920(M2) n = 800 , p = 1000 Bias − . − . − . − . − . − . . − . . √ Var .430 .424 .632 .515 .520 .759 .441 .450 .714 √ EVar .407 .393 .605 .491 .476 .744 .407 .412 .677Cov90 .820 .803 .831 .874 .824 .828 .875 .862 .858Cov95 .880 .868 .900 .929 .888 .894 .939 .928 .917

Note: RCAL denotes ˆ θ , RCAL , RML denotes ˆ θ , RML and RML2 denotes the variant where the nuisance pa-rameters are estimated by reﬁtting models with only the variables selected from the corresponding Lassoestimation. Bias and √ Var are the Monte Carlo bias and standard deviation of the points estimates, √ EVaris the square root of the mean of the variance estimates, and Cov90 or Cov95 is the coverage proportionof the 90% or 95% conﬁdence intervals, based on 1000 repeated simulations. The true values of θ under(C1)–(C3) are calculated using Monte Carlo integration with 100 repeated samples each of size . D -model is correct in conﬁgurations (C1) and(C3), but misspeciﬁed in (C2), while the Y -model in either (M1) or (M2) is misspeciﬁed in allconﬁgurations (C1)–(C3), but it can be regarded as being “closer” to the truth in (C1) and (C3)than in (C2) due to using X † instead of X as regressors. Therefore the models in both (M1) and(M2) can be classiﬁed as follows in conﬁgurations (C1)–(C3):(C1) IPS model correctly speciﬁed, OR models “more correctly” speciﬁed;(C2) IPS model correctly speciﬁed, OR models “less correctly” speciﬁed;(C3) IPS model misspeciﬁed, OR models “more correctly” speciﬁed.Similarly as in Kang & Schafer (2007) for p = 4 , the OR D - and Y -models in case (C2) andIPS model in (C3) appear adequate by standard diagnosis techniques. See Figures S2–S4 in theSupplement for scatterplots of Y against X † j within { D = 1 } , boxplots of X † j within { D = 0 } and { D = 1 } as well as boxplots of X † j within { Z = 0 } and { Z = 1 } for j = 1 , ..., .For n = 800 and p = 400 or , Table 1 summarizes the results based on 1000 repeatedsimulations. The methods RCAL and RML perform similarly to each other in terms of absolutebias, variance and coverage in (C1) and (C3), but RCAL yields noticeably smaller absolute biasesand better coverage than RML and RML2 in (C2). The post-Lasso reﬁtting method RML2 appearsto achieve coverages closer to the nominal probabilities in (C1), but yield substantially highervariances in all three cases (C1)–(C3). These properties can also be seen from the QQ-plots of theestimates and t-statistics in Figures S5–S8 in the Supplement. The performances of each of thethree methods are similar with models (M1) or (M2) speciﬁed. Hence in the settings studied, thereis little beneﬁt in adding the spline terms in the outcome Y -model. We generate data under the following conﬁgurations with a completely randomized instrument:(C4) Generate Z from a Bernoulli distribution with P ( Z = 1) = 0 . , and, independently, generate U from a standard Logistic distribution. Then, generate D = (1 − . Z + 0 . X † + X † + 0 . X † − . X † ≥ U ) and Y (1) from a Normal distribution with variance 1 and mean E { Y (1) | Z, X, U } = 0 . X † + X † + X † + X † + 2 U .29C5) Generate ( Z, U ) as in (C4), but generate D = (1 − . Z + 0 . X + X + 0 . X − . X ≥ U ) and Y (1) from a Normal distribution with variance 1 and mean E { Y (1) | Z, X, U } =0 . X + X + X + X + 2 U .For n = 800 and p = 1000 , Table 2 summarizes the results based on 1000 repeated simulations.See the Supplement for p = 400 results. The methods RCAL, RML and IPW yield small bias andadequate coverage proportions in (C4) and (C5). The reﬁtting method RML2 also yields smallbias, but coverage proportions noticeably below the nominal probabilities. The L average intervallengths, √ EVar, from RCAL and RML are comparable, and are ≈ shorter than those of IPWin (C4), and ≈ shorter in (C5). Such efﬁciency gains are comparable to those reported inprevious simulation studies dealing with the average treatment effect, e.g. the interval lengthsof Lasso-adjusted methods are ≈ shorter than those of the unadjusted difference-of-meansestimator in Bloniarz et al. (2016). In instrumental variable analysis, such variance reduction isparticularly helpful because the Wald estimator usually suffers from large standard errors.Table 2: Summary of results for estimation of θ with a completely randomized instrument. (C4) cor IPS, more cor OR (C5) cor IPS, less cor ORIPW RCAL RML RML2 IPW RCAL RML RML2(M1) n = 800 , p = 1000 Bias .019 − . − . − . .046 .015 .020 − . √ Var .439 .409 .406 .424 .544 .521 .522 .545 √ EVar .444 .410 .401 .378 .529 .501 .498 .473Cov90 .903 .904 .902 .843 890 .889 .886 .829Cov95 .952 .956 .951 .925 937 .942 .937 .908(M2) n = 800 , p = 1000 Bias − . . − . . .021 − . √ Var .411 .405 .424 .520 .521 .545 √ EVar — .407 .401 .379 — .500 .498 .474Cov90 .903 .907 .843 .889 .890 .836Cov95 .949 .948 .922 .942 .939 .901

Note: See the footnote of Table 1. IPW denotes ˆ θ , IPW (ˆ π ) , with asymptotic varianceestimated by accounting for variation of ˆ π = ˜ E ( Z ) . Effect of education on earnings

The causal relationship between education and earnings has been of considerable interest in eco-nomics. Card (1995) proposed proximity to college as an instrument for completed education.The argument is that proximity to college could be taken as being randomized conditionally onobserved covariates, and its inﬂuence on earnings could be only through that on schooling deci-sion. Consider the analytic sample in Card (1995) from National Longitudinal Survey (NLS) ofYoung men, which comprises 3,010 men with valid education and wage responses in the 1976interview. Similarly as in Tan (2006a), we deﬁne the treatment as education after high school, i.e. D = ( years of schooling > , the instrument Z a binary indicator for proximity to a 4-yearcollege, and the outcome Y a surrogate outcome constructed for the log of hourly earnings at age30. The raw vector of covariates X include a race indicator, indicators for nine regions of residenceand for residence in SMSA in 1966, mother’s and father’s years of schooling ( momed and daded respectively) and indicators for missing values, indicators for living with both natural parents, withone natural parent and one step parent, and with mother only at age 14, and the Knowledge ofWorld of Work score ( kww ) in 1966 and a missing indicator. We use mean imputation for themissing values, and standardize all continuous variables with sample means 0 and variances 1.We reanalyze the NLS data to estimate LATE of education beyond high school on log hourlyearnings, using more ﬂexible, higher dimensional models than previously allowed. We apply (ˆ θ , RCAL , ˆ θ , RCAL ) based on regularized calibrated estimation (RCAL) and (ˆ θ , RML , ˆ θ , RML ) based onregularized likelihood estimation (RML) as well as the post-Lasso variant (RML2). The speciﬁ-cation for f ( X ) = g ( X ) consists of all the indicator variables mentioned above, momed , daded ,linear spline bases in kww as well as interactions between the spline terms with all the indicatorvariables. The vector h ( X ) augments f ( X ) and g ( X ) by adding linear spline terms for each ﬁttedtreatment regression ˆ m z , z ∈ { , } . We vary the model complexity by considering the numberof knots in the set k ∈ { , , } , with knots at the i/ ( k + 1) -quantiles for i = 1 , ..., k . The tun-ing parameter λ is determined using -fold cross validation based on the corresponding penalizedloss functions, as described in Section 4. As an anchor speciﬁcation, we also consider main-effectmodels with f ( X ) = g ( X ) = (1 , X T ) T and h ( X ) = (1 , X T , ˆ m , ˆ m ) T , whereby the nuisanceparameters are estimated using non-penalized likelihood or calibration estimation.31able 3: Estimates of the effect of education beyond high school on log earnings. RCAL RML RML2Non-penalized main effects ( p = q = 19 , q = 21) θ . ± .

236 6 . ± . θ . ± .

297 6 . ± . —LATE . ± .

365 0 . ± . Linear spline with 3 knots ( p = q = 114 , q = 122) θ . ± .

334 6 . ± .

323 6 . ± . θ . ± .

382 6 . ± .

365 6 . ± . LATE . ± .

522 0 . ± .

493 0 . ± . Linear spline with 9 knots ( p = q = 213 , q = 233) θ . ± .

310 6 . ± .

315 6 . ± . θ . ± .

359 6 . ± .

359 6 . ± . LATE . ± .

495 0 . ± .

480 0 . ± . Linear spline with 15 knots ( p = q = 312 , q = 344) θ . ± .

310 6 . ± .

310 6 . ± . θ . ± .

360 6 . ± .

359 6 . ± . LATE . ± .

498 0 . ± .

477 0 . ± . Note: Estimate ± . × standard error. As deﬁned in Section 2.2, p , q , or q is thenumber of regressors in IPS, outcome D -model, or outcome Y -model. Table 3 shows the estimates of ( θ , θ ) and LATE of education beyond high school on log hourlyearnings. Regularized estimation from RCAL, RML and RML2 yield similar point estimates; thedifferences are small compared with the standard errors. The RCAL and RML estimates havenoticeably smaller standard errors than RML2. Interestingly, for splines with 15 knots, the LATEis estimated from RCAL with 95% conﬁdence interval . ± . , which excludes 0, whereasthose from RML and RML2 include 0.While the validity of conﬁdence intervals is difﬁcult to assess using real data, Figure S9 inthe Supplement shows that the standardized sample inﬂuence functions for estimation of LATE.The curves from RCAL appear to be more normally distributed than RML or RML2, especiallyin the tails. In addition, Figures S10–S12 in the Supplement present the standardized calibrationdifferences for all the variables f j ( X ) , j = 1 , ..., p , similarly as in Tan (2020a). Compared withRML and RML2, our method RCAL consistently yields smaller maximum absolute standardized32ifferences and involves fewer nonzero estimates of γ j in IPS models. We develop a computationally tractable method and appropriate theoretical analysis, to obtainmodel-assisted conﬁdence intervals for population LATEs in high-dimensional settings. There arevarious interesting topics which warrant further investigation. Both the instrument and treatmentare assumed to be binary here. It is desirable to extend our method to handle multi-valued in-struments and treatments and to estimate treatment effects under other identiﬁcation assumptions.Another methodological question is whether doubly robust conﬁdence intervals can be derived ina computationally and theoretically satisfactory manner for practical use.

References

Abadie, A. (2003) Semiparametric instrumental variable estimation of treatment response models,

Journal of Econometrics , 113, 231–263.Angrist, J.D., Imbens, G.W. and Rubin, D.B. (1996) Identiﬁcation of causal effects using instru-mental variables,

Journal of the American Statistical Association , 91, 444–455.Avagyan, V. and Vansteelandt, S. (2017) Honest data-adaptive inference for the average treatmenteffect under model misspeciﬁcation using penalised bias-reduced double-robust estimation, arXiv preprint , arXiv:1708.03787.Bloniarz, A., Liu, H., Zhang, C., Sekhon, J.S. and Yu, B. (2016) Lasso adjustments of treat-ment effect estimates in randomized experiments,

Proceedings of the National Academy ofSciences , 113, 7383–7390.Baiocchi, M., Cheng, J., and Small, D.S. (2014). Instrumental variable methods for causal infer-ence,

Statistics in Medicine , 33, 2297–2340.Belloni, A., Chernozhukov, V. and Hansen, C. (2014) Inference on treatment effects after selec-tion among high-dimensional controls,

The Review of Economic Studies , 81, 608–650.Bradic, J., Wager, S. and Zhu, Y. (2019) Sparsity double robust inference of average treatmenteffects, arXiv preprint , arXiv:1905.00744.33uhlmann, P. and van de Geer, S. (2011)

Statistics for High-Dimensional Data: Methods, Theoryand Applications , New York: Springer.Card, D. (1995) Using geographic variation in college proximity to estimate the return to school-ing, in

Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp , 201–222, Toronto: University of Toronto Press.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W. and Robins,J.M. (2018) Double/debiased machine learning for treatment and structural parameters,

TheEconometrics Journal , 21, C1–C68.Davidian, M., Tsiatis, A.A. and Leon, S. (2005) Semiparametric estimation of treatment effect ina pretest–posttest study with missing data,

Statistical Science , 20, 261–301.Farrell, M. (2015) Robust inference on average treatment effects with possibly more covariatesthan observations,

Journal of Econometrics , 189, 1–23.Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent,

Journal of Statistical Software , 33, 1–22.Fr¨olich, M. (2007) Nonparametric IV estimation of local average treatment effects with covari-ates,

Journal of Econometrics , 139, 35–75.Hirano, K., Imbens, G.W., Rubin, D.B. and Zhou, X.-H. (2000), Assessing the effect of an in-ﬂuenza vaccine in an encouragement design,

Biostatistics , 1, 69–88.Huang, J. and Zhang, C.-H. (2012) Estimation and selection via absolute penalized convex min-imization and its multistage adaptive applications,

Journal of Machine Learning Research ,13, 1839-1864.Imbens, G.W. (2014) Instrumental variables: An econometrician’s perspective,

Statistical Sci-ence , 29, 323–358.Kang, J.D.Y. and Schafer, J.L. (2007) Demystifying double robustness: A comparison of alterna-tive strategies for estimating a population mean from incomplete data,

Statistical Science , 4,523–539.Kim, J.K. and Haziza, D. (2014) Doubly robust inference with missing data in survey sampling,

Statistica Sinica , 24, 375–394.Little, R.J.A. and Yau, L. (1998). Statistical techniques for analyzing data from prevention trials:Treatment of no-shows using Rubin’s causal model,

Psychological Methods , 3, 147–159.34cCullagh, P. and Nelder, J. (1989)

Generalized Linear Models (2nd edition), New York: Chap-man & Hall.Neyman, J. (1990) On the application of probability theory to agricultural experiments. Essay onprinciples. Section 9,

Statistical Science , 5, 465–472.Ning, Y., Peng, S. and Imai, K. (2020) Robust estimation of causal effects via high-dimensionalcovariate balancing propensity score,

Biometrika , to appear.Ogburn, E.L., Rotnitzky, A. and Robins, J.M. (2015) Doubly robust estimation of the local aver-age treatment effect curve,

Journal of the Royal Statistical Society , Ser. B, 77, 373–396.Okui, R., Small, D.S., Tan, Z. and Robins, J.M. (2012) Doubly robust instrumental variableregression,

Statistica Sinica , 22, 173–205.Osborne, M., Presnell, B., and Turlach, B. (2000) A new approach to variable selection in leastsquares problems,

IMA Journal of Numerical Analysis , 20, 389-404.Robins, J.M. (1994) Correcting for non-compliance in randomized trials using structural nestedmean models,

Communications in Statistics-Theory and methods , 23, 2379–2412.Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994) Estimation of regression coefﬁcients whensome regressors are not always observed,

Journal of the American Statistical Association ,89, 846–866.Rubin, D.B. (1974) Estimating causal effects of treatments in randomized and nonrandomizedstudies,

Journal of educational Psychology , 66, 688–701.S¨arndal, C.E., Swensson, B. and Wretman, J. (2003)

Model Assisted Survey Sampling , SpringerScience & Business Media.Smucler, E., Rotnitzky, A. and Robins, James M. (2019) A unifying approach for doubly-robust ℓ regularized estimation of causal contrasts, arXiv preprint , arXiv:1904.03737.Tan, Z. (2006a) Regression and weighting methods for causal inference using instrumental vari-ables, Journal of the American Statistical Association , 101, 1607–1618.Tan, Z. (2006b) A distributional approach for causal inference using propensity scores,

Journalof the American Statistical Association , 101, 1619–1637.Tan, Z. (2007) Comment: Understanding OR, PS and DR,

Statistical Science , 22, 560–568.Tan, Z. (2010a) Marginal and nested structural models using instrumental variables,

Journal ofthe American Statistical Association , 105, 157–169.35an, Z. (2010b) Bounded, efﬁcient, and doubly robust estimation with inverse weighting,

Biometrika ,97, 661–682.Tan, Z. (2020a) Regularized calibrated estimation of propensity scores with model misspeciﬁca-tion and high-dimensional data,

Biometrika , 107, 137–158.Tan, Z. (2020b) Model-assisted inference for treatment effects using regularized calibrated esti-mation with high-dimensional data,

Annals of Statistics , 48, 811–837.Tibshirani, R. (1996) Regression shrinkage and selection via the lasso,

Journal of the Royal Sta-tistical Society , Ser. B, 58, 267–288.Uysal, S.D. (2011) Doubly robust IV estimation of the local average treatment effects, Unpub-lished manuscript.van de Geer, S., Buhlmann, P., Ritov, Y. and Dezeure, R. (2014) On asymptotically optimalconﬁdence regions and tests for high-dimensional models,

Annals of Statistics , 42, 1166–1202.Vansteelandt, S.and Goetghebeur, E. (2003) Causal inference with generalized structural meanmodels,

Journal of the Royal Statistical Society , Ser. B, 65, 817–835.Vermeulen, K. and Vansteelandt, S. (2015) Bias-reduced doubly robust estimation,

Journal of theAmerican Statistical Association , 110,1024–1036.Vytlacil, E. (2002) Independence, monotonicity, and latent index models: An equivalence result,

Econometrica , 70, 331–341.Wager, S., Du, W., Taylor, J. and Tibshirani, R.J. (2016) High-dimensional regression adjustmentsin randomized experiments,

Proceedings of the National Academy of Sciences , 113, 12673–12678.Wang, L., and Tchetgen Tchetgen, E. (2019) Bounded, efﬁcient and multiply robust estimationof average treatment effects using instrumental variables,

Journal of the Royal StatisticalSociety , Ser. B, 80, 531–550.Wright, P.G. (1928)

Tariff on Animal and Vegetable Oils , Macmillan Company, New York.Zhang, C.-H. and Zhang, S.S. (2014) Conﬁdence intervals for low-dimensional parameters withhigh-dimensional data,

Journal of the Royal Statistical Society , Ser. B, 76, 217–242.36 upplementary Material for“High-dimensional Model-assisted Inference for Local AverageTreatment Effects with Instrumental Variables

Baoluo Sun & Zhiqiang Tan

I Additional results in simulation studies

Figure S1:

Scatter plots of ( X † , X † , X † , X † ) with marginal histograms from a simulated sample of size n = 800 . −1 0 1 2 3 4 − var 1 − − −1 0 1 2 3 4 − − var 2 −2 0 2 4 var 3 −2 0 2 4 6 8 −2 −1 0 1 2 3 − − var 4 Scatterplots of Y against ( X † , X † , X † , X † ) within { D = 1 } , boxplots of ( X † , X † , X † , X † ) within { D = 0 } and { D = 1 } as well asboxplots of ( X † , X † , X † , X † ) within { Z = 0 } and { Z = 1 } from a sample of size n = 800 in case (C1). −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y − − D X − − D X − − D X − − D X − − Z X − − Z X − − Z X − − Z X igure S3: Scatterplots of Y against ( X † , X † , X † , X † ) within { D = 1 } , boxplots of ( X † , X † , X † , X † ) within { D = 0 } and { D = 1 } as well asboxplots of ( X † , X † , X † , X † ) within { Z = 0 } and { Z = 1 } from a sample of size n = 800 in case (C2). −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y − − D X − − D X − − D X − − D X − − Z X − − Z X − − Z X − − Z X igure S4: Scatterplots of Y against ( X † , X † , X † , X † ) within { D = 1 } , boxplots of ( X † , X † , X † , X † ) within { D = 0 } and { D = 1 } as well asboxplots of ( X † , X † , X † , X † ) within { Z = 0 } and { Z = 1 } from a sample of size n = 800 in case (C3). −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y −3 −2 −1 0 1 2 3 − − X Y − − D X − − D X − − D X − − D X − − Z X − − Z X − − Z X − − Z X igure S5: QQ-plots of the estimates (ﬁrst row) and t -statistics (second row) against standard normal ( n = 800 , p = 400 ), based on ˆ θ , RML ( ◦ ), thepost-Lasso variant ( △ ) and ˆ θ , RCAL ( × ) under speciﬁcation (M1). For readability, only a subset of 101 order statistics are shown as points on the QQlines. −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR igure S6: QQ-plots of the estimates (ﬁrst row) and t -statistics (second row) against standard normal ( n = 800 , p = 1000 ), based on ˆ θ , RML ( ◦ ), thepost-Lasso variant ( △ ) and ˆ θ , RCAL ( × ) under speciﬁcation (M1). For readability, only a subset of 101 order statistics are shown as points on the QQlines. −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR igure S7: QQ-plots of the estimates (ﬁrst row) and t -statistics (second row) against standard normal ( n = 800 , p = 400 ), based on ˆ θ , RML ( ◦ ), thepost-Lasso variant ( △ ) and ˆ θ , RCAL ( × ) under speciﬁcation (M2). For readability, only a subset of 101 order statistics are shown as points on the QQlines. −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR igure S8: QQ-plots of the estimates (ﬁrst row) and t -statistics (second row) against standard normal ( n = 800 , p = 1000 ), based on ˆ θ , RML ( ◦ ), thepost-Lasso variant ( △ ) and ˆ θ , RCAL ( × ) under speciﬁcation (M2). For readability, only a subset of 101 order statistics are shown as points on the QQlines. −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR −4 −2 0 2 4 − Cor IPS, More cor OR −4 −2 0 2 4 − Cor IPS, Less cor OR −4 −2 0 2 4 − Mis IPS, More cor OR able S1: Summary of results for estimation of θ with a completely randomized instrument. (C4) cor IPS, more cor OR (C5) cor IPS, less cor ORIPW RCAL RML RML2 IPW RCAL RML RML2(M1) n = 800 , p = 400 Bias .019 − . − . − . .046 .023 .029 .040 √ Var .439 .424 .417 .435 .544 .508 .508 .526 √ EVar .444 .408 .400 .378 .529 .504 .500 .529Cov90 .903 .891 .885 .841 .890 .903 .900 .889Cov95 .952 .946 .942 .906 .937 .943 .945 .936(M2) n = 800 , p = 400 Bias − . − . − . .023 .030 .004 √ Var .423 .416 .432 .506 .508 .514 √ EVar — .407 .400 .378 — .504 .500 .476Cov90 .889 .891 .844 .903 .903 .879Cov95 .942 .940 .911 .946 .947 .931(M1) n = 800 , p = 1000 Bias − . − . − . .015 .020 − . √ Var .409 .406 .424 .521 .522 .545 √ EVar — .410 .401 .378 — .501 .498 .473Cov90 .904 .902 .843 .889 .886 .829Cov95 .956 .951 .925 .942 .937 .908(M2) n = 800 , p = 1000 Bias − . . − . . .021 − . √ Var .411 .405 .424 .520 .521 .545 √ EVar — .407 .401 .379 — .500 .498 .474Cov90 .903 .907 .843 .889 .890 .836Cov95 .949 .948 .922 .942 .939 .901

Note: See the footnote of Table 2. For completeness, the results with n =800 and p = 1000 are also included. I Additional results in empirical application

Figure S9:

Boxplots of the weights / (1 − ˆ π ) or / ˆ π within the Z = 0 and Z = 1 groups (1st and 2ndcolumns respectively), each normalized to sum to the sample size n = 3010 , as well as QQ-plots with a45-degree line of the standardized sample inﬂuence functions for non-penalized main effects model (toprow) and linear spline speciﬁcation with 3, 9 and 15 knots (2nd to 4th rows respectively). RCAL RML

RCAL RML −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles

RCALRML

RCAL RML RML2

RCAL RML RML2 −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles

RCALRMLRML2

RCAL RML RML2

RCAL RML RML2 −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles

RCALRMLRML2

RCAL RML RML2

RCAL RML RML2 −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles −3 −1 1 3 − − Theoretical Quantiles

RCALRMLRML2 Z = 1 group for a function h ( X ) using an esti-mated ˆ π ( X ) is CAL (ˆ π, h ) = ˜ E [ { Z/ ˆ π ( X ) − } h ( X )] / ˜ E { Z/ ˆ π ( X ) } ˜ V / ( h ( X )) , where ˜ V () denotes the sample variance.Figure S10: Standardized calibration differences CAL (ˆ π ; f j ) over index j ∈ { , ..., } under linearspline speciﬁcation with 3 knots for the estimators ˆ π = ˜ E ( Z ) , ˆ π , ˆ π RML and the post-Lasso variant ˆ π RML2 respectively (top 2 rows), with λ selected from cross validation. Two horizontal lines are placed atthe maximum absolute standardized differences. Marks (x) are plotted at the indices j corresponding to 30nonzero estimates of γ j for ˆ π and 31 nonzero estimates of γ j for ˆ π RML and ˆ π RML2 . Scatter plots of theﬁtted propensity scores { ˆ π ( X i ) , ˆ π RML ( X i ) } and { ˆ π ( X i ) , ˆ π RML2 ( X i ) } respectively (bottom row), inthe group of individuals who stay near a 4-year college, i.e. { Z i = 1 , i = 1 , ..., n } . − . . . Raw s t d d i ff − . . . RCAL s t d d i ff x xxxxxxx xxx xx x x x x xxx xxxxx xx xx x − . . . RML s t d d i ff xxxxxxxxxx xxx x xxx x x xx xx xxxx xx x x − . . . RML2 s t d d i ff xxxxxxxxxx xxx x xxx x x xx xx xxxx xx x x . . . fitted probabilities RML RC A L . . . fitted probabilities RML2 RC A L Standardized calibration differences CAL (ˆ π ; f j ) over index j ∈ { , ..., } under linearspline speciﬁcation with 9 knots for the estimators ˆ π = ˜ E ( Z ) , ˆ π , ˆ π RML and the post-Lasso variant ˆ π RML2 respectively (top 2 rows), with λ selected from cross validation. Two horizontal lines are placed atthe maximum absolute standardized differences. Marks (x) are plotted at the indices j corresponding to 21nonzero estimates of γ j for ˆ π and 37 nonzero estimates of γ j for ˆ π RML and ˆ π RML2 . Scatter plots of theﬁtted propensity scores { ˆ π ( X i ) , ˆ π RML ( X i ) } and { ˆ π ( X i ) , ˆ π RML2 ( X i ) } respectively (bottom row), inthe group of individuals who stay near a 4-year college, i.e. { Z i = 1 , i = 1 , ..., n } . − . . . Raw s t d d i ff − . . . RCAL s t d d i ff xxxx x xx x x x xx xx x xxx xx x − . . . RML s t d d i ff xxxxxxxxxxxxxx x x xx xxx xxx xx xx xx xxx xx x x − . . . RML2 s t d d i ff xxxxxxxxxxxxxx x x xx xxx xxx xx xx xx xxx xx x x . . . fitted probabilities RML RC A L . . . fitted probabilities RML2 RC A L Standardized calibration differences CAL (ˆ π ; f j ) over index j ∈ { , ..., } under linearspline speciﬁcation with 15 knots for the estimators ˆ π = ˜ E ( Z ) , ˆ π , ˆ π RML and the post-Lasso variant ˆ π RML2 respectively (top 2 rows), with λ selected from cross validation. Two horizontal lines are placed atthe maximum absolute standardized differences. Marks (x) are plotted at the indices j corresponding to 20nonzero estimates of γ j for ˆ π and 34 nonzero estimates of γ j for ˆ π RML and ˆ π RML2 . Scatter plots of theﬁtted propensity scores { ˆ π ( X i ) , ˆ π RML ( X i ) } and { ˆ π ( X i ) , ˆ π RML2 ( X i ) } respectively (bottom row), inthe group of individuals who stay near a 4-year college, i.e. { Z i = 1 , i = 1 , ..., n } . − . . . Raw s t d d i ff − . . . RCAL s t d d i ff xxxxx x x x x x xx x x xxx xx x − . . . RML s t d d i ff xxxxxxxxxxxx x x xx x x xxx x xx xx xxx xxx x x − . . . RML2 s t d d i ff xxxxxxxxxxxx x x xx x x xxx x xx xx xxx xxx x x . . . fitted probabilities RML RC A L . . . fitted probabilities RML2 RC A L II Technical details

III.1 Probability lemmas

Denote by Ω the event that (48)–(49) hold. Then P (Ω ) ≥ − c ǫ under Theorem 1. Thefollowing Lemmas 1–4 are used in the proof of Theorem 2 (Section III.2). Lemma 1.

Under Assumptions 1(ii) and 3(i)–(ii), there exists a positive constant B , dependingon ( C f , C h , σ , σ ) , such that P (Ω ) ≥ − ǫ , where Ω denotes the event sup j =0 , ,...,r (cid:12)(cid:12)(cid:12) ˜ E (cid:2) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } h j ( X ) (cid:3)(cid:12)(cid:12)(cid:12) ≤ B λ . Proof.

This can be shown similarly as Lemma 2 in the Supplement of Tan (2020b). (cid:3)

Recall Σ h = E [ Zw ( X ; ¯ γ ) h ( X ) h T ( X )] , ˜Σ h = ˜ E [ Zw ( X ; ¯ γ ) h ( X ) h T ( X )] . Lemma 2.

Under Assumptions 1(ii) and 3(i), there exists a positive constant B h , depending on ( C f , C h ) , such that P (Ω h ) ≥ − ǫ , where Ω h denotes the event sup j,k =0 , ,...,r (cid:12)(cid:12)(cid:12) ( ˜Σ h ) jk − (Σ h ) jk (cid:12)(cid:12)(cid:12) ≤ B h λ . Proof.

This can be shown similarly as Lemma 1(ii) in the Supplement of Tan (2020b). (cid:3)

Recall Σ f = E [ Zw ( X ; ¯ γ ) f ( X ) f T ( X )] , ˜Σ f = ˜ E [ Zw ( X ; ¯ γ ) f ( X ) f T ( X )] . Lemma 3.

Under Assumptions 1(i)–(ii), there exists a positive constant B f , depending on ( C f , C f ) ,such that P (Ω f ) ≥ − ǫ , where Ω f denotes the event sup j,k =0 , ,...,p (cid:12)(cid:12)(cid:12) ( ˜Σ f ) jk − (Σ f ) jk (cid:12)(cid:12)(cid:12) ≤ B f λ . Proof.

This is Lemma 1(ii) in the Supplement of Tan (2020b). (cid:3)

Denote Σ f = E (cid:2) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } f ( X ) f T ( X ) (cid:3) , ˜Σ f = ˜ E (cid:2) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } f ( X ) f T ( X ) (cid:3) . emma 4. Under Assumptions 1(i)–(ii) and 3(ii), there exists a positive constant B f , dependingon ( C f , C f , σ , σ ) , such that if λ ≤ , then P (Ω f ) ≥ − ǫ , where Ω f denotes the event sup j,k =0 , ,...,p (cid:12)(cid:12)(cid:12) ( ˜Σ f ) jk − (Σ f ) jk (cid:12)(cid:12)(cid:12) ≤ B f λ . Proof.

This can be shown similarly as Lemma 3 in the Supplement of Tan (2020b). (cid:3)

In the event Ω f ∩ Ω f , we have for any vector b ∈ R p , (cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:8) Zw ( X ; ¯ γ )( b T f ) (cid:9)(cid:12)(cid:12)(cid:12) ≤ B f λ k b k , (cid:12)(cid:12)(cid:12) ( ˜ E − E ) (cid:8) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } ( b T f ) (cid:9)(cid:12)(cid:12)(cid:12) ≤ B f λ k b k . By Assumption 3(ii), E [ { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } | X ] ≤ σ + σ and hence E h Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } ( b T f ) i ≤ ( σ + σ ) E (cid:8) Zw ( X ; ¯ γ ) f ( X )( b T f ) (cid:9) . Combining the preceding three inequalities shows that in the event Ω f ∩ Ω f , ˜ E (cid:8) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } ( b T f ) (cid:9) ≤ ( σ + σ ) ˜ E (cid:8) Zw ( X ; ¯ γ )( b T f ) (cid:9) + { B f + ( σ + σ ) B f } λ k b k . (S1)The following Lemmas 5–7 are used in the proof of Theorem 3 (Section III.3). Denote Σ f = E { Zw ( X ; ¯ γ ) | DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) | f ( X ) f T ( X ) } , ˜Σ f = ˜ E { Zw ( X ; ¯ γ ) | DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) | f ( X ) f T ( X ) } . Lemma 5.

Under Assumptions 1(i)–(ii) and 3(ii), there exists a positive constant B f , dependingon ( C f , C f , σ , σ ) , such that P (Ω f ) ≥ − ǫ , where Ω f denotes the event sup j,k =0 , ,...,p (cid:12)(cid:12)(cid:12) ( ˜Σ f ) jk − (Σ f ) jk (cid:12)(cid:12)(cid:12) ≤ B f λ . Proof.

This can be shown similarly as Lemma 4 in the Supplement of Tan (2020b). (cid:3)

Similarly as (S1), we have in the event Ω f ∩ Ω f , for any vector b ∈ R p , ˜ E (cid:8) Zw ( X ; ¯ γ ) | DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) | ( b T f ) (cid:9) ≤ ( σ + σ ) / ˜ E (cid:8) Zw ( X ; ¯ γ )( b T f ) (cid:9) + { B f + ( σ + σ ) / B f } λ k b k . (S2)15 emma 6. Suppose that Assumptions 1(ii), 2(i)-(iii), and 3(iii) hold. Then P (Ω ,r ) ≥ − ǫ forany r ≥ , where Ω ,r denotes the event sup k α − ¯ α k≤ r (cid:12)(cid:12)(cid:12)(cid:12) ˜ E (cid:20)(cid:26) − Zπ ∗ ( X ) (cid:27) ¯ m ( X ) { ψ D ( α T g ) − ψ D ( ¯ α T g ) } (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − C f + 1) C g ˜ C h ψ ′ D (0)e C g ( C g + C g r ) r λ . Here ˜ C h = max {| ψ Y ( − C h ) | , | ψ Y ( C h ) |} . Proof.

Using (S7), this can be shown similarly as Lemma 13 in the Supplement of Tan (2020b). (cid:3)

Lemma 7.

Suppose that Assumptions 1(ii) and 3(i),(iii),(iv) hold. Then P (Ω ,r ) ≥ − ǫ for any r ≥ , where Ω ,r denotes the event sup α ∈ R q , k α − ¯ α k≤ r (cid:12)(cid:12)(cid:12)(cid:12) ˜ E (cid:20)(cid:26) − Zπ ∗ ( X ) (cid:27) ψ D ( α T g ) { ψ Y ( α T h ) − ψ Y ( ¯ α T h ) } (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − C f + 1) C h ψ ′ Y (0)e C h ( C h + C h r ) r λ . Proof.

Using (S20), this can be shown similarly as Lemma 13 in the Supplement of Tan (2020b). (cid:3)

III.2 Proof of Theorem 2

We split the proof of Theorem 2 into a series of lemmas. The ﬁrst one is usually called a basicinequality for ˆ α , but depending on the ﬁrst-step estimators (ˆ γ , ˆ α ) . Lemma 8.

For any coefﬁcient vector α , we have D † , WL ( ˆ α , α ; ˆ γ , ˆ α ) + λ k ( ˆ α ) q k ≤ ( ˆ α − α ) T ˜ E [ Zw ( X ; ˆ γ ) { DY − m ( X ; ˆ α ) ψ Y ( α T h ) } h ] + λ k ( α ) q k . (S3) Proof.

This can be shown similarly as Lemma 6 in the Supplement of Tan (2020b). (cid:3)

The second lemma deals with the dependency on (ˆ γ , ˆ α ) in the upper bound from the basicinequality (S3). Denote Q ( ˆ α , ¯ α ; ¯ γ ) = ˜ E (cid:2) Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) (cid:3) . emma 9. In the event Ω ∩ Ω f ∩ Ω f , we have ( ˆ α − ¯ α ) T ˜ E [ Zw ( X ; ˆ γ ) { DY − m ( X ; ˆ α ) ψ Y ( ¯ α T h ) } h ] ≤ ( ˆ α − ¯ α ) T ˜ E [ Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } h ]+ n ( M | S ¯ γ | λ ) / + M / ( | S ¯ γ | λ + | S ¯ α | λ ) / o { Q ( ˆ α , ¯ α ; ¯ γ ) } / , where M = e C f M ̺ [( σ + σ ) M + { B f +( σ + σ ) B f } M ̺ ] and, with ˜ C h = max {| ψ Y ( − C h ) | , | ψ Y ( C h ) |} , M = ψ ′ D (0) ˜ C h e C f M ̺ +2 C g ( C g + C g M ̺ ) M Proof.

Consider the following decomposition ( ˆ α − ¯ α ) T ˜ E [ Zw ( X ; ˆ γ ) { DY − m ( X ; ˆ α ) ψ Y ( ¯ α T h ) } h ]= ( ˆ α − ¯ α ) T ˜ E [ Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } h ] + ∆ + ∆ , where ∆ = ( ˆ α − ¯ α ) T ˜ E [ Z { w ( X ; ˆ γ ) − w ( X ; ¯ γ ) }{ DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } h ] , ∆ = ( ˆ α − ¯ α ) T ˜ E [ Zw ( X ; ˆ γ ) { m ( X ; ˆ α ) − m ( X ; ¯ α ) } ψ Y ( ¯ α T h ) h ] . To handle ∆ , we have by the mean value theorem and Assumption 1(i), | w ( X ; ˆ γ ) − w ( X ; ¯ γ ) | = (cid:12)(cid:12)(cid:12) e − ˆ γ T1 f − e − ¯ γ T1 f (cid:12)(cid:12)(cid:12) ≤ e − ¯ γ T1 f e | ˆ γ T1 f − ¯ γ T1 f | | ˆ γ T f − ¯ γ T f | ≤ e C f k ˆ γ − ¯ γ k w ( X ; ¯ γ ) | ˆ γ T f − ¯ γ T f | . (S4)By the Cauchy–Schwartz inequality, we have in the event Ω ∩ Ω f ∩ Ω f , | ∆ | ≤ e C f k ˆ γ − ¯ γ k ˜ E / (cid:2) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } (ˆ γ T f − ¯ γ T f ) (cid:3) × ˜ E / (cid:2) Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) (cid:3) ≤ ( M | S ¯ γ | λ ) / ˜ E / (cid:2) Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) (cid:3) , (S5)where M = e C f M ̺ [( σ + σ ) M + { B f + ( σ + σ ) B f } M ̺ ] . The second step followsbecause ˜ E (cid:2) Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } (ˆ γ T f − ¯ γ T f ) (cid:3) ≤ ( σ + σ ) M | S ¯ γ | λ + { B f + ( σ + σ ) B f } M ̺ | S ¯ γ | λ

17n the event Ω ∩ Ω f ∩ Ω f by (48), (S1) with b = ˆ γ − ¯ γ , and Assumption 3(vi). To handle ∆ , we have w ( X ; ˆ γ ) ≤ e C f k ˆ γ − ¯ γ k w ( X ; ¯ γ ) by the mean value theorem and Assumption 1(i).Moreover, by Assumptions 2(i)–(iii), | m ( X ; ˆ α ) − m ( X ; ¯ α ) | = | ψ D ( ˆ α T g ) − ψ D ( ¯ α T g ) | = | ( ˆ α − ¯ α ) T g | Z ψ ′ D ( ¯ α T g + u ( ˆ α − ¯ α ) T g ) d u ≤ | ( ˆ α − ¯ α ) T g | ψ ′ D ( ¯ α T g ) Z e C g u | (ˆ α − ¯ α ) T g | d u (S6) ≤ | ( ˆ α − ¯ α ) T g | ψ ′ D (0)e C g ( C g + C g k ˆ α − ¯ α k ) . (S7)By the Cauchy–Schwartz inequality and Assumption 3(ii), we have in the event Ω , | ∆ | ≤ ψ ′ D (0) ψ Y ( C h ) ˜ C h e C g ( C g + C g k ˆ α − ¯ α k ) × ˜ E / (cid:2) Zw ( X ; ¯ γ )( ˆ α T g − ¯ α T g ) (cid:3) ˜ E / (cid:2) Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) (cid:3) ≤ M / ( | S ¯ γ | λ + | S ¯ α | λ ) / ˜ E / (cid:2) Zw ( X ; ¯ γ )( ˆ α T h − ¯ α T h ) (cid:3) , (S8)where ˜ C h = max {| ψ Y ( − C h ) | , | ψ Y ( C h ) |} and M = ψ ′ D (0) ˜ C h e C f M ̺ +2 C g ( C g + C g M ̺ ) M .The second step follows by (49) and Assumption 3(vi). Combining (S5)–(S8) yields the desiredinequality. (cid:3) The third lemma derives an implication of the basic inequality (S3) using the triangle inequalityfor the L norm, while incorporating the bound from Lemma 9. Lemma 10.

Denote b = ˆ α − ¯ α . In the event Ω ∩ Ω ∩ Ω f ∩ Ω f , we have D † , WL ( ˆ α , ¯ α ; ˆ γ , ˆ α ) + ( A − B ) λ k b k ≤ n ( M | S ¯ γ | λ ) / + M / ( | S ¯ γ | λ + | S ¯ α | λ ) / o { Q ( ˆ α , ¯ α ; ¯ γ ) } / + 2 A λ X j ∈ S ¯ α | b j | . (S9) Proof.

In the event Ω , we have b T ˜ E [ Zw ( X ; ¯ γ ) { DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) } h ] ≤ B λ k b k . Ω ∩ Ω ∩ Ω f ∩ Ω f , D † , WL ( ˆ α , α ; ˆ γ , ˆ α ) + A λ k ( ˆ α ) q k ≤ B λ k b k + A λ k ( ¯ α ) q k + n ( M | S ¯ γ | λ ) / + M / ( | S ¯ γ | λ + | S ¯ α | λ ) / o { Q ( ˆ α , ¯ α ; ¯ γ ) } / . Using the identity | ( ˆ α ) j | = | ( ˆ α − ¯ α ) j | for j S ¯ α and the triangle inequality | ( ˆ α ) j | ≥| ( ¯ α ) j | − | ( ˆ α − ¯ α ) j | for j ∈ S ¯ α \ { } and rearranging the result yields (S9). (cid:3) The following lemma provides a desired bound relating the Bregman divergence D † , WL ( α , ¯ α ;ˆ γ , ˆ α ) with the quadratic function ( α − ¯ α ) T ˜Σ h ( α − ¯ α ) . Lemma 11.

In the event Ω , we have for any vector b ∈ R q , D † , WL ( α , ¯ α ; ˆ γ , ˆ α ) ≥ C g C h − e − C h C h k b k C h C h k b k ( b T ˜Σ h b ) , where b = α − ¯ α , C g = e − C f M ̺ ψ D ( − C g )(1 − ̺ ) , and C h = ψ ′ Y (0)e − C h C h . Throughout,set (1 − e − c ) /c = 1 for c = 0 . Proof.

Direct calculation yields D † , WL ( α , ¯ α ; ˆ γ , ˆ α )= ( α − ¯ α ) T ˜ E [ Zw ( X ; ˆ γ ) m ( X ; ˆ α ) { ψ Y ( α T h ) − ψ Y ( ¯ α T h ) } ]= ˜ E (cid:20) Zw ( X ; ˆ γ ) m ( X ; ˆ α ) (cid:26)Z ψ ′ Y ( ¯ α T h + u ( α − ¯ α ) T h ) d u (cid:27) ( α T h − ¯ α T h ) (cid:21) . By the mean value theorem and Assumption 1(i), w ( X ; ˆ γ ) ≥ e − C f k ˆ γ − ¯ γ k w ( X ; ¯ γ ) . By in-equality (S6) and Assumptions 2(i)–(iii), we have m ( X ; ˆ α ) ≥ m ( X ; ¯ α ) (cid:8) − C g k ˆ α − ¯ α k C g e C g C g k ˆ α − ¯ α k (cid:9) ≥ ψ D ( − C g ) (cid:8) − C g k ˆ α − ¯ α k C g e C g C g k ˆ α − ¯ α k (cid:9) . The ﬁrst step follows because, with ψ D ( −∞ ) = 0 , ψ D ( η ) = R η −∞ ψ ′ D ( t ) d t ≥ ψ ′ D ( η ) R η −∞ e − C g ( η − t ) d t = ψ ′ D ( η ) /C g and hence ψ ′ D ( η ) ≤ C g ψ D ( η ) for any η ∈ R . Then by (48)–(49) and Assump-tion 3(vi), we obtain in the event Ω , D † , WL ( α ′ , α ; ˆ γ , ˆ α ) ≥ C g ˜ E (cid:20) Zw ( X ; ¯ γ ) (cid:26)Z ψ ′ Y ( ¯ α T h + u ( ˆ α − ¯ α ) T h ) d u (cid:27) ( α T h − ¯ α T h ) (cid:21) , C g = e − C f M ̺ ψ D ( − C g )(1 − ̺ ) , with ̺ = C g M ̺ C g e C g C g M ̺ < . Furthermore,by Assumption 3(iii)–(iv), we have D † , WL ( α ′ , α ; ˆ γ , ˆ α ) ≥ C g ˜ E (cid:20) Zw ( X ; ¯ γ ) ψ ′ Y ( ¯ α T h ) (cid:26)Z e − C h u | (ˆ α − ¯ α ) T h | d u (cid:27) ( α T h − ¯ α T h ) (cid:21) ≥ C g C h (cid:26)Z e − C h C h k ˆ α − ¯ α k d u (cid:27) ˜ E (cid:8) Zw ( X ; ¯ γ )( α T h − ¯ α T h ) (cid:9) , where C h = ψ ′ Y (0)e − C h C h . The desired result follows because R e − cu d u = − e − c c for c ≥ . (cid:3) The following lemma shows that Assumption 3(v), a theoretical compatibility condition for Σ h ,implies an empirical compatibility condition for ˜Σ h . Lemma 12.

In the event Ω h , Assumption 3(v) implies that for any vector b ∈ R q such that P j S ¯ α | b j | ≤ µ P j ∈ S ¯ α | b j | , we have ν  X j ∈ S ¯ α | b j |  ≤ | S ¯ α | (cid:16) b T ˜Σ h b (cid:17) , where ν = ν { − ν − (1 + µ ) ̺ B h } / = ν (1 − ̺ ) / . Proof.

In the event Ω h , we have | b T ( ˜Σ h − Σ h ) b | ≤ B h λ k b k from Lemma 2. Then Assump-tion 3(v) implies that for any b = ( b , b , . . . , b p ) T satisfying P j S ¯ α | b j | ≤ µ P j ∈ S ¯ α | b j | , ν k b S ¯ α k ≤ | S ¯ α | ( b T Σ h b ) ≤ | S ¯ α | (cid:16) b T ˜Σ h b + B h λ k b k (cid:17) ≤ | S ¯ α | ( b T ˜Σ h b ) + B h | S ¯ α | λ (1 + µ ) k b S ¯ α k , where k b S ¯ α k = P j ∈ S ¯ α | b j | . The last inequality uses k b k ≤ (1 + µ ) k b S ¯ α k . The desiredresult follows because | S ¯ α | λ ≤ ̺ and ̺ = ν − B h (1 + µ ) ̺ < by Assumption 3(vi). (cid:3) The ﬁnal lemma completes the proof of Theorem 2, because P (Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f ) ≥ − ( c + 8) ǫ by probability Lemmas 1–4 in Section III.1. Lemma 13.

For A > B ( µ + 1) / ( µ − , inequality (53) holds in the event Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f : D † , WL ( ˆ α , ¯ α ; ¯ γ , ¯ α ) + ( A − B ) λ k ˆ α − ¯ α k ≤ C − g C − h (cid:8) µ − ( M + M ) | S ¯ γ | λ + 2 µ − M | S ¯ α | λ + µ ν − | S ¯ α | λ (cid:9) . roof. Denote b = ˆ α − ¯ α , D † , WL = D † , WL ( ˆ α , ¯ α ; ¯ γ , ¯ α ) , Q = Q ( ˆ α , ¯ α ; ¯ γ , ¯ α ) , and D ‡ , WL = D † , WL + ( A − B ) λ k b k . In the event Ω ∩ Ω h ∩ Ω f ∩ Ω f , inequality (S9) from Lemma 10 leads to two possible cases:either µ D ‡ , WL ≤ n ( M | S ¯ γ | λ ) / + M / ( | S ¯ γ | λ + | S ¯ α | λ ) / o Q / , (S10)or (1 − µ ) D ‡ , WL ≤ A λ P j ∈ S ¯ α | b j | , that is, D ‡ , WL ≤ ( µ + 1)( A − B ) λ X j ∈ S ¯ α | b j | = µ λ X j ∈ S ¯ α | b j | , (S11)where µ = 1 − A / { ( µ + 1)( A − B ) } ∈ (0 , because A > B ( µ + 1) / ( µ − and µ = ( µ + 1)( A − B ) . We deal with the two cases separately as follows.If (S11) holds, then P j S ¯ α | b j | ≤ µ P j ∈ S ¯ α | b j | , which, by Lemma 12, implies X j ∈ S ¯ α | b j | ≤ ν − | S ¯ α | / (cid:16) b T ˜Σ h b (cid:17) / . (S12)By Lemma 11, we have D † , WL ≥ C g C h − e − C h C h k b k C h C h k b k (cid:16) b T ˜Σ h b (cid:17) . (S13)Combining (S11), (S12), and (S13) and using D † , WL ≤ D ‡ , WL yields D ‡ , WL ≤ µ ν − C − g C − h | S ¯ α | λ C h C h k b k − e − C h C h k b k . (S14)But ( A − B ) λ k b k ≤ D ‡ , WL . Inequality (S14) along with Assumption 3(vi) implies that − e − C h C h k b k ≤ C h C h ( A − B ) − µ ν − C − g C − h | S ¯ α | λ ≤ ̺ ( < . As a result, C h C h k b k ≤− log(1 − ̺ ) and hence − e − C h C h k b k C h C h k b k = Z e − C h C h k b k u d u ≥ e − C h C h k b k ≥ − ̺ . (S15)From this bound, inequality (S14) then leads to D ‡ , WL ≤ µ ν − C − g C − h | S ¯ α | λ .If (S10) holds, then simple manipulation using D † , WL ≤ D ‡ , WL and (S13) together with Q = b T ˜Σ h b gives D ‡ , WL ≤ µ − C − g C − h (cid:8) M + M ) | S ¯ γ | λ + 2 M | S ¯ α | λ (cid:9) C h C h k b k − e − C h C h k b k . (S16)21imilarly as above, using ( A − B ) λ k b k ≤ D ‡ , WL and inequality (S16) along with Assump-tion 3(vi), we ﬁnd − e − C h C h k b k ≤ C h C h ( A − B ) − µ − C − g C − h { ( M + M ) | S ¯ γ | λ + M | S ¯ α | λ } ≤ ̺ ( < . As a result, C h C h k b k ≤ − log(1 − ̺ ) and hence − e − C h C h k b k C h C h k b k = Z e − C h C h k b k u d u ≥ e − C h C h k b k ≥ − ̺ . (S17)From this bound, inequality (S16) then leads to D ‡ , WL ≤ µ − C − g C − h { M + M ) | S ¯ γ | λ +2 M | S ¯ α | λ } . Therefore, (53) holds through (S10) and (S11) in the event Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f . (cid:3) From the preceding proof, inequality (54) can also be deduced in the event Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f . In fact, return to the two possible cases of (S10) or (S11). If (S11) holds, then we have,by (S13) and (S15), b T ˜Σ h b ≤ C − g C − h (1 − ̺ ) − D † , WL . If (S10) holds, then we have, by (S13)and (S17), b T ˜Σ h b ≤ C − g C − h (1 − ̺ ) − D † , WL . Hence b T ˜Σ h b ≤ C − g C − h (1 − ̺ ∨ ̺ ) − D † , WL . III.3 Proof of Theorem 3

We split the proof into two lemmas. The ﬁrst one deals with the convergence of the AIPW estimator ˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } in (55). Recall from (16) that ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) = Z ˆ π ( X ) DY − (cid:26) Z ˆ π ( X ) − (cid:27) ˆ m ( X ) ˆ m ( X ) . Lemma 14.

In the setting of Theorem 2, take r = M ( | S ¯ γ | λ + | S ¯ α | λ ) and r = M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) . Then we have in the event (Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f ) ∩ Ω f ∩ Ω ,r ∩ Ω ,r , (cid:12)(cid:12)(cid:12) ˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } − ˜ E { ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } (cid:12)(cid:12)(cid:12) ≤ M | S ¯ γ | λ ( λ ∨ λ ) + M | S ¯ α | λ ( λ ∨ λ ) + M | S ¯ α | λ , where M = ( B +1) M + M +( M + M ) / M + M , M = ( M + M ) / M + M , M = M / M , depending on ( M , M , M , M , M ) in the proof. Proof.

Consider the following decomposition ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) = ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) + δ + δ + δ , (S18)22here δ = { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:26) − Z ¯ π ( X ) (cid:27) ,δ = { DY − ¯ m ( X ) ¯ m ( X ) } (cid:26) Z ˆ π ( X ) − Z ¯ π ( X ) (cid:27) ,δ = { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:26) Z ¯ π ( X ) − Z ˆ π ( X ) (cid:27) . Denote ∆ = ˜ E ( δ ) , ∆ = ˜ E ( δ ) , and ∆ = ˜ E ( δ ) . Then ˜ E { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) } = ˜ E { ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } + ∆ + ∆ + ∆ , We handle the three terms ∆ , ∆ , ∆ separately.First, a Taylor expansion for ∆ yields for some u ∈ (0 , , ∆ = − (ˆ γ − ¯ γ ) T ˜ E h Z { DY − ¯ m ( X ) ¯ m ( X ) } e − ¯ γ T1 f f i + (ˆ γ − ¯ γ ) T ˜ E h Z { DY − ¯ m ( X ) ¯ m ( X ) } e − ¯ γ T1 f − u (ˆ γ − ¯ γ ) T f f f T i (ˆ γ − ¯ γ ) / , denoted as ∆ + ∆ . Because f is a subvector of h , we have in the event Ω ∩ Ω , | ∆ | ≤ (cid:13)(cid:13)(cid:13) ˜ E h Z { DY − ¯ m ( X ) ¯ m ( X ) } e − ¯ γ T1 f f i(cid:13)(cid:13)(cid:13) ∞ k ˆ γ − ¯ γ k ≤ B M | S ¯ γ | λ λ . Moreover, we have in the event Ω ∩ Ω f ∩ Ω f , | ∆ | ≤ e C f k ˆ γ − ¯ γ k ˜ E h Z | DY − ¯ m ( X ) ¯ m ( X ) | e − ¯ γ T1 f (ˆ γ T f − ¯ γ T f ) i / ≤ M | S ¯ γ | λ , where M = e C f M ̺ [( σ + σ ) / M + { B f + ( σ + σ ) / B f } M ̺ ] / . The second stepfollows because by (48) and (S2) with b = ˆ γ − ¯ γ , ˜ E (cid:2) Zw ( X ; ¯ γ ) | DY − m ( X ; ¯ α ) ψ Y ( ¯ α T h ) | (ˆ γ T f − ¯ γ T f ) (cid:3) ≤ ( σ + σ ) / M | S ¯ γ | λ + { B f + ( σ + σ ) / B f } M ̺ | S ¯ γ | λ . Combining the preceding inequalities yields | ∆ | ≤ | ∆ | + | ∆ | ≤ B M | S ¯ γ | λ λ + M | S ¯ γ | λ . (S19)23econd, the term ∆ can be decomposed as ∆ = ˜ E (cid:20) ¯ m ( X ) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) Z ¯ π ( X ) − Z ˆ π ( X ) (cid:27)(cid:21) + ˜ E (cid:20) ˆ m ( X ) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) Z ¯ π ( X ) − Z ˆ π ( X ) (cid:27)(cid:21) , denoted as ∆ + ∆ . Similarly as (S7) for | ˆ m ( X ) − ¯ m ( X ) | , we have | ˆ m ( X ) − ¯ m ( X ) | ≤ | ( ˆ α − ¯ α ) T h | ψ ′ Y (0)e C h ( C h + C h k ˆ α − ¯ α k ) . (S20)Combining this inequality with (S4) and (S7) and applying the Cauchy–Schwartz inequality showsthat in the event Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f , | ∆ | ≤ e C f k ˆ γ − ¯ γ k ˜ E / (cid:8) Zw ( X ; ¯ γ )(ˆ γ T f − ¯ γ T f ) (cid:9) × ˜ C h ˜ E / (cid:8) Zw ( X ; ¯ γ )( ˆ m − ¯ m ) (cid:9) ≤ ( M | S ¯ γ | λ ) / M / ( | S ¯ γ | λ + | S ¯ α | λ ) / , (S21)and, with ˆ m ( X ) ∈ (0 , , | ∆ | ≤ e C f k ˆ γ − ¯ γ k ˜ E / (cid:8) Zw ( X ; ¯ γ )(ˆ γ T f − ¯ γ T f ) (cid:9) ˜ E / (cid:8) Zw ( X ; ¯ γ )( ˆ m − ¯ m ) (cid:9) ≤ ( M | S ¯ γ | λ ) / M / ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) / , (S22)where ˜ C h = max {| ψ Y ( − C h ) | , | ψ Y ( C h ) |} and M = ψ ′ D (0) ˜ C h e C f M ̺ +2 C g ( C g + C g M ̺ ) M as in Lemma 9, and M = ψ ′ Y (0)e C f M ̺ +2 C h ( C h + C h M ( ̺ + ̺ )) M . From these two inequali-ties, we obtain | ∆ | ≤ (2 M + M + M ) | S ¯ γ | λ / M + M ) | S ¯ α | λ / M | S ¯ α | λ / . (S23)Finally, the term ∆ can be decomposed as ∆ = ˜ E (cid:20) ¯ m ( X ) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) − Z ¯ π ( X ) (cid:27)(cid:21) + ˜ E (cid:20) ˆ m ( X ) { ˆ m ( X ) − ¯ m ( X ) } (cid:26) − Z ¯ π ( X ) (cid:27)(cid:21) , denoted as ∆ + ∆ . Recall that ¯ π ( X ) ≡ π ∗ ( X ) because model (6) is correctly speciﬁed. Take r = M ( | S ¯ γ | λ + | S ¯ α | λ ) in Lemma 6. Then in the event Ω ∩ Ω ,r , we have k ˆ α − ¯ α k ≤ r | ∆ | ≤ − C f + 1) C g {| ψ Y ( − C h ) | , | ψ Y ( C h ) |} ψ ′ D (0)e C g ( C g + C g r ) r λ ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) λ , where M = 4(e − C f + 1) C g ˜ C h ψ ′ D (0)e C g ( C g + C g M ̺ ) M . Take r = M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) in Lemma 7. Then in the event (Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f ) ∩ Ω ,r , we have k ˆ α − ¯ α k ≤ r and hence | ∆ | ≤ − C f + 1) C h ψ ′ Y (0)e C h ( C h + C h r ) r λ ≤ M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) λ , where M = 4(e − C f + 1) C h ψ ′ Y (0)e C h ( C h + C h M ( ̺ + ̺ )) M . Hence | ∆ | ≤ M ( | S ¯ γ | λ + | S ¯ α | λ ) λ + M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) λ . (S24)Combining (S19), (S23), (S24) yields the desired result. (cid:3) The second lemma deals with the convergence of the mean squared difference between ϕ D Y ( O ;ˆ π , ˆ m , ˆ m ) and ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) in (56) for variance estimation. Lemma 15.

In the setting of Theorem 2, we have in the event Ω ∩ Ω ∩ Ω h ∩ Ω f ∩ Ω f , (cid:12)(cid:12)(cid:12) ˜ E (cid:2) { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } (cid:3)(cid:12)(cid:12)(cid:12) ≤ M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) + M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) , where M = 3e − C f M + 6e − C f (1 + e − C f M ̺ ) ( M + M ) and M = 6(1 + e − C f ) ( M + M ) , depending on ( M , M , M , M , M ) in the proof. Proof.

Return to the decomposition (S18). Then ˜ E (cid:2) { ϕ D Y ( O ; ˆ π , ˆ m , ˆ m ) − ϕ D Y ( O ; ¯ π , ¯ m , ¯ m ) } (cid:3) = ˜ E (cid:8) ( δ + δ + δ ) (cid:9) ≤ E ( δ ) + 3 ˜ E ( δ ) + 3 ˜ E ( δ ) . We handle the three terms ˜ E ( δ ) , ˜ E ( δ ) , ˜ E ( δ ) separately.25irst, we have by (S4) and Assumption 1(i)–(ii), ˜ E ( δ ) = ˜ E " Z { DY − ¯ m ( X ) ¯ m ( X ) } (cid:26) π ( X ) − π ( X ) (cid:27) ≤ e − C f +2 C f k ˆ γ − ¯ γ k ˜ E (cid:2) Zw ( X ; ¯ γ ) { DY − ¯ m ( X ) ¯ m ( X ) } (ˆ γ T f − ¯ γ T f ) (cid:3) ≤ e − C f M | S ¯ γ | λ , (S25)where M = e C f M ̺ [( σ + σ ) M + { B f + ( σ + σ ) B f } M ̺ ] as in Lemma 9.Second, writing ˆ π − − ¯ π − = e − ¯ γ T1 f (e − ˆ γ T1 f +¯ γ T1 f − and using Assumption 1(i)–(ii), we have ˜ E ( δ ) = ˜ E " { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:26) Z ¯ π ( X ) − Z ˆ π ( X ) (cid:27) ≤ e − C f (cid:0) C f k ˆ γ − ¯ γ k (cid:1) ˜ E (cid:2) Zw ( X ; ¯ γ ) { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:3) ≤ e − C f (cid:0) C f k ˆ γ − ¯ γ k (cid:1) × h ˜ E (cid:8) Zw ( X ; ¯ γ ) ¯ m ( ˆ m − ¯ m ) (cid:9) + ˜ E (cid:8) Zw ( X ; ¯ γ ) ˆ m ( ˆ m − ¯ m ) (cid:9)i . Then we obtain similarly as in (S21)–(S22), ˜ E ( δ ) ≤ − C f (cid:0) − C f M ̺ (cid:1) × (cid:8) M ( | S ¯ γ | λ + | S ¯ α | λ ) + M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) (cid:9) , (S26)where M and M are as in Lemma 14.Finally, using Assumption 1(i)–(ii), we also have ˜ E ( δ ) = ˜ E " { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:26) − Z ¯ π ( X ) (cid:27) ≤ (cid:0) − C f (cid:1) ˜ E (cid:2) { ˆ m ( X ) ˆ m ( X ) − ¯ m ( X ) ¯ m ( X ) } (cid:3) ≤ (cid:0) − C f (cid:1) h ˜ E (cid:8) ¯ m ( ˆ m − ¯ m ) (cid:9) + ˜ E (cid:8) ˆ m ( ˆ m − ¯ m ) (cid:9)i . Then we obtain similarly as in (S21)–(S22) but using the bounds | ˆ α T g − ¯ α T g | ≤ C g k ˆ α − ¯ α k and | ˆ α T h − ¯ α T h | ≤ C h k ˆ α − ¯ α k , ˜ E ( δ ) ≤ (cid:0) − C f (cid:1) × (cid:8) M ( | S ¯ γ | λ + | S ¯ α | λ ) + M ( | S ¯ γ | λ + | S ¯ α | λ + | S ¯ α | λ ) (cid:9) , (S27)26here M = C g ψ ′ D (0) ˜ C h e C g ( C g + C g M ̺ ) M and M = C h ψ ′ Y (0)e C h ( C h + C h M ( ̺ + ̺ )) M .Combining (S25)–(S27) yields the desired result. (cid:3) References

Tan, Z. (2020a) Regularized calibrated estimation of propensity scores with model misspeciﬁca-tion and high-dimensional data,

Biometrika , 107, 137–158.Tan, Z. (2020b) Model-assisted inference for treatment effects using regularized calibrated esti-mation with high-dimensional data,