[PDF] Doubly Robust Semiparametric Difference-in-Differences Estimators with High-Dimensional Data

Abstract

This paper proposes a doubly robust two-stage semiparametric difference-in-difference estimator for estimating heterogeneous treatment effects with high-dimensional data. Our new estimator is robust to model miss-specifications and allows for, but does not require, many more regressors than observations. The first stage allows a general set of machine learning methods to be used to estimate the propensity score. In the second stage, we derive the rates of convergence for both the parametric parameter and the unknown function under a partially linear specification for the outcome equation. We also provide bias correction procedures to allow for valid inference for the heterogeneous treatment effects. We evaluate the finite sample performance with extensive simulation studies. Additionally, a real data analysis on the effect of Fair Minimum Wage Act on the unemployment rate is performed as an illustration of our method. An R package for implementing the proposed method is available on Github.

Full PDF

DDoubly Robust SemiparametricDiﬀerence-in-Diﬀerences Estimators withHigh-Dimensional Data

Yang Ning , Sida Peng , Jing Tao Abstract:

This paper proposes a doubly robust two-stage semiparametric diﬀerence-in-diﬀerence estimator for estimating heterogeneous treatment eﬀects with high-dimensionaldata. Our new estimator is robust to model miss-speciﬁcations and allows for, but doesnot require, many more regressors than observations. The ﬁrst stage allows a general setof machine learning methods to be used to estimate the propensity score. In the secondstage, we derive the rates of convergence for both the parametric parameter and theunknown function under a partially linear speciﬁcation for the outcome equation. Wealso provide bias correction procedures to allow for valid inference for the heterogeneoustreatment eﬀects. We evaluate the ﬁnite sample performance with extensive simulationstudies. Additionally, a real data analysis on the eﬀect of Fair Minimum Wage Act onthe unemployment rate is performed as an illustration of our method. An R package forimplementing the proposed method is available on Github. Keywords: Diﬀerence-in-diﬀerence; High-dimensional data; Machine learning; Par-tially linear models; Two-stage regression.JEL codes: C13, C14, C31 Department of Statistical Science, Cornell University, [email protected] Microsoft Research, [email protected] Department of Economics, University of Washington, [email protected] https://github.com/psdsam/HDdiffindiff a r X i v : . [ ec on . E M ] S e p . Introduction This paper proposes a doubly robust two-stage semiparametric diﬀerence-in-diﬀerence esti-mator for estimating heterogeneous treatment eﬀects conditional on high-dimensional covari-ates. The diﬀerence-in-diﬀerence (DiD) design has been widely adopted in policy evaluationfrom academia to industry when a real experiment is expensive or infeasible. When a pol-icy/feature only aﬀects a fraction of the population, DiD design can identify the averagetreatment eﬀect on the treated (ATT) based on observational data. It is based on the simpleidea of comparing the diﬀerence in pre and post-treatment outcome of those individuals whoare aﬀected and those who are not aﬀected by the policy/feature of interest.A key identiﬁcation assumption for the classical DiD design is the parallel trend assump-tion. It requires that the outcome variables for treated and non-treated individuals wouldhave followed parallel paths over time in the absence of treatment. However, this assump-tion ignores the potential selection problem due to individual heterogeneity. For example, acompany might want to evaluate the eﬀect of an email marketing campaign (advertisementthrough email with an embedded promo link). A researcher can compare the customers’conversion rate (whether a purchase was made) before and after the campaign for a group oftreated customers (click into the link) and a group of non-treated customers (did not clickinto the link). If existing customers are more likely to click into the link and also more likelyto purchase again even without the campaign intervention, the classical DiD estimator willlead to a positive bias and exaggerate the eﬀect of the campaign. To account for such case,Abadie (2005) proposed a two-stage semiparametric estimator with the so-called conditionalparallel trend assumption. In this framework, a propensity score is estimated in the ﬁrststage to explicitly account for any observed confounders that may aﬀect both the treatmenttake-up as well as the outcome growth trend.While the semiparametric DiD (semi-DiD) estimator is comprehensively used by researchersin academia and industry, three major challenges arise in practice. First, the semi-DiD es-timator becomes diﬃcult to implement when there exist too many covariates. Followingthe previous example, researchers may also observe customers’ browsing history and maysuspect customers who visited certain (unknown) websites are more likely to click into thelink while also more likely to make the purchase. However, the semi-DiD estimator cannotbe implemented if the number of attributes exceeds the number of observations. Therefore,researchers may be forced to select covariates based on their intuition or insights and may2ead to further biases (Belloni et al. (2017)). Even when the number of observations is largerthan the number of covariates, the semi-DiD estimator may still contain a large bias whentoo many covariates are included (Cattaneo, Jansson and Ma (2019)). Second, the semi-DiDestimator is sensitive to the choice of speciﬁcation for propensity score estimation. This is asimilar problem for the inverse propensity score weighted (IPW) estimator and it becomes amore subtle problem if machine learning methods (e.g. random forrest, neural network, etc.)are used to predict propensity score. Third, conditional or heterogeneous treatment eﬀectson treated (ATT) is often of interest to practitioners. While semi-DiD framework providesa way to estimate conditional or heterogeneous ATT under a vector of low-dimensional co-variates, it is not clear how this framework can be extended to the high-dimensional case,(e.g. how to develop estimation and inference methods and theory).In this paper, we propose a new estimator to solve the above three problems. Our doublyrobust DiD (Dr-DiD) estimator is robust to model miss-speciﬁcations under high-dimensionalcovariates. We show the desired rate of convergence of our estimator can be achieved as longas either the propensity score function or the outcome equation can be approximated asymp-totically at a moderate rate. Thus, a general set of machine learning methods can be used inour framework. Although diﬀ-in-diﬀ design is an ATT estimator, we show that the semi-DiDestimator can be extended to an augmented inverse propensity score weighted (AIPW) esti-mator. We show that the extended AIPW form still preserves the doubly robustness propertyunder the parallel trend assumption.To further incorporate high-dimensional covariates and heterogeneous treatment eﬀects,we consider a partially linear speciﬁcation in the potential outcome estimation. The partiallylinear form is composed of a nonparametric speciﬁcation from a set of low-dimensional co-variates as well as a linear parametric speciﬁcation from a set of high-dimensional covariates,which provides a ﬂexible functional form to model the potential outcome. This is a very use-ful speciﬁcation in real world application. For example, researchers may be interested in thenonlinear relationship between the outcome variable and a set of covariates while also facinga large number of indicator variables such as age, gender and region.We derive the rate of convergence for our estimator as well as a de-bias procedure forinference. We show that the high-dimensional linear part of the estimator can achieve theoracle rate of convergence, while the nonparametric part maintains the nonparametric rateof convergence. With bias correction, the high-dimensional linear part can achieve normalityat √ n -rate, while the nonparametric part can achieve the normality at the nonparametric3ate. Finally, we demonstrate the ﬁnite sample performance of our estimator in a simulationstudy and apply our estimator to study the eﬀect of the Fair Minimum Wage Act on theunemployment rate using the data collect by Callaway and Li (2020). We show that theheterogeneity in the eﬀect of this policy can be explained by variations in demographics. Morespeciﬁcally, counties with larger population and higher median income are more likely tosuﬀer from an increase in the unemployment rate. These ﬁndings coincide with the canonicaleconomic theory on unemployment rate. For example, a higher median income level implies ahigher substitution cost for workers currently at minimum wage and thus leads to an increasein the unemployment rate when minimum wage rises. On the other hand, regions with largerpopulation sizes have more labor supply and thus a minimum wage raise can also lead to asurplus.In summary, the main contributions of this work are as follows: ﬁrst, we propose a doublyrobust approach to estimate heterogeneous ATT conditional on covariates for DiD modelsthat allows either the propensity scores or the model for ATT to be misspeciﬁed. Second, wepropose a regularized two-stage estimation procedure for DiD models that allows (i) suitablemachine learing tools to estimate the ﬁrst-stage propensity socres and (ii) high-dimensionalcovariates and nonparametric speciﬁcation for the heterogeneous ATT in the second-stage.Third, we provide a novel approach to simultaneously correct the biases due to both stagesand provide a novel statistical inference procedure based on the de-biased estimator. Finally,as a useful byproduct, we derive novel estimation and inference methods for a partially linearmodel for both the high-dimensional parametric parameter and the nonparametric function. This paper is related to the vast literature on robust estimation and inference for treatmenteﬀects models; see for example, Robins, Rotnitzky and Zhao (1994), Tan (2006), Chen et al.(2008), Graham, de Xavier Pinto and Egel (2012), Okui et al. (2012), Farrell (2015), Ver-meulen and Vansteelandt (2015), Ogburn, Rotnitzky and Robins (2015), Belloni et al. (2017),Lee, Okui and Whang (2017), Chernozhukov et al. (2018b), S(cid:32)loczy´nski and Wooldridge(2018), Kennedy, Lorch and Small (2019) and Tan (2020) among many others. Our workis particularly closely related to a recent work independently developed by Sant’Anna andZhao (2020). Both are based on the seminal framework proposed in Abadie (2005). Our es-timator complements theirs as we focus on estimation and inference for heterogeneous ATT4onditional on covariates in a high-dimensional setting while Sant’Anna and Zhao (2020)focus on eﬃcient estimation of ATT when the dimension of covariates is ﬁxed and is muchsmaller than the sample size.This paper also contributes to the literature by connecting the widely used DiD esti-mator with the machine learning/ high-dimensional statistic literature. The DiD estimatorhas been an active research ﬁeld in the economic literature, e.g. Card and Krueger (1994),Abadie (2005), Athey and Imbens (2006), Imai, Kim and Wang (2019), Athey and Imbens(2019), Callaway and Sant’Anna (2019) among others. Our paper proposes a speciﬁc DiDestimator so that high-dimensional/machine learning tools can be applied. This paper alsocontributes to a set of works that apply machine learning tools to casual inference. This in-cludes Chernozhukov et al. (2016), Belloni et al. (2017), Semenova and Chernozhukov (2017),Chernozhukov et al. (2018b), Syrgkanis et al. (2019), Fan et al. (2020), Tan (2020), etc. Ourwork distinguish this literature in the following two ways. First, we propose a doubly robustdiﬀ-in-diﬀ estimator in the high-dimensional/machine learning setting that has not beenstudied. Second, to our best knowledge, the doubly-robust estimators in these papers usevarious high-dimensional set of covariates and machine learning methods to deal with theselection into treatment. However, the ultimate parameter of interest in the second-stage isa low-dimensional subset of the covariates so traditional nonparametric estimator can apply.By contrast, our parameter of interest in the second-stage contains both high-dimensionalcovariates in the parametric part and an unknown function, which brings substantial chal-lenges for estimation and inference. We construct a new Neyman orthogonal moment condi-tion (Chernozhukov et al. (2016)) and propose de-biased estimators for both the parametricparameters and the nonparametric function in the second-stage to construct valid conﬁdenceintervals.Moreover, as useful by-products, we provide an inference method for a partially linearmodel for both the parametric parameter and the nonparametric function when the linearpart contains high-dimensional covariates. Thus, this work is related to recent discussionin M¨uller and Van de Geer (2015), Ma and Huang (2016), Yu et al. (2019), Zhu, Yu andCheng (2019), among others. Our paper departs from the existing papers in the followingthree aspects. First, our partially linear form is in the second-stage outcome equation soestimation and inference results have to take the ﬁrst-stage estimators into consideration,while the existing papers focus on a one-stage regression problem. Second, the above pa-pers propose estimators with penalized estimation in functional space. As is pointed out in5hen (1997), this approach often leads to undesirable properties of the estimates, such asinconsistency and roughness. Moreover, such an optimization procedure is diﬃcult to imple-ment in practice. Therefore, we consider the extension to sieve estimation in our estimatorby approximating the nonparametric function with sieves so that we carry out optimiza-tion within a dense subset of the inﬁnite- dimensional space, which is ﬁnite-dimensional andtherefore easy to work with. Finally, the existing literature is concerned with asymptotictheories and inference procedures for the parametric parameters only. The nonparametricfunction is proﬁled out as an inﬁnite-dimensional nuisance parameter. This paper considersthe joint asymptotic theory and inference methods when the parameters of interest are notonly the parametric parameter, but also the nonparametric function. To the best of ourknowledge, these results are new to the literature. We show that the parametric parameterconverges to a normal distribution with a √ n -rate. The parametric estimator achieves thesemiparametric eﬃciency bound when the error term is homoskedastic, while the functionalof a nonparametric function converges to a normal distribution with a nonparametric rate.We observe that the marginal asymptotic variance for the nonparametric component is, ingeneral, diﬀerent from those derived without the high-dimensional parametric parameter,i.e., Newey (1997), Belloni et al. (2015) and Chen and Christensen (2015). This result maybe of independent interest to the readers. The paper is organized as follows. The estimator is proposed in Section 2. Rate of convergenceof the estimator and inference theory are developed in Sections 3 and 4, respectively. Section 5presents extensive simulation results to evaluate the ﬁnite sample performance. An empiricalstudy on the eﬀect of the Fair minimum Wage Act on the unemployment rate is presentedin Section 6. Section 7 concludes. We defer the proofs to the Appendices.

For a vector x = ( x , . . . , x d ) (cid:62) ∈ R d and 1 ≤ q ≤ ∞ , let (cid:107) x (cid:107) q = (cid:16)(cid:80) di =1 | x i | q (cid:17) /q , (cid:107) x (cid:107) ∞ =max ≤ i ≤ d | x i | , (cid:107) x (cid:107) = | supp( x ) | , where supp( x ) = { j : x j (cid:54) = 0 } and | a | is the cardinality of aset a . For a symmetric matrix A , let Λ max ( A ) and Λ min ( A ) be the maximum and minimumeigenvalues of A . For a matrix B = [ B jk ], let (cid:107) B (cid:107) max = max jk | B jk | , (cid:107) B (cid:107) = (cid:80) jk | B jk | ,6 B (cid:107) = (cid:112) Λ max ( B (cid:62) B ) and (cid:107) B (cid:107) (cid:96) ∞ = max j (cid:80) k | B jk | . For any function f : Z → R , let (cid:107) f (cid:107) ∞ = sup z ∈Z | f ( z ) | , (cid:107) f (cid:107) P, = (cid:112) E f ( z ) and (cid:107) f (cid:107) n = (cid:112) n − (cid:80) ni =1 f ( Z i ). We denote I d asthe d × d identity matrix. For a set S ⊆ { , . . . , d } , let x S = { x j : j ∈ S } and S c be thecomplement of S . Let S be the set of all non-zero components of β and s = | S | . We use (cid:53) S f ( x ) to denote the gradient of f ( x ) with respect to x S . Given a, b ∈ R , let a ∨ b and a ∧ b denote the maximum and minimum of a and b . For two positive sequences a n and b n ,let a n (cid:16) b n denote C ≤ a n /b n ≤ C (cid:48) for some C, C (cid:48) >

0; let a n (cid:46) b n denote a n ≤ Cb n forsome constant C >

0. Also, we write a n = O ( b n ) if | a n | ≤ C | b n | . We use X n → p a for someconstant a if a sequence of random variables X n converges in probability to a . Similarly, if X n converges weakly to X we write X n (cid:32) X for some random variable X . For notationalsimplicity, we use C , C (cid:48) and C (cid:48)(cid:48) to denote generic constants, whose values can change fromline to line. Let E n f = n (cid:80) ni =1 f ( X i ) and G n f = E n f − E f .A random variable X is called sub-exponential if there exists some positive constant K such that P ( | X | > t ) ≤ exp(1 − t/K ) for all t ≥

0. The sub-exponential norm of X is deﬁnedas (cid:107) X (cid:107) ψ = sup q ≥ q − ( E | X | q ) /q . Similarly, a random variable X is called sub-Gaussian ifthere exists some positive constant K such that P ( | X | > t ) ≤ exp(1 − t /K ) for all t ≥ X is deﬁned as (cid:107) X (cid:107) ψ = sup q ≥ q − / ( E | X | q ) /q .

2. Doubly Robust DiD Estimator

Denote Y ( i, t ) as the potential outcome of individual i at time t being not treated and Y ( i, t ) as the potential outcome of individual i at time t being treated. We cannot observeboth Y ( i, t ) and Y ( i, t ) for the same individual, but we observe the realized outcome forindividual i at time t as Y ( i, t ) = D i Y ( i, t ) + (1 − D i ) Y ( i, t )where D i is the treatment status at time t = 1. For some observed covariates W i = ( X i , Z i ),we want to learn the heterogeneous treatment eﬀect on the treated conditional on the co-variates W i such that AT T ( W i ) = τ ( W i ) := E [ Y ( i, − Y ( i, | W i , D i = 1] . (2.1)Our parameter of interest is diﬀerent from ATT (e.g. Sant’Anna and Zhao (2020)), whichis deﬁned as AT T = E (cid:2) Y ( i, − Y ( i, | D i = 1 (cid:3) . (2.2)7hile it is useful to know (2.2), a doubly robust estimator for (2.1) conditional on thecovariates could also be relevant and important in empirical applications when the parameterof interest is the heterogeneous treatment eﬀects conditional on the covariates.As pointed out in Abadie (2005), the conventional DiD estimator is based on the strongassumption that outcomes for treated and non-treated groups or individuals would havefollowed parallel paths over time in the absence of treatment. That assumption can be easilyviolated when diﬀerences in observed characteristics create non-parallel outcome dynamicsbetween treated and non-treated populations. Abadie (2005) generalizes this assumption byallowing the parallel trend assumption to hold after conditioning on the covariates as follows: Assumption 1. E [ Y ( i, − Y ( i, | W i , D i = 1] = E [ Y ( i, − Y ( i, | W i , D i = 0] . In addition, a full support assumption will guarantee the existence of the propensity scorefunction.

Assumption 2.

With probability approaching 1, there exits a constant c > E [ D i = 1 | W i ] > c and E [ D i = 1 | W i ] < − c .Together with Assumptions 1 and 2, the Abadie (2005) estimand can be deﬁned as E (cid:20) ( D i − E ( D i = 1 | W i )) E ( D i = 1 | W i ) E ( D i = 0 | W i ) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i (cid:21) . (2.3)Deﬁning ∆ Y i := Y ( i, − Y ( i, E (cid:2) Y ( i, − Y ( i, | W i , D i = 1 (cid:3) = E (cid:20) ( D i − E ( D i = 1 | W i )) E ( D i = 1 | W i ) E ( D i = 0 | W i ) ∆ Y ( i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ∆ Y ( i ) E ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) (1 − D i )∆ Y ( i ) E ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) . (2.4)It is easy to see that Equation (2.4) is in the form of Horvitz-Thompson estimator (Horvitzand Thompson (1952)). As a natural extension to the IPW form estimator, we study whethera doubly robust form exists under the DiD setting and this leads to our parameters of interestas follows. Deﬁne∆ Y i := Y ( i, − Y ( i, , Φ ( W i ) := E [∆ Y i | W i , D i = 1] , ∆ Y i := Y ( i, − Y ( i, , Φ ( W i ) := E [∆ Y i | W i , D i = 0] . ρ = D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) E ( D i = 0 | W i ) , our doubly robust estimand is deﬁned as τ ( W i ) = E [ ρ (∆ Y i − E ( D i = 0 | W i )Φ ( W i ) − E ( D i = 1 | W i )Φ ( W i )) | W i ] , (2.5)where E [ D i = 0 | W i ], E [ D i = 1 | W i ], Φ ( W i ) and Φ ( W i ) are nuisance functions to be esti-mated from the ﬁrst-stage. Lemma 1.

Under Assumptions 1 and 2, (i) the estimand deﬁned in Equation (2.5) is doublyrobust in the sense that E [ Y ( i, − Y ( i, | W i , D i = 1] = E [ ρ (∆ Y i − E ( D i = 0 | W i )Φ ( W i ) − E ( D i = 1 | W i )Φ ( W i )) | W i ] holds provided that one of the two conditions (a) or (b) holds, even if both do not holdsimultaneously: (a) speciﬁcations Φ ( W i ) and Φ ( W i ) are correct, (b) speciﬁcation of P ( D i =1 | W i ) is correct.(ii) Let α = (Φ ( · ) , Φ ( · ) , π ( · )) , π ( W ) = P ( D = 1 | W ) and Υ ( W ; α ) = ρ [∆ Y − (1 − π ( W )) Φ ( W ) − π ( W )Φ ( W )] . Then the moment condition E [Υ( W i ; α ) − τ ( W i ) | W i = w ] = 0 holds and the following Ney-man orthogonality condition holds: ∂ r E [Υ ( W i ; α + r ( α − α )) − τ ( W i ) | W i = w ] | r =0 = 0 . Lemma 1 shows that, with Assumptions 1 and 2, we can have a doubly robust estimator for τ ( W i ) when either the regression models Φ ( · ) and Φ ( · ) are misspeciﬁed or the propensityscore P ( D i = 1 | X i ) is misspeciﬁed.To model τ ( · ), we consider a class of ﬂexible high-dimensional partially linear model suchthat E [ Y ( i, − Y ( i, | W i , D i = 1] = X (cid:62) i β + f ( Z i ) , (2.6)where the linear part contains the parametric Euclidean vector β ∈ B ⊆ R p with p > n , andthe nonparametric part contains an unknown function f ( · ) : Z → R , where Z is a compactsubset of R d z . We will assume that the unknown function belongs to a smoothed functionclass deﬁned in Section 3. 9ompare with the deﬁnition of equation (11) in Abadie (2005), we deﬁne our estimandin equation (2.6) in a partial linear form rather than approximate it with a best linearpredictor. The semi-parametric structure is slightly stronger as equation (11) in Abadie(2005) is satisﬁed if we plug in equation (2.6) and allow g ( X k , θ ) to admit a partial linearspeciﬁcation.On the other hand, the partially linear speciﬁcation in (2.6) provides a ﬂexible functionalform while still allowing us to maintain the Neyman orthogonality condition when designingthe estimator under the high dimensionally covariates. Theoretical properties of the semi-parametric partially linear model when the dimension of X is ﬁxed and smaller than n have been thoroughly discussed in the econometrics literature (Engle et al. (1986), Robinson(1988), Ahn and Powell (1993), Donald and Newey (1994), Linton (1995), Fan and Li (1999),to mention only a few; see Li and Racine (2007) for a review). We complement the literatureby providing new estimation and inference methods and theory when X is high-dimensional.As a result of Lemma 1 and equation (2.6), if Assumptions 1 and 2 hold, we have( β , f ) = arg min ( β ∈B ,f ∈F ) E (cid:20)(cid:110) X (cid:62) i β + f ( Z i ) − ρ (∆ Y i − (1 − π ( W i ))Φ ( W i ) − π ( W i )Φ ( W i )) (cid:111) (cid:21) . (2.7)We are going to construct a two-step estimator of ( β , f ) based on the sample analogue of(2.7), where the ﬁrst-step estimator estimates ρ and the second-step estimator estimates( β , f ). We allow the propensity score, hence ρ , to be estimated by any suitable machinelearning methods as long as certain conditions in Section 3 are satisﬁed.

3. Estimation

Let ˆ π ( · ), ˆΦ ( · ) and ˆΦ ( · ) be nonparametric or machine learning estimators of π ( W i ), Φ ( W i )and Φ ( W i ), respectively. We propose the following two-stage estimator such that( ˆ β, ˆ f ) = arg min β ∈B ,f n ∈F n E n (cid:20)(cid:16) X (cid:62) i β + f n ( Z i ) − ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17)(cid:17) (cid:21) + λ (cid:107) β (cid:107) , (3.1)where f n ( · ) = ψ k n ( · ) (cid:62) γ n is a sieve approximation of the unknown function f ( · ) ∈ F with f ( Z i ) = k n (cid:88) j =1 ψ j ( Z i ) γ j,n + r n ( Z i ) := f n ( Z i ) + r ni , where r ni := r n ( Z i ) , i = 1 , . . . , n is a sieve approximation error that depends on the smooth-ness of f and the sample size n . 10or α > τ = ( τ , . . . , τ d z ) of d z integers, deﬁne the diﬀerential operator D τ = ∂ τ. /∂z τ ...z τ dz d z , where τ. = (cid:80) d z l =1 τ l . For a function g : Z → R , let (cid:107) g (cid:107) ∞ ,α = max τ. ≤ α sup z | D τ g ( z ) | + max τ. = α sup z,z (cid:48) | D τ g ( z ) − D τ g ( z (cid:48) ) |(cid:107) z − z (cid:48) (cid:107) α − α . (3.2)Let C αM ( Z ) be the set of all continuous functions g : Z → R with (cid:107) g (cid:107) ∞ ,α ≤ M . We assumethat F ⊆ C αM ( Z ). Let ψ k n ( Z i ) = ( ψ ( Z i ) , . . . , ψ k n ( Z i )) (cid:62) be a k n × F n to represent the space of sieve functions. Deﬁne the projectionof X ij , i = 1 , . . . , n onto F n asΠ n ( X ij | Z ) = arg min h ∗ ∈F n (cid:107) X i j − h ∗ (cid:107) n = ψ k n ( Z )( ψ k n ( Z ) (cid:62) ψ k n ( Z )) − ψ k n ( Z ) (cid:62) X i j, j = 1 , . . . , p and Π n,X i | Z i := (Π n ( X i | Z i ) , . . . , Π n ( X ip | Z i )) (cid:62) , i = 1 , . . . , n. Next, deﬁne ˆ S i := ˆ ρ ( W i ) (cid:16) ∆ Y i − (1 − ˆ π ( W i )) ˆΦ ( W i ) − ˆ π ( W i ) ˆΦ ( W i ) (cid:17) ,S i := ρ ( W i ) (cid:16) ∆ Y i − (1 − π ( W i ))Φ ( W i ) − π ( W i )Φ ( W i ) (cid:17) , Π n,X | Z := P Z X , P Z := Ψ n (cid:0) Ψ (cid:62) n Ψ n (cid:1) − Ψ (cid:62) n and ˜ X := X − Π n,X | Z , where Ψ n := Ψ n ( Z ) = (cid:0) ψ k n ( Z ) (cid:62) , . . . , ψ k n ( Z n ) (cid:62) (cid:1) (cid:62) is a n × k n matrix. Let ¯ X = X − Π X | Z = ( X (cid:62) − Π (cid:62) X | Z , ..., X (cid:62) n − Π (cid:62) X n | Z n ) (cid:62) , where Π X | Z = E [ X i | Z i ].Deﬁne η n = S − X β − Ψ n γ n = (cid:15) + r n , where (cid:15) = ( (cid:15) , . . . , (cid:15) i , . . . , (cid:15) n ) (cid:62) , (cid:15) i = D i π i (cid:15) i + − D i − π i (cid:15) i , (cid:15) i = ∆ Y i − Φ ( W i ), (cid:15) i = ∆ Y i − Φ ( W i ), r n = ( r n , . . . , r ni , . . . , r nn ) (cid:62) . We have thefollowing decomposition: (cid:107) X β + f n (cid:107) n = (cid:107) ˜ X β (cid:107) n + (cid:107) Π n,X | Z β + f n (cid:107) n . Assumption 3. (i) The data are i.i.d. from the distribution of ( Y ( i, , Y ( i, , D i , W i )conditional on t = 1, while conditional on t = 0, the data are i.i.d. from the distribution of( Y ( i, , Y ( i, , D i , W i ); (ii) W is compact with nonempty interior; (iii) ( β , f ) ∈ B × F ⊆ R p × C αM ( Z ) is the only ( β, f ) that satisﬁes (2.6), where α ≥ d z /

2; (iv) E [ f ( Z ) | X ] does notbelong to the linear span of X . Assumption 4. (i) The error terms (cid:15) i ∈ R and (cid:15) i ∈ R are independently distributed with E [ (cid:15) i | W i ] = 0 and E [ (cid:15) i | W i ] = 0; (ii) max ≤ i ≤ n sup w ∈W E [ | (cid:15) i | r (cid:15) | W i = w ] ≤ C (cid:15) for some r (cid:15) > C (cid:15) . 11n Assumption 3, (i) is Assumption 3.3 in (Abadie, 2005). We follow the same samplingscheme to consider repeated cross sections; (ii) can be relaxed if we add a continuous nonneg-ative weight function in the deﬁnition of (cid:107) · (cid:107) ∞ ,α in (3.2) (Freyberger and Masten, 2019); and(iii) and (iv) are standard identiﬁcation conditions for a partially linear model. Assumption 4allows the error terms to be non-identically distributed and be conditionally heteroskedastic.One can replace it by the stronger sub-Gaussian assumption often used in the literature.As is standard in the literature for high-dimensional data, we introduce the restrictedeigenvalue condition for asΛ X ( s ) := min δ ∈ R p \{ } , (cid:107) δ SC (cid:107) ≤ √ s (cid:107) δ S (cid:107) δ (cid:62) E (cid:2) ¯ X i ¯ X (cid:62) i (cid:3) δ (cid:107) δ S (cid:107) > ¯ X = E [ ¯ X i ¯ X (cid:62) i ], Σ ˜ X = E [ ˜ X i ˜ X (cid:62) i ] and Σ Π = E [Π X i | Z i Π (cid:62) X i | Z i ]. Assumption 5. (i) For each i = 1 , ...., n , the covariates X i = x is a sub-Gaussian vectorsuch that for any vector v ∈ R p , v (cid:62) x is sub-Gaussian with sup v ∈ R p : (cid:107) v (cid:107) =1 (cid:107) v (cid:62) x (cid:107) ψ ≤ K X ;(ii) there exists constant C ¯ x >

0, such that Λ X ( s ) > C ¯ x ; (iii) there exist constants C Σ ¯ X > C Σ Π > C Σ ¯ X < Λ min (Σ ¯ X ) ≤ Λ max (Σ ¯ X ) < / C Σ ¯ X and C Σ Π < Λ min (Σ Π ) ≤ Λ max (Σ Π ) < / C Σ Π .Assumption 5 is standard in the literature: (i) can be relaxed if we replace it by someuniform moment conditions discussed in Caner and Kock (2018); (ii) and (iii) restrict theeigenvalues. In particular, (ii) is a restricted eigenvalue condition. Assumption 6.

There are ﬁnite constants c k n and (cid:96) k n such that for each f ∈ F and foreach n and k n , we have (cid:107) r n (cid:107) P, = (cid:115)(cid:90) z ∈Z r n ( z ) dF ( z ) ≤ c k n , (cid:107) r n (cid:107) ∞ = sup z ∈Z | r n ( z ) | ≤ (cid:96) k n c k n . Assumption 7. (i) The density of Z i is bounded and bounded away from zero. For every k n ,there exist a constant C z >

0, which does not depend on k n , such that λ min (cid:0) E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:1) > C z ; (ii) there is a sequence of constant ξ ( k n ) satisfying that sup z (cid:107) ψ k n ( z ) (cid:107) ≤ ξ ( k n ), ξ ( k n ) log k n /n = o p (1), and k n ξ ( k n ) log p/n = O p (1); (iii) (cid:107) E [ ˜ X i ˜ X (cid:62) i − ¯ X i ¯ X (cid:62) i ] (cid:107) ∞ = O ( (cid:112) log p/n ).Assumption 6 is Assumption A.3 in Belloni et al. (2015). Note that F is a set of functions f in C αM ( Z ), thus, (cid:107) f (cid:107) ∞ ,α is bounded from above uniformly over all f ∈ F . Then for instance, c k n = O ( k − α/d z ) for the polynomial series and c k n = O ( k − ( α ∧ α ) /d z ) for splines with order α . Assumption 7 (i) and the ﬁrst two conditions in (ii) are also standard in the literature.12iven this assumption, it is without loss of generality to normalize E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] = I k n .The condition k n ξ ( k n ) log p/n = O p (1) in Assumption 7(ii) is new. It is a mild conditionon the relationship between p and k n . Assumption 7(iii) is a smoothness condition on theapproximation error of the projection Π n,X | Z to Π X | Z . Assumption 8. (i) sup w | / ˆ π ( w ) − /π ( w ) | = O p (1);(ii) sup w | ˆΦ ( w ) − Φ ( w ) | = O p (1);(iii) sup w | ˆΦ ( w ) − Φ ( w ) | = O p (1);(iv) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = O p (log p/n ∨ k n log k n /n );(v) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = O p (log p/n ∨ k n log k n /n ).Assumption 8 imposes moderate conditions on the ﬁrst stage approximations of the nui-sance functions π ( W i ), Φ ( W i ) and Φ ( W i ). Only the interaction terms between π ( W i ) andΦ ( W i ) or π ( W i ) and Φ ( W i ) are required to converge at a mild rate O p (log p/n ∨ k n log k n /n ).This demonstrates the double robustness properties of our estimator such that when either π ( W i ) or (Φ ( W i ) , π ( W i )) are correctly speciﬁed, the desired rate of convergence in Theo-rem 1 can be achieved. As pointed out in Chernozhukov et al. (2018b), the beneﬁt of usingsample-splitting is that it makes the entropy condition become very weak, allowing machinelearning methods (e.g. random forest, boosted trees, deep neural nets, and their aggregatedand hybrid versions) to be applied to estimate the functions ˆ π ( W i ), ˆΦ ( W i ) and ˆΦ ( W i ). Onecan provide more primitive conditions to verify these rates for each given machine learningmethod of chosen. Assumption 9.

We choose λ , k n , and R satisfying the following: (i) λ (cid:38) (cid:112) log p/n ; (ii)2 λ s / Λ X ( s ) (cid:46) R (cid:46) λ ; and (iii) R = min (cid:0) (cid:96) k n c k n k n /n, ξ ( k n ) c k n /n (cid:1) + k n /n . Theorem 1.

Suppose that Assumptions 1-9 hold. Then with probability approaching 1, (cid:107) ˆ β − β (cid:107) = O p ( λs ) and (cid:107) ˆ f − f (cid:107) P, = O p ( R ) . Theorem 1 establishes the rate of convergence for our estimator. We show that for theparametric estimator, similar to the one for the high-dimensional linear regressors (e.g.,Theorem 6.1 in B¨uhlmann and van de Geer (2011)), its convergence rate depends on therate of the tuning parameter λ and the level of sparsity s . When λ = O ( (cid:112) log p/n ), we have (cid:107) ˆ β − β (cid:107) = O p ( s (cid:112) log p/n ), which is the same rate in lasso regression for high-dimensionallinear models without unknown functions. For the nonparametric estimator, the convergencerate maintains the same rate as the one obtained in nonparametric regressor models (e.g.,13heorem 4.1 in Belloni et al. (2015)), which depends on the order of basis function k n and theapproximation error. Unlike the results in the literature of semiparametric partially linearmodel when the dimension of X is much smaller than sample size n , the convergence rate ofthe parametric estimator is slower than √ n due to high dimensionality of the model. It makesthe inference problem challenging. As we will show in Section 4, the asymptotic variance ofthe nonparametric estimator will contain a projection term that reﬂects the eﬀect of thehigh-dimensional parametric estimation.

4. Asymptotic Inference

In many applications, practitioners are not only interested in the estimation of the treatmenteﬀect but also the uncertainty quantiﬁcation of the estimated treatment eﬀect. The latterprovides the conﬁdence of the treatment eﬀect estimation and is a routine procedure inmost causal inference problems. While the inferential properties under high-dimensionallinear/generalized linear models have been extensively investigated in the recent literature(Zhang and Zhang, 2014; Javanmard and Montanari, 2014; Van de Geer et al., 2014; Belloni,Chernozhukov and Wei, 2016; Ning and Liu, 2017; Ning et al., 2017; Cai and Guo, 2017;Neykov et al., 2018; Gold, Lederer and Tao, 2020), the asymptotic inference under the DiDdesign has not been studied, especially in the partially linear model speciﬁcation. In thissection, we consider how to construct conﬁdence intervals for the parametric component β and the nonparametric component f ( z ) for given z ∈ Z .Consider the inference problem for a linear combination of β , say ξ (cid:62) β , for a knownvector ξ ∈ R p . For instance, if we take ξ as the unit basis vector e j = (0 , ..., , , , ...

0) withthe j th position being 1 and 0 otherwise, then the linear functional reduces to ξ (cid:62) β = ( β ) j ,which is the j th component of the regression coeﬃcient. Similarly, if we are interested inthe prediction for a given test sample X = x and Z = z , then the parameter of interestbecomes x (cid:62) β + f ( z ). Thus, the inference problems can be decomposed into two problems:the inference on x (cid:62) β and the inference on f ( z ). The former is again a linear combination of β with ξ = x . The inference on f ( z ) will be studied later in this section. To construct theconﬁdence intervals for ξ (cid:62) β , we extend the de-biasing approach to the DiD design under thepartially linear model speciﬁcation. Given the Lasso estimator ˆ β , we propose the followingde-biased Lasso estimator:ˆ T = ξ (cid:62) ˆ β − ˆ w (cid:62) E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i ˆ β − ˆ f ( Z i ) (cid:17) ˜ X i (cid:111) , (4.1)14here ˆ w = arg min (cid:107) w (cid:107) s.t. (cid:107) ξ + ˆΣ ˜ X w (cid:107) ∞ ≤ λ (cid:48) , (4.2)with ˆΣ ˜ X = n (cid:80) ni =1 ˜ X i ˜ X (cid:62) i and λ (cid:48) as a tuning parameter. We will show that ˆ w is a consistentestimator of w = Σ − X ξ . Let s w = |{ k : w k (cid:54) = 0 }| as the size of non-zero elements in w . Assumption 10. (i) E n ((1 / ˆ π ( W i ) − /π ( W i )) ) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) andsup w | / ˆ π ( w ) − /π ( w ) | = o p (1);(ii) E n ( ˆΦ ( W i ) − Φ ( W i )) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) and sup w | ˆΦ ( w ) − Φ ( w ) | = o p (1);(iii) E n ( ˆΦ ( W i ) − Φ ( W i )) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) and sup w | ˆΦ ( w ) − Φ ( w ) | = o p (1);(iv) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = o p (1 / ( s w n ) ∨ / ( k n n ));(v) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i ))) = o p (1 / ( s w n ) ∨ / ( k n n )).Assumption 10 is a stronger version of Assumption 8, which is required for constructingthe asymptotic normality.Let σ i = E [ (cid:15) i | X i ], V β = Σ − X Ω β Σ − X with Σ ¯ X = E (cid:2) ¯ X i ¯ X (cid:62) i (cid:3) and Ω β := E (cid:2) σ i ¯ X i ¯ X (cid:62) i (cid:3) . Letˆ V β = ˆ w (cid:62) ˆΩ β ˆ w with ˆΩ β := E n (cid:104) ˆ σ i ˜ X i ˜ X (cid:62) i (cid:105) . Assumption 11.

We have (i) n − / ( s w (log p ) / ∨ s w log p ) = o p (1) and s w max ≤ j ≤ p, ≤ i ≤ n | ˜ X ij r ni | = o ( n − / ); (ii) s w E n (cid:104) (cid:15) i ( ˜ X i − ¯ X i ) (cid:105) = o p ( n − / ); (iii) ; (iv) the smallest eigenvalue of Ω β de-noted as λ min (Ω β ) is bounded away from 0 and the biggest eigenvalue denoted as λ max (Ω β )is bounded from above. Theorem 2.

Suppose that Assumptions 1-7, 9, 10 and 11 hold. let λ (cid:48) (cid:38) (cid:112) log p/n , we havethat √ n ( ˆ T − ξ (cid:62) β ) → d N (0 , ξ (cid:62) V β ξ ) . Furthermore, if (cid:18) log( np ) (cid:18) s log pn + (cid:113) ξ ( k n ) k n n + (cid:96) k n c k n (cid:19)(cid:19) = o (1) , ˆ V β p → V β . Theorem 2 implies that we can construct an asymptotic (1 − α ) conﬁdence interval for ξ (cid:62) β as ( ˆ T − z − α/ ( ˆ w (cid:62) ˆ V β ˆ w ) / , ˆ T + z − α/ ( ˆ w (cid:62) ˆ V β ˆ w ) / , where z − α/ is the 1 − α/ E [ (cid:15) i | X i ] = σ , the asymptotic variance achieves the15emiparametry eﬃciency bound V β = σ E [ ˜ X i ˜ X (cid:62) i ] − .In the following, we extend the de-biasing approach to construct the conﬁdence intervalsfor f ( z ) for any given z ∈ Z , where we assume d z is much smaller than n to avoid the curse ofdimentionality problem for nonparametric estimation. Recall that f ( z ) can be approximatedin the sieve space by ψ k n ( z ) (cid:62) γ n . To construct the conﬁdence interval for f ( z ), it suﬃcesto apply the debias approach to the parameter γ n , in which the parameter β is treated as ahigh-dimensional nuisance parameter. To this end, we ﬁrst derive the score function for γ n as U γ n ( β, M ) = E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − M X i ) (cid:111) , where M = E ( ψ k n ( Z i ) X (cid:62) i ) { E X ⊗ i } − is a k n × p matrix. One key property of the score func-tion is that U γ n ( β, M ) is insensitive to the unknown high-dimensional nuisance parameters β and M . In fact, we will show below that U γ n ( ˆ β, ˆ M ) = U γ n ( β, M ) + o p ( n − / ) for somesuitable estimators ˆ β and ˆ M to be deﬁned later.Given this score function U γ n ( β, M ), we can deﬁne the one-step updated de-biased esti-mator as ¯ f ( z ) := ψ k n ( z ) (cid:62) ¯ γ n , where¯ γ n := ˆ γ n − ˆΣ − f E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i ˆ β − ˆ f ( Z i ) (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:111) where ˆΣ f = E n (cid:110) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:111) and ˆ M = [ ˆ M , ..., ˆ M j , ..., ˆ M k n ] (cid:62) withˆ M j = arg min (cid:107) m (cid:107) s.t. (cid:107) m (cid:62) E n ( X ⊗ i ) − E n ( ψ k n j ( Z i ) X (cid:62) i ) (cid:107) ∞ ≤ λ (cid:48)(cid:48) , (4.3)Let s m = (cid:12)(cid:12)(cid:12) k : (cid:110)(cid:0) E [ ψ k n ( Z i ) X (cid:62) i ] (cid:1) (cid:16) E (cid:2) X i X (cid:62) i (cid:3) − (cid:17)(cid:111) k (cid:54) = 0 (cid:12)(cid:12)(cid:12) as the size of non-zeros elementsin (cid:0) E [ ψ k n ( Z i ) X (cid:62) i ] (cid:1) (cid:16) E (cid:2) X i X (cid:62) i (cid:3) − (cid:17) .Let σ z = ψ k n ( z ) (cid:62) V f ψ k n ( z ) with V f = Σ − f Ω f Σ − f , Σ f = E (cid:2)(cid:0) ψ k n ( Z i ) − M X i (cid:1) ψ k n ( Z i ) (cid:62) (cid:3) and Ω f = E (cid:2) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − M E (cid:2) σ i X i X (cid:62) i (cid:3) M (cid:62) .We deﬁne the sample analogs similarly. Let ˆ σ z = ψ k n ( z ) (cid:62) ˆ V f ψ k n ( z ) with ˆ V f = ˆΣ − f ˆΩ f ˆΣ − f ,ˆΣ f = E (cid:104)(cid:16) ψ k n ( Z i ) − ˆ M X i (cid:17) ψ k n ( Z i ) (cid:62) (cid:105) and ˆΩ f = E n (cid:2) ˆ σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − ˆ M E n (cid:2) ˆ σ i X i X (cid:62) i (cid:3) ˆ M (cid:62) . Assumption 12.

We have (i) s m (cid:112) k n log p/n = o (1), √ nσ − z (cid:107) E n (cid:2) r ni ( ψ k n ( Z i ) − M X i ) (cid:3) = o (1) and s m s log p/ √ n = o (1); √ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) r ni (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) = o p (1)16ii) the smallest eigenvalues of Σ f and Ω f denoted as λ min (Σ f ) and λ min (Ω f ), respectively,are bounded away from 0 and the biggest eigenvalues denoted as λ max (Σ f ) and λ max (Ω f ),respectively, are bounded from above. Theorem 3.

Suppose that Assumptions 1-7, 9-12 hold and √ nσ − z max ≤ i ≤ n | r ni | = o (1) .Let λ (cid:48)(cid:48) (cid:38) (cid:112) log p/n , we have that √ nσ − / z ( ¯ f ( z ) − f ( z )) → d N (0 , . Furthermore, if (cid:16) √ log np (cid:0) n /r (cid:15) + c k n (cid:96) k n (cid:1) (cid:16) s (cid:112) log p/n + (cid:112) ξ ( k n ) k n /n + c k n (cid:96) k n (cid:17)(cid:17) = o (1) ,we have ˆ σ z p → σ z . Theorem 3 provides asymptotic theory that can be used to construct the conﬁdence inter-vals for f ( z ) for any z . Unlike the standard results in the nonparametric literature, we needto construct a de-biased ¯ f ( z ) estimator that corrects bias caused by estimating the high-dimensional parametric component of the partially linear model. Since the Lasso estimatorof the parametric linear part has a convergence rate that is slower than √ n , the asymptoticvariance of ¯ f ( z ) contains a projection term that reﬂects the eﬀect of the parametric compo-nent on the nonparametric component. We can construct an asymptotic (1 − α ) conﬁdenceinterval for f ( z ) as (cid:2) ¯ f ( z ) − z − α/ ˆ σ z , ¯ f ( z ) + z − α/ ˆ σ z (cid:3) , where z − α/ is the 1 − α/ τ ( w ).

5. Simulation

We compare the ﬁnite sample performance of the doubly robust estimator proposed in thispaper with the semiparametric DiD estimator in Abadie (2005) when the latter is applicable.We consider two data generating processes. In the ﬁrst setting (DGP1), we allow Y ( i, Y ( i,

0) to follow standard normal distribution, where Y ( i,

1) and Y ( i,

1) are deﬁnedas follows: Y ( i,

1) = Y ( i,

0) + X (cid:62) i β + f ( Z i ) + (cid:15) i ,Y ( i,

1) = Y ( i,

0) + X (cid:62) i β + (cid:15) i , where X i and Z i are generated from independently standard normal distributions. The errors (cid:15) i and (cid:15) i are independently generated from standard normal distributions. In the second17etting (DGP2), we deﬁne Y ( i,

0) = ˜ (cid:15) i · (1 / √ · Z i + 1 / √ · X i )where ˜ (cid:15) i is generated from a standard normal distribution and X i ∼ N (0 , Σ) where Σ jk = ρ | j − k | . This allows both heteroskedasticity in the error term as well as correlation amongregressors. In both DGP1 and DGP2, we set β i = 2 /i and β i = 1 /i for i ≤

15 and f ( Z i ) =exp( Z i ). The treatment assignment probability is based on a logistic distribution with P ( T i = 1) = 1 − (1 + exp( X (cid:62) i θ )) − , where θ i = 1 /i for i ≤

10. In both setting, we use 8th degree trigonometric polynomial basisfor the non-parametric estimation.Table 1 and 2 summarize the results for the two settings. We report the average bias,average standard errors, average mean squared errors, average coverages for a 90% conﬁdenceintervals as well as the average lengths for this conﬁdence intervals separately for both thelinear coeﬃcients and the nonparametric coeﬃcients. To compare with the parametric part,we report the coverages for the linear combination of the nonparametric coeﬃcients. Dividedby the standard error, it also converges to standard normal with the same condition in 3.The “Dr-DiD” columns represent the results for the doubly robust diﬀ-in-diﬀ estimator and“semi-DiD” columns represent the results for the Abadie (2005) estimator. We present resultswith n varying from 200, 500 and 1000 and the dimension for linear speciﬁcation p varyingfrom 10, 50 , 500 and 1000. Notice that the Abadie (2005) estimator is infeasible when n ≤ p so we omit to report the “semi-DiD” results when p = 500 , n isrelatively large compared to p (e.g. n = 200 and p = 50). Although Semi-DiD estimator canstill be computed when n = 1000 and p = 500, we choose not to report this result becauseof its large variance .As shown in both tables, the Dr-DiD estimator has a smaller standard error, RMSEand conﬁdence interval length in both linear and nonparametric speciﬁcations. When p =50, the Semi-DiD estimator becomes too conservative and produces larger standard errors.On the other hand, the Dr-DiD estimator is also more robust comparing with the Semi-DiD estimator when switching from homoskedastic errors to heteroskedastic errors. Moreimportantly, our experiments show that in ﬁnite sample, the Dr-DiD estimator can deliverreasonable estimates under high-dimensional settings.18 . Empirical Application We use the proposed method to study the eﬀect of increasing the minimum wage on unem-ployment rates at the county level. We use the same dataset collected by Callaway and Li(2020), which contains the county level unemployment rates from 2005 to 2007 before theFair Minimum Wage Act was enacted in all states on May 25, 2007. Eleven states increasedtheir minimum wage by the ﬁrst quarter of 2007, while the other states did not increase theirminimum wage until the federal minimum wage increased in July of 2007. We explore the variation in adopting the minimum wage policy among diﬀerent states toevaluate its impact on county level unemployment rates. Callaway and Li (2020) consideridentiﬁcation and estimation of the quantile treatment eﬀect on the treated under a distri-butional extension of the mean diﬀerence in diﬀerences assumption with ﬁxed-dimensionalcovariates. Diﬀering from the work in Callaway and Li (2020), this work focuses on studyingthe impact of covariates on heterogeneous ATT in this DiD design. Our proposed methodallows us to weaken the parallel trend assumption to the conditional parallel trend assump-tion by conditioning on a large amount of potential confounders. For example, states withsmaller populations may have higher variation in the unemployment rates, thus moving atdiﬀerent trends as compared to states with larger populations. Our method also allows usto derive marginal eﬀect given a speciﬁc covariate of interest. As a result, customized policyrecommendations can be designed based on those results.Figure 1 plots the simple diﬀerence for the 2005 to 2007 diﬀerence in unemployment rateby median income (panel a) and population (panel b). We separate the counties in the con-trol and treated states by red and blue color. The solid lines on the graphs are the localmeans for the control and treated groups. There is a general decrease in the unemploymentrate from 2005 to 2007 across all counties as the change in unemployment rate is centeredbelow 0. The diﬀerence between the red and blue lines is the standard DiD estimator underthe unconditional parallel trend assumption. The decrease in the unemployment rate for thetreated counties is lower than the control counties at the low income region, while not muchdiﬀerence between treated and control is observed at high income region. On the other hand,the decrease in the unemployment rate for the treated counties is consistently lower thanthe control counties regardless of population size.Figure 2 compares the semi-parametric diﬀ-in-diﬀ estimator (Semi-DiD: blue) with the dou- New Hampshire and Pennsylvania are dropped for the same reason as in Callaway and Li (2020) f . Both methods use4th degree trigonometric polynomial basis to approximate f and have partially linear forms.The Semi-DiD estimator controls only the underlying covariate of interest ( Z i ), while theDr-DiD estimator controls not only the underlying covariate of interest ( Z i ) but also 703covariates ( X i ) in a linear additive form. These covariates include 38 county level charac-teristics, as well as all interactions between them. It is also worth pointing out that whencomputing the marginal eﬀect for median income, population is used as a confounder in thelinear part and vice versa when computing the marginal eﬀect for population. The dashedlines are 95% conﬁdence intervals for the estimator.We ﬁnd that both estimators show that regions with high median income levels and largerpopulation sizes may suﬀer from an increase in unemployment rate due to the minimumwage policy, while no signiﬁcant eﬀect of the policy is detected for regions with lower me-dian income levels and smaller population sizes. These ﬁndings coincide with the canonicaleconomic theory on unemployment rate. For example, a higher median income level impliesa higher substitution cost for a worker currently at minimum wage and thus leads to anincrease in unemployment rate when the minimum wage rises. On the other hand, a regionwith a larger population size means more labor supply and thus a raise in minimum wage canlead to a surplus. The diﬀerence in the general direction of the results predicted by Figure1 and Figure 2 indicates the potential severeness of confounding problems in this design.Next, while both the Semi-DiD estimator and Dr-DiD estimator show no signiﬁcant impactof the policy at low median income and thin population regions, the Dr-DiD shows a largerimpact in regions with a higher median income level and a denser population than theSemi-DiD estimator. The Dr-DiD estimated eﬀects are also more signiﬁcant in those regions.This is due to the controlling of additional covariates that further alleviates the concernof confoundedness as well as reducing the uncertainly in the model to yield more accurateestimation.

7. Discussion

In this paper, we propose a new doubly robust two-stage diﬀerence-in-diﬀerences estimatorthat allows for, but does not require, the number of potential confounding covariates to begreater than the number of observations. Our estimator is robust to model miss-speciﬁcationand a general set of machine learning tools can be used in our estimation procedure to es-20 a) Median Income (b) Population

Fig 1: Diﬀerence in 2005 to 2007 Unemployment Rate (a) Median Income (b) Population

Fig 2: Compare Dr-DiD estimator (with 703 covariates) with Ab-did estimatortimate the propensity score. The outcome equation is modeled as a ﬂexible partially linearform. The rate of convergence is derived for the new estimator and a novel de-bias proce-dure is proposed for inference. This allows the user to construct conﬁdence intervals for theheterogeneous treatment eﬀects. A simulation study shows promising ﬁnite sample perfor-mance of our estimator under diﬀerent data generation processes. Our method is appliedto study the eﬀect of the Fair Minimum Wage Act on local unemployment rates and showheterogenous eﬀects could rise due to the diﬀerences in demographics. Moreover, an R pack-age for implementing the proposed method is available on Github. More work remains tobe done. For example, it will be interesting to consider a similar estimation and inferencestrategy for panel data, or develop estimators for quantile treatment eﬀect on the treated21ith high-dimensional covariates. We leave these topics for future studies.

Appendix A: Proofs

A.1. Proof of Lemma 1

Proof.

For Part (i):let Φ ( W i ; θ ) and Φ ( W i ; θ ) be postulated models for Φ ( W i ) andΦ ( W i ). Let π ( W i ; ϑ ) be a postulated model for the true propensity score P ( D i = 1 | W i ).Since ρ = D i − P ( D i =1 | W i ) P ( D i =1 | W i ) P ( D i =0 | W i ) , we have E (cid:104) ρ (cid:16) ∆ Y ( i ) − P ( D i = 0 | W i )Φ ( W i ) − P ( D i = 1 | W i )Φ ( W i ) (cid:17) | W i (cid:105) = E (cid:20) ( D i − P ( D i = 1 | W i )) P ( D i = 1 | W i ) P ( D i = 0 | W i ) ∆ Y ( i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) P ( D i = 0 | W i ) (cid:16) P ( D i = 0 | W i )Φ ( W i ) + P ( D i = 1 | W i )Φ ( W i ) (cid:17)(cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) P artL . − E (cid:20) (1 − D i )∆ Y ( i ) P ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 0 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) . (cid:124) (cid:123)(cid:122) (cid:125) P artL . First we consider Part L1.1: E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ( Y ( i, − Y ( i, P ( D i = 1 | W i ] − D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) Φ ( W i ) + D i P ( D i = 1 | W i ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) + E (cid:20) D i P ( D i = 1 | W i ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) . Notice that when π ( W i ; ϑ ) is misspeciﬁed, but Φ ( W i ; θ ) is correctly speciﬁed so Φ ( W i ) =Φ ( W i ; θ ), we have E (cid:20) D i π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i , D i = 1 (cid:105) − π ( W i ; ϑ ) π ( W i ; ϑ ) = 0 . ( W i ; θ ) is misspeciﬁed, but the propensity score π ( W i ; ϑ ) is correctly speciﬁedso π ( W i ) = π ( W i ; ϑ ), we have E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i , θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E [ Y ( i, − Y ( i, | W i , D i = 1) − E (cid:104) D i − P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:105) Φ ( W i , θ ) P ( D i = 1 | W i )= E [ Y ( i, − Y ( i, | W i , D i = 1) . We next consider Part L1.2:When π ( W i , γ ) is misspeciﬁed, but Φ ( W i , θ ) is correctly speciﬁed so Φ ( W i ) = Φ ( W i , θ ),Part L1.2 is E (cid:20) (1 − D i )∆ Y ( i )1 − π ( W i ; ϑ ) (cid:12)(cid:12)(cid:12) W i (cid:21) + E (cid:20) D i − π ( W i ; ϑ )1 − π ( W i ; ϑ ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) (1 − D i )( Y ( i, − Y ( i, − π ( W i ; ϑ ) − (1 − D i ) − P ( D i = 0 | W i )1 − π ( W i ; ϑ ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:19) = E (cid:20) ( Y ( i, − Y ( i, − D i ) − (1 − π ( W i ; ϑ ))1 − π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) + E (cid:20) (1 − D i )1 − π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) . And when Φ ( W i ; θ ) is misspeciﬁed, but propensity score π ( W i ; ϑ ) is correctly speciﬁed, E (cid:20) (1 − D i )∆ Y ( i ) P ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) + E (cid:20) D i − P ( D i = 1 | W i ] P ( D i = 0 | W i ) Φ ( W i ; θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 0 (cid:105) + E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 0 | W i ) Φ ( W i ; θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 0 (cid:105) + E (cid:104) D i − P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:105) Φ ( W i ; θ ) P ( D i = 0 | W i )= E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 1 (cid:105) . (By Assumption 1)The result in Part (ii) follows from Part (i) and direct calculation. A.2. Proof of Theorem 1

Let B ( s , p ) be a set of p − dimensional vectors with at most s non-zero coordinates. Let (cid:98) Q z = n (cid:80) ni =1 ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) . Recall the deﬁnition of S i = X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n + r ni + (cid:15) in .Let v n ( ˆ f, ˆ β ) = v n ( Z , ˆ f ) + v n ( X , ˆ β ), where v n ( Z , ˆ f ) = Ψ n (ˆ γ n − γ n ), v n ( X , ˆ β ) = X ( ˆ β − β ).Let ˜ v n ( Z , ˆ f ) = Ψ n (ˆ γ n − γ n ) + Π n,X | Z ( ˆ β − β ) and ˜ v n ( X , ˆ β ) = ˜ X ( ˆ β − β ) . Then v n ( ˆ f, ˆ β ) =23 v n ( Z , ˆ γ n ) + ˜ v n ( X , ˆ β ). Deﬁne ι ni = r ni + (cid:15) ni and ι n as the vector with ι in on the i th position.Let ι ∗ n = P Z ι n , we have ι ∗(cid:62) n ˜ v n ( Z , ˆ f ) = ι (cid:62) n ˜ v n ( Z , ˆ f ) for ι n = (cid:15) + r n . Deﬁne the following normand set: τ ( β, f, R ) = R − λ (cid:107) β (cid:107) + (cid:107) Xβ + f (cid:107) P, M ( R ) = { f : (cid:107) f − f n (cid:107) P, ≤ R, f ∈ F }M ( R ) = { ( β, f ) : τ ( β, f ) ≤ R, β ∈ B ( s , p ) , f ∈ M ( R ) }T = (cid:40) sup M ( R ) |(cid:107) X (cid:62) β + f (cid:107) n − (cid:107) X (cid:62) β + f (cid:107) P, | ≤ R / (cid:41) T = {(cid:107) ι ∗ n (cid:107) n ≤ R / }T = {| ι (cid:62) n ˜ v n ( X , β ) /n | ≤ λ/ (cid:107) β − β (cid:107) , ( β, f ) ∈ M ( R ) }T = { Λ X,n ( s ) ≥ Λ X ( s ) / }T = {| ( ˆ S − S ) (cid:62) v n ( X , β ) /n | ≤ λ/ (cid:107) β − β (cid:107) , ( β, f ) ∈ M ( R ) }T = { (cid:112) k n · (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ ≤ R/ }T = {| ( ˆ S − S ) (cid:62) ˜ v n ( X , β ) /n | ≤ λ/ (cid:107) ˆ β − β (cid:107) , ( β, f ) ∈ M ( R ) } and Λ X,n ( s ) = min δ ∈ R p \{ } , (cid:107) δ SC (cid:107) ≤ √ s (cid:107) δ S (cid:107) δ (cid:62) E n (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) δ (cid:107) δ S (cid:107) . Proof.

Let ( ˆ β, ˆ f ) be the solution to then minimization problem in Equation (3.1), deﬁne t := 4 R R + (cid:107) ˆ f − f n (cid:107) P, and ˜ γ n := t ˆ γ n + (1 − t ) γ n and thus ˜ f = ˜ f ( z ) = ψ k n ( z ) (cid:62) ˜ γ n . By convexity, (cid:107) ˆ S − ˜ X ˆ β − Π n,X | Z ˆ β − Ψ n ˜ γ n (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ (cid:107) ˆ S − ˜ X β − Π n,X | Z β − Ψ n γ (cid:107) n + λ (cid:107) β (cid:107) , By the deﬁnition of S and ι and Lemma 2, (cid:107) ˜ X ( ˆ β − β ) (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ (cid:12)(cid:12)(cid:12) ι (cid:62) ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) (˜ γ n − γ n ) (cid:62) Ψ (cid:62) n ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) ( ˆ β − β ) (cid:62) Π (cid:62) n,X | Z ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + λ (cid:107) β (cid:107) . (A.1)24otice that (˜ γ n − γ n ) (cid:62) Ψ (cid:62) n ˜ X (cid:16) ˆ β − β (cid:17) = 0 and ( ˆ β − β ) (cid:62) Π (cid:62) n,X | Z ˜ X (cid:16) ˆ β − β (cid:17) = 0. On the set T ∩ T E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ (cid:107) ˆ β (cid:107) ≤ λ (cid:107) ˆ β − β (cid:107) + λ (cid:107) β (cid:107) Subtract (cid:107) ˆ β S (cid:107) and add λ (cid:107) ˆ β S − β S (cid:107) to both sides and from Assumption 5 (ii), on the set T , 2 E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ (cid:107) ˆ β − β (cid:107) ≤ λ (cid:107) ˆ β S − β S (cid:107) ≤ λ √ s (cid:107) ˆ β − β (cid:107) ≤ λ √ s Λ ˜ X ( s ) E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) ≤ E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ s Λ X ( s )As a result, from Lemma 5, 6, and 7, with probability approaching one, (cid:107) ˆ β − β (cid:107) ≤ λs / Λ X ( s ) and E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) ≤ λ s / Λ X ( s ).Lemma 2, 3, 4, 5 and 7 imply that (cid:107) X ( β − ˆ β ) + ( f n − ˜ f ) (cid:107) P, ≤ R with probabilityapproaching one. By orthogonal decomposition, it implies that (cid:107) ˜ X ( β − ˆ β ) (cid:107) P, ≤ R and (cid:107) Π n,X | Z ( β − ˆ β ) + f n − ˜ f (cid:107) P, ≤ R with probability approaching one. Then (cid:107) Π n,X | Z ( β − ˆ β ) (cid:107) P, ≤ p R by Assumption 5 (iii) and 7 (iii), so we have (cid:107) ˜ f − f n (cid:107) P, ≤ (cid:107) Π n,X | Z ( β − ˆ β ) + f n − ˜ f (cid:107) P, + (cid:107) Π n,X | Z ( β − ˆ β ) (cid:107) P, ≤ p R . which implies (cid:107) ˜ f − f n (cid:107) P, ≤ p R . Combining with Assumption 9(ii), it yields that (cid:107) ˆ f − f n (cid:107) P, ≤ p R. Lemma 2.

Suppose that Assumptions 1-9 are satisﬁed. For ˜ f as deﬁned in the proof ofTheorem 1, τ ( ˆ β − β , ˜ f − f n , R ) ≤ R on the event T ∩ T ∩ T ∩ T ∩ T .Proof. Deﬁne ˜ t = RR + τ ( ˆ β − β , ˜ f − f n , R )Let ˜ β = ˜ t ˆ β + (1 − ˜ t ) β , ˜˜ f = ˜ t ˜ f + (1 − ˜ t ) f n . Notice that the deﬁnition of ˜˜ f implies ˜˜ γ n =˜ t ˜ γ n + (1 − ˜ t ) γ n and ˜˜ f ∈ M ( R ). Since τ ( ˜ β − β , ˜˜ f − f n , R ) = ˜ tτ ( ˆ β − β , ˜ f − f n , R ) ≤ R .25hus ( ˜ β − β , ˜˜ f − f n ) ∈ M ( R ). To show τ ( ˆ β − β , ˜ f − f n , R ) ≤ R , it is then suﬃcient toshow τ ( ˆ β − β , ˜˜ f − f n , R ) ≤ R/

2. From the deﬁnition of ( ˆ β, ˜˜ f ), and convexity, E n (cid:20)(cid:110) ˆ S i − X (cid:62) i ˜ β − ψ k n ( Z i ) (cid:62) ˜˜ γ n (cid:111) (cid:21) + λ (cid:107) ˜ β (cid:107) ≤ E n (cid:20)(cid:110) ˆ S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:111) (cid:21) + λ (cid:107) β (cid:107) , Then by the deﬁnition of (cid:98) S and ι n , (cid:107) Ψ n (˜˜ γ n − γ n ) + X ( ˆ β − β ) (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ ι n + ( ˆ S − S )) (cid:62) (cid:16) Ψ n (˜˜ γ n − γ n ) /n + X (cid:16) ˆ β − β (cid:17) /n (cid:17) + λ (cid:107) β (cid:107) (A.2)First notice that (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) Ψ n (˜˜ γ n − γ n ) /n (cid:12)(cid:12)(cid:12) ≤ (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) ˜˜ γ n − γ n (cid:107) ≤ (cid:112) k n (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) ˜˜ γ n − γ n (cid:107) ≤ (cid:112) k n / Λ min, ( (cid:98) Q z · (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) Ψ n (˜˜ γ n − γ n ) (cid:107) n . where Λ min, ( (cid:98) Q z is the minimum eigenvalue of E (Ψ (cid:62) n Ψ n /n ) and is bounded away from 0 byAssumption 7. On the set T and T , and for ˜˜ f ∈ M ( R )2 (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) v n ( ˜˜ f, ˜ β ) /n (cid:12)(cid:12)(cid:12) ≤ λ/ (cid:107) ˜ β − β (cid:107) + R / . (A.3)By the Cauchy-Schwarz inequality,2 (cid:12)(cid:12)(cid:12) ι ∗(cid:62) n ˜ v n ( Z , ˜˜ f ) /n (cid:12)(cid:12)(cid:12) ≤ (cid:107) ι ∗ n (cid:107) n · (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n . Therefore, 2 ι (cid:62) n Ψ n (˜˜ γ n − γ n ) /n + 2 ι (cid:62) n Π n,X | Z ( ˜ β − β ) /n + 2 ι (cid:62) n ˜ X ( ˜ β − β ) /n = 2 ι ∗(cid:62) n ˜ v n ( Z , ˜˜ γ ) /n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n. (A.4)Then, (A.2), (A.3), and (A.4) imply that (cid:107) Ψ n (˜˜ γ n − γ n ) + X ( ˜ β − β ) (cid:107) n + λ (cid:107) ˜ β (cid:107) = (cid:107) v n (cid:107) n + λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) v n (cid:107) n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n + λ/ (cid:107) ˜ β − β (cid:107) + R /

96 + λ (cid:107) β (cid:107) where the last inequality follows from the orthogonal decomposition such that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n = (cid:107) ˜ v n ( X , ˜ β ) (cid:107) n + (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n ≥ (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n , (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 4 ι (cid:62) n ˜ v n ( X , ˜ β ) /n + λ/ (cid:107) ˜ β − β (cid:107) + R /

96 + 2 λ (cid:107) β (cid:107) , (A.5)On the event T , | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | ≤ λ/ (cid:107) ˜ β − β (cid:107) . Thus (A.5) is equivalent to (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + λ (cid:107) ˜ β − β (cid:107) + R /

96 + 2 λ (cid:107) β (cid:107) . (A.6)Because (cid:107) ˜ β − β (cid:107) = (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) , (cid:107) ˜ β (cid:107) = (cid:107) ˜ β S (cid:107) + (cid:107) ˜ β S C (cid:107) ≥ (cid:107) β ,S (cid:107) − (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) , (A.6) further implies that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:16) (cid:107) β ,S (cid:107) − (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˆ β S C (cid:107) (cid:17) ≤ λ (cid:16) (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) (cid:17) + 2 λ (cid:107) β (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / . Since (cid:12)(cid:12)(cid:12) (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n − (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, (cid:12)(cid:12)(cid:12) ≤ R /

96 on T , (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β S C (cid:107) ≤ λ (cid:107) ˜ β S − β ,S (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / . (A.7)By Assumption 5 (ii), and take 144 λ s / Λ X ( s ) ≤ R , λ (cid:107) ˜ β S − β ,S (cid:107) ≤ λ √ s (cid:107) ˜ β S − β ,S (cid:107) ≤ λ √ s (cid:107) ˜ X ( ˜ β − β ) (cid:107) P, / Λ ¯ X ( s ) ≤ λ s / Λ X ( s ) + (cid:107) ˜ X ( ˜ β − β ) (cid:107) P, / ≤ λ s / Λ X ( s ) + (cid:107) v n (˜˜ γ, ˜ β ) (cid:107) P, / ≤ R /

144 + (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, / , (A.8)where the third inequality follows becaue ab < a + b / λ (cid:107) ˜ β S − β ,S (cid:107) on both sides of (A.7) yields that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β − β (cid:107) = (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β S C (cid:107) + λ (cid:107) ˜ β S − β ,S (cid:107) ≤ λ (cid:107) ˜ β S − β ,S (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / ≤ R /

36 + (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + 4 (cid:107) ι ∗ n (cid:107) n + R / , which implies that λ (cid:107) ˜ β − β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 7 R / .

27n the set T , (cid:107) ι ∗ n (cid:107) n ≤ R / (cid:107) ˜ β − β (cid:107) such that (cid:107) ˜ β − β (cid:107) ≤ R /λ. On the other hand, substituting (A.8) into (A.7) and combining with (cid:107) ι ∗ n (cid:107) n ≤ R /

192 yieldthat (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, = (cid:107) ˜˜ f − f n + X ( ˜ β − β ) (cid:107) P, ≤ R . As a result, τ ( ˜ β − β , ˜˜ f − f n , R ) = R − λ (cid:107) ˜ β − β (cid:107) + (cid:107) ˜˜ f − f n + X ( ˜ β − β ) (cid:107) P, ≤ R ≤ R/ Lemma 3.

Suppose that Assumptions 1-9 are satisﬁed, we have P ( T ) = 1 − O (1 /p ∧ /k n ) Proof.

First notice that τ ( β, f n , R ) ≤ R implies for ( β, f n ) ∈ M ( R ), (cid:107) Xβ + f (cid:107) P, ≤ R , (cid:107) β (cid:107) ≤ R /λ, which further implies (cid:107) ¯ Xβ (cid:107) P, ≤ R and (cid:107) Π (cid:62) X | Z β + f (cid:107) P, ≤ R . By Assumption 5, we thenhave (cid:107) Xβ (cid:107) P, ≤ (cid:107) ¯ Xβ (cid:107) P, + (cid:107) Π (cid:62) X | Z β (cid:107) P, ≤ R + Λ max (Σ Π ) / Λ min (Σ Π ) (cid:107) ¯ Xβ (cid:107) P, ≤ (cid:0) (Σ Π ) / Λ (Σ ¯ X ) (cid:1) R := R , and (cid:107) f n (cid:107) P, ≤ (cid:107) f + Π (cid:62) X | Z β (cid:107) P, + (cid:107) Π (cid:62) X | Z β (cid:107) P, ≤ (cid:0) (Σ Π ) / Λ (Σ ¯ X ) (cid:1) R = R . (A.9)For any ( β, f n ) ∈ M ( R ), consider the decomposition such that (cid:12)(cid:12) (cid:107) Xβ + f n (cid:107) n − (cid:107) Xβ + f n (cid:107) P, (cid:12)(cid:12) ≤ | ( E n − E )[ (cid:107) X (cid:62) i β (cid:107) ] | (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + | ( E n − E )[ (cid:107) ψ k n ( Z i ) (cid:62) γ n (cid:107) ] | (cid:124) (cid:123)(cid:122) (cid:125) ( B ) +2 | ( E n − E ) [ β (cid:62) X i ψ k n ( Z i ) (cid:62) γ n ] | (cid:124) (cid:123)(cid:122) (cid:125) ( C ) For Term (A), let Σ X = E (cid:2) X i X (cid:62) i (cid:3) . For all ( β, f ) ∈ M ( R ),( A ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 β (cid:62) X i X (cid:62) i β − β (cid:62) Σ X β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ( E n − E ) X i X (cid:62) i (cid:107) ∞ (cid:107) β (cid:107) . − /p , (cid:107) ( E n − E ) X i X (cid:62) i (cid:107) ∞ (cid:46) (cid:114) log pn because X i ’s are sub-Gaussian variables. Thus, from Assumption 9, R ≤ λ , and with prob-ability at least 1 − /p , sup ( β,f ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:107) X (cid:62) i β (cid:107) (cid:12)(cid:12)(cid:12) ≤ CR /λ ≤ CR . For Term (B), from Assumption 7, Q Z := E (cid:2) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) can be normalized to I k n . sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12) ( E n − E ) f n (cid:12)(cid:12) = sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 γ (cid:62) n ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) γ n − γ (cid:62) n γ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( β,f n ) ∈M ( R ) (cid:107) ( E n − E ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:107) (cid:107) γ n (cid:107) where (cid:107) γ n (cid:107) = (cid:107) f n (cid:107) P, is bounded by R following from (A.9). Moreover, by a Bernsteintype inequality for random matrices (Theorem 6.1 in Tropp (2012); see also Theorem 4.3 invan de Geer (2014), Theorem 4.1 in Chen and Christensen (2015) or Lemma 6.2 in Belloniet al. (2015)), we have P (cid:32) (cid:107) ( E n − E ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:107) > C (cid:32) ξ ( k n ) (cid:114) log k n + tn + ξ ( k n ) (cid:18) log k n + tn (cid:19)(cid:33)(cid:33) ≤ exp( − t ) . By taking t = log k n , with probability at least 1 − O (1 /k n ), P sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12) ( E n − E ) f n (cid:12)(cid:12) ≤ CR by Assumption 7 (ii). For Term (C), we havesup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:104) (cid:107) β (cid:62) X i f n ( Z i ) (cid:107) (cid:105)(cid:12)(cid:12)(cid:12) = sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 β (cid:62) X i ψ k n n ( Z i ) (cid:62) γ n − E (cid:104) β (cid:62) X i ψ k n n ( Z i ) (cid:62) γ n (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( β,f n ) ∈M ( R ) (cid:107) β (cid:107) (cid:107) γ n (cid:107) (cid:13)(cid:13)(cid:13) ( E n − E ) X i ψ k n n ( Z i ) (cid:62) (cid:13)(cid:13)(cid:13) ∞ ≤ sup ( β,f n ) ∈M ( R ) (cid:107) β (cid:107) (cid:112) k n (cid:107) γ n (cid:107) (cid:13)(cid:13)(cid:13) ( E n − E ) X i ψ k n n ( Z i ) (cid:62) (cid:13)(cid:13)(cid:13) ∞ . | X ij ψ k n m ( Z i ) | ≤ ξ ( k n ) | X ij | for j = 1 , . . . , p and m = 1 , . . . , k n . Thus, Lemma 14.15in B¨uhlmann and van de Geer (2011) implies that given X , P (cid:32) max ≤ j ≤ p max ≤ m ≤ k n (cid:12)(cid:12)(cid:12) ( E n − E ) X ij ψ k n m ( Z i ) (cid:12)(cid:12)(cid:12) ≥ max ≤ j ≤ p (cid:114) ξ ( k n ) (cid:80) ni =1 X ij n (cid:115) (cid:18) t + 2 log 2 pn (cid:19)(cid:33) ≤ exp( − nt ) . Because X is sub-Gaussian, letting t = log(2 p ) /n gives P (cid:32) max ≤ j ≤ p max ≤ m ≤ k n (cid:12)(cid:12)(cid:12) ( E n − E ) X ij ψ k n m ( Z i ) (cid:12)(cid:12)(cid:12) ≥ (cid:113) K X ξ ( k n ) (cid:114) log 2 pn (cid:33) ≤ p − . Because for ( β, f n ) ∈ M ( R ), (cid:107) β (cid:107) ≤ R /λ and (cid:107) γ n (cid:107) = (cid:107) f n (cid:107) ≤ R , it implies that (cid:107) β (cid:107) (cid:107) γ n (cid:107) (cid:46) R /λ (cid:46) R by Assumption 9(ii). Then combing with Assumption 7(iii), wehave that with probability at least 1 − O (1 /p ),sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:104) (cid:107) β (cid:62) X i f n ( Z i ) (cid:107) (cid:105)(cid:12)(cid:12)(cid:12) (cid:46) R . The conclusion follows from Assumption 5.

Lemma 4.

Suppose that Assumptions 1-9 are satisﬁed. For a sequence κ n → ∞ as n → ∞ and κ n does not depend on p or k n , the set T has probability at least − O (1 /κ n + 1 /k n ) .Proof. (cid:107) ι ∗ n (cid:107) n ≤ (cid:15) (cid:62) P z (cid:15) /n + 2 r (cid:62) n P z r n /n. For the ﬁrst term, note that (cid:15) (cid:62) P z (cid:15) /n = (cid:107) (cid:16) Ψ (cid:62) n Ψ n /n (cid:17) − / Ψ n (cid:15) /n (cid:107) = (cid:107) (cid:98) Q − / z E n (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) ≤ (cid:107) (cid:98) Q − / z (cid:107) (cid:107) E n (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) , where all eigenvalues of (cid:98) Q z are bounded away with probability at least 1 − /k n followingthe matrix Bernstein inequality. Moreover, (cid:107) E (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) = E [ (cid:15) i ψ k n ( Z i ) (cid:62) ψ k n ( Z i ) /n ] (cid:46) E [ ψ k n ( Z i ) (cid:62) ψ k n ( Z i ) /n ]= tr ( E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) /n ]) = tr ( I k n /n ) = ( k n /n ) . For the second term, r (cid:62) n P z r n /n = (cid:107) (cid:98) Q − / z E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) . E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) = 1 n k n (cid:88) k =1 E (cid:2) ψ k ( Z i ) r ni (cid:3) ≤ (cid:18) (cid:96) k n c k n √ n (cid:19) E (cid:104) (cid:107) ψ k n ( Z i ) (cid:107) (cid:105) = (cid:96) k n c k n k n n or E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) ≤ n E [ ξ ( k n ) r ni ] ≤ ξ ( k n ) c k n n so we have E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) ≤ min (cid:32) (cid:96) k n c k n k n n , ξ ( k n ) c k n n (cid:33) . From triangular inequality and Markov inequality,there exists a constant C , such that E (cid:2) (cid:107) E n (cid:2) ψ k n ( Z i ) r ni (cid:3) (cid:107) (cid:3) = CR /κ n and by Markov inequality, P (cid:18)(cid:13)(cid:13)(cid:13) E n (cid:104) ψ k n ( Z i ) r ni (cid:105)(cid:13)(cid:13)(cid:13) > R / (cid:19) ≤ C/κ n Lemma 5.

Suppose that Assumptions 1-9 are satisﬁed, then P ( T ) = 1 − O (1 /p + 1 /k n ) Proof.

By deﬁnition of ι n , | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | = | ( (cid:15) + r n ) (cid:62) ˜ X ( ˜ β − β ) | ≤ (cid:107) ( (cid:15) + r n ) (cid:62) ˜ X (cid:107) ∞ (cid:107) ˜ β − β (cid:107) . By Lemma 6.2 in Belloni et al. (2015), all eigenvalues of (cid:98) Q Z are bounded away from zero withAssumption 7 on the set T with probability at least 1 − /k n . Notice that for ˜ X i = ( I − P Z ) X i .As X i is sub-Gaussian, its moment generating function has E [exp( s (( I − P Z ) X ij ))] ≤ E [exp( s (1 − Λ ( (cid:98) Q Z )) X ij )] ≤ exp( K X s (1 − Λ ( (cid:98) Q Z )) / . Thus ˜ X ij is also sub-Gaussian so Assumption 4 implies that max ≤ i ≤ n , max ≤ j ≤ n E [( ˜ X ij (cid:15) i ) ]is bounded from above. From Lemma E.1 and E.2 of Chernozhukov, Chetverikov and Kato(2017), E (cid:32) max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:114) t + log pn (cid:33) ≤ exp( − t ) . Because λ (cid:38) · C (cid:112) p/n , we have with probability at least 1 − /p max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ/ . K ˜ X ≥ P (cid:18) max ≤ j ≤ p (cid:107) ˜ X j (cid:107) n ≥ K ˜ X (cid:19) ≤ P (cid:32) max ≤ j ≤ p (cid:107) ˜ X j (cid:107) n ≥ E (cid:107) ˜ X j (cid:107) n + K ˜ X (cid:114) log pn (cid:33) ≤ (2 p ) − . Next, note that max ≤ i ≤ n max ≤ j ≤ p | r ni ˜ X ij | ≤ max ≤ i ≤ n max ≤ j ≤ p (cid:96) k n c k n | ˜ X ij | , by usingLemma 14.15 of B¨uhlmann and van de Geer (2011), we have P  max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r ni ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) (cid:96) k n c k n n (cid:88) i =1 ˜ X ij /n (cid:112) t + log p/n )  ≤ exp( − nt ) . Taking t = log p/n yields that with probability at least 1 − / (2 p ) − /p ,max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r ni ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) K ˜ X (cid:96) k n c k n (cid:112) log p/n. Thus, when λ (cid:38) · C (cid:112) p/n and because l k n c k n = o (1), we havemax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( (cid:15) i + r ni ) ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ/ . which further implies that | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | ≤ λ/ (cid:107) ˜ β − β (cid:107) . Lemma 6.

Suppose that Assumptions 1-9 are satisﬁed. Then P ( T ) = 1 − O (1 /p ) .Proof. For ˆΣ ˜ X = E n (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) and Σ ˜ X = E (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) , we have δ (cid:62) ˆΣ ˜ X δ (cid:107) δ S (cid:107) = δ (cid:62) Σ ˜ X δ + δ (cid:62) (cid:16) ˆΣ ˜ X − Σ ˜ X (cid:17) δ (cid:107) δ S (cid:107) ≥ Λ X ( s ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) δ (cid:107) (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ (cid:107) δ S (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ Λ X ( s ) − s (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ , where the ﬁrst inequality follows by the deﬁnition of Λ X ( s ) and the second inequality followsby (cid:107) δ (cid:107) ≤ s (cid:107) δ S (cid:107) . Since with probability at least 1 − /p , (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ ≤ (cid:107) ( E n − E ) ˜ X i ˜ X (cid:62) i (cid:107) ∞ + (cid:107) E (cid:104) ˜ X i ˜ X (cid:62) i − ¯ X i ¯ X (cid:62) i (cid:105) (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) where the equality follows from Bernstein inequality applied in Lemma 5 and Assumption7(iii). Because s (cid:113) log pn = o (1), for large enough n , we have Λ X,n ( s ) ≥ Λ X ( s ) / − /p . 32 emma 7. Suppose that Assumptions 1-9 are satisﬁed. Then P ( T ) = 1 − O (1 /p ) and P ( T ) = 1 − O (1 /p ) .Proof. The proof for T is similar to T and thus we focus on the proof of T below. Fromthe deﬁnition of S i ,ˆ S i − S i = ˆ ρ (cid:16) ∆ Y ( i ) − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − ρ i (cid:16) ∆ Y i − (1 − π i )Φ ( W i ) − π i Φ ( W i ) (cid:17) = D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + (1 − D i )(∆ Y i − Φ ( W i )) (cid:18) − ˆ π i − − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − D i π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E − ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − − D i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + D i ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E − (1 − D i )( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − π i − − ˆ π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E Let S S S ( · ), ˆΦ ( · ) and ˆ π ( · ), and we use S β and ˆ f . Similar to Lemma 5, for( β, f ) ∈ M ( R ), | ( ˆ S − S ) (cid:62) ˜ v n ( X , β ) /n | = | ( ˆ S − S ) (cid:62) ˜ X ( β − β ) /n | ≤ (cid:107) ( ˆ S − S ) (cid:62) ˜ X /n (cid:107) ∞ (cid:107) β − β (cid:107) . For each component of ˜ X i , E n (cid:20) ˜ X ij D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:21) = E n (cid:20)(cid:18) π i − π i (cid:19) ˜ X ij D i (cid:15) ,i (cid:21) . Since (cid:15) ,i is independent of (1 / ˆ π i − /π i ) due to sample split and by Assumption 4, we have E (cid:104) (1 / ˆ π i − /π i ) ˜ X ij D ( i ) (cid:15) ,i (cid:105) = 0. Deﬁne r πni = (1 / ˆ π i − /π i ). Lemma 14.15 in B¨uhlmannand van de Geer (2011) implies that for a constant C π , P  max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ i ≤ n r πni max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ˜ X ij D i (cid:15) ,i (cid:115) (cid:18) t + log pn (cid:19) ≤ exp( − nt ) , where with probability bigger than 1 − /p , for suﬃciently large C max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ˜ X ij D i (cid:15) ,i − E (cid:104) ˜ X ij D i (cid:15) ,i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:114) log pn . ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ≤ i ≤ n r πni (cid:114) log pn with probability approaching one.From Assumption 8, max ≤ i ≤ n (1 / ˆ π i − /π i ) = O p (1). Thus, take t = log p , with proba-bility at least 1 − /p , (cid:107) E n (( E X i ) (cid:107) ∞ (cid:46) (cid:112) log p/n .Next deﬁne r Φ ni = ˆΦ ( W i ) − Φ ( W i ). From sample splitting, E (cid:18) r Φ ni ˜ X ij (cid:18) − D i π i (cid:19)(cid:19) = E (cid:18) r Φ ni ˜ X ij (cid:18) − E ( D i | X i , Z i , S π i (cid:19)(cid:19) = 0With constant C Φ , We can then apply the same bound such that P  max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r Φ ni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ C Φ max ≤ i ≤ n r Φ ni max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ˜ X ij D i (cid:15) ,i (cid:115) (cid:18) t + log pn (cid:19) ≤ exp( − nt ) . From Assumption 8, (cid:107) E n (( E X i ) (cid:107) ∞ = O p ( (cid:112) log p/n ). Lastly consider E (cid:12)(cid:12)(cid:12)(cid:12) E n (cid:18) ˜ X ij D i ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E n (cid:32) ( ˆΦ ( W i ) − Φ ( W i )) · (cid:18) π i − π i (cid:19) (cid:33) · E n (cid:16) ˜ X ij D i (cid:17) = O p (log p/n )The last equality follows as E n (cid:16) ˜ X ij D i (cid:17) = O p (1). Similar results can be derived for E E E Lemma 8.

Suppose that Assumptions 1-9 are satisﬁed. Then P ( T ) = 1 − O (1 /p ) .Proof. Similar as Lemma 7, consider the interaction of each component of ˆ S i − S i with ψ k n j ( Z i ) for j = 1 , · · · , k n . E n (cid:20) ψ k n j ( Z i ) D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:21) = E n (cid:20)(cid:18) π i − π i (cid:19) ψ k n j ( Z i ) D i (cid:15) ,i (cid:21) . By sample splitting and Assumption 4, we have E (cid:104) (1 / ˆ π i − /π i ) ψ k n j ( Z i ) D ( i ) (cid:15) ,i (cid:105) = 0. Again,Lemma 14.15 in B¨uhlmann and van de Geer (2011) implies that for a constant C π , P  max ≤ j ≤ k n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ψ k n j ( Z i ) D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ i ≤ n r πni max ≤ j ≤ k n (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ψ k n j ( Z i ) D i (cid:15) ,i (cid:115) (cid:18) t + log k n n (cid:19) ≤ exp( − nt ) , ≤ j ≤ k n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ψ k n j ( Z i ) D i (cid:15) ,i − E (cid:104) ψ k n j ( Z i ) D i (cid:15) ,i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log k n n (cid:33) . So we have max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ψ k n j ( Z i ) D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ≤ i ≤ n r πni (cid:114) log k n n with probability approaching one. From Assumption 8, max ≤ i ≤ n (1 / ˆ π i − /π i ) = O p (1).Thus, take t = log p , with probability at least 1 − /p , √ k n (cid:107) E n (( E n ) (cid:107) ∞ (cid:46) (cid:112) k n log k n /n .Equation E E (cid:112) k n log k n /n similarly as shownin Lemma 7 with assumption 8 and thus we omit their proof. A.3. Proof of Theorem 2

Proof.

Consider the following decomposition such that for (cid:98) Σ ˜ X := E ˜ X i ˜ X (cid:62) i ,ˆ T − ξ (cid:62) β = ξ (cid:62) ( ˆ β − β ) − ˆ w (cid:62) E n [( ˆ S i − X (cid:62) i ˆ β − ˆ f ( Z i ) ˜ X i ]= ξ (cid:62) ( ˆ β − β ) − ˆ w (cid:62) E n [( S i − X (cid:62) i β − f ( Z i )) ˜ X i ] − ˆ w (cid:62) E n [ ˆ S i − S i ]+ ˆ w (cid:62) E n [ X (cid:62) i ( ˆ β − β ) − ˆ f ( Z i ) + f ( Z i )) ˜ X i ]= − ˆ w (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] − ˆ w (cid:62) E n [( ˆ S i − S i ) ˜ X i ]+ ( ˆ w (cid:98) Σ ˜ X + ξ ) (cid:62) ( ˆ β − β ) + ˆ w (cid:62) E n [(Π n,X i | Z i ( ˆ β − β )) + ( ˆ f ( Z i ) − f ( Z i )) ˜ X i ]:= − I − I + I + I . Recall that ˆ S i = ˆ ρ i (∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i )) and S i = ρ i (∆ Y i − (1 − π i )Φ ( W i ) − π i Φ ( W i )). We now analyze the four terms in the last expression one by one.For the ﬁrst term, I − w (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] = ( ˆ w − w ) (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] ≤ (cid:107) ˆ w − w (cid:107) (cid:107) E n ( S i − X (cid:62) i β − f ( Z i )) ˜ X i (cid:107) ∞ = O p ( s w (cid:112) log p/n ) × O p ( (cid:112) log p/n ) = O p ( s w (log p/n ))where (cid:107) ˆ w − w (cid:107) = O p ( s w (cid:112) log p/n ) as a Dantzig selector as deﬁned in Theorem 7.1 in Bickel,Ritov and Tsybakov (2009), and the fact that (cid:107) E n (cid:15) i ˜ X i (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) by Lemma E.1and E.2 of Chernozhukov, Chetverikov and Kato (2017). Then Assumption 11 (i) guaranteesthat the remainder term in I is o p ( n − / ). 35or the second term, I ≤ (cid:107) ˆ w (cid:107) (cid:107) E n ( ˆ S i − S i ) ˜ X i (cid:107) ∞ ≤ (cid:107) w (cid:107) (cid:107) E n ( ˆ S i − S i ) ˜ X i (cid:107) ∞ , where (cid:107) w (cid:107) ≤ s w and (cid:107) ˆ w (cid:107) ≤ (cid:107) w (cid:107) because of the deﬁnition of ˆ w . Then Lemma 9 impliesthat s w (cid:107) E n (cid:104) ( ˆ S i − S i ) ˜ X i (cid:105) (cid:107) ∞ = o p ( n − / ) so we have √ nI = o p (1).For the third term, I ≤ (cid:107) ˆ w (cid:98) Σ ˜ X + ξ (cid:107) ∞ (cid:107) ˆ β − β (cid:107) ≤ λ (cid:48) (cid:107) ˆ β − β (cid:107) = O p ( s log pn ) , where the last step follows from Theorem 1 and λ (cid:48) = O p ( (cid:112) log p/n ) so √ nI = o p (1) because s log p/ √ n = o p (1).For the last term, I , by construction, E n (cid:104) Π (cid:62) n,X i | Z i ˜ X i (cid:105) = 0 and E n (cid:104)(cid:110) ˆ f ( Z i ) − f n ( Z i ) (cid:111) ˜ X i (cid:105) =0. Moreover, Assumption 11 (i) implies that s w max ≤ j ≤ p, ≤ i ≤ n | ˜ X ij r ni | = o ( n − / ), so wehave √ nI ≤ (cid:107) w (cid:107) (cid:107) E n [ r ni ˜ X i ] (cid:107) ∞ = o p (1) . Combining the above results for I - I and Assumption 11 (ii), we obtainˆ T − ξ (cid:62) β = − w (cid:62) E n (cid:15) i ˜ X i + o p ( n − / ) = − w (cid:62) E n (cid:15) i ¯ X i + o p ( n − / ) . Next, for Ω β = E (cid:2) σ i ¯ X i ¯ X (cid:62) i (cid:3) , note that w = Σ − X ξ , ξ (cid:62) V β ξ = w (cid:62) Ω β w , E (cid:34)(cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i (cid:35) = 0because E [ (cid:15) i ¯ X i ] = 0, and E (cid:34)(cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i (cid:35) = 1 . We want to verify the Lyapunov’s condition for CLT. In particular, we wish to show that1( w (cid:62) Ω β w ) r (cid:15) / n (cid:88) i =1 E (cid:104) w (cid:62) ¯ X i (cid:15) i / √ n (cid:105) r (cid:15) / = o p (1) . (A.10)First because (cid:107) w (cid:107) ≤ s w , for r (cid:15) > n (cid:88) i =1 E (cid:104) w (cid:62) ¯ X i (cid:15) i / √ n (cid:105) r (cid:15) / ≤ n (cid:88) i =1 E (cid:2) (cid:107) w (cid:107) (cid:107) ¯ X i (cid:15) i / √ n (cid:107) ∞ (cid:3) r (cid:15) / ≤ n (cid:88) i =1 (cid:18) s w √ n (cid:19) r (cid:15) / max ≤ k ≤ p E (cid:104)(cid:12)(cid:12) ¯ X ik (cid:15) i (cid:12)(cid:12) r (cid:15) / (cid:105) = O p (cid:32) s r (cid:15) / w n r (cid:15) / − (cid:33) , (cid:2) | ¯ X ik (cid:15) i | (cid:3) r (cid:15) / < ∞ by the Cauchy-Schwarz inequality.Moreover, because (cid:16) ξ (cid:62) V β ξ (cid:17) r (cid:15) / ≥ (cid:104) (cid:107) ξ (cid:107) Λ min (Ω β )Λ (Σ − X ) (cid:105) r (cid:15) / = p r (cid:15) / (cid:104) Λ min (Ω β )Λ (Σ − X ) (cid:105) r (cid:15) / > (cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i → d N (0 , , which implies that √ n ( ˆ T − ξ (cid:62) β ) → d N (0 , w (cid:62) Ω β w ) = N (0 , ξ (cid:62) V β ξ ) . Finally, we show that ˆ V β p → V β . Let ˆΩ β = n (cid:80) ni =1 ˆ (cid:15) i ˜ X i ˜ X (cid:62) i and ˜Ω β = n (cid:80) ni =1 ( (cid:15) i + r ni ) ˜ X i ˜ X (cid:62) i .Note that (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˆΩ β ˆ w − ˆ w (cid:62) ˜Ω β ˆ w (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ (cid:107) ˆ w (cid:107) , (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˜Ω β ˆ w − ˆ w (cid:62) Ω β ˆ w (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˜Ω β − Ω β (cid:107)(cid:107) ˆ w (cid:107) and (cid:12)(cid:12) ˆ w (cid:62) Ω β ˆ w − w (cid:62) Ω β w (cid:12)(cid:12) ≤ (cid:107) Ω β (cid:107) ∞ (cid:107) ˆ w − w (cid:107) . Thus, (cid:12)(cid:12)(cid:12) ˆ V β − V β (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˆΩ β ˆ w − w (cid:62) Ω β w (cid:12)(cid:12)(cid:12) ≤(cid:107) ˆ w (cid:107) (cid:16) (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ + (cid:107) ˜Ω β − Ω β (cid:107) ∞ (cid:17) + (cid:107) ˆ w − w (cid:107) (cid:107) Ω β (cid:107) ∞ ≤ s w (cid:16) (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ + (cid:107) ˜Ω β − Ω β (cid:107) ∞ (cid:17) + o p (1) . (A.11)We next consider to bound (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ . First note that Lemma E.1 and E.2 of Chernozhukov,Chetverikov and Kato (2017) imply thatmax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ˜ X ij (cid:15) i − E (cid:104) ˜ X ij (cid:15) i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log pn (cid:33) so s w (cid:107) ˜Ω β − Ω β (cid:107) ∞ = o p (1) (A.12)because of Assumption 11(i). Moreover, because (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ = (cid:107) n n (cid:88) i =1 (cid:16) ˆ (cid:15) i − ( (cid:15) i + r ni ) (cid:17) ˜ X i ˜ X i (cid:107) ∞ ≤ max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ˜ X ij ˜ X ik (cid:15) ni (cid:16) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f ( Z i ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ˜ X ij ˜ X ik (cid:16) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f ( Z i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (A.13)37here the ﬁrst term on the RHS of (A.13) is bounded by2 max ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12) (cid:107) (cid:15) (cid:62) n (cid:16) X (cid:16) ˆ β − β (cid:17) + ˆ f ( Z ) − f ( Z ) (cid:17) (cid:107) n ≤ ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12)  (cid:107) n n (cid:88) i =1 ( (cid:15) i + r ni ) X i (cid:107) ∞ (cid:107) ˆ β − β (cid:107) + sup z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − f ( z ) (cid:12)(cid:12)(cid:12) (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:15) ni  = O p (cid:32) log( np ) (cid:32) s log pn + (cid:114) ξ ( k n ) k n n + (cid:96) k n c k n (cid:33)(cid:33) Moreover, the second term is bounded bymax ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) X (cid:62) i ( ˆ β − β ) + ˆ f ( Z i ) − f ( Z i ) (cid:17) = O p  log( np ) (cid:32) s log pn + (cid:114) k n n + (cid:96) k n c k n (cid:33)  where we use the sub-Gaussian property to obtain that max ≤ i ≤ n max ≤ j,k ≤ p | X ij X ik | = O p (cid:16) (log ( np )) (cid:17) . Thus, with the additional assumption in Theorem 2, we have s w (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ = o p (1) (A.14)Then (cid:12)(cid:12)(cid:12) ˆ V β − V β (cid:12)(cid:12)(cid:12) = o p (1) follows from (A.11), (A.14), (A.22), and Assumption 11. A.4. Proof of Theorem Proof.

We consider the following decomposition such that¯ f ( z ) − f n ( z ) = ψ k n ( z ) (cid:62) (ˆ γ n − γ n ) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) ˆ S i − X (cid:62) i ˆ β − ψ k n ( Z i ) (cid:62) ˆ γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) = − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) ˆ S i − S i (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) − ψ k n ( z ) (cid:62) (cid:16) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:105) − I k n (cid:17) (ˆ γ n − γ n ):= II + II + II + II (A.15)38he ﬁrst term II can be further expand as − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110)(cid:16) S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:111) = − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110) ( r ni + (cid:15) i ) ( M X i − ˆ M X i ) (cid:111) (A.16) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) . (A.17)We ﬁrst consider the term (A.16) such that (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ˆ M − M ) X i ( r ni + (cid:15) i ) (cid:105) (cid:107) ≤ (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) (cid:112) k n (cid:107) E n (cid:104) ( ˆ M − M ) X i ( r ni + (cid:15) i ) (cid:105) (cid:107) ∞ ≤ (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) E n [ X i ( r ni + (cid:15) i )] (cid:107) ∞ = O p (cid:16)(cid:112) k n (cid:16) s m (cid:112) (log k n + log p ) /n (cid:17)(cid:17) × o p (cid:16)(cid:112) k n /n (cid:112) log p/n (cid:17) , where (cid:107) ˆ M − M (cid:107) = O p (cid:16) s m (cid:112) (log k n + log p ) /n (cid:17) from Lemma 10, max ≤ j ≤ p, ≤ i ≤ n | X ij r ni | = o p ( n − / ) by Assumption 11(i) and (cid:107) E n X i (cid:15) i (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) so the above term is o p ( n − / ) because of Assumption (12) (i). Also note that( A.

17) = − ψ k n ( z ) (cid:62) (cid:16) ˆΣ − f − Σ − f (cid:17) E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) − ψ k n ( z ) (cid:62) Σ − f E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) , where (cid:107) ˆΣ f − Σ f (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( E n − E ) (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105)(cid:13)(cid:13)(cid:13) (A.18)+ (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M (cid:62) − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:62) (cid:13)(cid:13)(cid:13) . (A.19)The ﬁrst term (A.18) is bounded by (cid:107) ( E n − E )[ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n (cid:19) . (A.20)following Lemma 6.2 in Belloni et al. (2015). The second term (A.19) is bounded by (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:13)(cid:13)(cid:13) ≤ (cid:112) k n (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) M (cid:107) (cid:107) E n (cid:104) X i X (cid:62) i (cid:105) (cid:107) ∞ + (cid:112) k n (cid:107) M (cid:107) (cid:107) ( E n − E ) (cid:104) X i X (cid:62) i (cid:105) (cid:107) ∞ = O p ( s m (cid:112) k n log p/n ) , (A.21)39here the ﬁrst and second inequalities follow from direct calculation and the last inequalityfollows because λ (cid:48)(cid:48) = O p ( (cid:112) log p/n ), (cid:107) M (cid:107) ≤ s m . Moreover, because X ij X ik − E [ X ij X ik ] issub-exponential so by Bernstein inequality, we have for some constant K X , P (cid:18) max ≤ j,k ≤ p | ( E n − E ) X ij X ik | > K X (cid:112) log 2 p/n (cid:19) ≤ / p. Thus, (A.20) and (A.21) imply that (cid:107) ˆΣ f − Σ f (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n + s m (cid:112) k n log p/n (cid:19) = o p (1) , (A.22)where the second equality follows from Assumption 12(i). Because σ − z ψ k n ( z ) (cid:62) Σ − f G n (cid:15) i ( ψ k n ( Z i ) − M X i ) = O p (1)and σ − z ψ k n ( z ) (cid:62) Σ − f G n r ni ( ψ k n ( Z i ) − M X i ) = O p ( (cid:96) k n c k n (cid:112) k n ) , with (A.22), we have √ nσ − z II = −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) ( (cid:15) i + r ni ) (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) + o p (1) . Similar to the argument in Theorem 2 of Newey (1997), with Assumption 4 and 6, theLindbergh-Feller central limit theorem gives us −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) (cid:15) i (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) → d N (0 , . and Assumption 12(i) implies that −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) r ni (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) = o p (1) . Next we consider the term II . Note that √ nσ − / z ψ k n ( z ) (cid:62) Σ − f E n (cid:104) ( ˆ S i − S i )( ˆ M − M ) X i (cid:105) ≤ (cid:112) k n n (cid:107) σ − / z ψ k n ( z ) (cid:62) Σ − f (cid:107) (cid:107) E n (cid:104) ( ˆ S i − S i ) X i (cid:105) (cid:107) ∞ (cid:107) ˆ M − M (cid:107) = O p ( s m (cid:112) k n (cid:112) log p/n ) = o p (1)and further from Lemma 9. √ nσ − / z ψ k n ( z ) (cid:62) Σ − f E n (cid:104)(cid:16) ˆ S i − S i (cid:17) (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:105) = o p (1)40ext, consider equation II . Consider the following decomposition: (cid:12)(cid:12)(cid:12) σ − / z ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) − σ − / z ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) · (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) − σ − / z ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) · (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ( ˆ β − β ) (cid:13)(cid:13)(cid:13) . From the deﬁnition of ˆ M j , (cid:13)(cid:13)(cid:13) E n (cid:110) ( ψ K n j ( Z i ) − ˆ M (cid:62) j X i ) X (cid:62) i (cid:111)(cid:13)(cid:13)(cid:13) ∞ ≤ λ (cid:48)(cid:48) for all j ; and from The-orem 1, (cid:107) ( ˆ β − β ) (cid:107) = O p ( s (cid:112) log p/n ), thus by choosing λ (cid:48)(cid:48) = O ( (cid:112) log p/n ), II = o p (1)because √ ns m s log p/n = o p (1) by Assumption 12(i).Finally the last term II is 0, since ˆΣ − f E n (cid:110) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:111) = I k n .When √ nσ − l k n c k n = o p (1), we have √ nσ − z ( ˜ f ( z ) − f ( z )) = √ nσ − z ( ˜ f ( z ) − f n ( z )) + o p (1) → d N (0 , . Next, we show the consistency of the variance term. Similar to Theorem 4.6 in Belloniet al. (2015), we have (cid:107) ( E n − E )[ σ i ( Z i , X i ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n (cid:19) . For V f = Σ − f Ω f Σ − f and σ z = ψ k n ( z ) (cid:62) V f ψ k n ( z ), (cid:107) ˆ V f − V f (cid:107) (cid:46) (cid:107) ( ˆΣ − f − Σ − f ) ˆΩ f ˆΣ − f (cid:107) + (cid:107) Σ − f ( ˆΩ f − Ω f )Σ − f (cid:107) + (cid:107) Σ − f Ω f ( ˆΣ − f − Σ − f ) (cid:107) . Note that for ˜Ω f = E n (cid:2) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − ˆ M E n (cid:2) σ i X i X (cid:62) i (cid:3) ˆ M , (cid:107) ˆΩ f − Ω f (cid:107) ≤ (cid:107) ˆΩ f − ˜Ω f (cid:107) + (cid:107) ˜Ω f − Ω f (cid:107) . To bound (cid:107) ˆΩ f − ˜Ω f (cid:107) , note that (cid:107) E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) ≤ max ≤ i ≤ n (cid:12)(cid:12)(cid:12) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f n ( Z i ) (cid:12)(cid:12)(cid:12) (cid:107) E n (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) + 2 max ≤ i ≤ n | (cid:15) i + r ni | max ≤ i ≤ n (cid:12)(cid:12)(cid:12) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f n ( Z i ) (cid:12)(cid:12)(cid:12) (cid:107) E n (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) (cid:46) p (cid:107) (cid:98) Q z (cid:107) O p (cid:18)(cid:112) log np (cid:16) n /r (cid:15) + c k n (cid:96) k n (cid:17) (cid:18) s (cid:112) log p/n + (cid:113) ξ ( k n ) k n /n + c k n (cid:96) k n (cid:19)(cid:19) = o p (1) , where the second inequality follows from the results in Theorem 1 and the last equalityfollows from the condition in Theorem 3. Similarly, (cid:107) ˆ M E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) ˆ M (cid:62) (cid:107) ≤ (cid:112) k n (cid:107) ˆ M E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) ˆ M (cid:62) (cid:107) ∞ ≤ (cid:112) k n (cid:107) ˆ M (cid:107) (cid:107) E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) (cid:107) ∞ = (cid:112) k n s m (cid:32) s (cid:114) ξ ( k n ) log pn + (cid:96) k n c k n (cid:33)

41o we have (cid:107) ˆΩ f − ˜Ω f (cid:107) = o p (1) . To bound (cid:107) ˜Ω f − Ω f (cid:107) = o p (1), note that (cid:107) ( E n − E ) (cid:104) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) = O p (cid:32) (1 + (cid:96) k n c k n ) (cid:114) ξ ( k n ) log k n n (cid:33) by Theorem 4.6 in Belloni et al. (2015). Becausemax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) X ij σ i − E (cid:2) X ij σ i (cid:3)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log pn (cid:33) , then (cid:107) ˆ M E n (cid:104) σ i X i X (cid:62) i (cid:105) ˆ M (cid:62) − M E (cid:104) σ i X i X (cid:62) i (cid:105) M (cid:62) (cid:107) ≤ (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) M (cid:107)(cid:107) E n (cid:104) σ i X i X (cid:62) i (cid:105) (cid:107) ∞ + (cid:107) M (cid:107) (cid:107) ( E n − E ) (cid:104) (cid:15) i X i X (cid:62) i (cid:105) (cid:107) ∞ = O p (cid:16)(cid:112) k n s m (cid:112) log p/n (cid:17) = o p (1) . The conclusion follows from triangular inequality.

Lemma 9.

Suppose that conditions in Theorem 2 and 3 are satisﬁed, then s w (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i ) ˜ X i (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i ) X i (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i )( ψ k n ( Z i ) − ˆ M X i ) (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) Proof.

Similar to the proof in Lemma 7, we consider the interaction of function F = ˜ X i , F = X i and F = ( ψ k n ( Z i ) − ˆ M X i ) with each component of ˆ S i − S i . With Bernsteininequality and union bound, P (cid:32) max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni F kj D i ( (cid:15) ,i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ C π (cid:107) r πni (cid:107) n (cid:112) ( t + log p ) /n ) (cid:33) ≤ exp( − t ) , for k = 1 , k = 1, taking t = log p and from Assumption 10, with probability at least 1 − /p , s w (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). When k = 2, we can apply the similar argument andwith probability at least 1 − /p , (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). And again when k = 3,with probability at least 1 − /k n , √ k n (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). The same logic inLemma 7 will lead to the rest of the terms and we omitted those here.42 emma 10. Let M (cid:62) j = E ( ψ k n j ( Z i ) X (cid:62) i ) { E (cid:2) X i X (cid:62) i (cid:3) } − , suppose that conditions in Theorem3 are satisﬁed, then (cid:107) M − ˆ M (cid:107) = O p ( s m (cid:112) (log k n + log p ) /n ) . Proof.

Since ψ k n j ( Z i ) = M (cid:62) j X i + ( ψ k n j ( Z i ) − E ( ψ k n j ( Z i ) X (cid:62) i ) { E X ⊗ i } − X i ) , we deﬁne υ i := ψ k n j ( Z i ) − E ( ψ k n j ( Z i ) X (cid:62) i ) { E (cid:2) X i X (cid:62) i (cid:3) } − X i ) such that v i X (cid:62) i = ψ k n j ( Z i ) X (cid:62) i − E ( ψ k n j ( Z i ) X (cid:62) i ) { E X ⊗ i } − X i X (cid:62) i = ( ψ k n j ( Z i ) X (cid:62) i − E ( ψ k n j ( Z i ) X (cid:62) i )) + E ( ψ k n j ( Z i ) X (cid:62) i )( I p × p − { E X ⊗ i } − X i X (cid:62) i ) . Thus E ( υ i X (cid:62) i ) = 0, and the event B = (cid:107) E n ( υ i X (cid:62) i ) (cid:107) ∞ < λ (cid:48)(cid:48) has probability at least 1 − exp( − cnλ (cid:48)(cid:48) ). Equation (4.3) is thus a linear Dantzig selector as deﬁned in Theorem 7.1 inBickel, Ritov and Tsybakov (2009). Therefore P ( (cid:107) M − ˆ M (cid:107) > λ (cid:48)(cid:48) ) ≤ K n (cid:88) j =1 P ( (cid:107) M j − ˆ M j (cid:107) > λ (cid:48)(cid:48) ) ≤ exp (cid:0) log k n − cnλ (cid:48)(cid:48) (cid:1) By choose λ (cid:48)(cid:48) (cid:38) (cid:112) (log p + log k n ) /n , with probability at least 1 − exp( − c log( k n ) − c log( p )), (cid:107) M − ˆ M (cid:107) = O p ( s m (cid:112) (log k n + log p ) /n ) . References

Abadie, A. (2005). Semiparametric Diﬀerence-in-Diﬀerences Estimators.

Review of Eco-nomic Studies Ahn, H. and

Powell, J. L. (1993). Semiparametric estimation of censored selection modelswith a nonparametric selection mechanism.

Journal of Econometrics Athey, S. and

Imbens, G. W. (2006). Identiﬁcation and Inference in Nonlinear Diﬀerencein Diﬀerences Models.

Econometrica Athey, S. and

Imbens, G. W. (2019). Design-based Analysis in Diﬀerence-In-DiﬀerencesSettings with Staggered Adoption.

Belloni, A. , Chernozhukov, V. and

Wei, Y. (2016). Post-selection inference for gener-alized linear models with many controls.

Journal of Business & Economic Statistics elloni, A. , Chernozhukov, V. , Chetverikov, D. and

Kato, K. (2015). Some newasymptotic theory for least squares series: Pointwise and uniform results.

Journal of Econo-metrics

Belloni, A. , Chernozhukov, V. , Fern´andezVal, I. and

Hansen, C. (2017). ProgramEvaluation and Causal Inference with High-Dimensional Data.

Econometrica Bickel, P. J. , Ritov, Y. and

Tsybakov, A. B. (2009). Simultaneous analysis of Lassoand Dantzig selector.

Annals of Statistics B¨uhlmann, P. and van de Geer, S. (2011).

Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media.

Cai, T. T. and

Guo, Z. (2017). Conﬁdence intervals for high-dimensional linear regression:Minimax rates and adaptivity.

The Annals of statistics Callaway, B. and

Li, T. (2020). Quantile treatment eﬀects in diﬀerence in diﬀerencesmodels with panel data.

Quantitative Economics Callaway, B. and

Sant’Anna, P. H. C. (2019). Diﬀerence-in-Diﬀerences with MultipleTime Periods and an Application on the Minimum Wage and Employment.

Caner, M. and

Kock, A. B. (2018). Asymptotically honest conﬁdence regions for highdimensional parameters by the desparsiﬁed conservative lasso.

Journal of Econometrics

Card, D. and

Krueger, A. B. (1994). Minimum Wages and Employment: A Case Study ofthe Fast-Food Industry in New Jersey and Pennsylvania.

The American Economic Review Cattaneo, M. D. , Jansson, M. and

Ma, X. (2019). Two-Step Estimation and Inferencewith Possibly Many Included Covariates.

Review of Economic Studies . Chen, X. and

Christensen, T. M. (2015). Optimal uniform convergence rates and asymp-totic normality for series estimators under weak dependence and weak conditions.

Journalof Econometrics

Chen, X. , Hong, H. , Tarozzi, A. et al. (2008). Semiparametric eﬃciency in GMM modelswith auxiliary data.

The Annals of Statistics Chernozhukov, V. , Chetverikov, D. and

Kato, K. (2017). Central limit theorems andbootstrap in high dimensions.

The Annals of Probability Chernozhukov, V. , Escanciano, J. C. , Ichimura, H. , Newey, W. K. and

Robins, J. M. (2016). Locally robust semiparametric estimation. arXiv preprintarXiv:1608.00033 . 44 hernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. and

Robins, J. (2018a). Double/Debiased Machine Learning for Treatmentand Structural Parameters.

The Econometrics Journal C1-C68.

Chernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. and

Robins, J. (2018b). Double/debiased Machine Learning for Treatmentand Structural Parameters.

The Econometrics Journal Donald, S. G. and

Newey, W. (1994). Series estimation of semilinear models.

Journal ofMultivariate Analysis Engle, R. F. , Granger, C. W. , Rice, J. and

Weiss, A. (1986). Semiparametric estimatesof the relation between weather and electricity sales.

Journal of the American statisticalAssociation Fan, Y. and

Li, Q. (1999). Root-n-consistent estimation of partially linear time seriesmodels.

Journal of Nonparametric Statistics Fan, Q. , Hsu, Y.-C. , Lieli, R. P. and

Zhang, Y. (2020). Estimation of conditional averagetreatment eﬀects with high-dimensional data.

Journal of Business & Economic Statistics forthcoming . Farrell, M. H. (2015). Robust inference on average treatment eﬀects with possibly morecovariates than observations.

Journal of Econometrics

Freyberger, J. and

Masten, M. A. (2019). A practical guide to compact inﬁnite dimen-sional parameter spaces.

Econometric Reviews Gold, D. , Lederer, J. and

Tao, J. (2020). Inference for high-dimensional instrumentalvariables regression.

Journal of Econometrics

Graham, B. S. , de Xavier Pinto, C. C. and Egel, D. (2012). Inverse probability tiltingfor moment condition models with missing data.

The Review of Economic Studies Horvitz, D. G. and

Thompson, D. J. (1952). A generalization of sampling without re-placement from a ﬁnite universe.

Journal of the American Statistical Association, Imai, K. , Kim, I. S. and

Wang, E. (2019). Matching Methods for Causal Inference withTime-Series Cross-Sectional Data.

Javanmard, A. and

Montanari, A. (2014). Conﬁdence intervals and hypothesis testing forhigh-dimensional regression.

The Journal of Machine Learning Research Kennedy, E. H. , Lorch, S. and

Small, D. S. (2019). Robust causal inference with45ontinuous instruments using the local instrumental variable curve.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) Lee, S. , Okui, R. and

Whang, Y.-J. (2017). Doubly robust uniform conﬁdence band forthe conditional average treatment eﬀect function.

Journal of Applied Econometrics Li, Q. and

Racine, J. S. (2007).

Nonparametric econometrics: theory and practice . Prince-ton University Press.

Linton, O. (1995). Second order approximation in the partially linear regression model.

Econometrica: Journal of the Econometric Society

Ma, C. and

Huang, J. (2016). Asymptotic properties of Lasso in high-dimensional partiallylinear models.

Science China Mathematics M¨uller, P. and

Van de Geer, S. (2015). The partial linear model in high dimensions.

Scandinavian Journal of Statistics Newey, W. (1997). Convergence rates and asymptotic normality for series estimators.

Jour-nal of Econometrics Neykov, M. , Ning, Y. , Liu, J. S. , Liu, H. et al. (2018). A uniﬁed theory of conﬁdenceregions and testing for high-dimensional estimating equations.

Statistical Science Ning, Y. and

Liu, H. (2017). A general theory of hypothesis tests and conﬁdence regionsfor sparse high dimensional models.

The Annals of Statistics Ning, Y. , Zhao, T. , Liu, H. et al. (2017). A likelihood ratio framework for high-dimensionalsemiparametric regression.

The Annals of Statistics Ogburn, E. L. , Rotnitzky, A. and

Robins, J. M. (2015). Doubly robust estimation ofthe local average treatment eﬀect curve.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) Okui, R. , Small, D. S. , Tan, Z. and

Robins, J. M. (2012). Doubly robust instrumentalvariable regression.

Statistica Sinica

Robins, J. M. , Rotnitzky, A. and

Zhao, L. P. (1994). Estimation of regression coeﬃ-cients when some regressors are not always observed.

Journal of the American statisticalAssociation Robinson, P. M. (1988). Root-N-consistent semiparametric regression.

Econometrica: Jour-nal of the Econometric Society

Sant’Anna, P. and

Zhao, J. (2020). Doubly Robust Diﬀerence-in-Diﬀerences Estimators.46 orking paper . Semenova, V. and

Chernozhukov, V. (2017). Estimation and Inference about Con-ditional Average Treatment Eﬀect and Other Structural Functions. arXiv preprintarXiv:1702.06240 . Shen, X. (1997). On methods of sieves and penalization.

The Annals of Statistics

S(cid:32)loczy´nski, T. and

Wooldridge, J. M. (2018). A general double robustness result forestimating average treatment eﬀects.

Econometric Theory Syrgkanis, V. , Lei, V. , Oprescu, M. , Hei, M. , Battocchi, K. and

Lewis, G. (2019).Machine Learning Estimation of Heterogeneous Treatment Eﬀects with Instruments In

NeurIPS 2019 . Tan, Z. (2006). A distributional approach for causal inference using propensity scores.

Jour-nal of the American Statistical Association

Tan, Z. (2020). Model-assisted inference for treatment eﬀects using regularized calibratedestimation with high-dimensional data.

Annals of Statistics, forthcoming . Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices.

Foundationsof computational mathematics van de Geer, S. (2014). On the uniform convergence of empirical norms and inner products,with application to causal inference. Electronic Journal of Statistics Van de Geer, S. , B¨uhlmann, P. , Ritov, Y. and

Dezeure, R. (2014). On asymptoticallyoptimal conﬁdence regions and tests for high-dimensional models.

The Annals of Statistics Vermeulen, K. and

Vansteelandt, S. (2015). Bias-reduced doubly robust estimation.

Journal of the American Statistical Association

Yu, Z. , Levine, M. , Cheng, G. et al. (2019). Minimax optimal estimation in partiallylinear additive models under high dimension.

Bernoulli Zhang, C.-H. and

Zhang, S. S. (2014). Conﬁdence intervals for low dimensional parame-ters in high dimensional linear models.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) Zhu, Y. , Yu, Z. and

Cheng, G. (2019). High dimensional inference in partially linearmodels.

Proceedings of Machine Learning Research, PMLR able 1

Simulation: homogeneous error with independent confounders p 10 50 500 1000Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD n = linearBias -0.0476 0.0746 -0.0279 -0.1177 0.0008 - -0.0045 -Std Err 0.5241 1.2641 0.4533 4.3207 0.3732 - 0.4030 -RMSE 0.2816 1.7240 0.2116 21.9360 0.1410 - 0.1677 -Coverage 0.8300 0.9260 0.8680 0.9710 0.8833 - 0.8900 -CI length 1.4018 4.1499 1.2448 49.9842 1.0778 - 1.0800 -nonparametricBias -0.0165 0.0128 0.0365 -0.1181 0.0260 - 0.0102 -Std Err 1.0612 1.3773 0.8825 8.7693 0.8072 - 0.9131 -RMSE 1.1701 1.9408 0.7846 80.8639 0.6548 - 0.8426 -Coverage 0.8200 0.9056 0.8525 0.9712 0.7988 - 0.8375 -CI length 2.4596 5.2435 2.3056 49.2043 2.1269 - 2.1916 - n = linearBias -0.0258 0.0031 -0.0171 -0.0011 -0.0016 - -0.0011 -Std Err 0.2812 0.4972 0.2268 0.5771 0.2086 - 0.1983 -RMSE 0.0828 0.2574 0.0536 0.3369 0.0441 - 0.0398 -Coverage 0.8570 0.8815 0.8656 0.9635 0.8857 - 0.8916 -CI length 0.7628 1.4949 0.6723 3.9223 0.6488 - 0.6261 -nonparametricBias 0.0023 -0.0133 0.0028 0.0108 0.0031 - 0.0032 -Std Err 0.3953 0.6337 0.3872 0.7954 0.3715 - 0.3558 -RMSE 0.1564 0.4020 0.1501 0.6348 0.1384 - 0.1277 -Coverage 0.8888 0.8644 0.8688 0.9606 0.8588 - 0.8725 -CI length 1.1900 1.9907 1.1284 5.6116 1.1040 - 1.1037 - n = linearBias -0.0091 0.0044 -0.0068 0.0094 -0.0014 - -0.0008 -Std Err 0.1842 0.3126 0.1485 0.3347 0.1401 - 0.1399 -RMSE 0.0342 0.0992 0.0226 0.1132 0.0199 - 0.0198 -Coverage 0.8640 0.8720 0.8848 0.9027 0.8910 - 0.8946 -CI length 0.5432 0.9229 0.4683 1.2354 0.4466 - 0.4482 -nonparametricBias 0.0047 -0.0251 -0.0187 -0.0121 -0.0069 - -0.0083 -Std Err 0.2695 0.3950 0.2362 0.4562 0.2268 - 0.2446 -RMSE 0.0728 0.1570 0.0566 0.2087 0.0515 - 0.0599 -Coverage 0.8750 0.8725 0.8812 0.9075 0.8825 - 0.8712 -CI length 0.8223 1.2382 0.7472 1.7548 0.7196 - 0.7247 - This table compare the doubly robust diﬀ-in-diﬀ estimator (denoted as Dr-DiD) with the originalsemi-parametric diﬀ-in-diﬀ estimator (denoted as Semi-DiD) proposed in Abadie (2005). p represents thedimension for the linear speciﬁcation. The nonparametric part is speciﬁed by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.represents thedimension for the linear speciﬁcation. The nonparametric part is speciﬁed by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%. able 2 Simulation: heterogeneous error with correlated confounders p 10 50 500 1000Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD n = linearBias -0.0213 0.0159 0.0122 -0.1277 0.0048 - 0.0027 -Std Err 0.5929 2.2577 0.5333 11.3165 0.4583 - 0.4649 -RMSE 0.3643 5.7073 0.2912 156.7863 0.2129 - 0.2255 -Coverage 0.8390 0.9260 0.8608 0.9360 0.8881 - 0.8912 -CI length 1.5717 8.8340 1.4061 12.8266 1.2764 - 1.2174 -nonparametricBias 0.0327 0.0407 -0.0200 0.0026 -0.0391 - 0.0183 -Std Err 0.9525 2.7935 1.0049 12.2120 0.9813 - 0.9562 -RMSE 0.9288 8.0217 1.0136 176.1709 0.9779 - 0.9171 -Coverage 0.8275 0.9269 0.8200 0.9450 0.8275 - 0.8088 -CI length 2.4823 10.4079 2.4977 12.4647 2.4471 - 2.3827 - n = linearBias 0.0148 -0.0486 0.0072 0.0198 0.0006 - 0.0001 -Std Err 0.3691 1.7750 0.2966 2.5093 0.2305 - 0.2285 -RMSE 0.1368 3.8087 0.0888 6.7667 0.0536 - 0.0526 -Coverage 0.8350 0.8550 0.8798 0.9761 0.8901 - 0.8943 -CI length 0.9454 3.2110 0.8478 15.5631 0.7177 - 0.7059 -nonparametricBias -0.0068 0.0255 -0.0061 0.1500 0.0127 - -0.0156 -Std Err 0.4750 1.8659 0.4719 2.7917 0.4122 - 0.3913 -RMSE 0.2278 3.7041 0.2237 7.9921 0.1695 - 0.1541 -Coverage 0.8800 0.8569 0.8487 0.9681 0.8225 - 0.8562 -CI length 1.3348 3.5635 1.2741 18.3474 1.1226 - 1.1594 - n = linearBias -0.0018 -0.0000 0.0048 -0.0007 0.0004 - 0.0006 -Std Err 0.2733 0.6733 0.2071 1.1275 0.1850 - 0.1706 -RMSE 0.0750 0.4579 0.0433 1.3780 0.0345 - 0.0294 -Coverage 0.8410 0.8650 0.8720 0.9099 0.8943 - 0.8960 -CI length 0.7548 1.9303 0.6045 3.2445 0.5709 - 0.5378 -nonparametricBias -0.0116 0.0129 0.0114 -0.1022 0.0023 - 0.0081 -Std Err 0.3303 0.7637 0.2771 1.3032 0.2771 - 0.2642 -RMSE 0.1092 0.5839 0.0781 1.8072 0.0765 - 0.0703 -Coverage 0.8600 0.8750 0.8762 0.8981 0.8588 - 0.8488 -CI length 0.9666 2.1754 0.8233 3.6871 0.8006 - 0.7682 - This table compare the doubly robust diﬀ-in-diﬀ estimator (denoted as Dr-DiD) with the originalsemi-parametric diﬀ-in-diﬀ estimator (denoted as Semi-did) proposed in Abadie (2005). p represents thedimension for the linear speciﬁcation. The nonparametric part is speciﬁed by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.represents thedimension for the linear speciﬁcation. The nonparametric part is speciﬁed by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.