Doubly Robust Semiparametric Difference-in-Differences Estimators with High-Dimensional Data
DDoubly Robust SemiparametricDifference-in-Differences Estimators withHigh-Dimensional Data
Yang Ning , Sida Peng , Jing Tao Abstract:
This paper proposes a doubly robust two-stage semiparametric difference-in-difference estimator for estimating heterogeneous treatment effects with high-dimensionaldata. Our new estimator is robust to model miss-specifications and allows for, but doesnot require, many more regressors than observations. The first stage allows a general setof machine learning methods to be used to estimate the propensity score. In the secondstage, we derive the rates of convergence for both the parametric parameter and theunknown function under a partially linear specification for the outcome equation. Wealso provide bias correction procedures to allow for valid inference for the heterogeneoustreatment effects. We evaluate the finite sample performance with extensive simulationstudies. Additionally, a real data analysis on the effect of Fair Minimum Wage Act onthe unemployment rate is performed as an illustration of our method. An R package forimplementing the proposed method is available on Github. Keywords: Difference-in-difference; High-dimensional data; Machine learning; Par-tially linear models; Two-stage regression.JEL codes: C13, C14, C31 Department of Statistical Science, Cornell University, [email protected] Microsoft Research, [email protected] Department of Economics, University of Washington, [email protected] https://github.com/psdsam/HDdiffindiff a r X i v : . [ ec on . E M ] S e p . Introduction This paper proposes a doubly robust two-stage semiparametric difference-in-difference esti-mator for estimating heterogeneous treatment effects conditional on high-dimensional covari-ates. The difference-in-difference (DiD) design has been widely adopted in policy evaluationfrom academia to industry when a real experiment is expensive or infeasible. When a pol-icy/feature only affects a fraction of the population, DiD design can identify the averagetreatment effect on the treated (ATT) based on observational data. It is based on the simpleidea of comparing the difference in pre and post-treatment outcome of those individuals whoare affected and those who are not affected by the policy/feature of interest.A key identification assumption for the classical DiD design is the parallel trend assump-tion. It requires that the outcome variables for treated and non-treated individuals wouldhave followed parallel paths over time in the absence of treatment. However, this assump-tion ignores the potential selection problem due to individual heterogeneity. For example, acompany might want to evaluate the effect of an email marketing campaign (advertisementthrough email with an embedded promo link). A researcher can compare the customers’conversion rate (whether a purchase was made) before and after the campaign for a group oftreated customers (click into the link) and a group of non-treated customers (did not clickinto the link). If existing customers are more likely to click into the link and also more likelyto purchase again even without the campaign intervention, the classical DiD estimator willlead to a positive bias and exaggerate the effect of the campaign. To account for such case,Abadie (2005) proposed a two-stage semiparametric estimator with the so-called conditionalparallel trend assumption. In this framework, a propensity score is estimated in the firststage to explicitly account for any observed confounders that may affect both the treatmenttake-up as well as the outcome growth trend.While the semiparametric DiD (semi-DiD) estimator is comprehensively used by researchersin academia and industry, three major challenges arise in practice. First, the semi-DiD es-timator becomes difficult to implement when there exist too many covariates. Followingthe previous example, researchers may also observe customers’ browsing history and maysuspect customers who visited certain (unknown) websites are more likely to click into thelink while also more likely to make the purchase. However, the semi-DiD estimator cannotbe implemented if the number of attributes exceeds the number of observations. Therefore,researchers may be forced to select covariates based on their intuition or insights and may2ead to further biases (Belloni et al. (2017)). Even when the number of observations is largerthan the number of covariates, the semi-DiD estimator may still contain a large bias whentoo many covariates are included (Cattaneo, Jansson and Ma (2019)). Second, the semi-DiDestimator is sensitive to the choice of specification for propensity score estimation. This is asimilar problem for the inverse propensity score weighted (IPW) estimator and it becomes amore subtle problem if machine learning methods (e.g. random forrest, neural network, etc.)are used to predict propensity score. Third, conditional or heterogeneous treatment effectson treated (ATT) is often of interest to practitioners. While semi-DiD framework providesa way to estimate conditional or heterogeneous ATT under a vector of low-dimensional co-variates, it is not clear how this framework can be extended to the high-dimensional case,(e.g. how to develop estimation and inference methods and theory).In this paper, we propose a new estimator to solve the above three problems. Our doublyrobust DiD (Dr-DiD) estimator is robust to model miss-specifications under high-dimensionalcovariates. We show the desired rate of convergence of our estimator can be achieved as longas either the propensity score function or the outcome equation can be approximated asymp-totically at a moderate rate. Thus, a general set of machine learning methods can be used inour framework. Although diff-in-diff design is an ATT estimator, we show that the semi-DiDestimator can be extended to an augmented inverse propensity score weighted (AIPW) esti-mator. We show that the extended AIPW form still preserves the doubly robustness propertyunder the parallel trend assumption.To further incorporate high-dimensional covariates and heterogeneous treatment effects,we consider a partially linear specification in the potential outcome estimation. The partiallylinear form is composed of a nonparametric specification from a set of low-dimensional co-variates as well as a linear parametric specification from a set of high-dimensional covariates,which provides a flexible functional form to model the potential outcome. This is a very use-ful specification in real world application. For example, researchers may be interested in thenonlinear relationship between the outcome variable and a set of covariates while also facinga large number of indicator variables such as age, gender and region.We derive the rate of convergence for our estimator as well as a de-bias procedure forinference. We show that the high-dimensional linear part of the estimator can achieve theoracle rate of convergence, while the nonparametric part maintains the nonparametric rateof convergence. With bias correction, the high-dimensional linear part can achieve normalityat √ n -rate, while the nonparametric part can achieve the normality at the nonparametric3ate. Finally, we demonstrate the finite sample performance of our estimator in a simulationstudy and apply our estimator to study the effect of the Fair Minimum Wage Act on theunemployment rate using the data collect by Callaway and Li (2020). We show that theheterogeneity in the effect of this policy can be explained by variations in demographics. Morespecifically, counties with larger population and higher median income are more likely tosuffer from an increase in the unemployment rate. These findings coincide with the canonicaleconomic theory on unemployment rate. For example, a higher median income level implies ahigher substitution cost for workers currently at minimum wage and thus leads to an increasein the unemployment rate when minimum wage rises. On the other hand, regions with largerpopulation sizes have more labor supply and thus a minimum wage raise can also lead to asurplus.In summary, the main contributions of this work are as follows: first, we propose a doublyrobust approach to estimate heterogeneous ATT conditional on covariates for DiD modelsthat allows either the propensity scores or the model for ATT to be misspecified. Second, wepropose a regularized two-stage estimation procedure for DiD models that allows (i) suitablemachine learing tools to estimate the first-stage propensity socres and (ii) high-dimensionalcovariates and nonparametric specification for the heterogeneous ATT in the second-stage.Third, we provide a novel approach to simultaneously correct the biases due to both stagesand provide a novel statistical inference procedure based on the de-biased estimator. Finally,as a useful byproduct, we derive novel estimation and inference methods for a partially linearmodel for both the high-dimensional parametric parameter and the nonparametric function. This paper is related to the vast literature on robust estimation and inference for treatmenteffects models; see for example, Robins, Rotnitzky and Zhao (1994), Tan (2006), Chen et al.(2008), Graham, de Xavier Pinto and Egel (2012), Okui et al. (2012), Farrell (2015), Ver-meulen and Vansteelandt (2015), Ogburn, Rotnitzky and Robins (2015), Belloni et al. (2017),Lee, Okui and Whang (2017), Chernozhukov et al. (2018b), S(cid:32)loczy´nski and Wooldridge(2018), Kennedy, Lorch and Small (2019) and Tan (2020) among many others. Our workis particularly closely related to a recent work independently developed by Sant’Anna andZhao (2020). Both are based on the seminal framework proposed in Abadie (2005). Our es-timator complements theirs as we focus on estimation and inference for heterogeneous ATT4onditional on covariates in a high-dimensional setting while Sant’Anna and Zhao (2020)focus on efficient estimation of ATT when the dimension of covariates is fixed and is muchsmaller than the sample size.This paper also contributes to the literature by connecting the widely used DiD esti-mator with the machine learning/ high-dimensional statistic literature. The DiD estimatorhas been an active research field in the economic literature, e.g. Card and Krueger (1994),Abadie (2005), Athey and Imbens (2006), Imai, Kim and Wang (2019), Athey and Imbens(2019), Callaway and Sant’Anna (2019) among others. Our paper proposes a specific DiDestimator so that high-dimensional/machine learning tools can be applied. This paper alsocontributes to a set of works that apply machine learning tools to casual inference. This in-cludes Chernozhukov et al. (2016), Belloni et al. (2017), Semenova and Chernozhukov (2017),Chernozhukov et al. (2018b), Syrgkanis et al. (2019), Fan et al. (2020), Tan (2020), etc. Ourwork distinguish this literature in the following two ways. First, we propose a doubly robustdiff-in-diff estimator in the high-dimensional/machine learning setting that has not beenstudied. Second, to our best knowledge, the doubly-robust estimators in these papers usevarious high-dimensional set of covariates and machine learning methods to deal with theselection into treatment. However, the ultimate parameter of interest in the second-stage isa low-dimensional subset of the covariates so traditional nonparametric estimator can apply.By contrast, our parameter of interest in the second-stage contains both high-dimensionalcovariates in the parametric part and an unknown function, which brings substantial chal-lenges for estimation and inference. We construct a new Neyman orthogonal moment condi-tion (Chernozhukov et al. (2016)) and propose de-biased estimators for both the parametricparameters and the nonparametric function in the second-stage to construct valid confidenceintervals.Moreover, as useful by-products, we provide an inference method for a partially linearmodel for both the parametric parameter and the nonparametric function when the linearpart contains high-dimensional covariates. Thus, this work is related to recent discussionin M¨uller and Van de Geer (2015), Ma and Huang (2016), Yu et al. (2019), Zhu, Yu andCheng (2019), among others. Our paper departs from the existing papers in the followingthree aspects. First, our partially linear form is in the second-stage outcome equation soestimation and inference results have to take the first-stage estimators into consideration,while the existing papers focus on a one-stage regression problem. Second, the above pa-pers propose estimators with penalized estimation in functional space. As is pointed out in5hen (1997), this approach often leads to undesirable properties of the estimates, such asinconsistency and roughness. Moreover, such an optimization procedure is difficult to imple-ment in practice. Therefore, we consider the extension to sieve estimation in our estimatorby approximating the nonparametric function with sieves so that we carry out optimiza-tion within a dense subset of the infinite- dimensional space, which is finite-dimensional andtherefore easy to work with. Finally, the existing literature is concerned with asymptotictheories and inference procedures for the parametric parameters only. The nonparametricfunction is profiled out as an infinite-dimensional nuisance parameter. This paper considersthe joint asymptotic theory and inference methods when the parameters of interest are notonly the parametric parameter, but also the nonparametric function. To the best of ourknowledge, these results are new to the literature. We show that the parametric parameterconverges to a normal distribution with a √ n -rate. The parametric estimator achieves thesemiparametric efficiency bound when the error term is homoskedastic, while the functionalof a nonparametric function converges to a normal distribution with a nonparametric rate.We observe that the marginal asymptotic variance for the nonparametric component is, ingeneral, different from those derived without the high-dimensional parametric parameter,i.e., Newey (1997), Belloni et al. (2015) and Chen and Christensen (2015). This result maybe of independent interest to the readers. The paper is organized as follows. The estimator is proposed in Section 2. Rate of convergenceof the estimator and inference theory are developed in Sections 3 and 4, respectively. Section 5presents extensive simulation results to evaluate the finite sample performance. An empiricalstudy on the effect of the Fair minimum Wage Act on the unemployment rate is presentedin Section 6. Section 7 concludes. We defer the proofs to the Appendices.
For a vector x = ( x , . . . , x d ) (cid:62) ∈ R d and 1 ≤ q ≤ ∞ , let (cid:107) x (cid:107) q = (cid:16)(cid:80) di =1 | x i | q (cid:17) /q , (cid:107) x (cid:107) ∞ =max ≤ i ≤ d | x i | , (cid:107) x (cid:107) = | supp( x ) | , where supp( x ) = { j : x j (cid:54) = 0 } and | a | is the cardinality of aset a . For a symmetric matrix A , let Λ max ( A ) and Λ min ( A ) be the maximum and minimumeigenvalues of A . For a matrix B = [ B jk ], let (cid:107) B (cid:107) max = max jk | B jk | , (cid:107) B (cid:107) = (cid:80) jk | B jk | ,6 B (cid:107) = (cid:112) Λ max ( B (cid:62) B ) and (cid:107) B (cid:107) (cid:96) ∞ = max j (cid:80) k | B jk | . For any function f : Z → R , let (cid:107) f (cid:107) ∞ = sup z ∈Z | f ( z ) | , (cid:107) f (cid:107) P, = (cid:112) E f ( z ) and (cid:107) f (cid:107) n = (cid:112) n − (cid:80) ni =1 f ( Z i ). We denote I d asthe d × d identity matrix. For a set S ⊆ { , . . . , d } , let x S = { x j : j ∈ S } and S c be thecomplement of S . Let S be the set of all non-zero components of β and s = | S | . We use (cid:53) S f ( x ) to denote the gradient of f ( x ) with respect to x S . Given a, b ∈ R , let a ∨ b and a ∧ b denote the maximum and minimum of a and b . For two positive sequences a n and b n ,let a n (cid:16) b n denote C ≤ a n /b n ≤ C (cid:48) for some C, C (cid:48) >
0; let a n (cid:46) b n denote a n ≤ Cb n forsome constant C >
0. Also, we write a n = O ( b n ) if | a n | ≤ C | b n | . We use X n → p a for someconstant a if a sequence of random variables X n converges in probability to a . Similarly, if X n converges weakly to X we write X n (cid:32) X for some random variable X . For notationalsimplicity, we use C , C (cid:48) and C (cid:48)(cid:48) to denote generic constants, whose values can change fromline to line. Let E n f = n (cid:80) ni =1 f ( X i ) and G n f = E n f − E f .A random variable X is called sub-exponential if there exists some positive constant K such that P ( | X | > t ) ≤ exp(1 − t/K ) for all t ≥
0. The sub-exponential norm of X is definedas (cid:107) X (cid:107) ψ = sup q ≥ q − ( E | X | q ) /q . Similarly, a random variable X is called sub-Gaussian ifthere exists some positive constant K such that P ( | X | > t ) ≤ exp(1 − t /K ) for all t ≥ X is defined as (cid:107) X (cid:107) ψ = sup q ≥ q − / ( E | X | q ) /q .
2. Doubly Robust DiD Estimator
Denote Y ( i, t ) as the potential outcome of individual i at time t being not treated and Y ( i, t ) as the potential outcome of individual i at time t being treated. We cannot observeboth Y ( i, t ) and Y ( i, t ) for the same individual, but we observe the realized outcome forindividual i at time t as Y ( i, t ) = D i Y ( i, t ) + (1 − D i ) Y ( i, t )where D i is the treatment status at time t = 1. For some observed covariates W i = ( X i , Z i ),we want to learn the heterogeneous treatment effect on the treated conditional on the co-variates W i such that AT T ( W i ) = τ ( W i ) := E [ Y ( i, − Y ( i, | W i , D i = 1] . (2.1)Our parameter of interest is different from ATT (e.g. Sant’Anna and Zhao (2020)), whichis defined as AT T = E (cid:2) Y ( i, − Y ( i, | D i = 1 (cid:3) . (2.2)7hile it is useful to know (2.2), a doubly robust estimator for (2.1) conditional on thecovariates could also be relevant and important in empirical applications when the parameterof interest is the heterogeneous treatment effects conditional on the covariates.As pointed out in Abadie (2005), the conventional DiD estimator is based on the strongassumption that outcomes for treated and non-treated groups or individuals would havefollowed parallel paths over time in the absence of treatment. That assumption can be easilyviolated when differences in observed characteristics create non-parallel outcome dynamicsbetween treated and non-treated populations. Abadie (2005) generalizes this assumption byallowing the parallel trend assumption to hold after conditioning on the covariates as follows: Assumption 1. E [ Y ( i, − Y ( i, | W i , D i = 1] = E [ Y ( i, − Y ( i, | W i , D i = 0] . In addition, a full support assumption will guarantee the existence of the propensity scorefunction.
Assumption 2.
With probability approaching 1, there exits a constant c > E [ D i = 1 | W i ] > c and E [ D i = 1 | W i ] < − c .Together with Assumptions 1 and 2, the Abadie (2005) estimand can be defined as E (cid:20) ( D i − E ( D i = 1 | W i )) E ( D i = 1 | W i ) E ( D i = 0 | W i ) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i (cid:21) . (2.3)Defining ∆ Y i := Y ( i, − Y ( i, E (cid:2) Y ( i, − Y ( i, | W i , D i = 1 (cid:3) = E (cid:20) ( D i − E ( D i = 1 | W i )) E ( D i = 1 | W i ) E ( D i = 0 | W i ) ∆ Y ( i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ∆ Y ( i ) E ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) (1 − D i )∆ Y ( i ) E ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) . (2.4)It is easy to see that Equation (2.4) is in the form of Horvitz-Thompson estimator (Horvitzand Thompson (1952)). As a natural extension to the IPW form estimator, we study whethera doubly robust form exists under the DiD setting and this leads to our parameters of interestas follows. Define∆ Y i := Y ( i, − Y ( i, , Φ ( W i ) := E [∆ Y i | W i , D i = 1] , ∆ Y i := Y ( i, − Y ( i, , Φ ( W i ) := E [∆ Y i | W i , D i = 0] . ρ = D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) E ( D i = 0 | W i ) , our doubly robust estimand is defined as τ ( W i ) = E [ ρ (∆ Y i − E ( D i = 0 | W i )Φ ( W i ) − E ( D i = 1 | W i )Φ ( W i )) | W i ] , (2.5)where E [ D i = 0 | W i ], E [ D i = 1 | W i ], Φ ( W i ) and Φ ( W i ) are nuisance functions to be esti-mated from the first-stage. Lemma 1.
Under Assumptions 1 and 2, (i) the estimand defined in Equation (2.5) is doublyrobust in the sense that E [ Y ( i, − Y ( i, | W i , D i = 1] = E [ ρ (∆ Y i − E ( D i = 0 | W i )Φ ( W i ) − E ( D i = 1 | W i )Φ ( W i )) | W i ] holds provided that one of the two conditions (a) or (b) holds, even if both do not holdsimultaneously: (a) specifications Φ ( W i ) and Φ ( W i ) are correct, (b) specification of P ( D i =1 | W i ) is correct.(ii) Let α = (Φ ( · ) , Φ ( · ) , π ( · )) , π ( W ) = P ( D = 1 | W ) and Υ ( W ; α ) = ρ [∆ Y − (1 − π ( W )) Φ ( W ) − π ( W )Φ ( W )] . Then the moment condition E [Υ( W i ; α ) − τ ( W i ) | W i = w ] = 0 holds and the following Ney-man orthogonality condition holds: ∂ r E [Υ ( W i ; α + r ( α − α )) − τ ( W i ) | W i = w ] | r =0 = 0 . Lemma 1 shows that, with Assumptions 1 and 2, we can have a doubly robust estimator for τ ( W i ) when either the regression models Φ ( · ) and Φ ( · ) are misspecified or the propensityscore P ( D i = 1 | X i ) is misspecified.To model τ ( · ), we consider a class of flexible high-dimensional partially linear model suchthat E [ Y ( i, − Y ( i, | W i , D i = 1] = X (cid:62) i β + f ( Z i ) , (2.6)where the linear part contains the parametric Euclidean vector β ∈ B ⊆ R p with p > n , andthe nonparametric part contains an unknown function f ( · ) : Z → R , where Z is a compactsubset of R d z . We will assume that the unknown function belongs to a smoothed functionclass defined in Section 3. 9ompare with the definition of equation (11) in Abadie (2005), we define our estimandin equation (2.6) in a partial linear form rather than approximate it with a best linearpredictor. The semi-parametric structure is slightly stronger as equation (11) in Abadie(2005) is satisfied if we plug in equation (2.6) and allow g ( X k , θ ) to admit a partial linearspecification.On the other hand, the partially linear specification in (2.6) provides a flexible functionalform while still allowing us to maintain the Neyman orthogonality condition when designingthe estimator under the high dimensionally covariates. Theoretical properties of the semi-parametric partially linear model when the dimension of X is fixed and smaller than n have been thoroughly discussed in the econometrics literature (Engle et al. (1986), Robinson(1988), Ahn and Powell (1993), Donald and Newey (1994), Linton (1995), Fan and Li (1999),to mention only a few; see Li and Racine (2007) for a review). We complement the literatureby providing new estimation and inference methods and theory when X is high-dimensional.As a result of Lemma 1 and equation (2.6), if Assumptions 1 and 2 hold, we have( β , f ) = arg min ( β ∈B ,f ∈F ) E (cid:20)(cid:110) X (cid:62) i β + f ( Z i ) − ρ (∆ Y i − (1 − π ( W i ))Φ ( W i ) − π ( W i )Φ ( W i )) (cid:111) (cid:21) . (2.7)We are going to construct a two-step estimator of ( β , f ) based on the sample analogue of(2.7), where the first-step estimator estimates ρ and the second-step estimator estimates( β , f ). We allow the propensity score, hence ρ , to be estimated by any suitable machinelearning methods as long as certain conditions in Section 3 are satisfied.
3. Estimation
Let ˆ π ( · ), ˆΦ ( · ) and ˆΦ ( · ) be nonparametric or machine learning estimators of π ( W i ), Φ ( W i )and Φ ( W i ), respectively. We propose the following two-stage estimator such that( ˆ β, ˆ f ) = arg min β ∈B ,f n ∈F n E n (cid:20)(cid:16) X (cid:62) i β + f n ( Z i ) − ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17)(cid:17) (cid:21) + λ (cid:107) β (cid:107) , (3.1)where f n ( · ) = ψ k n ( · ) (cid:62) γ n is a sieve approximation of the unknown function f ( · ) ∈ F with f ( Z i ) = k n (cid:88) j =1 ψ j ( Z i ) γ j,n + r n ( Z i ) := f n ( Z i ) + r ni , where r ni := r n ( Z i ) , i = 1 , . . . , n is a sieve approximation error that depends on the smooth-ness of f and the sample size n . 10or α > τ = ( τ , . . . , τ d z ) of d z integers, define the differential operator D τ = ∂ τ. /∂z τ ...z τ dz d z , where τ. = (cid:80) d z l =1 τ l . For a function g : Z → R , let (cid:107) g (cid:107) ∞ ,α = max τ. ≤ α sup z | D τ g ( z ) | + max τ. = α sup z,z (cid:48) | D τ g ( z ) − D τ g ( z (cid:48) ) |(cid:107) z − z (cid:48) (cid:107) α − α . (3.2)Let C αM ( Z ) be the set of all continuous functions g : Z → R with (cid:107) g (cid:107) ∞ ,α ≤ M . We assumethat F ⊆ C αM ( Z ). Let ψ k n ( Z i ) = ( ψ ( Z i ) , . . . , ψ k n ( Z i )) (cid:62) be a k n × F n to represent the space of sieve functions. Define the projectionof X ij , i = 1 , . . . , n onto F n asΠ n ( X ij | Z ) = arg min h ∗ ∈F n (cid:107) X i j − h ∗ (cid:107) n = ψ k n ( Z )( ψ k n ( Z ) (cid:62) ψ k n ( Z )) − ψ k n ( Z ) (cid:62) X i j, j = 1 , . . . , p and Π n,X i | Z i := (Π n ( X i | Z i ) , . . . , Π n ( X ip | Z i )) (cid:62) , i = 1 , . . . , n. Next, define ˆ S i := ˆ ρ ( W i ) (cid:16) ∆ Y i − (1 − ˆ π ( W i )) ˆΦ ( W i ) − ˆ π ( W i ) ˆΦ ( W i ) (cid:17) ,S i := ρ ( W i ) (cid:16) ∆ Y i − (1 − π ( W i ))Φ ( W i ) − π ( W i )Φ ( W i ) (cid:17) , Π n,X | Z := P Z X , P Z := Ψ n (cid:0) Ψ (cid:62) n Ψ n (cid:1) − Ψ (cid:62) n and ˜ X := X − Π n,X | Z , where Ψ n := Ψ n ( Z ) = (cid:0) ψ k n ( Z ) (cid:62) , . . . , ψ k n ( Z n ) (cid:62) (cid:1) (cid:62) is a n × k n matrix. Let ¯ X = X − Π X | Z = ( X (cid:62) − Π (cid:62) X | Z , ..., X (cid:62) n − Π (cid:62) X n | Z n ) (cid:62) , where Π X | Z = E [ X i | Z i ].Define η n = S − X β − Ψ n γ n = (cid:15) + r n , where (cid:15) = ( (cid:15) , . . . , (cid:15) i , . . . , (cid:15) n ) (cid:62) , (cid:15) i = D i π i (cid:15) i + − D i − π i (cid:15) i , (cid:15) i = ∆ Y i − Φ ( W i ), (cid:15) i = ∆ Y i − Φ ( W i ), r n = ( r n , . . . , r ni , . . . , r nn ) (cid:62) . We have thefollowing decomposition: (cid:107) X β + f n (cid:107) n = (cid:107) ˜ X β (cid:107) n + (cid:107) Π n,X | Z β + f n (cid:107) n . Assumption 3. (i) The data are i.i.d. from the distribution of ( Y ( i, , Y ( i, , D i , W i )conditional on t = 1, while conditional on t = 0, the data are i.i.d. from the distribution of( Y ( i, , Y ( i, , D i , W i ); (ii) W is compact with nonempty interior; (iii) ( β , f ) ∈ B × F ⊆ R p × C αM ( Z ) is the only ( β, f ) that satisfies (2.6), where α ≥ d z /
2; (iv) E [ f ( Z ) | X ] does notbelong to the linear span of X . Assumption 4. (i) The error terms (cid:15) i ∈ R and (cid:15) i ∈ R are independently distributed with E [ (cid:15) i | W i ] = 0 and E [ (cid:15) i | W i ] = 0; (ii) max ≤ i ≤ n sup w ∈W E [ | (cid:15) i | r (cid:15) | W i = w ] ≤ C (cid:15) for some r (cid:15) > C (cid:15) . 11n Assumption 3, (i) is Assumption 3.3 in (Abadie, 2005). We follow the same samplingscheme to consider repeated cross sections; (ii) can be relaxed if we add a continuous nonneg-ative weight function in the definition of (cid:107) · (cid:107) ∞ ,α in (3.2) (Freyberger and Masten, 2019); and(iii) and (iv) are standard identification conditions for a partially linear model. Assumption 4allows the error terms to be non-identically distributed and be conditionally heteroskedastic.One can replace it by the stronger sub-Gaussian assumption often used in the literature.As is standard in the literature for high-dimensional data, we introduce the restrictedeigenvalue condition for asΛ X ( s ) := min δ ∈ R p \{ } , (cid:107) δ SC (cid:107) ≤ √ s (cid:107) δ S (cid:107) δ (cid:62) E (cid:2) ¯ X i ¯ X (cid:62) i (cid:3) δ (cid:107) δ S (cid:107) > ¯ X = E [ ¯ X i ¯ X (cid:62) i ], Σ ˜ X = E [ ˜ X i ˜ X (cid:62) i ] and Σ Π = E [Π X i | Z i Π (cid:62) X i | Z i ]. Assumption 5. (i) For each i = 1 , ...., n , the covariates X i = x is a sub-Gaussian vectorsuch that for any vector v ∈ R p , v (cid:62) x is sub-Gaussian with sup v ∈ R p : (cid:107) v (cid:107) =1 (cid:107) v (cid:62) x (cid:107) ψ ≤ K X ;(ii) there exists constant C ¯ x >
0, such that Λ X ( s ) > C ¯ x ; (iii) there exist constants C Σ ¯ X > C Σ Π > C Σ ¯ X < Λ min (Σ ¯ X ) ≤ Λ max (Σ ¯ X ) < / C Σ ¯ X and C Σ Π < Λ min (Σ Π ) ≤ Λ max (Σ Π ) < / C Σ Π .Assumption 5 is standard in the literature: (i) can be relaxed if we replace it by someuniform moment conditions discussed in Caner and Kock (2018); (ii) and (iii) restrict theeigenvalues. In particular, (ii) is a restricted eigenvalue condition. Assumption 6.
There are finite constants c k n and (cid:96) k n such that for each f ∈ F and foreach n and k n , we have (cid:107) r n (cid:107) P, = (cid:115)(cid:90) z ∈Z r n ( z ) dF ( z ) ≤ c k n , (cid:107) r n (cid:107) ∞ = sup z ∈Z | r n ( z ) | ≤ (cid:96) k n c k n . Assumption 7. (i) The density of Z i is bounded and bounded away from zero. For every k n ,there exist a constant C z >
0, which does not depend on k n , such that λ min (cid:0) E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:1) > C z ; (ii) there is a sequence of constant ξ ( k n ) satisfying that sup z (cid:107) ψ k n ( z ) (cid:107) ≤ ξ ( k n ), ξ ( k n ) log k n /n = o p (1), and k n ξ ( k n ) log p/n = O p (1); (iii) (cid:107) E [ ˜ X i ˜ X (cid:62) i − ¯ X i ¯ X (cid:62) i ] (cid:107) ∞ = O ( (cid:112) log p/n ).Assumption 6 is Assumption A.3 in Belloni et al. (2015). Note that F is a set of functions f in C αM ( Z ), thus, (cid:107) f (cid:107) ∞ ,α is bounded from above uniformly over all f ∈ F . Then for instance, c k n = O ( k − α/d z ) for the polynomial series and c k n = O ( k − ( α ∧ α ) /d z ) for splines with order α . Assumption 7 (i) and the first two conditions in (ii) are also standard in the literature.12iven this assumption, it is without loss of generality to normalize E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] = I k n .The condition k n ξ ( k n ) log p/n = O p (1) in Assumption 7(ii) is new. It is a mild conditionon the relationship between p and k n . Assumption 7(iii) is a smoothness condition on theapproximation error of the projection Π n,X | Z to Π X | Z . Assumption 8. (i) sup w | / ˆ π ( w ) − /π ( w ) | = O p (1);(ii) sup w | ˆΦ ( w ) − Φ ( w ) | = O p (1);(iii) sup w | ˆΦ ( w ) − Φ ( w ) | = O p (1);(iv) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = O p (log p/n ∨ k n log k n /n );(v) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = O p (log p/n ∨ k n log k n /n ).Assumption 8 imposes moderate conditions on the first stage approximations of the nui-sance functions π ( W i ), Φ ( W i ) and Φ ( W i ). Only the interaction terms between π ( W i ) andΦ ( W i ) or π ( W i ) and Φ ( W i ) are required to converge at a mild rate O p (log p/n ∨ k n log k n /n ).This demonstrates the double robustness properties of our estimator such that when either π ( W i ) or (Φ ( W i ) , π ( W i )) are correctly specified, the desired rate of convergence in Theo-rem 1 can be achieved. As pointed out in Chernozhukov et al. (2018b), the benefit of usingsample-splitting is that it makes the entropy condition become very weak, allowing machinelearning methods (e.g. random forest, boosted trees, deep neural nets, and their aggregatedand hybrid versions) to be applied to estimate the functions ˆ π ( W i ), ˆΦ ( W i ) and ˆΦ ( W i ). Onecan provide more primitive conditions to verify these rates for each given machine learningmethod of chosen. Assumption 9.
We choose λ , k n , and R satisfying the following: (i) λ (cid:38) (cid:112) log p/n ; (ii)2 λ s / Λ X ( s ) (cid:46) R (cid:46) λ ; and (iii) R = min (cid:0) (cid:96) k n c k n k n /n, ξ ( k n ) c k n /n (cid:1) + k n /n . Theorem 1.
Suppose that Assumptions 1-9 hold. Then with probability approaching 1, (cid:107) ˆ β − β (cid:107) = O p ( λs ) and (cid:107) ˆ f − f (cid:107) P, = O p ( R ) . Theorem 1 establishes the rate of convergence for our estimator. We show that for theparametric estimator, similar to the one for the high-dimensional linear regressors (e.g.,Theorem 6.1 in B¨uhlmann and van de Geer (2011)), its convergence rate depends on therate of the tuning parameter λ and the level of sparsity s . When λ = O ( (cid:112) log p/n ), we have (cid:107) ˆ β − β (cid:107) = O p ( s (cid:112) log p/n ), which is the same rate in lasso regression for high-dimensionallinear models without unknown functions. For the nonparametric estimator, the convergencerate maintains the same rate as the one obtained in nonparametric regressor models (e.g.,13heorem 4.1 in Belloni et al. (2015)), which depends on the order of basis function k n and theapproximation error. Unlike the results in the literature of semiparametric partially linearmodel when the dimension of X is much smaller than sample size n , the convergence rate ofthe parametric estimator is slower than √ n due to high dimensionality of the model. It makesthe inference problem challenging. As we will show in Section 4, the asymptotic variance ofthe nonparametric estimator will contain a projection term that reflects the effect of thehigh-dimensional parametric estimation.
4. Asymptotic Inference
In many applications, practitioners are not only interested in the estimation of the treatmenteffect but also the uncertainty quantification of the estimated treatment effect. The latterprovides the confidence of the treatment effect estimation and is a routine procedure inmost causal inference problems. While the inferential properties under high-dimensionallinear/generalized linear models have been extensively investigated in the recent literature(Zhang and Zhang, 2014; Javanmard and Montanari, 2014; Van de Geer et al., 2014; Belloni,Chernozhukov and Wei, 2016; Ning and Liu, 2017; Ning et al., 2017; Cai and Guo, 2017;Neykov et al., 2018; Gold, Lederer and Tao, 2020), the asymptotic inference under the DiDdesign has not been studied, especially in the partially linear model specification. In thissection, we consider how to construct confidence intervals for the parametric component β and the nonparametric component f ( z ) for given z ∈ Z .Consider the inference problem for a linear combination of β , say ξ (cid:62) β , for a knownvector ξ ∈ R p . For instance, if we take ξ as the unit basis vector e j = (0 , ..., , , , ...
0) withthe j th position being 1 and 0 otherwise, then the linear functional reduces to ξ (cid:62) β = ( β ) j ,which is the j th component of the regression coefficient. Similarly, if we are interested inthe prediction for a given test sample X = x and Z = z , then the parameter of interestbecomes x (cid:62) β + f ( z ). Thus, the inference problems can be decomposed into two problems:the inference on x (cid:62) β and the inference on f ( z ). The former is again a linear combination of β with ξ = x . The inference on f ( z ) will be studied later in this section. To construct theconfidence intervals for ξ (cid:62) β , we extend the de-biasing approach to the DiD design under thepartially linear model specification. Given the Lasso estimator ˆ β , we propose the followingde-biased Lasso estimator:ˆ T = ξ (cid:62) ˆ β − ˆ w (cid:62) E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i ˆ β − ˆ f ( Z i ) (cid:17) ˜ X i (cid:111) , (4.1)14here ˆ w = arg min (cid:107) w (cid:107) s.t. (cid:107) ξ + ˆΣ ˜ X w (cid:107) ∞ ≤ λ (cid:48) , (4.2)with ˆΣ ˜ X = n (cid:80) ni =1 ˜ X i ˜ X (cid:62) i and λ (cid:48) as a tuning parameter. We will show that ˆ w is a consistentestimator of w = Σ − X ξ . Let s w = |{ k : w k (cid:54) = 0 }| as the size of non-zero elements in w . Assumption 10. (i) E n ((1 / ˆ π ( W i ) − /π ( W i )) ) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) andsup w | / ˆ π ( w ) − /π ( w ) | = o p (1);(ii) E n ( ˆΦ ( W i ) − Φ ( W i )) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) and sup w | ˆΦ ( w ) − Φ ( w ) | = o p (1);(iii) E n ( ˆΦ ( W i ) − Φ ( W i )) = o p (1 / ( k n log k n ) ∨ / ( s w log p )) and sup w | ˆΦ ( w ) − Φ ( w ) | = o p (1);(iv) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i )) ) = o p (1 / ( s w n ) ∨ / ( k n n ));(v) E n ((1 / ˆ π ( W i ) − /π ( W i )) · ( ˆΦ ( W i ) − Φ ( W i ))) = o p (1 / ( s w n ) ∨ / ( k n n )).Assumption 10 is a stronger version of Assumption 8, which is required for constructingthe asymptotic normality.Let σ i = E [ (cid:15) i | X i ], V β = Σ − X Ω β Σ − X with Σ ¯ X = E (cid:2) ¯ X i ¯ X (cid:62) i (cid:3) and Ω β := E (cid:2) σ i ¯ X i ¯ X (cid:62) i (cid:3) . Letˆ V β = ˆ w (cid:62) ˆΩ β ˆ w with ˆΩ β := E n (cid:104) ˆ σ i ˜ X i ˜ X (cid:62) i (cid:105) . Assumption 11.
We have (i) n − / ( s w (log p ) / ∨ s w log p ) = o p (1) and s w max ≤ j ≤ p, ≤ i ≤ n | ˜ X ij r ni | = o ( n − / ); (ii) s w E n (cid:104) (cid:15) i ( ˜ X i − ¯ X i ) (cid:105) = o p ( n − / ); (iii) ; (iv) the smallest eigenvalue of Ω β de-noted as λ min (Ω β ) is bounded away from 0 and the biggest eigenvalue denoted as λ max (Ω β )is bounded from above. Theorem 2.
Suppose that Assumptions 1-7, 9, 10 and 11 hold. let λ (cid:48) (cid:38) (cid:112) log p/n , we havethat √ n ( ˆ T − ξ (cid:62) β ) → d N (0 , ξ (cid:62) V β ξ ) . Furthermore, if (cid:18) log( np ) (cid:18) s log pn + (cid:113) ξ ( k n ) k n n + (cid:96) k n c k n (cid:19)(cid:19) = o (1) , ˆ V β p → V β . Theorem 2 implies that we can construct an asymptotic (1 − α ) confidence interval for ξ (cid:62) β as ( ˆ T − z − α/ ( ˆ w (cid:62) ˆ V β ˆ w ) / , ˆ T + z − α/ ( ˆ w (cid:62) ˆ V β ˆ w ) / , where z − α/ is the 1 − α/ E [ (cid:15) i | X i ] = σ , the asymptotic variance achieves the15emiparametry efficiency bound V β = σ E [ ˜ X i ˜ X (cid:62) i ] − .In the following, we extend the de-biasing approach to construct the confidence intervalsfor f ( z ) for any given z ∈ Z , where we assume d z is much smaller than n to avoid the curse ofdimentionality problem for nonparametric estimation. Recall that f ( z ) can be approximatedin the sieve space by ψ k n ( z ) (cid:62) γ n . To construct the confidence interval for f ( z ), it sufficesto apply the debias approach to the parameter γ n , in which the parameter β is treated as ahigh-dimensional nuisance parameter. To this end, we first derive the score function for γ n as U γ n ( β, M ) = E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − M X i ) (cid:111) , where M = E ( ψ k n ( Z i ) X (cid:62) i ) { E X ⊗ i } − is a k n × p matrix. One key property of the score func-tion is that U γ n ( β, M ) is insensitive to the unknown high-dimensional nuisance parameters β and M . In fact, we will show below that U γ n ( ˆ β, ˆ M ) = U γ n ( β, M ) + o p ( n − / ) for somesuitable estimators ˆ β and ˆ M to be defined later.Given this score function U γ n ( β, M ), we can define the one-step updated de-biased esti-mator as ¯ f ( z ) := ψ k n ( z ) (cid:62) ¯ γ n , where¯ γ n := ˆ γ n − ˆΣ − f E n (cid:110)(cid:16) ˆ ρ i (cid:16) ∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − X (cid:62) i ˆ β − ˆ f ( Z i ) (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:111) where ˆΣ f = E n (cid:110) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:111) and ˆ M = [ ˆ M , ..., ˆ M j , ..., ˆ M k n ] (cid:62) withˆ M j = arg min (cid:107) m (cid:107) s.t. (cid:107) m (cid:62) E n ( X ⊗ i ) − E n ( ψ k n j ( Z i ) X (cid:62) i ) (cid:107) ∞ ≤ λ (cid:48)(cid:48) , (4.3)Let s m = (cid:12)(cid:12)(cid:12) k : (cid:110)(cid:0) E [ ψ k n ( Z i ) X (cid:62) i ] (cid:1) (cid:16) E (cid:2) X i X (cid:62) i (cid:3) − (cid:17)(cid:111) k (cid:54) = 0 (cid:12)(cid:12)(cid:12) as the size of non-zeros elementsin (cid:0) E [ ψ k n ( Z i ) X (cid:62) i ] (cid:1) (cid:16) E (cid:2) X i X (cid:62) i (cid:3) − (cid:17) .Let σ z = ψ k n ( z ) (cid:62) V f ψ k n ( z ) with V f = Σ − f Ω f Σ − f , Σ f = E (cid:2)(cid:0) ψ k n ( Z i ) − M X i (cid:1) ψ k n ( Z i ) (cid:62) (cid:3) and Ω f = E (cid:2) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − M E (cid:2) σ i X i X (cid:62) i (cid:3) M (cid:62) .We define the sample analogs similarly. Let ˆ σ z = ψ k n ( z ) (cid:62) ˆ V f ψ k n ( z ) with ˆ V f = ˆΣ − f ˆΩ f ˆΣ − f ,ˆΣ f = E (cid:104)(cid:16) ψ k n ( Z i ) − ˆ M X i (cid:17) ψ k n ( Z i ) (cid:62) (cid:105) and ˆΩ f = E n (cid:2) ˆ σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − ˆ M E n (cid:2) ˆ σ i X i X (cid:62) i (cid:3) ˆ M (cid:62) . Assumption 12.
We have (i) s m (cid:112) k n log p/n = o (1), √ nσ − z (cid:107) E n (cid:2) r ni ( ψ k n ( Z i ) − M X i ) (cid:3) = o (1) and s m s log p/ √ n = o (1); √ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) r ni (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) = o p (1)16ii) the smallest eigenvalues of Σ f and Ω f denoted as λ min (Σ f ) and λ min (Ω f ), respectively,are bounded away from 0 and the biggest eigenvalues denoted as λ max (Σ f ) and λ max (Ω f ),respectively, are bounded from above. Theorem 3.
Suppose that Assumptions 1-7, 9-12 hold and √ nσ − z max ≤ i ≤ n | r ni | = o (1) .Let λ (cid:48)(cid:48) (cid:38) (cid:112) log p/n , we have that √ nσ − / z ( ¯ f ( z ) − f ( z )) → d N (0 , . Furthermore, if (cid:16) √ log np (cid:0) n /r (cid:15) + c k n (cid:96) k n (cid:1) (cid:16) s (cid:112) log p/n + (cid:112) ξ ( k n ) k n /n + c k n (cid:96) k n (cid:17)(cid:17) = o (1) ,we have ˆ σ z p → σ z . Theorem 3 provides asymptotic theory that can be used to construct the confidence inter-vals for f ( z ) for any z . Unlike the standard results in the nonparametric literature, we needto construct a de-biased ¯ f ( z ) estimator that corrects bias caused by estimating the high-dimensional parametric component of the partially linear model. Since the Lasso estimatorof the parametric linear part has a convergence rate that is slower than √ n , the asymptoticvariance of ¯ f ( z ) contains a projection term that reflects the effect of the parametric compo-nent on the nonparametric component. We can construct an asymptotic (1 − α ) confidenceinterval for f ( z ) as (cid:2) ¯ f ( z ) − z − α/ ˆ σ z , ¯ f ( z ) + z − α/ ˆ σ z (cid:3) , where z − α/ is the 1 − α/ τ ( w ).
5. Simulation
We compare the finite sample performance of the doubly robust estimator proposed in thispaper with the semiparametric DiD estimator in Abadie (2005) when the latter is applicable.We consider two data generating processes. In the first setting (DGP1), we allow Y ( i, Y ( i,
0) to follow standard normal distribution, where Y ( i,
1) and Y ( i,
1) are definedas follows: Y ( i,
1) = Y ( i,
0) + X (cid:62) i β + f ( Z i ) + (cid:15) i ,Y ( i,
1) = Y ( i,
0) + X (cid:62) i β + (cid:15) i , where X i and Z i are generated from independently standard normal distributions. The errors (cid:15) i and (cid:15) i are independently generated from standard normal distributions. In the second17etting (DGP2), we define Y ( i,
0) = ˜ (cid:15) i · (1 / √ · Z i + 1 / √ · X i )where ˜ (cid:15) i is generated from a standard normal distribution and X i ∼ N (0 , Σ) where Σ jk = ρ | j − k | . This allows both heteroskedasticity in the error term as well as correlation amongregressors. In both DGP1 and DGP2, we set β i = 2 /i and β i = 1 /i for i ≤
15 and f ( Z i ) =exp( Z i ). The treatment assignment probability is based on a logistic distribution with P ( T i = 1) = 1 − (1 + exp( X (cid:62) i θ )) − , where θ i = 1 /i for i ≤
10. In both setting, we use 8th degree trigonometric polynomial basisfor the non-parametric estimation.Table 1 and 2 summarize the results for the two settings. We report the average bias,average standard errors, average mean squared errors, average coverages for a 90% confidenceintervals as well as the average lengths for this confidence intervals separately for both thelinear coefficients and the nonparametric coefficients. To compare with the parametric part,we report the coverages for the linear combination of the nonparametric coefficients. Dividedby the standard error, it also converges to standard normal with the same condition in 3.The “Dr-DiD” columns represent the results for the doubly robust diff-in-diff estimator and“semi-DiD” columns represent the results for the Abadie (2005) estimator. We present resultswith n varying from 200, 500 and 1000 and the dimension for linear specification p varyingfrom 10, 50 , 500 and 1000. Notice that the Abadie (2005) estimator is infeasible when n ≤ p so we omit to report the “semi-DiD” results when p = 500 , n isrelatively large compared to p (e.g. n = 200 and p = 50). Although Semi-DiD estimator canstill be computed when n = 1000 and p = 500, we choose not to report this result becauseof its large variance .As shown in both tables, the Dr-DiD estimator has a smaller standard error, RMSEand confidence interval length in both linear and nonparametric specifications. When p =50, the Semi-DiD estimator becomes too conservative and produces larger standard errors.On the other hand, the Dr-DiD estimator is also more robust comparing with the Semi-DiD estimator when switching from homoskedastic errors to heteroskedastic errors. Moreimportantly, our experiments show that in finite sample, the Dr-DiD estimator can deliverreasonable estimates under high-dimensional settings.18 . Empirical Application We use the proposed method to study the effect of increasing the minimum wage on unem-ployment rates at the county level. We use the same dataset collected by Callaway and Li(2020), which contains the county level unemployment rates from 2005 to 2007 before theFair Minimum Wage Act was enacted in all states on May 25, 2007. Eleven states increasedtheir minimum wage by the first quarter of 2007, while the other states did not increase theirminimum wage until the federal minimum wage increased in July of 2007. We explore the variation in adopting the minimum wage policy among different states toevaluate its impact on county level unemployment rates. Callaway and Li (2020) consideridentification and estimation of the quantile treatment effect on the treated under a distri-butional extension of the mean difference in differences assumption with fixed-dimensionalcovariates. Differing from the work in Callaway and Li (2020), this work focuses on studyingthe impact of covariates on heterogeneous ATT in this DiD design. Our proposed methodallows us to weaken the parallel trend assumption to the conditional parallel trend assump-tion by conditioning on a large amount of potential confounders. For example, states withsmaller populations may have higher variation in the unemployment rates, thus moving atdifferent trends as compared to states with larger populations. Our method also allows usto derive marginal effect given a specific covariate of interest. As a result, customized policyrecommendations can be designed based on those results.Figure 1 plots the simple difference for the 2005 to 2007 difference in unemployment rateby median income (panel a) and population (panel b). We separate the counties in the con-trol and treated states by red and blue color. The solid lines on the graphs are the localmeans for the control and treated groups. There is a general decrease in the unemploymentrate from 2005 to 2007 across all counties as the change in unemployment rate is centeredbelow 0. The difference between the red and blue lines is the standard DiD estimator underthe unconditional parallel trend assumption. The decrease in the unemployment rate for thetreated counties is lower than the control counties at the low income region, while not muchdifference between treated and control is observed at high income region. On the other hand,the decrease in the unemployment rate for the treated counties is consistently lower thanthe control counties regardless of population size.Figure 2 compares the semi-parametric diff-in-diff estimator (Semi-DiD: blue) with the dou- New Hampshire and Pennsylvania are dropped for the same reason as in Callaway and Li (2020) f . Both methods use4th degree trigonometric polynomial basis to approximate f and have partially linear forms.The Semi-DiD estimator controls only the underlying covariate of interest ( Z i ), while theDr-DiD estimator controls not only the underlying covariate of interest ( Z i ) but also 703covariates ( X i ) in a linear additive form. These covariates include 38 county level charac-teristics, as well as all interactions between them. It is also worth pointing out that whencomputing the marginal effect for median income, population is used as a confounder in thelinear part and vice versa when computing the marginal effect for population. The dashedlines are 95% confidence intervals for the estimator.We find that both estimators show that regions with high median income levels and largerpopulation sizes may suffer from an increase in unemployment rate due to the minimumwage policy, while no significant effect of the policy is detected for regions with lower me-dian income levels and smaller population sizes. These findings coincide with the canonicaleconomic theory on unemployment rate. For example, a higher median income level impliesa higher substitution cost for a worker currently at minimum wage and thus leads to anincrease in unemployment rate when the minimum wage rises. On the other hand, a regionwith a larger population size means more labor supply and thus a raise in minimum wage canlead to a surplus. The difference in the general direction of the results predicted by Figure1 and Figure 2 indicates the potential severeness of confounding problems in this design.Next, while both the Semi-DiD estimator and Dr-DiD estimator show no significant impactof the policy at low median income and thin population regions, the Dr-DiD shows a largerimpact in regions with a higher median income level and a denser population than theSemi-DiD estimator. The Dr-DiD estimated effects are also more significant in those regions.This is due to the controlling of additional covariates that further alleviates the concernof confoundedness as well as reducing the uncertainly in the model to yield more accurateestimation.
7. Discussion
In this paper, we propose a new doubly robust two-stage difference-in-differences estimatorthat allows for, but does not require, the number of potential confounding covariates to begreater than the number of observations. Our estimator is robust to model miss-specificationand a general set of machine learning tools can be used in our estimation procedure to es-20 a) Median Income (b) Population
Fig 1: Difference in 2005 to 2007 Unemployment Rate (a) Median Income (b) Population
Fig 2: Compare Dr-DiD estimator (with 703 covariates) with Ab-did estimatortimate the propensity score. The outcome equation is modeled as a flexible partially linearform. The rate of convergence is derived for the new estimator and a novel de-bias proce-dure is proposed for inference. This allows the user to construct confidence intervals for theheterogeneous treatment effects. A simulation study shows promising finite sample perfor-mance of our estimator under different data generation processes. Our method is appliedto study the effect of the Fair Minimum Wage Act on local unemployment rates and showheterogenous effects could rise due to the differences in demographics. Moreover, an R pack-age for implementing the proposed method is available on Github. More work remains tobe done. For example, it will be interesting to consider a similar estimation and inferencestrategy for panel data, or develop estimators for quantile treatment effect on the treated21ith high-dimensional covariates. We leave these topics for future studies.
Appendix A: Proofs
A.1. Proof of Lemma 1
Proof.
For Part (i):let Φ ( W i ; θ ) and Φ ( W i ; θ ) be postulated models for Φ ( W i ) andΦ ( W i ). Let π ( W i ; ϑ ) be a postulated model for the true propensity score P ( D i = 1 | W i ).Since ρ = D i − P ( D i =1 | W i ) P ( D i =1 | W i ) P ( D i =0 | W i ) , we have E (cid:104) ρ (cid:16) ∆ Y ( i ) − P ( D i = 0 | W i )Φ ( W i ) − P ( D i = 1 | W i )Φ ( W i ) (cid:17) | W i (cid:105) = E (cid:20) ( D i − P ( D i = 1 | W i )) P ( D i = 1 | W i ) P ( D i = 0 | W i ) ∆ Y ( i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) P ( D i = 0 | W i ) (cid:16) P ( D i = 0 | W i )Φ ( W i ) + P ( D i = 1 | W i )Φ ( W i ) (cid:17)(cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) P artL . − E (cid:20) (1 − D i )∆ Y ( i ) P ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 0 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) . (cid:124) (cid:123)(cid:122) (cid:125) P artL . First we consider Part L1.1: E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) D i ( Y ( i, − Y ( i, P ( D i = 1 | W i ] − D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) Φ ( W i ) + D i P ( D i = 1 | W i ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) + E (cid:20) D i P ( D i = 1 | W i ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) . Notice that when π ( W i ; ϑ ) is misspecified, but Φ ( W i ; θ ) is correctly specified so Φ ( W i ) =Φ ( W i ; θ ), we have E (cid:20) D i π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i , D i = 1 (cid:105) − π ( W i ; ϑ ) π ( W i ; ϑ ) = 0 . ( W i ; θ ) is misspecified, but the propensity score π ( W i ; ϑ ) is correctly specifiedso π ( W i ) = π ( W i ; ϑ ), we have E (cid:20) D i ∆ Y ( i ) P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) − E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 1 | W i ) Φ ( W i , θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E [ Y ( i, − Y ( i, | W i , D i = 1) − E (cid:104) D i − P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:105) Φ ( W i , θ ) P ( D i = 1 | W i )= E [ Y ( i, − Y ( i, | W i , D i = 1) . We next consider Part L1.2:When π ( W i , γ ) is misspecified, but Φ ( W i , θ ) is correctly specified so Φ ( W i ) = Φ ( W i , θ ),Part L1.2 is E (cid:20) (1 − D i )∆ Y ( i )1 − π ( W i ; ϑ ) (cid:12)(cid:12)(cid:12) W i (cid:21) + E (cid:20) D i − π ( W i ; ϑ )1 − π ( W i ; ϑ ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:20) (1 − D i )( Y ( i, − Y ( i, − π ( W i ; ϑ ) − (1 − D i ) − P ( D i = 0 | W i )1 − π ( W i ; ϑ ) Φ ( W i ) (cid:12)(cid:12)(cid:12) W i (cid:19) = E (cid:20) ( Y ( i, − Y ( i, − D i ) − (1 − π ( W i ; ϑ ))1 − π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) + E (cid:20) (1 − D i )1 − π ( W i ; ϑ ) (( Y ( i, − Y ( i, − Φ ( W i )) (cid:12)(cid:12)(cid:12) W i (cid:21) = Φ ( W i ) . And when Φ ( W i ; θ ) is misspecified, but propensity score π ( W i ; ϑ ) is correctly specified, E (cid:20) (1 − D i )∆ Y ( i ) P ( D i = 0 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:21) + E (cid:20) D i − P ( D i = 1 | W i ] P ( D i = 0 | W i ) Φ ( W i ; θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 0 (cid:105) + E (cid:20) D i − P ( D i = 1 | W i ) P ( D i = 0 | W i ) Φ ( W i ; θ ) (cid:12)(cid:12)(cid:12) W i (cid:21) = E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 0 (cid:105) + E (cid:104) D i − P ( D i = 1 | W i ) (cid:12)(cid:12)(cid:12) W i (cid:105) Φ ( W i ; θ ) P ( D i = 0 | W i )= E (cid:104) ( Y ( i, − Y ( i, (cid:12)(cid:12)(cid:12) W i , D i = 1 (cid:105) . (By Assumption 1)The result in Part (ii) follows from Part (i) and direct calculation. A.2. Proof of Theorem 1
Let B ( s , p ) be a set of p − dimensional vectors with at most s non-zero coordinates. Let (cid:98) Q z = n (cid:80) ni =1 ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) . Recall the definition of S i = X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n + r ni + (cid:15) in .Let v n ( ˆ f, ˆ β ) = v n ( Z , ˆ f ) + v n ( X , ˆ β ), where v n ( Z , ˆ f ) = Ψ n (ˆ γ n − γ n ), v n ( X , ˆ β ) = X ( ˆ β − β ).Let ˜ v n ( Z , ˆ f ) = Ψ n (ˆ γ n − γ n ) + Π n,X | Z ( ˆ β − β ) and ˜ v n ( X , ˆ β ) = ˜ X ( ˆ β − β ) . Then v n ( ˆ f, ˆ β ) =23 v n ( Z , ˆ γ n ) + ˜ v n ( X , ˆ β ). Define ι ni = r ni + (cid:15) ni and ι n as the vector with ι in on the i th position.Let ι ∗ n = P Z ι n , we have ι ∗(cid:62) n ˜ v n ( Z , ˆ f ) = ι (cid:62) n ˜ v n ( Z , ˆ f ) for ι n = (cid:15) + r n . Define the following normand set: τ ( β, f, R ) = R − λ (cid:107) β (cid:107) + (cid:107) Xβ + f (cid:107) P, M ( R ) = { f : (cid:107) f − f n (cid:107) P, ≤ R, f ∈ F }M ( R ) = { ( β, f ) : τ ( β, f ) ≤ R, β ∈ B ( s , p ) , f ∈ M ( R ) }T = (cid:40) sup M ( R ) |(cid:107) X (cid:62) β + f (cid:107) n − (cid:107) X (cid:62) β + f (cid:107) P, | ≤ R / (cid:41) T = {(cid:107) ι ∗ n (cid:107) n ≤ R / }T = {| ι (cid:62) n ˜ v n ( X , β ) /n | ≤ λ/ (cid:107) β − β (cid:107) , ( β, f ) ∈ M ( R ) }T = { Λ X,n ( s ) ≥ Λ X ( s ) / }T = {| ( ˆ S − S ) (cid:62) v n ( X , β ) /n | ≤ λ/ (cid:107) β − β (cid:107) , ( β, f ) ∈ M ( R ) }T = { (cid:112) k n · (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ ≤ R/ }T = {| ( ˆ S − S ) (cid:62) ˜ v n ( X , β ) /n | ≤ λ/ (cid:107) ˆ β − β (cid:107) , ( β, f ) ∈ M ( R ) } and Λ X,n ( s ) = min δ ∈ R p \{ } , (cid:107) δ SC (cid:107) ≤ √ s (cid:107) δ S (cid:107) δ (cid:62) E n (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) δ (cid:107) δ S (cid:107) . Proof.
Let ( ˆ β, ˆ f ) be the solution to then minimization problem in Equation (3.1), define t := 4 R R + (cid:107) ˆ f − f n (cid:107) P, and ˜ γ n := t ˆ γ n + (1 − t ) γ n and thus ˜ f = ˜ f ( z ) = ψ k n ( z ) (cid:62) ˜ γ n . By convexity, (cid:107) ˆ S − ˜ X ˆ β − Π n,X | Z ˆ β − Ψ n ˜ γ n (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ (cid:107) ˆ S − ˜ X β − Π n,X | Z β − Ψ n γ (cid:107) n + λ (cid:107) β (cid:107) , By the definition of S and ι and Lemma 2, (cid:107) ˜ X ( ˆ β − β ) (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ (cid:12)(cid:12)(cid:12) ι (cid:62) ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) (˜ γ n − γ n ) (cid:62) Ψ (cid:62) n ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) ( ˆ β − β ) (cid:62) Π (cid:62) n,X | Z ˜ X (cid:16) ˆ β − β (cid:17)(cid:12)(cid:12)(cid:12) + λ (cid:107) β (cid:107) . (A.1)24otice that (˜ γ n − γ n ) (cid:62) Ψ (cid:62) n ˜ X (cid:16) ˆ β − β (cid:17) = 0 and ( ˆ β − β ) (cid:62) Π (cid:62) n,X | Z ˜ X (cid:16) ˆ β − β (cid:17) = 0. On the set T ∩ T E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ (cid:107) ˆ β (cid:107) ≤ λ (cid:107) ˆ β − β (cid:107) + λ (cid:107) β (cid:107) Subtract (cid:107) ˆ β S (cid:107) and add λ (cid:107) ˆ β S − β S (cid:107) to both sides and from Assumption 5 (ii), on the set T , 2 E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ (cid:107) ˆ β − β (cid:107) ≤ λ (cid:107) ˆ β S − β S (cid:107) ≤ λ √ s (cid:107) ˆ β − β (cid:107) ≤ λ √ s Λ ˜ X ( s ) E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) ≤ E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) + λ s Λ X ( s )As a result, from Lemma 5, 6, and 7, with probability approaching one, (cid:107) ˆ β − β (cid:107) ≤ λs / Λ X ( s ) and E n (cid:104) ˜ X (cid:62) i (cid:16) β − ˆ β (cid:17)(cid:105) ≤ λ s / Λ X ( s ).Lemma 2, 3, 4, 5 and 7 imply that (cid:107) X ( β − ˆ β ) + ( f n − ˜ f ) (cid:107) P, ≤ R with probabilityapproaching one. By orthogonal decomposition, it implies that (cid:107) ˜ X ( β − ˆ β ) (cid:107) P, ≤ R and (cid:107) Π n,X | Z ( β − ˆ β ) + f n − ˜ f (cid:107) P, ≤ R with probability approaching one. Then (cid:107) Π n,X | Z ( β − ˆ β ) (cid:107) P, ≤ p R by Assumption 5 (iii) and 7 (iii), so we have (cid:107) ˜ f − f n (cid:107) P, ≤ (cid:107) Π n,X | Z ( β − ˆ β ) + f n − ˜ f (cid:107) P, + (cid:107) Π n,X | Z ( β − ˆ β ) (cid:107) P, ≤ p R . which implies (cid:107) ˜ f − f n (cid:107) P, ≤ p R . Combining with Assumption 9(ii), it yields that (cid:107) ˆ f − f n (cid:107) P, ≤ p R. Lemma 2.
Suppose that Assumptions 1-9 are satisfied. For ˜ f as defined in the proof ofTheorem 1, τ ( ˆ β − β , ˜ f − f n , R ) ≤ R on the event T ∩ T ∩ T ∩ T ∩ T .Proof. Define ˜ t = RR + τ ( ˆ β − β , ˜ f − f n , R )Let ˜ β = ˜ t ˆ β + (1 − ˜ t ) β , ˜˜ f = ˜ t ˜ f + (1 − ˜ t ) f n . Notice that the definition of ˜˜ f implies ˜˜ γ n =˜ t ˜ γ n + (1 − ˜ t ) γ n and ˜˜ f ∈ M ( R ). Since τ ( ˜ β − β , ˜˜ f − f n , R ) = ˜ tτ ( ˆ β − β , ˜ f − f n , R ) ≤ R .25hus ( ˜ β − β , ˜˜ f − f n ) ∈ M ( R ). To show τ ( ˆ β − β , ˜ f − f n , R ) ≤ R , it is then sufficient toshow τ ( ˆ β − β , ˜˜ f − f n , R ) ≤ R/
2. From the definition of ( ˆ β, ˜˜ f ), and convexity, E n (cid:20)(cid:110) ˆ S i − X (cid:62) i ˜ β − ψ k n ( Z i ) (cid:62) ˜˜ γ n (cid:111) (cid:21) + λ (cid:107) ˜ β (cid:107) ≤ E n (cid:20)(cid:110) ˆ S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:111) (cid:21) + λ (cid:107) β (cid:107) , Then by the definition of (cid:98) S and ι n , (cid:107) Ψ n (˜˜ γ n − γ n ) + X ( ˆ β − β ) (cid:107) n + λ (cid:107) ˆ β (cid:107) ≤ ι n + ( ˆ S − S )) (cid:62) (cid:16) Ψ n (˜˜ γ n − γ n ) /n + X (cid:16) ˆ β − β (cid:17) /n (cid:17) + λ (cid:107) β (cid:107) (A.2)First notice that (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) Ψ n (˜˜ γ n − γ n ) /n (cid:12)(cid:12)(cid:12) ≤ (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) ˜˜ γ n − γ n (cid:107) ≤ (cid:112) k n (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) ˜˜ γ n − γ n (cid:107) ≤ (cid:112) k n / Λ min, ( (cid:98) Q z · (cid:107) ( ˆ S − S ) (cid:62) Ψ n /n (cid:107) ∞ (cid:107) Ψ n (˜˜ γ n − γ n ) (cid:107) n . where Λ min, ( (cid:98) Q z is the minimum eigenvalue of E (Ψ (cid:62) n Ψ n /n ) and is bounded away from 0 byAssumption 7. On the set T and T , and for ˜˜ f ∈ M ( R )2 (cid:12)(cid:12)(cid:12) ( ˆ S − S ) (cid:62) v n ( ˜˜ f, ˜ β ) /n (cid:12)(cid:12)(cid:12) ≤ λ/ (cid:107) ˜ β − β (cid:107) + R / . (A.3)By the Cauchy-Schwarz inequality,2 (cid:12)(cid:12)(cid:12) ι ∗(cid:62) n ˜ v n ( Z , ˜˜ f ) /n (cid:12)(cid:12)(cid:12) ≤ (cid:107) ι ∗ n (cid:107) n · (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n . Therefore, 2 ι (cid:62) n Ψ n (˜˜ γ n − γ n ) /n + 2 ι (cid:62) n Π n,X | Z ( ˜ β − β ) /n + 2 ι (cid:62) n ˜ X ( ˜ β − β ) /n = 2 ι ∗(cid:62) n ˜ v n ( Z , ˜˜ γ ) /n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n. (A.4)Then, (A.2), (A.3), and (A.4) imply that (cid:107) Ψ n (˜˜ γ n − γ n ) + X ( ˜ β − β ) (cid:107) n + λ (cid:107) ˜ β (cid:107) = (cid:107) v n (cid:107) n + λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 12 (cid:107) v n (cid:107) n + 2 ι (cid:62) n ˜ v n ( X , ˜ β ) /n + λ/ (cid:107) ˜ β − β (cid:107) + R /
96 + λ (cid:107) β (cid:107) where the last inequality follows from the orthogonal decomposition such that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n = (cid:107) ˜ v n ( X , ˜ β ) (cid:107) n + (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n ≥ (cid:107) ˜ v n ( Z , ˜˜ f ) (cid:107) n , (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 4 ι (cid:62) n ˜ v n ( X , ˜ β ) /n + λ/ (cid:107) ˜ β − β (cid:107) + R /
96 + 2 λ (cid:107) β (cid:107) , (A.5)On the event T , | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | ≤ λ/ (cid:107) ˜ β − β (cid:107) . Thus (A.5) is equivalent to (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:107) ˜ β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + λ (cid:107) ˜ β − β (cid:107) + R /
96 + 2 λ (cid:107) β (cid:107) . (A.6)Because (cid:107) ˜ β − β (cid:107) = (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) , (cid:107) ˜ β (cid:107) = (cid:107) ˜ β S (cid:107) + (cid:107) ˜ β S C (cid:107) ≥ (cid:107) β ,S (cid:107) − (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) , (A.6) further implies that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n + 2 λ (cid:16) (cid:107) β ,S (cid:107) − (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˆ β S C (cid:107) (cid:17) ≤ λ (cid:16) (cid:107) ˜ β S − β ,S (cid:107) + (cid:107) ˜ β S C (cid:107) (cid:17) + 2 λ (cid:107) β (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / . Since (cid:12)(cid:12)(cid:12) (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) n − (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, (cid:12)(cid:12)(cid:12) ≤ R /
96 on T , (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β S C (cid:107) ≤ λ (cid:107) ˜ β S − β ,S (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / . (A.7)By Assumption 5 (ii), and take 144 λ s / Λ X ( s ) ≤ R , λ (cid:107) ˜ β S − β ,S (cid:107) ≤ λ √ s (cid:107) ˜ β S − β ,S (cid:107) ≤ λ √ s (cid:107) ˜ X ( ˜ β − β ) (cid:107) P, / Λ ¯ X ( s ) ≤ λ s / Λ X ( s ) + (cid:107) ˜ X ( ˜ β − β ) (cid:107) P, / ≤ λ s / Λ X ( s ) + (cid:107) v n (˜˜ γ, ˜ β ) (cid:107) P, / ≤ R /
144 + (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, / , (A.8)where the third inequality follows becaue ab < a + b / λ (cid:107) ˜ β S − β ,S (cid:107) on both sides of (A.7) yields that (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β − β (cid:107) = (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + λ (cid:107) ˜ β S C (cid:107) + λ (cid:107) ˜ β S − β ,S (cid:107) ≤ λ (cid:107) ˜ β S − β ,S (cid:107) + 4 (cid:107) ι ∗ n (cid:107) n + R / ≤ R /
36 + (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, + 4 (cid:107) ι ∗ n (cid:107) n + R / , which implies that λ (cid:107) ˜ β − β (cid:107) ≤ (cid:107) ι ∗ n (cid:107) n + 7 R / .
27n the set T , (cid:107) ι ∗ n (cid:107) n ≤ R / (cid:107) ˜ β − β (cid:107) such that (cid:107) ˜ β − β (cid:107) ≤ R /λ. On the other hand, substituting (A.8) into (A.7) and combining with (cid:107) ι ∗ n (cid:107) n ≤ R /
192 yieldthat (cid:107) v n ( ˜˜ f, ˜ β ) (cid:107) P, = (cid:107) ˜˜ f − f n + X ( ˜ β − β ) (cid:107) P, ≤ R . As a result, τ ( ˜ β − β , ˜˜ f − f n , R ) = R − λ (cid:107) ˜ β − β (cid:107) + (cid:107) ˜˜ f − f n + X ( ˜ β − β ) (cid:107) P, ≤ R ≤ R/ Lemma 3.
Suppose that Assumptions 1-9 are satisfied, we have P ( T ) = 1 − O (1 /p ∧ /k n ) Proof.
First notice that τ ( β, f n , R ) ≤ R implies for ( β, f n ) ∈ M ( R ), (cid:107) Xβ + f (cid:107) P, ≤ R , (cid:107) β (cid:107) ≤ R /λ, which further implies (cid:107) ¯ Xβ (cid:107) P, ≤ R and (cid:107) Π (cid:62) X | Z β + f (cid:107) P, ≤ R . By Assumption 5, we thenhave (cid:107) Xβ (cid:107) P, ≤ (cid:107) ¯ Xβ (cid:107) P, + (cid:107) Π (cid:62) X | Z β (cid:107) P, ≤ R + Λ max (Σ Π ) / Λ min (Σ Π ) (cid:107) ¯ Xβ (cid:107) P, ≤ (cid:0) (Σ Π ) / Λ (Σ ¯ X ) (cid:1) R := R , and (cid:107) f n (cid:107) P, ≤ (cid:107) f + Π (cid:62) X | Z β (cid:107) P, + (cid:107) Π (cid:62) X | Z β (cid:107) P, ≤ (cid:0) (Σ Π ) / Λ (Σ ¯ X ) (cid:1) R = R . (A.9)For any ( β, f n ) ∈ M ( R ), consider the decomposition such that (cid:12)(cid:12) (cid:107) Xβ + f n (cid:107) n − (cid:107) Xβ + f n (cid:107) P, (cid:12)(cid:12) ≤ | ( E n − E )[ (cid:107) X (cid:62) i β (cid:107) ] | (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + | ( E n − E )[ (cid:107) ψ k n ( Z i ) (cid:62) γ n (cid:107) ] | (cid:124) (cid:123)(cid:122) (cid:125) ( B ) +2 | ( E n − E ) [ β (cid:62) X i ψ k n ( Z i ) (cid:62) γ n ] | (cid:124) (cid:123)(cid:122) (cid:125) ( C ) For Term (A), let Σ X = E (cid:2) X i X (cid:62) i (cid:3) . For all ( β, f ) ∈ M ( R ),( A ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 β (cid:62) X i X (cid:62) i β − β (cid:62) Σ X β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ( E n − E ) X i X (cid:62) i (cid:107) ∞ (cid:107) β (cid:107) . − /p , (cid:107) ( E n − E ) X i X (cid:62) i (cid:107) ∞ (cid:46) (cid:114) log pn because X i ’s are sub-Gaussian variables. Thus, from Assumption 9, R ≤ λ , and with prob-ability at least 1 − /p , sup ( β,f ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:107) X (cid:62) i β (cid:107) (cid:12)(cid:12)(cid:12) ≤ CR /λ ≤ CR . For Term (B), from Assumption 7, Q Z := E (cid:2) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) can be normalized to I k n . sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12) ( E n − E ) f n (cid:12)(cid:12) = sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 γ (cid:62) n ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) γ n − γ (cid:62) n γ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( β,f n ) ∈M ( R ) (cid:107) ( E n − E ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:107) (cid:107) γ n (cid:107) where (cid:107) γ n (cid:107) = (cid:107) f n (cid:107) P, is bounded by R following from (A.9). Moreover, by a Bernsteintype inequality for random matrices (Theorem 6.1 in Tropp (2012); see also Theorem 4.3 invan de Geer (2014), Theorem 4.1 in Chen and Christensen (2015) or Lemma 6.2 in Belloniet al. (2015)), we have P (cid:32) (cid:107) ( E n − E ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:107) > C (cid:32) ξ ( k n ) (cid:114) log k n + tn + ξ ( k n ) (cid:18) log k n + tn (cid:19)(cid:33)(cid:33) ≤ exp( − t ) . By taking t = log k n , with probability at least 1 − O (1 /k n ), P sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12) ( E n − E ) f n (cid:12)(cid:12) ≤ CR by Assumption 7 (ii). For Term (C), we havesup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:104) (cid:107) β (cid:62) X i f n ( Z i ) (cid:107) (cid:105)(cid:12)(cid:12)(cid:12) = sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 β (cid:62) X i ψ k n n ( Z i ) (cid:62) γ n − E (cid:104) β (cid:62) X i ψ k n n ( Z i ) (cid:62) γ n (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( β,f n ) ∈M ( R ) (cid:107) β (cid:107) (cid:107) γ n (cid:107) (cid:13)(cid:13)(cid:13) ( E n − E ) X i ψ k n n ( Z i ) (cid:62) (cid:13)(cid:13)(cid:13) ∞ ≤ sup ( β,f n ) ∈M ( R ) (cid:107) β (cid:107) (cid:112) k n (cid:107) γ n (cid:107) (cid:13)(cid:13)(cid:13) ( E n − E ) X i ψ k n n ( Z i ) (cid:62) (cid:13)(cid:13)(cid:13) ∞ . | X ij ψ k n m ( Z i ) | ≤ ξ ( k n ) | X ij | for j = 1 , . . . , p and m = 1 , . . . , k n . Thus, Lemma 14.15in B¨uhlmann and van de Geer (2011) implies that given X , P (cid:32) max ≤ j ≤ p max ≤ m ≤ k n (cid:12)(cid:12)(cid:12) ( E n − E ) X ij ψ k n m ( Z i ) (cid:12)(cid:12)(cid:12) ≥ max ≤ j ≤ p (cid:114) ξ ( k n ) (cid:80) ni =1 X ij n (cid:115) (cid:18) t + 2 log 2 pn (cid:19)(cid:33) ≤ exp( − nt ) . Because X is sub-Gaussian, letting t = log(2 p ) /n gives P (cid:32) max ≤ j ≤ p max ≤ m ≤ k n (cid:12)(cid:12)(cid:12) ( E n − E ) X ij ψ k n m ( Z i ) (cid:12)(cid:12)(cid:12) ≥ (cid:113) K X ξ ( k n ) (cid:114) log 2 pn (cid:33) ≤ p − . Because for ( β, f n ) ∈ M ( R ), (cid:107) β (cid:107) ≤ R /λ and (cid:107) γ n (cid:107) = (cid:107) f n (cid:107) ≤ R , it implies that (cid:107) β (cid:107) (cid:107) γ n (cid:107) (cid:46) R /λ (cid:46) R by Assumption 9(ii). Then combing with Assumption 7(iii), wehave that with probability at least 1 − O (1 /p ),sup ( β,f n ) ∈M ( R ) (cid:12)(cid:12)(cid:12) ( E n − E ) (cid:104) (cid:107) β (cid:62) X i f n ( Z i ) (cid:107) (cid:105)(cid:12)(cid:12)(cid:12) (cid:46) R . The conclusion follows from Assumption 5.
Lemma 4.
Suppose that Assumptions 1-9 are satisfied. For a sequence κ n → ∞ as n → ∞ and κ n does not depend on p or k n , the set T has probability at least − O (1 /κ n + 1 /k n ) .Proof. (cid:107) ι ∗ n (cid:107) n ≤ (cid:15) (cid:62) P z (cid:15) /n + 2 r (cid:62) n P z r n /n. For the first term, note that (cid:15) (cid:62) P z (cid:15) /n = (cid:107) (cid:16) Ψ (cid:62) n Ψ n /n (cid:17) − / Ψ n (cid:15) /n (cid:107) = (cid:107) (cid:98) Q − / z E n (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) ≤ (cid:107) (cid:98) Q − / z (cid:107) (cid:107) E n (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) , where all eigenvalues of (cid:98) Q z are bounded away with probability at least 1 − /k n followingthe matrix Bernstein inequality. Moreover, (cid:107) E (cid:104) ψ k n ( Z i ) (cid:15) i (cid:105) (cid:107) = E [ (cid:15) i ψ k n ( Z i ) (cid:62) ψ k n ( Z i ) /n ] (cid:46) E [ ψ k n ( Z i ) (cid:62) ψ k n ( Z i ) /n ]= tr ( E [ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) /n ]) = tr ( I k n /n ) = ( k n /n ) . For the second term, r (cid:62) n P z r n /n = (cid:107) (cid:98) Q − / z E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) . E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) = 1 n k n (cid:88) k =1 E (cid:2) ψ k ( Z i ) r ni (cid:3) ≤ (cid:18) (cid:96) k n c k n √ n (cid:19) E (cid:104) (cid:107) ψ k n ( Z i ) (cid:107) (cid:105) = (cid:96) k n c k n k n n or E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) ≤ n E [ ξ ( k n ) r ni ] ≤ ξ ( k n ) c k n n so we have E (cid:104) (cid:107) E n (cid:104) ψ k n ( Z i ) r ni (cid:105) (cid:107) (cid:105) ≤ min (cid:32) (cid:96) k n c k n k n n , ξ ( k n ) c k n n (cid:33) . From triangular inequality and Markov inequality,there exists a constant C , such that E (cid:2) (cid:107) E n (cid:2) ψ k n ( Z i ) r ni (cid:3) (cid:107) (cid:3) = CR /κ n and by Markov inequality, P (cid:18)(cid:13)(cid:13)(cid:13) E n (cid:104) ψ k n ( Z i ) r ni (cid:105)(cid:13)(cid:13)(cid:13) > R / (cid:19) ≤ C/κ n Lemma 5.
Suppose that Assumptions 1-9 are satisfied, then P ( T ) = 1 − O (1 /p + 1 /k n ) Proof.
By definition of ι n , | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | = | ( (cid:15) + r n ) (cid:62) ˜ X ( ˜ β − β ) | ≤ (cid:107) ( (cid:15) + r n ) (cid:62) ˜ X (cid:107) ∞ (cid:107) ˜ β − β (cid:107) . By Lemma 6.2 in Belloni et al. (2015), all eigenvalues of (cid:98) Q Z are bounded away from zero withAssumption 7 on the set T with probability at least 1 − /k n . Notice that for ˜ X i = ( I − P Z ) X i .As X i is sub-Gaussian, its moment generating function has E [exp( s (( I − P Z ) X ij ))] ≤ E [exp( s (1 − Λ ( (cid:98) Q Z )) X ij )] ≤ exp( K X s (1 − Λ ( (cid:98) Q Z )) / . Thus ˜ X ij is also sub-Gaussian so Assumption 4 implies that max ≤ i ≤ n , max ≤ j ≤ n E [( ˜ X ij (cid:15) i ) ]is bounded from above. From Lemma E.1 and E.2 of Chernozhukov, Chetverikov and Kato(2017), E (cid:32) max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > C (cid:114) t + log pn (cid:33) ≤ exp( − t ) . Because λ (cid:38) · C (cid:112) p/n , we have with probability at least 1 − /p max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ/ . K ˜ X ≥ P (cid:18) max ≤ j ≤ p (cid:107) ˜ X j (cid:107) n ≥ K ˜ X (cid:19) ≤ P (cid:32) max ≤ j ≤ p (cid:107) ˜ X j (cid:107) n ≥ E (cid:107) ˜ X j (cid:107) n + K ˜ X (cid:114) log pn (cid:33) ≤ (2 p ) − . Next, note that max ≤ i ≤ n max ≤ j ≤ p | r ni ˜ X ij | ≤ max ≤ i ≤ n max ≤ j ≤ p (cid:96) k n c k n | ˜ X ij | , by usingLemma 14.15 of B¨uhlmann and van de Geer (2011), we have P max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r ni ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) (cid:96) k n c k n n (cid:88) i =1 ˜ X ij /n (cid:112) t + log p/n ) ≤ exp( − nt ) . Taking t = log p/n yields that with probability at least 1 − / (2 p ) − /p ,max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r ni ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) K ˜ X (cid:96) k n c k n (cid:112) log p/n. Thus, when λ (cid:38) · C (cid:112) p/n and because l k n c k n = o (1), we havemax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( (cid:15) i + r ni ) ˜ X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ/ . which further implies that | ι (cid:62) n ˜ v n ( X , ˜ β ) /n | ≤ λ/ (cid:107) ˜ β − β (cid:107) . Lemma 6.
Suppose that Assumptions 1-9 are satisfied. Then P ( T ) = 1 − O (1 /p ) .Proof. For ˆΣ ˜ X = E n (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) and Σ ˜ X = E (cid:104) ˜ X i ˜ X (cid:62) i (cid:105) , we have δ (cid:62) ˆΣ ˜ X δ (cid:107) δ S (cid:107) = δ (cid:62) Σ ˜ X δ + δ (cid:62) (cid:16) ˆΣ ˜ X − Σ ˜ X (cid:17) δ (cid:107) δ S (cid:107) ≥ Λ X ( s ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) δ (cid:107) (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ (cid:107) δ S (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ Λ X ( s ) − s (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ , where the first inequality follows by the definition of Λ X ( s ) and the second inequality followsby (cid:107) δ (cid:107) ≤ s (cid:107) δ S (cid:107) . Since with probability at least 1 − /p , (cid:107) ˆΣ ˜ X − Σ ˜ X (cid:107) ∞ ≤ (cid:107) ( E n − E ) ˜ X i ˜ X (cid:62) i (cid:107) ∞ + (cid:107) E (cid:104) ˜ X i ˜ X (cid:62) i − ¯ X i ¯ X (cid:62) i (cid:105) (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) where the equality follows from Bernstein inequality applied in Lemma 5 and Assumption7(iii). Because s (cid:113) log pn = o (1), for large enough n , we have Λ X,n ( s ) ≥ Λ X ( s ) / − /p . 32 emma 7. Suppose that Assumptions 1-9 are satisfied. Then P ( T ) = 1 − O (1 /p ) and P ( T ) = 1 − O (1 /p ) .Proof. The proof for T is similar to T and thus we focus on the proof of T below. Fromthe definition of S i ,ˆ S i − S i = ˆ ρ (cid:16) ∆ Y ( i ) − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i ) (cid:17) − ρ i (cid:16) ∆ Y i − (1 − π i )Φ ( W i ) − π i Φ ( W i ) (cid:17) = D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + (1 − D i )(∆ Y i − Φ ( W i )) (cid:18) − ˆ π i − − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − D i π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E − ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − − D i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E + D i ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E − (1 − D i )( ˆΦ ( W i ) − Φ ( W i )) (cid:18) − π i − − ˆ π i (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) E Let S S S ( · ), ˆΦ ( · ) and ˆ π ( · ), and we use S β and ˆ f . Similar to Lemma 5, for( β, f ) ∈ M ( R ), | ( ˆ S − S ) (cid:62) ˜ v n ( X , β ) /n | = | ( ˆ S − S ) (cid:62) ˜ X ( β − β ) /n | ≤ (cid:107) ( ˆ S − S ) (cid:62) ˜ X /n (cid:107) ∞ (cid:107) β − β (cid:107) . For each component of ˜ X i , E n (cid:20) ˜ X ij D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:21) = E n (cid:20)(cid:18) π i − π i (cid:19) ˜ X ij D i (cid:15) ,i (cid:21) . Since (cid:15) ,i is independent of (1 / ˆ π i − /π i ) due to sample split and by Assumption 4, we have E (cid:104) (1 / ˆ π i − /π i ) ˜ X ij D ( i ) (cid:15) ,i (cid:105) = 0. Define r πni = (1 / ˆ π i − /π i ). Lemma 14.15 in B¨uhlmannand van de Geer (2011) implies that for a constant C π , P max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ i ≤ n r πni max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ˜ X ij D i (cid:15) ,i (cid:115) (cid:18) t + log pn (cid:19) ≤ exp( − nt ) , where with probability bigger than 1 − /p , for sufficiently large C max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ˜ X ij D i (cid:15) ,i − E (cid:104) ˜ X ij D i (cid:15) ,i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:114) log pn . ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ≤ i ≤ n r πni (cid:114) log pn with probability approaching one.From Assumption 8, max ≤ i ≤ n (1 / ˆ π i − /π i ) = O p (1). Thus, take t = log p , with proba-bility at least 1 − /p , (cid:107) E n (( E X i ) (cid:107) ∞ (cid:46) (cid:112) log p/n .Next define r Φ ni = ˆΦ ( W i ) − Φ ( W i ). From sample splitting, E (cid:18) r Φ ni ˜ X ij (cid:18) − D i π i (cid:19)(cid:19) = E (cid:18) r Φ ni ˜ X ij (cid:18) − E ( D i | X i , Z i , S π i (cid:19)(cid:19) = 0With constant C Φ , We can then apply the same bound such that P max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r Φ ni ˜ X ij D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ C Φ max ≤ i ≤ n r Φ ni max ≤ j ≤ p (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ˜ X ij D i (cid:15) ,i (cid:115) (cid:18) t + log pn (cid:19) ≤ exp( − nt ) . From Assumption 8, (cid:107) E n (( E X i ) (cid:107) ∞ = O p ( (cid:112) log p/n ). Lastly consider E (cid:12)(cid:12)(cid:12)(cid:12) E n (cid:18) ˜ X ij D i ( ˆΦ ( W i ) − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E n (cid:32) ( ˆΦ ( W i ) − Φ ( W i )) · (cid:18) π i − π i (cid:19) (cid:33) · E n (cid:16) ˜ X ij D i (cid:17) = O p (log p/n )The last equality follows as E n (cid:16) ˜ X ij D i (cid:17) = O p (1). Similar results can be derived for E E E Lemma 8.
Suppose that Assumptions 1-9 are satisfied. Then P ( T ) = 1 − O (1 /p ) .Proof. Similar as Lemma 7, consider the interaction of each component of ˆ S i − S i with ψ k n j ( Z i ) for j = 1 , · · · , k n . E n (cid:20) ψ k n j ( Z i ) D i (∆ Y i − Φ ( W i )) (cid:18) π i − π i (cid:19)(cid:21) = E n (cid:20)(cid:18) π i − π i (cid:19) ψ k n j ( Z i ) D i (cid:15) ,i (cid:21) . By sample splitting and Assumption 4, we have E (cid:104) (1 / ˆ π i − /π i ) ψ k n j ( Z i ) D ( i ) (cid:15) ,i (cid:105) = 0. Again,Lemma 14.15 in B¨uhlmann and van de Geer (2011) implies that for a constant C π , P max ≤ j ≤ k n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ψ k n j ( Z i ) D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ max ≤ i ≤ n r πni max ≤ j ≤ k n (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ψ k n j ( Z i ) D i (cid:15) ,i (cid:115) (cid:18) t + log k n n (cid:19) ≤ exp( − nt ) , ≤ j ≤ k n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ψ k n j ( Z i ) D i (cid:15) ,i − E (cid:104) ψ k n j ( Z i ) D i (cid:15) ,i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log k n n (cid:33) . So we have max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni ψ k n j ( Z i ) D i (cid:15) ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ≤ i ≤ n r πni (cid:114) log k n n with probability approaching one. From Assumption 8, max ≤ i ≤ n (1 / ˆ π i − /π i ) = O p (1).Thus, take t = log p , with probability at least 1 − /p , √ k n (cid:107) E n (( E n ) (cid:107) ∞ (cid:46) (cid:112) k n log k n /n .Equation E E (cid:112) k n log k n /n similarly as shownin Lemma 7 with assumption 8 and thus we omit their proof. A.3. Proof of Theorem 2
Proof.
Consider the following decomposition such that for (cid:98) Σ ˜ X := E ˜ X i ˜ X (cid:62) i ,ˆ T − ξ (cid:62) β = ξ (cid:62) ( ˆ β − β ) − ˆ w (cid:62) E n [( ˆ S i − X (cid:62) i ˆ β − ˆ f ( Z i ) ˜ X i ]= ξ (cid:62) ( ˆ β − β ) − ˆ w (cid:62) E n [( S i − X (cid:62) i β − f ( Z i )) ˜ X i ] − ˆ w (cid:62) E n [ ˆ S i − S i ]+ ˆ w (cid:62) E n [ X (cid:62) i ( ˆ β − β ) − ˆ f ( Z i ) + f ( Z i )) ˜ X i ]= − ˆ w (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] − ˆ w (cid:62) E n [( ˆ S i − S i ) ˜ X i ]+ ( ˆ w (cid:98) Σ ˜ X + ξ ) (cid:62) ( ˆ β − β ) + ˆ w (cid:62) E n [(Π n,X i | Z i ( ˆ β − β )) + ( ˆ f ( Z i ) − f ( Z i )) ˜ X i ]:= − I − I + I + I . Recall that ˆ S i = ˆ ρ i (∆ Y i − (1 − ˆ π i ) ˆΦ ( W i ) − ˆ π i ˆΦ ( W i )) and S i = ρ i (∆ Y i − (1 − π i )Φ ( W i ) − π i Φ ( W i )). We now analyze the four terms in the last expression one by one.For the first term, I − w (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] = ( ˆ w − w ) (cid:62) E n [ S i − X (cid:62) i β − f ( Z i )) ˜ X i ] ≤ (cid:107) ˆ w − w (cid:107) (cid:107) E n ( S i − X (cid:62) i β − f ( Z i )) ˜ X i (cid:107) ∞ = O p ( s w (cid:112) log p/n ) × O p ( (cid:112) log p/n ) = O p ( s w (log p/n ))where (cid:107) ˆ w − w (cid:107) = O p ( s w (cid:112) log p/n ) as a Dantzig selector as defined in Theorem 7.1 in Bickel,Ritov and Tsybakov (2009), and the fact that (cid:107) E n (cid:15) i ˜ X i (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) by Lemma E.1and E.2 of Chernozhukov, Chetverikov and Kato (2017). Then Assumption 11 (i) guaranteesthat the remainder term in I is o p ( n − / ). 35or the second term, I ≤ (cid:107) ˆ w (cid:107) (cid:107) E n ( ˆ S i − S i ) ˜ X i (cid:107) ∞ ≤ (cid:107) w (cid:107) (cid:107) E n ( ˆ S i − S i ) ˜ X i (cid:107) ∞ , where (cid:107) w (cid:107) ≤ s w and (cid:107) ˆ w (cid:107) ≤ (cid:107) w (cid:107) because of the definition of ˆ w . Then Lemma 9 impliesthat s w (cid:107) E n (cid:104) ( ˆ S i − S i ) ˜ X i (cid:105) (cid:107) ∞ = o p ( n − / ) so we have √ nI = o p (1).For the third term, I ≤ (cid:107) ˆ w (cid:98) Σ ˜ X + ξ (cid:107) ∞ (cid:107) ˆ β − β (cid:107) ≤ λ (cid:48) (cid:107) ˆ β − β (cid:107) = O p ( s log pn ) , where the last step follows from Theorem 1 and λ (cid:48) = O p ( (cid:112) log p/n ) so √ nI = o p (1) because s log p/ √ n = o p (1).For the last term, I , by construction, E n (cid:104) Π (cid:62) n,X i | Z i ˜ X i (cid:105) = 0 and E n (cid:104)(cid:110) ˆ f ( Z i ) − f n ( Z i ) (cid:111) ˜ X i (cid:105) =0. Moreover, Assumption 11 (i) implies that s w max ≤ j ≤ p, ≤ i ≤ n | ˜ X ij r ni | = o ( n − / ), so wehave √ nI ≤ (cid:107) w (cid:107) (cid:107) E n [ r ni ˜ X i ] (cid:107) ∞ = o p (1) . Combining the above results for I - I and Assumption 11 (ii), we obtainˆ T − ξ (cid:62) β = − w (cid:62) E n (cid:15) i ˜ X i + o p ( n − / ) = − w (cid:62) E n (cid:15) i ¯ X i + o p ( n − / ) . Next, for Ω β = E (cid:2) σ i ¯ X i ¯ X (cid:62) i (cid:3) , note that w = Σ − X ξ , ξ (cid:62) V β ξ = w (cid:62) Ω β w , E (cid:34)(cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i (cid:35) = 0because E [ (cid:15) i ¯ X i ] = 0, and E (cid:34)(cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i (cid:35) = 1 . We want to verify the Lyapunov’s condition for CLT. In particular, we wish to show that1( w (cid:62) Ω β w ) r (cid:15) / n (cid:88) i =1 E (cid:104) w (cid:62) ¯ X i (cid:15) i / √ n (cid:105) r (cid:15) / = o p (1) . (A.10)First because (cid:107) w (cid:107) ≤ s w , for r (cid:15) > n (cid:88) i =1 E (cid:104) w (cid:62) ¯ X i (cid:15) i / √ n (cid:105) r (cid:15) / ≤ n (cid:88) i =1 E (cid:2) (cid:107) w (cid:107) (cid:107) ¯ X i (cid:15) i / √ n (cid:107) ∞ (cid:3) r (cid:15) / ≤ n (cid:88) i =1 (cid:18) s w √ n (cid:19) r (cid:15) / max ≤ k ≤ p E (cid:104)(cid:12)(cid:12) ¯ X ik (cid:15) i (cid:12)(cid:12) r (cid:15) / (cid:105) = O p (cid:32) s r (cid:15) / w n r (cid:15) / − (cid:33) , (cid:2) | ¯ X ik (cid:15) i | (cid:3) r (cid:15) / < ∞ by the Cauchy-Schwarz inequality.Moreover, because (cid:16) ξ (cid:62) V β ξ (cid:17) r (cid:15) / ≥ (cid:104) (cid:107) ξ (cid:107) Λ min (Ω β )Λ (Σ − X ) (cid:105) r (cid:15) / = p r (cid:15) / (cid:104) Λ min (Ω β )Λ (Σ − X ) (cid:105) r (cid:15) / > (cid:16) w (cid:62) Ω β w (cid:17) − / √ n n (cid:88) i =1 w (cid:62) ¯ X i (cid:15) i → d N (0 , , which implies that √ n ( ˆ T − ξ (cid:62) β ) → d N (0 , w (cid:62) Ω β w ) = N (0 , ξ (cid:62) V β ξ ) . Finally, we show that ˆ V β p → V β . Let ˆΩ β = n (cid:80) ni =1 ˆ (cid:15) i ˜ X i ˜ X (cid:62) i and ˜Ω β = n (cid:80) ni =1 ( (cid:15) i + r ni ) ˜ X i ˜ X (cid:62) i .Note that (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˆΩ β ˆ w − ˆ w (cid:62) ˜Ω β ˆ w (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ (cid:107) ˆ w (cid:107) , (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˜Ω β ˆ w − ˆ w (cid:62) Ω β ˆ w (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˜Ω β − Ω β (cid:107)(cid:107) ˆ w (cid:107) and (cid:12)(cid:12) ˆ w (cid:62) Ω β ˆ w − w (cid:62) Ω β w (cid:12)(cid:12) ≤ (cid:107) Ω β (cid:107) ∞ (cid:107) ˆ w − w (cid:107) . Thus, (cid:12)(cid:12)(cid:12) ˆ V β − V β (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ w (cid:62) ˆΩ β ˆ w − w (cid:62) Ω β w (cid:12)(cid:12)(cid:12) ≤(cid:107) ˆ w (cid:107) (cid:16) (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ + (cid:107) ˜Ω β − Ω β (cid:107) ∞ (cid:17) + (cid:107) ˆ w − w (cid:107) (cid:107) Ω β (cid:107) ∞ ≤ s w (cid:16) (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ + (cid:107) ˜Ω β − Ω β (cid:107) ∞ (cid:17) + o p (1) . (A.11)We next consider to bound (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ . First note that Lemma E.1 and E.2 of Chernozhukov,Chetverikov and Kato (2017) imply thatmax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) ˜ X ij (cid:15) i − E (cid:104) ˜ X ij (cid:15) i (cid:105)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log pn (cid:33) so s w (cid:107) ˜Ω β − Ω β (cid:107) ∞ = o p (1) (A.12)because of Assumption 11(i). Moreover, because (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ = (cid:107) n n (cid:88) i =1 (cid:16) ˆ (cid:15) i − ( (cid:15) i + r ni ) (cid:17) ˜ X i ˜ X i (cid:107) ∞ ≤ max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ˜ X ij ˜ X ik (cid:15) ni (cid:16) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f ( Z i ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ˜ X ij ˜ X ik (cid:16) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f ( Z i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (A.13)37here the first term on the RHS of (A.13) is bounded by2 max ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12) (cid:107) (cid:15) (cid:62) n (cid:16) X (cid:16) ˆ β − β (cid:17) + ˆ f ( Z ) − f ( Z ) (cid:17) (cid:107) n ≤ ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12) (cid:107) n n (cid:88) i =1 ( (cid:15) i + r ni ) X i (cid:107) ∞ (cid:107) ˆ β − β (cid:107) + sup z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − f ( z ) (cid:12)(cid:12)(cid:12) (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:15) ni = O p (cid:32) log( np ) (cid:32) s log pn + (cid:114) ξ ( k n ) k n n + (cid:96) k n c k n (cid:33)(cid:33) Moreover, the second term is bounded bymax ≤ i ≤ n max ≤ j,k ≤ p (cid:12)(cid:12)(cid:12) ˜ X ij ˜ X ik (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) X (cid:62) i ( ˆ β − β ) + ˆ f ( Z i ) − f ( Z i ) (cid:17) = O p log( np ) (cid:32) s log pn + (cid:114) k n n + (cid:96) k n c k n (cid:33) where we use the sub-Gaussian property to obtain that max ≤ i ≤ n max ≤ j,k ≤ p | X ij X ik | = O p (cid:16) (log ( np )) (cid:17) . Thus, with the additional assumption in Theorem 2, we have s w (cid:107) ˆΩ β − ˜Ω β (cid:107) ∞ = o p (1) (A.14)Then (cid:12)(cid:12)(cid:12) ˆ V β − V β (cid:12)(cid:12)(cid:12) = o p (1) follows from (A.11), (A.14), (A.22), and Assumption 11. A.4. Proof of Theorem Proof.
We consider the following decomposition such that¯ f ( z ) − f n ( z ) = ψ k n ( z ) (cid:62) (ˆ γ n − γ n ) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) ˆ S i − X (cid:62) i ˆ β − ψ k n ( Z i ) (cid:62) ˆ γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) = − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104)(cid:16) ˆ S i − S i (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:105) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) − ψ k n ( z ) (cid:62) (cid:16) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:105) − I k n (cid:17) (ˆ γ n − γ n ):= II + II + II + II (A.15)38he first term II can be further expand as − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110)(cid:16) S i − X (cid:62) i β − ψ k n ( Z i ) (cid:62) γ n (cid:17) ( ψ k n ( Z i ) − ˆ M X i ) (cid:111) = − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110) ( r ni + (cid:15) i ) ( M X i − ˆ M X i ) (cid:111) (A.16) − ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) . (A.17)We first consider the term (A.16) such that (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ˆ M − M ) X i ( r ni + (cid:15) i ) (cid:105) (cid:107) ≤ (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) (cid:112) k n (cid:107) E n (cid:104) ( ˆ M − M ) X i ( r ni + (cid:15) i ) (cid:105) (cid:107) ∞ ≤ (cid:107) ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) E n [ X i ( r ni + (cid:15) i )] (cid:107) ∞ = O p (cid:16)(cid:112) k n (cid:16) s m (cid:112) (log k n + log p ) /n (cid:17)(cid:17) × o p (cid:16)(cid:112) k n /n (cid:112) log p/n (cid:17) , where (cid:107) ˆ M − M (cid:107) = O p (cid:16) s m (cid:112) (log k n + log p ) /n (cid:17) from Lemma 10, max ≤ j ≤ p, ≤ i ≤ n | X ij r ni | = o p ( n − / ) by Assumption 11(i) and (cid:107) E n X i (cid:15) i (cid:107) ∞ = O p (cid:16)(cid:112) log p/n (cid:17) so the above term is o p ( n − / ) because of Assumption (12) (i). Also note that( A.
17) = − ψ k n ( z ) (cid:62) (cid:16) ˆΣ − f − Σ − f (cid:17) E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) − ψ k n ( z ) (cid:62) Σ − f E n (cid:110) ( r ni + (cid:15) i ) ( ψ k n ( Z i ) − M X i ) (cid:111) , where (cid:107) ˆΣ f − Σ f (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( E n − E ) (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105)(cid:13)(cid:13)(cid:13) (A.18)+ (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M (cid:62) − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:62) (cid:13)(cid:13)(cid:13) . (A.19)The first term (A.18) is bounded by (cid:107) ( E n − E )[ ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n (cid:19) . (A.20)following Lemma 6.2 in Belloni et al. (2015). The second term (A.19) is bounded by (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:13)(cid:13)(cid:13) ≤ (cid:112) k n (cid:13)(cid:13)(cid:13) ˆ M E n [ X i X (cid:62) i ] ˆ M − M E (cid:104) X i X (cid:62) i (cid:105) M (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) M (cid:107) (cid:107) E n (cid:104) X i X (cid:62) i (cid:105) (cid:107) ∞ + (cid:112) k n (cid:107) M (cid:107) (cid:107) ( E n − E ) (cid:104) X i X (cid:62) i (cid:105) (cid:107) ∞ = O p ( s m (cid:112) k n log p/n ) , (A.21)39here the first and second inequalities follow from direct calculation and the last inequalityfollows because λ (cid:48)(cid:48) = O p ( (cid:112) log p/n ), (cid:107) M (cid:107) ≤ s m . Moreover, because X ij X ik − E [ X ij X ik ] issub-exponential so by Bernstein inequality, we have for some constant K X , P (cid:18) max ≤ j,k ≤ p | ( E n − E ) X ij X ik | > K X (cid:112) log 2 p/n (cid:19) ≤ / p. Thus, (A.20) and (A.21) imply that (cid:107) ˆΣ f − Σ f (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n + s m (cid:112) k n log p/n (cid:19) = o p (1) , (A.22)where the second equality follows from Assumption 12(i). Because σ − z ψ k n ( z ) (cid:62) Σ − f G n (cid:15) i ( ψ k n ( Z i ) − M X i ) = O p (1)and σ − z ψ k n ( z ) (cid:62) Σ − f G n r ni ( ψ k n ( Z i ) − M X i ) = O p ( (cid:96) k n c k n (cid:112) k n ) , with (A.22), we have √ nσ − z II = −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) ( (cid:15) i + r ni ) (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) + o p (1) . Similar to the argument in Theorem 2 of Newey (1997), with Assumption 4 and 6, theLindbergh-Feller central limit theorem gives us −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) (cid:15) i (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) → d N (0 , . and Assumption 12(i) implies that −√ nσ − z ψ k n ( z ) (cid:62) Σ − f E n (cid:110) r ni (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:111) = o p (1) . Next we consider the term II . Note that √ nσ − / z ψ k n ( z ) (cid:62) Σ − f E n (cid:104) ( ˆ S i − S i )( ˆ M − M ) X i (cid:105) ≤ (cid:112) k n n (cid:107) σ − / z ψ k n ( z ) (cid:62) Σ − f (cid:107) (cid:107) E n (cid:104) ( ˆ S i − S i ) X i (cid:105) (cid:107) ∞ (cid:107) ˆ M − M (cid:107) = O p ( s m (cid:112) k n (cid:112) log p/n ) = o p (1)and further from Lemma 9. √ nσ − / z ψ k n ( z ) (cid:62) Σ − f E n (cid:104)(cid:16) ˆ S i − S i (cid:17) (cid:16) ψ k n ( Z i ) − M X i (cid:17)(cid:105) = o p (1)40ext, consider equation II . Consider the following decomposition: (cid:12)(cid:12)(cid:12) σ − / z ψ k n ( z ) (cid:62) ˆΣ − f E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) − σ − / z ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) · (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105) ( ˆ β − β ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) − σ − / z ψ k n ( z ) (cid:62) ˆΣ − f (cid:107) · (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ψ k n ( Z i ) − ˆ M X i ) X (cid:62) i (cid:105)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) ( ˆ β − β ) (cid:13)(cid:13)(cid:13) . From the definition of ˆ M j , (cid:13)(cid:13)(cid:13) E n (cid:110) ( ψ K n j ( Z i ) − ˆ M (cid:62) j X i ) X (cid:62) i (cid:111)(cid:13)(cid:13)(cid:13) ∞ ≤ λ (cid:48)(cid:48) for all j ; and from The-orem 1, (cid:107) ( ˆ β − β ) (cid:107) = O p ( s (cid:112) log p/n ), thus by choosing λ (cid:48)(cid:48) = O ( (cid:112) log p/n ), II = o p (1)because √ ns m s log p/n = o p (1) by Assumption 12(i).Finally the last term II is 0, since ˆΣ − f E n (cid:110) ( ψ k n ( Z i ) − ˆ M X i ) ψ k n ( Z i ) (cid:62) (cid:111) = I k n .When √ nσ − l k n c k n = o p (1), we have √ nσ − z ( ˜ f ( z ) − f ( z )) = √ nσ − z ( ˜ f ( z ) − f n ( z )) + o p (1) → d N (0 , . Next, we show the consistency of the variance term. Similar to Theorem 4.6 in Belloniet al. (2015), we have (cid:107) ( E n − E )[ σ i ( Z i , X i ) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) ] (cid:107) = O p (cid:18)(cid:113) ξ ( k n ) log k n /n (cid:19) . For V f = Σ − f Ω f Σ − f and σ z = ψ k n ( z ) (cid:62) V f ψ k n ( z ), (cid:107) ˆ V f − V f (cid:107) (cid:46) (cid:107) ( ˆΣ − f − Σ − f ) ˆΩ f ˆΣ − f (cid:107) + (cid:107) Σ − f ( ˆΩ f − Ω f )Σ − f (cid:107) + (cid:107) Σ − f Ω f ( ˆΣ − f − Σ − f ) (cid:107) . Note that for ˜Ω f = E n (cid:2) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:3) − ˆ M E n (cid:2) σ i X i X (cid:62) i (cid:3) ˆ M , (cid:107) ˆΩ f − Ω f (cid:107) ≤ (cid:107) ˆΩ f − ˜Ω f (cid:107) + (cid:107) ˜Ω f − Ω f (cid:107) . To bound (cid:107) ˆΩ f − ˜Ω f (cid:107) , note that (cid:107) E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) ≤ max ≤ i ≤ n (cid:12)(cid:12)(cid:12) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f n ( Z i ) (cid:12)(cid:12)(cid:12) (cid:107) E n (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) + 2 max ≤ i ≤ n | (cid:15) i + r ni | max ≤ i ≤ n (cid:12)(cid:12)(cid:12) X (cid:62) i (cid:16) ˆ β − β (cid:17) + ˆ f ( Z i ) − f n ( Z i ) (cid:12)(cid:12)(cid:12) (cid:107) E n (cid:104) ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) (cid:46) p (cid:107) (cid:98) Q z (cid:107) O p (cid:18)(cid:112) log np (cid:16) n /r (cid:15) + c k n (cid:96) k n (cid:17) (cid:18) s (cid:112) log p/n + (cid:113) ξ ( k n ) k n /n + c k n (cid:96) k n (cid:19)(cid:19) = o p (1) , where the second inequality follows from the results in Theorem 1 and the last equalityfollows from the condition in Theorem 3. Similarly, (cid:107) ˆ M E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) ˆ M (cid:62) (cid:107) ≤ (cid:112) k n (cid:107) ˆ M E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) ˆ M (cid:62) (cid:107) ∞ ≤ (cid:112) k n (cid:107) ˆ M (cid:107) (cid:107) E n (cid:104)(cid:0) ˆ σ i − σ i (cid:1) X i X (cid:62) i (cid:105) (cid:107) ∞ = (cid:112) k n s m (cid:32) s (cid:114) ξ ( k n ) log pn + (cid:96) k n c k n (cid:33)
41o we have (cid:107) ˆΩ f − ˜Ω f (cid:107) = o p (1) . To bound (cid:107) ˜Ω f − Ω f (cid:107) = o p (1), note that (cid:107) ( E n − E ) (cid:104) σ i ψ k n ( Z i ) ψ k n ( Z i ) (cid:62) (cid:105) (cid:107) = O p (cid:32) (1 + (cid:96) k n c k n ) (cid:114) ξ ( k n ) log k n n (cid:33) by Theorem 4.6 in Belloni et al. (2015). Becausemax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) X ij σ i − E (cid:2) X ij σ i (cid:3)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32)(cid:114) log pn (cid:33) , then (cid:107) ˆ M E n (cid:104) σ i X i X (cid:62) i (cid:105) ˆ M (cid:62) − M E (cid:104) σ i X i X (cid:62) i (cid:105) M (cid:62) (cid:107) ≤ (cid:112) k n (cid:107) ˆ M − M (cid:107) (cid:107) M (cid:107)(cid:107) E n (cid:104) σ i X i X (cid:62) i (cid:105) (cid:107) ∞ + (cid:107) M (cid:107) (cid:107) ( E n − E ) (cid:104) (cid:15) i X i X (cid:62) i (cid:105) (cid:107) ∞ = O p (cid:16)(cid:112) k n s m (cid:112) log p/n (cid:17) = o p (1) . The conclusion follows from triangular inequality.
Lemma 9.
Suppose that conditions in Theorem 2 and 3 are satisfied, then s w (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i ) ˜ X i (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i ) X i (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) (cid:112) k n (cid:13)(cid:13)(cid:13) E n (cid:104) ( ˆ S i − S i )( ψ k n ( Z i ) − ˆ M X i ) (cid:105)(cid:13)(cid:13)(cid:13) ∞ = o p (1 / √ n ) Proof.
Similar to the proof in Lemma 7, we consider the interaction of function F = ˜ X i , F = X i and F = ( ψ k n ( Z i ) − ˆ M X i ) with each component of ˆ S i − S i . With Bernsteininequality and union bound, P (cid:32) max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 r πni F kj D i ( (cid:15) ,i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ C π (cid:107) r πni (cid:107) n (cid:112) ( t + log p ) /n ) (cid:33) ≤ exp( − t ) , for k = 1 , k = 1, taking t = log p and from Assumption 10, with probability at least 1 − /p , s w (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). When k = 2, we can apply the similar argument andwith probability at least 1 − /p , (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). And again when k = 3,with probability at least 1 − /k n , √ k n (cid:107) E n ( F · ( E (cid:107) ∞ = o p (1 / √ n ). The same logic inLemma 7 will lead to the rest of the terms and we omitted those here.42 emma 10. Let M (cid:62) j = E ( ψ k n j ( Z i ) X (cid:62) i ) { E (cid:2) X i X (cid:62) i (cid:3) } − , suppose that conditions in Theorem3 are satisfied, then (cid:107) M − ˆ M (cid:107) = O p ( s m (cid:112) (log k n + log p ) /n ) . Proof.
Since ψ k n j ( Z i ) = M (cid:62) j X i + ( ψ k n j ( Z i ) − E ( ψ k n j ( Z i ) X (cid:62) i ) { E X ⊗ i } − X i ) , we define υ i := ψ k n j ( Z i ) − E ( ψ k n j ( Z i ) X (cid:62) i ) { E (cid:2) X i X (cid:62) i (cid:3) } − X i ) such that v i X (cid:62) i = ψ k n j ( Z i ) X (cid:62) i − E ( ψ k n j ( Z i ) X (cid:62) i ) { E X ⊗ i } − X i X (cid:62) i = ( ψ k n j ( Z i ) X (cid:62) i − E ( ψ k n j ( Z i ) X (cid:62) i )) + E ( ψ k n j ( Z i ) X (cid:62) i )( I p × p − { E X ⊗ i } − X i X (cid:62) i ) . Thus E ( υ i X (cid:62) i ) = 0, and the event B = (cid:107) E n ( υ i X (cid:62) i ) (cid:107) ∞ < λ (cid:48)(cid:48) has probability at least 1 − exp( − cnλ (cid:48)(cid:48) ). Equation (4.3) is thus a linear Dantzig selector as defined in Theorem 7.1 inBickel, Ritov and Tsybakov (2009). Therefore P ( (cid:107) M − ˆ M (cid:107) > λ (cid:48)(cid:48) ) ≤ K n (cid:88) j =1 P ( (cid:107) M j − ˆ M j (cid:107) > λ (cid:48)(cid:48) ) ≤ exp (cid:0) log k n − cnλ (cid:48)(cid:48) (cid:1) By choose λ (cid:48)(cid:48) (cid:38) (cid:112) (log p + log k n ) /n , with probability at least 1 − exp( − c log( k n ) − c log( p )), (cid:107) M − ˆ M (cid:107) = O p ( s m (cid:112) (log k n + log p ) /n ) . References
Abadie, A. (2005). Semiparametric Difference-in-Differences Estimators.
Review of Eco-nomic Studies Ahn, H. and
Powell, J. L. (1993). Semiparametric estimation of censored selection modelswith a nonparametric selection mechanism.
Journal of Econometrics Athey, S. and
Imbens, G. W. (2006). Identification and Inference in Nonlinear Differencein Differences Models.
Econometrica Athey, S. and
Imbens, G. W. (2019). Design-based Analysis in Difference-In-DifferencesSettings with Staggered Adoption.
Belloni, A. , Chernozhukov, V. and
Wei, Y. (2016). Post-selection inference for gener-alized linear models with many controls.
Journal of Business & Economic Statistics elloni, A. , Chernozhukov, V. , Chetverikov, D. and
Kato, K. (2015). Some newasymptotic theory for least squares series: Pointwise and uniform results.
Journal of Econo-metrics
Belloni, A. , Chernozhukov, V. , Fern´andezVal, I. and
Hansen, C. (2017). ProgramEvaluation and Causal Inference with High-Dimensional Data.
Econometrica Bickel, P. J. , Ritov, Y. and
Tsybakov, A. B. (2009). Simultaneous analysis of Lassoand Dantzig selector.
Annals of Statistics B¨uhlmann, P. and van de Geer, S. (2011).
Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media.
Cai, T. T. and
Guo, Z. (2017). Confidence intervals for high-dimensional linear regression:Minimax rates and adaptivity.
The Annals of statistics Callaway, B. and
Li, T. (2020). Quantile treatment effects in difference in differencesmodels with panel data.
Quantitative Economics Callaway, B. and
Sant’Anna, P. H. C. (2019). Difference-in-Differences with MultipleTime Periods and an Application on the Minimum Wage and Employment.
Caner, M. and
Kock, A. B. (2018). Asymptotically honest confidence regions for highdimensional parameters by the desparsified conservative lasso.
Journal of Econometrics
Card, D. and
Krueger, A. B. (1994). Minimum Wages and Employment: A Case Study ofthe Fast-Food Industry in New Jersey and Pennsylvania.
The American Economic Review Cattaneo, M. D. , Jansson, M. and
Ma, X. (2019). Two-Step Estimation and Inferencewith Possibly Many Included Covariates.
Review of Economic Studies . Chen, X. and
Christensen, T. M. (2015). Optimal uniform convergence rates and asymp-totic normality for series estimators under weak dependence and weak conditions.
Journalof Econometrics
Chen, X. , Hong, H. , Tarozzi, A. et al. (2008). Semiparametric efficiency in GMM modelswith auxiliary data.
The Annals of Statistics Chernozhukov, V. , Chetverikov, D. and
Kato, K. (2017). Central limit theorems andbootstrap in high dimensions.
The Annals of Probability Chernozhukov, V. , Escanciano, J. C. , Ichimura, H. , Newey, W. K. and
Robins, J. M. (2016). Locally robust semiparametric estimation. arXiv preprintarXiv:1608.00033 . 44 hernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. and
Robins, J. (2018a). Double/Debiased Machine Learning for Treatmentand Structural Parameters.
The Econometrics Journal C1-C68.
Chernozhukov, V. , Chetverikov, D. , Demirer, M. , Duflo, E. , Hansen, C. , Newey, W. and
Robins, J. (2018b). Double/debiased Machine Learning for Treatmentand Structural Parameters.
The Econometrics Journal Donald, S. G. and
Newey, W. (1994). Series estimation of semilinear models.
Journal ofMultivariate Analysis Engle, R. F. , Granger, C. W. , Rice, J. and
Weiss, A. (1986). Semiparametric estimatesof the relation between weather and electricity sales.
Journal of the American statisticalAssociation Fan, Y. and
Li, Q. (1999). Root-n-consistent estimation of partially linear time seriesmodels.
Journal of Nonparametric Statistics Fan, Q. , Hsu, Y.-C. , Lieli, R. P. and
Zhang, Y. (2020). Estimation of conditional averagetreatment effects with high-dimensional data.
Journal of Business & Economic Statistics forthcoming . Farrell, M. H. (2015). Robust inference on average treatment effects with possibly morecovariates than observations.
Journal of Econometrics
Freyberger, J. and
Masten, M. A. (2019). A practical guide to compact infinite dimen-sional parameter spaces.
Econometric Reviews Gold, D. , Lederer, J. and
Tao, J. (2020). Inference for high-dimensional instrumentalvariables regression.
Journal of Econometrics
Graham, B. S. , de Xavier Pinto, C. C. and Egel, D. (2012). Inverse probability tiltingfor moment condition models with missing data.
The Review of Economic Studies Horvitz, D. G. and
Thompson, D. J. (1952). A generalization of sampling without re-placement from a finite universe.
Journal of the American Statistical Association, Imai, K. , Kim, I. S. and
Wang, E. (2019). Matching Methods for Causal Inference withTime-Series Cross-Sectional Data.
Javanmard, A. and
Montanari, A. (2014). Confidence intervals and hypothesis testing forhigh-dimensional regression.
The Journal of Machine Learning Research Kennedy, E. H. , Lorch, S. and
Small, D. S. (2019). Robust causal inference with45ontinuous instruments using the local instrumental variable curve.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) Lee, S. , Okui, R. and
Whang, Y.-J. (2017). Doubly robust uniform confidence band forthe conditional average treatment effect function.
Journal of Applied Econometrics Li, Q. and
Racine, J. S. (2007).
Nonparametric econometrics: theory and practice . Prince-ton University Press.
Linton, O. (1995). Second order approximation in the partially linear regression model.
Econometrica: Journal of the Econometric Society
Ma, C. and
Huang, J. (2016). Asymptotic properties of Lasso in high-dimensional partiallylinear models.
Science China Mathematics M¨uller, P. and
Van de Geer, S. (2015). The partial linear model in high dimensions.
Scandinavian Journal of Statistics Newey, W. (1997). Convergence rates and asymptotic normality for series estimators.
Jour-nal of Econometrics Neykov, M. , Ning, Y. , Liu, J. S. , Liu, H. et al. (2018). A unified theory of confidenceregions and testing for high-dimensional estimating equations.
Statistical Science Ning, Y. and
Liu, H. (2017). A general theory of hypothesis tests and confidence regionsfor sparse high dimensional models.
The Annals of Statistics Ning, Y. , Zhao, T. , Liu, H. et al. (2017). A likelihood ratio framework for high-dimensionalsemiparametric regression.
The Annals of Statistics Ogburn, E. L. , Rotnitzky, A. and
Robins, J. M. (2015). Doubly robust estimation ofthe local average treatment effect curve.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) Okui, R. , Small, D. S. , Tan, Z. and
Robins, J. M. (2012). Doubly robust instrumentalvariable regression.
Statistica Sinica
Robins, J. M. , Rotnitzky, A. and
Zhao, L. P. (1994). Estimation of regression coeffi-cients when some regressors are not always observed.
Journal of the American statisticalAssociation Robinson, P. M. (1988). Root-N-consistent semiparametric regression.
Econometrica: Jour-nal of the Econometric Society
Sant’Anna, P. and
Zhao, J. (2020). Doubly Robust Difference-in-Differences Estimators.46 orking paper . Semenova, V. and
Chernozhukov, V. (2017). Estimation and Inference about Con-ditional Average Treatment Effect and Other Structural Functions. arXiv preprintarXiv:1702.06240 . Shen, X. (1997). On methods of sieves and penalization.
The Annals of Statistics
S(cid:32)loczy´nski, T. and
Wooldridge, J. M. (2018). A general double robustness result forestimating average treatment effects.
Econometric Theory Syrgkanis, V. , Lei, V. , Oprescu, M. , Hei, M. , Battocchi, K. and
Lewis, G. (2019).Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments In
NeurIPS 2019 . Tan, Z. (2006). A distributional approach for causal inference using propensity scores.
Jour-nal of the American Statistical Association
Tan, Z. (2020). Model-assisted inference for treatment effects using regularized calibratedestimation with high-dimensional data.
Annals of Statistics, forthcoming . Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices.
Foundationsof computational mathematics van de Geer, S. (2014). On the uniform convergence of empirical norms and inner products,with application to causal inference. Electronic Journal of Statistics Van de Geer, S. , B¨uhlmann, P. , Ritov, Y. and
Dezeure, R. (2014). On asymptoticallyoptimal confidence regions and tests for high-dimensional models.
The Annals of Statistics Vermeulen, K. and
Vansteelandt, S. (2015). Bias-reduced doubly robust estimation.
Journal of the American Statistical Association
Yu, Z. , Levine, M. , Cheng, G. et al. (2019). Minimax optimal estimation in partiallylinear additive models under high dimension.
Bernoulli Zhang, C.-H. and
Zhang, S. S. (2014). Confidence intervals for low dimensional parame-ters in high dimensional linear models.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) Zhu, Y. , Yu, Z. and
Cheng, G. (2019). High dimensional inference in partially linearmodels.
Proceedings of Machine Learning Research, PMLR able 1
Simulation: homogeneous error with independent confounders p 10 50 500 1000Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD n = linearBias -0.0476 0.0746 -0.0279 -0.1177 0.0008 - -0.0045 -Std Err 0.5241 1.2641 0.4533 4.3207 0.3732 - 0.4030 -RMSE 0.2816 1.7240 0.2116 21.9360 0.1410 - 0.1677 -Coverage 0.8300 0.9260 0.8680 0.9710 0.8833 - 0.8900 -CI length 1.4018 4.1499 1.2448 49.9842 1.0778 - 1.0800 -nonparametricBias -0.0165 0.0128 0.0365 -0.1181 0.0260 - 0.0102 -Std Err 1.0612 1.3773 0.8825 8.7693 0.8072 - 0.9131 -RMSE 1.1701 1.9408 0.7846 80.8639 0.6548 - 0.8426 -Coverage 0.8200 0.9056 0.8525 0.9712 0.7988 - 0.8375 -CI length 2.4596 5.2435 2.3056 49.2043 2.1269 - 2.1916 - n = linearBias -0.0258 0.0031 -0.0171 -0.0011 -0.0016 - -0.0011 -Std Err 0.2812 0.4972 0.2268 0.5771 0.2086 - 0.1983 -RMSE 0.0828 0.2574 0.0536 0.3369 0.0441 - 0.0398 -Coverage 0.8570 0.8815 0.8656 0.9635 0.8857 - 0.8916 -CI length 0.7628 1.4949 0.6723 3.9223 0.6488 - 0.6261 -nonparametricBias 0.0023 -0.0133 0.0028 0.0108 0.0031 - 0.0032 -Std Err 0.3953 0.6337 0.3872 0.7954 0.3715 - 0.3558 -RMSE 0.1564 0.4020 0.1501 0.6348 0.1384 - 0.1277 -Coverage 0.8888 0.8644 0.8688 0.9606 0.8588 - 0.8725 -CI length 1.1900 1.9907 1.1284 5.6116 1.1040 - 1.1037 - n = linearBias -0.0091 0.0044 -0.0068 0.0094 -0.0014 - -0.0008 -Std Err 0.1842 0.3126 0.1485 0.3347 0.1401 - 0.1399 -RMSE 0.0342 0.0992 0.0226 0.1132 0.0199 - 0.0198 -Coverage 0.8640 0.8720 0.8848 0.9027 0.8910 - 0.8946 -CI length 0.5432 0.9229 0.4683 1.2354 0.4466 - 0.4482 -nonparametricBias 0.0047 -0.0251 -0.0187 -0.0121 -0.0069 - -0.0083 -Std Err 0.2695 0.3950 0.2362 0.4562 0.2268 - 0.2446 -RMSE 0.0728 0.1570 0.0566 0.2087 0.0515 - 0.0599 -Coverage 0.8750 0.8725 0.8812 0.9075 0.8825 - 0.8712 -CI length 0.8223 1.2382 0.7472 1.7548 0.7196 - 0.7247 - This table compare the doubly robust diff-in-diff estimator (denoted as Dr-DiD) with the originalsemi-parametric diff-in-diff estimator (denoted as Semi-DiD) proposed in Abadie (2005). p represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.
Simulation: homogeneous error with independent confounders p 10 50 500 1000Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD n = linearBias -0.0476 0.0746 -0.0279 -0.1177 0.0008 - -0.0045 -Std Err 0.5241 1.2641 0.4533 4.3207 0.3732 - 0.4030 -RMSE 0.2816 1.7240 0.2116 21.9360 0.1410 - 0.1677 -Coverage 0.8300 0.9260 0.8680 0.9710 0.8833 - 0.8900 -CI length 1.4018 4.1499 1.2448 49.9842 1.0778 - 1.0800 -nonparametricBias -0.0165 0.0128 0.0365 -0.1181 0.0260 - 0.0102 -Std Err 1.0612 1.3773 0.8825 8.7693 0.8072 - 0.9131 -RMSE 1.1701 1.9408 0.7846 80.8639 0.6548 - 0.8426 -Coverage 0.8200 0.9056 0.8525 0.9712 0.7988 - 0.8375 -CI length 2.4596 5.2435 2.3056 49.2043 2.1269 - 2.1916 - n = linearBias -0.0258 0.0031 -0.0171 -0.0011 -0.0016 - -0.0011 -Std Err 0.2812 0.4972 0.2268 0.5771 0.2086 - 0.1983 -RMSE 0.0828 0.2574 0.0536 0.3369 0.0441 - 0.0398 -Coverage 0.8570 0.8815 0.8656 0.9635 0.8857 - 0.8916 -CI length 0.7628 1.4949 0.6723 3.9223 0.6488 - 0.6261 -nonparametricBias 0.0023 -0.0133 0.0028 0.0108 0.0031 - 0.0032 -Std Err 0.3953 0.6337 0.3872 0.7954 0.3715 - 0.3558 -RMSE 0.1564 0.4020 0.1501 0.6348 0.1384 - 0.1277 -Coverage 0.8888 0.8644 0.8688 0.9606 0.8588 - 0.8725 -CI length 1.1900 1.9907 1.1284 5.6116 1.1040 - 1.1037 - n = linearBias -0.0091 0.0044 -0.0068 0.0094 -0.0014 - -0.0008 -Std Err 0.1842 0.3126 0.1485 0.3347 0.1401 - 0.1399 -RMSE 0.0342 0.0992 0.0226 0.1132 0.0199 - 0.0198 -Coverage 0.8640 0.8720 0.8848 0.9027 0.8910 - 0.8946 -CI length 0.5432 0.9229 0.4683 1.2354 0.4466 - 0.4482 -nonparametricBias 0.0047 -0.0251 -0.0187 -0.0121 -0.0069 - -0.0083 -Std Err 0.2695 0.3950 0.2362 0.4562 0.2268 - 0.2446 -RMSE 0.0728 0.1570 0.0566 0.2087 0.0515 - 0.0599 -Coverage 0.8750 0.8725 0.8812 0.9075 0.8825 - 0.8712 -CI length 0.8223 1.2382 0.7472 1.7548 0.7196 - 0.7247 - This table compare the doubly robust diff-in-diff estimator (denoted as Dr-DiD) with the originalsemi-parametric diff-in-diff estimator (denoted as Semi-DiD) proposed in Abadie (2005). p represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%. able 2 Simulation: heterogeneous error with correlated confounders p 10 50 500 1000Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD Dr-DiD Semi-DiD n = linearBias -0.0213 0.0159 0.0122 -0.1277 0.0048 - 0.0027 -Std Err 0.5929 2.2577 0.5333 11.3165 0.4583 - 0.4649 -RMSE 0.3643 5.7073 0.2912 156.7863 0.2129 - 0.2255 -Coverage 0.8390 0.9260 0.8608 0.9360 0.8881 - 0.8912 -CI length 1.5717 8.8340 1.4061 12.8266 1.2764 - 1.2174 -nonparametricBias 0.0327 0.0407 -0.0200 0.0026 -0.0391 - 0.0183 -Std Err 0.9525 2.7935 1.0049 12.2120 0.9813 - 0.9562 -RMSE 0.9288 8.0217 1.0136 176.1709 0.9779 - 0.9171 -Coverage 0.8275 0.9269 0.8200 0.9450 0.8275 - 0.8088 -CI length 2.4823 10.4079 2.4977 12.4647 2.4471 - 2.3827 - n = linearBias 0.0148 -0.0486 0.0072 0.0198 0.0006 - 0.0001 -Std Err 0.3691 1.7750 0.2966 2.5093 0.2305 - 0.2285 -RMSE 0.1368 3.8087 0.0888 6.7667 0.0536 - 0.0526 -Coverage 0.8350 0.8550 0.8798 0.9761 0.8901 - 0.8943 -CI length 0.9454 3.2110 0.8478 15.5631 0.7177 - 0.7059 -nonparametricBias -0.0068 0.0255 -0.0061 0.1500 0.0127 - -0.0156 -Std Err 0.4750 1.8659 0.4719 2.7917 0.4122 - 0.3913 -RMSE 0.2278 3.7041 0.2237 7.9921 0.1695 - 0.1541 -Coverage 0.8800 0.8569 0.8487 0.9681 0.8225 - 0.8562 -CI length 1.3348 3.5635 1.2741 18.3474 1.1226 - 1.1594 - n = linearBias -0.0018 -0.0000 0.0048 -0.0007 0.0004 - 0.0006 -Std Err 0.2733 0.6733 0.2071 1.1275 0.1850 - 0.1706 -RMSE 0.0750 0.4579 0.0433 1.3780 0.0345 - 0.0294 -Coverage 0.8410 0.8650 0.8720 0.9099 0.8943 - 0.8960 -CI length 0.7548 1.9303 0.6045 3.2445 0.5709 - 0.5378 -nonparametricBias -0.0116 0.0129 0.0114 -0.1022 0.0023 - 0.0081 -Std Err 0.3303 0.7637 0.2771 1.3032 0.2771 - 0.2642 -RMSE 0.1092 0.5839 0.0781 1.8072 0.0765 - 0.0703 -Coverage 0.8600 0.8750 0.8762 0.8981 0.8588 - 0.8488 -CI length 0.9666 2.1754 0.8233 3.6871 0.8006 - 0.7682 - This table compare the doubly robust diff-in-diff estimator (denoted as Dr-DiD) with the originalsemi-parametric diff-in-diff estimator (denoted as Semi-did) proposed in Abadie (2005). p represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.represents thedimension for the linear specification. The nonparametric part is specified by an exponential function and itis approximated by a 8th degree trigonometric polynomial basis in both methods. The nominal coverage isat 90%.