[PDF] An alternative to synthetic control for models with many covariates under sparsity

Abstract

The synthetic control method is a an econometric tool to evaluate causal effects when only one unit is treated. While initially aimed at evaluating the effect of large-scale macroeconomic changes with very few available control units, it has increasingly been used in place of more well-known microeconometric tools in a broad range of applications, but its properties in this context are unknown. This paper introduces an alternative to the synthetic control method, which is developed both in the usual asymptotic framework and in the high-dimensional scenario. We propose an estimator of average treatment effect that is doubly robust, consistent and asymptotically normal. It is also immunized against first-step selection mistakes. We illustrate these properties using Monte Carlo simulations and applications to both standard and potentially high-dimensional settings, and offer a comparison with the synthetic control method.

Full PDF

AAn alternative to synthetic control for models with manycovariates under sparsity ∗ Marianne Bl´ehaut † Xavier D’Haultfœuille ‡ J´er´emy L’Hour § Alexandre B. Tsybakov ¶ May 26, 2020

Abstract

The synthetic control method is a an econometric tool to evaluate causal eﬀects whenonly one unit is treated. While initially aimed at evaluating the eﬀect of large-scale macroe-conomic changes with very few available control units, it has increasingly been used in placeof more well-known microeconometric tools in a broad range of applications, but its prop-erties in this context are unknown. This paper introduces an alternative to the syntheticcontrol method, which is developed both in the usual asymptotic framework and in thehigh-dimensional scenario. We propose an estimator of average treatment eﬀect that is dou-bly robust, consistent and asymptotically normal. It is also immunized against ﬁrst-stepselection mistakes. We illustrate these properties using Monte Carlo simulations and appli-cations to both standard and potentially high-dimensional settings, and oﬀer a comparisonwith the synthetic control method.

Keywords: treatment eﬀect, synthetic control, covariate balancing, high-dimension.

The synthetic control method (Abadie and Gardeazabal, 2003; Abadie et al., 2010, 2015) isone of the most recent additions to the empiricist’s toolbox, gaining popularity not only ineconomics, but also in political science, medicine, etc. It provides a sound methodology in many ∗ We thank participants at the 2016 North American and European Meetings of the Econometric Society, the2017 IAAE Meeting and CREST internal seminars for their useful comments and discussions. We acknowledgefunding from Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047). † CREST-ENSAE, [email protected]. ‡ CREST-ENSAE, [email protected] § CREST-ENSAE, [email protected]. ¶ CREST-ENSAE, [email protected]. a r X i v : . [ ec on . E M ] M a y ettings where only long aggregate panel data is available to the researcher. The method has beenspeciﬁcally developed in a context where a single sizeable unit such as a country, a state or a cityundergoes a large-scale policy change (referred to as the treatment or intervention hereafter),while only a moderate number of control units (the donor pool) is available to construct acounterfactual through a synthetic unit. This unit is deﬁned as a convex combination of unitsfrom the donor pool that best resembles the treated unit before the intervention. Then thetreatment eﬀect is estimated from the diﬀerence in outcomes between the treated unit and itssynthetic unit after the intervention takes place. In contexts such as those described above, thesynthetic unit possesses several appealing properties (for more details on such properties, seethe recent survey of Abadie, 2019). First, it does not lead to extrapolation outside the supportof the data: because weights are non-negative, the counterfactual never takes a value outsideof the convex hull deﬁned by the donor pool. Second, one can assess simply its ﬁt, making iteasy to judge the quality of the counterfactual. Third, the synthetic unit is sparse: the numberof control units receiving a non-zero weight is at most equal to the dimension of the matchingvariable plus one.The method has still some limitations, in particular when applied to micro data, for which itwas not initially intended. In such cases, the number of untreated units n is typically greaterthan the dimension p of variables X used to construct the synthetic units. Then, as soon as thetreated unit falls into the convex hull deﬁned by the donor pool, the synthetic control solution isnot uniquely deﬁned (see in particular Abadie and L’Hour, 2019). Second, and still related to thefact that the method was not developed for micro data, there is, to the best of our knowledge, noasymptotic theory available for synthetic control yet. This means in particular, that inferencecannot be conducted in a standard way. A third issue is related to variable selection. Thestandard synthetic control method, as advocated in Abadie et al. (2010), not only minimizesthe norm (cid:107) . (cid:107) V – deﬁned for a vector a of dimension p and diagonal positive-deﬁnite matrix V ,as (cid:107) a (cid:107) V = √ a T V a – between the characteristics of the treated and those of its synthetic unitunder constraints, but also employs a bi-level optimization program over the weighting matrix V so as to obtain the best possible pre-treatment ﬁt. Diagonal elements of V are interpretedas a measure of the predicting power of each characteristics for the outcome (see, e.g., Abadieet al., 2010; Abadie, 2019). This approach has been criticized for being unstable and yieldingunreproducible results, see in particular Kl¨oßner et al. (2018).We consider an alternative to the synthetic control that addresses these issues. Speciﬁcally,we consider a parametric form for the synthetic control weights, W i = h ( X Ti β ), where weestimate the unknown parameter β . This approach warrants the uniqueness of the solutionin low-dimensional cases where p < n . With micro data, it may thus be seen as a particularsolution of the synthetic control method. We show that the average treatment on the treated(ATT) parameter can be estimated with a two-step GMM estimator, where β is computed in2 ﬁrst step so that the reweighted control group matches some features of the treated units.A key result is the double robustness of the estimator, as deﬁned by Bang and Robins (2005).Speciﬁcally, we show that misspeciﬁcations in the synthetic control weights do not prevent validinference if the outcome regression function is linear for the control group.We then turn to the high-dimensional case where p is large, possibly greater than n . This caseactually corresponds to the initial set-up of the synthetic control method, and is therefore crucialto take into consideration. We depart from the synthetic control method by introducing an (cid:96) penalization term in the minimization program used to estimate β . We thus perform variableselection in a similar way as the Lasso, but diﬀerently from the synthetic control method, whichrelies on the aforementioned optimization over V (leading to overweighting the variables thatare good predictors of the outcome and underweighting the others).We also study the asymptotic properties of our estimator. Building on double robustness, weconstruct an estimator that is immunized against ﬁrst-step selection mistakes in the sense deﬁnedfor example by Chernozhukov et al. (2015 b ) or Chernozhukov et al. (2018). This constructionrequires an extra step, which models the outcome regression function and provides a bias cor-rection, a theme that has also been developed in Ben-Michael et al. (2018), Abadie and L’Hour(2019) and Arkhangelsky et al. (2019). We show that both in the low- and high-dimensionalcase, the estimator is consistent and asymptotically normal. Consequently, we develop infer-ence based on asymptotic approximation, which can be used in place of permutation tests whenrandomization of the treatment is not warranted.Apart from its close connection with the synthetic control method, the present paper is relatedto the literature on treatment eﬀect evaluation through propensity score weighting and covariatebalancing. Several eﬀorts have been made to include balance between covariates as an explicitobjective for estimation with or without relation to the propensity score (e.g. Hainmueller, 2012;Graham et al., 2012). Our paper is in particular related to that of Imai and Ratkovic (2014),who integrate propensity score estimation and covariate balancing in the same framework. Weextend their paper by considering the case of high-dimensional covariates. Note that the covariatebalancing idea is related to the calibration estimation in survey sampling, see in particular Devilleand S¨arndal (1992).It also partakes in the econometric literature addressing variable selection, and more generally theuse of machine learning tools, when estimating a treatment eﬀect, especially but not exclusivelyin a high-dimensional framework. The lack of uniformity for inference after a selection step hasbeen raised in a series of papers by Leeb and P¨otscher (2005, 2008 a , b ), echoing earlier papersby Leamer (1983) who put into question the credibility of many empirical policy evaluationresults. One recent solution proposed to circumvent this post-selection conundrum is the use ofdouble-selection procedures (Belloni and Chernozhukov, 2013; Farrell, 2015; Chernozhukov et al.,2015 a ; Chernozhukov et al., 2018). For example, Belloni et al. (2014) highlight the dangers of3electing controls exclusively in their relation to the outcome and propose a three-step procedurethat helps selecting more controls and guards against omitted variable biases much more thana simple “post-single-selection” estimator, as it is usually done by selecting covariates based oneither their relation with the outcome or with the treatment variable, but rarely both. Farrell(2015) extends the main approach of Belloni et al. (2014) by allowing for heterogeneous treatmenteﬀects, proposing an estimator that is robust to either model selection mistakes in propensityscores or in outcome regression. However, Farrell (2015) proves the root-n consistency of hisestimator only when both models hold true (see Assumption 3 and Theorem 3 therein). Ourpaper is also related to the work of Athey et al. (2018), who consider treatment eﬀect estimationunder the assumption of a linear conditional expectation for the outcome equation. As we do,they also estimate balancing weights, to correct for the bias arising in this high-dimensionalsetting, but because of their linearity assumption, they do not require to estimate a propensityscore. Their method is then somewhat simpler than ours, but it does not enjoy the double-robustness property of ours.Finally, a recent work by Bradic et al. (2019) parallel to ours suggests a Lasso-type procedureassuming logistic propensity score and linear speciﬁcation for both treated and untreated items.Their method is similar but still slightly diﬀerent from ours. Also, the main focus in Bradic et al.(2019) is to prove asymptotic normality under possibly weaker conditions on the sparsity levelsof the parameters. Namely, they allow for sparsity up to o ( √ n/ log( p )) or, in some cases up to o ( n/ log( p )) where n is the sample size, when the eigenvalues of the population Gram matrixdo not depend on n . Similar results can be developed for our method at the expense of muchadditional technical eﬀort. We have opted not to pursue in this direction since it only helpsto include relatively non-sparse models that are not of interest for the applications we have inmind.The paper is organized as follows. Section 2 introduces the set-up and the identiﬁcation strategybehind our estimator. Section 3 presents the estimator both in the low- and high-dimensionalcase and studies its asymptotic properties. Section 4 examines the ﬁnite sample properties ofour estimator through a Monte Carlo experiment. Section 5 revisits LaLonde (1986)’s datasetto compare our procedure with other high-dimensional econometric tools and the eﬀect of thelarge-scale tobacco control program of Abadie et al. (2010) for a comparison with syntheticcontrol. Section 6 concludes. All proofs are in the Appendix. Our results have been presented as early as 2016 at the North American and European Summer Meet-ings of the Econometric Society (see ), and again in 2017 during the IAAE Meeting in Sapporo, see Bl´ehaut et al. (2017). Covariate Balancing Weights and Double Robustness

We are interested in the eﬀect of a binary treatment, coded by D = 1 for the treated and D = 0for the non-treated. We let Y (0) and Y (1) denote the potential outcome under no treatment andunder the treatment, respectively. The observed outcome is then Y = DY (1) + (1 − D ) Y (0). Wealso observe a random vector X ∈ R p of pre-treatment characteristics. The quantity of interestis the average treatment eﬀect on the treated (ATT), deﬁned as: θ = E [ Y (1) − Y (0) | D = 1] . Here and in what follows, we assume that the random variables are such that all the consideredexpectations are ﬁnite. Since no individual is observed in both treatment states, identiﬁcation ofthe counterfactual E [ Y (0) | D = 1] is achieved through the following two ubiquitous conditions. Assumption 2.1 (Nested Support) P [ D = 1 | X ] < almost surely and < P [ D = 1] < . Assumption 2.2 (Mean Independence) E [ Y (0) | X, D = 1] = E [ Y (0) | X, D = 0] . Assumption 2.1, a version of the usual common support condition, requires that there exist con-trol units for any possible value of the covariates in the population. Since the ATT is the param-eter of interest, we are never reconstructing a counterfactual for control units so P [ D = 1 | X ] > Y (0) , Y (1)) ⊥⊥ D | X .In policy evaluation settings, the counterfactual is usually identiﬁed and estimated as a weightedaverage of non-treated unit outcomes: θ = E [ Y (1) | D = 1] − E [ W Y (0) | D = 0] , (2.1)where W is a random variable called the weight. Popular choices for the weight are the following:1. Linear regression: W = E [ DX T ] E [(1 − D ) XX T ] − X , also referred to as the Oaxaca-Blinderweight (Kline, 2011);2. Propensity score: W = P [ D = 1 | X ] / (1 − P [ D = 1 | X ]);3. Matching: W = P ( D = 1) f X | D =1 ( X ) / [ P ( D = 0) f X | D =0 ( X )]; Assuming here that the conditional densities f X | D =1 and f X | D =0 exist. Of course, P ( D =1) f X | D =1 ( X ) / [ P ( D = 0) f X | D =0 ( X )] = P [ D = 1 | X ] / (1 − P [ D = 1 | X ]), but the methods of estimation of W diﬀer in the two cases.

5. Synthetic controls: see Abadie et al. (2010).In this paper, we propose another choice of weight W , which can be seen as a parametricalternative to the synthetic control. An advantage is that it is well-deﬁned whether or not thenumber of untreated observations n is greater than the dimension p of X , whereas the syntheticcontrol estimator is not uniquely deﬁned when n > p . Formally, we look for weights W that:(i) satisfy a balancing condition as in the synthetic control method;(ii) are positive;(iii) depend only on the covariates;(iv) can be used whether n > p or n ≤ p (high-dimensional regime).Satisfying a balancing condition means that E [ DX ] = E [ W (1 − D ) X ] . (2.2)Up to a proportionality constant, this is equivalent to E [ X | D = 1] = E [ W X | D = 0]. In words, W balances the ﬁrst moment of the observed covariates between the treated and the control group.The deﬁnition of the observable covariates X is left to the econometrician and can includetransformation of the original covariates so as to match more features of their distribution. Theidea behind such weights relies on the principle of “covariate balancing” as in, e.g., Imai andRatkovic (2014). The following lemma shows that under Assumption 2.1 weights satisfying thebalancing condition always exist. Lemma 2.1

If Assumption 2.1 holds, the propensity score weight W = P [ D = 1 | X ] / (1 − P [ D =1 | X ]) satisﬁes the balancing condition (2.2) . The proof of this lemma is straightforward by plugging the expression of W in Equation (2.2)and using the law of iterated expectations. Note that W is not a unique solution of (2.2). Thelinear regression weight W = E [ DX T ] E [(1 − D ) XX T ] − X also satisﬁes the balancing conditionbut it can be negative and its use is problematic in high-dimensional regime.Lemma 2.1 would suggest solving a binary choice model to obtain estimators of P [ D = 1 | X ]and of the weight W as a ﬁrst step, and then plugging W in (2.1) to estimate θ . However,an inconsistent estimator of the propensity score leads to an inconsistent estimator of θ anddoes not guarantee that the corresponding weight will achieve covariate balancing. Finally,estimation of the propensity score can be problematic when there are very few treated units.For these reasons, we consider another approach where estimation is based directly on balancingequations: E [( D − (1 − D ) W ) X ] = 0 . (2.3)6n important advantage of this approach over the usual one based on the propensity scoreestimation through maximum likelihood is its double robustness (for a deﬁnition, see, e.g., Bangand Robins, 2005). Indeed, let W denote the weights identiﬁed by (2.3) under a misspeciﬁedmodel on the propensity score. It turns out that if the balancing equations (2.3) hold for W theestimated treatment eﬀect will still be consistent provided that E [ Y (0) | X ] is linear in X . Theformal result is given in Theorem 2.1 below. Theorem 2.1 (Double Robustness)

Let Assumptions 2.1-2.2 hold and let w : R p → (0 , + ∞ ) be a measurable function such that E [ w ( X ) | Y | ] < ∞ , E [ w ( X ) (cid:107) X (cid:107) ] < ∞ , where (cid:107) · (cid:107) denotesthe Euclidean norm. Assume the balancing condition E [( D − (1 − D ) w ( X )) X ] = 0 . (2.4) Then, for any µ ∈ R p the ATT θ can be expressed as θ = 1 P ( D = 1) E (cid:2) ( D − (1 − D ) w ( X )) ( Y − X T µ ) (cid:3) (2.5) in each of the following two cases:1. E [ Y (0) | X ] = X T µ for some µ ∈ R p ;2. P [ D = 1 | X ] = w ( X ) / (1 + w ( X )) . In (2.5), the eﬀect of X is taken out from Y in a linear fashion, while the eﬀect of X on D is takenout by re-weighting the control group to obtain the same mean for X . Theorem 2.1 shows thatan estimator based on (2.5) enjoys the double robustness property. Theorem 2.1 is similar to theresult of Kline (2011) for the Oaxaca-Blinder estimator, which is obtained under the assumptionthat the propensity score follows speciﬁcally a log-logistic model in the propensity-score-well-speciﬁed case. Theorem 2.1 is more general. It can be applied under parametric modeling of W as well as in nonparametric settings.In this paper, we consider a parametric model for w ( X ). Namely, we assume that P [ D = 1 | X ] = G ( X T β ) for some unknown β ∈ R p and some known strictly increasing cumulative distributionfunction G . Then w ( X ) = h ( X T β ) with h = G/ (1 − G ) and β is identiﬁed by the balancingcondition E (cid:2)(cid:0) D − (1 − D ) h ( X T β ) (cid:1) X (cid:3) = 0 . (2.6)Clearly, h is a positive strictly increasing function, which implies that its primitive H is strictlyconvex. A classical example is to take G as the c.d.f. of the logistic distribution, in which case h ( u ) = H ( u ) = exp( u ) for u ∈ R . The strict convexity of H implies that β is the uniquesolution of a strictly convex program: β = arg min β ∈ R p E (cid:2) (1 − D ) H ( X T β ) − DX T β (cid:3) . (2.7)7his program is well-deﬁned, whether or not P [ D = 1 | X ] = G ( X T β ). Note also that deﬁnitions(2.6) and (2.7) are equivalent provided that E (cid:2) h ( X T β ) (cid:107) X (cid:107) (cid:3) < ∞ for β in a vicinity of β .Indeed, it follows from the dominated convergence theorem that, under this assumption and dueto the fact that any convex function is locally Lipschitz, diﬀerentiation under the expectationsign is legitimate in (2.7).We are now ready to state the main identiﬁcation theorem justifying the use of ATT estimationmethods developed below. It is a straightforward corollary of Theorem 2.1. Theorem 2.2 (Parametric Double Robustness)

Let Assumptions 2.1-2.2 hold. Assumethat β ∈ R p and a positive strictly increasing function h are such that E (cid:2) h ( X T β ) (cid:107) X (cid:107) (cid:3) < ∞ , E (cid:2) h ( X T β ) | Y | (cid:3) < ∞ and condition (2.6) holds. Then, for any µ ∈ R p , the ATT θ satisﬁes θ = 1 P ( D = 1) E (cid:2)(cid:0) D − (1 − D ) h ( X T β ) (cid:1) ( Y − X T µ ) (cid:3) , (2.8) in each of two the following cases.1. There exists µ ∈ R p such that E [ Y (0) | X ] = X T µ .2. P [ D = 1 | X ] = G ( X T β ) with G = h/ (1 + h ) . At this stage, the parameter µ in Equation (2.8) does not play any role and can, for example,be zero. However, we will see below that in the high-dimensional regime, choosing µ carefully iscrucial to obtain an “immunized” estimator of θ that enjoys the desirable asymptotic properties. We now assume to have a sample ( D i , X i , Y i ) i =1 ...n of i.i.d. random variables with the samedistribution as ( D, X, Y ). Consider ﬁrst an asymptotic regime where the dimension p of the covariates is ﬁxed, while thesample size n tends to inﬁnity. We call it the low-dimensional regime. Deﬁne an estimator of β via the empirical counterpart of (2.7):ˆ β ld ∈ arg min β ∈ R p n n (cid:88) i =1 [(1 − D i ) H ( X Ti β ) − D i X Ti β ] . (3.1)Next, we plug ˆ β ld in the empirical counterpart of (2.8) to obtain the following estimator of θ : (cid:101) θ ld := 1 n (cid:80) ni =1 D i (cid:32) n n (cid:88) i =1 [ D i − (1 − D i ) h ( X Ti ˆ β ld )] Y i (cid:33) . X includes the intercept, (cid:101) θ ld satisﬁes the desirable property of location invariance,namely it does not change if we replace all Y i by Y i + c , for any c ∈ R .Set Z := ( D, X, Y ), Z i := ( D i , X i , Y i ) and introduce the function g ( Z, θ, ( β, µ )) := [ D − (1 − D ) h ( X T β )][ Y − X T µ ] − Dθ.

Then the estimator (cid:101) θ ld satisﬁes 1 n n (cid:88) i =1 g ( Z i , (cid:101) θ ld , ( ˆ β ld , . (3.2)This estimator is a two-step GMM. It is consistent and asymptotically normal under mild reg-ularity conditions, with asymptotic variance E [ g ( Z, θ , ( β , µ ))] / E ( D ) , where µ = E [ h (cid:48) ( X T β ) XX T | D = 0] − E [ h (cid:48) ( X T β ) XY | D = 0] . (3.3)This can be shown by standard techniques (see, e.g., Section 6 in Newey and McFadden, 1994).Notice that since h (cid:48) ( X T β ) > µ is the coeﬃcient of the weighted populationregression of Y on X for the control group. This observation is useful for the derivation of the“immunized” estimator in the high-dimensional case, to which we now turn. We now consider that p may grow with n , with possibly p (cid:29) n . This can be of interest inseveral situations. First, in macroeconomic problems, n is actually small, and p may easilybe of comparable size. For example, in the Tobacco control program application by Abadieet al. (2010) the control group size is limited due to the fact that the observational unit isthe state but many pre-treatment outcomes are included among the covariates. Section 5.2revisits this example. Second, researcher may want to consider a ﬂexible form for the weights byincluding transformations of the covariates. For instance, one may want to interact categoricalvariables with other covariates or consider, e.g., diﬀerent powers of continuous variables if onewants to allow for ﬂexible non-linear eﬀects. See Section 5.1 for an application considering suchtransformations. Third, one may want not only to balance the ﬁrst moments of the distributionof the covariates but also the second moments, the covariances, the third moments and so onto make the distribution more similar between the treated and the control group. In this case,high-dimensional settings seem to be of interest as well.In high-dimensional regime, the GMM estimator in (3.1) is, in general, not consistent. Wetherefore propose an alternative Lasso-type method by adding in (3.1) an (cid:96) penalization term:ˆ β ∈ arg min β ∈ R p (cid:32) n n (cid:88) i =1 [(1 − D i ) H ( X Ti β ) − D i X Ti β ] + λ p (cid:88) j =1 ψ j | β j | (cid:33) . (3.4)9ere, λ > { ψ j } j =1 ,...,p are covariate speciﬁc penalty loadings set as to grant goodasymptotic properties. The penalty loadings can be adjusted using the algorithm presented inAppendix A.This type of penalization oﬀers several advantages. First, the program (3.4) has almost surely aunique solution when the entries of X have a continuous distribution, cf. Lemma 5 in Tibshirani(2013), which cannot be granted for its non-penalized version (3.1). Second, it yields a sparsesolution in the sense that some entries of the vector of estimated coeﬃcients are set exactly tozero if the penalty is large enough, which is not the case for estimators based on (cid:96) penalization.The (cid:96) -penalized estimator shares the same sparsity property but is very costly to compute,whereas (3.4) can be easily solved by computationally eﬃcient methods, see, e.g., Hastie et al.(2009).The use of covariate speciﬁc penalty loadings goes back to Bickel et al. (2009); the particularchoice of penalty loadings that we consider below is inspired by Belloni et al. (2012). A drawbackof penalizing by the (cid:96) -norm is that it induces a bias in estimation of the coeﬃcients. But this isnot an issue here since we are ultimately interested in estimating θ rather than β . The solutionˆ β of (3.4) only plays the role of a pilot estimator.The estimator ˆ β is consistent as n tends to inﬁnity under assumptions analogous to those usedin Bunea et al. (2007); Bickel et al. (2009) for the Lasso with quadratic loss, see Theorem 3.2below. As in the low-dimensional case (cf. Section 3.1), one is then tempted to consider theplug-in estimator for the ATT based on Equation (2.8) with µ = 0: (cid:101) θ = 1 (cid:80) ni =1 D i n (cid:88) i =1 [ D i − (1 − D i ) h ( X Ti ˆ β )] Y i . (3.5)We refer to this estimator as the naive plug-in estimator. However, as mentioned above, the Lassoestimator ˆ β of the nuisance parameter β is not asymptotically unbiased. In high-dimensionalregime where p grows with n , naive plug-in estimators suﬀer from a regularization bias and maynot be asymptotically normal with zero mean, as illustrated for example in Belloni et al. (2014);Chernozhukov et al. (2015 b , 2018). Therefore, following the general approach of Chernozhukovet al. (2015 b , 2018), we develop an immunized estimator that, at the ﬁrst order, is insensitiveto ˆ β . We show that this estimator is asymptotically normal with mean zero and an asymptoticvariance that does not depend on the properties of the pilot estimator ˆ β . The idea is to chooseparameter µ in (2.8) such that the expected gradient of the estimating function g ( Z, θ, ( β, µ ))with respect to β is zero when taken at ( θ , β ). This holds for µ = µ , where µ satisﬁes E (cid:2) (1 − D ) h (cid:48) ( X T β )( Y − X T µ ) X (cid:3) = 0 . (3.6)Notice that if the corresponding matrix is invertible we get the low-dimensional solution (3.3).Clearly, µ depends on unknown quantities and we need to estimate it. To this end, observe10hat Equation (3.6) corresponds to the ﬁrst-order condition of a weighted least-squares problem, namely µ = arg min µ ∈ R p E (cid:2) (1 − D ) h (cid:48) ( X T β )( Y − X T µ ) (cid:3) . (3.7)Since X is high-dimensional we cannot estimate µ via the empirical counterpart of (3.7). In-stead, we consider a Lasso-type estimatorˆ µ ∈ arg min µ ∈ R p (cid:32) n n (cid:88) i =1 [(1 − D i ) h (cid:48) ( X Ti ˆ β ) (cid:0) Y i − X Ti µ (cid:1) ] + λ (cid:48) p (cid:88) j =1 ψ (cid:48) j | µ j | (cid:33) . (3.8)Here, similarly to (3.4), λ (cid:48) > (cid:8) ψ (cid:48) j (cid:9) j =1 ,...,p are covariate-speciﬁc penalty loadings.Importantly, by estimating µ we do not introduce, at least asymptotically, an additional sourceof variability since by construction, the gradient of the moment condition (2.8) (if we considerit as function of β rather than of β ) with respect to ( β, µ ) vanishes at point ( β , µ ).Finally, the immunized ATT estimator is deﬁned asˆ θ := 1 (cid:80) ni =1 D i n (cid:88) i =1 (cid:16) D i − (1 − D i ) h ( X Ti ˆ β ) (cid:17) ( Y i − X Ti ˆ µ ) . Intuitively, the immunized procedure corrects the naive plug-in estimator in the case where thebalancing program has “missed” a covariate that is very important to predict the outcome:ˆ θ = ˜ θ − (cid:34) n n (cid:88) i : D i =1 X i − n n (cid:88) i : D i =0 h ( X Ti ˆ β ) X i (cid:35) T ˆ µ, where n is the number of treated observations. This has a ﬂavor of Frish-Waugh-Lowellpartialling-out procedure for model selection as observed by Belloni et al. (2014) and furtherdeveloped in Chernozhukov et al. (2015 b ).To summarize, the estimation procedure in high-dimensional regime consists of the three fol-lowing steps. Each step is computationally simple as it needs at most to minimize a convexfunction:1. (Balancing step.) For a given level of penalty λ and positive covariate-speciﬁc penaltyloadings { ψ j } pj =1 , compute ˆ β deﬁned byˆ β ∈ arg min β ∈ R p (cid:32) n n (cid:88) i =1 [(1 − D i ) H ( X Ti β ) − D i X Ti β ] + λ p (cid:88) j =1 ψ j | β j | (cid:33) . (3.9) The assumptions under which we prove the results below guarantee that µ deﬁned here is unique. Extensionto the case of multiple solutions can be worked out as well. It is technically more involved but in our opiniondoes not add much to the understanding of the problem. (Immunization step.) For a given level of penalty λ (cid:48) and covariate-speciﬁc penalty loadings (cid:8) ψ (cid:48) j (cid:9) pj =1 , and using ˆ β obtained in the previous step, compute ˆ µ deﬁned by:ˆ µ ∈ arg min µ ∈ R p (cid:32) n n (cid:88) i =1 [(1 − D i ) h (cid:48) ( X Ti ˆ β ) (cid:0) Y i − X Ti µ (cid:1) ] + λ (cid:48) p (cid:88) j =1 ψ (cid:48) j | µ j | (cid:33) . (3.10)3. (ATT estimation.) Estimate the ATT using the immunized estimator:ˆ θ = 1 (cid:80) ni =1 D i n (cid:88) i =1 (cid:104) D i − (1 − D i ) h ( X Ti ˆ β ) (cid:105) ( Y i − X Ti ˆ µ ) . (3.11) The current framework poses several challenges to achieving asymptotically valid inference.First, X can be high-dimensional since we allow for p (cid:29) n provided that sparsity conditionsare met (see Assumption 3.1 below). Second, the ATT estimation is aﬀected by the estimationof the nuisance parameters ( β , µ ) and we wish to neutralize their inﬂuence. Finally, the (cid:96) -penalized estimators we use for β and µ are not conventional. The estimator of β relies ona non-standard loss function and, to our knowledge, the properties of ˆ β that we need are notavailable in the literature, cf., e.g., van de Geer (2016) and the references therein. The estimatorof µ is close to the usual Lasso except for the weights that depend on ˆ β . In general, discrepancyin the weights can induce an extra bias. Thus, it is not granted that such an estimator achievesproperties close to the Lasso. We show below that it holds true under our assumptions. Ourproof techniques may be of interest for other problems of similar type.Let η = ( β, µ ) denote the vector of two nuisance parameters and recall that Z = ( D, X, Y ). Inwhat follows, we write for brevity g ( Z, θ, η ) instead of g ( Z, θ, ( β, µ )). In particular, for the value η := ( β , µ ) we have E g ( Z, θ , η ) = 0 . (3.12)Hereafter, the notation a (cid:46) b means that a ≤ cb for some constant c > n . We denote by Φ and Φ − the cumulative distribution function and the quantilefunction of a standard normal random variable, respectively. We use the symbol E n ( · ) to denotethe empirical average, that is E n ( a ) = n − (cid:80) ni =1 a i for a = ( a , . . . , a n ). Finally, for a vector δ = ( δ , . . . , δ p ) ∈ R p and a subset S ⊆ { , . . . , p } we consider the restricted vector δ S = ( δ j I ( j ∈ S )) pj =1 , where I ( · ) denotes the indicator function, and we set (cid:107) δ (cid:107) := Card { ≤ j ≤ p : δ j (cid:54) = 0 } , (cid:107) δ (cid:107) := (cid:80) pj =1 | δ j | , (cid:107) δ (cid:107) := (cid:113)(cid:80) pj =1 δ j and (cid:107) δ (cid:107) ∞ := max j =1 ,...,p | δ j | .We now state the assumptions used to prove the asymptotic results.12 ssumption 3.1 (Sparsity restrictions) The nuisance parameter is sparse in the followingsense: (cid:107) β (cid:107) ≤ s β , (cid:107) µ (cid:107) ≤ s µ for some integers s β , s µ ∈ [1 , p ] . Assumption 3.2 (Conditions on function h ) Function h is increasing, twice continuouslydiﬀerentiable on R and(i) the second derivative h (cid:48)(cid:48) is Lipschitz on any compact subset of R ,(ii) either inf u ∈ R h (cid:48) ( u ) ≥ c or (cid:107) β (cid:107) ≤ c where c > and c > are constants independentof n . We also need some conditions on the distribution of data. The random vectors Z i = ( D i , X i , Y i )are assumed to be i.i.d. copies of Z = ( D, X, Y ) with D ∈ { , } , X ∈ R p and Y ∈ R .Throughout the paper, we assume that Z depends on n , so that in fact we deal with a triangulararray of random vectors. This dependence on n is needed for rigorously stating asymptoticalresults since we consider the setting where the dimension p = p ( n ) is a function of n . Thus,in what follows Z is indexed by n but for brevity we typically suppress this dependence in thenotation. On the other hand, all constants denoted by c (with various indices) and K appearingbelow are independent of n . Assumption 3.3 (Conditions on the distribution of data)

The random vectors Z i are i.i.d.copies of Z = ( D, X, Y ) with D ∈ { , } , X ∈ R p and Y ∈ R satisfying (2.6) , (3.6) , (3.12) andthe following conditions:(i) There exist constants K > and c > such that max {(cid:107) X (cid:107) ∞ , | X T β | , | Y − X T µ |} ≤ K (a.s.) , and < P ( D = 1) < . (ii) Non-degeneracy conditions. There exists c > such that for all j = 1 , . . . , p , min (cid:26) E (cid:0) ( Y − X T µ − θ ) | D = 1 (cid:1) , E (cid:0) h ( X T β )( Y − X T µ ) | D = 0 (cid:1) , E (cid:0) ( X T e j ) | D = 1 (cid:1) , E (cid:0) h ( X T β )( X T e j ) | D = 0 (cid:1) , E (cid:0) ( h (cid:48) ( X T β )) ( Y − X T µ ) ( X T e j ) | D = 0 (cid:1) (cid:27) ≥ c , where e j denotes the j th canonical basis vector in R p . ssumption 3.4 (Condition on the population Gram matrix of the control group) Thepopulation Gram matrix of the control group

Σ := E ((1 − D ) XX T ) is such that min v ∈ R p \{ } v T Σ v (cid:107) v (cid:107) ≥ κ Σ , (3.13) where the minimal eigenvalue κ Σ is a positive number. Note that, in view of Assumption 3.3, κ Σ is uniformly bounded: κ Σ ≤ e T Σ e ≤ K . Assumption 3.5 (Dimension restrictions)

The integers p, s = max( s β , s µ ) ∈ [1 , p/ andthe value κ Σ > are functions of n satisfying the following growth conditions:(i) s log( p ) κ √ n → as n → ∞ ,(ii) s/κ Σ = o ( p ) as n → ∞ ,(iii) log( p ) = o ( n / ) as n → ∞ . Finally, we deﬁne the penalty loadings for estimation of nuisance parameters. The gradients ofthe estimating function with respect to the nuisance parameters are ∇ µ g ( Z, θ, η ) = − (cid:2) D − (1 − D ) h ( X T β ) (cid:3) X, ∇ β g ( Z, θ, η ) = − (1 − D ) h (cid:48) ( X T β ) (cid:2) Y − X T µ (cid:3) X. For each i = 1 , . . . , n , we deﬁne the random vector U i ∈ R p with entries corresponding to thesegradients: U i,j := (cid:40) − (cid:2) D i − (1 − D i ) h ( X Ti β ) (cid:3) X i,j if 1 ≤ j ≤ p , − (1 − D i ) h (cid:48) ( X Ti β ) (cid:2) Y i − X Ti µ (cid:3) X i,j if p + 1 ≤ j ≤ p ,where X i,j is the j th entry of X i . Assumption 3.6 (Penalty Loadings)

Let c > and γ ∈ (0 , p ) be such that log(1 /γ ) (cid:46) log( p ) and γ = o (1) as n → ∞ . The penalty loadings for estimation of β satisfy λ := c Φ − (1 − γ/ p ) / √ n,ψ j, max ≥ ψ j ≥ (cid:118)(cid:117)(cid:117)(cid:116) n − n (cid:88) i =1 U i,j for j = 1 , . . . , p. he penalty loadings for estimation of µ satisfy λ (cid:48) := 2 c Φ − (1 − γ/ p ) / √ n,ψ (cid:48) j, max ≥ ψ (cid:48) j ≥ (cid:118)(cid:117)(cid:117)(cid:116) n − n (cid:88) i =1 U i,j + p for j = 1 , . . . , p. Here, the upper bounds on the loadings are ψ j, max = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 max | u |≤ K [(1 − D i ) h ( u ) − D i ] X i,j ,ψ (cid:48) j, max = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (1 − D i ) max | u |≤ K (cid:0) | h (cid:48) ( u ) | [ Y i − u ] (cid:1) X i,j representing feasible majorants for (cid:113) n − (cid:80) ni =1 U i,j under our assumptions.The values (cid:113) n − (cid:80) ni =1 U i,j depend on the unknown parameters. Thus, we cannot choose ψ j , ψ (cid:48) j equal to these values but we can take them equal to the upper bounds ψ j = ψ j, max and ψ (cid:48) j = ψ (cid:48) j, max . A more ﬂexible iterative approach of choosing feasible loadings is discussed inAppendix A.The following theorem constitutes the main asymptotic result of the paper. Theorem 3.1 (Asymptotic Normality of the Immunized Estimator)

Let Assumptions 3.1- 3.6 hold. Then the immunized estimator ˆ θ deﬁned in Equation (3.11) satisﬁes ˆ σ − √ n (ˆ θ − θ ) D → N (0 , as n → ∞ , where ˆ σ := n (cid:80) ni =1 g ( Z i , ˆ θ, ˆ η ) (cid:16) n (cid:80) ni =1 D i (cid:17) − is a consistent estimator of the asymptotic vari-ance, and D → denotes convergence in distribution. The proof is given in Appendix B.2. An important point underlying the root- n convergenceand asymptotic normality of ˆ θ is the fact that the expected gradient of g with respect to η iszero at η . In granting this property, we follow the general methodology of estimation in thepresence of high-dimensional nuisance parameters developed in Belloni et al. (2012, 2014, 2017);Chernozhukov et al. (2015 b ) among other papers by the same authors. The second importantingredient of the proof is to ensure that the estimator ˆ η converges fast enough to the nuisanceparameter η . Its rate of convergence is given in the following theorem.15 heorem 3.2 (Nuisance Parameter Estimation) Under Assumptions 3.1-3.6, we have, withprobability tending to 1 as n → ∞ , (cid:107) ˆ β − β (cid:107) (cid:46) s β κ Σ (cid:114) log( p ) n , , (3.14) (cid:107) ˆ µ − µ (cid:107) (cid:46) sκ Σ (cid:114) log( p ) n .. (3.15)The proof is given in Appendix B.3. Note that the rate in (3.15) depends not only on thesparsity index s µ of µ but on the maximum s = max( s β , s µ ). This is natural since one shouldaccount for the accuracy of the preliminary estimator ˆ β used to obtain ˆ µ . Inspection of theproof shows that Theorem 3.2 remains valid under weaker assumptions, namely, we can replaceAssumption 3.5 on s, p, κ Σ by the condition (C.2). We also note that Assumption 3.5(i) inTheorem 3.1 can be modiﬁed to ( s/ √ n ) log( p ) → κ Σ is a constant independent of n as it isassumed, for example, in Bradic et al. (2019). Such a modiﬁcation would require a substantiallymore involved proof but only improves upon considering relatively non-sparse cases. This doesnot seem of much added value for using the methods in practice when the sparsity index istypically small. Moreover, in the high-dimensional scenario we ﬁnd it more important to specifythe dependency of the growth conditions on κ Σ . The aim of this experiment is two-fold: illustrate the better properties of the immunized esti-mator over the naive plug-in, and compare it with other estimators. In particular, we compareit with a similar estimator proposed by Farrell (2015). We consider the following DGP. The p covariates are distributed as X ∼ N (0 , Σ), where the ( i, j ) element of the variance-covariancematrix satisﬁes Σ i,j = . | i − j | . The treatment equation follows a logit model, Pr( D = 1 | X ) =Λ (cid:0) X T γ (cid:1) with Λ( u ) = 1 / (1 + exp( − u )). The potential outcomes satisfy Y ( d ) = exp( X T µ ) + d ( ζ X T γ ) + ε, ε | D, X ∼ N (0 , . We assume the following form for the j th entry of γ and µ : γ j = (cid:40) ρ γ ( − j /j if j ≤

100 otherwise , µ j =  ρ µ ( − j /j if j ≤ ρ µ ( − j +1 / ( p − j + 1) if j ≥ p −

90 otherwise.We are thus in an strictly sparse setting for both equations where only ten covariates play a rolein the treatment assignment and twenty in the outcome. Figure 1 depicts the precise patternof the corresponding coeﬃcients for p = 30. ρ γ and ρ µ are constants that ﬁx the signal-to-noise16atio. More precisely, ρ γ is set so that R = 0 . D , and ρ µ is set sothat R = 0 . Y (0). Finally, we let ζ = [ V (cid:0) exp( X T µ ) (cid:1) / V ( Y (0))] / . Thisimpies that the variance of the individual treatment eﬀect ζ X (cid:48) γ is one ﬁfth of the varianceof Y (0). In this set-up, the ATT satisﬁes E [ Y (1) − Y (0) | D = 1] = ζ E [ Z Λ( Z )] /E [Λ( Z )], with Z ∼ N (0 , γ T Σ γ ). We compute it using Monte Carlo simulations. − . − . − . − . − . . . Coefficient value j l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Notes: In this example, ρ γ = ρ µ = 1. The central region of the graph represents the coeﬃcients γ and µ associated with variables that do not play a role in either the equation equation or the outcome equation.The left region shows the coeﬃcients associated with variables that are important for both equations. Inthe right region, only entries in µ is diﬀerent from 0, meaning that the variables determine the outcomeequation but not the selection equation. Figure 1: Sparsity patterns of γ (crosses) and µ (circles).We consider several estimators of the ATT. The ﬁrst is the naive plug-in estimator deﬁned in(3.5). Next, we consider our proposed estimator deﬁned in (3.11), with H ( x ) = h ( x ) = exp( x ).We also consider the estimator proposed by Farrell (2015). This estimator is also deﬁned by(3.11), but ˆ β and ˆ µ therein are obtained by a Logit Lasso and an unweighted Lasso regression,respectively. For these three estimators, the penalty loadings for the ﬁrst-step estimators areset as in Appendix A. The last estimator, called the oracle hereafter, is our low-dimensionalestimator deﬁned by (3.2), where in the ﬁrst step we only include the ten covariates aﬀectingthe treatment. For all estimators, we construct 95% conﬁdence estimator on the ATT using17he normal approximation and asymptotic variance estimators. We estimate the asymptoticvariance of the naive plug-in estimator making as if we were in a low-dimensional setting. Thismeans that this estimator would have an asymptotic coverage of 0.95 if p remained ﬁxed.In our DGP, the variables X j with j ≥ p − X j with j ≥ p − n and p that approximate a rela-tively high-dimensional setting. For every couple ( n, p ), we report the root mean squared error(RMSE), the bias and the coverage rate of the conﬁdence intervals associated to each estimator.Our estimator performs well in all settings, with a correct coverage rate and often the lowestRMSE over all estimators. The oracle has always a coverage rate close to 0.95 and a bias veryclose to 0, as one could expect, but it does not always exhibit the lowest RMSE. This is because,intuitively, the immunized estimator and that of Farrell (2015) trade oﬀ variance with some biasin their ﬁrst steps, sometimes resulting in slightly lower RMSE on the ﬁnal estimator. Note thatthe bias of the ﬁnal estimator, though asymptotically negligible, results in a slight undercoverageof the conﬁdence intervals. Yet, even with n = 500 and p = 1 , p = 50 and n = 2 , p/n is quite small. Finally, theestimator of Farrell (2015) exhibit similar performances as those of the immunized estimator,though it displays a slightly larger RMSE and smaller coverage rate with this particular DGP.18able 1: Monte-Carlo simulations n = 500 n = 1 , n = 2 , p = 50Naive plug-in 0.312 0.264 0.62 0.228 0.195 0.601 0.169 0.144 0.608Immunized 0.186 0.102 0.872 0.121 0.059 0.907 0.084 0.03 0.907Farrell 0.197 0.117 0.857 0.132 0.075 0.885 0.093 0.046 0.887Oracle 0.202 -0.017 0.929 0.143 -0.025 0.938 0.105 -0.03 0.935 p = 200Naive plug-in 0.318 0.274 0.587 0.238 0.208 0.557 0.179 0.156 0.544Immunized 0.185 0.108 0.883 0.125 0.066 0.897 0.085 0.037 0.911Farrell 0.195 0.121 0.87 0.135 0.081 0.873 0.095 0.052 0.882Oracle 0.194 -0.023 0.936 0.141 -0.029 0.946 0.108 -0.034 0.934 p = 500Naive plug-in 0.339 0.297 0.532 0.247 0.216 0.522 0.185 0.165 0.494Immunized 0.202 0.128 0.841 0.128 0.07 0.881 0.087 0.044 0.912Farrell 0.212 0.141 0.81 0.14 0.086 0.854 0.098 0.059 0.866Oracle 0.205 -0.019 0.927 0.146 -0.033 0.931 0.105 -0.032 0.938 p = 1 , Notes: RMSE and CR stand respectively for root mean squared error and coverage rate. Thenominal coverage rate is 0.95. The results are based on 10,000 simulations for each ( n, p ). Thenaive plug-in and immunized estimators are deﬁned in (3.5) and (3.11), respectively. “Farrell”is the estimator considered by Farrell (2015).“Oracle” is deﬁned by (3.2), where in the ﬁrst stepwe only include the ten covariates aﬀecting the treatment. Empirical Applications

We revisit LaLonde (1986), who examines the ability of econometric methods to recover thecausal eﬀect of employment programs. This dataset was ﬁrst built to assess the impactof the National Supported Work (NSW) program. The NSW is a transitional, subsidizedwork experience program targeted towards people with longstanding employment problems:ex-oﬀenders, former drug addicts, women who were long-term recipients of welfare beneﬁts andschool dropouts. The quantity of interest is the average eﬀect of the program for the participantson 1978 yearly earnings. The treated group gathers people who were randomly assigned to thisprogram from the population at risk (with a sample size of n = 185). Two control groups areavailable. The ﬁrst one comes from the Panel Study of Income Dynamics (PSID) (sample size n = 2 , n = 260) and is there-fore directly comparable to the treated group. It provides us with a benchmark for the ATT.Hereafter, we use the group of participants and the PSID sample to compute our estimator andcompare it with other competitors and the experimental benchmark.To allow for a ﬂexible speciﬁcation, we follow Farrell (2015) by taking the raw covariates ofthe dataset (age, education, black, hispanic, married, dummy variable of no degree, incomein 1974, income in 1975, dummy variables of no earnings in 1974 and in 1975), two-by-two-interactions between the continuous and dummy variables, two-by-two interactions between thedummy variables and powers up to degree 5 of the continuous variables. Continuous variablesare linearly rescaled to [0 , Notably, they are the only ones, out of ﬁve estimators, that display a signiﬁcant, For more discussion on the NSW program and the controversy regarding econometric estimates of causaleﬀects based on nonexperimental data, see LaLonde (1986) and the subsequent contributions by Dehejia andWahba (1999, 2002); Smith and Todd (2005). Farrell (2015)’s estimate shown in Table 2 diﬀers from that displayed in Farrell’s paper because contrary tohim, we have not automatically included education, the dummy of no degree and the 1974 income in the setof theory pre-selected covariates. When doing so, the results are slightly better but not qualitatively diﬀerentfor this estimator. We chose not to do so as it would bias the comparison with other estimators, which do notinclude a set of pre-selected variables.

EstimatorExperimental Naive Immunizedbenchmark plug-in estimator(1) (2) (3)Point estimate 1,794.34 401.89 1,608.99Standard error (671.00) (746.07) (705.38)95% conﬁdence interval [519; 3,046] [-1,060; 1,864] [226; 2991]

Notes: For details on the estimators, see the text. Standard errors and conﬁdence intervalsare based on the asymptotic distribution.

Proposition 99 is one of the ﬁrst and most ambitious large-scale tobacco control program, im-plemented in 1989 in California. It includes a vast array of measures, including an increase incigarette taxation of 25 cents per pack, and a signiﬁcant eﬀort in prevention and education.In particular, the tax revenues generated by Proposition 99 were used to fund anti-smokingcampaigns. Abadie et al. (2010) analyzes the impact of the law on tobacco consumption inCalifornia. Since this program was only enforced in California, it is a nice example where the21ynthetic control method applies whereas more standard public policy evaluation tools cannotbe used. It is possible to reproduce a synthetic California by reweighting other states so as toimitate California’s behavior.For this purpose, Abadie et al. (2010) consider the following covariates: retail price of cigarettes,state log income per capita, percentage of the population between 15 and 24, per capita beerconsumption (all 1980-1988 averages). Cigarette consumptions for the years between 1970 and1975, 1980 and 1988 are also included. Using the same variables, we conduct the same analysiswith our estimator. Figure 2 displays the estimated eﬀect of Proposition 99 using the immunizedestimator. C i ga r e tt e c on s u m p t i on ( P a cks pe r c ap i t a ) − − − − − − − − − − Notes: the shaded area represents the post-treatment period. The 95% conﬁdence interval is based on theasymptotic approximation.

Figure 2: The eﬀect of Proposition 99 on per capita tobacco consumption.We ﬁnd almost no eﬀect of the policy over the pre-treatment period, giving credibility to thecounterfactual employed. A steady decline takes place after 1988, and in the long-run, tobaccoconsumption is estimated to have decreased by about 30 packs per capita per year in Cali-fornia as a consequence of the policy. The variance is larger towards the end of the periodbecause covariates are measured in the pre-treatment period and they become less relevant aspredictors. Note also that by construction, including 1970 to 1975, 1980 and 1988 cigaretteconsumptions among the covariates yields an almost perfect ﬁt at theses dates because of theimmunization procedure. The ﬁt is not perfect, however, because of the shrinkage induced bythe (cid:96) -penalization.Figure 3 displays a comparison between the immunized estimator and the synthetic controlmethod. The dashed green line is the synthetic control counterfactual. It does not match exactly22he plot of Abadie et al. (2010), in which the weights given to each predictors are optimized toﬁt best the outcome over the whole pre-treatment period. Instead, the green curve in Figure 3optimizes the predictor weights using only the years 1970 through 1975, 1980 and 1988. Thisstrategy brings a fairer comparison with our estimator that does not use California’s per capitatobacco consumption outside those dates to optimize the ﬁt. In such a case, the years from 1976to 1987, excluding 1980, can be used as sort of placebo tests. C i ga r e tt e c on s u m p t i on ( P a cks pe r c ap i t a ) CaliforniaOther StatesImmunized, Lasso.95 confidence bandsSynthetic Control

Notes: The solid black line is California tobacco consumption as in the data. The dotted purple line is asimple average of other U.S. states. The dashed red line is the immunized estimator as presented in thepaper along with the 95% conﬁdence bands. The dashed green line is synthetic California.

Figure 3: Cigarette consumption in California, actual and counterfactualBoth our estimator and the synthetic control are credible counterfactuals, as they are able toclosely match California pre-treatment tobacco consumption. They oﬀer a sizable improvementover a sample average over the U.S. that did not implement any tobacco control program. Fur-thermore, even if our estimator gives a result relatively similar to the synthetic control, it displaysa smoother pattern especially towards the end of the 1980s. The estimated treatment eﬀect ap-pears to be larger with the immunized estimate than with the synthetic control. However, it ishard to conclude that this diﬀerence is signiﬁcant, because absent any asymptotic theory on thesynthetic control estimator, it is unclear how one could make a test on the diﬀerence between23he two. In fact, the availability of standard asymptotic approximation for conﬁdence intervalsis to the advantage of our method.

In this paper, we propose an estimator that makes a link between the synthetic control method,typically used with aggregated data and n smaller than or of the same order as p , and treatmenteﬀect methods used for micro data for which n (cid:29) p . Our method accommodates both settings.In the low-dimensional regime, it pins down one of the solutions of the synthetic control problem,which admits an inﬁnity of solutions. In the high-dimensional regime, the estimator is a regu-larized and immunized version of the low-dimensional one and then diﬀers from the syntheticcontrol estimator. The simulations and applications suggest that it works well in practice.In our study, we have focused on speciﬁc procedures based on (cid:96) -penalization and proved thatthey achieve good asymptotic behavior in possibly high-dimensional regime under sparsity re-strictions. Other types of estimators could be explored using these ideas. For example, in thehigh-dimensional regime, our strategy can be used with the whole spectrum of sparsity-relatedpenalization techniques, such as group Lasso, fused Lasso, adaptive Lasso, Slope, among others.24 eferences Abadie, A. (2019), ‘Using synthetic controls: Feasibility, data requirements, and methodologicalaspects’,

Journal of Economic Literature forthcoming .Abadie, A., Diamond, A. and Hainmueller, J. (2010), ‘Synthetic Control Methods for Compar-ative Case Studies: Estimating the Eﬀect of California’s Tobacco Control Program’,

Journalof the American Statistical Association (490), 493–505.Abadie, A., Diamond, A. and Hainmueller, J. (2015), ‘Comparative politics and the syntheticcontrol method’,

American Journal of Political Science (2), 495–510.Abadie, A. and Gardeazabal, J. (2003), ‘The Economic Costs of Conﬂict: A Case Study of theBasque Country’, American Economic Review (1), 113–132.Abadie, A. and L’Hour, J. (2019), A penalized synthetic control estimator for disaggregateddata. Working Paper.Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W. and Wager, S. (2019), SyntheticDiﬀerence in Diﬀerences, Working Paper wp2019 907, CEMFI.Athey, S., Imbens, G. W. and Wager, S. (2018), ‘Approximate residual balancing: De-biasedinference of average treatment eﬀects in high dimensions’. arXiv:1604.07125.Bang, H. and Robins, J. M. (2005), ‘Doubly robust estimation in missing data and causalinference models’, Biometrics (4), 962–973.Bellec, P. C., Lecu´e, G. and Tsybakov, A. B. (2018), ‘Slope meets Lasso: Improved oracle boundsand optimality’, Annals of Statistics (6B), 3603–3642.Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012), ‘Sparse models and methods foroptimal instruments with an application to eminent domain’, Econometrica (6), 2369–2429.Belloni, A. and Chernozhukov, V. (2013), ‘Least squares after model selection in high-dimensional sparse models’, Bernoulli (2), 521–547.Belloni, A., Chernozhukov, V., Fern´andez-Val, I. and Hansen, C. (2017), ‘Program evaluationand causal inference with high-dimensional data’, Econometrica (1), 233–298.Belloni, A., Chernozhukov, V. and Hansen, C. (2014), ‘Inference on treatment eﬀects afterselection among high-dimensional controls’, The Review of Economic Studies (2), 608–650.Ben-Michael, E., Feller, A. and Rothstein, J. (2018), ‘The Augmented Synthetic ControlMethod’, arXiv e-prints p. arXiv:1811.04170.25ickel, P., Ritov, Y. and Tsybakov, A. B. (2009), ‘Simultaneous analysis of Lasso and Dantzigselector’, Annals of Statistics (4), 1705–1732.Bl´ehaut, M., D’Haultfoeuille, X., L’Hour, J. and Tsybakov, A. B. (2017), ‘A parametric gen-eralization of the synthetic control method, with high dimension’, Unpublished, 2017 IAAEMeeting presentation pp. 0–53.

URL: https://editorialexpress.com/conference/IAAE2017/program/IAAE2017.html

Bradic, J., Wager, S. and Zhu, Y. (2019), ‘Sparsity double robust inference of average treatmenteﬀects’. arXiv preprint arXiv:1905.00744.Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007), ‘Sparsity oracle inequalities for theLasso.’,

Electronic Journal of Statistics , 169–194.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W. and Robins,J. (2018), ‘Double/debiased machine learning for treatment and structural parameters’, TheEconometrics Journal (1), C1–C68.Chernozhukov, V., Hansen, C. and Spindler, M. (2015 a ), ‘Post-Selection and Post-RegularizationInference in Linear Models with Many Controls and Instruments’, American Economic Review (5), 486–90.Chernozhukov, V., Hansen, C. and Spindler, M. (2015 b ), ‘Valid post-selection and post-regularization inference: An elementary, general approach’, Annual Review of Economics (1), 649–688.de la Pe˜na, V. H., Lai, T. L. and Shao, Q.-M. (2009), Self-Normalized Processes: Limit Theoryand Statistical Applications , 1st edn, Springer-Verlag Berlin Heidelberg.Dehejia, R. H. and Wahba, S. (1999), ‘Causal eﬀects in nonexperimental studies: Reevalu-ating the evaluation of training programs’,

Journal of the American Statistical Association (448), pp. 1053–1062.Dehejia, R. H. and Wahba, S. (2002), ‘Propensity score-matching methods for nonexperimentalcausal studies’, The Review of Economics and Statistics (1), 151–161.Deville, J.-C. and S¨arndal, C.-E. (1992), ‘Calibration estimators in survey sampling’, Journal ofthe American statistical Association (418), 376–382.Farrell, M. H. (2015), ‘Robust inference on average treatment eﬀects with possibly more covari-ates than observations’, Journal of Econometrics (1), 1 – 23.Graham, B. S., Pinto, C. C. D. X. and Egel, D. (2012), ‘Inverse Probability Tilting for MomentCondition Models with Missing Data’,

Review of Economic Studies (3), 1053–1079.26ainmueller, J. (2012), ‘Entropy balancing for causal eﬀects: A multivariate reweighting methodto produce balanced samples in observational studies’, Political Analysis (1), 25–46.Hastie, T., Tibshirani, R. and Friedman, J. (2009), The Elements of Statistical Learning: Datamining, Inference and Prediction , 2nd edn, Springer.Imai, K. and Ratkovic, M. (2014), ‘Covariate balancing propensity score’,

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) (1), 243–263.Kline, P. (2011), ‘Oaxaca-blinder as a reweighting estimator’, American Economic Review (3), 532–37.Kl¨oßner, S., Kaul, A., Pfeifer, G. and Schieler, M. (2018), ‘Comparative politics and the syntheticcontrol method revisited: A note on abadie et al.(2015)’,

Swiss journal of economics andstatistics (1), 11.LaLonde, R. J. (1986), ‘Evaluating the Econometric Evaluations of Training Programs withExperimental Data’,

American Economic Review (4), 604–20.Leamer, E. E. (1983), ‘Let’s take the con out of econometrics’, The American Economic Review (1), 31–43.Leeb, H. and P¨otscher, B. M. (2005), ‘Model Selection And Inference: Facts And Fiction’, Econometric Theory (01), 21–59.Leeb, H. and P¨otscher, B. M. (2008 a ), ‘Recent developments in model selection and relatedareas’, Econometric Theory , 319–322.Leeb, H. and P¨otscher, B. M. (2008 b ), ‘Sparse estimators and the oracle property, or the returnof Hodges’ estimator’, Journal of Econometrics (1), 201–211.Newey, W. K. and McFadden, D. (1994), Chapter 36 large sample estimation and hypothesistesting, in ‘Handbook of Econometrics’, Vol. 4 of Handbook of Econometrics , Elsevier, pp. 2111– 2245.Rudelson, M. (2020), Personal communication.Rudelson, M. and Zhou, S. (2013), ‘Reconstruction from anisotropic random measurements’,

IEEE Transactions on Information Theory (6), 3434–3447.Smith, J. and Todd, P. (2005), ‘Does matching overcome LaLonde’s critique of nonexperimentalestimators?’, Journal of Econometrics (1-2), 305–353.27ibshirani, R. J. (2013), ‘The lasso problem and uniqueness.’,

Electronic Journal of Statistics , 1456–1490.van de Geer, S. A. (2016), Estimating and Testing under Sparsity , Springer.28

Algorithm for Feasible Penalty Loadings

Consider the ideal penalty loadings for estimation of β deﬁned as λ := c Φ − (1 − γ/ p ) / √ n ¯ ψ j := (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 [(1 − D i ) h ( X Ti β ) − D i ] X i,j for j = 1 , ..., p, and the ideal penalty loadings for estimation of µ : λ (cid:48) := 2 c Φ − (1 − γ/ p ) / √ n ¯ ψ (cid:48) j := (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti β ) [ Y i − X Ti µ ] X i,j for j = 1 , ..., p. Here c > γ > β and µ are the truecoeﬃcients. We follow Belloni et al. (2014) and set γ = .

05 and c = 1 . (cid:8) ¯ ψ j (cid:9) pj =1 of the balancing step using the followingalgorithm. Set a small constant (cid:15) > k .1. Start by using a preliminary estimate β (0) of β . For example, take β (0) with the entrycorresponding to the intercept equal to log( (cid:80) ni =1 D i / (cid:80) ni =1 (1 − D i )) and all other entriesequal to zero. Then, for all j = 1 , ..., p , set˜ ψ (0) j = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 [(1 − D i ) h ( X Ti β (0) ) − D i ] X i,j . At step k , set ˜ ψ ( k ) j = (cid:113) n (cid:80) ni =1 [(1 − D i ) h ( X Ti β ( k ) ) − D i ] X i,j , j = 1 , ..., p .2. Estimate the model by the penalized balancing Equation (3.4) using the penalty level λ and penalty loadings found previously, to obtain ˆ β ( k ) .3. Stop if max j =1 ,...,p | ˜ ψ ( k ) j − ˜ ψ ( k − j | ≤ (cid:15) or if k > k . Set k = k + 1 and go to step 1 otherwise.Asymptotic validity of this approach is established analogously to (Belloni et al., 2012, Lemma11). Estimation of the penalty loadings ¯ ψ (cid:48) j on the immunization step follows a similar procedurewhere we replace β in the formula for ¯ ψ (cid:48) j by its estimator obtained on the balancing step.29 Proofs

B.1 Proof of Theorem 2.1

First, note that we have E [(1 − D ) w ( X ) X ] = E [ DX ]. As a result, for any µ ∈ R p , E (cid:2) ( D − (1 − D ) w ( X )) ( Y − X T µ ) (cid:3) = E [( D − (1 − D ) w ( X )) Y ] . Since (1 − D ) Y = (1 − D ) Y (0) and DY = DY (1), the value θ satisﬁes the moment condition(2.8) if and only if E [ w ( X ) Y (1 − D )] = E [ DY (0)] . By the Mean Independence assumption, E ( D | X ) E ( Y (0) | X ) = E ( DY (0) | X ). Thus, E [ w ( X ) Y (1 − D )] = E [ E ( w ( X ) Y (0)(1 − D ) | X )] = E [ w ( X )(1 − E ( D | X )) E ( Y (0) | X )] . (B.1)We consider the two cases of the theorem separately.1. In the linear case E ( Y (0) | X ) = X T µ we have E [ w ( X ) Y (1 − D )] = E (cid:2) w ( X )(1 − D ) X T µ (cid:3) = E (cid:2) DX T µ (cid:3) = E [ D E ( Y (0) | X )]= E [ DY (0)] . The ﬁrst equality here is due to (B.1). The second equality follows from the fact that E [(1 − D ) w ( X ) X ] = E [ DX ]. The last equality uses the Mean Independence assumption.2. Propensity score satisﬁes P ( D = 1 | X ) = w ( X ) / (1 + w ( X )). In this case, using (B.1) wehave E [ w ( X ) Y (1 − D )] = E [ w ( X )(1 − P ( D = 1 | X )) E ( Y (0) | X )]= E [ P ( D = 1 | X ) E ( Y (0) | X )]= E [ E ( D | X ) E ( Y (0) | X )]= E [ DY (0)] , where the last equality follows from the Mean Independence assumption. (cid:3) .2 Proof of Theorem 3.1 Denote the observed data by Z i = ( Y i , D i , X i ), and by π the probability of being treated: π := P ( D = 1). The estimating moment function for θ is g ( Z, θ, η ) := [ D − (1 − D ) h ( X T β )][ Y − X T µ ] − Dθ . Recall that we deﬁne ( θ , η ) as the values satisfying: E g ( Z, θ , η ) = 0 . All these quantities depend on the sample size n but for the sake of brevity we suppress thisdependency in the notation except for the cases when we need it explicitly.By the Taylor expansion and the linearity of the estimating function g in θ , there exists t ∈ (0 , E n [ g ( Z, ˆ θ, ˆ η )] = E n [ g ( Z, θ , ˆ η )] + ˆ π ( θ − ˆ θ )= ˆ π ( θ − ˆ θ ) + E n [ g ( Z, θ , η )] + (ˆ η − η ) T E n [ ∇ η g ( Z, θ , η )]+ 12 (ˆ η − η ) T E n (cid:2) ∇ η g ( Z, θ , ˜ η ) (cid:3) (ˆ η − η ) , where ˜ η := tη +(1 − t )ˆ η and ˆ π = n (cid:80) ni =1 D i . The immunized estimator satisﬁes E n [ g ( Z, ˆ θ, ˆ η )] = 0.Thus, we obtainˆ π √ n (ˆ θ − θ ) = √ n E n [ g ( Z, θ , η )] (cid:124) (cid:123)(cid:122) (cid:125) := I + √ n (ˆ η − η ) T E n [ ∇ η g ( Z, θ , η )] (cid:124) (cid:123)(cid:122) (cid:125) := I + 2 − √ n (ˆ η − η ) T E n (cid:2) ∇ η g ( Z, θ , ˜ η ) (cid:3) (ˆ η − η ) (cid:124) (cid:123)(cid:122) (cid:125) := I . Now, to prove Theorem 3.1 we proceed as follows. First, we show that I converges in distributionto a zero mean normal random variable with variance E g ( Z, θ , η ) (Step 1 below), while I and I tend to zero in probability (Steps 2 and 3). This and the fact that ˆ π → π (a.s.) imply that √ n (ˆ θ − θ ) is asymptotically normal with some positive variance, which in turn implies that ˆ θ → θ in probability. Finally, using this property we prove that E n (cid:104) g ( Z i , ˆ θ, ˆ η ) (cid:105) → E g ( Z, θ , η ) inprobability (Step 4). Combining Steps 1 to 4 and using Slutsky lemma leads to the result of thetheorem. Thus, to complete the proof of the theorem it remains to establish Steps 1 to 4. Step 1

In this part, we write g i,n := g ( Z i , θ , η ) (making the dependence on n explicit in the notation).Recall that E g i,n = 0. We apply the Lindeberg-Feller central limit theorem for triangular arraysby checking a Lyapunov condition. Since ( g i,n ) ≤ i ≤ n are i.i.d., it suﬃces to prove thatlim sup n →∞ E ( g δ ,n )( E g ,n ) δ/ < ∞ (B.2)31or some δ >

0. Assumption 3.3(i) implies that E ( g δ ,n ) is bounded uniformly in n . Moreover, E ( g ,n ) = π E (cid:0) ( Y (1) − X T µ − θ ) | D = 1 (cid:1) + (1 − π ) E (cid:0) h ( X T β ) ( Y (0) − X T µ ) | D = 0 (cid:1) . Due to Assumption 3.3(ii) we have E ( g ,n ) ≥ c , where c > n . Thus,(B.2) holds. It follows that ( E ( g ,n )) − / I converges in distribution to a standard normal randomvariable. Step 2

Set ψ j + p = ψ (cid:48) j , j = 1 , . . . , p, and denote by Ψ a diagonal matrix of dimension 2 p with diagonalelements ψ j , j = 1 , . . . , p. Let also ¯Ψ be a diagonal matrix of dimension 2 p with diagonalelements ¯ ψ j = (cid:113) n − (cid:80) ni =1 U i,j , j = 1 , . . . , p . By Assumption 3.6 we have ψ j ≥ ¯ ψ j . Hence, | I | ≤ (cid:107) Ψ(ˆ η − η ) (cid:107) (cid:107) Ψ − √ n E n [ ∇ η g ( Z, θ , η )] (cid:107) ∞ ≤ (cid:107) Ψ(ˆ η − η ) (cid:107) (cid:107) ¯Ψ − √ n E n [ ∇ η g ( Z, θ , η )] (cid:107) ∞ . Here, the term (cid:107) ¯Ψ − √ n E n [ ∇ η g ( Z, θ , η )] (cid:107) ∞ is a maximum of self-normalized sums of variables U i,j and it can be bounded by using standard inequalities for self-normalized sums, cf. LemmaC.1 below. From the orthogonality conditions and Assumption 3.3 we have E ( U i,j ) = 0, E ( U i,j ) ≥ c and E ( | U i,j | ) < ∞ for any i and any j . By Lemma C.1 and the fact that Φ − (1 − a ) ≤ (cid:112) − a ) for all a ∈ (0 ,

1) we obtain that, with probability tending to 1 as n → ∞ , (cid:107) ¯Ψ − √ n E n [ ∇ η g ( Z, θ , η )] (cid:107) ∞ ≤ Φ − (1 − γ/ p ) ≤ (cid:112) p/γ ) (cid:46) (cid:112) log( p ) . Next, inequalities (B.18) and (B.23) imply that (cid:107)

Ψ(ˆ η − η ) (cid:107) (cid:46) ( s/κ Σ ) (cid:112) log( p ) /n with proba-bility tending to 1 as n → ∞ . Using these facts and the growth condition (i) in Assumption 3.5we conclude that I converges to 0 in probability as n → ∞ . Step 3

Let H := E n (cid:2) ∇ η g ( Z, θ , ˜ η ) (cid:3) ∈ R p × p and let h k,j be the elements of matrix H . We have | I | ≤ √ n (cid:107) ˆ η − η (cid:107) max ≤ k,j ≤ p | h k,j | . (B.3)We now control the random variable max ≤ k,j ≤ p | h k,j | . To do this, we ﬁrst note that ∂ ∂β∂β T g ( Z, θ, η ) = − (1 − D ) h (cid:48)(cid:48) ( X T β ) (cid:2) Y − X T µ (cid:3) XX T ,∂ ∂µ∂β T g ( Z, θ, η ) = ∂ ∂β∂µ T g ( Z, θ, η ) = (1 − D ) h (cid:48) ( X T β ) XX T ,∂ ∂µ∂µ T g ( Z, θ, η ) = 0 .

32t follows that max ≤ k,j ≤ p | h k,j | ≤ max (cid:18) max ≤ k,j ≤ p | ˜ h k,j | , max ≤ k,j ≤ p | ¯ h k,j | (cid:19) , (B.4)where ˜ h k,j = 1 n n (cid:88) i =1 (1 − D i ) h (cid:48)(cid:48) ( X Ti ˜ β )( Y i − X i ˜ µ ) X i,k X i,j , and ¯ h k,j = 1 n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˜ β ) X i,k X i,j . We now evaluate separately the terms max ≤ k,j ≤ p | ˜ h k,j | and max ≤ k,j ≤ p | ¯ h k,j | .Note that ˜ h k,j can be decomposed as˜ h k,j = ˜ h k,j, + ˜ h k,j, + ˜ h k,j, + ˜ h k,j, , where ˜ h k,j, = n − n (cid:88) i =1 (1 − D i ) h (cid:48)(cid:48) ( X Ti β )( Y i − X Ti µ ) X i,k X i,j , ˜ h k,j, = n − n (cid:88) i =1 (1 − D i ) h (cid:48)(cid:48) ( X Ti β ) X Ti ( µ − ˜ µ ) X i,k X i,j , ˜ h k,j, = n − n (cid:88) i =1 (1 − D i )( h (cid:48)(cid:48) ( X Ti ˜ β ) − h (cid:48)(cid:48) ( X Ti β ))( Y i − X Ti µ ) X i,k X i,j , ˜ h k,j, = n − n (cid:88) i =1 (1 − D i )( h (cid:48)(cid:48) ( X Ti ˜ β ) − h (cid:48)(cid:48) ( X Ti β )) X Ti ( µ − ˜ µ ) X i,k X i,j . It follows from Assumption 3.3 that, for all k, j , | ˜ h k,j, | ≤ C (cid:107) µ − ˆ µ (cid:107) . (B.5)Here and in what follows we denote by C positive constants depending only on K that can bediﬀerent on diﬀerent appearences. Next, from Assumptions 3.3 and 3.2 (i) we obtain that, forall k, j , | ˜ h k,j, | ≤ C (cid:107) β − ˆ β (cid:107) , | ˜ h k,j, | ≤ C (cid:107) β − ˆ β (cid:107) (cid:107) µ − ˆ µ (cid:107) . (B.6)From (B.5), (B.6), Theorem 3.2 and the growth condition (i) in Assumption 3.5 we ﬁnd that,with probability tending to 1 as n → ∞ ,max ≤ k,j ≤ p ( | ˜ h k,j, | + | ˜ h k,j, | + | ˜ h k,j, | ) (cid:46) ( s/κ Σ ) (cid:112) log( p ) /n (cid:46) . (B.7)Next, again from Assumption 3.3, we deduce that | E (˜ h k,j, ) | ≤ C , while by Hoeﬀding’s inequality P (cid:0) | ˜ h k,j, − E (˜ h k,j, ) | ≥ x (cid:1) ≤ − Cnx ) , ∀ x > . C > n → ∞ , max ≤ k,j ≤ p | ˜ h k,j, | ≤ C (1 + (cid:112) log( p ) /n ) (cid:46) . (B.8)Finally, combining (B.7) and (B.8) we obtain that, with probability tending to 1 as n → ∞ ,max ≤ k,j ≤ p | ˜ h k,j | (cid:46) . Quite similarly, we get that with probability tending to 1 as n → ∞ ,max ≤ k,j ≤ p | ¯ h k,j | (cid:46) . Thus, with probability tending to 1 as n → ∞ we have max ≤ k,j ≤ p | h k,j | (cid:46)

1. On the otherhand, (cid:107) ˆ η − η (cid:107) (cid:46) ( s/κ Σ ) (cid:112) log( p ) /n with probability tending to 1 as n → ∞ due to Theorem 3.2.Using these facts together with (B.3) and the growth condition (i) in Assumption 3.5 we concludethat I tends to 0 in probability as n → ∞ . Step 4

We now prove that if ˆ θ → θ in probability then n (cid:80) ni =1 g ( Z i , ˆ θ, ˆ η ) → E g ( Z, θ , η ) in proba-bility as n → ∞ . We have g ( Z, θ, η ) = [ D − (1 − D ) h ( X T β )][ Y − X T µ ] − Dθ.

Theorem 3.2 and the growth condition (i) in Assumption 3.5 imply that (cid:107) β − ˆ β (cid:107) is boundedby 1 on an event A n of probability tending to 1 as n → ∞ . Using Assumption 3.3 we deducethat, on the event A n , the values X Ti β and X Ti ˆ β for all i belong to a subset of R of diameterat most 2 K . On the other hand, Assumption 3.2 (i) implies that function h is bounded andLipschitz on any compact subset of R . Therefore, using again Assumption 3.3 we ﬁnd that onthe event A n we have | h ( X Ti β ) − h ( X Ti ˆ β ) | ≤ C (cid:107) β − ˆ β (cid:107) for all i . It follows from this remarkand from Assumption 3.3 that, on the event A n , | g ( Z i , ˆ θ, ˆ η ) − g ( Z i , θ , η ) | ≤ | ˆ θ − θ | + | h ( X Ti β ) − h ( X Ti ˆ β ) || Y − X T µ | + | h ( X Ti ˆ β ) || X Ti ( µ − ˆ µ ) |≤ | ˆ θ − θ | + C (cid:16) (cid:107) β − ˆ β (cid:107) + (cid:107) µ − ˆ µ (cid:107) + (cid:107) µ − ˆ µ (cid:107) (cid:107) β − ˆ β (cid:107) (cid:17) := ζ n for all i . Also note that, due to Assumption 3.3, the random variables g ( Z i , θ , η ) are a.s.uniformly bounded. Thus, using the equality b − a = ( b − a ) + 2 a ( b − a ) , ∀ a, b ∈ R , we getthat, on the event A n , | g ( Z i , ˆ θ, ˆ η ) − g ( Z i , θ , η ) | ≤ C ( ζ n + ζ n ) . for all i . Using this fact together with Theorem 3.2, the growth condition (i) in Assumption 3.5and the convergence ˆ θ → θ in probability we ﬁnd that n (cid:80) ni =1 g ( Z i , ˆ θ, ˆ η ) − n (cid:80) ni =1 g ( Z i , θ , η ) → n → ∞ . We conclude by applying the law of large numbers to the sum n (cid:80) ni =1 g ( Z i , θ , η ). (cid:3) .3 Proof of Theorem 3.2 B.3.1 Proof of (3.14)Recall that ˆ β is deﬁned as:ˆ β ∈ arg min β ∈ R p (cid:32) n n (cid:88) i =1 [(1 − D i ) H ( X Ti β ) − D i X Ti β ] + λ p (cid:88) j =1 ψ j | β j | (cid:33) (B.9)with penalty loadings satisfying Assumption 3 .

6. Let Ψ d ∈ R p × p be the diagonal matrix withdiagonal entries ψ , . . . , ψ p . Let also ¯Ψ d ∈ R p × p be a diagonal matrix with diagonal entries¯ ψ j = (cid:113) n − (cid:80) ni =1 U i,j , j = 1 , . . . , p . We denote by S ⊆ { , . . . , p } the set of indices of non-zerocomponents of β . By assumption, Card( S ) = (cid:107) β (cid:107) ≤ s β . Step 1: Concentration Inequality

We ﬁrst bound the sup-norm of the gradient of the objective function using Lemma C.1. Recallthat for 1 ≤ j ≤ p we have U i,j = (cid:2) (1 − D i ) h ( X Ti β ) − D i (cid:3) X i,j . Set S j := √ n (cid:80) ni =1 U i,j / ¯ ψ j andconsider the event B := (cid:40) n max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 U i,j ¯ ψ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λc (cid:41) . By construction, the random variables U i,j are i.i.d., E ( U i,j ) = 0, E ( U i,j ) ≥ c and E ( | U i,j | ) ≤ C by Assumptions 3.3 and 3.2. Using these remarks, 3.6 and Lemma C.1 we obtain P (cid:0) B C (cid:1) = P (cid:18) c √ n max ≤ j ≤ p |S j | > c Φ − (1 − γ/ p ) / √ n (cid:19) = P (cid:18) max ≤ j ≤ p |S j | > Φ − (1 − γ/ p ) (cid:19) = o (1) as n → ∞ . Step 2: Restricted Eigenvalue condition for the empirical Gram matrix

The empirical Gram matrix isˆΣ := 1 n n (cid:88) i =1 (1 − D i ) X i X Ti = 1 n n (cid:88) i =1 (1 − D i ) X i X Ti . We also recall the Restricted Eigenvalue (RE) condition (Bickel et al., 2009). For a non-emptysubset S ⊆ { , . . . , p } and α >

0, deﬁne the set: C [ S, α ] := { v ∈ R p : (cid:107) v S C (cid:107) ≤ α (cid:107) v S (cid:107) , v (cid:54) = 0 } (B.10)where S C stands for the complement of S . Then, for given s ∈ { , . . . , p } and α >

0, the matrixˆΣ satisﬁes the RE( s , α ) condition if there exists κ ( ˆΣ) > S ⊆{ ,...,p } :Card( S ) ≤ s min v ∈C [ S,α ] v T ˆΣ v (cid:107) v S (cid:107) ≥ κ ( ˆΣ) . (B.11)35e now use Lemma C.2, stated and proved in Section C below. Note that Assumption 3.5 implies(C.2) therein and set V i = (1 − D i ) X i . Then, for any s ∈ [1 , p/

2] and α >

0, ˆΣ satisﬁes theRE( s , α ) condition, with κ ( ˆΣ) = c ∗ κ Σ where c ∗ ∈ (0 ,

1) is an absolute constant, with probabilitytending to 1 as n → ∞ . Step 3: Basic inequality

At this step, we prove that with probability tending to 1 as n → ∞ , ˆ β satisﬁes the followinginequality (further called the basic inequality): τ ( ˆ β − β ) T ˆΣ( ˆ β − β ) ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) + 2 λc (cid:107) Ψ d ( ˆ β − β ) (cid:107) , (B.12)where τ > n .By optimality of ˆ β we have1 n n (cid:88) i =1 [ γ ˆ β ( X i , D i ) − γ β ( X i , D i )] ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) , where γ β ( X, D ) := (1 − D ) H ( X T β ) − DX T β . Subtracting the inner product of the gradient ∇ β γ β ( X i , D i ) and ˆ β − β on both sides we ﬁnd1 n n (cid:88) i =1 [ γ ˆ β ( X i , D i ) − γ β ( X i , D i ) − (cid:0) (1 − D i ) h ( X Ti β ) − D i (cid:1) ( ˆ β − β ) T X i ] ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) − n n (cid:88) i =1 (cid:0) (1 − D i ) h ( X Ti β ) − D i (cid:1) ( ˆ β − β ) T X i . (B.13)Using Taylor expansion we get that there exists 0 ≤ t ≤ n n (cid:88) i =1 γ ˆ β ( X i , D i ) − γ β ( X i , D i ) − (cid:0) (1 − D i ) h ( X Ti β ) − D i (cid:1) ( ˆ β − β ) T X i = 12 ( ˆ β − β ) T (cid:34) n n (cid:88) i =1 (1 − D i ) X i X Ti h (cid:48) ( X Ti ˜ β ) (cid:35) ( ˆ β − β ) , where ˜ β = t ˆ β +(1 − t ) β . Plugging this into (B.13) and using the facts that | (cid:80) i a i b i | ≤ (cid:107) a (cid:107) (cid:107) b (cid:107) ∞ and ψ j ≥ ¯ ψ j we get that, on the event B , which occurs with probability tending to 1 as n → ∞ :12 ( ˆ β − β ) T (cid:34) n n (cid:88) i =1 (1 − D i ) X i X Ti h (cid:48) ( X Ti ˜ β ) (cid:35) ( ˆ β − β ) (B.14) ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) − n n (cid:88) i =1 (cid:0) (1 − D i ) h ( X Ti β ) − D i (cid:1) ( ˆ β − β ) T X i ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) + max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 X i,j ¯ ψ j (cid:0) (1 − D i ) h ( X Ti β ) − D i (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) + λc (cid:107) Ψ d ( ˆ β − β ) (cid:107) .

36y Assumption 3.2 we have h (cid:48) >

0, which implies that the left-hand side of (B.14) is non-negative. Hence we have, under the event B ,0 ≤ λ (cid:16) (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) (cid:17) + ( λ/c ) (cid:107) Ψ d ( ˆ β − β ) (cid:107) . which implies that (cid:107) Ψ d ˆ β (cid:107) ≤ c (cid:107) Ψ d β (cid:107) (B.15)where c = ( c + 1) / ( c − j ψ j ≤ max j ψ j, max ≤ ¯ ψ where¯ ψ > n . On the other hand, Assumption 3.3(ii) and thefact that the random variables U i,j are uniformly bounded implies that min j ψ j ≥ (cid:112) c / ψ with probability tending to 1 as n → ∞ (this follows immediately from Hoeﬀding’s inequality,the union bound and the fact that log( p ) /n → n → ∞ , (cid:107) ˆ β (cid:107) ≤ c ¯ ψ ¯ ψ (cid:107) β (cid:107) . (B.16)We now use Assumption 3.2(ii). If (cid:107) β (cid:107) ≤ c , then (cid:107) ˆ β (cid:107) ≤ c ( ¯ ψ/ ¯ ψ ) c with probability tendingto 1 as n → ∞ , so that min i =1 ,...,n h (cid:48) ( X Ti ˜ β ) ≥ h (cid:48) ( − K max(1 , c ψ ¯ ψ ) c ) > h (cid:48) ≥ c on the whole real lineso obviously min i =1 ,...,n h (cid:48) ( X Ti ˜ β ) ≥ c . It follows that there exists τ > n such that, with probability tending to 1 as n → ∞ , τ v T ˆΣ v ≤ v T (cid:34) n n (cid:88) i =1 (1 − D i ) X i X Ti h (cid:48) ( X Ti ˜ β ) (cid:35) v, ∀ v ∈ R p . (B.17)Using (B.17) with v = ˆ β − β and combining it with inequality (B.14) yields (B.12). Step 4: Control of the (cid:96) -error for ˆ β We prove that with probability tending to 1 as n → ∞ , (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ (cid:16) cc − (cid:17) ψ λs β c ∗ τ κ Σ , (cid:107) ˆ β − β (cid:107) ≤ (cid:16) cc − (cid:17) ψ λs β ¯ ψc ∗ τ κ Σ . (B.18)It suﬃces to prove the ﬁrst inequality in (B.18). The second inequality follows as an immediateconsequence.We will use the basic inequality (B.12). First, we bound (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) . By the triangularinequality, (cid:107) Ψ d β ,S (cid:107) − (cid:107) Ψ d ˆ β S (cid:107) ≤ (cid:107) Ψ d ( β ,S − ˆ β S ) (cid:107) . (cid:107) Ψ d β ,S C (cid:107) − (cid:107) Ψ d ˆ β S C (cid:107) = 2 (cid:107) Ψ d β ,S C (cid:107) − (cid:107) Ψ d β ,S C (cid:107) − (cid:107) Ψ d ˆ β S C (cid:107) ≤ (cid:107) Ψ d β ,S C (cid:107) − (cid:107) Ψ d ( β ,S C − ˆ β S C ) (cid:107) ≤ −(cid:107) Ψ d ( β ,S C − ˆ β S C ) (cid:107) . The last inequality follows from the fact that (cid:107) β ,S C (cid:107) = 0. Hence, (cid:107) Ψ d β (cid:107) − (cid:107) Ψ d ˆ β (cid:107) + 1 c (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ (cid:18) c (cid:19) (cid:107) Ψ d ( ˆ β S − β ,S ) (cid:107) − (cid:18) − c (cid:19) (cid:107) Ψ d ( ˆ β S C − β ,S C ) (cid:107) . (B.19)Plugging this result in (B.12) we get, with probability tending to 1 as n → ∞ ,( ˆ β − β ) T ˆΣ( ˆ β − β ) ≤ λτ (cid:20)(cid:18) c (cid:19) (cid:107) Ψ d ( ˆ β S − β ,S ) (cid:107) − (cid:18) − c (cid:19) (cid:107) Ψ d ( ˆ β S C − β ,S C ) (cid:107) (cid:21) , (B.20)and thus( ˆ β − β ) T ˆΣ( ˆ β − β ) + 2 λτ (cid:16) − c (cid:17) (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ λτ (cid:107) Ψ d ( ˆ β S − β ,S ) (cid:107) (B.21) ≤ λ ¯ ψ √ s β τ (cid:107) ˆ β S − β ,S (cid:107) , where we have used the fact that Card( S ) ≤ s β due to Assumption 3.1. Recall that c > β − β ) T ˆΣ( ˆ β − β ) ≥ d ( ˆ β − β ) ∈ C [ S , c ] for Ψ d ( ˆ β − β ), which in turn implies (with probability tending to 1 as n → ∞ ) a cone condition ˆ β − β ∈ C [ S , c ¯ ψ/ ¯ ψ ] for ˆ β − β . Therefore, using (B.11) (where werecall that κ ( ˆΣ) = c ∗ κ Σ ), we obtain that, with probability tending to 1 as n → ∞ ,( ˆ β − β ) T ˆΣ( ˆ β − β ) + 2 λτ (cid:18) − c (cid:19) (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ λ ¯ ψ √ s β τ (cid:115) ( ˆ β − β ) T ˆΣ( ˆ β − β ) c ∗ κ Σ . Using here the inequality ab ≤ ( a + b ) / , ∀ a, b >

0, we ﬁnd that, with probability tending to 1as n → ∞ , 12 ( ˆ β − β ) T ˆΣ( ˆ β − β ) + 2 λτ (cid:18) − c (cid:19) (cid:107) Ψ d ( ˆ β − β ) (cid:107) ≤ λ ¯ ψ s β τ c ∗ κ Σ . (B.22)which implies the ﬁrst inequality in (B.18). Since λ (cid:46) (cid:112) log( p ) /n the proof of (3.14) is complete.38 .3.2 Proof of (3.15)Recall that ˆ µ is deﬁned as:ˆ µ ∈ arg min µ ∈ R p (cid:32) n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˆ β ) (cid:0) Y i − X Ti µ (cid:1) + λ (cid:48) p (cid:88) j =1 ψ (cid:48) j | µ j | (cid:33) . Let Ψ (cid:48) ∈ R p × p denote the diagonal matrix with diagonal entries ψ (cid:48) , . . . , ψ (cid:48) p . We will prove that,with probability tending to 1 as n → ∞ , (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) (cid:46) sκ Σ (cid:114) log( p ) n . (B.23)Using an argument analogous to that after (B.15) we easily get that (B.23) implies (3.15). Step 1: Concentration inequality

Deﬁne V ij := (1 − D i ) h (cid:48) ( X Ti β ) (cid:2) Y i − X Ti µ (cid:3) X i,j , S (cid:48) j := √ n (cid:80) ni =1 V ij /ψ (cid:48) j and consider the event B (cid:48) := (cid:40) n max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 V ij ψ (cid:48) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:48) c (cid:41) . The random variables V ij , i = 1 , . . . , n, are i.i.d. and E ( V ij ) = 0, E ( V ij ) ≥ c and E ( | V ij | ) < C for all i, j by Assumptions 3.2 and 3.3. Using Assumptions 3.1, 3.6, and Lemma C.1 we obtain P ( B (cid:48) C ) = P (cid:18) c √ n max ≤ j ≤ p |S (cid:48) j | > c Φ − (1 − γ/ p ) / √ n (cid:19) = P (cid:18) max ≤ j ≤ p |S (cid:48) j | > Φ − (1 − γ/ p ) (cid:19) = o (1) as n → ∞ . Step 2: Control of the (cid:96) -error for ˆ µ Introduce the notation γ β,µ ( Z i ) = (1 − D i ) h (cid:48) ( X Ti β ) (cid:0) Y i − X Ti µ (cid:1) . It follows from the deﬁnitionof ˆ µ that 1 n n (cid:88) i =1 [ γ ˆ β, ˆ µ ( Z i ) − γ ˆ β,µ ( Z i )] ≤ λ (cid:48) ( (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) ) . Here 1 n n (cid:88) i =1 [ γ ˆ β, ˆ µ ( Z i ) − γ ˆ β,µ ( Z i )] = (ˆ µ − µ ) T (cid:32) n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˆ β ) X i X Ti (cid:33) (ˆ µ − µ )+ 2 n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˆ β )( Y i − X Ti µ ) X Ti ( µ − ˆ µ ) . µ − µ ) T (cid:32) n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˆ β ) X i X Ti (cid:33) (ˆ µ − µ ) ≤ λ (cid:48) ( (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) ) + 2 n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti β )( Y i − X Ti µ ) X Ti (ˆ µ − µ ) + R n , (B.24)where R n = 2 n n (cid:88) i =1 (1 − D i ) (cid:104) h (cid:48) ( X Ti ˆ β ) − h (cid:48) ( X Ti β ) (cid:105) ( Y i − X Ti µ ) X Ti (ˆ µ − µ )= 2 n n (cid:88) i =1 (1 − D i )( Y i − X Ti µ ) h (cid:48)(cid:48) ( X Ti ˜ β ) X Ti ( ˆ β − β ) X Ti (ˆ µ − µ )with ˜ β = t ˆ β + (1 − t ) β for some t ∈ [0 , A = 2 n n (cid:88) i =1 (1 − D i )( Y i − X Ti µ ) h (cid:48)(cid:48) ( X Ti ˜ β ) X i X Ti , we can write R n = ( ˆ β − β ) T A (ˆ µ − µ ) . From (B.24) we deduce that on the event B (cid:48) that occurs with probability tending to 1 as n → ∞ ,(ˆ µ − µ ) T (cid:32) n n (cid:88) i =1 (1 − D i ) h (cid:48) ( X Ti ˆ β ) X i X Ti (cid:33) (ˆ µ − µ ) ≤ λ (cid:48) ( (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) ) + λ (cid:48) c (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) + ( ˆ β − β ) T A (ˆ µ − µ ) . We now use (B.17) (noticing that ˆ β = ˜ β for t = 1) to obtain that, with probability tending to 1as n → ∞ , τ (ˆ µ − µ ) T ˆΣ (ˆ µ − µ ) ≤ λ (cid:48) ( (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) ) + λ (cid:48) c (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) + ( ˆ β − β ) T A (ˆ µ − µ ) . (B.25)Next, observe that, with probability tending to 1 as n → ∞ ,( ˆ β − β ) T A (ˆ µ − µ ) − ( τ / µ − µ ) T ˆΣ(ˆ µ − µ ) ≤ C ( ˆ β − β ) T ˆΣ( ˆ β − β ) (B.26)where C > n . To see this, set u i = ( ˆ β − β ) T X i v i = (ˆ µ − µ ) T X i and a i = ( Y i − X Ti µ ) h (cid:48)(cid:48) ( X Ti ˜ β ). We have( ˆ β − β ) T A (ˆ µ − µ ) − ( τ / µ − µ ) T ˆΣ(ˆ µ − µ ) = τ n n (cid:88) i =1 (1 − D i ) (cid:16) a i τ u i v i − v i (cid:17) ≤ τ n n (cid:88) i =1 (1 − D i ) a i u i . n → ∞ , we have max i | a i | ≤ C for a constant C > n .We also note that, due to (B.22), with probability tending to 1 as n → ∞ ,( ˆ β − β ) T ˆΣ( ˆ β − β ) (cid:46) λ s β κ Σ (cid:46) s β log( p ) nκ Σ . (B.27)Combining (B.25), (B.26) and (B.27) we ﬁnally get that, with probability tending to 1 as n → ∞ , τ µ − µ ) T ˆΣ (ˆ µ − µ ) ≤ λ (cid:48) (cid:18) (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) + 1 c (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) (cid:19) + ¯ cs β log( p ) nκ Σ , (B.28)where ¯ c > n . Let S ⊆ { , . . . , p } denote the setof indices of non-zero components of µ . By assumption, Card( S ) = (cid:107) µ (cid:107) ≤ s µ . The sameargument as in (B.19) (where we replace β , ˆ β, Ψ d , S by µ , ˆ µ, Ψ (cid:48) , S , respectively) yields (cid:107) Ψ (cid:48) µ (cid:107) − (cid:107) Ψ (cid:48) ˆ µ (cid:107) + 1 c (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) ≤ (cid:16) c (cid:17) (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) − (cid:16) − c (cid:17) (cid:107) Ψ (cid:48) (ˆ µ S C − µ ,S C ) (cid:107) . This and (B.28) imply that, with probability tending to 1 as n → ∞ , τ µ − µ ) T ˆΣ (ˆ µ − µ ) + λ (cid:48) (cid:16) − c (cid:17) (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) ≤ λ (cid:48) (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) + ¯ cs β log( p ) nκ Σ , (B.29)where we have used the fact that c >

1. We now consider two cases:1. λ (cid:48) (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) ≤ ¯ cs β log( p ) nκ Σ . In this case, inequality (B.29) implies (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) (cid:46) s β log( p ) λ (cid:48) nκ Σ and (B.23) follows immediately since (cid:112) log( p ) /n (cid:46) λ (cid:48) . Consequently, (3.15) holds in thiscase.2. λ (cid:48) (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) > ¯ cs β log( p ) nκ Σ . Then with probability tending to 1 as n → ∞ we have τ µ − µ ) T ˆΣ (ˆ µ − µ ) + λ (cid:48) (cid:16) − c (cid:17) (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) ≤ λ (cid:48) (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) . This inequality is analogous to (B.21). In particular, it implies the cone condition, whichnow takes the form (cid:107) Ψ (cid:48) (ˆ µ S C − µ ,S C ) (cid:107) ≤ α (cid:107) Ψ (cid:48) (ˆ µ S − µ ,S ) (cid:107) with α = 5 c/ ( c − µ − µ ) T ˆΣ (ˆ µ − µ ) + Cλ (cid:48) (cid:107) Ψ (cid:48) (ˆ µ − µ ) (cid:107) (cid:46) ( λ (cid:48) ) s µ κ Σ , (B.30)where C > n . Since λ (cid:48) (cid:46) (cid:112) log( p ) /n we get (B.23).Thus, the proof of (3.15) is complete. (cid:3) Auxiliary Lemmas

Lemma C.1 (Deviation of maximum of self-normalized sums)

Consider S j := n (cid:88) i =1 U i,j (cid:16) n (cid:88) i =1 U i,j (cid:17) − / , where U i,j are independent random variables across i with mean zero and for all i, j we have E [ | U i,j | ] ≤ C , E [ U i,j ] ≥ C for some positive constants C , C independent of n . Let p = p ( n ) satisfy the condition log( p ) = o ( n / ) and let γ = γ ( n ) ∈ (0 , p ) be such that log(1 /γ ) (cid:46) log( p ) .Then, P (cid:18) max ≤ j ≤ p |S j | > Φ − (1 − γ/ p ) (cid:19) = γ (1 + o (1)) as n → ∞ . Proof.

We use a corollary of a result from de la Pe˜na et al. (2009) given by Belloni et al. (2012,p.2409), which in our case can be stated as follows. Let S j and U i,j satisfy the assumptions ofthe present lemma. If there exist positive numbers (cid:96) > γ > < Φ − (1 − γ/ p ) ≤ C / C / n / (cid:96) − , (C.1)then, P (cid:18) max ≤ j ≤ p |S j | > Φ − (1 − γ/ p ) (cid:19) ≤ γ (cid:18) A(cid:96) (cid:19) , where A > − (1 − γ/ p ) ≤ (cid:112) p/γ ) and we assume that log(1 /γ ) (cid:46) log( p ) and log( p ) = o ( n / ) condition (C.1) is satisﬁed with (cid:96) = (cid:96) ( n ) = ( n / / log( p )) / for n large enough. Then (cid:96) ( n ) → ∞ as n → ∞ and the lemma follows. (cid:3) Lemma C.2

Let s ∈ [1 , p/ be an integer and α > . Let V ∈ R p be a random vector suchthat (cid:107) V (cid:107) ∞ ≤ M < ∞ (a.s.), and set Σ = E ( V V T ) . Let Σ satisfy (3.13) and s/κ Σ = o ( p ) as n → ∞ , and s (cid:46) nκ log( p ) log ( n ) . (C.2) Consider i.i.d. random vectors V , . . . , V n with the same distribution as V . Then, for all n large enough with probability at least − exp( − C log( p ) log ( n )) where C > is a constantdepending only on M the empirical matrix ˆΣ = 1 n n (cid:88) i =1 V i V Ti satisﬁes the RE( s , α ) condition with κ ( ˆΣ) = c ∗ κ Σ where c ∗ ∈ (0 , is an absolute constant. roof. We set the parameters of Theorem 22 in Rudelson and Zhou (2013) as follows s = s , k = α , and due to (3.13) we have, in the notation of that theorem, K ( s , k , Σ / ) ≤ /κ Σ and ρ ≥ κ Σ . Also note that (cid:107) Σ / e j (cid:107) = E [( V T e j ) ] ≤ M , where e j denotes the j th canonicalbasis vector in R p (this, in particular, implies that κ Σ ≤ M ). Thus, the value d deﬁned inTheorem 22 in Rudelson and Zhou (2013) satisﬁes d (cid:46) s/κ Σ and condition d ≤ p holds truefor n large enough due to (C.2). Next, note that condition n ≥ x log ( x ) is satisﬁed for all x ≤ n/ log ( n ) and n ≥

3, so that the penultimate display formula in Theorem 22 of Rudelsonand Zhou (2013) can be written as d log( p ) /ρ (cid:46) n/ log ( n ). Given the above bounds for d and ρ ,we have a suﬃcient condition for this inequality in the form s/κ (cid:46) n/ log ( n ), which is grantedby (C.2). Thus, all the conditions of Theorem 22 in Rudelson and Zhou (2013) are satisﬁed andwe ﬁnd that, for all n large enough, with probability at least 1 − exp( − C log( p ) log ( n )) where C > M we havemin S ⊆{ ,...,p } :Card( S )= s min v ∈C [ S,α ] v T ˆΣ v (cid:107) v S (cid:107) ≥ (1 − δ ) κ Σ , (C.3)where δ ∈ (0 , /

5) (remark that there is a typo in Theorem 22 in Rudelson and Zhou (2013) thatis corrected in (C.3): the last formula of that theorem should be (1 − δ ) (cid:107) Σ / u (cid:107) ≤ (cid:107) Xu (cid:107) √ n ≤ (1 + 3 δ ) (cid:107) Σ / u (cid:107) where 0 < δ < / S ) = s rather than Card( S ) ≤ s ), thesetwo conditions are equivalent. Indeed, as shown in (Bellec et al., 2018, page 3607), ∪ S ⊆{ ,...,p } :Card( S ) ≤ s (cid:8) v ∈ R p : (cid:107) v S C (cid:107) ≤ α (cid:107) v S (cid:107) (cid:9) = (cid:110) v ∈ R p : (cid:107) v (cid:107) ≤ (1 + α ) s (cid:88) j =1 v ∗ j (cid:111) where v ∗ ≥ · · · ≥ v ∗ p denotes a non-increasing rearrangement of | v | , . . . , | v p | . On the other hand, ∪ S ⊆{ ,...,p } :Card( S )= s { v ∈ R p : (cid:107) v S C (cid:107) ≤ α (cid:107) v S (cid:107) } ⊇ (cid:8) v ∈ R p : (cid:107) v S C ∗ ( v ) (cid:107) ≤ α (cid:107) v S ∗ ( v ) (cid:107) (cid:9) = (cid:110) v ∈ R p : (cid:107) v (cid:107) ≤ (1 + α ) s (cid:88) j =1 v ∗ j (cid:111) where S ∗ ( v ) is the set of s largest in absolute value components of v . (cid:3)(cid:3)