[PDF] Exact Trend Control in Estimating Treatment Effects Using Panel Data with Heterogenous Trends

Abstract

For a panel model considered by Abadie et al. (2010), the counterfactual outcomes constructed by Abadie et al., Hsiao et al. (2012), and Doudchenko and Imbens (2017) may all be confounded by uncontrolled heterogenous trends. Based on exact-matching on the trend predictors, I propose new methods of estimating the model-specific treatment effects, which are free from heterogenous trends. When applied to Abadie et al.'s (2010) model and data, the new estimators suggest considerably smaller effects of California's tobacco control program.

Full PDF

EExact Trend Control in Estimating Treatment EffectsUsing Panel Data with Heterogenous Trends

Chirok Han * Department of EconomicsKorea UniversityThis Version: June 2020

Abstract

For a panel model considered by Abadie et al. (2010), the counterfactual outcomesconstructed by Abadie et al., Hsiao et al. (2012), and Doudchenko and Imbens (2017) mayall be confounded by uncontrolled heterogenous trends. Based on exact-matching on thetrend predictors, I propose new methods of estimating the model-speciﬁc treatment effects,which are free from heterogenous trends. When applied to Abadie et al.’s (2010) modeland data, the new estimators suggest considerably smaller effects of California’s tobaccocontrol program.

Key Words:

Synthetic control, difference-in-differences, heterogenous trends, panel data,treatment effects, matching, balancing, multiple control groups, regularization, constrainedridge, constrained lasso, constrained elastic net.

JEL Classiﬁcation:

C01, C1 * Department of Economics, Korea University, 145 Anam-ro Seongbuk-gu, Seoul, Korea. [email protected] . The author thanks Professor Myoung-jae Lee and Changhui Kang for useful comments. a r X i v : . [ ec on . E M ] D ec Introduction

In this paper I propose new methods of estimating treatment effects for panel models withheterogenous trends. Two motivational numerical examples are illustrated in Figure 1 basedon simulated data generated by a model considered by Abadie, Diamond and Hainmueller(2010, ADH), with details given in Appendix A.2. Trends are plotted in the ﬁgure for the trueuntreated outcomes, the ADH synthetic control outcomes, and the construction by one of mynew methods. In part (a) of Figure 1, the ADH synthetic control outcomes are far from thetruth even for the pre-treatment periods, presumably due to the violation of the convexity orinterpolation assumption (ADH, 2010; Gobillon and Magnac, 2016; see also Figure 5 in theappendix for the generated untreated outcomes). Though not much useful, the ADH results atleast do not mislead the researcher as its inappropriateness is unequivocal. In part (b), however,the ADH synthetic control looks ﬂawless for the pre-treatment periods, but the post-treatmentsynthetic control outcomes are far from the truth. Later developments such as Hsiao, Ching andWan (2012, HCW hereafter) and Doudchenko and Imbens (2017) suffer from similar biases,while the methods I propose in this paper work well as Figure 1 shows.The model considered here is identical to ADH’s (2010), and is given by(1) y it = µ i + γ (cid:48) t z i + δ (cid:48) t h i + u it , y it = τ it + y it , where z i and and y it are observed, with y it = y it if the i th unit is treated in period t and y it = y it otherwise. Unit 1 is treated for t > T , and the rest units ( i = 2 , . . . , J + 1 ) are untreated forall t . The unobservable trends γ t and δ t are ﬁxed effects that can be dependent on any otherrandom variables. For example, γ t can be small in magnitude in the pre-treatment periods andlarge in the post-treatment periods. Similarly, µ i are arbitrary ﬁxed effects. The unobservableconstituents µ i , γ t and δ t should be normalized somehow for identiﬁcation, but how they arenormalized is of no consequence because I take the difference-in-differences (DID) approach.The observed vector z i contains K + 1 components including the constant term for commontime effects, and the unobservable h i has r elements, where K and r are typically small. Thevariables z i and h i determine how each unit responds to common shocks γ t and δ t . The randomerrors u it are assumed to have zero mean conditional on z i and h i .The goal is to ﬁnd w , . . . , w J +1 such that the linear combination (cid:80) J +1 j =2 w j y jt forms a sen-sible counterfactual comparison for the treated unit while controlling for the trends due to thecommon shocks γ t and δ t . Unlike Doudchenko and Imbens (2017) I ﬁrmly base my analy-2igure 1: Trends of counterfactual outcomes(a) Pre-treatment outcomes not traced by ADH’s synthetic control Period O u t c o m e Treatment →

True counterfactualADH (2010)THIS PAPER (b) Bias in post-treatment counterfactual outcome estimation . . . . . . . Period O u t c o m e Treatment →

True counterfactualADH (2010)THIS PAPER

Note.

Simulated data. See Appendix A.2 for the data generating processes. ADH’s (2010) counterfactual trendsare found using the R package Synth. (a) The ADH counterfactual outcomes are far from the truth even for the pre-treatment period. (b) The ADH synthetic control is ﬂawless in the pre-treatment period, but ADH’s post-treatmentcounterfactual trend is severely biased. w , . . . , w J +1 such that y t − (cid:80) J +1 j =2 w j y jt is free from confounding trends driven by γ (cid:48) t z i and δ (cid:48) t h i in the model. Identiﬁ-cation is sought not by algorithm but by the model and the population distribution of the relatedrandom variables.My approach begins with distinguishing the variables responsible for trend heterogeneityand those which are balanced on in order to enhance comparability. When a set of variables(such as z i and h i ) are responsible for heterogenous trends, they should be exactly balanced onby hard constraints in order to avoid bias due to uncontrolled trends since the components γ t , δ t , z i and h i are ﬁxed effects; the balancing covariates such as pre-treatment outcomes, on theother hand, need not be exactly matched on.The importance of exact matching on z i and h i has been overlooked in the literature. Asdiscussed in Section 3 later, ADH’s (2010) algorithm is relevant in a subtle way but their non-negativity constraint is obstruent. HCW’s (2012) regression-based method and Doudchenkoand Imbens’s (2017) elastic-net proposal do not attend the heterogenous trends γ (cid:48) t z i and δ (cid:48) t h i .Consequences of ignoring its signiﬁcance are visible in Figure 1 above and in Figure 4 later inSection 3.3.Exact-balancing on trending covariates does not obliterate the necessity of regularization,especially when the number J of the untreated units is comparable to or larger than the numberof balancing covariates as is the case in many applications. Without regularization the weightmatrix may not be uniquely identiﬁed. Undoubtedly, all the extant methods implement regular-ization in some ways. ADH (2010) impose the nonnegativity and adding-up constraints as hardrestrictions. HCW (2012) select a subset of control groups based on researcher’s judgment.Doudchenko and Imbens (2017) implement elastic-net penalties. I also consider regulariza-tion, where the penalty term is motivated by ordinary least squares (OLS), rather than givenheuristically. My proposal leads to a ‘constrained ridge’ regression and its lasso and elastic-netvariants, all of which are now well accepted by the econometric community.The rest of this paper is organized as follows. Section 2 presents the new estimators, andSection 3 compares them with extant estimators. The last section contains concluding remarks.All the proofs are gathered in the appendix, which also contains discussions on establishingasymptotics. Throughout the paper, Y t and U t denote the J × vectors of y jt and u jt , respec-tively, for j ≥ , i.e., for the untreated units. The weight vector ( w , . . . , w J +1 ) (cid:48) is denoted by w , and Z is the ( K + 1) × J matrix ( z , . . . , z J +1 ) . The exact-balancing restriction for z i is thus4ritten as z = Zw . This section presents the new estimators. Section 2.1 considers a model with a common compo-nent γ (cid:48) t z i but without latent factors in order to motivate the exact-matching constraint z = Zw and regularization. Section 2.2 considers the same model but introduces balancing covariates.Section 2.3 makes an extension to models with unobservable common factors. To begin with, consider the model in (1) without h i so the potential untreated outcomes aremodeled by y it = µ i + γ (cid:48) t z i + u it , where u it shows no systematic trends if the model is correctlyspeciﬁed. For s and t with s ≤ T < t , where T is the last period before treatment, we have(2) y it − y is = τ t I ( i = 1) + ( γ t − γ s ) (cid:48) z i + ( u it − u is ) , i = 1 , , . . . , J + 1 , with I ( · ) denoting the indicator function. In (2), a post-treatment period t is compared with asingle pre-treatment period s for the sake of simple exposition. Generalization by changing y is to T − (cid:80) T s =1 y is or any other weighted average makes no serious differences in the argumentsto follow; likewise, y it can be replaced with an average over the post-treatment periods.An obvious estimator of τ t in (2) can be obtained by the OLS regression of y it − y is on I ( i = 1) and z i using the J + 1 cross-sectional observations as the sample. Because the dummyvariable I ( i = 1) has value 1 only for i = 1 , the OLS estimator of γ t − γ s is also obtained byregression y it − y is on z i using i ≥ , and then τ t is estimated as the prediction error for i = 1 .That is, the OLS estimator of γ t − γ s is ( ZZ (cid:48) ) − Z ( Y t − Y s ) , and ˆ τ t = ( y t − y s ) − ( Y t − Y s ) (cid:48) Z (cid:48) ( ZZ (cid:48) ) − z . With w a denoting Z (cid:48) ( ZZ (cid:48) ) − z , this ˆ τ t is written as ˆ τ t = ( y t − y s ) − ( Y t − Y s ) (cid:48) w a , which isthe DID estimator using Y (cid:48) t w a as the constructed control group. In the simple case of z i = 1 , theelements in w a are uniform, i.e., w a = J − (1 , , . . . , (cid:48) , and Y (cid:48) t w a is the unweighted average y jt over the untreated units. In this sense w a generalizes the unweigted averaging operator. Notethat w a depends on z i only and choice of s and t is irrelevant.5he weight vector w a eliminates the confounding trends driven by z i from y t − Y (cid:48) t w a because y t − Y (cid:48) t w a = ( τ t + µ + γ (cid:48) t z + u t ) − ( µ (cid:48) w a + γ (cid:48) t Zw a + U (cid:48) t w a )= τ t + ( µ − µ (cid:48) w a ) + ( u t − U (cid:48) t w a ) , µ = ( µ , . . . , µ J +1 ) (cid:48) , and thus the DID estimator ˆ τ t satisﬁes ˆ τ t = τ t + [( u t − u s ) − ( U t − U s ) (cid:48) w a ] . We clearly have

E(ˆ τ t ) = τ t because w a is a function of z , . . . , z J +1 , provided that the randomdisturbances u jt have zero mean for all t conditional on the trending covariates z , . . . , z J +1 .The above w a is not the only weight vector that gives an unbiased estimator of τ t by DID.Any w satisfying z = Zw and E( U (cid:48) t w ) = 0 works because then y t − Y (cid:48) t w = ( µ − µ (cid:48) w ) +( u t − U (cid:48) t w ) . Given the arbitrariness of γ t , unbiased estimation of τ t requires z = Zw as aminimal condition, which is the exact balancing constraint emphasized in the introduction, andwhich w a turns out to satisfy.It is noteworthy that w a = Z (cid:48) ( ZZ (cid:48) ) − z is the solution to the constrained (cid:96) minimization(3) min w w (cid:48) w subject to z = Zw. (See the appendix for a proof that w a solves (3).) That is, w a is the smallest (in terms of Eu-clidean norm) of those satisfying z = Zw . Under the iid assumption for u jt , w a also minimizesthe sampling variability in the constructed counterfactual outcomes conditional on z , . . . , z J +1 ,since var( U (cid:48) t w ) = σ u w (cid:48) w for nonrandom w . In plain words, Y (cid:48) t w a would exhibit least ﬂuctua-tions over time while satisfying z = Zw a .It is subtle to discuss how a weight w is deﬁned for the model y it = µ i + γ (cid:48) t z i + u it . For agiven w , let ˆ τ t ( w ) = ( y t − Y (cid:48) t w ) − ( y s − Y (cid:48) s w ) for s ≤ T < t , which is the DID estimator using Y (cid:48) t w as the constructed comparison group. The restriction that ˆ τ t ( w ) should be unbiased for τ t alone does not identify a w in the population since z = Zw and E( u jt w ) = 0 are satisﬁed byinﬁnitely many w ’s, if J > K +1 . For example, when z i = 1 , any J × vector of ﬁxed numbersthat sum up to 1, such as the uniform weights, uneven weights like w = (0 . , . , , . . . , (cid:48) ,non-convex weights like w = ( − . , . , , . . . , (cid:48) , and inﬁnitely many others, allows ˆ τ t ( w ) to be unbiased for τ t if the model is correctly speciﬁed so that E( u it | z , . . . , z J +1 ) = 0 forall t . The weight w a = Z (cid:48) ( ZZ (cid:48) ) − z is just one particular choice that generalizes the uniform6eights. The identiﬁcation of w a requires further the minimization of w (cid:48) w in (3) on top of theunbiasedness requirement ( z = Zw ).A natural alternative to w (cid:48) w in (3) is the (cid:96) norm (cid:107) w (cid:107) = (cid:80) J +1 j =2 | w j | , which leads to(4) min w (cid:107) w (cid:107) subject to z = Zw, a constrained (cid:96) minimization problem, also known as the basis pursuit minimization (see Mal-lat, 2009, Chapter 12). Algorithms using Alternating Direction Method of Multipliers (ADMM)are available for this problem (the R package ADMM). The minimization problem (4) can alsobe written as the standard quadratic programming(5) min w + ,w − J +1 (cid:88) j =2 ( w + j + w − j ) subject to z = Zw + − Zw − , w + j ≥ , w − j ≥ ∀ j because w = w + − w − and (cid:107) w (cid:107) = w + + w − for w + j = max( w j , and w − j = − min( w j , .Note that the (cid:96) minimization problem does not necessarily have a unique solution (e.g., when z i = 1 ), in which case we can minimize εw (cid:48) w + (cid:107) w (cid:107) instead of (cid:107) w (cid:107) for some small positiveconstant ε such as − to achieve uniqueness (see Gains et al., 2018, p. 863). The elastic-netstyle loss function − α w (cid:48) w + α (cid:107) w (cid:107) using other α parameter values can also be used. Theelastic-net minimization algorithm can be implemented as a constrained lasso using α (cid:107) w (cid:107) aspenalty, the zero vector as the response vector, and [(1 − α ) / / I J as the feature matrix. SeeJames et al. (2019) for a fast algorithm for constrained lasso and its implementation by the Rpackage PACLasso.Given a weight vector w , the presence of systematic trends in the prediction error y t − Y (cid:48) t w can be tested for the pre-treatment periods by regressing it on t , unless T is too small.There is no ‘generated regressors’ problem if w is a function of z , . . . , z J +1 . In addition, themutual compatibility of two estimated weight vectors, w (1) and w (2) , say, can be tested byregressing Y (cid:48) t w (1) − Y (cid:48) t w (2) on t − T , after t and after t ( t − T ) using all the observations,where after t is the dummy variable for t > T . Overall signiﬁcance can be interpreted as anevidence of model misspeciﬁcation, although overall insigniﬁcance does not necessarily implycorrect model speciﬁcation because U (cid:48) t [ w (1) − w (2) ] can show no systematic trends while some u it ’s still do. If z i contains pre-treatment outcomes (e.g., ADH, 2010), the estimated w is notnecessarily exogenous, and the generated regressors problem applies. In that case, the testingresults should be taken only as a diagnostic summary statistic. In all cases decision by humanintuition using visual examination rather than formal testing is a promising alternative.7ith regard to how to present the estimated counterfactual outcomes, if z i contains no pre-treatment dependent variables, then Y (cid:48) t w and y t may have systematically different levels justlike in the standard DID framework. The counterfactual outcomes are, thus, better presentedby c + Y (cid:48) t w such that the intercept c deals with the pre-treatment level difference. For example, c can be the average of y s − Y (cid:48) s w a over the pre-treatment periods. This modiﬁcation does notchange anything about the estimation of treatment effects but only helps presentation. We have thus far considered controlling for heterogenous trends driven by γ (cid:48) t z i by imposing theexact-matching constraints that z = Zw . In most applications the number K of the noncon-stant variables in z i is much smaller than the number J of untreated units, and the restrictions z = Zw do not identify a unique w . As a supplementary means to identify a single vector, wehave considered minimizating the (cid:96) , the (cid:96) , or an elastic-net norm of w .Now, beside the trending covariates z i , the researcher may also want some other variablesto be balanced on in pursuit of robustness against outliers or local misspeciﬁcation. Typicalbalancing covariates include pre-treatment outcomes or their deviations from the pre-treatmentaverage, while other exogenous features such as post-treatment controls can also be taken intoconsideration. Unlike the trend predictors z i , these balancing covariates need not be matchedon exactly.Let q i denote the m × vector of such balancing covariates, e.g., q i = ( y i , . . . , y iT ) (cid:48) , where m can be larger or smaller than J . Let Q be the m × J matrix of q i for the untreated units, i.e., Q = ( q , . . . , q J +1 ) . Matching seeks to make ( q − Qw ) (cid:48) ( q − Qw ) as small as possible, whichleads to a natual extension of (3) to(6) min w ( q − Qw ) (cid:48) ( q − Qw ) + λw (cid:48) w subject to z = Zw for a user-speciﬁed tuning parameter λ ≥ (and λ > if Q (cid:48) Q is singular). This is a constrainedridge (CRIDGE) regression of q on Q with penalty λw (cid:48) w and constraints z = Zw . Theshrinkage parameter λ inversely relates to the desired matching quality relative to the magnitude w (cid:48) w . If λ = 0 (allowed if Q (cid:48) Q is nonsingular), we pursue best matching without shrinkage. If λ = ∞ , we give up on balancing and pursue maximal shrinkage, leading to w a in the previoussection. A ﬁnite positive λ is a compromise. In all cases, we explicitly impose the restrictionsthat z = Zw , and thus heterogenous trends due to different z i are perfectly controlled for.8iven λ , the solution to (6) is(7) ˆ w = ˜ w ridge + G − λ Z (cid:48) ( ZG − λ Z (cid:48) ) − ( z − Z ˜ w ridge ) , G λ = Q (cid:48) Q + λI J , where ˜ w ridge = G − λ Q (cid:48) q is the unconstrained ridge estimator (see the appendix for a proof).Note that G λ is invertible if λ > whether or not Q (cid:48) Q is, and thus ˆ w is well deﬁned if Z is offull row-rank and λ > . The resulting treatment effect estimators are obtained by DID using Y (cid:48) t ˆ w as the constructed control group.There is a more revealing expression for ˆ w than (7). To derive it, let us ﬁrst partial out z i from Q and from q . Precisely, let B = QZ (cid:48) ( ZZ (cid:48) ) − , the matrix of the OLS estimators fromthe regression of the rows of Q on Z (cid:48) , and let ˜ Q = Q − BZ and ˜ q = q − Bz , the predictionerrors. Then ˆ w is decomposed as follows:(8) ˆ w = w a + ˆ w b , w a = Z (cid:48) ( ZZ (cid:48) ) − z , ˆ w b = ( ˜ Q (cid:48) ˜ Q + λI ) − ˜ Q (cid:48) ˜ q , which is the sum of the maximum shrinkage estimator w a subject to z = Zw and the un-constrained ridge estimator ˆ w b for balancing on the covariates orthogonal to Z (proved in theappendix). Note that (8) does not hold if the variables are automatically normalized in the ridgeregression procedure, but whether to normalize q i or not is not critical under z = Z ˆ w , accord-ing to experiments. See Doudchenko and Imbens (2017) for more on normalization without theconstraints.By substituting the (cid:96) norm for the squared (cid:96) norm w (cid:48) w in (6), we have the constrainedlasso (CLASSO) version(9) min w ( q − Qw ) (cid:48) ( q − Qw ) + λ (cid:107) w (cid:107) subject to z = Zw, where λ is again a user-speciﬁed parameter. A fast optimization algorithm is available (Jameset al., 2019; see also Gaines et al., 2018). CRIDGE and CLASSO both shrink the parametersbut only CLASSO achieves variable selection. Though a simple decomposition like (8) is notavailable for CLASSO, the modiﬁed constrained lasso min w (˜ q − ˜ Qw ) (cid:48) (˜ q − ˜ Qw ) + λ (cid:107) w (cid:107) subject to z = Zw after partialing out z i from q and Q is identical to the original problem (9). Again, if thebalancing covariates are to be scaled within the optimization algorithm, the original variablesand the variables after partialing-out give different results, naturally.9hen usual constrained-lasso algorithms fail, one can again modify (cid:96) to a nominal elastic-net norm as Gaines et al. (2018) remark. The elastic-net objective function is ( q − Qw ) (cid:48) ( q − Qw ) + λ ( − α w (cid:48) w + α (cid:107) w (cid:107) ) , which equals the lasso objective function ( q aug − Q aug w ) (cid:48) ( q aug − Q aug w ) + λα (cid:107) w (cid:107) , where q aug = ( q (cid:48) , (cid:48) and Q aug = [ Q (cid:48) , (cid:112) λ (1 − α ) I J ] (cid:48) ; see Gaines et al. (2018). Doudchenkoand Imbens (2017) propose a cross-validation method of selecting λ (and α ). I propose com-parison by visualization after trying several different λ values. Example 1.

ADH (2010) analyze the effect of the 1988 California tobacco control programusing their synthetic control method. The dependent variable is cigarette consumption. ADHuse 7 variables x i = ( x i , . . . , x i ) (cid:48) as trend predictors: log per capita state personal income( x i ), the percentage of population aged 15–24 ( x i ), retail price of cigarettes ( x i ), per capitabeer consumption ( x i ), all of which are averaged over the 1980–1988 period, together withthree years of lagged smoking consumption (1975, 1980, and 1988). The balancing covariatesare the pre-treatment outcomes (1970–1988). The counterfactual outcomes by ADH, by theconstrained ridge with λ = 2 , and by the constrained lasso with the same λ are plotted in Figure2(a), where z i = (1 , x (cid:48) i ) (cid:48) and q i = ( y i , . . . , y iT ) (cid:48) . Figure 2(a) suggests that the treatment effectsby CRIDGE and CLASSO are nontrivially smaller than by ADH. The results by CRIDGEand CLASSO are only marginally different from each other. The last three variables in x i areincluded in both z i and q i , and removing them from z i is immaterial.If we let z i = 1 and q i = ( x (cid:48) i , y i , . . . , y iT ) (cid:48) instead, that is, if ADH’s seven ‘predictor’variables are used as balancing covariates instead of as trending covariates, then the resultsfrom ADH, CRIDGE and CLASSO are all very similar, as Figure 2(b) shows. It turns out thatthe x i variable, ln(GDP per capita), is the main driver of the dissimilarity between (a) and (b)of Figure 2; if we let z i = (1 , x i ) (cid:48) and q i = ( x i , . . . , x i , y i , . . . , y iT ) (cid:48) , the resulting trendsare close to those in Figure 2(a). Removing the duplicates ( x i , x i and x i ) is again of littleconsequence.For the model given by (1), ADH (2010) treat the constant term in z i and the nonconstantterms differently, where the constant term is exactly matched on by the adding-up constraintand the nonconstant terms appear in minimization. My approach, on the other hand, treats allthe terms in z i identically by exact matching. Balancing q and Qw is a different issue; they arematched by the minimization of ( q − Qw ) (cid:48) ( q − Qw ) without requiring exact balancing. The10igure 2: Trends in cigarette sales in California(a) 1 and x i for trending covariates; y i , . . . , y iT for balancing covariates Year pe r- c ap i t a c i ga r e tt e s a l e s ( i n pa cks ) Actual CaliforniaSynthetic California by ADH (2010)Constrained RidgeConstrained Lasso (b) 1 for trending covariates; x i , y i , . . . , y iT for balancing covariates Year pe r- c ap i t a c i ga r e tt e s a l e s ( i n pa cks ) Actual CaliforniaSynthetic California by ADH (2010)Constrained RidgeConstrained Lasso

Note.

ADH (2010) data. (a) Trending covariates are 1 and x i , where x i contains ln(GDP per capita), percentaged 15–24, retail price, beer consumption per capita, and cigarette sales per capita 1988, 1980 and 1975 (seeADH, 2010, Table 1); balancing covariates ( q i ) are y i , . . . , y iT . (b) Only the constant term is used as trendingcovariates, and all variables in x i and q i are used for balancing. In both (a) and (b), λ = 2 for the constrained ridgeand lasso. z i appears in the model as the drivers of nuisance trends and q i is introduced to enhancecomparability.A practical remark on the selection of λ is worth making. If the model is correctly speciﬁedso that u it shows no systematic trends, i.e., if u it has zero mean conditional on z , . . . , z J +1 forall t , then any w satisfying z = Zw will eliminate confounding systematic trends in y t − Y (cid:48) t w .When it happens, the choice of λ would not make much difference in principle. On the otherhand, systematicity in y t − Y (cid:48) t w in the pre-treatment periods would be an evidence of possiblemisspeciﬁcation of the model for some i or all, in which case matching on variables such aspre-treatment outcomes will hopefully mitigate the problem. Since a larger λ deteriorates thematching quality and increases the variability in Y (cid:48) t w , it would be an acceptable practice toenlarge λ while keeping the discrepancy between q and Qw within a tolerable range. Thoughfuzzy theoretically, the acceptability is usually clear to human eyes as the time-series of y s and Y (cid:48) s w in the pre-treatment periods can be visually compared without difﬁculty. Also, theconstrained shrinkage estimators are continuous in λ (except at λ = 0 for which Q (cid:48) Q may besingular) for given data, and small changes in λ will lead to only small changes in the trend of Y (cid:48) t w . We have thus far considered the case h i is empty in (1). In many application, a few variablesin z i would be sufﬁcient as the driving force of trend heterogeneity. Besides, soft matching onthe lagged dependent variables often obliterates the necessity of unobservable common factors.In some cases, however, researchers may want to allow for unobservable h i , especially if noobservable trending covariates are available. In this section, we discuss how to handle h i .Because h i makes heterogenous trends, it is again essential to have h and Hw exactlybalanced, where H = ( h , . . . , h J +1 ) . But this is infeasible since h i are not observed. ADH(2010) replace h = Hw with the sufﬁcient condition that y s = Y (cid:48) s w for all s ≤ T , whichis not attainable unless T is smaller than J . But even when J is large enough for y s = Y (cid:48) s w for all s ≤ T , the nonnegativity of w j imposed by ADH (2010) does not necessarily guarantee z = Zw and y s = Y (cid:48) s w at the same time. Adverse examples have been illustrated in Figure 1.When h i are unobserved, an obvious strategy is to estimate them rather than attempting toﬁnd a detour. If ˇ h i denotes the initial estimator of h i and ˇ H = (ˇ h , . . . , ˇ h J +1 ) , the corresponding12 optimization problem is min w ( q − Qw ) (cid:48) ( q − Qw ) + λw (cid:48) w subject to z = Zw and ˇ h = ˇ Hw.

There are total K + r constraints, which are generally satisﬁed by nonempty parameters if J > K + 1 + r , which holds in usual applications. If J is too small, the researcher would try toreduce K or r or both; it is not very sensible to have more common factors than the number ofuntreated units in applications.A convenient way of estimating h i is to use least squares using the pre-treatment data: min µ ,...,µ J +1 γ ,...,γ T δ ,...,δ T h ,...,h J +1 J +1 (cid:88) i =1 T (cid:88) t =1 ( y it − µ i − γ (cid:48) t z i − δ (cid:48) t h i ) , or in matrix notations min µ ∗ , Γ ,F,H ∗ tr (cid:110) ( Y ∗ − µ ∗(cid:48) − Γ Z ∗ − δH ∗ ) (cid:48) ( Y ∗ − µ ∗(cid:48) − Γ Z ∗ − δH ∗ ) (cid:111) , where Y ∗ is the T × ( J + 1) matrix of y it for i = 1 , . . . , J + 1 (columns) and t = 1 , . . . , T (rows), µ ∗ is the ( J + 1) × vector of µ i , i = 1 , . . . , J + 1 , Γ = ( γ , . . . , γ T ) (cid:48) , Z ∗ = ( z , Z ) , δ = ( δ , . . . , δ T ) (cid:48) , and H ∗ = ( h , H ) . The concentrated loss function is(10) min F,H ∗ tr (cid:110) ( M Y ∗ M Z ∗(cid:48) − M δH ∗ M Z ∗(cid:48) ) (cid:48) ( M Y ∗ M Z ∗(cid:48) − M δH ∗ M Z ∗(cid:48) ) (cid:111) , where M = I T − T − (cid:48) and M Z ∗(cid:48) = I − Z ∗(cid:48) ( Z ∗ Z ∗(cid:48) ) − Z ∗ . Let A = M Y ∗ M Z ∗(cid:48) . The com-mon factors in A are estimated as √ T times the orthonomal eigenvectors of AA (cid:48) correspondingto the r largest eigenvalues, and the associated factor loading estimators are (˜ h , . . . , ˜ h J +1 ) = T − ˜ δ (cid:48) A , where ˜ δ is the matrix of estimated common factors. Note that the estimated com-mon factors correspond to M δ rather than δ itself, and the estimated factor loadings to H †∗ = H ∗ M Z ∗(cid:48) = H ∗ − H ∗ Z ∗(cid:48) ( Z ∗ Z ∗(cid:48) ) − Z ∗ = [ h , H ] − H ∗ Z ∗(cid:48) ( Z ∗ Z ∗(cid:48) ) − [ z , Z ] rather than H ∗ itself.But, given that z = Zw , we have h = Hw if and only if h † = H † w , where H †∗ = [ h † , H † ] . Wecan therefore use the estimated factor loadings ˜ h i in the constrained ridge, lasso and elastic-netoptimization. Although h † and H † ˆ w are not exactly balanced due to the discrepancy of ˜ h i and h † i (after rotation),It is nuisance that the constrained estimator vector ˆ w satisﬁes z = Z ˆ w and ˜ h = ˜ H ˆ w , butnot h † = H † ˆ w or h = H ˆ w . Thus, y − Y t ˆ w still contains a remaining trend term as shown in y − Y t ˆ w = ( µ − µ (cid:48) ˆ w ) + δ (cid:48) t ( h − H ˆ w ) + ( u t − U (cid:48) t ˆ w ) . z = Z ˆ w and ˜ h = ˜ H ˆ w , we have δ (cid:48) t ( h − H ˆ w ) = δ (cid:48) t B − [( Bh † − ˜ h ) − ( BH † − ˜ H ) ˆ w ] Example 2.

For the application in ADH (2010), again let x i be the seven predictor variablesused by ADH (2010) as in Example 1. Let ˜ h i be the vector of two factor loadings found in y it after temporally demeaning and cross-sectionally partialing-out (1 , x (cid:48) i ) (cid:48) . If we let z i = (1 , x (cid:48) i ) (cid:48) and q i = ( y i , . . . , y iT ) (cid:48) , then the estimated counterfactual outcomes using ˜ h i as extra trendpredictors are given in Figure 3(a), which is very similar to those in Figure 2(a). On the otherhand, if q i = ( x (cid:48) i , y i , . . . , y iT ) (cid:48) , z i contains only 1, and ˜ h i contains the four estimated factorloadings in y it after temporal and cross-sectional demeaning (without x i partialed out), thenthe CRIDGE and CLASSO results are very similar to the ADH synthetic control as shown inFigure 3(b) just like in Figure 2(b). Changing z i is consequential, but controlling for estimatedhidden factor loadings does not make much difference in this example.In this exercise, the estimated factor loadings explain the pre-treatment outcomes well.When each of the seven variables in x i are regressed on the four estimated factor loadingsfound in part (b), the R-squared is low for the ﬁrst four controls and very high for the last three(the lagged outcomes) as Table 1 shows. The results remain stable when r is increased up to10. This suggests that the role of hidden factors is only limited when q i or z i contains somepre-treatment outcomes.If some common factors are observed (e.g., incidental linear or quadratic trends), then theycan be partialed out by replacing the M matrix in (10) with an appropriate projection matrix.For example, if y it = γ (cid:48) t z i + g (cid:48) t µ i + δ (cid:48) t h i + u it , where g t is observable and the ﬁxed effects aresubsumed in g (cid:48) t µ i , then M is to be replaced with M [1 ,g ] , say, where g = ( g , . . . , g T ) (cid:48) . Finally,the number r of common factors may be chosen exogenously by the researcher or by usingan automatic selection procedure. I recommend the former method. Speciﬁcally, increasing r starting from zero and plotting the estimated counterfactual outcomes will give the researcherclear ideas how the results change as more hidden factors are allowed for in the model. This section compares the new methods with ADH (2010), HCW (2012), and Doudchenko andImbens (2017). 14igure 3: Trends of cigarette sales in California(a) z i = (1 , x (cid:48) i ) (cid:48) , q i = ( y i , . . . , y iT ) (cid:48) , and r = 2 Year pe r- c ap i t a c i ga r e tt e s a l e s ( i n pa cks ) Actual CaliforniaSynthetic California by ADH (2010)Constrained RidgeConstrained Lasso (b) z i = 1 , q i = ( x (cid:48) i , y i , . . . , y iT ) (cid:48) , and r = 4 Year pe r- c ap i t a c i ga r e tt e s a l e s ( i n pa cks ) Actual CaliforniaSynthetic California by ADH (2010)Constrained RidgeConstrained Lasso

Note.

The tuning parameter λ is set to 2. In (b), r is chosen to be 4 because there are four predictors in x i otherthan pre-treatment outcomes. Changing r to 2 makes practically no differences. a a a b Note.

The sample size is J + 1 = 39 , and the explanatory variables are the estimated factor loadings obtainedby least squares applied to cross-sectionally and temporally demeaned pre-treatment outcomes. a b ADH’s (2010) synthetic control algorithm consists of two layers of optimization, which I callthe ‘inner’ and ‘outer’ optimization loops. The inner loop ﬁnds an optimal ˆ w ( V ) for a given V by minimizing ( z − Zw ) (cid:48) V ( z − Zw ) subject to the adding-up and nonnegativity constraints(called the ‘ADH constraints’ in short in this subsection), and the outer loop ﬁnds an opti-mal diagonal positive semideﬁnite V by minimizing (cid:80) T s =1 [ y s − Y (cid:48) s ˆ w ( V )] . The ﬁnal weightestimator is ˆ w = ˆ w ( ˆ V ) . ADH (2010) also discuss using a user-speciﬁed V .For a given V , if there exists a w satisfying the ADH constraints and the exact-balancingcondition z = Zw simultaneously, the inner-loop loss function ( z − Zw ) (cid:48) V ( z − Zw ) attainszero at such a w . Even in that case, however, a unique w is not identiﬁed in general because theconstraints are linear in w . For example, if z = 0 , a scalar, and ( z , z , z , z ) = ( − , − , , ,any symmetric kernels such as w = ( , , , ) (cid:48) , w = (0 , , , (cid:48) , etc., minimize the loss func-tion for the inner optimization loop. In such a case a particular weight will be chosen arbi-trarily by the numerical procedure used for the optimization. In contrast, if no w satisﬁes boththe ADH constraints and the exact-balancing condition simultaneously, then ADH’s algorithmsacriﬁces exact balancing to abide by the ADH constraints. The consequences of abandoningexact balancing to save the ADH constraints are illustrated in Figure 1 as discussed repeatedly.The V -weight is determined by the outer-loop minimization for balancing the pre-treatment16utcomes. (If a ﬁxed V is used, balancing the pre-treatment outcomes is irrelevant.) For theﬁnally chosen ˆ V , no matter whether it is the outcome of the outer-loop optimization or givenexogenously, the solution ˆ w = ˆ w ( ˆ V ) need not be unique nor satisfy z = Z ˆ w . Notably, theselection of V is blind to whether z = Z ˆ w ( V ) because V is chosen by the outer loop involvingonly the pre-treatment outcomes. For example, if some V allows for z = Z ˆ w ( V ) and others donot, the ADH algorithm does not necessarily choose the one that allows for z = Z ˆ w ( V ) since V is determined by minimizing (cid:80) T s =1 [ y s − Y (cid:48) s ˆ w ( V )] , which does not necessarily minimize [ z − Z ˆ w ( V )] (cid:48) [ z − Z ˆ w ( V )] .The nonnegativity and adding-up constraints provide attractive interpretations to practition-ers, but the beneﬁts come with nontrivial costs. First, ADH’s (2010) two-layer optimizationprocedure may fail to converge or give a suboptimal choice of synthetic control. For example,Abadie and Gardeazabal (2003) ﬁnd the ‘Synthetic Basque’ of . × Cataluna + 0 . × Madrid’ in their study on the political turmoil in Spain. But a thorough investigation revealsthat a lower root mean squared prediction error can be achieved by an alternative syntheticBasque of . × Cataluna + 0 . × Madrid + 0 . × Baleares. (Finding this weight vectorrequires more direct use of the Karush-Kuhn-Tucker theorem. Neither the Stata ‘synth’ pack-age nor the R ‘Synth’ package identiﬁes this synthetic control.) This suggests that researchersshould not be overly conﬁdent about the meaningfulness of the estimated w weights.The second issue involves the nonnegativity, and is more subtle. The nonnegativity con-straint may violate z = Zw , i.e., R J + ∩ { w : z = Zw } = ∅ , in which case trends in y t − Y (cid:48) t w due to z − Zw may confound the treatment effects if w is forced to be in R J + . The importanceof nonnegativity can be controversial, but it is noteworthy that a discrepancy between z and Zw can lead to a nonnegligible confounding trend in y t − Y (cid:48) t w while a negative w j only affectsinterpretation. If one wishes, the nonnegativity restriction can be made soft by, for example, theconstrained lasso min w + ,w − (cid:107) q − Qw + + Qw − (cid:107) + λ J +1 (cid:88) j =2 ( w + j + κw − j ) , for some large positive κ , subject to the constraints that z = Zw + − Zw − , w + j ≥ and w − j ≥ for all j , which modiﬁes a generalized version of (5). The above soft nonnegativity will allow w j < for some j if hard nonnegativity is incompatible with z = Zw , but will try to keep w j as close to the nonnegative domain as possible. However, the beneﬁt looks only minor becausethe appealing interpretation attached to nonnegativity is lost anyway if some w j are negative.17 .2 Comparison to HCW (2012) HCW (2012) take an alternative approach of regressing y t on Y t for a selected subset of theuntreated units using the pre-treatment observations to estimate the intercept c and the slopevector w . Then the counterfactual outcomes are formed as ˆ c + Y (cid:48) t ˆ w for t > T , where ˆ c and ˆ w are the OLS estimators. As Li and Bell (2017) derive, this estimator is justiﬁed under meanstationarity. If the unobserved trends show mean nonstationarity, HCW’s (2012) method needsmodiﬁcation.To see the source of bias and its remedy, let us take a simple example with z i = 1 . Giventhe OLS estimators ˆ c and ˆ w , the estimated treatment effects are(11) ˆ τ t = y t − ˆ c − Y (cid:48) t ˆ w = τ t + ¨ γ t (1 − (cid:48) ˆ w ) + ¨ δ t ( h − Hw ) + (¨ u t − ¨ U (cid:48) t ˆ w ) , where ¨ γ t − γ t − ¯ γ pre , ¨ δ t = δ t − ¯ δ pre , ¨ u t = u t − ¯ u ,pre , and ¨ U t = U t − ¯ U pre , with ¯ ξ pre denoting T − (cid:80) T t =1 ξ t for variable ξ t .The OLS regression of y t on Y t for t ≤ T may give systematic biases in ˆ τ t for this modeldue to the ¨ γ t (1 − (cid:48) ˆ w ) term among others, because the stated OLS regression does not guarantee (cid:48) ˆ w p −→ . The origin of this failure is in fact endogeneity. Example 3 below demonstrates that (cid:48) ˆ w < asymptotically (as T → ∞ ) if y t is regressed on Y t for t ≤ T for a model with z i = 1 and empty h i , so that systematic changes in trend ( γ t ) may confound the treatment effects. Example 3.

Consider the model y it = µ i + γ t + u it , where γ t are common time-effects. Let J be small and T → ∞ as considered by HCW (2012). The OLS slope estimator ˆ w from theregression of y t on Y t using the pre-treatment observations is ˆ w = ( Y (cid:48) M Y ) − Y (cid:48) M y = [( γ (cid:48) + U ) (cid:48) M ( γ (cid:48) + U )] − ( γ (cid:48) + U ) (cid:48) M ( γ + u )= ( σ γ (cid:48) + S U ) − σ γ + o p (1) , where y i = ( y i , . . . , y iT ) (cid:48) , Y = ( y , . . . , y J +1 ) , M = I T − T − (cid:48) , U is the T × J matrixof u jt for j ≥ and t ≤ T , γ = ( γ , . . . , γ T ) (cid:48) , σ γ = plim T − γ (cid:48) M γ , and S U = plim T − U (cid:48) M U . Thus, when J is ﬁxed, (cid:48) ˆ w = σ γ (cid:48) ( σ γ (cid:48) + S U ) − o p (1) = σ γ (cid:48) S − U

11 + σ γ (cid:48) S − U o p (1) , which implies that(12) − (cid:48) ˆ w p −→ (1 + σ γ (cid:48) S − U − > .

18n the presence of common time effects γ t , the estimated ˆ τ t ( ˆ w ) systematically depends on ¨ γ t (1 − (cid:48) ˆ w ) , as is apparent by (11) and (12). Without the mean stationarity of γ t that ensures ¨ γ t ≈ , ˆ τ t ( ˆ w ) is systematically biased away from τ t .An obvious solution to the problem is to impose the restrictions that (cid:48) w = 1 in case z i = 1 as in Example 3 and that z = Zw for general z i , which is exactly our exact-balancingconstraint. If h i is nonempty in (1), then h i can be estimated and the constraints that ˜ h = ˜ Hw can be added as explained in Section 2.3. Because the number of common factors are typicallysmall, there exist almost certainly some w vectors that satisfy the restrictions. This modiﬁedHCW method is a special case of the constrained ridge regressions proposed in this papercorresponding to λ = 0 .The above constrained OLS is easy to implement, but it requires T > J − K − . If there aremany untreated units ( J large), HCW (2012) select a sufﬁciently small subset a priori by theresearcher’s judgment, which is sometimes arbitrary but often acceptable as long as rationalesare provided. The constraints that z = Zw and that h = Hw are always crucial. Doudchenko and Imbens (2017) propose minimizing the elastic-net loss function (cid:80) T s =1 ( y t − c − Y (cid:48) t w ) + λ ( − α (cid:107) w (cid:107) + α (cid:107) w (cid:107) ) without constraints. Their proposal (elastic net, and noconstraints) can be understood as a modiﬁcation of ADH (2010) and also a modiﬁcation ofHCW (2012) to an elastic-net framework. When signal is strong in the pre-treatment periodsuch that matching on the observed pre-treatment outcomes deals with trends adequately, thiselastic-net solution may work well (though bias may still exist due to the endogeneity reasonexplained in Section 3.2), but otherwise there is no device to control for heterogenous trends inthe outcomes in the post-treatment periods.Let us take numerical examples. Figure 4 is obtained by applying Doudchenko and Imbens’s(2017) proposal to the two simulated data sets considered for Figure 1. The elastic-net mixingparameter is set to α = 0 . (close to lasso), and the tuning parameter is λ = 0 . , a valuethat gives a visually appealing pre-treatment matching; larger λ values such as 0.1 and 1 arepoor in reproducing the trend in the pre-treatment outcomes. The results are compromised forboth data sets in the post-treatment periods, which seems to be due to the endogeneity biasdiscussed in Section 3.2. Imposing (cid:48) w = 1 as a hard restriction controls for common time19igure 4: Trends constructed by Doudchenko and Imbens (2017)(a) Data for Figure 1(a) Period O u t c o m e Treatment →

True counterfactualDoudchenko and Imbens (2017)CRIDGE (b) Data for Figure 1(b) . . . . . . . Period O u t c o m e Treatment →

True counterfactualDoudchenko and Imbens (2017)CRIDGE

Note.

Simulated data used in Figure 1. Doudchenko and Imbens’s (2010) counterfactual trends are obtained usingthe R package glmnet with no standardization and including the intercept. The elastic-net mixing parameter is α = 0 . , and the λ parameter is set to 0.01. For both (a) and (b) the post-treatment counterfactual outcomes areunderstated by Doudchenko and Imbens’ method. z = Zw for more general models, gives the elastic-net version of what the presentpaper proposes.It is noteworthy that Doudchenko and Imbens (2017) do not refer to an explicit model;see their introduction. In other words, their aim is not at controlling for heterogenous trendsfor models like (1) but at estimating counterfactual trends based on regularized matching onpre-treatment outcomes (identiﬁcation by regularization). For model (1) considered by ADH (2010), I propose new estimators of treatment effects bytreating the trending variables ( z i and h i in the model) and other balancing covariates (de-noted q i in this paper) differently. Without further assumptions on the time-varying coefﬁcients( γ t and δ t in the model), exact-balancing of the trend predictors as hard restrictions is crucialfor properly dealing with heterogenous trends driven by the trending covariates. The adverseconsequences of making the exact matching soft are illustrated in Figures 1 and 4, where allthe extant estimators exhibit compromised behaviors for data generated by (1) without hiddenfactors. The new estimators proposed in this paper work well. References

Abadie, A., A. Diamond, and J. Hainmueller (2010). Synthetic control methods for comparativecase studies: Estimating the effect of California’s tobacco control program,

Journal ofAmerican Statistical Association

American Economic Review

93 (1), 113–132.Doudchenko, N., and G. W. Imbens (2017). Balancing, regression, difference-in-differencesand synthetic control methods: A synthesis, arXiv

Journalof Computational and Graphical Statistics

Review of Economics and Statistics

Journal of Applied Econometrics

27, 705–740.James, G. M., C. Paulson, and P. Rusmevichientong (2019). Penalized and constrained opti-mization: An application to high-dimensional website advertising,

Journal of the Ameri-can Statistical Association , DOI: 10.1080/01621459.2019.1609970.Li, K. T., and D. R. Bell (2017). Estimation of average treatment effects with panel data:Asymptotic theory and implementation,

Journal of Econometrics

A Wavelet Tour of Signal Processing: The Sparse Way , Academic Press,Elsevier.

A Appendix

A.1 Mathematical Proofs

Solution to (3).

The Lagrangian function is L = w (cid:48) w + µ (cid:48) ( z − Zw ) . The ﬁrst order conditionsare (i) w a = Z (cid:48) ˆ µ and (ii) z = Zw a . Condition (i) implies that Zw a = ZZ (cid:48) ˆ µ , i.e., z = ZZ (cid:48) ˆ µ ,and thus ˆ µ = ( ZZ (cid:48) ) − z . By substituting this back into (i), we have w a = Z (cid:48) ( ZZ (cid:48) ) − z .Incidentally, we can also directly show that w a minimizes w (cid:48) w subject to Zw = z . For any w satisfying z = Zw , we have w (cid:48) w − w (cid:48) a w a = w (cid:48) w − z (cid:48) ( ZZ (cid:48) ) − z = w (cid:48) w − w (cid:48) Z (cid:48) ( ZZ (cid:48) ) − Zw = w (cid:48) [ I − Z (cid:48) ( ZZ (cid:48) ) − Z ] w ≥ because I − Z (cid:48) ( ZZ (cid:48) ) − Z is positive semideﬁnite. Proof of (7).

The Lagrangian function for (6) is L = (cid:2) ( q − Qw ) (cid:48) ( q − Qw ) + λw (cid:48) w (cid:3) + (cid:96) (cid:48) ( z − Zw ) , where (cid:96) is the vector of the Lagrangian multipliers. The ﬁrst-order conditions are (i) G λ ˆ w − Q (cid:48) q − Z (cid:48) ˆ (cid:96) = 0 , where G λ = Q (cid:48) Q + λI J , and (ii) z = Z ˆ w . From (i), we have (i (cid:48) ) ˆ w =ˆ w ridge + G − λ Z (cid:48) ˆ (cid:96) , where ˆ w ridge = G − λ Q (cid:48) q , the unconstrained ridge estimator. Pre-multiplying Z and substituting (ii) gives z = Z ˜ w ridge + ZG − λ Z (cid:48) ˆ (cid:96) , which implies that ˆ (cid:96) = ( ZG − λ Z ) (cid:48)− · ( z − Z ˜ w ridge ) . Substituting this back into (i (cid:48) ) gives (7). Proof of (8).

Given the constraints z = Zw , q − Qw = ˜ q − ˜ Qw for ˜ q = q − Bz and ˜ Q = Q − BZ for any B . Thus, the solution to (6) is identical to the solution to min w (˜ q − Qw ) (cid:48) (˜ q − ˜ Qw ) subject to z = Zw . With the choice of B = QZ (cid:48) ( ZZ (cid:48) ) − , we have Z ˜ Q (cid:48) = 0 .Letting ˜ G λ = ˜ Q (cid:48) ˜ Q + λI , we have ˜ G − λ = λ I − λ ˜ Q (cid:48) ( ˜ Q ˜ Q (cid:48) + λI m ) − ˜ Q , which implies Z ˜ G − λ = λ Z and Z ˜ w ridge = ZG − λ ˜ Q (cid:48) ˜ q = λ Z ˜ Q (cid:48) ˜ q = 0 . The result follows from (7). A.2 Data generating processes

The data used for producing Figure 1(a) are generated by the following: γ t = 0 . . πt/T ) + 2 t/T ,γ tk = ( − k − × . − . π log k + 2 πt/T ) , k = 1 , . . . , K,z ik = z ik − i/J + k, z ik ∼ iid N (0 , ,µ i = ¯ z i − i/J + µ i , µ i ∼ iid N (0 , ,u it = 0 . u it , u it = 0 . u it − + u ∗ it , u ∗ it ∼ iid N (0 , , u i, − = 0 ,y it = µ i + γ t + γ (cid:48) t z i + u it , i = 1 , . . . , J, t = 1 , . . . , T. Above we set J = 38 , T = 20 , T = 10 , T = T + T = 30 , and K = 4 , similarly to theapplication in ADH (2010). Data are generated by R with the initial random seed set to 55.This is the data generating process for Figure 1(a) in the introduction. If γ s is set to γ T for all s ≤ T after γ T is generated, so that there are no obvious trends in the pre-treatment periods,we have the data for Figure 1(b). See Figure 5 for the generated untreated outcomes. A.3 Discussions on asymptotics

This appendix demonstrates how to establish asymptotics for the average treatment effects(ATE) estimator using the constrained ridge estimator for model (1) without h i , i.e., y it = µ i + γ (cid:48) t z i + u it . Let c = ( c (cid:48) , c (cid:48) ) (cid:48) be given, where the T nonpositive elements of c add up to−1 and the T ( = T − T ) nonnegative elements of c add up to 1. The ATE estimator by DIDis ˆ τ = c (cid:48) ( y − Y ˆ w ) , where y i = ( y i , . . . , y iT ) (cid:48) , Y = ( y , . . . , y J +1 ) , and ˆ w is the constrainedridge estimator. An obvious choice of c is c = − T − (1 , . . . , (cid:48) and c = T − (1 , . . . , (cid:48) ,which lead to T T (cid:88) t = T +1 ( y t − Y (cid:48) t ˆ w ) − T T (cid:88) s =1 ( y s − Y (cid:48) s ˆ w ) . Let the true ATE be deﬁned by ¯ τ = (cid:80) Tt = T +1 c t τ t . Then since z = Z ˆ w , we have ˆ τ = ¯ τ + c (cid:48) ( u − U ˆ w ) , - Period O u t c o m e (b) Trends for Figure 1(b) Period O u t c o m e Note.

In each ﬁgure, the dark line is for the treated unit and the gray ones for the 37 untreated units. u i = ( u i , . . . , u iT ) (cid:48) and U = ( u , . . . , u J +1 ) . We shall assume that c (cid:48) j c j = O ( T − j ) and T j → ∞ for j = 0 , , which are satisﬁed by the above averaging operators. Note that c (cid:48) c = T − + T − and min( T , T ) ≤ ( c (cid:48) c ) − ≤ min( T , T ) if c (cid:48) j c j = T − j . Under thefurther assumption that the maximal eigenvalue of E( u i u (cid:48) i ) is uniformly bounded, we have c (cid:48) u i p −→ for each i because E( c (cid:48) u i ) = 0 and var( c (cid:48) u i ) = c (cid:48) E( u i u (cid:48) i ) c = O ( c (cid:48) c ) → . Thatis, c (cid:48) u i = O p ( (cid:107) c (cid:107) ) for every i , where (cid:107) c (cid:107) = ( c (cid:48) c ) / . When J is ﬁxed, c (cid:48) U ˆ w = O p ( (cid:107) c (cid:107) ) toobecause ˆ w is convergent, and thus ˆ τ − ¯ τ = O p ( (cid:107) c (cid:107) ) p −→ .The case J increases is harder to deal with. Write c (cid:48) U ˆ w = ( ˆ w ⊗ c ) (cid:48) vec( U ) so that ( c (cid:48) U ˆ w ) =( ˆ w ⊗ c ) (cid:48) vec( U ) vec( U ) (cid:48) ( ˆ w ⊗ c ) . If the maximal eigenvalue of E[vec( U ) vec( U ) (cid:48) | ˆ w ] is uniformlybounded, then the law of iterated expectations implies that E[( c (cid:48) U ˆ w ) ] = ( c (cid:48) c ) E( ˆ w (cid:48) ˆ w ) O (1) .For E( ˆ w (cid:48) ˆ w ) , we have ˆ w (cid:48) ˆ w ≤ w (cid:48) a w a + 2 ˆ w (cid:48) b ˆ w b due to (8), where w a = Z (cid:48) ( ZZ (cid:48) ) − z and ˆ w b = ( ˜ Q (cid:48) ˜ Q + λI ) − ˜ Q (cid:48) ˜ q . The maximal shrinkage component w a is easy to handle: w (cid:48) a w a = z (cid:48) ( ZZ (cid:48) ) − z = O p ( J − ) so it is not unnatural to assume that E( w (cid:48) a w a ) is bounded. For theunconstrained ridge component, we have ˆ w (cid:48) b ˆ w b = ˜ q (cid:48) ˜ Q ( ˜ Q (cid:48) ˜ Q + λI ) − ˜ Q (cid:48) ˜ q . When the minimaleigenvalue of T − ( ˜ Q (cid:48) ˜ Q + λI ) is supported by a strictly positive universal constant, ˆ w (cid:48) b ˆ w b hasthe same order as T − ˜ q (cid:48) ˜ Q ˜ Q (cid:48) ˜ q . If furthermore the maximal eigenvalue of ˜ Q ˜ Q (cid:48) is O p ( J ) , then ˆ w (cid:48) b ˆ w b = O p ( J/T ) = O p (1) . Thus, we may assume that E( ˆ w (cid:48) ˆ w ) = O (1) , under which c (cid:48) U ˆ w = O p ( (cid:107) c (cid:107) ) p −→ .Above we have demonstrated a path to establishing ˆ τ − ¯ τ = O p ( (cid:107) c (cid:107) ) . This reasoning is,however, incomplete. First, it is hard to verify the condition that C ( ˆ w ) ≡ E[vec( U ) vec( U ) (cid:48) | ˆ w ] has a uniformly bounded maximal eigenvalue. Especially, q i usually depends on u it in the pre-treatment periods, thus the maximal eigenvalue of C ( ˆ w ) depends on ˆ w generally, and how itbehaves is unclear. Second, it is hard to verify the condition that E( ˆ w (cid:48) b ˆ w b ) is bounded. Mydemonstration above involves showing that ˆ w (cid:48) b ˆ w b is stochastically bounded, which does notnecessarily imply that E( ˆ w (cid:48) b ˆ w b ) is bounded. Under what circumstances E( ˆ w (cid:48) b ˆ w b ) is boundedrequires its evaluation, which is challenging if not impossible.The difﬁculty in the above demonstration originates from the fact that E[( c (cid:48) U ˆ w ) ] is eval-uated. One might want to use Markov’s inequality ( c (cid:48) U ˆ w ) ≤ ( c (cid:48) U U (cid:48) c ) ˆ w (cid:48) ˆ w instead, which isabortive in case J → ∞ because c (cid:48) U U (cid:48) c is of order J c (cid:48) c , not cc