Assessing the Sensitivity of Synthetic Control Treatment Effect Estimates to Misspecification Error
AAssessing the Sensitivity of Synthetic ControlTreatment Effect Estimates to Misspecification Error ∗Billy FergusonGraduate School of BusinessStanford University [email protected]
Brad RossGraduate School of BusinessStanford University [email protected]
Last Updated February 23, 2021
Abstract
We propose a sensitivity analysis for Synthetic Control (SC) treatmenteffect estimates to interrogate the assumption that the SC method is well-specified, namely that choosing weights to minimize pre-treatment predictionerror yields accurate predictions of counterfactual post-treatment outcomes.Our data-driven procedure recovers the set of treatment effects consistent withthe assumption that the misspecification error incurred by the SC method isat most the observable misspecification error incurred when using the SC esti-mator to predict the outcomes of some control unit. We show that under onedefinition of misspecification error, our procedure provides a simple, geometricmotivation for comparing the estimated treatment effect to the distribution ofplacebo residuals to assess estimate credibility. When we apply our procedureto several canonical studies that report SC estimates, we broadly confirm theconclusions drawn by the source papers.
The Synthetic Control (SC) method was originally developed in Abadie and Gardeaz-abal (2003), Abadie, Diamond, and Hainmueller (2010), and Abadie, Diamond, andHainmueller (2015) to estimate treatment effects in comparative case study settings,in which a researcher observes panel data on aggregate outcomes for a small num-ber of large, heterogeneous units, only one of which receives some intervention ofinterest at some point in time. For the SC method to yield credible treatment effectestimates for the treated unit, researchers must assume that the SC method is well-specified: if there is a convex combination of control units’ pre-treatment outcomesthat closely approximates the pre-treatment outcomes of the treated unit, then thatsame convex combination of control units’ post-treatment outcomes will yield good ∗ We are grateful for invaluable comments from and stimulating conversations with Bharat Chan-dar, Jiafeng Chen, Benny Goldman, Guido Imbens, Advik Shreekumar, Charlie Walker, and theparticipants in the Stanford Econometrics and Applied Lunches. a r X i v : . [ ec on . E M ] F e b stimates of the treated unit’s post-treatment control outcomes. While this as-sumption is necessary for tractable treatment effect estimation, it is unlikely to holdexactly in practice. In this paper, we develop a sensitivity analysis of SC treatmenteffect estimates that bounds the true treatment effect under the assumption thatany deviations from this well-specified method assumption are at most as severe asthe deviations observed in placebo analyses of the control units.To build intuition for where this assumption might lead researchers astray, wepresent two placebo analyses in which we apply the SC method to panel data fromthe evaluation of a 1989 tobacco control program implemented in California, as inAbadie et al. (2010). In particular, we use the SC method to predict per-capitatobacco sales in Virginia and Delaware in the year 2000 using the other members ofthe donor pool as control units. Since neither Virginia nor Delaware received thetreatment and we observe their true control outcomes post-treatment, we can seewhether the SC method correctly predicts no effects in either state.In Figure 1a, we depict the observed, true control outcomes for Virgina alongsidetwo different convex combinations of the remaining control units’ outcome trends.The first is the orange synthetic control trend constructed in typical SC fashion,namely as the convex combination of control units’ outcomes that most closely ap-proximates Virginia’s pre-treatment outcomes (Abadie et al., 2010). While this pro-cedure yields a trend with good pre-treatment fit, it does a subpar job of predictingVirginia’s control outcome in the year 2000.Next, since we observe Virginia’s control outcomes post-treatment in this placeboanalysis, we can instead construct the “best-looking” (in a pre-treatment fit sense)convex combination of the remaining control units’ outcome trends that matchesVirginia’s control outcome in 2000 exactly, shown in green. Perhaps surprisingly,there exists a convex combination of control units that exactly predicts our post-treatment outcome of interest while achieving only marginally worse pre-treatmentfit than the best-fitting trend chosen by the SC method.When we conduct the same exercise with Delaware as the “treated” unit of inter-est, we see in Figure 1b that, just as with Virginia, although the trend constructed bythe SC method (again in orange) has good-looking pre-treatment fit, it does a poorjob estimating the true control outcome of interest. However, unlike when we usedVirginia as the placebo treated unit, we cannot construct a convex combination ofcontrol units’ outcome trends that matches both Delaware’s control outcome in 2000 In addition, the researcher must assume that there are no idiosyncratic factors that affectthe treated unit’s counterfactual control outcomes post-treatment but not the control units’ post-treatment outcomes besides their differing treatment statuses; one way to operationalize this idea isthe linear factor model presented in Abadie et al. (2010) and studied in depth in Ferman and Pinto(2019), in which units’ factor loadings do not vary before and after treatment. It is also importantto assume that the treatment does not affect units in the donor pool, although researchers cansimply exclude “control” units for which spillover effects are a concern (Abadie, 2020). Typically,such assumptions must be justified using domain knowledge about the setting of interest, so we donot concern ourselves with assessing their validity in this paper. We will discuss how we can compute such a convex combination in Section 4. Note thatdoing so is only possible because Virginia’s control outcome in 2000 lies between the minimumand maximum of the other control units’ outcomes in 2000, in which case there are many convexcombinations with no prediction error. a) Virginia Placebo Analysis (b) Delaware Placebo Analysis Figure 1: Visualizations of placebo analyses with Virginia and Delaware as treated units; inboth panels, the blue trend denotes the true control outcome trend of the placebo treatedunit, the orange trend denotes the synthetic control trend selected by the SC method, and thegreen trend denotes the “best-looking” synthetic control trend that minimizes pre-treatmentfit error while exactly matching the placebo treated unit’s control outcome in 2000. exactly and its pre-treatment outcomes well, so the best-looking convex combinationof control units’ trends we select to match Delaware’s outcome in 2000 exactly (againin green) has unacceptable pre-treatment fit.These two examples indicate we should interpret SC estimates with caution; theplacebo analysis using Virginia suggests good pre-treatment fit is not sufficient forgood post-treatment accuracy, while the placebo analysis using Delaware suggestsgood pre-treatment accuracy, while feasible, may not even be achievable alongsidepost-treatment accuracy. As discussed in Section 4, the additional pre-treatment fiterror incurred by the green trends beyond the minimum error incurred by the orangetrends is one natural measure of misspecification error .This perspective on the informativeness of pre-treatment fit (or lack thereof) isalso the motivation for our proposed sensitivity analysis. While we do not observethe treated unit’s counterfactual post-treatment outcomes and thus cannot computeits misspecification error, we can compute the misspecification errors incurred by theSC method when we use it to predict control units’ post-treatment outcomes, as inthe placebo analyses of Virginia and Delaware. Our procedure assumes the treatedunit’s misspecification error is at most the misspecification error of a given controlunit and computes the set of treatment effects consistent with the assumption thatthe unknown misspecification error incurred by the SC method is at most this errorbound.In Figure 2a, we depict the sets of plausible counterfactual control outcomes forCalifornia computed by our procedure consistent with the assumptions that the SCmethod’s misspecification error for California is at most the observed misspecificationerrors for Virginia and Delaware, indicated by the green and purple dotted intervalsrespectively. For intuition, we also include examples of predicted counterfactualtrends for California that satisfy these error bounds in light green for Virginia andlight purple for Delaware. 3 a) California SC Predictions (b) California Treatment Effect Bounds
Figure 2: In Figure 2a, the blue trend denotes the observed outcome trend for California andthe orange trend denotes the synthetic control trend selected by the SC method. The greenand purple dotted intervals depict the sets of plausible counterfactual control outcomes forCalifornia under Virginia and Delaware’s misspecification errors respectively, and we provideexamples of counterfactual trends for California that satisfy the error bound for Virginia inlight green and the error bound for Delaware in light purple. Figure 2b presents the plausibletreatment effect bounds corresponding to the misspecification errors of all the control unitsin the donor pool in order of error magnitude, with the bounds for Virginia and Delawarehighlighted with green and purple dashed lines. The red region indicates where in thedistribution of misspecification errors a zero treatment effect first becomes plausible.
Our procedure also finds the minimum misspecification error necessary for zeroto be a plausible treatment effect, which we can use to assess how reasonable a zerotreatment effect would be by benchmarking it against the placebo misspecificationerrors described above. In Figure 2b, we show the treatment effect bounds corre-sponding to each control unit’s misspecification error in order of increasing errormagnitude, and we use the red region to highlight where in the distribution of mis-specification errors a zero treatment effect first becomes plausible. We also illustratehow Virginia and Delaware’s misspecification errors compare to those of the othercontrol units by highlighting the treatment effect bounds corresponding to Virginiaand Delaware’s misspecification errors with green and purple dashed lines.Although in general, our proposed treatment effect bounds must be computednumerically using convex programming tools (Boyd and Vandenberghe, 2004), ap-plying our procedure with misspecification error defined as the minimum distancebetween the SC weights and any vector of weights with perfect predictive accuracyyields closed-form bounds whose widths are determined by scaled-up residuals frompredicting control units’ outcomes with the SC method. As such, one can view ourprocedure applied with this misspecification error metric as a geometric motivation As we discuss in more detail in Section 4.1, the control units with the largest and smallestpost-treatment outcomes in 2000 cannot be perfectly predicted using a convex combination of theremaining control units’ outcomes, so the misspecification errors incurred by the SC method forthese two placebo treated units are infinite. As a result, the sets of plausible treatment effectscorresponding to these extremal units mechanically span the real line, so they cannot be plotted,and they cannot rule out a zero treatment effect.
To introduce the ideas underlying our sensitivity analysis, in this section, we presenta particular version of our proposed procedure based on a measure of misspecifi-cation error that yields intuitive, closed-form expressions for our treatment effectbounds. Later, we generalize the procedure to accommodate other valuable notionsof misspecification error. 5 .1 Notation and the SC Method
Before describing our procedure, we introduce some necessary notation and reviewthe SC method. In the canonical setting used to motivate the SC method, we acquiredata about J + 1 units across T time periods [ T ] := { , . . . , T } to estimate the effectof a policy intervention, referred to as the treatment, affecting a single treated unitindexed by j = 1 . The treatment is first implemented just after T < T and staysin effect for all remaining periods T + 1 , . . . , T . The set of J remaining controlunits J := { , . . . , J + 1 } that are not affected by the treatment is called the donorpool . For each unit j ∈ { , . . . , J + 1 } and each time period t ∈ [ T ] , we let Y jt (1) and Y jt (0) denote that unit’s potential outcomes in that period under treatment andlack thereof, respectively (Imbens and Rubin, 2015). Next, let the indicator D jt = 1 if unit j is exposed to the treatment in period t and D jt = 0 otherwise. We then let Y jt := D jt Y jt (1) + (1 − D jt ) Y jt (0) denote the potential outcome we observe for unit j in period t , and let Y t := (cid:0) Y t , . . . , Y ( J +1) t (cid:1) T denote the vector of control units’observed outcomes in period t .Typically, the goal in comparative case study settings like these is to estimatethe treatment effect on the treated unit (with index j = 1 ) in some post-treatmentperiod T ∗ > T : τ T ∗ := Y T ∗ (1) − Y T ∗ (0) . Because we only observe Y T ∗ = Y T ∗ (1) , and not Y T ∗ (0) , estimating τ reduces toestimating Y T ∗ (0) . Although there are many ways one could do so, the SC methodassumes it is possible to compute Y T ∗ (0) using a weighted sum of the control units’outcomes in period T ∗ (Abadie and Gardeazabal, 2003, Abadie et al., 2010): Y T ∗ (0) = Y T T ∗ w = J +1 (cid:88) j =2 w j Y jT ∗ (0) for w = ( w , . . . , w J +1 ) T ∈ R J . (1)We refer to this weighted combination of control units’ outcome trends as a syntheticcontrol .In particular, Abadie et al. (2010) propose choosing weights w that make theweighted average of the control units’ pre-treatment outcomes as similar as possibleto the treated unit’s pre-treatment outcomes. Let x j := ( Y j , . . . , Y jT ) T be thevector of unit j ’s observed, pre-treatment outcomes and X be the T × J matrixwhose columns are the control units’ observed pre-treatment outcomes (i.e. X ’s j thcolumn is given by x j +1 ). Then we can write the SC estimator as the minimizer ofpre-treatment prediction error over the set of positive weights that sum to one: Abadie et al. (2010), Chernozhukov et al. (2017), and Ferman and Pinto (2019) discuss severalmodels under which such an assumption is reasonable. While uncommon in practice, x could in principle lie in the convex hull of the columns of X ,in which case (2) could have an infinite number of solutions with perfect pre-treatment fit, someof which would provide better post-treatment fit than others (Abadie, 2020). In the sections thatfollow, we assume perfect pre-treatment fit is not achievable because it is empirically rare and doingso allows us to develop the more intuitive sensitivity analysis presented in Section 2. However, inSection B of the Appendix, we discuss in detail how the generalized sensitivity analysis described inSection 4.1 can easily account for non-uniqueness of the SC estimator when generating treatment sc := arg min w ∈ R J (cid:107) x − X w (cid:107) s . t . T w = 1 w ≥ (2)Later, we will use ∆ J := (cid:8) w ∈ R J : w ≥ , T w = 1 (cid:9) to denote the set of valid SCweights.Once w sc has been computed, we can estimate τ T ∗ by ˆ τ sc T ∗ := Y T ∗ − Y T T ∗ w sc = Y T ∗ (1) − J +1 (cid:88) j =2 w sc ,j Y jT ∗ (0) . If (1) holds for weights w = w sc , we say the SC method is well-specified , in whichcase we have that ˆ τ sc T ∗ = τ T ∗ . However, as illustrated in Section 1, the SC methodis unlikely to be well-specified in practice. In the next section, we will introducea natural way to measure the degree to which the SC method deviates from well-specification on which we will base our sensitivity analysis.In Section C.2 of the Appendix, we discuss how the sensitivity analyses we intro-duce below can also apply to extensions of the SC method that incorporate additionalpre-treatment covariates, relax the convex weight constraints, add an intercept term,and minimize different and sometimes data-adaptive objective functions. For expo-sitional clarity however, the basic SC method presented here will suffice to motivateour proposed procedures. Despite the concerns about the effectiveness of the SC method raised in Section 1,we can still attempt to assess what the true value of Y T ∗ (0) might be under limitedmisspecification error. Since ˆ τ sc T ∗ = Y T ∗ (1) − Y T T ∗ w sc is an affine function of Y T ∗ and τ T ∗ is a scalar, it is always possible to choose some set of weights w ∈ R J suchthat Y T T ∗ w = Y T ∗ (0) and thus Y T ∗ (1) − Y T T ∗ w = τ T ∗ . More importantly, theyare not at all unique; in fact, the set of optimal weights W ∗ := { w ∈ R J : Y T T ∗ w = Y T ∗ (0) } forms a ( J − -dimensional hyperplane in R J . Thus, a natural measure of misspec-ification error in the SC weights w sc is the difference between w sc and the closestweights w ∗ to w sc in W ∗ , where distance is measured by the (cid:96) -norm. More formally, effect bounds. We also note that Abadie and L’Hour (2018) and Kellogg, Mogstad, Pouliot, andTorgovitsky (2020) propose modifying the SC objective to penalize solutions that interpolate morebetween units, since such solutions will yield worse predictions if the relationship between pre-treatment outcomes and post-treatment outcomes is nonlinear. Our sensitivity analysis can also beapplied to these alternative estimators, as we detail in Section C.2 of the Appendix.
7e can define w ∗ like so: w ∗ := arg min w ∈ R J (cid:107) w sc − w (cid:107) s . t . Y T T ∗ w = Y T ∗ (0) ( ⇔ w ∈ W ∗ ) . (3)Note that we do not restrict ourselves to considering weights within the set of convexweights ∆ J ; though such a restriction prevents SC estimates from extrapolatingbeyond the outcomes in the data (Abadie, 2020), it may be that the closest weightsthat allow for optimal prediction of Y T ∗ (0) lie outside ∆ J , or that W ∗ and ∆ J donot overlap at all. In what follows, we will frequently focus on the magnitude ofmisspecification error, which we denote by d ( w sc , W ∗ ) := (cid:107) w sc − w ∗ (cid:107) .Since W ∗ is a hyperplane, we could in principle solve (3) by projecting w sc onto W ∗ . Because we do not observe Y T ∗ (0) , we cannot do so in practice. However, if weare willing to assume d ( w sc , W ∗ ) ≤ B for some bound B ≥ , then there must besome weight vector w ∈ R J within a radius B (cid:96) -ball around w sc such that Y T T ∗ w = Y T ∗ (0) . Crucially, this assumption limits the magnitude of method misspecificationerror while allowing for the direction of that error to remain arbitrary. If we let (cid:99) W B := (cid:8) w ∈ R J : (cid:107) w sc − w (cid:107) ≤ B (cid:9) denote the set of all weights (cid:96) -distance at most B away from w sc , then we knowthat the true potential outcome Y T ∗ (0) lies within the following set of values: Y B T ∗ (0) := Y T T ∗ (cid:99) W B = (cid:8) Y T T ∗ w : (cid:107) w sc − w (cid:107) ≤ B (cid:9) . Since the function w (cid:55)→ Y T T ∗ w is continuous in w and (cid:99) W B is compact, theset Y B T ∗ (0) containing Y T ∗ (0) must be a closed interval in R . As a result, we cancharacterize the interval Y B T ∗ (0) by computing its endpoints Y B, − T ∗ (0) and Y B, +1 T ∗ (0) ,which are the solutions to the following two optimization problems: Y B, − T ∗ (0) := min w ∈ R J Y T T ∗ w Y B, +1 T ∗ (0) := max w ∈ R J Y T T ∗ w s . t . (cid:107) w sc − w (cid:107) ≤ B s . t . (cid:107) w sc − w (cid:107) ≤ B (4)Since Y B, − T ∗ (0) and Y B, +1 T ∗ (0) are defined as the extrema of linear functions on an (cid:96) -ball centered at w sc , they can easily be computed in closed form: Y B, − T ∗ (0) = Y T T ∗ w sc − B (cid:107) Y T ∗ (cid:107) Y B, +1 T ∗ (0) = Y T T ∗ w sc + B (cid:107) Y T ∗ (cid:107) . Then, since τ T ∗ is linear in Y T ∗ (0) , we can translate these bounds on Y T ∗ (0) intobounds on τ T ∗ : τ T ∗ ∈ T BT ∗ := (cid:104) Y T ∗ (1) − Y B, +1 T ∗ (0) , Y T ∗ (1) − Y B, − T ∗ (0) (cid:105) = ˆ τ sc T ∗ + B (cid:107) Y T ∗ (cid:107) · [ − , (5)8 .2.2 Bound Calibration via Placebo Effect Estimation Unfortunately, the discussion in Section 2.2.1 does not make it clear how one shouldchoose an appropriate misspecification error bound B from which the bounds on τ T ∗ in (5) can be constructed. However, since we do observe Y jT ∗ = Y jT ∗ (0) for eachcontrol unit j ∈ J , we can use a similar distance measure to d ( w sc , W ∗ ) to quantifythe misspecification error in SC estimates of Y jT ∗ (0) for j ∈ J using the remaining J − control units as donor pools. Then, we can assume the treated unit’s post-treatment potential outcome Y T ∗ (0) is no more difficult to estimate using the SCmethod than some percentage of the control units’ post-treatment control outcomesand use these measures to inform our choice of bound B .Importantly, the methodology we propose below based on this intuition onlyrelies on the assumption that the magnitude of the misspecification error for thetreated unit is no larger than the magnitudes of the placebo misspecification errorsfor some percentage of the control units. Given that the differences in characteristicsbetween the treated and control units is a primary reason researchers should use theSC method in the first place (Abadie, 2020), it is likely implausible that the unknowndirection of the treated unit’s misspecification error is similar to the directions ofthe control units’ placebo misspecification errors.To formalize the ideas presented above, we first define the following quantitiesanalogous to X , Y T ∗ , w sc , and W ∗ when we view control unit j as a placebotreated unit and the other J − control units as the donor pool: let X − j be the T × ( J − matrix whose columns are the pre-treatment outcomes of the J − control units other than j , let Y ( − j ) T ∗ := ( Y kt ) Tk (cid:54) = j be the ( J − -vector of the J − control units besides j ’s observed control outcomes, let w ( j ) sc ∈ R J − be thesynthetic control weights chosen as if control unit j were the treated unit and theremaining J − control units were the donor pool, i.e. by solving the followingoptimization problem similar to (2): w ( j ) sc := arg min w ∈ R J − (cid:107) x j − X − j w (cid:107) s . t . T w = 1 w ≥ , (6)and let W ∗ j := { w ∈ R J − : Y T ( − j ) T ∗ w = Y jT ∗ (0) } denote the set of weight vectors w ∈ R J − that yield placebo unit j ’s control outcome in period T ∗ .Since we observe Y jT ∗ = Y jT ∗ (0) for placebo unit j , we can actually compute thedistance d ( w ( j ) sc , W ∗ j ) defined analogously to the unobservable d ( w sc , W ∗ ) in (3): d ( w ( j ) sc , W ∗ j ) := min w ∈ R J − (cid:107) w ( j ) sc − w (cid:107) s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) . (7) i.e. X − j ’s k th column is given by x k +1 if k < j and x k +2 if k > j . ˆ R sc jT ∗ := Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ denote the residual fromthe SC estimator used to predict Y jT ∗ = Y jT ∗ (0) . Then as with (3), (7) is a basicprojection problem with a closed-form solution (see Cheney and Kincaid (2009),pages 450–451, for example): d ( w ( j ) sc , W ∗ j ) = | ˆ R sc jT ∗ | (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) = (cid:12)(cid:12)(cid:12) Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ (cid:12)(cid:12)(cid:12)(cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) (8)Although d ( w ( j ) sc , W ∗ j ) is defined purely geometrically, choosing the (cid:96) -norm tomeasure distance in weight space implies d ( w ( j ) sc , W ∗ j ) can also be characterized asa scaled variant of the absolute placebo SC residual | ˆ R sc jT ∗ | for control unit j usingthe j − other control units as the donor pool. We will discuss this observation inmore detail in Section 2.3.For notational convenience, we use the shorthand B j = d ( w ( j ) sc , W ∗ j ) and assumecontrol units’ indices align with the sorted order of their respective B j values, so thatthe j th control unit has the ( j − th-smallest B j . Then once we have computed d ( w ( j ) sc , W ∗ j ) for all j ∈ J , we can compute bounds on the treatment effect basedon (5) for each j ∈ J by choosing B = B j := d ( w ( j ) sc , W ∗ j ) : T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | (cid:107) Y T ∗ (cid:107) (cid:107) Y ( − j ) T ∗ (cid:107) · [ − , (9)Another natural quantity of interest is the minimum bound B on d ( w sc , W ∗ ) such that a zero treatment effect lies within T B T ∗ , i.e. B := min (cid:8) B : 0 ∈ T BT ∗ (cid:9) . With B in hand, we can then find the control unit j ∈ J such that B j ≤ B ≤ B j +1 (where B J +2 = ∞ ) and report the statistic ν := ( j − /J , interpreted as the fractionof control units for which it would have to be “easier” for the SC method to estimate Y jT ∗ (0) than Y T ∗ (0) if the treatment effect τ T ∗ for the treated unit were actuallyzero. For the purposes of computation, B can be defined similarly to d ( w sc , W ∗ ) ,with the unobserved Y T ∗ (0) replaced with the observed outcome Y T ∗ of the treatedunit in period T ∗ : B := min w ∈ R J (cid:107) w sc − w (cid:107) s . t . Y T T ∗ w = Y T ∗ (10)As with (7), B can easily be computed in closed form by projecting w sc onto thehyperplane (cid:8) w ∈ R J : Y T T ∗ w = Y T ∗ (cid:9) : B = (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) (cid:107) Y T ∗ (cid:107) (11)For reference, we summarize the sensitivity analysis procedure we have developedabove in Procedure 1. We also demonstrate one way to visualize T B j T ∗ for each j ∈ J along with B j and B j +1 in Figure 3a using data on California’s 1989 tobacco controlprogram analyzed in Abadie et al. (2010). In the figure, the units of the x -axis are10 rocedure 1. Sensitivity Analysis
1. For each control unit j ∈ J :(a) Use the SC method to predict unit j ’s outcome in period T ∗ , treatingthe other J − control units as the donor pool; compute the observedresidual from this prediction ˆ R sc jT ∗ = Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ .(b) Compute the bounds T B j T ∗ on the treatment effect τ T ∗ under theassumption that the misspecification error d ( w sc , W ∗ ) incurred byestimating Y T ∗ (0) with the SC method is at most the misspecifica-tion error B j incurred by the SC method in Step 1a: T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | (cid:107) Y T ∗ (cid:107) (cid:107) Y ( − j ) T ∗ (cid:107) · [ − , ,
2. Compute the minimum misspecification error B needed for ∈ T B T ∗ , i.e. to be a plausible treatment effect estimate: B = (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) (cid:107) Y T ∗ (cid:107) , and find the control unit j with the largest misspecification error stillsmaller than B , i.e. where B j ≤ B ≤ B j +1 .3. Visualize the treatment effect bounds T B j T ∗ for each j ∈ J and the mis-specification errors B j and B j +1 in a plot like Figure 3a, and report thepercentage ν = ( j − /J of control units whose misspecification errors B j are smaller than B .percentile ranks p j := ( j − /J of the ordered set of placebo misspecification errors { B j : j ∈ J } rather than the units of B j , so that it is easy to read ν off of the x -axiswhere the red shaded region begins.Before proceeding, we make note of several interesting properties of our proposedbounds T B j T ∗ . To do so, we define N j := (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) so we can write T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | N j · [ − , . Since Y ( − j ) T ∗ contains all of the entries of Y T ∗ except Y jT ∗ (0) , we have that (cid:107) Y ( − j ) T ∗ (cid:107) ≤ (cid:107) Y T ∗ (cid:107) , so N j ≥ . Intuitively, this inflation of the placebo residualfor unit j in T B j T ∗ corrects for the fact that the placebo SC procedure for estimating11 jT ∗ (0) has one fewer control unit at its disposal than the SC procedure for estimat-ing Y T ∗ (0) and thus has less flexibility to make more extreme predictions than theSC procedure would for our actual task of interest.Next, we can write N j as N j = (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) Jk =2 Y kT ∗ (0) (cid:80) Jk =2 Y kT ∗ (0) − Y jT ∗ (0) = (cid:34) − (cid:18) | Y jT (0) |(cid:107) Y T ∗ (cid:107) (cid:19) (cid:35) − / , enabling us to make two more observations. First, N j is increasing in the magnitudeof Y jT ∗ (0) relative to (cid:107) Y T ∗ (cid:107) , meaning T B j T ∗ is wider if unit j has a larger magni-tude outcome in period T ∗ relative to the outcomes of the other control units andthus could generate more extreme predictions if it contributed to the SC predictedoutcome.Second, under mild conditions, the bounds T B j T ∗ converge to purely residual-based bounds ˆ τ sc T ∗ + | ˆ R sc jT ∗ | · [ − , as the size of the donor pool increases. Considera sequence of donor pools indexed by their sizes, which with an abuse of notationwe denote {J J : J ∈ N } . Then, provided that the outcomes of the units in each ofthe donor pools do not grow too quickly or too slowly in magnitude, i.e. if lim J →∞ max j ∈J J | Y jT ∗ (0) |(cid:107) Y T ∗ (cid:107) → , the ratios N j converge uniformly to as the sample size J increases. As a con-sequence, for some α ∈ [0 , , the bounds T B (cid:100) (1 − α ) J (cid:101) T ∗ calibrated to the (1 − α ) thpercentile of the ordered set of placebo distances B j will shrink towards the bounds ˆ τ sc T ∗ + | ˆ R sc (cid:100) (1 − α ) J (cid:101) T ∗ | · [ − , as J → ∞ . As described in Section 2.2.2, we can view T B j T ∗ as the set of plausible treatmenteffects for the treated unit if we assume that the magnitude of misspecification error d ( w sc , W ∗ ) incurred by estimating Y T ∗ (0) with the SC estimator is no larger thanthe magnitude of misspecification error d ( w ( j ) sc , W ∗ j ) incurred by treating unit j as the treated unit and estimating Y jT ∗ (0) with the placebo SC estimator. Then,fixing some fraction α ∈ [0 , , T B (cid:100) (1 − α ) J (cid:101) T ∗ contains the set of plausible treatmenteffects under the assumption that it is “no harder” to estimate Y T ∗ (0) when unit is the treated unit than it is to estimate Y jT ∗ (0) for any of the (cid:100) (1 − α ) J (cid:101) “easiest-to-estimate” control units, i.e. those with the (cid:100) (1 − α ) J (cid:101) smallest misspecificationerror magnitudes. Further, B quantifies the magnitude of the misspecification errorthe SC method would have to incur for a treatment effect of zero to be plausible.This magnitude can be compared to control units’ misspecification error magnitudes B j to benchmark how “reasonable” a treatment effect of zero might be, as measuredby the percentage ν of control units for which B j ≤ B .Despite the resemblance of our sensitivity analysis to frequentist statistical in-12erence procedures, we caution against interpreting ν as the p -value correspondingto a test of no treatment effect and T B (cid:100) (1 − α ) J (cid:101) T ∗ as a confidence interval for the treat-ment effect since our methodology is based on the perspective that uncertainty inSC estimates is the result from modeling error, not statistical noise. We believe thisperspective is important because in most comparative case studies, we only observe asingle outcome sample path over a limited number of time periods for each of a smallnumber of heterogeneous units, only one of which is ever treated (Abadie, 2020). Asa result, any stochastic model with enough structure to allow for tractable statisticalinference in such settings must rely on potentially unrealistic assumptions about thedata generating process to make any progress, e.g. distributional assumptions onthe stochastic outcome processes, a stance on the treatment assignment mechanism,and/or growing dataset asymptotics. Further, while some of the statistical approaches to characterizing uncertainty inSC estimates do acknowledge and accommodate the possibility of misspecificationerror (Chernozhukov et al., 2017, 2018, Cattaneo et al., 2019), the assumptions theymake to limit its effect on inferential validity can be difficult to justify in comparativecase study settings and interpret for practitioners, e.g. stationarity of units’ outcomeprocesses, large numbers of observed pre and post-treatment periods, exchangeabilityof SC residuals across periods, and/or mean-zero post-treatment SC residuals. Whileour sensitivity analysis avoids the statistical perspective on estimate uncertainty thatis the norm in empirical economics, we believe it provides a transparent evaluationof the credibility of SC counterfactuals in the presence of misspecification error.
Our methodology also provides an alternative motivation for a variant of the pop-ular design-based placebo test of no treatment effect originally proposed in Abadieet al. (2010). Abadie et al. (2010) suggest comparing the absolute SC residual | ˆ R sc T ∗ | := (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) under the assumption of no treatment effect (so Y T ∗ = Y T ∗ (1) = Y T ∗ (0) ) to the distribution of absolute placebo residuals | ˆ R sc jT ∗ | for j ∈ J ;Abadie et al. (2010) interpret | ˆ R sc T ∗ | being large relative to | ˆ R sc jT ∗ | for j ∈ J as strongevidence of a non-zero treatment effect, assuming pre-treatment fit is also good. Inparticular, if we take a design-based perspective and treat outcomes as fixed quanti-ties (see Imbens and Rubin (2015)), then under the admittedly unrealistic assump-tion that treatment is assigned uniformly at random to the units under consideration,the percentage of absolute residuals | ˆ R sc jT ∗ | that are smaller than | ˆ R sc T ∗ | can be in-terpreted as a p -value for a test of the null hypothesis of no treatment effect. Abadieet al. (2010), Firpo and Possebom (2018), and others suggest using test statisticsbased on the ratios of post-treatment mean squared error under the null hypothesisto pre-treatment prediction error, but in light of the discussion about the relation-ship between pre and post-treatment error in Section 1, it is unclear how meaningfulsuch relative error metrics are in practice. Bojinov and Shephard (2019) and Rambachan and Shephard (2019) discuss similar philosoph-ical issues in the context of time series.
13o see the connection between the placebo test described above and our proposedprocedure, recall that the statistic ν defined at the end of Section 2.2.2 is computedby asking what fraction of control units’ placebo distances B j = d ( w ( j ) sc , W ∗ j ) = | ˆ R sc jT ∗ | / (cid:107) Y ( − j ) T ∗ (cid:107) (from (8)) are smaller than the minimum bound B = | ˆ R sc T ∗ | / (cid:107) Y T ∗ (cid:107) (from (11)) on d ( w sc , W ∗ ) required for to lie in the set of plausible treatment ef-fects T B T ∗ . If we multiply B and B j for j ∈ J by (cid:107) Y T ∗ (cid:107) , we can see that ν can equivalently be computed by asking for what fraction of control units j ∈ J is B j (cid:107) Y T ∗ (cid:107) = | ˆ R sc jT ∗ | · (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) smaller than B (cid:107) Y T ∗ (cid:107) = | ˆ R sc T ∗ | . Sincethe ratios N j = (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) are all greater than one from the discussion atthe end of Section 2.2.2, we can see that under the assumption of random treatmentassignment, ν can be interpreted as the p -value corresponding to a more conserva-tive variant of Abadie et al. (2010)’s placebo test described above. Further, for any α ∈ [0 , , we can view T B (cid:100) (1 − α ) J (cid:101) T ∗ as the set of treatment effects under which ourconservative version of Abadie et al. (2010)’s placebo test would fail to reject thenull hypothesis of zero treatment effect at level α . Per the discussion at the end ofSection 2.2.2, the degree of conservativeness of this placebo test also decreases inthe size of the donor pool under mild conditions. Thus, our procedure motivatescomparing the treated and control units’ absolute residuals to assess errors in SCestimates without starting from a random treatment assignment assumption. It is important to note that, like most papers in the SC literature, our method as-sumes the donor pool is fixed before treatment effect estimation. In practice however,researchers often exercise tremendous discretion in donor pool selection in ways thatcan dramatically change results, as demonstrated by the popular leave-unit-out ro-bustness check we illustrate in Section 3.2. Despite this sensitivity, inclusion of onlythe control units that are believed to be “most similar” to the treated unit is explicitlyadvocated for in the SC literature (Abadie, 2020). Doing so is encouraged because,as discussed in Footnote 5, synthetic controls that interpolate more between controlunits that are very different from the treated unit can be quite biased if the relation-ship between pre-treatment outcomes (and covariates) and post-treatment outcomesis non-linear (Abadie, 2020, Abadie and L’Hour, 2018, Kellogg et al., 2020).Since our sensitivity analysis defines robustness relative to SC performance whenpredicting control units’ outcomes and the researcher has significant latitude to selectthose control units, one might worry that our sensitivity analysis is itself sensitiveto the choice of donor pool. Because the inclusion or exclusion of a control unitfrom the donor pool has the potential to affect both the SC estimates of the treatedand placebo treated units and the placebo misspecification errors incurred by theSC method, the impact of donor pool manipulation is often ambiguous. It is possi-ble though that an adversarial researcher could select the donor pool to maximizeperceived robustness of their SC estimates, but such doctoring has always been avulnerability of both the SC literature and empirical economics more broadly (Brod-erick, Giordano, and Meager, 2020).We note that our procedure can assess sensitivity to the inclusion or exclusion of14ontrol units that exist in the observed donor pool since such choices are equivalentto toggling the weights corresponding to certain control units between zero and non-zero values. However, we cannot determine the impact of including potential controlunits not reported by the researcher. For this reason, it is crucial that researchersare transparent about the universe of possible control units from which they selecttheir donor pool and precise about the procedure according to which such selectionoccurs.Unfortunately, much ambiguity remains about how researchers should go aboutdefining such a universe. For example, in the context of the tobacco control programstuded in Abadie et al. (2010), one might argue that California is more similar alongmany dimensions (e.g. total population or GDP) to countries like Germany and theUK than US states like Nebraska or Utah; perhaps data on control units from Abadieet al. (2015)’s study of German reunification (augmented with data on tobacco sales)would yield better SC counterfactuals? While such a line of reasoning is compelling,a researcher could also argue that cultural norms around smoking in California aremore similar to those of other US states than European countries. While contrived,this small example illustrates the kind of subjectivity inherent in the donor poolselection process, and to our knowledge, there exist no agreed-upon best practicesor formal criteria for inclusion or exclusion of particular units. As such, we viewstudying the effect of donor pool selection on SC estimates with more analyticalprecision as an important area for future investigation.
We now demonstrate how to apply our sensitivity analysis as outlined by Procedure1 by re-examining three canonical policies studied often in the SC literature: Cal-ifornia’s tobacco control program on tobacco sales using data provided by Abadieet al. (2010), German reunification on GDP using data from Abadie et al. (2015),and the Mariel boatlift and Cuban mass migration on the 20th percentile of the wagedistribution in Miami using data as in Peri and Yasenov (2019).In Figure 3, we summarize the results from each case study by plotting the rangeof possible treatment effects at each percentile rank p j = ( j − /J of the orderedset of placebo misspecification errors { B j : j ∈ J } . In Figure 3a, the horizontal,dotted blue line represents the SC point estimate of the effect of California’s tobaccoprogram on tobacco sales. For each of the J observed placebo misspecification er-rors B j , we use blue points to denote the maximum and minimum treatment effectspossible for California if we allow for misspecification error up to B j . The x -axisrepresents B j with its percentile rank p j within the ordered set of placebo misspeci-fication errors. We highlight in red the interval of the placebo misspecification errordistribution where the allowable misspecification error first yields treatment effectbounds containing zero. Summarizing Figure 3a, we can see that the SC weightestimates for California would need to incur at least as much error as the 94.7thpercentile of the 38 placebo misspecification errors for a zero treatment effect to be15 a) Effect of California’s Tobacco Control Program(b) Effect of German Reunification on GDP(c) Effect of Mariel Boatlift on Low-Income Wages Figure 3: Plots of treatment effect bounds T B j T ∗ corresponding to each control unit j ’s misspecifi-cation error computed using data from three papers using the SC method, where the units of the x -axes are percentile ranks p j of the set of misspecification errors { B j : j ∈ J } . We highlight theregion between B j and B j +1 in which our treatment effect bounds first contain zero in red. Of course, our procedure is not the first to purport to help researchers assess thesusceptibility of their SC treatment effect estimates to misspecification error. Wenext show that our procedure provides more complete and interpretable measures ofSC estimate robustness compared to two commonly used robustness checks in theSC literature. In the spirit of Bertrand, Duflo, and Mullainathan (2004), we believemethods are best tested on real datasets, so we implement two popular alternativeprocedures in repeated placebo versions of each of our three case studies, treatingeach of the control units as a placebo treated unit and comparing the results of thesemethods to those delivered by our procedure.First, we examine the “leave-unit-out” robustness check, which entails droppingeach control unit from the donor pool and recomputing SC outcome estimates witha donor pool consisting of the remaining control units (Abadie, 2020). The re-searcher is then supposed to assess robustness qualitatively by checking whether theset of treatment effects outputted by this procedure have the same signs as and sim-ilar magnitudes to the effects computed using the full donor pool. When we treatDelaware as the placebo treated unit in the context of the state-by-state smokingdata from Abadie et al. (2010) and conduct the leave-unit-out robustness check, wesee that the alternative predictions generated by this procedure fail to capture theextent of the prediction error incurred by the SC method, as illustrated in Figure4a. Unfortunately, this inadequacy is not isolated to Delaware or the state-level smok-ing data from Abadie et al. (2010). If we repeat this placebo procedure with eachof the other control units in each of our three case studies, we find that of the control units ( . ) from Abadie et al. (2010); of the control units ( . )from Abadie et al. (2015); and of the control units ( . ) from Peri andYasenov (2019) have last period outcomes outside the range of their corresponding It suffices to drop only those control units with positive weight in the full-sample vector of SCweights because dropping units with zero weight will not affect SC estimates. a) Delaware Leave-Unit-Out Analysis (b) Virginia Leave-Time-Out Analysis Figure 4: In Figure 4a, we show the SC trends for Delaware computed while leaving outeach donor unit that received positive weight (weight shown in parentheses) when runningthe SC method with the entire donor pool. In Figure 4b, we show the SC trend for Virginiacomputed while excluding the pre-treatment fit errors for the six periods immediately priorto treatment (between the two vertical black lines) from the SC objective. leave-unit-out predictions. In some sense, this result is not so surprising, since theleave-unit-out analysis only assesses the sensitivity of SC estimates to a particularcause of misspecification error: mistakenly including a particular unit in the donorpool and placing positive weight on that unit’s outcome in a SC estimate.The second diagnostic we consider, the “leave-time-out” or “backdating” proce-dure, involves fitting a synthetic control using only the pre-treatment outcomes upto some number of periods before the first treatment period; the remaining pre-treatment periods in which control outcomes for the treated unit are known are usedas a validation set to assess the quality of the SC method’s predictions out-of-sample(Abadie, 2020). Treating Virginia as the placebo treated unit in the context of thestate-by-state smoking data from Abadie et al. (2010), we leave out the six timeperiods before California was treated (between the two black vertical lines in Figure4b) and fit the synthetic control on the remaining pre-treatment periods. Giventhe gap in Figure 4b between the true control trend in black and the backdatedsynthetic control in purple over the five validation periods, many researchers wouldbe skeptical about their SC estimates. In Virginia’s case though, the backdatedSC trend predicts the outcome in 2000 remarkably well and clearly outperforms thenon-backdated SC trend in the other post-treatment periods.If a researcher only considers the backdating exercise as a diagnostic for thecredibility of the original (non-backdated) SC estimates, then the poor predictiveperformance of the backdated SC counterfactual in the validation periods correctlyindicates that the original SC counterfactual does not reflect the true control trendin the post-treatment periods. However, some researchers also use the backdatedSC trend itself to compute treatment effect estimates since the leave-time-out exer-cise directly tests the predictive performance of the same counterfactual on whichtreatment effect estimates are based. If Virginia were the treated unit, such re-searchers would be mislead; the poor fit in the validation periods does not translate18ounterfactual California GermanyDonor Pool Size 38 16SC False Pos. Rate 10.5% 25.0%False Neg. Rate 26.3% 12.5%Total Err. Rate 36.8% 37.5%Backdated SC False Pos. Rate 13.2% 18.7%False Neg. Rate 10.5% 6.2%Total Err. Rate 23.7% 24.9%
Table 1: This table summarizes the results from our placebo analyses of the leave-time-out robust-ness check. A false positive occurs when the backdated SC trend fits well in the validation periodsbut the counterfactual control trend does not fit well post-treatment; similarly, a false negativeoccurs when the backdated SC trend does not fit well in the validation periods but the counter-factual control trend does fit well post-treatment. The error rate is defined as the sum of the falsepositive and false negative rates. We report error rates for both backdated and non-backdated SCcounterfactual trends. into meaningfully subpar post-treatment fit.Given these concerns, we repeat this placebo analysis with each of the othercontrol units in the studies of California’s tobacco control law and German reunifi-cation. In particular, we visually code each application of the backdating procedureas yielding a “false positive”—the backdated SC trend fits well in the validation pe-riods but the counterfactual control trend does not fit well post-treatment—a “falsenegative”—the backdated SC trend does not fit well in the validation periods butthe counterfactual control trend does fit well post-treatment—or neither if the pro-cedure properly rejected a counterfactual with bad post-treatment fit or did notreject a counterfactual with good post-treatment fit. Since there is not a consen-sus amongst practitioners about which of the backdated or non-backdated SC trendsshould determine treatment effect estimates, we report false positive and false neg-ative rates for both types of counterfactuals.As can be seen from Table 1, while the performance of the leave-time-out pro-cedure is better than the performance of the leave-unit-out procedure, it still leavesmuch to be desired given that it had the potential to mislead researchers roughlya quarter of the time it was applied in our placebo analyses. Again, these resultsshould not be unexpected, since the leave-time-out analysis is only assessing the sen-sitivity of SC estimates to misspecification error caused by overfitting to outcomesclose to the first treated period. More importantly, there are no agreed-upon formalcriteria we know of in the literature for trusting or doubting synthetic control esti- Unfortunately, there are not enough pre-treatment periods in the data from Peri and Yasenov(2019) to reliably evaluate SC predictions from the backdated fit. Our notions of good and poor fit here are necessarily heuristic, since we know of no acceptedformal criteria in the literature for what constitutes acceptable fit in the validation periods. Amore systematic way to code each placebo analysis would be to survey a sample of practitionerswho use the SC method and ask them whether they find the predictive performance of backdatedand non-backdated SC trends acceptable; we leave such a survey for future work. igure 5: We plot the results from placebo analyses of Procedure 1 using data from three casestudies. On the x -axis, we vary the percentile rank cutoff that determines the width of the boundsgenerated by our procedure and on the y -axis we show the share of placebo treated units for whichthose bounds correctly include a zero treatment effect. The dashed black 45-degree line demarcates p percent of placebo units’ true outcomes being covered with a misspecification error cutoff at the p th percentile rank. mates based on the leave-unit-out or backdating exercises. Researchers (includingus) seem to decide based on visual appeal, which, as we have demonstrated, can leadresearchers astray. In fact, we could not reproduce the error rates in Table 1 when weconducted the coding exercise described above twice, six months apart; the versionincluded here reports the results from our second coding attempt, and departuresfrom our first results were not uniform in any direction. In contrast, our procedureprovides a comprehensive and less subjective approach to assessing how all types ofmisspecification error could affect SC estimates.To assess the effectiveness of our proposed method in comparison, we subject itto the same placebo analyses we used to interrogate the leave-unit-out and leave-time-out robustness checks studied above. In particular, we treat the control units ineach of our three case studies as placebo treated units and apply our sensitivity anal-ysis to each. In Figure 5, we plot the share of control units for which our procedureyields bounds on the treatment effect that correctly contain zero for each possiblepercentile rank at which we could generate bounds using our procedure. When theresearcher chooses a threshold of misspecification error in terms of a p th percentilerank cutoff that they deem “acceptable” when constructing treatment effect bounds,our placebo analyses suggest that doing so correctly captures zero treatment effectsfor approximately p percent of the placebo treated units. In some sense, this cali-brated relationship between the chosen percentile rank cutoff and bound coverage ofplacebo treated units’ outcomes is not surprising, since our procedure can be viewedas a particular way of synthesizing the results of repeated placebo analyses. SectionA of the Appendix describes the mechanics behind this connection in more detail.20n contrast, it is difficult to translate the results of the leave-unit-out and leave-time-out robustness checks into clear insights about the validity of the SC treatmenteffect estimate for the treated unit. Recall the leave-unit-out placebo analyses con-ducted using the data from Abadie et al. (2010); if the true control outcomes lieoutside of the range of leave-unit-out trends for . of placebo treated units, arewe meant to believe that the range of leave-unit-out trends for California only cap-tures the true counterfactual control trend . of the time under some samplingmodel for the potential outcomes? As discussed above, it is even harder to under-stand how informative the backdating placebo analyses are about the value of theleave-time-out analysis applied to the treated unit. While these other methods areplagued by ambiguities in implementation and interpretation, the only degree of free-dom left by our procedure for the researcher to determine, the acceptable percentilerank cutoff, is both directly meaningful and closely connected to a natural summarystatistic of our procedure’s performance in placebo analyses. The (cid:96) -distances defined in Section 2 between the SC weights and the closest weightsthat correctly predict Y jT ∗ (0) , d ( w sc , W ∗ ) and d ( w ( j ) sc , W ∗ j ) , are natural measuresof misspecification error magnitudes, but they are certainly not the only ones re-searchers can use to assess the sensitivity of SC treatment effect estimates. Insteadof measuring the misspecification error incurred by the SC method relative to weights w with the (cid:96) -distance of w ( j ) sc to w , we can use any function m j : R J − { j (cid:54) =1 } → [0 , ∞ ] such that m j ( w ( j ) sc ) = 0 to measure the distance of w to w ( j ) sc . To allow for thesealternative misspecification error metrics m j , we generalize our proposed sensitiv-ity analysis in Procedure 2, which nests the analysis described in Section 2 for m j ( w ) = m wt j ( w ) := (cid:107) w ( j ) sc − w (cid:107) . Note that as long as m j are convex func-tions, then although the optimization problems (12), (13), and (15) likely do nothave closed-form solutions as their equivalents in Section 2 do, their solutions arestill easily computable numerically using off-the-shelf convex optimization software(Boyd and Vandenberghe, 2004).To demonstrate the value of this more general procedure, we focus on an al-ternative misspecification error metric m err j ( w ) , defined as the extra pre-treatmentprediction error incurred by w relative to the minimum achievable pre-treatmentprediction error with valid SC weights, assuming w are also valid SC weights: m err j ( w ) := (cid:107) x j − X − j w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w )min ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) (cid:111) − , (16)where for a given set C ⊆ R J − { j (cid:54) =1 } , ψ C ( w ) is a penalty term designed to constrain21 rocedure 2. Generalized Sensitivity Analysis
1. For each control unit j ∈ J :(a) Compute the misspecification error d m j ( w ( j ) sc , W ∗ j ) incurred by esti-mating the placebo post-treatment outcome of interest Y jT ∗ (0) forcontrol unit j with the SC method using the other J − units inthe donor pool as control units, as in (7): d m j ( w ( j ) sc , W ∗ j ) := inf w ∈ R J − m j ( w )s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) . (12)(b) Compute the largest and smallest plausible counterfactual controloutcomes Y B j , − T ∗ (0) and Y B j , +1 T ∗ (0) under the assumption that themisspecification error d m ( w sc , W ∗ ) incurred by estimating Y T ∗ (0) with the SC method is at most the misspecification error B j := d m j ( w ( j ) sc , W ∗ j ) , as in (4): Y B j , − T ∗ (0) := inf w ∈ R J (cid:110) Y T T ∗ w : m ( w ) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:111) Y B j , +1 T ∗ (0) := sup w ∈ R J (cid:110) Y T T ∗ w : m ( w ) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:111) (13)(c) Compute the bounds T B j T ∗ on the treatment effect τ T ∗ under the as-sumption that the misspecification error d m ( w sc , W ∗ ) incurred byestimating Y T ∗ (0) with the SC method is at most the misspecifica-tion error B j , as in (9): T B j T ∗ := (cid:104) Y T ∗ − Y B j , +1 T ∗ (0) , Y T ∗ − Y B j , − T ∗ (0) (cid:105) (14)2. Compute the minimum misspecification error B needed for ∈ T B T ∗ , i.e. to be a plausible treatment effect estimate, as in (10): B := inf w ∈ R J m ( w )s . t . Y T T ∗ w = Y T ∗ (15)and find the control unit j with the largest misspecification error stillsmaller than B , i.e. where B j ≤ B ≤ B j +1 .3. Visualize the treatment effect bounds T B j T ∗ for each j ∈ J and the mis-specification errors B j and B j +1 in a plot like Figure 3a, and report thepercentage ν = ( j − /J of control units whose misspecification errors B j are smaller than B . 22 to lie in the set C when m j is used in minimization problems: ψ C ( w ) := (cid:40) w ∈ C ∞ otherwise (17)The denominator of the fraction in (16) is just the pre-treatment prediction errorincurred by the canonical SC estimator, since (cid:107) x j − X − j ˜ w (cid:107) is exactly the objectivefunction minimized in (6) to construct a synthetic control and ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) justensures that the minimizer of (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) is a vector of valid SCweights. Then, if we use m err j as the misspecification error metric in our proposedsensitivity analysis, we can interpret the misspecification error d m err j ( w ( j ) sc , W ∗ j ) asthe minimum amount of additional pre-treatment prediction error (relative to theminimum possible) a researcher would have to tolerate for a vector of SC weightsthat yields a correct prediction of Y jT ∗ (0) to be considered a “reasonable” choice ofweights.As it happens, the weights that solve (12) under m err j are exactly the weightsthat yield the green outcome trends in Figure 1 that match Virginia and Delaware’soutcomes in 2000 and achieve the smallest possible pre-treatment prediction errormagnitudes while doing so. Further, suppose we treat unit j as the treated unit andthe other J − control units as the donor pool. Then the sets [ Y B j , − jT ∗ (0) , Y B j , + jT ∗ (0)] for j ∈ J with endpoints defined analogously to (13), i.e. the sets that containthe plausible predicted control outcomes for each unit j assuming misspecificationerror is no larger than j ’s own true misspecification error, are exactly the red dashedintervals in Figures 1a and 1b.Although m err j has clear intuitive appeal, it does have several shortcomings. First,it is only well-defined if X − j w (cid:54) = x j for all w ∈ ∆ J − { j (cid:54) =1 } ; otherwise, the denom-inator in (16) will be zero, in which case m err j is unusable given the dataset ofinterest. Second, the sets of w that perfectly predict the period- T ∗ outcomes forthe control units with the largest and smallest values of Y jT ∗ do not intersect with ∆ J − { j (cid:54) =1 } at all, in which case m err j ( w ) will be infinite for all feasible w in (12).Then d m j ( w ( j ) sc , W ∗ j ) = ∞ for the two units with the largest and smallest period- T ∗ outcomes, meaning T B j T ∗ = ( −∞ , ∞ ) . Despite the fact that these bounds containthe whole real line, we do not intend their vacuousness to reflect that all treatmenteffects are equally plausible; we simply mean to convey that the particular boundscorresponding to the control units with extreme outcomes are uninformative aboutthe treatment effect for the treated unit.In addition to the generalization of our sensitivity analysis to other misspecifi-cation error metrics described above, we also extend our procedure to measure thesensitivity of alternative outcome contrast estimates in Section C.1 of the Appendixand to apply to effect estimates generated by other policy evaluation methods forpanel data in Section C.2 of the Appendix. Further, we demonstrate in Section B ofthe Appendix how this generalized procedure can be used to account for potentialnon-uniqueness of the SC estimator in the sensitivity analysis from Section 2.23 .2 Choosing a Misspecification Error Metric To understand how the choice of misspecification error metric can affect the outputof Procedure 2, we compare the results of our sensitivity analysis based on m wt j shown in Figure 3 to results based on two additional misspecification error metrics,which we review below:1. Unconstrained weight space : m wt j ( w ) = (cid:107) w ( j ) sc − w (cid:107) ; as described above, usingthis metric yields the sensitivity analysis given in Procedure 1.2. Constrained weight space : m wt ∆ J − { j (cid:54) =1 } ( w ) := (cid:107) w ( j ) sc − w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w ) ; thismetric still measures distance in weight space but requires w to lie in the setof valid SC weights.3. Constrained error space : m err j ( w ) := (cid:107) x j − X − j w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w )min ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) (cid:111) − as discussed in Section 4.1, this metric measures distance with the extra errorincurred by w relative to the error incurred by the vector of SC weights.The results of repeating our earlier case studies with the alternative misspecifi-cation metrics described above are shown in Figure 6. First, the treatment effectbounds for the tobacco control program in California in Figure 6a uniformly indi-cate that the finding of a large, negative effect is robust, since for all choices of m j , the misspecification error for California would need to be large relative to themisspecification errors of most control units. For a zero treatment effect to be plausi-ble, the constrainted weight space metric suggests California’s misspecification errorwould need to be larger than 92.1% of the control units and the remaining two met-rics require error larger than 94.7% of control units. The results for the Germanreunification and Mariel boatlift settings shown in Figures 6b and 6c have morevariation across misspecification metrics. While the constrained error space metricsuggests fairly robust results in the German reunification setting, the other two areless supportive. In the Mariel boatlift setting, the treatment effect of the influx ofimmigrants on low-income wages in Miami is reasonably indistinguishable from zerousing the unconstrained weight space and constrained error space metrics, as hasbeen argued in the literature by other means.Perhaps counterintuitively, Figure 6b demonstrates that the percentile rank ofthe misspecification error needed for a zero treatment effect to be plausible usingthe constrained weight space metric is smaller than the equivalent percentile rankusing the unconstrained weight space metric. At first, this phenomenon may seemimpossible since, holding the magnitude of misspecification error fixed, the boundsconstructed by maximizing and minimizing over the unconstrained set of weights in(13) should be mechanically wider than the bounds constructed over the constrainedset. However, recall that the units on the x -axis in 6b correspond to the percentileranks of the placebo misspecification errors, not their magnitudes. Because the rel-ative sizes of the placebo misspecification errors also depend on choice of metric, it24 a) Effect of California’s Tobacco Control Program(b) Effect of German Reunification on GDP(c) Effect of Mariel Boatlift on Low-Income Wages Figure 6: Plots of treatment effect bounds T B j T ∗ using different misspecification error metrics cor-responding to each control unit j ’s misspecification error computed using data from three papersusing the SC method, where the units of the x -axes are percentile ranks p j of the set of misspeci-fication errors { B j : j ∈ J } . We shade each region between B j and B j +1 in which our treatmenteffect bounds first contain zero in the color corresponding to the relevant misspecification errormetric.
25s certainly possible that, at a fixed percentile rank in the placebo misspecificationerror distribution, either metric could yield wider bounds. While this ambiguity maysuggest visualizing the bounds defined via the two weight space metrics in terms ofabsolute misspecfication error magnitudes, as discussed in Section 2.2.2, it is hard todetermine what constitutes a reasonable amount of misspecification error measuredusing (cid:96) -distances in weight space. Benchmarking against the placebo misspecifica-tion errors of the control units provides a more meaningful characterization of therobustness of SC estimates.Unfortunately, given the ambiguous relationships between metrics discussed above,we cannot recommend a single preferred misspecification error metric for all settings.Rather, we believe the choice should be made based on the researcher’s prior beliefsabout the SC method’s susceptibility to misspecification error. When comparing theconstrained and unconstrained weight space metrics, the decision should be deter-mined by the researcher’s belief about the validity of the SC weight constraints. Ifthe researcher just views the constraints as a convenient way of inducing sparsityin the SC weights, then conducting the sensitivity analysis while enforcing thoseconstraints would fail to capture the possible misspecification error induced by theimposition of the constraints when choosing the SC weights. However, if the re-searcher believes the weight constraints capture important structural features of thesetting, for example that treatment effect estimates based on extrapolation are un-desirable, then they may wish to use the constrained weight space metric and onlyevaluate misspecification error incurred by the minimization of the wrong objectivefunction when selecting the SC weights, not the weight constraints themselves. The choice between the constrained weight and constrained error space metricsis more subtle. The sensitivity analyses based on weight space metrics search foralternative weights agnostic to direction when constructing treatment effect bounds.On the other hand, the constrained error space metric penalizes alternative weightsthat have poor performance on the original SC objective. Therefore, if the researcherdoes not believe pre-treatment fit is at all informative about post-treatment fit, theymay prefer the weight space metrics. However, if the researcher maintains thatgood pretreatment fit is a desirable and informative property of the weights used toconstruct counterfactual predictions, the constrained error space metric may makemore sense. In principle, one could even interpolate between the different metrics.
In this paper, we demonstrate that pre-treatment fit is neither neccesary nor suf-ficient for good post-treatment fit and that existing robustness checks often fail tocapture the extent of this disconnect due to their heuristic motivations and ad-hocinterpretations. To structure conversations about the robustness of SC estimates,we provide researchers with a procedure to systematically assess SC estimate sensi-tivity to misspecification error in an interpretable, data-driven manner. Our method by extrapolation, we mean estimates of Y T ∗ that lie outside the range of control units’ period- T ∗ outcomes (Abadie, 2020). we could potentially develop a conditional prediction interval for thetreated unit’s outcome in a given period (Cattaneo et al., 2019). Of course, there ismuch more to be done to understand the viability (or lack thereof) of this generalapproach given the conceptual difficulty in measuring variability due to sampling incomparative case study settings, so we leave doing so to future work.In conclusion, we hope that researchers will perform the sensitivity analysis out-lined in Procedures 1 and 2 as part of their future comparative case studies employingthe SC method and visualize their results as in Figures 3 and 6. Such a perspective is reminiscent of the partial identification approach taken in Rambachanand Roth (2019) to allow for limited violations of the parallel trends assumption in the context ofevent studies eferences Alberto Abadie and Javier Gardeazabal. The economic costs of conflict: A casestudy of the basque country.
American Economic Review , 93(1):113–132, March2003.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methodsfor comparative case studies: Estimating the effect of california’s tobacco controlprogram.
Journal of the American Statistical Association , 105(490):493–505, 2010.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Comparative politics andthe synthetic control method.
American Journal of Political Science , 59(2):495–510, 2015.Bruno Ferman and Cristine Pinto. Synthetic controls with imperfect pre-treatmentfit, 2019.Alberto Abadie. Using synthetic controls: Feasibility, data requirements, andmethodological aspects.
Journal of Economic Literature , 2020.Stephen P Boyd and Lieven Vandenberghe.
Convex optimization . Cambridge uni-versity press, 2004.Charles F. Manski and John V. Pepper. How do right-to-carry laws affect crimerates? coping with ambiguity using bounded-variation assumptions.
The Review ofEconomics and Statistics , 100(2):232–244, 2018. doi: 10.1162/REST\_a\_00689.URL https://doi.org/10.1162/REST_a_00689 .Sergio Firpo and Vitor Possebom. Synthetic control method: Inference, sensitivityanalysis and confidence sets.
Journal of Causal Inference , 6(2), 2018.Victor Chernozhukov, Kaspar Wuthrich, and Yinchu Zhu. An exact and robust con-formal inference method for counterfactual and synthetic controls. arXiv preprintarXiv:1712.09089 , 2017.Victor Chernozhukov, Kaspar Wuthrich, and Yinchu Zhu. Practical and robust t -test based inference for synthetic control and related methods. arXiv preprintarXiv:1812.10820 , 2018.Matias D Cattaneo, Yingjie Feng, and Rocio Titiunik. Prediction intervals for syn-thetic control methods. arXiv preprint arXiv:1912.07120 , 2019.Kathleen T Li. Statistical inference for average treatment effects estimated by syn-thetic control methods. Journal of the American Statistical Association , pages1–16, 2019.Guido W Imbens and Donald B Rubin.
Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press, 2015.28lberto Abadie and Jeremy L’Hour. A penalized synthetic control estimator fordisaggregated data. 2018.Maxwell Kellogg, Magne Mogstad, Guillaume Pouliot, and Alexander Torgovitsky.Combining matching and synthetic controls to trade off biases from extrapolationand interpolation. Technical report, National Bureau of Economic Research, 2020.Ward Cheney and David Kincaid. Linear algebra: Theory and applications.
TheAustralian Mathematical Society , 110, 2009.Iavor Bojinov and Neil Shephard. Time series experiments and causal estimands:exact randomization tests and trading.
Journal of the American Statistical Asso-ciation , 114(528):1665–1682, 2019.Ashesh Rambachan and Neil Shephard. Econometric analysis of potential outcomestime series: instruments, shocks, linearity and the causal response function. arXivpreprint arXiv:1903.01637 , 2019.Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-samplerobustness metric: Can dropping a little data change conclusions?, 2020.Giovanni Peri and Vasil Yasenov. The labor market effects of a refugee wave syntheticcontrol method meets the mariel boatlift.
Journal of Human Resources , 54(2):267–309, 2019.George J. Borjas. The wage impact of the marielitos: A reappraisal.
ILR Review ,70(5):1077–1110, 2017. doi: 10.1177/0019793917692945. URL https://doi.org/10.1177/0019793917692945 .Marianne Bertrand, Esther Duflo, and Sendhil Mullainathan. How Much ShouldWe Trust Differences-In-Differences Estimates?*.
The Quarterly Journal of Eco-nomics , 119(1):249–275, 02 2004.Andreas Hagemann. Inference with a single treated cluster. arXiv preprintarXiv:2010.04076 , 2020.Ashesh Rambachan and Jonathan Roth. An honest approach to parallel trends. Tech-nical report, Working Paper. https://scholar. harvard. edu/files/jroth/files . . . ,2019.Nikolay Doudchenko and Guido W Imbens. Balancing, regression, difference-in-differences and synthetic control methods: A synthesis. Technical report, NationalBureau of Economic Research, 2016.Eli Ben-Michael, Avi Feller, and Jesse Rothstein. The augmented synthetic controlmethod. arXiv preprint arXiv:1811.04170 , 2018.Dmitry Arkhangelsky, Susan Athey, David A. Hirshberg, Guido W. Imbens, andStefan Wager. Synthetic difference in differences, 2020.29
Placebo Analysis of Procedures 1 and 2
We perform our sensitivity analysis on each of the control units in the three casestudies as an extension of the placebo analysis used to assess the performance of ourprocedure at the end of Section 3. We treat each control unit as the placebo treatedunit and run Procedure 2 under the three misspecification error metrics discussedin Section 4.2. For each placebo treated unit, our procedure returns the minimumpercentile rank of the placebo control units’ misspecification errors at which a zerotreatment effect is in the range of effects plausible under the allowed misspecificationerror. Within each case study, we can vary the level of acceptable misspecificationerror by choosing a different percentile rank cutoff. At each proposed cutoff, wegenerate bounds on the treatment effect and observe the share of control units forwhich we correctly include the zero treatment effect.In Figure 7, we report the results of this placebo exercise by plotting the share ofcontrol units for which our procedure generates bounds that correctly contain zerotreatment effect under each possible percentile rank cutoff. Across case studies andmisspecification error metrics, a p th percentile rank cutoff is associated with correctpredictions for around p percent of the control units. This direct correspondencebetween percentile rank cutoff and the coverage of placebo units’ outcomes is to beexpected given the design of our placebo analysis.When our sensitivity analysis is performed on the true treated unit, the set ofplacebo misspecification errors is calculated when, for each control unit j , we calcu-late the distance (in a chosen metric) between the canonical SC estimate using theremaining J − control units and the closest weights that correctly predict the targetoutcome. If we were to plot the share of control units covered at each percentile rankof the misspecification error distribution generated in our analysis of the true treatedunit, we would exactly recover the 45-degree line. However, when we perform theanalysis on control unit j as a placebo treated unit, the set of misspecification er-rors is constructed by creating placebo SC weights for the remaining placebo controlunits without unit j , that is, using J − control units. Redefining the set of mis-specification errors without using placebo control unit j can yield slightly differentSC estimates that generate the deviations seen in each of panel of Figure 7.In Figure 7, we also observe that the unconstrained weight space metric yieldsplacebo coverage rates that always lie weakly above the 45-degree line. While thisresult is not theoretically guaranteed, we can see why we may expect such a phe-nomenon. Consider the placebo analysis of control unit j . Using the remaining J − control units, the closed form solution for the observable misspecification errorunder the unconstrained weight space metric is given in Equation (11): the ratioof the absolute SC residual to the (cid:96) -norm of the vector of J − control unit out-comes, B ( j )0 = | ˆ R sc jT ∗ | / (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) . Placebo treated unit j ’s misspecification errorwill be compared to the distribution of misspecification errors of the remaining J − placebo control units, calculated as each placebo control unit k ’s ratio of the absoluteresidual from predicting its post-treatment SC estimate using the remaining J − control units to the (cid:96) -norm of the vector of J − remaining control unit outcomes, d ( w ( k ) sc , W ∗ k ) = | ˆ R sc kT ∗ | / (cid:13)(cid:13) Y ( − j, − k ) T ∗ (cid:13)(cid:13) . Because most SC estimates are sparse and30 a) Effect of California’s Tobacco Control Program(b) Effect of German Reunification on GDP(c) Effect of Mariel Boatlift on Low-Income Wages Figure 7: We plot the results from our placebo analysis of Procedure 2 using three different mis-specification error metrics in each of the three case studies, as in Section 4.2. On the x -axis, we varythe percentile rank cutoff that determines the width of the bounds generated by our procedure andon the y -axis we show the share of placebo treated units for which those bounds correctly include azero treatment effect. The dashed black 45-degree line demarcates p percent of placebo units’ trueoutcomes being covered with a misspecification error cutoff at the p th percentile rank. j anyway, the residuals willnot change much whether J − or J − placebo control units are used to constructSC estimates, in which case the primary difference in misspecification error will bedriven by the reduction in the norm of the vector of control unit outcomes with J − units instead of J − units, (cid:13)(cid:13) Y ( − j, − k ) T ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) . Therefore, it isreasonable to expect that the treatment effect bounds for control unit j will includea zero treatment effect at a weakly lower percentile rank in the placebo analysis thanin the sensitivity analysis of the true treated unit, which is why the unconstrainedweight space placebo coverage rates lie weakly above the 45-degree line.While there is not an observable pattern in the constrained weight space case,we do observe that the constrained error space metric yields placebo coverage ratesthat are mostly below the 45 degree line. Recall from Equation 16 and Procedure 2that the constrained error space misspecification error (12) is the ratio of the pre-treatment error from the best-fitting performing synthetic control pre-treatment thatachieves perfect post-treatment accuracy to the minimum SC pre-treatment error.When placebo treated unit j ’s misspecification error is compared to the errors of theremaining J − units that are fit with only J − placebo control units, both thenumerator and denominator of the control unit misspecification errors weakly in-crease as pre-treatment fit can only get worse with fewer units. This behavior makesthe relative magnitudes of control unit misspecification errors computed using J − control units to the equivalent errors computed using J − control units theoreti-cally ambiguous. However, since the constrained error space metric placebo coveragerates tend to be below the 45-degree line, it must be that the misspecification errorswith J − units are typically smaller than with J − units. This suggests thatthe addition of a control unit in the donor pool tends to decrease pre-treatment fiterror constrained to perform well in the post-treatment period of interest less thanit decreases the minimum achievable pre-treatment error. B Handling Non-Uniqueness of the SC Estimator
As discussed in Footnote 5 in Section 2.1, it is possible that x could lie in theconvex hull of the columns of X , in which case the optimization problem (2) thatdefines the SC estimator could have multiple or even infinite solutions. In thiscase, the sensitivity analysis described in Section 2 would understate the impact ofmisspecification error on treatment effect estimates because it does not account forthe multiplicity of valid SC estimators that could result from solving (2). Thankfully,we can apply the generalized sensitivity analysis described in Section 4.1 with anappropriate choice of misspecification error metric m j to allow for a weight-space-based sensitivity analysis that can account for non-uniqueness.In particular, we can choose m wt,mult j ( w ) to measure the distance of w to the closest (in (cid:96) distance) SC weights that solve (2). Formally, we let W ( j ) sc := arg min w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X ( − j ) w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w ) (cid:111) ψ defined in (17) anddefine m wt,mult j ( w ) := inf ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) ˜ w − w (cid:107) + ψ W ( j ) sc ( ˜ w ) (cid:111) . If we apply our generalized sensitivity analysis using misspecification error met-ric m wt,mult j , the resulting treatment effect bounds will capture the impact of bothmisspecification error and potential non-uniqueness of the SC estimator because,in effect, the solutions to (12) and (13) will range over all weights that lie within d m j ( w ( j ) sc , W ∗ j ) of some valid SC estimator, not just the particular SC estimatorreturned by the estimation procedure.Note that since W ( j ) sc is defined as the solution set of a convex optimization prob-lem and must therefore be a convex set (see Section 4.2.1 of Boyd and Vandenberghe(2004)), and since the infimum of a convex function in one of its arguments over aconvex set must be convex in its remaining arguments (again, see Boyd and Van-denberghe (2004)), m wt,mult j must convex. However, (12) and (13) with m wt,mult j directly plugged in are not formulated in manners that are particularly amenable tocomputation. Instead, we can write (12) using misspecification error metric m wt,mult j in a form that is more directly solvable as follows: d m j ( w ( j ) sc , W ∗ j ) := inf ˜ w , w ∈ R J − (cid:107) ˜ w − w (cid:107) s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) (cid:107) x j − X ( − j ) ˜ w (cid:107) ≤ (cid:107) x j − X ( − j ) w ( j ) sc (cid:107) T ˜ w = 1˜ w ≥ . (18)Similarly, we can write the optimization problems in (13) to facilitate use of commonconvex solvers as follows: Y B j , − T ∗ (0) := inf ˜ w , w ∈ R J − Y T T ∗ w s . t . (cid:107) ˜ w − w (cid:107) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:107) x j − X ( − j ) ˜ w (cid:107) ≤ (cid:107) x j − X ( − j ) w ( j ) sc (cid:107) T ˜ w = 1˜ w ≥ , (19)and Y B j , +1 T ∗ (0) is defined similarly. Finally, we note that generalizing the constrainedweight space sensitivity analysis discussed in Section 4.2 to allow for non-uniquenessis essentially the same as the process described above with additional constraints T w = 1 and w ≥ added to (18) and (19).33 Other Generalizations
C.1 Other Contrasts
While the treatment effect τ T ∗ in period T ∗ is a natural estimand in comparative casestudy settings, researchers are often interested in other linear contrasts of outcomeslike the average treatment effect across all post-treatment periods or the effect onthe average slope of the treated unit’s outcome path. Our sensitivity analysis canbe extended naturally to assess the robustness of synthetic control-based estimatesof these alternative estimands.Let Y j,T : T ( d ) := ( Y jT (0) , . . . , Y jT (0)) T denote the vector containing unit j ’spotential outcomes under treatment arm d in each of the T − T post-treatmentperiods, and suppose we are interested in assessing the robustness of an estimand τ c parameterized by the vector c := ( c (0) , c (1)) T ∈ R T − T ) : τ c := c T (cid:34) Y ,T : T (1) Y ,T : T (0) (cid:35) = c (1) T Y ,T : T (1) + c (0) T Y ,T : T (0) . For example, if we use the contrast vector c T ∗ defined entrywise as [ c T ∗ ( d )] t := { t = T ∗ } ( d − (1 − d )) , we recover the treatment effect in period T ∗ studied in Section 2, τ T ∗ = τ c T ∗ . If weinstead use c avg defined entrywise as [ c avg ( d )] t := 1 T − T ( d − (1 − d )) , we recover the average treatment effect τ c avg across the T − T post-treatment periods.Using c slo defined entrywise as [ c slo ( d )] t := 1 T − T ( { t = T } − { T = T } )( d − (1 − d )) yields the effect τ c slo on the average slope of the treated unit’s outcome path, sincethe sums in the average slopes telescope: τ c slo := 1 T − T T − (cid:88) t = T ( Y t +1) (1) − Y t (1)) − T − T T − (cid:88) t = T ( Y t +1) (0) − Y t (0))= 1 T − T { [ Y T (1) − Y T (1)] − [ Y T (0) − Y T (0)] } Next, let Y j,T : T := ( Y jT ( D jT ) , . . . , Y jT ( D jT )) T be the vector of unit j ’s ob-served post-treatment outcomes, let Y be the ( T − T ) × J matrix of control units’post-treatment outcomes, where the j th column of Y is Y j +1 ,T : T , and let Y − j denote the matrix Y with its j th column deleted. Then once we have chosen acontrast c , we can write the synthetic control estimate of τ c for the treated unit as34ollows: ˆ τ sc c := c (1) T Y ,T : T + c (0) T ˆ Y ,T : T = c (1) T Y ,T : T + c (0) T Y w sc . Similarly, we can write the placebo treatment effect for the j th control unit usingthe other J − control units as the donor pool as follows: ˆ τ ( j ) , sc c := c (1) T Y j,T : T + c (0) T ˆ Y j,T : T = c (1) T Y j,T : T + c (0) T Y − j w ( j ) sc Given this characterization of ˆ τ sc c , modifying the general procedure described in inSection 4 is relatively straightforward. First, we replace the constraints Y T ( − j ) T ∗ w = Y jT ∗ and Y T T ∗ w = Y T ∗ requiring perfect post-treatment accuracy in period T ∗ in the optimization problems (12) and (15) with the constraints c (0) T Y j,T : T = c (0) T Y − j w and c (0) T Y ,T : T = c (0) T Y w , which is equivalent to requiring correcttreatment effect estimation for control unit j in the case of (12) and for the treatedunit under the assumption of no effect in (15). The definition of d m ( j ) ( w ( j ) sc , W ∗ j ) should also be updated accordingly. Next, we replace the computations of the boundson Y T ∗ (0) in (13) with the following bounds on the component of the treatment effectthat depends on counterfactual control outcomes Y ,T : T (0) : µ B j , − (0) := min w ∈ R J (cid:110) c (0) T Y − j w : m ( j ) ( w ) ≤ d m ( j ) ( w ( j ) sc , W ∗ j ) (cid:111) µ B j , +1 (0) := max w ∈ R J (cid:110) c (0) T Y − j w : m ( j ) ( w ) ≤ d m ( j ) ( w ( j ) sc , W ∗ j ) (cid:111) Finally, we replace the bounds in (14) with the following bounds on τ c : T B j c := (cid:104) c (1) T Y ,T : T + µ B j , − (0) , c (1) T Y ,T : T + µ B j , +1 (0) (cid:105) C.2 Other Panel Data Methods
The outcomes-based SC method described in Section 2.1 is by no means the onlymethod for generating counterfactual predictions in comparative case study settings.Besides the classic Difference-in-Differences estimator (Bertrand et al., 2004) andSC estimators that incorporate other pre-treatment covariates (Abadie et al., 2015),a whole suite of methods for panel data inspired by the SC method have beenproposed in the past decade, including but not limited to the estimators proposedin Doudchenko and Imbens (2016), Chernozhukov et al. (2017), Ben-Michael, Feller,and Rothstein (2018), Abadie and L’Hour (2018), Arkhangelsky, Athey, Hirshberg,Imbens, and Wager (2020), and Kellogg et al. (2020).As has been noted in Doudchenko and Imbens (2016), Chernozhukov et al. (2017),and Cattaneo et al. (2019) among others, we can write many of these alternativetreatment effect estimators for panel data as affine functions of the control units’35ost-treatment outcomes Y T ∗ in period T ∗ : ˆ τ T ∗ := Y T ∗ − (cid:0) ˆ µ + Y T T ∗ ˆ w (cid:1) = Y T ∗ − (cid:104) Y T ∗ (cid:105) (cid:34) ˆ µ ˆ w (cid:35) , (20)where we now allow for an intercept term ˆ µ in addition to weights on the controlunits’ post-treatment outcomes. As indicated by the second equality in (20), to allowfor an intercept term, we can simply add an extra “control unit” to the donor poolwith ones as all of its outcomes, so we assume we include such an intercept unit andomit the explicit intercept term in what follows.Once we take this more general perspective, we can see that Procedure 1 eas-ily generalizes to accommodate alternative policy evaluation methods that generatetreatment effect estimates as in (20), since all the procedure requires are the weightson units’ post-treatment outcomes that generate counterfactual outcome predictions.Specifically, the only difference is that instead of using the SC method to generatethe weights used to compute the residuals in steps 1a and 2, we use the weightsoutputted by some alternative policy evaluation method.Further, many of the methods listed above can be described as choosing weightsto solve a particular instance of the following general convex program: J ˆ V ,r,C ( w ) := ( x − X w ) T ˆ V ( x − X w ) + r ( w ) + ψ C ( w ) , ˆ w := arg min w ∈ R J +1 J ˆ V ,r,C ( w ) (21)where C is a convex set, ψ C is a penalty term as defined in (17) to ensure ˆ w ∈ C , ˆ V is a weighting matrix that can be chosen in a potentially data-driven manner, r : R J → [0 , ∞ ) is a convex penalty term that regularizes the weights w in somefashion, the columns of X can contain additional pre-treatment covariates beyondcontrol units’ pre-treatment outcomes, and we include an additional first columnin X containing ones in all the rows corresponding to pre-treatment outcomes andzeros in the other rows. For brevity, we leave the reader to see how the methodslisted above can be written in the form of (21) in the papers introducing them.Given this common characterization of many alternative policy evaluation meth-ods, we can see that for an appropriately defined misspecification error metric m j ( w ) that measures the misspecification error incurred by an alternative policy evaluationmethod’s weights relative to the weights w , the general Procedure 2 can be usedwithout modification to assess the robustness of treatment effect estimates outputtedby that alternative policy evaluation method. One particular misspecification errormetric of interest is an analogue of the constrained error space metric m err j definedin Section 4.1 corresponding to an alternative policy evaluation method defined byparticular choices of ˆ V , r , and C : m gen,err j ( w ) = J ˆ V ,r,C ( w )min ˜ w ∈ R J − { j (cid:54) =1 } J ˆ V ,r,C ( ˜ w ) − ..