[PDF] Assessing the Sensitivity of Synthetic Control Treatment Effect Estimates to Misspecification Error

Abstract

We propose a sensitivity analysis for Synthetic Control (SC) treatment effect estimates to interrogate the assumption that the SC method is well-specified, namely that choosing weights to minimize pre-treatment prediction error yields accurate predictions of counterfactual post-treatment outcomes. Our data-driven procedure recovers the set of treatment effects consistent with the assumption that the misspecification error incurred by the SC method is at most the observable misspecification error incurred when using the SC estimator to predict the outcomes of some control unit. We show that under one definition of misspecification error, our procedure provides a simple, geometric motivation for comparing the estimated treatment effect to the distribution of placebo residuals to assess estimate credibility. When we apply our procedure to several canonical studies that report SC estimates, we broadly confirm the conclusions drawn by the source papers.

Full PDF

AAssessing the Sensitivity of Synthetic ControlTreatment Eﬀect Estimates to Misspeciﬁcation Error ∗Billy FergusonGraduate School of BusinessStanford University [email protected]

Brad RossGraduate School of BusinessStanford University [email protected]

Last Updated February 23, 2021

Abstract

We propose a sensitivity analysis for Synthetic Control (SC) treatmenteﬀect estimates to interrogate the assumption that the SC method is well-speciﬁed, namely that choosing weights to minimize pre-treatment predictionerror yields accurate predictions of counterfactual post-treatment outcomes.Our data-driven procedure recovers the set of treatment eﬀects consistent withthe assumption that the misspeciﬁcation error incurred by the SC method isat most the observable misspeciﬁcation error incurred when using the SC esti-mator to predict the outcomes of some control unit. We show that under onedeﬁnition of misspeciﬁcation error, our procedure provides a simple, geometricmotivation for comparing the estimated treatment eﬀect to the distribution ofplacebo residuals to assess estimate credibility. When we apply our procedureto several canonical studies that report SC estimates, we broadly conﬁrm theconclusions drawn by the source papers.

The Synthetic Control (SC) method was originally developed in Abadie and Gardeaz-abal (2003), Abadie, Diamond, and Hainmueller (2010), and Abadie, Diamond, andHainmueller (2015) to estimate treatment eﬀects in comparative case study settings,in which a researcher observes panel data on aggregate outcomes for a small num-ber of large, heterogeneous units, only one of which receives some intervention ofinterest at some point in time. For the SC method to yield credible treatment eﬀectestimates for the treated unit, researchers must assume that the SC method is well-speciﬁed: if there is a convex combination of control units’ pre-treatment outcomesthat closely approximates the pre-treatment outcomes of the treated unit, then thatsame convex combination of control units’ post-treatment outcomes will yield good ∗ We are grateful for invaluable comments from and stimulating conversations with Bharat Chan-dar, Jiafeng Chen, Benny Goldman, Guido Imbens, Advik Shreekumar, Charlie Walker, and theparticipants in the Stanford Econometrics and Applied Lunches. a r X i v : . [ ec on . E M ] F e b stimates of the treated unit’s post-treatment control outcomes. While this as-sumption is necessary for tractable treatment eﬀect estimation, it is unlikely to holdexactly in practice. In this paper, we develop a sensitivity analysis of SC treatmenteﬀect estimates that bounds the true treatment eﬀect under the assumption thatany deviations from this well-speciﬁed method assumption are at most as severe asthe deviations observed in placebo analyses of the control units.To build intuition for where this assumption might lead researchers astray, wepresent two placebo analyses in which we apply the SC method to panel data fromthe evaluation of a 1989 tobacco control program implemented in California, as inAbadie et al. (2010). In particular, we use the SC method to predict per-capitatobacco sales in Virginia and Delaware in the year 2000 using the other members ofthe donor pool as control units. Since neither Virginia nor Delaware received thetreatment and we observe their true control outcomes post-treatment, we can seewhether the SC method correctly predicts no eﬀects in either state.In Figure 1a, we depict the observed, true control outcomes for Virgina alongsidetwo diﬀerent convex combinations of the remaining control units’ outcome trends.The ﬁrst is the orange synthetic control trend constructed in typical SC fashion,namely as the convex combination of control units’ outcomes that most closely ap-proximates Virginia’s pre-treatment outcomes (Abadie et al., 2010). While this pro-cedure yields a trend with good pre-treatment ﬁt, it does a subpar job of predictingVirginia’s control outcome in the year 2000.Next, since we observe Virginia’s control outcomes post-treatment in this placeboanalysis, we can instead construct the “best-looking” (in a pre-treatment ﬁt sense)convex combination of the remaining control units’ outcome trends that matchesVirginia’s control outcome in 2000 exactly, shown in green. Perhaps surprisingly,there exists a convex combination of control units that exactly predicts our post-treatment outcome of interest while achieving only marginally worse pre-treatmentﬁt than the best-ﬁtting trend chosen by the SC method.When we conduct the same exercise with Delaware as the “treated” unit of inter-est, we see in Figure 1b that, just as with Virginia, although the trend constructed bythe SC method (again in orange) has good-looking pre-treatment ﬁt, it does a poorjob estimating the true control outcome of interest. However, unlike when we usedVirginia as the placebo treated unit, we cannot construct a convex combination ofcontrol units’ outcome trends that matches both Delaware’s control outcome in 2000 In addition, the researcher must assume that there are no idiosyncratic factors that aﬀectthe treated unit’s counterfactual control outcomes post-treatment but not the control units’ post-treatment outcomes besides their diﬀering treatment statuses; one way to operationalize this idea isthe linear factor model presented in Abadie et al. (2010) and studied in depth in Ferman and Pinto(2019), in which units’ factor loadings do not vary before and after treatment. It is also importantto assume that the treatment does not aﬀect units in the donor pool, although researchers cansimply exclude “control” units for which spillover eﬀects are a concern (Abadie, 2020). Typically,such assumptions must be justiﬁed using domain knowledge about the setting of interest, so we donot concern ourselves with assessing their validity in this paper. We will discuss how we can compute such a convex combination in Section 4. Note thatdoing so is only possible because Virginia’s control outcome in 2000 lies between the minimumand maximum of the other control units’ outcomes in 2000, in which case there are many convexcombinations with no prediction error. a) Virginia Placebo Analysis (b) Delaware Placebo Analysis Figure 1: Visualizations of placebo analyses with Virginia and Delaware as treated units; inboth panels, the blue trend denotes the true control outcome trend of the placebo treatedunit, the orange trend denotes the synthetic control trend selected by the SC method, and thegreen trend denotes the “best-looking” synthetic control trend that minimizes pre-treatmentﬁt error while exactly matching the placebo treated unit’s control outcome in 2000. exactly and its pre-treatment outcomes well, so the best-looking convex combinationof control units’ trends we select to match Delaware’s outcome in 2000 exactly (againin green) has unacceptable pre-treatment ﬁt.These two examples indicate we should interpret SC estimates with caution; theplacebo analysis using Virginia suggests good pre-treatment ﬁt is not suﬃcient forgood post-treatment accuracy, while the placebo analysis using Delaware suggestsgood pre-treatment accuracy, while feasible, may not even be achievable alongsidepost-treatment accuracy. As discussed in Section 4, the additional pre-treatment ﬁterror incurred by the green trends beyond the minimum error incurred by the orangetrends is one natural measure of misspeciﬁcation error .This perspective on the informativeness of pre-treatment ﬁt (or lack thereof) isalso the motivation for our proposed sensitivity analysis. While we do not observethe treated unit’s counterfactual post-treatment outcomes and thus cannot computeits misspeciﬁcation error, we can compute the misspeciﬁcation errors incurred by theSC method when we use it to predict control units’ post-treatment outcomes, as inthe placebo analyses of Virginia and Delaware. Our procedure assumes the treatedunit’s misspeciﬁcation error is at most the misspeciﬁcation error of a given controlunit and computes the set of treatment eﬀects consistent with the assumption thatthe unknown misspeciﬁcation error incurred by the SC method is at most this errorbound.In Figure 2a, we depict the sets of plausible counterfactual control outcomes forCalifornia computed by our procedure consistent with the assumptions that the SCmethod’s misspeciﬁcation error for California is at most the observed misspeciﬁcationerrors for Virginia and Delaware, indicated by the green and purple dotted intervalsrespectively. For intuition, we also include examples of predicted counterfactualtrends for California that satisfy these error bounds in light green for Virginia andlight purple for Delaware. 3 a) California SC Predictions (b) California Treatment Eﬀect Bounds

Figure 2: In Figure 2a, the blue trend denotes the observed outcome trend for California andthe orange trend denotes the synthetic control trend selected by the SC method. The greenand purple dotted intervals depict the sets of plausible counterfactual control outcomes forCalifornia under Virginia and Delaware’s misspeciﬁcation errors respectively, and we provideexamples of counterfactual trends for California that satisfy the error bound for Virginia inlight green and the error bound for Delaware in light purple. Figure 2b presents the plausibletreatment eﬀect bounds corresponding to the misspeciﬁcation errors of all the control unitsin the donor pool in order of error magnitude, with the bounds for Virginia and Delawarehighlighted with green and purple dashed lines. The red region indicates where in thedistribution of misspeciﬁcation errors a zero treatment eﬀect ﬁrst becomes plausible.

Our procedure also ﬁnds the minimum misspeciﬁcation error necessary for zeroto be a plausible treatment eﬀect, which we can use to assess how reasonable a zerotreatment eﬀect would be by benchmarking it against the placebo misspeciﬁcationerrors described above. In Figure 2b, we show the treatment eﬀect bounds corre-sponding to each control unit’s misspeciﬁcation error in order of increasing errormagnitude, and we use the red region to highlight where in the distribution of mis-speciﬁcation errors a zero treatment eﬀect ﬁrst becomes plausible. We also illustratehow Virginia and Delaware’s misspeciﬁcation errors compare to those of the othercontrol units by highlighting the treatment eﬀect bounds corresponding to Virginiaand Delaware’s misspeciﬁcation errors with green and purple dashed lines.Although in general, our proposed treatment eﬀect bounds must be computednumerically using convex programming tools (Boyd and Vandenberghe, 2004), ap-plying our procedure with misspeciﬁcation error deﬁned as the minimum distancebetween the SC weights and any vector of weights with perfect predictive accuracyyields closed-form bounds whose widths are determined by scaled-up residuals frompredicting control units’ outcomes with the SC method. As such, one can view ourprocedure applied with this misspeciﬁcation error metric as a geometric motivation As we discuss in more detail in Section 4.1, the control units with the largest and smallestpost-treatment outcomes in 2000 cannot be perfectly predicted using a convex combination of theremaining control units’ outcomes, so the misspeciﬁcation errors incurred by the SC method forthese two placebo treated units are inﬁnite. As a result, the sets of plausible treatment eﬀectscorresponding to these extremal units mechanically span the real line, so they cannot be plotted,and they cannot rule out a zero treatment eﬀect.

To introduce the ideas underlying our sensitivity analysis, in this section, we presenta particular version of our proposed procedure based on a measure of misspeciﬁ-cation error that yields intuitive, closed-form expressions for our treatment eﬀectbounds. Later, we generalize the procedure to accommodate other valuable notionsof misspeciﬁcation error. 5 .1 Notation and the SC Method

Before describing our procedure, we introduce some necessary notation and reviewthe SC method. In the canonical setting used to motivate the SC method, we acquiredata about J + 1 units across T time periods [ T ] := { , . . . , T } to estimate the eﬀectof a policy intervention, referred to as the treatment, aﬀecting a single treated unitindexed by j = 1 . The treatment is ﬁrst implemented just after T < T and staysin eﬀect for all remaining periods T + 1 , . . . , T . The set of J remaining controlunits J := { , . . . , J + 1 } that are not aﬀected by the treatment is called the donorpool . For each unit j ∈ { , . . . , J + 1 } and each time period t ∈ [ T ] , we let Y jt (1) and Y jt (0) denote that unit’s potential outcomes in that period under treatment andlack thereof, respectively (Imbens and Rubin, 2015). Next, let the indicator D jt = 1 if unit j is exposed to the treatment in period t and D jt = 0 otherwise. We then let Y jt := D jt Y jt (1) + (1 − D jt ) Y jt (0) denote the potential outcome we observe for unit j in period t , and let Y t := (cid:0) Y t , . . . , Y ( J +1) t (cid:1) T denote the vector of control units’observed outcomes in period t .Typically, the goal in comparative case study settings like these is to estimatethe treatment eﬀect on the treated unit (with index j = 1 ) in some post-treatmentperiod T ∗ > T : τ T ∗ := Y T ∗ (1) − Y T ∗ (0) . Because we only observe Y T ∗ = Y T ∗ (1) , and not Y T ∗ (0) , estimating τ reduces toestimating Y T ∗ (0) . Although there are many ways one could do so, the SC methodassumes it is possible to compute Y T ∗ (0) using a weighted sum of the control units’outcomes in period T ∗ (Abadie and Gardeazabal, 2003, Abadie et al., 2010): Y T ∗ (0) = Y T T ∗ w = J +1 (cid:88) j =2 w j Y jT ∗ (0) for w = ( w , . . . , w J +1 ) T ∈ R J . (1)We refer to this weighted combination of control units’ outcome trends as a syntheticcontrol .In particular, Abadie et al. (2010) propose choosing weights w that make theweighted average of the control units’ pre-treatment outcomes as similar as possibleto the treated unit’s pre-treatment outcomes. Let x j := ( Y j , . . . , Y jT ) T be thevector of unit j ’s observed, pre-treatment outcomes and X be the T × J matrixwhose columns are the control units’ observed pre-treatment outcomes (i.e. X ’s j thcolumn is given by x j +1 ). Then we can write the SC estimator as the minimizer ofpre-treatment prediction error over the set of positive weights that sum to one: Abadie et al. (2010), Chernozhukov et al. (2017), and Ferman and Pinto (2019) discuss severalmodels under which such an assumption is reasonable. While uncommon in practice, x could in principle lie in the convex hull of the columns of X ,in which case (2) could have an inﬁnite number of solutions with perfect pre-treatment ﬁt, someof which would provide better post-treatment ﬁt than others (Abadie, 2020). In the sections thatfollow, we assume perfect pre-treatment ﬁt is not achievable because it is empirically rare and doingso allows us to develop the more intuitive sensitivity analysis presented in Section 2. However, inSection B of the Appendix, we discuss in detail how the generalized sensitivity analysis described inSection 4.1 can easily account for non-uniqueness of the SC estimator when generating treatment sc := arg min w ∈ R J (cid:107) x − X w (cid:107) s . t . T w = 1 w ≥ (2)Later, we will use ∆ J := (cid:8) w ∈ R J : w ≥ , T w = 1 (cid:9) to denote the set of valid SCweights.Once w sc has been computed, we can estimate τ T ∗ by ˆ τ sc T ∗ := Y T ∗ − Y T T ∗ w sc = Y T ∗ (1) − J +1 (cid:88) j =2 w sc ,j Y jT ∗ (0) . If (1) holds for weights w = w sc , we say the SC method is well-speciﬁed , in whichcase we have that ˆ τ sc T ∗ = τ T ∗ . However, as illustrated in Section 1, the SC methodis unlikely to be well-speciﬁed in practice. In the next section, we will introducea natural way to measure the degree to which the SC method deviates from well-speciﬁcation on which we will base our sensitivity analysis.In Section C.2 of the Appendix, we discuss how the sensitivity analyses we intro-duce below can also apply to extensions of the SC method that incorporate additionalpre-treatment covariates, relax the convex weight constraints, add an intercept term,and minimize diﬀerent and sometimes data-adaptive objective functions. For expo-sitional clarity however, the basic SC method presented here will suﬃce to motivateour proposed procedures. Despite the concerns about the eﬀectiveness of the SC method raised in Section 1,we can still attempt to assess what the true value of Y T ∗ (0) might be under limitedmisspeciﬁcation error. Since ˆ τ sc T ∗ = Y T ∗ (1) − Y T T ∗ w sc is an aﬃne function of Y T ∗ and τ T ∗ is a scalar, it is always possible to choose some set of weights w ∈ R J suchthat Y T T ∗ w = Y T ∗ (0) and thus Y T ∗ (1) − Y T T ∗ w = τ T ∗ . More importantly, theyare not at all unique; in fact, the set of optimal weights W ∗ := { w ∈ R J : Y T T ∗ w = Y T ∗ (0) } forms a ( J − -dimensional hyperplane in R J . Thus, a natural measure of misspec-iﬁcation error in the SC weights w sc is the diﬀerence between w sc and the closestweights w ∗ to w sc in W ∗ , where distance is measured by the (cid:96) -norm. More formally, eﬀect bounds. We also note that Abadie and L’Hour (2018) and Kellogg, Mogstad, Pouliot, andTorgovitsky (2020) propose modifying the SC objective to penalize solutions that interpolate morebetween units, since such solutions will yield worse predictions if the relationship between pre-treatment outcomes and post-treatment outcomes is nonlinear. Our sensitivity analysis can also beapplied to these alternative estimators, as we detail in Section C.2 of the Appendix.

7e can deﬁne w ∗ like so: w ∗ := arg min w ∈ R J (cid:107) w sc − w (cid:107) s . t . Y T T ∗ w = Y T ∗ (0) ( ⇔ w ∈ W ∗ ) . (3)Note that we do not restrict ourselves to considering weights within the set of convexweights ∆ J ; though such a restriction prevents SC estimates from extrapolatingbeyond the outcomes in the data (Abadie, 2020), it may be that the closest weightsthat allow for optimal prediction of Y T ∗ (0) lie outside ∆ J , or that W ∗ and ∆ J donot overlap at all. In what follows, we will frequently focus on the magnitude ofmisspeciﬁcation error, which we denote by d ( w sc , W ∗ ) := (cid:107) w sc − w ∗ (cid:107) .Since W ∗ is a hyperplane, we could in principle solve (3) by projecting w sc onto W ∗ . Because we do not observe Y T ∗ (0) , we cannot do so in practice. However, if weare willing to assume d ( w sc , W ∗ ) ≤ B for some bound B ≥ , then there must besome weight vector w ∈ R J within a radius B (cid:96) -ball around w sc such that Y T T ∗ w = Y T ∗ (0) . Crucially, this assumption limits the magnitude of method misspeciﬁcationerror while allowing for the direction of that error to remain arbitrary. If we let (cid:99) W B := (cid:8) w ∈ R J : (cid:107) w sc − w (cid:107) ≤ B (cid:9) denote the set of all weights (cid:96) -distance at most B away from w sc , then we knowthat the true potential outcome Y T ∗ (0) lies within the following set of values: Y B T ∗ (0) := Y T T ∗ (cid:99) W B = (cid:8) Y T T ∗ w : (cid:107) w sc − w (cid:107) ≤ B (cid:9) . Since the function w (cid:55)→ Y T T ∗ w is continuous in w and (cid:99) W B is compact, theset Y B T ∗ (0) containing Y T ∗ (0) must be a closed interval in R . As a result, we cancharacterize the interval Y B T ∗ (0) by computing its endpoints Y B, − T ∗ (0) and Y B, +1 T ∗ (0) ,which are the solutions to the following two optimization problems: Y B, − T ∗ (0) := min w ∈ R J Y T T ∗ w Y B, +1 T ∗ (0) := max w ∈ R J Y T T ∗ w s . t . (cid:107) w sc − w (cid:107) ≤ B s . t . (cid:107) w sc − w (cid:107) ≤ B (4)Since Y B, − T ∗ (0) and Y B, +1 T ∗ (0) are deﬁned as the extrema of linear functions on an (cid:96) -ball centered at w sc , they can easily be computed in closed form: Y B, − T ∗ (0) = Y T T ∗ w sc − B (cid:107) Y T ∗ (cid:107) Y B, +1 T ∗ (0) = Y T T ∗ w sc + B (cid:107) Y T ∗ (cid:107) . Then, since τ T ∗ is linear in Y T ∗ (0) , we can translate these bounds on Y T ∗ (0) intobounds on τ T ∗ : τ T ∗ ∈ T BT ∗ := (cid:104) Y T ∗ (1) − Y B, +1 T ∗ (0) , Y T ∗ (1) − Y B, − T ∗ (0) (cid:105) = ˆ τ sc T ∗ + B (cid:107) Y T ∗ (cid:107) · [ − , (5)8 .2.2 Bound Calibration via Placebo Eﬀect Estimation Unfortunately, the discussion in Section 2.2.1 does not make it clear how one shouldchoose an appropriate misspeciﬁcation error bound B from which the bounds on τ T ∗ in (5) can be constructed. However, since we do observe Y jT ∗ = Y jT ∗ (0) for eachcontrol unit j ∈ J , we can use a similar distance measure to d ( w sc , W ∗ ) to quantifythe misspeciﬁcation error in SC estimates of Y jT ∗ (0) for j ∈ J using the remaining J − control units as donor pools. Then, we can assume the treated unit’s post-treatment potential outcome Y T ∗ (0) is no more diﬃcult to estimate using the SCmethod than some percentage of the control units’ post-treatment control outcomesand use these measures to inform our choice of bound B .Importantly, the methodology we propose below based on this intuition onlyrelies on the assumption that the magnitude of the misspeciﬁcation error for thetreated unit is no larger than the magnitudes of the placebo misspeciﬁcation errorsfor some percentage of the control units. Given that the diﬀerences in characteristicsbetween the treated and control units is a primary reason researchers should use theSC method in the ﬁrst place (Abadie, 2020), it is likely implausible that the unknowndirection of the treated unit’s misspeciﬁcation error is similar to the directions ofthe control units’ placebo misspeciﬁcation errors.To formalize the ideas presented above, we ﬁrst deﬁne the following quantitiesanalogous to X , Y T ∗ , w sc , and W ∗ when we view control unit j as a placebotreated unit and the other J − control units as the donor pool: let X − j be the T × ( J − matrix whose columns are the pre-treatment outcomes of the J − control units other than j , let Y ( − j ) T ∗ := ( Y kt ) Tk (cid:54) = j be the ( J − -vector of the J − control units besides j ’s observed control outcomes, let w ( j ) sc ∈ R J − be thesynthetic control weights chosen as if control unit j were the treated unit and theremaining J − control units were the donor pool, i.e. by solving the followingoptimization problem similar to (2): w ( j ) sc := arg min w ∈ R J − (cid:107) x j − X − j w (cid:107) s . t . T w = 1 w ≥ , (6)and let W ∗ j := { w ∈ R J − : Y T ( − j ) T ∗ w = Y jT ∗ (0) } denote the set of weight vectors w ∈ R J − that yield placebo unit j ’s control outcome in period T ∗ .Since we observe Y jT ∗ = Y jT ∗ (0) for placebo unit j , we can actually compute thedistance d ( w ( j ) sc , W ∗ j ) deﬁned analogously to the unobservable d ( w sc , W ∗ ) in (3): d ( w ( j ) sc , W ∗ j ) := min w ∈ R J − (cid:107) w ( j ) sc − w (cid:107) s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) . (7) i.e. X − j ’s k th column is given by x k +1 if k < j and x k +2 if k > j . ˆ R sc jT ∗ := Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ denote the residual fromthe SC estimator used to predict Y jT ∗ = Y jT ∗ (0) . Then as with (3), (7) is a basicprojection problem with a closed-form solution (see Cheney and Kincaid (2009),pages 450–451, for example): d ( w ( j ) sc , W ∗ j ) = | ˆ R sc jT ∗ | (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) = (cid:12)(cid:12)(cid:12) Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ (cid:12)(cid:12)(cid:12)(cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) (8)Although d ( w ( j ) sc , W ∗ j ) is deﬁned purely geometrically, choosing the (cid:96) -norm tomeasure distance in weight space implies d ( w ( j ) sc , W ∗ j ) can also be characterized asa scaled variant of the absolute placebo SC residual | ˆ R sc jT ∗ | for control unit j usingthe j − other control units as the donor pool. We will discuss this observation inmore detail in Section 2.3.For notational convenience, we use the shorthand B j = d ( w ( j ) sc , W ∗ j ) and assumecontrol units’ indices align with the sorted order of their respective B j values, so thatthe j th control unit has the ( j − th-smallest B j . Then once we have computed d ( w ( j ) sc , W ∗ j ) for all j ∈ J , we can compute bounds on the treatment eﬀect basedon (5) for each j ∈ J by choosing B = B j := d ( w ( j ) sc , W ∗ j ) : T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | (cid:107) Y T ∗ (cid:107) (cid:107) Y ( − j ) T ∗ (cid:107) · [ − , (9)Another natural quantity of interest is the minimum bound B on d ( w sc , W ∗ ) such that a zero treatment eﬀect lies within T B T ∗ , i.e. B := min (cid:8) B : 0 ∈ T BT ∗ (cid:9) . With B in hand, we can then ﬁnd the control unit j ∈ J such that B j ≤ B ≤ B j +1 (where B J +2 = ∞ ) and report the statistic ν := ( j − /J , interpreted as the fractionof control units for which it would have to be “easier” for the SC method to estimate Y jT ∗ (0) than Y T ∗ (0) if the treatment eﬀect τ T ∗ for the treated unit were actuallyzero. For the purposes of computation, B can be deﬁned similarly to d ( w sc , W ∗ ) ,with the unobserved Y T ∗ (0) replaced with the observed outcome Y T ∗ of the treatedunit in period T ∗ : B := min w ∈ R J (cid:107) w sc − w (cid:107) s . t . Y T T ∗ w = Y T ∗ (10)As with (7), B can easily be computed in closed form by projecting w sc onto thehyperplane (cid:8) w ∈ R J : Y T T ∗ w = Y T ∗ (cid:9) : B = (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) (cid:107) Y T ∗ (cid:107) (11)For reference, we summarize the sensitivity analysis procedure we have developedabove in Procedure 1. We also demonstrate one way to visualize T B j T ∗ for each j ∈ J along with B j and B j +1 in Figure 3a using data on California’s 1989 tobacco controlprogram analyzed in Abadie et al. (2010). In the ﬁgure, the units of the x -axis are10 rocedure 1. Sensitivity Analysis

1. For each control unit j ∈ J :(a) Use the SC method to predict unit j ’s outcome in period T ∗ , treatingthe other J − control units as the donor pool; compute the observedresidual from this prediction ˆ R sc jT ∗ = Y T ( − j ) T ∗ w ( j ) sc − Y jT ∗ .(b) Compute the bounds T B j T ∗ on the treatment eﬀect τ T ∗ under theassumption that the misspeciﬁcation error d ( w sc , W ∗ ) incurred byestimating Y T ∗ (0) with the SC method is at most the misspeciﬁca-tion error B j incurred by the SC method in Step 1a: T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | (cid:107) Y T ∗ (cid:107) (cid:107) Y ( − j ) T ∗ (cid:107) · [ − , ,

2. Compute the minimum misspeciﬁcation error B needed for ∈ T B T ∗ , i.e. to be a plausible treatment eﬀect estimate: B = (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) (cid:107) Y T ∗ (cid:107) , and ﬁnd the control unit j with the largest misspeciﬁcation error stillsmaller than B , i.e. where B j ≤ B ≤ B j +1 .3. Visualize the treatment eﬀect bounds T B j T ∗ for each j ∈ J and the mis-speciﬁcation errors B j and B j +1 in a plot like Figure 3a, and report thepercentage ν = ( j − /J of control units whose misspeciﬁcation errors B j are smaller than B .percentile ranks p j := ( j − /J of the ordered set of placebo misspeciﬁcation errors { B j : j ∈ J } rather than the units of B j , so that it is easy to read ν oﬀ of the x -axiswhere the red shaded region begins.Before proceeding, we make note of several interesting properties of our proposedbounds T B j T ∗ . To do so, we deﬁne N j := (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) so we can write T B j T ∗ = ˆ τ T ∗ + | ˆ R sc jT ∗ | N j · [ − , . Since Y ( − j ) T ∗ contains all of the entries of Y T ∗ except Y jT ∗ (0) , we have that (cid:107) Y ( − j ) T ∗ (cid:107) ≤ (cid:107) Y T ∗ (cid:107) , so N j ≥ . Intuitively, this inﬂation of the placebo residualfor unit j in T B j T ∗ corrects for the fact that the placebo SC procedure for estimating11 jT ∗ (0) has one fewer control unit at its disposal than the SC procedure for estimat-ing Y T ∗ (0) and thus has less ﬂexibility to make more extreme predictions than theSC procedure would for our actual task of interest.Next, we can write N j as N j = (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) Jk =2 Y kT ∗ (0) (cid:80) Jk =2 Y kT ∗ (0) − Y jT ∗ (0) = (cid:34) − (cid:18) | Y jT (0) |(cid:107) Y T ∗ (cid:107) (cid:19) (cid:35) − / , enabling us to make two more observations. First, N j is increasing in the magnitudeof Y jT ∗ (0) relative to (cid:107) Y T ∗ (cid:107) , meaning T B j T ∗ is wider if unit j has a larger magni-tude outcome in period T ∗ relative to the outcomes of the other control units andthus could generate more extreme predictions if it contributed to the SC predictedoutcome.Second, under mild conditions, the bounds T B j T ∗ converge to purely residual-based bounds ˆ τ sc T ∗ + | ˆ R sc jT ∗ | · [ − , as the size of the donor pool increases. Considera sequence of donor pools indexed by their sizes, which with an abuse of notationwe denote {J J : J ∈ N } . Then, provided that the outcomes of the units in each ofthe donor pools do not grow too quickly or too slowly in magnitude, i.e. if lim J →∞ max j ∈J J | Y jT ∗ (0) |(cid:107) Y T ∗ (cid:107) → , the ratios N j converge uniformly to as the sample size J increases. As a con-sequence, for some α ∈ [0 , , the bounds T B (cid:100) (1 − α ) J (cid:101) T ∗ calibrated to the (1 − α ) thpercentile of the ordered set of placebo distances B j will shrink towards the bounds ˆ τ sc T ∗ + | ˆ R sc (cid:100) (1 − α ) J (cid:101) T ∗ | · [ − , as J → ∞ . As described in Section 2.2.2, we can view T B j T ∗ as the set of plausible treatmenteﬀects for the treated unit if we assume that the magnitude of misspeciﬁcation error d ( w sc , W ∗ ) incurred by estimating Y T ∗ (0) with the SC estimator is no larger thanthe magnitude of misspeciﬁcation error d ( w ( j ) sc , W ∗ j ) incurred by treating unit j as the treated unit and estimating Y jT ∗ (0) with the placebo SC estimator. Then,ﬁxing some fraction α ∈ [0 , , T B (cid:100) (1 − α ) J (cid:101) T ∗ contains the set of plausible treatmenteﬀects under the assumption that it is “no harder” to estimate Y T ∗ (0) when unit is the treated unit than it is to estimate Y jT ∗ (0) for any of the (cid:100) (1 − α ) J (cid:101) “easiest-to-estimate” control units, i.e. those with the (cid:100) (1 − α ) J (cid:101) smallest misspeciﬁcationerror magnitudes. Further, B quantiﬁes the magnitude of the misspeciﬁcation errorthe SC method would have to incur for a treatment eﬀect of zero to be plausible.This magnitude can be compared to control units’ misspeciﬁcation error magnitudes B j to benchmark how “reasonable” a treatment eﬀect of zero might be, as measuredby the percentage ν of control units for which B j ≤ B .Despite the resemblance of our sensitivity analysis to frequentist statistical in-12erence procedures, we caution against interpreting ν as the p -value correspondingto a test of no treatment eﬀect and T B (cid:100) (1 − α ) J (cid:101) T ∗ as a conﬁdence interval for the treat-ment eﬀect since our methodology is based on the perspective that uncertainty inSC estimates is the result from modeling error, not statistical noise. We believe thisperspective is important because in most comparative case studies, we only observe asingle outcome sample path over a limited number of time periods for each of a smallnumber of heterogeneous units, only one of which is ever treated (Abadie, 2020). Asa result, any stochastic model with enough structure to allow for tractable statisticalinference in such settings must rely on potentially unrealistic assumptions about thedata generating process to make any progress, e.g. distributional assumptions onthe stochastic outcome processes, a stance on the treatment assignment mechanism,and/or growing dataset asymptotics. Further, while some of the statistical approaches to characterizing uncertainty inSC estimates do acknowledge and accommodate the possibility of misspeciﬁcationerror (Chernozhukov et al., 2017, 2018, Cattaneo et al., 2019), the assumptions theymake to limit its eﬀect on inferential validity can be diﬃcult to justify in comparativecase study settings and interpret for practitioners, e.g. stationarity of units’ outcomeprocesses, large numbers of observed pre and post-treatment periods, exchangeabilityof SC residuals across periods, and/or mean-zero post-treatment SC residuals. Whileour sensitivity analysis avoids the statistical perspective on estimate uncertainty thatis the norm in empirical economics, we believe it provides a transparent evaluationof the credibility of SC counterfactuals in the presence of misspeciﬁcation error.

Our methodology also provides an alternative motivation for a variant of the pop-ular design-based placebo test of no treatment eﬀect originally proposed in Abadieet al. (2010). Abadie et al. (2010) suggest comparing the absolute SC residual | ˆ R sc T ∗ | := (cid:12)(cid:12) Y T T ∗ w sc − Y T ∗ (cid:12)(cid:12) under the assumption of no treatment eﬀect (so Y T ∗ = Y T ∗ (1) = Y T ∗ (0) ) to the distribution of absolute placebo residuals | ˆ R sc jT ∗ | for j ∈ J ;Abadie et al. (2010) interpret | ˆ R sc T ∗ | being large relative to | ˆ R sc jT ∗ | for j ∈ J as strongevidence of a non-zero treatment eﬀect, assuming pre-treatment ﬁt is also good. Inparticular, if we take a design-based perspective and treat outcomes as ﬁxed quanti-ties (see Imbens and Rubin (2015)), then under the admittedly unrealistic assump-tion that treatment is assigned uniformly at random to the units under consideration,the percentage of absolute residuals | ˆ R sc jT ∗ | that are smaller than | ˆ R sc T ∗ | can be in-terpreted as a p -value for a test of the null hypothesis of no treatment eﬀect. Abadieet al. (2010), Firpo and Possebom (2018), and others suggest using test statisticsbased on the ratios of post-treatment mean squared error under the null hypothesisto pre-treatment prediction error, but in light of the discussion about the relation-ship between pre and post-treatment error in Section 1, it is unclear how meaningfulsuch relative error metrics are in practice. Bojinov and Shephard (2019) and Rambachan and Shephard (2019) discuss similar philosoph-ical issues in the context of time series.

13o see the connection between the placebo test described above and our proposedprocedure, recall that the statistic ν deﬁned at the end of Section 2.2.2 is computedby asking what fraction of control units’ placebo distances B j = d ( w ( j ) sc , W ∗ j ) = | ˆ R sc jT ∗ | / (cid:107) Y ( − j ) T ∗ (cid:107) (from (8)) are smaller than the minimum bound B = | ˆ R sc T ∗ | / (cid:107) Y T ∗ (cid:107) (from (11)) on d ( w sc , W ∗ ) required for to lie in the set of plausible treatment ef-fects T B T ∗ . If we multiply B and B j for j ∈ J by (cid:107) Y T ∗ (cid:107) , we can see that ν can equivalently be computed by asking for what fraction of control units j ∈ J is B j (cid:107) Y T ∗ (cid:107) = | ˆ R sc jT ∗ | · (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) smaller than B (cid:107) Y T ∗ (cid:107) = | ˆ R sc T ∗ | . Sincethe ratios N j = (cid:107) Y T ∗ (cid:107) / (cid:107) Y ( − j ) T ∗ (cid:107) are all greater than one from the discussion atthe end of Section 2.2.2, we can see that under the assumption of random treatmentassignment, ν can be interpreted as the p -value corresponding to a more conserva-tive variant of Abadie et al. (2010)’s placebo test described above. Further, for any α ∈ [0 , , we can view T B (cid:100) (1 − α ) J (cid:101) T ∗ as the set of treatment eﬀects under which ourconservative version of Abadie et al. (2010)’s placebo test would fail to reject thenull hypothesis of zero treatment eﬀect at level α . Per the discussion at the end ofSection 2.2.2, the degree of conservativeness of this placebo test also decreases inthe size of the donor pool under mild conditions. Thus, our procedure motivatescomparing the treated and control units’ absolute residuals to assess errors in SCestimates without starting from a random treatment assignment assumption. It is important to note that, like most papers in the SC literature, our method as-sumes the donor pool is ﬁxed before treatment eﬀect estimation. In practice however,researchers often exercise tremendous discretion in donor pool selection in ways thatcan dramatically change results, as demonstrated by the popular leave-unit-out ro-bustness check we illustrate in Section 3.2. Despite this sensitivity, inclusion of onlythe control units that are believed to be “most similar” to the treated unit is explicitlyadvocated for in the SC literature (Abadie, 2020). Doing so is encouraged because,as discussed in Footnote 5, synthetic controls that interpolate more between controlunits that are very diﬀerent from the treated unit can be quite biased if the relation-ship between pre-treatment outcomes (and covariates) and post-treatment outcomesis non-linear (Abadie, 2020, Abadie and L’Hour, 2018, Kellogg et al., 2020).Since our sensitivity analysis deﬁnes robustness relative to SC performance whenpredicting control units’ outcomes and the researcher has signiﬁcant latitude to selectthose control units, one might worry that our sensitivity analysis is itself sensitiveto the choice of donor pool. Because the inclusion or exclusion of a control unitfrom the donor pool has the potential to aﬀect both the SC estimates of the treatedand placebo treated units and the placebo misspeciﬁcation errors incurred by theSC method, the impact of donor pool manipulation is often ambiguous. It is possi-ble though that an adversarial researcher could select the donor pool to maximizeperceived robustness of their SC estimates, but such doctoring has always been avulnerability of both the SC literature and empirical economics more broadly (Brod-erick, Giordano, and Meager, 2020).We note that our procedure can assess sensitivity to the inclusion or exclusion of14ontrol units that exist in the observed donor pool since such choices are equivalentto toggling the weights corresponding to certain control units between zero and non-zero values. However, we cannot determine the impact of including potential controlunits not reported by the researcher. For this reason, it is crucial that researchersare transparent about the universe of possible control units from which they selecttheir donor pool and precise about the procedure according to which such selectionoccurs.Unfortunately, much ambiguity remains about how researchers should go aboutdeﬁning such a universe. For example, in the context of the tobacco control programstuded in Abadie et al. (2010), one might argue that California is more similar alongmany dimensions (e.g. total population or GDP) to countries like Germany and theUK than US states like Nebraska or Utah; perhaps data on control units from Abadieet al. (2015)’s study of German reuniﬁcation (augmented with data on tobacco sales)would yield better SC counterfactuals? While such a line of reasoning is compelling,a researcher could also argue that cultural norms around smoking in California aremore similar to those of other US states than European countries. While contrived,this small example illustrates the kind of subjectivity inherent in the donor poolselection process, and to our knowledge, there exist no agreed-upon best practicesor formal criteria for inclusion or exclusion of particular units. As such, we viewstudying the eﬀect of donor pool selection on SC estimates with more analyticalprecision as an important area for future investigation.

We now demonstrate how to apply our sensitivity analysis as outlined by Procedure1 by re-examining three canonical policies studied often in the SC literature: Cal-ifornia’s tobacco control program on tobacco sales using data provided by Abadieet al. (2010), German reuniﬁcation on GDP using data from Abadie et al. (2015),and the Mariel boatlift and Cuban mass migration on the 20th percentile of the wagedistribution in Miami using data as in Peri and Yasenov (2019).In Figure 3, we summarize the results from each case study by plotting the rangeof possible treatment eﬀects at each percentile rank p j = ( j − /J of the orderedset of placebo misspeciﬁcation errors { B j : j ∈ J } . In Figure 3a, the horizontal,dotted blue line represents the SC point estimate of the eﬀect of California’s tobaccoprogram on tobacco sales. For each of the J observed placebo misspeciﬁcation er-rors B j , we use blue points to denote the maximum and minimum treatment eﬀectspossible for California if we allow for misspeciﬁcation error up to B j . The x -axisrepresents B j with its percentile rank p j within the ordered set of placebo misspeci-ﬁcation errors. We highlight in red the interval of the placebo misspeciﬁcation errordistribution where the allowable misspeciﬁcation error ﬁrst yields treatment eﬀectbounds containing zero. Summarizing Figure 3a, we can see that the SC weightestimates for California would need to incur at least as much error as the 94.7thpercentile of the 38 placebo misspeciﬁcation errors for a zero treatment eﬀect to be15 a) Eﬀect of California’s Tobacco Control Program(b) Eﬀect of German Reuniﬁcation on GDP(c) Eﬀect of Mariel Boatlift on Low-Income Wages Figure 3: Plots of treatment eﬀect bounds T B j T ∗ corresponding to each control unit j ’s misspeciﬁ-cation error computed using data from three papers using the SC method, where the units of the x -axes are percentile ranks p j of the set of misspeciﬁcation errors { B j : j ∈ J } . We highlight theregion between B j and B j +1 in which our treatment eﬀect bounds ﬁrst contain zero in red. Of course, our procedure is not the ﬁrst to purport to help researchers assess thesusceptibility of their SC treatment eﬀect estimates to misspeciﬁcation error. Wenext show that our procedure provides more complete and interpretable measures ofSC estimate robustness compared to two commonly used robustness checks in theSC literature. In the spirit of Bertrand, Duﬂo, and Mullainathan (2004), we believemethods are best tested on real datasets, so we implement two popular alternativeprocedures in repeated placebo versions of each of our three case studies, treatingeach of the control units as a placebo treated unit and comparing the results of thesemethods to those delivered by our procedure.First, we examine the “leave-unit-out” robustness check, which entails droppingeach control unit from the donor pool and recomputing SC outcome estimates witha donor pool consisting of the remaining control units (Abadie, 2020). The re-searcher is then supposed to assess robustness qualitatively by checking whether theset of treatment eﬀects outputted by this procedure have the same signs as and sim-ilar magnitudes to the eﬀects computed using the full donor pool. When we treatDelaware as the placebo treated unit in the context of the state-by-state smokingdata from Abadie et al. (2010) and conduct the leave-unit-out robustness check, wesee that the alternative predictions generated by this procedure fail to capture theextent of the prediction error incurred by the SC method, as illustrated in Figure4a. Unfortunately, this inadequacy is not isolated to Delaware or the state-level smok-ing data from Abadie et al. (2010). If we repeat this placebo procedure with eachof the other control units in each of our three case studies, we ﬁnd that of the control units ( . ) from Abadie et al. (2010); of the control units ( . )from Abadie et al. (2015); and of the control units ( . ) from Peri andYasenov (2019) have last period outcomes outside the range of their corresponding It suﬃces to drop only those control units with positive weight in the full-sample vector of SCweights because dropping units with zero weight will not aﬀect SC estimates. a) Delaware Leave-Unit-Out Analysis (b) Virginia Leave-Time-Out Analysis Figure 4: In Figure 4a, we show the SC trends for Delaware computed while leaving outeach donor unit that received positive weight (weight shown in parentheses) when runningthe SC method with the entire donor pool. In Figure 4b, we show the SC trend for Virginiacomputed while excluding the pre-treatment ﬁt errors for the six periods immediately priorto treatment (between the two vertical black lines) from the SC objective. leave-unit-out predictions. In some sense, this result is not so surprising, since theleave-unit-out analysis only assesses the sensitivity of SC estimates to a particularcause of misspeciﬁcation error: mistakenly including a particular unit in the donorpool and placing positive weight on that unit’s outcome in a SC estimate.The second diagnostic we consider, the “leave-time-out” or “backdating” proce-dure, involves ﬁtting a synthetic control using only the pre-treatment outcomes upto some number of periods before the ﬁrst treatment period; the remaining pre-treatment periods in which control outcomes for the treated unit are known are usedas a validation set to assess the quality of the SC method’s predictions out-of-sample(Abadie, 2020). Treating Virginia as the placebo treated unit in the context of thestate-by-state smoking data from Abadie et al. (2010), we leave out the six timeperiods before California was treated (between the two black vertical lines in Figure4b) and ﬁt the synthetic control on the remaining pre-treatment periods. Giventhe gap in Figure 4b between the true control trend in black and the backdatedsynthetic control in purple over the ﬁve validation periods, many researchers wouldbe skeptical about their SC estimates. In Virginia’s case though, the backdatedSC trend predicts the outcome in 2000 remarkably well and clearly outperforms thenon-backdated SC trend in the other post-treatment periods.If a researcher only considers the backdating exercise as a diagnostic for thecredibility of the original (non-backdated) SC estimates, then the poor predictiveperformance of the backdated SC counterfactual in the validation periods correctlyindicates that the original SC counterfactual does not reﬂect the true control trendin the post-treatment periods. However, some researchers also use the backdatedSC trend itself to compute treatment eﬀect estimates since the leave-time-out exer-cise directly tests the predictive performance of the same counterfactual on whichtreatment eﬀect estimates are based. If Virginia were the treated unit, such re-searchers would be mislead; the poor ﬁt in the validation periods does not translate18ounterfactual California GermanyDonor Pool Size 38 16SC False Pos. Rate 10.5% 25.0%False Neg. Rate 26.3% 12.5%Total Err. Rate 36.8% 37.5%Backdated SC False Pos. Rate 13.2% 18.7%False Neg. Rate 10.5% 6.2%Total Err. Rate 23.7% 24.9%

Table 1: This table summarizes the results from our placebo analyses of the leave-time-out robust-ness check. A false positive occurs when the backdated SC trend ﬁts well in the validation periodsbut the counterfactual control trend does not ﬁt well post-treatment; similarly, a false negativeoccurs when the backdated SC trend does not ﬁt well in the validation periods but the counter-factual control trend does ﬁt well post-treatment. The error rate is deﬁned as the sum of the falsepositive and false negative rates. We report error rates for both backdated and non-backdated SCcounterfactual trends. into meaningfully subpar post-treatment ﬁt.Given these concerns, we repeat this placebo analysis with each of the othercontrol units in the studies of California’s tobacco control law and German reuniﬁ-cation. In particular, we visually code each application of the backdating procedureas yielding a “false positive”—the backdated SC trend ﬁts well in the validation pe-riods but the counterfactual control trend does not ﬁt well post-treatment—a “falsenegative”—the backdated SC trend does not ﬁt well in the validation periods butthe counterfactual control trend does ﬁt well post-treatment—or neither if the pro-cedure properly rejected a counterfactual with bad post-treatment ﬁt or did notreject a counterfactual with good post-treatment ﬁt. Since there is not a consen-sus amongst practitioners about which of the backdated or non-backdated SC trendsshould determine treatment eﬀect estimates, we report false positive and false neg-ative rates for both types of counterfactuals.As can be seen from Table 1, while the performance of the leave-time-out pro-cedure is better than the performance of the leave-unit-out procedure, it still leavesmuch to be desired given that it had the potential to mislead researchers roughlya quarter of the time it was applied in our placebo analyses. Again, these resultsshould not be unexpected, since the leave-time-out analysis is only assessing the sen-sitivity of SC estimates to misspeciﬁcation error caused by overﬁtting to outcomesclose to the ﬁrst treated period. More importantly, there are no agreed-upon formalcriteria we know of in the literature for trusting or doubting synthetic control esti- Unfortunately, there are not enough pre-treatment periods in the data from Peri and Yasenov(2019) to reliably evaluate SC predictions from the backdated ﬁt. Our notions of good and poor ﬁt here are necessarily heuristic, since we know of no acceptedformal criteria in the literature for what constitutes acceptable ﬁt in the validation periods. Amore systematic way to code each placebo analysis would be to survey a sample of practitionerswho use the SC method and ask them whether they ﬁnd the predictive performance of backdatedand non-backdated SC trends acceptable; we leave such a survey for future work. igure 5: We plot the results from placebo analyses of Procedure 1 using data from three casestudies. On the x -axis, we vary the percentile rank cutoﬀ that determines the width of the boundsgenerated by our procedure and on the y -axis we show the share of placebo treated units for whichthose bounds correctly include a zero treatment eﬀect. The dashed black 45-degree line demarcates p percent of placebo units’ true outcomes being covered with a misspeciﬁcation error cutoﬀ at the p th percentile rank. mates based on the leave-unit-out or backdating exercises. Researchers (includingus) seem to decide based on visual appeal, which, as we have demonstrated, can leadresearchers astray. In fact, we could not reproduce the error rates in Table 1 when weconducted the coding exercise described above twice, six months apart; the versionincluded here reports the results from our second coding attempt, and departuresfrom our ﬁrst results were not uniform in any direction. In contrast, our procedureprovides a comprehensive and less subjective approach to assessing how all types ofmisspeciﬁcation error could aﬀect SC estimates.To assess the eﬀectiveness of our proposed method in comparison, we subject itto the same placebo analyses we used to interrogate the leave-unit-out and leave-time-out robustness checks studied above. In particular, we treat the control units ineach of our three case studies as placebo treated units and apply our sensitivity anal-ysis to each. In Figure 5, we plot the share of control units for which our procedureyields bounds on the treatment eﬀect that correctly contain zero for each possiblepercentile rank at which we could generate bounds using our procedure. When theresearcher chooses a threshold of misspeciﬁcation error in terms of a p th percentilerank cutoﬀ that they deem “acceptable” when constructing treatment eﬀect bounds,our placebo analyses suggest that doing so correctly captures zero treatment eﬀectsfor approximately p percent of the placebo treated units. In some sense, this cali-brated relationship between the chosen percentile rank cutoﬀ and bound coverage ofplacebo treated units’ outcomes is not surprising, since our procedure can be viewedas a particular way of synthesizing the results of repeated placebo analyses. SectionA of the Appendix describes the mechanics behind this connection in more detail.20n contrast, it is diﬃcult to translate the results of the leave-unit-out and leave-time-out robustness checks into clear insights about the validity of the SC treatmenteﬀect estimate for the treated unit. Recall the leave-unit-out placebo analyses con-ducted using the data from Abadie et al. (2010); if the true control outcomes lieoutside of the range of leave-unit-out trends for . of placebo treated units, arewe meant to believe that the range of leave-unit-out trends for California only cap-tures the true counterfactual control trend . of the time under some samplingmodel for the potential outcomes? As discussed above, it is even harder to under-stand how informative the backdating placebo analyses are about the value of theleave-time-out analysis applied to the treated unit. While these other methods areplagued by ambiguities in implementation and interpretation, the only degree of free-dom left by our procedure for the researcher to determine, the acceptable percentilerank cutoﬀ, is both directly meaningful and closely connected to a natural summarystatistic of our procedure’s performance in placebo analyses. The (cid:96) -distances deﬁned in Section 2 between the SC weights and the closest weightsthat correctly predict Y jT ∗ (0) , d ( w sc , W ∗ ) and d ( w ( j ) sc , W ∗ j ) , are natural measuresof misspeciﬁcation error magnitudes, but they are certainly not the only ones re-searchers can use to assess the sensitivity of SC treatment eﬀect estimates. Insteadof measuring the misspeciﬁcation error incurred by the SC method relative to weights w with the (cid:96) -distance of w ( j ) sc to w , we can use any function m j : R J − { j (cid:54) =1 } → [0 , ∞ ] such that m j ( w ( j ) sc ) = 0 to measure the distance of w to w ( j ) sc . To allow for thesealternative misspeciﬁcation error metrics m j , we generalize our proposed sensitiv-ity analysis in Procedure 2, which nests the analysis described in Section 2 for m j ( w ) = m wt j ( w ) := (cid:107) w ( j ) sc − w (cid:107) . Note that as long as m j are convex func-tions, then although the optimization problems (12), (13), and (15) likely do nothave closed-form solutions as their equivalents in Section 2 do, their solutions arestill easily computable numerically using oﬀ-the-shelf convex optimization software(Boyd and Vandenberghe, 2004).To demonstrate the value of this more general procedure, we focus on an al-ternative misspeciﬁcation error metric m err j ( w ) , deﬁned as the extra pre-treatmentprediction error incurred by w relative to the minimum achievable pre-treatmentprediction error with valid SC weights, assuming w are also valid SC weights: m err j ( w ) := (cid:107) x j − X − j w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w )min ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) (cid:111) − , (16)where for a given set C ⊆ R J − { j (cid:54) =1 } , ψ C ( w ) is a penalty term designed to constrain21 rocedure 2. Generalized Sensitivity Analysis

1. For each control unit j ∈ J :(a) Compute the misspeciﬁcation error d m j ( w ( j ) sc , W ∗ j ) incurred by esti-mating the placebo post-treatment outcome of interest Y jT ∗ (0) forcontrol unit j with the SC method using the other J − units inthe donor pool as control units, as in (7): d m j ( w ( j ) sc , W ∗ j ) := inf w ∈ R J − m j ( w )s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) . (12)(b) Compute the largest and smallest plausible counterfactual controloutcomes Y B j , − T ∗ (0) and Y B j , +1 T ∗ (0) under the assumption that themisspeciﬁcation error d m ( w sc , W ∗ ) incurred by estimating Y T ∗ (0) with the SC method is at most the misspeciﬁcation error B j := d m j ( w ( j ) sc , W ∗ j ) , as in (4): Y B j , − T ∗ (0) := inf w ∈ R J (cid:110) Y T T ∗ w : m ( w ) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:111) Y B j , +1 T ∗ (0) := sup w ∈ R J (cid:110) Y T T ∗ w : m ( w ) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:111) (13)(c) Compute the bounds T B j T ∗ on the treatment eﬀect τ T ∗ under the as-sumption that the misspeciﬁcation error d m ( w sc , W ∗ ) incurred byestimating Y T ∗ (0) with the SC method is at most the misspeciﬁca-tion error B j , as in (9): T B j T ∗ := (cid:104) Y T ∗ − Y B j , +1 T ∗ (0) , Y T ∗ − Y B j , − T ∗ (0) (cid:105) (14)2. Compute the minimum misspeciﬁcation error B needed for ∈ T B T ∗ , i.e. to be a plausible treatment eﬀect estimate, as in (10): B := inf w ∈ R J m ( w )s . t . Y T T ∗ w = Y T ∗ (15)and ﬁnd the control unit j with the largest misspeciﬁcation error stillsmaller than B , i.e. where B j ≤ B ≤ B j +1 .3. Visualize the treatment eﬀect bounds T B j T ∗ for each j ∈ J and the mis-speciﬁcation errors B j and B j +1 in a plot like Figure 3a, and report thepercentage ν = ( j − /J of control units whose misspeciﬁcation errors B j are smaller than B . 22 to lie in the set C when m j is used in minimization problems: ψ C ( w ) := (cid:40) w ∈ C ∞ otherwise (17)The denominator of the fraction in (16) is just the pre-treatment prediction errorincurred by the canonical SC estimator, since (cid:107) x j − X − j ˜ w (cid:107) is exactly the objectivefunction minimized in (6) to construct a synthetic control and ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) justensures that the minimizer of (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) is a vector of valid SCweights. Then, if we use m err j as the misspeciﬁcation error metric in our proposedsensitivity analysis, we can interpret the misspeciﬁcation error d m err j ( w ( j ) sc , W ∗ j ) asthe minimum amount of additional pre-treatment prediction error (relative to theminimum possible) a researcher would have to tolerate for a vector of SC weightsthat yields a correct prediction of Y jT ∗ (0) to be considered a “reasonable” choice ofweights.As it happens, the weights that solve (12) under m err j are exactly the weightsthat yield the green outcome trends in Figure 1 that match Virginia and Delaware’soutcomes in 2000 and achieve the smallest possible pre-treatment prediction errormagnitudes while doing so. Further, suppose we treat unit j as the treated unit andthe other J − control units as the donor pool. Then the sets [ Y B j , − jT ∗ (0) , Y B j , + jT ∗ (0)] for j ∈ J with endpoints deﬁned analogously to (13), i.e. the sets that containthe plausible predicted control outcomes for each unit j assuming misspeciﬁcationerror is no larger than j ’s own true misspeciﬁcation error, are exactly the red dashedintervals in Figures 1a and 1b.Although m err j has clear intuitive appeal, it does have several shortcomings. First,it is only well-deﬁned if X − j w (cid:54) = x j for all w ∈ ∆ J − { j (cid:54) =1 } ; otherwise, the denom-inator in (16) will be zero, in which case m err j is unusable given the dataset ofinterest. Second, the sets of w that perfectly predict the period- T ∗ outcomes forthe control units with the largest and smallest values of Y jT ∗ do not intersect with ∆ J − { j (cid:54) =1 } at all, in which case m err j ( w ) will be inﬁnite for all feasible w in (12).Then d m j ( w ( j ) sc , W ∗ j ) = ∞ for the two units with the largest and smallest period- T ∗ outcomes, meaning T B j T ∗ = ( −∞ , ∞ ) . Despite the fact that these bounds containthe whole real line, we do not intend their vacuousness to reﬂect that all treatmenteﬀects are equally plausible; we simply mean to convey that the particular boundscorresponding to the control units with extreme outcomes are uninformative aboutthe treatment eﬀect for the treated unit.In addition to the generalization of our sensitivity analysis to other misspeciﬁ-cation error metrics described above, we also extend our procedure to measure thesensitivity of alternative outcome contrast estimates in Section C.1 of the Appendixand to apply to eﬀect estimates generated by other policy evaluation methods forpanel data in Section C.2 of the Appendix. Further, we demonstrate in Section B ofthe Appendix how this generalized procedure can be used to account for potentialnon-uniqueness of the SC estimator in the sensitivity analysis from Section 2.23 .2 Choosing a Misspeciﬁcation Error Metric To understand how the choice of misspeciﬁcation error metric can aﬀect the outputof Procedure 2, we compare the results of our sensitivity analysis based on m wt j shown in Figure 3 to results based on two additional misspeciﬁcation error metrics,which we review below:1. Unconstrained weight space : m wt j ( w ) = (cid:107) w ( j ) sc − w (cid:107) ; as described above, usingthis metric yields the sensitivity analysis given in Procedure 1.2. Constrained weight space : m wt ∆ J − { j (cid:54) =1 } ( w ) := (cid:107) w ( j ) sc − w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w ) ; thismetric still measures distance in weight space but requires w to lie in the setof valid SC weights.3. Constrained error space : m err j ( w ) := (cid:107) x j − X − j w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w )min ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X − j ˜ w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( ˜ w ) (cid:111) − as discussed in Section 4.1, this metric measures distance with the extra errorincurred by w relative to the error incurred by the vector of SC weights.The results of repeating our earlier case studies with the alternative misspeciﬁ-cation metrics described above are shown in Figure 6. First, the treatment eﬀectbounds for the tobacco control program in California in Figure 6a uniformly indi-cate that the ﬁnding of a large, negative eﬀect is robust, since for all choices of m j , the misspeciﬁcation error for California would need to be large relative to themisspeciﬁcation errors of most control units. For a zero treatment eﬀect to be plausi-ble, the constrainted weight space metric suggests California’s misspeciﬁcation errorwould need to be larger than 92.1% of the control units and the remaining two met-rics require error larger than 94.7% of control units. The results for the Germanreuniﬁcation and Mariel boatlift settings shown in Figures 6b and 6c have morevariation across misspeciﬁcation metrics. While the constrained error space metricsuggests fairly robust results in the German reuniﬁcation setting, the other two areless supportive. In the Mariel boatlift setting, the treatment eﬀect of the inﬂux ofimmigrants on low-income wages in Miami is reasonably indistinguishable from zerousing the unconstrained weight space and constrained error space metrics, as hasbeen argued in the literature by other means.Perhaps counterintuitively, Figure 6b demonstrates that the percentile rank ofthe misspeciﬁcation error needed for a zero treatment eﬀect to be plausible usingthe constrained weight space metric is smaller than the equivalent percentile rankusing the unconstrained weight space metric. At ﬁrst, this phenomenon may seemimpossible since, holding the magnitude of misspeciﬁcation error ﬁxed, the boundsconstructed by maximizing and minimizing over the unconstrained set of weights in(13) should be mechanically wider than the bounds constructed over the constrainedset. However, recall that the units on the x -axis in 6b correspond to the percentileranks of the placebo misspeciﬁcation errors, not their magnitudes. Because the rel-ative sizes of the placebo misspeciﬁcation errors also depend on choice of metric, it24 a) Eﬀect of California’s Tobacco Control Program(b) Eﬀect of German Reuniﬁcation on GDP(c) Eﬀect of Mariel Boatlift on Low-Income Wages Figure 6: Plots of treatment eﬀect bounds T B j T ∗ using diﬀerent misspeciﬁcation error metrics cor-responding to each control unit j ’s misspeciﬁcation error computed using data from three papersusing the SC method, where the units of the x -axes are percentile ranks p j of the set of misspeci-ﬁcation errors { B j : j ∈ J } . We shade each region between B j and B j +1 in which our treatmenteﬀect bounds ﬁrst contain zero in the color corresponding to the relevant misspeciﬁcation errormetric.

25s certainly possible that, at a ﬁxed percentile rank in the placebo misspeciﬁcationerror distribution, either metric could yield wider bounds. While this ambiguity maysuggest visualizing the bounds deﬁned via the two weight space metrics in terms ofabsolute misspecﬁcation error magnitudes, as discussed in Section 2.2.2, it is hard todetermine what constitutes a reasonable amount of misspeciﬁcation error measuredusing (cid:96) -distances in weight space. Benchmarking against the placebo misspeciﬁca-tion errors of the control units provides a more meaningful characterization of therobustness of SC estimates.Unfortunately, given the ambiguous relationships between metrics discussed above,we cannot recommend a single preferred misspeciﬁcation error metric for all settings.Rather, we believe the choice should be made based on the researcher’s prior beliefsabout the SC method’s susceptibility to misspeciﬁcation error. When comparing theconstrained and unconstrained weight space metrics, the decision should be deter-mined by the researcher’s belief about the validity of the SC weight constraints. Ifthe researcher just views the constraints as a convenient way of inducing sparsityin the SC weights, then conducting the sensitivity analysis while enforcing thoseconstraints would fail to capture the possible misspeciﬁcation error induced by theimposition of the constraints when choosing the SC weights. However, if the re-searcher believes the weight constraints capture important structural features of thesetting, for example that treatment eﬀect estimates based on extrapolation are un-desirable, then they may wish to use the constrained weight space metric and onlyevaluate misspeciﬁcation error incurred by the minimization of the wrong objectivefunction when selecting the SC weights, not the weight constraints themselves. The choice between the constrained weight and constrained error space metricsis more subtle. The sensitivity analyses based on weight space metrics search foralternative weights agnostic to direction when constructing treatment eﬀect bounds.On the other hand, the constrained error space metric penalizes alternative weightsthat have poor performance on the original SC objective. Therefore, if the researcherdoes not believe pre-treatment ﬁt is at all informative about post-treatment ﬁt, theymay prefer the weight space metrics. However, if the researcher maintains thatgood pretreatment ﬁt is a desirable and informative property of the weights used toconstruct counterfactual predictions, the constrained error space metric may makemore sense. In principle, one could even interpolate between the diﬀerent metrics.

In this paper, we demonstrate that pre-treatment ﬁt is neither neccesary nor suf-ﬁcient for good post-treatment ﬁt and that existing robustness checks often fail tocapture the extent of this disconnect due to their heuristic motivations and ad-hocinterpretations. To structure conversations about the robustness of SC estimates,we provide researchers with a procedure to systematically assess SC estimate sensi-tivity to misspeciﬁcation error in an interpretable, data-driven manner. Our method by extrapolation, we mean estimates of Y T ∗ that lie outside the range of control units’ period- T ∗ outcomes (Abadie, 2020). we could potentially develop a conditional prediction interval for thetreated unit’s outcome in a given period (Cattaneo et al., 2019). Of course, there ismuch more to be done to understand the viability (or lack thereof) of this generalapproach given the conceptual diﬃculty in measuring variability due to sampling incomparative case study settings, so we leave doing so to future work.In conclusion, we hope that researchers will perform the sensitivity analysis out-lined in Procedures 1 and 2 as part of their future comparative case studies employingthe SC method and visualize their results as in Figures 3 and 6. Such a perspective is reminiscent of the partial identiﬁcation approach taken in Rambachanand Roth (2019) to allow for limited violations of the parallel trends assumption in the context ofevent studies eferences Alberto Abadie and Javier Gardeazabal. The economic costs of conﬂict: A casestudy of the basque country.

American Economic Review , 93(1):113–132, March2003.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methodsfor comparative case studies: Estimating the eﬀect of california’s tobacco controlprogram.

Journal of the American Statistical Association , 105(490):493–505, 2010.Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Comparative politics andthe synthetic control method.

American Journal of Political Science , 59(2):495–510, 2015.Bruno Ferman and Cristine Pinto. Synthetic controls with imperfect pre-treatmentﬁt, 2019.Alberto Abadie. Using synthetic controls: Feasibility, data requirements, andmethodological aspects.

Journal of Economic Literature , 2020.Stephen P Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge uni-versity press, 2004.Charles F. Manski and John V. Pepper. How do right-to-carry laws aﬀect crimerates? coping with ambiguity using bounded-variation assumptions.

The Review ofEconomics and Statistics , 100(2):232–244, 2018. doi: 10.1162/REST\_a\_00689.URL https://doi.org/10.1162/REST_a_00689 .Sergio Firpo and Vitor Possebom. Synthetic control method: Inference, sensitivityanalysis and conﬁdence sets.

Journal of Causal Inference , 6(2), 2018.Victor Chernozhukov, Kaspar Wuthrich, and Yinchu Zhu. An exact and robust con-formal inference method for counterfactual and synthetic controls. arXiv preprintarXiv:1712.09089 , 2017.Victor Chernozhukov, Kaspar Wuthrich, and Yinchu Zhu. Practical and robust t -test based inference for synthetic control and related methods. arXiv preprintarXiv:1812.10820 , 2018.Matias D Cattaneo, Yingjie Feng, and Rocio Titiunik. Prediction intervals for syn-thetic control methods. arXiv preprint arXiv:1912.07120 , 2019.Kathleen T Li. Statistical inference for average treatment eﬀects estimated by syn-thetic control methods. Journal of the American Statistical Association , pages1–16, 2019.Guido W Imbens and Donald B Rubin.

Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press, 2015.28lberto Abadie and Jeremy L’Hour. A penalized synthetic control estimator fordisaggregated data. 2018.Maxwell Kellogg, Magne Mogstad, Guillaume Pouliot, and Alexander Torgovitsky.Combining matching and synthetic controls to trade oﬀ biases from extrapolationand interpolation. Technical report, National Bureau of Economic Research, 2020.Ward Cheney and David Kincaid. Linear algebra: Theory and applications.

TheAustralian Mathematical Society , 110, 2009.Iavor Bojinov and Neil Shephard. Time series experiments and causal estimands:exact randomization tests and trading.

Journal of the American Statistical Asso-ciation , 114(528):1665–1682, 2019.Ashesh Rambachan and Neil Shephard. Econometric analysis of potential outcomestime series: instruments, shocks, linearity and the causal response function. arXivpreprint arXiv:1903.01637 , 2019.Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic ﬁnite-samplerobustness metric: Can dropping a little data change conclusions?, 2020.Giovanni Peri and Vasil Yasenov. The labor market eﬀects of a refugee wave syntheticcontrol method meets the mariel boatlift.

Journal of Human Resources , 54(2):267–309, 2019.George J. Borjas. The wage impact of the marielitos: A reappraisal.

ILR Review ,70(5):1077–1110, 2017. doi: 10.1177/0019793917692945. URL https://doi.org/10.1177/0019793917692945 .Marianne Bertrand, Esther Duﬂo, and Sendhil Mullainathan. How Much ShouldWe Trust Diﬀerences-In-Diﬀerences Estimates?*.

The Quarterly Journal of Eco-nomics , 119(1):249–275, 02 2004.Andreas Hagemann. Inference with a single treated cluster. arXiv preprintarXiv:2010.04076 , 2020.Ashesh Rambachan and Jonathan Roth. An honest approach to parallel trends. Tech-nical report, Working Paper. https://scholar. harvard. edu/ﬁles/jroth/ﬁles . . . ,2019.Nikolay Doudchenko and Guido W Imbens. Balancing, regression, diﬀerence-in-diﬀerences and synthetic control methods: A synthesis. Technical report, NationalBureau of Economic Research, 2016.Eli Ben-Michael, Avi Feller, and Jesse Rothstein. The augmented synthetic controlmethod. arXiv preprint arXiv:1811.04170 , 2018.Dmitry Arkhangelsky, Susan Athey, David A. Hirshberg, Guido W. Imbens, andStefan Wager. Synthetic diﬀerence in diﬀerences, 2020.29

Placebo Analysis of Procedures 1 and 2

We perform our sensitivity analysis on each of the control units in the three casestudies as an extension of the placebo analysis used to assess the performance of ourprocedure at the end of Section 3. We treat each control unit as the placebo treatedunit and run Procedure 2 under the three misspeciﬁcation error metrics discussedin Section 4.2. For each placebo treated unit, our procedure returns the minimumpercentile rank of the placebo control units’ misspeciﬁcation errors at which a zerotreatment eﬀect is in the range of eﬀects plausible under the allowed misspeciﬁcationerror. Within each case study, we can vary the level of acceptable misspeciﬁcationerror by choosing a diﬀerent percentile rank cutoﬀ. At each proposed cutoﬀ, wegenerate bounds on the treatment eﬀect and observe the share of control units forwhich we correctly include the zero treatment eﬀect.In Figure 7, we report the results of this placebo exercise by plotting the share ofcontrol units for which our procedure generates bounds that correctly contain zerotreatment eﬀect under each possible percentile rank cutoﬀ. Across case studies andmisspeciﬁcation error metrics, a p th percentile rank cutoﬀ is associated with correctpredictions for around p percent of the control units. This direct correspondencebetween percentile rank cutoﬀ and the coverage of placebo units’ outcomes is to beexpected given the design of our placebo analysis.When our sensitivity analysis is performed on the true treated unit, the set ofplacebo misspeciﬁcation errors is calculated when, for each control unit j , we calcu-late the distance (in a chosen metric) between the canonical SC estimate using theremaining J − control units and the closest weights that correctly predict the targetoutcome. If we were to plot the share of control units covered at each percentile rankof the misspeciﬁcation error distribution generated in our analysis of the true treatedunit, we would exactly recover the 45-degree line. However, when we perform theanalysis on control unit j as a placebo treated unit, the set of misspeciﬁcation er-rors is constructed by creating placebo SC weights for the remaining placebo controlunits without unit j , that is, using J − control units. Redeﬁning the set of mis-speciﬁcation errors without using placebo control unit j can yield slightly diﬀerentSC estimates that generate the deviations seen in each of panel of Figure 7.In Figure 7, we also observe that the unconstrained weight space metric yieldsplacebo coverage rates that always lie weakly above the 45-degree line. While thisresult is not theoretically guaranteed, we can see why we may expect such a phe-nomenon. Consider the placebo analysis of control unit j . Using the remaining J − control units, the closed form solution for the observable misspeciﬁcation errorunder the unconstrained weight space metric is given in Equation (11): the ratioof the absolute SC residual to the (cid:96) -norm of the vector of J − control unit out-comes, B ( j )0 = | ˆ R sc jT ∗ | / (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) . Placebo treated unit j ’s misspeciﬁcation errorwill be compared to the distribution of misspeciﬁcation errors of the remaining J − placebo control units, calculated as each placebo control unit k ’s ratio of the absoluteresidual from predicting its post-treatment SC estimate using the remaining J − control units to the (cid:96) -norm of the vector of J − remaining control unit outcomes, d ( w ( k ) sc , W ∗ k ) = | ˆ R sc kT ∗ | / (cid:13)(cid:13) Y ( − j, − k ) T ∗ (cid:13)(cid:13) . Because most SC estimates are sparse and30 a) Eﬀect of California’s Tobacco Control Program(b) Eﬀect of German Reuniﬁcation on GDP(c) Eﬀect of Mariel Boatlift on Low-Income Wages Figure 7: We plot the results from our placebo analysis of Procedure 2 using three diﬀerent mis-speciﬁcation error metrics in each of the three case studies, as in Section 4.2. On the x -axis, we varythe percentile rank cutoﬀ that determines the width of the bounds generated by our procedure andon the y -axis we show the share of placebo treated units for which those bounds correctly include azero treatment eﬀect. The dashed black 45-degree line demarcates p percent of placebo units’ trueoutcomes being covered with a misspeciﬁcation error cutoﬀ at the p th percentile rank. j anyway, the residuals willnot change much whether J − or J − placebo control units are used to constructSC estimates, in which case the primary diﬀerence in misspeciﬁcation error will bedriven by the reduction in the norm of the vector of control unit outcomes with J − units instead of J − units, (cid:13)(cid:13) Y ( − j, − k ) T ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) Y ( − j ) T ∗ (cid:13)(cid:13) . Therefore, it isreasonable to expect that the treatment eﬀect bounds for control unit j will includea zero treatment eﬀect at a weakly lower percentile rank in the placebo analysis thanin the sensitivity analysis of the true treated unit, which is why the unconstrainedweight space placebo coverage rates lie weakly above the 45-degree line.While there is not an observable pattern in the constrained weight space case,we do observe that the constrained error space metric yields placebo coverage ratesthat are mostly below the 45 degree line. Recall from Equation 16 and Procedure 2that the constrained error space misspeciﬁcation error (12) is the ratio of the pre-treatment error from the best-ﬁtting performing synthetic control pre-treatment thatachieves perfect post-treatment accuracy to the minimum SC pre-treatment error.When placebo treated unit j ’s misspeciﬁcation error is compared to the errors of theremaining J − units that are ﬁt with only J − placebo control units, both thenumerator and denominator of the control unit misspeciﬁcation errors weakly in-crease as pre-treatment ﬁt can only get worse with fewer units. This behavior makesthe relative magnitudes of control unit misspeciﬁcation errors computed using J − control units to the equivalent errors computed using J − control units theoreti-cally ambiguous. However, since the constrained error space metric placebo coveragerates tend to be below the 45-degree line, it must be that the misspeciﬁcation errorswith J − units are typically smaller than with J − units. This suggests thatthe addition of a control unit in the donor pool tends to decrease pre-treatment ﬁterror constrained to perform well in the post-treatment period of interest less thanit decreases the minimum achievable pre-treatment error. B Handling Non-Uniqueness of the SC Estimator

As discussed in Footnote 5 in Section 2.1, it is possible that x could lie in theconvex hull of the columns of X , in which case the optimization problem (2) thatdeﬁnes the SC estimator could have multiple or even inﬁnite solutions. In thiscase, the sensitivity analysis described in Section 2 would understate the impact ofmisspeciﬁcation error on treatment eﬀect estimates because it does not account forthe multiplicity of valid SC estimators that could result from solving (2). Thankfully,we can apply the generalized sensitivity analysis described in Section 4.1 with anappropriate choice of misspeciﬁcation error metric m j to allow for a weight-space-based sensitivity analysis that can account for non-uniqueness.In particular, we can choose m wt,mult j ( w ) to measure the distance of w to the closest (in (cid:96) distance) SC weights that solve (2). Formally, we let W ( j ) sc := arg min w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) x j − X ( − j ) w (cid:107) + ψ ∆ J − { j (cid:54) =1 } ( w ) (cid:111) ψ deﬁned in (17) anddeﬁne m wt,mult j ( w ) := inf ˜ w ∈ R J − { j (cid:54) =1 } (cid:110) (cid:107) ˜ w − w (cid:107) + ψ W ( j ) sc ( ˜ w ) (cid:111) . If we apply our generalized sensitivity analysis using misspeciﬁcation error met-ric m wt,mult j , the resulting treatment eﬀect bounds will capture the impact of bothmisspeciﬁcation error and potential non-uniqueness of the SC estimator because,in eﬀect, the solutions to (12) and (13) will range over all weights that lie within d m j ( w ( j ) sc , W ∗ j ) of some valid SC estimator, not just the particular SC estimatorreturned by the estimation procedure.Note that since W ( j ) sc is deﬁned as the solution set of a convex optimization prob-lem and must therefore be a convex set (see Section 4.2.1 of Boyd and Vandenberghe(2004)), and since the inﬁmum of a convex function in one of its arguments over aconvex set must be convex in its remaining arguments (again, see Boyd and Van-denberghe (2004)), m wt,mult j must convex. However, (12) and (13) with m wt,mult j directly plugged in are not formulated in manners that are particularly amenable tocomputation. Instead, we can write (12) using misspeciﬁcation error metric m wt,mult j in a form that is more directly solvable as follows: d m j ( w ( j ) sc , W ∗ j ) := inf ˜ w , w ∈ R J − (cid:107) ˜ w − w (cid:107) s . t . Y T ( − j ) T ∗ w = Y jT ∗ (cid:0) ⇔ w ∈ W ∗ j (cid:1) (cid:107) x j − X ( − j ) ˜ w (cid:107) ≤ (cid:107) x j − X ( − j ) w ( j ) sc (cid:107) T ˜ w = 1˜ w ≥ . (18)Similarly, we can write the optimization problems in (13) to facilitate use of commonconvex solvers as follows: Y B j , − T ∗ (0) := inf ˜ w , w ∈ R J − Y T T ∗ w s . t . (cid:107) ˜ w − w (cid:107) ≤ d m j ( w ( j ) sc , W ∗ j ) (cid:107) x j − X ( − j ) ˜ w (cid:107) ≤ (cid:107) x j − X ( − j ) w ( j ) sc (cid:107) T ˜ w = 1˜ w ≥ , (19)and Y B j , +1 T ∗ (0) is deﬁned similarly. Finally, we note that generalizing the constrainedweight space sensitivity analysis discussed in Section 4.2 to allow for non-uniquenessis essentially the same as the process described above with additional constraints T w = 1 and w ≥ added to (18) and (19).33 Other Generalizations

C.1 Other Contrasts

While the treatment eﬀect τ T ∗ in period T ∗ is a natural estimand in comparative casestudy settings, researchers are often interested in other linear contrasts of outcomeslike the average treatment eﬀect across all post-treatment periods or the eﬀect onthe average slope of the treated unit’s outcome path. Our sensitivity analysis canbe extended naturally to assess the robustness of synthetic control-based estimatesof these alternative estimands.Let Y j,T : T ( d ) := ( Y jT (0) , . . . , Y jT (0)) T denote the vector containing unit j ’spotential outcomes under treatment arm d in each of the T − T post-treatmentperiods, and suppose we are interested in assessing the robustness of an estimand τ c parameterized by the vector c := ( c (0) , c (1)) T ∈ R T − T ) : τ c := c T (cid:34) Y ,T : T (1) Y ,T : T (0) (cid:35) = c (1) T Y ,T : T (1) + c (0) T Y ,T : T (0) . For example, if we use the contrast vector c T ∗ deﬁned entrywise as [ c T ∗ ( d )] t := { t = T ∗ } ( d − (1 − d )) , we recover the treatment eﬀect in period T ∗ studied in Section 2, τ T ∗ = τ c T ∗ . If weinstead use c avg deﬁned entrywise as [ c avg ( d )] t := 1 T − T ( d − (1 − d )) , we recover the average treatment eﬀect τ c avg across the T − T post-treatment periods.Using c slo deﬁned entrywise as [ c slo ( d )] t := 1 T − T ( { t = T } − { T = T } )( d − (1 − d )) yields the eﬀect τ c slo on the average slope of the treated unit’s outcome path, sincethe sums in the average slopes telescope: τ c slo := 1 T − T T − (cid:88) t = T ( Y t +1) (1) − Y t (1)) − T − T T − (cid:88) t = T ( Y t +1) (0) − Y t (0))= 1 T − T { [ Y T (1) − Y T (1)] − [ Y T (0) − Y T (0)] } Next, let Y j,T : T := ( Y jT ( D jT ) , . . . , Y jT ( D jT )) T be the vector of unit j ’s ob-served post-treatment outcomes, let Y be the ( T − T ) × J matrix of control units’post-treatment outcomes, where the j th column of Y is Y j +1 ,T : T , and let Y − j denote the matrix Y with its j th column deleted. Then once we have chosen acontrast c , we can write the synthetic control estimate of τ c for the treated unit as34ollows: ˆ τ sc c := c (1) T Y ,T : T + c (0) T ˆ Y ,T : T = c (1) T Y ,T : T + c (0) T Y w sc . Similarly, we can write the placebo treatment eﬀect for the j th control unit usingthe other J − control units as the donor pool as follows: ˆ τ ( j ) , sc c := c (1) T Y j,T : T + c (0) T ˆ Y j,T : T = c (1) T Y j,T : T + c (0) T Y − j w ( j ) sc Given this characterization of ˆ τ sc c , modifying the general procedure described in inSection 4 is relatively straightforward. First, we replace the constraints Y T ( − j ) T ∗ w = Y jT ∗ and Y T T ∗ w = Y T ∗ requiring perfect post-treatment accuracy in period T ∗ in the optimization problems (12) and (15) with the constraints c (0) T Y j,T : T = c (0) T Y − j w and c (0) T Y ,T : T = c (0) T Y w , which is equivalent to requiring correcttreatment eﬀect estimation for control unit j in the case of (12) and for the treatedunit under the assumption of no eﬀect in (15). The deﬁnition of d m ( j ) ( w ( j ) sc , W ∗ j ) should also be updated accordingly. Next, we replace the computations of the boundson Y T ∗ (0) in (13) with the following bounds on the component of the treatment eﬀectthat depends on counterfactual control outcomes Y ,T : T (0) : µ B j , − (0) := min w ∈ R J (cid:110) c (0) T Y − j w : m ( j ) ( w ) ≤ d m ( j ) ( w ( j ) sc , W ∗ j ) (cid:111) µ B j , +1 (0) := max w ∈ R J (cid:110) c (0) T Y − j w : m ( j ) ( w ) ≤ d m ( j ) ( w ( j ) sc , W ∗ j ) (cid:111) Finally, we replace the bounds in (14) with the following bounds on τ c : T B j c := (cid:104) c (1) T Y ,T : T + µ B j , − (0) , c (1) T Y ,T : T + µ B j , +1 (0) (cid:105) C.2 Other Panel Data Methods

The outcomes-based SC method described in Section 2.1 is by no means the onlymethod for generating counterfactual predictions in comparative case study settings.Besides the classic Diﬀerence-in-Diﬀerences estimator (Bertrand et al., 2004) andSC estimators that incorporate other pre-treatment covariates (Abadie et al., 2015),a whole suite of methods for panel data inspired by the SC method have beenproposed in the past decade, including but not limited to the estimators proposedin Doudchenko and Imbens (2016), Chernozhukov et al. (2017), Ben-Michael, Feller,and Rothstein (2018), Abadie and L’Hour (2018), Arkhangelsky, Athey, Hirshberg,Imbens, and Wager (2020), and Kellogg et al. (2020).As has been noted in Doudchenko and Imbens (2016), Chernozhukov et al. (2017),and Cattaneo et al. (2019) among others, we can write many of these alternativetreatment eﬀect estimators for panel data as aﬃne functions of the control units’35ost-treatment outcomes Y T ∗ in period T ∗ : ˆ τ T ∗ := Y T ∗ − (cid:0) ˆ µ + Y T T ∗ ˆ w (cid:1) = Y T ∗ − (cid:104) Y T ∗ (cid:105) (cid:34) ˆ µ ˆ w (cid:35) , (20)where we now allow for an intercept term ˆ µ in addition to weights on the controlunits’ post-treatment outcomes. As indicated by the second equality in (20), to allowfor an intercept term, we can simply add an extra “control unit” to the donor poolwith ones as all of its outcomes, so we assume we include such an intercept unit andomit the explicit intercept term in what follows.Once we take this more general perspective, we can see that Procedure 1 eas-ily generalizes to accommodate alternative policy evaluation methods that generatetreatment eﬀect estimates as in (20), since all the procedure requires are the weightson units’ post-treatment outcomes that generate counterfactual outcome predictions.Speciﬁcally, the only diﬀerence is that instead of using the SC method to generatethe weights used to compute the residuals in steps 1a and 2, we use the weightsoutputted by some alternative policy evaluation method.Further, many of the methods listed above can be described as choosing weightsto solve a particular instance of the following general convex program: J ˆ V ,r,C ( w ) := ( x − X w ) T ˆ V ( x − X w ) + r ( w ) + ψ C ( w ) , ˆ w := arg min w ∈ R J +1 J ˆ V ,r,C ( w ) (21)where C is a convex set, ψ C is a penalty term as deﬁned in (17) to ensure ˆ w ∈ C , ˆ V is a weighting matrix that can be chosen in a potentially data-driven manner, r : R J → [0 , ∞ ) is a convex penalty term that regularizes the weights w in somefashion, the columns of X can contain additional pre-treatment covariates beyondcontrol units’ pre-treatment outcomes, and we include an additional ﬁrst columnin X containing ones in all the rows corresponding to pre-treatment outcomes andzeros in the other rows. For brevity, we leave the reader to see how the methodslisted above can be written in the form of (21) in the papers introducing them.Given this common characterization of many alternative policy evaluation meth-ods, we can see that for an appropriately deﬁned misspeciﬁcation error metric m j ( w ) that measures the misspeciﬁcation error incurred by an alternative policy evaluationmethod’s weights relative to the weights w , the general Procedure 2 can be usedwithout modiﬁcation to assess the robustness of treatment eﬀect estimates outputtedby that alternative policy evaluation method. One particular misspeciﬁcation errormetric of interest is an analogue of the constrained error space metric m err j deﬁnedin Section 4.1 corresponding to an alternative policy evaluation method deﬁned byparticular choices of ˆ V , r , and C : m gen,err j ( w ) = J ˆ V ,r,C ( w )min ˜ w ∈ R J − { j (cid:54) =1 } J ˆ V ,r,C ( ˜ w ) − ..