[PDF] Matching Estimators with Few Treated and Many Control Observations

Abstract

We analyze the properties of matching estimators when there are few treated, but many control observations. We show that, under standard assumptions, the nearest neighbor matching estimator for the average treatment effect on the treated is asymptotically unbiased in this framework. However, when the number of treated observations is fixed, the estimator is not consistent, and it is generally not asymptotically normal. Since standard inference methods are inadequate, we propose alternative inference methods, based on the theory of randomization tests under approximate symmetry, that are asymptotically valid in this framework. We show that these tests are valid under relatively strong assumptions when the number of treated observations is fixed, and under weaker assumptions when the number of treated observations increases, but at a lower rate relative to the number of control observations.

Full PDF

aa r X i v : . [ ec on . E M ] O c t Matching Estimators with Few Treated andMany Control Observations ∗ Bruno Ferman † Sao Paulo School of Economics - FGV

First Draft: May 5th, 2017This Draft: October 6th, 2019

Please click here for the most recent version

Abstract

We analyze the properties of matching estimators when there are few treated, but manycontrol observations. We show that, under standard assumptions, the nearest neighbormatching estimator for the average treatment eﬀect on the treated is asymptotically unbi-ased in this framework. However, when the number of treated observations is ﬁxed, theestimator is not consistent, and it is generally not asymptotically normal. Since standardinference methods are inadequate in this setting, we propose alternative inference methodsbased on the theory of randomization tests under approximate symmetry. We consider theimplications of our ﬁndings for synthetic control applications.

Keywords: matching estimators, treatment eﬀects, hypothesis testing, randomization infer-ence, synthetic control estimator

JEL Codes:

C12; C13; C21 ∗ The author gratefully acknowledges the comments and suggestions of Luis Alvarez, Ricardo Paes deBarros, Lucas Finamor, Sergio Firpo, Ricardo Masini, Cristine Pinto, Vitor Possebom, Pedro Sant’Anna,Azeem Shaikh, and participants of the 2017 California Econometrics Conference and of the Rio-Sao PauloEconometrics Conference. Deivis Angeli provided outstanding research assistance. † email: [email protected]; address: Sao Paulo School of Economics, FGV, Rua Itapeva no. 474, SaoPaulo - Brazil, 01332-000; telephone number: +55 11 3799-3350 Introduction

Matching estimators have been widely used for the estimation of treatment eﬀects un-der a conditional independence assumption (CIA). In many cases, matching estimatorshave been applied in settings where (1) the interest is in the average treatment eﬀect for thetreated (ATT), and (2) there is a large reservoir of potential controls (Imbens and Wooldridge(2009)). Abadie and Imbens (2006) study the asymptotic properties of matching estimatorswhen the number of control observations grows at a higher rate than the number of treatedobservations. However, their asymptotic theory still depends on both the number of treatedand control observations going to inﬁnity. Therefore, reliance on such asymptotic approxi-mations should be considered with caution when the number of treated observations is small,even if the total number of observations is large.In this paper, we analyze the properties of matching estimators when the number oftreated observations is ﬁxed, while the number of control observations goes to inﬁnity. Weﬁrst show that the nearest neighbor matching estimator is asymptotically unbiased for theATT, under standard assumptions used in the literature on estimation of treatment eﬀectsunder selection on observables. This is consistent with Abadie and Imbens (2006), who showthat the conditional bias of the matching estimator can be ignored, provided that the numberof control observations increases fast enough, relative to the number of treated observations.In their setting, the matching estimator is consistent and asymptotically normal. In oursetting, however, the variance of the matching estimator does not converge to zero, and theestimator will not generally be asymptotically normal. Our theory complements the theorydeveloped by Abadie and Imbens (2006), providing a better approximation to settings inwhich there is a larger number of control relative to treated observations, but the numberof treated observations is not large enough, so that we cannot rely on asymptotic results in See Imbens (2004), Imbens and Wooldridge (2009), and Imbens (2014) for reviews. This is true whether we consider the average treatment eﬀect on the treated conditional or unconditionalon the covariates of the treated observations. Also, this is true whether asymptotic unbiasedness is deﬁnedbased on the limit of the expected value of the estimator, or based on the expected value of the asymptoticdistribution. We conduct an empirical Monte Carlo (MC) study based on real data, as suggested byHuber et al. (2013). When the dimensionality of the covariates is low, and we considermatching estimators with few nearest neighbors, our simulations suggest that, regardless ofthe number of treated observations, the bias of the matching estimator is close to zero, evenwhen the number of control observations is not large. Increasing the dimensionality of thecovariates and/or increasing the number of nearest neighbors used in the estimation impliesthat we need an increasing number of controls to keep our approximations reliable.The fact that the matching estimator is not asymptotically normal, in our setting, posesimportant challenges when it comes to inference. Inference based on the asymptotic distri-bution of the matching estimator derived by Abadie and Imbens (2006) should not providea good approximation when the number of treated observations is very small, even if thereare many control observations. The bootstrap procedure proposed by Otsu and Rai (2017)also relies on the number of both treated and control observations going to inﬁnity. Forﬁnite samples, Rosenbaum (1984) and Rosenbaum (2002) consider permutation tests forobservational studies under strong ignorability. However, these tests rely on restrictive as-sumptions. Rothe (2017) provides robust conﬁdence intervals for average treatment eﬀectsunder limited overlap. For the case with continuous covariates, he combines his methodwith subclassiﬁcation on the propensity score. However, with few treated and many controlobservations, it would not be possible to reliably estimate a propensity score. Therefore, weconsider two alternative inference methods based on the theory of randomization tests underan approximate symmetry assumption, developed by Canay et al. (2017). One test relies onpermutations, while the other relies on group transformations given by sign changes. We The ﬁnite sample properties of matching and other related estimators have been evaluated in detail insimulations by, for example, Frolich (2004), Busso et al. (2014), Huber et al. (2013), and Bodory et al. (2018).In contrast to their approach, we provide theoretical and simulation results holding the number of treatedobservations ﬁxed, but relying on the number of control observations going to inﬁnity. Rosenbaum (1984) assumes that the propensity score follows a logit model, while Rosenbaum (2002)assumes that observations are matched in pairs such that the probability of treatment assignment is thesame conditional on the pair. A test based on permutations has been studied in the context of an approximate symmetry assumption With few treated ob-servations, tests based on the asymptotic distribution derived by Abadie and Imbens (2006)and on the bootstrap procedure proposed by Otsu and Rai (2017) can have important sizedistortions, while the two randomization inference tests we propose control well for size evenwhen the number of treated observations is very small. However, the randomization infer-ence tests are valid to test sharper null hypotheses. We show that the size distortion andpower for each test depend crucially on the number of treated observations, the number ofcontrol observations, and the number of nearest neighbors used in the estimation, providingguidance on how to evaluate the trade-oﬀs among these test procedures in diﬀerent scenarios.As an empirical illustration, we consider the “Jovem de Futuro” (

Youth of the Future )program. This is a program that has been running in Brazil since 2008, aimed at improvingthe quality of education in public schools by improving management practices and allocatinggrants to treated schools. In 2010, this program was implemented in a randomized controltrial with 15 treated schools in Rio de Janeiro and 39 treated schools in Sao Paulo. Weestimate the eﬀects of the program using a matching estimator with the non-experimentalsample as the control schools. We take advantage of the fact that there were about 1,000other public schools in Rio de Janeiro and more than 3,000 other public schools in Sao Paulothat did not participate in the experiment, therefore, providing a setting with few treated by Canay and Kamat (2018) for regression discontinuity designs, while a test based on sign changes hasbeen studied in the context of an approximate symmetry assumption by Canay et al. (2017) for a series ofapplications. We show that the randomization inference procedures are valid when we consider more stringent nullhypotheses than the null hypothesis that the ATT is equal to zero. Alternatively, we can think that theseprocedures are valid to test the null hypothesis that the ATT is equal to zero if we impose additional auxiliaryassumptions. See details in Section 4. Alternatively, these procedures can be valid to test hypotheses on the ATT, but only if we imposeadditional auxiliary assumptions We ﬁnd marginally signiﬁcant treatment eﬀects for SaoPaulo, and small and insigniﬁcant eﬀects for Rio de Janeiro, which is consistent with theestimates based on the randomized control trial. Moreover, using the experimental controlschools as the treated group for the matching estimator (so that we should expect to ﬁndno signiﬁcant results), we provide empirical evidence that inference based on the asymptoticdistribution derived by Abadie and Imbens (2006) may lead to over-rejection when there arevery few treated observations, while the randomization inference procedures control betterfor size in this case.The remainder of this paper proceeds as follows. We present our theoretical setup inSection 2. In Section 3, we derive the asymptotic distribution of the matching estimatorand derive conditions under which it is asymptotically unbiased. In Section 4, we consideralternative inference methods that are asymptotically valid when the number of controlobservations goes to inﬁnity, while the number of treated observations remains ﬁxed. InSection 5, we present an empirical MC simulation based on the “Jovem de Futuro” program,and estimate the eﬀects of this program using a matching estimator. In Section 6, we contrastthe diﬀerent inference procedures in light of the theoretical results presented in Section 4and the simulations presented in Section 5, providing guidance on which method should bechosen depending on the setting. Concluding remarks, including a discussion of potentialimplications for Synthetic Control applications, are presented in Section 7.

We are interested in estimating the eﬀect of a binary treatment on some outcome. Fol-lowing Rubin (1973), for each unit i we denote the potential outcomes Y i (1) if observation i receives treatment and Y i (0) if observation i does not receive treatment. Therefore, the ob-served outcome for unit i is given by Y i = W i Y i (1)+(1 − W i ) Y i (0), where variable W i ∈ { , } Inﬂuential papers that evaluate the use of non-experimental methods in empirical applications where arandomized control trial is available include LaLonde (1986), Dehejia and Wahba (1999), and Dehejia andWahba (2002). Y i and W i , we also observe for each unit i acontinuous random vector of pretreatment variables of dimension k in R k , which we denoteby X i . The case in which components of X i are discrete, with a ﬁnite number of supportpoints, can be easily dealt with by estimating treatment eﬀects within subsamples deﬁnedby their values, and then aggregating on such covariates, as argued by Abadie and Imbens(2006). We assume that we observe a sample of N treated ( N control) units that consists ofi.i.d. observations of units with W i = 1 ( W i = 0), and that treated and control observationsare independent. Let I w denote the set of indexes for observations with W i = w . Assumption 1 (Sample)

For w ∈ { , } , { Y i , X i } i ∈I w consists of N w i.i.d. observationswith W i = w . Furthermore, we assume that individuals in the treated and control samplesare independent. We consider the case in which the number of treated observations ( N ) is ﬁxed, while thenumber of control observations ( N ) goes to inﬁnity. One possibility is that there is a largeset of units that could potentially be treated, but only a ﬁnite number of those units actuallyreceive treatment. For example, in the empirical application, to be presented in Section 5,there are a large number of schools that could potentially receive the treatment, but only asmall number of schools actually received it. Alternatively, we can imagine that there is alarge number of treated units, but we only have data from a small sample of them.We focus on two distinct estimands. First, we consider the conditional average treatmenteﬀect on the treated (CATT), τ ( { X i } i ∈I ) ≡ N X i ∈I E [ Y i (1) − Y i (0) | X i , W i = 1] , (1)which is, conditional on the realization of { X i } i ∈I , the expected treatment eﬀect for thetreated units with these covariate values. We also consider the unconditional average treat-6ent eﬀect on the treated (UATT), which we denote by τ ′ ≡ E [ Y i (1) − Y i (0) | W i = 1] . (2)In both cases, we focus on estimands related to the treatment eﬀect on the treatedbecause, given our setting with N ﬁnite and N large, there is no hope of constructing acounterfactual for the control observations using only a ﬁnite set of treated observations.In the framework of Imbens and Rubin (2015), these two estimands are deﬁned based on asuper-population.Assumption 1 does not impose any restriction on how the distribution of ( Y i (1) , Y i (0) , X i )for treated and control observations may diﬀer. The following assumption does restrict theway in which these distributions may diﬀer. Assumption 2 (Conditional Independence Assumption)

Conditional on X i , the dis-tribution of Y i (0) is the same for i in the treated and in the control groups. Assumption 2 is equivalent to the conditional independence assumption (CIA). While inAssumption 1 we allow for diﬀerent distributions of ( Y i (0) , Y i (1) , X i ) whether i is treatedor control, Assumption 2 restricts that the conditional distribution of Y i (0) given X i is thesame for both treatment and control observations. However, the density of X i for thetreated observations ( f ( X i )) can potentially be diﬀerent from the density of X i for thecontrol observations ( f ( X i )). This is what generates potential bias in a simple comparisonof means between treated and control groups, without taking into account that these groupsmight have diﬀerent distributions of covariates X i .The next assumption states that possible values of X i for the treated observations are inthe support of the distribution of X i for the control observations. Assumption 3 (Overlap) X ⊂ X , where X w is the support of f w ( X i ) , for w ∈ { , } We do not need to impose such restriction on Y i (1) because of our focus on average treatment eﬀects onthe treated. P r ( W = 1 | X = x ) < − η for some η >

0. This assumption guarantees that, for each i in the treated group, we can ﬁnd anobservation j in the control group with covariates X j arbitrarily close to X i when N → ∞ .The main identiﬁcation problem arises from the fact that we observe either Y i (1) or Y i (0)for each observation i . Note that, if we had two observations, i ∈ I and j ∈ I , with X i = X j = x , then, under Assumption 2, E [ Y i | W i = 1 , X i = x ] − E [ Y j | W j = 0 , X j = x ] = E [ Y i (1) − Y i (0) | X i = x, W i = 1]. The main challenge is that, with a continuous randomvariable X i , the probability of ﬁnding observations with exactly the same X i is zero. Theidea of the nearest neighbor matching estimator is to input the missing potential outcomesof a treated observation i ∈ I with observations in the control group j ∈ I that are as closeas possible in terms of covariates X i . More speciﬁcally, for a given metric d ( a, b ) in R k , let J M ( i ) be the set of M nearest neighbors in the control group of observation i ∈ I . Thenthe matching estimator is given byˆ τ = 1 N X i ∈I  Y i − M X j ∈J M ( i ) Y j  . (3) For w ∈ { , } , we deﬁne µ ( x, w ) = E [ Y | X = x, W = w ] and ǫ i = Y i − µ ( X i , W i ).Since we are focusing on the average treatment eﬀect on the treated, we also deﬁne µ w ( x ) = E [ Y ( w ) | X = x, W i = 1]. Under Assumption 2, we have that µ ( x,

0) = µ ( x ). Using thisnotation, note that the CATT is given by τ ( { X i } i ∈I ) = 1 N X i ∈I [ µ ( X i ) − µ ( X i )] , (4) Note that Abadie and Imbens (2006) deﬁne µ w ( x ) = E [ Y ( w ) | X = x ]. We use a slightly diﬀerentdeﬁnition because we focus on the average treatment eﬀects on the treated. τ = 1 N X i ∈I  µ ( X i ) − M X j ∈J M ( i ) µ ( X j )  +  ǫ i − M X j ∈J M ( i ) ǫ j  . (5)We ﬁrst show that ˆ τ is an asymptotically unbiased estimator for the CATT when thenumber of treated observations is ﬁxed and the number of control observations grows, andwe derive its asymptotic distribution in this setting. Proposition 1

Under Assumptions 1, 2, and 3,1. If µ ( x ) is continuous and bounded, then E [ˆ τ |{ X i } i ∈I ] → τ ( { X i } i ∈I ) when N → ∞ and N is ﬁxed.2. If ˜ h ( x ) = E [ h ( Y (0)) | X = x ] is continuous and bounded for any h ( y ) continuous andbounded, then, conditional on { X i } i ∈I , ˆ τ d → τ ( { X i } i ∈I ) + 1 N X i ∈I ǫ i − M M X m =1 ǫ m ( X i ) ! when N → ∞ and N is ﬁxed , where ǫ m ( X i ) d = Y i (0) | X i − µ ( X i ) for i ∈ I , and ǫ m ( X i ) is independent across m and i . Proof.

See details in Appendix A.1.1.Let X i ( m ) be the covariate value of the m -closest match to observation i . The mainintuition for the results in Proposition 1 is that, for a ﬁxed X i = ¯ x , X i ( m ) p → ¯ x when N → ∞ ,because, holding M ﬁxed, we will always be able to ﬁnd M observations in the control groupthat are arbitrarily close to ¯ x . Independence of ǫ m ( X i ) across m and i follows from the factthat the probability of two treated observations sharing the same nearest neighbor convergesto zero.Proposition 1 shows that, conditional on the realization of { X i } i ∈I , the expected value ofthe matching estimator converges to τ ( { X i } i ∈I ) = N P i ∈I ( µ ( X i ) − µ ( X i )) when N → . We also derive the asymptotic distribution of the matching estimator conditional on { X i } i ∈I , which is centered on τ ( { X i } i ∈I ). This is important for the construction of theinference methods we propose in Section 4. These results are valid for any ﬁxed value of N ,including the case with N = 1. Remark 1

The condition that µ ( x ) is continuous and bounded would be satisﬁed if weassume that µ ( x ) is continuous and X is compact, as is assumed by Abadie and Imbens(2006). The intuition behind the assumption used in part 2 of Proposition 1 is that the con-ditional distribution of Y (0) given X = x changes “smoothly” with x . This guarantees thatthe outcome of the m -closest match to treated observation i , Y i ( m ) , converges in distributionto Y i (0) | X i = ¯ x when X i ( m ) p → ¯ x . In Appendix A.1.2, we show that this condition is satisﬁedif, for example, Y (0) | X = x ∼ N ( θ ( x ) , σ ( x )), where θ ( x ) and σ ( x ) are continuous functionsof x . Remark 2

We focus on the properties of the matching estimator conditional on { X i } i ∈I .We might be interested, however, in the unconditional properties of the matching estimator.Under the assumptions from part 1 of Proposition 1, E [ˆ τ ] = E { E [ˆ τ |{ X i } i ∈I ] } converges to τ ′ , which is the UATT. See details in Appendix A.1.3. Remark 3

With N ﬁxed, the estimator is not consistent. This happens because, with aﬁxed number of treated observations, we cannot apply a law of large numbers to the averageof the error of the treated observations. For the same reason, the matching estimator will notbe asymptotically normal, unless we assume that the error ǫ i is normal. These conclusionsare similar to the ones derived by Conley and Taber (2011) for diﬀerences-in-diﬀerencesestimators with few treated groups. Remark 4

Consider a bias-corrected estimator given byˆ τ biasadj = 1 N X i ∈I  Y i − M X j ∈J M ( i ) ( Y j + ˆ µ ( X i ) − ˆ µ ( X j ))  , (6)10here ˆ µ ( X i ) is an estimator for µ ( X i ). With additional assumptions, we can also guaranteethat ˆ τ biasadj has the same asymptotic distribution as ˆ τ . The intuition is that ˆ µ ( X i ) − ˆ µ ( X i ( m ) )converges in probability to zero when N → ∞ , because X i ( m ) p → X i . See details in AppendixA.1.4. Remark 5

We consider an asymptotic framework in which M is held ﬁxed, while N → ∞ ,which is similar to what Abadie and Imbens (2006) call ﬁxed- M asymptotics in their set-ting. As argued by Abadie and Imbens (2006), the motivation for such ﬁxed- M asymptoticsis to provide an approximation to the sampling distribution of matching estimators with asmall number of matches. Matching estimators using few matches have been widely used inapplied work (see Abadie and Imbens (2006)). Moreover, Imbens and Rubin (2015) argueagainst using matching estimators with many matches, as this would tends to increase thebias of the resulting estimator, while the marginal gains in precision of increasing the numberof matches are limited. The fact that the matching estimator is not generally asymptotically normal when N isﬁxed and N → ∞ poses an important challenge when it comes to inference. In particular,inference based on the asymptotically normal distribution derived by Abadie and Imbens(2006), or on the bootstrap procedure suggested by Otsu and Rai (2017), should not providea good approximation in our setting, as the asymptotic theory behind these methods rely onboth N and N going to inﬁnity. We therefore consider alternative inference methods basedon the theory of randomization tests under an approximate symmetry assumption, developedby Canay et al. (2017). We derive conditions under which these methods are asymptoticallyvalid when N → ∞ , even with ﬁxed N . The ﬁrst test is based on group transformationsgiven by permutations, while the second test is based on group transformations given by The diﬀerence relative to the framework considered by Abadie and Imbens (2006) is that we also hold N ﬁxed. Consider a function of the data given by˜ S N = (cid:16) ˜ S N , , ˜ S N , , ..., ˜ S MN , , ..., ˜ S N ,N , ˜ S N ,N , ..., ˜ S MN ,N (cid:17) ′ (7)where ˜ S N ,i = Y i and ˜ S mN ,i = Y i ( m ) for m = 1 , ..., M . That is, ˜ S N is a vector containing theoutcomes of the treated observations and of their M -nearest neighbors. The distribution of˜ S N depends on N , because the quality of the matches will depend on N . In this notation,the matching estimator is given byˆ τ = 1 N N X i =1 ˜ S N ,i − M M X j =1 ˜ S jN ,i ! . (8)Let ˜ G i be the set of all permutations π i = ( π i (0) , ..., π i ( M )) of { , , ..., M } , π = ⊗ N i =1 π i , and ˜ G = ⊗ N i =1 ˜ G i . Note that ˜ G is the set of all permutations that reassign thetreatment status conditional on having exactly one treated observation for each group oftreated observation i and its M nearest neighbors. For a given π ∈ ˜ G , consider ˜ S πN = (cid:16) ˜ S π (0) N , , ˜ S π (1) N , , ..., ˜ S π ( M ) N , , ..., ˜ S π N (0) N ,N , ˜ S π N (1) N ,N , ..., ˜ S π N ( M ) N ,N (cid:17) ′ .Let ˜ K = | ˜ G | and denote by˜ T (1) ( ˜ S N ) ≤ ˜ T (2) ( ˜ S N ) ≤ ... ≤ ˜ T ( ˜ K ) ( ˜ S N ) (9)12he ordered values of { ˜ T ( ˜ S πN ) : π ∈ ˜ G } , where˜ T ( ˜ S πN ) = " N N X i =1 ˜ S π i (0) N ,i − M M X j =1 ˜ S π i ( j ) N ,i ! . (10)We set ˜ k = ⌈ ˜ K (1 − α ) ⌉ , where α is the signiﬁcance level of the test, and deﬁne thedecision rule of the test as˜ φ ( S N ) =  T ( ˜ S N ) > ˜ T (˜ k ) ( ˜ S N )0 if ˜ T ( ˜ S N ) ≤ ˜ T (˜ k ) ( ˜ S N ) . (11)In words, we calculate the test statistic ˜ T ( ˜ S πN ) for all possible permutations in ˜ G , andthen we reject the null if the actual test statistic ˜ T ( ˜ S N ) is large relative to the distributiongiven by these permutations. If N >

1, we could also consider a standardized test statistic˜ T std ( ˜ S πN ) = h N P N i =1 (cid:16) ˜ S π i (0) N ,i − M P Mj =1 ˜ S π i ( j ) N ,i (cid:17)i N P N i =1 (cid:16) ˜ S π i (0) N ,i − M P Mj =1 ˜ S π i ( j ) N ,i − ˜ τ π (cid:17) , (12)where ˜ τ π = N P N i =1 (cid:16) ˜ S π i (0) N ,i − M P Mj =1 ˜ S π i ( j ) N ,i (cid:17) , and consider a decision rule as in 11.We show that, if we consider the null hypothesis H : Y i (0) | X i d = Y i (1) | X i for all i ∈ I , (13)then such test is asymptotically level α , meaning that probability of rejection under the nullconverges to a value lower or equal to α when N → ∞ . Proposition 2

Suppose the assumptions used in part 2 of Proposition 1 are valid, and thatthe distribution of Y i (0) | X i is continuous. If we consider the problem of testing 13, then atest based on the decision rule deﬁned in 11 is asymptotically level α , for any α ∈ (0 , ,when N → ∞ and N is ﬁxed. roof. See details of the proof in Appendix A.1.5.The main intuition of the proof is that, when N → ∞ , the limiting distribution of ˜ S N ,under the null, is invariant to the transformations in ˜ G . From the proof of Proposition 1,note that S mN ,i = Y i ( m ) d → Y i (0) | X i , for all m = 1 , ..., M . Therefore, under the null deﬁnedin 13, we have that ˜ S jN ,i d → Y i (0) | X i for all j = 0 , ..., M . Moreover, asymptotically, ˜ S jN ,i isindependent across i and j , because the probability that two treated units share the samenearest neighbor converges to zero when N → ∞ . This last point is true whenever at leastone covariate is continuous. Remark 6

Rosenbaum (2002) considers Fisher exact tests in observational studies withmatched pairs. He shows that, if the probability of treatment assignment is the same forboth observations in each pair, then a permutation test conditional on the pair is valid, evenin ﬁnite samples. With a ﬁnite N and continuous X , however, it is not possible to guaranteethis condition, even under Assumption 2, since we will not have, in general, a perfect matchin terms of covariates. We show that this condition can be approximately satisﬁed when N → ∞ and N is ﬁxed using the theory of randomization inference under approximatesymmetry developed by Canay et al. (2017). Remark 7

The null hypothesis 13 implies that τ ( { X i } i ∈I ) = 0, but the converse is not true.To understand why this is crucial for this test, suppose, for example, that E [ Y i (1) | X i ] = E [ Y i (0) | X i ] for all i ∈ I , but V [ Y i (1) | X i ] > V [ Y i (0) | X i ] (in this case, τ ( { X i } i ∈I ) = 0, butthe null hypothesis 13 does not hold). If M >

1, then a permutation that uses controlobservations in place of treated ones would have a less volatile distribution relative to thedistribution of the matching estimator. This would lead to a rejection rate higher than α . This is an important drawback of this test, if the underlying interest is in testing If all covariates are discrete, then there are other inference methods that could be used, such as the oneproposed by Rothe (2017). Therefore, we are interested in considering inference methods that are valid whenat least one covariate is continuous. The intuition is similar to the one presented by Ferman and Pinto (2019a) for the DID estimator. Thematching estimator will compare the averages of N treated observations with the average of M × N controlnearest neighbors. Therefore, if M >

1, then the variance of the treated observations will have a relatively τ ( { X i } i ∈I ) = 0. In this case, the test may reject at a rate higher than α even when τ ( { X i } i ∈I ) = 0. However, we need to consider a more stringent null hypothesis to guaranteethat the test is valid for any ﬁxed value of N (even for N = 1). Remark 8

This permutation test is similar in spirit to the test proposed by Conley andTaber (2011) for diﬀerences-in-diﬀerences with few treated and many control groups. Notethat they assume that errors are i.i.d. across groups. In line with their results, the permu-tation test we propose would also be asymptotically valid if, for example, we test the nullhypothesis τ ( { X i } i ∈I ) = 0 (instead of the null hypothesis 13), but we impose the additionalauxiliary assumptions that ǫ i is i.i.d. for all i , and that treatment eﬀect is homogeneous.This highlights the fact that we need to rely on stronger assumptions if we want to constructa test that is valid regardless of the number of treated observations. Remark 9

If we consider ˜ T std ( ˜ S πN ) instead of ˜ T ( ˜ S πN ) as test statistic, then the limitationof the permutation test presented in Remark 7 is ameliorated. The intuition is that thedistribution of the test statistic would be more similar for diﬀerent permutations when westandardize, as MacKinnon and Webb (2019) observe for DID applications. If we consider asetting with N ﬁxed, then the test would still only remain valid for the sharp null hypoth-esis 13 (or if we impose the homogeneous treatment eﬀects and homoskedasticity auxiliaryassumptions). However, based on the simulations from MacKinnon and Webb (2019), suchadjustment is very eﬀective even for relatively small values of N . Remark 10

It is possible to consider other test statistics than the one presented in equation larger impact on the variance of the matching estimator than the variance of the control observations. As aconsequence, permutations that place control observations as treated would have a lower variance than theactual estimator if V [ Y i (1) | X i ] < V [ Y i (0) | X i ]. Following the same logic, this also implies that such a test mayhave a low power if the treatment decreases the variance of the outcome (that is, V [ Y i (1) | X i ] < V [ Y i (0) | X i ]). Ferman and Pinto (2019a) consider a method similar to the one proposed by Conley and Taber (2011),but that allows for speciﬁc forms of heteroskedasticity based on the observed covariates. However, if weconsider that all the observable variables that may induce heteroskedasticity are already included as covariatesin the matching process, then this would be innocuous. More speciﬁcally, in this case, we would already beconsidering only permutations of observations with similar values of X i , which is essentially a non-parametricversion of Ferman and Pinto (2019a). Conley and Taber (2011) also consider alternative methods to allowfor heteroskedasticity. However, these alternatives would also allow only for heteroskedasticity based onobservables. T ( ˜ S πN ) and˜ T std ( ˜ S πN ) because we want to have power against alternatives such that τ ( { X i } i ∈I ) = 0. Remark 11

As outlined by Bugni et al. (2018), the null hypothesis Y i (0) | X i d = Y i (1) | X i isimplied by what is sometimes referred to as a “sharp null hypothesis,” in which Y i (1) = Y i (0)with probability one. Remark 12

The test would remain valid if we consider the null hypothesis Y i (0) | X i d = Y i (1) | X i + c i for all i ∈ I , for a known vector of constants c = ( c , ..., c N ), instead of thenull hypothesis deﬁned in 13. Remark 13

Canay et al. (2017) consider a randomized version of the test to deal withcases such that ˜ T ( ˜ S N ) = T (˜ k ) ( ˜ S N ). Their approach guarantees a test with asymptotic size α . We focus on the non-randomized version of the test that rejects the null hypothesis if˜ T ( ˜ S N ) > ˜ T (˜ k ) ( ˜ S N ), which guarantees that the test is asymptotically level α , although itmay be conservative. The under rejection will only be relevant if ˜ K is very small, where ˜ K is a function of N and M . Remark 14

This test is asymptotically valid, when N → ∞ , in part because the proba-bility that diﬀerent treated observations share the same nearest neighbor goes to zero. Inﬁnite samples, however, this may not be the case, and two treated observations will likelyshare the same nearest neighbor when N is not large enough relative to N and M . Totake that into account, we consider a ﬁnite sample ﬁx in the permutation test. If a controlobservation is the nearest neighbor for two or more treated observations, then we restrict topermutations of S N such that this control observation is always placed as either treated orcontrol. Since the probability that two treated observations share the same nearest neighborgoes to zero when N is ﬁxed and N → ∞ , for a ﬁxed M , this ﬁnite sample adjustment isasymptotically irrelevant. Another alternative would be to consider a matching estimator without replacement. However, this emark 15 This test is also asymptotically valid for bias-corrected matching estimators, aspresented in 6. In this case, we deﬁne ˜ S N ,i = Y i and ˜ S mN ,i = Y i ( m ) + ˆ µ ( X i ) − ˆ µ ( X i ( m ) ). Thekey idea is that, again, ˜ S mN ,i = Y i ( m ) + ˆ µ ( X i ) − ˆ µ ( X i ( m ) ) d → Y i (0) | X i , for all m = 1 , ..., M ,because ˆ µ ( X i ) − ˆ µ ( X i ( m ) ) p → We consider now an alternative function of the data given by S N = (cid:0) ˆ τ N , ..., ˆ τ N N (cid:1) ′ (14)where ˆ τ N i = Y i − M P j ∈J M ( i ) Y j . Each ˆ τ N i depends on the M nearest neighbors of observation i , so its distribution depends on N .Following Canay et al. (2017), we consider a test statistic given by T ( S N ) = | ˆ τ | q N − P N i =1 (ˆ τ N i − ˆ τ ) , (15)where ˆ τ = N P i ∈I ˆ τ N i is the matching estimator for the treatment eﬀects on the treated.We consider the group of transformations given by G = {− , } N , where gS N = (cid:0) g ˆ τ N , ..., g N ˆ τ N N (cid:1) ′ . Let K = | G | and denote by T (1) ( S N ) ≤ T (2) ( S N ) ≤ ... ≤ T ( K ) ( S N ) (16)the ordered values of { T ( gS N ) : g ∈ G } . Let k = ⌈ K (1 − α ) ⌉ , where α is the signiﬁcance would generate lower quality matches, which implies more bias (Abadie and Imbens (2006)). Moreover,matching without replacement has the disadvantage that the estimator is not invariant to diﬀerent sortingof the data. φ ( S N ) =  T ( S N ) > T ( k ) ( S N )0 if T ( S N ) ≤ T ( k ) ( S N ) . (17)In words, we calculate the test statistic T ( gS N ) for all possible gS N = (cid:0) g ˆ τ N , ..., g N ˆ τ N N (cid:1) ′ ,and then we compare the actual test statistic T ( S N ) with the distribution { T ( gS N ) : g ∈ G } .We show that such test is asymptotically valid if we consider the null hypothesis H : µ ( X i ) = µ ( X i ) for all i ∈ I . (18) Proposition 3

Suppose the assumptions used in part 2 of Proposition 1 are valid, and thatthe distribution of Y i (1) | X i ( Y i (0) | X i ) is continuous and symmetric around µ ( X i ) ( µ ( X i ) )for all i = 1 , ..., N . If we consider the problem of testing 18, then a test based on the decisionrule deﬁned in 17 is asymptotically level α for any α ∈ (0 , when N → ∞ and N is ﬁxed. Proof.

See details of the proof in Appendix A.1.6.Again, the main intuition of the proof is that, when N → ∞ , the limiting distribution of S N , under the null, is invariant to the transformations in G . This is true if, asymptotically,ˆ τ N i and ˆ τ N j are independent for i = j , and the distribution of ˆ τ N i is symmetric around zero.It is not necessary for ˆ τ N i to have the same distribution across i . From Proposition 1, we knowthat, under the null, the asymptotic distribution of ˆ τ N i , conditional on { X } i ∈I , is given by ǫ i − M P Mm =1 ǫ m ( X i ). This distribution is symmetric around zero given the assumption that Y i (1) | X i and Y i (0) | X i are symmetric around zero for all i = 1 , ..., N . Moreover, Proposition1 also shows that, asymptotically, ˆ τ N i are independent across i . Remark 16

The null hypothesis deﬁned in 18 allows for diﬀerent distributions of potentialoutcomes when treated and control. In particular, it allows for heteroskedasticity, as it may18e that V [ Y i (1) | X i ] = V [ Y i (0) | X i ] under the null. This null hypothesis is implied by morenarrowly deﬁned null hypotheses that are usually considered in Fisher-type tests, such as Y i (0) | X i d = Y i (1) | X i or Y i (0) = Y i (1) with probability one. However, it is still more stringentthan the null hypothesis that τ ( { X i } i ∈I ) = 0. Remark 17

If the null hypothesis 18 is false, but τ ( { X i } i ∈I ) = 0, then the test would tendto be conservative. The reason is that, in this case, we will have that N P i ∈I g i ( µ ( X i ) − µ ( X i )) = 0 for at least some g i = (1 , ..., N P i ∈I ( µ ( X i ) − µ ( X i )) = 0. Thiswill tend to generate a distribution for the test statistic given these group transformationsthat is more volatile than the distribution of the actual test statistic. Therefore, even if weconsider a null hypothesis τ ( { X i } i ∈I ) = 0, we should still expect to have a level α test. Remark 18

This test can be extended to test null hypotheses of the form µ ( X i ) = µ ( X i )+ c i for all i ∈ I , for a known vector of constants c = ( c , ..., c N ). Remark 19

Similarly to the point raised in Remark 15, this test is asymptotically validbecause the probability that diﬀerent treated observations share the same nearest neighborgoes to zero, when N → ∞ , which implies that ˆ τ N i and ˆ τ N i ′ are asymptotically uncorrelatedfor i = i ′ . Therefore, we also suggest a ﬁnite sample adjustment, in which we restrict tosign changes such that g i = g j if i and j share the same nearest neighbor. Similar to theﬁnite sample adjustment used in the test based on permutations, the probability that thismodiﬁcation is relevant converges to zero when N → ∞ . Remark 20

Remark 13 also applies to this test.

Remark 21

This test is also asymptotically valid for bias-corrected matching estimators, asdeﬁned in equation 6. In this case, we deﬁne ˜ τ N i = Y i − M P j ∈J M ( i ) ( Y j − ˆ µ ( X j ) + ˆ µ ( X i )).19 “Jovem de Futuro Program”: Monte Carlo Simula-tions & Empirical Application We explore the validity of matching estimators and of diﬀerent inferential methods inthe estimation of the eﬀects of an educational program in Brazil called “Jovem de Futuro”.This application provides a setting with few treated and many control schools. In Section5.1, we conduct an empirical Monte Carlo (MC) study based on this application (e.g. Huberet al. (2013)), while in Section 5.2 we estimate the eﬀects of the program using matchingestimators.Before we proceed, we start with a brief description of the program, and we present somedescriptive statistics (see Barros et al. (2012) for more details). The “Jovem de Futuro”program, an initiative of the “Instituto Unibanco” (Unibanco Institute), aims to improve thequality of education in Brazilian public schools. This is a three-year-long intervention basedon two eﬀorts: (i) providing school managers with strategies and instruments to becomemore eﬃcient and productive, and (ii) providing conditional cash transfers to schools. In2007, the Unibanco Institute created and implemented the program in three schools in SaoPaulo. Then they implemented a few randomized control trials in the following years toevaluate the impact of the program.We focus on the 2010 implementation of the program, which took place in Rio de Janeiroand Sao Paulo. Schools in these two states were invited to participate in the program,knowing in advance that they would be randomly assigned to receive the program startingin 2010, or that they would be placed ﬁrst as a control group and would start the programonly in 2013. We use information from the 2007 to 2012 “Exame Nacional do Ensino M´edio”(ENEM), a national exam that evaluates high school students in Brazil, as a measure of The conditions are to improve students’ performance on a standardized examination by the Institute atthe end of each school year and to implement a participatory budget process in the school. , Focusing on schools with test score information from 2007 to 2012,we have 15 treated schools in Rio de Janeiro and 39 in Sao Paulo, with the same number ofcontrol schools in each state. Column 1 of Table 1 presents the diﬀerence in test scores for treated and control experi-mental schools in Rio de Janeiro, and column 3 shows the same diﬀerence for schools in SaoPaulo. Panel A presents this information for 2007 to 2009, which was before the intervention.For Rio de Janeiro, all diﬀerences are small and not statistically diﬀerent from zero, as onewould expect given random assignment. For Sao Paulo, however, there are signiﬁcant diﬀer-ences in test scores in 2007 and 2008, suggesting that there may have been some problemsin the assignment of treatment schools. Panel B presents the results for the three years afterthe implementation of the program. The comparison between treated and control schoolssuggest a null eﬀect of the program in Rio de Janeiro, and a positive and signiﬁcant eﬀectin Sao Paulo. We should be careful in interpreting the results for Sao Paulo, however, dueto the imbalances in pre-intervention test scores. Columns 2 and 4 of Table 1 present diﬀerences in test scores for public schools that didnot participate in the experiment and schools in the experimental control group. In Rio It is not possible to identify the schools that participated in the “Jovem de Futuro” experiment using thepublic-access ENEM microdata before 2007. For this reason, we do not consider earlier implementations ofthe program in Minas Gerais and Rio Grande do Sul, because we would only have one year of pre-treatmentoutcome. For 2007 and 2008, we focus on the score on a 63-question multiple-choice test on various subjects(Portuguese, History, Geography, Math, Physics, Chemistry and Biology). Since 2009, the exam has beencomposed of 180 multiple-choice questions, equally divided into four areas of knowledge: languages, codesand related technologies; human sciences and related technologies; natural sciences and related technologies;and mathematics and its technologies. In this case, we consider the average score for these four areas.For each year and for each state, we standardize the test scores based on the sample of students from theexperimental control schools. We exclude one control and two treated schools from Sao Paulo because they lack information for atleast one of these years. Rosa (2015) analyzes the “Jovem de Futuro” program using a diﬀerences-in-diﬀerences approach, ex-ploiting the experimental design of the program. He ﬁnds a positive and signiﬁcant eﬀect of the programfor both Rio de Janeiro and Sao Paulo. There are a few diﬀerences in our analyses that justify the diﬀerentresults. First, we consider an intention to treat eﬀect, including schools that abandoned the program afterits implementation, while Rosa (2015) includes only strata with no attritors (see Ferman and Ponczek (2017)for a discussion on potential bias from the exclusion of strata with attrition problems). Second, Rosa (2015)considers an exam that was administered on the treated and control schools to evaluate this program. We arenot able to use this dataset because this information is not available for non-experimental schools. Finally,we aggregate our data at the school level, while Rosa (2015) uses individual-level data.

21e Janeiro, schools that (voluntarily) decided to participate in the experiment had betteroutcomes prior to the intervention, relative to other schools that did not participate in theexperiment. In Sao Paulo, schools in the experimental control group were, on average, worsethan the schools that did not participate in the experiment. Interestingly, Rio de Janeirohas 966 and Sao Paulo has 3481 non-experimental public schools, thus providing a settingwith few treated and many (non-experimental) control schools.

We consider an empirical MC study based on the “Jovem de Futuro” implementation.We ﬁrst estimate a probit model using schools’ average test scores in the three years priorto the intervention as covariates. We estimate the probit model using the implementationof the program in Sao Paulo, which was a place where the program focused on attendingschools with lower test scores, so treatment selection is a more severe problem in this case.We also include private schools to have a larger population for the simulation study. Thenwe exclude the treated schools and draw placebo treatments for all schools in Brazil with atreatment selection process based on the estimated probit model. We have a population of20,363 schools for this simulation study. Based on these simulations, we ﬁnd, on average, adiﬀerence of − .

32 points in a standardized test score when we simply compare treated andcontrol schools under this selection process, revealing that schools that participated in thisprogram had, on average, worse test scores relative to other schools.For each realization of the placebo treatment, we control the number of treated andcontrol observations by selecting a random sample of N ∈ { , , , } treated and N ∈{ , } control schools. We then estimate the nearest neighbor matching estimator with M ∈ { , , } using three years of pre-intervention outcomes as matching variables. Wealso calculate rejection rates based on the asymptotic distribution derived by Abadie andImbens (2006), and based on the randomization inference tests presented in Section 4. For Simulation results are similar if we include only public schools. Results available upon request.

Bias and Mean Square Error

Panel A of Table 2 shows the average bias of the nearest-neighbor matching estimator.Columns 1 and 2 has M = 1. For N = 50, the matching estimator for the treatment eﬀecton the treated has a bias of around 0 .

01, regardless of the number of observations in thetreated group, which reﬂects the fact that, with a ﬁnite N , it is impossible to guaranteea perfect match in X for the treated observations and their nearest neighbors. This bias,however, equals only about 3% of the bias of a naive comparison between treated and controlobservations, suggesting that, in this setting, the matching estimator is very eﬀective incontrolling for diﬀerences in observables of treated and control schools, even when N is notlarge. Consistent with Proposition 1, the average bias shrinks to zero when we increase thenumber of control observations, regardless of the number of treated observations. When thematching estimator has more nearest neighbors, the bias increases, but it remains close tozero when N = 500. This happens because, with a limited number of control observations,we end up with poorer matches when considering an estimator with more nearest neighbors.This loss in match quality becomes less relevant when there are many control observations.Panel B of Table 2 presents the mean square error (MSE) of the matching estimators.While the MSE is always decreasing in N and N , two competing forces come into playwhen M increases. On the one hand, using more nearest neighbors reduces the variance ofthe matching estimator. On the other hand, this increases the bias of the estimator. With N = 500, since increasing M from one to ten has little impact on the bias, using morenearest neighbors — in this range — always reduces the MSE of the matching estimator.However, with smaller N there are some cases in which increasing M actually increases theMSE, exposing the trade-oﬀ between bias and variance for the matching estimator.Appendix Table A.1 presents simulations when the dimensionality of the covariates in-23reases. While the number of covariates does not aﬀect the theoretical conclusions fromProposition 1, these simulations conﬁrm the intuition that, when the dimensionality of thecovariates increases, a larger N is required to keep our approximations reliable. Finally,Appendix Table A.2 presents simulations for a bias-corrected estimator, as deﬁned in equa-tion 6. While the average bias is reduced using this procedure, the eﬀects on the MSE areambiguous. In particular, the bias corrected estimator may lead to higher MSE when N isvery small and N is not large. When N is large, the bias correction becomes less relevant,so the bias and MSE of the two estimators become very similar. Inference: test size

Panels C to E of Table 2 show rejection rates for 5% tests using diﬀerent inferencemethods. A superscript “+” indicates a rejection rate greater than 6%, and a superscript“ − ”, a rejection rate lower than 4%. Importantly, while the diﬀerent test procedures relyon diﬀerent null hypotheses, all these null hypotheses are valid in the simulations. We discussin detail the implications of considering tests that rely on diﬀerent null hypotheses in Section6. Panel C of Table 2 presents rejection rates using the test based on Abadie and Imbens(2006). Rejection rates for a 5% test are higher than 13% when N = 5, and around 9%when N = 10, for all values of N and M . This happens because the asymptotic distributionderived by Abadie and Imbens (2006) relies on N → ∞ , even though it allows N to grow We generate three additional covariates with the same distributions of the test scores from 2007, 2008,and 2009, but that are independent of all other random variables in the model. Then we estimate thematching estimator including these variables, in addition to the original ones, as covariates. A mismatch inthese additional variables would not directly generate bias in the matching estimator. However, the additionof these variables makes it harder to ﬁnd a good match in terms of relevant covariates, which might lead tohigher bias. We use linear least squares using only the nearest neighbors to estimate µ ( x ). This is the procedureused in the teﬀects command in Stata. While there is an asymmetry in that over-rejection is usually considered a more relevant problem relativeto under-rejection, it is also important to highlight cases in which a test under-rejects, as this might implythat the test is under-powered. We consider in our simulations the default options of the teﬀect program in Stata, which uses the robuststandard errors derived by Abadie and Imbens (2006) with two nearest neighbors for the estimation of thevariance.

24t a faster rate than N . When N increases, rejection rates go down, although they arestill marginally higher than 5% even when N = 50. The simulations suggest that rejectionrates computed using the asymptotic variance derived by Abadie and Imbens (2006) shouldbe considered with caution when the number of treated observations is very small.Panel D of Table 2 shows rejection rates using randomization inference test based onpermutations. Rejection rates are close to 5% in most cases. The exceptions are the scenarioswith M = 1/ N = 5, and with M = 10/ N ∈ { , } , in which the test is conservative. Inboth cases, the test is conservative because there are relatively few possible permutations. In the ﬁrst case, there are few possible permutations because the dimension of ˜ S N is small.Therefore, the test should remain conservative even when we increase N even further. In thesecond case, the test is conservative because we end up with many shared nearest neighbors(see Remark 15). Therefore, the test would lead to rejection rates closer to 5% if we increase N .Panel E of Table 2 shows rejection rates using the randomization inference test basedon sign changes, presented in Section 4.2. When the nearest-neighbor matching estimatorwith M = 1 is considered, rejection rates using this test are close to 5%, except when N = 5. In this case, few diﬀerent group transformations exist, which explains why the testis conservative. When we consider matching estimators with

M > N = 50, the testunder-rejects the null hypothesis, even for larger N . This happens because increasing M increases the probability that diﬀerent treated observations share the same nearest neighbors,which in turn reduces the number of group transformations. When N = 500, this problembecomes less relevant, and rejection rates approach 5%, when M = 4. However, the testis still conservative when M = 10. Since this comes from a higher proportion of sharedneighbors when M = 10, the test would lead to rejection rates closer to 5% if we increase We use the non-randomized version of the test in which we do not reject the null hypothesis in case ofequality. We could guarantee the correct size if we used a randomized version of the test, as explained inRemark 13. Similar to the case of permutations, this happens because we use the non-randomized version of the testin which we do not reject in case of equality. We could guarantee the correct size if we used a randomizedversion of the test. (except for the case with N = 5).Appendix Table A.1 show some over-rejection for the randomization inference tests whenwe increase the dimensionality of the covariates, which is explained by the fact that thebias is more relevant in this scenario. Again, such over-rejection does not arise if N islarge enough. When a bias-corrected estimator is used, Appendix Table A.2 also show someover-rejection in the permutation test when N and N are small, despite the fact that thebias is smaller. When N is large, there is not much diﬀerence in rejection rates betweenthe standard and the bias-corrected matching estimator. Finally, Appendix Table A.3 showsthat the bootstrap test proposed by Otsu and Rai (2017) can also lead to over-rejection when N is small. When N and N increases, rejection rates converge to 5%, which was expectedgiven their theoretical results. Inference: test power

Table 3 present rejection rates when we assume a homogeneous treatment of 0 . Y i (1) = Y i (0) + 0 . i ). An importantcaveat when comparing these diﬀerent inference procedures is that inference based on theasymptotic distribution derived by Abadie and Imbens (2006) leads to over-rejection underthe null, particularly when N is small. Therefore, these results should be considered withcaution in these cases. As expected the power of these tests are increasing with N . Thepower is also increasing with M , but at decreasing rates, which is expected given the discus-sion presented by Imbens and Rubin (2015) that M should not be large. Most importantly,the two randomization inference tests present non-trivial power in many settings in whichtests that rely on N → ∞ would lead to over-rejection. The only exceptions are the cases in We focus on the wild bootstrap implementation of test using the two point distribution suggested byMammen (1993). Another alternative proposed by Otsu and Rai (2017) would be a nonparametric bootstrap.However, with few treated and many control observations, we would likely generate bootstrap samples withno treated observations. Diﬀerently from the other tests we considered, this test must be based on a bias-corrected estimator, and it requires some properties on the estimator for µ ( x ) (see Otsu and Rai (2017) fordetails). Following Otsu and Rai (2017), we estimate µ ( x ) using a linear OLS with all control observations.We also present results using the estimator for µ ( x ) used by default in the teﬀects command in Stata, whichmakes the over-rejection more signiﬁcant. N → ∞ . In Section 6, we contrast these diﬀerentinference procedures in more detail, providing guidance on how to evaluate the trade-oﬀs ofthese methods in diﬀerent settings. Our idea is to estimate the eﬀects of the program using a matching estimator with theexperimental treated schools as treated observations and schools that did not participate inthe experiment as control observations, therefore providing a setting with few treated andmany control observations. Moreover, we take advantage of the randomized control trialto analyze the validity of the matching estimator and of diﬀerent inference methods in thissetting. More speciﬁcally, we consider a matching estimator using the experimental control schools as treated observations, and schools that did not participate in the experiment ascontrol observations. Since the experimental control schools did not actually receive thetreatment in the analyzed period, we should not expect to ﬁnd signiﬁcant eﬀects in thiscase.One important caveat in using ENEM test scores is that the treatment may have aﬀectedthe probability that a student would take the exam. We do not ﬁnd, however, signiﬁcantdiﬀerences in the number of students who took the exam between treated and control schools(see Appendix Table A.4). Moreover, one of our main exercises in this empirical applicationis to analyze the performance of matching estimators using the experimental control schoolsas the treated observations. Since the experimental control schools were not aﬀected by thetreatment, we do not have any reason to believe sample selection should be a problem inthis case.Table 4 shows estimated eﬀects from 2010 to 2012 using the experimental control schoolsas the treated observations in our matching estimators. These schools volunteered to par-27icipate in the program, but were not actually treated during this period. Therefore, if thematching estimators are valid, then we should not expect to ﬁnd signiﬁcant eﬀects. In addi-tion to the point estimates, p-values are calculated using the asymptotic distribution derivedby Abadie and Imbens (2006), and from the two proposed RI tests. We use test scores from2007 to 2009 as matching variables. Interestingly, estimates for Rio de Janeiro (columns 1 to4) generally have lower p-values using the test based on Abadie and Imbens (2006), relativeto the alternative inference procedures. In particular, a test based on Abadie and Imbens(2006) would reject the null at 10% in two cases, while the other tests would fail to rejectthe null. This is consistent with our simulations from Section 5.1, that show the test basedon Abadie and Imbens (2006) may lead to over-rejection when N is small. The diﬀerencein p-values across diﬀerent methods is less pronounced when we consider estimates for SaoPaulo, which is consistent with having a larger number of “treated” schools in Sao Paulo.Finally, Table 5 presents estimated eﬀects using the experimental treated schools as thetreated observations in our matching estimators. The eﬀects for Rio de Janeiro are smalland not signiﬁcantly diﬀerent from zero, which is consistent with the experimental resultspresented in Table 1. For Sao Paulo, some results for 2011 and 2012 are signiﬁcant, dependingon the speciﬁcation. While positive, the estimates for Sao Paulo are generally smaller thanthe experimental results presented in Table 1, which is consistent with the imbalances inpre-treatment outcomes for the experimental sample. The diﬀerent test procedures we consider present important trade-oﬀs in terms of sizedistortion, power, and the underlying null hypothesis they rely on. In light of the theoreticalproperties derived in Section 4, and of the empirical evidence presented in Section 5, weprovide guidance on how to evaluate these trade-oﬀs. First, note that tests based on theasymptotic distribution derived by Abadie and Imbens (2006), and on the bootstrap proce-28ure proposed by Otsu and Rai (2017), are valid to test the null hypothesis τ ( { X i } i ∈I ) = 0,provided both N and N are large enough so that the asymptotic approximations are reli-able. The randomization inference tests, in contrast, rely on more stringent null hypotheses(alternatively, we can consider that the randomization inference tests are valid to test thenull τ ( { X i } i ∈I ) = 0 if we impose additional auxiliary assumptions). Therefore, if we areinterested in testing this null and we believe N is large enough so that this approximationis reasonable, then we should use one of these tests that rely on N → ∞ , instead of therandomization inference ones. In our simulations, for example, there is only a slight over-rejection when N = 50, so the advantage of using an inference method that is valid undera less stringent null hypothesis should dominate.When N is not that large, then the over-rejection of tests that rely on N → ∞ becomesmore relevant, so it may be reasonable to consider alternative inference procedures thatallow for N ﬁxed. The randomization inference test based on sign changes relies on aslightly more stringent null hypothesis, that the average treatment eﬀect for each value of X i is equal to zero. However, in light of Remark 17, if τ ( { X i } i ∈I ) = 0, but the null is falsebecause treatment eﬀects are heterogeneous across X i , then the test would under-reject.This means that such test would only reject at a rate greater than α when it is actuallythe case that τ ( { X i } i ∈I ) = 0 (asymptotically, with N ﬁxed and N → ∞ ). Therefore, ifthe goal is to test the null τ ( { X i } i ∈I ) = 0, then we should not expect over-rejection for anyvalue of N . This test would have low power if N is very small, or if N is not very large, sothat many treated observations share the same nearest neighbors. Since the proportion ofshared nearest neighbors is increasing with M , this provides another reason to avoid usingmatching estimators with large M (see Imbens and Rubin (2015) for other reasons to avoidusing large M ). In our simulations, the test based on sign changes becomes an attractivealternative when N ∈ { , } . In these cases, it has the correct size and non-trivial power,while tests that rely on N → ∞ presented relevant size distortions. When N = 5, however,this test is underpowered, so it should not be used.29inally, the randomization inference test based on permutations is the only one thatprovides correct size and non-trivial power when N is very small. However, it relies on avery stringent null hypothesis, which implies that we may reject at a rate greater than α even when τ ( { X i } i ∈I ) = 0 (see Remark 7). Using a standardized test statistic ˜ T std ( ˜ S πN )helps ameliorate the over-rejection when V [ Y i (1) | X i ] > V [ Y i (0) | X i ]. However, this correctionwould be less eﬀective exactly when N is very small. Therefore, this test should only be usedwhen alternative methods either lead to signiﬁcant over-rejection or provide trivial power. We consider the asymptotic properties of matching estimators when the number of con-trol observations is large, but the number of treated observations is ﬁxed. In this setting,the nearest neighbor matching estimator is asymptotically unbiased for the ATT under stan-dard assumptions used in the literature on estimation of treatment eﬀects under selectionon unobservables. Moreover, we provide tests, based on the theory of randomization underapproximate symmetry, that are asymptotically valid when the number of treated obser-vations is ﬁxed and the number of control observations goes to inﬁnity. The diﬀerent testprocedures we consider present important trade-oﬀs in terms of size distortion, power, andthe underlying null hypothesis they rely on. We, therefore, provide guidance on on how toevaluate the trade-oﬀs among these diﬀerent test procedures in speciﬁc settings.Our results are also relevant for SC applications. Following Doudchenko and Imbens(2016), the SC and the matching estimators are nested in a framework in which the es-timated counterfactual outcome for the treated observation is a linear combination of theoutcomes for the controls. In the framework of Doudchenko and Imbens (2016), if we con-sider linear combinations of the controls such that the weights given to observations withlarge discrepancies in pre-treatment outcomes relative to the treated units go to zero, then,following the same arguments as we do for the matching estimator, the estimator is asymp-30otically unbiased if treatment assignment is “as good as random,” conditional on this setof pre-treatment outcomes. This is exactly the case for the penalized SC estimator fordisaggregated data proposed by Abadie and L’Hour (2019). Under these conditions, therandomization inference test we propose based on sign changes remains asymptotically validwhen the number of control units goes to inﬁnity. This provides an interesting alterna-tive for inference, when there are multiple treated units and a large number of control units,that does not rely on exchangeability nor homoskedasticity assumptions. The only caveat isthat a very large number of control observations is needed when the number of pre-treatmentperiods is large, so that approximations remain reliable.

References

Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic Control Methods for Com-parative Case Studies: Estimating the Eﬀect of California’s Tobacco Control Program.

Journal of the American Statiscal Association , 105(490):493–505.Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators foraverage treatment eﬀects.

Econometrica , 74(1):235–267.Abadie, A. and Imbens, G. W. (2011). Bias-corrected matching estimators for averagetreatment eﬀects.

Journal of Business & Economic Statistics , 29(1):1–11.Abadie, A. and L’Hour, J. (2019). A penalized synthetic control estimator for disaggregateddata.Barros, R., de Carvalho, M., Franco, S., and Rosal´em, A. (2012). Impacto do projeto jovemde futuro.

Estudos em Avalia¸c˜ao Educacional , 23(51):214–226. If, however, treatment assignment is only “as good as random” conditional on a set of common factors(which allows for some correlation between treatment assignment and post-treatment potential outcomes),then this would not necessarily be true. See Abadie et al. (2010), Botosaru and Ferman (2019), Ferman andPinto (2019b), and Ferman (2019) for a discussion on the validity of the synthetic control estimator undera diﬀerent set of assumptions. See Firpo and Possebom (2018), Ferman and Pinto (2017) and Hahn and Shi (2017) for a discussion onthe placebo test proposed by Abadie et al. (2010). Chernozhukov et al. (2017) propose a permutation testbased on the timing of the intervention. This test, however, would require a very large number of periods.Instead, our test may be an alternative when the number of periods is not large, but the number of controlunits is large.

Journal of Business & Economic Statistics , 0(ja):1–43.Botosaru, I. and Ferman, B. (2019). On the role of covariates in the synthetic control method.

Econometrics Journal , 22(2):117–130.Bugni, F. A., Canay, I. A., and Shaikh, A. M. (2018). Inference under covariate-adaptiverandomization.

Journal of the American Statistical Association .Busso, M., DiNardo, J., and McCrary, J. (2014). New Evidence on the Finite SampleProperties of Propensity Score Reweighting and Matching Estimators.

The Review ofEconomics and Statistics , 96(5):885–897.Canay, I. A. and Kamat, V. (2018).

The Review of Economic Studies . Forthcoming.Canay, I. A., Romano, J. P., and Shaikh, A. M. (2017). Randomization tests under anapproximate symmetry assumption.

Econometrica , 85(3):1013–1030.Chernozhukov, V., Wuthrich, K., and Zhu, Y. (2017). An exact and robust conformalinference method for counterfactual and synthetic controls.Conley, T. G. and Taber, C. R. (2011). Inference with Diﬀerence in Diﬀerences with a SmallNumber of Policy Changes.

The Review of Economics and Statistics , 93(1):113–125.Dehejia, R. H. and Wahba, S. (1999). Causal eﬀects in nonexperimental studies: Reevaluat-ing the evaluation of training programs.

Journal of the American Statistical Association ,94(448):1053–1062.Dehejia, R. H. and Wahba, S. (2002). Propensity Score-Matching Methods For Nonexperi-mental Causal Studies.

The Review of Economics and Statistics , 84(1):151–161.Doudchenko, N. and Imbens, G. (2016). Balancing, regression, diﬀerence-in-diﬀerences andsynthetic control methods: A synthesis.Ferman, B. (2019). On the Properties of the Synthetic Control Estimator with Many Periodsand Many Controls. arXiv e-prints , page arXiv:1906.06665.Ferman, B. and Pinto, C. (2017). Placebo Tests for Synthetic Controls. MPRA Paper 78079,University Library of Munich, Germany.Ferman, B. and Pinto, C. (2019a). Inference in diﬀerences-in-diﬀerences with few treatedgroups and heteroskedasticity.

The Review of Economics and Statistics , 101(3):452–467.Ferman, B. and Pinto, C. (2019b). Synthetic Controls with Imperfect Pre-Treatment Fit.Ferman, B. and Ponczek, V. (2017). Should we drop covariate cells with attrition problems?Mpra paper, University Library of Munich, Germany.32irpo, S. P. and Possebom, V. A. (2018). Synthetic control method: Inference, sensitivityanalysis and conﬁdence sets.

Journal of Causal Inference , 6.Frolich, M. (2004). Finite-sample properties of propensity-score matching and weightingestimators.

The Review of Economics and Statistics , 86(1):77–90.Hahn, J. and Shi, R. (2017). Synthetic control and inference.

Econometrics , 5(4).Huber, M., Lechner, M., and Wunsch, C. (2013). The performance of estimators based onthe propensity score.

Journal of Econometrics , 175(1):1 – 21.Imbens, G. (2004). Nonparametric estimation of average treatment eﬀects under exogeneity:A review.

Review of Economics and Statistics .Imbens, G. (2014). Matching Methods in Practice: Three Examples. NBER Working Papers19959, National Bureau of Economic Research, Inc.Imbens, G. and Wooldridge, J. (2009). Recent developments in the econometrics of programevaluation.

Journal of Economic Literature , 47(1):5–86.Imbens, G. W. and Rubin, D. B. (2015).

Causal Inference for Statistics, Social, and Biomed-ical Sciences: An Introduction . Cambridge University Press, New York, NY, USA.LaLonde, R. (1986). Evaluating the econometric evaluations of training programs withexperimental data.

American Economic Review , 76(4):604–20.MacKinnon, J. G. and Webb, M. D. (2019). Randomization Inference for Diﬀerence-in-Diﬀerences with Few Treated Clusters.

Journal of Econometrics, Forthcoming .Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models.

Ann.Statist. , 21(1):255–285.Otsu, T. and Rai, Y. (2017). Bootstrap inference of matching estimators for average treat-ment eﬀects.

Journal of the American Statistical Association , 112(520):1720–1732.Rosa, L. (2015). Avalia¸c˜ao de impacto do programa jovem de futuro.Rosenbaum, P. R. (1984). Conditional permutation tests and the propensity score in obser-vational studies.

Journal of the American Statistical Association , 79(387):565–574.Rosenbaum, P. R. (2002). Covariance adjustment in randomized experiments and observa-tional studies.

Statist. Sci. , 17(3):286–327.Rothe, C. (2017). Robust conﬁdence intervals for average treatment eﬀects under limitedoverlap.

Econometrica , 85(2):645–660.Rubin, D. B. (1973). Matching to remove bias in observational studies.

Biometrics ,29(1):159–183. 33able 1: “Jovem de Futuro”: Summary Statistics

Rio de Janeiro Sao PauloExp. Treated Nonexp. Control Exp. Treated Nonexp. Control- - - -Exp. Control Exp. Control Exp. Control Exp. Control(1) (2) (3) (4)Panel A: Before treatment2007 0.040 -0.091 0.116*** 0.117***(0.111) (0.082) (0.042) (0.034)2008 0.006 -0.136** 0.091** 0.061(0.098) (0.059) (0.041) (0.046)2009 0.026 -0.122 0.030 0.096**(0.111) (0.079) (0.053) (0.045)Panel B: After treatment2010 -0.063 -0.197*** 0.097* 0.070*(0.124) (0.073) (0.057) (0.042)2011 0.065 -0.086 0.142*** 0.112***(0.101) (0.059) (0.048) (0.039)2012 0.016 -0.121** 0.129** 0.093**(0.102) (0.050) (0.054) (0.041)

Empirical Monte Carlo Simulation M = 1 M = 4 M = 10 N = 50 N = 500 N = 50 N = 500 N = 50 N = 500(1) (2) (3) (4) (5) (6) Panel A: | average bias × | N = 5 1.143 0.338 1.618 0.673 2.156 0.936 N = 10 1.112 0.465 1.585 0.711 2.085 0.706 N = 25 0.883 0.369 1.547 0.576 2.148 0.833 N = 50 1.030 0.466 1.608 0.635 2.137 0.771 Panel B: mean squared error ( × ) N = 5 2.587 2.481 1.822 1.591 1.733 1.440 N = 10 1.425 1.286 1.005 0.816 0.989 0.739 N = 25 0.677 0.516 0.515 0.344 0.522 0.315 N = 50 0.453 0.268 0.357 0.186 0.365 0.167 Panel C: rejection rates based on AI (2006) N = 5 0.139 + + + + + + N = 10 0.093 + + + + + + N = 25 0.068 + + + + + + N = 50 0.063 + + + + + Panel D: test based on RI, permutations N = 5 0.009 − − − N = 10 0.049 0.053 0.045 0.052 0.024 − N = 25 0.053 0.049 0.025 − − − N = 50 0.054 0.046 0.016 − − − − Panel E: test based on RI, sign changes N = 5 0.009 − − − − − − N = 10 0.049 0.053 0.000 − − − N = 25 0.052 0.052 0.000 − − − N = 50 0.053 0.050 0.000 − − − Note: This table presents simulation results from the empirical MC study describedin Section 5.1. Panel A reports the average bias (multiplied by 100), while Panel Breports the mean squared error (multiplied by 100) of the matching estimator. PanelC presents rejection rates based on the asymptotic distribution derived by Abadie andImbens (2006). Panel D presents rejection rates for the randomization inference testbased on permutations, proposed in Section 4.2, while Panel E presents rejection ratesfor the randomization inference test based on sign changes, proposed in Section 4.1.We include a superscript “+” when rejection rate is greater than 6% and a superscript“ − ” when rejection rate is lower than 4%. For each combination ( N , N ), we run10,000 simulations. Empirical Monte Carlo Simulation: Test Power M = 1 M = 4 M = 10 N = 50 N = 500 N = 50 N = 500 N = 50 N = 500(1) (2) (3) (4) (5) (6) Panel A: rejection rates based on AI (2006) N = 5 0.386 + + + + + + N = 10 0.471 + + + + + + N = 25 0.683 + + + + + + N = 50 0.821 + + + + + + Panel B: test based on RI, permutations N = 5 0.034 − + + + + N = 10 0.314 + + + + + + N = 25 0.608 + + + + + + N = 50 0.785 + + + + + + Panel C: test based on RI, sign changes N = 5 0.036 − − − − N = 10 0.312 + + − + − + N = 25 0.606 + + − + − + N = 50 0.786 + + − + − .

20 stan-dard deviations in the individual-level test scores. Panel A presents rejection ratesbased on the asymptotic distribution derived by Abadie and Imbens (2006). Panel Bpresents rejection rates for the randomization inference test based on permutations,proposed in Section 4.2, while Panel C presents rejection rates for the randomizationinference test based on sign changes, proposed in Section 4.1. We include a superscript“+” when rejection rate is greater than 6% and a superscript “ − ” when rejection rateis lower than 4%. For each combination ( N , N ), we run 10,000 simulations. Non-experimental Results, Experimental Control Schools as TreatedObservations

Rio de Janeiro Sao Paulo M = 1 M = 4 M = 10 M = 1 M = 4 M = 10(1) (2) (3) (4) (5) (6)Treatment eﬀects in 2010Point Estimate 0.087 -0.003 0.046 0.000 0.018 0.004p-values:AI (2006) 0.091 0.941 0.086 0.995 0.601 0.924RI-permutation 0.124 0.960 0.449 0.996 0.679 0.933RI-sign changes 0.123 0.938 0.179 0.996 0.609 0.917Treatment eﬀects in 2011Point Estimate 0.043 -0.032 0.000 -0.019 -0.027 -0.013p-values:AI (2006) 0.566 0.396 0.997 0.746 0.475 0.692RI-permutation 0.659 0.626 0.999 0.771 0.553 0.783RI-sign changes 0.662 0.438 0.997 0.734 0.496 0.693Treatment eﬀects in 2012Point Estimate 0.070 -0.019 0.006 -0.072 -0.034 -0.019p-values:AI (2006) 0.263 0.522 0.885 0.169 0.383 0.616RI-permutation 0.295 0.742 0.918 0.189 0.453 0.665RI-sign changes 0.306 0.576 0.896 0.185 0.382 0.495Note: This table presents non-experimental results using a matching estimator with ex-perimental control schools as treated observations and non-experimental schools as controlobservations. Columns 1 to 3 present results for Rio de Janeiro using 1, 4, or 10 nearestneighbors in the estimation, while columns 4 to 6 present results for Sao Paulo. We presentthe estimated eﬀects separately for 2010, 2011, and 2012. For each estimate, we presentp-values calculated based on the asymptotic distribution derived by Abadie and Imbens(2006), and based on the randomization inference procedures described in Section 4. Non-experimental Results, Experimental Treated Schools as TreatedObservations

Rio de Janeiro Sao Paulo M = 1 M = 4 M = 10 M = 1 M = 4 M = 10(1) (2) (3) (4) (5) (6)Treatment eﬀects in 2010Point Estimate -0.056 -0.012 0.017 0.039 0.025 0.051p-values:AI (2006) 0.319 0.736 0.596 0.412 0.516 0.119RI-permutation 0.344 0.851 0.816 0.429 0.587 0.305RI-sign changes 0.349 0.766 0.624 0.427 0.538 0.111Treatment eﬀects in 2011Point Estimate -0.100 0.045 0.033 0.040 0.070 0.055p-values:AI (2006) 0.247 0.415 0.545 0.318 0.080 0.123RI-permutation 0.280 0.600 0.667 0.321 0.098 0.181RI-sign changes 0.273 0.551 0.537 0.346 0.116 0.213Treatment eﬀects in 2012Point Estimate 0.023 0.030 0.044 0.054 0.089 0.063p-values:AI (2006) 0.719 0.516 0.293 0.312 0.032 0.090RI-permutation 0.739 0.694 0.543 0.311 0.052 0.142RI-sign changes 0.720 0.566 0.257 0.337 0.043 0.122Note: This table replicates the results from Table 4 using the experimental treated schoolsas treated observations for the matching estimators. Online Appendix for “Matching Estimators with FewTreated and Many Control Observations

A.1 Proof of Main Results

A.1.1 Proof of Proposition 1Proof.

For a given realization of X i = ¯ x for an observation in the treated group and fora given ǫ >

0, consider the probability that the M -closest realizations of { X j } j ∈I are suchthat d ( X j , ¯ x ) < ǫ . Let X i ( M ) be the M -closest match of observation i . Then,Pr (cid:0) d ( X i ( M ) , ¯ x ) > ǫ (cid:1) = M − X m =0 Pr ( d ( X j , ¯ x ) < ǫ for exactly m observations)= M − X m =0 (cid:18) N m (cid:19) [Pr( d ( X j , ¯ x ) < ǫ )] m [Pr( d ( X j , ¯ x ) > ǫ )] N − m . (19)Since ¯ x ∈ X , we have that P r ( d ( X j , ¯ x ) < ǫ ) >

0, which implies that Pr( d ( X j , ¯ x ) > ǫ ) <

1. Therefore, we have that Pr (cid:16) d ( X i ( M ) , ¯ x ) > ǫ (cid:17) →

0. By analogy, the m -nearest neighbor of i for m < M also converges in probability to ¯ x .Now consider E [ˆ τ |{ X i } i ∈I ] = 1 N X i ∈I µ ( X i ) − E " M M X m =1 µ ( X i ( m ) ) . (20)Since µ ( x ) is continuous and bounded and X i ( m ) p → X i , then we have that E [ µ ( X i ( m ) ) | X i ] → µ ( X i ), which proves part 1 of Proposition 1.For part 2, we assume that ˜ h ( x ) = E [ h ( Y (0)) | X = x ] is continuous and bounded for any h : R → R continuous and bounded. Let Y i ( m ) be the outcome of the m -nearest neighborof treated observation i . Therefore, for any h ( y ) continuous and bounded, and for a given X i = ¯ x , we have that E [ h ( Y i ( m ) )] = E (cid:8) E [ h ( Y i ( m ) ) | X i ( m ) ] (cid:9) = E n ˜ h ( X i ( m ) ) o → ˜ h (¯ x ) = E [ h ( Y (0)) | X = ¯ x ] . (21)By the Portmanteau Lemma, we have that Y i ( m ) d → Y (0) |{ X = ¯ x } . Under Assumption2, Y i ( m ) d → µ ( X i ) + ǫ m ( X i ), where ǫ m ( X i ) d = Y i (0) | X i − µ ( X i ). Therefore, conditional on { X i } i ∈I ,ˆ τ = 1 N X i ∈I " Y i − M M X m =1 Y i ( m ) d → N X i ∈I " ( µ ( X i ) − µ ( X i )) + ǫ i − M M X m =1 ǫ m ( X i ) ! . (22)Now we just have to show that ǫ m ( X i ) is independent across m and i . Since X i is acontinuous random variable, then X i = X j with probability one for i = j with i, j ∈ I .39ince there is a ﬁnite number of treated observations, then it must be that, conditional on { X i } N i =1 , there is an η > d ( X i , X j ) > η for all i, j ∈ I with i = j . However, weknow that P r ( d ( X i , X i ( m ) ) > ǫ ) → ǫ >

0. Therefore, the probability that k ∈ I belongs to J M ( i ) and J M ( j ) converges to zero. Under the assumption that the errors ǫ i are independent across i (which is guaranteed from Assumption 1), we have that ǫ m ( X i ) isindependent across m and i . A.1.2 Particular case: Y (0) | X is normally distributed Let Y ∼ N ( θ, σ ). We ﬁrst want to show that ˜ h ( θ, σ ) = E [ h ( Y ) | θ, σ ] is continuous andbounded for any h () continuous and bounded. In this case,˜ h ( θ, σ ) = Z h ( y ) 1 √ π σ e − ( y − θσ ) dy. (23)Let g ( y, θ, σ ) = h ( y ) √ π σ e − ( y − θσ ) . Since h ( y ) is continuous and bounded, g ( y, θ, σ ) isintegrable for all ( θ, σ ), and, for all y ∈ R , g ( y, θ, σ ) is continuous in ( θ, σ ). We now showthat there is a neighborhood of ( θ, σ ) and an integrable function q : R → R + such that, forall ( θ, σ ) in this neighborhood, | g ( y, θ, σ ) | ≤ q ( y ).Consider the neighborhood of ( θ, σ ) given by ( θ − δ, θ + δ ) × ( σ − δ, σ + δ ) (where δ issuﬃciently small so that σ − δ > q ( y ) =  | h ( y ) | √ π σ − δ e − ( y − ( θ + δ ) σ + δ ) , if y > θ + δ | h ( y ) | √ π σ − δ , if y ∈ [ θ − δ, θ + δ ] | h ( y ) | √ π σ − δ e − ( y − ( θ − δ ) σ + δ ) , if y < θ − δ (24)For any ( θ, σ ) ∈ ( θ − δ, θ + δ ) × ( σ − δ, σ + δ ), | g ( y, θ, σ ) | ≤ q ( y ), and q ( y ) is integrable.Therefore, h ( θ, σ ) is continuous at any point ( θ, σ ). Moreover, since h ( y ) is bounded, ˜ h ( θ, σ )is also bounded.Now let Y (0) | X = x ∼ N ( θ ( x ) , σ ( x )). Since compositions of continuous functions arecontinuous, it follows that ˜ h ( x ) = R h ( y ) √ π σ ( x ) e − ( y − θ ( x ) σ ( x ) ) dy is bounded and continuous in x . A.1.3 Unconditional Expectation

Now we consider the unconditional expectation of ˆ τ : E [ˆ τ ] = E { E [ˆ τ |{ X i } i ∈I ] } = 1 N X i ∈I E " µ ( X i ) − M M X m =1 µ ( X i ( m ) ) . (25)We need that E [ µ ( X i ( m ) )] → E [ µ ( X i )]. We know that E [ µ ( X i ( m ) ) | X i ] → µ ( X i ) for all X i . Again using the fact that µ ( x ) is continuous and bounded, we have that E [ µ ( X i ( m ) )] =40 { E [ µ ( X i ( m ) ) | X i ] } → E [ µ ( X i )]. Therefore, E [ˆ τ ] → E [ µ ( X i ) − µ ( X i )] (26)where this expectation is taken according to f ( x ), the density function of the treated units. A.1.4 Bias-corrected Matching Estimator

We consider the bias-corrected matching estimator using linear least squares on the near-est neighbors to estimate µ ( x ). This is the procedure used in the teﬀects command in Stata.Considering, for simplicity, the case with k = 1, note thatˆ τ biasadj = ˆ τ + 1 N M M X m =1 X i ∈I ˆ β (cid:0) X i ( m ) − X i (cid:1) (27)where ˆ β = P Mm =1 P i ∈I ( X i ( m ) − ¯ X ) Y i ( m ) P Mm =1 P i ∈I (cid:16) X i ( m ) − ¯ X (cid:17) and ¯ X = N M P Mm =1 P i ∈I X i ( m ) . We assume that Y i (0) | X i = x is uniformly bounded for almost all x ∈ X and that X i is bounded. Deﬁne X = P Mm =1 P i ∈I ( X i ( m ) − ¯ X ) . If we have at least two treated observations, then ∃ C > X < C ) →

0. Therefore,Pr (cid:16) | ˆ β | ≥ c (cid:17) = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P Mm =1 P i ∈I (cid:16) X i ( m ) − ¯ X (cid:17) Y i ( m ) X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ c  ≤ Pr  P Mm =1 P i ∈I (cid:12)(cid:12)(cid:12) X i ( m ) − ¯ X (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Y i ( m ) (cid:12)(cid:12)(cid:12) X ≥ c  ≤ Pr  C P Mm =1 P i ∈I (cid:12)(cid:12)(cid:12) Y i ( m ) (cid:12)(cid:12)(cid:12) X ≥ c | X < C  Pr ( X < C )+Pr  C P Mm =1 P i ∈I (cid:12)(cid:12)(cid:12) Y i ( m ) (cid:12)(cid:12)(cid:12) C ≥ c | X > C  Pr ( X > C ) . Since Pr ( X < C ) →

0, the ﬁrst term converges to zero. Since we assume that Y i (0) | X i = x is uniformly bounded for almost all x ∈ X , we can always ﬁnd c such that the secondterm is lower than any η >

0, which implies that ˆ β = O p (1). Since X i ( m ) − X i = o p (1) for all i and m , N M P Mm =1 P i ∈I ˆ β (cid:16) X i ( m ) − X i (cid:17) = o p (1), so | ˆ τ biasadj − ˆ τ | = o p (1). These assumptions are weaker than the assumptions of Abadie and Imbens (2011). The proof would be easier if we used all control observations to estimate µ ( x ) using linear least squares.In this case, ˆ β would converge to the population OLS coeﬃcients. .1.5 Proof of Proposition 2Proof. We apply Theorem 3.1 from Canay et al. (2017). We ﬁrst show that, when N → ∞ ,the limiting distribution of ˜ S N under the null, ˜ S , is invariant to transformations in ˜ G .From the proof of Proposition 1, note that Y i ( m ) d → Y i (0) | X i . Therefore, under the nullthat Y i (0) | X i d = Y i (1) | X i , we have that ˜ S jN ,i d → Y i (0) | X i for all j = 0 , ..., M . Moreover,asymptotically, ˜ S jN ,i is independent across i and j because the probability that two treatedunits share the same nearest neighbor converges to zero when N → ∞ . Therefore, theasymptotic distribution of ˜ S N is invariant to transformations in ˜ G .We also have that the test statistic function ˜ T ( ˜ S ) is continuous. Finally, we show that, fortwo distinct elements π ∈ ˜ G and π ′ ∈ ˜ G , either ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ ) for all possible realizations of˜ S , or P r ( ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ )) = 1. Suppose M >

1. Then, if π and π ′ are such that π i (0) = π ′ i (0)for all i ∈ I , then we will have ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ ) for all possible realizations of ˜ S . If π and π ′ are such that π i (0) = π ′ i (0) for at least one i ∈ I , then the probability that ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ )would be equal to zero, because ˜ S is a continuous random variable. For the case M = 1,we would have ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ ) for all possible realizations of ˜ S if π and π ′ are such that π i (0) = π ′ i (0) for all i , or π i (0) = π ′ i (1) for all i . Otherwise, P r ( ˜ T ( ˜ S π ) = ˜ T ( ˜ S π ′ )) = 1. A.1.6 Proof of Proposition 3Proof.

Again, we apply Theorem 3.1 from Canay et al. (2017). We ﬁrst show that, when N → ∞ , the limiting distribution of S N under the null, S , is invariant to sign changes.This is true if, asymptotically, ˆ τ i and ˆ τ j are independent for i = j , and the distributionof ˆ τ i is symmetric around zero. It is not necessary for ˆ τ i to have the same distributionacross i . From Proposition 1, we know that, under the null, the asymptotic distribution ofˆ τ i conditional on { X } i ∈I is given by ǫ i − M P Mm =1 ǫ m ( X i ). This distribution is symmetricaround zero given the assumption that Y i (1) | X i and Y i (0) | X i are symmetric around theirmean for all i = 1 , ..., N . Moreover, Proposition 1 also shows that, asymptotically, ˆ τ i areindependent across i .We also have that the test statistic function T ( S ) is continuous. Finally, we show that, fortwo distinct elements g ∈ G and g ′ ∈ G , either T ( gS ) = T ( g ′ S ) for all possible realizationsof S , or P r ( T ( gS ) = T ( g ′ S )) = 1. If g and g ′ are such that g i = g ′ i for all i , or g i = − g ′ i for all i , then T ( gS ) = T ( g ′ S ) for all possible realizations of S . Otherwise, given that S is acontinuous random variable, P r ( T ( gS ) = T ( g ′ S )) = 1.42 .2 Appendix Tables and Figures Table A.1:

Empirical Monte Carlo Simulation: More Covariates M = 1 M = 4 M = 10 N = 50 N = 500 N = 50 N = 500 N = 50 N = 500(1) (2) (3) (4) (5) (6) Panel A: | average bias × | N = 5 3.181 1.703 3.999 2.280 4.625 2.509 N = 10 2.822 1.717 3.776 2.201 4.889 2.951 N = 25 3.005 1.744 3.656 2.196 4.538 2.657 N = 50 2.657 1.657 3.476 2.138 4.294 2.644 Panel B: mean squared error ( × ) N = 5 3.190 2.679 2.228 1.720 2.282 1.645 N = 10 1.682 1.299 1.261 0.887 1.393 0.875 N = 25 0.820 0.557 0.718 0.403 0.764 0.389 N = 50 0.543 0.299 0.493 0.227 0.588 0.240 Panel C: rejection rates based on AI (2006) N = 5 0.154 + + + + + + N = 10 0.091 + + + + + + N = 25 0.073 + + + + + + N = 50 0.072 + + + + + + Panel D: test based on RI. permutations N = 5 0.013 − − − N = 10 0.051 0.051 0.047 0.058 0.032 − N = 25 0.064 + − N = 50 0.069 + − − Panel E: test based on RI. sign changes N = 5 0.012 − − − − − − N = 10 0.049 0.051 0.000 − − − N = 25 0.064 + − + − − N = 50 0.069 + − − − Note: This table replicates the simulations presented in Table 2 with the diﬀerence thatwe add three additional covariates that are uncorrelated with the potential outcomes.

Empirical Monte Carlo Simulation: Bias-corrected Estimator M = 1 M = 4 M = 10 N = 50 N = 500 N = 50 N = 500 N = 50 N = 500(1) (2) (3) (4) (5) (6) Panel A: | average bias × | N = 5 13.181 0.141 0.212 0.014 0.495 0.064 N = 10 0.042 0.027 0.357 0.062 0.684 0.151 N = 25 0.452 0.125 0.452 0.105 0.620 0.255 N = 50 0.285 0.007 0.403 0.081 0.614 0.130 Panel B: mean squared error ( × ) N = 5 >