Design-Based Uncertainty for Quasi-Experiments
aa r X i v : . [ ec on . E M ] A ug Design-Based Uncertainty forQuasi-Experiments ∗ Ashesh Rambachan † Jonathan Roth ‡ August 6, 2020
Abstract
Social scientists are often interested in estimating causal effects in settings where allunits in the population are observed (e.g. all 50 US states). Design-based approaches,which view the treatment as the random object of interest, may be more appealingthan standard sampling-based approaches in such contexts. This paper develops adesign-based theory of uncertainty suitable for quasi-experimental settings, in which theresearcher estimates the treatment effect as if treatment was randomly assigned, but inreality treatment probabilities may depend in unknown ways on the potential outcomes.We first study the properties of the simple difference-in-means (SDIM) estimator. TheSDIM is unbiased for a finite-population design-based analog to the average treatmenteffect on the treated (ATT) if treatment probabilities are uncorrelated with the potentialoutcomes in a finite population sense. We further derive expressions for the variance ofthe SDIM estimator and a central limit theorem under sequences of finite populationswith growing sample size. We then show how our results can be applied to analyze thedistribution and estimand of difference-in-differences (DiD) and two-stage least squares(2SLS) from a design-based perspective when treatment is not completely randomlyassigned. ∗ We thank Isaiah Andrews, Iavor Bojinov, Peng Ding, Pedro Sant’Anna, Yotam Shem-Tov, and NeilShephard for helpful comments and suggestions. Rambachan gratefully acknowledges support from the NSFGraduate Research Fellowship under Grant DGE1745303. † Harvard University, Department of Economics. Email: [email protected] ‡ Microsoft Research. Email: [email protected]
Introduction
Standard econometric analyses of causal effects typically view the data obtained by theeconometrician as a random sample from a larger superpopulation. This sampling-basedview may be unnatural in economic contexts where the entire population of interest is ob-served. For example, applied researchers are often interested in the causal effect of state-levelpolicies when outcomes for all 50 US states are observed (Manski and Pepper, 2018). Simi-lar difficulties arise when the researcher has access to large-scale administrative data for theentire population of interest. In these settings, it may be more attractive to view uncertaintyas purely design-based, i.e. arising due to the stochastic nature of the treatment assignmentfor a finite population. A celebrated literature in statistics, dating to at least Neyman (1923)and Fisher (1935), has analyzed randomized experiments from such a design-based perspec-tive. This finite population view has received recent attention in the econometrics literature,e.g. from Abadie et al. (2017, 2020).However, there remains a gap between the typical assumptions used in existing finitepopulation causal analyses and many leading empirical settings in which a finite populationperspective is conceptually attractive. Typically, finite population analyses of causal effectsassume that the observable data were generated from a randomized experiment, in which thetreatment is randomly assigned to units through an assignment mechanism with known prob-abilities (e.g., Imbens and Rubin (2015), Aronow and Middleton (2015), Middleton (2018),Savje and Delevoye (2020) among others). In contrast, social scientists often employ “quasi-experimental” methods, in which the data is analyzed as if treatment were randomly as-signed, but random assignment is not guaranteed by design. The probability of treatmentassignment is therefore not known to the researcher. In such settings, it is desirable to under-stand the properties of quasi-experimental estimators if in fact the data-generating processdiffers from random assignment.Existing analyses of quasi-experimental estimators — such as simple-differences-in-means(SDIM), difference-in-differences (DiD), and two-stage least squares (2SLS) — often adopta sampling-based view and consider the limiting distribution of the estimator in settingswhere treatment is not independent of potential outcomes. It is typically possible to obtainasymptotically valid causal estimation and inference under orthogonality conditions thatare weaker than strict independence between the treatment (or instrument) and potentialoutcomes. However, the interpretation of the causal estimand differs under these weakerassumptions – for example, it may be an average treatment effect on the treated (ATT) ora local average treatment effect (LATE), rather than an average treatment effect (ATE).Given the attractiveness of the design-based approach for many quasi-experimental settings,1t is useful to understand from the design-based perspective whether it is possible to obtainvalid inference on an interpretable causal parameter when randomization fails.To bridge these gaps, we study the estimation and inference of treatment effects in afinite population setting where the probability of treatment assignment varies arbitrarilyacross units. We analyze a treatment assignment mechanism that allows each unit to havean idiosyncratic probability p i of receiving a binary treatment. The idiosyncratic probability p i may depend arbitrarily on i ’s potential outcomes p Y i p q , Y i p qq . In this sense, our modelallows for the possibility that the “quasi-experimental” research design may not, in fact,mimic random assignment. We study the properties of three popular quasi-experimentalestimators – SDIM, DiD, and 2SLS – under this assignment mechanism from a purely design-based perspective.We begin with an analysis of the simple difference-in-means estimator (SDIM) in Section3. We first establish a finite-population analog to the omitted variable bias formula, whichdecomposes the expectation of the SDIM into two terms: (i) a finite-population design-basedanalog to the average treatment effect on the treated (ATT), and (ii) a bias term equal tothe finite-population covariance between the unit-specific treatment probabilities and theiruntreated potential outcomes. We then derive the finite population asymptotic distribu-tion of the SDIM as the size of the population grows large. We derive intuitive formulasfor the asymptotic variance of the SDIM statistic, as well as a central limit theorem underappropriate regularity conditions. As in the standard completely randomized experiment,the usual variance estimate is consistent for an upper bound on the variance of the esti-mator. An interesting feature of our setting is that the standard variance estimator maybe conservative even under constant treatment effects if treatment probabilities differ acrossunits. Thus, standard confidence intervals deliver asymptotically conservative inference forthe finite-population ATT when the unit-specific treatment probabilities are orthogonal tothe potential outcomes.In Section 4, we extend the results for the SDIM to difference-in-differences (DiD). Weshow that the DiD estimator is unbiased for the finite population ATT under a finite-population analogue to the well-known “parallel trends” assumption in the sampling-basedliterature (e.g., see Chapter 5 of Angrist and Pischke (2009)). Our results thus help bridgethe gap between the sampling-based literature on DiD and recent work by Athey and Imbens(2018), who study DiD from a design-based perspective but assume completely random treat-ment timing. As with the SDIM, we show that widely used cluster-robust standard errors Concretely, we analyze the asymptotic distribution of the SDIM along a sequence of finite populationsin which both the size of the population and the number of treated units grows large. Similar finite pop-ulation asymptotics have been considered in the context of randomized experiments (Li and Ding, 2017;Abadie et al., 2017, 2020). Z i and binary treatment D i . The stochastic nature ofthe data now arises due to the assignment of the instrument Z i , holding fixed the potentialoutcomes Y p d q and the potential treatments D p z q , as in Kang et al. (2018). We provide anintuitive expression for the estimand of 2SLS allowing for an arbitrary relationship betweenthe probability that Z i “ and the potential outcomes. Our results thus provide a bridgebetween recent work by Kang et al. (2018), who study instrumental variables models froma design-based perspective in which the instrument is completely randomly assigned, andsampling-based models of sensitivity analysis for IV (e.g. Conley et al. (2010)). When theinstrument is completely random, our expression reduces to the well-known result that theestimand of 2SLS is a local average treatment effect (LATE) (Angrist and Imbens, 1994;Angrist et al., 1996). We generalize this result, showing that the 2SLS estimand also has aninteresting causal interpretation from a design-based perspective under the weaker conditionthat the probability that Z i “ has zero finite population covariance with both D i p q and Y i p D i p qq . Under this condition, the 2SLS estimand is a weighted average of the causaleffects for compliers, where the weights are equal to the unit-specific probabilities of re-ceiving Z i “ . This parameter can be interpreted as an instrument-propensity reweightedlocal average treatment effect. As with the previously discussed estimators, standard in-ference methods yield asymptotically conservative inference for this estimand under “stronginstrument” asymptotics. Consider a finite population of N units. Let D i denote a binary indicator for whether unit i adopts a treatment of interest. Units are associated with potential outcomes Y i p q , Y i p q ,under treatment and control respectively, and the observed outcome equals Y i “ D i Y i p q `p ´ D i q Y i p q . Throughout the paper, the potential outcomes are treated as fixed (or condi-tioned on), and the stochastic nature of the data arises only due to the random assignmentor adoption of treatment.Each unit independently adopts the treatment with idiosyncratic probability p i . Weallow for p i to be arbitrarily related to the potential outcomes with p i “ g p Y i p q , Y i p q , W i q ,where g is an unknown link function that maps p Y i p q , Y i p qq and some other (possiblyunobserved) i -level pre-treatment covariates W i into the unit interval. Since the researcherneither observes the pair of potential outcomes nor knows the link function g , the unit-specific treatment probabilities p i are unknown to the researcher. For example, such unit-3pecific treatment probabilities may arise if units decide whether to adopt the treatmentbased on a choice model in which each unit’s adoption decision depends on its potentialoutcomes, pre-treatment covariates and idiosyncratic taste or information shocks ν i (e.g.,see Heckman and Vytlacil (2006) among many others). In this view, the randomness intreatment adoption in our model arises from the randomness in the idiosyncratic shocks ν i conditional on the potential outcomes and pre-treatment covariates. Example 1.
The Tax Cuts and Jobs Act of 2017 allowed for US census tracts meetingcertain criteria to receive tax benefits if they were designated by the governor of their stateas “Opportunity Zones.” SUppose we are interested in the effect of an eligible census tractbeing designated as an Opportunity Zone p D q on housing price growth p Y q , as in Chen et al.(2019). Since housing price growth is observed for all eligible census tracts, it is attractive tothink of the randomness in the data as coming from the choice of which tracts to designate asOpportunity Zones, rather than from drawing the observed sample from a superpopulationof census tracts. Owing to the vagaries of the political process, it is plausible that the choiceof which of the eligible census tracts to designate as Opportunity Zones is as-if randomlyassigned. For instance, the choice of which tracts to designate may depend on arbitraryfactors such as the order in which briefings about tracts were presented p ν i q that are unrelatedto the potential outcomes. It therefore may be sensible to estimate the causal effect ofthe policy by comparing outcomes for designated and non-designated census tracts as if it were a randomized experiment. Nevertheless, we may still worry that – in addition tothe aforementioned idiosyncratic factors – the probability a particular tract is designated asan Opportunity Zone depends on the benefit of treatment p Y i p q ´ Y i p qq and other fixedfeatures of the tract such as its partisan lean ( W i ). It is therefore instructive to analyze theproperties of quasi-experimental estimators if we view the uncertainty in the data as comingfrom the idiosyncratic factors ν i but allow the probability of treatment to depend arbitrarilyon the other fixed factors that affect treatment choice, p i “ g p Y i p q , Y i p q , W i q .Following the literature on completely randomized experiments (e.g. Imbens and Rubin(2015)), we condition on the number of treatment and control units, N : “ ř i D i and N : “ N ´ N respectively. It is straightforward to derive the distribution of treatmentassignments D “ p D , ..., D N q conditional on N and N : P ˜ D “ d ˇˇˇ ÿ i D i “ N ¸ “ C ź i p d i i p ´ p i q ´ d i (1)for all d P t , u N such that ř i d i “ N , and zero otherwise. We refer to this as a
Poisson This follows from the fact that P p D “ d | ř i D i “ N q “ P p D “ d ^ ř i D i “ N q { P p ř i D i “ N q . ejective assignment mechanism , since it parallels what Hajek (1964) refers to as Poissonrejective sampling, in which units are sampled from a finite population only if D i “ and D has the distribution given in (1).As notation, define the marginal assignment probability as π i : “ P p D i “ | ř i d i “ N q .Additionally, for non-stochastic weights w i and a non-stochastic attribute X i (such as apotential outcome), define E w r X i s : “ ř i w i ÿ i w i X i and V w r X i s : “ ř i w i ÿ i w i p X i ´ E w r X i sq to be the finite-population weighted expectation and variance respectively. Analogously, de-fine C ov w r X i , Y i s “ E w rp X i ´ E w r X i sq p Y i ´ E w r Y i sqs . We denote by E R r¨s “ E r¨ | ř i D i “ N s the expectation with respect to the randomization distribution for the treatment assignment D , conditional on the number of treated units. The operators V R r¨s and C ov R r¨ , ¨s are definedanalogously as the variance and covariance respectively over the randomization distributionfor the treatment assignment D , conditional on the number of treated units. We begin by analyzing the properties of the simple difference in means (SDIM) estimator, ˆ τ : “ N ÿ i D i Y i ´ N ÿ i p ´ D i q Y i . (2)Our results are thus relevant for quasi-experimental settings where the researcher comparesthe treated and untreated units as if they were randomly assigned, but may be concernedthat in fact treatment probabilities were related to potential outcomes. We first turn our attention to the expectation of ˆ τ under the treatment assignment mecha-nism (1). Observe that E R r ˆ τ s “ N ÿ i π i p Y i p q ` τ i q looooomooooon “ Y i p q ´ N ÿ i p ´ π i q Y i p q“ N ÿ i π i τ i loooomoooon “ τ ATT ` NN NN ˜ N ÿ i ˆ π i ´ N N ˙ Y i p q ¸loooooooooooooooomoooooooooooooooon “ C ov r π i ,Y i p qs , (3)5here τ i “ Y i p q ´ Y i p q is unit i ’s causal effect. The first term in the previous displayis a weighted average of the unit-specific causal effects, where the weights are proportionalto the unit-specific treatment probabilities. We interpret this object as a finite-populationanalogue to the average treatment effect on the treated since N ÿ i π i τ i “ E R « N ÿ i D i τ i ff “ : τ AT T . (4) τ AT T is the expected value of what Imbens (2004) and Sekhon and Shem-Tov (2020) referto as the sample average treatment effect on the treated (SATT), where the expectation istaken over the stochastic realization of which units are treated. The second term in (3) is theSDIM’s bias for τ AT T and equals a constant times the finite-population covariance betweenthe treatment probabilities π i and the untreated potential outcomes Y i p q . The bias is zeroif all units are treated with the same probability (i.e. π i “ N { N for all i ), and furthermoreunder this condition τ AT T reduces to the average treatment effect.This characterization of the bias of the SDIM estimator suggests that researchers mayconduct sensitivity analysis under different assumptions about the finite-population covari-ance between the treatment probabilities and the untreated potential outcomes – i.e., reportthe range of possible values for ˆ τ ´ NN NN C ov r π i , Y i p qs under different assumptions aboutthe possible magnitudes of C ov r π i , Y i p qs . Such a sensitivity analysis is related to, but dif-ferent from existing design-based sensitivity analyses developed in, for example, Rosenbaum(1987), Chapter 4 of Rosenbaum (2002), Rosenbaum (2005) among many others. The ap-proach in those papers places bounds on the relative odds ratio of treatment between twounits (i.e., π i p ´ π j q π j p ´ π i q for i ‰ j ) and examines the extent to which the relative odds ratiomust vary across units such that we may no longer reject a particular sharp (Fisher) nullof interest. In contrast, we focus on examining how the bias of the SDIM estimator fora particular weighted average treatment effect varies with the finite population covariancebetween treatment probabilities and untreated potential outcomes.Equation (3) may also be interpreted as a finite population version of the omitted variablesbias formula for regression analyses. Defining the errors ε Yi “ Y i p q ´ E ´ π r Y i p qs and ε τi “ τ i ´ τ AT T , we may rewrite the observed outcome for unit i as Y i “ β ` D i τ AT T ` u i , (5)where β “ E ´ π r Y i p qs and u i “ ε Yi ` D i ε τi . One can show that the expression derived abovefor E R r ˆ τ ´ τ AT T s is equivalent to E R ” C ov r D i ,u i s V ar r D i s ı , which in light of equation (5) coincides withthe omitted variable bias formula for the coefficient on D i in an OLS regression of Y i on D i We now turn our attention to the variance and distribution of ˆ τ . The exact finite-samplevariance and distribution functions are complicated functions of the p i , and we thereforerely on a triangular array asymptotic approximation using a sequence of finite populationswhere the number of units grows large, in the spirit of Freedman (2008b,a), Lin (2013), andLi and Ding (2017). We consider sequences of populations indexed by m of size N m , with N m treated units, potential outcomes t Y im p d q : d “ , i “ , ..., N m u , and assignmentweights p m , ..., p N m . For brevity, we leave the subscript m implicit in our notation; all limitsare implicitly taken as m Ñ 8 . Our results will provide an approximation to the propertiesof ˆ τ for finite populations with a sufficiently large number of units.To analyze its distribution, note that ˆ τ may be re-written as ˆ τ “ ÿ i D i π i ˜ Y i ´ N ÿ i Y i p q , (6)where ˜ Y i : “ π i ´ N Y i p q ` N Y i p q ¯ . The second term on the right-hand side of the previousdisplay is non-stochastic. The first term, on the other hand, can be viewed as a Horvitz-Thompson estimator for ř Ni “ π i ˜ Y i under what Hajek (1964) refers to as Poisson rejectivesampling. We can therefore make use of results from Hajek (1964) to obtain its asymptoticdistribution under a sequence of finite populations as described above. To obtain the asymptotic variance of ˆ τ , we impose the following assumption on the sequenceof populations. Assumption 1.
The sequence of populations satisfies ř Ni “ π i p ´ π i q Ñ 8 . Note that π i p ´ π i q is the variance of the Bernouli random variable D i , so Assumption 1implies that the sum of the variances of the D i grows large. Assumption 1 also implies thatboth N and N go to infinity, since ř Ni “ π i p ´ π i q ď min t ř i π i , ř i p ´ π i qu “ min t N , N u .Note that Assumption 1 is trivially satisfied under the familiar overlap condition (i.e., π i Pr η, ´ η s for some η ą ). However, overlap for all units is not necessary for Assumption 1to hold, and indeed Assumption 1 allows for π i “ or π i “ for some units.7 emma 3.1. Under Assumption 1, V R r ˆ τ s r ` o p qs “ N ř Nk “ π k p ´ π k q N N N N „ N V ar ˜ π r Y i p qs ` N V ar ˜ π r Y i p qs ´ N V ar ˜ π r τ i s , (7) where o p q Ñ and the weights are given by ˜ π i “ π i p ´ π i q .Proof. Since ˆ τ can be represented as a Horvitz-Thompson estimator under Poisson rejectivesampling, Theorem 6.1 in Hajek (1964) implies that V R r ˆ τ s r ` o p qs “ « N ÿ k “ π k p ´ π k q ff V ar ˜ π ” ˜ Y i ı . (8)Standard decomposition arguments for completely randomized experiments (e.g. Imbens and Rubin(2015)), modified to replace unweighted variances with weighted variances, yield that V ar ˜ π ” ˜ Y i ı “ NN N ˆ N V ar ˜ π r Y i p qs ` N V ar ˜ π r Y i p qs ´ N V ar ˜ π r τ i s ˙ , which together with the previous display yields the desired result.Lemma 3.1 shows that the asymptotic variance of ˆ τ depends on the weighted variance ofthe treated and untreated potential outcomes and treatment effects, where unit i is weightedproportionally to the variance of their treatment status V R r D i s “ π i p ´ π i q . The leadingconstant term is less than or equal to one by Jensen’s inequality, with equality when π i isconstant across units. Thus, in the special case of a completely random experiment, theformula in Lemma 3.1 reduces to p ` o p qq ´ N V ar r Y i p qs ` N V ar r Y i p qs ´ N V ar r τ i s ¯ ,which mimics the familiar formula for completely randomized experiments up to a degrees-of-freedom corrections. We next provide an upper bound for the asymptotic variance derived in Lemma 3.1.We will later provide regularity conditions under which the standard variance estimator isasymptotically consistent for this upper bound.
Lemma 3.2.
Under Assumption 1, the right-hand side of (7) is bounded above by N V ar π r Y i p qs ` N V ar ´ π r Y i p qs , (9) The ` o p q correction is needed here because V ar r Y i p d qs “ N ř i p Y i p d q ´ E r Y i p d qsq , which differsfrom the usual finite population variance by the degrees-of-freedom correction factor NN ´ . nd the bound holds with equality if and only if E ˜ π „ N Y i p q ` N Y i p q “ N E π r Y i p qs ` N E ´ π r Y i p qs and π i N { N Y i p q ´ ´ π i N { N Y i p q “ π i N { N E π r Y i p qs ´ ´ π i N { N E ´ π r Y i p qs for all i. Proof.
From (8), we see that the right-hand side of (7) is equivalent to N ÿ i “ π i p ´ π i q ˆ N Y i p q ` N Y i p q ´ ˆ E ˜ π „ N Y i p q ` N Y i p q ˙˙ . Since for any X , E ˜ π r X s “ arg min µ ř Ni “ π i p ´ π i qp X i ´ µ q , it follows that this is boundedabove by N ÿ i “ π i p ´ π i q ˆ N Y i p q ` N Y i p q ´ ˆ E π „ N Y i p q ` E ´ π „ N Y i p q ˙˙ , (10)and the bound is strict if and only if E ˜ π „ N Y i p q ` N Y i p q “ N E π r Y i p qs ` N E ´ π r Y i p qs . Let Y i p q “ Y i p q ´ E π r Y i p qs and Y i p q “ Y i p q ´ E ´ π r Y i p qs . Then the expression in (10)can be written as N ÿ i “ π i p ´ π i q ˆ N Y i p q ` N Y i p q ˙ “ « N N ÿ i “ π i Y i p q ` N N ÿ i “ p ´ π i q Y i p q ´ N N ÿ i “ π i Y i p q ´ N N ÿ i “ p ´ π q Y i p q ` N N N ÿ i “ π i p ´ π i q Y i p q Y i p q ff “ « N V ar π r Y i p qs ` N V ar ´ π r Y i p qs ´ N N ÿ i “ ˆ π i N { N Y i p q ´ ´ π i N { N Y i p q ˙ ff , from which the result is immediate. Corollary 3.1.
If treatment effects are constant, Y i p q “ τ ` Y i p q for all i , and E R r ˆ τ s “ τ ,then the bound in Lemma 3.2 is only strict if π i “ N N for all i such that Y i p q ‰ E ˜ π r Y i p qs . roof. The two conditions for equality in Lemma 3.2 together with the assumption that Y i p q “ τ ` Y i p q imply that τ ´ E R r ˆ τ s “ N ˆ π i ´ N N ˙ ˆ N ` N ˙ p Y i p q ´ E ˜ π r Y i p qsq for all i, from which the result follows immediately.We thus see that under constant treatment effects, if ˆ τ is unbiased then the asymptoticvariance of ˆ τ will be strictly lower than the upper bound when treatment probabilities arenot uniform (unless the treatment probabilities differ from uniformity only for a set of unitsfor which Y i p q “ E ˜ π r Y i p qs .) Remark 1.
It is straightforward to show that if π “ N N for all i , then the bound in Lemma3.2 is strict if and only if treatment effects are constant, which is a standard result forcompletely randomized experiments. When π ‰ N N , Lemma 3.2 implies that the boundholds with strict equality only in knife-edge cases. Next, we provide a regularity condition under which the standard variance estimator isconsistent for the upper bound on the asymptotic variance of ˆ τ given in (9). Let ˆ s “ N ˆ s ` N ˆ s , where ˆ s : “ N ÿ i D i p Y i ´ ¯ Y q , ˆ s : “ N ÿ i p ´ D i qp Y i ´ ¯ Y q , and ¯ Y : “ N ř i D i Y i , ¯ Y : “ N ř i p ´ D i q Y i . The following assumption and consistency result generalize those in Li and Ding (2017)for the case of completely randomized assignment.
Assumption 2.
Define m N p q : “ max ď i ď N p Y i p q ´ E π r Y i p qsq , and analogously m N p q : “ max ď i ď N p Y i p q ´ E ´ π r Y i p qsq . Assume that, N m N p q V ar π r Y i p qs Ñ and N m N p q V ar ´ π r Y i p qs Ñ . Lemma 3.3.
Under Assumptions 1 and 2, ˆ s ´ N V ar π r Y i p qs ` N V ar π r Y i p qs ¯ p ÝÑ . roof. See Appendix.
Finally, we introduce an assumption that allows us to obtain a central limit theorem for theSDIM ˆ τ . Assumption 3.
Let ˜ Y i “ N Y i p q ` N Y i p q , and assume σ π “ V ar ˜ π ” ˜ Y i ı ą . Suppose thatfor all ǫ ą , σ π E ˜ π «´ ˜ Y i ´ E ˜ π ” ˜ Y i ı¯ «ˇˇˇ ˜ Y i ´ E ˜ π ” ˜ Y i ıˇˇˇ ě cÿ i π i p ´ π i q ¨ σ ˜ π ǫ ffff Ñ . Assumption 3 is similar to the Lindeberg condition for the standard Lindeberg-Levy cen-tral limit theorem, and imposes that the weighted finite-population variance of ˜ Y i is notdominated by a small number of observations. Viewing ˆ τ as a Horvitz-Thompson estimatorunder Poisson rejective sampling in light of (6), the following result follows immediately fromTheorem 1 in Berger (1998), which is based on Hajek (1964). Lemma 3.4.
Suppose Assumptions 1 and 3 hold. Then, ˆ τ ´ E R r ˆ τ s a V R r ˆ τ s d ÝÑ N p , q . The results for scalar outcomes Y i extend easily to the multiple outcome case with Y i P R K .This is relevant when we observe multiple outcome measures in a cross-section, or we observethe same outcome measure for multiple periods (or both). We use the extension to multi-ple outcomes in our finite population analysis of difference-in-differences and instrumentalvariables settings later in the paper.We extend our notation from the scalar case, so that Y i P R K , and for a fixed vector-valued characteristic X i (e.g a function of the potential outcomes), E w r X i s : “ ř i w i ř i w i X i and V ar w r X i s “ ř i w i ř i p X i ´ E w r X i sq p X i ´ E w r X i sq . In particular, define S ,w : “ V ar w r Y i p qs , S ,w : “ V ar w r Y i p qs ,S ,w : “ E w rp Y i p q ´ E w r Y i p qsqp Y i p q ´ E w r Y i p qsq s Berger (1998) gives the result using the actual inclusion probabilities π i , whereas Hajek (1964) states asimilar result where the Horvitz-Thompson estimator uses an approximation to the π i in terms of the p i .
11o be the weighted finite population variances and covariance of Y i p q and Y i p q . Addition-ally, the vector-valued ATT is defined as, τ AT T : “ N ř i π i p Y i p q ´ Y i p qq , and consider thevector-valued SDIM estimator ˆ τ “ N ř i D i Y i p q ´ N ř i p ´ D i q Y i p q . We also generalizethe variance estimators introduced above, ˆs : “ N ˆs ` N ˆs , ˆs : “ N ÿ i D i p Y i ´ ¯Y qp Y i ´ ¯Y q , ˆs : “ N ÿ i p ´ D i qp Y i ´ ¯Y qp Y i ´ ¯Y q , where ¯Y : “ N ř i D i Y i and ¯Y : “ N ř i p ´ D i q Y i .We introduce the following assumptions on the sequence of finite populations. Assumption 4.
Suppose that N { N Ñ p P p , q , and S ,w , S ,w , S ,w have finite limits for w P t π, ´ π, ˜ π u . Assumption 5.
Assume that max ď i ď N || Y i p q ´ E π r Y i p qs || { N Ñ ď i ď N || Y i p q ´ E ´ π r Y i p qs || { N Ñ where || ¨ || is the Euclidean norm. Assumption 6.
Let ˜Y i “ N Y i p q ` N Y i p q , and let λ min be the minimal eigenvalue of Σ ˜ π “ V ar ˜ π ” ˜Y i ı . Assume λ min ą and for all ǫ ą , λ min E ˜ π «ˇˇˇˇˇˇ ˜Y i ´ E ˜ π ” ˜Y i ıˇˇˇˇˇˇ ¨ «ˇˇˇˇˇˇ ˜Y i ´ E ˜ π ” ˜Y i ıˇˇˇˇˇˇ ě cÿ i π i p ´ π i q ¨ λ min ¨ ǫ ffff Ñ . Assumption 4 requires that the fraction of treated units and the (weighted) variance andcovariances of the potential outcomes have limits. Assumption 5 is a multivariate analogof Assumption 2 in that it requires that no single observation dominate the π or p ´ π q -weighted variance of the potential outcomes. Assumption 6 is a multivariate generalizationof the Lindeberg-type condition in Assumption 3. Proposition 3.1 (Results for vector-valued outcomes) . (1) E R r ˆ τ s “ τ AT T ` NN NN ˜ N ÿ i ˆ π i ´ N N ˙ Y i p q ¸ .
2) Under Assumptions 1, and 4, V R r ˆ τ s ` o p N ´ q “ N ř Nk “ π k p ´ π k q N N N N „ N V ar ˜ π r Y i p qs ` N V ar ˜ π r Y i p qs ´ N V ar ˜ π r τ i s ď N V ar π r Y i p qs ` N V ar ´ π r Y i p qs where A ď B if B ´ A is positive semi-definite.(3) Under Assumptions 1, 4, and 5, ˆs ´ V ar π r Y i p qs p ÝÑ , ˆs ´ V ar ´ π r Y i p qs p ÝÑ . (4) Under Assumptions 1, 4, and 6, V R r ˆ τ s ´ p ˆ τ ´ τ q d ÝÑ N p , I q . Assumption 4 implies Σ τ “ lim N Ñ8 N V R r ˆ τ s exists, so the previous display can alterna-tively be written as ? N p ˆ τ ´ τ q d ÝÑ N p , Σ τ q . Proof.
See appendix.
In this section, we apply our results to provide a design-based analysis of difference-in-differences estimators (e.g., Chapter 5 of Angrist and Pischke (2009)). Such a design-basedanalysis is useful since applied researchers commonly use difference-in-differences estimatorsin quasi-experimental settings to analyze the causal effects of state-level polices in whichoutcomes for all 50 US states are observed.Suppose we observe panel data for a population of N units for periods t “ ´ ¯ T, ..., ¯ T .Units with D i “ receive a treatment of interest beginning at period t “ . The observedoutcome for unit i at period t is Y it “ Y it p D i q . We assume the treatment has no effect prior We focus on the case with non-staggered treatment timing, since it may be difficult to interpret theestimand of standard two-way fixed effects models under treatment effect heterogeneity and staggeredtreatment timing (Borusyak and Jaravel, 2016; de Chaisemartin and D’Haultfœuille, 2018; Goodman-Bacon,2018; Athey and Imbens, 2018). The results in this section could be extended to other estimators with amore sensible interpretation under staggered timing e.g. Callaway and Sant’Anna (2019); Sun and Abraham(2020).
13o its implementation, so that Y it p q “ Y it p q for all t ă . Consider the common dynamictwo-way fixed effects (TWFE) or “event-study” regression specification Y it “ α i ` φ t ` ÿ s ‰ D i ˆ r s “ t s ˆ β s ` ǫ it . (11)It is well known in this setting that ˆ β t “ ˆ τ t ´ ˆ τ where ˆ τ t “ N ÿ i D i Y it ´ N ÿ i p ´ D i q Y it . Thus, ˆ β t is the difference in the SDIM estimators for the outcome in period t and period 0.Letting Y i “ p Y i, ´ ¯ T , ..., Y i, ¯ T q , (3) implies that under Poisson rejective assignment, E R ” ˆ β t ı “ τ t ` NN NN C ov r π i , Y it p q ´ Y i p qs , where τ t “ N ř i π i Y it p q is the ATT in period t , and we use the fact that τ “ by theno-anticipation assumption. Thus, the bias in ˆ β t is proportional to the finite population co-variance between π i and trends in the untreated potential outcomes, Y it p q´ Y i p q . It followsthat ˆ β t is unbiased for τ t over the randomization distribution if C ov r π i , Y it p q ´ Y i p qs “ ,or equivalently, if E R « N ÿ i D i p Y it p q ´ Y i p qq ff “ E R « N ÿ i p ´ D i qp Y it p q ´ Y i p qq ff , which mimics the familiar “parallel trends” assumption from the sampling-based model.Further, if the sequence of populations satisfies the assumptions in part (4) of Proposition3.1, then ? N p ˆ β ´ p τ ` δ qq Ñ d N p , Σ q , (12)where ˆ β is the vector that stacks ˆ β t , Σ “ lim N Ñ8 N V R ” ˆ β t ı , and τ , δ are the vectors thatstack τ t and δ t “ NN NN C ov r π i , Y it p q ´ Y i p qs . Part (3) implies that the variance estimator ˆs is asymptotically conservative for ˆ β . It is easily verified that ˆs corresponds with thecluster-robust variance estimator for (11) that clusters at level i (up to degrees of freedomcorrections). The normal limiting model in (12) has been studied by Roth (2019) andRambachan and Roth (2019) from a sampling-based perspective in which parallel trendsmay fail; our results show that it also has a sensible interpretation from a design-basedperspective. 14 Instrumental Variables
In this section, we apply our results to analyze the properties of two-stage least squaresinstrumental variables estimators. Let Z i P t , u be an instrument. Let D i p z q P t , u bethe potential treatment status as a function of z . Let Y i p d q be the potential outcome as afunction of d P t , u . Our notation Y p d q encodes the so-called “exclusion restriction” that Z affects Y only through D . We observe p Y i , D i , Z i q where Y i “ Y i p D i p Z i qq and D i “ D i p Z i q .We treat Z i as stochastic and the potential outcomes for both D and Y as fixed. The numberof units with Z i “ is denoted by N Z and the number of units with Z i “ is denoted by N Z . Example 2.
Researchers may have data on student outcomes for all students attendingpublic and private schools in a particular geographic area (e.g., Goodman (2008) observesdata on all high school graduates in Massachusetts from 2003-2005). The instrument Z i could be an indicator for whether a student is offered a subsidy for attending private school, D i could be an indicator for whether a student attends private school, and Y i could be astudent’s test score. We might suspect that an organization assigns scholarships essentiallyas-if random, but it is also plausible that they may target their offers to students that arelikely to accept if offered, or who have high benefits from private school, so that P p Z i q “ may be related to Y i p d q and D i p z q . It is therefore instructive to consider the distribution the2SLS estimator when Z i is not completely randomly assigned.In canonical IV frameworks, it is traditionally assumed that the instrument Z is indepen-dent of the potential outcomes (see Angrist and Imbens (1994); Angrist et al. (1996) for asampling-based model, and Kang et al. (2018) for a design-based model). We instead allowfor the possibility that the probability that Z i “ may differ across units, and be arbitrarilyrelated to the potential outcomes. In particular, we suppose that P ˜ Z “ z ˇˇˇ ÿ i Z i “ N Z ¸ “ C ź i p z i i p ´ p i q ´ z i (13)for all Z P t , u N such that ř i z i “ N Z , and zero otherwise. Thus, the assignment ofthe instrument Z i mimics the Poisson rejective assignment of D i in (1). We update thenotation to use E R Z r¨s , V R Z r¨s to denote the expectations and variances with respect to therandomization distribution of Z conditional on the number of units assigned to Z “ . Wealso maintain the typical monotonicity assumption that is commonly imposed in IV settings. Assumption 7 (Monotonicity) . D i p q ě D i p q for all i .
15 common method for estimating treatment effects in an instrumental variables settingis two-stage least squares (2SLS), defined as ˆ β SLS : “ ˆ τ RF { ˆ τ F S with ˆ τ RF : “ N Z ÿ i Z i Y i ´ N Z ÿ i p ´ Z i q Y i ˆ τ F S : “ N Z ÿ i Z i D i ´ N Z ÿ i p ´ Z i q D i . ˆ τ RF is often referred to as the “reduced-form” coefficient, whereas ˆ τ F S is referred to as the“first-stage” coefficient.Observe that ˆ τ RF is a SDIM for the effect of Z i on Y i , whereas ˆ τ F S can be viewed as aSDIM for the effect of Z i on Y i . Equation (3) thus implies that E R Z r ˆ τ RF s “ N ÿ i π Zi p Y i p D i p qq ´ Y i p D i p qqq ` NN Z NN Z C ov “ π Zi , Y i p D i p qq ‰ , where C ov “ π Zi , Y i p D i p qq ‰ “ N ř i ´ π Zi ´ N Z N ¯ Y i p D i p qq is the finite population covariancebetween π Zi and Y i p D i p qq . Let C “ t i : D i p q ą D i p qu denote the set of compliers. Theprevious display along with Assumption 7 imply that E R Z r ˆ τ RF s “ N ÿ i P C π Zi p Y i p q ´ Y i p qq ` NN Z NN Z C ov “ π Zi , Y i p D i p qq ‰ . (14)By an analogous argument for ˆ τ F S , we obtain that E R Z r ˆ τ F S s “ N ÿ i P C π Zi ` NN Z NN Z C ov “ π Zi , D i p q ‰ . (15)Define β SLS : “ E RZ r ˆ τ RF s E RZ r ˆ τ F S s .Our earlier results imply that under suitable regularity conditions ˆ β SLS is normallydistributed around β SLS in large populations. Let Y i “ p Y i , D i q and define the potentialoutcomes Y i p z q “ p Y i p D i p z qq , D i p z qq . If the sequence of populations satisfies the assumptionsin Proposition 3.1, part 4 (using Y i as just defined, and adding sub- or super-script Z asneeded), then ? N ˜ ˆ τ RF ´ E R Z r ˆ τ RF s ˆ τ F S ´ E R Z r ˆ τ F S s ¸ Ñ d N p , Σ τ q , where Σ τ “ lim N Ñ8 N V R Z «˜ ˆ τ RF ˆ τ F S ¸ff . Assuming further that the sequence of populationssatisfies p E R Z r ˆ τ RF s , E R Z r ˆ τ F S sq Ñ p τ ˚ RF , τ ˚ F S q with τ ˚ F S ą , then the uniform delta method16e.g., Theorem 3.8 in van der Vaart (2000)) implies that ? N p ˆ β SLS ´ β SLS q Ñ d N p , g Σ τ g q , where g is the gradient of h p x, y q “ x { y evaluated at p τ ˚ RF , τ ˚ F S q . Proposition 3.1 likewiseimplies that it is possible to obtain asymptotically conservative inference for β SLS usingplug-in estimates of the variance.How should we interpret the estimand β SLS ? First, note that if π Zi ” N Z N , so that allunits receive Z “ with equal probability, then equations (14) and (15) imply that β SLS “ | C | ř i P C p Y i p q ´ Y i p qq , which is the canonical local average treatment effect (LATE) for com-pliers (Angrist et al., 1996). Interestingly, our results show that β SLS has a general causal in-terpretation under the weaker assumption that C ov “ π Zi , Y i p D i p qq ‰ “ C ov “ π Zi , D i p q ‰ “ ,so that the probability that Z i “ may differ across units but the finite population covari-ance between treatment probabilities and D i p q and Y i p D i p qq is equal to zero. Under thisassumption, we have that β SLS “ ř i P C π Zi ÿ i P C π Zi p Y i p q ´ Y i p qq . The parameter β SLS can then be interpreted as a π Zi -weighted local average treatment effect(LATE) for compliers. The weights given to each complier are proportional to the probabilitythat Z i “ . This is intuitive, as a complier with a low probability of having Z i “ shouldhave little effect on the 2SLS estimator. This paper analyzes the properties of quasi-experimental estimators, such as SDIM, DiD,and 2SLS, in a finite population setting in which treatment probabilities are non-constantacross units and may vary systematically with potential outcomes. Analogous to familiarresults in the sampling-based framework, we show that one can obtain valid causal inferencefor certain interpretable causal estimands if complete randomization is replaced with weakerorthogonality conditions. More generally, our results allow one to understand the bias andlimiting distribution of these estimators for the ATT as a function of the finite-population It is well-known in sampling-based instrumental variables settings that the delta method fails under“weak-instrument asymptotics” in which E R Z r ˆ τ F S s drifts towards zero (Staiger and Stock, 1997). Similarissues apply here. However, the test static used to form Anderson-Rubin confidence intervals, which arerobust to weak identification, can be written as a quadratic form in a SDIM statistic (see, e.g., Li and Ding(2017)). Our results could thus also be applied to analyze the properties of Anderson-Rubin based CIs underweak identification asymptotics. π i and functions of the potential outcomes, akinto familiar omitted variable bias formulas.The analysis in this paper could be extended in a variety of directions. First, the analysismight be extended to settings where the stochastic nature of the data arises both from theassignment of treatment and from sampling a subset of units from a finite population, asin Abadie et al. (2020). Like in Abadie et al. (2020), the analysis could also be extendedto allow for clustered sampling or treatment assignment. Second, our results on the lim-iting distribution of the SDIM suggest that a variety of mis-specification robust tools andsensitivity analyses which have been developed under the assumption of asymptotic normal-ity from a sampling-based perspective could also potentially be applied in finite populationcontexts as well (e.g., Armstrong and Kolesar (2018a,b); Bonhomme and Weidner (2018);Andrews et al. (2017, 2019)). However, the finite population setting studied here differsfrom the usual sampling-based approach in that the variance matrix is only conservativelyestimated. It would be useful to study which guarantees of size control and/or optimalityfrom the sampling literature are robust to this modification.18 eferences Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jef-frey M. Wooldridge , “Sampling-Based versus Design-Based Uncertaintyin Regression Analysis,”
Econometrica , 2020, (1), 265–296. _eprint:https://onlinelibrary.wiley.com/doi/pdf/10.3982/ECTA12675. , , Guido W Imbens, and Jeffrey Wooldridge , “When Should You Adjust StandardErrors for Clustering?,” Working Paper 24003, National Bureau of Economic ResearchNovember 2017. Series: Working Paper Series. Andrews, Isaiah, Matthew Gentzkow, and Jesse Shapiro , “Measuring the Sensitivityof Parameter Estimates to Estimation Moments,”
The Quarterly Journal of Economics ,2017, (4), 1553–1592. , , and , “On the Informativeness of Descriptive Statistics for Structural Estimates,”Technical Report 2019.
Angrist, Joshua and Guido Imbens , “Identification and Estimation of Local AverageTreatment Effects,”
Econometrica , 1994, (2), 467–475. Angrist, Joshua D. and Jorn-Steffen Pischke , Mostly Harmless Econometrics: AnEmpiricist’s Companion , Princeton: Princeton University Press, 2009. , Guido W. Imbens, and Donald B. Rubin , “Identification of Causal Effects UsingInstrumental Variables,”
Journal of the American Statistical Association , 1996, (434),444–455. Publisher: [American Statistical Association, Taylor & Francis, Ltd.]. Armstrong, Timothy and Michal Kolesar , “Optimal Inference in a Class of RegressionModels,”
Econometrica , 2018, , 655–683. and , “Simple and Honest Confidence Intervals in Nonparametric Regression,” TechnicalReport 2018. Aronow, Peter M. and Joel A. Middleton , “A class of unbiased estimators of theaverage treatment effect in randomized experiments,”
Journal of Causal Inference , 2015, (1), 135–154. Athey, Susan and Guido Imbens , “Design-Based Analysis in Difference-In-DifferencesSettings with Staggered Adoption,” arXiv:1808.05293 [cs, econ, math, stat] , August 2018.
Berger, Yves G. , “Rate of convergence to normal distribution for the Horvitz-Thompsonestimator,”
Journal of Statistical Planning and Inference , April 1998, (2), 209–226. Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan , “How Much ShouldWe Trust Differences-In-Differences Estimates?,”
The Quarterly Journal of Economics ,February 2004, (1), 249–275.
Bonhomme, Stephanne and Martin Weidner , “Minimizing Sensitivity to Model Mis-specification,” Technical Report 2018. 19 orusyak, Kirill and Xavier Jaravel , “Revisiting Event Study Designs,” SSRN ScholarlyPaper ID 2826228, Social Science Research Network, Rochester, NY August 2016.
Callaway, Brantly and Pedro H. C. Sant’Anna , “Difference-in-Differences with Multi-ple Time Periods,” SSRN Scholarly Paper ID 3148250, Social Science Research Network,Rochester, NY March 2019.
Chen, Jiafeng, Edward Glaeser, and David Wessel , “The (Non-) Effect of Opportu-nity Zones on Housing Prices,” Technical Report w26587, National Bureau of EconomicResearch, Cambridge, MA December 2019.
Conley, Timothy G., Christian B. Hansen, and Peter E. Rossi , “Plausibly Exoge-nous,”
The Review of Economics and Statistics , October 2010, (1), 260–272. de Chaisemartin, Clément and Xavier D’Haultfœuille , “Two-way fixed effects estima-tors with heterogeneous treatment effects,” arXiv:1803.08807 [econ] , March 2018. arXiv:1803.08807. Fisher, R. A. , The design of experiments
The design of experiments, Oxford, England:Oliver & Boyd, 1935. Pages: xi, 251.
Freedman, David A. , “On Regression Adjustments in Experiments with Several Treat-ments,”
The Annals of Applied Statistics , 2008, (1), 176–196., “On regression adjustments to experimental data,” Advances in Applied Mathematics ,2008, (2), 180–193. Goodman-Bacon, Andrew , “Difference-in-Differences with Variation in Treatment Tim-ing,” Working Paper 25018, National Bureau of Economic Research September 2018.
Goodman, Joshua , “Who merits financial aid?: Massachusetts’ Adams Scholarship,”
Jour-nal of Public Economics , 2008, , 2121–2131. Hajek, Jaroslav , “Asymptotic Theory of Rejective Sampling with Varying Probabilitiesfrom a Finite Population,”
Annals of Mathematical Statistics , December 1964, (4),1491–1523. Publisher: Institute of Mathematical Statistics. Heckman, James J. and Edward J. Vytlacil , “Econometric Evaluation of Social Pro-grams, Part I: Causal Models, Structural Models and Econometric Policy Evaluation,” in“Handbook of Econometrics,” Vol. 6 2006, pp. 4779–4874.
Imbens, Guido W. , “Nonparametric Estimation of Average Treatment Effects Under Exo-geneity: A Review,”
The Review of Economics and Statistics , February 2004, (1), 4–29.Publisher: MIT Press. and Donald B. Rubin , Causal Inference for Statistics, Social, and Biomedical Sciences:An Introduction , Cambridge: Cambridge University Press, 2015.20 ang, Hyunseung, Laura Peck, and Luke Keele , “Inference for instrumen-tal variables: a randomization inference approach,”
Journal of the Royal Statisti-cal Society: Series A (Statistics in Society) , 2018, (4), 1231–1254. _eprint:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12353.
Li, Xinran and Peng Ding , “General Forms of Finite Population Central Limit The-orems with Applications to Causal Inference,”
Journal of the American Statistical As-sociation , October 2017, (520), 1759–1769. Publisher: Taylor & Francis _eprint:https://doi.org/10.1080/01621459.2017.1295865.
Lin, Winston , “Agnostic Notes on Regression Adjustments to Experimental Data: Reex-amining Freedman’s critique,”
The Annals of Applied Statistics , 2013, (1), 295–318. Manski, Charles F. and John V. Pepper , “How Do Right-to-Carry Laws Affect CrimeRates? Coping with Ambiguity Using Bounded-Variation Assumptions,”
Review of Eco-nomics and Statistics , 2018, (2), 232–244.
Middleton, Joel A. , “A Unified Theory of Regression Adjustment for Design-based Infer-ence,” Technical Report, arXiv preprint arXiv:1803.06011 2018.
Neyman, Jerzy , “On the Application of Probability Theory to Agricultural Experiments.Essay on Principles. Section 9.,”
Statistical Science , 1923, (4), 465–472. Publisher:Institute of Mathematical Statistics. Rambachan, Ashesh and Jonathan Roth , “An Honest Approach to Parallel Trends,”Technical Report 2019.
Rosenbaum, Paul , “Sensitivity Analysis in Observational Studies,” in B. S. Everitt andD. C. Howell, eds.,
Encyclopedia of Statistics in Behavioral Science , 2005.
Rosenbaum, Paul R. , “Sensitivity analysis for certain permutation inferences in matchedobservational studies,” Technical Report 1 1987.,
Observational Studies , Springer Science, 2002.
Roth, Jonathan , “Pre-test with Caution: Event-study Estimates After Testing for ParallelTrends,”
Working paper , 2019.
Savje, Frederik and Angele Delevoye , “Consistency of the Horvitz-Thompson estimatorunder general sampling and experimental designs,”
Journal of Statistical Planning andInference , 2020, , 190–197.
Sekhon, Jasjeet S. and Yotam Shem-Tov , “Inference on a New Class of Sample AverageTreatment Effects,”
Journal of the American Statistical Association , February 2020, pp. 1–18. Publisher: Taylor & Francis.
Staiger, Douglas and James H. Stock , “Instrumental Variables Regression with WeakInstruments,”
Econometrica , 1997, (3), 557–586. Publisher: [Wiley, Econometric Soci-ety]. 21 un, Liyan and Sarah Abraham , “Estimating Dynamic Treatment Effects in Event Stud-ies with Heterogeneous Treatment Effects,” Working Paper , 2020. van der Vaart, A. W. , Asymptotic Statistics , Cambridge University Press, June 2000.22 esign-Based Uncertainty forQuasi-Experiments
Appendix
Ashesh Rambachan Jonathan RothAugust 6, 2020
A Additional Proofs
Proof of Lemma 4
Proof.
It suffices to show that ˆ s V ar π r Y i p qs Ñ p and ˆ s V ar ´ π r Y i p qs Ñ p . We provide aproof for the former; the latter proof is analogous. For notational convenience, let v “ V ar π r Y i p qs . From the definition of ˆ s , we can write ˆ s v “ v ˜˜ N ÿ i D i p Y i p q ´ E π r Y i p qsq ¸ ´ p ¯ Y ´ E π r Y i p qsq ¸ . Now, N ř i D i p Y i p q ´ E π r Y i p qsq can be viewed as a Horvitz-Thompson estimator of N ř i π i p Y i p q ´ E π r Y i p qsq “ v , and thus by Theorem 6.2 in Hajek (1964), its variance isequal to p ` o p qq ˜ N ÿ i π i p ´ π i q ¸ ¨ V ar ˜ π “ p Y i p q ´ E π r Y i p qsq ‰ q . Note further that ˜ N ÿ i π i p ´ π i q ¸ ¨ V ar ˜ π “ p Y i p q ´ E π r Y i p qsq ‰ ď N ÿ i π i p ´ π i qp Y i p q ´ E π r Y i p qsq ď N m N p q ÿ i π i p Y i p q ´ E π rp Y i p qsq “ N m N p q V ar π r Y i p qs . Applying Chebychev’s inequality, we have N ÿ i p D i p Y i p q ´ E π r Y i p qsq ´ v “ O p ˆc N m N p q V ar π r Y i p qs ˙ . Next, viewing ¯ Y as a Horvitz-Thomson estimator, we see that its variance is bounded by p ` o p qq ´ N ř i π i p ´ π i q ¯ ¨ V ar ˜ π r Y i p qs , which by similar logic to that above is boundedA-1bove by p ` o p qq N V ar π r Y i p qs . Thus, by Chebychev’s inequality, ¯ Y ´ E π r Y i p qs “ O p ˆc N V ar π r Y i p qs ˙ . Combining the results above, it follows that ˆ s v “ v ˜ v ` O p ˜d m N p q v N ¸ ` O p ˆ N v ˙¸ “ ` O p ˜d m N p q v N ¸ ` O p ˆ N ˙ . However, the first O p term converges to 0 by assumption, and since Assumption 1 impliesthat N Ñ 8 , the second O p term converges to 0 as well. Proof of Proposition 3.1
Proof.
The proof of claim (1) is analogous to equation (3). We next prove claim (2). Forsimplicity, let A n “ V R r ˆ τ s , let B n be the right-hand-side of the first equality in claim (2),and let C n be the right-hand side of the inequality in claim (2). We first prove the inequality.Note that by the definition of a semi-definite matrix, it suffices to show that l B n l ď l C n l for all l P R K . However, letting Y i p d q “ l Y i p d q , the desired inequality follows from Lemma3.2. Next, observe that A n ´ B n “ o p N ´ q if and only if D n : “ N A n ´ N B n “ o p q ,which holds if and only if l D n l “ o p q for all l P L : “ t e j | ď j ď K u Y t e j ´ e j | ď j, j ď K u , where e j is the j th basis vector in R K . To obtain the last equivalence, note that e j D n e j “ r D n s jj (the p j, j q element of D n ), whereas exploiting the fact that D n is symmetric, p e j ´ e j q D n p e j ´ e j q “ r D n s jj ` r D n s j j ´ r D n s jj , and so convergence of l D n l to zero forall l P L is equivalent to convergence of each of the elements of D n . Next, note that if Y i p d q “ l Y i p d q , then ˆ τ as defined in (2) is equal to l ˆ τ and V ar ˜ π r Y i p d qs “ l V ar ˜ π r Y i p d qs l .It follows from Lemma 3.1 that N ¨ l V R r ˆ τ s l r ` o p qs “ N ř Nk “ π k p ´ π k q N N N N l „ NN V ar ˜ π r Y i p qs ` NN V ar ˜ π r Y i p qs ´ V ar ˜ π r τ i s l, (16)which implies that l D n l “ l p N A n q l ¨ o p q . However, Assumption 4, together with theinequality in claim (2), implies that the right-hand side of the previous display is O p q , andthus l p N A n q l “ O p q , from which the desired result follows.The proof of (3) is similar to the proof of Lemma A3 in Li and Ding (2017), which gives asimilar result in the case of completely randomized experiments. We provide a proof for theconvergence of ˆ s ; the convergence of ˆ s is similar. As in the proof to claim (2), it sufficesA-2o show that l ˆ s l ´ l V ar π r Y i p qs l Ñ p for all l P L . Let Y i p d q “ l Y i p q . Then l ˆ s l “ N ÿ i D i p l Y i p q ´ N ÿ j D j l Y j p qq “ ˜ N ÿ i D i p l Y i p q ´ l E π r Y i p qsq ¸ ` ˜ N ÿ i D i l Y i p q ´ E π r l Y i p qs ¸ , (17)where the second line uses the bias variance decomposition. The first term can be viewedas a Horvitz-Thompson estimator of N ř i π i p l Y i p q ´ E π r l Y i p qsq “ V ar π r l Y i p qs underPoisson rejective sampling, and thus has variance equal to p ` o p qq N ÿ i π i p ´ π i q V ar ˜ π “ p l Y i p q ´ E π r l Y i p qsq ‰ . Further, observe that N ÿ i π i p ´ π i q V ar ˜ π “ p l Y i p q ´ E π r l Y i p qsq ‰ ď N E π “ p l Y i p q ´ E π r l Y i p qsq ‰ ď N max i p l Y i p q ´ E π r l Y i p qsq ( ¨ V ar π r l Y i p qs ď „ || l || NN ” max i || Y i p q ´ E π r Y i p qs || { N ı ¨ r l V ar π r Y i p qs l s “ o p q where the first inequality is obtained using the fact that V ar ˜ π r X s ď E ˜ π r X s , expanding thedefinition of E ˜ π r¨s , and using the inequality π i p ´ π i q ď π i , analogous to the argument inthe proof to Lemma 3.3; the final inequality uses the Cauchy-Schwarz inequality and factorsout l ; and we obtain that the final term is o p q by noting that the first and final bracketedterms are O p q by Assumption 4 and the middle term is o p q by Assumption 5. ApplyingChebychev’s inequality, it follows that the first term in (17) is equal to V ar π r l Y i p qs ` o p q .To complete the proof of the claim, we show that the second term in (17) is o p q . Notethat we can view N ř i D i l Y i p q as a Horvitz-Thompson estimator of E π r l Y i s . Followingsimilar arguments to that in the proceeding paragraph, we have that its variance is boundedabove by N l V ar π r Y i p qs l , which is o p q by Assumption 4 combined with the fact thatAssumption 1 implies N Ñ 8 . Applying Chebychev’s inequality again, we obtain that thesecond term in (17) is o p q , as needed.To prove claim (4), appealing to the Cramer-Wold device, it suffices to show that for any l P R K zt u , Y i “ l Y i , and ˆ τ as defined in (2), V R r ˆ τ s ´ p ˆ τ ´ τ q Ñ d N p , q . This follows fromProposition 3.4, provided that we can show that Assumption 6 implies that Assumption 3holds when Y i “ l Y i for any conformable vector l . Indeed, recall that σ π “ l Σ ˜ π l ě λ min || l || ,A-3nd hence λ min ě || l || σ π . From the Cauchy-Schwarz inequality ˇˇˇˇˇˇ ˜Y i ´ E ˜ π ” ˜Y i ıˇˇˇˇˇˇ ¨ || l || ě p ˜ Y i ´ E ˜ π ” ˜ Y i ı q . Together with the previous inequality, this implies that λ min E ˜ π «ˇˇˇˇˇˇ ˜Y i ´ E ˜ π ” ˜Y i ıˇˇˇˇˇˇ ¨ «ˇˇˇˇˇˇ ˜Y i ´ E ˜ π ” ˜Y i ıˇˇˇˇˇˇ ě cÿ i π i p ´ π i q ¨ λ min ¨ ǫ ffff ě σ π E ˜ π « p ˜ Y i ´ E ˜ π ” ˜ Y i ı q ¨ «ˇˇˇ p ˜ Y i ´ E ˜ π ” ˜ Y i ı q ˇˇˇ ě cÿ i π i p ´ π i q ¨ σ ˜ π ǫ ffff ,,