Decomposing Treatment Effect Variation
DDecomposing Treatment Effect Variation ∗ Peng DingUC Berkeley Avi FellerUC Berkeley Luke MiratrixHarvard GSEJuly 31, 2017
Abstract
Understanding and characterizing treatment effect variation in randomized experiments hasbecome essential for going beyond the “black box” of the average treatment effect. Nonetheless,traditional statistical approaches often ignore or assume away such variation. In the context ofrandomized experiments, this paper proposes a framework for decomposing overall treatmenteffect variation into a systematic component explained by observed covariates and a remainingidiosyncratic component. Our framework is fully randomization-based, with estimates of treat-ment effect variation that are entirely justified by the randomization itself. Our framework canalso account for noncompliance, which is an important practical complication. We make sev-eral contributions. First, we show that randomization-based estimates of systematic variationare very similar in form to estimates from fully-interacted linear regression and two stage leastsquares. Second, we use these estimators to develop an omnibus test for systematic treatmenteffect variation, both with and without noncompliance. Third, we propose an R -like measureof treatment effect variation explained by covariates and, when applicable, noncompliance. Fi-nally, we assess these methods via simulation studies and apply them to the Head Start ImpactStudy, a large-scale randomized experiment. Key Words : Noncompliance; Heterogeneous treatment effect; Idiosyncratic treatment effectvariation; Randomization inference; Systematic treatment effect variation. ∗ Peng Ding (Email: [email protected]) is Assistant Professor, Department of Statistics, University ofCalifornia, Berkeley. Avi Feller (Email: [email protected]) is Assistant Professor, Goldman School of PublicPolicy, University of California, Berkeley. Luke Miratrix (Email: [email protected]) is Assistant Professor,Harvard Graduate School of Education. We thank Alberto Abadie, Donald Rubin, participants at the AppliedStatistics Seminar at the Harvard Institute of Quantitative Social Science, and colleagues at University of California,Berkeley and Harvard University for helpful comments. We gratefully acknowledge financial support from the SpencerFoundation through a grant entitled “Using Emerging Methods with Existing Data from Multi-site Trials to LearnAbout and From Variation in Educational Program Effects,” and from the Institute for Education Science (IES Grant a r X i v : . [ m a t h . S T ] J u l Introduction
The analysis of randomized experiments has traditionally focused on the average treatment effect,often ignoring or assuming away treatment effect variation (e.g., Neyman, 1923; Fisher, 1935;Kempthorne, 1952; Rosenbaum, 2002). Today, understanding and characterizing treatment effectvariation in randomized experiments has become essential for going beyond the “black box” ofthe average treatment effect. This is clear from the increasing number of papers on the topic instatistics and machine learning (Hill, 2011; Athey and Imbens, 2016; Wager and Athey, 2017),biostatistics (Huang et al., 2012; Matsouaka et al., 2014), education (Raudenbush and Bloom,2015), economics (Heckman et al., 1997; Crump et al., 2008; Djebbari and Smith, 2008), politicalscience (Green and Kern, 2012; Imai and Ratkovic, 2013), and other areas.This paper proposes a framework for decomposing overall treatment effect variation in a ran-domized experiment into a systematic component that is explained by observed covariates, and an idiosyncratic component that is not explained (Heckman et al., 1997; Djebbari and Smith, 2008). Indoing so, we make several key contributions. First, we take a fully randomization-based perspective(see Rosenbaum, 2002; Imbens and Rubin, 2015), and propose estimators that are entirely justifiedby the randomization itself. This is in contrast to much of the literature on randomization-basedmethods, where treatment effect variation is typically a nuisance (e.g. Rosenbaum, 1999, 2007).Similar to Lin (2013), we show that the resulting estimator is very similar in form to linear regres-sion with interactions between the treatment indicator and covariates. Unlike with linear regression,however, the proposed estimator does not require any modeling assumptions on the marginal out-comes.Second, we extend these methods from intention-to-treat (ITT) analysis to allow for noncom-pliance, proposing a randomized-based estimator for systematic treatment effect variation for theLocal Average Treatment Effect (LATE) in the case of noncompliance (Angrist et al., 1996). Weshow that this estimator is nearly identical to the two-stage least squares estimator with interac-tions between the treatment and covariates. We believe that this is a particularly novel contributionto the recent literature seeking to reconcile the randomization-based tradition in statistics and thelinear model-based perspective more common in econometrics (Abadie, 2003; Imbens, 2014; Imbensand Rubin, 2015).Armed with these estimators, we turn to two practical tools for decomposing treatment effect2ariation. The first is an omnibus test for the presence of systematic treatment effect variation.While versions of this test have been proposed previously, largely in the context of linear mod-els (Cox, 1984; Crump et al., 2008), our proposed test is fully randomization-based and can alsoaccount for noncompliance. The second is to develop and bound an R -like measure of the fractionof treatment effect variation explained by covariates. This builds on previous versions proposedin the econometrics literature (Heckman et al., 1997; Djebbari and Smith, 2008), again extendingresults to account for noncompliance. This approach is also closely related to the Oaxaca–Blinderdecomposition in economics (Oaxaca, 1973; Blinder, 1973). See Angrist et al. (2013) for a recentapplication that also addresses compliance. Finally, we apply these methods to the Head StartImpact Study, a large-scale randomized trial of Head Start, a federally funded preschool program(Puma et al., 2010). We relegate the technical details and some further extensions to the onlineSupplementary Material. Assume that we have n units in an experiment. For unit i , let X i = ( X i , . . . , X Ki ) T ∈ R K denotethe vector of pretreatment covariates, with the constant 1 as its first component. Let T i denotethe treatment indicator with 1 for treatment and 0 for control. We use the potential outcomesframework (Neyman, 1923; Rubin, 1974) to define causal effects. Under the Stable Unit TreatmentValue Assumption (Rubin, 1980) that there is only one version of the treatment and no interferenceamong units, we define Y i (1) and Y i (0) as the potential outcomes of unit i under treatment andcontrol, respectively. The observed outcome, Y obs i = T i Y i (1) + (1 − T i ) Y i (0), is quite generaland includes continuous, binary, and zero-inflated cases. On the difference scale, the individualtreatment effect is τ i = Y i (1) − Y i (0).Importantly, this is finite population inference in that we condition on the n units at hand—thepotential outcomes are fixed and pre-treatment. This differs from super population inference inwhich some variables or residuals are assumed to be independent and identically distributed (iid)draws from some distribution. See, for example, Rosenbaum (2002), Imbens and Rubin (2015)and Li and Ding (2017). Under the potential outcomes framework, { Y i (1) , Y i (0) } ni =1 are all fixednumbers; the randomness of any estimator comes from the assignment mechanism, which is the3istribution of possible treatment assignments T = ( T , . . . , T n ) T . Note that pr { ( T , . . . , T n ) =( t , . . . , t n ) } = (cid:0) nn (cid:1) − if (cid:80) ni =1 t i = n . To set up our overall framework, we first generalize Neyman (1923)’s classic results to vectoroutcomes. We consider a completely randomized experiment, with n units assigned to treatmentand n units assigned to control; in total we have (cid:0) nn (cid:1) possible randomizations. We are interestedin estimating the finite population average treatment effect on a vector outcome V ∈ R K : τ V = 1 n n (cid:88) i =1 { V i (1) − V i (0) } , where V i (1) and V i (0) are the potential outcomes of V for unit i . The Neyman-type unbiasedestimator for τ V is the difference between the sample mean vectors of the observed outcomes undertreatment and control: (cid:98) τ V = ¯ V obs1 − ¯ V obs0 = 1 n n (cid:88) i =1 T i V obs i − n n (cid:88) i =1 (1 − T i ) V obs i = 1 n n (cid:88) i =1 T i V i (1) − n n (cid:88) i =1 (1 − T i ) V i (0) . The behavior of our estimator, and of our estimators for heterogeneity discussed later, revolvearound covariances of vector outcomes. For notation, let A = { A , . . . , A n } be a collection of n vectors, with ¯ A = n − (cid:80) ni =1 A i the vector mean, and define the covariance operator on A as S ( A ) = 1 n − n (cid:88) i =1 ( A i − ¯ A )( A i − ¯ A ) T , which gives the covariance matrix of the n vectors in A . For example, A i can be V i (1) , V i (0) or V i (1) − V i (0) . The following theorem, generalizing the results for scalar outcomes from Neyman (1923), demon-strates that (cid:98) τ V is unbiased and gives its covariance matrix. Theorem 1.
Over all possible randomizations of a completely randomized experiment, (cid:98) τ V isunbiased for τ V , with K × K covariance matrix:cov( (cid:98) τ V ) = S{ V (1) } n + S{ V (0) } n − S{ V (1) − V (0) } n . (1)The diagonal elements of this matrix are the variances of the estimators of each component of τ V . The covariance matrix of (cid:98) τ V depends on the various covariances of the potential outcomes4nder treatment and control. In particular, the last term depends on the correlation between thepotential outcomes V (1) and V (0), and therefore cannot be identified from the observed data.When the individual treatment effects are constant for all components of V , the last term in theabove covariance matrix vanishes, because then S{ V (1) − V (0) } = K × K . Under this assumption,we can unbiasedly estimate the sampling covariance matrix cov( (cid:98) τ V ) by replacing the covariancesof the potential outcomes by the sample analogues: (cid:99) cov( (cid:98) τ V ) = (cid:98) S ( V obs ) n + (cid:98) S ( V obs ) n , where (cid:98) S t ( V obs ) = 1 n t − n (cid:88) i =1 I ( T i = t ) ( V i − ¯ V obs t )( V i − ¯ V obs t ) T ( t = 0 ,
1) (2)are the sample covariance matrices of V obs in the treatment and control groups. Without theconstant treatment effect assumption, the covariance estimator (cid:99) cov( (cid:98) τ V ) is conservative in the sensethat the difference between the expectation of the variance estimator and the true variance is anon-negative definite matrix. In particular, the diagonal terms of the expected estimator will allbe larger than the truth. Letting K = 1, the covariance matrices become simple variances, whichrecovers Neyman’s original result.Using the mathematical framework introduced in the Appendix and in Li and Ding (2017), wecan easily generalize Theorem 1 to more complicated experimental designs, e.g., cluster-randomizedtrials (Middleton and Aronow, 2015) and unbalanced 2 split-plot designs (Zhao et al., 2017). We now apply this general framework to treatment effect variation. We decompose the individualtreatment effect, τ i , via τ i = Y i (1) − Y i (0) = X T i β + ε i , ( i = 1 , . . . , n ) (3)with β being the finite population linear regression coefficient of τ i on X i , defined by β = arg min b ∈ R K n (cid:88) i =1 ( τ i − X T i b ) . (4)Following Heckman et al. (1997) and Djebbari and Smith (2008), we call δ i = X T i β the systematictreatment effect variation explained by the observed covariates, X i , and call ε i the idiosyncratictreatment effect variation not explained by X i . 5ore generally, we can view this decomposition in a regression-style framework. Define S xx = 1 n n (cid:88) i =1 X i X T i ∈ R K × K , S xε = 1 n n (cid:88) i =1 ε i X i ∈ R K , S xτ = 1 n n (cid:88) i =1 τ i X i ∈ R K , where S xx is non-degenerate, analogous to the usual full rank assumption in linear models. Alsodefine S xt = 1 n n (cid:88) i =1 X i Y i ( t ) ∈ R K , ( t = 0 , . These are all finite population quantities, as in they are fixed pre-randomization values. Thedefinition of β gives S xε = 0, i.e., ε i and X i have finite population covariance zero. Therefore, in thespirit of the agnostic regression framework (e.g., Lin, 2013), the systematic component, δ i = X T i β ,is a projection of τ i onto the linear space spanned by X i , and the idiosyncratic treatment effect, ε i , is the corresponding residual. The linear projection applies to general outcomes, including thebinary case.Because of our finite population focus, if we observed all the potential outcomes we couldimmediately calculate all individual treatment effects and apply standard linear regression theoryto (3) and obtain β . In particular, the solution of (4), i.e. the ordinary least squares (OLS) solutionfrom regressing τ on X , is β = S − xx S xτ = S − xx S x − S − xx S x ≡ γ − γ , (5)where γ = S − xx S x and γ = S − xx S x are the corresponding finite population regression coefficientsof the potential outcomes on the covariates. Let e i (1) = Y i (1) − X T i γ and e i (0) = Y i (0) − X T i γ bethe residual potential outcomes from the regression of Y i ( t ) onto X . Our idiosyncratic treatmentvariation is then the difference of residuals: ε i = e i (1) − e i (0). In practice, we do not fully observethese components, but we can obtain unbiased or consistent estimates for them as we discuss below. We now turn to estimating β . As shown in (5), β has three components. The first term, S xx ,is fully observed as all the covariates are observed. Our estimation then depends on the sampleanalogues of S x and S x : (cid:98) S x = 1 n n (cid:88) i =1 T i Y obs i X i ∈ R K , (cid:98) S x = 1 n n (cid:88) i =1 (1 − T i ) Y obs i X i ∈ R K . (cid:98) S xt ’s capture how the observed potential outcomes correlate with the covariates. Plug theseinto (5) to obtain an overall estimate of β . The randomization of T then justifies the followingtheorem. Theorem 2.
Under decomposition (3), S − xx (cid:98) S x and S − xx (cid:98) S x are unbiased estimates of γ and γ ,respectively. Therefore (cid:98) β RI = S − xx (cid:98) S x − S − xx (cid:98) S x , is an unbiased estimator for β with covariance matrixcov( (cid:98) β RI ) = S − xx (cid:20) S{ Y (1) X } n + S{ Y (0) X } n − S ( τ X ) n (cid:21) S − xx . (6)Here, for example, S{ Y (0) X } denotes the covariance operator on new unit-level variables Y i (0) X i ∈ R K , made by scaling the X i vector of each unit by Y i (0), similarly for S{ Y (1) X } and S ( τ X ). This slight abuse of notation gives formulae less cluttered by subscripts and excessiveannotation. As with the vector version of Neyman’s formula, the square root of the diagonal ofcov( (cid:98) β RI ) gives the standard errors of β .The covariance formula (6) generalizes the result of Neyman (1923) for the average treatmenteffect, reducing to Neyman’s formula if X i = 1 for all units. We can obtain a “conservative”estimate of cov( (cid:98) β RI ) by (cid:99) cov( (cid:98) β RI ) = S − xx (cid:34) (cid:98) S ( Y obs X ) n + (cid:98) S ( Y obs X ) n (cid:35) S − xx , recalling the definitions of the sample covariance operators (cid:98) S and (cid:98) S introduced in (2). Similarto Neyman (1923), this implicitly assumes S ( τ X ) = . Under the assumption that ε i = 0 for allunits (i.e., no idiosyncratic variation whatsoever), we can instead use S ( (cid:98) τ X ) with (cid:98) τ = X T i (cid:98) β RI as aplug-in estimate for S ( τ X ). This yields tighter standard errors based on the diagonal elements ofthe covariance matrix. Finite population asymptotic analysis
Theorem 2 holds for any finite sample. To obtainconfidence intervals and to conduct hypothesis testing as we describe below, we need to prove fur-ther that (cid:98) β RI is asymptotically normal with mean β and covariance cov( (cid:98) β RI ). Finite populationasymptotic analysis, however, has a slightly different flavor from the usual super population ap-proach. Formally, the finite asymptotic scheme embeds the finite population { ( X i , Y i (1) , Y i (0)) } ni =1 n into a hypothetical sequence of finite populations with sizes approaching infinity. Thiseffectively assumes that all the finite population quantities, for example, S xx and β , depend on n , although they are fixed numbers for a given finite population. Moreover, the sample quantitiessuch as (cid:98) S x and (cid:98) β RI depend on n as well, and are random quantities due to the randomization of T . For notational simplicity, we drop the index n for all these quantities. Importantly, we mustimpose some regularity conditions on the hypothetical sequence of finite populations. Throughoutthe paper, we invoke the following conditions for asymptotic analysis, which are required for a formof the finite population central limit theorem discussed in Li and Ding (2017, Theorem 5). Condition 1. (i) Stable treatment proportions: p = n /n and p = n /n have positive limitingvalues; (ii) Stable means, variances and covariances: the finite population means, variances andcovariances of the covariates and potential outcomes have finite limiting values; (iii) max ≤ i ≤ n || V i − ¯ V || /n →
0, where V i can be the covariate vector, the outcome, and the products of them.Under these conditions, we can extend Theorem 2 to a sequence of finite populations: √ n (cid:16) (cid:98) β RI − β (cid:17) d → N (cid:16) , lim n →∞ [ p S{ Y (1) X } + p S{ Y (0) X } − S ( τ X )] (cid:17) . (7)As a result, we can state that (cid:98) β RI is approximately normal with mean β and covariance matrix(6), which allows us to construct confidence intervals and hypothesis tests. In our theory below,we use this informal statement instead of (7) to avoid notational complexity.Conditions (i) and (ii) are natural. Condition (iii) holds if V has more than two moments (Liand Ding, 2017). For bounded covariates and outcomes, (iii) is satisfied automatically. For moretechnical discussion of finite population causal inference, see Ding (2014), Aronow et al. (2014),and Middleton and Aronow (2015); for regularity conditions of the finite population central limittheorems, see H´ajek (1960) and Lehmann (1998). A recent review is Li and Ding (2017). The results from randomization inference can shed light on the familiar case of linear regressionwith treatment-covariate interactions. This classical approach assumes the model Y obs i = X T i γ + T i X T i β + u i , ( i = 1 , . . . , n ) (8)where { u i } ni =1 are errors implicitly assumed to induce the randomness, and where β models sys-tematic treatment effect variation, as in (3). Departing from much of the previous literature (e.g.,8ox, 1984; Berrington de Gonz´alez and Cox, 2007; Crump et al., 2008), we study the propertiesof the least squares estimator under complete randomization, without assuming that model (8) iscorrectly specified. In particular, we do not assume any i.i.d. sampling; the assignment mechanismdrives the distribution of the OLS estimator. Theorem 3.
The OLS estimator for β from fitting model (8) can be rewritten as (cid:98) β OLS = (cid:98) S − xx, (cid:98) S x − (cid:98) S − xx, (cid:98) S x , where (cid:98) S xx,t = 1 n t n (cid:88) i =1 I ( T i = t ) X i X T i , ( t = 0 , . Over all possible randomizations of T , (cid:98) S − xx, (cid:98) S x and (cid:98) S − xx, (cid:98) S x are consistent estimates of γ and γ respectively; (cid:98) β OLS therefore follows an asymptotic normal distribution with mean β and covariancematrix cov( (cid:98) β OLS ) = S − xx (cid:20) S{ e (1) X } n + S{ e (0) X } n − S ( ε X ) n (cid:21) S − xx . (9)with e i (1) , e i (0), and ε i as defined after (5).This estimate is simply the difference between (cid:98) γ , OLS = (cid:98) S − xx, (cid:98) S x and (cid:98) γ , OLS = (cid:98) S − xx, (cid:98) S x ,two OLS regressions run separately on each treatment arm. For treated units, define residual (cid:98) e i = Y obs i − X T i (cid:98) γ , OLS , and for control units, define residual (cid:98) e i = Y obs i − X T i (cid:98) γ , OLS . We can dropthe unidentifiable term S ( ε X ), estimate S{ e (1) X } and S{ e (0) X } by their sample analogues, andconservatively estimate the asymptotic covariance matrix (9) by (cid:99) cov( (cid:98) β OLS ) = (cid:98) S − xx, (cid:34) (cid:98) S ( (cid:98) e X ) n (cid:35) (cid:98) S − xx, + (cid:98) S − xx, (cid:34) (cid:98) S ( (cid:98) e X ) n (cid:35) (cid:98) S − xx, . This form of the sandwich variance estimator has the same probability limit as the HuberWhitecovariance estimator for linear model (8) (Huber, 1967; White, 1980; Lin, 2013; Angrist and Pischke,2008).Importantly, (cid:98) β RI and (cid:98) β OLS are quite similar in form. In particular, (cid:98) β RI uses the true S xx while (cid:98) β OLS separately estimates the covariance matrix for each treatment arm, (cid:98) S xx, and (cid:98) S xx, . The latteris effectively a ratio estimator. Although this introduces some small bias (on the order of 1 /n ),using the estimated (cid:98) S xx,t rather than true S xx can often lead to gains in precision, especially when9ovariates are strongly correlated with the potential outcomes. In particular, the OLS estimator,by separately estimating the (known) S xx matrix for each treatment arm, can account for randomimbalances in the covariates in both arms.The RI estimator, by comparison, has no adjustment whatsoever, and so cannot account for suchrandom covariate imbalances. However, in Section 3.4 below, and in the supplementary materialswe introduce a different form of adjustment that uses covariates to make the estimates of the S xt more precise. Depending on the structure of covariates, this estimator could be better or worsethan OLS adjustment; we leave a thorough investigation of these trade-offs for future work.Regardless, we again emphasize that we do not rely on classical OLS assumptions to justifythe OLS estimator here. Rather, randomization (plus some mild regularity conditions for thefinite sample asymptotics) justify our results. For related discussion, see Cochran (1977) on ratioestimators in surveys. Finally, we can use these results to develop an omnibus test for the presence of any systematictreatment effect variation. The null hypothesis of no treatment effect variation explained by theobserved covariates can be characterized by H ( X ) : β = 0 , where β contains all the components of β except the first component corresponding to the inter-cept. Under H ( X ), the individual treatment effects have no linear dependance on X . We then construct a Wald-type test for H ( X ) using an estimator (cid:98) β and its covariance estimator (cid:99) cov( (cid:98) β ); it could be (cid:98) β RI or (cid:98) β OLS . Let (cid:98) β and (cid:99) cov( (cid:98) β ) denote the sub-vector of (cid:98) β and sub-matrix of (cid:99) cov( (cid:98) β ), corresponding to the non-intercept coordinates of X . We reject when (cid:98) β T (cid:99) cov − ( (cid:98) β ) (cid:98) β > q K − (1 − α ) , (10)where q K − (1 − α ) is the 1 − α quantile of the χ random variable with degrees of freedom K − X (or other basis functions) into the model for δ to allow for more flexible systematic treatmenteffect variation, which could enhance power or model more complex relationships between the bX and treatment impact. In the Supplementary Material, we describe two additional points about systematic treatmenteffect variation that we briefly address here. First, as mentioned above, we can use model-assistedestimation to improve the randomization-based estimator. In particular, improving estimationof (cid:98) S xt directly improves (cid:98) β RI , as the (cid:98) S xt are the only random components. In particular, if wereplace the standard sample estimator, (cid:98) S xt , by a more efficient, model-assisted estimator, as insurvey sampling (Cochran, 1977; S¨arndal et al., 2003), we can achieve meaningful precision gainsin practice. More importantly, this setup allows researchers to assess systematic variation acrossone set of covariates while adjusting for another set.Second, under the assumption of no idiosyncratic variation (i.e., ε i = 0 for all i ), we can ob-tain exact inference for β by inverting a sequence of randomization-based tests. This complementsprevious work on randomization-based tests for the presence of idiosyncratic treatment effect vari-ation (Ding et al., 2016). After characterizing the systematic component of treatment effect variation, we now turn to char-acterizing the idiosyncratic component. Since this quantity is inherently unidentifiable, we proposesharp bounds on this component and a framework for sensitivity analysis. We then leverage theseresults to bound an R -like measure of the treatment effect variation explained by covariates. We first define the main quantities of interest: S ττ = 1 n n (cid:88) i =1 ( τ i − τ ) , S δδ = 1 n n (cid:88) i =1 ( δ i − τ ) , S εε = 1 n n (cid:88) i =1 ε i , with δ i and ε i defined as in (3). Then S ττ = S δδ + S εε . We can immediately estimate S δδ viathe sample variance of { (cid:98) δ i = X T i (cid:98) β } ni =1 , where (cid:98) β is a consistent estimator, e.g., (cid:98) β RI or (cid:98) β OLS .11owever, the idiosyncratic variance, S εε , is inherently unidentifiable because it depends on thejoint distribution of potential outcomes.We can, however, derive sharp bounds for S εε . Let F ( y ) and F ( y ) be the empirical cumulativedistribution functions of { e i (1) } ni =1 and { e i (0) } ni =1 . Below we denote e ( t ) as a random variable tak-ing equal probabilities on n values of { e i ( t ) } ni =1 . Based on the Fr´echet–Hoeffding bounds (Hoeffding,1941; Fr´echet, 1951; Nelsen, 2007), we can bound S εε as follows. Theorem 4. S εε has sharp bounds S εε ≤ S εε ≤ S εε , where S εε = (cid:90) { F − ( u ) − F − ( u ) } du, S εε = (cid:90) { F − ( u ) − F − (1 − u ) } du with F − ( u ) = inf { x : F ( x ) ≥ u } as the quantile function. The lower and upper bounds areattainable when e (1) and e (0) have the same ranks and opposite ranks, respectively.The lower bound of S εε corresponds to a rank-preserving relationship between e (1) and e (0),and the upper bound of S εε corresponds to an anti-rank-preserving relationship between e (1) and e (0). Equivalently, they correspond to the cases where the Spearman rank correlation coefficientsbetween e (1) and e (0) are +1 and − . In practice, we can often sharpen these bounds because we are unlikely to have negativelyassociated potential outcomes after adjusting for covariates. If we assume a nonnegative correlationbetween e (1) and e (0), we have the following corollary: Corollary 1.
If the correlation between e (1) and e (0) is nonnegative, then the bounds for S εε become S εε ≤ S εε ≤ V + V , where V t is the variance of e ( t ) for t = 0 , . We can consistently estimate each quantity: S δδ by the sample variance of X T i (cid:98) β , F e ( y ) and F e ( y ) by (cid:98) F ( y ) and (cid:98) F ( y ), the empirical cumulative distribution functions of the residuals (cid:98) e i undertreatment and control, and V and V by the variances of (cid:98) e (1) and (cid:98) e (0). Variance of the overall ITT estimator.
We can use these results to obtain sharper boundson the variance of Neyman (1923)’s estimate of overall ITT, (cid:98) τ = n − (cid:80) ni =1 T i Y obs i − n − (cid:80) ni =1 (1 − T i ) Y obs i , extending previous work by Heckman et al. (1997) and Aronow et al. (2014). See also Fog-arty (2016). Applying the results in Section 2 for scalar outcomes, we have the following variancefor the difference-in-means estimator,var( (cid:98) τ ) = S n + S n − (cid:18) S δδ n + S εε n (cid:19) , S ττ = S δδ + S εε . As we discuss above, Neyman (1923) proposed a lower bound for the overallvar( (cid:98) τ ) under the assumption of a constant treatment effect, S ττ = 0. More recently, Aronow et al.(2014) instead proposed to bound S ττ via Fr´echet–Hoeffding bounds. We can modestly improvethese results by applying Fr´echet–Hoeffding bounds for S εε alone rather than for S ττ = S δδ + S εε .So long as S δδ >
0, this yields strictly tighter bounds on var( (cid:98) τ ) than the corresponding boundsthat do not incorporate covariate information. In turn, this gives a tighter estimate of the standarderror for the same difference-in-means estimator, (cid:98) τ . A variance ratio test.
Finally, while the relationship between e (0) and e (1) is inherently uniden-tifiable, there is some information in the data about the relationship between ε i , the individual-levelidiosyncratic treatment effect, and Y i (0), the control potential outcome. In particular, Raudenbushand Bloom (2015) noted that if the variance of the treatment potential outcomes is smaller thanthe variance of the control potential outcomes, then the treatment effect must be negatively asso-ciated with the control potential outcomes. In the Supplementary Material, we extend this resultto incorporate covariates and propose a formal test. Going beyond worst-case bounds, we can assess the sensitivity of our estimate of S εε to differ-ent assumptions of the dependence between potential outcomes. Using the probability integraltransformation, we represent the residual potential outcomes as e (1) = F − ( U ) , e (0) = F − ( U ) , U , U ∼ Uniform(0 , , Therefore, the dependence of the potential outcomes is determined by the dependence of the uniformrandom variables U and U , which are the standardized ranks of the potential outcomes. When U = U , S εε attains the lower bound S εε ; when U = 1 − U , S εε attains the upper bound S εε ;when U U , S εε attains the improved upper bound V + V .Rather than simply examine extreme scenarios of S εε , we can instead represent U as a mixtureof U and another independent uniform random variable V : U ∼ ρU + (1 − ρ ) V , U , V ∼ Uniform(0 , , (11)which the sensitivity parameter ρ captures the association between U and U . An immediateinterpretation of ρ is the proportion of rank preserved units, with the other 1 − ρ as the proportion13f units with independent treatment and control residual outcomes. When ρ = 0, U U , andthe residual potential outcomes are independent; when ρ = 1, U = U , and the residual potentialoutcomes have the same ranks. The values between (0 ,
1) corresponds to positive rank correlationbut not full rank preservation. Note that the representation of the joint distribution is not unique,because we can choose any copula as a joint distribution of ( U , U ) (Nelsen, 2007). We choose theabove representation and notation ρ for the following theorem. Theorem 5.
If Equation 11 holds, then ρ is Spearman’s rank correlation coefficient between e (1)and e (0). Furthermore, S εε is a linear function of ρ : S εε ( ρ ) = ρS εε + (1 − ρ )( V + V ) . We cannot extract any information about ρ from the data. We therefore treat ρ as a sensitivityparameter, choose a plausible range of ρ , and obtain corresponding values for S εε . A natural question is the relative magnitudes of S δδ and S εε (Djebbari and Smith, 2008). Continuingthe regression analogy, this is an R -like measure for the proportion of total treatment effectvariation explained by the systematic component: R τ = S δδ S ττ = S δδ S δδ + S εε , which is the ratio between the finite population variances of δ and τ. As above, we can directlyestimate S δδ but must bound S εε . Applying Theorem 4, we obtain the following bounds on R τ . Corollary 2.
The sharp bounds on R τ are S δδ S δδ + S εε ≤ R τ ≤ S δδ S δδ + S εε . If we further assume that the correlation between e (1) and e (0) is nonnegative, the sharp boundson R τ are S δδ S δδ + V + V ≤ R τ ≤ S δδ S δδ + S εε . We estimate these bounds via plug-in estimates. Note that Djebbari and Smith (2008) explorea similar quantity by using a permutation approach to approximate the Fr´echet–Hoeffding upperand lower bounds. Finally, we can use the sensitivity results for S εε , with values of ρ ∈ [0 , R τ ( ρ ) = S δδ S δδ + S εε ( ρ ) . Noncompliance
We now extend our results to allow for noncompliance. Let T be the indicator of treatmentassigned, D be the indicator of treatment received, Y be outcome of interest, and X be pretreatmentcovariates. Under the Stable Unit Treatment Value Assumption, we define D i ( t ) and Y i ( t ) asthe potential outcomes for unit i under treatment assignment t. Following Angrist et al. (1996)and Frangakis and Rubin (2002), we can classify units into four compliance types based on thejoint values of D i (1) and D i (0): U i = Always Taker ( a ) if D i (1) = 1 , D i (0) = 1 , Never Taker ( n ) if D i (1) = 0 , D i (0) = 0 , Complier ( c ) if D i (1) = 1 , D i (0) = 0 , Defier ( d ) if D i (1) = 0 , D i (0) = 1 . Denote n i and π u by the number and proportion of compliance types π u of stratum U = u for u = a, n, c, d .Throughout our discussion, we invoke the following assumptions which are commonly used foranalyzing randomized experiments with noncompliance. Assumption 1. (i) Monotonicity: D i (1) ≥ D i (0); (ii) Exclusion restrictions for Always Takers andNever Takers: Y i (1) = Y i (0) for all units with D i (1) = D i (0); (iii) Strong instrument: π c > C > C is a positive constant independent of the sample size.Monotonicity rules out the existence of Defiers, i.e., π d = 0. Under monotonicity, we canestimate the proportion π u using the observed counts of units classified by T and D : let n td = { i : T i = t, D i = d } , and then (cid:98) π n = n /n , (cid:98) π a = n /n , and (cid:98) π c = n /n − n /n . The exclusionrestrictions assume that treatment assignment has no effect on the outcome for Always Takers andNever Takers. As a result, treatment effect variation is trivially zero for Always Takers and NeverTakers. Note that this is the unit-level exclusion restriction imposed in Angrist et al. (1996). Thiscan be relaxed in other settings; for example, we could assume the impact of randomization for thesegroups is zero on average (see Imbens and Rubin, 2015). Finally, to avoid technical complexity, werule out the weak instrument case (Bound et al., 1995; Staiger and Stock, 1997), i.e., π c is withina small neighborhood of 0 with radius shrinking to 0 .
15e are interested in treatment effect variation among Compliers, which motivates the followingdecomposition: τ i = Y i (1) − Y i (0) = (cid:26) , if U i = a or n, X T i β c + ε i , if U i = c, (12)where β c is the regression coefficient of τ i on X i among Compliers, analogous to (3). We now extend the results of Section 3 to estimate systematic treatment effect variation amongCompliers. Define S xx,u = 1 n u n (cid:88) i =1 I ( U i = u ) X i X T i , S xt,u = 1 n u n (cid:88) i =1 I ( U i = u ) Y i ( t ) X i , ( t = 0 , u = a, c, n ) . Then, analogous to (5), β c = S − xx,c ( S x ,c − S x ,c ) = S − xx,c S x ,c − S − xx,c S x ,c ≡ γ c − γ c , (13)where γ c = S − xx,c S x ,c , γ c = S − xx,c S x ,c are the linear regression coefficients of Y (1) and Y (0) on covariates among Compliers.Unlike in the ITT case, we cannot estimate these quantities directly. Instead, following standardresults from noncompliance (e.g., Angrist et al., 1996; Abadie, 2003; Angrist and Pischke, 2008),we use estimates from observed subgroups to estimate the desired quantities of interest. Definesample moments: (cid:98) S xx,td = 1 n t n (cid:88) i =1 I ( T i = t ) I ( D i = d ) X i X T i , (cid:98) S xt,td = 1 n t n (cid:88) i =1 I ( T i = t ) I ( D i = d ) Y obs i X i ( t, d = 0 , . (14)The following theorem connects these quantities with the finite population quantities in (13). Theorem 6.
Over all possible randomizations of a completely randomized experiment, both (cid:98) S xx (1) = (cid:98) S xx, − (cid:98) S xx, and (cid:98) S xx (0) = (cid:98) S xx, − (cid:98) S xx, are unbiased for π c S xx,c , and E ( (cid:98) S x , − (cid:98) S x , ) = π c S x ,c , E ( (cid:98) S x , − (cid:98) S x , ) = π c S x ,c . (15)16his theorem shows that we can obtain unbiased estimates for all terms in (13). The followingcorollary shows that we can then obtain consistent estimates for γ c , γ c , and β c , recalling that inthe asymptotic analysis, we need to embed { X i , Y i (1) , Y i (0) , D i (1) , D i (0) } ni =1 into a hypotheticalsequence of finite populations under Condition 1. Corollary 3. (cid:98) γ c, RI = (cid:98) S − xx (1)( (cid:98) S x , − (cid:98) S x , ) and (cid:98) γ c, RI = (cid:98) S − xx (0)( (cid:98) S x , − (cid:98) S x , ) are consistentfor γ c and γ c . Furthermore, (cid:98) β c, RI = (cid:98) γ c, RI − (cid:98) γ c, RI is consistent for β c and follows an asymptoticnormal distribution with covariance matrixcov( (cid:98) β c, RI ) = ( π c S xx,c ) − (cid:20) S{ e (cid:48) (1) X } n + S{ e (cid:48) (0) X } n − S ( ε X ) n (cid:21) ( π c S xx,c ) − , (16)where we define the residual potential outcomes to be: e (cid:48) i (1) = Y i (1) − X T i γ c ,Y i (1) − X T i γ c ,Y i (1) − X T i γ c , e (cid:48) i (0) = Y i (0) − X T i γ c , U i = a,Y i (0) − X T i γ c , U i = n,Y i (0) − X T i γ c , U i = c. (17)The idiosyncratic variation is ε i = e (cid:48) i (1) − e (cid:48) i (0) for unit i , with ε i = 0 for Never Takers andAlways Takers, and with ε i for Compliers as in (12). The two sets of residuals are not formed froma regression on all units, but instead the population regression on Compliers alone. As in the ITTcase, we can estimate S{ e (cid:48) (1) X } and S{ e (cid:48) (0) X } using their sample analogues; S ( ε X ), however, isunidentifiable. For units with D i = 1, we define the residual (cid:98) e (cid:48) i = Y obs i − X T i (cid:98) γ c , RI , and for unitswith D i = 0, we define the residual (cid:98) e (cid:48) i = Y obs i − X T i (cid:98) γ c , RI . Therefore, we can obtain a conservativeestimate for the asymptotic covariance (16) by the following sandwich form: (cid:99) cov( (cid:98) β c, RI ) = (cid:98) S − xx (1) (cid:34) (cid:98) S ( (cid:98) e (cid:48) X ) n (cid:35) (cid:98) S − xx (1) + (cid:98) S − xx (0) (cid:34) (cid:98) S ( (cid:98) e (cid:48) X ) n (cid:35) (cid:98) S − xx (0) . As with the ITT analog, so long as we have Assumption 1, randomization itself fully justifies thetheorem and estimators without relying on a model of the observed outcomes.
We now turn to the standard two-stage least squares (TSLS) setting in econometrics (e.g., Angristand Pischke, 2008). First, we impose a linear regression model with treatment-covariate interac-tions: Y obs i = X T i γ + D i X T i β + u i ( i = 1 , . . . , n ) . D i and u i . In thelanguage of econometrics, the treatment received is “endogenous,” i.e., D i and the error term u i are assumed to be correlated; we therefore use T i as an instrument for D i . The TSLS estimates( (cid:98) γ TSLS , (cid:98) β TSLS ) are the solutions to the following estimating equations: n − n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( Y obs i − X T i (cid:98) γ TSLS − D i X T i (cid:98) β TSLS ) = 0 . (18)This approach is based on M -estimation, though there are many other ways to formalize theTSLS estimator (e.g., Imbens, 2014). The following theorem shows that the fully-interacted TSLSestimator (cid:98) β TSLS is consistent for β c across randomizations. Theorem 7.
Over all randomizations, the TSLS estimator (cid:98) β TSLS follows an asymptotic normaldistribution with mean β c and covariance matrix( π c S xx,c ) − (cid:20) S{ e (cid:48)(cid:48) (1) X } n + S{ e (cid:48)(cid:48) (0) X } n − S ( ε X ) n (cid:21) ( π c S xx,c ) − , where the residual potential outcomes are defined as e (cid:48)(cid:48) i (1) = Y i (1) − X T i ( γ ∞ + β c ) ,Y i (1) − X T i γ ∞ ,Y i (1) − X T i ( γ ∞ + β c ) , e (cid:48)(cid:48) i (0) = Y i (0) − X T i ( γ ∞ + β c ) , U i = a,Y i (0) − X T i γ ∞ , U i = nY i (0) − X T i γ ∞ , U i = c, where γ ∞ is the probability limit of the TSLS regression coefficient, (cid:98) γ TSLS , and the idiosyncratictreatment effect is ε i ≡ e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) . For variance estimation, define the residual as (cid:98) e (cid:48)(cid:48) i = Y obs i − X T i ( (cid:98) γ TSLS + (cid:98) β TSLS ) for units with D i = 1 and (cid:98) e (cid:48)(cid:48) i = Y obs i − X T i (cid:98) γ TSLS for units with D i = 0. We can then use the following sandwichvariance estimator (cid:99) cov( (cid:98) β TSLS ) = (cid:98) S − xx (1) (cid:34) (cid:98) S ( (cid:98) e (cid:48)(cid:48) X ) n (cid:35) (cid:98) S − xx (1) + (cid:98) S − xx (0) (cid:34) (cid:98) S ( (cid:98) e (cid:48)(cid:48) X ) n (cid:35) (cid:98) S − xx (0) , which has the same probability limit as the Huber–White covariance estimator for (cid:98) β TSLS . Therefore,the randomization itself effectively justifies the use of TSLS for estimating systematic treatmenteffect variation among Compliers, extending our ITT results.Finally, while (cid:98) β TSLS is a consistent estimator for β c , (cid:98) γ TSLS is not, in general, a consistentestimator for γ c ; that is, γ ∞ (cid:54) = γ c . Instead, (cid:98) γ TSLS converges to γ ∞ = S − xx S x − π a S − xx S xx,a β c .In the special case of one-sided noncompliance (i.e., π a = 0), γ ∞ = γ = S − xx S x , the populationOLS regression coefficient, among all Compliers and Never Takers, of Y (0) on covariates.18 .2.3 Omnibus test for systematic treatment effect variation among Compliers With point estimate (cid:98) β and covariance estimate (cid:99) cov( (cid:98) β ) for β c , we can use the same Wald-type χ test as in (10) for the presence of systematic treatment effect variation among Compliers. Here, theestimator can be either randomization-based (cid:98) β c, RI or TSLS estimator (cid:98) β TSLS ; the degrees of freedomare the same, K −
1. Unlike in the ITT case, we are not aware of existing tests for systematictreatment effect variation among Compliers.
We now turn to decomposing the overall treatment effect in the presence of noncompliance. Inthis setting, we have three sources of treatment effect variation: (i) systematic treatment effectvariation among Compliers, (ii) idiosyncratic treatment effect variation among Compliers, and (iii)treatment effect variation due to noncompliance.First, recall that total treatment effect variation is S ττ = (cid:80) ni =1 ( τ i − τ ) /n . We can define asimilar quantity among Compliers: S ττ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( τ i − τ c ) . As in Section 4, we can decompose this variation into systematic and idiosyncratic treatment effectvariation for Compliers, respectively: S δδ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( δ i − τ c ) , S εε,c = 1 n c n (cid:88) i =1 I ( U i = c ) ε i . Because treatment effects for Never Takers and Always Takers are zero, there is no treatment effectvariation for these units. The component of treatment effect variation due to compliance status is S ττ,U = (cid:88) u = c,a,n π u ( τ u − τ ) . Using τ a = τ n = 0 and τ = π c τ c due to the exclusion restrictions, we have the following theoremsummarizing the relationships among the above components. Theorem 8. S ττ = π c S ττ,c + S ττ,U , S ττ,c = S δδ,c + S εε,c , and S ττ,U = π c (1 − π c ) τ c .
19n words, total treatment effect variation has three parts: (i) systematic treatment effect vari-ation among Compliers, π c S δδ,c ; (ii) idiosyncratic treatment effect variation among Compliers, π c S εε,c ; (iii) treatment effect variation due to noncompliance, S ττ,U .As in the ITT case, even though S εε,c is not identifiable, we can derive bounds in terms ofthe marginal distributions of the residuals, { e (cid:48) i (1) = Y i (1) − X T i γ c : U i = c, i = 1 , . . . , n } and { e (cid:48) i (0) = Y i (0) − X T i γ c : U i = c, i = 1 , . . . , n } , denoted by F c ( y ) and F c ( y ), and with marginalvariances, V c and V c . Once we estimate these quantities, we can plug them in to Theorem 4and Corrolary 1 to get our bounds. As compliance status is only partially observed, we haveto estimate these quantities by differencing observed distributions; we defer this and some othertechnical details to the Supplementary Material.
Since there are two sources of variation—covariates and noncompliance—there are three possible R -type measures. First, we can measure the treatment effect variation explained by noncompliancealone (i.e., only U ): R τ,U = S ττ,U S ττ = S ττ,U S ττ,U + π c S ττ,c = S ττ,U S ττ,U + π c S δδ,c + π c S εε,c . Second, we can measure the proportion of treatment effect variation among Compliers explainedby covariates (i.e., only X ): R τ,c = S δδ,c S ττ,c = S δδ,c S δδ,c + S εε,c . Third, we can measure the treatment effect variation explained by covariates and noncompliance(i.e., both X and U ): R τ,U X = S ττ,U + π c S δδ,c S ττ = S ττ,U + π c S δδ,c S ττ,U + π c S δδ,c + π c S εε,c . For each measure, we can use tailored versions of Corollary 1 to construct bounds, or conductsensitivity analysis as in Section 4.2, with the sensitivity parameter expressed as the Spearmancorrelation between the treatment and control potential outcomes among Compliers.20
Simulation study
We simulate completely randomized experiments to evaluate the finite sample performance of thetests for systematic treatment effect variation based on (cid:98) β OLS , (cid:98) β RI , and (cid:98) β w RI , the model-assistedversion discussed in the Supplementary Material. Our data generation process is inspired by theHead Start Impact Study (HSIS) study analyzed in the next section. For a given sample size, wefirst generate four independent covariates ( X , a standard normal, X , a binary covariate withprobability 0 . X , a binary covariate with probability 0 .
25 being 1, and X , a standardnormal). The control potential outcomes are then generated from Y i (0) = 0 . . X i + 0 . X i − . X i + 0 . X i + u i , u i ∼ N (0 , σ ) . We select σ = 0 .
26 to make the marginal variance for the control potential outcomes 1; thuswe can interpret impacts in “effect size” units. The R of regressing Y (0) onto the covariates isapproximately 0.74, due to the “pre-test”-like variable X i . Without X i , the R is about 0.09.The treatment effects are τ i = δ i + ε i , with (i) either δ i = 0 . i , or δ i = 0 . . X i +0 . X i ;and (ii) either ε i = 0 for all i , or ε i ∼ N (0 , . ). All combinations of these two options give the fourcases of (a) no treatment effect variation, (b) only systematic variation, (c) idiosyncratic variationwith no systematic variation, and (d) both systematic and idiosyncratic variation. For an α -leveltest of systematic variation, scenarios (a) and (c) should only reject at rate α , while we would like tosee high rejection rates for scenarios (b) and (d). For scenario (d), the R τ is about 0.5; systematicvariation explains a good share of the overall variation.To generate a synthetic dataset we generated all potential outcomes, randomized units intotreatment with probability 0 .
6, and then calculated the corresponding observed outcomes. We thenconducted a test for systematic variation using each of our three estimators. For (cid:98) β RI and (cid:98) β OLS weuse X , X , X . For our covariate-adjusted estimator (cid:98) β w RI we also include the fairly predictive X for adjustment.Figure 1 shows the power of these tests, with α = 0 .
05, for different sample sizes. First,all estimators appear asymptotically valid, consistent with the theoretical results. The OLS andadjusted estimators are slightly anti-conservative for small n , however, with rejection rates ofaround 9%. Second, the OLS estimator appears to have the greatest power in this setting, which is21
00 500 1000 2000 5000 100000.00.20.40.60.81.0 a) none po w e r l l l l l l
200 500 1000 2000 5000 100000.00.20.40.60.81.0 b) systematic l l l l l l
200 500 1000 2000 5000 100000.00.20.40.60.81.0 c) idiosyncratic N po w e r l l l l l l
200 500 1000 2000 5000 100000.00.20.40.60.81.0 d) systematic + idiosyncratic N l l l l l l l RI OLS RI (adj)
Figure 1: Power of the tests based on (cid:98) β RI , (cid:98) β OLS , and (cid:98) β w RI .unsurprising since the true data generating process is a linear model. Finally, covariate adjustmentslightly improves the power of the RI estimator. Overall, in the scenarios we consider, we onlyachieve decent levels of power in large samples, although there seems to be reasonable power forthe sample size in the data application, n = 3 , We next simulate completely randomized experiments with noncompliance to evaluate the finitesample performance of the tests for systematic treatment effect variation among Compliers basedon (cid:98) β c, RI and (cid:98) β TSLS . We first generated a complete dataset as in the ITT case above, and thenassigned strata membership to all units with probabilities proportional to their covariates. ForAlways Takers we then set Y i (0) = Y i (1), and for Never Takers, Y i (1) = Y i (0). The overall ITT isnow reduced to 0.21 (due to the 0 effects of Never Takers and Always Takers), although the CACE22s still approximately 0.3. The proportion of Compliers is approximately 68%.The Compliers have the systematic and idiosyncratic effects described as above. We tested forthe presence of systematic variation for Compliers under the exclusion restrictions. Figure 2 showsthe power of these tests for our RI and TSLS estimators. First, in this scenario, the 2SLS andthe RI estimators are virtually equivalent; the additional adjustment provided by TSLS does notadd significantly to the precision. We see the tests are valid (they even appear conservative) forcases (a) and (c). Power is reduced compared to the ITT simulation; this is reasonable as poweris effectively a function of the number of Compliers, with additional uncertainty due to partialinformation about the identity of Compliers.
500 1000 2000 5000 100000.00.20.40.60.81.0 a) none po w e r l l l l l l
500 1000 2000 5000 100000.00.20.40.60.81.0 b) systematic l l l l l l
500 1000 2000 5000 100000.00.20.40.60.81.0 c) idiosyncratic N po w e r l l l l l l
500 1000 2000 5000 100000.00.20.40.60.81.0 d) systematic + idiosyncratic N l l l l l ll RI TSLS
Figure 2: Power of the tests based on (cid:98) β c, RI and (cid:98) β TSLS . Application to the Head Start Impact Study
Established in 1965, Head Start is the largest Federal preschool program in the United States,serving nearly 1 million low-income three- and four-year-old children each year at a cost of over$7 billion (Administration for Children and Families, 2015). Researchers and policymakers havedebated Head Start’s effectiveness since its inception, with early randomized trials finding limitedimpacts (e.g., Westinghouse Learning Corporation, 1969) and quasi-experimental studies showingmuch larger effects (e.g., Currie and Thomas, 1995). Designed in part to settle this debate, theHead Start Impact Study (HSIS) is a large-scale, nationally representative randomized trial of HeadStart first launched in 2002 (Puma et al., 2010). The Congressional mandate for HSIS includedtwo broad questions: (1) the program’s overall impact, and (2) how impacts vary across childrenand centers. The policy debate has largely focused on this first question; HSIS only found modestaverage effects on a range of children’s cognitive and social-emotional outcomes. However, both theoriginal study and several recent papers argue that these topline results mask important treatmenteffect variation (e.g., Bloom and Weiland, 2014; Bitler et al., 2014; Ding et al., 2016; Walters,2015; Feller et al., 2016). Understanding such variation is critical both for assessing the program’sbenefits and costs and for improving the practice and science of early childhood education.HSIS collected a rich set of covariates about children and their families, including pre-test score,child’s age, child’s race, child’s home language, mother’s education level, and mother’s maritalstatus. At the same time, many potentially important covariates are unavailable. For instance,while families must be low-income to be eligible for Head Start, HSIS does not include informationon families’ actual income nor other financial details that could be important predictors of programimpact. In addition, Feller et al. (2016) and others argue that that the setting in which a childwould otherwise receive care is an important source of impact variation, although this is not directlyobservable.We now use the methods outlined above to assess treatment effect variation in HSIS. Theoriginal study included n = 4 ,
400 total children, with n = 2 ,
644 in the treatment group and n = 1 ,
796 in the control group. Following earlier analyses (Ding et al., 2016) and to simplifyexposition, we restrict our attention to a complete-case subset of the HSIS, with n = 2 ,
238 in thetreatment group and n = 1 ,
348 in the control group (so p ≈ .
62 and p ≈ . We first explore treatment effect variation for the ITT estimate, beginning with estimating system-atic treatment effect variation. We examine three estimators: the randomization-based and OLSestimators discussed in Section 3, (cid:98) β RI and (cid:98) β OLS , and the corresponding model-assisted version ofthe RI estimator discussed in the Supplementary Material, (cid:98) β w RI . For this latter estimator, we use allavailable covariates to adjust the standard estimators, that is, W is the entire vector of covariates. Omnibus test for systematic treatment effect variation.
We begin by using these estima-tors for an omnibus test of whether any treatment effect variation is explained by the full set ofcovariates. The p -values for the unadjusted (cid:98) β RI estimator and model-assisted (cid:98) β w RI are 0 .
39 and 0 . p = 0 . p -values. And while we expect the unadjusted (cid:98) β RI to have the lowestpower, it is instructive that the p -value for (cid:98) β OLS is substantially smaller than the p -value for thecovariate-adjusted (cid:98) β w RI . As we discuss in Section 3.2, (cid:98) β OLS can account for covariate imbalanceacross experimental arms by estimating the S xx matrix separately for the treatment and controlgroups. By contrast, (cid:98) β RI does not address imbalance in X and instead attempts to residualize outthe Y in order to get a more precise estimate of the relationship of the X to Y for each treatmentarm. Based on the discrepancy in p -values, adjusting for baseline imbalance is clearly important inthis example. Treatment effect R τ . Next, we examine how much of the variation could be explained by ourcovariates. Figure 3a shows values of the treatment effect R τ using (cid:98) β w RI to estimate the systematicvariation. Results are nearly identical using the other estimators. In the worst case of perfectnegative dependence between potential outcomes (not shown), the treatment effect R τ could be as25 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . ρ T r ea t m e n t e ff ec t R ● ●● ●● ● ITT: Covariates only (X)LATE: Covariates only (X)LATE: Covariates and compliance (X + U) (a) Overall R τ ●●●●●●●●●● Mother is Recent Immigrant Dual−Language Learner Child's Race Caregiver's Age Pre−test Score Mother's Marial Status Mother's Education Level Child is Male Both Parents Live with Child Child's Age ITT treatment effect R with ρ = (b) R τ separately by covariate Figure 3: Treatment effect R τ , with sensitivity parameter, ρ ∈ [0 , R τ ranges from0.03 to 0.76. While the estimate is clearly sensitive to the unidentifiable sensitivity parameter, thecovariates explain a substantial proportion of treatment effect variation for values of ρ near 1.We can also use this framework to assess the relative importance of each covariate in terms ofexplaining overall treatment effect variation. To do this, we use the model-assisted RI estimator, (cid:98) β w RI , adjusting for all covariates (i.e., dim( W ) = 17) but restricting systematic treatment effectvariation to one covariate at a time. Note that we consider factors (e.g., race) as a group. Figure 3bshows the resulting estimates for the upper bound of R τ , with lower bound estimates all below 0.01.Having a mother who is a recent immigrant and dual language learner status (which are highlycorrelated in practice) could each explain a substantial proportion of treatment effect variation,consistent with previous results from Bloom and Weiland (2014) and Bitler et al. (2014). This isnot true for other covariates, like mother’s education level. Negative correlation between treatment effect and control potential outcomes.
Fi-nally, we test whether the individual-level idiosyncratic treatment effects, { ε i } ni =1 , are negativelycorrelated with the control potential outcomes, { Y i (0) } ni =1 , extending results from Raudenbush andBloom (2015). As outlined in the Supplementary Material, we do so by testing whether the variance26f { Y obs i − X T i (cid:98) β w RI : T i = 1 } is smaller than the variance of { Y obs i : T i = 0 } . This yields a p -valueof 0 .
02, which suggests that the unexplained treatment effect is indeed larger for smaller values ofthe control potential outcomes. This result is consistent with findings from Bitler et al. (2014) whouse a quantile treatment effect approach.
As with many social experiments, there is substantial noncompliance with random assignment inHSIS. In the analysis sample we consider here, the estimated proportion of compliance types is (cid:98) π c = 0 .
69 for Compliers, (cid:98) π a = 0 .
13 for Always Takers, and (cid:98) π n = 0 .
18 for Never Takers. Given theexclusion restrictions for Always Takers and Never Takers, the treatment effect is therefore zero (byassumption) for over 30 percent of the sample, suggesting that noncompliance will be an importantcomponent of treatment effect variation.In the setting with noncompliance, we focus on two estimators for systematic treatment effectvariation among Compliers: the randomization-based estimator, (cid:98) β c, RI , and the Two-Stage LeastSquares estimator, (cid:98) β TSLS . We first use these estimators to construct omnibus tests for systematictreatment effect variation among Compliers. Tests using both estimators show strong evidence forsuch variation, with p -value 0 .
02 using (cid:98) β c, RI and p -value 0 .
01 using (cid:98) β TSLS .Finally, we turn to decomposing the overall treatment effect. As in the ITT case, we assume thatthe potential outcomes have a nonnegative correlation. Figure 3a shows the treatment effect R among Compliers, which ranges from R τ,c = 0 .
05 to R τ,c = 0 .
68. Next, we can calculate treatmenteffect variation due to noncompliance, R τ,U . In the case of HSIS, this is relatively small—between0.01 and 0.16—in part because the overall treatment effect is fairly small. Therefore, the overalltreatment effect decomposition due to both covariates and noncompliance, R τ,U X , is quite closeto R τ,c , as shown in Figure 3a. Taken together, these estimates suggest that there is indeedimportant treatment effect variation that is neither captured by pre-treatment covariates nor bynoncompliance, consistent with previous results in Ding et al. (2016). In this paper, we propose a broad, flexible framework for assessing and decomposing treatmenteffect variation in randomized experiments with and without noncompliance. In general, we believe27his is a natural setup for researchers to formulate and investigate a broad range of questions aboutimpact heterogeneity (e.g., Heckman et al., 1997). Applications include assessing underlying causalmechanisms and targeting treatments based on individual-level characteristics. Understanding suchvariation is also important for the design of experiments. Djebbari and Smith (2008), for example,argue that characterizing the size of the idiosyncratic treatment effect is useful for determining thevalue of additional data collection.We briefly note several directions for future work. First, our primary purpose was to propose aframework for analysis rooted in and justified by the randomization itself. As a result, we focusedon the core properties of several relatively simple versions of linear regression and TSLS. We didnot, however, fully explore their practical and finite-sample properties. For example, in futurework, we hope to determine the settings in which model assistance will most improve estimationand assess the increased power of the
OLS approach versus the unbiased RI approach. We arealso investigating how to connect model assisted and OLS approaches to take advantage of bothmethods of precision gain. Similarly, there is still much potential improvement in determining waysof characterizing the degree of heterogeneity, such as with an effect size for the systematic variation.Second, a natural extension is to use more complex methods to estimate systematic treatmenteffects, such as via hierarchical models (Feller and Gelman, 2015) or via machine learning meth-ods (Wager and Athey, 2017), extending the results for the omnibus test and treatment effect R τ accordingly. While the guarantees from randomization are clearly weaker in such settings,researchers can assess these tradeoffs themselves. For example, hierarchical modeling would beespecially useful in the Head Start Impact Study due to the multi-site design (Bloom and Weiland,2014).Third, a question of increasing practical importance is the generalizability of experimentalresults to a given target population (Stuart et al., 2011). We believe that the treatment effect R τ is a critical measure for assessing the credibility of these generalizations. In short, if there issubstantial idiosyncratic treatment effect variation, i.e., R τ is small, then researchers should bewary of using observed covariates to extrapolate treatment effects.Finally, a question is how to extend this treatment effect variation framework to non-randomizedsettings. While the results would necessarily rest on much stronger assumptions, many settingsalready use an as-if-randomized framework, such as in observational studies (Rosenbaum, 2002;28mbens and Rubin, 2015). Under this approach, extensions should be natural. References
A. Abadie. Semiparametric instrumental variable estimation of treatment response models.
Journalof Econometrics , 113:231–263, 2003.Administration for Children and Families. Head Start program facts, fiscal year 2014. Available athttps://eclkc.ohs.acf.hhs.gov/hslc/data/factsheets/docs/hs-program-fact-sheet-2014.pdf, 2015.J. D. Angrist and J. Pischke.
Mostly Harmless Econometrics: An Empiricist’s Companion . Prince-ton: Princeton University Press, 2008.J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumentalvariables.
Journal of the American Statistical Association , 91:444–455, 1996.J. D. Angrist, P. A. Pathak, and C. R. Walters. Explaining charter school effectiveness.
AmericanEconomic Journal: Applied Economics , 5(4):1–27, 2013.P. M. Aronow, D. P. Green, and D. K. Lee. Sharp bounds on the variance in randomized experi-ments.
The Annals of Statistics , 42:850–871, 2014.S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects.
Proceedings ofthe National Academy of Sciences , 113(27):7353–7360, 2016.A. Berrington de Gonz´alez and D. R. Cox. Interpretation of interaction: A review.
The Annals ofApplied Statistics , 1:371–385, 2007.M. Bitler, H. Hoynes, and T. Domina. Experimental Evidence on Distributional Effects of HeadStart. Working Paper, 2014.A. S. Blinder. Wage discrimination: reduced form and structural estimates.
Journal of Humanresources , 8:436–455, 1973.H. S. Bloom and C. Weiland. To what extent do the effects of Head Start on enrolled children varyacross sites? Working Paper, 2014. 29. Bound, D. A. Jaeger, and R. M. Baker. Problems with instrumental variables estimation when thecorrelation between the instruments and the endogenous explanatory variable is weak.
Journalof the American statistical association , 90:443–450, 1995.W. G. Cochran.
Sampling Techniques . New York: John Wiley & Sons, 3rd edition, 1977.D. R. Cox. Interaction (with discussion).
International Statistical Review , 52:1–24, 1984.R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Nonparametric tests for treatmenteffect heterogeneity.
Review of Economics and Statistics , 90:389–405, 2008.J. Currie and D. Thomas. Does Head Start make a difference?
American Economic Review , 85(3):341–364, 1995.P. Ding. A paradox from randomization-based causal inference. arXiv preprint arXiv:1402.0142 ,2014.P. Ding, A. Feller, and L. W. Miratrix. Randomization inference for treatment effect variation.
Journal of the Royal Statistical Society, Series B (Statistical Methodology) , 78:655–671, 2016.H. Djebbari and J. Smith. Heterogeneous impacts in PROGRESA.
Journal of Econometrics , 145:64–80, 2008.A. Feller and A. Gelman. Hierarchical models for causal effects.
Emerging Trends in the Social andBehavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource , 2015.A. Feller, T. Grindal, L. Miratrix, and L. C. Page. Compared to what? variation in the impactsof early childhood education by alternative care type.
The Annals of Applied Statistics , 10(3):1245–1285, 2016.R. A. Fisher.
The Design of Experiments.
Edinburgh: Oliver & Boyd, 1st edition, 1935.C. B. Fogarty. Regression assisted inference for the average treatment effect in paired experiments. arXiv preprint arXiv:1612.05179 , 2016.C. E. Frangakis and D. B. Rubin. Principal stratification in causal inference.
Biometrics , 58:21–29,2002. 30. Fr´echet. Sur les tableaux de corr´elation dont les marges son donn´ees.
Annals Universite deLyon, Sect. A. Ser. 3 , 14:53–77, 1951.D. P. Green and H. L. Kern. Modeling Heterogeneous Treatment Effects in Survey Experimentswith Bayesian Additive Regression Trees.
The Public Opinion Quarterly , 76:491–511, 2012.J. H´ajek. Limiting distributions in simple random sampling from a finite population.
Publicationsof the Mathematics Institute of the Hungarian Academy of Science , 5:361–74, 1960.J. J. Heckman, J. Smith, and N. Clements. Making the most out of programme evaluationsand social experiments: Accounting for heterogeneity in programme impacts.
The Review ofEconomic Studies , 64:487–535, 1997.J. L. Hill. Bayesian nonparametric modeling for causal inference.
Journal of Computational andGraphical Statistics , 20:217–240, 2011.W. Hoeffding. Masstabinvariante korrelationsmasse f¨ur diskontinuierliche verteilungen.
Arkiv frmatematischen Wirtschaften und Sozialforschung , 7:49–70, 1941.Y. Huang, P. B. Gilbert, and H. Janes. Assessing treatment-selection markers using a potentialoutcomes framework.
Biometrics , 68:687–696, 2012.P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In
Pro-ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume 1,pages 221–233, 1967.K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program eval-uation.
The Annals of Applied Statistics , 7:443–470, 2013.G. Imbens. Instrumental variables: An econometrician’s perspective (with discussion).
StatisticalScience , 29:323–358, 2014.G. W. Imbens and D. B. Rubin.
Causal Inference in Statistics, and in the Social and BiomedicalSciences . New York: Cambridge University Press, 2015.O. Kempthorne.
The Design and Analysis of Experiments.
New York: Wiley, 1952.E. L. Lehmann.
Elements of Large-Sample Theory . New York: Springer, 1998.31. Li and P. Ding. General forms of finite population central limit theorems with applications tocausal inference.
Journal of the American Statistical Association , page in press, 2017.W. Lin. Agnostic notes on regression adjustments to experimental data: reexamining Freedman’scritique.
The Annals of Applied Statistics , 7:295–318, 2013.R. A. Matsouaka, J. Li, and T. Cai. Evaluating marker-guided treatment selection strategies.
Biometrics , 70:489–499, 2014.J. A. Middleton and P. M. Aronow. Unbiased estimation of the average treatment effect in cluster-randomized experiments.
Statistics, Politics and Policy , 6:39–75, 2015.R. B. Nelsen.
An Introduction to Copulas . New York: Springer, 2nd edition, 2007.J. Neyman. On the application of probability theory to agricultural experiments. Essay on princi-ples. Section 9.
Statistical Science , 5:465–472, 1923.R. Oaxaca. Male-female wage differentials in urban labor markets.
International Economic Review ,14:693–709, 1973.M. Puma, S. Bell, R. Cook, C. Heid, G. Shapiro, P. Broene, F. Jenkins, P. Fletcher, L. Quinn,J. Friedman, et al. Head start impact study: Final report. Technical report, Department ofHealth and Human Services, Administration for Children and Families, Washington DC, 2010.S. W. Raudenbush and H. S. Bloom. Learning about and from a distribution of program impactsusing multisite trials.
American Journal of Evaluation , 36(4):475–499, 2015.P. R. Rosenbaum. Reduced sensitivity to hidden bias at upper quantiles in observational studieswith dilated treatment effects.
Biometrics , 55:560–564, 1999.P. R. Rosenbaum.
Observational Studies . New York: Springer, 2nd edition, 2002.P. R. Rosenbaum. Confidence intervals for uncommon but dramatic responses to treatment.
Bio-metrics , 63:1164–1171, 2007.D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of Educational Psychology , 66:688–701, 1974.32. B. Rubin. Comment on “Randomization analysis of experimental data: the Fisher randomizationtest” by D. Basu.
Journal of the American Statistical Association , 75:591–593, 1980.C.-E. S¨arndal, B. Swensson, and J. Wretman.
Model-Assisted Survey Sampling . New York:Springer, 2003.D. O. Staiger and J. H. Stock. Instrumental variables regression with weak instruments.
Econo-metrica , 65:557–586, 1997.E. A. Stuart, S. R. Cole, C. P. Bradshaw, and P. J. Leaf. The use of propensity scores to assessthe generalizability of results from randomized trials.
Journal of the Royal Statistical Society:Series A (Statistics in Society) , 174(2):369–386, 2011.S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using randomforests.
Journal of the American Statistical Association , (just-accepted), 2017.C. R. Walters. Inputs in the production of early childhood human capital: Evidence from headstart.
American Economic Journal: Applied Economics , 7(4):76–102, 2015.Westinghouse Learning Corporation.
The Impact of Head Start: An Evaluation of the Effects ofHead Start on Children’s Cognitive and Affective Development, Volume 1: Report to the Office ofEconomic Opportunity . Athens, Ohio: Westinghouse Learning Corporation and Ohio University,1969.H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-eroskedasticity.
Econometrica , pages 817–838, 1980.A. Zhao, P. Ding, R. Mukerjee, and T. Dasgupta. Randomization-based causal inference fromsplit-plot designs.
Annals of Statistics , page in press, 2017.33 upplementary Materialfor“Decomposing Treatment Effect Variation”
Appendix A gives all the proofs and Appendix B provides the additional commentary mentionedin the main text. The finite population central limit theorem (FPCLT) we use for our asymptoticproofs is Theorem 5 of Li and Ding (2017), which requires some mild moment conditions on thecovariates and potential outcomes, as outlined in the main text.
Appendix A Lemmas and Proofs
Before we prove Theorem 1, we provide a few lemmas to ease the notational burden and amount ofalgebra of subsequent calculations. These lemmas allow us to derive expressions for our estimatorsin terms of matrix algebra rather than the summation-style approach typically seen for Neyman-style derivations in the literature.To begin, let n = (1 , . . . , T and n = (0 , . . . , T be column vectors of length n , and I n be the n × n identity matrix. Then S n = I n − n − n T n is the projection matrix orthogonal to n with S n n = n . Under this formulation, the covariance matrix of the treatment assignment vector is ascaled projection matrix orthogonal to n , as shown in the following lemma. Lemma A.1.
The treatment assignment vector T of a completely randomized experiment has E ( T ) = n n n , cov( T ) = n n n ( n − S n . Proof of Lemma A.1.
The conclusions follow from E ( T i ) = n n , var( T i ) = n n n , cov( T i , T j ) = − n n n ( n − , ( i (cid:54) = j ) . The projection matrix S n acts as a covariance operator as illustrated by the following lemma. Lemma A.2.
Let U i , V i ∈ R K be column vectors of length K . Define U = [ U , U , . . . , U n ] and V = [ V , V , . . . , V n ] ∈ R K × n as two matrices of dimension K × n . If ¯ U = n − (cid:80) ni =1 U i = n − U and ¯ V = n − (cid:80) ni =1 V i = n − V , then U S n V T = n (cid:88) i =1 ( U i − ¯ U )( V i − ¯ V ) T . (A.1)1n particular, when U i = V i , V S n V T = n (cid:88) i =1 ( V i − ¯ V )( V i − ¯ V ) T = ( n − S ( V ) . Proof of Lemma A.2.
The left hand side of (A.1) is equal to U S n V T = U V T − n − ( U n ) ( V n ) T = n (cid:88) i =1 U i V T i − n − ( n ¯ U )( n ¯ V ) T = n (cid:88) i =1 U i V T i − n ¯ U ¯ V T , which is the same as the right hand side of (A.1). Theorem 1: A generalized, vector-outcome version of Neyman.
To prove the generalizedNeyman result, we bundle our vector potential outcomes into matrices and use the above lemmasto obtain their covariance matrix. The theorem is exact, no asymptotics. Using the FPCLT toshow that the estimator has an approximately Normal distribution, allowing for classic testing andinference, is a separate, subsequent step.
Proof of Theorem 1.
Define V = [ V (1) , . . . , V n (1)] and V = [ V (0) , . . . , V n (0)] as the matrices ofthe potential outcomes. Then the Neymanian simple difference in means estimator has the followingrepresentation: (cid:98) τ V = ¯ V obs1 − ¯ V obs0 = 1 n n (cid:88) i =1 T i V i (1) − n n (cid:88) i =1 (1 − T i ) V i (0)= 1 n V T − n V ( − T )= (cid:18) V n + V n (cid:19) T − n V . Now the unbiasedness of (cid:98) τ V follows from the linearity of the expectation and Lemma A.1. Forthe covariance, note the second term in the above is constant, and so is not involved. ApplyingLemmas A.1 and A.2, we can obtain the covariance matrix of (cid:98) τ V :cov( (cid:98) τ V ) = (cid:18) V n + V n (cid:19) cov( T ) (cid:18) V n + V n (cid:19) T = n n n ( n − (cid:18) V n + V n (cid:19) S n (cid:18) V n + V n (cid:19) T = n n n ( n − (cid:18) n V S n V T + 1 n V S n V T + 1 n n V S n V T + 1 n n V S n V T (cid:19) = n nn S{ V (1) } + n nn S{ V (0) } + 1 n ( n −
1) ( V S n V T + V S n V T ) .
2o simplify the third term, we use the fact ab T + ba T = aa T + bb T − ( a − b )( a − b ) T for two columnvectors a and b , we have { V i (1) − ¯ V (1) }{ V i (0) − ¯ V (0) } T + { V i (0) − ¯ V (0) }{ V i (1) − ¯ V (1) } T = { V i (1) − ¯ V (1) }{ V i (1) − ¯ V (1) } T + { V i (1) − ¯ V (1) }{ V i (1) − ¯ V (1) } T −{ V i (1) − V i (0) − ¯ V (1) + ¯ V (0) }{ V i (1) − V i (0) − ¯ V (1) + ¯ V (0) } T . Summing over i = 1 , . . . , n and applying Lemma A.2, we have V S n V T n − V S n V T n − S{ V (1) } + S{ V (0) } − S{ V (1) − V (0) } . Therefore, the covariance of (cid:98) τ V can be simplified as:cov( (cid:98) τ V ) = n nn S{ V (1) } + n nn S{ V (0) } + 1 n [ S{ V (1) } + S{ V (0) } − S{ V (1) − V (0) } ]= S{ V (1) } n + S{ V (0) } n − S{ V (1) − V (0) } n . Theorem 2: Behavior of (cid:98) β RI . To show properties of (cid:98) β RI we express the systematic variationas a vector of new potential outcomes of the original outcome scaled by the different covariates ofinterest. This allows for immediate use of Theorem 1. Proof of Theorem 2.
Because (cid:98) S xt is the sample mean for { X i Y obs i : T i = t, i = 1 , . . . , n } = { X i Y i ( t ) : T i = t, i = 1 , . . . , n } , it is unbiased for the population mean S xt . Thus, the estima-tor (cid:98) β RI is also unbiased for β as S − xx is fixed and the expectation is linear. Its sampling covarianceover all possible randomizations iscov( (cid:98) β RI ) = S − xx cov( (cid:98) S x − (cid:98) S x ) S − xx . Therefore, we need only to obtain the covariance of (cid:98) S x − (cid:98) S x = 1 n n (cid:88) i =1 T i X i Y obs i − n n (cid:88) i =1 (1 − T i ) X i Y obs i , which is the difference between the sample means of { X i Y i (1) : i = 1 , . . . , n } and { X i Y i (0) : i = 1 , . . . , N } under treatment and control. Viewing X i Y obs i as a vector outcome in a completely randomizedexperiment, we can apply Theorem 1 to obtaincov( (cid:98) S x − (cid:98) S x ) = S{ X Y (1) } n + S{ X Y (0) } n − S ( X τ ) n , Theorem 3: Behavior of (cid:98) β OLS . We first use the well-known fact that the estimate from aOLS model with treatment fully interacted with covariates is equivalent to separate regressions ofoutcome onto covariates for the control and treatment groups. This means we can obtain (cid:98) γ OLS by running a regression of Y obs onto X using the control group data, and (cid:92) γ + β OLS by runningregression of Y obs onto X using the treatment group data, giving estimated coefficients of (cid:98) γ OLS = (cid:98) S − xx, (cid:98) S x and (cid:98) β OLS = (cid:98) S − xx, (cid:98) S x − (cid:98) S − xx, (cid:98) S x . As a quick heuristic argument for this, consider that the maximization problem for the interactedmodel will separate into two components, one for each group. Then re-parameterize to get theabove.We now prove the properties of (cid:98) β OLS . Here we have to use asymptotics for the entire theorem,unlike the case of (cid:98) β RI , where the mean and covariance are exact and the asymptotics are onlyneeded for the asymptotic normality of the estimator. Proof of Theorem 3.
First expand the difference of (cid:98) β OLS and β as (cid:98) β OLS − β = (cid:98) S − xx, ( (cid:98) S x − (cid:98) S xx, γ ) − (cid:98) S − xx, ( (cid:98) S x − (cid:98) S xx, γ ) , This will be close to the related quantity of∆ = S − xx ( (cid:98) S x − (cid:98) S xx, γ ) − S − xx ( (cid:98) S x − (cid:98) S xx, γ ) . (A.2)For the above to make sense and hold, we here need our asymptotic framework. In particular,we need the associated moment conditions described in the main text. We next observe that thedifference between (cid:98) β OLS − β and ∆ is of higher order, because( (cid:98) β OLS − β ) − ∆ = ( (cid:98) S − xx, − S − xx )( (cid:98) S x − (cid:98) S xx, γ ) − ( (cid:98) S − xx, − S − xx )( (cid:98) S x − (cid:98) S xx, γ ) (A.3)= O P ( n − / ) O P ( n − / ) − O P ( n − / ) O P ( n − / ) = O P ( n − ) , (A.4)following from the FPCLT for the four terms in (A.3). This is an argument commonly used in thesurvey sampling literature for ratio estimators (Cochran, 1977).4e next focus on the asymptotic distribution of ∆, because the asymptotic distribution of (cid:98) β OLS − β will be the same. Further simplify (A.2) as∆ = S − xx (cid:34) n n (cid:88) i =1 T i X i e i (1) − n n (cid:88) i =1 (1 − T i ) X i e i (0) (cid:35) , (A.5)where e i (1) = Y i (1) − X T i γ and e i (0) = Y i (0) − X T i γ are the residual potential outcomes. (To seethe above, note, for example, that both (cid:98) S x and (cid:98) S xx, are sums over the treatment units, and wecan factor out an X i to get X i times the difference in the Y i and predicted Y i .)Applying Theorem 1 to the vector outcome X e , we obtain the covariance matrix of ∆, to which (cid:98) β OLS − β converges to due to (A.4). The asymptotic normality follows from the representation(A.5) and the FPCLT. Theorem 4: Bounds for R τ . To prove Theorem 4, we need to invoke the following Fr´echet–Hoeffding inequality (Hoeffding, 1941; Fr´echet, 1951; Heckman et al., 1997; Aronow et al., 2014).
Lemma A.3.
If we know only the marginal distributions of two random variables X ∼ F X ( x ) and Y ∼ F Y ( y ), then E ( XY ) can be sharply bounded by (cid:90) F − X ( u ) F − Y (1 − u )d u ≤ E ( XY ) ≤ (cid:90) F − X ( u ) F − Y ( u )d u. Lemma A.3 immediately implies the following bound for var( X − Y ) if E ( X − Y ) = 0. Lemma A.4.
If we know only the marginal distributions X ∼ F X ( x ) , Y ∼ F Y ( y ) and E ( X − Y ) =0, then var( X − Y ) can be sharply bounded by (cid:90) { F − X ( u ) − F − Y ( u ) } d u ≤ var( X − Y ) ≤ (cid:90) { F − X ( u ) − F − Y (1 − u ) } d u Proof of Lemma A.4.
The variance var( X − Y ) can be decomposed asvar( X − Y ) = E ( X − Y ) = E ( X ) + E ( Y ) − E ( XY ) , which depends on the following three terms: E ( X ) = (cid:82) x d F X ( x ) = (cid:90) { F − X ( u )] } d u,E ( Y ) = (cid:82) { F − Y ( u ) } d u = (cid:90) { F − Y (1 − u ) } d u, (cid:90) F − X ( u ) F − Y (1 − u )d u ≤ E ( XY ) ≤ (cid:90) F − X ( u ) F − Y ( u )d u. Plug the above expressions into the variance of X − Y to obtain the desired bounds.5pplying Lemma A.4, we can easily prove Theorem 4. Proof of Theorem 4.
Because S ττ = S δδ + S εε , we need only to bound S εε , which is the finitepopulation variance of ε i = { Y i (1) − X T i γ } − { Y i (0) − X T i γ } = e i (1) − e i (0) . We can identifythe marginal distributions of { e i (1) : i = 1 , . . . , n } and { e i (0) : i = 1 , . . . , n } , and also know that n − (cid:80) ni =1 ε i = 0. Therefore, the bounds in Lemma A.4 imply the bounds in Theorem 4. Theorem 5: Sensitivity analysis.
Proof of Theorem 5.
The joint distribution of ( U , U ) is C ( u , u ) = P ( U ≤ u , U ≤ u )= ρP ( U ≤ u , U ≤ u ) + (1 − ρ ) P ( V ≤ u , U ≤ u )= ρ min( u , u ) + (1 − ρ ) u u . Therefore, the distribution function C ( u , u ) is a weighted average of min( u , u ) = C R ( u , u )and u u = C I ( u , u ), i.e., the joint distributions when U = U and U U , respectively.According to Nelsen (2007, Theorem 5.1.6), Spearman’s rank correlation coefficient between e (1) and e (0) is12 (cid:90) (cid:90) { C ( u , u ) − u u } d u d u = 12 ρ (cid:90) (cid:90) { min( u , u ) − u u } d u d u = 12 ρ (cid:18) (cid:90) d u (cid:90) u u d u − (cid:19) = 12 ρ (1 / − /
4) = ρ. To complete the proof of the theorem, we need only to show that the covariance between e (1)and e (0) is linear in ρ , which follows from (cid:90) (cid:90) F − ( u ) F − ( u )d C ( u , u )= ρ (cid:90) (cid:90) F − ( u ) F − ( u )d C R ( u , u ) + ρ (cid:90) (cid:90) F − ( u ) F − ( u )d C I ( u , u )= ρ (cid:90) F − ( u ) F − ( u )d u + (1 − ρ ) (cid:90) F − ( u )d u (cid:90) F − ( u )d u. heorem 6: Extending to non-compliance. Theorem 6 shows how to estimate the outcome-to-covariate relationships of the Compliers by estimating different aggregate covariance relationshipsacross all the strata for different observed groups and then taking differences. Due to the exclusionrestriction for the Never Takers and Always Takers, this gives our desired relationships for theCompliers only.First, a small bit of notation of, due to the exclusion restrictions for Never Takers and AlwaysTakers, defining the population covariance between X and Y (1) = Y (0) within stratum U = a and U = n as S x.,u = 1 n u n (cid:88) i =1 I ( U i = u ) X i Y i (1) = 1 n u n (cid:88) i =1 I ( U i = u ) X i Y I (0) , ( u = a, n ) . Proof of Theorem 6.
We first create an estimator for S xx,c . From the observed data with ( T i , D i ) =(1 , E (cid:40) n n (cid:88) i =1 T i D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i (cid:41) = π a S xx,a + π c S xx,c . (A.6)Similar to (A.6), we have E (cid:40) n n (cid:88) i =1 T i (1 − D i ) X i X T i (cid:41) = π n S xx,n , (A.7) E (cid:40) n n (cid:88) i =1 (1 − T i ) D i X i X T i (cid:41) = π a S xx,a , (A.8) E (cid:40) n n (cid:88) i =1 (1 − T i )(1 − D i ) X i X T i (cid:41) = π n S xx,n + π c S xx,c . (A.9)Subtracting the left sides of (A.8) from (A.6), or subtracting the left sides of (A.7) from (A.9), giveunbiased estimators for π c S xx,c . Second, analogous to the S xx,c , we consider the sample covariances between X and Y obs toobtain estimators for S x ,c and S x ,c . From the observed data with ( T i , D i ) = (1 , E (cid:40) n n (cid:88) i =1 T i D i X i Y obs i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i Y i (1) + 1 n n (cid:88) i =1 T i I ( U i = c ) X i Y i (1) (cid:41) = π a S x.,a + π c S x ,c . (A.10)7imilar to (A.10), we have E (cid:40) n n (cid:88) i =1 T i (1 − D i ) X i Y obs i (cid:41) = π n S x.,n , (A.11) E (cid:40) n n (cid:88) i =1 (1 − T i ) D i X i Y obs i (cid:41) = π a S x.,a , (A.12) E (cid:40) n n (cid:88) i =1 (1 − T i )(1 − D i ) X i Y obs i (cid:41) = π n S x.,n + π c S x ,c . (A.13)Subtracting (A.12) from (A.10), and subtracting (A.11) from (A.13), we obtain the results in(15). Corollary 3: Behavior of (cid:98) β c, RI . Theorem 6 shows how to obtain unbiased estimates of thecomponents of our estimator, which we can then plug in to obtain a consistent estimator of β c . Wenext show how this plug-in estimator behaves. Proof of Corollary 3.
First we write (cid:98) β c, RI − β c = ( (cid:98) S xx, − (cid:98) S xx, ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c }− ( (cid:98) S xx, − (cid:98) S xx, ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c } , second we introduce∆ c = ( π c S xx,c ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c }− ( π c S xx,c ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c } , third we observed that the difference between (cid:98) β c, RI − β c and ∆ c has higher order following thesame argument as (A.4). Therefore, we need only to find the asymptotic distribution of ∆ c .8imple algebra gives∆ c = ( π c S xx,c ) − (cid:104) n n (cid:88) i =1 T i D i X i Y i (1) − n n (cid:88) i =1 (1 − T i ) D i X i Y i (0) − n n (cid:88) i =1 T i D i X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) D i X i X T i γ c − n n (cid:88) i =1 (1 − T i )(1 − D i ) X i Y i (0) + 1 n n (cid:88) i =1 T i (1 − D i ) X i Y i (1)+ 1 n n (cid:88) i =1 (1 − T i )(1 − D i ) X i X T i γ c − n n (cid:88) i =1 T i (1 − D i ) X i X T i γ c (cid:105) = ( π c S xx,c ) − (cid:104) n n (cid:88) i =1 T i I ( U i = a ) X i Y i (1) + 1 n n (cid:88) i =1 T i I ( U i = c ) X i Y i (1) − n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i Y i (0) − n n (cid:88) i =1 T i I ( U i = a ) X i X T i γ c − n n (cid:88) i =1 T i I ( U i = c ) X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i X T i γ c − n n (cid:88) i =1 (1 − T i ) I ( U i = n ) X i Y i (0) − n n (cid:88) i =1 (1 − T i ) I ( U i = c ) X i Y i (0) + 1 n n (cid:88) i =1 T i I ( U i = n ) X i Y i (1)+ 1 n n (cid:88) i =1 (1 − T i ) I ( U i = n ) X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = c ) X i X T i γ c − n n (cid:88) i =1 T i I ( U i = n ) X i X T i γ c (cid:105) = ( π c S xx,c ) − (cid:110) n n (cid:88) i =1 T i X i (cid:2) I ( U i = a ) ( Y i (1) − X T i γ c ) + I ( U i = n ) ( Y i (1) − X T i γ c ) + I ( U i = c ) ( Y i (1) − X T i γ c ) (cid:3) − n n (cid:88) i =1 (1 − T i ) X i (cid:2) I ( U i = a ) ( Y i (0) − X T i γ c ) + I ( U i = n ) ( Y i (0) − X T i γ c ) + I ( U i = c ) ( Y i (0) − X T i γ c ) (cid:3) (cid:111) . According to the definitions of the residual potential outcomes e (cid:48) i (1) and e (cid:48) i (0) in the main text, theabove formula reduces to (cid:101) β c, RI − β c = ( π c S xx,c ) − (cid:34) n n (cid:88) i =1 T i X i e (cid:48) i (1) − n n (cid:88) i =1 (1 − T i ) X i e (cid:48) i (0) (cid:35) . (A.14)The representation in (A.14) implies the asymptotic covariance matrix according to Theorem 1 andthe asymptotic normality of (cid:101) β c, RI according to the FPCLT. Theorem 7: Behavior of (cid:98) β TSLS . While the amount of notation and matrix algebra is consider-ably more in scope, the overall structure of the proof follows the earlier one for the OLS estimatorfor the ITT. In particular, we show the estimator asymptotically converges to a more tractableversion that has a fixed portion, and then use the usual covariance argument on the remaining9erms. Before doing this, we first show the probability limits of the estimator by working throughthe matrix algebra.
Proof of Theorem 7.
First, we find the probability limits of the TSLS estimators: (cid:18)(cid:98) γ TSLS (cid:98) β TSLS (cid:19) = (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( X T i , D i X T i ) (cid:41) − (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) Y obs i (cid:41) = (cid:18) n − (cid:80) ni =1 X i X T i n − (cid:80) ni =1 D i X i X T i n − (cid:80) ni =1 T i X i X T i n − (cid:80) ni =1 T i D i X i X T i (cid:19) − (cid:18) n − (cid:80) ni =1 X i Y obs i n − (cid:80) ni =1 T i X i Y obs i (cid:19) P −→ (cid:18) A BC D (cid:19) − (cid:18) GH (cid:19) . (A.15)The above term A is A = S xx , and terms ( B , C , D , G , H ) are the population limits of the samplequantities. We will find each of them. Term B is B = E (cid:40) n n (cid:88) i =1 D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i D i X i X T i + 1 n n (cid:88) i =1 (1 − T i ) D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i X T i (cid:41) = p π a S xx,a + p π c S xx,c + p π a S xx,a = π a S xx,a + p π c S xx,c . Term C is C = E (cid:8) n − (cid:80) ni =1 T i X i X T i (cid:9) = p S xx . Term D is D = E (cid:40) n n (cid:88) i =1 T i D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i (cid:41) = p π a S xx,a + p π c S xx,c . Term G is G = E (cid:40) n n (cid:88) i =1 X i Y obs i (cid:41) = E (cid:40) n n (cid:88) i =1 T i X i Y obs i + 1 n n (cid:88) i =1 (1 − T i ) X i Y obs i (cid:41) = p S x + p S x . Term H is H = E (cid:8) n − (cid:80) ni =1 T i X i Y obs i (cid:9) = p S x . We apply the following formula for the inverseof a block matrix: (cid:18)
A BC D (cid:19) − = (cid:18) S − D − A − BS − A − D − CS − D S − A (cid:19) , where S D = A − BD − C and S A = D − CA − B are the Schur complements of blocks D and A . Omitting some tedious matrix algebra, we obtain S D = p π c S xx,c ( π a S xx,a + π c S xx,c ) − S xx , S A = p p π c S xx,c , (cid:18) A BC D (cid:19) − = (cid:18) p − π − c S − xx ( π a S xx,a + π c S xx,c ) S − xx,c − p − p − π − c S − xx ( π a S xx,a + p π c S xx,c ) S − xx,c − p − π − c S − xx,c p − p − π − c S − xx,c (cid:19) . Therefore, according to (A.15), the probability limit of (cid:98) γ TSLS is p − π − c S − xx ( π a S xx,a + π c S xx,c ) S − xx,c ( p S x + p S x ) − p − p − π − c S − xx ( π a S xx,a + p π c S xx,c ) S − xx,c ( p S x )= S − xx S x − π a π − c S − xx S xx,a S − xx,c ( S x − S x )= γ − π a S − xx S xx,a β c ≡ γ ∞ , (A.16)and the probability limit of (cid:98) β TSLS is − p − π − c S − xx,c ( p S x + p S x ) + p − p − π − c S − xx,c ( p S x ) = π − c S − xx,c ( S x − S x ) = β c , (A.17)where we use S x − S x = π c ( S x ,c − S x ,c ), which is guaranteed by exclusion restrictions.We next find the asymptotic distribution of (cid:98) β TSLS . Following the derivation in Corollary 3, wefirst write (cid:18)(cid:98) γ TSLS (cid:98) β TSLS (cid:19) − (cid:18) γ ∞ β c (cid:19) = (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( X T i , D i X T i ) (cid:41) − (cid:40) n n (cid:88) i =1 (cid:18) X i ( Y obs i − X T i γ ∞ − D i X T i β c ) T i X i ( Y obs i − X T i γ ∞ − D i X T i β c ) (cid:19)(cid:41) , then introduce∆ TSLS = (cid:18) A BC D (cid:19) − (cid:40) n n (cid:88) i =1 (cid:18) X i ( Y obs i − X T i γ ∞ − D i X T i β c ) T i X i ( Y obs i − X T i γ ∞ − D i X T i β c ) (cid:19)(cid:41) = (cid:18) A BC D (cid:19) − (cid:18) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) + n − (cid:80) ni =1 (1 − T i ) X i e (cid:48)(cid:48) i (0) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) = (cid:18) A BC D (cid:19) − (cid:18) n − (cid:80) ni =1 T i X i { e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) } + n − (cid:80) ni =1 X i e (cid:48)(cid:48) i (0) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) , (A.18)with ( A , B , C , D ) defined in (A.15) and { e (cid:48)(cid:48) i (1) , e (cid:48)(cid:48) i (0) } defined in Theorem 7, and finally recognizethat the difference between the above two formulas has high order. Again we need only to findthe asymptotic distribution of ∆ TSLS . The covariance of the second term on the right hand side of(A.18) is (dropping the constant sum of X i e (cid:48)(cid:48) (0))cov (cid:18) n − (cid:80) ni =1 T i X i { e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) } n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) = 1 n n n n (cid:18) S ( X ε ) [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] S{ X e (cid:48)(cid:48) (1) } (cid:19) , X { e (cid:48)(cid:48) (1) − e (cid:48)(cid:48) (0) } and X e (cid:48)(cid:48) (1). Therefore, according to (A.18), the asymptotic covariance of ∆ TSLS is the (2 ,
2) blockof the following matrix1 n n n n (cid:18) A BC D (cid:19) − · (cid:18) S ( X ε ) [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] S{ X e (cid:48)(cid:48) (1) } (cid:19) · (cid:18) A BC D (cid:19) − T , which is1 n n n n (cid:110) ( p − π − c S − xx,c ) S ( X ε )( p − π − c S − xx,c ) T + ( p − p − π − c S − xx,c ) S{ X e (cid:48)(cid:48) (1) } ( p − p − π − c S − xx,c ) T − ( p − π − c S − xx,c )[ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )]( p − p − π − c S − xx,c ) T (cid:111) = ( π c S xx,c ) − (cid:20) S{ X e (cid:48)(cid:48) (1) } n + S{ X e (cid:48)(cid:48) (0) } n − S ( X ε ) n (cid:21) ( π c S xx,c ) − . The asymptotic normality follows from the representation in (A.18) and the FPCLT.
Theorem 8: Decomposition of variation in non-compliance.
The following proof uses twofacts: τ a = τ n = 0, and τ = π c τ c . Proof of Theorem 8.
Write the total treatment effect variation as S ττ = 1 n n (cid:88) i =1 ( τ i − τ ) = 1 n n (cid:88) i =1 τ i − τ = 1 n n (cid:88) i =1 I ( U i = c ) τ i − π c τ c = π c (cid:32) n c n (cid:88) i =1 I ( U i = c ) τ i − τ c (cid:33) + π c (1 − π c ) τ c , the treatment effect variation explained by compliance status as S ττ,U = (cid:88) u = c,a,n π u ( τ u − τ ) = π c ( τ c − π c τ c ) + π a (0 − π c τ c ) + π n (0 − π c τ c ) = π c τ c (cid:8) (1 − π c ) + π c ( π a + π n ) (cid:9) = π c (1 − π c ) τ c , and the subtotal treatment effect variation for compliers as S ττ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( τ i − τ c ) = 1 n c n (cid:88) i =1 I ( U i = c ) τ i − τ c . Therefore, the above three terms has the relationship S ττ = π c S ττ,c + S ττ,U . The decomposition S ττ,c = S δδ,c + S εε,c follows immediately from the definition of β c . ppendix B More detailed comments Appendices B.1–B.5 give more details of some technical issues and extensions mentioned in themain text, and Appendix B.6 contains the proofs of the results in Appendix B.
Appendix B.1 Covariate adjustment to improve efficiency
In the main text, the role of covariates has been to model the treatment effect alone. In general,we also want to use covariates to reduce sampling variability of (cid:98) β RI , just as we can use covariatesto get more precise estimates of the average treatment effect. In particular, the goal is to moreprecisely estimate (cid:98) S xt ∈ R K ; because these are the only random components in (cid:98) β RI , if we estimatethem more precisely, we estimate (cid:98) β RI more precisely as well. Let W i ∈ R J denote a vector ofpretreatment covariates without the intercept term. Because X i and W i have different roles inestimation, they may also contain different sets of covariates, though, in practice, X is likely to bea subset of W .Following the covariate adjustment approach in survey sampling, we can obtain a model-assistedestimator for β that uses W to reduce sampling variability. To see this, we need several definitions.Define W = n − (cid:80) ni =1 W i and S ww = n − (cid:80) ni =1 W i W T i , with det( S ww ) >
0; define W t and (cid:98) S ww,t as the sample mean and covariance of W under treatment arm t ; define (cid:98) B t ∈ R J × K as the regressioncoefficient of Y obs X on W for treatment arm t : (cid:98) B t = (cid:98) S − ww,t (cid:40) n t n (cid:88) i =1 I ( T i = t ) W i ( Y obs i X i ) T (cid:41) . The model-assisted estimator for S xt is then (cid:98) S wxt = (cid:98) S xt − (cid:98) B T t ( ¯ W t − ¯ W ) , ( t = 0 , . As a result, we can improve the randomization-based estimator by (cid:98) β w RI = S − xx ( (cid:98) S wx − (cid:98) S wx ) . Theorem A.1.
The model-assisted estimator (cid:98) β w RI is consistent for β with asymptotic covariance S − xx (cid:20) S{ E (1) } n + S{ E (0) } n − S ( ∆ ) n (cid:21) S − xx , where E i ( t ) = Y i ( t ) X i − B T t ( W i − ¯ W ) is the residual term and ∆ i = E i (1) − E i (0).13he estimator, (cid:98) β w RI uses covariates both to estimate treatment effect variation and to reducesampling variability. Asymptotically, as long as W is predictive of the marginal potential outcomes,the model-assisted estimator will improve precision over the unassisted estimators. Appendix B.2 Fisherian exact inference
When ε i = 0 for all i , we can obtain exact inference for β based on the Fisher randomization test(Rubin, 1980; Rosenbaum, 2002; Ding et al., 2016). With a known β , the null hypothesis H ( β ) : Y i (1) − Y i (0) = X T i β for all i (A.19)is sharp in the sense of allowing for full imputation of all missing potential outcomes based onthe observed data. We can perform randomization test using any sensible test statistic measuringthe deviation from the null hypothesis H ( β ), for example, the test statistic t ( T , Y obs ; β ) can bethe difference-in-means, difference-in-medians or the Kolmogorov–Smirnov statistics comparing twosamples { Y obs i − X T i β : T i = 1 , i = 1 , . . . , n } and { Y obs i : T i = 0 , i = 1 , . . . , n } . Then we can obtaina (1 − α ) level confidence region for β by inverting a sequence of randomization tests: CR α = { β : Randomization test fails to reject H ( β ) at significance level α } . The confidence region CR α is exact regardless of the sample size, and it is valid for general designsof experiments if we use the corresponding assignment mechanism to simulate the null distributionof the test statistic. Due to the duality between testing and interval estimation, we reject H ( X )with β = 0 in Section 3.3 if CR α ∩ { β : β = 0 } is an empty set, which controls the type one errorrate by α. Appendix B.3 A Variance Ratio Test
Raudenbush and Bloom (2015) have noticed that if the variance of the treatment potential outcomeis smaller than the control potential outcome, then the correlation between the individual treatmenteffect and the control potential outcome is negative. This statement does not involve any covariates,but it can be generalized to incorporate systematic and idiosyncratic treatment effect variation.Below we give a finite population version of their result.
Theorem A.2.
If the finite population variance of { Y i (1) − X (cid:48) i β } ni =1 is smaller than { Y i (0) } ni =1 ,then the idiosyncratic treatment effect variation, { ε i } ni =1 , is negatively correlated with the controlpotential outcomes. 14ecause the condition in Theorem A.2 depends only on the marginal distributions of the po-tential outcomes, we propose a formal variance ratio test of it using the observed data, which is ageneralization of a similar theorem in Ding et al. (2016): Theorem A.3.
The variance ratio test with rejection regionlog s − log s (cid:112) ( (cid:98) κ − /n + ( (cid:98) κ − /n < Φ − ( α ) , has size at least as large as α , where s and (cid:98) κ are the sample variance and kurtosis of { Y obs i − X T i (cid:98) β RI : T i = 1 , i = 1 , . . . , n } , and s and (cid:98) κ are the sample variance and kurtosis of { Y obs i : T i =0 , i = 1 , . . . , n } , and Φ − ( α ) is the α -th quantile of the standard normal distribution.For finite population inference, the above test in Theorem A.3 is generally conservative, but forsuperpopulation inference, it is asymptotically exact.Note that Raudenbush and Bloom (2015) and Theorem A.2 are only about detecting a negativeassociation. Unfortunately, there is no testable condition for a positive association. Appendix B.4 More on noncompliance: estimating the bounds of the R s The component S ττ,U and and the probability π c are directly identifiable according to previousdiscussion. Furthermore, S δδ,c is also identifiable according to the following result. Corollary A.1. S δδ,c can be expressed as the expectation of the following quantity:1 π c (cid:40) n n (cid:88) i =1 ( δ i − τ c ) − n n (cid:88) i =1 T i (1 − D i )( δ i − τ c ) − n n (cid:88) i =1 (1 − T i ) D i ( δ i − τ c ) (cid:41) . Because π c , δ i = X T i β c and τ c can be estimated by a plug-in approach, S δδ,c can also be estimatedfrom the observed data.In the ITT case, estimation of the residual distributions are straightforward. In the noncompli-ance case, however, we need more discussion about the estimation of F c ( y ) and F c ( y ), because U i is a latent variable. To avoid notational clatter, we assume that γ c and γ c are known; in practicewe can replace them by the randomization-based estimators (cid:98) γ c , RI and (cid:98) γ c , RI , and the consistencyof the final estimator will not be affected. Recall the potential residuals e (cid:48) i (1) and e (cid:48) i (0) defined in(17), and its observed value e (cid:48) i = T i e (cid:48) i (1) + (1 − T i ) e (cid:48) i (0). We define the following quantities (cid:98) F ( y ) = n (cid:80) ni =1 T i D i I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 T i (1 − D i ) I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 (1 − T i ) D i I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 (1 − T i )(1 − D i ) I ( e (cid:48) i ≤ y ) . (A.20)Similar to Corollary 3, we have the following results.15 orollary A.2. For any y,E { (cid:98) F ( y ) − (cid:98) F ( y ) } = π c F c ( y ) , E { (cid:98) F ( y ) − (cid:98) F ( y ) } = π c F c ( y ) . Therefore, we can estimate F c ( y ) by { (cid:98) F ( y ) − (cid:98) F ( y ) } / (cid:98) π c , and estimate F c ( y ) by { (cid:98) F ( y ) − (cid:98) F ( y ) } / (cid:98) π c . As we mentioned before, in practice, we use (cid:98) e (cid:48) i instead of e (cid:48) i in the formulas in (A.20). Appendix B.5 Proofs of the theorems and corollaries in Appendix B
Proof of Theorem A.1.
The population-level OLS regression matrix of Y ( t ) X onto W is B t = S − ww (cid:40) n n (cid:88) i =1 W i { Y i ( t ) X i } T (cid:41) ∈ R J × K . Define (cid:101) S wxt = (cid:98) S xt + B T t ( ¯ W − ¯ W t ) and (cid:101) β w RI = S − xx ( (cid:101) S wx − (cid:101) S wx ) . According to the same argumentas (A.4), (cid:98) β RI and (cid:101) β w RI have the same asymptotic covariance, and in the following we need only todiscuss the covariance of (cid:101) β w RI . Because (cid:101) S wx − (cid:101) S wx = 1 n n (cid:88) i =1 T i (cid:8) Y i (1) X i + B T ( ¯ W − W i ) (cid:9) − n n (cid:88) i =1 (1 − T i ) (cid:8) Y i (0) X i + B T ( ¯ W − W i ) (cid:9) = 1 n n (cid:88) i =1 T i E i (1) − n n (cid:88) i =1 (1 − T i ) E i (0)can be represented as the difference between the sample means of E i (1) and E i (0), applying The-orem 2 we can obtain its covariance:cov (cid:16) (cid:101) S wx − (cid:101) S wx (cid:17) = S{ E (1) } n + S{ E (0) } n − S{ ∆ } n , which completes the proof. Proof of Theorem A.2.
For simplicity, we abuse the variance and covariance notation for finitepopulation. For example, var { Y (0) } = (cid:80) ni =1 { Y i (0) − ¯ Y (0) } / ( n − . If var { Y (1) − X T β } ≤ var { Y (0) } , then var { Y (0) + ε } ≤ var { Y (0) } . Expanding the left hand side,var { Y (0) } + var { ε } + 2cov { Y (0) , ε } ≤ var { Y (0) } , which implies 2cov { Y (0) , ε } ≤ − var { ε } < . c , · · · , c n ) T and ( d , . . . , d n ) T be two vectors of nonnegativeconstants with the same mean m > S cc and S dd . The difference vector( c − d , . . . , c n − d n ) T has mean zero and variance S c − d,c − d . Let (cid:98) θ c = 1 n n (cid:88) i =1 T i c i , (cid:98) θ d = 1 n n (cid:88) i =1 (1 − T i ) d i be two sample means of the treatment and control group, respectively. Lemma A.5.
Under the regularity conditions for the FPCLT, log (cid:98) θ c − log (cid:98) θ d has asymptotic meanzero and variance 1 m (cid:18) S cc n + S dd n − S c − d,c − d n (cid:19) . (A.21) Proof of Lemma A.5.
According to the FPCLT, we have the following joint asymptotic normalityof (cid:98) θ c and (cid:98) θ d : (cid:32)(cid:98) θ c (cid:98) θ d (cid:33) = (cid:18) n − (cid:80) ni =1 T i c i n − (cid:80) ni =1 (1 − T i ) d i (cid:19) a ∼ N (cid:20)(cid:18) mm (cid:19) , (cid:18) V cc V cd V cd V dd (cid:19)(cid:21) , where V cc = n n n S cc , V dd = n n n S dd , V cd = − n ( S cc + S dd − S c − d,c − d ) . Applying Taylor expansion at m , we have log (cid:98) θ c − log (cid:98) θ d = { ( (cid:98) θ c − m ) − ( (cid:98) θ d − m ) } /m + o P ( n − / ), which,coupled with Neyman (1923)’s variance formula, gives the asymptotic variance of log (cid:98) θ c − log (cid:98) θ d in(A.21). Proof of Theorem A.3.
First, as a direct consequence of Lemma A.5, the finite sample variance isalways larger than the super population variance, unless S c − d,c − d = 0. Therefore, we need only toshow that the test in Theorem A.3 is asymptotically exact for super population inference, and theasymptotic size of the test is no larger than α for finite population inference.Second, replacing β by its consistent estimator (cid:98) β RI does not affect the asymptotic distribution ofthe test statistic, due to Slutsky’s Theorem. For simplicity, we treat β as known in our asymptoticanalysis.With the two ingredients above, Theorem A.3 follows directly from the variance ratio test inDing et al. (2016, Theorem 2, Supplementary Material).17 roof of Corollary A.1. The conclusion follows from E (cid:40) n n (cid:88) i =1 T i (1 − D i )( δ i − τ c ) (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = n ) ( δ i − τ c ) (cid:41) = 1 n n (cid:88) i =1 I ( U i = n ) ( δ i − τ c ) ,E (cid:40) n n (cid:88) i =1 (1 − T i ) D i ( δ i − τ c ) (cid:41) = E (cid:40) n n (cid:88) i =1 (1 − T i ) I ( U i = a ) ( δ i − τ c ) (cid:41) = 1 n n (cid:88) i =1 I ( U i = a ) ( δ i − τ c ) . Proof of Corollary A.2.
We rewrite (cid:98) F ( y ) = 1 n n (cid:88) i =1 T i I ( U i = c ) I { e i (1) ≤ y } + 1 n n (cid:88) i =1 T i I ( U i = a ) I { e i (1) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 T i I ( U i = n ) I { e i (1) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) I { e i (0) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 (1 − T i ) I ( U i = c ) I { e i (0) ≤ y } + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = n ) I { e i (0) ≤ y } . In the above formulas, the random components are the T ii