[PDF] Decomposing Treatment Effect Variation

Abstract

Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this paper proposes a framework for decomposing overall treatment effect variation into a systematic component explained by observed covariates and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are entirely justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully-interacted linear regression and two stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an R 2 -like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment.

Full PDF

DDecomposing Treatment Eﬀect Variation ∗ Peng DingUC Berkeley Avi FellerUC Berkeley Luke MiratrixHarvard GSEJuly 31, 2017

Abstract

Understanding and characterizing treatment eﬀect variation in randomized experiments hasbecome essential for going beyond the “black box” of the average treatment eﬀect. Nonetheless,traditional statistical approaches often ignore or assume away such variation. In the context ofrandomized experiments, this paper proposes a framework for decomposing overall treatmenteﬀect variation into a systematic component explained by observed covariates and a remainingidiosyncratic component. Our framework is fully randomization-based, with estimates of treat-ment eﬀect variation that are entirely justiﬁed by the randomization itself. Our framework canalso account for noncompliance, which is an important practical complication. We make sev-eral contributions. First, we show that randomization-based estimates of systematic variationare very similar in form to estimates from fully-interacted linear regression and two stage leastsquares. Second, we use these estimators to develop an omnibus test for systematic treatmenteﬀect variation, both with and without noncompliance. Third, we propose an R -like measureof treatment eﬀect variation explained by covariates and, when applicable, noncompliance. Fi-nally, we assess these methods via simulation studies and apply them to the Head Start ImpactStudy, a large-scale randomized experiment. Key Words : Noncompliance; Heterogeneous treatment eﬀect; Idiosyncratic treatment eﬀectvariation; Randomization inference; Systematic treatment eﬀect variation. ∗ Peng Ding (Email: [email protected]) is Assistant Professor, Department of Statistics, University ofCalifornia, Berkeley. Avi Feller (Email: [email protected]) is Assistant Professor, Goldman School of PublicPolicy, University of California, Berkeley. Luke Miratrix (Email: [email protected]) is Assistant Professor,Harvard Graduate School of Education. We thank Alberto Abadie, Donald Rubin, participants at the AppliedStatistics Seminar at the Harvard Institute of Quantitative Social Science, and colleagues at University of California,Berkeley and Harvard University for helpful comments. We gratefully acknowledge ﬁnancial support from the SpencerFoundation through a grant entitled “Using Emerging Methods with Existing Data from Multi-site Trials to LearnAbout and From Variation in Educational Program Eﬀects,” and from the Institute for Education Science (IES Grant a r X i v : . [ m a t h . S T ] J u l Introduction

The analysis of randomized experiments has traditionally focused on the average treatment eﬀect,often ignoring or assuming away treatment eﬀect variation (e.g., Neyman, 1923; Fisher, 1935;Kempthorne, 1952; Rosenbaum, 2002). Today, understanding and characterizing treatment eﬀectvariation in randomized experiments has become essential for going beyond the “black box” ofthe average treatment eﬀect. This is clear from the increasing number of papers on the topic instatistics and machine learning (Hill, 2011; Athey and Imbens, 2016; Wager and Athey, 2017),biostatistics (Huang et al., 2012; Matsouaka et al., 2014), education (Raudenbush and Bloom,2015), economics (Heckman et al., 1997; Crump et al., 2008; Djebbari and Smith, 2008), politicalscience (Green and Kern, 2012; Imai and Ratkovic, 2013), and other areas.This paper proposes a framework for decomposing overall treatment eﬀect variation in a ran-domized experiment into a systematic component that is explained by observed covariates, and an idiosyncratic component that is not explained (Heckman et al., 1997; Djebbari and Smith, 2008). Indoing so, we make several key contributions. First, we take a fully randomization-based perspective(see Rosenbaum, 2002; Imbens and Rubin, 2015), and propose estimators that are entirely justiﬁedby the randomization itself. This is in contrast to much of the literature on randomization-basedmethods, where treatment eﬀect variation is typically a nuisance (e.g. Rosenbaum, 1999, 2007).Similar to Lin (2013), we show that the resulting estimator is very similar in form to linear regres-sion with interactions between the treatment indicator and covariates. Unlike with linear regression,however, the proposed estimator does not require any modeling assumptions on the marginal out-comes.Second, we extend these methods from intention-to-treat (ITT) analysis to allow for noncom-pliance, proposing a randomized-based estimator for systematic treatment eﬀect variation for theLocal Average Treatment Eﬀect (LATE) in the case of noncompliance (Angrist et al., 1996). Weshow that this estimator is nearly identical to the two-stage least squares estimator with interac-tions between the treatment and covariates. We believe that this is a particularly novel contributionto the recent literature seeking to reconcile the randomization-based tradition in statistics and thelinear model-based perspective more common in econometrics (Abadie, 2003; Imbens, 2014; Imbensand Rubin, 2015).Armed with these estimators, we turn to two practical tools for decomposing treatment eﬀect2ariation. The ﬁrst is an omnibus test for the presence of systematic treatment eﬀect variation.While versions of this test have been proposed previously, largely in the context of linear mod-els (Cox, 1984; Crump et al., 2008), our proposed test is fully randomization-based and can alsoaccount for noncompliance. The second is to develop and bound an R -like measure of the fractionof treatment eﬀect variation explained by covariates. This builds on previous versions proposedin the econometrics literature (Heckman et al., 1997; Djebbari and Smith, 2008), again extendingresults to account for noncompliance. This approach is also closely related to the Oaxaca–Blinderdecomposition in economics (Oaxaca, 1973; Blinder, 1973). See Angrist et al. (2013) for a recentapplication that also addresses compliance. Finally, we apply these methods to the Head StartImpact Study, a large-scale randomized trial of Head Start, a federally funded preschool program(Puma et al., 2010). We relegate the technical details and some further extensions to the onlineSupplementary Material. Assume that we have n units in an experiment. For unit i , let X i = ( X i , . . . , X Ki ) T ∈ R K denotethe vector of pretreatment covariates, with the constant 1 as its ﬁrst component. Let T i denotethe treatment indicator with 1 for treatment and 0 for control. We use the potential outcomesframework (Neyman, 1923; Rubin, 1974) to deﬁne causal eﬀects. Under the Stable Unit TreatmentValue Assumption (Rubin, 1980) that there is only one version of the treatment and no interferenceamong units, we deﬁne Y i (1) and Y i (0) as the potential outcomes of unit i under treatment andcontrol, respectively. The observed outcome, Y obs i = T i Y i (1) + (1 − T i ) Y i (0), is quite generaland includes continuous, binary, and zero-inﬂated cases. On the diﬀerence scale, the individualtreatment eﬀect is τ i = Y i (1) − Y i (0).Importantly, this is ﬁnite population inference in that we condition on the n units at hand—thepotential outcomes are ﬁxed and pre-treatment. This diﬀers from super population inference inwhich some variables or residuals are assumed to be independent and identically distributed (iid)draws from some distribution. See, for example, Rosenbaum (2002), Imbens and Rubin (2015)and Li and Ding (2017). Under the potential outcomes framework, { Y i (1) , Y i (0) } ni =1 are all ﬁxednumbers; the randomness of any estimator comes from the assignment mechanism, which is the3istribution of possible treatment assignments T = ( T , . . . , T n ) T . Note that pr { ( T , . . . , T n ) =( t , . . . , t n ) } = (cid:0) nn (cid:1) − if (cid:80) ni =1 t i = n . To set up our overall framework, we ﬁrst generalize Neyman (1923)’s classic results to vectoroutcomes. We consider a completely randomized experiment, with n units assigned to treatmentand n units assigned to control; in total we have (cid:0) nn (cid:1) possible randomizations. We are interestedin estimating the ﬁnite population average treatment eﬀect on a vector outcome V ∈ R K : τ V = 1 n n (cid:88) i =1 { V i (1) − V i (0) } , where V i (1) and V i (0) are the potential outcomes of V for unit i . The Neyman-type unbiasedestimator for τ V is the diﬀerence between the sample mean vectors of the observed outcomes undertreatment and control: (cid:98) τ V = ¯ V obs1 − ¯ V obs0 = 1 n n (cid:88) i =1 T i V obs i − n n (cid:88) i =1 (1 − T i ) V obs i = 1 n n (cid:88) i =1 T i V i (1) − n n (cid:88) i =1 (1 − T i ) V i (0) . The behavior of our estimator, and of our estimators for heterogeneity discussed later, revolvearound covariances of vector outcomes. For notation, let A = { A , . . . , A n } be a collection of n vectors, with ¯ A = n − (cid:80) ni =1 A i the vector mean, and deﬁne the covariance operator on A as S ( A ) = 1 n − n (cid:88) i =1 ( A i − ¯ A )( A i − ¯ A ) T , which gives the covariance matrix of the n vectors in A . For example, A i can be V i (1) , V i (0) or V i (1) − V i (0) . The following theorem, generalizing the results for scalar outcomes from Neyman (1923), demon-strates that (cid:98) τ V is unbiased and gives its covariance matrix. Theorem 1.

Over all possible randomizations of a completely randomized experiment, (cid:98) τ V isunbiased for τ V , with K × K covariance matrix:cov( (cid:98) τ V ) = S{ V (1) } n + S{ V (0) } n − S{ V (1) − V (0) } n . (1)The diagonal elements of this matrix are the variances of the estimators of each component of τ V . The covariance matrix of (cid:98) τ V depends on the various covariances of the potential outcomes4nder treatment and control. In particular, the last term depends on the correlation between thepotential outcomes V (1) and V (0), and therefore cannot be identiﬁed from the observed data.When the individual treatment eﬀects are constant for all components of V , the last term in theabove covariance matrix vanishes, because then S{ V (1) − V (0) } = K × K . Under this assumption,we can unbiasedly estimate the sampling covariance matrix cov( (cid:98) τ V ) by replacing the covariancesof the potential outcomes by the sample analogues: (cid:99) cov( (cid:98) τ V ) = (cid:98) S ( V obs ) n + (cid:98) S ( V obs ) n , where (cid:98) S t ( V obs ) = 1 n t − n (cid:88) i =1 I ( T i = t ) ( V i − ¯ V obs t )( V i − ¯ V obs t ) T ( t = 0 ,

1) (2)are the sample covariance matrices of V obs in the treatment and control groups. Without theconstant treatment eﬀect assumption, the covariance estimator (cid:99) cov( (cid:98) τ V ) is conservative in the sensethat the diﬀerence between the expectation of the variance estimator and the true variance is anon-negative deﬁnite matrix. In particular, the diagonal terms of the expected estimator will allbe larger than the truth. Letting K = 1, the covariance matrices become simple variances, whichrecovers Neyman’s original result.Using the mathematical framework introduced in the Appendix and in Li and Ding (2017), wecan easily generalize Theorem 1 to more complicated experimental designs, e.g., cluster-randomizedtrials (Middleton and Aronow, 2015) and unbalanced 2 split-plot designs (Zhao et al., 2017). We now apply this general framework to treatment eﬀect variation. We decompose the individualtreatment eﬀect, τ i , via τ i = Y i (1) − Y i (0) = X T i β + ε i , ( i = 1 , . . . , n ) (3)with β being the ﬁnite population linear regression coeﬃcient of τ i on X i , deﬁned by β = arg min b ∈ R K n (cid:88) i =1 ( τ i − X T i b ) . (4)Following Heckman et al. (1997) and Djebbari and Smith (2008), we call δ i = X T i β the systematictreatment eﬀect variation explained by the observed covariates, X i , and call ε i the idiosyncratictreatment eﬀect variation not explained by X i . 5ore generally, we can view this decomposition in a regression-style framework. Deﬁne S xx = 1 n n (cid:88) i =1 X i X T i ∈ R K × K , S xε = 1 n n (cid:88) i =1 ε i X i ∈ R K , S xτ = 1 n n (cid:88) i =1 τ i X i ∈ R K , where S xx is non-degenerate, analogous to the usual full rank assumption in linear models. Alsodeﬁne S xt = 1 n n (cid:88) i =1 X i Y i ( t ) ∈ R K , ( t = 0 , . These are all ﬁnite population quantities, as in they are ﬁxed pre-randomization values. Thedeﬁnition of β gives S xε = 0, i.e., ε i and X i have ﬁnite population covariance zero. Therefore, in thespirit of the agnostic regression framework (e.g., Lin, 2013), the systematic component, δ i = X T i β ,is a projection of τ i onto the linear space spanned by X i , and the idiosyncratic treatment eﬀect, ε i , is the corresponding residual. The linear projection applies to general outcomes, including thebinary case.Because of our ﬁnite population focus, if we observed all the potential outcomes we couldimmediately calculate all individual treatment eﬀects and apply standard linear regression theoryto (3) and obtain β . In particular, the solution of (4), i.e. the ordinary least squares (OLS) solutionfrom regressing τ on X , is β = S − xx S xτ = S − xx S x − S − xx S x ≡ γ − γ , (5)where γ = S − xx S x and γ = S − xx S x are the corresponding ﬁnite population regression coeﬃcientsof the potential outcomes on the covariates. Let e i (1) = Y i (1) − X T i γ and e i (0) = Y i (0) − X T i γ bethe residual potential outcomes from the regression of Y i ( t ) onto X . Our idiosyncratic treatmentvariation is then the diﬀerence of residuals: ε i = e i (1) − e i (0). In practice, we do not fully observethese components, but we can obtain unbiased or consistent estimates for them as we discuss below. We now turn to estimating β . As shown in (5), β has three components. The ﬁrst term, S xx ,is fully observed as all the covariates are observed. Our estimation then depends on the sampleanalogues of S x and S x : (cid:98) S x = 1 n n (cid:88) i =1 T i Y obs i X i ∈ R K , (cid:98) S x = 1 n n (cid:88) i =1 (1 − T i ) Y obs i X i ∈ R K . (cid:98) S xt ’s capture how the observed potential outcomes correlate with the covariates. Plug theseinto (5) to obtain an overall estimate of β . The randomization of T then justiﬁes the followingtheorem. Theorem 2.

Under decomposition (3), S − xx (cid:98) S x and S − xx (cid:98) S x are unbiased estimates of γ and γ ,respectively. Therefore (cid:98) β RI = S − xx (cid:98) S x − S − xx (cid:98) S x , is an unbiased estimator for β with covariance matrixcov( (cid:98) β RI ) = S − xx (cid:20) S{ Y (1) X } n + S{ Y (0) X } n − S ( τ X ) n (cid:21) S − xx . (6)Here, for example, S{ Y (0) X } denotes the covariance operator on new unit-level variables Y i (0) X i ∈ R K , made by scaling the X i vector of each unit by Y i (0), similarly for S{ Y (1) X } and S ( τ X ). This slight abuse of notation gives formulae less cluttered by subscripts and excessiveannotation. As with the vector version of Neyman’s formula, the square root of the diagonal ofcov( (cid:98) β RI ) gives the standard errors of β .The covariance formula (6) generalizes the result of Neyman (1923) for the average treatmenteﬀect, reducing to Neyman’s formula if X i = 1 for all units. We can obtain a “conservative”estimate of cov( (cid:98) β RI ) by (cid:99) cov( (cid:98) β RI ) = S − xx (cid:34) (cid:98) S ( Y obs X ) n + (cid:98) S ( Y obs X ) n (cid:35) S − xx , recalling the deﬁnitions of the sample covariance operators (cid:98) S and (cid:98) S introduced in (2). Similarto Neyman (1923), this implicitly assumes S ( τ X ) = . Under the assumption that ε i = 0 for allunits (i.e., no idiosyncratic variation whatsoever), we can instead use S ( (cid:98) τ X ) with (cid:98) τ = X T i (cid:98) β RI as aplug-in estimate for S ( τ X ). This yields tighter standard errors based on the diagonal elements ofthe covariance matrix. Finite population asymptotic analysis

Theorem 2 holds for any ﬁnite sample. To obtainconﬁdence intervals and to conduct hypothesis testing as we describe below, we need to prove fur-ther that (cid:98) β RI is asymptotically normal with mean β and covariance cov( (cid:98) β RI ). Finite populationasymptotic analysis, however, has a slightly diﬀerent ﬂavor from the usual super population ap-proach. Formally, the ﬁnite asymptotic scheme embeds the ﬁnite population { ( X i , Y i (1) , Y i (0)) } ni =1 n into a hypothetical sequence of ﬁnite populations with sizes approaching inﬁnity. Thiseﬀectively assumes that all the ﬁnite population quantities, for example, S xx and β , depend on n , although they are ﬁxed numbers for a given ﬁnite population. Moreover, the sample quantitiessuch as (cid:98) S x and (cid:98) β RI depend on n as well, and are random quantities due to the randomization of T . For notational simplicity, we drop the index n for all these quantities. Importantly, we mustimpose some regularity conditions on the hypothetical sequence of ﬁnite populations. Throughoutthe paper, we invoke the following conditions for asymptotic analysis, which are required for a formof the ﬁnite population central limit theorem discussed in Li and Ding (2017, Theorem 5). Condition 1. (i) Stable treatment proportions: p = n /n and p = n /n have positive limitingvalues; (ii) Stable means, variances and covariances: the ﬁnite population means, variances andcovariances of the covariates and potential outcomes have ﬁnite limiting values; (iii) max ≤ i ≤ n || V i − ¯ V || /n →

0, where V i can be the covariate vector, the outcome, and the products of them.Under these conditions, we can extend Theorem 2 to a sequence of ﬁnite populations: √ n (cid:16) (cid:98) β RI − β (cid:17) d → N (cid:16) , lim n →∞ [ p S{ Y (1) X } + p S{ Y (0) X } − S ( τ X )] (cid:17) . (7)As a result, we can state that (cid:98) β RI is approximately normal with mean β and covariance matrix(6), which allows us to construct conﬁdence intervals and hypothesis tests. In our theory below,we use this informal statement instead of (7) to avoid notational complexity.Conditions (i) and (ii) are natural. Condition (iii) holds if V has more than two moments (Liand Ding, 2017). For bounded covariates and outcomes, (iii) is satisﬁed automatically. For moretechnical discussion of ﬁnite population causal inference, see Ding (2014), Aronow et al. (2014),and Middleton and Aronow (2015); for regularity conditions of the ﬁnite population central limittheorems, see H´ajek (1960) and Lehmann (1998). A recent review is Li and Ding (2017). The results from randomization inference can shed light on the familiar case of linear regressionwith treatment-covariate interactions. This classical approach assumes the model Y obs i = X T i γ + T i X T i β + u i , ( i = 1 , . . . , n ) (8)where { u i } ni =1 are errors implicitly assumed to induce the randomness, and where β models sys-tematic treatment eﬀect variation, as in (3). Departing from much of the previous literature (e.g.,8ox, 1984; Berrington de Gonz´alez and Cox, 2007; Crump et al., 2008), we study the propertiesof the least squares estimator under complete randomization, without assuming that model (8) iscorrectly speciﬁed. In particular, we do not assume any i.i.d. sampling; the assignment mechanismdrives the distribution of the OLS estimator. Theorem 3.

The OLS estimator for β from ﬁtting model (8) can be rewritten as (cid:98) β OLS = (cid:98) S − xx, (cid:98) S x − (cid:98) S − xx, (cid:98) S x , where (cid:98) S xx,t = 1 n t n (cid:88) i =1 I ( T i = t ) X i X T i , ( t = 0 , . Over all possible randomizations of T , (cid:98) S − xx, (cid:98) S x and (cid:98) S − xx, (cid:98) S x are consistent estimates of γ and γ respectively; (cid:98) β OLS therefore follows an asymptotic normal distribution with mean β and covariancematrix cov( (cid:98) β OLS ) = S − xx (cid:20) S{ e (1) X } n + S{ e (0) X } n − S ( ε X ) n (cid:21) S − xx . (9)with e i (1) , e i (0), and ε i as deﬁned after (5).This estimate is simply the diﬀerence between (cid:98) γ , OLS = (cid:98) S − xx, (cid:98) S x and (cid:98) γ , OLS = (cid:98) S − xx, (cid:98) S x ,two OLS regressions run separately on each treatment arm. For treated units, deﬁne residual (cid:98) e i = Y obs i − X T i (cid:98) γ , OLS , and for control units, deﬁne residual (cid:98) e i = Y obs i − X T i (cid:98) γ , OLS . We can dropthe unidentiﬁable term S ( ε X ), estimate S{ e (1) X } and S{ e (0) X } by their sample analogues, andconservatively estimate the asymptotic covariance matrix (9) by (cid:99) cov( (cid:98) β OLS ) = (cid:98) S − xx, (cid:34) (cid:98) S ( (cid:98) e X ) n (cid:35) (cid:98) S − xx, + (cid:98) S − xx, (cid:34) (cid:98) S ( (cid:98) e X ) n (cid:35) (cid:98) S − xx, . This form of the sandwich variance estimator has the same probability limit as the HuberWhitecovariance estimator for linear model (8) (Huber, 1967; White, 1980; Lin, 2013; Angrist and Pischke,2008).Importantly, (cid:98) β RI and (cid:98) β OLS are quite similar in form. In particular, (cid:98) β RI uses the true S xx while (cid:98) β OLS separately estimates the covariance matrix for each treatment arm, (cid:98) S xx, and (cid:98) S xx, . The latteris eﬀectively a ratio estimator. Although this introduces some small bias (on the order of 1 /n ),using the estimated (cid:98) S xx,t rather than true S xx can often lead to gains in precision, especially when9ovariates are strongly correlated with the potential outcomes. In particular, the OLS estimator,by separately estimating the (known) S xx matrix for each treatment arm, can account for randomimbalances in the covariates in both arms.The RI estimator, by comparison, has no adjustment whatsoever, and so cannot account for suchrandom covariate imbalances. However, in Section 3.4 below, and in the supplementary materialswe introduce a diﬀerent form of adjustment that uses covariates to make the estimates of the S xt more precise. Depending on the structure of covariates, this estimator could be better or worsethan OLS adjustment; we leave a thorough investigation of these trade-oﬀs for future work.Regardless, we again emphasize that we do not rely on classical OLS assumptions to justifythe OLS estimator here. Rather, randomization (plus some mild regularity conditions for theﬁnite sample asymptotics) justify our results. For related discussion, see Cochran (1977) on ratioestimators in surveys. Finally, we can use these results to develop an omnibus test for the presence of any systematictreatment eﬀect variation. The null hypothesis of no treatment eﬀect variation explained by theobserved covariates can be characterized by H ( X ) : β = 0 , where β contains all the components of β except the ﬁrst component corresponding to the inter-cept. Under H ( X ), the individual treatment eﬀects have no linear dependance on X . We then construct a Wald-type test for H ( X ) using an estimator (cid:98) β and its covariance estimator (cid:99) cov( (cid:98) β ); it could be (cid:98) β RI or (cid:98) β OLS . Let (cid:98) β and (cid:99) cov( (cid:98) β ) denote the sub-vector of (cid:98) β and sub-matrix of (cid:99) cov( (cid:98) β ), corresponding to the non-intercept coordinates of X . We reject when (cid:98) β T (cid:99) cov − ( (cid:98) β ) (cid:98) β > q K − (1 − α ) , (10)where q K − (1 − α ) is the 1 − α quantile of the χ random variable with degrees of freedom K − X (or other basis functions) into the model for δ to allow for more ﬂexible systematic treatmenteﬀect variation, which could enhance power or model more complex relationships between the bX and treatment impact. In the Supplementary Material, we describe two additional points about systematic treatmenteﬀect variation that we brieﬂy address here. First, as mentioned above, we can use model-assistedestimation to improve the randomization-based estimator. In particular, improving estimationof (cid:98) S xt directly improves (cid:98) β RI , as the (cid:98) S xt are the only random components. In particular, if wereplace the standard sample estimator, (cid:98) S xt , by a more eﬃcient, model-assisted estimator, as insurvey sampling (Cochran, 1977; S¨arndal et al., 2003), we can achieve meaningful precision gainsin practice. More importantly, this setup allows researchers to assess systematic variation acrossone set of covariates while adjusting for another set.Second, under the assumption of no idiosyncratic variation (i.e., ε i = 0 for all i ), we can ob-tain exact inference for β by inverting a sequence of randomization-based tests. This complementsprevious work on randomization-based tests for the presence of idiosyncratic treatment eﬀect vari-ation (Ding et al., 2016). After characterizing the systematic component of treatment eﬀect variation, we now turn to char-acterizing the idiosyncratic component. Since this quantity is inherently unidentiﬁable, we proposesharp bounds on this component and a framework for sensitivity analysis. We then leverage theseresults to bound an R -like measure of the treatment eﬀect variation explained by covariates. We ﬁrst deﬁne the main quantities of interest: S ττ = 1 n n (cid:88) i =1 ( τ i − τ ) , S δδ = 1 n n (cid:88) i =1 ( δ i − τ ) , S εε = 1 n n (cid:88) i =1 ε i , with δ i and ε i deﬁned as in (3). Then S ττ = S δδ + S εε . We can immediately estimate S δδ viathe sample variance of { (cid:98) δ i = X T i (cid:98) β } ni =1 , where (cid:98) β is a consistent estimator, e.g., (cid:98) β RI or (cid:98) β OLS .11owever, the idiosyncratic variance, S εε , is inherently unidentiﬁable because it depends on thejoint distribution of potential outcomes.We can, however, derive sharp bounds for S εε . Let F ( y ) and F ( y ) be the empirical cumulativedistribution functions of { e i (1) } ni =1 and { e i (0) } ni =1 . Below we denote e ( t ) as a random variable tak-ing equal probabilities on n values of { e i ( t ) } ni =1 . Based on the Fr´echet–Hoeﬀding bounds (Hoeﬀding,1941; Fr´echet, 1951; Nelsen, 2007), we can bound S εε as follows. Theorem 4. S εε has sharp bounds S εε ≤ S εε ≤ S εε , where S εε = (cid:90) { F − ( u ) − F − ( u ) } du, S εε = (cid:90) { F − ( u ) − F − (1 − u ) } du with F − ( u ) = inf { x : F ( x ) ≥ u } as the quantile function. The lower and upper bounds areattainable when e (1) and e (0) have the same ranks and opposite ranks, respectively.The lower bound of S εε corresponds to a rank-preserving relationship between e (1) and e (0),and the upper bound of S εε corresponds to an anti-rank-preserving relationship between e (1) and e (0). Equivalently, they correspond to the cases where the Spearman rank correlation coeﬃcientsbetween e (1) and e (0) are +1 and − . In practice, we can often sharpen these bounds because we are unlikely to have negativelyassociated potential outcomes after adjusting for covariates. If we assume a nonnegative correlationbetween e (1) and e (0), we have the following corollary: Corollary 1.

If the correlation between e (1) and e (0) is nonnegative, then the bounds for S εε become S εε ≤ S εε ≤ V + V , where V t is the variance of e ( t ) for t = 0 , . We can consistently estimate each quantity: S δδ by the sample variance of X T i (cid:98) β , F e ( y ) and F e ( y ) by (cid:98) F ( y ) and (cid:98) F ( y ), the empirical cumulative distribution functions of the residuals (cid:98) e i undertreatment and control, and V and V by the variances of (cid:98) e (1) and (cid:98) e (0). Variance of the overall ITT estimator.

We can use these results to obtain sharper boundson the variance of Neyman (1923)’s estimate of overall ITT, (cid:98) τ = n − (cid:80) ni =1 T i Y obs i − n − (cid:80) ni =1 (1 − T i ) Y obs i , extending previous work by Heckman et al. (1997) and Aronow et al. (2014). See also Fog-arty (2016). Applying the results in Section 2 for scalar outcomes, we have the following variancefor the diﬀerence-in-means estimator,var( (cid:98) τ ) = S n + S n − (cid:18) S δδ n + S εε n (cid:19) , S ττ = S δδ + S εε . As we discuss above, Neyman (1923) proposed a lower bound for the overallvar( (cid:98) τ ) under the assumption of a constant treatment eﬀect, S ττ = 0. More recently, Aronow et al.(2014) instead proposed to bound S ττ via Fr´echet–Hoeﬀding bounds. We can modestly improvethese results by applying Fr´echet–Hoeﬀding bounds for S εε alone rather than for S ττ = S δδ + S εε .So long as S δδ >

0, this yields strictly tighter bounds on var( (cid:98) τ ) than the corresponding boundsthat do not incorporate covariate information. In turn, this gives a tighter estimate of the standarderror for the same diﬀerence-in-means estimator, (cid:98) τ . A variance ratio test.

Finally, while the relationship between e (0) and e (1) is inherently uniden-tiﬁable, there is some information in the data about the relationship between ε i , the individual-levelidiosyncratic treatment eﬀect, and Y i (0), the control potential outcome. In particular, Raudenbushand Bloom (2015) noted that if the variance of the treatment potential outcomes is smaller thanthe variance of the control potential outcomes, then the treatment eﬀect must be negatively asso-ciated with the control potential outcomes. In the Supplementary Material, we extend this resultto incorporate covariates and propose a formal test. Going beyond worst-case bounds, we can assess the sensitivity of our estimate of S εε to diﬀer-ent assumptions of the dependence between potential outcomes. Using the probability integraltransformation, we represent the residual potential outcomes as e (1) = F − ( U ) , e (0) = F − ( U ) , U , U ∼ Uniform(0 , , Therefore, the dependence of the potential outcomes is determined by the dependence of the uniformrandom variables U and U , which are the standardized ranks of the potential outcomes. When U = U , S εε attains the lower bound S εε ; when U = 1 − U , S εε attains the upper bound S εε ;when U U , S εε attains the improved upper bound V + V .Rather than simply examine extreme scenarios of S εε , we can instead represent U as a mixtureof U and another independent uniform random variable V : U ∼ ρU + (1 − ρ ) V , U , V ∼ Uniform(0 , , (11)which the sensitivity parameter ρ captures the association between U and U . An immediateinterpretation of ρ is the proportion of rank preserved units, with the other 1 − ρ as the proportion13f units with independent treatment and control residual outcomes. When ρ = 0, U U , andthe residual potential outcomes are independent; when ρ = 1, U = U , and the residual potentialoutcomes have the same ranks. The values between (0 ,

1) corresponds to positive rank correlationbut not full rank preservation. Note that the representation of the joint distribution is not unique,because we can choose any copula as a joint distribution of ( U , U ) (Nelsen, 2007). We choose theabove representation and notation ρ for the following theorem. Theorem 5.

If Equation 11 holds, then ρ is Spearman’s rank correlation coeﬃcient between e (1)and e (0). Furthermore, S εε is a linear function of ρ : S εε ( ρ ) = ρS εε + (1 − ρ )( V + V ) . We cannot extract any information about ρ from the data. We therefore treat ρ as a sensitivityparameter, choose a plausible range of ρ , and obtain corresponding values for S εε . A natural question is the relative magnitudes of S δδ and S εε (Djebbari and Smith, 2008). Continuingthe regression analogy, this is an R -like measure for the proportion of total treatment eﬀectvariation explained by the systematic component: R τ = S δδ S ττ = S δδ S δδ + S εε , which is the ratio between the ﬁnite population variances of δ and τ. As above, we can directlyestimate S δδ but must bound S εε . Applying Theorem 4, we obtain the following bounds on R τ . Corollary 2.

The sharp bounds on R τ are S δδ S δδ + S εε ≤ R τ ≤ S δδ S δδ + S εε . If we further assume that the correlation between e (1) and e (0) is nonnegative, the sharp boundson R τ are S δδ S δδ + V + V ≤ R τ ≤ S δδ S δδ + S εε . We estimate these bounds via plug-in estimates. Note that Djebbari and Smith (2008) explorea similar quantity by using a permutation approach to approximate the Fr´echet–Hoeﬀding upperand lower bounds. Finally, we can use the sensitivity results for S εε , with values of ρ ∈ [0 , R τ ( ρ ) = S δδ S δδ + S εε ( ρ ) . Noncompliance

We now extend our results to allow for noncompliance. Let T be the indicator of treatmentassigned, D be the indicator of treatment received, Y be outcome of interest, and X be pretreatmentcovariates. Under the Stable Unit Treatment Value Assumption, we deﬁne D i ( t ) and Y i ( t ) asthe potential outcomes for unit i under treatment assignment t. Following Angrist et al. (1996)and Frangakis and Rubin (2002), we can classify units into four compliance types based on thejoint values of D i (1) and D i (0): U i =  Always Taker ( a ) if D i (1) = 1 , D i (0) = 1 , Never Taker ( n ) if D i (1) = 0 , D i (0) = 0 , Complier ( c ) if D i (1) = 1 , D i (0) = 0 , Deﬁer ( d ) if D i (1) = 0 , D i (0) = 1 . Denote n i and π u by the number and proportion of compliance types π u of stratum U = u for u = a, n, c, d .Throughout our discussion, we invoke the following assumptions which are commonly used foranalyzing randomized experiments with noncompliance. Assumption 1. (i) Monotonicity: D i (1) ≥ D i (0); (ii) Exclusion restrictions for Always Takers andNever Takers: Y i (1) = Y i (0) for all units with D i (1) = D i (0); (iii) Strong instrument: π c > C > C is a positive constant independent of the sample size.Monotonicity rules out the existence of Deﬁers, i.e., π d = 0. Under monotonicity, we canestimate the proportion π u using the observed counts of units classiﬁed by T and D : let n td = { i : T i = t, D i = d } , and then (cid:98) π n = n /n , (cid:98) π a = n /n , and (cid:98) π c = n /n − n /n . The exclusionrestrictions assume that treatment assignment has no eﬀect on the outcome for Always Takers andNever Takers. As a result, treatment eﬀect variation is trivially zero for Always Takers and NeverTakers. Note that this is the unit-level exclusion restriction imposed in Angrist et al. (1996). Thiscan be relaxed in other settings; for example, we could assume the impact of randomization for thesegroups is zero on average (see Imbens and Rubin, 2015). Finally, to avoid technical complexity, werule out the weak instrument case (Bound et al., 1995; Staiger and Stock, 1997), i.e., π c is withina small neighborhood of 0 with radius shrinking to 0 .

15e are interested in treatment eﬀect variation among Compliers, which motivates the followingdecomposition: τ i = Y i (1) − Y i (0) = (cid:26) , if U i = a or n, X T i β c + ε i , if U i = c, (12)where β c is the regression coeﬃcient of τ i on X i among Compliers, analogous to (3). We now extend the results of Section 3 to estimate systematic treatment eﬀect variation amongCompliers. Deﬁne S xx,u = 1 n u n (cid:88) i =1 I ( U i = u ) X i X T i , S xt,u = 1 n u n (cid:88) i =1 I ( U i = u ) Y i ( t ) X i , ( t = 0 , u = a, c, n ) . Then, analogous to (5), β c = S − xx,c ( S x ,c − S x ,c ) = S − xx,c S x ,c − S − xx,c S x ,c ≡ γ c − γ c , (13)where γ c = S − xx,c S x ,c , γ c = S − xx,c S x ,c are the linear regression coeﬃcients of Y (1) and Y (0) on covariates among Compliers.Unlike in the ITT case, we cannot estimate these quantities directly. Instead, following standardresults from noncompliance (e.g., Angrist et al., 1996; Abadie, 2003; Angrist and Pischke, 2008),we use estimates from observed subgroups to estimate the desired quantities of interest. Deﬁnesample moments: (cid:98) S xx,td = 1 n t n (cid:88) i =1 I ( T i = t ) I ( D i = d ) X i X T i , (cid:98) S xt,td = 1 n t n (cid:88) i =1 I ( T i = t ) I ( D i = d ) Y obs i X i ( t, d = 0 , . (14)The following theorem connects these quantities with the ﬁnite population quantities in (13). Theorem 6.

Over all possible randomizations of a completely randomized experiment, both (cid:98) S xx (1) = (cid:98) S xx, − (cid:98) S xx, and (cid:98) S xx (0) = (cid:98) S xx, − (cid:98) S xx, are unbiased for π c S xx,c , and E ( (cid:98) S x , − (cid:98) S x , ) = π c S x ,c , E ( (cid:98) S x , − (cid:98) S x , ) = π c S x ,c . (15)16his theorem shows that we can obtain unbiased estimates for all terms in (13). The followingcorollary shows that we can then obtain consistent estimates for γ c , γ c , and β c , recalling that inthe asymptotic analysis, we need to embed { X i , Y i (1) , Y i (0) , D i (1) , D i (0) } ni =1 into a hypotheticalsequence of ﬁnite populations under Condition 1. Corollary 3. (cid:98) γ c, RI = (cid:98) S − xx (1)( (cid:98) S x , − (cid:98) S x , ) and (cid:98) γ c, RI = (cid:98) S − xx (0)( (cid:98) S x , − (cid:98) S x , ) are consistentfor γ c and γ c . Furthermore, (cid:98) β c, RI = (cid:98) γ c, RI − (cid:98) γ c, RI is consistent for β c and follows an asymptoticnormal distribution with covariance matrixcov( (cid:98) β c, RI ) = ( π c S xx,c ) − (cid:20) S{ e (cid:48) (1) X } n + S{ e (cid:48) (0) X } n − S ( ε X ) n (cid:21) ( π c S xx,c ) − , (16)where we deﬁne the residual potential outcomes to be: e (cid:48) i (1) =  Y i (1) − X T i γ c ,Y i (1) − X T i γ c ,Y i (1) − X T i γ c , e (cid:48) i (0) =  Y i (0) − X T i γ c , U i = a,Y i (0) − X T i γ c , U i = n,Y i (0) − X T i γ c , U i = c. (17)The idiosyncratic variation is ε i = e (cid:48) i (1) − e (cid:48) i (0) for unit i , with ε i = 0 for Never Takers andAlways Takers, and with ε i for Compliers as in (12). The two sets of residuals are not formed froma regression on all units, but instead the population regression on Compliers alone. As in the ITTcase, we can estimate S{ e (cid:48) (1) X } and S{ e (cid:48) (0) X } using their sample analogues; S ( ε X ), however, isunidentiﬁable. For units with D i = 1, we deﬁne the residual (cid:98) e (cid:48) i = Y obs i − X T i (cid:98) γ c , RI , and for unitswith D i = 0, we deﬁne the residual (cid:98) e (cid:48) i = Y obs i − X T i (cid:98) γ c , RI . Therefore, we can obtain a conservativeestimate for the asymptotic covariance (16) by the following sandwich form: (cid:99) cov( (cid:98) β c, RI ) = (cid:98) S − xx (1) (cid:34) (cid:98) S ( (cid:98) e (cid:48) X ) n (cid:35) (cid:98) S − xx (1) + (cid:98) S − xx (0) (cid:34) (cid:98) S ( (cid:98) e (cid:48) X ) n (cid:35) (cid:98) S − xx (0) . As with the ITT analog, so long as we have Assumption 1, randomization itself fully justiﬁes thetheorem and estimators without relying on a model of the observed outcomes.

We now turn to the standard two-stage least squares (TSLS) setting in econometrics (e.g., Angristand Pischke, 2008). First, we impose a linear regression model with treatment-covariate interac-tions: Y obs i = X T i γ + D i X T i β + u i ( i = 1 , . . . , n ) . D i and u i . In thelanguage of econometrics, the treatment received is “endogenous,” i.e., D i and the error term u i are assumed to be correlated; we therefore use T i as an instrument for D i . The TSLS estimates( (cid:98) γ TSLS , (cid:98) β TSLS ) are the solutions to the following estimating equations: n − n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( Y obs i − X T i (cid:98) γ TSLS − D i X T i (cid:98) β TSLS ) = 0 . (18)This approach is based on M -estimation, though there are many other ways to formalize theTSLS estimator (e.g., Imbens, 2014). The following theorem shows that the fully-interacted TSLSestimator (cid:98) β TSLS is consistent for β c across randomizations. Theorem 7.

Over all randomizations, the TSLS estimator (cid:98) β TSLS follows an asymptotic normaldistribution with mean β c and covariance matrix( π c S xx,c ) − (cid:20) S{ e (cid:48)(cid:48) (1) X } n + S{ e (cid:48)(cid:48) (0) X } n − S ( ε X ) n (cid:21) ( π c S xx,c ) − , where the residual potential outcomes are deﬁned as e (cid:48)(cid:48) i (1) =  Y i (1) − X T i ( γ ∞ + β c ) ,Y i (1) − X T i γ ∞ ,Y i (1) − X T i ( γ ∞ + β c ) , e (cid:48)(cid:48) i (0) =  Y i (0) − X T i ( γ ∞ + β c ) , U i = a,Y i (0) − X T i γ ∞ , U i = nY i (0) − X T i γ ∞ , U i = c, where γ ∞ is the probability limit of the TSLS regression coeﬃcient, (cid:98) γ TSLS , and the idiosyncratictreatment eﬀect is ε i ≡ e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) . For variance estimation, deﬁne the residual as (cid:98) e (cid:48)(cid:48) i = Y obs i − X T i ( (cid:98) γ TSLS + (cid:98) β TSLS ) for units with D i = 1 and (cid:98) e (cid:48)(cid:48) i = Y obs i − X T i (cid:98) γ TSLS for units with D i = 0. We can then use the following sandwichvariance estimator (cid:99) cov( (cid:98) β TSLS ) = (cid:98) S − xx (1) (cid:34) (cid:98) S ( (cid:98) e (cid:48)(cid:48) X ) n (cid:35) (cid:98) S − xx (1) + (cid:98) S − xx (0) (cid:34) (cid:98) S ( (cid:98) e (cid:48)(cid:48) X ) n (cid:35) (cid:98) S − xx (0) , which has the same probability limit as the Huber–White covariance estimator for (cid:98) β TSLS . Therefore,the randomization itself eﬀectively justiﬁes the use of TSLS for estimating systematic treatmenteﬀect variation among Compliers, extending our ITT results.Finally, while (cid:98) β TSLS is a consistent estimator for β c , (cid:98) γ TSLS is not, in general, a consistentestimator for γ c ; that is, γ ∞ (cid:54) = γ c . Instead, (cid:98) γ TSLS converges to γ ∞ = S − xx S x − π a S − xx S xx,a β c .In the special case of one-sided noncompliance (i.e., π a = 0), γ ∞ = γ = S − xx S x , the populationOLS regression coeﬃcient, among all Compliers and Never Takers, of Y (0) on covariates.18 .2.3 Omnibus test for systematic treatment eﬀect variation among Compliers With point estimate (cid:98) β and covariance estimate (cid:99) cov( (cid:98) β ) for β c , we can use the same Wald-type χ test as in (10) for the presence of systematic treatment eﬀect variation among Compliers. Here, theestimator can be either randomization-based (cid:98) β c, RI or TSLS estimator (cid:98) β TSLS ; the degrees of freedomare the same, K −

1. Unlike in the ITT case, we are not aware of existing tests for systematictreatment eﬀect variation among Compliers.

We now turn to decomposing the overall treatment eﬀect in the presence of noncompliance. Inthis setting, we have three sources of treatment eﬀect variation: (i) systematic treatment eﬀectvariation among Compliers, (ii) idiosyncratic treatment eﬀect variation among Compliers, and (iii)treatment eﬀect variation due to noncompliance.First, recall that total treatment eﬀect variation is S ττ = (cid:80) ni =1 ( τ i − τ ) /n . We can deﬁne asimilar quantity among Compliers: S ττ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( τ i − τ c ) . As in Section 4, we can decompose this variation into systematic and idiosyncratic treatment eﬀectvariation for Compliers, respectively: S δδ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( δ i − τ c ) , S εε,c = 1 n c n (cid:88) i =1 I ( U i = c ) ε i . Because treatment eﬀects for Never Takers and Always Takers are zero, there is no treatment eﬀectvariation for these units. The component of treatment eﬀect variation due to compliance status is S ττ,U = (cid:88) u = c,a,n π u ( τ u − τ ) . Using τ a = τ n = 0 and τ = π c τ c due to the exclusion restrictions, we have the following theoremsummarizing the relationships among the above components. Theorem 8. S ττ = π c S ττ,c + S ττ,U , S ττ,c = S δδ,c + S εε,c , and S ττ,U = π c (1 − π c ) τ c .

19n words, total treatment eﬀect variation has three parts: (i) systematic treatment eﬀect vari-ation among Compliers, π c S δδ,c ; (ii) idiosyncratic treatment eﬀect variation among Compliers, π c S εε,c ; (iii) treatment eﬀect variation due to noncompliance, S ττ,U .As in the ITT case, even though S εε,c is not identiﬁable, we can derive bounds in terms ofthe marginal distributions of the residuals, { e (cid:48) i (1) = Y i (1) − X T i γ c : U i = c, i = 1 , . . . , n } and { e (cid:48) i (0) = Y i (0) − X T i γ c : U i = c, i = 1 , . . . , n } , denoted by F c ( y ) and F c ( y ), and with marginalvariances, V c and V c . Once we estimate these quantities, we can plug them in to Theorem 4and Corrolary 1 to get our bounds. As compliance status is only partially observed, we haveto estimate these quantities by diﬀerencing observed distributions; we defer this and some othertechnical details to the Supplementary Material.

Since there are two sources of variation—covariates and noncompliance—there are three possible R -type measures. First, we can measure the treatment eﬀect variation explained by noncompliancealone (i.e., only U ): R τ,U = S ττ,U S ττ = S ττ,U S ττ,U + π c S ττ,c = S ττ,U S ττ,U + π c S δδ,c + π c S εε,c . Second, we can measure the proportion of treatment eﬀect variation among Compliers explainedby covariates (i.e., only X ): R τ,c = S δδ,c S ττ,c = S δδ,c S δδ,c + S εε,c . Third, we can measure the treatment eﬀect variation explained by covariates and noncompliance(i.e., both X and U ): R τ,U X = S ττ,U + π c S δδ,c S ττ = S ττ,U + π c S δδ,c S ττ,U + π c S δδ,c + π c S εε,c . For each measure, we can use tailored versions of Corollary 1 to construct bounds, or conductsensitivity analysis as in Section 4.2, with the sensitivity parameter expressed as the Spearmancorrelation between the treatment and control potential outcomes among Compliers.20

Simulation study

We simulate completely randomized experiments to evaluate the ﬁnite sample performance of thetests for systematic treatment eﬀect variation based on (cid:98) β OLS , (cid:98) β RI , and (cid:98) β w RI , the model-assistedversion discussed in the Supplementary Material. Our data generation process is inspired by theHead Start Impact Study (HSIS) study analyzed in the next section. For a given sample size, weﬁrst generate four independent covariates ( X , a standard normal, X , a binary covariate withprobability 0 . X , a binary covariate with probability 0 .

25 being 1, and X , a standardnormal). The control potential outcomes are then generated from Y i (0) = 0 . . X i + 0 . X i − . X i + 0 . X i + u i , u i ∼ N (0 , σ ) . We select σ = 0 .

26 to make the marginal variance for the control potential outcomes 1; thuswe can interpret impacts in “eﬀect size” units. The R of regressing Y (0) onto the covariates isapproximately 0.74, due to the “pre-test”-like variable X i . Without X i , the R is about 0.09.The treatment eﬀects are τ i = δ i + ε i , with (i) either δ i = 0 . i , or δ i = 0 . . X i +0 . X i ;and (ii) either ε i = 0 for all i , or ε i ∼ N (0 , . ). All combinations of these two options give the fourcases of (a) no treatment eﬀect variation, (b) only systematic variation, (c) idiosyncratic variationwith no systematic variation, and (d) both systematic and idiosyncratic variation. For an α -leveltest of systematic variation, scenarios (a) and (c) should only reject at rate α , while we would like tosee high rejection rates for scenarios (b) and (d). For scenario (d), the R τ is about 0.5; systematicvariation explains a good share of the overall variation.To generate a synthetic dataset we generated all potential outcomes, randomized units intotreatment with probability 0 .

6, and then calculated the corresponding observed outcomes. We thenconducted a test for systematic variation using each of our three estimators. For (cid:98) β RI and (cid:98) β OLS weuse X , X , X . For our covariate-adjusted estimator (cid:98) β w RI we also include the fairly predictive X for adjustment.Figure 1 shows the power of these tests, with α = 0 .

05, for diﬀerent sample sizes. First,all estimators appear asymptotically valid, consistent with the theoretical results. The OLS andadjusted estimators are slightly anti-conservative for small n , however, with rejection rates ofaround 9%. Second, the OLS estimator appears to have the greatest power in this setting, which is21

00 500 1000 2000 5000 100000.00.20.40.60.81.0 a) none po w e r l l l l l l

200 500 1000 2000 5000 100000.00.20.40.60.81.0 b) systematic l l l l l l

200 500 1000 2000 5000 100000.00.20.40.60.81.0 c) idiosyncratic N po w e r l l l l l l

200 500 1000 2000 5000 100000.00.20.40.60.81.0 d) systematic + idiosyncratic N l l l l l l l RI OLS RI (adj)

Figure 1: Power of the tests based on (cid:98) β RI , (cid:98) β OLS , and (cid:98) β w RI .unsurprising since the true data generating process is a linear model. Finally, covariate adjustmentslightly improves the power of the RI estimator. Overall, in the scenarios we consider, we onlyachieve decent levels of power in large samples, although there seems to be reasonable power forthe sample size in the data application, n = 3 , We next simulate completely randomized experiments with noncompliance to evaluate the ﬁnitesample performance of the tests for systematic treatment eﬀect variation among Compliers basedon (cid:98) β c, RI and (cid:98) β TSLS . We ﬁrst generated a complete dataset as in the ITT case above, and thenassigned strata membership to all units with probabilities proportional to their covariates. ForAlways Takers we then set Y i (0) = Y i (1), and for Never Takers, Y i (1) = Y i (0). The overall ITT isnow reduced to 0.21 (due to the 0 eﬀects of Never Takers and Always Takers), although the CACE22s still approximately 0.3. The proportion of Compliers is approximately 68%.The Compliers have the systematic and idiosyncratic eﬀects described as above. We tested forthe presence of systematic variation for Compliers under the exclusion restrictions. Figure 2 showsthe power of these tests for our RI and TSLS estimators. First, in this scenario, the 2SLS andthe RI estimators are virtually equivalent; the additional adjustment provided by TSLS does notadd signiﬁcantly to the precision. We see the tests are valid (they even appear conservative) forcases (a) and (c). Power is reduced compared to the ITT simulation; this is reasonable as poweris eﬀectively a function of the number of Compliers, with additional uncertainty due to partialinformation about the identity of Compliers.

500 1000 2000 5000 100000.00.20.40.60.81.0 a) none po w e r l l l l l l

500 1000 2000 5000 100000.00.20.40.60.81.0 b) systematic l l l l l l

500 1000 2000 5000 100000.00.20.40.60.81.0 c) idiosyncratic N po w e r l l l l l l

500 1000 2000 5000 100000.00.20.40.60.81.0 d) systematic + idiosyncratic N l l l l l ll RI TSLS

Figure 2: Power of the tests based on (cid:98) β c, RI and (cid:98) β TSLS . Application to the Head Start Impact Study

Established in 1965, Head Start is the largest Federal preschool program in the United States,serving nearly 1 million low-income three- and four-year-old children each year at a cost of over$7 billion (Administration for Children and Families, 2015). Researchers and policymakers havedebated Head Start’s eﬀectiveness since its inception, with early randomized trials ﬁnding limitedimpacts (e.g., Westinghouse Learning Corporation, 1969) and quasi-experimental studies showingmuch larger eﬀects (e.g., Currie and Thomas, 1995). Designed in part to settle this debate, theHead Start Impact Study (HSIS) is a large-scale, nationally representative randomized trial of HeadStart ﬁrst launched in 2002 (Puma et al., 2010). The Congressional mandate for HSIS includedtwo broad questions: (1) the program’s overall impact, and (2) how impacts vary across childrenand centers. The policy debate has largely focused on this ﬁrst question; HSIS only found modestaverage eﬀects on a range of children’s cognitive and social-emotional outcomes. However, both theoriginal study and several recent papers argue that these topline results mask important treatmenteﬀect variation (e.g., Bloom and Weiland, 2014; Bitler et al., 2014; Ding et al., 2016; Walters,2015; Feller et al., 2016). Understanding such variation is critical both for assessing the program’sbeneﬁts and costs and for improving the practice and science of early childhood education.HSIS collected a rich set of covariates about children and their families, including pre-test score,child’s age, child’s race, child’s home language, mother’s education level, and mother’s maritalstatus. At the same time, many potentially important covariates are unavailable. For instance,while families must be low-income to be eligible for Head Start, HSIS does not include informationon families’ actual income nor other ﬁnancial details that could be important predictors of programimpact. In addition, Feller et al. (2016) and others argue that that the setting in which a childwould otherwise receive care is an important source of impact variation, although this is not directlyobservable.We now use the methods outlined above to assess treatment eﬀect variation in HSIS. Theoriginal study included n = 4 ,

400 total children, with n = 2 ,

644 in the treatment group and n = 1 ,

796 in the control group. Following earlier analyses (Ding et al., 2016) and to simplifyexposition, we restrict our attention to a complete-case subset of the HSIS, with n = 2 ,

238 in thetreatment group and n = 1 ,

348 in the control group (so p ≈ .

62 and p ≈ . We ﬁrst explore treatment eﬀect variation for the ITT estimate, beginning with estimating system-atic treatment eﬀect variation. We examine three estimators: the randomization-based and OLSestimators discussed in Section 3, (cid:98) β RI and (cid:98) β OLS , and the corresponding model-assisted version ofthe RI estimator discussed in the Supplementary Material, (cid:98) β w RI . For this latter estimator, we use allavailable covariates to adjust the standard estimators, that is, W is the entire vector of covariates. Omnibus test for systematic treatment eﬀect variation.

We begin by using these estima-tors for an omnibus test of whether any treatment eﬀect variation is explained by the full set ofcovariates. The p -values for the unadjusted (cid:98) β RI estimator and model-assisted (cid:98) β w RI are 0 .

39 and 0 . p = 0 . p -values. And while we expect the unadjusted (cid:98) β RI to have the lowestpower, it is instructive that the p -value for (cid:98) β OLS is substantially smaller than the p -value for thecovariate-adjusted (cid:98) β w RI . As we discuss in Section 3.2, (cid:98) β OLS can account for covariate imbalanceacross experimental arms by estimating the S xx matrix separately for the treatment and controlgroups. By contrast, (cid:98) β RI does not address imbalance in X and instead attempts to residualize outthe Y in order to get a more precise estimate of the relationship of the X to Y for each treatmentarm. Based on the discrepancy in p -values, adjusting for baseline imbalance is clearly important inthis example. Treatment eﬀect R τ . Next, we examine how much of the variation could be explained by ourcovariates. Figure 3a shows values of the treatment eﬀect R τ using (cid:98) β w RI to estimate the systematicvariation. Results are nearly identical using the other estimators. In the worst case of perfectnegative dependence between potential outcomes (not shown), the treatment eﬀect R τ could be as25 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . ρ T r ea t m e n t e ff ec t R ● ●● ●● ● ITT: Covariates only (X)LATE: Covariates only (X)LATE: Covariates and compliance (X + U) (a) Overall R τ ●●●●●●●●●● Mother is Recent Immigrant Dual−Language Learner Child's Race Caregiver's Age Pre−test Score Mother's Marial Status Mother's Education Level Child is Male Both Parents Live with Child Child's Age ITT treatment effect R with ρ = (b) R τ separately by covariate Figure 3: Treatment eﬀect R τ , with sensitivity parameter, ρ ∈ [0 , R τ ranges from0.03 to 0.76. While the estimate is clearly sensitive to the unidentiﬁable sensitivity parameter, thecovariates explain a substantial proportion of treatment eﬀect variation for values of ρ near 1.We can also use this framework to assess the relative importance of each covariate in terms ofexplaining overall treatment eﬀect variation. To do this, we use the model-assisted RI estimator, (cid:98) β w RI , adjusting for all covariates (i.e., dim( W ) = 17) but restricting systematic treatment eﬀectvariation to one covariate at a time. Note that we consider factors (e.g., race) as a group. Figure 3bshows the resulting estimates for the upper bound of R τ , with lower bound estimates all below 0.01.Having a mother who is a recent immigrant and dual language learner status (which are highlycorrelated in practice) could each explain a substantial proportion of treatment eﬀect variation,consistent with previous results from Bloom and Weiland (2014) and Bitler et al. (2014). This isnot true for other covariates, like mother’s education level. Negative correlation between treatment eﬀect and control potential outcomes.

Fi-nally, we test whether the individual-level idiosyncratic treatment eﬀects, { ε i } ni =1 , are negativelycorrelated with the control potential outcomes, { Y i (0) } ni =1 , extending results from Raudenbush andBloom (2015). As outlined in the Supplementary Material, we do so by testing whether the variance26f { Y obs i − X T i (cid:98) β w RI : T i = 1 } is smaller than the variance of { Y obs i : T i = 0 } . This yields a p -valueof 0 .

02, which suggests that the unexplained treatment eﬀect is indeed larger for smaller values ofthe control potential outcomes. This result is consistent with ﬁndings from Bitler et al. (2014) whouse a quantile treatment eﬀect approach.

As with many social experiments, there is substantial noncompliance with random assignment inHSIS. In the analysis sample we consider here, the estimated proportion of compliance types is (cid:98) π c = 0 .

69 for Compliers, (cid:98) π a = 0 .

13 for Always Takers, and (cid:98) π n = 0 .

18 for Never Takers. Given theexclusion restrictions for Always Takers and Never Takers, the treatment eﬀect is therefore zero (byassumption) for over 30 percent of the sample, suggesting that noncompliance will be an importantcomponent of treatment eﬀect variation.In the setting with noncompliance, we focus on two estimators for systematic treatment eﬀectvariation among Compliers: the randomization-based estimator, (cid:98) β c, RI , and the Two-Stage LeastSquares estimator, (cid:98) β TSLS . We ﬁrst use these estimators to construct omnibus tests for systematictreatment eﬀect variation among Compliers. Tests using both estimators show strong evidence forsuch variation, with p -value 0 .

02 using (cid:98) β c, RI and p -value 0 .

01 using (cid:98) β TSLS .Finally, we turn to decomposing the overall treatment eﬀect. As in the ITT case, we assume thatthe potential outcomes have a nonnegative correlation. Figure 3a shows the treatment eﬀect R among Compliers, which ranges from R τ,c = 0 .

05 to R τ,c = 0 .

68. Next, we can calculate treatmenteﬀect variation due to noncompliance, R τ,U . In the case of HSIS, this is relatively small—between0.01 and 0.16—in part because the overall treatment eﬀect is fairly small. Therefore, the overalltreatment eﬀect decomposition due to both covariates and noncompliance, R τ,U X , is quite closeto R τ,c , as shown in Figure 3a. Taken together, these estimates suggest that there is indeedimportant treatment eﬀect variation that is neither captured by pre-treatment covariates nor bynoncompliance, consistent with previous results in Ding et al. (2016). In this paper, we propose a broad, ﬂexible framework for assessing and decomposing treatmenteﬀect variation in randomized experiments with and without noncompliance. In general, we believe27his is a natural setup for researchers to formulate and investigate a broad range of questions aboutimpact heterogeneity (e.g., Heckman et al., 1997). Applications include assessing underlying causalmechanisms and targeting treatments based on individual-level characteristics. Understanding suchvariation is also important for the design of experiments. Djebbari and Smith (2008), for example,argue that characterizing the size of the idiosyncratic treatment eﬀect is useful for determining thevalue of additional data collection.We brieﬂy note several directions for future work. First, our primary purpose was to propose aframework for analysis rooted in and justiﬁed by the randomization itself. As a result, we focusedon the core properties of several relatively simple versions of linear regression and TSLS. We didnot, however, fully explore their practical and ﬁnite-sample properties. For example, in futurework, we hope to determine the settings in which model assistance will most improve estimationand assess the increased power of the

OLS approach versus the unbiased RI approach. We arealso investigating how to connect model assisted and OLS approaches to take advantage of bothmethods of precision gain. Similarly, there is still much potential improvement in determining waysof characterizing the degree of heterogeneity, such as with an eﬀect size for the systematic variation.Second, a natural extension is to use more complex methods to estimate systematic treatmenteﬀects, such as via hierarchical models (Feller and Gelman, 2015) or via machine learning meth-ods (Wager and Athey, 2017), extending the results for the omnibus test and treatment eﬀect R τ accordingly. While the guarantees from randomization are clearly weaker in such settings,researchers can assess these tradeoﬀs themselves. For example, hierarchical modeling would beespecially useful in the Head Start Impact Study due to the multi-site design (Bloom and Weiland,2014).Third, a question of increasing practical importance is the generalizability of experimentalresults to a given target population (Stuart et al., 2011). We believe that the treatment eﬀect R τ is a critical measure for assessing the credibility of these generalizations. In short, if there issubstantial idiosyncratic treatment eﬀect variation, i.e., R τ is small, then researchers should bewary of using observed covariates to extrapolate treatment eﬀects.Finally, a question is how to extend this treatment eﬀect variation framework to non-randomizedsettings. While the results would necessarily rest on much stronger assumptions, many settingsalready use an as-if-randomized framework, such as in observational studies (Rosenbaum, 2002;28mbens and Rubin, 2015). Under this approach, extensions should be natural. References

A. Abadie. Semiparametric instrumental variable estimation of treatment response models.

Journalof Econometrics , 113:231–263, 2003.Administration for Children and Families. Head Start program facts, ﬁscal year 2014. Available athttps://eclkc.ohs.acf.hhs.gov/hslc/data/factsheets/docs/hs-program-fact-sheet-2014.pdf, 2015.J. D. Angrist and J. Pischke.

Mostly Harmless Econometrics: An Empiricist’s Companion . Prince-ton: Princeton University Press, 2008.J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identiﬁcation of causal eﬀects using instrumentalvariables.

Journal of the American Statistical Association , 91:444–455, 1996.J. D. Angrist, P. A. Pathak, and C. R. Walters. Explaining charter school eﬀectiveness.

AmericanEconomic Journal: Applied Economics , 5(4):1–27, 2013.P. M. Aronow, D. P. Green, and D. K. Lee. Sharp bounds on the variance in randomized experi-ments.

The Annals of Statistics , 42:850–871, 2014.S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal eﬀects.

Proceedings ofthe National Academy of Sciences , 113(27):7353–7360, 2016.A. Berrington de Gonz´alez and D. R. Cox. Interpretation of interaction: A review.

The Annals ofApplied Statistics , 1:371–385, 2007.M. Bitler, H. Hoynes, and T. Domina. Experimental Evidence on Distributional Eﬀects of HeadStart. Working Paper, 2014.A. S. Blinder. Wage discrimination: reduced form and structural estimates.

Journal of Humanresources , 8:436–455, 1973.H. S. Bloom and C. Weiland. To what extent do the eﬀects of Head Start on enrolled children varyacross sites? Working Paper, 2014. 29. Bound, D. A. Jaeger, and R. M. Baker. Problems with instrumental variables estimation when thecorrelation between the instruments and the endogenous explanatory variable is weak.

Journalof the American statistical association , 90:443–450, 1995.W. G. Cochran.

Sampling Techniques . New York: John Wiley & Sons, 3rd edition, 1977.D. R. Cox. Interaction (with discussion).

International Statistical Review , 52:1–24, 1984.R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Nonparametric tests for treatmenteﬀect heterogeneity.

Review of Economics and Statistics , 90:389–405, 2008.J. Currie and D. Thomas. Does Head Start make a diﬀerence?

American Economic Review , 85(3):341–364, 1995.P. Ding. A paradox from randomization-based causal inference. arXiv preprint arXiv:1402.0142 ,2014.P. Ding, A. Feller, and L. W. Miratrix. Randomization inference for treatment eﬀect variation.

Journal of the Royal Statistical Society, Series B (Statistical Methodology) , 78:655–671, 2016.H. Djebbari and J. Smith. Heterogeneous impacts in PROGRESA.

Journal of Econometrics , 145:64–80, 2008.A. Feller and A. Gelman. Hierarchical models for causal eﬀects.

Emerging Trends in the Social andBehavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource , 2015.A. Feller, T. Grindal, L. Miratrix, and L. C. Page. Compared to what? variation in the impactsof early childhood education by alternative care type.

The Annals of Applied Statistics , 10(3):1245–1285, 2016.R. A. Fisher.

The Design of Experiments.

Edinburgh: Oliver & Boyd, 1st edition, 1935.C. B. Fogarty. Regression assisted inference for the average treatment eﬀect in paired experiments. arXiv preprint arXiv:1612.05179 , 2016.C. E. Frangakis and D. B. Rubin. Principal stratiﬁcation in causal inference.

Biometrics , 58:21–29,2002. 30. Fr´echet. Sur les tableaux de corr´elation dont les marges son donn´ees.

Annals Universite deLyon, Sect. A. Ser. 3 , 14:53–77, 1951.D. P. Green and H. L. Kern. Modeling Heterogeneous Treatment Eﬀects in Survey Experimentswith Bayesian Additive Regression Trees.

The Public Opinion Quarterly , 76:491–511, 2012.J. H´ajek. Limiting distributions in simple random sampling from a ﬁnite population.

Publicationsof the Mathematics Institute of the Hungarian Academy of Science , 5:361–74, 1960.J. J. Heckman, J. Smith, and N. Clements. Making the most out of programme evaluationsand social experiments: Accounting for heterogeneity in programme impacts.

The Review ofEconomic Studies , 64:487–535, 1997.J. L. Hill. Bayesian nonparametric modeling for causal inference.

Journal of Computational andGraphical Statistics , 20:217–240, 2011.W. Hoeﬀding. Masstabinvariante korrelationsmasse f¨ur diskontinuierliche verteilungen.

Arkiv frmatematischen Wirtschaften und Sozialforschung , 7:49–70, 1941.Y. Huang, P. B. Gilbert, and H. Janes. Assessing treatment-selection markers using a potentialoutcomes framework.

Biometrics , 68:687–696, 2012.P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In

Pro-ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume 1,pages 221–233, 1967.K. Imai and M. Ratkovic. Estimating treatment eﬀect heterogeneity in randomized program eval-uation.

The Annals of Applied Statistics , 7:443–470, 2013.G. Imbens. Instrumental variables: An econometrician’s perspective (with discussion).

StatisticalScience , 29:323–358, 2014.G. W. Imbens and D. B. Rubin.

Causal Inference in Statistics, and in the Social and BiomedicalSciences . New York: Cambridge University Press, 2015.O. Kempthorne.

The Design and Analysis of Experiments.

New York: Wiley, 1952.E. L. Lehmann.

Elements of Large-Sample Theory . New York: Springer, 1998.31. Li and P. Ding. General forms of ﬁnite population central limit theorems with applications tocausal inference.

Journal of the American Statistical Association , page in press, 2017.W. Lin. Agnostic notes on regression adjustments to experimental data: reexamining Freedman’scritique.

The Annals of Applied Statistics , 7:295–318, 2013.R. A. Matsouaka, J. Li, and T. Cai. Evaluating marker-guided treatment selection strategies.

Biometrics , 70:489–499, 2014.J. A. Middleton and P. M. Aronow. Unbiased estimation of the average treatment eﬀect in cluster-randomized experiments.

Statistics, Politics and Policy , 6:39–75, 2015.R. B. Nelsen.

An Introduction to Copulas . New York: Springer, 2nd edition, 2007.J. Neyman. On the application of probability theory to agricultural experiments. Essay on princi-ples. Section 9.

Statistical Science , 5:465–472, 1923.R. Oaxaca. Male-female wage diﬀerentials in urban labor markets.

International Economic Review ,14:693–709, 1973.M. Puma, S. Bell, R. Cook, C. Heid, G. Shapiro, P. Broene, F. Jenkins, P. Fletcher, L. Quinn,J. Friedman, et al. Head start impact study: Final report. Technical report, Department ofHealth and Human Services, Administration for Children and Families, Washington DC, 2010.S. W. Raudenbush and H. S. Bloom. Learning about and from a distribution of program impactsusing multisite trials.

American Journal of Evaluation , 36(4):475–499, 2015.P. R. Rosenbaum. Reduced sensitivity to hidden bias at upper quantiles in observational studieswith dilated treatment eﬀects.

Biometrics , 55:560–564, 1999.P. R. Rosenbaum.

Observational Studies . New York: Springer, 2nd edition, 2002.P. R. Rosenbaum. Conﬁdence intervals for uncommon but dramatic responses to treatment.

Bio-metrics , 63:1164–1171, 2007.D. B. Rubin. Estimating causal eﬀects of treatments in randomized and nonrandomized studies.

Journal of Educational Psychology , 66:688–701, 1974.32. B. Rubin. Comment on “Randomization analysis of experimental data: the Fisher randomizationtest” by D. Basu.

Journal of the American Statistical Association , 75:591–593, 1980.C.-E. S¨arndal, B. Swensson, and J. Wretman.

Model-Assisted Survey Sampling . New York:Springer, 2003.D. O. Staiger and J. H. Stock. Instrumental variables regression with weak instruments.

Econo-metrica , 65:557–586, 1997.E. A. Stuart, S. R. Cole, C. P. Bradshaw, and P. J. Leaf. The use of propensity scores to assessthe generalizability of results from randomized trials.

Journal of the Royal Statistical Society:Series A (Statistics in Society) , 174(2):369–386, 2011.S. Wager and S. Athey. Estimation and inference of heterogeneous treatment eﬀects using randomforests.

Journal of the American Statistical Association , (just-accepted), 2017.C. R. Walters. Inputs in the production of early childhood human capital: Evidence from headstart.

American Economic Journal: Applied Economics , 7(4):76–102, 2015.Westinghouse Learning Corporation.

The Impact of Head Start: An Evaluation of the Eﬀects ofHead Start on Children’s Cognitive and Aﬀective Development, Volume 1: Report to the Oﬃce ofEconomic Opportunity . Athens, Ohio: Westinghouse Learning Corporation and Ohio University,1969.H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-eroskedasticity.

Econometrica , pages 817–838, 1980.A. Zhao, P. Ding, R. Mukerjee, and T. Dasgupta. Randomization-based causal inference fromsplit-plot designs.

Annals of Statistics , page in press, 2017.33 upplementary Materialfor“Decomposing Treatment Eﬀect Variation”

Appendix A gives all the proofs and Appendix B provides the additional commentary mentionedin the main text. The ﬁnite population central limit theorem (FPCLT) we use for our asymptoticproofs is Theorem 5 of Li and Ding (2017), which requires some mild moment conditions on thecovariates and potential outcomes, as outlined in the main text.

Appendix A Lemmas and Proofs

Before we prove Theorem 1, we provide a few lemmas to ease the notational burden and amount ofalgebra of subsequent calculations. These lemmas allow us to derive expressions for our estimatorsin terms of matrix algebra rather than the summation-style approach typically seen for Neyman-style derivations in the literature.To begin, let n = (1 , . . . , T and n = (0 , . . . , T be column vectors of length n , and I n be the n × n identity matrix. Then S n = I n − n − n T n is the projection matrix orthogonal to n with S n n = n . Under this formulation, the covariance matrix of the treatment assignment vector is ascaled projection matrix orthogonal to n , as shown in the following lemma. Lemma A.1.

The treatment assignment vector T of a completely randomized experiment has E ( T ) = n n n , cov( T ) = n n n ( n − S n . Proof of Lemma A.1.

The conclusions follow from E ( T i ) = n n , var( T i ) = n n n , cov( T i , T j ) = − n n n ( n − , ( i (cid:54) = j ) . The projection matrix S n acts as a covariance operator as illustrated by the following lemma. Lemma A.2.

Let U i , V i ∈ R K be column vectors of length K . Deﬁne U = [ U , U , . . . , U n ] and V = [ V , V , . . . , V n ] ∈ R K × n as two matrices of dimension K × n . If ¯ U = n − (cid:80) ni =1 U i = n − U and ¯ V = n − (cid:80) ni =1 V i = n − V , then U S n V T = n (cid:88) i =1 ( U i − ¯ U )( V i − ¯ V ) T . (A.1)1n particular, when U i = V i , V S n V T = n (cid:88) i =1 ( V i − ¯ V )( V i − ¯ V ) T = ( n − S ( V ) . Proof of Lemma A.2.

The left hand side of (A.1) is equal to U S n V T = U V T − n − ( U n ) ( V n ) T = n (cid:88) i =1 U i V T i − n − ( n ¯ U )( n ¯ V ) T = n (cid:88) i =1 U i V T i − n ¯ U ¯ V T , which is the same as the right hand side of (A.1). Theorem 1: A generalized, vector-outcome version of Neyman.

To prove the generalizedNeyman result, we bundle our vector potential outcomes into matrices and use the above lemmasto obtain their covariance matrix. The theorem is exact, no asymptotics. Using the FPCLT toshow that the estimator has an approximately Normal distribution, allowing for classic testing andinference, is a separate, subsequent step.

Proof of Theorem 1.

Deﬁne V = [ V (1) , . . . , V n (1)] and V = [ V (0) , . . . , V n (0)] as the matrices ofthe potential outcomes. Then the Neymanian simple diﬀerence in means estimator has the followingrepresentation: (cid:98) τ V = ¯ V obs1 − ¯ V obs0 = 1 n n (cid:88) i =1 T i V i (1) − n n (cid:88) i =1 (1 − T i ) V i (0)= 1 n V T − n V ( − T )= (cid:18) V n + V n (cid:19) T − n V . Now the unbiasedness of (cid:98) τ V follows from the linearity of the expectation and Lemma A.1. Forthe covariance, note the second term in the above is constant, and so is not involved. ApplyingLemmas A.1 and A.2, we can obtain the covariance matrix of (cid:98) τ V :cov( (cid:98) τ V ) = (cid:18) V n + V n (cid:19) cov( T ) (cid:18) V n + V n (cid:19) T = n n n ( n − (cid:18) V n + V n (cid:19) S n (cid:18) V n + V n (cid:19) T = n n n ( n − (cid:18) n V S n V T + 1 n V S n V T + 1 n n V S n V T + 1 n n V S n V T (cid:19) = n nn S{ V (1) } + n nn S{ V (0) } + 1 n ( n −

1) ( V S n V T + V S n V T ) .

2o simplify the third term, we use the fact ab T + ba T = aa T + bb T − ( a − b )( a − b ) T for two columnvectors a and b , we have { V i (1) − ¯ V (1) }{ V i (0) − ¯ V (0) } T + { V i (0) − ¯ V (0) }{ V i (1) − ¯ V (1) } T = { V i (1) − ¯ V (1) }{ V i (1) − ¯ V (1) } T + { V i (1) − ¯ V (1) }{ V i (1) − ¯ V (1) } T −{ V i (1) − V i (0) − ¯ V (1) + ¯ V (0) }{ V i (1) − V i (0) − ¯ V (1) + ¯ V (0) } T . Summing over i = 1 , . . . , n and applying Lemma A.2, we have V S n V T n − V S n V T n − S{ V (1) } + S{ V (0) } − S{ V (1) − V (0) } . Therefore, the covariance of (cid:98) τ V can be simpliﬁed as:cov( (cid:98) τ V ) = n nn S{ V (1) } + n nn S{ V (0) } + 1 n [ S{ V (1) } + S{ V (0) } − S{ V (1) − V (0) } ]= S{ V (1) } n + S{ V (0) } n − S{ V (1) − V (0) } n . Theorem 2: Behavior of (cid:98) β RI . To show properties of (cid:98) β RI we express the systematic variationas a vector of new potential outcomes of the original outcome scaled by the diﬀerent covariates ofinterest. This allows for immediate use of Theorem 1. Proof of Theorem 2.

Because (cid:98) S xt is the sample mean for { X i Y obs i : T i = t, i = 1 , . . . , n } = { X i Y i ( t ) : T i = t, i = 1 , . . . , n } , it is unbiased for the population mean S xt . Thus, the estima-tor (cid:98) β RI is also unbiased for β as S − xx is ﬁxed and the expectation is linear. Its sampling covarianceover all possible randomizations iscov( (cid:98) β RI ) = S − xx cov( (cid:98) S x − (cid:98) S x ) S − xx . Therefore, we need only to obtain the covariance of (cid:98) S x − (cid:98) S x = 1 n n (cid:88) i =1 T i X i Y obs i − n n (cid:88) i =1 (1 − T i ) X i Y obs i , which is the diﬀerence between the sample means of { X i Y i (1) : i = 1 , . . . , n } and { X i Y i (0) : i = 1 , . . . , N } under treatment and control. Viewing X i Y obs i as a vector outcome in a completely randomizedexperiment, we can apply Theorem 1 to obtaincov( (cid:98) S x − (cid:98) S x ) = S{ X Y (1) } n + S{ X Y (0) } n − S ( X τ ) n , Theorem 3: Behavior of (cid:98) β OLS . We ﬁrst use the well-known fact that the estimate from aOLS model with treatment fully interacted with covariates is equivalent to separate regressions ofoutcome onto covariates for the control and treatment groups. This means we can obtain (cid:98) γ OLS by running a regression of Y obs onto X using the control group data, and (cid:92) γ + β OLS by runningregression of Y obs onto X using the treatment group data, giving estimated coeﬃcients of (cid:98) γ OLS = (cid:98) S − xx, (cid:98) S x and (cid:98) β OLS = (cid:98) S − xx, (cid:98) S x − (cid:98) S − xx, (cid:98) S x . As a quick heuristic argument for this, consider that the maximization problem for the interactedmodel will separate into two components, one for each group. Then re-parameterize to get theabove.We now prove the properties of (cid:98) β OLS . Here we have to use asymptotics for the entire theorem,unlike the case of (cid:98) β RI , where the mean and covariance are exact and the asymptotics are onlyneeded for the asymptotic normality of the estimator. Proof of Theorem 3.

First expand the diﬀerence of (cid:98) β OLS and β as (cid:98) β OLS − β = (cid:98) S − xx, ( (cid:98) S x − (cid:98) S xx, γ ) − (cid:98) S − xx, ( (cid:98) S x − (cid:98) S xx, γ ) , This will be close to the related quantity of∆ = S − xx ( (cid:98) S x − (cid:98) S xx, γ ) − S − xx ( (cid:98) S x − (cid:98) S xx, γ ) . (A.2)For the above to make sense and hold, we here need our asymptotic framework. In particular,we need the associated moment conditions described in the main text. We next observe that thediﬀerence between (cid:98) β OLS − β and ∆ is of higher order, because( (cid:98) β OLS − β ) − ∆ = ( (cid:98) S − xx, − S − xx )( (cid:98) S x − (cid:98) S xx, γ ) − ( (cid:98) S − xx, − S − xx )( (cid:98) S x − (cid:98) S xx, γ ) (A.3)= O P ( n − / ) O P ( n − / ) − O P ( n − / ) O P ( n − / ) = O P ( n − ) , (A.4)following from the FPCLT for the four terms in (A.3). This is an argument commonly used in thesurvey sampling literature for ratio estimators (Cochran, 1977).4e next focus on the asymptotic distribution of ∆, because the asymptotic distribution of (cid:98) β OLS − β will be the same. Further simplify (A.2) as∆ = S − xx (cid:34) n n (cid:88) i =1 T i X i e i (1) − n n (cid:88) i =1 (1 − T i ) X i e i (0) (cid:35) , (A.5)where e i (1) = Y i (1) − X T i γ and e i (0) = Y i (0) − X T i γ are the residual potential outcomes. (To seethe above, note, for example, that both (cid:98) S x and (cid:98) S xx, are sums over the treatment units, and wecan factor out an X i to get X i times the diﬀerence in the Y i and predicted Y i .)Applying Theorem 1 to the vector outcome X e , we obtain the covariance matrix of ∆, to which (cid:98) β OLS − β converges to due to (A.4). The asymptotic normality follows from the representation(A.5) and the FPCLT. Theorem 4: Bounds for R τ . To prove Theorem 4, we need to invoke the following Fr´echet–Hoeﬀding inequality (Hoeﬀding, 1941; Fr´echet, 1951; Heckman et al., 1997; Aronow et al., 2014).

Lemma A.3.

If we know only the marginal distributions of two random variables X ∼ F X ( x ) and Y ∼ F Y ( y ), then E ( XY ) can be sharply bounded by (cid:90) F − X ( u ) F − Y (1 − u )d u ≤ E ( XY ) ≤ (cid:90) F − X ( u ) F − Y ( u )d u. Lemma A.3 immediately implies the following bound for var( X − Y ) if E ( X − Y ) = 0. Lemma A.4.

If we know only the marginal distributions X ∼ F X ( x ) , Y ∼ F Y ( y ) and E ( X − Y ) =0, then var( X − Y ) can be sharply bounded by (cid:90) { F − X ( u ) − F − Y ( u ) } d u ≤ var( X − Y ) ≤ (cid:90) { F − X ( u ) − F − Y (1 − u ) } d u Proof of Lemma A.4.

The variance var( X − Y ) can be decomposed asvar( X − Y ) = E ( X − Y ) = E ( X ) + E ( Y ) − E ( XY ) , which depends on the following three terms: E ( X ) = (cid:82) x d F X ( x ) = (cid:90) { F − X ( u )] } d u,E ( Y ) = (cid:82) { F − Y ( u ) } d u = (cid:90) { F − Y (1 − u ) } d u, (cid:90) F − X ( u ) F − Y (1 − u )d u ≤ E ( XY ) ≤ (cid:90) F − X ( u ) F − Y ( u )d u. Plug the above expressions into the variance of X − Y to obtain the desired bounds.5pplying Lemma A.4, we can easily prove Theorem 4. Proof of Theorem 4.

Because S ττ = S δδ + S εε , we need only to bound S εε , which is the ﬁnitepopulation variance of ε i = { Y i (1) − X T i γ } − { Y i (0) − X T i γ } = e i (1) − e i (0) . We can identifythe marginal distributions of { e i (1) : i = 1 , . . . , n } and { e i (0) : i = 1 , . . . , n } , and also know that n − (cid:80) ni =1 ε i = 0. Therefore, the bounds in Lemma A.4 imply the bounds in Theorem 4. Theorem 5: Sensitivity analysis.

Proof of Theorem 5.

The joint distribution of ( U , U ) is C ( u , u ) = P ( U ≤ u , U ≤ u )= ρP ( U ≤ u , U ≤ u ) + (1 − ρ ) P ( V ≤ u , U ≤ u )= ρ min( u , u ) + (1 − ρ ) u u . Therefore, the distribution function C ( u , u ) is a weighted average of min( u , u ) = C R ( u , u )and u u = C I ( u , u ), i.e., the joint distributions when U = U and U U , respectively.According to Nelsen (2007, Theorem 5.1.6), Spearman’s rank correlation coeﬃcient between e (1) and e (0) is12 (cid:90) (cid:90) { C ( u , u ) − u u } d u d u = 12 ρ (cid:90) (cid:90) { min( u , u ) − u u } d u d u = 12 ρ (cid:18) (cid:90) d u (cid:90) u u d u − (cid:19) = 12 ρ (1 / − /

4) = ρ. To complete the proof of the theorem, we need only to show that the covariance between e (1)and e (0) is linear in ρ , which follows from (cid:90) (cid:90) F − ( u ) F − ( u )d C ( u , u )= ρ (cid:90) (cid:90) F − ( u ) F − ( u )d C R ( u , u ) + ρ (cid:90) (cid:90) F − ( u ) F − ( u )d C I ( u , u )= ρ (cid:90) F − ( u ) F − ( u )d u + (1 − ρ ) (cid:90) F − ( u )d u (cid:90) F − ( u )d u. heorem 6: Extending to non-compliance. Theorem 6 shows how to estimate the outcome-to-covariate relationships of the Compliers by estimating diﬀerent aggregate covariance relationshipsacross all the strata for diﬀerent observed groups and then taking diﬀerences. Due to the exclusionrestriction for the Never Takers and Always Takers, this gives our desired relationships for theCompliers only.First, a small bit of notation of, due to the exclusion restrictions for Never Takers and AlwaysTakers, deﬁning the population covariance between X and Y (1) = Y (0) within stratum U = a and U = n as S x.,u = 1 n u n (cid:88) i =1 I ( U i = u ) X i Y i (1) = 1 n u n (cid:88) i =1 I ( U i = u ) X i Y I (0) , ( u = a, n ) . Proof of Theorem 6.

We ﬁrst create an estimator for S xx,c . From the observed data with ( T i , D i ) =(1 , E (cid:40) n n (cid:88) i =1 T i D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i (cid:41) = π a S xx,a + π c S xx,c . (A.6)Similar to (A.6), we have E (cid:40) n n (cid:88) i =1 T i (1 − D i ) X i X T i (cid:41) = π n S xx,n , (A.7) E (cid:40) n n (cid:88) i =1 (1 − T i ) D i X i X T i (cid:41) = π a S xx,a , (A.8) E (cid:40) n n (cid:88) i =1 (1 − T i )(1 − D i ) X i X T i (cid:41) = π n S xx,n + π c S xx,c . (A.9)Subtracting the left sides of (A.8) from (A.6), or subtracting the left sides of (A.7) from (A.9), giveunbiased estimators for π c S xx,c . Second, analogous to the S xx,c , we consider the sample covariances between X and Y obs toobtain estimators for S x ,c and S x ,c . From the observed data with ( T i , D i ) = (1 , E (cid:40) n n (cid:88) i =1 T i D i X i Y obs i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i Y i (1) + 1 n n (cid:88) i =1 T i I ( U i = c ) X i Y i (1) (cid:41) = π a S x.,a + π c S x ,c . (A.10)7imilar to (A.10), we have E (cid:40) n n (cid:88) i =1 T i (1 − D i ) X i Y obs i (cid:41) = π n S x.,n , (A.11) E (cid:40) n n (cid:88) i =1 (1 − T i ) D i X i Y obs i (cid:41) = π a S x.,a , (A.12) E (cid:40) n n (cid:88) i =1 (1 − T i )(1 − D i ) X i Y obs i (cid:41) = π n S x.,n + π c S x ,c . (A.13)Subtracting (A.12) from (A.10), and subtracting (A.11) from (A.13), we obtain the results in(15). Corollary 3: Behavior of (cid:98) β c, RI . Theorem 6 shows how to obtain unbiased estimates of thecomponents of our estimator, which we can then plug in to obtain a consistent estimator of β c . Wenext show how this plug-in estimator behaves. Proof of Corollary 3.

First we write (cid:98) β c, RI − β c = ( (cid:98) S xx, − (cid:98) S xx, ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c }− ( (cid:98) S xx, − (cid:98) S xx, ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c } , second we introduce∆ c = ( π c S xx,c ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c }− ( π c S xx,c ) − { (cid:98) S x , − (cid:98) S x , − ( (cid:98) S xx, − (cid:98) S xx, ) γ c } , third we observed that the diﬀerence between (cid:98) β c, RI − β c and ∆ c has higher order following thesame argument as (A.4). Therefore, we need only to ﬁnd the asymptotic distribution of ∆ c .8imple algebra gives∆ c = ( π c S xx,c ) − (cid:104) n n (cid:88) i =1 T i D i X i Y i (1) − n n (cid:88) i =1 (1 − T i ) D i X i Y i (0) − n n (cid:88) i =1 T i D i X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) D i X i X T i γ c − n n (cid:88) i =1 (1 − T i )(1 − D i ) X i Y i (0) + 1 n n (cid:88) i =1 T i (1 − D i ) X i Y i (1)+ 1 n n (cid:88) i =1 (1 − T i )(1 − D i ) X i X T i γ c − n n (cid:88) i =1 T i (1 − D i ) X i X T i γ c (cid:105) = ( π c S xx,c ) − (cid:104) n n (cid:88) i =1 T i I ( U i = a ) X i Y i (1) + 1 n n (cid:88) i =1 T i I ( U i = c ) X i Y i (1) − n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i Y i (0) − n n (cid:88) i =1 T i I ( U i = a ) X i X T i γ c − n n (cid:88) i =1 T i I ( U i = c ) X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i X T i γ c − n n (cid:88) i =1 (1 − T i ) I ( U i = n ) X i Y i (0) − n n (cid:88) i =1 (1 − T i ) I ( U i = c ) X i Y i (0) + 1 n n (cid:88) i =1 T i I ( U i = n ) X i Y i (1)+ 1 n n (cid:88) i =1 (1 − T i ) I ( U i = n ) X i X T i γ c + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = c ) X i X T i γ c − n n (cid:88) i =1 T i I ( U i = n ) X i X T i γ c (cid:105) = ( π c S xx,c ) − (cid:110) n n (cid:88) i =1 T i X i (cid:2) I ( U i = a ) ( Y i (1) − X T i γ c ) + I ( U i = n ) ( Y i (1) − X T i γ c ) + I ( U i = c ) ( Y i (1) − X T i γ c ) (cid:3) − n n (cid:88) i =1 (1 − T i ) X i (cid:2) I ( U i = a ) ( Y i (0) − X T i γ c ) + I ( U i = n ) ( Y i (0) − X T i γ c ) + I ( U i = c ) ( Y i (0) − X T i γ c ) (cid:3) (cid:111) . According to the deﬁnitions of the residual potential outcomes e (cid:48) i (1) and e (cid:48) i (0) in the main text, theabove formula reduces to (cid:101) β c, RI − β c = ( π c S xx,c ) − (cid:34) n n (cid:88) i =1 T i X i e (cid:48) i (1) − n n (cid:88) i =1 (1 − T i ) X i e (cid:48) i (0) (cid:35) . (A.14)The representation in (A.14) implies the asymptotic covariance matrix according to Theorem 1 andthe asymptotic normality of (cid:101) β c, RI according to the FPCLT. Theorem 7: Behavior of (cid:98) β TSLS . While the amount of notation and matrix algebra is consider-ably more in scope, the overall structure of the proof follows the earlier one for the OLS estimatorfor the ITT. In particular, we show the estimator asymptotically converges to a more tractableversion that has a ﬁxed portion, and then use the usual covariance argument on the remaining9erms. Before doing this, we ﬁrst show the probability limits of the estimator by working throughthe matrix algebra.

Proof of Theorem 7.

First, we ﬁnd the probability limits of the TSLS estimators: (cid:18)(cid:98) γ TSLS (cid:98) β TSLS (cid:19) = (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( X T i , D i X T i ) (cid:41) − (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) Y obs i (cid:41) = (cid:18) n − (cid:80) ni =1 X i X T i n − (cid:80) ni =1 D i X i X T i n − (cid:80) ni =1 T i X i X T i n − (cid:80) ni =1 T i D i X i X T i (cid:19) − (cid:18) n − (cid:80) ni =1 X i Y obs i n − (cid:80) ni =1 T i X i Y obs i (cid:19) P −→ (cid:18) A BC D (cid:19) − (cid:18) GH (cid:19) . (A.15)The above term A is A = S xx , and terms ( B , C , D , G , H ) are the population limits of the samplequantities. We will ﬁnd each of them. Term B is B = E (cid:40) n n (cid:88) i =1 D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i D i X i X T i + 1 n n (cid:88) i =1 (1 − T i ) D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) X i X T i (cid:41) = p π a S xx,a + p π c S xx,c + p π a S xx,a = π a S xx,a + p π c S xx,c . Term C is C = E (cid:8) n − (cid:80) ni =1 T i X i X T i (cid:9) = p S xx . Term D is D = E (cid:40) n n (cid:88) i =1 T i D i X i X T i (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = a ) X i X T i + 1 n n (cid:88) i =1 T i I ( U i = c ) X i X T i (cid:41) = p π a S xx,a + p π c S xx,c . Term G is G = E (cid:40) n n (cid:88) i =1 X i Y obs i (cid:41) = E (cid:40) n n (cid:88) i =1 T i X i Y obs i + 1 n n (cid:88) i =1 (1 − T i ) X i Y obs i (cid:41) = p S x + p S x . Term H is H = E (cid:8) n − (cid:80) ni =1 T i X i Y obs i (cid:9) = p S x . We apply the following formula for the inverseof a block matrix: (cid:18)

A BC D (cid:19) − = (cid:18) S − D − A − BS − A − D − CS − D S − A (cid:19) , where S D = A − BD − C and S A = D − CA − B are the Schur complements of blocks D and A . Omitting some tedious matrix algebra, we obtain S D = p π c S xx,c ( π a S xx,a + π c S xx,c ) − S xx , S A = p p π c S xx,c , (cid:18) A BC D (cid:19) − = (cid:18) p − π − c S − xx ( π a S xx,a + π c S xx,c ) S − xx,c − p − p − π − c S − xx ( π a S xx,a + p π c S xx,c ) S − xx,c − p − π − c S − xx,c p − p − π − c S − xx,c (cid:19) . Therefore, according to (A.15), the probability limit of (cid:98) γ TSLS is p − π − c S − xx ( π a S xx,a + π c S xx,c ) S − xx,c ( p S x + p S x ) − p − p − π − c S − xx ( π a S xx,a + p π c S xx,c ) S − xx,c ( p S x )= S − xx S x − π a π − c S − xx S xx,a S − xx,c ( S x − S x )= γ − π a S − xx S xx,a β c ≡ γ ∞ , (A.16)and the probability limit of (cid:98) β TSLS is − p − π − c S − xx,c ( p S x + p S x ) + p − p − π − c S − xx,c ( p S x ) = π − c S − xx,c ( S x − S x ) = β c , (A.17)where we use S x − S x = π c ( S x ,c − S x ,c ), which is guaranteed by exclusion restrictions.We next ﬁnd the asymptotic distribution of (cid:98) β TSLS . Following the derivation in Corollary 3, weﬁrst write (cid:18)(cid:98) γ TSLS (cid:98) β TSLS (cid:19) − (cid:18) γ ∞ β c (cid:19) = (cid:40) n n (cid:88) i =1 (cid:18) X i T i X i (cid:19) ( X T i , D i X T i ) (cid:41) − (cid:40) n n (cid:88) i =1 (cid:18) X i ( Y obs i − X T i γ ∞ − D i X T i β c ) T i X i ( Y obs i − X T i γ ∞ − D i X T i β c ) (cid:19)(cid:41) , then introduce∆ TSLS = (cid:18) A BC D (cid:19) − (cid:40) n n (cid:88) i =1 (cid:18) X i ( Y obs i − X T i γ ∞ − D i X T i β c ) T i X i ( Y obs i − X T i γ ∞ − D i X T i β c ) (cid:19)(cid:41) = (cid:18) A BC D (cid:19) − (cid:18) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) + n − (cid:80) ni =1 (1 − T i ) X i e (cid:48)(cid:48) i (0) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) = (cid:18) A BC D (cid:19) − (cid:18) n − (cid:80) ni =1 T i X i { e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) } + n − (cid:80) ni =1 X i e (cid:48)(cid:48) i (0) n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) , (A.18)with ( A , B , C , D ) deﬁned in (A.15) and { e (cid:48)(cid:48) i (1) , e (cid:48)(cid:48) i (0) } deﬁned in Theorem 7, and ﬁnally recognizethat the diﬀerence between the above two formulas has high order. Again we need only to ﬁndthe asymptotic distribution of ∆ TSLS . The covariance of the second term on the right hand side of(A.18) is (dropping the constant sum of X i e (cid:48)(cid:48) (0))cov (cid:18) n − (cid:80) ni =1 T i X i { e (cid:48)(cid:48) i (1) − e (cid:48)(cid:48) i (0) } n − (cid:80) ni =1 T i X i e (cid:48)(cid:48) i (1) (cid:19) = 1 n n n n (cid:18) S ( X ε ) [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] S{ X e (cid:48)(cid:48) (1) } (cid:19) , X { e (cid:48)(cid:48) (1) − e (cid:48)(cid:48) (0) } and X e (cid:48)(cid:48) (1). Therefore, according to (A.18), the asymptotic covariance of ∆ TSLS is the (2 ,

2) blockof the following matrix1 n n n n (cid:18) A BC D (cid:19) − · (cid:18) S ( X ε ) [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] [ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )] S{ X e (cid:48)(cid:48) (1) } (cid:19) · (cid:18) A BC D (cid:19) − T , which is1 n n n n (cid:110) ( p − π − c S − xx,c ) S ( X ε )( p − π − c S − xx,c ) T + ( p − p − π − c S − xx,c ) S{ X e (cid:48)(cid:48) (1) } ( p − p − π − c S − xx,c ) T − ( p − π − c S − xx,c )[ S{ X e (cid:48)(cid:48) (1) } − S{ X e (cid:48)(cid:48) (0) } + S ( X ε )]( p − p − π − c S − xx,c ) T (cid:111) = ( π c S xx,c ) − (cid:20) S{ X e (cid:48)(cid:48) (1) } n + S{ X e (cid:48)(cid:48) (0) } n − S ( X ε ) n (cid:21) ( π c S xx,c ) − . The asymptotic normality follows from the representation in (A.18) and the FPCLT.

Theorem 8: Decomposition of variation in non-compliance.

The following proof uses twofacts: τ a = τ n = 0, and τ = π c τ c . Proof of Theorem 8.

Write the total treatment eﬀect variation as S ττ = 1 n n (cid:88) i =1 ( τ i − τ ) = 1 n n (cid:88) i =1 τ i − τ = 1 n n (cid:88) i =1 I ( U i = c ) τ i − π c τ c = π c (cid:32) n c n (cid:88) i =1 I ( U i = c ) τ i − τ c (cid:33) + π c (1 − π c ) τ c , the treatment eﬀect variation explained by compliance status as S ττ,U = (cid:88) u = c,a,n π u ( τ u − τ ) = π c ( τ c − π c τ c ) + π a (0 − π c τ c ) + π n (0 − π c τ c ) = π c τ c (cid:8) (1 − π c ) + π c ( π a + π n ) (cid:9) = π c (1 − π c ) τ c , and the subtotal treatment eﬀect variation for compliers as S ττ,c = 1 n c n (cid:88) i =1 I ( U i = c ) ( τ i − τ c ) = 1 n c n (cid:88) i =1 I ( U i = c ) τ i − τ c . Therefore, the above three terms has the relationship S ττ = π c S ττ,c + S ττ,U . The decomposition S ττ,c = S δδ,c + S εε,c follows immediately from the deﬁnition of β c . ppendix B More detailed comments Appendices B.1–B.5 give more details of some technical issues and extensions mentioned in themain text, and Appendix B.6 contains the proofs of the results in Appendix B.

Appendix B.1 Covariate adjustment to improve eﬃciency

In the main text, the role of covariates has been to model the treatment eﬀect alone. In general,we also want to use covariates to reduce sampling variability of (cid:98) β RI , just as we can use covariatesto get more precise estimates of the average treatment eﬀect. In particular, the goal is to moreprecisely estimate (cid:98) S xt ∈ R K ; because these are the only random components in (cid:98) β RI , if we estimatethem more precisely, we estimate (cid:98) β RI more precisely as well. Let W i ∈ R J denote a vector ofpretreatment covariates without the intercept term. Because X i and W i have diﬀerent roles inestimation, they may also contain diﬀerent sets of covariates, though, in practice, X is likely to bea subset of W .Following the covariate adjustment approach in survey sampling, we can obtain a model-assistedestimator for β that uses W to reduce sampling variability. To see this, we need several deﬁnitions.Deﬁne W = n − (cid:80) ni =1 W i and S ww = n − (cid:80) ni =1 W i W T i , with det( S ww ) >

0; deﬁne W t and (cid:98) S ww,t as the sample mean and covariance of W under treatment arm t ; deﬁne (cid:98) B t ∈ R J × K as the regressioncoeﬃcient of Y obs X on W for treatment arm t : (cid:98) B t = (cid:98) S − ww,t (cid:40) n t n (cid:88) i =1 I ( T i = t ) W i ( Y obs i X i ) T (cid:41) . The model-assisted estimator for S xt is then (cid:98) S wxt = (cid:98) S xt − (cid:98) B T t ( ¯ W t − ¯ W ) , ( t = 0 , . As a result, we can improve the randomization-based estimator by (cid:98) β w RI = S − xx ( (cid:98) S wx − (cid:98) S wx ) . Theorem A.1.

The model-assisted estimator (cid:98) β w RI is consistent for β with asymptotic covariance S − xx (cid:20) S{ E (1) } n + S{ E (0) } n − S ( ∆ ) n (cid:21) S − xx , where E i ( t ) = Y i ( t ) X i − B T t ( W i − ¯ W ) is the residual term and ∆ i = E i (1) − E i (0).13he estimator, (cid:98) β w RI uses covariates both to estimate treatment eﬀect variation and to reducesampling variability. Asymptotically, as long as W is predictive of the marginal potential outcomes,the model-assisted estimator will improve precision over the unassisted estimators. Appendix B.2 Fisherian exact inference

When ε i = 0 for all i , we can obtain exact inference for β based on the Fisher randomization test(Rubin, 1980; Rosenbaum, 2002; Ding et al., 2016). With a known β , the null hypothesis H ( β ) : Y i (1) − Y i (0) = X T i β for all i (A.19)is sharp in the sense of allowing for full imputation of all missing potential outcomes based onthe observed data. We can perform randomization test using any sensible test statistic measuringthe deviation from the null hypothesis H ( β ), for example, the test statistic t ( T , Y obs ; β ) can bethe diﬀerence-in-means, diﬀerence-in-medians or the Kolmogorov–Smirnov statistics comparing twosamples { Y obs i − X T i β : T i = 1 , i = 1 , . . . , n } and { Y obs i : T i = 0 , i = 1 , . . . , n } . Then we can obtaina (1 − α ) level conﬁdence region for β by inverting a sequence of randomization tests: CR α = { β : Randomization test fails to reject H ( β ) at signiﬁcance level α } . The conﬁdence region CR α is exact regardless of the sample size, and it is valid for general designsof experiments if we use the corresponding assignment mechanism to simulate the null distributionof the test statistic. Due to the duality between testing and interval estimation, we reject H ( X )with β = 0 in Section 3.3 if CR α ∩ { β : β = 0 } is an empty set, which controls the type one errorrate by α. Appendix B.3 A Variance Ratio Test

Raudenbush and Bloom (2015) have noticed that if the variance of the treatment potential outcomeis smaller than the control potential outcome, then the correlation between the individual treatmenteﬀect and the control potential outcome is negative. This statement does not involve any covariates,but it can be generalized to incorporate systematic and idiosyncratic treatment eﬀect variation.Below we give a ﬁnite population version of their result.

Theorem A.2.

If the ﬁnite population variance of { Y i (1) − X (cid:48) i β } ni =1 is smaller than { Y i (0) } ni =1 ,then the idiosyncratic treatment eﬀect variation, { ε i } ni =1 , is negatively correlated with the controlpotential outcomes. 14ecause the condition in Theorem A.2 depends only on the marginal distributions of the po-tential outcomes, we propose a formal variance ratio test of it using the observed data, which is ageneralization of a similar theorem in Ding et al. (2016): Theorem A.3.

The variance ratio test with rejection regionlog s − log s (cid:112) ( (cid:98) κ − /n + ( (cid:98) κ − /n < Φ − ( α ) , has size at least as large as α , where s and (cid:98) κ are the sample variance and kurtosis of { Y obs i − X T i (cid:98) β RI : T i = 1 , i = 1 , . . . , n } , and s and (cid:98) κ are the sample variance and kurtosis of { Y obs i : T i =0 , i = 1 , . . . , n } , and Φ − ( α ) is the α -th quantile of the standard normal distribution.For ﬁnite population inference, the above test in Theorem A.3 is generally conservative, but forsuperpopulation inference, it is asymptotically exact.Note that Raudenbush and Bloom (2015) and Theorem A.2 are only about detecting a negativeassociation. Unfortunately, there is no testable condition for a positive association. Appendix B.4 More on noncompliance: estimating the bounds of the R s The component S ττ,U and and the probability π c are directly identiﬁable according to previousdiscussion. Furthermore, S δδ,c is also identiﬁable according to the following result. Corollary A.1. S δδ,c can be expressed as the expectation of the following quantity:1 π c (cid:40) n n (cid:88) i =1 ( δ i − τ c ) − n n (cid:88) i =1 T i (1 − D i )( δ i − τ c ) − n n (cid:88) i =1 (1 − T i ) D i ( δ i − τ c ) (cid:41) . Because π c , δ i = X T i β c and τ c can be estimated by a plug-in approach, S δδ,c can also be estimatedfrom the observed data.In the ITT case, estimation of the residual distributions are straightforward. In the noncompli-ance case, however, we need more discussion about the estimation of F c ( y ) and F c ( y ), because U i is a latent variable. To avoid notational clatter, we assume that γ c and γ c are known; in practicewe can replace them by the randomization-based estimators (cid:98) γ c , RI and (cid:98) γ c , RI , and the consistencyof the ﬁnal estimator will not be aﬀected. Recall the potential residuals e (cid:48) i (1) and e (cid:48) i (0) deﬁned in(17), and its observed value e (cid:48) i = T i e (cid:48) i (1) + (1 − T i ) e (cid:48) i (0). We deﬁne the following quantities (cid:98) F ( y ) = n (cid:80) ni =1 T i D i I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 T i (1 − D i ) I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 (1 − T i ) D i I ( e (cid:48) i ≤ y ) , (cid:98) F ( y ) = n (cid:80) ni =1 (1 − T i )(1 − D i ) I ( e (cid:48) i ≤ y ) . (A.20)Similar to Corollary 3, we have the following results.15 orollary A.2. For any y,E { (cid:98) F ( y ) − (cid:98) F ( y ) } = π c F c ( y ) , E { (cid:98) F ( y ) − (cid:98) F ( y ) } = π c F c ( y ) . Therefore, we can estimate F c ( y ) by { (cid:98) F ( y ) − (cid:98) F ( y ) } / (cid:98) π c , and estimate F c ( y ) by { (cid:98) F ( y ) − (cid:98) F ( y ) } / (cid:98) π c . As we mentioned before, in practice, we use (cid:98) e (cid:48) i instead of e (cid:48) i in the formulas in (A.20). Appendix B.5 Proofs of the theorems and corollaries in Appendix B

Proof of Theorem A.1.

The population-level OLS regression matrix of Y ( t ) X onto W is B t = S − ww (cid:40) n n (cid:88) i =1 W i { Y i ( t ) X i } T (cid:41) ∈ R J × K . Deﬁne (cid:101) S wxt = (cid:98) S xt + B T t ( ¯ W − ¯ W t ) and (cid:101) β w RI = S − xx ( (cid:101) S wx − (cid:101) S wx ) . According to the same argumentas (A.4), (cid:98) β RI and (cid:101) β w RI have the same asymptotic covariance, and in the following we need only todiscuss the covariance of (cid:101) β w RI . Because (cid:101) S wx − (cid:101) S wx = 1 n n (cid:88) i =1 T i (cid:8) Y i (1) X i + B T ( ¯ W − W i ) (cid:9) − n n (cid:88) i =1 (1 − T i ) (cid:8) Y i (0) X i + B T ( ¯ W − W i ) (cid:9) = 1 n n (cid:88) i =1 T i E i (1) − n n (cid:88) i =1 (1 − T i ) E i (0)can be represented as the diﬀerence between the sample means of E i (1) and E i (0), applying The-orem 2 we can obtain its covariance:cov (cid:16) (cid:101) S wx − (cid:101) S wx (cid:17) = S{ E (1) } n + S{ E (0) } n − S{ ∆ } n , which completes the proof. Proof of Theorem A.2.

For simplicity, we abuse the variance and covariance notation for ﬁnitepopulation. For example, var { Y (0) } = (cid:80) ni =1 { Y i (0) − ¯ Y (0) } / ( n − . If var { Y (1) − X T β } ≤ var { Y (0) } , then var { Y (0) + ε } ≤ var { Y (0) } . Expanding the left hand side,var { Y (0) } + var { ε } + 2cov { Y (0) , ε } ≤ var { Y (0) } , which implies 2cov { Y (0) , ε } ≤ − var { ε } < . c , · · · , c n ) T and ( d , . . . , d n ) T be two vectors of nonnegativeconstants with the same mean m > S cc and S dd . The diﬀerence vector( c − d , . . . , c n − d n ) T has mean zero and variance S c − d,c − d . Let (cid:98) θ c = 1 n n (cid:88) i =1 T i c i , (cid:98) θ d = 1 n n (cid:88) i =1 (1 − T i ) d i be two sample means of the treatment and control group, respectively. Lemma A.5.

Under the regularity conditions for the FPCLT, log (cid:98) θ c − log (cid:98) θ d has asymptotic meanzero and variance 1 m (cid:18) S cc n + S dd n − S c − d,c − d n (cid:19) . (A.21) Proof of Lemma A.5.

According to the FPCLT, we have the following joint asymptotic normalityof (cid:98) θ c and (cid:98) θ d : (cid:32)(cid:98) θ c (cid:98) θ d (cid:33) = (cid:18) n − (cid:80) ni =1 T i c i n − (cid:80) ni =1 (1 − T i ) d i (cid:19) a ∼ N (cid:20)(cid:18) mm (cid:19) , (cid:18) V cc V cd V cd V dd (cid:19)(cid:21) , where V cc = n n n S cc , V dd = n n n S dd , V cd = − n ( S cc + S dd − S c − d,c − d ) . Applying Taylor expansion at m , we have log (cid:98) θ c − log (cid:98) θ d = { ( (cid:98) θ c − m ) − ( (cid:98) θ d − m ) } /m + o P ( n − / ), which,coupled with Neyman (1923)’s variance formula, gives the asymptotic variance of log (cid:98) θ c − log (cid:98) θ d in(A.21). Proof of Theorem A.3.

First, as a direct consequence of Lemma A.5, the ﬁnite sample variance isalways larger than the super population variance, unless S c − d,c − d = 0. Therefore, we need only toshow that the test in Theorem A.3 is asymptotically exact for super population inference, and theasymptotic size of the test is no larger than α for ﬁnite population inference.Second, replacing β by its consistent estimator (cid:98) β RI does not aﬀect the asymptotic distribution ofthe test statistic, due to Slutsky’s Theorem. For simplicity, we treat β as known in our asymptoticanalysis.With the two ingredients above, Theorem A.3 follows directly from the variance ratio test inDing et al. (2016, Theorem 2, Supplementary Material).17 roof of Corollary A.1. The conclusion follows from E (cid:40) n n (cid:88) i =1 T i (1 − D i )( δ i − τ c ) (cid:41) = E (cid:40) n n (cid:88) i =1 T i I ( U i = n ) ( δ i − τ c ) (cid:41) = 1 n n (cid:88) i =1 I ( U i = n ) ( δ i − τ c ) ,E (cid:40) n n (cid:88) i =1 (1 − T i ) D i ( δ i − τ c ) (cid:41) = E (cid:40) n n (cid:88) i =1 (1 − T i ) I ( U i = a ) ( δ i − τ c ) (cid:41) = 1 n n (cid:88) i =1 I ( U i = a ) ( δ i − τ c ) . Proof of Corollary A.2.

We rewrite (cid:98) F ( y ) = 1 n n (cid:88) i =1 T i I ( U i = c ) I { e i (1) ≤ y } + 1 n n (cid:88) i =1 T i I ( U i = a ) I { e i (1) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 T i I ( U i = n ) I { e i (1) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 (1 − T i ) I ( U i = a ) I { e i (0) ≤ y } , (cid:98) F ( y ) = 1 n n (cid:88) i =1 (1 − T i ) I ( U i = c ) I { e i (0) ≤ y } + 1 n n (cid:88) i =1 (1 − T i ) I ( U i = n ) I { e i (0) ≤ y } . In the above formulas, the random components are the T ii