Heterogeneous Coefficients, Control Variables, and Identification of Treatment Effects
aa r X i v : . [ ec on . E M ] S e p HETEROGENEOUS COEFFICIENTS, CONTROL VARIABLES,AND IDENTIFICATION OF TREATMENT EFFECTS
WHITNEY K. NEWEY † AND SAMI STOULI § Abstract.
Multidimensional heterogeneity and endogeneity are important fea-tures of models with multiple treatments. We consider a heterogeneous coefficientsmodel where the outcome is a linear combination of dummy treatment variables,with each variable representing a different kind of treatment. We use control vari-ables to give necessary and sufficient conditions for identification of average treat-ment effects. With mutually exclusive treatments we find that, provided the general-ized propensity scores (Imbens, 2000) are bounded away from zero with probabilityone, a simple identification condition is that their sum be bounded away from onewith probability one. These results generalize the classical identification result ofRosenbaum and Rubin (1983) for binary treatments.
Keywords : Treatment effect; Multiple treatments; Heterogeneous coefficients; Con-trol variable; Identification; Conditional nonsingularity; Propensity score.1.
Introduction
Models that allow for multiple treatments are important for programevaluation and the estimation of treatment effects (Cattaneo 2010;Heckman, Ichimura, Smith, and Todd 1998; Imai and van Dyk 2004; Imbens2000; Graham and Pinto 2018; Lechner 2001; Wooldridge 2004). A general classis heterogeneous coefficient models where the outcome is a linear combination ofdummy treatment variables and unobserved heterogeneity. These models allowfor multiple treatment regimes, with each variable representing a different kindof treatment. These models also feature multidimensional heterogeneity, with thedimension of unobserved heterogeneity being determined by the number of treatmentregimes.Endogeneity is often a problem in these models because we are interested in the effectof treatment variables on an outcome, and the treatment variables are correlated † Department of Economics, MIT, [email protected]. § Department of Economics, University of Bristol, [email protected]. ith heterogeneity. Control variables provide an important means of controlling forendogeneity with multidimensional heterogeneity. For treatment effects, a controlvariable is an observed variable that makes heterogeneity and treatment variablesindependent when it is conditioned on (Rosenbaum and Rubin, 1983).We use control variables to give necessary and sufficient conditions for identification ofaverage treatment effects based on conditional nonsingularity of the second momentmatrix of the vector of dummy treatment variables given the controls. This result isfamiliar in the binary treatment case, but its generalization to multiple treatmentsappears to be new. With mutually exclusive treatments we find that, provided thegeneralized propensity scores (Imbens, 2000) are bounded away from zero with prob-ability one, a simple condition for identification is that their sum be bounded awayfrom one with probability one. These results provide an important generalization ofRosenbaum and Rubin (1983)’s classical identification result for binary treatments.2. Modeling of Treatment Effects
Let Y denote an outcome variable of interest, and X a vector of dummy variables X ( t ) , t ∈ T ≡ { , . . . , T } , taking value one if treatment t occurs and zero otherwise,and ε a structural disturbance vector of finite dimension. We consider a heterogeneouscoefficients model of the form(2.1) Y = p ( X ) ′ ε, p ( X ) = (1 , X (1) , . . . , X ( T )) ′ . This model is linear in the treatment dummy variables, with coefficients ε that neednot be independent of X. We assume that the vector ε is mean independent of theendogenous treatments X , conditional on an observable control variable denoted V . Assumption 1.
For the model in (2.1), there exists a control variable V such that E [ ε | X, V ] = E [ ε | V ] .The Rosenbaum and Rubin (1983) treatment effects model is included as a specialcase where X ∈ { , } is a treatment dummy variable that is equal to one if treatmentoccurs and equals zero without treatment, and p ( X ) = (1 , X ) ′ . In this case ε = ( ε , ε ) ′ is two dimensional with ε giving the outcome withouttreatment, and ε being the treatment effect. Here the control variables in V wouldbe observable variables such that Assumption 1 holds, i.e., the coefficients ( ε , ε ) are ean independent of treatment conditional on controls; this is the unconfoundednessassumption of Rosenbaum and Rubin (1983).A central object of interest in model (2.1) is the average structural function givenby µ ( X ) ≡ p ( X ) ′ E [ ε ] ; see Blundell and Powell (2003) and Wooldridge (2005). Thisfunction is also referred to as the dose-response function in the statistics literature(e.g., Imbens, 2000). When X ∈ { , } is a dummy variable for treatment, µ (0) givesthe average outcome if every unit remained untreated and µ (1) the average outcomeif every unit were treated, with µ (1) − µ (0) being the average treatment effect. Ingeneral, the average effect of some treatment t ∈ T is µ ( e t ) − µ (0 T ) , with e t = (0 , . . . , , , , . . . , ′ defined as a T -vector with all components equal tozero, except the t th, which is one, and T a T -vector of zeros.The conditional mean independence assumption and the form of the structural func-tion p ( X ) ′ ε in (2.1) together imply that the control regression function of Y given ( X, V ) , E [ Y | X, V ] , is a linear combination of the treatment variables:(2.2) E [ Y | X, V ] = p ( X ) ′ E [ ε | X, V ] = p ( X ) ′ E [ ε | V ] = p ( X ) ′ q ( V ) , q ( V ) ≡ E [ ε | V ] . The average structural function can thus be expressed as a known linear combinationof E [ q ( V )] from equation (2.2). By iterated expectations, p ( X ) ′ E [ q ( V )] = p ( X ) ′ E [ E [ ε | V ]] = µ ( X ) . We use the varying coefficient structure of the control regression function (2.2) and theimplied linear form of µ ( X ) to give conditions that are necessary as well as sufficientfor identification. 3. Identification Analysis
A sufficient condition for identification of the average structural function is nonsin-gularity of the second moment matrix of the treatment dummies given the controls, E [ p ( X ) p ( X ) ′ | V ] , with probability one. Under the maintained assumption that E [ p ( X ) p ( X ) ′ ] is non-singular, this condition is also necessary. heorem 1 states our first main result. The proofs of all formal results are given inthe Appendix. Theorem 1.
Suppose that E [ k ε k ] < ∞ , E [ p ( X ) p ( X ) ′ ] is nonsingular, and Assump-tion 1 holds. Then: E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one if, and onlyif, µ ( X ) is identified. When X ∈ { , } and p ( X ) = (1 , X ) ′ , the identification condition becomes the stan-dard condition for the treatment effect model Y = ε + ε X, E [ ε | X, V ] = E [ ε | V ] , ε ≡ ( ε , ε ) ′ . The identification condition is that the conditional second moment matrix of (1 , X ) ′ given V is nonsingular with probability one, which is the same as(3.1) var ( X | V ) = P ( V )[1 − P ( V )] > , P ( V ) ≡ Pr[ X = 1 | V ] , with probability one, where P ( V ) is the propensity score. Here we can see that theidentification condition is the same as < P ( V ) < with probability one, which isthe standard identification condition.With multiple treatments, because p ( X ) includes an intercept, the identification con-dition is the same as nonsingularity of the variance matrix var ( X | V ) with probabilityone. This result generalizes (3.1). Theorem 2. E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one if, and only if, thevariance matrix var ( X | V ) is nonsingular with probability one. Considerable simplification occurs with mutually exclusive treatments, which allowsfor the formulation of an equivalent condition for nonsingularity of E [ p ( X ) p ( X ) ′ | V ] solely in terms of the generalized propensity scores (Imbens, 2000). This result gen-eralizes the standard identification condition for binary X . Theorem 3.
Suppose that
Pr[ X ( t ) = 1 | V ] > for each t ∈ T with probability one.With mutually exclusive treatments, E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probabilityone if, and only if, (3.2) Σ Ts =1 Pr[ X ( s ) = 1 | V ] < , with probability one. . Discussion
The heterogeneous coefficients formulation we propose for multiple treatment effectsreveals the central role of the conditional nonsingularity condition for identification.Because this condition is in principle testable, establishing that it is also necessarydemonstrates testability of identification (e.g., Breusch, 1986). With mutually exclu-sive treatments, the formulation of the equivalent condition (3.2) thus relates testa-bility of identification to the generalized propensity scores. This is a generalization ofthe relationship between testability of identification and the propensity score in thebinary treatment case.Conditions that are both necessary and sufficient are also important for the deter-mination of minimal conditions for identification. In unpublished work Wooldridge(2004) considers a restricted version of our model with E [ X | ε, V ] = E [ X | V ] and E [ p ( X ) p ( X ) ′ | ε, V ] = E [ p ( X ) p ( X ) ′ | V ] , and shows that q ( V ) is identified if E [ p ( X ) p ( X ) ′ | V ] is invertible. The additional conditional second moments assump-tion implies that his identification condition differs from ours. Thus his result andproof do not apply in our setting which only assumes conditional mean indepen-dence E [ ε | X, V ] = E [ ε | V ] , and our results show that conditional second momentsindependence is not necessary for identification in multiple treatment effect models.Graham and Pinto (2018) consider a related approach in work independent of thefirst version of this paper (Newey and Stouli, 2018) where the identification result(Lemma 1 in the Appendix) was derived. The conditional nonsingularity conditionwe propose is weaker than their identification condition, and we study necessity aswell as sufficiency for identification of average treatment effects.The identification results we obtain here are of general interest for the vast treat-ment effects literature (e.g., Imbens, 2004, Imbens and Wooldridge, 2009, andAthey and Imbens, 2017, for reviews) and complement existing results on identifi-cation of treatment effects. Appendix A. Proofs
A.1.
Preliminary result.Lemma 1.
Suppose that E [ k ε k ] < ∞ and Assumption 1 holds. If E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one then q ( V ) is identified. roof. Let λ min ( V ) denote the smallest eigenvalue of E [ p ( X ) p ( X ) ′ | V ] . Suppose that ¯ q ( V ) = q ( V ) with positive probability on a set e V , and note that λ min ( V ) > on V by assumption. Then E (cid:20)n p ( X ) ′ { ¯ q ( V ) − q ( V ) } o (cid:21) = E h { ¯ q ( V ) − q ( V ) } ′ E [ p ( X ) p ( X ) ′ | V ] { ¯ q ( V ) − q ( V ) } i ≥ E h k ¯ q ( V ) − q ( V ) k λ min ( V ) i ≥ E h V ∈ V ∩ e V ) k ¯ q ( V ) − q ( V ) k λ min ( V ) i By definition
Pr[ e V ] > and e V ⊆ V so that e V ∩ V = e V . Thus the fact that k ¯ q ( V ) − q ( V ) k λ min ( V ) is positive on e V ∩ V implies E h V ∈ V ∩ e V ) k ¯ q ( V ) − q ( V ) k λ min ( V ) i > . We have shown that, for ¯ q ( V ) = q ( V ) with positive probability on a set e V , E (cid:20)n p ( X ) ′ { ¯ q ( V ) − q ( V ) } o (cid:21) > , which implies p ( X ) ′ ¯ q ( V ) = p ( X ) ′ q ( V ) . Therefore, q ( V ) is identified from E [ Y | X, V ] . (cid:3) A.2.
Proof of Theorem 1.
We first show that nonsingularity of E [ p ( X ) p ( X ) ′ | V ] with probability one implies identification of µ ( X ) . By Lemma 1, if E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one then q ( V ) is identified, and hence E [ q ( V )] alsois. By p ( X ) being a known function, p ( X ) ′ E [ q ( V )] = µ ( X ) is identified.We now establish that nonsingularity of E [ p ( X ) p ( X ) ′ | V ] with probability one is neces-sary for identification of µ ( X ) . It suffices to show that singularity of E [ p ( X ) p ( X ) ′ | V ] with positive probability implies that µ ( X ) is not identified, i.e., there exists an obser-vationally equivalent q ( V ) = q ( V ) with positive probability such that p ( X ) ′ E [ q ( V )] = p ( X ) ′ E [ q ( V )] with positive probability. By nonsingularity of E [ p ( X ) p ( X ) ′ ] and lin-earity of µ ( X ) , the conclusion holds if, and only if, there exists an observationallyequivalent q ( V ) = q ( V ) with positive probability such that E [ q ( V )] = E [ q ( V )] .Suppose that E [ p ( X ) p ( X ) ′ | V ] is singular with positive probability and let ∆( V ) besuch that E [ p ( X ) p ( X ) ′ | V ]∆( V ) = 0 . We have that ∆( V ) = 0 on a set e V with Pr[ e V ] > . For J = T + 1 , define e V j = { v ∈ e V : ∆ j ( v ) = 0 } , j ∈ { , . . . , J } . Then ∪ Jj =1 e V j = { v ∈ e V : ∆( v ) = 0 } = e V . Hence < Pr[ e V ] = Pr[ ∪ Jj =1 e V j ] ≤ J X j =1 Pr[ e V j ] , hich implies that Pr[ e V j ∗ ] > for some j ∗ ∈ { , . . . , J } .Set e ∆( v ) = ∆( v ) for v ∈ e V j ∗ , and e ∆( v ) = 0 otherwise. By construction e ∆ j ∗ ( V ) = 0 ,and letting ee ∆( V ) = sign { e ∆ j ∗ ( V ) } e ∆( V ) || e ∆( V ) || , we have that ee ∆ j ∗ ( V ) > on e V j ∗ and || ee ∆( V ) || = 1 , and hence E [ | ee ∆( V ) | ] < ∞ and E [ ee ∆ j ∗ ( V )] = 0 . Therefore E [ ee ∆( V )] = 0 , which implies that E [ q ( V ) + ee ∆( V )] = E [ q ( V )] . The result follows. (cid:3) A.3.
Proof of Theorem 2.
The matrix E [ p ( X ) p ( X ) ′ | V ] is of the form(A.1) E [ p ( X ) p ( X ) ′ | V ] = E [ X ′ | V ] E [ X | V ] E [ XX ′ | V ] , and is positive definite if, and only if, the Schur complement of in (A.1) is positivedefinite (Boyd and Vandenberghe, 2004, Appendix A.5.5.), i.e., if, and only if, E [ XX ′ | V ] − E [ X | V ] E [ X ′ | V ] = var ( X | V ) , is positive definite with probability one, as claimed. (cid:3) A.4.
Proof of Theorem 3.
For a vector w ∈ R T , let diag ( w ) denote the T × T di-agonal matrix with diagonal elements w , . . . , w T . For mutually exclusive treatments,the matrix E [ p ( X ) p ( X ) ′ | V ] is of the form(A.2) E [ p ( X ) p ( X ) ′ | V ] = E [ X ′ | V ] E [ X | V ] diag ( E [ X | V ]) . The matrix diag ( E [ X | V ]) has diagonal elements E [ X ( t ) | V ] = Pr[ X ( t ) = 1 | V ] > ,for each t ∈ T , and hence is positive definite. Therefore, E [ p ( X ) p ( X ) ′ | V ] is positivedefinite if, and only if, the Schur complement of diag ( E [ X | V ]) in (A.2) is positivedefinite (Boyd and Vandenberghe, 2004, Appendix A.5.5.), i.e., if, and only if, < − E [ X ′ | V ] diag ( E [ X | V ]) − E [ X | V ] = 1 − Σ Ts =1 E [ X ( s ) | V ]= 1 − Σ Ts =1 Pr[ X ( s ) = 1 | V ] , with probability one, as claimed. (cid:3) eferencesAthey, S. and Imbens, G. (2017). The state of applied econometrics: causalityand policy evaluation. Journal of Economic Perspectives . 31, 3–32.
Blundell, R., and Powell, J. L . (2003). Endogeneity in nonparametric andsemiparametric regression models.
Econometric society monographs . 36, 312–357.
Boyd, S. P. and Vandenberghe, L. (2004).
Convex Optimization . CambridgeUniversity Press.
Breusch, T . S. (1986). Hypothesis testing in unidentified models.
Review of Eco-nomic Studies . 53, 635–651.
Cattaneo,
M. (2010). Efficient semiparametric estimation of multi-valued treatmenteffects under ignorability.
Journal of Econometrics.
Graham, B. S. and Pinto, C. C. D. X. (2018) Semiparametrically efficientestimation of the average linear regression function. eprint arXiv:1810.12511 . Heckman, J. J., Ichimura, H., Smith, J. and Todd, P. (1998). Characterizingselection bias using experimental data.
Econometrica . 66, 1017–1098.
Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatmentregimes: Generalizing the propensity score.
Journal of the American StatisticalAssociation.
99, 854–866.
Imbens, G. W. (2000). The role of the propensity score in estimating dose-responsefunctions.
Biometrika . 87, 706–710.
Imbens, G . (2004). Nonparametric estimation of average treatment effects underexogeneity.
Review of Economics and Statistics . 86, 4–29.
Imbens, G . and Wooldridge, J. M . (2009). Recent developments in the econo-metrics of program evaluation. Journal of Economic Literature.
47, 5–86.
Lechner, M . (2001). Identification and estimation of causal effects of multiple treat-ments under the conditional independence assumption. In
Econometric evaluationof labour market policies in Europe , 43–58. Physica, Heidelberg.
Newey, W. K. and Stouli, S . (2018). Heterogenous coefficients, discrete instru-ments, and identification of treatment effects. eprint arXiv:1811.09837 . Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensityscore in observational studies for causal effects.
Biometrika . 70, 41–55.
Wooldridge, J. M. (2004). Estimating average partial effects under conditionalmoment independence assumptions.
Cemmap working paper
CWP03/04.
Wooldridge, J. M. (2005). Unobserved heterogeneity and the estimation of aver-age partial effects. In
Identification and inference for econometric models , pp.27-55., pp.27-55.