[PDF] Heterogeneous Coefficients, Control Variables, and Identification of Treatment Effects

Abstract

Multidimensional heterogeneity and endogeneity are important features of models with multiple treatments. We consider a heterogeneous coefficients model where the outcome is a linear combination of dummy treatment variables, with each variable representing a different kind of treatment. We use control variables to give necessary and sufficient conditions for identification of average treatment effects. With mutually exclusive treatments we find that, provided the generalized propensity scores (Imbens, 2000) are bounded away from zero with probability one, a simple identification condition is that their sum be bounded away from one with probability one. These results generalize the classical identification result of Rosenbaum and Rubin (1983) for binary treatments.

Full PDF

aa r X i v : . [ ec on . E M ] S e p HETEROGENEOUS COEFFICIENTS, CONTROL VARIABLES,AND IDENTIFICATION OF TREATMENT EFFECTS

WHITNEY K. NEWEY † AND SAMI STOULI § Abstract.

Multidimensional heterogeneity and endogeneity are important fea-tures of models with multiple treatments. We consider a heterogeneous coeﬃcientsmodel where the outcome is a linear combination of dummy treatment variables,with each variable representing a diﬀerent kind of treatment. We use control vari-ables to give necessary and suﬃcient conditions for identiﬁcation of average treat-ment eﬀects. With mutually exclusive treatments we ﬁnd that, provided the general-ized propensity scores (Imbens, 2000) are bounded away from zero with probabilityone, a simple identiﬁcation condition is that their sum be bounded away from onewith probability one. These results generalize the classical identiﬁcation result ofRosenbaum and Rubin (1983) for binary treatments.

Keywords : Treatment eﬀect; Multiple treatments; Heterogeneous coeﬃcients; Con-trol variable; Identiﬁcation; Conditional nonsingularity; Propensity score.1.

Introduction

Models that allow for multiple treatments are important for programevaluation and the estimation of treatment eﬀects (Cattaneo 2010;Heckman, Ichimura, Smith, and Todd 1998; Imai and van Dyk 2004; Imbens2000; Graham and Pinto 2018; Lechner 2001; Wooldridge 2004). A general classis heterogeneous coeﬃcient models where the outcome is a linear combination ofdummy treatment variables and unobserved heterogeneity. These models allowfor multiple treatment regimes, with each variable representing a diﬀerent kindof treatment. These models also feature multidimensional heterogeneity, with thedimension of unobserved heterogeneity being determined by the number of treatmentregimes.Endogeneity is often a problem in these models because we are interested in the eﬀectof treatment variables on an outcome, and the treatment variables are correlated † Department of Economics, MIT, [email protected]. § Department of Economics, University of Bristol, [email protected]. ith heterogeneity. Control variables provide an important means of controlling forendogeneity with multidimensional heterogeneity. For treatment eﬀects, a controlvariable is an observed variable that makes heterogeneity and treatment variablesindependent when it is conditioned on (Rosenbaum and Rubin, 1983).We use control variables to give necessary and suﬃcient conditions for identiﬁcation ofaverage treatment eﬀects based on conditional nonsingularity of the second momentmatrix of the vector of dummy treatment variables given the controls. This result isfamiliar in the binary treatment case, but its generalization to multiple treatmentsappears to be new. With mutually exclusive treatments we ﬁnd that, provided thegeneralized propensity scores (Imbens, 2000) are bounded away from zero with prob-ability one, a simple condition for identiﬁcation is that their sum be bounded awayfrom one with probability one. These results provide an important generalization ofRosenbaum and Rubin (1983)’s classical identiﬁcation result for binary treatments.2. Modeling of Treatment Effects

Let Y denote an outcome variable of interest, and X a vector of dummy variables X ( t ) , t ∈ T ≡ { , . . . , T } , taking value one if treatment t occurs and zero otherwise,and ε a structural disturbance vector of ﬁnite dimension. We consider a heterogeneouscoeﬃcients model of the form(2.1) Y = p ( X ) ′ ε, p ( X ) = (1 , X (1) , . . . , X ( T )) ′ . This model is linear in the treatment dummy variables, with coeﬃcients ε that neednot be independent of X. We assume that the vector ε is mean independent of theendogenous treatments X , conditional on an observable control variable denoted V . Assumption 1.

For the model in (2.1), there exists a control variable V such that E [ ε | X, V ] = E [ ε | V ] .The Rosenbaum and Rubin (1983) treatment eﬀects model is included as a specialcase where X ∈ { , } is a treatment dummy variable that is equal to one if treatmentoccurs and equals zero without treatment, and p ( X ) = (1 , X ) ′ . In this case ε = ( ε , ε ) ′ is two dimensional with ε giving the outcome withouttreatment, and ε being the treatment eﬀect. Here the control variables in V wouldbe observable variables such that Assumption 1 holds, i.e., the coeﬃcients ( ε , ε ) are ean independent of treatment conditional on controls; this is the unconfoundednessassumption of Rosenbaum and Rubin (1983).A central object of interest in model (2.1) is the average structural function givenby µ ( X ) ≡ p ( X ) ′ E [ ε ] ; see Blundell and Powell (2003) and Wooldridge (2005). Thisfunction is also referred to as the dose-response function in the statistics literature(e.g., Imbens, 2000). When X ∈ { , } is a dummy variable for treatment, µ (0) givesthe average outcome if every unit remained untreated and µ (1) the average outcomeif every unit were treated, with µ (1) − µ (0) being the average treatment eﬀect. Ingeneral, the average eﬀect of some treatment t ∈ T is µ ( e t ) − µ (0 T ) , with e t = (0 , . . . , , , , . . . , ′ deﬁned as a T -vector with all components equal tozero, except the t th, which is one, and T a T -vector of zeros.The conditional mean independence assumption and the form of the structural func-tion p ( X ) ′ ε in (2.1) together imply that the control regression function of Y given ( X, V ) , E [ Y | X, V ] , is a linear combination of the treatment variables:(2.2) E [ Y | X, V ] = p ( X ) ′ E [ ε | X, V ] = p ( X ) ′ E [ ε | V ] = p ( X ) ′ q ( V ) , q ( V ) ≡ E [ ε | V ] . The average structural function can thus be expressed as a known linear combinationof E [ q ( V )] from equation (2.2). By iterated expectations, p ( X ) ′ E [ q ( V )] = p ( X ) ′ E [ E [ ε | V ]] = µ ( X ) . We use the varying coeﬃcient structure of the control regression function (2.2) and theimplied linear form of µ ( X ) to give conditions that are necessary as well as suﬃcientfor identiﬁcation. 3. Identification Analysis

A suﬃcient condition for identiﬁcation of the average structural function is nonsin-gularity of the second moment matrix of the treatment dummies given the controls, E [ p ( X ) p ( X ) ′ | V ] , with probability one. Under the maintained assumption that E [ p ( X ) p ( X ) ′ ] is non-singular, this condition is also necessary. heorem 1 states our ﬁrst main result. The proofs of all formal results are given inthe Appendix. Theorem 1.

Suppose that E [ k ε k ] < ∞ , E [ p ( X ) p ( X ) ′ ] is nonsingular, and Assump-tion 1 holds. Then: E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one if, and onlyif, µ ( X ) is identiﬁed. When X ∈ { , } and p ( X ) = (1 , X ) ′ , the identiﬁcation condition becomes the stan-dard condition for the treatment eﬀect model Y = ε + ε X, E [ ε | X, V ] = E [ ε | V ] , ε ≡ ( ε , ε ) ′ . The identiﬁcation condition is that the conditional second moment matrix of (1 , X ) ′ given V is nonsingular with probability one, which is the same as(3.1) var ( X | V ) = P ( V )[1 − P ( V )] > , P ( V ) ≡ Pr[ X = 1 | V ] , with probability one, where P ( V ) is the propensity score. Here we can see that theidentiﬁcation condition is the same as < P ( V ) < with probability one, which isthe standard identiﬁcation condition.With multiple treatments, because p ( X ) includes an intercept, the identiﬁcation con-dition is the same as nonsingularity of the variance matrix var ( X | V ) with probabilityone. This result generalizes (3.1). Theorem 2. E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one if, and only if, thevariance matrix var ( X | V ) is nonsingular with probability one. Considerable simpliﬁcation occurs with mutually exclusive treatments, which allowsfor the formulation of an equivalent condition for nonsingularity of E [ p ( X ) p ( X ) ′ | V ] solely in terms of the generalized propensity scores (Imbens, 2000). This result gen-eralizes the standard identiﬁcation condition for binary X . Theorem 3.

Suppose that

Pr[ X ( t ) = 1 | V ] > for each t ∈ T with probability one.With mutually exclusive treatments, E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probabilityone if, and only if, (3.2) Σ Ts =1 Pr[ X ( s ) = 1 | V ] < , with probability one. . Discussion

The heterogeneous coeﬃcients formulation we propose for multiple treatment eﬀectsreveals the central role of the conditional nonsingularity condition for identiﬁcation.Because this condition is in principle testable, establishing that it is also necessarydemonstrates testability of identiﬁcation (e.g., Breusch, 1986). With mutually exclu-sive treatments, the formulation of the equivalent condition (3.2) thus relates testa-bility of identiﬁcation to the generalized propensity scores. This is a generalization ofthe relationship between testability of identiﬁcation and the propensity score in thebinary treatment case.Conditions that are both necessary and suﬃcient are also important for the deter-mination of minimal conditions for identiﬁcation. In unpublished work Wooldridge(2004) considers a restricted version of our model with E [ X | ε, V ] = E [ X | V ] and E [ p ( X ) p ( X ) ′ | ε, V ] = E [ p ( X ) p ( X ) ′ | V ] , and shows that q ( V ) is identiﬁed if E [ p ( X ) p ( X ) ′ | V ] is invertible. The additional conditional second moments assump-tion implies that his identiﬁcation condition diﬀers from ours. Thus his result andproof do not apply in our setting which only assumes conditional mean indepen-dence E [ ε | X, V ] = E [ ε | V ] , and our results show that conditional second momentsindependence is not necessary for identiﬁcation in multiple treatment eﬀect models.Graham and Pinto (2018) consider a related approach in work independent of theﬁrst version of this paper (Newey and Stouli, 2018) where the identiﬁcation result(Lemma 1 in the Appendix) was derived. The conditional nonsingularity conditionwe propose is weaker than their identiﬁcation condition, and we study necessity aswell as suﬃciency for identiﬁcation of average treatment eﬀects.The identiﬁcation results we obtain here are of general interest for the vast treat-ment eﬀects literature (e.g., Imbens, 2004, Imbens and Wooldridge, 2009, andAthey and Imbens, 2017, for reviews) and complement existing results on identiﬁ-cation of treatment eﬀects. Appendix A. Proofs

A.1.

Preliminary result.Lemma 1.

Suppose that E [ k ε k ] < ∞ and Assumption 1 holds. If E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one then q ( V ) is identiﬁed. roof. Let λ min ( V ) denote the smallest eigenvalue of E [ p ( X ) p ( X ) ′ | V ] . Suppose that ¯ q ( V ) = q ( V ) with positive probability on a set e V , and note that λ min ( V ) > on V by assumption. Then E (cid:20)n p ( X ) ′ { ¯ q ( V ) − q ( V ) } o (cid:21) = E h { ¯ q ( V ) − q ( V ) } ′ E [ p ( X ) p ( X ) ′ | V ] { ¯ q ( V ) − q ( V ) } i ≥ E h k ¯ q ( V ) − q ( V ) k λ min ( V ) i ≥ E h V ∈ V ∩ e V ) k ¯ q ( V ) − q ( V ) k λ min ( V ) i By deﬁnition

Pr[ e V ] > and e V ⊆ V so that e V ∩ V = e V . Thus the fact that k ¯ q ( V ) − q ( V ) k λ min ( V ) is positive on e V ∩ V implies E h V ∈ V ∩ e V ) k ¯ q ( V ) − q ( V ) k λ min ( V ) i > . We have shown that, for ¯ q ( V ) = q ( V ) with positive probability on a set e V , E (cid:20)n p ( X ) ′ { ¯ q ( V ) − q ( V ) } o (cid:21) > , which implies p ( X ) ′ ¯ q ( V ) = p ( X ) ′ q ( V ) . Therefore, q ( V ) is identiﬁed from E [ Y | X, V ] . (cid:3) A.2.

Proof of Theorem 1.

We ﬁrst show that nonsingularity of E [ p ( X ) p ( X ) ′ | V ] with probability one implies identiﬁcation of µ ( X ) . By Lemma 1, if E [ p ( X ) p ( X ) ′ | V ] is nonsingular with probability one then q ( V ) is identiﬁed, and hence E [ q ( V )] alsois. By p ( X ) being a known function, p ( X ) ′ E [ q ( V )] = µ ( X ) is identiﬁed.We now establish that nonsingularity of E [ p ( X ) p ( X ) ′ | V ] with probability one is neces-sary for identiﬁcation of µ ( X ) . It suﬃces to show that singularity of E [ p ( X ) p ( X ) ′ | V ] with positive probability implies that µ ( X ) is not identiﬁed, i.e., there exists an obser-vationally equivalent q ( V ) = q ( V ) with positive probability such that p ( X ) ′ E [ q ( V )] = p ( X ) ′ E [ q ( V )] with positive probability. By nonsingularity of E [ p ( X ) p ( X ) ′ ] and lin-earity of µ ( X ) , the conclusion holds if, and only if, there exists an observationallyequivalent q ( V ) = q ( V ) with positive probability such that E [ q ( V )] = E [ q ( V )] .Suppose that E [ p ( X ) p ( X ) ′ | V ] is singular with positive probability and let ∆( V ) besuch that E [ p ( X ) p ( X ) ′ | V ]∆( V ) = 0 . We have that ∆( V ) = 0 on a set e V with Pr[ e V ] > . For J = T + 1 , deﬁne e V j = { v ∈ e V : ∆ j ( v ) = 0 } , j ∈ { , . . . , J } . Then ∪ Jj =1 e V j = { v ∈ e V : ∆( v ) = 0 } = e V . Hence < Pr[ e V ] = Pr[ ∪ Jj =1 e V j ] ≤ J X j =1 Pr[ e V j ] , hich implies that Pr[ e V j ∗ ] > for some j ∗ ∈ { , . . . , J } .Set e ∆( v ) = ∆( v ) for v ∈ e V j ∗ , and e ∆( v ) = 0 otherwise. By construction e ∆ j ∗ ( V ) = 0 ,and letting ee ∆( V ) = sign { e ∆ j ∗ ( V ) } e ∆( V ) || e ∆( V ) || , we have that ee ∆ j ∗ ( V ) > on e V j ∗ and || ee ∆( V ) || = 1 , and hence E [ | ee ∆( V ) | ] < ∞ and E [ ee ∆ j ∗ ( V )] = 0 . Therefore E [ ee ∆( V )] = 0 , which implies that E [ q ( V ) + ee ∆( V )] = E [ q ( V )] . The result follows. (cid:3) A.3.

Proof of Theorem 2.

Proof of Theorem 3.

For a vector w ∈ R T , let diag ( w ) denote the T × T di-agonal matrix with diagonal elements w , . . . , w T . For mutually exclusive treatments,the matrix E [ p ( X ) p ( X ) ′ | V ] is of the form(A.2) E [ p ( X ) p ( X ) ′ | V ] =  E [ X ′ | V ] E [ X | V ] diag ( E [ X | V ])  . The matrix diag ( E [ X | V ]) has diagonal elements E [ X ( t ) | V ] = Pr[ X ( t ) = 1 | V ] > ,for each t ∈ T , and hence is positive deﬁnite. Therefore, E [ p ( X ) p ( X ) ′ | V ] is positivedeﬁnite if, and only if, the Schur complement of diag ( E [ X | V ]) in (A.2) is positivedeﬁnite (Boyd and Vandenberghe, 2004, Appendix A.5.5.), i.e., if, and only if, < − E [ X ′ | V ] diag ( E [ X | V ]) − E [ X | V ] = 1 − Σ Ts =1 E [ X ( s ) | V ]= 1 − Σ Ts =1 Pr[ X ( s ) = 1 | V ] , with probability one, as claimed. (cid:3) eferencesAthey, S. and Imbens, G. (2017). The state of applied econometrics: causalityand policy evaluation. Journal of Economic Perspectives . 31, 3–32.

Blundell, R., and Powell, J. L . (2003). Endogeneity in nonparametric andsemiparametric regression models.

Econometric society monographs . 36, 312–357.

Boyd, S. P. and Vandenberghe, L. (2004).

Convex Optimization . CambridgeUniversity Press.

Breusch, T . S. (1986). Hypothesis testing in unidentiﬁed models.

Review of Eco-nomic Studies . 53, 635–651.

Cattaneo,

M. (2010). Eﬃcient semiparametric estimation of multi-valued treatmenteﬀects under ignorability.

Journal of Econometrics.

Graham, B. S. and Pinto, C. C. D. X. (2018) Semiparametrically eﬃcientestimation of the average linear regression function. eprint arXiv:1810.12511 . Heckman, J. J., Ichimura, H., Smith, J. and Todd, P. (1998). Characterizingselection bias using experimental data.

Econometrica . 66, 1017–1098.

Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatmentregimes: Generalizing the propensity score.

Journal of the American StatisticalAssociation.

99, 854–866.

Imbens, G. W. (2000). The role of the propensity score in estimating dose-responsefunctions.

Biometrika . 87, 706–710.

Imbens, G . (2004). Nonparametric estimation of average treatment eﬀects underexogeneity.

Review of Economics and Statistics . 86, 4–29.

Imbens, G . and Wooldridge, J. M . (2009). Recent developments in the econo-metrics of program evaluation. Journal of Economic Literature.

47, 5–86.

Lechner, M . (2001). Identiﬁcation and estimation of causal eﬀects of multiple treat-ments under the conditional independence assumption. In

Econometric evaluationof labour market policies in Europe , 43–58. Physica, Heidelberg.

Newey, W. K. and Stouli, S . (2018). Heterogenous coeﬃcients, discrete instru-ments, and identiﬁcation of treatment eﬀects. eprint arXiv:1811.09837 . Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensityscore in observational studies for causal eﬀects.

Biometrika . 70, 41–55.

Wooldridge, J. M. (2004). Estimating average partial eﬀects under conditionalmoment independence assumptions.

Cemmap working paper

CWP03/04.

Wooldridge, J. M. (2005). Unobserved heterogeneity and the estimation of aver-age partial eﬀects. In

Identiﬁcation and inference for econometric models , pp.27-55., pp.27-55.