Identifying the effect of a mis-classified, binary, endogenous regressor
aa r X i v : . [ ec on . E M ] N ov Identifying the Effect of a Mis-classified, Binary,Endogenous Regressor ∗ Francis J. DiTraglia and Camilo Garc´ıa-Jimeno
Final Version: January 23, 2019
Abstract
This paper studies identification of the effect of a mis-classified, binary, endogenousregressor when a discrete-valued instrumental variable is available. We begin by show-ing that the only existing point identification result for this model is incorrect. Wego on to derive the sharp identified set under mean independence assumptions for theinstrument and measurement error. The resulting bounds are novel and informative,but fail to point identify the effect of interest. This motivates us to consider alterna-tive and slightly stronger assumptions: we show that adding second and third momentindependence assumptions suffices to identify the model.
Keywords:
Instrumental variables, Measurement error, Endogeneity
JEL Codes:
C10, C25, C26 ∗ We thank Daron Acemoglu, Manuel Arellano, Kristy Buzard, Xu Cheng, Bernardo da Silveira, BoHonor´e, Arthur Lewbel, Chuck Manski, Sophocles Mavroeidis, Francesca Molinari, Yuya Takahashi, theassociate editor, two anonymous referees, and seminar participants at Cambridge, CEMFI, Chicago Booth,Manchester, Northwestern, Oxford, Penn State, Princeton, UCL, the 2016 Greater New York Area Economet-rics Colloquium, Camp Econometrics IX, and the 2017 North American Summer Meeting of the EconometricSociety for valuable comments and suggestions. This document supersedes an earlier version entitled “OnMis-measured Binary Regressors: New Results and Some Comments on the Literature.”
Introduction
Measurement error and endogeneity are pervasive features of economic data. Conveniently,a valid instrumental variable corrects for both problems when the measurement error isclassical, i.e. uncorrelated with the true value of the regressor. Many regressors of interest inapplied work, however, are binary and thus cannot be subject to classical measurement error. When faced with non-classical measurement error, the instrumental variables estimator canbe severely biased. In this paper, we study an additively separable model of the form y = c ( x ) + β ( x ) T ∗ + ε (1)where ε is a mean-zero error term, T ∗ is a binary, potentially endogenous regressor of interest,and x is a vector of exogenous controls. We ask whether, and if so under what conditions, adiscrete instrumental variable z suffices to non-parametrically identify the causal effect β ( x )of T ∗ , when we observe not T ∗ but a mis-classified binary surrogate T .We proceed under the assumption of non-differential measurement error. This conditionhas been widely used in the existing literature and imposes that T provides no additionalinformation beyond that contained in ( T ∗ , x ). Even in this fairly standard setting, identifi-cation remains an open question: we begin by showing that the only existing identificationresult for this model is incorrect. We then go on to derive the sharp identified set under thestandard first-moment assumptions from the related literature. We show that regardless ofthe number of values that z takes on, the model is not point identified. This motivates us toconsider alternative, and slightly stronger assumptions. We show that, given a binary instru-ment, the addition of a second moment independence assumption suffices to identify a modelwith one-sided mis-classification. Adding a second moment restriction on the measurementerror along with a third moment independence assumption for the instrument suffices toidentify the model in general. This result likewise requires only a binary z .Our work relates to a large literature that considers departures from classical measure-ment error, by allowing the measurement error to be related to the true value of the un-observed regressor. Chen et al. (2005) obtain identification in a general class of momentcondition models with mis-measured data by relying on the existence of an auxiliary datasetfrom which they can estimate the measurement error process. In contrast, Hu and Shennach(2008) and Song (2015) rely on an instrumental variable and an additional conditional loca-tion assumption on the measurement error distribution. More recently, Hu et al. (2015) usea continuous instrument to identify the ratio of partial effects of two continuous regressors,one measured with error, in a linear single index model. Unfortunately, these approachescannot be applied to the case of a mis-measured binary regressor.A number of papers have studied models with an exogenous binary regressor subject tonon-differential measurement error. One group of papers asks what can be learned withoutrecourse to an instrumental variable. An early contribution by Aigner (1973) characterizesthe asymptotic bias of OLS in this setting, and proposes a correction using outside infor- The only way to mis-classify a true one is downwards, as a zero, while the only way to mis-classify atrue zero is upwards, as a one. This creates negative dependence between the truth and measurement error. Because T ∗ is binary, there is no loss of generality from writing the model in this form rather than themore familiar y = h ( T ∗ , x ) + ε . Simply define β ( x ) = h (1 , x ) − h (0 , x ) and c ( x ) = h (0 , x ). T ∗ . Black et al. (2000) and Kane et al. (1999) consider a linear model andshow that when two alternative measures T and T of T ∗ are available, a non-linear GMMestimator can be used to recover the effect of interest. Subsequently, Frazis and Loewenstein(2003) note that an instrumental variable can take the place of one of the measures. Mahajan(2006) extends the results of Black et al. (2000) and Kane et al. (1999) to a more generalsetting using a binary instrument in place of one of the treatment measures, establishing non-parametric identification of the conditional mean function. When T ∗ is in fact exogenous,this coincides with the causal effect. Hu (2008) derives related results when the mis-classifieddiscrete regressor may take on more than two values. Lewbel (2007a) provides an identifica-tion result for the same model as Mahajan (2006) under different assumptions. In particular,his “instrument-like variable” need not satisfy the usual exclusion restriction so long as itdoes not interact with T ∗ and takes on three or more values.Much less is known about the case in which a binary, or discrete, regressor is not only mis-classified but endogenous. The first paper to provide a formal result for this case is Mahajan(2006). He extends his main result to the case of an endogenous treatment, providing anexplicit proof of identification under the usual IV assumption in a model with additivelyseparable errors. As we show below, however, this result is false. Several more recentpapers also consider the case of a mis-classified, endogenous, binary regressor. Kreider et al.(2012), partially identify the effects of food stamps on health outcomes of children underweak measurement error assumptions by relying on auxiliary data. Similarly, Battistin et al.(2014) study the returns to schooling in a setting with multiple mis-reported measures ofeducational qualifications. Unlike these two papers, our approach does not depend on theavailability of auxiliary data. In a different vein, Shiu (2016) uses an exclusion restrictionfor the participation equation and an additional valid instrument to identify the effect of adiscrete, mis-classified endogenous regressor in a semi-parametric selection model. Similarly,Nguimkeu et al. (2016) use exclusion restrictions for both the participation equation andmeasurement error equation to identify a parametric model with endogenous participationand one-sided endogenous mis-reporting. Unlike those of the preceding two papers, ourresults rely neither on parametric assumptions nor additional exclusion restrictions. Otherthan Mahajan (2006), the paper most closely related to our own is that of Ura (2018),who derives partial identification results for a local average treatment effect without thenon-differential assumption. In contrast, we study an additively separable model undernon-differential measurement error and derive both partial and point identification results.The remainder of the paper is organized as follows. Section 2.1 describes our model andassumptions, Section 2.2 relates our results to existing work, and Sections 2.3–2.4 present Appendix B provides a detailed explanation of the error in Mahajan’s proof.
As defined in the preceding section, our model is y = c ( x ) + β ( x ) T ∗ + ε , where ε is a mean-zero error term, and the parameter of interest is β ( x ) – the effect of an unobserved, binary,endogenous regressor T ∗ . Suppose we observe a valid and relevant binary instrument z . Inthe discussion following Corollary 2.2 below, we explain how these results generalize to thecase of an arbitrary discrete-valued instrument. We assume that the model and instrumentsatisfy the following conditions: Assumption 2.1. (i) y = c ( x ) + β ( x ) T ∗ + ε where T ∗ ∈ { , } and E [ ε ] = 0 ;(ii) z ∈ { , } , where < P ( z = 1 | x ) < , and P ( T ∗ = 1 | x , z = 1) = P ( T ∗ = 1 | x , z = 0) ;(iii) E [ ε | x , z ] = 0 . Assumption 2.1(i) is a restatement of the additively separable model from Equation 1,which includes as a special case the linear model y = c + βT ∗ + x ′ γ + ε that is pervasivein empirical economics. Assumptions 2.1(ii) and (iii) are the textbook instrumental variablerelevance and validity conditions, respectively. Under Assumption 2.1, the Wald estimator[ E ( y | z = 1 , x ) − E ( y | z = 0 , x )] / [ E ( T ∗ | z = 1 , x ) − E ( T ∗ | z = 0 , x )]identifies β ( x ). Unfortunately this estimator is infeasible, as we observe not T ∗ but a mis-classified binary surrogate T . To make further progress, we must impose conditions on theprocess that generates T . Accordingly, define the following mis-classification probabilities: α ( x , z ) = P ( T = 1 | T ∗ = 0 , x , z ) α ( x ) = P ( T = 1 | T ∗ = 0 , x ) α ( x , z ) = P ( T = 0 | T ∗ = 1 , x , z ) α ( x ) = P ( T = 0 | T ∗ = 1 , x ) . Assumption 2.2. (i) α ( x , z ) = α ( x ) , α ( x , z ) = α ( x ) (ii) α ( x ) + α ( x ) < (iii) E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] Although it involves T ∗ , Assumption 2.1(ii) is testable: see the discussion following Lemma 2.1. z . Assumption 2.2(ii) restricts the extent of mis-classification and is equivalent to requiring that T and T ∗ bepositively correlated. Assumption 2.2 (iii) is often referred to as “non-differential measure-ment error.” Intuitively, it maintains that T provides no additional information about ε , andhence y , given knowledge of ( T ∗ , z, x ). While Assumption 2.2(ii) is quite mild, Assumptions2.2 (i) and (iii) are more restrictive, as discussed by Bound et al. (2001). To take a specificexample, suppose that y is log wage and T ∗ is an indicator for college completion. If T is apotentially erroneous measure of college completion taken from a university’s administrativerecords, then the assumption of non-differential measurement error is quite plausible. If, onthe other hand, T is a self-report of college completion and there are “returns to lying” aboutcollege completion, i.e. employers only imperfectly observe worker ability, this assumptionis less plausible. Note, however, that our assumptions on the mis-classification process are conditional on x : we place no restrictions on the relationship between observed covariatesand the mis-classification errors. In contrast, Bound et al. (2001) considers unconditional versions of our Assumption 2.2. Instrument validity – Assumption 2.1 (iii) – is more plau-sible after conditioning on a rich set of exogenous controls, and the same is true of ourmis-classification assumptions. For more discussion of settings in which the assumption ofnon-differential measurement error is warranted, see Carroll et al. (2006). Existing results from the literature – see for example Frazis and Loewenstein (2003) andMahajan (2006) – establish that β ( x ) is point identified if Assumptions 2.1–2.2 are aug-mented to include the following condition: Assumption 2.3 (Joint Exogeneity) . E [ ε | x , z, T ∗ ] = 0 . Assumption 2.3 strengthens the mean independence condition from Assumption 2.1 (iii)to hold jointly for T ∗ and z . By iterated expectations, this implies that T ∗ is exogenous,i.e. E [ ε | x , T ∗ ] = 0. If T ∗ is endogenous, Assumption 2.3 clearly fails. Mahajan (2006)argues, however, that the following restriction, along with our Assumptions 2.1–2.2, sufficesto identify β ( x ) when T ∗ may be endogenous: Assumption 2.4 (Mahajan (2006) Equation 11) . E [ ε | x , z, T ∗ , T ] = E [ ε | x , T ∗ ] . Assumption 2.4 does not require E [ ε | x , T ∗ ] to be zero, but maintains that it does notvary with z . We show in Appendix B, however, that under Assumptions 2.1–2.2, Assumption2.4 can only hold if T ∗ is exogenous. If z is a valid instrument and T ∗ is endogenous, thenAssumption 2.4 implies that there is no first-stage relationship between z and T ∗ . As such,identification in the case where T ∗ is endogenous is an open question. See Hu and Lewbel (2012) for a proposal to estimate the “returns to lying” in this context. .3 Partial Identification In this section we derive the sharp identified set under Assumptions 2.1–2.2 and show that β ( x ) is not point identified. For a discussion of how our partial identification results can beinterpreted in a local average treatment effects (LATE) setting, see Appendix C.To simplify the notation, define the following shorthand for the unobserved and observedfirst stage probabilities p ∗ k ( x ) = P ( T ∗ = 1 | x , z = k ) , p k ( x ) = P ( T = 1 | x , z = k ) . (2)We first state two lemmas that that will be used repeatedly below. Lemma 2.1.
Under Assumption 2.2 (i), [1 − α ( x ) − α ( x )] p ∗ k ( x ) = p k ( x ) − α ( x )[1 − α ( x ) − α ( x )] [1 − p ∗ k ( x )] = 1 − p k ( x ) − α ( x ) where the first-stage probabilities p ∗ k ( x ) and p k ( x ) are as defined in Equation 2. Lemma 2.2.
Under Assumptions 2.1 and 2.2 (i)–(ii), β ( x ) Cov ( z, T | x ) = [1 − α ( x ) − α ( x )] Cov ( y, z | x )Lemma 2.1 relates the observed first-stage probabilities p k ( x ) to their unobserved counter-parts p ∗ k ( x ) in terms of the mis-classification probabilities α ( x ) and α ( x ). By Assumption2.2 (ii), 1 − α ( x ) − α ( x ) > α ( x ) and α ( x ) in terms of theobserved first-stage probabilities. Moreover, by taking differences evaluated at k = 1 and k = 0, this Lemma shows that p ∗ ( x ) = p ∗ ( x ) if and only if p ( x ) = p ( x ). In other words,Assumption 2.1 (ii) is testable under Assumption 2.2 (ii). Lemma 2.2 relates the instrumen-tal variables (IV) estimand, Cov( y, z | x ) / Cov( z, T | x ), to the mis-classification probabilities.Since 1 − α ( x ) − α ( x ) >
0, IV is biased upwards in the presence of mis-classification.Together these lemmas bound the causal effect of interest: β ( x ) lies between the reducedform and IV estimators. Without Assumption 2.2 (iii), non-differential measurement error,these bounds are sharp. Theorem 2.1.
Under Assumptions 2.1 and 2.2 (i)–(ii), α ( x ) ≤ p k ( x ) ≤ − α ( x ) for k = 0 , and E [ y | x , z = k ] = c ( x ) + β ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) . (3) Provided that p ( x ) = p ( x ) , these expressions characterize the sharp identified set for c ( x ) , β ( x ) , α ( x ) , and α ( x ) . Corollary 2.1.
Under the conditions of Theorem 2.1, the sharp identified set for β ( x ) is theclosed interval between the reduced form estimand Cov ( y, z | x ) / Var ( z | x ) and the IV estimandCov ( y, z | x ) / Cov ( z, T | x ) . Corollary 2.1 follows by taking differences of the expression for E [ y | x , z = k ] across k = 1and k = 0, and substituting the maximum and minimum value for α ( x ) + α ( x ) consistent5ith the observed first-stage probabilities. Note that the only role of the condition p ( x ) = p ( x ) in the preceding two results is to ensure that it is possible to satisfy Assumption 2.1 (ii).Frazis and Loewenstein (2003) point out that the IV estimand provides an upper bound for β ( x ), and Lemmas 2.1–2.2 are well-known in the literature (see e.g. Frazis and Loewenstein,2003; Mahajan, 2006). Nevertheless, we are unaware of any published result that explicitlystates both bounds from Corollary 2.1 or proves that they are sharp under Assumptions 2.1and 2.2 (i)–(ii).Neither Theorem 2.1 nor Corollary 2.1 imposes Assumption 2.2 (iii) – non-differentialmeasurement error. While this assumption plays an important role in existing identificationresults for an exogenous T ∗ (see Section 2.2), its identifying power under endogeneity has notbeen addressed in the literature. We now show that this assumption in general yields furtherrestrictions on probabilities α ( x ) and α ( x ), but fails to point identify β ( x ). To simplifythe proof of sharpness, we assume that y is continuously distributed, which is natural in anadditively separable model. Without this assumption, the bounds that we derive are stillvalid, but may not be sharp. Nevertheless, the reasoning from our proof can be generalizedto cases in which y does not have a continuous support set. Theorem 2.2.
Suppose that the conditional distribution of y given ( x , T, z ) is continuous.Further suppose that the conditions of Theorem 2.1 and Assumption 2.2 (iii) hold. Forany k such that E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] , let A k denote the set of pairs (cid:0) α ( x ) , α ( x ) (cid:1) such that α ( x ) < p k ( x ) < − α ( x ) and µ tk (cid:18) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) , x (cid:19) ≤ µ k (cid:0) α ( x ) , x (cid:1) ≤ µ tk (cid:18) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) , x (cid:19) for all t = 0 , where µ tk (cid:0) q, x (cid:1) = E [ y | y ≤ q, x , T = t, z = k ] , µ tk (cid:0) q, x (cid:1) = E [ y | y > q, x , T = t, z = k ] µ k (cid:0) α ( x ) , x (cid:1) = p k ( x ) E [ y | x , z = k, T = 1] − α ( x ) E [ y | x , z = k ] p k ( x ) − α ( x ) and we define q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) = F − tk (cid:18) r tk (cid:0) α ( x ) , α ( x ) , x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:19) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) = F − tk (cid:18) − r tk (cid:0) α ( x ) , α ( x ) , x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:19) If a priori restrictions on α and α are available, e.g. α = 0, α = 0, or α = α , these bounds can beimproved. For more discussion, see Corollary 2.2 of DiTraglia and Garc´ıa-Jimeno (2017). The only exception is the incorrect result of Mahajan (2006) described in Section 2.2 and Appendix B. here F − tk ( ·| x ) is the conditional quantile function of y given ( x , T = t, z = k ) , r k (cid:0) α ( x ) , α ( x ) , x (cid:1) = α ( x )1 − p k ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) r k (cid:0) α ( x ) , α ( x ) , x (cid:1) = 1 − α ( x ) p k ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) and p k ( x ) is defined in Equation 2. The sharp identified set for c ( x ) , β ( x ) , α ( x ) and α ( x ) is characterized by Equation 3 and (cid:0) α ( x ) , α ( x ) (cid:1) ∈ A ∗ where(i) A ∗ ≡ A ∩ A if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k = 0 , ;(ii) A ∗ ≡ A k if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] and E [ y | x , T = 0 , z = ℓ ] = E [ y | x , T = 1 , z = ℓ ] ;(iii) A ∗ ≡ (cid:8)(cid:0) α ( x ) , α ( x ) (cid:1) : α ( x ) ≤ p k ( x ) ≤ − α ( x ) for all k (cid:9) if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k = 0 , . Imposing Assumption 2.2 (iii) strictly improves upon the identified set from Theorem2.1 unless E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k . Even if β ( x ) = 0, thedifference of these observable means is generically nonzero. The intuition for Theorem2.2 is as follows. For simplicity, suppress dependence on x . Now, fix ( T = t, z = k ) and( α , α ). The observed distribution of y given ( T = t, z = k ), call it F tk , is a mixture oftwo unobserved distributions: the distribution of y given ( T = k, z = k, T ∗ = 1), call it F tk ,and the distribution of y given ( T = t, z = k, T ∗ = 0), call it F tk . The mixing probabilitiesare r tk and 1 − r tk from the statement of Theorem 2.2 and are fully determined by ( α , α )and p k . Assumptions 2.1 (i) and 2.2 (ii) imply that the unobserved means E [ y | T ∗ , T, z ] arefully determined by ( α , α ) given the observed means E [ y | T, z ]. The question is whether itis possible, given the observed distribution F tk , to construct F tk and F tk with the requiredvalues for E [ y | T ∗ , T, z ] such that F tk = r tk F tk + (1 − r tk ) F tk for all combinations ( t, k ). Ifnot, then ( α , α ) does not belong to the identified set. Our proof provides necessary andsufficient conditions for such a mixture to exist at a given point ( α , α ). We can thenappeal to the reasoning from Theorem 2.1 to complete the argument. By ruling out valuesfor α and α , Theorem 2.2 restricts β via Lemma 2.2. While these restrictions can be veryinformative, they do not yield point identification. Corollary 2.2.
Under Assumptions 2.1 and 2.2 the identified set for β ( x ) contains both theIV estimand Cov ( y, z | x ) / Cov ( z, T | x ) and the true coefficient β ( x ) . Corollary 2.2 follows by Lemma 2.2 because α ( x ) = α ( x ) = 0 always belongs to thesharp identified set from Theorem 2.2. Non-differential measurement error cannot excludethe possibility that there is no mis-classification because in this case it is trivial to constructthe required mixtures. Although we focus throughout this paper on the case of a binaryinstrument, one might wonder whether point identification can be achieved by increasing Suppress dependence on x for simplicity. There are only two settings in which E [ y | T = 0 , z = k ] = E [ y | T = 1 , z = k ]. The first is if the true value of either α or α lies at the upper boundary of the identifiedset from Theorem 2.1. The second is if β = E [ ε | T ∗ = 0 , z = k ] − E [ ε | T ∗ = 0 , z = k ]. z , perhaps along the lines of Lewbel (2007a). The answer turns out to be no.Suppose that we were to modify Assumptions 2.1 and 2.2 to hold for all values of z in somediscrete support set. By Lemma 2.2, a binary instrument identifies β ( x ) up to knowledgeof the mis-classification probabilities α ( x ) and α ( x ). It follows that any pair of values( k, ℓ ) in the support set of z identifies the same object. Accordingly, to identify β ( x ) it isnecessary and sufficient to identify the mis-classification probabilities. A binary instrumentfails to identify these probabilities because we can never exclude the possibility of zero mis-classification. The same is true of a discrete K -valued instrument. Increasing the support of z does, however, shrink the identified set by increasing the number of restrictions available:in this case Theorems 2.1–2.2 continue to apply replacing “ k = 0 ,
1” with “for all k .” The results of the preceding section establish that β ( x ) is not point identified under As-sumptions 2.1 and 2.2. In light of this, there are two possible ways to proceed: either onecan report partial identification bounds based on our characterization of the sharp identifiedset from Theorem 2.2, or one can attempt to impose stronger assumptions to obtain pointidentification. In this section we consider the second possibility. We begin by defining thefollowing functions of the model parameters: θ ( x ) = β ( x ) [1 − α ( x ) − α ( x )] − (4) θ ( x ) = [ θ ( x )] [1 + α ( x ) − α ( x )] (5) θ ( x ) = [ θ ( x )] (cid:2) { − α ( x ) − α ( x ) } + 6 α ( x ) { − α ( x ) } (cid:3) (6)Now consider the following additional assumption: Assumption 2.5. E [ ε | x , z ] = E [ ε | x ]Assumption 2.5 is a second moment version of the standard mean exclusion restriction forthe instrument z – Assumption 2.1 (iii). It requires that the conditional variance of the errorterm given the covariates x does not depend on z , but does not require homoskedasticitywith respect to x , T ∗ or T . Assumption 2.5 allows us to derive the following lemma: Lemma 2.3.
Under Assumptions 2.1, 2.2 and 2.5,Cov ( y , z | x ) = 2 Cov ( yT, z | x ) θ ( x ) − Cov ( T, z | x ) θ ( x ) where θ ( x ) and θ ( x ) are defined in Equations 4–5. Lemma 2.2 identifies θ ( x ). Since Cov( z, T | x ) = 0 by Assumption 2.1 (ii), we can solvefor θ ( x ) in terms of observables only, using Lemma 2.3. Given knowledge of θ ( x ), we cansolve Equation 5 for the difference of mis-classification rates so long as β ( x ) = 0. Corollary 2.3.
Under Assumptions 2.1–2.2 and 2.5, α ( x ) − α ( x ) is identified so long as β ( x ) = 0 . α ( x ) = 0 or α ( x ) = 0, augmenting our baseline Assumptions2.1–2.2 with Assumption 2.5 suffices to identify β ( x ). Notice that β ( x ) = 0 if and only if θ ( x ) = 0. Thus, β ( x ) is still identified in the case where Corollary 2.3 fails to apply.Assumption 2.5 does not suffice to identify β ( x ) without a priori restrictions on themis-classification error rates. To achieve identification in the general case, we impose thefollowing additional conditions: Assumption 2.6. (i) E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] (ii) E [ ε | x , z ] = E [ ε | x ]Assumption 2.6 (i) is a second moment version of the non-differential measurement errorassumption, Assumption 2.2 (iii). It requires that, given knowledge of ( x , T ∗ , z ), T providesno additional information about the variance of the error term. Note that Assumption 2.6(i) does not require homoskedasticity of ε with respect to x or T ∗ . Assumption 2.6 (ii)is a third moment version of Assumption 2.5. It requires that the conditional third mo-ment of the error term given x does not depend on z . This condition neither requires norexcludes skewness in the error term conditional on covariates: it merely states that theskewness is unaffected by the instrument. While Assumptions 2.5 and 2.6 may appear some-what unusual, they are implied by the more intuitive independence conditions ε | = z | x and ε | = T | ( x , T ∗ , z ). Although E [ ε | x , z ] = 0 and E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] are technicallyweaker than assuming full independence, we would be somewhat dubious of any supposed“natural experiment” that purportedly satisfied mean exclusion but not independence. In-deed, as discussed by Imbens and Rubin (1997), an instrument satisfying mean exclusionbut not independence could become invalid if the outcome variable were transformed, forexample by taking logs. As it is not uncommon for applied papers to report results inboth logs and levels (e.g. Angrist, 1990), our view is that researchers implicitly assume morethan mean exclusion in typical applications of instrumental variables. Analogous reasoningapplies to the non-differential measurement error assumption.Assumption 2.6 allows us to derive the following Lemma which, combined with Lemma2.3, leads to point identification: Lemma 2.4.
Under Assumptions 2.1–2.2 and 2.5–2.6,Cov ( y , z | x ) = 3 Cov ( y T, z | x ) θ ( x ) − Cov ( yT, z | x ) θ ( x ) + Cov ( T, z | x ) θ ( x ) where θ ( x ) , θ ( x ) and θ ( x ) are defined in Equations 4–5. Theorem 2.3.
Under Assumptions 2.1–2.2 and 2.5–2.6, β ( x ) is identified. If β ( x ) = 0 ,then α ( x ) and α ( x ) are likewise identified. Lemmas 2.2–2.4 yield a linear system of three equations in θ ( x ) , θ ( x ) and θ ( x ). UnderAssumption 2.1 (ii), the system has a unique solution so θ ( x ) , θ ( x ) and θ ( x ) are identified.9he proof of Theorem 2.3 shows that, so long as β ( x ) = 0, Equations 4–6 can be solved for β ( x ), α ( x ) and α ( x ). In particular, using steps from the proof of Theorem 2.3 β ( x ) = sign (cid:2) θ ( x ) (cid:3)q (cid:2) θ ( x ) /θ ( x ) (cid:3) − (cid:2) θ ( x ) /θ ( x ) (cid:3) . If we relax Assumption 2.2 (ii) and assume α ( x ) + α ( x ) = 1 only, β ( x ) is only identifiedup to sign: in this case the sign of θ ( x ) need not equal that of β ( x ). We now briefly outline how the identification results from Section 2 can be used to esti-mate and carry out statistical inference for the parameters of interest: (cid:0) α ( x ) , α ( x ) , β ( x ) (cid:1) .Lemmas 2.2–2.4 yield a system of linear moment equations in the reduced form parameters θ ′ ( x ) = (cid:0) θ ( x ) , θ ( x ) , θ ( x ) (cid:1) . Defining a vector of intercepts κ ′ ( x ) = (cid:0) κ ( x ) , κ ( x ) , κ ( x ) (cid:1) ,and a vector of observables w ′ = ( T, y, yT, y , y T, y ), we can write this system as E "(cid:26) Ψ (cid:0) θ ( x ) (cid:1) w i − κ ( x ) (cid:27) ⊗ (cid:18) z (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x = x = (7) Ψ (cid:0) θ ( x ) (cid:1) ≡ − θ ( x ) 1 0 0 0 0 θ ( x ) 0 − θ ( x ) 1 0 0 − θ ( x ) 0 3 θ ( x ) 0 − θ ( x ) 1 . (8)Using Equations 4–6, we can re-write Ψ as a function of (cid:0) α ( x ) , α ( x ) , β ( x ) (cid:1) , leaving uswith a just-identified, non-parametric conditional moment problem. Because the condi-tioning variables in Equation 7 are the same as the arguments of the unknown functions( α , α , β ), this problem fits within the framework of Lewbel (2007b), permitting straight-forward estimation and inference via a local GMM procedure. If β ( x ) is close to zero,however, this procedure can perform poorly; in this case the moment conditions from Equa-tions 7, are only weakly informative about α ( x ) and α ( x ). An earlier version of this paper(DiTraglia and Garc´ıa-Jimeno, 2017) discusses this problem in more detail and provides asolution based on generalized moment selection (Andrews and Soares, 2010) that combinesthe moment inequalities implied by our partial identification results from Section 2.3 withthe moment equalities from Equation 7. This paper has studied identification and inference for a mis-classified, binary, endogenousregressor in an additively separable model using a discrete instrumental variable. We haveshown that the only existing identification result for this model is incorrect, and gone on toderive the sharp identified set under standard first-moment assumptions from the literature.Strengthening these assumptions to hold for second and third moments, we have establishedpoint identification for the effect of interest. An interesting extension of the results presentedabove would be to consider the case of discrete regressors that take on more than two values.10
Proofs
All of the results in this paper hold x fixed . This allows us to completely ignore the presence ofcovariates in the proofs that follow. Accordingly we work in terms of scalars α , α , β, p k , etc.rather than functions α ( x ), α ( x ), β ( x ) , p k ( x ). The former should be understood as the value ofthe latter evaluated at some particular x . A.1 Partial Identification Results
Proof of Lemma 2.1.
Follows from a simple calculation using the law of total probability.
Proof of Lemma 2.2.
Immediate since Cov( z, T ) = (1 − α − α )Cov( z, T ∗ ) by Lemma 2.1. Proof of Theorem 2.1.
To show that α ≤ p k ≤ − α , substitute p ∗ k = 0 and p ∗ k = 1, respec-tively, into Lemma 2.1 and rearrange. To show that E [ y | z = k ] = c + β ( p k − α ) / (1 − α − α ),take conditional expectations of Equation 1 and apply Assumption 2.1 (iii) and Lemma 2.1.To prove sharpness we need to show that for any ( c, β, α , α ) that satisfy α ≤ p k ≤ − α and E [ y | z = k ] = c + β ( p k − α ) / (1 − α − α ) we can construct a valid joint distribution for( y, T, T ∗ , z ) that is compatible with the observed distribution of ( y, T, z ), provided that p = p .To establish this result, we factorize the joint distribution of ( y, T, T ∗ , z ) into the product of aconditional y | ( T, T ∗ , z ) and marginal ( T, T ∗ , z ). The argument proceeds in two steps. Our firststep relies on the fact that Assumptions 2.1 (i) and (iii) do not constrain the distribution of( T, T ∗ , z ) while 2.1 (ii) and 2.2 (i)–(ii) constrain only the distribution of ( T, T ∗ , z ). Under theselatter three assumptions, we show how to construct a valid joint distribution for ( T, T ∗ , z ) that iscompatible with the observed distribution of ( T, z ) for any ( α , α ) satisfying α ≤ p k ≤ − α .Our second step shows how to construct a valid conditional distribution for y given ( T, T ∗ , z ) underAssumptions 2.1 (i) and (iii) that is compatible with the observed conditional distribution of y given ( T, z ) for any ( c, β, α , α ) satisfying E [ y | z = k ] = c + β ( p k − α )(1 − α − α ). Combiningthe two steps gives the required joint distribution for ( y, T ∗ , T, z ).For the first step, we need to construct a valid joint probability mass function p ( T ∗ , T, z ) withsupport set { , } × { , } × { , } . By Assumption 2.2 (i), p ( T | T ∗ , z ) = p ( T | T ∗ ) and hence p ( T ∗ , T, z ) = p ( T | T ∗ ) p ( T ∗ | z ) p ( z ) . Since p ( z ) is observed, to construct a valid joint probability mass function p ( T ∗ , T, z ) it suffices toconstruct valid conditional probability mass functions p ( T | T ∗ ) and p ( T ∗ | z ). Since α ≤ p k ≤ − α ,both α and α are guaranteed to lie between zero and one. This gives a valid construction of p ( T | T ∗ ). Moreover the corresponding values of p ∗ k implied by Lemma 2.1 are also guaranteed to liebetween zero and one. This gives a valid construction of p ( T ∗ | z ) that satisfies Assumption 2.1 (ii),since p = p by assumption and ( p − p ) = ( p ∗ − p ∗ )(1 − α − α ) by Lemma 2.1. Because ourconstruction relies on Lemma 2.1, which is simply an application of the law of total probability,the resulting distribution p ( T, T ∗ , z ) is automatically compatible with p ( T, z ) = p ( T | z ) p ( z ).For the second step, we need to construct a valid conditional distribution for y given ( T, T ∗ , z ).To begin we define the following notation: r tk ≡ P ( T ∗ = 1 | T = t, z = k ) F t ( τ ) ≡ P ( y ≤ τ | z = k ) F tk ( τ ) ≡ P ( y ≤ τ | T = t, z = k ) F t ∗ tk ( τ ) ≡ P ( y ≤ τ | T ∗ = t ∗ , T = t, z = k ) G k ( τ ) ≡ P ( ε ≤ τ | z = k ) G t ∗ tk ( τ ) ≡ P ( ε ≤ τ | T ∗ = t ∗ , T = t, z = k ) . ssumption 2.1 (i) imposes a relationship between G t ∗ tk and F t ∗ tk for each t ∗ , namely G tk ( τ ) = F tk ( τ + c ) , G tk ( τ ) = F tk ( τ + c + β ) (A.1)and thus we see that G k ( τ ) = r k p k F k ( τ + c + β ) + r k (1 − p k ) F k ( τ + c + β )+ (1 − r k ) p k F k ( τ + c ) + (1 − r k )(1 − p k ) F k ( τ + c ) (A.2)applying the law of total probability and Bayes’ rule. Moreover, F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ) (A.3)for all t, k ∈ { , } , and by Bayes’ rule, r k = (1 − α ) p ∗ k /p k , r k = α p ∗ k / (1 − p k ) . (A.4)There are four cases, corresponding to different possibilities for the r tk . The first case violates oneof our model assumptions. For each of the remaining cases, we show that it is possible to constructthe required distributions F tk , F tk under Assumptions 2.1 (i) and (iii) for any ( c, β, α , α ) suchthat E ( y | z = k ) = c + β ( p k − α ) / (1 − α − α ). Case I: r k = 0 , r k = 0 By Equation A.4 this requires α = 1, violating Assumption 2.2 (ii). Case II: r k = r k = 0 By Equation A.4, this requires p ∗ k = 0 which in turn requires p k = α .By Equation A.3 we have F tk = F tk , while F tk is unrestricted. Substituting into A.2, G k ( τ ) = p k F k ( τ + c ) + (1 − p k ) F k ( τ + c ) = F k ( τ + c )Now, since F k ( τ + c ) is the conditional CDF of y − c given that z = k , and G k is the conditionalCDF of ε given z = k , we see that Assumption 2.1 (i) is satisfied if and only if E ( y | z = k ) = c ,which is equal to c + β ( p k − α ) / (1 − α − α ) since p k − α = 0. Case III: r k = 0 , r k = 0 By Equation A.4 this requires α = 0 and p ∗ k = 0. By Equation A.3we have F k = F k and since r k = 1, we can solve to obtain F k ( τ ) = 1 r k (cid:2) F k ( τ ) − (1 − r k ) F k ( τ ) (cid:3) Substituting into Equation A.2, we obtain G k ( τ ) = [(1 − p k ) F k ( τ + c ) + p k F k ( τ + c + β )]+ p k (1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) Now, F k ( τ + c ) is the conditional CDF of ( y − c ) given ( T = 0 , z = k ) while F k ( τ + c + β ) isthe conditional CDF of ( y − c − β ) given ( T = 1 , z = k ). Similarly, F k ( τ + c ) is the conditionalCDF of ε given ( T ∗ = 0 , T = 1 , z = k ) while F k ( τ + c + β ) is the conditional CDF of ( ε − β )given ( T ∗ = 0 , T = 1 , z = k ). Since G k ( τ ) is the conditional CDF of ε given z = k , we see that ssumption 2.1 (iii) is satisfied if and only if0 = (1 − p k ) E ( y − c | T = 0 , z = k ) + p k E ( y − c − β | T = 1 , z = k )+ p k (1 − r k ) [ E ( ε | T ∗ = 0 , T = 1 , z = k ) − E ( ε − β | T ∗ = 0 , T = 1 , z = k )]Rearranging, this is equivalent to E ( y | z = k ) = c + (1 − α ) β (cid:18) p k − α − α − α (cid:19) = c + β (cid:18) p k − α − α − α (cid:19) since α = 0 in this case. As explained above, F k = F k in the present case while F k is undefined.We are free to choose any distributions for F k and F k that satisfy Equation A.3, for example F k = F k = F k . Case IV: r k = 0 , r k = 0 In this case, we can solve Equation A.3 to obtain F tk ( τ ) = 1 r tk (cid:2) F tk ( τ ) − (1 − r tk ) F tk ( τ ) (cid:3) Substituting this into Equation A.2, we have G k ( τ ) = F k ( τ + c + β ) + p k (1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) + (1 − p k )(1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) using the fact that F k ( τ ) = p k F k ( τ ) + (1 − p k ) F k ( τ ). Now, F k ( τ + c + β ) is the conditional CDFof ( y − c − β ) given z = k , while F tk ( τ + c ) is the conditional CDF of ε given ( T = t, z = k ) and F tk ( τ + c + β ) is the conditional CDF of ( ε − β ) given ( T = t, z = k ). Since G k ( τ ) is the conditionalCDF of ε given z = k , we see that Assumption 2.1 (iii) is satisfied if and only if0 = E [ y − c − β | z = k ] + p k (1 − r k ) [ E ( ε | T ∗ = 0 , T = 1 , z = k ) − E ( ε − β | T ∗ = 0 , T = 1 , z = k )]+ (1 − p k )(1 − r k ) [ E ( ε | T ∗ = 0 , T = 0 , z = k ) − E ( ε − β | T ∗ = 0 , T = 0 , z = k )]0 = E [ y − c − β | z = k ] + β [ p k (1 − r k ) + (1 − p k )(1 − r k )]But since [ p k (1 − r k ) + (1 − p k )(1 − r k )] = (1 − p ∗ k ) and p ∗ k = ( p k − α ) / (1 − α − α ), this becomes E [ y | z = k ] = c + β [( p k − α )(1 − α − α )] . Thus, in this case we are free to choose any distributions for F tk and F tk that satisfy Equation A.3.For example we could take F tk = F tk = F tk . Proof of Corollary 2.1.
The result follows by substituting the largest and smallest possible val-ues for α + α and taking the difference of the expressions for E [ y | z = k ]. Proof of Theorem 2.2.
The only difference between the conditions of Theorem 2.1 and thoseof 2.2 is that the latter imposes Assumption 2.2 (iii) while the former does not. Accordingly, thepresent argument builds on the proof of Theorem 2.1 and relies on the notation defined within it.Under Assumption 2.1 (i), Assumption 2.2 (iii) is equivalent to E [ y | T, T ∗ , z ] = E [ y | T ∗ , z ]. Hence,non-differential measurement error constrains only the conditional distribution of y given ( T, T ∗ , z ).For this reason, we need only revisit the second step of the proof of Theorem 2.1. Consider a point( c, β, α , α ) that satisfies Equation 3 and α ≤ p k ≤ − α for all k . Since this point lies in the dentified set from Theorem 2.1, it suffices to determine whether there exist valid conditional CDFs F tk , F tk such that F tk = (1 − r tk ) F tk + r tk F tk for all t, k and E [ y | T, T ∗ , z ] = E [ y | T ∗ , z ].Let µ t ∗ tk ≡ E [ y | T = t, z = k, T ∗ = t ∗ ], µ tk ≡ E [ y | T = t, z = k ], and µ t ∗ k ≡ E [ y | z = k, T ∗ = t ∗ ]. ByAssumption 2.2 (iii) µ t ∗ tk = µ t ∗ k for t ∗ = 0 ,
1. Hence, by iterated expectations, µ k = (1 − r k ) µ k + r k µ k µ k = (1 − r k ) µ k + r k µ k . Now, ( µ k , µ k ) are observed while r k and r k depend only on the observed first-stage probability p k and the mis-classification probabilities ( α , α ). Thus, at a given point ( c, β, α , α ) in theidentified set from Theorem 2.1 the preceding equations form a linear system in µ k and µ k . Aftersome algebra, we find that the determinant is r k − r k = (cid:20) p k − α − α − α (cid:21) (cid:20) − p k − α p k (1 − p k ) (cid:21) . Suppose first that r k = r k = r so the determinant condition fails. This occurs if and onlyif α = p k or α = 1 − p k . If µ k = µ k , the system is inconsistent: no solution for ( µ k , µ k )exists. Hence α = p k and α = 1 − p k are excluded from the identified set under non-differentialmeasurement error so long as µ k = µ k . If instead µ k = µ k = µ , the system is consistent butrank deficient: any pair ( µ k , µ k ) such that µ = (1 − r ) µ k + rµ k is a solution and hence satisfies theassumption of non-differential measurement error. One such solution is µ k = µ k = µ so we are freeto set F k = F k = F k and F k = F k = F k . Hence, if µ k = µ k then α = p k lies within thesharp identified set if p k < p ℓ and α = 1 − p k lies in the sharp identified set if p ℓ < p k .Now suppose that r k = r k , which occurs if and only if α = p k and α = 1 − p k . In this casethe system has a unique solution, namely µ k = r k µ k − r k µ k r k − r k = (1 − p k ) E ( y | T = 0 , z = k ) − α E ( y | z = k )1 − p k − α µ k = ( µ k − µ k ) + ( r k µ k − r k µ k ) r k − r k = p k E ( y | T = 1 , z = k ) − α E ( y | z = k ) p k − α . Since µ k = µ k = µ k and µ k = µ k = µ k under non-differential measurement error, the mis-classification probabilities ( α , α ) combined with the observable moments completely determinethe means of F tk and F tk whenever the determinant condition holds. If µ k = µ k then µ k = µ k so we are free to set F k = F k = F k and F k = F k = F k . Combining this with the reasoningfrom the preceding paragraph, we see that Assumption 2.2 (iii) imposes no additional restrictions for any k such that µ k = µ k . Accordingly, for the remainder of the proof we consider only thecase in which µ k = µ k . Given ( α , α ), r tk , µ k , and µ k are fixed. The question is whether, for agiven pair ( α , α ) and observed CDFs F tk , we can construct valid CDFs F tk , F tk such that Z R τ F tk ( dτ ) = µ k , Z R τ F tk ( dτ ) = µ k , F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ) . For a given pair ( t, k ), there are two cases: 0 < r tk < r tk ∈ { , } . Case I: r tk ∈ { , } If r tk = 1 then µ k = µ tk so we can set F tk = F tk . In this case F tk isunrestricted. Analogously, if r tk = 0, µ k = µ tk so we can set F tk = F tk with F tk unrestricted. ase II: < r tk < Define the function µ tk ( ξ ) = E [ y | y ∈ I tk ( ξ ) , T = t, z = k ] and theclosed interval I tk ( ξ ) = (cid:2) F − tk (1 − ξ − r tk ) , F − tk (1 − ξ ) (cid:3) where 0 ≤ ξ ≤ − r tk . The function µ tk isdecreasing in ξ , attaining its maximum µ tk at ξ = 0 and its minimum µ tk at ξ = 1 − r tk .Suppose first that µ k does not lie in the interval [ µ tk , µ tk ]. We show that it is impossible toconstruct valid CDFs F tk and F tk that satisfy F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ). Since r tk = 1,we can solve the expression for F tk to yield F tk ( τ ) = (cid:2) F tk ( τ ) − r tk F tk ( τ ) (cid:3) / (1 − r tk ). Hence, since r tk = 0, the requirement that 0 ≤ F tk ( τ ) ≤ F tk ( τ ) − (1 − r tk ) r tk ≤ F tk ( τ ) ≤ F tk ( τ ) r tk (A.5)Now define F tk ( τ ) = min { , F tk ( τ ) /r tk } and F tk ( τ ) = max { , F tk ( τ ) /r tk − (1 − r tk ) /r tk } . Bycombining Equation A.5 with 0 ≤ F tk ( τ ) ≤
1, we obtain F tk ( τ ) ≤ F tk ( τ ) ≤ F tk ( τ ). Thus, F tk first-order stochastically dominates F tk which first-order stochastically dominates F tk . Hence, Z τ F tk ( dτ ) ≤ Z τ F tk ( dτ ) ≤ Z τ F tk ( dτ ) . But notice that µ tk = Z τ F tk ( dτ ) , µ k = Z τ F tk ( dτ ) , µ tk = Z τ F tk ( dτ )so we have µ tk ≤ µ k ≤ µ tk which contradicts µ k / ∈ [ µ tk , µ tk ].Now suppose that µ k ∈ h µ tk , µ tk i . We show how to construct densities f tk and f tk that yieldCDFs F tk F tk satisfying the requirements described above. Since the conditional distribution of y given ( T, z ) is continuous, µ tk is continuous on its domain and takes on all values in h µ tk , µ tk i by theintermediate value theorem. Thus, there exists a ξ ∗ such that µ tk ( ξ ∗ ) = µ k . Let f tk ( τ ) = dF tk ( τ ) /dτ which is non-negative by the assumption that y is continuously distributed. Now, define f tk ( τ ) = f tk ( τ ) × { τ ∈ I tk ( ξ ∗ ) } r tk , f tk ( τ ) = f tk ( τ ) × { τ ∈ I tk ( ξ ∗ ) } − r tk . Clearly f tk ≥ f tk ≥
0. Integrating, Z R f tk ( τ ) dτ = 1 r tk Z I tk ( ξ ∗ ) f tk ( τ ) dτ = 1 , Z R f tk ( τ ) dτ = 11 − r tk Z I Ctk ( ξ ∗ ) f tk ( τ ) dτ = 1where I Ctk is the complement of I tk . By construction r tk Z A f tk ( τ ) dτ + (1 − r tk ) Z A f tk ( τ ) dτ = Z A f tk ( τ ) dτ for any set A . Finally, Z R τ f tk ( τ ) dτ = 1 r tk Z I tk ( ξ ∗ ) τ f tk ( τ ) dτ = µ tk ( ξ ∗ ) = µ k . .2 Point Identification Results In the proofs of Lemma 2.3, Lemma 2.4, and Theorem 2.3, we employ the shorthand π ≡ Cov(
T, z ), η j ≡ Cov( y j , z ), and τ j ≡ Cov(
T y j , z ) for j = 1 , ,
3. Hence Lemma 2.2 becomes η = πθ , whileLemma 2.3 becomes η = 2 τ θ − πθ , and Lemma 2.4 becomes η = 3 τ θ − τ θ + πθ . Proof of Lemma 2.3.
By Assumption 2.1 (i) and the basic properties of covariance, η = β Cov( T ∗ , z ) + 2 β [ c Cov( T ∗ , z ) + Cov( T ∗ ε, z )] + 2 c Cov( ε, z ) + Cov( ε , z ) τ = cπ + Cov( T ε, z ) + β Cov(
T T ∗ , z )using the fact that T ∗ is binary. Now, by Assumptions 2.1 (iii) and 2.5 we have Cov( ε, z ) =Cov( ε , z ) = 0. And, using Assumptions 2.2 (i) and (ii), one can show that Cov( T T ∗ , z ) = (1 − α )Cov( T ∗ , z ) and Cov( T ∗ , z ) = π/ (1 − α − α ). Hence, η = θ ( β + 2 c ) π + 2 β Cov( T ∗ ε, z )2 τ θ − πθ = (cid:2) θ c + 2 θ (1 − α ) − θ (cid:3) π + 2 θ Cov(
T ε, z )but since θ = θ [(1 − α ) + α ], we see that [2 θ (1 − α ) − θ ] = θ β . Thus, it suffices to showthat β Cov( T ∗ ε, z ) = θ Cov(
T ε, z ). This equality is trivially satisfied when β = 0, so supposethat β = 0. In this case it suffices to show that (1 − α − α )Cov( T ∗ ε, z ) = Cov( T ε, z ). Define m ∗ tk = E [ ε | T ∗ = t, z = k ] and p ∗ k = P ( T ∗ = 1 | z = k ). Then, by iterated expectations, Bayes’ rule,and Assumption 2.2 (iii)Cov( T ∗ ε, z ) = q (1 − q ) ( p ∗ m ∗ − p ∗ m ∗ )Cov( T ε, z ) = q (1 − q ) { (1 − α ) [ p ∗ m ∗ − p ∗ m ∗ ] + α [(1 − p ∗ ) m ∗ − (1 − p ∗ ) m ∗ ] } But by Assumption 2.1 (iii), E [ ε | z = k ] = m ∗ k p ∗ k + m ∗ k (1 − p ∗ k ) = 0 and thus we obtain m ∗ k (1 − p ∗ k ) = − m ∗ k p ∗ k . Therefore (1 − α − α )Cov( T ∗ ε, z ) = Cov( T ε, z ) as required.
Proof of Lemma 2.4.
Since T ∗ is binary, if follows from the basic properties of covariance that, η = Cov (cid:2) ( c + ε ) , z (cid:3) + 3 β Cov[( c + ε ) T ∗ , z ] + 3 β Cov[( c + ε ) T ∗ , z ] + β Cov( T ∗ , z ) τ = Cov (cid:2) ( c + ε ) T, z (cid:3) + 2 β Cov [( c + ε ) T T ∗ , z ] + β Cov(
T T ∗ , z )By Assumptions 2.1 (iii), 2.5, and 2.6 (ii) , Cov (cid:2) ( c + ε ) , z (cid:3) = 0. Expanding, η = 3 β Cov( T ∗ ε , z ) + (cid:0) β + 6 cβ (cid:1) Cov( T ∗ ε, z ) + (cid:0) β + 3 cβ + 3 c β (cid:1) Cov( T ∗ , z ) τ = c Cov(
T, z ) + β ( β + 2 c )Cov( T T ∗ , z ) + Cov( T ε , z ) + 2 c Cov(
T ε, z ) + 2 β Cov(
T T ∗ ε, z )Now, define s ∗ tk = E [ ε | T ∗ = t, z = k ] and p ∗ k = P ( T ∗ = 1 | z = k ). By iterated expectations, Bayes’rule, and Assumption 2.6 (i),Cov( T ∗ ε , z ) = q (1 − q )( p ∗ s ∗ − p ∗ s ∗ )Cov( T ε , z ) = q (1 − q ) { (1 − α ) [ p ∗ s ∗ − p ∗ s ∗ ] + α [(1 − p ∗ ) s ∗ − (1 − p ∗ ) s ∗ ] } By Assumption 2.5, E [ ε | z = 1] = E [ ε | z = 0] and thus, by iterated expectations we have p ∗ s ∗ − p ∗ s ∗ = − [(1 − p ∗ ) s ∗ − (1 − p ∗ ) s ∗ ] which impliesCov( T ε , z ) = (1 − α − α )Cov( T ∗ ε , z ) . (A.6) imilarly by iterated expectations and Assumptions 2.2 (i)–(ii)Cov( T T ∗ ε, z ) = q (1 − q )(1 − α )( p ∗ m ∗ k − p ∗ m ∗ ) = (1 − α )Cov( T ∗ ε, z ) (A.7)where m ∗ tk is defined as in the proof of Lemma 2.3. As shown in the proof of Lemma 2.3,Cov( T T ∗ , z ) = (1 − α )Cov( T ∗ , z ) , Cov( T ∗ , z ) = π − α − α , Cov( T ∗ ε, z ) = Cov( T ε, z )1 − α − α and combining these equalities with Equations A.6 and A.7, it follows that τ = 2 [(1 − α )( c + β ) − cα ] Cov( T ∗ ε, z ) + (cid:2) (1 − α )( c + β ) − c α (cid:3) Cov( T ∗ , z )+ (1 − α − α )Cov( T ∗ ε , z ) τ = (1 − α − α )Cov( T ∗ ε, z ) + [(1 − α )( c + β ) − cα ] Cov( T ∗ , z )using τ = cπ + Cov( T ε, z ) + β Cov(
T T ∗ , z ) as shown in the proof of Lemma 2.3. Thus,3 τ θ − τ θ + πθ = K Cov( T ∗ ε , z ) + K Cov( T ∗ ε, z ) + K Cov( T ∗ , z )where K ≡ θ (1 − α − α ) = 3 β and K ≡ θ [(1 − α )( c + β ) − cα ] − θ (1 − α − α ) K ≡ θ (cid:2) (1 − α )( c + β ) − c α (cid:3) − θ [(1 − α )( c + β ) − cα ] + θ (1 − α − α )Substituting the definitions of θ , θ , and θ from Equations 4–6, tedious but straightforward algebrashows that K = 3 β + 6 cβ and K = β + 3 cβ + 3 c β . Therefore the coefficients of η equal thoseof 3 τ − τ θ + πθ and the result follows. Proof of Theorem 2.3.
Collecting the results of Lemmas 2.2–2.4, we have η = πθ , η = 2 τ θ − πθ , η = 3 τ θ − τ θ + πθ which is a linear system in θ , θ , θ with determinant − π . Since π = 0 by assumption 2.1 (ii), θ , θ and θ are identified. Now, so long as β = 0, we can rearrange Equations 5 and 6 to obtain A = θ /θ = 1 + ( α − α ) (A.8) B = θ /θ = (1 − α − α ) + 6 α (1 − α ) (A.9)Equation A.8 gives (1 − α ) = A − α . Hence (1 − α − α ) = A − α and α (1 − α ) = α ( A − α ).Substituting into Equation A.9 and simplifying, ( A − B ) + 2 Aα − α = 0. Substituting for α analogously yields a quadratic in (1 − α ) with identical coefficients. It follows that one root of( A − B ) + 2 Ar − r = 0 is α and the other is 1 − α . Solving, r = A ± p A − B = 1 θ (cid:18) θ ± q θ − θ θ (cid:19) . (A.10)Substituting Equations 5 and 6, simple algebra shows that 3 θ − θ θ = θ (1 − α − α ) . Thisquantity is strictly greater than zero since θ = 0 and α + α = 1. It follows that both roots of thequadratic are real. Moreover, 3 θ /θ − θ /θ identifies (1 − α − α ) . Substituting into Equation4, it follows that β is identified up to sign. If α + α < β ) = sign( θ ) so that both the ign and magnitude of β are identified. If α + α < − α > α so (1 − α ) is the largerroot of ( A − B ) + 2 Ar − r = 0 and α is the smaller root. B Comment on Maha jan (2006) A.2
Expanding on our discussion from Section 2.2 above, we now show that Mahajan’s identificationargument for an endogenous regressor in an additively separable model (A.2) is incorrect. Unlessotherwise indicated, all notation used below is as defined in Section 2.The first step of Mahajan (2006) A.2 argues (correctly) that under Assumptions 2.1 and 2.2(i)–(ii), knowledge of α ( x ) and α ( x ) is sufficient to identify β ( x ). This step is equivalent to ourLemma 2.2 above. The second step appeals to Mahajan (2006) Theorem 1 to argue that α ( x )and α ( x ) are indeed point identified. To understand the logic of this second step, we first re-stateMahajan (2006) Theorem 1 in our notation. As in Section 2 above, T ∗ denotes an unobservedbinary random variable, z is a instrument, T an observed binary surrogate for T ∗ , y an outcome ofinterest, and x a vector covariates. Assumption B.1 (Mahajan (2006) Theorem 1) . Define g ( T ∗ , x ) ≡ E [ y | x , T ∗ ] and v ≡ y − g ( T ∗ , x ) .Suppose that knowledge of ( y, T ∗ , x ) is sufficient to identify g and that:(i) P ( T ∗ = 1 | x , z = 0) = P ( T ∗ = 1 | x , z = 1) .(ii) T is conditionally independent of z given ( x , T ∗ ) .(iii) α ( x ) + α ( x ) < (iv) E [ v | x , z, T ∗ , T ] = 0 (v) g (1 , x ) = g (0 , x ) Theorem B.1 (Mahajan (2006) Theorem 1) . Under Assumption B.1, α ( x ) and α ( x ) are pointidentified, as is g ( T ∗ , x ) . Assumption B.1 (i) is equivalent to our Assumption 2.1 (ii), while Assumptions B.1 (ii)–(iii) areequivalent to our Assumptions 2.2 (i)–(ii). Assumption B.1 (v) serves the same purpose as β ( x ) = 0in our Theorem 2.3: unless T ∗ affects y , we cannot identify the mis-classification probabilities. Thekey difference between Theorem B.1 and the setting we consider in Section 2 comes from AssumptionB.1 (iv). This is essentially a stronger version of our Assumptions 2.1 (iii) and 2.2 (iii) but appliesto the projection error v , defined in Assumption B.1 rather than the structural error ε , defined inAssumption 2.1 (i). Accordingly, Theorem B.1 identifies the conditional mean function g ratherthan the causal effect β ( x ).Although the meaning of the error term changes when we move from a structural to a reducedform model, the meaning of the mis-classification error rates does not: α ( x ) and α ( x ) are simplyconditional probabilities for T given ( T ∗ , x ). Step 2 of Mahajan (2006) A.2 relies on this insight.The idea is to find a way to satisfy Assumption B.1 (iv) simultaneously with Assumptions 2.1 (iii)and 2.2 (iii), while allowing T ∗ to be endogenous. If this can be achieved, α ( x ) , α ( x ) will beidentified via Theorem B.1, and identification of β ( x ) will follow from step 1 of A.2 (our Lemma2.2). To this end, Mahajan (2006) invokes the condition E ( y | x , z, T ∗ , T ) = E ( y | x , T ∗ ) . (B.1) ecause Mahajan (2006) A.2 assumes an additively separable model – our Assumption 2.1 (i) – wesee that E ( y | x , z, T ∗ , T ) = c ( x ) + β ( x ) T ∗ + E ( ε | x , z, T ∗ , T )so Equation B.1 is equivalent to E ( ε | x , z, T ∗ , T ) = E ( ε | x , T ∗ ). Note that this allows T ∗ to beendogenous, as it does not require E ( ε | x , T ∗ ) = 0. Now, applying Equation B.1 to the definition of v from Assumption B.1, we have E ( v | x , z, T ∗ , T ) = E [ y − E ( y | x , T ∗ ) | x , z, T ∗ , T ] = 0which satisfies Assumption B.1 (iv) as required. Based on this reasoning, Mahajan (2006) claimsthat Equation B.1 along with Assumptions B.1 (iv), 2.1, and 2.2 (i)–(ii) suffice to identify theeffect β ( x ) of an endogenous T ∗ , so long as g (1 , x ) = g (0 , x ). As we now show, however, theseAssumptions are contradictory unless T ∗ is exogenous.By Equation B.1 and Assumption 2.1 (i), E ( ε | x , z, T ∗ , T ) = E ( ε | x , T ∗ ) and thus by iteratedexpectations, we obtain E ( ε | x , T ∗ , z ) = E T | x ,T ∗ ,z [ E ( ε | x , T ∗ , T, z )] = E T | x ,T ∗ ,z [ E ( ε | x , T ∗ )] = E ( ε | x , T ∗ ) . (B.2)Now, let m ∗ tk ( x ) = E ( ε | x , T ∗ = t, z = k ). Using this notation, Equation B.2 is equivalent to m ∗ t ( x ) = m ∗ t ( x ) for t = 0 ,
1. Combining iterated expectations with Assumption 2.1 (iii), E ( ε | x , z = k ) = [1 − p ∗ k ( x )] m ∗ k ( x ) + p ∗ k ( x ) m ∗ k ( x ) = 0 (B.3)for k = 0 , p ∗ k ( x ) ≡ P ( T ∗ = 1 | x , z = k ). But substituting m ∗ t ( x ) = m ∗ t ( x ) into EquationB.3 for k = 0 ,
1, we obtain [1 − p ∗ ( x )] m ∗ ( x ) + p ∗ ( x ) m ∗ ( x ) = 0[1 − p ∗ ( x )] m ∗ ( x ) + p ∗ ( x ) m ∗ ( x ) = 0The preceding two equalities are convex combinations of m ∗ and m ∗ . The only way that bothcan equal zero simultaneously is if either p ∗ ( x ) = p ∗ ( x ), contradicting Assumption 2.1 (ii), or if m ∗ tk ( x ) = 0 for all ( t, k ), which implies that T ∗ is exogenous. Hence Mahajan (2006) A.2 fails:given the assumption that z is a valid instrument for ε , Equation B.1 implies that either there isno first-stage relationship between z and T ∗ or that T ∗ is exogenous. The root of the problemwith A.2 is the attempt to use one instrument to satisfy both the assumptions of Theorem B.1 andLemma 2.2. If one had access to a second instrument w , or equivalently a second mis-measuredsurrogate for T ∗ , that satisfied Assumptions B.1, one could use w to recover α ( x ) and α ( x ) viaTheorem B.1 and z to recover the IV estimand β ( x ) / [1 − α ( x ) − α ( x )] via Lemma 2.2. C Unobserved Heterogeneity
While allowing for arbitrary observed heterogeneity through the covariates x , all of the results pre-sented above assume an additively separable model – Assumption 2.1 (i). In this section we brieflydiscuss how our partial identification results can be interpreted in a local average treatment effects(LATE) setting. For simplicity, we suppress explicit conditioning on the covariates x throughout.In lieu of Assumption 2.1 (i), consider a non-separable model of the form y = h ( T ∗ , z, ε ). Let T ∗ ( z ) denote an individual’s potential treatment and Y ( t ∗ , z ) denote her potential outcome, where t ∗ , z ∈ { , } . Using this notation we can write Y ( t ∗ , z ) = h ( t ∗ , z, ε ). Let J ∈ { a, c, d, n } index the our LATE principal strata: a = always-taker, c = complier, d = defier, and n = never-taker. If J = a , then T ∗ ( z ) = 1; if J = c , then T ∗ ( z ) = z ; if J = d , then T ∗ ( z ) = 1 − z ; and if J = n , then T ∗ ( z ) = 0. In a LATE model, Assumption 2.1 (iii) is replaced by the standard LATE assumptions: Assumption C.1 (Unconfounded Type) . P ( J = j | z = 1) = P ( J = j | z = 0) for all j ∈ { a, c, d, n } . Assumption C.2 (Mean Exclusion Restriction) . For all t ∗ ∈ { , } and j ∈ { a, c, d, n } , E [ Y ( t ∗ , | T ∗ = t ∗ , z = 1] = E [ Y ( t ∗ , | T ∗ = t ∗ , z = 1] = E [ Y ( t ∗ ) | J = j ] . Assumption C.3 (Monotonicity) . P (cid:0) T ∗ (1) ≥ T ∗ (0) (cid:1) = 1As is well known, Assumption 2.1 (iii) combined with the preceding three conditions impliesthat the instrumental variables estimand based on T ∗ identifies the average treatment effect amongcompliers: E [ y | z = 1] − E [ y | z = 0] p ∗ − p ∗ = E [ Y (1) − Y (0) | J = c ] . The numerator of the preceding expression is observed, but under mis-classification the denominatoris not. Notice, however, that Assumptions 2.2 (i)–(ii) only concern the joint distribution of T given( T ∗ , z ). As such, they have the same meaning in a LATE model as in an additively separablemodel. Imposing these conditions, Lemma 2.1 continues to hold in a LATE model. It follows that p − p = (1 − α − α )( p ∗ − p ∗ ) so that E [ y | z = 1] − E [ y | z = 0] p − p = E [ Y (1) − Y (0) | J = c ]1 − α − α . Moreover, α ≤ p k ≤ − α for all k . Thus, the bound from Corollary 2.1 remains valid in a LATEmodel: E [ Y (1) − Y (0) | J = c ] must lie between the IV and reduced form estimands.Unlike Assumptions 2.2 (i)–(ii), Assumption 2.2 (iii), non-differential measurement error, isexplicitly stated in terms of the unobservable error term in an additively separable model. Ourderivation of the additional restrictions on ( α , α ) implied by non-differential measurement errorin the proof of Theorem 2.2, however, does not use Assumption 2.2 (iii) directly. Rather, ituses a condition that is equivalent to it in an additively separable model, namely E [ Y | T ∗ , T, z ] = E [ Y | T ∗ , z ]. Hence, as long as this equality holds, regardless of whether one is in an additivelyseparable model or a LATE model, the bounds on ( α , α ) from Theorem 2.2 remain valid. Since Y = (1 − T ∗ ) Y (0) + T ∗ Y (1), the appropriate modification of Assumption 2.2 (iii) is as follows. Assumption C.4 (Non-differential Measurement Error) . E [ Y (0) | T ∗ , T, z ] = E [ Y (0) | T ∗ , z ] and E [ Y (1) | T ∗ , T, z ] = E [ Y (1) | T ∗ , z ]To summarize, if one wishes to re-interpret our parameter β as a local average treatment effect,the partial identification bounds from Theorems 2.1 and 2.2 above remain valid. Assumption 2.1(i) is replaced by Y = h ( T ∗ , z, ε ), Assumption 2.1 (iii) is replaced by Assumptions C.1–C.3, andAssumption 2.2 (iii) is replaced by Assumption C.4. In a LATE model, however, our proofs of sharp-ness no longer apply, as they do not consider the testable implications of the LATE assumptionsthemselves. For partial identification results that consider these implications but do not imposenon-differential measurement error, see Ura (2018). For discussion of the testable implications of aLATE model, see Kitagawa (2015). eferences Aigner, D. J., 1973. Regression with a binary independent variable subject to errors of observation.Journal of Econometrics 1, 49–60.Andrews, D. W., Soares, G., 2010. Inference for parameters defined by moment inequalities usinggeneralized moment selection. Econometrica 78 (1), 119–157.Angrist, J. D., 1990. Lifetime earnings and the vietnam era draft lottery: evidence from socialsecurity administrative records. The American Economic Review, 313–336.Battistin, E., Nadai, M. D., Sianesi, B., 2014. Misreported schooling, multiple measures and returnsto educational qualifications. Journal of Econometrics 181 (2), 136–150.Black, D. A., Berger, M. C., Scott, F. A., 2000. Bounding parameter estimates with nonclassicalmeasurement error. Journal of the American Statistical Association 95 (451), 739–748.Bollinger, C. R., 1996. Bounding mean regressions when a binary regressor is mismeasured. Journalof Econometrics 73, 387–399.Bollinger, C. R., van Hasselt, M., 2015. Bayesian moment-based inference in a regression modelswith misclassification error, working Paper.Bound, J., Brown, C., Mathiowetz, N., 2001. Measurement error in survey data. In: Handbook ofeconometrics. Vol. 5. Elsevier, pp. 3705–3843.Carroll, R. J., Ruppert, D., Crainiceanu, C. M., Stefanski, L. A., 2006. Measurement error innonlinear models: a modern perspective. Chapman and Hall/CRC.Chen, X., Hong, H., Tamer, E., 2005. Measurement error models with auxiliary data. The Reviewof Economic Studies 72 (2), 343–366.Chen, X., Hu, Y., Lewbel, A., 2008a. Nonparametric identification of regression models containinga misclassified dichotomous regressor with instruments. Economics Letters 100, 381–384.Chen, X., Hu, Y., Lewbel, A., 2008b. A note on the closed-form identification of regression modelswith a mismeasured binary regressor. Statistics & Probability Letters 78 (12), 1473–1479.DiTraglia, F. J., Garc´ıa-Jimeno, C., 2017. Mis-classified, binary, endogenous regressors: Identifica-tion and inference. Tech. rep., NBER working paper u, Y., Shennach, S. M., January 2008. Instrumental variable treatment of nonclassical measure-ment error models. Econometrica 76 (1), 195–216.Hu, Y., Shiu, J.-L., Woutersen, T., 2015. Identification and estimation of single-index models withmeasurement error and endogeneity. The Econometrics Journal 18 (3), 347–362.Imbens, G. W., Rubin, D. B., 1997. Estimating outcome distributions for compliers in instrumentalvariables models. The Review of Economic Studies 64 (4), 555–574.Kane, T. J., Rouse, C. E., Staiger, D., July 1999. Estimating returns to schooling when schoolingis misreported. Tech. rep., National Bureau of Economic Research, NBER Working Paper 7235.Kitagawa, T., 2015. A test for instrument validity. Econometrica 83 (5), 2043–2063.Kreider, B., Pepper, J. V., Gundersen, C., Jolliffe, D., 2012. Identifying the effects of SNAP (foodstamps) on child health outcomes when participation is endogenous and misreported. Journal ofthe American Statistical Association 107 (499), 958–975.Lewbel, A., March 2007a. Estimation of average treatment effects with misclassification. Econo-metrica 75 (2), 537–551.Lewbel, A., 2007b. A local generalized method of moments estimator. Economics Letters 94, 124–128.Mahajan, A., 2006. Identification and estimation of regression models with misclassification. Econo-metrica 74 (3), 631–665.Molinari, F., 2008. Partial identification of probability distributions with misclassified data. Journalof Econometrics 144 (1), 81–117.Nguimkeu, P., Denteh, A., Tchernis, R., 2016. On the estimation of treatment effects with endoge-nous misreporting. Working Paper.Shiu, J.-L., 2016. Identification and estimation of endogenous selection models in the presence ofmisclassification errors. Economic Modelling 52 (Part B), 507–518.Song, S., 2015. Semiparametric estimation of models with conditional moment restrictions in thepresence of nonclassical measurement errors. Journal of Econometrics 185 (1), 95–109.Ura, T., 2018. Heterogeneous treatment effects with mismeasured endogenous treatment. Quanti-tative Economics 9 (3), 1335–1370.van Hasselt, M., Bollinger, C. R., 2012. Binary misclassification and identification in regressionmodels. Economics Letters 115, 81–84.u, Y., Shennach, S. M., January 2008. Instrumental variable treatment of nonclassical measure-ment error models. Econometrica 76 (1), 195–216.Hu, Y., Shiu, J.-L., Woutersen, T., 2015. Identification and estimation of single-index models withmeasurement error and endogeneity. The Econometrics Journal 18 (3), 347–362.Imbens, G. W., Rubin, D. B., 1997. Estimating outcome distributions for compliers in instrumentalvariables models. The Review of Economic Studies 64 (4), 555–574.Kane, T. J., Rouse, C. E., Staiger, D., July 1999. Estimating returns to schooling when schoolingis misreported. Tech. rep., National Bureau of Economic Research, NBER Working Paper 7235.Kitagawa, T., 2015. A test for instrument validity. Econometrica 83 (5), 2043–2063.Kreider, B., Pepper, J. V., Gundersen, C., Jolliffe, D., 2012. Identifying the effects of SNAP (foodstamps) on child health outcomes when participation is endogenous and misreported. Journal ofthe American Statistical Association 107 (499), 958–975.Lewbel, A., March 2007a. Estimation of average treatment effects with misclassification. Econo-metrica 75 (2), 537–551.Lewbel, A., 2007b. A local generalized method of moments estimator. Economics Letters 94, 124–128.Mahajan, A., 2006. Identification and estimation of regression models with misclassification. Econo-metrica 74 (3), 631–665.Molinari, F., 2008. Partial identification of probability distributions with misclassified data. Journalof Econometrics 144 (1), 81–117.Nguimkeu, P., Denteh, A., Tchernis, R., 2016. On the estimation of treatment effects with endoge-nous misreporting. Working Paper.Shiu, J.-L., 2016. Identification and estimation of endogenous selection models in the presence ofmisclassification errors. Economic Modelling 52 (Part B), 507–518.Song, S., 2015. Semiparametric estimation of models with conditional moment restrictions in thepresence of nonclassical measurement errors. Journal of Econometrics 185 (1), 95–109.Ura, T., 2018. Heterogeneous treatment effects with mismeasured endogenous treatment. Quanti-tative Economics 9 (3), 1335–1370.van Hasselt, M., Bollinger, C. R., 2012. Binary misclassification and identification in regressionmodels. Economics Letters 115, 81–84.