[PDF] Identifying the effect of a mis-classified, binary, endogenous regressor

Abstract

This paper studies identification of the effect of a mis-classified, binary, endogenous regressor when a discrete-valued instrumental variable is available. We begin by showing that the only existing point identification result for this model is incorrect. We go on to derive the sharp identified set under mean independence assumptions for the instrument and measurement error. The resulting bounds are novel and informative, but fail to point identify the effect of interest. This motivates us to consider alternative and slightly stronger assumptions: we show that adding second and third moment independence assumptions suffices to identify the model.

Full PDF

aa r X i v : . [ ec on . E M ] N ov Identifying the Eﬀect of a Mis-classiﬁed, Binary,Endogenous Regressor ∗ Francis J. DiTraglia and Camilo Garc´ıa-Jimeno

Final Version: January 23, 2019

Abstract

This paper studies identiﬁcation of the eﬀect of a mis-classiﬁed, binary, endogenousregressor when a discrete-valued instrumental variable is available. We begin by show-ing that the only existing point identiﬁcation result for this model is incorrect. Wego on to derive the sharp identiﬁed set under mean independence assumptions for theinstrument and measurement error. The resulting bounds are novel and informative,but fail to point identify the eﬀect of interest. This motivates us to consider alterna-tive and slightly stronger assumptions: we show that adding second and third momentindependence assumptions suﬃces to identify the model.

Keywords:

Instrumental variables, Measurement error, Endogeneity

JEL Codes:

C10, C25, C26 ∗ We thank Daron Acemoglu, Manuel Arellano, Kristy Buzard, Xu Cheng, Bernardo da Silveira, BoHonor´e, Arthur Lewbel, Chuck Manski, Sophocles Mavroeidis, Francesca Molinari, Yuya Takahashi, theassociate editor, two anonymous referees, and seminar participants at Cambridge, CEMFI, Chicago Booth,Manchester, Northwestern, Oxford, Penn State, Princeton, UCL, the 2016 Greater New York Area Economet-rics Colloquium, Camp Econometrics IX, and the 2017 North American Summer Meeting of the EconometricSociety for valuable comments and suggestions. This document supersedes an earlier version entitled “OnMis-measured Binary Regressors: New Results and Some Comments on the Literature.”

Introduction

Measurement error and endogeneity are pervasive features of economic data. Conveniently,a valid instrumental variable corrects for both problems when the measurement error isclassical, i.e. uncorrelated with the true value of the regressor. Many regressors of interest inapplied work, however, are binary and thus cannot be subject to classical measurement error. When faced with non-classical measurement error, the instrumental variables estimator canbe severely biased. In this paper, we study an additively separable model of the form y = c ( x ) + β ( x ) T ∗ + ε (1)where ε is a mean-zero error term, T ∗ is a binary, potentially endogenous regressor of interest,and x is a vector of exogenous controls. We ask whether, and if so under what conditions, adiscrete instrumental variable z suﬃces to non-parametrically identify the causal eﬀect β ( x )of T ∗ , when we observe not T ∗ but a mis-classiﬁed binary surrogate T .We proceed under the assumption of non-diﬀerential measurement error. This conditionhas been widely used in the existing literature and imposes that T provides no additionalinformation beyond that contained in ( T ∗ , x ). Even in this fairly standard setting, identiﬁ-cation remains an open question: we begin by showing that the only existing identiﬁcationresult for this model is incorrect. We then go on to derive the sharp identiﬁed set under thestandard ﬁrst-moment assumptions from the related literature. We show that regardless ofthe number of values that z takes on, the model is not point identiﬁed. This motivates us toconsider alternative, and slightly stronger assumptions. We show that, given a binary instru-ment, the addition of a second moment independence assumption suﬃces to identify a modelwith one-sided mis-classiﬁcation. Adding a second moment restriction on the measurementerror along with a third moment independence assumption for the instrument suﬃces toidentify the model in general. This result likewise requires only a binary z .Our work relates to a large literature that considers departures from classical measure-ment error, by allowing the measurement error to be related to the true value of the un-observed regressor. Chen et al. (2005) obtain identiﬁcation in a general class of momentcondition models with mis-measured data by relying on the existence of an auxiliary datasetfrom which they can estimate the measurement error process. In contrast, Hu and Shennach(2008) and Song (2015) rely on an instrumental variable and an additional conditional loca-tion assumption on the measurement error distribution. More recently, Hu et al. (2015) usea continuous instrument to identify the ratio of partial eﬀects of two continuous regressors,one measured with error, in a linear single index model. Unfortunately, these approachescannot be applied to the case of a mis-measured binary regressor.A number of papers have studied models with an exogenous binary regressor subject tonon-diﬀerential measurement error. One group of papers asks what can be learned withoutrecourse to an instrumental variable. An early contribution by Aigner (1973) characterizesthe asymptotic bias of OLS in this setting, and proposes a correction using outside infor- The only way to mis-classify a true one is downwards, as a zero, while the only way to mis-classify atrue zero is upwards, as a one. This creates negative dependence between the truth and measurement error. Because T ∗ is binary, there is no loss of generality from writing the model in this form rather than themore familiar y = h ( T ∗ , x ) + ε . Simply deﬁne β ( x ) = h (1 , x ) − h (0 , x ) and c ( x ) = h (0 , x ). T ∗ . Black et al. (2000) and Kane et al. (1999) consider a linear model andshow that when two alternative measures T and T of T ∗ are available, a non-linear GMMestimator can be used to recover the eﬀect of interest. Subsequently, Frazis and Loewenstein(2003) note that an instrumental variable can take the place of one of the measures. Mahajan(2006) extends the results of Black et al. (2000) and Kane et al. (1999) to a more generalsetting using a binary instrument in place of one of the treatment measures, establishing non-parametric identiﬁcation of the conditional mean function. When T ∗ is in fact exogenous,this coincides with the causal eﬀect. Hu (2008) derives related results when the mis-classiﬁeddiscrete regressor may take on more than two values. Lewbel (2007a) provides an identiﬁca-tion result for the same model as Mahajan (2006) under diﬀerent assumptions. In particular,his “instrument-like variable” need not satisfy the usual exclusion restriction so long as itdoes not interact with T ∗ and takes on three or more values.Much less is known about the case in which a binary, or discrete, regressor is not only mis-classiﬁed but endogenous. The ﬁrst paper to provide a formal result for this case is Mahajan(2006). He extends his main result to the case of an endogenous treatment, providing anexplicit proof of identiﬁcation under the usual IV assumption in a model with additivelyseparable errors. As we show below, however, this result is false. Several more recentpapers also consider the case of a mis-classiﬁed, endogenous, binary regressor. Kreider et al.(2012), partially identify the eﬀects of food stamps on health outcomes of children underweak measurement error assumptions by relying on auxiliary data. Similarly, Battistin et al.(2014) study the returns to schooling in a setting with multiple mis-reported measures ofeducational qualiﬁcations. Unlike these two papers, our approach does not depend on theavailability of auxiliary data. In a diﬀerent vein, Shiu (2016) uses an exclusion restrictionfor the participation equation and an additional valid instrument to identify the eﬀect of adiscrete, mis-classiﬁed endogenous regressor in a semi-parametric selection model. Similarly,Nguimkeu et al. (2016) use exclusion restrictions for both the participation equation andmeasurement error equation to identify a parametric model with endogenous participationand one-sided endogenous mis-reporting. Unlike those of the preceding two papers, ourresults rely neither on parametric assumptions nor additional exclusion restrictions. Otherthan Mahajan (2006), the paper most closely related to our own is that of Ura (2018),who derives partial identiﬁcation results for a local average treatment eﬀect without thenon-diﬀerential assumption. In contrast, we study an additively separable model undernon-diﬀerential measurement error and derive both partial and point identiﬁcation results.The remainder of the paper is organized as follows. Section 2.1 describes our model andassumptions, Section 2.2 relates our results to existing work, and Sections 2.3–2.4 present Appendix B provides a detailed explanation of the error in Mahajan’s proof.

As deﬁned in the preceding section, our model is y = c ( x ) + β ( x ) T ∗ + ε , where ε is a mean-zero error term, and the parameter of interest is β ( x ) – the eﬀect of an unobserved, binary,endogenous regressor T ∗ . Suppose we observe a valid and relevant binary instrument z . Inthe discussion following Corollary 2.2 below, we explain how these results generalize to thecase of an arbitrary discrete-valued instrument. We assume that the model and instrumentsatisfy the following conditions: Assumption 2.1. (i) y = c ( x ) + β ( x ) T ∗ + ε where T ∗ ∈ { , } and E [ ε ] = 0 ;(ii) z ∈ { , } , where < P ( z = 1 | x ) < , and P ( T ∗ = 1 | x , z = 1) = P ( T ∗ = 1 | x , z = 0) ;(iii) E [ ε | x , z ] = 0 . Assumption 2.1(i) is a restatement of the additively separable model from Equation 1,which includes as a special case the linear model y = c + βT ∗ + x ′ γ + ε that is pervasivein empirical economics. Assumptions 2.1(ii) and (iii) are the textbook instrumental variablerelevance and validity conditions, respectively. Under Assumption 2.1, the Wald estimator[ E ( y | z = 1 , x ) − E ( y | z = 0 , x )] / [ E ( T ∗ | z = 1 , x ) − E ( T ∗ | z = 0 , x )]identiﬁes β ( x ). Unfortunately this estimator is infeasible, as we observe not T ∗ but a mis-classiﬁed binary surrogate T . To make further progress, we must impose conditions on theprocess that generates T . Accordingly, deﬁne the following mis-classiﬁcation probabilities: α ( x , z ) = P ( T = 1 | T ∗ = 0 , x , z ) α ( x ) = P ( T = 1 | T ∗ = 0 , x ) α ( x , z ) = P ( T = 0 | T ∗ = 1 , x , z ) α ( x ) = P ( T = 0 | T ∗ = 1 , x ) . Assumption 2.2. (i) α ( x , z ) = α ( x ) , α ( x , z ) = α ( x ) (ii) α ( x ) + α ( x ) < (iii) E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] Although it involves T ∗ , Assumption 2.1(ii) is testable: see the discussion following Lemma 2.1. z . Assumption 2.2(ii) restricts the extent of mis-classiﬁcation and is equivalent to requiring that T and T ∗ bepositively correlated. Assumption 2.2 (iii) is often referred to as “non-diﬀerential measure-ment error.” Intuitively, it maintains that T provides no additional information about ε , andhence y , given knowledge of ( T ∗ , z, x ). While Assumption 2.2(ii) is quite mild, Assumptions2.2 (i) and (iii) are more restrictive, as discussed by Bound et al. (2001). To take a speciﬁcexample, suppose that y is log wage and T ∗ is an indicator for college completion. If T is apotentially erroneous measure of college completion taken from a university’s administrativerecords, then the assumption of non-diﬀerential measurement error is quite plausible. If, onthe other hand, T is a self-report of college completion and there are “returns to lying” aboutcollege completion, i.e. employers only imperfectly observe worker ability, this assumptionis less plausible. Note, however, that our assumptions on the mis-classiﬁcation process are conditional on x : we place no restrictions on the relationship between observed covariatesand the mis-classiﬁcation errors. In contrast, Bound et al. (2001) considers unconditional versions of our Assumption 2.2. Instrument validity – Assumption 2.1 (iii) – is more plau-sible after conditioning on a rich set of exogenous controls, and the same is true of ourmis-classiﬁcation assumptions. For more discussion of settings in which the assumption ofnon-diﬀerential measurement error is warranted, see Carroll et al. (2006). Existing results from the literature – see for example Frazis and Loewenstein (2003) andMahajan (2006) – establish that β ( x ) is point identiﬁed if Assumptions 2.1–2.2 are aug-mented to include the following condition: Assumption 2.3 (Joint Exogeneity) . E [ ε | x , z, T ∗ ] = 0 . Assumption 2.3 strengthens the mean independence condition from Assumption 2.1 (iii)to hold jointly for T ∗ and z . By iterated expectations, this implies that T ∗ is exogenous,i.e. E [ ε | x , T ∗ ] = 0. If T ∗ is endogenous, Assumption 2.3 clearly fails. Mahajan (2006)argues, however, that the following restriction, along with our Assumptions 2.1–2.2, suﬃcesto identify β ( x ) when T ∗ may be endogenous: Assumption 2.4 (Mahajan (2006) Equation 11) . E [ ε | x , z, T ∗ , T ] = E [ ε | x , T ∗ ] . Assumption 2.4 does not require E [ ε | x , T ∗ ] to be zero, but maintains that it does notvary with z . We show in Appendix B, however, that under Assumptions 2.1–2.2, Assumption2.4 can only hold if T ∗ is exogenous. If z is a valid instrument and T ∗ is endogenous, thenAssumption 2.4 implies that there is no ﬁrst-stage relationship between z and T ∗ . As such,identiﬁcation in the case where T ∗ is endogenous is an open question. See Hu and Lewbel (2012) for a proposal to estimate the “returns to lying” in this context. .3 Partial Identiﬁcation In this section we derive the sharp identiﬁed set under Assumptions 2.1–2.2 and show that β ( x ) is not point identiﬁed. For a discussion of how our partial identiﬁcation results can beinterpreted in a local average treatment eﬀects (LATE) setting, see Appendix C.To simplify the notation, deﬁne the following shorthand for the unobserved and observedﬁrst stage probabilities p ∗ k ( x ) = P ( T ∗ = 1 | x , z = k ) , p k ( x ) = P ( T = 1 | x , z = k ) . (2)We ﬁrst state two lemmas that that will be used repeatedly below. Lemma 2.1.

Under Assumption 2.2 (i), [1 − α ( x ) − α ( x )] p ∗ k ( x ) = p k ( x ) − α ( x )[1 − α ( x ) − α ( x )] [1 − p ∗ k ( x )] = 1 − p k ( x ) − α ( x ) where the ﬁrst-stage probabilities p ∗ k ( x ) and p k ( x ) are as deﬁned in Equation 2. Lemma 2.2.

Under Assumptions 2.1 and 2.2 (i)–(ii), β ( x ) Cov ( z, T | x ) = [1 − α ( x ) − α ( x )] Cov ( y, z | x )Lemma 2.1 relates the observed ﬁrst-stage probabilities p k ( x ) to their unobserved counter-parts p ∗ k ( x ) in terms of the mis-classiﬁcation probabilities α ( x ) and α ( x ). By Assumption2.2 (ii), 1 − α ( x ) − α ( x ) > α ( x ) and α ( x ) in terms of theobserved ﬁrst-stage probabilities. Moreover, by taking diﬀerences evaluated at k = 1 and k = 0, this Lemma shows that p ∗ ( x ) = p ∗ ( x ) if and only if p ( x ) = p ( x ). In other words,Assumption 2.1 (ii) is testable under Assumption 2.2 (ii). Lemma 2.2 relates the instrumen-tal variables (IV) estimand, Cov( y, z | x ) / Cov( z, T | x ), to the mis-classiﬁcation probabilities.Since 1 − α ( x ) − α ( x ) >

0, IV is biased upwards in the presence of mis-classiﬁcation.Together these lemmas bound the causal eﬀect of interest: β ( x ) lies between the reducedform and IV estimators. Without Assumption 2.2 (iii), non-diﬀerential measurement error,these bounds are sharp. Theorem 2.1.

Under Assumptions 2.1 and 2.2 (i)–(ii), α ( x ) ≤ p k ( x ) ≤ − α ( x ) for k = 0 , and E [ y | x , z = k ] = c ( x ) + β ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) . (3) Provided that p ( x ) = p ( x ) , these expressions characterize the sharp identiﬁed set for c ( x ) , β ( x ) , α ( x ) , and α ( x ) . Corollary 2.1.

Under the conditions of Theorem 2.1, the sharp identiﬁed set for β ( x ) is theclosed interval between the reduced form estimand Cov ( y, z | x ) / Var ( z | x ) and the IV estimandCov ( y, z | x ) / Cov ( z, T | x ) . Corollary 2.1 follows by taking diﬀerences of the expression for E [ y | x , z = k ] across k = 1and k = 0, and substituting the maximum and minimum value for α ( x ) + α ( x ) consistent5ith the observed ﬁrst-stage probabilities. Note that the only role of the condition p ( x ) = p ( x ) in the preceding two results is to ensure that it is possible to satisfy Assumption 2.1 (ii).Frazis and Loewenstein (2003) point out that the IV estimand provides an upper bound for β ( x ), and Lemmas 2.1–2.2 are well-known in the literature (see e.g. Frazis and Loewenstein,2003; Mahajan, 2006). Nevertheless, we are unaware of any published result that explicitlystates both bounds from Corollary 2.1 or proves that they are sharp under Assumptions 2.1and 2.2 (i)–(ii).Neither Theorem 2.1 nor Corollary 2.1 imposes Assumption 2.2 (iii) – non-diﬀerentialmeasurement error. While this assumption plays an important role in existing identiﬁcationresults for an exogenous T ∗ (see Section 2.2), its identifying power under endogeneity has notbeen addressed in the literature. We now show that this assumption in general yields furtherrestrictions on probabilities α ( x ) and α ( x ), but fails to point identify β ( x ). To simplifythe proof of sharpness, we assume that y is continuously distributed, which is natural in anadditively separable model. Without this assumption, the bounds that we derive are stillvalid, but may not be sharp. Nevertheless, the reasoning from our proof can be generalizedto cases in which y does not have a continuous support set. Theorem 2.2.

Suppose that the conditional distribution of y given ( x , T, z ) is continuous.Further suppose that the conditions of Theorem 2.1 and Assumption 2.2 (iii) hold. Forany k such that E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] , let A k denote the set of pairs (cid:0) α ( x ) , α ( x ) (cid:1) such that α ( x ) < p k ( x ) < − α ( x ) and µ tk (cid:18) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) , x (cid:19) ≤ µ k (cid:0) α ( x ) , x (cid:1) ≤ µ tk (cid:18) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) , x (cid:19) for all t = 0 , where µ tk (cid:0) q, x (cid:1) = E [ y | y ≤ q, x , T = t, z = k ] , µ tk (cid:0) q, x (cid:1) = E [ y | y > q, x , T = t, z = k ] µ k (cid:0) α ( x ) , x (cid:1) = p k ( x ) E [ y | x , z = k, T = 1] − α ( x ) E [ y | x , z = k ] p k ( x ) − α ( x ) and we deﬁne q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) = F − tk (cid:18) r tk (cid:0) α ( x ) , α ( x ) , x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:19) q tk (cid:0) α ( x ) , α ( x ) , x (cid:1) = F − tk (cid:18) − r tk (cid:0) α ( x ) , α ( x ) , x (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x (cid:19) If a priori restrictions on α and α are available, e.g. α = 0, α = 0, or α = α , these bounds can beimproved. For more discussion, see Corollary 2.2 of DiTraglia and Garc´ıa-Jimeno (2017). The only exception is the incorrect result of Mahajan (2006) described in Section 2.2 and Appendix B. here F − tk ( ·| x ) is the conditional quantile function of y given ( x , T = t, z = k ) , r k (cid:0) α ( x ) , α ( x ) , x (cid:1) = α ( x )1 − p k ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) r k (cid:0) α ( x ) , α ( x ) , x (cid:1) = 1 − α ( x ) p k ( x ) (cid:20) p k ( x ) − α ( x )1 − α ( x ) − α ( x ) (cid:21) and p k ( x ) is deﬁned in Equation 2. The sharp identiﬁed set for c ( x ) , β ( x ) , α ( x ) and α ( x ) is characterized by Equation 3 and (cid:0) α ( x ) , α ( x ) (cid:1) ∈ A ∗ where(i) A ∗ ≡ A ∩ A if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k = 0 , ;(ii) A ∗ ≡ A k if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] and E [ y | x , T = 0 , z = ℓ ] = E [ y | x , T = 1 , z = ℓ ] ;(iii) A ∗ ≡ (cid:8)(cid:0) α ( x ) , α ( x ) (cid:1) : α ( x ) ≤ p k ( x ) ≤ − α ( x ) for all k (cid:9) if E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k = 0 , . Imposing Assumption 2.2 (iii) strictly improves upon the identiﬁed set from Theorem2.1 unless E [ y | x , T = 0 , z = k ] = E [ y | x , T = 1 , z = k ] for all k . Even if β ( x ) = 0, thediﬀerence of these observable means is generically nonzero. The intuition for Theorem2.2 is as follows. For simplicity, suppress dependence on x . Now, ﬁx ( T = t, z = k ) and( α , α ). The observed distribution of y given ( T = t, z = k ), call it F tk , is a mixture oftwo unobserved distributions: the distribution of y given ( T = k, z = k, T ∗ = 1), call it F tk ,and the distribution of y given ( T = t, z = k, T ∗ = 0), call it F tk . The mixing probabilitiesare r tk and 1 − r tk from the statement of Theorem 2.2 and are fully determined by ( α , α )and p k . Assumptions 2.1 (i) and 2.2 (ii) imply that the unobserved means E [ y | T ∗ , T, z ] arefully determined by ( α , α ) given the observed means E [ y | T, z ]. The question is whether itis possible, given the observed distribution F tk , to construct F tk and F tk with the requiredvalues for E [ y | T ∗ , T, z ] such that F tk = r tk F tk + (1 − r tk ) F tk for all combinations ( t, k ). Ifnot, then ( α , α ) does not belong to the identiﬁed set. Our proof provides necessary andsuﬃcient conditions for such a mixture to exist at a given point ( α , α ). We can thenappeal to the reasoning from Theorem 2.1 to complete the argument. By ruling out valuesfor α and α , Theorem 2.2 restricts β via Lemma 2.2. While these restrictions can be veryinformative, they do not yield point identiﬁcation. Corollary 2.2.

Under Assumptions 2.1 and 2.2 the identiﬁed set for β ( x ) contains both theIV estimand Cov ( y, z | x ) / Cov ( z, T | x ) and the true coeﬃcient β ( x ) . Corollary 2.2 follows by Lemma 2.2 because α ( x ) = α ( x ) = 0 always belongs to thesharp identiﬁed set from Theorem 2.2. Non-diﬀerential measurement error cannot excludethe possibility that there is no mis-classiﬁcation because in this case it is trivial to constructthe required mixtures. Although we focus throughout this paper on the case of a binaryinstrument, one might wonder whether point identiﬁcation can be achieved by increasing Suppress dependence on x for simplicity. There are only two settings in which E [ y | T = 0 , z = k ] = E [ y | T = 1 , z = k ]. The ﬁrst is if the true value of either α or α lies at the upper boundary of the identiﬁedset from Theorem 2.1. The second is if β = E [ ε | T ∗ = 0 , z = k ] − E [ ε | T ∗ = 0 , z = k ]. z , perhaps along the lines of Lewbel (2007a). The answer turns out to be no.Suppose that we were to modify Assumptions 2.1 and 2.2 to hold for all values of z in somediscrete support set. By Lemma 2.2, a binary instrument identiﬁes β ( x ) up to knowledgeof the mis-classiﬁcation probabilities α ( x ) and α ( x ). It follows that any pair of values( k, ℓ ) in the support set of z identiﬁes the same object. Accordingly, to identify β ( x ) it isnecessary and suﬃcient to identify the mis-classiﬁcation probabilities. A binary instrumentfails to identify these probabilities because we can never exclude the possibility of zero mis-classiﬁcation. The same is true of a discrete K -valued instrument. Increasing the support of z does, however, shrink the identiﬁed set by increasing the number of restrictions available:in this case Theorems 2.1–2.2 continue to apply replacing “ k = 0 ,

1” with “for all k .” The results of the preceding section establish that β ( x ) is not point identiﬁed under As-sumptions 2.1 and 2.2. In light of this, there are two possible ways to proceed: either onecan report partial identiﬁcation bounds based on our characterization of the sharp identiﬁedset from Theorem 2.2, or one can attempt to impose stronger assumptions to obtain pointidentiﬁcation. In this section we consider the second possibility. We begin by deﬁning thefollowing functions of the model parameters: θ ( x ) = β ( x ) [1 − α ( x ) − α ( x )] − (4) θ ( x ) = [ θ ( x )] [1 + α ( x ) − α ( x )] (5) θ ( x ) = [ θ ( x )] (cid:2) { − α ( x ) − α ( x ) } + 6 α ( x ) { − α ( x ) } (cid:3) (6)Now consider the following additional assumption: Assumption 2.5. E [ ε | x , z ] = E [ ε | x ]Assumption 2.5 is a second moment version of the standard mean exclusion restriction forthe instrument z – Assumption 2.1 (iii). It requires that the conditional variance of the errorterm given the covariates x does not depend on z , but does not require homoskedasticitywith respect to x , T ∗ or T . Assumption 2.5 allows us to derive the following lemma: Lemma 2.3.

Under Assumptions 2.1, 2.2 and 2.5,Cov ( y , z | x ) = 2 Cov ( yT, z | x ) θ ( x ) − Cov ( T, z | x ) θ ( x ) where θ ( x ) and θ ( x ) are deﬁned in Equations 4–5. Lemma 2.2 identiﬁes θ ( x ). Since Cov( z, T | x ) = 0 by Assumption 2.1 (ii), we can solvefor θ ( x ) in terms of observables only, using Lemma 2.3. Given knowledge of θ ( x ), we cansolve Equation 5 for the diﬀerence of mis-classiﬁcation rates so long as β ( x ) = 0. Corollary 2.3.

Under Assumptions 2.1–2.2 and 2.5, α ( x ) − α ( x ) is identiﬁed so long as β ( x ) = 0 . α ( x ) = 0 or α ( x ) = 0, augmenting our baseline Assumptions2.1–2.2 with Assumption 2.5 suﬃces to identify β ( x ). Notice that β ( x ) = 0 if and only if θ ( x ) = 0. Thus, β ( x ) is still identiﬁed in the case where Corollary 2.3 fails to apply.Assumption 2.5 does not suﬃce to identify β ( x ) without a priori restrictions on themis-classiﬁcation error rates. To achieve identiﬁcation in the general case, we impose thefollowing additional conditions: Assumption 2.6. (i) E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] (ii) E [ ε | x , z ] = E [ ε | x ]Assumption 2.6 (i) is a second moment version of the non-diﬀerential measurement errorassumption, Assumption 2.2 (iii). It requires that, given knowledge of ( x , T ∗ , z ), T providesno additional information about the variance of the error term. Note that Assumption 2.6(i) does not require homoskedasticity of ε with respect to x or T ∗ . Assumption 2.6 (ii)is a third moment version of Assumption 2.5. It requires that the conditional third mo-ment of the error term given x does not depend on z . This condition neither requires norexcludes skewness in the error term conditional on covariates: it merely states that theskewness is unaﬀected by the instrument. While Assumptions 2.5 and 2.6 may appear some-what unusual, they are implied by the more intuitive independence conditions ε | = z | x and ε | = T | ( x , T ∗ , z ). Although E [ ε | x , z ] = 0 and E [ ε | x , z, T ∗ , T ] = E [ ε | x , z, T ∗ ] are technicallyweaker than assuming full independence, we would be somewhat dubious of any supposed“natural experiment” that purportedly satisﬁed mean exclusion but not independence. In-deed, as discussed by Imbens and Rubin (1997), an instrument satisfying mean exclusionbut not independence could become invalid if the outcome variable were transformed, forexample by taking logs. As it is not uncommon for applied papers to report results inboth logs and levels (e.g. Angrist, 1990), our view is that researchers implicitly assume morethan mean exclusion in typical applications of instrumental variables. Analogous reasoningapplies to the non-diﬀerential measurement error assumption.Assumption 2.6 allows us to derive the following Lemma which, combined with Lemma2.3, leads to point identiﬁcation: Lemma 2.4.

Under Assumptions 2.1–2.2 and 2.5–2.6,Cov ( y , z | x ) = 3 Cov ( y T, z | x ) θ ( x ) − Cov ( yT, z | x ) θ ( x ) + Cov ( T, z | x ) θ ( x ) where θ ( x ) , θ ( x ) and θ ( x ) are deﬁned in Equations 4–5. Theorem 2.3.

Under Assumptions 2.1–2.2 and 2.5–2.6, β ( x ) is identiﬁed. If β ( x ) = 0 ,then α ( x ) and α ( x ) are likewise identiﬁed. Lemmas 2.2–2.4 yield a linear system of three equations in θ ( x ) , θ ( x ) and θ ( x ). UnderAssumption 2.1 (ii), the system has a unique solution so θ ( x ) , θ ( x ) and θ ( x ) are identiﬁed.9he proof of Theorem 2.3 shows that, so long as β ( x ) = 0, Equations 4–6 can be solved for β ( x ), α ( x ) and α ( x ). In particular, using steps from the proof of Theorem 2.3 β ( x ) = sign (cid:2) θ ( x ) (cid:3)q (cid:2) θ ( x ) /θ ( x ) (cid:3) − (cid:2) θ ( x ) /θ ( x ) (cid:3) . If we relax Assumption 2.2 (ii) and assume α ( x ) + α ( x ) = 1 only, β ( x ) is only identiﬁedup to sign: in this case the sign of θ ( x ) need not equal that of β ( x ). We now brieﬂy outline how the identiﬁcation results from Section 2 can be used to esti-mate and carry out statistical inference for the parameters of interest: (cid:0) α ( x ) , α ( x ) , β ( x ) (cid:1) .Lemmas 2.2–2.4 yield a system of linear moment equations in the reduced form parameters θ ′ ( x ) = (cid:0) θ ( x ) , θ ( x ) , θ ( x ) (cid:1) . Deﬁning a vector of intercepts κ ′ ( x ) = (cid:0) κ ( x ) , κ ( x ) , κ ( x ) (cid:1) ,and a vector of observables w ′ = ( T, y, yT, y , y T, y ), we can write this system as E "(cid:26) Ψ (cid:0) θ ( x ) (cid:1) w i − κ ( x ) (cid:27) ⊗ (cid:18) z (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x = x = (7) Ψ (cid:0) θ ( x ) (cid:1) ≡  − θ ( x ) 1 0 0 0 0 θ ( x ) 0 − θ ( x ) 1 0 0 − θ ( x ) 0 3 θ ( x ) 0 − θ ( x ) 1  . (8)Using Equations 4–6, we can re-write Ψ as a function of (cid:0) α ( x ) , α ( x ) , β ( x ) (cid:1) , leaving uswith a just-identiﬁed, non-parametric conditional moment problem. Because the condi-tioning variables in Equation 7 are the same as the arguments of the unknown functions( α , α , β ), this problem ﬁts within the framework of Lewbel (2007b), permitting straight-forward estimation and inference via a local GMM procedure. If β ( x ) is close to zero,however, this procedure can perform poorly; in this case the moment conditions from Equa-tions 7, are only weakly informative about α ( x ) and α ( x ). An earlier version of this paper(DiTraglia and Garc´ıa-Jimeno, 2017) discusses this problem in more detail and provides asolution based on generalized moment selection (Andrews and Soares, 2010) that combinesthe moment inequalities implied by our partial identiﬁcation results from Section 2.3 withthe moment equalities from Equation 7. This paper has studied identiﬁcation and inference for a mis-classiﬁed, binary, endogenousregressor in an additively separable model using a discrete instrumental variable. We haveshown that the only existing identiﬁcation result for this model is incorrect, and gone on toderive the sharp identiﬁed set under standard ﬁrst-moment assumptions from the literature.Strengthening these assumptions to hold for second and third moments, we have establishedpoint identiﬁcation for the eﬀect of interest. An interesting extension of the results presentedabove would be to consider the case of discrete regressors that take on more than two values.10

Proofs

All of the results in this paper hold x ﬁxed . This allows us to completely ignore the presence ofcovariates in the proofs that follow. Accordingly we work in terms of scalars α , α , β, p k , etc.rather than functions α ( x ), α ( x ), β ( x ) , p k ( x ). The former should be understood as the value ofthe latter evaluated at some particular x . A.1 Partial Identiﬁcation Results

Proof of Lemma 2.1.

Follows from a simple calculation using the law of total probability.

Proof of Lemma 2.2.

Immediate since Cov( z, T ) = (1 − α − α )Cov( z, T ∗ ) by Lemma 2.1. Proof of Theorem 2.1.

To show that α ≤ p k ≤ − α , substitute p ∗ k = 0 and p ∗ k = 1, respec-tively, into Lemma 2.1 and rearrange. To show that E [ y | z = k ] = c + β ( p k − α ) / (1 − α − α ),take conditional expectations of Equation 1 and apply Assumption 2.1 (iii) and Lemma 2.1.To prove sharpness we need to show that for any ( c, β, α , α ) that satisfy α ≤ p k ≤ − α and E [ y | z = k ] = c + β ( p k − α ) / (1 − α − α ) we can construct a valid joint distribution for( y, T, T ∗ , z ) that is compatible with the observed distribution of ( y, T, z ), provided that p = p .To establish this result, we factorize the joint distribution of ( y, T, T ∗ , z ) into the product of aconditional y | ( T, T ∗ , z ) and marginal ( T, T ∗ , z ). The argument proceeds in two steps. Our ﬁrststep relies on the fact that Assumptions 2.1 (i) and (iii) do not constrain the distribution of( T, T ∗ , z ) while 2.1 (ii) and 2.2 (i)–(ii) constrain only the distribution of ( T, T ∗ , z ). Under theselatter three assumptions, we show how to construct a valid joint distribution for ( T, T ∗ , z ) that iscompatible with the observed distribution of ( T, z ) for any ( α , α ) satisfying α ≤ p k ≤ − α .Our second step shows how to construct a valid conditional distribution for y given ( T, T ∗ , z ) underAssumptions 2.1 (i) and (iii) that is compatible with the observed conditional distribution of y given ( T, z ) for any ( c, β, α , α ) satisfying E [ y | z = k ] = c + β ( p k − α )(1 − α − α ). Combiningthe two steps gives the required joint distribution for ( y, T ∗ , T, z ).For the ﬁrst step, we need to construct a valid joint probability mass function p ( T ∗ , T, z ) withsupport set { , } × { , } × { , } . By Assumption 2.2 (i), p ( T | T ∗ , z ) = p ( T | T ∗ ) and hence p ( T ∗ , T, z ) = p ( T | T ∗ ) p ( T ∗ | z ) p ( z ) . Since p ( z ) is observed, to construct a valid joint probability mass function p ( T ∗ , T, z ) it suﬃces toconstruct valid conditional probability mass functions p ( T | T ∗ ) and p ( T ∗ | z ). Since α ≤ p k ≤ − α ,both α and α are guaranteed to lie between zero and one. This gives a valid construction of p ( T | T ∗ ). Moreover the corresponding values of p ∗ k implied by Lemma 2.1 are also guaranteed to liebetween zero and one. This gives a valid construction of p ( T ∗ | z ) that satisﬁes Assumption 2.1 (ii),since p = p by assumption and ( p − p ) = ( p ∗ − p ∗ )(1 − α − α ) by Lemma 2.1. Because ourconstruction relies on Lemma 2.1, which is simply an application of the law of total probability,the resulting distribution p ( T, T ∗ , z ) is automatically compatible with p ( T, z ) = p ( T | z ) p ( z ).For the second step, we need to construct a valid conditional distribution for y given ( T, T ∗ , z ).To begin we deﬁne the following notation: r tk ≡ P ( T ∗ = 1 | T = t, z = k ) F t ( τ ) ≡ P ( y ≤ τ | z = k ) F tk ( τ ) ≡ P ( y ≤ τ | T = t, z = k ) F t ∗ tk ( τ ) ≡ P ( y ≤ τ | T ∗ = t ∗ , T = t, z = k ) G k ( τ ) ≡ P ( ε ≤ τ | z = k ) G t ∗ tk ( τ ) ≡ P ( ε ≤ τ | T ∗ = t ∗ , T = t, z = k ) . ssumption 2.1 (i) imposes a relationship between G t ∗ tk and F t ∗ tk for each t ∗ , namely G tk ( τ ) = F tk ( τ + c ) , G tk ( τ ) = F tk ( τ + c + β ) (A.1)and thus we see that G k ( τ ) = r k p k F k ( τ + c + β ) + r k (1 − p k ) F k ( τ + c + β )+ (1 − r k ) p k F k ( τ + c ) + (1 − r k )(1 − p k ) F k ( τ + c ) (A.2)applying the law of total probability and Bayes’ rule. Moreover, F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ) (A.3)for all t, k ∈ { , } , and by Bayes’ rule, r k = (1 − α ) p ∗ k /p k , r k = α p ∗ k / (1 − p k ) . (A.4)There are four cases, corresponding to diﬀerent possibilities for the r tk . The ﬁrst case violates oneof our model assumptions. For each of the remaining cases, we show that it is possible to constructthe required distributions F tk , F tk under Assumptions 2.1 (i) and (iii) for any ( c, β, α , α ) suchthat E ( y | z = k ) = c + β ( p k − α ) / (1 − α − α ). Case I: r k = 0 , r k = 0 By Equation A.4 this requires α = 1, violating Assumption 2.2 (ii). Case II: r k = r k = 0 By Equation A.4, this requires p ∗ k = 0 which in turn requires p k = α .By Equation A.3 we have F tk = F tk , while F tk is unrestricted. Substituting into A.2, G k ( τ ) = p k F k ( τ + c ) + (1 − p k ) F k ( τ + c ) = F k ( τ + c )Now, since F k ( τ + c ) is the conditional CDF of y − c given that z = k , and G k is the conditionalCDF of ε given z = k , we see that Assumption 2.1 (i) is satisﬁed if and only if E ( y | z = k ) = c ,which is equal to c + β ( p k − α ) / (1 − α − α ) since p k − α = 0. Case III: r k = 0 , r k = 0 By Equation A.4 this requires α = 0 and p ∗ k = 0. By Equation A.3we have F k = F k and since r k = 1, we can solve to obtain F k ( τ ) = 1 r k (cid:2) F k ( τ ) − (1 − r k ) F k ( τ ) (cid:3) Substituting into Equation A.2, we obtain G k ( τ ) = [(1 − p k ) F k ( τ + c ) + p k F k ( τ + c + β )]+ p k (1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) Now, F k ( τ + c ) is the conditional CDF of ( y − c ) given ( T = 0 , z = k ) while F k ( τ + c + β ) isthe conditional CDF of ( y − c − β ) given ( T = 1 , z = k ). Similarly, F k ( τ + c ) is the conditionalCDF of ε given ( T ∗ = 0 , T = 1 , z = k ) while F k ( τ + c + β ) is the conditional CDF of ( ε − β )given ( T ∗ = 0 , T = 1 , z = k ). Since G k ( τ ) is the conditional CDF of ε given z = k , we see that ssumption 2.1 (iii) is satisﬁed if and only if0 = (1 − p k ) E ( y − c | T = 0 , z = k ) + p k E ( y − c − β | T = 1 , z = k )+ p k (1 − r k ) [ E ( ε | T ∗ = 0 , T = 1 , z = k ) − E ( ε − β | T ∗ = 0 , T = 1 , z = k )]Rearranging, this is equivalent to E ( y | z = k ) = c + (1 − α ) β (cid:18) p k − α − α − α (cid:19) = c + β (cid:18) p k − α − α − α (cid:19) since α = 0 in this case. As explained above, F k = F k in the present case while F k is undeﬁned.We are free to choose any distributions for F k and F k that satisfy Equation A.3, for example F k = F k = F k . Case IV: r k = 0 , r k = 0 In this case, we can solve Equation A.3 to obtain F tk ( τ ) = 1 r tk (cid:2) F tk ( τ ) − (1 − r tk ) F tk ( τ ) (cid:3) Substituting this into Equation A.2, we have G k ( τ ) = F k ( τ + c + β ) + p k (1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) + (1 − p k )(1 − r k ) (cid:2) F k ( τ + c ) − F k ( τ + c + β ) (cid:3) using the fact that F k ( τ ) = p k F k ( τ ) + (1 − p k ) F k ( τ ). Now, F k ( τ + c + β ) is the conditional CDFof ( y − c − β ) given z = k , while F tk ( τ + c ) is the conditional CDF of ε given ( T = t, z = k ) and F tk ( τ + c + β ) is the conditional CDF of ( ε − β ) given ( T = t, z = k ). Since G k ( τ ) is the conditionalCDF of ε given z = k , we see that Assumption 2.1 (iii) is satisﬁed if and only if0 = E [ y − c − β | z = k ] + p k (1 − r k ) [ E ( ε | T ∗ = 0 , T = 1 , z = k ) − E ( ε − β | T ∗ = 0 , T = 1 , z = k )]+ (1 − p k )(1 − r k ) [ E ( ε | T ∗ = 0 , T = 0 , z = k ) − E ( ε − β | T ∗ = 0 , T = 0 , z = k )]0 = E [ y − c − β | z = k ] + β [ p k (1 − r k ) + (1 − p k )(1 − r k )]But since [ p k (1 − r k ) + (1 − p k )(1 − r k )] = (1 − p ∗ k ) and p ∗ k = ( p k − α ) / (1 − α − α ), this becomes E [ y | z = k ] = c + β [( p k − α )(1 − α − α )] . Thus, in this case we are free to choose any distributions for F tk and F tk that satisfy Equation A.3.For example we could take F tk = F tk = F tk . Proof of Corollary 2.1.

The result follows by substituting the largest and smallest possible val-ues for α + α and taking the diﬀerence of the expressions for E [ y | z = k ]. Proof of Theorem 2.2.

The only diﬀerence between the conditions of Theorem 2.1 and thoseof 2.2 is that the latter imposes Assumption 2.2 (iii) while the former does not. Accordingly, thepresent argument builds on the proof of Theorem 2.1 and relies on the notation deﬁned within it.Under Assumption 2.1 (i), Assumption 2.2 (iii) is equivalent to E [ y | T, T ∗ , z ] = E [ y | T ∗ , z ]. Hence,non-diﬀerential measurement error constrains only the conditional distribution of y given ( T, T ∗ , z ).For this reason, we need only revisit the second step of the proof of Theorem 2.1. Consider a point( c, β, α , α ) that satisﬁes Equation 3 and α ≤ p k ≤ − α for all k . Since this point lies in the dentiﬁed set from Theorem 2.1, it suﬃces to determine whether there exist valid conditional CDFs F tk , F tk such that F tk = (1 − r tk ) F tk + r tk F tk for all t, k and E [ y | T, T ∗ , z ] = E [ y | T ∗ , z ].Let µ t ∗ tk ≡ E [ y | T = t, z = k, T ∗ = t ∗ ], µ tk ≡ E [ y | T = t, z = k ], and µ t ∗ k ≡ E [ y | z = k, T ∗ = t ∗ ]. ByAssumption 2.2 (iii) µ t ∗ tk = µ t ∗ k for t ∗ = 0 ,

1. Hence, by iterated expectations, µ k = (1 − r k ) µ k + r k µ k µ k = (1 − r k ) µ k + r k µ k . Now, ( µ k , µ k ) are observed while r k and r k depend only on the observed ﬁrst-stage probability p k and the mis-classiﬁcation probabilities ( α , α ). Thus, at a given point ( c, β, α , α ) in theidentiﬁed set from Theorem 2.1 the preceding equations form a linear system in µ k and µ k . Aftersome algebra, we ﬁnd that the determinant is r k − r k = (cid:20) p k − α − α − α (cid:21) (cid:20) − p k − α p k (1 − p k ) (cid:21) . Suppose ﬁrst that r k = r k = r so the determinant condition fails. This occurs if and onlyif α = p k or α = 1 − p k . If µ k = µ k , the system is inconsistent: no solution for ( µ k , µ k )exists. Hence α = p k and α = 1 − p k are excluded from the identiﬁed set under non-diﬀerentialmeasurement error so long as µ k = µ k . If instead µ k = µ k = µ , the system is consistent butrank deﬁcient: any pair ( µ k , µ k ) such that µ = (1 − r ) µ k + rµ k is a solution and hence satisﬁes theassumption of non-diﬀerential measurement error. One such solution is µ k = µ k = µ so we are freeto set F k = F k = F k and F k = F k = F k . Hence, if µ k = µ k then α = p k lies within thesharp identiﬁed set if p k < p ℓ and α = 1 − p k lies in the sharp identiﬁed set if p ℓ < p k .Now suppose that r k = r k , which occurs if and only if α = p k and α = 1 − p k . In this casethe system has a unique solution, namely µ k = r k µ k − r k µ k r k − r k = (1 − p k ) E ( y | T = 0 , z = k ) − α E ( y | z = k )1 − p k − α µ k = ( µ k − µ k ) + ( r k µ k − r k µ k ) r k − r k = p k E ( y | T = 1 , z = k ) − α E ( y | z = k ) p k − α . Since µ k = µ k = µ k and µ k = µ k = µ k under non-diﬀerential measurement error, the mis-classiﬁcation probabilities ( α , α ) combined with the observable moments completely determinethe means of F tk and F tk whenever the determinant condition holds. If µ k = µ k then µ k = µ k so we are free to set F k = F k = F k and F k = F k = F k . Combining this with the reasoningfrom the preceding paragraph, we see that Assumption 2.2 (iii) imposes no additional restrictions for any k such that µ k = µ k . Accordingly, for the remainder of the proof we consider only thecase in which µ k = µ k . Given ( α , α ), r tk , µ k , and µ k are ﬁxed. The question is whether, for agiven pair ( α , α ) and observed CDFs F tk , we can construct valid CDFs F tk , F tk such that Z R τ F tk ( dτ ) = µ k , Z R τ F tk ( dτ ) = µ k , F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ) . For a given pair ( t, k ), there are two cases: 0 < r tk < r tk ∈ { , } . Case I: r tk ∈ { , } If r tk = 1 then µ k = µ tk so we can set F tk = F tk . In this case F tk isunrestricted. Analogously, if r tk = 0, µ k = µ tk so we can set F tk = F tk with F tk unrestricted. ase II: < r tk < Deﬁne the function µ tk ( ξ ) = E [ y | y ∈ I tk ( ξ ) , T = t, z = k ] and theclosed interval I tk ( ξ ) = (cid:2) F − tk (1 − ξ − r tk ) , F − tk (1 − ξ ) (cid:3) where 0 ≤ ξ ≤ − r tk . The function µ tk isdecreasing in ξ , attaining its maximum µ tk at ξ = 0 and its minimum µ tk at ξ = 1 − r tk .Suppose ﬁrst that µ k does not lie in the interval [ µ tk , µ tk ]. We show that it is impossible toconstruct valid CDFs F tk and F tk that satisfy F tk ( τ ) = r tk F tk ( τ ) + (1 − r tk ) F tk ( τ ). Since r tk = 1,we can solve the expression for F tk to yield F tk ( τ ) = (cid:2) F tk ( τ ) − r tk F tk ( τ ) (cid:3) / (1 − r tk ). Hence, since r tk = 0, the requirement that 0 ≤ F tk ( τ ) ≤ F tk ( τ ) − (1 − r tk ) r tk ≤ F tk ( τ ) ≤ F tk ( τ ) r tk (A.5)Now deﬁne F tk ( τ ) = min { , F tk ( τ ) /r tk } and F tk ( τ ) = max { , F tk ( τ ) /r tk − (1 − r tk ) /r tk } . Bycombining Equation A.5 with 0 ≤ F tk ( τ ) ≤

1, we obtain F tk ( τ ) ≤ F tk ( τ ) ≤ F tk ( τ ). Thus, F tk ﬁrst-order stochastically dominates F tk which ﬁrst-order stochastically dominates F tk . Hence, Z τ F tk ( dτ ) ≤ Z τ F tk ( dτ ) ≤ Z τ F tk ( dτ ) . But notice that µ tk = Z τ F tk ( dτ ) , µ k = Z τ F tk ( dτ ) , µ tk = Z τ F tk ( dτ )so we have µ tk ≤ µ k ≤ µ tk which contradicts µ k / ∈ [ µ tk , µ tk ].Now suppose that µ k ∈ h µ tk , µ tk i . We show how to construct densities f tk and f tk that yieldCDFs F tk F tk satisfying the requirements described above. Since the conditional distribution of y given ( T, z ) is continuous, µ tk is continuous on its domain and takes on all values in h µ tk , µ tk i by theintermediate value theorem. Thus, there exists a ξ ∗ such that µ tk ( ξ ∗ ) = µ k . Let f tk ( τ ) = dF tk ( τ ) /dτ which is non-negative by the assumption that y is continuously distributed. Now, deﬁne f tk ( τ ) = f tk ( τ ) × { τ ∈ I tk ( ξ ∗ ) } r tk , f tk ( τ ) = f tk ( τ ) × { τ ∈ I tk ( ξ ∗ ) } − r tk . Clearly f tk ≥ f tk ≥

0. Integrating, Z R f tk ( τ ) dτ = 1 r tk Z I tk ( ξ ∗ ) f tk ( τ ) dτ = 1 , Z R f tk ( τ ) dτ = 11 − r tk Z I Ctk ( ξ ∗ ) f tk ( τ ) dτ = 1where I Ctk is the complement of I tk . By construction r tk Z A f tk ( τ ) dτ + (1 − r tk ) Z A f tk ( τ ) dτ = Z A f tk ( τ ) dτ for any set A . Finally, Z R τ f tk ( τ ) dτ = 1 r tk Z I tk ( ξ ∗ ) τ f tk ( τ ) dτ = µ tk ( ξ ∗ ) = µ k . .2 Point Identiﬁcation Results In the proofs of Lemma 2.3, Lemma 2.4, and Theorem 2.3, we employ the shorthand π ≡ Cov(

T, z ), η j ≡ Cov( y j , z ), and τ j ≡ Cov(

T y j , z ) for j = 1 , ,

3. Hence Lemma 2.2 becomes η = πθ , whileLemma 2.3 becomes η = 2 τ θ − πθ , and Lemma 2.4 becomes η = 3 τ θ − τ θ + πθ . Proof of Lemma 2.3.

By Assumption 2.1 (i) and the basic properties of covariance, η = β Cov( T ∗ , z ) + 2 β [ c Cov( T ∗ , z ) + Cov( T ∗ ε, z )] + 2 c Cov( ε, z ) + Cov( ε , z ) τ = cπ + Cov( T ε, z ) + β Cov(

T T ∗ , z )using the fact that T ∗ is binary. Now, by Assumptions 2.1 (iii) and 2.5 we have Cov( ε, z ) =Cov( ε , z ) = 0. And, using Assumptions 2.2 (i) and (ii), one can show that Cov( T T ∗ , z ) = (1 − α )Cov( T ∗ , z ) and Cov( T ∗ , z ) = π/ (1 − α − α ). Hence, η = θ ( β + 2 c ) π + 2 β Cov( T ∗ ε, z )2 τ θ − πθ = (cid:2) θ c + 2 θ (1 − α ) − θ (cid:3) π + 2 θ Cov(

T ε, z )but since θ = θ [(1 − α ) + α ], we see that [2 θ (1 − α ) − θ ] = θ β . Thus, it suﬃces to showthat β Cov( T ∗ ε, z ) = θ Cov(

T ε, z ). This equality is trivially satisﬁed when β = 0, so supposethat β = 0. In this case it suﬃces to show that (1 − α − α )Cov( T ∗ ε, z ) = Cov( T ε, z ). Deﬁne m ∗ tk = E [ ε | T ∗ = t, z = k ] and p ∗ k = P ( T ∗ = 1 | z = k ). Then, by iterated expectations, Bayes’ rule,and Assumption 2.2 (iii)Cov( T ∗ ε, z ) = q (1 − q ) ( p ∗ m ∗ − p ∗ m ∗ )Cov( T ε, z ) = q (1 − q ) { (1 − α ) [ p ∗ m ∗ − p ∗ m ∗ ] + α [(1 − p ∗ ) m ∗ − (1 − p ∗ ) m ∗ ] } But by Assumption 2.1 (iii), E [ ε | z = k ] = m ∗ k p ∗ k + m ∗ k (1 − p ∗ k ) = 0 and thus we obtain m ∗ k (1 − p ∗ k ) = − m ∗ k p ∗ k . Therefore (1 − α − α )Cov( T ∗ ε, z ) = Cov( T ε, z ) as required.

Proof of Lemma 2.4.

Since T ∗ is binary, if follows from the basic properties of covariance that, η = Cov (cid:2) ( c + ε ) , z (cid:3) + 3 β Cov[( c + ε ) T ∗ , z ] + 3 β Cov[( c + ε ) T ∗ , z ] + β Cov( T ∗ , z ) τ = Cov (cid:2) ( c + ε ) T, z (cid:3) + 2 β Cov [( c + ε ) T T ∗ , z ] + β Cov(

T T ∗ , z )By Assumptions 2.1 (iii), 2.5, and 2.6 (ii) , Cov (cid:2) ( c + ε ) , z (cid:3) = 0. Expanding, η = 3 β Cov( T ∗ ε , z ) + (cid:0) β + 6 cβ (cid:1) Cov( T ∗ ε, z ) + (cid:0) β + 3 cβ + 3 c β (cid:1) Cov( T ∗ , z ) τ = c Cov(

T, z ) + β ( β + 2 c )Cov( T T ∗ , z ) + Cov( T ε , z ) + 2 c Cov(

T ε, z ) + 2 β Cov(

T T ∗ ε, z )Now, deﬁne s ∗ tk = E [ ε | T ∗ = t, z = k ] and p ∗ k = P ( T ∗ = 1 | z = k ). By iterated expectations, Bayes’rule, and Assumption 2.6 (i),Cov( T ∗ ε , z ) = q (1 − q )( p ∗ s ∗ − p ∗ s ∗ )Cov( T ε , z ) = q (1 − q ) { (1 − α ) [ p ∗ s ∗ − p ∗ s ∗ ] + α [(1 − p ∗ ) s ∗ − (1 − p ∗ ) s ∗ ] } By Assumption 2.5, E [ ε | z = 1] = E [ ε | z = 0] and thus, by iterated expectations we have p ∗ s ∗ − p ∗ s ∗ = − [(1 − p ∗ ) s ∗ − (1 − p ∗ ) s ∗ ] which impliesCov( T ε , z ) = (1 − α − α )Cov( T ∗ ε , z ) . (A.6) imilarly by iterated expectations and Assumptions 2.2 (i)–(ii)Cov( T T ∗ ε, z ) = q (1 − q )(1 − α )( p ∗ m ∗ k − p ∗ m ∗ ) = (1 − α )Cov( T ∗ ε, z ) (A.7)where m ∗ tk is deﬁned as in the proof of Lemma 2.3. As shown in the proof of Lemma 2.3,Cov( T T ∗ , z ) = (1 − α )Cov( T ∗ , z ) , Cov( T ∗ , z ) = π − α − α , Cov( T ∗ ε, z ) = Cov( T ε, z )1 − α − α and combining these equalities with Equations A.6 and A.7, it follows that τ = 2 [(1 − α )( c + β ) − cα ] Cov( T ∗ ε, z ) + (cid:2) (1 − α )( c + β ) − c α (cid:3) Cov( T ∗ , z )+ (1 − α − α )Cov( T ∗ ε , z ) τ = (1 − α − α )Cov( T ∗ ε, z ) + [(1 − α )( c + β ) − cα ] Cov( T ∗ , z )using τ = cπ + Cov( T ε, z ) + β Cov(

T T ∗ , z ) as shown in the proof of Lemma 2.3. Thus,3 τ θ − τ θ + πθ = K Cov( T ∗ ε , z ) + K Cov( T ∗ ε, z ) + K Cov( T ∗ , z )where K ≡ θ (1 − α − α ) = 3 β and K ≡ θ [(1 − α )( c + β ) − cα ] − θ (1 − α − α ) K ≡ θ (cid:2) (1 − α )( c + β ) − c α (cid:3) − θ [(1 − α )( c + β ) − cα ] + θ (1 − α − α )Substituting the deﬁnitions of θ , θ , and θ from Equations 4–6, tedious but straightforward algebrashows that K = 3 β + 6 cβ and K = β + 3 cβ + 3 c β . Therefore the coeﬃcients of η equal thoseof 3 τ − τ θ + πθ and the result follows. Proof of Theorem 2.3.

Collecting the results of Lemmas 2.2–2.4, we have η = πθ , η = 2 τ θ − πθ , η = 3 τ θ − τ θ + πθ which is a linear system in θ , θ , θ with determinant − π . Since π = 0 by assumption 2.1 (ii), θ , θ and θ are identiﬁed. Now, so long as β = 0, we can rearrange Equations 5 and 6 to obtain A = θ /θ = 1 + ( α − α ) (A.8) B = θ /θ = (1 − α − α ) + 6 α (1 − α ) (A.9)Equation A.8 gives (1 − α ) = A − α . Hence (1 − α − α ) = A − α and α (1 − α ) = α ( A − α ).Substituting into Equation A.9 and simplifying, ( A − B ) + 2 Aα − α = 0. Substituting for α analogously yields a quadratic in (1 − α ) with identical coeﬃcients. It follows that one root of( A − B ) + 2 Ar − r = 0 is α and the other is 1 − α . Solving, r = A ± p A − B = 1 θ (cid:18) θ ± q θ − θ θ (cid:19) . (A.10)Substituting Equations 5 and 6, simple algebra shows that 3 θ − θ θ = θ (1 − α − α ) . Thisquantity is strictly greater than zero since θ = 0 and α + α = 1. It follows that both roots of thequadratic are real. Moreover, 3 θ /θ − θ /θ identiﬁes (1 − α − α ) . Substituting into Equation4, it follows that β is identiﬁed up to sign. If α + α < β ) = sign( θ ) so that both the ign and magnitude of β are identiﬁed. If α + α < − α > α so (1 − α ) is the largerroot of ( A − B ) + 2 Ar − r = 0 and α is the smaller root. B Comment on Maha jan (2006) A.2

Expanding on our discussion from Section 2.2 above, we now show that Mahajan’s identiﬁcationargument for an endogenous regressor in an additively separable model (A.2) is incorrect. Unlessotherwise indicated, all notation used below is as deﬁned in Section 2.The ﬁrst step of Mahajan (2006) A.2 argues (correctly) that under Assumptions 2.1 and 2.2(i)–(ii), knowledge of α ( x ) and α ( x ) is suﬃcient to identify β ( x ). This step is equivalent to ourLemma 2.2 above. The second step appeals to Mahajan (2006) Theorem 1 to argue that α ( x )and α ( x ) are indeed point identiﬁed. To understand the logic of this second step, we ﬁrst re-stateMahajan (2006) Theorem 1 in our notation. As in Section 2 above, T ∗ denotes an unobservedbinary random variable, z is a instrument, T an observed binary surrogate for T ∗ , y an outcome ofinterest, and x a vector covariates. Assumption B.1 (Mahajan (2006) Theorem 1) . Deﬁne g ( T ∗ , x ) ≡ E [ y | x , T ∗ ] and v ≡ y − g ( T ∗ , x ) .Suppose that knowledge of ( y, T ∗ , x ) is suﬃcient to identify g and that:(i) P ( T ∗ = 1 | x , z = 0) = P ( T ∗ = 1 | x , z = 1) .(ii) T is conditionally independent of z given ( x , T ∗ ) .(iii) α ( x ) + α ( x ) < (iv) E [ v | x , z, T ∗ , T ] = 0 (v) g (1 , x ) = g (0 , x ) Theorem B.1 (Mahajan (2006) Theorem 1) . Under Assumption B.1, α ( x ) and α ( x ) are pointidentiﬁed, as is g ( T ∗ , x ) . Assumption B.1 (i) is equivalent to our Assumption 2.1 (ii), while Assumptions B.1 (ii)–(iii) areequivalent to our Assumptions 2.2 (i)–(ii). Assumption B.1 (v) serves the same purpose as β ( x ) = 0in our Theorem 2.3: unless T ∗ aﬀects y , we cannot identify the mis-classiﬁcation probabilities. Thekey diﬀerence between Theorem B.1 and the setting we consider in Section 2 comes from AssumptionB.1 (iv). This is essentially a stronger version of our Assumptions 2.1 (iii) and 2.2 (iii) but appliesto the projection error v , deﬁned in Assumption B.1 rather than the structural error ε , deﬁned inAssumption 2.1 (i). Accordingly, Theorem B.1 identiﬁes the conditional mean function g ratherthan the causal eﬀect β ( x ).Although the meaning of the error term changes when we move from a structural to a reducedform model, the meaning of the mis-classiﬁcation error rates does not: α ( x ) and α ( x ) are simplyconditional probabilities for T given ( T ∗ , x ). Step 2 of Mahajan (2006) A.2 relies on this insight.The idea is to ﬁnd a way to satisfy Assumption B.1 (iv) simultaneously with Assumptions 2.1 (iii)and 2.2 (iii), while allowing T ∗ to be endogenous. If this can be achieved, α ( x ) , α ( x ) will beidentiﬁed via Theorem B.1, and identiﬁcation of β ( x ) will follow from step 1 of A.2 (our Lemma2.2). To this end, Mahajan (2006) invokes the condition E ( y | x , z, T ∗ , T ) = E ( y | x , T ∗ ) . (B.1) ecause Mahajan (2006) A.2 assumes an additively separable model – our Assumption 2.1 (i) – wesee that E ( y | x , z, T ∗ , T ) = c ( x ) + β ( x ) T ∗ + E ( ε | x , z, T ∗ , T )so Equation B.1 is equivalent to E ( ε | x , z, T ∗ , T ) = E ( ε | x , T ∗ ). Note that this allows T ∗ to beendogenous, as it does not require E ( ε | x , T ∗ ) = 0. Now, applying Equation B.1 to the deﬁnition of v from Assumption B.1, we have E ( v | x , z, T ∗ , T ) = E [ y − E ( y | x , T ∗ ) | x , z, T ∗ , T ] = 0which satisﬁes Assumption B.1 (iv) as required. Based on this reasoning, Mahajan (2006) claimsthat Equation B.1 along with Assumptions B.1 (iv), 2.1, and 2.2 (i)–(ii) suﬃce to identify theeﬀect β ( x ) of an endogenous T ∗ , so long as g (1 , x ) = g (0 , x ). As we now show, however, theseAssumptions are contradictory unless T ∗ is exogenous.By Equation B.1 and Assumption 2.1 (i), E ( ε | x , z, T ∗ , T ) = E ( ε | x , T ∗ ) and thus by iteratedexpectations, we obtain E ( ε | x , T ∗ , z ) = E T | x ,T ∗ ,z [ E ( ε | x , T ∗ , T, z )] = E T | x ,T ∗ ,z [ E ( ε | x , T ∗ )] = E ( ε | x , T ∗ ) . (B.2)Now, let m ∗ tk ( x ) = E ( ε | x , T ∗ = t, z = k ). Using this notation, Equation B.2 is equivalent to m ∗ t ( x ) = m ∗ t ( x ) for t = 0 ,

1. Combining iterated expectations with Assumption 2.1 (iii), E ( ε | x , z = k ) = [1 − p ∗ k ( x )] m ∗ k ( x ) + p ∗ k ( x ) m ∗ k ( x ) = 0 (B.3)for k = 0 , p ∗ k ( x ) ≡ P ( T ∗ = 1 | x , z = k ). But substituting m ∗ t ( x ) = m ∗ t ( x ) into EquationB.3 for k = 0 ,

1, we obtain [1 − p ∗ ( x )] m ∗ ( x ) + p ∗ ( x ) m ∗ ( x ) = 0[1 − p ∗ ( x )] m ∗ ( x ) + p ∗ ( x ) m ∗ ( x ) = 0The preceding two equalities are convex combinations of m ∗ and m ∗ . The only way that bothcan equal zero simultaneously is if either p ∗ ( x ) = p ∗ ( x ), contradicting Assumption 2.1 (ii), or if m ∗ tk ( x ) = 0 for all ( t, k ), which implies that T ∗ is exogenous. Hence Mahajan (2006) A.2 fails:given the assumption that z is a valid instrument for ε , Equation B.1 implies that either there isno ﬁrst-stage relationship between z and T ∗ or that T ∗ is exogenous. The root of the problemwith A.2 is the attempt to use one instrument to satisfy both the assumptions of Theorem B.1 andLemma 2.2. If one had access to a second instrument w , or equivalently a second mis-measuredsurrogate for T ∗ , that satisﬁed Assumptions B.1, one could use w to recover α ( x ) and α ( x ) viaTheorem B.1 and z to recover the IV estimand β ( x ) / [1 − α ( x ) − α ( x )] via Lemma 2.2. C Unobserved Heterogeneity

While allowing for arbitrary observed heterogeneity through the covariates x , all of the results pre-sented above assume an additively separable model – Assumption 2.1 (i). In this section we brieﬂydiscuss how our partial identiﬁcation results can be interpreted in a local average treatment eﬀects(LATE) setting. For simplicity, we suppress explicit conditioning on the covariates x throughout.In lieu of Assumption 2.1 (i), consider a non-separable model of the form y = h ( T ∗ , z, ε ). Let T ∗ ( z ) denote an individual’s potential treatment and Y ( t ∗ , z ) denote her potential outcome, where t ∗ , z ∈ { , } . Using this notation we can write Y ( t ∗ , z ) = h ( t ∗ , z, ε ). Let J ∈ { a, c, d, n } index the our LATE principal strata: a = always-taker, c = complier, d = deﬁer, and n = never-taker. If J = a , then T ∗ ( z ) = 1; if J = c , then T ∗ ( z ) = z ; if J = d , then T ∗ ( z ) = 1 − z ; and if J = n , then T ∗ ( z ) = 0. In a LATE model, Assumption 2.1 (iii) is replaced by the standard LATE assumptions: Assumption C.1 (Unconfounded Type) . P ( J = j | z = 1) = P ( J = j | z = 0) for all j ∈ { a, c, d, n } . Assumption C.2 (Mean Exclusion Restriction) . For all t ∗ ∈ { , } and j ∈ { a, c, d, n } , E [ Y ( t ∗ , | T ∗ = t ∗ , z = 1] = E [ Y ( t ∗ , | T ∗ = t ∗ , z = 1] = E [ Y ( t ∗ ) | J = j ] . Assumption C.3 (Monotonicity) . P (cid:0) T ∗ (1) ≥ T ∗ (0) (cid:1) = 1As is well known, Assumption 2.1 (iii) combined with the preceding three conditions impliesthat the instrumental variables estimand based on T ∗ identiﬁes the average treatment eﬀect amongcompliers: E [ y | z = 1] − E [ y | z = 0] p ∗ − p ∗ = E [ Y (1) − Y (0) | J = c ] . The numerator of the preceding expression is observed, but under mis-classiﬁcation the denominatoris not. Notice, however, that Assumptions 2.2 (i)–(ii) only concern the joint distribution of T given( T ∗ , z ). As such, they have the same meaning in a LATE model as in an additively separablemodel. Imposing these conditions, Lemma 2.1 continues to hold in a LATE model. It follows that p − p = (1 − α − α )( p ∗ − p ∗ ) so that E [ y | z = 1] − E [ y | z = 0] p − p = E [ Y (1) − Y (0) | J = c ]1 − α − α . Moreover, α ≤ p k ≤ − α for all k . Thus, the bound from Corollary 2.1 remains valid in a LATEmodel: E [ Y (1) − Y (0) | J = c ] must lie between the IV and reduced form estimands.Unlike Assumptions 2.2 (i)–(ii), Assumption 2.2 (iii), non-diﬀerential measurement error, isexplicitly stated in terms of the unobservable error term in an additively separable model. Ourderivation of the additional restrictions on ( α , α ) implied by non-diﬀerential measurement errorin the proof of Theorem 2.2, however, does not use Assumption 2.2 (iii) directly. Rather, ituses a condition that is equivalent to it in an additively separable model, namely E [ Y | T ∗ , T, z ] = E [ Y | T ∗ , z ]. Hence, as long as this equality holds, regardless of whether one is in an additivelyseparable model or a LATE model, the bounds on ( α , α ) from Theorem 2.2 remain valid. Since Y = (1 − T ∗ ) Y (0) + T ∗ Y (1), the appropriate modiﬁcation of Assumption 2.2 (iii) is as follows. Assumption C.4 (Non-diﬀerential Measurement Error) . E [ Y (0) | T ∗ , T, z ] = E [ Y (0) | T ∗ , z ] and E [ Y (1) | T ∗ , T, z ] = E [ Y (1) | T ∗ , z ]To summarize, if one wishes to re-interpret our parameter β as a local average treatment eﬀect,the partial identiﬁcation bounds from Theorems 2.1 and 2.2 above remain valid. Assumption 2.1(i) is replaced by Y = h ( T ∗ , z, ε ), Assumption 2.1 (iii) is replaced by Assumptions C.1–C.3, andAssumption 2.2 (iii) is replaced by Assumption C.4. In a LATE model, however, our proofs of sharp-ness no longer apply, as they do not consider the testable implications of the LATE assumptionsthemselves. For partial identiﬁcation results that consider these implications but do not imposenon-diﬀerential measurement error, see Ura (2018). For discussion of the testable implications of aLATE model, see Kitagawa (2015). eferences Aigner, D. J., 1973. Regression with a binary independent variable subject to errors of observation.Journal of Econometrics 1, 49–60.Andrews, D. W., Soares, G., 2010. Inference for parameters deﬁned by moment inequalities usinggeneralized moment selection. Econometrica 78 (1), 119–157.Angrist, J. D., 1990. Lifetime earnings and the vietnam era draft lottery: evidence from socialsecurity administrative records. The American Economic Review, 313–336.Battistin, E., Nadai, M. D., Sianesi, B., 2014. Misreported schooling, multiple measures and returnsto educational qualiﬁcations. Journal of Econometrics 181 (2), 136–150.Black, D. A., Berger, M. C., Scott, F. A., 2000. Bounding parameter estimates with nonclassicalmeasurement error. Journal of the American Statistical Association 95 (451), 739–748.Bollinger, C. R., 1996. Bounding mean regressions when a binary regressor is mismeasured. Journalof Econometrics 73, 387–399.Bollinger, C. R., van Hasselt, M., 2015. Bayesian moment-based inference in a regression modelswith misclassiﬁcation error, working Paper.Bound, J., Brown, C., Mathiowetz, N., 2001. Measurement error in survey data. In: Handbook ofeconometrics. Vol. 5. Elsevier, pp. 3705–3843.Carroll, R. J., Ruppert, D., Crainiceanu, C. M., Stefanski, L. A., 2006. Measurement error innonlinear models: a modern perspective. Chapman and Hall/CRC.Chen, X., Hong, H., Tamer, E., 2005. Measurement error models with auxiliary data. The Reviewof Economic Studies 72 (2), 343–366.Chen, X., Hu, Y., Lewbel, A., 2008a. Nonparametric identiﬁcation of regression models containinga misclassiﬁed dichotomous regressor with instruments. Economics Letters 100, 381–384.Chen, X., Hu, Y., Lewbel, A., 2008b. A note on the closed-form identiﬁcation of regression modelswith a mismeasured binary regressor. Statistics & Probability Letters 78 (12), 1473–1479.DiTraglia, F. J., Garc´ıa-Jimeno, C., 2017. Mis-classiﬁed, binary, endogenous regressors: Identiﬁca-tion and inference. Tech. rep., NBER working paper u, Y., Shennach, S. M., January 2008. Instrumental variable treatment of nonclassical measure-ment error models. Econometrica 76 (1), 195–216.Hu, Y., Shiu, J.-L., Woutersen, T., 2015. Identiﬁcation and estimation of single-index models withmeasurement error and endogeneity. The Econometrics Journal 18 (3), 347–362.Imbens, G. W., Rubin, D. B., 1997. Estimating outcome distributions for compliers in instrumentalvariables models. The Review of Economic Studies 64 (4), 555–574.Kane, T. J., Rouse, C. E., Staiger, D., July 1999. Estimating returns to schooling when schoolingis misreported. Tech. rep., National Bureau of Economic Research, NBER Working Paper 7235.Kitagawa, T., 2015. A test for instrument validity. Econometrica 83 (5), 2043–2063.Kreider, B., Pepper, J. V., Gundersen, C., Jolliﬀe, D., 2012. Identifying the eﬀects of SNAP (foodstamps) on child health outcomes when participation is endogenous and misreported. Journal ofthe American Statistical Association 107 (499), 958–975.Lewbel, A., March 2007a. Estimation of average treatment eﬀects with misclassiﬁcation. Econo-metrica 75 (2), 537–551.Lewbel, A., 2007b. A local generalized method of moments estimator. Economics Letters 94, 124–128.Mahajan, A., 2006. Identiﬁcation and estimation of regression models with misclassiﬁcation. Econo-metrica 74 (3), 631–665.Molinari, F., 2008. Partial identiﬁcation of probability distributions with misclassiﬁed data. Journalof Econometrics 144 (1), 81–117.Nguimkeu, P., Denteh, A., Tchernis, R., 2016. On the estimation of treatment eﬀects with endoge-nous misreporting. Working Paper.Shiu, J.-L., 2016. Identiﬁcation and estimation of endogenous selection models in the presence ofmisclassiﬁcation errors. Economic Modelling 52 (Part B), 507–518.Song, S., 2015. Semiparametric estimation of models with conditional moment restrictions in thepresence of nonclassical measurement errors. Journal of Econometrics 185 (1), 95–109.Ura, T., 2018. Heterogeneous treatment eﬀects with mismeasured endogenous treatment. Quanti-tative Economics 9 (3), 1335–1370.van Hasselt, M., Bollinger, C. R., 2012. Binary misclassiﬁcation and identiﬁcation in regressionmodels. Economics Letters 115, 81–84.u, Y., Shennach, S. M., January 2008. Instrumental variable treatment of nonclassical measure-ment error models. Econometrica 76 (1), 195–216.Hu, Y., Shiu, J.-L., Woutersen, T., 2015. Identiﬁcation and estimation of single-index models withmeasurement error and endogeneity. The Econometrics Journal 18 (3), 347–362.Imbens, G. W., Rubin, D. B., 1997. Estimating outcome distributions for compliers in instrumentalvariables models. The Review of Economic Studies 64 (4), 555–574.Kane, T. J., Rouse, C. E., Staiger, D., July 1999. Estimating returns to schooling when schoolingis misreported. Tech. rep., National Bureau of Economic Research, NBER Working Paper 7235.Kitagawa, T., 2015. A test for instrument validity. Econometrica 83 (5), 2043–2063.Kreider, B., Pepper, J. V., Gundersen, C., Jolliﬀe, D., 2012. Identifying the eﬀects of SNAP (foodstamps) on child health outcomes when participation is endogenous and misreported. Journal ofthe American Statistical Association 107 (499), 958–975.Lewbel, A., March 2007a. Estimation of average treatment eﬀects with misclassiﬁcation. Econo-metrica 75 (2), 537–551.Lewbel, A., 2007b. A local generalized method of moments estimator. Economics Letters 94, 124–128.Mahajan, A., 2006. Identiﬁcation and estimation of regression models with misclassiﬁcation. Econo-metrica 74 (3), 631–665.Molinari, F., 2008. Partial identiﬁcation of probability distributions with misclassiﬁed data. Journalof Econometrics 144 (1), 81–117.Nguimkeu, P., Denteh, A., Tchernis, R., 2016. On the estimation of treatment eﬀects with endoge-nous misreporting. Working Paper.Shiu, J.-L., 2016. Identiﬁcation and estimation of endogenous selection models in the presence ofmisclassiﬁcation errors. Economic Modelling 52 (Part B), 507–518.Song, S., 2015. Semiparametric estimation of models with conditional moment restrictions in thepresence of nonclassical measurement errors. Journal of Econometrics 185 (1), 95–109.Ura, T., 2018. Heterogeneous treatment eﬀects with mismeasured endogenous treatment. Quanti-tative Economics 9 (3), 1335–1370.van Hasselt, M., Bollinger, C. R., 2012. Binary misclassiﬁcation and identiﬁcation in regressionmodels. Economics Letters 115, 81–84.