[PDF] Decomposing Identification Gains and Evaluating Instrument Identification Power for Partially Identified Average Treatment Effects

Abstract

This paper studies the instrument identification power for the average treatment effect (ATE) in partially identified binary outcome models with an endogenous binary treatment. We propose a novel approach to measure the instrument identification power by their ability to reduce the width of the ATE bounds. We show that instrument strength, as determined by the extreme values of the conditional propensity score, and its interplays with the degree of endogeneity and the exogenous covariates all play a role in bounding the ATE. We decompose the ATE identification gains into a sequence of measurable components, and construct a standardized quantitative measure for the instrument identification power ( IIP ). The decomposition and the IIP evaluation are illustrated with finite-sample simulation studies and an empirical example of childbearing and women's labor supply. Our simulations show that the IIP is a useful tool for detecting irrelevant instruments.

Full PDF

DDecomposing Identiﬁcation Gains and Evaluating InstrumentIdentiﬁcation Power for Partially Identiﬁed Average Treatment Eﬀects

Lina Zhang ∗ David T. Frazier † D.S. Poskitt ‡ Xueyan Zhao § September 8, 2020

Abstract

This paper studies the instrument identiﬁcation power for the average treatment eﬀect (ATE) inpartially identiﬁed binary outcome models with an endogenous binary treatment. We propose a novelapproach to measure the instrument identiﬁcation power by their ability to reduce the width of the ATEbounds. We show that instrument strength, as determined by the extreme values of the conditionalpropensity score, and its interplays with the degree of endogeneity and the exogenous covariates all playa role in bounding the ATE. We decompose the ATE identiﬁcation gains into a sequence of measurablecomponents, and construct a standardized quantitative measure for the instrument identiﬁcation power(

IIP ). The decomposition and the

IIP evaluation are illustrated with ﬁnite-sample simulation studiesand an empirical example of childbearing and women’s labor supply. Our simulations show that the

IIP is a useful tool for detecting irrelevant instruments.

Keywords : Binary Dependent Variables; Average Treatment Eﬀect; Instrument Identiﬁcation Power;Instrument Relevance; Endogeneity; Partial Identiﬁcation. ∗ Corresponding author. Department of Econometrics and Business Statistics, Monash University( [email protected] ). † Department of Econometrics and Business Statistics, Monash University ( [email protected] ). ‡ Department of Econometrics and Business Statistics, Monash University ( [email protected] ). § Department of Econometrics and Business Statistics, Monash University ( [email protected] ). a r X i v : . [ ec on . E M ] S e p Introduction

This paper investigates the identiﬁcation power of instrumental variables for the average treatment eﬀect(ATE) in partially identiﬁed triangular equations system models with binary endogenous variables. Binaryoutcome models with binary endogenous treatment have been widely used in empirical studies. The roleplayed by instrumental variables (IVs) in such models has long been a controversial topic and has beendiscussed in many papers (see for example Heckman, 1978; Maddala, 1986; Wilde, 2000; Freedman andSekhon, 2010; Mouriﬁ´e and M´eango, 2014; Han and Vytlacil, 2017; Li, Poskitt, and Zhao, 2019). Inparticular, there is a notion of “ identiﬁcation by functional form ”(Li et al., 2019), where such non-linearmodels can be point identiﬁed even without any IVs, relying on restrictive parametric assumptions suchas a bivariate probit. However, such identiﬁcation has been described as “fragile” (Marra and Radice,2011; Li et al., 2019), as models such as the bivariate probit are overly restrictive. Once less restrictiveassumptions are allowed, the IVs have been shown to play a crucial role for meaningful identiﬁcation inpartially identied models (see for example Chesher, 2005, 2010; Shaikh and Vytlacil, 2011; Li et al., 2019).The literature on partially identiﬁed models oﬀers a useful framework for IV identiﬁcation analysis.The identiﬁed set for the ATE, deﬁned as all possible values of the ATE from diﬀerent observationallyequivalent structures that can give rise to the observed data, oﬀers an obvious measure for identiﬁcationpower. For example, Kitagawa (2009) and Swanson et al. (2018) use the size of the identiﬁcation set tomeasure the identiﬁcation power of model assumptions. Naturally, the width of the ATE identiﬁed setcan also provide a measure to examine the IV contribution to the identiﬁcation gains. In this paper, weuse the reduction in the width of the identiﬁcation set as a measure for identiﬁcation gains. Since thepioneering work of Manski (1990), most of the ATE partial identiﬁcation studies with an endogenoustreatment have relied on the IVs to bound the ATE (see Heckman and Vytlacil, 1999, 2001; Vytlaciland Yildiz, 2007; Chesher, 2010; Chiburis, 2010; Shaikh and Vytlacil, 2011; Vuong and Xu, 2017; Floresand Chen, 2018). Both Chesher (2010) and Li et al. (2018) show that the existence and the strength ofthe IVs can signiﬁcantly aﬀect the identiﬁcation of the ATE for discrete outcome models. However, themechanism through which the IV strength translates to identiﬁcation gains in such non-linear models hasnot been well understood by researchers.In endogenous treatment eﬀect models, the IVs exert their inﬂuence through their impact on thetreatment propensity score. Heckman, Urzua, and Vytlacil (2006) provide a comprehensive study of theproperties of IVs in models with continuous outcomes, and point out the central role of the propensityscores in such models. In continuous outcome models, it is well known that the “identiﬁcation at inﬁnity”, Other works that establish the important role of the propensity score include Rosenbaum and Rubin (1983), Heckmanand Robb (1985, 1986), Heckman (1990), and Ahn and Powell (1993). R to measure IV strength and show that theATE bound width decreases as the pseudo R increases. As with linear models, it is natural to expectthat the propensity score variation is also a key component that governs the ability of the IVs to identifythe ATE. However, to the authors’ knowledge, no rigorous examinations have yet been conducted toinvestigate the factors contributing to the identiﬁcation gains of the ATE for discrete outcome modelswhen “identiﬁcation at inﬁnity” fails. It is part of the purpose of this paper to investigate this lacuna.This paper presents a rigorous examination of the role of IVs and their interplays with other factorsin the identiﬁcation gains for the ATE in binary outcome models with an endogenous binary treatment.Using the bivariate joint threshold crossing model proposed by Shaikh and Vytlacil (2011) (henceforthreferred to as the SV model or SV bounds) as an example, we study the identiﬁcation gains achieved bythe SV bounds against those from an ATE bounds benchmark, the bounds of Manski (1990) (hereafterManski bounds). The rationale for using Manski’s bounds as a benchmark follows from the observationthat if the IVs are irrelevant, then the SV bounds collapse to Manski bounds. Using this framework,we disentangle the various impacts of IVs on identiﬁcation gains, which yields a novel decomposition ofthe ATE SV bounds identiﬁcation gains. This decomposition provides useful insights into the diﬀerentsources and nature of identiﬁcation gains.Our paper makes several contributions. Firstly, we distinguish the concepts of

IV strength and

IVidentiﬁcation power for binary dependent variables models. We show that, as in the case of linear models,the IV strength, as measured by the range of the conditional propensity score (CPS) that are attributableto the IVs, plays a crucial role in the identiﬁcation gains when bounding the ATE. More importantly,we demonstrate that unlike linear models, the IV identiﬁcation power is also determined by the interplayof the IVs with the sign and the degree of treatment endogeneity. This is because in such non-linearmodels, the ATE bounds are governed by the joint probabilities of the outcome and the treatment, which See Remark 2.1 of Shaikh and Vytlacil (2011).

IIP ). The

IIP measures the IV contribution toidentiﬁcation gains by quantifying the reduction in the size of the ATE identiﬁcation set that can beattributed to the instruments alone. Works that aim to provide measures of the explained variation inlimited dependent variable models, such as Veall and Zimmermann (1992, 1996), are already available andWindmeijer (1995) provides a comprehensive review of various pseudo R goodness-of-ﬁt measures. Ingeneral, pseudo R statistics are developed for single equation limited dependent variable models, ratherthan for triangular systems with a binary endogenous treatment. Although such pseudo R statisticswill yield a measure of the IV strength (as used in Li et al., 2018), they are not appropriate measuresfor

IV identiﬁcation power , as they fail to capture the critical fact that the IV identiﬁcation informationpertaining to the ATE varies with the endogeneity degree. Consequently, any suggestion that pseudo R statistics will be an indicator of the IV identiﬁcation power would be misplaced. In contrast, the IIP proposed in this paper is speciﬁcally designed to evaluate the identiﬁcation gains that can be solelyattributed to the IVs.Finally, our paper also provides potential insights into the literature on instrument relevancy, weakinstruments and instrument selection. The importance of this

IIP measure is that it enables a rankingof alternative IVs by their identiﬁcation power, thereby oﬀering a potential criterion for detection ofirrelevant IVs and for selection of sets of IVs for constructing the ATE bounds. In this way, our measureis akin to existing approaches in the generalized methods of moment (GMM) literature that seek todetermine instrument “relevancy”. The ability of our approach to determine and rank sets of IVs by theiridentiﬁcation gains leads us to document, we believe for the ﬁrst time, a critically important feature ofbinary triangular equations systems: while in the population, adding irrelevant IVs can not increase theIV identiﬁcation power, in ﬁnite-samples, using such IVs to partially identify the ATE could lead to a4oss in IV identiﬁcation power, which may result in wider ATE bounds especially when the variation ofcovariates is limited. We liken this phenomena to the well-known problem of irrelevant moment conditionsin GMM (see Breusch et al., 1999; Hall and Peixe, 2003; Hall, 2005; Hall et al., 2007, among others) andleave a more rigorous study of this topic for future research.The rest of this paper is organized as follows. In Section 2 we present our model setup and the SVbounds. In Section 3 we establish how the conditional propensity score, the endogeneity and the covariatesaﬀect the ATE bounds. Section 4 introduces our decomposition of identiﬁcation gains, and studies howit can be used to gauge the instrument identiﬁcation power. Section 5 deﬁnes the index of

IIP andpresents some of its basic properties. A comprehensive numerical analysis and graphical presentationare given in Section 6 to illustrate our results. Finite sample evaluation of decomposition analysis ispresented in Section 7, and an empirical example is given in Section 8 to demonstrate the usefulness ofthe decomposition and the

IIP in evaluating instrument relevance. The paper closes in Section 9 withsome summary remarks. All proofs are relegated to Appendix.

Following the potential outcome framework, let Y be a binary outcome such that Y = DY + (1 − D ) Y , where D ∈ { , } is a treatment indicator with D = 1 denoting being treated and D = 0 denoting beinguntreated. The pair Y , Y ∈ { , } are two potential outcomes in the untreated and treated states. Weobserve ( Y, D, X, Z ), where X denotes a vector of exogenous covariates and Z represents a vector ofinstruments that can be either continuous or discrete. Suppose we are interested in the conditional ATE,deﬁned as ATE( x ) = E [ Y | X = x ] − E [ Y | X = x ] . Because only one of the potential outcomes is observed, we are faced with a missing data problem. If thepotential outcomes are independent of the treatment D then it can be shown that the ATE( x ) is pointidentiﬁed. However, in many empirical studies D is endogenous and hence correlated with the potentialoutcomes. Nevertheless, with the help of IVs we may partially identify the ATE( x ) and construct anidentiﬁed set for the ATE under mild conditions that are satisﬁed by a wide range of data generatingprocesses.For notational simplicity, henceforth we will use Pr( A | w ) to represent Pr( A | W = w ) for any event A ,5andom variable W and its possible value w unless otherwise stated. For any generic random variables A and B , the support of A is denoted as Ω A and the support of A conditional on B = b is given by Ω A | b .Let F A,B denote the joint cumulative distribution function (CDF) of (

A, B ), F A the marginal CDF of A ,and F A | B the conditional CDF of A given B . Corresponding density functions will be denoted using alower case f with associated subscript in an obvious way.We now introduce the model and the identiﬁed set of the ATE studied in Shaikh and Vytlacil (2011),based on which we explore the factors determining the ATE bounds and how they impact the ATE boundwidth. Consider a joint threshold crossing model Y = 1[ ν ( D, X ) > ε ] ,D = 1[ ν ( X, Z ) > ε ] , (1)where ν ( · , · ) and ν ( · , · ) are unknown functions, and ( ε , ε ) (cid:48) is an unobservable error term with jointCDF F ε ,ε . Threshold crossing models are often used in treatment evaluation studies (see Heckman andVytlacil, 1999, 2001, for example), and have been shown to be informative in the sense that the sign ofthe ATE can be recovered from the observable data, and the ATE can even be point identiﬁed in certaincircumstances; see Shaikh and Vytlacil (2005, 2011), Vytlacil and Yildiz (2007) and Vuong and Xu (2017)among others. Moreover, tests for the applicability of threshold crossing also have been developed; seeHeckman and Vytlacil (2005), Bhattacharya et al. (2012), Machado et al. (2013) and Kitagawa (2015) forexample. The following assumption summarises the conditions imposed by Shaikh and Vytlacil (2011).

Assumption 2.1

The model in (1) is assumed to satisfy the following conditions:(a) The distribution of error term ( ε , ε ) (cid:48) has a strictly positive density with respect to the Lebesguemeasure on R .(b) ( X, Z ) is independent of ( ε , ε ) .(c) The distribution of ν ( X, Z ) | X is non-degenerate.(d) The support of the distribution of ( X, Z ) , Ω X,Z , is compact.(e) ν : Ω D,X → R , ν : Ω X,Z → R are continuous in both arguments. Assumption 2.1 ensures that the instruments in Z satisfy the exclusion restriction, is independent of theerror term ( ε , ε ) (cid:48) and relevant to the treatment D . In addition, Assumption 2.1 (a) and (b) are such that Bhattacharya et al. (2012) demonstrate that the SV bounds still hold under a rank similarity condition, a weaker propertythat allows heterogeneity in the sign of the ATE( x ). Furthermore, as mentioned in Vytlacil and Yildiz (2007), it is possibleto achieve the ATE point identiﬁcation via the SV bounds if X contains a continuous element or the exclusion restrictionholds in both equations. enters the outcome Y only through the propensity score, which is called index suﬃciency. Conditions(d) and (e) are required to establish the sharpness of the identiﬁed set, and are imposed for analyticalsimplicity.Denote random variable P = Pr[ Y = 1 | X, Z ] with support Ω P . Under Assumption 2.1 (a)-(c), Shaikhand Vytlacil (2011) show that the sign of the ATE( x ) is identiﬁed: for any p and p (cid:48) in Ω P such that p > p (cid:48) , sgn[ATE( x )] = sgn[ ν (1 , x ) − ν (0 , x )] = sgn (cid:2) Pr[ Y = 1 | x, p ] − Pr[ Y = 1 | x, p (cid:48) ] (cid:3) , (2)where sgn[ · ] is the conventional signum function. Given (2), it is apparent that the sign of the ATE( x )is recovered from the observables if Z is valid in the sense that Z is independent to ( ε , ε ) and it hasnonzero prediction power for the treatment, meaning that there exist two diﬀerent values of p, p (cid:48) ∈ Ω P | x such that p =Pr[ D = 1 | x, z ] and p (cid:48) =Pr[ D = 1 | x, z (cid:48) ].More importantly, Assumption 2.1 is suﬃcient to construct bounds for the ATE, referred to as SVbounds. Let P and P (cid:48) are two independent random variables with the same distribution, and let x, x (cid:48) beany two values in Ω X . Now, deﬁne H ( x, x (cid:48) ) = E [ h ( x, x (cid:48) , P, P (cid:48) ) | P > P (cid:48) ] where h ( x, x (cid:48) , p, p (cid:48) ) =Pr[ Y = 1 , D = 1 | x (cid:48) , p ] − Pr[ Y = 1 , D = 1 | x (cid:48) , p (cid:48) ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] + Pr[ Y = 1 , D = 0 | x, p ] . Let X ( x ) = { x (cid:48) : H ( x, x (cid:48) ) ≥ } , X − ( x ) = { x (cid:48) : H ( x, x (cid:48) ) ≤ } , X ( x ) = { x (cid:48) : H ( x (cid:48) , x ) ≥ } , and X − ( x ) = { x (cid:48) : H ( x (cid:48) , x ) ≤ } . Then the SV lower bound is L SV ( x ) = sup p ∈ Ω P | x (cid:40) Pr[ Y = 1 , D = 1 | x, p ] + sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] (cid:41) − inf p ∈ Ω P | x (cid:26) Pr[ Y = 1 , D = 0 | x, p ] + p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] (cid:27) , (3)and the SV upper bound is U SV ( x ) = inf p ∈ Ω P | x (cid:26) Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) inf x (cid:48) ∈ X − ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 0] (cid:27) − sup p ∈ Ω P | x (cid:40) Pr[ Y = 1 , D = 0 | x, p ] + sup x (cid:48) ∈ X − ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ] (cid:41) . (4)The SV bounds in (3) and (4) consist of two layers of intersection evaluations. The ﬁrst layer is to intersectall possible values of the conditional propensity score, or equivalently, of the IVs. The second layer is to7tilize the identifying information contained in covariates. In particular, for given x , the second layer ofintersections are taken over values of the covariates other than x , say x (cid:48) , which lies in a certain subsetof Ω X , and there exists a z (cid:48) ∈ Ω Z | x such that p =Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. Thus, both the IVsand the covariates contribute to the identiﬁcation gains of SV bounds. It is understood that in (3) and(4) the supremum and inﬁmum operators are only taken over regions where all conditional probabilitiesare well deﬁned. The probabilities Pr[ Y = y, D = d | x (cid:48) , p ] and Pr[ Y = y | x (cid:48) , p, D = d ] are well deﬁned for y ∈ { , } and d ∈ { , } , if there exists a value z (cid:48) ∈ Ω Z | x such that Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p . The supremumover an empty set is deﬁned as 0, and the inﬁmum over an empty set is deﬁned as 1. Given (3) and (4),the width of SV bounds can be deﬁned as ω SV ( x ) = U SV ( x ) − L SV ( x ) . In the next section, we study the factors that impact the SV bounds and ω SV ( x ). As discussed in the introduction, for binary dependent variables the propensity of being treated is a keyfactor that carries the identiﬁcation information in the IVs. Therefore, we start from the conditionalpropensity score (CPS) of the treatment, deﬁned as Pr[ D = 1 | X = x, Z ], which is a random variable(function) of IV Z , and study the features of the CPS that are crucial in determining the SV boundwidth. In the following proposition, for the sake of completeness, we ﬁrst restate the sharpness result in Shaikhand Vytlacil (2011) under a stronger support condition Ω

X,P = Ω X × Ω P , and then introduce our newresults about the connections between P = Pr[ D = 1 | X, Z ] and the SV bound width. Denote the twoextreme values of the support of variable P by p := inf { p ∈ Ω P } and p := sup { p ∈ Ω P } respectively. Proposition 3.1

Let Assumption 2.1 hold. If Ω X,P = Ω X × Ω P , then the SV bounds in (3) and (4) aresharp. In addition, for any given ∀ x ∈ Ω X ,(i) L SV ( x ) is weakly increasing as p decreases or as p increases;(ii) U SV ( x ) is weakly decreasing as p decreases or as p increases;and hence iii) ω SV ( x ) is weakly decreasing as p decreases or as p increases. Notice that under the restriction Ω

X,P = Ω X × Ω P , the support of P is the same to the support of theCPS Pr[ D = 1 | X = x, Z ] for ∀ x ∈ Ω X . Proposition 3.1 shows that the locations of the lower and upperSV bounds are determined by the extreme values of the CPS, i.e. p and p . Moreover, the width of theSV bounds ω SV ( x ) weakly decreases as the support of the CPS “expands”. It means that when the IVsare good predictors of the treatment status, the identiﬁed set of the ATE( x ) (SV bounds) is likely to beinformative.The feature revealed by Proposition 3.1 is signiﬁcant. It indicates that in partially identiﬁed modelswith binary dependent variables, the property of IVs that determines their contribution to identiﬁcationgains is diﬀerent from that which has hitherto been held to be important. Key ingredients of conventionalmeasures of IV strength are the correlation between the IVs and the endogenous regressors (as evaluatedvia the ﬁrst-stage F -stat for continuous endogenous regressors, or the pseudo- R for binary response vari-ables), as well as the variation of the IVs to that of the random noise. However, Proposition 3.1 indicatesthat two IV sets that have the same CPS end points will make identical contributions to identiﬁcationgains when partially identifying the ATE, irrespective of their correlation with the endogenous regressorsor their variability.The restriction Ω X,P = Ω X × Ω P in Proposition 3.1 is utilized in Shaikh and Vytlacil (2011) to simplifythe expression of the SV bound and to prove the sharp result. It is also one of the suﬃcient conditions thatensures global identiﬁcation in a parametric triangular system model with binary endogenous treatment,see Han and Vytlacil (2017) Theorem 5.1. The condition Ω

X,P = Ω X × Ω P is saying that for any x, x (cid:48) ∈ Ω X , we have Ω P | x = Ω P | x (cid:48) ; i.e. there exist possible realizations z, z (cid:48) of Z such that Pr[ D =1 | x, z ] = Pr[ D = 1 | x (cid:48) , z (cid:48) ], which might fail to hold in practice especially when the variation in Z islimited. One suﬃcient condition for Ω X,P = Ω X × Ω P to hold is that X is mean independence of D given Z . The necessity of the condition Ω X,P = Ω X × Ω P here is that without this support restriction, the SVbound may not exhibit a monotonic relationship with the extreme values of the CPS.Fortunately, although Proposition 3.1 is derived using the support constraint, from the simulationsin Section 7 we can see that the SV bound width decreases, on average, whenever the extreme values ofthe CPS changes to their endpoints (zero and one). In fact, as we will now show, without the impositionof the support condition Ω X,P = Ω X × Ω P , a “widest bound” under Assumption 2.1 that restrictsthe size of ω SV ( x ) can be derived for any given x ∈ Ω X . Deﬁne the two extremes of the CPS as Without Ω

X,P = Ω X × Ω P , the SV bound need not be sharp. Chiburis (2010) shows that under joint threshold crossingthe sharp ATE bounds can only be implicitly determined by a copula, so that neither a closed form expression nor acomputationally feasible linear programming algorithm that solves this problem exists. We therefore maintain the supportrestriction. ( x ) := inf z ∈ Ω Z | x { p ∈ Ω P | x,z } and p ( x ) := sup z ∈ Ω Z | x { p ∈ Ω P | x,z } . Proposition 3.2

Let Assumption 2.1 hold. There exists a function ω : Ω X (cid:55)→ [0 , such that ≤ ω SV ( x ) ≤ ω ( x ) for any given x ∈ Ω X . In addition,if ATE ( x ) > , then ω ( x ) = Pr (cid:2) Y = 1 , D = 1 | x, p ( x ) (cid:3) + Pr [ Y = 0 , D = 0 | x, p ( x )] ; if ATE ( x ) < , then ω ( x ) = Pr [ Y = 1 , D = 0 | x, p ( x )] + Pr (cid:2) Y = 0 , D = 1 | x, p ( x ) (cid:3) . Moreover, ω ( x ) is weakly decreasing as p ( x ) decreases or as p ( x ) increases. The explicit expressions of the widest bounds, with width ω ( x ), can be found in (14) and (16); see theproof of Proposition 3.2. From Proposition 3.2 we can see that ω ( x ) is monotone in the extreme valuesof CPS, i.e. ( p ( x ) , p ( x )), and we are able to conclude that the extreme values of the CPS govern thesize of the SV bound width even without the support restriction. Moreover, under the extreme case ofperfect prediction, Proposition 3.2 implies that the ATE( x ) is point identiﬁed by the SV bounds. Suppose p ∗ , p ∗∗ ∈ Ω P | x are such that Pr[ D = 0 | x, p ∗ ] = 1 and Pr[ D = 1 | x, p ∗∗ ] = 1. By the deﬁnition of p ( x ) , p ( x ),we have that p ∗ = p ( x ) and p ∗∗ = p ( x ). Proposition 3.2 then yields that ω ( x ) = 0 whatever the sign ofthe ATE( x ), indicating that the ATE( x ) is point identiﬁed. From the above discussion it is apparent thatperfect prediction in the binary dependent variables model is equivalent to “identiﬁcation at inﬁnity”.Similar discussion can also be found when partially identifying the ATE in models with discrete outcomesin Chesher (2010). The importance of IVs in determining the ATE bounds via the CPS has been recognized in several studies,but it seems that another crucial determinant, the degree of endogeneity, has so far received little attention.The ATE bounds are constructed using the joint probabilities of the outcome and the treatment, andthe IVs aﬀect those joint probabilities not only directly through the CPS but also indirectly through theco-movements of the outcome and the treatment due to the endogeneity. Thus, it is reasonable to expectthat the information contained in the IVs may be correspondingly scaled via the leverage induced by thedegree of endogeneity.To facilitate obtaining interpretable relationships between the degree of endogeneity and the SV boundwidth, we introduce a family of bivariate single parameter copulae that speciﬁes the joint distribution ofthe stochastic error terms in (1), while we do not require the copula nor the marginal distributions to beknown. Denote a copula as C ( · , · ; ρ ) : (0 , (cid:55)→ (0 , ρ ∈ Ω ρ is a scalar dependence parameter that10ully describes the joint dependence between ε and ε , and their dependence increases as ρ increases. Itis worth noting that in our setting, for any given copula, the dependence parameter ρ can be understoodas indicating the level of endogeneity. We also impose additional dependence structure, the concordanceordering, on the copula C ( · , · ; ρ ). Let F ε ,ε and (cid:101) F ε ,ε be two distinct CDFs. Following Joe (1997), wedeﬁne (cid:101) F ε ,ε as being more concordant than F ε ,ε , denoted by F ε ,ε ≺ c (cid:101) F ε ,ε , as F ε ,ε ( e , e ) ≤ (cid:101) F ε ,ε ( e , e ) , ∀ ( e , e ) ∈ R . For ρ (cid:54) = ρ and u , u ∈ (0 , , we say that the copula C ( · , · ; ρ ) satisﬁes the concordant ordering withrespect to ρ , denoted as C ( u , u ; ρ ) ≺ c C ( u , u ; ρ ), if C ( u , u ; ρ ) ≤ C ( u , u ; ρ ) , for any ρ < ρ . (5)The concordant ordering with respect to ρ is a stochastic dominance restriction. The concordant orderingis embodied in many well-known copulae, including the normal copula; see Joe (1997) Section 5.1 for thecopulae families where (5) holds. Similar stochastic dominance conditions are employed in, e.g., Han andVytlacil (2017) and Han and Lee (2019), to derive identiﬁcation and estimation results for the parametricbivariate probit model and its generalizations. Assumption 3.1

The joint distribution of ( ε , ε ) (cid:48) is given by a member of the single parameter copulafamily F ε ,ε ( e , e ) = C ( F ε ( e ) , F ε ( e ); ρ ) , for ( e , e ) ∈ R , where C ( · , · ; ρ ) satisﬁes the concordantordering with respect to ρ . Assumption 3.1 deﬁnes a class of data generating processes that is suﬃcient for us to establish therelationship between endogeneity as captured by the dependence parameter ρ , and the widest SV boundwidth ω ( x ). The derivation of the following proposition does not require the copula C ( · , · ; ρ ) nor themarginal distributions F ε and F ε to be speciﬁed. Proposition 3.3

Under Assumptions 2.1 and 3.1, the widest SV bound width ω ( x ) is weakly increasingin ρ when ATE ( x ) > , and ω ( x ) is weakly decreasing in ρ when ATE ( x ) < . Proposition 3.3 implies that the (widest) SV bound width could be signiﬁcantly impacted by the degreeof endogeneity, even if the extreme values of the CPS are ﬁxed. In addition, Proposition 3.3 also revealsthat the eﬀect of endogeneity is asymmetric. To be more speciﬁc, with a positive treatment eﬀect nega-tive endogeneity helps narrow down the ATE bound width, while the opposite holds true for a negative In the special case of a normal bivariate probit model ρ represents the correlation between the error terms and Ω ρ =( − , ρ is not necessary ( − , power . Conversely, a set of “seemingly strong” IVs can be surprisingly powerless due to anundesirable sign or degree of endogeneity, resulting in wide ATE bounds. Thus, the conventional tests fordetecting IV strength, such as F -stat and pseudo R , or the associated weak IV tests designed for linearmodels, can be misleading in measuring IV identiﬁcation power. The result here shows that IV strength is a diﬀerent concept from the

IV identiﬁcation power in this binary model.

As we have seen from the construction of the SV bounds in Section 2, both IVs and covariates contributeto identifying the ATE under model (1). It is perhaps not surprising to ﬁnd that there are situationswhere covariates fail to further tighten the SV bounds, a feature previously noted in Chiburis (2010).This happens when, conditional on D , the covariates in X have no additional eﬀects on the outcome Y ,leading to ω SV ( x ) = ω ( x ). The following proposition formalizes these statements. Proposition 3.4

Let Assumption 2.1 hold. If the random variable ν ( D, X ) | D is degenerate, then ω SV ( x ) = ω ( x ) . Proposition 3.4 implies that any further reduction in the SV bound width from ω ( x ) to ω SV ( x ) can beattributed to the additional identiﬁcation information in the covariate X . In particular, if focusing onthe second layer of the intersections over X ( x ) , X − ( x ) , X ( x ) and X − ( x ) in bounds (3) and (4),we can see that such identiﬁcation gain is extracted from the matching pair ( x, z ) , ( x (cid:48) , z (cid:48) ) ∈ Ω X,Z suchthat Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. Thus, broader support and greater variability in X increases theprobability of ﬁnding a matching pair.To sum up, from the discussion in Section 3, we know that the identiﬁcation power for the ATE SVbounds is determined by the extreme values of the CPS, the sign and the degree of endogeneity, and thevariability (or support) of the covariates in the outcome equation. Based on the discussions above, in this section we introduce a novel decomposition of the identiﬁcationgains of the SV bounds. It disentangles the identiﬁcation gains into components that are attributable to12he gains obtained from the IVs and the exogenous covariates. To construct the decomposition let us ﬁrstintroduce the benchmark ATE bounds of Manski (1990) (Manski bounds), which are obtained withoutreference to IVs and are given by L M ( x ) = − Pr[ Y = 1 , D = 0 | x ] − Pr[ Y = 0 , D = 1 | x ] ,U M ( x ) = Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] , (6)where (with obvious notations) L M ( X ) and U M ( x ) are the lower bound and upper bound respectively.From (6), it is apparent that the width of the Manski bounds, deﬁned as ω M ( x ) = U M ( x ) − L M ( x ), isone for any given x ∈ Ω X , with the lower bound and upper bound falling on either side of zero. Thus, (cid:2) L M ( x ) , U M ( x )] is uninformative as to the sign or location of the treatment eﬀect, and it is often referredto in the literature as “the worst case scenario” (see Tamer, 2010; Chiburis, 2010; Bhattacharya et al.,2012, for example).Our proposed decomposition of identiﬁcation gains is inspired by the implications of the theoreticalresults in Section 3. For any given x ∈ Ω X , the decomposition consists of four components, denoted by C ( x ) to C ( x ) respectively. Each component corresponds to the identiﬁcation gains made by the SVbounds over the benchmark Manski bounds.(i) C ( x ): Contribution of IV Validity . The ﬁrst component of the identiﬁcation gains is thereduction in the SV bound width relative to the benchmark Manski bound width, due to theidentiﬁcation of the ATE( x ) sign. This contribution is accredited to IV validity, since by (2) we canidentify the sign of the ATE( x ) if the IVs are independent of the error term ( ε , ε ) and ν ( X, Z ) | X is nondegenerate (or equivalently, if the IVs are valid) regardless of the IV strength. For ∀ x ∈ Ω X , C ( x ) = 1[ATE( x ) ≤ U M ( x ) − x ) ≥ L M ( x ) , which is equivalent to the width of the negative (positive) part of Manski bounds if ATE( x ) isidentiﬁed to be positive (negative).(ii) C ( x ): Contribution of IV Strength . Conditional on the ﬁrst component, IV validity, the secondcomponent captures to the further reduction achieved by the SV bound width via intersecting overall possible values of Z . This is reﬂected in the dependence of the SV bounds in (3) and (4) on thetwo extreme values of the CPS, and the closer the extreme values to [0 ,

1] are, the greater is C ( x ). If ATE( x )=0 is identiﬁed by (2), i.e. Pr[ Y = 1 | x, p ] = Pr[ Y = 1 | x, p (cid:48) ] for any p > p (cid:48) , then it is obvious that the ﬁrstcontribution of SV bounds already leads to the point identiﬁcation of the ATE( x ), and the IV identiﬁcation power IIP ( x ),which will be introduced in Section 5, achieves its maximum value one. C ( x ) = ω M ( x ) − ω ( x ) − C ( x ) . (iii) C ( x ): Contribution of Covariates . The third component is the incremental reduction in theSV bound width brought about by intersecting over all possible values of the exogenous covariates X that fall into the areas described by the sets X ( x ) , X − ( x ) , X ( x ) and X − ( x ) via matchingfor the same propensity score values. As implied by Proposition 3.4, this component is attributedto the variation of exogenous covariates: C ( x ) = ω ( x ) − ω SV ( x ) . (iv) C ( x ): Remaining SV Bound Width . The last component is due to the unobservable error terms,and relates to the remaining SV bound width that cannot be further reduced by the observable dataunder the SV modeling assumptions. This component can be thought of as the signal-to-noise ratioof the error terms. By construction, we have C ( x ) = ω SV ( x ).It is easy to see that C ( x ) + C ( x ) + C ( x ) + C ( x ) = ω M ( x ) = 1. If ν ( X, Z ) | X is degenerate and theIVs have no explanatory power for the treatment, then C ( x ) = C ( x ) = C ( x ) = 0 and the SV boundsreduce to Manski bounds. It is worth to note that although we do not decompose the identiﬁcation gainsbased on the sign and the degree of endogeneity, the magnitude of all the four components varies withthem. According to Proposition 3.3, the sign and the endogeneity degree aﬀects ω ( x ), which enters allfour components either directly or indirectly due that the summation of the four components is a ﬁxedvalue one. In addition, C ( x ) to C ( x ) can always be identiﬁed and estimated from the data. In practice,once the model has been estimated (parametrically or non-parametrically), the estimates can be used toconstruct the decomposition. Detailed numerical illustrations and simulations of the decomposition arepresented in Sections 6 and 7. By construction, the identiﬁcation gains decomposition satisﬁes C ( x )+ C ( x )+ C ( x )+ C ( x ) = ω M ( x ) =1, ∀ x ∈ Ω X , with each C j ( x ) representing the proportion of total identiﬁcation gains that can be attributedto the corresponding component. Based on the decomposition, we can then construct a quantitativemeasurement of IV identiﬁcation power in the partial identiﬁcation setting. Suppose Assumption 2.114olds, bar condition (c). For ∀ x ∈ Ω X , deﬁne the IV identiﬁcation power IIP ( x ) as IIP ( x ) :=  ω M ( x ) − ω ( x ) , if ν ( X, Z ) | X = x is nondegenerate0 , if ν ( X, Z ) | X = x is degenerate (7)where ω ( x ) is the widest width of the SV bounds deﬁned in Proposition 3.3. Setting IIP ( x ) = 0 when ν ( X, Z ) | X = x is degenerate is equivalent to setting ω ( x ) = ω M ( x ) = 1, meaning that the widest widthof the SV bounds equates to the width of the benchmark Manski bounds because the IVs are irrelevant. From the decomposition, we have

IIP ( x ) = C ( x ) + C ( x ) when the IVs are valid and relevant. Thus IIP ( x ) represents the proportion of the identiﬁcation gains that is due to the IVs alone and it can beviewed as an index of the IV identiﬁcation power. The overall IV identiﬁcation power can be obtained bytaking the expectation of IIP ( x ) over Ω X , i.e. E X [ IIP ( X )].The following proposition formalizes some important properties of IIP ( x ) as an indicator of the IVidentiﬁcation power. Proposition 5.1

The index

IIP ( x ) lies in the unit interval [0 , , and under Assumption 2.1 IIP ( x ) has the following properties:(i) IIP ( x ) always lies in [0 , and can identify whether at least one of the IVs used to achieve the SVbounds is relevant;(ii) IIP ( x ) = 0 if none of the IVs in Z are relevant, then the SV bounds reduce to the benchmarkManski bounds;(iii) IIP ( x ) = 1 if the IVs in Z have perfect predictive power for the treatment D (identiﬁcation atinﬁnity holds), in the sense that there exists a p ∗ and p ∗∗ in Ω P | x such that Pr [ D = 0 | x, p ∗ ] = 1 andPr [ D = 1 | x, p ∗∗ ] = 1 . Moreover, the ATE ( x ) is point identiﬁed when IIP ( x ) = 1 . Proposition 5.1 indicates that

IIP ( x ) is a meaningful measure of IV usefulness for improving the ATEpartial identiﬁcation. Therefore, values of IIP ( x ) can be compared, across diﬀerent sets of IVs, or acrossdiﬀerent values of x given the same set of IVs, since they are standardized relative to the same baselinebenchmark. For example,

IIP ( x ) = 0 . IIP ( x ) is a meaningful measureindependent of the speciﬁc SV bounds. In addition, the values of

IIP ( x ) at its end points are intuitively The deﬁnition allows

IIP ( x ) to be discontinuous at Ω P | x = p x for some constant p x ∈ [0 , P | x is a singleton. IIP ( x ) or E X [ IIP ( X )] can also be compared across various studies if necessary. Theoretically, the value of

IIP ( x ) should lie in [0 ,

1] and the width of Manski bounds is always one. Then

IIP ( x ) can IIP ( x ) = 0 identiﬁes situations where the IVs are completely irrelevant, and, when the IVsare able to perfectly predict the treatment status (when identiﬁcation at inﬁnity holds,) IIP ( x ) = 1 andpoint identiﬁcation of the ATE( x ) is achieved.Numerical analysis is used in Section 6 to illustrate the behaviour of IIP ( x ) in a class of representativemodels. At this point we note that IIP ( x ) ignores the component of identiﬁcation gains attributable to theexogenous covariates, namely C ( x ). In view of the additivity of the identiﬁcation gains decomposition,this neglect seems entirely reasonable since we know, from Section 3, that for a given degree of endogeneityand extremes of the CPS, the value of ω ( x ) does not vary with the identiﬁcation information contained bythe covariates. This indicates that IIP ( x ) is a measure of identiﬁcation gains due to IVs alone, withoutthe contribution of the additional identiﬁcation power provided by the exogenous covariates. It measuresthe smallest identiﬁcation gains relative to the benchmark Manski bound that can be achieved by a givenset of IVs. More importantly, focusing on IIP ( x ) introduces considerable computational simpliﬁcationwhen comparing sets of IVs, as it avoids the second layer of the intersection bounds required to computethe SV bounds. In this section we illustrate numerically the theoretical results on the decomposition of SV bounds studiedin Section 2, and how each component aﬀects the SV bounds. We consider as our data generating process(DGP) a version of the model in (1) with a linear additive latent structure, which is similar to that studiedin Li et al. (2019): Y = 1[ αD + βX + ε > ,D = 1[ γZ + πX + ε > , (8)where the exogenous regressor X and the IV Z are assumed mutually independent, without loss ofgenerality, X ∼ N (0 ,

1) and Z ∈ {− , } with Pr( Z = 1) = 1 /

2. In addition, (

X, Z ) (cid:48) ⊥ ( ε , ε ) wherethe error term ( ε , ε ) is zero mean bivariate normal with unit variances and correlation ρ . For thisspeciﬁcation, given the distribution of Z , there is a monotonic one-to-one mapping from the coeﬃcientof the IV, γ , to the range of the conditional propensity score. We capture changes in the extreme valuesof the CPS using the parameter grid γ = − . ρ = − .

99 : 0 .

05 : 0 .

99. We set α = 1 and π = 0 across all parameter settings. Under this DGP,the SV bound width is aﬀected by α , β and the variation of the exogenous covariates. Since α and the be interpreted as the percentage points of the identiﬁcation gains brought by the IVs. In ﬁnite sample settings where theestimated Manski bound width may on longer be exact one, the sample explanation can be obtained by computing the ratio IIP ( x ) /ω M ( x ) using their associated estimates. X are held ﬁxed, we select β from the set { . , . , . } , so that changes in β capture thevariation of the exogenous covariates given the distribution of X . Using the DGP as characterized by (8)we compute the SV bounds [ L SV ( x ) , U SV ( x )] and the Manski bounds [ L M ( x ) , U M ( x )] and implement theidentiﬁcation gains decomposition according to the true DGP. In what follows we present the outcomesobtained when x = E [ X ]. In Figure 6.1, the subplots in the ﬁrst row display the upper and lower bounds of the ATE( x ), and thesubplots in the second row present the corresponding bound width. For the Manski bounds we can seethat the width is always one, and the upper and lower bounds stand on either side of zero, as previouslynoted. The SV bounds reduce to the Manski bounds when the IVs are irrelevant with γ = 0 (the separatelines in the graphs at γ = 0). When γ moves away from γ = 0, the SV bound width has a signiﬁcantdrop. Then, as the magnitude of γ increases, i.e. as the ending points of the CPS expand, the SV boundwidth decreases. In addition, since α > x ) is positive, the SV bound width increases as ρ increases. Moreover, comparison of the plots for diﬀerent values of β reveals that β plays a critical rolein determining the SV bounds in the sense that larger β produces signiﬁcantly narrower bound width.When β = 0 .

05 the SV bound width is non-negligible when the absolute value of γ is small, while when β = 0 .

45, point identiﬁcation of the ATE( x ) is achieved for most of the ( γ, ρ ) pairs. These indicate thatfor a given IV strength, as measured by γ or the associated range of CPS, the lower the value of ρ in the( − , +1) range or the bigger the impact of x , the narrower the SV bounds that can be achieved. In otherwords, for given IV strength, a larger identiﬁcation gain can be achieved if the error correlation ρ is largein magnitude and also has an opposite sign from the sign of the ATE( x ). The decomposition of identiﬁcation gains obtained when γ ∈ { , } , ρ ∈ {− . , − . , . , . } and β ∈{ . , . , . } is displayed for x = E [ X ] in Figure 6.2. We can see that when the ATE( x ) is positive,the contribution of IV validity, as measured by C ( x ), is determined by the Manski lower bound, anddecreases as ρ increases (conversely the numerical results not reported here show that when the ATE( x ) isnegative C ( x ) increases as ρ increases), while C ( x ) is invariant to β . By way of contrast, the contributionof the component C ( x ) also does not change by β , but it increases signiﬁcantly as the magnitude of γ increases due to the impact of the IVs on the range of the CPS. The component of identiﬁcation gains In our experiments we have calculated the ATE( x ) and its bounds at various quantile points of X , but space considerationsprevent us from listing all our results here. We present the outcomes obtained when X equals its modal/mean value as theseare representative. i g u r e . : M a n s k i a ndS V B o und s f o r A T E ( x = E [ X ] ) ( β = . )( β = . )( β = . ) M a n s k i B o und W i d t h S V B o und W i d t h ( β = . ) S V B o und W i d t h ( β = . ) S V B o und W i d t h ( β = . ) N o t e : T h r ee d i m e n s i o n a l p l o t s o f t h e A T E b o und s a s f un c t i o n o f ( γ , ρ ) . W h e n γ = , S V b o und s r e du c e t o M a n s k i b o und s w i t hb o und w i d t h o n e . C ( x ), also contributes signiﬁcantly to the identiﬁcation gains. When β is relatively large (e.g. β = 0 . Figure 6.3 depicts the index

IIP ( x ) as a function of ( γ, ρ ) on the lattice {− . } × {− .

99 : 0 .

05 :0 . } . The plot conﬁrms that, when the ATE( x ) is positive, the IV identiﬁcation power IIP ( x ) increasesas the IV strength ( | γ | ) increases, but for the same IV strength, the IIP ( x ) is higher the lower the valueof ρ . We also found, based on the results not reported here, that, when the ATE( x ) is negative, a risinglevel of positive endogeneity drives up IIP ( x ) and reduces the width of SV bounds.By way of summary, the theoretical results presented in Sections 3, 4 and 5 are clearly reﬂected in thefeatures observed in the numerical outcomes reported here. Firstly, IIP ( x ) is bigger when IVs are stronger( | γ | higher). In addition, for a given IV strength in the ﬁrst-stage treatment equation, higher IIP ( x ) canbe achieved if the endogeneity ρ has an opposite sign from the ATE( x ) and is of high magnitude ( | ρ | ).And if the endogeneity is of the same sign as the ATE( x ), then the lower the degree of endogeneity thebetter the identiﬁcation power. Of course adding the additional identiﬁcation gain C ( x ) to IIP ( x ) leadsto the SV bound width ω SV ( x ), and the C ( x ) depends on the properties of the covariates. Next, we study the empirical performance of our decomposition analysis for alternative sets of IVs. Wepresent ﬁnite sample results to show how

IIP ( x ) can be used to rank the identiﬁcation power of diﬀerentsets of IVs and to potentially detect irrelevant IVs, when determining which set of IVs should be usedto construct the ATE bounds. The advantage of this strategy over conventional IV strength evaluations(such as those akin to the ﬁrst-stage IV F -stat or the CPS) is that IIP ( x ) captures the IV identiﬁcationpower in terms of their ability to shrink the width of the ATE bounds, incorporating the IV strength andtheir interaction with the direction and magnitude of endogeneity in the nonlinear model. Consideri.i.d. samples generated from a similar DGP to (8) with two IVs: Y = 1[ αD + βX + ε > ,D = 1[ πX + γ Z + γ Z + ε >

0] (9) The identiﬁcation power

IIP ( x ) can provide testable implication of IV relevance, but a formal test is out of the scopeof this paper. x = E [ X ]) (a) β = 0 . β = 0 . β = 0 . Note: The green line depicts the amount of IV validity contribution C ( x ). To aid legibility C ( x ) , . . . , C ( x ) have beenrendered as C , . . . , C in each of the subplots in this ﬁgure. x-axis displays the values of γ . For space limitation, we onlyrepresent the ﬁgure for nonnegative values of γ . x = E [ X ]) II P ( x ) -4 Note: Three dimensional plot of

IIP ( x ) as function of ( γ, ρ ). The value of β does not aﬀect the IIP ( x ) in this case because π = 0 and no matches of Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ] exist for x = E [ X ] and z, z (cid:48) ∈ {− , } . When γ = 0, the IIP ( x ) = 0because IV is irrelevant. where two IVs in Z = ( Z , Z ) (cid:48) are Z ∼ Bernoulli (1 /

2) and Z ∈ {− , − , − , , , , } with probabil-ities (0 . , . , . , . , . , . , . α = 1, β = 1, π = −

1, ( γ , γ ) = (0 . , . ε , ε ) is jointly normal with mean zero, variance one and correlation ρ ∈ { . , . } . In addition, Z , Z and X are mutually independent, and also independent to ( ε , ε ). Consider two cases of covariatevariability: case

1, continuous X ∼ N (0 , case

2, binary X ∼ Bernoulli (1 / x = 0. The value of the ATE( x ) = E [ Y − Y | X = 0] under the DGP (9) is 0.341.In order to evaluate the ﬁnite sample performance of IIP ( x ) as an index for measuring IV identiﬁcationpower, we consider ﬁve alternative sets of IV options. In addition to the two valid IVs of Z and Z in theDGP, we introduce two ”pseudo” IVs: (cid:101) Z = 1[ Z > Z , and an irrelevant IV Z ∈ { , } such that Pr[ Z = 1] = 2 /

3, and Z ⊥ ( ε , ε , Z , Z , X ). Toillustrate the behaviour of the IIP ( x ) estimation, we use sample data for ( Y, D, X ) generated from theDGP in (9) to estimate models with ﬁve alternative IV sets: (1) only one valid IV Z (omitting Z ); (2)only one valid IV Z (omitting Z ); (3) one valid Z and one misspeciﬁed (cid:101) Z ; (4) two valid IVs Z and Z ; and (5) two valid Z and Z plus one irrelevant Z .Table 7.1: Population CPS Range and IIP ( x ) ( x = 0, cases 1 and 2)Sets IVs CPS deﬁnition CPS Range IIP ( x ) ( ρ = 0 . IIP ( x ) ( ρ = 0 . Z Pr[ D = 1 | x, Z ] [0 . , . Z Pr[ D = 1 | x, Z ] [0 . , . Z , (cid:101) Z Pr[ D = 1 | x, Z , (cid:101) Z ] [0 . , . Z , Z Pr[ D = 1 | x, Z , Z ] [0 . , . Z , Z , Z Pr[ D = 1 | x, Z , Z , Z ] [0 . , . Note: The population CPS and

IIP ( x ) are the same for case 1 and case 2. IIP ( x ) for the cases 1 and 2, at x = 0. Note thatthe covariate variability does not impact the population CPS nor IIP ( x ), so that the values of CPS rangeand IIP ( x ) for case 1 are the same to those for case 2. Looking at the CPS range as a measure of IVstrength, we can see that the CPS range is the widest when both valid and relevant IVs Z and Z areused as in (4). Adding an irrelevant IV Z does not change the theoretical CPS range, so theoretically(5) has the same IV strength as (4). The CPS range decreases when only one of the two valid IVs areused as in (1) and (2), with Z being stronger with wider CPS range than Z . As expected, when a validIV is incorrectly speciﬁed as a proxy dummy (cid:101) Z in (3), the CPS range is narrower than that of the bestset in (4), but wider than that in (1) with Z alone. Interestingly, comparing IV set (3) with (2), set (2)with only one valid IV actually results in wider CPS range than that for the two IVs in set (3) with Z misspeciﬁed, though the CPS interval for (3) is not completely nested within the interval for (2).Whilst the CPS range indicates the IV strength, it is the IIP ( x ) that captures the identiﬁcationpower of each IV set, measuring the reduction of SV bound width relative to the benchmark Manskibound width due to the contribution of IVs. As seen from the two IIP ( x ) columns in Table 7.1, the sameIV strength can achieve bigger identiﬁcation gains for ρ = 0 . ρ = 0 .

8. This is consistentwith the results in Section 6: as ρ and ATE( x ) are both positive in this case, the lower absolute value of ρ , the higher the IIP ( x ) is. For example for IV set (4), the Manski bound width can be reduced by 0 . . ρ = 0 .

8, and it increases to 0.625 (or 62 . ρ = 0 .

5. The equally mostpowerful IV sets are (4) and (5), and the least powerful set is (1).We next present the ﬁnite sample estimation of the Manski and SV bounds, and conduct the decom-position analysis based on the estimates of the bounds. Sample size is set to be n = 500 , , M = 1000 times. Tables 7.2 to 7.5 present the sample average (over M replications) ofthe estimated bounds, estimated C ( x ) to C ( x ) and IIP ( x ) of the ﬁve IV sets at x = 0. We use the“half-median-unbiased estimator” (HMUE) of the intersecting bounds proposed by Chernozhukov, Lee,and Rosen (2013) (hereafter CLR) to estimate the benchmark Manski bounds and the SV bounds. Inparticular, we employ maximum likelihood estimation (MLE) to estimate the bounding functions and toselect the critical values for bias correction according to the simulation-based methodology of CLR. The results of Tables 7.2 to 7.5 relate to the two diﬀerent covariate distributions ( case X ∼ N (0 , The CLR half-median-unbiased estimator produces a upper bound estimator that exceeds its true value and a lowerbound estimator that falls below its true value, each with probability at least a half asymptotically. We report the HMUEof the Manski bounds, for comparison purpose. Other estimation methods for Manski bounds are also available, see e.g.Imbens and Manski (2004). Theoretically, the construction of the SV bounds requires the matching of pairs ( x, z ) and ( x (cid:48) , z (cid:48) )such that Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. In practice, it is hard to ﬁnd such pairs with equal CPS especially when thevariation of covariates is limited. In the simulations, the SV bounds are computed by matching ( x, z ) and ( x (cid:48) , z (cid:48) ) such that | Pr[ D = 1 | x, z ] − Pr[ D = 1 | x (cid:48) , z (cid:48) ] | < c and c = 1%. Although the estimated SV bounds depend on c , the estimated IIP ( x )does not. Therefore the choice of c has no impacts on the performance of the IIP ( x ). ase X ∼ Bernoulli (1 / ρ values ( ρ = 0 . , . case 1 (Tables 7.2 and 7.3), where the covariate possesses suﬃcient variation, the true SV boundspoint identify the ATE( x ) for both ρ = 0 . ρ = 0 .

8. In case 2 (Tables 7.4 and 7.5), the true SVbounds fail to point identify the ATE( x ) due to the limited variation in X .Next, we focus on the left part of each table, which displays the HMUEs of the ATE bounds, and theHausdorﬀ distance between the true bounds and the estimated bounds, evaluated at x = 0. For all fourtables, we can see that the estimated Manski bounds are the same across all ﬁve IV sets, always includezero, and have a width a little over one. The estimated SV bounds identify the sign of ATE( x ) for all ﬁveIV sets. Moreover, the IV sets with greater identiﬁcation power lead to narrower estimated SV boundsand also improve the estimation accuracy in most of the scenarios. More precisely, the Hausdorﬀ distanceof the estimated SV bounds to the true bounds decreases as the IV identiﬁcation power increases. Movingto the right part of each of table, ﬁrst, we note that for each given IV set, all the estimated C ( x ) to C ( x ) and IIP ( x ) converges to their true values as sample size n increases, indicating that the estimatedidentiﬁcation gain is more accurate for larger sample size. We also note that the estimated C ( x ) whichis determined by the Manski bounds, is the same for diﬀerent IV sets. This result is quite intuitive becausethe identiﬁcation gains brought by the IV validity should not vary with the IV strength. Comparisonof Tables 7.2 and 7.3 or Tables 7.4 and 7.5 also reveals that the impacts of endogeneity degree on IVidentiﬁcation power can be captured by the estimated IIP ( x ). Importantly, the true ranking of IIP ( x )as in Table 7.1 can be correctly revealed by ﬁnite sample estimates of IIP ( x ).It is interesting to analyze the eﬀect of adding an additional but completely irrelevant IV on the ﬁnitesample performance of ATE partial identiﬁcation, by comparing the results obtained using IV sets (4)and (5). Adding Z to ( Z , Z ) actually produces a small decrease of the estimated IIP ( x ), on average,for almost all diﬀerent DGP designs considered in this section. The Cramer-Von Mises test and theKolmogorovSmirnov test conﬁrm that the average values of the estimates of IIP ( x ) under scenario (4) aresigniﬁcantly diﬀerent from those obtained under scenario (5), when sample size is N = 500 and N = 5000for both endogeneity degrees and for both case 1 and case 2. While when sample size is suﬃciently large N = 10000, the estimates of IIP ( x ) under scenario (4) and (5) are no longer signiﬁcantly diﬀerent, except Simulation results of bounds at diﬀerent values of x display similar patterns to those at x = 0, therefore are not reporteddue to the space limitation. The Hausdorﬀ distance between sets A and B is deﬁned as max (cid:8) sup a ∈ A d ( a, B ) , sup b ∈ B d ( b, A ) (cid:9) where d ( b, A ) := inf a ∈ A (cid:107) b − a (cid:107) and ∞ if either A or B is empty. Hausdorﬀ distance is a natural generalization of Euclideandistance and has been employed to study convergence properties when a set rather than a point is the parameter of interest,see e.g. Hansen et al. (1995), Manski and Tamer (2002) and Chernozhukov et al. (2007). Because C ( x ) to C ( x ) are functions of L M ( x ), U M ( x ), ω ( x ) and ω SV ( x ), the estimates of C ( x ) to C ( x ) are computedusing the HMUE of the bounds or their widths. We compute ω ( x ) as the width of the estimated bounds (by HMUE of CLR)[ L SV ( x ) , U SV ( x )] in (14) if ATE( x ) > ρ = 0 .

8. This suggests that in practice, the loss of information (eﬃciency) that arisesfrom using irrelevant IV can have a statistically signiﬁcant practical eﬀect on the IV identiﬁcation power,which can be captured by our proposed index

IIP ( x ). Such an information loss could lead to widerATE bounds, especially when the covariate possesses limited variation. Particularly, from Table 7.4 andTable 7.5 we can see that when the covariate X is a binary variable (case 2), on average, the estimatedSV bounds using ( Z , Z ) are signiﬁcantly narrower than those estimated by the IV set including theirrelevant IV ( Z , Z , Z ), especially for small sample size. Analyzing the results across the replications,we ﬁnd that about 78% (for both endogeneity degrees) of the replications give narrower estimated SVbounds with IV set ( Z , Z ) than those with ( Z , Z , Z ), for sample size N = 500; and this rate becomesto 53% ( ρ = 0 .

5) and 64% ( ρ = 0 .

8) for suﬃciently large sample size N = 10000.On the other hand, the IV irrelevancy cannot always be detected by simply comparing the estimatedSV bound width under diﬀerent IV sets. That is, adding an irrelevant IV in (5) could further shrink theSV bound width when the covariate X is continuous, although the improvement happens at the thirddecimal and the degree of the improvement decreases as sample size increases. These outcomes reinforce a-fortiori the warning that simply adding extra IVs without assessing their identiﬁcation power is unlikelyto be a good practical modelling strategy, but the ﬁnite sample estimates of our proposed

IIP ( x ) is morereliable in detecting the loss of eﬃciency of IV irrelevancy. The shrinkage of the estimated SV bounds using the irrelevant Z is due to the ﬁnite sample estimation error. Inparticular, because the estimates of the coeﬃcient of the irrelevant Z will be nonzero with probability one, it results in morematched pairs of ( x, z ) and ( x (cid:48) , z (cid:48) ) such that | Pr[ D = 1 | x, z ] − Pr[ D = 1 | x (cid:48) , z (cid:48) ] | < c (see footnote 12) especially when covariateis continuous. For case 1 in Table 7.2 and Table 7.3, we ﬁnd that when sample size is N = 500, (i) there are 22% ( ρ = 0 . ρ = 0 .

8) of the 1000 replications where at least one (either lower or upper) estimated SV bound using ( Z , Z ) iscloser to its true value, compared to that obtained by using the irrelevant IV; and (ii) 12% of the replications yield widerestimated SV bounds when using the irrelevant IV, for both endogeneity degrees. Case 1 . True and Estimated Bounds, and Decomposition of Identiﬁcation Gains ( ρ = 0 . , X ∼ N (0 , , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.246,0.899] 0.092 [0.117,0.775] 0.434 0.246 0.186 0.056 0.658 0.432(2) only Z [0.246,0.562] 0.227 0.342 0.241 0.316 0.587(3) Z , (cid:101) Z [0.193,0.759] 0.418 0.218 0.116 0.565 0.464(4) Z , Z [0.290,0.455] 0.121 0.436 0.298 0.165 0.682(5) Z , Z , Z [0.300,0.451] 0.116 0.424 0.324 0.151 0.670 n = 5000 (1) only Z [-0.202,0.846] 0.030 [0.121,0.768] 0.427 0.202 0.145 0.053 0.648 0.347(2) only Z [0.266,0.372] 0.078 0.334 0.406 0.106 0.536(3) Z , (cid:101) Z [0.221,0.757] 0.416 0.194 0.116 0.536 0.395(4) Z , Z [0.312,0.377] 0.043 0.446 0.335 0.066 0.648(5) Z , Z , Z [0.316,0.373] 0.038 0.442 0.347 0.057 0.644 n = 10000 (1) only Z [-0.198,0.838] 0.022 [0.123,0.768] 0.427 0.198 0.139 0.054 0.645 0.337(2) only Z [0.263,0.363] 0.080 0.331 0.407 0.101 0.528(3) Z , (cid:101) Z [0.225,0.756] 0.414 0.189 0.118 0.531 0.387(4) Z , Z [0.317,0.365] 0.031 0.444 0.346 0.048 0.642(5) Z , Z , Z [0.320,0.362] 0.027 0.443 0.353 0.042 0.641 Note: The estimated bounds, the Hausdorﬀ distance d H ( x ) and the decompositions are the averages over 1000 replications. Table 7.3:

Case 1 . True and Estimated Bounds, and Decomposition of Identiﬁcation Gains ( ρ = 0 . , X ∼ N (0 , , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.157,0.996] 0.098 [0.124,0.873] 0.532 0.157 0.205 0.041 0.750 0.362(2) only Z [0.233,0.559] 0.229 0.382 0.288 0.326 0.539(3) Z , (cid:101) Z [0.191,0.848] 0.507 0.246 0.093 0.657 0.403(4) Z , Z [0.291,0.437] 0.107 0.495 0.355 0.146 0.652(5) Z , Z , Z [0.298,0.431] 0.100 0.482 0.382 0.133 0.639 n = 5000 (1) only Z [-0.121,0.924] 0.028 [0.128,0.860] 0.519 0.121 0.149 0.042 0.732 0.271(2) only Z [0.254,0.357] 0.088 0.346 0.475 0.103 0.467(3) Z , (cid:101) Z [0.208,0.853] 0.512 0.210 0.068 0.645 0.332(4) Z , Z [0.312,0.378] 0.043 0.489 0.369 0.066 0.610(5) Z , Z , Z [0.315,0.373] 0.038 0.486 0.380 0.058 0.607 n = 10000 (1) only Z [-0.117,0.918] 0.022 [0.129,0.860] 0.519 0.117 0.146 0.042 0.731 0.263(2) only Z [0.258,0.357] 0.083 0.346 0.473 0.099 0.463(3) Z , (cid:101) Z [0.212,0.851] 0.510 0.209 0.071 0.639 0.326(4) Z , Z [0.316,0.369] 0.034 0.491 0.374 0.053 0.607(5) Z , Z , Z [0.319,0.365] 0.030 0.491 0.381 0.046 0.607 Note: The estimated bounds, the Hausdorﬀ distance d H ( x ) and the decompositions are the averages over 1000 replications. Case 2 . True and Estimated Bounds, and Decomposition of Identiﬁcation Gains ( ρ = 0 . , X ∼ Bernoulli (1 / , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.263,0.904] 0.102 [0.060,0.776] 0.237 0.263 0.185 0.002 0.716 0.448(2) only Z [0.110,0.669] 0.179 0.359 -0.014 0.559 0.621(3) Z , (cid:101) Z [0.098,0.769] 0.224 0.237 -0.004 0.671 0.499(4) Z , Z [0.166,0.647] 0.131 0.439 -0.017 0.481 0.701(5) Z , Z , Z [0.160,0.656] 0.140 0.433 -0.025 0.496 0.695 n = 5000 (1) only Z [-0.206,0.849] 0.034 [0.068,0.769] 0.223 0.206 0.148 0.000 0.701 0.354(2) only Z [0.135,0.640] 0.148 0.337 0.007 0.506 0.543(3) Z , (cid:101) Z [0.115,0.754] 0.207 0.211 0.000 0.639 0.417(4) Z , Z [0.210,0.619] 0.079 0.446 -0.006 0.409 0.653(5) Z , Z , Z [0.208,0.620] 0.081 0.444 -0.007 0.412 0.650 n = 10000 (1) only Z [-0.198,0.841] 0.024 [0.069,0.768] 0.221 0.198 0.141 0.001 0.699 0.339(2) only Z [0.138,0.640] 0.145 0.333 0.005 0.502 0.531(3) Z , (cid:101) Z [0.118,0.751] 0.204 0.207 0.000 0.633 0.406(4) Z , Z [0.216,0.612] 0.070 0.447 -0.006 0.396 0.645(5) Z , Z , Z [0.217,0.613] 0.071 0.447 -0.003 0.396 0.645 Note: The estimated bounds, the Hausdorﬀ distance d H ( x ) and the decompositions are the averages over 1000 replications. Table 7.5:

Case 2 . True and Estimated Bounds, and Decomposition of Identiﬁcation Gains ( ρ = 0 . , X ∼ Bernoulli (1 / , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.165,0.972] 0.084 [0.077,0.868] 0.276 0.165 0.183 -0.001 0.790 0.348(2) only Z [0.114,0.751] 0.212 0.330 0.006 0.637 0.495(3) Z , (cid:101) Z [0.133,0.863] 0.270 0.243 -0.001 0.730 0.408(4) Z , Z [0.209,0.732] 0.154 0.458 -0.008 0.523 0.623(5) Z , Z , Z [0.200,0.738] 0.164 0.441 -0.007 0.538 0.606 n = 5000 (1) only Z [-0.117,0.925] 0.026 [0.086,0.861] 0.268 0.117 0.149 0.001 0.776 0.266(2) only Z [0.144,0.720] 0.175 0.340 0.010 0.576 0.457(3) Z , (cid:101) Z [0.154,0.848] 0.256 0.232 -0.001 0.694 0.349(4) Z , Z [0.255,0.694] 0.102 0.486 0.001 0.439 0.603(5) Z , Z , Z [0.255,0.696] 0.105 0.483 0.001 0.440 0.600 n = 10000 (1) only Z [-0.111,0.919] 0.019 [0.087,0.860] 0.267 0.111 0.146 0.000 0.773 0.257(2) only Z [0.148,0.713] 0.171 0.338 0.015 0.565 0.450(3) Z , (cid:101) Z [0.158,0.846] 0.253 0.230 0.000 0.688 0.342(4) Z , Z [0.263,0.693] 0.100 0.491 -0.002 0.430 0.603(5) Z , Z , Z [0.263,0.692] 0.100 0.489 0.001 0.429 0.601 Note: The estimated bounds, the Hausdorﬀ distance d H ( x ) and the decompositions are the averages over 1000 replications. Empirical Application: Women LFP and Childbearing

In this section, we apply our novel decomposition and IV evaluation method to study the eﬀects ofchildbearing on women’s labor supply. The dataset analyzed here is from the 1980 Census Public UseMicro Samples (PUMS), available at Angrist and Evans (2009). We follow the data construction inAngrist and Evans (1998), where the sample consists of married women aged 21-35 with two or morechildren. The dateset contains 254,652 observations; see Table 2 in Angrist and Evans (1998) for moredetails and descriptive statistics. The binary outcome Y indicates if a individual was paid for work inthe year prior to the census ( Y = 1), or otherwise ( Y = 0). The treatment eﬀect of interest is theimpact of having more than two children on the labor force participation Y . Thus, the binary treatmentis D ∈ { , } , with D = 1 denoting having more than two children.Following Angrist and Evans (1998, Table 11) we use as continuous regressors woman’s age, woman’sage at ﬁrst birth, and ages of the ﬁrst two children (quarters), and binary regressors for ﬁrst child beinga boy, second child being a boy, black, hispanic, and other race, as well as the intersections of theabove mentioned continuous and indicator variables. For computational simplicity, we reduce dimensionof covariates by utilizing the conditional propensity score X P := (cid:99) P r [ D = 1 | X ] as a covariate, where (cid:99) P r [ D = 1 | X ] is estimated via a probit model and X includes all of the regressors mentioned above. Threesets of IVs are considered in this section: (1) the binary indicator that the ﬁrst two children are thesame sex (“ Samesex ”), (2) the binary indicator that the second birth was a twin (“

Twins ”), and (3) bothindicators (“

Both = { Samesex , Twins } ”). To provide a basis for comparison of SV bounds with other ATEbounding analyses, we also compute the ATE bounds in Heckman and Vytlacil (2001) (hereafter HVbounds) and Chesher (2010) (hereafter Chesher bounds). To be consistent with our previous numericalanalyses in Section 7, we use the method of CLR to compute all the four bounds of interest, via MLE forestimating bounding functions and the simulation-based method for correcting the bias of the intersectingbounds.Table 8.1 reports the weighted average of the HMUE and of the CLR two-sided conﬁdence intervals (at90%, 95% and 99% signiﬁcant level) of the four bounds of ATE( X P ), with weights given by the estimatedkernel density of X P . Panels (a), (b) and (c) display the results using IV Samesex , Twins and

Both ,respectively. The estimated average of the Manski bounds in all three panels are essentially identical,since the Manski bounds do not depend on IVs. In all panels, the HV bounds make an improvementover the benchmark Manski bounds, with the HV bound width using

Twins being narrower than thatusing

Samesex , and the HV bound width using

Both being the narrowest. The Chesher bounds using

Samesex fail to identify the sign of the ATE( X P ), as it is a union of both negative and positive intervals.27hen the IV Twins or Both is used instead, the weighted average of 95% conﬁdence interval of theChesher bounds is [ − . , − . Twins ) or [ − . , − . Both ), revealing negativeeﬀects of having a third child on women’s labor force participation. For the SV bounds, the results usingthe IV

Twins or Both dramatically outperform those using

Samesex . The 95% conﬁdence interval using

Samesex , Twins and

Both are [ − . , − . − . , − . − . , − . To summarize the results above, we can see that for ATE bounds in which the IV plays a key role inextracting identifying information, i.e. HV, Chesher and SV bounds, the IV

Both gives us the narrowestbounds (on average).The ranking of the IV identiﬁcation power of the three available IVs revealed by the discussion aboveis conﬁrmed and explained by the identiﬁcation gains decomposition and the

IIP reported in Table 8.2.The results based on the 95% conﬁdence interval show that given the same contribution of IV validity forthe three IVs, which is 44.6% on average, the identiﬁcation power of

Twins (68.2%) is signiﬁcantly largerthan that of

Samesex (47.1%). Closer inspection of the data reveals that the contribution of

Twins tothe identiﬁcation gains exceeds that of

Samesex , because whenever

Twins = 1 the treatment D = 1, i.e. Twins is a perfect predictor of being treated, whereas this is not the case for

Samesex . It is this feature, ofcourse, that explains the superior performance when the HV, Chesher and SV bounds are evaluated using

Twins rather than

Samesex . Moreover, when both IVs

Samesex and

Twins are used, the identiﬁcationpower of

Both (70.3%) also exceeds that of either one of the single IV

Samesex or Twins . It indicatesthat although the identiﬁcation power of

Samesex is dominated by

Twins , Samesex can still make extracontributions when identifying the ATE. It is intuitive because the mechanisms of the two IVs drivingthe probability of having a third child are diﬀerent. One remark on the above analysis is that, for otherATE bounds that exploits the identiﬁcation information of IVs, for example the HV and Chesher bounds,IVs with higher

IIP clearly leads to narrower bounds for the ATE. It indicates that although the

IIP isconstructed to measure the IV’s contribution to the SV bounds, it is also a meaningful measure for theIV identiﬁcation power and can be utilized to indicate the IV relevance in other ATE bounds. The two-stage least square (2SLS) estimates of Angrist and Evans (1998, Table 11) give an ATE estimate of -0.123 with95% conﬁdence interval of [ − . , − . Samesex , and an estimate of -0.087 with 95% conﬁdence interval of[ − . , − . Twins . As would be expected, the 95% two-sided conﬁdence intervals of all four bounds cover the2SLS estimates and their associated 95% conﬁdence intervals for both IVs. (a) IV: Samesex

Manski HV Chesher SV

HMUE [-0.560,0.439] [-0.537,0.401] [-0.537,-0.011] ∪ [0.011,0.401] [-0.538,-0.030]

90% CI [-0.566,0.445] [-0.546,0.411] [-0.546,-0.005] ∪ [0.005,0.411] [-0.546,-0.023]

95% CI [-0.567,0.446] [-0.548,0.412] [-0.548,-0.004] ∪ [0.004,0.412] [-0.548,-0.022]

99% CI [-0.569,0.448] [-0.551,0.416] [-0.551,-0.001] ∪ [0.001,0.416] [-0.551,-0.020] (b) IV: Twins Manski HV Chesher SV

HMUE [-0.560,0.439] [-0.304,0.113] [-0.305,-0.061] [-0.185,-0.101]

90% CI [-0.566,0.445] [-0.341,0.151] [-0.342,-0.026] [-0.259,-0.042]

95% CI [-0.567,0.446] [-0.349,0.158] [-0.349,-0.019] [-0.272,-0.031]

99% CI [-0.569,0.448] [-0.364,0.172] [-0.365,-0.004] [-0.299,-0.012] (c) IV: Both= { Samesex,Twins } Manski HV Chesher SV

HMUE [-0.560,0.439] [-0.295,0.097] [-0.295,-0.065] [-0.200,-0.105]

90% CI [-0.566,0.445] [-0.329,0.131] [-0.329,-0.032] [-0.259,-0.051]

95% CI [-0.567,0.446] [-0.336,0.137] [-0.335,-0.026] [-0.269,-0.042]

99% CI [-0.569,0.448] [-0.349,0.151] [-0.349,-0.011] [-0.289,-0.027]

Note: The ﬁrst row of panels (a)-(c) reports the weighted average of the HMUE of the four ATE bounds, and the second tofourth rows report the weighted average of the CLR two-sided conﬁdence interval at diﬀerent signiﬁcant levels.

Table 8.2: Decomposition of Identiﬁcation Gains and Instrument Identiﬁcation Power (a) IV: Samesex C C C C IIP

Based on HMUE 0.439 0.034 0.019 0.508 0.473Based on 90% CI 0.445 0.026 0.018 0.523 0.472Based on 95% CI 0.446 0.024 0.018 0.526 0.471Based on 99% CI 0.448 0.021 0.019 0.532 0.471 (b) IV: Twins C C C C IIP

Based on HMUE 0.439 0.317 0.163 0.081 0.756Based on 90% CI 0.445 0.250 0.100 0.216 0.695Based on 95% CI 0.446 0.236 0.090 0.242 0.682Based on 99% CI 0.448 0.209 0.075 0.286 0.657 (c) IV: Both= { Samesex,Twins } C C C C IIP

Based on HMUE 0.439 0.330 0.134 0.096 0.769Based on 90% CI 0.445 0.270 0.090 0.206 0.715Based on 95% CI 0.446 0.257 0.085 0.226 0.703Based on 99% CI 0.448 0.232 0.078 0.260 0.681

Note: C - C and IIP are the weighted average of their associated conditional estimates given X P , with the kernel densityof X P as weights. For both panels (a) to (c), C to C are computed as described in the footnote 14, and the estimates ineach row correspond to diﬀerent signiﬁcance levels of the CLR estimation.

29o explore the heterogeneity of the treatment eﬀects, Figure 8.1 graphs the four bounds of interestagainst X P . From Figure 8.1, we can see that when the more powerful of the three IVs are employed,namely Twins or Both , the HV bounds narrow down the possible range of the ATE( X P ) relative to thebenchmark Manski bounds, especially for individuals with a small probability of having a third child. Inaddition, they can even identify the negative eﬀect for individuals with a propensity score X P close tozero. Similar properties are exhibited by the Chesher bounds. The SV bounds indicate that for womenwho are less likely to have more than two children, it is more probable that there will be a negative eﬀecton their labor force participation once they have a third child, roughly in the region of -10% to -15%. Forindividuals who are more likely to have more than two children, the eﬀect of having a third child is stillnegative but with larger possible range, roughly from -10% to -40% when their propensity score is about0.6, and roughly from 0% to -30% when their propensity score is close to one.To check the heterogeneity of the IV identiﬁcation power, Figure 8.2 displays the decompositionsplotted against X P . It is obvious that the IV identiﬁcation power of Twins and

Both are signiﬁcantlylarger than that of

Samesex , across all possible values of X P . Furthermore, the contribution of thecovariate appears to be ampliﬁed when Twins is involved in deriving the bounds, leading to a furtherreduction in the width of the unexplained part relative to the benchmark.

In this paper we explore the factors that determine the identiﬁcation gains for the ATE in models withbinary endogenous variables. We use the reduction in the size of the ATE identiﬁcation set as a measurefor identiﬁcation power, and conduct our analysis with the identiﬁcation gains achieved by the SV bounds(Shaikh and Vytlacil, 2011) against the benchmark Manski bounds (Manski, 1990). We decompose theidentiﬁcation gains into the impacts of the IV validity, the IV strength and the variability of the exogenouscovariates. More importantly, we construct an index of “

IIP ” as a measure for the IV identiﬁcation power.We have developed theoretical results to show the complex mechanism through which IVs aﬀect theidentiﬁcation of the ATE. We ﬁnd that the IV identiﬁcation power in a nonparametric and partiallyidentiﬁed model is fundamentally diﬀerent from the traditional understanding of the IV strength as in aparametric linear model, which is measured, for instance, by the pseudo R or F -stat from the reducedform treatment equation. We have shown that in partially identiﬁed non-linear models it is not only thetraditional IV strength that determines the identiﬁcation gains obtained when bounding the ATE, but alsothe interplay of the IVs with the degree of endogeneity and the variability of exogenous covariates. Theconventional notion of IV strength or weakness no longer provides a full picture of the IV identiﬁcation30 i g u r e . : E s t i m a t e d B o und s o f A T E ( x ) . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . M a n sk i B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . H V B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . C h es h e r B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . SV B ound s ( a ) I V : S a m esex ( b ) I V : T w i n s ( c ) I V : B o t h N o t e : P a n e l s ( a ) - ( c ) p l o tt h ee s t i m a t e s A T E ( x ) a s f un c t i o n s o f t h e p r o p e n s i t y s c o r e X P . T h e r e d li n e s a r e t h e upp e r b o und s a ndb l u e li n e s a r e t h e l o w e r b o und s . T h e b l u e s h a d e d a r e a r e p r e s e n t s t h e % c o nﬁd e n c e r e g i o n s . i g u r e . : D ec o m p o s i t i o n o f I d e n t i ﬁ c a t i o n G a i n s E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I C C C C . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I ( c ) I V : B o t h ( b ) I V : T w i n s ( a ) I V : S a m esex N o t e : P a n e l s ( a ) - ( c ) d e p i c tt h ee s t i m a t e dd e c o m p o s i t i o n o f i d e n t i ﬁ c a t i o n ga i n s o v e r M a n s k i b o und w i d t h ( C j ( x ) / ω M ( x ) w i t h j = , , , ) aga i n s tt h e c o nd i t i o n a l p r o b a b ili t y o f b e i n g t r e a t e d X P = (cid:99) P r ( D = | X ) . IIP provides a more appropriate measure of IV identiﬁcation power, namely, thecontribution made by the IVs in shrinking the ATE identiﬁed set. Importantly, we illustrate how therange of the conditional propensity score and the

IIP relate to the ATE bounds for diﬀerent levels ofendogeneity, ﬁnite sample sizes and covariate variabilities. The results show that the

IIP works well inﬁnite sample settings as a tool for measuring the IV identiﬁcation power and for providing guidance ondetecting irrelevant IVs. We ﬁnd that missing IVs, or misspeciﬁcation of relevant IVs can result in widerATE identiﬁed sets and identiﬁcation power loss. We also ﬁnd that the loss of eﬃciency in ﬁnite samplfrom adding an irrelevant IV can be more reliably detected by the estimated

IIP ( x ), even irrelevant IVcould sometimes result in narrower SV bound width. The empirical application also demonstrates thepractical usefulness of our novel decomposition of the identiﬁcation gains and of the IIP index.The study of

IIP in this paper sheds new light on IV relevancy in partial identiﬁcation frameworks,and oﬀers a potential criterion for IV selection in high dimension settings. It also raises new questionsas to what constitutes an adequate deﬁnition of weak IVs in conjunction with ATE bounding analyses.Explorations of these issues are left for future research.

References

Ahn, H. and J. L. Powell (1993): “Semiparametric estimation of censored selection models with anonparametric selection mechanism,”

Journal of Econometrics , 58, 3–29.

Angrist, J. and W. Evans (1998): “Children and Their Parents’ Labor Supply: Evidence from Ex-ogenous Variation in Family Size,”

American Economic Review , 88, 450–77.

Angrist, J. D. and W. N. Evans (2009): “Replication data for: Children and Their Parents’ LaborSupply: Evidence from Exogenous Variation in Family Size,” https://doi.org/10.7910/DVN/4W9GW2 ,Harvard Dataverse, V1, UNF:3:gmuGDmy3Gcf/k1/lAJqw/A==.

Bhattacharya, J., A. M. Shaikh, and E. Vytlacil (2012): “Treatment eﬀect bounds: An applica-tion to Swan–Ganz catheterization,”

Journal of Econometrics , 168, 223–243.

Breusch, T., H. Qian, P. Schmidt, and D. Wyhowski (1999): “Redundancy of moment conditions,”

Journal of econometrics , 91, 89–111. 33 hernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and conﬁdence regions for param-eter sets in econometric models 1,”

Econometrica , 75, 1243–1284.

Chernozhukov, V., S. Lee, and A. M. Rosen (2013): “Intersection bounds: Estimation and infer-ence,”

Econometrica , 81, 667–737.

Chesher, A. (2005): “Nonparametric identiﬁcation under discrete variation,”

Econometrica , 73, 1525–1550.——— (2010): “Instrumental variable models for discrete outcomes,”

Econometrica , 78, 575–601.

Chiburis, R. C. (2010): “Semiparametric bounds on treatment eﬀects,”

Journal of Econometrics , 159,267–275.

Flores, C. A. and X. Chen (2018):

Average treatment eﬀect bounds with an instrumental variable:theory and practice , Springer.

Frazier, D. T., E. Renault, L. Zhang, and X. Zhao (June 28, 2019): “Weak instruments testin discrete choice models,” Paper presented at the 2019 North American Summer Meeting of theEconometric Society, Seattle, Washington.

Freedman, D. A. and J. S. Sekhon (2010): “Endogeneity in probit response models,”

Political Anal-ysis , 18, 138–150.

Hall, A. R. (2005):

Generalized method of moments , Oxford university press.

Hall, A. R., A. Inoue, K. Jana, and C. Shin (2007): “Information in generalized method of momentsestimation and entropy-based moment selection,”

Journal of Econometrics , 138, 488–512.

Hall, A. R. and F. P. Peixe (2003): “A consistent method for the selection of relevant instruments,”

Econometric reviews , 22, 269–287.

Han, S. and S. Lee (2019): “Estimation in a generalization of bivariate probit models with dummyendogenous regressors,”

Journal of Applied Econometrics , 34, 994–1015.

Han, S. and E. J. Vytlacil (2017): “Identiﬁcation in a generalization of bivariate probit models withdummy endogenous regressors,”

Journal of Econometrics , 199, 63–73.

Hansen, L. P., J. Heaton, and E. G. Luttmer (1995): “Econometric evaluation of asset pricingmodels,”

The Review of Financial Studies , 8, 237–274.

Heckman, J. (1990): “Varieties of selection bias,”

The American Economic Review , 80, 313–318.

Heckman, J. J. (1978): “Dummy endogenous variables in a simultaneous equation system,”

Economet-rica , 46, 931–959.

Heckman, J. J. and R. Robb (1985): “Alternative methods for evaluating the impact of interventions:An overview,”

Journal of econometrics , 30, 239–267.——— (1986): “Alternative methods for solving the problem of selection bias in evaluating the impactof treatments on outcomes,” in

Drawing inferences from self-selected samples , Springer, 63–107.34 eckman, J. J., S. Urzua, and E. Vytlacil (2006): “Understanding instrumental variables in modelswith essential heterogeneity,”

The Review of Economics and Statistics , 88, 389–432.

Heckman, J. J. and E. Vytlacil (2005): “Structural equations, treatment eﬀects, and econometricpolicy evaluation 1,”

Econometrica , 73, 669–738.

Heckman, J. J. and E. J. Vytlacil (1999): “Local instrumental variables and latent variable modelsfor identifying and bounding treatment eﬀects,”

Proceedings of the national Academy of Sciences , 96,4730–4734.——— (2001): “Instrumental variables, selection models, and tight bounds on the average treatmenteﬀect,” in

Econometric Evaluation of Labour Market Policies , Springer, 1–15.

Imbens, G. W. and J. D. Angrist (1994): “Identiﬁcation and estimation of local average treatmenteﬀects,”

Econometrica , 62, 467–475.

Imbens, G. W. and C. F. Manski (2004): “Conﬁdence intervals for partially identiﬁed parameters,”

Econometrica , 72, 1845–1857.

Joe, H. (1997):

Multivariate models and multivariate dependence concepts , Chapman and Hall/CRC.

Kitagawa, T. (2009): “Identiﬁcation region of the potential outcome distributions under instrumentindependence,” Tech. rep., CEMMAP Working Paper.——— (2015): “A test for instrument validity,”

Econometrica , 83, 2043–2063.

Li, C., D. S. Poskitt, and X. Zhao (2018): “Bounds for average treatment eﬀect: A comparisonof nonparametric and quasi maximum likelihood estimators,” Tech. rep., Working Paper, MonashUniversity.——— (2019): “The bivariate probit model, maximum likelihood estimation, pseudo true parameters andpartial identiﬁcation,”

Journal of Econometrics , 209, 94–113.

Machado, C., A. Shaikh, and E. Vytlacil (2013): “Instrumental variables and the sign of theaverage treatment eﬀect,”

Unpublished Manuscript, Get´ulio Vargas Foundation, University of Chicago,and New York University.[2049] . Maddala, G. S. (1986):

Limited-dependent and qualitative variables in econometrics , 3, Cambridgeuniversity press.

Manski, C. F. (1990): “Nonparametric bounds on treatment eﬀects,”

The American Economic Review ,80, 319–323.

Manski, C. F. and E. Tamer (2002): “Inference on regressions with interval data on a regressor oroutcome,”

Econometrica , 70, 519–546.

Marra, G. and R. Radice (2011): “Estimation of a semiparametric recursive bivariate probit modelin the presence of endogeneity,”

Canadian Journal of Statistics , 39, 259–279.

Mourifi´e, I. and R. M´eango (2014): “A note on the identiﬁcation in two equations probit model withdummy endogenous regressor,”

Economics Letters , 125, 360–363.35 osenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observationalstudies for causal eﬀects,”

Biometrika , 70, 41–55.

Shaikh, A. and E. Vytlacil (2005): “Threshold crossing models and bounds on treatment eﬀects: anonparametric analysis,” Tech. rep., National Bureau of Economic Research.

Shaikh, A. M. and E. J. Vytlacil (2011): “Partial identiﬁcation in triangular systems of equationswith binary dependent variables,”

Econometrica , 79, 949–955.

Swanson, S. A., M. A. Hern´an, M. Miller, J. M. Robins, and T. S. Richardson (2018): “Partialidentiﬁcation of the average treatment eﬀect using instrumental variables: review of methods for binaryinstruments, treatments, and outcomes,”

Journal of the American Statistical Association , 113, 933–947.

Tamer, E. (2010): “Partial Identiﬁcation in Econometrics,”

Annual Review of Economics , 2, 167–195.

Veall, M. R. and K. F. Zimmermann (1992): “Pseudo- R ’s in the ordinal probit model,” Journal ofmathematical sociology , 16, 333–342.——— (1996): “Pseudo- R measures for some common limited dependent variable models,” Journal ofEconomic surveys , 10, 241–259.

Vuong, Q. and H. Xu (2017): “Counterfactual mapping and individual treatment eﬀects in nonsepa-rable models with binary endogeneity,”

Quantitative Economics , 8, 589–610.

Vytlacil, E. and N. Yildiz (2007): “Dummy endogenous variables in weakly separable models,”

Econometrica , 75, 757–779.

Wilde, J. (2000): “Identiﬁcation of multiple equation probit models with endogenous dummy regressors,”

Economics letters , 69, 309–312.

Windmeijer, F. A. (1995): “Goodness-of-ﬁt measures in binary choice models,”

Econometric Reviews ,14, 101–116.

A Appendix

Throughout the proof, let P = Pr[ D = 1 | X, Z ] with support Ω P and let p ( x, z ) = Pr[ D = 1 | x, z ]. A.1 Lemmas

Lemma A.1

Under Assumption 2.1 (a) and (b), for any p, p (cid:48) ∈ Ω P | x such that p > p (cid:48) , we havePr [ D = 0 | x, p ] + Pr [ Y = y, D = 1 | x, p ] − { Pr [ D = 0 | x, p (cid:48) ] + Pr [ Y = y, D = 1 | x, p (cid:48) ] } ≤ , Pr [ D = 1 | x, p ] + Pr [ Y = y, D = 0 | x, p ] − { Pr [ D = 1 | x, p (cid:48) ] + Pr [ Y = y, D = 0 | x, p (cid:48) ] } ≥ , for y ∈ { , } . In addition,Pr [ Y = y, D = 1 | x, p ] − Pr [ Y = y, D = 1 | x, p (cid:48) ] ≥ , Pr [ Y = y, D = 0 | x, p ] − Pr [ Y = y, D = 0 | x, p (cid:48) ] ≤ , or y ∈ { , } . Lastly, if ν (1 , x ) > ν (0 , x ) given x ∈ Ω X , then Pr [ Y = 1 | x, p ] − Pr [ Y = 1 | x, p (cid:48) ] ≥ .If ν (1 , x ) ≤ ν (0 , x ) given x ∈ Ω X , then Pr [ Y = 1 | x, p ] − Pr [ Y = 1 | x, p (cid:48) ] ≤ . Strict inequalities hold ifAssumption 2.1 (c) is imposed on the DGP.

Proof of Lemma A.1.

Under Assumption 2.1 (a) and (b), for p, p (cid:48) ∈ Ω P | x with p > p (cid:48) , we havePr[ D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x, p ] − { Pr[ D = 0 | x, p (cid:48) ] + Pr[ Y = 1 , D = 1 | x, p (cid:48) ] } =Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] − Pr[ p (cid:48) ≤ F ε ( ε ) < p ]= − Pr[ ε ≥ ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ . Similar manipulations show thatPr[ D = 0 | x, p ] + Pr[ Y = 0 , D = 1 | x, p ] − { Pr[ D = 0 | x, p (cid:48) ] + Pr[ Y = 0 , D = 1 | x, p (cid:48) ] } ≤ , Pr[ D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] − { Pr[ D = 1 | x, p (cid:48) ] + Pr[ Y = 1 , D = 0 | x, p (cid:48) ] } ≥ , andPr[ D = 1 | x, p ] + Pr[ Y = 0 , D = 0 | x, p ] − { Pr[ D = 1 | x, p (cid:48) ] + Pr[ Y = 0 , D = 0 | x, p (cid:48) ] } ≥ . In addition, using relatively straightforward if somewhat tedious algebra, we can obtain the followinginequalitiesPr[ Y = 0 , D = 1 | x, p ] − Pr[ Y = 0 , D = 1 | x, p (cid:48) ] = Pr[ ε ≥ ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ , Pr[ Y = 1 , D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p (cid:48) ] = Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ , Pr[ Y = 0 , D = 0 | x, p ] − Pr[ Y = 0 , D = 0 | x, p (cid:48) ] = − Pr[ ε ≥ ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ , andPr[ Y = 1 , D = 0 | x, p ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] = − Pr[ ε < ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ . Now suppose that ν (1 , x ) > ν (0 , x ) given x ∈ Ω X . Then it follows thatPr[ Y = 1 | x, p ] − Pr[ Y = 1 | x, p (cid:48) ]=Pr[ Y = 1 , D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] − Pr[ Y = 1 , D = 1 | x, p (cid:48) ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ]=Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] − Pr[ ε < ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ]=Pr[ ν (0 , x ) ≤ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ . Finally, using a parallel argument in the case where ν (1 , x ) ≤ ν (0 , x ) given x ∈ Ω X , we can concludethat the inequalities stated in the lemma hold. Lemma A.2

Under Assumptions 2.1 and 3.1, the following results hold. Joint probabilities Pr [ Y = y, D = d | x, p ] for y, d ∈ { , } are functions of the dependence parameter ρ . In addition,(a) Pr [ Y = 1 , D = 1 | x, p ] and Pr [ Y = 0 , D = 0 | x, p ] are weakly increasing in ρ ;(b) Pr [ Y = 1 , D = 0 | x, p ] and Pr [ Y = 0 , D = 1 | x, p ] are weakly decreasing in ρ . Proof of Lemma A.2.

For any given p ∈ Ω P ,Pr[ Y = 1 , D = 1 | x, p ] = Pr[ ε < ν (1 , x ) , F ε ( ε ) < p | x, p ]= Pr[ ε < ν (1 , x ) , F ε ( ε ) < p ]= C ( F ε ( ν (1 , x )) , p ; ρ ) . (10)37ecause the copula C ( · , · ; ρ ) satisﬁes the concordant ordering with respect to ρ , we know that Pr[ Y =1 , D = 1 | x, p ] is weakly increasing in ρ . SincePr[ Y = 0 , D = 1 | x, p ] = Pr[ D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p ] = p − C ( F ε ( ν (1 , x )) , p ; ρ ) , Pr[ Y = 0 , D = 1 | x, p ] is decreasing in ρ . In addition,Pr[ Y = 0 , D = 0 | x, p ] =Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) ≥ p | x, p ]=Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) ≥ p ]=Pr[ ε ≥ ν (0 , x )] − Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) < p ]=Pr[ ε ≥ ν (0 , x )] − Pr[ F ε ( ε ) < p ] + Pr[ ε < ν (0 , x ) , F ε ( ε ) < p ]=1 − F ε ( ν (0 , x )) − p + C ( F ε ( ν (0 , x )) , p ; ρ ) . (11)From (11) we can see that Pr[ Y = 0 , D = 0 | x, p ] is weakly increasing in ρ , which immediately impliesthat Pr[ Y = 1 , D = 0 | x, p ] is weakly decreasing in ρ . A.2 Proofs

Proof of Proposition 3.1.

To begin, let us ﬁrst introduce the following notation: L ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + sup x (cid:48) ∈ X − ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ] ,L ( x, p ) = Pr[ Y = 1 , D = 1 | x, p ] + sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] ,U ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] ,U ( x, p ) = Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) inf x (cid:48) ∈ X − ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 0] . Then the SV bounds become L SV ( x ) = L ( x, p ) − U ( x, p ) and U SV ( x ) = U ( x, p ) − L ( x, p ) , (12)and under Assumption 2.1 the SV bounds are sharp if Ω X,P = Ω X × Ω P (Shaikh and Vytlacil, 2011,Theorem 2.1).Next we show that L ( x, p ) is weakly decreasing in p ( ceteris paribus ). Under Assumption 2.1 andΩ X,P = Ω X × Ω P , for ∀ x ∈ Ω X there exists x l ∈ X − ( x ), such that ν (1 , x l ) = sup x ∈ X − ( x ) ν (1 , x ) and L ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x l , p ] , (For detailed particulars see the proof of Shaikh and Vytlacil, 2011, Theorem 2.1 (ii) ). For p, p (cid:48) ∈ Ω P and p (cid:48) < p , we have now have L ( x, p ) − L ( x, p (cid:48) )=Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x l , p ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] − Pr[ Y = 1 , D = 1 | x l , p (cid:48) ]=Pr[ ε ≤ ν (1 , x l ) , p (cid:48) < ε ≤ p ) − Pr[ ε ≤ ν (0 , x ) , p (cid:48) < ε ≤ p )=Pr[ ν (0 , x ) < ε ≤ ν (1 , x l ) , p (cid:48) < ε ≤ p ) ≤ , (13) The proof is contained in the supplementary material of Shaikh and Vytlacil (2011). x l ∈ X − ( x ), and the Lemma 2 in Shaikh and Vytlacil (2011)shows that x l ∈ X − ( x ) implies ν (1 , x l ) ≥ ν (0 , x ). Thus, from (13), L ( x, p ) is weakly decreasing in p .Similar arguments show that L ( x, p ) is weakly increasing in p , U ( x, p ) is weakly increasing in p ,and U ( x, p ) is weakly decreasing in p . Hence L SV ( x ) is weakly increasing in p and U SV ( x ) is weaklydecreasing in p . On the other hand, L SV ( x ) is weakly decreasing in p and U SV ( x ) is weakly increasing in p . This completes the proof of the proposition. Proof of Proposition 3.2.

Suppose that ATE( x ) > x ∈ Ω X . Under Assumption 2.1, from thedeﬁnitions of X ( x ), X − ( x ), X ( x ) and X − ( x ), we know that X ( x ) and X ( x ) are nonemptyfor ∀ x ∈ Ω X , since x itself belongs to these two sets. While, X − ( x ) and X − ( x ) may be empty forsome x ∈ Ω X . Recall that the supremum and inﬁmum are deﬁned as zero and one over an empty set,respectively. Thus, for the four functions deﬁned in the proof of Proposition 3.1 we have L ( x, p ) ≥ Pr[ Y = 1 , D = 0 | x, p ] ,L ( x, p ) ≥ Pr[ Y = 1 | x, p ] ,U ( x, p ) ≤ Pr[ Y = 1 | x, p ] , and U ( x, p ) ≤ Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] . The ATE SV bounds are therefore bounded by [ L SV ( x ) , U SV ( x )] ⊂ [ L SV ( x ) , U SV ( x )], where L SV ( x ) = sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] , and U SV ( x ) = inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ] , (14)and the widest possible width ω ( x ) := U SV ( x ) − L SV ( x ) is ω ( x ) := inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] + inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] . From Lemma A.1 it follows that ω ( x ) =Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ D = 0 | x, p ( x )] − Pr[ Y = 1 , D = 0 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] + Pr[ Y = 1 | x, p ( x )]=Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ Y = 0 , D = 0 | x, p ( x )] . (15)Now consider the case where ATE( x ) <

0. In contrast to the positive ATE( x ) case, X − ( x ) and X − ( x ) are nonempty for ∀ x ∈ Ω X since x itself belongs to these two sets, while X ( x ) and X ( x ) maybe empty for some x ∈ Ω X . Thus, the following inequalities hold L ( x, p ) ≥ Pr[ Y = 1 | x, p ] ,L ( x, p ) ≥ Pr[ Y = 1 , D = 1 | x, p ] ,U ( x, p ) ≤ Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] , and U ( x, p ) ≤ Pr[ Y = 1 | x, p ] , L SV ( x ) , U SV ( x )] ⊂ [ L SV ( x ) , U SV ( x )], where U SV ( x ) = inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] , and L SV ( x ) = sup p ∈ Ω P | x Pr[ Y = 1 , D = 1 | x, p ] − inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] } . (16)The widest possible width of the SV bounds is now therefore ω ( x ) = inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 , D = 1 | x, p ]+ inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] } , and from Lemma A.1 we have that ω ( x ) =Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 , D = 1 | x, p ( x )]+ Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ D = 1 | x, p ( x )]=Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ Y = 0 , D = 1 | x, p ( x )] . (17)The nature of the relationship between ω ( x ) and p ( x ) and p ( x ) follows directly from the expressions in(15) and (17) upon application of Lemma A.1. Proof of Proposition 3.3.

The proof follows directly from the expression for ω ( x ) in Proposition 3.2and Lemma A.2. Proof of Proposition 3.4.

Without loss of generality, assume that the distribution of ε has been“normalized” to be uniform over [0 , ν ( D, X ) | D indicates that there exists a function m : { , } (cid:55)→ R such that ν ( d, x ) = m ( d ) for all ( d, x ) ∈ { , }× Ω X . Take ATE( x ) to be positive. When H ( x, x (cid:48) ) is well deﬁned and ν ( D, X ) = m ( D ), X ( x ) = X ( x ) = Ω X , and X − ( x ) = X − ( x ) = ∅ .Since ε is continuously distributed we can conclude that ∀ ( x, z ) , ( z (cid:48) , x (cid:48) ) ∈ Ω X,Z such that Pr[ D =1 | z (cid:48) , x (cid:48) ] = Pr[ D = 1 | x, z ] we must have ν ( x, z ) = ν ( z (cid:48) , x (cid:48) ).For L SV ( x ), consider sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ]. If X ( x ) is empty, or if there does not exista z (cid:48) such that Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p , then sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] is set to zero. Since X ( x )equals Ω X because ν ( D, X ) = m ( D ), we have Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p for at least ( z (cid:48) , x (cid:48) ) = ( x, z ), and thussup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] is well-deﬁned. It follows thatsup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] = sup x (cid:48) ∈ X ( x ) Pr[ ν (0 , x (cid:48) ) > ε , ν ( x (cid:48) , z (cid:48) ) ≤ ε | x (cid:48) , p ]= sup x (cid:48) ∈ X ( x ) Pr[ m (0) > ε , ν ( x, z ) ≤ ε | x (cid:48) , p ]= sup x (cid:48) ∈ X ( x ) Pr[ m (0) > ε , ν ( x, z ) ≤ ε | x, p ]=Pr[ Y = 1 , D = 0 | x, p ] , (18)where the second equality arises because the CDF of ε is the strictly positive and ν (0 , x (cid:48) ) = m (0) isdegenerate. The third equality is due to the assumed independence of ( X, Z ). Similarly, p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] = inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ]40 inf x (cid:48) ∈ X ( x ) Pr[ ν (1 , x (cid:48) ) > ε , ν ( x (cid:48) , z (cid:48) ) > ε | x (cid:48) , p ]= inf x (cid:48) ∈ X ( x ) Pr[ m (1) > ε , ν ( x, z ) > ε | x, p ]=Pr[ Y = 1 , D = 1 | x, p ] . (19)By virtue of equations (18) and (19), and Lemma A.1, L SV ( x ) can be rewritten as L SV ( x ) = sup p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] }− inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x, p ] } = sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − inf p ∈ Ω P | x Pr[ Y = 1 | x, p ]= Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] . (20)For U SV ( x ), because X − ( x ) and X − ( x ) are empty, from Lemma A.1 we get U SV ( x ) = inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ]= Pr[ Y = 1 , D = 1 | x, p ( x )] + (1 − p ( x )) − Pr[ Y = 1 , D = 0 | x, p ( x )] . (21)THe expressions in (20) and (21) now yield the result that ω SV =Pr[ Y = 1 , D = 1 | x, p ( x )] + (1 − p ( x )) − Pr[ Y = 1 , D = 0 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] + Pr[ Y = 1 | x, p ( x )]=Pr[ Y = 0 , D = 0 | x, p ( x )] + Pr[ Y = 1 , D = 1 | x, p ( x )] , which is equal to ω ( x ). The proof for the negative ATE( x ) case is completely analogous, the details areomitted. Proof of Proposition 5.1. (i) We ﬁrst show that

IIP ( x ) is well-deﬁned in the sense that we are able toidentify whether Z is relevant or not. If, for a given x ∈ Ω X , there exists a z and z (cid:48) in Ω Z | x such that z (cid:54) = z (cid:48) and Pr[ D = 1 | x, z ] (cid:54) = Pr[ D = 1 | x, z (cid:48) ], then the IV Z is relevant. If Z is relevant then IIP ( x ) = 1 − ω ( x )where ω ( x ) is the widest possible width deﬁned in Proposition 3.2. Otherwise, Z is irrelevant, and byProposition 3.4, if Z is irrelevant the SV bounds reduce to the benchmark Manski bounds and we have IIP ( x ) = 0.Next, we prove that IIP ( x ) ∈ [0 , ω ( x ) is a summation of some conditional probabilities ∀ x ∈ Ω X , it follows that ω ( x ) ≥ IIP ( x ) ≤

1. Whenever Z is relevant the sign of ATE( x ) isidentiﬁed, and from Lemma A.1 it follows that if ATE( x ) > ω ( x ) = Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ Y = 0 , D = 0 | x, p ( x )] ≤ Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] , (22)which is less than one, and if ATE( x ) < ω ( x ) = Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ Y = 0 , D = 1 | x, p ( x )] ≤ Pr[ Y = 1 , D = 0 | x ] + Pr[ Y = 0 , D = 1 | x ] , (23)which is also less than one. Thus, IIP ( x ) = 1 − ω ( x ) ≥ ∀ x ∈ Ω X , and IIP ( x ) ∈ [0 , Z is irrelevant, by deﬁnition we have IIP ( x ) = 0 and the SV bounds reduce to the benchmarkManski bounds by Proposition 3.4. To establish necessity we will show that the presumption that theevents Z is relevant and IIP ( x ) = 0 occur simultaneously leads to a contradiction. If Z is relevant, thenthe index IIP ( x ) = 1 − ω ( x ). The goal, therefore, is to show that relevant Z leads to strictly less one ω ( x ), by verifying the inequalities (22) and (23) are strict. Take (22) as an example and the result for(23) can be veriﬁed analogously. SincePr[ Y = 1 , D = 1 | x ] − Pr[ Y = 1 , D = 1 | x, p ( x )]= (cid:90) p ∈ Ω P | x (cid:2) Pr[ Y = 1 , D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p ( x )] (cid:3) d Pr[ P = p | X = x ]= (cid:90) p ∈ Ω P | x Pr (cid:2) ε < µ (1 , x ) , p ( x ) ≤ ε < p (cid:3) d Pr[ P = p | X = x ] , (24)the relevance of Z guarantees that there exists a p ∈ Ω P | x such that p (cid:54) = p ( x ) and Pr[ P = p | X = x ] > ε , ε ) with support R implies that (24) is strictly pos-itive. Similar arguments can be applied to show that Pr[ Y = 0 , D = 0 | x ] − Pr[ Y = 0 , D = 0 | x, p ( x )] > ω ( x ) < Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] ≤ , leading to IIP ( x ) > Z is a perfect predictor of the treatment D in the sense that there exist a z ∗ and a z ∗∗ inΩ Z | x such that Pr( D = 0 | x, z ∗ ) = 1 and Pr( D = 1 | x, z ∗∗ ) = 1, this obviously implies Z is relevant and IIP ( x ) = 1 − ω ( x ). Furthermore, p ( x ) = p ( x, z ∗ ) and p ( x ) = p ( x, z ∗∗ ). Hence, it can be easily shown fromthe expressions for ω ( x ) that perfect prediction by Z leads to the equality ω ( x ) = 0 for both ATE( x ) > x ) <

0. Thus

IIP ( x ) = 1 − ω ( x ) = 1.Moreover, since ω ( x ) is the widest possible width for the SV bounds, we have 0 ≤ ω SV ( x ) ≤ ω ( x ),and when ω ( x ) = 0 it follows that ω SV ( x ) = 0. The ATE( x ) is point identiﬁed if IIP ( xx