Decomposing Identification Gains and Evaluating Instrument Identification Power for Partially Identified Average Treatment Effects
DDecomposing Identification Gains and Evaluating InstrumentIdentification Power for Partially Identified Average Treatment Effects
Lina Zhang ∗ David T. Frazier † D.S. Poskitt ‡ Xueyan Zhao § September 8, 2020
Abstract
This paper studies the instrument identification power for the average treatment effect (ATE) inpartially identified binary outcome models with an endogenous binary treatment. We propose a novelapproach to measure the instrument identification power by their ability to reduce the width of the ATEbounds. We show that instrument strength, as determined by the extreme values of the conditionalpropensity score, and its interplays with the degree of endogeneity and the exogenous covariates all playa role in bounding the ATE. We decompose the ATE identification gains into a sequence of measurablecomponents, and construct a standardized quantitative measure for the instrument identification power(
IIP ). The decomposition and the
IIP evaluation are illustrated with finite-sample simulation studiesand an empirical example of childbearing and women’s labor supply. Our simulations show that the
IIP is a useful tool for detecting irrelevant instruments.
Keywords : Binary Dependent Variables; Average Treatment Effect; Instrument Identification Power;Instrument Relevance; Endogeneity; Partial Identification. ∗ Corresponding author. Department of Econometrics and Business Statistics, Monash University( [email protected] ). † Department of Econometrics and Business Statistics, Monash University ( [email protected] ). ‡ Department of Econometrics and Business Statistics, Monash University ( [email protected] ). § Department of Econometrics and Business Statistics, Monash University ( [email protected] ). a r X i v : . [ ec on . E M ] S e p Introduction
This paper investigates the identification power of instrumental variables for the average treatment effect(ATE) in partially identified triangular equations system models with binary endogenous variables. Binaryoutcome models with binary endogenous treatment have been widely used in empirical studies. The roleplayed by instrumental variables (IVs) in such models has long been a controversial topic and has beendiscussed in many papers (see for example Heckman, 1978; Maddala, 1986; Wilde, 2000; Freedman andSekhon, 2010; Mourifi´e and M´eango, 2014; Han and Vytlacil, 2017; Li, Poskitt, and Zhao, 2019). Inparticular, there is a notion of “ identification by functional form ”(Li et al., 2019), where such non-linearmodels can be point identified even without any IVs, relying on restrictive parametric assumptions suchas a bivariate probit. However, such identification has been described as “fragile” (Marra and Radice,2011; Li et al., 2019), as models such as the bivariate probit are overly restrictive. Once less restrictiveassumptions are allowed, the IVs have been shown to play a crucial role for meaningful identification inpartially identied models (see for example Chesher, 2005, 2010; Shaikh and Vytlacil, 2011; Li et al., 2019).The literature on partially identified models offers a useful framework for IV identification analysis.The identified set for the ATE, defined as all possible values of the ATE from different observationallyequivalent structures that can give rise to the observed data, offers an obvious measure for identificationpower. For example, Kitagawa (2009) and Swanson et al. (2018) use the size of the identification set tomeasure the identification power of model assumptions. Naturally, the width of the ATE identified setcan also provide a measure to examine the IV contribution to the identification gains. In this paper, weuse the reduction in the width of the identification set as a measure for identification gains. Since thepioneering work of Manski (1990), most of the ATE partial identification studies with an endogenoustreatment have relied on the IVs to bound the ATE (see Heckman and Vytlacil, 1999, 2001; Vytlaciland Yildiz, 2007; Chesher, 2010; Chiburis, 2010; Shaikh and Vytlacil, 2011; Vuong and Xu, 2017; Floresand Chen, 2018). Both Chesher (2010) and Li et al. (2018) show that the existence and the strength ofthe IVs can significantly affect the identification of the ATE for discrete outcome models. However, themechanism through which the IV strength translates to identification gains in such non-linear models hasnot been well understood by researchers.In endogenous treatment effect models, the IVs exert their influence through their impact on thetreatment propensity score. Heckman, Urzua, and Vytlacil (2006) provide a comprehensive study of theproperties of IVs in models with continuous outcomes, and point out the central role of the propensityscores in such models. In continuous outcome models, it is well known that the “identification at infinity”, Other works that establish the important role of the propensity score include Rosenbaum and Rubin (1983), Heckmanand Robb (1985, 1986), Heckman (1990), and Ahn and Powell (1993). R to measure IV strength and show that theATE bound width decreases as the pseudo R increases. As with linear models, it is natural to expectthat the propensity score variation is also a key component that governs the ability of the IVs to identifythe ATE. However, to the authors’ knowledge, no rigorous examinations have yet been conducted toinvestigate the factors contributing to the identification gains of the ATE for discrete outcome modelswhen “identification at infinity” fails. It is part of the purpose of this paper to investigate this lacuna.This paper presents a rigorous examination of the role of IVs and their interplays with other factorsin the identification gains for the ATE in binary outcome models with an endogenous binary treatment.Using the bivariate joint threshold crossing model proposed by Shaikh and Vytlacil (2011) (henceforthreferred to as the SV model or SV bounds) as an example, we study the identification gains achieved bythe SV bounds against those from an ATE bounds benchmark, the bounds of Manski (1990) (hereafterManski bounds). The rationale for using Manski’s bounds as a benchmark follows from the observationthat if the IVs are irrelevant, then the SV bounds collapse to Manski bounds. Using this framework,we disentangle the various impacts of IVs on identification gains, which yields a novel decomposition ofthe ATE SV bounds identification gains. This decomposition provides useful insights into the differentsources and nature of identification gains.Our paper makes several contributions. Firstly, we distinguish the concepts of
IV strength and
IVidentification power for binary dependent variables models. We show that, as in the case of linear models,the IV strength, as measured by the range of the conditional propensity score (CPS) that are attributableto the IVs, plays a crucial role in the identification gains when bounding the ATE. More importantly,we demonstrate that unlike linear models, the IV identification power is also determined by the interplayof the IVs with the sign and the degree of treatment endogeneity. This is because in such non-linearmodels, the ATE bounds are governed by the joint probabilities of the outcome and the treatment, which See Remark 2.1 of Shaikh and Vytlacil (2011).
IIP ). The
IIP measures the IV contribution toidentification gains by quantifying the reduction in the size of the ATE identification set that can beattributed to the instruments alone. Works that aim to provide measures of the explained variation inlimited dependent variable models, such as Veall and Zimmermann (1992, 1996), are already available andWindmeijer (1995) provides a comprehensive review of various pseudo R goodness-of-fit measures. Ingeneral, pseudo R statistics are developed for single equation limited dependent variable models, ratherthan for triangular systems with a binary endogenous treatment. Although such pseudo R statisticswill yield a measure of the IV strength (as used in Li et al., 2018), they are not appropriate measuresfor
IV identification power , as they fail to capture the critical fact that the IV identification informationpertaining to the ATE varies with the endogeneity degree. Consequently, any suggestion that pseudo R statistics will be an indicator of the IV identification power would be misplaced. In contrast, the IIP proposed in this paper is specifically designed to evaluate the identification gains that can be solelyattributed to the IVs.Finally, our paper also provides potential insights into the literature on instrument relevancy, weakinstruments and instrument selection. The importance of this
IIP measure is that it enables a rankingof alternative IVs by their identification power, thereby offering a potential criterion for detection ofirrelevant IVs and for selection of sets of IVs for constructing the ATE bounds. In this way, our measureis akin to existing approaches in the generalized methods of moment (GMM) literature that seek todetermine instrument “relevancy”. The ability of our approach to determine and rank sets of IVs by theiridentification gains leads us to document, we believe for the first time, a critically important feature ofbinary triangular equations systems: while in the population, adding irrelevant IVs can not increase theIV identification power, in finite-samples, using such IVs to partially identify the ATE could lead to a4oss in IV identification power, which may result in wider ATE bounds especially when the variation ofcovariates is limited. We liken this phenomena to the well-known problem of irrelevant moment conditionsin GMM (see Breusch et al., 1999; Hall and Peixe, 2003; Hall, 2005; Hall et al., 2007, among others) andleave a more rigorous study of this topic for future research.The rest of this paper is organized as follows. In Section 2 we present our model setup and the SVbounds. In Section 3 we establish how the conditional propensity score, the endogeneity and the covariatesaffect the ATE bounds. Section 4 introduces our decomposition of identification gains, and studies howit can be used to gauge the instrument identification power. Section 5 defines the index of
IIP andpresents some of its basic properties. A comprehensive numerical analysis and graphical presentationare given in Section 6 to illustrate our results. Finite sample evaluation of decomposition analysis ispresented in Section 7, and an empirical example is given in Section 8 to demonstrate the usefulness ofthe decomposition and the
IIP in evaluating instrument relevance. The paper closes in Section 9 withsome summary remarks. All proofs are relegated to Appendix.
Following the potential outcome framework, let Y be a binary outcome such that Y = DY + (1 − D ) Y , where D ∈ { , } is a treatment indicator with D = 1 denoting being treated and D = 0 denoting beinguntreated. The pair Y , Y ∈ { , } are two potential outcomes in the untreated and treated states. Weobserve ( Y, D, X, Z ), where X denotes a vector of exogenous covariates and Z represents a vector ofinstruments that can be either continuous or discrete. Suppose we are interested in the conditional ATE,defined as ATE( x ) = E [ Y | X = x ] − E [ Y | X = x ] . Because only one of the potential outcomes is observed, we are faced with a missing data problem. If thepotential outcomes are independent of the treatment D then it can be shown that the ATE( x ) is pointidentified. However, in many empirical studies D is endogenous and hence correlated with the potentialoutcomes. Nevertheless, with the help of IVs we may partially identify the ATE( x ) and construct anidentified set for the ATE under mild conditions that are satisfied by a wide range of data generatingprocesses.For notational simplicity, henceforth we will use Pr( A | w ) to represent Pr( A | W = w ) for any event A ,5andom variable W and its possible value w unless otherwise stated. For any generic random variables A and B , the support of A is denoted as Ω A and the support of A conditional on B = b is given by Ω A | b .Let F A,B denote the joint cumulative distribution function (CDF) of (
A, B ), F A the marginal CDF of A ,and F A | B the conditional CDF of A given B . Corresponding density functions will be denoted using alower case f with associated subscript in an obvious way.We now introduce the model and the identified set of the ATE studied in Shaikh and Vytlacil (2011),based on which we explore the factors determining the ATE bounds and how they impact the ATE boundwidth. Consider a joint threshold crossing model Y = 1[ ν ( D, X ) > ε ] ,D = 1[ ν ( X, Z ) > ε ] , (1)where ν ( · , · ) and ν ( · , · ) are unknown functions, and ( ε , ε ) (cid:48) is an unobservable error term with jointCDF F ε ,ε . Threshold crossing models are often used in treatment evaluation studies (see Heckman andVytlacil, 1999, 2001, for example), and have been shown to be informative in the sense that the sign ofthe ATE can be recovered from the observable data, and the ATE can even be point identified in certaincircumstances; see Shaikh and Vytlacil (2005, 2011), Vytlacil and Yildiz (2007) and Vuong and Xu (2017)among others. Moreover, tests for the applicability of threshold crossing also have been developed; seeHeckman and Vytlacil (2005), Bhattacharya et al. (2012), Machado et al. (2013) and Kitagawa (2015) forexample. The following assumption summarises the conditions imposed by Shaikh and Vytlacil (2011).
Assumption 2.1
The model in (1) is assumed to satisfy the following conditions:(a) The distribution of error term ( ε , ε ) (cid:48) has a strictly positive density with respect to the Lebesguemeasure on R .(b) ( X, Z ) is independent of ( ε , ε ) .(c) The distribution of ν ( X, Z ) | X is non-degenerate.(d) The support of the distribution of ( X, Z ) , Ω X,Z , is compact.(e) ν : Ω D,X → R , ν : Ω X,Z → R are continuous in both arguments. Assumption 2.1 ensures that the instruments in Z satisfy the exclusion restriction, is independent of theerror term ( ε , ε ) (cid:48) and relevant to the treatment D . In addition, Assumption 2.1 (a) and (b) are such that Bhattacharya et al. (2012) demonstrate that the SV bounds still hold under a rank similarity condition, a weaker propertythat allows heterogeneity in the sign of the ATE( x ). Furthermore, as mentioned in Vytlacil and Yildiz (2007), it is possibleto achieve the ATE point identification via the SV bounds if X contains a continuous element or the exclusion restrictionholds in both equations. enters the outcome Y only through the propensity score, which is called index sufficiency. Conditions(d) and (e) are required to establish the sharpness of the identified set, and are imposed for analyticalsimplicity.Denote random variable P = Pr[ Y = 1 | X, Z ] with support Ω P . Under Assumption 2.1 (a)-(c), Shaikhand Vytlacil (2011) show that the sign of the ATE( x ) is identified: for any p and p (cid:48) in Ω P such that p > p (cid:48) , sgn[ATE( x )] = sgn[ ν (1 , x ) − ν (0 , x )] = sgn (cid:2) Pr[ Y = 1 | x, p ] − Pr[ Y = 1 | x, p (cid:48) ] (cid:3) , (2)where sgn[ · ] is the conventional signum function. Given (2), it is apparent that the sign of the ATE( x )is recovered from the observables if Z is valid in the sense that Z is independent to ( ε , ε ) and it hasnonzero prediction power for the treatment, meaning that there exist two different values of p, p (cid:48) ∈ Ω P | x such that p =Pr[ D = 1 | x, z ] and p (cid:48) =Pr[ D = 1 | x, z (cid:48) ].More importantly, Assumption 2.1 is sufficient to construct bounds for the ATE, referred to as SVbounds. Let P and P (cid:48) are two independent random variables with the same distribution, and let x, x (cid:48) beany two values in Ω X . Now, define H ( x, x (cid:48) ) = E [ h ( x, x (cid:48) , P, P (cid:48) ) | P > P (cid:48) ] where h ( x, x (cid:48) , p, p (cid:48) ) =Pr[ Y = 1 , D = 1 | x (cid:48) , p ] − Pr[ Y = 1 , D = 1 | x (cid:48) , p (cid:48) ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] + Pr[ Y = 1 , D = 0 | x, p ] . Let X ( x ) = { x (cid:48) : H ( x, x (cid:48) ) ≥ } , X − ( x ) = { x (cid:48) : H ( x, x (cid:48) ) ≤ } , X ( x ) = { x (cid:48) : H ( x (cid:48) , x ) ≥ } , and X − ( x ) = { x (cid:48) : H ( x (cid:48) , x ) ≤ } . Then the SV lower bound is L SV ( x ) = sup p ∈ Ω P | x (cid:40) Pr[ Y = 1 , D = 1 | x, p ] + sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] (cid:41) − inf p ∈ Ω P | x (cid:26) Pr[ Y = 1 , D = 0 | x, p ] + p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] (cid:27) , (3)and the SV upper bound is U SV ( x ) = inf p ∈ Ω P | x (cid:26) Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) inf x (cid:48) ∈ X − ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 0] (cid:27) − sup p ∈ Ω P | x (cid:40) Pr[ Y = 1 , D = 0 | x, p ] + sup x (cid:48) ∈ X − ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ] (cid:41) . (4)The SV bounds in (3) and (4) consist of two layers of intersection evaluations. The first layer is to intersectall possible values of the conditional propensity score, or equivalently, of the IVs. The second layer is to7tilize the identifying information contained in covariates. In particular, for given x , the second layer ofintersections are taken over values of the covariates other than x , say x (cid:48) , which lies in a certain subsetof Ω X , and there exists a z (cid:48) ∈ Ω Z | x such that p =Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. Thus, both the IVsand the covariates contribute to the identification gains of SV bounds. It is understood that in (3) and(4) the supremum and infimum operators are only taken over regions where all conditional probabilitiesare well defined. The probabilities Pr[ Y = y, D = d | x (cid:48) , p ] and Pr[ Y = y | x (cid:48) , p, D = d ] are well defined for y ∈ { , } and d ∈ { , } , if there exists a value z (cid:48) ∈ Ω Z | x such that Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p . The supremumover an empty set is defined as 0, and the infimum over an empty set is defined as 1. Given (3) and (4),the width of SV bounds can be defined as ω SV ( x ) = U SV ( x ) − L SV ( x ) . In the next section, we study the factors that impact the SV bounds and ω SV ( x ). As discussed in the introduction, for binary dependent variables the propensity of being treated is a keyfactor that carries the identification information in the IVs. Therefore, we start from the conditionalpropensity score (CPS) of the treatment, defined as Pr[ D = 1 | X = x, Z ], which is a random variable(function) of IV Z , and study the features of the CPS that are crucial in determining the SV boundwidth. In the following proposition, for the sake of completeness, we first restate the sharpness result in Shaikhand Vytlacil (2011) under a stronger support condition Ω
X,P = Ω X × Ω P , and then introduce our newresults about the connections between P = Pr[ D = 1 | X, Z ] and the SV bound width. Denote the twoextreme values of the support of variable P by p := inf { p ∈ Ω P } and p := sup { p ∈ Ω P } respectively. Proposition 3.1
Let Assumption 2.1 hold. If Ω X,P = Ω X × Ω P , then the SV bounds in (3) and (4) aresharp. In addition, for any given ∀ x ∈ Ω X ,(i) L SV ( x ) is weakly increasing as p decreases or as p increases;(ii) U SV ( x ) is weakly decreasing as p decreases or as p increases;and hence iii) ω SV ( x ) is weakly decreasing as p decreases or as p increases. Notice that under the restriction Ω
X,P = Ω X × Ω P , the support of P is the same to the support of theCPS Pr[ D = 1 | X = x, Z ] for ∀ x ∈ Ω X . Proposition 3.1 shows that the locations of the lower and upperSV bounds are determined by the extreme values of the CPS, i.e. p and p . Moreover, the width of theSV bounds ω SV ( x ) weakly decreases as the support of the CPS “expands”. It means that when the IVsare good predictors of the treatment status, the identified set of the ATE( x ) (SV bounds) is likely to beinformative.The feature revealed by Proposition 3.1 is significant. It indicates that in partially identified modelswith binary dependent variables, the property of IVs that determines their contribution to identificationgains is different from that which has hitherto been held to be important. Key ingredients of conventionalmeasures of IV strength are the correlation between the IVs and the endogenous regressors (as evaluatedvia the first-stage F -stat for continuous endogenous regressors, or the pseudo- R for binary response vari-ables), as well as the variation of the IVs to that of the random noise. However, Proposition 3.1 indicatesthat two IV sets that have the same CPS end points will make identical contributions to identificationgains when partially identifying the ATE, irrespective of their correlation with the endogenous regressorsor their variability.The restriction Ω X,P = Ω X × Ω P in Proposition 3.1 is utilized in Shaikh and Vytlacil (2011) to simplifythe expression of the SV bound and to prove the sharp result. It is also one of the sufficient conditions thatensures global identification in a parametric triangular system model with binary endogenous treatment,see Han and Vytlacil (2017) Theorem 5.1. The condition Ω
X,P = Ω X × Ω P is saying that for any x, x (cid:48) ∈ Ω X , we have Ω P | x = Ω P | x (cid:48) ; i.e. there exist possible realizations z, z (cid:48) of Z such that Pr[ D =1 | x, z ] = Pr[ D = 1 | x (cid:48) , z (cid:48) ], which might fail to hold in practice especially when the variation in Z islimited. One sufficient condition for Ω X,P = Ω X × Ω P to hold is that X is mean independence of D given Z . The necessity of the condition Ω X,P = Ω X × Ω P here is that without this support restriction, the SVbound may not exhibit a monotonic relationship with the extreme values of the CPS.Fortunately, although Proposition 3.1 is derived using the support constraint, from the simulationsin Section 7 we can see that the SV bound width decreases, on average, whenever the extreme values ofthe CPS changes to their endpoints (zero and one). In fact, as we will now show, without the impositionof the support condition Ω X,P = Ω X × Ω P , a “widest bound” under Assumption 2.1 that restrictsthe size of ω SV ( x ) can be derived for any given x ∈ Ω X . Define the two extremes of the CPS as Without Ω
X,P = Ω X × Ω P , the SV bound need not be sharp. Chiburis (2010) shows that under joint threshold crossingthe sharp ATE bounds can only be implicitly determined by a copula, so that neither a closed form expression nor acomputationally feasible linear programming algorithm that solves this problem exists. We therefore maintain the supportrestriction. ( x ) := inf z ∈ Ω Z | x { p ∈ Ω P | x,z } and p ( x ) := sup z ∈ Ω Z | x { p ∈ Ω P | x,z } . Proposition 3.2
Let Assumption 2.1 hold. There exists a function ω : Ω X (cid:55)→ [0 , such that ≤ ω SV ( x ) ≤ ω ( x ) for any given x ∈ Ω X . In addition,if ATE ( x ) > , then ω ( x ) = Pr (cid:2) Y = 1 , D = 1 | x, p ( x ) (cid:3) + Pr [ Y = 0 , D = 0 | x, p ( x )] ; if ATE ( x ) < , then ω ( x ) = Pr [ Y = 1 , D = 0 | x, p ( x )] + Pr (cid:2) Y = 0 , D = 1 | x, p ( x ) (cid:3) . Moreover, ω ( x ) is weakly decreasing as p ( x ) decreases or as p ( x ) increases. The explicit expressions of the widest bounds, with width ω ( x ), can be found in (14) and (16); see theproof of Proposition 3.2. From Proposition 3.2 we can see that ω ( x ) is monotone in the extreme valuesof CPS, i.e. ( p ( x ) , p ( x )), and we are able to conclude that the extreme values of the CPS govern thesize of the SV bound width even without the support restriction. Moreover, under the extreme case ofperfect prediction, Proposition 3.2 implies that the ATE( x ) is point identified by the SV bounds. Suppose p ∗ , p ∗∗ ∈ Ω P | x are such that Pr[ D = 0 | x, p ∗ ] = 1 and Pr[ D = 1 | x, p ∗∗ ] = 1. By the definition of p ( x ) , p ( x ),we have that p ∗ = p ( x ) and p ∗∗ = p ( x ). Proposition 3.2 then yields that ω ( x ) = 0 whatever the sign ofthe ATE( x ), indicating that the ATE( x ) is point identified. From the above discussion it is apparent thatperfect prediction in the binary dependent variables model is equivalent to “identification at infinity”.Similar discussion can also be found when partially identifying the ATE in models with discrete outcomesin Chesher (2010). The importance of IVs in determining the ATE bounds via the CPS has been recognized in several studies,but it seems that another crucial determinant, the degree of endogeneity, has so far received little attention.The ATE bounds are constructed using the joint probabilities of the outcome and the treatment, andthe IVs affect those joint probabilities not only directly through the CPS but also indirectly through theco-movements of the outcome and the treatment due to the endogeneity. Thus, it is reasonable to expectthat the information contained in the IVs may be correspondingly scaled via the leverage induced by thedegree of endogeneity.To facilitate obtaining interpretable relationships between the degree of endogeneity and the SV boundwidth, we introduce a family of bivariate single parameter copulae that specifies the joint distribution ofthe stochastic error terms in (1), while we do not require the copula nor the marginal distributions to beknown. Denote a copula as C ( · , · ; ρ ) : (0 , (cid:55)→ (0 , ρ ∈ Ω ρ is a scalar dependence parameter that10ully describes the joint dependence between ε and ε , and their dependence increases as ρ increases. Itis worth noting that in our setting, for any given copula, the dependence parameter ρ can be understoodas indicating the level of endogeneity. We also impose additional dependence structure, the concordanceordering, on the copula C ( · , · ; ρ ). Let F ε ,ε and (cid:101) F ε ,ε be two distinct CDFs. Following Joe (1997), wedefine (cid:101) F ε ,ε as being more concordant than F ε ,ε , denoted by F ε ,ε ≺ c (cid:101) F ε ,ε , as F ε ,ε ( e , e ) ≤ (cid:101) F ε ,ε ( e , e ) , ∀ ( e , e ) ∈ R . For ρ (cid:54) = ρ and u , u ∈ (0 , , we say that the copula C ( · , · ; ρ ) satisfies the concordant ordering withrespect to ρ , denoted as C ( u , u ; ρ ) ≺ c C ( u , u ; ρ ), if C ( u , u ; ρ ) ≤ C ( u , u ; ρ ) , for any ρ < ρ . (5)The concordant ordering with respect to ρ is a stochastic dominance restriction. The concordant orderingis embodied in many well-known copulae, including the normal copula; see Joe (1997) Section 5.1 for thecopulae families where (5) holds. Similar stochastic dominance conditions are employed in, e.g., Han andVytlacil (2017) and Han and Lee (2019), to derive identification and estimation results for the parametricbivariate probit model and its generalizations. Assumption 3.1
The joint distribution of ( ε , ε ) (cid:48) is given by a member of the single parameter copulafamily F ε ,ε ( e , e ) = C ( F ε ( e ) , F ε ( e ); ρ ) , for ( e , e ) ∈ R , where C ( · , · ; ρ ) satisfies the concordantordering with respect to ρ . Assumption 3.1 defines a class of data generating processes that is sufficient for us to establish therelationship between endogeneity as captured by the dependence parameter ρ , and the widest SV boundwidth ω ( x ). The derivation of the following proposition does not require the copula C ( · , · ; ρ ) nor themarginal distributions F ε and F ε to be specified. Proposition 3.3
Under Assumptions 2.1 and 3.1, the widest SV bound width ω ( x ) is weakly increasingin ρ when ATE ( x ) > , and ω ( x ) is weakly decreasing in ρ when ATE ( x ) < . Proposition 3.3 implies that the (widest) SV bound width could be significantly impacted by the degreeof endogeneity, even if the extreme values of the CPS are fixed. In addition, Proposition 3.3 also revealsthat the effect of endogeneity is asymmetric. To be more specific, with a positive treatment effect nega-tive endogeneity helps narrow down the ATE bound width, while the opposite holds true for a negative In the special case of a normal bivariate probit model ρ represents the correlation between the error terms and Ω ρ =( − , ρ is not necessary ( − , power . Conversely, a set of “seemingly strong” IVs can be surprisingly powerless due to anundesirable sign or degree of endogeneity, resulting in wide ATE bounds. Thus, the conventional tests fordetecting IV strength, such as F -stat and pseudo R , or the associated weak IV tests designed for linearmodels, can be misleading in measuring IV identification power. The result here shows that IV strength is a different concept from the
IV identification power in this binary model.
As we have seen from the construction of the SV bounds in Section 2, both IVs and covariates contributeto identifying the ATE under model (1). It is perhaps not surprising to find that there are situationswhere covariates fail to further tighten the SV bounds, a feature previously noted in Chiburis (2010).This happens when, conditional on D , the covariates in X have no additional effects on the outcome Y ,leading to ω SV ( x ) = ω ( x ). The following proposition formalizes these statements. Proposition 3.4
Let Assumption 2.1 hold. If the random variable ν ( D, X ) | D is degenerate, then ω SV ( x ) = ω ( x ) . Proposition 3.4 implies that any further reduction in the SV bound width from ω ( x ) to ω SV ( x ) can beattributed to the additional identification information in the covariate X . In particular, if focusing onthe second layer of the intersections over X ( x ) , X − ( x ) , X ( x ) and X − ( x ) in bounds (3) and (4),we can see that such identification gain is extracted from the matching pair ( x, z ) , ( x (cid:48) , z (cid:48) ) ∈ Ω X,Z suchthat Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. Thus, broader support and greater variability in X increases theprobability of finding a matching pair.To sum up, from the discussion in Section 3, we know that the identification power for the ATE SVbounds is determined by the extreme values of the CPS, the sign and the degree of endogeneity, and thevariability (or support) of the covariates in the outcome equation. Based on the discussions above, in this section we introduce a novel decomposition of the identificationgains of the SV bounds. It disentangles the identification gains into components that are attributable to12he gains obtained from the IVs and the exogenous covariates. To construct the decomposition let us firstintroduce the benchmark ATE bounds of Manski (1990) (Manski bounds), which are obtained withoutreference to IVs and are given by L M ( x ) = − Pr[ Y = 1 , D = 0 | x ] − Pr[ Y = 0 , D = 1 | x ] ,U M ( x ) = Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] , (6)where (with obvious notations) L M ( X ) and U M ( x ) are the lower bound and upper bound respectively.From (6), it is apparent that the width of the Manski bounds, defined as ω M ( x ) = U M ( x ) − L M ( x ), isone for any given x ∈ Ω X , with the lower bound and upper bound falling on either side of zero. Thus, (cid:2) L M ( x ) , U M ( x )] is uninformative as to the sign or location of the treatment effect, and it is often referredto in the literature as “the worst case scenario” (see Tamer, 2010; Chiburis, 2010; Bhattacharya et al.,2012, for example).Our proposed decomposition of identification gains is inspired by the implications of the theoreticalresults in Section 3. For any given x ∈ Ω X , the decomposition consists of four components, denoted by C ( x ) to C ( x ) respectively. Each component corresponds to the identification gains made by the SVbounds over the benchmark Manski bounds.(i) C ( x ): Contribution of IV Validity . The first component of the identification gains is thereduction in the SV bound width relative to the benchmark Manski bound width, due to theidentification of the ATE( x ) sign. This contribution is accredited to IV validity, since by (2) we canidentify the sign of the ATE( x ) if the IVs are independent of the error term ( ε , ε ) and ν ( X, Z ) | X is nondegenerate (or equivalently, if the IVs are valid) regardless of the IV strength. For ∀ x ∈ Ω X , C ( x ) = 1[ATE( x ) ≤ U M ( x ) − x ) ≥ L M ( x ) , which is equivalent to the width of the negative (positive) part of Manski bounds if ATE( x ) isidentified to be positive (negative).(ii) C ( x ): Contribution of IV Strength . Conditional on the first component, IV validity, the secondcomponent captures to the further reduction achieved by the SV bound width via intersecting overall possible values of Z . This is reflected in the dependence of the SV bounds in (3) and (4) on thetwo extreme values of the CPS, and the closer the extreme values to [0 ,
1] are, the greater is C ( x ). If ATE( x )=0 is identified by (2), i.e. Pr[ Y = 1 | x, p ] = Pr[ Y = 1 | x, p (cid:48) ] for any p > p (cid:48) , then it is obvious that the firstcontribution of SV bounds already leads to the point identification of the ATE( x ), and the IV identification power IIP ( x ),which will be introduced in Section 5, achieves its maximum value one. C ( x ) = ω M ( x ) − ω ( x ) − C ( x ) . (iii) C ( x ): Contribution of Covariates . The third component is the incremental reduction in theSV bound width brought about by intersecting over all possible values of the exogenous covariates X that fall into the areas described by the sets X ( x ) , X − ( x ) , X ( x ) and X − ( x ) via matchingfor the same propensity score values. As implied by Proposition 3.4, this component is attributedto the variation of exogenous covariates: C ( x ) = ω ( x ) − ω SV ( x ) . (iv) C ( x ): Remaining SV Bound Width . The last component is due to the unobservable error terms,and relates to the remaining SV bound width that cannot be further reduced by the observable dataunder the SV modeling assumptions. This component can be thought of as the signal-to-noise ratioof the error terms. By construction, we have C ( x ) = ω SV ( x ).It is easy to see that C ( x ) + C ( x ) + C ( x ) + C ( x ) = ω M ( x ) = 1. If ν ( X, Z ) | X is degenerate and theIVs have no explanatory power for the treatment, then C ( x ) = C ( x ) = C ( x ) = 0 and the SV boundsreduce to Manski bounds. It is worth to note that although we do not decompose the identification gainsbased on the sign and the degree of endogeneity, the magnitude of all the four components varies withthem. According to Proposition 3.3, the sign and the endogeneity degree affects ω ( x ), which enters allfour components either directly or indirectly due that the summation of the four components is a fixedvalue one. In addition, C ( x ) to C ( x ) can always be identified and estimated from the data. In practice,once the model has been estimated (parametrically or non-parametrically), the estimates can be used toconstruct the decomposition. Detailed numerical illustrations and simulations of the decomposition arepresented in Sections 6 and 7. By construction, the identification gains decomposition satisfies C ( x )+ C ( x )+ C ( x )+ C ( x ) = ω M ( x ) =1, ∀ x ∈ Ω X , with each C j ( x ) representing the proportion of total identification gains that can be attributedto the corresponding component. Based on the decomposition, we can then construct a quantitativemeasurement of IV identification power in the partial identification setting. Suppose Assumption 2.114olds, bar condition (c). For ∀ x ∈ Ω X , define the IV identification power IIP ( x ) as IIP ( x ) := ω M ( x ) − ω ( x ) , if ν ( X, Z ) | X = x is nondegenerate0 , if ν ( X, Z ) | X = x is degenerate (7)where ω ( x ) is the widest width of the SV bounds defined in Proposition 3.3. Setting IIP ( x ) = 0 when ν ( X, Z ) | X = x is degenerate is equivalent to setting ω ( x ) = ω M ( x ) = 1, meaning that the widest widthof the SV bounds equates to the width of the benchmark Manski bounds because the IVs are irrelevant. From the decomposition, we have
IIP ( x ) = C ( x ) + C ( x ) when the IVs are valid and relevant. Thus IIP ( x ) represents the proportion of the identification gains that is due to the IVs alone and it can beviewed as an index of the IV identification power. The overall IV identification power can be obtained bytaking the expectation of IIP ( x ) over Ω X , i.e. E X [ IIP ( X )].The following proposition formalizes some important properties of IIP ( x ) as an indicator of the IVidentification power. Proposition 5.1
The index
IIP ( x ) lies in the unit interval [0 , , and under Assumption 2.1 IIP ( x ) has the following properties:(i) IIP ( x ) always lies in [0 , and can identify whether at least one of the IVs used to achieve the SVbounds is relevant;(ii) IIP ( x ) = 0 if none of the IVs in Z are relevant, then the SV bounds reduce to the benchmarkManski bounds;(iii) IIP ( x ) = 1 if the IVs in Z have perfect predictive power for the treatment D (identification atinfinity holds), in the sense that there exists a p ∗ and p ∗∗ in Ω P | x such that Pr [ D = 0 | x, p ∗ ] = 1 andPr [ D = 1 | x, p ∗∗ ] = 1 . Moreover, the ATE ( x ) is point identified when IIP ( x ) = 1 . Proposition 5.1 indicates that
IIP ( x ) is a meaningful measure of IV usefulness for improving the ATEpartial identification. Therefore, values of IIP ( x ) can be compared, across different sets of IVs, or acrossdifferent values of x given the same set of IVs, since they are standardized relative to the same baselinebenchmark. For example,
IIP ( x ) = 0 . IIP ( x ) is a meaningful measureindependent of the specific SV bounds. In addition, the values of
IIP ( x ) at its end points are intuitively The definition allows
IIP ( x ) to be discontinuous at Ω P | x = p x for some constant p x ∈ [0 , P | x is a singleton. IIP ( x ) or E X [ IIP ( X )] can also be compared across various studies if necessary. Theoretically, the value of
IIP ( x ) should lie in [0 ,
1] and the width of Manski bounds is always one. Then
IIP ( x ) can IIP ( x ) = 0 identifies situations where the IVs are completely irrelevant, and, when the IVsare able to perfectly predict the treatment status (when identification at infinity holds,) IIP ( x ) = 1 andpoint identification of the ATE( x ) is achieved.Numerical analysis is used in Section 6 to illustrate the behaviour of IIP ( x ) in a class of representativemodels. At this point we note that IIP ( x ) ignores the component of identification gains attributable to theexogenous covariates, namely C ( x ). In view of the additivity of the identification gains decomposition,this neglect seems entirely reasonable since we know, from Section 3, that for a given degree of endogeneityand extremes of the CPS, the value of ω ( x ) does not vary with the identification information contained bythe covariates. This indicates that IIP ( x ) is a measure of identification gains due to IVs alone, withoutthe contribution of the additional identification power provided by the exogenous covariates. It measuresthe smallest identification gains relative to the benchmark Manski bound that can be achieved by a givenset of IVs. More importantly, focusing on IIP ( x ) introduces considerable computational simplificationwhen comparing sets of IVs, as it avoids the second layer of the intersection bounds required to computethe SV bounds. In this section we illustrate numerically the theoretical results on the decomposition of SV bounds studiedin Section 2, and how each component affects the SV bounds. We consider as our data generating process(DGP) a version of the model in (1) with a linear additive latent structure, which is similar to that studiedin Li et al. (2019): Y = 1[ αD + βX + ε > ,D = 1[ γZ + πX + ε > , (8)where the exogenous regressor X and the IV Z are assumed mutually independent, without loss ofgenerality, X ∼ N (0 ,
1) and Z ∈ {− , } with Pr( Z = 1) = 1 /
2. In addition, (
X, Z ) (cid:48) ⊥ ( ε , ε ) wherethe error term ( ε , ε ) is zero mean bivariate normal with unit variances and correlation ρ . For thisspecification, given the distribution of Z , there is a monotonic one-to-one mapping from the coefficientof the IV, γ , to the range of the conditional propensity score. We capture changes in the extreme valuesof the CPS using the parameter grid γ = − . ρ = − .
99 : 0 .
05 : 0 .
99. We set α = 1 and π = 0 across all parameter settings. Under this DGP,the SV bound width is affected by α , β and the variation of the exogenous covariates. Since α and the be interpreted as the percentage points of the identification gains brought by the IVs. In finite sample settings where theestimated Manski bound width may on longer be exact one, the sample explanation can be obtained by computing the ratio IIP ( x ) /ω M ( x ) using their associated estimates. X are held fixed, we select β from the set { . , . , . } , so that changes in β capture thevariation of the exogenous covariates given the distribution of X . Using the DGP as characterized by (8)we compute the SV bounds [ L SV ( x ) , U SV ( x )] and the Manski bounds [ L M ( x ) , U M ( x )] and implement theidentification gains decomposition according to the true DGP. In what follows we present the outcomesobtained when x = E [ X ]. In Figure 6.1, the subplots in the first row display the upper and lower bounds of the ATE( x ), and thesubplots in the second row present the corresponding bound width. For the Manski bounds we can seethat the width is always one, and the upper and lower bounds stand on either side of zero, as previouslynoted. The SV bounds reduce to the Manski bounds when the IVs are irrelevant with γ = 0 (the separatelines in the graphs at γ = 0). When γ moves away from γ = 0, the SV bound width has a significantdrop. Then, as the magnitude of γ increases, i.e. as the ending points of the CPS expand, the SV boundwidth decreases. In addition, since α > x ) is positive, the SV bound width increases as ρ increases. Moreover, comparison of the plots for different values of β reveals that β plays a critical rolein determining the SV bounds in the sense that larger β produces significantly narrower bound width.When β = 0 .
05 the SV bound width is non-negligible when the absolute value of γ is small, while when β = 0 .
45, point identification of the ATE( x ) is achieved for most of the ( γ, ρ ) pairs. These indicate thatfor a given IV strength, as measured by γ or the associated range of CPS, the lower the value of ρ in the( − , +1) range or the bigger the impact of x , the narrower the SV bounds that can be achieved. In otherwords, for given IV strength, a larger identification gain can be achieved if the error correlation ρ is largein magnitude and also has an opposite sign from the sign of the ATE( x ). The decomposition of identification gains obtained when γ ∈ { , } , ρ ∈ {− . , − . , . , . } and β ∈{ . , . , . } is displayed for x = E [ X ] in Figure 6.2. We can see that when the ATE( x ) is positive,the contribution of IV validity, as measured by C ( x ), is determined by the Manski lower bound, anddecreases as ρ increases (conversely the numerical results not reported here show that when the ATE( x ) isnegative C ( x ) increases as ρ increases), while C ( x ) is invariant to β . By way of contrast, the contributionof the component C ( x ) also does not change by β , but it increases significantly as the magnitude of γ increases due to the impact of the IVs on the range of the CPS. The component of identification gains In our experiments we have calculated the ATE( x ) and its bounds at various quantile points of X , but space considerationsprevent us from listing all our results here. We present the outcomes obtained when X equals its modal/mean value as theseare representative. i g u r e . : M a n s k i a ndS V B o und s f o r A T E ( x = E [ X ] ) ( β = . )( β = . )( β = . ) M a n s k i B o und W i d t h S V B o und W i d t h ( β = . ) S V B o und W i d t h ( β = . ) S V B o und W i d t h ( β = . ) N o t e : T h r ee d i m e n s i o n a l p l o t s o f t h e A T E b o und s a s f un c t i o n o f ( γ , ρ ) . W h e n γ = , S V b o und s r e du c e t o M a n s k i b o und s w i t hb o und w i d t h o n e . C ( x ), also contributes significantly to the identification gains. When β is relatively large (e.g. β = 0 . Figure 6.3 depicts the index
IIP ( x ) as a function of ( γ, ρ ) on the lattice {− . } × {− .
99 : 0 .
05 :0 . } . The plot confirms that, when the ATE( x ) is positive, the IV identification power IIP ( x ) increasesas the IV strength ( | γ | ) increases, but for the same IV strength, the IIP ( x ) is higher the lower the valueof ρ . We also found, based on the results not reported here, that, when the ATE( x ) is negative, a risinglevel of positive endogeneity drives up IIP ( x ) and reduces the width of SV bounds.By way of summary, the theoretical results presented in Sections 3, 4 and 5 are clearly reflected in thefeatures observed in the numerical outcomes reported here. Firstly, IIP ( x ) is bigger when IVs are stronger( | γ | higher). In addition, for a given IV strength in the first-stage treatment equation, higher IIP ( x ) canbe achieved if the endogeneity ρ has an opposite sign from the ATE( x ) and is of high magnitude ( | ρ | ).And if the endogeneity is of the same sign as the ATE( x ), then the lower the degree of endogeneity thebetter the identification power. Of course adding the additional identification gain C ( x ) to IIP ( x ) leadsto the SV bound width ω SV ( x ), and the C ( x ) depends on the properties of the covariates. Next, we study the empirical performance of our decomposition analysis for alternative sets of IVs. Wepresent finite sample results to show how
IIP ( x ) can be used to rank the identification power of differentsets of IVs and to potentially detect irrelevant IVs, when determining which set of IVs should be usedto construct the ATE bounds. The advantage of this strategy over conventional IV strength evaluations(such as those akin to the first-stage IV F -stat or the CPS) is that IIP ( x ) captures the IV identificationpower in terms of their ability to shrink the width of the ATE bounds, incorporating the IV strength andtheir interaction with the direction and magnitude of endogeneity in the nonlinear model. Consideri.i.d. samples generated from a similar DGP to (8) with two IVs: Y = 1[ αD + βX + ε > ,D = 1[ πX + γ Z + γ Z + ε >
0] (9) The identification power
IIP ( x ) can provide testable implication of IV relevance, but a formal test is out of the scopeof this paper. x = E [ X ]) (a) β = 0 . β = 0 . β = 0 . Note: The green line depicts the amount of IV validity contribution C ( x ). To aid legibility C ( x ) , . . . , C ( x ) have beenrendered as C , . . . , C in each of the subplots in this figure. x-axis displays the values of γ . For space limitation, we onlyrepresent the figure for nonnegative values of γ . x = E [ X ]) II P ( x ) -4 Note: Three dimensional plot of
IIP ( x ) as function of ( γ, ρ ). The value of β does not affect the IIP ( x ) in this case because π = 0 and no matches of Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ] exist for x = E [ X ] and z, z (cid:48) ∈ {− , } . When γ = 0, the IIP ( x ) = 0because IV is irrelevant. where two IVs in Z = ( Z , Z ) (cid:48) are Z ∼ Bernoulli (1 /
2) and Z ∈ {− , − , − , , , , } with probabil-ities (0 . , . , . , . , . , . , . α = 1, β = 1, π = −
1, ( γ , γ ) = (0 . , . ε , ε ) is jointly normal with mean zero, variance one and correlation ρ ∈ { . , . } . In addition, Z , Z and X are mutually independent, and also independent to ( ε , ε ). Consider two cases of covariatevariability: case
1, continuous X ∼ N (0 , case
2, binary X ∼ Bernoulli (1 / x = 0. The value of the ATE( x ) = E [ Y − Y | X = 0] under the DGP (9) is 0.341.In order to evaluate the finite sample performance of IIP ( x ) as an index for measuring IV identificationpower, we consider five alternative sets of IV options. In addition to the two valid IVs of Z and Z in theDGP, we introduce two ”pseudo” IVs: (cid:101) Z = 1[ Z > Z , and an irrelevant IV Z ∈ { , } such that Pr[ Z = 1] = 2 /
3, and Z ⊥ ( ε , ε , Z , Z , X ). Toillustrate the behaviour of the IIP ( x ) estimation, we use sample data for ( Y, D, X ) generated from theDGP in (9) to estimate models with five alternative IV sets: (1) only one valid IV Z (omitting Z ); (2)only one valid IV Z (omitting Z ); (3) one valid Z and one misspecified (cid:101) Z ; (4) two valid IVs Z and Z ; and (5) two valid Z and Z plus one irrelevant Z .Table 7.1: Population CPS Range and IIP ( x ) ( x = 0, cases 1 and 2)Sets IVs CPS definition CPS Range IIP ( x ) ( ρ = 0 . IIP ( x ) ( ρ = 0 . Z Pr[ D = 1 | x, Z ] [0 . , . Z Pr[ D = 1 | x, Z ] [0 . , . Z , (cid:101) Z Pr[ D = 1 | x, Z , (cid:101) Z ] [0 . , . Z , Z Pr[ D = 1 | x, Z , Z ] [0 . , . Z , Z , Z Pr[ D = 1 | x, Z , Z , Z ] [0 . , . Note: The population CPS and
IIP ( x ) are the same for case 1 and case 2. IIP ( x ) for the cases 1 and 2, at x = 0. Note thatthe covariate variability does not impact the population CPS nor IIP ( x ), so that the values of CPS rangeand IIP ( x ) for case 1 are the same to those for case 2. Looking at the CPS range as a measure of IVstrength, we can see that the CPS range is the widest when both valid and relevant IVs Z and Z areused as in (4). Adding an irrelevant IV Z does not change the theoretical CPS range, so theoretically(5) has the same IV strength as (4). The CPS range decreases when only one of the two valid IVs areused as in (1) and (2), with Z being stronger with wider CPS range than Z . As expected, when a validIV is incorrectly specified as a proxy dummy (cid:101) Z in (3), the CPS range is narrower than that of the bestset in (4), but wider than that in (1) with Z alone. Interestingly, comparing IV set (3) with (2), set (2)with only one valid IV actually results in wider CPS range than that for the two IVs in set (3) with Z misspecified, though the CPS interval for (3) is not completely nested within the interval for (2).Whilst the CPS range indicates the IV strength, it is the IIP ( x ) that captures the identificationpower of each IV set, measuring the reduction of SV bound width relative to the benchmark Manskibound width due to the contribution of IVs. As seen from the two IIP ( x ) columns in Table 7.1, the sameIV strength can achieve bigger identification gains for ρ = 0 . ρ = 0 .
8. This is consistentwith the results in Section 6: as ρ and ATE( x ) are both positive in this case, the lower absolute value of ρ , the higher the IIP ( x ) is. For example for IV set (4), the Manski bound width can be reduced by 0 . . ρ = 0 .
8, and it increases to 0.625 (or 62 . ρ = 0 .
5. The equally mostpowerful IV sets are (4) and (5), and the least powerful set is (1).We next present the finite sample estimation of the Manski and SV bounds, and conduct the decom-position analysis based on the estimates of the bounds. Sample size is set to be n = 500 , , M = 1000 times. Tables 7.2 to 7.5 present the sample average (over M replications) ofthe estimated bounds, estimated C ( x ) to C ( x ) and IIP ( x ) of the five IV sets at x = 0. We use the“half-median-unbiased estimator” (HMUE) of the intersecting bounds proposed by Chernozhukov, Lee,and Rosen (2013) (hereafter CLR) to estimate the benchmark Manski bounds and the SV bounds. Inparticular, we employ maximum likelihood estimation (MLE) to estimate the bounding functions and toselect the critical values for bias correction according to the simulation-based methodology of CLR. The results of Tables 7.2 to 7.5 relate to the two different covariate distributions ( case X ∼ N (0 , The CLR half-median-unbiased estimator produces a upper bound estimator that exceeds its true value and a lowerbound estimator that falls below its true value, each with probability at least a half asymptotically. We report the HMUEof the Manski bounds, for comparison purpose. Other estimation methods for Manski bounds are also available, see e.g.Imbens and Manski (2004). Theoretically, the construction of the SV bounds requires the matching of pairs ( x, z ) and ( x (cid:48) , z (cid:48) )such that Pr[ D = 1 | x, z ] =Pr[ D = 1 | x (cid:48) , z (cid:48) ]. In practice, it is hard to find such pairs with equal CPS especially when thevariation of covariates is limited. In the simulations, the SV bounds are computed by matching ( x, z ) and ( x (cid:48) , z (cid:48) ) such that | Pr[ D = 1 | x, z ] − Pr[ D = 1 | x (cid:48) , z (cid:48) ] | < c and c = 1%. Although the estimated SV bounds depend on c , the estimated IIP ( x )does not. Therefore the choice of c has no impacts on the performance of the IIP ( x ). ase X ∼ Bernoulli (1 / ρ values ( ρ = 0 . , . case 1 (Tables 7.2 and 7.3), where the covariate possesses sufficient variation, the true SV boundspoint identify the ATE( x ) for both ρ = 0 . ρ = 0 .
8. In case 2 (Tables 7.4 and 7.5), the true SVbounds fail to point identify the ATE( x ) due to the limited variation in X .Next, we focus on the left part of each table, which displays the HMUEs of the ATE bounds, and theHausdorff distance between the true bounds and the estimated bounds, evaluated at x = 0. For all fourtables, we can see that the estimated Manski bounds are the same across all five IV sets, always includezero, and have a width a little over one. The estimated SV bounds identify the sign of ATE( x ) for all fiveIV sets. Moreover, the IV sets with greater identification power lead to narrower estimated SV boundsand also improve the estimation accuracy in most of the scenarios. More precisely, the Hausdorff distanceof the estimated SV bounds to the true bounds decreases as the IV identification power increases. Movingto the right part of each of table, first, we note that for each given IV set, all the estimated C ( x ) to C ( x ) and IIP ( x ) converges to their true values as sample size n increases, indicating that the estimatedidentification gain is more accurate for larger sample size. We also note that the estimated C ( x ) whichis determined by the Manski bounds, is the same for different IV sets. This result is quite intuitive becausethe identification gains brought by the IV validity should not vary with the IV strength. Comparisonof Tables 7.2 and 7.3 or Tables 7.4 and 7.5 also reveals that the impacts of endogeneity degree on IVidentification power can be captured by the estimated IIP ( x ). Importantly, the true ranking of IIP ( x )as in Table 7.1 can be correctly revealed by finite sample estimates of IIP ( x ).It is interesting to analyze the effect of adding an additional but completely irrelevant IV on the finitesample performance of ATE partial identification, by comparing the results obtained using IV sets (4)and (5). Adding Z to ( Z , Z ) actually produces a small decrease of the estimated IIP ( x ), on average,for almost all different DGP designs considered in this section. The Cramer-Von Mises test and theKolmogorovSmirnov test confirm that the average values of the estimates of IIP ( x ) under scenario (4) aresignificantly different from those obtained under scenario (5), when sample size is N = 500 and N = 5000for both endogeneity degrees and for both case 1 and case 2. While when sample size is sufficiently large N = 10000, the estimates of IIP ( x ) under scenario (4) and (5) are no longer significantly different, except Simulation results of bounds at different values of x display similar patterns to those at x = 0, therefore are not reporteddue to the space limitation. The Hausdorff distance between sets A and B is defined as max (cid:8) sup a ∈ A d ( a, B ) , sup b ∈ B d ( b, A ) (cid:9) where d ( b, A ) := inf a ∈ A (cid:107) b − a (cid:107) and ∞ if either A or B is empty. Hausdorff distance is a natural generalization of Euclideandistance and has been employed to study convergence properties when a set rather than a point is the parameter of interest,see e.g. Hansen et al. (1995), Manski and Tamer (2002) and Chernozhukov et al. (2007). Because C ( x ) to C ( x ) are functions of L M ( x ), U M ( x ), ω ( x ) and ω SV ( x ), the estimates of C ( x ) to C ( x ) are computedusing the HMUE of the bounds or their widths. We compute ω ( x ) as the width of the estimated bounds (by HMUE of CLR)[ L SV ( x ) , U SV ( x )] in (14) if ATE( x ) > ρ = 0 .
8. This suggests that in practice, the loss of information (efficiency) that arisesfrom using irrelevant IV can have a statistically significant practical effect on the IV identification power,which can be captured by our proposed index
IIP ( x ). Such an information loss could lead to widerATE bounds, especially when the covariate possesses limited variation. Particularly, from Table 7.4 andTable 7.5 we can see that when the covariate X is a binary variable (case 2), on average, the estimatedSV bounds using ( Z , Z ) are significantly narrower than those estimated by the IV set including theirrelevant IV ( Z , Z , Z ), especially for small sample size. Analyzing the results across the replications,we find that about 78% (for both endogeneity degrees) of the replications give narrower estimated SVbounds with IV set ( Z , Z ) than those with ( Z , Z , Z ), for sample size N = 500; and this rate becomesto 53% ( ρ = 0 .
5) and 64% ( ρ = 0 .
8) for sufficiently large sample size N = 10000.On the other hand, the IV irrelevancy cannot always be detected by simply comparing the estimatedSV bound width under different IV sets. That is, adding an irrelevant IV in (5) could further shrink theSV bound width when the covariate X is continuous, although the improvement happens at the thirddecimal and the degree of the improvement decreases as sample size increases. These outcomes reinforce a-fortiori the warning that simply adding extra IVs without assessing their identification power is unlikelyto be a good practical modelling strategy, but the finite sample estimates of our proposed
IIP ( x ) is morereliable in detecting the loss of efficiency of IV irrelevancy. The shrinkage of the estimated SV bounds using the irrelevant Z is due to the finite sample estimation error. Inparticular, because the estimates of the coefficient of the irrelevant Z will be nonzero with probability one, it results in morematched pairs of ( x, z ) and ( x (cid:48) , z (cid:48) ) such that | Pr[ D = 1 | x, z ] − Pr[ D = 1 | x (cid:48) , z (cid:48) ] | < c (see footnote 12) especially when covariateis continuous. For case 1 in Table 7.2 and Table 7.3, we find that when sample size is N = 500, (i) there are 22% ( ρ = 0 . ρ = 0 .
8) of the 1000 replications where at least one (either lower or upper) estimated SV bound using ( Z , Z ) iscloser to its true value, compared to that obtained by using the irrelevant IV; and (ii) 12% of the replications yield widerestimated SV bounds when using the irrelevant IV, for both endogeneity degrees. Case 1 . True and Estimated Bounds, and Decomposition of Identification Gains ( ρ = 0 . , X ∼ N (0 , , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.246,0.899] 0.092 [0.117,0.775] 0.434 0.246 0.186 0.056 0.658 0.432(2) only Z [0.246,0.562] 0.227 0.342 0.241 0.316 0.587(3) Z , (cid:101) Z [0.193,0.759] 0.418 0.218 0.116 0.565 0.464(4) Z , Z [0.290,0.455] 0.121 0.436 0.298 0.165 0.682(5) Z , Z , Z [0.300,0.451] 0.116 0.424 0.324 0.151 0.670 n = 5000 (1) only Z [-0.202,0.846] 0.030 [0.121,0.768] 0.427 0.202 0.145 0.053 0.648 0.347(2) only Z [0.266,0.372] 0.078 0.334 0.406 0.106 0.536(3) Z , (cid:101) Z [0.221,0.757] 0.416 0.194 0.116 0.536 0.395(4) Z , Z [0.312,0.377] 0.043 0.446 0.335 0.066 0.648(5) Z , Z , Z [0.316,0.373] 0.038 0.442 0.347 0.057 0.644 n = 10000 (1) only Z [-0.198,0.838] 0.022 [0.123,0.768] 0.427 0.198 0.139 0.054 0.645 0.337(2) only Z [0.263,0.363] 0.080 0.331 0.407 0.101 0.528(3) Z , (cid:101) Z [0.225,0.756] 0.414 0.189 0.118 0.531 0.387(4) Z , Z [0.317,0.365] 0.031 0.444 0.346 0.048 0.642(5) Z , Z , Z [0.320,0.362] 0.027 0.443 0.353 0.042 0.641 Note: The estimated bounds, the Hausdorff distance d H ( x ) and the decompositions are the averages over 1000 replications. Table 7.3:
Case 1 . True and Estimated Bounds, and Decomposition of Identification Gains ( ρ = 0 . , X ∼ N (0 , , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.157,0.996] 0.098 [0.124,0.873] 0.532 0.157 0.205 0.041 0.750 0.362(2) only Z [0.233,0.559] 0.229 0.382 0.288 0.326 0.539(3) Z , (cid:101) Z [0.191,0.848] 0.507 0.246 0.093 0.657 0.403(4) Z , Z [0.291,0.437] 0.107 0.495 0.355 0.146 0.652(5) Z , Z , Z [0.298,0.431] 0.100 0.482 0.382 0.133 0.639 n = 5000 (1) only Z [-0.121,0.924] 0.028 [0.128,0.860] 0.519 0.121 0.149 0.042 0.732 0.271(2) only Z [0.254,0.357] 0.088 0.346 0.475 0.103 0.467(3) Z , (cid:101) Z [0.208,0.853] 0.512 0.210 0.068 0.645 0.332(4) Z , Z [0.312,0.378] 0.043 0.489 0.369 0.066 0.610(5) Z , Z , Z [0.315,0.373] 0.038 0.486 0.380 0.058 0.607 n = 10000 (1) only Z [-0.117,0.918] 0.022 [0.129,0.860] 0.519 0.117 0.146 0.042 0.731 0.263(2) only Z [0.258,0.357] 0.083 0.346 0.473 0.099 0.463(3) Z , (cid:101) Z [0.212,0.851] 0.510 0.209 0.071 0.639 0.326(4) Z , Z [0.316,0.369] 0.034 0.491 0.374 0.053 0.607(5) Z , Z , Z [0.319,0.365] 0.030 0.491 0.381 0.046 0.607 Note: The estimated bounds, the Hausdorff distance d H ( x ) and the decompositions are the averages over 1000 replications. Case 2 . True and Estimated Bounds, and Decomposition of Identification Gains ( ρ = 0 . , X ∼ Bernoulli (1 / , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.263,0.904] 0.102 [0.060,0.776] 0.237 0.263 0.185 0.002 0.716 0.448(2) only Z [0.110,0.669] 0.179 0.359 -0.014 0.559 0.621(3) Z , (cid:101) Z [0.098,0.769] 0.224 0.237 -0.004 0.671 0.499(4) Z , Z [0.166,0.647] 0.131 0.439 -0.017 0.481 0.701(5) Z , Z , Z [0.160,0.656] 0.140 0.433 -0.025 0.496 0.695 n = 5000 (1) only Z [-0.206,0.849] 0.034 [0.068,0.769] 0.223 0.206 0.148 0.000 0.701 0.354(2) only Z [0.135,0.640] 0.148 0.337 0.007 0.506 0.543(3) Z , (cid:101) Z [0.115,0.754] 0.207 0.211 0.000 0.639 0.417(4) Z , Z [0.210,0.619] 0.079 0.446 -0.006 0.409 0.653(5) Z , Z , Z [0.208,0.620] 0.081 0.444 -0.007 0.412 0.650 n = 10000 (1) only Z [-0.198,0.841] 0.024 [0.069,0.768] 0.221 0.198 0.141 0.001 0.699 0.339(2) only Z [0.138,0.640] 0.145 0.333 0.005 0.502 0.531(3) Z , (cid:101) Z [0.118,0.751] 0.204 0.207 0.000 0.633 0.406(4) Z , Z [0.216,0.612] 0.070 0.447 -0.006 0.396 0.645(5) Z , Z , Z [0.217,0.613] 0.071 0.447 -0.003 0.396 0.645 Note: The estimated bounds, the Hausdorff distance d H ( x ) and the decompositions are the averages over 1000 replications. Table 7.5:
Case 2 . True and Estimated Bounds, and Decomposition of Identification Gains ( ρ = 0 . , X ∼ Bernoulli (1 / , x = 0) Bounds DecompositionManski SV [ L M ( x ) , U M ( x )] d H ( x ) [ L SV ( x ) , U SV ( x )] d H ( x ) C ( x ) C ( x ) C ( x ) C ( x ) IIP(x)True DGP Z , Z [ − . , . . , . n = 500 (1) only Z [-0.165,0.972] 0.084 [0.077,0.868] 0.276 0.165 0.183 -0.001 0.790 0.348(2) only Z [0.114,0.751] 0.212 0.330 0.006 0.637 0.495(3) Z , (cid:101) Z [0.133,0.863] 0.270 0.243 -0.001 0.730 0.408(4) Z , Z [0.209,0.732] 0.154 0.458 -0.008 0.523 0.623(5) Z , Z , Z [0.200,0.738] 0.164 0.441 -0.007 0.538 0.606 n = 5000 (1) only Z [-0.117,0.925] 0.026 [0.086,0.861] 0.268 0.117 0.149 0.001 0.776 0.266(2) only Z [0.144,0.720] 0.175 0.340 0.010 0.576 0.457(3) Z , (cid:101) Z [0.154,0.848] 0.256 0.232 -0.001 0.694 0.349(4) Z , Z [0.255,0.694] 0.102 0.486 0.001 0.439 0.603(5) Z , Z , Z [0.255,0.696] 0.105 0.483 0.001 0.440 0.600 n = 10000 (1) only Z [-0.111,0.919] 0.019 [0.087,0.860] 0.267 0.111 0.146 0.000 0.773 0.257(2) only Z [0.148,0.713] 0.171 0.338 0.015 0.565 0.450(3) Z , (cid:101) Z [0.158,0.846] 0.253 0.230 0.000 0.688 0.342(4) Z , Z [0.263,0.693] 0.100 0.491 -0.002 0.430 0.603(5) Z , Z , Z [0.263,0.692] 0.100 0.489 0.001 0.429 0.601 Note: The estimated bounds, the Hausdorff distance d H ( x ) and the decompositions are the averages over 1000 replications. Empirical Application: Women LFP and Childbearing
In this section, we apply our novel decomposition and IV evaluation method to study the effects ofchildbearing on women’s labor supply. The dataset analyzed here is from the 1980 Census Public UseMicro Samples (PUMS), available at Angrist and Evans (2009). We follow the data construction inAngrist and Evans (1998), where the sample consists of married women aged 21-35 with two or morechildren. The dateset contains 254,652 observations; see Table 2 in Angrist and Evans (1998) for moredetails and descriptive statistics. The binary outcome Y indicates if a individual was paid for work inthe year prior to the census ( Y = 1), or otherwise ( Y = 0). The treatment effect of interest is theimpact of having more than two children on the labor force participation Y . Thus, the binary treatmentis D ∈ { , } , with D = 1 denoting having more than two children.Following Angrist and Evans (1998, Table 11) we use as continuous regressors woman’s age, woman’sage at first birth, and ages of the first two children (quarters), and binary regressors for first child beinga boy, second child being a boy, black, hispanic, and other race, as well as the intersections of theabove mentioned continuous and indicator variables. For computational simplicity, we reduce dimensionof covariates by utilizing the conditional propensity score X P := (cid:99) P r [ D = 1 | X ] as a covariate, where (cid:99) P r [ D = 1 | X ] is estimated via a probit model and X includes all of the regressors mentioned above. Threesets of IVs are considered in this section: (1) the binary indicator that the first two children are thesame sex (“ Samesex ”), (2) the binary indicator that the second birth was a twin (“
Twins ”), and (3) bothindicators (“
Both = { Samesex , Twins } ”). To provide a basis for comparison of SV bounds with other ATEbounding analyses, we also compute the ATE bounds in Heckman and Vytlacil (2001) (hereafter HVbounds) and Chesher (2010) (hereafter Chesher bounds). To be consistent with our previous numericalanalyses in Section 7, we use the method of CLR to compute all the four bounds of interest, via MLE forestimating bounding functions and the simulation-based method for correcting the bias of the intersectingbounds.Table 8.1 reports the weighted average of the HMUE and of the CLR two-sided confidence intervals (at90%, 95% and 99% significant level) of the four bounds of ATE( X P ), with weights given by the estimatedkernel density of X P . Panels (a), (b) and (c) display the results using IV Samesex , Twins and
Both ,respectively. The estimated average of the Manski bounds in all three panels are essentially identical,since the Manski bounds do not depend on IVs. In all panels, the HV bounds make an improvementover the benchmark Manski bounds, with the HV bound width using
Twins being narrower than thatusing
Samesex , and the HV bound width using
Both being the narrowest. The Chesher bounds using
Samesex fail to identify the sign of the ATE( X P ), as it is a union of both negative and positive intervals.27hen the IV Twins or Both is used instead, the weighted average of 95% confidence interval of theChesher bounds is [ − . , − . Twins ) or [ − . , − . Both ), revealing negativeeffects of having a third child on women’s labor force participation. For the SV bounds, the results usingthe IV
Twins or Both dramatically outperform those using
Samesex . The 95% confidence interval using
Samesex , Twins and
Both are [ − . , − . − . , − . − . , − . To summarize the results above, we can see that for ATE bounds in which the IV plays a key role inextracting identifying information, i.e. HV, Chesher and SV bounds, the IV
Both gives us the narrowestbounds (on average).The ranking of the IV identification power of the three available IVs revealed by the discussion aboveis confirmed and explained by the identification gains decomposition and the
IIP reported in Table 8.2.The results based on the 95% confidence interval show that given the same contribution of IV validity forthe three IVs, which is 44.6% on average, the identification power of
Twins (68.2%) is significantly largerthan that of
Samesex (47.1%). Closer inspection of the data reveals that the contribution of
Twins tothe identification gains exceeds that of
Samesex , because whenever
Twins = 1 the treatment D = 1, i.e. Twins is a perfect predictor of being treated, whereas this is not the case for
Samesex . It is this feature, ofcourse, that explains the superior performance when the HV, Chesher and SV bounds are evaluated using
Twins rather than
Samesex . Moreover, when both IVs
Samesex and
Twins are used, the identificationpower of
Both (70.3%) also exceeds that of either one of the single IV
Samesex or Twins . It indicatesthat although the identification power of
Samesex is dominated by
Twins , Samesex can still make extracontributions when identifying the ATE. It is intuitive because the mechanisms of the two IVs drivingthe probability of having a third child are different. One remark on the above analysis is that, for otherATE bounds that exploits the identification information of IVs, for example the HV and Chesher bounds,IVs with higher
IIP clearly leads to narrower bounds for the ATE. It indicates that although the
IIP isconstructed to measure the IV’s contribution to the SV bounds, it is also a meaningful measure for theIV identification power and can be utilized to indicate the IV relevance in other ATE bounds. The two-stage least square (2SLS) estimates of Angrist and Evans (1998, Table 11) give an ATE estimate of -0.123 with95% confidence interval of [ − . , − . Samesex , and an estimate of -0.087 with 95% confidence interval of[ − . , − . Twins . As would be expected, the 95% two-sided confidence intervals of all four bounds cover the2SLS estimates and their associated 95% confidence intervals for both IVs. (a) IV: Samesex
Manski HV Chesher SV
HMUE [-0.560,0.439] [-0.537,0.401] [-0.537,-0.011] ∪ [0.011,0.401] [-0.538,-0.030]
90% CI [-0.566,0.445] [-0.546,0.411] [-0.546,-0.005] ∪ [0.005,0.411] [-0.546,-0.023]
95% CI [-0.567,0.446] [-0.548,0.412] [-0.548,-0.004] ∪ [0.004,0.412] [-0.548,-0.022]
99% CI [-0.569,0.448] [-0.551,0.416] [-0.551,-0.001] ∪ [0.001,0.416] [-0.551,-0.020] (b) IV: Twins Manski HV Chesher SV
HMUE [-0.560,0.439] [-0.304,0.113] [-0.305,-0.061] [-0.185,-0.101]
90% CI [-0.566,0.445] [-0.341,0.151] [-0.342,-0.026] [-0.259,-0.042]
95% CI [-0.567,0.446] [-0.349,0.158] [-0.349,-0.019] [-0.272,-0.031]
99% CI [-0.569,0.448] [-0.364,0.172] [-0.365,-0.004] [-0.299,-0.012] (c) IV: Both= { Samesex,Twins } Manski HV Chesher SV
HMUE [-0.560,0.439] [-0.295,0.097] [-0.295,-0.065] [-0.200,-0.105]
90% CI [-0.566,0.445] [-0.329,0.131] [-0.329,-0.032] [-0.259,-0.051]
95% CI [-0.567,0.446] [-0.336,0.137] [-0.335,-0.026] [-0.269,-0.042]
99% CI [-0.569,0.448] [-0.349,0.151] [-0.349,-0.011] [-0.289,-0.027]
Note: The first row of panels (a)-(c) reports the weighted average of the HMUE of the four ATE bounds, and the second tofourth rows report the weighted average of the CLR two-sided confidence interval at different significant levels.
Table 8.2: Decomposition of Identification Gains and Instrument Identification Power (a) IV: Samesex C C C C IIP
Based on HMUE 0.439 0.034 0.019 0.508 0.473Based on 90% CI 0.445 0.026 0.018 0.523 0.472Based on 95% CI 0.446 0.024 0.018 0.526 0.471Based on 99% CI 0.448 0.021 0.019 0.532 0.471 (b) IV: Twins C C C C IIP
Based on HMUE 0.439 0.317 0.163 0.081 0.756Based on 90% CI 0.445 0.250 0.100 0.216 0.695Based on 95% CI 0.446 0.236 0.090 0.242 0.682Based on 99% CI 0.448 0.209 0.075 0.286 0.657 (c) IV: Both= { Samesex,Twins } C C C C IIP
Based on HMUE 0.439 0.330 0.134 0.096 0.769Based on 90% CI 0.445 0.270 0.090 0.206 0.715Based on 95% CI 0.446 0.257 0.085 0.226 0.703Based on 99% CI 0.448 0.232 0.078 0.260 0.681
Note: C - C and IIP are the weighted average of their associated conditional estimates given X P , with the kernel densityof X P as weights. For both panels (a) to (c), C to C are computed as described in the footnote 14, and the estimates ineach row correspond to different significance levels of the CLR estimation.
29o explore the heterogeneity of the treatment effects, Figure 8.1 graphs the four bounds of interestagainst X P . From Figure 8.1, we can see that when the more powerful of the three IVs are employed,namely Twins or Both , the HV bounds narrow down the possible range of the ATE( X P ) relative to thebenchmark Manski bounds, especially for individuals with a small probability of having a third child. Inaddition, they can even identify the negative effect for individuals with a propensity score X P close tozero. Similar properties are exhibited by the Chesher bounds. The SV bounds indicate that for womenwho are less likely to have more than two children, it is more probable that there will be a negative effecton their labor force participation once they have a third child, roughly in the region of -10% to -15%. Forindividuals who are more likely to have more than two children, the effect of having a third child is stillnegative but with larger possible range, roughly from -10% to -40% when their propensity score is about0.6, and roughly from 0% to -30% when their propensity score is close to one.To check the heterogeneity of the IV identification power, Figure 8.2 displays the decompositionsplotted against X P . It is obvious that the IV identification power of Twins and
Both are significantlylarger than that of
Samesex , across all possible values of X P . Furthermore, the contribution of thecovariate appears to be amplified when Twins is involved in deriving the bounds, leading to a furtherreduction in the width of the unexplained part relative to the benchmark.
In this paper we explore the factors that determine the identification gains for the ATE in models withbinary endogenous variables. We use the reduction in the size of the ATE identification set as a measurefor identification power, and conduct our analysis with the identification gains achieved by the SV bounds(Shaikh and Vytlacil, 2011) against the benchmark Manski bounds (Manski, 1990). We decompose theidentification gains into the impacts of the IV validity, the IV strength and the variability of the exogenouscovariates. More importantly, we construct an index of “
IIP ” as a measure for the IV identification power.We have developed theoretical results to show the complex mechanism through which IVs affect theidentification of the ATE. We find that the IV identification power in a nonparametric and partiallyidentified model is fundamentally different from the traditional understanding of the IV strength as in aparametric linear model, which is measured, for instance, by the pseudo R or F -stat from the reducedform treatment equation. We have shown that in partially identified non-linear models it is not only thetraditional IV strength that determines the identification gains obtained when bounding the ATE, but alsothe interplay of the IVs with the degree of endogeneity and the variability of exogenous covariates. Theconventional notion of IV strength or weakness no longer provides a full picture of the IV identification30 i g u r e . : E s t i m a t e d B o und s o f A T E ( x ) . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . M a n sk i B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . H V B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . C h es h e r B ound s . . . . P r open s i t y S c o r e P r( D = | X )- . - . - . - . . . . . SV B ound s ( a ) I V : S a m esex ( b ) I V : T w i n s ( c ) I V : B o t h N o t e : P a n e l s ( a ) - ( c ) p l o tt h ee s t i m a t e s A T E ( x ) a s f un c t i o n s o f t h e p r o p e n s i t y s c o r e X P . T h e r e d li n e s a r e t h e upp e r b o und s a ndb l u e li n e s a r e t h e l o w e r b o und s . T h e b l u e s h a d e d a r e a r e p r e s e n t s t h e % c o nfid e n c e r e g i o n s . i g u r e . : D ec o m p o s i t i o n o f I d e n t i fi c a t i o n G a i n s E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I C C C C . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I E s t i m a t es . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I . . . . P r open s i t y S c o r e P r( D = | X ) . . . . Identification Gains % C I ( c ) I V : B o t h ( b ) I V : T w i n s ( a ) I V : S a m esex N o t e : P a n e l s ( a ) - ( c ) d e p i c tt h ee s t i m a t e dd e c o m p o s i t i o n o f i d e n t i fi c a t i o n ga i n s o v e r M a n s k i b o und w i d t h ( C j ( x ) / ω M ( x ) w i t h j = , , , ) aga i n s tt h e c o nd i t i o n a l p r o b a b ili t y o f b e i n g t r e a t e d X P = (cid:99) P r ( D = | X ) . IIP provides a more appropriate measure of IV identification power, namely, thecontribution made by the IVs in shrinking the ATE identified set. Importantly, we illustrate how therange of the conditional propensity score and the
IIP relate to the ATE bounds for different levels ofendogeneity, finite sample sizes and covariate variabilities. The results show that the
IIP works well infinite sample settings as a tool for measuring the IV identification power and for providing guidance ondetecting irrelevant IVs. We find that missing IVs, or misspecification of relevant IVs can result in widerATE identified sets and identification power loss. We also find that the loss of efficiency in finite samplfrom adding an irrelevant IV can be more reliably detected by the estimated
IIP ( x ), even irrelevant IVcould sometimes result in narrower SV bound width. The empirical application also demonstrates thepractical usefulness of our novel decomposition of the identification gains and of the IIP index.The study of
IIP in this paper sheds new light on IV relevancy in partial identification frameworks,and offers a potential criterion for IV selection in high dimension settings. It also raises new questionsas to what constitutes an adequate definition of weak IVs in conjunction with ATE bounding analyses.Explorations of these issues are left for future research.
References
Ahn, H. and J. L. Powell (1993): “Semiparametric estimation of censored selection models with anonparametric selection mechanism,”
Journal of Econometrics , 58, 3–29.
Angrist, J. and W. Evans (1998): “Children and Their Parents’ Labor Supply: Evidence from Ex-ogenous Variation in Family Size,”
American Economic Review , 88, 450–77.
Angrist, J. D. and W. N. Evans (2009): “Replication data for: Children and Their Parents’ LaborSupply: Evidence from Exogenous Variation in Family Size,” https://doi.org/10.7910/DVN/4W9GW2 ,Harvard Dataverse, V1, UNF:3:gmuGDmy3Gcf/k1/lAJqw/A==.
Bhattacharya, J., A. M. Shaikh, and E. Vytlacil (2012): “Treatment effect bounds: An applica-tion to Swan–Ganz catheterization,”
Journal of Econometrics , 168, 223–243.
Breusch, T., H. Qian, P. Schmidt, and D. Wyhowski (1999): “Redundancy of moment conditions,”
Journal of econometrics , 91, 89–111. 33 hernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and confidence regions for param-eter sets in econometric models 1,”
Econometrica , 75, 1243–1284.
Chernozhukov, V., S. Lee, and A. M. Rosen (2013): “Intersection bounds: Estimation and infer-ence,”
Econometrica , 81, 667–737.
Chesher, A. (2005): “Nonparametric identification under discrete variation,”
Econometrica , 73, 1525–1550.——— (2010): “Instrumental variable models for discrete outcomes,”
Econometrica , 78, 575–601.
Chiburis, R. C. (2010): “Semiparametric bounds on treatment effects,”
Journal of Econometrics , 159,267–275.
Flores, C. A. and X. Chen (2018):
Average treatment effect bounds with an instrumental variable:theory and practice , Springer.
Frazier, D. T., E. Renault, L. Zhang, and X. Zhao (June 28, 2019): “Weak instruments testin discrete choice models,” Paper presented at the 2019 North American Summer Meeting of theEconometric Society, Seattle, Washington.
Freedman, D. A. and J. S. Sekhon (2010): “Endogeneity in probit response models,”
Political Anal-ysis , 18, 138–150.
Hall, A. R. (2005):
Generalized method of moments , Oxford university press.
Hall, A. R., A. Inoue, K. Jana, and C. Shin (2007): “Information in generalized method of momentsestimation and entropy-based moment selection,”
Journal of Econometrics , 138, 488–512.
Hall, A. R. and F. P. Peixe (2003): “A consistent method for the selection of relevant instruments,”
Econometric reviews , 22, 269–287.
Han, S. and S. Lee (2019): “Estimation in a generalization of bivariate probit models with dummyendogenous regressors,”
Journal of Applied Econometrics , 34, 994–1015.
Han, S. and E. J. Vytlacil (2017): “Identification in a generalization of bivariate probit models withdummy endogenous regressors,”
Journal of Econometrics , 199, 63–73.
Hansen, L. P., J. Heaton, and E. G. Luttmer (1995): “Econometric evaluation of asset pricingmodels,”
The Review of Financial Studies , 8, 237–274.
Heckman, J. (1990): “Varieties of selection bias,”
The American Economic Review , 80, 313–318.
Heckman, J. J. (1978): “Dummy endogenous variables in a simultaneous equation system,”
Economet-rica , 46, 931–959.
Heckman, J. J. and R. Robb (1985): “Alternative methods for evaluating the impact of interventions:An overview,”
Journal of econometrics , 30, 239–267.——— (1986): “Alternative methods for solving the problem of selection bias in evaluating the impactof treatments on outcomes,” in
Drawing inferences from self-selected samples , Springer, 63–107.34 eckman, J. J., S. Urzua, and E. Vytlacil (2006): “Understanding instrumental variables in modelswith essential heterogeneity,”
The Review of Economics and Statistics , 88, 389–432.
Heckman, J. J. and E. Vytlacil (2005): “Structural equations, treatment effects, and econometricpolicy evaluation 1,”
Econometrica , 73, 669–738.
Heckman, J. J. and E. J. Vytlacil (1999): “Local instrumental variables and latent variable modelsfor identifying and bounding treatment effects,”
Proceedings of the national Academy of Sciences , 96,4730–4734.——— (2001): “Instrumental variables, selection models, and tight bounds on the average treatmenteffect,” in
Econometric Evaluation of Labour Market Policies , Springer, 1–15.
Imbens, G. W. and J. D. Angrist (1994): “Identification and estimation of local average treatmenteffects,”
Econometrica , 62, 467–475.
Imbens, G. W. and C. F. Manski (2004): “Confidence intervals for partially identified parameters,”
Econometrica , 72, 1845–1857.
Joe, H. (1997):
Multivariate models and multivariate dependence concepts , Chapman and Hall/CRC.
Kitagawa, T. (2009): “Identification region of the potential outcome distributions under instrumentindependence,” Tech. rep., CEMMAP Working Paper.——— (2015): “A test for instrument validity,”
Econometrica , 83, 2043–2063.
Li, C., D. S. Poskitt, and X. Zhao (2018): “Bounds for average treatment effect: A comparisonof nonparametric and quasi maximum likelihood estimators,” Tech. rep., Working Paper, MonashUniversity.——— (2019): “The bivariate probit model, maximum likelihood estimation, pseudo true parameters andpartial identification,”
Journal of Econometrics , 209, 94–113.
Machado, C., A. Shaikh, and E. Vytlacil (2013): “Instrumental variables and the sign of theaverage treatment effect,”
Unpublished Manuscript, Get´ulio Vargas Foundation, University of Chicago,and New York University.[2049] . Maddala, G. S. (1986):
Limited-dependent and qualitative variables in econometrics , 3, Cambridgeuniversity press.
Manski, C. F. (1990): “Nonparametric bounds on treatment effects,”
The American Economic Review ,80, 319–323.
Manski, C. F. and E. Tamer (2002): “Inference on regressions with interval data on a regressor oroutcome,”
Econometrica , 70, 519–546.
Marra, G. and R. Radice (2011): “Estimation of a semiparametric recursive bivariate probit modelin the presence of endogeneity,”
Canadian Journal of Statistics , 39, 259–279.
Mourifi´e, I. and R. M´eango (2014): “A note on the identification in two equations probit model withdummy endogenous regressor,”
Economics Letters , 125, 360–363.35 osenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observationalstudies for causal effects,”
Biometrika , 70, 41–55.
Shaikh, A. and E. Vytlacil (2005): “Threshold crossing models and bounds on treatment effects: anonparametric analysis,” Tech. rep., National Bureau of Economic Research.
Shaikh, A. M. and E. J. Vytlacil (2011): “Partial identification in triangular systems of equationswith binary dependent variables,”
Econometrica , 79, 949–955.
Swanson, S. A., M. A. Hern´an, M. Miller, J. M. Robins, and T. S. Richardson (2018): “Partialidentification of the average treatment effect using instrumental variables: review of methods for binaryinstruments, treatments, and outcomes,”
Journal of the American Statistical Association , 113, 933–947.
Tamer, E. (2010): “Partial Identification in Econometrics,”
Annual Review of Economics , 2, 167–195.
Veall, M. R. and K. F. Zimmermann (1992): “Pseudo- R ’s in the ordinal probit model,” Journal ofmathematical sociology , 16, 333–342.——— (1996): “Pseudo- R measures for some common limited dependent variable models,” Journal ofEconomic surveys , 10, 241–259.
Vuong, Q. and H. Xu (2017): “Counterfactual mapping and individual treatment effects in nonsepa-rable models with binary endogeneity,”
Quantitative Economics , 8, 589–610.
Vytlacil, E. and N. Yildiz (2007): “Dummy endogenous variables in weakly separable models,”
Econometrica , 75, 757–779.
Wilde, J. (2000): “Identification of multiple equation probit models with endogenous dummy regressors,”
Economics letters , 69, 309–312.
Windmeijer, F. A. (1995): “Goodness-of-fit measures in binary choice models,”
Econometric Reviews ,14, 101–116.
A Appendix
Throughout the proof, let P = Pr[ D = 1 | X, Z ] with support Ω P and let p ( x, z ) = Pr[ D = 1 | x, z ]. A.1 Lemmas
Lemma A.1
Under Assumption 2.1 (a) and (b), for any p, p (cid:48) ∈ Ω P | x such that p > p (cid:48) , we havePr [ D = 0 | x, p ] + Pr [ Y = y, D = 1 | x, p ] − { Pr [ D = 0 | x, p (cid:48) ] + Pr [ Y = y, D = 1 | x, p (cid:48) ] } ≤ , Pr [ D = 1 | x, p ] + Pr [ Y = y, D = 0 | x, p ] − { Pr [ D = 1 | x, p (cid:48) ] + Pr [ Y = y, D = 0 | x, p (cid:48) ] } ≥ , for y ∈ { , } . In addition,Pr [ Y = y, D = 1 | x, p ] − Pr [ Y = y, D = 1 | x, p (cid:48) ] ≥ , Pr [ Y = y, D = 0 | x, p ] − Pr [ Y = y, D = 0 | x, p (cid:48) ] ≤ , or y ∈ { , } . Lastly, if ν (1 , x ) > ν (0 , x ) given x ∈ Ω X , then Pr [ Y = 1 | x, p ] − Pr [ Y = 1 | x, p (cid:48) ] ≥ .If ν (1 , x ) ≤ ν (0 , x ) given x ∈ Ω X , then Pr [ Y = 1 | x, p ] − Pr [ Y = 1 | x, p (cid:48) ] ≤ . Strict inequalities hold ifAssumption 2.1 (c) is imposed on the DGP.
Proof of Lemma A.1.
Under Assumption 2.1 (a) and (b), for p, p (cid:48) ∈ Ω P | x with p > p (cid:48) , we havePr[ D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x, p ] − { Pr[ D = 0 | x, p (cid:48) ] + Pr[ Y = 1 , D = 1 | x, p (cid:48) ] } =Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] − Pr[ p (cid:48) ≤ F ε ( ε ) < p ]= − Pr[ ε ≥ ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ . Similar manipulations show thatPr[ D = 0 | x, p ] + Pr[ Y = 0 , D = 1 | x, p ] − { Pr[ D = 0 | x, p (cid:48) ] + Pr[ Y = 0 , D = 1 | x, p (cid:48) ] } ≤ , Pr[ D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] − { Pr[ D = 1 | x, p (cid:48) ] + Pr[ Y = 1 , D = 0 | x, p (cid:48) ] } ≥ , andPr[ D = 1 | x, p ] + Pr[ Y = 0 , D = 0 | x, p ] − { Pr[ D = 1 | x, p (cid:48) ] + Pr[ Y = 0 , D = 0 | x, p (cid:48) ] } ≥ . In addition, using relatively straightforward if somewhat tedious algebra, we can obtain the followinginequalitiesPr[ Y = 0 , D = 1 | x, p ] − Pr[ Y = 0 , D = 1 | x, p (cid:48) ] = Pr[ ε ≥ ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ , Pr[ Y = 1 , D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p (cid:48) ] = Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ , Pr[ Y = 0 , D = 0 | x, p ] − Pr[ Y = 0 , D = 0 | x, p (cid:48) ] = − Pr[ ε ≥ ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ , andPr[ Y = 1 , D = 0 | x, p ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] = − Pr[ ε < ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≤ . Now suppose that ν (1 , x ) > ν (0 , x ) given x ∈ Ω X . Then it follows thatPr[ Y = 1 | x, p ] − Pr[ Y = 1 | x, p (cid:48) ]=Pr[ Y = 1 , D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] − Pr[ Y = 1 , D = 1 | x, p (cid:48) ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ]=Pr[ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] − Pr[ ε < ν (0 , x ) , p (cid:48) ≤ F ε ( ε ) < p ]=Pr[ ν (0 , x ) ≤ ε < ν (1 , x ) , p (cid:48) ≤ F ε ( ε ) < p ] ≥ . Finally, using a parallel argument in the case where ν (1 , x ) ≤ ν (0 , x ) given x ∈ Ω X , we can concludethat the inequalities stated in the lemma hold. Lemma A.2
Under Assumptions 2.1 and 3.1, the following results hold. Joint probabilities Pr [ Y = y, D = d | x, p ] for y, d ∈ { , } are functions of the dependence parameter ρ . In addition,(a) Pr [ Y = 1 , D = 1 | x, p ] and Pr [ Y = 0 , D = 0 | x, p ] are weakly increasing in ρ ;(b) Pr [ Y = 1 , D = 0 | x, p ] and Pr [ Y = 0 , D = 1 | x, p ] are weakly decreasing in ρ . Proof of Lemma A.2.
For any given p ∈ Ω P ,Pr[ Y = 1 , D = 1 | x, p ] = Pr[ ε < ν (1 , x ) , F ε ( ε ) < p | x, p ]= Pr[ ε < ν (1 , x ) , F ε ( ε ) < p ]= C ( F ε ( ν (1 , x )) , p ; ρ ) . (10)37ecause the copula C ( · , · ; ρ ) satisfies the concordant ordering with respect to ρ , we know that Pr[ Y =1 , D = 1 | x, p ] is weakly increasing in ρ . SincePr[ Y = 0 , D = 1 | x, p ] = Pr[ D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p ] = p − C ( F ε ( ν (1 , x )) , p ; ρ ) , Pr[ Y = 0 , D = 1 | x, p ] is decreasing in ρ . In addition,Pr[ Y = 0 , D = 0 | x, p ] =Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) ≥ p | x, p ]=Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) ≥ p ]=Pr[ ε ≥ ν (0 , x )] − Pr[ ε ≥ ν (0 , x ) , F ε ( ε ) < p ]=Pr[ ε ≥ ν (0 , x )] − Pr[ F ε ( ε ) < p ] + Pr[ ε < ν (0 , x ) , F ε ( ε ) < p ]=1 − F ε ( ν (0 , x )) − p + C ( F ε ( ν (0 , x )) , p ; ρ ) . (11)From (11) we can see that Pr[ Y = 0 , D = 0 | x, p ] is weakly increasing in ρ , which immediately impliesthat Pr[ Y = 1 , D = 0 | x, p ] is weakly decreasing in ρ . A.2 Proofs
Proof of Proposition 3.1.
To begin, let us first introduce the following notation: L ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + sup x (cid:48) ∈ X − ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ] ,L ( x, p ) = Pr[ Y = 1 , D = 1 | x, p ] + sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] ,U ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] ,U ( x, p ) = Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) inf x (cid:48) ∈ X − ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 0] . Then the SV bounds become L SV ( x ) = L ( x, p ) − U ( x, p ) and U SV ( x ) = U ( x, p ) − L ( x, p ) , (12)and under Assumption 2.1 the SV bounds are sharp if Ω X,P = Ω X × Ω P (Shaikh and Vytlacil, 2011,Theorem 2.1).Next we show that L ( x, p ) is weakly decreasing in p ( ceteris paribus ). Under Assumption 2.1 andΩ X,P = Ω X × Ω P , for ∀ x ∈ Ω X there exists x l ∈ X − ( x ), such that ν (1 , x l ) = sup x ∈ X − ( x ) ν (1 , x ) and L ( x, p ) = Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x l , p ] , (For detailed particulars see the proof of Shaikh and Vytlacil, 2011, Theorem 2.1 (ii) ). For p, p (cid:48) ∈ Ω P and p (cid:48) < p , we have now have L ( x, p ) − L ( x, p (cid:48) )=Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x l , p ] − Pr[ Y = 1 , D = 0 | x, p (cid:48) ] − Pr[ Y = 1 , D = 1 | x l , p (cid:48) ]=Pr[ ε ≤ ν (1 , x l ) , p (cid:48) < ε ≤ p ) − Pr[ ε ≤ ν (0 , x ) , p (cid:48) < ε ≤ p )=Pr[ ν (0 , x ) < ε ≤ ν (1 , x l ) , p (cid:48) < ε ≤ p ) ≤ , (13) The proof is contained in the supplementary material of Shaikh and Vytlacil (2011). x l ∈ X − ( x ), and the Lemma 2 in Shaikh and Vytlacil (2011)shows that x l ∈ X − ( x ) implies ν (1 , x l ) ≥ ν (0 , x ). Thus, from (13), L ( x, p ) is weakly decreasing in p .Similar arguments show that L ( x, p ) is weakly increasing in p , U ( x, p ) is weakly increasing in p ,and U ( x, p ) is weakly decreasing in p . Hence L SV ( x ) is weakly increasing in p and U SV ( x ) is weaklydecreasing in p . On the other hand, L SV ( x ) is weakly decreasing in p and U SV ( x ) is weakly increasing in p . This completes the proof of the proposition. Proof of Proposition 3.2.
Suppose that ATE( x ) > x ∈ Ω X . Under Assumption 2.1, from thedefinitions of X ( x ), X − ( x ), X ( x ) and X − ( x ), we know that X ( x ) and X ( x ) are nonemptyfor ∀ x ∈ Ω X , since x itself belongs to these two sets. While, X − ( x ) and X − ( x ) may be empty forsome x ∈ Ω X . Recall that the supremum and infimum are defined as zero and one over an empty set,respectively. Thus, for the four functions defined in the proof of Proposition 3.1 we have L ( x, p ) ≥ Pr[ Y = 1 , D = 0 | x, p ] ,L ( x, p ) ≥ Pr[ Y = 1 | x, p ] ,U ( x, p ) ≤ Pr[ Y = 1 | x, p ] , and U ( x, p ) ≤ Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] . The ATE SV bounds are therefore bounded by [ L SV ( x ) , U SV ( x )] ⊂ [ L SV ( x ) , U SV ( x )], where L SV ( x ) = sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] , and U SV ( x ) = inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ] , (14)and the widest possible width ω ( x ) := U SV ( x ) − L SV ( x ) is ω ( x ) := inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ D = 0 | x, p ] } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] + inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] . From Lemma A.1 it follows that ω ( x ) =Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ D = 0 | x, p ( x )] − Pr[ Y = 1 , D = 0 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] + Pr[ Y = 1 | x, p ( x )]=Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ Y = 0 , D = 0 | x, p ( x )] . (15)Now consider the case where ATE( x ) <
0. In contrast to the positive ATE( x ) case, X − ( x ) and X − ( x ) are nonempty for ∀ x ∈ Ω X since x itself belongs to these two sets, while X ( x ) and X ( x ) maybe empty for some x ∈ Ω X . Thus, the following inequalities hold L ( x, p ) ≥ Pr[ Y = 1 | x, p ] ,L ( x, p ) ≥ Pr[ Y = 1 , D = 1 | x, p ] ,U ( x, p ) ≤ Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] , and U ( x, p ) ≤ Pr[ Y = 1 | x, p ] , L SV ( x ) , U SV ( x )] ⊂ [ L SV ( x ) , U SV ( x )], where U SV ( x ) = inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] , and L SV ( x ) = sup p ∈ Ω P | x Pr[ Y = 1 , D = 1 | x, p ] − inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] } . (16)The widest possible width of the SV bounds is now therefore ω ( x ) = inf p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − sup p ∈ Ω P | x Pr[ Y = 1 , D = 1 | x, p ]+ inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ D = 1 | x, p ] } , and from Lemma A.1 we have that ω ( x ) =Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 , D = 1 | x, p ( x )]+ Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ D = 1 | x, p ( x )]=Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ Y = 0 , D = 1 | x, p ( x )] . (17)The nature of the relationship between ω ( x ) and p ( x ) and p ( x ) follows directly from the expressions in(15) and (17) upon application of Lemma A.1. Proof of Proposition 3.3.
The proof follows directly from the expression for ω ( x ) in Proposition 3.2and Lemma A.2. Proof of Proposition 3.4.
Without loss of generality, assume that the distribution of ε has been“normalized” to be uniform over [0 , ν ( D, X ) | D indicates that there exists a function m : { , } (cid:55)→ R such that ν ( d, x ) = m ( d ) for all ( d, x ) ∈ { , }× Ω X . Take ATE( x ) to be positive. When H ( x, x (cid:48) ) is well defined and ν ( D, X ) = m ( D ), X ( x ) = X ( x ) = Ω X , and X − ( x ) = X − ( x ) = ∅ .Since ε is continuously distributed we can conclude that ∀ ( x, z ) , ( z (cid:48) , x (cid:48) ) ∈ Ω X,Z such that Pr[ D =1 | z (cid:48) , x (cid:48) ] = Pr[ D = 1 | x, z ] we must have ν ( x, z ) = ν ( z (cid:48) , x (cid:48) ).For L SV ( x ), consider sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ]. If X ( x ) is empty, or if there does not exista z (cid:48) such that Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p , then sup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] is set to zero. Since X ( x )equals Ω X because ν ( D, X ) = m ( D ), we have Pr[ D = 1 | x (cid:48) , z (cid:48) ] = p for at least ( z (cid:48) , x (cid:48) ) = ( x, z ), and thussup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] is well-defined. It follows thatsup x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 0 | x (cid:48) , p ] = sup x (cid:48) ∈ X ( x ) Pr[ ν (0 , x (cid:48) ) > ε , ν ( x (cid:48) , z (cid:48) ) ≤ ε | x (cid:48) , p ]= sup x (cid:48) ∈ X ( x ) Pr[ m (0) > ε , ν ( x, z ) ≤ ε | x (cid:48) , p ]= sup x (cid:48) ∈ X ( x ) Pr[ m (0) > ε , ν ( x, z ) ≤ ε | x, p ]=Pr[ Y = 1 , D = 0 | x, p ] , (18)where the second equality arises because the CDF of ε is the strictly positive and ν (0 , x (cid:48) ) = m (0) isdegenerate. The third equality is due to the assumed independence of ( X, Z ). Similarly, p inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 | x (cid:48) , p, D = 1] = inf x (cid:48) ∈ X ( x ) Pr[ Y = 1 , D = 1 | x (cid:48) , p ]40 inf x (cid:48) ∈ X ( x ) Pr[ ν (1 , x (cid:48) ) > ε , ν ( x (cid:48) , z (cid:48) ) > ε | x (cid:48) , p ]= inf x (cid:48) ∈ X ( x ) Pr[ m (1) > ε , ν ( x, z ) > ε | x, p ]=Pr[ Y = 1 , D = 1 | x, p ] . (19)By virtue of equations (18) and (19), and Lemma A.1, L SV ( x ) can be rewritten as L SV ( x ) = sup p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + Pr[ Y = 1 , D = 0 | x, p ] }− inf p ∈ Ω P | x { Pr[ Y = 1 , D = 0 | x, p ] + Pr[ Y = 1 , D = 1 | x, p ] } = sup p ∈ Ω P | x Pr[ Y = 1 | x, p ] − inf p ∈ Ω P | x Pr[ Y = 1 | x, p ]= Pr[ Y = 1 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] . (20)For U SV ( x ), because X − ( x ) and X − ( x ) are empty, from Lemma A.1 we get U SV ( x ) = inf p ∈ Ω P | x { Pr[ Y = 1 , D = 1 | x, p ] + (1 − p ) } − sup p ∈ Ω P | x Pr[ Y = 1 , D = 0 | x, p ]= Pr[ Y = 1 , D = 1 | x, p ( x )] + (1 − p ( x )) − Pr[ Y = 1 , D = 0 | x, p ( x )] . (21)THe expressions in (20) and (21) now yield the result that ω SV =Pr[ Y = 1 , D = 1 | x, p ( x )] + (1 − p ( x )) − Pr[ Y = 1 , D = 0 | x, p ( x )] − Pr[ Y = 1 | x, p ( x )] + Pr[ Y = 1 | x, p ( x )]=Pr[ Y = 0 , D = 0 | x, p ( x )] + Pr[ Y = 1 , D = 1 | x, p ( x )] , which is equal to ω ( x ). The proof for the negative ATE( x ) case is completely analogous, the details areomitted. Proof of Proposition 5.1. (i) We first show that
IIP ( x ) is well-defined in the sense that we are able toidentify whether Z is relevant or not. If, for a given x ∈ Ω X , there exists a z and z (cid:48) in Ω Z | x such that z (cid:54) = z (cid:48) and Pr[ D = 1 | x, z ] (cid:54) = Pr[ D = 1 | x, z (cid:48) ], then the IV Z is relevant. If Z is relevant then IIP ( x ) = 1 − ω ( x )where ω ( x ) is the widest possible width defined in Proposition 3.2. Otherwise, Z is irrelevant, and byProposition 3.4, if Z is irrelevant the SV bounds reduce to the benchmark Manski bounds and we have IIP ( x ) = 0.Next, we prove that IIP ( x ) ∈ [0 , ω ( x ) is a summation of some conditional probabilities ∀ x ∈ Ω X , it follows that ω ( x ) ≥ IIP ( x ) ≤
1. Whenever Z is relevant the sign of ATE( x ) isidentified, and from Lemma A.1 it follows that if ATE( x ) > ω ( x ) = Pr[ Y = 1 , D = 1 | x, p ( x )] + Pr[ Y = 0 , D = 0 | x, p ( x )] ≤ Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] , (22)which is less than one, and if ATE( x ) < ω ( x ) = Pr[ Y = 1 , D = 0 | x, p ( x )] + Pr[ Y = 0 , D = 1 | x, p ( x )] ≤ Pr[ Y = 1 , D = 0 | x ] + Pr[ Y = 0 , D = 1 | x ] , (23)which is also less than one. Thus, IIP ( x ) = 1 − ω ( x ) ≥ ∀ x ∈ Ω X , and IIP ( x ) ∈ [0 , Z is irrelevant, by definition we have IIP ( x ) = 0 and the SV bounds reduce to the benchmarkManski bounds by Proposition 3.4. To establish necessity we will show that the presumption that theevents Z is relevant and IIP ( x ) = 0 occur simultaneously leads to a contradiction. If Z is relevant, thenthe index IIP ( x ) = 1 − ω ( x ). The goal, therefore, is to show that relevant Z leads to strictly less one ω ( x ), by verifying the inequalities (22) and (23) are strict. Take (22) as an example and the result for(23) can be verified analogously. SincePr[ Y = 1 , D = 1 | x ] − Pr[ Y = 1 , D = 1 | x, p ( x )]= (cid:90) p ∈ Ω P | x (cid:2) Pr[ Y = 1 , D = 1 | x, p ] − Pr[ Y = 1 , D = 1 | x, p ( x )] (cid:3) d Pr[ P = p | X = x ]= (cid:90) p ∈ Ω P | x Pr (cid:2) ε < µ (1 , x ) , p ( x ) ≤ ε < p (cid:3) d Pr[ P = p | X = x ] , (24)the relevance of Z guarantees that there exists a p ∈ Ω P | x such that p (cid:54) = p ( x ) and Pr[ P = p | X = x ] > ε , ε ) with support R implies that (24) is strictly pos-itive. Similar arguments can be applied to show that Pr[ Y = 0 , D = 0 | x ] − Pr[ Y = 0 , D = 0 | x, p ( x )] > ω ( x ) < Pr[ Y = 1 , D = 1 | x ] + Pr[ Y = 0 , D = 0 | x ] ≤ , leading to IIP ( x ) > Z is a perfect predictor of the treatment D in the sense that there exist a z ∗ and a z ∗∗ inΩ Z | x such that Pr( D = 0 | x, z ∗ ) = 1 and Pr( D = 1 | x, z ∗∗ ) = 1, this obviously implies Z is relevant and IIP ( x ) = 1 − ω ( x ). Furthermore, p ( x ) = p ( x, z ∗ ) and p ( x ) = p ( x, z ∗∗ ). Hence, it can be easily shown fromthe expressions for ω ( x ) that perfect prediction by Z leads to the equality ω ( x ) = 0 for both ATE( x ) > x ) <
0. Thus
IIP ( x ) = 1 − ω ( x ) = 1.Moreover, since ω ( x ) is the widest possible width for the SV bounds, we have 0 ≤ ω SV ( x ) ≤ ω ( x ),and when ω ( x ) = 0 it follows that ω SV ( x ) = 0. The ATE( x ) is point identified if IIP ( xx