[PDF] Instrument Validity for Heterogeneous Causal Effects

Abstract

This paper provides a general framework for testing instrument validity in heterogeneous causal effect models. We first generalize the testable implications of the instrument validity assumption provided by Balke and Pearl (1997), Imbens and Rubin (1997), and Heckman and Vytlacil (2005). The generalization involves the cases where the treatment can be multivalued (and ordered) or unordered, and there can be conditioning covariates. Based on these testable implications, we propose a nonparametric test which is proved to be asymptotically size controlled and consistent. Because of the nonstandard nature of the problem in question, the test statistic is constructed based on a nonsmooth map, which causes technical complications. We provide an extended continuous mapping theorem and an extended delta method, which may be of independent interest, to establish the asymptotic distribution of the test statistic under null. We then extend the bootstrap method proposed by Fang and Santos (2018) to approximate this asymptotic distribution and construct a critical value for the test. Compared to the test proposed by Kitagawa (2015), our test can be applied in more general settings and may achieve power improvement. Evidence that the test performs well on finite samples is provided via simulations. We revisit the empirical study of Card (1993) and use their data to demonstrate application of the proposed test in practice. We show that a valid instrument for a multivalued treatment may not remain valid if the treatment is coarsened.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Instrument Validity for Heterogeneous Causal Effects

Zhenting Sun ∗ National School of Development, Peking [email protected] 7, 2020

Abstract

This paper provides a general framework for testing instrument validity in heterogeneouscausal effect models. We ﬁrst generalize the testable implications of the instrument validity as-sumption provided by Balke and Pearl (1997), Imbens and Rubin (1997), and Heckman andVytlacil (2005). The generalization involves the cases where the treatment can be multivalued(and ordered) or unordered, and there can be conditioning covariates. Based on these testableimplications, we propose a nonparametric test which is proved to be asymptotically size controlledand consistent. Because of the nonstandard nature of the problem in question, the test statisticis constructed based on a nonsmooth map, which causes technical complications. We provide anextended continuous mapping theorem and an extended delta method, which may be of inde-pendent interest, to establish the asymptotic distribution of the test statistic under null. We thenextend the bootstrap method proposed by Fang and Santos (2018) to approximate this asymptoticdistribution and construct a critical value for the test. Compared to the test proposed by Kitagawa(2015), our test can be applied in more general settings and may achieve power improvement.Evidence that the test performs well on ﬁnite samples is provided via simulations. We revisit theempirical study of Card (1993) and use their data to demonstrate application of the proposed testin practice. We show that a valid instrument for a multivalued treatment may not remain valid ifthe treatment is coarsened.

Keywords:

Instrument validity, heterogeneous causal effects, general nonparametric test, powerimprovement, extended continuous mapping theorem, extended delta method ∗ This article is a revised version of the ﬁrst chapter of the author’s doctoral thesis at UC San Diego. I am deeply gratefulto Brendan K. Beare, Zheng Fang, Andres Santos, Yixiao Sun, and Kaspar W¨uthrich for their constant support on this paper.I thank Shengtao Dai, Tongyu Li, and Xingyu Li for their excellent work as research assistants. I also thank Roy Allen, QihuiChen, Asad Dossani, Graham Elliott, Ivan Fernandez-Val, Wenzheng Gao, James D. Hamilton, Jungbin Hwang, Toru Kitagawa,Sungwon Lee, Juwon Seo, and all seminar participants for their insightful suggestions and comments. Introduction

The local average treatment effect (LATE) framework, introduced by the seminal works Imbens andAngrist (1994) and Angrist et al. (1996), is a commonly used approach in studies of instrumentalvariable (IV) models with treatment effect heterogeneity. The local quantile treatment effect (LQTE)is a concept similar to LATE. While LATE shows the treatment effect on the mean of the outcome,LQTE is more informative in regard to the effect on the outcome distribution. These causal effectmodels rely on several strong and sometimes controversial assumptions of IV validity: 1) The instru-ment should not affect the outcome directly; 2) it should be as good as being randomly assigned;and 3) it affects the treatment in monotone fashion. Violations of these conditions can generally leadto inconsistent treatment effect estimates. Relevant surveys and discussion of this can be found inAngrist and Pischke (2008), Angrist and Pischke (2014), Imbens (2014), Imbens and Rubin (2015),Koenker et al. (2017), Melly and W¨uthrich (2017), and Huber and W¨uthrich (2018). Since theplausibility of the analyses of such models depends on IV validity, economics research has developedmethods to examine these conditions based on testable implications.This paper provides a general framework for testing such IV validity assumptions. We ﬁrst gen-eralize the testable implications obtained by Balke and Pearl (1997), Imbens and Rubin (1997), andHeckman and Vytlacil (2005) for binary treatments. The generalization includes the cases where thetreatment can be multivalued (and ordered) or unordered, and conditioning covariates may exist. Then based on these testable implications, we propose a nonparametric test which can easily beapplied in practice.Kitagawa (2015) was the ﬁrst paper to propose a test of IV validity in heterogeneous causaleffect models based on the testable implications in the literature. The test, constructed using abootstrap method, is for binary treatments. It was shown to be asymptotically uniformly size con-trolled and consistent. Since the bootstrap critical value converges to a number larger than the − α quantile of the asymptotic distribution of the test statistic over a large region of the null, thetest could be conservative. Mouriﬁ´e and Wan (2017) reformulated as conditional inequalities thetestable implications used in Kitagawa (2015). Then they showed that these inequalities could betested in the intersection bounds framework of Chernozhukov et al. (2013) using the Stata packageprovided by Chernozhukov et al. (2015). The test is also for binary treatments and could be conser-vative as well. It restricts the support of the outcome variables to be compact, ruling out the casewhere outcomes can be unbounded. Huber and Mellace (2015) derived a testable implication fora weaker LATE identifying condition, that is, that the potential outcomes are mean independent ofinstruments, conditional on each selection type. However, the condition of potential outcomes beingmean independent of instruments is not sufﬁcient if we are concerned with distributional features See, for example, studies of LQTE in Abadie (2002), Ananat and Michaels (2008), Cawley and Meyerhoefer (2012),Fr¨olich and Melly (2013), and Eren and Ozbeklik (2014). Studies of LATE with binary treatments can be found in Angrist (1990), Angrist and Krueger (1991), and Vytlacil (2002).Those with multivalued treatments can be found in Angrist and Imbens (1995), Angrist and Krueger (1995), and Vytlacil(2006). Identiﬁcation of causal effects in unordered choice (treatment) models can be found in Heckman et al. (2006),Heckman and Vytlacil (2007), Heckman et al. (2008), and Heckman and Pinto (2018).

2f a complier’s potential outcomes, such as the quantile treatment effects for compliers; see Abadieet al. (2002) for details. The focus of the present paper is on full statistical independence of potentialoutcomes and instruments.The null hypothesis for the testable implications used in Kitagawa (2015) consists of a set ofinequalities. The reason why the test proposed by Kitagawa (2015) could be conservative is thatthey used an upper bound on the asymptotic distribution of the test statistic under null to constructthe bootstrap critical value. The upper bound is identical to the asymptotic distribution when all theinequalities in the null are binding. In the study described in the present paper, we solve a technicalissue and establish the pointwise asymptotic distribution of the test statistic under null. Then weconstruct the critical value based on this asymptotic distribution, rather than on an upper bound,and therefore improve the power of the test.A modiﬁed variance-weighted Kolmogorov–Smirnov (KS) test statistic is employed in our test.As mentioned by Kitagawa (2015), variance-weighted KS statistics have been widely applied inthe literature on conditional moment inequalities, such as in Andrews and Shi (2013), Armstrong(2014), Armstrong and Chan (2016), and Chetverikov (2018). More general KS statistics can befound in the stochastic dominance testing literature, such as in Abadie (2002), Barrett and Donald(2003), Horv´ath et al. (2006), Linton et al. (2010), Barrett et al. (2014), and Donald and Hsu(2016).There are two major complications in deriving and approximating the asymptotic distribution ofthe test statistic under null. First, the test statistic involves a nonsmooth (nondifferentiable) mapof unknown parameters (underlying probability distributions), and the delta method fails to work.We provide an extended continuous mapping theorem and an extended delta method, which mightbe of independent interest, to overcome this difﬁculty. By showing that the conditions of the ex-tended delta method are satisﬁed under several weak assumptions, we establish the null asymptoticdistribution of the test statistic. Second, since the null asymptotic distribution involves a nonlinearfunction, the standard bootstrap method may fail to approximate this distribution consistently. Dis-cussion of this issue can be found in D¨umbgen (1993), Andrews (2000), Hirano and Porter (2012),Hansen (2017), Fang and Santos (2018), and Hong and Li (2018). To achieve a consistent approx-imation, we extend the bootstrap approach proposed by Fang and Santos (2018) and provide avalid bootstrap critical value. The test is found to be asymptotically sized controlled and consistent.Evidence that the test performs well on ﬁnite samples is provided via simulations.We now introduce the following notation, which will be used throughout the paper. We let denote Hoffmann–Jørgensen weak convergence in a metric space. For a set D , denote the space ofbounded functions on D by ℓ ∞ ( D ) : ℓ ∞ ( D ) = { f : D → R : k f k ∞ < ∞} , where k f k ∞ = sup x ∈ D | f ( x ) | . Other applications of this bootstrap method can be found in Beare and Moon (2015), Beare and Fang (2017), Seo (2018),Beare and Shi (2019), and Sun and Beare (2019). A similar bootstrap approach can be found in Hong and Li (2018). D is a topological space, let C ( D ) denote the set of continuous functions on D : C ( D ) = { f : D → R : f is continuous } . To formally introduce the topic of interest, we ﬁrst consider the heterogeneous causal effect modelof Imbens and Angrist (1994). Let Y ∈ R be the observable outcome variable, and let D ∈ { , } be the observable treatment variable, where D = 1 indicates that an individual receives treatment.Let Z ∈ { , } be a binary instrumental variable. Let Y dz ∈ R be the potential outcome variable for D = d and Z = z , where d, z ∈ { , } . Similarly, let D z be the potential treatment variablefor Z = z . The instrument validity assumption for binary treatment and binary IV is formalized asfollows. Assumption 2.1

IV validity for binary D and binary Z :(i) Instrument Exclusion: With probability , Y d = Y d for each d ∈ { , } .(ii) Random Assignment: The variable Z is jointly independent of ( Y , Y , Y , Y , D , D ) .(iii) Instrument Monotonicity: The potential treatment response indicators satisfy D ≥ D withprobability . Assumption 2.1 is from Imbens and Rubin (1997), but it does not require strict instrument mono-tonicity. In this paper, we are not concerned with the strict monotonicity assumption, which is alsoknown as the instrument relevance assumption. Let (Ω , A , P ) be a probability space on which all random elements are well deﬁned. Let B R m denote the Borel σ -algebra on R m for all m ∈ N . For all Borel sets B and C , we follow Kitagawa(2015) and deﬁne probability measures as follows: P ( B, C ) = P ( Y ∈ B, D ∈ C | Z = 1) and P ( B, C ) = P ( Y ∈ B, D ∈ C | Z = 0) . Under Assumption 2.1(i), we can deﬁne a potential outcome variable Y d such that Y d = Y d = Y d almost surely. Imbens and Rubin (1997) showed that for every Borel set B , P ( B, { } ) − P ( B, { } ) = P ( Y ∈ B, D > D ) and P ( B, { } ) − P ( B, { } ) = P ( Y ∈ B, D > D ) . (1) See Rubin (1974) and Splawa-Neyman et al. (1990) for further discussion of the potential outcomes. As mentioned by Kitagawa (2015), the instrument relevance assumption can be assessed by inferring the coefﬁcient inthe ﬁrst-stage regression of D onto Z . For simplicity of notation, we implicitly assume that ( Y, D, Z ) is ( A , B R ) -measurable.

4o see why (1) is true, we can write P ( B, { } ) − P ( B, { } ) = P ( Y ∈ B, D = 1 | Z = 1) − P ( Y ∈ B, D = 1 | Z = 0)= P ( Y ∈ B, D = 1) − P ( Y ∈ B, D = 1) = P ( Y ∈ B, D = 1 , D = 0) , where the second equality follows from Assumptions 2.1(i) and 2.1(ii) and the third equality followsfrom Assumption 2.1(iii). Similar reasoning gives the second equation in (1). Since the probabilitiesin (1) are nonnegative, we obtain the testable implication of Assumption 2.1 in Balke and Pearl(1997), Imbens and Rubin (1997), and Heckman and Vytlacil (2005): For all B ∈ B R , P ( B, { } ) − P ( B, { } ) ≥ and P ( B, { } ) − P ( B, { } ) ≥ . (2)To understand (2) graphically, suppose that Y is a continuous variable and that p z ( y, d ) is thederivative of the function P z (( −∞ , y ] , { d } ) with respect to y for all d, z ∈ { , } . The followinggraphs show a case where (2) holds.Figure 1: A special case satisfying testable implication (2) p ( y, p ( y, − − (a) P ( B, { } ) > P ( B, { } ) p ( y, p ( y, − − (b) P ( B, { } ) > P ( B, { } ) The ﬁrst inequality in (2) is shown in Figure 1a, where the derivative p ( y, is greater than p ( y, everywhere. The second inequality in (2) is shown in Figure 1b, where the derivative p ( y, is greater than p ( y, everywhere. Additional graphical examples can be found in Kitagawa(2015). Section 2.1 discussed the case where the treatment and the instrument are both binary. In manyapplications, D and Z can be multivalued. See, for example, Angrist and Imbens (1995), where thetreatment variable is the number of years of schooling completed by a student and can take morethan two values. Now suppose that D ∈ D = { d , d , . . . } and Z ∈ Z = { z , z , . . . , z K } . We let d max be the maximum value of D if it exists, and d min the minimum value of D if it exists. Suppose thereexist potential variables Y dz for d ∈ D and z ∈ Z , and D z for z ∈ Z . Then the IV validity assumptionfor multivalued treatment D and multivalued instrument Z is formalized as follows. Assumption 2.2

IV validity for multivalued D and multivalued Z : i) Instrument Exclusion: With probability , Y dz = Y dz = · · · = Y dz K for all d ∈ D .(ii) Random Assignment: The variable Z is jointly independent of ( ˜ Y , ˜ D ) , where ˜ Y = ( Y d z , . . . , Y d z K , Y d z , . . . , Y d z K , . . . ) and ˜ D = ( D z , D z , . . . , D z K ) . (iii) Instrument Monotonicity: The potential treatment response variables satisfy D z k +1 ≥ D z k withprobability for all k ∈ { , , . . . , K − } . Assumption 2.2 is similar to Assumptions 1 and 2 of Angrist and Imbens (1995). Since we allow mul-tivalued Z , the monotonicity assumption needs to hold for each pair ( D z k , D z k +1 ) . The next lemmaestablishes a testable implication of Assumption 2.2 when the treatment variable has a maximumvalue and/or a minimum value. Lemma 2.1

A testable implication of Assumption 2.2 is that for all k with ≤ k ≤ K − , all Borelsets B , and all C = ( −∞ , c ] with c ∈ R , the following hold: P ( Y ∈ B, D = d max | Z = z k ) ≤ P ( Y ∈ B, D = d max | Z = z k +1 ) if d max existsand P ( Y ∈ B, D = d min | Z = z k ) ≥ P ( Y ∈ B, D = d min | Z = z k +1 ) if d min exists ; (3) P ( D ∈ C | Z = z k ) ≥ P ( D ∈ C | Z = z k +1 ) . (4)Lemma 2.1 generalized testable implication (2) to the case where the treatment and the instrumentcan both be multivalued. The testable implication (ﬁrst-order stochastic dominance) discussed byAngrist and Imbens (1995) for Assumption 2.2 is equivalent to (4). Clearly, if D and Z are bothbinary as assumed in Section 2.1, with d max = 1 and d min = 0 , then (3) is equivalent to (2) and (4)is implied by (3). To the best of our knowledge, (3) is new in the literature. Studies of identiﬁcation of causal effects in unordered choice (treatment) models can be found inHeckman et al. (2006), Heckman and Vytlacil (2007), and Heckman et al. (2008). Heckman andPinto (2018) showed that the assumptions in the preceding literature could be relaxed, and theydeﬁned a new monotonicity condition for the identiﬁcation of causal effects in such models. Wefollow Heckman and Pinto (2018) and suppose that the support D of D is an unordered set with D = { d , d , . . . , d J } and that the support Z of Z with Z = { z , . . . , z K } can be unordered as well.The unordered monotonicity condition proposed by Heckman and Pinto (2018) is as follows. Assumption 2.3

The potential treatment response indicators satisfy the condition that for all d ∈ D and all z, z ′ ∈ Z , { D z ′ = d } ≥ { D z = d } almost surely or { D z ′ = d } ≤ { D z = d } almost surely. See Heckman and Pinto (2018, pp. 2–3) for a discussion of these assumptions.

6t is worth noting that in Assumption 2.3, D is allowed to be a vector random element. In the casewhere D, Z ∈ { , } , Assumption 2.3 is equivalent to the assumption that { D = 1 } ≥ { D = 1 } almost surely or { D = 1 } ≤ { D = 1 } almost surely. According to the context of the issue ofinterest, we can prespecify a set C ⊂ D × Z × Z and assume that { D z ′ = d } ≤ { D z = d } almostsurely for all ( d, z, z ′ ) ∈ C , which is similar to Assumption 2.1(iii). With this monotonicity condition,we introduce the IV validity assumption for unordered treatment. Assumption 2.4

IV validity for unordered D and unordered Z :(i) Instrument Exclusion: With probability , Y dz = Y dz ′ for all d ∈ D and all z, z ′ ∈ Z .(ii) Random Assignment: The random element Z is jointly independent of ( ˜ Y , ˜ D ) , where ˜ Y = ( Y d z , . . . , Y d z K , Y d z , . . . , Y d z K , . . . , Y d J z , . . . , Y d J z K ) and ˜ D = ( D z , D z , . . . , D z K ) . (iii) Instrument Monotonicity: The potential treatment elements satisfy { D z ′ = d } ≤ { D z = d } with probability for all ( d, z, z ′ ) ∈ C . Under this assumption, we can deﬁne Y d by Y d = Y dz almost surely for all z , and hence P ( Y ∈ B, D = d | Z = z ′ ) = E [1 { Y d ∈ B } · { D z ′ = d } ] ≤ E [1 { Y d ∈ B } · { D z = d } ] = P ( Y ∈ B, D = d | Z = z ) for all Borel sets B and all ( d, z, z ′ ) ∈ C . Lemma 2.2

A testable implication of Assumption 2.4 is given by P ( Y ∈ B, D = d | Z = z ′ ) ≤ P ( Y ∈ B, D = d | Z = z ) (5) for all Borel sets B and all ( d, z, z ′ ) ∈ C , where C is a prespeciﬁed subset of D × Z × Z . In this section, we consider the case where conditioning covariates may exist, that is, the randomassignment assumption holds conditional on some covariates. Suppose X is a conditioning covariatevector, let X be the set of possible values of X , and let X = { x , x , . . . , x L } .First, consider the case introduced in Section 2.2 where the treatment and the instrument areboth multivalued (and ordered). A testable implication with conditioning covariates is as follows. Lemma 2.3

A testable implication of the conditional version of Assumption 2.2 is that P ( Y ∈ B, D = d max | Z = z k , X = x l ) ≤ P ( Y ∈ B, D = d max | Z = z k +1 , X = x l ) if d max existsand P ( Y ∈ B, D = d min | Z = z k , X = x l ) ≥ P ( Y ∈ B, D = d min | Z = z k +1 , X = x l ) if d min exists , and P ( D ∈ C | Z = z k , X = x l ) ≥ P ( D ∈ C | Z = z k +1 , X = x l ) , (6)7 or all k with ≤ k ≤ K − , all l with ≤ l ≤ L , all B ∈ B R , and all C = ( −∞ , c ] with c ∈ R . Second, consider the case introduced in Section 2.3 where the treatment and the instrument canboth be unordered. A testable implication with conditioning covariates is as follows.

Lemma 2.4

A testable implication of the conditional version of Assumption 2.4 is given by P ( Y ∈ B, D = d | Z = z ′ , X = x l ) ≤ P ( Y ∈ B, D = d | Z = z, X = x l ) (7) for all Borel sets B , all ( d, z, z ′ ) ∈ C , and all l with ≤ l ≤ L , where C is a prespeciﬁed subset of D × Z × Z . The inequality in (7) is similar to the generalized regression monotonicity (GRM) hypothesis in Hsuet al. (2019). The major difference is that Z is allowed to be unordered in (7). To highlight the idea, we ﬁrst introduce the test for the case where the treatment is multivalued(and ordered), with support D = { d , d , . . . } . The other cases will be discussed as extensions inlater sections. Also, we let Z be multivalued with support Z = { z , . . . , z K } . The test is constructedbased on the testable implication given in (3) and (4). Without loss of generality, we assume thatboth d min and d max exist, with d min = 0 and d max = 1 . In practice, we can always normalize d min and d max to and , respectively. Then (3) and (4) are equivalent to ( − d · { P ( Y ∈ B, D = d | Z = z k +1 ) − P ( Y ∈ B, D = d | Z = z k ) } ≤ and P ( D ∈ C | Z = z k +1 ) − P ( D ∈ C | Z = z k ) ≤ (8)for all k with ≤ k ≤ K − , all closed intervals B in R , each d ∈ { , } , and all C = ( −∞ , c ] with c ∈ R . Here, (3) and (4) originally require (8) to hold for all Borel sets B . Similarly to LemmaB.7 of Kitagawa (2015), we can show (by applying Lemma C1 of Andrews and Shi (2013)) that (8)holding for all closed intervals B is equivalent to (8) holding for all Borel sets B .By deﬁnition, for all B, C ∈ B R and all k with ≤ k ≤ K , P ( Y ∈ B, D ∈ C | Z = z k ) = P ( Y ∈ B, D ∈ C, Z = z k ) / P ( Z = z k ) . We now deﬁne function spaces G K = (cid:8) R × R ×{ z k } : k = 1 , , . . . , K (cid:9) , G = (cid:8)(cid:0) R × R ×{ z k } , R × R ×{ z k +1 } (cid:1) : k = 1 , , . . . , K − (cid:9) , H = n ( − d · B ×{ d }× R : B is a closed interval in R , d ∈ { , } o , ¯ H = n ( − d · B ×{ d }× R : B is a closed, open, or half-closed interval in R , d ∈ { , } o , H = { R × C × R : C = ( −∞ , c ] , c ∈ R } , ¯ H = { R × C × R : C = ( −∞ , c ] or C = ( −∞ , c ) , c ∈ R } , H = H ∪ H , and ¯ H = ¯ H ∪ ¯ H . (9)8et P denote the set of probability measures on ( R , B R ) . We use an i.i.d. sample { ( Y i , D i , Z i ) } ni =1 which is distributed according to some probability distribution Q in P , that is, that the measure Q ( G ) = P (( Y i , D i , Z i ) ∈ G ) for all G ∈ B R , to construct a test for the testable implication given in(3) and (4) (or in (8)). For every Q ∈ P and every measurable function v , by an abuse of notationwe deﬁne Q ( v ) = Z v d Q. (10)Deﬁne, by convention (see, for example, Folland (1999, p. 45)), that · ∞ = 0 . (11)For each Q ∈ P , the closure of H in L ( Q ) is equal to ¯ H (Lemma A.1). For every Q ∈ P and every ( h, g ) ∈ ¯ H × G with g = ( g , g ) , deﬁne φ Q ( h, g ) = Q ( h · g ) Q ( g ) − Q ( h · g ) Q ( g ) . (12)With (11), φ Q is always well deﬁned. Then the null hypothesis equivalent to (8) is H : sup ( h,g ) ∈H×G φ Q ( h, g ) ≤ (13)if the underlying distribution of the data is Q . Since Q ( v ) is continuous on L ( Q ) , (13) is equivalentto sup ( h,g ) ∈ ¯ H×G φ Q ( h, g ) ≤ . The alternative hypothesis is naturally set to H : sup ( h,g ) ∈H×G φ Q ( h, g ) > . Deﬁne the sample analogue of φ Q by ˆ φ Q ( h, g ) = ˆ Q ( h · g )ˆ Q ( g ) − ˆ Q ( h · g )ˆ Q ( g ) , where ˆ Q denotes the empirical probability measure of Q such that for every measurable function v , ˆ Q ( v ) = 1 n n X i =1 v ( Y i , D i , Z i ) , (14)and { ( Y i , D i , Z i ) } ni =1 is the i.i.d. sample distributed according to Q .The goal of this section is to construct a test for the H in (13). To evaluate the ability of thetest to provide size control, we consider a “local” sequence of probability distributions { P n } ∞ n =1 ⊂ P under which the testable implication is true and P n converges to some probability measure P ∈ P .We introduce the next two assumptions to formalize the above settings.9 ssumption 3.1 { ( Y i , D i , Z i ) } ni =1 is an i.i.d. data set distributed according to probability distribution P n for each n , where D i and Z i are discrete variables with support D and Z , respectively. Assumption 3.2

There is a probability measure P ∈ P such that lim n →∞ Z (cid:18) √ n n d P / n − d P / o − v d P / (cid:19) = 0 (15) for some measurable function v , where d P / n and d P / denote the square roots of the densities of P n and P , respectively. Assumptions 3.1 and 3.2 assume an i.i.d. sample whose distribution P n is allowed to change as n increases, and to converge to some probability measure P as deﬁned in (3.10.10) of van der Vaartand Wellner (1996). In the local analysis of Fang and Santos (2018), they considered the case wherethe value of the underlying parameter may be close to a point at which the map involved in the teststatistic is only directionally differentiable (not fully differentiable). A similarly convergent probabil-ity sequence was introduced to show the local size control of their test. As will be shown later, themap involved in our test statistic is nondifferentiable (neither fully nor directionally differentiable).We follow Fang and Santos (2018) and assume such a convergent probability sequence to show thelocal size control of our test.Clearly,

H × G ⊂ L ( P ) × ( L ( P ) × L ( P )) . Under Assumption 3.2, deﬁne a metric ρ P on L ( P ) × ( L ( P ) × L ( P )) by ρ P (( h, g ) , ( h ′ , g ′ )) = k h − h ′ k L ( P ) + k g − g ′ k L ( P ) + k g − g ′ k L ( P ) (16)for all ( h, g ) , ( h ′ , g ′ ) ∈ L ( P ) × ( L ( P ) × L ( P )) with g = ( g , g ) and g ′ = ( g ′ , g ′ ) . By Lemma A.8,the closure of H × G in L ( P ) × ( L ( P ) × L ( P )) under ρ P is equal to ¯ H × G , where ¯ H is deﬁned in(9). Deﬁne Λ( Q ) = K Y k =1 Q (cid:0) R × R ×{ z k } (cid:1) for all Q ∈ P , and T n = n · K Y k =1 ˆ P n (cid:0) R × R ×{ z k } (cid:1) , where ˆ P n is the empirical probability measure of P n deﬁned as in (14). Under Assumption 3.2, wemainly consider the nontrivial case where Λ( P ) > . Also, for every Q ∈ P , deﬁne σ Q ( h, g ) = Λ( Q ) · ( Q (cid:0) h · g (cid:1) Q ( g ) − Q ( h · g ) Q ( g ) + Q (cid:0) h · g (cid:1) Q ( g ) − Q ( h · g ) Q ( g ) ) (17)for all ( h, g ) ∈ ¯ H × G with g = ( g , g ) , where Q m ( g j ) = [ Q ( g j )] m for m ∈ N and j ∈ { , } . Lemma 3.1

Under Assumptions 3.1 and 3.2, √ T n ( ˆ φ P n − φ P ) G for some tight random element G which almost surely has a uniformly ρ P -continuous path, and for all ( h, g ) ∈ ¯ H × G with g = ( g , g ) , See Examples 2.1 and 2.2 of Fang and Santos (2018). In a metric space, tightness implies separability. he variance V ar ( G ( h, g )) is equal to the σ P ( h, g ) given in (17) , where σ P ( h, g ) ≤ / · max ( g ′ ,g ′ ) ∈G { Λ ( P ) /P ( g ′ ) + Λ ( P ) /P ( g ′ ) } ≤ / · ( K − − ( K − , (18) and K is the number of elements in Z . In particular, σ P ( h, g ) ≤ / for all ( h, g ) ∈ ¯ H × G when K = 2 . Lemma 3.1 provides the asymptotic distribution of √ T n ( ˆ φ P n − φ P ) and its asymptotic variance,which is uniformly bounded by for all K > . We used the quantity √ T n instead of √ n to establishthe asymptotic distribution in order to achieve a known bound for the asymptotic variance. Thebound in (18) will be useful when we construct the test statistic. By (17), for every ( h, g ) ∈ ¯ H×G with g = ( g , g ) , deﬁne the sample analogue of σ P ( h, g ) by ˆ σ P n ( h, g ) = T n n · ( ˆ P n (cid:0) h · g (cid:1) ˆ P n ( g ) − ˆ P n ( h · g )ˆ P n ( g ) + ˆ P n (cid:0) h · g (cid:1) ˆ P n ( g ) − ˆ P n ( h · g )ˆ P n ( g ) ) . (19)Note that for each h ∈ ¯ H and each g l ∈ G K , if ˆ P n ( g l ) = 0 then ˆ P n ( h · g l ) = 0 . By (11), ˆ σ P n is welldeﬁned.We may extend the idea of Kitagawa (2015) and construct the test statistic to be sup ( h,g ) ∈H×G √ T n ˆ φ P n ( h, g )max { ξ, ˆ σ P n ( h, g ) } (20)for some positive number (trimming parameter) ξ . Here, ξ plays two roles: (1) Since ˆ σ P n canbe zero, ξ bounds the denominator away from zero; (2) as shown in the Monte Carlo studies ofKitagawa (2015) and the present paper, different values of ξ , from small (close to 0) to large (closeto 1), may lead to different powers of the test for the same data generating process (DGP), whichcould be close to 0. Kitagawa (2015) suggests that if there is no prior knowledge available abouta likely alternative, the default choice of ξ could be set to . according to the simulation studiesfor the binary treatment and binary instrument case. They also suggest that users report test resultsusing different values of ξ . However, the underlying distribution of the data can never be fullyexplored or represented by limited simulation designs, so an “optimal” value of ξ which is plausiblefor all possible DGPs may not exist. If we repeat the test using the same data set but differentvalues of ξ and make a decision based on all these results, we might encounter an issue of multiplecomparisons. As a consequence, the size of the test, or more precisely the “family-wise error rate,”may not be controlled by the nominal signiﬁcance level. With all these considerations, this paperconstructs the test statistic in a way that, loosely speaking, computes the weighted average of thetest statistics in (20) over ξ . If we put all the weight on one particular value of ξ , the test statisticdegenerates to the test statistic in (20).Let Ξ be a predetermined closed subset of [0 , such that Ξ . The set Ξ contains all thevalues of ξ used for constructing the test statistic. Only one of the values greater than (or equal to)the bound in Lemma 3.1, say , needs to be included in Ξ . The test statistic in (20) reduces to the11nweighted KS statistic when ξ = 1 . Also, for every A ⊂ ¯ H×G , deﬁne a map S A : ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) → ℓ ∞ (Ξ) by S A ( ψ ) ( ξ ) = sup ( h,g ) ∈ A ψ ( ξ, h, g ) for all ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) . For simplicity of notation, we will write S for S ¯ H×G . Deﬁne M : ℓ ∞ ( ¯ H × G ) → ℓ ∞ (Ξ × ¯ H × G ) by M ( ϕ )( ξ, h, g ) = max { ξ, ϕ ( h, g ) } (21)for all ϕ ∈ ℓ ∞ ( ¯ H × G ) and ( ξ, h, g ) ∈ Ξ × ¯ H × G . Let ν be a positive measure on Ξ . Assumption 3.3

The measure ν satisﬁes that < ν (Ξ) < ∞ and S H×G ( ˆ φ P n / M (ˆ σ P n )) ∈ L ( ν ) for all ω ∈ Ω and all n . Note that for every ﬁnite sample set, S H×G ( ˆ φ P n / M (ˆ σ P n )) = S ( ˆ φ P n / M (ˆ σ P n )) . (22)See the discussion in Section 4 about the computational simpliﬁcation of S H×G ( ˆ φ P n / M (ˆ σ P n )) . De-ﬁne a function I : L ( ν ) → R by I ( f ) = R Ξ f d ν for all f ∈ L ( ν ) . Now we set the test statisticto p T n I ◦ S

H×G ˆ φ P n M (ˆ σ P n ) ! . (23)The measure ν could be a Dirac measure centered at some ﬁxed ξ ∈ Ξ . This is equivalent to using aparticular value for the trimming parameter to construct the test statistic as in (20). Or ν could bea discrete or continuous probability measure that assigns probabilities to the elements of Ξ . This isequivalent to using a weighted average of the test statistics in (20) over ξ . By using (23), we takeinto account the fact that the values of ξ may inﬂuence the power of the test, and we can also avoidthe multiple testing issue. Deﬁne Ψ H×G = { ( h, g ) ∈ H × G : φ P ( h, g ) = 0 } and Ψ ¯ H×G = (cid:8) ( h, g ) ∈ ¯ H × G : φ P ( h, g ) = 0 (cid:9) . (24)Since { a }×{ }× R , − { a }×{ }× R ∈ H for all a ∈ R , Ψ H×G and Ψ ¯ H×G are not empty.

Theorem 3.1

Suppose Assumptions 3.1, 3.2, and 3.3 hold. If the H in (13) is true with Q = P n forall n , then p T n I ◦ S

H×G ˆ φ P n M (ˆ σ P n ) ! I ◦ S Ψ ¯ H×G (cid:18) G M ( σ P ) (cid:19) , (25) where G is as in Lemma 3.1. Theorem 3.1 provides the pointwise asymptotic distribution of the test statistic if the H in (13)12s true with Q = P n for all n . To ﬁnd this asymptotic distribution, we employed the extendeddelta method provided in Appendix A. Because the map M is nondifferentiable, the existing deltamethods fail to work in establishing the weak convergence in (25). In Appendix A, we providean extended continuous mapping theorem and an extended delta method elaborated by TheoremsA.1 and A.2, respectively, to deal with this technical issue. See further discussion in Remark B.3.Theorem A.1 can be viewed as an extension of Theorem 1.11.1 of van der Vaart and Wellner (1996),and Theorem A.2 can be viewed as an extension of Theorem 3.9.5 of van der Vaart and Wellner(1996) and of Theorem 2.1 of Fang and Santos (2018).In Theorem 3.1, we consider the general case, where D = { d , d , . . . } . If D is a ﬁnite set with D = { d , d , . . . , d J } , then I ◦ S Ψ ¯ H×G ( G / M ( σ P )) = I ◦ S Ψ H×G ( G / M ( σ P )) under null, because itcan be shown that in this special case Ψ ¯ H×G is equal to the closure of Ψ H×G in ¯ H × G under ρ P and G / M ( σ P ) is continuous under ρ P for every ﬁxed ξ . We summarize this in the following corollary. Corollary 3.1

Under the assumptions of Theorem 3.1 with D = { d , d , . . . , d J } , p T n I ◦ S

H×G ˆ φ P n M (ˆ σ P n ) ! I ◦ S Ψ H×G (cid:18) G M ( σ P ) (cid:19) , (26) where G is as in Lemma 3.1. It was shown that the asymptotic distribution in (26) involves a map S Ψ H×G where Ψ H×G dependson the underlying probability measure P . Therefore, we need to ﬁnd a “valid” estimator \ Ψ H×G for Ψ H×G in order to consistently approximate the asymptotic distribution. If \ Ψ H×G can be constructedappropriately, a natural approximation of S Ψ H×G can be constructed by S \ Ψ H×G . By the deﬁnition of Ψ H×G in (24), we construct \ Ψ H×G by \ Ψ H×G = ( ( h, g ) ∈ H × G : p T n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) M (ˆ σ P n ) ( ξ , h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ n ) (27)with τ n → ∞ and τ n / √ n → as n → ∞ , where ξ is a small positive number. We suggest using ξ = 0 . in practice. It can be shown that S \ Ψ H×G can also be used to approximate the asymptoticdistribution in (25) when D = { d , d , . . . } . This is a method similar to that which is used in Beareand Shi (2019) and Sun and Beare (2019) to estimate contact sets in independent contexts. SeeLinton et al. (2010) and Lee et al. (2013) for further discussion of estimation of contact sets. Each ( h, g ) is included in \ Ψ H×G if √ T n | ˆ φ P n ( h, g ) | is no more than τ n estimated standard deviations fromzero. As mentioned by Sun and Beare (2019), we can effectively use pointwise conﬁdence intervalsto select points in this way. More precisely, the weak convergence in (25) is under P n . See the equation in (B.50). .1.1 Test Procedure We implement the test in the following sequence of steps:(1) Obtain the bootstrap sample { ( ˆ Y i , ˆ D i , ˆ Z i ) } ni =1 drawn independently with replacement from thesample { ( Y i , D i , Z i ) } ni =1 .(2) Calculate the bootstrap version of ˆ φ P n by ˆ φ BP n ( h, g ) = ˆ P Bn ( h · g )ˆ P Bn ( g ) − ˆ P Bn ( h · g )ˆ P Bn ( g ) , (28)let T Bn = n · Q Kk =1 ˆ P Bn (cid:0) R × R ×{ z k } (cid:1) , and calculate the bootstrap version of ˆ σ P n by ˆ σ BP n ( h, g ) = r T Bn n · s ˆ P Bn ( h · g )ˆ P Bn ( g ) − ˆ P Bn ( h · g ) ˆ P Bn ( g ) + ˆ P Bn ( h · g )ˆ P Bn ( g ) − ˆ P Bn ( h · g ) ˆ P Bn ( g ) (29)for all ( h, g ) ∈ ¯ H × G , where ˆ P Bn ( v ) = n − P ni =1 v ( ˆ Y i , ˆ D i , ˆ Z i ) for all measurable v .(3) Calculate the bootstrap version of the test statistic by I ◦ S \ Ψ H×G (cid:18)q T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) (cid:19) . (30)Since the I ◦ S Ψ ¯ H×G in the asymptotic distribution in (25) is a nonlinear map, the bootstraptest statistic in (30) was constructed following the idea of Fang and Santos (2018). The non-linearity of the map

I ◦ S Ψ ¯ H×G may cause inconsistencies in the bootstrap approximation. SeeD¨umbgen (1993), Andrews (2000), and Fang and Santos (2018) for details. Because of thedenominator M (ˆ σ BP n ) , our approach is an extension of that of Fang and Santos (2018). Sim-ilarly to (23), we can simplify the calculation of (30) in practice. See Section 4 for detailsregarding Monte Carlo simulations.(4) Repeat steps (1), (2), and (3) n B times independently, for (say) n B = 1000 . Given the nominalsigniﬁcance level α , calculate the bootstrap critical value ˆ c − α by ˆ c − α = inf (cid:26) c : P (cid:18) I ◦ S \ Ψ H×G (cid:18)q T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) (cid:19) ≤ c (cid:12)(cid:12)(cid:12)(cid:12) { ( Y i , D i , Z i ) } ni =1 (cid:19) ≥ − α (cid:27) . (31)In practice, we approximate ˆ c − α by computing the − α quantile of the n B independentlygenerated bootstrap statistics, with n B chosen as large as is computationally convenient.(5) The decision rule for the test is: Reject H if √ T n I ◦ S

H×G ( ˆ φ P n / M (ˆ σ P n )) > ˆ c − α . Theorem 3.2

Suppose Assumptions 3.1, 3.2, and 3.3 hold. i) If the H in (13) is true with Q = P n for all n and the CDF of I ◦S Ψ ¯ H×G ( G / M ( σ P )) is increasingand continuous at its − α quantile c − α , where G is the asymptotic limit given by Lemma B.8,then lim n →∞ P ( √ T n I ◦ S

H×G ( ˆ φ P n / M (ˆ σ P n )) > ˆ c − α ) ≤ α . If, in addition, P n = P for all large n , then lim n →∞ P ( √ T n I ◦ S

H×G ( ˆ φ P n / M (ˆ σ P n )) > ˆ c − α ) = α .(ii) If the H in (13) is false with Q = P and P n = P for all large n , then lim n →∞ P ( √ T n I ◦ S

H×G ( ˆ φ P n / M (ˆ σ P n )) > ˆ c − α ) = 1 . It is implied by Theorem 11.1 of Davydov et al. (1998) that in (i) of Theorem 3.2, the CDF of

I ◦S Ψ ¯ H×G ( G / M ( σ P )) is differentiable and has a positive derivative everywhere except at countablymany points in its support, provided that I ◦ S Ψ ¯ H×G ( G / M ( σ P )) = 0 . If I ◦ S Ψ ¯ H×G ( G / M ( σ P )) = 0 at null conﬁgurations, our test statistic converges to zero in probability and so does the critical value.Theorem 3.2 does not show clearly how the rejection rate of the test will behave asymptotically inthis case. As discussed in Sun and Beare (2019), this is a common theoretical limitation for irregulartesting problems. Tests based on the machinery of Fang and Santos (2018), and also those based ongeneralized moment selection (Andrews and Soares, 2010; Andrews and Shi, 2013), may encounterthis issue. One practical resolution is to replace the bootstrap critical value ˆ c − α with max { ˆ c − α , η } or ˆ c − α + η , where η is some small positive constant. See, for instance, Donald and Hsu (2016,p. 13). Simulation results showed that the empirical rejection rates of our test with η = 0 are lowerthan the nominal signiﬁcance level when I ◦ S Ψ ¯ H×G ( G / M ( σ P )) = 0 under null conﬁgurations. In this section, we consider the special case where the treatment D and the instrument Z are bothbinary. Kitagawa (2015) constructed a test for the instrument validity assumption based on testableimplication (2) when D and Z are both binary. We now compare the results from Section 3.1 withthose of Kitagawa (2015). Let z = 0 , z = 1 , d = 0 , and d = 1 . All the results in Section 3.1 holdin this case, and the test statistic in (23) is now numerically equal to the one constructed by Kitagawa(2015) if we let ν be a Dirac measure. Recall that the instrument is allowed to be multivalued underthe constructions in Section 3. The testing strategy in this paper is different from that of Kitagawa (2015). To make this pointclear, we consider a simple case where P n = P for all n and the H in (13) is true with Q = P . We establish the asymptotic distribution in (26) and use it to construct the critical value, whileKitagawa (2015) used an upper bound of the asymptotic distribution to construct the critical value. For the case where the treatment is binary and the instrument is multivalued, Kitagawa (2015) constructed the teststatistic by ﬁrst computing the normalized differences of two empirical probability measures between neighboring pairs ofvalues of instruments (ordered according to the propensity score), and then taking the maximum value of all these differences.Since these differences can be mutually correlated, it would not be straightforward to obtain the asymptotic distribution oftheir test statistic and approximate its null distribution by bootstrap. Our test achieves size control under Assumption 3.2 (the convergence of a “local” sequence of probability distributions),while the test of Kitagawa (2015) achieves uniform size control under different conditions. Assuming a ﬁxed P makes thecomparison more explicit.

15s introduced in Section 2, we follow Kitagawa (2015) and deﬁne probability measures P ( B, C ) = P ( Y ∈ B, D ∈ C | Z = 1) and P ( B, C ) = P ( Y ∈ B, D ∈ C | Z = 0) for all B, C ∈ B R . Now we deﬁne F b = (cid:8) ( − d · B ×{ d } : B is a closed interval , d ∈ { , } (cid:9) , and write P d ( f ) = R f d P d for all measurable f and each d ∈ { , } . Kitagawa (2015) showed thattheir critical value converged to the − α quantile of the distribution sup f ∈F b G H ( f ) / ( ξ ∨ σ H ( f )) ,where H = λP + (1 − λ ) P , λ = P ( Z = 1) , G H is an H -Brownian bridge, and σ H ( f ) is the standarddeviation of G H ( f ) , that is, σ H ( f ) = H ( f ) − H ( f ) . Let F ∗ b = { f ∈ F b : P ( f ) = P ( f ) } . Then it iseasy to show that H ( f ) = P ( f ) = P ( f ) for all f ∈ F ∗ b . Let ν be a Dirac measure centered at some ξ . It can be shown that sup f ∈F b G H ( f ) ξ ∨ σ H ( f ) ≥ sup f ∈F ∗ b G H ( f ) ξ ∨ σ H ( f ) L = I ◦ S Ψ H×G (cid:18) G M ( σ P ) (cid:19) , (32)where I ◦S Ψ H×G ( G / M ( σ P )) is the asymptotic distribution of the test statistic in (26) and “ L = ” meansequivalence in distribution. The bootstrap critical value proposed in the present paper is based on I ◦S Ψ H×G ( G / M ( σ P )) (equivalently, sup f ∈F ∗ b G H ( f ) / ( ξ ∨ σ H ( f )) ), while the one of Kitagawa (2015) isbased on the upper bound sup f ∈F b G H ( f ) / ( ξ ∨ σ H ( f )) . Speciﬁcally, Kitagawa (2015) constructed abootstrap approximation for the Gaussian process G H / ( ξ ∨ σ H ) , denoted by G BH / ( ξ ∨ σ BH ) , and thencomputed the bootstrap test statistic by sup f ∈F b G BH ( f ) / ( ξ ∨ σ BH ( f )) . We estimate F ∗ b by a subset of F b , denoted by c F ∗ b , and compute the bootstrap test statistic by sup f ∈ c F ∗ b G BH ( f ) / ( ξ ∨ σ BH ( f )) . Clearly,our bootstrap test statistic is numerically smaller than that of Kitagawa (2015), and hence the criticalvalue is smaller. It can also be shown that our critical value converges to the − α quantile of sup f ∈F ∗ b G H ( f ) / ( ξ ∨ σ H ( f )) . Since the test statistic in (23) is numerically equivalent to that ofKitagawa (2015), this shows that the power of the test can be improved by use of our approach. Seethe simulation evidence in Appendix C. With testable implication (5), we deﬁne the function space

H × G = (cid:8)(cid:0) B ×{ d }× R , (cid:0) R × R ×{ z } , R × R ×{ z ′ } (cid:1)(cid:1) : B is a closed interval , ( d, z, z ′ ) ∈ C (cid:9) . (33)For every probability measure Q with (10), we deﬁne φ Q ( h, g ) = Q ( h · g ) /Q ( g ) − Q ( h · g ) /Q ( g ) for every ( h, g ) ∈ H × G with g = ( g , g ) . Testable implication (5) is equivalent to the H in H : sup ( h,g ) ∈H×G φ Q ( h, g ) ≤ and H : sup ( h,g ) ∈H×G φ Q ( h, g ) > Q is the underlying probability distribution of the data. Then we can follow the test procedure inSection 3.1.1 to conduct the test with the function space H × G deﬁned in (33).

We follow the setup in Section 2.4 and suppose X is a d X -dimensional vector random variable. First,consider the testable implication in Lemma 2.3 with d min = 0 and d max = 1 . Deﬁne function spaces G = (cid:8)(cid:0) R × R ×{ z k }×{ x l } , R × R ×{ z k +1 }×{ x l } (cid:1) : k = 1 , , . . . , K − , l = 1 , , . . . , L (cid:9) , H = n ( − d · B ×{ d }× R × R dX : B is a closed interval , d ∈ { , } o , H = (cid:8) R × C × R × R dX : C = ( −∞ , c ] , c ∈ R (cid:9) , and H = H ∪ H . (34)For every probability measure Q with (10), we deﬁne φ Q ( h, g ) = Q ( h · g ) /Q ( g ) − Q ( h · g ) /Q ( g ) for every ( h, g ) ∈ H × G with g = ( g , g ) . Testable implication (6) is equivalent to the H in H : sup ( h,g ) ∈H×G φ Q ( h, g ) ≤ and H : sup ( h,g ) ∈H×G φ Q ( h, g ) > if Q is the underlying probability distribution of the data. Then we can follow the test procedure inSection 3.1.1 to conduct the test with the function space H × G deﬁned by the H and the G in (34).Next, consider the testable implication in Lemma 2.4. Deﬁne the function space H × G =  (cid:16) B ×{ d }× R × R dX , (cid:0) R × R ×{ z }×{ x l } , R × R ×{ z ′ }×{ x l } (cid:1)(cid:17) : B is a closed interval , ( d, z, z ′ ) ∈ C , l = 1 , , . . . , L  . (35)For every probability measure Q with (10), we deﬁne φ Q ( h, g ) = Q ( h · g ) /Q ( g ) − Q ( h · g ) /Q ( g ) for every ( h, g ) ∈ H × G with g = ( g , g ) . Testable implication (7) is equivalent to the H in H : sup ( h,g ) ∈H×G φ Q ( h, g ) ≤ and H : sup ( h,g ) ∈H×G φ Q ( h, g ) > if Q is the underlying probability distribution of the data. Then we can follow the test procedure inSection 3.1.1 to conduct the test with the function space H × G deﬁned in (35).

We ﬁrst designed Monte Carlo simulations for the case where D and Z are both multivalued randomvariables such that D ∈ { , , } and Z ∈ { , , } . Simulation comparisons with Kitagawa (2015)for the case where D and Z are both binary are given in Appendix C. Each simulation consisted of Monte Carlo iterations and bootstrap iterations. To expedite the simulation, we employedthe warp-speed method of Giacomini et al. (2013). As shown in (18), σ P is bounded by (1 / · ( K − ) − ( K − , where K = 3 in our setting. In each simulation, the measure ν was set to be a Diracmeasure δ ξ centered at one of the following values of ξ : . , . , . , . , . , . , . , . , . , and , or to be a probability measure ¯ ν ξ that assigns equal probabilities (weights) to the valuesof ξ listed above. The nominal signiﬁcance level α was set to . .When calculating the supremum of the test statistic √ T n I ◦ S

H×G ( ˆ φ P n / M (ˆ σ P n )) in (23), wefollowed the numerical computation approach used by Kitagawa (2015). Speciﬁcally, we calculatedthe supremum using only the closed intervals B with the values of { Y i } ni =1 observed in the data asthe endpoints, that is, B = [ a, b ] with a, b ∈ { Y , Y , . . . , Y n } and a ≤ b . It is not hard to show thatthe test statistic calculated in this way is equal to that in (23). We also used such closed intervals tocalculate the bootstrap test statistic I ◦ S \ Ψ H×G ( p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n )) in (30). From all suchintervals, we found those that satisfy the inequality in (27) and used them to calculate the supremumof p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) for each ξ listed above. The ﬁrst set of simulations was designed to investigate the size of the test and the selection ofthe tuning parameter. As shown in (27), the estimate \ Ψ H×G involves a tuning parameter τ n with τ n → ∞ and τ n / √ n → as n → ∞ . In practice, we need to use a particular value of τ n foreach sample size n . For this set of simulations, we set n to and τ n to , , , , and ∞ . For τ n = ∞ , \ Ψ H×G = H × G and the test is conservative. We compared the rejection rates obtainedusing each of these values of τ n and decided which value would be a good option for sample sizesclose to . We let U ∼ Unif(0 , , V ∼ Unif(0 , , N ∼ N(0 , , N ∼ N(1 , , N ∼ N(2 , , Z = 2 × { U ≤ . } +1 { . < U ≤ . } ( P ( Z = 2) = 0 . ), D z = 2 × { V ≤ . } +1 { . < V ≤ . } for z = 0 , , , D = P z =0 { Z = z } × D z , and Y = P d =0 { D = d } × N d . All the variables U , V , N , N , and N are mutually independent. Clearly, Assumption 2.2 holds in this case with z = 0 , z = 1 , and z = 2 .Table 1 shows the results of the simulations. The rejection rates were inﬂuenced by the values of τ n and ξ . For each measure ν , a smaller τ n yields greater rejection rates, because a smaller τ n leadsto a smaller critical value according to (27). For τ n = 2 , all the rejection rates were close to thosefor τ n = ∞ (the conservative case). Similar to the pattern of the results shown in Kitagawa (2015),some rejection rates for τ n = 2 with δ ξ centered at particular values of ξ were slightly upwardlybiased compared to the nominal size. Overall, however, the results showed good performance ofthe test in terms of size control. When sample sizes are less than or close to , we suggest using τ n = 2 in practice to achieve good size control without a signiﬁcant power loss. When the samplesize increases, τ n should be increased accordingly. It is also worth noting that when we used themeasure ¯ ν ξ , the rejection rates could be well controlled by the nominal signiﬁcance level. Thus ifwe have no additional information about the choice of ξ , ¯ ν ξ can be a default choice for us.18able 1: Rejection Rates under H for Multivalued D and Multivalued Zτ n ξ for δ ξ ¯ ν ξ ∞ The second set of simulations was designed to investigate the power of the test. Six data generatingprocesses (DGPs) in total were considered, and Assumption 2.2 did not hold with z = 0 , z = 1 , and z = 2 . Sample sizes were set to n = 200 , , , , and . The probability P ( Z = 2) = r n ,with r n = 1 / , / , / , / , and / for the corresponding sample sizes. We set τ n to , assuggested in the preceding set of simulations. DGPs (1)–(4) are the cases where (3) was violated and(4) was not violated, and DGPs (5) and (6) are the cases where both (3) and (4) were violated. Welet U ∼ Unif(0 , , V ∼ Unif(0 , , W ∼ Unif(0 , , and Z = 2 × { U ≤ r n } + 1 { r n < U ≤ r n + 0 . } .For DGPs (1)–(4), we let D z = 2 × { V ≤ . } + 1 { . < V ≤ . } for z = 0 , , , D = P z =0 { Z = z } × D z , N ∼ N(0 , , N ∼ N(0 , , and N dz ∼ N(0 , for d = 0 , , and z = 1 , .(1): N ∼ N( − . , and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(2): N ∼ N(0 , . ) and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(3): N ∼ N(0 , . ) and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(4): N a ∼ N( − , . ) , N b ∼ N( − . , . ) , N c ∼ N(0 , . ) , N d ∼ N(0 . , . ) , N e ∼ N(1 , . ) , N = 1 { W ≤ . } × N a + 1 { . < W ≤ . } × N b + 1 { . . } × N e , and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .For DGPs (5) and (6), we let N ∼ N(0 , , N ∼ N(1 , , and N ∼ N(2 , .(5): D = 2 × { V ≤ . } + 1 { . < V ≤ . } , D = 2 × { V ≤ . } + 1 { . < V ≤ . } , D = D , D = P z =0 { Z = z } × D z , and Y = P d =0 { D = d } × N d .(6): D = 2 × { V ≤ . } + 1 { . < V ≤ . } , D = 2 × { V ≤ . } + 1 { . < V ≤ . } , D = D , D = P z =0 { Z = z } × D z , and Y = P d =0 { D = d } × N d .All the variables U , V , N , N , N , N , N , N , N , N , N , N , N , and N were setto be mutually independent for each DGP. We brieﬂy explain how DGPs (1)–(4) violate (3), whichis shown graphically in Figure 2. We let p z ( y, d ) be the derivative of P ( Y ∈ ( −∞ , y ] , D = d | Z = z ) with respect to y for all d, z ∈ { , , } . Similarly to Figure 1, if (3) were true, then we would have p ( y, ≤ p ( y, ≤ p ( y, everywhere. For DGPs (1)–(4), p ( y,

2) = p ( y, held for all y , but19 ( y, ≤ p ( y, did not hold on some range of R . DGPs (5) and (6) are the cases where themonotonicity assumption did not hold and both (3) and (4) were violated.Figure 2: Curves of p ( y, (dashed) and p ( y, (solid) for DGPs (1)–(4) p ( y, p ( y, − − (a) DGP (1) p ( y, p ( y, − − (b) DGP (2) p ( y, p ( y, − − (c) DGP (3) p ( y, p ( y, − − (d) DGP (4) Table 2 shows the rejection rates under DGPs (1)–(6), that is, the power of the test. For each DGPand each measure ν , the rejection rate increased as the sample size n was increased. The results for ν = ¯ ν ξ showed that if we have no information about the choice of ξ , using the weighted average ofthe statistics over ξ is a desirable option. When n > , the rejection rates for using ν = ¯ ν ξ were ata relatively high level compared to the results for using a Dirac measure. We revisit one empirical example discussed by Kitagawa (2015) to show the performance of theproposed test in practice. The example is from Card (1993), who used college proximity as aninstrument of years of schooling to study the causal link between education and earnings. The dataare from the Young Men Cohort of the National Longitudinal Survey. In the original study of Card(1993), the educational level D is a multivalued treatment variable, while Kitagawa (2015) treatedit as a binary treatment variable T with T = 1 { D ≥ } . The results of the test of Kitagawa (2015)showed that the instrument was not valid when no covariates were controlled.We use the originally deﬁned treatment variable D to reconduct the test. Speciﬁcally, the treat-ment D is education attainment observed in 1976 (the variable “ed76”), the instrument Z is whetheran individual grew up near a 4-year college (the variable “nearc4”), and the outcome is log wageobserved in 1976 (the variable “lwage76”) in the data set. The available sample size is 3010. Wefollow the setup in Section 3 with D = { , , . . . , } and Z = { , } . The instrument Z = 1 impliesthat an individual grew up near a 4-year college. Table 3 shows the p -values obtained from ourtest using each measure ν . From these results we conclude that we do not reject the validity ofinstrument Z .The testable implication used by Kitagawa (2015) for binary T is that P ( Y ∈ B, T = 0 | Z = 1) − P ( Y ∈ B, T = 0 | Z = 0) ≤ and P ( Y ∈ B, T = 1 | Z = 1) − P ( Y ∈ B, T = 1 | Z = 0) ≥ (36)20able 2: Rejection Rates under H for Multivalued D and Multivalued Z DGP n ξ for δ ξ ¯ ν ξ for all closed intervals B . The inequalities in (36) are equivalent to the following for all closedintervals B : P ( Y ∈ B, D < | Z = 1) − P ( Y ∈ B, D < | Z = 0) ≤ and P ( Y ∈ B, D ≥ | Z = 1) − P ( Y ∈ B, D ≥ | Z = 0) ≥ , (37)which are different from those in the testable implication given in (3) and (4) and are not implied byAssumption 2.2. Thus a valid instrument Z for multivalued D which satisﬁes the testable implicationgiven in (3) and (4) may not satisfy the inequalities in (36), that is, Z may not remain valid for binary(or coarsened) T . This provides a possible explanation for why we accept Z but Kitagawa (2015)rejected it.To be more explicit, we consider a simpler example. Let U ∼ Unif (0 , , V ∼ Unif (0 , , Y d ∼ Unif ( d, d + 1) for d ∈ { , , } , Z = 1 { U ≤ . } , D = 2 × { V ≤ . } + 1 { . < V ≤ . } , D =2 × { V ≤ . } + 1 { . < V ≤ . } , D = P z =0 { Z = z } × D z , and Y = P d =0 { D = d } × Y d ,21able 3: p -values Obtained from the Proposed Test for Each Measure ν ξ for δ ξ ¯ ν ξ where U , V , Y , Y , and Y are mutually independent. We can verify that Assumption 2.2 holdsfor Z and D in this example. It can be shown that for every Borel set B and each z ∈ { , } , P ( Y ∈ B, D ≥ | Z = z ) = P ( Y ∈ B, D z = 1) + P ( Y ∈ B, D z = 2) . Let B = [1 , . Then we have P ( Y ∈ B, D ≥ | Z = 1) − P ( Y ∈ B, D ≥ | Z = 0) = P ( D = 0 , D = 1) − P ( D = 1 , D = 2) < . (38)The inequality in (38) shows that the valid instrument Z for multivalued D does not satisfy theinequalities as those in (37). Equivalently, the instrument Z is not valid for the coarsened treatment T = 1 { D ≥ } . The reason why Z does not remain valid is as follows. Assumption 2.1 for Z and T speciﬁed in this example requires Y ′ t = Y ′ t almost surely for t ∈ { , } , where Y ′ tz is the potentialoutcome variable for T = t and Z = z with t ∈ { , } and z ∈ { , } . With the potential outcomevariables, we can write Y = X d =0 { D = d } · Y d = X z =0 { Z = z } · X t =0 { T = t } · Y ′ tz ! . For every ω ∈ Ω with Z ( ω ) = z and T ( ω ) = 1 , Y ( ω ) = P d =1 { D z ( ω ) = d } · Y d ( ω ) = Y ′ z ( ω ) . If Y ′ = Y ′ almost surely, it follows that Y ′ = Y ′ = X d =1 { D = d } · Y d + 1 { D = 0 } · W almost surely with D = X z =0 { Z = z } · D z , (39)where W is a random variable such that W ( ω ) = Y ′ ( ω ) = Y ′ ( ω ) for almost all ω with T ( ω ) = 0 .However, (39) shows that Z affects Y ′ and Y ′ through D , and therefore Y ′ and Y ′ are notnecessarily independent of Z . Thus Assumption 2.1(ii) may fail for Z and (coarsened) T .For empirical or theoretical reasons, we may want to coarsen a multivalued treatment to be abinary variable in some circumstances. However, Angrist and Imbens (1995, p. 436) and Marshall(2016) showed that such coarsening may lead to inconsistent estimates for the average per-unittreatment effect and the effect of obtaining a particular treatment intensity level beyond obtainingonly the preceding level. They provided several special cases in which the estimates could be con-sistent, such as the case where the instrument only affects reaching a particular treatment intensityand the case where the effect at all intensities other than a particular one is zero. But further dis-cussion of Marshall (2016) showed that these cases are often implausible in practice. For the dataset of Card (1993), the treatment variable deﬁned by Kitagawa (2015), T = 1 { D ≥ } , can beconsidered as a four-year college degree. The simple numerical example designed above shows that22oarsening may undermine the validity of the instrument for T , so the IV estimate for the effect ofobtaining a college degree may be inconsistent. This provides another perspective for understand-ing the inconsistency of the coarsened estimator. In general, therefore, coarsening is not a desirableoption for us. This also shows the signiﬁcance of the generalization of the test in the present paper. In this paper, we provided a general framework for testing instrument validity in heterogeneouscausal effect models. We generalized the testable implications of the instrument validity assump-tions in the literature, and based on them we proposed a nonparametric bootstrap test. An extendedcontinuous mapping theorem and an extended delta method were provided to establish the asymp-totic distribution of the test statistic, which may be of independent interest. The proposed test canbe applied in more general settings and may achieve power improvement.

References

Abadie, A. (2002). Bootstrap tests for distributional treatment effects in instrumental variable mod-els.

Journal of the American Statistical Association , 97(457):284–292.Abadie, A., Angrist, J., and Imbens, G. (2002). Instrumental variables estimates of the effect ofsubsidized training on the quantiles of trainee earnings.

Econometrica , 70(1):91–117.Ananat, E. O. and Michaels, G. (2008). The effect of marital breakup on the income distribution ofwomen with children.

Journal of Human Resources , 43(3):611–629.Andrews, D. W. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of theparameter space.

Econometrica , 68(2):399–405.Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities.

Economet-rica , 81(2):609–666.Andrews, D. W. and Soares, G. (2010). Inference for parameters deﬁned by moment inequalitiesusing generalized moment selection.

Econometrica , 78(1):119–157.Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from socialsecurity administrative records.

The American Economic Review , 80(3):313–336.Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causaleffects in models with variable treatment intensity.

Journal of the American Statistical Association ,90(430):431–442.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identiﬁcation of causal effects using instru-mental variables.

Journal of the American Statistical Association , 91(434):444–455.23ngrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling andearnings?

The Quarterly Journal of Economics , 106(4):979–1014.Angrist, J. D. and Krueger, A. B. (1995). Split-sample instrumental variables estimates of the returnto schooling.

Journal of Business & Economic Statistics , 13(2):225–235.Angrist, J. D. and Pischke, J.-S. (2008).

Mostly Harmless Econometrics: An Empiricist’s Companion .Princeton University Press.Angrist, J. D. and Pischke, J.-S. (2014).

Mastering Metrics: The Path from Cause to Effect . PrincetonUniversity Press.Armstrong, T. B. (2014). Weighted KS statistics for inference on conditional moment inequalities.

Journal of Econometrics , 181(2):92–116.Armstrong, T. B. and Chan, H. P. (2016). Multiscale adaptive inference on conditional momentinequalities.

Journal of Econometrics , 194(1):24–43.Balke, A. and Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance.

Journal of the American Statistical Association , 92(439):1171–1176.Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance.

Econometrica ,71(1):71–104.Barrett, G. F., Donald, S. G., and Bhattacharya, D. (2014). Consistent nonparametric tests for Lorenzdominance.

Journal of Business & Economic Statistics , 32(1):1–13.Beare, B. K. and Fang, Z. (2017). Weak convergence of the least concave majorant of estimators fora concave distribution function.

Electronic Journal of Statistics , 11(2):3841–3870.Beare, B. K. and Moon, J.-M. (2015). Nonparametric tests of density ratio ordering.

EconometricTheory , 31(3):471–492.Beare, B. K. and Shi, X. (2019). An improved bootstrap test of density ratio ordering.

Econometricsand Statistics , 10:9–26.Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.Working Paper 4483, National Bureau of Economic Research.Cawley, J. and Meyerhoefer, C. (2012). The medical care costs of obesity: An instrumental variablesapproach.

Journal of Health Economics , 31(1):219–230.Chernozhukov, V., Kim, W., Lee, S., and Rosen, A. M. (2015). Implementing intersection bounds instata.

The Stata Journal , 15(1):21–44.Chernozhukov, V., Lee, S., and Rosen, A. M. (2013). Intersection bounds: Estimation and inference.

Econometrica , 81(2):667–737. 24hetverikov, D. (2018). Adaptive tests of conditional moment inequalities.

Econometric Theory ,34(1):186–227.Davydov, Y. A., Lifshits, M. A., and Smorodina, N. V. (1998).

Local Properties of Distributions ofStochastic Functionals , volume 173. American Mathematical Society.Donald, S. G. and Hsu, Y.-C. (2016). Improving the power of tests of stochastic dominance.

Econo-metric Reviews , 35(4):553–585.D¨umbgen, L. (1993). On nondifferentiable functions and the bootstrap.

Probability Theory andRelated Fields , 95(1):125–140.Eren, O. and Ozbeklik, S. (2014). Who beneﬁts from Job Corps? A distributional analysis of anactive labor market program.

Journal of Applied Econometrics , 29(4):586–611.Fang, Z. and Santos, A. (2018). Inference on directionally differentiable functions.

The Review ofEconomic Studies , 86(1):377–412.Folland, G. B. (1999).

Real Analysis: Modern Techniques and Their Applications . John Wiley & Sons.Fr¨olich, M. and Melly, B. (2013). Unconditional quantile treatment effects under endogeneity.

Jour-nal of Business & Economic Statistics , 31(3):346–357.Giacomini, R., Politis, D. N., and White, H. (2013). A warp-speed method for conducting MonteCarlo experiments involving bootstrap estimators.

Econometric Theory , 29(3):567–589.Hansen, B. E. (2017). Regression kink with an unknown threshold.

Journal of Business & EconomicStatistics , 35(2):228–240.Heckman, J. J. and Pinto, R. (2018). Unordered monotonicity.

Econometrica , 86(1):1–35.Heckman, J. J., Urzua, S., and Vytlacil, E. (2006). Understanding instrumental variables in modelswith essential heterogeneity.

The Review of Economics and Statistics , 88(3):389–432.Heckman, J. J., Urzua, S., and Vytlacil, E. (2008). Instrumental variables in models with multipleoutcomes: The general unordered case.

Annales d’Economie et de Statistique , pages 151–174.Heckman, J. J. and Vytlacil, E. (2005). Structural equations, treatment effects, and econometricpolicy evaluation.

Econometrica , 73(3):669–738.Heckman, J. J. and Vytlacil, E. J. (2007). Econometric evaluation of social programs, part ii: Usingthe marginal treatment effect to organize alternative econometric estimators to evaluate socialprograms, and to forecast their effects in new environments. In

Handbook of Econometrics , pages4875–5143. Amsterdam: Elsevier.Hirano, K. and Porter, J. R. (2012). Impossibility results for nondifferentiable functionals.

Econo-metrica , 80(4):1769–1790. 25ong, H. and Li, J. (2018). The numerical delta method.

Journal of Econometrics , 206(2):379–394.Horv´ath, L., Kokoszka, P., and Zitikis, R. (2006). Testing for stochastic dominance using the weightedMcFadden-type statistic.

Journal of Econometrics , 133(1):191–205.Hsu, Y.-C., Liu, C.-A., and Shi, X. (2019). Testing generalized regression monotonicity.

EconometricTheory , 35(6):1146–1200.Huber, M. and Mellace, G. (2015). Testing instrument validity for LATE identiﬁcation based oninequality moment constraints.

Review of Economics and Statistics , 97(2):398–411.Huber, M. and W¨uthrich, K. (2018). Local average and quantile treatment effects under endogeneity:A review.

Journal of Econometric Methods , 8(1).Imbens, G. (2014). Instrumental variables: An econometrician’s perspective. Technical report,National Bureau of Economic Research.Imbens, G. W. and Angrist, J. D. (1994). Identiﬁcation and estimation of local average treatmenteffects.

Econometrica , 62(2):467–475.Imbens, G. W. and Rubin, D. B. (1997). Estimating outcome distributions for compliers in instru-mental variables models.

The Review of Economic Studies , 64(4):555–574.Imbens, G. W. and Rubin, D. B. (2015).

Causal Inference in Statistics, Social, and Biomedical Sciences .Cambridge University Press.Kitagawa, T. (2015). A test for instrument validity.

Econometrica , 83(5):2043–2063.Koenker, R., Chernozhukov, V., He, X., and Peng, L. (2017).

Handbook of Quantile Regression . CRCPress.Lee, S., Song, K., and Whang, Y.-J. (2013). Testing functional inequalities.

Journal of Econometrics ,172(1):14–32.Linton, O., Song, K., and Whang, Y.-J. (2010). An improved bootstrap test of stochastic dominance.

Journal of Econometrics , 154(2):186–202.Marshall, J. (2016). Coarsening bias: How coarse treatment measurement upwardly biases instru-mental variable estimates.

Political Analysis , 24(2):157–171.Melly, B. and W¨uthrich, K. (2017). Local quantile treatment effects. In

Handbook of Quantile Regres-sion , pages 145–164. Chapman and Hall/CRC.Mouriﬁ´e, I. and Wan, Y. (2017). Testing local average treatment effect assumptions.

Review ofEconomics and Statistics , 99(2):305–313.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies.

Journal of Educational Psychology , 66(5):688.26eo, J. (2018). Tests of stochastic monotonicity with improved power.

Journal of Econometrics ,207(1):53–70.Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. (1990). On the application of probabilitytheory to agricultural experiments. Essay on principles. Section 9.

Statistical Science , pages 465–472.Sun, Z. and Beare, B. K. (2019). Improved nonparametric bootstrap tests of Lorenz dominance.

Journal of Business & Economic Statistics . Forthcoming.van der Vaart, A. W. and Wellner, J. A. (1996).

Weak Convergence and Empirical Processes . Springer.Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result.

Econometrica , 70(1):331–341.Vytlacil, E. (2006). Ordered discrete-choice selection models and local average treatment effectassumptions: Equivalence, nonequivalence, and representation results.

The Review of Economicsand Statistics , 88(3):578–581. 27 nstrument Validity for Heterogeneous Causal EffectsSupplementary Appendix

Zhenting SunNational School of Development, Peking [email protected] 7, 2020

The appendix consists of three sections. Section A provides auxiliary theorems and lemmas, someof which may be of independent interest, such as the extended continuous mapping theorem andthe extended delta method. Section B contains the proofs of the main results in the paper. Section Cshows the power comparisons between the proposed test and the test of Kitagawa (2015) via MonteCarlo simulations.

A Auxiliary Results

We follow van der Vaart and Wellner (1996) to introduce some notation we use multiple times inthe appendix. Let (Ω , A , P ) be an arbitrary probability space. For an arbitrary map T : Ω → ¯ R , wedeﬁne the outer integral or outer expectation of T with respect to P by E ∗ [ T ] = inf (cid:8) E [ U ] : U ≥ T, U : Ω → ¯ R measurable and E [ U ] exists (cid:9) . The outer probability of an arbitrary subset B of Ω is P ∗ ( B ) = inf { P ( A ) : A ⊃ B, A ∈ A} . The inner integral (or inner expectation) and the inner probability can be deﬁned as E ∗ [ T ] = − E ∗ [ − T ] and P ∗ ( B ) = 1 − P ∗ (Ω \ B ) , respectively. We denote a minimal measurable majorant of T (resp. a maximal measurable minorant)by T ∗ (resp. T ∗ ), which always exists by Lemma 1.2.1 of van der Vaart and Wellner (1996). Suppose T is a real-valued map deﬁned on an arbitrary product probability space (Ω × Ω , A × A , P × P ) .We write E ∗ [ T ] for the outer expectation as before, and for every ω , we deﬁne E ∗ [ T ]( ω ) = inf Z U ( ω ) d P ( ω ) , (A.1)1here the inﬁmum is taken over all measurable functions U : Ω → ¯ R with U ( ω ) ≥ T ( ω , ω ) forall ω such that R U d P exists. Then E ∗ [ E ∗ [ T ]] is the outer integral of the function E ∗ [ T ] : Ω → ¯ R ,and we call E ∗ [ E ∗ [ T ]] the repeated outer expectation. We deﬁne the repeated inner expectation E ∗ [ E ∗ [ T ]] analogously. Theorem A.1 (Extended continuous mapping)

Let D and E be metric spaces with metrics d and e ,respectively. Let D ⊂ D . Let X be Borel measurable and take values in D . Suppose, in addition, thateither of the following conditions holds:(a) Let D n ⊂ D . Let X n : Ω → D with X n ( ω ) ∈ D n for all ω ∈ Ω and all n . Let g n be a randommap with g n ( ω ) : D n → E (for every ω ∈ Ω , g n ( ω ) is a map on D n ). The random map g n satisﬁes the condition that for every ε > there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that if x n → x with x n ∈ D n and x ∈ D , then g n ( x n ) converges to g ( x ) uniformly on A ( sup ω ∈ A e ( g n ( ω )( x n ) , g ( x )) → ), where g : D → E is a ﬁxed (deterministic) map. Also, X isseparable.(b) Let D n ( ω ) ⊂ D for all ω ∈ Ω and all n . Let X n : Ω → D with X n ( ω ) ∈ D n ( ω ) for all ω ∈ Ω and all n . Let g n be a random map with g n ( ω ) : D n ( ω ) → E (for every ω ∈ Ω , g n ( ω ) is a map on D n ( ω ) ).The random map g n satisﬁes the condition that for every ε > there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that for every subsequence { x n m } , if x n m → x with x n m ∈ D n m ( ω n m ) , ω n m ∈ A , and x ∈ D , then g n m ( ω n m ) ( x n m ) converges to g ( x ) , where g : D → E is a ﬁxedcontinuous map.Then we have that(i) X n X implies that g n ( X n ) g ( X ) ;(ii) If X n converges to X in outer probability, then g n ( X n ) converges to g ( X ) in outer probability;(iii) If X n converges to X outer almost surely, then g n ( X n ) converges to g ( X ) outer almost surely. Remark A.1

Theorem A.1 is an extension of Theorem 1.11.1 (extended continuous mapping) of van derVaart and Wellner (1996). Theorem 1.11.1 of van der Vaart and Wellner (1996) assumes that every g n is a ﬁxed map. Theorem A.1 allows every g n to be random. Theorem A.1(i) will be used to establishTheorem A.2 (extended delta method). Proof of Theorem A.1.

Suppose Condition (a) holds. Assume the weakest of the three assumptions:the one in (i) that X n X . First , let D ∞ be the set of all x for which there exists a sequence { x n } with x n ∈ D n and x n → x . By the representation theorem (see, for example, Theorem 9.4 of Pollard(1990) or Theorem 1.10.4 of van der Vaart and Wellner (1996)), along the lines of the second Additional technical details about the repeated expectations can be found in van der Vaart and Wellner (1996, pp. 10–12). This is a condition similar to almost uniform convergence. See Deﬁnition 1.9.1(ii) of van der Vaart and Wellner (1996).By Lemma 1.9.2(iii) of van der Vaart and Wellner (1996), almost uniform convergence is equivalent to outer almost sureconvergence if the limit is Borel measurable. See Deﬁnition 1.9.1(i) of convergence in outer probability in van der Vaart and Wellner (1996). See Deﬁnition 1.9.1(iii) of outer almost sure convergence in van der Vaart and Wellner (1996). P ∗ ( X ∈ D ∞ ) = 1 . Second , ﬁx ε and a measurable set A with P ( A ) ≥ − ε that satisﬁes theassumptions, and suppose there is some subsequence such that x n ′ → x with x n ′ ∈ D n ′ for all n ′ and x ∈ D ∩ D ∞ . Since x ∈ D ∞ , there is a sequence y n → x with y n ∈ D n for all n . Fill out thesubsequence x n ′ to an entire sequence by putting x n = y n for all n / ∈ { n ′ } . Then by assumption, g n ( x n ) → g ( x ) uniformly on A on this entire sequence, hence also on the subsequence, that is, g n ′ ( x n ′ ) → g ( x ) uniformly on A . Third , let x m → x in D ∩ D ∞ . For every m , there is a sequence y m,n ∈ D n with y m,n → x m as n → ∞ . Fix a small ε > and a measurable set A with P ( A ) ≥ − ε that satisﬁes the assumptions. Now we have that g n ( y m,n ) → g ( x m ) uniformly on A . For every m , take n m such that | y m,n m − x m | < /m and | g n m ( y m,n m ) − g ( x m ) | < /m uniformly on A andsuch that n m is increasing in m . Then y m,n m → x , and hence g n m ( y m,n m ) → g ( x ) uniformly on A . Since | g ( x m ) − g ( x ) | ≤ | g n m ( y m,n m ) − g ( x m ) | + | g n m ( y m,n m ) − g ( x ) | uniformly on A , we have | g ( x m ) − g ( x ) | → . Thus g is continuous on D ∩ D ∞ .For simplicity of notation, we will write D for D ∩ D ∞ . Without loss of generality, we assumethat X takes its values in D . Since g is continuous on D now, g ( X ) is Borel measurable.(i). Let F be an arbitrary closed set in E . By the assumptions, for every ε > there is ameasurable set A ⊂ Ω with P ( A ) ≥ − ε such that if x n → x with x n ∈ D n and x ∈ D , then g n ( x n ) converges to g ( x ) uniformly on A , that is, sup ω ∈ A | g n ( ω ) ( x n ) − g ( x ) | → . Fix ε and A . Then ∩ ∞ k =1 ∪ ∞ m = k ∪ ω ∈ A ( g m ( ω )) − ( F ) ⊂ g − ( F ) ∪ ( D − D ) . (A.2)Suppose x is an element of the set on the left-hand side of (A.2). For every n , there exist n ′ > n , ω n ′ ∈ A , and x n ′ ∈ g n ′ ( ω n ′ ) − ( F ) ⊂ D n ′ such that d ( x n ′ , x ) ≤ /n . Therefore, there is a subsequence x n m ∈ g n m ( ω n m ) − ( F ) ⊂ D n m with ω n m ∈ A such that n m ↑ ∞ and x n m → x as m → ∞ . By thedeﬁnition of A , either g n m ( ω n m )( x n m ) → g ( x ) or x / ∈ D . Since F is closed, this implies that g ( x ) ∈ F or x / ∈ D . Then for every k , lim sup n →∞ P ∗ ( g n ( X n ) ∈ F ) ≤ lim sup n →∞ P ∗ (cid:16)nn X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A o ∪ A c (cid:17) = lim sup n →∞ E h(cid:16) nn X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A o ∨ { A c } (cid:17) ∗ i , (A.3)where the equality is from Lemmas 1.2.3(i) and 1.2.1 of van der Vaart and Wellner (1996). Then byLemmas 1.2.2(viii), 1.2.1, and 1.2.3(i) of van der Vaart and Wellner (1996), E h(cid:16) nn X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A o ∨ { A c } (cid:17) ∗ i = E h(cid:16) nn X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A o(cid:17) ∗ ∨ (1 { A c } ) i ≤ P ∗ (cid:16)n X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A (cid:17) + P ( A c ) . (A.4)By (A.3) and (A.4), together with Theorem 1.3.4(iii) (portmanteau) of van der Vaart and Wellner31996), we have lim sup n →∞ P ∗ ( g n ( X n ) ∈ F ) ≤ lim sup n →∞ P ∗ (cid:16)n X n ∈ ∪ ∞ m = k g − m ( F ) o ∩ A (cid:17) + P ( A c ) ≤ lim sup n →∞ P ∗ (cid:16) X n ∈ ∪ ∞ m = k ∪ ω ∈ A ( g m ( ω )) − ( F ) (cid:17) + ε ≤ P (cid:16) X ∈ ∪ ∞ m = k ∪ ω ∈ A ( g m ( ω )) − ( F ) (cid:17) + ε. Letting k → ∞ together with (A.2) gives lim sup n →∞ P ∗ ( g n ( X n ) ∈ F ) ≤ P (cid:16) X ∈ ∩ ∞ k =1 ∪ ∞ m = k ∪ ω ∈ A ( g m ( ω )) − ( F ) (cid:17) + ε ≤ P ( g ( X ) ∈ F ) + ε. Since ε can be arbitrarily small, we can conclude that lim sup n →∞ P ∗ ( g n ( X n ) ∈ F ) ≤ P ( g ( X ) ∈ F ) .By Theorem 1.3.4(iii) of van der Vaart and Wellner (1996) again, g n ( X n ) g ( X ) .(ii). Choose δ n ↓ with P ∗ ( d ( X n , X ) ≥ δ n ) → . Fix ε > . Let A ⊂ Ω be a measurable setwith P ( A ) ≥ − ε that satisﬁes the assumptions. Let B n ( ω ) be the set of all x such that there isa y ∈ D n with d ( y, x ) < δ n and e ( g n ( ω ) ( y ) , g ( x )) > ε . Let B n = ∪ ω ∈ A B n ( ω ) . Suppose x ∈ B n for inﬁnitely many n . Then there are sequences ω n m ∈ A and x n m ∈ D n m with x n m → x suchthat e ( g n m ( ω n m ) ( x n m ) , g ( x )) > ε for each m . This implies that x n m → x with x n m ∈ D n m butthat g n m ( x n m ) does not converge to g ( x ) uniformly on A . Thus by assumption, x / ∈ D . Notethat x ∈ lim sup B n is equivalent to x ∈ B n for inﬁnitely many n . Thus we can conclude that lim sup B n ∩ D = ∅ . Since g is continuous on D , B n ∩ D is relatively open in D and hence relativelyBorel. This is because if z ∈ D is close enough to x ∈ B n ∩ D , then d ( y, z ) ≤ d ( y, x ) + d ( x, z ) < δ n and e ( g n ( ω ) ( y ) , g ( z )) ≥ e ( g n ( ω ) ( y ) , g ( x )) − e ( g ( z ) , g ( x )) > ε . Since X takes values in D byassumption, by Lemma 1.2.3(i) of van der Vaart and Wellner (1996), P ∗ ( X ∈ B n ) = E ∗ [1 { X ∈ B n } ] = E [1 { X ∈ B n ∩ D } ] . Also, by the dominated convergence theorem, E [1 { X ∈ B n ∩ D } ] ≤ E [1 { X ∈ ∪ ∞ m = n ( B m ∩ D ) } ] → E [1 { X ∈ ∩ ∞ n =1 ∪ ∞ m = n ( B m ∩ D ) } ] = P ( X ∈ lim sup B n ∩ D ) = 0 . This implies that P ∗ ( X ∈ B n ) → as n → ∞ . Now we have that P ∗ ( e ( g n ( X n ) , g ( X )) > ε ) ≤ P ∗ ( { e ( g n ( X n ) , g ( X )) > ε } ∩ A ) + P ( A c ) ≤ P ∗ ( X ∈ B n or d ( X n , X ) ≥ δ n ) + ε → ε. Since ε is arbitrary, the claim holds.(iii). By Lemmas 1.9.3(i) and 1.9.2(iii) of van der Vaart and Wellner (1996), it sufﬁces toprove that sup m ≥ n e ( g m ( X m ) , g ( X )) converges to in outer probability. Choose δ n ↓ with P ∗ (cid:0) sup m ≥ n d ( X m , X ) ≥ δ n (cid:1) → . Fix ε > . Let A ⊂ Ω be a measurable set with P ( A ) ≥ − ε such that if x n → x with x n ∈ D n and x ∈ D , then g n ( x n ) converges to g ( x ) uniformly on4 . Let B n ( ω ) be the set of all x such that there are m ≥ n and y ∈ D m with d ( y, x ) < δ n and e ( g m ( ω ) ( y ) , g ( x )) > ε . Let B n = ∪ ω ∈ A B n ( ω ) . Then we can ﬁnish the proof along the lines of theproof of (ii).Suppose Condition (b) holds. Repeat the proofs of (i), (ii), and (iii) under Condition (a) withthe properties of g n and g under Condition (b). For (ii), let B n ( ω ) be the set of all x such that thereis a y ∈ D n ( ω ) with d ( y, x ) < δ n and e ( g n ( ω ) ( y ) , g ( x )) > ε . For (iii), let B n ( ω ) be the set of all x such that there are m ≥ n and y ∈ D m ( ω ) with d ( y, x ) < δ n and e ( g m ( ω ) ( y ) , g ( x )) > ε . Thekey difference is that Condition (a) requires that X n ( ω ) ∈ D n for all ω holds for some ﬁxed D n .Condition (b) only requires that X n ( ω ) ∈ D n ( ω ) for all ω holds for some random D n which can takedifferent values D n ( ω ) for different ω . On the other hand, Condition (b) strengthens the propertiesof g n and g so that the claims hold as well. Theorem A.2 (Extended delta method)

Let D and E be metric spaces, and let r n be constants with r n → ∞ . Let ˆ φ n : Ω → D F ⊂ D be a random element for every n . Let D ⊂ D .(i) Let F : D F → E satisfy the condition that for every ε > , there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that for some map F ′ φ on D , r n ( F ( ˆ φ n + r − n h n ) − F ( ˆ φ n )) → F ′ φ ( h ) uniformly on A for every convergent sequence { h n } ⊂ D with ˆ φ n ( ω ) + r − n h n ∈ D F for all n and all ω and h n → h ∈ D . If X n : Ω → D F are maps with X n ( ω ) − ˆ φ n ( ω ) + ˆ φ n ( ω ′ ) ∈ D F for all ω, ω ′ ∈ Ω and r n ( X n − ˆ φ n ) X , where X is separable and takes its values in D , then r n ( F ( X n ) − F ( ˆ φ n )) F ′ φ ( X ) . Moreover, if F ′ φ is continuous on all of D , then r n ( F ( X n ) −F ( ˆ φ n )) − F ′ φ ( r n ( X n − ˆ φ n )) converges to zero in outer probability.(ii) Let F : D F → E satisfy the condition that for every ε > , there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that for some continuous map F ′ φ on D , r n m {F ( ˆ φ n m ( ω n m ) + r − n m h n m ) − F ( ˆ φ n m ( ω n m )) } → F ′ φ ( h ) for every convergent subsequence { h n m } ⊂ D with ˆ φ n m ( ω n m ) + r − n m h n m ∈ D F , ω n m ∈ A , and h n m → h ∈ D . If X n : Ω → D F are maps with r n ( X n − ˆ φ n ) X , where X takes its valuesin D , then r n ( F ( X n ) − F ( ˆ φ n )) F ′ φ ( X ) . Moreover, if F ′ φ is continuous on all of D , then r n ( F ( X n ) − F ( ˆ φ n )) − F ′ φ ( r n ( X n − ˆ φ n )) converges to zero in outer probability. Remark A.2

Theorem A.2 is an extension of Theorem 3.9.5 (delta method) of van der Vaart and Wellner(1996). Here, ˆ φ n is allowed to be random, which is the key difference between the two theorems.Theorem A.2 is used to establish the asymptotic distribution of the test statistic under null. Proof of Theorem A.2. (i). The proof mainly relies on the results of Theorem A.1. Deﬁne D n ( ω ) = { h ∈ D : ˆ φ n ( ω ) + r − n h ∈ D F } for every n and every ω ∈ Ω . Let D n = ∩ ω ∈ Ω D n ( ω ) . Deﬁne5 n ( ω ) ( h ) = r n ( F ( ˆ φ n ( ω ) + r − n h ) − F ( ˆ φ n ( ω ))) for every n , every ω ∈ Ω , and every h ∈ D n . Here, g n is a random map because of ˆ φ n . For every n and every ω ∈ Ω , g n ( ω ) : D n → E . By the assumptions,for every ε > there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that if h n ∈ D n with h n → h ∈ D , then g n ( h n ) → F ′ φ ( h ) uniformly on A . Also, r n ( X n ( ω ) − ˆ φ n ( ω )) ∈ D n for all ω byassumption. Now by Theorem A.1(i) (under Condition (a)), r n ( F ( X n ) − F ( ˆ φ n )) = g n ( r n ( X n − ˆ φ n )) F ′ φ ( X ) . Moreover, suppose F ′ φ is continuous on all of D , and let f n ( h ) = ( g n ( h ) , F ′ φ ( h )) for every h ∈ D n .By Theorem A.1(i) again, (cid:16) r n ( F ( X n ) − F ( ˆ φ n )) , F ′ φ ( r n ( X n − ˆ φ n )) (cid:17) = f n ( r n ( X n − ˆ φ n )) (cid:0) F ′ φ , F ′ φ (cid:1) ( X ) . Thus by Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996), r n ( F ( X n ) −F ( ˆ φ n )) − F ′ φ ( r n ( X n − ˆ φ n )) . The claim follows from Lemma 1.10.2(iii) of van der Vaart andWellner (1996).(ii). Together with the continuity of F ′ φ , by arguments similar to the proof of (i), we can showthat the claim holds by Theorem A.1(i) (under Condition (b)). Lemma A.1

Let P be the set of probability measures deﬁned in Section 3. Let H , ¯ H , H , ¯ H , H , and ¯ H be as in (9) . Then for every Q ∈ P , the closures of H and H in L ( Q ) are equal to ¯ H and ¯ H ,respectively. Also, the closure of H in L ( Q ) is equal to ¯ H for every Q ∈ P . Proof of Lemma A.1.

Let H d = { ( − d · B ×{ d }× R : B is a closed interval in R } for d ∈ { , } . Weﬁrst show that the closure of H d in L ( Q ) is equal to ¯ H d = n ( − d · B ×{ d }× R : B is a closed, open, or half-closed interval in R o . If this is true, the ﬁrst claim of the Lemma follows from ¯ H = ¯ H ∪ ¯ H .Suppose there is a sequence { h n } ⊂ H d such that k h n − h k L ( Q ) → for some h ∈ L ( Q ) .Then h n is a Cauchy sequence, that is, k h n − h m k L ( Q ) → as n, m → ∞ . By the deﬁnition of H d , h n = ( − d · B n ×{ d }× R , where B n is a closed interval in R . It is possible that R B n ×{ d }× R d Q → ,and in this case there is a B = { a } for some a ∈ R such that Q ( B × R × R ) = 0 and h n → ( − d · B ×{ d }× R ∈ H d . If R B n ×{ d }× R d Q , then there is an ε > such that for all n ε > , thereis an n > n ε such that k h n k L ( Q ) > ε . For a δ ≪ ε , there is an N such that k h n − h m k L ( Q ) < δ forall m, n > N . Thus there is an n > N such that k h n k L ( Q ) > ε and k h n − h n k L ( Q ) < δ for all n > N . Now let δ be such that < δ ≪ δ . Then there is an N > n such that k h n − h m k L ( Q ) <δ for all m, n > N . Thus there is an n > N such that k h n k L ( Q ) > ε and k h n − h n k L ( Q ) <δ for all n > N . In this way, we can ﬁnd a sequence { h n k } k with h n k = ( − d · B nk ×{ d }× R , k h n k k L ( Q ) > ε , k h n − h n k k L ( Q ) < δ k for all n > n k , and δ k ↓ . Let B ∞ = ∪ ∞ j =1 ∩ ∞ k = j B n k . For6very K , k h n k − h n K k L ( Q ) < δ K for all k > K . Notice that for every K ′ > K , k h n K − ( − d · ( ∩ ∞ k = K ′ B nk ) ×{ d }× R k L ( Q ) = Z | B nK ×{ d }× R − ( ∩ ∞ k = K ′ B nk ) ×{ d }× R | d Q = Z B nK \ ( ∩ ∞ k = K ′ B nk ) ×{ d }× R d Q + Z ( ∩ ∞ k = K ′ B nk ) \ B nK ×{ d }× R d Q. Because B n k is a closed interval for all k , we have that for every K ′′ ≥ K ′ , there exist L and L with K ′ ≤ L ≤ L ≤ K ′′ such that ∪ K ′′ k = K ′ ( B n K \ B n k ) = ( B n K \ B n L ) ∪ ( B n K \ B n L ) . Then since k h n k − h n K k L ( Q ) = Q ( B n K \ B n k × { d } × R ) + Q ( B n k \ B n K × { d } × R ) < δ K for all k > K , we have Z B nK \ ( ∩ ∞ k = K ′ B nk ) ×{ d }× R d Q = Q ( B n K \ ( ∩ ∞ k = K ′ B n k ) × { d } × R )= Q ( ∪ ∞ k = K ′ ( B n K \ B n k ) × { d } × R ) ≤ δ K . Similarly, it is easy to show that R ( ∩ ∞ k = K ′ B nk ) \ B nK ×{ d }× R d Q ≤ δ K . Thus it follows that k h n K − ( − d · ( ∩ ∞ k = K ′ B nk ) ×{ d }× R k L ( Q ) ≤ δ K , which is true for all K ′ > K . Letting K ′ → ∞ , by the dominated convergence theorem ( B ∞ = ∪ ∞ j =1 ∩ ∞ k = j B n k ) we have k h n K − ( − d · B ∞ ×{ d }× R k L ( Q ) ≤ δ K . This implies that k h n K − ( − d · B ∞ ×{ d }× R k L ( Q ) → as K → ∞ , because δ K ↓ . Finally, we have k h n − ( − d · B ∞ ×{ d }× R k L ( Q ) ≤ k h n − h n K k L ( Q ) + k h n K − ( − d · B ∞ ×{ d }× R k L ( Q ) → . Clearly, B ∞ can be a closed, open, or half-closed interval in R . Also, every element of ¯ H d is equal tothe limit of a sequence of elements of H d under the L ( Q ) norm. Thus the closure of H d in L ( Q ) is equal to ¯ H d for every Q ∈ P . Similarly, we can show that the closure of H in L ( Q ) is equal to ¯ H for every Q ∈ P . As a result, the closure of H = H ∪ H in L ( Q ) is equal to ¯ H = ¯ H ∪ ¯ H forevery Q ∈ P . Lemma A.2

Let H and H be deﬁned as in (9) . Then H is a VC class with VC index V ( H ) = 3 ,and H is a VC class with VC index V ( H ) = 2 . Proof of Lemma A.2.

All the functions h ∈ H take the form h = − B ×{ }× R or h = 1 B ×{ }× R ,where B is a closed interval. If h = − B ×{ }× R , the subgraph of h is C B = (cid:8) ( y, w, z, t ) ⊂ R : t < − B ×{ }× R ( y, w, z ) (cid:9) . See the deﬁnition of VC class of functions in van der Vaart and Wellner (1996, p. 141). h = 1 B ×{ }× R , the subgraph of h is C B = (cid:8) ( y, w, z, t ) ⊂ R : t < B ×{ }× R ( y, w, z ) (cid:9) . Let C = { C dB : B is a closed interval in R , d ∈ { , }} . Suppose there are two different points a = ( y , w , z , t ) , a = ( y , w , z , t ) ∈ R with y

Let H be deﬁned as in (9) . Then H is totally bounded under k·k L r ( Q ) for every probabilitymeasure Q ∈ P and every r ≥ . Proof of Lemma A.3.

Let N ( ε, H j , L r ( Q )) denote the covering number under the L r ( Q ) norm for H j for j ∈ { , } and all ε > , where H j is deﬁned as in (9). Since H and H are VC classes by8emma A.2 with V ( H ) = 3 and V ( H ) = 2 , by Theorem 2.6.7 of van der Vaart and Wellner (1996)with envelope function F = 1 and r ≥ we have that for every probability measure Q , N ( ε, H , L r ( Q )) ≤ K e ) (1 /ε ) r and N ( ε, H , L r ( Q )) ≤ K e ) (1 /ε ) r for universal constants K , K ≥ and every ε ∈ (0 , . Since H = H ∪ H , we have N ( ε, H , L r ( Q )) ≤ K e ) (1 /ε ) r + K e ) (1 /ε ) r , (A.5)which implies that H is totally bounded. Lemma A.4

Let ¯ H be as in (9) . Then ¯ H is compact under k·k L ( Q ) for every Q ∈ P . Proof of Lemma A.4.

By Lemma A.3, H is totally bounded under k·k L ( Q ) for all Q ∈ P . Supposethat H ⊂ S j ∈ J B ε/ ( h j ) , where J is a ﬁnite index set and B ε/ ( h j ) is an open ball with center h j and radius ε/ under k · k L ( Q ) . By Lemma A.1, ¯ H is equal to the closure of H in L ( Q ) . Clearly, ¯ H ⊂ S j ∈ J B ε/ ( h j ) ⊂ S j ∈ J B ε ( h j ) , and therefore N ( ε, ¯ H , L ( Q )) ≤ N ( ε/ , H , L ( Q )) , (A.6)which, together with (A.5), implies that ¯ H is totally bounded. Since L ( Q ) is complete, ¯ H is compactin L ( Q ) .Let ¯ H and G K be deﬁned as in (9). Let V = (cid:8) h · f : h ∈ ¯ H , f ∈ G K (cid:9) . Then deﬁne ˜ V = V ∪ G K . (A.7) Lemma A.5

The function space ˜ V is Donsker and pre-Gaussian uniformly in Q ∈ P . Proof of Lemma A.5.

For every δ > and every Q ∈ P , deﬁne ˜ V δ,Q = n v − v ′ : v, v ′ ∈ ˜ V , k v − v ′ k L ( Q ) < δ o and ˜ V ∞ = n ( v − v ′ ) : v, v ′ ∈ ˜ V o . First, we show that ˜ V δ,Q is Q -measurable for all Q ∈ P . Similarly to the construction of H , weconstruct function spaces by H q = n ( − d · B ×{ d }× R : B = [ a, b ] , a, b ∈ Q , a ≤ b, d ∈ { , } o , H q = { R × C × R : C = ( −∞ , c ] , c ∈ Q } , and H q = H q ∪ H q , where Q denotes the set of all rational numbers. Now deﬁne ˜ V q = { h · f : h ∈ H q , f ∈ G K } ∪ G K and ˜ V qδ,Q = n v − v ′ : v, v ′ ∈ ˜ V q , k v − v ′ k L ( Q ) < δ o . See Deﬁnition 2.3.3 of Q -measurable class in van der Vaart and Wellner (1996).

9y construction, G K is a ﬁnite set. Since Q is countable (and therefore the set of ordered pairs ofelements of Q is countable), H q and H q are countable (and therefore H q and ˜ V q are countable).Clearly, ˜ V qδ,Q is a countable subset of ˜ V δ,Q . For every v ∈ ˜ V , there is a sequence { v m } ⊂ ˜ V q such that v m → v pointwise, because Q is dense in R . For example, if v = ( − d · √ , √ ] ×{ d }× R · R × R ×{ z k } , we can ﬁnd v m = ( − d · [ a m ,b m ] ×{ d }× R · R × R ×{ z k } with a m ↓ √ , b m ↓ √ , and a m , b m ∈ Q . Suppose v − v ′ ∈ ˜ V δ,Q and v m , v ′ m ∈ ˜ V q such that v m → v and v ′ m → v ′ pointwise.It is easy to show that k v m − v ′ m k L ( Q ) < δ for large m , that is, v m − v ′ m ∈ ˜ V qδ,Q for large m . ByExample 2.3.4 of van der Vaart and Wellner (1996), ˜ V δ,Q is Q -measurable, and this is true for all δ > . Similarly, ˜ V ∞ is Q -measurable.By the construction of ˜ V , F = 1 is a measurable envelope function with R F d Q < ∞ . Also, lim M →∞ sup Q ∈P R F · { F > M } d Q = 0 . For all H ∈ P and all ε ≥ , N (cid:16) ε k F k L ( H ) , ˜ V , L ( H ) (cid:17) = N (cid:16) ε, ˜ V , L ( H ) (cid:17) = 1 . (A.8)For all H ∈ P and all ε > , N (cid:0) ε, V , L ( H ) (cid:1) ≤ N (cid:16) ε , ¯ H , L ( H ) (cid:17) · N (cid:16) ε , G K , L ( H ) (cid:17) ≤ K · N (cid:16) ε , ¯ H , L ( H ) (cid:17) , (A.9)where K is the number of elements in G K . Thus by the deﬁnition of ˜ V in (A.7), N (cid:16) ε, ˜ V , L ( H ) (cid:17) ≤ K · N (cid:16) ε , ¯ H , L ( H ) (cid:17) + K (A.10)for all H ∈ P and all ε > . Let Q denote the set of ﬁnitely discrete probability measures. Theresults in (A.5), (A.6), (A.8), and (A.10) imply that Z ∞ sup H ∈Q r log N (cid:16) ε k F k L ( H ) , ˜ V , L ( H ) (cid:17) d ε = Z sup H ∈Q r log N (cid:16) ε, ˜ V , L ( H ) (cid:17) d ε ≤ Z r log n K · ( K + K ) · · (16 e ) (4 /ε ) + K o d ε < ∞ . The claim of the Lemma follows from Theorem 2.8.3 of van der Vaart and Wellner (1996).

Lemma A.6

The function space ˜ V deﬁned in (A.7) is Glivenko–Cantelli uniformly in Q ∈ P . Proof of Lemma A.6.

Similarly to the proof of Lemma A.5, we can show that ˜ V is Q -measurablefor every Q ∈ P . With F = 1 being an envelope function of ˜ V , we have lim M →∞ sup Q ∈P R F · { F > M } d Q = 0 . Similarly to the proofs of Lemmas A.1, A.4, and A.5, we can show that for every Q ∈ P and every ε > , the closure of H in L ( Q ) is equal to ¯ H , N ( ε, ¯ H , L ( Q )) ≤ N ( ε/ , H , L ( Q )) ,and N ( ε, ˜ V , L ( Q )) ≤ K · N (cid:0) ε/ , ¯ H , L ( Q ) (cid:1) + K . Then by (A.5), with the envelope function F = 1 we can show that sup H ∈Q n log N ( ε k F k L ( H ) , ˜ V , L ( H )) = o ( n ) , where Q n is the collection of allpossible realizations of empirical measures of n observations. Then by Theorem 2.8.1 in van derVaart and Wellner (1996), ˜ V is Glivenko–Cantelli uniformly in Q ∈ P .10 emma A.7 Let H and G be deﬁned as in (9) , let ρ P be as in (16) , and deﬁne H × G as the closure of

H × G in L ( P ) × ( L ( P ) × L ( P )) under ρ P . Then N (cid:0) ε, H × G , ρ P (cid:1) = O (cid:0) /ε (cid:1) as ε → . Proof of Lemma A.7.

By the constructions of

H × G and the metric ρ P , N ( ε, H × G , ρ P ) ≤ N (cid:16) ε , H , L ( P ) (cid:17) · h N (cid:16) ε , G K , L ( P ) (cid:17)i , where G K is deﬁned as in (9). By the construction of G K , N (cid:0) ε/ , G K , L ( P ) (cid:1) ≤ K , where K is the number of elements in G K . This, together with (A.5), implies that N ( ε, H × G , ρ P ) = O (cid:0) /ε (cid:1) as ε → . Similarly to (A.6), N (cid:0) ε, H × G , ρ P (cid:1) ≤ N (cid:16) ε , H × G , ρ P (cid:17) = O (cid:18) ε (cid:19) as ε → . Lemma A.8

Let H and G be deﬁned as in (9) , and let ρ P be as in (16) . Then H × G , the closure of

H × G under ρ P in Lemma A.7, is compact and H × G = ¯

H × G , where ¯ H is deﬁned as in (9) . Proof of Lemma A.8.

The ﬁrst claim follows from Lemma A.7 and the fact that L ( P ) × ( L ( P ) × L ( P )) is complete under ρ P . The second claim holds by the constructions of ρ P and G . B Main Results

Proof of Lemma 2.1.

Suppose Assumption 2.2 holds. Then we can deﬁne Y d by Y d = Y dz = Y dz = · · · = Y dz K almost surely for all d ∈ D . First, suppose d max exists. Under Assumption 2.2, forall k with ≤ k ≤ K − and all Borel sets B , P ( Y ∈ B, D = d max | Z = z k ) = P ( Y d max ∈ B, D z k = d max )= X j P (cid:0) Y d max ∈ B, D z k = d max , D z k +1 = d j (cid:1) = P (cid:0) Y d max ∈ B, D z k = d max , D z k +1 = d max (cid:1) and P ( Y ∈ B, D = d max | Z = z k +1 ) = P (cid:0) Y d max ∈ B, D z k +1 = d max (cid:1) = X j P (cid:0) Y d max ∈ B, D z k = d j , D z k +1 = d max (cid:1) . Thus P ( Y ∈ B, D = d max | Z = z k +1 ) ≥ P ( Y ∈ B, D = d max | Z = z k ) . Second, suppose d min exists.Then similarly, P ( Y ∈ B, D = d min | Z = z k ) ≥ P ( Y ∈ B, D = d min | Z = z k +1 ) . Remark B.1

Lemmas 2.2, 2.3, and 2.4 can be proved analogously. emma B.1 Let D L = { R ∈ ℓ ∞ ( ˜ V ) : R ( h · g l ) /R ( g l ) exists for all h ∈ ¯ H and all g l ∈ G K } . Deﬁne L : D L ⊂ ℓ ∞ ( ˜ V ) → ℓ ∞ (cid:0) ¯ H × G (cid:1) by L ( R ) ( h, g ) = R ( h · g ) R ( g ) − R ( h · g ) R ( g ) for all R ∈ D L and all ( h, g ) ∈ ¯ H × G with g = ( g , g ) . Then L is uniformly Hadamard differentiable along every sequence P n → P in D L , tangentially to ℓ ∞ ( ˜ V ) , with the derivative L ′ P deﬁned by L ′ P ( H ) ( h, g ) = H ( h · g ) P ( g ) − P ( h · g ) H ( g ) P ( g ) − H ( h · g ) P ( g ) − P ( h · g ) H ( g ) P ( g ) (B.11) for all H ∈ ℓ ∞ ( ˜ V ) . Remark B.2

By the deﬁnition of L , L ( Q ) = φ Q for all Q ∈ P . We will apply Lemma B.1 along withthe suitable delta method to deduce the asymptotic distributions of √ n ( ˆ φ P n − φ P ) and the bootstrapversion of this random element. Proof of Lemma B.1.

For all t n → , P n → P , and H n → H in ℓ ∞ ( ˜ V ) such that P n ∈ D L and P n + t n H n ∈ D L , we have that for each ( h, g ) ∈ ¯ H × G with g = ( g , g ) , L ( P n + t n H n ) ( h, g ) − L ( P n ) ( h, g )= (cid:18) ( P n + t n H n ) ( h · g )( P n + t n H n ) ( g ) − ( P n + t n H n ) ( h · g )( P n + t n H n ) ( g ) (cid:19) − (cid:18) P n ( h · g ) P n ( g ) − P n ( h · g ) P n ( g ) (cid:19) = t n H n ( h · g ) P n ( g ) − t n P n ( h · g ) H n ( g )( P n + t n H n ) ( g ) P n ( g ) − t n H n ( h · g ) P n ( g ) − t n P n ( h · g ) H n ( g )( P n + t n H n ) ( g ) P n ( g ) . Thus it is easy to show that lim n →∞ sup ( h,g ) ∈ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12) L ( P n + t n H n ) ( h, g ) − L ( P n ) ( h, g ) t n − L ′ P ( H ) ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12) = 0 , where L ′ P is deﬁned as in (B.11). This implies that L is uniformly differentiable and veriﬁes thederivative in (B.11). Lemma B.2

Under Assumptions 3.1 and 3.2 with P n , P ∈ ℓ ∞ ( ˜ V ) , we have that sup v ∈ ˜ V |√ n ( P n − P )( v ) − Q ( v ) | → , where Q ( v ) = P ( vv ) for all v ∈ ˜ V and v is as in Assumption 3.2, and that √ n ( ˆ P n − P ) converges under P n in distribution to the process G P + Q for a tight P -Brownian bridge G P with E [ G P ( v ) G P ( v )] = P ( v v ) − P ( v ) P ( v ) for all v , v ∈ ˜ V . Proof of Lemma B.2.

The Lemma holds by Assumptions 3.1 and 3.2, the facts that sup v ∈ ˜ V | P ( v ) | ≤ and sup v ∈ ˜ V | P n ( v ) | ≤ for all n , Lemma A.5 in this paper, and Theorem 3.10.12 of van der Vaartand Wellner (1996). See the deﬁnitions of Hadamard differentiability and uniform Hadamard differentiability in van der Vaart and Wellner(1996, pp. 372–375). By (11), L ′ P is well deﬁned. emma B.3 Under Assumptions 3.1 and 3.2 with P n , P ∈ ℓ ∞ ( ˜ V ) , we have that P n → P and that ˆ P n → P , ˆ φ P n → φ P , T n /n → Λ( P ) , and ˆ σ P n → σ P almost uniformly. Proof of Lemma B.3.

By Lemma B.2 in this paper, H¨older’s inequality, and Lemma 3.10.11 ofvan der Vaart and Wellner (1996), we have that k P n − P k ∞ ≤k P n − P − n − / Q k ∞ + k n − / Q k ∞ ≤k P n − P − n − / Q k ∞ + n − / sup v ∈ ˜ V | P ( v ) P ( v ) | / → , where Q is the function deﬁned in Lemma B.2. By Lemma A.6 in this paper and Lemma 1.9.3of van der Vaart and Wellner (1996), k ˆ P n − P n k ∞ → almost uniformly. Then we have that k ˆ P n − P k ∞ → almost uniformly. The rest of the results follow from the constructions of ˆ φ P n , T n /n , and ˆ σ P n . By the construction of ¯ H , the σ Q ( h, g ) in (17) can also be written as σ Q ( h, g ) = Λ( Q ) · (cid:26) | Q ( h · g ) | Q ( g ) − Q ( h · g ) Q ( g ) + | Q ( h · g ) | Q ( g ) − Q ( h · g ) Q ( g ) (cid:27) . (B.12)Similarly to (B.12), we can write the ˆ σ P n ( h, g ) in (19) as ˆ σ P n ( h, g ) = T n n · ( | ˆ P n ( h · g ) | ˆ P n ( g ) − ˆ P n ( h · g )ˆ P n ( g ) + | ˆ P n ( h · g ) | ˆ P n ( g ) − ˆ P n ( h · g )ˆ P n ( g ) ) . (B.13)Then the almost uniform convergence of ˆ P n to P in ℓ ∞ ( ˜ V ) implies the almost uniform convergenceof the ˆ σ P n in (B.13) to the σ P as in (B.12). Proof of Lemma 3.1.

By the Hadamard derivative of L in (B.11), together with Lemma B.2 in thispaper and Theorem 3.9.4 (delta method) of van der Vaart and Wellner (1996), we have that under P n , √ n ( ˆ φ P n − φ P ) = √ n {L ( ˆ P n ) − L ( P ) } L ′ P ( G P + Q ) . (B.14)By Lemma B.3, T n /n → Λ( P ) almost uniformly. Thus by Lemmas 1.9.3(ii) and 1.10.2(iii), Example1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner(1996), p T n ( ˆ φ P n − φ P ) = p T n /n · √ n ( ˆ φ P n − φ P ) Λ( P ) / L ′ P ( G P + Q ) . (B.15)Let G = Λ( P ) / L ′ P ( G P + Q ) . Then G is tight, because G P is tight and L ′ P is a continuous map.Thus (B.15) veriﬁes the ﬁrst claim of Lemma 3.1. Now we show the continuity of G under ρ P .Deﬁne a semimetric on ˜ V by ρ ( v, v ′ ) = E (cid:2) | G P ( v ) − G P ( v ′ ) | (cid:3) / v, v ′ ∈ ˜ V . This semimetric is the one deﬁned in van der Vaart and Wellner (1996, p. 39)with p = 2 . Since G P is tight, it follows from the discussion in Example 1.5.10 of van der Vaart andWellner (1996) that G P almost surely has a uniformly ρ -continuous path. Since G P is a P -Brownianbridge, ρ ( v, v ′ ) = P (( v − v ′ ) ) − P ( v − v ′ ) ≤ k v − v ′ k L ( P ) (B.16)for all v, v ′ ∈ ˜ V . Therefore, G P almost surely has a uniformly continuous path under k · k L ( P ) . ByLemma 3.10.11 of van der Vaart and Wellner (1996), P ( v ) = 0 and P ( v ) < ∞ , where v is as inAssumption 3.2. H¨older’s inequality implies that for every v ∈ L ( P ) , k v · k L ( P ) ≤ · k v k L ( P ) . ByH¨older’s inequality, P and Q are both continuous on ˜ V under k · k L ( P ) , where Q is as in LemmaB.2. Suppose that there are ( h, g ) , ( h ′ , g ′ ) ∈ ¯ H × G with g = ( g , g ) and g ′ = ( g ′ , g ′ ) . Then for j ∈ { , } we have (cid:13)(cid:13) g j − g ′ j (cid:13)(cid:13) L ( P ) ≤ ρ P (( h, g ) , ( h ′ , g ′ )) and (cid:13)(cid:13) h · g j − h ′ · g ′ j (cid:13)(cid:13) L ( P ) ≤ k h − h ′ k L ( P ) + (cid:13)(cid:13) g j − g ′ j (cid:13)(cid:13) L ( P ) ≤ ρ P (( h, g ) , ( h ′ , g ′ )) . (B.17)By (B.11) and (B.17), together with the continuity of G P , P , and Q under k · k L ( P ) , we concludethat G almost surely has a continuous path under ρ P .Next, we show the variance of G ( h, g ) for each ( h, g ) ∈ ¯ H × G with g = ( g , g ) . Since L ′ P ( H ) islinear in H , V ar ( G ( h, g )) = Λ( P ) · V ar ( L ′ P ( G P )( h, g )) . First, we have that V ar ( L ′ P ( G P ) ( h, g ))= E "(cid:18) G P ( h · g ) P ( g ) − P ( h · g ) G P ( g ) P ( g ) − G P ( h · g ) P ( g ) − P ( h · g ) G P ( g ) P ( g ) (cid:19) . (B.18)Since G P is a Brownian bridge with E [ G P ( v ) G P ( v )] = P ( v v ) − P ( v ) P ( v ) for all v , v ∈ ˜ V , wehave E "(cid:18) G P ( h · g ) P ( g ) − P ( h · g ) G P ( g ) P ( g ) (cid:19) = P (cid:0) h · g (cid:1) − P ( h · g ) P ( g ) + P ( h · g ) P ( g ) − P ( h · g ) P ( g ) − P ( h · g ) P ( g ) + 2 P ( h · g ) P ( g )= P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) . (B.19)Similarly, E "(cid:18) G P ( h · g ) P ( g ) − P ( h · g ) G P ( g ) P ( g ) (cid:19) = P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) . (B.20)14lso, we have that E [( G P ( h · g ) P ( g ) − P ( h · g ) G P ( g )) ( G P ( h · g ) P ( g ) − P ( h · g ) G P ( g ))]= P ( g ) P ( g ) P (cid:0) h g g (cid:1) − P ( g ) P ( hg ) P ( hg g ) − P ( hg ) P ( g ) P ( hg g )+ P ( hg ) P ( hg ) P ( g g ) = 0 , (B.21)where we use the fact that g g = 0 by the construction of G . By (B.21), the expectation on theright-hand side of (B.18) is equal to the sum of the expectations in (B.19) and (B.20). Thus we nowhave that V ar ( L ′ P ( G P ) ( h, g )) = P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) + P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) , which, together with V ar ( G ( h, g )) = Λ( P ) · V ar ( L ′ P ( G P )( h, g )) , veriﬁes that V ar ( G ( g, h )) = σ P ( h, g ) for the σ P in (17). For every ( h, g ) ∈ ¯ H × G with g = ( g , g ) , σ P ( h, g ) = Λ ( P ) ( P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) + P (cid:0) h · g (cid:1) P ( g ) − P ( h · g ) P ( g ) ) = Λ ( P ) P ( g ) | P ( h · g ) | P ( g ) (cid:20) − | P ( h · g ) | P ( g ) (cid:21) + Λ ( P ) P ( g ) | P ( h · g ) | P ( g ) (cid:20) − | P ( h · g ) | P ( g ) (cid:21) . Since ≤ | P ( hg j ) | /P ( g j ) ≤ for j ∈ { , } , σ P ( h, g ) ≤ / · { Λ ( P ) /P ( g ) + Λ ( P ) /P ( g ) } .Recall that K is the number of elements in Z . We have that for each j ∈ { , } , Λ ( P ) P ( g j ) ≤ max ≤ k ′ ≤ K Q Kk =1 P (cid:0) R × R ×{ z k } (cid:1) P (cid:0) R × R ×{ z k ′ } (cid:1) ≤ (cid:18) K − (cid:19) K − , which implies that σ P ( h, g ) ≤ / · max ( g ′ ,g ′ ) ∈G { Λ ( P ) /P ( g ′ ) + Λ ( P ) /P ( g ′ ) } ≤ / · ( K − − ( K − . When K = 2 , σ P ( h, g ) ≤ / by the construction of Λ( P ) . Lemma B.4

Under ρ P , φ P and σ P are continuous on ¯ H × G . Proof of Lemma B.4.

Suppose there are ( h, g ) , ( h k , g k ) ∈ ¯ H×G with g = ( g , g ) , g k = ( g k , g k ) , and ( h k , g k ) → ( h, g ) under ρ P . Since G K is ﬁnite, ( h k , g k ) → ( h, g ) under ρ P implies that P ( g kj ) = P ( g j ) for each j ∈ { , } when k is sufﬁciently large. If P ( g j ) = 0 , then by (11) P ( h · g j ) /P ( g j ) = 0 , P ( h k · g kj ) /P ( g kj ) = 0 when k is large, and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ( h · g j ) P ( g j ) − P ( h k · g kj ) P ( g kj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . If P ( g j ) = 0 for some g j ∈ G K , then Λ( P ) = 0 , which is a trivial case. We consider this case only for the sake ofcompleteness. P ( g j ) = 0 , then for each j ∈ { , } and large k , P ( g kj ) = P ( g j ) = 0 and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ( h · g j ) P ( g j ) − P ( h k · g kj ) P ( g kj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) h · g j − h k · g kj (cid:13)(cid:13) L ( P ) P ( g j ) ≤ ρ P (( h, g ) , ( h k , g k )) P ( g j ) by H¨older’s inequality and (B.17). Thus we can conclude that (cid:12)(cid:12) φ P ( h, g ) − φ P ( h k , g k ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) P ( h · g ) P ( g ) − P ( h · g ) P ( g ) (cid:19) − (cid:18) P ( h k · g k ) P ( g k ) − P ( h k · g k ) P ( g k ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) → if ( h k , g k ) → ( h, g ) under ρ P . Similarly, we can show that σ P is continuous on ¯ H × G under ρ P .We deﬁne some new notation which will be used in the following results. Deﬁne a randomelement ˆ ϕ P : Ω → ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) such that for each ω ∈ Ω and each ( ξ, h, g ) ∈ Ξ × ¯ H × G , ˆ ϕ P ( ω )( ξ, h, g ) = φ P ( h, g ) M (ˆ σ P n ( ω )) ( ξ, h, g ) , (B.22)and let ϕ P ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) be such that for each ( ξ, h, g ) ∈ Ξ × ¯ H × G , ϕ P ( ξ, h, g ) = φ P ( h, g ) M ( σ P ) ( ξ, h, g ) . Here, ˆ σ P n is estimated from data, hence it depends on ω , and so does ˆ ϕ P . When there is no dangerof confusion, we omit the ω from ˆ σ P n and ˆ ϕ P for brevity. Given each sequence r n → ∞ and each ν which satisﬁes Assumption 3.3, deﬁne D n ( ω ) = (cid:8) ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) : S (cid:0) ˆ ϕ P ( ω ) + r − n ψ (cid:1) ∈ L ( ν ) (cid:9) (B.23)for all ω ∈ Ω , and g n ( ω ) ( ψ ) = r n I ◦ S (cid:0) ˆ ϕ P ( ω ) + r − n ψ (cid:1) (B.24)for all ω ∈ Ω and all ψ ∈ D n ( ω ) . Here, g n also depends on ω ; for brevity, we omit ω from g n aswell. If the H in (13) is true with Q = P n for all n , then S ( ˆ ϕ P ) = 0 by Lemma B.5, and so g n ( ψ ) = r n (cid:8) I ◦ S (cid:0) ˆ ϕ P + r − n ψ (cid:1) − I ◦ S ( ˆ ϕ P ) (cid:9) . Deﬁne a correspondence Ψ : Ξ × ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) ։ ¯ H × G by Ψ ( ξ, ψ ) = (cid:8) ( h, g ) ∈ ¯ H × G : ψ ( ξ, h, g ) = S ( ψ ) ( ξ ) (cid:9) (B.25)for all ξ ∈ Ξ and all ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) , and deﬁne a metric ρ ξψ on Ξ × ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) by ρ ξψ (( ξ , ψ ) , ( ξ , ψ )) = | ξ − ξ | + k ψ − ψ k ∞ (B.26)16or all ( ξ , ψ ) , ( ξ , ψ ) ∈ Ξ × ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) . Also, deﬁne a metric on Ξ × ¯ H × G by ρ ξhg (( ξ , h , g ) , ( ξ , h , g )) = | ξ − ξ | + ρ P (( h , g ) , ( h , g )) (B.27)for all ( ξ , h , g ) , ( ξ , h , g ) ∈ Ξ × ¯ H × G . For every set A ⊂ ¯ H × G and every δ > , deﬁne A δ = (cid:26) ( h, g ) ∈ ¯ H × G : inf ( h ′ ,g ′ ) ∈ A ρ P (( h, g ) , ( h ′ , g ′ )) ≤ δ (cid:27) . (B.28) Lemma B.5

Suppose Assumption 3.2 holds and the H in (13) is true with Q = P n for all n . Thenthe H in (13) is true with Q = P . This implies that sup ( h,g ) ∈ ¯ H×G φ P ( h, g ) = 0 , and hence that S ( ϕ P ) = 0 and S ( ˆ ϕ P ) = 0 for all ω ∈ Ω . Proof of Lemma B.5.

By Lemma B.3, we have k P n − P k ∞ → . Thus φ P n → φ P in ℓ ∞ ( ¯ H × G ) , andby the assumption that sup ( h,g ) ∈H×G φ P n ( h, g ) ≤ for all n , we have that sup ( h,g ) ∈H×G φ P ( h, g ) ≤ .This implies that sup ( h,g ) ∈ ¯ H×G φ P ( h, g ) ≤ by the constructions of φ P and ¯ H . By the construction of ¯ H × G , there is some ( h, g ) ∈ ¯ H × G , such as h = 1 { a }×{ }× R for some a ∈ R , for which φ P ( h, g ) = 0 .Therefore, sup ( h,g ) ∈ ¯ H×G φ P ( h, g ) = 0 under the assumptions. Because ξ ∈ Ξ is always positive bythe construction of Ξ , we have that S ( ϕ P ) ( ξ ) = 0 for all ξ ∈ Ξ . For the same reason, S ( ˆ ϕ P ) ( ξ ) = 0 for all ξ ∈ Ξ and all ω ∈ Ω . Lemma B.6

The correspondence Ψ deﬁned in (B.25) is upper hemicontinuous at ( ξ, ϕ P ) for all ξ ∈ Ξ .In addition, suppose the H in (13) is true with Q = P . Then for every δ > there is an ε > such that Ψ ( ξ ′ , ψ ) ⊂ Ψ ( ξ, ϕ P ) δ (where the latter is deﬁned as in (B.28) ) for all ξ, ξ ′ ∈ Ξ and all ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) with k ψ − ϕ P k ∞ < ε . Proof of Lemma B.6.

We ﬁrst show that Ψ is upper hemicontinuous at ( ξ, ϕ P ) for all ξ ∈ Ξ . Wedo this in three steps . First , we show that

Ψ ( ξ, ϕ P ) is compact for each ξ ∈ Ξ under ρ P . Clearly,given an arbitrary ξ ∈ Ξ , ϕ P ( ξ, · , · ) is continuous on ¯ H × G under ρ P by Lemma B.4. Because ¯ H × G is compact by Lemma A.8,

Ψ ( ξ, ϕ P ) is not empty. Since Ψ ( ξ, ϕ P ) ⊂ ¯ H × G , it sufﬁces to showthat

Ψ ( ξ, ϕ P ) is closed in ¯ H × G . Fix ξ ∈ Ξ . Suppose there is a sequence { ( h n , g n ) } n ⊂ Ψ ( ξ, ϕ P ) such that ( h n , g n ) → ( h, g ) ∈ ¯ H × G under ρ P . Then for all n , ϕ P ( ξ, h n , g n ) = S ( ϕ P ) ( ξ ) . Since ϕ P ( ξ, · , · ) is continuous by Lemma B.4, ϕ P ( ξ, h n , g n ) → ϕ P ( ξ, h, g ) as ( h n , g n ) → ( h, g ) . Thus ϕ P ( ξ, h, g ) = S ( ϕ P ) ( ξ ) , which implies that Ψ ( ξ, ϕ P ) is closed in ¯ H × G and therefore compact.

Second , we show that if there is a sequence { ( ξ n , ψ n ) , ( h n , g n ) } such that ( h n , g n ) ∈ Ψ ( ξ n , ψ n ) and ρ ξψ (( ξ n , ψ n ) , ( ξ, ϕ P )) → , where ρ ξψ is deﬁned in (B.26), then ( h n , g n ) has a limit point in Ψ ( ξ, ϕ P ) . Notice that by the constructions of Ξ and ¯ H × G , Ξ × ¯ H × G is compact under the metric ρ ξhg deﬁned in (B.27). It is easy to show, by Lemma B.4, that ϕ P is continuous on Ξ × ¯ H × G under See Deﬁnition 17.2 of upper hemicontinuity in Aliprantis and Border (2006). See the deﬁnition of limit point in Aliprantis and Border (2006, p. 31). ξhg , and hence that it is uniformly continuous. Thus ρ ξψ (( ξ n , ψ n ) , ( ξ, ϕ P )) → implies that |S ( ψ n ) ( ξ n ) − S ( ϕ P ) ( ξ ) | ≤ sup ( h,g ) ∈ ¯ H×G | ψ n ( ξ n , h, g ) − ϕ P ( ξ n , h, g ) | + sup ( h,g ) ∈ ¯ H×G | ϕ P ( ξ n , h, g ) − ϕ P ( ξ, h, g ) | → , where sup ( h,g ) ∈ ¯ H×G | ϕ P ( ξ n , h, g ) − ϕ P ( ξ, h, g ) | converges to because ϕ P is uniformly continuouson Ξ × ¯ H × G under ρ ξhg . This implies that ψ n ( ξ n , h n , g n ) → S ( ϕ P ) ( ξ ) . Suppose, by way of con-tradiction, that ( h n , g n ) has no limit point in Ψ ( ξ, ϕ P ) . This implies that for each ( h, g ) ∈ Ψ ( ξ, ϕ P ) there exist an open neighborhood V h,g of ( h, g ) and an n h,g such that ( h n , g n ) V h,g when n ≥ n h,g .Because we have shown that Ψ ( ξ, ϕ P ) is compact in ¯ H × G , there is a ﬁnite open cover V such that Ψ ( ξ, ϕ P ) ⊂ V = V h ,g ∪ · · · ∪ V h M ,g M . Let n = max m ≤ M n h m ,g m . Thus if n > n , then ( h n , g n ) V ,and hence ( h n , g n ) Ψ ( ξ, ϕ P ) . Since ¯ H × G is compact and V c is closed in ¯ H × G , V c is compact.Notice that V c ∩ Ψ( ξ, ϕ P ) = ∅ . Thus sup ( h,g ) ∈ V c ϕ P ( ξ, h, g ) < sup ( h,g ) ∈ ¯ H×G ϕ P ( ξ, h, g ) = sup ( h,g ) ∈ Ψ( ξ,ϕ P ) ϕ P ( ξ, h, g ) . Let δ = sup ( h,g ) ∈ ¯ H×G ϕ P ( ξ, h, g ) − sup ( h,g ) ∈ V c ϕ P ( ξ, h, g ) . Recall that ( h n , g n ) ∈ V c for all n > n .Thus ψ n ( ξ n , h n , g n ) = sup ( h,g ) ∈ ¯ H×G ψ n ( ξ n , h, g ) = sup ( h,g ) ∈ V c ψ n ( ξ n , h, g ) , so (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ψ n ( ξ n , h n , g n ) − sup ( h,g ) ∈ V c ϕ P ( ξ, h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( h,g ) ∈ ¯ H×G | ψ n ( ξ n , h, g ) − ϕ P ( ξ n , h, g ) | + sup ( h,g ) ∈ ¯ H×G | ϕ P ( ξ n , h, g ) − ϕ P ( ξ, h, g ) | → . This implies that for sufﬁciently large n , ψ n ( ξ n , h n , g n ) ≤ sup ( h,g ) ∈ V c ϕ P ( ξ, h, g ) + δ ( h,g ) ∈ ¯ H×G ϕ P ( ξ, h, g ) − δ . This contradicts ψ n ( ξ n , h n , g n ) → S ( ϕ P ) ( ξ ) . Thus ( h n , g n ) has a limit point in Ψ ( ξ, ϕ P ) . Third , byTheorem 17.20(ii) of Aliprantis and Border (2006), together with the fact that Ξ × ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) is ﬁrst countable under the metric ρ ξψ deﬁned in (B.26) (every metric space is ﬁrst countable), Ψ isupper hemicontinuous at ( ξ, ϕ P ) .Now we prove the second claim in the Lemma. Fix δ > . Since Ψ is upper hemicontinuous at ( ξ, ϕ P ) for all ξ ∈ Ξ , we have that for each ξ there is an open ball B ε ξ ( ξ, ϕ P ) under ρ ξψ with center ( ξ, ϕ P ) and radius ε ξ such that Ψ ( ξ ′ , ϕ ′ ) ⊂ Ψ ( ξ, ϕ P ) δ for all ( ξ ′ , ϕ ′ ) ∈ B ε ξ ( ξ, ϕ P ) , where Ψ ( ξ, ϕ P ) δ is deﬁned as in (B.28). Notice that { B ε ξ / ( ξ ) } ξ ∈ Ξ is an open cover of Ξ , where each B ε ξ / ( ξ ) isan open ball in R with center ξ and radius ε ξ / . Since Ξ is compact by construction, there is aﬁnite open cover { B ε i ( ξ i ) } Mi =1 of Ξ with ε i = ε ξ i / . Let ε = min i ≤ M ε i . Then for every ξ ′ ∈ Ξ and every ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) with k ψ − ϕ P k ∞ < ε , there is an open ball B ε ξi ( ξ i , ϕ P ) such that18 ξ ′ , ψ ) ⊂ B ε ξi ( ξ i , ϕ P ) . This implies that Ψ ( ξ ′ , ψ ) ⊂ Ψ ( ξ i , ϕ P ) δ . Suppose the H in (13) is true with Q = P . By Lemma B.5, we have that S ( ϕ P ) = 0 and Ψ ( ξ, ϕ P ) = Ψ( ˜ ξ, ϕ P ) = (cid:8) ( h, g ) ∈ ¯ H × G : φ P ( h, g ) = 0 (cid:9) for all ξ, ˜ ξ ∈ Ξ . Thus Ψ ( ξ ′ , ψ ) ⊂ Ψ ( ξ, ϕ P ) δ for all ξ ∈ Ξ , that is, the second claim holds. Lemma B.7

Suppose Assumptions 3.1, 3.2, and 3.3 hold and the H in (13) is true with Q = P n for all n . For every ε > , there is a measurable set Ω ⊂ Ω with P (Ω ) ≥ − ε such that for every subsequence { ψ n m } with ψ n m ∈ D n m ( ω n m ) , ω n m ∈ Ω , where D n m ( ω n m ) is deﬁned in (B.23) , and ψ n m → ψ forsome ψ ∈ C (cid:0) Ξ × ¯ H × G (cid:1) under the ρ ξhg deﬁned in (B.27) , we have that g n m ( ω n m ) ( ψ n m ) → I ◦ S Ψ( ξ,ϕ P ) ( ψ ) , where g n m is deﬁned in (B.24) . Proof of Lemma B.7.

For simplicity of notation, we replace n m with n . Note that all the followingresults hold for every subsequence indexed by n m . By Lemma A.8, ¯ H × G is compact under ρ P .By Lemma B.3, we have ˆ σ P n → σ P almost uniformly. Then by construction, ˆ ϕ P → ϕ P almostuniformly, where ˆ ϕ P is deﬁned in (B.22). By Lemma B.5, S ( ϕ P ) = 0 and S ( ˆ ϕ P ) = 0 for all ω ∈ Ω .For every ψ ∈ C (cid:0) Ξ × ¯ H × G (cid:1) , since ˆ ϕ P ( ξ, · , · ) + r − n ψ ( ξ, · , · ) may not be continuous on ¯ H × G , Ψ (cid:0) ξ, ˆ ϕ P + r − n ψ (cid:1) may be empty. Here, we construct a modiﬁed version of ˆ ϕ P , denoted by ˜ ϕ P , suchthat(i) ˜ ϕ P ( ξ, · , · ) is upper semicontinuous for every ω ∈ Ω , every n , and every ξ ∈ Ξ ;(ii) sup ( h,g ) ∈ ¯ H×G ˆ ϕ P ( ξ, h, g ) = sup ( h,g ) ∈ ¯ H×G ˜ ϕ P ( ξ, h, g ) for every ω ∈ Ω , every n , and every ξ ∈ Ξ ;(iii) sup ( h,g ) ∈ ¯ H×G (cid:0) ˆ ϕ P + r − n ψ (cid:1) ( ξ, h, g ) = sup ( h,g ) ∈ ¯ H×G (cid:0) ˜ ϕ P + r − n ψ (cid:1) ( ξ, h, g ) for every function ψ ∈ C (cid:0) Ξ × ¯ H × G (cid:1) , every ω ∈ Ω , every n , and every ξ ∈ Ξ ;(iv) for every ε > there is a measurable set A ⊂ Ω with P ( A ) ≥ − ε such that for all ϕ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) , ˜ ϕ P + r − n ϕ → ϕ P uniformly on A .Speciﬁcally, for all ω ∈ Ω , all ( ξ, h, g ) ∈ Ξ × ¯ H × G , and all n , we deﬁne ˜ ϕ P ( ξ, h, g ) by ˜ ϕ P ( ξ, h, g ) = lim δ ↓ sup ( h ′ ,g ′ ) ∈ B δ ( h,g ) ˆ ϕ P ( ξ, h ′ , g ′ ) , (B.29)where B δ ( h, g ) is an open ball in ¯ H × G under ρ P with center ( h, g ) and radius δ .Fix ω ∈ Ω , n , and ξ ∈ Ξ . First , we prove (i), that is, ˜ ϕ P ( ξ, · , · ) is upper semicontinuous at every ( h, g ) ∈ ¯ H × G . Fix ( h, g ) ∈ ¯ H × G . By (B.29), for each ε > , there is a δ ε > such that ˆ ϕ P ( ξ, h ′ , g ′ ) ≤ ˜ ϕ P ( ξ, h, g ) + ε (B.30)19or all ( h ′ , g ′ ) ∈ B δ ε ( h, g ) , where B δ ε ( h, g ) denotes the open ball in ¯ H × G under ρ P with center ( h, g ) and radius δ ε . Fix ( h , g ) ∈ B δ ε / ( h, g ) . By deﬁnition, there is a δ > such that for all δ ′ with < δ ′ ≤ δ , ˜ ϕ P ( ξ, h , g ) ≤ sup ( h ,g ) ∈ B δ ′ ( h ,g ) ˆ ϕ P ( ξ, h , g ) + ε . Let δ = min { δ ε / , δ } . Then for this ( h , g ) , we have that ˜ ϕ P ( ξ, h , g ) ≤ sup ( h ,g ) ∈ B δ ( h ,g ) ˆ ϕ P ( ξ, h , g ) + ε . Notice that if ( h , g ) ∈ B δ ( h , g ) , then ( h , g ) ∈ B δ ε ( h, g ) , and hence ˆ ϕ P ( ξ, h , g ) ≤ ˜ ϕ P ( ξ, h, g )+ ε/ . This implies that sup ( h ,g ) ∈ B δ ( h ,g ) ˆ ϕ P ( ξ, h , g ) ≤ ˜ ϕ P ( ξ, h, g )+ ε/ , and hence ˜ ϕ P ( ξ, h , g ) ≤ ˜ ϕ P ( ξ, h, g )+ ε . This shows that for each ε > , there is a δ ε > such that for all ( h , g ) ∈ B δ ε / ( h, g ) , ˜ ϕ P ( ξ, h , g ) ≤ ˜ ϕ P ( ξ, h, g ) + ε . Second , we prove (ii), that is, sup ( h,g ) ∈ ¯ H×G ˆ ϕ P ( ξ, h, g ) = sup ( h,g ) ∈ ¯ H×G ˜ ϕ P ( ξ, h, g ) . (B.31)By the deﬁnition of ˜ ϕ P , we have ˆ ϕ P ( ξ, h, g ) ≤ ˜ ϕ P ( ξ, h, g ) for all ( h, g ) ∈ ¯ H × G , and hence sup ( h,g ) ∈ ¯ H×G ˆ ϕ P ( ξ, h, g ) ≤ sup ( h,g ) ∈ ¯ H×G ˜ ϕ P ( ξ, h, g ) . Also, by the deﬁnition of ˜ ϕ P , ˜ ϕ P ( ξ, h, g ) ≤ sup ( h ′ ,g ′ ) ∈ ¯ H×G ˆ ϕ P ( ξ, h ′ , g ′ ) for all ( h, g ) . Thus sup ( h,g ) ∈ ¯ H×G ˜ ϕ P ( ξ, h, g ) ≤ sup ( h,g ) ∈ ¯ H×G ˆ ϕ P ( ξ, h, g ) ,and (B.31) holds. Similarly , by the deﬁnition of ˜ ϕ P , we have that ˆ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) ≤ ˜ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) for all ( h, g ) ∈ ¯ H × G , and hence sup ( h,g ) ∈ ¯ H×G { ˆ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } ≤ sup ( h,g ) ∈ ¯ H×G { ˜ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } . Fix ( h, g ) ∈ ¯ H × G . Since ψ ( ξ, · , · ) is continuous under ρ P , for every ε > there is a ¯ δ > such that sup ( h ′ ,g ′ ) ∈ B δ ( h,g ) { ˆ ϕ P ( ξ, h ′ , g ′ ) + r − n ψ ( ξ, h, g ) − ε } ≤ sup ( h ′ ,g ′ ) ∈ B δ ( h,g ) { ˆ ϕ P ( ξ, h ′ , g ′ ) + r − n ψ ( ξ, h ′ , g ′ ) } for all δ ≤ ¯ δ . By the deﬁnition of ˜ ϕ P , this implies that ˜ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) − ε ≤ lim δ ↓ sup ( h ′ ,g ′ ) ∈ B δ ( h,g ) { ˆ ϕ P ( ξ, h ′ , g ′ ) + r − n ψ ( ξ, h ′ , g ′ ) }≤ sup ( h,g ) ∈ ¯ H×G { ˆ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } . Since ε is arbitrary, we have ˜ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) ≤ sup ( h,g ) ∈ ¯ H×G { ˆ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } .This holds for all ( h, g ) ∈ ¯ H × G , which implies that sup ( h,g ) ∈ ¯ H×G { ˆ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } ≥ sup ( h,g ) ∈ ¯ H×G { ˜ ϕ P ( ξ, h, g ) + r − n ψ ( ξ, h, g ) } . Thus (iii) is proved. 20 ast , we prove (iv). Since ϕ P ( ξ, · , · ) is continuous, we have that sup ( ξ,h,g ) ∈ Ξ × ¯ H×G (cid:12)(cid:12) ˜ ϕ P ( ξ, h, g ) + r − n ϕ ( ξ, h, g ) − ϕ P ( ξ, h, g ) (cid:12)(cid:12) ≤ sup ( ξ,h,g ) ∈ Ξ × ¯ H×G | ˆ ϕ P ( ξ, h, g ) − ϕ P ( ξ, h, g ) | + r − n k ϕ k ∞ . (iv) follows from the facts that ˆ ϕ P → ϕ P almost uniformly, as mentioned at the beginning of theproof, and k ϕ k ∞ < ∞ .Fix ε > . By property (iv), let Ω ⊂ Ω be a measurable set such that P (Ω ) ≥ − ε and ˜ ϕ P + t − n ϕ → ϕ P uniformly on Ω for all ϕ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) . Let ψ n ∈ D n ( ω n ) , ω n ∈ Ω , and ψ ∈ C (cid:0) Ξ × ¯ H × G (cid:1) be arbitrary maps such that ψ n → ψ . By property (i) that we proved above, wehave that Ψ (cid:0) ξ, ˜ ϕ P + r − n ψ (cid:1) = ∅ for all ω ∈ Ω , all n , and all ξ ∈ Ξ . It is easy to show that because ψ n → ψ in ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) , sup ξ ∈ Ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup ( h,g ) ∈ ¯ H×G (cid:8) ˆ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ n ( ξ, h, g ) (cid:9) − sup ( h,g ) ∈ ¯ H×G (cid:8) ˆ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r − n sup ( ξ,h,g ) ∈ Ξ × ¯ H×G | ψ n ( ξ, h, g ) − ψ ( ξ, h, g ) | = o (cid:0) r − n (cid:1) . Since ˜ ϕ P + r − n ψ converges to ϕ P uniformly on Ω , by Lemma B.6 there is a sequence δ n ↓ suchthat Ψ (cid:0) ξ, ˜ ϕ P ( ω ) + r − n ψ (cid:1) ⊂ Ψ ( ξ, ϕ P ) δ n for all ξ ∈ Ξ and all ω ∈ Ω . (By Lemma B.6, δ n does notdepend on ξ ∈ Ξ or on ω ∈ Ω .) Since S ( ϕ P ) = 0 by Lemma B.5, we have that for all ξ ∈ Ξ , Ψ ( ξ, ϕ P ) = { ( h, g ) ∈ ¯ H × G : φ P ( h, g ) = 0 } . (B.32)By Lemma B.5 and the constructions of ˆ ϕ P and ˜ ϕ P , we also have that for all ω , ˆ ϕ P ≤ and ˜ ϕ P ≤ on Ξ × ¯ H × G , and ˆ ϕ P ( ξ, · , · ) = 0 on Ψ ( ξ, ϕ P ) . Thus for every ξ ∈ Ξ , sup ( h,g ) ∈ ¯ H×G (cid:8) ˆ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) ≥ sup ( h,g ) ∈ Ψ( ξ,ϕ P ) (cid:8) ˆ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) = sup ( h,g ) ∈ Ψ( ξ,ϕ P ) r − n ψ ( ξ, h, g ) . By property (iii) of ˜ ϕ P , together with the results shown above, we have that sup ξ ∈ Ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup ( h,g ) ∈ ¯ H×G (cid:8) ˆ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) − sup ( h,g ) ∈ Ψ( ξ,ϕ P ) r − n ψ ( ξ, h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup ξ ∈ Ξ  sup ( h,g ) ∈ Ψ ( ξ, ˜ ϕ P ( ω n )+ r − n ψ ) (cid:8) ˜ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) − sup ( h,g ) ∈ Ψ( ξ,ϕ P ) r − n ψ ( ξ, h, g )  ≤ sup ξ ∈ Ξ ( sup ( h,g ) ∈ Ψ( ξ,ϕ P ) δn (cid:8) ˜ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) − sup ( h,g ) ∈ Ψ( ξ,ϕ P ) r − n ψ ( ξ, h, g ) ) . Ψ( ξ, ϕ P ) δ n , sup ξ ∈ Ξ ( sup ( h,g ) ∈ Ψ( ξ,ϕ P ) δn (cid:8) ˜ ϕ P ( ω n ) ( ξ, h, g ) + r − n ψ ( ξ, h, g ) (cid:9) − sup ( h,g ) ∈ Ψ( ξ,ϕ P ) r − n ψ ( ξ, h, g ) ) ≤ sup ξ ∈ Ξ ( sup ρ P (( h ,g ) , ( h ,g )) ≤ δ n r − n | ψ ( ξ, h , g ) − ψ ( ξ, h , g ) | ) = o ( r − n ) . Finally, combining all the results above, we can conclude that sup ξ ∈ Ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S (cid:0) ˆ ϕ P ( ω n ) + r − n ψ n (cid:1) ( ξ ) − r − n sup ( h,g ) ∈ Ψ( ξ,ϕ P ) ψ ( ξ, h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o (cid:0) r − n (cid:1) . This implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g n ( ω n ) ( ψ n ) − Z Ξ sup ( h,g ) ∈ Ψ( ξ,ϕ P ) ψ ( ξ, h, g ) dν ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z Ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r n S (cid:0) ˆ ϕ P ( ω n ) + r − n ψ n (cid:1) ( ξ ) − sup ( h,g ) ∈ Ψ( ξ,ϕ P ) ψ ( ξ, h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dν ( ξ ) = o (1) . Proof of Theorem 3.1.

By (B.14), √ n ( ˆ φ P n − φ P ) L ′ P ( G P + Q ) , where L ′ P ( G P + Q ) is tight asshown in the proof of Lemma 3.1. By Lemma B.3, M (ˆ σ P n ) → M ( σ P ) almost uniformly, and hencethis convergence is also in outer probability by Lemma 1.9.3(ii) of van der Vaart and Wellner (1996).By Lemma 1.10.2(iii) of van der Vaart and Wellner (1996), M (ˆ σ P n ) M ( σ P ) . By Example 1.4.7(Slutsky’s lemma) of van der Vaart and Wellner (1996), we have that ( √ n ( ˆ φ P n − φ P ) , M (ˆ σ P n )) ( L ′ P ( G P + Q ) , M ( σ P )) . Let ℓ ∞ (Ξ × ¯ H × G ) + = { ψ ∈ ℓ ∞ (Ξ × ¯ H × G ) : k /ψ k ∞ < ∞} . Deﬁne a map f : ℓ ∞ ( ¯ H × G ) × ℓ ∞ (Ξ × ¯ H × G ) + → ℓ ∞ (Ξ × ¯ H × G ) by f ( ϕ, ψ ) = ϕ/ψ for all ( ϕ, ψ ) ∈ ℓ ∞ ( ¯ H × G ) × ℓ ∞ (Ξ × ¯ H × G ) + . Clearly, ( L ′ P ( G P + Q ) , M ( σ P )) takes its values in ℓ ∞ ( ¯ H × G ) × ℓ ∞ (Ξ × ¯ H × G ) + . Itis easy to show that f is continuous under the metric k ( ϕ, ψ ) − ( ϕ ′ , ψ ′ ) k = k ϕ − ϕ ′ k ∞ + k ψ − ψ ′ k ∞ .By Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996), f ( √ n ( ˆ φ P n − φ P ) , M (ˆ σ P n )) = √ n ( ˆ φ P n − φ P ) M (ˆ σ P n ) L ′ P ( G P + Q ) M ( σ P ) . By Lemma B.5, we have that

I ◦ S ( φ P / M (ˆ σ P n )) = 0 . Then by Theorem A.2(ii) and Lemma B.7,together with the continuity of I ◦ S Ψ( ξ,ϕ P ) under k · k ∞ , we have √ n ( I ◦ S ˆ φ P n M (ˆ σ P n ) ! − I ◦ S (cid:18) φ P M (ˆ σ P n ) (cid:19)) I ◦ S Ψ( ξ,ϕ P ) (cid:18) L ′ P ( G P + Q ) M ( σ P ) (cid:19) . (B.33)By Lemma B.3, T n /n → Λ( P ) almost uniformly. Then by Lemmas 1.9.3(ii) and 1.10.2(iii), Example1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner221996), together with (B.33), we have that r T n n · √ n ( I ◦ S ˆ φ P n M (ˆ σ P n ) !) I ◦ S Ψ( ξ,ϕ P ) (cid:18) G M ( σ P ) (cid:19) , where G = p Λ( P ) L ′ P ( G P + Q ) as in Lemma 3.1. By Lemma B.5, we have that Ψ( ξ, ϕ P ) = Ψ ¯ H×G deﬁned by (24) for all ξ ∈ Ξ under the assumptions. Remark B.3

If the H in (13) is true with Q = P n for all n , it is easy to show that S ( φ P / M ( σ P )) = 0 (see Lemma B.5). Thus it sufﬁces to ﬁnd the asymptotic distribution of √ n I ◦ S

H×G ˆ φ P n M (ˆ σ P n ) ! = √ n ( I ◦ S ˆ φ P n M (ˆ σ P n ) ! − I ◦ S (cid:18) φ P M ( σ P ) (cid:19)) . (B.34) If we can ﬁnd the asymptotic distribution of √ n ( ˆ φ P n / M (ˆ σ P n ) − φ P / M ( σ P )) and the “derivative” of I◦S (see, for example, the deﬁnition of Hadamard directional derivative in Shapiro (1990) and Fang andSantos (2018)), then by the delta method of Fang and Santos (2018), it is straightforward to obtain theasymptotic distribution of (B.34) . However, establishing the limiting distribution of √ n ( ˆ φ P n / M (ˆ σ P n ) − φ P / M ( σ P )) is technically tricky. By the constructions of φ P and σ P , we can view φ P / M ( σ P ) as amap of P . Speciﬁcally, let V = { v : v = h · g l or v = h · g l for some h ∈ ¯ H and g l ∈ G K } and D Q = { Q ∈ ℓ ∞ ( V ∪ G K ) : Q ( h · g l ) /Q ( g l ) and Q ( h · g l ) /Q ( g l ) exist for all h ∈ ¯ H and g l ∈ G K } . Thenwe extend the deﬁnitions of φ Q and σ Q for all Q ∈ P , that is, the φ Q deﬁned in (12) and the σ Q deﬁnedin (17) , to all Q ∈ D Q . Clearly, P ⊂ D Q by (11) . Deﬁne a map T : D Q → ℓ ∞ (Ξ × ¯ H × G ) by T ( Q )( ξ, h, g ) = φ Q ( h, g ) M ( σ Q )( ξ, h, g ) for all Q ∈ D Q and ( ξ, h, g ) ∈ Ξ × ¯ H × G . Now we have that T ( P ) = φ P / M ( σ P ) and T ( ˆ P n ) =ˆ φ P n / M (ˆ σ P n ) . Suppose we have weak convergence of √ n ( ˆ P n − P ) in some suitable space. Then if T isHadamard (directionally) differentiable, by delta method we can establish weak convergence of √ n ˆ φ P n M (ˆ σ P n ) − φ P M ( σ P ) ! = √ n (cid:16) T ( ˆ P n ) − T ( P ) (cid:17) . (B.35) Unfortunately, however, T is nondifferentiable, because of the nondifferentiability of the M deﬁned in (21) ( M is not differentiable even when Ξ is a singleton), and hence it is not straightforward to showthe convergence of √ n ( T ( ˆ P n ) − T ( P )) . Inspired by Kitagawa (2015), with the asymptotic distributionof √ n ( ˆ φ P n / M (ˆ σ P n ) − φ P / M (ˆ σ P n )) (which can be obtained by using Slutsky’s theorem), we can insteadestablish the asymptotic distribution of √ n ( I ◦ S ˆ φ P n M (ˆ σ P n ) ! − I ◦ S (cid:18) φ P M (ˆ σ P n ) (cid:19)) , (B.36)23 here S ( φ P / M (ˆ σ P n )) = 0 by Lemma B.5 if the H in (13) is true with Q = P n for all n . However,existing delta methods cannot be used to establish the asymptotic distribution of (B.36) either. Since φ P / M (ˆ σ P n ) is a random element, delta methods such as Theorem 3.9.4 or Theorem 3.9.5 of van derVaart and Wellner (1996), or Theorem 2.1 of Fang and Santos (2018), do not work in this case. Toovercome the technical complications due to the random element φ P / M (ˆ σ P n ) , we provide the extendedcontinuous mapping theorem and the extended delta method elaborated by Theorems A.1 and A.2,respectively. Proof of Corollary 3.1.

By Lemma B.5, φ P ( h, g ) ≤ for all ( h, g ) ∈ ¯ H × G , and there exists ( h , g ) ∈ ¯ H × G with g = ( g , g ) such that φ P ( h , g ) = 0 . First , we show that if h = ( − d · A ×{ d }× R , where d ∈ { , } and A is a half-closed interval or an open interval, then for every closedinterval B such that B ⊂ A , we have that φ P (˜ h, g ) = 0 with ˜ h = ( − d · B ×{ d }× R . Suppose, byway of contradiction, that A = ( a , a ) and B = [ b , b ] with a < b , a > b , and φ P (˜ h, g ) < with ˜ h = ( − d · B ×{ d }× R . Let h L = ( − d · ( a ,b ) ×{ d }× R and h R = ( − d · ( b ,a ) ×{ d }× R . Then bythe deﬁnition of φ P , φ P ( h , g ) = P ( h · g ) P ( g ) − P ( h · g ) P ( g ) = P (( h L + ˜ h + h R ) · g ) P ( g ) − P (( h L + ˜ h + h R ) · g ) P ( g )= φ P (˜ h, g ) + φ P ( h L , g ) + φ P ( h R , g ) . Since φ P ( h , g ) = 0 but φ P (˜ h, g ) < , we have φ P ( h L , g ) + φ P ( h R , g ) > . This implies thateither φ P ( h L , g ) > or φ P ( h R , g ) > . However, since ( h L , g ) , ( h R , g ) ∈ ¯ H × G , Lemma B.5shows that both φ P ( h L , g ) and φ P ( h R , g ) are nonpositive. This is a contradiction. When A isa half-closed interval, we can show analogously that the claim is true. Second , we show that if h = 1 R × C × R with C = ( −∞ , c ) for some c ∈ R , then there is a sequence of sets C k = ( −∞ , c k ] with c k ↑ c such that φ P ( h k , g ) = 0 with h k = 1 R × C k × R . By assumption, D is a ﬁnite set. UnderAssumption 3.1, D is a discrete random variable with D ∈ D under P n . Then D ∈ D under P byLemma B.3, and the claim holds.The above results imply that Ψ ¯ H×G ⊂ Ψ H×G , where Ψ H×G is the closure of Ψ H×G in ¯ H × G under ρ P . By (24) and Lemma B.4, Ψ ¯ H×G = Ψ

H×G . By Lemma 3.1, G almost surely has a continuous pathunder ρ P . By Lemma B.4, σ P is continuous under ρ P . Thus the Corollary follows from Theorem 3.1and the continuity of G / M ( σ P ) under ρ P for every ﬁxed ξ ∈ Ξ .We now introduce the notation for the bootstrap elements. Let ( W n , . . . , W nn ) be a vector ofrandom multinomial weights independent of { ( Y i , D i , Z i ) } ni =1 for all n . As deﬁned in (14), ˆ P n isthe empirical measure of an i.i.d. sample { ( Y i , D i , Z i ) } ni =1 from probability distribution P n . Giventhe sample values, the { ( ˆ Y i , ˆ D i , ˆ Z i ) } ni =1 introduced in Section 3.1.1 is an i.i.d. sample from ˆ P n .We can write the empirical measure of { ( ˆ Y i , ˆ D i , ˆ Z i ) } ni =1 , given sample { ( Y i , D i , Z i ) } ni =1 , as ˆ P Bn = n − P ni =1 W ni δ ( Y i ,D i ,Z i ) , where δ ( Y i ,D i ,Z i ) is a Dirac measure centered at ( Y i , D i , Z i ) . Given the ˆ φ BP n , T Bn , and ˆ σ BP n deﬁned in Section 3.1.1, ˆ φ BP n / M (ˆ σ BP n ) is a map of { ( Y i , D i , Z i , W ni ) } ni =1 to the space ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) . We follow Section 3.6 of van der Vaart and Wellner (1996) and (A.1) to deﬁne the24onditional outer expectations. When we compute the outer expectations as in (A.1), independenceis understood in terms of a product space. Under Assumptions 3.1 and 3.2, each term ( Y i , D i , Z i ) of the sequence { ( Y i , D i , Z i ) } ∞ i =1 has probability distribution P . Let { ( Y i , D i , Z i ) } ∞ i =1 be the coor-dinate projections on the ﬁrst ∞ coordinates of the product space (( R ) ∞ , B ∞ R , P ∞ ) × ( W , C , P W ) ,and let the multinomial vectors W depend on the last factor only. For each real-valued map T on (( R ) ∞ , B ∞ R , P ∞ ) × ( W , C , P W ) , we can take (Ω , A , P ) = (( R ) ∞ , B ∞ R , P ∞ ) and (Ω , A , P ) =( W , C , P W ) and deﬁne a real-valued map E ∗ W [ T ] on (( R ) ∞ , B ∞ R , P ∞ ) by E ∗ W [ T ]( { ( Y i , D i , Z i ) } ∞ i =1 ) = E ∗ [ T ]( { ( Y i , D i , Z i ) } ∞ i =1 ) (B.37)for each sequence { ( Y i , D i , Z i ) } ∞ i =1 ∈ ( R ) ∞ , where E ∗ [ T ] is deﬁned as in (A.1). We call the left-hand side of (B.37) the conditional outer expectation of T given the sequence { ( Y i , D i , Z i ) } ∞ i =1 .Since E ∗ W [ T ] is a real-valued map on (( R ) ∞ , B ∞ R , P ∞ ) , we can compute its outer and inner inte-grals (expectations) with respect to (( R ) ∞ , B ∞ R , P ∞ ) . For simplicity of notation, we write themas E ∗ [ E ∗ W [ T ]] and E ∗ [ E ∗ W [ T ]] , respectively. If T ( { ( Y i , D i , Z i ) } ∞ i =1 , · ) is a measurable integrablemap on ( W , C , P W ) for every given sequence { ( Y i , D i , Z i ) } ∞ i =1 , we then write E W [ T ] for E ∗ W [ T ] and call the quantity E W [ T ]( { ( Y i , D i , Z i ) } ∞ i =1 ) the conditional expectation of T given the sequence { ( Y i , D i , Z i ) } ∞ i =1 . The conditional inner expectation is deﬁned analogously. If D is a metric spacewith metric d , we deﬁne BL ( D ) = { f : D → R : k f k ∞ ≤ , | f ( x ) − f ( x ) | ≤ d ( x , x ) for all x , x ∈ D } . Lemma B.8

Suppose Assumptions 3.1 and 3.2 hold.(i) p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) satisﬁes sup f ∈ BL ( ℓ ∞ (Ξ × ¯ H×G )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E W " f p T Bn ( ˆ φ BP n − ˆ φ P n ) M (ˆ σ BP n ) ! − E (cid:20) f (cid:18) G M ( σ P ) (cid:19)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → (B.38) in outer probability, where G = p Λ( P ) · L ′ P ( G P ) is tight and G P is as in Lemma B.2;(ii) p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) G / M ( σ P ) ; (iii) For each continuous, bounded f : ℓ ∞ (Ξ × ¯ H × G ) → R , f ( p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n )) is ameasurable function of { W ni } ni =1 for every given sequence { ( Y i , D i , Z i ) } ∞ i =1 . Proof of Lemma B.8. (i). To explore the conditional property of p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) ,we consider the entire sequence { ( Y i , D i , Z i ) } ∞ i =1 . Each term ( Y i , D i , Z i ) in { ( Y i , D i , Z i ) } ∞ i =1 hasprobability distribution P under Assumptions 3.1 and 3.2. Now the ˆ P n deﬁned in (14) can be viewed This implies that p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) is asymptotically measurable jointly in { ( Y i , D i , Z i ) } ∞ i =1 and W byLemma 1.3.8 of van der Vaart and Wellner (1996). We follow Section 3.6 of van der Vaart and Wellner (1996) to obtain the conditional property of the bootstrap element p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) given the entire sequence { ( Y i , D i , Z i ) } ∞ i =1 .

25s being computed with the ﬁrst n elements of { ( Y i , D i , Z i ) } ∞ i =1 that are distributed according to P .By Lemma A.5, √ n ( ˆ P n − P ) G P under P , where G P is the limit shown in Lemma B.2. By theconstruction of ˜ V in (A.7), F = 1 is an envelope function of ˜ V and P ∗ (sup v ∈ ˜ V | v − P ( v ) | ) < ∞ ,where P ∗ is the outer probability measure of P . By Lemma A.5, ˜ V is Donsker. By Theorem 3.6.2 ofvan der Vaart and Wellner (1996), we have that sup f ∈ BL ( ℓ ∞ ( ˜ V )) | E W [ f {√ n ( ˆ P Bn − ˆ P n ) } ] − E [ f ( G P )] | → (B.39)outer almost surely and E W [ f {√ n ( ˆ P Bn − ˆ P n ) } ∗ ] − E W [ f {√ n ( ˆ P Bn − ˆ P n ) } ∗ ] → (B.40)almost surely for every f ∈ BL ( ℓ ∞ ( ˜ V )) . Here, the asterisks denote the measurable cover functionswith respect to { ( Y i , D i , Z i ) } ∞ i =1 and W jointly. Then by Lemmas B.1, A.5, and A.6 in this paper, andTheorem 3.9.13 of van der Vaart and Wellner (1996), we have sup f ∈ BL ( ℓ ∞ ( ¯ H×G )) | E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] − E [ f ( L ′ P ( G P ))] | → (B.41)outer almost surely and E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ∗ ] − E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ∗ ] → (B.42)almost surely for every f ∈ BL ( ℓ ∞ ( ¯ H × G )) . The outer almost sure convergence in (B.41) impliesthat √ n ( L ( ˆ P Bn ) − L ( ˆ P n )) L ′ P ( G P ) for almost every given sequence { ( Y i , D i , Z i ) } ∞ i =1 . By LemmaA.6 in this paper, and Lemmas 1.9.2 and 1.9.3 of van der Vaart and Wellner (1996), we have that k ˆ P Bn − ˆ P n k ∞ → outer almost surely for almost every given sequence { ( Y i , D i , Z i ) } ∞ i =1 . By LemmaA.6 again, k ˆ P n − P k ∞ → for almost every sequence { ( Y i , D i , Z i ) } ∞ i =1 . Thus now we have that k ˆ P Bn − P k ∞ ≤ k ˆ P Bn − ˆ P n k ∞ + k ˆ P n − P k ∞ → outer almost surely for almost every given sequence { ( Y i , D i , Z i ) } ∞ i =1 . This implies that k ˆ σ BP n − σ P k ∞ → and T Bn /n → Λ( P ) outer almost surely foralmost every given sequence { ( Y i , D i , Z i ) } ∞ i =1 . This, together with (B.41), and Lemmas 1.9.2(i) and1.10.2(iii), Example 1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van derVaart and Wellner (1996), implies that p T Bn ( L ( ˆ P Bn ) − L ( ˆ P n )) / M (ˆ σ BP n ) G / M ( σ P ) for almostevery given sequence { ( Y i , D i , Z i ) } ∞ i =1 . Since G P is tight, G is tight by (B.11).(ii). By (B.42) and Theorem 2.37 of Folland (1999) (Fubini), together with the dominatedconvergence theorem and Lemma 1.2.1 of van der Vaart and Wellner (1996), E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] − E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] → (B.43) As discussed in van der Vaart and Wellner (1996, p. 183), f {√ n ( ˆ P Bn − ˆ P n ) } is measurable as a function of the randomweights given the values of the sample. Thus we use the conditional expectation E W [ f {√ n ( ˆ P Bn − ˆ P n ) } ] in (B.39). Similarly,we use the conditional expectation E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] in (B.41). f ∈ BL ( ℓ ∞ ( ¯ H × G )) . By (B.41), together with the deﬁnition of outer almost sure conver-gence (Deﬁnition 1.9.1(iii) of van der Vaart and Wellner (1996)), we have that for every function f ∈ BL ( ℓ ∞ ( ¯ H × G )) , | E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] − E [ f ( L ′ P ( G P ))] | ∗ → (B.44)almost surely. Thus by (B.44), together with Lemma 1.2.2(iii) of van der Vaart and Wellner (1996),we have that | ( E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ]) ∗ − E [ f ( L ′ P ( G P ))] | → (B.45)almost surely for every f ∈ BL ( ℓ ∞ ( ¯ H × G )) . By Lemma 1.2.6 (Fubini’s theorem) of van der Vaartand Wellner (1996), E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] ≥ E ∗ [ E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ]] ≥ E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] . (B.46)Then by Lemma 1.2.1 of van der Vaart and Wellner (1996) and (B.43), we have that E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] = E [( E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ]) ∗ ] + o (1) . (B.47)Now with (B.45) we can conclude that | E ∗ [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ] − E [ f ( L ′ P ( G P ))] | = | E [( E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ]) ∗ ] + o (1) − E [ f ( L ′ P ( G P ))] |≤ E [ | ( E W [ f {√ n ( L ( ˆ P Bn ) − L ( ˆ P n )) } ]) ∗ − E [ f ( L ′ P ( G P ))] | ] + o (1) → for every f ∈ BL ( ℓ ∞ ( ¯ H × G )) , where the equality is from (B.47) and the convergence is by thedominated convergence theorem together with the almost sure convergence in (B.45). This impliesthat √ n ( L ( ˆ P Bn ) − L ( ˆ P n )) L ′ P ( G P ) unconditionally. Similarly, by (B.39) and (B.40) we can easilyshow that √ n ( ˆ P Bn − ˆ P n ) G P unconditionally. Thus we can conclude that ˆ P Bn − ˆ P n → in outerprobability by Lemma 1.10.2(iii) of van der Vaart and Wellner (1996). By Lemma A.6 in this paperand Lemmas 1.9.3 and 1.2.2(i) of van der Vaart and Wellner (1996), we have that ˆ P Bn → P in outerprobability, and hence T Bn /n → Λ( P ) and M (ˆ σ BP n ) → M ( σ P ) in outer probability by Theorem 1.9.5(continuous mapping) of van der Vaart and Wellner (1996). By Lemma 1.10.2(iii), Example 1.4.7(Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996), p T Bn ( L ( ˆ P Bn ) − L ( ˆ P n )) / M (ˆ σ BP n ) G / M ( σ P ) unconditionally. This veriﬁes (ii) of the Lemma.(iii). This claim holds naturally under our constructions.To explore the property of the bootstrap test statistic, we introduce the following notation. For27ll sets A , A ⊂ ¯ H × G , deﬁne −→ d H ( A , A ) = sup a ∈ A inf b ∈ A ρ P ( a, b ) and d H ( A , A ) = max n −→ d H ( A , A ) , −→ d H ( A , A ) o . Also, deﬁne \ Ψ ¯ H×G = ( ( h, g ) ∈ ¯ H × G : p T n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) M (ˆ σ P n ) ( ξ , h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ n ) , (B.48)where ξ and τ n are as in (27). Notice the difference between \ Ψ H×G in (27) and \ Ψ ¯ H×G in (B.48).Clearly, \ Ψ H×G ⊂ \ Ψ ¯ H×G . Lemma B.9

Under Assumptions 3.1 and 3.2, if the H in (13) is true with Q = P n for all n , then d H ( \ Ψ ¯ H×G , Ψ ¯ H×G ) → in outer probability, where Ψ ¯ H×G is deﬁned as in (24) . Proof of Lemma B.9.

First, under the assumptions, we have that for all ε > , lim n →∞ P ∗ (cid:16) −→ d H (cid:16) Ψ ¯ H×G , \ Ψ ¯ H×G (cid:17) > ε (cid:17) ≤ lim n →∞ P ∗ (cid:16) Ψ ¯ H×G \ \ Ψ ¯ H×G = ∅ (cid:17) ≤ lim n →∞ P ∗ sup ( h,g ) ∈ ¯ H×G p T n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) − φ P ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > τ n ! . By Theorem 3.1, √ T n ( ˆ φ P n − φ P ) G . By Lemma B.3, ˆ σ P n → σ P almost uniformly, which impliesthat ˆ σ P n σ P by Lemmas 1.9.3(ii) and 1.10.2(iii) of van der Vaart and Wellner (1996). Thus byExample 1.4.7 (Slutsky’s lemma) and Theorem 1.3.6 (continuous mapping) of van der Vaart andWellner (1996), sup ( h,g ) ∈ ¯ H×G p T n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) − φ P ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup ( h,g ) ∈ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12) G ( h, g ) ξ ∨ σ P ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12) . Since τ n → ∞ , we have that lim n →∞ P ∗ ( −→ d H (Ψ ¯ H×G , \ Ψ ¯ H×G ) > ε ) = 0 .Next, consider −→ d H ( \ Ψ ¯ H×G , Ψ ¯ H×G ) . Deﬁne d (( h, g ) , A ) = inf ( h ′ ,g ′ ) ∈ A ρ P (( h, g ) , ( h ′ , g ′ )) for all ( h, g ) ∈ ¯ H × G and all subsets A ⊂ ¯ H × G . For each ε > , deﬁne ˜ D ε = (cid:8) ( h, g ) ∈ ¯ H × G : d (cid:0) ( h, g ) , Ψ ¯ H×G (cid:1) ≥ ε (cid:9) . The product space ¯ H × G is compact under ρ P by Lemma A.8. Suppose there exist { ( h n , g n ) } n ⊂ ˜ D ε ( h n , g n ) → ( h, g ) for some ( h, g ) ∈ ¯ H × G . Then d (cid:0) ( h, g ) , Ψ ¯ H×G (cid:1) = inf ( h ′ ,g ′ ) ∈ Ψ ¯ H×G ρ P (( h, g ) , ( h ′ , g ′ )) ≥ inf ( h ′ ,g ′ ) ∈ Ψ ¯ H×G ρ P (( h n , g n ) , ( h ′ , g ′ )) − ρ P (( h, g ) , ( h n , g n )) ≥ ε − ρ P (( h, g ) , ( h n , g n )) , which is true for all n . Letting n → ∞ gives d (cid:0) ( h, g ) , Ψ ¯ H×G (cid:1) ≥ ε . This implies that ˜ D ε is closed in ¯ H × G , which is compact, and thus ˜ D ε is compact. If ˜ D ε = ∅ , then clearly lim n →∞ P ∗ (cid:16) −→ d H (cid:16) \ Ψ ¯ H×G , Ψ ¯ H×G (cid:17) > ε (cid:17) = lim n →∞ P ∗  sup ( h,g ) ∈ \ Ψ ¯ H×G inf ( h ′ ,g ′ ) ∈ Ψ ¯ H×G ρ P (( h, g ) , ( h ′ , g ′ )) > ε  = 0 . If ˜ D ε = ∅ , then there is a δ ε > such that inf ( h,g ) ∈ ˜ D ε | φ P ( h, g ) | > δ ε , since φ P is continuous byLemma B.4. Also, ˆ σ P n is uniformly bounded in ( h, g ) and ω , so there is a δ ′ ε > such that for all ω ∈ Ω , inf ( h,g ) ∈ ˜ D ε | φ P ( h, g ) / ( ξ ∨ ˆ σ P n ( h, g )) | > δ ′ ε . Thus if ˜ D ε = ∅ , we have lim n →∞ P ∗ (cid:16) −→ d H (cid:16) \ Ψ ¯ H×G , Ψ ¯ H×G (cid:17) > ε (cid:17) = lim n →∞ P ∗  sup ( h,g ) ∈ \ Ψ ¯ H×G inf ( h ′ ,g ′ ) ∈ Ψ ¯ H×G ρ P (( h, g ) , ( h ′ , g ′ )) > ε  ≤ lim n →∞ P ∗  sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12) φ P ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12) > δ ′ ε , sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G p T n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ n  . By Lemma B.3, we have that ˆ φ P n → φ P almost uniformly. Thus there is a measurable set A with P ( A ) ≥ − ε such that for sufﬁciently large n , sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12) φ P ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12) − δ ′ ε uniformly on A . Thus we now have that lim n →∞ P ∗ (cid:16) −→ d H (cid:16) \ Ψ ¯ H×G , Ψ ¯ H×G (cid:17) > ε (cid:17) ≤ lim n →∞ P ∗  n sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G (cid:12)(cid:12)(cid:12) φ P ( h,g ) ξ ∨ ˆ σ Pn ( h,g ) (cid:12)(cid:12)(cid:12) > δ ′ ε o ∩ n sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G √ T n (cid:12)(cid:12)(cid:12) ˆ φ Pn ( h,g ) ξ ∨ ˆ σ Pn ( h,g ) (cid:12)(cid:12)(cid:12) ≤ τ n o ∩ A  + P ( A c ) ≤ lim n →∞ P ∗ r T n n δ ′ ε < sup ( h,g ) ∈ \ Ψ ¯ H×G \ Ψ ¯ H×G r T n n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ φ P n ( h, g ) ξ ∨ ˆ σ P n ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ n √ n  + ε = ε, because τ n / √ n → as n → ∞ . Here, ε can be arbitrarily small. Proof of Theorem 3.2. (i). Fix ψ ∈ C (cid:0) Ξ × ¯ H × G (cid:1) under the ρ ξhg deﬁned in (B.27). It is easyto show that Ξ × ¯ H × G is compact under ρ ξhg , and thus ψ is uniformly continuous on Ξ × ¯ H × G .This implies that for every ε > , there is a δ > such that | ψ ( ξ ′ , h ′ , g ′ ) − ψ ( ξ, h, g ) | ≤ ε/ν (Ξ) for29ll ( ξ, h, g ) , ( ξ ′ , h ′ , g ′ ) ∈ Ξ × ¯ H × G with ρ ξhg (( ξ ′ , h ′ , g ′ ) , ( ξ, h, g )) ≤ δ . Also, by the constructions of Ψ ¯ H×G in (24) and \ Ψ ¯ H×G in (B.48), we have that (cid:12)(cid:12)(cid:12)

I ◦ S \ Ψ ¯ H×G ( ψ ) − I ◦ S Ψ ¯ H×G ( ψ ) (cid:12)(cid:12)(cid:12) ≤ ν (Ξ) sup ρ ξhg (( ξ ′ ,h ′ ,g ′ ) , ( ξ,h,g )) ≤ d H (cid:16) \ Ψ ¯ H×G , Ψ ¯ H×G (cid:17) | ψ ( ξ ′ , h ′ , g ′ ) − ψ ( ξ, h, g ) | . By Lemma B.9, this implies that P ∗ (cid:16)(cid:12)(cid:12)(cid:12) I ◦ S \ Ψ ¯ H×G ( ψ ) − I ◦ S Ψ ¯ H×G ( ψ ) (cid:12)(cid:12)(cid:12) > ε (cid:17) ≤ P ∗ (cid:16) d H (cid:16) \ Ψ ¯ H×G , Ψ ¯ H×G (cid:17) > δ (cid:17) → . Notice that |I ◦ S \ Ψ ¯ H×G ( ψ ) − I ◦ S \ Ψ ¯ H×G ( ψ ) | ≤ ν (Ξ) k ψ − ψ k ∞ for all ψ , ψ ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) .By Lemma S.3.6 of Fang and Santos (2018),

I ◦ S \ Ψ ¯ H×G satisﬁes Assumption 4 of Fang and Santos(2018). Together with Lemma B.8, by repeating the proof of Theorem 3.2 of Fang and Santos (2018)with G Bn = p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n ) , where G Bn replaces G ∗ n in their notation, we can show that sup f ∈ BL ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E W  f  I ◦ S \ Ψ ¯ H×G  p T Bn (cid:16) ˆ φ BP n − ˆ φ P n (cid:17) M (cid:0) ˆ σ BP n (cid:1)  − E (cid:20) f (cid:26) I ◦ S Ψ ¯ H×G (cid:18) G M ( σ P ) (cid:19)(cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → (B.49)in outer probability, where G is the limit obtained in Lemma B.8 and G / M ( σ P ) is tight by LemmaB.8(i). Since the sample is ﬁnite, that is, we have only ﬁnitely many observations { ( Y i , D i , Z i ) } ni =1 in the data set, by the constructions of \ Ψ H×G in (27) and \ Ψ ¯ H×G in (B.48) we have that

I ◦ S \ Ψ H×G  p T Bn (cid:16) ˆ φ BP n − ˆ φ P n (cid:17) M (cid:0) ˆ σ BP n (cid:1)  = I ◦ S \ Ψ ¯ H×G  p T Bn (cid:16) ˆ φ BP n − ˆ φ P n (cid:17) M (cid:0) ˆ σ BP n (cid:1)  . (B.50)Then (B.49) and (B.50) imply that sup f ∈ BL ( R ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E W  f  I ◦ S \ Ψ H×G  p T Bn (cid:16) ˆ φ BP n − ˆ φ P n (cid:17) M (cid:0) ˆ σ BP n (cid:1)  − E (cid:20) f (cid:26) I ◦ S Ψ ¯ H×G (cid:18) G M ( σ P ) (cid:19)(cid:27)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → (B.51)in outer probability. Let F denote the CDF of I ◦ S Ψ ¯ H×G ( G / M ( σ P )) , and deﬁne ˆ F n by ˆ F n ( c ) = P  I ◦ S \ Ψ H×G  p T Bn (cid:16) ˆ φ BP n − ˆ φ P n (cid:17) M (cid:0) ˆ σ BP n (cid:1)  ≤ c (cid:12)(cid:12)(cid:12)(cid:12) { ( Y i , D i , Z i ) } ∞ i =1  . Since by assumption F is continuous and increasing at c − α , by a proof similar to that of Theorem This conditional probability given { ( Y i , D i , Z i ) } ∞ i =1 is numerically equal to that given { ( Y i , D i , Z i ) } ni =1 in (31). ε > , P ∗ ( | ˆ c − α − c − α | > ε ) → . (B.52)By the deﬁnitions of G (in the proof of Lemma 3.1) and G (in Lemma B.8), together with thelinearity of L ′ P , we have that G = G + Λ( P ) / L ′ P ( Q ) . Let H n = √ n ( P n − P ) . By Lemma B.2, k H n − Q k ∞ → as n → ∞ . Notice that P n = P + n − / H n . By Lemma B.1, we have that lim n →∞ sup ( h,g ) ∈ Ψ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12) L ( P n ) ( h, g ) − L ( P ) ( h, g ) n − / − L ′ P ( Q ) ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ lim n →∞ sup ( h,g ) ∈ ¯ H×G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L (cid:0) P + n − / H n (cid:1) ( h, g ) − L ( P ) ( h, g ) n − / − L ′ P ( Q ) ( h, g ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . (B.53)By construction, L ( P ) = 0 on Ψ ¯ H×G because L ( P ) = φ P . By assumption, we have that L ( P n ) = φ P n ≤ on Ψ ¯ H×G and (B.53) implies that L ′ P ( Q ) ≤ on Ψ ¯ H×G . Thus we have that G ≤ G and I ◦ S Ψ ¯ H×G ( G / M ( σ P )) ≤ I ◦ S Ψ ¯ H×G ( G / M ( σ P )) . Since G / M ( σ P ) ∈ ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) , where ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) is a Banach space under k·k ∞ and G is tight by Lemma 3.1, we have that G / M ( σ P ) is tight (hence separable ) and is Radon by Theorem 7.1.7 of Bogachev (2007). Since I ◦ S Ψ ¯ H×G is continuous and convex, Theorem 11.1(i) of Davydov et al. (1998) implies that the CDF of

I ◦S Ψ ¯ H×G ( G / M ( σ P )) is everywhere continuous except possibly at the point r = inf (cid:8) r : P (cid:0) I ◦ S Ψ ¯ H×G ( G / M ( σ P )) ≤ r (cid:1) > (cid:9) .Because I ◦ S Ψ ¯ H×G ( G / M ( σ P )) ≤ I ◦ S Ψ ¯ H×G ( G / M ( σ P )) , we have that r ≤ inf (cid:8) r : P (cid:0) I ◦ S Ψ ¯ H×G ( G / M ( σ P )) ≤ r (cid:1) > (cid:9) < c − α , where the last inequality follows from that the CDF of I ◦ S Ψ ¯ H×G ( G / M ( σ P )) is continuous andincreasing at c − α . This implies that I ◦ S Ψ ¯ H×G ( G / M ( σ P )) is continuous at c − α . Now by (25) and(B.52) in this paper, together with Example 1.4.7 (Slutsky’s lemma), Theorem 1.3.6 (continuousmapping), and Theorem 1.3.4(vi) of van der Vaart and Wellner (1996), we conclude that lim n →∞ P ∗ p T n I ◦ S ˆ φ P n M (ˆ σ P n ) ! > ˆ c − α ! = P (cid:18) I ◦ S Ψ ¯ H×G (cid:18) G M ( σ P ) (cid:19) > c − α (cid:19) ≤ α, (B.54)where the inequality follows from that c − α is the − α quantile for I ◦ S Ψ ¯ H×G ( G / M ( σ P )) . If, inaddition, P n = P for all n , then by Assumption 3.2 we have that v = 0 and hence Q = 0 . Thisimplies that G = G and that the inequality in (B.54) holds with equality.(ii). Let ˆ c ′ − α be the bootstrap critical value obtained using the bootstrap test statistic I ◦ See the deﬁnition of separability in van der Vaart and Wellner (1996, p. 17). The closure of a separable subset of a metricspace is separable. ( p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n )) in place of I ◦ S \ Ψ H×G ( p T Bn ( ˆ φ BP n − ˆ φ P n ) / M (ˆ σ BP n )) in the test pro-cedure in Section 3.1.1. By arguments similar to those in the proof of part (i), we can show that ˆ c ′ − α → c ′ − α in outer probability, where c ′ − α is the − α quantile for I ◦ S ( G / M ( σ P )) . Clearly, ˆ c ′ − α ≥ ˆ c − α by construction. By Lemma B.3, ˆ φ P n / M (ˆ σ P n ) → φ P / M ( σ P ) in ℓ ∞ (cid:0) Ξ × ¯ H × G (cid:1) almost uniformly, and hence almost uniformly

I ◦ S

H×G ˆ φ P n M (ˆ σ P n ) ! → I ◦ S H×G (cid:18) φ P M ( σ P ) (cid:19) > , where the inequality follows from the assumption that the H in (13) is false with Q = P . Thus wehave that [ I◦S

H×G ( √ T n ˆ φ P n / M (ˆ σ P n ))] − → almost uniformly ( T n /n → Λ( P ) almost uniformly byLemma B.3). By Lemma 1.9.3(ii) and 1.10.2(iii), Example 1.4.7 (Slutsky’s lemma), and Theorems1.3.6 (continuous mapping) and 1.3.4(vi) of van der Vaart and Wellner (1996), we now concludethat P ∗ I ◦ S

H×G √ T n ˆ φ P n M (ˆ σ P n ) ! > ˆ c − α ! ≥ P ∗ I ◦ S

H×G √ T n ˆ φ P n M (ˆ σ P n ) ! > ˆ c ′ − α ! → . C Additional Monte Carlo Studies

The Monte Carlo experiments discussed in this section followed the design of Kitagawa (2015),where the treatment and the instrument were both binary, with D ∈ { , } and Z ∈ { , } , andwe compared our results with theirs. We simulated the limiting rejection rates using the approachproposed in the present paper and that proposed by Kitagawa (2015) with the same randomlygenerated data. In this special case, if the measure ν is set to be a Dirac measure, the asymptoticdistribution of the test statistic under null can be written as sup f ∈F ∗ b G H ( f ) / ( ξ ∨ σ H ( f )) in (32).Since the test proposed by Kitagawa (2015) constructed the critical value based on the upper bound sup f ∈F b G H ( f ) / ( ξ ∨ σ H ( f )) in (32), to show the power improvement of the proposed test on a ﬁnitesample more clearly, we constructed the critical value using sup f ∈F ∗ b G H ( f ) / ( ξ ∨ σ H ( f )) instead of I ◦ S Ψ H×G ( G / M ( σ P )) , which is equivalent to it in distribution. That is, we approximated G H and σ H by G BH and σ BH following the bootstrap method of Kitagawa (2015). Then we estimated F ∗ b by c F ∗ b in a way similar to (27), which is the key difference between our approach and that of Kitagawa(2015). Last, we constructed the bootstrap test statistic from sup f ∈ c F ∗ b G BH ( f ) / ( ξ ∨ σ BH ( f )) and usedit to create the critical value. Because of c F ∗ b , our bootstrap test statistic can approximate the nulldistribution consistently and the power of the test can be improved. This new bootstrap test statisticis asymptotically equivalent to that in (30), and the new critical value is asymptotically equivalentto ˆ c − α in Section 3.1.1. Here, we implicitly assume that the CDF of

I ◦ S ( G / M ( σ P )) is continuous and strictly increasing at c ′ − α . Theorem11.1 of Davydov et al. (1998) implies that the CDF of I ◦ S ( G / M ( σ P )) is differentiable and has a positive derivative every-where except at countably many points in its support, provided that I ◦ S ( G / M ( σ P )) is not a constant. By construction, I ◦ S ( G / M ( σ P )) is not a constant in general cases. Monte Carlo iterations and bootstrap iterations. For eachDGP, the measure ν was set to a Dirac measure centered at ξ = 0 . , . , . , and . The nominalsigniﬁcance level α was set to . . C.1 Size Control and Tuning Parameter Selection

We ﬁrst ran simulations to investigate the size of the test and the selection of the tuning parame-ter. As suggested in Section 4, for sample sizes less than , we can use τ n = 2 for the tuningparameter. In this set of simulations, we set n = 2000 and τ n = 1 , , , , ∞ . For the DGP, we used U ∼ Unif(0 , , V ∼ Unif(0 , , N ∼ N(0 , , N ∼ N(1 , , Z = 1 { U ≤ . } , D = 1 { V ≤ . } , D = 1 { V ≤ . } , D = P z =0 { Z = z } × D z , and Y = P d =0 { D = d } × N d , where U , V , N , and N were mutually independent. This DGP is equivalent to that used by Kitagawa (2015) to show thesize control of their test. The results in Table 4 conﬁrmed the conclusion from Table 1: For τ n = 2 ,the rejection rates were close to those for τ n = ∞ and close to the nominal size. Recall that a smallertuning parameter τ n yields greater power for the test. Thus we kept using τ n = 2 in this case.Table 4: Rejection Rates under H for Binary D and Binary Zτ n ξ ∞ C.2 Power Comparison

Four DGPs were considered for the power comparisons. The sample sizes were set to n = 200 , , , , and , and the tuning parameter was set to τ n = 2 . The probability P ( Z = 1) = r n with r n = 1 / , / , / , / , and / for the corresponding sample sizes. We let U ∼ Unif(0 , , V ∼ Unif(0 , , W ∼ Unif(0 , , Z = 1 { U ≤ r n } , D = 1 { V ≤ . } , D = 1 { V ≤ . } , D = P z =0 { Z = z } × D z , N ∼ N(0 , , N ∼ N(0 , , and N ∼ N(0 , .(1): N ∼ N( − . , and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(2): N ∼ N(0 , . ) and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(3): N ∼ N(0 , . ) and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) .(4): N a ∼ N( − , . ) , N b ∼ N( − . , . ) , N c ∼ N(0 , . ) , N d ∼ N(0 . , . ) , N e ∼ N(1 , . ) , N = 1 { W ≤ . } × N a + 1 { . < W ≤ . } × N b + 1 { . . } × N e , and Y = P z =0 { Z = z } × ( P d =0 { D = d } × N dz ) . 33ll the variables U , V , N , N , N , and N were set to be mutually independent for each DGP.Table 5 shows a comparison of the powers of the two tests. The results suggest that the proposedtest achieves a manifest power improvement over that of Kitagawa (2015).Table 5: Rejection Rates under H for Binary D and Binary Z DGP n The Proposed Test Test of Kitagawa (2015) ξ ξ

References

Aliprantis, C. D. and Border, K. (2006).

Inﬁnite Dimensional Analysis: A Hitchhiker’s Guide . SpringerScience & Business Media.Bogachev, V. I. (2007).

Measure Theory , volume 2. Springer Science & Business Media.Davydov, Y. A., Lifshits, M. A., and Smorodina, N. V. (1998).

Local Properties of Distributions ofStochastic Functionals , volume 173. American Mathematical Society.Fang, Z. and Santos, A. (2018). Inference on directionally differentiable functions.

The Review ofEconomic Studies , 86(1):377–412.Folland, G. B. (1999).

Real Analysis: Modern Techniques and Their Applications . John Wiley & Sons.Kitagawa, T. (2015). A test for instrument validity.

Econometrica , 83(5):2043–2063.34ollard, D. (1990). Empirical processes: Theory and applications. In

NSF-CBMS Regional ConferenceSeries in Probability and Statistics , pages i–86. JSTOR.Shapiro, A. (1990). On concepts of directional differentiability.

Journal of Optimization Theory andApplications , 66(3):477–487.van der Vaart, A. W. and Wellner, J. A. (1996).