[PDF] Adaptive, Rate-Optimal Testing in Instrumental Variables Models

Abstract

This paper proposes simple, data-driven, optimal rate-adaptive inferences on a structural function in semi-nonparametric conditional moment restrictions. We consider two types of hypothesis tests based on leave-one-out sieve estimators. A structure-space test (ST) uses a quadratic distance between the structural functions of endogenous variables; while an image-space test (IT) uses a quadratic distance of the conditional moment from zero. For both tests, we analyze their respective classes of nonparametric alternative models that are separated from the null hypothesis by the minimax rate of testing. That is, the sum of the type I and the type II errors of the test, uniformly over the class of nonparametric alternative models, cannot be improved by any other test. Our new minimax rate of ST differs from the known minimax rate of estimation in nonparametric instrumental variables (NPIV) models. We propose computationally simple and novel exponential scan data-driven choices of sieve regularization parameters and adjusted chi-squared critical values. The resulting tests attain the minimax rate of testing, and hence optimally adapt to the unknown smoothness of functions and are robust to the unknown degree of ill-posedness (endogeneity). Data-driven confidence sets are easily obtained by inverting the adaptive ST. Monte Carlo studies demonstrate that our adaptive ST has good size and power properties in finite samples for testing monotonicity or equality restrictions in NPIV models. Empirical applications to nonparametric multi-product demands with endogenous prices are presented.

Full PDF

AAdaptive, Rate-Optimal Testing in

Instrumental Variables Models ∗ Christoph Breunig † Xiaohong Chen ‡ First version: August 2018, Revised June 18, 2020

This paper proposes simple, data-driven, optimal rate-adaptive inferences on astructural function in semi-nonparametric conditional moment restrictions. We con-sider two types of hypothesis tests based on leave-one-out sieve estimators. A structure-space test (ST) uses a quadratic distance between the structural functions of en-dogenous variables; while an image-space test (IT) uses a quadratic distance of theconditional moment from zero. For both tests, we analyze their respective classesof nonparametric alternative models that are separated from the null hypothesis bythe minimax rate of testing . That is, the sum of the type I and the type II errorsof the test, uniformly over the class of nonparametric alternative models, cannot beimproved by any other test. Our new minimax rate of ST diﬀers from the knownminimax rate of estimation in nonparametric instrumental variables (NPIV) models.We propose computationally simple and novel exponential scan data-driven choicesof sieve regularization parameters and adjusted chi-squared critical values. The re-sulting tests attain the minimax rate of testing, and hence optimally adapt to the un-known smoothness of functions and are robust to the unknown degree of ill-posedness(endogeneity). Data-driven conﬁdence sets are easily obtained by inverting the adap-tive ST. Monte Carlo studies demonstrate that our adaptive ST has good size andpower properties in ﬁnite samples for testing monotonicity or equality restrictions inNPIV models. Empirical applications to nonparametric multi-product demands withendogenous prices are presented.

Keywords : Instrumental variables; Minimax rate of testing; Adaptive testing; Exponentialscan; Conﬁdence sets; Quadratic functionals; Shape restrictions. ∗ We thank Don Andrews, Tim Armstrong, Denis Chetverikov, Tim Christensen, Enno Mammen, PeterMathe and others at numerous workshops and conferences for helpful comments. The empirical work hereis researchers own analyses based in part on data from The Nielsen Company (US), LLC and marketingdatabases provided through the Nielsen Datasets at the Kilts Center for Marketing Data Center at TheUniversity of Chicago Booth School of Business. The conclusions drawn from the Nielsen data are those ofthe researchers and do not reﬂect the views of Nielsen. Nielsen is not responsible for, had no role in, andwas not involved in analyzing and preparing the results reported herein. † Department of Economics, Emory University, Rich Memorial Building, Atlanta, GA 30322, USA.Email: [email protected] ‡ Cowles Foundation for Research in Economics, Yale University, Box 208281, New Haven, CT 06520,USA. Email: [email protected] a r X i v : . [ ec on . E M ] J un . Introduction Models with endogeneity are pervasive in economics, and are one of the most defyingfeatures that diﬀerentiate econometrics from statistics. In the big data era, semiparametricand nonparametric methods and models allowing for ﬂexible endogeneity are increasinglywidely used in empirical research. A common diﬃculty in applying any semiparametric andnonparametric methods in practice is how to choose tuning (regularization) parametersin a simple, data-driven way that still possesses some “optimal” theoretical properties.For nonparametric models with endogeneity such as nonparametric instrumental variables(NPIV) and nonparametric quantile instrumental variables (NPQIV), it is well-known thatthe ﬁnite sample performance of various estimators and tests are much more sensitive totuning parameters than those without endogeneity.There are a few papers on data-driven choices of regularization parameters in esti-mation of a NPIV model. However, it is well known that data-driven choices designedfor “optimal” nonparametric estimation do not lead to “optimal” inference (testing andconﬁdence sets) in nonparametric settings (see, e.g., Gine and Nickl [2016]). To the bestof our knowledge, there is currently no work on minimax rate-optimal testing for NPIVmodels, nor on data-driven choice of regularization parameters for rate-optimal testingand conﬁdence sets on NPIV models. In this paper we shall address this important issuewithin the framework of rate-optimal testing in semiparametric and nonparametric condi-tional moment restrictions. As a leading example, we shall provide computationally simple,data-driven choices of tuning parameters for optimal inference on NPIV functions such asmulti-product demand functions (of endogenous prices) in industrial organization.This paper ﬁrst considers the minimax rate-optimal hypothesis testing in semiparamet-ric or nonparametric models deﬁned by conditional moment restrictions. The maintainedmodeling assumption is that there is a nonparametric structural function h satisfyingE[ ρ ( Y, h ( X )) | W ] = 0 , (1.1)where ρ is a possibly non-smooth mapping that is known up to the function h , X is a d x -dimensional vector of continuous endogenous regressors, W is a d w -dimensional vector ofconditional (instrumental) variables, and the joint distribution of ( Y, X, W ) is unspeciﬁed.Our goal is to test whether h coincides with a restricted function h r , such as parametricor semiparametric or shape-restricted function. For example, in a NPIV model E[ Y − h ( X ) | W ] = 0 we would be interested in testing whether the structural NPIV function h coincides with some decreasing function h r . We propose two statistics for the hypothesis See, e.g., Horowitz [2014], Liu and Tao [2014], Centorrino [2014], Chen and Christensen [2015], Breunigand Johannes [2016], Gautier and Le Pennec [2018], Jansson and Pouzo [2019] and the referencestherein. Most papers suggest data-driven procedures without establishing the rate-adaptivity in NPIVestimation. h and its restrictedversion h r . The IT is based on the squared distance between a nonparametric estimator ofE[ ρ ( Y, h r ( X )) | W ] and zero.Both test statistics are based on simple leave-one-out sieve estimators of some quadraticfunctionals. We establish an upper bound on the sum of the type I and type II errors.Speciﬁcally, we bound the type I error uniformly over distributions satisfying the nullhypothesis and the type II error uniformly over a class of nonparametric alternative modelsseparated from the null hypothesis via a so-called rate of testing . We then establish a lowerbound for the sum of the type I and the type II errors at the same separation rate. Thus,there exists no other test that provides a better performance with respect to the sum ofthose errors. This optimal rate of separation is called the minimax rate of testing .A key technical part to establish the minimax rate of our ST in quadratic distance isto derive a tight upper bound on the convergence rate of a leave-one-out sieve estimatorof a quadratic functional of a NPIV function h ; see the online Appendix F. This rateis diﬀerent from the existing minimax rate of convergence in root-mean squared error ofany consistent estimator to the function h itself. However, the minimax rate of our STdepends on an optimal choice of sieve dimension (the key tuning parameter), which isdetermined by the unknown degree of ill-posedness (due to the endogeneity in (1.1)) andthe unknown regularity of the nonparametric alternative functions diﬀerent from the nullrestricted functions.We propose a computationally simple, data-driven version of our ST that does notrequire a priori knowledge of smoothness of nonparametric alternative functions h northe degree of ill-posedness. The data-driven test rejects the null hypothesis as soon asthere is a sieve dimension (say the smallest sieve dimension) in an admissible index setsuch that the corresponding normalized quadratic distance estimator exceeds one; and failsto reject the null when the maximal (over the admissible index set) normalized quadraticdistance estimator is less than or equal to one. The cardinality of this admissible index set isdetermined by a novel exponential scan (ES) method that automatically takes the unknowndegree of ill-posedness (endogeneity) into account. We show that our data-driven ST attainsthe minimax optimal rate for severely ill-posed problems and is within a log log( n ) termfor mildly ill-posed problems, where n is the sample size. This extra log log( n ) term isthe price to pay for adaptivity to unknown smoothness of the nonparametric alternativefunctions diﬀerent from the null restricted function h r .By inverting the adaptive tests we obtain conﬁdence sets on restricted (constrained) We prove in Appendix F that this convergence rate coincides with the lower bound derived by Chen andChristensen [2018] for estimation of a quadratic functional of a NPIV function. As shown in Chen andChristensen [2018], the plug-in sieve estimator does not achieve the optimal minimax rate for estimationof a quadratic functional of a NPIV in the mildly-illposed case. h r . These conﬁdence sets do not require additional choices of tuningparameters. The adaptive minimax rate of testing determines the radius of the conﬁdencesets. We argue that the radius based on our adaptive ST can only be marginally improvedin a very limited range of submodels depending on the regularity of the unknown function h in NPIV models.Monte Carlo studies indicate that our data-driven ST is not only computationally veryfast, but also has accurate size and good power in ﬁnite samples, without the need of com-putationally intensive bootstrap critical values. In a simulation study of hypothesis testingfor monotonicity of a NPIV function, our adaptive ST automatically leads to a data-drivenconﬁdence set under monotonicity restrictions if the null is not rejected. When the null ofmonotonicity is rejected by our adaptive ST, the data-driven choice of the smallest sievedimension leading to the null rejection can still lead to a consistent sieve estimate of the un-restricted NPIV function h , while parts of the true h and its sieve estimate lie outside of themonotonicity-constrained conﬁdence sets. This simulation demonstrates the importance ofdata-driven choice of tuning parameters for testing shapes of a NPIV function. We provideempirical applications concerning shapes of consumer demands, where our data-driven testdetects heterogeneity in the curvature of demand curves among diﬀerent income groups.For instance, our adaptive ST fails to reject that the demand for certain nondurable goodsis decreasing (in its price) for low income household, but does reject the decreasing shapefor high income household. Therefore, it may lead to erroneous policy evaluations whennonparametric decreasing demand (in own price) is imposed across all income levels.Our main contribution is the data-driven, rate-optimal hypothesis testing in structure-space. But we also present the minimax rate-adaptive image-space test (IT) as compari-son. Although both are simple to implement, their data-driven procedures choose diﬀerentkey tuning parameters to achieve their respective minimax optimal rate for testing. Thesieve dimension J for approximating h ( X ) is the key tuning parameter in the ST ap-proach, while the sieve dimension K for approximating the conditional moment functionE[ ρ ( Y, h r ( X )) | W ] is the key tuning parameter in the IT approach for a simple or parametricnull hypothesis. The adaptive ST has the advantage of automatically providing a data-driven choice of the sieve dimension J that can be used for estimation of semi-nonparametricor shape-restricted function h r . This greatly simpliﬁes the construction of data-driven con-ﬁdence sets. In addition, we show both theoretically and via Monte Carlo simulations thatthe adaptive ST can be more powerful than adaptive IT when the dimension of the con-ditional instruments is larger than the dimension of endogenous regressors (i.e., d w > d x ).On the other hand, the image-space test (IT) is more convenient for non-separable modelssuch as nonparametric quantile IV regressions, as well as for partially-identiﬁed models. Literature review : The concept of minimax rate of testing in nonparametric models4as perhaps ﬁrst introduced by Ingster [1993] and Spokoiny [1996]. It has been ap-plied to optimal testing in nonparametric regression models without endogeneity, includingHorowitz and Spokoiny [2001] and Guerre and Lavergne [2005] and others. Our paper is theﬁrst to study minimax rate-optimal test in nonparametric conditional moment restrictionswith endogeneity, including NPIV model as a leading example.There are papers on speciﬁcation tests for NPIV type models by extending Bierens[1990]’s test for conditional moment restrictions to models that allow for functions de-pending on endogenous regressors; see, e.g., Horowitz [2006], Breunig [2015], Santos [2012],Tao [2014], Chernozhukov et al. [2015], Zhu [2020] and the references therein. These testsare similar to what we called the image-space test. Among these papers, Chernozhukovet al. [2015] is the most general one that provides inference on equality or/and inequalityconstrained conditional moment restrictions allowing for partial identiﬁcation. Chen andPouzo [2015] provide inference results using either sieve Wald (“structural space”) testor sieve QLR (“image space”) tests for general point-identiﬁed semi-nonparametric condi-tional moment restriction models. Chetverikov and Wilhelm [2017] studied mean squaredrate sieve estimation of NPIV by imposing monotonicity restriction. Freyberger and Reeves[2019] considered L conﬁdence sets for monotone NPIV function. Compiani [2019] alsoimposed monotone restriction in his estimation of IO demand NPIV function. None of thesepapers consider minimax tests for NPIV type models nor data-driven choice of key tuningparameters. Our paper is the ﬁrst to propose simple, adaptive structure- and image- spacetests that achieves minimax rate-optimality. In addition, we provide data-driven choice oftuning parameters based conﬁdence sets for NPIV functions h r .The remainder of the paper is organized as follows. Section 2 describes our data-drivenstructure-space hypothesis testing (ST). It also presents a simulation study to adaptivetesting for monotonicity in NPIV models. Section 3 ﬁrst establishes the minimax optimalrate of the ST. It then shows that this optimal rate is attained (within a log log n term)by our data-driven ST procedure. Section 4 introduces the data-driven image-space teststatistic, and presents its minimax rate of testing and the adaptivity. Section 5 providesthree empirical illustrations. It also contains additional Monte Carlo studies to comparethe ﬁnite-sample size and power properties of the adaptive ST vs the adaptive IT. Section6 brieﬂy concludes. Appendices A and B contain proofs for the minimax rates for the STand the adaptive ST under simple null hypothesis respectively. The online supplementaryappendices contain additional materials: Appendix C presents robustness checks usingbootstrap critical values for the empirical applications. Appendices D and E provide proofsfor the optimal rates of the adaptive ST under composite null hypothesis and of the adaptiveIT respectively. Appendices F and G contain additional technical lemmas. It has a close connection to robustness or sensitivity literature that has gain popularity in macroeco-nomics; see, e.g., Hansen and Sargent [2008]. . Preview of Adaptive Structure-Space Test We ﬁrst introduce the null and the alternative hypotheses as well as the concept of minimaxrate of testing in Subsection 2.1. We then describe our new data-driven, rate-adaptivestructure-space test (ST) for NPIV type models in Subsection 2.2. The formal theoreticaljustiﬁcations are postponed to Section 3. Subsection 2.3 provides a simulation study of ouradaptive testing for monotonicity of structural functions.

Let H denote some class of functions. Let { ( Y i , X i , W i ) } ni =1 be a random sample from thedistribution P h of ( Y, X, W ), where h ∈ H , such thatE[ Y − h ( X ) | W ] = 0 . Let H r denote a subset of functions in H that satisﬁes a conjectured restriction, such asmonotonicity, concavity, some other shape or some parametric restrictions. For any h ∈ H ,we introduce h r ∈ H r such that E | E[ h ( X ) − h r ( X ) | W ] | ≤ E | E[ h ( X ) − h r ( X ) | W ] | forall h r ∈ H r .We analyze the null hypothesis that there exists a function h ∈ H with E[ Y − h ( X ) | W ] =0 satisfying a conjectured restriction captured by H r , speciﬁcally, the set H := (cid:110) h ∈ H : E[ Y − h ( X ) | W ] = 0 and (cid:90) (cid:0) h ( x ) − h r ( x ) (cid:1) µ ( x ) dx = 0 (cid:111) is not empty. Here we measure the distance between restricted and unrestricted functionswith a measure depending on a prespeciﬁed weighting function µ which is restricted to bepositive on the support of X . If we want to test that h coincides with h r on some subset of ofthe support of X only then this modiﬁed null hypothesis can be implemented by changing µ accordingly. To analyze the power of any test against nonparametric alternatives, werequire some separation between the null and the class of nonparametric alternatives forall h ∈ H . The resulting class of alternatives considered in this paper is given by H ( δ, r n ) := (cid:110) h ∈ H : E[ Y − h ( X ) | W ] = 0 and (cid:90) ( h ( x ) − h r ( x )) µ ( x ) dx ≥ δ r n (cid:111) for some constant δ > r n . The rate r n is also known as the rate oftesting and we establish its optimality in the minimax sense as described below.6n this paper, we establish the min-imax rate of testing r n in the senseof Ingster [1993]: We propose a testwhich minimizes the sum of TypeI error uniformly over H and themaximum Type II error uniformlyover H ( δ, r n ). Moreover, we showthat the sum of both errors cannotbe improved by any other test. H √ δr n H ( δ, r n ) The minimax rate of testing requires on optimal choice of tuning parameters depending onunknown smoothness properties of the structural function h and unknown mapping prop-erties of the conditional expectation given W . We provide a data driven extension to theminimax test, i.e., a testing procedure that adapts to the smoothness of the unrestrictedfunction h in the presence of unknown smoothing properties of the conditional expectationmapping. In this section we describe our adaptive structure-space test for the point identiﬁed NPIVmodels. Our test builds on the leave-one-out, series estimator of the quadratic distance (cid:82) ( h ( x ) − h r ( x )) µ ( x ) dx depending on the dimension J given by (cid:98) D J ( h r ) = 2 n ( n − (cid:88) ≤ i

1) we deﬁne (cid:99) ST n = (cid:40) ∃ J ∈ (cid:98) I n such that (cid:99) W J := n (cid:98) D J ( (cid:98) h r J ) √ (cid:98) η J ( α ) (cid:98) v J ( (cid:98) h r J ) > (cid:41) , (2.2)where {·} denotes the indicator function and (cid:98) v J , (cid:98) I n and (cid:98) η J ( α ) are deﬁned as follows.First, (cid:98) v J ( h ) = (cid:13)(cid:13)(cid:13) n − n (cid:88) i =1 (cid:0) Y i − h ( X i ) (cid:1) G / (cid:98) A b K ( J ) ( W i ) b K ( J ) ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) F (2.3)where (cid:107) · (cid:107) F denotes the Frobenius norm. Second, the index set (cid:98) I n is constructed via a simple exponential scan (ES) procedure (cid:98) I n = (cid:110) J ≤ (cid:98) J max : J = J j where j = 0 , , . . . , j max (cid:111) (2.4)where J := (cid:98) √ log log n (cid:99) , j max := (cid:100) log ( n / /J ) (cid:101) , and (cid:98) J max = min (cid:110) J > J : ζ ( J ) (cid:112) (cid:96) ( J )(log n ) /n ≥ s min (cid:0) ( B (cid:48) B/n ) − / ( B (cid:48) Ψ /n ) G − / (cid:1)(cid:111) where (cid:96) ( J ) = 0 . J , and s min ( · ) is the minimal singular value. Further ζ ( J ) = √ J for spline, wavelet, or trigonometric sieve basis, and ζ ( J ) = J for orthogonal polynomialbasis.Finally, the critical value (cid:98) η J ( α ) is speciﬁed diﬀerently for testing equality and inequalityconstraints. For testing equality constraints, we compute (cid:98) η J ( α ) using the Bonferroni correc-tion to a critical value from a centralized chi-square distribution relative to the cardinalityof the ES index set denoted by (cid:98) I n ). That is, (cid:98) η J ( α ) = q (cid:0) α/ (cid:98) I n ) , J (cid:1) − J √ J , (2.5)where q (cid:0) a, J ) denotes the upper a -quantile of χ distribution with J degrees of freedom.For testing inequality constraints we have implemented two approaches. The ﬁrst one ispresented in Remark 3.1, which is a simple, data-driven, ﬁnite-dimensional correction tothe chi-square critical values. The second one is introduced in Remark 3.2, which is abootstrap approach to calculate the critical values. The Frobenius norm for a J × J matrix M = ( M jl ) ≤ j,l ≤ J is deﬁned as (cid:107) M (cid:107) F = (cid:113)(cid:80) Jj,l =1 M jl . .3. A Monte Carlo Study: Adaptive Testing for Monotonicity We investigate ﬁnite sample performance of our adaptive ST for decreasing function in aMonte Carlo experiments. The results are based on 5000 Monte Carlo replications for everyexperiment. In all the experiments, realizations of Y are generated according to a NPIVmodel Y = h ( X ) + U , E[ U | W ] = 0 (2.6)for some unknown structural function h . The functional form of h and the joint distributionof ( X, W, U ) vary in diﬀerent Monte Carlo designs. Results presented in this section indicatethat our adaptive ST with simple data-driven critical values has very good size and powerin ﬁnite samples, and that the adaptive ST with computationally demanding bootstrappedcritical values has no obvious improvement in terms of size or power.Let Φ denote the standard normal distribution function. We set X i = Φ( X ∗ i ) and W i = Φ( W ∗ i ), where the random vector ( X ∗ i , W ∗ i , U i ) is generated according to  X ∗ i W ∗ i U i  ∼ N   ,  ξ . ξ .  . (2.7)The parameter ξ captures the strength of instruments and varies in the experiments below.As ξ increases the instrument becomes stronger (or the ill-posedness gets weaker). Theexperimental design with ξ = 0 . Y i according to(2.6) where h ( x ) = c (cid:104) − (cid:16) x − / c (cid:17)(cid:105) for some constant 0 < c ≤ . The function h is monotonically decreasing where c captures the degree of monotonicity.For small c the function h is close to zero, and for c = 1 it holds h ( x ) ≈ φ (0)(1 − x )where φ denotes the standard normal probability density function.We study the size and power patterns of our adaptive ST under the null hypothesis thatthe NPIV function h is weakly decreasing on the support of X . For the monotonicity test,We implement the test statistic (cid:99) ST n given in (2.2) using quadratic B-spline basis functionswith varying number of knots for h . Due to piecewise linear derivatives, monotonicityconstraints are easily imposed on the restricted function at the derivative at J − K ( J ) = 2 J . We make use of the data-driven critical value (cid:98) η J ( α ) given in Remark 3.1 (which is implemented using the R package coneproj ). As a9 ample c ξ Emp. Size of (cid:99) ST n average (cid:98) J Emp. Size of (cid:99) ST B n average (cid:98) J B size

10% 5% 1% at 5% level

10% 5% 1% at 5% level

500 0 .

01 0 . . . . . . . . . . .

01 0 . . . . . . . . . . Table 1:

Testing Monotonicity - Empirical Size for the adaptive tests (cid:99) ST n and (cid:99) ST B n comparison, we also implement our adaptive ST using bootstrap critical values as describedin Remark 3.2. In each Monte Carlo iteration, we generate 200 bootstrap replications usingrandom weights ω ∼ N (0 ,

1) drawn independently from (

X, W, U ).Table 1 reports the empirical size control for diﬀerent nominal levels of the test (cid:99) ST n andthe bootstrap analog (cid:99) ST B n for the sample sizes n ∈ { , } . Results are presented underthe diﬀerent parameter values for ξ ∈ { . , . , . } and c ∈ { . , . , } . Overall, we seefrom Table 1 that the test provides adequate size control for diﬀerent parameter values of ξ and c . It is interesting to see that the computationally demanding bootstrap version (cid:99) ST B n with 200 bootstrap replications does not have any improvement in terms of size control. In addition to the coverage, Table 1 also presents the average data driven choice oftuning parameter J at the 5% nominal level, denoted by (cid:98) J for the test (cid:99) ST n and (cid:98) J B for itsbootstrap analog (cid:99) ST B n . Speciﬁcally, (cid:98) J is the average choice of J which maximizes (cid:99) W J overthe index set (cid:98) I n when the null is not rejected; and is the smallest J ∈ (cid:98) I n such that (cid:99) W J > J corresponds to early stopping when We did run 1000 Monte Carlo replications with n = 500 sample size and 500 bootstrap evaluations perMonte Carlo replications for the monotonicity test. (cid:99) ST B n with 500 bootstrap evaluations have slightlymore accurate sizes than that with 200 bootstrap evaluations. Nevertheless, our simple adaptive (cid:99) ST n test also has very good size control in 1000 Monte Carlo replications and is super fast to compute. .5 1.0 1.5 2.0 . . . . . . c A P o w e r x = x = x = x = . . . . . . c A P o w e r x = x = x = x = Figure 1:

Adaptive Monotonicity Test - Empirical Power for the adaptive tests (cid:99) ST n and (cid:99) ST B n when ξ = { . , . } . Solid (or dashed) lines show power results for data-driven (or bootstrap)critical values. Power curves are not size adjusted. LHS: n = 500; RHS: n = 1000 we reject the null.) From Table 1 we see that the average data driven choice (cid:98) J increases asthe strength of instruments increases (captured by the parameter ξ ). Further, (cid:98) J decreasesas the regularity of the structural function h declines (captured by the parameter c ). Thisis due to the fact that with increasing nonlinearity of h a smaller degree of knots is suﬃcientin order to reject the hypothesis. The data driven choice of J hence works in the oppositedirection as in adaptive estimation where larger smoothness leads to smaller values of J .Finally, we see that as the sample size increases so does the value of the estimator (cid:98) J .To study the power of the test that the NPIV function h is monotonically decreasing, weconsider deviations from the constant zero function. Speciﬁcally, we examine the rejectionprobabilities of the adaptive ST when the data is generated by the design (2.7) but usingthe structural function h ( x ) = − x/ c A x . Note that h (cid:48) ( x ) ≤ x ≤ . /c A . Since the support of X is contained in[0 ,

1] we obtain from our model that the null hypothesis of weakly decreasing functions issatisﬁed only if c A ≤ .

1. 11 .0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Constrained CB x 0.0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Unconstrained CB x Figure 2:

Estimated NPIV curves with data generated from (2.8) with c A = 0 . n = 1000,showing true structural function (black dotted lines), unconstrained estimator (reddashed lines), and constrained estimator (blue solid line). LHS: 95% CB based onconstrained estimator. RHS: 95% CB based on unconstrained estimator. Figure 1 depicts the power function of the adaptive monotonicity test (cid:99) ST n and (cid:99) ST B n , basedon 200 bootstrap iterations, under a 5% nominal level for diﬀerent parameters ξ ∈ { . , . } and sample sizes n ∈ { , } . From Figure 1 we see that both tests become morepowerful, for c A > .

1, as the parameter of instrument strength ξ and the sample size n increase. The bootstrap test (cid:99) ST B n is more powerful for c A < ξ = 0 . n = 1000.In the other cases, the power improvement by using bootstrap critical values is only ofsmall magnitude or absent. When the power curves are size adjusted the slight advantagein power of (cid:99) ST B n over (cid:99) ST n disappears. In particular, we found that (cid:99) ST n is more powerfulwhen c A > ξ = 0 . (cid:99) ST B n with larger number of bootstrap iterations as it is computationally demanding.To illustrate the choice of our adaptive inference procedure we further provide impli-cations on estimation. We consider here a modiﬁcation of our data generating process byconsidering the model Y = − X + c A X + U/ X, W, U ) is generated as in (2.7) and we consider c A ∈ { . , . } . For c A = 0 . c A = 0 .

2. 12e apply our adaptive ST to one sample { ( Y i , X i , W i ) } of size n = 1000 for which weobtain a data-driven choice of sieve dimension (cid:98) J = 3 in both cases c A ∈ { . , . } . Basedon the dimension parameter choice, Figure 2 shows the constrained sieve NPIV estimator(blue solid line) and unconstrained sieve NPIV estimator (red dashed lines). We show the95% uniform conﬁdence bands (CB) following Chen and Christensen [2018] based on 1000bootstrap on the constrained on the left and based on the unconstrained estimator on theright. From Figure 2 we see that the diﬀerence between CBs based on constrained andunconstrained estimator is minor, although there exists a slight improvement of the CBbased on the constrained one. − . − . − . . . Constrained CB x 0.0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Unconstrained CB x Figure 3:

Estimated NPIV curves with data generated from (2.8) with c A = 0 . n = 1000,showing true structural function (black dotted lines), unconstrained estimator (reddashed lines), and constrained estimator (blue solid line). LHS: 95% CB based onconstrained estimator. RHS: 95% CB based on unconstrained estimator. Figure 3 shows the estimation results when c A = 0 .

3. Adaptive Inference via the Structure-Space Test

This section presents several results on data driven structure-space test (ST) statistics.Subsection 3.1 introduces the notation and the main regularity conditions. Subsection 3.213stablishes the minimax rate of testing without a data driven choice of the sieve dimension.Subsection 3.3 establishes the minimax rate of testing of the adaptive ST. Subsection 3.4shows that this rate coincides with rate of testing attained by the tests with composite nullhypothesis. Subsection 3.5 proposes data-driven conﬁdence sets by inverting the adaptiveST under the null hypothesis.

Before we state the minimax rate of testing in structure space, we introduce additionalnotation and main assumptions. For a random variable X , we deﬁne the space L ( X ) as theequivalence class of all measurable functions of X with ﬁnite second moment with (cid:107) · (cid:107) L ( X ) as the associated norm. For any sigma-ﬁnite measure µ we deﬁne (cid:107) φ (cid:107) µ := (cid:82) φ ( x ) µ ( x ) dx for all φ ∈ L µ := { φ : (cid:107) φ (cid:107) µ < ∞} . Assumption 1. (i)

H ⊂ L ( X ) ; (ii) sup w ∈W sup h ∈H E h [ ρ ( Y, h r ( X )) | W = w ] ≤ σ < ∞ and sup h ∈H E h [ ρ ( Y, h r ( X ))] < ∞ ; and (iii) inf w ∈W inf h ∈H V ar h ( ρ ( Y, h r ( X )) | W = w ) ≥ σ > . Let A = [ S (cid:48) G − b S ] − S (cid:48) G − b where S = E[ b K ( W ) ψ J ( X ) (cid:48) ] and G b = E[ b K ( W ) b K ( W ) (cid:48) ]. Weintroduce the projections Π J h ( · ) = ψ J ( · ) (cid:48) G − (cid:82) ψ J ( x ) h ( x ) µ ( x ) dx for h ∈ L µ and Π K m ( · ) = b K ( · ) (cid:48) G − b E[ b K ( W ) m ( W )] for m ∈ L ( W ). The minimal singular value of G − / b SG − / isdenoted by s J . We make use of the notation ζ J = max( ζ ψ,J , ζ b,K ), for K = K ( J ), where ζ ψ,J = sup x (cid:107) G − / ψ J ( x ) (cid:107) and ζ b,K = sup w (cid:107) G − / b b K ( w ) (cid:107) . We deﬁne d x = dim( X ) and d w = dim( W ). Assumption 2. (i) s − J ζ J (cid:112) (log J ) /n = O (1) ; (ii) (cid:107) Π J h − h (cid:107) µ = O ( J − p/d x ) for all h ∈ H and some p > such that ζ J √ log J = O ( J p/d x ) ; and (iii) the eigenvalues of G and G b areuniformly bounded from below and above. Let T : L ( X ) (cid:55)→ L ( W ) denote the conditional expectation operator given by T h ( w ) =E[ h ( X ) | W = w ]. We further deﬁne Ψ J = clsp { ψ , . . . , ψ J } ⊂ L ( X ). Assumption 3. (i) sup h ∈ Ψ J (cid:107) (Π K T − T ) h (cid:107) L ( W ) / (cid:107) h (cid:107) µ = o ( s J ) and (ii) (cid:107) T ( h − h r − Π J ( h − h r ) (cid:107) L ( W ) = O ( s J (cid:107) h − h r − Π J ( h − h r ) (cid:107) µ ) for all h ∈ H . Assumption 4.

For any h ∈ H , T h = 0 implies that (cid:107) h (cid:107) µ = 0 .Discussion of Assumptions. Assumption 1 captures second moment bounds. In addi-tion, a lower bound for the variance is imposed. Assumption 2 (i) imposes bounds on thegrowth of the basis functions relative to the singular values of the matrix G − / b SG − / .Assumption 2 (i)(ii) imposes bounds on the growth of the basis functions which are knownfor commonly used bases. For instance, ζ b,K = O ( √ K ) and ζ ψ,J = O ( √ J ) for polyno-mial spline, wavelet and cosine bases, and ζ b,K = O ( K ) and ζ ψ,J = O ( J ) for orthogonal14olynomial bases; see, e.g., Newey [1997], Huang [1998]. Assumption 3 (i) is a mild condi-tion on the approximation properties of the basis used for the instrument space. In fact, (cid:107) (Π K T − T ) h (cid:107) L ( W ) = 0 for all h ∈ Ψ J when the basis functions for B K and Ψ J form either aRiesz basis or an eigenfunction basis for the conditional expectation operator. Assumption3 (ii) is the usual L stability condition imposed in the NPIV literature when h r = 0 (cf.Assumption 6 in Blundell et al. [2007] and Assumption 5.2(ii) in Chen and Pouzo [2012]).Note that Assumption 3 (ii) is also automatically satisﬁed by Riesz bases. Assumption 4is required for identiﬁcation of the quadratic functional (cid:107) h (cid:107) µ and the condition can be lessrestrictive than imposing L completeness when the support of µ is a subset of the supportof X . Example 3.1 (NQIV) . The ST test can also be applied to models with nonseperable un-observables after linearization. Consider as an example the nonparametric quantile instru-mental variable model with conditional moment restriction E[ { Y ≤ h ( X ) } − q | W ] = 0 for some q ∈ (0 , . A linearization of the model can be obtained using the Frechet derivativeat h r maps h to E[ f Y | X,W ( h r ( X ))( h ( X ) − h r ( X )) | W ] . This leads to a modiﬁed version of ourtest statistic where B (cid:48) Ψ is replaced by an empirical analog of E[ f Y | X,W ( h r ( X )) b K ( W ) ψ J ( X ) (cid:48) ] and Y − h r ( X ) by f Y | X,W ( h r ( X )) h r ( X ) . We do not address the estimation of the condi-tional density and hence, the NQIV case explicitly for our structural space test. Example 3.2 (Testing Derivatives of h ) . Note that the test can be extended to check forderivatives of the function h . To do so, we replace G by the matrix (cid:90) ∂ x ψ J ( x )( ∂ x ψ J )( x ) (cid:48) µ ( x ) dx as long as the basis function ψ j are diﬀerentiable on the support of µ . This straightforwardextension is only possible in case of ST but not for IT and hence, illustrates the advantageof the ST approach. We ﬁrst consider the simple hypothesis case where H = { h } and, in particular, h r = h ,for some known function h satisfying (1.1) with ρ ( Y, h ( X )) = Y − h ( X ). We introduce a J dependent analog to the adaptive structure-space test (cid:99) ST n under the simple null: ST n,J = (cid:40) n (cid:98) D J ( h ) √ η (cid:98) v J ( h ) > (cid:41) η >

0. The test ST n,J with optimally chosen J serves as a benchmark ofour adaptive ST procedure (given in (3.5)) for the simple null hypothesis case. Theorem 3.1.

Let Assumptions 1–4 be satisﬁed. Then, for any ε > there exists aconstant δ ∗ > such that lim sup n →∞ (cid:110) P h ( ST n,J = 1) + sup h ∈H ( δ ∗ ,r n,J ) P h ( ST n,J = 0) (cid:111) ≤ ε, (3.1) where the rate r n,J is given by r n,J = n − / s − J J / + J − p/d x . (3.2)Theorem 3.1 shows that the test statistic ST n,J attains the rate of testing r n,J . Thisrate consists of a variance and a bias part. The optimal choice of J requires knowledge ofunknown mapping properties of the conditional expectation operator T and the unknownsmoothness of the true structural function h , as illustrated below. A central step to achievethis rate result is to establish a rate of convergence of the quadratic distance estimator (cid:98) D J ( h ), see Theorem F.1 in the online appendix. We thus make use of the close connectionbetween minimax optimal quadratic functional estimation and minimax optimal testing.We diﬀerentiate among two diﬀerent degrees of ill-posedness, which are typically con-sidered in the literature. The sieve L measure of ill-posedness is deﬁned as τ J = sup h ∈ Ψ J ,h (cid:54) =0 (cid:107) h (cid:107) µ (cid:107) T h (cid:107) L ( W ) ≤ sup h ∈ Ψ J ,h (cid:54) =0 (cid:107) h (cid:107) µ (cid:107) Π K T h (cid:107) L ( W ) = s − J . We call the model (1.1) mildly ill-posed if: τ j ∼ j ζ/d x for some ζ > τ j ∼ exp( j ζ/d x /

2) for some ζ > The next corollary provides concrete rates of testingwhen the dimension parameter J is chosen to level variance and square bias under classicalsmoothness conditions. Corollary 3.1.

Let Assumptions 1–4 be satisﬁed. Then the rate of testing r n,J given in (3.2) is of the following form:1. Mildly ill-posed case: choosing J ∼ n d x / (4( p + ζ )+ d x ) implies r n,J = n − p/ (4( p + ζ )+ d x ) , (3.3) If { a n } and { b n } are sequences of positive numbers, we use the notation a n (cid:46) b n if lim sup n →∞ a n /b n < ∞ and a n ∼ b n if a n (cid:46) b n and b n (cid:46) a n . . Severely ill-posed case: choosing J ∼ (cid:0) log n − p + d x ζ log log n (cid:1) d x /ζ implies r n,J = (log n ) − p/ζ . (3.4) In the next result, we establish a lower bound for the rate of testing in each of the ill-posedcase scenarios considered in the previous corollary. Below, (cid:104)· , ·(cid:105) µ denotes the inner productassociated to L µ . Theorem 3.2.

Let Assumptions 1 (iii) and 4 be satisﬁed. Assume that (cid:107)

T h (cid:107) L ( W ) (cid:46) (cid:80) j ≥ τ − j (cid:104) h, (cid:101) ψ j (cid:105) µ for all h ∈ H and an orthonormal basis { (cid:101) ψ j } j ≥ in L µ . Then for any ε > there exists a constant δ ∗ > such that lim inf n →∞ inf T n (cid:110) P h ( T n = 1) + sup h ∈H ( δ ∗ ,r n ) P h ( T n = 0) (cid:111) ≥ − ε where r n is given by:1. Mildly ill-posed case: r n = n − p/ (4( p + ζ )+ d x ) ,2. Severely ill-posed case: r n = (log n ) − p/ζ . From Corollary 3.1 and Theorem 3.2 we conclude that r n,J is the minimax rate of testing once J is chosen to level variance and squared bias. In particular, we conclude that therate of testing is always nonparametric in contrast to the case of estimation of quadraticfunctionals where also the √ n –rate can be achieved. We propose a data-driven ST that rejects the null hypothesis H = { h } (cid:54) = ∅ , for someknown function h satisfying (1.1), as soon as at least for one J the normalized estimator (cid:98) D J ( h ) is suﬃciently large. Speciﬁcally, we consider the data-driven test statistic ST n = (cid:40) ∃ J ∈ (cid:98) I n such that n (cid:98) D J ( h ) √ (cid:98) η J ( α ) (cid:98) v J ( h ) > (cid:41) , (3.5)where (cid:98) η J ( α ), (cid:98) v J ( h ), and the index set (cid:98) I n are given in Subsection 2.2.We deﬁne J be the smallest dimension parameter such that the variance dominatesthe squared bias within a √ log log n term, that is, J = min (cid:8) J : J − p/d x ≤ n − (cid:112) log log n s − J √ J (cid:9) . (3.6)17ecall the deﬁnition of the index set (cid:98) I n given in (2.4) which relies on an upper bound (cid:98) J max .We introduce the dimension parameter J slowly growing with the sample size n whichcontrols the complexity of the ES index set (cid:98) I n . Speciﬁcally, Chen and Christensen [2015,Theorem 3.2] show that (cid:98) J max ≤ J holds with probability approaching one where J satisﬁesthe rate restrictions imposed in the next assumption. We also make use of the notation ζ = ζ J . Assumption 5. (i) s − J ζ (cid:112) (log n ) /n = o (1) ; (ii) for any α ∈ (0 , it holds (cid:98) η J ( α ) = O ( √ log log n ) and (log log J ) c ≤ (cid:98) η J ( α ) for some constant c > and for all J ≤ J ≤ J withprobability approaching one. Assumption 5 imposes an upper bound on the growth of the population counterpart ofthe upper bound of the set (cid:98) I n . Assumption 5 (i) is a slight modiﬁcation of Assumption 2(i) considered uniformly over J ≤ J ≤ J . Assumption 5 (ii) imposes a mild restriction onthe critical values (cid:98) η J ( α ) given in (2.5). Theorem 3.3.

Let Assumptions 1–3 and 5 be satisﬁed. Then, for any ε > there exists aconstant δ ◦ > such that lim sup n →∞ (cid:110) P h ( ST n = 1) + sup h ∈H ( δ ◦ , r n ) P h ( ST n = 0) (cid:111) ≤ ε, (3.7) where the rate r n is given by r n = n − (cid:112) log log n s − J (cid:112) J . (3.8)Theorem 3.3 establishes an upper bound for the testing rate of the adaptive structure-space test ST n . The proof of Theorem 3.3 relies on a novel exponential bound for degenerateU-statistics based on sieve estimators. Adaptive testing for inverse problems was consideredfor deconvolution models (with known degree of ill-posedness) by Butucea et al. [2009]. Infunctional linear models, adaptive tests (under unknown, mild degree of ill-posedness) wereproposed by Lei [2014]. In Gaussian white noise models, adaptive tests was proposedby Ingster et al. [2012] also under the severely ill-posed case but requires knowledge ofthe ill-posedness scenario. We now illustrate the upper bound under classical smoothnessassumptions. Again, we distinguish between the mildly or severely ill-posed case. Corollary 3.2.

Let Assumptions 1–5 be satisﬁed. Then, the adaptive rate of testing r n given in (3.8) satisﬁes:1. Mildly ill-posed case: r n = (cid:0)(cid:112) log log n/n (cid:1) p/ (4( p + ζ )+ d x ) , . Severely ill-posed case: r n = (log n ) − p/ζ . From Corollary 3.2 we see that the adaptive ST attains in the mildly ill-posed case theminimax rate of testing within a (log log n )-term. For adaptive testing without endogeneity,it is well known that a (log log n )-term is required, see Spokoiny [1996]. In the severelyill-posed cases, our adaptive test attains the exact minimax rate of testing and hence, thereis no price to pay for adaptation. We extend the results from the previous subsection to the case of composite hypothesesand, in particular, allow for testing inequality constraints. Below, we discuss two diﬀerentapproaches for deriving critical values in the case of constrained inequality tests. Bothmethods rely on cone properties imposed on the restricted set of functions. We use thenotation Π r J for the projection on H r J . Here, the set H r J is used to approximate the set offunctions H r ⊂ H which satisﬁes a conjectured restriction. Remark 3.1 (Adaptive critical values for inequality constrains) . In both cases, we rely onthe assumption that H r J is a polyhedral cone . In this case, we may infer from Silvapulleand Sen [2005, Lemma 3.13.5] the existence of a collection of faces { H , . . . , H L } such thatthe collection of their relative interiors { ri ( H ) , . . . , ri ( H L ) } forms a partitioning of H r J .Let P l be the projection matrix onto the linear space spanned by H l where J l = rank ( P l ) .Then, a Bonferroni correction of the adaptive critical values of Al Mohamad et al. [2018]gives (cid:98) η J ( α ) = L (cid:88) l =1 (cid:8) Π r J (cid:98) h J ∈ ri ( H l ) (cid:9) q (cid:0) α/ (cid:98) I n ) , J l ) − J l √ J l , where (cid:98) h J is the unconstrained analog of (2.1) and we impose the restriction J l ≥ . Remark 3.2 (Bootstrap critical values for inequality constrains) . We propose a modiﬁca-tion of the bootstrap procedure of Fang and Seo [2019] which imposes a cone condition on H r . Below Z J denotes the sieve bootstrap score proposed by Chen and Christensen [2018].We proceed in two steps: Step 1.

Introduce a sequence of independent and identically distributed random variables { ω i } ni =1 drawn independently of the original data { ( Y i , X i , W i ) } ni =1 . Compute the bootstrap A cone C is called polyhedral if there is some matrix M such that C = { β ∈ R J : M β ≥ } . ersion of the quadratic distance estimator given by (cid:98) D B J ( (cid:98) h J ) = 2 n ( n − (cid:88) ≤ i
For some γ n ∈ (0 , , we construct the − γ n quantile of { n (cid:98) D B J ( (cid:98) h J ) / √ (cid:98) v J ( (cid:98) h J ) } ,based on N bootstrap samples, denoted by (cid:98) τ n, − γ n . Set (cid:98) κ J = ν n c n / (cid:98) τ n, − γ n where ν n = n/ √ (cid:98) v J ( (cid:98) h J ) and c n is such that (cid:107) ( (cid:98) h J − h ) / √ (cid:98) v J ( (cid:98) h J ) − Z J (cid:107) µ = o p ( c n ) . Compute (cid:98) η J ( α ) asthe − α/ (cid:98) I n ) quantile of { n (cid:101) D B J ( (cid:98) h J ) / √ (cid:98) v J ( (cid:98) h J ) } based on N bootstrap samples, where (cid:101) D B J ( (cid:98) h J ) coincides with (3.9) but with (cid:98) U i replaced by (cid:98) U i − (cid:98) κ J (cid:98) h J ( X i ) − Π r J ( Z J − (cid:98) κ J (cid:98) h J ) . In the following, we impose restrictions on the complexity of the set H r J . As we rely onthe empirical process theory, we make use of the literature’s notation. Let N [] ( t, H , (cid:107) · (cid:107) µ )denote the smallest number of brackets of size t (under (cid:107) · (cid:107) µ ) required to cover H . Wefurther denote H r J (∆ J,n ) = { h ∈ H r J : (cid:107) h − h r (cid:107) ∞ ≤ ∆ J,n } for some ∆ J,n >

0. Below, η J ( α )denotes a deterministic sequence satisfying η J ( α ) = O ( √ log log n ) and (log log J ) c ≤ η J ( α )for some constant c > J ≤ J ≤ J . Assumption 6. (i) For any h ∈ H there exists a sequence (∆ J,n ) n ≥ satisfying (cid:98) h r J ∈H r J (∆ J,n ) with probability approaching one and (cid:82) (cid:112) N [] ( tC, H r J (∆ J,n ) , (cid:107) · (cid:107) µ ) dt ≤ C J,n where (cid:80) J ∈ (cid:98) I n C J,n ∆ J,n / (log log J ) = o p (1) and max J ∈ (cid:98) I n ∆ J,n ζ J (log J ) = o p (1) . (ii)For some J ∈ (cid:98) I n and any ε > there exist constants c, C > such that it holds lim sup n →∞ sup h ∈H P h (cid:0)(cid:98) η J ( α ) < C η J ( α ) (cid:1) < ε and lim sup n →∞ sup h ∈H ( δ ◦ , r n ) P h (cid:0)(cid:98) η J ( α ) >c η J ( α ) (cid:1) < ε . Assumption 6 (i) is a mild restriction on the complexity of the set of functions H r J (∆ J,n )by imposing rate conditions on ∆

J,n . These conditions determine the rate of convergenceof the constraint sieve estimator to any function h r satisfying a conjectured restrictioncaptured by H r . It was similarly imposed in Assumption C.2 by Chen and Pouzo [2012].Further note that the critical values estimators introduced in Remarks 3.1 and 3.2 satisfyAssumption 6 (ii) under mild conditions. In the case of the adaptive critical values (seeRemark 3.1), the cone projection leads to a weakly larger critical values than the one givenin (2.5) since (cid:98) η J ( α ) is now determined by the dimension of the face on which the coneprojection lands. In the case of the bootstrap critical values (see Remark 3.2), Assumption6 (ii) can be justiﬁed by following Fang and Seo [2019].The next result establishes an upper bound for the rate of testing of (cid:99) ST n . In the implementation of the procedure we use throughout the paper the choice γ n = 0 . / log n , followingChernozhukov et al. [2013]. heorem 3.4. Let Assumptions 1–6 be satisﬁed. Then, for any ε > there exists aconstant δ ◦ > such that lim sup n →∞ (cid:110) sup h ∈H P h (cid:0)(cid:99) ST n = 1 (cid:1) + sup h ∈H ( δ ◦ , r n ) P h (cid:0)(cid:99) ST n = 0 (cid:1)(cid:111) ≤ ε, (3.10) where the rate r n is given in Theorem 3.3. From Theorem 3.4 we see that (cid:99) ST n attains the rate of testing r n which is the samerate of testing obtained by ST n in the case of simple hypotheses. Under the restrictionimposed in Assumption 6 we thus conclude that estimation of restricted functions doesnot imply slower rates of testing. In the deﬁnition of (cid:99) ST n , the dimension parameter forestimating the structural function under the conjectured restriction is set to be equivalentto the unrestricted estimator of the structural function. In this sense, our inference resultsdo not require undersmoothing conditions. Finally, we note that the test statistic can betrivially modiﬁed to tests where the constraint functions h might be estimated using aﬁxed ﬁnite dimensional sieve space. We now propose L conﬁdence sets which are based on inversion of the structural spacetest. The resulting conﬁdence region imposes conjectured restrictions on the function ofinterest h . The (1 − α ) − conﬁdence set is given by C n ( α ) = (cid:110) h ∈ H r : n (cid:98) D J ( h ) √ (cid:98) η J ( α ) (cid:98) v J ( h ) ≤ J ∈ (cid:98) I n (cid:111) . (3.11)The following corollary exploits our previous results and the introduced assumptions tocharacterize the asymptotic size and power properties of our procedure. Corollary 3.3.

Let Assumptions 1–6 be satisﬁed. Then, for any α > it holds lim sup n →∞ sup h ∈H P h ( h / ∈ C n ( α )) ≤ α (3.12) and there exists a constant δ ◦ > such that lim inf n →∞ inf h ∈H ( δ ◦ , r n ) P h ( h / ∈ C n ( α )) ≥ − α. (3.13)Corollary 3.12 shows that the L conﬁdence set C n ( α ) controls size uniformly over theclass of functions H . Moreover, the result establishes power uniformly over the class H ( δ ◦ , r n ). We immediately see from Corollary 3.3 that the size of the L conﬁdence balldepends on the degree of ill-posedness captured by the minimal singular values s J .21 orollary 3.4. Let Assumptions 1–6 be satisﬁed. Then, we have lim sup n →∞ sup h ∈H P h (cid:16) diam ( C n ( α )) ≥ Cn − (cid:112) log log( n ) s − J (cid:112) J (cid:17) = 0 , for some constant C > and where the dimension parameter J is given in (3.6) . Corollary 3.4 yields a conﬁdence region of the diameter of the order n − (cid:112) log log( n ) s − J √ J for conﬁdence sets based on inversion of the structural space test statistic. We see that thediameter of the conﬁdence set does not adapt to regularity of submodels. The followingremark illustrates that the gain of adaptation is expected to minor in inverse problems. Remark 3.3 (Adaptive Conﬁdence Sets) . Consider the mildly ill-posed case where thedegree of ill-posedness is ﬁxed to ζ and we wish to adapt over the function classes H ( p ) and H ( p ) with smoothness parameters p > p . Suppose we have an adaptive estimator whichattains in the mildly ill-posed case the L rate of convergence n − p/ (2( p + ζ )+ d x ) in comparisonto the rate of testing n − p/ (2( p + ζ )+ d x / . It is known in statistical regression models (see Robinsand Van Der Vaart [2006] and Cai and Low [2006]) that rate adaption is only possible oversubmodels (indexed by p ) such that the rate of estimation over the submodel is larger thanthe rate of testing over the “supermodel”. Speciﬁcally, we obtain the restriction n − p/ (2( p + ζ )+ d x / (cid:46) n − p / (2( p + ζ )+ d x ) . This condition translates into the smoothness restriction p < p ζ + d x ζ + d x / and hence, in this sense, adaptation is only possible when the submodel H ( p ) satisﬁes p ∈ (cid:16) p, p ζ + d x ζ + d x / (cid:17) , which shows that for large values of ζ (or dimension of d x ) adaptation with respect to H ( p ) can only be achieved over a very limited range of smoothness p .

4. Adaptive Inference via the Image-Space Test

In this section we consider a data-driven test in the image-space of the conditional ex-pectation mapping. Subsection 4.1 proposes a data-driven image-space test (IT) statistic.Subsection 4.2 establishes the minimax rate of testing of the adaptive IT.22 .1. The Data-driven Image-Space Test Statistic

We consider the set of function satisfying H = { h ∈ H : m ( · , h ) = 0 } ∩ H r (cid:54) = ∅ where,in this section, H r is a ﬁnite dimensional, compact function space. We do not addressthe question of IT with inﬁnite dimensional restricted set of functions here as this wouldrequire an additional choice of tuning parameter.In contrast to the previous section, we specify alternative models through deviationsfrom the conditional moment restriction. For convenience of notation, we introduce theconditional moment function m ( · , h r ) = E[ ρ ( Y, h r ( X )) | W = · ]. For the image space test,we consider a class of functions which are separated from H in the sense M ( δ, r n ) := (cid:110) h ∈ H : m ( W, h ) = 0 and E[ m ( W, h r )] ≥ δ r n (cid:111) . We propose an image-space test based on a leave-one-out sieve estimator of the quadraticfunctional E[ m ( W, h r )] given by (cid:98) D K ( (cid:98) h r ) = 2 n ( n − (cid:88) ≤ i (cid:41) , (4.2)based on η K ( α ) as given in (2.5) and the estimated normalization factor (cid:98) v K ( h ) = (cid:13)(cid:13)(cid:13) n − n (cid:88) i =1 ρ (cid:0) Y i , h ( X i ) (cid:1) ( B (cid:48) B/n ) − / b K ( W i ) b K ( W i ) (cid:48) ( B (cid:48) B/n ) − / (cid:13)(cid:13)(cid:13) F . (4.3)Also the image space test IT n relies on the ES selection method to determine the indexset (cid:98) I n as given in (2.4) yet its upper bound has to be modiﬁed as follows. We replace theupper bound (cid:98) J max of the index set (cid:98) I n by the estimator (cid:98) K max = min (cid:110) K > J : ζ ( K ) (cid:112) (cid:96) ( K )(log n ) /n ≥ s min (cid:0) ( B (cid:48) B/n ) − / (cid:1)(cid:111) where (cid:96) ( K ) = 0 . K . 23 .2. Adaptive Testing For the IT case the index set (cid:98) I n depends on the upper bound (cid:98) K max . As in the previoussection, we may assume that (cid:98) K max ≤ K with probability approaching one where K satisﬁesthe following rate conditions. We further denote H r (∆ n ) = { h ∈ H r : (cid:107) h − h r (cid:107) ∞ ≤ ∆ n } for some ∆ n >

0, where h r = arg min h ∈H r (cid:107) m ( · , h ) (cid:107) L ( W ) . Further, we deﬁne (cid:101) b K = G − / b b K with entries (cid:101) b k , 1 ≤ k ≤ K . Assumption 7. (i) ζ b,K (cid:112) (log n ) /n = o (1) (ii) (cid:107) Π K m ( · , h r ) − m ( · , h r ) (cid:107) L ( W ) = O ( K − γ/d w ) for some γ > such that ζ b,K √ log K = O ( K γ/d w ) ; (iii) the eigenvalues G b are uniformlybounded from below and above; (iv) (cid:82) (cid:112) N [] ( w, H r , (cid:107) · (cid:107) L ( Z ) ) dw ≤ C for some con-stant C > ; (v) (cid:107) E h [( ρ ( Y, h ( X )) − ρ ( Y, h r ( X ))) (cid:101) b K ( W )] (cid:107) ≤ (cid:107) h − h r (cid:107) µ for all h ∈ H r (∆ n ) ;and (vi) (cid:98) h r ∈ H r (∆ n ) with probability approaching one such that max ≤ k ≤ K E sup h ∈H r (∆ n ) (cid:12)(cid:12)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b k ( W ) (cid:12)(cid:12) ≤ C ∆ κn for some κ ∈ (0 , and ∆ n (cid:80) K ∈ (cid:98) I n (cid:112) K/ log log K = o p (1) . Assumption 7 (ii) does not impose regularity on the structural function h but ratheron the conditional mean function m . Assumption 7 (iv) restricts the complexity of the setof functions H r by imposing a ﬁnite entropy integral condition. Assumption 7 (v)(vi) weresimilarly imposed in Chen and Pouzo [2012]. We are now in the position to establish anupper bound for the ﬁrst and second type error, uniformly of the function class M ( δ ◦ , r n ),of the data driven test statistic IT n . Theorem 4.1.

Let Assumptions 1, 5 (ii) with J reeplaced by K , and 7 be satisﬁed. Then,for any ε > there exists a constant δ ◦ > such that lim sup n →∞ (cid:110) P h ∈H ( (cid:99) IT n = 1) + sup h ∈M ( δ ◦ , r n ) P h ( (cid:99) IT n = 0) (cid:111) ≤ ε, (4.4) where the rate r n is given by r n = (cid:0)(cid:112) log log n/n (cid:1) γ/ (4 γ + d w ) . Remark 4.1 (Formulation of Hypotheses) . We see from Theorem 4.1 that the IT rateattains the usual minimax rate of testing within a (log log n ) term and, in particular, doesnot suﬀer from the ill-posedness of the underlying inverse problem. This is due to thedeﬁnition of the class of alternative function M ( δ ◦ , r n ) , which measures deviations fromthe null by the squared norm of the conditional mean function under consideration. Notethat it is possible to achieve the same rate of testing as in ST once we are willing to impose ink conditions between (cid:107) h − h r (cid:107) µ and (cid:107) m h ( · , h r ) (cid:107) L ( W ) . We do not provide this resultexplicitly in this paper, in the interest of space. Remark 4.2 (Comparison of Testing Rates) . From Corollary 3.1 we see that the rate oftesting suﬀers only in the case of ST from the ill-posed inverse problem. Nevertheless, inthe mildly ill-posed case, the rate of ST can be even better than of IT, that is, n − p/ (4( p + ζ )+ d x ) < n − γ/ (4 γ + d w ) if and only if the dimension of W satisﬁes d w > γp (cid:0) ζ + d x (cid:1) . In contrast the rate in case of ST is always slower than the IT rate in the severely ill-posedcase.

5. Empirical Applications, Further Simulations Studies

We present three empirical applications of our adaptive structure-space tests for NPIVmodels. The ﬁrst one tests for monotonicity of demand for diﬀerentiated products in IO.The second one tests for monotonicity of gasoline demand function. The third one tests formonotonicity or parametric speciﬁcation of Engel curves for non-durable good consumption.In all of the empirical applications, we implement the adaptive test (cid:99) ST n given in (2.2),using the adaptive critical values (cid:98) η J ( α ) presented in Remark 3.1 (see Appendix C forbootstrap critical values). Throughout this section, we use quadratic B-spline basis withvarying number of knots to approximate h and set K ( J ) = 2 J . For any sieve dimension J ∈ (cid:98) I n (the ES index set), we denote the corresponding “standardized test value” as (cid:99) W J := n (cid:98) D J ( (cid:98) h r J ) √ (cid:98) η J ( α ) (cid:98) v J ( (cid:98) h r J )at the nominal level α = 0 .

05. The null hypothesis is rejected whenever (cid:99) W J > J ∈ (cid:98) I n . Tables below reports a set (cid:98) J ⊂ (cid:98) I n , which equals to arg max J ∈ (cid:98) I n (cid:99) W J when thenull is not rejected at the nominal level α = 0 .

05; and equals to { J ∈ (cid:98) I n : (cid:99) W J > } whenthe null is rejected at the nominal level α = 0 .

05. Let (cid:98) J denote the minimal integer of (cid:98) J . In all the tables below, we report (cid:99) W (cid:98) J , and its corresponding p value, which should, byBonferroni correction, be compared to the nominal level 0 .

05 divided by the cardinality of (cid:98) I n , which diﬀer in the applications presented below.25 .1.1. Adaptive ST testing for a multi-Product Demand systems Recently Compiani [2019] applies nonparametric estimation of a NPIV model as an al-ternative to BLP semiparametric speciﬁcation. He directly imposes shape restrictions toreduce dimensionality in nonparametric estimation of own-price elasticity and cross-priceelasticity of multi-product demand. Following Compiani [2019] we use the Nielsen scannerdata set on demand for strawberries in California.Our application here is to provide adaptive nonparametric hypothesis testing on themonotonicity of demand in its own price. First notice that by imposing an index restrictionand a connected substitute assumption the empirical model can be written in NPIV form,following Berry and Haile [2014] and Compiani [2019]: Y o = h ( P o , P no , P other , S o , In ) − U, E[ U | W o , W no , W other , W S o , In ] = 0 , where Y o denotes a measure of taste for organic products, S o denotes the share of theorganic products, In income, and U unobserved shocks for organic produce. The vector( P o , P no , P other ) denotes the prices of organic strawberries, non-organic strawberries, andnon-strawberry fresh fruit, respectively. To account for possible endogeneity of prices we fol-low Compiani [2019] and use Hausman-type instrumental variables denoted by ( W o , W no , W other )and shipping-point spot prices W S o as a proxy for the wholesale prices faced by retailers. Income H : h is decreasing in P o H : h is increasing in P o groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.552 0.103 no { } { } middle 0.234 0.246 no { } .

720 0.000 yes { , , } high 1 .

526 0.000 yes { , , } .

883 0.000 yes { , , , } Table 2:

Adaptive testing for monotonic demand for organic products.

Data on prices and quantities come from the 2014 Nielsen scanner data set. For eachmarket, the most granular unit of observation in the Nielsen data is a UPC (i.e. a speciﬁcbar code). We consider a low income, middle income, and high income” group based onindividuals between the 5%–25%, 40%–60%, and 75%–95% quantile, respectively, of theincome distribution. The resulting sample sizes for the three subgroups are 1509, 1491, and2093. We implement our adaptive test (cid:99) ST n by making use of a semiparametric speciﬁcationof the structural demand function: we consider the tensor product of quadratic B-splines q J ( P o ) and the vector (1 , In ), hence J = 2 J . The other variables are accounted forseparately, that is, P no and P other ﬁxed with two interior knots and market shares S o linearly. The cardinality of the index set (cid:98) I n changes for the diﬀerent subgroups considered:26e have (cid:98) I n = { , , } for the low and middle income groups but (cid:98) I n = { , , , } forthe high income group.According to Table 2, our adaptive test rejects the null of weakly decreasing demandcurve (in own price) at the nominal level α = 0 .

05 for the high income group, but failsto reject the decreasing demand for the low and middle income groups. In addition, ouradaptive test rejects the null of weakly increasing demand curve (in own price) for allincome groups.

In this subsection, we revisit the problem whether the gasoline demand is monotone de-creasing in its own price or not. We consider the following partially linear speciﬁcation ofthe demand function: Y = h ( P, In ) + γ (cid:48) H + U, E[ U | W, In, H ] = 0where Y denotes annual log-gasoline consumption of a household, P log-price of gasoline(average local price), In log-household income, H are control variables (such as log ageof the household respondent, the log household size, the log number of drivers, and thenumber of workers in the household), and W distance to major oil platform as instrumentalvariables. We allow for price P to be endogenous, but assume that ( In, H ) is exogenous.

Income H : h is decreasing in P H : h is increasing in P groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.433 0.111 no { } { , } high 11 .

703 0.000 yes { , } { , , } Table 3:

Adaptive testing for monotonicity of gasoline demand.

Consumer theory requires downward-sloping compensated demand curves. In the fol-lowing we provide a test for monotonicity of the uncompensated (Marshallian) demandcurves. The data are from the 2001 National Household Travel Survey (NHTS), whichsurveys, on a household-level, the civilian noninstitutionalized population in the UnitedStates. Following Blundell et al. [2017] we restrict the analysis to households with a whiterespondent, two or more adults, at least one child under age 16, and at least one driver.The resulting sample contains 3,640 observations. Chetverikov and Wilhelm [2017] and Freyberger and Reeves [2019] have used this dataset to estimate the gasoline demand by assuming a decreasing demand curve. Instead we We thank Matthias Parey for sharing the dataset with us. We refer the reader to Blundell et al. [2017]for a detailed description of the data. .20 0.25 0.30 0.35 . . . . . . Constrained CB log price ga s o li ne de m and . . . . . . Unconstrained CB log price ga s o li ne de m and Figure 4:

Estimated demand curves for the low income group using (cid:98) J = 8: constraint estimator(blue solid line) and unconstraint estimator (red dashed line). LHS: Dotted blue linesshow 95% uniform CBs based on constrained estimator. RHS: Dotted red lines show95% uniform CBs based on unconstrained estimator. All other variables are ﬁxed totheir median levels (implying $32,500 of income). aim at testing the monotonicity of the gasoline demand here. We consider a semiparametricspeciﬁcation similar to theirs to approximate the unknown function h , that is, our test isbased on the tensor product of quadratic B-splines q J ( P ) and the vector (1 , In ), hence J = 2 J . To analyze heterogeneity in demand for diﬀerent income levels we make use oftwo diﬀerent subsamples of the data set. We consider a sample of n = 803 “low income”households whose income is below the ﬁrst quartile, and a sample of n = 1369 “highincome” households whose income is weakly above the third quartile. When computing theES index set we obtain (cid:98) I n = { , } for low income group and (cid:98) I n = { , , } for the highincome group.According to Table 3, our adaptive test rejects the null of weakly decreasing gasolinedemand at the nominal level α = 0 .

05 for the high income group, but fails to reject thedecreasing demand for the low income group. In addition, our adaptive test rejects the nullof weakly increasing gasoline demand curve for both income groups.We illustrate our testing results in Figures 4 and 5 which shows the graphs of restrictedand unrestricted NPIV estimators for the low and high income groups, respectively. Theestimators are implemented using the choice of the sieve dimension given by (cid:98) J = 8 in bothgroups. Variables other than price are ﬁxed to their median level at each subgroup. In28 .20 0.25 0.30 0.35 . . . . . Constrained CB log price ga s o li ne de m and . . . . . Unconstrained CB log price ga s o li ne de m and Figure 5:

Estimated demand curves for the high income group using (cid:98) J = 8: constraint estimator(blue solid line) and unconstraint estimator (red dashed line). LHS: Dotted blue linesshow 95% uniform CBs based on constrained estimator. RHS: Dotted red lines show95% uniform CBs based on unconstrained estimator. All other variables are ﬁxed totheir median levels (implying $120,000 of income). particular, the median income is given by roughly $32,500 and $120,000 for the two incomegroups considered. We also provide 95% bootstrap uniform conﬁdence bands (CBs), using1000 bootstrap iterations. Both ﬁgures are in line with our adaptive testing results reportedin Table 3, that is, only the demand curves for high income households seem to violate amontonically decreasing shape. For high income households the unrestricted demand curvesare slightly outside of the 95% conﬁdence bands of the restricted NPIV estimator. Engel curves play a central role in the analysis of consumer demand and describe thehousehold budget share Y (cid:96) for non-durable goods (cid:96) as a function of household log-totalexpenditure X . Blundell et al. [2007] estimated a system of nonparametric Engel curvesas functions of endogenous log-total expenditure and family size, using log-gross earningsof the head of household as an instrument W . We use a subset of their data from the1995 British Family Expenditure Survey, with the head of household aged between 20 and55 and in work, and household with one or two children. This leaves a sample of size n = 1027. As an illustration we consider Engel curves h (cid:96) ( X ) for four non-durable goods (cid:96) :“food in”, “fuel”, “travel”, and “leisure”: E [ Y j − h (cid:96) ( X ) | W ] = 0. We use the same quadratic29-spline basis with up to 3 knots to approximate all the Engel curves. Hence the index set (cid:98) I n = { , , , } is the same for the diﬀerent Engel curves. Table 4 reports the adaptivetest for weak monotonicity of Engel curves. H : h is increasing H : h is decreasingGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” 1.347 0.002 yes { } -0.269 0.821 no { } “fuel” 4.484 0.000 yes { , } { } “travel” 2.074 0.000 yes { } { } “leisure” 0.295 0.151 no { } { , } Table 4:

Adaptive testing for monotonicity of Engel curves.

Table 4 reports that our test rejects increasing Engel curves for “food in”, “fuel”, and“travel” categories, and also rejects decreasing Engel curve for “leisure” at the 0 .

05 nominallevel. Previously, to decide whether the Engel curves is strictly monotonic, estimatedderivatives of these function together with their 95% uniform conﬁdence bands were alsoprovided in Chen and Christensen [2018, Figure 4]. Those conﬁdence bands contain zeroalmost over the whole support of household expenditure. It is interesting to see that ouradaptive test is more informative about monotonicity in certain directions that are notobvious from their uniform conﬁdence bands. H : h is linear H : h is quadraticGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” -0.169 0.644 no { } { } “fuel” 1.174 0.004 yes { } -0.089 0.502 no { } “travel” 1.052 0.007 yes { } -0.108 0.531 no { } “leisure” 0.536 0.074 no { } { } Table 5:

Adaptive testing for linear/quadratic speciﬁcation of Engel curves.

The most popular class of parametric demand systems is the almost ideal class, pio-neered by Deaton and Muellbauer [1980], where budget shares are assumed to be linear inlog-total expenditure. Banks et al. [1997] proposed a popular extension of this model toinclude a squared term in log-total expenditure. Table 5 presents tests for either a linear orquadratic speciﬁcation of the Engel curves for the four goods considered above. Accordingto this table, at the nominal level α = 0 .

05, our adaptive ST fails to reject quadratic formfor all the goods, while rejects a linear Engel curve for fuel and travel goods.30 .2. A Monte Carlo Study: Testing for Parametric Restrictions

In this subsection, we compare the ﬁnite sample performances of the adaptive structure-space test ( (cid:99) ST n ) and the adaptive image-space test ( (cid:99) IT n ) when h takes a parametric formunder the null. The results are based on 5000 Monte Carlo replications for every experiment.We consider two diﬀerent designs similar to the simulation set up in (2.7). First, weset X i = Φ( X ∗ i ) and W i = Φ( W ∗ i ), where the random vector ( X ∗ i , W ∗ i , U i ) is generatedaccording to  X ∗ i W ∗ i U i  ∼ N   ,  ξ . ξ . σ U  . (5.1)Second, we consider multivariate conditioning variable W = ( W , W ). To do so, we set X i = Φ( X ∗ i ), W i = Φ( W ∗ i ), and W i = Φ( W ∗ i ), where  X ∗ i W ∗ i W ∗ i U i  ∼ N   ,  ξ . . ξ . . σ U  . (5.2)In both experiment designs, (5.1) and (5.2), we let σ U = 0 .

2, and Y i be generated accordingto (2.6) with the structural function h ( x ) = − x/ c A x + ( c B x ) , where the constants c A and c B will vary in the experiments below. Speciﬁcally, we considereither a linear (i.e., ( c A , c B ) = (0 , c A , c B ) = (1 , c A , c B ) = (1 , c A , c B ) = (1 ,

1) when the null is linear or( c A , c B ) = (1 ,

5) when the null is quadratic).The simulation sample size is n = 500. We implement the adaptive structure-spacetest (cid:99) ST n given in (2.2) using quadratic B-spline basis functions with varying number ofknots. The number of knots is varied within the index set (cid:98) I n as implemented in the lastsubsection, also with K ( J ) = 2 J . In addition, we implement the adaptive image-space test (cid:99) IT n given in (4.2) with using quadratic B-spline basis functions with varying number ofknots, where h is replaced by the 2SLS estimator.In Table 6, we depict the empirical rejection probabilities of the test statistics (cid:99) ST n and (cid:99) IT n , and their bootstrapped versions (cid:99) ST B n and (cid:99) IT B n , with 200 bootstrap replications for thebootstrapped critical values. Again, the adaptive tests and their respectively bootstrapped31 esign ξ Null Alt. Empirical Rejection prob./Average dimension choice ( c A , c B ) ( c A , c B ) (cid:99) ST B n (cid:99) ST n (cid:98) J ST (cid:99) IT B n (cid:99) IT n (cid:98) K IT t -test(5.1) 0 . ,

0) 0.046 0.055 3.92 0.051 0.047 3.81 0.026 d x = d w (1 ,

0) 0.022 0.031 4.17 0.029 0.024 4.14 0.001(0 ,

0) (1 ,

0) 0.072 0.088 3.78 0.084 0.080 3.76 0.084(0 ,

0) (1 ,

1) 0.272 0.279 3.40 0.252 0.245 3.54 0.380(1 ,

0) (1 ,

5) 0.092 0.120 4.16 0.038 0.037 4.12 0.0870 . ,

0) 0.046 0.049 4.13 0.042 0.049 3.82 0.045(1 ,

0) 0.026 0.029 4.53 0.023 0.026 4.12 0.027(0 ,

0) (1 ,

0) 0.185 0.167 3.68 0.184 0.198 3.57 0.293(0 ,

0) (1 ,

1) 0.810 0.815 3.06 0.802 0.822 3.14 0.912(1 ,

0) (1 ,

5) 0.515 0.538 4.08 0.295 0.330 4.12 0.728(5.2) 0 . ,

0) 0.052 0.052 3.98 0.041 0.039 7.82 0.035 d x < d w (1 ,

0) 0.030 0.032 4.20 0.028 0.032 8.12 0.007(0 ,

0) (1 ,

0) 0.111 0.115 3.79 0.061 0.052 7.73 0.121(0 ,

0) (1 ,

1) 0.430 0.461 3.31 0.113 0.116 7.42 0.536(1 ,

0) (1 ,

5) 0.321 0.340 4.28 0.031 0.032 8.17 0.2550 . ,

0) 0.042 0.050 4.10 0.036 0.040 7.84 0.045(1 ,

0) 0.026 0.031 4.43 0.033 0.034 8.13 0.034(0 ,

0) (1 ,

0) 0.209 0.206 3.61 0.064 0.063 7.46 0.322(0 ,

0) (1 ,

1) 0.874 0.883 3.04 0.370 0.354 6.54 0.942(1 ,

0) (1 ,

5) 0.687 0.700 4.08 0.054 0.052 8.25 0.797

Table 6:

Empirical Rejection probabilities of (cid:99) ST B n , (cid:99) ST n , (cid:99) IT B n , (cid:99) IT n -tests, and t -test (of the hypoth-esis ( c A , c B ) = (0 ,

0) if null is linear and of ( c A , c B ) = (1 ,

0) if null is quadratic) under5% nominal level, n = 500. versions perform similarly in terms of sizes and powers. All these adaptive tests are com-pared with the asymptotic t -test of the hypothesis c A = 0 if the null is linear and of c B = 0if the null is quadratic. We see that the adaptive tests are overall not very conservative incomparison to the t -test that imposes speciﬁc parametric restrictions. The powers of allthe tests increases as ξ increases (i.e., the instrument gets stronger). Very remarkably, theadaptive ST performs as well as the asymptotic t -test does. For the ﬁrst experiment designgiven in (5.1) (with d x = d w ), we see that the adaptive ST is somewhat more powerful thanthe adaptive IT when the true model is cubic, i.e., ( c A , c B ) = (1 , d x < d w ), the adaptive ST is much more powerful than theadaptive IT. This ﬁnding is consistent with our theoretical ﬁndings as described in Remark4.2. This severe power diﬀerence also holds when cubic or quartic B-splines are used as thevector of instrument basis functions b K ( W ).32 . Conclusion In this paper, we propose data-driven, minimax rate-adaptive hypothesis testing on a struc-tural function in semi-nonparametric conditional moment restrictions. Our main focus isthe adaptive structure-space test (ST) that is based on a leave-one-out sieve estimate ofquadratic distance between the structural functions of endogenous variables in NPIV mod-els. But we also present the minimax rate-adaptive image-space test (IT) as comparison.This is because our sieve IT is related to the Bierens’ type tests that have been utilizedby many existing papers on inference for semi-nonparametric conditional moment restric-tions. All the prior existing papers on testing for NPIV models do not consider data-drivenchoice of tuning parameters, however. For both tests, we ﬁrst establish their respectiveminimax rate of testing against classes of nonparametric alternative models. We then pro-vide computationally simple data-driven choices of sieve tuning parameters and adaptivecritical values. The resulting tests attain the optimal minimax rate of testing, adapt to theunknown smoothness of functions, and are robust to the unknown degree of ill-posedness(endogeneity). Data-driven conﬁdence sets are easily obtained by inverting the adaptiveST. Monte Carlo studies and empirical applications demonstrate that our simple, adaptiveST has good size and power properties in ﬁnite samples for testing monotonicity or para-metric restrictions in NPIV models, without the need of using computationally intensivebootstrapped critical values.

References

D. Al Mohamad, E. W. Van Zwet, E. Cator, and J. J. Goeman. Adaptive critical value forconstrained likelihood ratio testing. arXiv preprint arXiv:1806.01325 , 2018.J. Banks, R. Blundell, and A. Lewbel. Quadratic engel curves and consumer demand.

Review of Economics and Statistics , 79(4):527–539, 1997.S. T. Berry and P. A. Haile. Identiﬁcation in diﬀerentiated products markets using marketlevel data.

Econometrica , 82(5):1749–1797, 2014.H. J. Bierens. A consistent conditional moment test of functional form.

Econometrica , 58(6):1443–1458, 1990.R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant engel curves.

Econometrica , 75(6):1613–1669, 2007.R. Blundell, J. Horowitz, and M. Parey. Nonparametric estimation of a nonseparabledemand function under the slutsky inequality restriction.

Review of Economics andStatistics , 99(2):291–304, 2017. 33. Breunig. Goodness-of-ﬁt tests based on series estimators in nonparametric instrumentalregression.

Journal of Econometrics , 184(2):328–346, 2015.C. Breunig and J. Johannes. Adaptive estimation of functionals in nonparametric instru-mental regression.

Econometric Theory , 32(3):612–654, 2016.C. Butucea, C. Matias, and C. Pouet. Adaptive goodness-of-ﬁt testing from indirect ob-servations.

Ann. Inst. Henri Poincar´e Probab. Stat , 45(2):352–372, 2009.T. T. Cai and M. G. Low. Adaptive conﬁdence balls.

The Annals of Statistics , 34(1):202–228, 2006.S. Centorrino. Data driven selection of the regularization parameter in additive nonpara-metric instrumental regressions. Working paper, Stony Brook, 2014.X. Chen and T. Christensen. Optimal sup-norm rates, adaptivity and inference in non-parametric instrumental variables estimation, 2015.X. Chen and T. M. Christensen. Optimal sup-norm rates and uniform inference on nonlinearfunctionals of nonparametric iv regression.

Quantitative Economics , 9(1):39–84, 2018.X. Chen and D. Pouzo. Estimation of nonparametric conditional moment models withpossibly nonsmooth generalized residuals.

Econometrica , 80(1):277–321, 01 2012.X. Chen and D. Pouzo. Sieve quasi likelihood ratio inference on semi/nonparametric con-ditional moment models.

Econometrica , 83(3):1013–1079, 2015.X. Chen and M. Reiß. On rate optimality for ill-posed inverse problems in econometrics.

Econometric Theory , 27(03):497–521, 2011.V. Chernozhukov, S. Lee, and A. M. Rosen. Intersection bounds: Estimation and inference.

Econometrica , 81(2):667–737, 2013.V. Chernozhukov, W. K. Newey, and A. Santos. Constrained conditional moment restrictionmodels. arXiv preprint arXiv:1509.06311 , 2015.D. Chetverikov and D. Wilhelm. Nonparametric instrumental variable estimation undermonotonicity.

Econometrica , 85:1303–1320, 2017.O. Collier, L. Comminges, and A. B. Tsybakov. Minimax estimation of linear and quadraticfunctionals on sparsity classes.

The Annals of Statistics , 45(3):923–958, 2017.G. Compiani. Market counterfactuals and the speciﬁcation of multi-product demand: Anonparametric approach.

Available at SSRN 3134152 , 2019.34. Deaton and J. Muellbauer. An almost ideal demand system.

American EconomicReview , pages 312–326, 1980.Z. Fang and J. Seo. A general framework for inference on shape restrictions. arXiv preprintarXiv:1910.07689 , 2019.J. Freyberger and B. Reeves. Inference under shape restrictions. Working paper, Universityof Wisconsin, 2019.E. Gautier and E. Le Pennec. Adaptive estimation in the nonparametric random coeﬃcientsbinary choice model by needlet thresholding.

Electronic Journal of Statistics , 12(1):277–320, 2018.E. Gine and R. Nickl.

Mathematical Foundations of Inﬁnite-Dimensional Statistical Models .Cambridge University Press, 2016.E. Guerre and P. Lavergne. Data-driven rate-optimal speciﬁcation testing in regressionmodels.

The Annals of Statistics , 33(2):840–870, 2005.L. P. Hansen and T. J. Sargent.

Robustness . Princeton university press, 2008.J. L. Horowitz. Testing a parametric model against a nonparametric alternative withidentiﬁcation through instrumental variables.

Econometrica , 74(2):521–538, 2006.J. L. Horowitz. Adaptive nonparametric instrumental variables estimation: Empiricalchoice of the regularization parameter.

Journal of Econometrics , 180:158–173, 2014.J. L. Horowitz and V. G. Spokoiny. An adaptive, rate-optimal test of a parametric mean-regression model against a nonparametric alternative.

Econometrica , 69(3):599–631,2001.C. Houdr´e and P. Reynaud-Bouret. Exponential inequalities, with constants, for u-statisticsof order two. In

Stochastic inequalities and applications , pages 55–69. Springer, 2003.J. Z. Huang. Projection estimation in multiple regression with application to functionalanova models.

The Annals of Statistics , 26(1):242–272, 1998.Y. I. Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives.i, ii, iii.

Math. Methods Statist , 2(2):85–114, 1993.Y. I. Ingster, T. Sapatinas, I. A. Suslina, et al. Minimax signal detection in ill-posed inverseproblems.

The Annals of Statistics , 40(3):1524–1549, 2012.M. Jansson and D. Pouzo. Towards a general large sample theory for regularized estimators.Working paper, University of California, Berkeley, 2019.35. Lei. Adaptive global testing for functional linear models.

Journal of the AmericanStatistical Association , 109(506):624–634, 2014.C.-A. Liu and J. Tao. Model selection and model averaging in nonparametric instrumentalvariables models. Working paper, National University of Singapore and University ofWisconsin-Madison, 2014.W. K. Newey. Convergence rates and asymptotic normality for series estimators.

Journalof Econometrics , 79(1):147–168, 1997.J. Robins and A. Van Der Vaart. Adaptive nonparametric conﬁdence sets.

The Annals ofStatistics , 34(1):229–253, 2006.A. Santos. Inference in nonparametric instrumental variables with partial identiﬁcation.

Econometrica , 80(1):213–275, 2012.M. J. Silvapulle and P. K. Sen.

Constrained statistical inference: Inequality, order andshape restrictions . John Wiley & Sons, 2005.V. G. Spokoiny. Adaptive hypothesis testing using wavelets.

The Annals of Statistics , 24(6):2477–2498, 1996.J. Tao. Inference for point and partially identiﬁed semi-nonparametric conditional momentmodels. In

Unpublished Manuscript, University of Wisconsin-Madison , 2014.A. B. Tsybakov.

Introduction to nonparametric estimation . Springer, 2009.Y. Zhu. Inference in nonparametric/semiparametric moment equality models with shaperestrictions.

Quantitative Economics , 11(2):609–636, 2020.

A. Proofs of Minimax ST Results in Subsection 3.2

We introduce additional notation used in the proofs. For a r × c matrix M with r ≤ c and full row rank r we let M − l denote its left pseudoinverse, namely ( M (cid:48) M ) − M (cid:48) where (cid:48) denotes transpose and − denotes generalized inverse. We deﬁne A = [ S (cid:48) G − b S ] − S (cid:48) G − b where S = E[ b K ( W ) ψ J ( X ) (cid:48) ]. For all J ≥ s J = s min ( G − / b SG − / ) > (cid:13)(cid:13)(cid:13) G / AG / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) G / [ S (cid:48) G − b S ] − S (cid:48) G − / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) [ G − / S (cid:48) G − b SG − / ] − G − / S (cid:48) G − / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:16) G − / b SG − / (cid:17) − l (cid:13)(cid:13)(cid:13) = s − J . V Ji := ( Y i − h ( X i )) G / Ab K ( W i ). We also introduce theprojection Q J h ( · ) = ψ J ( · ) (cid:48) A E[ h ( X ) b K ( W )] for h ∈ L µ . We introduce the normalizationterm v J ( h ) = (cid:13)(cid:13) E h [ (cid:0) Y − h ( X ) (cid:1) G / A b K ( J ) ( W ) b K ( J ) ( W ) (cid:48) A (cid:48) G / ] (cid:13)(cid:13) F . For simplicity of notation, we write (cid:98) D J instead of (cid:98) D J ( h ) and v J instead of v J ( h ). Proof of Theorem 3.1.

For the proof of the result, we proceed in three steps. First, webound the ﬁrst type error of the test statistic (cid:102) ST n,J = (cid:110) n (cid:98) D J > η (cid:48) v J (cid:111) for some constant η (cid:48) >

0. Second, we bound the second type error of (cid:102) ST n,J where η (cid:48) isreplaced by η (cid:48)(cid:48) >

0. Third, we make the dependence of η (cid:48) , η (cid:48)(cid:48) with η explicit and show thatthe derived bounds in the previous steps are suﬃcient to control the ﬁrst and second typeerror of our test statistic (cid:102) ST n,J . Step 1:

Under the simple null hypothesis we have (cid:107) h − h (cid:107) µ = 0 and hence we obtain forany ε > η suﬃciently large thatP h ( (cid:102) ST n,J = 1) = P h (cid:16) n (cid:98) D J > η (cid:48) v J (cid:17) ≤ ε by the upper bound derived in (F.4). Step 2:

From the Step 2. of the proof of Theorem 3.3 (with J ∗ replaced by J and usingthat η (cid:48)(cid:48) J ∗ replaced by η (cid:48)(cid:48) ) we obtain uniformly over h ∈ H ( δ ∗ , r n,J ) thatP h ( (cid:102) ST n,J = 0) = P h (cid:18) (cid:98) D J ≤ η (cid:48)(cid:48) v J n (cid:19) = o (1) . Step 3:

Finally, we account for estimation of the normalization factor v J . Now since | (cid:98) v J − v J | ≤ c v J wpa1 for some 0 < c < − c ) v J ≤ (cid:98) v J ≤ (1+ c ) v J wpa1. Let η (cid:48) and η (cid:48)(cid:48) be such that √ η = η (cid:48) (1 − c ) = η (cid:48)(cid:48) (1 + c ) . (cid:102) ST n,J as followsP h ( ST n,J = 1) ≤ P h ( ST n,J = 1 , (cid:98) v J ≥ (1 − c ) v J ) + P h ( (cid:98) v J < (1 − c ) v J ) ≤ P h (cid:16) (cid:98) D J > √ η v J (1 − c ) (cid:17) + o (1)= P h (cid:16) (cid:98) D J > η (cid:48) v J (cid:17) + o (1)= o (1)where the last equation is due to Step 1. To bound the second error term of ST n,J wecalculate uniformly over h ∈ H ( δ ∗ , r n,J ) thatP h (cid:16)(cid:102) ST n,J = 0 (cid:17) ≤ P h (cid:16)(cid:102) ST n,J = 0 , (cid:98) v J ≤ (1 + c ) v J (cid:17) + P h ( (cid:98) v J > (1 + c ) v J )= P h (cid:16) | (cid:98) D J | ≤ η (cid:48)(cid:48) v J (cid:17) + o (1)= o (1) , where the last equation is due to Step 2. Proof of Corollary 3.1.

We make use of the observation s − J = (1 + o (1)) τ J .Proof of (3.3). The choice of J ∼ n d x / (4( p + ζ )+ d x ) implies n − τ J √ J ∼ n − J / ζ/d x ∼ n − p/ (4( p + ζ )+ d x ) and for the bias term J − p/d x ∼ n − p/ (4( p + ζ )+ d x ) .Proof of (3.4). The choice of J ∼ (log n − ζ − (2 p + d x ) log log n ) d x /ζ implies n − τ J √ J ∼ n − √ J exp( J ζ/d x ) ∼ (cid:16) log n − p + d x ζ log log n (cid:17) d x / (2 ζ ) (log n ) − (2 p +1) /ζ (cid:46) (log n ) d x / (2 ζ ) (log n ) − (2 p + d x ) /ζ (cid:46) (log n ) − p/ζ and for the bias term J − p/d x ∼ (log n ) − p/ζ , which yields equation (3.4). Proof of Theorem 3.2.

It is suﬃcient to consider the Gaussian reduced-form NPIR asin Chen and Reiß [2011]. From the proof of Collier et al. [2017, Lemma 3] we infer for any h θ ∈ H ( δ ∗ , r n ) thatinf T n (cid:8) P h ( T n = 1) + sup h ∈H ( δ ∗ ,r n ) P h ( T n = 0) (cid:9) ≥ inf T n (cid:8) P h ( T n = 1) + P h θ ( T n = 0) (cid:9) ≥ − V (P h θ , P h ) ≥ − (cid:113) χ (P h θ , P h ) , (A.1)38here V ( · , · ) denotes the total variation distance and χ ( · , · ) denotes the χ divergence. Weconsider θ = ( θ j ) j ≥ with θ j ∈ {− , } and introduce the test function h θ = h + (cid:114) δ ∗ n J (cid:88) j =1 τ j θ j (cid:101) ψ j (cid:16) J (cid:88) j =1 τ j (cid:17) − / , where { (cid:101) ψ j } j ≥ forms an orthonormal basis in L µ and the dimension parameter J satisﬁesthe inequality restriction δ ∗ n (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 τ j j p ≤ C (A.2)for some constant C >

0. Therefore, orthonormality of the basis functions { (cid:101) ψ j } j ≥ in L µ together with the Cauchy-Schwarz inequality implies ∞ (cid:88) j =1 (cid:104) h θ − h , (cid:101) ψ j (cid:105) µ j p = δ ∗ n J (cid:88) j =1 τ j j p (cid:16) J (cid:88) l =1 τ l (cid:17) − / ≤ δ ∗ n (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 τ j j p ≤ C and we conclude that h θ − h attains the sieve approximation error imposed in Assumption2. Denoting r n = n − / (cid:0) (cid:80) Jj =1 τ j (cid:1) / we obtain (cid:107) h θ − h (cid:107) µ = δ ∗ n (cid:16) J (cid:88) j =1 τ j (cid:17) / (cid:90) (cid:101) ψ j ( x ) µ ( x ) dx = δ ∗ n (cid:16) J (cid:88) j =1 τ j (cid:17) / = δ ∗ r n and hence, h θ ∈ H ( δ ∗ , r n ). Under P h θ the conditional distribution of Y given W is Gaussianwith mean T h θ ( W ) and variance 1. We may assume that { λ j , (cid:101) ψ j , (cid:101) b j } , j ≥

1, forms a singularvalue decomposition of the conditional expectation operator T . The total variation of P h θ and P h thus satisﬁes χ (P h θ , P h ) = (cid:90) (cid:18) P h θ P h (cid:19) d P h − J (cid:89) j =1 exp( − nλ j γ j ) − exp( nλ j γ j )2 − , where we deﬁne γ j = n − / τ j (cid:0) (cid:80) Jj =1 τ j (cid:1) − / . By Tsybakov [2009, Section 2.7.5] there existsa constant c > − nλ j γ j ) − exp( nλ j γ j ) ≤ (cid:0) c n λ j γ j (cid:1) . By making useof condition λ j ≤ cτ − j for all j ≥ c > τ j → ∞ ) that χ (P h θ , P h ) ≤ exp (cid:16) c cn J (cid:88) j =1 τ j γ j (cid:17) − ≤ exp( c ) − ,

39y the deﬁnition of γ j . Consequently, the results follows by making use of inequality (A.1).Finally, we derive speciﬁc rate results under the diﬀerent measures of ill-posedness.Consider the mildly ill-posed case with the choice of J ∼ n d x / (4( p + ζ )+ d x ) then J satisﬁesconstraint (A.2) within a constant and n − (cid:16) J (cid:88) j =1 τ j (cid:17) / ∼ n − (cid:16) J (cid:88) j =1 j ζ/d x (cid:17) / ∼ n − p/ (4( p + ζ )+ d x ) . In the severely ill-posed case, the choice of J ∼ (log n − ζ − (2 p + 1) log log n ) /ζ also satis-ﬁes (A.2) within a constant and n − (cid:16) J (cid:88) j =1 τ j (cid:17) / ∼ n − (cid:16) J (cid:88) j =1 exp (cid:0) j ζ/d x (cid:1)(cid:17) / ∼ (log n ) − p/ζ , which completes the proof. B. Proofs of Adaptive ST Results in Subsection 3.3

We require some additional notation. We set Z i = ( Y i , X i , W i ) and introduce a function R ( Z i , Z i (cid:48) , D i ) = ( Y i − h ( X i )) D i b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )( Y i (cid:48) − h ( X i (cid:48) )) D i (cid:48) − E h [( Y − h ( X )) D b K ( W )] (cid:48) A (cid:48) GA E h [ b K ( W )( Y − h ( X )) D ]for any set D i . We deﬁne R ( Z i , Z i (cid:48) ) := R ( Z i , Z i (cid:48) , M i ) and R ( Z i , Z i (cid:48) ) := R ( Z i , Z i (cid:48) , M ci )where M i = {| Y i − h ( X i ) | ≤ M n } and ( M n ) n ≥ is an increasing sequence satisfying M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) . Based on kernels R l , where l = 1 ,

2, we introduce the U-statistic U J,l = 2 n ( n − (cid:88) ≤ i
Proof of Theorem 3.3.

For the proof of the result, we proceed in three steps. First, webound the ﬁrst type error of the test statistic (cid:102) ST n = (cid:26) max J ∈I n | n (cid:98) D J / ( η (cid:48) J v J ) | > (cid:27) for some η (cid:48) J >

0. Second, we bound the second type error of (cid:102) ST n where η (cid:48) J is replaced by η (cid:48)(cid:48) J >

0. Let η (cid:48) J and η (cid:48)(cid:48) J be such that √ η J = η (cid:48) J − c = η (cid:48)(cid:48) J c for some constant 0 < c <

1. Finally, we show that the derived bounds in the previoussteps are suﬃcient to control the ﬁrst and second type error of our test statistic ST n . Step 1:

To control the ﬁrst type error of the test statistic (cid:102) ST n , we make use of thedecomposition for all h ∈ H :P h (cid:16)(cid:102) ST n = 1 (cid:17) ≤ P h (cid:18) max J ∈I n (cid:110) | n (cid:98) D J | / ( η (cid:48) J v J ) (cid:111) > (cid:19) . ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) V ij V i (cid:48) j (cid:12)(cid:12)(cid:12) + max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17) ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12) n U J, / ( η (cid:48) J v J ) (cid:12)(cid:12) > (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) = I + P h (cid:18) max J ∈I n (cid:12)(cid:12) n U J, / ( η (cid:48) J v J ) (cid:12)(cid:12) > (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) = II + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:0) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:1) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) = III using that U i = Y i − h ( X i ) under H . Consider I . From Lemma G.1 and Lemma G.2 we41onclude for u = 2 log log J and M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) that for all J ∈ I n we haveΛ √ u + Λ u + Λ u / + Λ u ≤ Λ (cid:112) J + 2Λ log log J + Λ (2 log log J ) / + 4Λ (log log J ) ≤ n v J (cid:112) log log J + 4 σ ns − J log log J + σ ns − J (2 log log J ) / + 4 ns − J (cid:112) log log J for n suﬃciently large. Lemma F.2 implies s − J ≤ σ − v J . Assumption 5 (ii) yieldslog log J ≤ σ η (cid:48) J / (16 σ ) for all J ∈ I n and n suﬃciently large and hence,we obtainΛ √ u + Λ u + Λ u / + Λ u ≤ n − v J η (cid:48) J . Consequently, the exponential inequality for degenerate U-statistics in Lemma G.1 with u = 2 log log J together with the deﬁnition of I n , i.e., J = J j for all J ∈ I n , yields I ≤ (cid:88) J ∈I n P h (cid:18)(cid:12)(cid:12) n U J, (cid:12)(cid:12) > η (cid:48) J v J (cid:19) = (cid:88) J ∈I n P h (cid:32)(cid:12)(cid:12)(cid:12) (cid:88) iM n } | E h | U {| U | >M n } | max J ∈I n ζ J (cid:13)(cid:13) G / µ AG / b (cid:13)(cid:13) η (cid:48) J v J ≤ nM − n (cid:0) E h [ U ] (cid:1) ζ max J ∈I n s − J η (cid:48) J v J where the fourth moment of U = Y − h ( X ) is bounded under Assumption 1 (ii). FromLemma F.2 we deduce s − J ≤ σ − v J . Thus, using deﬁnition M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) gives II = o (cid:16) n − (log log J ) / ζ (cid:17) = o (1) 42here the last equation is due to rate restriction on J imposed Assumption 5 (i). Consider III . Lemma F.5 implies uniformly in J ∈ I n that1 n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:0) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:1) b K ( W i (cid:48) ) = o p ( v J η (cid:48) J )and hence III = o (1). Step 2:

We control the second type error of the test statistic (cid:102) ST n where η (cid:48) J is replaced by η (cid:48)(cid:48) J >

0. We may assume that there exists (cid:101) J with J ≤ (cid:101) J ≤ n / such that (cid:101) J − p/d ≤ n − v (cid:101) J .Thus, by the construction of the set I n we have that there exists J ∗ ∈ I n such that (cid:101) J ≤ J ∗ < (cid:101) J . We denote K ∗ = K ( J ∗ ). We further evaluate for all h ∈ H ( δ ◦ , r n ) thatP h (cid:16)(cid:102) ST n = 0 (cid:17) = P h (cid:16) n (cid:98) D J ≤ η (cid:48)(cid:48) J v J for all J ∈ I n (cid:17) ≤ P h (cid:16) n (cid:98) D J ∗ ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) . We make use of the notation B J = (cid:107) E h [ V J ] (cid:107) − (cid:107) h − h (cid:107) µ and obtain uniformly over h ∈ H ( δ ◦ , r n ) that P h (cid:16) n (cid:98) D J ∗ ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) = P h (cid:18) (cid:107) E h [ V J ∗ ] (cid:107) − (cid:98) D J ∗ > (cid:107) E h [ V J ∗ ] (cid:107) − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) ≤ P h (cid:12)(cid:12)(cid:12) n J ∗ (cid:88) j =1 (cid:88) i (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ v J ∗ n + B J ∗ (cid:124) (cid:123)(cid:122) (cid:125) IV + P h (cid:32)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ v J ∗ n + B J ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) V using n/ ( n − ≤ n ≥

2. Consider IV . We ﬁrst derive an upper bound for theterm B J ∗ . The deﬁnitions of V J and Q J imply (cid:107) E h [ V J ∗ ] (cid:107) = E h [( Y − h ( X )) b K ∗ ( W ) (cid:48) ] A (cid:48) GA E h [( Y − h ( X )) b K ∗ ( W )] = (cid:107) Q J ∗ ( h − h ) (cid:107) µ . Consequently, from Lemma F.3 we infer | B J ∗ | = (cid:12)(cid:12) (cid:107) Q J ∗ ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ (cid:12)(cid:12) ≤ C B r n for some constant C B , due to the deﬁnition of J ∗ . To establish an upper bound of IV , wemake use of inequality (F.3) together with Markov’s inequality which yields IV = O (cid:32) n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ∗ (cid:0) (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ (cid:1) (cid:33) . (B.2)In the following, we distinguish between two cases. First, consider the case where n − v J ∗ η (cid:48)(cid:48) J ∗ ≤ C η √ log log n for some constant C η >

0. For any h ∈ H ( δ ◦ , r n ) we have (cid:107) h − h (cid:107) µ ≥ δ ◦ r n and hence, weobtain the lower bound (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ ≥ ( δ ◦ − C η − C B ) r n ≥ c r n for some constant c := δ ◦ − C η − C B which is positive for any δ ◦ > C η + C B . Frominequality (B.2) we infer IV ≤ O (cid:18) v J ∗ c r n n (cid:19) = o (1) . Second, consider the case where n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) dominates. Nowusing (cid:107) ( G − / b SG − / ) − l (cid:107) = s − J ∗ together with Assumption 2 (iii), i.e., the eigenvalues of G are uniformly bounded away from zero, we obtain n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) = n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ G − / ( G − / b SG − / ) − l (cid:13)(cid:13) = O (cid:16) n − s − J ∗ (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) µ (cid:13)(cid:13) (cid:17) = O (cid:16) n − s − J ∗ (cid:0) (cid:107) h − h (cid:107) µ + ( J ∗ ) − p/d (cid:1)(cid:17) , where the last bound is due to Lemma F.3. For any h ∈ H ( δ ◦ , r n ) we have (cid:107) h − h (cid:107) µ ≥ δ ◦ r n > δ ◦ n − v J ∗ and hence, obtain the lower bound (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ ≥ (cid:18) − δ ◦ − C B δ ◦ (cid:19) (cid:107) h − h (cid:107) µ ≥ c (cid:107) h − h (cid:107) µ for some constant c = 1 − (1 + C B ) /δ ◦ which is positive for any δ ◦ > C B . Hence,inequality (B.2) yields for all h ∈ H ( δ ◦ , r n ) that IV = O (cid:16) n − s − J ∗ (cid:16) (cid:107) h − h (cid:107) µ + 1 (cid:107) h − h (cid:107) µ ( J ∗ ) p/d (cid:17)(cid:17) = O (cid:16) n − s − J ∗ r − n (cid:17) = o (1) . Finally, V = o (1) by making use of Lemma F.4. Step 3:

Finally, we account for estimation of the normalization factor v J . Now since | (cid:98) v J − v J | ≤ c v J wpa1 for some 0 < c < J ∈ I n by Lemma F.6 it holds(1 − c ) v J ≤ (cid:98) v J ≤ (1 + c ) v J wpa1. Hence, we control the ﬁrst type error of the test ST n

44s follows, using for any (cid:101) J ∈ I n and for all h ∈ H thatP h ( ST n = 1) ≤ P h ( ST n = 1 , (cid:98) v J ≥ (1 − c ) v J for all J ∈ I n )+ P h ( (cid:98) v J < (1 − c ) v J for all J ∈ I n ) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( √ η J (cid:98) v J ) (cid:111) > , (cid:98) v J ≥ (1 − c ) v J for all J ∈ I n (cid:19) + P h (cid:0)(cid:98) v (cid:101) J < (1 − c ) v (cid:101) J (cid:1) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( √ η J v J ) (cid:111) > − c (cid:19) + o (1)= P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( η (cid:48) J v J ) (cid:111) > (cid:19) + o (1) = o (1)where the last equation is due to Step 1. To bound the second error term of the test ST n recall the deﬁnition of J ∗ ∈ I n given in Step 2. We evaluate for all h ∈ H ( δ ◦ , r n ) thatP h ( ST n = 0) ≤ P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (cid:98) v J ∗ (cid:17) ≤ P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (cid:98) v J ∗ , (cid:98) v J ∗ ≤ (1 + c ) v J ∗ (cid:17) + P h ( (cid:98) v J ∗ > (1 + c ) v J ∗ )= P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (1 + c ) v J ∗ (cid:17) + o (1)= P h (cid:16) | (cid:98) D J ∗ | ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) + o (1) = o (1) , where the last equation is due to Step 2. Proof of Corollary 3.2.

Theorem 3.3 establishes the rate r n = n − √ log log n s − J √ J where J satisﬁes J − p/d x ∼ n − √ log log n s − J √ J . Consequently, following the proof ofCorollary 3.1, the result immediately follows.45 upplement to “Adaptive, Rate-Optimal Testing in Instrumental Variables Models”

Christoph Breunig Xiaohong Chen

First version: August 2018, Revised June 18, 2020

This supplementary appendix contains materials to support our paper. Appendix C con-tains robustness checks using bootstrap critical values for the empirical illustrations. Ap-pendix D provides proofs of our adaptive ST results under composite null hypothesis.Appendix E establishes the results for our adaptive IT for conditional moment restrictions.Appendix F provides an upper bound for quadratic distance estimation which is essen-tial for our minimax testing results. It also contains further technical Lemmas. Finally,Appendix G gathers an exponential inequality for U-statistics.

C. Empirical Results based on Bootstrap Critical Values

In the section, we present robustness checks of our empirical ﬁndings when the criticalvalues of our adaptive test are computed using bootstrap. For the bootstrap critical values (cid:98) η J ( α ), the i.i.d. bootstrap weight is generated according to ω ∼ N (0 , Income H : h is decreasing in P o H : h is increasing in P o groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.497 0.094 no { } { } { , , } high 1.471 0.002 yes { , , } { , , , } Table 7:

Adaptive bootstrap testing for monotonic demand for organic products.

Table 7 reports bootstrap adaptive testing results for monotonic multi-product demandin Subsection 5.1.1. It replicates Table 2. 1 ncome H : h is decreasing in P H : h is increasing in P groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.408 0.159 no { } .

611 0.000 yes { } high 15.256 0.000 yes { , } .

894 0.000 yes { } Table 8:

Adaptive bootstrap testing for monotonic gasoline demand. H : h is increasing H : h is decreasingGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” 1.449 0.003 yes { } -0.262 0.947 no { } “fuel” 3.936 0.000 yes { } { } “travel” 2.079 0.001 yes { } { } “leisure” 0.241 0.152 no { } { , } Table 9:

Adaptive bootstrap testing for monotonicity of Engel curves.

Table 8 reports bootstrap adaptive testing results for monotonic gasoline demand inSubsection 5.1.2. It replicates Table 3.Table 9 reports bootstrap adaptive testing results for monotonic Engel curves in Sub-section 5.1.3. It replicates Table 4. H : h is linear H : h is quadraticGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” -0.177 0.785 no { } { } “fuel” 1.141 0 .009 yes { } -0.078 0.502 no { } “travel” 1.260 0.005 yes { } -0.098 0.539 no { } “leisure” 0.482 0.077 no { } { } Table 10:

Adaptive bootstrap testing for linear/quadratic Engel curves.

Table 10 reports bootstrap adaptive testing results for either linear or quadratic Engelcurves in Subsection 5.1.3. It replicates Table 5.

D. Proofs of Adaptive ST Results Under Composite NullHypothesis in Subsection 3.4

Recall the deﬁnition of the deterministic index set I n in (B.1) satisfying (cid:98) I n ⊂ I n withprobability approaching one. Also recall the notation (cid:101) b K ( · ) = G − / b b K ( · ). Below, we2enote by C > v J instead of v J ( h r ) and (cid:98) η J ( α ) instead of (cid:98) η J . Below, η J denotesa deterministic sequence satisfying η J = O ( √ log log n ) and (log log J ) c ≤ η J for someconstant c > J ≤ J ≤ J . Proof of Theorem 3.4.

For the proof of the result, we proceed in three steps. First, webound the ﬁrst type error of the test statistic (cid:102) ST n = (cid:26) max J ∈I n | n (cid:98) D J ( (cid:98) h r J ) / ( η J (cid:98) v J ( (cid:98) h r J )) | > √ (cid:27) . Second, we bound the second type error of (cid:102) ST n . Third, we show that these steps aresuﬃcient to control the ﬁrst and second type error of our test statistic (cid:99) ST n . Step 1:

We control the ﬁrst type error of the test statistic (cid:102) ST n by making use of thedecomposition for all h ∈ H :P h (cid:16)(cid:102) ST n = 1 (cid:1) ≤ P h (cid:16) max J ∈I n n (cid:98) D J ( h r ) η J (cid:98) v J ( h r ) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) I + P h (cid:16) n max J ∈I n sup h ∈H r J (cid:12)(cid:12) (cid:98) D J ( h ) − (cid:98) D J ( h r ) (cid:12)(cid:12) / ( η J (cid:98) v J ( h r )) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) II + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D J ( (cid:98) h J ) η J (cid:98) v J ( h r ) (cid:12)(cid:12)(cid:12) max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h r ) (cid:98) v J ( (cid:98) h J ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) III . We have I = o (1) due to the proof of Theorem 3.1. Consider II . We may assume that (cid:98) h r J ∈ H r J (∆ J,n ) due to Assumption 6 (i). The deﬁnition of the estimator (cid:98) D J ( h ) implies forall J ∈ I n and h ∈ H r J (∆ J,n ) that nη J v J (cid:0) (cid:98) D J ( h ) − (cid:98) D J ( h r ) (cid:1) = 1 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) ( h − h r )( X i )( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )+ 1 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) ( h − h r )( X i )( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )+ 2 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h r ( X i ) (cid:1) ( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )+ 2 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h r ( X i ) (cid:1) ( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= T ,J ( h ) + T ,J ( h ) + T ,J ( h ) + T ,J ( h ) . T ,J ( h ). We obtain | T ,J ( h ) | ≤ η J v J (cid:13)(cid:13)(cid:13) n − / n (cid:88) i =1 ( h − h r )( X i ) b K ( W i ) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) + 1 nη J v J n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ( h − h r )( X i ) b K ( W i ) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) = T ,J ( h ) + T ,J ( h ) . Consider T ,J ( h ). We obtain P h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ P h (cid:16) ∃ J ∈ I n : sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) > η J v J − sup h ∈H r J (∆ J,n ) (cid:13)(cid:13) G / A E (cid:2) ( h − h r )( X ) b K ( W ) (cid:3)(cid:13)(cid:13) (cid:17) ≤ P h (cid:16) ∃ J ∈ I n : sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) > (1 − c ) η J v J (cid:17) , where the second inequality is due Lemma F.3, i.e., that uniformly in h ∈ H r J : (cid:13)(cid:13) G / A E (cid:2) ( h − h r )( X ) b K ( W ) (cid:3)(cid:13)(cid:13) = (cid:107) Q J ( h − h r ) (cid:107) µ ≤ C (∆ J,n + J − p/d )and for n suﬃciently large that 2 C (∆ J,n + J − p/d ) ≤ c η J v J for some 0 < c < J ∈ I n . Let s − j , 1 ≤ j ≤ J , be the nondecreasing singular values of G / AG / b . Further,let e j be the unit vector with 1 at the j –th position. We bound for all h ∈ H r J (∆ J,n ) andall j ≥

1E sup h ∈H r J (∆ J,n ) (cid:12)(cid:12) ( h − h r )( X ) (cid:101) b K ( W ) (cid:48) diag( s − , . . . , s − J ) e j (cid:12)(cid:12) ≤ sup h ∈H r J (∆ J,n ) (cid:107) h − h r (cid:107) ∞ E (cid:12)(cid:12)(cid:101) b K ( W ) (cid:48) diag( s − , . . . , s − J ) e j (cid:12)(cid:12) ≤ ∆ J,n (cid:13)(cid:13) diag( s − , . . . , s − J ) e j (cid:13)(cid:13) ≤ s − j ∆ J,n by the deﬁnition of H r J (∆ J,n ). Consequently, Markov’s inequality together with the proof4f Chen and Pouzo [2012, Lemma C.1] yields P h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n P h (cid:16) sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n − c ) η J v J E sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) ≤ C (1 − c ) − (cid:88) J ∈I n η J v J (cid:16) J (cid:88) j =1 s − j (cid:17) ∆ J,n (cid:18)(cid:90) (cid:113) N [] ( tC, H r J , (cid:107) · (cid:107) µ ) dt (cid:19) ≤ Cσ − (1 − c ) − (cid:88) J ∈I n ∆ J,n η J C J,n , for n suﬃciently large, using (cid:80) Jj =1 s − j ≤ σ − √ v J by Lemma F.2. Consequently, the ratecondition imposed in Assumption 6 (i) impliesP h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) = o (1) . (D.1)Consider T ,J ( h ). Using the notation (cid:101) b K ( · ) = G − / b b K ( · ) we obtainmax J ∈I n sup h ∈H r J T ,J ( h ) ≤ max J ∈I n (cid:16) sup h ∈H r J (cid:107) h − h r (cid:107) ∞ sup w (cid:107) (cid:101) b K ( w ) (cid:107) (cid:13)(cid:13)(cid:0) G − / b SG − / (cid:1) − l (cid:13)(cid:13)(cid:17) / ( η J v J ) ≤ max J ∈I n (cid:0) ∆ J,n ζ J s − J (cid:1) η J v J = o (1) . Consider T ,J ( h ). For all J ∈ I n and h ∈ H r J we evaluate | T ,J ( h ) | ≤ nη J v J (cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( h − h r )( X i ) (cid:101) b K ( W i ) − E (cid:2) ( h − h r )( X ) (cid:101) b K ( W ) (cid:3)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) + 2 nη J v J (cid:13)(cid:13)(cid:13) E (cid:2) ( h − h r )( X ) (cid:101) b K ( W ) (cid:3) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13)(cid:13) = 2 T ,J ( h ) + 2 T ,J ( h ) . Consequently, using Chen and Christensen [2018, Lemma F.10 (b)] (with G ψ replaced by G ), i.e., (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) = O p (cid:0) n − s − J ζ J (log J ) (cid:1) we obtainP h (cid:16) max J ∈I n sup h ∈H r J T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n P h (cid:16) sup h ∈H r J T ,J ( h ) > (cid:1) ≤ C (cid:88) J ∈I n (cid:18)(cid:90) (cid:113) N [] ( wC, H r J , (cid:107) · (cid:107) µ ) dw (cid:19) K ( J ) η J v J ∆ J,n n − s − J ζ J (log J ) ≤ Cn − s − J ζ (log n ) (cid:88) J ∈I n C J,n ∆ J,n η J , s − J ζ (cid:112) (log n ) /n = o (1) andconsequently,P h (cid:16) max J ∈I n sup h ∈H r J T ,J ( h ) > (cid:1) = o (cid:16) (cid:88) J ∈I n C J,n ∆ J,n η J (cid:17) = o (1) , where the last equation is due to Assumption 6 (i). Consider T ,J ( h ). We make use of theinequality T ,J ( h ) ≤ n (cid:107) Π K T ( h − h r ) (cid:107) L ( W ) (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) . Now Assumption 3 implies (cid:107) Π K T ( h − h r ) (cid:107) L ( W ) = O ( s J (cid:107) h − h r (cid:107) µ ) and thus, we obtainmax J ∈I n sup h ∈H r J T ,J ( h ) = o (cid:18) max J ∈I n ∆ J,n ζ J (log J ) (cid:19) = o (1) , by Assumption 6 (i). The bound of T ,J ( h ) and T ,J ( h ) follow analogously. Consider III .We make use of the inequality

III ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D J ( (cid:98) h r J ) η J (cid:98) v J ( h r ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h r ) (cid:98) v J ( (cid:98) h r J ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) where the ﬁrst term on the right hand side tends to zero which follows immediately from theupper bounds derived for I and II . Consequently, from Lemma F.7 we infer III = o (1). Step 2:

We control the second type error of the test statistic (cid:102) ST n where η (cid:48) J is replacedby η (cid:48)(cid:48) J >

0. Let J ∗ be as in the proof of Theorem 3.1. We thenobtain uniformly over h ∈ H ( δ ◦ , r n ) that P h (cid:16)(cid:102) ST n = 0 (cid:17) ≤ P h (cid:16) n (cid:98) D J ∗ ( (cid:98) h r J ∗ ) ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) ≤ P h (cid:18)(cid:12)(cid:12) (cid:107) E h [ V J ∗ ] (cid:107) − (cid:98) D J ∗ ( h r ) (cid:12)(cid:12) > (cid:107) E h [ V J ∗ ] (cid:107) / − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) + P h (cid:18)(cid:12)(cid:12) (cid:98) D J ∗ ( (cid:98) h r J ∗ ) − (cid:98) D J ∗ ( h r ) (cid:12)(cid:12) > (cid:107) E h [ V J ∗ ] (cid:107) / − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) where the ﬁrst summand on the right hand side tends to zero following the proof of Theorem3.1. The second summand tends to zero analogously to the previous step of this proof. Step 3:

Finally, we account for estimation of the normalization factor v J . We control theﬁrst type error of the test (cid:99) ST n as follows, using for any (cid:101) J ∈ I n and h ∈ H by making use6f Assumption 6 (iii)P h (cid:16)(cid:99) ST n = 1 (cid:17) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J ( (cid:98) h r J ) | / ( (cid:98) η J v J ) (cid:111) > , (cid:98) η J ≥ Cη J for all J ∈ I n (cid:19) + P h ( (cid:98) η J < Cη J for all J ∈ I n ) + o (1) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J ( (cid:98) h r J ) | / ( Cη J v J ) (cid:111) > (cid:19) + P h (cid:0)(cid:98) η (cid:101) J < Cη (cid:101) J (cid:1) + o (1)= o (1)where the last equation is due to Step 1. To bound the second error term of the test ST n recall the deﬁnition of J ∗ ∈ I n introduced in Step 2. We evaluate for all h ∈ H ( δ ◦ , r n ) bymaking use of Assumption 6 (iii) thatP h (cid:16)(cid:99) ST n = 0 (cid:17) ≤ P h (cid:16) | (cid:98) D J ∗ ( (cid:98) h r J ) | ≤ (cid:98) η J ∗ v J ∗ , (cid:98) η J ∗ ≤ c η J ∗ (cid:17) + P h ( (cid:98) η J ∗ > c η J ∗ )= P h (cid:16) | (cid:98) D J ∗ ( (cid:98) h r J ) | ≤ c η J ∗ v J ∗ (cid:17) + o (1)= o (1) , where the last equation is due to Step 2. Proof of Corollary 3.3.

Proof of (3.12). We observelim sup n →∞ sup h ∈H P h ( h / ∈ C n ( α )) = lim sup n →∞ sup h ∈H P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) > √ (cid:33) ≤ α, where the last inequality is due to step 1 of the proof of Theorem 3.3 and step 3 of theproof of Theorem 3.4.Proof of (3.13). Let J ∗ be as be as in step 2 of the proof of Theorem 3.3. We observefor all h ∈ H ( δ ◦ , r n ) thatP h ( h / ∈ C n ( α )) = P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) > √ (cid:33) = 1 − P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) ≤ √ (cid:33) ≥ − P h (cid:32) n (cid:98) D J ∗ ( h ) (cid:98) η J ∗ (cid:98) v J ∗ ( h ) ≤ √ (cid:33) ≥ − α, for n suﬃciently large, where the last inequality is due to step 2 of the proof of Theorem3.3 and step 3 of the proof of Theorem 3.4. 7 roof of Corollary 3.4. For any h ∈ H , we analyze the diameter of the conﬁdence set C n ( α ) under P h for some For all h ∈ C n ( α ) ⊂ H it holds for all J ∈ I n by using thedeﬁnition of the projection Q J : (cid:107) h − h (cid:107) µ ≤ (cid:107) Q J Π J ( h − h ) (cid:107) µ + (cid:107) Π J h − h (cid:107) µ + (cid:107) Π J h − h (cid:107) µ ≤ (cid:107) Q J ( h − h ) (cid:107) µ + O ( J − p/d x ) , where the second inequality due to the triangular inequality and Assumptions 2 (ii) and 3(ii). The upper bound established in (F.4) yields: (cid:12)(cid:12)(cid:12) (cid:107) Q J ( h − h ) (cid:107) µ − (cid:98) D J ( h ) (cid:12)(cid:12)(cid:12) ≤ n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J with probability approaching one. Consequently, the deﬁnition of the conﬁdence set C n ( α )with h ∈ C n ( α ) gives (cid:107) Q J ( h − h ) (cid:107) µ ≤ (cid:98) D J ( h ) + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ≤ √ n − (cid:98) η J (cid:98) v J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ≤ C (cid:16) n − η J v J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) with probability approaching one. We may choose J = J ∈ I n for n suﬃciently largeand hence, the result follows by applying Lemma F.1 and Assumption 5 (ii), i.e., η J = O ( √ log log n ). E. Proofs of Adaptive IT Results in Subsection 4.2

For the adaptive IT results, we may consider restriction to the index set I n given in (B.1),but where the uppder bound J is replaced by K . As in Appendix B this implies that (cid:98) η K ( α )is deterministic and is denoted by η K in the following. Proof of Theorem 4.1.

The proof is based on two steps, where we bound the ﬁrst andsecond type error of (cid:99) IT n separately. 8 tep 1: The deﬁnition of the test statistic (cid:99) IT n implies for all h ∈ H :P h (cid:16)(cid:99) IT n = 1 (cid:1) ≤ P h (cid:16) max K ∈I n n (cid:98) D K ( h r ) η K (cid:98) v K ( h r ) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) I + P h (cid:16) n max K ∈I n sup h ∈H r (cid:12)(cid:12) (cid:98) D K ( h ) − (cid:98) D K ( h r ) (cid:12)(cid:12) / ( η K (cid:98) v K ( h r )) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) II + P h (cid:16) max K ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D K ( (cid:98) h r ) η K (cid:98) v K ( h r ) (cid:12)(cid:12)(cid:12) max K ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v K ( h r ) (cid:98) v K ( (cid:98) h r ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) . (cid:124) (cid:123)(cid:122) (cid:125) III

We obtain I = o (1) following the proof of Theorem 3.1 (with U i replaced by ρ ( Y i , h r ( X i ))and (cid:98) A (cid:48) G (cid:98) A replaced by ( B (cid:48) B/n ) − ). Consider II . The deﬁnition of the estimator (cid:98) D K impliesfor all K ∈ I n and h ∈ H r that1 η K v K ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:0) ρ ( Y i (cid:48) , h ( X i (cid:48) )) − ρ ( Y i (cid:48) , h r ( X i (cid:48) )) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) )+ 1 η K v K ( n − (cid:88) i (cid:54) = i (cid:48) ρ ( Y i , h r ( X i )) (cid:0) ρ ( Y i (cid:48) , h ( X i (cid:48) )) − ρ ( Y i (cid:48) , h r ( X i (cid:48) )) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) )= T ,K ( h ) + T ,K ( h ) . Consider T ,K ( h ). We obtain | T ,K ( h ) | ≤ η K v K (cid:13)(cid:13)(cid:13) n − / n (cid:88) i =1 (cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) + 1 nη K v K n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) = T ,K ( h ) + T ,K ( h ) . Consider T ,K ( h ). We obtain P h (cid:16) max K ∈I n sup h ∈H r T ,K ( h ) > (cid:1) ≤ P h (cid:16) ∃ K ∈ I n : sup h ∈H r (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16)(cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) − E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:17)(cid:13)(cid:13)(cid:13) > η K v K − sup h ∈H r (cid:13)(cid:13) E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:13)(cid:13) (cid:17) . We evaluate by using Assumption 3 (ii) that uniformly in h ∈ H r : (cid:13)(cid:13) E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:13)(cid:13) ≤ (cid:107) h − h r (cid:107) µ ≤ ∆ n . K ∈ I n and 1 ≤ k ≤ K we haveE sup h ∈H r (cid:12)(cid:12)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b k ( W ) | ≤ C ∆ κn by Assumption 6 (i). Consequently,following the derivation of the upper bound (D.1) weobtainP h (cid:16) max K ∈I n (∆ n ) sup h ∈H r T ,K ( h ) > (cid:1) ≤ max K ∈I n K ∆ κn η K v K − ∆ n (cid:18)(cid:90) (cid:113) N [] ( wC, H r , (cid:107) · (cid:107) ) dw (cid:19) ≤ (1 − c ) − max K ∈I n K ∆ κn η K v K = o (1)where the second inequality is based on the inequality ∆ κn ≤ c η K v K for some 0 < c < K ∈ I n as n becomes suﬃciently large. Consider T ,K ( h ). We obtainmax K ∈I n sup h ∈H r T ,K ( h ) ≤ max J ∈I n (cid:0) ∆ n ζ K (cid:1) η K v K = o (1) . The bound for T ,K ( h ) follows analogously. Step 2:

It is suﬃcient to control the second type error of the statistic (cid:102) IT n = (cid:26) max K ∈I n | n (cid:98) D K ( (cid:98) h r ) / ( η (cid:48)(cid:48) K v K ) | > (cid:27) for some η (cid:48)(cid:48) K >

0. Let K ∗ denote the largest integer such that n − √ K ∗ ≤ r n . We evaluatefor all h ∈ M ( δ ◦ , r n ) thatP h (cid:16)(cid:102) IT n = 0 (cid:17) = P h (cid:16) n (cid:98) D K ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K v K for all K ∈ I n (cid:17) ≤ P h (cid:16) n (cid:98) D K ∗ ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K ∗ v K ∗ (cid:17) . We make use of the notation B K = (cid:107) E h [ V K ] (cid:107) − (cid:107) m ( · , h r ) (cid:107) L ( W ) where we write V K =10 ( Y, h r ( X )) G − / b b K ∗ ( W ). We obtain uniformly over h ∈ M ( δ ◦ , r n ) that P h (cid:16) n (cid:98) D K ∗ ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K ∗ v K ∗ (cid:17) = P h (cid:18) (cid:107) E h [ V K ∗ ] (cid:107) − (cid:98) D K ∗ ( (cid:98) h r ) > (cid:107) E h [ V K ∗ ] (cid:107) − η (cid:48)(cid:48) K ∗ v K ∗ n (cid:19) ≤ P h (cid:32)(cid:12)(cid:12)(cid:12) n K ∗ (cid:88) k =1 (cid:88) i E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) IV + P h (cid:32)(cid:12)(cid:12)(cid:12) n (cid:88) i E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) V + P h (cid:18)(cid:12)(cid:12) (cid:98) D K ∗ ( (cid:98) h r ) − (cid:98) D K ∗ ( h r ) (cid:12)(cid:12) > E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:19) , (cid:124) (cid:123)(cid:122) (cid:125) V I using n/ ( n − ≤ n ≥

2. Consider IV . We observe | B K ∗ | ≤ C B r n which is due tothe upper bound (cid:107) E h [ V K ∗ ] (cid:107) = (cid:0) E h [ ρ ( Y, h r ( X )) b K ∗ ( W ) (cid:48) ] G − b E h [ ρ ( Y, h r ( X )) b K ∗ ( W )] (cid:1) / = (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) . Further, from the proof of Lemma F.9 we deduceE h (cid:12)(cid:12)(cid:12) n ( n − K ∗ (cid:88) k =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ik V i (cid:48) k − E h [ V k ] (cid:1)(cid:12)(cid:12)(cid:12) ≤ Cn (cid:18) (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) + v K ∗ n (cid:19) . Consequently, Markov’s inequality yields IV = O (cid:32) n − (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) + n − v K ∗ (cid:0) (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ (cid:1) (cid:33) (E.1)In the following, we distinguish between two cases. First, consider the case where n − v K ∗ dominates the summand in the numerator. For any h ∈ M ( δ ◦ , r n ) we have (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ δ ◦ r n and hence, using that v K ∗ ≥ c √ K ∗ for some constant 0 < c < (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ ≥ ( δ ◦ − c − C B ) r n ≥ C r n for some constant C := δ ◦ − c − C B which is positive for any δ ◦ > c + C B . Frominequality (E.1) we infer IV = O ( r − n n − v K ∗ ) = o (1). Second, consider the case where n − (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) dominates. For any h ∈ M ( δ ◦ , r n ) we have (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ ◦ r n > δ ◦ n − √ K ∗ and hence, obtain the lower bound (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ ≥ (cid:18) − δ ◦ − C B δ ◦ (cid:19) (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ c (cid:107) m ( · , h r ) (cid:107) L ( W ) for some constant c := 1 − (1 + C B ) /δ ◦ which is positive for any δ ◦ > C B . Hence,inequality (E.1) yields for all h ∈ M ( δ ◦ , r n ) that IV = O (cid:16) n − (cid:107) m ( · , h r ) (cid:107) L ( W ) (cid:17) = O (cid:0) n − r − n (cid:1) = o (1) . Finally, V = o (1) by making use of Lemma F.4 (with U i replaced by ρ ( Y i , h r ( X i )) and (cid:98) A (cid:48) G (cid:98) A replaced by ( B (cid:48) B/n ) − ) and V I = o (1) by following step 1. F. Technical Results

Theorem F.1.

Let Assumptions 1–3 be satisﬁed. Then, it holds (cid:12)(cid:12) (cid:98) D J − (cid:107) h − h (cid:107) µ (cid:12)(cid:12) = O p (cid:16) n − s − J √ J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / S ) − l (cid:13)(cid:13) + J − p/d x (cid:17) . Proof.

We make use of the decomposition (cid:98) D J − (cid:107) h − h (cid:107) µ = (cid:98) D J − (cid:107) Q J ( h − h ) (cid:107) µ + (cid:107) Q J ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ . Note that (cid:107) Q J ( h − h ) (cid:107) µ = (cid:90) (cid:0) ψ J ( x ) (cid:48) A E h [( Y − h ( X )) b K ( W )] (cid:1) µ ( x ) dx = E h [( Y − h ( X )) b K ( W )] (cid:48) A (cid:48) (cid:90) ψ J ( x ) ψ J ( x ) (cid:48) µ ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) = G A E h [( Y − h ( X )) b K ( W )]= (cid:13)(cid:13) G / A E h [( Y − h ( X )) b K ( W )] (cid:13)(cid:13) = (cid:107) E h [ V J ] (cid:107) using the notation V Ji = ( Y i − h ( X i )) G / Ab K ( W i ). Thus, the deﬁnition of the estimator12 D J implies (cid:98) D J − (cid:107) Q J ( h − h ) (cid:107) µ = 1 n ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1) (F.1)+ 1 n ( n − (cid:88) i (cid:54) = i (cid:48) Y i Y i (cid:48) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) , (F.2)where we bound both summands on the right hand side separately in the following. Con-sider the summand in (F.1), we observe (cid:12)(cid:12)(cid:12) J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) = J (cid:88) j,j (cid:48) =1 (cid:88) i (cid:54) = i (cid:48) (cid:88) i (cid:48)(cid:48) (cid:54) = i (cid:48)(cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) We distinguish three diﬀerent cases. First: i, i (cid:48) , i (cid:48)(cid:48) , i (cid:48)(cid:48)(cid:48) are all diﬀerent, second: either i = i (cid:48)(cid:48) or i (cid:48) = i (cid:48)(cid:48)(cid:48) , or third: i = i (cid:48) and i (cid:48) = i (cid:48)(cid:48)(cid:48) . We thus calculate for each j, j (cid:48) ≥ (cid:88) i (cid:54) = i (cid:48) (cid:88) i (cid:48)(cid:48) (cid:54) = i (cid:48)(cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) = (cid:88) i,i (cid:48) ,i (cid:48)(cid:48) ,i (cid:48)(cid:48)(cid:48) all diﬀerent (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) + 2 (cid:88) i (cid:54) = i (cid:48) (cid:54) = i (cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) + (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V ij (cid:48) V i (cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) . Due to independent observations we have (cid:88) i,i (cid:48) ,i (cid:48)(cid:48) ,i (cid:48)(cid:48)(cid:48) all diﬀerent E h (cid:104)(cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105) = 013onsequently, we calculateE h (cid:12)(cid:12)(cid:12) J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) = 2 n ( n − n − J (cid:88) j,j (cid:48) =1 E h (cid:104)(cid:0) V j V j − E h [ V j ] (cid:1)(cid:0) V j (cid:48) V j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) I + n ( n − J (cid:88) j,j (cid:48) =1 E h (cid:104)(cid:0) V j V j − E h [ V j ] (cid:1)(cid:0) V j (cid:48) V j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) II . To bound the summand I we observe that I = J (cid:88) j,j (cid:48) =1 E h [ V j ] E h [ V j (cid:48) ] C ov h ( V j , V j (cid:48) )= E h [ V J ] (cid:48) C ov h ( V J , V J ) E h [ V J ] ≤ λ max (cid:0) V ar h (( Y − h ( X )) G − / b b K ( W )) (cid:1)(cid:13)(cid:13) G / b A (cid:48) G / E h [ V J ] (cid:13)(cid:13) ≤ σ (cid:13)(cid:13) E h [( Y − h ( X )) b K ( W )] (cid:48) A (cid:48) GAG / b (cid:13)(cid:13) = σ (cid:13)(cid:13)(cid:13) (cid:90) Q J ( h − h )( x ) ψ J ( x ) (cid:48) µ ( dx )( G − / b S ) − l (cid:13)(cid:13)(cid:13) by using the notation V Ji = ( Y i − h ( X i )) G / Ab K ( W i ), AG / b = ( G − / b S ) − l , and LemmaF.8, i.e., λ max (cid:0) V ar h (( Y − h ( X )) (cid:101) b K ( W )) (cid:1) ≤ σ . Consider II . We observe II = n ( n − J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] − n ( n − (cid:16) J (cid:88) j =1 E h [ V j ] (cid:17) ≤ n ( n − J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = n ( n − v J . The upper bounds derived for the terms I and II imply for all n ≥ h (cid:12)(cid:12)(cid:12) n ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) ≤ σ (cid:18) n (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + v J n (cid:19) . (F.3)14onsequently, equality (F.2) together with Lemma F.4 yields (cid:98) D J −(cid:107) Q J ( h − h ) (cid:107) µ = O p (cid:16) n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J (cid:17) , (F.4)which implies the variance part by employing Lemma F.1. Finally, Lemma F.3 implies forthe bias term (cid:107) Q J ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ = O ( J − p/d )which completes the proof. Lemma F.1.

Let Assumption 1 (ii) be satisﬁed. Then, it holds v J ≤ σ s − J √ J .

Proof.

In the following, let e j be the unit vector with 1 at the j –th position. We obtain v J = J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = J (cid:88) j =1 (cid:13)(cid:13)(cid:13) E[E h [( Y − h ( X )) | W ] e (cid:48) j G / Ab K ( W ) (cid:124) (cid:123)(cid:122) (cid:125) =: χ j ( W ) G / Ab K ( W )] (cid:13)(cid:13)(cid:13) = J (cid:88) j =1 (cid:13)(cid:13)(cid:13) G / AG / b E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13)(cid:13) ≤ s − J J (cid:88) j =1 (cid:13)(cid:13) E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13) , using the relationship (cid:107) G / AG / b (cid:107) = s − J . For all j ≥ (cid:13)(cid:13) E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13) ≤ (cid:107) χ j (cid:107) L ( W ) . Now using that sup w ∈W sup h ∈H E h [( Y − h ( X )) | W = w ] ≤ σ due to Assumption 1 (ii),we get J (cid:88) j =1 (cid:107) χ j (cid:107) L ( W ) ≤ σ J (cid:88) j =1 E | e (cid:48) j G / Ab K ( W ) | = σ J (cid:88) j =1 e (cid:48) j G / AG b A (cid:48) G / e j ≤ σ s − J J, Lemma F.2.

Let Assumption 1 (iii) be satisﬁed. Then, it holds (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 s − j ≤ σ − v J , where s − j , ≤ j ≤ J , are the nondecreasing singular values of G / AG / b .Proof. In the following, let e j be the unit vector with 1 at the j –th position. Introduce aunitary matrix Q such that by Schur decomposition Q (cid:48) G / AG b A (cid:48) G / Q = diag( s − , . . . , s − J ) . We make use of the notation (cid:101) V Ji = ( Y i − h ( X i )) Q (cid:48) G / Ab K ( W i ). Now since the Frobeniusnorm is invariant under unitary matrix multiplication we have v J = J (cid:88) j,j (cid:48) =1 E h [ (cid:101) V j (cid:101) V j (cid:48) ] ≥ J (cid:88) j =1 E h [ (cid:101) V j ] = J (cid:88) j =1 (cid:0) E | ( Y − h ( X )) e (cid:48) j Q (cid:48) G / Ab K ( W ) | (cid:1) ≥ σ J (cid:88) j =1 (cid:0) E[ e (cid:48) j Q (cid:48) G / Ab K ( W ) b K ( W ) (cid:48) A (cid:48) G / Qe j ] (cid:1) = σ J (cid:88) j =1 (cid:0) e (cid:48) j Q (cid:48) G / AG b A (cid:48) G / Qe j (cid:1) = σ J (cid:88) j =1 (cid:0) e (cid:48) j diag( s − , . . . , s − J ) e j (cid:1) ≥ σ J (cid:88) j =1 s − j , using inf w ∈W inf h ∈H E h [( Y − h ( X )) | W = w ] ≥ σ by Assumption 1 (iii). Lemma F.3.

Let Assumptions 2 and 3 be satisﬁed. Then, for all h ∈ H we have (cid:107) Q J ( h − h ) (cid:107) µ = (cid:107) h − h (cid:107) µ + O (cid:0) J − p/d (cid:1) . roof. Using the notation (cid:101) b K ( · ) := G − / b b K ( · ), we observe for all h ∈ H that (cid:107) Q J ( h − h ) (cid:107) µ = (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )( h − h )( X )] (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )(Π J h − Π J h )( X )] (cid:13)(cid:13) + (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )(( h − h )( X ) − (Π J h − Π J h )( X ))] (cid:13)(cid:13) ≤ (cid:107) Π J h − Π J h (cid:107) µ + s − J (cid:107) Π K T (( h − h ) − (Π J h − Π J h )) (cid:107) L ( W ) ≤ (cid:107) Π J h − Π J h (cid:107) µ + O (cid:0) J − p/d (cid:1) by making use of Assumption 3 (ii). Lemma F.4.

Let Assumptions 1–3 be satisﬁed. Then, uniformly in h ∈ H it holds n ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= O p (cid:16) n − s − J √ J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) . Proof.

In the proof, we establish an upper bound of1 n (cid:88) i,i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= E[( h − h )( X ) b K ( W )] (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) E[( h − h )( X ) b K ( W )]+ 2 (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) × E[( h − h )( X ) b K ( W )]+ (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17)(cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) × (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17) uniformly in h ∈ H . It is suﬃcient to bound the ﬁrst summand on the right hand side. Wemake use of the decompositionE[( h − h )( X ) b K ( W )] (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) E[( h − h )( X ) b K ( W )]= 2 E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( A − (cid:98) A ) E[( h − h )( X ) b K ( W )] − E[( h − h )( X ) b K ( W )] (cid:48) ( A − (cid:98) A ) (cid:48) G ( A − (cid:98) A ) E[( h − h )( X ) b K ( W )]= 2 T − T , where we bound each summand separately in what follows. Consider T . Below, we show17he result T = O p (cid:16) n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) . (F.5)To do so, we make use of the decomposition T = E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[( h − h − Π J ( h − h ))( X ) b K ( W )] . (F.6)Consider the ﬁrst summand on the right hand side of the equation. Using the deﬁnition ofthe left pseudo inverse we can write (cid:98) A = ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b where (cid:98) S = n − (cid:80) i b K ( W i ) ψ J ( X i ) (cid:48) .Making use of the relation Q J Π J h = Π J h and (cid:98) SG − (cid:104) h, ψ J (cid:105) µ = n − (cid:80) i Π J h ( X i ) b K ( W i )yields E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( A − (cid:98) A ) E[Π J ( h − h )( X ) b K ( W )]= (cid:90) Q J ( h − h )( x ) (cid:16) Π J ( h − h )( x ) − ψ J ( x ) (cid:48) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b E[( h − h )( X ) b K ( W )] (cid:17) µ ( x ) dx = (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b (cid:16) n (cid:88) i Π J ( h − h )( X i ) b K ( W i ) − E[( h − h )( X ) b K ( W )] (cid:17) = (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:16) n (cid:88) i Π J ( h − h )( X i ) (cid:101) b K ( W i ) − E[Π J ( h − h )( X ) (cid:101) b K ( W )] (cid:17) + (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17) × (cid:16) n (cid:88) i Π J ( h − h )( X i ) (cid:101) b K ( W i ) − E[Π J ( h − h )( X ) (cid:101) b K ( W )] (cid:17) = T + T , where we used the notation (cid:101) b K ( · ) = G − / b b K ( · ). Consider T . We obtainE | T | ≤ n − E (cid:12)(cid:12)(cid:12) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l Π J ( h − h )( X ) (cid:101) b K ( W ) (cid:12)(cid:12)(cid:12) ≤ n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h ) (cid:107) L ( W ) + 2 n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h − Π J ( h − h )) (cid:107) L ( W ) = O (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) , where the second bound is due to the Cauchy-Schwarz inequality and the third bound is dueto Assumption 2. To establish an upper bound for T we infer from Chen and Christensen182018, Lemma F.10 (c)] that | T | ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17)(cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) n (cid:88) i b K ( W i )Π J ( h − h )( X i ) − E[Π J ( h − h )( X ) b K ( W )] (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) × O p (cid:0) n − s − J ζ J (log J ) (cid:1) × O p (cid:0) n − ζ J (cid:1) = O p (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) using Assumption 2 (i), i.e., s − J ζ J (cid:112) (log J ) /n = O (1). Consider the second summand onthe right hand side of (F.6). Following the upper bound of T we obtain (cid:12)(cid:12)(cid:12) E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[( h − h − Π J ( h − h ))( X ) b K ( W )] (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17)(cid:13)(cid:13)(cid:13) × (cid:13)(cid:13) (cid:104) T ( h − h − Π J ( h − h )) , (cid:101) b K (cid:105) L ( W ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h − Π J ( h − h )) (cid:107) L ( W ) × O p (cid:0) n − s − J ζ J (log J ) (cid:1) = O (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) using that s − J (cid:107) T ( h − h − Π J ( h − h )) (cid:107) L ( W ) = O ( (cid:107) h − h − Π J ( h − h ) (cid:107) µ ) by Assumption 3(ii) and ζ J (log J ) (cid:107) h − Π J h (cid:107) µ = O (1) by Assumption 2 (ii), which implies the upper bound(F.5).Consider T . We make use of the decomposition T = E[Π J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ 2 E[Π ⊥ J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ E[Π ⊥ J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π ⊥ J ( h − h )( X ) b K ( W )]= T + T + T where we denote the projection Π ⊥ J = id − Π J . Consider T . We make use of the inequalityE (cid:13)(cid:13)(cid:13)(cid:16) n (cid:88) i ( h − h )( X i ) b K ( W i ) − E[( h − h )( X ) b K ( W )] (cid:17) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) ≤ n − E (cid:104) ( h − h ( X )) (cid:13)(cid:13) b K ( W ) (cid:48) A (cid:48) G / (cid:13)(cid:13) (cid:105) ≤ n − s − J √ J , T ≤ (cid:13)(cid:13) G / (cid:8) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:9)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) n (cid:88) i ( Y i − h ( X i )) b K ( W i ) − E[( Y − h ( X )) b K ( W )] (cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13) n (cid:88) i (cid:0) b K ( W i )( h − h )( X i ) − E[( Y − h ( X )) b K ( W )] (cid:1) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) = O p (cid:0) n − s − J ζ J (log J ) (cid:1) × O p (cid:0) n − ζ J (cid:1) + O p (cid:0) n − v J (cid:1) = O p (cid:16) n − s − J √ J (cid:17) using Chen and Christensen [2018, Lemma F.10(b)] (with G ψ replaced by G ) and that n − s − J ζ J (log J ) = O (1) by Assumption 2 (i). Since | T | ≤ √ T T we conclude T = O p ( n − s − J ), which completes the proof. Lemma F.5.

Let Assumptions 1–5 be satisﬁed. Then, under H = { h } it holds uniformlyin J ∈ I n : n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) = O p ( v J ) . Proof.

We make use of the inequality (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) ≤ (cid:13)(cid:13)(cid:13) (cid:88) i (cid:0) Y i − h ( X i ) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) = (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) ) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) + 1 n (cid:88) i (cid:13)(cid:13)(cid:13)(cid:0) Y i − h ( X i ) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) . Note that n − (cid:80) i (cid:107) ( Y i − h ( X i )) (cid:101) b K ( W i ) (cid:107) ≤ K + o p (1) uniformly in K . Further, by LemmaF.2 we have v J ≥ σ √ J (we may assume that s = 1) and thus we obtain P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n − v J (cid:88) i,i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17) ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n √ J (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > σ − (cid:17) + P h (cid:16) max J ∈I n (cid:0)(cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13)(cid:1) > σ − / (cid:17) + P h (cid:16) max J ∈I n (cid:0) K (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13)(cid:1) > σ − / (cid:17) + o (1) . Lemma F.6.

Let Assumptions 1–5 be satisﬁed. Then, for some c ∈ (0 , it holds | (cid:98) v J − v J | ≤ c v J wpa1 for all J ∈ I n .Proof. We denote G σ = E h [( Y − h ( X )) b K ( W ) b K ( W ) (cid:48) ] and its empirical analog (cid:98) G σ = n − (cid:80) i ( Y i − h ( X i )) b K ( W i ) b K ( W i ) (cid:48) . Note that for any J × J matrix M it holds (cid:107) M (cid:107) F ≤√ J (cid:107) M (cid:107) and hence For all J ∈ I n the triangular inequality implies | (cid:98) v J − v J | ≤ (cid:13)(cid:13)(cid:13) G / (cid:98) A (cid:98) G σ (cid:98) A (cid:48) G / − G / AG σ A (cid:48) G / (cid:13)(cid:13)(cid:13) F ≤ √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A (cid:98) G σ (cid:98) A (cid:48) G / − G / AG σ A (cid:48) G / (cid:13)(cid:13)(cid:13) . Thus, the result follows from the proof of Chen and Christensen [2015, Lemma E.16].

Lemma F.7.

Let Assumptions 1–5 be satisﬁed. Then, we have max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h ) (cid:98) v J ( (cid:98) h J ) (cid:12)(cid:12)(cid:12) = o p (1) . Proof.

For all J ∈ I n and h ∈ H r J the triangular inequality implies | (cid:98) v J ( h ) − (cid:98) v J ( h ) |≤ (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) ( Y i − h ( X i )) − ( Y i − h ( X i )) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) F ≤ √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) + 2 √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) ( Y i − h ( X i )) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) = T + T . Consider T . Following the proof of Theorem 3.4 we obtain T ≤ √ J s − J (cid:13)(cid:13)(cid:13) n − (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:13)(cid:13)(cid:13) = O p ( √ J s − J n − ) = o p (1)uniformly in J ∈ I n by Assumption (5) (i). Analogously, we obtain T = o p (1) uniformlyin J ∈ I n . Lemma F.8.

Under Assumption 1 (ii) and 2 (iii) it holds for all h ∈ H that λ max (cid:0) V ar h ( ρ ( Y, h ( X )) G − / b b K ( W )) (cid:1) ≤ σ < ∞ . roof. For any γ ∈ R K it holds γ (cid:48) V ar h (cid:0) ρ ( Y, h ( X )) G − / b b K ( W ) (cid:1) γ ≤ E (cid:104) E h [ ρ ( Z, h ) | W ] (cid:0) γ (cid:48) G − / b b K ( W ) (cid:1) (cid:105) ≤ σ E (cid:104)(cid:0) γ (cid:48) G − / b b K ( W ) (cid:1) (cid:105) = σ γ (cid:48) G − / b E (cid:2) b K ( W ) b K ( W ) (cid:48) (cid:3) G − / b γ = σ (cid:107) γ (cid:107) , where the second inequality is due to to Assumption 1 (ii). Lemma F.9.

Under the conditions of Theorem 4.1 it holds (cid:98) D K − (cid:107) m ( · , h ) (cid:107) L ( W ) = O p (cid:16) n − √ K + n − / (cid:13)(cid:13) Π K m ( · , h ) (cid:107) L ( W ) + K − γ/d w (cid:17) . Proof.

Similarly to Theorem F.1 we obtain (cid:98) D K − (cid:107) m ( · , h ) (cid:107) L ( W ) = (cid:98) D K − (cid:107) Π K m ( · , h ) (cid:107) L ( W ) + (cid:107) Π K m ( · , h ) (cid:107) L ( W ) − (cid:107) m ( · , h ) (cid:107) L ( W ) . Following the ﬁrst part of the proof of Theorem F.1 with V Ki replaced by ρ ( Y i , h ( X i )) G − / b b K ( W i )and using Lemma F.4 yields (cid:98) D K − (cid:107) Π K m ( · , h ) (cid:107) L ( W ) = O (cid:16) n − √ K + n − / (cid:13)(cid:13) Π K m ( h , · ) (cid:107) L ( W ) (cid:17) . Moreover, we obtain using that Π K is a projection on L ( W ): (cid:107) Π K m ( · , h ) (cid:107) L ( W ) − (cid:107) m ( · , h ) (cid:107) L ( W ) = (cid:107) Π K m ( · , h ) − m ( · , h ) (cid:107) L ( W ) + 2 (cid:104) Π K m ( · , h ) − m ( · , h ) , Π K m ( · , h ) (cid:105) L ( W ) = 3 (cid:107) Π K m ( · , h ) − m ( · , h ) (cid:107) L ( W ) = O ( K − γ/d w ) , where the last bound is due to the sieve approximation rate imposed in Assumption 7. G. U-statistics deviation results

We make use of the following exponential inequality established by Houdr´e and Reynaud-Bouret [2003].

Lemma G.1 (Houdr´e and Reynaud-Bouret [2003]) . Let U n be a degenerate U-statistic oforder 2 with kernel R based on a simple random sample Z , . . . , Z n . Then there exists a eneric constant C > , such that P h (cid:32)(cid:12)(cid:12)(cid:12) (cid:88) ≤ i
Let Assumption 1 (ii) be satisﬁed. Given kernel R it holds under H : Λ ≤ n ( n − v J , (G.1)Λ ≤ σ n s − J , (G.2)Λ ≤ σ √ n M n ζ b,K s − J , (G.3)Λ ≤ M n ζ b,K s − J . (G.4) Proof.

Proof of (G.1). Recall the notation V Ji = U i G / Ab K ( W ) with U i = Y i − h ( X i ),then we evaluate under H :E h [ R ( Z , S )] ≤ E h (cid:12)(cid:12)(cid:12) U b K ( W ) (cid:48) A (cid:48) GAb K ( W ) U (cid:12)(cid:12)(cid:12) = E h (cid:104) U b K ( W ) (cid:48) A (cid:48) GA E h (cid:2) U b K ( W ) b K ( W ) (cid:48) (cid:3) A (cid:48) GAb K ( W ) (cid:105) = E h (cid:104) ( V J ) (cid:48) E h (cid:2) V J ( V J ) (cid:48) (cid:3) V J (cid:105) = J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = v J . Proof of (G.2). For any function ν and κ with (cid:107) ν (cid:107) L ( Z ) ≤ (cid:107) κ (cid:107) L ( Z ) ≤

1, respectively,23e obtain | E h [ R ( Z , Z ) ν ( Z ) κ ( Z )] | ≤ (cid:12)(cid:12)(cid:12) E[ U M b K ( W ) (cid:48) ν ( Z )] A (cid:48) GA E h [ U M b K ( W ) κ ( Z )] (cid:12)(cid:12)(cid:12) + (cid:13)(cid:13) G / A E h [ U M b K ( W )] (cid:13)(cid:13) ≤(cid:107) G / A E h [ U M b K ( W ) κ ( Z )] (cid:107) (cid:107) G / A E h [ U M b K ( W ) ν ( Z )] (cid:107) + (cid:13)(cid:13) G / A E h [ U M b K ( W )] (cid:13)(cid:13) ≤(cid:107) G / AG / b (cid:107) (cid:16)(cid:113) E (cid:2) | E h [ U M κ ( Z ) | W ] | (cid:3) × (cid:113) E h (cid:2) | E h [ U M ν ( Z ) | W ] | (cid:3) + E (cid:2) | E h [ U M | W ] | (cid:3)(cid:17) . Now observe E (cid:2) | E h [ U M κ ( Z ) | W ] | (cid:3) ≤ E (cid:2) E h [ U | W ] κ ( Z ) (cid:3) ≤ σ by Assumption 1 (ii) andusing that (cid:107) κ (cid:107) L ( Z ) ≤

1, which yields the upper bound by using (cid:107) G / AG / b (cid:107) = s − J .Proof of (G.3). Observe that for any z = ( u, w ) (cid:12)(cid:12) E h [ R ( Z , z )] (cid:12)(cid:12) ≤ E h (cid:12)(cid:12)(cid:12) U {| U | ≤ M n } b K ( W ) (cid:48) A (cid:48) GAb K ( w ) u {| u | ≤ M n } (cid:12)(cid:12)(cid:12) ≤ (cid:107) G / Ab K ( w ) u {| u | ≤ M n }(cid:107) E h (cid:107) G / Ab K ( W ) U (cid:107) ≤ σ M n ζ b,K (cid:107) G / AG / b (cid:107) , again by using Assumption 1 (ii) and hence the upper bound (G.3) follows.Proof of (G.4). Observe that for any z = ( u , w ) and z = ( u , w ) we get (cid:12)(cid:12) R ( s , s ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) u {| u | ≤ M n } b K ( w ) (cid:48) A (cid:48) GAb K ( w ) u {| u | ≤ M n } (cid:12)(cid:12)(cid:12) ≤ sup u,w (cid:13)(cid:13) G / Ab K ( w ) u {| u | ≤ M n } (cid:13)(cid:13) ≤ M n ζ b,K (cid:107) G / AG / b (cid:107) ,,

Related Researches

Optimal transportation and the falsifiability of incompletely specified economic models

by Ivar Ekeland

A note on global identification in structural vector autoregressions

by Emanuele Bacchiocchi

Duality in dynamic discrete-choice models

by Khai Xiang Chiong

A test of non-identifying restrictions and confidence regions for partially identified parameters

by Alfred Galichon

Assessing Sensitivity of Machine Learning Predictions.A Novel Toolbox with an Application to Financial Literacy

by Falco J. Bargagli Stoffi

Extreme dependence for multivariate data

by Damien Bosc

Dilation bootstrap

by Alfred Galichon

Inference under Covariate-Adaptive Randomization with Imperfect Compliance

by Federico A. Bugni

Identification of Matching Complementarities: A Geometric Viewpoint

by Alfred Galichon

Hypothetical bias in stated choice experiments: Part I. Integrative synthesis of empirical evidence and conceptualisation of external validity

by Milad Haghani

Hypothetical bias in stated choice experiments: Part II. Macro-scale analysis of literature and effectiveness of bias mitigation methods

by Milad Haghani

The Econometrics and Some Properties of Separable Matching Models

by Alfred Galichon

Discretizing Unobserved Heterogeneity

by Stéphane Bonhomme Thibaut Lamadon Elena Manresa

Permutation Tests at Nonparametric Rates

by Marinho Bertanha

General Bayesian time-varying parameter VARs for predicting government bond yields

by Manfred M. Fischer

Quasi-maximum likelihood estimation of break point in high-dimensional factor models

by Jiangtao Duan

A Control Function Approach to Estimate Panel Data Binary Response Model

by Amaresh K Tiwari

Set Identification in Models with Multiple Equilibria

by Alfred Galichon

Inference in Incomplete Models

by Alfred Galichon

Non-stationary GARCH modelling for fitting higher order moments of financial series within moving time windows

by Luke De Clerk

Bridging factor and sparse models

by Jianqing Fan

Misguided Use of Observed Covariates to Impute Missing Covariates in Conditional Prediction: A Shrinkage Problem

by Charles F Manski

A Novel Multi-Period and Multilateral Price Index

by Consuelo Rubina Nava

Cointegrated Solutions of Unit-Root VARs: An Extended Representation Theorem

by Mario Faliva

Estimation and Inference by Stochastic Optimization: Three Examples

by Jean-Jacques Forneron

«

1

2

3

4

»

Submitted on 17 Jun 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar