Adaptive, Rate-Optimal Testing in Instrumental Variables Models
AAdaptive, Rate-Optimal Testing in
Instrumental Variables Models ∗ Christoph Breunig † Xiaohong Chen ‡ First version: August 2018, Revised June 18, 2020
This paper proposes simple, data-driven, optimal rate-adaptive inferences on astructural function in semi-nonparametric conditional moment restrictions. We con-sider two types of hypothesis tests based on leave-one-out sieve estimators. A structure-space test (ST) uses a quadratic distance between the structural functions of en-dogenous variables; while an image-space test (IT) uses a quadratic distance of theconditional moment from zero. For both tests, we analyze their respective classesof nonparametric alternative models that are separated from the null hypothesis bythe minimax rate of testing . That is, the sum of the type I and the type II errorsof the test, uniformly over the class of nonparametric alternative models, cannot beimproved by any other test. Our new minimax rate of ST differs from the knownminimax rate of estimation in nonparametric instrumental variables (NPIV) models.We propose computationally simple and novel exponential scan data-driven choicesof sieve regularization parameters and adjusted chi-squared critical values. The re-sulting tests attain the minimax rate of testing, and hence optimally adapt to the un-known smoothness of functions and are robust to the unknown degree of ill-posedness(endogeneity). Data-driven confidence sets are easily obtained by inverting the adap-tive ST. Monte Carlo studies demonstrate that our adaptive ST has good size andpower properties in finite samples for testing monotonicity or equality restrictions inNPIV models. Empirical applications to nonparametric multi-product demands withendogenous prices are presented.
Keywords : Instrumental variables; Minimax rate of testing; Adaptive testing; Exponentialscan; Confidence sets; Quadratic functionals; Shape restrictions. ∗ We thank Don Andrews, Tim Armstrong, Denis Chetverikov, Tim Christensen, Enno Mammen, PeterMathe and others at numerous workshops and conferences for helpful comments. The empirical work hereis researchers own analyses based in part on data from The Nielsen Company (US), LLC and marketingdatabases provided through the Nielsen Datasets at the Kilts Center for Marketing Data Center at TheUniversity of Chicago Booth School of Business. The conclusions drawn from the Nielsen data are those ofthe researchers and do not reflect the views of Nielsen. Nielsen is not responsible for, had no role in, andwas not involved in analyzing and preparing the results reported herein. † Department of Economics, Emory University, Rich Memorial Building, Atlanta, GA 30322, USA.Email: [email protected] ‡ Cowles Foundation for Research in Economics, Yale University, Box 208281, New Haven, CT 06520,USA. Email: [email protected] a r X i v : . [ ec on . E M ] J un . Introduction Models with endogeneity are pervasive in economics, and are one of the most defyingfeatures that differentiate econometrics from statistics. In the big data era, semiparametricand nonparametric methods and models allowing for flexible endogeneity are increasinglywidely used in empirical research. A common difficulty in applying any semiparametric andnonparametric methods in practice is how to choose tuning (regularization) parametersin a simple, data-driven way that still possesses some “optimal” theoretical properties.For nonparametric models with endogeneity such as nonparametric instrumental variables(NPIV) and nonparametric quantile instrumental variables (NPQIV), it is well-known thatthe finite sample performance of various estimators and tests are much more sensitive totuning parameters than those without endogeneity.There are a few papers on data-driven choices of regularization parameters in esti-mation of a NPIV model. However, it is well known that data-driven choices designedfor “optimal” nonparametric estimation do not lead to “optimal” inference (testing andconfidence sets) in nonparametric settings (see, e.g., Gine and Nickl [2016]). To the bestof our knowledge, there is currently no work on minimax rate-optimal testing for NPIVmodels, nor on data-driven choice of regularization parameters for rate-optimal testingand confidence sets on NPIV models. In this paper we shall address this important issuewithin the framework of rate-optimal testing in semiparametric and nonparametric condi-tional moment restrictions. As a leading example, we shall provide computationally simple,data-driven choices of tuning parameters for optimal inference on NPIV functions such asmulti-product demand functions (of endogenous prices) in industrial organization.This paper first considers the minimax rate-optimal hypothesis testing in semiparamet-ric or nonparametric models defined by conditional moment restrictions. The maintainedmodeling assumption is that there is a nonparametric structural function h satisfyingE[ ρ ( Y, h ( X )) | W ] = 0 , (1.1)where ρ is a possibly non-smooth mapping that is known up to the function h , X is a d x -dimensional vector of continuous endogenous regressors, W is a d w -dimensional vector ofconditional (instrumental) variables, and the joint distribution of ( Y, X, W ) is unspecified.Our goal is to test whether h coincides with a restricted function h r , such as parametricor semiparametric or shape-restricted function. For example, in a NPIV model E[ Y − h ( X ) | W ] = 0 we would be interested in testing whether the structural NPIV function h coincides with some decreasing function h r . We propose two statistics for the hypothesis See, e.g., Horowitz [2014], Liu and Tao [2014], Centorrino [2014], Chen and Christensen [2015], Breunigand Johannes [2016], Gautier and Le Pennec [2018], Jansson and Pouzo [2019] and the referencestherein. Most papers suggest data-driven procedures without establishing the rate-adaptivity in NPIVestimation. h and its restrictedversion h r . The IT is based on the squared distance between a nonparametric estimator ofE[ ρ ( Y, h r ( X )) | W ] and zero.Both test statistics are based on simple leave-one-out sieve estimators of some quadraticfunctionals. We establish an upper bound on the sum of the type I and type II errors.Specifically, we bound the type I error uniformly over distributions satisfying the nullhypothesis and the type II error uniformly over a class of nonparametric alternative modelsseparated from the null hypothesis via a so-called rate of testing . We then establish a lowerbound for the sum of the type I and the type II errors at the same separation rate. Thus,there exists no other test that provides a better performance with respect to the sum ofthose errors. This optimal rate of separation is called the minimax rate of testing .A key technical part to establish the minimax rate of our ST in quadratic distance isto derive a tight upper bound on the convergence rate of a leave-one-out sieve estimatorof a quadratic functional of a NPIV function h ; see the online Appendix F. This rateis different from the existing minimax rate of convergence in root-mean squared error ofany consistent estimator to the function h itself. However, the minimax rate of our STdepends on an optimal choice of sieve dimension (the key tuning parameter), which isdetermined by the unknown degree of ill-posedness (due to the endogeneity in (1.1)) andthe unknown regularity of the nonparametric alternative functions different from the nullrestricted functions.We propose a computationally simple, data-driven version of our ST that does notrequire a priori knowledge of smoothness of nonparametric alternative functions h northe degree of ill-posedness. The data-driven test rejects the null hypothesis as soon asthere is a sieve dimension (say the smallest sieve dimension) in an admissible index setsuch that the corresponding normalized quadratic distance estimator exceeds one; and failsto reject the null when the maximal (over the admissible index set) normalized quadraticdistance estimator is less than or equal to one. The cardinality of this admissible index set isdetermined by a novel exponential scan (ES) method that automatically takes the unknowndegree of ill-posedness (endogeneity) into account. We show that our data-driven ST attainsthe minimax optimal rate for severely ill-posed problems and is within a log log( n ) termfor mildly ill-posed problems, where n is the sample size. This extra log log( n ) term isthe price to pay for adaptivity to unknown smoothness of the nonparametric alternativefunctions different from the null restricted function h r .By inverting the adaptive tests we obtain confidence sets on restricted (constrained) We prove in Appendix F that this convergence rate coincides with the lower bound derived by Chen andChristensen [2018] for estimation of a quadratic functional of a NPIV function. As shown in Chen andChristensen [2018], the plug-in sieve estimator does not achieve the optimal minimax rate for estimationof a quadratic functional of a NPIV in the mildly-illposed case. h r . These confidence sets do not require additional choices of tuningparameters. The adaptive minimax rate of testing determines the radius of the confidencesets. We argue that the radius based on our adaptive ST can only be marginally improvedin a very limited range of submodels depending on the regularity of the unknown function h in NPIV models.Monte Carlo studies indicate that our data-driven ST is not only computationally veryfast, but also has accurate size and good power in finite samples, without the need of com-putationally intensive bootstrap critical values. In a simulation study of hypothesis testingfor monotonicity of a NPIV function, our adaptive ST automatically leads to a data-drivenconfidence set under monotonicity restrictions if the null is not rejected. When the null ofmonotonicity is rejected by our adaptive ST, the data-driven choice of the smallest sievedimension leading to the null rejection can still lead to a consistent sieve estimate of the un-restricted NPIV function h , while parts of the true h and its sieve estimate lie outside of themonotonicity-constrained confidence sets. This simulation demonstrates the importance ofdata-driven choice of tuning parameters for testing shapes of a NPIV function. We provideempirical applications concerning shapes of consumer demands, where our data-driven testdetects heterogeneity in the curvature of demand curves among different income groups.For instance, our adaptive ST fails to reject that the demand for certain nondurable goodsis decreasing (in its price) for low income household, but does reject the decreasing shapefor high income household. Therefore, it may lead to erroneous policy evaluations whennonparametric decreasing demand (in own price) is imposed across all income levels.Our main contribution is the data-driven, rate-optimal hypothesis testing in structure-space. But we also present the minimax rate-adaptive image-space test (IT) as compari-son. Although both are simple to implement, their data-driven procedures choose differentkey tuning parameters to achieve their respective minimax optimal rate for testing. Thesieve dimension J for approximating h ( X ) is the key tuning parameter in the ST ap-proach, while the sieve dimension K for approximating the conditional moment functionE[ ρ ( Y, h r ( X )) | W ] is the key tuning parameter in the IT approach for a simple or parametricnull hypothesis. The adaptive ST has the advantage of automatically providing a data-driven choice of the sieve dimension J that can be used for estimation of semi-nonparametricor shape-restricted function h r . This greatly simplifies the construction of data-driven con-fidence sets. In addition, we show both theoretically and via Monte Carlo simulations thatthe adaptive ST can be more powerful than adaptive IT when the dimension of the con-ditional instruments is larger than the dimension of endogenous regressors (i.e., d w > d x ).On the other hand, the image-space test (IT) is more convenient for non-separable modelssuch as nonparametric quantile IV regressions, as well as for partially-identified models. Literature review : The concept of minimax rate of testing in nonparametric models4as perhaps first introduced by Ingster [1993] and Spokoiny [1996]. It has been ap-plied to optimal testing in nonparametric regression models without endogeneity, includingHorowitz and Spokoiny [2001] and Guerre and Lavergne [2005] and others. Our paper is thefirst to study minimax rate-optimal test in nonparametric conditional moment restrictionswith endogeneity, including NPIV model as a leading example.There are papers on specification tests for NPIV type models by extending Bierens[1990]’s test for conditional moment restrictions to models that allow for functions de-pending on endogenous regressors; see, e.g., Horowitz [2006], Breunig [2015], Santos [2012],Tao [2014], Chernozhukov et al. [2015], Zhu [2020] and the references therein. These testsare similar to what we called the image-space test. Among these papers, Chernozhukovet al. [2015] is the most general one that provides inference on equality or/and inequalityconstrained conditional moment restrictions allowing for partial identification. Chen andPouzo [2015] provide inference results using either sieve Wald (“structural space”) testor sieve QLR (“image space”) tests for general point-identified semi-nonparametric condi-tional moment restriction models. Chetverikov and Wilhelm [2017] studied mean squaredrate sieve estimation of NPIV by imposing monotonicity restriction. Freyberger and Reeves[2019] considered L confidence sets for monotone NPIV function. Compiani [2019] alsoimposed monotone restriction in his estimation of IO demand NPIV function. None of thesepapers consider minimax tests for NPIV type models nor data-driven choice of key tuningparameters. Our paper is the first to propose simple, adaptive structure- and image- spacetests that achieves minimax rate-optimality. In addition, we provide data-driven choice oftuning parameters based confidence sets for NPIV functions h r .The remainder of the paper is organized as follows. Section 2 describes our data-drivenstructure-space hypothesis testing (ST). It also presents a simulation study to adaptivetesting for monotonicity in NPIV models. Section 3 first establishes the minimax optimalrate of the ST. It then shows that this optimal rate is attained (within a log log n term)by our data-driven ST procedure. Section 4 introduces the data-driven image-space teststatistic, and presents its minimax rate of testing and the adaptivity. Section 5 providesthree empirical illustrations. It also contains additional Monte Carlo studies to comparethe finite-sample size and power properties of the adaptive ST vs the adaptive IT. Section6 briefly concludes. Appendices A and B contain proofs for the minimax rates for the STand the adaptive ST under simple null hypothesis respectively. The online supplementaryappendices contain additional materials: Appendix C presents robustness checks usingbootstrap critical values for the empirical applications. Appendices D and E provide proofsfor the optimal rates of the adaptive ST under composite null hypothesis and of the adaptiveIT respectively. Appendices F and G contain additional technical lemmas. It has a close connection to robustness or sensitivity literature that has gain popularity in macroeco-nomics; see, e.g., Hansen and Sargent [2008]. . Preview of Adaptive Structure-Space Test We first introduce the null and the alternative hypotheses as well as the concept of minimaxrate of testing in Subsection 2.1. We then describe our new data-driven, rate-adaptivestructure-space test (ST) for NPIV type models in Subsection 2.2. The formal theoreticaljustifications are postponed to Section 3. Subsection 2.3 provides a simulation study of ouradaptive testing for monotonicity of structural functions.
Let H denote some class of functions. Let { ( Y i , X i , W i ) } ni =1 be a random sample from thedistribution P h of ( Y, X, W ), where h ∈ H , such thatE[ Y − h ( X ) | W ] = 0 . Let H r denote a subset of functions in H that satisfies a conjectured restriction, such asmonotonicity, concavity, some other shape or some parametric restrictions. For any h ∈ H ,we introduce h r ∈ H r such that E | E[ h ( X ) − h r ( X ) | W ] | ≤ E | E[ h ( X ) − h r ( X ) | W ] | forall h r ∈ H r .We analyze the null hypothesis that there exists a function h ∈ H with E[ Y − h ( X ) | W ] =0 satisfying a conjectured restriction captured by H r , specifically, the set H := (cid:110) h ∈ H : E[ Y − h ( X ) | W ] = 0 and (cid:90) (cid:0) h ( x ) − h r ( x ) (cid:1) µ ( x ) dx = 0 (cid:111) is not empty. Here we measure the distance between restricted and unrestricted functionswith a measure depending on a prespecified weighting function µ which is restricted to bepositive on the support of X . If we want to test that h coincides with h r on some subset of ofthe support of X only then this modified null hypothesis can be implemented by changing µ accordingly. To analyze the power of any test against nonparametric alternatives, werequire some separation between the null and the class of nonparametric alternatives forall h ∈ H . The resulting class of alternatives considered in this paper is given by H ( δ, r n ) := (cid:110) h ∈ H : E[ Y − h ( X ) | W ] = 0 and (cid:90) ( h ( x ) − h r ( x )) µ ( x ) dx ≥ δ r n (cid:111) for some constant δ > r n . The rate r n is also known as the rate oftesting and we establish its optimality in the minimax sense as described below.6n this paper, we establish the min-imax rate of testing r n in the senseof Ingster [1993]: We propose a testwhich minimizes the sum of TypeI error uniformly over H and themaximum Type II error uniformlyover H ( δ, r n ). Moreover, we showthat the sum of both errors cannotbe improved by any other test. H √ δr n H ( δ, r n ) The minimax rate of testing requires on optimal choice of tuning parameters depending onunknown smoothness properties of the structural function h and unknown mapping prop-erties of the conditional expectation given W . We provide a data driven extension to theminimax test, i.e., a testing procedure that adapts to the smoothness of the unrestrictedfunction h in the presence of unknown smoothing properties of the conditional expectationmapping. In this section we describe our adaptive structure-space test for the point identified NPIVmodels. Our test builds on the leave-one-out, series estimator of the quadratic distance (cid:82) ( h ( x ) − h r ( x )) µ ( x ) dx depending on the dimension J given by (cid:98) D J ( h r ) = 2 n ( n − (cid:88) ≤ i
1) we define (cid:99) ST n = (cid:40) ∃ J ∈ (cid:98) I n such that (cid:99) W J := n (cid:98) D J ( (cid:98) h r J ) √ (cid:98) η J ( α ) (cid:98) v J ( (cid:98) h r J ) > (cid:41) , (2.2)where {·} denotes the indicator function and (cid:98) v J , (cid:98) I n and (cid:98) η J ( α ) are defined as follows.First, (cid:98) v J ( h ) = (cid:13)(cid:13)(cid:13) n − n (cid:88) i =1 (cid:0) Y i − h ( X i ) (cid:1) G / (cid:98) A b K ( J ) ( W i ) b K ( J ) ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) F (2.3)where (cid:107) · (cid:107) F denotes the Frobenius norm. Second, the index set (cid:98) I n is constructed via a simple exponential scan (ES) procedure (cid:98) I n = (cid:110) J ≤ (cid:98) J max : J = J j where j = 0 , , . . . , j max (cid:111) (2.4)where J := (cid:98) √ log log n (cid:99) , j max := (cid:100) log ( n / /J ) (cid:101) , and (cid:98) J max = min (cid:110) J > J : ζ ( J ) (cid:112) (cid:96) ( J )(log n ) /n ≥ s min (cid:0) ( B (cid:48) B/n ) − / ( B (cid:48) Ψ /n ) G − / (cid:1)(cid:111) where (cid:96) ( J ) = 0 . J , and s min ( · ) is the minimal singular value. Further ζ ( J ) = √ J for spline, wavelet, or trigonometric sieve basis, and ζ ( J ) = J for orthogonal polynomialbasis.Finally, the critical value (cid:98) η J ( α ) is specified differently for testing equality and inequalityconstraints. For testing equality constraints, we compute (cid:98) η J ( α ) using the Bonferroni correc-tion to a critical value from a centralized chi-square distribution relative to the cardinalityof the ES index set denoted by (cid:98) I n ). That is, (cid:98) η J ( α ) = q (cid:0) α/ (cid:98) I n ) , J (cid:1) − J √ J , (2.5)where q (cid:0) a, J ) denotes the upper a -quantile of χ distribution with J degrees of freedom.For testing inequality constraints we have implemented two approaches. The first one ispresented in Remark 3.1, which is a simple, data-driven, finite-dimensional correction tothe chi-square critical values. The second one is introduced in Remark 3.2, which is abootstrap approach to calculate the critical values. The Frobenius norm for a J × J matrix M = ( M jl ) ≤ j,l ≤ J is defined as (cid:107) M (cid:107) F = (cid:113)(cid:80) Jj,l =1 M jl . .3. A Monte Carlo Study: Adaptive Testing for Monotonicity We investigate finite sample performance of our adaptive ST for decreasing function in aMonte Carlo experiments. The results are based on 5000 Monte Carlo replications for everyexperiment. In all the experiments, realizations of Y are generated according to a NPIVmodel Y = h ( X ) + U , E[ U | W ] = 0 (2.6)for some unknown structural function h . The functional form of h and the joint distributionof ( X, W, U ) vary in different Monte Carlo designs. Results presented in this section indicatethat our adaptive ST with simple data-driven critical values has very good size and powerin finite samples, and that the adaptive ST with computationally demanding bootstrappedcritical values has no obvious improvement in terms of size or power.Let Φ denote the standard normal distribution function. We set X i = Φ( X ∗ i ) and W i = Φ( W ∗ i ), where the random vector ( X ∗ i , W ∗ i , U i ) is generated according to X ∗ i W ∗ i U i ∼ N , ξ . ξ . . (2.7)The parameter ξ captures the strength of instruments and varies in the experiments below.As ξ increases the instrument becomes stronger (or the ill-posedness gets weaker). Theexperimental design with ξ = 0 . Y i according to(2.6) where h ( x ) = c (cid:104) − (cid:16) x − / c (cid:17)(cid:105) for some constant 0 < c ≤ . The function h is monotonically decreasing where c captures the degree of monotonicity.For small c the function h is close to zero, and for c = 1 it holds h ( x ) ≈ φ (0)(1 − x )where φ denotes the standard normal probability density function.We study the size and power patterns of our adaptive ST under the null hypothesis thatthe NPIV function h is weakly decreasing on the support of X . For the monotonicity test,We implement the test statistic (cid:99) ST n given in (2.2) using quadratic B-spline basis functionswith varying number of knots for h . Due to piecewise linear derivatives, monotonicityconstraints are easily imposed on the restricted function at the derivative at J − K ( J ) = 2 J . We make use of the data-driven critical value (cid:98) η J ( α ) given in Remark 3.1 (which is implemented using the R package coneproj ). As a9 ample c ξ Emp. Size of (cid:99) ST n average (cid:98) J Emp. Size of (cid:99) ST B n average (cid:98) J B size
10% 5% 1% at 5% level
10% 5% 1% at 5% level
500 0 .
01 0 . . . . . . . . . . .
01 0 . . . . . . . . . . Table 1:
Testing Monotonicity - Empirical Size for the adaptive tests (cid:99) ST n and (cid:99) ST B n comparison, we also implement our adaptive ST using bootstrap critical values as describedin Remark 3.2. In each Monte Carlo iteration, we generate 200 bootstrap replications usingrandom weights ω ∼ N (0 ,
1) drawn independently from (
X, W, U ).Table 1 reports the empirical size control for different nominal levels of the test (cid:99) ST n andthe bootstrap analog (cid:99) ST B n for the sample sizes n ∈ { , } . Results are presented underthe different parameter values for ξ ∈ { . , . , . } and c ∈ { . , . , } . Overall, we seefrom Table 1 that the test provides adequate size control for different parameter values of ξ and c . It is interesting to see that the computationally demanding bootstrap version (cid:99) ST B n with 200 bootstrap replications does not have any improvement in terms of size control. In addition to the coverage, Table 1 also presents the average data driven choice oftuning parameter J at the 5% nominal level, denoted by (cid:98) J for the test (cid:99) ST n and (cid:98) J B for itsbootstrap analog (cid:99) ST B n . Specifically, (cid:98) J is the average choice of J which maximizes (cid:99) W J overthe index set (cid:98) I n when the null is not rejected; and is the smallest J ∈ (cid:98) I n such that (cid:99) W J > J corresponds to early stopping when We did run 1000 Monte Carlo replications with n = 500 sample size and 500 bootstrap evaluations perMonte Carlo replications for the monotonicity test. (cid:99) ST B n with 500 bootstrap evaluations have slightlymore accurate sizes than that with 200 bootstrap evaluations. Nevertheless, our simple adaptive (cid:99) ST n test also has very good size control in 1000 Monte Carlo replications and is super fast to compute. .5 1.0 1.5 2.0 . . . . . . c A P o w e r x = x = x = x = . . . . . . c A P o w e r x = x = x = x = Figure 1:
Adaptive Monotonicity Test - Empirical Power for the adaptive tests (cid:99) ST n and (cid:99) ST B n when ξ = { . , . } . Solid (or dashed) lines show power results for data-driven (or bootstrap)critical values. Power curves are not size adjusted. LHS: n = 500; RHS: n = 1000 we reject the null.) From Table 1 we see that the average data driven choice (cid:98) J increases asthe strength of instruments increases (captured by the parameter ξ ). Further, (cid:98) J decreasesas the regularity of the structural function h declines (captured by the parameter c ). Thisis due to the fact that with increasing nonlinearity of h a smaller degree of knots is sufficientin order to reject the hypothesis. The data driven choice of J hence works in the oppositedirection as in adaptive estimation where larger smoothness leads to smaller values of J .Finally, we see that as the sample size increases so does the value of the estimator (cid:98) J .To study the power of the test that the NPIV function h is monotonically decreasing, weconsider deviations from the constant zero function. Specifically, we examine the rejectionprobabilities of the adaptive ST when the data is generated by the design (2.7) but usingthe structural function h ( x ) = − x/ c A x . Note that h (cid:48) ( x ) ≤ x ≤ . /c A . Since the support of X is contained in[0 ,
1] we obtain from our model that the null hypothesis of weakly decreasing functions issatisfied only if c A ≤ .
1. 11 .0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Constrained CB x 0.0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Unconstrained CB x Figure 2:
Estimated NPIV curves with data generated from (2.8) with c A = 0 . n = 1000,showing true structural function (black dotted lines), unconstrained estimator (reddashed lines), and constrained estimator (blue solid line). LHS: 95% CB based onconstrained estimator. RHS: 95% CB based on unconstrained estimator. Figure 1 depicts the power function of the adaptive monotonicity test (cid:99) ST n and (cid:99) ST B n , basedon 200 bootstrap iterations, under a 5% nominal level for different parameters ξ ∈ { . , . } and sample sizes n ∈ { , } . From Figure 1 we see that both tests become morepowerful, for c A > .
1, as the parameter of instrument strength ξ and the sample size n increase. The bootstrap test (cid:99) ST B n is more powerful for c A < ξ = 0 . n = 1000.In the other cases, the power improvement by using bootstrap critical values is only ofsmall magnitude or absent. When the power curves are size adjusted the slight advantagein power of (cid:99) ST B n over (cid:99) ST n disappears. In particular, we found that (cid:99) ST n is more powerfulwhen c A > ξ = 0 . (cid:99) ST B n with larger number of bootstrap iterations as it is computationally demanding.To illustrate the choice of our adaptive inference procedure we further provide impli-cations on estimation. We consider here a modification of our data generating process byconsidering the model Y = − X + c A X + U/ X, W, U ) is generated as in (2.7) and we consider c A ∈ { . , . } . For c A = 0 . c A = 0 .
2. 12e apply our adaptive ST to one sample { ( Y i , X i , W i ) } of size n = 1000 for which weobtain a data-driven choice of sieve dimension (cid:98) J = 3 in both cases c A ∈ { . , . } . Basedon the dimension parameter choice, Figure 2 shows the constrained sieve NPIV estimator(blue solid line) and unconstrained sieve NPIV estimator (red dashed lines). We show the95% uniform confidence bands (CB) following Chen and Christensen [2018] based on 1000bootstrap on the constrained on the left and based on the unconstrained estimator on theright. From Figure 2 we see that the difference between CBs based on constrained andunconstrained estimator is minor, although there exists a slight improvement of the CBbased on the constrained one. − . − . − . . . Constrained CB x 0.0 0.2 0.4 0.6 0.8 1.0 − . − . − . . . Unconstrained CB x Figure 3:
Estimated NPIV curves with data generated from (2.8) with c A = 0 . n = 1000,showing true structural function (black dotted lines), unconstrained estimator (reddashed lines), and constrained estimator (blue solid line). LHS: 95% CB based onconstrained estimator. RHS: 95% CB based on unconstrained estimator. Figure 3 shows the estimation results when c A = 0 .
3. Adaptive Inference via the Structure-Space Test
This section presents several results on data driven structure-space test (ST) statistics.Subsection 3.1 introduces the notation and the main regularity conditions. Subsection 3.213stablishes the minimax rate of testing without a data driven choice of the sieve dimension.Subsection 3.3 establishes the minimax rate of testing of the adaptive ST. Subsection 3.4shows that this rate coincides with rate of testing attained by the tests with composite nullhypothesis. Subsection 3.5 proposes data-driven confidence sets by inverting the adaptiveST under the null hypothesis.
Before we state the minimax rate of testing in structure space, we introduce additionalnotation and main assumptions. For a random variable X , we define the space L ( X ) as theequivalence class of all measurable functions of X with finite second moment with (cid:107) · (cid:107) L ( X ) as the associated norm. For any sigma-finite measure µ we define (cid:107) φ (cid:107) µ := (cid:82) φ ( x ) µ ( x ) dx for all φ ∈ L µ := { φ : (cid:107) φ (cid:107) µ < ∞} . Assumption 1. (i)
H ⊂ L ( X ) ; (ii) sup w ∈W sup h ∈H E h [ ρ ( Y, h r ( X )) | W = w ] ≤ σ < ∞ and sup h ∈H E h [ ρ ( Y, h r ( X ))] < ∞ ; and (iii) inf w ∈W inf h ∈H V ar h ( ρ ( Y, h r ( X )) | W = w ) ≥ σ > . Let A = [ S (cid:48) G − b S ] − S (cid:48) G − b where S = E[ b K ( W ) ψ J ( X ) (cid:48) ] and G b = E[ b K ( W ) b K ( W ) (cid:48) ]. Weintroduce the projections Π J h ( · ) = ψ J ( · ) (cid:48) G − (cid:82) ψ J ( x ) h ( x ) µ ( x ) dx for h ∈ L µ and Π K m ( · ) = b K ( · ) (cid:48) G − b E[ b K ( W ) m ( W )] for m ∈ L ( W ). The minimal singular value of G − / b SG − / isdenoted by s J . We make use of the notation ζ J = max( ζ ψ,J , ζ b,K ), for K = K ( J ), where ζ ψ,J = sup x (cid:107) G − / ψ J ( x ) (cid:107) and ζ b,K = sup w (cid:107) G − / b b K ( w ) (cid:107) . We define d x = dim( X ) and d w = dim( W ). Assumption 2. (i) s − J ζ J (cid:112) (log J ) /n = O (1) ; (ii) (cid:107) Π J h − h (cid:107) µ = O ( J − p/d x ) for all h ∈ H and some p > such that ζ J √ log J = O ( J p/d x ) ; and (iii) the eigenvalues of G and G b areuniformly bounded from below and above. Let T : L ( X ) (cid:55)→ L ( W ) denote the conditional expectation operator given by T h ( w ) =E[ h ( X ) | W = w ]. We further define Ψ J = clsp { ψ , . . . , ψ J } ⊂ L ( X ). Assumption 3. (i) sup h ∈ Ψ J (cid:107) (Π K T − T ) h (cid:107) L ( W ) / (cid:107) h (cid:107) µ = o ( s J ) and (ii) (cid:107) T ( h − h r − Π J ( h − h r ) (cid:107) L ( W ) = O ( s J (cid:107) h − h r − Π J ( h − h r ) (cid:107) µ ) for all h ∈ H . Assumption 4.
For any h ∈ H , T h = 0 implies that (cid:107) h (cid:107) µ = 0 .Discussion of Assumptions. Assumption 1 captures second moment bounds. In addi-tion, a lower bound for the variance is imposed. Assumption 2 (i) imposes bounds on thegrowth of the basis functions relative to the singular values of the matrix G − / b SG − / .Assumption 2 (i)(ii) imposes bounds on the growth of the basis functions which are knownfor commonly used bases. For instance, ζ b,K = O ( √ K ) and ζ ψ,J = O ( √ J ) for polyno-mial spline, wavelet and cosine bases, and ζ b,K = O ( K ) and ζ ψ,J = O ( J ) for orthogonal14olynomial bases; see, e.g., Newey [1997], Huang [1998]. Assumption 3 (i) is a mild condi-tion on the approximation properties of the basis used for the instrument space. In fact, (cid:107) (Π K T − T ) h (cid:107) L ( W ) = 0 for all h ∈ Ψ J when the basis functions for B K and Ψ J form either aRiesz basis or an eigenfunction basis for the conditional expectation operator. Assumption3 (ii) is the usual L stability condition imposed in the NPIV literature when h r = 0 (cf.Assumption 6 in Blundell et al. [2007] and Assumption 5.2(ii) in Chen and Pouzo [2012]).Note that Assumption 3 (ii) is also automatically satisfied by Riesz bases. Assumption 4is required for identification of the quadratic functional (cid:107) h (cid:107) µ and the condition can be lessrestrictive than imposing L completeness when the support of µ is a subset of the supportof X . Example 3.1 (NQIV) . The ST test can also be applied to models with nonseperable un-observables after linearization. Consider as an example the nonparametric quantile instru-mental variable model with conditional moment restriction E[ { Y ≤ h ( X ) } − q | W ] = 0 for some q ∈ (0 , . A linearization of the model can be obtained using the Frechet derivativeat h r maps h to E[ f Y | X,W ( h r ( X ))( h ( X ) − h r ( X )) | W ] . This leads to a modified version of ourtest statistic where B (cid:48) Ψ is replaced by an empirical analog of E[ f Y | X,W ( h r ( X )) b K ( W ) ψ J ( X ) (cid:48) ] and Y − h r ( X ) by f Y | X,W ( h r ( X )) h r ( X ) . We do not address the estimation of the condi-tional density and hence, the NQIV case explicitly for our structural space test. Example 3.2 (Testing Derivatives of h ) . Note that the test can be extended to check forderivatives of the function h . To do so, we replace G by the matrix (cid:90) ∂ x ψ J ( x )( ∂ x ψ J )( x ) (cid:48) µ ( x ) dx as long as the basis function ψ j are differentiable on the support of µ . This straightforwardextension is only possible in case of ST but not for IT and hence, illustrates the advantageof the ST approach. We first consider the simple hypothesis case where H = { h } and, in particular, h r = h ,for some known function h satisfying (1.1) with ρ ( Y, h ( X )) = Y − h ( X ). We introduce a J dependent analog to the adaptive structure-space test (cid:99) ST n under the simple null: ST n,J = (cid:40) n (cid:98) D J ( h ) √ η (cid:98) v J ( h ) > (cid:41) η >
0. The test ST n,J with optimally chosen J serves as a benchmark ofour adaptive ST procedure (given in (3.5)) for the simple null hypothesis case. Theorem 3.1.
Let Assumptions 1–4 be satisfied. Then, for any ε > there exists aconstant δ ∗ > such that lim sup n →∞ (cid:110) P h ( ST n,J = 1) + sup h ∈H ( δ ∗ ,r n,J ) P h ( ST n,J = 0) (cid:111) ≤ ε, (3.1) where the rate r n,J is given by r n,J = n − / s − J J / + J − p/d x . (3.2)Theorem 3.1 shows that the test statistic ST n,J attains the rate of testing r n,J . Thisrate consists of a variance and a bias part. The optimal choice of J requires knowledge ofunknown mapping properties of the conditional expectation operator T and the unknownsmoothness of the true structural function h , as illustrated below. A central step to achievethis rate result is to establish a rate of convergence of the quadratic distance estimator (cid:98) D J ( h ), see Theorem F.1 in the online appendix. We thus make use of the close connectionbetween minimax optimal quadratic functional estimation and minimax optimal testing.We differentiate among two different degrees of ill-posedness, which are typically con-sidered in the literature. The sieve L measure of ill-posedness is defined as τ J = sup h ∈ Ψ J ,h (cid:54) =0 (cid:107) h (cid:107) µ (cid:107) T h (cid:107) L ( W ) ≤ sup h ∈ Ψ J ,h (cid:54) =0 (cid:107) h (cid:107) µ (cid:107) Π K T h (cid:107) L ( W ) = s − J . We call the model (1.1) mildly ill-posed if: τ j ∼ j ζ/d x for some ζ > τ j ∼ exp( j ζ/d x /
2) for some ζ > The next corollary provides concrete rates of testingwhen the dimension parameter J is chosen to level variance and square bias under classicalsmoothness conditions. Corollary 3.1.
Let Assumptions 1–4 be satisfied. Then the rate of testing r n,J given in (3.2) is of the following form:1. Mildly ill-posed case: choosing J ∼ n d x / (4( p + ζ )+ d x ) implies r n,J = n − p/ (4( p + ζ )+ d x ) , (3.3) If { a n } and { b n } are sequences of positive numbers, we use the notation a n (cid:46) b n if lim sup n →∞ a n /b n < ∞ and a n ∼ b n if a n (cid:46) b n and b n (cid:46) a n . . Severely ill-posed case: choosing J ∼ (cid:0) log n − p + d x ζ log log n (cid:1) d x /ζ implies r n,J = (log n ) − p/ζ . (3.4) In the next result, we establish a lower bound for the rate of testing in each of the ill-posedcase scenarios considered in the previous corollary. Below, (cid:104)· , ·(cid:105) µ denotes the inner productassociated to L µ . Theorem 3.2.
Let Assumptions 1 (iii) and 4 be satisfied. Assume that (cid:107)
T h (cid:107) L ( W ) (cid:46) (cid:80) j ≥ τ − j (cid:104) h, (cid:101) ψ j (cid:105) µ for all h ∈ H and an orthonormal basis { (cid:101) ψ j } j ≥ in L µ . Then for any ε > there exists a constant δ ∗ > such that lim inf n →∞ inf T n (cid:110) P h ( T n = 1) + sup h ∈H ( δ ∗ ,r n ) P h ( T n = 0) (cid:111) ≥ − ε where r n is given by:1. Mildly ill-posed case: r n = n − p/ (4( p + ζ )+ d x ) ,2. Severely ill-posed case: r n = (log n ) − p/ζ . From Corollary 3.1 and Theorem 3.2 we conclude that r n,J is the minimax rate of testing once J is chosen to level variance and squared bias. In particular, we conclude that therate of testing is always nonparametric in contrast to the case of estimation of quadraticfunctionals where also the √ n –rate can be achieved. We propose a data-driven ST that rejects the null hypothesis H = { h } (cid:54) = ∅ , for someknown function h satisfying (1.1), as soon as at least for one J the normalized estimator (cid:98) D J ( h ) is sufficiently large. Specifically, we consider the data-driven test statistic ST n = (cid:40) ∃ J ∈ (cid:98) I n such that n (cid:98) D J ( h ) √ (cid:98) η J ( α ) (cid:98) v J ( h ) > (cid:41) , (3.5)where (cid:98) η J ( α ), (cid:98) v J ( h ), and the index set (cid:98) I n are given in Subsection 2.2.We define J be the smallest dimension parameter such that the variance dominatesthe squared bias within a √ log log n term, that is, J = min (cid:8) J : J − p/d x ≤ n − (cid:112) log log n s − J √ J (cid:9) . (3.6)17ecall the definition of the index set (cid:98) I n given in (2.4) which relies on an upper bound (cid:98) J max .We introduce the dimension parameter J slowly growing with the sample size n whichcontrols the complexity of the ES index set (cid:98) I n . Specifically, Chen and Christensen [2015,Theorem 3.2] show that (cid:98) J max ≤ J holds with probability approaching one where J satisfiesthe rate restrictions imposed in the next assumption. We also make use of the notation ζ = ζ J . Assumption 5. (i) s − J ζ (cid:112) (log n ) /n = o (1) ; (ii) for any α ∈ (0 , it holds (cid:98) η J ( α ) = O ( √ log log n ) and (log log J ) c ≤ (cid:98) η J ( α ) for some constant c > and for all J ≤ J ≤ J withprobability approaching one. Assumption 5 imposes an upper bound on the growth of the population counterpart ofthe upper bound of the set (cid:98) I n . Assumption 5 (i) is a slight modification of Assumption 2(i) considered uniformly over J ≤ J ≤ J . Assumption 5 (ii) imposes a mild restriction onthe critical values (cid:98) η J ( α ) given in (2.5). Theorem 3.3.
Let Assumptions 1–3 and 5 be satisfied. Then, for any ε > there exists aconstant δ ◦ > such that lim sup n →∞ (cid:110) P h ( ST n = 1) + sup h ∈H ( δ ◦ , r n ) P h ( ST n = 0) (cid:111) ≤ ε, (3.7) where the rate r n is given by r n = n − (cid:112) log log n s − J (cid:112) J . (3.8)Theorem 3.3 establishes an upper bound for the testing rate of the adaptive structure-space test ST n . The proof of Theorem 3.3 relies on a novel exponential bound for degenerateU-statistics based on sieve estimators. Adaptive testing for inverse problems was consideredfor deconvolution models (with known degree of ill-posedness) by Butucea et al. [2009]. Infunctional linear models, adaptive tests (under unknown, mild degree of ill-posedness) wereproposed by Lei [2014]. In Gaussian white noise models, adaptive tests was proposedby Ingster et al. [2012] also under the severely ill-posed case but requires knowledge ofthe ill-posedness scenario. We now illustrate the upper bound under classical smoothnessassumptions. Again, we distinguish between the mildly or severely ill-posed case. Corollary 3.2.
Let Assumptions 1–5 be satisfied. Then, the adaptive rate of testing r n given in (3.8) satisfies:1. Mildly ill-posed case: r n = (cid:0)(cid:112) log log n/n (cid:1) p/ (4( p + ζ )+ d x ) , . Severely ill-posed case: r n = (log n ) − p/ζ . From Corollary 3.2 we see that the adaptive ST attains in the mildly ill-posed case theminimax rate of testing within a (log log n )-term. For adaptive testing without endogeneity,it is well known that a (log log n )-term is required, see Spokoiny [1996]. In the severelyill-posed cases, our adaptive test attains the exact minimax rate of testing and hence, thereis no price to pay for adaptation. We extend the results from the previous subsection to the case of composite hypothesesand, in particular, allow for testing inequality constraints. Below, we discuss two differentapproaches for deriving critical values in the case of constrained inequality tests. Bothmethods rely on cone properties imposed on the restricted set of functions. We use thenotation Π r J for the projection on H r J . Here, the set H r J is used to approximate the set offunctions H r ⊂ H which satisfies a conjectured restriction. Remark 3.1 (Adaptive critical values for inequality constrains) . In both cases, we rely onthe assumption that H r J is a polyhedral cone . In this case, we may infer from Silvapulleand Sen [2005, Lemma 3.13.5] the existence of a collection of faces { H , . . . , H L } such thatthe collection of their relative interiors { ri ( H ) , . . . , ri ( H L ) } forms a partitioning of H r J .Let P l be the projection matrix onto the linear space spanned by H l where J l = rank ( P l ) .Then, a Bonferroni correction of the adaptive critical values of Al Mohamad et al. [2018]gives (cid:98) η J ( α ) = L (cid:88) l =1 (cid:8) Π r J (cid:98) h J ∈ ri ( H l ) (cid:9) q (cid:0) α/ (cid:98) I n ) , J l ) − J l √ J l , where (cid:98) h J is the unconstrained analog of (2.1) and we impose the restriction J l ≥ . Remark 3.2 (Bootstrap critical values for inequality constrains) . We propose a modifica-tion of the bootstrap procedure of Fang and Seo [2019] which imposes a cone condition on H r . Below Z J denotes the sieve bootstrap score proposed by Chen and Christensen [2018].We proceed in two steps: Step 1.
Introduce a sequence of independent and identically distributed random variables { ω i } ni =1 drawn independently of the original data { ( Y i , X i , W i ) } ni =1 . Compute the bootstrap A cone C is called polyhedral if there is some matrix M such that C = { β ∈ R J : M β ≥ } . ersion of the quadratic distance estimator given by (cid:98) D B J ( (cid:98) h J ) = 2 n ( n − (cid:88) ≤ i
For some γ n ∈ (0 , , we construct the − γ n quantile of { n (cid:98) D B J ( (cid:98) h J ) / √ (cid:98) v J ( (cid:98) h J ) } ,based on N bootstrap samples, denoted by (cid:98) τ n, − γ n . Set (cid:98) κ J = ν n c n / (cid:98) τ n, − γ n where ν n = n/ √ (cid:98) v J ( (cid:98) h J ) and c n is such that (cid:107) ( (cid:98) h J − h ) / √ (cid:98) v J ( (cid:98) h J ) − Z J (cid:107) µ = o p ( c n ) . Compute (cid:98) η J ( α ) asthe − α/ (cid:98) I n ) quantile of { n (cid:101) D B J ( (cid:98) h J ) / √ (cid:98) v J ( (cid:98) h J ) } based on N bootstrap samples, where (cid:101) D B J ( (cid:98) h J ) coincides with (3.9) but with (cid:98) U i replaced by (cid:98) U i − (cid:98) κ J (cid:98) h J ( X i ) − Π r J ( Z J − (cid:98) κ J (cid:98) h J ) . In the following, we impose restrictions on the complexity of the set H r J . As we rely onthe empirical process theory, we make use of the literature’s notation. Let N [] ( t, H , (cid:107) · (cid:107) µ )denote the smallest number of brackets of size t (under (cid:107) · (cid:107) µ ) required to cover H . Wefurther denote H r J (∆ J,n ) = { h ∈ H r J : (cid:107) h − h r (cid:107) ∞ ≤ ∆ J,n } for some ∆ J,n >
0. Below, η J ( α )denotes a deterministic sequence satisfying η J ( α ) = O ( √ log log n ) and (log log J ) c ≤ η J ( α )for some constant c > J ≤ J ≤ J . Assumption 6. (i) For any h ∈ H there exists a sequence (∆ J,n ) n ≥ satisfying (cid:98) h r J ∈H r J (∆ J,n ) with probability approaching one and (cid:82) (cid:112) N [] ( tC, H r J (∆ J,n ) , (cid:107) · (cid:107) µ ) dt ≤ C J,n where (cid:80) J ∈ (cid:98) I n C J,n ∆ J,n / (log log J ) = o p (1) and max J ∈ (cid:98) I n ∆ J,n ζ J (log J ) = o p (1) . (ii)For some J ∈ (cid:98) I n and any ε > there exist constants c, C > such that it holds lim sup n →∞ sup h ∈H P h (cid:0)(cid:98) η J ( α ) < C η J ( α ) (cid:1) < ε and lim sup n →∞ sup h ∈H ( δ ◦ , r n ) P h (cid:0)(cid:98) η J ( α ) >c η J ( α ) (cid:1) < ε . Assumption 6 (i) is a mild restriction on the complexity of the set of functions H r J (∆ J,n )by imposing rate conditions on ∆
J,n . These conditions determine the rate of convergenceof the constraint sieve estimator to any function h r satisfying a conjectured restrictioncaptured by H r . It was similarly imposed in Assumption C.2 by Chen and Pouzo [2012].Further note that the critical values estimators introduced in Remarks 3.1 and 3.2 satisfyAssumption 6 (ii) under mild conditions. In the case of the adaptive critical values (seeRemark 3.1), the cone projection leads to a weakly larger critical values than the one givenin (2.5) since (cid:98) η J ( α ) is now determined by the dimension of the face on which the coneprojection lands. In the case of the bootstrap critical values (see Remark 3.2), Assumption6 (ii) can be justified by following Fang and Seo [2019].The next result establishes an upper bound for the rate of testing of (cid:99) ST n . In the implementation of the procedure we use throughout the paper the choice γ n = 0 . / log n , followingChernozhukov et al. [2013]. heorem 3.4. Let Assumptions 1–6 be satisfied. Then, for any ε > there exists aconstant δ ◦ > such that lim sup n →∞ (cid:110) sup h ∈H P h (cid:0)(cid:99) ST n = 1 (cid:1) + sup h ∈H ( δ ◦ , r n ) P h (cid:0)(cid:99) ST n = 0 (cid:1)(cid:111) ≤ ε, (3.10) where the rate r n is given in Theorem 3.3. From Theorem 3.4 we see that (cid:99) ST n attains the rate of testing r n which is the samerate of testing obtained by ST n in the case of simple hypotheses. Under the restrictionimposed in Assumption 6 we thus conclude that estimation of restricted functions doesnot imply slower rates of testing. In the definition of (cid:99) ST n , the dimension parameter forestimating the structural function under the conjectured restriction is set to be equivalentto the unrestricted estimator of the structural function. In this sense, our inference resultsdo not require undersmoothing conditions. Finally, we note that the test statistic can betrivially modified to tests where the constraint functions h might be estimated using afixed finite dimensional sieve space. We now propose L confidence sets which are based on inversion of the structural spacetest. The resulting confidence region imposes conjectured restrictions on the function ofinterest h . The (1 − α ) − confidence set is given by C n ( α ) = (cid:110) h ∈ H r : n (cid:98) D J ( h ) √ (cid:98) η J ( α ) (cid:98) v J ( h ) ≤ J ∈ (cid:98) I n (cid:111) . (3.11)The following corollary exploits our previous results and the introduced assumptions tocharacterize the asymptotic size and power properties of our procedure. Corollary 3.3.
Let Assumptions 1–6 be satisfied. Then, for any α > it holds lim sup n →∞ sup h ∈H P h ( h / ∈ C n ( α )) ≤ α (3.12) and there exists a constant δ ◦ > such that lim inf n →∞ inf h ∈H ( δ ◦ , r n ) P h ( h / ∈ C n ( α )) ≥ − α. (3.13)Corollary 3.12 shows that the L confidence set C n ( α ) controls size uniformly over theclass of functions H . Moreover, the result establishes power uniformly over the class H ( δ ◦ , r n ). We immediately see from Corollary 3.3 that the size of the L confidence balldepends on the degree of ill-posedness captured by the minimal singular values s J .21 orollary 3.4. Let Assumptions 1–6 be satisfied. Then, we have lim sup n →∞ sup h ∈H P h (cid:16) diam ( C n ( α )) ≥ Cn − (cid:112) log log( n ) s − J (cid:112) J (cid:17) = 0 , for some constant C > and where the dimension parameter J is given in (3.6) . Corollary 3.4 yields a confidence region of the diameter of the order n − (cid:112) log log( n ) s − J √ J for confidence sets based on inversion of the structural space test statistic. We see that thediameter of the confidence set does not adapt to regularity of submodels. The followingremark illustrates that the gain of adaptation is expected to minor in inverse problems. Remark 3.3 (Adaptive Confidence Sets) . Consider the mildly ill-posed case where thedegree of ill-posedness is fixed to ζ and we wish to adapt over the function classes H ( p ) and H ( p ) with smoothness parameters p > p . Suppose we have an adaptive estimator whichattains in the mildly ill-posed case the L rate of convergence n − p/ (2( p + ζ )+ d x ) in comparisonto the rate of testing n − p/ (2( p + ζ )+ d x / . It is known in statistical regression models (see Robinsand Van Der Vaart [2006] and Cai and Low [2006]) that rate adaption is only possible oversubmodels (indexed by p ) such that the rate of estimation over the submodel is larger thanthe rate of testing over the “supermodel”. Specifically, we obtain the restriction n − p/ (2( p + ζ )+ d x / (cid:46) n − p / (2( p + ζ )+ d x ) . This condition translates into the smoothness restriction p < p ζ + d x ζ + d x / and hence, in this sense, adaptation is only possible when the submodel H ( p ) satisfies p ∈ (cid:16) p, p ζ + d x ζ + d x / (cid:17) , which shows that for large values of ζ (or dimension of d x ) adaptation with respect to H ( p ) can only be achieved over a very limited range of smoothness p .
4. Adaptive Inference via the Image-Space Test
In this section we consider a data-driven test in the image-space of the conditional ex-pectation mapping. Subsection 4.1 proposes a data-driven image-space test (IT) statistic.Subsection 4.2 establishes the minimax rate of testing of the adaptive IT.22 .1. The Data-driven Image-Space Test Statistic
We consider the set of function satisfying H = { h ∈ H : m ( · , h ) = 0 } ∩ H r (cid:54) = ∅ where,in this section, H r is a finite dimensional, compact function space. We do not addressthe question of IT with infinite dimensional restricted set of functions here as this wouldrequire an additional choice of tuning parameter.In contrast to the previous section, we specify alternative models through deviationsfrom the conditional moment restriction. For convenience of notation, we introduce theconditional moment function m ( · , h r ) = E[ ρ ( Y, h r ( X )) | W = · ]. For the image space test,we consider a class of functions which are separated from H in the sense M ( δ, r n ) := (cid:110) h ∈ H : m ( W, h ) = 0 and E[ m ( W, h r )] ≥ δ r n (cid:111) . We propose an image-space test based on a leave-one-out sieve estimator of the quadraticfunctional E[ m ( W, h r )] given by (cid:98) D K ( (cid:98) h r ) = 2 n ( n − (cid:88) ≤ i (cid:41) , (4.2)based on η K ( α ) as given in (2.5) and the estimated normalization factor (cid:98) v K ( h ) = (cid:13)(cid:13)(cid:13) n − n (cid:88) i =1 ρ (cid:0) Y i , h ( X i ) (cid:1) ( B (cid:48) B/n ) − / b K ( W i ) b K ( W i ) (cid:48) ( B (cid:48) B/n ) − / (cid:13)(cid:13)(cid:13) F . (4.3)Also the image space test IT n relies on the ES selection method to determine the indexset (cid:98) I n as given in (2.4) yet its upper bound has to be modified as follows. We replace theupper bound (cid:98) J max of the index set (cid:98) I n by the estimator (cid:98) K max = min (cid:110) K > J : ζ ( K ) (cid:112) (cid:96) ( K )(log n ) /n ≥ s min (cid:0) ( B (cid:48) B/n ) − / (cid:1)(cid:111) where (cid:96) ( K ) = 0 . K . 23 .2. Adaptive Testing For the IT case the index set (cid:98) I n depends on the upper bound (cid:98) K max . As in the previoussection, we may assume that (cid:98) K max ≤ K with probability approaching one where K satisfiesthe following rate conditions. We further denote H r (∆ n ) = { h ∈ H r : (cid:107) h − h r (cid:107) ∞ ≤ ∆ n } for some ∆ n >
0, where h r = arg min h ∈H r (cid:107) m ( · , h ) (cid:107) L ( W ) . Further, we define (cid:101) b K = G − / b b K with entries (cid:101) b k , 1 ≤ k ≤ K . Assumption 7. (i) ζ b,K (cid:112) (log n ) /n = o (1) (ii) (cid:107) Π K m ( · , h r ) − m ( · , h r ) (cid:107) L ( W ) = O ( K − γ/d w ) for some γ > such that ζ b,K √ log K = O ( K γ/d w ) ; (iii) the eigenvalues G b are uniformlybounded from below and above; (iv) (cid:82) (cid:112) N [] ( w, H r , (cid:107) · (cid:107) L ( Z ) ) dw ≤ C for some con-stant C > ; (v) (cid:107) E h [( ρ ( Y, h ( X )) − ρ ( Y, h r ( X ))) (cid:101) b K ( W )] (cid:107) ≤ (cid:107) h − h r (cid:107) µ for all h ∈ H r (∆ n ) ;and (vi) (cid:98) h r ∈ H r (∆ n ) with probability approaching one such that max ≤ k ≤ K E sup h ∈H r (∆ n ) (cid:12)(cid:12)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b k ( W ) (cid:12)(cid:12) ≤ C ∆ κn for some κ ∈ (0 , and ∆ n (cid:80) K ∈ (cid:98) I n (cid:112) K/ log log K = o p (1) . Assumption 7 (ii) does not impose regularity on the structural function h but ratheron the conditional mean function m . Assumption 7 (iv) restricts the complexity of the setof functions H r by imposing a finite entropy integral condition. Assumption 7 (v)(vi) weresimilarly imposed in Chen and Pouzo [2012]. We are now in the position to establish anupper bound for the first and second type error, uniformly of the function class M ( δ ◦ , r n ),of the data driven test statistic IT n . Theorem 4.1.
Let Assumptions 1, 5 (ii) with J reeplaced by K , and 7 be satisfied. Then,for any ε > there exists a constant δ ◦ > such that lim sup n →∞ (cid:110) P h ∈H ( (cid:99) IT n = 1) + sup h ∈M ( δ ◦ , r n ) P h ( (cid:99) IT n = 0) (cid:111) ≤ ε, (4.4) where the rate r n is given by r n = (cid:0)(cid:112) log log n/n (cid:1) γ/ (4 γ + d w ) . Remark 4.1 (Formulation of Hypotheses) . We see from Theorem 4.1 that the IT rateattains the usual minimax rate of testing within a (log log n ) term and, in particular, doesnot suffer from the ill-posedness of the underlying inverse problem. This is due to thedefinition of the class of alternative function M ( δ ◦ , r n ) , which measures deviations fromthe null by the squared norm of the conditional mean function under consideration. Notethat it is possible to achieve the same rate of testing as in ST once we are willing to impose ink conditions between (cid:107) h − h r (cid:107) µ and (cid:107) m h ( · , h r ) (cid:107) L ( W ) . We do not provide this resultexplicitly in this paper, in the interest of space. Remark 4.2 (Comparison of Testing Rates) . From Corollary 3.1 we see that the rate oftesting suffers only in the case of ST from the ill-posed inverse problem. Nevertheless, inthe mildly ill-posed case, the rate of ST can be even better than of IT, that is, n − p/ (4( p + ζ )+ d x ) < n − γ/ (4 γ + d w ) if and only if the dimension of W satisfies d w > γp (cid:0) ζ + d x (cid:1) . In contrast the rate in case of ST is always slower than the IT rate in the severely ill-posedcase.
5. Empirical Applications, Further Simulations Studies
We present three empirical applications of our adaptive structure-space tests for NPIVmodels. The first one tests for monotonicity of demand for differentiated products in IO.The second one tests for monotonicity of gasoline demand function. The third one tests formonotonicity or parametric specification of Engel curves for non-durable good consumption.In all of the empirical applications, we implement the adaptive test (cid:99) ST n given in (2.2),using the adaptive critical values (cid:98) η J ( α ) presented in Remark 3.1 (see Appendix C forbootstrap critical values). Throughout this section, we use quadratic B-spline basis withvarying number of knots to approximate h and set K ( J ) = 2 J . For any sieve dimension J ∈ (cid:98) I n (the ES index set), we denote the corresponding “standardized test value” as (cid:99) W J := n (cid:98) D J ( (cid:98) h r J ) √ (cid:98) η J ( α ) (cid:98) v J ( (cid:98) h r J )at the nominal level α = 0 .
05. The null hypothesis is rejected whenever (cid:99) W J > J ∈ (cid:98) I n . Tables below reports a set (cid:98) J ⊂ (cid:98) I n , which equals to arg max J ∈ (cid:98) I n (cid:99) W J when thenull is not rejected at the nominal level α = 0 .
05; and equals to { J ∈ (cid:98) I n : (cid:99) W J > } whenthe null is rejected at the nominal level α = 0 .
05. Let (cid:98) J denote the minimal integer of (cid:98) J . In all the tables below, we report (cid:99) W (cid:98) J , and its corresponding p value, which should, byBonferroni correction, be compared to the nominal level 0 .
05 divided by the cardinality of (cid:98) I n , which differ in the applications presented below.25 .1.1. Adaptive ST testing for a multi-Product Demand systems Recently Compiani [2019] applies nonparametric estimation of a NPIV model as an al-ternative to BLP semiparametric specification. He directly imposes shape restrictions toreduce dimensionality in nonparametric estimation of own-price elasticity and cross-priceelasticity of multi-product demand. Following Compiani [2019] we use the Nielsen scannerdata set on demand for strawberries in California.Our application here is to provide adaptive nonparametric hypothesis testing on themonotonicity of demand in its own price. First notice that by imposing an index restrictionand a connected substitute assumption the empirical model can be written in NPIV form,following Berry and Haile [2014] and Compiani [2019]: Y o = h ( P o , P no , P other , S o , In ) − U, E[ U | W o , W no , W other , W S o , In ] = 0 , where Y o denotes a measure of taste for organic products, S o denotes the share of theorganic products, In income, and U unobserved shocks for organic produce. The vector( P o , P no , P other ) denotes the prices of organic strawberries, non-organic strawberries, andnon-strawberry fresh fruit, respectively. To account for possible endogeneity of prices we fol-low Compiani [2019] and use Hausman-type instrumental variables denoted by ( W o , W no , W other )and shipping-point spot prices W S o as a proxy for the wholesale prices faced by retailers. Income H : h is decreasing in P o H : h is increasing in P o groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.552 0.103 no { } { } middle 0.234 0.246 no { } .
720 0.000 yes { , , } high 1 .
526 0.000 yes { , , } .
883 0.000 yes { , , , } Table 2:
Adaptive testing for monotonic demand for organic products.
Data on prices and quantities come from the 2014 Nielsen scanner data set. For eachmarket, the most granular unit of observation in the Nielsen data is a UPC (i.e. a specificbar code). We consider a low income, middle income, and high income” group based onindividuals between the 5%–25%, 40%–60%, and 75%–95% quantile, respectively, of theincome distribution. The resulting sample sizes for the three subgroups are 1509, 1491, and2093. We implement our adaptive test (cid:99) ST n by making use of a semiparametric specificationof the structural demand function: we consider the tensor product of quadratic B-splines q J ( P o ) and the vector (1 , In ), hence J = 2 J . The other variables are accounted forseparately, that is, P no and P other fixed with two interior knots and market shares S o linearly. The cardinality of the index set (cid:98) I n changes for the different subgroups considered:26e have (cid:98) I n = { , , } for the low and middle income groups but (cid:98) I n = { , , , } forthe high income group.According to Table 2, our adaptive test rejects the null of weakly decreasing demandcurve (in own price) at the nominal level α = 0 .
05 for the high income group, but failsto reject the decreasing demand for the low and middle income groups. In addition, ouradaptive test rejects the null of weakly increasing demand curve (in own price) for allincome groups.
In this subsection, we revisit the problem whether the gasoline demand is monotone de-creasing in its own price or not. We consider the following partially linear specification ofthe demand function: Y = h ( P, In ) + γ (cid:48) H + U, E[ U | W, In, H ] = 0where Y denotes annual log-gasoline consumption of a household, P log-price of gasoline(average local price), In log-household income, H are control variables (such as log ageof the household respondent, the log household size, the log number of drivers, and thenumber of workers in the household), and W distance to major oil platform as instrumentalvariables. We allow for price P to be endogenous, but assume that ( In, H ) is exogenous.
Income H : h is decreasing in P H : h is increasing in P groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.433 0.111 no { } { , } high 11 .
703 0.000 yes { , } { , , } Table 3:
Adaptive testing for monotonicity of gasoline demand.
Consumer theory requires downward-sloping compensated demand curves. In the fol-lowing we provide a test for monotonicity of the uncompensated (Marshallian) demandcurves. The data are from the 2001 National Household Travel Survey (NHTS), whichsurveys, on a household-level, the civilian noninstitutionalized population in the UnitedStates. Following Blundell et al. [2017] we restrict the analysis to households with a whiterespondent, two or more adults, at least one child under age 16, and at least one driver.The resulting sample contains 3,640 observations. Chetverikov and Wilhelm [2017] and Freyberger and Reeves [2019] have used this dataset to estimate the gasoline demand by assuming a decreasing demand curve. Instead we We thank Matthias Parey for sharing the dataset with us. We refer the reader to Blundell et al. [2017]for a detailed description of the data. .20 0.25 0.30 0.35 . . . . . . Constrained CB log price ga s o li ne de m and . . . . . . Unconstrained CB log price ga s o li ne de m and Figure 4:
Estimated demand curves for the low income group using (cid:98) J = 8: constraint estimator(blue solid line) and unconstraint estimator (red dashed line). LHS: Dotted blue linesshow 95% uniform CBs based on constrained estimator. RHS: Dotted red lines show95% uniform CBs based on unconstrained estimator. All other variables are fixed totheir median levels (implying $32,500 of income). aim at testing the monotonicity of the gasoline demand here. We consider a semiparametricspecification similar to theirs to approximate the unknown function h , that is, our test isbased on the tensor product of quadratic B-splines q J ( P ) and the vector (1 , In ), hence J = 2 J . To analyze heterogeneity in demand for different income levels we make use oftwo different subsamples of the data set. We consider a sample of n = 803 “low income”households whose income is below the first quartile, and a sample of n = 1369 “highincome” households whose income is weakly above the third quartile. When computing theES index set we obtain (cid:98) I n = { , } for low income group and (cid:98) I n = { , , } for the highincome group.According to Table 3, our adaptive test rejects the null of weakly decreasing gasolinedemand at the nominal level α = 0 .
05 for the high income group, but fails to reject thedecreasing demand for the low income group. In addition, our adaptive test rejects the nullof weakly increasing gasoline demand curve for both income groups.We illustrate our testing results in Figures 4 and 5 which shows the graphs of restrictedand unrestricted NPIV estimators for the low and high income groups, respectively. Theestimators are implemented using the choice of the sieve dimension given by (cid:98) J = 8 in bothgroups. Variables other than price are fixed to their median level at each subgroup. In28 .20 0.25 0.30 0.35 . . . . . Constrained CB log price ga s o li ne de m and . . . . . Unconstrained CB log price ga s o li ne de m and Figure 5:
Estimated demand curves for the high income group using (cid:98) J = 8: constraint estimator(blue solid line) and unconstraint estimator (red dashed line). LHS: Dotted blue linesshow 95% uniform CBs based on constrained estimator. RHS: Dotted red lines show95% uniform CBs based on unconstrained estimator. All other variables are fixed totheir median levels (implying $120,000 of income). particular, the median income is given by roughly $32,500 and $120,000 for the two incomegroups considered. We also provide 95% bootstrap uniform confidence bands (CBs), using1000 bootstrap iterations. Both figures are in line with our adaptive testing results reportedin Table 3, that is, only the demand curves for high income households seem to violate amontonically decreasing shape. For high income households the unrestricted demand curvesare slightly outside of the 95% confidence bands of the restricted NPIV estimator. Engel curves play a central role in the analysis of consumer demand and describe thehousehold budget share Y (cid:96) for non-durable goods (cid:96) as a function of household log-totalexpenditure X . Blundell et al. [2007] estimated a system of nonparametric Engel curvesas functions of endogenous log-total expenditure and family size, using log-gross earningsof the head of household as an instrument W . We use a subset of their data from the1995 British Family Expenditure Survey, with the head of household aged between 20 and55 and in work, and household with one or two children. This leaves a sample of size n = 1027. As an illustration we consider Engel curves h (cid:96) ( X ) for four non-durable goods (cid:96) :“food in”, “fuel”, “travel”, and “leisure”: E [ Y j − h (cid:96) ( X ) | W ] = 0. We use the same quadratic29-spline basis with up to 3 knots to approximate all the Engel curves. Hence the index set (cid:98) I n = { , , , } is the same for the different Engel curves. Table 4 reports the adaptivetest for weak monotonicity of Engel curves. H : h is increasing H : h is decreasingGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” 1.347 0.002 yes { } -0.269 0.821 no { } “fuel” 4.484 0.000 yes { , } { } “travel” 2.074 0.000 yes { } { } “leisure” 0.295 0.151 no { } { , } Table 4:
Adaptive testing for monotonicity of Engel curves.
Table 4 reports that our test rejects increasing Engel curves for “food in”, “fuel”, and“travel” categories, and also rejects decreasing Engel curve for “leisure” at the 0 .
05 nominallevel. Previously, to decide whether the Engel curves is strictly monotonic, estimatedderivatives of these function together with their 95% uniform confidence bands were alsoprovided in Chen and Christensen [2018, Figure 4]. Those confidence bands contain zeroalmost over the whole support of household expenditure. It is interesting to see that ouradaptive test is more informative about monotonicity in certain directions that are notobvious from their uniform confidence bands. H : h is linear H : h is quadraticGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” -0.169 0.644 no { } { } “fuel” 1.174 0.004 yes { } -0.089 0.502 no { } “travel” 1.052 0.007 yes { } -0.108 0.531 no { } “leisure” 0.536 0.074 no { } { } Table 5:
Adaptive testing for linear/quadratic specification of Engel curves.
The most popular class of parametric demand systems is the almost ideal class, pio-neered by Deaton and Muellbauer [1980], where budget shares are assumed to be linear inlog-total expenditure. Banks et al. [1997] proposed a popular extension of this model toinclude a squared term in log-total expenditure. Table 5 presents tests for either a linear orquadratic specification of the Engel curves for the four goods considered above. Accordingto this table, at the nominal level α = 0 .
05, our adaptive ST fails to reject quadratic formfor all the goods, while rejects a linear Engel curve for fuel and travel goods.30 .2. A Monte Carlo Study: Testing for Parametric Restrictions
In this subsection, we compare the finite sample performances of the adaptive structure-space test ( (cid:99) ST n ) and the adaptive image-space test ( (cid:99) IT n ) when h takes a parametric formunder the null. The results are based on 5000 Monte Carlo replications for every experiment.We consider two different designs similar to the simulation set up in (2.7). First, weset X i = Φ( X ∗ i ) and W i = Φ( W ∗ i ), where the random vector ( X ∗ i , W ∗ i , U i ) is generatedaccording to X ∗ i W ∗ i U i ∼ N , ξ . ξ . σ U . (5.1)Second, we consider multivariate conditioning variable W = ( W , W ). To do so, we set X i = Φ( X ∗ i ), W i = Φ( W ∗ i ), and W i = Φ( W ∗ i ), where X ∗ i W ∗ i W ∗ i U i ∼ N , ξ . . ξ . . σ U . (5.2)In both experiment designs, (5.1) and (5.2), we let σ U = 0 .
2, and Y i be generated accordingto (2.6) with the structural function h ( x ) = − x/ c A x + ( c B x ) , where the constants c A and c B will vary in the experiments below. Specifically, we considereither a linear (i.e., ( c A , c B ) = (0 , c A , c B ) = (1 , c A , c B ) = (1 , c A , c B ) = (1 ,
1) when the null is linear or( c A , c B ) = (1 ,
5) when the null is quadratic).The simulation sample size is n = 500. We implement the adaptive structure-spacetest (cid:99) ST n given in (2.2) using quadratic B-spline basis functions with varying number ofknots. The number of knots is varied within the index set (cid:98) I n as implemented in the lastsubsection, also with K ( J ) = 2 J . In addition, we implement the adaptive image-space test (cid:99) IT n given in (4.2) with using quadratic B-spline basis functions with varying number ofknots, where h is replaced by the 2SLS estimator.In Table 6, we depict the empirical rejection probabilities of the test statistics (cid:99) ST n and (cid:99) IT n , and their bootstrapped versions (cid:99) ST B n and (cid:99) IT B n , with 200 bootstrap replications for thebootstrapped critical values. Again, the adaptive tests and their respectively bootstrapped31 esign ξ Null Alt. Empirical Rejection prob./Average dimension choice ( c A , c B ) ( c A , c B ) (cid:99) ST B n (cid:99) ST n (cid:98) J ST (cid:99) IT B n (cid:99) IT n (cid:98) K IT t -test(5.1) 0 . ,
0) 0.046 0.055 3.92 0.051 0.047 3.81 0.026 d x = d w (1 ,
0) 0.022 0.031 4.17 0.029 0.024 4.14 0.001(0 ,
0) (1 ,
0) 0.072 0.088 3.78 0.084 0.080 3.76 0.084(0 ,
0) (1 ,
1) 0.272 0.279 3.40 0.252 0.245 3.54 0.380(1 ,
0) (1 ,
5) 0.092 0.120 4.16 0.038 0.037 4.12 0.0870 . ,
0) 0.046 0.049 4.13 0.042 0.049 3.82 0.045(1 ,
0) 0.026 0.029 4.53 0.023 0.026 4.12 0.027(0 ,
0) (1 ,
0) 0.185 0.167 3.68 0.184 0.198 3.57 0.293(0 ,
0) (1 ,
1) 0.810 0.815 3.06 0.802 0.822 3.14 0.912(1 ,
0) (1 ,
5) 0.515 0.538 4.08 0.295 0.330 4.12 0.728(5.2) 0 . ,
0) 0.052 0.052 3.98 0.041 0.039 7.82 0.035 d x < d w (1 ,
0) 0.030 0.032 4.20 0.028 0.032 8.12 0.007(0 ,
0) (1 ,
0) 0.111 0.115 3.79 0.061 0.052 7.73 0.121(0 ,
0) (1 ,
1) 0.430 0.461 3.31 0.113 0.116 7.42 0.536(1 ,
0) (1 ,
5) 0.321 0.340 4.28 0.031 0.032 8.17 0.2550 . ,
0) 0.042 0.050 4.10 0.036 0.040 7.84 0.045(1 ,
0) 0.026 0.031 4.43 0.033 0.034 8.13 0.034(0 ,
0) (1 ,
0) 0.209 0.206 3.61 0.064 0.063 7.46 0.322(0 ,
0) (1 ,
1) 0.874 0.883 3.04 0.370 0.354 6.54 0.942(1 ,
0) (1 ,
5) 0.687 0.700 4.08 0.054 0.052 8.25 0.797
Table 6:
Empirical Rejection probabilities of (cid:99) ST B n , (cid:99) ST n , (cid:99) IT B n , (cid:99) IT n -tests, and t -test (of the hypoth-esis ( c A , c B ) = (0 ,
0) if null is linear and of ( c A , c B ) = (1 ,
0) if null is quadratic) under5% nominal level, n = 500. versions perform similarly in terms of sizes and powers. All these adaptive tests are com-pared with the asymptotic t -test of the hypothesis c A = 0 if the null is linear and of c B = 0if the null is quadratic. We see that the adaptive tests are overall not very conservative incomparison to the t -test that imposes specific parametric restrictions. The powers of allthe tests increases as ξ increases (i.e., the instrument gets stronger). Very remarkably, theadaptive ST performs as well as the asymptotic t -test does. For the first experiment designgiven in (5.1) (with d x = d w ), we see that the adaptive ST is somewhat more powerful thanthe adaptive IT when the true model is cubic, i.e., ( c A , c B ) = (1 , d x < d w ), the adaptive ST is much more powerful than theadaptive IT. This finding is consistent with our theoretical findings as described in Remark4.2. This severe power difference also holds when cubic or quartic B-splines are used as thevector of instrument basis functions b K ( W ).32 . Conclusion In this paper, we propose data-driven, minimax rate-adaptive hypothesis testing on a struc-tural function in semi-nonparametric conditional moment restrictions. Our main focus isthe adaptive structure-space test (ST) that is based on a leave-one-out sieve estimate ofquadratic distance between the structural functions of endogenous variables in NPIV mod-els. But we also present the minimax rate-adaptive image-space test (IT) as comparison.This is because our sieve IT is related to the Bierens’ type tests that have been utilizedby many existing papers on inference for semi-nonparametric conditional moment restric-tions. All the prior existing papers on testing for NPIV models do not consider data-drivenchoice of tuning parameters, however. For both tests, we first establish their respectiveminimax rate of testing against classes of nonparametric alternative models. We then pro-vide computationally simple data-driven choices of sieve tuning parameters and adaptivecritical values. The resulting tests attain the optimal minimax rate of testing, adapt to theunknown smoothness of functions, and are robust to the unknown degree of ill-posedness(endogeneity). Data-driven confidence sets are easily obtained by inverting the adaptiveST. Monte Carlo studies and empirical applications demonstrate that our simple, adaptiveST has good size and power properties in finite samples for testing monotonicity or para-metric restrictions in NPIV models, without the need of using computationally intensivebootstrapped critical values.
References
D. Al Mohamad, E. W. Van Zwet, E. Cator, and J. J. Goeman. Adaptive critical value forconstrained likelihood ratio testing. arXiv preprint arXiv:1806.01325 , 2018.J. Banks, R. Blundell, and A. Lewbel. Quadratic engel curves and consumer demand.
Review of Economics and Statistics , 79(4):527–539, 1997.S. T. Berry and P. A. Haile. Identification in differentiated products markets using marketlevel data.
Econometrica , 82(5):1749–1797, 2014.H. J. Bierens. A consistent conditional moment test of functional form.
Econometrica , 58(6):1443–1458, 1990.R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant engel curves.
Econometrica , 75(6):1613–1669, 2007.R. Blundell, J. Horowitz, and M. Parey. Nonparametric estimation of a nonseparabledemand function under the slutsky inequality restriction.
Review of Economics andStatistics , 99(2):291–304, 2017. 33. Breunig. Goodness-of-fit tests based on series estimators in nonparametric instrumentalregression.
Journal of Econometrics , 184(2):328–346, 2015.C. Breunig and J. Johannes. Adaptive estimation of functionals in nonparametric instru-mental regression.
Econometric Theory , 32(3):612–654, 2016.C. Butucea, C. Matias, and C. Pouet. Adaptive goodness-of-fit testing from indirect ob-servations.
Ann. Inst. Henri Poincar´e Probab. Stat , 45(2):352–372, 2009.T. T. Cai and M. G. Low. Adaptive confidence balls.
The Annals of Statistics , 34(1):202–228, 2006.S. Centorrino. Data driven selection of the regularization parameter in additive nonpara-metric instrumental regressions. Working paper, Stony Brook, 2014.X. Chen and T. Christensen. Optimal sup-norm rates, adaptivity and inference in non-parametric instrumental variables estimation, 2015.X. Chen and T. M. Christensen. Optimal sup-norm rates and uniform inference on nonlinearfunctionals of nonparametric iv regression.
Quantitative Economics , 9(1):39–84, 2018.X. Chen and D. Pouzo. Estimation of nonparametric conditional moment models withpossibly nonsmooth generalized residuals.
Econometrica , 80(1):277–321, 01 2012.X. Chen and D. Pouzo. Sieve quasi likelihood ratio inference on semi/nonparametric con-ditional moment models.
Econometrica , 83(3):1013–1079, 2015.X. Chen and M. Reiß. On rate optimality for ill-posed inverse problems in econometrics.
Econometric Theory , 27(03):497–521, 2011.V. Chernozhukov, S. Lee, and A. M. Rosen. Intersection bounds: Estimation and inference.
Econometrica , 81(2):667–737, 2013.V. Chernozhukov, W. K. Newey, and A. Santos. Constrained conditional moment restrictionmodels. arXiv preprint arXiv:1509.06311 , 2015.D. Chetverikov and D. Wilhelm. Nonparametric instrumental variable estimation undermonotonicity.
Econometrica , 85:1303–1320, 2017.O. Collier, L. Comminges, and A. B. Tsybakov. Minimax estimation of linear and quadraticfunctionals on sparsity classes.
The Annals of Statistics , 45(3):923–958, 2017.G. Compiani. Market counterfactuals and the specification of multi-product demand: Anonparametric approach.
Available at SSRN 3134152 , 2019.34. Deaton and J. Muellbauer. An almost ideal demand system.
American EconomicReview , pages 312–326, 1980.Z. Fang and J. Seo. A general framework for inference on shape restrictions. arXiv preprintarXiv:1910.07689 , 2019.J. Freyberger and B. Reeves. Inference under shape restrictions. Working paper, Universityof Wisconsin, 2019.E. Gautier and E. Le Pennec. Adaptive estimation in the nonparametric random coefficientsbinary choice model by needlet thresholding.
Electronic Journal of Statistics , 12(1):277–320, 2018.E. Gine and R. Nickl.
Mathematical Foundations of Infinite-Dimensional Statistical Models .Cambridge University Press, 2016.E. Guerre and P. Lavergne. Data-driven rate-optimal specification testing in regressionmodels.
The Annals of Statistics , 33(2):840–870, 2005.L. P. Hansen and T. J. Sargent.
Robustness . Princeton university press, 2008.J. L. Horowitz. Testing a parametric model against a nonparametric alternative withidentification through instrumental variables.
Econometrica , 74(2):521–538, 2006.J. L. Horowitz. Adaptive nonparametric instrumental variables estimation: Empiricalchoice of the regularization parameter.
Journal of Econometrics , 180:158–173, 2014.J. L. Horowitz and V. G. Spokoiny. An adaptive, rate-optimal test of a parametric mean-regression model against a nonparametric alternative.
Econometrica , 69(3):599–631,2001.C. Houdr´e and P. Reynaud-Bouret. Exponential inequalities, with constants, for u-statisticsof order two. In
Stochastic inequalities and applications , pages 55–69. Springer, 2003.J. Z. Huang. Projection estimation in multiple regression with application to functionalanova models.
The Annals of Statistics , 26(1):242–272, 1998.Y. I. Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives.i, ii, iii.
Math. Methods Statist , 2(2):85–114, 1993.Y. I. Ingster, T. Sapatinas, I. A. Suslina, et al. Minimax signal detection in ill-posed inverseproblems.
The Annals of Statistics , 40(3):1524–1549, 2012.M. Jansson and D. Pouzo. Towards a general large sample theory for regularized estimators.Working paper, University of California, Berkeley, 2019.35. Lei. Adaptive global testing for functional linear models.
Journal of the AmericanStatistical Association , 109(506):624–634, 2014.C.-A. Liu and J. Tao. Model selection and model averaging in nonparametric instrumentalvariables models. Working paper, National University of Singapore and University ofWisconsin-Madison, 2014.W. K. Newey. Convergence rates and asymptotic normality for series estimators.
Journalof Econometrics , 79(1):147–168, 1997.J. Robins and A. Van Der Vaart. Adaptive nonparametric confidence sets.
The Annals ofStatistics , 34(1):229–253, 2006.A. Santos. Inference in nonparametric instrumental variables with partial identification.
Econometrica , 80(1):213–275, 2012.M. J. Silvapulle and P. K. Sen.
Constrained statistical inference: Inequality, order andshape restrictions . John Wiley & Sons, 2005.V. G. Spokoiny. Adaptive hypothesis testing using wavelets.
The Annals of Statistics , 24(6):2477–2498, 1996.J. Tao. Inference for point and partially identified semi-nonparametric conditional momentmodels. In
Unpublished Manuscript, University of Wisconsin-Madison , 2014.A. B. Tsybakov.
Introduction to nonparametric estimation . Springer, 2009.Y. Zhu. Inference in nonparametric/semiparametric moment equality models with shaperestrictions.
Quantitative Economics , 11(2):609–636, 2020.
A. Proofs of Minimax ST Results in Subsection 3.2
We introduce additional notation used in the proofs. For a r × c matrix M with r ≤ c and full row rank r we let M − l denote its left pseudoinverse, namely ( M (cid:48) M ) − M (cid:48) where (cid:48) denotes transpose and − denotes generalized inverse. We define A = [ S (cid:48) G − b S ] − S (cid:48) G − b where S = E[ b K ( W ) ψ J ( X ) (cid:48) ]. For all J ≥ s J = s min ( G − / b SG − / ) > (cid:13)(cid:13)(cid:13) G / AG / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) G / [ S (cid:48) G − b S ] − S (cid:48) G − / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) [ G − / S (cid:48) G − b SG − / ] − G − / S (cid:48) G − / b (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:16) G − / b SG − / (cid:17) − l (cid:13)(cid:13)(cid:13) = s − J . V Ji := ( Y i − h ( X i )) G / Ab K ( W i ). We also introduce theprojection Q J h ( · ) = ψ J ( · ) (cid:48) A E[ h ( X ) b K ( W )] for h ∈ L µ . We introduce the normalizationterm v J ( h ) = (cid:13)(cid:13) E h [ (cid:0) Y − h ( X ) (cid:1) G / A b K ( J ) ( W ) b K ( J ) ( W ) (cid:48) A (cid:48) G / ] (cid:13)(cid:13) F . For simplicity of notation, we write (cid:98) D J instead of (cid:98) D J ( h ) and v J instead of v J ( h ). Proof of Theorem 3.1.
For the proof of the result, we proceed in three steps. First, webound the first type error of the test statistic (cid:102) ST n,J = (cid:110) n (cid:98) D J > η (cid:48) v J (cid:111) for some constant η (cid:48) >
0. Second, we bound the second type error of (cid:102) ST n,J where η (cid:48) isreplaced by η (cid:48)(cid:48) >
0. Third, we make the dependence of η (cid:48) , η (cid:48)(cid:48) with η explicit and show thatthe derived bounds in the previous steps are sufficient to control the first and second typeerror of our test statistic (cid:102) ST n,J . Step 1:
Under the simple null hypothesis we have (cid:107) h − h (cid:107) µ = 0 and hence we obtain forany ε > η sufficiently large thatP h ( (cid:102) ST n,J = 1) = P h (cid:16) n (cid:98) D J > η (cid:48) v J (cid:17) ≤ ε by the upper bound derived in (F.4). Step 2:
From the Step 2. of the proof of Theorem 3.3 (with J ∗ replaced by J and usingthat η (cid:48)(cid:48) J ∗ replaced by η (cid:48)(cid:48) ) we obtain uniformly over h ∈ H ( δ ∗ , r n,J ) thatP h ( (cid:102) ST n,J = 0) = P h (cid:18) (cid:98) D J ≤ η (cid:48)(cid:48) v J n (cid:19) = o (1) . Step 3:
Finally, we account for estimation of the normalization factor v J . Now since | (cid:98) v J − v J | ≤ c v J wpa1 for some 0 < c < − c ) v J ≤ (cid:98) v J ≤ (1+ c ) v J wpa1. Let η (cid:48) and η (cid:48)(cid:48) be such that √ η = η (cid:48) (1 − c ) = η (cid:48)(cid:48) (1 + c ) . (cid:102) ST n,J as followsP h ( ST n,J = 1) ≤ P h ( ST n,J = 1 , (cid:98) v J ≥ (1 − c ) v J ) + P h ( (cid:98) v J < (1 − c ) v J ) ≤ P h (cid:16) (cid:98) D J > √ η v J (1 − c ) (cid:17) + o (1)= P h (cid:16) (cid:98) D J > η (cid:48) v J (cid:17) + o (1)= o (1)where the last equation is due to Step 1. To bound the second error term of ST n,J wecalculate uniformly over h ∈ H ( δ ∗ , r n,J ) thatP h (cid:16)(cid:102) ST n,J = 0 (cid:17) ≤ P h (cid:16)(cid:102) ST n,J = 0 , (cid:98) v J ≤ (1 + c ) v J (cid:17) + P h ( (cid:98) v J > (1 + c ) v J )= P h (cid:16) | (cid:98) D J | ≤ η (cid:48)(cid:48) v J (cid:17) + o (1)= o (1) , where the last equation is due to Step 2. Proof of Corollary 3.1.
We make use of the observation s − J = (1 + o (1)) τ J .Proof of (3.3). The choice of J ∼ n d x / (4( p + ζ )+ d x ) implies n − τ J √ J ∼ n − J / ζ/d x ∼ n − p/ (4( p + ζ )+ d x ) and for the bias term J − p/d x ∼ n − p/ (4( p + ζ )+ d x ) .Proof of (3.4). The choice of J ∼ (log n − ζ − (2 p + d x ) log log n ) d x /ζ implies n − τ J √ J ∼ n − √ J exp( J ζ/d x ) ∼ (cid:16) log n − p + d x ζ log log n (cid:17) d x / (2 ζ ) (log n ) − (2 p +1) /ζ (cid:46) (log n ) d x / (2 ζ ) (log n ) − (2 p + d x ) /ζ (cid:46) (log n ) − p/ζ and for the bias term J − p/d x ∼ (log n ) − p/ζ , which yields equation (3.4). Proof of Theorem 3.2.
It is sufficient to consider the Gaussian reduced-form NPIR asin Chen and Reiß [2011]. From the proof of Collier et al. [2017, Lemma 3] we infer for any h θ ∈ H ( δ ∗ , r n ) thatinf T n (cid:8) P h ( T n = 1) + sup h ∈H ( δ ∗ ,r n ) P h ( T n = 0) (cid:9) ≥ inf T n (cid:8) P h ( T n = 1) + P h θ ( T n = 0) (cid:9) ≥ − V (P h θ , P h ) ≥ − (cid:113) χ (P h θ , P h ) , (A.1)38here V ( · , · ) denotes the total variation distance and χ ( · , · ) denotes the χ divergence. Weconsider θ = ( θ j ) j ≥ with θ j ∈ {− , } and introduce the test function h θ = h + (cid:114) δ ∗ n J (cid:88) j =1 τ j θ j (cid:101) ψ j (cid:16) J (cid:88) j =1 τ j (cid:17) − / , where { (cid:101) ψ j } j ≥ forms an orthonormal basis in L µ and the dimension parameter J satisfiesthe inequality restriction δ ∗ n (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 τ j j p ≤ C (A.2)for some constant C >
0. Therefore, orthonormality of the basis functions { (cid:101) ψ j } j ≥ in L µ together with the Cauchy-Schwarz inequality implies ∞ (cid:88) j =1 (cid:104) h θ − h , (cid:101) ψ j (cid:105) µ j p = δ ∗ n J (cid:88) j =1 τ j j p (cid:16) J (cid:88) l =1 τ l (cid:17) − / ≤ δ ∗ n (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 τ j j p ≤ C and we conclude that h θ − h attains the sieve approximation error imposed in Assumption2. Denoting r n = n − / (cid:0) (cid:80) Jj =1 τ j (cid:1) / we obtain (cid:107) h θ − h (cid:107) µ = δ ∗ n (cid:16) J (cid:88) j =1 τ j (cid:17) / (cid:90) (cid:101) ψ j ( x ) µ ( x ) dx = δ ∗ n (cid:16) J (cid:88) j =1 τ j (cid:17) / = δ ∗ r n and hence, h θ ∈ H ( δ ∗ , r n ). Under P h θ the conditional distribution of Y given W is Gaussianwith mean T h θ ( W ) and variance 1. We may assume that { λ j , (cid:101) ψ j , (cid:101) b j } , j ≥
1, forms a singularvalue decomposition of the conditional expectation operator T . The total variation of P h θ and P h thus satisfies χ (P h θ , P h ) = (cid:90) (cid:18) P h θ P h (cid:19) d P h − J (cid:89) j =1 exp( − nλ j γ j ) − exp( nλ j γ j )2 − , where we define γ j = n − / τ j (cid:0) (cid:80) Jj =1 τ j (cid:1) − / . By Tsybakov [2009, Section 2.7.5] there existsa constant c > − nλ j γ j ) − exp( nλ j γ j ) ≤ (cid:0) c n λ j γ j (cid:1) . By making useof condition λ j ≤ cτ − j for all j ≥ c > τ j → ∞ ) that χ (P h θ , P h ) ≤ exp (cid:16) c cn J (cid:88) j =1 τ j γ j (cid:17) − ≤ exp( c ) − ,
39y the definition of γ j . Consequently, the results follows by making use of inequality (A.1).Finally, we derive specific rate results under the different measures of ill-posedness.Consider the mildly ill-posed case with the choice of J ∼ n d x / (4( p + ζ )+ d x ) then J satisfiesconstraint (A.2) within a constant and n − (cid:16) J (cid:88) j =1 τ j (cid:17) / ∼ n − (cid:16) J (cid:88) j =1 j ζ/d x (cid:17) / ∼ n − p/ (4( p + ζ )+ d x ) . In the severely ill-posed case, the choice of J ∼ (log n − ζ − (2 p + 1) log log n ) /ζ also satis-fies (A.2) within a constant and n − (cid:16) J (cid:88) j =1 τ j (cid:17) / ∼ n − (cid:16) J (cid:88) j =1 exp (cid:0) j ζ/d x (cid:1)(cid:17) / ∼ (log n ) − p/ζ , which completes the proof. B. Proofs of Adaptive ST Results in Subsection 3.3
We require some additional notation. We set Z i = ( Y i , X i , W i ) and introduce a function R ( Z i , Z i (cid:48) , D i ) = ( Y i − h ( X i )) D i b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )( Y i (cid:48) − h ( X i (cid:48) )) D i (cid:48) − E h [( Y − h ( X )) D b K ( W )] (cid:48) A (cid:48) GA E h [ b K ( W )( Y − h ( X )) D ]for any set D i . We define R ( Z i , Z i (cid:48) ) := R ( Z i , Z i (cid:48) , M i ) and R ( Z i , Z i (cid:48) ) := R ( Z i , Z i (cid:48) , M ci )where M i = {| Y i − h ( X i ) | ≤ M n } and ( M n ) n ≥ is an increasing sequence satisfying M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) . Based on kernels R l , where l = 1 ,
2, we introduce the U-statistic U J,l = 2 n ( n − (cid:88) ≤ i
Proof of Theorem 3.3.
For the proof of the result, we proceed in three steps. First, webound the first type error of the test statistic (cid:102) ST n = (cid:26) max J ∈I n | n (cid:98) D J / ( η (cid:48) J v J ) | > (cid:27) for some η (cid:48) J >
0. Second, we bound the second type error of (cid:102) ST n where η (cid:48) J is replaced by η (cid:48)(cid:48) J >
0. Let η (cid:48) J and η (cid:48)(cid:48) J be such that √ η J = η (cid:48) J − c = η (cid:48)(cid:48) J c for some constant 0 < c <
1. Finally, we show that the derived bounds in the previoussteps are sufficient to control the first and second type error of our test statistic ST n . Step 1:
To control the first type error of the test statistic (cid:102) ST n , we make use of thedecomposition for all h ∈ H :P h (cid:16)(cid:102) ST n = 1 (cid:17) ≤ P h (cid:18) max J ∈I n (cid:110) | n (cid:98) D J | / ( η (cid:48) J v J ) (cid:111) > (cid:19) . ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) V ij V i (cid:48) j (cid:12)(cid:12)(cid:12) + max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17) ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12) n U J, / ( η (cid:48) J v J ) (cid:12)(cid:12) > (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) = I + P h (cid:18) max J ∈I n (cid:12)(cid:12) n U J, / ( η (cid:48) J v J ) (cid:12)(cid:12) > (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) = II + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) η (cid:48) J v J ( n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:0) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:1) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) = III using that U i = Y i − h ( X i ) under H . Consider I . From Lemma G.1 and Lemma G.2 we41onclude for u = 2 log log J and M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) that for all J ∈ I n we haveΛ √ u + Λ u + Λ u / + Λ u ≤ Λ (cid:112) J + 2Λ log log J + Λ (2 log log J ) / + 4Λ (log log J ) ≤ n v J (cid:112) log log J + 4 σ ns − J log log J + σ ns − J (2 log log J ) / + 4 ns − J (cid:112) log log J for n sufficiently large. Lemma F.2 implies s − J ≤ σ − v J . Assumption 5 (ii) yieldslog log J ≤ σ η (cid:48) J / (16 σ ) for all J ∈ I n and n sufficiently large and hence,we obtainΛ √ u + Λ u + Λ u / + Λ u ≤ n − v J η (cid:48) J . Consequently, the exponential inequality for degenerate U-statistics in Lemma G.1 with u = 2 log log J together with the definition of I n , i.e., J = J j for all J ∈ I n , yields I ≤ (cid:88) J ∈I n P h (cid:18)(cid:12)(cid:12) n U J, (cid:12)(cid:12) > η (cid:48) J v J (cid:19) = (cid:88) J ∈I n P h (cid:32)(cid:12)(cid:12)(cid:12) (cid:88) iM n } | E h | U {| U | >M n } | max J ∈I n ζ J (cid:13)(cid:13) G / µ AG / b (cid:13)(cid:13) η (cid:48) J v J ≤ nM − n (cid:0) E h [ U ] (cid:1) ζ max J ∈I n s − J η (cid:48) J v J where the fourth moment of U = Y − h ( X ) is bounded under Assumption 1 (ii). FromLemma F.2 we deduce s − J ≤ σ − v J . Thus, using definition M n = o (cid:0) √ n ζ − (log log J ) − / (cid:1) gives II = o (cid:16) n − (log log J ) / ζ (cid:17) = o (1) 42here the last equation is due to rate restriction on J imposed Assumption 5 (i). Consider III . Lemma F.5 implies uniformly in J ∈ I n that1 n − (cid:88) i (cid:54) = i (cid:48) U i U i (cid:48) b K ( W i ) (cid:48) (cid:0) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:1) b K ( W i (cid:48) ) = o p ( v J η (cid:48) J )and hence III = o (1). Step 2:
We control the second type error of the test statistic (cid:102) ST n where η (cid:48) J is replaced by η (cid:48)(cid:48) J >
0. We may assume that there exists (cid:101) J with J ≤ (cid:101) J ≤ n / such that (cid:101) J − p/d ≤ n − v (cid:101) J .Thus, by the construction of the set I n we have that there exists J ∗ ∈ I n such that (cid:101) J ≤ J ∗ < (cid:101) J . We denote K ∗ = K ( J ∗ ). We further evaluate for all h ∈ H ( δ ◦ , r n ) thatP h (cid:16)(cid:102) ST n = 0 (cid:17) = P h (cid:16) n (cid:98) D J ≤ η (cid:48)(cid:48) J v J for all J ∈ I n (cid:17) ≤ P h (cid:16) n (cid:98) D J ∗ ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) . We make use of the notation B J = (cid:107) E h [ V J ] (cid:107) − (cid:107) h − h (cid:107) µ and obtain uniformly over h ∈ H ( δ ◦ , r n ) that P h (cid:16) n (cid:98) D J ∗ ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) = P h (cid:18) (cid:107) E h [ V J ∗ ] (cid:107) − (cid:98) D J ∗ > (cid:107) E h [ V J ∗ ] (cid:107) − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) ≤ P h (cid:12)(cid:12)(cid:12) n J ∗ (cid:88) j =1 (cid:88) i (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ v J ∗ n + B J ∗ (cid:124) (cid:123)(cid:122) (cid:125) IV + P h (cid:32)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ v J ∗ n + B J ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) V using n/ ( n − ≤ n ≥
2. Consider IV . We first derive an upper bound for theterm B J ∗ . The definitions of V J and Q J imply (cid:107) E h [ V J ∗ ] (cid:107) = E h [( Y − h ( X )) b K ∗ ( W ) (cid:48) ] A (cid:48) GA E h [( Y − h ( X )) b K ∗ ( W )] = (cid:107) Q J ∗ ( h − h ) (cid:107) µ . Consequently, from Lemma F.3 we infer | B J ∗ | = (cid:12)(cid:12) (cid:107) Q J ∗ ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ (cid:12)(cid:12) ≤ C B r n for some constant C B , due to the definition of J ∗ . To establish an upper bound of IV , wemake use of inequality (F.3) together with Markov’s inequality which yields IV = O (cid:32) n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ∗ (cid:0) (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ (cid:1) (cid:33) . (B.2)In the following, we distinguish between two cases. First, consider the case where n − v J ∗ η (cid:48)(cid:48) J ∗ ≤ C η √ log log n for some constant C η >
0. For any h ∈ H ( δ ◦ , r n ) we have (cid:107) h − h (cid:107) µ ≥ δ ◦ r n and hence, weobtain the lower bound (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ ≥ ( δ ◦ − C η − C B ) r n ≥ c r n for some constant c := δ ◦ − C η − C B which is positive for any δ ◦ > C η + C B . Frominequality (B.2) we infer IV ≤ O (cid:18) v J ∗ c r n n (cid:19) = o (1) . Second, consider the case where n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) dominates. Nowusing (cid:107) ( G − / b SG − / ) − l (cid:107) = s − J ∗ together with Assumption 2 (iii), i.e., the eigenvalues of G are uniformly bounded away from zero, we obtain n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) = n − (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) (cid:48) µ G − / ( G − / b SG − / ) − l (cid:13)(cid:13) = O (cid:16) n − s − J ∗ (cid:13)(cid:13) (cid:104) Q J ∗ ( h − h ) , ψ J ∗ (cid:105) µ (cid:13)(cid:13) (cid:17) = O (cid:16) n − s − J ∗ (cid:0) (cid:107) h − h (cid:107) µ + ( J ∗ ) − p/d (cid:1)(cid:17) , where the last bound is due to Lemma F.3. For any h ∈ H ( δ ◦ , r n ) we have (cid:107) h − h (cid:107) µ ≥ δ ◦ r n > δ ◦ n − v J ∗ and hence, obtain the lower bound (cid:107) h − h (cid:107) µ − η (cid:48)(cid:48) J ∗ n − v J ∗ + B J ∗ ≥ (cid:18) − δ ◦ − C B δ ◦ (cid:19) (cid:107) h − h (cid:107) µ ≥ c (cid:107) h − h (cid:107) µ for some constant c = 1 − (1 + C B ) /δ ◦ which is positive for any δ ◦ > C B . Hence,inequality (B.2) yields for all h ∈ H ( δ ◦ , r n ) that IV = O (cid:16) n − s − J ∗ (cid:16) (cid:107) h − h (cid:107) µ + 1 (cid:107) h − h (cid:107) µ ( J ∗ ) p/d (cid:17)(cid:17) = O (cid:16) n − s − J ∗ r − n (cid:17) = o (1) . Finally, V = o (1) by making use of Lemma F.4. Step 3:
Finally, we account for estimation of the normalization factor v J . Now since | (cid:98) v J − v J | ≤ c v J wpa1 for some 0 < c < J ∈ I n by Lemma F.6 it holds(1 − c ) v J ≤ (cid:98) v J ≤ (1 + c ) v J wpa1. Hence, we control the first type error of the test ST n
44s follows, using for any (cid:101) J ∈ I n and for all h ∈ H thatP h ( ST n = 1) ≤ P h ( ST n = 1 , (cid:98) v J ≥ (1 − c ) v J for all J ∈ I n )+ P h ( (cid:98) v J < (1 − c ) v J for all J ∈ I n ) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( √ η J (cid:98) v J ) (cid:111) > , (cid:98) v J ≥ (1 − c ) v J for all J ∈ I n (cid:19) + P h (cid:0)(cid:98) v (cid:101) J < (1 − c ) v (cid:101) J (cid:1) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( √ η J v J ) (cid:111) > − c (cid:19) + o (1)= P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J | / ( η (cid:48) J v J ) (cid:111) > (cid:19) + o (1) = o (1)where the last equation is due to Step 1. To bound the second error term of the test ST n recall the definition of J ∗ ∈ I n given in Step 2. We evaluate for all h ∈ H ( δ ◦ , r n ) thatP h ( ST n = 0) ≤ P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (cid:98) v J ∗ (cid:17) ≤ P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (cid:98) v J ∗ , (cid:98) v J ∗ ≤ (1 + c ) v J ∗ (cid:17) + P h ( (cid:98) v J ∗ > (1 + c ) v J ∗ )= P h (cid:16) | (cid:98) D J ∗ | ≤ √ η J ∗ (1 + c ) v J ∗ (cid:17) + o (1)= P h (cid:16) | (cid:98) D J ∗ | ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) + o (1) = o (1) , where the last equation is due to Step 2. Proof of Corollary 3.2.
Theorem 3.3 establishes the rate r n = n − √ log log n s − J √ J where J satisfies J − p/d x ∼ n − √ log log n s − J √ J . Consequently, following the proof ofCorollary 3.1, the result immediately follows.45 upplement to “Adaptive, Rate-Optimal Testing in Instrumental Variables Models”
Christoph Breunig Xiaohong Chen
First version: August 2018, Revised June 18, 2020
This supplementary appendix contains materials to support our paper. Appendix C con-tains robustness checks using bootstrap critical values for the empirical illustrations. Ap-pendix D provides proofs of our adaptive ST results under composite null hypothesis.Appendix E establishes the results for our adaptive IT for conditional moment restrictions.Appendix F provides an upper bound for quadratic distance estimation which is essen-tial for our minimax testing results. It also contains further technical Lemmas. Finally,Appendix G gathers an exponential inequality for U-statistics.
C. Empirical Results based on Bootstrap Critical Values
In the section, we present robustness checks of our empirical findings when the criticalvalues of our adaptive test are computed using bootstrap. For the bootstrap critical values (cid:98) η J ( α ), the i.i.d. bootstrap weight is generated according to ω ∼ N (0 , Income H : h is decreasing in P o H : h is increasing in P o groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.497 0.094 no { } { } { , , } high 1.471 0.002 yes { , , } { , , , } Table 7:
Adaptive bootstrap testing for monotonic demand for organic products.
Table 7 reports bootstrap adaptive testing results for monotonic multi-product demandin Subsection 5.1.1. It replicates Table 2. 1 ncome H : h is decreasing in P H : h is increasing in P groups (cid:99) W (cid:98) J p val. reject H ? (cid:98) J (cid:99) W (cid:98) J p val. reject H ? (cid:98) J low 0.408 0.159 no { } .
611 0.000 yes { } high 15.256 0.000 yes { , } .
894 0.000 yes { } Table 8:
Adaptive bootstrap testing for monotonic gasoline demand. H : h is increasing H : h is decreasingGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” 1.449 0.003 yes { } -0.262 0.947 no { } “fuel” 3.936 0.000 yes { } { } “travel” 2.079 0.001 yes { } { } “leisure” 0.241 0.152 no { } { , } Table 9:
Adaptive bootstrap testing for monotonicity of Engel curves.
Table 8 reports bootstrap adaptive testing results for monotonic gasoline demand inSubsection 5.1.2. It replicates Table 3.Table 9 reports bootstrap adaptive testing results for monotonic Engel curves in Sub-section 5.1.3. It replicates Table 4. H : h is linear H : h is quadraticGoods (cid:99) W (cid:98) J p value reject H ? (cid:98) J (cid:99) W (cid:98) J p value reject H ? (cid:98) J “food in” -0.177 0.785 no { } { } “fuel” 1.141 0 .009 yes { } -0.078 0.502 no { } “travel” 1.260 0.005 yes { } -0.098 0.539 no { } “leisure” 0.482 0.077 no { } { } Table 10:
Adaptive bootstrap testing for linear/quadratic Engel curves.
Table 10 reports bootstrap adaptive testing results for either linear or quadratic Engelcurves in Subsection 5.1.3. It replicates Table 5.
D. Proofs of Adaptive ST Results Under Composite NullHypothesis in Subsection 3.4
Recall the definition of the deterministic index set I n in (B.1) satisfying (cid:98) I n ⊂ I n withprobability approaching one. Also recall the notation (cid:101) b K ( · ) = G − / b b K ( · ). Below, we2enote by C > v J instead of v J ( h r ) and (cid:98) η J ( α ) instead of (cid:98) η J . Below, η J denotesa deterministic sequence satisfying η J = O ( √ log log n ) and (log log J ) c ≤ η J for someconstant c > J ≤ J ≤ J . Proof of Theorem 3.4.
For the proof of the result, we proceed in three steps. First, webound the first type error of the test statistic (cid:102) ST n = (cid:26) max J ∈I n | n (cid:98) D J ( (cid:98) h r J ) / ( η J (cid:98) v J ( (cid:98) h r J )) | > √ (cid:27) . Second, we bound the second type error of (cid:102) ST n . Third, we show that these steps aresufficient to control the first and second type error of our test statistic (cid:99) ST n . Step 1:
We control the first type error of the test statistic (cid:102) ST n by making use of thedecomposition for all h ∈ H :P h (cid:16)(cid:102) ST n = 1 (cid:1) ≤ P h (cid:16) max J ∈I n n (cid:98) D J ( h r ) η J (cid:98) v J ( h r ) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) I + P h (cid:16) n max J ∈I n sup h ∈H r J (cid:12)(cid:12) (cid:98) D J ( h ) − (cid:98) D J ( h r ) (cid:12)(cid:12) / ( η J (cid:98) v J ( h r )) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) II + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D J ( (cid:98) h J ) η J (cid:98) v J ( h r ) (cid:12)(cid:12)(cid:12) max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h r ) (cid:98) v J ( (cid:98) h J ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) III . We have I = o (1) due to the proof of Theorem 3.1. Consider II . We may assume that (cid:98) h r J ∈ H r J (∆ J,n ) due to Assumption 6 (i). The definition of the estimator (cid:98) D J ( h ) implies forall J ∈ I n and h ∈ H r J (∆ J,n ) that nη J v J (cid:0) (cid:98) D J ( h ) − (cid:98) D J ( h r ) (cid:1) = 1 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) ( h − h r )( X i )( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )+ 1 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) ( h − h r )( X i )( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )+ 2 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h r ( X i ) (cid:1) ( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) A (cid:48) GAb K ( W i (cid:48) )+ 2 η J v J ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h r ( X i ) (cid:1) ( h − h r )( X i (cid:48) ) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= T ,J ( h ) + T ,J ( h ) + T ,J ( h ) + T ,J ( h ) . T ,J ( h ). We obtain | T ,J ( h ) | ≤ η J v J (cid:13)(cid:13)(cid:13) n − / n (cid:88) i =1 ( h − h r )( X i ) b K ( W i ) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) + 1 nη J v J n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ( h − h r )( X i ) b K ( W i ) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) = T ,J ( h ) + T ,J ( h ) . Consider T ,J ( h ). We obtain P h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ P h (cid:16) ∃ J ∈ I n : sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) > η J v J − sup h ∈H r J (∆ J,n ) (cid:13)(cid:13) G / A E (cid:2) ( h − h r )( X ) b K ( W ) (cid:3)(cid:13)(cid:13) (cid:17) ≤ P h (cid:16) ∃ J ∈ I n : sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) > (1 − c ) η J v J (cid:17) , where the second inequality is due Lemma F.3, i.e., that uniformly in h ∈ H r J : (cid:13)(cid:13) G / A E (cid:2) ( h − h r )( X ) b K ( W ) (cid:3)(cid:13)(cid:13) = (cid:107) Q J ( h − h r ) (cid:107) µ ≤ C (∆ J,n + J − p/d )and for n sufficiently large that 2 C (∆ J,n + J − p/d ) ≤ c η J v J for some 0 < c < J ∈ I n . Let s − j , 1 ≤ j ≤ J , be the nondecreasing singular values of G / AG / b . Further,let e j be the unit vector with 1 at the j –th position. We bound for all h ∈ H r J (∆ J,n ) andall j ≥
1E sup h ∈H r J (∆ J,n ) (cid:12)(cid:12) ( h − h r )( X ) (cid:101) b K ( W ) (cid:48) diag( s − , . . . , s − J ) e j (cid:12)(cid:12) ≤ sup h ∈H r J (∆ J,n ) (cid:107) h − h r (cid:107) ∞ E (cid:12)(cid:12)(cid:101) b K ( W ) (cid:48) diag( s − , . . . , s − J ) e j (cid:12)(cid:12) ≤ ∆ J,n (cid:13)(cid:13) diag( s − , . . . , s − J ) e j (cid:13)(cid:13) ≤ s − j ∆ J,n by the definition of H r J (∆ J,n ). Consequently, Markov’s inequality together with the proof4f Chen and Pouzo [2012, Lemma C.1] yields P h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n P h (cid:16) sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n − c ) η J v J E sup h ∈H r J (∆ J,n ) (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16) ( h − h r )( X i ) b K ( W i ) (cid:48) − E (cid:2) ( h − h r )( X ) b K ( W ) (cid:48) (cid:3)(cid:17) A (cid:48) G / (cid:13)(cid:13)(cid:13) ≤ C (1 − c ) − (cid:88) J ∈I n η J v J (cid:16) J (cid:88) j =1 s − j (cid:17) ∆ J,n (cid:18)(cid:90) (cid:113) N [] ( tC, H r J , (cid:107) · (cid:107) µ ) dt (cid:19) ≤ Cσ − (1 − c ) − (cid:88) J ∈I n ∆ J,n η J C J,n , for n sufficiently large, using (cid:80) Jj =1 s − j ≤ σ − √ v J by Lemma F.2. Consequently, the ratecondition imposed in Assumption 6 (i) impliesP h (cid:16) max J ∈I n sup h ∈H r J (∆ J,n ) T ,J ( h ) > (cid:1) = o (1) . (D.1)Consider T ,J ( h ). Using the notation (cid:101) b K ( · ) = G − / b b K ( · ) we obtainmax J ∈I n sup h ∈H r J T ,J ( h ) ≤ max J ∈I n (cid:16) sup h ∈H r J (cid:107) h − h r (cid:107) ∞ sup w (cid:107) (cid:101) b K ( w ) (cid:107) (cid:13)(cid:13)(cid:0) G − / b SG − / (cid:1) − l (cid:13)(cid:13)(cid:17) / ( η J v J ) ≤ max J ∈I n (cid:0) ∆ J,n ζ J s − J (cid:1) η J v J = o (1) . Consider T ,J ( h ). For all J ∈ I n and h ∈ H r J we evaluate | T ,J ( h ) | ≤ nη J v J (cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( h − h r )( X i ) (cid:101) b K ( W i ) − E (cid:2) ( h − h r )( X ) (cid:101) b K ( W ) (cid:3)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) + 2 nη J v J (cid:13)(cid:13)(cid:13) E (cid:2) ( h − h r )( X ) (cid:101) b K ( W ) (cid:3) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13)(cid:13) = 2 T ,J ( h ) + 2 T ,J ( h ) . Consequently, using Chen and Christensen [2018, Lemma F.10 (b)] (with G ψ replaced by G ), i.e., (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) = O p (cid:0) n − s − J ζ J (log J ) (cid:1) we obtainP h (cid:16) max J ∈I n sup h ∈H r J T ,J ( h ) > (cid:1) ≤ (cid:88) J ∈I n P h (cid:16) sup h ∈H r J T ,J ( h ) > (cid:1) ≤ C (cid:88) J ∈I n (cid:18)(cid:90) (cid:113) N [] ( wC, H r J , (cid:107) · (cid:107) µ ) dw (cid:19) K ( J ) η J v J ∆ J,n n − s − J ζ J (log J ) ≤ Cn − s − J ζ (log n ) (cid:88) J ∈I n C J,n ∆ J,n η J , s − J ζ (cid:112) (log n ) /n = o (1) andconsequently,P h (cid:16) max J ∈I n sup h ∈H r J T ,J ( h ) > (cid:1) = o (cid:16) (cid:88) J ∈I n C J,n ∆ J,n η J (cid:17) = o (1) , where the last equation is due to Assumption 6 (i). Consider T ,J ( h ). We make use of theinequality T ,J ( h ) ≤ n (cid:107) Π K T ( h − h r ) (cid:107) L ( W ) (cid:13)(cid:13) G / ( (cid:98) A − A ) G / b (cid:13)(cid:13) . Now Assumption 3 implies (cid:107) Π K T ( h − h r ) (cid:107) L ( W ) = O ( s J (cid:107) h − h r (cid:107) µ ) and thus, we obtainmax J ∈I n sup h ∈H r J T ,J ( h ) = o (cid:18) max J ∈I n ∆ J,n ζ J (log J ) (cid:19) = o (1) , by Assumption 6 (i). The bound of T ,J ( h ) and T ,J ( h ) follow analogously. Consider III .We make use of the inequality
III ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D J ( (cid:98) h r J ) η J (cid:98) v J ( h r ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) + P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h r ) (cid:98) v J ( (cid:98) h r J ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) where the first term on the right hand side tends to zero which follows immediately from theupper bounds derived for I and II . Consequently, from Lemma F.7 we infer III = o (1). Step 2:
We control the second type error of the test statistic (cid:102) ST n where η (cid:48) J is replacedby η (cid:48)(cid:48) J >
0. Let J ∗ be as in the proof of Theorem 3.1. We thenobtain uniformly over h ∈ H ( δ ◦ , r n ) that P h (cid:16)(cid:102) ST n = 0 (cid:17) ≤ P h (cid:16) n (cid:98) D J ∗ ( (cid:98) h r J ∗ ) ≤ η (cid:48)(cid:48) J ∗ v J ∗ (cid:17) ≤ P h (cid:18)(cid:12)(cid:12) (cid:107) E h [ V J ∗ ] (cid:107) − (cid:98) D J ∗ ( h r ) (cid:12)(cid:12) > (cid:107) E h [ V J ∗ ] (cid:107) / − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) + P h (cid:18)(cid:12)(cid:12) (cid:98) D J ∗ ( (cid:98) h r J ∗ ) − (cid:98) D J ∗ ( h r ) (cid:12)(cid:12) > (cid:107) E h [ V J ∗ ] (cid:107) / − η (cid:48)(cid:48) J ∗ v J ∗ n (cid:19) where the first summand on the right hand side tends to zero following the proof of Theorem3.1. The second summand tends to zero analogously to the previous step of this proof. Step 3:
Finally, we account for estimation of the normalization factor v J . We control thefirst type error of the test (cid:99) ST n as follows, using for any (cid:101) J ∈ I n and h ∈ H by making use6f Assumption 6 (iii)P h (cid:16)(cid:99) ST n = 1 (cid:17) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J ( (cid:98) h r J ) | / ( (cid:98) η J v J ) (cid:111) > , (cid:98) η J ≥ Cη J for all J ∈ I n (cid:19) + P h ( (cid:98) η J < Cη J for all J ∈ I n ) + o (1) ≤ P h (cid:18) max J ∈I n (cid:110) | (cid:98) D J ( (cid:98) h r J ) | / ( Cη J v J ) (cid:111) > (cid:19) + P h (cid:0)(cid:98) η (cid:101) J < Cη (cid:101) J (cid:1) + o (1)= o (1)where the last equation is due to Step 1. To bound the second error term of the test ST n recall the definition of J ∗ ∈ I n introduced in Step 2. We evaluate for all h ∈ H ( δ ◦ , r n ) bymaking use of Assumption 6 (iii) thatP h (cid:16)(cid:99) ST n = 0 (cid:17) ≤ P h (cid:16) | (cid:98) D J ∗ ( (cid:98) h r J ) | ≤ (cid:98) η J ∗ v J ∗ , (cid:98) η J ∗ ≤ c η J ∗ (cid:17) + P h ( (cid:98) η J ∗ > c η J ∗ )= P h (cid:16) | (cid:98) D J ∗ ( (cid:98) h r J ) | ≤ c η J ∗ v J ∗ (cid:17) + o (1)= o (1) , where the last equation is due to Step 2. Proof of Corollary 3.3.
Proof of (3.12). We observelim sup n →∞ sup h ∈H P h ( h / ∈ C n ( α )) = lim sup n →∞ sup h ∈H P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) > √ (cid:33) ≤ α, where the last inequality is due to step 1 of the proof of Theorem 3.3 and step 3 of theproof of Theorem 3.4.Proof of (3.13). Let J ∗ be as be as in step 2 of the proof of Theorem 3.3. We observefor all h ∈ H ( δ ◦ , r n ) thatP h ( h / ∈ C n ( α )) = P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) > √ (cid:33) = 1 − P h (cid:32) max J ∈I n n (cid:98) D J ( h ) (cid:98) η J (cid:98) v J ( h ) ≤ √ (cid:33) ≥ − P h (cid:32) n (cid:98) D J ∗ ( h ) (cid:98) η J ∗ (cid:98) v J ∗ ( h ) ≤ √ (cid:33) ≥ − α, for n sufficiently large, where the last inequality is due to step 2 of the proof of Theorem3.3 and step 3 of the proof of Theorem 3.4. 7 roof of Corollary 3.4. For any h ∈ H , we analyze the diameter of the confidence set C n ( α ) under P h for some For all h ∈ C n ( α ) ⊂ H it holds for all J ∈ I n by using thedefinition of the projection Q J : (cid:107) h − h (cid:107) µ ≤ (cid:107) Q J Π J ( h − h ) (cid:107) µ + (cid:107) Π J h − h (cid:107) µ + (cid:107) Π J h − h (cid:107) µ ≤ (cid:107) Q J ( h − h ) (cid:107) µ + O ( J − p/d x ) , where the second inequality due to the triangular inequality and Assumptions 2 (ii) and 3(ii). The upper bound established in (F.4) yields: (cid:12)(cid:12)(cid:12) (cid:107) Q J ( h − h ) (cid:107) µ − (cid:98) D J ( h ) (cid:12)(cid:12)(cid:12) ≤ n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J with probability approaching one. Consequently, the definition of the confidence set C n ( α )with h ∈ C n ( α ) gives (cid:107) Q J ( h − h ) (cid:107) µ ≤ (cid:98) D J ( h ) + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ≤ √ n − (cid:98) η J (cid:98) v J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J ≤ C (cid:16) n − η J v J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) with probability approaching one. We may choose J = J ∈ I n for n sufficiently largeand hence, the result follows by applying Lemma F.1 and Assumption 5 (ii), i.e., η J = O ( √ log log n ). E. Proofs of Adaptive IT Results in Subsection 4.2
For the adaptive IT results, we may consider restriction to the index set I n given in (B.1),but where the uppder bound J is replaced by K . As in Appendix B this implies that (cid:98) η K ( α )is deterministic and is denoted by η K in the following. Proof of Theorem 4.1.
The proof is based on two steps, where we bound the first andsecond type error of (cid:99) IT n separately. 8 tep 1: The definition of the test statistic (cid:99) IT n implies for all h ∈ H :P h (cid:16)(cid:99) IT n = 1 (cid:1) ≤ P h (cid:16) max K ∈I n n (cid:98) D K ( h r ) η K (cid:98) v K ( h r ) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) I + P h (cid:16) n max K ∈I n sup h ∈H r (cid:12)(cid:12) (cid:98) D K ( h ) − (cid:98) D K ( h r ) (cid:12)(cid:12) / ( η K (cid:98) v K ( h r )) > (cid:112) / (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) II + P h (cid:16) max K ∈I n (cid:12)(cid:12)(cid:12) n (cid:98) D K ( (cid:98) h r ) η K (cid:98) v K ( h r ) (cid:12)(cid:12)(cid:12) max K ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v K ( h r ) (cid:98) v K ( (cid:98) h r ) (cid:12)(cid:12)(cid:12) > (cid:112) / (cid:17) . (cid:124) (cid:123)(cid:122) (cid:125) III
We obtain I = o (1) following the proof of Theorem 3.1 (with U i replaced by ρ ( Y i , h r ( X i ))and (cid:98) A (cid:48) G (cid:98) A replaced by ( B (cid:48) B/n ) − ). Consider II . The definition of the estimator (cid:98) D K impliesfor all K ∈ I n and h ∈ H r that1 η K v K ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:0) ρ ( Y i (cid:48) , h ( X i (cid:48) )) − ρ ( Y i (cid:48) , h r ( X i (cid:48) )) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) )+ 1 η K v K ( n − (cid:88) i (cid:54) = i (cid:48) ρ ( Y i , h r ( X i )) (cid:0) ρ ( Y i (cid:48) , h ( X i (cid:48) )) − ρ ( Y i (cid:48) , h r ( X i (cid:48) )) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) )= T ,K ( h ) + T ,K ( h ) . Consider T ,K ( h ). We obtain | T ,K ( h ) | ≤ η K v K (cid:13)(cid:13)(cid:13) n − / n (cid:88) i =1 (cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) + 1 nη K v K n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) = T ,K ( h ) + T ,K ( h ) . Consider T ,K ( h ). We obtain P h (cid:16) max K ∈I n sup h ∈H r T ,K ( h ) > (cid:1) ≤ P h (cid:16) ∃ K ∈ I n : sup h ∈H r (cid:13)(cid:13)(cid:13) √ n n (cid:88) i =1 (cid:16)(cid:0) ρ ( Y i , h ( X i )) − ρ ( Y i , h r ( X i )) (cid:1)(cid:101) b K ( W i ) − E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:17)(cid:13)(cid:13)(cid:13) > η K v K − sup h ∈H r (cid:13)(cid:13) E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:13)(cid:13) (cid:17) . We evaluate by using Assumption 3 (ii) that uniformly in h ∈ H r : (cid:13)(cid:13) E (cid:2)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b K ( W ) (cid:3)(cid:13)(cid:13) ≤ (cid:107) h − h r (cid:107) µ ≤ ∆ n . K ∈ I n and 1 ≤ k ≤ K we haveE sup h ∈H r (cid:12)(cid:12)(cid:0) ρ ( Y, h ( X )) − ρ ( Y, h r ( X )) (cid:1)(cid:101) b k ( W ) | ≤ C ∆ κn by Assumption 6 (i). Consequently,following the derivation of the upper bound (D.1) weobtainP h (cid:16) max K ∈I n (∆ n ) sup h ∈H r T ,K ( h ) > (cid:1) ≤ max K ∈I n K ∆ κn η K v K − ∆ n (cid:18)(cid:90) (cid:113) N [] ( wC, H r , (cid:107) · (cid:107) ) dw (cid:19) ≤ (1 − c ) − max K ∈I n K ∆ κn η K v K = o (1)where the second inequality is based on the inequality ∆ κn ≤ c η K v K for some 0 < c < K ∈ I n as n becomes sufficiently large. Consider T ,K ( h ). We obtainmax K ∈I n sup h ∈H r T ,K ( h ) ≤ max J ∈I n (cid:0) ∆ n ζ K (cid:1) η K v K = o (1) . The bound for T ,K ( h ) follows analogously. Step 2:
It is sufficient to control the second type error of the statistic (cid:102) IT n = (cid:26) max K ∈I n | n (cid:98) D K ( (cid:98) h r ) / ( η (cid:48)(cid:48) K v K ) | > (cid:27) for some η (cid:48)(cid:48) K >
0. Let K ∗ denote the largest integer such that n − √ K ∗ ≤ r n . We evaluatefor all h ∈ M ( δ ◦ , r n ) thatP h (cid:16)(cid:102) IT n = 0 (cid:17) = P h (cid:16) n (cid:98) D K ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K v K for all K ∈ I n (cid:17) ≤ P h (cid:16) n (cid:98) D K ∗ ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K ∗ v K ∗ (cid:17) . We make use of the notation B K = (cid:107) E h [ V K ] (cid:107) − (cid:107) m ( · , h r ) (cid:107) L ( W ) where we write V K =10 ( Y, h r ( X )) G − / b b K ∗ ( W ). We obtain uniformly over h ∈ M ( δ ◦ , r n ) that P h (cid:16) n (cid:98) D K ∗ ( (cid:98) h r ) ≤ η (cid:48)(cid:48) K ∗ v K ∗ (cid:17) = P h (cid:18) (cid:107) E h [ V K ∗ ] (cid:107) − (cid:98) D K ∗ ( (cid:98) h r ) > (cid:107) E h [ V K ∗ ] (cid:107) − η (cid:48)(cid:48) K ∗ v K ∗ n (cid:19) ≤ P h (cid:32)(cid:12)(cid:12)(cid:12) n K ∗ (cid:88) k =1 (cid:88) i E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) IV + P h (cid:32)(cid:12)(cid:12)(cid:12) n (cid:88) i E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) V + P h (cid:18)(cid:12)(cid:12) (cid:98) D K ∗ ( (cid:98) h r ) − (cid:98) D K ∗ ( h r ) (cid:12)(cid:12) > E[ m ( W, h r )] − η (cid:48)(cid:48) K ∗ v K ∗ n + B K ∗ (cid:19) , (cid:124) (cid:123)(cid:122) (cid:125) V I using n/ ( n − ≤ n ≥
2. Consider IV . We observe | B K ∗ | ≤ C B r n which is due tothe upper bound (cid:107) E h [ V K ∗ ] (cid:107) = (cid:0) E h [ ρ ( Y, h r ( X )) b K ∗ ( W ) (cid:48) ] G − b E h [ ρ ( Y, h r ( X )) b K ∗ ( W )] (cid:1) / = (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) . Further, from the proof of Lemma F.9 we deduceE h (cid:12)(cid:12)(cid:12) n ( n − K ∗ (cid:88) k =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ik V i (cid:48) k − E h [ V k ] (cid:1)(cid:12)(cid:12)(cid:12) ≤ Cn (cid:18) (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) + v K ∗ n (cid:19) . Consequently, Markov’s inequality yields IV = O (cid:32) n − (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) + n − v K ∗ (cid:0) (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ (cid:1) (cid:33) (E.1)In the following, we distinguish between two cases. First, consider the case where n − v K ∗ dominates the summand in the numerator. For any h ∈ M ( δ ◦ , r n ) we have (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ δ ◦ r n and hence, using that v K ∗ ≥ c √ K ∗ for some constant 0 < c < (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ ≥ ( δ ◦ − c − C B ) r n ≥ C r n for some constant C := δ ◦ − c − C B which is positive for any δ ◦ > c + C B . Frominequality (E.1) we infer IV = O ( r − n n − v K ∗ ) = o (1). Second, consider the case where n − (cid:107) Π K ∗ m ( · , h r ) (cid:107) L ( W ) dominates. For any h ∈ M ( δ ◦ , r n ) we have (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ ◦ r n > δ ◦ n − √ K ∗ and hence, obtain the lower bound (cid:107) m ( · , h r ) (cid:107) L ( W ) − η (cid:48)(cid:48) K ∗ n − v K ∗ + B K ∗ ≥ (cid:18) − δ ◦ − C B δ ◦ (cid:19) (cid:107) m ( · , h r ) (cid:107) L ( W ) ≥ c (cid:107) m ( · , h r ) (cid:107) L ( W ) for some constant c := 1 − (1 + C B ) /δ ◦ which is positive for any δ ◦ > C B . Hence,inequality (E.1) yields for all h ∈ M ( δ ◦ , r n ) that IV = O (cid:16) n − (cid:107) m ( · , h r ) (cid:107) L ( W ) (cid:17) = O (cid:0) n − r − n (cid:1) = o (1) . Finally, V = o (1) by making use of Lemma F.4 (with U i replaced by ρ ( Y i , h r ( X i )) and (cid:98) A (cid:48) G (cid:98) A replaced by ( B (cid:48) B/n ) − ) and V I = o (1) by following step 1. F. Technical Results
Theorem F.1.
Let Assumptions 1–3 be satisfied. Then, it holds (cid:12)(cid:12) (cid:98) D J − (cid:107) h − h (cid:107) µ (cid:12)(cid:12) = O p (cid:16) n − s − J √ J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / S ) − l (cid:13)(cid:13) + J − p/d x (cid:17) . Proof.
We make use of the decomposition (cid:98) D J − (cid:107) h − h (cid:107) µ = (cid:98) D J − (cid:107) Q J ( h − h ) (cid:107) µ + (cid:107) Q J ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ . Note that (cid:107) Q J ( h − h ) (cid:107) µ = (cid:90) (cid:0) ψ J ( x ) (cid:48) A E h [( Y − h ( X )) b K ( W )] (cid:1) µ ( x ) dx = E h [( Y − h ( X )) b K ( W )] (cid:48) A (cid:48) (cid:90) ψ J ( x ) ψ J ( x ) (cid:48) µ ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) = G A E h [( Y − h ( X )) b K ( W )]= (cid:13)(cid:13) G / A E h [( Y − h ( X )) b K ( W )] (cid:13)(cid:13) = (cid:107) E h [ V J ] (cid:107) using the notation V Ji = ( Y i − h ( X i )) G / Ab K ( W i ). Thus, the definition of the estimator12 D J implies (cid:98) D J − (cid:107) Q J ( h − h ) (cid:107) µ = 1 n ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1) (F.1)+ 1 n ( n − (cid:88) i (cid:54) = i (cid:48) Y i Y i (cid:48) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) , (F.2)where we bound both summands on the right hand side separately in the following. Con-sider the summand in (F.1), we observe (cid:12)(cid:12)(cid:12) J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) = J (cid:88) j,j (cid:48) =1 (cid:88) i (cid:54) = i (cid:48) (cid:88) i (cid:48)(cid:48) (cid:54) = i (cid:48)(cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) We distinguish three different cases. First: i, i (cid:48) , i (cid:48)(cid:48) , i (cid:48)(cid:48)(cid:48) are all different, second: either i = i (cid:48)(cid:48) or i (cid:48) = i (cid:48)(cid:48)(cid:48) , or third: i = i (cid:48) and i (cid:48) = i (cid:48)(cid:48)(cid:48) . We thus calculate for each j, j (cid:48) ≥ (cid:88) i (cid:54) = i (cid:48) (cid:88) i (cid:48)(cid:48) (cid:54) = i (cid:48)(cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) = (cid:88) i,i (cid:48) ,i (cid:48)(cid:48) ,i (cid:48)(cid:48)(cid:48) all different (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) + 2 (cid:88) i (cid:54) = i (cid:48) (cid:54) = i (cid:48)(cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) + (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V ij (cid:48) V i (cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1) . Due to independent observations we have (cid:88) i,i (cid:48) ,i (cid:48)(cid:48) ,i (cid:48)(cid:48)(cid:48) all different E h (cid:104)(cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:0) V i (cid:48)(cid:48) j (cid:48) V i (cid:48)(cid:48)(cid:48) j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105) = 013onsequently, we calculateE h (cid:12)(cid:12)(cid:12) J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) = 2 n ( n − n − J (cid:88) j,j (cid:48) =1 E h (cid:104)(cid:0) V j V j − E h [ V j ] (cid:1)(cid:0) V j (cid:48) V j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) I + n ( n − J (cid:88) j,j (cid:48) =1 E h (cid:104)(cid:0) V j V j − E h [ V j ] (cid:1)(cid:0) V j (cid:48) V j (cid:48) − E h [ V j (cid:48) ] (cid:1)(cid:105)(cid:124) (cid:123)(cid:122) (cid:125) II . To bound the summand I we observe that I = J (cid:88) j,j (cid:48) =1 E h [ V j ] E h [ V j (cid:48) ] C ov h ( V j , V j (cid:48) )= E h [ V J ] (cid:48) C ov h ( V J , V J ) E h [ V J ] ≤ λ max (cid:0) V ar h (( Y − h ( X )) G − / b b K ( W )) (cid:1)(cid:13)(cid:13) G / b A (cid:48) G / E h [ V J ] (cid:13)(cid:13) ≤ σ (cid:13)(cid:13) E h [( Y − h ( X )) b K ( W )] (cid:48) A (cid:48) GAG / b (cid:13)(cid:13) = σ (cid:13)(cid:13)(cid:13) (cid:90) Q J ( h − h )( x ) ψ J ( x ) (cid:48) µ ( dx )( G − / b S ) − l (cid:13)(cid:13)(cid:13) by using the notation V Ji = ( Y i − h ( X i )) G / Ab K ( W i ), AG / b = ( G − / b S ) − l , and LemmaF.8, i.e., λ max (cid:0) V ar h (( Y − h ( X )) (cid:101) b K ( W )) (cid:1) ≤ σ . Consider II . We observe II = n ( n − J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] − n ( n − (cid:16) J (cid:88) j =1 E h [ V j ] (cid:17) ≤ n ( n − J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = n ( n − v J . The upper bounds derived for the terms I and II imply for all n ≥ h (cid:12)(cid:12)(cid:12) n ( n − J (cid:88) j =1 (cid:88) i (cid:54) = i (cid:48) (cid:0) V ij V i (cid:48) j − E h [ V j ] (cid:1)(cid:12)(cid:12)(cid:12) ≤ σ (cid:18) n (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + v J n (cid:19) . (F.3)14onsequently, equality (F.2) together with Lemma F.4 yields (cid:98) D J −(cid:107) Q J ( h − h ) (cid:107) µ = O p (cid:16) n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) + n − v J (cid:17) , (F.4)which implies the variance part by employing Lemma F.1. Finally, Lemma F.3 implies forthe bias term (cid:107) Q J ( h − h ) (cid:107) µ − (cid:107) h − h (cid:107) µ = O ( J − p/d )which completes the proof. Lemma F.1.
Let Assumption 1 (ii) be satisfied. Then, it holds v J ≤ σ s − J √ J .
Proof.
In the following, let e j be the unit vector with 1 at the j –th position. We obtain v J = J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = J (cid:88) j =1 (cid:13)(cid:13)(cid:13) E[E h [( Y − h ( X )) | W ] e (cid:48) j G / Ab K ( W ) (cid:124) (cid:123)(cid:122) (cid:125) =: χ j ( W ) G / Ab K ( W )] (cid:13)(cid:13)(cid:13) = J (cid:88) j =1 (cid:13)(cid:13)(cid:13) G / AG / b E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13)(cid:13) ≤ s − J J (cid:88) j =1 (cid:13)(cid:13) E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13) , using the relationship (cid:107) G / AG / b (cid:107) = s − J . For all j ≥ (cid:13)(cid:13) E[ χ j ( W ) (cid:101) b K ( W )] (cid:13)(cid:13) ≤ (cid:107) χ j (cid:107) L ( W ) . Now using that sup w ∈W sup h ∈H E h [( Y − h ( X )) | W = w ] ≤ σ due to Assumption 1 (ii),we get J (cid:88) j =1 (cid:107) χ j (cid:107) L ( W ) ≤ σ J (cid:88) j =1 E | e (cid:48) j G / Ab K ( W ) | = σ J (cid:88) j =1 e (cid:48) j G / AG b A (cid:48) G / e j ≤ σ s − J J, Lemma F.2.
Let Assumption 1 (iii) be satisfied. Then, it holds (cid:118)(cid:117)(cid:117)(cid:116) J (cid:88) j =1 s − j ≤ σ − v J , where s − j , ≤ j ≤ J , are the nondecreasing singular values of G / AG / b .Proof. In the following, let e j be the unit vector with 1 at the j –th position. Introduce aunitary matrix Q such that by Schur decomposition Q (cid:48) G / AG b A (cid:48) G / Q = diag( s − , . . . , s − J ) . We make use of the notation (cid:101) V Ji = ( Y i − h ( X i )) Q (cid:48) G / Ab K ( W i ). Now since the Frobeniusnorm is invariant under unitary matrix multiplication we have v J = J (cid:88) j,j (cid:48) =1 E h [ (cid:101) V j (cid:101) V j (cid:48) ] ≥ J (cid:88) j =1 E h [ (cid:101) V j ] = J (cid:88) j =1 (cid:0) E | ( Y − h ( X )) e (cid:48) j Q (cid:48) G / Ab K ( W ) | (cid:1) ≥ σ J (cid:88) j =1 (cid:0) E[ e (cid:48) j Q (cid:48) G / Ab K ( W ) b K ( W ) (cid:48) A (cid:48) G / Qe j ] (cid:1) = σ J (cid:88) j =1 (cid:0) e (cid:48) j Q (cid:48) G / AG b A (cid:48) G / Qe j (cid:1) = σ J (cid:88) j =1 (cid:0) e (cid:48) j diag( s − , . . . , s − J ) e j (cid:1) ≥ σ J (cid:88) j =1 s − j , using inf w ∈W inf h ∈H E h [( Y − h ( X )) | W = w ] ≥ σ by Assumption 1 (iii). Lemma F.3.
Let Assumptions 2 and 3 be satisfied. Then, for all h ∈ H we have (cid:107) Q J ( h − h ) (cid:107) µ = (cid:107) h − h (cid:107) µ + O (cid:0) J − p/d (cid:1) . roof. Using the notation (cid:101) b K ( · ) := G − / b b K ( · ), we observe for all h ∈ H that (cid:107) Q J ( h − h ) (cid:107) µ = (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )( h − h )( X )] (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )(Π J h − Π J h )( X )] (cid:13)(cid:13) + (cid:13)(cid:13) ( G − / b SG − / ) − l E[ (cid:101) b K ( W )(( h − h )( X ) − (Π J h − Π J h )( X ))] (cid:13)(cid:13) ≤ (cid:107) Π J h − Π J h (cid:107) µ + s − J (cid:107) Π K T (( h − h ) − (Π J h − Π J h )) (cid:107) L ( W ) ≤ (cid:107) Π J h − Π J h (cid:107) µ + O (cid:0) J − p/d (cid:1) by making use of Assumption 3 (ii). Lemma F.4.
Let Assumptions 1–3 be satisfied. Then, uniformly in h ∈ H it holds n ( n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= O p (cid:16) n − s − J √ J + n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) . Proof.
In the proof, we establish an upper bound of1 n (cid:88) i,i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) )= E[( h − h )( X ) b K ( W )] (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) E[( h − h )( X ) b K ( W )]+ 2 (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) × E[( h − h )( X ) b K ( W )]+ (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17)(cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) × (cid:16) n (cid:88) i (cid:0) Y i − h ( X i ) (cid:1) b K ( W i ) (cid:48) − E[( h − h )( X ) b K ( W )] (cid:48) (cid:17) uniformly in h ∈ H . It is sufficient to bound the first summand on the right hand side. Wemake use of the decompositionE[( h − h )( X ) b K ( W )] (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) E[( h − h )( X ) b K ( W )]= 2 E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( A − (cid:98) A ) E[( h − h )( X ) b K ( W )] − E[( h − h )( X ) b K ( W )] (cid:48) ( A − (cid:98) A ) (cid:48) G ( A − (cid:98) A ) E[( h − h )( X ) b K ( W )]= 2 T − T , where we bound each summand separately in what follows. Consider T . Below, we show17he result T = O p (cid:16) n − / (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13)(cid:17) . (F.5)To do so, we make use of the decomposition T = E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[( h − h − Π J ( h − h ))( X ) b K ( W )] . (F.6)Consider the first summand on the right hand side of the equation. Using the definition ofthe left pseudo inverse we can write (cid:98) A = ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b where (cid:98) S = n − (cid:80) i b K ( W i ) ψ J ( X i ) (cid:48) .Making use of the relation Q J Π J h = Π J h and (cid:98) SG − (cid:104) h, ψ J (cid:105) µ = n − (cid:80) i Π J h ( X i ) b K ( W i )yields E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( A − (cid:98) A ) E[Π J ( h − h )( X ) b K ( W )]= (cid:90) Q J ( h − h )( x ) (cid:16) Π J ( h − h )( x ) − ψ J ( x ) (cid:48) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b E[( h − h )( X ) b K ( W )] (cid:17) µ ( x ) dx = (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b (cid:16) n (cid:88) i Π J ( h − h )( X i ) b K ( W i ) − E[( h − h )( X ) b K ( W )] (cid:17) = (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:16) n (cid:88) i Π J ( h − h )( X i ) (cid:101) b K ( W i ) − E[Π J ( h − h )( X ) (cid:101) b K ( W )] (cid:17) + (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17) × (cid:16) n (cid:88) i Π J ( h − h )( X i ) (cid:101) b K ( W i ) − E[Π J ( h − h )( X ) (cid:101) b K ( W )] (cid:17) = T + T , where we used the notation (cid:101) b K ( · ) = G − / b b K ( · ). Consider T . We obtainE | T | ≤ n − E (cid:12)(cid:12)(cid:12) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l Π J ( h − h )( X ) (cid:101) b K ( W ) (cid:12)(cid:12)(cid:12) ≤ n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h ) (cid:107) L ( W ) + 2 n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h − Π J ( h − h )) (cid:107) L ( W ) = O (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) , where the second bound is due to the Cauchy-Schwarz inequality and the third bound is dueto Assumption 2. To establish an upper bound for T we infer from Chen and Christensen182018, Lemma F.10 (c)] that | T | ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17)(cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) n (cid:88) i b K ( W i )Π J ( h − h )( X i ) − E[Π J ( h − h )( X ) b K ( W )] (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) × O p (cid:0) n − s − J ζ J (log J ) (cid:1) × O p (cid:0) n − ζ J (cid:1) = O p (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) using Assumption 2 (i), i.e., s − J ζ J (cid:112) (log J ) /n = O (1). Consider the second summand onthe right hand side of (F.6). Following the upper bound of T we obtain (cid:12)(cid:12)(cid:12) E[( h − h )( X ) b K ( W )] (cid:48) A (cid:48) G ( (cid:98) A − A ) E[( h − h − Π J ( h − h ))( X ) b K ( W )] (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) G − / b S (cid:48) (cid:16) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:17)(cid:13)(cid:13)(cid:13) × (cid:13)(cid:13) (cid:104) T ( h − h − Π J ( h − h )) , (cid:101) b K (cid:105) L ( W ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:107) Π K T ( h − h − Π J ( h − h )) (cid:107) L ( W ) × O p (cid:0) n − s − J ζ J (log J ) (cid:1) = O (cid:16) n − (cid:13)(cid:13) (cid:104) Q J ( h − h ) , ψ J (cid:105) (cid:48) µ ( G − / b S ) − l (cid:13)(cid:13) (cid:17) using that s − J (cid:107) T ( h − h − Π J ( h − h )) (cid:107) L ( W ) = O ( (cid:107) h − h − Π J ( h − h ) (cid:107) µ ) by Assumption 3(ii) and ζ J (log J ) (cid:107) h − Π J h (cid:107) µ = O (1) by Assumption 2 (ii), which implies the upper bound(F.5).Consider T . We make use of the decomposition T = E[Π J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ 2 E[Π ⊥ J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π J ( h − h )( X ) b K ( W )]+ E[Π ⊥ J ( h − h )( X ) b K ( W )] (cid:48) ( (cid:98) A − A ) (cid:48) G ( (cid:98) A − A ) E[Π ⊥ J ( h − h )( X ) b K ( W )]= T + T + T where we denote the projection Π ⊥ J = id − Π J . Consider T . We make use of the inequalityE (cid:13)(cid:13)(cid:13)(cid:16) n (cid:88) i ( h − h )( X i ) b K ( W i ) − E[( h − h )( X ) b K ( W )] (cid:17) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) ≤ n − E (cid:104) ( h − h ( X )) (cid:13)(cid:13) b K ( W ) (cid:48) A (cid:48) G / (cid:13)(cid:13) (cid:105) ≤ n − s − J √ J , T ≤ (cid:13)(cid:13) G / (cid:8) ( (cid:98) G − / b (cid:98) S ) − l (cid:98) G − / b G / b − ( G − / b S ) − l (cid:9)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) n (cid:88) i ( Y i − h ( X i )) b K ( W i ) − E[( Y − h ( X )) b K ( W )] (cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13) n (cid:88) i (cid:0) b K ( W i )( h − h )( X i ) − E[( Y − h ( X )) b K ( W )] (cid:1) (cid:48) A (cid:48) G / (cid:13)(cid:13)(cid:13) = O p (cid:0) n − s − J ζ J (log J ) (cid:1) × O p (cid:0) n − ζ J (cid:1) + O p (cid:0) n − v J (cid:1) = O p (cid:16) n − s − J √ J (cid:17) using Chen and Christensen [2018, Lemma F.10(b)] (with G ψ replaced by G ) and that n − s − J ζ J (log J ) = O (1) by Assumption 2 (i). Since | T | ≤ √ T T we conclude T = O p ( n − s − J ), which completes the proof. Lemma F.5.
Let Assumptions 1–5 be satisfied. Then, under H = { h } it holds uniformlyin J ∈ I n : n − (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) = O p ( v J ) . Proof.
We make use of the inequality (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) ≤ (cid:13)(cid:13)(cid:13) (cid:88) i (cid:0) Y i − h ( X i ) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) = (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) ) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) + 1 n (cid:88) i (cid:13)(cid:13)(cid:13)(cid:0) Y i − h ( X i ) (cid:1)(cid:101) b K ( W i ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13) . Note that n − (cid:80) i (cid:107) ( Y i − h ( X i )) (cid:101) b K ( W i ) (cid:107) ≤ K + o p (1) uniformly in K . Further, by LemmaF.2 we have v J ≥ σ √ J (we may assume that s = 1) and thus we obtain P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n − v J (cid:88) i,i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1) b K ( W i ) (cid:48) (cid:16) A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A (cid:17) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > (cid:17) ≤ P h (cid:16) max J ∈I n (cid:12)(cid:12)(cid:12) n √ J (cid:88) i (cid:54) = i (cid:48) (cid:0) Y i − h ( X i ) (cid:1)(cid:0) Y i (cid:48) − h ( X i (cid:48) ) (cid:1)(cid:101) b K ( W i ) (cid:48) (cid:101) b K ( W i (cid:48) ) (cid:12)(cid:12)(cid:12) > σ − (cid:17) + P h (cid:16) max J ∈I n (cid:0)(cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13)(cid:1) > σ − / (cid:17) + P h (cid:16) max J ∈I n (cid:0) K (cid:13)(cid:13) G / b ( A (cid:48) GA − (cid:98) A (cid:48) G (cid:98) A ) G / b (cid:13)(cid:13)(cid:1) > σ − / (cid:17) + o (1) . Lemma F.6.
Let Assumptions 1–5 be satisfied. Then, for some c ∈ (0 , it holds | (cid:98) v J − v J | ≤ c v J wpa1 for all J ∈ I n .Proof. We denote G σ = E h [( Y − h ( X )) b K ( W ) b K ( W ) (cid:48) ] and its empirical analog (cid:98) G σ = n − (cid:80) i ( Y i − h ( X i )) b K ( W i ) b K ( W i ) (cid:48) . Note that for any J × J matrix M it holds (cid:107) M (cid:107) F ≤√ J (cid:107) M (cid:107) and hence For all J ∈ I n the triangular inequality implies | (cid:98) v J − v J | ≤ (cid:13)(cid:13)(cid:13) G / (cid:98) A (cid:98) G σ (cid:98) A (cid:48) G / − G / AG σ A (cid:48) G / (cid:13)(cid:13)(cid:13) F ≤ √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A (cid:98) G σ (cid:98) A (cid:48) G / − G / AG σ A (cid:48) G / (cid:13)(cid:13)(cid:13) . Thus, the result follows from the proof of Chen and Christensen [2015, Lemma E.16].
Lemma F.7.
Let Assumptions 1–5 be satisfied. Then, we have max J ∈I n (cid:12)(cid:12)(cid:12) − (cid:98) v J ( h ) (cid:98) v J ( (cid:98) h J ) (cid:12)(cid:12)(cid:12) = o p (1) . Proof.
For all J ∈ I n and h ∈ H r J the triangular inequality implies | (cid:98) v J ( h ) − (cid:98) v J ( h ) |≤ (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) ( Y i − h ( X i )) − ( Y i − h ( X i )) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) F ≤ √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) + 2 √ J (cid:13)(cid:13)(cid:13) G / (cid:98) A n (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) ( Y i − h ( X i )) b K ( W i ) b K ( W i ) (cid:48) (cid:98) A (cid:48) G / (cid:13)(cid:13)(cid:13) = T + T . Consider T . Following the proof of Theorem 3.4 we obtain T ≤ √ J s − J (cid:13)(cid:13)(cid:13) n − (cid:88) i (cid:0) h ( X i ) − h ( X i ) (cid:1) b K ( W i ) b K ( W i ) (cid:48) (cid:13)(cid:13)(cid:13) = O p ( √ J s − J n − ) = o p (1)uniformly in J ∈ I n by Assumption (5) (i). Analogously, we obtain T = o p (1) uniformlyin J ∈ I n . Lemma F.8.
Under Assumption 1 (ii) and 2 (iii) it holds for all h ∈ H that λ max (cid:0) V ar h ( ρ ( Y, h ( X )) G − / b b K ( W )) (cid:1) ≤ σ < ∞ . roof. For any γ ∈ R K it holds γ (cid:48) V ar h (cid:0) ρ ( Y, h ( X )) G − / b b K ( W ) (cid:1) γ ≤ E (cid:104) E h [ ρ ( Z, h ) | W ] (cid:0) γ (cid:48) G − / b b K ( W ) (cid:1) (cid:105) ≤ σ E (cid:104)(cid:0) γ (cid:48) G − / b b K ( W ) (cid:1) (cid:105) = σ γ (cid:48) G − / b E (cid:2) b K ( W ) b K ( W ) (cid:48) (cid:3) G − / b γ = σ (cid:107) γ (cid:107) , where the second inequality is due to to Assumption 1 (ii). Lemma F.9.
Under the conditions of Theorem 4.1 it holds (cid:98) D K − (cid:107) m ( · , h ) (cid:107) L ( W ) = O p (cid:16) n − √ K + n − / (cid:13)(cid:13) Π K m ( · , h ) (cid:107) L ( W ) + K − γ/d w (cid:17) . Proof.
Similarly to Theorem F.1 we obtain (cid:98) D K − (cid:107) m ( · , h ) (cid:107) L ( W ) = (cid:98) D K − (cid:107) Π K m ( · , h ) (cid:107) L ( W ) + (cid:107) Π K m ( · , h ) (cid:107) L ( W ) − (cid:107) m ( · , h ) (cid:107) L ( W ) . Following the first part of the proof of Theorem F.1 with V Ki replaced by ρ ( Y i , h ( X i )) G − / b b K ( W i )and using Lemma F.4 yields (cid:98) D K − (cid:107) Π K m ( · , h ) (cid:107) L ( W ) = O (cid:16) n − √ K + n − / (cid:13)(cid:13) Π K m ( h , · ) (cid:107) L ( W ) (cid:17) . Moreover, we obtain using that Π K is a projection on L ( W ): (cid:107) Π K m ( · , h ) (cid:107) L ( W ) − (cid:107) m ( · , h ) (cid:107) L ( W ) = (cid:107) Π K m ( · , h ) − m ( · , h ) (cid:107) L ( W ) + 2 (cid:104) Π K m ( · , h ) − m ( · , h ) , Π K m ( · , h ) (cid:105) L ( W ) = 3 (cid:107) Π K m ( · , h ) − m ( · , h ) (cid:107) L ( W ) = O ( K − γ/d w ) , where the last bound is due to the sieve approximation rate imposed in Assumption 7. G. U-statistics deviation results
We make use of the following exponential inequality established by Houdr´e and Reynaud-Bouret [2003].
Lemma G.1 (Houdr´e and Reynaud-Bouret [2003]) . Let U n be a degenerate U-statistic oforder 2 with kernel R based on a simple random sample Z , . . . , Z n . Then there exists a eneric constant C > , such that P h (cid:32)(cid:12)(cid:12)(cid:12) (cid:88) ≤ i
Let Assumption 1 (ii) be satisfied. Given kernel R it holds under H : Λ ≤ n ( n − v J , (G.1)Λ ≤ σ n s − J , (G.2)Λ ≤ σ √ n M n ζ b,K s − J , (G.3)Λ ≤ M n ζ b,K s − J . (G.4) Proof.
Proof of (G.1). Recall the notation V Ji = U i G / Ab K ( W ) with U i = Y i − h ( X i ),then we evaluate under H :E h [ R ( Z , S )] ≤ E h (cid:12)(cid:12)(cid:12) U b K ( W ) (cid:48) A (cid:48) GAb K ( W ) U (cid:12)(cid:12)(cid:12) = E h (cid:104) U b K ( W ) (cid:48) A (cid:48) GA E h (cid:2) U b K ( W ) b K ( W ) (cid:48) (cid:3) A (cid:48) GAb K ( W ) (cid:105) = E h (cid:104) ( V J ) (cid:48) E h (cid:2) V J ( V J ) (cid:48) (cid:3) V J (cid:105) = J (cid:88) j,j (cid:48) =1 E h [ V j V j (cid:48) ] = v J . Proof of (G.2). For any function ν and κ with (cid:107) ν (cid:107) L ( Z ) ≤ (cid:107) κ (cid:107) L ( Z ) ≤
1, respectively,23e obtain | E h [ R ( Z , Z ) ν ( Z ) κ ( Z )] | ≤ (cid:12)(cid:12)(cid:12) E[ U M b K ( W ) (cid:48) ν ( Z )] A (cid:48) GA E h [ U M b K ( W ) κ ( Z )] (cid:12)(cid:12)(cid:12) + (cid:13)(cid:13) G / A E h [ U M b K ( W )] (cid:13)(cid:13) ≤(cid:107) G / A E h [ U M b K ( W ) κ ( Z )] (cid:107) (cid:107) G / A E h [ U M b K ( W ) ν ( Z )] (cid:107) + (cid:13)(cid:13) G / A E h [ U M b K ( W )] (cid:13)(cid:13) ≤(cid:107) G / AG / b (cid:107) (cid:16)(cid:113) E (cid:2) | E h [ U M κ ( Z ) | W ] | (cid:3) × (cid:113) E h (cid:2) | E h [ U M ν ( Z ) | W ] | (cid:3) + E (cid:2) | E h [ U M | W ] | (cid:3)(cid:17) . Now observe E (cid:2) | E h [ U M κ ( Z ) | W ] | (cid:3) ≤ E (cid:2) E h [ U | W ] κ ( Z ) (cid:3) ≤ σ by Assumption 1 (ii) andusing that (cid:107) κ (cid:107) L ( Z ) ≤
1, which yields the upper bound by using (cid:107) G / AG / b (cid:107) = s − J .Proof of (G.3). Observe that for any z = ( u, w ) (cid:12)(cid:12) E h [ R ( Z , z )] (cid:12)(cid:12) ≤ E h (cid:12)(cid:12)(cid:12) U {| U | ≤ M n } b K ( W ) (cid:48) A (cid:48) GAb K ( w ) u {| u | ≤ M n } (cid:12)(cid:12)(cid:12) ≤ (cid:107) G / Ab K ( w ) u {| u | ≤ M n }(cid:107) E h (cid:107) G / Ab K ( W ) U (cid:107) ≤ σ M n ζ b,K (cid:107) G / AG / b (cid:107) , again by using Assumption 1 (ii) and hence the upper bound (G.3) follows.Proof of (G.4). Observe that for any z = ( u , w ) and z = ( u , w ) we get (cid:12)(cid:12) R ( s , s ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) u {| u | ≤ M n } b K ( w ) (cid:48) A (cid:48) GAb K ( w ) u {| u | ≤ M n } (cid:12)(cid:12)(cid:12) ≤ sup u,w (cid:13)(cid:13) G / Ab K ( w ) u {| u | ≤ M n } (cid:13)(cid:13) ≤ M n ζ b,K (cid:107) G / AG / b (cid:107) ,,