Inference in Nonparametric Series Estimation with Specification Searches for the Number of Series Terms
IInference in Nonparametric Series Estimation with SpecificationSearches for the Number of Series Terms
Byunghoon Kang ∗ Department of Economics, Lancaster UniversityFebruary 26, 2020
Abstract
Nonparametric series regression often involves specification search over the tuning parameter,i.e., evaluating estimates and confidence intervals with a different number of series terms. Thispaper develops pointwise and uniform inferences for conditional mean functions in nonparamet-ric series estimations that are uniform in the number of series terms. As a result, this paperconstructs confidence intervals and confidence bands with possibly data-dependent series termsthat have valid asymptotic coverage probabilities. This paper also considers a partially linearmodel setup and develops inference methods for the parametric part uniform in the number ofseries terms. The finite sample performance of the proposed methods is investigated in varioussimulation setups as well as in an illustrative example, i.e., the nonparametric estimation of thewage elasticity of the expected labor supply from Blomquist and Newey (2002).
Keywords:
Nonparametric series regression, Pointwise confidence interval, Smoothing pa-rameter choice, Specification search, Undersmoothing, Uniform confidence bands.
JEL classification:
C12, C14.
We consider the following nonparametric regression model y i = g ( x i ) + ε i , E ( ε i | x i ) = 0 (1.1)where { y i , x i } ni =1 is i.i.d., y i is a scalar response variable, x i ∈ X ⊂ R d x is a vector of covariates,and g ( x ) = E ( y i | x i = x ) is the conditional mean function. The theory of estimation and inference ∗ I thank the editor Peter Phillips, the co-editor Iv´an Fern´andez-Val, and the two anonymous referees for thoughtfulcomments that significantly improved this paper. I am also grateful to Bruce Hansen, Jack Porter, Xiaoxia Shiand Joachim Freyberger for useful comments and discussions, and thanks to Michal Koles´ar, Denis Chetverikov,Yixiao Sun, Andres Santos, Patrik Guggenberger, Federico Bugni, Joris Pinkse, Liangjun Su, Myung Hwan Seo,and ´Aureo de Paula for helpful conversations and criticism. This paper is a revised version of the first chapterin my Ph.D. thesis at UW-Madison and previously titled “Inference in Nonparametric Series Estimation withData-Dependent Undersmoothing”. I acknowledge support by the Kwanjeong Educational Foundation GraduateResearch Fellowship and Leon Mears Dissertation Fellowship from UW-Madison. All errors are my own. Email:[email protected], Homepage: https://sites.google.com/site/davidbhkang a r X i v : . [ ec on . E M ] F e b s well developed for nonparametric series (sieve) methods in a large body of econometrics andstatistics literature. Series estimators have also received attention in applied economics becausethey have many appealing features, e.g., they can easily impose shape restrictions such as additiveseparability and monotonicity. Once the basis function is chosen (e.g., polynomial or regressionspline series of fixed order), implementation requires a choice of the number of series terms K = K n where K denotes the order of the polynomials or the number of knots in the splines. However,this often involves some ad-hoc specification searches over K ∈ K n . For example, when x i ∈ R d x isvector valued, researchers often evaluate the different numbers of terms in each dimension separatelyand construct a set of bases with different powers and cross-products of covariates. Althoughspecification search seems necessary in some cases, it may lead to misleading inference withoutconsidering the first-step specification search or series term selection. Existing theory for the asymptotic normality of t -statistics and valid inference imposes a so-called undersmoothing (i.e., overfitting) condition that is a faster rate of K than the mean-squarederror (MSE) optimal convergence rates, and many papers in the literature typically suggest rule-of-thumb rules that give the desired level of undersmoothing. Among many others, Newey (2013)suggested increasing K until the standard errors are large relative to small changes in objects ofinterest. Newey, Powell, and Vella (1999) suggested using more terms than that chosen by cross-validation. Horowitz and Lee (2012) suggested increasing K until the integrated variance suddenlyincreases and then adding additional terms.In this paper, we formally justify these rule-of-thumb methods or “plug-in” methods with un-dersmoothed (cid:98) K for valid inference in nonparametric series regression. Specifically, we providepointwise inference for g ( x ) with possibly data-dependent (undersmoothed) (cid:98) K ∈ K n , i.e., con-structing 100(1 − α )% confidence interval (CI),lim inf n →∞ P ( g ( x ) ∈ [ (cid:98) g n ( (cid:98) K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( (cid:98) K, x ) /n ]) ≥ − α, (1.2)with an estimator (cid:98) g n ( K, x ), variance (cid:98) V n ( K, x ) using K series terms, and critical values (cid:98) c − α ( x )from the supremum of the t -statistics. For this result, we first develop a uniform distributionalapproximation theory of the absolute value of the supremum of the t -statistics over different seriesterms to construct asymptotically valid confidence intervals, which are uniform in K ∈ K n , P ( g ( x ) ∈ [ (cid:98) g n ( K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n ) = 1 − α + o (1) . (1.3)The critical values (cid:98) c − α ( x ) can be easily implemented using simple simulation or weighted bootstrapmethods.Furthermore, this paper develops the construction of confidence bands for g ( x ) with asymp- As a referee noted, the bias and MSE of the series estimator depend on not only K but also the specific basesor sieve spaces, e.g., the order of the splines. In this paper, we fix the basis function, and we do not allow searchingover the specific bases or sieve spaces. K ∈ K n ) coverage with critical values (cid:98) c − α chosen to satisfy P ( g ( x ) ∈ [ (cid:98) g n ( K, x ) ± (cid:98) c − α (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n , x ∈ X ) = 1 − α + o (1) . (1.4)Analogous to the pointwise inference in (1.2), we can show the validity of confidence bands withthe data-dependent (cid:98) K . Even in pointwise inference, deriving a uniform asymptotic distributiontheory for all sequences of t -statistics over K ∈ K may not be possible unless p = |K n | is finite.Allowing p → ∞ as n → ∞ , results in this paper build on coupling inequalities for the supremum ofthe empirical process developed by Chernozhukov, Chetverikov, and Kato (2014a, 2016) combinedwith anti-concentration inequality in Chernozhukov, Chetverikov, and Kato (2014b).We also provide inference methods in a partially linear model setup focusing on the commonparametric part. Unlike the nonparametric object of interest that has a slower convergence rate than n / (e.g., regression function or regression derivative), the t -statistics for the parametric objectof interest are asymptotically equivalent for all sequences of K under standard rate conditions K/n → n → ∞ . To account for the dependency of the t -statistics with the different sequencesof K in this setup, we consider a faster rate of K that grows as fast as the sample size n , as inCattaneo, Jansson, and Newey (2018a, 2018b), and develop an asymptotic distribution of the t -statistics over K ∈ K n . Then, we discuss methods to construct confidence intervals that are similarto the nonparametric regression setup and provide uniform (in K ∈ K n ) coverage properties.We investigate finite sample coverage and length properties of the proposed CIs and uniformconfidence bands in various simulation setups. As an illustrative example, we revisit nonparametricestimation of labor supply function using the entire individual piecewise-linear budget set as inBlomquist and Newey (2002). Imposing additive separability, which is derived by economic theory,Blomquist and Newey (2002) estimate the conditional mean of labor supply function using seriesestimation and report wage elasticity of the expected labor supply as well as other welfare measureswith various specifications of the different number of series terms.Several important papers have investigated the asymptotic properties of series (and sieve) esti-mators, including papers by Andrews (1991a); Eastwood and Gallant (1991); Newey (1997); Chenand Shen (1998); Huang (2003); Chen (2007); Chen and Liao (2014); Chen, Liao, and Sun (2014);Belloni, Chernozhukov, Chetverikov, and Kato (2015); and Chen and Christensen (2015), amongmany others. This paper extends inference based on the t -statistic under a single sequence of K to the sequences of K over a set K n and focuses both on the pointwise and uniform inferences on g ( x ), which is irregular (i.e., slower than a rate of n / ) and a linear functional, under an i.i.d.setup.The supremum t -statistics have been used as a correction for multiple-testing problems and toconstruct simultaneous confidence bands, and the importance of multiple-testing problems (datamining or data snooping) has long been noted in various other contexts (see Leamer (1983), White(2000), Romano and Wolf (2005), Hansen (2005)).There is also a growing literature on data-dependent series term selection and its impact on3stimation and inference in econometrics and statistics. Asymptotic optimality results of cross-validation have been developed, e.g., in papers by Li (1987), Andrews (1991b), and Hansen (2015).Horowitz (2014) develops data-driven methods for choosing the sieve dimension in the nonpara-metric instrumental variables (NPIV) estimation such that resulting NPIV estimators attain theoptimal sup-norm or L norm rates adaptive to the unknown smoothness of g ( x ). Although we donot pursue adaptive inference in this paper, there is also a large statistical literature on adaptiveinference. For example, Gin´e and Nickl (2010), Chernozhukov, Chetverikov, and Kato (2014b)construct adaptive confidence bands in the density estimation problem (see Gin´e and Nickl (2015,Section 8) for comprehensive lists of references). However, once data-driven choice is obtainedfor adaptive estimation (e.g., Lepski (1990)-type procedures), one still requires an undersmooth-ing condition for inference to eliminate asymptotic bias terms (see Theorem 1 of Gin´e and Nickl(2010)), and this may result in similar specification search issues when choosing sufficiently “large” K in practice.We can, in principle, consider kernel-based estimation where several data-dependent bandwidthselections or explicit bias corrections have been proposed. However, there exist many examplesestimating g ( x ) using (global) series estimation and imposing shape constraints easily (such as ad-ditive separability to reduce dimensionality) that are also interested in both pointwise and uniforminference. Given the issues of specification search, our paper is closely related to a recent paper byArmstrong and Koles´ar (2018) which considers a bandwidth snooping adjustment for kernel-basedinference.Unlike kernel-based methods, little is known about the statistical properties of data-dependentselection rules and explicit bias formulas for general series estimation. Zhou, Shen, and Wolfe(1998) and Huang (2003) are two of the few exceptions. A recent paper, Cattaneo, Farrell, andFeng (2019), develops novel explicit asymptotic bias/integrated mean squared error (IMSE) for-mulas and asymptotic theory of the bias-correction methods for general partitioning-based seriesestimators. The results in Cattaneo, Farrel, and Feng (2019) can be used as an alternative to theundersmoothing approach to avoid specification search issues.The remainder of the paper is organized as follows. Section 2 introduces the basic nonparamet-ric series regression setup and the candidate set K n . Section 3 provides the pointwise inference,and Section 4 provides uniform inference in x ∈ X . Section 5 extends our inference methods tothe partially linear model setup. Section 6 summarizes Monte Carlo experiments in various se-tups, and Section 7 illustrates an empirical example as in Blomquist and Newey (2002). Then,Section 8 concludes the paper. Appendix A includes the main proofs, and Appendix B includesfigures and tables. Additional supporting lemmas and simulation results are provided in the OnlineSupplementary Material available at Cambridge Journals Online (journals.cambridge.org/ect). See H¨ardle and Linton (1994), Li and Racine (2007) for references. See also Hall and Horowitz (2013), Calonico,Cattaneo, and Farrell (2018), Schennach (2015) and references therein for various recent works on related bias issuesand inference for kernel estimators. .1 Notation || A || denotes the spectral norm, which equals the largest singular value of a matrix A , and λ min ( A ) ,λ max ( A ) denote the minimum and maximum eigenvalues of a symmetric matrix A , respectively. o p ( · ) and O p ( · ) denote the usual stochastic order symbols, d −→ denotes convergence in distribution,and ⇒ denotes weak convergence. Let a ∧ b = min { a, b } , a ∨ b = max { a, b } and denote (cid:98) a (cid:99) as thelargest integer less than the real number a . For two sequences of positive real numbers a n and b n , a n (cid:46) b n denotes a n ≤ cb n for all n sufficiently large with some constant c > n . a n (cid:16) b n denotes a n (cid:46) b n and b n (cid:46) a n . Furthermore, a n (cid:46) P b n denotes a n = O p ( b n ). For a givenrandom variable { X i } and 1 ≤ p < ∞ , L p ( X ) is the space of all L p -norm bounded functions with || f || L p = [ E || f ( X i ) || p ] /p , (cid:96) ∞ ( X ) denotes the space of all bounded functions under the sup-norm,and || f || ∞ = sup x ∈X | f ( x ) | for the bounded real-valued functions f on the support X . We introduce the nonparametric series regression setup in the model (1.1). Given a random sample { y i , x i } ni =1 , we are interested in inference on the conditional mean g ( x ) = E ( y i | x i = x ) at aparticular point x ∈ X ⊂ R d x or uniform in x ∈ X .Let (cid:98) g n ( K, x ) be an estimator of g ( x ) using K = K n ≥ P ( K, x ) = ( p ( x ) , · · · ,p K ( x )) (cid:48) , which is a vector of basis functions that can change with n . Standard examples for thebasis functions are power series, Fourier series, orthogonal polynomials, splines and wavelets. Theseries estimator is then obtained by the least square (LS) estimation of y i on regressors P ( K, x i ) (cid:98) g n ( K, x ) = P ( K, x ) (cid:48) (cid:98) β K , (cid:98) β K = ( P K (cid:48) P K ) − P K (cid:48) Y (2.1)where P K = [ P K , · · · , P Kn ] (cid:48) , P Ki ≡ P ( K, x i ) = ( p ( x i ) , p ( x i ) , · · · , p K ( x i )) (cid:48) , Y = ( y , · · · y n ) (cid:48) . De-fine the least square residuals as (cid:98) ε Ki = y i − P (cid:48) Ki (cid:98) β K , (cid:98) V n ( K, x ) = P ( K, x ) (cid:48) (cid:98) Q − K (cid:98) Ω K (cid:98) Q − K P ( K, x ) , (cid:98) Q K = 1 n n (cid:88) i =1 P Ki P (cid:48) Ki , (cid:98) Ω K = 1 n n (cid:88) i =1 P Ki P (cid:48) Ki (cid:98) ε Ki , (2.2)and consider the t -statistic (cid:98) T n ( K, x ) ≡ √ n ( (cid:98) g n ( K, x ) − g ( x )) (cid:98) V n ( K, x ) / . (2.3)Under standard regularity conditions (discussed in the next section), the t -statistic can bedecomposed as follows: (cid:98) T n ( K, x ) = 1 √ n n (cid:88) i =1 P ( K, x ) (cid:48) Q − K P Ki ε i (cid:98) V n ( K, x ) / − r n ( K, x ) (cid:113) (cid:98) V n ( K, x ) /n + o p (1) (2.4)5here Q K = E ( P Ki P (cid:48) Ki ), r n ( K, x ) = g ( x ) − P ( K, x ) (cid:48) β K , and β K ≡ E [ P Ki P (cid:48) Ki ]) − E [ P Ki y i ] isthe best linear L projection coefficient. The first term in the decomposition (2.4) converges to astandard normal distribution for the deterministic sequence K → ∞ as n → ∞ , and the secondterm does not necessarily converge to 0 due to approximation errors r n ( K, x ). The second term canbe ignored with an undersmoothing assumption, and the asymptotic distribution of the t -statistic, (cid:98) T n ( K, x ) d −→ N (0 , − α )% confidence interval for g ( x ) can be easily constructed using the normal critical value z − α/ (cid:104)(cid:98) g n ( K, x ) ± z − α/ (cid:113) (cid:98) V n ( K, x ) /n (cid:105) . (2.5)However, it is not clear whether the conventional CI using normal critical values (2.5) has acorrect coverage probability with a possibly data-dependent (cid:98) K such as cross-validation or IMSE-optimal selection. First, T n ( (cid:98) K, θ ) d → N (0 ,
1) may not hold with a random sequence of (cid:98) K , even if weassume the asymptotic bias is negligible. Second, it is well known that some data-dependent rules (cid:98) K do not satisfy the undersmoothing rate conditions, which can lead to a large asymptotic biasand coverage distortion of the standard CI. For example, suppose that the researcher uses (cid:98) K = (cid:98) K cv selected by cross-validation; then, (cid:98) K cv is typically too “small” and violates the undersmoothingassumption needed to ensure the asymptotic normality without bias terms and the valid inference.As discussed in the introduction, the undersmoothing assumption involves possibly ad-hoc meth-ods to choose series terms K over a candidate set K n for a valid inference, and cross-validationmethods naturally involve specification search over a set of the different number of series terms.The following set assumption on K n is constructed to allow a broad range of K such that K n can allow (unknown) an optimal MSE rate of K as well as an undersmoothing rate that increasesfaster than the optimal MSE rate. Assumption 2.1. (Set of number of series terms) Assume the candidate set as K n = { K j : 1 ≤ j ≤ p } , where K = K → ∞ and K = K p → ∞ as n → ∞ . Here, we consider a possibly growing set of the number of series terms, and a similar assumptionis used in the literature, for example, in Newey (1994a, 1994b). Suppose g ( x ) belongs to the H¨olderspace of smoothness s >
0, Σ( s, X ); then, we obtain optimal L convergence rates O p ( n − s/ (2 s + d x ) )with K (cid:16) n d x / ( d x +2 s ) . Assumption 2.1 allows having optimal L rates of K in a large set of classes offunctions. By setting K n = [ K, K ] ∩ N , K (cid:16) n φ and K (cid:16) n φ with φ = d x / ( d x +2 s ) , φ = d x / ( d x +2 s ),Assumption 2.1 contains the number of series terms that obtain an optimal L rate of convergencefor g ( x ) ∈ (cid:83) s ∈ S Σ( s, X ), S = [ s, s ]. A similar assumption is used in the literature on adaptiveinference, although we do not pursue this direction in the current paper.Assumption 2.1 gives flexible choices of K , as we only assume the rates of K , for example, K = Cn φ , K = cn φ , where c and C can be set arbitrarily small or large. We only require raterestrictions uniformly over K ∈ K to guarantee the linearization of the t -statistic in (2.4) and therates of the cardinality p = |K n | . Since K ∈ K n is a positive integer and p ≤ K , p is growing at a6ate much slower than n under the rate restrictions in Section 3. Remark 2.1 ( K n and the largest K ) . As a referee noted, specification search is often performedover a simple pre-defined set in practice. For example, a researcher may only use quadratic, cubic,or quartic terms in polynomial regression or try only a few different numbers of knots in regressionsplines to observe how the estimate and standard error change. In the nonparametric estimation ofthe Mincer equation (Heckman, Lochner, and Todd (2006)), researchers may consider a regressionof log wages on experience with polynomials of order K = 1 (linear) to K = 4 (quartic). However, it may not be clear how to define a priori K n in practice. One must first consider aset of pre-selected models over which to search. As discussed earlier and suggested by many papersin the literature, some formal data-dependent methods to obtain optimal L norm or sup-normrates, such as cross-validation, can be a useful guideline for K n . For example, one can consider areasonable set (cid:101) K n first, choose (cid:98) K cv ∈ (cid:101) K n by cross-validation, and then consider K n = [ (cid:98) K cv , c (cid:98) K cv ]or [ (cid:98) K cv , (cid:98) K cv n c ] for some constants c , c >
0. One can also search K and K sequentially bycalculating changes in cross-validation or standard errors from the initial candidate set. Extendingresults developed in this paper with data-dependent K n are beyond the scope of the paper. In this section, we focus on pointwise inference for g ( x ). The goal of this section is to providea uniform distributional approximation theory of (cid:98) T n ( K, x ) over a set K n and provide uniform (in K ∈ K n ) coverage properties of confidence intervals for g ( x ) in (1.2), (1.3) with the constructionof critical values.From the decomposition of the t -statistic in (2.4), we first consider the (infeasible) test statisticmax K ∈K n | t n ( K, x ) | = max ≤ j ≤ p | t n ( K j , x ) | (3.1)where t n ( K, x ) = n − / (cid:80) ni =1 P ( K, x ) (cid:48) Q − K P Ki ε i /V n ( K, x ) / with the series variance V n ( K, x ) = P ( K, x ) (cid:48) Q − K Ω K Q − K P ( K, x ), Ω K = E ( P Ki P (cid:48) Ki ε i ). In general, t n ( K, x ) , K ∈ K n does not have alimiting distribution because it is not asymptotically tight under Assumption 2.1 unless |K n | is finiteor under the restrictive assumption on K n . However, we show below that there exists a sequenceof random variables max ≤ j ≤ p (cid:80) ni =1 | Z ij | such that (cid:12)(cid:12) max K ∈K n | t n ( K, x ) | − max ≤ j ≤ p (cid:80) ni =1 | Z ij | (cid:12)(cid:12) = O p ( a n ) for a sequence of constants a n →
0, where Z i = ( Z i , ..., Z ip ) (cid:48) is a Gaussian random vectorin R p such that Z i ∼ N (0 , n Σ n ) with ( j, l ) elements of the variance-covariance matrixΣ n ( j, l ) = E [ t n ( K j , x ) t n ( K l , x ))] = P ( K j , x ) (cid:48) Q − K j Ω K j ,K l Q − K l P ( K l , x ) V n ( K j , x ) / V n ( K l , x ) / , (3.2) All of our results continue to hold with fixed p ; however, it may be preferred to use larger sets K n with p → ∞ to give greater flexibility to the candidate models as the sample size n increases. In an earlier version of the paper, we provide the weak convergence of a series process under the same rates of K ∈ K n and high-level assumptions. This can be viewed as an analogous result in the kernel estimation literature(see Section 2 of Armstrong and Koles´ar (2018) and other references therein). K j ,K l = E ( P K j i P (cid:48) K l i ε i ).By replacing unknown Σ n , V n ( K, x ) with consistent estimators (cid:98) Σ n , (cid:98) V n ( K, x ), we show below thatwe can approximate max K ∈K n | (cid:98) T n ( K, x ) | by max ≤ j ≤ p (cid:80) ni =1 | Z ij | and then obtain critical values byusing a simulation-based method to provide valid coverage properties in (1.2) and (1.3). We define (cid:98) c − α ( x ) as follows: (cid:98) c − α ( x ) ≡ (1 − α ) quantile of max ≤ j ≤ p n (cid:88) i =1 | (cid:98) Z ij | , where (cid:98) Z i = ( (cid:98) Z i , ..., (cid:98) Z ip ) (cid:48) ∼ N (0 , n (cid:98) Σ n ) , (cid:98) Σ n ( j, j ) = 1 , (cid:98) Σ n ( j, l ) = (cid:98) V n ( K j , K l , x ) (cid:98) V n ( K j , x ) / (cid:98) V n ( K l , x ) / , (cid:98) V n ( K j , K l , x ) = P ( K j , x ) (cid:48) (cid:98) Q − K j (cid:98) Ω K j ,K l (cid:98) Q − K l P ( K l , x ) , (cid:98) Ω K j ,K l = 1 n n (cid:88) i =1 P K j i P (cid:48) K l i (cid:98) ε K j i (cid:98) ε K l i (3.3)where (cid:98) Σ n is a consistent estimator of the variance-covariance matrix Σ n defined in (3.2), (cid:98) V n ( K, x )is the simple plug-in estimator for V n ( K, x ) as in (2.2), and (cid:98) ε Ki = y i − P (cid:48) Ki (cid:98) β K , ∀ K ∈ K n . One cancompute (cid:98) c − α ( x ) by simulating B (typically B = 1000 or 5000) i.i.d. random vectors (cid:98) Z bi ∼ N (0 , n (cid:98) Σ n ) and by taking a (1 − α ) sample quantile of { max ≤ j ≤ p (cid:80) ni =1 | (cid:98) Z bij | : b = 1 , · · · , B } . Alternatively,we can use weighted bootstrap methods. See Section 4 for the implementation and the validity ofbootstrap procedures in the construction of confidence bands.To establish our main results, we impose mild regularity conditions uniform in K ∈ K n . Foreach K ∈ K n , define ζ K ≡ sup x ∈X || P ( K, x ) || as the largest normalized length of the regressorvector and λ K ≡ ( λ min ( Q K )) − / for K × K design matrix Q K = E ( P Ki P (cid:48) Ki ). Assumption 3.1. (Regularity conditions - model)(i) { y i , x i } ni =1 are i.i.d random variables satisfying the model (1.1) .(ii) max K ∈K n λ K (cid:46) , and for each K ∈ K n , as K → ∞ , there exists c K , (cid:96) K such that sup x ∈X | r n ( K, x ) | ≤ (cid:96) K c K , E [ r n ( K, x ) ] / ≤ c K , where r n ( K, x ) = g ( x ) − P ( K, x ) (cid:48) β K , β K = ( E [ P Ki P (cid:48) Ki ]) − E [ P Ki y i ] . Assumption 3.2. (Regularity conditions - pointwise inference)(i) max K ∈K n (cid:113) ζ K log K log p/n (1 + √ K(cid:96) K c K ) + (cid:96) K c K log p → as n → ∞ .(ii) sup x ∈X E ( | ε i | | x i = x ) < ∞ , inf x ∈X E ( ε i | x i = x ) > , and either of the following conditionshold: (a) sup x ∈X E [ | ε i | q | x i = x ] < ∞ for q ≥ or (b) there exists a constant C > such that sup x ∈X E [exp( | ε i | /C ) | X i = x ] ≤ .(iii) max K ∈K n | V n ( K,x ) (cid:98) V n ( K,x ) − | = o p (1 / log p ) , max ≤ j,l ≤ p | (cid:98) Σ n ( j, l ) − Σ n ( j, l ) | = o p (1 / log p ) . K uniformly over K n . The rate conditions can be replaced by the specific boundsof ζ K , c K , (cid:96) K with various sieve bases. For example, when X = [0 , d x , the probability density of x i is uniformly bounded above and bounded away from zero, and g ( x ) ∈ Σ( s, X ), i.e., the H¨olderspace of smoothness s >
0, then λ K (cid:46) ζ K (cid:46) √ K , (cid:96) K c K (cid:46) K − ( s ∧ s ) /d x for regression splineseries of order s , and Assumption 3.2(i) is satisfied when (cid:113) K (log K ) /n (1 + K / K − ( s ∧ s ) /d x ) + K − ( s ∧ s ) /d x log K →
0. Other standard regularity conditions in the literature (e.g., Newey (1997)and Chen (2007)) can also be used here, and the rate condition can be improved with differentpointwise linearization and approximation bounds in Huang (2003) for splines and Cattaneo et al.(2019) for partitioning-based estimators.Assumption 3.2(ii) imposes either the bounded polynomial moment conditions or sub-exponentialmoments of the regression errors. Assumption 3.2(iii) imposes the consistency of variance estimator (cid:98) V n ( K, x ) uniformly in K ∈ K n , and this holds under mild regularity conditions (see Lemma 5.1 ofBelloni et al. (2015) and Lemma 3.1-3.2 of Chen and Christensen (2015)). Theorem 3.1.
Suppose that Assumptions 2.1, 3.1, and 3.2 hold and that either of the following rateconditions hold depending on the case (a) or (b) in Assumption 3.2(ii): (a) (max K ζ K ) log n log p/n ∨ max K ζ K log / n log p/n / − /q → or (b) (max K ζ K ) log n log p/n → . If, in addition, we as-sume that max K ∈K n | √ nr n ( K,x ) V n ( K,x ) / | = o (1 / √ log p ) , then sup u ∈ R (cid:12)(cid:12) P ( max K ∈K n | (cid:98) T n ( K, x ) | ≤ u ) − P ( max ≤ j ≤ p n (cid:88) i =1 | (cid:98) Z ij | ≤ u ) (cid:12)(cid:12) = o (1) , (3.4) and the following coverage property holds P ( g ( x ) ∈ [ (cid:98) g n ( K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n ) = 1 − α + o (1) (3.5) with a critical value (cid:98) c − α ( x ) defined in (3.3) . Alternatively, if we assume | √ nr n ( (cid:98) K,x ) V n ( (cid:98) K,x ) / | = o (1 / √ log p ) with (cid:98) K ∈ K n , then the following holds: lim inf n →∞ P ( g ( x ) ∈ [ (cid:98) g n ( (cid:98) K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( (cid:98) K, x ) /n ]) ≥ − α. (3.6)Theorem 3.1 provides a uniform coverage property of the confidence interval over K ∈ K n for the regression function g ( x ). Equation (3.6) guarantees the asymptotic coverage of CI fordata-dependent (cid:98) K ∈ K n with undersmoothing. Note that standard inference methods in thenonparametric regression setup typically consider a singleton set K n = { K } with K → ∞ as n → ∞ . The rate restriction is mild because it only requires K/n − /q →
0, up to log n terms,in case ( a ) and K/n →
0, up to log n terms, in case ( b ) when ζ K (cid:46) √ K for splines and waveletseries. Theorem 3.1 builds upon a coupling inequality for maxima of sums of random vectors in9hernozhukov, Chetverikov, and Kato (2014a) combined with the anti-concentration inequality inChernozukhov, Chetverikov, and Kato (2014b). Remark 3.1 (Undersmoothing assumption) . Note that (3.5) requires an undersmoothing assump-tion uniformly over K ∈ K n . Without max K ∈K n | √ nr n ( K,x ) V n ( K,x ) / | = o (1), coverage in (3.5) can be understoodas the uniform confidence intervals for the pseudo-true value g ( K, x ) = P ( K, x ) (cid:48) β K , i.e., P ( g ( K, x ) ∈ [ (cid:98) g n ( K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n ) = 1 − α + o (1) (3.7)However, a uniform undersmoothing condition is not assumed in (3.6), and it only requires that thechosen (cid:98) K ∈ K n satisfies the undersmoothing condition such that the asymptotic bias is negligible.This allows broader ranges of K in K n including an unknown optimal MSE rate. We formally justifyrule-of-thumb methods for valid inference suggested in the literature that include an additionalnumber of series terms, a blow up of the numbers after using cross-validation, or some “plug-in”methods for choosing (cid:98) K such as those in Newey, Powell, and Vella (1999), Newey (2013). Here,uniform (in K ∈ K n ) inference considers uncertainty from specification search and using largercritical values (cid:98) c − α ( x ) than the normal critical value z − α/ . Remark 3.2 (Other functionals) . Here, we focus on the leading example with g ( x ) for somefixed point x ∈ X ; however, we can consider other linear functionals a ( g ( · )) such as the regressionderivatives a ( g ( x )) = ddx g ( x ). All the results in this paper can be applied to irregular (slowerthan n / rate) linear functionals using estimators a ( (cid:98) g n ( K, x )) = a K ( x ) (cid:48) (cid:98) β K and an appropriatetransformation of basis a K ( x ) = ( a ( p ( x ) , · · · , a ( p K ( x ))) (cid:48) with proper smoothness condition onthe functional and continuity conditions on the derivative as in Newey (1997). Although theverification of previous results for regular ( n / rate) functionals, such as integrals and weightedaverage derivatives, is beyond the scope of this paper, we examine similar results for the partiallylinear model setup in Section 5. This section provides construction of uniform confidence bands for g ( x ) (uniform in K ∈ K n ) givenin (1.4). We define the following empirical process (cid:98) T n ( K, x ) ≡ √ n ( (cid:98) g n ( K, x ) − g ( x )) (cid:98) V n ( K, x ) / (4.1)over K n × X , and we show below that the supremum of the empirical process sup ( K,x ) ∈K n ×X | (cid:98) T n ( K,x ) | can be approximated by a sequence of random variables sup ( K,x ) ∈K n ×X | Z n ( K, x ) | , where Z n ( K,x ) is a tight Gaussian random process in (cid:96) ∞ ( K n × X ) with zero mean and covariance function E [ Z n ( K, x ) Z n ( K (cid:48) , x (cid:48) )] = P ( K, x ) (cid:48) Q − K Ω K,K (cid:48) Q − K (cid:48) P ( K (cid:48) , x (cid:48) ) V n ( K, x ) / V n ( K (cid:48) , x (cid:48) ) / . (4.2)10lthough the Gaussian approximation is an important first step, the covariance function (4.2)is generally difficult to construct for the purpose of uniform inference. Thus, we employ weightedbootstrap methods similar to Belloni et al. (2015) and show the validity of the bootstrap procedurefor uniform confidence bands.Let e , ..., e n be a sequence of i.i.d. standard exponential random variables that are independentof X n = { x , ..., x n } . For ( K, x ) ∈ K n × X , we define a (centered) weighted bootstrap process (cid:98) T en ( K, x ) = √ n ( (cid:98) g en ( K, x ) − (cid:98) g n ( K, x )) (cid:98) V n ( K, x ) / (4.3)where (cid:98) g en ( K, x ) = P ( K, x ) (cid:48) (cid:98) β eK , and (cid:98) β eK is obtained by the following weighted least squares regression (cid:98) β eK = arg min β ∈ R K n (cid:88) i =1 e i ( y i − P ( K, x i ) (cid:48) β ) . (4.4)Define the critical value (cid:98) c − α ≡ (1 − α ) conditional quantile of sup K ∈K n ,x ∈X | (cid:98) T en ( K, x ) | given the data X n , (4.5)and we consider confidence bands of the form[ (cid:98) g n ( K, x ) ± (cid:98) c − α (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n , x ∈ X . (4.6)To provide the validity of the bootstrap critical values and confidence bands in (4.6), we showbelow that the conditional distribution of sup ( K,x ) ∈K n ×X | (cid:98) T en ( K, x ) | is “close” to the distribution ofsup ( K,x ) ∈K n ×X | Z n ( K, x ) | and that of sup ( K,x ) ∈K n ×X | (cid:98) T n ( K, x ) | using coupling inequalities for thesupremum of the empirical process and the bootstrap process as in Chernozhukov et al. (2016).Then, similar to Theorem 3.1, this gives bounds on the Kolmogorov distance for the distributionfunctions of P (sup K ∈K n ,x ∈X | (cid:98) T n ( K, x ) | ≤ u ) and P (sup K ∈K n ,x ∈X | (cid:98) T en ( K, x ) | ≤ u | X n ).The following assumptions are used to establish the coverage probability of confidence bandsuniformly over K ∈ K n . Define α ( K, x ) ≡ Q − / K P ( K, x ) /V n ( K, x ) / , and ζ L = max K ∈K n sup x,x (cid:48) ∈X ,x (cid:54) = x (cid:48) || α ( K, x ) − α ( K, x (cid:48) ) |||| x − x (cid:48) || , ζ L = sup x ∈X max K,K (cid:48) ∈K n : K (cid:54) = K (cid:48) || α ( K, x ) − α ( K (cid:48) , x ) ||| K − K (cid:48) | . Assumption 4.1. (Regularity conditions - uniform inference)(i) sup x E [ | ε i | q | x i = x ] < ∞ for q ≥ and inf x ∈X E ( ε i | x i = x ) > .(ii) max K ∈K n (cid:113) λ K ζ K log K log nn ( n /q + (cid:96) K c K √ K ) + ( (cid:96) K c K ) log n → as n → ∞ .(iii) log( ζ L ∨ ζ L ) (cid:46) log n , max K ζ q/ ( q − K log n/n (cid:46) , and max K ζ K (cid:46) log n .(iv) sup ( K,x ) ∈K n ×X | V n ( K,x ) (cid:98) V n ( K,x ) − | = o p (1 / log n ) . ζ L , ζ L and max K ∈K n ζ K similar to Chernozhukovet al. (2014a) and Belloni et al. (2015). Theorem 4.1.
Suppose that Assumptions 2.1, 3.1, and 4.1 hold, and (max K ζ K ) log / (2 q ) n/n / − /q → , (max K ζ K ) log n/n → as n → ∞ . If, in addition, we assume that sup ( K,x ) ∈K n ×X | √ nr n ( K,x ) V n ( K,x ) / | = o (1 / √ log n ) , then P ( g ( x ) ∈ [ (cid:98) g n ( K, x ) ± (cid:98) c − α (cid:113) (cid:98) V n ( K, x ) /n ] , K ∈ K n , x ∈ X ) = 1 − α + o (1) (4.7) with a critical value (cid:98) c − α in (4.5) .Alternatively, if we assume sup x ∈X | √ nr n ( (cid:98) K,x ) V n ( (cid:98) K,x ) / | = o (1 / √ log n ) with (cid:98) K ∈ K n , then the followingcoverage property holds: lim inf n →∞ P ( g ( x ) ∈ [ (cid:98) g n ( (cid:98) K, x ) ± (cid:98) c − α (cid:113) (cid:98) V n ( (cid:98) K, x ) /n ] , x ∈ X ) ≥ − α. (4.8)Theorem 4.1 shows the uniform asymptotic coverage property of the confidence bands definedin (4.6) uniformly over K ∈ K n . Furthermore, it shows a confidence band with possibly data-dependent (cid:98) K ∈ K n having an asymptotic coverage of at least 1 − α . The confidence band constructedin (4.8) requires a substantially weaker assumption on the undersmoothing similar to Theorem 3.1. In this section we provide inference methods for the partially linear model (PLM) setup. Fornotational simplicity, we use similar notation as defined in the nonparametric regression setup.Suppose we observe random samples { y i , w i , x i } ni =1 , where y i is the scalar response variable, w i ∈W ⊂ R is the treatment/policy variable of interest, and x i ∈ X ⊂ R d x is a set of explanatoryvariables. For simplicity, we shall assume that w i is a scalar. We consider the model y i = θ w i + g ( x i ) + ε i , E ( ε i | w i , x i ) = 0 . (5.1)We are interested in inference on θ after approximating an unknown function g ( x ) by seriesterms/regressors p ( x i ) among a set of potential control variables. Specification searches can beperformed for the number of different approximating terms or for the number of covariates inestimating the nonparametric part.The series estimator (cid:98) θ n ( K ) for θ using the first K = K n terms is obtained by standard LSestimation of y i on w i and P Ki = P ( K, x i ) and has the usual “partialling out” formula (cid:98) θ n ( K ) = (cid:0) W (cid:48) M K W (cid:1) − W (cid:48) M K Y (5.2)where W = ( w , · · · , w n ) (cid:48) , M K = I K − P K ( P K (cid:48) P K ) − P K (cid:48) , P K = [ P K , · · · , P Kn ] (cid:48) , Y = ( y , · · · , y n ) (cid:48) .12he asymptotic normality and valid inference for (cid:98) θ n ( K ) have been developed in the literature. Donald and Newey (1994) derived the asymptotic normality of (cid:98) θ n ( K ) under standard rate conditions K/n →
0. Belloni, Chernozukhov, and Hansen (2014) analyzed asymptotic normality and uniformlyvalid inference for the post-double-selection estimator even when K is much larger than n (see alsoKozbur (2018)). Recent papers by Cattaneo, Jansson, and Newey (2018a, 2018b) provided a validapproximation theory for (cid:98) θ n ( K ) when K grows at the same rate of n .A different approximation theory using a faster rate of K ( K/n → c, < c <
1) than thestandard rate conditions (
K/n →
0) is particularly useful for our purpose to establish the asymp-totic distribution of t -statistics over K ∈ K n . From the results in Cattaneo, Jansson, and Newey(2018a), we have the following decomposition: √ n ( (cid:98) θ n ( K ) − θ ) = ( 1 n W (cid:48) M K W ) − √ n W (cid:48) M K Y = (cid:98) Γ n ( K ) − ( 1 √ n (cid:88) i v i M K,ii ε i + 1 √ n n (cid:88) i =1 n (cid:88) j =1 ,j (cid:54) = i v i M K,ij ε j ) + o p (1) (5.3)where v i ≡ w i − g w ( x i ), g w ( x i ) ≡ E [ w i | x i ] and (cid:98) Γ n ( K ) = W (cid:48) M K W/n . For any deterministicsequence K → ∞ satisfying standard rate conditions K/n → √ n ( (cid:98) θ n ( K ) − θ ) is asymptoticallynormal with variance V = Γ − ΩΓ − , Γ = E [ v i v (cid:48) i ] , Ω = E [ v i v (cid:48) i ε i ]. Unlike the nonparametric objectof interest in the fully nonparametric model, where the variance term increases with K , (cid:98) θ n ( K ) has aparametric ( n / ) convergence rate, and (cid:98) θ n ( K ) with all different sequences of K are asymptoticallyequivalent under K/n → However, under faster rate conditions,
K/n → c for 0 < c < K such that we can provide an asymptotic distribution of the t -statistics with thedifferent sequence of K over K n .The following assumption on K n is considered, and we impose the regularity conditions thatare used in Cattaneo, Jansson, and Newey (2018a, Assumption PLM) uniformly over K ∈ K n . Assumption 5.1. (Set of finite number of series terms)Assume K n = { K ≡ K , · · · , K m , · · · , K ≡ K p } , where K m → ∞ , K m /n → c m as n → ∞ forall m = 1 , ..., p , constant c m such that < c < c < · · · < c p < , and fixed p . Assumption 5.2. (Regularity conditions - partially linear model) See also Robinson (1988), Linton (1995) and references therein for the results of the kernel estimators. This is also related to the well-known results of the two-step semiparametric estimation; the asymptotic varianceof two-step semiparametric estimators does not depend on the type of the first-step estimator or smoothing parametersequences under certain conditions (see Newey (1994b)). i) { y i , w i , x i } ni =1 are i.i.d random variables satisfying the model (5.1) .(ii) There exists constants < c ≤ C < ∞ such that E [ ε i | w i , x i ] ≥ c and E [ v i | x i ] ≥ c , E [ ε i | w i ,x i ] ≤ C and E [ v i | x i ] ≤ C .(iii) rank ( P K ) = K (a.s.) and M K,ii ≥ C for C > for all K ∈ K n .(iv) For each K ∈ K n , there exists some γ g , γ g w , min η g E [( g ( x i ) − η (cid:48) g P Ki ) ] = O ( K − γ g ) , min η gw E [( g w ( x i ) − η (cid:48) g w P Ki ) ] = O ( K − γ gw ) . Assumption 5.2 does not require
K/n → γ g = s g /d x , γ g w = s w /d x when X is compact and when the unknown functions g ( x ) and g w ( x ) have s g and s w continuous derivates, respectively.Under Assumptions 5.1, 5.2 and undersmoothing condition ( nK − γ g + γ gw ) → t -statistics T n ( K, θ ) = √ nV n ( K ) − / ( (cid:98) θ n ( K ) − θ ) over K ∈ K n :( T n ( K , θ ) , · · · , T n ( K p , θ )) (cid:48) d −→ Z Σ = ( Z , · · · Z p ) (cid:48) ∼ N (0 , Σ)where V n ( K ) = Γ n ( K ) − Ω n ( K )Γ n ( K ) − , Γ n ( K ) = 1 n n (cid:88) i =1 M K,ii E [ v i | x i ] , Ω n ( K ) = 1 n n (cid:88) i =1 n (cid:88) j =1 M K,ij E [ v i ε j | x i , x j ] , and the variance-covariance matrix Σ with ( l, l (cid:48) ) elementΣ( l, l (cid:48) ) ≡ lim n →∞ V n ( K l , K l (cid:48) ) V n ( K l ) / ( K l (cid:48) ) / , V n ( K l , K l (cid:48) ) = Γ n ( K l ) − Ω n ( K l , K l (cid:48) )Γ n ( K l (cid:48) ) − Ω n ( K l , K l (cid:48) ) = 1 n n (cid:88) i =1 n (cid:88) j =1 M K l ,ij M K l (cid:48) ,ij E [ v i ε j | x i , x j ] , (5.4)for l, l (cid:48) = 1 , ..., p. Then, we can similarly define critical values as in (3.3) to construct confidenceintervals for θ uniform in K ∈ K n analogous to the nonparametric setup. Let (cid:98) c − α ≡ (1 − α ) quantile of max m =1 ,...,p | (cid:98) Z m | , (cid:98) Z Σ = ( (cid:98) Z , ..., (cid:98) Z p ) (cid:48) ∼ N (0 , (cid:98) Σ n ) (5.5)where (cid:98) Σ n is a consistent estimator for unknown Σ defined in (5.4).Theorem 5.1 is the main result for the partially linear model setup and provides the asymptoticcoverage results of the CIs uniform in K ∈ K n analogous to the nonparametric setup in Section 3.14 heorem 5.1. Suppose that Assumptions 5.1 and 5.2 hold. In addition, assume that nK − γ g + γ gw ) → and max K,K (cid:48) ∈K n | (cid:98) V n ( K,K (cid:48) ) V n ( K,K (cid:48) ) − | = o p (1) as n, K → ∞ . Then, lim n →∞ P ( θ ∈ [ (cid:98) θ n ( K ) ± (cid:98) c − α (cid:113) (cid:98) V n ( K ) /n ] , ∀ K ∈ K n ) = 1 − α, (5.6)lim inf n →∞ P ( θ ∈ [ (cid:98) θ n ( (cid:98) K ) ± (cid:98) c − α (cid:113) (cid:98) V n ( (cid:98) K ) /n ]) ≥ − α, (cid:98) K ∈ K n , (5.7) where the critical value (cid:98) c − α is defined in (5.5) . Remark 5.1.
Note that the construction of CIs requires consistent variance estimation of Ω n ( K ).As discussed in Cattaneo, Jansson, and Newey (2018a, 2018b), the construction of the heteroskedasticity-robust estimator for Ω n ( K ) under K/n → c > K/n → (cid:98) Ω n ( K, κ n ) = 1 n n (cid:88) i =1 n (cid:88) j =1 κ ij ˆ v K,i ˆ ε K,j (5.8)where ˆ v K = M K W, ˆ ε K = M K ( Y − W (cid:98) θ n ( K )) and symmetric matrix κ n with ( i, j ) element κ ij . Cat-taneo, Jansson, and Newey (2018b) show that (cid:98) Ω n ( K, κ n ) is consistent even under heteroskedasticityand K/n → c > κ n and provide a sufficient condition for consistency.See Theorems 3 and 4 of Cattaneo, Jansson, and Newey (2018b) for further discussion. This section investigates the small sample performance of the proposed inference methods. Wereport the empirical coverage and the average length of the confidence intervals/confidence bandsconsidered in Sections 3 and 4 with various simulation setups.We consider the following data generating process: y i = g ( x i ) + ε i ,x i = Φ( x ∗ i ) , (cid:32) x ∗ i ε i (cid:33) ∼ N (cid:32)(cid:32) (cid:33) , (cid:32) σ ( x ∗ i ) (cid:33)(cid:33) where Φ( · ) is the standard normal cumulative distribution function needed to ensure compactsupport, and σ ( x ∗ i ) = ((1+2 x ∗ i ) / (heteroskedastic). We investigate the following three functionsfor g ( x ): g ( x ) = ln( | x − | + 1) sgn ( x − / g ( x ) = sin(7 πx/ x ( sgn ( x )+1) , and g ( x ) = x − / φ (10( x − / φ ( · ) is the standard normal probability density function, and sgn ( · ) is thesign function. g ( x ) is used in Newey and Powell (2003), as well as Chen and Christensen (2018). g ( x ) and g ( x ) are rescaled versions used in Hall and Horowitz (2013). See Figure 1 for the15hapes of all three functions on [0 , n = 200.Results for quadratic splines with evenly placed knots are reported where the number of knots K are selected among K n = { , , ..., } by setting K = 2 n / and K = 2 n / rounded up tothe nearest integer. Then, we calculate a pointwise coverage rate (COV) and the average length(AL) of various 95% nominal CIs, as well as analogous uniform CBs for the grid points of x on thesupport X = [0 . , . σ ( x ∗ i ) = 1), different sample sizes n = { , } , polynomial regressions,and different specifications as in Cattaneo and Farrell (2013) with multivariate and non-normalregressors; however, the results show qualitatively similar patterns and hence are not reported herefor brevity. Additional simulation results are reported in the Online Supplementary Material.Table 1 reports the nominal 95% coverage of the following pointwise CIs at x = 0 . , . , . , . (cid:98) K cv ∈ K n selected to minimize the leave-one-out cross-validation;(2) robust CI in (3.6) with (cid:98) K cv using the critical value (cid:98) c − α ( x ); (3) robust CI using (cid:98) K cv+ = (cid:98) K cv + 2.Analogous uniform inference results for CBs are also reported. The critical values, (cid:98) c − α ( x ) and (cid:98) c − α are constructed using the Monte Carlo methods and weighted bootstrap method, respectively.Overall, we find that the coverage of the standard CI with (cid:98) K cv is far less than 95% over thesupport although it has the shortest length. However, the coverage of robust CIs based on (cid:98) K cv or (cid:98) K cv+ with (cid:98) c − α ( x ) is close to or above 95% and performs well across the different simulationdesigns, and this is consistent with theoretical results in Theorem 3.1. Using the undersmoothed (cid:98) K cv+ (using more terms than the cross-validation) seems to work quite well at most points and forhighly nonlinear designs where there exists relatively large bias, e.g., Model 3 ( g ( x )) at x = 0 . Uniform coverage rates of confidence bands with selected K seem conservative, and this is due tothe large critical values based on weighted bootstrap methods to be uniform in both K ∈ K n and x ∈ X , including boundary points. In this section, we illustrate inference procedures by revisiting Blomquist and Newey (2002). Un-derstanding how tax policy affects individual labor supply has been a central issue in labor eco-nomics (see Hausman (1985) and Blundell and MaCurdy (1999), among many others). Blomquistand Newey (2002) estimate the conditional mean of hours of work given the individual nonlinearbudget sets using nonparametric series estimation. They also estimate the wage elasticity of theexpected labor supply and find evidence of possible misspecification of the usual parametric modelsuch as maximum likelihood estimation (MLE).Specifically, Blomquist and Newey (2002) consider the following model by exploiting an additive The possibly poor coverage property of the standard kernel-based CIs for g ( x ) at the single peak ( x = 0 .
5) wasalso described in Hall and Horowitz (2013, Figure 3). h i = g ( x i ) + ε i , E ( ε i | x i ) = 0 , (7.1) g ( x i ) = g ( y J , w J ) + J − (cid:88) j =1 [ g ( y j , w j , (cid:96) j ) − g ( y j +1 , w j +1 , (cid:96) j )] , (7.2)where h i is the hours worked of the i th individual and x i = ( y , · · · , y J , w , · · · , w J , (cid:96) , · · · , (cid:96) J , J ) isthe budget set, which can be represented by the intercept y j (non-labor income), slope w j (marginalwage rates) and the end point (cid:96) j of the j th segment in a piecewise linear budget with J segments.Equation (7.2) for the conditional mean function follows from Theorem 2.1 of Blomquist and Newey(2002), and this additive structure substantially reduces the dimensionality issues. To approximate g ( x ), they consider the power series, p k ( x ) = ( y p ( k ) J w q ( k ) J , (cid:80) J − j =1 (cid:96) m ( k ) j ( y p ( k ) j w q ( k ) j − y p ( k ) j +1 w q ( k ) j +1 )) ,p ( k ) + q ( k ) ≥ n = 2321. See Section 5 of Blomquist and Newey (2002) for more detailed descriptions. Theyestimate the wage elasticity of the expected labor supply E w = ¯ w/ ¯ h [ ∂g ( w, · · · , w, ¯ y, · · · , ¯ y ) ∂w ] | w = ¯ w , (7.3)which is the regression derivative of g ( x ) evaluated at the mean of the net wage rates ¯ w , virtualincome ¯ y and level of hours ¯ h .Table 2 is the same table as in Blomquist and Newey (2002, Table 1). They report esti-mates (cid:98) E w and standard errors SE (cid:98) E w with a different number of series terms by adding additionalseries terms. For example, the estimates in the second row use the term in the first row (1 ,y J , w J ) with the additional terms (∆ y, ∆ w ). Here, (cid:96) m ∆ y p w q denotes approximating the term (cid:80) j (cid:96) mj ( y pj w qj − y pj +1 w qj +1 ). Blomquist and Newey (2002) also report cross-validation criteria, CV ,for each specification. In their formula, series terms are chosen to maximize CV , which minimizesthe asymptotic MSE. In addition to their original table, we add the standard 95% CI for eachspecification, i.e., CI ( K ) = (cid:98) E w ( K ) ± . SE (cid:98) E w ( K ). In Table 2, it is ambiguous as to which largemodel ( K ) can be used for the inference, and we do not have compelling data-dependent methodsfor selecting one of the large K for the confidence interval to be reported. Here we want to constructCIs that are robust to specification searches.Figure 2 displays pointwise 95% uniform CIs for K m ∈ { K , K , · · · , K } , where K m corre-sponds to each specification in Table 2 with increasing order of series terms, along with the pointestimates and standard 95% confidence interval. From Figure 2, we reject a zero wage elasticity of It is straightforward to construct (cid:98) c − α ( x ) using the covariance structure under the homoskedastic error and itonly requires estimated variances for different K ∈ K n that are already reported in the table of Blomquist and Newey(2002). Based on 100,000 simulation repetitions, we have (cid:98) c − α ( x ) = 2 . K . Table 2 also reports robust confidence intervals CI sup (cid:98) E w ( K ) = (cid:98) E w ( K ) ± (cid:98) c − α ( x ) SE (cid:98) E w ( K ) with possibly data-dependent (cid:98) K justified by Theorem 3.1(eq (3.6)). Note that cross-validation chooses (cid:98) K cv = K , and the standard CI with (cid:98) K cv is [0 . , . . , . (cid:98) K cv+ = K or (cid:98) K cv++ = K widens the standardCI, and the robust CIs are CI sup (cid:98) E w ( (cid:98) K cv+ ) = [0 . , . , CI sup (cid:98) E w ( (cid:98) K cv++ ) = [0 . , . This paper considers nonparametric inference methods given specification searches over differentnumbers of series terms in the nonparametric series regression model. We provide methods ofconstructing uniform CIs and confidence bands by adjusting the conventional normal critical valueto the critical value based on the supremum of the t -statistics. The critical values can be constructedusing simple Monte Carlo simulation or weighted bootstrap methods. Then, we provide an extensionof the proposed CIs in the partially linear model setup. Finally, we investigate the finite sampleproperties of the proposed methods and illustrate uniform CIs in an empirical example of Blomquistand Newey (2002).While beyond the scope of this paper, there are some potential directions to extend the resultsestablished here. First, investigating the coverage property of CIs with data-dependent (cid:98) K usingbias-corrected methods is of interest. In particular, it would be of interest to analyze the bias-corrected CI and confidence bands using cross-validation methods combined with the recent resultsestablished in Cattaneo, Farrell, and Feng (2019). Second, an extension of the current theory forquantile regression (e.g., Belloni, Chernozhukov, Chetverikov, and Fern´andez-Val (2019)) or thenonparametric IV setup would be desirable. In the NPIV setup, for example, one can considerpointwise CIs (or uniform confidence bands) that are uniform in pairs of ( K n , J n ) ∈ K n × J n withan additional dimension of the instrument sieve and the number of instruments J = J n . This is adifficult problem, and it would require a distinct theory to address the ill-posed inverse problem aswell as two-dimensional choices. We leave these topics for future research.18 eferences Andrews, D. W. K. (1991a): “Asymptotic Normality of Series Estimators for Nonparametricand Semiparametric Regression Models,”
Econometrica , 59, 307-345.
Andrews, D. W. K. (1991b): “Asymptotic Optimality of Generalized C L , Cross-Validation, andGeneralized Cross-Validation in Regression with Heteroskedastic Errors,” Journal of Econo-metrics , 47, 359-377.
Armstrong, T. B. and M. Koles´ar (2018): “A Simple Adjustment for Bandwidth Snoop-ing,”
Review of Economic Studies , 85, 732-765.
Belloni, A., V. Chernozhukov, D. Chetverikov, and I. Fern´andez-Val (2019): “Condi-tional quantile processes based on series or many regressors,”
Journal of Econometrics , 213,4-29.
Belloni, A., V. Chernozhukov, D. Chetverikov, and K. Kato (2015): “Some NewAsymptotic Theory for Least Squares Series: Pointwise and Uniform Results,”
Journal ofEconometrics , 186, 345-366.
Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Effectsafter Selection among High-Dimensional Controls,”
Review of Economic Studies , 81, 608-650.
Blomquist, S. and W. K. Newey (2002): “Nonparametric Estimation with Nonlinear BudgetSets,”
Econometrica , 70, 2455-2480.
Blundell, R. and T. E. MaCurdy (1999): “Labor Supply: A Review of Alternative Ap-proaches,”
Handbook of Labor Economics , In: O. Ashenfelter, D. Card (Eds.), vol. 3., Else-vier, Chapter 27.
Calonico, S., M. D. Cattaneo, and M. H. Farrell (2018): “On the Effect of Bias Estima-tion on Coverage Accuracy in Nonparametric Inference,”
Journal of the American StatisticalAssociation , 113, 767-779.
Cattaneo, M. D. and M. H. Farrell (2013): “Optimal Convergence Rates, Bahadur Repre-sentation, and Asymptotic Normality of Partitioning Estimators,”
Journal of Econometrics ,174, 127-143.
Cattaneo, M. D., M. H. Farrell, and Y. Feng (2019): “Large Sample Properties ofPartitioning-Based Series Estimators,”
Annals of Statistics , forthcoming.
Cattaneo, M. D., M. Jansson, and W. K. Newey (2018a): “Alternative Asymptotics andthe Partially Linear Model with Many Regressors,”
Econometric Theory , 34, 277-301.19 attaneo, M. D., M. Jansson, and W. K. Newey (2018b): “Inference in Linear RegressionModels with Many Covariates and Heteroscedasticity,”
Journal of the American StatisticalAssociation , 113, 1350-1361.
Chao, J. C., N. R. Swanson, J. A. Hausman, W. K. Newey, and T. Woutersen (2012):“Asymptotic Distribution of JIVE in a Heteroskedastic IV Regression with Many Instru-ments,”
Econometric Theory , 28, 42-86.
Chatterjee, S. (2005): “An error bound in the Sudakov-Fernique inequality,”arXiv:math/0510424
Chen, X. (2007): “Large Sample Sieve Estimation of Semi-nonparametric Models,”
Handbookof Econometrics , In: J.J. Heckman, E. Leamer (Eds.), vol. 6B., Elsevier, Chapter 76.
Chen, X. and T. Christensen (2015): “Optimal Uniform Convergence Rates and AsymptoticNormality for Series Estimators Under Weak Dependence and Weak Conditions,”
Journal ofEconometrics , 188, 447-465.
Chen, X. and T. Christensen (2018): “Optimal Sup-norm Rates and Uniform Inference onNonlinear Functionals of Nonparametric IV Regression”,
Quantitative Economics , 9(1), 39-85.
Chen, X. and Z. Liao (2014): “Sieve M inference on irregular parameters,”
Journal of Econo-metrics ,182, 70-86.
Chen, X., Z. Liao, and Y. Sun (2014): “Sieve inference on possibly misspecified semi-nonparametrictime series models,”
Journal of Econometrics ,178, 639-658.
Chen, X. and X. Shen (1998): “Sieve extremum estimates for weakly dependent data,”
Econo-metrica , 66 (2), 289-314.
Chernozhukov V, D. Chetverikov, and K. Kato (2014a): “Gaussian approximation ofsuprema of empirical processes,”
The Annals of Statistics , 42(4), 1564-1597.
Chernozhukov V, D. Chetverikov, and K. Kato (2014b): “Anti-Concentration and Honest,Adaptive Confidence Bands,”
The Annals of Statistics , 42(5), 1787-1818.
Chernozhukov V, D. Chetverikov, and K. Kato (2016): “Empirical and multiplier boot-straps for suprema of empirical processes of increasing complexity, and related Gaussiancouplings,”
Stochastic Processes and their Applications , 126(12), 3632-3651.
Donald, S. G. and W. K. Newey (1994): “Series Estimation of Semilinear Models,”
Journalof Multivariate Analysis , 50, 30-40.
Eastwood, B. J. and A.R. Gallant, (1991): “Adaptive Rules for Seminonparametric Esti-mators That Achieve Asymptotic Normality,”
Econometric Theory , 7, 307-340.20 in´e, E. and R. Nickl (2010): “Confidence bands in density estimation,”
The Annals of Statis-tics , 38, 1122-1170.
Gin´e, E. and R. Nickl (2015):
Mathematical Foundations of Infinite-Dimensional StatisticalModels , Cambridge University Press.
Hall, P. and J. Horowitz (2013): “A Simple Bootstrap Method for Constructing Nonpara-metric Confidence Bands for Functions,”
The Annals of Statistics , 41, 1892-1921.
Hansen B. E. (2015): “The Integrated Mean Squared Error of Series Regression and a RosenthalHilbert-Space Inequality,”
Econometric Theory , 31, 337-361.
Hansen, P.R. (2005): “A Test for Superior Predictive Ability,”
Journal of Business and EconomicStatistics , 23, 365-380.
H¨ardle, W. and O. Linton (1994): “Applied Nonparametric Methods,”
Handbook of Econo-metrics , In: R. F. Engle, D. F. McFadden (Eds.), vol. 4., Elsevier, Chapter 38.
Hausman, J. A. (1985): “The Econometrics of Nonlinear Budget Sets”,
Econometrica , 53, 1255-1282.
Heckman, J. J., L. J. Lochner, and P. E. Todd (2006): “Earnings Functions, Rates of Returnand Treatment Effects: The Mincer Equation and Beyond,”
Handbook of the Economics ofEducation , In: E. A. Hanushek, and F. Welch (Eds.), Vol. 1, Elsevier, Chapter 7.
Horowitz, J. L. (2014): “Adaptive Nonparametric Instrumental Variables Estimation: Empiri-cal Choice of the Regularization Parameter,”
Journal of Econometrics , 180, 158-173.
Horowitz, J. L. and S. Lee (2012): “Uniform Confidence Bands for Functions EstimatedNonparametrically with Instrumental Variables,”
Journal of Econometrics , 168, 175-188.
Huang, J. Z. (2003): “Local Asymptotics for Polynomial Spline Regression,”
The Annals ofStatistics , 31, 1600-1635.
Kozbur, D. (2018): “Inference in Additively Separable Models With a High-Dimensional Set ofConditioning Variables,” Working Paper, arXiv:1503.05436.
Leamer, E. E. (1983): “Let’s Take the Con Out of Econometrics,”
The American EconomicReview , 73, 31-43.
Lepski, O. V. (1990): “On a problem of adaptive estimation in Gaussian white noise,”
Theory ofProbability and its Applications , 35, 454-466.
Li, K. C. (1987): “Asymptotic Optimality for C p , C L , Cross-Validation and Generalized Cross-Validation: Discrete Index Set,” The Annals of Statistics , 15, 958-975.21 i, Qi, and J. S. Racine (2007):
Nonparametric Econometrics: Theory and Practice , PrincetonUniversity Press.
Linton, O. (1995): “Second order approximation in the partialy linear regression model,”
Econo-metrica , 63(5), 1079-1112.
Newey, W. K. (1994a): “Series Estimation of Regression Functionals,”
Econometric Theory , 10,1-28.
Newey, W. K. (1994b): “The Asymptotic Variance of Semiparametric Estimators,”
Economet-rica , 62, 1349-1382.
Newey, W. K. (1997): “Convergence Rates and Asymptotic Normality for Series Estima-tors,”
Journal of Econometrics , 79, 147-168.
Newey, W. K. (2013): “Nonparametric Instrumental Variables Estimation,”
American EconomicReview: Papers & Proceedings , 103, 550-556.
Newey, W. K. and J. L. Powell (2003): “Instrumental Variable Estimation of NonparametricModels,”
Econometrica , 71, 1565-1578.
Newey, W. K. and J. L. Powell, F. Vella (1999): “Nonparametric Estimation of TriangularSimultaneous Equations Models,”
Econometrica , 67, 565-603.
Robinson, P. M. (1988): “Root-N-Consistent Semiparametric Regression,”
Econometrica , 56(4),931-954.
Romano, J. P. and M. Wolf (2005): “Stepwise Multiple Testing as Formalized Data Snoop-ing,”
Econometrica , 73, 1237-1282.
Schennach, S. M. (2015): “A bias bound approach to nonparametric inference,”
CEMMAPworking paper CWP71/15 . Van Der Vaart, A. W. and J. A. Wellner (1996):
Weak Convergence and Empirical Pro-cesses , Springer.
White, H. (2000): “A Reality Check for Data Snooping,”
Econometrica , 68, 1097-1126.
Zhou, S., X. Shen, and D.A. Wolfe (1998): “Local Asymptotics for Regression Splines andConfidence Regions,”
The Annals of Statistics , 26, 1760-1782.22 ppendix A Proofs
A.1 Preliminaries and Useful Lemmas
We define additional notations for the empirical process theory used in the proof of Theorem4.1. Given measurable space ( S, S ), let F as a class of measurable functions f : S → R . Forany probability measure Q on ( S, S ), we define N ( (cid:15), F , L ( Q )) as covering numbers, which is theminimal number of the L ( Q ) balls of radius (cid:15) to cover F with L ( Q ) norms || f || Q, = ( (cid:82) | f | dQ ) / .The uniform entropy numbers relative to the L ( Q ) norms are defined as sup Q log N ( (cid:15) || F || Q, , F ,L ( Q )) where the supremum is over all discrete probability measures with an envelope function F .We define F as a VC type with envelope F if there are constants A, v > Q N ( (cid:15) || F || Q, , F , L ( Q )) ≤ ( A/(cid:15) ) v for all 0 < (cid:15) ≤ z i = ( ε i , x i ) be i.i.d. random vectors defined on the probability space ( Z = E × X , A , P ) with common probability distribution P ≡ P ε,x . We think of ( ε , x ) , · · · ( ε n , x n ) as thecoordinates of the infinite product probability space. We avoid discussing nonmeasurability issuesand outer expectations (for the related issues, see van der Vaart and Wellner (1996)). Throughoutthe proofs, we denote c, C > n .For any sequence { K = K n : n ≥ } ∈ (cid:81) ∞ n =1 K n under Assumption 2.1, we first define theorthonormalized vector of basis functions˜ P ( K, x ) ≡ Q − / K P ( K, x ) = E [ P Ki P (cid:48) Ki ] − / P ( K, x ) , ˜ P Ki = ˜ P ( K, x i ) , ˜ P K = [ ˜ P K , · · · , ˜ P Kn ] (cid:48) . We observe that (cid:98) g n ( K, x ) = ˜ P ( K, x ) (cid:48) ( ˜ P K (cid:48) ˜ P K ) − ˜ P K (cid:48) Y, V n ( K, x ) = ˜ P ( K, x ) (cid:48) ˜Ω K ˜ P ( K, x ) , ˜Ω K = E ( ˜ P Ki ˜ P (cid:48) Ki ε i ) . Without loss of generality, we may impose normalizations of Q K = I K or Q K = E ( P Ki P (cid:48) Ki ) = I K uniformly over K ∈ K n , since (cid:98) g n ( K, x ) is invariant to nonsingular linear transformations of P ( K, x ).However, we shall treat Q K as unknown and deal with the non-orthonormalized series terms. Next,we re-define pseudo true value β K , with an abuse of notation, using orthonormalized series terms˜ P Ki . That is, y i = ˜ P (cid:48) Ki β K + ε Ki , E [ ˜ P Ki ε Ki ] = 0 where ε Ki = r Ki + ε i , r n ( K, x ) = g ( x ) − ˜ P ( K, x ) (cid:48) β K ,r Ki = r n ( K, x i ), and r K ≡ ( r K , · · · r Kn ) (cid:48) . We also define (cid:98) Q K ≡ n ˜ P K (cid:48) ˜ P K , σ ≡ inf x E [ ε i | x i = x ] , ¯ σ ≡ sup x E [ ε i | x i = x ].We first provide useful lemmas which will be used in the proof of Theorem 3.1 and 4.1. Theversions of proof of Lemmas 1 and 2 with K n = { K } are available in the literature, such as Belloniet al. (2015) and Chen and Christensen (2015), among many others. The maximal inequalitiesare used in the proof of Lemmas 1 and 2 to bound the remainder terms in the linearization of the t -statistics. Also note that different rate conditions of K such as those in Newey (1997) can beused here but lead to different bounds. We provide the proofs of Lemma 1 and 2 in the OnlineSupplementary Material (Section B). 23 emma 1. Suppose that Assumptions 2.1, 3.1, and 3.2 hold, then || (cid:98) Q K − I K || = O p ( (cid:113) λ K ζ K log K/n ) for any K ∈ K n and the following holds max K ∈K n | R ( K, x ) | = O p ( max K ∈K n (cid:114) λ K ζ K log K log pn (1 + (cid:96) K c K √ K )) , (A.1)max K ∈K n | R ( K, x ) | = O p ( max K ∈K n ( (cid:96) K c K ) (cid:112) log p ) , (A.2) where R ( K, x ) ≡ (cid:113) nV n ( K,x ) ˜ P ( K, x ) (cid:48) ( (cid:98) Q − K − I K ) ˜ P K (cid:48) ( ε + r K ) , R ( K, x ) ≡ (cid:113) nV n ( K,x ) ˜ P ( K, x ) (cid:48) ˜ P K (cid:48) r K . Lemma 2.
Suppose that Assumptions 2.1, 3.1 and 4.1 hold, then the following holds sup K ∈K n ,x ∈X | R ( K, x ) | = O p ( max K ∈K n (cid:114) λ K ζ K log K log nn ( n /q + (cid:96) K c K √ K )) , (A.3)sup K ∈K n ,x ∈X | R ( K, x ) | = O p ( max K ∈K n ( (cid:96) K c K ) (cid:112) log n ) , (A.4) where R ( K, x ) , R ( K, x ) are defined in Lemma 1. A.2 Proofs of the Main Results
A.2.1 Proof of Theorem 3.1
Proof.
For any K ∈ K n , we first consider the decomposition of the t -statistic in (2.4) with theknown variance V n ( K, x ), T n ( K, x ) = (cid:114) nV n ( K, x ) ˜ P ( K, x ) (cid:48) ( (cid:98) β K − β K ) − (cid:114) nV n ( K, x ) r n ( K, x )= t n ( K, x ) + R ( K, x ) + R ( K, x ) + ν n ( K, x )where t n ( K, x ) = n − / (cid:80) ni =1 ˜ P ( K,x ) (cid:48) ˜ P Ki ε i V n ( K,x ) / , R ( K, x ) , R ( K, x ) are defined in Lemma 1, and ν n ( K,x ) = −√ nV n ( K, x ) − / r n ( K, x ). Define t n ≡ ( t n ( K , x ) , · · · , t n ( K p , x )) (cid:48) = 1 √ n n (cid:88) i =1 ξ i where ξ i = ( ξ i , ξ i , · · · , ξ ip ) (cid:48) ∈ R p with ξ ij = ˜ P ( K j ,x ) (cid:48) ˜ P Kji ε i V n ( K j ,x ) / and p = |K n | . Note that E [ ξ ij ] = 0and E [ | ξ ij | ] (cid:46) E [ | ˜ P ( K j , x ) (cid:48) ˜ P K j i /V n ( K j , x ) / | ] sup x E [ | ε i | | x i = x ] (cid:46) max K ζ K for all 1 ≤ i ≤ n, ≤ j ≤ p . By Lemma A.2 in the Online Supplementary Material, for any δ >
0, there exists arandom variable max ≤ j ≤ p (cid:80) ni =1 Z ij with independent random vectors { Z i } ni =1 ∈ R p , Z i ∼ N (0 , n E [ ξ i ξ (cid:48) i ]) , ≤ i ≤ n , such that P ( | max ≤ j ≤ p | t n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij || > δ ) (cid:46) log( p ∨ n ) δ D + log ( p ∨ n ) δ n / ( D + D ) + log nn D = E (cid:2) max ≤ j,l ≤ p | n (cid:80) ni =1 ( ξ ij ξ il − E [ ξ ij ξ il ]) | (cid:3) , D = E (cid:2) max ≤ j ≤ p (cid:80) ni =1 | ξ ij | (cid:3) , and D = (cid:80) ni =1 E (cid:2) max ≤ j ≤ p | ξ ij | (cid:0) max ≤ j ≤ p | ξ ij | > δ √ n/ log( p ∨ n ) (cid:1)(cid:3) . First consider the case (a) in Assumption 3.2(ii). Combining bounds for D , D , D in LemmaB.1 in the Online Supplementary Material gives, for any δ > P ( | max ≤ j ≤ p | t n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij || > δ ) (cid:46) log( p ∨ n ) δ (cid:2) ( (max K ζ K ) log pn ) / + (max K ζ K ) log pn − /q (cid:3) + log ( p ∨ n ) δ (cid:2) ( (max K ζ K ) n ) / + (max K ζ K ) log pn / − /q (cid:3) + log q − ( p ∨ n ) δ q (max K ζ K ) q n q/ − + log nn . For γ >
0, by setting δ = γ − / (cid:0) (max K ζ K ) log ( p ∨ n ) n (cid:1) / + γ − / (cid:0) (max K ζ K ) log( p ∨ n ) log pn − /q (cid:1) / + γ − / (cid:0) (max K ζ K ) log ( p ∨ n ) log pn / − /q (cid:1) / , we have P ( | max ≤ j ≤ p | t n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij || > C δ ) ≤ C ( γ + log nn )where C , C are positive constants that depend only on q . If we take γ = γ n → γ = log( p ∨ n ) − / , then the above implies there exists max ≤ j ≤ p (cid:80) ni =1 Z ij such that | max ≤ j ≤ p | t n ( K j , x ) |− max ≤ j ≤ p n (cid:88) i =1 | Z ij || = o p ( (cid:0) (max K ζ K ) log ( p ∨ n ) n (cid:1) / + (max K ζ K ) log / ( p ∨ n ) log / pn / − /q ) . Next, consider the case (b) in Assumption 3.2(ii). For any δ > P ( | max ≤ j ≤ p | t n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij || > δ ) (cid:46) log( p ∨ n ) δ (cid:2) ( (max K ζ K ) log pn ) / + (max K ζ K ) log ( pn ) log pn (cid:3) + log ( p ∨ n ) δ (cid:2) ( (max K ζ K ) n ) / + (max K ζ K ) log ( pn ) log pn / (cid:3) + log ( p ∨ n ) δ (cid:2) n / ( δ n / log ( p ∨ n ) + (max K ζ K ) log p ) exp( − δ √ nC max K ζ K log p log( p ∨ n ) ) (cid:3) + log nn by Lemma B.1 in the Online Supplementary Material. Similarly, by setting δ = max { γ − / (max K ζ K ) log ( p ∨ n ) /n ) / , C ((max K ζ K ) log ( p ∨ n ) log p/n ) / }
25e have, for γ > P ( | max ≤ j ≤ p | t n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij || > C δ ) ≤ C ( γ + log nn )where C , C are universal constants which do not depend on n . Here we use δ √ nC max K ζ K log p log( p ∨ n ) ≥ p ∨ n ). By taking γ = log( p ∨ n ) − / , there exists max ≤ j ≤ p (cid:80) ni =1 Z ij such that | max ≤ j ≤ p | t n ( K j , x ) |− max ≤ j ≤ p n (cid:88) i =1 | Z ij || = o p ( (cid:0) (max K ζ K ) log ( p ∨ n ) n (cid:1) / + (max K ζ K ) log ( p ∨ n ) log pn (cid:1) / ) . In either case (a) or (b), the above coupling inequality shows that there exists a sequence ofrandom variables max ≤ j ≤ p (cid:80) ni =1 | Z ij | such that (cid:12)(cid:12) max K ∈K n | t n ( K, x ) | − max ≤ j ≤ p (cid:80) ni =1 | Z ij | (cid:12)(cid:12) = o p ( a n ), a n = 1 / (log p ) / under the rate conditions imposed in Theorem 3.1. Furthermore, (cid:12)(cid:12) max ≤ j ≤ p | T n ( K j , x ) | − max ≤ j ≤ p | t n ( K j , x ) | (cid:12)(cid:12) ≤ max ≤ j ≤ p | T n ( K j , x ) − t n ( K j , x ) | ≤ max ≤ j ≤ p | R ( K j , x ) | + max ≤ j ≤ p | R ( K j , x ) | + max ≤ j ≤ p | ν n ( K j , x ) | = o p ( a n ) (A.5)with a n = 1 / (log p ) / by Lemma 1 and the assumption imposed in Theorem 3.1. We also have (cid:12)(cid:12) max ≤ j ≤ p | T n ( K j , x ) | − max ≤ j ≤ p | (cid:98) T n ( K j , x ) | (cid:12)(cid:12) ≤ max ≤ j ≤ p | T n ( K j , x ) − (cid:98) T n ( K j , x ) |≤ max ≤ j ≤ p | T n ( K j , x ) | max ≤ j ≤ p | − V n ( K j , x ) / (cid:98) V n ( K j , x ) / | = o p ( a n ) (A.6)where we use Lemma 1 and max ≤ j ≤ p | t n ( K j , x ) | (cid:46) P √ log p by the maximal inequality (e.g.,Lemma A.4 in the Online Supplementary Material) and Assumption 3.2(iii) with a n = 1 / (log p ) / .Combining (A.5) and (A.6) gives (cid:12)(cid:12) max ≤ j ≤ p | (cid:98) T n ( K j , x ) | − max ≤ j ≤ p (cid:80) ni =1 | Z ij | (cid:12)(cid:12) = o p ( a n ) with a n = 1 / (log p ) / . Then, there exists some sequence of positive constant δ n such that δ n = o (1)and P ( (cid:12)(cid:12) max ≤ j ≤ p | (cid:98) T n ( K j , x ) | − max ≤ j ≤ p (cid:80) ni =1 | Z ij | (cid:12)(cid:12) > a n δ n ) = o (1).For any u ∈ R , we have P ( max ≤ j ≤ p | (cid:98) T n ( K j , x ) | ≤ u ) ≤ P ( { max ≤ j ≤ p | (cid:98) T n ( K j , x ) | ≤ u } ∩ { (cid:12)(cid:12) max ≤ j ≤ p | (cid:98) T n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij | (cid:12)(cid:12) ≤ a n δ n } )+ P ( (cid:12)(cid:12) max ≤ j ≤ p | (cid:98) T n ( K j , x ) | − max ≤ j ≤ p n (cid:88) i =1 | Z ij | (cid:12)(cid:12) > a n δ n ) ≤ P ( max ≤ j ≤ p n (cid:88) i =1 | Z ij | ≤ u + a n δ n ) + o (1) ≤ P ( max ≤ j ≤ p n (cid:88) i =1 | Z ij | ≤ u ) + a n δ n E [ max ≤ j ≤ p n (cid:88) i =1 | Z ij | ] + o (1)where the last inequality uses anti-concentration inequality (Lemma A.8 in the Online Supplemen-26ary Material). The reverse inequality holds with a similar argument above, and thussup u ∈ R (cid:12)(cid:12) P ( max ≤ j ≤ p | (cid:98) T n ( K, x ) | ≤ u ) − P ( max ≤ j ≤ p n (cid:88) i =1 | Z ij | ≤ u ) (cid:12)(cid:12) = a n δ n E [ max ≤ j ≤ p n (cid:88) i =1 | Z ij | ] + o (1) = o (1)where we use E [max ≤ j ≤ p (cid:80) ni =1 | Z ij | ] (cid:46) √ log p by Gaussian maximal inequality and a n = (log p ) − / .Using the same arguments above, (cid:12)(cid:12) max ≤ j ≤ p (cid:80) ni =1 | Z ij |− max ≤ j ≤ p (cid:80) ni =1 | (cid:98) Z ij | (cid:12)(cid:12) = o p ( a n ) by Sudakov-Fernique type bound (e.g., Chatterjee (2005)) and Assumption 3.2(iii), we have sup u ∈ R (cid:12)(cid:12) P (max ≤ j ≤ p | (cid:98) Z ij || ≤ u ) − P (max ≤ j ≤ p (cid:80) ni =1 | Z ij | ≤ u ) | = o (1). Therefore, the following holds by the triangle inequality,sup u ∈ R (cid:12)(cid:12) P ( max ≤ j ≤ p | (cid:98) T n ( K, x ) | ≤ u ) − P ( max ≤ j ≤ p n (cid:88) i =1 | (cid:98) Z ij | ≤ u ) (cid:12)(cid:12) = o (1) , and then we conclude P ( max K ∈K n | (cid:98) T n ( K, x ) | ≤ (cid:98) c − α ( x )) = 1 − α + o (1) , with a critical value (cid:98) c − α ( x ) given in (3.3), and the coverage result (3.5) follows.Finally, we will show (3.6). For (cid:98) K ∈ K n , observe that | (cid:98) T n ( (cid:98) K, x ) | ≤ ( | t n ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | ν n ( (cid:98) K, x ) | ) | V n ( (cid:98) K, x ) / (cid:98) V n ( (cid:98) K, x ) / | (A.7)by the triangle inequality. Then, P ( g ( x ) ∈ [ (cid:98) g n ( (cid:98) K, x ) ± (cid:98) c − α ( x ) (cid:113) (cid:98) V n ( (cid:98) K, x ) /n ]) ≥ P ( | t n ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | ν n ( (cid:98) K, x ) | ≤ (cid:98) c − α ( x ) | (cid:98) V n ( (cid:98) K, x ) / V n ( (cid:98) K, x ) / | ) ≥ P ( | t n ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | + | ν n ( (cid:98) K, x ) | ≤ (cid:98) c − α ( x )(1 − a n δ n )) − (cid:15) n (A.8) ≥ P ( | t n ( (cid:98) K, x ) | ≤ (cid:98) c − α ( x )(1 − a n δ n ) − a n δ n − a n δ n ) − (cid:15) n − (cid:15) n − (cid:15) n (A.9) ≥ P ( max K ∈K n | t n ( K, x ) | ≤ (cid:98) c − α ( x )(1 − a n δ n ) − a n δ n − a n δ n ) − (cid:15) n − (cid:15) n − (cid:15) n (A.10) ≥ P ( max ≤ j ≤ p n (cid:88) i =1 | (cid:98) Z ij | ≤ (cid:98) c − α ( x ) − ˜ δ n ) − ˜ (cid:15) n (A.11) ≥ − α − sup u P ( | max ≤ j ≤ p n (cid:88) i =1 | (cid:98) Z ij | − u | ≤ ˜ δ n ) − ˜ (cid:15) n ≥ − α − o (1) . (A.12)The first inequality follows by (A.7), and (A.8) holds by Assumption 3.2(iii) with some sequenceof positive constant δ n = o (1) , (cid:15) n = o (1) and (A.9) follows by | R ( (cid:98) K, x ) | + | R ( (cid:98) K, x ) | = o p ( a n )from Lemma 1 and the assumption | √ nr n ( (cid:98) K,x ) V n ( (cid:98) K,x ) / | = o ( a n ) with a n = 1 / (log p ) / and some se-quences of constants δ n = o (1) , (cid:15) n = o (1) , δ n = o (1) , (cid:15) n = o (1). (A.10) follows by | t n ( (cid:98) K, x ) | ≤ max K ∈K n | t n ( K, x ) | , and (A.11) holds by | max K ∈K n | t n ( K, x ) |− max ≤ j ≤ p (cid:80) ni =1 | (cid:98) Z ij || = o p ( a n ) with27ome sequences δ n = o (1) , (cid:15) n = o (1) and defining ˜ δ n = (cid:98) c − α ( x ) a n δ n + a n δ n + a n δ n + a n δ n , ˜ (cid:15) n = (cid:15) n + (cid:15) n + (cid:15) n + (cid:15) n . Finally, (A.12) holds by Lemma A.8, E [max ≤ j ≤ p (cid:80) ni =1 | (cid:98) Z ij | ] (cid:46) √ log p and ˜ δ n √ log p = o (1) since (cid:98) c − α ( x ) (cid:46) √ log p by Lemma A.15. This completes the proof. (cid:4) A.2.2 Proof of Theorem 4.1
Proof.
Similar to the proof of Theorem 3.1, we have the following linearization of the t -statisticsuniformly in ( K, x ) ∈ K n × X , T n ( K, x ) = t n ( K, x ) + ν n ( K, x ) + R n ( K, x ) , where t n ( K, x ) = n − / (cid:80) ni =1 ˜ P ( K, x ) (cid:48) ˜ P Ki ε i /V n ( K, x ) / and R n ( K, x ) = R ( K, x ) + R ( K, x ).Define f n,K,x : ( E × X ) (cid:55)→ R for given n ≥ K ∈ K n , x ∈ X , f n,K,x ( ε, t ) = ˜ P ( K, x ) (cid:48) ˜ P ( K, t ) εV n ( K, x ) / , ( ε, t ) ∈ E × X . (A.13)and consider the class of measurable functions F n = { f n,K,x : ( K, x ) ∈ K n × X } . Then, we considerthe following empirical process: (cid:110) t n ( K, x ) : (
K, x ) ∈ K n × X (cid:111) = (cid:110) n − / n (cid:88) i =1 f n,K,x ( ε i , x i ) : ( K, x ) ∈ K n × X (cid:111) which is indexed by classes of functions F n . Define α ( K, x ) ≡ ˜ P ( K, x ) /V n ( K, x ) / = ˜ P ( K,x ) / || Ω / K ˜ P ( K, x ) || . Note that | f n,K,x ( ε, t ) | = | α ( K, x ) (cid:48) ˜ P ( K, t ) ε | ≤ C | ε | max K ζ K for any ( K,x ) ∈ K n × X . We define the envelope function F n ( ε, t ) ≡ C | ε | max K ζ K ∨
1. By Assumption4.1, we have | f n,K,x − f n,K (cid:48) ,x (cid:48) | = | ε || α ( K, x ) (cid:48) ˜ P ( K, t ) − α ( K (cid:48) , x (cid:48) ) (cid:48) ˜ P ( K (cid:48) , t ) |≤ | ε || (cid:2) | α ( K, x ) (cid:48) ˜ P ( K, t ) − α ( K, x ) (cid:48) ˜ P ( K (cid:48) , t ) | + | α ( K, x ) (cid:48) ˜ P ( K (cid:48) , t ) − α ( K (cid:48) , x ) (cid:48) ˜ P ( K (cid:48) , t ) | + | α ( K (cid:48) , x ) (cid:48) ˜ P ( K (cid:48) , t ) − α ( K (cid:48) , x (cid:48) ) (cid:48) ˜ P ( K (cid:48) , t ) | (cid:3) ≤ | ε | A max K ζ K L n ( || x − x (cid:48) || + | K − K (cid:48) | )for all x, x (cid:48) ∈ X , K, K (cid:48) ∈ K n where L n = ζ L ∨ ζ L . Therefore, the class of functions F n = { f n,K,x :( K, x ) ∈ K n × X } is a VC type and there are constants A, V > Q N ( (cid:15) || F n || L ( Q ) , F n , L ( Q )) ≤ ( AL n /(cid:15) ) V , < ∀ (cid:15) ≤ n . Then, using Theorem 2.1 (Lemma A.9 in the Online Supplementary Material) inChernozhukov et al. (2016) with B ( f ) = 0, there exists a tight Gaussian process G n ( f ) in (cid:96) ∞ ( F n ) and Z n ( K, x ) = G n ( f n,K,x ) in (cid:96) ∞ ( K n × X ) with zero mean and covariance function(4.2), E [ G n ( f ) G n ( f (cid:48) )] = Cov ( f n,K,x ( ε i , x i ) , f (cid:48) n,K (cid:48) ,x (cid:48) ( ε i , x i )) and a sequence of random variables28 Z ≡ sup ( K,x ) ∈K n ×X | Z n ( K, x ) | such that, for every γ ∈ (0 , P ( | sup ( K,x ) ∈K n ×X | t n ( K, x ) | − (cid:101) Z | > C δ n ) ≤ C ( γ + n − ) (A.14)where C , C are positive constants that depend only on q , and δ n = γ − /q n − / /q max K ζ K log n + γ − / n − / (max K ζ K ) / log / n by Assumption 4.1(iii) and assuming log n ≤ n . By taking γ = (log n ) − / , we have | sup K,x | t n ( K, x ) | − (cid:101) Z | = o p ( n − / /q max K ζ K log / q n + n − / (max K ζ K ) / log / n ) . Furthermore, | R ( K, x ) | = o p ( a n ) , | R ( K, x ) | = o p ( a n ) , | ν n ( K, x ) | = o p ( a n ) uniformly in ( K,x ) ∈ K n × X with a n = 1 / (log n ) / by Lemma 2 and the rate conditions. Again, consider the classof functions F n = { f n,K,x : ( K, x ) ∈ K n × X } and then E (cid:2) sup K,x | t n ( K, x ) | (cid:3) (cid:46) (cid:112) log n + (max K ζ K ) q/ ( q − log n/ √ n (cid:46) (cid:112) log n by Lemma A.13 and Assumption 4.1(iii), and we have sup K,x | t n ( K, x ) | (cid:46) P √ log n . Further,sup K,x | Z n ( K, x ) | (cid:46) P √ log n using Dudley’s inequality (Corollary 2.2.8 in van der Vaart and Wellner(1996)) and using the same arguments given in Theorem 3.1, we have (cid:12)(cid:12) sup K,x | (cid:98) T n ( K , x ) | − (cid:101) Z (cid:12)(cid:12) = o p ( a n ) with a n = 1 / (log n ) / andsup u ∈ R (cid:12)(cid:12) P ( sup ( K,x ) ∈K n ×X | (cid:98) T n ( K , x ) | ≤ u ) − P ( (cid:101) Z ≤ u ) (cid:12)(cid:12) = o (1) . (A.15)Next we consider following (infeasible) bootstrap process T en ( K, x ) = √ n ( (cid:98) g en ( K, x ) − (cid:98) g n ( K, x )) V n ( K, x ) / , ( K, x ) ∈ K n × X where (cid:98) g en ( K, x ) = ˜ P ( K, x ) (cid:48) (cid:98) β eK , (cid:98) β eK is defined in (4.4) with ˜ P ( K, x i ), and e i is i.i.d. standardexponential random variables independent of X n = { x , ..., x n } . Then, we have T en ( K, x ) = √ n ( (cid:98) g en ( K, x ) − g ( x )) V n ( K, x ) / − √ n ( (cid:98) g n ( K, x ) − g ( x )) V n ( K, x ) / = t en ( K, x ) + R en ( K, x ) − R n ( K, x )where t en ( K, x ) = n − / (cid:80) ni =1 ( e i − f n,K,x ( ε i , x i ), R en ( K, x ) = R e ( K, x ) + R e ( K, x ), R e ( K, x ), and R e ( K, x ) are defined the same as in Lemma 1 with the rescaled data { ( √ e i ˜ P ( K, x i ) , √ e i ε i } ni =1 .Note that (cid:98) β eK is the weighted least square estimator for the original data, and we can extend theuniform linearization results in Lemma 2 by replacing ζ K with ζ eK = ζ K log / n and noting that29 [ e i ] = 1 , E [ e i ] = 1 , max ≤ i ≤ n | e i | = o p (log n ).By applying Theorem 2.1 in Chernozhukov et al. (2016) to the weighted bootstrap process t en ( K, x ), there exists a random variable (cid:101) Z e d | X n = sup ( K,x ) ∈K n ×X | Z n ( K, x ) | such that, for every γ ∈ (0 , P ( | sup ( K,x ) ∈K n ×X | t en ( K, x ) | − (cid:101) Z e | > C δ n ) ≤ C ( γ + n − ) (A.16)where C , C are positive constants that depend only on q , δ n = γ − /q n − / /q max K ζ K log n + γ − / n − / (max K ζ K ) / log n, and d | X n = denotes that the two random variables have the same conditional distribution given X n .Further, (cid:12)(cid:12) sup K,x | (cid:98) T en ( K, x ) | − sup K,x | t en ( K, x ) | (cid:12)(cid:12) ≤ sup K,x | (cid:98) T en ( K, x ) − T en ( K, x ) | + sup K,x | T en ( K, x ) − t en ( K, x ) | = o p ( a n )by using E (cid:2) sup K,x | t en ( K, x ) | (cid:3) ≤ max ≤ i ≤ n | e i | E (cid:2) sup K,x | t n ( K, x ) | (cid:3) (cid:46) P log / n , Assumption 4.1(iv),and | R en ( K, x ) | = o p ( a n ) , | R n ( K, x ) | = o p ( a n ) uniformly in ( K, x ) ∈ K n × X under the rate condi-tions in Assumption 4.1(ii) with a n = 1 / (log n ) / . Then, there exists some sequence of positiveconstant δ n , δ n such that δ n = o (1) , δ n = o (1), P ( (cid:12)(cid:12) sup K,x | (cid:98) T en ( K, x ) | − sup K,x | t en ( K, x ) | (cid:12)(cid:12) > a n δ n ) ≤ δ n . (A.17)Combining (A.16) and (A.17) gives P ( | sup ( K,x ) ∈K n ×X | (cid:98) T en ( K, x ) | − (cid:101) Z e | > a n δ n + C δ n ) ≤ C ( γ + n − ) + δ n (A.18)By the Markov’s inequality, the following is deduced from (A.18), for every ν ∈ (0 , P ( | sup ( K,x ) ∈K n ×X | (cid:98) T en ( K, x ) | − (cid:101) Z e | > a n δ n + C δ n | X n ) ≤ ν − ( C ( γ + n − ) + δ n ) (A.19)with probability at least 1 − ν . Similar derivation as in Theorem 3.1 using Lemma A.14 givessup u ∈ R (cid:12)(cid:12) P ( sup ( K,x ) ∈K n ×X | (cid:98) T en ( K , x ) | ≤ u | X n ) − P ( (cid:101) Z ≤ u ) (cid:12)(cid:12) = ( a n δ n + C δ n ) (cid:112) log n + ν − ( C ( γ + n − )+ δ n )(A.20)with probability at least 1 − ν where we use (cid:101) Z e d | X n = (cid:101) Z and E [sup K,x | Z n ( K, x ) | ] (cid:46) √ log n . By taking γ = (log n ) − / and ν = ν n → n ) − / ∨ δ n , and using δ n = o ( a n ),the rate conditions imposed in the theorem, (A.20) is o p (1). Combining this with (A.15),sup u ∈ R (cid:12)(cid:12) P ( sup ( K,x ) ∈K n ×X | (cid:98) T en ( K , x ) | ≤ u | X ) − P ( sup ( K,x ) ∈K n ×X | (cid:98) T n ( K , x ) | ≤ u ) (cid:12)(cid:12) = o p (1) . (A.21)30hen, the coverage result (4.7) follows. The second part of the theorem, (4.8), can be similarlyderived as in the proof of Theorem 3.1 and this completes the proof. (cid:4) A.2.3 Proof of Theorem 5.1
Proof.
Conditional on X = [ x , · · · , x n ] (cid:48) , the following decomposition holds for any sequence K ∈K n : √ n ( (cid:98) θ n ( K ) − θ ) = (cid:98) Γ n ( K ) − S n ( K ) , (cid:98) Γ n ( K ) = 1 n ( W (cid:48) M K W ) , S n ( K ) = 1 √ n W (cid:48) M K ( g + ε )where g = [ g , · · · , g n ] (cid:48) , g i = g ( x i ) , g w = [ g w , · · · , g wn ] (cid:48) , g wi = g w ( x i ) = E [ w i | x i ], and v = [ v , · · · ,v n ]. All remaining proofs contain conditional expectations (conditioning on X ) and hold almostsurely (a.s.). Under Assumption 5.2, (cid:98) Γ n ( K ) = Γ n ( K ) + o p (1) , Γ n ( K ) = 1 n n (cid:88) i =1 M K,ii E [ v i | x i ]by Lemma 1 of Cattaneo, Jansson, and Newey (2018a). Moreover, S n ( K ) = 1 √ n n (cid:88) i =1 M K,ii v i ε i − √ n n (cid:88) i =1 n (cid:88) j =1 ,j P ( (cid:12)(cid:12) (cid:80) ni =2 E [ y in |X , ..., X i − , X ] − s n ( X ) (cid:12)(cid:12) ≥ δ | X ) → n (cid:88) i =2 E [ y in |X , ..., X i − , X ] − s n ( X )= n (cid:88) i =2 E [ y ,in |X , ..., X i − , X ] − n (cid:88) i =2 ( E [ ω ,in | X ] + E [¯ y ,in | X ]) (A.23)+ n (cid:88) i =2 E [ y ,in |X , ..., X i − , X ] − n (cid:88) i =2 ( E [ ω ,in | X ] + E [¯ y ,in | X ]) (A.24)+ 2 (cid:16) n (cid:88) i =2 ( E [ ω ,in ω ,in |X , ..., X i − , X ] − E [ ω ,in ω ,in | X ]) (A.25)+ n (cid:88) i =2 E [ ω ,in ¯ y ,in + ω ,in ¯ y ,in |X , ..., X i − , X ] + n (cid:88) i =2 ( E [¯ y ,in ¯ y ,in |X , ..., X i − , X ] − E [¯ y ,in ¯ y ,in | X ]) (cid:17) . (A.26)(A.23) and (A.24) converge to 0 a.s. by the proof of Lemma A2 in Chao et al. (2012). Moreover,it is straightforward to verify that (A.25) and (A.26) converge to 0 a.s. since P K ,ij P K ,ij ≤ P K ,ij ∨ P K ,ij , K (cid:16) K and by closely following the proof of Lemma A2 in Chao et al. (2012).Then we can apply the martingale central limit theorem and deduce Y n d −→ N (0 ,
1) using similar32rguments to the proof of Lemma A2 in Chao et al. (2012). Coverage results (5.6) and (5.7)follow by the joint convergence of (cid:98) T n ( K, θ ) with max K ∈K n | (cid:98) V n ( K ) V n ( K ) − | = o p (1) , || (cid:98) Σ n − Σ n || = o p (1) as n, K → ∞ under the assumption imposed in Theorem 5.1 and the Slutzky theorem. This completesthe proof. (cid:4) ppendix B Figures and Tables g ( x ) g (x)g (x)g (x) Figure 1: Different functions of g ( x ) used in simulations (Section 6).Solid lines (Black) are g ( x ) = ln( | x − | + 1) sgn ( x − / g ( x ) = sin(7 πx/ / [1 + 2 x ( sgn ( x ) + 1)] ; Dotted lines (Blue) are g ( x ) = x − / φ (10( x − / φ ( · ) is the standard normal pdf.34able 1: Coverage and Length of Nominal 95% CIs and CBs - SplinesPointwise Uniform x = 0 . x = 0 . x = 0 . x = 0 . g ( x ) = ln( | x − | + 1) sgn ( x − / (cid:98) K cv ) 0.98 0.37 0.98 0.46 0.96 1.14 0.95 1.76 0.97 1.33Robust ( (cid:98) K cv+ ) 0.98 0.51 0.98 0.49 0.98 1.51 0.97 2.08 0.98 1.42Model 2: g ( x ) = sin(7 πx/ / [1 + 2 x ( sgn ( x ) + 1)]Standard 0.80 0.28 0.93 0.36 0.91 0.92 0.92 1.49 0.27 0.69Robust ( (cid:98) K cv ) 0.93 0.37 0.97 0.46 0.96 1.14 0.95 1.76 0.96 1.33Robust ( (cid:98) K cv+ ) 0.98 0.51 0.98 0.49 0.98 1.51 0.97 2.08 0.98 1.42Model 3: g ( x ) = x − / φ (10( x − / (cid:98) K cv ) 0.88 0.39 0.74 0.50 0.96 1.23 0.95 1.85 0.75 1.35Robust ( (cid:98) K cv+ ) 0.98 0.52 0.92 0.53 0.98 1.52 0.97 2.06 0.97 1.44 Notes: “Pointwise” reports coverage (COV) and average length (AL) of (1) the standard95% CI with (cid:98) K cv ∈ K n ; (2) robust CI with (cid:98) K cv ; (3) robust CI with (cid:98) K cv+ . “Uniform” reportsanalogous uniform inference results for confidence bands. (cid:98) K cv is selected to minimize leave-one-out cross-validation and (cid:98) K cv+ = (cid:98) K cv + 2. Using quadratic spline regressions with evenlyplaced knots. CV (cid:98) E w SE (cid:98) E w CI (cid:98) E w ( K )1 , y J , w J y ∆ w (cid:96) ∆ y y J , w J y , ∆ w y J w J yw (cid:96) ∆ y y J , w J (cid:96) ∆ y , (cid:96) ∆ w , (cid:96) ∆ yw y J w J , y J w J (cid:98) c − α ( x ) = 2 . CI sup (cid:98) E w ( (cid:98) K cv ) = [0 . , . CI sup (cid:98) E w ( (cid:98) K cv+ ) = [0 . , . , CI sup (cid:98) E w ( (cid:98) K cv++ ) = [0 . , . y : non-labor income, w : marginal wage rates, (cid:96) : the end pointof the segment in a piecewise linear budget set. (cid:96) m ∆ y p w q denotes (cid:80) j (cid:96) mj ( y pj w qj − y pj +1 w qj +1 ). CV denotes the cross-validation criteria defined in Blomquist and Newey(2002, p.2464). (cid:98) K cv = K , the 5th smallest model, is chosen by the cross-validation, and let (cid:98) K cv+ = K , (cid:98) K cv++ = K . CI sup (cid:98) E w ( K ) = (cid:98) E w ( K ) ± (cid:98) c − α ( x ) SE (cid:98) E w ( K ), CI (cid:98) E w ( K ) = (cid:98) E w ( K ) ± z − α/ SE (cid:98) E w ( K ). K K K K K K K K K K K Series Terms -0.0200.020.040.060.080.10.120.140.160.18 CV Point EstimateRobust CIStandard CIcross-validation
Figure 2 plots the wage elasticity estimates of the expected labor supply same as in Table2, with standard pointwise 95% CIs as well as uniform (in K ∈ K n ) CIs constructed withthe critical value (cid:98) c − α ( x ).).