[PDF] Powerful Inference

Abstract

We develop an inference method for a (sub)vector of parameters identified by conditional moment restrictions, which are implied by economic models such as rational behavior and Euler equations. Building on Bierens (1990), we propose penalized maximum statistics and combine bootstrap inference with model selection. Our method is optimized to be powerful against a set of local alternatives of interest by solving a data-dependent max-min problem for tuning parameter selection. We demonstrate the efficacy of our method by a proof of concept using two empirical examples: rational unbiased reporting of ability status and the elasticity of intertemporal substitution.

Full PDF

PPowerful Inference ∗ Xiaohong Chen † Yale University and Cowles FoundationSokbae Lee ‡ Columbia University and Institute for Fiscal StudiesMyung Hwan Seo § Seoul National University and Institute of Economic ResearchAugust 26, 2020

Abstract

We develop an inference method for a (sub)vector of parameters identiﬁed byconditional moment restrictions, which are implied by economic models such asrational behavior and Euler equations. Building on Bierens (1990), we proposepenalized maximum statistics and combine bootstrap inference with model se-lection. Our method is optimized to be powerful against a set of local alternativesof interest by solving a data-dependent max-min problem for tuning parameterselection. We demonstrate the eﬃcacy of our method by a proof of concept us-ing two empirical examples: rational unbiased reporting of ability status and theelasticity of intertemporal substitution.

Keywords : conditional moments, economic restrictions, penalization, regular-ization, multiplier bootstrap, max-min. ∗ This work was supported in part by the European Research Council (ERC-2014-CoG- 646917-ROMIA) and by the UK Economic and Social Research Council (ESRC) through research grant(ES/P008909/1) to the CeMMAP. Part of this research was carried out when Seo was visiting CowlesFoundation in 2018/2019 and was supported by the Ministry of Education of the Republic of Koreaand the National Research Foundation of Korea (NRF-2018S1A5A2A01033487). † [email protected] ‡ [email protected] § [email protected] a r X i v : . [ ec on . E M ] A ug Introduction

Conditional moment restrictions are ubiquitous in economics. There is now a ma-ture literature on estimation and inference with conditional moment restrictions. Astandard approach in the literature is ﬁrstly to estimate unknown parameters andsecondly to develop suitable test statistics based on estimators. In this paper, wetake a diﬀerent path and aim to carry out inference directly by skipping the ﬁrststep of estimating parameters of interest. To convey our main point succinctly, wefocus on a simple version of conditional moment restrictions, namely, Chamberlain(1987)’s model: E [ g ( X i , θ ) | W i ] = 0 a.s. if and only if θ = θ , (1)where g ( x, θ ) is a known function except for θ , which is a ﬁnite-dimensional param-eter.Arguably, the most challenging data scenario in applying (1) is when the dimen-sion ( p ) of W i ∈ R p is high, due to the curse of dimensionality. For example, Benítez-Silva, Buchinsky, Chan, Cheidvasser, and Rust (2004) consider testing a conditionalmoment restriction with a dataset of sample size n ≈ and p ≈ . They examinewhether a self-reported disability status is an unbiased indicator of Social SecurityAdministration (SSA)’s disability award decision. They fail to reject the null hypoth-esis of rational unbiased reporting of ability status with a battery of parametric andnonparametric tests; one may ask whether it is driven by a relatively large numberof conditioning variables given the sample size, however. This is the motivation ofour paper. Since p = 20 is much smaller than n = 350 , we do not assume that p grows with n . Nonetheless, it is still demanding to build an inference method byconditioning on double-digit covariates in a fully nonparametric fashion.To construct a conﬁdence set for θ or its subvector in a data scenario similar to onementioned above, we propose to invert the Bierens (1990) test in conjunction with amethod of penalization. The original Bierens test is designed to test a functionalform of nonlinear regression models and has been subsequently extended to timeseries (de Jong, 1996) and to a more general form (Stinchcombe and White, 1998),among other things. There are diﬀerent speciﬁcation tests based on conditional mo-ment restrictions in the literature (e.g., Bierens, 1982; Fan and Li, 1996; Andrews,1997; Bierens and Ploberger, 1997; Fan and Li, 2000; Horowitz and Spokoiny, 2001).2n a recent working paper, Antoine and Lavergne (2020) leverage Bierens (1982)’s in-tegrated conditional moment statistic to develop an inference procedure in a linearinstrumental variable model. In this paper, we modify Bierens (1990)’s maximumstatistic to an (cid:96) -penalized maximum statistic. Our idea of (cid:96) -penalization resemblesLASSO (Tibshirani, 1996); however, its use is fundamentally distinct.In LASSO, parameter estimation and model selection are combined together toimprove prediction accuracy. However, (cid:96) -penalized estimators are irregular andinference based on them requires careful treatment (see, e.g., Taylor and Tibshirani(2015) among others). Our proposal of an (cid:96) -penalized Bierens maximum statistic ismotivated by the following research questions: • Can we make use of penalization without distorting inference? • Can we optimize a model selection procedure to improve the power of a test?Our solution to these questions in the context of conditional moment models is tocarry out inference directly based on the Bierens (1990) test without estimating θ .That is, we propose to combine inference with model selection to improve the powerof the Bierens test. Our penalized inference method is asymptotically valid whenthe null hypothesis is true and can be optimized to be powerful against a local alter-native of interest. Furthermore, the penalized test statistic is easier to compute thanthe one without penalization. The computational gains by penalization are practi-cally important since the p -value is constructed by a multiplier bootstrap procedure.The penalization tuning parameter is selected by solving a data-dependent max-minproblem.We demonstrate the usefulness of our method by a proof of concept. First, werevisit the test of Benítez-Silva, Buchinsky, Chan, Cheidvasser, and Rust (2004) andshow that our method yields a rejection of the null hypothesis at the conventionallevel, in contrast to the original analysis. Second, we revisit Yogo (2004) and ﬁnd thatan uninformative conﬁdence interval for the elasticity of intertemporal substitution,based on annual US series ( n ≈ and p = 4 ), can turn into an informative one.Both empirical examples suggest that there is substantive evidence for the eﬃcacyof our proposed method.The remainder of the paper is organized as follows. In Section 2, we deﬁne thetest statistic and describe how to obtain bootstrap p -values. Section 3 establishesbootstrap validity. In Section 4, we derive consistency and local power and propose3ow to calibrate the penalization parameter to optimize the power of the test. InSection 5, we extend our method to subvector inference for θ , for which we useplug-in estimation of θ , where θ = ( θ , θ ) . Sections 6 and 7 give two empiricalexamples and Section 8 gives concluding remarks. All the proofs are in Section A. In this section, we introduce our test statistic and describe how to carry out bootstrapinference. Let (cid:107) a (cid:107) denote the Euclidean norm of a vector a .Before we present our test statistic, we ﬁrst assume the following conditions. Assumption 1. (i) Θ is a compact subset of R d .(ii) E [ g ( X i , θ ) | W i ] = 0 a.s. if and only if θ = θ , where θ ∈ Θ .(iii) E [ (cid:107) g ( X i , θ ) (cid:107) ] < ∞ for each θ ∈ Θ .(iv) W i is a bounded random vector in R p . The boundedness assumption on W i is without loss of generality since we cantake a one-to-one transformation to ensure that each component of W i is bounded(for instance, x (cid:55)→ tan − ( x ) componentwise, as used in Bierens (1990)).Deﬁne M ( θ, γ ) := E [ g ( X i , θ ) exp( W (cid:48) i γ )] . (2)Let µ () denote the Lebesgue measure on R p . Bierens (1990) established the followingresult. Lemma 1 (Bierens (1990)) . Let Assumption 1 hold. Then, M ( θ, γ ) = 0 under θ = θ and M ( θ, γ ) (cid:54) = 0 a.e. under θ (cid:54) = θ . That is, µ { γ ∈ R p : M ( θ, γ ) = 0 } is either or . To minimize the notational complexity, we often abbreviate M ( γ ) := M ( θ , γ ) throughout this paper. In order to test the null hypothesis H : θ = θ against thealternative hypothesis H : θ (cid:54) = θ , we construct a test statistic as follows. Deﬁne4 i ( θ ) := g ( X i , θ ) and, as before, write U i := g ( X i , θ ) . We start with the case that thedimension of U i is one. Deﬁne M n ( γ ) := 1 n n (cid:88) i =1 U i exp( W (cid:48) i γ ) ,s n ( γ ) := 1 n n (cid:88) i =1 [ U i exp( W (cid:48) i γ )] ,Q n ( γ ) := n [ M n ( γ )] s n ( γ ) . (3)Note that U i exp( W (cid:48) i γ ) is a centered random variable under the null hypothesis H .Deﬁne the test statistic as T n ( λ ) := max γ ∈ Γ (cid:104)(cid:112) Q n ( γ ) − λ (cid:107) γ (cid:107) (cid:105) , (4)where (cid:107) a (cid:107) is the (cid:96) norm of a vector a , Γ is a compact subset in R p , and λ ≥ is thepenalization parameter. We regard T n ( λ ) as a stochastic process indexed by λ ∈ Λ ,where Λ is a compact subset in R + := { λ ∈ R | λ ≥ } . We consider the multiplier bootstrap to carry out inference. Deﬁne M n, ∗ ( γ ) := 1 n n (cid:88) i =1 η ∗ i U i exp( W (cid:48) i γ ) ,s n, ∗ ( γ ) := 1 n n (cid:88) i =1 [ η ∗ i U i exp( W (cid:48) i γ )] ,Q n, ∗ ( γ ) := n [ M n, ∗ ( γ )] s n, ∗ ( γ ) , (5)where η ∗ i is drawn from N (0 , and independent from data { ( X i , W i ) : i = 1 , . . . , n } .For each bootstrap replication r , let T ( r ) n, ∗ ( λ ) := max γ ∈ Γ (cid:20)(cid:113) Q ( r ) n, ∗ ( γ ) − λ (cid:107) γ (cid:107) (cid:21) . (6)5or each λ , the bootstrap p -value is deﬁned as p ∗ ( λ ) := 1 R R (cid:88) r =1 { T ( r ) n, ∗ ( λ ) > T n ( λ ) } for a large R . We reject the null hypothesis at the α level if and only if p ∗ ( λ ) < α . Then,a bootstrap conﬁdence interval for θ can be constructed by inverting a pointwise testof H : θ = θ .It is straightforward to extend our method to multiple conditional moment re-strictions, although there may not be a unique way of doing so. For example, sup-pose that U i = ( U (1) i , . . . , U ( J ) i ) (cid:48) be a J × vector. Let Q ( j ) n ( γ ) denote Q n ( γ ) in (3) using U ( j ) i for j = 1 , . . . , J . Then, we may generalize the test statistic in (4) by T n ( λ ) := max γ ∈ Γ (cid:32) J (cid:88) j =1 Q ( j ) n ( γ ) (cid:33) / − λ (cid:107) γ (cid:107)  . (7)Alternatively, we may take a more general quadratic form for the ﬁrst term insidethe brackets in (7). For simplicity, in what follows, we focus on the case that U i is ascalar. Deﬁne K ( γ , γ ) := E [ U i exp ( W (cid:48) i ( γ + γ ))] and s ( γ ) := E [ U i exp (2 W (cid:48) i γ )] . Wemake the following regularity assumptions. Assumption 2. (i) Γ is a compact subset in R p .(ii) { X i , W i } is a strictly stationary, β -mixing sequence, where W i is adapted to the ﬁltra-tion F i − , the mixing coeﬃcient β k satisﬁes k c/ ( c − (log k ) c − / ( c − β k → , and { U i } is a martingale diﬀerence sequence (mds) with < E | U i | c < ∞ , for some c > .(iii) Λ is a compact subset in R + := { λ ∈ R | λ ≥ } . Assumption 2(ii) is standard and ensures the weak convergence of √ nM n ( γ ) and s n ( γ ) , see e.g. Arcones and Yu (1994) for the formal deﬁnition of the β -mixing anda functional central limit theorem under β -mixing. The boundedness of W i and Γ U i together imply that sup ( γ ,γ ) ∈ Γ K ( γ , γ ) < ∞ and inf γ ∈ Γ s ( γ ) > .Let {M ( γ ) : γ ∈ Γ } be a centered Gaussian process with the covariance kernel E (cid:2) M ( γ ) M ( γ ) (cid:48) (cid:3) = K ( γ , γ ) . Also, let ⇒ denote the weak convergence in thespace of uniformly bounded functions on the parameter space that is endowed withthe uniform metric.We ﬁrst establish the weak convergence of T n ( λ ) . Theorem 1.

Let Assumptions 1 and 2 hold. Then, √ nM n ( γ ) ⇒ M ( γ ) (8) s n ( γ ) p → s ( γ ) uniformly in Γ . (9) Furthermore, T n ( λ ) ⇒ max γ ∈ Γ (cid:20) |M ( γ ) | s ( γ ) − λ (cid:107) γ (cid:107) (cid:21) . We now show that the bootstrap analog T n, ∗ ( λ ) of T n ( λ ) converges weakly to thesame limit. Theorem 2.

Let Assumptions 1 and 2 hold. Then, √ nM n, ∗ ( γ ) ⇒ M ( γ ) (10) s n, ∗ ( γ ) p → s ( γ ) uniformly in Γ . (11) Furthermore, T n, ∗ ( λ ) ⇒ max γ ∈ Γ (cid:20) |M ( γ ) | s ( γ ) − λ (cid:107) γ (cid:107) (cid:21) a.s. Theorems 1 and 2 imply that the bootstrap critical values are valid for any con-verging sequence of λ n . See the proof of Theorem 4 for more details.7 Consistency, Local Power and Calibration of λ n Suppose that H : θ = θ ∗ for some θ ∗ (cid:54) = θ . Then, M n ( γ ) = 1 n n (cid:88) i =1 g ( X i , θ ∗ ) exp( W (cid:48) i γ ) p → E [ g ( X i , θ ∗ ) exp( W (cid:48) i γ )] ,s n ( γ ) p → (cid:112) E [ g ( X i , θ ∗ ) exp(2 W (cid:48) i γ )] . Therefore, for any λ ∈ Λ , n − T n ( λ ) p → sup γ ∈ Γ | E [ g ( X i , θ ∗ ) exp( W (cid:48) i γ )] | (cid:112) E [ g ( X i , θ ∗ ) exp(2 W (cid:48) i γ )] , yielding the consistency of the test, as in the following theorem. Theorem 3.

Let Assumptions 1 and 2 hold. Then, for θ ∗ (cid:54) = θ , T n ( λ ) p → + ∞ for any λ ∈ Λ . Consider a sequence of local hypotheses of the following form: for some nonzeroconstant vector B , θ n = θ + B n − / , which leads to U i,n = U i + G ( X i , θ ) B n − / , (12)where G ( X i , θ ) = ∂g ( X i , θ ) /∂θ (cid:48) , assuming the continuous diﬀerentiability of g ( · ) at θ . Unless g ( X i , θ ) is linear in θ , the term G ( X i , θ ) depends on θ . However, underthe null hypothesis, G ( X i , θ ) is completely speciﬁed. For B , we may set it as a vectorof ones times a constant. The form of (12) will be intimately related to our proposalregarding how to calibrate the penalization parameter λ .8s before, write G i := G ( X i , θ ) . Under (12), we have √ nM n ( γ ) = 1 √ n n (cid:88) i =1 U i exp( W (cid:48) i γ ) + 1 n n (cid:88) i =1 exp( W (cid:48) i γ ) G i B + o p (1) ⇒ M ( γ ) + E [exp( W (cid:48) i γ ) G i B ] . Then, we can establish that T n ( λ ) ⇒ max γ ∈ Γ (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) M ( γ ) + E [exp( W (cid:48) i γ ) G i B ] s ( γ ) (cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:107) γ (cid:107) (cid:21) , (13)using arguments identical to those to prove Theorem 1.Deﬁne the noncentrality term κ ( γ ) := E [exp( W (cid:48) i γ ) G i B ] (cid:112) E [ U exp(2 W (cid:48) i γ )] . For the test to have a non-trivial power, we need that κ ( γ ∗ ) (cid:54) = 0 with a positiveprobability, where γ ∗ ( B ) denotes a (random) maximizer of the stochastic process in(13). Since the penalty aﬀects γ ∗ ( B ) diﬀerent ways under the null of B = 0 andalternatives of B (cid:54) = 0 , its implication on power of the test is not straightforward toanalyze. The subsequent section proposes a method to select λ in a more systematicway to increase power.Here, we discuss some examples where the presence of penalty, − λ (cid:107) γ (cid:107) , mayincrease the power of the test. They concern the cases when κ ( γ ) is maximized at γ = 0 or near zero. Since the penality is also maximized at γ = 0 , the penalty pushesthe maximzer γ ∗ ( B ) in (13) toward the maximizer of κ ( γ ) . This means the penalityhelps increase the value of the statistic under the alternative.On the other hand, the magnitude of the penalty gets smaller under the alterna-tive than under the null because the maximzer γ ∗ ( B ) is closer to zero when B (cid:54) = 0 .It would decrease the critical values. Together they would enhance the power of thetest.An example is a kind of conditional homoskedasticity, namely E [ U i e W (cid:48) i γ ] = E [ U i ] E [ e W (cid:48) i γ ] and E [exp( W (cid:48) i γ ) G i B ] = E [exp( W (cid:48) i γ )] E [ G i B ] , we have that κ ( γ ) ≤ E [ G i B ] / E [ U i ] by Jensen’s inequality and the equality holds if9nd only if γ = 0 . As the second example, suppose that W i = ( Z i , F i ) and F i is apure noise that is independent of everything else. Then, the noncentrality term canbe rewritten as κ ( γ ) = E [exp( Z (cid:48) i γ ) G i B ] (cid:112) E [ U i exp(2 Z (cid:48) i γ )] E [exp( F (cid:48) i γ )] (cid:112) E [exp(2 F (cid:48) i γ )] =: κ ( γ ) κ ( γ ) . Then, as before, by Jensen’s inequality, κ ( γ ) ≤ and the equality holds if and onlyif γ = 0 . Therefore, the (random) maximizer of the non-penalized stochastic process (cid:12)(cid:12)(cid:12) M ( γ ) s ( γ ) + κ ( γ ) (cid:12)(cid:12)(cid:12) will center around γ = 0 under homoskedasticity; and it will revolvearound γ = 0 if F i is pure noise. λ The penalty function works diﬀerently on how it shrinks the maximizer (cid:98) γ underthe alternatives. Ideally, it should induce sparse solutions that force zeros for thecoeﬃcients of the irrelevant conditioning variable to maximize the power of the test.On one hand, the penalty helps to increase the power by increasing κ ( γ ) underhomoskedasticity or boosting κ ( γ ) under the presence of pure noise. On the otherhand, the penalty thwarts both the critical values and κ ( γ ) by introducing a biasto the maximizer γ ∗ .Although it is demanding to characterize the optimal choice of λ analytically, wecan elaborate on the choice of the penalty parameter λ under the limit of experiments M ( γ ) + E [exp( W (cid:48) i γ ) G i B ] s ( γ ) , for which we parametrize the size of the deviation by B . Then, our test becomes T ( λ, B, α ) = 1 (cid:26) max γ ∈ Γ (cid:12)(cid:12)(cid:12)(cid:12) M ( γ ) + E [exp( W (cid:48) i γ ) G i B ] s ( γ ) (cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:107) γ (cid:107) > c α ( λ ) (cid:27) (14)for a critical value c α ( λ ) , which is the (1 − α ) quantile of max γ ∈ Γ (cid:104) |M ( γ ) | s ( γ ) − λ (cid:107) γ (cid:107) (cid:105) .Let R ( λ, B, α ) = E [ T ( λ, B, α )] denote the power function of the test under thelimit experiment for given λ , B and α , where < α < is a prespeciﬁed level of the10est. We propose to select λ by solving the max-min problem: max λ ∈ Λ min B ∈B R ( λ, B, α ) , (15)where Λ is a set of possible values of λ and and B is a set of possible values of B . Insome applications, the inner minimization over B ∈ B is simple and easy to charac-terize. For Λ , we can take a discrete set of possible values of λ , including 0, if suitable.The idea behind (15) is as follows. For each candidate λ , the size of the test is con-strained properly because R ( λ, , α ) ≤ α . Then we look at the least-favorable localpower among possible values of B and choose λ that maximizes the least-favorablelocal power.To operationalize our proposal, we again rely on a multiplier bootstrap. Deﬁne M n, ∗ ( γ, B ) := 1 n n (cid:88) i =1 (cid:18) η ∗ i U i + B √ n G i (cid:19) exp( W (cid:48) i γ ) ,s n, ∗ ( γ, B ) := 1 n n (cid:88) i =1 (cid:20)(cid:18) η ∗ i U i + B √ n G i (cid:19) exp( W (cid:48) i γ ) (cid:21) ,Q n, ∗ ( γ, B ) := n [ M n, ∗ ( γ, B )] s n, ∗ ( γ, B ) , where η ∗ i is drawn from N (0 , and independent from data { ( X i , W i ) : i = 1 , . . . , n } .The quantities above are just shifted versions of (5). For each bootstrap replication r ,let T ( r ) n, ∗ ( λ, B ) := max γ ∈ Γ (cid:20)(cid:113) Q ( r ) n, ∗ ( γ, B ) − λ (cid:107) γ (cid:107) (cid:21) . (16)Then the critical value c α ( λ ) is approximated by c ∗ α ( λ ) , the (1 − α ) -quantile of { T ( r ) n, ∗ ( λ,

0) : r = 1 , . . . , R } for some large value of R . Once c ∗ α ( λ ) is obtained, R ( λ, B, α ) is approx-imated by R R (cid:88) r =1 (cid:8) T ( r ) n, ∗ ( λ, B ) > c ∗ α ( λ ) (cid:9) . (17)We conclude this section by commenting that we use a shifted version of (cid:98) s n, ∗ ( γ, B ) instead of (cid:98) s n, ∗ ( γ, when we deﬁne (16). This is because we would like to mimicmore closely the ﬁnite-sample distribution of the test statistic under the alternative.11et Λ ∈ Λ denote a set of the global solution in (15) so that min B ∈B R ( λ , B, α ) > min B ∈B R ( λ, B, α ) for any λ ∈ Λ \ Λ and λ ∈ Λ . (18)Similarly, let (cid:98) λ denote a maximizer of min B ∈B R n ( λ, B, α ) , where R n ( λ, B, α ) := Pr ∗ { T n, ∗ ( λ, B, α ) > c ∗ α ( λ ) } . Deﬁne T ( λ ) := max γ ∈ Γ (cid:104) |M ( γ ) | s ( γ ) − λ (cid:107) γ (cid:107) (cid:105) . Let F λ and F ∗ λ denote the distribution func-tion of T ( λ ) and that of T n, ∗ ( λ ) conditional on the sample X n , respectively. We makethe following regularity condition on F λ . Assumption 3.

The partial derivatives { ∂F λ ( x ) /∂x : λ ∈ Λ } exist and are bounded awayfrom zero for all λ ∈ Λ and x ∈ [ − c + min λ F − λ ( α ) , max λ F − λ ( α ) + c ] for some c > . The following theorem shows that the bootstrap critical values c ∗ α ( λ ) are uni-formly consistent for c α ( λ ) , which is the (1 − α ) quantile of T ( λ ) . Furthermore, it es-tablishes consistency of our proposed calibration method in the sense that d ( (cid:98) λ, Λ ) p → , where d ( x, X ) := min {| x − y | : y ∈ X } . Theorem 4.

Let Assumptions 1, 2 and 3 hold. Then, c ∗ α ( λ ) p → c α ( λ ) uniformly in Λ . Thus, d ( (cid:98) λ, Λ ) p → . Partition θ = ( θ (cid:48) , θ (cid:48) ) (cid:48) and θ = ( θ (cid:48) , θ (cid:48) ) (cid:48) . We now consider the inference for θ . Weassume that for each ﬁxed θ , there exists some prior estimator (cid:98) θ ( θ ) = ψ n ( { X i , W i } ni =1 ) of θ ( θ ) , so that θ = θ ( θ ) . For example, suppose that g ( X i , θ ) can be written as g ( X i , θ ) = g ( X i , θ ) − θ . Then, θ = E [ g ( X i , θ )] = 0 , thereby yielding the fol-lowing estimator of θ given θ : (cid:98) θ ( θ ) = n − n (cid:88) i =1 g ( X i , θ ) . In what follows, we assume standard regularity conditions on (cid:98) θ ( θ ) . Let Θ denotethe parameter space for θ . 12 ssumption 4. Suppose that there exists a √ n -consistent estimator (cid:98) θ ( θ ) of θ ( θ ) that hasthe following representation: uniformly in θ ∈ Θ , √ n (cid:16)(cid:98) θ ( θ ) − θ ( θ ) (cid:17) = 1 √ n n (cid:88) i =1 η ni ( θ ) + o p (1) , where { ψ ni ( θ ) = ( η ni ( θ ) , U i ) } is a strictly stationary ergodic mds array with V ψ ( θ ) =lim n →∞ E [ ψ ni ( θ ) ψ ni ( θ ) (cid:48) ] > . Furthermore, assume that there exists G ( x, θ ) such that E | G ( X i , θ ) | < ∞ , θ (cid:55)→ G ( X i , θ ) is continuous at θ almost surely, and g ( X i , θ ) − g ( X i , θ ) − G ( X i , θ ) (cid:48) ( θ − θ ) = o p ( (cid:107) θ − θ (cid:107) ) , (19) and that θ (cid:55)→ θ ( θ ) is continuous. Deﬁne (cid:98) U i ( θ ) := g [ X i , { θ , (cid:98) θ ( θ ) } ] , (cid:98) U i := (cid:98) U i ( θ ) , and accordingly deﬁne the statis-tics (cid:99) M n ( γ ) := 1 n n (cid:88) i =1 (cid:98) U i exp( W (cid:48) i γ ) , (cid:98) s n ( γ ) := 1 n n (cid:88) i =1 (cid:104) (cid:98) U i exp( W (cid:48) i γ ) (cid:105) , (cid:98) Q n ( γ ) := n [ (cid:99) M n ( γ )] (cid:98) s n ( γ ) , and the test statistic (cid:98) T n ( λ ) := max γ ∈ Γ (cid:20)(cid:113) (cid:98) Q n ( γ ) − λ (cid:107) γ (cid:107) (cid:21) . (20)Partition G ( X i , θ ) = (cid:2) G ( X i , θ ) (cid:48) , G ( X i , θ ) (cid:48) (cid:3) (cid:48) corresponding to the partial derivativeswith respect to θ and θ . Theorem 5.

Let Assumptions 1, 2 and 4 hold. Then, (cid:98) T n ( λ ) ⇒ max γ ∈ Γ (cid:20) |M ( γ ) + Z (cid:48) E [ G ( X i , θ ) exp ( W (cid:48) i γ )] | s ( γ ) − λ (cid:107) γ (cid:107) (cid:21) , where ( Z, M ( γ )) is a centered Gaussian random vector with E [ ZZ (cid:48) ] = V η and E [ Z M ( γ )] =lim n →∞ E [ U i η ni exp ( W (cid:48) i γ )] . η ni . Deﬁne (cid:98) η ni := (cid:98) η ni ( θ ) and (cid:98) G i := G (cid:16) X i , (cid:16) θ , (cid:98) θ (cid:17)(cid:17) , where (cid:98) η ni ( θ ) is a consistent estimator of η ni ( θ ) . Let (cid:99) M n, ∗ ( γ ) := 1 n n (cid:88) i =1 η ∗ i (cid:40) (cid:98) U i exp( W (cid:48) i γ ) + (cid:98) η (cid:48) ni n n (cid:88) j =1 (cid:98) G j exp (cid:0) W (cid:48) j γ (cid:1)(cid:41) , (cid:98) s n, ∗ ( γ ) := 1 n n (cid:88) i =1 (cid:34) η ∗ i (cid:40) (cid:98) U i exp( W (cid:48) i γ ) + (cid:98) η (cid:48) ni n n (cid:88) j =1 (cid:98) G j exp (cid:0) W (cid:48) j γ (cid:1)(cid:41)(cid:35) . (21)Then, we proceed with these modiﬁed quantities, as in Section 3. We start with a sequence of local alternatives θ n = θ + B/ √ n . Then, expressingthe hypothesized value of θ n explicitly, we write the corresponding statistics by (cid:99) M n ( θ n , γ ) := 1 n n (cid:88) i =1 (cid:98) U i ( θ n ) exp( W (cid:48) i γ ) , (cid:98) s n ( θ n , γ ) := 1 n n (cid:88) i =1 (cid:104) (cid:98) U i ( θ n ) exp( W (cid:48) i γ ) (cid:105) , (cid:98) Q n ( θ n , γ ) := n [ (cid:99) M n ( θ n , γ )] (cid:98) s n ( θ n , γ ) , and the test statistic (cid:98) T n ( θ n ) := max γ ∈ Γ (cid:20)(cid:113) (cid:98) Q n ( θ n , γ ) − λ n (cid:107) γ (cid:107) (cid:21) . (22)14he limit of the test statistic (cid:98) T n ( θ n ) can be easily obtained by modifying the proofof Theorem 5. Speciﬁcally, we note that √ n (cid:99) M n ( θ n , γ ) = 1 √ n n (cid:88) i =1 g i ( θ n , θ ( θ n )) exp( W (cid:48) i γ )+ 1 √ n n (cid:88) i =1 η (cid:48) ni n n (cid:88) j =1 G j ( θ n , θ ( θ n )) exp( W (cid:48) j γ ) + o p (1)= 1 √ n n (cid:88) i =1 g i ( θ ) exp( W (cid:48) i γ ) + B (cid:48) n n (cid:88) i =1 G i ( θ ) exp( W (cid:48) i γ )+ 1 √ n n (cid:88) i =1 η (cid:48) ni n n (cid:88) j =1 G j ( θ ) exp( W (cid:48) j γ ) + o p (1) . Thus, the noncentrality term is determined by B (cid:48) E [ G i ( θ ) exp( W (cid:48) i γ )] .As shorthand notation, let G i := G ( X i , θ ) and G i := G ( X i , θ ) . We nowadjust (14) in Section 4.3 as follows: let T ( λ, B )= 1 (cid:26) max γ ∈ Γ (cid:12)(cid:12)(cid:12)(cid:12) M ( γ ) + Z (cid:48) E [ G i exp ( W (cid:48) i γ )] + B (cid:48) E [ G i exp( W (cid:48) i γ )] s ( γ ) (cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:107) γ (cid:107) > c α ( λ ) (cid:27) (23)for a critical value c α ( λ ) and R ( λ, B ) = E [ T ( λ, B )] . Then, as before, choose λ bysolving (15). To implement this procedure, we modify the steps in Section 4.3 with (cid:99) M n, ∗ ,B ( γ ) := 1 n n (cid:88) i =1 η ∗ i (cid:40) (cid:98) U i exp( W (cid:48) i γ ) + (cid:98) η (cid:48) ni n n (cid:88) j =1 (cid:98) G j exp (cid:0) W (cid:48) j γ (cid:1)(cid:41) + 1 n n (cid:88) i =1 B √ n G i exp( W (cid:48) i γ ) , (cid:98) s n, ∗ ,B ( γ ) := 1 n n (cid:88) i =1 (cid:34) η ∗ i (cid:40) (cid:98) U i exp( W (cid:48) i γ ) + (cid:98) η (cid:48) ni n n (cid:88) j =1 (cid:98) G j exp (cid:0) W (cid:48) j γ (cid:1)(cid:41) + 1 n n (cid:88) i =1 B √ n G i exp( W (cid:48) i γ ) (cid:35) . Then, the remaining steps are identical to those in Section 4.3.When θ (cid:55)→ g ( · , ( θ , θ )) is linear, G i ( θ ) does not depend on θ . Otherwise, theprocedure above needs a preliminary estimator of θ . To avoid this preliminary step,one could restrict B (cid:48) G i ( θ ) to be a constant.15 Testing Rational Unbiased Reporting of Ability Sta-tus

Benítez-Silva, Buchinsky, Chan, Cheidvasser, and Rust (2004, BBCCR hereafter) ex-amine whether a self-reported disability status is an unbiased indicator of SocialSecurity Administration (SSA)’s disability award decision. Speciﬁcally, they test if U i = ˜ A i − ˜ D i has mean zero conditional on covariates W i , where ˜ A i is the SSA dis-ability award decision and ˜ D i is a self-reported disability status indicator. Their nullhypothesis is H : E [ ˜ A i − ˜ D i | W i ] = 0 , which is termed as the hypothesis of rationalunbiased reporting of ability status (RUR hypothesis). They use a battery of tests, in-cluding Bierens (1990)’s original test, and conclude that they fail to reject the RURhypothesis. In fact, the Bierens (2010) test has the smallest p -value of 0.09 in their testresults (see Table II of their paper). In this section, we revisit this result and applyour testing procedure.Table 1 shows a two-way table of ˜ A i and ˜ D i and Table 2 reports the summarystatistics of ˜ A i and ˜ D i along those of covariates W i . After removing individuals withmissing values in any of covariates, the sample size is n = 347 and the number ofcovariates is p = 21 . Table 1: Self-reported disability and SSA award decisionSSA award decision ( ˜ A ) TotalSelf-reported disability ( ˜ D ) 0 10 35 51 861 61 200 261Total 96 251 347Before computing the penalized maximum test statistic, we ﬁrst studentize eachof covariates and transform them by x (cid:55)→ tan − ( x ) componentwise. This step en-sures that each of the components of W i is bounded and they are comparable amongeach other. The space Γ is set as Γ = [ − , p . To compute T n in (4), we use the particleswarm solver available in Matlab . Particle swarm optimization (PSO) is a According to Table I in BBCCR, the sample size is 393 before removing observations with themissing values; however, there are only 388 observations in the data ﬁle archived at the Journal ofApplied Econometrics web page. After removing missing values, the size of the sample extract weuse is n = 347 , whereas the originally reported sample size is n = 356 in Table 2 in BBCCR. Speciﬁcally, the particleswarm solver is included in the global optimization toolbox software of (cid:98) γ SSA award decision ( ˜ A ) 0.75 0.43 0 1Self-reported disability ( ˜ D ) 0.72 0.45 0 1CovariatesWhite 0.57 0.50 0 1 0.07Married 0.58 0.49 0 1 -0.16Prof./voc. training 0.36 0.48 0 1 0.17Male 0.39 0.49 0 1 0.02Age at application to SSDI 55.97 4.81 33 76 0.33Respondent income/1000 6.19 10.28 0 52 0.12Hospitalization 0.88 1.44 0 14 –Doctor visits 13.12 13.19 0 90 –Stroke 0.07 0.26 0 1 -0.90Psych. problems 0.25 0.43 0 1 –Arthritis 0.40 0.49 0 1 –Fracture 0.13 0.33 0 1 –Back problem 0.59 0.49 0 1 -0.13Problem with walking in room 0.15 0.36 0 1 –Problem sitting 0.48 0.50 0 1 -0.03Problem getting up 0.59 0.49 0 1 -0.03Problem getting out of bed 0.24 0.43 0 1 -0.13Problem getting up the stairs 0.45 0.50 0 1 –Problem eating or dressing 0.07 0.26 0 1 –Prop. worked in t − Deﬁne T n ( λ ) := max γ ∈ Γ (cid:104)(cid:112) Q n ( γ ) − λ (cid:107) γ (cid:107) (cid:105) . (24)It is computationally easier to obtain T n ( λ ) when λ is relatively larger. This is becausea relevant space for Γ is smaller with a larger λ .To choose an optimal λ as described in Section 4.3, ﬁrst note that G i = 1 in this ex-ample. Therefore, for each λ ∈ Λ , the inner minimization problem, i.e. min B ∈B R ( λ, B ) is a decreasing function of | B | . Thus, it suﬃces to evaluate the smallest value of | B | satisfying B ∈ B . Here, we take it to the sample standard deviation of U i . For λ , wetake Λ = { , . , . . . , . } . This range of λ ’s is chosen by some preliminary analysis.When λ is smaller than 0.2, it is considerably harder to obtain stable solutions; thus,we do not consider smaller values of λ . Since λ (cid:55)→ T n ( λ ) is a decreasing function, weﬁrst start with the largest value of λ and then solves sequently by lowering the valueof λ , while checking whether the newly obtained solution indeed is larger than theprevious solution. This procedure results in a solution path by λ .Top-left panel of Figure 1 shows the solution path λ (cid:55)→ T n ( λ ) along with thenumber of selected covariates, which is deﬁned to be ones whose coeﬃcients are noless than 0.01 in absolute value. For the latter, 4 covariates are selected with λ = 1 ,whereas 12 are chosen with λ = 0 . . Top-right panel displays the rejection proba-bility deﬁned in (17) when B = 0 (size) and B = (cid:98) σ ( U i ) , where (cid:98) σ ( U i ) is the samplestandard deviation of U i . The level of the test is 0.1 and there are 100 replicationsto compute the rejection probability. The power is relatively ﬂat up to λ = 0 . , in-crease a bit at λ = 0 . and is maximized at λ = 0 . . The bottom panel visualizes eachof 21 coeﬃcients as λ decreases. It can be seen that the proportion of working in t − ( worked prev in the legend of the ﬁgure) has the largest coeﬃcient (in absolutevalue) for all values of λ and an indicator of stroke has the second largest coeﬃcient,followed by age at application to Social Security Disability Insurance (SSDI). The co- Matlab . We use the default option of the particleswarm solver. It is possible to adopt the two-step approach used in Qu and Tkachenko (2016). That is, we startwith the PSO solver, followed by multiple local searches. Further, the genetic algorithm (GA) canbe used in the ﬁrst step instead of PSO and both GA and PSO methods can be compared to checkwhether a global solution is obtained. We do not pursue these reﬁnements in our paper to save thecomputational times of bootstrap inference. λ = 0 . .Table 3: Bootstrap Inference λ Test statistic No. of selected cov.s Bootstrap p-value0.2 3.525 12 0.0210.3 3.213 11 0.020Since the power in the top-right panel of Figure 1 is higher at λ = 0 . and . , wereport bootstrap test results for λ ∈ { . , . } in Table 3. There are R = 1 , boot-strap replications to obtain the bootstrap p-values. Interestingly, we are able to rejectthe RUR hypothesis at the 5 percent level, unlike BBCCR. Furthermore, our anal-ysis suggests that the employment history, captured by the proportion of workingpreviously, stroke, and the age at application to SSDI are the three most indicativecovariates that point to the departure from the RUR hypothesis. In this section, we revisit Yogo (2004) and conduct inference on the elasticity of in-tertemporal substitution (EIS). We look at the case of annual US series (1891–1995)used in Yogo (2004) and focus on U t ( θ ) = ∆ c t +1 − θ − θ r t , where ∆ c t +1 is the con-sumption growth at time t +1 and r t is the real interest rate at time t . The instruments W t are the twice lagged nominal interest rate, inﬂation, consumption growth, and logdividend-price ratio. As in the previous section, each instrument is studentized andis transformed by tan − ( · ) . The main parameter of interest is EIS, denoted by θ . Inthis example, we have data { (∆ c t +1 , r t , W t ) : t = 1 , . . . , n } .To conduct subvector inference for θ developed in Section 5, we use a demeanedversion: (cid:98) U t ( θ ) = (cid:32) ∆ c t +1 − n n (cid:88) t =1 ∆ c t +1 (cid:33) − θ (cid:32) r t − n n (cid:88) t =1 r t (cid:33) . θ ( θ ) = E [∆ c t +1 ] − θ E [ r t ] , (cid:98) θ ( θ ) = 1 n n (cid:88) t =1 ∆ c t +1 − θ n n (cid:88) t =1 r t ,η nt ( θ ) = (∆ c t +1 − E [ c t +1 ]) − θ ( r t − E [ r t ]) , (cid:98) η nt ( θ ) = (cid:98) U t ( θ ) . Then, because G t = − , adopting (21) yields the following multiplier bootstrap: (cid:99) M n, ∗ ( γ ) = 1 n n (cid:88) t =1 η ∗ t (cid:98) U t (cid:40) exp( W (cid:48) t γ ) − n n (cid:88) t =1 exp( W (cid:48) t γ ) (cid:41) , (cid:98) s n, ∗ ( γ ) = 1 n n (cid:88) t =1 (cid:34) η ∗ t (cid:98) U t (cid:40) exp( W (cid:48) t γ ) − n n (cid:88) t =1 exp( W (cid:48) t γ ) (cid:41)(cid:35) . (25)Furthermore, since G t = − r t , we can use the following quantities to calibrate theoptimal λ : (cid:99) M n, ∗ ,B ( γ ) = 1 n n (cid:88) t =1 η ∗ t (cid:98) U t (cid:40) exp( W (cid:48) t γ ) − n n (cid:88) t =1 exp( W (cid:48) t γ ) (cid:41) − n n (cid:88) t =1 Br t √ n exp( W (cid:48) t γ ) , (cid:98) s n, ∗ ,B ( γ ) = 1 n n (cid:88) t =1 (cid:34) η ∗ t (cid:98) U t (cid:40) exp( W (cid:48) t γ ) − n n (cid:88) t =1 exp( W (cid:48) t γ ) (cid:41) − n n (cid:88) t =1 Br t √ n exp( W (cid:48) t γ ) (cid:35) . To solve for λ in (15), we take Λ = { , . , . , . , . , } and B = { . } . The latteris taken be a singleton set since the rejection probability is an increasing functionof | B | and η ∗ t ∼ N (0 , is symmetrically distributed about zero. In computing theoptimal λ , the level of the test is 0.1 and a grid {− . , − . , . . . , . } for θ is used;there are 100 replications to compute the rejection probability for each case. Figure 2shows a heatmap of the rejection probability with B = 0 . . The ﬁrst noticeable resultis that setting λ = 0 results in substantial loss of power. This indicates that the sampleis not large enough to accommodate all four instruments without penalization. Thepower is above 0.6, as long as λ > . There are some diﬀerence across diﬀerent valuesof θ . For simplicity, we will use the common tuning parameter across diﬀerent θ and set λ = 0 . at which the power seems to be maximized or nearly maxmized.The top panel of Figure 3 shows the bootstrap p-values for each θ that are con-21igure 2: Empirical Results (EIS): Selection of λ λ = 0 . [ − . , . . Yogo (2004) commented that “there appears to be identiﬁcation failurefor the annual U.S. series.” Indeed, the conﬁdence interval from the Anderson-Rubin(AR) test was [ − . , . and those from the Lagrange multiplier (LM) test and theconditional likelihood ratio (LR) tests were [ −∞ , ∞ ] (see Table 3 of Yogo, 2004). Ourconﬁdence interval is tighter than any of these similar tests based on unconditionalmoments, thereby suggesting that the conditional moment restrictions could pro-vide a more informative conﬁdence interval without arbitrarily creating a particularset of unconditional moment restrictions. The bottom panel of Figure 3 displays theestimated γ . The estimates of γ vary over θ , even change the signs, and seem tohave large impacts when θ ranges from 0.1 to 0.2. There is no particular instrumentthat stands out in terms of the estimated magnitude. In a nutshell, we demonstratethat a seemingly uninformative set of instruments can provide an informative infer-ence result if one strengthens unconditional moment restrictions by making them theinﬁnite-dimensional conditional moment restrictions with the aid of penalization. We have developed an inference method for a (sub)vector of parameters using an (cid:96) -penalized maximum statistic. Our inference procedure is based on the multiplierbootstrap and combines inference with model selection to improve the power of thetest. We have recommended solving a data-dependent max-min problem to selectthe penalization tuning parameter. We have demonstrated the eﬃcacy of our methodusing two empirical examples.There are multiple directions to extend our method. First, we may consider apanel data setting where the number of conditioning variables may grow as the timeseries dimension increases. Second, unknown parameters may include an unknownfunction (e.g., Chamberlain, 1992; Newey and Powell, 2003; Ai and Chen, 2003; Chenand Pouzo, 2012). In view of results in Breunig and Chen (2020), Bierens-type testswithout penalization might not work well when the parameter of interest is a non-parametric function. It would be interesting to study whether and to what extentour penalization method improves power for nonparametric inference. Third, a con-tinuum of conditional moment restrictions (e.g., conditional independence assump-tion) might be relevant in some applications. All of these extensions call for substan-24ial developments in both theory and computation. A Proofs

Proof of Theorem 1.

Due to the boundedness of Γ and W i , U i exp( W (cid:48) i γ ) is Lipschitzcontinuous in γ and constitutes a VC-subgraph; see e.g. Arcones and Yu (1994)for the deﬁnition. Thus, it remains to show the ﬁnite dimensional convergence of √ nM n ( γ ) and s n ( γ ) for (8) and (9). See e.g. Andrews (1992) and Arcones and Yu(1994) for the guarantee of the stochastic equicontinuity. Then, the martingale dif-ference sequence central limit theorem and the ergodic theorem yield the desiredﬁnite-dimensional convergence under Assumptions 1 and 2; see Hall and Heyde(1980). Finally, for the convergence of T n ( λ ) , note that both Λ and Γ are bounded,implying λ (cid:107) γ (cid:107) is uniformly continuous. Thus, the process |M ( γ ) | s ( γ ) − λ (cid:107) γ (cid:107) convergesweakly in (cid:96) ∞ (Γ × Λ) , the space of bounded functions on Γ × Λ , and the weak conver-gence of T n ( λ ) follows from the continuous mapping theorem since (elementwise) max is a continuous operator. Proof of Theorem 2.

For the same reason as in the proof of Theorem 1, it is suﬃcient toverify the conditional ﬁnite dimensional convergence. As η ∗ i U i exp( W (cid:48) i γ ) is a martin-gale diﬀerence sequence, we verify the conditions in Hall and Heyde (1980)’s Theo-rem 3.2, a conditional central limit theorem for martingales. Their ﬁrst condition that n − / max i | η i U i exp( W (cid:48) i γ ) | p → and the last condition E [max i η ∗ i U i exp(2 W (cid:48) i γ )] = O ( n ) are straightforward since exp( W (cid:48) i γ ) is bounded and | η i U i | has a ﬁnite c mo-ment for c > . Next, n − (cid:80) ni =1 η ∗ i U i exp(2 W (cid:48) i γ ) p → E [ U i exp(2 W (cid:48) i γ )] by the ergodictheorem. This completes the proof. Proof of Theorem 3.

It follows from Lemma 1 that | E [ g ( X i , θ ∗ ) exp( W (cid:48) i γ )] | (cid:112) E [ g ( X i , θ ∗ ) exp(2 W (cid:48) i γ )] > for almost every γ ∈ Γ . Then, the result follows from the ergodic theorem. Proof of Theorem 4.

We begin with showing that c ∗ α ( λ ) p → c α ( λ ) uniformly in Λ . First,recall that the inverse map on the space of the distribution function F that assignsits α th quantile is Hadamard-diﬀerentiable at F provided that F is diﬀerentiable at F − ( α ) with a strictly positive derivative; see e.g. Section 3.9.4.2 in van der Vaart and25ellner (1996). Therefore, it is suﬃcient for the uniform consistency of the bootstrapto show that F ∗ λ ( x ) p → F λ ( x ) uniformly x ∈ [min λ F − λ ( α ) , max λ F − λ ( α )] and λ ∈ Λ .However, this is a direct consequence of the conditional stochastic equicontinuityand the convergence of the ﬁnite-dimensional distributions established in Theorem2. Next, the preceding step implies that T n, ∗ ( λ, B ) − c ∗ α ( λ ) converges weakly. This inturn implies the uniform convergence of R n ( λ, B ) in probability to R ( λ, B ) . Since R is continuous on a compact set, the usual consistency argument yields that d ( (cid:98) λ, Λ ) p → . Proof of Theorem 5.

Write g i ( θ ) and G i ( θ ) for g ( X i , θ ) and G ( X i , θ ) , respectively.Note that for θ = θ , √ n (cid:99) M n ( γ ) = 1 √ n n (cid:88) i =1 g i ( θ ) exp( W (cid:48) i γ )+ 1 √ n n (cid:88) i =1 η ni n n (cid:88) j =1 G j ( θ ) exp( W (cid:48) j γ ) + o p (1) due to Assumption 4. Then, n (cid:80) nj =1 G j ( θ ) exp( W (cid:48) j γ ) converges uniformly in prob-ability and √ n (cid:80) ni =1 a g i ( θ ) exp( W (cid:48) i γ ) + a η ni is P-Donkser for any real a and a forthe same reasoining as in the proof of Theorem 1. Similarly, the uniform convergenceof (cid:98) s n ( γ ) follows since g () is Lipschitz in θ by (19). References

Ai, C., and

X. Chen (2003): “Eﬃcient estimation of models with conditional momentrestrictions containing unknown functions,”

Econometrica , 71(6), 1795–1843.Andrews, D. W. (1992): “Generic uniform convergence,”

Econometric theory , 8(2), 241–257.Andrews, D. W. K. (1997): “A Conditional Kolmogorov Test,”

Econometrica , 65(5),1097–1128.Antoine, B., and

P. Lavergne (2020): “Identiﬁcation-Robust Nonparametric Inferencein a Linear IV Model,” Discussion Papers dp20-03, Department of Economics, Si-mon Fraser University. 26rcones, M. A., and

B. Yu (1994): “Central limit theorems for empirical and U -processes of stationary mixing sequences,” Journal of Theoretical Probability , 7(1),47–71.Benítez-Silva, H., M. Buchinsky, H. M. Chan, S. Cheidvasser, and

J. Rust (2004):“How large is the bias in self-reported disability?,”

Journal of Applied Econometrics ,19(6), 649–670.Bierens, H. J. (1982): “Consistent model speciﬁcation tests,”

Journal of Econometrics ,20(1), 105–134.(1990): “A consistent conditional moment test of functional form,”

Economet-rica , 58(6), 1443–1458.Bierens, H. J., and

W. Ploberger (1997): “Asymptotic Theory of Integrated Condi-tional Moment Tests,”

Econometrica , 65(5), 1129–1151.Breunig, C., and

X. Chen (2020): “Adaptive, Rate-Optimal Testing in InstrumentalVariables Models,” arXiv:2006.09587 [econ.EM], https://arxiv.org/abs/2006.09587 .Chamberlain, G. (1987): “Asymptotic eﬃciency in estimation with conditional mo-ment restrictions,”

Journal of Econometrics , 34(3), 305–334.(1992): “Eﬃciency bounds for semiparametric regression,”

Econometrica ,60(3), 567–596.Chen, X., and

D. Pouzo (2012): “Estimation of nonparametric conditional momentmodels with possibly nonsmooth generalized residuals,”

Econometrica , 80(1), 277–321.de Jong, R. M. (1996): “The Bierens test under data dependence,”

Journal of Econo-metrics , 72(1), 1–32.Fan, Y., and

Q. Li (1996): “Consistent model speciﬁcation tests: omitted variables andsemiparametric functional forms,”

Econometrica , pp. 865–890.(2000): “Consistent model speciﬁcation tests: Kernel-based tests versusBierens’ ICM tests,”

Econometric Theory , pp. 1016–1041.27all, P., and

C. C. Heyde (1980):

Martingale limit theory and its application . Academicpress.Horowitz, J. L., and

V. G. Spokoiny (2001): “An Adaptive, Rate-Optimal Test of aParametric Mean-Regression Model Against a Nonparametric Alternative,”

Econo-metrica , 69(3), 599–631.Kennedy, J., and

R. Eberhart (1995): “Particle swarm optimization,” in

Proceedings ofICNN’95-International Conference on Neural Networks , vol. 4, pp. 1942–1948.Newey, W. K., and

J. L. Powell (2003): “Instrumental Variable Estimation of Non-parametric Models,”

Econometrica , 71(5), 1565–1578.Qu, Z., and

D. Tkachenko (2016): “Global Identiﬁcation in DSGE Models Allowingfor Indeterminacy,”

Review of Economic Studies , 84(3), 1306–1345.Stinchcombe, M. B., and

H. White (1998): “Consistent Speciﬁcation Testing with Nui-sance Parameters Present Only under the Alternative,”

Econometric Theory , 14(3),295–325.Taylor, J., and

R. J. Tibshirani (2015): “Statistical learning and selective inference,”

Proceedings of the National Academy of Sciences , 112(25), 7629–7634.Tibshirani, R. (1996): “Regression Shrinkage and Selection via the Lasso,”

Journal ofthe Royal Statistical Society. Series B (Methodological) , 58(1), 267–288.van der Vaart, A. W., and

J. A. Wellner (1996):

Weak Convergence and Empirical Pro-cesses . Springer, New York, NY.Yogo, M. (2004): “Estimating the elasticity of intertemporal substitution when in-struments are weak,”