[PDF] Semiparametric Testing with Highly Persistent Predictors

Abstract

We address the issue of semiparametric efficiency in the bivariate regression problem with a highly persistent predictor, where the joint distribution of the innovations is regarded an infinite-dimensional nuisance parameter. Using a structural representation of the limit experiment and exploiting invariance relationships therein, we construct invariant point-optimal tests for the regression coefficient of interest. This approach naturally leads to a family of feasible tests based on the component-wise ranks of the innovations that can gain considerable power relative to existing tests under non-Gaussian innovation distributions, while behaving equivalently under Gaussianity. When an i.i.d. assumption on the innovations is appropriate for the data at hand, our tests exploit the efficiency gains possible. Moreover, we show by simulation that our test remains well behaved under some forms of conditional heteroskedasticity.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Semiparametric Testing with Highly Persistent Predictors ∗ Bas J.M. Werker and Bo Zhou Econometrics and Finance Group, Tilburg University Department of Economics and Finance, Durham University

Abstract

We address the issue of semiparametric eﬃciency in the bivariate regression problemwith a highly persistent predictor, where the joint distribution of the innovationsis regarded an inﬁnite-dimensional nuisance parameter. Using a structural repre-sentation of the limit experiment and exploiting invariance relationships therein,we construct invariant point-optimal tests for the regression coeﬃcient of interest.This approach naturally leads to a family of feasible tests based on the component-wise ranks of the innovations that can gain considerable power relative to existingtests under non-Gaussian innovation distributions, while behaving equivalently un-der Gaussianity. When an i.i.d. assumption on the innovations is appropriate forthe data at hand, our tests exploit the eﬃciency gains possible. Moreover, we showby simulation that our test remains well behaved under some forms of conditionalheteroskedasticity.

JEL classiﬁcation:

C12, C14

Keywords: predictive regression, limit experiment, LABF, maximal invariant,rank statistics. ∗ We thank Gaia Becheri for signiﬁcant input on an earlier version of this paper. We also thankPeter Boswijk, Feike Drost, Ramon van den Akker, two referees, the associate editor, and participantsat the European Conferences of the Econometrics Community (EC2) conference, Amsterdam, Dec 2017;Aarhus University, Sep 2018 for helpful comments. Introduction

Over the past two decades, inference for the bivariate regression model with a highlypersistent predictor has been well studied under the assumption of bivariate Gaus-sian innovations. Several procedures have been proposed in the econometric litera-ture, see Cavanagh et al. (1995), Campbell and Yogo (2006), Jansson and Moreira(2006), Elliott et al. (2015), and Moreira and Mour˜ao (2016). These inference pro-cedures are all constructed based on the assumption of Gaussian innovations and,while their validity has been established under weaker assumptions, the asymptoticpower of all these procedures cannot go beyond the Gaussian power envelope.In the present paper we show that, when the application supports an additionalassumption of serially independent innovations, sizable power gains are possible be-yond the Gaussian power envelope. We establish this result by studying in detail theinvariance structures that are present in the limiting experiment associated with thepredictive regression model. This leads to a semiparametric power envelop which,under non-Gaussian innovation distributions, lies above the Gaussian power enve-lope. In that case, even without knowing the innovation distribution, our methoddominates existing QMLE-based methods.Our results precisely quantify the statistical eﬃciency gains from non-Gaussianinnovation distributions when innovations are serially independent in predictive re-gression models. Under such, arguably restrictive assumption, we construct semi-parametrically optimal (in a sense to be made precise later) tests. Whether inconcrete applications the assumption of serially independence is warranted, is anempirical question. When it is, it can, as our results show, be exploited leading tosizable power gains (of, as Section 5 shows, up to 30% under Student- t innovationdistributions). Symmetrically, to make an informed choice, we study the behaviorof our test when the innovations are not i.i.d. but exhibit conditional heteroskedas-ticity as often found in (ﬁnancial) applications. Section 5.2 shows that, for thedeviations studied, our test still has desirable size and power properties.We note that our conceptual ideas reach further. We could, for instance, allowfor serial dependence along the lines of Zhou et al. (2019) where an AR-type modelon the error is imposed. Conditional heterogeneity could formally be addressedalong the lines of Ling et al. (2003) where a GARCH-type structure on the error isimposed; or following Boswijk et al. (2005) where the (potentially nonstationary)volatility is estimated nonparametrically. These relaxations would technically benon-trivial and are left for future research. Note that, in view of the robustness-eﬃciency trade-oﬀ (see, e.g., M¨uller, 2011), an i.i.d. assumption on the innovations ltimately driving the error term is not avoidable. Our test gives the empiricalresearchers an additional option: an improved power when innovations are i.i.d.and non-Gaussian.The study of (optimal) semiparametric inference in the predictive regressionmodel is complicated by the nonstandard asymptotic behavior induced by the local-to-unity asymptotics on the persistence parameter. More precisely, the associatedlikelihood ratios are of the Locally Asymptotically Brownian Functional (LABF)form in (see Jeganathan, 1995) and henceforth outside the conventional

LocallyAsymptotically Normality (LAN) world. As a consequence, the usual semipara-metric approach based on projecting the score of the parameter of interest on thetangent space of nuisance scores is not straightforward. In particular, the modeldoes not feature an adaptiveness property, which complicates its analysis. Jansson(2008) deals with the unit root testing problem, which also admits the LABF form,by guessing and then proving a least favorable direction of parametric submodels.An alternative approach has been proposed for the unit root testing problem inZhou et al. (2019) and generalized to other common types of limiting experimentsin Zhou (2020). In the present paper we apply these techniques to the predictiveregression model.The key idea is to exploit invariance structures in a so-called “structural” rep-resentation of the limit experiment. This approach sets us apart from most ofthe statistical and econometric literature where invariance arguments are used inthe sequence of experiments. Instead, we obtain procedures which are invariant inthe limit experiment, thereby making the analysis tractable and applicable to manymodels. Furthermore, the unique bivariate nature of the predictive regression modelleads to a nonstandard multivariate structure in the associated limit experiment (seeTheorem 3.1). Therefore, we present the approach in detail in the present paper.Our contribution is twofold. First, we derive the semiparametric power enve-lope for (asymptotically) invariant tests in case the predictor’s persistence level isassumed to be known, based on the structural LABF limit experiments. More pre-cisely, Girsanov’s theorem, combined with the limiting likelihood ratios for LABFexperiments, leads to a description of the limit experiment by stochastic diﬀerentialequations (SDEs). The observations in the limit experiment correspond to the lim-its of partial-sum processes of the innovations and score functions in the predictiveregression model. In this structural representation of the limit experiment, we ﬁndthat the nuisance parameters induced by the density function of the innovationsonly appear in the drifts of the driving Brownian motions. This leads to an invari-ance restriction by taking the Brownian bridges (which are invariant with respect to hese drifts) of these processes, and allows us to remove the nonparametric nuisanceparameter (the density f of the innovations). We show that this also generatesthe maximal invariant . In this way, we avoid the problem of explicitly ﬁnding theleast-favorable submodel. The likelihood of the maximal invariant immediately, bythe Neyman-Pearson lemma, leads to the semiparametric power envelope.Second, we propose a family of semiparametric feasible tests that has desirableproperties. These tests are constructed using (asymptotically) suﬃcient statisticsthat are based on the increments of innovations, their component-wise ranks, and apair of chosen marginal reference densities for both innovations including a referencecorrelation parameter. The ranks appear naturally as rank-based partial-sum scoreprocesses which weakly converge to the Brownian bridge that is invariant w.r.t.the density perturbation parameters. To further eliminate the remaining nuisanceparameter, namely the predictor’s persistence level, we employ the ApproximateLeast Favorable Distribution (ALFD) approach proposed by Elliott et al. (2015).We also follow their suggestion to switch to standard asymptotic approximationswhen the persistence parameter is far from unity. This helps to control the size ofour tests uniformly under both non-stationarity and stationarity, see Appendix C.The tests thus obtained are semiparametric in the sense that they have correctasymptotic sizes (under all innovation densities allowed) regardless of the choices ofthe marginal reference densities or the reference correlation.Next to their uniform (relative to our model) validity, our test are more powerfulthan existing tests when the true innovation density is non-Gaussian. In particular,we compare our test to Elliott et al. (2015) (henceforth denoted as EMW), whichis based on Gaussian likelihood ratios (see also Jansson and Moreira, 2006). Ourasymptotic analysis using invariance arguments shows that, under non-Gaussianinnovations, the EMW test actually is measurable with respect to an invariant inthe limit that is not maximally invariant. As a result, under non-Gaussianity, we canconstruct tests that outperform the Gaussian power envelope and, thus, outperformthe EMW test; see Remark 3.2. The power improvement depends on the choicesof the marginal reference densities: when they are “closer” to the true marginaldensities, we gain more power (and, again, while always having the desired size).Additionally, if one ﬁxes the marginal reference densities to be Gaussian, our testis generally still more powerful than the EMW test under non-Gaussian innovationdensity; while under Gaussian innovation density, our test performs equivalently tothe EMW test. This property is often referred to as the Chernoﬀ-Savage result (seeChernoﬀ and Savage (1958)). In the present LABF setting we have not been ableto formally prove this Chernoﬀ-Savage result, but our simulations indicate that this roperty nevertheless may hold.Our rank-based test can be regarded a generalized version of quasi-likelihoodratio tests which take the reference density to be Gaussian. The extra freedom tochoose the reference density also comes with the cost of actually choosing it. How-ever, we note that, in line with traditional quasi-likelihood methods, one can alwayschoose the Gaussian reference density. Based on the classical Chernoﬀ-Savage re-sult, we conjecture that our rank-based procedure will then always outperform thequasi-likelihood procedure. This is conﬁrmed by simulations and intuition, but, asdiscussed below, given the non-standard limiting experiment structure, we have notbeen able to prove this formally. Alternative, one could study a plug-in estimatorwhere the reference density is nonparametrically estimated. We do not study thisformally in the present paper; however, see Section 5.3 for some simulation results.Similarly, one may envision an approach where one pre-tests the residuals for, e.g.,high kurtosis and chooses a references density based on that pre-test result.The paper is organized as follows. In Section 2, we introduce the model and test-ing problem under consideration. In Section 3, we develop the asymptotic powerenvelope for test that are (asymptotically) invariant with respect to the innovationdensity f , assuming the predictor’s persistence parameter γ is known. This devel-opment is based on the theory of limit experiments (see, e.g., Le Cam (1986) andVan der Vaart (2000)) and a structural version for models of LABF likelihood ratios(see Zhou et al. (2019)). In particular, this section explains where our power gainscome from, see Remark 3.2. In Section 4, we employ the ALFD approach proposedby Elliott et al. (2015), among several available choices in the literature, to elim-inate the nuisance parameter γ . In Section 5, we report large- and small-sampleperformances of our tests under both i.i.d. and conditional heteroskedastic errors.Section 6 concludes. All proofs are gathered in the appendix. Let y t denote a random variable, observable at time t , that we wish to predict attime t − x t − . We consider the predictiveregression model y t = µ + βx t − + ε yt , (1) x t − α = γ ( x t − − α ) + ε xt , (2) ith x = 0. The parameter space is given by µ ∈ R , α ∈ R , β ∈ R , and γ ∈ ( − , t = 1 , . . . , T .Equation (2) features, along the lines of Cavanagh et al. (1995) and Jansson and Moreira(2006), an intercept α . However, as µ is a nuisance parameter in our model, theintercept α can be subsumed in µ without aﬀecting inference on β . Indeed, ourtest statistics will only depend on the increments of x t , denoted by ∆ x t , and theirassociated ranks and, thus, they are invariant with respect to α . We therefore omit α in the rest of this paper.To eliminate the nuisance intercept parameter µ in (1), one can directly imposean invariance restriction in the sequence of predictive regression experiments. Forinstance, the Jansson and Moreira (2006) test is based on the maximal invariantstatistic ( y − y , y − y , . . . , y T − y ) ′ . In the present paper, our statistic is onlybased on y t ’s through their ranks and, thus, also enjoys ﬁnite-sample invariancew.r.t. µ . To simplify notation, we set µ = 0 throughout the paper and nowhereassume E f ( ε yt ) = 0. We will need to impose E f ( ε xt ) = 0: allowing for deterministictrends in x t would lead to an entirely diﬀerent asymptotic analysis.Summarizing, as outlined in the introduction, we assume that the innovations ε t = ( ε yt , ε xt ) ′ are independent and identically distributed (i.i.d.) with (bivariate)density f satisfying the following condition. Assumption 1. (a) E f ( ε xt ) = 0 and Var f ( ε t ) =  σ y ρσ y σ x ρσ y σ x σ x  is a ﬁnitepositive-deﬁnite matrix.(b) The density f is absolutely continuous with a.e. derivative ˙ f =  ˙ f y ˙ f x  .(c) The (standardized) Fisher information for location, J f =  J f yy J f yx J f yx J f xx  = E f (cid:0) ℓ f ℓ ′ f (cid:1) , where ℓ f is the (standardized) location score function ℓ f =  σ y ℓ f y σ x ℓ f x  =  − σ y ˙ f y /f − σ x ˙ f x /f  , Note that this assumption on the initial value x could possibly be relaxed to the weaker as-sumption T − / x = o P (1) under β = 0 and γ = 1. One can possibly proceed along the linesof M¨uller and Elliott (2003); see also a remark on this point in Section 4 of Jansson and Moreira(2006). We keep the assumption x = 0 for simplicity. s ﬁnite. (d) f > . (cid:3) Let F denote the set of densities satisfying Assumption 1.The Fisher information J f and scores ℓ f for location are standardized in thesense that they are actually those related to ε yt /σ y and ε xt /σ x . As a result, ℓ f and J f do not depend on σ y or σ x . Note, however, that they both still depend on thecorrelation between the innovations ε yt and ε xt , i.e., they still depend on ρ .We are interested in (optimal) tests for the (composite) null hypothesisH : β = 0 , γ ∈ ( − , , f ∈ F , (3)versus the one-sided alternativeH : β > , γ ∈ ( − , , f ∈ F . (4)As the literature focuses on test derived using an assumed Gaussian innovationdensity, we will throughout this paper consider Gaussian densities as a special case.This will allow us to make explicit where the power improvements come from in thecase of non-Gaussian, serially independent, innovations ( ε y , ε x ). Remark f ) . In case f is zero-mean bivariate Gaussian with correlationmatrix R =  ρρ  , Assumption 1 is satisﬁed with ℓ f ( ε y , ε x ) = R −  ε y /σ y ε x /σ x  and J f = R − . Following the by now standard approach in the literature, we study the limit exper-iment in the sense of H´ajek-Le Cam by considering local alternatives for all modelparameters, that is, for both the parameter of interest β and the nuisance parame-ters ( γ and f ). For β and γ the appropriate rates of convergence are well known, see,e.g., Elliott and Stock (1994), Campbell and Yogo (2006), or Jansson and Moreira(2006). More precisely, we consider a T − -localization rate for β and γ , i.e., β = β ( T ) ( b ) = bT σ y σ x , γ = γ ( T ) ( c ) = 1 + cT , (5) Being a Fisher information for location, J f is automatically nonsingular and positive deﬁnite, seeMayer-Wolf et al. (1990, Theorem 2.3). ith b ∈ R and c ∈ ( −∞ , Observe that the local perturbation for b features ascaling by σ y /σ x . This ensures that the limit experiment will not depend on σ y and σ x (although it still depends on ρ ).The nuisance parameter f is inﬁnite dimensional, so it is somewhat more involvedto describe its relevant local perturbations. Introduce the separable Hilbert spaceL ,f = L ,f ( R , B ) = n h ∈ L f ( R , B ) | E f h ( ε ) = 0 , E f ε x h ( ε ) = 0 o , (6)where L f ( R , B ) denotes, the space of Borel-measurable functions h : R → R satisfying E f h ( ε ) = R R h ( ε ) f ( ε )d ε < ∞ . The model assumption E f ( ε xt ) = 0induces the restriction that local perturbations for f are orthogonal to the ﬁrstcomponent of ε : E f ε x h ( ε ) = 0.The separability of the Hilbert space L ,f ensures the existence of a count-able orthonormal basis h k , k ∈ N , such that each h k is bounded and two timescontinuously diﬀerentiable with bounded derivatives; see, e.g., Rudin (1987, Theo-rem 3.14). Therefore, any function h ∈ L ,f can be written as h = P ∞ k =1 η k h k , forsome η = ( η k ) k ∈ N ∈ ℓ = { ( z k ) k ∈ N | P ∞ k =1 z k < ∞} . Besides the space ℓ , we alsoneed the space c which is deﬁned as the subset of sequences with ﬁnite support,i.e., c = ( ( z k ) k ∈ N ∈ R N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ X k =1 { z k = 0 } < ∞ ) . (7)Observe that c is a dense subspace of ℓ . It is introduced only in the asymptoticanalysis to avoid convergence of inﬁnite-dimensional processes and possibly inducedmathematical complications, see Section 2.2. However, the restriction η ∈ c willnot aﬀect our conclusions. Indeed, considering η ∈ c restricts our analysis to asubset of all semiparametric models which potentially makes the obtained upperbound higher. However, as we are able to show that this higher upper bound is(point-wisely) attainable by feasible tests for arbitrary innovation density in se-quence, see Remark 3.1, it constitutes the semiparametric power envelope and thetest is semiparametrically optimal.We model local perturbations to the innovation density f as f ( T ) η ( e ) = f ( e ) √ T ∞ X k =1 η k h k ( e ) ! for all e ∈ R , (8) We use here the common approach in the literature to restrict the nuisance parameter c to ( −∞ , c ∈ R ; see, e.g., Moreira and Mour˜ao (2016). here η ∈ c . We thus use a standard localization rate T − / for the bivariate den-sity f . Indeed, Proposition 3.1 below shows that all the above rates are appropriatein the sense that they lead to contiguous alternatives for the induced probabilitymeasures as T tends to inﬁnity.In order to show that the above localization of the innovation density is valid,we need to establish that f ( T ) η ∈ F . This is the content of the next proposition. Proposition 2.1.

Let f ∈ F and η ∈ c , then there exists a ﬁnite integer e T suchthat for all T ≥ e T we have f ( T ) η ∈ F . The proof uses exactly the same arguments as in the proof of Proposition 3.1 inZhou et al. (2019), but with support R instead of R . It is therefore omitted.In terms of the local parameters b , c , and η , the hypothesis of interest becomesH : b = 0 , c ∈ R , η ∈ c , (9)versus the one-sided alternativeH : b > , c ∈ R , η ∈ c . (10) In order to derive the limiting experiment for the predictive regression model, weneed to introduce some partial-sum processes and study their asymptotic behavior.We denote by P ( T ) b,c,η ; f the law of ( y , x ) ′ , . . . , ( y T , x T ) ′ under the model (1)–(2),where the parameters β and γ are given by (5) and the innovation density is givenby (8). Formally, we deﬁne the sequence of experiments of interest as E ( T ) ( f ) := (cid:16) Ω ( T ) , F ( T ) , n P ( T ) b,c,η ; f : b, c ∈ R , η ∈ c o(cid:17) , T ∈ N , (11)where Ω ( T ) := R × T and F ( T ) := B ( R × T ). We denote the expectation taken underthe measure P ( T )0 , , f by E ( T ) .Let us already mention that we will also introduce a collection of probabilitymeasures P b,c,η , deﬁned on a probability space (Ω , F ), representing the limit exper-iment E ( f ) in Section 3.1 below; see (26). We will denote he expectation takenunder the measure P , , by E . That is, P ( T ) and E ( T ) refer to ﬁnite-sample distri-butions in the sequence of experiments, while P and E refer to distributions in thelimit experiment.As a ﬁnal ingredient for our analysis, we introduce some partial-sum processesthat we use throughout to link the sequence of experiments E ( T ) ( f ) to the limitexperiment E ( f ). In particular, deﬁne, with ∆ x t := x t − x t − , the partial-sum rocesses W ( T ) ε ( s ) := 1 √ T ⌊ sT ⌋ X t =1 ∆ x t σ x , (12) W ( T ) ℓ fy ( s ) := 1 √ T ⌊ sT ⌋ X t =1 σ y ℓ f y ( y t , ∆ x t ) , (13) W ( T ) ℓ fx ( s ) := 1 √ T ⌊ sT ⌋ X t =1 σ x ℓ f x ( y t , ∆ x t ) , (14) W ( T ) h k ( s ) := 1 √ T ⌊ sT ⌋ X t =1 h k ( y t , ∆ x t ) , k ∈ N . (15)Here we standardize the ﬁrst three partial-sum processes by the standard deviations σ y and σ x in order to make their limits scale invariant. Under P ( T )0 , , f , by theFunctional Central Limit Theorem (see also Lemma A.1), we have  W ( T ) ε ( s ) W ( T ) ℓ fy ( s ) W ( T ) ℓ fx ( s ) W ( T ) h ( s )  ⇒  W ε ( s ) W ℓ fy ( s ) W ℓ fx ( s ) W h ( s )  , s ∈ [0 , , (16)where the Brownian motions W ε , W ℓ fy , W ℓ fx and W h are deﬁned on the commonprobability space (Ω , F , P , , ). We have to be precise about the notion of weakconvergence adopted in (16) as W h is inﬁnite dimensional. In line with stochasticprocess theory, we mean that all ﬁnite-dimensional subprocesses of W ( T ) h weaklyconverges in the space D M +3 [0 ,

1] with the uniform topology, where M is the di-mension of the ﬁnite-dimensional subprocess considered. This is precisely becausewe take the local parameter η to be in c . For the sake of convenient notation,we write the seemingly inﬁnite-dimensional convergence (16). As argued above, weare ultimately able to attain the semiparametric power envelope induced under therestriction η ∈ c so that we can claim semiparametric optimality.Next, deﬁne the column vectors J f y h = ( J f y h k ) k ∈ N and J f x h = ( J f x h k ) k ∈ N ,where J f y h k := E f (cid:2) σ y ℓ f y ( ε t ) h k ( ε t ) (cid:3) and J f x h k := E f [ σ x ℓ f x ( ε t ) h k ( ε t )]. As we havethe equalities E f (cid:2) ε xt ℓ f y ( ε t ) (cid:3) = − σ y R R ε x ˙ f y ( ε ) f ( ε ) f ( ε )d ε = − σ y R R ε x ˙ f y ( ε )d ε = 0 andE f [ ε xt ℓ f x ( ε t )] = − σ x R R ε x ˙ f x ( ε ) f ( ε ) f ( ε )d ε = − σ x R R ε x ˙ f x ( ε )d ε = σ x R R f ( ε )d ε = σ x ,the behavior of the Brownian motions W ε , W ℓ fy , W ℓ fx and W h is described by the One may consider partial sum processes that start at t = 2 in order to make them exactly invariantto translations in x t . This would, clearly, have no eﬀect on our asymptotic results. ovariance matrix Var  W ε (1) W ℓ fy (1) W ℓ fx (1) W h (1)  =  J f yy J f yx J ′ f y h J f yx J f xx J ′ f x h J f y h J f x h I ∞  , (17)where I ∞ denotes the ∞ -dimensional identity matrix. The scaling by σ x and σ y introduced in (12)–(15) is indeed such that the covariance matrix (17) does notdepend on σ x or σ y . Again, it still depends on ρ through the various J matrices.Recall that the functions h k form an orthonormal basis for all zero-mean ﬁnite-variance functions that are orthogonal to ε xt . In view of the covariance matrix (17),we may thus write, for s ∈ [0 , W ℓ fy ( s ) = J ′ f y h W h ( s ) , (18) W ℓ fx ( s ) = W ε ( s ) + J ′ f x h W h ( s ) . (19)Consequently, we also have Var (cid:2) W ℓ fy (1) (cid:3) = J f yy = J ′ f y h J f y h , (20)Var (cid:2) W ℓ fx (1) (cid:3) = J f xx = 1 + J ′ f x h J f x h , (21)Cov (cid:2) W ℓ fy (1) , W ℓ fx (1) (cid:3) = J f yx = J ′ f y h J f x h . (22)We again consider the special case of a Gaussian density f . Remark f ) . In the situation of Gaussian f as discussed in Remark 2.1,we may write the decomposition (21) as W ℓ fx = W ε − ρ √ − ρ W ⊥ where W ⊥ is thestandard Brownian motion generated by the increments ( ε y /σ y − ρε x /σ x ) / p − ρ .Indeed, W ε and W ⊥ are independent (calculate the correlation of the incrementsthat generate both processes). Thus, we also ﬁnd J ′ f x h W h ( s ) = − ρW ⊥ and thedecomposition (21) becomes J f xx = 1 + ρ − ρ = − ρ = J f yy . Moreover, we have W ℓ fy = √ − ρ W ⊥ and J f yx = − ρ − ρ . f by invari-ance We ﬁrst focus on eliminating the nuisance parameter f from the testing problemoutlined in Section 2. We will see that this can be handled using invariance argu-ments in the limit experiment, which we derive in Section 3.1. In Section 4, weconsider the nuisance parameter γ .We take the following steps in this section: . Provide a structural representation of the limit experiment (Section 3.1).2. Characterize maximally invariant test statistics in this limit experiment (Sec-tion 3.2).3. Provide a structural representation of the invariant limit experiment (Sec-tion 3.3).4. Provide a feasible version of the asymptotically invariant test statistics to beapplied in the sequence of predictive regression experiments (Section 3.4).These steps also show that, to eliminate the nuisance parameter f , instead of study-ing invariance restrictions in the sequence of ﬁnite-sample experiments, we only im-pose them in the limit experiment. Unlike for the location parameter µ (of ε yt ), thislimiting invariance property of the parameter f does not follow directly from ex-act ﬁnite-sample invariance properties. Notably, the existing tests in the literatureshare this feature, as they also (implicitly) impose the invariance restriction in thelimit, though not in the sequence; see Remark 3.2. As far as we know, all existingtests belong to the class of asymptotically invariant (w.r.t. f ) tests, while our testis semiparametrically optimal in the model we study. Section 5 shows that this ap-proach leads to considerable power gains in case the innovations are non-Gaussian,while no power is lost under Gaussianity. We consider the limit experiment corresponding to the predictive regression model (1)–(2) using the local perturbations (5) and (8), i.e., the limit of the experiments E ( T ) ( f ) indexed by T , by studying the asymptotic behavior of the induced likeli-hood ratios. We expand the likelihood ratio around ( β, γ, η ) = (0 , ,

0) and deriveits limit in the following proposition, which can be interpreted as a generalizationof Lemma 4 in Jansson and Moreira (2006) by including non-Gaussian distributionsand perturbations thereof. Proposition 3.1.

Fix f ∈ F . Consider the local parameters b ∈ R , c ∈ R , and η ∈ c . Then,(i) Under P ( T )0 , , f , the log-likelihood ratio of the predictive regression experimentsatisﬁes, as T → ∞ , log dP ( T ) b,c,η ; f dP ( T )0 , , f = ∆ ( T ) ( b, c, η ) − Q ( T ) ( b, c, η ) + o P (1) , (23) As preparation for the results in Section 4, we allow in this proposition for local perturbations withrespect to γ even though, in the present section, γ is assumed to be known. here ∆ ( T ) ( b, c, η ) = bT T X t =1 x t − σ x σ y ℓ f y ( y t , ∆ x t ) + cT T X t =1 x t − ℓ f x ( y t , ∆ x t )+ 1 √ T T X t =1 X k η k h k ( y t , ∆ x t ) , Q ( T ) ( b, c, η ) = (cid:0) b J f yy + c J f xx + 2 bcJ f yx (cid:1) T T X t =1 x t − σ x + (cid:16) bJ ′ f y h η + 2 cJ ′ f x h η (cid:17) T / T X t =1 x t − σ x + η ′ η. (ii) Still under P ( T )0 , , f , as T → ∞ , we have log dP ( T ) b,c,η ; f dP ( T )0 , , f ⇒ L ( b, c, η ) = ∆( b, c, η ) − Q ( b, c, η ) , (24) where ∆( b, c, η ) = b Z W ε ( s )d W ℓ fy ( s ) + c Z W ε ( s )d W ℓ fx ( s ) + η ′ W h (1)= Z W ε ( s ) (cid:0) bJ f y h + cJ f x h (cid:1) ′ d W h ( s ) + c Z W ε ( s )d W ε ( s ) + η ′ W h (1) , Q ( b, c, η ) = (cid:0) b J f yy + c J f xx + 2 bcJ f yx (cid:1) Z W ε ( s ) d s + η ′ η + (cid:16) bJ ′ f y h η + 2 cJ ′ f x h η (cid:17) Z W ε ( s )d s = Z (cid:12)(cid:12)(cid:0) bJ f y h + cJ f x h (cid:1) W ε ( s ) + η (cid:12)(cid:12) d s + c Z W ε ( s ) d s. (iii) For every b, c ∈ R and η ∈ c , under P , , , E [exp ( L ( b, c, η ))] = 1 . A proof of Proposition 3.1 is provided in Appendix B, but let us give a brief sketchhere. Part (i) is immediate from an informal Taylor expansion of the log-likelihoodratios and, formally, follows from Hallin et al. (2015), which provides generally ap-plicable suﬃcient conditions for the quadratic expansion of likelihood ratios withdensities that are diﬀerentiable in quadratic mean (DQM). This DQM condition isimplied, for location models, by the absolutely continuity of the innovation densityfunction and ﬁniteness of the associated Fisher information, i.e., precisely the con-tent of Assumption 1. A detailed discussion can be found in Le Cam (1986, Section17.3) or Yang and Le Cam (2000, Section 7.3). Part (ii) follows from the continu-ous mapping theorem applied to the weak convergence in (16). Both forms of thecentral sequence ∆ and quadratic term Q follow from (18) and (19). Part (iii) fol-lows from standard stochastic calculations concerning Dol´eans-Dade exponentials.To see this, note that W h and W ε are independent in view of (17) and, thus, havevanishing quadratic covariation. art (iii) of Proposition 3.1 ensures that we can introduce a collection of prob-ability measures P b,c,η on the measurable space (Ω , F ) (on which the Brownianmotions W ε , W ℓ fy , W ℓ fx and W h are deﬁned) by the Radon-Nikodym derivatived P b,c,η d P , , = exp L ( b, c, η ) , (25)where L ( b, c, η ) is deﬁned in (24). Then, in the sense of H´ajek-Le Cam (see, forinstance, Van der Vaart (2000), Chapter 9), the sequence of predictive regressionexperiments, indexed by sample size T , weakly converges to the limit experimentdescribed by the measures P b,c,η . We formally deﬁne this limit experiment by E ( f ) := (cid:16) Ω , F , n P b,c,η : b, c ∈ R , η ∈ c o(cid:17) , (26)where Ω := C [0 , × C [0 , × C [0 , × C N [0 ,

1] and F := B C ⊗ B C ⊗ B C ⊗ ( ⊗ ∞ k =1 B C ).The following statement is an immediate consequence of Proposition 3.1. Corollary 3.1.

Let f ∈ F , then the sequence of experiments E ( T ) ( f ) converges tothe limit experiment E ( f ) as T → ∞ . Although the log-likelihood ratios L ( b, c, η ) formally describe the limiting exper-iment, it is more insightful to provide, what we call, a structural representation.This structural representation provides a ﬁxed-horizon continuous-time model forwhich the likelihoods are exactly equal to exp ( L ( b, c, η )). From a statistical point ofview, the induced experiments are thus equal. The result follows from an immediateapplication of Girsanov’s theorem to the Radon-Nikodym derivates (24). Its proofis therefore omitted. Theorem 3.1.

Fix f ∈ F . Let, under P , , , Z ε , and Z h be zero-drift Brownianmotions with covariance according to the ﬁrst and last row and column of (17).The limit experiment E ( f ) can be described as: observe { ( W ε ( s ) , W h ( s )) : s ∈ [0 , } generated by d W ε ( s ) = cW ε ( s )d s + d Z ε ( s ) , (27)d W h ( s ) = ( bJ f y h + cJ f x h ) W ε ( s )d s + η d s + d Z h ( s ) . (28)A few remarks can be made in relation to Theorem 3.1. First, note that for b = c = 0 and η = 0, we obtain W ε = Z ε and W h = Z h . Secondly, the theoremessentially states that while ( W ε , W ′ h ) ′ is a zero-drift Brownian motion under P , , ,it becomes an Ornstein-Uhlenbeck process under P b,c,η , where the log-likelihoodratio log (d P b,c,η / d P , , ) equals L ( b, c, η ). Observe in particular that local pertur-bations of the innovation density f , as described by η , only aﬀect the drift in (28).We will consider inference procedures that are invariant with respect to η in the imit experiment. In terms of the (sequence of) predictive regression model(s) thisconsequently translates into invariance with respect to (local perturbations in) theinnovation density f .In view of (18)–(19), we may also writed W ℓ fy ( s ) = ( bJ f yy + cJ f yx ) W ε ( s )d s + J ′ f y h η d s + d Z ℓ fy ( s ) , (29)d W ℓ fx ( s ) = ( bJ f yx + cJ f xx ) W ε ( s )d s + J ′ f x h η d s + d Z ℓ fx ( s ) , (30)where Z ℓ fx and Z ℓ fy are zero-drift Brownian motions under P , , . However, theseequations do not contain any additional information, precisely given (18) and (19).Nevertheless, they will turn out useful when describing the likelihood ratio of themaximal invariant M to be introduced below in (33). In the limit experiment E ( f ), the parameter b ∈ R is the parameter of interest, while c ∈ R and η ∈ c are nuisance parameters. Observe that the nuisance parameter η appears only in the drift of the SDEs in Theorem 3.1. This suggests an invariancerestriction in line with the approach in Zhou et al. (2019) for unit root testing.To be speciﬁc, we ﬁrst introduce, for η ∈ c , the transformations g η : C N [0 , → C N [0 ,

1] by [ g η ( W )]( s ) = W ( s ) − ηs, (31)for W ∈ C N [0 ,

1] and all s ∈ [0 , g η adds a drift s

7→ − ηs to W . Thus, Theorem 3.1 implies that the law of ( W ε , ( g η ( W h )) ′ ) ′ under P b,c, is the same as the law of ( W ε , W ′ h ) ′ under P b,c,η . Denote by G η the group oftransformations g η for η ∈ c . We can now characterize the maximal invariantwith respect to G η in the limit experiment E ( f ).For any process W , we deﬁne the associated bridge process by B W ( s ) := W ( s ) − sW (1) , (32)for all s ∈ [0 , B g η ( W ) ( s ) = [ g η ( W )]( s ) − s [ g η ( W )](1)= W ( s ) − ηs − s ( W (1) − η )= W ( s ) − sW (1)= B W ( s ) . By (19) and (18), the same holds for W ℓ fx and W ℓ fy . s a result, the bridges B W h are invariant under the transformations g η .Deﬁne the mapping M by M ( W ε , W h ) := ( W ε , B W h ). It then follows thatstatistics that are measurable with respect to the σ -ﬁeld M = σ ( M ( W ε , W h )) = σ (cid:0) W ε , B W h (cid:1) , (33)are invariant with respect to g η for all η ∈ c . Moreover, in the following theorem,we show M to be maximally invariant. Its proof is, again, provided in Appendix B. Theorem 3.2.

In the limit experiment E ( f ) , for η ∈ c , the σ -ﬁeld M in (33) ismaximally invariant with respect to G η . Theorem 3.2 implies that any inference invariant with respect to G η must be mea-surable with respect to M ; see, e.g., Lehmann and Romano (2006, Theorem 6.2.1).Therefore, by the Neyman-Pearson lemma, inference based on the likelihood ra-tio with respect to M yields the power envelope for invariant tests in the limitexperiment E ( f ). The following result provides this likelihood ratio. Theorem 3.3.

Fix f ∈ F . Then the likelihood ratios in the limit experiment E ( f ) restricted to the maximal invariant M are given by exp L M ( b, c ) := d P M b,c d P M , = E (cid:20) d P b,c,η d P , , |M (cid:21) = exp (cid:18) ∆ M ( b, c ) − Q M ( b, c ) (cid:19) , (34) where ∆ M ( b, c ) = Z W ε ( s ) (cid:0) bJ f y h + cJ f x h (cid:1) ′ d B W h ( s ) + c Z W ε ( s )d W ε ( s ) (35)= b Z W ε ( s )d B ℓ fy ( s ) + c (cid:18)Z W ε ( s )d B ℓ fx ( s ) + W ε (1) W ε (cid:19) , Q M ( b, c ) = (cid:0) bJ f y h + cJ f x h (cid:1) Z (cid:0) W ε ( s ) − W ε (cid:1) d s + c Z W ε ( s ) d s (36)= (cid:0) b J f yy + c ( J f xx −

1) + 2 bcJ f yx (cid:1) (cid:16) W ε − ( W ε ) (cid:17) + c (cid:0) W ε (cid:1) , with W ε = R W ε ( s ) d s and W ε = R W ε ( s )d s . The proof is provided in Appendix B. The ﬁrst ways to write ∆ M ( b, c ) and Q M ( b, c ) make explicit that the likelihood factorizes in a conditional likelihoodgiven W ε and the marginal likelihood of W ε . Both second ways to write ∆ M ( b, c )and Q M ( b, c ) follow from (18)–(19) and (20)–(22). Those are the versions that weuse below to construct our feasible test statistics. Theorem 3.3 also immediately ields the semiparametric power envelope, still for ﬁxed c , that we do not presentin detail for brevity.The restriction to invariant tests removes the nuisance parameter η from thetesting problem. Indeed, the likelihood ratio (34) no longer depends on η . Therefore,we can formally deﬁne the limit experiment restricted to the maximal invariance M as E M ( f ) := (cid:16) Ω , M , n P M b,c : b, c ∈ R o(cid:17) . (37)Again, the likelihood ratios d P M b,c / d P M , can also be interpreted as Girsanov trans-formations. We state this as a corollary as the result follows immediately fromcalculating the bridges corresponding to W ℓ fy and W ℓ fx in Theorem 3.3. Corollary 3.2.

Fix f ∈ F . Let, under P M , , Z ε and Z h be zero-drift Brownianmotions with covariance according to the ﬁrst and last row and column of (17). Thelimit experiment E M ( f ) can be described as follows: we observe, with B W h ( s ) = W h ( s ) − sW h (1) , (cid:8)(cid:0) W ε ( s ) , B W h ( s ) (cid:1) : s ∈ [0 , (cid:9) with ( W ε , W h ) generated by d W ε ( s ) = cW ε ( s )d s + d Z ε ( s ) , (38)d W h ( s ) = ( bJ f y h + cJ f x h ) W ε ( s )d s + d Z h ( s ) . (39)The diﬀerence between Corollary 3.2 and Theorem 3.1 is twofold. First, besidesthe process W ε , the observation in the invariant limit experiment in Corollary 3.2 isonly the Brownian bridge B W h and not the complete Brownian motion W h . Second,as a consequence of this, the nuisance parameter η disappeared from (39).Corollary 3.2 does not provide, as far as we know, a further invariance structurethat can be used to eliminate the nuisance parameter c . As a result, we rely, inSection 4, on the so-called Approximate Least Favorable Distribution method todeal with this last nuisance parameter.We conclude this section by again considering the special case of a Gaussianinnovation density f . This also shows where exactly our power gains, under seriallyindependent innovations, come from relative to the Gaussian procedures in, forinstance, Jansson and Moreira (2006). Remark . One may expectthe semiparametric power envelope to be formally attainable by a likelihood-ratiotest constructed using a nonparametric estimate of the score function ℓ f . Intuitively,the argument is as follows. Rewrite R W ε ( s )d B ℓ fy ( s ) = R (cid:0) W ε ( s ) − W ε (cid:1) d W ℓ fy ( s ).Hence, even though there is a bias a (at rate √ T ) in the estimated score function,this bias will be canceled out automatically since R (cid:0) W ε ( s ) − W ε (cid:1) d (cid:16) as + W ℓ fy ( s ) (cid:17) = R (cid:0) W ε ( s ) − W ε (cid:1) d W ℓ fy ( s ). The same argument applies to the term R W ε ( s )d B ℓ fx ( s ). ompare the discussion in Jansson (2008, Section 6) for the unit root testing prob-lem and Zhou (2020, Section 2) for general LAN, LAMN, and LABF experiments. Remark f ) . In the situation of Gaussian f , Remark 2.1 and Re-mark 2.2 imply that B ℓ fy and B ℓ fx are linear combinations of B ε and B ⊥ (theBrownian bridges generated by W ε and W ⊥ , respectively). As a result, the opti-mal invariant procedures are measurable with respect to W ε and B ⊥ . Using thesame conditional expectation calculation, the associated log-likelihood ratio of theGaussian σ -ﬁeld, M Gaussian = σ ( W ε , B ⊥ ), leads to the Gaussian log-likelihood ratioin Jansson and Moreira (2006, Lemma 3). As B ⊥ is spanned by B W h , the σ -ﬁeld M Gaussian is also invariant w.r.t η (or f ), but it is not maximally invariant. Asa consequence, under non-Gaussianity, this leads to an eﬃciency loss in statisticalinference.Note that all existing tests in the literature are (essentially) based on the Gaus-sian likelihood of the generally non-maximally invariant M Gaussian , e.g., Jansson and Moreira(2006) and Elliott et al. (2015). Therefore, these tests belong to the class of asymp-totically invariant tests. This invariance imposed in the limiting experiment is asso-ciated to invariance w.r.t. the innovation density f in the sequence as η represents lo-cal perturbations precisely of f . Indeed, we have the convergence W ( T ) ε ( s ) ⇒ W ε ( s )and the one associated to W ⊥ for all f ∈ F , hence, η will not enter the associatedequation (27) in the limiting experiment. See M¨uller (2011) for a more comprehen-sive analysis of this convergence. The elimination of the nuisance parameter η is performed in the limit experiment E ( f ) and leads to E M ( f ). We now show how this elimination can be mimicked inthe actual predictive regression model of interest, i.e., in E ( T ) ( f ). It is reasonableto expect that exploiting the asymptotic invariance structures also works “well” forthe sequence of experiments. The claim will be substantiated by the simulationresults in Section 5.In line with the vast literature on rank-based inference, the appearance of theBrownian Bridges B ℓ fx and B ℓ fy in Corollary 3.2, naturally suggest to use statisticsthat are based on ranks of the innovations ε yt and ε xt in the predictive regressionmodel. Indeed, we will follow that route. However, in the present situation we dealwith bivariate innovations ( ε yt , ε xt ) which complicates the analysis considerably rela-tive to models with univariate innovations that are mostly studied in the literature.As the true innovation density f is unknown, we actually base our test statistic n an assumed (so-called reference ) density g that also satisﬁes Assumption 1. Let g y and g x denote the marginal densities for the ﬁrst, respectively, second componentof g . The bivariate nature of the innovations ( ε yt , ε xt ) implies that we cannot dealwith a completely general reference bivariate density g . Thus, we choose marginalreference densities g y and g x , and a reference correlation parameter ρ g . For themarginal reference densities, we impose the standard condition in the rank-basedinference literature, see, e.g., Theorem 13.5 in Van der Vaart (2000). Assumption 2.

The marginal reference densities g i , i = { y, x } , are strictly posi-tive, absolutely continuous with derivative ˙ g i and J g i := R ( ˙ g i /g i ) g i < ∞ . More-over, we have lim T →∞ T T X t =1 (cid:18) − ˙ g i g i (cid:18) G − i (cid:18) tT + 1 (cid:19)(cid:19)(cid:19) = J g i , (40) where G − i is the inverse cumulative distribution function associated to g i . Moreover, given an additionally chosen reference correlation ρ g ∈ ( − , ℓ g ( ε y , ε x ) := (cid:0) ℓ g y ( ε y , ε x ) , ℓ g x ( ε y , ε x ) (cid:1) ′ (41)where ℓ g y ( ε y , ε x ) = − (cid:18) ˙ g y g y ( ε y ) − ρ g ˙ g x g x ( ε x ) (cid:19) . (1 − ρ g ) ,ℓ g x ( ε y , ε x ) = − (cid:18) ˙ g x g x ( ε x ) − ρ g ˙ g y g y ( ε y ) (cid:19) . (1 − ρ g ) . The linearity of the reference score functions ℓ g y and ℓ g x is key to the analysisthat follows. It implies that, when using component-wise ranks of the innovations( ε y , ε x ), the resulting rank-based processes converge to a bivariate Brownian bridge.Despite its seemingly restrictive nature, the linearity allows use to fully exploitthe invariance structures embedded in the predictive regression model of interest,leading to sizable power gains (see Section 5).Now, let R y,t denote the rank of y t (among y , . . . , y T ), while R x,t denotes therank of ∆ x t = x t − x t − (among ∆ x , . . . , ∆ x T ). Note that the pairs ( R y,t , R x,t )equal the (component-wise) ranks of ( ε yt , ε xt ) under β = 0 and γ = 0. We deﬁne thebivariate partial sum process of the rank-based scores by B ( T ) ℓ g ( s ) = (cid:16) B ( T ) ℓ gy ( s ) , B ( T ) ℓ gx ( s ) (cid:17) ′ := 1 √ T ⌊ sT ⌋ X t =1 ℓ g (cid:18) G − y (cid:18) R y,t T + 1 (cid:19) , G − x (cid:18) R x,t T + 1 (cid:19)(cid:19) , (42)for s ∈ [0 , B ( T ) ℓ g underP ( T )0 , ,η ; f . Its proof is again provided in Appendix B. roposition 3.2. Suppose ε t = ( ε yt , ε xt ) ′ are i.i.d. innovations with density f ∈ F .Let g y and g x be reference densities that satisfy Assumption 2 and ﬁx the referencecorrelation ρ g . Then, under P ( T )0 , ,η ; f , we have B ( T ) ℓ g ⇒ B ℓ g , (43) where B ℓ g is a bivariate Brownian bridge, i.e., B ℓ g ( s ) = W ℓ g ( s ) − sW ℓ g (1) , with W ℓ g a zero-drift Brownian motion. The covariance of W ℓ g with W ε and W ℓ f :=( W ℓ fy , W ℓ fx ) ′ is given by Var  W ε (1) W ℓ f (1) W ℓ g (1)  =  e ′ σ ′ εg e J f J fg σ εg J gf J g  , (44) where e = (0 , ′ , σ εg = ( σ εg y , σ εg x ) ′ = E f (cid:2) ε xt ℓ g (cid:0) G − y ( F y ( ε yt )) , G − x ( F x ( ε xt )) (cid:1)(cid:3) ,J fg = J ′ gf = E f h ℓ f ( ε yt , ε xt ) ℓ g (cid:0) G − y ( F y ( ε yt )) , G − x ( F x ( ε xt )) (cid:1) ′ i ,J g = E f h ℓ g (cid:0) G − y ( F y ( ε yt )) , G − x ( F x ( ε xt )) (cid:1) ℓ g (cid:0) G − y ( F y ( ε yt )) , G − x ( F x ( ε xt )) (cid:1) ′ i . The above result is classical for univariate rank statistics. In the present paper,we use component-wise bivariate ranks. One complication is that the matrix J g depends on f through its copula. This implies that, like J g , it will have to beestimated in applications; compare also to the discussion of Theorem 3.1 in Zhou(2020).We use the rank-based processes B ( T ) ℓ g to replace B ℓ f in the likelihood ratioin Theorem 3.3; see Section 4.2 for details. In line with Remark 3.1, one couldcontemplate to use reference densities ˆ f based on a non-parametric estimate of thetrue innovation density, but we leave a formal analysis for future work. As we willsee in Section 5, even for incorrectly chosen reference densities (that is, for g = f ),our procedure features power gains over existing Gaussian based procedures. Thesegains come from the assumption that the error term ε t is driven by some i.i.d.innovations, which may possibly be maintained in empirical work. It is importantto note that choosing a reference density g = f does not aﬀect the validity of ourtest. The test will be of the appropriate level irrespective of the reference densities g y and g x chosen (provided they satisfy Assumption 2). But, likelihood ratio testsbased on Theorem 3.3 still feature the nuisance parameter c . We deal with this inthe next section.For completeness, we also provide the equivalent to Corollary 3.2 when usingthe reference density g . orollary 3.3. Fix f ∈ F . Let b ∈ R , c ∈ ( −∞ , , and η ∈ c . Then, under P b,c,η , the behavior of W ε and B ℓ g follows d W ε ( s ) = cW ε ( s )d s + d Z ε ( s ) , (45)d W ℓ g ( s ) = J gf  bc  W ε ( s )d s + d Z ℓ g ( s ) , (46) where, under P , , , Z ℓ g is a bivariate Brownian motion with variance J g and co-variance with W ε equal to σ εg . γ by ALFD In the previous section, we have developed the semiparametric power envelope fortests on b that are invariant with respect to η , under the assumption that c isknown. We now address the question of testing the regression coeﬃcient β in case γ is treated as a nuisance parameter as well.As argued in the the discussion following Corollary 3.2, we conjecture that thenuisance parameter c cannot be dealt with using invariance arguments. Variousalternative methods to deal with nuisance parameters in testing problems have beenused in the literature. In relation to the predictive regression model at hand, wemention the Bonferroni method (Cavanagh et al. (1995) and Campbell and Yogo(2006)); tests based on a conditional unbiasedness condition (Jansson and Moreira(2006)); and tests based on a numerically calculated Approximate Least FavorableDistribution (ALFD) as more recently proposed in Elliott et al. (2015). All thesetechniques apply to the Gaussian likelihood ratio statistic in Remark 3.2.These approaches have diﬀerent advantages and disadvantages. Campbell and Yogo(2006) proposes a modiﬁed Bonferroni method to eliminate the nuisance parameter c , leading to a simple yet more powerful test than the Cavanagh et al. (1995) test.However, as pointed out by Phillips (2014), inference based on Bonferroni boundscan be severely undersized when the predictor is “far away” from being a unit rootprocess ( γ << c in the Gaussian likelihood ratio—and derives an optimaltest in the class of conditionally unbiased tests. Nevertheless, such a conditionalunbiasedness constraint narrows the considered class and rules out some more pow-erful tests. Consequently, as shown by the simulation results of Jansson and Moreira(2006), the associated test has relatively low power compared to the Campbell and Yogo c to optimize weighted average power over somecompact interval (of c ). Note that with respect to our parameter of interest b , weconsider point-optimal test and do not use weighted powers over a discretized spaceto avoid the induced computational complexities. On one hand, the ALFD yieldsan upper bound of the weighted average power for all valid tests. On the otherhand, integrating out the likelihood statistic w.r.t. the ALFD leads to a “nearlyoptimal” test whose power is close to the upper bound. Moreover, by switchingto standard asymptotic approximations in case γ appears to be far from unity, theassociated test can achieve better size and power performances uniformly for all c ∈ ( −∞ ,

0] (e.g., across the parameter space γ ∈ ( − , This leads totests that are of correct size for all relevant c and have good power performance.We conﬁrm these properties by simulations in Section 5. In Section 3 we used invariance arguments to reduce the predictive regression testingproblem towards log-likelihood ratio of the form (34) where b is the parameter ofinterest to be tested and c is a nuisance parameter. We brieﬂy outline, in thepresent section, how the Approximate Least Favorable Distribution approach inElliott et al. (2015) works in our setting.Rewrite the log-likelihood ratio of the maximal invariant M in Theorem 3.3 as L M ( b, c ) = bS + cS − (cid:0) ( b, c ) J f ( b, c ) ′ − c (cid:1) S − c S , We expect that other approaches based on likelihood ratios, e.g., the approaches ofCampbell and Yogo (2006) and Jansson and Moreira (2006), will apply here as well. This is be-cause (i) the semiparametric likelihood ratio L M ( b, c ) in Theorem 3.3 is as the general version of theGaussian likelihood ratio in Jansson and Moreira (2006, Lemma 3), thus when the true density isGaussian, the former reduces to the latter; (ii) its rank-based proxy in (54) has the same structure(exponential family); and (iii) the asymptotic behaviors of the associated rank-based processes areknown and consistently estimable. here S = Z W ε ( s )d B ℓ fy ( s ) , S = Z W ε ( s )d B ℓ fx ( s ) + W ε (1) W ε , (47) S = W ε − (cid:0) W ε (cid:1) and S = W ε . One can thus consider the four-dimensional suﬃcient statistic S := ( S , S , S , S ).For notational simplicity, in the present section, we denote by F b,c ( S ) the distribu-tion of S under P b,c . The hypothesis of interest isH : b = 0 , c ∈ ( −∞ ,

0] versus H : b > , c ∈ ( −∞ , . (48)Note that, thus, both the null and the alternative hypothesis are composite. Weﬁrst discuss elimination of the nuisance parameter c under the alternative and,subsequently, its elimination under the null.To eliminate the nuisance parameter c under the alternative, a standard ap-proach is to consider a so-called weighted average power (see, e.g., Andrews and Ploberger(1994)) WAP( ϕ ) = Z c (cid:18)Z S ϕ ( S )d F b,c ( S ) (cid:19) dΛ ( c ) , (49)where ϕ is some test function for the problem above and Λ is a probability weightingmeasure for c ∈ ( −∞ , can be chosen by the researcherand reﬂects the weights that she assigns to various values of c under the alternative.Due to Fubini’s Theorem, we haveWAP( ϕ ) = Z S ϕ ( S )d Z c F b,c ( S )dΛ ( c ) , (50)which leads to the simple alternative hypothesis H , under which the distributionof S is given by the mixture F b ;Λ ( S ) = R F b,c ( S )dΛ ( c ). In this way, the testingproblem is reduced to testing H against H .Subsequently, in order to eliminate the nuisance parameter c under the null weproceed as follows. Again we impose a probability weighting measure Λ for c andintroduce the simple null hypothesis, denoted H , under which the distributionof S is given by F b ;Λ ( S ) = R F b,c ( S )dΛ ( c ). Now we deﬁne the test ϕ ¯ b ;Λ by ϕ ¯ b, Λ ( S ) =  F ¯ b, Λ ( S ) > κ d F , Λ ( S ) , F ¯ b, Λ ( S ) ≤ κ d F , Λ ( S ) , (51)where the critical value κ is chosen to obtain the desired size. By the Neyman-Pearson Lemma, ϕ ¯ b, Λ is point optimal at b = ¯ b , for the problem of testing the nullH against the alternative H .The problem of choosing Λ is, unfortunately, more complicated than that ofchoosing Λ . The reason is that we want to control the rejection probability of he test, not only under H , but for all values of c ∈ ( −∞ , α test under H is of correct size for theentire null hypothesis H . However, for some speciﬁc choices of Λ this statementis true, and such a distribution is called a least-favorable distribution ; see, e.g.,Lehmann and Romano (2006), Theorem 3.8.1. Formally, a distribution Λ ∗ is called least favorable if the most powerful level- α test (51) for testing H ∗ against H is of the desired size for the (entire) null hypothesis H . Moreover, once more byTheorem 3.8.1 in Lehmann and Romano (2006), the test ϕ ¯ b, Λ ∗ is also point optimal(at b = ¯ b ) for this problem. A least-favorable distribution Λ ∗ exists in most of theusual statistical problems. conditions that ensure this and associated references canbe found in Section 3.8 of Lehmann and Romano (2006).As, in most cases, the least-favorable distribution Λ ∗ is not easily obtained,Elliott et al. (2015) propose a numerical method to ﬁnd, what they call, an “Ap-proximate Least Favorable Distribution” (ALFD). The ALFD is deﬁned as follows. Deﬁnition 1. An ǫ -ALFD is a probability distribution Λ ∗ ǫ over ( −∞ , satisfying(i) the Neyman-Pearson test (51) with Λ = Λ ∗ ǫ and critical value κ = κ ∗ , i.e., ϕ ¯ b, Λ ∗ ǫ , is of size α under H ∗ ǫ and has power ¯ π against H ;(ii) there exists κ ∗ ǫ such that the test (51) with Λ = Λ ∗ ǫ and κ = κ ∗ ǫ , ϕ ǫ ¯ b, Λ ∗ ǫ , is oflevel α under H , and has power of at least ¯ π − ǫ against H . The test ϕ ǫ ¯ b, Λ ∗ ǫ (in particular, the ALFD Λ ∗ ǫ and the critical value κ ∗ ǫ ) is ex-actly what we are looking for, once we have set the weights Λ of interest for thealternative hypothesis. Besides the size control under H , the deﬁnition above alsoensures that the test ϕ ǫ ¯ b, Λ ∗ ǫ enjoys a near-optimality property with a relatively smallpower loss (less than ǫ ).Note that even for a given (small) value of ǫ , the ALFD Λ ∗ ǫ is not necessarily“close” to the least favorable distribution Λ ∗ . Actually, (possibly inﬁnitely) manypairs of (Λ ∗ ǫ , κ ∗ ǫ ) may satisfy Deﬁnition 1. The details about how to implement thenumerical algorithm to determine a pair of (Λ ∗ ǫ , κ ∗ ǫ ) (henceforth the test ϕ ¯ b, Λ ∗ ǫ )for a small ǫ can be found in Section 3 and Appendix A of Elliott et al. (2015). Asthe nuisance parameter space c ∈ ( −∞ ,

0] is unbounded, we also need to “switch”back to standard test statistics (i.e., in the stationary case) for large values of | c | .We provide in Appendix C the details about our test for the standard part of thelimit experiment E ( f ). .2 Putting it all together Putting everything together, our test for the predictive regression model is based onapplying the ALFD approach to the rank-based counterpart (using Proposition 3.2)of the asymptotically point-optimal invariant derived in Theorem 3.3.We thus replace, in the suﬃcient statistic S = ( S , S , S , S ) in (47), W ε , B ℓ fy ,and B ℓ fx by W ( T ) ε , B ( T ) ℓ gy , and B ( T ) ℓ gx , leading to the feasible rank-based statistic S ( T ) g := (cid:16) S ( T ) g, , S ( T ) g, , S ( T ) g, , S ( T ) g, (cid:17) , (52)where S ( T ) g, = Z W ( T ) ε ( s )d B ( T ) ℓ gy ( s ) ,S ( T ) g, = Z W ( T ) ε ( s )d B ( T ) ℓ gx ( s ) + W ( T ) ε (1) Z W ( T ) ε ( s )d s,S ( T ) g, = Z W ( T ) ε ( s ) d s − (cid:18)Z W ( T ) ε ( s )d s (cid:19) ,S ( T ) g, = Z W ( T ) ε ( s ) d s. To make the log-likelihood ratio L M in (34) fully feasible, we also have to deal with J f . From Kagan and Landsman (1999) we know that J f is diagonalized by theCholesky root of the correlation matrix R g . Therefore, we replace J f by J p =  J p yy J p yx J p yx J p xx  := R − g ′ diag { J g y , J g x } R − g , (53)where J g y and J g x are the Fisher information of the chosen marginal reference den-sities deﬁned in Assumption 2, and R g is the correlation matrix based on the chosenreference correlation ρ g , i.e., R g := (cid:16) ρ g ρ g (cid:17) . We recommend to use a consistentestimate of ρ as ρ g regarding the power of the test, although any choice of ρ g wouldlead to correct sizes. This leads to our feasible rank-based log-likelihood statistic L ( T ) g ( b, c ) := bS ( T ) g, + cS ( T ) g, − (cid:0) ( b, c ) J p ( b, c ) ′ − c (cid:1) S ( T ) g, − c S ( T ) g, , (54)of which the limit is given by the proposition below. Proposition 4.1.

Suppose ε t = ( ε yt , ε xt ) ′ are i.i.d. innovations with density f ∈ F .Let g y and g x be reference densities that satisfy Assumption 2 and ﬁx the referencecorrelation ρ g . Then, for b ∈ R and c ∈ ( −∞ , , under P ( T )0 , ,η ; f , we have L ( T ) g ( b, c ) ⇒ L g ( b, c ) , where L g ( b, c ) := bS g, + cS g, − (cid:0) ( b, c ) J p ( b, c ) ′ − c (cid:1) S g, − c S g, (55) ith S g, = Z W ε ( s )d B ℓ gy ( s ) , S g, = Z W ε ( s )d B ℓ gx ( s ) + W ε (1) W ε ,S g, = W ε − (cid:0) W ε (cid:1) and S g, = W ε . We omit the proof of Proposition 4.1 since it directly follows from the weakconvergences in (16) and (43), the continuous mapping theorem, and the rank-basedstochastic integral convergence argument in the proof of Lemma 4.1 of Zhou et al.(2019).Although not explicit in the above, observe that the statistic L ( T ) g ( b, c ) in (54)still depends on σ x through W ( T ) ε deﬁned in (12). We will simply replace σ x by itssample counterpart below. As long as this estimator is consistent, the continuousmapping theorem shows that this replacement has no asymptotic consequences. Thestatistic does not depend on σ y , but it does depends on the reference correlation ρ g .Now, applying the ALFD algorithm to L ( T ) g ( b, c ), we obtain a distribution Λ ∗ ǫ ,g and critical value κ g,n such that the test ϕ g,n ( S ( T ) g , ρ g ) =  R L ( T ) g (¯ b, c )dΛ ( c ) > κ g,n R L ( T ) g (0 , c )dΛ ∗ ǫ ,g ( c )0 if R L ( T ) g (¯ b, c )dΛ ( c ) < κ g,n R L ( T ) g (0 , c )dΛ ∗ ǫ ,g ( c ) (56)is of size α . Here ¯ b serves as a ﬁxed alternative point for the quasi-likelihood statistic;see Elliott et al. (1992).In order to get the appropriate critical values of the test, note that we needconsistent estimates, under the null, of J g and J fg . We need these in order toensures the feasibility of the numerically determined pair (Λ ∗ ǫ , κ ∗ ǫ ). In applications J g and J fg can easily be estimated, however, in the Monte Carlo study below weestimate J g and J fg based on the known. This is necessary as we cannot aﬀord todetermine a pair (Λ ∗ ǫ , κ ∗ ǫ ) for each repetition in the simulation. That would be toointensive computationally. In this section, we explore by Monte Carlo the size and power properties of ourtest (56), combined with the switching approach detailed in Appendix C, (labeledWZ) relative to the Gaussian quasi-likelihood counterpart in Elliott et al. (2015)(labeled EMW). From the theoretical results, both tests should enjoy good size A simple consistent estimator for J g would be the sample covariance of the rank-based scores ℓ g deﬁned in (41) and a direct rank-based estimator for J fg can be found in Cassart et al. (2010). roperties but the WZ test should exhibit larger power in case the true innovationdistribution is not Gaussian. Under Gaussian innovation distribution, both testsshould have similar power.Section 5.1 provides simulations under the predictive regression model studiedformally in this paper. Section 5.2 provides results of our test under conditionalheteroskedasticity. Finally, Section 5.3 provides results when the reference densityused in the test is estimated. We simulate the model (1)–(2) with µ = 2, σ y = 3, σ x = 3, and ρ = − .

5. Allresults reported in this section are based on 10,000 replications.For the ALFD approach, we choose a discrete weighting distribution Λ in (49)where each of the 57 points c ∈ { , − . , − . , . . . , − } of the support have equal weight. The same 57 points are also as the support ofΛ ∗ ǫ . For the test statistic in (56), we choose a ﬁxed alternative ¯ b = B (1 . ρ g , we use the simple samplecorrelation of ˆ ǫ yt and ˆ ǫ xt under the null, where ˆ ǫ yt = y t − P Tt =1 y t and ˆ ǫ xt is the residualof the regression of x t on x t − .We present the power curves in two ways. The ﬁrst presentation follows Elliott et al.(2015): We let the local nuisance parameter c (which governs the persistence of thepredictor) take 21 values c ∈ { , − , − , . . . , − } . And to have roughly similarpower for each value of c , we transform the parameter b by b = B ( δ ) = δ r − c + 61 − ρ , for c < . (57)Alternatives for β are now characterized by diﬀerent values of δ . The null hypothesisH corresponds to δ = 0, and we let the parameter of interest b take three alterna-tives: δ ∈ { , , } . Secondly, we present power curves where we ﬁx the nuisanceparameter c = −

25 and plot the rejection rates for δ ∈ [0 , α is chosen to be 5% in all cases.In Figure 1, we reports the large-sample ( T = 2 , f and the marginal reference densities g y and g x . The upper-left subplotreports the case where f is a multivariate t density, while g y and g x are bothunivariate t densities. Both the EMW test and the WZ test are of correct sizefor all chosen values of c . Under the alternative hypothesis (i.e., for δ ∈ { , , } ),

50 100 150 200 -c r e j e c t i on r a t e s f = Multi-t , g y = g x = t -c r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 1: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively. For all the four cases, the correlation is − .

5. The sample size is2,000. the WZ test is more powerful than the EMW test. Taking the alternative δ = 2 asexample, for most values of c , the power of the EMW test is about 65% while theWZ test attains about 90% power. In the upper-right subplot, we keep f unchangedand let g y and g x both be Gaussian. Both tests provide correct size and, again, theWZ test is more powerful than the EMW test. However, compared to the upper-left subplot, we observe that the WZ test suﬀers a small power loss when choosingreference densities that are further away from the true ones. When f is Gaussian,the WZ test with Gaussian marginal reference densities shares almost the same sizeand power performances as the EMW test, as shown by the bottom-left subplot.The bottom-right subplot presents the case when f is Gaussian, while the marginalreference densities g y and g x are univariate t . In this case, the WZ test is less

50 100 150 200 -c r e j e c t i on r a t e s f = Multi-t , g y = g x = t -c r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 2: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively. For all the four cases, the correlation is − .

5. The sample size is200. powerful than the EMW test. In practice, we may want to avoid this power loss bypre-testing the residuals under the null hypothesis. We study this in Section 5.3.Actually, one can always use Gaussian reference densities as a conservativechoice, which is based on a (numerical) Chernoﬀ and Savage (1958) result — keep-ing the marginal reference densities g y and g x Gaussian, the WZ test is alwaysmore powerful than the EMW test when f is non-Gaussian, and it works as wellas the EMW test when f is Gaussian. A formal proof of this result in LABF-typeexperiments is still an open question, but we show that this property holds in somemore simulations. In Figure 5, we ﬁx g y and g x to be Gaussian, and choose fourdiﬀerent multivariate innovation distributions: (i) Gaussian copula with Laplacemarginal distributions (top-left, labeled Multi-Laplace); (ii) Multivariate Pearson r e j e c t i on r a t e s f = Multi-t , g y = g x = t r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 3: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for ﬁxed value of c = −

25 and diﬀerent values of δ ∈ [0 , − .

5. The sample size is 2,000. distribution with skewness 3 and kurtosis 36 (top-right, labeled Multi-Pearson);(iii) Gaussian copula with t distribution for the ﬁrst dimension and Gaussian dis-tribution for the second dimension (bottom-left, labeled Multi-combo1); and (iv) t copula with Gaussian for the ﬁrst dimension and t for the second dimension(bottom-right, labeled Multi-combo2). These simulations support the Chernoﬀ-Savage result and also show that the further away the true distribution is fromGaussian, the more power can be gained by the WZ test. Moreover, case (iv) inthe bottom-right subplot shows that actually the power we gain by the WZ testis from the innovation of the ﬁrst dimension, ε yt . When the distribution of ε yt isGaussian, we do as well as the EMW test. We conjecture that inference for β in thepredictive regression model (1)-(2) is adaptive with respect to the marginal densityof ε xt , when γ is eliminated by the ALFD approach in Elliott et al. (2015). r e j e c t i on r a t e s f = Multi-t , g y = g x = t r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 4: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for ﬁxed value of c = −

25 and diﬀerent values of δ ∈ [0 , − .

5. The sample size is 2,000.

In Figure 3 and Figure 4, we present the powers of the WZ test and the EMWtest (for ﬁxed c = −

15 and for δ ∈ [0 , T = 200) results for both tests in Figure 2and Figure 6 (the small-sample counterparts of Figure 1 and Figure 5, respectively).The conclusions are similar: both tests are of good size (all around 4 . ∗ ǫ and κ g . The WZ test still gains considerable powerin the case of non-Gaussian densities, though the gain is slightly smaller than inthe large-sample case. This once more shows the additional information present,when supported by the application at hand, of an i.i.d.-ness assumption on theinnovations. Appendix D provides additional simulation results in Figure 11 and

15 and δ ∈ [0 , ρ = − . f , the Chernoﬀ-Savage result, and decent small-sample performances. In many (ﬁnancial) applications the maintained assumption of i.i.d. innovations willnot be satisﬁed. We therefore study, by simulation, the behavior of the tests whenthe innovations exhibit conditional heteroskedasticity. The tests are identical tothose in the previous sections, thus not adapted to deal with possible heteroskedas-ticity.

50 100 150 200 -c r e j e c t i on r a t e s f = Multi-laplace, g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-pearson, g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-combo1, g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-combo2, g y = g x = Gaussian Figure 6: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively. For all the four cases, ρ = − . T = 200. Keeping everything else unchanged, we replace the i.i.d. innovations ( ε yt , ε xt ) ′ , bya univariate GARCH(1,1) model (i) for ε yt only; or (ii) for both ε yt and ε xt in thedata generating process. Formally, we choose ε yt = p − ρ ε ,t + ρε ,t ,ε xt = ε ,t , where, for case (ii), ε ,t and ε ,t are independently generated by the GARCH(1,1)model ε j,t = ν j,t p h j,t ,h j,t = 1 + 0 . ε j,t − + 0 . h j,t − , for j = 1 ,

2, where ν j,t ’s are i.i.d. innovations. For case (i), we let ε ,t be i.i.d. and ndependent of ε ,t . The joint density of ν ,t and ν ,t is denoted by f . The GARCHparameters are chosen based on common empirical ﬁndings. -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.9 Figure 7: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively, under heteroskedasticity . For all the four cases, T = 2 , In Figure 7, we present case (i) where only the innovations of the responsevariable, ε y , exhibit conditional heteroskedasticity, while the predictor innovations

50 100 150 200 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.9 Figure 8: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively, under heteroskedasticity . For all the four cases, T = 2 , ε x are still i.i.d. We show results for three density combinations as mentioned inthe title of each subplot and three diﬀerent values for the correlation of innovations( ρ = − . − .

5, and − . ε y will not ﬀect their size performances much. In terms of power, the WZ test outperformsthe EMW test under the t distribution, and both tests have similar powers underGaussianity. In addition, when ε y is exhibits more heteroskedasticity (i.e., when ρ is close to 0), the WZ test gains more power as heteroskedasticity pushes theunconditional innovation distribution further away from Gaussianity.Figure 8 presents the results for case (ii) where both innovations, ε y and ε x ,are heteroskedastic and correlated as modeled above. When ρ is close to zero, thesize distortion becomes smaller, while for larger (absolute) values of ρ , both testsbecome more oversized, especially under heavy-tailed innovation distribution. TheWZ test suﬀers less size distortion than the EMW test under t distributions and,using t reference marginal densities, the size distortion bcomes even smaller (seethe bottom panel).The small-sample counterparts of Figure 7 and Figure 8 with T = 200 are pro-vided in Appendix D. We draw conclusions similar to the i.i.d. case in Section 5.1.Additionally, we ﬁnd that, when both ε y and ε x are heteroskedastic and their cor-relation is close to −

1, both the EMW and WZ tests are less over-sized in thesmall-sample case.These conclusions above also apply to other GARCH settings with diﬀerent valuechosen for parameters. These simulation results are available upon request.

In this section, we provide simulation results for the WZ test based on nonpara-metrically estimated reference densities, i.e., g y = ˆ f y and g x = ˆ f x , under the i.i.d.setting as in Section 5.1.In Figure 9, we compare the WZ test with g y = ˆ f y and g x = ˆ f x (dotted lines)with the EMW test (dashed lines) and with as the WZ test using correctly speciﬁedreference marginal densities (solid lines). When both the true and the referencedensities are Gaussian (right plot), we see that all three tests perform similarlywith decent size and power properties. When the true innovation distribution isStudent- t , all three tests control the sizes well, while in terms of power, bothWZ tests outperform the Gaussian-based EMW test. The WZ test with estimatedreference densities suﬀers a small eﬃciency loss due to the nonparametric estimation.Figure 10 provides the small-sample results under the same setting but withsample size T = 200. In general, the smaller sample leads to lower size and powerfor the WZ test with estimated reference densities relative to the large-sample case.But again, when f is heavy-tailed, it can be more powerful than the EMW test.

50 100 150 200 -c r e j e c t i on r a t e s f = t3; = -0.5 -c r e j e c t i on r a t e s f = G; = -0.5 Figure 9: Rejection rates of the WZ test (solid lines), the EMW test with Student- t marginal reference densities (dashed lines), and the EMW test with nonparametricallyestimated density ˆ f (dotted lines) for diﬀerent values of δ = 0, 1, 2, and 3, correspondingto lines in blue, green, brown, and red, respectively. For all the four cases, T = 2 , -c r e j e c t i on r a t e s f = t3; = -0.5 -c r e j e c t i on r a t e s f = G; = -0.5 Figure 10: Rejection rates of the WZ test (solid lines), the EMW test with Student- t marginal reference densities (dashed lines), and the EMW test with nonparametricallyestimated density ˆ f (dotted lines) for diﬀerent values of δ = 0, 1, 2, and 3, correspondingto lines in blue, green, brown, and red, respectively. For all the four cases, T = 200. In this paper, we show that there is signiﬁcant statistical information, when sup-ported by the application at hand, in a maintained assumption of serially indepen- ent innovations in a predictive regression model. We exploit this information byderiving the (maximal) invariance structures in the associated limit experiment.Speciﬁcally, we ﬁrst derive the maximal invariant in the (structural) limit exper-iment where the predictor’s persistence parameter is assumed to be known. Thisleads to the semiparametric power envelope for test that are invariant with re-spect to the innovation density. The associated likelihood ratio thus gives the semi-parametric counterparts of the Gaussian suﬃcient statistics of Jansson and Moreira(2006). Under non-Gaussianity, larger powers are possible than under Gaussianity;a well-known result in many classical statistical models. To eliminate the predictor’spersistence nuisance parameter, we employ the ALFD approach recently proposedin Elliott et al. (2015).Our analysis naturally leads to statistics based on the bivariate component-wiseranks of the innovations in the model. Our statistics involve a choice of referencedensities that is, subject to some mild regularity conditions, largely arbitrary. Irre-spective of the choice of reference densities, our test are of correct asymptotic size.Under non-Gaussianity, even with incorrectly speciﬁed reference densities, our testhave better power properties than existing tests in the literature that are derivedunder the assumption of Gaussian innovation densities. These alternative tests donot need serially independent innovations and, as a result, we precisely quantify thepower improvements possible when such an assumption is supported by the data.Monte Carlo simulations corroborate our asymptotic results and illustrate that therank-based tests also work well in smaller samples. References

Andrews, D. W. and Ploberger, W. (1994), “Optimal tests when a nuisance pa-rameter is present only under the alternative,”

Econometrica: Journal of theEconometric Society , 1383–1414.Boswijk, H. P. et al. (2005), “Adaptive testing for a unit root with nonstationaryvolatility,”

UvA-Econometrics Discussion Paper , 7.Campbell, J. Y. and Yogo, M. (2006), “Eﬃcient tests of stock return predictability,”

Journal of ﬁnancial economics , 81, 27–60.Cassart, D., Hallin, M., Paindaveine, D., et al. (2010), “On the estimation of cross-information quantities in rank-based inference,” in

Nonparametrics and Robust-ness in Modern Statistical Inference and Time Series Analysis: A Festschrift n Honor of Professor Jana Jureˇckov´a , Institute of Mathematical Statistics, pp.35–45.Cavanagh, C. L., Elliott, G., and Stock, J. H. (1995), “Inference in models withnearly integrated regressors,” Econometric theory , 11, 1131–1147.Chan, N. H. and Wei, C. (1988), “Limiting distributions of least squares estimatesof unstable autoregressive processes,”

The Annals of Statistics , 367–401.Chernoﬀ, H. and Savage, I. R. (1958), “Asymptotic normality and eﬃciency ofcertain nonparametric test statistics,”

The Annals of Mathematical Statistics ,972–994.Elliott, G., M¨uller, U. K., and Watson, M. W. (2015), “Nearly optimal tests whena nuisance parameter is present under the null hypothesis,”

Econometrica , 83,771–811.Elliott, G., Rothenberg, T. J., and Stock, J. H. (1992), “Eﬃcient tests for an au-toregressive unit root,” .Elliott, G. and Stock, J. H. (1994), “Inference in time series regression when theorder of integration of a regressor is unknown,”

Econometric theory , 10, 672–700.Hallin, M., Van den Akker, R., and Werker, B. J. (2011), “A class of simpledistribution-free rank-based unit root tests,”

Journal of econometrics , 163, 200–214.Hallin, M., Van Den Akker, R., and Werker, B. J. (2015), “On quadratic expansionsof log-likelihoods and a general asymptotic linearity result,” in

MathematicalStatistics and Limit Theorems , Springer, pp. 147–165.Hansen, B. E. (1992), “Convergence to stochastic integrals for dependent heteroge-neous processes,”

Econometric Theory , 8, 489–500.Jacod, J. and Shiryaev, A. (2002),

Limit theorems for stochastic processes , vol. 288,Berlin: Springer.Jansson, M. (2008), “Semiparametric power envelopes for tests of the unit roothypothesis,”

Econometrica , 76, 1103–1142.Jansson, M. and Moreira, M. J. (2006), “Optimal inference in regression modelswith nearly integrated regressors,”

Econometrica , 74, 681–714. eganathan, P. (1995), “Some aspects of asymptotic theory with applications totime series models,” Econometric Theory , 11, 818–887.Kagan, A. and Landsman, Z. (1999), “Relation between the covariance and Fisherinformation matrices,”

Statistics & Probability Letters , 42, 7–13.Le Cam, L. M. (1986), “Asymptotic methods in statistical theory,” .Lehmann, E. L. and Romano, J. P. (2006),

Testing statistical hypotheses , SpringerScience & Business Media.Ling, S., McAleer, M., et al. (2003), “On adaptive estimation in nonstationaryARMA models with GARCH errors,”

The Annals of Statistics , 31, 642–674.Mayer-Wolf, E. et al. (1990), “The Cram´er-Rao functional and limiting laws,”

TheAnnals of Probability , 18, 840–850.Moreira, M. J. and Mour˜ao, R. (2016), “A critical value function approach, with anapplication to persistent time-series,” arXiv preprint arXiv:1606.03496 .M¨uller, U. K. (2011), “Eﬃcient tests under a weak convergence assumption,”

Econo-metrica , 79, 395–435.M¨uller, U. K. and Elliott, G. (2003), “Tests for unit roots and the initial condition,”

Econometrica , 71, 1269–1286.Phillips, P. C. (2014), “On conﬁdence intervals for autoregressive roots and predic-tive regression,”

Econometrica , 82, 1177–1195.Rudin, W. (1987), “Real and complex analysis,” .Van der Vaart, A. W. (2000),

Asymptotic statistics , vol. 3, Cambridge universitypress.Yang, G. L. and Le Cam, L. (2000), “Asymptotics in statistics: some basic con-cepts,”

Berlin, German: Springer .Zhou, B. (2020), “A General Semiparametric Approach for LAN, LAMN, and LABFExperiments,”

Working Paper .Zhou, B., van den Akker, R., and Werker, B. J. (2019), “Semiparametrically optimalhybrid rank tests for unit roots,”

The Annals of Statistics , 47, 2601–2638. Auxiliaries

The lemma below shows that the partial sum processes introduced in Section 2.2weakly converge to the associated Brownian motions. Due to the i.i.d.-ness of theinnovations, the lemma follows, e.g., from the functional central limit theorem VIII.3.33 in Jacod and Shiryaev (2002).

Lemma A.1.

Let f ∈ F and let, with m ≥ , k , . . . , k m − ∈ N . Deﬁne, with thenotation of Section 2.2, W ( T ) = (cid:0) W ( T ) ε , W ( T ) ℓ fy , W ( T ) ℓ fx , W ( T ) h , . . . , W ( T ) h m − (cid:1) ′ and W = (cid:0) W ε , W ℓ fy , W ℓ fx , W h , . . . , W h m − (cid:1) ′ . Then, in D R m [0 , under P ( T )0 , , , f , we have W ( T ) ⇒ W , (cid:10) W ( T ) , W ( T ) (cid:11) (1) = (cid:2) W ( T ) , W ( T ) (cid:3) (1) + o P (1) = Var (cid:0) W (1) (cid:1) + o P (1) . B Proofs

Proof of Proposition 3.1.Proof of Part (i):

Suppose y t and x t − , for t = 1 , , . . . , T , are generated from (1)–(2). Then, usingthe local parameter perturbations (5), the log-likelihood ratio equalslog dP ( T ) b,c,η ; f dP ( T )0 , , f = LLR ( T ) I ( b, c ) + LLR ( T ) II ( b, c, η ) , (58)whereLLR ( T ) I ( b, c ) := T X t =1 log f (cid:16) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:17) f ( y t , ∆ x t ) , LLR ( T ) II ( b, c, η ) := T X t =1 log √ T ∞ X k =1 η k h k (cid:18) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:19)! . We ﬁrst use Proposition 1 in Hallin et al. (2015) to proveLLR ( T ) I ( b, c ) = bT T X t =1 x t − σ x σ y ℓ f y ( y t , ∆ x t ) + cT T X t =1 x t − ℓ f x ( y t , ∆ x t ) (59) − (cid:0) b J f yy + c J f xx + 2 bcJ f yx (cid:1) T T X t =1 x t − σ x ! + o P (1) . ssumption 1 (a) implies that the density f is diﬀerentiable in quadratic mean, i.e., √ f ( e − w ) √ f ( e ) = 1 + 12 [ w ′ ℓ f ( e ) + r ( e, w )] , e, w ∈ R , (60)where E f r ( ε t , w ) = o ( w ) . (61)In the notation of Hallin et al. (2015), we have LR T t = f (cid:16) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:17) f ( y t , ∆ x t ) ,S T t = (cid:18) T x t − σ x σ y ℓ f y ( y t , ∆ x t ) , T x t − ℓ f x ( y t , ∆ x t ) (cid:19) ′ ,R T t = r ( ε t , w T t ) , where r is implicitly deﬁned in (60), w T t = (cid:16) − bT σ y σ x x t − , − cT x t − (cid:17) ′ , and h T = ( b, c ) ′ .Thus, (60) implies LR T t = (cid:18) h ′ T S T t + R T t ) (cid:19) . To complete the proof of Part (i), we show that condition ( a ), ( b ), ( c ), and ( d ) inProposition 1 of Hallin et al. (2015) are satisﬁed. Condition (a) . This is immediate since h T = ( b, c ) ′ is a constant vector. Condition (b) . Display (2), E ( T ) (cid:2) S T t (cid:12)(cid:12) F T,t − (cid:3) = 0 with F T,s − = σ ( ε yt , ε xt : t < s ),follows immediately from the independence of ε t and F T,t − , E f (cid:2) ℓ f y ( ε t ) (cid:3) = 0, andE f [ ℓ f x ( ε t )] = 0. The second equation in Display (3) is met as J T := T X t =1 E ( T ) [ S T t S ′ T t |F T,t − ]= T X t =1  T x t − σ x J f yy T x t − σ x J f yx T x t − σ x J f yx T x t − σ x J f xx  ⇒ J :=  J f yy R W ε ( s )d s J f yx R W ε ( s )d sJ f yx R W ε ( s )d s J f xx R W ε ( s )d s  , where the weak convergence follows from a combination of Lemma A.1, Theorem2.1 in Hansen (1992), and the continuous mapping theorem. Next we verify theconditional Lindeberg condition (the ﬁrst equation in Display (3)), which is, for all δ > T X t =1 E ( T ) h ( h ′ T S T t ) { | h ′ T S Tt | >δ } (cid:12)(cid:12) F T,t − i = o P (1) . bserve T X t =1 E ( T ) h ( h ′ T S T t ) { | h ′ T S Tt | >δ } (cid:12)(cid:12) F T,t − i = T X t =1 E ( T ) "(cid:18) bT x t − σ x σ y ℓ f y ( y t , ∆ x t ) + cT x t − ℓ f x ( y t , ∆ x t ) (cid:19) { ( h ′ T S Tt ) >δ } (cid:12)(cid:12) F T,t − ≤ T X t =1 E ( T ) "(cid:18) bT x t − σ x σ y ℓ f y ( y t , ∆ x t ) (cid:19) { bx t − σ y ℓ fy ( y t , ∆ x t )) >δ T σ x } (cid:12)(cid:12) F T,t − + 4 T X t =1 E ( T ) (cid:20)(cid:16) cT x t − ℓ f x ( y t , ∆ x t ) (cid:17) { cx t − ℓ fy ( y t , ∆ x t )) >δ T } (cid:12)(cid:12) F T,t − (cid:21) . To complete the proof, we just need to show separately, for any given δ > T X t =1 E ( T ) "(cid:18) bT x t − σ x σ y ℓ f y ( y t , ∆ x t ) (cid:19) { | bx t − σ y ℓ fy ( y t , ∆ x t ) | >δT σ x } (cid:12)(cid:12) F T,t − = o P (1) , T X t =1 E ( T ) "(cid:18) cT x t − ℓ f x ( y t , ∆ x t ) (cid:19) { | cx t − ℓ fy ( y t , ∆ x t ) | >δT } (cid:12)(cid:12) F T,t − = o P (1) . Using the notation ζ ( M ) = E f h(cid:0) bσ y ℓ f y ( y t , ∆ x t ) (cid:1) { | bσ y ℓ fy ( y t , ∆ x t ) | >δT } i , we see,for instance, that the left-hand-side of the second term of the previous display isbounded by ζ δ √ T k W ( T ) ε k ∞ ! Z (cid:16) W ( T ) ε ( u − ) (cid:17) d u = o P (1) , by a combination of Lemma A.1, the continuous mapping theorem, and ζ ( M ) → M → ∞ (dominated convergence). The same strategy works for the other term. Condition (c) . This condition consists two asymptotic negligibility properties(the Displays (4) and (5) in Hallin et al. (2015)) of the remainder terms R T t = r ( ε t , w T t ). Recall w T t = (cid:16) − bT σ y σ x x t − , − cT x t − (cid:17) ′ , by (61), we have T E f (cid:2) r ( ε t , w T t ) |F T,t − (cid:3) = o P (1) , which ensures the Display (4): P Tt =1 E ( T ) (cid:2) R T t |F T,t − (cid:3) = o P (1). Display (5), thatis T X t =1 (cid:16) − E ( T ) [ LR T t |F T,t − ] (cid:17) = o P (1) , is trivially met by plugging in LR T t = LLR ( T ) I ( b, c ) to the left-hand-side which giveszero due to the assumed non-negativity of f . Condition (d) . This condition is satisﬁed since x = 0, so thatlog LR T t = log f (cid:16) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:17) f ( y t , ∆ x t ) = log f ( y t , ∆ x t ) f ( y t , ∆ x t ) = o P (1) . ubsequently, for the second term of the log likelihood ratio, LLR ( T ) II ( b, c, η ), weprove that it equals η ′ √ T T X t =1 X k h k ( y t , ∆ x t ) − aJ ′ f y h η + (cid:16) bJ ′ f y h η + 2 cJ ′ f x h η (cid:17) T / T X t =1 x t − σ x + η ′ η ! + o P (1) . This completes the proof for Part (i). Since we assume that the functions h k , k ∈ N , are two times continuously diﬀerentiable with bounded derivatives, by aTaylor Series expansion, we have h k (cid:18) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:19) (62)= h k ( y t , ∆ x t ) − bT σ y σ x x t − ˙ h k,y ( y t , ∆ x t ) − cT x t − ˙ h k,y ( y t , ∆ x t ) + o P (1) , where ˙ h k,y and ˙ h k,y are the ﬁrst-order derivatives of ˙ h k with respect to the ﬁrst andsecond argument, respectively. In this equality, higher-order terms are omitted sincethe second-order derivatives, denoted by ¨ h k,yy , ¨ h k,yx , and ¨ h k,xx , are bounded, i.e.,there exists a real number M , such that (cid:12)(cid:12)(cid:12) ¨ h k,yy (cid:12)(cid:12)(cid:12) < M , (cid:12)(cid:12)(cid:12) ¨ h k,yx (cid:12)(cid:12)(cid:12) < M , and (cid:12)(cid:12)(cid:12) ¨ h k,xx (cid:12)(cid:12)(cid:12) < M .Therefore,1 √ T T X t =1 (cid:18) bT σ y σ x x t − (cid:19) ¨ h k,yy ( y t , ∆ x t ) < √ T T X t =1 (cid:18) bT σ y σ x x t − (cid:19) M ⇒ √ T (cid:18) abσ y Z W ε ( s )d s + b σ y Z W ε ( s )d s (cid:19) M = O P (cid:18) √ T (cid:19) = o P (1) , and similar results hold for other higher order terms of ¨ h k,yx and ¨ h k,xx . Also, usinglog(1 + x ) = x − x + O ( x ), we haveLLR ( T ) II ( b, c, η ) (63)= 1 √ T T X t =1 ∞ X k =1 η k h k (cid:18) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:19) −

12 1 T T X t =1 " ∞ X k =1 η k h k (cid:18) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:19) + o P (1)= 1 √ T T X t =1 ∞ X k =1 η k (cid:20) h k ( y t , ∆ x t ) − bT σ y σ x x t − ˙ h k,y ( y t , ∆ x t ) − cT x t − ˙ h k,x ( y t , ∆ x t ) (cid:21) −

12 1 T T X t =1 " ∞ X k =1 η k h k (cid:18) y t − bT σ y σ x x t − , ∆ x t − cT x t − (cid:19) + o P (1)= 1 √ T T X t =1 ∞ X k =1 η k (cid:20) h k ( y t , ∆ x t ) − bT x t − σ x J f y h k − cT x t − σ x J f x h k (cid:21) − ∞ X k =1 η k + o P (1)= η ′ √ T T X t =1 X k h k ( y t , ∆ x t ) − aJ ′ f y h η − (cid:16) bJ ′ f y h η + cJ ′ f x h η (cid:17) T / T X t =1 x t − σ x − η ′ η + o P (1) . he third equality follows from Lemma A.1, E f (cid:2) ˙ h k,y ( y t , ∆ x t ) (cid:3) = R R ˙ h k,y ( e ) f ( e )d e = h k ( e ) f ( e ) (cid:12)(cid:12) R − R R h k ( e ) ˙ f y f ( e )d e = J f y h k , E f (cid:2) ˙ h k,x ( y t , ∆ x t ) (cid:3) = J f x h k , the assumptionE f (cid:2) h k ( e ) (cid:3) = 1, and E f [ h i ( e ) h j ( e )] = 0 when i = j .Putting together (59) and (63) completes the proof of the LAQ result in Part(i). Proof of Part (ii):

The proof for this part follows immediately from the Func-tional Central Limit Theorem (see, e.g., Lemma A.1 and Theorem 2.4 in Chan and Wei(1988)). The convergence of integrals as R W ε ( s )d W ℓ fy ( s ) needs an additional ar-gument as it does not follow automatically from Lemma A.1. The argument isidentical to that in the proof of Proposition 3.2 in Zhou et al. (2019). Proof of Part (iii):

Taking the expectation of exp L ( b, c, η ) under P , , willdirectly lead to the result. Proof of Theorem 3.2.

The proof follows from the deﬁnition of the maximal invari-ant in Section 6.2 of Lehmann and Romano (2006), which, in terms of the presentproblem, is: M is called maximally invariant with respect to G η if (i) it is in-variant, and if (ii) the equality M ( W ε , W h ) = M ( f W ε , f W h ), with the mapping M deﬁned in Section 3.2, implies that ( W ε , W h ) can be transformed into ( f W ε , f W h ) withsome transformation g η ∈ G η . Since (i) is trivially met, the proof is complete byestablishing (ii).Suppose, indeed, M ( W ε ( s ) , W h ( s )) = M ( f W ε ( s ) , f W h ( s )), s ∈ [0 , W ε ( s ) = f W ε ( s ) and B W h ( s ) = e B h ( s ) , for all s ∈ [0 , . This in turn implies, for all s ∈ [0 , W ε ( s ) − f W ε ( s ) = 0 and W h ( s ) − f W h ( s ) = c g s with c g = W h (1) − f W h (1) ∈ R . This shows that ( W ε , W h ) can indeed be transformedto ( f W ε , f W h ) by the transformation g η ∈ G η with η = c g . Thus condition (ii) isveriﬁed and the proof is complete. Proof of Theorem 3.3.

Observe that we can decompose the central sequence ∆( b, c, η )in (24) as ∆( b, c, η ) = ∆ M ( b, c ) + ∆ ⊥⊥ ( b, c, η ) , (64)with ∆ ⊥⊥ ( b, c, η ) = W ε (cid:0) bJ f y h + cJ f x h (cid:1) ′ W h (1) + η ′ W h (1) . (65)Under P , , , W h (1) is independent of M while W ε is measurable with respectto M . As a result, under P , , and conditionally on M , ∆ ⊥⊥ ( b, c, η ) is normally istributed with mean zero and variance (cid:12)(cid:12) W ε (cid:0) bJ f y h + cJ f x h (cid:1) + η (cid:12)(cid:12) . As ∆ M and Q are obviously M -measurable, we ﬁnd E (cid:20) d P b,c,η d P , , |M (cid:21) = E (cid:20) exp (cid:18) ∆ M ( b, c ) + ∆ ⊥⊥ ( b, c, η ) − Q ( b, c, η ) (cid:19) |M (cid:21) = exp (cid:18) ∆ M ( b, c ) − Q ( b, c, η ) (cid:19) E [exp ∆ ⊥⊥ ( b, c, η ) |M ]= exp (cid:18) ∆ M ( b, c ) − Q M ( b, c ) (cid:19) . This completes the proof.

Proof of Proposition 3.2.

The proposition is somewhat nonstandard as it deals withbivariate component-wise ranks, but otherwise its proof mimics that of Lemma A.1in Hallin et al. (2011). Tightness of the processes follows exactly as in that lemma,so we only consider convergence of the ﬁnite-dimensional distributions. We nowfrom the so-called H´ajek Representation Theorem (we use it in the version of The-orem 13.5 in Van der Vaart (2000)), that we may write1 √ T ⌊ sT ⌋ X t =1 − ˙ g y g y (cid:18) G − y (cid:18) R y,t T + 1 (cid:19)(cid:19) = 1 √ T ⌊ sT ⌋ X t =1 − ˙ g y g y (cid:0) G − y ( F y ( ε yt )) (cid:1) − √ T T X t =1 − ˙ g y g y (cid:0) G − y ( F y ( ε yt )) (cid:1) + o P (1) . The equivalent statement holds for the ranks R x,t , with y replaced by x everywherein the above expression. The claim then follows from the functional central limittheorem applied to the partial sums of − ˙ g y g y (cid:0) G − y ( F y ( ε yt )) (cid:1) and − ˙ g x g x (cid:0) G − x ( F x ( ε xt )) (cid:1) ,jointly with W ( T ) ε and W ( T ) ℓ f . Proof of Corollary 3.3.

The behavior of W ε under P b,c,η is already given in thestructural limit experiment associated to the maximal invariant M in Corollary 3.2.To get the behavior of W ℓ g under P b,c,η , ﬁrst decompose it as W ℓ g ( s ) = vW ε ( s ) + AW ℓ f ( s ) + W ⊥ ( s )for some v ∈ R × and A ∈ R × , where W ⊥ is a Brownian motion independent of W ε and W ℓ f . The appropriate values of v and A satisfy the relation J gf = Cov (cid:2) W ℓ g (1) , W ℓ f (1) (cid:3) = Cov (cid:2) vW ε (1) + AW ℓ f (1) + W ⊥ (1) , W ℓ f (1) (cid:3) = v e ′ + AJ f . Note that in v and A there are 6 unknowns and here there are only two equations, which we onlyneed for this proof. The other four equations are given by the equalities Cov (cid:2) W ℓ g (1) , W ε (1) (cid:3) = σ εg and Cov (cid:2) W ℓ g (1) , W ℓ g (1) (cid:3) = J g . hen the proof is complete upon noting that, under P b,c,η , we haved W ℓ g ( s ) = v d W ε ( s ) + A d W ℓ f ( s ) + d W ⊥ ( s )= v ( cW ε d s + d Z ε ( s )) + A (cid:0) J f ( b, c ) ′ W ε ( s )d s + d Z ℓ f ( s ) (cid:1) + d W ⊥ ( s )= ( v e ′ + AJ f )( b, c ) ′ W ε ( s )d s + (cid:0) d Z ε ( s ) + d Z ℓ f ( s ) + d Z ⊥ ( s ) (cid:1) = J gf ( b, c ) ′ W ε ( s )d s + d Z ℓ g ( s ) . C Switching Tests to Standard Case

The numerical approach of Elliott et al. (2015) needs to discretize the nuisanceparameter space under the null hypothesis (and the associated mesh is regardedas the support of Λ ∗ ǫ ). However, in the present case, the null parameter spaceof c is ( −∞ , | c | is large enough so that the predictor essentially behaveslike a stationary time series. In that case, the problem reduces to a standard testwith a stationary regressor. In particular, the authors propose to use a “switching”function χ = { ˆ c < K } based on some estimator ˆ c of c and a chosen “threshold” K to distinguish the nonstandard situation from the standard one. Then, one canemploy the following (combined) test function ϕ n,s,χ ( S ) = χϕ s ( S ) + (1 − χ ) ϕ n ( S ) , (66)where ϕ s is some test for the standard case, and ϕ n is the test (56) for the nonstan-dard case. For the standard test ϕ s , following the argument in the same paper, weuse the semiparametric version of the t -test ϕ s ( S ) = (cid:8) b ⋆ (cid:14) σ b ⋆ > κ s (cid:9) (67)with b ⋆ = S S J f yy − J f yx J f yy c ⋆ , c ⋆ = S − ( J f yx /J f yy ) S (cid:0) ( J f xx − − J f yx /J f yy (cid:1) S + S , and σ b ⋆ = vuut J f yy S + (cid:18) J f yx J f yy (cid:19) (cid:0) ( J f xx − − J f yx /J f yy (cid:1) S + S . Here b ⋆ and c ⋆ are the maximum likelihood estimators of b and c based on thelikelihood ratio in (34).The proof of the following lemma can be found in the Supplementary Materialof Elliott et al. (2015) (Appendix C.4). emma C.1. For s ∈ [0 , , let Z ( s ) and Z ( s ) be two independent standard Brow-nian motions, and W ( s ) be the associated Ornstein-Uhlenbeck process of Z ( s ) ,deﬁned by d W ( s ) = cW ( s )d s + d Z ( s ) . Deﬁne the demeaned process W µ ( s ) = W ( s ) − R W ( s )d s . Then, as c → −∞ , we have  √− c R W ( s )d Z ( s ) √− c R W µ ( s )d Z ( s ) − c R W ( s ) d s − c R W µ ( s ) d s  ⇒  z z  , (68) where z and z are two independent standard normal random variables. Lemma C.2.

Suppose the suﬃcient statistics S , S , S , S are deﬁned in (47),where the behavior of ( W ε , B ℓ fy , B ℓ fx ) ′ is described by the limit experiment E M ( f ) in Corollary 3.2. Then, under P c,η and as c → −∞ , we have √− c  S + J f yx / S + J f xx /  ⇒ N  ,  J f yy J f yx J f yx J f xx  , − cS ⇒ , and − cS ⇒ . Subsequently, still under P c,η and as c → −∞ , we have b ⋆ /σ b ⋆ ⇒ N (0 , . Proof of Lemma C.2.

Note that, in this proof, all convergence results (as c → −∞ )follow immediately from Lemma C.1.First, we give the convergence results for S and S : Recall d W ε ( s ) = cW ε ( s )d s +d Z ε ( s ) for s ∈ [0 ,

1] which makes W ε ( s ) an Ornstein-Uhlenbeck process. Then wehave, as c → −∞ , − cS = − c (cid:16) W ε − (cid:0) W ε (cid:1) (cid:17) = − c Z W µε ( s ) d s → , (69) − cS = − cW ε = − c Z W ε ( s ) d s → . Next, we give the convergence results of statistics S and S : To this end,we state ﬁrst some results derived from Lemma C.1: Deﬁne W µε ( s ) = W ε ( s ) − R W ε ( s )d s for s ∈ [0 ,

1] and any inﬁnite-dimensional vector A , A ∈ R ∞× , wehave  − cA ′ R W µε ( s )d Z h ( s ) − cA ′ R W µε ( s )d Z h ( s )  ⇒ N   ,  A ′ A A ′ A A ′ A A ′ A  . ence, following the decomposition √− cS = √− c Z W ε ( s )d B ℓ fy ( s )= √− c Z W µε ( s )d W ℓ fy ( s )= √− c Z W µε ( s )d Z ℓ fy ( s ) + √− c × cJ f yx Z W µε ( s ) W ε ( s )d s = √− c Z W µε ( s )d Z ℓ fy ( s ) − √− c J f yx (cid:18) − c Z W µε ( s ) d s (cid:19) , we ﬁnd √− cS + √− c J f yx ⇒ N (cid:0) , J f yy (cid:1) . Similarly, by the decomposition √− cS = √− c (cid:18)Z W ε ( s )d B ℓ fx ( s ) + W ε (1) Z W ε ( s )d s (cid:19) = √− c (cid:18)Z W ε ( s )d W ε ( s ) + J f x h Z W µε ( s )d W h ( s ) (cid:19) = √− c (cid:18)Z W ε ( s )d Z ε ( s ) + J f x h Z W µε ( s )d Z h ( s ) (cid:19) − √− c (cid:18) − c Z W ε ( s ) d s − cJ f x h J ′ f x h Z ( W µε ( s )) d s (cid:19) , and J f xx = 1 + J f x h J ′ f x h , we have √− cS + √− c J f xx ⇒ N (0 , J f xx ) . The covariance of √− cS and √− cS is J f y h J ′ f x h = J f yx . In total, we have √− c  S S  + 12  J f yx J f xx  ⇒ N   ,  J f yy J f yx J f yx J f xx  . (70)Finally, we show b ⋆ /σ b ⋆ ⇒ N (0 , √− c  S S − J fyx J fyy S  + 12  J f yx J f xx − J fyx J fyy  ⇒ N   ,  J f yy J f xx − J fyx J fyy  . Thus, after some algebra, b ⋆ √− c = √− cS J f yy ( − cS ) − J f yx J f yy c ⋆ √− c = √− cS J f yy ( − cS ) − J f yx J f yy √− cS − ( J f yx /J f yy ) √− cS (( J f xx − − J f yx /J f yy )( − cS ) + ( − cS ) ⇒ N , J f yy + (cid:18) J f yx J f yy (cid:19) J f xx − J f yx /J f yy !! . oreover, following (69), we have σ b ⋆ √− c = vuut J f yy ( − cS ) + (cid:18) J f yx J f yy (cid:19) J f xx − − J f yx /J f yy )( − cS ) + ( − cS ) ⇒ vuut J f yy + (cid:18) J f yx J f yy (cid:19) J f xx − J f yx /J f yy , which completes the proof.To introduce the rank-based standard test ϕ s , we deﬁne, in terms of S g, , S g, , S g, and S g, , the rank-based statistics b ⋆g = S g, + ρ g S g, S g, J g y , c ⋆g = S g, − ρ g S g, S g, J g x , and σ b ⋆g = s S g, J g y . Note, S g, = S and S g, = S . Now, b ⋆g and c ⋆g serve as rank-based estimatorsof b and c . The following lemma can be regarded as the rank-based version ofLemma C.2. Lemma C.3.

Deﬁne the statistic S g := ( S g, , S g, , S g, , S g, ) where S g, , S g, , S g, and S g, are introduced in Proposition 4.1. Then, under P c,η and as c → −∞ , wehave b ⋆g /σ b ⋆g ⇒ N (0 , . (71) Proof.

Recall, as c → −∞ , − cS → − cS →

1, hence σ b ⋆g √− c → p J g y . Rewrite b ⋆g √− c = √− c ( S g, + ρ g S g, ) − cS g, J g y = 1 − cS g, J g y √− c Z W ε ( s )d B g y ( s ) , where B g y := B ℓ gy + ρ g B ℓ gx . It is not hard to ﬁnd that, based on the construction in(41)-(42), B g y is the limit of the partial-sum process √ T P ⌊ sT ⌋ t =1 − ˙ g y g y (cid:16) G − y (cid:16) R y,t T +1 (cid:17)(cid:17) .Therefore, under H , B g y is a Brownian bridge. As c → −∞ , by Lemma C.1, wehave √− c Z W ε ( s )d B g y ( s ) = √− c Z W µε ( s )d W g y ( s ) ⇒ N (0 , J g y ) , where W g y is the associated Brownian motion of B g y . Thus b ⋆g √− c ⇒ N (0 , J gy ),which in turn completes the proof. ow we have the standard test ϕ g,s ( S g , ρ g ) = n b ⋆g (cid:14) σ b ⋆g > κ g,s o where κ g,s is the (1 − α )-quantile of a standard normal distribution. Similarly,employing the (combined) test as in (66), we obtain the rank-based test ϕ g,χ g ( S g , ρ g ) = χ g ϕ g,s ( S g , ρ g ) + (1 − χ g ) ϕ g,n ( S g , ρ g ) , where χ g = { c ⋆g < K g } .Replacing S g by its ﬁnite-sample counterpart S ( T ) g , deﬁnes the feasible test ϕ g,χ g ( S ( T ) g , ρ g ). In the Monte Carlo study in Section 5, following Elliott et al.(2015), we choose K g = − Additional Simulation Results r e j e c t i on r a t e s f = Multi-laplace, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-pearson, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-combo1, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-combo2, g y = g x = Gaussian Figure 11: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for ﬁxed value of c = −

25 and diﬀerent values of δ ∈ [0 , ρ = − . T = 2 , r e j e c t i on r a t e s f = Multi-laplace, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-pearson, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-combo1, g y = g x = Gaussian r e j e c t i on r a t e s f = Multi-combo2, g y = g x = Gaussian Figure 12: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for ﬁxed value of c = −

25 and diﬀerent values of δ ∈ [0 , ρ = − . T = 200. 53

50 100 150 200 -c r e j e c t i on r a t e s f = Multi-t , g y = g x = t -c r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 13: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively. For all cases, ρ = − . T = 2 ,

50 100 150 200 -c r e j e c t i on r a t e s f = Multi-t , g y = g x = t -c r e j e c t i on r a t e s f = Multi-t , g y = g x = Gaussian -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = t -c r e j e c t i on r a t e s f = Multi-Gaussian, g y = g x = Gaussian Figure 14: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively. For all cases, ρ = − . T = 200.55

50 100 150 200 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-G, gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=G; =-0.9 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.1 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.5 -c r e j e c t i on r a t e s f=Multi-t , gy=gx=t ; =-0.9 Figure 15: Rejection rates of the WZ test (solid lines) and the EMW test (dashed lines)for diﬀerent values of δ = 0, 1, 2, and 3, corresponding to lines in blue, green, brown,and red, respectively, under heteroskedasticity . For all the four cases, T = 200.56