[PDF] Predictive Quantile Regression with Mixed Roots and Increasing Dimensions

Abstract

In this paper we study the benefit of using the adaptive LASSO for predictive quantile regression. It is common that predictors in predictive quantile regression have various degrees of persistence and exhibit different signal strengths in explaining the dependent variable. We show that the adaptive LASSO has the consistent variable selection and the oracle properties under the simultaneous presence of stationary, unit root and cointegrated predictors. Some encouraging simulation and out-of-sample prediction results are reported.

Full PDF

PPredictive Quantile Regressionwith Mixed Roots and Increasing Dimensions

Rui Fan ∗ Ji Hyung Lee † Youngki Shin ‡ January 2020

Abstract

In this paper we study the beneﬁt of using the adaptive LASSO for predictive quantileregression. It is common that predictors in predictive quantile regression have various degreesof persistence and exhibit diﬀerent signal strengths in explaining the dependent variable. Weshow that the adaptive LASSO has the consistent variable selection and the oracle propertiesunder the simultaneous presence of stationary, unit root and cointegrated predictors. Someencouraging simulation and out-of-sample prediction results are reported.

Keywords: adaptive LASSO, cointegration, forecasting, oracle property, quantile regression

JEL classiﬁcation:

C22, C53, C61

Predictive quantile regression (QR) identiﬁes the impact of predictors on a set of conditional quan-tiles of the response variable, providing richer information about the heterogeneous distributionalpredictability. When a large number of predictors are available, it is not clear how to select aset of informative predictors. Such a selection can improve forecasting performance at the quantilelevels under study, so is important for practical problems in economics and ﬁnance. Forecasting theconditional quantiles of stock returns is recently attracting much attention since the tail quantileinformation of stock return has been used as the most important risk measure in ﬁnance. Manyeconomic state variables have been employed to predict stock returns, and the number of candi-date predictors is often large. In a predictive regression framework, this large set of the candidatepredictors includes highly persistent and potentially cointegrated predictors.Unlike the large literature of predictive mean regression, predictive QR is relatively less under-stood. The existing papers study the predictability of the mean of excess returns on the equitymarket, see, e.g., Campbell (1987), Fama and French (1988), Hodrick (1992), Cenesizoglu andTimmermann (2012), Andersen et al. (2020), among others. Another interesting line of research ∗ Department of Economics, Rensselaer Polytechnic Institute. Email: [email protected] † Department of Economics, University of Illinois. Email: [email protected] ‡ Department of Economics, McMaster University. Email: [email protected]. a r X i v : . [ ec on . E M ] J a n nvestigates whether a certain period of a regime explains the predictability of market returns,rather than the whole sample period, see Farmer et al. (2019), Demetrescu et al. (2020), Harveyet al. (2020), for example. Predictive QR provides a diﬀerent predictive relation between stock re-turns and predictors complementing the mean predictive model of ﬁnancial data. Cenesizoglu andTimmermann (2008) is an early paper on predictive QR, and Maynard et al. (2011), Lee (2016),Fan and Lee (2019), Gungor and Luger (2019) and Cai et al. (2020) recently developed inferencemethods in predictive QR with nonstationarity and/or heteroskedasticity. This paper, however,focuses on the variable selection and oracle properties with the increasing number of mixed rootpredictors.Although many methods are available in variable selection, the penalized regression methodsare most popular, such as Tibshirani (1996)’s L -penalized linear regression (LASSO). LASSO re-gression is known to reduce overﬁtting of the model thereby improving the prediction accuracy. Theregression models also become more interpretable after LASSO selection. While LASSO regressionhas been extensively discussed for cross-sectional data, only a few papers have investigated LASSOin stationary or nonstationary time series contexts. Recently, Koo et al. (2020) used LASSO toimprove the stock return prediction. They show that LASSO signiﬁcantly reduces forecasting meansquared errors even with a mixture of stationary, unit-root, and cointegrated variables. However,the conventional LASSO method may not have variable selection consistency and the oracle prop-erty (Meinshausen and B¨uhlmann (2004); Fan and Li (2001)). On the other hand, the adaptiveLASSO (ALASSO) proposed by Zou (2006) enables the oracle property. Instead of imposing thesame penalty weight on all candidate parameters, ALASSO penalizes each parameter proportionallyto the inverse of its initial estimate. With a proper choice of the tuning parameter λ n , the adaptivepenalty weights for the irrelevant variables approach inﬁnity, whereas those for the relevant variablesconverge to constants. Lee et al. (2018) apply this ALASSO to a predictive regression framework.Similar to Koo et al. (2020), predictors are allowed to have diﬀerent degrees of persistence andcointegration. Lee et al. (2018) ﬁnd that ALASSO and a newly proposed Twin Adaptive Lasso(TALASSO) outperform the other existing methods in terms of predictor selection consistency andout-of-sample mean squared errors. The purpose of TALASSO of Lee et al. (2018) is to break thecointegration group among the regressors, in order to eliminate the irrelevant persistent predictorin the second stage. Since the conditional mean of stock return is stationary and quickly meanreverting, including the irrelevant and persistent predictor is detrimental in terms of prediction.We do not pursue a similar method here, since certain quantiles of stock returns can be highlypersistent so nonstationary predictors may have strong signals. In such a case, we want to exploitthe signals, rather than break the relation.Some eﬀort has been also made to investigate variable selection and model estimation in QR.The LASSO penalized regression methods were studied in QR framework for high-dimensionaldata analysis (Portnoy (1984); Portnoy (1985); Knight and Fu (2000); Koenker (2005); Li andZhu (2008); Belloni and Chernozhukov (2011), to name a few). To overcome the problem ofinconsistent variable selection in the LASSO regularization technique (Fan et al. (2014); Wang2t al. (2012)), recent studies have further considered ALASSO in QR. Wu and Liu (2009) discusshow to conduct variable selection for QR models using SCAD and the ALASSO method. Zhenget al. (2013) establish the oracle property for an ALASSO QR model with heterogeneous errorsequences. The adaptive weights they use are constructed from the consistent estimator usingBelloni and Chernozhukov (2011)’s L -penalized QR model. Zheng et al. (2015) study a globallyALASSO method for ultra high-dimensional QR models. They propose a generalized informationcriterion (GIC)-based strategy to select the penalty level λ n .This paper contributes high-dimensional predictive QR literature by studying ALASSO penaltymethod. In particular, we consider the case where a diverging number of predictors possess diﬀerentdegrees of persistence and potential cointegration. In this model, several important econometricquestions arise. The ﬁrst is whether the oracle properties hold for an ALASSO QR method in amixed-root time series framework. In this paper, we develop asymptotic theories to show that formodels with a diverging number of mixed-root predictors, ALASSO penalty can consistently selectthe true active predictors on the conditional distribution of the response variable, with probabilityapproaching one. The second econometric question is how to select the penalty parameter forALASSO QR. The choice of the penalty parameter λ n is a subtle yet important issue in practice.We discuss two conventional criteria for the choices: Bayesian information criterion (BIC) (Schwarzet al. (1978)) and GIC (Nishii (1984)). Our simulation experiments in Section 5 indicate that BIC-and GIC-based methods outperform the conventional methods such as LASSO QR, in terms ofout-of-sample forecasting. The GIC method used in this paper closely follows the idea of Fan andTang (2013) and Zheng et al. (2015).The rest of this paper is organized as follows. In Section 2, we introduce the predictive QRmodels under various degrees of persistent predictors. We then propose the ALASSO QR (ALQR)estimators. Section 3 compares the out-of-sample quantile prediction using Welch and Goyal (2008)data, conﬁrming the improved forecasting performance of the proposed ALQR methods. In Section4, we study the asymptotics and oracle properties of ALQR estimators with the mixed roots andincreasing dimension, providing new theoretical contributions. Section 5 presents simulation studiesto support our theory in Section 4 and the empirical results in Section 3. A discussion on the choicesof the penalty parameter λ n is also included in this section. Section 6 concludes , and all technicalproofs are relegated to the Appendix.We use the following notation. For a vector a , (cid:107) a (cid:107) = √ a (cid:48) a represents the L -norm. For amatrix A , λ min ( A ) and λ max ( A ) denote the smallest and largest eigenvalues. −→ d , −→ p and −→ a.s. represent convergence in distribution, convergence in probability and almost sure conver-gence, respectively. All limit theory assumes n → ∞ so we oftentimes omit this condition. ≈ signiﬁes “being equal” after ignoring the asymptotically negligible terms. O (1) and o (1) ( O p (1)and o p (1)) are (stochastically) asymptotically bounded or negligible quantities.3 Model, Assumption and Estimation

We ﬁrst consider a predictive QR model with only unit-root predictors: Q y t ( τ |F t − ) = α τ + x (cid:48) t − β τ , (1) x t = x t − + v t ,where x t is a p n × O p (1) initialization of v = (cid:80) ∞ j =0 F vj (cid:15) − j , following Assumption 2.1 below. Q y t ( τ |F t − ) is the conditional quantile of y t such that Pr (cid:0) y t ≤ Q y t ( τ |F t − ) = α τ + x (cid:48) t − β τ |F t − (cid:1) = τ , with the vector of true parameters β τ = ( β τ, , ..., β τ,p n ) (cid:48) , and {F t } t ≥ is a natural ﬁltration.Deﬁne u tτ := y t − Q y t ( τ |F t − ), then Pr ( u tτ < |F t − ) = τ . Let ψ τ ( u ) = τ − ( u < E ( ψ τ ( u tτ ) |F t − ) = E ( τ − ( u tτ < |F t − ) = τ − Pr ( u tτ < |F t − ) = 0. To allow for a divergingnumber of predictors, we assume p n increases as n → ∞ but slower than n . The restrictions on thegrowth rate of p n are detailed in Section 4. For notational simplicity, we use p n = p hereafter. Remark 2.1

By the law of iterated expectations, E [ E ( ψ τ ( u tτ ) |F t − )] = E ( ψ τ ( u tτ )) = τ − Pr ( u tτ < so u tτ has both the conditional and unconditional τ -quantile at zero, so that F − u tτ ( τ ) = F − u tτ ( τ |F t − ) . For Model (1), the ordinary QR estimator (Koenker and Bassett (1978)) is:( ˆ α QRτ , ˆ β QR (cid:48) τ ) (cid:48) = arg min α ∈ R ,β ∈ R p n (cid:88) t =1 ρ τ ( y t − α − x (cid:48) t − β ) , (2)where ρ τ ( u ) = u ( τ − ( u < τ ∈ (0 ,

1) and ( · ) denotes an indicator function.We deﬁne the ALASSO QR (ALQR) estimator that minimizes the penalized QR objectivefunction as: ( ˆ α ALQRτ , ˆ β ALQR (cid:48) τ ) (cid:48) = arg min α ∈ R ,β ∈ R p n (cid:88) t =1 ρ τ ( y t − α − x (cid:48) t − β ) + p (cid:88) j =1 λ n,j | β j | . (3)As in Zou (2006), we deﬁne the ALQR penalty term as λ n,j = λ n /ω j with ω j = | ˜ β τ,j | γ , witha certain value of γ . For each parameter β j , j = 1 , ..., p , the degree of penalty depends on theimportance of the original predictor whose regression coeﬃcent is β j . For each j , let ˜ β τ,j be aﬁrst-step consistent estimate of β τ,j . The ALASSO penalty can avoid over- (under-) penalizingthe important (irrelevant) predictors by utilizing the information from the ﬁrst-step consistentestimation. In our case, the ordinary QR estimators from (2) are the qualiﬁed candidates for ˜ β τ .The ALQR in (3) is a convex optimization problem under an adaptively weighted L constraint.The performance of the estimators obtained from (3) is aﬀected by the choice of a tuning parameter λ n . In Section 5, we discuss the choice of λ n in details.4e make the following assumptions under Model (1). Assumption 2.1

The innovation of unit root predictors, v t , follows a linear process: v t = ∞ (cid:88) j =0 F vj (cid:15) t − j , (cid:15) t ∼ mds (0 , Σ) , Σ > ,E || (cid:15) || ν < ∞ , ν > ,F v = I p , ∞ (cid:88) j =0 || F vj || < ∞ , F v ( r ) = ∞ (cid:88) j =0 F vj r j , F v (1) = ∞ (cid:88) j =0 F vj > , Σ vv = F v (1)Σ F v (1) (cid:48) . Assumption 2.2

The contemporaneous covariance between ψ τ ( u tτ ) and v t is Σ ψv = Cov ( ψ τ ( u tτ ) ,v (cid:48) t ) . Remark 2.2

The QR model in (1) leads to zero covariance between v t and ψ τ ( u t − j,τ ) , for j ≥ , since E ( ψ τ ( u tτ ) | F t − ) = τ − Pr ( u tτ < |F t − ) = 0 for all t . We now introduce a QR model with mixed-root predictors. Assume that predictors have heteroge-nous degrees of persistence of I (0), I (1), and contegration. This type of models is most relevant inpractice. The QR model in Section 2.1 can be considered as a special case of this model.Let z t : p z × x ct : p c × x t : p x × p = p z + p c + p x . Given { y t , z t , x ct , x t } nt =1 , the QR model withmixed roots is: Q y t ( τ |F t − ) = α τ + z t − (cid:48) β z τ + x ct − (cid:48) β c τ + x t − (cid:48) β x τ , (4)where x t = x t − + v t .The cointegrated system in x ct has the triangular representation by Phillips (1991): Ax ct = x c t − A x c t = v c t , ∆ x c t = v c t . (5)This representation allows us to conveniently characterize the cointegration relations in x ct usingthe matrices A = ( I p , − A ) : p × p c and A : p × p .Let X t = ( z t , x ct , x t ) and β ∗ = ( β z (cid:48) , β c (cid:48) , β x (cid:48) ) (cid:48) . Deﬁne v ct = ( v c t (cid:48) , v c t (cid:48) ) (cid:48) : p c × e t = ( z (cid:48) t , v ct (cid:48) , v (cid:48) t ) (cid:48) : p ×

1. We impose Assumption 2.1 and the following assumptions for the vector e t : Assumption 2.3 e t = F e ( L ) (cid:15) t = ∞ (cid:88) j =0 F ej (cid:15) t − j , t =  (cid:15) zt (cid:15) ct (cid:15) vt  ∼ mds (0 , Σ) with Σ p × p =  Σ zz Σ zc Σ zv Σ (cid:48) zc Σ cc Σ cv Σ (cid:48) zv Σ (cid:48) cv Σ vv  > ,E || (cid:15) t || ν < ∞ , ν > ,F e = I p , ∞ (cid:88) j =0 j || F ej || < ∞ , F e ( z ) = ∞ (cid:88) j =0 F ej z j , F e (1) = ∞ (cid:88) j =0 F ej > , Ω ee = ∞ (cid:88) h = −∞ E ( e t e (cid:48) t − h ) = F e (1)Σ F e (1) (cid:48) : p × p. Assumption 2.4 E ( ψ τ ( u tτ ) z (cid:48) t ) = 0 , E ( ψ τ ( u tτ ) v c t (cid:48) ) = 0 ,E ( ψ τ ( u tτ ) v c t (cid:48) ) = Σ ψc , E ( ψ τ ( u tτ ) v (cid:48) t ) = Σ ψv . Remark 2.3

The cointegrated regressors x ct can be decomposed as ( x c (cid:48) t , x c (cid:48) t ) (cid:48) , where ( x c t − A x c t = v c t ) p × indicates the vector of I(0) cointegrating residuals, and x c t : p × is the vector of unitroot predictors. In Assumption 2.4 , QR-induced regression errors ψ τ ( u tτ ) are contemporaneouslycorrelated with the innovations of unit root sequences x c t and x t . This is commonly assumed incointegration and predictive regression literature, inducing a potential second order bias arisingfrom the one-sided correlation. As discussed in Remark 2.2, however, our QR modeling does notallow the correlation of ψ τ ( u tτ ) with the predetermined regressors in this paper. For the stationarypredictor z t and the cointegrating residuals v c t , we rule out the identiﬁcation issue by assuming zerocorrelation between ψ τ ( u tτ ) and z t , v c t . The ordinary QR estimators for Model (4) are:( ˆ α QR ∗ τ , ˆ β QR ∗ (cid:48) τ ) (cid:48) = arg min α ∗ ∈ R ,β ∗ ∈ R p n (cid:88) t =1 ρ τ ( y t − α ∗ − X (cid:48) t − β ∗ ) , (6)and the ALQR estimators are:( ˆ α ALQR ∗ τ , ˆ β ALQR ∗ (cid:48) τ ) (cid:48) = arg min α ∗ ∈ R ,β ∗ ∈ R p n (cid:88) t =1 ρ τ ( y t − α ∗ − X (cid:48) t − β ∗ ) + p (cid:88) j =1 λ n,j | β ∗ j | , (7)where λ n,j = λ n /ω j with ω j = | ˜ β τ,j | γ , with a user-chosen value of γ . In this section, we demonstrate how the ALQR method can improve the prediction of stock returns.Following the literature, we use an updated data of Welch and Goyal (2008). The data ranges from6anuary 1952 to December 2017 with a total number of 792 monthly observations. The dependentvariable is excess stock returns. The set of predictors under study are dividend price ratio ( dp ),dividend yield ratio ( dy ), earnings price ratio ( ep ), book-to-market ratio ( bm ), stock variance ( svar ),default yield spread ( df y ), default return spread ( df r ), inﬂation ( inf l ), net equity expansion ( ntis ),long term rate of returns ( ltr ), long term yield ( lty ), and treasury bills ( tbl ). These ﬁnancial andmacroeconomic variables are commonly used in the literature of stock returns prediction. Thedeﬁnition of these variables is available from Welch and Goyal (2008).Before applying our prediction strategies, we emphasize that the mixed degrees of persistenceare present in this real data. In Figures 1–2, we plot the time-series of the predictors using thewhole sample. We notice that the two-thirds of predictors ( dp , dy , ep , bm , df y , ntis , lty , and tbl )are quite persistent and diﬀerent from the remaining four stationary predictors ( svar , df r , ltr and inf l ). In Table 1, we reconﬁrm this fact by looking at the estimated ﬁrst-order autoregressioncoeﬃcients for the persistent predictors. Second, the Johansen test indicates that the cointegratingrank is 3 in most of the 779-month rolling windows. All of these evidences strongly suggest touse a prediction method that can accommodate the mixed roots. Therefore, we need a methodenabling variable selection and model estimation with a mixed set of stationary, nonstationary andcointegrating predictors.We investigate the performance of the proposed ALQR method using two diﬀerent tuningparameter choices: the Bayesian information criteria (BIC) and the generalized information criteria(GIC). The tuning parameter selection is discussed further in Section 5.1 below. We also assessthe performance of ALQR by comparing it to existing alternative methods, including QR (regularquantile regression), LASSO and the unconditional quantiles. For LASSO, we employ the LASSOpenalty of Tibshirani in a QR framework. See Koenker (2005, Section 4.9.2) and Knight and Fu(2000) for further details.The prediction performance is evaluated by the following two measures that are also used in Luand Su (2015). The ﬁrst measure is the out-of-sample one-step-ahead quantile prediction error (or ﬁnal prediction error ) deﬁned as follows:FPE( τ ) = 1 S S (cid:88) s =1 ρ τ ( y s − ˆ y s ) , (8)where S is the number of the out-of-sample predictions. Since FPE( τ ) is the average of the quantileloss function, it can be taken as a measure of proper centering at quantile τ . The smaller FPEimplies the better prediction performance. The second measure is the out-of-sample R : R ( τ ) = 1 − (cid:80) T − t = n ρ τ ( y t +1 − ˆ y t +1 ,τ ) (cid:80) T − t = n ρ τ ( y t +1 − ¯ y t +1 ,τ ) , (9)where n is the sample size for in-sample estimation, T is the total sample size for in-sample estima- This method is implemented using rq.fit.lasso in package quantreg using R. T = n + S ), ˆ y t +1 ,τ is a one-step-ahead prediction of the τ thquantile of y at time t given the data from the past n periods, ¯ y t +1 ,τ is the historical unconditional τ th quantile of y over the past n periods. The larger R implies the better prediction performance.Tables 2, 4, 6 and 8 provide the prediction results evaluated by FPE and R . The ﬁrst twoare based on the average performance for the last 12 months prediction, while the last two studythe last 24 months. Bold numbers are used to indicate the best performance at each quantile: thesmallest FPE or the largest R . We summarize our ﬁndings as follows: (1) The ALQR methodshows comparable or better forecasting performance than the other methods: it produces thesmallest one-step-ahead prediction errors and thus the largest R across most quantiles of interest;(2) Compared to the baseline unconditional quantile method, ALQR can greatly reduce forecasterrors across quantiles. Notably, such an improvement is more signiﬁcant at tails than at themedian; (3) Another interesting ﬁnding is the prediction performance at the median. As we cansee in Table 6 and 8, none of the methods works better than the baseline method in predictingthe averag e market performance. This result is in line with the diﬃculty of predicting average stock market returns in the literature; (4) From Tables 3, 5, 7 and 9, we observe that the ALQRtypically selects a fewer number of predictors than the conventional LASSO method, conﬁrmingthe conservative variable selection of LASSO. In Section 4, we further prove the consistent variableselection and oracle properties of the ALQR method. In Section 5, we indeed conﬁrm that thenumber of selected variables by ALQR is closer to the true sparsity in a simulation environmentthat mimics our empirical scenario in this section.Our empirical analysis using QR-based methods shows that QR can be an informative techniquefor predicting the ﬁnancial market returns. It allows to study a wider range of predictability beyondthe mean of the return distribution. For example, in Figure 3, we provide a clear picture of howa set of state variables can eﬀectively contribute to the prediction of the return distribution. For τ = 0 .

5, (median of the market returns), the majority of the predictors have zero coeﬃcients andthe only informative predictor is dp . This result is consistent with the ﬁnding in literature such asWelch and Goyal (2008) and Fan and Lee (2019), suggesting the diﬃculty in predicting the centerpart of the return distribution. The failure to predict the median returns does not necessarily implythat the predictors are uninformative in predicting the return distribution. For investors or policymakers, it would be also useful to ﬁnd informative predictors in the bad time (lower tail) or thegood time (upper tail) of the market. In Figure 3, our results of τ = 0 . τ = 0 . τ = 0 . dp remains as aneﬀective nonstationary predictor with larger positive values of coeﬃcients. In addition, a stationarypredictor, stock variance ( svar ), now becomes informative with large positive coeﬃcients. For τ = 0 .

1, more predictors become eﬀective in predicting the quantile of return distribution, whichalso indicates the asymmetry in their ability to predict market returns. The economic story behindthe results could be addressed. For example, the negative coeﬃcients of svar for τ = 0 . τ = 0 . svar indicate that a larger stock volatility can also shift the upper tailsfurther upward, thus increasing the prospects of subsequent larger positive returns. The resultsmay suggest that in the period of a bull market, returns of the market can be driven upward byhigher values of dividend price ratio and larger stock variance. On the contrary, a bear marketcan be aﬀected by various factors inﬂuencing the market performance. For example, we ﬁnd thatdefault yield spread, treasury bills, and stock variance show meaningful coeﬃcients in the periodof unusual low market returns. In this section, we show that the oracle properties (Fan and Li (2001)) are still valid in a mixed-roottime series QR framework with a diverging number of predictors. This result therefore providesa theoretical suppport of using the ALQR method to improve forecasting accuracy when a largenumber of stationary and nonstationary predictors are employed.We ﬁrst investigate a special case, where all predictors are I (1) (Section 2.1). Then, we gen-eralize the results to the case where predictors can be a mixture of I (0), I (1), and cointegratedprocesses, as in Section 2.2. In both cases, the number of predictors p is assumed to increase at aslower rate than n . In particular, p = n ζ , with ζ ∈ (0 ,

1) (10)Some additional restrictions on ζ , systematically combined with the rate of convergences, arediscussed in the Appendices.For simplicity, we remove the intercept terms in Model (1) and Model (4). Following Lee (2016),the dequantiled dependent variable is deﬁned as y tτ := y t − ˆ α QRτ . Because the presence of the intercept term does not involve with the ALASSO selection, removingthe intercept has no impact on the asymptotic results developed in this section. Particularly, in amixed-root model, the intercept term can be treated as one of the I (0) predictors. So the oracleproperties we show in this section are invariant with respect to the decision whether to include theintercept or not. Consider the QR model with unit-root predictors deﬁned in Section 2.1 under the assumptions 2.1and 2.2. To construct an adaptive penalty term in ALQR, we need a consistent estimator of β τ in the ﬁrst step. In the following lemma, we will show that the QR estimator deﬁned in (2) is aqualiﬁed initial estimator for ALQR.We make the following assumptions with A ( t ) := E (cid:2) f t − (0) x t − x (cid:48) t − (cid:3) , B ( t ) := E (cid:2) ψ τ ( u tτ ) x t − x (cid:48) t − (cid:3) ,and the generic constants c A , c A , c B and c B , which may depend on t and p .9 ssumption 4.1 (Assumption f ) (1) The distribution function of u tτ , F ( · ) , has a continuousdensity f ( · ) with f ( a ) > on { a : 0 < F ( a ) < } . (2) The derivative of conditional distributionfunction F t − ( a ) = P r [ u tτ < a |F t − ] , which we denote f t − ( · ) , is continuous and uniformly boundedabove by a ﬁnite constant c f almost surely. (3) For any sequence ζ n → F − ( τ ) , f t − ( ζ n ) is uniformlyintegrable, and E [ f ηt − ( F − ( τ ))] < ∞ for some η > . Assumption 4.2 (Assumption A ) < c A ≤ λ min ( A ( t ) t ) ≤ λ max ( A ( t ) t ) ≤ c f λ max ( x t − x (cid:48) t − t ) ≤ c A < ∞ . Assumption 4.3 (Assumption B ) < c B ≤ λ min ( B ( t ) t ) ≤ λ max ( B ( t ) t ) ≤ c B < ∞ . Assumption 4.4 (Assumption P ) ( p ) p = n ζ with < αζ < , ( p ) c / A n / p α ∨ c / B p α = O ( c A ) ,and ( p ) p / n αζ +2 c / A = n (3 / ζ n αζ +2 c / A = o (1) . Assumption 4.5 (Assumption λ ) ( λ ) λ n → ∞ , and λ n n ζ n αζ c A → , ( λ ) λ n n (1 − αζ ) γ n αζ c / A → ∞ , with γ > . Remark 4.1

Assumptions A1 and B1 modify the existing conditions from stationary QR literature.In particular, we adapt Assumption A.2 of Lu and Su (2015) to our nonstationary framework.Under Assumption 2.1 it is easy to show E (cid:2) x t − x (cid:48) t − (cid:3) = t · Σ vv , so we impose a similar set ofrestricted eigenvalue conditions for Σ vv = E [ x t − x (cid:48) t − ] t , A ( t ) t and B ( t ) t . Therefore, this is a naturalextension of the exising conditions in iid or stationary QR literature to nonstationary time serieswith increasing dimension. Remark 4.2

In Assumption p , the condition < αζ < imposes a n = p α n = n αζ n → indicatingthe restriction on the growth rate of the number of parameters. In the Appendix, we discuss thetechnicality of Assumptions P and λ in details. Assumption f is a standard one from the literature. Remark 4.3

Assumption λ collects a set of rate conditions to prove the oracle properties. Inparticular, Assumption λ is required to prove Theorem 4.1, while Assumption λ is used to showTheorem 4.2. Lemma 4.1 (Consistency of QR Estimator with unit-root predictors)

Under the assump-tions 4.1, 4.2, 4.3, 4.4 and 4.5, the QR estimator of (2) is ( n/p α ) -consistent: (cid:13)(cid:13)(cid:13) ˆ β QRτ − β τ (cid:13)(cid:13)(cid:13) = O p (cid:18) p α n (cid:19) . Remark 4.4

With I(0) predictors, the rate of convergence is n / with ﬁxed p , and p / n / withincreasing p . Thus, the loss of information (or the degrees of freedom) to esimate the increasingnumber of parameter is p / . With I(1) predictors, the rate of convergence is n with ﬁxed p , wherethe super-consistency is from the stronger signal from X (cid:48) X . With increasing p , p α n = p / · p α − / n , here the additional rate loss p α − / comes from the increasing singularity of E [ x t x (cid:48) t ] as summarizedin Assumption A1 and B1. In contrast, for I(0) predictors, < (cid:104) λ min (cid:16) X (cid:48) Xn (cid:17)(cid:105) < (cid:104) λ max (cid:16) X (cid:48) Xn (cid:17)(cid:105) < ∞ , with p/n → . We now show the oracle properties of the ALQR estimators from (3).

Theorem 4.1 (ALQR Estimation Consistency under Unit Roots)

Under the assumptions4.1, 4.2, 4.3, 4.4 and 4.5, the ALQR estimator ˆ β ALQRτ of (3) is ( n/p α ) -consistent: (cid:13)(cid:13)(cid:13) ˆ β ALQRτ − β τ (cid:13)(cid:13)(cid:13) = O p (cid:18) p α n (cid:19) . Theorem 4.2 (ALQR Sparsity under Unit Roots)

Under the assumptions 4.1, 4.2, 4.3, 4.4and 4.5,

Pr( j ∈ ˆ A n ) −→ , for j / ∈ A , as n → ∞ , where ˆ A n = { j : ˆ β ALQRτ,j (cid:54) = 0 } and A = { j : β τ,j (cid:54) = 0 } . Now let us consider a general QR model in Section 2.2 with heterogeneous degrees of predictorpersistence, under the assumptions 2.1, 2.3 and 2.4. Recall X t := ( z (cid:48) t , x c t (cid:48) , x c t (cid:48) , x (cid:48) t ) (cid:48) and β τ =( β z (cid:48) τ , β c (cid:48) , τ , β c (cid:48) , τ , β x (cid:48) τ ) (cid:48) are p × Q y t ( τ |F t − ) = α τ + z t − (cid:48) β z τ + x c t − (cid:48) β c , τ + x c t − (cid:48) β c , τ + x t − (cid:48) β x τ = α τ + X (cid:48) t − β τ .To show the oracle properties in the mixed roots case, we need to deﬁne a transformation matrix Q that captures the cointegration relation between x c t and x c t in the following way: Q ( p n × p n ) =  I p z I p A (cid:48) I p

00 0 0 I p x  We discuss the transformed model ﬁrst: Q y t ( τ |F t − ) = α τ + X (cid:48) t − β τ = α τ + (cid:0) X (cid:48) t − Q − (cid:1) Qβ τ : = α τ + ˜ x (cid:48) t − ˜ β τ β τ = Qβ τ =  I p z I p A (cid:48) I p

00 0 0 I p x  β τ =  β zτ β c τ A (cid:48) β c τ + β c τ β xτ  and ˜ x (cid:48) t − = X (cid:48) t − Q − = ( z (cid:48) t , x c t (cid:48) , x c t (cid:48) , x (cid:48) t )  I p z I p − A (cid:48) I p

00 0 0 I p x  = ( z (cid:48) t , v c t (cid:48) , x c t (cid:48) , x (cid:48) t ) := (cid:16) w (0) (cid:48) t , w (1) (cid:48) t (cid:17) , where w (0) t = [ z (cid:48) t , v c t (cid:48) ] (cid:48) is a ( p z + p ) × I (0) process and w (1) t = [ x c t (cid:48) , x (cid:48) t ] (cid:48) is a ( p + p x ) × I (1)process, and p is the cointegration rank.Let the number of I (0) predictors in ˜ x t − be r ≡ p z + p ≥

0, and assume r is ﬁxed - thiscondition can be easily generalized to allow r increasing. The number of the I (1) predictors in ˜ x t − is ( p − r ), which grows as p diverges. Note that Section 4.1 is a special case with r = 0, i.e., nocointegration. This section considers a more general model with r ≥ Q y t ( τ |F t − ) = α τ + X (cid:48) t − β τ = α τ + (cid:0) X (cid:48) t − Q − (cid:1) Qβ τ (11): = α τ + ˜ x (cid:48) t − ˜ β τ Let M n be a diagonal matrix of the form M n := (cid:32) √ nI r I p n − r (cid:33) , We make the following additional assumptions to show the QR consistency of the transformedmodel (11). .

Assumption 4.6 (Assumption A ) < ¯ c A ≤ λ min (cid:32) M (cid:48) n (cid:80) nt =1 E (cid:2) f t − (0)˜ x t − ˜ x (cid:48) t − (cid:3) M n n (cid:33) ≤ λ max (cid:32) M (cid:48) n (cid:80) nt =1 E (cid:2) f t − (0)˜ x t − ˜ x (cid:48) t − (cid:3) M n n (cid:33) ≤ ¯ c A < ∞ w.p.a.1 as n → ∞ . ssumption 4.7 (Assumption B ) < ¯ c B ≤ λ min (cid:32) M (cid:48) n (cid:80) nt =1 E (cid:2) ψ τ ( u tτ ) ˜ x t − ˜ x (cid:48) t − (cid:3) M n n (cid:33) ≤ λ max (cid:32) M (cid:48) n (cid:80) nt =1 E (cid:2) ψ τ ( u tτ ) ˜ x t − ˜ x (cid:48) t − (cid:3) M n n (cid:33) ≤ ¯ c B < ∞ w.p.a.1 as n → ∞ Assumption 4.8 (Assumption λ ) λ n → ∞ , and λ n n ζ n

12 + αζ c A → . Assumption 4.9 (Assumption λ ) λ n → ∞ , λ n n (1 / − αζ ) γ n αζ c / A → ∞ . Remark 4.5

Compared to Assumptions λ and λ , Assumptions λ and λ are more restrictiveconditions, hence these two assumptions imply Assumptions λ and λ . This is because we need toaccommodate both I(0) and I(1) predictors unlike Section 4.1 that only deals with I(1) predictors. To accommodate the diﬀerent rates of convergence, we use notation of a (0) n = ( p α / √ n ) and a n = a (1) n = ( p α /n ) to represent I(0) and I(1) rates, respectively. Lemma 4.2 (Consistency of the transformed QR Estimator ˜ β τ under Mixed Roots) Under the assumptions 4.6, 4.7, 4.8 and 4.9, the QR estimator of the transformed model (11), ˜ β QRτ,j , is a consistent estimator of ˜ β τ,j such that, for j = 1 , ..., r , ( a (0) n ) − ( ˜ β QRτ,j − ˜ β τ,j ) = O p (1) , andfor j = r + 1 , ..., p n , ( a (1) n ) − ( ˜ β QRτ,j − ˜ β τ,j ) = O p (1) . Using the consistency of transformed QR model (11), we show the consistency of the originalQR model (4). Since ˜ β c τ = A (cid:48) ˆ β c τ + ˆ β c τ , the convergence rate of ˆ β c τ will be determined by the rateof ˆ β c τ . Note that ˆ β c τ = (cid:16) ˜ β c τ (cid:17) QR − A (cid:48) ˆ β c τ = (cid:18)(cid:16) ˜ β c τ (cid:17) QR − ˜ β c τ (cid:19) − A (cid:48) (cid:16) ˆ β c τ − β c τ (cid:17) + (cid:16) ˜ β c τ − A (cid:48) β c τ (cid:17) = O p (cid:16) a (1) n (cid:17) + O p (cid:16) a (0) n (cid:17) + β c τ = β c τ + O p (cid:16) a (0) n (cid:17) . Thus, the I(0) rate dominates, and (cid:13)(cid:13)(cid:13) ˆ β c τ − β c τ (cid:13)(cid:13)(cid:13) = O p (cid:16) a (0) n (cid:17) . We have the following consistencyfor the original (untransformed) QR estimator with the reduced rate of convergence for ˆ β c τ . Corollary 4.3 (Consistency of the QR Estimator under Mixed Roots)

Under the assump-tions 4.6, 4.7 and 4.8, the QR estimator , ˆ β QR ∗ τ in (6), is a consistent estimator of β τ,j such that,for j = 1 , ..., r + p , ( ˆ β QR ∗ τ,j − β τ,j ) = O p ( a (0) n ) , and for j = r + p +1 , ..., p , ( ˆ β QR ∗ τ,j − β τ,j ) = O p ( a (1) n ) .

13e now have the following oracle properties of the ALQR estimators with the mixed rootsproviding the theoretical supports of using ALQR in the empirical practice with stationary andnonstationary time series data.

Theorem 4.4 (ALQR Estimation Consistency under Mixed Roots)

Under the assump-tions 4.6, 4.7 and 4.8, ALQR estimator ˆ β ALQR ∗ τ from (7) is a consistent estimator of β τ such thatfor j = 1 , ..., r + p , ( ˆ β ALQR ∗ τ,j − β τ,j ) = O p ( a (0) n ) , and for j = r + p + 1 , ..., p n , ( ˆ β ALQR ∗ τ,j − β τ,j ) = O p ( a (1) n ) . Theorem 4.5 (ALQR Sparsity under Mixed Roots)

Under the assumptions 4.6, 4.7, 4.8 and4.9,

Pr( j ∈ ˆ A n ) −→ , for j / ∈ A , where ˆ A n = { j : ˆ β ALQR ∗ τ,j (cid:54) = 0 } and A = { j : β τ,j (cid:54) = 0 } . In this section, we conduct a set of Monte Carlo simulation studies to evaluate the forecastingperformance of the proposed ALQR method. Based on the stock returns application in Section3, we construct a simulation environment to closely follow the updated monthly data of Welchand Goyal (2008) ranging from January 1952 to December 2017. Recall that the set of predictorsconsidered are dividend price ratio ( dp ), dividend yield ratio ( dy ), earnings price ratio ( ep ), book-to-market ratio ( bm ), stock variance ( svar ), default yield spread ( df y ), default return spread ( df r ),inﬂation ( inf l ), net equity expansion ( ntis ), long term rate of returns ( ltr ), long term yield ( lty ),and treasury bills ( tbl ). In Table 1, we conﬁrm that they can be categorized into eight persistentpredictors and four stationary ones. We construct the following simulation design.We calibrate the structure of z t by estimating a VAR( p ) model with BIC based on the data of svar , df r , inf l , and ltr . An estimated VAR(2) model is obtained as follows: z t =  . − .

059 0 − . . .

022 0 .

538 01 0 0 0  z t − +  .

208 0 0 00 . − . − .

312 00 .

092 0 .

023 0 .

239 00 0 1 .

038 0  z t − + u zt , where u zt ∼ N (0 , dp , dy , df y , and ep . We thus14an generate a set of cointegrating predictors from the following process:( x c t , x c t ) (cid:48) =  .

14 0 . . − . .

19 0 . . − . − .

24 1 . . . . − . . .  · ( x c ,t − , x c ,t − ) (cid:48) + v ct , where v ct ∼ N (0 , N (0 ,

1) innovations with stationary initializations.We consider the following scenarios with diﬀerent numbers of zero slope coeﬃcients:(1) 6 zero coeﬃcients: β z = ( − . , , , . , ( β c (cid:48) , β x (cid:48) ) = (0 . , − . , . , . , , , , );(2) 4 zero coeﬃcients: β z = ( − . , , , . , ( β c (cid:48) , β x (cid:48) ) = (0 . , − . , . , . , . , . , , );(3) 0 zero coeﬃcients: β z = ( − . , . , . , . , ( β c (cid:48) , β x (cid:48) ) = (0 . , − . , . , . , . , . , . , − . . Without loss of generality, we assume the intercept α = 0.We simulate our data for n = 1000, using the last 12 periods for out-of-sample predictionevaluation and the remaining for in-sample estimation. The total number of replications for eachexperiment is 1000. The quantiles of interest are 0 .

05, 0 .

1, 0 .

5, 0 .

9, 0 . deﬁned in (8) and (9), respectively. We report the performance measures by averagingover 1000 replications. λ n In the L -penalized regression literature, the choice of the tuning parameter λ n often plays a crucialrole in achieving consistent model selection and the optimal convergence rate for the regressionestimator. To achieve the oracle property, many studies have pointed out that λ n must satisfycertain rate conditions which depend on the size of the true model. Because the true model size isinfeasible in practice, various criteria have been developed to help identify the number of relevantvariables in the underlying model. The commonly used criteria include k -fold cross-validation(CV), Akaike information criterion (AIC), BIC and GIC. However, it has been shown that theﬁrst two methods may fail to eﬀectively identify the true model (Shao (1997); Wang et al. (2007);Zhang et al. (2010)). As discussed in Zheng et al. (2013), the statistical properties of the k -foldCV have not been well understood for high-dimensional regression, especially under the heavy-15ailed errors, where QR is often applied. On the contrary, the model selection consistency propertyof a BIC procedure proposed by Wang and Leng (2007) has been demonstrated in Wang et al.(2007) for ﬁxed dimensionality and in Wang et al. (2009) for increasing dimensional problem with p < n . Recently, Fan and Tang (2013) and Zheng et al. (2015) consider the GIC criterion toselect optimal λ n in practice. They show that, for (ultra-)high dimensional regressions, the GICtuning parameter procedure can consistently identify the underlying true model with probabilityapproaching 1. Following those results, we select the optimal tuning parameter λ n by minimizingBIC or GIC.For the information criterion based method, we adapt it to a QR framework. Given λ , theobjective function of BIC or GIC is deﬁned as: BIC ( λ ) or GIC ( λ ) = log (cid:32) n n (cid:88) t =1 ρ τ ( y t − x (cid:48) t − ˆ β τ ( λ )) (cid:33) + Γ n · | ˆ S ( λ ) | , (12)where ˆ β τ ( λ ) is the ALQR estimate, which is the minimizer of (3) with λ n = λ . Γ n is a positivesequence converging to 0 as n grows. ˆ S ( λ ) is a count of active variables identiﬁed by ˆ β τ ( λ ). Ourgoal is to ﬁnd an optimal λ n which minimizes the objective function. The diﬀerence between BICand GIC is the choice of Γ n , which controls the degree of penalty on model size. In this paper, welet Γ n = log ( n ) n for BIC, and Γ n = log ( p n ) n · log ( log ( n )) for GIC, following Zheng et al. (2015).The procedure of ﬁnding an optimal λ n is summarized as follows:1. Run an unpenalized QR to obtain estimates ˜ β τ . This ﬁrst-step consistent estimate of β τ,j isused to construct the penalty weights in λ n,j of ALQR.2. Select λ n from an equally spaced grid, and run a penalized QR to obtain ˆ β τ ( λ n ). Note that,in the procedure of penalized regressions, we use λ n,j = λ n / | ˜ β τ,j | γ for ALQR and λ n,j = λ n for LASSO, respectively.3. Compute BIC ( λ n ) or GIC ( λ n ) in (12).4. Repeat step 2-3 for each candidate λ n over the searching grid. The optimal λ n is then selectedas the one that gives the smallest value of BIC ( λ n ) or GIC ( λ n ).To alleviate the computational burden, we search for the optimal λ n over a prespeciﬁed ﬁnegrid. For our simulation study, the searching grid is speciﬁed as a sequence from 10 − to 1 . × − with an increment of 10 − for ALQR, and a sequence from 10 − to 10 with an increment of 0 . λ n between ALQR andLASSO is mainly due to the fact that the tuning parameter λ n of ALQR needs to be weighted by1 / | ˜ β τ,j | γ in order to achieve model selection consistency. Thus, as for our simulation design withlarge γ and small ˜ β τ,j where | ˜ β τ,j | < λ n of ALQR can be much smaller than the λ n of LASSO.16 .2 Simulation results Table 10 shows the results with BIC. It presents a comparison of the ﬁnite-sample performancesusing four prediction strategies. In terms of overall performance, ALQR has the smallest FPE andthe largest R under scenarios (1) and (2). When some predictors in fact have no predictabilityon the τ th quantile of y t , i.e., the true values of some coeﬃcients are zeros, ALQR dominates theother methods. In scenario (3) when all coeﬃcients are non-zero and many of them are small,the prediction performance of the ALASSO-based methods is still comparable or better than othermethods.In Table 12, we summarize the results with GIC. It provides further evidence to support the useof ALQR. For the ﬁrst two scenarios where certain degrees of sparsity are imposed on the underlyingdata generating process, ALQR with GIC performs uniformly better than the other three methods.In scenario (3) with no sparsity, ALQR is slightly worse than LASSO, but comparable at τ = 0 . τ = 0 .

9. From Tables 11 and 13, we conﬁrm that the number of selected predictors fromALQR is always closer to the true number of non-zero predictors, supporting the empirical resultsin Section 3 and the theoretical results in Section 4.

In this paper, we show that the adaptive LASSO for quantile regression is attractive in forecastingwith stationary and nonstationary predictors as well as cointegrated predictors. The framework isgeneral that the regressors may have the mixed roots without requiring the researchers’ knowledgeon the speciﬁc structure of the order of integration. In this general framework, the proposedALQR estimator is shown to preserve the oracle property in prediction. These advantages oﬀersubstantial convenience and robustness to empirical researchers working with quantile predictionwith time series data.We focus on the case with the number of covariates ( p ) smaller than the sample size ( n ), eventhough we allow p to increase as n grows. This framework justiﬁes a wide range of practicalapplications in economics, such as the stock return quantile prediction we study in this paper. Itwould be an interesting future research to allow p even larger than n , which has not been studiedin this general time series framework. 17igure 1: Time-series Plots of Persistent Predictors −4.5−4.0−3.5−3.0 1960 1980 2000 2020 Date dp −4.5−4.0−3.5−3.0 1960 1980 2000 2020 Date dy −4−3−2 1960 1980 2000 2020 Date ep Date bm Date dfy −0.06−0.030.000.03 1960 1980 2000 2020

Date ntis

Date lty

Date tbl

Note: All plots are based on 792 monthly observations ranging from January 1952 to December 2017. The fullpredictor names are deﬁned in the main text.

Date svar −0.10−0.050.000.05 1960 1980 2000 2020

Date dfr −0.10.00.1 1960 1980 2000 2020

Date ltr −0.02−0.010.000.01 1960 1980 2000 2020

Date infl

Note: All plots are based on 792 monthly observations ranging from January 1952 to December 2017. The fullpredictor names are deﬁned in the main text. −0.50.00.51.0 t c o e f . va l u e predictorsdpdyepbmdfyntisltytbl tau=0.1, nonstationary predictors −6−4−20 t c o e f . va l u e predictorssvardfrinflltr tau=0.1, stationary predictors t c o e f . va l u e predictorsdpdyepbmdfyntisltytbl tau=0.5, nonstationary predictors −0.050−0.0250.0000.0250.050 t c o e f . va l u e predictorssvardfrinflltr tau=0.5, stationary predictors t c o e f . va l u e predictorsdpdyepbmdfyntisltytbl tau=0.9, nonstationary predictors t c o e f . va l u e predictorssvardfrinflltr tau=0.9, stationary predictors Note: The full predictor names are available from Sec-tion 3.

Table 2: The out-of-sample prediction of stock returns evaluated by 12-period average: FPE and R ( λ selected by BIC) tau=0.05 0.1 0.5 0.9 0.95FPE QR 0.0036 0.0046 LASSO 0.0039 0.0061 0.0053 0.0036 0.0023ALQR R QR 0.0743 0.0988

LASSO -0.0080 -0.1939 -0.0646 0.2648 0.2895ALQR

B num ) and λ by BIC τ =0.05 0.1 0.5 0.9 0.95B num LASSO 4.0000 5.3333 1.4167 1.3333 5.0000ALQR 4.2500 1.0000 0.0000 5.0000 4.0000 λ LASSO 0.7000 0.4593 9.1677 3.1417 0.2000ALQR 0.0157 0.2993 4.8868 0.0002 0.001121able 4: The out-of-sample prediction of stock returns evaluated by 12-period average: FPE and R ( λ selected by GIC) τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.0036 0.0036 LASSO 0.0039 0.0039 0.0053 0.0038 0.0023ALQR R QR 0.0743 0.0743

LASSO -0.0047 -0.0047 -0.0646 0.2144 0.2895ALQR

B num ) and λ by GICtau=0.05 0.1 0.5 0.9 0.95B num LASSO 7.0000 7.0000 1.4167 5.9167 5.0000ALQR 4.2500 4.2500 0.0000 5.0000 4.0000 λ LASSO 0.2000 0.2000 9.1677 0.3167 0.2000ALQR 0.0157 0.0157 4.8868 0.0002 0.0011Table 6: The out-of-sample prediction of stock returns evaluated by 24-period average: FPE and R ( λ selected by BIC) tau=0.05 0.1 0.5 0.9 0.95FPE QR 0.0036 0.0055 0.0086 0.0041 LASSO 0.0038 0.0060 unc quantile 0.0037 0.0054 R QR 0.0369 -0.0074 -0.0825 0.2193

LASSO -0.0099 -0.1104

B num ) and λ by BIC τ =0.05 0.1 0.5 0.9 0.95B num LASSO 4.0000 4.8333 1.6250 1.500 5.0000ALQR 2.6250 1.0000 0.0000 4.8333 4.0000LASSO 0.7000 0.6177 8.5708 3.0750 0.2000ALQR 0.0503 0.3669 6.3314 0.0003 0.001322able 8: The out-of-sample prediction of stock returns evaluated by 24-period average: FPE and R ( λ selected by GIC) τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.0036 0.0055 0.0086 0.0041 LASSO 0.0038 0.0057 unc quantile 0.0037 0.0054 R QR 0.0369 -0.0074 -0.0825 0.2193

LASSO -0.0236 -0.0487

B num ) and λ by GICtau=0.05 0.1 0.5 0.9 0.95B num LASSO 6.5833 9.0833 1.6250 5.7083 5.0000ALQR 2.6250 1.0000 0.0000 4.8750 4.0000 λ LASSO 0.2833 0.0968 8.5708 0.3625 0.200ALQR 0.0503 0.3669 6.3314 0.0002 0.001323able 10: The results of simulation experiments: FPE and R ( λ selected by BIC) Scenario (1): 6 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1821 0.2835 0.5807 0.2817 0.1795LASSO 0.1091 0.1813 0.4048 0.1807 0.1084ALQR unc quantile 2.7366 4.9691 12.1063 4.8815 2.6289 R QR 0.9335 0.9430 0.9520 0.9423 0.9317LASSO 0.9601 0.9635 0.9666 0.9630 0.9588ALQR unc quantile 0.0000 0.0000 0.0000 0.0000 0.0000

Scenario (2): 4 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1845 0.2864 0.5882 0.2878 0.1848LASSO 0.1085 0.1818 0.4056 0.1800 0.1078ALQR unc quantile 2.6803 4.8142 12.1974 4.8994 2.5749 R QR 0.9312 0.9405 0.9518 0.9413 0.9282LASSO 0.9595 0.9622 0.9667 0.9633 0.9581ALQR unc quantile 0.0000 0.0000 0.0000 0.0000 0.0000

Scenario (3): 0 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1927 0.2968 0.5992 0.2964 0.1906LASSO ALQR 0.1099 0.1825 R ALQR 0.9592 0.9630

B num ) and λ by BIC Scenario (1): 6 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 9.80 9.39 8.99 9.43 9.74ALQR 6.24 6.06 5.97 6.05 6.26 λ LASSO 8.271500 15.676250 36.464583 15.170958 8.556958ALQR 0.000171 0.000238 0.000399 0.000234 0.000169

Scenario (2): 4 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 10.33 10.09 9.74 10.09 10.27ALQR 8.23 8.056 7.99 8.04 8.20 λ LASSO 6.750593 12.559635 29.983385 12.217093 7.177177ALQR 0.000117 0.000160 0.000254 0.000161 0.000118

Scenario (3): 0 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 11.97 11.99 12.00 11.99 11.98ALQR 11.77 11.86 11.95 11.86 11.78 λ LASSO 0.200927 0.158552 0.064593 0.112010 0.197302ALQR 1.75E-05 1.88E-05 1.87E-05 1.89E-05 1.75E-0525able 12: The results of simulation experiments: FPE and R ( λ selected by GIC) Scenario (1): 6 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1815 0.2834 0.5805 0.2817 0.1793LASSO 0.1084 0.1817 0.4045 0.1805 0.1084ALQR unc quantile 2.7285 4.9162 12.0961 4.8751 2.6290 R QR 0.9335 0.9424 0.9520 0.9422 0.9318LASSO 0.9603 0.9630 0.9666 0.9630 0.9588ALQR unc quantile 0.0000 0.0000 0.0000 0.0000 0.0000

Scenario (2): 4 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1852 0.2876 0.5880 0.2874 0.1848LASSO 0.1090 0.1825 0.4035 0.1799 0.1076ALQR unc quantile 2.7245 4.8459 12.2354 4.8893 2.5796 R QR 0.9320 0.9407 0.9519 0.9412 0.9283LASSO 0.9600 0.9623 0.9670 0.9632 0.9583ALQR unc quantile 0.0000 0.0000 0.0000 0.0000 0.0000

Scenario (3): 0 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95FPE QR 0.1945 0.2972 0.6002 0.2964 0.1906LASSO ALQR 0.1105 0.1820 R QR 0.9272 0.9398 0.9509 0.9395 0.9289LASSO

ALQR 0.9586 0.9631

B num ) and λ by GIC Scenario (1): 6 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 10.14 9.79 9.35 9.80 10.13ALQR 6.80 6.36 6.11 6.35 6.76 λ LASSO 6.205000 11.660042 28.387583 11.542000 6.319167ALQR 0.000122 0.000193 0.000355 0.000194 0.000124

Scenario (2): 4 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 10.58 10.37 9.98 10.34 10.56ALQR 8.58 8.28 8.09 8.29 8.55 λ LASSO 5.175218 9.325552 23.824677 9.457427 5.274260ALQR 8.46E-05 0.000130 0.000230 0.000129 8.66E-05

Scenario (3): 0 zero coeﬃcients τ =0.05 0.1 0.5 0.9 0.95 B num

LASSO 11.98 11.99 12.00 12.00 11.98ALQR 11.86 11.92 11.98 11.93 11.85 λ LASSO 0.142135 0.123552 0.064343 0.090093 0.145260ALQR 1.34E-05 1.39E-05 1.27E-05 1.36E-05 1.40E-0527 ppendices

A Proofs for Section 4.1

Proof of Lemma 4.1:

Let a n = p αn /n and c ∈ R p n such that || c || = C , where C is a ﬁniteconstant. Denote the (unpenalized) quantile objective function as Q QRn ( β τ ).To show the result of consistency, it suﬃces to show that for any (cid:15) > C such that P (cid:26) inf || c || = C Q QRn ( β τ + a n c ) > Q QRn ( β τ ) (cid:27) ≥ − (cid:15). (13)This inequality implies that with probability at least 1 − (cid:15) , there is a local minimizer ˜ β τ in theshrinking ball { β τ + a n c, || c || ≤ C } such that || ˜ β τ − β τ || = O p ( a n ). Thus, the proof is completedif we show that the following term is positive: Q QRn ( β τ + a n c ) − Q QRn ( β τ ) = n (cid:88) t =1 ρ τ ( u tτ − x (cid:48) t − a n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (14)By Knight’s Identity, n (cid:88) t =1 (cid:2) ρ τ ( u tτ − x (cid:48) t − a n c ) − ρ τ ( u tτ ) (cid:3) = − a n n (cid:88) t =1 x (cid:48) t − c · ψ τ ( u tτ ) + n (cid:88) t =1 (cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds = − a n n (cid:88) t =1 x (cid:48) t − c · ψ τ ( u tτ ) + n (cid:88) t =1 E (cid:34)(cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:35) + n (cid:88) t =1 (cid:40)(cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds − E (cid:34)(cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:35)(cid:41) ≡ I + I + I . We will show that I and I are dominated by I and that I > I . E | I | = a n n (cid:88) t =1 c (cid:48) E (cid:2) ψ τ ( u tτ ) x t − x (cid:48) t − (cid:3) c + 2 n (cid:88) t =2 t − (cid:88) k =1 c (cid:48) E (cid:2) ψ τ ( u tτ ) ψ τ ( u kτ ) x t − x (cid:48) k − (cid:3) c = a n n (cid:88) t =1 c (cid:48) E (cid:2) ψ τ ( u tτ ) x t − x (cid:48) t − (cid:3) c ≤ a n n (cid:88) t =1 tc B C ≤ C a n c B n ( n + 1)2 . The second equality holds since, for k ≤ t − E (cid:2) ψ τ ( u tτ ) x t − x (cid:48) t − (cid:3) = E (cid:2) E t − [ ψ τ ( u tτ )] ψ τ ( u kτ ) x t − x (cid:48) k − (cid:3) = 0 . The third inequality holds by Assumption B

1. Therefore, the Chebyshev’s inequality implies that I = O p ( a n c / B n ) = O p ( a n c / B n p − αn ) . Next, we derive the lower bound of I . I = n (cid:88) t =1 E (cid:90) a n x (cid:48) t − c ( F t − ( s ) − F t − (0)) ds = n (cid:88) t =1 E (cid:90) a n x (cid:48) t − c ( f t − (0) · s ) ds { o p (1) } = 12 a n n (cid:88) t =1 c (cid:48) E (cid:2) f t − (0) x t − x (cid:48) t − (cid:3) c { o (1) }≥ C a n c A n ( n + 1)2= O ( a n n c A ) . The ﬁrst equality holds by the law of iterated expectations and the second does by the Taylorexpansion. The inequality holds under Assumption A I . V ar ( I ) = V ar (cid:32) n (cid:88) t =1 (cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33) ≤ E (cid:32) n (cid:88) t =1 (cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33)  = E (cid:34) n (cid:88) t =1 (cid:32)(cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33) + 2 n (cid:88) t =2 t − (cid:88) k =1 (cid:32)(cid:90) x (cid:48) t − a n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33) (cid:32)(cid:90) x (cid:48) k − a n c ( ( u kτ ≤ s ) − ( u kτ ≤ ds (cid:33) (cid:35) ≤ E (cid:34) n (cid:88) t =1 (cid:0) x (cid:48) t − a n c (cid:1) + 2 n (cid:88) t =2 t − (cid:88) k =1 (cid:12)(cid:12) x (cid:48) t − a n c (cid:12)(cid:12) (cid:12)(cid:12) x (cid:48) k − a n c (cid:12)(cid:12) (cid:35) = a n n (cid:88) t =1 c (cid:48) E (cid:2) x t − x (cid:48) t − (cid:3) c + 2 a n n (cid:88) t =2 t − (cid:88) k =1 E (cid:2)(cid:12)(cid:12) x (cid:48) t − c (cid:12)(cid:12) (cid:12)(cid:12) x (cid:48) k − c (cid:12)(cid:12)(cid:3) ≡ V , + V , Using the similar arguments in I , we have V , ≤ a n C c A c f n ( n + 1)2 = O ( a n c A n ) . By the Cauchy-Schwarz inequality, Assumption A

1, and t > k , we have V , ≤ a n n (cid:88) t =2 t − (cid:88) k =1 (cid:113) E (cid:2) ( x (cid:48) t − (cid:3)(cid:113) E (cid:2) ( x (cid:48) k − (cid:3) ≤ a n n (cid:88) t =2 t − (cid:88) k =1 (cid:115) C t c A c f (cid:115) C t c A c f ≤ a n C c A c f n (cid:88) t =2 t − (cid:88) k =1 t = O ( a n c A n )Therefore, V ar ( I ) = O ( a n c A n ), and Chebyshev’s inequality implies that I = O p ( c / A a n n / ) = O p (cid:16) c / A a n n / p − αn (cid:17) . By Assumptions p p

2, we establish the desired result.

Proof of Theorem 4.1:

For simplicity, in this proof, we use ˆ β τ to represent the ALQR estimatorˆ β ALQRτ . Without loss of generality, let the values of β τ, , β τ, , ..., β τ,q n be nonzero and β τ,q n +1 , τ,q n +2 , ..., β τ,p n be zero. Let E t − ( · ) ≡ E ( ·|F t − ).To show the result of consistency, it suﬃces to show that for any (cid:15) > C such that P (cid:26) inf || c ||≤ C Q n ( β τ + a n c ) > Q n ( β τ ) (cid:27) ≥ − (cid:15). (15)This inequality implies that with probability at least 1 − (cid:15) , there is a local minimizer ˆ β τ in theshrinking ball { β τ + a n c, || c || ≤ C } such that || ˆ β τ − β τ || = O p ( a n ).Since Q n ( β τ + a n c ) − Q n ( β τ ) (16)= (cid:32) n (cid:88) t =1 ρ τ ( u tτ − x (cid:48) t − a n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (cid:33) +  p n (cid:88) j =1 λ n,j | β τ,j + a n c j | − p n (cid:88) j =1 λ n,j | β τ,j |  = (cid:32) n (cid:88) t =1 ρ τ ( u tτ − x (cid:48) t − a n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (cid:33) +  q n (cid:88) j =1 λ n,j ( | β τ,j + a n c j | − | β τ,j | ) + p n (cid:88) j = q n +1 λ n,j ( | β τ,j + a n c j | − | β τ,j | )  = (cid:32) n (cid:88) t =1 ρ τ ( u tτ − x (cid:48) t − a n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (cid:33) +  q n (cid:88) j =1 λ n,j ( | β τ,j + a n c j | − | β τ,j | ) + p n (cid:88) j = q n +1 λ n,j ( | a n c j | )  ≥ (cid:32) n (cid:88) t =1 ρ τ ( u tτ − x (cid:48) t − a n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (cid:33) + q n (cid:88) j =1 λ n,j ( | β τ,j + a n c j | − | β τ,j | ) ≡ D + D , we need to show that D + D is positive.For D , we know || β τ,j + a n c j | − | β τ,j || ≤ | a n c j | . Given that ˜ β τ is a ( a − n )-consistent estimateof β τ , we have a − n ( ˜ β τ,j − β τ,j ) = O p (1), and then ˜ β τ,j = O p ( a n ) + β τ,j = o p (1) + β τ,j . Thus, | D | ≤ q n (cid:88) j =1 λ n,j | a n c j | = λ n a n q n (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ˜ β τ,j | γ c j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ n a n  q n (cid:88) j =1

1( ˜ β τ,j ) γ  / · || c || (17)31 C λ n a n  q n (cid:88) j =1 o p (1) + β τ,j ) γ  / = C λ n a n  O p (1) q n (cid:88) j =1  / ≤ O p ( λ n a n q / n ) ≤ O p ( λ n a n p / n ) . (18)Next we consider D . Following the proof of Lemma 4.1, we have that D ≥ O p ( a n n c A ) andthat it is positive with probability approaching 1. Under Assumption λ D dominates D . Wecomplete the proof of Theorem 4.1. Proof of Theorem 4.2:

From Theorem 4.1 for a suﬃciently large constant C , ˆ β ALQRτ is a localminimizer lies in the ball { β τ + a n c, || c || ≤ C } with probability tending to 1 and a n = p αn /n . Forsimplicity, in this proof, we use ˆ β τ to represent the ALQR estimator ˆ β ALQRτ .First, note that the subgradient of the unpenalized objective function, s ( β τ ) = ( s ( β τ ) , ..., s p n ( β τ )) (cid:48) is given by (Sherwood et al. (2016), page 298 and Lemma 1): s j ( β τ ) = − n (cid:88) t =1 x t − ,j ψ τ (cid:0) y tτ − x (cid:48) t − β τ (cid:1) + n (cid:88) t =1 x t − ,j k t for 1 ≤ j ≤ p n , where k t = 0 if y tτ − x (cid:48) t − β τ (cid:54) = 0 and k t ∈ [ − τ, − τ ] if y tτ − x (cid:48) t − β τ = 0.Let D = { t : y tτ − x (cid:48) t − ˆ β τ = 0 } , then s j ( ˆ β τ ) = − n (cid:88) t =1 x t − ,j ψ τ (cid:16) y tτ − x (cid:48) t − ˆ β τ (cid:17) + (cid:88) t ∈D x t − ,j ( k ∗ t + (1 − τ )) , = − n (cid:88) t =1 x t − ,j ψ τ (cid:16) y tτ − x (cid:48) t − ˆ β τ (cid:17) + h n , where h n := (cid:80) t ∈D x t − ,j ( k ∗ t + (1 − τ )) and k ∗ t =  , if y tτ − x (cid:48) t − ˆ β τ (cid:54) = 0 ∈ [ − τ, − τ ] , if y tτ − x (cid:48) t − ˆ β τ = 0 . With probability one (Koenker (2005, Section 2.2)), |D| = p n . Thus h n = O p ( p / n ) = O p ( n (3 / ζ ) . Next, deﬁne the subgradient of the penalized objective function as S j ( β τ ): S j ( β τ ) = − n (cid:88) t =1 x t − ,j ψ τ (cid:0) y tτ − x (cid:48) t − β τ (cid:1) + h n + λ n | ˜ β j,τ | γ sgn ( β j,τ )= − n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − x (cid:48) t − δ τ (cid:1) + h n + λ n | ˜ β j,τ | γ sgn ( β j,τ ) , u tτ = u t − F − u ( τ ) and δ τ = β τ − β τ . The subgradient condition requires that at theoptimum, ˆ β τ , 0 ∈ S j ( ˆ β τ ) . That is, n (cid:88) t =1 x t − ,j ψ τ (cid:16) u tτ − x (cid:48) t − ˆ δ τ (cid:17) − h n = λ n | ˜ β j,τ | γ sgn ( ˆ β j,τ ) . (19)If j = 1 , .., q n , it implies that | sgn ( ˆ β j,τ ) | = 1. Then we can write the subragdient condition (19)as: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) t =1 x t − ,j ψ τ (cid:16) u tτ − ˆ δ (cid:48) τ x t − (cid:17) − h n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ n | ˜ β j,τ | γ . (20)In the following, we show that this subgradient condition does not hold for j / ∈ A , i.e., j = q n + 1 , ..., p n . It suﬃces to show:(a) ( n αζ +2 c / A ) − λ n | ˜ β j,τ | γ → ∞ ;(b) (cid:80) nt =1 x t − ,j ψ τ (cid:0) u tτ − x (cid:48) t − δ τ (cid:1) − h n ≤ O p ( n αζ +2 c / A ).For (b), we ﬁrst prove that the ﬁrst term on the left-hand side of (20) is dominated by O p ( n αζ +2 c / A ): n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) = n (cid:88) t =1 x t − ,j (cid:104) ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − ψ τ ( u tτ ) + E t − ψ τ ( u tτ ) (cid:105) + n (cid:88) t =1 x t − ,j E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) + n (cid:88) t =1 x t − ,j ψ τ ( u tτ ) ≡ I + I + I . (21)For I , let I = A + B , where A ≡ n (cid:88) t =1 x t − ,j (cid:104) ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) (cid:105) ,B ≡ n (cid:88) t =1 x t − ,j (cid:104) ψ τ ( u tτ ) − E t − ψ τ ( u tτ ) (cid:105) . E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) = E t − ψ τ ( u tτ ) + ∂E t − ψ τ ( u tτ − δ (cid:48) τ x t − ) ∂δ (cid:48) τ (cid:12)(cid:12)(cid:12)(cid:12) δ τ = δ τ + o p ( a n )= − x (cid:48) t − f t − (0) δ τ + o p ( a n ) , (22) E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) = E t − ψ τ ( u tτ ) + ∂E t − ψ τ ( u tτ − δ (cid:48) τ x t − ) ∂δ (cid:48) τ (cid:12)(cid:12)(cid:12)(cid:12) δ τ = δ τ + o p ( a n )= τ (1 − τ ) − x (cid:48) t − f t − (0) δ τ + o p ( a n ) . (23)Then, we have E ( A ) = EE t − ( A ) = E (cid:40) n (cid:88) t =1 x t − ,j (cid:104) E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) (cid:105)(cid:41) = 0 , and E t − ( A ) = n (cid:88) t =1 x t − ,j E t − (cid:104) ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) (cid:105) + 0= n (cid:88) t =1 x t − ,j (cid:104) E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − (cid:0) E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1)(cid:1) (cid:105) = n (cid:88) t =1 x t − ,j (cid:16) τ (1 − τ ) − x (cid:48) t − f t − (0) δ τ + o p ( a n ) (cid:17) − n (cid:88) t =1 x t − ,j (cid:16) − x (cid:48) t − f t − (0) δ τ + o p ( γ n ) (cid:17) (by (22) and (23))= O p ( n ) − n (cid:88) t =1 x t − ,j x (cid:48) t − cf t − (0) O p ( a n ) + O p ( n ) o p ( a n ) − n (cid:88) t =1 x t − ,j (cid:16) c (cid:48) x t − x (cid:48) t − cf t − (0) O p ( a n ) − x (cid:48) t − cf t − (0) O p ( a n ) o p ( a n ) + o p ( a n ) (cid:17) . (since δ τ = β τ − β τ = O p ( a n ) c and x (cid:48) t − δ τ = x (cid:48) t − cO p ( a n )) (24)Under Assumption f and Assumption A

1, we have: E (cid:34) n (cid:88) t =1 c (cid:48) f t − (0) x t − x (cid:48) t − c (cid:35) / ≤ (cid:34) n (cid:88) t =1 Ec (cid:48) f t − (0) x t − x (cid:48) t − c (cid:35) / ≤ O ( c / A n ) , (by Jensen’s inequality)34 ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) ≤ E ( n (cid:88) t =1 | x (cid:48) t − cf t − (0) | ) = E (cid:34) n (cid:88) t =1 ( x (cid:48) t − cf t − (0)) + 2 n (cid:88) t =2 t − (cid:88) k =1 | x (cid:48) t − cf t − (0) | · | x (cid:48) k − cf k − (0) | (cid:35) ≤ c f n (cid:88) t =1 E ( c (cid:48) x t − x (cid:48) t − cf t − (0)) + 2 E (cid:34) n (cid:88) t =2 t − (cid:88) k =1 | x (cid:48) t − cf t − (0) | · | x (cid:48) k − cf k − (0) | (cid:35) ≤ O ( c f c A n ) + 2 n (cid:88) t =2 t − (cid:88) k =1 (cid:113) E (cid:2) ( x (cid:48) t − cf t − (0)) (cid:3) · (cid:113) E (cid:2) ( x (cid:48) k − cf k − (0)) (cid:3) ≤ O ( c f c A n ) + O ( c f c A n ) = O ( c f c A n ) , (by Cauchy-Schwarz inequality)and E ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) ≤ (cid:20) E ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) (cid:21) / ≤ (cid:2) O ( c f c A n ) (cid:3) / = O ( c / f c / A n / ) . Using the above results and the Cauchy-Schwarz inequality, we can show that: EE t − ( A ) ≤ O ( n ) − O ( a n ) E [ n (cid:88) t =1 x t − ,j x (cid:48) t − cf t − (0)] + o ( n a n ) − O ( a n ) E [ n (cid:88) t =1 x t − ,j c (cid:48) x t − x (cid:48) t − cf t − (0)]+ o ( a n ) E [ n (cid:88) t =1 x t − ,j x (cid:48) t − cf t − (0)] − o ( a n ) O ( n ) ≤ O ( n ) + O ( a n ) E (cid:32) max ≤ t ≤ n | x (cid:48) t − cf t − (0) | n (cid:88) t =1 x t − ,j (cid:33) + o ( n a n )+ O ( a n ) E (cid:32) ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) n (cid:88) t =1 x t − ,j (cid:33) + o ( a n ) E (cid:32) max ≤ t ≤ n | x (cid:48) t − cf t − (0) | n (cid:88) t =1 x t − ,j (cid:33) + o ( n a n ) ≤ O ( n ) + O ( a n ) O ( n ) O ( c f c / f c / A n / ) + o ( n a n ) + O ( a n ) O ( n ) O ( c f c A n )+ o ( a n ) O ( n ) O ( c / f c / A n / ) + o ( n a n )= O ( n ) + O ( a n n / c / A ) + o ( a n n ) + O ( a n n c A ) + o ( a n n / c / A ) + o ( a n n )= O ( n ) + O ( n αζ +(5 / c / A ) + o ( n αζ +1 ) + O ( n αζ +3 c A ) + o ( n αζ +(3 / c / A ) + o ( n αζ ) ≡ (1) + (2) + (3) + (4) + (5) + (6) . (25)Given the conditions: 0 < αζ <

1, it is easy to verify that (4) dominates the other terms. Thus, E ( A ) ≤ O ( n αζ +3 c A ). The Chebyshev’s inequality implies that A ≤ O p ( n αζ +(3 / c / A ) . (26)35or B, since it is easy to obtain that E t − [ ψ τ ( u tτ )] = 0 , E t − [ ψ τ ( u tτ )] = τ (1 − τ ) , (27)we can show: E t − ( B ) = n (cid:88) t =1 x t − ,j (cid:104) E t − ψ τ ( u tτ ) − E t − ψ τ ( u tτ ) (cid:105) = 0 ,E t − ( B ) = n (cid:88) t =1 x t − ,j E t − (cid:104) ψ τ ( u tτ ) − E t − ψ τ ( u tτ ) (cid:105) = n (cid:88) t =1 x t − ,j E t − ψ τ ( u tτ ) = O p ( n ) , which implies that B ≤ O p ( n ) . (28)By the results of (26) and (28), we have I ≤ O p ( n αζ +(3 / c / A ) . (29)For I , by (22), n (cid:88) t =1 x t − ,j E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) = − n (cid:88) t =1 x t − ,j x (cid:48) t − f t − (0) δ τ + o p ( a n ) n (cid:88) t =1 x t − ,j . Thus, E (cid:34) n (cid:88) t =1 x t − ,j E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1)(cid:35) = − E n (cid:88) t =1 [ x t − ,j x (cid:48) t − cf t − (0) O p ( a n )] + E [ o p ( a n n / )] ≤ O ( a n ) E (cid:34) n (cid:88) t =1 x t − ,j x (cid:48) t − cf t − (0) (cid:35) + o ( a n n / ) ≤ O ( a n ) E (cid:32) max ≤ t ≤ n | x (cid:48) t − cf t − (0) | n (cid:88) t =1 x t − ,j (cid:33) + o ( a n n / ) ≤ O ( a n ) O ( n / ) O ( c / f c / A n / ) + o ( a n n / )= O ( a n n c / A ) + o ( a n n / )= O ( n αζ +2 c / A ) , (30)and E (cid:32) n (cid:88) t =1 x t − ,j E t − ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1)(cid:33)  E (cid:32) − n (cid:88) t =1 x t − ,j x (cid:48) t − f t − (0) δ τ + o p ( a n ) n (cid:88) t =1 x t − ,j (cid:33)  = E (cid:20) (cid:32) n (cid:88) t =1 x t − ,j x (cid:48) t − f t − (0) δ τ (cid:33) + (cid:32) o p ( a n ) n (cid:88) t =1 x t − ,j (cid:33) − (cid:32) n (cid:88) t =1 x t − ,j x (cid:48) t − f t − (0) δ τ (cid:33) · (cid:32) o p ( a n ) n (cid:88) t =1 x t − ,j (cid:33) (cid:21) ≤ O ( a n ) E (cid:34) n (cid:88) t =1 (cid:0) x t − ,j x (cid:48) t − cf t − (0) (cid:1) + 2 n (cid:88) t =2 t − (cid:88) k =1 x t − ,j x k − ,j x (cid:48) t − cf t − (0) · x (cid:48) k − cf k − (0) (cid:35) + o ( a n ) E (cid:34) n (cid:88) t =1 x t − ,j + 2 n (cid:88) t =2 t − (cid:88) k =1 x t − ,j x k − ,j (cid:35) + o ( a n ) E (cid:34)(cid:32) n (cid:88) t =1 x t − ,j x (cid:48) t − cf t − (0) (cid:33) · (cid:32) n (cid:88) t =1 x t − ,j (cid:33)(cid:35) ≤ O ( a n ) E (cid:34) n (cid:88) t =1 x t − ,j c (cid:48) x t − x (cid:48) t − cf t − (0) + 2 n (cid:88) t =2 t − (cid:88) k =1 x t − ,j x k − ,j x (cid:48) t − cf t − (0) · x (cid:48) k − cf k − (0) (cid:35) + o ( a n ) (cid:2) O ( n ) + O ( n ) (cid:3) + o ( a n ) E (cid:34)(cid:32) max ≤ t ≤ n | x (cid:48) t − cf t − (0) | n (cid:88) t =1 x t − ,j (cid:33) · O p ( n / ) (cid:35) ≤ O ( a n ) E (cid:32) ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) n (cid:88) t =1 x t − ,j (cid:33) + O ( a n ) E (cid:34) ( max ≤ t ≤ n | x (cid:48) t − cf t − (0) | ) n (cid:88) t =2 t − (cid:88) k =1 x t − ,j x k − ,j (cid:35) + o ( a n n ) + o ( a n ) E (cid:34)(cid:32) max ≤ t ≤ n | x (cid:48) t − cf t − (0) | n (cid:88) t =1 x t − ,j (cid:33) · O p ( n / ) (cid:35) ≤ O ( a n ) O ( n ) O ( c f c A n ) + O ( a n ) O ( n ) O ( c f c A n ) + o ( a n n ) + o ( a n ) O ( n / ) O ( n / ) O ( c / f c / A n / )= O ( a n n c A ) + O ( a n n c A ) + o ( a n n ) + o ( a n n / c / A )= O ( a n n c A ) = O ( n αζ +4 c A ) , (31)37here the second inequality holds using Cauchy-Schwarz inequality and the following result: n (cid:88) t =2 t − (cid:88) k =1 x t − ,j x k − ,j ≤ n (cid:88) t =2 (cid:34) x t − ,j (cid:32) t − (cid:88) k =1 x k − ,j (cid:33)(cid:35) ≤ (cid:32) n (cid:88) t =2 x t − ,j (cid:33) / ·  n (cid:88) t =2 (cid:32) t − (cid:88) k =1 x k − ,j (cid:33)  / = (cid:0) O p ( n ) (cid:1) / · (cid:34) n (cid:88) t =2 O p ( t ) (cid:35) / = O p ( n ) · O p (1) (cid:34) n (cid:88) t =2 t (cid:35) / = O p ( n ) · O p (1) (cid:2) O ( n ) (cid:3) / = O p ( n ) . By the results of (30) and (31) , we apply Chebyshev’s inequality and obtain that: I ≤ O p ( n αζ +2 c / A ) . (32)For I , it can be shown that E t − n (cid:88) t =1 x t − ,j ψ τ ( u tτ ) = n (cid:88) t =1 x t − ,j E t − ψ τ ( u tτ ) = 0 ,E t − (cid:34) n (cid:88) t =1 x t − ,j ψ τ ( u tτ ) (cid:35) = n (cid:88) t =1 x t − ,j E t − ψ τ ( u tτ ) + 0 = τ (1 − τ ) O p ( n ) . This implies that I ≤ O p ( n ) . (33)By the results of (29), (32) and (33), we ﬁnd the upper bound of the ﬁrst term in (b): n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) ≤ O p ( n αζ +2 c / A ) . For the second term in (b), under Assumption p

3, we have h n n αζ +2 c / A = O p ( p / n ) n αζ +2 c / A = O p (1) o (1) = o p (1) . n (cid:88) t =1 x t − ,j ψ τ (cid:16) u tτ − x (cid:48) t − ˆ δ τ (cid:17) − h n ≤ O p ( n αζ +2 c / A ) . (34)For (a), under Assumption λ λ n n (1 − αζ ) γ n αζ +2 c / A → ∞ :( n αζ +2 c / A ) − λ n | ˜ β j,τ | γ = λ n n (1 − αζ ) γ n αζ +2 c / A | ( n/p αn ) ˜ β j,τ | γ = λ n n (1 − αζ ) γ n αζ +2 c / A O p (1) → ∞ . (35)By (20), (34), and (35), we can shown that, for any j / ∈ A ,Pr (cid:16) j ∈ ˆ A n (cid:17) ≤ Pr (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − h n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ n | ˜ β j,τ | γ (cid:33) = Pr (cid:32) n αζ +2 c / A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − h n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ n n (1 − αζ ) γ n αζ +2 c / A | ( n/p αn ) ˜ β j,τ | γ (cid:33) −→ . B Proofs for Section 4.2

Proof of Lemma 4.2:

Let Q QRn ( β τ ) be the (unpenalized) quantile regression objective func-tion.To show the result of consistency, it suﬃces to show that for any (cid:15) > C such that P (cid:26) inf || c || = C Q QRn ( ˜ β τ + a n M n c ) > Q QRn ( ˜ β τ ) (cid:27) ≥ − (cid:15). (36)Then a n M n = (cid:32) a (0) n I r a (1) n I p n − r (cid:33) so that a n M n c = (cid:16) a (0) n c , ..., a (0) n c r , a (1) n c r +1 , ...a (1) n c p n (cid:17) (cid:48) .As is shown in the proof of Lemma 4.1 above, the proof is completed if we show that thefollowing term is positive: Q QRn ( ˜ β τ + a n M n c ) − Q QRn ( ˜ β τ ) = n (cid:88) t =1 ρ τ ( u tτ − ˜ x (cid:48) t − a n M n c ) − n (cid:88) t =1 ρ τ ( u tτ )39y Knight’s Identity, n (cid:88) t =1 (cid:2) ρ τ ( u tτ − ˜ x (cid:48) t − a n M n c ) − ρ τ ( u tτ ) (cid:3) = − a n n (cid:88) t =1 ˜ x (cid:48) t − M n c · ψ τ ( u tτ ) + n (cid:88) t =1 (cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds = − a n n (cid:88) t =1 ˜ x (cid:48) t − M n c · ψ τ ( u tτ ) + n (cid:88) t =1 E (cid:34)(cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:35) + n (cid:88) t =1 (cid:40)(cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds − E (cid:34)(cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:35)(cid:41) ≡ I + I + I . We will show that I and I are dominated by I and that I > I . E | I | = a n n (cid:88) t =1 c (cid:48) M (cid:48) n E (cid:2) ψ τ ( u tτ ) ˜ x t − ˜ x (cid:48) t − (cid:3) M n c + 2 a n n (cid:88) t =2 t − (cid:88) k =1 c (cid:48) M (cid:48) n E (cid:2) ψ τ ( u tτ ) ψ τ ( u kτ )˜ x t − ˜ x (cid:48) k − (cid:3) M n c = a n n (cid:88) t =1 c (cid:48) M (cid:48) n E (cid:2) ψ τ ( u tτ ) ˜ x t − ˜ x (cid:48) t − (cid:3) M n c = a n c (cid:48) M (cid:48) n n (cid:88) t =1 E (cid:2) ψ τ ( u tτ ) ˜ x t − ˜ x (cid:48) t − (cid:3) M n c ≤ a n n c B C , from Assumption B2so that I = O (cid:16) a n nc / B (cid:17) For I , I = n (cid:88) t =1 E (cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds = n (cid:88) t =1 E (cid:90) ˜ x (cid:48) t − a n M n c ( f t − (0) · s ) ds { o p (1) } = 12 a n n (cid:88) t =1 c (cid:48) M (cid:48) n E (cid:2) f t − (0)˜ x t − ˜ x (cid:48) t − (cid:3) M n c { o p (1) } = 12 a n c (cid:48) (cid:32) M (cid:48) n n (cid:88) t =1 E (cid:2) f t − (0)˜ x t − ˜ x (cid:48) t − (cid:3) M n (cid:33) c { o p (1) }≥ a n f c A C n by Assumption A2= O ( a n n c A ) . I V ar ( I ) = V ar (cid:32) n (cid:88) t =1 (cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33) ≤ E (cid:32) n (cid:88) t =1 (cid:90) ˜ x (cid:48) t − a n M n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds (cid:33)  ≤ a n n (cid:88) t =1 c (cid:48) M (cid:48) n E (cid:2) x t − x (cid:48) t − (cid:3) M n c + 2 a n n (cid:88) t =2 t − (cid:88) k =1 E (cid:2)(cid:12)(cid:12) x (cid:48) t − M n c (cid:12)(cid:12) (cid:12)(cid:12) x (cid:48) k − M n c (cid:12)(cid:12)(cid:3) = V , + V , Then V , ≤ O ( γ n c x n ) , and V , ≤ O ( γ n c x n )the proof of 4.1. Thus I = O p ( c / x γ n n / )hence we establish the desired result. Remark B.1

We deﬁne Q − M n = M ∗ n , then our normalizing matrice is  √ nI p z √ nI p −√ nA (cid:48) I p

00 0 0 I p x  and a (1) n M ∗ n c = p α n  √ nI p z √ nI p −√ nA (cid:48) I p

00 0 0 I p x  c =  p α √ n I p z p α √ n I p − p α √ n A (cid:48) p α n I p

00 0 0 p α n I p x  c = ( p α √ n ( c , ..., c p z ) , p α √ n ( c p z +1 , ..., c p z + p ) , p α √ n ( c ∗ , ..., c ∗ p ) + O (cid:18) p α n (cid:19) , p α n ( c p z + p + p +1 , ..., c p n )) where ( c ∗ , ..., c ∗ p ) (cid:48) = A (cid:48) ( c p z +1 , ..., c p z + p ) (cid:48) , so the reduced rate for the 3rd block component is well accommodated. Let’s deﬁne the convergencerates ˜ a ∗ n,j accordingly, by ˜ a ∗ n,j = (cid:40) p α √ n = a (0) n , for j = 1 , ..., r + p p α n = a (1) n , for j = r + p + 1 , ..., p n roof of Theorem 4.4: Following the proof of Theorem 4.1, it suﬃces to show that for any (cid:15) > C such that P (cid:26) inf || c ||≤ C Q n ( β τ + a (1) n M ∗ n c ) > Q n ( β τ ) (cid:27) ≥ − (cid:15) where M ∗ n = Q − M n as in Remark B.1.This inequality implies that with probability at least 1 − (cid:15) , there is a local minimizer ˆ β ∗ τ in theshrinking ball { β τ + a (1) ∗ n M ∗ n c, || c || ≤ C } such that || ˆ β ∗ τ − β τ || = O p (˜ a ∗ n,j ), where ˜ a ∗ n,j is the j -thdominating rates from a (1) n M ∗ n c , i.e.,˜ a ∗ n,j = (cid:40) p α √ n = a (0) n , for j = 1 , ..., r + p , p α n = a (1) n , for j = r + p + 1 , ..., p n ..Then Q n ( β τ + a (1) n M ∗ n c ) − Q n ( β τ ) ≥ (cid:32) n (cid:88) t =1 ρ τ ( u tτ − X (cid:48) t − a (1) n M ∗ n c ) − n (cid:88) t =1 ρ τ ( u tτ ) (cid:33) + q n (cid:88) j =1 λ n,j ( | β τ,j + ˜ a ∗ n,j c j | − | β τ,j | )= d ∗ + d ∗ .For d ∗ , similarly to Theorem 4.1, we have n (cid:88) t =1 ρ τ ( u tτ − X (cid:48) t − a (1) n M ∗ n c ) − n (cid:88) t =1 ρ τ ( u tτ )= − a (1) n n (cid:88) t =1 X (cid:48) t − M ∗ n c · ψ τ ( u tτ ) + n (cid:88) t =1 (cid:90) X (cid:48) t − a (1) ∗ n M ∗ n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds, Note that M ∗(cid:48) n n (cid:88) t =1 E (cid:2) f t − (0) X t − X (cid:48) t − (cid:3) M ∗ n = M (cid:48) n (cid:0) Q − (cid:1) (cid:48) n (cid:88) t =1 E (cid:2) f t − (0) X t − X (cid:48) t − (cid:3) Q − M n = M (cid:48) n n (cid:88) t =1 E (cid:104) f t − (0) (cid:0) Q − (cid:1) (cid:48) X t − X (cid:48) t − Q − (cid:105) M n = M (cid:48) n n (cid:88) t =1 E (cid:2) f t − (0)˜ x t − ˜ x (cid:48) t − (cid:3) M n which is controlled by Assumption A2. Thus, using the exactly same proof of Theorem 4.1, the42ominating order in d ∗ is n (cid:88) t =1 E (cid:90) x (cid:48) t − a (1) n M ∗ n c ( ( u tτ ≤ s ) − ( u tτ ≤ ds = n (cid:88) t =1 E (cid:90) x (cid:48) t − a (1) n M ∗ n c ( f t − (0) · s ) ds { o p (1) } = 12 (cid:16) a (1) n (cid:17) c (cid:48) (cid:32) M ∗(cid:48) n n (cid:88) t =1 E (cid:2) f t − (0) X t − X (cid:48) t − (cid:3) M ∗ n (cid:33) c { o p (1) }≥ O p ( (cid:16) a (1) n (cid:17) n c A ) = O p (cid:0) p α c A (cid:1) For d ∗ , the only diﬀerences with Proof of Theorem 4.1 are the rate of divergence in λ n,j ’s for j = 1 , ..., r + p , and for j = r + p + 1 , ..., p n . From Corollary 4.3,ˆ β τ,j =  O p ( a (0) n ) + β τ,j = o p (1) + β τ,j , j = 1 , ..., r + p ,O p ( a (1) n ) + β τ,j = o p (1) + β τ,j , j = r + p + 1 , ..., p n , and clearly a (0) n > a (1) n for any given n . Since β τ,j (cid:54) = 0 for j = 1 , ..., q n , we have q n (cid:88) j =1 λ n,j ( | β τ,j + ˜ a ∗ n,j c j | − | β τ,j | ) ≤ q n (cid:88) j =1 λ n,j ˜ a ∗ n,j | c j | = λ n q n (cid:88) j =1 ˜ a ∗ n,j | c j | | ˆ β τ,j | γ ≤ λ n max j (cid:0) ˜ a ∗ n,j (cid:1)  q n (cid:88) j =1 o p (1) + β τ,j ) γ  / · || c || = λ n a (0) n q / n O p (1)= O p (cid:16) λ n ( p α / √ n ) p / (cid:17) = O p (cid:32) λ n p + α n / (cid:33) so is dominated by d ∗ = O p (cid:0) p α c A (cid:1) under the condition λ n p − α n / c A = λ n n ( − α ) ζ n / c A = λ n n ζ n + αζ c A → Proof of Theorem 4.5:

Among ˆ β ALQR ∗ τ = ˆ β ∗ τ = ( ˆ β z ∗ (cid:48) τ , ˆ β c ∗ (cid:48) τ , ˆ β c ∗ (cid:48) τ , ˆ β x ∗ (cid:48) τ ) (cid:48) , we just to need to showALQR sparsity for ˆ β c ∗ τ , since other parts are shown by Theorem 4.2 and the existing proof for theI(0) case. Among j = r + 1 , ..., r + p , for any j / ∈ A , (cid:12)(cid:12)(cid:12) ˆ β ∗ j,τ − β j, τ (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ β ∗ j,τ (cid:12)(cid:12)(cid:12) = O p (cid:16) p α √ n (cid:17) , so following43he proof of Theorem 4.2,Pr (cid:16) j ∈ ˆ A n (cid:17) ≤ Pr (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − h n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ n | ˆ β ∗ j,τ | γ (cid:33) = Pr  n αζ +2 c / A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) t =1 x t − ,j ψ τ (cid:0) u tτ − δ (cid:48) τ x t − (cid:1) − h n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = λ n √ np α n αζ +2 c / A | √ np α ˆ β ∗ j,τ | γ  −→ . as long as λ n (cid:16) √ np α (cid:17) γ n αζ +2 c / A = λ n n (1 / − αζ ) γ n αζ +2 c / A → ∞ . References

Andersen, T. G., N. Fusari, and V. Todorov (2020). The pricing of tail risk and the equity premium:Evidence from international option markets.

Journal of Business & Economic Statistics 38 (3),662–678.Belloni, A. and V. Chernozhukov (2011). (cid:96)

The Annals of Statistics 39 (1), 82–130.Cai, Z., H. Chen, and X. Liao (2020). A new robust inference for predictive quantile regression.

Available at SSRN 3593817 .Campbell, J. Y. (1987). Stock returns and the term structure.

Journal of Financial Eco-nomics 18 (2), 373–399.Cenesizoglu, T. and A. Timmermann (2008). Is the distribution of stock returns predictable?

Available at SSRN 1107185 .Cenesizoglu, T. and A. Timmermann (2012). Do return prediction models add economic value?

Journal of Banking & Finance 36 (11), 2974–2987.Demetrescu, M., I. Georgiev, P. M. Rodrigues, and A. R. Taylor (2020). Testing for episodicpredictability in stock returns.

Journal of Econometrics .Fama, E. F. and K. R. French (1988). Dividend yields and expected stock returns.

Journal ofFinancial Economics 22 (1), 3–25.Fan, J., Y. Fan, and E. Barut (2014). Adaptive robust variable selection.

The Annals of Statis-tics 42 (1), 324. 44an, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.

Journal of the American Statistical Association 96 (456), 1348–1360.Fan, R. and J. H. Lee (2019). Predictive quantile regressions under persistence and conditionalheteroskedasticity.

Journal of Econometrics 213 (1), 261–280.Fan, Y. and C. Y. Tang (2013). Tuning parameter selection in high dimensional penalized likelihood.

Journal of the Royal Statistical Society: SERIES B: Statistical Methodology , 531–552.Farmer, L., L. Schmidt, and A. Timmermann (2019). Pockets of predictability.

Available at SSRN3152386 .Gungor, S. and R. Luger (2019). Exact inference in long-horizon predictive quantile regressionswith an application to stock returns.

Journal of Financial Econometrics .Harvey, D. I., S. J. Leybourne, R. Sollis, and A. R. Taylor (2020). Real-time detection of regimesof predictability in the us equity premium.

Journal of Applied Econometrics .Hodrick, R. J. (1992). Dividend yields and expected stock returns: Alternative procedures forinference and measurement.

The Review of Financial Studies 5 (3), 357–386.Knight, K. and W. Fu (2000). Asymptotics for lasso-type estimators.

The Annals of Statistics ,1356–1378.Koenker, R. (2005).

Quantile Regression . Econometric Society Monographs. Cambridge UniversityPress.Koenker, R. and G. Bassett (1978). Regression quantiles.

Econometrica , 33–50.Koo, B., H. M. Anderson, M. H. Seo, and W. Yao (2020). High-dimensional predictive regressionin the presence of cointegration.

Journal of Econometrics .Lee, J. H. (2016). Predictive quantile regression with persistent covariates: Ivx-qr approach.

Journalof Econometrics 192 (1), 105–118.Lee, J. H., Z. Shi, and Z. Gao (2018). On lasso for predictive regression. arXiv preprintarXiv:1810.03140 .Li, Y. and J. Zhu (2008). L1-norm quantile regression.

Journal of Computational and GraphicalStatistics 17 (1), 163–185.Lu, X. and L. Su (2015). Jackknife model averaging for quantile regressions.

Journal of Economet-rics 188 (1), 40–58.Maynard, A., K. Shimotsu, and Y. Wang (2011). Inference in predictive quantile regressions.

Unpublished manuscript . 45einshausen, N. and P. B¨uhlmann (2004). Consistent neighbourhood selection for sparse high-dimensional graphs with the lasso. Seminar f¨ur Statistik, Eidgen¨ossische Technische Hochschule(ETH), Z¨urich.Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression.

The Annals of Statistics , 758–765.Phillips, P. C. (1991). Optimal inference in cointegrated systems.

Econometrica: Journal of theEconometric Society , 283–306.Portnoy, S. (1984). Asymptotic behavior of m -estimators of p regression parameters when p /n islarge. i. consistency. The Annals of Statistics 12 (4), 1298–1309.Portnoy, S. (1985). Asymptotic behavior of m estimators of p regression parameters when p /n islarge; ii. normal approximation. The Annals of Statistics , 1403–1417.Schwarz, G. et al. (1978). Estimating the dimension of a model.

The Annals of Statistics 6 (2),461–464.Shao, J. (1997). An asymptotic theory for linear model selection.

Statistica Sinica , 221–242.Sherwood, B., L. Wang, et al. (2016). Partially linear additive quantile regression in ultra-highdimension.

The Annals of Statistics 44 (1), 288–317.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society: Series B (Methodological) 58 (1), 267–288.Wang, H. and C. Leng (2007). Uniﬁed lasso estimation by least squares approximation.

Journal ofthe American Statistical Association 102 (479), 1039–1048.Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging numberof parameters.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3),671–683.Wang, H., G. Li, and G. Jiang (2007). Robust regression shrinkage and consistent variable selectionthrough the lad-lasso.

Journal of Business & Economic Statistics 25 (3), 347–355.Wang, H., R. Li, and C.-L. Tsai (2007). Tuning parameter selectors for the smoothly clippedabsolute deviation method.

Biometrika 94 (3), 553–568.Wang, L., Y. Wu, and R. Li (2012). Quantile regression for analyzing heterogeneity in ultra-highdimension.

Journal of the American Statistical Association 107 (497), 214–222.Welch, I. and A. Goyal (2008). A comprehensive look at the empirical performance of equitypremium prediction.

The Review of Financial Studies 21 (4), 1455–1508.Wu, Y. and Y. Liu (2009). Variable selection in quantile regression.

Statistica Sinica , 801–817.46hang, Y., R. Li, and C.-L. Tsai (2010). Regularization parameter selections via generalizedinformation criterion.

Journal of the American Statistical Association 105 (489), 312–323.Zheng, Q., C. Gallagher, and K. Kulasekera (2013). Adaptive penalized quantile regression for highdimensional data.

Journal of Statistical Planning and Inference 143 (6), 1029–1038.Zheng, Q., L. Peng, and X. He (2015). Globally adaptive quantile regression with ultra-highdimensional data.

The Annals of Statistics 43 (5), 2225.Zou, H. (2006). The adaptive lasso and its oracle properties.