High Dimensional Forecast Combinations Under Latent Structures
aa r X i v : . [ ec on . E M ] O c t High Dimensional Forecast CombinationsUnder Latent Structures
Zhentao Shi, Liangjun Su and Tian XieOctober 20, 2020
Abstract
This paper proposes a novel high dimensional forecast combination estimator in the presenceof many forecasts and potential latent group structures. The new algorithm, which we call ℓ - relaxation , minimizes the squared ℓ -norm of the weight vector subject to a relaxed version of thefirst-order conditions, instead of minimizing the mean squared forecast error as those standardoptimal forecast combination procedures. A proper choice of the tuning parameter achieves biasand variance trade-off, and incorporates as special cases the simple average (equal-weight) strategyand the conventional optimal weighting scheme. When the variance-covariance (VC) matrix ofthe individual forecast errors exhibits latent group structures — a block equicorrelation matrixplus a VC for idiosyncratic noises, ℓ -relaxation delivers combined forecasts with roughly equalwithin-group weights. Asymptotic optimality of the new method is established by exploiting theduality between the sup-norm restriction and the high-dimensional sparse ℓ -norm penalization.Excellent finite sample performance of our method is demonstrated in Monte Carlo simulations.Its wide applicability is highlighted in three real data examples concerning empirical applicationsof microeconomics, macroeconomics, and finance. Key Words : Factor models; forecast combination puzzle; high dimension; Lasso; latent group; machinelearning; optimality.
JEL Classification : C22, C53, C55Shi acknowledges financial support from the Research Grants Council (RGC) No.14500118. Su acknowledgesfinancial support from Tsinghua University. Xie’s research is supported by the Natural Science Foundation of China(71701175), the Chinese Ministry of Education Project of Humanities and Social Sciences (17YJC790174), and theFundamental Research Funds for the Central Universities.Address correspondence: Zhentao Shi: [email protected], Department of Economics, 928 Esther Lee Building,the Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China; Tel: (852) 3943-1432; Fax(852) 2603-5805. Liangjun Su: [email protected], School of Economics and Management, Tsinghua University,Beijing, China; Tel: (86 10) 6278-9506. Tian Xie: [email protected], College of Business, Shanghai Universityof Finance and Economics, China. Introduction
Forecast has been one of the fundamental tasks since the dawn of econometrics. In his paper presentedat the inaugural meeting of the Econometric Society, Cowles (1933) found little evidence of forecastability to the stock market among dozens of financial services and publications. As part of thepostwar development of national accounting and statistics, Theil (1992, originally 1955) examinedaccuracy of forecasting models of macroeconomic indicators compiled by the Economic Commissionfor Europe. Besides quantitative economic data analysis, forecast is the backbone of macroeconomictheory, under the names of adaptive expectation , rational expectation , and so on. Individuals andfirms rely on their perception of the future to make decisions about consumption, investment andproduction. Monetary and fiscal authorities communicate their future targets to stabilize price levelsand unemployment rates. Well-functioning financial markets and economies count on prediction,confidence, and interaction of market participants and regulators.Thanks to information technology advancements in recent decades, the costs of collecting marketforecaster data and running predictive algorithms drop tremendously. We have fastforwarded into theera of big data, when machine learning and artificial intelligence profoundly change the routines of ourdaily life and the ways we conduct economic research. While we indulge in the rich-data environmentand the awesome computing power which those pioneer econometricians could only envy, we mustsearch for effective approaches of ensembling disaggregate information scattered in many individualforecasters or forecasting models in order to distill signals out of noises. In this paper, we seekenlightenment from Bates and Granger (1969)’s seminal work of forecast combination, which hasbecome part of the core technical toolkits of modern forecasting practices.We consider high dimensional forecast combinations in the presence of many forecasts. Supposethat y t +1 is an outcome variable and there are N forecasts, { f it } i ∈ [ N ] , available at time t for y t +1 , where t ∈ [ T ] := { , , ..., T } , [ N ] := { , , ..., N } and “:=” signifies definition. Let f t = ( f t , ..., f Nt ) ′ . We are interested in finding an N × w = ( w i ) i ∈ [ N ] to form a linear combination w ′ f t whose mean squared forecast error (MSFE) is minimized. One way to estimate w is to run therestricted least squares (RLS):min w ∈ R N T T X t =1 (cid:0) y t +1 − w ′ f t (cid:1) subject to w ′ N = 1 , (1.1)where N is an N × e it = y t +1 − f it ,compute its sample variance-covariance (VC) b Σ := T − P Tt =1 e t e ′ t with e t = ( e it ) i ∈ [ N ] , and followBates and Granger (1969) to work with the following minimization problem:min w ∈ R N w ′ b Σw subject to w ′ N = 1 . (1.2)Denote the solution to the above constrained optimization problem as b w BG . When b Σ is invertible,the explicit solution is b w BG = ( ′ N b Σ − N ) − b Σ − N . (1.3)The two formulations in (1.2) and (1.1) are numerically equivalent, as demonstrated by Granger and Ramanathan(1984). Apparently, the requirement of the invertibility of b Σ is not innocuous in high dimensionalsettings, and in fact b Σ must be singular when N > T.
Despite the MSFE-optimality of the above Bates and Granger’s combined forecasts, the optimalweights often do not yield the best forecast in empirical work when they are replaced with their sample2stimates; rather, the simple average (namely, equal-weight forecast combination) often performsbetter. This empirical fact has been best known as the “ forecast combination puzzle,” which was firstnoted by Clemen (1989) and formally named by Stock and Watson (2004). In particular, Clemen(1989) reviews the literature up to the late 1980s and notes that over a large number of papersaveraging forecasts appear to be a more robust procedure than optimal combination. A reasonableexplanation suggests that the errors on the estimation of the weights can be large and thus dominatethe gains from the use of optimal combination; see, e.g., Smith and Wallis (2009) and Claeskens et al.(2016). See Timmermann (2006) for an excellent review of forecast combination.A lesson learned from this literature is that it is unwise to include all possible variables. Limitingthe number of unknown parameters can help reduce estimation error. This has led to the adoptionof various shrinkage and variable selection techniques. For example, Elliott et al. (2013) proposea complete subset regression (CSR) approach to forecast combinations by using equal weights tocombine forecasts based on the same number of predictive variables. They show that in many casessubset regression combinations amount to a form of shrinkage that is more general than the con-ventional variable-by-variable shrinkage implied by ridge regression. In contrast, Diebold and Shin(2019) bring forth the partially egalitarian Lasso (peLASSO) procedures that discard some fore-casts and then select and shrink the remaining forecasts toward the equal weights. In essence, bothElliott et al. (2013) and Diebold and Shin (2019) assign two distinct weights to two subsets of mod-els: zero weight to a large subset of the models and equal or roughly equal weights to the remainingsubset of the models.In this paper, we extend the ideas of Elliott et al. (2013) and Diebold and Shin (2019) and proposea new shrinkage technique for forecast combinations. Specifically, we propose to minimize the squared ℓ -norm of the weight vector subject to a relaxed version of the first order conditions from theminimization problem in (1.2), yielding the ℓ - relaxation problem. The strategy is similar in spiritto the ℓ -relaxation in Dantzig selector a la Candes and Tao (2007). Interestingly, the ℓ -relaxedoptimal forecasts incorporate both the simple average (equal-weight) strategy and the conventionaloptimal weighting scheme as special cases by setting the tuning parameter to be sufficiently large orzero, respectively. A proper choice of the tuning parameter can achieve the usual bias and variancetrade-off and deliver combined forecasts with roughly equal groupwise weights when the variance-covariance (VC) matrix of the individual forecasts exhibit a certain latent group structure. This isconsistent with the intuition that one should assign the same weights to all individuals within thesame group and potentially distinct weights to individuals in different groups when the forecast errorVC matrix exactly exhibits a block equicorrelation structure. When the VC matrix is contaminatedwith a noisy component, we show that the resultant ℓ -relaxed weights are close to the infeasiblegroupwise equal weights.The approximate latent group structure is not a man-made artifact. It is inherent in manyforecast combination problems. For example, it emerges when the forecasting errors exhibit a factorstructure as in Hsiao and Wan (2014) and Chan and Pauwels (2018) and the factor loadings aredirectly governed by certain latent group structures or are approximable by a few values. It isalso present if one considers forecast combinations based on a large number of forecast models (say,2 = 1024) with a fixed number of predictive regressors (say, p = 10). In the latter case, we arguethat the p regressors serve as a part of the “latent” factors.We develop original asymptotic results to support this new ℓ -relaxation estimation method.Noticing the duality between the sup-norm constraint and the ℓ -penalization of Lasso (Tibshirani,1996), we first develop the asymptotic convergence in its dual problem (see (2.6) in the next section)which resembles Lasso in view of its ℓ -penalty, instead of directly working with the primal problemof ℓ -relaxation (see (2.5)) which is a linearly constrained quadratic optimization. Studies of high3imensional regression have prepared a set of inequalities for Lasso to handle sparse regressions. Wesharpen these techniques in our context to cope with groupwise sparsity. Once the convergence of thehigh dimensional parameters in the dual problem is established, the convergence of the combinationsweights proceeds immediately, and then follows the asymptotic optimality of ℓ -relaxation.We assess the finite sample behavior of ℓ -relaxation in Monte Carlo simulations. Compared withthe oracle estimator and popular off-the-shelf machine learning estimators, ℓ -relaxation performswell under various data generating processes (DGPs). We further evaluate its empirical accuracy inthree real data examples covering box office prediction, inflation forecast by surveyed experts, andstock market volatility forecast based on regressions. These examples demonstrate wide applicabilityof ℓ -relaxation. Literature Review.
It is worth mentioning that various shrinkage and machine learning tech-niques have been applied to the forecasting literature since the pioneer work of Tibshirani (1996). See,e.g., Li and Chen (2014), Conflitti et al. (2015), Konzen and Ziegelmann (2016), Stasinakis et al.(2016), Bayer (2018), Wilms et al. (2018), Kotchoni et al. (2019), Coulombe et al. (2020), and Roccazzella et al.(2020). We complement these works by considering an ℓ -relaxation of the regularized weights esti-mation problem, which exhibits certain optimality properties when the forecast error VC matrix canbe decomposed into the sum of a low-rank matrix and a VC of idiosyncratic shocks.Our paper is also related to the recent literature on latent group structures in panel data analy-sis; see, e.g., Bonhomme and Manresa (2015), Su et al. (2016), Bonhomme et al. (2017), Su and Ju(2018), Su et al. (2019), Vogt and Linton (2017), and Vogt and Linton (2020). Except Bonhomme and Manresa(2015) and Bonhomme et al. (2017), all these previous studies focus on the recovery of the latentgroup structures. In this paper, although we assume that a community or latent group structure liesin the dominant term in the forecast error VC matrix, we do not attempt to recover the membershipfor each individual forecast.Lastly, our paper adds to the vast literature on portfolio optimization as well; see Ledoit and Wolf(2004), Disatnik and Katz (2012), Fan et al. (2012), Fan et al. (2016), Ledoit and Wolf (2017), andAo et al. (2019), among many others. In particular, Ledoit and Wolf (2004) use Bayesian methodsfor shrinking the sample correlation matrix to an equicorrelated target and show that this helps selectportfolios with low volatility compared to those based on the sample correlation; and Ledoit and Wolf(2017) promote a nonlinear shrinkage estimator that is more flexible than previous linear shrinkageestimators and has just the right number of free parameters. Apparently, our method can also beused to estimate the optimal portfolios. Organization.
The rest of the paper is organized as follows. In Section 2, we introduce the ℓ -relaxation primal problem and its dual problem, and discuss their attributes without imposingstructures on the VC matrix of the forecast errors. Section 3 considers the latent group structures inthe VC matrix and develops the finite sample numerical properties of the optimization problems. InSection 4, we study the asymptotic behaviors of the estimators in both the dual and primal problemsand establish the asymptotic optimality of the ℓ -relaxed estimator of the combination weights.Section 5 reports Monte Carlo simulation results. The new method is applied to three datasets inSection 6. We provide the proofs of all theoretical results Appendix A. Additional numerical resultsare contained in Appendix B. Notation.
For a random variable x , we write its population mean as E [ x ]; for a sample( x , . . . , x T ), we write its sample mean as E T [ x t ] := T − P Tt =1 x t . A plain b denotes a scalar, aboldface lowercase b denotes a vector, and a boldface uppercase B denotes a matrix. The ℓ -normand ℓ -norm of b = ( b , ..., b n ) ′ are denoted as k b k := P ni =1 | b i | and k b k := (cid:0)P ni =1 b i (cid:1) / , respec-tively. For a generic n × m matrix B , we denote B i · as the i -th row (1 × m vector), and B · j as4he j -th column ( n × φ max ( · ) and φ min ( · ) denote the largest and smallest eigenvalues ofa real symmetric matrix, respectively. Define the spectral norm, sup-norm, and maximum column ℓ matrix norm as k B k sp := φ / ( B ′ B ) , k B k ∞ := max i ≤ n,j ≤ m | b ij | , and k B k c := max j ≤ m k B · j k ,respectively. n and n are n × I n is the n × n identitymatrix. “w.p.a.1” is short for “with probability approaching one” in asymptotic statements. Wewrite a ≍ b when both a/b and b/a are stochastically bounded. Let a ∧ b = min { a, b } .The cross-sectional units are indexed by i ∈ [ N ] := { , . . . , N } . For a generic index set G ⊂ [ N ],we denote |G| as the cardinality of G , and b G = ( b i ) i ∈G k as the |G| -dimensional subvector. We calla vector of similar sign if no pair of its elements takes opposite signs. Formally, an n -vector b is ofsimilar sign if b ∈ S n := { b ∈ R n : b i b j ≥ i, j ∈ [ N ] } . Let G , . . . , G K be a partition of [ N ], and denote N k = |G k | . We call a generic N -vector b of similar sign within all groups if b ∈ S all := (cid:8) b ∈ R N : b G k ∈ S N k for all k ∈ [ K ] (cid:9) , and define ˜ S all := (cid:8) b ∈ S all : ′ N b = 0 (cid:9) with a further restriction that the elements in b add up to 0. ℓ -Relaxation for the Optimal Forecast Combination In this section, we introduce the idea of ℓ -relaxation to the classical forecast combination problem.We first discuss the low-dimensional case, and then proceed to the high-dimensional case. ℓ -Relaxation in the Low Dimensional Case To solve the constrained optimization problem in (1.2), we rewrite it as the unconstrained Lagrangianproblem: 12 w ′ b Σw + γ (cid:0) w ′ N − (cid:1) , (2.1)where γ is the Lagrangian multiplier. If b Σ is invertible, the explicit solution is given in (1.3).Nevertheless, b Σ is frequently non-invertible in high-dimensional settings. In this case, (1.2) hasmultiple solutions, all of which satisfy the Kuhn-Karush-Tucker (KKT) conditions: b Σw + γ N = N w ′ N − . (2.2)The non-unique solution due to the singularity of b Σ is parallel to the familiar multi-collinearity issuein ordinary least squares (OLS) regression. To motivate the ℓ -relaxation, we take a digression andconsider the generic OLS regression in the following example. Example 1
The OLS seeks a parameter θ ∈ R p that minimizes the sum of squared residuals (SSR) k y − X θ k , where y is a T -vector of dependent variable and X is a T × p matrix of independentvariables. When the columns in X are collinear, any solution that satisfies the first order condition X ′ X θ = X ′ y minimizes the SSR. One solution to the collinearity problem is provided by the well-known ridge regression (Tikhonov, 1977): b θ ridge = min θ ∈ R p n T k y − X θ k + λ k θ k o , where λ is atuning parameter. Alternatively, one can set up a criterion to pick one out of these multiple solutions.For example, under the ℓ -norm, one of the most popular criteria, the problem can be written as min θ ∈ R p k θ k subject to X ′ X θ = X ′ y . (2.3)5 nlike the ridge regression, no tuning parameter is involved in (2.3). Apparently, when X ′ X isnon-singular, the solution to the last minimization problem is given by the OLS solution b θ OLS =( X ′ X ) − X ′ y which is the only feasible point satisfying the constraint in (2.3). Following the spirit of (2.3), a unique solution to (1.2) can be found bymin ( w ,γ ) ∈ R N +1 k w k subject to w ′ N = 1 and b Σw + γ N = N . (2.4)The above minimization problem can be solved regardless of the invertibility of b Σ , and (1.3) is itssolution when b Σ is indeed invertible. ℓ -Relaxation in the High Dimensional Case Potential problems arise when N is large relatively to the time dimension T . “High dimensional”here means that the number of unknown parameters, in our context N , is comparable to the samplesize T . For example, we allow N/T → c ∈ (0 , ∞ ] as ( N, T ) → ∞ .Consider the case where N is of similar magnitude to T but N < T , say N = 80 and T = 100.Even if b Σ is non-singular, a few small sample eigenvalues of b Σ are likely to be close to zero, leadingto a numerically unstable solution from (1.3). The numerical instability is due to the fact that theweights are solely decided by b Σ via the ( N + 1)-equation linear system in (2.4).An idea to stabilize the numerical solution is expanding the feasible set. Inspired by the Dantzigselector of Candes and Tao (2007) and the relaxed empirical likelihood of Shi (2016), we considerrelaxing the sup-norm of the KKT condition as follows:min ( w ,γ ) ∈ R N +1 k w k subject to w ′ N = 1 and k b Σw + γ N k ∞ ≤ τ , (2.5)where τ is a tuning parameter to be specified by the user. We call the problem in (2.5) the ℓ -relaxation primal problem , and denote the solution to (2.5) as b w = b w τ , where the dependence of b w on τ is often suppressed for notational conciseness.Constraints in (2.5) are feasible for any τ ≥
0. The solution to (2.5) is unique because theobjective is a strictly convex function and the feasible set is a closed convex set. Therefore, while(1.2) may have multiple solutions, the objective function in (2.5) selects in the feasible set the solutionwith the smallest ℓ -norm. With modern convex optimization modeling languages and open-sourceconvex solvers, the quadratic optimization with constraints in (2.5) can be handled with ease evenwhen N is in hundreds or thousands. Proprietary convex solvers can also be called upon for furtherspeed gain in numerical operations; see Gao and Shi (2020). Remark 2.1 . The tuning parameter τ plays a crucial role for the ℓ -relaxation problem. Inter-estingly, the penalized scheme in (2.5) incorporates the two extremes, simple average and optimalweighting, as special cases.(a) If τ = 0, the solution b w is characterized by the KKT conditions in (2.2); if, in addition, b Σ isinvertible, the unique solution is then given by the optimal weighting b w BG defined in (1.3).(b) When τ >
0, it relaxes the high dimensional component of the KKT condition in (2.2). If τ is sufficiently large, say τ ≥ max i ∈ [ N ] | b Σ i · N | /N , the second constraint in (2.5) will not play arole in minimizing the objective function and it is easy to see that b w = N − N is the uniquesolution to w in (2.5). That is, the simple average is optimal for the primal problem in (2.5) If τ = max i ∈ [ N ] | b Σ i · N | /N , then b γ = 0 is also the unique solution to γ. -3 Bias VarianceMSE
Bias VarianceMSE
Figure 1: The Bias and Variance Trade-off under Different τ Valuesprovided τ is sufficiently large.(c) The optimal weighting strategy in (1.2) often performs unsatisfactorily in practice becausethe estimation of the optimal weights yield biases and large estimation errors. If the tuningparameter τ is chosen properly, the ℓ -relaxation can achieve the right balance between theoptimal weighting and the simple average by exploring their bias-variance trade-off. Figure 1demonstrates the bias and variance trade-off under a range of τ , where the DGP is describedin Section B.1 in the appendix. This graphs plots the squared bias, variance and MSE of thefirst element ( ˆ w = ˆ w τ ) of b w τ and those of the one-step-ahead forecast ˆ y T +1 with T = 100 asa function of τ . As can be seen clearly from the figure, both the ℓ -relaxation weight estimatorand the one-step-ahead forecast associated with a small value of τ tend to have small biases butlarge variances whereas those associated with a large value of τ tend to have large biases butsmall variances. In the middle, there is a large range of values for τ where the combined forecastyields MSFE that is smaller than either that of the simple average estimator (attainable forsufficiently large τ ) or that of the Bates-Granger optimally combined estimator (attainable for τ = 0).The primal problem in (2.5) is accompanied by the dual problem as stated in the following lemma. Lemma 1
The dual problem of (2.5) is min α ∈ R N (cid:26) α ′ b A ′ b A α + 1 N ′ N b Σ α + τ k α k − N (cid:27) subject to ′ N α = 0 , (2.6) where b A = (cid:0) I N − N − N ′ N (cid:1) b Σ is the demeaned version of b Σ . Denote b α = b α τ as a solution to thedual problem in (2.6), and it is connected with the solution to the primal problem in (2.5) via b w = b A b α + N N . (2.7)7 emark 2.2 . The dual problem of (2.4) can be produced by setting τ = 0 in (2.6). When τ > ℓ -penalized optimization where the criterion function is the summation of aquadratic form of α , a linear combination of α , and the ℓ -norm of α , while the constraint is linear in α . The dual problem is instrumental in our theoretical analyses due to its similarity to the familiarLasso, the ℓ -penalized sparse regression problem (Tibshirani, 1996). Remark 2.3 . Since rank( b A ) ≤ rank( I N − N − N ′ N ) = N −
1, the singularity of b A may inducemultiple solutions to the dual problem in (2.6). Despite this, the uniqueness of b w as a solution to theprimal problem in (2.5) implies b A b α (1) = b A b α (2) for any b α (1) and b α (2) that solve (2.6). It is sufficientto find any solution to the dual problem to recover the same b w in the primal.In the next section, we assume that b Σ has a certain structure and then examine its implicationson the optimal forecast combination based on the ℓ -relaxation. In this section, we impose latent group structures on b Σ (or its population version) and then studyits implications on the ℓ -relaxed estimates of the weights. b Σ and Latent Group Structure
Statistical analysis of high dimensional problems typically postulates structures on the data gen-erating process for dimension reduction. For example, variable selection methods such as Lasso(Tibshirani, 1996) and SCAD (Fan and Li, 2001) are motivated from regressions with sparsity, mean-ing most of the regression coefficients are either exactly zero or approximately zero. Similarly, in largeVC estimation various structures have been considered in the literature. Bickel and Levina (2008)imposes many off-diagonal elements to be zero, Engle and Kelly (2012) assume a block equicorrela-tion structure, and Ledoit and Wolf (2004) use Bayesian methods for shrinking the sample correlationmatrix to an equicorrelated target, to name a few.Latent group structures in panel data is an alternative way to reduce dimensions, which nowhas grown into a burgeoning literature. While the optimization problem (2.5) is formulated for ageneric covariance matrix b Σ , to analyze it in depth in the high dimensional framework we assume b Σ = { b Σ ij } i,j ∈ [ N ] can be approximated by a block equicorrelation matrix: b Σ = b Σ ∗ + b Σ e , (3.1)where b Σ ∗ = { b Σ ∗ ij } i,j ∈ [ N ] is a block equicorrelation matrix and b Σ e = { b Σ eij } i,j ∈ [ N ] denotes the deviationof b Σ from the block equicorrelation matrix. We will treat b Σ ∗ as the oracle object in Section 3.2,instead of its population counterpart, so that we can study the finite-sample numerical properties inSection 3.3 without resorting to asymptotics. We write b Σ ∗ ( N × N ) = Z ( N × K ) b Σ co( K × K ) Z ′ (3.2)where Z = { Z ik } denotes an N × K binary matrix providing the cluster membership of each individualforecast, i.e., Z ik = 1 if forecast i belongs to group G k ⊂ [ N ] and Z ik = 0 otherwise, and b Σ co = { b Σ co kl } k,l ∈ [ K ] is a K × K symmetric positive definite matrix. Here, the superscript “co” stands for“core”. Note that b Σ ∗ ij = b Σ co kl if i ∈ G k and j ∈ G l .
8n econometrician observes b Σ from the data but not b Σ ∗ . We will be precise about the definitionof “approximation” for b Σ e in Assumption 1 later. Recall that N k := |G k | denotes the number ofindividuals in the k -th group, and P Kk =1 N k = N. For ease of notation and after necessary re-ordering the N forecast units, we write b Σ ∗ = ( b Σ co kl · N k ′ N l ) k,l ∈ [ K ] , (3.3)in which the units in the same group cluster together in a block. The re-ordering is for the convenienceof notation only. The theory to be developed is irrelevant to the ordering of the forecast units, anddoes not require the knowledge on the membership matrix Z .We now motivate the above decomposition via two examples. Example 2
Chan and Pauwels (2018) assume the existence of a “best” unbiased forecast f t of vari-able y t +1 with an associated forecast error e t , and the forecast error e it of model i can be decomposedas e it = e t + u it , where e t represents the forecast error from the best forecasting model, and u it is the deviation of e it from the best forecasting model. Assuming that E [ u it ] = 0 and E [ e t u it ] = 0 for each i, theVC of e t = ( e t , . . . , e Nt ) ′ can be written as Σ = E [ e t e ′ t ] = E (cid:2) e t (cid:3) N ′ N + E [ u t u ′ t ] , where u t =( u t , ..., u Nt ) ′ . At the sample level, we have b Σ = b Σ ∗ + b Σ e , where b Σ = E T (cid:2) e t e ′ t (cid:3) , b Σ ∗ = E T (cid:2) e t (cid:3) N ′ N , and b Σ e = E T [ u t u ′ t ] + N E T [ e t u ′ t ] + E T [ e t u t ] ′ N . In this case, all the N forecast units belong to the same group G as rank ( b Σ ∗ ) = 1 . Example 3
Consider that each individual forecast f it is generated from a factor model f it = λ ′ g i η t + u it , (3.4) where λ g i is a q × vector of factor loadings, η t is a q × vector of latent factors, and u it is anidiosyncratic shock. Here g i denotes individual i ’s membership, i.e., it takes value k if individual i belongs to group G k for k ∈ [ K ] and i ∈ [ N ] . Similarly, assume y t +1 = λ ′ y η t + u y,t +1 . Assume that E [ u it | η t ] = 0 and E [ u y,t +1 | η t ] = 0 and E [ u it u y,t +1 | η t ] = 0 . For simplicity, we also assume condi-tional homoskedasticity var ( u t | η t ) = Ω u and the factor loadings are nonstochastic. Then individual i ’s forecast error is e it := y t +1 − f it = [( λ y − λ g i ) ′ η t + u y,t +1 ] − u it = λ †′ g i η † t − u it , where η † t = ( η ′ t , u y,t +1 ) ′ and λ † g i = (( λ y − λ g i ) ′ , ′ , or equivalently e t = Λ † η † t − u t in a vector form,where Λ † : = (cid:0) λ † g , . . . , λ † g N (cid:1) ′ . The population VC of e t is given by Σ = E (cid:2) e t e ′ t (cid:3) = Λ † E [ η † t η †′ t ] Λ †′ + Ω u . Other than those q factors in η t , the additional latent factor u y,t +1 in y t +1 is unforeseeable at time t . In otherwords, given the information set I t that contains (cid:0) { f it } i ∈ [ N ] , η t (cid:1) and u t at time t , the error u y,t +1 = y t +1 − λ ′ y η t = y t +1 − E ( y t +1 |I t ) must be orthogonal to I t . Then E [ u it u y,t +1 |I t ] = 0 implies E [ u it u y,t +1 | η t ] = 0 by the law of iteratedexpectations. ecompose the sample VC as b Σ = b Σ ∗ + b Σ e , where b Σ = E T (cid:2) e t e ′ t (cid:3) , b Σ ∗ = Λ † E T [ η † t η †′ t ] Λ †′ , and b Σ e = E T [ u t u ′ t ] − Λ † E T [ η † t u ′ t ] − E T [ u t η †′ t ] Λ †′ . By construction, the core matrix has element b Σ co kl = λ †′ k E T [ η † t η †′ t ] λ † l for k, l ∈ [ K ] , the equicorrelationmatrix has element b Σ ∗ ij = b Σ co kl if i ∈ G k and j ∈ G l , and rank ( b Σ ∗ ) ≤ ( q + 1) ∧ K . Remark 3.1 . We emphasize that our theory below does not require the knowledge on thegroup membership for individual forecasts. If one believes in the multi-factor structure in (3.4),he can always conduct the principal component analysis (PCA) to estimate the factor loadingsand factors first in the large N and large T framework. Then he can apply either the K-meansalgorithm or the sequential binary segmentation algorithm (Wang and Su, 2020) to the estimatedfactor loadings to identity the group membership. Alternatively, one can consider the regressionof f it on the estimated factors and apply the classifier-Lasso (Su et al., 2016) or other methods torecover the group membership. The advantage of ℓ -relaxation is that it directly works with thesample moments and hence bypasses the factor structure and the group membership. Remark 3.2 . It is worth mentioning that Hsiao and Wan (2014) also assume that the forecasterrors exhibit a multi-factor structure. But they do not assume the presence of K latent groupsin the N factor loadings and write λ i in place of λ g i in our Example 3. In the absence of thelatent group structures among the factor loadings { λ i } i ∈ [ N ] in the true DGP, we can follow thelead of Bonhomme et al. (2017) and consider their discretization. For simplicity, if λ i lies in acompact parameter space R q , then for each i ∈ [ N ] there exists K, g i ∈ [ K ] and λ g i ∈ R q suchthat k λ i − λ g i k ≤ δ K . As K diverges to infinity, the approximation error δ K can be made as smallas possible. As a result, we can continue to decompose the i -th forecast error in Example 3 as e i,t = λ †′ g i η † t − u ai,t , where λ † g i = (( λ y − λ g i ) ′ , ′ and u ai,t now contains the discretization error.Latent groups are present not only in approximate factor models, as in the above two motivatingexamples, but also in many forecast problems in which multi-factor structures are hidden implicitly.Here follows such an example. Example 4
Suppose that the outcome variable y t +1 is generated via the process y t +1 = x ′ t θ + u t +1 for t = − T , ..., − , , , ... where x t = ( x j,t ) pj =1 is a p × vector of potential predictive variables, θ = ( θ j ) pj =1 is a p × vectorof regression coefficients. Due to costly variable collection or ignorance, the forecaster i utilizes onlya subset x S i ,t of x t , where S i ⊂ [ p ] , to exercise prediction with the OLS estimate. Let b θ S i ,t =( P tl = − T +1 x S i ,l − x ′ S i ,l − ) − P tl = − T +1 x S i ,l − y l , and b θ i,t be the sparse p × vector that embeds thecorresponding b θ S i ,t so that ( b θ i,t ) S i = b θ S i ,t and ( b θ i,t ) [ p ] \ S i = 0 . We consider two forecasting schemes:fixed window and rolling window.(i) In the case of fixed estimation window for t ∈ [ − T , . . . , , , the i -th forecast of y t +1 is givenby b f i,t := x ′ S i ,t b θ S i , for t ≥ . The associated forecast error e it = y t +1 − b f i,t = y t +1 − x ′ t b θ i, = u t +1 + x ′ t ( θ − b θ i, ) . Apparently, this is an exact ( p + 1) -factor model with factors ( x ′ t , u t +1 ) ′ and factor loadings (( θ − b θ i, ) ′ , ′ . ii) In the case of rolling window of extending length for t ∈ [ − T , . . . , t ] , the forecast error is e it = y t +1 − x ′ S i ,t b θ S i ,t = u t +1 + x ′ t ( θ − b θ i,t ) = u t +1 + x ′ t ( θ − θ i ) + ǫ i,t where b θ i,t p → θ i as T → ∞ is assumed to hold uniformly in ( i, t ) under some regularity conditionsthat include the covariance stationarity, and ǫ i,t := x ′ t ( b θ i,t − θ i ) . Therefore, we have an approximate ( p + 1) -factor model with factors ( x ′ t , u t +1 ) ′ and factor loadings (( θ − θ i ) ′ , ′ . Similar analysisapplies to the rolling window of fixed length L , in which the forecaster i estimates the coefficient by b θ LS i ,t = ( P tl = t − L +1 x S i ,t − x ′ S i ,t − ) − P tl = t − L +1 x S i ,t − y t .In either case, if p = 10 then the total number of potential combinations amounts to N = 2 =1024 forecasting models under consideration. It is reasonable to model the forecast errors with amulti-factor structure as above. We consider the oracle problem by assuming away the idiosyncratic component b Σ e in b Σ . Given thelatent group structure, an oracle version of (1.2) ismin w ∈ R N w ′ b Σ ∗ w subject to w ′ N = 1 , (3.5)where the infeasible block equicorrelation covariance matrix b Σ ∗ replaces the sample covariance b Σ in(1.2). In the oracle problem, the group structure can be identified by inspecting the values of theentries in b Σ ∗ . The rank of b Σ ∗ is at most K due to the latent group pattern, leading to multiple and infact an infinite number of solutions. Despite this, each solution yields the same optimal value for theprimal objective function. Therefore it suffices to identify only one solution. The oracle counterpartof the ℓ -relaxation primal problem in (2.5) ismin ( w ,γ ) ∈ R N +1 k w k subject to w ′ N = 1 and k b Σ ∗ w + γ N k ∞ ≤ τ . (3.6)Denote the solution of the weights in the above problem as w ∗ τ , which in general does not accommo-date an explicit form due to the presence of sup-norm in the inequality constraint. However, in thespecial case of τ = 0, the inequality constraint can be equivalently written as K equality constraintsso that we can solve for w ∗ := w ∗ τ =0 in closed-form. Define r k := N k /N as the fraction of the k -th group members on the cross section. Let r = ( r k ) k ∈ [ K ] and r / = ( √ r k ) k ∈ [ K ] . Let “ ◦ ” be theHadamard product. Lemma 2 (a) The solution to (3.6) must take within-group equal values in the form w ∗ τ = (cid:0) N − b ∗ τ ′ N , · · · , N − b ∗ τK ′ N K (cid:1) ′ , (3.7) where b ∗ τ := ( b ∗ τk ) k ∈ [ K ] solves min ( b ,γ ) ∈ R K +1 N (cid:13)(cid:13)(cid:13) b ◦ r / (cid:13)(cid:13)(cid:13) subject to ′ K ( b ◦ r ) = 1 and k b Σ co ( b ◦ r ) + γ K k ∞ ≤ τ . (3.8)11 b) In the special case of τ = 0 , the solution b ∗ := b ∗ τ =0 to (3.8) is given by b ∗ = r − ◦ ( b Σ co ) − K ′ K ( b Σ co ) − K , (3.9) where r − = (cid:0) r − k (cid:1) k ∈ [ K ] . Lemma 2(a) shows that the squared ℓ -norm objective function produces the within-group equallyweighted solution w ∗ τ . The high dimensional oracle problem with ( N + 1) free parameters in (3.6) isreduced to only ( K + 1) free parameters by the “core primal problem” in (3.8). When τ = 0, we canobtain the explicit solution of w ∗ by inserting (3.9) into (3.7).As in Lemma 1, we proceed with the dual problem. The dual of (3.6) ismin α ∈ R N (cid:26) α ′ b A ∗′ b A ∗ α + 1 N ′ N b Σ ∗ α + τ k α k − N (cid:27) subject to ′ N α = 0 , (3.10)where b A ∗ = (cid:0) I N − N − N ′ N (cid:1) b Σ ∗ . For any α ∗ = α ∗ τ that solves (3.10), Lemma 1 implies that thesolution to (3.10) is not unique, and the unique w ∗ τ and the non-unique α ∗ τ are connected via w ∗ τ = b A ∗ α ∗ τ + N N . (3.11)To develop the counterpart of Lemma 2 for the dual problem, we need some extra notations. Fora generic N -vector α = ( α i ) i ∈ [ N ] , denote a k = P i ∈G k α i as the k -th within-group summation of α .Let a = ( a k ) k ∈ [ K ] . Let b A co = R / ( I K − K r ′ ) b Σ co , where R = diag( r ) is the K × K diagonal matrixthat stacks the elements of r along the diagonal line. Note that b A co is the weighted demeaned core of b A ∗ with the weights depending on the relative group size r . The following lemma characterizesthe features of α ∗ = α ∗ τ . Lemma 3 (a) If τ > , any solution α ∗ = α ∗ τ to (3.10) must be of similar signs, i.e., α ∗ ∈ S all .(b) The low dimensional core dual problem for (3.10) is min a ∈ R K (cid:26) N a ′ b A co ′ b A co a + r ′ b Σ co a + τ k a k − N (cid:27) subject to ′ K a = 0 . (3.12) (c) In the special case of τ = 0 , the solution a ∗ := a ∗ τ =0 to (3.12) is a ∗ = N − ( ˜ A co ′ ˜ A co ) − ˜ A co ′ (cid:0) ( b ∗ ◦ r − r ) ′ R − / (cid:1) ′ , (3.13) whereas ˜ A co := (cid:16) b A co ′ K (cid:17) ′ is a ( K + 1) × K matrix of full column rank. Remark 3.4 . Lemma 3(a) shows that the ℓ -norm penalization in (3.10) precludes oppositesigns of the estimates α ∗ τ within a group, which implies k α ∗ τ k = k a ∗ τ k for any τ >
0. Lemma 3(b)reduces the high dimensional oracle dual problem in α ∈ R N to the low dimensional oracle dual onein a ∈ R K . Lemma 3(c) is the counterpart of (3.9) for the dual, which involves the augmented (by arow of 1’s) weighted demeaned core ˜ A co . A numerical lower bound and a stochastic lower bound of˜ A co will be established in Lemma 4(b) and Lemma 5(a) in Section A.3 of the Appendix.12 .3 Numerical Properties In this section, we derive the (finite sample) numerical properties of the estimates b w and b α under afinite N . These properties are prepared for the development of the asymptotic theory in Section 4.The sample estimator b w deviates from the oracle w ∗ τ due to the presence of the idiosyncratic shockand the tuning parameter τ . We will show that the effect of the idiosyncratic shock is embodied bythe quantity φ e := k b Σ e k c , which can be viewed as a measurement of the noise level or contamination level of b Σ ∗ in the model.Theorem 1 below reports the numerical properties of the sample estimator, where the condition τ > φ e k b ∗ k ∞ / √ N can be satisfied w.p.a.1 in the asymptotic analysis. Theorem 1
Suppose that τ > φ e k b ∗ k ∞ / √ N . Then(a) k b w k ≤ k w ∗ k ≤ k b ∗ k ∞ / √ N ;(b) b α ∈ S all . Remark 3.5 . Theorem 1(a) reports the upper bound for k b w k , which is used in establishingpart (b). If the ratio between the tolerance τ and the noise level φ e is sufficiently large in that itis larger than k b ∗ k ∞ / √ N , the estimator b α must be of similar sign within each group. This resultis proved by exploiting the KKT conditions associated with the Lagrangian of (2.6). The intuitionis that when the specified τ is large, for any i, j ∈ G k , the column-wise difference in the noise, i.e., b Σ e · i − b Σ e · j , is unable to push the two associated KKT conditions to be satisfied simultaneously for thepair of b α i and b α j of opposite signs. Remark 3.6 . The result in Theorem 1(b) is important. It reminds us of the grouping effect ofelastic net (Zou and Hastie, 2005). A regression method exhibits the grouping effect if the regressioncoefficients of a group of highly correlated regressors in the design matrix X tend to be equal (up toa change of sign if negatively correlated). It is well-known that while Lasso yields sparse solutionsin many cases, it does not have the grouping effect. In contrast, the elastic net penalty, as a convexcombination of the Lasso ( ℓ ) and ridge ( ℓ ) penalties, encourages the grouping effect and has theadvantage of including automatically all the highly correlated variables in the group. In the presenceof a latent group structure in the dominant component b Σ ∗ of b Σ , b Σ plays the role as the Gram matrix X ′ X . Then b Σ · i and b Σ · j are asymptotically collinear if the i -th and j -th forecasts come from thesame group (e.g., i, j ∈ G k for some k ∈ [ K ]), a feature similar to highly correlated regressors in theregression framework. As a result, the ℓ -relaxation estimator of the weights enjoys similar propertiesas the elastic net estimator. Remark 3.7 . The ℓ penalized form in (2.6) also draws similarity to high dimensional Lassoestimation under sparsity. The consistency of Lasso requires that the correlation among the columnsof the design matrix should not be too strong; otherwise, various versions of restricted eigenvalueconditions break down (Bickel et al., 2009; Van De Geer and B¨uhlmann, 2009; Belloni et al., 2012).When b Σ is treated as a design matrix (or more precisely, the Gram matrix) in the regression frame-work, its columns are highly correlated, or asymptotically perfectly collinear. Consider the extremecase where b Σ = b Σ ∗ so that b Σ ∗ is not contaminated by the noise component b Σ e . If we try to solve(2.6) in this case, the estimated b α will be numerically very unstable due to perfect collinearity, andwe cannot expect it to converge to a fixed α ∗ under the ℓ norm. Therefore we must seek a com-patibility condition tailored for the group structure. Due to its technical nature, we relegate it toLemma 4 in Section A.3 of the Appendix. 13 Asymptotic Theory
We study the asymptotic properties of our ℓ -relaxed estimator of the combination weights in thissection. To this end, we impose some conditions and study the asymptotic properties of the estimatorin the dual problem first. We consider a triangular array of models indexed by T and N , both of which pass to infinity. Let φ NT := p (log N ) / ( T ∧ N ) →
0. Note that we allow both N ≫ T (as in standard high dimensionalproblems) and T ≫ N or T ≍ N in the definition of φ NT . But we rule out the traditional caseof “fixed N and large T ”, which can be trivially handled by (1.2). Furthermore, let Σ e = E [ b Σ e ], ∆ e = b Σ e − Σ e , Σ co0 = E [ b Σ co ], and ∆ co = b Σ co − Σ co0 . We impose the following assumption. Assumption 1 (a) φ max ( Σ e ) = O ( √ N φ NT ) , k Σ e k c ≤ C e · φ max ( Σ e ) , and k ∆ e k ∞ = O p (( T / log N ) − / ) ;(b) c ≤ φ min ( Σ co0 ) ≤ φ max ( Σ co0 ) ≤ c , and k ∆ co k ∞ = O p (( T / log N ) − / ) ;where c , c , and C e are positive finite constants. The first condition in Assumption 1(a) allows the maximal eigenvalue of the N × N matrix Σ e to diverge to infinity, but at a limited rate √ N φ NT . The second condition in (a) is similar to butweaker than the absolute row-sum condition that is frequently used to model weak cross-sectionaldependence; see, e.g., Fan et al. (2013). The third condition in (a) requires that the sampling errorof ∆ eij be controlled by ( T / log N ) − / uniformly over i and j so that each element of b Σ e should notdeviate too much from its population mean Σ e . This condition can be established under some low-level assumptions; see, e.g., Chapter 6 in Wainwright (2019). Assumption 1(b) bounds all eigenvaluesof the population core from 0 and infinity, and impose similar stochastic order on ∆ co as that on ∆ e in Assumption 1(a). Example 5 (Example 3, cont.) Following the notation of Example 3, we can decompose the popu-lation variance-covariance matrix Σ := E [ e t e ′ t ] as Σ = Σ * + Σ e , where Σ ∗ = Λ † E [ η † t η †′ t ] Λ †′ , and Σ e = Ω x = { Ω x,ij } . The corresponding sampling error is ∆ eij = { E T [ ǫ i,t ǫ j,t ] − Ω x,ij } − X l ∈{ i,j } (cid:8) ( λ y − λ g l ) ′ E T [ η t ( u y,t +1 − u l,t )] + E T [ u y,t +1 u l,t ] (cid:9) . Then the first part of Assumption 1(a) is satisfied as long as φ max ( Ω x ) = O ( √ N φ NT ) . For thesampling error matrix, if max i,j ∈ [ N ] {| E T [ η t ( u y,t +1 − u i,t )] | + | E T [ u y,t u i,t ] | + | E T [ u i,t u i,j ] − Ω ij,x |} = O p (( T / log N ) − / ) , then k ∆ e k ∞ = O p (( T / log N ) − / ) is satisfied as well. The extent of relaxation in (2.5) is controlled by the tuning parameter τ , which is to be chosenby cross validations (CV) in simulations and applications. We spell out admissible range of τ inAssumption 2(a) below. Assumption 2(b) restricts r := min k ∈ [ K ] r k relative to K .14 ssumption 2 As ( N, T ) → ∞ , (a) √ Kφ NT /τ + K / τ → ;(b) r ≍ K − . In order to meet the condition √ Kφ NT /τ → τ = D τ √ Kφ NT for some slowly diverging sequence D τ as ( N, T ) → ∞ , for example log log ( N ∧ T ). If K is finite, thisspecification implies that τ should shrink to zero at a rate slightly slower than φ NT . We allow K → ∞ ,provided K / τ → b Σ e would not offset the dominant grouping effectof the ℓ -relaxation in the presence of latent groups in b Σ ∗ . The particular rate K / τ will appearas the order of convergence in Theorem 3 below. Though the exact number of groups K is usuallyunknown in reality, if the researcher believes that K is asymptotically dominated by some explicitrate function ¯ K N,T of N and T in that lim sup K/ ¯ K N,T <
1, say ¯ K N,T = ( N ∧ T ) / , then all thefollowing theoretical results still hold if K is replaced by ¯ K N,T and τ is replaced by τ ¯ K = C τ ¯ K / N,T φ NT for some positive constant C τ , as long as Assumption 2(a) is replaced by ¯ K / N,T φ NT /τ + ¯ K / N,T τ → r be proportional to the reciprocalof K . Had a group included too few members, the weight of the group would be too small tomatter and the corresponding coefficient b ∗ k in (3.9) is inversely affected by the group size. IndeedAssumption 2(b) is a simplifying assumption for notational conciseness. If we drop it, r will appearin the rates of convergence in all the following results, which complicates the expressions but addsno new insight. b α in the Dual We start with the dual problem (2.6) under Assumption 1. Following the discussion in Section 3.2,there are multiple solutions to the oracle dual in (3.10) due to the rank deficiency of Σ ∗ . But if wewant to establish convergence in probability, we must declare a target to which the estimator willconverge. We construct such a desirable α ∗ in (4.1) below, denoted as b α ∗ where the “hat” signifiesits dependence on the realization of b α and “star” indicates its validity as an oracle estimator. Define b α ∗ = ( b α ∗G k ) k ∈ [ K ] , where b α ∗G k = a ∗ k (cid:18) b α G k b a k · { b a k a ∗ k > } + N k N k · { b a k a ∗ k ≤ } (cid:19) , (4.1)where b a k = P i ∈G k b α i is the sum of the b α i in the k -th group and 1 {·} is the usual indicator function.The above b α ∗G k is designed such that the k -th oracle group weight a ∗ k is distributed across the k th group members proportionally to b α G k / b a k when b a k and a ∗ k share the same sign, whereas a ∗ k isdistributed equally across k -th group members when they take opposite signs. When b α ∈ ˜ S all , whichholds w.p.a.1 in view of Theorem 1 and Lemma 5(b) in the Appendix, it is easy to verify that( i ) b α ∗ ∈ ˜ S all , ( ii ) k α ∗ k = k b α ∗ k and ( iii ) b α − b α ∗ ∈ ˜ S all . (4.2)For example, ( i ) in (4.2) holds because by construction b α ∗ ∈ S all as long as b α ∈ ˜ S all , and ′ N b α ∗ = P Kk =1 ′ N k b α ∗G k = P Kk =1 a ∗ k = ′ K a ∗ = 0 . The following theorem shows that the solution to theLasso-type ℓ -penalization problem (2.6) is close to the desirable oracle estimator b α ∗ .15 heorem 2 Suppose that Assumptions 1 and 2 hold. Then k b α − b α ∗ k = O p (cid:0) N − K τ (cid:1) and k b A ( b α − b α ∗ ) k = O p ( N − / K τ ) . Remark 4.1.
Theorem 2 is a key result that characterizes the convergence rate of the high-dimensional parameter b α in the dual problem to its oracle group counterpart α ∗ , represented by theconstructed unique solution b α ∗ . Although our ultimate interest lies in the weight estimate b w in theprimal problem, in theoretical analysis we first work with b α in the dual problem instead. This detouris taken because the dual is an ℓ -penalized optimization which resembles Lasso. The intensive studyof Lasso in statistics and econometrics offers a set of inequalities involving the ℓ -norms of b α , b α ∗ andtheir difference ( b α − b α ∗ ) at our disposal to analyze its asymptotic behavior. Remark 4.2.
All high dimensional estimation problems require certain notion of sparsity toreduce dimensionality. It is helpful to compare our setting of latent group structures with the sparseregression estimated by Lasso. For Lasso estimation, the complexity of the problem is governedby the total number of regressors ( p in Example 1) while under sparsity those non-zero coefficientscontrol the effective number of parameters, which is assumed to be far fewer than the sample size.For the ℓ penalized dual in (2.6), the complexity is the number of forecasters N whereas under thegroup structures the number of groups K determines the effective number of parameters. Remark 4.3 . A critical technical step in proving the consistency of high-dimensional Lasso prob-lems is the compatibility condition (B¨uhlmann and van de Geer, 2011, Ch 6.13) in conjunction withthe restricted eigenvalue condition (Bickel et al., 2009). In our paper, the compatibility condition isestablish in Lemma 4(a) and the restricted eigenvalue is represented by φ A := 1 ∧ φ min ( ˜ A co ′ ˜ A co ).In particular, instead of assuming a lower bound for the restricted eigenvalues as most high dimen-sional Lasso papers do, we derive a finite sample lower bound for φ A in Lemma 4(b) as well as itsconvergence rate in Lemma 5(a) under our latent group structures and primitive Assumptions 1 and2. We deem these developments as original contributions to the literature, though we relegate themto Section A.3 due to the technical nature. b w in the Primal Given Theorem 2, we proceed to handle the estimator b w in the ℓ -relaxation primal problem. Noticethat for the simple average weight b w SA = N /N , although b w SA does not mimic w ∗ in general, the ℓ -distance (cid:13)(cid:13) b w SA − w ∗ (cid:13)(cid:13) ≤ (cid:13)(cid:13) b w SA (cid:13)(cid:13) + k w ∗ k ≤ (1 + k b ∗ k ∞ ) / √ N . can shrink to zero in the limit, for example (cid:13)(cid:13) b w SA − w ∗ (cid:13)(cid:13) ≍ / √ N in the simplest case when K is finite. It is thus only non-trivial if we manage to show k b w − w ∗ k = o p (1) and k b w − w ∗ k = o p (cid:0) N − / (cid:1) , which is stated in the following Corollary 1. Corollary 1
Under the assumptions in Theorem 2, we have k b w − w ∗ k = O p (cid:16) N − / K τ (cid:17) = o p (cid:16) N − / (cid:17) and k b w − w ∗ k = O p (cid:0) K τ (cid:1) = o p (1) . Remark 4.4.
To connect the primal problem with the dual problem, we consider the decompo-sition: b w − w ∗ = b A ( b α − b α ∗ ) + ( b A − A ∗ ) b α ∗ ;see (A.39) in the appendix. The ℓ -norm of the first term on the right-hand side is bounded by The-orem 2. Moreover, the first term dominates the ℓ -norm of the second term in which the magnitudeof ( b A − A ∗ ) is well controlled under Assumption 1.16orollary 1 establishes meaningful convergence of the norms of the b w to its oracle counterpart w ∗ . The convergence further implies a desirable oracle inequality in Theorem 3 below, which showsthat the empirical risk under b w is asymptotically as small as if we knew the oracle object b Σ ∗ . Theorem 3 (Oracle inequalities)
Under the assumptions in Theorem 2, we have(a) b w ′ b Σ b w ≤ w ∗′ b Σ ∗ w ∗ + O p (cid:0) τ K / (cid:1) . Furthermore, let b Σ new and b Σ ∗ new be the counterparts of b Σ and b Σ ∗ from a new (testing) sample,which can be dependent or independent of the training dataset used to estimate b w and w ∗ . If thetesting dataset is generated by the same data generating process as that of the training dataset, then(a) b w ′ b Σ new b w ≤ w ∗′ b Σ ∗ new w ∗ + O p (cid:0) τ K / (cid:1) ;(b) b w ′ b Σ b w ≤ b w ′ b Σ new b w ≤ Q ( Σ )+ O p ( τ K / ) , where Σ = Σ ∗ + Σ e and Q ( Σ ) := min w ′ N =1 w ′ Σ w . Remark 4.5.
Theorem 3(a) is an in-sample oracle inequality, and (b) is an out-of-sample oracleinequality. The proof is a term-by-term analysis of the difference between b w ′ b Σ b w and w ∗′ b Σ ∗ w ∗ causedby the idiosyncratic shock. Again, because the magnitude of the idiosyncratic shock is controlled byAssumption 1, the convergence of the weight estimator in Corollary 1 allows the sample risk b w ′ b Σ b w to approximate the oracle risk w ∗′ b Σ ∗ w ∗ . The approximation is nontrivial by noting that w ∗′ b Σ ∗ w ∗ and w ∗′ b Σ ∗ new w ∗ are bounded away from 0 given the low rank structure of b Σ ∗ new and that τ K / → Remark 4.6 . While our ℓ -relaxation regularizes the combination weights, there is another lineof literature of regularizing the high dimensional VC estimation or its inverse (the precision matrix);see Bickel and Levina (2008), Fan et al. (2013), and the overview by Fan et al. (2016). Theorem 3(c)implies that the in-sample and out-of-sample risks coming out of the ℓ -relaxation are comparablewith the resultant risk from estimating the high dimensional VC matrix. Specifically, Q ( Σ ) isBates and Granger (1969)’s optimal risk under the population VC Σ , and Σ is the target of highdimensional VC estimation or precision matrix estimation. The population VC Σ takes into accountboth the low rank component Σ ∗ and the high rank component Σ e . Even if Σ can be estimated sowell that the estimation error is completely eliminated, our out-of-sample risk b w ′ b Σ new b w is within an O p (cid:0) τ K / (cid:1) tolerance level of Q ( Σ ).In summary, we have shown that under the high dimensional asymptotic framework where N/T →∞ is allowed as ( N, T ) → ∞ , we can construct a unique oracle target b α ∗ that satisfies a set of desirableproperties. Since the dual problem (2.6) is an ℓ -penalized optimization, we establish in Theorem 1the convergence of b α to b α ∗ by statistical techniques that deal with the ℓ -regularization, thanks tothe amenable comparability condition and the derived restricted eigenvalue. Then Theorem 2 canbe extended to the convergence of the weight b w in Corollary 1 and furthermore the convergence ofthe sample risk to the oracle risks in Theorem 3. We illustrate the performance of the proposed ℓ -relaxation method via Monte Carlo simulations.For simplicity, the simulated DGP follows a group pattern, with an equal number of members in eachgroup, i.e., N k = N/K for each k ∈ [ K ]. We start with a baseline model of independent factors, andthen extend it to allow dynamic factors and approximate factors.17 .1 Simulation Design Let Ψ co be a K × K symmetric positive definite matrix, and Ψ = Ψ co ⊗ ( N ′ N ) be its N × N equicorrelation matrix. The N forecasters are generated by f t = Ψ / η t + u t , (5.1)where Ψ / = N − / ( Ψ co ) / ⊗ ( N ′ N ) , η t ∼ N ( , I N ) , and the idiosyncratic noise u t ∼ N ( , Ω u )are independent across t, and η t and u t are independent. The target variable is generated as y t +1 = w ∗ ψ ′ Ψ / η t + u y,t +1 , where u y,t +1 ∼ N (0 , σ y ) is independent of η t and u t , and w ∗ ψ = [( Ψ co ) − K ] ⊗ N / [ N ′ K ( Ψ co ) − K ] . The forecast error vector is e t := y t +1 N − f t = ( N w ∗ ψ ′ − I N ) Ψ / η t + u y,t +1 N − u t , and its VC can be written as E (cid:2) e t e ′ t (cid:3) = (cid:0) I N − N w ∗′ ψ (cid:1) Ψ (cid:0) I N − w ∗ ψ ′ N (cid:1) + σ y N ′ N | {z } Σ + Ω u |{z} Ω . By construction w ∗ ψ solves min w ′ N =1 w ′ Σ w .We compare the following estimators of w , all subject to the restriction w ′ N = 1: (i) the group-membership oracle estimator; (ii) simple averaging (SA); (iii) Lasso; (iv) Ridge; (v) the principlecomponent (PC) grouping estimator; and (vi) ℓ -relaxation. All the tuning parameters are obtainedby the conventional 5-fold cross validation (CV) through a grid search from 0.1 to 1 with increment0.1. We have also tried the 3-fold CV and the 10-fold CV, which yield similar results.We elaborate the two estimators in which the group identity is explicitly involved. The oracleestimator takes advantage of the true group membership in the DGP. Given information about thegroup membership, we reduce the N forecasters to K forecasters f ( g k ) ,t = |G k | − P i ∈G k f it for k ∈ [ K ],and use the low-dimensional (1.3) to find the optimal weights. We estimate the group membershipin PC as follows. We compute the T × N in-sample forecasters’ error matrix ˆ E = (ˆ e , ..., ˆ e T ) ′ , savethe associated N × N factor loading matrix ˆ Γ of the singular decomposition ˆ E = ˆ U ˆ D ˆ Γ ′ , where ˆ D isthe ‘diagonal’ matrix of the singular values in descending order. We extract the the first q columnsof ˆ Γ , and perform the standard k-means clustering algorithm to partition the factor loading vectorsinto K estimated groups ˆ G k , k ∈ [ K ]. In the simulation we use the true K and try q = 5 , ,
20 toavoid tuning on these hyperparameters in this PC grouping procedure.We estimated the weights ˆ w with the training sample { ( y t +1 , f t ) , t ∈ [ T ]), and then cast the1-step-ahead prediction ˆ w ′ f T +1 for y T +2 . The above exercise is repeated to evaluate the meansquared forecast error (MSFE) E (cid:2) ( y T +2 − ˆ w ′ f T +1 ) (cid:3) − σ y of each estimator, where the unpredictablecomponents σ y in the MSFE is subtracted and the mathematical expectations are approximated byempirical averages of 1000 simulation replications. In addition, we report the mean absolute forecasterror (MAFE), which is not covered by our theory, in Appendix B.4 for reference.We experiment with three training sample sizes T = 50 , ,
200 with the corresponding K =18 , , N = 100 , , Ψ co = . · · · . . · · ·
00 0 . · · · · · · K +12 . While Ω u = σ u I N with σ u fixed at σ u = 5. To highlight the effect of the signal-to-noise ratio (SNR)on the forecast accuracy, we specify σ y = 1 as the low-signal design (with SNR around 3:7) and σ y = 0 . DGP1 . The baseline model sets η t ∼ N ( , I N ) i.i.d. across t . Figure 2 illustrates the estimatedweights of a typical replication. The four rows of sub-figures correspond to the oracle, the ℓ -relaxation, Lasso, and ridge, respectively; the three columns represent the results under K = 2 , , N . Figure 2 shows clearlythat ℓ -relaxation is capable of capturing the grouping patterns, although it does not explicitly classifyindividuals into groups. Such patterns are observed in neither Lasso nor Ridge.Table 1: MSFE for DGP1: the Baseline Model T N K
Oracle SA Lasso Ridge PC ℓ -relax q = 5 q = 10 q = 20 Panel A: Low SNR
50 100 2 0.189 1.039 0.722 1.389 0.785 0.801 0.848 0.252100 200 4 0.179 3.259 0.403 1.072 1.189 1.267 1.503 0.251200 300 6 0.070 4.615 0.171 0.324 1.399 1.323 1.363 0.116
Panel B: High SNR
50 100 2 0.158 0.999 0.440 0.470 0.739 0.768 0.731 0.230100 200 4 0.137 3.134 0.185 0.252 1.108 1.239 1.303 0.161200 300 6 0.093 4.420 0.120 0.126 1.306 1.437 1.391 0.108
Table 1 reports the out-of-sample prediction accuracy under DGP1. The first three columns showthe settings of T , N and K , and Columns 4–11 contain the MSFEs of the labeled estimators. Almostall estimators have stronger performance under a high SNR than that under a low SNR (exceptfor the PC estimator with large q and T ). The infeasible grouping information helps the oracleestimator to prevail in all cases. Our ℓ -relaxation estimator is always the best feasible estimator,with MSFE close to that of the oracle estimator. Off-the-shelf shrinkage estimators Lasso and Ridgeare in general better than the PC estimator. With the group pattern in the DGP, SA in general lagsfar behind the other feasible estimators that learn the combination weights from the data. Noticethat the MSFEs by Oracle, Lasso, Ridge, and the ℓ -relaxation decrease as T grows along with N .However, the results by SA and PC under all values of q diverge as ( T, N ) increases.
DGP2 . Time dependence is an essential feature in time series forecast. We extend the baselinei.i.d. model by allowing for temporal serial dependence in { η t } . Specifically, for each i , we generate19
50 1000.0080.0090.010.0110.012 W e i gh t (a) Oracle, K=2 W e i gh t -3 (b) Oracle, K=4 W e i gh t -3 (c) Oracle, K=6 W e i gh t -3 (d) L2Relax, K=2 W e i gh t -3 (e) L2Relax, K=4 W e i gh t -3 (f) L2Relax, K=6 W e i gh t (g) LASSO, K=2 W e i gh t (h) LASSO, K=4 W e i gh t -3 (i) LASSO, K=6 W e i gh t (j) Ridge, K=2 W e i gh t (k) Ridge, K=4 W e i gh t (l) Ridge, K=6 Figure 2: Illustration of Estimated Weights in DGP120 it from an AR(1) model η it = ρ i η i,t − + ǫ ηit , where ρ i ∼ Uniform(0 , .
9) is a random autoregressive coefficient, the noise ǫ ηit ∼ i . i . d . N (0 , − ρ i ),and the initial values η i ∼ iid N (0 , { ρ i } while maintainsstrictly stationarity over t .Table 2: MSFE for DGP2: the Dynamic Factor Model T N K
Oracle SA Lasso Ridge PC ℓ -relax q = 5 q = 10 q = 20 Panel A: Low SNR
50 100 2 0.259 1.129 0.644 1.174 0.834 0.828 0.939 0.276100 200 4 0.161 2.980 0.397 1.178 1.138 1.257 1.218 0.168200 300 6 0.117 4.311 0.241 0.425 1.368 1.430 1.339 0.159
Panel B: High SNR
50 100 2 0.263 1.078 0.476 0.485 0.769 0.838 0.820 0.267100 200 4 0.149 3.101 0.200 0.261 1.081 1.202 1.291 0.162200 300 6 0.096 4.452 0.130 0.130 1.265 1.367 1.459 0.112
Intuitively, the conventional 5-fold CV that randomly permutes the data may not be suitablefor time series data, since it accounts for neither the chronological order nor the serial correla-tion of the data. Practitioners usually resort to the out-of-sample (OOS) evaluation instead; seeBergmeir and Ben´ıtez (2012) and Mirakyan et al. (2017), among others. See also Arlot and Celisse(2010) for a survey of cross-validation procedures for model selection.We adopt the following procedure among many variations of OOS. The full dataset, in the chrono-logical order, is divided into 5 blocks of equal sized (up to rounding of decimals). In the first iteration,block 1 is the training data for model estimation and blocks 2 make the evaluation data for computingthe MFSE. In the second iteration, blocks 1-2 are taken as the training data and blocks 3 serve as theevaluation data, and so on for 4 iterations in total. Over a pre-specified grid of the potential valuesof the tuning parameter, the one that minimizes the average MSFE over the 4 iterations is selected.The MSFEs for DGP2 are presented in Table 2. Apparently, the rankings of relative performanceamong the six estimators are similar to that in Table 1. ℓ -relaxation remains the best performeramong the feasible estimator in all cases, only second to the infeasible oracle estimator.Our results are not sensitive to different evaluation methods. Bergmeir et al. (2018) argue thestandard 5-fold CV is valid in purely autoregressive models with uncorrelated errors. Simulationresults for DGP2 by the conventional 5-fold CV are reported in Appendix B.3. DGP3 . Acknowledging that equal factor loadings within a group can be an approximation ofmore general factor loading configurations, we experiment with a second extension to the baselinemodel in this DGP. We define ˜ Ψ / = Ψ / + iid N (0 , N − / ) as a perturbed factor loading matrixto replace Ψ / in (5.1). The results are reported in Table 3. Although the additional noises in thefactor loadings enlarge the MSFEs of all estimators relative to their counterparts in Table 1, theranking pattern of the estimators stays the same as in DGP1.In summary, we observe consistently robust performance of ℓ relaxation superior to the otherfeasible estimators across the DGP designs, signal strength, and CV methods.21able 3: MSFE for DGP3: Approximate Factor Model T N K
Oracle SA Lasso Ridge PC ℓ -relax q = 5 q = 10 q = 20 Panel A: Low SNR
50 100 2 0.372 1.054 1.076 1.668 0.848 0.949 0.868 0.504100 200 4 0.340 3.470 0.745 1.580 1.621 1.754 1.806 0.445200 300 6 0.335 4.635 0.552 0.920 2.277 2.283 2.288 0.420
Panel B: High SNR
50 100 2 0.392 1.059 0.722 0.726 0.895 0.909 0.915 0.422100 200 4 0.280 3.335 0.403 0.508 1.641 1.791 1.956 0.306200 300 6 0.274 5.226 0.347 0.343 2.174 2.242 2.260 0.309
We apply our proposed method to three empirical examples. The first exercise is a microeconomicstudy of forecasting box office. The second one is a macroeconomic application to the survey ofprofessional forecasters. The last one conducts a one-period-ahead forecast of the realized volatilityof the S&P500 index. In all three cases, we employ MSFE as the criterion to assess the forecastingperformance. The results under MAFE are available in Appendix B.5. The ℓ -relaxation enjoysexcellent performance in these examples. The motion picture industry devotes enormous resources to marketing in order to influence consumersentiment toward their products. These resources intend to reduce the supply-demand friction on themarket. On the supply side, movie making is an expensive business; on the demand side, however,the audience’s taste is notoriously hard to catch. Accurate prediction of box office is financiallycrucial for motion picture investors.Based on the data of Hollywood movies released in North America between October 1, 2010 andJune 30, 2012, Lehrer and Xie (2017) demonstrate the sound OOS performance of the predictionmodel averaging (PMA). We revisit their dataset of 94 cross-sectional observations (movies), 28non-constant explanatory variables and 95 candidate forecasters according to a multitude of modelspecifications. Guided by the intuition that the input variables capturing similar characteristicsare “closer” to one another, Lehrer and Xie (2017) cluster input variables into six groups in theirAppendix D.1:
Key variables : Constant , Animation , Family , Weeks , Screens , VOL : T − / − Twitter Volume : T − / − , T − / − , T − / − , T − / − Twitter Sentiment : T − / − , T − / − , T − / − , T − / − , T − / − Rating Related : PG , PG13 , R , Budget
Male Genre : Action , Adventure , Crime , Fantasy , Sci − Fi , Thriller
Female Genre : Comedy , Drama , Mystery , Romance
Since the 95 forecasters are generated based on these input variables, the potential group patternsmay help the ℓ -relaxation achieve more accurate forecasts than other off-the-shelf machine learning22hrinkage methods in this setting.Following Lehrer and Xie (2017), we randomly rearrange the full sample with n = 94 movies intoa training set of size n tr and an evaluation set of size n ev = n − n tr . We consider evaluation sizes n ev = 10 , ,
30, and 40. We repeat this procedure for 1,000 times and evaluate the OOS MSFE ofPMA, Lasso, Ridge, and ℓ -relaxation. Movies are viewed as independent observations and thus thetuning parameters are chosen by the conventional 5-fold CV. We conduct grid search from 0 to 5.Table 4: Relative MSFE of Movie Forecasting n ev PMA Lasso Ridge ℓ -relax10 1.000 1.010 1.101 0.91520 1.000 1.032 1.153 0.89330 1.000 1.017 1.192 0.83340 1.000 1.030 1.185 0.766 Note: The MSFE of PMA is normalized as 1.
Since the magnitude of MSFEs varies on the evaluation sizes n ev , for convenience of comparisonwe report the mean risk relative to that of PMA in Table 4. Entries smaller than 1 indicate betterperformance relative to that of PMA. The results show that ℓ -relaxation yields the lowest risk in eachcase, and it is followed by PMA which is known to outperform Lasso and Ridge in Lehrer and Xie(2017). The improvement by the ℓ -relaxation from PMA increases with values of n ev . In particular,when n ev = 40 the MSFE of ℓ -relaxation is 23.4% smaller than that of PMA.The averaged weights across the 1000 repetitions generated by the ℓ -relaxation method tell smallvariations in the relative importance amongst the 95 candidate models. As each of these forecastmodels are ensembles of several underlying input variables, it is more interesting to look at thepartial weights of the 28 input characteristics. Figure 3 plots the averaged partial weights from the ℓ -relaxation associated with these 28 variables under n ev = 10, and the pattern is similar for othervalues of n ev . The horizontal axis of Figure 3 shows all the 28 non-constant variables from the onewith the highest partial weight (GENRE: Adventure) to the one with the lowest (GENRE: Thriller).The vertical axis represents the averaged weight. We shade the variables with partial weights either1 or 0 exactly. These variables are either included or excluded in all candidate models followingLehrer and Xie (2017)’s pre-screening procedure.We focus on the variables with averaged partial weights lie strictly between 0 and 1. Variablessuch as RATE-R , VOL: T-4/-6 and
VOL: T-7/-13 yield similar averaged partial weights, and thisis robust under other values of n ev . The variable groups revealed by ℓ -relaxation offer accurateprediction, whereas they do not coincide the manually selected groups. These findings open interest-ing possibilities for alternative categorization of the features of movies as well as their social mediapopularity. Firms, consumers as well as monetary policy authorities count on the outlook of inflation to makerational economic decisions. Besides model-based inflation forecasts published by government andresearch institutes, surveys of professional forecasters (SPF) report experts’ perceptions about theprice level movement in the future. A long-standing myth of forecast combination lies in the robust-ness of the simple average, documented by Ang et al. (2007) which extract the mean or median as apredictor in a simple linear regression. Recent research shows modern machine learning methods canassist by assigning data-driven weights to individual forecasters to gather disaggregated information,23 E NR E : A d v en t u r e G E NR E : A n i m a t i on G E NR E : F a m il y B udge t W ee ks S c r een s V O L : T - / - G E NR E : F an t a sy R A T E : R V O L : T - / - V O L : T - / - V O L : T - / - R A T E : P G SE N T I: T - / - SE N T I: T - / - SE N T I: T - / - SE N T I: T - / - SE N T I: T - / - G E NR E : C r i m e V O L : T - / - R A T E : P G G E NR E : A c t i on G E NR E : C o m ed y G E NR E : D r a m a G E NR E : M ys t e r y G E NR E : R o m an c e G E NR E : S c i - F i G E NR E : T h r ill e r A v e r aged W e i gh t Figure 3: Averaged Partial Weights for All Variables by ℓ -relaxation under n E = 10for example Diebold and Shin (2018).The European Central Bank’s SPF inquires many professional institutions for their expectations ofthe euro-zone macroeconomic outlook. We revisit Genre et al. (2013)’s harmonized index of consumerprices (HICP) dataset, which covers 1999Q1–2018Q4. The experts were asked about their one-year- and two-year-ahead predictions. The raw data record 119 forecasters in total, but are highlyunbalanced with many missing values, mainly due to entry and exit in the long time span. We followGenre et al. (2013) to obtain 30 qualified forecasters by first filtering out irregular respondents ifhe or she missed more than 50% of the observations, and then using a simple AR(1) regression tointerpolate the missing values in the middle.Our benchmark is the simple averaging (SA) with equal weights on all 30 forecasters. We comparethe forecast errors of SA, Lasso, Ridge, and ℓ -relaxation. We use a rolling window of 40 quarters forestimation. The tuning parameters are selected by the OOS CV approach described the simulationof DGP2. Table 5: Relative MSFE of HICP ForecastingHorizon SA Lasso Ridge ℓ -relaxOne-year-ahead 1.000 0.954 0.964 0.940Two-year-ahead 1.000 0.887 0.969 0.518 Note: The MSFE of SA is normalized as 1.
The results of relative risks are presented in Table 5, with the MSFEs of SA standardized as1. ℓ -relaxation attains the best performance in both horizons. While Lasso and Ridge yield closeresults to SA under one-year-ahead horizon, ℓ -relaxation improves the relative MSFE by a moderate6%. Moreover, under two-year-ahead horizon ℓ -relaxation beats SA by a whopping gain of 48.2%.Since we do not directly observe the underlying factors based upon which the forecasters make24
10 20 30
Forecaster -0.6-0.4-0.200.20.40.6 A v e r aged M ode l W e i gh t (a) One-year-ahead Forecasting Forecaster -0.6-0.4-0.200.20.40.6 A v e r aged M ode l W e i gh t (b) Two-year-ahead Forecasting Figure 4: Averaged Model Weights for Different Forecast Horizons by ℓ -relaxationdecisions, we illustrate in Figure 4 the weights associated with the 30 forecasters, averaged over therolling windows. The figure suggests that the 30 forecasters roughly fall into a few groups by theirweight estimates. All weights lie within the range of ( − , Modeling the realized volatility (RV) of financial assets is of great interest to practitioners in riskmanagement and portfolio allocation. To capture the main characteristics of this variance time series,a plethora of models have been proposed, such as Andersen et al. (2001)’s fractionally integratedautoregressive moving average model and Corsi (2009)’s heterogeneous autoregressive model (HAR).In particular, HAR owes its popularity to the simple linear regression structure. HAR linearly projectsthe variance of discretely sampled returns onto the liner space spanned by the lagged squared returnover the identical return horizon in combination with the squared returns over longer and/or shorterreturn horizons. The standard HAR model writes the h -step-ahead daily y t + h in the linear regressionform y t + h = β + β d y (1) t + β w y (5) t + β m y (22) t + ǫ t + h , (6.1)where y ( l ) t := l − P ls =1 y t − s is the averages of the previous l periods of RV from period t . A typicalchoice for the index vector l is [1 , , l = [1 , , ℓ -relaxation with a full index vector[1 , , ...,
22] relative to the benchmark HAR with l = [1 , , h -step-ahead rolling window forecasting with the window length 100. The daily, weekly, bi-weekly, and monthly forecasts with h = 1 , ,
10, and 22 are analyzed. For comparison, we also run the25 -step ahead forecasts by the simple OLS, Lasso and Ridge when all the 22 variables { y (1) t , ..., y (22) t } are used. We do not expect each individual regressor to be an unbiased estimate of the dependentvariable, and hence for all estimators we drop the restriction that the coefficients add up to 1. Alltuning parameters are determined using the OOS CV as for the simulation in DGP2.Table 6: Relative Performance of Volatility Forecastinghorizon h HAR OLS Lasso Ridge ℓ -relax1 (daily) 1.0000 1.4401 0.9500 0.9555 0.86545 (weekly) 1.0000 1.3691 0.9240 0.8564 0.820710 (bi-weekly) 1.0000 1.5419 1.0034 0.9474 0.879322 (monthly) 1.0000 1.7485 1.1737 1.1622 0.9012 Note: The MSFE of HAR is normalized as 1.
Table 6 shows the results of out-of-sample MSFEs relative to the benchmark HAR. Entries smallerthan 1 indicate better performance relative to HAR. There are several interesting findings. (i) OLSsuffers the worst performance in all cases, due to overfitting by the 22-variable full index vector inthe 100-day rolling windows. (ii) Shrinkage estimators mitigate overfitting. While HAR outperformsOLS by fusing the daily, weekly and monthly lagged observations to reduce the number of regressorsfrom 22 to 3, the pre-selected index vector [1 , ,
22] cannot beat the shrinkage estimators in the shorthorizons. Ridge outperforms Lasso in general, indicating the underlying coefficients may take smallvalues but may not be exactly zero. (iii) ℓ -relaxation is more accurate than the other methods underall horizons. In particular, as noises accumulate under the prolonged horizon h = 22, HAR becomesmore stable than Lasso and Ridge, whereas ℓ -relaxation still prevails.Panels (a)–(d) in Figure 5 represent the averaged (over the rolling windows) coefficients for eachforecast horizon, respectively. The solid line, circles, and dashed line indicate the benchmark HAR, ℓ -relaxation and OLS, respectively. Since HAR incorporates a fixed lag index l = [1 , , ℓ -relaxation exhibits its own patterns distinctive from those of HAR and OLS. The curve of averagedcoefficients are smooth, balancing the flexibility of OLS and the parsimony of HAR. Furthermore, inshort horizons the estimated coefficients monotonically shrink to zero as the lags increases, reflectingthe dampening effects of past observations. This paper presents a new machine learning algorithm, the ℓ -relaxation for forecast combinationin the presence of many forecasts. When the forecast error VC matrix can be approximated bya block equicorrelation structure, we establish its asymptotic optimality in the high dimensionalcontext when N is potentially larger than T . Simulations and real data applications demonstratethe excellent performance of the ℓ -relaxation method relative to Lasso, Ridge and SA estimators.The present work raises several interesting issues for further research. First, additional formsof restrictions can be imposed to accompany the ℓ -relaxation. For example, if sparsity is alsodesirable, we may consider adding another constraint k w k ≤ τ for some tuning parameter τ ,which is similar to the idea of mixed ℓ - ℓ penalty of elastic net (Zou and Hastie, 2005). For anotherexample, if non-negative weights are desirable, we can add the constraints w i ≥ i , similarto Jagannathan and Ma (2003). Second, as mentioned in the introduction, the idea of ℓ -relaxationcan also be applied to portfolio optimization problems, and it is interesting to investigate how this26 ag
10 Lag
11 Lag
12 Lag
13 Lag
14 Lag
15 Lag
16 Lag
17 Lag
18 Lag
19 Lag
20 Lag
21 Lag -0.2-0.100.10.20.30.40.50.6 (a) h = 1 Lag
10 Lag
11 Lag
12 Lag
13 Lag
14 Lag
15 Lag
16 Lag
17 Lag
18 Lag
19 Lag
20 Lag
21 Lag -0.15-0.1-0.0500.050.10.150.20.25 (b) h = 5 Lag
10 Lag
11 Lag
12 Lag
13 Lag
14 Lag
15 Lag
16 Lag
17 Lag
18 Lag
19 Lag
20 Lag
21 Lag -0.2-0.15-0.1-0.0500.050.10.150.2 (c) h = 10 Lag
10 Lag
11 Lag
12 Lag
13 Lag
14 Lag
15 Lag
16 Lag
17 Lag
18 Lag
19 Lag
20 Lag
21 Lag -0.2-0.15-0.1-0.0500.050.10.150.2 (d) h = 22 HARL2-RelaxOLS
Figure 5: Averaged Coefficients by ℓ -relaxationapproach works in comparison with shrinkage or regularization methods used to estimate the optimalportfolios. Third, our ℓ -relaxation is motivated from the MSFE loss function, it is possible to considerother forms of relaxation if the other forms of loss functions (e.g., MAFE) are under investigation.We shall explore these topics in future work. References
Andersen, T. G., T. Bollerslev, F. X. Diebold, and P. Labys (2001). The distribution of realizedexchange rate volatility.
Journal of the American Statistical Association 96 (453), 42–55.Ang, A., G. Bekaert, and M. Wei (2007). Do macro variables, asset markets, or surveys forecastinflation better?
Journal of Monetary Economics 54 (4), 1163–1212.Ao, M., L. Yingying, and X. Zheng (2019). Approaching mean-variance efficiency for large portfolios.
The Review of Financial Studies 32 (7), 2890–2919.Arlot, S. and A. Celisse (2010). A survey of cross-validation procedures for model selection.
StatistcsSurveys 4 , 40–79.Bates, J. M. and C. W. Granger (1969). The combination of forecasts.
Operational ResearchQuarterly , 451–468.Bayer, S. (2018). Combining value-at-risk forecasts using penalized quantile regressions.
Economet-rics and Statistics 8 , 56–77. 27elloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods foroptimal instruments with an application to eminent domain.
Econometrica 80 , 2369–2429.Bergmeir, C. and J. M. Ben´ıtez (2012). On the use of cross-validation for time series predictorevaluation.
Information Sciences 191 , 192 – 213. Data Mining for Software Trustworthiness.Bergmeir, C., R. J. Hyndman, and B. Koo (2018). A note on the validity of cross-validation forevaluating autoregressive time series prediction.
Computational Statistics & Data Analysis 120 ,70–83.Bickel, P., Y. Ritov, and A. Tsybakov (2009). Simultaneous analysis of Lasso and Dantzig selector.
The Annals of Statistics 37 (4), 1705–1732.Bickel, P. J. and E. Levina (2008). Regularized estimation of large covariance matrices.
The Annalsof Statistics 36 (1), 199–227.Bonhomme, S., T. Lamadon, and E. Manresa (2017). Discretizing unobserved heterogeneity.
Uni-versity of Chicago, Becker Friedman Institute for Economics Working Paper (2019-16).Bonhomme, S. and E. Manresa (2015). Grouped patterns of heterogeneity in panel data.
Econo-metrica 83 (3), 1147–1184.Boyd, S. and L. Vandenberghe (2004).
Convex optimization . Cambridge University Press.B¨uhlmann, P. and S. van de Geer (2011).
Statistics for High-Dimensional Data: Methods, Theoryand Applications . Springer.Candes, E. and T. Tao (2007). The Dantzig selector: Statistical estimation when p is much largerthan n . The Annals of Statistics 35 (6), 2313–2351.Chan, F. and L. L. Pauwels (2018). Some theoretical results on forecast combinations.
InternationalJournal of Forecasting 34 (1), 64–74.Claeskens, G., J. R. Magnus, A. L. Vasnev, and W. Wang (2016). The forecast combination puzzle:A simple theoretical explanation.
International Journal of Forecasting 32 (3), 754–762.Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography.
InternationalJournal of forecasting 5 (4), 559–583.Conflitti, C., C. De Mol, and D. Giannone (2015). Optimal combination of survey forecasts.
Inter-national Journal of Forecasting 31 (4), 1096–1103.Corsi, F. (2009). A simple approximate long-memory model of realized volatility.
Journal ofFinancial Econometrics 7 (2), 174–196.Coulombe, P. G., M. Leroux, D. Stevanovic, and S. Surprenant (2020). How is machine learninguseful for macroeconomic forecasting? Technical report, CIRANO.Cowles, A. (1933). Can stock market forecasters forecast?
Econometrica , 309–324.Diebold, F. X. and M. Shin (2018). Machine learning for regularized survey forecast combination:Partially-egalitarian lasso and its derivatives.
International Journal of Forecasting .28iebold, F. X. and M. Shin (2019). Machine learning for regularized survey forecast combination:Partially-egalitarian lasso and its derivatives.
International Journal of Forecasting 35 (4), 1679–1691.Disatnik, D. and S. Katz (2012). Portfolio optimization using a block structure for the covariancematrix.
Journal of Business Finance & Accounting 39 (5-6), 806–843.Elliott, G., A. Gargano, and A. Timmermann (2013). Complete subset regressions.
Journal ofEconometrics 177 (2), 357–373.Engle, R. and B. Kelly (2012). Dynamic equicorrelation.
Journal of Business & Economic Statis-tics 30 (2), 212–228.Fan, J., A. Furger, and D. Xiu (2016). Incorporating global industrial classification standardinto portfolio allocation: A simple factor-based large covariance matrix estimator with high-frequency data.
Journal of Business & Economic Statistics 34 (4), 489–503.Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.
Journal of the American Statistical Association 96 (456), 1348–1360.Fan, J., Y. Liao, and H. Liu (2016). An overview of the estimation of large covariance and precisionmatrices.
The Econometrics Journal 19 (1), C1–C32.Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding princi-pal orthogonal complements.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 75 (4), 603–680.Fan, J., J. Zhang, and K. Yu (2012). Vast portfolio selection with gross-exposure constraints.
Journal of the American Statistical Association 107 (498), 592–606.Gao, Z. and Z. Shi (2020). Implementing convex optimization in r: Two econometric examples.
Computational Economics .Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013). Combining expert forecasts: Cananything beat the simple average?
International Journal of Forecasting 29 (1), 108 – 121.Granger, C. W. and R. Ramanathan (1984). Improved methods of combining forecasts.
Journal ofForecasting 3 (2), 197–204.Hsiao, C. and S. K. Wan (2014). Is there an optimal forecast combination?
Journal of Economet-rics 178 , 294–309.Jagannathan, R. and T. Ma (2003). Risk reduction in large portfolios: Why imposing the wrongconstraints helps.
The Journal of Finance 58 (4), 1651–1683.Konzen, E. and F. A. Ziegelmann (2016). Lasso-type penalties for covariate selection and forecastingin time series.
Journal of Forecasting 35 (7), 592–612.Kotchoni, R., M. Leroux, and D. Stevanovic (2019). Macroeconomic forecast accuracy in a data-richenvironment.
Journal of Applied Econometrics 34 (7), 1050–1072.Ledoit, O. and M. Wolf (2004). Honey, i shrunk the sample covariance matrix.
The Journal ofPortfolio Management 30 (4), 110–119. 29edoit, O. and M. Wolf (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection:Markowitz meets goldilocks.
The Review of Financial Studies 30 (12), 4349–4388.Lehrer, S. F. and T. Xie (2017). Box office buzz: Does social media data steal the show from modeluncertainty when forecasting for hollywood?
The Review of Economics and Statistics 99 (5),749–755.Li, J. and W. Chen (2014). Forecasting macroeconomic time series: Lasso-based approaches andtheir forecast combinations with dynamic factor models.
International Journal of Forecast-ing 30 (4), 996–1015.Mirakyan, A., M. Meyer-Renschhausen, and A. Koch (2017). Composite forecasting approach,application for next-day electricity price forecasting.
Energy Economics 66 , 228 – 237.Roccazzella, F., P. Gambetti, and F. Vrins (2020). Optimal and robust combination of forecasts viaconstrained optimization and shrinkage. Technical report, LFIN Working Paper Series, 2020/6,1–2.Shi, Z. (2016). Econometric estimation with high-dimensional moment equalities.
Journal of Econo-metrics 195 (1), 104–119.Smith, J. and K. F. Wallis (2009). A simple explanation of the forecast combination puzzle.
OxfordBulletin of Economics and Statistics 71 (3), 331–355.Stasinakis, C., G. Sermpinis, K. Theofilatos, and A. Karathanasopoulos (2016). Forecasting usunemployment with radial basis neural networks, kalman filters and support vector regressions.
Computational Economics 47 (4), 569–587.Stock, J. H. and M. W. Watson (2004). Combination forecasts of output growth in a seven-countrydata set.
Journal of Forecasting 23 (6), 405–430.Su, L. and G. Ju (2018). Identifying latent grouped patterns in panel data models with interactivefixed effects.
Journal of Econometrics 206 (2), 554–573.Su, L., Z. Shi, and P. C. Phillips (2016). Identifying latent structures in panel data.
Economet-rica 84 (6), 2215–2264.Su, L., X. Wang, and S. Jin (2019). Sieve estimation of time-varying panel data models with latentstructures.
Journal of Business & Economic Statistics 37 (2), 334–349.Theil, H. (1992).
Who forecasts best? , Chapter International Economic Papers 5 (1955) 194–199,pp. 1115–1120. Springer.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) , 267–288.Tikhonov (1977).
Solutions of Ill-Posed Problems . Winston and Sons, Washington, DC.Timmermann, A. (2006). Forecast combinations.
Handbook of Economic Forecasting 1 , 135–196.Van De Geer, S. A. and P. B¨uhlmann (2009). On the conditions used to prove oracle results for thelasso.
Electronic Journal of Statistics 3 , 1360–1392.30ogt, M. and O. Linton (2017). Classification of non-parametric regression functions in longitudinaldata models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1),5–27.Vogt, M. and O. Linton (2020). Multiscale clustering of nonparametric regression curves.
Journalof Econometrics .Wainwright, M. J. (2019).
High-dimensional statistics: A non-asymptotic viewpoint , Volume 48.Cambridge University Press.Wang, W. and L. Su (2020). Identifying latent group structures in nonlinear panels.
Journal ofEconometrics .Wilms, I., J. Rombouts, and C. Croux (2018). Multivariate lasso-based forecast combinations forstock market volatility.Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net.
Journal ofthe Royal Statistical Society: series B (Statistical Methodology) 67 (2), 301–320.31
PPENDIX
This appendix is composed of two sections. Appendix A contains the proofs of the theoreticalstatements in the paper. Appendix B contains further explanation and additional results on thesimulation and empirical exercises.
A Proofs of the Main Results
A.1 Proof of the Result in Section 2
Proof of Lemma 1 . First, we can rewrite the minimization problem in (2.5) in terms of linearconstraints: min ( w ,γ ) ∈ R N +1 k w k s.t. w ′ N − , (cid:16) b Σ 1 N (cid:17) (cid:0) w ′ γ ′ (cid:1) ′ ≤ τ N , and − (cid:16) b Σ 1 N (cid:17) (cid:0) w ′ γ ′ (cid:1) ′ ≤ τ N (A.1)where “ ≤ ” holds elementwise hereafter. Define the Lagrangian function as L ( w , γ ; α , α , α ) = 12 w ′ w + α ′ (cid:18)(cid:16) b Σ 1 N (cid:17) (cid:18) w γ (cid:19) − τ N (cid:19) − α ′ (cid:18)(cid:16) b Σ 1 N (cid:17) (cid:18) w γ (cid:19) + τ N (cid:19) + α (cid:0) w ′ N − (cid:1) (A.2)and the associated Lagrangian dual function as g ( α , α , α ) = inf w ,γ L ( w , γ ; α , α , α ) , where α ≥ α ≥
0, and α are the Lagrangian multipliers for the three constraints in (A.1), respectively.Let ϕ ( w ,γ ) = k w k , the objective function in (A.1). Define its conjugate function as ϕ ∗ ( a ,b ) = sup w ,γ (cid:26) a ′ w + bγ − k w k (cid:27) = (cid:26) k a k if b = 0 ∞ otherwise . The linear constraints indicate an explicit dual function (See Boyd and Vandenberghe, 2004, p.221): g ( α , α , α )= − τ ′ N ( α + α ) − α − ϕ ∗ (cid:16) b Σ ( α − α ) − α N , ′ N ( α − α ) (cid:17) = ( − τ ′ N ( α + α ) − α − (cid:13)(cid:13)(cid:13) b Σ ( α − α ) − α N (cid:13)(cid:13)(cid:13) if ′ N ( α − α ) = 0 ∞ otherwise . Let α = α − α . When τ >
0, the two inequalities b Σ i · w + γ ≤ τ and − b Σ i · w − γ ≤ τ cannot bebinding simultaneously. The associated Lagrangian multipliers α i and α i must satisfy α i · α i = 0for all i ∈ [ N ]. This implies that k α k = ′ N α + ′ N α so that the dual problem can be simplifiedas max α ,α (cid:26) − (cid:13)(cid:13)(cid:13) b Σ α − α N (cid:13)(cid:13)(cid:13) − α − τ k α k (cid:27) s.t. ′ N α = 0 . (A.3)32aking the partial derivative of the above criterion function with respect to α yields( b Σ α − α N ) ′ N − , or equivalently, α = N ( ′ N b Σ α − . Then (cid:13)(cid:13)(cid:13) b Σ α − α N (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) b A α + N N (cid:13)(cid:13)(cid:13)(cid:13) = α ′ b A ′ b A α + 1 N where b A = (cid:0) I N − N − N ′ N (cid:1) b Σ as defined in the main text. We conclude that the dual problem in(A.3) is equivalent tomin α ∈ R N (cid:26) α ′ b A ′ b A α + 1 N ′ N b Σ α + τ k α k − N (cid:27) subject to ′ N α = 0 , (A.4)where we keep the constant − N which is irrelevant to the optimization.When b α = b α − b α is the solution to (A.4), the solution of α in (A.3) is b α = N ( ′ N b Σ b α − . The first order condition of (A.2) with respect to w evaluated at the solution gives N = b w + b Σ ( b α − b α ) + b α N = b w − b Σ b α + 1 N ( ′ N b Σ b α − N = b w − b Σ b α − N N . as ′ N b α = 0. The result in (2.7) follows. (cid:4) A.2 Proofs of the Results in Section 3
Proof of Lemma 2. Part (a) . Suppose w is a feasible point. Due to the block equicorrelationstructure, the N rows in b Σ ∗ w contain only K distinct rows: b Σ ∗ w = b Σ co1 · (cid:16)P i ∈G w i , . . . , P i ∈G K w i (cid:17) ′ · N ... b Σ co K · (cid:16)P i ∈G w i , . . . , P i ∈G K w i (cid:17) ′ · N K . We prove the result by contradiction. Suppose the elements in the k -th group of an optimizer w ∗ τ = ( w ∗ τ, , ..., w ∗ τ ,N ) ′ are unequal for some k ∈ [ K ]. As before, we frequently suppress the dependenceof w ∗ τ and w ∗ τ,i on τ and write them as w ∗ and w ∗ i , respectively. Then we can construct an alternativeestimator ˇ w ∗ = ( ˇ w ∗ , ..., ˇ w ∗ N ) ′ such thatˇ w ∗ i = N − k X j ∈G k w ∗ j if i ∈ G k for each k ∈ [ K ] . It is easy to verify that ˇ w ∗′ N = P Ni =1 ˇ w ∗ i = P Kk =1 N k ˇ w ∗ i = P Kk =1 P j ∈G k w ∗ j = P Ni =1 w ∗ i = 1 and (cid:13)(cid:13)(cid:13) b Σ ∗ ˇ w ∗ + γ N (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13) b Σ ∗ w ∗ + γ N (cid:13)(cid:13)(cid:13) ∞ ≤ τ by the fact that w ∗ satisfies the constraints. That is, ˇ w ∗
33s also feasible. On the other hand, by the Jensen inequality, k ˇ w ∗ k = N X i =1 ( ˇ w ∗ i ) = K X k =1 N k N − k X j ∈G k w ∗ j ≤ K X k =1 X j ∈G k ( w ∗ j ) = k w ∗ k , where the inequality becomes a strict inequality as long as w ∗ j , j ∈ G k , are not all equal for some k ∈ [ K ] . But this contradicts the presumption that w ∗ is the minimizer. Therefore, w ∗ = w ∗ τ mustbe in the form of (3.7) with b ∗ τk = r − k P i ∈G k w ∗ τ,i . Substituting (3.7) into (3.6), the minimizationproblem in (3.6) can be equivalently written as that in (3.8). Part (b) . Under τ = 0, the restriction of (3.6) can be written as (cid:18) b Σ ∗ N ′ N (cid:19) (cid:18) w γ (cid:19) = (cid:18) N (cid:19) . Since the rank of b Σ ∗ is K and there are an infinite number of solutions of ( w , γ ) to the above systemof N + 1 equations. However, the i -th equation and the j -th equation are exactly the same if i and j are the in the same group, and the ( N + 1)-equation system can be reduced to a system of K + 1equations: (cid:18) b Σ co K ′ K (cid:19) X i ∈G w i , . . . , X i ∈G K w i , γ ′ = (cid:18) K (cid:19) . As b Σ co is of full rank, the solution to the above ( K + 1)-equation system is unique: b ∗ ◦ r = X i ∈G w ∗ ,i , . . . , X i ∈G K w ∗ ,i ′ = ( b Σ co ) − K ′ K ( b Σ co ) − K . where we use the fact that P i ∈G k w ∗ τ,i = b ∗ τ,k r k for each k ∈ [ K ] and all τ ≥ b ∗ = r − ◦ ( b Σ co ) − K ′ K ( b Σ co ) − K . (cid:4) Proof of Lemma 3. Part (a).
We prove the result by contradiction. Suppose that thereexists some k ∈ [ K ] such that the k -th group of an optimizer α ∗ = ( α ∗ , ..., α ∗ N ) ′ has elements ofopposite signs, viz, there are i, j ∈ G k such that α ∗ i α ∗ j <
0. Construct an alternative estimatorˇ α ∗ = (ˇ α ∗ , ..., ˇ α ∗ N ) ′ , where ˇ α ∗ i = N − k a ∗ k for i ∈ G k and all k ∈ [ K ]where a ∗ k = P j ∈G k α ∗ j . By construction, ˇ α ∗ ∈ S all as it replaces each α ∗ i with i ∈ G k by the groupwiseaverage. It is obvious that α ∗′ b A ∗′ b A ∗ α = ˇ α ∗′ b A ∗′ b A ∗ ˇ α ∗ and ′ N b Σ ∗ α = ′ N b Σ ∗ ˇ α ∗ . On the other hand,by the triangle inequality k ˇ α ∗ k = N X i =1 | ˇ α ∗ i | = K X k =1 N k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − k X j ∈G k α ∗ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < N X i =1 (cid:12)(cid:12) α ∗ j (cid:12)(cid:12) = k α ∗ k , { α ∗ j , j ∈ G k } change signs forsome k ∈ [ K ] . As a result, the objective function of the dual problem in (3.10) is strictly larger whenevaluated at α ∗ than that at ˇ α ∗ . This contradicts the presumption that α ∗ is an optimizer of (3.10). Part (b).
For any α ∈ R N , we have (cid:13)(cid:13)(cid:13) b A ∗ α (cid:13)(cid:13)(cid:13) = α ′ b Σ ∗ (cid:18) I N − N N ′ N (cid:19) b Σ ∗ α = α ′ b Σ ∗ b Σ ∗ α − N (cid:16) ′ N b Σ ∗ α (cid:17) . (A.5)The group structure in b Σ ∗ implies b Σ ∗ α = (cid:16) b Σ co1 · a · ′ N , . . . , b Σ co K · a · ′ N K (cid:17) ′ . Therefore, we have α ′ b Σ ∗ b Σ ∗ α = K X k =1 N k (cid:16) b Σ co k · a (cid:17) = N K X k =1 r k (cid:16) b Σ co k · a (cid:17) = N a ′ b Σ co R b Σ co a , ′ N b Σ ∗ α = K X k =1 N k b Σ co k · a = N K X k =1 r k b Σ co k · a = N r ′ b Σ co a . (A.6)Substituting these two equations to (A.5) yields (cid:13)(cid:13)(cid:13) b A ∗ α (cid:13)(cid:13)(cid:13) = N a ′ b Σ co R b Σ co a − N (cid:16) r ′ b Σ co a (cid:17) = N a ′ b Σ co (cid:0) R − rr ′ (cid:1) b Σ co a . On the other hand, noticing R1 K = r and ′ K R1 K = ′ K r = 1, we have N (cid:13)(cid:13)(cid:13) b A co a (cid:13)(cid:13)(cid:13) = N a ′ b Σ co (cid:0) I K − r1 ′ K (cid:1) R (cid:0) I K − K r ′ (cid:1) b Σ co a = N a ′ b Σ co (cid:0) R − R1 K r ′ − r1 ′ K R + r1 ′ K R1 K r ′ (cid:1) b Σ co a = N a ′ b Σ co (cid:0) R − rr ′ (cid:1) b Σ co a . Therefore we obtain (cid:13)(cid:13)(cid:13) b A ∗ α (cid:13)(cid:13)(cid:13) = N (cid:13)(cid:13)(cid:13) b A co a (cid:13)(cid:13)(cid:13) = N a ′ b A co ′ b A co a . (A.7)In the objective function (3.10), by (A.7) we can use N a ′ b A co ′ b A co a to replace α ′ b A ∗′ b A ∗ α , by (A.6)we can use r ′ b Σ co a to replace N ′ N b Σ ∗ α , and by Part (a) its solution must be of similar sign in that k α ∗ k = k a ∗ k . Consequently, the problem in (3.10) is equivalent to that in (3.12). Part (c) . We first show ˜ A co is of full column rank. The first K -rows of ˜ A co is b A co = R / ( I K − K r ′ ) b Σ co . Notice that I K − r1 ′ K is idempotent and R / and b Σ co are both of full rank,rank( b A co ) = rank( I K − r1 ′ K ) = trace( I K − r1 ′ K ) = K −
1. In other words, b A co is rank deficient andits null space is one-dimensional. The null space of b A co isker( b A co ) = n c ( b Σ co ) − K : c ∈ R \ { } o , as b A co ( b Σ co ) − K = R / ( I K − K r ′ ) K = N . Moreover, since ′ K ( b Σ co ) − K = 0, ( b Σ co ) − K is not in the null space of ′ K . In other words, ker( b A co ) ∩ ker( ′ K ) is empty and we must haverank( ˜ A co ) =rank( b A co )+rank( ′ K ) = ( K −
1) + 1 = K. τ = 0 in (3.11), we have b A ∗ α ∗ = w ∗ − N N . (A.8)Premultiplying both sides of the above equation by the K × N block diagonal matrix diag( r − / ′ N , ...,r − / K ′ N K ) , we obtain N b A co a ∗ = R − / ( b ∗ ◦ r − r ) . (A.9)As b A co is a submatrix of ˜ A co , the above equation implies N ˜ A co a ∗ = (cid:18) N b A co a ∗ N ′ K a ∗ (cid:19) = (cid:18) R − / ( b ∗ ◦ r − r )0 (cid:19) , where we use the restriction N ′ K a ∗ = 0 . Since ˜ A co is of full column rank, we explicitly solve a ∗ = (cid:16) ˜ A co ′ ˜ A co (cid:17) − ˜ A co ′ ˜ b co /N. (cid:4) Proof of Theorem 1 . Part (a) . Substituting ( w ∗ , γ ∗ ) into the constraint in (2.5), we obtain (cid:13)(cid:13)(cid:13) b Σw ∗ + γ ∗ N (cid:13)(cid:13)(cid:13) ∞ = k b Σ ∗ w ∗ + b Σ e w ∗ + γ ∗ N k ∞ = k b Σ e w ∗ k ∞ = max i k b Σ ei · w ∗ k ∞ ≤ k b Σ e k c k w ∗ k ≤ φ e k b ∗ k ∞ / √ N . (A.10)where the second equality follows by the KKT condition b Σ ∗ w ∗ + γ ∗ N = 0. The presumption φ e k b ∗ k ∞ / √ N < τ in the statement makes sure that k b Σw ∗ + γ ∗ N k ∞ < τ holds with strict inequality.This strict inequality means that ( w ∗ , γ ∗ ) lies in the interior of the feasible set of (2.5). Because b w is the minimizer of the problem in (2.5), its ℓ -norm is no greater than any other feasible solution.Thus k b w k ≤ k w ∗ k and furthermore k w ∗ k is bounded by k b ∗ k ∞ / √ N by (3.7). Part (b) . The Lagrangian of (2.6) can be written as L ( α , γ ) = 12 α ′ b A ′ b A α + 1 N ′ N b Σ α + τ k α k + γ ′ N α − N , where γ is the Lagrangian multiplier for the constraint ′ N α = 0. Consider the subgradient of L ( b α , b γ )with respect to α i for any i ∈ [ N ], where ( b α , b γ ) is the optimizer. Noting that b A ′ b A = b A ′ b Σ due tothe fact that I N − N − N ′ N is a projection matrix, the KKT conditions imply that (cid:12)(cid:12)(cid:12)(cid:12) b α ′ b A ′ b A · i + 1 N ′ N b Σ · i + b γ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( b A b α + 1 N N ) ′ b Σ · i + b γ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) b w ′ b Σ · i + b γ (cid:12)(cid:12)(cid:12) ≤ τ for all i ∈ [ N ]and furthermore b w ′ b Σ · i + b γ = τ sign ( b α i ) for all b α i = 0 . (A.11)Suppose b α / ∈ S all . Without loss of generality, let b α i > b α j < i, j ∈ G k , i = j .(A.11) indicates b w ′ b Σ · i + b γ = τ and b w ′ b Σ · j + b γ = − τ . Subtracting these two equations on both sides36ields 2 τ = (cid:12)(cid:12)(cid:12) b w ′ ( b Σ · i − b Σ · j ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) b w ′ [( b Σ e · i + b Σ ∗· i ) − ( b Σ e · j + b Σ ∗· j )] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) b w ′ ( b Σ e · i − b Σ e · j ) (cid:12)(cid:12)(cid:12) ≤ k b Σ e · i − b Σ e · j k k b w k ≤ k b Σ e k c k b w k ≤ φ e k b ∗ k ∞ / √ N , (A.12)where the third equality holds as b Σ ∗· i = b Σ ∗· j for i and j in the same group k , and the last in-equality by Part (a) which bounds k b w k . The above inequality (A.12) violates the presumption τ > φ e k b ∗ k ∞ / √ N . We thus conclude b α ∈ S all . (cid:4) A.3 Lemmas Prepared for the Proof of Theorem 2
Lemma 4(a) below provides a compatibility inequality that links k δ k and k b A δ k for any δ ∈ ˜ S all .Instead of imposing a restricted eigenvalue condition as an assumption, we establish our comparabilityinequality under a primitive condition about φ e . Recall that φ A := φ min ( ˜ A co ′ ˜ A co ) ∧ r :=min k ∈ [ K ] r k . Lemma 4 (a) If φ e ≤ p N φ A /K , we have k δ k ≤ p K/ ( N φ A ) k b A δ k for any δ ∈ ˜ S all .(b) φ − A ≤ r − φ − ( b Σ co ) + K − φ max ( b Σ co ) /φ min ( b Σ co ) . (c) k a ∗ k ≤ N − p K/φ A (cid:0) k b ∗ k ∞ + 1 (cid:1) . Remark A.1.
The constants 1 / φ e , is controlled bythe order p N φ A /K , then the ℓ -norm of δ can be controlled by the ℓ -norm of k b A δ k multipliedby a factor involving K/φ A , which is the ratio between the number of groups K and the square ofthe minimal non-trivial singular value of the augmented weighted demeaned core ˜ A co . In the proofof Lemma 4, we introduce an original self-defined semi-norm (A.14) to take advantage of the grouppattern. Remark A.2.
Another necessary condition for Lasso to achieve reasonable performance is thatthe ℓ -norm of the true coefficients cannot be too large. In our context, since Lemma 3 implies k α ∗ k = k a ∗ k , Part (c) sets an upper bound for the ℓ -norm of the true coefficient value for (3.10). Proof of Lemma 4. Part (a) . For a generic vector δ ∈ ˜ S all , we have k b A δ k ≥ k b A ∗ δ k − k (cid:0) I N − N − N ′ N (cid:1) b Σ e δ k ≥ k b A ∗ δ k − k b Σ e δ k , (A.13)where the first inequality holds by the the triangle inequality, and the second follows because I N − N − N ′ N is a projection matrix. We will bound the two terms on the right hand side.To take advantage of the group structure to handle collinearity, we introduce a novel groupwisesemi-norm and establish a corresponding version of compatibility condition. Let d k = P i ∈G k δ i for k ∈ [ K ] and d = ( d , . . . , d K ) ′ . Define a groupwise ℓ semi-norm k·k G : R N R + as k δ k G = k d k . (A.14)The definition of the semi-norm depends on the true group membership, which is infeasible in reality.We introduce this semi-norm only for theoretical development. In the estimation we do not need to37now the true group membership. This semi-norm k δ k G allows k δ k G = 0 even if δ = N , whileit remains homogeneous, sub-additive, and non-negative — all other desirable properties of a norm.Moreover, if δ ∈ S all it is obvious k δ k = X k ∈ [ K ] (cid:12)(cid:12)(cid:12)(cid:12) X i ∈G k δ i (cid:12)(cid:12)(cid:12)(cid:12) = X k ∈ [ K ] | d k | ≤ √ K k δ k G (A.15)by either the Cauchy-Schwarz or Jensen’s inequality.For any δ ∈ ˜ S all , we have k b A ∗ δ k = √ N k b A co d k = √ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) b A co d (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = √ N k ˜ A co d k ≥ p N φ A k d k = p N φ A k δ k G ≥ r N φ A K k δ k , (A.16)where the first equality follows by (A.7), the third equality by ′ K d = ′ N δ = 0, and the last inequalityby (A.15). We have found a lower bound for the first term on the right hand side of (A.13). For thesecond term on the right hand side of (A.13), we have k b Σ e δ k ≤ k b Σ || c k δ k ≤ φ e k δ k (A.17)by (A.57) and the definition of φ e .Under the presumption φ e ≤ p N φ A /K , (A.13) and (A.16)-(A.17) together imply k b A δ k ≥ ( p N φ A /K − φ e ) k δ k ≥ p N φ A /K k δ k . Then the result in (a) follows.
Part (b) . Notice˜ A co ′ ˜ A co = b A co ′ b A co + K ′ K = b Σ co (cid:0) I K − r · ′ K (cid:1) R (cid:0) I K − K · r ′ (cid:1) b Σ co + K ′ K = b Σ co R b Σ co + K ′ K − b Σ co rr ′ b Σ co . By the Sherman-Morrison formula in (A.59),( ˜ A co ′ ˜ A co ) − = A − + A − b Σ co rr ′ b Σ co A − − r ′ b Σ co A − b Σ co r , (A.18)where A = b Σ co R b Σ co + K ′ K and moreover A − = A − − A − K ′ K A − ′ K A − K (A.19)by (A.58), where A = b Σ co R b Σ co . Obviously φ max (cid:0) A − (cid:1) ≤ [ φ min ( A )] − ≤ r − φ − ( b Σ co ) (A.20)38nd ′ K A − K ≤ φ − ( R ) φ − ( b Σ co ) ′ K ( b Σ co ) − K . (A.21)The denominator of the second term on the right hand side of (A.18) is1 − r ′ b Σ co A − b Σ co r = 1 − r ′ b Σ co A − − A − K ′ K A − ′ K A − K ! b Σ co r = r ′ b Σ co A − K ′ K A − ′ K A − K b Σ co r = h ′ K ( b Σ co ) − K i ′ K A − K > , (A.22)where the second equality follows by r ′ b Σ co A − b Σ co r = r ′ R − r = ′ K r = 1 , and the third equality by r ′ b Σ co A − K = ′ K ( b Σ co ) − K . The numerator of the second term on the right hand side of (A.18)has rank 1, and thus φ max (cid:16) A − b Σ co rr ′ b Σ co A − (cid:17) = trace (cid:16) A − b Σ co rr ′ b Σ co A − (cid:17) = r ′ b Σ co A − b Σ co r ≤ r ′ b Σ co A − b Σ co r = r ′ R − ( b Σ co ) − R − r = ′ K ( b Σ co ) − K . (A.23)Combine (A.22) and (A.23), φ max (cid:16) A − b Σ co rr ′ b Σ co A − (cid:17) − r ′ b Σ co A − b Σ co r ≤ ′ K ( b Σ co ) − K × ′ K A − K h ′ K ( b Σ co ) − K i ≤ φ − ( b Σ co ) (cid:20)(cid:16) ′ K ( b Σ co ) − K (cid:17) − + φ − ( R ) φ − ( b Σ co ) (cid:21) ≤ φ − ( b Σ co ) Kφ − ( b Σ co ) + r − φ − ( b Σ co ) , (A.24)where the second inequality holds by (A.21).Applying the spectral norm to (A.18) yields φ − A = φ max (cid:16) ( ˜ A co ′ ˜ A co ) − (cid:17) ≤ φ max (cid:0) A − (cid:1) + φ max (cid:16) A − b Σ co rr ′ b Σ co A − (cid:17) − r ′ b Σ co A − b Σ co r ≤ r − φ − ( b Σ co ) + φ max ( b Σ co ) Kφ min ( b Σ co ) + r − φ − ( b Σ co )by (A.20) and (A.24). Part (c) . Given the expression of a ∗ in Lemma 3, its ℓ -norm is bounded by k a ∗ k ≤ k ( ˜ A co ′ ˜ A co ) − ˜ A co ′ k sp k ( b ∗ ◦ r − r ) ′ R − / k /N ≤ N p φ A (cid:16) k b ∗ ◦ r / k + k r / k (cid:17) ≤ N p φ A ( k b ∗ k ∞ + 1) , (A.25)39here r / = ( r / , . . . , r / k ) ′ . In addition, the Cauchy-Schwarz inequality entails k a ∗ k ≤ √ K k a ∗ k = N − p K/φ A ( k b ∗ k ∞ + 1) (A.26)as stated in the lemma. (cid:4) Lemma 5 collects the implications of Assumptions 1 and 2 for some building blocks of our asymp-totic theory. Lemma 5(a) provides the magnitude φ e , φ − A and k b ∗ k ∞ , and part (b) shows that thekey condition in the numerical properties in Theorem 1 is satisfied. Lemma 5
Under Assumptions 1 and 2,(a) φ e = O p ( √ N φ NT ) , φ − A = O p ( K ) , and k b ∗ k ∞ = O p ( √ K ) ;(b) The event n φ e k b ∗ k ∞ / √ N < τ o occurs w.p.a.1. Proof of Lemma 5. Part (a).
By the definition of φ e and the triangle inequality, φ e = k b Σ e k c ≤ k Σ e k c + k ∆ e k c ≤ C e φ max ( Σ e ) + √ N k ∆ e k ∞ = O p ( √ N φ NT ) , (A.27)where the second inequality and the last equality follow by Assumption 1(a).Noting that b Σ co = Σ co0 + ∆ co , and then w.p.a.1 φ min ( b Σ co ) ≥ φ min ( Σ co0 ) − k ∆ co k sp ≥ φ min ( Σ co0 ) − K k ∆ co k ∞ ≥ c − O p ( K ( T / log N ) − / ) ≥ c/ φ max ( b Σ co ) ≤ φ max ( Σ co0 ) + k ∆ co k sp ≤ c (A.29)w.p.a.1. Suppose (A.28) and (A.29) occur. Given Assumption 2(b) about the rate of r , Lemma 4(b)implies φ − A ≤ r − c − + 4 K cc = O p ( K )and (3.9) implies k b ∗ k ∞ ≤ (cid:13)(cid:13)(cid:13) ( b Σ co ) − K (cid:13)(cid:13)(cid:13) ∞ / h r · ′ K ( b Σ co ) − K i ≤ (cid:16) ′ K ( b Σ co ) − K (cid:17) / / h r · ′ K ( b Σ co ) − K i ≤ r − φ − (cid:16) b Σ co (cid:17) (cid:16) ′ K ( b Σ co ) − K (cid:17) − / ≤ r − K − / φ / (cid:16) b Σ co (cid:17) /φ min (cid:16) b Σ co (cid:17) ≤ r − K − / · O p ( c / /c ) = O p ( √ K ) . (A.30) Part (b).
The results of Part (a) gives φ e k b ∗ k ∞ / √ N = O p ( K / φ NT ) . Since K / φ NT /τ → τ > φ e k b ∗ k ∞ / √ N when ( N, T ) are sufficiently large and thus the event n φ e k b ∗ k ∞ / √ N < τ o occurs w.p.a.1. (cid:4) .4 Proofs of the Results in Section 4 Proof of Theorem 2.
When the sample size is sufficiently large we have n φ e k b ∗ k ∞ / √ N < τ o w.p.a.1 by Lemma 5(b). We can then construct the desirable b α ∗ according to (4.1). Since b α is thesolution to (A.4),12 b α ′ b A ′ b A b α + 1 N ′ N b Σ b α + τ k α k ≤ b α ∗′ b A ′ b A b α ∗ + 1 N ′ N b Σ b α ∗ + τ k b α ∗ k . Define ψ = b α − b α ∗ . Rearranging the above inequality yields ψ ′ b A ′ b A ψ + 2 τ k b α k ≤ − ψ ′ b Σ ( b A b α ∗ + N N ) + 2 τ k b α ∗ k . (A.31)Notice that ψ ′ b Σ (cid:18) b A b α ∗ + N N (cid:19) = ψ ′ b Σ ( b A ∗ b α ∗ + N N ) + ψ ′ b Σ ( b A − b A ∗ ) b α ∗ = ψ ′ b Σw ∗ + ψ ′ b Σ ′ (cid:0) I N − N − N ′ N (cid:1) b Σ e b α ∗ = ( ψ ′ b Σ ∗ w ∗ + ψ ′ b Σ e w ∗ ) + ψ ′ b A ′ b Σ e b α ∗ = ( − γ ∗ ψ ′ N + ψ ′ b Σ e w ∗ ) + ψ ′ b A ′ b Σ e b α ∗ = ψ ′ b Σ e w ∗ + ψ ′ b A ′ b Σ e b α ∗ , (A.32)where the fourth equality follows by the fact that b Σ ∗ w ∗ = − γ ∗ N implied by the KKT conditionsin (2.2) with τ = 0, and the last equality by ψ ′ N = ( b α − b α ∗ ) ′ N = 0 as the dual problems entailboth b α and b α ∗ sum up to 0. Plugging (A.32) into (A.31) to bound the right hand side of (A.31), wehave ψ ′ b A ′ b A ψ + 2 τ k b α k ≤ (cid:12)(cid:12)(cid:12) ψ ′ b Σ e w ∗ + ψ ′ b A ′ b Σ e b α ∗ (cid:12)(cid:12)(cid:12) + 2 τ k b α ∗ k ≤ k ψ k ( ζ + ζ ) + 2 τ k b α ∗ k , (A.33)where ζ = k b Σ e w ∗ k ∞ and ζ = k b A ′ b Σ e b α ∗ k ∞ .Now we bound ζ and ζ in turn. By (A.10), (A.27) and Lemma 5(a), we have ζ ≤ k b Σ e k c k w ∗ k ≤ φ e k b ∗ k ∞ / √ N = O p ( K / φ NT ) . For ζ , we have by (A.56), ζ ≤ k b A k c k b Σ e k c k b α ∗ k = k b A k c · φ e k b α ∗ k . (A.34)Noting that (cid:13)(cid:13) I N − N − N ′ N (cid:13)(cid:13) sp = 1 and by the triangle inequality, we have k b A k c ≤ k b Σ k c ≤ k b Σ ∗ k c + k b Σ ec k≤ √ N ( φ max ( Σ co0 ) + k ∆ co k ∞ ) = O p ( √ N ) , where the third inequality follows from Assumption 1(b). Noting that k b α ∗ k = k a ∗ k ≤ (cid:0) k b ∗ k ∞ + 1 (cid:1) p K/φ A /N by Lemma 4(c), we continue (A.34) to obtain ζ = O p ( √ N ) O p ( √ N φ NT ) N − p K/φ A ( k b ∗ k ∞ + 1) = O p ( K / φ NT p K/φ A )as k b ∗ k ∞ = O p (cid:0) K / (cid:1) by Lemma 5(a). We thus obtain ζ + ζ = O p ( K / φ NT p K/φ A )) by the fact K/φ A ≥ φ A .Now, suppose that the sample size is sufficiently large so that ζ + ζ ≤ τ p K/φ A / τ in Assumption 2(a). We push (A.33) further to attain ψ ′ b A ′ b A ψ + 2 τ k b α k ≤ τ p K/φ A k ψ k + 2 τ k b α ∗ k . Then ψ ′ b A ′ b A ψ ≤ τ p K/φ A k ψ k + 2 τ ( k b α ∗ k − k b α k ) ≤ τ (cid:16)p K/φ A + 2 (cid:17) k ψ k , where the last inequality follows by the triangle inequality: k b α ∗ k −k b α k ≤ k ψ k . Adding τ p K/φ A k ψ k to both sides of the above inequality yields ψ ′ b A ′ b A ψ + τ p K/φ A k ψ k ≤ τ (cid:16)p K/φ A + 1 (cid:17) k ψ k ≤ τ p K/φ A k ψ k (A.35)where the last inequality follows the fact K/φ A ≥ φ − A = O p ( K ) in Lemma 5(a), we have r K/φ A N φ e = r K/φ A N O p (cid:16) √ N φ NT (cid:17) = O p (cid:16)p K/φ A φ NT (cid:17) = O p ( Kφ NT ) = O p ( K / τ ) = o p (1) . where the last two equalities hold by Assumption 2(a). This implies that the condition φ e ≤ p N φ A /K in Lemma 4 is satisfied w.p.a.1. Moreover, ψ ∈ ˜ S all by construction of b α ∗ in (4.1).We hence invoke Lemma 4 to continue (A.35):4 τ p K/φ A k ψ k ≤ τ K/φ A √ N k b A ψ k ≤ ψ ′ b A ′ b A ψ + 32 τ ( K/φ A ) N (A.36)where the last inequality follows by 8 ab ≤ a + 32 b . Combining (A.35) and (A.36), we have12 ψ ′ b A ′ b A ψ + τ p K/φ A k ψ k ≤ τ ( K/φ A ) N . (A.37)The above equality immediately implies k ψ k = k b α − b α ∗ k ≤ τ ( K/φ A ) / N = O p ( K/φ A ) / τN ! = O p (cid:18) K τN (cid:19) and q ψ ′ b A ′ b A ψ = (cid:13)(cid:13)(cid:13) b A ( b α − b α ∗ ) (cid:13)(cid:13)(cid:13) ≤ τ K/φ A √ N = O p (cid:18) ( K/φ A ) τ √ N (cid:19) = O p (cid:18) K τ √ N (cid:19) , (A.38)given K/φ A = O p (cid:0) K (cid:1) by Lemma 5(a). This completes the proof of the theorem. (cid:4) roof of Corollary 1 . Recall that b A ∗ = ( I N − N − N ′ N ) b Σ ∗ and b A = ( I N − N − N ′ N ) b Σ . Let b A e := ( I N − N − N ′ N ) b Σ e . Then we have b w − w ∗ = (cid:18) b A b α + N N (cid:19) − (cid:18) b A ∗ b α ∗ + N N (cid:19) = b A b α − (cid:16) b A − b A e (cid:17) b α ∗ = b A ( b α − b α ∗ ) + b A e b α ∗ . (A.39)For the first term in (A.39), by (A.38) we have k b A ( b α − b α ∗ ) k = O p (cid:16) N − / K τ (cid:17) . (A.40)For the second term in (A.39), we have k b A e b α ∗ k ≤ k b Σ e b α ∗ k ≤ k Σ e b α ∗ k + k ∆ e b α ∗ k := I + I bythe triangle inequality. Notice that I ≤ φ max ( Σ e ) k b α ∗ k ≤ φ max ( Σ e ) k a ∗ k ≤ O p (cid:16) √ N φ NT (cid:17) k b ∗ k ∞ + 1 N p φ A = O p K / φ NT p N φ A ! , (A.41)where the second inequality follows as b α ∗ ∈ ˜ S all ⊂ S all by construction, the third inequality by (A.25)and Assumption 1(a), and the last equality by and Lemma 5(a). Moreover, I ≤ k ∆ e k c k b α ∗ k ≤ √ N k ∆ e k ∞ k b α ∗ k = √ N k ∆ e k ∞ k a ∗ k ≤ √ N O p (cid:16) ( T / log N ) − / (cid:17) N − p K/φ A ( k b ∗ k ∞ + 1)= O p (cid:16) Kφ NT / p N φ A (cid:17) , (A.42)where the first inequality follows by (A.57), the first equality holds by the fact that k b α ∗ k = k α ∗ k = k a ∗ k as in (4.2), the third inequality by Assumption 1(a) and Lemma 4(c), and the last equalityholds and Lemma 5(a).Then collecting (A.39), (A.40), (A.41) and (A.42), we have k b w − w ∗ k = O p (cid:16) N − / τ K (cid:17) + O p (cid:16) Kφ NT / p N φ A (cid:17) = O p (cid:16) N − / τ K (cid:17) = o p (cid:16) N − / (cid:17) by Assumption 2(a) and Lemma 5(a). In addition, the Cauchy-Schwarz inequality immediatelyimplies k b w − w ∗ k ≤ √ N k b w − w ∗ k = O p (cid:0) K τ (cid:1) = o p (1) . (cid:4) Proof of Theorem 3 . Part (a) . Denote ψ w = b w − w ∗ . We first show the in-sample oracleinequality. Decompose b w ′ b Σ b w − w ∗′ b Σ ∗ w ∗ = ( w ∗′ b Σw ∗ + 2 ψ ′ w b Σw ∗ + ψ w b Σ ψ w ) − w ∗′ b Σ ∗ w ∗ = w ∗′ b Σ e w ∗ + 2 ψ ′ w b Σw ∗ + ψ w b Σ ψ w = w ∗′ b Σ e w ∗ + 2 ψ ′ w ( b Σ ∗ + ∆ e ) w ∗ + 2 ψ ′ w Σ e w ∗ + ψ ′ w ( b Σ ∗ + ∆ e ) ψ w + ψ ′ w Σ e ψ w = : II + 2 II + 2 II + II + II .
43e bound II by | II | ≤ φ max ( b Σ e ) k w ∗ k ≤ ( φ max ( Σ e ) + φ max ( ∆ e )) k w ∗ k ≤ ( φ max ( Σ e ) + N || ∆ e || ∞ ) k w ∗ k ≤ (cid:16) O p ( √ N φ NT ) + N O p (( T / log N ) − / ) (cid:17) k b ∗ k ∞ N = O p ( Kφ NT ) , where the third inequality holds by the Gershgorin circle theorem, and the fourth by Assumption 1,and the last by Lemma 5(a). The second term II is bounded by | II | ≤ k b Σ ∗ + ∆ e k ∞ k ψ w k k w ∗ k ≤ (cid:16) k b Σ ∗ k ∞ + k ∆ e k ∞ (cid:17) k ψ w k k w ∗ k ≤ (cid:16) k b Σ co k ∞ + k ∆ e k ∞ (cid:17) k ψ w k √ N k w ∗ k = O p (cid:16) c + ( T / log N ) − / (cid:17) O p (cid:0) τ K (cid:1) k b ∗ k ∞ = O p (cid:16) τ K / (cid:17) , where the first inequality follows by (A.54), the third inequality by the Cauchy-Schwarz inequality,the first equality holds by Assumptions 1, Corollary 1, and Theorem 1(a), and the last equality byLemma 5(a). For II , we have | II | ≤ φ max ( Σ e ) k ψ w k k w ∗ k = O p ( √ N φ NT ) O p (cid:16) N − / τ K (cid:17) k b ∗ k ∞ N − / = O p (cid:16) N − / φ NT τ K / (cid:17) by (A.55), Assumptions 1, and Corollary 1. Similarly, | II | ≤ k b Σ ∗ + ∆ e k ∞ k ψ w k = O p ( c + ( T / log N ) − / ) O p (cid:0) τ K (cid:1) = O p (cid:0) τ K (cid:1) , and | II | ≤ φ max ( Σ e ) k ψ w k = O p ( √ N φ NT ) (cid:0) N − τ K (cid:1) = O p (cid:16) N − / φ NT τ K (cid:17) . Collecting all these five terms, and notice that O p (cid:0) τ K / (cid:1) is the dominating order, we have (cid:12)(cid:12)(cid:12) b w ′ b Σ b w − w ∗∗′ b Σ ∗ w ∗ (cid:12)(cid:12)(cid:12) = O p (cid:16) τ K / (cid:17) = o p (1)under Assumption 2(a). Part (b) . The same argument goes through if we replace b Σ with b Σ new , and replace b Σ ∗ with b Σ ∗ new , because in the above analysis of the in-sample oracle inequality we always bound the variousquantities by separating the norms of vectors and the square matrices. We conclude the out-of-sampleoracle inequality. Part (c) . This proof involves two steps: (i) Establish the closeness between b w ′ b Σ new b w and b w ′ b Σ b w as shown in (A.47) below; (ii) Establish the closeness between b w ′ b Σ b w and Q ( Σ ) as shown in (A.52)below, where Q ( S ) := min w ∈ R N , w ′ N =1 w ′ Sw for a generic N × N positive semi-definite matrix S .Obviously b w ′ b Σ new b w ≥ b w ′ b Σ b w = Q ( b Σ ). On the other hand, b w ′ b Σ b w = b w ′ b Σ new b w + b w ′ ( b Σ − b Σ new ) b w ≥ b w ′ b Σ new b w − k b Σ − b Σ new k sp k b w k (A.43)by the triangle inequality and (A.55). We focus on the term k b Σ − b Σ new k sp k b w k .44or the first factor, notice b Σ − b Σ new = b Σ ∗ − b Σ ∗ , new + b Σ e − b Σ e, new = ( b Σ ∗ − Σ ∗ ) − ( b Σ ∗ , new − Σ ∗ ) + ∆ e − ∆ e, new . Under Assumption 1(b), k b Σ ∗ − Σ ∗ k ∞ = k b Σ co − Σ co0 k ∞ = k ∆ co k ∞ = O p (( T / log N ) − / ) and thereforeby Gershgorin circle theorem the spectral norm is bounded by k b Σ ∗ − Σ ∗ k sp ≤ N k ∆ co k ∞ = O p (cid:16) N ( T / log N ) − / (cid:17) . (A.44)Gershgorin circle theorem also implies k ∆ e k sp ≤ φ e = O p ( √ N φ NT ) , where the stochastic orderfollows by Lemma 5(a). Since the new testing data comes from the same data generating process asthat of the training data, the same stochastic bounds is applicable to the terms involving the newdata, and then k b Σ − b Σ new k sp ≤ k b Σ ∗ − Σ ∗ k sp + k b Σ ∗ , new − Σ ∗ k sp + k ∆ e k sp + k ∆ e, new k sp = O p (cid:16) N ( T / log N ) − / (cid:17) + O p ( √ N φ NT ) = O p ( N φ NT ) . (A.45)The second factor is bounded by k b w k ≤ k b ∗ k ∞ /N = O p ( K/N ) (A.46)according to Theorem 1(a) and Lemma 5(a). Collecting (A.43), (A.45) and (A.46), we have0 ≤ b w ′ b Σ new b w − b w ′ b Σ b w ≤ k b Σ − b Σ new k sp k b w k = O p ( N φ NT ) O p ( K/N ) = O p ( Kφ NT ) . (A.47)Next, consider the population matrices Σ and Σ ∗ . Because Σ − Σ ∗ = Σ e is positive semi-definite, Q ( Σ ) ≥ Q ( Σ ∗ ) . (A.48)Since rank ( Σ ∗ ) = K ≪ N , the solution to min w ∈ R N , w ′ N =1 w ′ Σ ∗ w is not unique but all the solutionsgive the same minimum Q ( Σ ∗ ). Thus in order to evaluate Q ( Σ ∗ ) we can simply use the within-groupequal weight optimizer w ♯ which solvesmin ( w ,γ ) ∈ R N +1 k w k subject to w ′ N = 1 , and Σ ∗ w + γ = 0 . The only difference between w ♯ and w ∗ is that the former is associated with the population Σ ∗ and the latter associated with the sample b Σ ∗ . Parallel to (3.7), (3.9) and (A.30), we can write w ♯ = (cid:18) b ♯ N · ′ N , · · · , b ♯ K N · ′ N K (cid:19) ′ where b ♯ = r ◦ ( Σ co0 ) − K ′ K ( Σ co0 ) − K , and it is bounded by (cid:13)(cid:13)(cid:13) b ♯ (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) ( Σ co0 ) − K (cid:13)(cid:13) ∞ / (cid:2) r · ′ K ( Σ co0 ) − K (cid:3) ≤ r − K − / φ / ( Σ co0 ) /φ min ( Σ co0 ) ≤ c / / ( rcK / ) = O ( √ K ) 45nder Assumption 1(b) and furthermore k w ♯ k ≤ N k w ♯ k ∞ = N ( k b ♯ k ∞ /N ) = O ( K/N ) . (A.49)We continue (A.48): Q ( Σ ∗ ) = w ♯ ′ b Σ ∗ w ♯ + w ♯ ′ (cid:16) Σ ∗ − b Σ ∗ (cid:17) w ♯ ≥ w ∗′ b Σ ∗ w ∗ + w ♯ ′ (cid:16) Σ ∗ − b Σ ∗ (cid:17) w ♯ ≥ w ∗′ b Σ ∗ w ∗ − k Σ ∗ − b Σ ∗ k sp k w ♯ k , (A.50)where the first inequality follows as w ∗ is the optimizer associated with b Σ ∗ , and the second inequalityis derived by the same reasoning as used to obtain (A.43). (A.50) and (A.48) imply w ∗′ b Σ ∗ w ∗ ≤ Q ( Σ ) + k Σ ∗ − b Σ ∗ k sp k w ♯ k ≤ Q ( Σ ) + O p (cid:16) K ( T / log N ) − / (cid:17) (A.51)in view of (A.44) and (A.49). Combine Part (a) and (A.51): b w ′ b Σ b w ≤ Q ( Σ ) + O p ( τ K / ) . (A.52)In conjunction with (A.47) and notice Kφ NT = O (cid:0) τ K / (cid:1) is of smaller order than τ K / , theconclusion follows. (cid:4) A.5 Elementary Inequalities on Matrix Norms
We collect some elementary inequalities used in the proofs.
Lemma 6
Let a and b be two vectors, and A and B be two matrices of compatible dimensions.Then we have k Ab k ∞ ≤ k A k ∞ k b k (A.53) (cid:12)(cid:12) a ′ Ab (cid:12)(cid:12) ≤ k A k ∞ k a k k b k (A.54) (cid:12)(cid:12) a ′ Ab (cid:12)(cid:12) ≤ k A k sp k a k k b k (A.55) k ABb k ∞ ≤ (cid:13)(cid:13) A ′ (cid:13)(cid:13) c k B k c k b k (A.56) If S is a symmetric matrix, k Sb k ≤ k S k c k b k . (A.57) If Σ is positive definite, (cid:0) Σ + aa ′ (cid:1) − = Σ − − Σ − aa ′ Σ − a ′ Σ − a (A.58) (cid:0) Σ − aa ′ (cid:1) − = Σ − + Σ − aa ′ Σ − − a ′ Σ − a . (A.59)46 roof. The first inequality follows because k Ab k ∞ ≤ max i | A i · b | ≤ max i k A i · k ∞ k b k = k A k ∞ k b k . It implies the second inequality | a ′ Ab | ≤ k a k k Ab k ∞ ≤ k A k ∞ k a k k b k and the fourth inequality k ABb k ∞ ≤ k AB k ∞ k b k = max i,j | A i · B · j | k b k ≤ (cid:13)(cid:13) A ′ (cid:13)(cid:13) c k B k c k b k . The third inequality follows by the Cauchy-Schwarz inequality | a ′ Ab | ≤ k A ′ a k k b k ≤ k A k sp k a k k b k . For the symmetric matrix S , k Sb k = √ b ′ SSb ≤ q k SS k ∞ k b k ≤ q max i ( SS ) ii k b k = k S k c k b k where the first inequality follows by (A.54), and the second inequality and the last equality are dueto the symmetry of Q .The Sherman-Morrison formula gives ( Σ + ab ′ ) − = Σ − − Σ − ab ′ Σ − / (cid:0) a ′ Σ − b (cid:1) for any com-patible vector a and b . (A.58) and (A.59) follow by setting b = a and b = − a , respectively. (cid:4) B Additional Results for the Numerical Work
In this appendix, we report additional designs and results for the numerical work in the paper.
B.1 Simulation of the Demonstrative Example in Figure 1
The demonstrative example in Figure 1 is generated from the DGP y t +1 = X i =1 w i f it + u t +1 , t = 1 , . . . , . The dependent variable y t +1 is a linear combination of two groups of input variables { f it } i =1 and { f it } i =11 with group weights w i = 0 . · { ≤ i ≤ } + 0 . · { ≤ i ≤ } and 1 {·} denoting theindicator function, so that P i =1 w i = 1. We set f it ∼ i.i.d. N (1 ,
1) for 1 ≤ i ≤ f it ∼ i.i.d. N (0 , ≤ i ≤
20, and u t +1 ∼ i.i.d. N (0 , . ℓ -relaxation methodunder different values of τ = 0 , . , . , . . . , .
1. Let ˆ w = ˆ w τ denote the first element of the ℓ -relaxation estimator ˆw τ . We report in Figure 1 the empirical squared bias, variance and MSE ofˆ w and those of the one-step-ahead forecast b y n +1 with n = 100 over 1000 replications as a functionof τ . B.2 Calculation of SNR in the Section 5
To obtain SNR in the simulations, we decompose y t +1 = w ∗′ ψ ( f t − u t ) + u y,t +1 into y t +1 = w ∗′ ψ ( f t − P [ u t | f t ]) | {z } signal − w ∗′ ψ ( u t − P [ u t | f t ]) + u y,t +1 | {z } noise , P [ u t | f t ] = E [ u t f ′ t ] ( E [ f t f ′ t ]) − f t = Ω u ( Ψ + Ω u ) − f t is the projection of u t onto the linearspace spanned by f t . By construction, the signal and noise terms are orthogonal in that E (cid:2) w ∗′ ψ ( f t − P [ u t | f t ]) ( u t − P [ u t | f t ]) ′ w ∗ ψ (cid:3) = w ∗′ ψ E (cid:20)(cid:16) f t − Ω u ( Ψ + Ω u ) − f t (cid:17) (cid:16) u t − Ω u ( Ψ + Ω u ) − f t (cid:17) ′ (cid:21) w ∗ ψ = 0 . Simple calculation shows that the variance of the noise component isvar (cid:2) w ∗′ ψ ( u t − P [ u t | f t ]) + u y,t +1 (cid:3) = w ∗′ ψ h Ω u − Ω u ( Ψ + Ω u ) − Ω u i w ∗ ψ + σ y .This, along with the fact that var[ y t +1 ] = w ∗′ ψ Ψw ∗ ψ + σ y , implies that SNR here can be defined asSNR = var[ y t +1 ] − var h w ∗′ ψ ( u t − P [ u t | f t ]) + u y,t +1 i var h w ∗′ ψ ( u t − P [ u t | f t ]) + u y,t +1 i = w ∗′ ψ h Ψ − Ω u + Ω u ( Ψ + Ω u ) − Ω u i w ∗ ψ w ∗′ ψ h Ω u − Ω u ( Ψ + Ω u ) − Ω u i w ∗ ψ + σ y . B.3 Simulation Results under DGP2 by Conventional 5-fold CV
The MSFEs under DGP2 by the conventional 5-fold CV are close to those in Table 2 and all patternsremain. Table 7: MSFE for DGP2 by Conventional 5-fold CV
T N K
Oracle SA Lasso Ridge PC ℓ -relax q = 5 q = 10 q = 20 Panel A: Low SNR
50 100 2 0.259 1.129 0.708 1.484 0.834 0.828 0.939 0.272100 200 4 0.161 2.980 0.379 1.169 1.138 1.257 1.218 0.196200 300 6 0.117 4.311 0.206 0.419 1.368 1.430 1.339 0.118
Panel B: High SNR
50 100 2 0.263 1.078 0.483 0.512 0.769 0.838 0.820 0.270100 200 4 0.149 3.101 0.200 0.262 1.081 1.202 1.291 0.155200 300 6 0.096 4.452 0.129 0.139 1.265 1.367 1.459 0.121
B.4 MAFE for the Simulations
In this section we report the mean absolute forecast error (MAFE) in the simulations. The MAFEis defined as MAFE = E (cid:2)(cid:12)(cid:12) y T +1 − ˆ w ′ f T +1 (cid:12)(cid:12)(cid:3) − σ y p /π, where the unpredictable component, σ y R ∞−∞ | x | √ π exp( − x / x = σ y p /π , is subtracted. Insimulations we know σ y but it is unknown in empirical applications.The results are collected in Table 8. Results by oracle, Lasso, Ridge, and the ℓ -relaxation tendto decrease with T , yet results by SA and PC may diverge as T increases. Similar to the MSFEresults in the main text, the ℓ -relaxation is always the best feasible estimator in all cases.48able 8: MAFE for Simulations T N K
Oracle SA LASSO Ridge PC ℓ -relax q = 5 q = 10 q = 20 Panel A: DGP1 with low SNR
50 100 2 0.336 0.721 0.457 0.472 0.604 0.614 0.609 0.346100 200 4 0.224 1.315 0.273 0.327 0.726 0.799 0.798 0.245200 300 6 0.177 1.595 0.209 0.215 0.809 0.837 0.849 0.195
Panel B: DGP1 with high SNR
50 100 2 0.063 0.356 0.233 0.422 0.278 0.278 0.305 0.085100 200 4 0.076 0.845 0.153 0.351 0.377 0.398 0.448 0.099200 300 6 0.017 1.067 0.051 0.104 0.432 0.403 0.404 0.034
Panel C: DGP2 with low SNR
50 100 2 0.340 0.753 0.479 0.486 0.612 0.650 0.643 0.343100 200 4 0.236 1.313 0.285 0.336 0.740 0.780 0.809 0.259200 300 6 0.177 1.608 0.210 0.216 0.791 0.834 0.862 0.195
Panel D: DGP2 with high SNR
50 100 2 0.105 0.378 0.246 0.388 0.290 0.286 0.325 0.111100 200 4 0.081 0.777 0.173 0.389 0.356 0.407 0.375 0.089200 300 6 0.053 1.053 0.094 0.158 0.428 0.447 0.428 0.069
Panel E: DGP3 with low SNR
50 100 2 0.435 0.799 0.578 0.599 0.720 0.709 0.705 0.441100 200 4 0.364 1.402 0.426 0.488 0.961 1.018 1.032 0.383200 300 6 0.350 1.609 0.372 0.390 1.083 1.120 1.126 0.364
Panel F: DGP3 with high SNR
50 100 2 0.163 0.406 0.387 0.550 0.356 0.339 0.360 0.197100 200 4 0.073 0.856 0.191 0.426 0.448 0.463 0.506 0.090200 300 6 0.047 1.085 0.176 0.290 0.586 0.625 0.603 0.054 .5 MAFE for the Empirical Applications Tables 9–11 report the relative (to the benchmark) MAFEs in the three empirical applications,respectively. The results and patterns are robust in comparison with those based on MSFE.Table 9: Relative MAFE of Box Office Forecast n ev PMA Lasso Ridge ℓ -relax10 1.000 0.999 1.078 0.98520 1.000 1.003 1.072 0.98430 1.000 1.003 1.064 0.97640 1.000 1.008 1.074 0.965 Note: The MAFE of PMA is normalized as 1.
Table 10: Relative MAFE of Inflation Forecasthorizon SA Lasso Ridge ℓ -relaxOne-year-ahead 1.000 0.971 0.996 0.940Two-year-ahead 1.000 0.857 0.994 0.709 Note: The MAFE of SA is normalized as 1.