[PDF] Learning from Forecast Errors: A New Approach to Forecast Combinations

Abstract

Forecasters often use common information and hence make common mistakes. We propose a new approach, Factor Graphical Model (FGM), to forecast combinations that separates idiosyncratic forecast errors from the common errors. FGM exploits the factor structure of forecast errors and the sparsity of the precision matrix of the idiosyncratic errors. We prove the consistency of forecast combination weights and mean squared forecast error estimated using FGM, supporting the results with extensive simulations. Empirical applications to forecasting macroeconomic series shows that forecast combination using FGM outperforms combined forecasts using equal weights and graphical models without incorporating factor structure of forecast errors.

Full PDF

LLearning from Forecast Errors:A New Approach to Forecast Combinations

Tae-Hwy Lee ∗ and Ekaterina Seregina † September 14, 2020

Abstract

This paper studies forecast combination (as an expert system) using the precision matrixestimation of forecast errors when the latter admit the approximate factor model. This approachincorporates the facts that experts often use common sets of information and hence they tendto make common mistakes. This premise is evidenced in many empirical results. For example,the European Central Bank’s Survey of Professional Forecasters on Euro-area real GDP growthdemonstrates that the professional forecasters tend to jointly understate or overstate GDPgrowth. Motivated by this stylized fact, we develop a novel framework which exploits thefactor structure of forecast errors and the sparsity in the precision matrix of the idiosyncraticcomponents of the forecast errors. The proposed algorithm is called

Factor Graphical Model (FGM). Our approach overcomes the challenge of obtaining the forecasts that contain uniqueinformation, which was shown to be necessary to achieve a “winning” forecast combination.In simulation, we demonstrate the merits of the FGM in comparison with the equal-weightedforecasts and the standard graphical methods in the literature. An empirical application toforecasting macroeconomic time series in big data environment highlights the advantage of theFGM approach in comparison with the existing methods of forecast combination.

Keywords : High-dimensionality; Graphical Lasso; Approximate Factor Model; Nodewise Re-gression; Precision Matrix

JEL Classiﬁcations : C13, C38, C55 ∗ Department of Economics, University of California, Riverside. Email: [email protected]. † Department of Economics, University of California, Riverside. Email: [email protected]. a r X i v : . [ ec on . E M ] N ov Introduction

A search for the best forecast combination has been an important on-going research questionin economics. Clemen (1989) pointed out that combining forecasts is “practical, economical anduseful. Many empirical tests have demonstrated the value of composite forecasting. We no longerneed to justify that methodology”. However, as demonstrated by Diebold and Shin (2019), there arestill some unresolved issues. Despite the ﬁndings based on the theoretical grounds, equal-weightedforecasts have proved surprisingly diﬃcult to beat. It can be shown (see Timmermann (2006)) thatequal weights are optimal in situations with an arbitrary number of forecasts when the individualforecast errors have the same variance and identical pairwise correlations. Many methodologies thatseek for the best forecast combination use equal weights as a benchmark: for instance, Diebold andShin (2019) develop “partially egalitarian Lasso” that discards some forecasts and then selects andshrinks the remaining forecasts toward equal weights.In this paper we are interested in ﬁnding the combination of forecasts which yields the bestout-of-sample performance in terms of the mean-squared forecast error (MSFE). We claim that thesuccess of equal weights is partly due to the fact that the forecasters use the same set of publicinformation to make forecasts, hence, they tend to make common mistakes. This statement issupported by the empirical results based on the European Central Bank’s Survey of Professionalforecasters of Euro-area real GDP growth. We demonstrate that the forecasters tend to jointly un-derstate or overstate GDP growth. Therefore, we propose that the forecast errors include commonand idiosyncratic components, which allows us to model the tendency of the forecast errors to movetogether due to the common component.Several recent papers support this methodology: Atiya (2020) provides a good graphical illustra-tion showing that “forecast combination should be a winning strategy if the constituent forecastsare either diverse or comparable in performance”. Thomson et al. (2019) ﬁnd that the beneﬁtof combining forecasts from many models/experts depends upon the extent to which these indi-vidual forecasts contain unique information. Our paper provides a simple framework to separateunique/individual errors from the common ones to improve the accuracy of the combined forecast.In high dimensions, when the number of forecasts is large relative to the sample size, the samplecovariance matrix of the forecast errors underlying the standard forecast combination (see Bates and1ranger (1969)) is subject to the estimation uncertainty. In the literature on portfolio allocationand forecast combination, there are three popular methods to construct a good covariance matrixestimator. The ﬁrst one uses shrinkage of the sample covariance matrix (see Ledoit and Wolf (2003,2004a,b, 2012, 2015) among others). The second one imposes some structure on the data, such asusing factor models to decompose covariance matrix into a low-rank and sparse components (seeFan et al. (2011, 2019, 2017)), and the third one uses thresholding of the sample covariance (seeBickel and Levina (2008); Cai and Liu (2011)). Rothman et al. (2008) emphasized that shrinkageestimators of the form proposed by Ledoit and Wolf (2003, 2004b) do not aﬀect the eigenvectors ofthe covariance, only the eigenvalues. However, Johnstone and Lu (2009) showed that the sampleeigenvectors are also not consistent in high-dimensions. When constructing a covariance matrixestimator using the factor model for high-dimensional problems, after we estimate the low-rankcomponent, the idiosyncratic part is still large. Therefore, in addition to a factor model, we needsome sparsity condition for estimating the residual precision matrix. The thresholding estimatorsresult in sparse covariance but do not take into account the structure in the data. Fan et al. (2008)showed that the precision matrix takes advantage of the factor structure and, hence, can be betterestimated in the factor approach. At the same time, forecast combination problem requires anestimator of the precision matrix. However, regularizing a covariance matrix does not guarantee awell-behaved estimator of the precision.Our paper develops a new precision matrix estimator for the forecast errors under the approxi-mate factor model with unobserved factors that addresses the aforementioned limitations. We callour algorithm the

Factor Graphical Model . We use a factor model to estimate a sparse idiosyn-cratic component, and then apply the Graphical model (Graphical Lasso (Friedman et al. (2008))or nodewise regression (Meinshausen and B¨uhlmann (2006))) for the estimation of the precisionmatrix of the idiosyncratic terms.There are a few papers that used graphical models to estimate the covariance matrix of theidiosyncratic component when the factors are known and the loadings are assumed to be constant.Brownlees et al. (2018) estimate a sparse covariance matrix for high-frequency data and constructthe realized network for ﬁnancial data. Barigozzi et al. (2018) develop a power-law partial corre-lation network based on the Gaussian graphical models. They show that when the dimension ofthe system is large, the largest eigenvalues of the precision converge to a positive aﬃne transfor-2ation. Koike (2020) uses the Weighted Graphical Lasso to estimate a sparse covariance matrixof the idiosyncratic component for a factor model with observable factors for high-frequency data.The paper derives consistency and the asymptotic mixed normality for the estimator based on therealized covariance matrix.Our paper makes several important contributions: ﬁrst, we develop a novel framework thatmodels the tendency of the forecast errors to move together due to the common component. Thisframework is supported by the stylized fact that the forecasters tend to jointly understate oroverstate the predicted series of interest. Second, we develop a novel high-dimensional precisionmatrix estimator which combines the beneﬁts of the factor structure and sparsity of the precisionmatrix of the idiosyncratic component for the forecast combination under the approximate factormodel. Third, this is the ﬁrst paper that provides a simple framework to overcome the challengeof obtaining the forecasts that contain unique information, which was shown to be necessary toachieve a “winning” forecast combination. The empirical application highlights the advantage ofthis methodology in comparison with the existing methods of forecast combination.The paper is structured as follows: Section 2 reviews Graphical Lasso and nodewise regressiontechniques. Section 3 studies the approximate factor models for the forecast combination. Section4 introduces the Factor Graphical Model and discusses the tuning of the proposed model. Section5 provides simulations. Section 6 studies an empirical application for macroeconomic time-series.And Section 7 concludes.

Notation . For the convenience of the reader, we summarize the notation to be used throughoutthe paper. Given a vector u ∈ R d and a parameter a ∈ [1 , ∞ ), let (cid:107) u (cid:107) a denote l a -norm. Givena matrix U ∈ R p × p and parameters a, b ∈ [1 , ∞ ), let ||| U ||| a,b denote the induced matrix-operatornorm max (cid:107) y (cid:107) a =1 (cid:107) Uy (cid:107) b . The special cases are ||| U ||| ≡ max ≤ j ≤ p (cid:80) pi =1 | U i,j | for the l /l -operatornorm; the operator norm ( l -matrix norm) ||| U ||| ≡ Λ max ( UU (cid:48) ) is equal to the maximal singularvalue of U . Finally, (cid:107) U (cid:107) ∞ denotes the element-wise maximum max i,j | U i,j | . This section brieﬂy reviews a class of models, called Gaussian graphical models, that search forthe estimator of the precision matrix (see Bishop (2006); Hastie et al. (2001) for more detailed de-scription). In graphical models, each vertex represents a random variable, and the graph visualizes3he joint distribution of the entire set of random variables.

Sparse graphs have a relatively small number of edges. Among the main challenges in workingwith the graphical models are choosing the structure of the graph ( model selection ) and estimationof the edge parameters from the data.Suppose we have p competing forecasts of the univariate series y t , t = 1 , . . . , T . Let e t =( e t , . . . , e pt ) (cid:48) ∼ N ( , Σ ) be a p × Σ − ≡ Θ contains information about partial covariances betweenthe variables. For instance, if Θ ij , which is the ij -th element of the precision matrix, is zero, thenthe variables i and j are conditionally independent, given the other variables.Given a sample { e t } Tt =1 , let S = (1 /T ) (cid:80) Tt =1 ( e t )( e t ) (cid:48) denote the sample covariance matrixand (cid:98) D ≡ diag( S ). We can write down the Gaussian log-likelihood (up to constants) l ( Θ ) =log det( Θ ) − trace( SΘ ). The maximum likelihood estimate (MLE) of Θ is (cid:98) Θ = S − .In the high-dimensional settings it is necessary to regularize the precision matrix, which meansthat some edges will be zero. In the following subsections we discuss two most widely used tech-niques to estimate sparse high-dimensional precision matrices. The ﬁrst approach to induce sparsity in the estimation of precision matrix is to add penaltyto the maximum likelihood and use the connection between the precision matrix and regressioncoeﬃcients to maximize the following weighted penalized log-likelihood (Jankov´a and van de Geer(2018)): (cid:98) Θ = arg min Θ = Θ (cid:48) trace( SΘ ) − log det( Θ ) + λ (cid:88) i (cid:54) = j (cid:98) D ii (cid:98) D jj | Θ ij | , (2.1)over positive deﬁnite symmetric matrices, where λ ≥ λ = 0, theMLEs for Σ and Θ in (2.1) are the sample covariance matrix S and its inverse S − respectively.When λ >

0, the solution yields penalized MLEs of the covariance and precision matrices, denotedas (cid:98) Σ and (cid:98) Θ = (cid:98) Σ − .The penalized likelihood formulation was proposed in Yuan and Lin (2007). They solve theoptimization problem in (2.1) using the interior-point method for the max log-determinant problem(see Vandenberghe et al. (1998) for more details on the method). This procedure guarantees4he positive-deﬁniteness of the penalized MLE of Θ . However, the method is computationallydemanding and is limited to p ≤

10. Banerjee et al. (2008) develop a diﬀerent framework usingblock-coordinate descent. They show that the optimization problem in (2.1) is convex and considerestimation of Σ rather than Θ . Banerjee et al. (2008) show that one can easily recover Θ usingtheir procedure.One of the most popular and fast algorithms to solve the optimization problem in (2.1) iscalled the Graphical Lasso (GLASSO), which was introduced by Friedman et al. (2008). TheGraphical Lasso procedure is summarized in Algorithm 1: it combines the neighborhood methodby Meinshausen and B¨uhlmann (2006) and block-coordinate descent by Banerjee et al. (2008). Algorithm 1

Graphical Lasso (Friedman et al. (2008)) Initialize W = S + λ I . The diagonal of W remains the same in what follows. Repeat for j = 1 , . . . , p, , . . . , p, . . . until convergence: • Partition W into part 1: all but the j -th row and column, and part 2: the j -th row andcolumn, • Solve the score equations using the cyclical coordinate descent: W β − s + λ · Sign( β ) = . This gives a ( p − × (cid:98) β . • Update (cid:98) w = W (cid:98) β . In the ﬁnal cycle (for i = 1 , . . . , p ) solve for1 (cid:98) θ = w − (cid:98) β (cid:48) (cid:98) w , (cid:98) θ = − (cid:98) θ (cid:98) β . An alternative approach to induce sparsity in the estimation of precision matrix in equation(2.1) is to solve for (cid:98) Θ one column at a time via linear regressions, replacing population moments bytheir sample counterparts S . When we repeat this procedure for each variable j = 1 , . . . , p , we willestimate the elements of (cid:98) Θ column by column using { e t } Tt =1 via p linear regressions. Meinshausenand B¨uhlmann (2006) use this approach (which we will refer to as MB) to incorporate sparsity intothe estimation of the precision matrix. They ﬁt p separate Lasso regressions using each variable(node) as the response and the others as predictors to estimate (cid:98) Θ . This method is known as the5nodewise” regression and it is reviewed below based on van de Geer et al. (2014) and Callot et al.(2019).Let e j be a T × j -th regressor, the remaining covariates arecollected in a T × p matrix E − j . For each j = 1 , . . . , p we run the following Lasso regressions: (cid:98) γ j = arg min γ ∈ R p − (cid:16) (cid:107) e j − E − j γ (cid:107) /T + 2 λ j (cid:107) γ (cid:107) (cid:17) , (2.2)where (cid:98) γ j = { (cid:98) γ j,k ; j = 1 , . . . , p, k (cid:54) = j } is a ( p − × (cid:98) Θ . Deﬁne (cid:98) C =  − (cid:98) γ , · · · − (cid:98) γ ,p − (cid:98) γ , · · · − (cid:98) γ ,p ... ... . . . ... − (cid:98) γ p, − (cid:98) γ p, · · ·  . (2.3)For j = 1 , . . . , p , deﬁne ˆ τ j = (cid:107) e j − E − j (cid:98) γ j (cid:107) /T + λ j (cid:107) (cid:98) γ j (cid:107) (2.4)and write (cid:98) T = diag(ˆ τ , . . . , ˆ τ p ) . (2.5)The approximate inverse is deﬁned as (cid:98) Θ = (cid:98) T − (cid:98) C . (2.6)One of the caveats to keep in mind when using the MB method is that the estimator in (2.6) is notself-adjoint. Callot et al. (2019) show (see their Lemma A.1) that (cid:98) Θ in (2.6) is positive deﬁnitewith high probability, however, it could still occur that (cid:98) Θ is not positive deﬁnite in ﬁnite samples.A possible solution would be to use the matrix symmetrization procedure as in Fan et al. (2018)and then use eigenvalue cleaning as in Callot et al. (2017) and Hautsch et al. (2012). The procedureto estimate the precision matrix using nodewise regression is summarized in Algorithm 2. Algorithm 2

Nodewise regression by Meinshausen and B¨uhlmann (2006) (MB) Repeat for j = 1 , . . . , p : • Estimate (cid:98) γ j using (2.2) for a given λ j . • Select λ j using a suitable information criterion (see section 4.1 for the possible options). Calculate (cid:98) C and (cid:98) T . Return (cid:98) Θ = (cid:98) T − (cid:98) C . 6 Approximate Factor Models for Forecast Errors

The approximate factor models for the forecasts were ﬁrst considered by Chan et al. (1999).They modeled a panel of ex-ante forecasts of a single time-series as a dynamic factor model andfound out that the combined forecasts improved on individual ones when all forecasts have the sameinformation set (up to diﬀerence in lags). This result emphasizes the beneﬁt of forecast combinationeven when the individual forecasts are not based on diﬀerent information and, therefore, do notbroaden the information set used by any one forecaster.In this paper, we are interested in ﬁnding the combination of forecasts which yields the bestout-of-sample performance in terms of the mean-squared forecast error to be introduced later. Weclaim that the forecasters use the same set of public information to make forecasts, hence, theytend to make common mistakes. Figure 1 illustrates this statement: it shows quarterly forecastsof Euro-area real GDP growth produced by the European Central Bank’s Survey of ProfessionalForecasters from 1999Q3 to 2019Q3. As described in Diebold and Shin (2019), forecasts are solicitedfor one year ahead of the latest available outcome: e.g., the 2007Q1 survey asked the respondentsto forecast the GDP growth over 2006Q3-2007Q3. As evidenced from Figure 1, forecasters tend tojointly understate or overstate GDP growth, meaning that their forecast errors include common andidiosyncratic parts. Therefore, we can model the tendency of the forecast errors to move togethervia factor decomposition.Recall that we have p competing forecasts of the univariate series y t , t = 1 , . . . , T and e t =( e t , . . . , e pt ) (cid:48) ∼ N ( , Σ ) is a p × q -factor model: e t (cid:124)(cid:123)(cid:122)(cid:125) p × = B f t (cid:124)(cid:123)(cid:122)(cid:125) q × + ε t , t = 1 , . . . , T (3.1)where f t = ( f t , . . . , f qt ) (cid:48) are the factors of the forecast errors for p models, B is a p × q matrix offactor loadings, and ε t is the idiosyncratic component that cannot be explained by the commonfactors. Unobservable factors and loadings are usually estimated by the principal component anal-ysis (PCA), studied in Bai (2003); Bai and Ng (2002); Connor and Korajczyk (1988); Stock andWatson (2002). Strict factor structure assumes that the idiosyncratic disturbances, ε t , are uncorre-lated with each other, whereas approximate factor structure allows correlation of the idiosyncratic7isturbances (Chamberlain and Rothschild (1983)).We use the following notations: E [ ε t ε (cid:48) t ] = Σ ε , E [ ε t ε (cid:48) t ] = Σ ε , E [ e t e (cid:48) t ] = Σ = BΣ f B (cid:48) + Σ ε , and E [ ε t | f t ] = 0. The objective function to recover factors and loadings from (3.1) is:min f ,..., f T , B T T (cid:88) t =1 ( e t − Bf t ) (cid:48) ( e t − Bf t ) (3.2)s.t. B (cid:48) B = I q , (3.3)where (3.3) is the assumption necessary for the unique identiﬁcation of factors. Fixing the value of B , we can project forecast errors e t into the space spanned by B : f t = ( B (cid:48) B ) − B (cid:48) e t = B (cid:48) e t . Whencombined with (3.2), this yields a concentrated objective function for B :max B tr (cid:104) B (cid:48) (cid:16) T T (cid:88) t =1 e t e (cid:48) t (cid:17) B (cid:105) . (3.4)It is well-known (see Stock and Watson (2002) among others) that (cid:98) B estimated from the ﬁrst q eigenvectors of T (cid:80) Tt =1 e t e (cid:48) t is the solution to (3.4). Once we obtain (cid:98) f t , (cid:98) B , we can get an estimateof covariance matrix of forecast errors (cid:98) Σ = (cid:98) B (cid:98) Σ f (cid:98) B (cid:48) + (cid:98) Σ ε .Having obtained all necessary estimates, we move to the forecast combination exercise. Supposewe have p competing forecasts, (cid:98) y t = (ˆ y ,t , . . . , ˆ y p,t ) (cid:48) , of the variable y t , t = 1 , . . . , T . Let Θ = Σ − be the precision matrix of the forecast errors. The forecast combination is deﬁned as follows: (cid:98) y ct = w (cid:48) (cid:98) y t (3.5)where w is a p × R ( w , Σ ) = w (cid:48) Σw . As shown inBates and Granger (1969), the optimal forecast combination minimizes the variance of the combinedforecast error: min w R ( w , Σ ) , s.t. w (cid:48) ι p = 1 , (3.6)where ι p is a p × p × w = Θ ι p ι (cid:48) p Θ ι p . (3.7)If the true precision matrix is known, the equation (3.7) guarantees to yield the optimal forecastcombination. In reality, one has to estimate Θ . Hence, the out-of-sample performance of thecombined forecast is aﬀected by the estimation error. As pointed out by Smith and Wallis (2009)8nd Claeskens et al. (2016), when the estimation uncertainty of the weights is taken into account,there is no guarantee that the “optimal” forecast combination will be better than the equal weightsor even improve the individual forecasts. Concretely, for any estimator of covariance matrix andcombination weights, we have: (cid:12)(cid:12)(cid:12) R ( (cid:98) w , (cid:98) Σ ) − R ( w , (cid:98) Σ ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) w − w (cid:107) (cid:13)(cid:13)(cid:13) (cid:98) Σw (cid:13)(cid:13)(cid:13) ∞ . (3.8)The equation (3.8) implies that given an estimator of the covariance matrix, the risk of the combinedforecast is bounded by the estimation error in the optimal forecast combination weights. In orderto control the latter, deﬁne a = ι (cid:48) p Θ ι p /p , and (cid:98) a = ι (cid:48) p (cid:98) Θ ι p /p . We can easily obtain the followingbound on the optimal combination weights: (cid:107) (cid:98) w − w (cid:107) ≤ a (cid:107) ( (cid:98) Θ − Θ ) ι p (cid:107) p + | a − (cid:98) a | (cid:107) Θ ι p (cid:107) p | (cid:98) a | a , (3.9)where the inequality was shown in Callot et al. (2019). Therefore, in order to control the estimationuncertainty in the combination weights, one needs to obtain a consistent estimator of the precisionmatrix Θ . We can use the optimization problem in (2.1) or (2.2) to directly estimate the precision matrix Θ and apply it to ﬁnd the forecast combination weights in (3.7). As pointed out by Dai et al.(2019), for high-dimensional problems after we estimate the low-rank component of the covariancematrix, the idiosyncratic part is still large. Therefore, in addition to a factor model, we needsome sparsity condition for estimating the residual covariance matrix, Σ ε . Rothman et al. (2008)emphasized that shrinkage estimators of the form proposed by Ledoit and Wolf (2003, 2004b) donot aﬀect the eigenvectors of the covariance, only the eigenvalues. However, Johnstone and Lu(2009) showed that the sample eigenvectors are also not consistent in high-dimensions. Fan et al.(2011) ﬁrst construct a sample covariance matrix of residuals based on the estimated factor model,and then apply adaptive thresholding to estimate the idiosyncratic component of the covariancematrix.Since our interest is in constructing weights for the forecast combination, our goal is to estimatea precision matrix of the forecast errors. However, as pointed out by Koike (2020), when common9actors are present across the forecast errors, the precision matrix cannot be sparse because all pairsof the forecast errors are partially correlated given other forecast errors through the common factors.Therefore, we impose a sparsity assumption on the precision matrix of the idiosyncratic errors, Θ ε ,which is obtained using the estimated residuals after removing the co-movements induced by thefactors (see Barigozzi et al. (2018); Brownlees et al. (2018); Koike (2020)).We use the weighted Graphical Lasso and nodewise regression as shrinkage techniques to esti-mate the precision matrix of residuals. Let Θ ε = Σ − ε and Θ f = Σ − f be the precision matrices ofthe idiosyncratic and common components respectively. Once the precision of the low-rank compo-nent is obtained, similarly to Fan et al. (2011), we use the Sherman-Morrison-Woodbury formulato estimate the precision of forecast errors: Θ = Θ ε − Θ ε B [ Θ f + B (cid:48) Θ ε B ] − B (cid:48) Θ ε . (4.1)To obtain (cid:98) Θ f = (cid:98) Σ − f , we use (cid:98) Σ f = T (cid:80) Tt =1 ( (cid:98) f t − ¯ f )( (cid:98) f t − ¯ f ) (cid:48) , where (cid:98) f t = (cid:98) B (cid:48) e t . To get (cid:98) Θ ε , we developtwo approaches: the ﬁrst uses the weighted GLASSO Algorithm 1, with the initial estimate of thecovariance matrix of the idiosyncratic errors calculated as (cid:98) Σ ε = T (cid:80) Tt =1 ( (cid:98) ε t − ¯ ε )( (cid:98) ε t − ¯ ε ) (cid:48) , where (cid:98) ε t = e t − (cid:98) B (cid:98) f t . The second uses nodewise regression and applies Algorithm 2 to (cid:98) ε . Once we estimate (cid:98) Θ f and (cid:98) Θ ε , we can get (cid:98) Θ using a sample analogue of (4.1). We call the proposed procedures Factor Graphical Lasso and

Factor nodewise regression and summarize them in Algorithm 3 andAlgorithm 4 respectively.

Algorithm 3

Factor Graphical Lasso (Factor GLASSO) Estimate the residuals: (cid:98) ε t = e t − (cid:98) B (cid:98) f t using PCA.Get (cid:98) Σ ε = T (cid:80) Tt =1 ( (cid:98) ε t − ¯ ε )( (cid:98) ε t − ¯ ε ) (cid:48) . Estimate a sparse Θ ε using the weighted Graphical Lasso: initialize Algorithm 1 with W = (cid:98) Σ ε + λ I . Estimate Θ using the Sherman-Morrison-Woodbury formula in (4.1). Algorithm 4

Factor nodewise regression Meinshausen and B¨uhlmann (2006) (Factor MB) Estimate the residuals: (cid:98) ε t = e t − (cid:98) B (cid:98) f t using PCA.Get (cid:98) Σ ε = T (cid:80) Tt =1 ( (cid:98) ε t − ¯ ε )( (cid:98) ε t − ¯ ε ) (cid:48) . Estimate a sparse Θ ε using nodewise regression: apply Algorithm 2 to (cid:98) ε . Estimate Θ using the Sherman-Morrison-Woodbury formula in (4.1).10ow we can use (cid:98) Θ to estimate the forecast combination weights (cid:98) w (cid:98) w = (cid:98) Θ ι p ι (cid:48) p (cid:98) Θ ι p , (4.2)where (cid:98) Θ is obtained from Algorithm 3 or Algorithm 4. Algorithms 3-4 require the tuning parameters λ (from Algorithm 1) and λ j (from Algorithm 2)respectively. We now brieﬂy comment on the choices for both tuning parameters.To motivate the choice of the tuning parameter for GLASSO and Factor GLASSO, we ﬁrstbrieﬂy discuss some of the existing options to motivate our choice of λ in (2.1) in simulationsand the empirical application. Usually λ is selected from a grid of values F λ = ( λ min , . . . , λ max )which minimizes the score measuring the goodness-of-ﬁt. Some popular examples include multifoldcross-validation (CV), Stability Approach to Regularization Selection (STARS, Liu et al. (2010)),and the Extended Bayesian Information Criteria (EBIC, Foygel and Drton (2010)). Since we areinterested in estimating a sparse high-dimensional precision matrix, we need to choose a method forselecting the tuning parameter which is consistent in high-dimensions. Meinshausen and B¨uhlmann(2010) suggest that CV performs poorly for high-dimensional data, it overﬁts (Liu et al. (2010)),and it does not consistently select models (Shao (1993)). Zhu and Cribben (2018) pointed out thatthe STARS is not computationally eﬃcient. It is consistent under certain conditions, but suﬀersfrom the problem of overselection in estimating Gaussian graphical models. In contrast, EBIC iscomputationally eﬃcient and is considered to be the state-of-the-art technique for choosing thetuning parameter for the undirected graphs. The score measuring the goodness of ﬁt for EBIC canbe written as: λ EBIC = arg min λ ∈ F λ {− l ( Θ λ ) + log( T )df( Θ λ ) + 4df( Θ λ ) log( p ) η } , (4.3)where η ∈ [0 , Θ λ is the precision matrix estimated for the tuning parameter λ ∈ F λ , and the log-likelihood is l ( Θ λ ) = log det( Θ λ ) − trace( SΘ λ ). For the estimation of graphical models, the degreesof freedom are usually deﬁned as the number of unique non-zero elements in the estimated precisionmatrix, df( Θ λ ) = (cid:80) i ≤ j I Θ λ,i,j (cid:54) =0 . When η = 0, (4.3) reduces to the original BIC (Schwarz (1978)).Chen and Chen (2008) showed that when γ = 1, EBIC is consistent as long as the dimension p T . Hence, in our simulations and the empiricalexercise we use EBIC with η = 1 for GLASSO and Factor GLASSO in Algorithms 1 and 3.For Algorithms 2 and 4, we follow Callot et al. (2019) to choose λ j in (2.2) by minimizing thegeneralized information criterion (GIC) developed by Fan and Tang (2013). Let (cid:12)(cid:12)(cid:12) (cid:98) S j ( λ j ) (cid:12)(cid:12)(cid:12) denotethe estimated number of nonzero parameters in the vector (cid:98) γ j :GIC( λ j ) = log( (cid:107) e j − E − j (cid:98) γ j (cid:107) /T ) + (cid:12)(cid:12)(cid:12) (cid:98) S j ( λ j ) (cid:12)(cid:12)(cid:12) log( p ) T log(log( T )) . (4.4) We divide the simulation results into two subsections. In the ﬁrst subsection we study theconsistency of the Factor GLASSO and Factor MB for estimating precision matrix and the com-bination weights. In the second subsection we evaluate the out-of-sample forecasting performanceof the Factor Graphical models from Algorithms 3-4 in terms of the mean-squared forecast error.We compare the performance of factor-based models with equal-weighted (EW) forecast combina-tion, GLASSO and nodewise regression from Algorithms 1-2. All exercises use 100 Monte Carlosimulations.

We consider sparse Gaussian graphical models which may be fully speciﬁed by a precisionmatrix Θ . Therefore, the random sample is distributed as e t = ( e t , . . . , e pt ) (cid:48) ∼ N (0 , Σ ), where Θ = ( Σ ) − for t = 1 , . . . , T, j = 1 , . . . , p . Let (cid:98) Θ be the precision matrix estimator. We showconsistency of the Factor GLASSO (Algorithm 3) and Factor MB (Algorithm 4), in (i) the operatornorm, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , and (ii) l / l -matrix norm which is the maximum absolute column sum of thematrix , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , and (iii) in l -vector norm for the combination weights, (cid:107) (cid:98) w − w (cid:107) , where w is given by (3.7). The forecast errors are assumed to have the following structure: f t = φ f f t − + ζ t (5.1) e t (cid:124)(cid:123)(cid:122)(cid:125) p × = B f t (cid:124)(cid:123)(cid:122)(cid:125) q × + ε t , t = 1 , . . . , T (5.2)12here e t is a p × N ( , Σ ), f t is a q × B isa p × q matrix of factor loadings, φ f is an autoregressive parameter in the factors which is a scalarfor simplicity, ζ t is a q × N (0 , σ ζ ), ε t is a p × N (0 , Σ ε ), with sparse Θ ε that has a random graph structuredescribed below. To create B in (5.1) we take the ﬁrst q columns of an upper triangular matrixfrom a Cholesky decomposition of the p × p Toeplitz matrix parameterized by ρ : that is, Q = ( Q ) ij ,where ( Q ) ij = ρ | i − j | , i, j ∈ , . . . , p . We set ρ = 0 . φ f = 0 . σ ζ = 1. The speciﬁcation in(5.1) leads to the low-rank plus sparse decomposition of the covariance matrix: E (cid:2) e t e (cid:48) t (cid:3) = Σ = BΣ f B (cid:48) + Σ ε (5.3)When Σ ε has a sparse inverse Θ ε , it leads to the low-rank plus sparse decomposition of the precisionmatrix Θ , such that Θ can be expressed as a function of the low-rank Θ f plus sparse Θ ε .We consider the following setup: let p = T δ , δ = 0 . q = 2(log( T )) . and T = [2 κ ] , for κ =7 , . , , . . . , .

5. Our setup allows the number of individual forecasts, p , and the number of commonfactors in the forecast errors, q , to increase with the sample size, T .A sparse precision matrix of the idiosyncratic components Θ ε is constructed as follows: we ﬁrstgenerate the adjacency matrix using a random graph structure. Deﬁne a p × p adjacency matrix A ε which is used to represent the structure of the graph: A ijε = (cid:40) , for i (cid:54) = j with probability π, , otherwise. (5.4)Let A ε,ij denote the i, j -th element of the adjacency matrix A ε . We set A ε,ij = A ε,ji = 1 , for i (cid:54) = j with probability π , and 0 otherwise. Such structure results in s T = p ( p − π/ π = 1 / ( pT . ), which makes s T = O ( T . ). The adjacency matrix hasall diagonal elements equal to zero. Hence, to obtain a positive deﬁnite precision matrix we applythe procedure described in Zhao et al. (2012): using their notation, Θ ε = A ε · v + I ( | τ | + 0 . u ),where u > v controls the magnitude of partial correlations with u , and τ isthe smallest eigenvalue of A ε · v . In our simulations we use u = 0 . v = 0 . .1.2 Simulation Results Figures 2-3 show the averaged (over Monte Carlo simulations) errors of the estimators of theprecision matrix Θ and the optimal combination weight versus the sample size T in the logarithmicscale (base 2). The estimate of the precision matrix of the EW forecast is obtained using the factthat equal weights imply diagonal covariance and precision matrices. To determine the values ofthe diagonal elements we use the shrinkage intensity coeﬃcient calculated as the average of theeigenvalues of the sample covariance matrix of the forecast errors (see Ledoit and Wolf (2004b)).As evidenced by Figures 2-3, Factor GLASSO and Factor MB demonstrate superior performanceover EW and non-factor based models (GLASSO and MB). Furthermore, our method achieveslower estimation error in the combination weights (3.9), which leads to lower risk of the combinedforecast as shown in (3.8). Interestingly, even though the precision matrix estimated using FactorMB has faster convergence rate in |||·||| and |||·||| norms as compared to Factor GLASSO, theweights estimated using Factor GLASSO converge faster. Also, note that the precision matrixestimated using the EW method also shows good convergence properties. However, in terms ofestimating the combination weight (and, as a corollary of (3.9), controlling risk), the performanceof the EW does not exhibit convergence properties. This is in agreement with previously reportedﬁndings (see Claeskens et al. (2016); Smith and Wallis (2009) among others) that equal weights arenot theoretically optimal, however, as demonstrated in Figure 4, the EW combination still leads toa relatively good performance in terms of MSFE. We consider the standard forecasting model in the literature (e.g., Stock and Watson (2002)),which uses the factor structure of the high dimensional predictors.

Suppose the data is generated from the following data generating process (DGP): x t = Λg t + v t , (5.5) g t = φ g t − + ξ t , (5.6) y t +1 = g (cid:48) t α + ∞ (cid:88) s =1 θ s (cid:15) t +1 − s + (cid:15) t +1 , (5.7)14here y t +1 is a univariate series of our interest in forecasting, x t is an N × β is an N × g t is an r × Λ is an N × r matrix offactor loadings, v t is an N × N (0 , σ v ), φ is an autoregressive parameterin the factors which is a scalar for simplicity, ξ t is an r × N (0 , σ ξ ), (cid:15) t +1 is a random error following N (0 , σ (cid:15) ), and α is an r × N (1 , σ (cid:15) = 1. The coeﬃcients θ s areset according to the rule θ s = (1 + s ) c c s , (5.8)as in Hansen (2008). We set c ∈ { , . } and c ∈ { . , . , . , . } . We generate r factors using(5.6) with a grid of 10 diﬀerent AR(1) coeﬃcients φ equidistant between 0 and 0 .

9. To create Λ in (5.5) we take the ﬁrst r rows of an upper triangular matrix from a Cholesky decomposition ofthe N × N Toeplitz matrix parameterized by ρ . We consider a grid of 10 diﬀerent values of ρ equidistant between 0 and 0 . k, l , denoted as FAR( k, l ):ˆ y t +1 = ˆ µ + ˆ κ ˆ g ,t + · · · + ˆ κ k ˆ g k,t + ˆ ψ y t + · · · + ˆ ψ l y t +1 − l , (5.9)where the factors (ˆ g ,t , . . . , ˆ g k,t ) are estimated from equation (5.5). We consider the FAR modelsof various orders, with k = 1 , . . . , K and l = 1 , . . . , L . We also consider the models without anylagged y or any factors. Therefore, the total number of forecasting models is p ≡ (1 + K ) × (1 + L ),which includes the forecasting models using naive average or no factors.The total number of observations is T , and the number of observations in the regression period(the train sample) is set to be the ﬁrst half of the sample, t = 1 , . . . , m ≡ T /

2, to leave the secondhalf of the sample, t = m + 1 , . . . , T , for the out-of-sample evaluation (the test sample). We rollthe estimation window over the test sample of the size n ≡ T − m , to update all the estimates ineach point of time t = 1 , . . . , m . Recall that q denotes the number of factors in the forecast errorsas in equation (3.1). We ﬁrst examine the properties of the combined forecasts based on the FactorGraphical models when T and p vary and compare their performance with the combined forecastsbased on the GLASSO, MB and EW forecasts. 15 .2.2 Simulation Results We analyzed various scenarios for the MSFE simulation exercise. For the ﬁrst set of simulationswe consider a low-dimensional setup to demonstrate the advantage of using FGM even when thenumber of forecasts, p , is small relative to the sample size, T : (1) in such scenario EW has anadvantage since there are not many models to combine and assigning equal weights should producesatisfactory performance, and (2) non-factor based models have the advantage over the models thatestimate factors due to the estimation errors. As a result, this framework with the low-dimensionalsetup is favorable to EW and non-factor based models. Figure 4 shows the MSFE for diﬀerentsample sizes and ﬁxed parameters: we report the results for two values of c ∈ { , . } . Asevidenced from Figure 4, the models that use the factor structure outperform EW combination andnon-factor based counterparts for both values of c .Figures 5-9 show the performance in terms of MSFE for diﬀerent number of predictors N , diﬀer-ent values of c , φ , ρ and q : Factor-based models (Factor GLASSO and Factor MB) outperform theequal-weighted forecast combination and the standard GLASSO and nodewise regression withoutany factor structure. As evidenced from the ﬁgures, these ﬁndings are robust to the changes in themodel parameters. Importantly, Figure 9 shows the scenario when the true number of principalcomponents, r , is equal to 5, whereas none of the forecasters use PCA for prediction: in this caseincluding at least 2 common components of the forecasting errors reduces MSFE, such that FactorGLASSO and Factor MB outperform EW forecast combination. Based on Figures 4-9, we see thatFactor GLASSO, in general, has lower MSFE than Factor MB. This ﬁnding is further supportedby our empirical application in Section 6. An empirical application to forecasting macroeconomic time series in big data environmenthighlights the advantage of both Factor Graphical models described in Algorithms 3-4 in com-parison with the existing methods of forecast combination. We use a large monthly frequencymacroeconomic database of McCracken and Ng (2016), who provide a comprehensive descriptionof the dataset and 128 macroeconomic series. We consider the time period 1960:1-2020:07 withthe total number of observations T = 726, the training sample consists of m = 120 observations,16nd the test sample n ≡ T − m − h + 1, where h is the forecast horizon. We roll the estimationwindow over the test sample to update all the estimates in each point of time t = m, . . . , T − h . Weestimate h -step ahead forecasts from FAR( k, l ) with k = 0 , , . . . , K = 9, and l = 0 , , . . . , L = 11.The total number of forecasting models is p = 120. The optimal number of factors in the forecasterrors (denoted as q in equation (3.1)) is chosen using the standard data-driven method that usesthe information criterion IC1 described in Bai and Ng (2002). We note that in the majority of thecases the optimal number of factors was estimated to be equal to 1.Table 1 compares the performance of the Factor GLASSO and Factor MB with the competitorsfor predicting four representative macroeconomic indicators of the US economy: monthly industrialproduction (INDPROD), S&P500 composite index, civilian Unemployment Rate (UNRATE), andthe Eﬀective Federal Funds rate (FEDFUNDS) using 127 remaining macroeconomic series. Let { Y t } Tt =1 be the series of interest for forecasting. Similarly to Coulombe et al. (2020), for INDPRODand S&P500, we forecast the average growth rate (with logs): y ( h ) t + h = 1 h ln( Y t + h /Y t ) . For UNRATE we forecast the average change (without logs): y ( h ) t + h = 1 h ( Y t + h /Y t ) . And for FEDFUNDS we forecast the log of the series: y ( h ) t + h = ln( Y t + h ) . As evidenced from Table 1, our methods outperform EW, GLASSO and nodewise regression: ac-counting for the factor structure results in lower MSFE. Therefore, the FGM framework developedin this paper leads to the superior performance of the combined forecast as compared to EW modeleven when the models/experts do not contain a lot of unique information. Our empirical appli-cation demonstrates that this ﬁnding does not originate from the diﬀerence in the performance ofEW vs graphical models: as evidenced from Table 1, the performance of GLASSO is worse thanthat of EW for the FEDFUNDS series, wheres Factor GLASSO outperforms EW. Therefore, theimprovement in the combined forecast comes from the use of the factor structure in the forecasterrors. Note that in contrast with EW and non-factor based methods, the performance of FactorGLASSO and Factor MB does not deteriorate signiﬁcantly when the forecast horizon, h , increases.17 Conclusions

This paper proposed a novel precision matrix estimator for the forecast combination when theexperts are assumed to make common mistakes. We account for the factor structure in the forecasterrors by decomposing the precision matrix into a low-rank and sparse components, where thelatter is estimated using the Graphical Lasso or nodewise regression. The proposed algorithms arecalled the Factor Graphical Models (Factor GLASSO and Factor MB). The framework developedin this paper overcomes the challenge of obtaining the forecasts that contain unique information,which was shown to be necessary to achieve a “winning” forecast combination. Our simulationsdemonstrate the consistency of the developed procedure for estimating the precision matrix andoptimal forecast combination weights. An empirical application to forecasting macroeconomic timeseries in big data environment highlights the advantage of the FGM approach in comparison withthe existing methods of forecast combination. It would be interesting to apply our model for theanalysis of some prominent forecasting competition strategies, such as M4 competition. We leavethis exercise for the future research. 18 eferences

Atiya, A. F. (2020). Why does forecast combination work so well?

International Journal ofForecasting , 36(1):197–200.Bai, J. (2003). Inferential theory for factor models of large dimensions.

Econometrica , 71(1):135–171.Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models.

Econometrica , 70(1):191–221.Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximumlikelihood estimation for multivariate gaussian or binary data.

Journal of Machine LearningResearch , 9:485–516.Barigozzi, M., Brownlees, C., and Lugosi, G. (2018). Power-law partial correlation network models.

Electronic Journal of Statistics , 12(2):2905–2929.Bates, J. M. and Granger, C. W. J. (1969). The combination of forecasts.

Operations Research ,20(4):451–468.Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding.

The Annals ofStatistics , 36(6):2577–2604.Bishop, C. M. (2006).

Pattern Recognition and Machine Learning . Springer-Verlag, Berlin, Heidel-berg.Brownlees, C., Nualart, E., and Sun, Y. (2018). Realized networks.

Journal of Applied Economet-rics , 33(7):986–1006.Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation.

Journalof the American Statistical Association , 106(494):672–684.Callot, L., Caner, M., ¨Onder, A. O., and Ula¸san, E. (2019). A nodewise regression approach toestimating large portfolios.

Journal of Business & Economic Statistics , 0(0):1–12.Callot, L. A. F., Kock, A. B., and Medeiros, M. C. (2017). Modeling and forecasting large realizedcovariance matrices and portfolio choice.

Journal of Applied Econometrics , 32(1):140–158.Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysison large asset markets.

Econometrica , 51(5):1281–1304.Chan, Y. L., Stock, J. H., and Watson, M. W. (1999). A dynamic factor model framework forforecast combination.

Spanish Economic Review , 1(2):91–121.Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection withlarge model spaces.

Biometrika , 95(3):759–771.Claeskens, G., Magnus, J. R., Vasnev, A. L., and Wang, W. (2016). The forecast combinationpuzzle: A simple theoretical explanation.

International Journal of Forecasting , 32(3):754–762.Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography.

InternationalJournal of Forecasting , 5(4):559–583. 19onnor, G. and Korajczyk, R. A. (1988). Risk and return in an equilibrium APT: application of anew test methodology.

Journal of Financial Economics , 21(2):255–289.Coulombe, P. G., Leroux, M., Stevanovic, D., and Surprenant, S. (2020). How is machine learninguseful for macroeconomic forecasting? arXiv:2008.12477 .Dai, C., Lu, K., and Xiu, D. (2019). Knowing factors or factor loadings, or neither? evaluating esti-mators of large covariance matrices with noisy and asynchronous data.

Journal of Econometrics ,208(1):43–79.Diebold, F. X. and Shin, M. (2019). Machine learning for regularized survey forecast combination:Partially-egalitarian lasso and its derivatives.

International Journal of Forecasting , 35(4):1679–1691.Fan, J., Fan, Y., and Lv, J. (2008). High dimensional covariance matrix estimation using a factormodel.

Journal of Econometrics , 147:186–197.Fan, J., Liao, Y., and Mincheva, M. (2011). High-dimensional covariance matrix estimation inapproximate factor models.

The Annals of Statistics , 39(6):3320–3356.Fan, J., Liu, H., and Wang, W. (2018). Large covariance estimation through elliptical factor models.

The Annals of Statistics , 46(4):1383–1414.Fan, J., Wang, W., and Zhong, Y. (2019). Robust covariance estimation for approximate factormodels.

Journal of Econometrics , 208(1):5–22.Fan, J., Xue, L., and Yao, J. (2017). Suﬃcient forecasting using factor models.

Journal of Econo-metrics , 201(2):292–306.Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likeli-hood.

Journal of the Royal Statistical Society: Series B , 75(3):531–552.Foygel, R. and Drton, M. (2010). Extended bayesian information criteria for gaussian graphicalmodels. In

Proceedings of the 23rd International Conference on Neural Information ProcessingSystems - Volume 1 , NIPS, pages 604–612, USA. Curran Associates Inc.Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with theGraphical Lasso.

Biostatistics , 9(3):432–441.Hansen, B. E. (2008). Least-squares forecast averaging.

Journal of Econometrics , 146(2):342–350.Hastie, T., Tibshirani, R., and Friedman, J. (2001).

The Elements of Statistical Learning . SpringerSeries in Statistics. Springer New York Inc., New York, NY, USA.Hautsch, N., Kyj, L. M., and Oomen, R. C. A. (2012). A blocking and regularization approachto high-dimensional realized covariance estimation.

Journal of Applied Econometrics , 27(4):625–645.Jankov´a, J. and van de Geer, S. (2018). Inference in high-dimensional graphical models.

Handbookof Graphical Models , Chapter 14, pages 325–351. CRC Press.Johnstone, I. M. and Lu, A. Y. (2009). Sparse principal components analysis. arXiv:0901.4392 .20oike, Y. (2020). De-biased graphical lasso for high-frequency data.

Entropy , 22(4):456.Ledoit, O. and Wolf, M. (2003). Improved estimation of the covariance matrix of stock returnswith an application to portfolio selection.

Journal of Empirical Finance , 10(5):603–621.Ledoit, O. and Wolf, M. (2004a). Honey, I shrunk the sample covariance matrix.

The Journal ofPortfolio Management , 30(4):110–119.Ledoit, O. and Wolf, M. (2004b). A well-conditioned estimator for large-dimensional covariancematrices.

Journal of Multivariate Analysis , 88(2):365–411.Ledoit, O. and Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensional covariancematrices.

The Annals of Statistics , 40(2):1024–1060.Ledoit, O. and Wolf, M. (2015). Spectrum estimation: A uniﬁed framework for covariance matrixestimation and PCA in large dimensions.

Journal of Multivariate Analysis , 139:360–384.Liu, H., Roeder, K., and Wasserman, L. (2010). Stability approach to regularization selection (stars)for high dimensional graphical models. In

Proceedings of the 23rd International Conference onNeural Information Processing Systems - Volume 2 , NIPS’10, pages 1432–1440, USA. CurranAssociates Inc.McCracken, M. W. and Ng, S. (2016). FRED-MD: A monthly database for macroeconomic research.

Journal of Business & Economic Statistics , 34(4):574–589.Meinshausen, N. and B¨uhlmann, P. (2006). High-dimensional graphs and variable selection withthe Lasso.

The Annals of Statistics , 34(3):1436–1462.Meinshausen, N. and B¨uhlmann, P. (2010). Stability selection.

Journal of the Royal StatisticalSociety, Series B , 72:417–473.Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008). Sparse permutation invariantcovariance estimation.

Electronic Journal of Statistics , 2:494–515.Schwarz, G. (1978). Estimating the dimension of a model.

The Annals of Statistics , 6(2):461–464.Shao, J. (1993). Linear model selection by cross-validation.

Journal of the American StatisticalAssociation , 88(422):486–494.Smith, J. and Wallis, K. F. (2009). A simple explanation of the forecast combination puzzle.

OxfordBulletin of Economics and Statistics , 71(3):331–355.Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a largenumber of predictors.

Journal of the American Statistical Association , 97(460):1167–1179.Thomson, M. E., Pollock, A. C., ¨Onkal, D., and G¨on¨ul, M. S. (2019). Combining forecasts: per-formance and coherence.

International Journal of Forecasting , 35(2):474–484.Timmermann, A. (2006). Forecast combinations.

Handbook of Economic Forecasting , Vol. 1, Chap-ter 4, pages 135–196. Elsevier.van de Geer, S., Buhlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimalconﬁdence regions and tests for high-dimensional models.

The Annals of Statistics , 42(3):1166–1202. 21andenberghe, L., Boyd, S., and Wu, S.-P. (1998). Determinant maximization with linear matrixinequality constraints.

SIAM Journal on Matrix Analysis and Applications , 19(2):499–533.Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model.

Biometrika , 94(1):19–35.Zhao, T., Liu, H., Roeder, K., Laﬀerty, J., and Wasserman, L. (2012). The HUGE package forhigh-dimensional undirected graph estimation in R.

Journal of Machine Learning Research ,13(1):1059–1062.Zhu, Y. and Cribben, I. (2018). Sparse graphical models for functional connectivity networks: Bestmethods and the autocorrelation issue.

Brain Connectivity , 8(3):139–165. PMID: 29634321.22igure 1:

The European Central Bank’s (ECB) Survey of Professional Forecasters(SPF) . Each circle denotes the forecast of each professional forecaster in the SPF for the quar-terly 1-year-ahead forecasts of Euro-area real GDP growth, year-on-year percentage change. Actualseries is the blue line.

Source: European Central Bank.

Averaged errors of the estimators of Θ on logarithmic scale (base 2): p = T . , q = 2(log( T )) . , s T = O ( T . ) . Figure 3:

Averaged errors of the estimator of w (base 2) on logarithmic scale: p = T . , q = 2(log( T )) . , s T = O ( T . ) . a) c = 0. (b) c = 0 . Figure 4:

Plots of the MSFE over the sample size T . c ∈ { , . } , c = 0 . , N = 100 , r = 5 ,σ ξ = 1 , L = 7 , K = 2 , p = 24 , q = 5 , ρ = 0 . , φ = 0 . Plots of the MSFE over the number of predictors N . c = 0 . , c = 0 . ,T = 800 , r = 5 , σ ξ = 1 , L = 7 , K = 2 , p = 24 , q = 5 , ρ = 0 . , φ = 0 . Plots of the MSFE over the values of c . c = 0 . , c ∈ { . , . , . , . } ,T = 800 , N = 100, r = 5 , σ ξ = 1 , L = 7 , K = 2 , p = 24 , q = 5 , ρ = 0 . , φ = 0 . Plots of the MSFE over the values of φ . c = 0 . , c = 0 . , T = 800 ,N = 100 , r = 5 , σ ξ = 1, L = 7 , K = 2 , p = 24 , q = 5 , ρ = 0 . , φ ∈ { , . , . . . , . } .26igure 8: Plots of the MSFE over the values of ρ . c = 0 . , c = 0 . , T = 800 ,N = 100 , r = 5 , σ ξ = 1, L = 7 , K = 2 , p = 24 , q = 5 , ρ ∈ { , . , . . . , . } , φ = 0 . Plots of the MSFE over the values of q . c = 0 . , c = 0 . , T = 800 ,N = 100 , r = 5 , σ ξ = 1, L = 12 , K = 0 , p = 13 , q ∈ { , , . . . , } , ρ = 0 . , φ = 0 . NDPROD h EW GLASSO Factor GLASSO MB Factor MB1 2.77E-04 1.51E-04 1.24E-04 2.23E-04 1.28E-042 3.26E-04 1.79E-04 5.59E-05 1.61E-04 1.38E-043 1.55E-04 9.77E-05 3.81E-05 1.17E-04 6.54E-054 1.18E-04 7.60E-05 2.38E-05 1.03E-04 2.65E-05

S&P500

UNRATE

FEDFUNDS

Table 1:

Prediction of Monthly Macroeconomic Variables: hh