On the Forecast Combination Puzzle
OOn the Forecast Combination Puzzle
Wei Qian, Craig A. Rolling ∗ , Gang Cheng, Yuhong Yang Abstract.
It is often reported in forecast combination literature that a simple averageof candidate forecasts is more robust than sophisticated combining methods. This phe-nomenon is usually referred to as the “forecast combination puzzle”. Motivated by thispuzzle, we explore its possible explanations including estimation error, invalid weightingformulas and model screening. We show that existing understanding of the puzzle shouldbe complemented by the distinction of different forecast combination scenarios known ascombining for adaptation and combining for improvement. Applying combining meth-ods without consideration of the underlying scenario can itself cause the puzzle. Basedon our new understandings, both simulations and real data evaluations are conductedto illustrate the causes of the puzzle. We further propose a multi-level AFTER strategythat can integrate the strengths of different combining methods and adapt intelligentlyto the underlying scenario. In particular, by treating the simple average as a candidateforecast, the proposed strategy is shown to avoid the heavy cost of estimation error and,to a large extent, solve the forecast combination puzzle.
Key Words: combining for adaptation, combining for improvement, multi-level AF-TER, model selection, structural break
1. Introduction
Since the seminal work of Bates and Granger (1969), both empirical and theoreticalinvestigations support that when multiple candidate forecasts for a target variable areavailable to an analyst, forecast combination often provides more accurate and robustforecasting performance in terms of mean square forecast error (MSFE) than using asingle candidate forecast. The benefits of forecast combination are attributable to thefacts that individual forecasts often use different sets of information, are subject to modelbias from different but unknown model misspecifications, and/or are varyingly affectedby structural breaks. The review of Timmermann (2006) provides a comprehensiveaccount of various forecast combination methods. In particular, one popular method is to ∗ Co-first author a r X i v : . [ s t a t . M E ] M a y ombine forecasts by estimating a theoretically optimal weight through the minimizationof mean square error (MSE). For example, Bates and Granger (1969) propose to findthe optimal weight using error variance-covariance structure of the individual forecasts.Granger and Ramanathan (1984) construct the optimal weight under a linear regressionframework.Despite the ever-increasing popularity and sophistication of combining methods, itis repeatedly reported from past literature that the simple average (SA) is a very effec-tive and robust forecast combination method that often outperforms more complicatedcombining methods (see Winkler and Makridakis (1983), Clemen and Winkler (1986)and Diebold and Pauly (1990) for some early examples). In a review and annotatedbibliography on earlier studies, Clemen (1989) raises the question, “What is the expla-nation for the robustness of the simple average of forecasts?”. Specifically, he proposestwo questions of interest, “(1) Why does the simple average work so well, and (2) un-der what conditions do other specific methods work better?” The robustness of SA isalso echoed in more recent literature. For example, Stock and Watson (2004) build au-toregressive models with univariate predictors (macroeconomic variables) as candidateforecasts for output growth of seven developed countries, and find that SA, together withother methods of least data adaptivity, is among the top-performing forecast combina-tion methods. Stock and Watson (2004) further coin the term “ F orecast C ombination P uzzle” (for brevity, we refer to the puzzle as FCP hereafter ), which refers to “therepeated finding that simple combination forecasts outperform sophisticated adaptivecombination methods in empirical applications”. In another recent example, Genreet al. (2013) use survey data from professional forecasters as the individual candidatesto construct combined forecasts for three target variables. Despite some promising re-sults of complicated methods, they further note that the observed improvement over SAis rather vague when a period of financial crisis is included in the analysis. The pastempirical evidence appears to support the mysterious existence of FCP, which is alsosummarized in Timmermann (2006, section 7.1).Many attempts have been made to demystify FCP. One popular and arguably themost well-studied explanation for FCP is the estimation error of the combining methods2hat rely on the optimal weight estimation by MSE minimization. Smith and Wallis(2009) rigorously study the estimation error issue. Using the forecast error variance-covariance structure, they show both theoretically and numerically that the estimatortargeting the optimal weight can have large variance and consequently, the estimatedoptimal weight can be very different from the true optimal weight, often even more sothan simple equal weight. Elliott (2011) studies the theoretical maximal performancegain of the optimal weight over SA by optimizing the error variance-covariance structure,and points out that the gain is often small enough to be overshadowed by estimationerror. Timmermann (2006) and Hsiao and Wan (2014) also illustrate conditions for theoptimal weight to be close to the equal weight so that the relative gain of the optimalweight over SA is small. Claeskens et al. (2014) consider the random weight and showthat when the weight variance is taken into account, SA can perform better than usingthe “optimal” weight. Under linear regression settings, Huang and Lee (2010) discussthe estimation error and the relative gain of the optimal weight.In addition to estimation error, nonstationarity and structural breaks in the datagenerating process (DGP) are believed to contribute to the unstable performance of theestimated “optimal” weight. For example, Hendry and Clements (2004) demonstratethat when candidate forecasting models are all misspecified and breaks occur in theinformation variables, forecast combination methods that target the optimal weightmay not perform as well as SA. Also, Huang and Lee (2010) propose that the candidateforecasts are often weak, that is, they have low predictive content on the target variable,making the optimal weight similar to simple equal weight.While the aforementioned points are valid and valuable, they do not depict thecomplete picture of the puzzle. In this paper, we provide our perspectives on FCPto contribute to its settling. In our view, besides providing explanations of FCP, it isalso very important to point out the potential danger of recommending SA for broadand indiscriminate use. Here, we focus on the mean squared error (MSE). It shouldbe pointed out that the main points are expected to stand for other losses as well (e.g.,absolute error) and that some combination approaches (e.g., AFTER) can handle generalloss functions. 3he rest of this article is organized as follows. In section 2, we list some aspects thathave not been much addressed but are important towards the understanding of FCP inour view. We formally introduce the problem setup of the forecast combination problemwe consider in section 3. Our understandings of FCP are elaborated in sections 4-8. Inparticular, section 5 proposes a multi-level AFTER approach to solve FCP. The perfor-mance of this approach is also evaluated in section 9 using a U.S. Survey of ProfessionalForecasters (SPF) data. A brief conclusion is given in section 10.
2. Additional Aspects of FCP
The previous work has nicely pointed out that estimation error is an important sourceof FCP and has characterized the impact of the estimation error in idealized settings.Indeed, in general, when the forecast combination weighting formula is valid in the sensethat an optimal weight can be correctly estimated by minimizing MSE, insufficientlysmall sample size may not support reliable estimation of the weight, resulting in inflatedvariance of the combined forecast. The explanation with structural breaks also makessense for certain situations. However, in our view, there are several additional aspectsthat need to be considered for understanding FCP.1. A key factor missing in addressing the FCP is the true nature of improvability ofthe candidate forecasts. While we all strive for better forecast performance thanthe candidates, that may not be feasible (at least for the methods considered).Thus we have two scenarios (Yang, 2004): i) One of the candidates is pretty muchthe best we can hope for (within the considerations of course) and consequentlyany attempt to beat it will not succeed. We refer to this scenario as “Combin-ing for Adaptation” (CFA), because the proper goal of a forecast combinationmethod under this scenario should be targeting the performance of the best in-dividual candidate forecast, which is unknown. ii) The other is that a significantgain of accuracy over all the individual candidates can be materialized. We re-fer to this scenario as “Combining for Improvement” (CFI), because the propergoal of a forecast combination method under this scenario should be targetingthe performance of the best combination of the candidate forecasts to overcome4efects of the candidates. In our experience, both scenarios occur commonly inreal problems. Without factoring in this aspect, comparison of different combina-tion methods may be grossly misleading due to the well-known sin of comparingapples to oranges. In our view, empirical studies on forecast combinations in thefuture need to bring this lurking aspect into the analysis. With the above forecastcombination scenarios spelled out, a natural question follows: Can we design acombination method to bridge the two camps of methods proposed for the twoscenarios respectively, so as to help solve the FCP?2. The methods being examined in the literature on FCP are mostly specific choices(e.g., least squares estimation). Can we do better with other methods (that mayor may not have been invented yet) to avoid the heavy estimation price? Also,the currently investigated methods often assume the forecasts are unbiased andthe forecast errors are stationary, which may not be proper for many applications.What happens when these assumptions fail?3. It has been stated in the literature that the simple methods (e.g., SA) are ro-bust based on empirical studies. We feel this is not necessarily true in the usualstatistical sense (rigorously or loosely). In many published empirical results, thecandidate forecasts were carefully selected/built and thus well-behaved. Therefore,the finding in favor of robustness of SA may be proper only for such situations thatthe data analyst has extensive expertise on the forecasting problem and has donequite a bit of work on screening out poor/un-useful candidates. We argue that itis much more desirable to investigate FCP broadly so as to allow the possibility ofpoor/redundant candidates for wider and more realistic applications. It should beadded that in various situations, the screening of forecasts is far from an easy taskand its complexity may well be at the same level as model selection/averaging.Therefore, even for top experts, the view that we can do a good job in screen-ing the candidate forecasts and then simply recruit SA is overly optimistic. Withthe above, an important matter is to examine the robustness of SA in a broadercontext. 5s is described in the first item, there are two distinct scenarios: CFA and CFI.The CFA scenario can happen if one of the candidate forecasts is based on a modelsophisticated enough to capture the true DGP (yet still relatively simple), and/or theother candidate forecasts only add redundant information. The CFI scenario can of-ten happen when different candidate forecasts use different information, and/or theirunderlying models have misspecifications in different ways.There are different existing combining methods designed for the two scenarios. Themethods for the CFI scenario typically seek to estimate the optimal weight aggressively,and their examples include variance-covariance based optimization (Bates and Granger,1969) and linear regression (Granger and Ramanathan, 1984). These methods are likelyto suffer from estimation error, causing unstable performance relative to SA. On the otherhand, the combining methods for the CFA scenario should ideally perform similarly tothe best individual candidate forecast and should not be subject as severely to estimationerror as the methods for CFI. The typical methods suitable for the CFA scenario includeAIC model averaging (Buckland et al., 1997) and Bayesian model averaging (e.g., Garrattet al., 2003), both in parametric settings. The method of AFTER (Yang, 2004) can beapplied more broadly in parametric and non-parametric settings, regardless of the natureof the candidate forecasts. As one of the main contributions in this article, we show thatthe distinction between the two scenarios provides one of the keys to understanding theFCP. We will see in section 4 that an analyst who fails to understand and bring in theunderlying scenarios and specific types of data when choosing the combining methodscan incorrectly apply a combining method not designed for the underlying scenario andconsequently deliver forecasting results worse than other methods (e.g., SA).For the questions raised in the second item regarding whether we can avoid theestimation price, we cannot fully address them without a proper framework, because forany sensible method, one can always find a situation to favor it to its competitors. Theframework we consider with sound theoretical support is through a minimax view: If onehas a specific class of combination of the forecasts in mind and wants to target the bestcombination in this class, then without any restriction/assumption on unbiasedness ofthe candidate forecasts and stationarity of the forecast errors, the minimax view seeks6 clear understanding of the minimum price we have to pay no matter what method(existing or not) is used for combining. It turns out that the framework from theminimax view is closely related to the forecast combination scenarios discussed in thefirst item, and Yang (2004) provides a detailed theoretical exposition of the distinctforecast combination scenarios and associated minimax results.Indeed, Yang (2004) shows that from a minimax perspective, because of the aggres-sive target set for the CFI scenario, we have to pay an unavoidably heavier cost thanthe target set under the CFA scenario. Specifically, if we let K denote the number offorecasts and T denote the forecasting horizon, Yang (2004) shows that when the targetis to find the optimal weight to minimize the general empirical risk over a set of weightssatisfying a convex constraint (which is appropriate under the CFI scenario), the estima-tion cost is O ( K log(1+ T/K ) T ) for relatively large T ( T > K ), and O (log( K ) / √ T log T ) forrelatively small T ( T ≤ K ). In contrast, if the target is to match the performance of thebest individual forecast (which is appropriate under the CFA scenario), the estimationcost is only O (log( K ) /T ).Because of the unavoidable heavy cost under the CFI scenario, it is not alwaysideal to pursue the aggressive target of the optimal weight. Indeed, even if the optimalweight gives better performance than the best individual candidate, the improvementmay not be enough to offset the additional estimation cost (i.e., increased variance) asprecisely (in minimax rate) identified in Yang (2004) and Wang et al. (2014). As anothercontribution of our work, we show in section 6 that an appropriately constructed forecastcombination strategy can perform in a smart way according to the underlying CFI orCFA scenario. If CFI is the correct scenario, the proposed strategy can behave bothaggressively and conservatively so that it performs similar to SA when SA is much betterthan e.g., the linear regression method.Besides the estimation error and the necessary distinction of underlying scenariosdiscussed in the first two items, the following three reasons can also contribute to FCP.First, the weighting derivation formula used by complicated methods is often not suitablefor the situation. For example, under structural breaks, old historical data no longerhold support for a valid optimal weighting scheme, and the known justification of well-7stablished combining methods fails as a result. Indeed, Hendry and Clements (2004)demonstrate that when candidate forecasting models are all misspecified and breaksoccur in the information variables, methods that estimate the optimal weight may notperform as well as SA. In section 7, our Monte Carlo examples also show that SA maydominate the complicated methods when breaks occur in DGP dynamics. Second, it iscommon practice that the candidate forecasts are already screened in some ways so thatthey are more or less on an equal footing. For example, Stock and Watson (1998) andStock and Watson (2004) apply various model selection methods such as AIC and BICto identify promising linear or nonlinear candidate forecast models. Recently, Bordignonet al. (2013) select models of different types (ARMAX, time-varying coefficients, etc.)and suggest that SA works well when combining a small number of well-performingforecasts. In studies using survey data of professional forecasters, it is also expectedthat each professional forecaster performs some model screening before satisfactorilysettling down with their own forecast. In these cases, there may not be particularlypoor candidate forecasts, and the the candidates (at least the top ones) may tend tocontribute more or less equally to the optimal combination, making SA a competitivemethod. In section 8, we use Monte Carlo examples to show that screening can be asource of FCP. Lastly, the puzzle can also be a result of publication bias; people do nottend to emphasize the performance of SA when SA does not work well.With all our understandings of FCP discussed above, we address the issues raisedin the third item and provide further information on robustness of SA in sections 6-8. In particular, we will see that SA is actually not robust in performance in severaldirections: its performance may change significantly or even substantially when i) anoptimal, poor or redundant forecast is added; or ii) the degree of the screening of thecandidate forecasts is done differently. In addition, the size of the rolling window to dealwith structural breaks affects the relative performance of SA as well. Fortunately, aswill be seen, some combination methods can largely avoid these defects.8 . Problem Setup Suppose that an analyst is interested in forecasting a real-valued time series y , y , · · · .Given each time point t ≥
1, let x t be the (possibly multivariate) information variablevector revealed prior to the observation of y t . The x t may not be accessible to the analyst.Conditional on x t and z t − =: { ( x j , y j ) , ≤ j ≤ t − } , y t is subsequently generated fromsome unknown distribution p t ( ·| x t , z t − ) with conditional mean m t = E ( y t | x t , z t − ) andconditional variance v t = Var( y t | x t , z t − ). Then, y t can be represented as y t = m t + ε t ,where ε t is the random noise with the conditional mean and the conditional variancebeing 0 and v t , respectively.Assume that prior to the observation of y t , the analyst has access to K real-valuedcandidate forecasts ˆ y t,i ( i = 1 , · · · , K ). These forecasts may be constructed with dif-ferent model structures, and/or with different components of the information variables,but the details regarding how each original forecast is created may not be available inpractice and are not assumed to be known. The analyst’s objective in (linear) forecastcombination is to construct a weight vector w = ( w , · · · , w K ) T ∈ R K , based on theavailable information prior to the observation of y t , to find a point forecast of y t byforecast combination ˆ y t, w = (cid:80) Ki =1 w i ˆ y t,i . The weight vector may be different at differenttime points.To gauge the performance of a procedure that produces forecasts { ˆ y t , t = 1 , , . . . } given time horizon T , we consider the average forecast risk R T = 1 T T (cid:88) t =1 E ( y t − ˆ y t ) in our analysis and simulation studies. For real data evaluation, since the risk cannotbe computed, we use the mean square forecast error (MSFE) as a substitute:MSFE T = 1 T T (cid:88) t =1 ( y t − ˆ y t ) . According to the FCP, simple methods with little or no time variation in weight w (e.g., equal weighting) often outperform complicated methods with much time variationin terms of R T and MSFE T . 9 . CFA versus CFI: A Hidden Source of FCP In this section, we study the performance of forecast combination methods underthe two distinct scenarios. Failure to recognize these scenarios can itself result in theFCP. We use two simple but illustrative Monte Carlo examples under regression settingssimilar to those of Huang and Lee (2010) to demonstrate the CFA and CFI scenarios.
Case 1.
Suppose y t ( t = 1 , · · · , T ) is generated by the linear model y t = x t β + ε t , where x t ’s are i.i.d. N (0 , σ X ), and ε t ’s are independent of x t ’s and are i.i.d.N (0 , σ ). Consider the two candidate forecasts generated byForecast 1: ˆ y t, = x t ˆ β t ;Forecast 2: ˆ y t, = ˆ α t , where ˆ β t and ˆ α t are both obtained from the ordinary least square (OLS) estimationusing historical data.Given that Forecast 1 essentially represents the true model, its combining with Forecast 2cannot improve over the performance of the best individual forecast asymptotically, thusgiving an example of the CFA scenario. Let T be a fixed start point of the evaluationperiod, and let T be the end point. Given the evaluation period from T to T , let R T, , R T, and R T, w be the average forecast risks of Forecast 1, Forecast 2 and the combinedforecast, respectively. If we let R T, SA be the average forecast risk at time T for SA, weexpect that R T, SA > R T, . Indeed, Proposition 2 in the Appendix shows R T, R T, SA → σ σ + β σ X / T → ∞ , (1)and asymptotically, the optimal combination assigns all the weight on Forecast 1.Under the CFA scenario, since the best candidate is unknown, the natural goal offorecast combination is to match the performance of the best candidate.10 ase 2. Suppose y t ( t = 1 , · · · , T ) is generated by the linear model y t = ( x t, + x t, ) β + ε t , where the x t = ( x t, , x t, ) T are i.i.d. following a bivariate normal distribution withmean and common variance σ X = σ X = σ X . Let ρ denote the correlationbetween x t, and x t, . The random error ε t ’s are independent of x t ’s and are i.i.d.N (0 , σ ). Consider the two candidate forecasts generated byForecast 1: ˆ y t, = x t, ˆ β t, ;Forecast 2: ˆ y t, = x t, ˆ β t, , where ˆ β t, and ˆ β t, are both obtained from OLS estimation with historical data.Different from Case 1, Case 2 presents a scenario where each candidate forecast employsonly part of the information set. It is expected, to some extent, that combining thetwo forecasts works like pooling different sources of important information, resultingin performance better than either of the candidate forecasts. By defining the averageforecast risks R T, , R T, , R T, SA the same way as in Case 1, we can see from Proposition 3in the Appendix that R T, R T, SA → σ X β (1 − ρ ) + σ σ X β (1 − ρ )(1 − ρ ) / σ as T → ∞ . (2)Clearly, when the two information sets are not highly correlated, SA can improve theforecast performance over the best candidate. This case gives a typical example of theCFI scenario, and it is appropriate to seek the more aggressive goal of finding the bestlinear combination of candidate forecasts.Our view is that discussion of the FCP should take into account the different com-bining scenarios. Next, we perform Monte Carlo studies on the two cases to providean explanation of the puzzle. Combining methods suitable for the CFA scenario havebeen developed to target performance of the best individual candidate. In our numer-ical studies, we choose the AFTER method (Yang, 2004) as the representative, and itis known that AFTER pays a smaller estimation price than methods that target the11ptimal linear or convex weighting. In contrast, combining methods for the CFI sce-nario usually attempt to estimate the optimal weight. We choose linear regression ofthe response on the candidate forecasts (LinReg) as the representative. The method ofBates and Granger (1969) without estimating correlation (BG for brevity) is used as anadditional benchmark.For Case 1, we perform simulations as follows. Set σ = σ X = 1. Consider asequence of 20 β ’s such that the corresponding signal-to-noise (S/N) ratios are evenlyspaced between 0.05 and 5 in the logarithmic scale. For each β , we conduct the followingsimulation 100 times to estimate the average forecast risk. A sample of 100 observationsis generated. The first 60 observations are used to build the candidate forecast models,which are subsequently used to generate forecasts for the remaining 40 observations.Forecast combination methods including SA, BG, AFTER and LinReg methods areapplied to combine the candidate forecasts, and the last 20 observations are used forperformance evaluation. The average forecast risk of each forecast combination methodis divided by that of SA to obtain the normalized average forecast risk (denoted bynormalized R T ). The results are summarized in Figure 1. For Case 2, we set β = β = β , ρ = 0 and σ = σ X = σ X = 1. The remaining simulation settings are the same as Case1. The normalized average forecast risks (relative to SA) are summarized in Figure 2.In Case 1, it is clear from Figure 1 that AFTER is the preferred method of choiceunder the CFA scenario. LinReg, on the other hand, consistently underperforms com-pared to AFTER. Interestingly, when S/N is relatively low (less than 0.35), we observethe “puzzle” that LinReg performs worse than SA, which is due to the weight estimationerror. If the analyst correctly identifies that it is the CFA scenario and applies a cor-responding method like AFTER, the “puzzle” disappears: AFTER can perform betterthan (or very close to) SA, while LinReg fails.In Case 2, if the analyst applies AFTER without realizing the underlying CFI sce-nario, we observe the “puzzle” that SA outperforms AFTER. The “puzzle” is not entirelysurprising since AFTER is designed to target the performance of the best individualforecast, while (2) shows that SA can improve over the best individual forecast. LinRegappears to be the correct method of choice when S/N ratio is relatively high. However,12 .05 . . . . . . . . S/N Ratio N o r m a li z ed R T AFTER BG LinReg
Figure 1: (Case 1) Comparing the average forecast risk of different forecast combinationmethods (dashed line represents the SA baseline; x -axis is in logarithmic scale).similar to what is observed in Case 1, LinReg suffers from weight estimation error whenS/N ratio is low, once again giving the “puzzle” that LinReg performs worse than SA.Case 2 also shows the interesting observation that it is not always optimal to apply SAeven when SA is the “optimal” weight in a restricted sense. Indeed, (A.2) and (A.3) inProposition 3 imply that if we adopt the common restriction that the sum of all weightsis 1, SA is the asymptotic optimal weight. However, if we impose no restriction on theweight range, the asymptotic optimal weight assigns a unit weight to each candidateforecast. This explains the advantage of LinReg over SA in Case 2 when the S/N ratio13 .05 . . . . . . S/N N o r m a li z ed R T AFTER BG LinReg
Figure 2: (Case 2) Comparing the average forecast risk of different forecast combinationmethods (dashed line represents the SA baseline; x -axis is in logarithmic scale).is large.The observations above illustrate that different combining methods can have strik-ingly different performance depending on the underlying scenario. The FCP can appearwhen a combining method is not properly chosen according to the correct scenario.Without knowing the underlying scenario, comparing these methods may not providea complete picture of FCP, and blindly applying SA may result in sub-optimal perfor-mance. We advocate the practice of trying to identify the underlying scenario (CFAor CFI) when considering forecast combination. It should be pointed out that when14he relevant information is limited, it may not be feasible to confidently identify theforecast combination scenario. In such a case, a forced selection, similar to the compar-ison of model selection and model combining (averaging) described in Yuan and Yang(2005), would induce enlarged variability of the resulting forecast. A better solution isan adaptive combination of forecasts as illustrated in the next section.
5. Multi-level AFTER
With the understanding in section 4, we see that when considering forecast combi-nation methods, an effort should be made to understand whether there is much roomfor improvement over the best candidate. When this is difficult to decide or impracticalto implement due to handling a large number of quantities to be forecast in real time,we may turn to the question: Can we find an adaptive (or universal ) combining strategythat performs well in both CFA and CFI scenarios? Note that here adaptive refers toadaptation to the forecast combination scenario (instead of adaptation to achieving thebest individual performance). Another question follows: Under the CFI scenario, canthe adaptive combining strategy still perform as well as SA when the price of estimationerror is high? As we have seen in Case 2 of section 4, using methods (e.g., LinReg)intended for the CFI scenario alone cannot successfully address the second question.It turns out that the answers to these two questions are affirmative. The idea isrelated to a philosophical comment in Clemen et al. (1995): “Any combination of forecasts yields a single forecast. As a result, a particular combi-nation of a given set of forecasts can itself be thought of as a forecasting method thatcould compete...”
The use of combination of forecast (or procedure) combinations is a theoretically pow-erful tool to achieve adaptive minimax optimality (see, e.g., Yang (2004), Wang et al.(2014)). In the context of our discussion, combined forecasts such as SA, AFTER andLinReg can all be considered as the candidate forecasts and may be used as individualcandidates in a forecast combination scheme.Accordingly, we design a two-step combining strategy: first, we construct three newcandidate forecasts using SA, AFTER and LinReg; second, we apply the AFTER al-15orithm on these new candidate forecasts to generate a combined forecast. We referto this two-step algorithm as multi-level AFTER (or mAFTER for short) because twolayers of AFTER algorithms are involved. The key lies in the AFTER algorithm onthe second step, which allows mAFTER to automatically target the performance of thebest individual candidate among SA, AFTER and LinReg. Under the CFA scenario,mAFTER can perform as if we are using AFTER alone considering that AFTER is theproper method of choice. Under the CFI scenario, mAFTER can perform closely tothe better of SA and LinReg. Thus, when LinReg suffers from severe estimation error,mAFTER will perform closely to SA and thereby avoid the high cost.Indeed, if we denote the forecasts generated from SA, LinReg and mAFTER by ˆ y ( SA ) t ,ˆ y ( LR ) t and ˆ y ( M ) t , respectively, we have Proposition 1 as follows. Proposition 1.
Under the regularity conditions shown in the Appendix, the averageforecast risk of the mAFTER strategy satisfies T T (cid:88) t = T E ( y t − ˆ y ( M ) t ) ≤ inf (cid:16) inf ≤ i ≤ K T T (cid:88) t = T E ( y t − ˆ y t,i ) + c log( K ) T , T T (cid:88) t = T E ( y t − ˆ y ( SA ) t ) + c T , T T (cid:88) t = T E ( y t − ˆ y ( LR ) t ) + c T (cid:17) , where c and c are some positive constants not depending on the time horizon T . Proposition 1 is a consequence of Theorem 5 in Yang (2004). It indicates that, interms of the average forecast risk, mAFTER can match the performance of the bestoriginal individual forecast, the SA forecast and the LinReg forecast (whichever is thebest), with a relatively small price of order at most log( K ) /T .To confirm that the mAFTER strategy can solve the “puzzles” illustrated in theprevious section, we repeat the simulation studies of Case 1 and Case 2 and summarizethe results in Figure 3 and Figure 4, respectively. In Case 1, it suffices to see thatmAFTER correctly tracks the performance of AFTER. In Case 2, when S/N is rela-tively large ( > . < .
6. Is SA Really Robust?
The SA has been praised for being robustly among top performers relative to otherforecast combination methods. It is obvious that SA cannot be robust in the traditionalstatistical sense: even a single really bad candidate can damage the performance of thecombined forecast to an arbitrarily worse position. A more interesting question is toassess robustness of SA in practically relevant settings.The previous two sections have shown that SA is not robust in terms of its relativeperformance when dealing with the two different scenarios. In this section, we show thatSA is not robust even in the loose sense when new forecast candidates are added to thecandidate pool, especially if the new candidates have only redundant information withrespect to the original candidate pool. In contrast, the AFTER-type combining methodscan be rather robust against adding poor or redundant candidate forecasts. Here, weconsider the following three cases.
Case 3.
Suppose a new information variable x t, has the same distribution as x t, , andis independent of z t − and ( x t, , x t, ). A new candidate forecast ˆ y t, = x t, ˆ β t, joinsthe candidate pool in Case 2, where ˆ β t, is obtained from OLS estimation withhistorical data. Case 4.
A new candidate forecast ˆ y t, = x t, ˆ β t, identical to Forecast 2 joins the candi-17 .05 . . . . . . . . S/N Ratio N o r m a li z ed R T AFTER mAFTER
LinReg
Figure 3: (Case 1) Performance of mAFTER under adaptation scenario (dashed linerepresents the SA baseline; x -axis is in logarithmic scale).date pool in Case 2. Case 5.
A new candidate forecast ˆ y t, = ˜ x t, ˜ β t, is generated using a transformed infor-mation variable ˜ x t, = exp( x t, ), where ˜ β t, is obtained from OLS estimation withhistorical data.Note that the new candidate in Case 3 is a very poor forecast, while the new candi-dates in Case 4 and Case 5 contain a subset of the information variables. In all of thecases above, no new information is added to the candidate pool. Following the samesimulation setting as Case 2, we focus on SA and AFTER and compute the ratio be-18 .05 . . . . . . S/N N o r m a li z ed R T AFTER mAFTER
LinReg
Figure 4: (Case 2) Performance of mAFTER under improvement scenario (dashed linerepresents the SA baseline; x -axis is in logarithmic scale).tween the MSFE after adding the new candidate and the MSFE in Case 2. Figure 5shows that the performance of AFTER remains almost the same, while the performanceof SA worsens after adding the non-informative or redundant candidate forecasts.
7. Improper Weighting Formulas: A Source of the FCP Revisited
Generally speaking, the popular forecast combination methods often implicitly as-sume that the time series and/or the forecast errors are stationary. It is expected intheory that they should perform well if we have access to long enough historical data.19 .05 . . . . . . . Case 3
S/N M S F E R a t i o SA AFTER . . . . . . Case 4
S/N M S F E R a t i o SA AFTER . . . . . Case 5
S/N M S F E R a t i o SA AFTER
Figure 5: Studying the robustness of SA against adding new candidate forecasts.In practice, however, such derived weighting formulas can often be unsuitable when theDGP changes and the candidate forecasts cannot adjust quickly to the new reality. Forexample, it is often believed that structural breaks can unexpectedly happen, makingthe relative performance of the candidate forecasts unstable and giving us the impressionthat SA performs well.Next, we use a Monte Carlo example to illustrate the FCP under structural breaks.Rather than assuming deterministic shifts in information variables (Hendry and Clements,2004), we consider breaks in the DGP dynamics: y t = (cid:80) k =1 β ,k y t − k + ε t if 1 ≤ t ≤ ,β , y t − + β , y t − + ε t if 51 ≤ t ≤ ,β , y t − + ε t if 101 ≤ t ≤ , where the coefficients β j,k ( j = 1 , ,
3) are randomly generated from the uniform distri-bution on (0 , ε t ’s are i.i.d. N (0 , t = 50 and20 = 100. The candidate forecast models are autoregressions from lag 1 to lag 6, and weapply SA, BG, LinReg and AFTER to generate the combined forecasts. The simulationis repeated 100 times, and the last 100 time points serve as the evaluation period toobtain the average forecast risk. For comparison, we consider BG, LinReg and AFTERmethods with estimation rolling window size rw = 20 or 40, meaning only the mostrecent rw observations are used to estimate the weights for each forecast. The resultsare summarized in Table 1. The average forecast risk is normalized with respect to SA,and numbers in parentheses are standard errors.Table 1: Comparing the normalized average forecast risk of different combination meth-ods under structural breaks.SA LinReg BG AFTERstandard 1.000 1.026 (0.011) 1.005 (0.003) 1.047 (0.010) rw = 40 1.000 1.060 (0.033) 0.992 (0.002) 0.991 (0.009) rw = 20 1.000 1.64 (0.42) 0.980 (0.003) 0.952 (0.007)We can see from Table 1 that all three standard combining methods, when findingweights using all historical data, underperform compared to SA due to the unstablerelative performance of candidate forecasts. As we shrink the estimation window sizeto the most recent 40 and 20 time points, BG and AFTER achieve better performancethan SA while the performance of LinReg worsens. This result can be understood bynoting that there are two opposing factors when we shrink the weight estimation window.When using only the most recent forecasts, we decrease the bias of the weighting formulasupported by the old data but simultaneously increase the variance of the estimatedweight. Among the three methods considered, the estimation error factor dominates forLinReg. On the other hand, AFTER is not designed to aggressively target the optimalweight, thus benefiting the most from the shrinking rolling window.Due to the complex impact of structural breaks on forecast combination methods, itis arguably true that the focus should be made on how to detect the problem (see, e.g.,21ltissimo and Corradi, 2003; Davis et al., 2006) and how to come up with new com-bining forms accordingly (e.g., using the most recent observations to avoid an improperweighting formula). However, proper identification of structural breaks can be difficultto achieve in practice, and this example shows that in the presence of structural breaks,the relative performance of SA is not as robust as BG and AFTER with na¨ıvely chosenrolling windows.
8. Linking Forecast Model Screening to FCP
In empirical studies, the candidate forecasting models are often screened/selected insome way to generate a smaller set of candidates for combining. As is demonstrated inCase 3 of section 6, the performance of SA is particularly susceptible to poor-performingcandidate models. The common practice of model screening may contribute to improvingthe performance of SA.Next, we illustrate the impact of screening with a Monte Carlo example. Let x t ∈ R p ( p = 20) be the p -dimensional information variable vector randomly generated from amultivariate normal distribution with mean and covariance Σ, where (Σ) i,j = ρ | i − j | and ρ = 0 or 0.5. Consider a DGP with linear model setting y t = x Tt β + ε t , where coefficient β = (3 , , , , , , , , , · · · ,
0) and ε t are i.i.d. N (0 , σ ) with σ = 2or 4. Under this setting, only the first 7 variables in x t are important for y t , while theremaining variables are redundant.If we assume that the analyst has full access to the information vector x t ’s, wemay build linear models as the candidate forecasts with any subset of the informationvariables. It is known from Wang et al. (2014) that if we select the best subset modelwith the right size using the ABC criterion (Yang, 1999) or combine the subset regressionmodels by proper adaptive combining methods (Yang, 2001), the prediction risk canadaptively achieve the minimax optimality over soft and hard sparse function classes.Inspired by this result, we consider the following screening-and-combining approach.First, given the model size (that is, the number of information variables used in a22andidate linear model), choose the best OLS model based on estimation mean squareerror. Second, from the p models selected from the first step, find the top X % ( X =10 , , , ,
80) of the models based on the ABC criterion. Note that the ABC criterionfor a subset model with size r is ABC ( r ) = (cid:80) nt =1 ( y t − ˆ y t,r ) + 2 rσ + σ log (cid:0) pr (cid:1) , where n is the estimation sample size, ˆ y t,r is the fitted response, and σ can be replaced bythe estimation mean square error. The remaining subset models after the two-stepscreening are used to build the candidate forecasts for combining. In simulation, thetotal time horizon is set to be 200. The screening procedures are applied to the first 100observations, and the remaining models are used to build the candidate forecasts for thelatter 100 time points. Different forecast combination methods are applied, and theirperformances are evaluated using the last 50 observations. The simulation is repeated100 times, and the normalized average forecast risk (relative to SA) is summarized inTable 2.Table 2 shows that AFTER outperforms all the other competitors, including SA. Thisis consistent with our understanding of a typical CFA scenario, under which AFTER isthe proper choice of combining methods. However, as we decrease X and select smallersets of candidate forecasts for combining, the performance of SA gradually approachesthat of AFTER. Such a result is not entirely surprising considering that when onlythe top few models are selected, simply averaging them can perform similarly to theoptimal results obtained by the proper subset selection or combination methods (Wanget al., 2014). LinReg, which is not a proper choice under the CFA scenario, appears tounderperform compared to SA. As X decreases, LinReg becomes less subject to weightestimation error, and the performance of LinReg improves relative to SA.From this example, we can see that the performance of SA is not robust to the de-gree of screening. Generally, it is a very challenging task to ensure an optimal screeningto make SA perform well. As a result, although SA works relatively well in this par-ticular example for aggressive screening (keeping very few candidates), SA should notbe preferred in general. Without a good screening/selection rule, it leaves too muchfreedom for the analyst to make poor decisions. We note that a possible solution is tofirst create new candidate forecasts (e.g., forecasts generated by linear regression meth-23able 2: Comparing the normalized average forecast risk of different forecast combina-tion methods after the screening procedure.Top X % 10% 20% 40% 60% 80% σ = 2, ρ = 0AFTER 0.998 0.989 0.966 0.951 0.945BG 1.000 0.999 0.997 0.997 0.996LinReg 1.017 1.024 1.056 1.098 1.151 σ = 2, ρ = 0 . σ = 4, ρ = 0 . σ = 4, ρ = 0 . . Real Data Example In this section, we study the U.S. SPF (Society of Professional Forecasters) dataset toevaluate SA and the mAFTER strategy. This dataset is a quarterly survey on macroe-conomic forecasts in the United States. Lahiri et al. (2013) nicely handled the missingforecasts by adopting two missing forecast imputation strategies known as the regres-sion imputation (REG-Imputed) and the simple average imputation (SA-Imputed) togenerate the complete panels. As pointed out by Lahiri et al. (2013), the change of dataadministration agency in 1990 and the subsequently shifting missing data pattern makeit difficult to use the entire data period for meaningful evaluation. Therefore, we inherittheir missing forecast imputation as well as the forecast selection strategies, and focuson the period from 1968:Q4 to 1990:Q4 to evaluate the performance of the mAFTERstrategy.Three macroeconomic variables are considered: seasonally-adjusted annual rate ofchange for GDP price deflator (PGDP), growth rate of real GDP (RGDP) and quarterlyaverage of monthly unemployment rate (UNEMP). The datasets for RGDP and PGDPhave 14 candidate forecasts, and the datasets for UNEMP have 13 candidate forecasts.Each forecast provides g -quarter ( g = 1 , , ,
4) ahead forecasting. We apply SA, AF-TER, BG, LinReg and mAFTER to each SPF dataset of a macroeconomic variable witha given missing forecast imputation method. Each forecast combination method uses thefirst one fourth of the total time horizon to build up the initial weights, and the remain-ing time points are used to calculate the normalized MSFE of each method relative toSA. By taking the average over the four MSFEs that correspond to the 1,2,3,4-quarterahead forecasting, we summarize the performance of different combining methods inTable 3.From Table 3, although AFTER performs quite differently with different targetmacroeconomic variables, the mAFTER strategy delivers overall robust performance forall three variables. For PGDP, AFTER performs the best, and beats SA by as much as10%. Using mAFTER successfully maintains this advantage over SA. For RGDP, whileSA and BG beat AFTER by up to 13%, mAFTER successfully pulls the performance to25able 3: Comparing the performance of forecast combination methods with SPF datasets(values shown are normalized MSFEs averaged over 1,2,3,4-quarter ahead forecasting).Target Variable SA LinReg BG AFTER mAFTERREG-imputedPGDP 1.00 1.88 0.95 0.90 0.90RGDP 1.00 1.64 1.00 1.11 1.01UNEMP 1.00 1.79 0.99 0.98 0.98SA-imputedPGDP 1.00 2.17 0.98 0.95 0.95RGDP 1.00 1.83 1.00 1.13 1.03UNEMP 1.00 1.69 0.99 0.97 0.98within 3% of SA. Finally, for the UNEMP variable, SA, BG and AFTER all perform verysimilarly with no more than a 3% difference, and the performance of mAFTER does notdeviate much from either SA or AFTER. The LinReg method that aggressively pursuesthe optimal weight performs poorly for all three target variables. It is interesting tonote from Figure 6 that for both PGDP and RGDP variables, the largest performancedifference between SA and AFTER is found in the one-quarter ahead forecasting; ineach case, mAFTER robustly matches the better of SA and AFTER.
10. Conclusions
Inspired by the seemingly mysterious FCP, we provide our explanations of why thepuzzle often occurs and investigate when a sophisticated combining method can work wellcompared to the simple average (SA). Our study illustrates that the following reasonscan contribute to the puzzle.First, estimation error is known to be an important source of FCP. Both theoreticaland empirical evidence show that a relatively small sample size may prevent some com-26 G AFTER mAFTER
PGDP N o r m a li z ed M S F E . . . . . . BG AFTER mAFTER
RGDP N o r m a li z ed M S F E . . . . . . Figure 6: Comparing normalized MSFEs of different forecast combination methods withREG-Imputed SPF datasets. Left panel: PGDP variable. Right panel: RGDP variable.For each method, the bars from left to right represents 1,2,3,4-quarter ahead forecastingresults, respectively. The dashed line represents the SA baseline.bining methods from reliably estimating the optimal weight. Second, FCP can appearif we apply a combining method without consideration of the underlying data scenarios.The relative performance of SA may depend heavily on which scenario is more properfor the data. Third, the weighting formula of the combining methods is not always ap-propriate for the data, because structural breaks and shocks can unexpectedly happen.The weighting formula obtained by sophisticated methods cannot adjust fast enough tothe reality, resulting in performance less stable than SA. Fourth, candidate forecasts areoften screened in some way so that the remaining forecasts used for combining tend tohave similar performance, and SA may tend to work well in such cases. However, SA canbe sensitive to the screening process, and enlarging the pool of candidates may benefitother combination methods; therefore, empirical observations that SA works well aftermodel screening should be taken with a grain of salt. Fifth, there may be publicationbias in that people tend to report the existence of FCP when SA gives good empiricalresults but may not emphasize the performance of SA when it gives mediocre results.27egarding the first two reasons above, our study shows that it is not hard to find dataand build candidate forecasts in a certain way to favor a sophisticated or simple method.Under the CFA scenario, we realize that the heavy estimation price can be avoided byapplying combining methods designed to target the performance of the best candidateforecast. Under the CFI scenario, although past literature has properly pointed out thepotentially high cost of estimation error when targeting the optimal weight, it turns outthat we do not have to pay the high cost. Indeed, a carefully designed mAFTER strategycan perform aggressively to target the optimal weight when information is sufficient tosupport exploiting the optimal weighting and perform conservatively like SA when thedegree of estimation error is high. mAFTER can also intelligently perform accordingto the underlying scenario (CFA or CFI), avoiding the puzzle caused by improperlychoosing the combining methods.SA certainly can be the best or among the top combining methods, as observedempirically and reported in the literature. It may be particularly useful when one canlegitimately narrow the focus to just a few well-behaving candidate forecasts. However,since the uncertainty of the process used to reach the small set of candidates is notreflected in the showcase examples in the literature, the “conditional” results in favor ofSA may not be replicable when one starts from scratch with inhomogeneous raw mod-els/forecasts. For such problems, the performance of SA may span the whole spectrum,from terrible to on top of the chart. Also, when information is rich for a stable fore-casting problem, SA may lose greatly to a model-based method (e.g., regression). Incontrast, when the analyst has little confidence in basic modeling assumptions on thedata or in the quality of the available forecasts, perhaps SA (or the like) would be thechoice to take.The repeatedly reported puzzle in literature tends to give the sentiment that so-phisticated methods are not trustworthy and simple methods should be used. Basedon our understanding and the numerical results, it seems fair to say that if the sophis-ticated methods in those studies do not perform well, it is actually because they arenot sophisticated enough, not the other way around! In particular, when SA is consid-ered by mAFTER as a candidate, the possible advantage of SA is retained while the28n-robustness of SA is avoided. To a large extent, the forecast combination puzzle nolonger exists if we are able to move forward intelligently by integrating the strengths ofdifferent combining methods.
APPENDIXA. Assumptions of Proposition 1
The following two assumptions are sufficient regularity conditions for Propostion 1.Note that Assumption A.1 is satisfied if we truncate the candidate forecasts to have cer-tain lower and upper bounds. Assumption A.2 is satisfied if the conditional distributionsof the random noise are sub-Gaussian.
Assumption A.1.
There exists a positive constant M such that the candidate forecastssatisfy with probability 1 that sup ≤ i ≤ K, ≤ t ≤ T | m t − ˆ y t,i | ≤ M. Assumption A.2.
There exists a constant r > and continuous functions Under the settings of Case 1, the average forecast risk of Forecaster 1relative to the SA satisfies R T, R T, SA → σ σ + β σ X / as T → ∞ . In addition, if we consider the weight vectors in R , the asymptotic optimal combinationweight w ∗ satisfies w ∗ =: arg min w ∈ R (cid:16) lim T →∞ R T, w (cid:17) = (cid:18) (cid:19) . roposition 3. Under the settings of Case 2, if we assume that β = β = β and σ X = σ X = σ X , the average forecast risk of Forecast i ( i = 1 , ) relative to the SAsatisfies R T,i R T, SA → σ X β (1 − ρ ) + σ σ X β (1 − ρ )(1 − ρ ) / σ as T → ∞ . (A.1) In addition, if we further assume ρ = 0 , the asymptotic optimal combination weight ˜ w ∗ under the restriction Θ = { w : w + w = 1 } satisfies ˜ w ∗ =: arg min w ∈ Θ (cid:16) lim T →∞ R T, w (cid:17) = (cid:18) / / (cid:19) , (A.2) and the asymptotic optimal combination weight w ∗ without the restriction satisfies w ∗ =: arg min w ∈ R (cid:16) lim T →∞ R T, w (cid:17) = (cid:18) (cid:19) , (A.3)The proof of Proposition 2 is similar to that of Proposition 3. In the following, weprovide a sketch for the proof of Proposition 3. Proof of Proposition 3. Let r T, = E ( y T − ˆ y T, ) , r T, = E ( y T − ˆ y T, ) and r T, w = E ( y T − ˆ y T, w ) be the point-wise forecast risks at time T for forecaster 1, forecaster 2 and thecombined forecast, respectively. We will first verify that under the restriction Θ = { w : w + w = 1 } , r T +1 , = σ (cid:16) T − (cid:17) + σ X β + σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) − ρσ X σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) ,r T +1 , = σ (cid:16) T − (cid:17) + σ X β + σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) − ρσ X σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) , and r T +1 , w = σ (1 − w − w ) + w r T +1 , + w r T +1 , + 2 w w (cid:16) ρσ X σ X β β (cid:0) E ( ˆ ρ ) (cid:1) − σ X β β E (cid:0) ˆ ρ ˆ σ X ˆ σ X (cid:1) − σ X β β E (cid:0) ˆ ρ ˆ σ X ˆ σ X (cid:1) + ρσ X σ X σ T E (cid:0) ˆ ρ ˆ σ X ˆ σ X (cid:1)(cid:17) , where ˆ σ X i = (cid:113)(cid:80) Tt =1 x t,i /T is the estimated covariate standard deviation ( i = 1 , 2) andˆ ρ = (cid:80) Tt =1 x t, x t, T ˆ σ X ˆ σ X is the estimated covariate correlation.30irst, we have r T +1 , = E ( y T +1 − x T +1 , ˆ β T +1 , ) = E (cid:16) ε T +1 + x T +1 , β + x T +1 , β − x T +1 , (cid:80) Tt =1 x t, y t (cid:80) Tt =1 x t, (cid:17) = σ + E (cid:16) x T +1 , β + x T +1 , β − x T +1 , (cid:80) Tt =1 x t, ( x t, β + x t, β + ε t ) (cid:80) Tt =1 x t, (cid:17) = σ + E ( x T +1 , β ) + E (cid:16) ( x T +1 , β ) (cid:0) (cid:80) Tt =1 x t, x t, (cid:80) Tt =1 x t, (cid:1) (cid:17) + E (cid:16) x T +1 , ( (cid:80) Tt =1 x t, ε t ) ( (cid:80) Tt =1 x t, ) (cid:17) − E (cid:16) x T +1 , x T +1 , β (cid:80) Tt =1 x t, x t, (cid:80) Tt =1 x t, (cid:17) = σ + σ X β + σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) + σ T − − ρσ X σ X β E (cid:16) ˆ ρ ˆ σ X ˆ σ X (cid:17) . The expression for r T +1 , can be derived similarly. For r T +1 , w , we have r T +1 , w = E ( y T +1 − w ˆ y T +1 , − w ˆ y T +1 , ) = σ + E (cid:16) w ( x T +1 , β + x T +1 , β − x T +1 , ˆ β T +1 , )+ w ( x T +1 , β + x T +1 , β − x T +1 , ˆ β T +1 , ) (cid:17) = σ (1 − w − w ) + w r T +1 , + w r T +1 , + 2 w w E (cid:16) ( x T +1 , β + x T +1 , β − x T +1 , ˆ β T +1 , ) × ( x T +1 , β + x T +1 , β − x T +1 , ˆ β T +1 , ) (cid:17) =: σ (1 − w − w ) + w r T +1 , + w r T +1 , + 2 w w A . With tedious algebra, it is not hard to show that A = ρσ X σ X β β (cid:0) E ( ˆ ρ ) (cid:1) − σ X β β E (cid:18) ˆ ρ ˆ σ X ˆ σ X (cid:19) − σ X β β E (cid:18) ˆ ρ ˆ σ X ˆ σ X (cid:19) + ρσ X σ X σ T E (cid:18) ˆ ρ ˆ σ X ˆ σ X (cid:19) . Together with the previous display, we verify the formula for r T +1 , w . The formulas(A.1) and (A.2) can be verified straightforwardly by noting that the x t ’s are normallydistributed and that r T,i /R T,i → T → ∞ ( i = 1 , w , r T +1 , w can be derived similarly as above. Then, we can show that when w = (1 , T ,lim T →∞ R T, w = σ , which implies (A.3). 31 eferences Altissimo, F. and Corradi, V. (2003), ‘Strong rules for detecting the number of breaksin a time series’, Journal of Econometrics (2), 207–244.Bates, J. M. and Granger, C. W. J. (1969), ‘The combination of forecasts’, OperationResearch Quarterly , 451–468.Bordignon, S., Bunn, D. W., Lisi, F. and Nan, F. (2013), ‘Combining day-ahead forecastsfor british electricity prices’, Energy Economics , 88–103.Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997), ‘Model selection: anintegral part of inference’, Biometrics , 603–618.Claeskens, G., Magnus, J. R., Vasnev, A. L. and Wang, W. (2014), ‘The forecast com-bination puzzle: A simple theoretical explanation’. Tinbergen Institute DiscussionPaper 14-127/III.Clemen, R. T. (1989), ‘Combining forecasts: A review and annotated bibliography’, International journal of forecasting (4), 559–583.Clemen, R. T., Murphy, A. H. and Winkler, R. L. (1995), ‘Screening probability fore-casts: contrasts between choosing and combining’, International Journal of Forecast-ing (1), 133–145.Clemen, R. T. and Winkler, R. L. (1986), ‘Combining economic forecasts’, Journal ofBusiness & Economic Statistics (1), 39–46.Davis, R. A., Lee, T. C. M. and Rodriguez-Yam, G. A. (2006), ‘Structural break es-timation for nonstationary time series models’, Journal of the American StatisticalAssociation (473), 223–239.Diebold, F. X. and Pauly, P. (1990), ‘The use of prior information in forecast combina-tion’, International Journal of Forecasting (4), 503–508.32lliott, G. (2011), Averaging and the optimal combination of forecasts, Technical report,Tech. rep., UCSD Working Paper.Garratt, A., Lee, K., Pesaran, M. H. and Shin, Y. (2003), ‘Forecast uncertainties inmacroeconomic modeling’, Journal of the American Statistical Association (464).Genre, V., Kenny, G., Meyler, A. and Timmermann, A. (2013), ‘Combining expert fore-casts: Can anything beat the simple average?’, International Journal of Forecasting (1), 108–121.Granger, C. W. J. and Ramanathan, R. (1984), ‘Improved methods of combining fore-casts’, Journal of Forecasting (2), 197–204.Hendry, D. F. and Clements, M. P. (2004), ‘Pooling of forecasts’, The EconometricsJournal (1), 1–31.Hsiao, C. and Wan, S. K. (2014), ‘Is there an optimal forecast combination?’, Journalof Econometrics , 294–309.Huang, H. and Lee, T.-H. (2010), ‘To combine forecasts or to combine information?’, Econometric Reviews (5-6), 534–570.Lahiri, K., Peng, H. and Zhao, Y. (2013), ‘Machine learning and forecast combinationin incomplete panels’. Available at SSRN: http://ssrn.com/abstract=2359523 .Smith, J. and Wallis, K. F. (2009), ‘A simple explanation of the forecast combinationpuzzle’, Oxford Bulletin of Economics and Statistics (3), 331–355.Stock, J. H. and Watson, M. W. (1998), A comparison of linear and nonlinear univariatemodels for forecasting macroeconomic time series, Technical report, National Bureauof Economic Research.Stock, J. H. and Watson, M. W. (2004), ‘Combination forecasts of output growth in aseven-country data set’, Journal of Forecasting (6), 405–430.Timmermann, A. (2006), ‘Forecast combinations’, Handbook of economic forecasting , 135–196. 33ang, Z., Paterlini, S., Gao, F. and Yang, Y. (2014), ‘Adaptive minimax regressionestimation over sparse lq-hulls’, Journal of Machine Learning Research (1), 1675–1711.Winkler, R. L. and Makridakis, S. (1983), ‘The combination of forecasts’, Journal of theRoyal Statistical Society, Series A , 150–157.Yang, Y. (1999), ‘Model selection for nonparametric regression’, Statistica Sinica (2), 475–499.Yang, Y. (2001), ‘Adaptive regression by mixing’, Journal of the American StatisticalAssociation (454), 574–588.Yang, Y. (2004), ‘Combining forecasting procedures: some theoretical results’, Econo-metric Theory (01), 176–222.Yuan, Z. and Yang, Y. (2005), ‘Combining linear regression models’, Journal of theAmerican Statistical Association100