Fast and frugal time series forecasting
aa r X i v : . [ s t a t . M E ] F e b Fast and frugal time series forecasting
Fotios Petropoulos a , Yael Grushka-Cockayne b a School of Management, University of Bath, UK b Darden School of Business, University of Virginia, USA
Abstract
Over the years, families of forecasting models, such as the exponential smoothing family andAutoregressive Integrated Moving Average, have expanded to contain multiple possible formsand forecasting profiles. In this paper, we question the need to consider such large families ofmodels. We argue that parsimoniously identifying suitable subsets of models will not decreasethe forecasting accuracy nor will it reduce the ability to estimate the forecast uncertainty. Wepropose a framework that balances forecasting performance versus computational cost, resultingin a set of reduced families of models and empirically demonstrate this trade-o ff s. We translatecomputational benefits to monetary cost savings and discuss the implications of our results in thecontext of large retailers. Keywords: exponential smoothing, ARIMA, big data, suboptimality, computational cost, retail,forecasting, forecast-value-added
1. Introduction
In modern business settings, the forecasting task usually involves the production of predic-tions for hundreds of thousands or even millions of time series, that might represent sales ofproducts or services, competitors’ data, factors that a ff ect the business environment, and others.The size of the forecasting problem is substantial for large retailers. For instance, Walmart fea-tures approximately 200 thousand unique stock keeping units (SKUs) and more than 10 thousandstores, resulting in 2 billion unique SKU-store combinations. At the most granular level, forecastsare required at any combination of SKU-store. The number of combinations can exceed 1 trillionunique combinations of SKU-ZIP codes for its online marketplace (Seaman, 2018) .The number of series to forecast increases further if one considers that these might be arrangedin hierarchical structures (i.e. store-level sales aggregating to regional sales) with decisions that Email addresses: [email protected] (Fotios Petropoulos), [email protected] (Yael Grushka-Cockayne) Working Paper March 1, 2021 eed to be made at di ff erent levels of the hierarchy. In addition, forecasting practices stretch thedemand for scalable solutions that need to balance forecast accuracy with model explainability.Yelland et al. (2019) describe the steps taken by Target for handling the volume of forecastingactivities. The need for forecasting at scale is not limited to large retailers. For example, Googleneeds to produce forecasts for its products (such as Ads, Search, YouTube, Analytics, etc) to ade-quately plan their resources and infrastructure.With the majority of the recent forecasting studies focusing on improving the forecasting per-formance as measured by some error measure, the question is: at what cost? Makridakis et al.(2020b) analyze the forecasting performance of the participating methods in their 2018 M4 fore-casting competition, with regard to their computational cost. Computational cost is defined asthe clock time required to select a forecasting model and generate a set of forecasts and predic-tion intervals. Given a computing set up, we can compare the computational cost across di ff erentapproaches. Makridakis et al. (2020b) show a negative relationship between computational costand forecast error, i.e., additional resources result in better accuracy. However, the fastest ofthe top-six participating methods required a few days to produce forecasts for all 100,000 seriesof the competition on a modern 8-core cloud computer, with the slowest requiring more than amonth to complete the task. Hence, users of such forecasting systems may wish to trade o ff theimprovements in forecasting performance against the additional computational cost required inorder to determine their added-value (Gilliland, 2020).Much has been discussed among academics and practitioners on the cost of the forecast error ,i.e., what are the monetary benefits associated with a 1% decrease in the forecast error. In thisstudy, we focus on the cost of the forecast itself and its trade o ff with the achieved forecasting per-formance. The cost of the forecast is associated with several aspects of the forecasting process,including data collection, cost for fitting models and generating the forecasts, and resources asso-ciated with judgmental interventions on the generated statistical (formal) forecasts. We considerthe computational cost of producing statistical forecasts. We address the following research ques-tion: Can we maintain the quality of statistical forecasts while reducing the cost required togenerated them?
Identifying the ‘optimal’ solution for a forecasting task might be not relevant, as the futurewill not look exactly like the past. While some forecasting models produce very good in-samplefits, their out-of-sample performance is not as good. This phenonemon is known as overfittingthe models to the available data.Goldstein and Gigerenzer (2009) demonstrate that fast and frugal heuristics that humans of-ten use, give remarkably precise estimates. They show that fast and frugal heuristics achieverobustness through simplicity. The authors warn practitioners about the use of overly complexmodels: “the danger is that complex methods become an end in themselves, a ritual to impressothers, and at the same time opportunities to learn how to do things better are missed. Learningrequires some form of transparency, which forecasters can best achieve when they understand2hat they are doing” (Goldstein and Gigerenzer, 2009, p. 770).Nikolopoulos and Petropoulos (2018) also argue for suboptimal forecasting solutions. Theyexamine how a suboptimal search for the smoothing parameter in the simple exponential smooth-ing method will lead to an average forecasting performance that is statistically indi ff erent to themore computationally expensive, optimal search. The authors conclude that accepting subopti-mal solutions will not a ff ect the quality of the forecasts but will save significant resources in termsof producing these forecasts. Another attempt to obtain fast (but suboptimal) forecasts was madeby Ashouri et al. (2019), who propose a tree-based clustering approach for producing time seriesforecasts in an aggregate fashion instead of individual series modeling.In this paper, we shift the focus from the search of (sub)optimal parameters to the searchof (sub)optimal models. We focus on two well-studied and most-commonly-used forecastingfamilies: exponential smoothing (Hyndman et al., 2002; Gardner, 2006) and Autoregressive In-tegrated Moving Average (ARIMA) (Box et al., 2008). Automatic algorithms based on informationcriteria have been proposed to select the best model within each of these two families (Hyndman and Khandakar,2008). Fitting and parameterizing all applicable models, however, can be a computationally ex-pensive task while the additional performance benefits are not clear.We empirically explore the performance of subsets of models in terms of forecasting accu-racy achieved and computational resources required for both the exponential smoothing and theARIMA families. We present the value-added in each step of the reduction process and explorethe di ff erences in the selected model forms. For both model families, we show that significantgains in cost can be achieved without harming – and in some cases even improving – the forecast-ing performance.The contributions of this paper are as follows. First, we propose that forecasting approachesshould be judged by trading o ff the achieved forecasting performance (accuracy and estimationof uncertainty) versus their computational cost. Second, we empirically identify small and robustpools of models that o ff er better performance in a fraction of the time compared to the stan-dard pools of two widely-used forecasting families. Third, we demonstrate how to monetize thecomputational benefits in the context of large retailers.
2. Time series forecasting with exponential smoothing and ARIMA models
In this section, we provide an brief overview of the two families of time series models studiedin this paper, highlighting the richness of options o ff ered by each and how model variants areselected. We then consider the need for more e ffi cient reduced forms or subsets of such models. Exponential smoothing models are univariate time series forecasting models that work onthe assumption that more recently observed data are more relevant and, as such, should carry3 higher weight towards the calculation of future point forecasts. The exponential smoothingfamily of models can handle level, trend, and seasonal patterns. Each of these components isexponentially smoothed using separate equations.The simplest exponential smoothing method (Simple Exponential Smoothing, or SES) wasproposed by Brown (1956) and was able to handle level-only data. This method was later ex-tended by Holt (2004) – initially published in 1957 as a report to the O ffi ce of Naval Research –and Winters (1960) to include trend and seasonal cases. The assumption of a constant trend in theexponential smoothing method, however, results in forecasts that are significantly and positivelybiased for the longer horizons. Thus, Gardner and McKenzie (1985) proposed that trends shouldbe dampened towards a flat line, emulating the maturity stage of product life-cycles. Damped-trend exponential smoothing is a very robust forecasting method that is usually hard to outper-form. Forecasting researchers use this method even today as an objective forecasting benchmark(Gardner, 2006; Makridakis et al., 2020a,b). Damped models have also been extended to includeseasonal cases (Gardner and McKenzie, 1989).Pegels (1969) and Gardner (2006) proposed a taxonomy for exponential smoothing modelscontaining 12 forecasting profiles that are combinations of possible trend and seasonal compo-nents. They suggest that the observed trend might be non-existent (no trend; level data), linear(additive), exponential (multiplicative), or damped. At the same time, data might be non-seasonalor have seasonality that is additive or multiplicative.For many years, the forecasting community believed that exponential smoothing methodsare not able to capture the data generation process as good as more sophisticated approaches.However, in a series of studies, Makridakis showed that exponential smoothing methods performon par (if not better) to more complicated forecasting methods (Makridakis and Hibon, 1979;Makridakis et al., 1982; Makridakis and Hibon, 2000). Given their good performance, simplicity,and intuitive mechanics, it is not a surprise that the methods described above (SES, Holt’s method,Holt-Winter’s method, and Damped-trend exponential smoothing method) are among the mostpopular time series forecasting methods. They are implemented in the majority of commercialforecasting support systems (such as ForecastPro, SAP APO, and SAS) while being most fre-quently used by companies (Fildes et al., 2009; Weller and Crone, 2012; Petropoulos et al., 2018).Regardless of their popularity in the industry, these methods remained relatively unpopu-lar in the academic circles until their revamp into a statistically sound state-space framework(Ord et al., 1997; Hyndman et al., 2002; Taylor, 2003a). The introduction of a state-space frame-work came with several advantages. It was made possible to estimate the uncertainty of forecasts(through prediction intervals) and the likelihood of the models. The latter simplified the auto-matic model selection process across exponential smoothing methods using information criteria(see Section 3.3 for a formal description of information criteria). Finally, Hyndman et al. (2002)introduced models with both additive and multiplicative errors, increasing the forecast profilesof Pegels (1969) to 24. Taylor (2003a) further extended these to 30, adding multiplicative damped4rend models. Gardner (2006) showed the equivalences between the traditional formulations ofexponential smoothing methods (recurrence and error-correction forms) with the state space for-mulations.Hyndman et al. (2008) codified the 30 exponential smoothing models by acronymizing thethree components: Error, Trend, and Seasonality (ETS). As such, Holt’s exponential smoothingmodel can be expressed as AAN, meaning a model with additive error, additive trend, and noseasonality. Similarly, MAdM is the model with multiplicative error, damped additive trend, andmultiplicative seasonality. The acronyms for all 30 models are given in Table 1.Out of the 30 models, the 11 models shaded in grey color sometimes lead to estimation di ffi -culties and infinite forecast variances for long forecast horizons (for more details, see: Hyndman et al.,2008, Chapter 15). For this reason, some modern forecasting software (such as the forecast pack-age for the R statistical software) do not, by default, allow these models. The 19 remaining modelshighlighted with green shade in Table 1 will be referred to as “all applicable models” or simplyas “all models”. Four of these models (MMN, MMdN, MMM, and MMdM) lack analytical ex-pressions, so simulation approaches are used to obtain prediction intervals. This significantlyincreases calculation times in practice. Table 1: Exponential smoothing models. The applicable models are shaded with green.
Additive E rror Multiplicative E rror S easonality S easonality T rend N A M T rend N A MN ANN ANA ANM N MNN MNA MNMA AAN AAA AAM A MAN MAA MAMAd AAdN AAdA AAdM Ad MAdN MAdA MAdMM AMN AMA AMM M MMN MMA MMMMd AMdN AMdA AMdM Md MMdN MMdA MMdM Extensions to exponential smoothing models include models with regressor variables (alsoknown as ETSx: Hyndman et al., 2008), models with multiple seasonal patterns (Taylor, 2003b,2008), models with Box-Cox transformation, autoregressive and moving-average errors, trend,and seasonal components (De Livera et al., 2011), a model for forecasting product life cycles(Guo et al., 2018), and the complex exponential smoothing model (Svetunkov and Kourentzes,2018).
ARIMA models o ff er an alternative to exponential smoothing models for time series forecast-ing. In contrast to the exponential smoothing family which consists of a finite set of 30 models,the ARIMA family consists of a theoretically infinite number of models (Box et al., 2008). Thebasic mechanism of ARIMA models resembles that of multiple regression models, where the pre-dictors are either lagged values of the data (autoregressive, AR) or lagged values of the forecast5rrors (moving averages, MA). The regression modeling is applied on the di ff erenced, stationarydata.According to Gooijer and Hyndman (2006), the origins of the ARIMA family of models aretraced back to the work of Yule (1927), who argued that time series should not be regarded asdeterministic but stochastic processes. Consequently, there were several early attempts to for-mulate AR and MA models separately. The seminal book by Box and Jenkins (1970) formalizedthe knowledge around ARIMA modeling. Also, the authors argued for model parsimony andthat simple models should be adequate in capturing reasonably well the time series structure.Following this, several researchers worked towards developing approaches for model selection(see, for example: Ozaki, 1977; Hamilton and Watts, 1978; Bhansali, 1999), parameter estima-tion (see, for example: Newbold, 1974; Ljung and George E. P. Box, 1979; Newbold et al., 1994;Kim, 2003), and model diagnostics (see, for example: Box and Pierce, 1970; Davies et al., 1977;Hosking, 1980).Notable extensions to the ARIMA family of models include the multivariate version of ARIMA,or vector ARIMA (VARIMA: Riise and Tjozstheim, 1984), fractional ARIMA models (Granger and Joyeux,1980; Hosking, 1981), and seasonal decomposition based on ARIMA (X-12-ARIMA: Findley et al.,1998). Multiple researchers have applied various implementations of ARIMA models in manydi ff erent contexts; see, for example, Gooijer and Hyndman (2006)’s Table 1 for examples of realapplications.A non-seasonal univariate ARIMA model is generally denoted as ARIMA( p, d, q ), where p isthe order of the autoregressive term, q is the order of the moving average term, and d is the de-gree of first di ff erencing required to achieve stationarity. Seasonal ARIMA models are obtainedby adding seasonal autoregressive and moving average terms, while accounting for seasonal dif-ferences if appropriate. A seasonal ARIMA model is denoted as ARIMA( p, d, q )( P, D, Q ) s with P and Q being the seasonal autoregressing and moving average terms, D the degree of seasonaldi ff erences, and s the periodicity of the data. All p , d , q , P , D , and Q are non-negative integers.For example, ARIMA(0 , , , , refers to an ARIMA model for quarterly data ( s = 4) withfirst-order non-seasonal and seasonal di ff erences ( d = D = 1), a second-order non-seasonal MAterm ( q = 2), and a first-order seasonal AR term ( P = 1).ARIMA models are very flexible at handling a remarkable wide range of data generationprocesses. In fact, in the 1970s ARIMA were considered as the holy grail in forecasting, espe-cially around the econometrics circles, as no other family of models was able to produce superiorgoodness-of-fits. However, this came with two drawbacks. First, the estimation of an appropriateARIMA model given an observed series was a combination of art and science. As a result, ARIMAmodels were not quite popular in industry, often considered as black boxes. Second, despitethe good in-sample-fits, the out-of-sample forecasting performance of ARIMA models was notas good (Makridakis and Hibon, 1979; Makridakis et al., 1982); e ff ectively, ARIMA models (andespecially large ARIMA models) su ff ered from overfitting. The same insights were presented in a6uch later study of the results of the M3 forecasting competition (Makridakis and Hibon, 2000).Hyndman and Khandakar (2008) o ff ered an automatic algorithm to specify and parameterizean ARIMA model. The algorithm starts by applying repeated unit root tests to determine thedegree of first-order and seasonal-order di ff erencing required to render the series stationary. Thena stepwise selection process is applied as follows. An initial set of candidate models are estimatedand the best one is selected based on an information criterion. A new set of candidate models isconsidered by varying the values of p , q , P , and Q of the currently selected model by ± p + q + P + Q ) of the search. Increasing the maximum order increases the search space for anoptimal model form but also significantly increases the computational time required to completethe search. In this paper, we investigate the balance between simplicity of the ARIMA models(e ff ectively the maximum order for ARIMA models, which directly reflects the computationalcomplexity) and the forecasting performance. The above discussion focused on the developments of two families of models: Exponentialsmoothing models and ARIMA. Especially with the former, we notice an increase in the varietyof models over the years. At the same time, the theory of ARIMA models, which was developedby Box and Jenkins in the 1970s, suggests an infinite number of possible models. Such large fam-ilies of models are linked with the assumption that capturing every possible forecast profile willincrease the forecasting performance. The increase in computing power and speed also corrobo-rated towards richer and larger families of models. This, however, does present some fundamen-tal questions. Do we really need all these di ff erent candidate models? Would we lose forecastingperformance if we only consider small subsets of these families? Would model selection be easierif we had to select among fewer models?Before the expansion of the exponential smoothing family of models, Makridakis and Winkler(1983) and Makridakis and Hibon (2000) showed that simple exponential smoothing forms areable to perform as well as the more complex ARIMA family of models, if not better. We ex-tend the studies of Makridakis and Winkler (1983) and Makridakis and Hibon (2000) to searchfor parsimony within each of the two widely-used families of models. In line with the simplic-ity argument (Green and Armstrong, 2015), we suggest that parsimony in forecasting should bepreferred to complexity, especially if forecasting performance does not deteriorate. Ultimately,we suggest that the cost of the forecast error (or forecast utility) should be balanced against the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.
3. Experimental design
In our experimental design, we consider 98,830 time series. Our goal is to compare forecastingperformance of a selected model from a given pool, to the e ff ort of searching for that modelwithin the pool. We construct 5 pools of exponential smoothing models and 8 pools of ARIMAmodels. We describe the exponential smoothing pools in Subsection 3.1 and the ARIMA poolsin Subsection 3.2. We describe the selection procedure of each model from a considered poolin Subsection 3.3. The performance metrics used for measuring the performance of the selectedmodel and the cost of searching within a pool are described in Subsection 3.4. Finally, the dataused in this study is described in Subsection 3.5. We explore the performance of di ff erent subsets of models within of the exponential smooth-ing framework. In total, we examine the following five pools of exponential smoothing models:• All applicable exponential smoothing models within the ets function of the forecast pack-age for the R statistical software. This includes the 19 models (8 when data are non-seasonal)shaded with green color in Table 1. We will refer to this pool as “all models”.• All applicable exponential smoothing models within the ets function, excluding the mod-els with multiplicative trends. Even if such models are interesting from a theoretical pointof view, they lead to unrealistically explosive forecasts. Additionally, as discussed in sec-tion 2.1, these models use simulation to find prediction intervals and are computationallyexpensive. This leads to a collection of 15 models (6 when data are non-seasonal), with fourmodels being excluded compared to the previous case (MMN, MMdN, MMM, MMdM). Wewill refer to this pool as “no multiplicative trend”.• All applicable exponential smoothing models within the ets function, excluding non-dampedtrended models when trend exists. Damped exponential smoothing has been found to sig-nificantly outperform the Holt’s linear trend exponential smoothing, which tends to pro-duce positively biased forecasts. Moreover, damped exponential smoothing is used as oneof the benchmarks in empirical forecasting evaluations, with Gardner (2006) noting that “itis still di ffi cult to beat the application of a damped trend to every time series”. This leads toa collection of 12 models (5 when data are non-seasonal), with seven models being excluded The ets() function of the forecast package for the R statistical software by default excludes (since its 6.0version, May 2015) models with multiplicative trends. However, their application is still possible through the allow.multiplicative.trend argument. ets function, excluding modelswhere the error type does not match the seasonality type. Models with additive error andmultiplicative seasonality could lead to numerical di ffi culties and are already excluded. Wesuggest also excluding models with multiplicative error and additive seasonality as theytend to overestimate the variance and lead to very large prediction intervals. This leads to acollection of 16 models (8 when data are non-seasonal), with three models being excludedcompared to the first case (MNA, MAA, MAdA). We will refer to this pool as “match errorwith seasonal type”.• All applicable exponential smoothing models within the ets function, excluding the mod-els in the three above cases. This leads to a collection of 8 models (4 when data are non-seasonal). We will refer to this pool as “reduced”. These are presented with green shade inTable 2, while the excluded applicable models are shaded with red. Table 2: A reduced exponential smoothing framework.
Additive E rror Multiplicative E rror S easonality S easonality T rend N A M T rend N A MN ANN ANA ANM N MNN MNA MNMA AAN AAA AAM A MAN MAA MAMAd AAdN AAdA AAdM Ad MAdN MAdA MAdMM AMN AMA AMM M MMN MMA MMMMd AMdN AMdA AMdM Md MMdN MMdA MMdM It is worth noting that, from the pools of models described above, only the last one (reducedexponential smoothing) o ff ers a balance with regards to the number of models considered foreach of the following four broad categories (profiles): level only, trend only, seasonal only, trendand seasonal; see Table 3. Table 3: Exponential smoothing models in the four broad forecast profiles consider by each pool.
Pool Level Trend Seasonal Trend andonly only only seasonalAll models 2 6 3 8No multiplicative trend 2 4 3 6Damped trend 2 3 3 4Match error with seasonal type 2 6 2 6Reduced 2 2 2 2 .2. Reduced pools of ARIMA models Similarly to the reduced pools of exponential smoothing models, we also explore the fore-casting performance of subsets of ARIMA models, as a function of their complexity. We focuson constraining the maximum order of the candidate ARIMA models so that p + q ≤ K for non-seasonal data and p + q + P + Q ≤ K for seasonal data, with K ∈ { , . . . , } . For each value of K ,we perform an exhaustive search across all possible models of order K or below and select thebest one based on information criteria (see Subsection 3.3). Table 4 provides an overview of theARIMA pools of models considered. For each possible combination of p , q , P , and Q , we also fita model with a constant if d + D ≤
1, similarly to (Hyndman and Khandakar, 2008). In contrastto the exponential smoothing family, there is always a balance with regards to the auto-regressive(AR) and the moving-average (MA) terms within each ARIMA pool. Also in contrast to the ex-ponential smoothing models, ARIMA pools tend to increase significantly in size as K increases,rendering the balance between performance and computational cost even more important. Table 4: ARIMA models in each pool.
Maximum No. of Examples of models in the poolOrder Models (in addition to previous pools) K = 1 5 (0, d ,0); (1, d ,0); (0, d ,1); (0, d ,0)(1, D ,0); (0, d ,0)(0, D ,1) K = 2 15 (2, d ,0); (0, d ,2); (1, d ,1); (1, d ,0)(0, D ,1); (0, d ,0)(2, D ,0) K = 3 35 (3, d ,0); (0, d ,3); (2, d ,1); (1, d ,0)(1, D ,1); (0, d ,2)(1, D ,0) K = 4 70 (4, d ,0); (2, d ,2); (1, d ,3); (2, d ,1)(0, D ,1) K = 5 126 (0, d ,5); (4, d ,1); (3, d ,0)(1, D ,1); (1, d ,1)(1, D ,2) K = 6 210 (2, d ,4); (1, d ,5); (0, d ,6); (1, d ,2)(2, D ,1) K = 7 330 (5, d ,2); (0, d ,7); (1, d ,2)(3, D ,1); (3, d ,3)(0, D ,1) K = 8 495 (8, d ,0); (2, d ,6); (5, d ,3); (2, d ,2)(2, D ,2) Regardless the preferred family of models, exponential smoothing or ARIMA, and the sizeof the respective pools of candidate models, we need to select the best candidate model. Theproblem of selecting the best model for each series has challenged researchers and practition-ers alike. One of the first attempts to choose between exponential smoothing models was madeby Gardner and McKenzie (1988). They suggested examining the time series’ variances of di ff er-ences. A better approach for selecting among forecasting models is via (cross-)validation, whichfollows the process of fixed/rolling origin evaluation (Tashman, 2000). This approach, however,has two significant drawbacks. First, it requires series to have enough observations to facilitatethe cutting o ff a hold-out sample. Second, it is computationally expensive, as each model will befitted at least twice: once for validation (even more times for cross-validation) and once for fore-casting. On the other hand, (cross-)validation does not depend on a particular family of modelsand can be applied either to the exponential smoothing or the ARIMA family.10rguably, the most popular approach for model selection in forecasting research nowadays isthe use of information criteria. Information criteria select the model with the maximum likeli-hood after a penalization. Maximizing the likelihood is approximately equivalent to maximizingthe in-sample fit. The penalization is based on the complexity/size of the model (i.e., the numberof parameters to be estimated). Selecting between models with information criteria is computa-tionally fast and generally robust (Hyndman et al., 2002, 2008; Hyndman and Khandakar, 2008).Popular forecasting packages o ff er by default model selection with information criteria. Whileseveral information criteria exist, the three most widely used are:• Akaike’s information criterion, defined as AIC = − L )+2 k , where L is the likelihood and k is the total number of parameters.• Bayesian information criterion, defined as BIC = AIC + k [log( n ) − n is the samplesize (number of in-sample observations).• Akaike’s information criterion corrected for small sample sizes, defined as AICc = AIC + k ( k +1) n − k − .It should be noted that AIC and AICc have explicit forecasting relevance being asymptoticallyoptimal for the one-step-ahead out-of-sample mean squared error. This is similar to the conceptof one-step-ahead cross-validation. BIC has no such forecasting connection. Furthermore, AICcis the default criterion in the ets function of the forecast package for the R statistical softwareand is approximately equivalent to AIC as the sample size (number of observations in the series)increases. Thus, it seems logical to use the AICc for model selection (Kolassa, 2011). In this study,for each of the various pools of exponential smoothing and ARIMA models presented in sections3.1 and 3.2, we select the ‘best’ model based on the values of the AICc. The use of AIC or BICinstead of AICc resulted in very similar insights, which confirms the findings by (Billah et al.,2006). We measure both the forecasting performance and the computational cost of each of the poolsof models. With regards to the forecasting performance, both the point forecast accuracy as wellas the prediction intervals coverage are considered. Point forecast accuracy is measured in termsof the Mean Absolute Scaled Error (MASE). MASE is equal to the mean absolute error scaledby the in-sample one-step-ahead mean absolute error of seasonal Naive. It has better statisticalproperties than widely-used percentage error measures (Hyndman and Koehler, 2006; Franses,2016), but it is not easily interpretable, even if we can still compare the MASE values betweentwo approaches. The MASE is defined asMASE = 1 h P n + ht = n +1 | y t − f t | n − s P ni = s +1 | y i − y i − s | , y t is the actual observation at time period t , f t is the respective forecast, h is the forecastinghorizon, and s is the number of periods within a seasonal cycle (e.g., s = 4 for quarterly data). Thevalues of MASE are averaged across di ff erent time series. While forecasting research and practiceoften uses percentage error measures, we decided not to report these for two reasons. First, theysu ff er from asymmetry (Goodwin and Fildes, 1999). Second, the insights obtained by percentageerror measures, such as the symmetric Mean Absolute Percentage Error (sMAPE), were widelyin-line with the MASE results.The performance of the prediction intervals is measured in terms of the Mean Absolute In-terval Score (MSIS, Makridakis et al., 2020b). The interval score (Gneiting and Raftery, 2007)considers both the width of the intervals and the cases where the actual values fall outside theintervals range. It is scaled similarly to the MASE so that it can be averaged across series. TheMSIS is defined as MSIS = W ( l t , u t , y t ) n − s P ni = s +1 | y i − y i − s | , where l t and u t are the lower and upper prediction intervals for period t and W ( l t , u t , y t ) = ( u t − l t ) l t ≤ y t ≤ u t ( u t − l t ) + α ( l t − y t ) y t < l t ( u t − l t ) + α ( y t − u t ) y t > u t , where (1 − α ) × forecast package. While other platforms (C++ or Python) could provide faster implementations, we donot expect that the relative di ff erences in computational cost for the di ff erent pools of models(sections 3.1 and 3.2) to change significantly. The yearly, quarterly, and monthly data from the M, M3 and M4 forecasting competitions(Makridakis et al., 1982; Makridakis and Hibon, 2000; Makridakis et al., 2020b) form the empir-ical data for this study. In total, we consider an extensive sample of 98,830 real-life series (23,826yearly; 24,959 quarterly; 50,045 monthly). The series exhibit various lengths, but the forecastinghorizon is fixed per frequency: 6, 8, and 18-periods-ahead for the yearly, quarterly, and monthly12requencies, respectively. Following the design of the M competitions, we produce forecasts oncefor each model and series combination, using a fixed-origin evaluation and withholding the ap-propriate number of observations.
4. Empirical results
The performance and cost results for the exponential smoothing and ARIMA pools consideredare presented in the Tables 5 and 6, respectively. Each table consists of three panels containingthe results for MASE, MSIS, and the computational cost. Within each panel, the di ff erent rowspresent the performance of the di ff erent pools of models described in Section 3. Note that onlyfour pools are presented for the yearly data and the exponential smoothing models, as the “matcherror with seasonal type” pool di ff ers from the “all models” pool only when the data are sea-sonal. Each column of the Tables 5 and 6 presents the performance results for di ff erent samplingfrequencies (yearly, quarterly, and monthly). For the MSIS, we consider a 95% confidence level( α = 0 . Table 5: Average forecasting performance and computational cost results of the five exponential smoothing pools.
Measure Pool Yearly Quarterly MonthlyMASE All models 4.030 1.231 1.046No multiplicative trend 3.431
MSIS All models 223.314 12.703 2374.937No multiplicative trend 34.970
Cost All models 0.122 0.317 1.375(seconds No multiplicative trend 0.013 0.169 0.973per series) Damped trend 0.140 0.270 0.923Match error with seasonal type - 0.272 1.072Reduced
The empirical results of the exponential smoothing pools (Table 5) suggest that considering“all models” gives the worst forecasting performance and the highest computational times. Re-call, that lower scores are better. In some cases, the forecasts and prediction intervals producedare explosive, leading to very poor predictions and substantial forecasting errors. This results invery high values of the arithmetic means especially for the MSIS (across all frequencies). The poorperformance of the “all models” pool suggests that identifying the best model from a large pool13 able 6: Average forecasting performance and computational cost for di ff erent maximum order of ARIMA models. Measure Maximum Order Yearly Quarterly MonthlyMASE K = 1 K = 2 3.428 K = 3 3.436 1.173 0.933 K = 4 3.436 1.183 K = 5 3.439 1.190 0.932 K = 6 3.438 1.195 0.935 K = 7 3.439 1.200 0.938 K = 8 3.438 1.206 0.941MSIS K = 1 K = 2 45.410 11.033 K = 3 45.981 11.215 8.745 K = 4 46.113 11.483 8.762 K = 5 46.240 11.640 8.800 K = 6 46.220 11.747 8.891 K = 7 46.267 11.891 8.970 K = 8 46.258 12.039 9.111Cost K = 1 (seconds K = 2 0.025 0.098 0.141per series) K = 3 0.048 0.348 0.744 K = 4 0.084 1.098 3.511 K = 5 0.132 3.132 12.524 K = 6 0.193 8.112 36.716 K = 7 0.270 19.207 92.278 K = 8 0.360 41.989 205.140 of candidates is sometimes influenced by overfitting and selecting models which tend to producenon-realistic out-of-sample forecasts.The major problem of the “all models” pool is that multiplicative trend models are allowed. Ifthese are excluded (“no multiplicative trend” pool), then the performance across all measures issignificantly better. In fact, the second pool of models results in the second-best performance (onaverage) for the yearly and monthly frequencies, and the best for the quarterly frequency. Moreimportantly, excluding the four multiplicative trend models leads to very significant reductionsin terms of computational time: 89%, 47%, and 29% compared to the time needed for the “allmodels” pool for the yearly, quarterly, and monthly frequency, respectively.The next pool of models considered in this study allows only models with damped trend,when a trend component is included. This pool results in superior performance compared to the“all models” pool and needs less time to produce forecasts for the seasonal frequencies. Whilesome multiplicative trended models are still allowed, the outlying cases are moderated by forcinga damped trend.Matching the error type with the type of the seasonality essentially does not allow multiplica-tive error with additive seasonality models to be considered (as the additive error and multiplica-14ive seasonality were already excluded). This reduction by three models does not have an e ff ectin terms of performance, which is on par with that of “all models” pool, but it does o ff er a smalldecrease in terms of computational time.Finally, the “reduced” pool of exponential models is evaluated. This is the smallest of theexponential smoothing pools considered in this study, consisting of just eight models, and theonly one to o ff er a balanced selection of exponential smoothing models with regards to theircomponents (trend and/or seasonality). The reduced pool results in the best performance overall,being ranked first for all measures in the yearly and monthly frequencies and second-best in thequarterly frequency (with small di ff erences from the best pool). Furthermore, the “reduced”pool of models o ff ers computational times that are 31% lower for the yearly frequency and morethan 50% lower for the quarterly and monthly frequency compared to the second-fastest “nomultiplicative trend” pool (which already o ff ered significant computational gains). In balance,the “reduced” pool provides the best combination of forecasting performance and computationalcost.We turn our attention to the ARIMA results, where the various pools of models are definedwith regards to the maximum order that is allowed. As discussed in 3.2, we consider pools ofmodels with maximum order between 1 and 8. The results for the yearly and quarterly frequen-cies are a clear-cut. The smallest pool of models ( K = 1) results in the best, on average, perfor-mance and the lowest computational cost. The only exception is the accuracy (MASE) for thequarterly frequency, where the best performance is achieved for K = 2, but the di ff erence from K = 1 is small. The very good performance of low-order ARIMA models is an impressive result.The best performance (both in terms of accuracy and uncertainty estimation) for non-seasonaldata is achieved by selecting between just three model forms, once the degree of di ff erencing isappropriately defined: ARIMA(0 , d, , d, , d, ff erent. For point forecast accuracy, the results follow aU-shaped function and the best accuracy is achieved with K = 4. Further increasing the maximumorder does not o ff er better average accuracy. Decreasing the maximum order to K = 3 or K =2 results in very small accuracy deterioration. The uncertainty results (MSIS) show that bestperformance is achieved for K = 2 with the uncertainty estimation progressively decreasing as K increases. Similarly to yearly and quarterly data, large models do not come with increasedforecasting performance, possibly as result of the inability of information criteria to properlypenalize for over-fitting.Overall, we can see that better performance in the ARIMA family of models is not necessarilycorrelated with an increase on the maximum order of the candidate models. In fact, allowing forlarge models leads to poor average performance. This is especially true for the yearly and quar-terly frequencies. Moreover, the computational cost is also considerably increased: A maximumorder of K = 8 is 15, 429, and 1454 times slower compared to K = 2 for the yearly, quarterly, andmonthly frequency respectively. If we were to select a single K across all frequencies, then K = 215ould be considered to o ff er a good balance in terms of accuracy and cost.Comparing the results of Tables 5 and 6, we observe that the two families of models end incomparable accuracy (MASE), with exponential smoothing being better in the lower frequenciesand ARIMA being better in higher frequencies. There is, however, a more considerable di ff er-ence with regards to the uncertainty estimation performance. The best performing exponentialsmoothing pools significantly outperform ARIMA pools, achieving better MSIS scores. Finally,the reduced exponential smoothing pool is computationally faster than the ARIMA pools with K = 1, 2, and 3 for the yearly, quarterly, and monthly series, respectively.We also identified the drivers for the increase in the computational cost within each of thetwo families of models. The need to perform simulations for deriving prediction intervals (due tolack of analytical expressions in some models) is the most computationally demanding factor forthe exponential smoothing models, followed by the inclusion of a seasonality and the inclusionof a trend component. The use of higher seasonal autoregressive or moving average orders ( P and Q ) is the computationally demanding factor for increasing the computational cost of an ARIMAmodel, followed by the non-seasonal orders ( p and q ). The length of the series had a lesser e ff ectin both model families.To test the statistical significance of the di ff erences between pools, we use the modified Diebold-Mariano (DM) test for comparing the performance of the various pools of models consideredin order to rule out that the lower scores were obtained due to chance (Harvey et al., 1997;Athanasopoulos et al., 2009). The DM test allows for comparing forecasting models, aligned withcomparisons conducted using classical statistics hypothesis testing, accounting for the autocor-relation that exists when forecasting h -steps ahead of the origin. From the DM test, one cancalculate the percentage of times that a pool of models is significantly better or worse than otherpools considered, for each of the forecast horizons (for yearly time series, this would imply 1 to 6step-ahead forecasts, for quarterly 1 to 8 step-ahead forecasts, and for monthly 1 to 18).For the exponential smoothing models, when considering the accuracy performance measuredby MASE, the reduced pool is never worse than other pools on the yearly and monthly time seriesand is significantly better between 50% and 100% of the time. In the case of the quarterly timeseries, the reduced pool is worse than the no multiplicative trend pool 87.5% of the time andsignificantly better than the other pools in 87.5% to 100% of the cases.When considering the uncertainty and the performance as measured by MSIS, the significanceof the reduced pool’s performance can also be explored. On the yearly time series, it is better thanother pools 16.7% to 100% of the time and worse only than the Damped trend pool 33% of theinstances. On the quarterly data, the reduced pool is significantly better than other pools 12.5%to 37.5% of the time, but never significantly better than the no multiplicative trend pool. It issignificantly worse than three of the other pools in 1 out of 8 horizons. On the monthly timeseries, the reduced pool is significantly better than other pools anywhere from 28% up to 66.7%of the time and never significantly worse. 16or the ARIMA models, the modified DM test supports the results in Table 6, indicating thelower maximum order models ( K = 1 and K = 2) are never significantly worse than higher maxi-mum order models ( K = 3 , . . . ,
8) in terms of both MASE and MSIS on the yearly datasets. Theselower order models are also significantly better than the higher order models in 50% to 66.7% ofthe horizons on MASE and 100% of the horizons on MSIS. On the quarterly data, again, lowerordered models ( K = 1, 2, and 3) perform significantly better than higher maximum orders. OnMASE, K = 2 and K = 3 are never worse than higher order models, and K = 1, 2, or 3 are signif-icantly better than K > K = 1, 2, or 3 are never worse thanhigher order models and they are significantly better 87.5%-100% of the time. Finally, on themonthly dataset, we see that lower maximum order models ( K = 1) are statistically significantlyworse than higher maximum order models, especially on MASE. Models of maximum order of K = 4 and 2 are, however, never significantly worse than other order models while also beingsignificantly better than models with K ≥ In this subsection we explore the nature of the selected model across each pool of models. Wedo this, by comparing the frequency of the various components that comprise the selected mod-els. For the exponential smoothing family, we first calculate the relative percentage frequenciesthat each of the four basic forecast profiles (level only, trend only, seasonal only, and trend andseasonal) is selected as optimal. We compare the relative frequencies that di ff erent model cate-gories are selected when using “all models” versus the “reduced” pools. The results are presentedin Table 7.We notice that the relative frequencies of trended-only or trended and seasonal models signif-icantly drop when we move from the “all models” pool to the “reduced” one. At the same time,the “reduced” framework selects more often level only models and slightly more often seasonal(but not trended) models. A more in-depth analysis reveals that for approximately 18.8% of thequarterly and 20.5% of the monthly series where a trended model was suggested by “all models”pool, the “reduced” pool suggests a level model as optimal. Similarly, 15.7% and 7.3% of thequarterly and monthly series best fitted with trended and seasonal models under “all models” arenow fitted by an only-seasonal model. Overall, we can see that the “reduced” pool results in notonly better forecasting performance but also the selection of simpler models.The respective results for the ARIMA pools of models are reported in Table 8. We present thepercentage of times that the selected models in the pool with maximum order K di ff er from theprevious pool ( K −
1) in terms of the order of p , q , P , Q , or any of the above. We observe thatthe percentage of times that a model changes in the yearly frequency quickly drops to less that5%. Going beyond K = 3 would only result in a di ff erent model in a very small percentage ofcases, while significantly increasing the computational cost. Even if the percentage of times that17 able 7: The relative percentage frequencies that exponential smoothing models with di ff erent components are se-lected. Frequency Model components All models ReducedYearly Level 33.7 48.2Trended 66.3 51.8Quarterly Level 25.2 33.4Trend 39.5 31.8Seasonality 11.7 15.4Trend & Seasonality 23.5 19.4Monthly Level 25.1 30.9Trended 25.8 20.3Seasonality 17.0 18.1Trend & Seasonality 32.2 30.6 a model changes increases for the quarterly frequency, we still see infrequent changes for K ≥ ff erences are recorded for the monthly data. This suggest that in higher frequenciesthere is more scope to explore higher-degree models. Table 8: The relative percentage frequencies of the di ff erences in the order of the selected ARIMA models comparedto the previous ( K −
1) pool of models.
Frequency Orders Maximum order ( K )2 3 4 5 6 7 8Yearly p q p q P Q p q P Q The MSIS results in the Tables 5 and 6 assumed a 95% confidence level ( α = 0 . early M S I S Quarterly M S I S All modelsNo multiplicative trendDamped trendMatch typesReducedOptimal
Monthly M S I S
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t Figure 1: The MSIS performance and calibration for di ff erent confidence levels and the exponential smoothing poolsof models. The MSIS for the “all models” pool (red line) is not depicted for the quarterly and monthly frequencies dueto the very large values. The vertical axes of the MSIS plots are log-scaled. With regards to the exponential smoothing results (Figure 1), we observe that the MSIS valuesfor the “reduced” pool of models are lower or equal to that of any other pool of models consid-ered (they are virtually identical with the “no multiplicative trend” pool for the quarterly andmonthly frequencies). All pools of models have similar miscalibration issues, especially for theyearly frequency. In any case, the “reduced” pool of models always o ff ers the best calibrationcompared to all other pools. Overall, we can see that the good performance of the “reduced” poolin estimating the uncertainty around the forecasts holds for various confidence levels.The ARIMA results (Figure 2) indicate that a maximum order of one o ff ers the best MSISresults for the yearly and the quarterly frequency, regardless the value of the confidence level.Moreover, a maximum order equal to one results in the best calibration, even if all pools of modelsare significantly undershooting. We saw in Table 6 that lower maximum orders are outperformedin the monthly data in terms of uncertainty estimation based on MSIS. While this is true for allconfidence levels, small pools of ARIMA models are better in terms of calibration compared tolarger pools; this di ff erence is amplified as we consider lower nominal confidence levels. While the various pools of ARIMA models are defined naturally based on their maximumorder, the same is not true for the five exponential smoothing pools considered in this study.19 early M S I S Quarterly M S I S Max order12345678
Monthly M S I S
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t Optimal
80 85 90 95 100Nominal Confidence (%) P e r c en t C o rr e c t Figure 2: The MSIS performance and calibration for di ff erent confidence levels and the ARIMA pools of models. Thevertical axes of the MSIS plots are log-scaled. It can be argued that our definition of the reduced pool is arbitrary. In this subsection, we nowexpand our analysis to a much larger number of exponential smoothing pools. The only constraintwe enforce is that each pool has at least one model for each of the four basic forecast profiles (twoin the case of yearly data), as presented in Table 3: level only, trend only, seasonal only, and trendand seasonal.In total, we analyze the performance of 189 non-seasonal pools (each containing at least twomodels with a maximum of 8 models) and 337,365 seasonal pools (each containing at least 4models with a maximum of 19 models). For each pool size (2 to 8 for yearly data, and 4 to19 for quarterly and monthly data), we depict in Figure 3 the box-plot of the average MASEperformances of the respective pools of models. We also present the performance of the five basepools defined in Section 3.1. For example, we can see that the reduced pool of models (orange dot)performs very competitively against other pools of equal size (at the lower end of the respectivebox-plots, reflecting a lower MASE), but also robustly across pools of all sizes. In fact, the reducedpool of models is ranked in the top 1%, 5%, and 1% for the yearly, quarterly, and monthly datarespectively across all pools considered. Similar insights are obtained for the MSIS.20
Yearly
Number of models M ASE
Quarterly
Number of models M ASE
Monthly
Number of models M ASE
All modelsNo multiplicative trendDamped trendMatch typesReduced
Figure 3: The performance of each possible pool of exponential smoothing models.
5. Discussion
The empirical analysis in Section 4 shows that considering pragmatic subsets of the avail-able forecasting models significantly reduces the required computational time without necessar-ily having a negative impact on the quality of the forecasts with regards to their accuracy anduncertainty estimation. In particular for the exponential smoothing family of models, some soft-ware packages already suggest the avoidance of multiplicative trend models. We also show thatconsidering only damped trends (when a trended model is needed) and matching the type ofthe error component with that of the seasonality component (by excluding models with multi-plicative errors and additive trends) bring additional benefits. In the ARIMA family of models,restricting the maximum order of the models is also beneficial, especially for non-seasonal data,in both performance and cost.Gilliland (2015) suggests that one should measure the forecast-value-added (FVA) of eachstep in the forecasting process, when progressing from simple to more complex procedures. Sim-ply put, FVA is the percentage improvement in forecasting performance achieved by each step.FVA can be seen as an underfitting versus overfitting exercise where one attempts to identify theoptimal level of complexity in a forecasting process to achieve the desired levels in forecastingperformance. If more complex processes add forecasting value, then simpler approaches mightbe underfitted and require improvement. On the other hand, if more complex approaches donot o ff er any performance gains, this might be an indication of overfitting and that the simplerapproaches might be good enough.We perform a FVA analysis for both exponential smoothing and ARIMA families. With re-gards to the exponential smoothing family, we use three of the five collections of models con-sidered in the previous section: reduced (8 models), no multiplicative trend (15 models), and allmodels (19 models). In ARIMA, we consider the pools where the maximum order is lower orequal to 5. We sort the pools of models within each family based on their size, as we attempt21o map the gains on each step of increasing complexity. At the same time, we expand the FVAanalysis to include what we refer to as the computational-cost-reduction (CCR) of each step. Ouranalysis is presented in Tables 9 and 10 for the monthly data frequency when forecasting perfor-mance is measured by MASE. Positive values in FVA suggest that more sophisticated approacheso ff er performance gains over simpler ones. Positive values in CCR indicate a decrease in thecomputational cost. Table 9: Forecast-value-added and computational-cost-reduced analysis for the monthly data frequency and the expo-nential smoothing family of models.Pool of models MASE FVA FVA Cost per series CCR CCR(sorted by complexity) vs. 1 vs. 2 (seconds) vs. 1 vs. 21: Reduced 0.942 0.4502: No multiplicative trend 0.947 − − − − − − K = 1 0.962 0.0292: K = 2 0.938 2.5% 0.141 − K = 3 0.933 3.1% 0.5% 0.744 − − K = 4 0.931 3.2% 0.7% 0.1% 3.511 − − − K = 5 0.932 3.1% 0.6% 0.1% 0% 12.524 − − − − Table 9 shows that expanding the collection of exponential smoothing models considered notonly decreases the FVA but also substantially increases the CCR. Table 10 shows that while theCCR is always negative (suggesting increased cost), there is some forecast value added up until K = 4. While K = 4 performs slightly more accurate than K = 3, this improvement in forecastingperformance comes with a significant added computation (more than four times more expensive).We suggest that the FVA must be balanced against the CCR in order to make an informed decisionon the value of the maximum order.Next, we attempt to quantify the monetary gains if a large retailer adopted the proposedreduced exponential smoothing pool by making some realistic assumptions. Seaman (2017) men-tions that the online marketplace of Walmart (walmart.com) o ff ers 50 million SKUs, which maybe demanded in 40,000 di ff erent ZIP codes. Assuming that just 1% of these 50 million SKUsrefer to fast-moving monthly goods, producing forecasts for each combination of items and ZIPcodes (0.5M ×
40K = 20B SKU-ZIP codes ) would require 7.6 million CPU-hours when all expo- Another real life example is the case of Target which requires the generation of about 4B forecasts monthly(Yelland et al., 2019). Note that CPU-hour is not a direct measure of time, as parallelization can be applied. ets() function of the forecast package for the R statistical software (where multiplicativetrend models are excluded), then the monetary savings would be $145,000 for each forecast cy-cle. Assuming that forecasts are produced each month, the reduced computational cost translatesto monetary savings of about $1.74 millions annually, compared to the second fastest pool ofexponential smoothing models in this paper.Note that these savings come with an increased forecast accuracy of about 0.5%. Petropoulos et al.(2019) measured the forecasting and inventory performance of various forecasting methods. Theirresults suggest that a percentage decrease in terms of a forecast error measure is approximatelydoubled in terms of percentage decrease of the holding inventory cost for the same achieved ser-vice level. Reductions in the inventory cost increase with the lead time and the target servicelevel.On top of the financial savings associated with reduced computational cost, another criticaldimension is the energy usage and the environmental footprint. According to Naughton (2019),“the computing power required for today’s most-vaunted machine-learning systems has beendoubling every 3.4 months”. Against the recent trends for using machine (and deep) learningtechniques in forecasting tasks, we take one step back and argue for the need for fast and cost-e ff ective solutions.Our results are relevant to forecasting research. Over the past 20 years, we have observed atrend of increasing the number and complexity of models available in the forecasters’ toolboxes.We believe that this points to an attempt to capture the true data generation process. In real life,however, there exists considerable randomness, uncertainty, and structural instabilities. Evenif we were able to capture the true data generation process for the observed data, there wouldbe no guarantees that this processes would not change in the future. Our study builds on thearguments of suboptimal (Nikolopoulos and Petropoulos, 2018; Ashouri et al., 2019) and simple(Green and Armstrong, 2015) forecasting solutions.The results and insights are consistent regardless of the criterion used for model selection(AIC, BIC, or AICc). This is in line with past research that has found on similar average per-formances across various information criterion (Billah et al., 2006). While other selection ap-proaches (such as validation or cross-validation) exist, we do not expect their application toprovide insights that would di ff er from the ones presented in this paper, especially given therelevance between AIC/AICc and one-step-ahead cross-validation that was discussed in section23.3. Additionally, these approaches are computationally costly. For example, time-series cross-validation would require fitting the same models multiple times for a single series so that therolling-origin performance of each model could be measured on a hold-out set of observations.As such, it would be practically very hard to apply (cross-)validation in situations when hundredsof thousands of series need to be forecast.
6. Conclusions
In this study, we focused on time series forecasting and, in particular, on the exponentialsmoothing and ARIMA families of models. We questioned the forecast-value-added of large poolsof models when the computational cost is also taken into account. We proposed a reduced expo-nential smoothing pool of models that consists of just eight models. This pool excludes modelswith multiplicative trends, non-damped trended models, and models where the type of the errordoes not match the seasonality type. We also controlled the size of ARIMA pools of models byrestricting their maximum order.We empirically tested the performance of small and computationally-e ffi cient pools of fore-casting models against that of larger pools on almost 100,000 real-life time series. We noticedthat small pools of exponential smoothing or ARIMA models produce results in a fraction of thetime compared to larger pools. Crucially, the reduced computational time did not come with infe-rior forecasting performance. In fact, the reduced exponential smoothing pool led to, on average,more accurate forecasts and better estimation of the forecast uncertainty through prediction in-tervals. At the same time, low-order ARIMA models o ff ered competitive performance comparedto larger models; however, very small models ( K = 1) may not be the best choice for monthly data.We translated the reduced computational cost in monetary savings in the context of large re-tailers. We showed that the use of reduced pools of models could be associated with significantsavings when one has to forecast a vast number of series. We suggest that future studies shouldconsider the cost of the forecast error as an additional performance indicator on top of the tradi-tional forecast error measures. Finally, as in this study we separately focused on the exponentialsmoothing and ARIMA families of models, a direction for future research could be the general-ization of the findings to other (or even combined) families of forecasting models. Acknowledgments
We are grateful to Professors Len Tashman, James Taylor, Konstantinos Nikolopoulos, and RobHyndman for their valuable comments and feedback.
References
Ashouri, M., Shmueli, G., Sin, C.-Y., 2019. Tree-based methods for clustering time series using domain-relevant at-tributes. Journal of Business Analytics 2 (1), 1–23. thanasopoulos, G., Ahmed, R. A., Hyndman, R. J., 2009. Hierarchical forecasts for australian domestic tourism.International Journal of Forecasting 25 (1), 146–166.Bhansali, R. J., 1999. Autoregressive model selection for multistep prediction. Journal of Statistical Planning and In-ference 78 (1), 295–305.Billah, B., King, M. L., Snyder, R. D., Koehler, A. B., 2006. Exponential smoothing model selection for forecasting.International Journal of Forecasting 22 (2), 239–247.Box, G. E. P., Jenkins, G., 1970. Time series analysis forecasting and control, first edition - AbeBooks.Box, G. E. P., Jenkins, G. M., Reinsel, G. C., 2008. Time Series Analysis: Forecasting and Control, 4th Edition. Wiley,New Jersey.Box, G. E. P., Pierce, D. A., 1970. Distribution of residual autocorrelations in Autoregressive-Integrated moving averagetime series models. Journal of the American Statistical Association 65 (332), 1509–1526.Brown, R. G., 1956. Exponential smoothing for predicting demand. Little, Cambridge, Massachusetts.Davies, N., Triggs, C. M., Newbold, P., 1977. Significance levels of the Box-Pierce portmanteau statistic in finite sam-ples. Biometrika 64 (3), 517–522.De Livera, A. M., Hyndman, R. J., Snyder, R., 2011. Forecasting time series with complex seasonal patterns usingexponential smoothing. Journal of the American Statistical Association 106 (496), 1513–1527.Fildes, R., Goodwin, P., Lawrence, M., Nikolopoulos, K., 2009. E ff ective forecasting and judgmental adjustments: anempirical evaluation and strategies for improvement in supply-chain planning. International Journal of Forecasting25, 3–23.Findley, D. F., Monsell, B. C., Bell, W. R., Otto, M. C., Chen, B.-C., 1998. New capabilities and methods of the X-12-ARIMA Seasonal-Adjustment program. Journal of Business & Economic Statistics 16 (2), 127–152.Franses, P. H., 2016. A note on the mean absolute scaled error. International Journal of Forecasting 32 (1), 20–22.Gardner, E. S., 2006. Exponential smoothing: The state of the art - part II . International Journal of Forecasting 22 (4),637–666.Gardner, E. S., McKenzie, E., 1988. Model identification in exponential smoothing. The Journal of the OperationalResearch Society 39 (9), 863–867.Gardner, E. S., McKenzie, E., 1989. Seasonal exponential smoothing with damped trends. Management Science 35 (3),372–376.Gardner, Everette S., J., McKenzie, E., 1985. Forecasting trends in time series. Management Science 31 (10), 1237–1246.Gilliland, M., 2015. White paper: Forecast Value Added Analysis: Step by Step. Tech. rep., SAS.Gilliland, M., 2020. The value added by machine learning approaches in forecasting. International Journal of Forecast-ing 36 (1), 161–166.Gneiting, T., Raftery, A. E., 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the AmericanStatistical Association 102 (477), 359–378.Goldstein, D. G., Gigerenzer, G., Oct. 2009. Fast and frugal forecasting. International Journal of Forecasting 25 (4),760–772.Goodwin, P., Fildes, R., 1999. Judgmental forecasts of time series a ff ected by special events: Does providing a statisticalforecast improve accuracy? Journal of Behavioral Decision Making 12, 37–53.Gooijer, J. G. D., Hyndman, R. J., 2006. 25 years of time series forecasting. International Journal of Forecasting 22,443–473.Granger, C. W. J., Joyeux, R., 1980. An introduction to long-memory time series models and fractional di ff erencing.Journal of Time Series Analysis 1 (1), 15–29.Green, K. C., Armstrong, J. S., 2015. Simple versus complex forecasting: The evidence. Journal of Business Research68 (8), 1678–1685.Guo, X., Lichtendahl, Jr, K. C., Grushka-Cockayne, Y., 2018. Quantile forecasts of product life cycles using exponentialsmoothing. Harvard Business School Working Paper 19-038.Hamilton, D. C., Watts, D. G., 1978. Interpreting partial autocorrelation functions of seasonal time series models.Biometrika 65 (1), 135–140.Harvey, D., Leybourne, S., Newbold, P., 1997. Testing the equality of prediction mean squared errors. InternationalJournal of Forecasting 13 (2), 281–291.Holt, C. C., 2004. Forecasting seasonals and trends by exponentially weighted moving averages. International Journalof Forecasting 20 (1), 5–10.Hosking, J. R. M., 1980. Lagrange-Multiplier tests of Time-Series models. Journal of the Royal Statistical Society. SeriesB, Statistical methodology 42 (2), 170–181.Hosking, J. R. M., 1981. Fractional di ff erencing. Biometrika 68 (1), 165–176.Hyndman, R. J., Khandakar, Y., 7 2008. Automatic time series forecasting: The forecast package for R . Journal of tatistical Software 27 (3), 1–22.Hyndman, R. J., Koehler, A. B., 2006. Another look at measures of forecast accuracy. International Journal of Forecast-ing 22, 679–688.Hyndman, R. J., Koehler, A. B., Ord, J. K., Snyder, R. D., 2008. Forecasting with Exponential Smoothing: The StateSpace Approach. Springer Verlag, Berlin.Hyndman, R. J., Koehler, A. B., Snyder, R. D., Grose, S., 2002. A state space framework for automatic forecasting usingexponential smoothing methods. International Journal of Forecasting 18 (3), 439–454.Kim, J. H., 2003. Forecasting autoregressive time series with bias-corrected parameter estimators. International Journalof Forecasting 19 (3), 493–502.Kolassa, S., 2011. Combining exponential smoothing forecasts using A kaike weights. International Journal of Forecast-ing 27 (2), 238–251.Ljung, G. M., George E. P. Box, 1979. The likelihood function of stationary Autoregressive-Moving average models.Biometrika 66 (2), 265–270.Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen, E., Winkler,R., 1982. The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal ofForecasting 1 (2), 111–153.Makridakis, S., Hibon, M., 1979. Accuracy of forecasting: An empirical investigation. Journal of the Royal StatisticalSociety. Series A 142 (2), 97–145.Makridakis, S., Hibon, M., 2000. The M ff ord the exorbitant power demands of machine learning? ,accessed: 2019-11-18.Newbold, P., 1974. The exact likelihood function for a mixed Autoregressive-Moving average process. Biometrika61 (3), 423–426.Newbold, P., Agiakloglou, C., Miller, J., 1994. Adventures with ARIMA software. International Journal of Forecasting10 (4), 573–581.Nikolopoulos, K., Petropoulos, F., 2018. Forecasting for big data: Does suboptimality matter? Computers & OperationsResearch 98, 322–329.Ord, J. K., Koehler, A. B., Snyder, R. D., 1997. Estimation and prediction for a class of dynamic nonlinear statisticalmodels. Journal of the American Statistical Association 92 (440), 1621–1629.Ozaki, T., 1977. On the order determination of ARIMA models. Journal of the Royal Statistical Society. Series C, Ap-plied statistics 26 (3), 290–301.Pegels, C. C., 1969. Exponential forecasting: Some new variations. Management Science 15 (5), 311–315.Petropoulos, F., Kourentzes, N., Nikolopoulos, K., Siemsen, E., 2018. Judgmental selection of forecasting models. Jour-nal of Operations Management 60, 34–46.Petropoulos, F., Wang, X., Disney, S. M., 2019. The inventory performance of forecasting methods: Evidence from theM3 competition data. International Journal of Forecasting 35 (1), 251–265.Riise, T., Tjozstheim, D., Jul. 1984. Theory and practice of multivariate arma forecasting. Journal of Forecasting 3 (3),309–317.Seaman, B., 2017. Retail sales forecasting at walmart. International Symposium on Forecasting 2017 (Cairns, Aus-tralia).Seaman, B., 2018. Considerations of a retail forecasting practitioner. International Journal of Forecasting 34 (4), 822–829.Svetunkov, I., Kourentzes, N., 2018. Complex exponential smoothing for seasonal time series. Working Paper of De-partment of Management Science, Lancaster University 2018:1, 1–20.Tashman, L. J., 2000. Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal ofForecasting 16 (4), 437–450.Taylor, J. W., 2003a. Exponential smoothing with a damped multiplicative trend. International Journal of Forecasting19 (4), 715–725.Taylor, J. W., 2003b. Short-Term electricity demand forecasting using double seasonal exponential smoothing. The ournal of the Operational Research Society 54 (8), 799–805.Taylor, J. W., 2008. A comparison of univariate time series methods for forecasting intraday arrivals at a call center.Management Science 54 (2), 253–265.Weller, M., Crone, S., 2012. Supply Chain Forecasting: Best Practices & Benchmarking Study. Lancaster Centre ForForecasting, Technical Report.Winters, P. R., 1960. Forecasting sales by exponentially weighted moving averages. Management Science 6, 324–342.Yelland, P., Baz, Z. E., Serafini, D., 2019. Forecasting at scale: The architecture of a modern retail forecasting system.Foresight: The International Journal of Applied Forecasting 55, 10–18.Yule, G. U., 1927. On a method of investigating periodicities in disturbed series, with special reference to wolfer’ssunspot numbers. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of aMathematical or Physical Character 226, 267–298.ournal of the Operational Research Society 54 (8), 799–805.Taylor, J. W., 2008. A comparison of univariate time series methods for forecasting intraday arrivals at a call center.Management Science 54 (2), 253–265.Weller, M., Crone, S., 2012. Supply Chain Forecasting: Best Practices & Benchmarking Study. Lancaster Centre ForForecasting, Technical Report.Winters, P. R., 1960. Forecasting sales by exponentially weighted moving averages. Management Science 6, 324–342.Yelland, P., Baz, Z. E., Serafini, D., 2019. Forecasting at scale: The architecture of a modern retail forecasting system.Foresight: The International Journal of Applied Forecasting 55, 10–18.Yule, G. U., 1927. On a method of investigating periodicities in disturbed series, with special reference to wolfer’ssunspot numbers. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of aMathematical or Physical Character 226, 267–298.