The future of forecasting competitions: Design attributes and principles
Spyros Makridakis, Chris Fry, Fotios Petropoulos, Evangelos Spiliotis
aa r X i v : . [ s t a t . A P ] F e b The future of forecasting competitions:Design attributes and principles
Spyros Makridakis a , Chris Fry b , Fotios Petropoulos c , Evangelos Spiliotis d a Institute for the Future (IFF), University of Nicosia, Nicosia, Cyprus b Google Inc., USA c School of Management, University of Bath, UK d Forecasting and Strategy Unit, School of Electrical and Computer Engineering,National Technical University of Athens, Greece
Abstract
Forecasting competitions are the equivalent of laboratory experimentation widely used in physi-cal and life sciences. They provide useful, objective information to improve the theory and prac-tice of forecasting, advancing the field, expanding its usage and enhancing its value to decisionand policymakers. We describe ten design attributes to be considered when organizing forecast-ing competitions, taking into account trade-o ff s between optimal choices and practical concernslike costs, as well as the time and e ff ort required to participate in them. Consequently, we mapall major past competitions in respect to their design attributes, identifying similarities and dif-ferences between them, as well as design gaps, and making suggestions about the principles to beincluded in future competitions, putting a particular emphasis on learning as much as possiblefrom their implementation in order to help improve forecasting accuracy and uncertainty. Wediscuss that the task of forecasting often presents a multitude of challenges that can be di ffi cultto be captured in a single forecasting contest. To assess the quality of a forecaster, we, therefore,propose that organizers of future competitions consider a multi-contest approach. We suggest theidea of a forecasting pentathlon, where di ff erent challenges of varying characteristics take place. Keywords: data science, business analytics, competitions, organization, design, forecasting
1. Introduction
In a seminal paper, Hyndman (2020) reviews the history of time series forecasting competi-tions and discusses what we have learned from them as well as how they have influenced the the-ory and practice of forecasting. The first time series forecasting competition, M (Makridakis et al., ∗ Correspondence: Fotios Petropoulos, School of Management, School of Management, University of Bath, ClavertonDown, Bath, BA2 7AY, UK.
Email addresses: [email protected] (Spyros Makridakis), [email protected] (Chris Fry), [email protected] (Fotios Petropoulos), [email protected] (Evangelos Spiliotis)
Working Paper February 10, 2021 ff ering $100,000 prizes to the winners. There islittle doubt, therefore, that forecasting competitions have changed a great deal and have becomebig events, attracting large numbers of participants from diverse backgrounds and with varyingreasons to join. As time passes, however, there is a need to question the way forecasting compe-titions are structured, and consider improvements in their design and the objectives they striveto achieve in order to attain maximum benefits from their implementation. Also, as presented inthe encyclopedic overview of Petropoulos et al. (2020), the applications of forecasting expand tomany social science areas, such as economics, finance, health care, climate, sports, and politics,among others. As such, there is also the need to consider new application areas for future fore-casting competitions beyond operations, supply chains, and energy, which has been the main casetill now.Forecasting competitions are the equivalent of the laboratory experimentation widely usedin physical and life sciences. They are used to evaluate the forecasting performance of variousapproaches and determine their accuracy and uncertainty. Their purpose is to provide objective,empirical evidence to aid policy and decision makers about the most appropriate forecastingapproach to use to realize their specific needs.It is the aim of this paper to appraise the past of forecasting competitions and deliberate abouttheir future and what should be done to improve their value and expand their usefulness acrossall application domains.The paper consists of six sections and a conclusion. After this short introduction, section 2summarizes the conclusions of Hyndman’s influential paper about past time series forecastingcompetitions, an interest of the present discussion, and enumerates his suggestions about thecharacteristics of future ones. Section 3 describes various types of forecasting competitions, con-sidering their scope, the type of data used in terms of diversity and representativeness, structure,granularity, availability, the length of forecasting horizon and several other attributes, includ-ing performance measures and the need for benchmarks. Consequently, section 4 identifies thecommonalities as well as design gaps of past forecasting competitions by mapping the designs ofindicative, major ones to the attributes described previously and mentioning the advantages anddrawbacks of each. Section 5 focuses on outlining the proposed features of some “ideal” forecast-ing competitions that would avoid the problems of past ones while filling existing gaps in orderto improve their value and gain maximum benefits from their implementation. Section 6 presentssome thoughts about institutionalizing the practice of forecasting competitions and systematiz-ing the way they are conducted, moving from running single competitions to structuring themacross multiple forecasting challenges in the way that pentathlons are run with single winners ineach challenge and an overall one across all. Finally, the conclusion summarizes the paper and2roposes expanding the competitions beyond business forecasting to cover other social scienceareas in need of objective information to improve policy and decision making.
2. A brief history of time series forecasting competitions
In his seminal paper, Hyndman (2020) concludes that time series forecasting competitionshave played an important role in advancing our knowledge of what forecasting methods workand how their performance is a ff ected by various influencing factors. He believes that in orderto improve objectivity and replicability, the data and the submitted forecasts of competitionsmust be made publicly available in order to promote research and facilitate the di ff usion andthe usage of their findings in practice. At the same time, their objectives must be clear and theextent that their findings can be generalized must be stated. According to him, future compe-titions should carefully define the population of data from which the sample has been drawnand the possible limitations of generalizing their findings to other situations. The usage of in-stance spaces (Kang et al., 2017) could provide a way to specify the characteristics of the dataincluded and allow comparisons to other competitions or data sets with well known properties(Fry and Brundage, 2020; Spiliotis et al., 2020a). Moreover, a nice side-e ff ect of time series com-petitions is that they have introduced popular benchmarks, allowing the evaluation of perfor-mance improvements and comparisons among competitions for judging the accuracy and uncer-tainty of the submitted methods, including the assessment and replication of their findings overtime. Furthermore, as new competitions emerge and the benchmarks are regularly updated, thee ff ect of developing methods that overfit published data is mitigated and new, robust forecastingmethods can be e ff ectively identified.On the negative side, Hyndman expresses concerns about the objectivity of the performancemeasures used, stating that these should be based on well-recognized attributes of the forecastdistribution. This is particularly true for the case of the prediction intervals, stating that thewidely used Winkler scores (Winkler, 1972) are not scale-free and that their scaled version usedto assess the interval performance in the M4 competition (Makridakis et al., 2020b) seems ratherad hoc, with unknown properties. Consequently, he cites the work of Askanazi et al. (2018) whoassert that comparisons of interval predictions are problematic in several ways and should beabandoned for density forecasts. Probabilistic forecasts, such as densities, could be evaluatedinstead using proper scoring rules and scale-free measures like log density scores, as done inM5 and in some energy competitions (Hong et al., 2016, 2019). There is, therefore, a need toreconsider how such probabilistic forecasts will be made and evaluated in future competitionsto avoid the criticisms that they are inadequate. However, no matter how such evaluations aredone, Hyndman suggests that it would be desirable that forecast distributions will be part ofall future forecasting competitions. Another issue he raises is whether explanatory/exogenousvariables improve forecasting performance over that of time series methods. For instance, in the3ourism forecasting competition (Athanasopoulos et al., 2011) explanatory/exogenous variableswere helpful only for one-step-ahead forecasts, while in some energy competitions (Hong et al.,2014, 2016, 2019) using temperature forecasts was beneficial for short-term forecasting, whereweather forecasts were relatively accurate, with the results being mixed for longer forecastinghorizons. On the other hand, explanatory/exogenous variables whose values can be specified,such as existence of promotions, day of the week, holidays, and days of special events like thesuper bowl, are generally considered being helpful for improving forecasting performance andshould be therefore included in the forecasting process (Makridakis et al., 2020a).A major suggestion of Hyndman (2020) is that future time series competitions should focusmore on the conditions under which di ff erent methods work well rather than simply identifyingthe methods that perform better than others. Doing so will present a significant change that willbe particularly relevant for breaking the black box of machine and deep learning forecasting met-hods that will be necessary to better understand how their predictions are made and how they canbe improved by concentrating on the factors that influence accuracy and uncertainty the most. Inaddition, he believes that future time series competitions should involve large-scale multivari-ate forecasting challenges while focusing on irregularly spaced and high frequency series suchas hourly, daily, and weekly data that is nowadays widely recorded by sensors, systems, and theinternet of things. Finally, Hyndman states that he does not know of any large-scale time seriesforecasting competition that has been conducted using finance data (e.g., stock and commodityprices and/or returns) and that such a competition would seem to be of great potential interest tothe financial industries and investors in general.
3. Design attributes of forecasting competitions
In this section we identify and discuss ten key attributes that should be considered whendesigning forecasting competitions, even if some of them might not be applicable to all of them.Table 1 provides a summary description of these attributes, which are then discussed in detail inthe next subsections.
The first decision in designing a forecasting competition relates to its scope, which can bedefined based on ( i ) the focus of the competition, ( ii ) the type of the submissions it will attract,and ( iii ) the format of the required submissions.Regarding the focus, there is a spectrum of possibilities ranging from generic to specific com-petitions. Generic competitions feature data from multiple domains that represent various in-dustries and applications, as well as from various frequencies. Examples include the M, M3, andM4 forecasting competitions that include data from di ff erent domains (micro, macro, industry,demographic, finance, and others) and various frequencies (yearly, quarterly, monthly, weekly,4 able 1: Summary description of the design attributes of forecasting competitions. Design attribute Description daily, hourly and others). While the results of such competitions identify the methods perform-ing best on each data domain/frequency, they typically determine the winners based on theiraverage performance across the complete data set. Thus, although their main findings may notbe necessarily applicable to all the domains/frequencies examined, they help us e ff ectively iden-tify best forecasting practices that hold for diverse types of data representing predictions done inbusiness firms.Specific competitions feature data of a particular domain/frequency, a particular industry orcompany/organization. Examples of such competitions include the global energy ones and themajority of those hosted on Kaggle (Bojer and Meldgaard, 2020), including M5. Although thesecompetitions may be more valuable for specific industries or organizations, replicating real-worldsituations, their findings are restricted to the specific data set and cannot be generalized to othersituations. Finally, semi-specific competitions feature data that although refers to a particular do-main, it includes instances from various applications of that domain which may therefore requirethe utilization of significantly di ff erent forecasting methods. For example, a semi-specific energy5ompetition may require forecasts for renewable energy production, energy demand, and energyprices, with the winners being determined based on their average performance across these tasks.In this case, factors that influence forecasting performance in the examined domain can be e ff ec-tively identified while the key findings of the competitions can be applicable to several forecastingtasks of that domain.Apart from the focus, when deciding on the scope of a competition, organizers will need tothink about the types of submissions that they would receive, particularly if these submissionswill be based on automatic statistical algorithms or human judgment. While most competitionsdo not state this explicitly, the type of submissions is usually implied based on the number ofinputs required. In a large-scale forecasting competition, where one has to provide many thou-sands of inputs, automatic algorithms might be the only feasible way. In smaller scale compe-titions, judgment could be used in predicting events while in cases where data is insu ffi cient oreven unavailable, judgment may be the only possible way to produce forecasts and estimate un-certainty. Consider, for instance, challenges similar to the ones posed within the Good Judgmentproject and questions such as “what is the possibility that humans will visit Mars before the end of2030?”. In such cases, the focus of the competition will be the events examined and the requiredsubmissions will have to be made judgmentally.A final decision on the scope of a competition has to do with the format of the submissionsrequested from the participants. While some of the forecasting competitions so far have askedfor the submission of the most likely point forecasts only, it is preferable that submissions ofuncertainty should be required too. This can be obtained by the submission of prediction in-tervals for one or multiple indicative quantiles or, even better, the submission of full predictivedistributions that include estimates of fat tails. If the event to be forecast has a discrete number ofpossible solutions, uncertainty can be provided in a form of confidence levels (e.g., 90% certainty)or categorical answers (e.g., low, moderate, and high confidence). Note also that it can be the casethat a forecasting competition does not ask for forecasts (or estimates of uncertainty) per se, butthe decisions to be made when using such forecasts. Examples include setting the safety stockin an inventory system, the selection of a portfolio of stocks in investing, or betting amounts forfuture events given their odds. Regardless if the focus of a competition is genetic or not, it is important that the events con-sidered have a reasonable degree of diversity that will allow for generalization of the findingsand insights obtained. Diversity e ff ectively refers to the heterogeneity of the events to be pre-dicted. In the case of the forecasting competitions that provide historical information in the formof time series, diversity is usually determined by visualizing spaces based on time series fea-tures (Kang et al., 2017) that may include the strength of predictable patterns (trend, seasonality,autocorrelations, etc.), the degree of predictability (coe ffi cient of variation, signal-to-noise ratio,6ntropy, etc.), the degree of intermittence and sparseness (fast versus slow-moving items), as wellas the length and periodicity of the data, among others. In time series such features can be en-dogenously measured while in competitions where past data is not provided, diversity can beappreciated with regards to the intent of the events under investigation and the implicit require-ments from a participant’s perspective in analyzing and producing forecasts/uncertainty for suchevents.Diversity could also include the country of origin of the data, the type of data domains, thefrequencies considered, the industries or companies investigated, and the time frame covered.For example, the results of a competition like M5 that focused on the sales of ten US stores from aglobal grocery in 2016 would not necessarily apply to a grocery retailer in China in the same yearor another United States grocery retailer in 2021. Similarly, they may not apply to other types ofretailers, such as fashion, pharmaceutical, or technology, to firms operating online, or providingdi ff erent discounts and promotions strategies. Diversifying the data set of the competition so thatmultiple events of di ff erent attributes are considered is a prerequisite for designing competitionsto represent reality realistically and ensure that its findings can be safely generalized across thedomain(s), frequencies, or application(s) being considered.Other competitions could be based on forecasting data of unknown or undisclosed sources,as well as competitions based on forecasting synthetic time series (i.e., time series data generatedthrough simulations). Such competitions would allow identifying the conditions where particularforecasting models perform well, including time series characteristics, such as seasonality, trend,noise, and structural changes, as well as decisions like the forecasting horizon considered. Thesecompetitions would enable learning more from the results of the competitions, linking also theorywith empirical evidence. While in some competition settings it is possible that no data is provided at all, in most com-petitions historical data is made available. Such data may be individual time series that are notsomehow connected to one another. In such cases, although series are typically forecast sepa-rately, participants may attempt to apply cross-learning techniques to improve the accuracy oftheir solutions, as was the case with the two top-performing solutions in the M4 competition(Semenoglou et al., 2020). It is also possible that competition data is logically organized to formhierarchical structures (Hyndman et al., 2011). Such structures do not have to be uniquely de-fined necessarily. For instance, in competitions like M5, the sales of a company may be disag-gregated by regions, categories, or both if grouped hierarchies are assumed. Given that in manyforecasting applications hierarchies are present and information exchange between the series ispossible, deciding on the correlation of the data provided is critical for determining under whichcircumstances the findings of the competition will apply.7lternatively, the provided time series data may or may not be supported by additional in-formation. For example, in competitions like M4 where the existence of timestamps may haveled to information leakage about the actual future values of the series, dates should not be pro-vided. However, when this information is indeed available, then multivariate settings may alsobe considered. Also, while data availability might be limited to the variables for which fore-casts are required, explanatory/exogenous variables can also be provided. Information for suchvariables may match the time window of the dependent variables, part of it, or even exceed it.Explanatory/exogenous variables may be either provided by the organizers of the competition toits participants directly or collected by them through various external sources. In any case, it isimportant that the explanatory/exogenous variables used for producing the forecasts will onlyrefer to information that would have been originally available at the time the forecasts were pro-duced and not after that point to make sure that no information about the actual future is leaked.For example, short-term weather forecasts may be o ff ered as an explanatory variable for predict-ing wind production, but no actual weather conditions measured either on site or at a nearbymeteorological station. Data granularity refers to the most disaggregated level where data will be available and mayrefer both to cross-sectional and temporal aggregation levels (Spiliotis et al., 2020b). In mostcases, the granularity of the data matches that of the variable to be forecast, but this does nothave to always be the case. If, for example, a competition focuses on the sales of a particularproduct in European Union, then country-level sales or even store-level sales might be helpful inimproving forecasting performance. Similarly, smart-meter data may enhance the predictions ofenergy consumption at city level, with hourly measurements being also useful in predicting dailydemand. This is particularly true in applications where data appear in mixed frequencies. Forinstance, in econometric regression, a quarterly time series may be used as an external regressorin forecasting a monthly time series.Temporal granularity is more relevant when the data under investigation is organized overtime (time series data). Increasingly, forecasting competitions have been focusing on higher fre-quency data like daily and weekly series, but this should not be considered a panacea. The choiceof the frequency needs to be linked with the scope of the competition as low-frequency data willbe naturally used for supporting strategic decisions, while low-frequency ones for supportingoperations. For instance, daily data are not available for macroeconomic variables compared tomonthly, quarterly, or yearly frequencies. Similarly, daily or hourly data would be more relevantin forecasting the sales of fresh products for store replenishment purposes. Finally, special treat-ment should be given in instances where seasonality is not an integer number as, for example,when using weekly frequency data. 8 .5. Data availability
Data availability refers to the amount of information provided by the organizers for producingthe requested forecasts. For time series competitions this would include the number of histori-cal observations available per series as well as the number of series contained in the data set.Note that both dimensions of data availability may be equally important in determining the per-formance of the submitted forecasts. For instance, in time series competitions, methods can betrained both in a series-by-series fashion, where a large number of historical observations is de-sirable per series, and in a cross-learning one, where data sets of multiple series are preferablefor building appropriate models. In general, relatively large data sets are more advantageousover smaller ones so that the participants will be capable of e ff ectively training their models byextracting more information from the data. In addition, the probability of a participant winningthe competition by luck rather than skills is e ff ectively reduced. For example, in competitions ofthe size of the M4, which involved 100,000 series, it is practically impossible to win by makingrandom choices (Spiliotis et al., 2020a).Data availability can be also driven by the scope of the competition and the type of the eventsto be predicted. For example, if the competition focuses on new product or technological forecast-ing, data availability will be naturally limited over time. Similarly, if the competition focuses onthe sales of a manufacturer that produces a limited number of products, data availability will benaturally bounded over series, requiring more manufactures of the same industry to be includedin the data set to expand its size and improve its representativeness. Moreover, data availabil-ity may be influenced by the frequency of the series, especially when multiple periodicities areobserved. Hourly electricity consumption data, for instance, may contain three seasonal cycles:daily (every 24 hours), weekly (every 168 hours), and yearly (every 12 months). In addition, whendealing with seasonal data, it is generally believed that a minimum of three seasonal periods arerequired in order for the seasonal component of the series to adequately capture the periodicpatterns existing across time.Certain domain-specific future forecasting competitions may not o ff er any data at all. In theera of big data and instant access to many publicly available sources of information, participantsare usually in a position to gather the required data by themselves, but also to complement theirforecasts by using any other publicly available information. However, in the case that the organiz-ers decide not to provide data, there is still a benefit to specifying a “default” data set to be usedfor evaluation purposes. Finally, in non-time-series forecasting competitions, such as the GoodJudgment project, quantitative data may not only not be provided but it may not be available atall. The forecasting horizon may vary from predicting the present situation (also known as now-casting, especially popular in predicting macroeconomic variables), to immediate, short, medium,9nd long-term planning horizons. The exact definition of each planning horizon may di ff er withregards to the frequency of the data under investigation. For instance, for hourly data, 1-24 hoursahead is usually considered short-term forecasting. At the same time, 1-3 months ahead can alsobe regarded as “short-term” when working with monthly data. Accordingly, the forecasting hori-zon can be naturally bounded based on the frequency of the series. For daily data, for example,it is probably unreasonable to produce forecasts for the following three years, a request which isreasonable for quarterly data.The choice of the appropriate forecasting horizon is a function of various factors that mayinclude the importance of the specific planning horizons for the application data, the user ofthe forecasts, and the hierarchical level of the forecast. Short-term horizons are suitable for op-erational planning and scheduling, mid-term horizons are appropriate for financial, budgeting,marketing, and employment decisions, while long-term forecasts are associated with strategicdecisions that include technological predictions as well as business and capacity planning.It is not uncommon in forecasting competitions to require forecasts for multiple periodsahead, with the performance usually being measured as the average across all horizons. How-ever, for some applications, like store replenishment or production, it is more relevant to con-sider the cumulative forecast error (di ff erence between the sum of actual values and the sum ofthe forecasts for the lead time) rather than the average of the forecast errors across all horizons. In time-series forecasting competitions, the most common design setup is to use historicaldata and conceal part of it to be used as test data to evaluate the performance of the submittedforecasts. The setup of concealing data may be further expanded to a number of rolling evaluationrounds. In single origin evaluation, participants do not receive feedback on their performancewhich is based on a single time window, which may not be representative of the entire series.For example, in electricity load forecasting where three strong seasonal patterns are typicallyobserved across the year, evaluating submissions by considering a particular day, week, or monthis probably not the best option. Similarly, we found this to be a drawback of the evaluation setupused in M5.To avoid the disadvantage of a single origin, the competition can be rolling (Tashman, 2000),revealing some more of the hidden data each time and asking for new forecasts at each rollingiteration, providing the participants the opportunity to learn and improve their performance overtime. A potential disadvantage of rolling-origin competitions is that they require more inputs andenergy by the participants who may wish, or have to adjust their models at each new round. Forthis reason, rolling origin competitions display higher drop-out rates, excluding also participantsthat are interested in participating but missed some early rounds and those that cannot be com-mitted for a long period of time. An alternative could be a rolling-origin evaluation set-up where10he participants provide the code for their solutions, and then the organizers produce forecastsautomatically for multiple origins, as required.Instead of concealing data, a competition can be designed to take place on a real-time ba-sis (live competition), with forecasts being evaluated against the actual data once they becomeavailable. The major advantage of live competitions is that participants can incorporate currentinformation to their forecasts in real time, meaning that data and external variables could befetched by the participants themselves based on their preferences and methods used. Also, infor-mation leakage about the actual future values becomes impossible and the competition representsreality perfectly. Its disadvantage is that it is much more di ffi cult to run (e.g., data must be col-lected in real time and evaluations must be accordingly updated) while taking some time untilthe actual values become available. A real-time competition may have a single submission originor multiple, rolling ones. In the latter case, feedback is explicitly provided to the participants inreal time, allowing learning with each additional rolling iteration. Its major disadvantage is thatit would be much more di ffi cult to run and would require great motivation to participate giventhe considerable e ff ort to keep informed and update the forecasts each time.In some cases, when historical information is not available, concealing data is not an option.In such cases, the real-time design is the only alternative. Examples include elections and sportsforecasting, where a single evaluation origin will typically be possible. However, participantsmay be also allowed to submit multiple forecasts (or revise previously submitted forecasts) untila particular point in time in live submission setups that include, for instance, prediction markets. Another important decision in designing a competition is how the performance will be mea-sured and evaluated. It is common that the performance of the (point) forecasts is evaluated usingstatistical error measures. The choice of such measures should be based on a variety of factors,such as their theoretical foundation, applicability, and interpretability. Nowadays, relative andscaled error measures are generally preferred to percentage ones (Hyndman and Koehler, 2006),however the latter are still dominant in practice by being more intuitive. The evaluation of theestimation of the uncertainty around the forecasts can be performed using interval scores andproper scoring rules (Makridakis et al., 2020e). Proper scoring rules can address both sharpnessand calibration, which is relevant in estimating the performance under fat tails. In all cases, how-ever, robust measures, with well-known statistical properties should be preferred to interpret theresults and be confident of their value.In cases where the importance (volume and value) of the predicted events varies, performancemeasurements may include weighting schemes that account for such di ff erences. This is espe-cially true when evaluating the performance of hierarchical structured data where some aggre-gation levels may be more important than others based on the decisions that the forecasts willsupport. For instance, product-store forecasts may be considered more important than regional11nes when used for supply-chain management purposes, with the opposite being true in caseswhere forecasts are used for budgeting purposes. Similarly, forecasts that refer to more expensiveor perishable products may be weighted more than those that refer to inexpensive, fast-movingones.Whenever possible, instead of measuring the performance of the forecasts, one should mea-sure their utility value directly. For instance, if the forecasts referred to investment decisions, theactual profit/loss from such investments could be measured. If the forecasts were to be used in asupply-chain setting, then inventory-related costs, achieved service levels, and/or the variance ofthe forecasted variable can be useful measurements of their utility (Petropoulos et al., 2019). Ifmore than two performance indicators need to be considered, then multicriteria techniques couldbe used to balance the performance across the chosen criteria. A simpler approach would be toassume equal importance across criteria and apply a root mean square evaluation measure. Careshould be used to address any double-counting that can arise when evaluating hierarchical serieswith multiple related levels.Another critical factor in evaluating forecasts is the cost relating to various functions of theforecasting process, including data collection, computational resources required to produce theforecasts (Nikolopoulos and Petropoulos, 2018), and personnel time that is needed to revise/fi-nalize such forecasts when judgment is needed. In standard forecasting competitions where datais provided and the submission format usually refers to automatic forecasts, the computationalcost can be easily measured by sharing the code used for their production and reproducing them.Once the computational cost is determined, it is important to contrast any improvements in per-formance against any additional costs. E ff ectively, this becomes a Forecast Value Added (FVA)exercise (Gilliland, 2013, 2019), accepting that computational time is often subject to program-ming skills and optimizations techniques, making its correct estimate a considerable challenge. A similarly important decision in designing competitions to selecting the performance mea-surements has to do with the choice of appropriate benchmarks. Such benchmarks should includeboth traditional and state-of-the-art models and algorithms that are suitable for the competi-tion based on its scope and particularities of the data. Usually, benchmarks include individualmethods that have performed well in previous, similar competitions, are considered standardapproaches for the forecasting task at hand, or display a performance which is considered a min-imum for such a task. For example, ARIMAX, linear regression, and decision-tree-based modelscan be used as benchmarks in competitions that involve explanatory/exogenous variables, Cros-ton’s method in competitions that refer to inventory forecasting, the winning methods of thefirst three M competitions for the fourth one, and a random walk model for the performance ofa major index such as S&P 500 or FTSE in a stock market competition. Simple combinations ofstate-of-the-art methods are also useful benchmarks, especially given the ample evidence on their12ompetitive performance (Makridakis et al., 2020a). It is good practice that the implementationof the benchmark methods is fully specified. This will allow participants to obtain a valid start-ing point for their investigation and facilitate transparency and reproducibility, indicating theadditional value-added of a proposed method over that of an appropriate benchmark.
Regardless of the design of the competition, its objective should not be just to determine thewinners of the examined forecasting task, but also learn how to advance the theory and practiceof forecasting by identifying the factors that contribute to the improvement of the forecastingaccuracy and the estimation of uncertainty. This has been the case for the competitions organizedby academics, but not in all others. In order to allow for such learning, su ffi cient informationwould be required about how the forecasts are made by the participants, with the code used(where applicable) being also published to facilitate replicability or reproducibility of the results(Boylan et al., 2015; Makridakis et al., 2018). For instance, this was true with the M4 competitionwhere the vast majority of the methods were e ff ectively reproduced by the organizers but not withthe M5 where this was only done with the winners that were obliged to provide a clear descriptionof their method along with their code, as well as a small number of the top 50 submissions thatcomplied with the repeated requests of the organizers to share such information.Another idea would be for the organizers to make specific hypotheses before launching thecompetitions in order to test their predictions once the actual results become available, thus learn-ing from their successes and mistakes. Such an approach would highlight the exact expectationsof the competition and clarify its objectives, avoiding the problem of rationalizing the findingsafter the fact and allowing the equivalent of the scientific method, widely used in physical and lifesciences, to be utilized in forecasting studies. This practice was followed in the M4 competitionwith positive results (Makridakis et al., 2020d) and has been repeated with the M5.Finally, future forecasting competitions should challenge the findings of previous ones, test-ing the replicability of their results and trying to identify new, better forecasting practices as new,more accurate methods become available. For example, combining the forecasts of more than onemethods has been a consistent finding of all competitions that has also flourished with machineand deep learning methods where ensembles of numerous individual models are used for pro-ducing the final forecasts. Another critical finding, lasting until the M4 competition, was thatsimple methods were at least as accurate as more sophisticated ones. This finding was reversedwith the M4 and M5, as well as the latest Kaggle competitions, indicating the need for dynamiclearning where new findings may reverse previous ones as more accurate methods are surpassingolder ones. 13 . Mapping the design attributes of past competitions In this section we map the design attributes discussed in section 3 to past, major forecastingcompetitions with the aim to identify their commonalities and design gaps, while also highlight-ing the advantages and drawbacks of each. We focus on the major competitions, organized bythe community of the International Institute of Forecasters (IIF), but also on recent competitionshosted on Kaggle. In total, we consider seventeen forecasting competitions, which are listed in therows of Tables 2, 3, and 4 with the columns of the table presenting the various design attributesdiscussed in the previous section. From the total of the seventeen competitions conducted in thelast 40 years, seven were hosted by Kaggle, five were M competitions, three were energy ones,while there was a single tourism and a sole neural network one.14 able 2: Mapping the design attributes of past competitions: Scope.
Competition Scope(Leader; Year) Focus Type of submission Format of submissionGeneric Specific Semi-specific Numerical Judgmental Pointforecasts Uncertaintyestimates Decisions
M or M1 (Makridakis;1982) Macro, Micro, Industry &Demographic ✓ ✓
M2 (Makridakis; 1993) Micro & Macro ✓ ✓ ✓
M3 (Makridakis; 2000) Micro, Macro, Industry,Demographic, Finance &Other ✓ ✓
NN3 (Crone; 2006) Industry ✓ ✓
Tourism (Athanasopoulos;2011) Tourism ✓ ✓ ✓ ✓
GEFCom 2014 (Hong;2014) Load/ Wind/ So-lar/ Price ✓
99 quantilesGEFCom 2017 (Hong;2017) Load ✓ ✓ ✓ ✓ ✓ ✓ ✓ Walmart Recruiting II:Sales in Stormy Weather(2015) Retail sales ofweather-sensitiveproducts ✓ ✓
Rossmann Store Sales(2015) Drug store sales ✓ ✓
Grupo Bimbo InventoryDemand (2016) Bakery goodssales ✓ ✓
Web Tra ffi c Time SeriesForecasting (2017) Tra ffi c of webpages ✓ ✓ Corporaci´on FavoritaGrocery Sales Forecasting(2018) Grocery storesales ✓ ✓
Recruit Restaurant VisitorForecasting (2018) Restaurant visits ✓ ✓ able 3: Mapping the design attributes of past competitions: Diversity and representativeness (specified based on the origin and size of the data set as wellas the length of the period examined); Data structure; Data granularity; Data availability. Competition Diversity & Data structure Data granularity Data availability(Leader; Year) representativeness Hierarchies Exogenous vari-ables Cross sectional Temporal Numberof events Observations per event(min-median-max)
M or M1 (Makridakis;1982) Moderate From countryto company Monthly, Quar-terly & Yearly 1001 Monthly 30-66-132 Quart10-40-106 Year 9-15-52M2 (Makridakis; 1993) Low ✓ Country andcompany Monthly &Quarterly 29 45-82-225 (Monthly) &167-167-167 (Quarterly)M3 (Makridakis; 2000) Moderate From countryto company Monthly, Quar-terly, Yearly &Other 3003 Monthly 48-115-126Quart 16-44-64 Year14-19-41 Others 63-63-96NN3 (Crone; 2006) Low From countryto region Monthly 111 50-116-126Tourism (Athanasopoulos;2011) Moderate ✓ From countryto company Yearly, Quar-terly & Monthly 1311 7-23-43 (Yearly), 22-102-122 (Quarterly) & 67-306-309 (Monthly)GEFCom 2012 (Hong;2012) Moderate Hierarchical / ✗ ✓ (only for traindata) / ✓ Utilityzone/Windfarm Hourly 44397 38070/19033GEFCom 2014 (Hong;2014) Moderate ✓ Utility/ Windfarm/Solarplant/Zone Hourly 1/10/3/1 Round-based: 50376-60600/6576-16800/8760-18984/21528-25944GEFCom 2017 (Hong;2017) Moderate Hierarchical ✓ Delivery pointmeters (zonesin qualifyingmatch) Hourly 161 out of169 (8 inqualifyingmatch) 2232-61337 (Round based119904-122736 in quali-fying match)M4 (Makridakis; 2018) High From countryto company Monthly, Quar-terly, Yearly,Daily, Hourly &Weekly 100000 42-202-2794 (Monthly),16-88-866 (Quarterly),13-29-835 (Yearly),93-2940-9919 (Daily),700-960-960 (Hourly) &80-934-2597 (Weekly)M5 (Makridakis; 2020) Moderate Grouped ✓ Store-Product Daily 30490 96-1782-1941Walmart Recruiting -Store Sales Forecasting(2014) Moderate Grouped ✓ Store-Department Weekly 3331 1-143-143Walmart Recruiting II:Sales in Stormy Weather(2015) Moderate Grouped ✓ Store-Product Daily 4995 851-914-1011Rossmann Store Sales(2015) Moderate Grouped ✓ Store Daily 1115 941-942-942Grupo Bimbo InventoryDemand (2016) Moderate Hierarchical Store-Product Weekly 26396648 1-2-7Web Tra ffi c Time SeriesForecasting (2017) Moderate Grouped Page and Tra ffi cType Daily 145063 803Corporaci´on FavoritaGrocery Sales Forecasting(2018) Moderate Grouped ✓ Store-Product Daily 174685 1-1687-1688Recruit Restaurant VisitorForecasting (2018) Moderate Grouped ✓ Restaurant Daily 829 47-296-478 able 4: Mapping the design attributes of past competitions: Forecasting horizon; Evaluation setup; Performance measurement; Benchmarking; Learning. Competition Forecasting horizon Evaluation setup Performance measurement Benchmarks Learning(Leader; Year) Live Rounds Forecast Utility Cost
M or M1 (Makridakis;1982) 1-18 (Monthly), 1-8(Quarterly) & 1-6(Yearly) 1 MAPE, MSE, AR,MdAPE & PB Naive & ES ✓ M2 (Makridakis; 1993) 1-15 (Monthly) & 1-5(Quarterly) ✓ ✓ M3 (Makridakis; 2000) 1-18 (Monthly), 1-8 (Quarterly), 1-6(Yearly) & 1-8 (Other) 1 sMAPE Naive, ES, Combination ofES & ARIMA ✓ NN3 (Crone; 2006) 1-18 1 sMAPE Naive, ES, Theta, Combi-nation of ES, Expert sys-tems, ARIMA, vanilla NNsand SVR ✓ Tourism (Athanasopoulos;2011) 1-4 (Yearly), 1-8(Quarterly) & 1-24(Monthly) 1 MASE & Coverage Naive, ES, Theta, ARIMA,Expert systems & mod-els with explanatory vari-ables ✓ GEFCom 2012 (Hong;2012) 1-168/1-48 1/1 of 157periods RMSE Vanilla MLR/Naive ✓ GEFCom 2014 (Hong;2014) 1-24 for Price & (1-31)*24 for the rest 15 PL improvementover benchmark, ad-justed for simplicityand quality Naive ✓ GEFCom 2017 (Hong;2017) 8784 ((1-31)*24 inqualifying match) ✓ ✓ M4 (Makridakis; 2018) 1-18 (Monthly), 1-8 (Quarterly), 1-6(Yearly), 1-14 (Daily),1-48 (Hourly) & 1-13(Weekly) 1 OWA & MSIS Naive, ES, Theta, Combi-nation of ES, ARIMA &vanilla NNs ✓ M5 (Makridakis; 2020) 1-28 1 WRMSSE/WSPL Naive, ES, ARIMA, Cros-ton and variants, Combi-nations, NNs, RTs ✓ Walmart Recruiting -Store Sales Forecasting(2014) 1-39 1 WMAE All zerosWalmart Recruiting II:Sales in Stormy Weather(2015) 1-25 (in an interpola-tion fashion) 1 RMSLE All zerosRossmann Store Sales(2015) 1-48 1 RMSPE All zeros, Median day ofweekGrupo Bimbo InventoryDemand (2016) 1-2 1 RMSLE All sevensWeb Tra ffi c Time SeriesForecasting (2017) 3-65 ✓ here are several common attributes characterizing practically all seventeen competitions.First, the submissions required were all numerical except for M2 that asked, in addition, for judg-mental inputs from the forecasters. Second, the submission setup was that of concealing somedata for evaluating the forecasts, that was also done a single time in all but three competitions(M2, GEF2012, and GEF2017). Third, there were only three live competitions (M2, GEF2017,and Web Tra ffi c Time Series Forecasting) that were also limited in a small number of evaluationrounds. Fourth, the majority of the competitions (fifteen out of the seventeen) required pointforecasts while five also demanded uncertainty estimates, ranging from 2 quantiles in M4 to 99in GEF2014. Fifth, while there is a balance between generic, specific, and semi-specific competi-tions, we observe that specific ones focus on tourism, energy, and retail forecasting applications,with the majority of the specific ones including high-frequency, hierarchically structured seriesand explanatory/exogenous variables, while the generic ones focusing on lower-frequency data,such as yearly, quarterly, and monthly, that were not accompanied with additional information.Moreover, there seems to be a trend towards more detailed data sets as more recent competitionsmove from individual time series to hierarchically structured ones that may be influenced by ex-planatory/exogenous variables. Sixth, none of the competitions required submissions in the formof decisions nor evaluated their performance in terms of utility or cost-based measures, utilizingvarious statistical measures that build on absolute, squared, and percentage errors. Finally, withthe exception of the competitions that were organized by academics, little emphasis was given onthe element of learning and how to improve forecasting performance and few, non-competitivebenchmarks were considered for evaluating such improvements. For instance, the M and en-ergy competitions included several variations of naive approaches, combinations of exponentialsmoothing models, ARIMA, the Theta method, and simple machine learning or statistical regres-sion methods, while Kaggle ones featured only naive methods and dummy submissions.These observations reveal both a consensus to apply what has worked in the past and it iseasy to implement in practice as well as a desire for experimentation. What is clear from Ta-bles 2, 3, and 4 is the di ff erence between the top twelve competitions organized by the academiccommunity and the last five ones hosted by Kaggle. In the former the emphasis is on learningby publishing the results in peer reviewed journals and providing open access to the data andforecasts so that others can comment on the findings, respond to their value, and suggest im-provements for future ones. Thus, it is not surprising that the number of citations received by theformer (close to 5,500) are significantly more than that of the latter (probably limited to less than100). Citations are an integral part of learning as other researchers read the cited work and be-come aware of its findings that they then try to extend to additional directions. At the same time,the Kaggle approach encourages cooperation among competitors to come up with the best solu-tion to the problem at hand without concern with the dissemination of the findings to the widerdata science community. A clear breakthrough will come by combining the academic and Kaggleapproaches by exploiting the advantages of both as there is no reason that Kaggle scientists will18ot be willing to share their knowledge so that others can also learn from their experience. In ourview, such a breakthrough will be inevitable to happen in the near future.
5. Future forecasting competitions
In the previous sections we discussed the design aspects of forecasting competitions andmapped these to the past ones. Then, we elaborated on the design opportunities, i.e., the gapsthat past forecasting competitions have left. In this section, we propose some principles for futurecompetitions:
Replicability.
One crucial aspect of any research study is that its results should be able to bereplicated. This has been an increasing concern across sciences Goodman et al. (2016), includ-ing the forecasting field (Boylan et al., 2015; Makridakis et al., 2018). The design of forecastingcompetitions needs to be transparent and allow for the full replicability of the results. One wayto achieve this is through the submission of the source code (or at least an executable file) of theparticipating solutions, coupled with su ffi cient descriptions and open libraries for benchmarksand performance measures. Reproducibility will also allow those interested to test if the resultsof a forecasting competition hold for other data sets, performance measures, forecasting horizons,and testing periods, while also enabling computational cost comparisons. In addition it wouldenable rolling-origin evaluation to be done in an automated fashion, reflecting the realistic situa-tion when forecasting models are built and then run repeatedly without the opportunity to tweakthem each time an output is generated. Representativeness.
If possible, organizers of forecasting competitions should aim for a di-verse and representative set of data. A high degree of representativeness (Spiliotis et al., 2020a)will allow for a fuller analysis of the results, enabling us to understand the conditions underwhich some methods perform better than others. Moving away from “the overall top-performingsolution wins it all”, we will be able to e ff ectively understand the importance of particular fea-tures (including frequencies) and gain insights of the performance of various methods for specificindustries or organizations. One strategy to improve representativeness could be to look at thefeature space for time series included in a competition in comparison with other samples fromthe relevant population of series.Forecasting under highly stable conditions o ff ers little challenge. Therefore, competition or-ganizers should consider evaluating forecasts across a range of conditions, including conditionswhere past patterns/relationships are bound to fail (e.g., structural changes, fat tails, recessions,pandemics) to identify methods that are more robusts under such conditions in order to o ff ervaluable insights and enhance our understanding towards managing such situations. Moreover,including competitive actions and reactions should be included as this is the reality modern com-panies operate. Future competitions could also explore the possibility of multivariate (but not19ierarchically structured) sets of data that also include information directly coming from onlinedata devices, including nowcasting. Robust evaluation.
For the results of a competition to be meaningful, robust evaluationstrategies must be considered. Such strategies could include rolling evaluation schemes and,for seasonal time series, evaluation periods that cover many di ff erent times within the calendaryear, if not one or more complete years. This would mitigate against sampling bias caused byevaluating forecasts over a short interval. Organizers could also consider evaluating hold-out setsfor representativeness using the principles discussed above. Future competitions could also of-fer multiple evaluation rounds in a live setup. Undoubtedly, this would add a more pragmaticdimension to forecasting competitions. Measuring impact on decisions.
Reflection to reality may include how a forecasting solutionis indeed implemented in practice, but also o ff er metrics of performance measurement that aredirectly linked to decisions. We argue that future forecasting competitions may need to shift thefocus to measure the utility of the forecasts/uncertainty directly. In many applications, the trans-lation from point and probabilistic forecasts to their decision making implications is a big stepand a formidable challenge as utility can be not only non-linear but also non-monotonic. When-ever possible, such utility should be expressed in monetary terms that would allow comparingmeaningful trade-o ff s. Such trade-o ff s could include conflicting optimization criteria (such asinventory holdings versus service levels) but also would allow for a more systematic value-addedanalysis of the complexity of the participating solutions and their computational (or otherwise)cost. However, we should be careful to distinguish between evaluating the impact of forecasts ondecisions versus evaluating the impact of decisions themselves. Showcase forecast-value-added (FVA).
Forecasting competitions need to clearly demonstratethe added-value of a proposed solution over the state-of-the-art methods and benchmarks. Thechoices for benchmarks is wide and could include top-performing methods from previous compe-titions. For instance, a future large-scale generic forecasting competition could have as a bench-mark the winning method of Smyl (2020) in the M4 competitions, or N-Beats. Also, a future com-petition on retail forecasting should include as benchmarks the top-methods from M5 or otherKaggle competitions. Finally, the inclusion of past winning approaches as benchmarks can act asa way of measuring improvements from new competitions and determining the value they haveadded in forecasting performance. We suggest that an FVA analysis is multifold and includes notonly the performance of the point forecasts, but also the performance in estimating uncertainty,dealing with fat tails, and the computational cost and complexity of each method. The last twoaspects (complexity and cost) are increasingly important for the acceptance and successful im-plementation of a method particularly when millions of forecasts/estimates of uncertainty areneeded on a weekly basis.
Enhancing knowledge.
We would like to see future competitions focus on contributing newlearnings and insights to the forecasting community, moving away from a horse-race exercise to-20ards bridging the gap between theory and practice. They should be able to show how the resultscan be implemented to improve the baseline and what are the consequences of the forecastingaccuracy/uncertainty on decision making. Although not an objective of all past forecasting com-petitions, learning must become an integral part of all future ones to maximize their expectedvalue by making their findings widely known to anyone wishing to utilize them to improve thetheory or practice of forecasting. The current trend towards open access of knowledge must beapplied to forecasting competitions as its findings will improve the much talked circular econ-omy by eliminating waste and achieving optimal results across a wide variety of operational andstrategic areas.
Merging the academic and Kaggle approaches.
There is much to gain and nothing to loseby combining the academic approach of disseminating learning and achieving high citations withthat of Kaggle encouraging high collaboration and open participation by the participating groups.Facilitating learning by widely disseminating the findings of Kaggle competitions will benefit theentire data science community and avoid concerns about their relevance (see Chawla, 2020). Atthe same time, stimulating a more supportive collaborative spirit in academic competitions canencourage innovation and foster team e ff ort, as long as some clever ways of supporting collabo-rative work could be adopted.We note that one strategy that enables both replicability and robust evaluation is the useof code-only competitions, where the organizers of the competition use the submitted codes toproduce forecasts for multiple origins. Participants may be given the option to alter their code inkey points; for instance, the participants may resubmit their codes every quarter when forecastsare produced and evaluated every week. Such a strategy also reflects the real-world situationin that a forecasting model used in practice may not be able to benefit from manual tweakingbetween each subsequent forecast generation, leading also to unreasonably higher costs in termsof post performance analysis and re-engineering. Making and using forecasts has progressed a great deal during the last four decades, basedon the findings of forecasting competitions that as Hyndman (2020) mentions have contributeda great deal to improve the theory and practice of forecasting and provide considerable valueto business firms using such predictions to improve their operations. Forecasting competitionscould be expanded beyond business applications to other social science areas to provide objectiveinformation and improve policy and decision making. In addition, uncertainty needs to receiveattention among academicians and practitioners alike. It must be accepted that uncertainty willalways exist and cannot be avoided or reduced no matter if we would like to live in a worldwithout uncertainty. What we will have to do is to understand its risk implications and considerwhat actions to take to minimize the negative consequences involved. Directly linking forecastingcompetitions with decision-making aspects and the utility of the forecasts is also very important21nd will allow us to gain further insights on the use of forecasts in practice.Competitions could also change from featuring a single challenge to multiple ones, with awinner in each challenge and an overall winner for the entire competition. For example, a futureforecasting competition could be a pentathlon (or hexathlon, or heptathlon...), where the variouschallenges could be organized around domain skills, such as ( i ) forecasting of univariate serieswith no exogenous information, ( ii ) forecasting of multivariate series, ( iii ) forecasting of serieswith exogenous information (e.g., weather, price, promotion activity, competitor actions, etc.),( iv ) long-range forecasting with market or competitor uncertainties, ( v ) forecasting of intermit-tent series, ( vi ) lifecycle forecasting, etc. We view this structure as valuable for a comprehensiveforecasting competition for several reasons:It may be impossible to cover all of the ideal aspects and core forecasting skills in a singlechallenge. The use of multiple challenges would also allow for greater diversity of applicationdomains. This would enable evaluation of participants in multiple skill domains and would re-duce the randomness in the final results and rankings.Another possibility would be the organization of challenges around applications. For ex-ample, within a manufacturing company, that would include forecasting for ( i ) inventory, ( ii )scheduling, ( iii ) budget, ( iv ) cash flows, ( v ) long-range planning, and ( vi ) human resources, amongothers. Such challenges will better reflect reality and showcase FVA since, in real life, in order foran organization to thrive, accurate forecasts and correct estimates of uncertainty are required formultiple aspects of its strategy, planning, and operation related decisions.Future domain-specific competitions could focus on new application areas, covering the econ-omy (gross domestic product, monetary policies, interest rates), finance (stocks, commodities),operations (new products, promotional forecasting, spare parts, predictive maintenance, reverselogistics), healthcare (epidemics, healthcare management, mortality, preventable medical errors),climate, sports, elections, call centers, big projects and megaprojects, transportation, and onlinecommerce, among others. Finally, the increasing role of judgment in various aspects of the fore-casting process, such as adjusting/finalizing forecasts or even selecting between models, calls forfurther investigations of including it in future competitions.Overall, we foresee that forecasting competitions have still much to o ff er if they are designedin a way to represent reality even closer. If forecasting competitions are done systematicallyand consistently, they would allow for comparisons and assessing improvements over time whilecovering also various areas of applications and time horizons.
6. Conclusions
Forecasting competitions, the equivalent of laboratory experimentation in physical and lifesciences, provide useful, objective information to improve the theory and practice of forecast-ing, advancing the field and enhancing decision and policy making. This paper has described all22ajor past competitions, discussed their design attributes, and identified those of “ideal” compe-titions, extending their coverage to a multitude of applications and social science areas, echoingHyndman’s suggestion that the main objective of competitions is learning as much as possiblerather than identifying winners.The main part of the paper described ten design attributes to be considered by the organizersof competitions who need to decide those relevant for their own, considering trade-o ff s betweenoptimal choices and practical concerns like costs, as well as elements related with the time ande ff ort required to participate in them. Next, the paper mapped all pertinent past competitionsin respect of the described design attributes, identifying similarities and di ff erences between thecompetitions, as well as design gaps, and making suggestions about the attributes that futurecompetitions should consider, putting a particular emphasis on learning as much as possiblefrom their implementation in order to help improve forecasting accuracy and uncertainty.The majority of past competitions concentrated on point forecasts. Our proposal is that all fu-ture competitions should also request probabilistic forecasts for a su ffi cient number of quantilesso that both the main part of the uncertainty distribution and its tails are e ff ectively captured.This is of critical importance since both point forecasts and uncertainty estimates need to beconsidered in all future oriented decisions. Another concentration of past competitions is theusage of the single origin concealed data evaluation setup, which is the easiest to implement andrequires the least time to participate. This practice will have to change by first expanding theevaluation setup to several rolling origins and then moving to rolling live competitions that maybe the hardest to run but provide the greatest value as they run on a real-time basis where allinformation is currently available and judgmental inputs can be directly incorporated. Clearly,there will be trade-o ff s that would need to be considered between the number of rolling originsused and the amount of e ff ort that would be required to complete the competition, with the sametrade-o ff s deliberated between live and concealed data ones. Competitions are costly to run, re-quiring a considerable amount of e ff ort both to be implemented and participate. Their advantageis the objective evidence they provide to improve the theory and practice of forecasting. As such,they must continue and maybe their costs are financed by a joint industry or specific group ef-fort in search of solutions to improve the accuracy and uncertainty of their specific predictions.Whatever the solution, the practice of forecasting competitions must expand in the future to gainthe maximum benefits from their findings.The final section of the paper ends with the observation that the task of forecasting presentsa multitude of challenges for organizations and societies. Business firms, for instance, must pre-dict the level of their inventories for the large number of items sold in their stores, schedule theirproduction and workforce, and estimate their budget requirements as well as their long termstrategic plans, including competitive and technological forecasts. Moreover, economic forecast-ing is also necessary at the societal level as well as energy, climate, and health predictions. Suchmultitude of challenges cannot be met with a single competition. Instead, a number of them23ould be demanded like in a pentathlon where di ff erent challenges take place, identifying thewinner of each but also the overall one that would contribute the most to the overall forecastinge ff ort among various areas or even industries with varying characteristics. References
Askanazi, R., Diebold, F. X., Schorfheide, F., Shin, M., 2018. On the comparison of interval forecasts: On the comparisonof interval forecasts. Journal of Time Series Analysis 39 (6), 953–965.Athanasopoulos, G., Hyndman, R. J., Song, H., Wu, D. C., 2011. The tourism forecasting competition. InternationalJournal of Forecasting 27 (3), 822–844.Bojer, C. S., Meldgaard, J. P., Sep. 2020. Kaggle forecasting competitions: An overlooked learning opportunity. Inter-national Journal of Forecasting.Boylan, J. E., Goodwin, P., Mohammadipour, M., Syntetos, A. A., 2015. Reproducibility in forecasting research. Inter-national Journal of Forecasting 31 (1), 79–90.Chawla, V., Jun. 2020. How much is kaggle relevant for Real-Life data science? https://analyticsindiamag.com/how-much-is-kaggle-relevant-for-real-life-data-science/ ,accessed: 2021-1-5.Fry, C., Brundage, M., 2020. The M4 forecasting competition – a practitioner’s view. International Journal of Forecast-ing 36 (1), 156–160.Gilliland, M., 2013. FVA: A reality check on forecasting practices. Foresight: The International Journal of AppliedForecasting (29), 14–18.Gilliland, M., 2019. The value added by machine learning approaches in forecasting. International Journal of Forecast-ing.Goodman, S. N., Fanelli, D., Ioannidis, J. P. A., 2016. What does research reproducibility mean? Science TranslationalMedicine 8 (341), 341ps12.Hong, T., Pinson, P., Fan, S., 2014. Global energy forecasting competition 2012. International Journal of Forecasting30 (2), 357–363.Hong, T., Pinson, P., Fan, S., Zareipour, H., Troccoli, A., Hyndman, R. J., 2016. Probabilistic energy forecasting: Globalenergy forecasting competition 2014 and beyond. International Journal of Forecasting 32 (3), 896–913.Hong, T., Xie, J., Black, J., 2019. Global energy forecasting competition 2017: Hierarchical probabilistic load forecast-ing. International Journal of Forecasting 35 (4), 1389–1399.Hyndman, R. J., 2020. A brief history of forecasting competitions. International Journal of Forecasting 36 (1), 7–14.Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., Shang, H. L., 2011. Optimal combination forecasts for hierarchicaltime series. Computational Statistics & Data Analysis 55 (9), 2579–2589.Hyndman, R. J., Koehler, A. B., 2006. Another look at measures of forecast accuracy. International Journal of Forecast-ing 22 (4), 679–688.Kang, Y., Hyndman, R. J., Smith-Miles, K., 2017. Visualising forecasting algorithm performance using time seriesinstance spaces. International Journal of Forecasting 33 (2), 345–358.Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen, E., Winkler,R., 1982. The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal ofForecasting 1 (2), 111–153.Makridakis, S., Assimakopoulos, V., Spiliotis, E., 2018. Objectivity, reproducibility and replicability in forecastingresearch. International Journal of Forecasting 34 (4), 835–838.Makridakis, S., Hyndman, R. J., Petropoulos, F., 2020a. Forecasting in social settings: The state of the art. InternationalJournal of Forecasting 36 (1), 15–28.Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2020b. The M4 competition: 100,000 time series and 61 forecastingmethods. International Journal of Forecasting 36 (1), 54–74.Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2020c. The M5 accuracy competition: Results, findings and conclu-sions.Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2020d. Predicting/hypothesizing the findings of the M4 competition.International Journal of Forecasting 36 (1), 29–36.Makridakis, S., Spiliotis, E., Assimakopoulos, V., Chen, Z., Winkler, R. L., others, 2020e. The M5 uncertainty competi-tion: Results, findings and conclusions.Nikolopoulos, K., Petropoulos, F., 2018. Forecasting for big data: Does suboptimality matter? Computers & OperationsResearch 98, 322–329. etropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Barrow, D. K., Bergmeir, C., Bessa, R. J., Boylan, J. E.,Browell, J., Carnevale, C., Castle, J. L., Cirillo, P., Clements, M. P., Cordeiro, C., Oliveira, F. L. C., De Baets, S.,Dokumentov, A., Fiszeder, P., Franses, P. H., Gilliland, M., Sinan G¨on ¨ul, M., Goodwin, P., Grossi, L., Grushka-Cockayne, Y., Guidolin, M., Guidolin, M., Gunter, U., Guo, X., Guseo, R., Harvey, N., Hendry, D. F., Hollyman, R.,Januschowski, T., Jeon, J., Jose, V. R. R., Kang, Y., Koehler, A. B., Kolassa, S., Kourentzes, N., Leva, S., Li, F., Litsiou,K., Makridakis, S., Martinez, A. B., Meeran, S., Modis, T., Nikolopoulos, K., ¨Onkal, D., Paccagnini, A., Panapakidis,I., Pav´ıa, J. M., Pedio, M., Pedregal, D. J., Pinson, P., Ramos, P., Rapach, D. E., James Reade, J., Rostami-Tabar, B.,Rubaszek, M., Sermpinis, G., Shang, H. L., Spiliotis, E., Syntetos, A. A., Talagala, P. D., Talagala, T. S., Tashman, L.,Thomakos, D., Thorarinsdottir, T., Todini, E., Arenas, J. R. T., Wang, X., Winkler, R. L., Yusupova, A., Ziel, F., 2020.Forecasting: theory and practice.Petropoulos, F., Wang, X., Disney, S. M., 2019. The inventory performance of forecasting methods: Evidence from theM3 competition data. International Journal of Forecasting 35 (1), 251–265.Semenoglou, A.-A., Spiliotis, E., Makridakis, S., Assimakopoulos, V., 2020. Investigating the accuracy of cross-learningtime series forecasting methods. International Journal of Forecasting.Smyl, S., 2020. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting.International Journal of Forecasting 36 (1), 75–85.Spiliotis, E., Kouloumos, A., Assimakopoulos, V., Makridakis, S., 2020a. Are forecasting competitions data representa-tive of the reality? International Journal of Forecasting 36 (1), 37–53.Spiliotis, E., Petropoulos, F., Kourentzes, N., Assimakopoulos, V., 2020b. Cross-temporal aggregation: Improving theforecast accuracy of hierarchical electricity consumption. Applied Energy 261, 114339.Tashman, L. J., 2000. Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal ofForecasting 16 (4), 437–450.Winkler, R. L., 1972. A Decision-Theoretic approach to interval estimation. Journal of the American Statistical Associ-ation 67 (337), 187–191.etropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Barrow, D. K., Bergmeir, C., Bessa, R. J., Boylan, J. E.,Browell, J., Carnevale, C., Castle, J. L., Cirillo, P., Clements, M. P., Cordeiro, C., Oliveira, F. L. C., De Baets, S.,Dokumentov, A., Fiszeder, P., Franses, P. H., Gilliland, M., Sinan G¨on ¨ul, M., Goodwin, P., Grossi, L., Grushka-Cockayne, Y., Guidolin, M., Guidolin, M., Gunter, U., Guo, X., Guseo, R., Harvey, N., Hendry, D. F., Hollyman, R.,Januschowski, T., Jeon, J., Jose, V. R. R., Kang, Y., Koehler, A. B., Kolassa, S., Kourentzes, N., Leva, S., Li, F., Litsiou,K., Makridakis, S., Martinez, A. B., Meeran, S., Modis, T., Nikolopoulos, K., ¨Onkal, D., Paccagnini, A., Panapakidis,I., Pav´ıa, J. M., Pedio, M., Pedregal, D. J., Pinson, P., Ramos, P., Rapach, D. E., James Reade, J., Rostami-Tabar, B.,Rubaszek, M., Sermpinis, G., Shang, H. L., Spiliotis, E., Syntetos, A. A., Talagala, P. D., Talagala, T. S., Tashman, L.,Thomakos, D., Thorarinsdottir, T., Todini, E., Arenas, J. R. T., Wang, X., Winkler, R. L., Yusupova, A., Ziel, F., 2020.Forecasting: theory and practice.Petropoulos, F., Wang, X., Disney, S. M., 2019. The inventory performance of forecasting methods: Evidence from theM3 competition data. International Journal of Forecasting 35 (1), 251–265.Semenoglou, A.-A., Spiliotis, E., Makridakis, S., Assimakopoulos, V., 2020. Investigating the accuracy of cross-learningtime series forecasting methods. International Journal of Forecasting.Smyl, S., 2020. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting.International Journal of Forecasting 36 (1), 75–85.Spiliotis, E., Kouloumos, A., Assimakopoulos, V., Makridakis, S., 2020a. Are forecasting competitions data representa-tive of the reality? International Journal of Forecasting 36 (1), 37–53.Spiliotis, E., Petropoulos, F., Kourentzes, N., Assimakopoulos, V., 2020b. Cross-temporal aggregation: Improving theforecast accuracy of hierarchical electricity consumption. Applied Energy 261, 114339.Tashman, L. J., 2000. Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal ofForecasting 16 (4), 437–450.Winkler, R. L., 1972. A Decision-Theoretic approach to interval estimation. Journal of the American Statistical Associ-ation 67 (337), 187–191.