Meta-learning framework with applications to zero-shot time-series forecasting
Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, Yoshua Bengio
MMeta-learning framework with applications to zero-shot time-series forecasting
Boris N. Oreshkin , Dmitri Carpov , Nicolas Chapados , Yoshua Bengio Element AI, [email protected]
Abstract
Can meta-learning discover generic ways of processing timeseries (TS) from a diverse dataset so as to greatly improvegeneralization on new TS coming from different datasets?This work provides positive evidence to this using a broadmeta-learning framework which we show subsumes manyexisting meta-learning algorithms. Our theoretical analysissuggests that residual connections act as a meta-learning adap-tation mechanism, generating a subset of task-specific param-eters based on a given TS input, thus gradually expandingthe expressive power of the architecture on-the-fly. The samemechanism is shown via linearization analysis to have theinterpretation of a sequential update of the final linear layer.Our empirical results on a wide range of data emphasize theimportance of the identified meta-learning mechanisms forsuccessful zero-shot univariate forecasting, suggesting that itis viable to train a neural network on a source TS dataset anddeploy it on a different target TS dataset without retraining,resulting in performance that is at least as good as that ofstate-of-practice univariate forecasting models.
Time series (TS) forecasting is both a fundamental scientificproblem and one of great practical importance. It is centralto the actions of intelligent agents: the ability to plan andcontrol as well as to appropriately react to manifestationsof complex partially or completely unknown systems oftenrelies on the ability to forecast relevant observations based onpast history. Moreover, for most utility-maximizing agents,gains in forecasting accuracy broadly translate into utilitygains; as such, improvements in forecasting technology canhave wide impacts. Unsurprisingly, forecasting methods havea long history that can be traced back to the very originsof human civilization (Neale 1985), modern science (Gauss1809) and have consistently attracted considerable researchattention (Yule 1927; Walker 1931; Holt 1957; Winters 1960;Engle 1982; Sezer, Gudelek, and Ozbayoglu 2019). The appli-cations of forecasting span a variety of fields, including high-frequency control (e.g. vehicle and robot control (Tang andSalakhutdinov 2019), data center optimization (Gao 2014)),business planning (supply chain management (Leung 1995),workforce and call center management (Chapados et al. 2014;
Ibrahim et al. 2016), as well as such critically important areasas precision agriculture (Rodrigues Jr et al. 2019). In businessspecifically, improved forecasting translates in better pro-duction planning (leading to less waste) and less transporta-tion (reducing CO emissions) (Kahn 2003; Kerkkänen, Kor-pela, and Huiskonen 2009; Nguyen, Ni, and Rossetti 2010).The progress made in univariate forecasting in the past fourdecades is well reflected in the results and methods consid-ered in associated competitions over that period (Makridakiset al. 1982, 1993; Makridakis and Hibon 2000; Athanasopou-los et al. 2011; Makridakis, Spiliotis, and Assimakopoulos2018a). Recently, growing evidence has started to emergesuggesting that machine learning approaches could improveon classical forecasting methods, in contrast to some ear-lier assessments (Makridakis, Spiliotis, and Assimakopoulos2018b). For example, the winner of the 2018 M4 competition(Makridakis, Spiliotis, and Assimakopoulos 2018a) was aneural network designed by Smyl (2020).On the practical side, the deployment of deep neural time-series models is challenged by the cold start problem. Beforea tabula rasa deep neural network provides a useful forecast-ing output, it should be trained on a large problem-specifictime-series dataset. For early adopters, this often implies datacollection efforts, changing data handling practices and evenchanging the existing IT infrastructures on a large scale. Incontrast, advanced statistical models can be deployed withsignificantly less effort as they estimate their parameters onsingle time series at a time. In this paper we address the prob-lem of reducing the entry cost of deep neural networks inthe industrial practice of TS forecasting. We show that it isviable to train a neural network model on a diversified sourcedataset and deploy it on a target dataset in a zero-shot regime ,i.e. without explicit retraining on that target data, resultingin performance that is at least as good as that of advancedstatistical models tailored to the target dataset. We would liketo clarify that we use the term “zero-shot” in our work in thesense that the number of history samples available for thetarget time series is so small that it makes training a deeplearning model on this time series infeasible.Addressing this practical problem provides clues to fun-damental questions. Can we learn something general aboutforecasting and transfer this knowledge across datasets? Ifso, what kind of mechanisms could facilitate this? The abil-ity to learn and transfer representations across tasks via a r X i v : . [ c s . L G ] D ec ask adaptation is an advantage of meta-learning (Raghuet al. 2019). We propose here a broad theoretical frameworkfor meta-learning that encompasses several existing meta-learning algorithms. We further show that a recent successfulmodel, N-BEATS (Oreshkin et al. 2020), fits this framework.We identify internal meta-learning adaptation mechanismsthat generate new parameters on-the-fly, specific to a givenTS, iteratively extending the architecture’s expressive power.We empirically confirm that meta-learning mechanisms arekey to improving zero-shot TS forecasting performance, anddemonstrate results on a wide range of datasets. The univariate point forecasting problem in discrete timeis formulated given a length- H forecast horizon and alength- T observed series history [ y , . . . , y T ] ∈ R T . Thetask is to predict the vector of future values y ∈ R H =[ y T + , y T + , . . . , y T + H ] . For simplicity, we will later considera lookback window of length t ≤ T ending with the lastobserved value y T to serve as model input, and denoted x ∈ R t = [ y T − t + , . . . , y T ] . We denote (cid:98) y the point forecastof y . Its accuracy can be evaluated with s MAPE , the sym-metric mean absolute percentage error (Makridakis, Spiliotis,and Assimakopoulos 2018a),s
MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | + | (cid:98) y T + i | . (1)Other quality metrics (e.g. MAPE , MASE , OWA , ND ) are pos-sible and are defined in Appendix A. Meta-learning or learning-to-learn (Harlow 1949; Schmi-dhuber 1987; Bengio, Bengio, and Cloutier 1991) is usuallylinked to being able to (i) accumulate knowledge across tasks( i.e. transfer learning , multi-task learning ) and (ii) quicklyadapt the accumulated knowledge to the new task ( task adap-tation ) (Ravi and Larochelle 2016; Bengio et al. 1992). N-BEATS algorithm has demonstrated outstanding perfor-mance on several competition benchmarks (Oreshkin et al.2020). The model consists of a total of L blocks connectedusing a doubly residual architecture. Block (cid:96) has input x (cid:96) and produces two outputs: the backcast (cid:98) x (cid:96) and the partialforecast (cid:98) y (cid:96) . For the first block we define x ≡ x , where x is assumed to be the model-level input from now on. Wedefine the k -th fully-connected layer in the (cid:96) -th block; havingR E L U non-linearity, weights W k , bias b k and input h (cid:96), k − , asFC k ( h (cid:96), k − ) ≡ R E L U ( W k h (cid:96), k − + b k ) . We focus on the con-figuration that shares all learnable parameters across blocks.With this notation, one block of N-BEATS is described as: h (cid:96), = FC ( x (cid:96) ) , h (cid:96), k = FC k ( h (cid:96), k − ) , k = . . . K ; (cid:98) x (cid:96) = Qh (cid:96), K , (cid:98) y (cid:96) = Gh (cid:96), K , (2)where Q and G are linear operators. The N-BEATS parame-ters included in the FC and linear layers are learned by min-imizing a suitable loss function ( e.g. s MAPE defined in (1))across multiple TS. Finally, the doubly residual architecture isdescribed by the following recursion (recalling that x ≡ x ): x (cid:96) = x (cid:96) − − (cid:98) x (cid:96) − , (cid:98) y = L ∑ (cid:96) = (cid:98) y (cid:96) . (3) From a high-level perspective, there are many links withclassical TS modeling: a human-specified classical model istypically designed to generalize well on unseen TS, whilewe propose to automate that process. The classical modelsinclude exponential smoothing with and without seasonaleffects (Holt 1957, 2004; Winters 1960), multi-trace expo-nential smoothing approaches, e.g.
Theta and its variants (As-simakopoulos and Nikolopoulos 2000; Fiorucci et al. 2016;Spiliotis, Assimakopoulos, and Nikolopoulos 2019). Finally,the state space modeling approach encapsulates most of theabove in addition to auto-ARIMA and GARCH (Engle 1982;see Hyndman and Khandakar (2008) for an overview). Thestate-space approach has also been underlying significantamounts of research in the neural TS modeling (Salinas et al.2019; Wang et al. 2019; Rangapuram et al. 2018). However,those models have not been considered in the zero-shot sce-nario. In this work we focus on studying the importanceof meta-learning for successful zero-shot forecasting. Thefoundations of meta-learning have been developed by Schmi-dhuber (1987); Bengio, Bengio, and Cloutier (1991) amongothers. More recently, meta-learning research has been ex-panding, mostly outside of the TS forecasting domain (Raviand Larochelle 2016; Finn, Abbeel, and Levine 2017; Snell,Swersky, and Zemel 2017; Vinyals et al. 2016; Rusu et al.2019). In the TS domain, meta-learning has manifested it-self via neural models trained over a collection of TS (Smyl2020; Oreshkin et al. 2020) or via a model trained to predictweights combining outputs of several classical forecastingalgorithms (Montero-Manso et al. 2020). Successful appli-cation of a neural TS forecasting model trained on a sourcedataset and fine-tuned on the target dataset was demonstratedby Hooshmand and Sharma (2019); Ribeiro et al. (2018) aswell as in the context of TS classification by Fawaz et al.(2018). Unlike those, we focus on the zero-shot scenario andaddress the cold start problem.
We define a meta-learning framework with associated equa-tions, and recast within it many existing meta-learning algo-rithms. We show that N-BEATS follows the same equations.According to our analysis, its residual connections implementmeta-learning inner loop, thereby performing task adaptationwithout gradient steps at inference time.We define a novel zero-shot univariate TS forecastingtask and make its dataset loaders and evaluation code public,including a new large-scale dataset (
FRED ) with 290k TS.We empirically show, for the first time, that deep-learningzero-shot time series forecasting is feasible and that themeta-learning component is important for zero-shot general-ization in univariate TS forecasting.
A meta-learning procedure can generally be viewed at twolevels: the inner loop and the outer loop . The inner trainingloop operates within an individual “meta-example” or task T (fast learning loop improving over current T ) and the outerloop operates across tasks (slow learning loop). A task T ncludes task training data D tr T and task validation data D val T ,both optionally involving inputs, targets and a task-specificloss: D tr T = { X tr T , Y tr T , L T } , D val T = { X val T , Y val T , L T } . Accord-ingly, a meta-learning set-up can be defined by assuming adistribution p ( T ) over tasks, a predictor P θ , w and a meta-learner with meta-parameters ϕ . We allow a subset of predic-tor’s parameters denoted w to belong to meta-parameters ϕ and hence not to be task adaptive. The objective is to designa meta-learner that can generalize well on a new task by ap-propriately choosing the predictor’s task adaptive parameters θ after observing D tr T . The meta-learner is trained to do soby being exposed to many tasks in a training dataset { T traini } sampled from p ( T ) . For each training task T traini , the meta-learner is requested to produce the solution to the task in theform of P θ , w : X val T i (cid:55)→ (cid:98) Y val T i conditioned on D tr T i . The meta-parameters ϕ are updated in the outer meta-learning loop soas to obtain good generalization in the inner loop, i.e. , byminimizing the expected validation loss E T i L T i ( (cid:98) Y val T i , Y val T i ) mapping the ground truth and estimated outputs into the valuethat quantifies the generalization performance across tasks.Training on multiple tasks enables the meta-learner to pro-duce solutions P θ , w that generalize well on a set of unseentasks { T testi } sampled from p ( T ) .Consequently, the meta-learning procedure has threedistinct ingredients: (i) meta-parameters ϕ = ( t , w , u ) ,(ii) initialization function I t and (iii) update function U u .The meta-learner’s meta-parameters ϕ include the meta-parameters of the meta-initialization function, t , the meta-parameters of the predictor shared across tasks, w , andthe meta-parameters of the update function, u . The meta-initialization function I t ( D tr T i , c T i ) defines the initial val-ues of parameters θ for a given task T i based on its meta-initialization parameters t , task training dataset D tr T i andtask meta-data c T i . Task meta-data may have, for example,a form of task ID or a textual task description. The updatefunction U u ( θ (cid:96) − , D tr T i ) is parameterized with update meta-parameters u . It defines an iterated update to predictor pa-rameters θ at iteration (cid:96) based on their previous value andthe task training set D tr T i . The initialization and update func-tions produce a sequence of predictor parameters, which wecompactly write as θ (cid:96) ≡ { θ , . . . , θ (cid:96) − , θ (cid:96) } . We let the finalpredictor be a function of the whole sequence of parame-ters, written compactly as P θ (cid:96) , w . One implementation ofsuch general function could be a Bayesian ensemble or aweighted sum, for example: P θ (cid:96) , w ( · ) = ∑ (cid:96) j = ω j P θ j , w ( · ) . Ifwe set ω j = j = (cid:96) and 0 otherwise, then we get the morecommon situation P θ (cid:96) , w ( · ) ≡ P θ (cid:96) , w ( · ) . This meta-learningframework is succinctly described by the following set ofequations:Parameters: θ ; Meta-parameters: ϕ = ( t , w , u ) Inner Loop: θ ← I t ( D tr T i , c T i ) θ (cid:96) ← U u ( θ (cid:96) − , D tr T i ) , ∀ (cid:96) > x : P θ (cid:96) , w ( x ) Outer Loop: ϕ ← ϕ − η ∇ ϕ L T i [ P θ (cid:96) , w ( X val T i ) , Y val T i ] . (5) In the previous section we laid out a unifying framework formeta-learning. How is it connected to the TS forecasting task?We believe that this question is best answered by answeringquestions “why the classical statistical TS forecasting mod-els such as ARIMA and ETS are not doing meta-learning?”and “what does the meta-learning component offer when it ispart of a forecasting algorithm?”. The first question can beanswered by considering the fact that the classical statisticalmodels produce a forecast by estimating their parametersfrom the history of the target time series using a predefinedfixed set of rules, for example, given a model selection and themaximum likelihood parameter estimator for it. Therefore, interms of our meta-learning framework, a classical statisticalmodel executes only the inner loop (model parameter estima-tion) encapsulated in equation (4). The outer loop in this caseis irrelevant — a human analyst defines what equation (4)is doing, based on experience (for example, “for most slowvarying time-series with trend, no seasonality and white resid-uals, ETS with Gaussian maximum likelihood estimator willprobably work well”). The second question can be answeredconsidering that meta-learning based forecasting algorithmreplaces the predefined fixed set of rules for model parameterestimation with a learnable parameter estimation strategy.The learnable parameter estimation strategy is trained usingouter loop equation (5) by adjusting the strategy such that itis able to produce parameter estimates that generalize wellover multiple TS. It is assumed that there exists a dataset thatis representative of the forecasting tasks that will be handledat inference time. Thus the main advantage of meta-learningbased forecasting approaches is that they enable learning adata-driven parameter estimator that can be optimized fora particular set of forecasting tasks and forecasting models.On top of that, a meta-learning approach allows for a gen-eral learnable predictor in equation (4) that can be optimizedfor a given forecasting task. So both predictor (model) andits parameter estimator can be jointly learned for a forecast-ing task represented by available data. Empirically, we showthat this elegant theoretical concept works effectively acrossmultiple datasets and across multiple forecasting tasks (e.g.forecasting yearly, monthly or hourly TS) and even acrossvery loosely related tasks (for example, forecasting hourlyelectricity demand after training on a monthly economic dataafter appropriate time scale normalization).
To further illustrate the generality of the proposed framework,we next show how to cast existing meta-learning algorithmswithin it, before turning to N-BEATS.
MAML and related approaches (Finn, Abbeel, and Levine2017; Li et al. 2017; Raghu et al. 2019) can be derivedfrom (4) and (5) by (i) setting I to be the identity mapthat copies t into θ , (ii) setting U to be the SGD gradientupdate: U u ( θ , D tr T i ) = θ − α ∇ θ L T i ( P θ , w ( X tr T i ) , Y tr T i ) , where u = { α } and by (iii) setting the predictor’s meta-parametersto the empty set w = /0. Equation (5) applies with no modifica-tions. MT-net (Lee and Choi 2018) is a variant of MAML inhich the predictor’s meta-parameter set w is not empty. Thepart of the predictor parameterized with w is meta-learnedacross tasks and is fixed during task adaptation. Optimization as a model for few-shot learning (Raviand Larochelle 2016) can be derived from (4) and (5)via the following steps (in addition to those of MAML).First, set the update function U u to the update equationof an LSTM-like cell of the form ( (cid:96) is the LSTM updatestep index) θ (cid:96) ← f (cid:96) θ (cid:96) − + α (cid:96) ∇ θ (cid:96) − L T i ( P θ (cid:96) − , w ( X tr T i ) , Y tr T i ) .Second, set f (cid:96) to be the LSTM forget gate value (Raviand Larochelle 2016): f (cid:96) = σ ( W F [ ∇ θ L T i , L T i , θ (cid:96) − , f (cid:96) − ] + b F ) and α (cid:96) to be the LSTM input gate value: α (cid:96) = σ ( W α [ ∇ θ L T i , L T i , θ (cid:96) − , α (cid:96) − ] + b α ) . Here σ is a sigmoidnon-linearity. Finally, include all the LSTM parameters intothe set of update meta-parameters: u = { W F , b F , W α , b α } . Prototypical Networks (PNs) (Snell, Swersky, and Zemel2017). Most metric-based meta-learning approaches, includ-ing PNs, rely on comparing embeddings of the task trainingset with those of the validation set. Therefore, it is conve-nient to consider a composite predictor consisting of the embedding function, E w , and the comparison function, C θ , P θ , w ( · ) = C θ ◦ E w ( · ) . PNs can be derived from (4) and (5)by considering a K -shot image classification task, convolu-tional network E w shared across tasks and class prototypes p k = K ∑ j : Y trj = k E w ( X trj ) included in θ = { p k } ∀ k . Initializa-tion function I t with t = /0 simply sets θ to the values ofprototypes. U u is an identity map with u = /0 and C θ is as asoftmax classifier: Y val T i = arg max k softmax ( − d ( E w ( X val T i ) , p k )) . (6)Here d ( · , · ) is a similarity measure and the softmax is nor-malized w.r.t. all p k . Finally, define the loss L T i in (5) asthe cross-entropy of the softmax classifier described in (6).Interestingly, θ = { p k } ∀ k are nothing else than the dynami-cally generated weights of the final linear layer fed into thesoftmax, which is especially apparent when d ( a , b ) = − a · b .The fact that in the prototypical network scenario only thefinal linear layer weights are dynamically generated based onthe task training set resonates very well with the most recentstudy of MAML (Raghu et al. 2019). It has been shown thatmost of the MAML’s gain can be recovered by only adaptingthe weights of the final linear layer in the inner loop.In this section, we illustrated that four distinct meta-learning algorithms from two broad categories (optimization-and metric-based) can be derived from our equations (4) and(5). This confirms that our meta-learning framework is gen-eral and it can represent existing meta-learning algorithms.The analysis of three additional existing meta-learning algo-rithms is presented in Appendix C. Let us now focus on the analysis of N-BEATS described byequations (2), (3). We first introduce the following notation: f : x (cid:96) (cid:55)→ h (cid:96), ; g : h (cid:96), (cid:55)→ (cid:98) y (cid:96) ; q : h (cid:96), (cid:55)→ (cid:98) x (cid:96) . In the originalequations, g and q are linear and hence can be representedby equivalent matrices G and Q . In the following, we keepthe notation general as much as possible, transitioning to thelinear case only when needed. Then, given the network input, x ( x ≡ x ), and noting that (cid:98) x (cid:96) − = q ◦ f ( x (cid:96) − ) we can writeN-BEATS as follows: (cid:98) y = g ◦ f ( x ) + ∑ (cid:96)> g ◦ f ( x (cid:96) − − q ◦ f ( x (cid:96) − )) . (7)N-BEATS is now derived from the meta-learning frameworkof Sec. 2 using two observations: (i) each application of g ◦ f in (7) is a predictor and (ii) each block of N-BEATS is theiteration of the inner meta-learning loop. More concretely,we have that P θ , w ( · ) = g w g ◦ f w f , θ ( · ) . Here w g and w f areparameters of functions g and f , included in w = ( w g , w f ) and learned across tasks in the outer loop. The task-specificparameters θ consist of the sequence of input shift vec-tors, θ ≡ { µ (cid:96) } L (cid:96) = , defined such that the (cid:96) -th block inputcan be written as x (cid:96) = x − µ (cid:96) − . This yields a recursive ex-pression for the predictor’s task-specific parameters of theform µ (cid:96) ← µ (cid:96) − + (cid:98) x (cid:96) , µ ≡ , obtained by recursively un-rolling eq. (3). These yield the following initialization andupdate functions: I t with t = /0 sets µ to zero; U u , with u = ( w q , w f ) generates a next parameter update based on (cid:98) x (cid:96) : µ (cid:96) ← U u ( µ (cid:96) − , D tr T i ) ≡ µ (cid:96) − + q w q ◦ f w f ( x − µ (cid:96) − ) . Interestingly, (i) meta-parameters w f are shared between thepredictor and the update function and (ii) the task trainingset is limited to the network input, D tr T i ≡ { x } . Note thatthe latter makes sense because the data are complete timeseries, with the inputs x having the same form of internaldependencies as the forecasting targets y . Hence, observing x is enough to infer how to predict y from x in a way that issimilar to how different parts of x are related to each other.Finally, according to (7), predictor outputs correspond-ing to the values of parameters θ learned at every iter-ation of the inner loop are combined in the final out-put. This corresponds to choosing a predictor of the form P µ L , w ( · ) = ∑ Lj = ω j P µ j , w ( · ) , ω j = , ∀ j in (5). The outerlearning loop (5) describes the N-BEATS training procedureacross tasks (TS) with no modification.It is clear that the final output of the architecture de-pends on the sequence µ L . Even if predictor parameters w g , w f are shared across blocks and fixed, the behaviourof P µ L , w ( · ) = g w g ◦ f w f , µ L ( · ) is governed by an extendedspace of parameters ( w , µ , µ , . . . ) . Therefore, the expressivepower of the architecture can be expected to grow with thegrowing number of blocks, in proportion to the growth ofthe space spanned by µ L , even if w g , w f are shared acrossblocks. Thus, it is reasonable to expect that the addition ofidentical blocks will improve generalization performancebecause of the increase in expressive power. Next, we go a level deeper in the analysis to uncover more in-tricate task adaptation processes. Using linear approximationanalysis, we express N-BEATS’ meta-learning operation interms of the adaptation of the internal weights of the networkbased on the task input data. In particular, assuming small (cid:98) x (cid:96) , (7) can be approximated using the first order Taylor seriesexpansion in the vicinity of x (cid:96) − : (cid:98) y = g ◦ f ( x ) + ∑ (cid:96)> [ g − J g ◦ f ( x (cid:96) − ) q ] ◦ f ( x (cid:96) − )+ o ( (cid:107) q ◦ f ( x (cid:96) − ) (cid:107) ) . ere J g ◦ f ( x (cid:96) − ) = J g ( f ( x (cid:96) − )) J f ( x (cid:96) − ) is the Jacobian of g ◦ f . We now consider linear g and q , as mentioned earlier,in which case g and q are represented by two matrices ofappropriate dimensionality, G and Q ; and J g ( f ( x (cid:96) − )) = G .Thus, the above expression can be simplified as: (cid:98) y = G f ( x )+ ∑ (cid:96)> G [ I − J f ( x (cid:96) − ) Q ] f ( x (cid:96) − ) + o ( (cid:107) Q f ( x (cid:96) − ) (cid:107) ) . Continuously applying the linear approximation f ( x (cid:96) ) = [ I − J f ( x (cid:96) − ) Q ] f ( x (cid:96) − ) + o ( (cid:107) Q f ( x (cid:96) − ) (cid:107) ) until we reach (cid:96) = x ≡ x we arrive at the following: (cid:98) y = ∑ (cid:96)> G (cid:34) (cid:96) − ∏ k = [ I − J f ( x (cid:96) − k ) Q ] (cid:35) f ( x ) + o ( (cid:107) Q f ( x (cid:96) ) (cid:107) ) . (8)Note that G (cid:0) ∏ (cid:96) − k = [ I − J f ( x (cid:96) − k ) Q ] (cid:1) can be written in the iter-ative update form. Consider G (cid:48) = G , then the update equationfor G (cid:48) can be written as G (cid:48) (cid:96) = G (cid:48) (cid:96) − [ I − J f ( x (cid:96) − ) Q ] , ∀ (cid:96) > (cid:98) y = ∑ (cid:96)> G (cid:48) (cid:96) f ( x ) + o ( (cid:107) Q f ( x (cid:96) ) (cid:107) ) . (9)Let us now discuss how (9) can be used to re-interpret N-BEATS as an instance of the meta-learning framework (4)and (5). The predictor can now be represented in a decoupledform P θ , w ( · ) = g θ ◦ f w f ( · ) . Thus task adaptation is clearlyconfined in the decision function, g θ , whereas the embeddingfunction f w f only relies on fixed meta-parameters w f . Theadaptive parameters θ include the sequence of projectionmatrices { G (cid:48) (cid:96) } . The meta-initialization function I t is param-eterized with t ≡ G and it simply sets G (cid:48) ← t . The mainingredient of the update function U u is Q f w f ( · ) , parameter-ized as before with u = ( Q , w f ) . The update function nowconsists of two equations: G (cid:48) (cid:96) ← G (cid:48) (cid:96) − [ I − J f ( x − µ (cid:96) − ) Q ] , ∀ (cid:96) > , µ (cid:96) ← µ (cid:96) − + Q f w f ( x − µ (cid:96) − ) , µ = . (10)The first order analysis results (9) and (10) suggest thatunder certain circumstances, the block-by-block manipula-tion of the input sequence apparent in (7) is equivalent toproducing an iterative update of predictor’s final linear layerweights apparent in (10), with the block input being set tothe same fixed value. This is very similar to the final linearlayer update behaviour identified in other meta-learning algo-rithms: in LEO it is present by design (Rusu et al. 2019), inMAML it was identified by Raghu et al. (2019), and in PNsit follows from the results of our analysis in Section 2.2. It is hard to study the form of Q learned from the data ingeneral. However, equipped with the results of the linearapproximation analysis presented in Section 3.1, we can studythe case of a two-block network, assuming that the L normloss between y and (cid:98) y is used to train the network. If, inaddition, the dataset consists of the set of N pairs { x i , y i } i = the dataset-wise loss L has the following expression: L = ∑ i (cid:13)(cid:13) y i − G f ( x i ) + J g ◦ f ( x i ) Q f ( x i ) + o ( (cid:107) Q f ( x i )) (cid:107) ) (cid:13)(cid:13) . Introducing ∆ y i = y i − G f ( x i ) , the error between the default forecast 2 G f ( x i ) and the ground truth y i , and expanding the L norm we obtain the following: L = ∑ i ∆ y i (cid:124) ∆ y i + ∆ y i (cid:124) J g ◦ f ( x i ) Q f ( x i )+ f ( x i ) (cid:124) Q (cid:124) J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) + o ( (cid:107) Q f ( x i )) (cid:107) ) . Now, assuming that the rest of the parameters of the networkare fixed, we have the derivative with respect to Q usingmatrix calculus (Petersen and Pedersen 2012): ∂ L ∂ Q = ∑ i J (cid:124) g ◦ f ( x i ) ∆ y i f ( x i ) (cid:124) + J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) f ( x i ) (cid:124) + o ( (cid:107) Q f ( x i )) (cid:107) ) . Using the above expression we conclude that the first-orderapproximation of optimal Q satisfies the following equation: ∑ i J (cid:124) g ◦ f ( x i ) ∆ y i f ( x i ) (cid:124) = − ∑ i J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) f ( x i ) (cid:124) . Although this does not help to find a closed form solutionfor Q , it does provide a quite obvious intuition: the LHSand the RHS are equal when the correction term createdby the second block, J g ◦ f ( x i ) Q f ( x i ) , tends to compensatethe default forecast error, ∆ y i . Therefore, Q satisfying theequation will tend to drive the update to G in (10) in such away that on average the projection of f ( x ) over the update J g ◦ f ( x ) Q to matrix G will tend to compensate the error ∆ y made by forecasting y using G based on meta-initialization. Let us now analyze the factors that enable the meta-learninginner loop obvious in (10). First, meta-learning regime isnot viable without having multiple blocks connected via theresidual connection (feedback loop): x (cid:96) = x (cid:96) − − q ◦ f ( x (cid:96) − ).Second, the meta-learning inner loop is not viable when f is linear: the update of G is extracted from the curvature of f at the point dictated by the input x and the sequence ofshifts µ L . Indeed, suppose f is linear, and denote it by linearoperator F . The Jacobian J f ( x (cid:96) − ) becomes a constant, F .Equation (8) simplifies as (note that for linear f , (8) is exact): (cid:98) y = ∑ (cid:96)> G [ I − FQ ] (cid:96) − Fx . Therefore, G ∑ (cid:96)> [ I − FQ ] (cid:96) − may be replaced with an equiv-alent G (cid:48) that is not data adaptive. Interestingly, ∑ (cid:96)> [ I − FQ ] (cid:96) − happens to be a truncated Neumann series. DenotingMoore-Penrose pseudo-inverse as [ · ] + , assuming bounded-ness of FQ and completing the series, ∑ ∞ (cid:96) = [ I − FQ ] (cid:96) , resultsin (cid:98) y = G [ FQ ] + Fx . Therefore, under certain conditions, theN-BEATS architecture with linear f and infinite number ofblocks can be interpreted as a linear predictor of a signal incolored noise. Here the [ FQ ] + part cleans the intermediatespace created by projection F from the components that areundesired for forecasting and G creates the forecast based onthe initial projection Fx after it is “sanitized” by [ FQ ] + .In this section we established that N-BEATS is an instanceof a meta-learning algorithm described by equations (4) andable 1: Dataset-specific metrics aggregated over each dataset; lower values are better. The bottom three rows represent thezero-shot transfer setup, indicating respectively the core algorithm (DeepAR or N-BEATS) and the source dataset (M4 orFR(ED)). All other model names are explained in Appendix F. † N-BEATS trained on double upsampled monthly data, seeAppendix D. ‡ M3/M4 s
MAPE definitions differ. ∗ DeepAR trained by us using GluonTS.M4, s
MAPE
M3, s
MAPE ‡ TOURISM , MAPE ELECTR / TRAFF , ND FRED , s
MAPE
Pure ML 12.89 Comb 13.52 ETS 20.88 MatFact 0.16 / 0.20 ETS 14.16Best STAT 11.99 ForePro 13.19 Theta 20.88 DeepAR 0.07 / 0.17 Naïve 12.79ProLogistica 11.85 Theta 13.01 ForePro 19.84 DeepState 0.08 / 0.17 SES 12.70Best ML/TS 11.72 DOTM 12.90 Strato 19.52 Theta 0.08 / 0.18 Theta 12.20DL/TS hybrid 11.37 EXP 12.71 LCBaker 19.35 ARIMA 0.07 / 0.15 ARIMA 12.15N-BEATS 11.14 12.37 18.52 0.07 / 0.11 11.49DeepAR ∗ ∗ n/a 14.76 24.79 0.15 / 0.36 n/aN-BEATS-M4 n/a 12.44 18.82 0.09 / 0.15 11.60N-BEATS-FR 11.70 12.69 19.94 † 0 .
09 / 0 .
26 n/a(5). We showed that each block of N-BEATS is an innermeta-learning loop that generates additional shift parametersspecific to the input time series. Therefore, the expressivepower of the architecture is expected to grow with each ad-ditional block, even if all blocks share their parameters. Weused linear approximation analysis to show that the inputshift in a block is equivalent to the update of the block’s finallinear layer weights under certain conditions. The key role inthis process seems to be encapsulated in the non-linearity of f and in residual connections. We evaluate performance on a number of datasets repre-senting a diverse set of univariate time series. For each ofthem, we evaluate the base N-BEATS performance comparedagainst the best-published approaches. We also evaluate zero-shot transfer from several source datasets, as explained next.
Base datasets. M4 (M4 Team 2018), contains 100k TSrepresenting demographic, finance, industry, macro and mi-cro indicators. Sampling frequencies include yearly, quarterly,monthly, weekly, daily and hourly. M3 (Makridakis and Hi-bon 2000) contains 3003 TS from domains and samplingfrequencies similar to M4. FRED is a dataset introduced inthis paper containing 290k US and international economicTS from 89 sources, a subset of the data published by theFederal Reserve Bank of St. Louis (Federal Reserve 2019).
TOURISM (Athanasopoulos et al. 2011) includes monthly,quarterly and yearly series of indicators related to tourismactivities.
ELECTRICITY (Dua and Graff 2017; Yu, Rao, andDhillon 2016) represents the hourly electricity usage of 370customers.
TRAFFIC (Dua and Graff 2017; Yu, Rao, andDhillon 2016) tracks hourly occupancy rates of 963 lanesin the Bay Area freeways. Additional details for all datasetsappear in Appendix E.
Zero-shot TS forecasting task definition . One of thebase datasets, a source dataset, is used to train a machinelearning model. The trained model then forecasts a TS ina target dataset. The source and the target datasets are dis- tinct: they do not contain TS whose values are linear trans-formations of each other. The forecasted TS is split into twonon-overlapping pieces: the history, and the test. The historyis used as model input and the test is used to compute theforecast error metric. We use the history and the test splitsfor the base datasets consistent with their original publica-tion, unless explicitly stated otherwise. To produce forecasts,the model is allowed to access the TS in the target dataseton a one-at-a-time basis. This is to avoid having the modelimplicitly learn/adapt based on any information contained inthe target dataset other than the history of the forecasted TS.If any adjustments of model parameters or hyperparametersare necessary, they are allowed exclusively using the historyof the forecasted TS.
Training setup.
DeepAR (Salinas et al. 2019) is trainedusing GluonTS implementation from its authors (Alexan-drov et al. 2019). N-BEATS is trained following the originaltraining setup of Oreshkin et al. (2020). Both N-BEATS andDeepAR are trained with scaling/descaling the architectureinput/output by dividing/multiplying all input/output valuesby the max value of the input window computed per targettime-series. This does not affect the accuracy of the modelsin the usual train/test scenario. In the zero-shot regime, thisoperation is intended to prevent catastrophic failure when thescale of the target time-series differs significantly from thoseof the source dataset. Additional training setup details areprovided in Appendix D.
Key results.
For each dataset, we compare our resultsto 5 representative entries reported in the literature for thatdataset, based on dataset-specific metrics (M4,
FRED , M3:s
MAPE ; TOURISM : MAPE ; ELECTRICITY , TRAFFIC : ND ). Weadditionally train the popular machine learning TS modelDeepAR and evaluate it in the zero-shot regime. Our mainresults appear in Table 1, with more details provided in Ap-pendix F. In the zero-shot forecasting regime (bottom threerows), N-BEATS consistently outperforms most statisticalmodels tailored to these datasets as well as DeepAR trainedon M4 and evaluated in zero-shot regime on other datasets. N-
20 40 60 80 100
Number of Blocks s M A P E Shared Weights
TrueFalse 0 20 40 60 80 100
Number of Blocks M A P E Shared Weights
TrueFalse
Figure 1: Zero-shot forecasting performance of N-BEATS trained on M4 and applied to M3 ( left ) and
TOURISM ( right ) targetdatasets with respect to the number of blocks, L . The mean and one standard deviation interval (based on ensemble bootstrap)with (blue) and without (red) weight sharing across blocks are shown. The extended set of results for all datasets, using FRED asa source dataset and a few metrics are provided in Appendix G, further reinforcing our findings.BEATS trained on
FRED and applied in the zero-shot regimeto M4 outperforms the best statistical model selected for itsperformance on M4 and is at par with the competition’s sec-ond entry (boosted trees). On M3 and
TOURISM the zero-shotforecasting performance of N-BEATS is better than that ofthe M3 winner, Theta (Assimakopoulos and Nikolopoulos2000). On
ELECTRICITY and
TRAFFIC
N-BEATS performsclose to or better than other neural models trained on thesedatasets. The results suggest that a neural model is able toextract general knowledge about TS forecasting and thensuccessfully adapt it to forecast on unseen TS. Our studypresents the first successful application of a neural model tosolve univariate zero-shot TS point forecasting across a largevariety of datasets, and suggests that a pre-trained N-BEATSmodel can constitute a strong baseline for this task.
Meta-learning Effects . Analysis in Section 3 implies thatN-BEATS internally generates a sequence of parameters thatdynamically extend the expressive power of the architecturewith each newly added block, even if the blocks are identi-cal. To validate this hypothesis, we performed an experimentstudying the zero-shot forecasting performance of N-BEATSwith increasing number of blocks, with and without param-eter sharing. The architecture was trained on M4 and theperformance was measured on the target datasets M3 and
TOURISM . The results are presented in Fig. 1. On the twodatasets and for the shared-weights configuration, we con-sistently see performance improvement when the number ofblocks increases up to about 30 blocks. In the same scenario,increasing the number of blocks beyond 30 leads to small,but consistent deterioration in performance. One can viewthese results as evidence supporting the meta-learning inter-pretation of N-BEATS, with a possible explanation of thisphenomenon as overfitting in the meta-learning inner loop.It would not otherwise be obvious how to explain the gener-alization dynamics in Fig. 1. Additionally, the performanceimprovement due to meta-learning alone (shared weights,multiple blocks vs. a single block) is 12.60 to 12.44 (1.2%)and 20.40 to 18.82 (7.8%) for M3 and
TOURISM , respec-tively (see Fig. 1). The performance improvement due to meta-learning and unique weights (unique weights, multipleblocks vs. a single block) is 12.60 to 12.40 (1.6%) and 20.40to 18.91 (7.4%). Clearly, the majority of the gain is due tothe meta-learning alone. The introduction of unique blockweights sometimes results in marginal gain, but often leadsto a loss (see more results in Appendix G).In this section, we presented empirical evidence that neu-ral networks are able to provide high-quality zero-shot fore-casts on unseen TS. We further empirically supported the hy-pothesis that meta-learning adaptation mechanisms identifiedwithin N-BEATS in Section 3 are instrumental in achievingimpressive zero-shot forecasting accuracy results. Zero-shot transfer learning.
We propose a broad meta-learning framework and explain mechanisms facilitating zero-shot forecasting. Our results show that neural networks canextract generic knowledge about forecasting and apply itin zero-shot transfer.
Residual architectures in general arecovered by the analysis of Sec. 3, which might explain someof the success of residual architectures, although their deeperstudy should be subject to future work. Our theory suggeststhat residual connections generate, on-the-fly, compact task-specific parameter updates by producing a sequence of inputshifts for identical blocks. Sec. 3.1 reinterprets our resultsshowing that, as a first-order approximation residual con-nections produce an iterative update to the predictor finallinear layer.
Memory efficiency and knowledge compres-sion.
Our empirical results imply that N-BEATS is able tocompress all the relevant knowledge about a given dataset ina single block, rather than in 10 or 30 blocks with individ-ual weights. From a practical perspective, this could be usedto obtain 10–30 times neural network weight compressionand is relevant in applications where storing neural networksefficiently is important. Intuitively, the network with unique block weights includes thenetwork with identical weights as a special case. Therefore, it isfree to combine the effect of meta-learning with the effect of uniqueblock weights based on its training loss. eferences
Alexandrov, A.; Benidis, K.; Bohlke-Schneider, M.; Flunkert, V.;Gasthaus, J.; Januschowski, T.; Maddix, D. C.; Rangapuram, S.;Salinas, D.; Schulz, J.; Stella, L.; Türkmen, A. C.; and Wang, Y.2019. GluonTS: Probabilistic Time Series Modeling in Python. arXiv preprint arXiv:1906.05264 .Assimakopoulos, V.; and Nikolopoulos, K. 2000. The theta model:a decomposition approach to forecasting.
International Journal ofForecasting
International Journal of Forecast-ing
International Journalof Forecasting
International Journal of Forecasting
Optimality in Artificialand Biological Neural Networks .Bengio, Y.; Bengio, S.; and Cloutier, J. 1991. Learning a SynapticLearning Rule. In
Proceedings of the International Joint Conferenceon Neural Networks , II–A969. Seattle, USA.Bergmeir, C.; Hyndman, R. J.; and Benítez, J. M. 2016. Baggingexponential smoothing methods using STL decomposition and Box–Cox transformation.
International Journal of Forecasting
European Journal of OperationalResearch
Econo-metrica .Federal Reserve Bank of St. Louis. 2019. FRED Economic Data.Data retrieved from https://fred.stlouisfed.org/ Accessed: 2019-11-01.Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In
ICML , 1126–1135.Fiorucci, J. A.; Pellegrini, T. R.; Louzada, F.; Petropoulos, F.; andKoehler, A. B. 2016. Models for optimising the theta method andtheir relationship to state space models.
International Journal ofForecasting
CoRR abs/1704.04110.Gao, J. 2014. Machine learning applications for data center opti-mization. Technical report, Google.Gauss, C. F. 1809.
Theoria motus corporum coelestium in section-ibus conicis solem ambientium . Hamburg: Frid. Perthes and I. H.Besser. Harlow, H. F. 1949. The Formation of Learning Sets.
PsychologicalReview
International Journal of Forecasting
Proceedings of theTenth ACM International Conference on Future Energy Systems ,e-Energy’19, 12–16.Hyndman, R.; and Koehler, A. B. 2006. Another look at measuresof forecast accuracy.
International Journal of Forecasting
Journal of Statistical Soft-ware
International Journal of Forecasting
The Journal of Business Forecasting Methods &Systems
International Journal of Production Economics
ICML , 2933–2942.Leung, H. C. 1995. Neural networks in supply chain management.In
Proceedings for Operating Research and the Management Sci-ences , 347–352.Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-SGD: Learning toLearn Quickly for Few Shot Learning.
CoRR
Journal of forecasting
International Journal ofForecasting
International Journal of Forecasting
International Journal of Forecasting
PLoS ONE
International Journal of Forecasting
Weather and Climate
Proceedings of the2010 Industrial Engineering Research Conference .Oreshkin, B. N.; Carpov, D.; Chapados, N.; and Bengio, Y. 2020.N-BEATS: Neural basis expansion analysis for interpretable timeseries forecasting. In
ICLR .Oreshkin, B. N.; Rodríguez López, P.; and Lacoste, A. 2018.TADAM: Task dependent adaptive metric for improved few-shotlearning. In
NeurIPS , 721–731.Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; and Courville, A.2018. FiLM: Visual reasoning with a general conditioning layer. In
AAAI .Petersen, K. B.; and Pedersen, M. S. 2012. The Matrix Cookbook.Version 20121115.Raghu, A.; Raghu, M.; Bengio, S.; and Vinyals, O. 2019. RapidLearning or Feature Reuse? Towards Understanding the Effective-ness of MAML.Rangapuram, S. S.; Seeger, M.; Gasthaus, J.; Stella, L.; Wang, Y.;and Januschowski, T. 2018. Deep State Space Models for TimeSeries Forecasting. In
NeurIPS .Ravi, S.; and Larochelle, H. 2016. Optimization as a model forfew-shot learning. In
ICLR .Ribeiro, M.; Grolinger, K.; ElYamany, H. F.; Higashino, W. A.;and Capretz, M. A. 2018. Transfer learning with seasonal andtrend adjustment for cross-building energy forecasting.
Energy andBuildings
Poster Proceedings of the 12th European Conferenceon Precision Agriculture .Rusu, A. A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.;Osindero, S.; and Hadsell, R. 2019. Meta-Learning with LatentEmbedding Optimization. In
ICLR .Salinas, D.; Flunkert, V.; Gasthaus, J.; and Januschowski, T. 2019.DeepAR: Probabilistic forecasting with autoregressive recurrentnetworks.
International Journal of Forecasting .Schmidhuber, J. 1987.
Evolutionary principles in self-referentiallearning . Master’s thesis, Institut f. Informatik, Tech. Univ. Munich.Sezer, O. B.; Gudelek, M. U.; and Ozbayoglu, A. M. 2019. Finan-cial Time Series Forecasting with Deep Learning : A SystematicLiterature Review: 2005-2019.Smyl, S. 2020. A hybrid method of exponential smoothing andrecurrent neural networks for time series forecasting.
InternationalJournal of Forecasting .Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical Networksfor Few-shot Learning. In
NIPS , 4080–4090.Spiliotis, E.; Assimakopoulos, V.; and Nikolopoulos, K. 2019. Fore-casting with a hybrid method utilizing data smoothing, a variation ofthe Theta method and shrinkage of seasonal factors.
InternationalJournal of Production Economics
Journal of the OperationalResearch Society
NeurIPS 32 , 15398–15408.Velkoski, A. 2016. Python Client for FRED API. URL https://github.com/avelkoski/FRB.Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wier-stra, D. 2016. Matching Networks for One Shot Learning. In
NIPS ,3630–3638.Walker, G. 1931. On Periodicity in Series of Related Terms.
Proc.R. Soc. Lond. A
ICML .Winters, P. R. 1960. Forecasting Sales by Exponentially WeightedMoving Averages.
Management Science
NIPS .Yule, G. U. 1927. On a Method of Investigating Periodicities in Dis-turbed Series, with Special Reference to Wolfer’s Sunspot Numbers.
Phil. Trans. the R. Soc. Lond. A upplementary Material for
AStrong Meta-Learned Baselinefor Zero-Shot Time SeriesForecasting
A TS Forecasting Accuracy Metrics
The following metrics are standard scale-free metrics in thepractice of point forecasting performance evaluation (Hynd-man and Koehler 2006; Makridakis and Hibon 2000; Makri-dakis, Spiliotis, and Assimakopoulos 2018a; Athanasopou-los et al. 2011):
MAPE (Mean Absolute Percentage Error),s
MAPE (symmetric
MAPE ) and
MASE (Mean Absolute ScaledError). Whereas s
MAPE scales the error by the average be-tween the forecast and ground truth, the
MASE scales by theaverage error of the naïve predictor that simply copies the ob-servation measured m periods in the past, thereby accountingfor seasonality. Here m is the periodicity of the data ( e.g. , 12for monthly series). OWA (overall weighted average) is a M4-specific metric used to rank competition entries (M4 Team2018), where s
MAPE and
MASE metrics are normalized suchthat a seasonally-adjusted naïve forecast obtains
OWA = . ND , being a less standard metric inthe traditional TS forecasting literature, is nevertheless quitepopular in the machine learning TS forecasting papers (Yu,Rao, and Dhillon 2016; Flunkert, Salinas, and Gasthaus 2017;Wang et al. 2019; Rangapuram et al. 2018).s MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | + | (cid:98) y T + i | , MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | , MASE = H H ∑ i = | y T + i − (cid:98) y T + i | T + H − m ∑ T + Hj = m + | y j − y j − m | , OWA = (cid:20) s MAPE s MAPE
Naïve2 + MASEMASE
Naïve2 (cid:21) , ND = ∑ i , ts | y T + i , ts − (cid:98) y T + i , ts | ∑ i , ts | y T + i , ts | . In these expressions, y t refers to the observed time series(ground truth) and ˆ y t refers to a point forecast. In the lastequation, y T + i , ts refers to a sample T + i from TS with index ts and the sum ∑ i , ts is running over all TS indices and TSsamples. B N-BEATS Details
B.1 Architecture Details
N-BEATS originally proposed by (Oreshkin et al. 2020) op-tionally has interpretable hierarchical structure consisting ofmultiple stacks. In this work, without loss of generality, wefocus on a generic model for which output partitioning isirrelevant. This is depicted in Figure 2, modified from Figure1 in Oreshkin et al. (2020) accordingly. The final forecast
ForecastBackcast Block Input
Block 2 – + History Global forecast(model output)
Block R FC layer L Linear LinearFC layer 1
Block 1 – Figure 2: N-BEATS architecture, adapted from Figure 1of Oreshkin et al. (2020).is obtained from the sum of individual forecasts producedby blocks; the blocks are chained together using a doublyresidual architecture.
C The analysis of additional existingmeta-learning algorithms
Matching networks (Vinyals et al. 2016) are similar to PNswith a few adjustments. In the vanilla matching networkarchitecture, C θ is defined, assuming one-hot encoded Y val T i and Y tr T i , as a soft nearest neighbor: (cid:98) Y val T i = ∑ x , y ∈ D tr T i softmax ( − d ( E w ( X val T i ) , E w ( x ))) y . The softmax is normalized w.r.t. x ∈ D tr T i . Predictor parame-ters, dynamically generated by I t , include embedding/labelpairs: θ = { ( E w ( x ) , y ) , ∀ x , y ∈ D tr T i } . In the FCE matchingnetwork, validation and training embeddings additionally in-teract with the task training set via attention LSTMs (Vinyalset al. 2016). To reflect this, the update function, U u ( θ , D tr T i ) ,updates the original embeddings via LSTM equations: θ ←{ ( attLSTM u [ E w ( x ) , D tr T i ] , y ) , ∀ x , y ∈ D tr T i } . The LSTM pa-rameters are included in u . Second, the predictor is aug-mented with an additional relation module R w R , such that P θ , w ( · ) = C θ ◦ R w R ◦ E w E ( · ) , with the set of predictor meta-parameters extended accordingly: w = ( w R , w E ) . The re-lation module is again implemented via LSTM: R w R ( · ) ≡ attLSTM w R ( · , D tr T i ) . TADAM (Oreshkin, Rodríguez López, and Lacoste 2018)extends PNs by dynamically conditioning the embeddingfunction on the task training data via FiLM layers (Perezet al. 2018). TADAM’s predictor has the following form: P θ , w ( · ) = C θ C ◦ E θ γ , β , w ( · ) ; θ = ( θ γ , β , θ C ) . The compare func-tion parameters are as before, θ C = { p k } ∀ k . The embed-ding function parameters θ γ , β include the FiLM layer γ / β (scale/shift) vectors for each convolutional layer, generatedby a separate FC network from the task embedding. The ini-tialization function I t sets θ γ , β to all zeros, embeds task train-ing data, and sets the task embedding to the average of classprototypes. The update function U u whose meta-parametersinclude the coefficients of the FC network, u = w FC , gen-erates an update to θ γ , β from the task embedding. Then itenerates an update to the class prototypes θ C using E θ γ , β , w ( · ) conditioned with the updated θ γ , β . LEO (Rusu et al. 2019) uses a fixed pretrained embeddingfunction. The intermediate low-dimensional latent space z is optimized and is used to generate the predictor’s task-adaptive final layer weights θ C . LEO’s predictor, P θ , w ( · ) = C θ ◦ E ( · ) has final layer and the latent space parameters, θ =( θ C , θ z ) , and no meta-parameters, w = /0. The initializationfunction I t , t = ( w E , w R ) , uses a task encoder and a relationnetwork with meta-parameters w E and w R . It meta-initializesthe latent space parameters, θ z , based on the task training data.The update function U u , u = w D , uses a decoder with meta-parameters w D to iteratively decode θ z into the final layerweights, θ C . It optimizes θ z by executing gradient descent θ z ← θ z − α ∇ θ z L T i ( P θ ( X tr T i ) , Y tr T i ) in the inner loop. D Training setup details
Most of the time, the model trained on a given frequency splitof a source dataset is used to forecast the same frequencysplit on the target dataset. There are a few exceptions to thisrule. First, when transferring from M4 to M3, the Otherssplit of M3 is forecasted with the model trained on Quarterlysplit of M4. This is because (i) the default horizon length ofM4 Quarterly is 8, same as that of M3 Others and (ii) M4Others is heterogeneous and contains Weekly, Daily, Hourlydata with horizon lengths 13, 14, 48. So M4 Quarterly toM3 Others transfer provided a more natural basis from animplementation standpoint. Second, the transfer from M4to
ELECTRICITY and
TRAFFIC dataset is done based on amodel trained on M4 Hourly. This is because
ELECTRICITY and
TRAFFIC contain hourly time-series with obvious 24-hour seasonality patterns. It is worth noting that the M4Hourly only contains 414 time-series and we can clearly seepositive zero-shot transfer in Table 1 from the model trainedon this rather small dataset. Third, the transfer from
FRED to ELECTRICITY and
TRAFFIC is done by training the model onthe
FRED
Monthly split, double upsampled using bi-linearinterpolation. This is because
FRED does not have hourlydata. Monthly data naturally provide patterns with seasonalityperiod 12. Upsampling with a factor of two and bi-linearinterpolation provide data with natural seasonality period 24,most often observed in Hourly data, such as
ELECTRICITY and
TRAFFIC . D.1 N-BEATS training setup
We use the same overall training framework, as definedby Oreshkin et al. (2020), including the stratified uniformsampling of TS in the source dataset to train the model. Onemodel is trained per frequency split of a dataset ( e.g.
Yearly,Quarterly, Monthly, Weekly, Daily and Hourly frequenciesin M4 dataset). All reported accuracy results are based on anensemble of 30 models (5 different initializations with 6 dif-ferent lookback periods). One aspect that we found importantin the zero-shot regime, which is different from the originaltraining setup, is the scaling/descaling of the input/output.We scale/descale the architecture input/output by the divid-ing/multiplying all input/output values over the max valueof the input window. We found that this does not affect the Table 2: DeepAR training parameters.
BatchLayers Cells Epochs SizeYearly (M3, M4, Tourism) 3 40 300 32Quarterly (M3, M4, Tourism) 2 20 100 32Monthly (M3, M4, Tourism) 2 40 500 32Others (M3) 2 40 100 32M4 (weekly, daily) 3 20 100 32M4 Hourly 2 20 50 32Electricity (all splits) 2 40 50 64Traffic (2008-01-14) 1 20 5 64Traffic (other splits) 4 40 50 64 accuracy of the model trained and tested on the same datasetin a statistically significant way. In the zero-shot regime, thisoperation prevents catastrophic failure when the target datasetscale (marginal distribution) is significantly different fromthat of the source dataset.
D.2 DeepAR training setup
DeepAR experiments are using the model implementationprovided by GluonTS (Alexandrov et al. 2019) version 1.6.We optimized hyperparameters of DeepAR as the defaultsprovided in GluonTS would often lead to apparently sub-optimal performance on many of the datasets. The train-ing parameters for each dataset are described in Table 2.Weight decay is 0.0, Dropout rate is 0.0 for all experi-ments except Electricity dataset where it is 0.1. The de-fault scaling was replaced by MaxAbs, which improvedand stabilized results. All other parameters are defaultsof gluonts.model.deepar.DeepAREstimator . To reduce vari-ance of performance between experiments we use medianensemble of 30 independent runs. The code for DeepARexperiments can be found at https://github.com/timeseries-zeroshot/deepar_evaluation.
E Dataset Details
E.1 M4 Dataset Details
Table 3 outlines the composition of the M4 dataset acrossdomains and forecast horizons by listing the number of TSbased on their frequency and type (M4 Team 2018). The M4dataset is large and diverse: all forecast horizons are com-posed of heterogeneous TS types (with exception of Hourly)frequently encountered in business, financial and economicforecasting. Summary statistics on series lengths are alsolisted, showing wide variability therein, as well as a charac-terization ( smooth vs erratic ) that follows Syntetos, Boylan,and Croston (2005), and is based on the squared coefficientof variation of the series. All series have positive observedvalues at all time-steps; as such, none can be considered inter-mittent or lumpy per Syntetos, Boylan, and Croston (2005). E.2
FRED
Dataset Details
FRED is a large-scale dataset introduced in this paper con-taining around 290k US and international economic TSfrom 89 sources, a subset of Federal Reserve economicable 3: Composition of the M4 dataset: the number of TS based on their sampling frequency and type.Frequency / HorizonType Yearly/6 Qtly/8 Monthly/18 Wkly/13 Daily/14 Hrly/48 TotalDemographic 1,088 1,858 5,728 24 10 0 8,708Finance 6,519 5,305 10,987 164 1,559 0 24,534Industry 3,716 4,637 10,017 6 422 0 18,798Macro 3,903 5,315 10,016 41 127 0 19,402Micro 6,538 6,020 10,975 112 1,476 0 25,121Other 1,236 865 277 12 633 414 3,437Total 23,000 24,000 48,000 359 4,227 414 100,000Min. Length 19 24 60 93 107 748Max. Length 841 874 2812 2610 9933 1008Mean Length 37.3 100.2 234.3 1035.0 2371.4 901.9SD Length 24.5 51.1 137.4 707.1 1756.6 127.9% Smooth 82% 89% 94% 84% 98% 83%% Erratic 18% 11% 6% 16% 2% 17%data (Federal Reserve 2019).
FRED is downloaded usinga custom download script based on the high-level FREDpython API (Velkoski 2016). This is a python wrapper overthe low-level web-based FRED API. For each point in atime-series the raw data published at the time of first re-lease are downloaded. All time series with any NaN entrieshave been filtered out. We focus our attention on Yearly,Quarterly, Monthly, Weekly and Daily frequency data. Otherfrequencies are available, for example, bi-weekly and five-yearly. They are skipped, because only being present in smallquantities. These factors explain the fact that the size of thedataset we assembled for this study is 290k, while 672k to-tal time-series are in principle available (Federal Reserve2019). Hourly data are not available in this dataset. For thedata frequencies included in
FRED dataset, we use the sameforecasting horizons as for the M4 dataset: Yearly: 6, Quar-terly: 8, Monthly: 18, Weekly: 13 and Daily: 14. The datasetdownload takes approximately 7–10 days, because of thebandwidth constraints imposed by the low-level FRED API.The test, validation and train subsets are defined in the usualway. The test set is derived by splitting the full
FRED datasetat the left boundary of the last horizon of each time series.Similarly, the validation set is derived from the penultimatehorizon of each time series.
E.3 M3 Dataset Details
Table 4 outlines the composition of the M3 dataset acrossdomains and forecast horizons by listing the number of TSbased on their frequency and type (Makridakis and Hibon2000). The M3 is smaller than the M4, but it is still large anddiverse: all forecast horizons are composed of heterogeneousTS types frequently encountered in business, financial andeconomic forecasting. Over the past 20 years, this dataset hassupported significant efforts in the design of advanced sta-tistical models, e.g. Theta and its variants (Assimakopoulosand Nikolopoulos 2000; Fiorucci et al. 2016; Spiliotis, As-simakopoulos, and Nikolopoulos 2019). Summary statisticson series lengths are also listed, showing wide variability in length, as well as a characterization ( smooth vs erratic ) thatfollows Syntetos, Boylan, and Croston (2005), and is basedon the squared coefficient of variation of the series. All serieshave positive observed values at all time-steps; as such, nonecan be considered intermittent or lumpy per Syntetos, Boylan,and Croston (2005). E.4
TOURISM
Dataset Details
Table 5 outlines the composition of the
TOURISM datasetacross forecast horizons by listing the number of TS basedon their frequency. Summary statistics on series lengths arelisted, showing wide variability in length. All series havepositive observed values at all time-steps. In contrast to M4and M3 datasets,
TOURISM includes a much higher fractionof erratic series.
E.5
ELECTRICITY and
TRAFFIC
Dataset Details
ELECTRICITY and TRAFFIC datasets (Dua and Graff 2017;Yu, Rao, and Dhillon 2016) are both part of UCI repository. ELECTRICITY represents the hourly electricity usage mon-itoring of 370 customers over three years.
TRAFFIC datasettracks the hourly occupancy rates scaled in (0,1) range of 963lanes in the San Francisco bay area freeways over a period ofslightly more than a year. Both datasets exhibit strong hourlyand daily seasonality patterns.Both datasets are aggregated to hourly data, but using dif-ferent aggregation operations: sum for
ELECTRICITY andmean for
TRAFFIC . The hourly aggregation is done so that allthe points available in ( h − , h : 00 ] hours are aggregatedto hour h , thus if original dataset starts on 2011-01-01 00:15then the first time point after aggregation will be 2011-01-0101:00. For the ELECTRICITY dataset we removed the firstyear from training set, to match the training set used in (Yu,Rao, and Dhillon 2016), based on the aggregated dataset https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 https://archive.ics.uci.edu/ml/datasets/PEMS-SF able 4: Composition of the M3 dataset: the number of TS based on their sampling frequency and type. Frequency / HorizonType Yearly/6 Quarterly/8 Monthly/18 Other/8 TotalDemographic 245 57 111 0 413Finance 58 76 145 29 308Industry 102 83 334 0 519Macro 83 336 312 0 731Micro 146 204 474 4 828Other 11 0 52 141 204Total 645 756 1,428 174 3,003Min. Length 20 24 66 71Max. Length 47 72 144 104Mean Length 28.4 48.9 117.3 76.6SD Length 9.9 10.6 28.5 10.9% Smooth 90% 99% 98% 100%% Erratic 10% 1% 2% 0%
Table 5: Composition of the
TOURISM dataset: the numberof TS based on their sampling frequency.
Frequency / HorizonYearly/4 Quarterly/8 Monthly/24 Total518 427 366 1,311Min. Length 11 30 91Max. Length 47 130 333Mean Length 24.4 99.6 298SD Length 5.5 20.3 55.7% Smooth 77% 61% 49%% Erratic 23% 39% 51% downloaded from, presumable authors’, Github repository .We also made sure that data points for both ELECTRICITY and
TRAFFIC datasets after aggregation match those usedin (Yu, Rao, and Dhillon 2016). The authors of the Mat-Fact model were using the last 7 days of datasets as test set,but papers from Amazon DeepAR (Flunkert, Salinas, andGasthaus 2017), Deep State (Rangapuram et al. 2018), DeepFactors (Wang et al. 2019) are using different splits, wherethe split points are provided by a date. Changing split pointswithout a well-grounded reason adds uncertainties to the com-parability of the models performances and creates challengesto the reproducibility of the results, thus we were trying tomatch all different splits in our experiments. It was especiallychallenging on
TRAFFIC dataset, where we had to use someheuristics to find records dates; the dataset authors state: “Themeasurements cover the period from Jan. 1st 2008 to Mar.30th 2009” and “We remove public holidays from the dataset,as well as two days with anomalies (March 8th 2009 andMarch 9th 2008) where all sensors were muted between 2:00and 3:00 AM.” In spite of this, we failed to match a part ofthe provided labels of week days to actual dates. Therefore, https://github.com/rofuyu/exp-trmf-nips16/blob/master/python/exp-scripts/datasets/download-data.sh we had to assume that the actual list of gaps, which includeholidays and anomalous days, is as follows:1. Jan. 1, 2008 (New Year’s Day)2. Jan. 21, 2008 (Martin Luther King Jr. Day)3. Feb. 18, 2008 (Washington’s Birthday)4. Mar. 9, 2008 (Anomaly day)5. May 26, 2008 (Memorial Day)6. Jul. 4, 2008 (Independence Day)7. Sep. 1, 2008 (Labor Day)8. Oct. 13, 2008 (Columbus Day)9. Nov. 11, 2008 (Veterans Day)10. Nov. 27, 2008 (Thanksgiving)11. Dec. 25, 2008 (Christmas Day)12. Jan. 1, 2009 (New Year’s Day)13. Jan. 19, 2009 (Martin Luther King Jr. Day)14. Feb. 16, 2009 (Washington’s Birthday)15. Mar. 8, 2009 (Anomaly day)The first six gaps were confirmed by the gaps in labels, butthe rest were more than one day apart from any public holidayof years 2008 and 2009 in San Francisco, California and US.Moreover, the number of gaps we found in the labels providedby dataset authors is 10, while the number of days betweenJan. 1st 2008 and Mar. 30th 2009 is 455, assuming that Jan.1st 2008 was skipped from the values and labels we shouldend up with either 454 − =
444 instead of 440 days ordifferent end date. The metric used to evaluate performanceon the datasets is ND (Yu, Rao, and Dhillon 2016), whichis equal to p
50 loss used in DeepAR, Deep State, and DeepFactors papers.
E.6 Overlaps Between Datasets
Some of the datasets used in experiments consist of timeseries from different domains. Thus, it would be reasonableo suggest that the target dataset, used for transfer learningperformance evaluation, could contain time series from thesource dataset. To validate that the model performance is notaffected by the fact that these datasets may share parts of timeseries we have performed sequence to sequence comparisonbetween training and testing sets. The searched sequenceis constructed from the last horizon of the input, providedto model during test, and the test part of the target dataset,forming the chunks of two horizons length. Then the searchedsequence is compared to every sequence of the source dataset.This method allows to spot training cases where the last partof the input with the output have exact match with the lasttwo horizons of time series from the target dataset, used forperformance evaluation. We have found that the only datasetswhich have common sequences are M4 and
FRED : 3 in Yearly,34 in Quarterly and 195 in Monthly. Taking into account thetotal number of time series in these datasets, the effect fromoverlap can be considered as insignificant.
F Empirical Results Details
On all datasets, we consider the original N-BEATS (Oreshkinet al. 2020), the model trained on a given dataset and appliedto this same dataset. This is provided for the purpose of as-sessing the generalization gap of the zero-shot N-BEATS. Weconsider four variants of zero-shot N-BEATS: NB-SH-M4,NB-NSH-M4, NB-SH-FR, NB-NSH-FR. -SH/-NSH optionsignifies block weight sharing ON/OFF. -M4/-FR option sig-nifies M4/
FRED source dataset. The mapping between sea-sonal patterns of target and source datasets is presented inTable 6. The model architecture and training procedure doesnot depend on the source dataset, i.e. we used the same pa-rameters to train models from M4 and
FRED . The parametersvalues can be found in Table 7. The results are calculatedbased on ensembles of 90 models: 6 lookback horizons, 3loss functions, and 5 repeats. Models were trained using thetraining parts of the source datasets.
F.1 Detailed M4 Results
On M4 we compare against five M4 competition entries,each representative of a broad model class. Best pure ML isthe submission by B. Trotta, the best entry among the 6 pureML models. Best statistical is the best pure statistical modelby N.Z. Legaki and K. Koutsouri. ProLogistica is a weightedensemble of statistical methods, the third best M4 participant.Best ML/TS combination is the model by (Montero-Mansoet al. 2020), second best entry, gradient boosted tree overa few statistical time series models. Finally,
DL/TS hybrid is the winner of M4 competition (Smyl 2020). Results arepresented in Table 8.
F.2 Detailed
FRED
Results
We compare against well established off-the-shelf statisticalmodels available from the R forecast package (Hyndmanand Khandakar 2008). Those include Naïve (repeating thelast value), ARIMA, Theta, SES and ETS. The quality metricis the regular s
MAPE defined in (1). Table 6: Mapping of seasonal patterns between source andtarget datasets. † Monthly dataset was linearly interpolated tomatch hourly period. M4 FRED
FRED
Yearly Yearly –Quarterly Quarterly –Monthly Monthly –Weekly Weekly –Daily Daily – M4 Yearly – YearlyQuarterly – QuarterlyMonthly – MonthlyWeekly – MonthlyDaily – MonthlyHourly –
Monthly † M3 Yearly Yearly YearlyQuarterly Quarterly QuarterlyMonthly Monthly MonthlyOthers Quarterly Quarterly
TOURISM
Yearly Yearly YearlyQuarterly Quarterly QuarterlyMonthly Monthly Monthly
ELECTRICITY
Hourly
Monthly † TRAFFIC
Hourly
Monthly † F.3 Detailed M3 Results
We used the original M3 s
MAPE metric to be able to compareagainst the results published in the literature. The s
MAPE used for M3 is different from the metric defined in (1) inthat it does not have the absolute values of the values in thedenominator: s
MAPE = H H ∑ i = | y T + i − (cid:98) y T + i | y T + i + (cid:98) y T + i . (11)The detailed zero-shot transfer results on M3 from FRED andM4 are presented in Table 10.On M3 dataset (Makridakis and Hibon 2000), we compareagainst the
Theta method (Assimakopoulos and Nikolopou-los 2000), the winner of M3;
DOTA , a dynamically opti-mized Theta model (Fiorucci et al. 2016);
EXP , the mostresent statistical approach and the previous state-of-the-art onM3 (Spiliotis, Assimakopoulos, and Nikolopoulos 2019); aswell as
ForecastPro , an off-the-shelf forecasting software thatis based on model selection between exponential smoothing,ARIMA and moving average (Athanasopoulos et al. 2011;Assimakopoulos and Nikolopoulos 2000). We also includethe DeepAR model trained on M3, denoted ‘DeepAR’, aswell as DeepAR trained on M4 and tested in zero-shot trans-fer mode on M3, denoted ‘DeepAR-M4’. Please see (Makri-dakis and Hibon 2000) for the details of other models.able 7: Model parameters
Source Datasets M4,
FRED
Loss Functions
MASE , MAPE , s
MAPE
Number of Blocks 30Layers in Block 4Layer Size 512Iterations 15 000Lookback Horizons 2, 3, 4, 5, 6, 7History size 10 horizonsLearning rate 10 − Batch size 1024Repeats 5
F.4 Detailed
TOURISM
Results
On the
TOURISM dataset (Athanasopoulos et al. 2011), wecompare against three statistical benchmarks:
ETS , exponen-tial smoothing with cross-validated additive/multiplicativemodel;
Theta method;
ForePro , same as
ForecastPro in M3;as well as top 2 entries from the
TOURISM
Kaggle competi-tion (Athanasopoulos and Hyndman 2011):
Stratometrics , anunknown technique;
LeeCBaker (Baker and Howard 2011),a weighted combination of Naïve, linear trend model, and ex-ponentially weighted least squares regression trend. We alsoinclude the DeepAR model trained on
TOURISM , denoted‘DeepAR’, as well as DeepAR trained on M4 and tested inzero-shot transfer mode on
TOURISM , denoted ‘DeepAR-M4’. Please see (Athanasopoulos et al. 2011) for the detailsof other models.
F.5 Detailed
ELECTRICITY
Results On ELECTRICITY , we compare against MatFact (Yu, Rao,and Dhillon 2016), DeepAR (Flunkert, Salinas, and Gasthaus2017), Deep State (Rangapuram et al. 2018), Deep Fac-tors (Wang et al. 2019). We use ND metric that was usedin those papers. The results are presented in in Table 12.We present our results on 3 different splits, as explained inAppendix E.5. F.6 Detailed
TRAFFIC
Results On TRAFFIC , we compare against MatFact (Yu, Rao, andDhillon 2016), DeepAR (Flunkert, Salinas, and Gasthaus2017), Deep State (Rangapuram et al. 2018), Deep Fac-tors (Wang et al. 2019). We use ND metric that was usedin those papers. The results are presented in in Table 13.We present our results on 3 different splits, as explained inAppendix E.5. G The Details of the Study of Meta-learningEffects
Figures 3 and 4 detail the performance across a number ofdatasets, as the number of N-BEATS blocks is varied. Illus-trated on the plots are the effects of having the same param-eters being shared across all blocks (blue curves) or havingindividual parameters (red curves).able 8: Performance on the M4 test set, s
MAPE . Lower values are better. ∗ DeepAR trained by us using GluonTS.
Yearly Quarterly Monthly Others Average(23k) (24k) (48k) (5k) (100k)Best pure ML 14.397 11.031 13.973 4.566 12.894Best statistical 13.366 10.155 13.002 4.682 11.986ProLogistica 13.943 9.796 12.747 3.365 11.845Best ML/TS combination 13.528 9.733 12.639 4.118 11.720DL/TS hybrid, M4 winner 13.176 9.679 12.126 4.014 11.374DeepAR ∗ Table 9: Performance on the
FRED test set, s
MAPE . Lower values are better.
Yearly Quarterly Monthly Weekly Daily Average(133,554) (57,569) (99,558) (1,348) (17) (292,046)Theta 16.50 14.24 5.35 6.29 10.57 12.20ARIMA 16.21 14.25 5.58 5.51 9.88 12.15SES 16.61 14.58 6.45 5.38 7.75 12.70ETS 16.46 19.34 8.18 5.44 8.07 14.52Naïve 16.59 14.86 6.59 5.41 8.65 12.79N-BEATS 15.79 13.27 4.79 4.63 8.86 11.49NB-SH-M4 15.00 13.36 6.10 5.67 8.57 11.60NB-NSH-M4 15.06 13.48 6.24 5.71 9.21 11.70
Table 10: M3 s
MAPE defined in (11). † Numbers from Appendix C.2, Detailed results: M3 Dataset, of (Oreshkin et al.2020). ∗ DeepAR trained by us using GluonTS.
Yearly Quarterly Monthly Others Average(645) (756) (1428) (174) (3003)Naïve2 17.88 9.95 16.91 6.30 15.47ARIMA (B–J automatic) 17.73 10.26 14.81 5.06 14.01Comb S-H-D 17.07 9.22 14.48 4.56 13.52ForecastPro 17.14 9.77 13.86 4.60 13.19Theta 16.90 8.96 13.85 4.41 13.01DOTM (Fiorucci et al. 2016) 15.94 9.28 13.74 4.58 12.90EXP (Spiliotis, Assimakopoulos, and Nikolopoulos 2019) 16.39 8.98 13.43 5.46 12 . † LGT (Smyl and Kuber 2016) 15.23 n/a n/a 4.26 n/aBaggedETS.BC (Bergmeir, Hyndman, and Benítez 2016) 17.49 9.89 13.74 n/a n/aDeepAR ∗ ∗ able 11: TOURISM, MAPE. ∗ DeepAR trained by us using GluonTS.
Yearly Quarterly Monthly Average(518) (427) (366) (1311)
Statistical benchmarks
SNaïve 23.61 16.46 22.56 21.25Theta 23.45 16.15 22.11 20.88ForePro 26.36 15.72 19.91 19.84ETS 27.68 16.05 21.15 20.88Damped 28.15 15.56 23.47 22.26ARIMA 28.03 16.23 21.13 20.96
Kaggle competitors
SaliMali n/a 14.83 19.64 n/aLeeCBaker 22.73 15.14 20.19 19.35Stratometrics 23.15 15.14 20.37 19.52Robert n/a 14.96 20.28 n/aIdalgo n/a 15.07 20.55 n/aDeepAR ∗ ∗ Table 12:
ELECTRICITY , ND. † Numbers reported by Flunkert, Salinas, and Gasthaus (2017), different from the originallyreported MatFact results, most probably due to changed split point. ∗ DeepAR trained by us using GluonTS † n/a 0.255DeepAR 0.070 0.272 n/aDeep State 0.083 n/a n/aDeep Factors n/a 0.112 n/aTheta 0.079 0.080 0.191ARIMA 0.067 0.068 0.225ETS 0.083 0.075 0.190SES 0.372 0.320 0.365DeepAR ∗ ∗ able 13: TRAFFIC , ND. † Numbers reported by Flunkert, Salinas, and Gasthaus (2017), different from the originally reportedMatFact results, most probably due to changed split point. ∗ DeepAR trained by us using GluonTS. † n/a 0.187DeepAR 0.170 0.296 n/aDeep State 0.167 n/a n/aDeep Factors n/a 0.225 n/aTheta 0.178 0.841 0.170ARIMA 0.145 0.500 0.153ETS 0.701 1.330 0.720SES 0.634 1.110 0.637DeepAR ∗ ∗
20 40 60 80 100
Number of Blocks s M A P E Shared Weights
TrueFalse (a) M3
Number of Blocks M A P E Shared Weights
TrueFalse (b) Tourism
Number of Blocks N D Shared Weights
TrueFalse (c) Electricity
Number of Blocks N D Shared Weights
TrueFalse (d) Traffic
Figure 3: Evolution of performance metrics as a function of the number of N-BEATS blocks. Each plot combines metrics forboth architectures with shared weights (blue line) and distinct weights (red line), respectively for M3, Tourism, Electricity, andTraffic. Each target dataset has its own performance metric, matching those in their respective literature. The results are based onensemble of 30 models (5 different initializations with 6 different lookback periods), the mean and confidence interval (onestandard deviation) are calculated based on performance of 30 different ensembles.
20 40 60 80 100
Number of Blocks s M A P E Shared Weights
TrueFalse (a) M3
Number of Blocks s M A P E Shared Weights
TrueFalse (b) Tourism
Number of Blocks s M A P E Shared Weights
TrueFalse (c) Electricity
Number of Blocks s M A P E Shared Weights
TrueFalse (d) Traffic(d) Traffic