[PDF] Meta-learning framework with applications to zero-shot time-series forecasting

Abstract

Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.

Full PDF

MMeta-learning framework with applications to zero-shot time-series forecasting

Boris N. Oreshkin , Dmitri Carpov , Nicolas Chapados , Yoshua Bengio Element AI, [email protected]

Abstract

Can meta-learning discover generic ways of processing timeseries (TS) from a diverse dataset so as to greatly improvegeneralization on new TS coming from different datasets?This work provides positive evidence to this using a broadmeta-learning framework which we show subsumes manyexisting meta-learning algorithms. Our theoretical analysissuggests that residual connections act as a meta-learning adap-tation mechanism, generating a subset of task-speciﬁc param-eters based on a given TS input, thus gradually expandingthe expressive power of the architecture on-the-ﬂy. The samemechanism is shown via linearization analysis to have theinterpretation of a sequential update of the ﬁnal linear layer.Our empirical results on a wide range of data emphasize theimportance of the identiﬁed meta-learning mechanisms forsuccessful zero-shot univariate forecasting, suggesting that itis viable to train a neural network on a source TS dataset anddeploy it on a different target TS dataset without retraining,resulting in performance that is at least as good as that ofstate-of-practice univariate forecasting models.

Time series (TS) forecasting is both a fundamental scientiﬁcproblem and one of great practical importance. It is centralto the actions of intelligent agents: the ability to plan andcontrol as well as to appropriately react to manifestationsof complex partially or completely unknown systems oftenrelies on the ability to forecast relevant observations based onpast history. Moreover, for most utility-maximizing agents,gains in forecasting accuracy broadly translate into utilitygains; as such, improvements in forecasting technology canhave wide impacts. Unsurprisingly, forecasting methods havea long history that can be traced back to the very originsof human civilization (Neale 1985), modern science (Gauss1809) and have consistently attracted considerable researchattention (Yule 1927; Walker 1931; Holt 1957; Winters 1960;Engle 1982; Sezer, Gudelek, and Ozbayoglu 2019). The appli-cations of forecasting span a variety of ﬁelds, including high-frequency control (e.g. vehicle and robot control (Tang andSalakhutdinov 2019), data center optimization (Gao 2014)),business planning (supply chain management (Leung 1995),workforce and call center management (Chapados et al. 2014;

Ibrahim et al. 2016), as well as such critically important areasas precision agriculture (Rodrigues Jr et al. 2019). In businessspeciﬁcally, improved forecasting translates in better pro-duction planning (leading to less waste) and less transporta-tion (reducing CO emissions) (Kahn 2003; Kerkkänen, Kor-pela, and Huiskonen 2009; Nguyen, Ni, and Rossetti 2010).The progress made in univariate forecasting in the past fourdecades is well reﬂected in the results and methods consid-ered in associated competitions over that period (Makridakiset al. 1982, 1993; Makridakis and Hibon 2000; Athanasopou-los et al. 2011; Makridakis, Spiliotis, and Assimakopoulos2018a). Recently, growing evidence has started to emergesuggesting that machine learning approaches could improveon classical forecasting methods, in contrast to some ear-lier assessments (Makridakis, Spiliotis, and Assimakopoulos2018b). For example, the winner of the 2018 M4 competition(Makridakis, Spiliotis, and Assimakopoulos 2018a) was aneural network designed by Smyl (2020).On the practical side, the deployment of deep neural time-series models is challenged by the cold start problem. Beforea tabula rasa deep neural network provides a useful forecast-ing output, it should be trained on a large problem-speciﬁctime-series dataset. For early adopters, this often implies datacollection efforts, changing data handling practices and evenchanging the existing IT infrastructures on a large scale. Incontrast, advanced statistical models can be deployed withsigniﬁcantly less effort as they estimate their parameters onsingle time series at a time. In this paper we address the prob-lem of reducing the entry cost of deep neural networks inthe industrial practice of TS forecasting. We show that it isviable to train a neural network model on a diversiﬁed sourcedataset and deploy it on a target dataset in a zero-shot regime ,i.e. without explicit retraining on that target data, resultingin performance that is at least as good as that of advancedstatistical models tailored to the target dataset. We would liketo clarify that we use the term “zero-shot” in our work in thesense that the number of history samples available for thetarget time series is so small that it makes training a deeplearning model on this time series infeasible.Addressing this practical problem provides clues to fun-damental questions. Can we learn something general aboutforecasting and transfer this knowledge across datasets? Ifso, what kind of mechanisms could facilitate this? The abil-ity to learn and transfer representations across tasks via a r X i v : . [ c s . L G ] D ec ask adaptation is an advantage of meta-learning (Raghuet al. 2019). We propose here a broad theoretical frameworkfor meta-learning that encompasses several existing meta-learning algorithms. We further show that a recent successfulmodel, N-BEATS (Oreshkin et al. 2020), ﬁts this framework.We identify internal meta-learning adaptation mechanismsthat generate new parameters on-the-ﬂy, speciﬁc to a givenTS, iteratively extending the architecture’s expressive power.We empirically conﬁrm that meta-learning mechanisms arekey to improving zero-shot TS forecasting performance, anddemonstrate results on a wide range of datasets. The univariate point forecasting problem in discrete timeis formulated given a length- H forecast horizon and alength- T observed series history [ y , . . . , y T ] ∈ R T . Thetask is to predict the vector of future values y ∈ R H =[ y T + , y T + , . . . , y T + H ] . For simplicity, we will later considera lookback window of length t ≤ T ending with the lastobserved value y T to serve as model input, and denoted x ∈ R t = [ y T − t + , . . . , y T ] . We denote (cid:98) y the point forecastof y . Its accuracy can be evaluated with s MAPE , the sym-metric mean absolute percentage error (Makridakis, Spiliotis,and Assimakopoulos 2018a),s

MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | + | (cid:98) y T + i | . (1)Other quality metrics (e.g. MAPE , MASE , OWA , ND ) are pos-sible and are deﬁned in Appendix A. Meta-learning or learning-to-learn (Harlow 1949; Schmi-dhuber 1987; Bengio, Bengio, and Cloutier 1991) is usuallylinked to being able to (i) accumulate knowledge across tasks( i.e. transfer learning , multi-task learning ) and (ii) quicklyadapt the accumulated knowledge to the new task ( task adap-tation ) (Ravi and Larochelle 2016; Bengio et al. 1992). N-BEATS algorithm has demonstrated outstanding perfor-mance on several competition benchmarks (Oreshkin et al.2020). The model consists of a total of L blocks connectedusing a doubly residual architecture. Block (cid:96) has input x (cid:96) and produces two outputs: the backcast (cid:98) x (cid:96) and the partialforecast (cid:98) y (cid:96) . For the ﬁrst block we deﬁne x ≡ x , where x is assumed to be the model-level input from now on. Wedeﬁne the k -th fully-connected layer in the (cid:96) -th block; havingR E L U non-linearity, weights W k , bias b k and input h (cid:96), k − , asFC k ( h (cid:96), k − ) ≡ R E L U ( W k h (cid:96), k − + b k ) . We focus on the con-ﬁguration that shares all learnable parameters across blocks.With this notation, one block of N-BEATS is described as: h (cid:96), = FC ( x (cid:96) ) , h (cid:96), k = FC k ( h (cid:96), k − ) , k = . . . K ; (cid:98) x (cid:96) = Qh (cid:96), K , (cid:98) y (cid:96) = Gh (cid:96), K , (2)where Q and G are linear operators. The N-BEATS parame-ters included in the FC and linear layers are learned by min-imizing a suitable loss function ( e.g. s MAPE deﬁned in (1))across multiple TS. Finally, the doubly residual architecture isdescribed by the following recursion (recalling that x ≡ x ): x (cid:96) = x (cid:96) − − (cid:98) x (cid:96) − , (cid:98) y = L ∑ (cid:96) = (cid:98) y (cid:96) . (3) From a high-level perspective, there are many links withclassical TS modeling: a human-speciﬁed classical model istypically designed to generalize well on unseen TS, whilewe propose to automate that process. The classical modelsinclude exponential smoothing with and without seasonaleffects (Holt 1957, 2004; Winters 1960), multi-trace expo-nential smoothing approaches, e.g.

Theta and its variants (As-simakopoulos and Nikolopoulos 2000; Fiorucci et al. 2016;Spiliotis, Assimakopoulos, and Nikolopoulos 2019). Finally,the state space modeling approach encapsulates most of theabove in addition to auto-ARIMA and GARCH (Engle 1982;see Hyndman and Khandakar (2008) for an overview). Thestate-space approach has also been underlying signiﬁcantamounts of research in the neural TS modeling (Salinas et al.2019; Wang et al. 2019; Rangapuram et al. 2018). However,those models have not been considered in the zero-shot sce-nario. In this work we focus on studying the importanceof meta-learning for successful zero-shot forecasting. Thefoundations of meta-learning have been developed by Schmi-dhuber (1987); Bengio, Bengio, and Cloutier (1991) amongothers. More recently, meta-learning research has been ex-panding, mostly outside of the TS forecasting domain (Raviand Larochelle 2016; Finn, Abbeel, and Levine 2017; Snell,Swersky, and Zemel 2017; Vinyals et al. 2016; Rusu et al.2019). In the TS domain, meta-learning has manifested it-self via neural models trained over a collection of TS (Smyl2020; Oreshkin et al. 2020) or via a model trained to predictweights combining outputs of several classical forecastingalgorithms (Montero-Manso et al. 2020). Successful appli-cation of a neural TS forecasting model trained on a sourcedataset and ﬁne-tuned on the target dataset was demonstratedby Hooshmand and Sharma (2019); Ribeiro et al. (2018) aswell as in the context of TS classiﬁcation by Fawaz et al.(2018). Unlike those, we focus on the zero-shot scenario andaddress the cold start problem.

We deﬁne a meta-learning framework with associated equa-tions, and recast within it many existing meta-learning algo-rithms. We show that N-BEATS follows the same equations.According to our analysis, its residual connections implementmeta-learning inner loop, thereby performing task adaptationwithout gradient steps at inference time.We deﬁne a novel zero-shot univariate TS forecastingtask and make its dataset loaders and evaluation code public,including a new large-scale dataset (

FRED ) with 290k TS.We empirically show, for the ﬁrst time, that deep-learningzero-shot time series forecasting is feasible and that themeta-learning component is important for zero-shot general-ization in univariate TS forecasting.

A meta-learning procedure can generally be viewed at twolevels: the inner loop and the outer loop . The inner trainingloop operates within an individual “meta-example” or task T (fast learning loop improving over current T ) and the outerloop operates across tasks (slow learning loop). A task T ncludes task training data D tr T and task validation data D val T ,both optionally involving inputs, targets and a task-speciﬁcloss: D tr T = { X tr T , Y tr T , L T } , D val T = { X val T , Y val T , L T } . Accord-ingly, a meta-learning set-up can be deﬁned by assuming adistribution p ( T ) over tasks, a predictor P θ , w and a meta-learner with meta-parameters ϕ . We allow a subset of predic-tor’s parameters denoted w to belong to meta-parameters ϕ and hence not to be task adaptive. The objective is to designa meta-learner that can generalize well on a new task by ap-propriately choosing the predictor’s task adaptive parameters θ after observing D tr T . The meta-learner is trained to do soby being exposed to many tasks in a training dataset { T traini } sampled from p ( T ) . For each training task T traini , the meta-learner is requested to produce the solution to the task in theform of P θ , w : X val T i (cid:55)→ (cid:98) Y val T i conditioned on D tr T i . The meta-parameters ϕ are updated in the outer meta-learning loop soas to obtain good generalization in the inner loop, i.e. , byminimizing the expected validation loss E T i L T i ( (cid:98) Y val T i , Y val T i ) mapping the ground truth and estimated outputs into the valuethat quantiﬁes the generalization performance across tasks.Training on multiple tasks enables the meta-learner to pro-duce solutions P θ , w that generalize well on a set of unseentasks { T testi } sampled from p ( T ) .Consequently, the meta-learning procedure has threedistinct ingredients: (i) meta-parameters ϕ = ( t , w , u ) ,(ii) initialization function I t and (iii) update function U u .The meta-learner’s meta-parameters ϕ include the meta-parameters of the meta-initialization function, t , the meta-parameters of the predictor shared across tasks, w , andthe meta-parameters of the update function, u . The meta-initialization function I t ( D tr T i , c T i ) deﬁnes the initial val-ues of parameters θ for a given task T i based on its meta-initialization parameters t , task training dataset D tr T i andtask meta-data c T i . Task meta-data may have, for example,a form of task ID or a textual task description. The updatefunction U u ( θ (cid:96) − , D tr T i ) is parameterized with update meta-parameters u . It deﬁnes an iterated update to predictor pa-rameters θ at iteration (cid:96) based on their previous value andthe task training set D tr T i . The initialization and update func-tions produce a sequence of predictor parameters, which wecompactly write as θ (cid:96) ≡ { θ , . . . , θ (cid:96) − , θ (cid:96) } . We let the ﬁnalpredictor be a function of the whole sequence of parame-ters, written compactly as P θ (cid:96) , w . One implementation ofsuch general function could be a Bayesian ensemble or aweighted sum, for example: P θ (cid:96) , w ( · ) = ∑ (cid:96) j = ω j P θ j , w ( · ) . Ifwe set ω j = j = (cid:96) and 0 otherwise, then we get the morecommon situation P θ (cid:96) , w ( · ) ≡ P θ (cid:96) , w ( · ) . This meta-learningframework is succinctly described by the following set ofequations:Parameters: θ ; Meta-parameters: ϕ = ( t , w , u ) Inner Loop: θ ← I t ( D tr T i , c T i ) θ (cid:96) ← U u ( θ (cid:96) − , D tr T i ) , ∀ (cid:96) > x : P θ (cid:96) , w ( x ) Outer Loop: ϕ ← ϕ − η ∇ ϕ L T i [ P θ (cid:96) , w ( X val T i ) , Y val T i ] . (5) In the previous section we laid out a unifying framework formeta-learning. How is it connected to the TS forecasting task?We believe that this question is best answered by answeringquestions “why the classical statistical TS forecasting mod-els such as ARIMA and ETS are not doing meta-learning?”and “what does the meta-learning component offer when it ispart of a forecasting algorithm?”. The ﬁrst question can beanswered by considering the fact that the classical statisticalmodels produce a forecast by estimating their parametersfrom the history of the target time series using a predeﬁnedﬁxed set of rules, for example, given a model selection and themaximum likelihood parameter estimator for it. Therefore, interms of our meta-learning framework, a classical statisticalmodel executes only the inner loop (model parameter estima-tion) encapsulated in equation (4). The outer loop in this caseis irrelevant — a human analyst deﬁnes what equation (4)is doing, based on experience (for example, “for most slowvarying time-series with trend, no seasonality and white resid-uals, ETS with Gaussian maximum likelihood estimator willprobably work well”). The second question can be answeredconsidering that meta-learning based forecasting algorithmreplaces the predeﬁned ﬁxed set of rules for model parameterestimation with a learnable parameter estimation strategy.The learnable parameter estimation strategy is trained usingouter loop equation (5) by adjusting the strategy such that itis able to produce parameter estimates that generalize wellover multiple TS. It is assumed that there exists a dataset thatis representative of the forecasting tasks that will be handledat inference time. Thus the main advantage of meta-learningbased forecasting approaches is that they enable learning adata-driven parameter estimator that can be optimized fora particular set of forecasting tasks and forecasting models.On top of that, a meta-learning approach allows for a gen-eral learnable predictor in equation (4) that can be optimizedfor a given forecasting task. So both predictor (model) andits parameter estimator can be jointly learned for a forecast-ing task represented by available data. Empirically, we showthat this elegant theoretical concept works effectively acrossmultiple datasets and across multiple forecasting tasks (e.g.forecasting yearly, monthly or hourly TS) and even acrossvery loosely related tasks (for example, forecasting hourlyelectricity demand after training on a monthly economic dataafter appropriate time scale normalization).

To further illustrate the generality of the proposed framework,we next show how to cast existing meta-learning algorithmswithin it, before turning to N-BEATS.

MAML and related approaches (Finn, Abbeel, and Levine2017; Li et al. 2017; Raghu et al. 2019) can be derivedfrom (4) and (5) by (i) setting I to be the identity mapthat copies t into θ , (ii) setting U to be the SGD gradientupdate: U u ( θ , D tr T i ) = θ − α ∇ θ L T i ( P θ , w ( X tr T i ) , Y tr T i ) , where u = { α } and by (iii) setting the predictor’s meta-parametersto the empty set w = /0. Equation (5) applies with no modiﬁca-tions. MT-net (Lee and Choi 2018) is a variant of MAML inhich the predictor’s meta-parameter set w is not empty. Thepart of the predictor parameterized with w is meta-learnedacross tasks and is ﬁxed during task adaptation. Optimization as a model for few-shot learning (Raviand Larochelle 2016) can be derived from (4) and (5)via the following steps (in addition to those of MAML).First, set the update function U u to the update equationof an LSTM-like cell of the form ( (cid:96) is the LSTM updatestep index) θ (cid:96) ← f (cid:96) θ (cid:96) − + α (cid:96) ∇ θ (cid:96) − L T i ( P θ (cid:96) − , w ( X tr T i ) , Y tr T i ) .Second, set f (cid:96) to be the LSTM forget gate value (Raviand Larochelle 2016): f (cid:96) = σ ( W F [ ∇ θ L T i , L T i , θ (cid:96) − , f (cid:96) − ] + b F ) and α (cid:96) to be the LSTM input gate value: α (cid:96) = σ ( W α [ ∇ θ L T i , L T i , θ (cid:96) − , α (cid:96) − ] + b α ) . Here σ is a sigmoidnon-linearity. Finally, include all the LSTM parameters intothe set of update meta-parameters: u = { W F , b F , W α , b α } . Prototypical Networks (PNs) (Snell, Swersky, and Zemel2017). Most metric-based meta-learning approaches, includ-ing PNs, rely on comparing embeddings of the task trainingset with those of the validation set. Therefore, it is conve-nient to consider a composite predictor consisting of the embedding function, E w , and the comparison function, C θ , P θ , w ( · ) = C θ ◦ E w ( · ) . PNs can be derived from (4) and (5)by considering a K -shot image classiﬁcation task, convolu-tional network E w shared across tasks and class prototypes p k = K ∑ j : Y trj = k E w ( X trj ) included in θ = { p k } ∀ k . Initializa-tion function I t with t = /0 simply sets θ to the values ofprototypes. U u is an identity map with u = /0 and C θ is as asoftmax classiﬁer: Y val T i = arg max k softmax ( − d ( E w ( X val T i ) , p k )) . (6)Here d ( · , · ) is a similarity measure and the softmax is nor-malized w.r.t. all p k . Finally, deﬁne the loss L T i in (5) asthe cross-entropy of the softmax classiﬁer described in (6).Interestingly, θ = { p k } ∀ k are nothing else than the dynami-cally generated weights of the ﬁnal linear layer fed into thesoftmax, which is especially apparent when d ( a , b ) = − a · b .The fact that in the prototypical network scenario only theﬁnal linear layer weights are dynamically generated based onthe task training set resonates very well with the most recentstudy of MAML (Raghu et al. 2019). It has been shown thatmost of the MAML’s gain can be recovered by only adaptingthe weights of the ﬁnal linear layer in the inner loop.In this section, we illustrated that four distinct meta-learning algorithms from two broad categories (optimization-and metric-based) can be derived from our equations (4) and(5). This conﬁrms that our meta-learning framework is gen-eral and it can represent existing meta-learning algorithms.The analysis of three additional existing meta-learning algo-rithms is presented in Appendix C. Let us now focus on the analysis of N-BEATS described byequations (2), (3). We ﬁrst introduce the following notation: f : x (cid:96) (cid:55)→ h (cid:96), ; g : h (cid:96), (cid:55)→ (cid:98) y (cid:96) ; q : h (cid:96), (cid:55)→ (cid:98) x (cid:96) . In the originalequations, g and q are linear and hence can be representedby equivalent matrices G and Q . In the following, we keepthe notation general as much as possible, transitioning to thelinear case only when needed. Then, given the network input, x ( x ≡ x ), and noting that (cid:98) x (cid:96) − = q ◦ f ( x (cid:96) − ) we can writeN-BEATS as follows: (cid:98) y = g ◦ f ( x ) + ∑ (cid:96)> g ◦ f ( x (cid:96) − − q ◦ f ( x (cid:96) − )) . (7)N-BEATS is now derived from the meta-learning frameworkof Sec. 2 using two observations: (i) each application of g ◦ f in (7) is a predictor and (ii) each block of N-BEATS is theiteration of the inner meta-learning loop. More concretely,we have that P θ , w ( · ) = g w g ◦ f w f , θ ( · ) . Here w g and w f areparameters of functions g and f , included in w = ( w g , w f ) and learned across tasks in the outer loop. The task-speciﬁcparameters θ consist of the sequence of input shift vec-tors, θ ≡ { µ (cid:96) } L (cid:96) = , deﬁned such that the (cid:96) -th block inputcan be written as x (cid:96) = x − µ (cid:96) − . This yields a recursive ex-pression for the predictor’s task-speciﬁc parameters of theform µ (cid:96) ← µ (cid:96) − + (cid:98) x (cid:96) , µ ≡ , obtained by recursively un-rolling eq. (3). These yield the following initialization andupdate functions: I t with t = /0 sets µ to zero; U u , with u = ( w q , w f ) generates a next parameter update based on (cid:98) x (cid:96) : µ (cid:96) ← U u ( µ (cid:96) − , D tr T i ) ≡ µ (cid:96) − + q w q ◦ f w f ( x − µ (cid:96) − ) . Interestingly, (i) meta-parameters w f are shared between thepredictor and the update function and (ii) the task trainingset is limited to the network input, D tr T i ≡ { x } . Note thatthe latter makes sense because the data are complete timeseries, with the inputs x having the same form of internaldependencies as the forecasting targets y . Hence, observing x is enough to infer how to predict y from x in a way that issimilar to how different parts of x are related to each other.Finally, according to (7), predictor outputs correspond-ing to the values of parameters θ learned at every iter-ation of the inner loop are combined in the ﬁnal out-put. This corresponds to choosing a predictor of the form P µ L , w ( · ) = ∑ Lj = ω j P µ j , w ( · ) , ω j = , ∀ j in (5). The outerlearning loop (5) describes the N-BEATS training procedureacross tasks (TS) with no modiﬁcation.It is clear that the ﬁnal output of the architecture de-pends on the sequence µ L . Even if predictor parameters w g , w f are shared across blocks and ﬁxed, the behaviourof P µ L , w ( · ) = g w g ◦ f w f , µ L ( · ) is governed by an extendedspace of parameters ( w , µ , µ , . . . ) . Therefore, the expressivepower of the architecture can be expected to grow with thegrowing number of blocks, in proportion to the growth ofthe space spanned by µ L , even if w g , w f are shared acrossblocks. Thus, it is reasonable to expect that the addition ofidentical blocks will improve generalization performancebecause of the increase in expressive power. Next, we go a level deeper in the analysis to uncover more in-tricate task adaptation processes. Using linear approximationanalysis, we express N-BEATS’ meta-learning operation interms of the adaptation of the internal weights of the networkbased on the task input data. In particular, assuming small (cid:98) x (cid:96) , (7) can be approximated using the ﬁrst order Taylor seriesexpansion in the vicinity of x (cid:96) − : (cid:98) y = g ◦ f ( x ) + ∑ (cid:96)> [ g − J g ◦ f ( x (cid:96) − ) q ] ◦ f ( x (cid:96) − )+ o ( (cid:107) q ◦ f ( x (cid:96) − ) (cid:107) ) . ere J g ◦ f ( x (cid:96) − ) = J g ( f ( x (cid:96) − )) J f ( x (cid:96) − ) is the Jacobian of g ◦ f . We now consider linear g and q , as mentioned earlier,in which case g and q are represented by two matrices ofappropriate dimensionality, G and Q ; and J g ( f ( x (cid:96) − )) = G .Thus, the above expression can be simpliﬁed as: (cid:98) y = G f ( x )+ ∑ (cid:96)> G [ I − J f ( x (cid:96) − ) Q ] f ( x (cid:96) − ) + o ( (cid:107) Q f ( x (cid:96) − ) (cid:107) ) . Continuously applying the linear approximation f ( x (cid:96) ) = [ I − J f ( x (cid:96) − ) Q ] f ( x (cid:96) − ) + o ( (cid:107) Q f ( x (cid:96) − ) (cid:107) ) until we reach (cid:96) = x ≡ x we arrive at the following: (cid:98) y = ∑ (cid:96)> G (cid:34) (cid:96) − ∏ k = [ I − J f ( x (cid:96) − k ) Q ] (cid:35) f ( x ) + o ( (cid:107) Q f ( x (cid:96) ) (cid:107) ) . (8)Note that G (cid:0) ∏ (cid:96) − k = [ I − J f ( x (cid:96) − k ) Q ] (cid:1) can be written in the iter-ative update form. Consider G (cid:48) = G , then the update equationfor G (cid:48) can be written as G (cid:48) (cid:96) = G (cid:48) (cid:96) − [ I − J f ( x (cid:96) − ) Q ] , ∀ (cid:96) > (cid:98) y = ∑ (cid:96)> G (cid:48) (cid:96) f ( x ) + o ( (cid:107) Q f ( x (cid:96) ) (cid:107) ) . (9)Let us now discuss how (9) can be used to re-interpret N-BEATS as an instance of the meta-learning framework (4)and (5). The predictor can now be represented in a decoupledform P θ , w ( · ) = g θ ◦ f w f ( · ) . Thus task adaptation is clearlyconﬁned in the decision function, g θ , whereas the embeddingfunction f w f only relies on ﬁxed meta-parameters w f . Theadaptive parameters θ include the sequence of projectionmatrices { G (cid:48) (cid:96) } . The meta-initialization function I t is param-eterized with t ≡ G and it simply sets G (cid:48) ← t . The mainingredient of the update function U u is Q f w f ( · ) , parameter-ized as before with u = ( Q , w f ) . The update function nowconsists of two equations: G (cid:48) (cid:96) ← G (cid:48) (cid:96) − [ I − J f ( x − µ (cid:96) − ) Q ] , ∀ (cid:96) > , µ (cid:96) ← µ (cid:96) − + Q f w f ( x − µ (cid:96) − ) , µ = . (10)The ﬁrst order analysis results (9) and (10) suggest thatunder certain circumstances, the block-by-block manipula-tion of the input sequence apparent in (7) is equivalent toproducing an iterative update of predictor’s ﬁnal linear layerweights apparent in (10), with the block input being set tothe same ﬁxed value. This is very similar to the ﬁnal linearlayer update behaviour identiﬁed in other meta-learning algo-rithms: in LEO it is present by design (Rusu et al. 2019), inMAML it was identiﬁed by Raghu et al. (2019), and in PNsit follows from the results of our analysis in Section 2.2. It is hard to study the form of Q learned from the data ingeneral. However, equipped with the results of the linearapproximation analysis presented in Section 3.1, we can studythe case of a two-block network, assuming that the L normloss between y and (cid:98) y is used to train the network. If, inaddition, the dataset consists of the set of N pairs { x i , y i } i = the dataset-wise loss L has the following expression: L = ∑ i (cid:13)(cid:13) y i − G f ( x i ) + J g ◦ f ( x i ) Q f ( x i ) + o ( (cid:107) Q f ( x i )) (cid:107) ) (cid:13)(cid:13) . Introducing ∆ y i = y i − G f ( x i ) , the error between the default forecast 2 G f ( x i ) and the ground truth y i , and expanding the L norm we obtain the following: L = ∑ i ∆ y i (cid:124) ∆ y i + ∆ y i (cid:124) J g ◦ f ( x i ) Q f ( x i )+ f ( x i ) (cid:124) Q (cid:124) J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) + o ( (cid:107) Q f ( x i )) (cid:107) ) . Now, assuming that the rest of the parameters of the networkare ﬁxed, we have the derivative with respect to Q usingmatrix calculus (Petersen and Pedersen 2012): ∂ L ∂ Q = ∑ i J (cid:124) g ◦ f ( x i ) ∆ y i f ( x i ) (cid:124) + J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) f ( x i ) (cid:124) + o ( (cid:107) Q f ( x i )) (cid:107) ) . Using the above expression we conclude that the ﬁrst-orderapproximation of optimal Q satisﬁes the following equation: ∑ i J (cid:124) g ◦ f ( x i ) ∆ y i f ( x i ) (cid:124) = − ∑ i J (cid:124) g ◦ f ( x i ) J g ◦ f ( x i ) Q f ( x i ) f ( x i ) (cid:124) . Although this does not help to ﬁnd a closed form solutionfor Q , it does provide a quite obvious intuition: the LHSand the RHS are equal when the correction term createdby the second block, J g ◦ f ( x i ) Q f ( x i ) , tends to compensatethe default forecast error, ∆ y i . Therefore, Q satisfying theequation will tend to drive the update to G in (10) in such away that on average the projection of f ( x ) over the update J g ◦ f ( x ) Q to matrix G will tend to compensate the error ∆ y made by forecasting y using G based on meta-initialization. Let us now analyze the factors that enable the meta-learninginner loop obvious in (10). First, meta-learning regime isnot viable without having multiple blocks connected via theresidual connection (feedback loop): x (cid:96) = x (cid:96) − − q ◦ f ( x (cid:96) − ).Second, the meta-learning inner loop is not viable when f is linear: the update of G is extracted from the curvature of f at the point dictated by the input x and the sequence ofshifts µ L . Indeed, suppose f is linear, and denote it by linearoperator F . The Jacobian J f ( x (cid:96) − ) becomes a constant, F .Equation (8) simpliﬁes as (note that for linear f , (8) is exact): (cid:98) y = ∑ (cid:96)> G [ I − FQ ] (cid:96) − Fx . Therefore, G ∑ (cid:96)> [ I − FQ ] (cid:96) − may be replaced with an equiv-alent G (cid:48) that is not data adaptive. Interestingly, ∑ (cid:96)> [ I − FQ ] (cid:96) − happens to be a truncated Neumann series. DenotingMoore-Penrose pseudo-inverse as [ · ] + , assuming bounded-ness of FQ and completing the series, ∑ ∞ (cid:96) = [ I − FQ ] (cid:96) , resultsin (cid:98) y = G [ FQ ] + Fx . Therefore, under certain conditions, theN-BEATS architecture with linear f and inﬁnite number ofblocks can be interpreted as a linear predictor of a signal incolored noise. Here the [ FQ ] + part cleans the intermediatespace created by projection F from the components that areundesired for forecasting and G creates the forecast based onthe initial projection Fx after it is “sanitized” by [ FQ ] + .In this section we established that N-BEATS is an instanceof a meta-learning algorithm described by equations (4) andable 1: Dataset-speciﬁc metrics aggregated over each dataset; lower values are better. The bottom three rows represent thezero-shot transfer setup, indicating respectively the core algorithm (DeepAR or N-BEATS) and the source dataset (M4 orFR(ED)). All other model names are explained in Appendix F. † N-BEATS trained on double upsampled monthly data, seeAppendix D. ‡ M3/M4 s

MAPE deﬁnitions differ. ∗ DeepAR trained by us using GluonTS.M4, s

MAPE

M3, s

MAPE ‡ TOURISM , MAPE ELECTR / TRAFF , ND FRED , s

MAPE

Pure ML 12.89 Comb 13.52 ETS 20.88 MatFact 0.16 / 0.20 ETS 14.16Best STAT 11.99 ForePro 13.19 Theta 20.88 DeepAR 0.07 / 0.17 Naïve 12.79ProLogistica 11.85 Theta 13.01 ForePro 19.84 DeepState 0.08 / 0.17 SES 12.70Best ML/TS 11.72 DOTM 12.90 Strato 19.52 Theta 0.08 / 0.18 Theta 12.20DL/TS hybrid 11.37 EXP 12.71 LCBaker 19.35 ARIMA 0.07 / 0.15 ARIMA 12.15N-BEATS 11.14 12.37 18.52 0.07 / 0.11 11.49DeepAR ∗ ∗ n/a 14.76 24.79 0.15 / 0.36 n/aN-BEATS-M4 n/a 12.44 18.82 0.09 / 0.15 11.60N-BEATS-FR 11.70 12.69 19.94 † 0 .

09 / 0 .

26 n/a(5). We showed that each block of N-BEATS is an innermeta-learning loop that generates additional shift parametersspeciﬁc to the input time series. Therefore, the expressivepower of the architecture is expected to grow with each ad-ditional block, even if all blocks share their parameters. Weused linear approximation analysis to show that the inputshift in a block is equivalent to the update of the block’s ﬁnallinear layer weights under certain conditions. The key role inthis process seems to be encapsulated in the non-linearity of f and in residual connections. We evaluate performance on a number of datasets repre-senting a diverse set of univariate time series. For each ofthem, we evaluate the base N-BEATS performance comparedagainst the best-published approaches. We also evaluate zero-shot transfer from several source datasets, as explained next.

Base datasets. M4 (M4 Team 2018), contains 100k TSrepresenting demographic, ﬁnance, industry, macro and mi-cro indicators. Sampling frequencies include yearly, quarterly,monthly, weekly, daily and hourly. M3 (Makridakis and Hi-bon 2000) contains 3003 TS from domains and samplingfrequencies similar to M4. FRED is a dataset introduced inthis paper containing 290k US and international economicTS from 89 sources, a subset of the data published by theFederal Reserve Bank of St. Louis (Federal Reserve 2019).

TOURISM (Athanasopoulos et al. 2011) includes monthly,quarterly and yearly series of indicators related to tourismactivities.

ELECTRICITY (Dua and Graff 2017; Yu, Rao, andDhillon 2016) represents the hourly electricity usage of 370customers.

TRAFFIC (Dua and Graff 2017; Yu, Rao, andDhillon 2016) tracks hourly occupancy rates of 963 lanesin the Bay Area freeways. Additional details for all datasetsappear in Appendix E.

Zero-shot TS forecasting task deﬁnition . One of thebase datasets, a source dataset, is used to train a machinelearning model. The trained model then forecasts a TS ina target dataset. The source and the target datasets are dis- tinct: they do not contain TS whose values are linear trans-formations of each other. The forecasted TS is split into twonon-overlapping pieces: the history, and the test. The historyis used as model input and the test is used to compute theforecast error metric. We use the history and the test splitsfor the base datasets consistent with their original publica-tion, unless explicitly stated otherwise. To produce forecasts,the model is allowed to access the TS in the target dataseton a one-at-a-time basis. This is to avoid having the modelimplicitly learn/adapt based on any information contained inthe target dataset other than the history of the forecasted TS.If any adjustments of model parameters or hyperparametersare necessary, they are allowed exclusively using the historyof the forecasted TS.

Training setup.

DeepAR (Salinas et al. 2019) is trainedusing GluonTS implementation from its authors (Alexan-drov et al. 2019). N-BEATS is trained following the originaltraining setup of Oreshkin et al. (2020). Both N-BEATS andDeepAR are trained with scaling/descaling the architectureinput/output by dividing/multiplying all input/output valuesby the max value of the input window computed per targettime-series. This does not affect the accuracy of the modelsin the usual train/test scenario. In the zero-shot regime, thisoperation is intended to prevent catastrophic failure when thescale of the target time-series differs signiﬁcantly from thoseof the source dataset. Additional training setup details areprovided in Appendix D.

Key results.

For each dataset, we compare our resultsto 5 representative entries reported in the literature for thatdataset, based on dataset-speciﬁc metrics (M4,

FRED , M3:s

MAPE ; TOURISM : MAPE ; ELECTRICITY , TRAFFIC : ND ). Weadditionally train the popular machine learning TS modelDeepAR and evaluate it in the zero-shot regime. Our mainresults appear in Table 1, with more details provided in Ap-pendix F. In the zero-shot forecasting regime (bottom threerows), N-BEATS consistently outperforms most statisticalmodels tailored to these datasets as well as DeepAR trainedon M4 and evaluated in zero-shot regime on other datasets. N-

20 40 60 80 100

Number of Blocks s M A P E Shared Weights

TrueFalse 0 20 40 60 80 100

Number of Blocks M A P E Shared Weights

TrueFalse

Figure 1: Zero-shot forecasting performance of N-BEATS trained on M4 and applied to M3 ( left ) and

TOURISM ( right ) targetdatasets with respect to the number of blocks, L . The mean and one standard deviation interval (based on ensemble bootstrap)with (blue) and without (red) weight sharing across blocks are shown. The extended set of results for all datasets, using FRED asa source dataset and a few metrics are provided in Appendix G, further reinforcing our ﬁndings.BEATS trained on

FRED and applied in the zero-shot regimeto M4 outperforms the best statistical model selected for itsperformance on M4 and is at par with the competition’s sec-ond entry (boosted trees). On M3 and

TOURISM the zero-shotforecasting performance of N-BEATS is better than that ofthe M3 winner, Theta (Assimakopoulos and Nikolopoulos2000). On

ELECTRICITY and

TRAFFIC

N-BEATS performsclose to or better than other neural models trained on thesedatasets. The results suggest that a neural model is able toextract general knowledge about TS forecasting and thensuccessfully adapt it to forecast on unseen TS. Our studypresents the ﬁrst successful application of a neural model tosolve univariate zero-shot TS point forecasting across a largevariety of datasets, and suggests that a pre-trained N-BEATSmodel can constitute a strong baseline for this task.

Meta-learning Effects . Analysis in Section 3 implies thatN-BEATS internally generates a sequence of parameters thatdynamically extend the expressive power of the architecturewith each newly added block, even if the blocks are identi-cal. To validate this hypothesis, we performed an experimentstudying the zero-shot forecasting performance of N-BEATSwith increasing number of blocks, with and without param-eter sharing. The architecture was trained on M4 and theperformance was measured on the target datasets M3 and

TOURISM . The results are presented in Fig. 1. On the twodatasets and for the shared-weights conﬁguration, we con-sistently see performance improvement when the number ofblocks increases up to about 30 blocks. In the same scenario,increasing the number of blocks beyond 30 leads to small,but consistent deterioration in performance. One can viewthese results as evidence supporting the meta-learning inter-pretation of N-BEATS, with a possible explanation of thisphenomenon as overﬁtting in the meta-learning inner loop.It would not otherwise be obvious how to explain the gener-alization dynamics in Fig. 1. Additionally, the performanceimprovement due to meta-learning alone (shared weights,multiple blocks vs. a single block) is 12.60 to 12.44 (1.2%)and 20.40 to 18.82 (7.8%) for M3 and

TOURISM , respec-tively (see Fig. 1). The performance improvement due to meta-learning and unique weights (unique weights, multipleblocks vs. a single block) is 12.60 to 12.40 (1.6%) and 20.40to 18.91 (7.4%). Clearly, the majority of the gain is due tothe meta-learning alone. The introduction of unique blockweights sometimes results in marginal gain, but often leadsto a loss (see more results in Appendix G).In this section, we presented empirical evidence that neu-ral networks are able to provide high-quality zero-shot fore-casts on unseen TS. We further empirically supported the hy-pothesis that meta-learning adaptation mechanisms identiﬁedwithin N-BEATS in Section 3 are instrumental in achievingimpressive zero-shot forecasting accuracy results. Zero-shot transfer learning.

We propose a broad meta-learning framework and explain mechanisms facilitating zero-shot forecasting. Our results show that neural networks canextract generic knowledge about forecasting and apply itin zero-shot transfer.

Residual architectures in general arecovered by the analysis of Sec. 3, which might explain someof the success of residual architectures, although their deeperstudy should be subject to future work. Our theory suggeststhat residual connections generate, on-the-ﬂy, compact task-speciﬁc parameter updates by producing a sequence of inputshifts for identical blocks. Sec. 3.1 reinterprets our resultsshowing that, as a ﬁrst-order approximation residual con-nections produce an iterative update to the predictor ﬁnallinear layer.

Memory efﬁciency and knowledge compres-sion.

Our empirical results imply that N-BEATS is able tocompress all the relevant knowledge about a given dataset ina single block, rather than in 10 or 30 blocks with individ-ual weights. From a practical perspective, this could be usedto obtain 10–30 times neural network weight compressionand is relevant in applications where storing neural networksefﬁciently is important. Intuitively, the network with unique block weights includes thenetwork with identical weights as a special case. Therefore, it isfree to combine the effect of meta-learning with the effect of uniqueblock weights based on its training loss. eferences

Alexandrov, A.; Benidis, K.; Bohlke-Schneider, M.; Flunkert, V.;Gasthaus, J.; Januschowski, T.; Maddix, D. C.; Rangapuram, S.;Salinas, D.; Schulz, J.; Stella, L.; Türkmen, A. C.; and Wang, Y.2019. GluonTS: Probabilistic Time Series Modeling in Python. arXiv preprint arXiv:1906.05264 .Assimakopoulos, V.; and Nikolopoulos, K. 2000. The theta model:a decomposition approach to forecasting.

International Journal ofForecasting

International Journal of Forecast-ing

International Journalof Forecasting

International Journal of Forecasting

Optimality in Artiﬁcialand Biological Neural Networks .Bengio, Y.; Bengio, S.; and Cloutier, J. 1991. Learning a SynapticLearning Rule. In

Proceedings of the International Joint Conferenceon Neural Networks , II–A969. Seattle, USA.Bergmeir, C.; Hyndman, R. J.; and Benítez, J. M. 2016. Baggingexponential smoothing methods using STL decomposition and Box–Cox transformation.

International Journal of Forecasting

European Journal of OperationalResearch

Econo-metrica .Federal Reserve Bank of St. Louis. 2019. FRED Economic Data.Data retrieved from https://fred.stlouisfed.org/ Accessed: 2019-11-01.Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In

ICML , 1126–1135.Fiorucci, J. A.; Pellegrini, T. R.; Louzada, F.; Petropoulos, F.; andKoehler, A. B. 2016. Models for optimising the theta method andtheir relationship to state space models.

International Journal ofForecasting

CoRR abs/1704.04110.Gao, J. 2014. Machine learning applications for data center opti-mization. Technical report, Google.Gauss, C. F. 1809.

Theoria motus corporum coelestium in section-ibus conicis solem ambientium . Hamburg: Frid. Perthes and I. H.Besser. Harlow, H. F. 1949. The Formation of Learning Sets.

PsychologicalReview

International Journal of Forecasting

Proceedings of theTenth ACM International Conference on Future Energy Systems ,e-Energy’19, 12–16.Hyndman, R.; and Koehler, A. B. 2006. Another look at measuresof forecast accuracy.

International Journal of Forecasting

Journal of Statistical Soft-ware

International Journal of Forecasting

The Journal of Business Forecasting Methods &Systems

International Journal of Production Economics

ICML , 2933–2942.Leung, H. C. 1995. Neural networks in supply chain management.In

Proceedings for Operating Research and the Management Sci-ences , 347–352.Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-SGD: Learning toLearn Quickly for Few Shot Learning.

CoRR

Journal of forecasting

International Journal ofForecasting

International Journal of Forecasting

PLoS ONE

International Journal of Forecasting

Weather and Climate

Proceedings of the2010 Industrial Engineering Research Conference .Oreshkin, B. N.; Carpov, D.; Chapados, N.; and Bengio, Y. 2020.N-BEATS: Neural basis expansion analysis for interpretable timeseries forecasting. In

ICLR .Oreshkin, B. N.; Rodríguez López, P.; and Lacoste, A. 2018.TADAM: Task dependent adaptive metric for improved few-shotlearning. In

NeurIPS , 721–731.Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; and Courville, A.2018. FiLM: Visual reasoning with a general conditioning layer. In

AAAI .Petersen, K. B.; and Pedersen, M. S. 2012. The Matrix Cookbook.Version 20121115.Raghu, A.; Raghu, M.; Bengio, S.; and Vinyals, O. 2019. RapidLearning or Feature Reuse? Towards Understanding the Effective-ness of MAML.Rangapuram, S. S.; Seeger, M.; Gasthaus, J.; Stella, L.; Wang, Y.;and Januschowski, T. 2018. Deep State Space Models for TimeSeries Forecasting. In

NeurIPS .Ravi, S.; and Larochelle, H. 2016. Optimization as a model forfew-shot learning. In

ICLR .Ribeiro, M.; Grolinger, K.; ElYamany, H. F.; Higashino, W. A.;and Capretz, M. A. 2018. Transfer learning with seasonal andtrend adjustment for cross-building energy forecasting.

Energy andBuildings

Poster Proceedings of the 12th European Conferenceon Precision Agriculture .Rusu, A. A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.;Osindero, S.; and Hadsell, R. 2019. Meta-Learning with LatentEmbedding Optimization. In

ICLR .Salinas, D.; Flunkert, V.; Gasthaus, J.; and Januschowski, T. 2019.DeepAR: Probabilistic forecasting with autoregressive recurrentnetworks.

International Journal of Forecasting .Schmidhuber, J. 1987.

Evolutionary principles in self-referentiallearning . Master’s thesis, Institut f. Informatik, Tech. Univ. Munich.Sezer, O. B.; Gudelek, M. U.; and Ozbayoglu, A. M. 2019. Finan-cial Time Series Forecasting with Deep Learning : A SystematicLiterature Review: 2005-2019.Smyl, S. 2020. A hybrid method of exponential smoothing andrecurrent neural networks for time series forecasting.

InternationalJournal of Forecasting .Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical Networksfor Few-shot Learning. In

NIPS , 4080–4090.Spiliotis, E.; Assimakopoulos, V.; and Nikolopoulos, K. 2019. Fore-casting with a hybrid method utilizing data smoothing, a variation ofthe Theta method and shrinkage of seasonal factors.

InternationalJournal of Production Economics

Journal of the OperationalResearch Society

NeurIPS 32 , 15398–15408.Velkoski, A. 2016. Python Client for FRED API. URL https://github.com/avelkoski/FRB.Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wier-stra, D. 2016. Matching Networks for One Shot Learning. In

NIPS ,3630–3638.Walker, G. 1931. On Periodicity in Series of Related Terms.

Proc.R. Soc. Lond. A

ICML .Winters, P. R. 1960. Forecasting Sales by Exponentially WeightedMoving Averages.

Management Science

NIPS .Yule, G. U. 1927. On a Method of Investigating Periodicities in Dis-turbed Series, with Special Reference to Wolfer’s Sunspot Numbers.

Phil. Trans. the R. Soc. Lond. A upplementary Material for

AStrong Meta-Learned Baselinefor Zero-Shot Time SeriesForecasting

A TS Forecasting Accuracy Metrics

The following metrics are standard scale-free metrics in thepractice of point forecasting performance evaluation (Hynd-man and Koehler 2006; Makridakis and Hibon 2000; Makri-dakis, Spiliotis, and Assimakopoulos 2018a; Athanasopou-los et al. 2011):

MAPE (Mean Absolute Percentage Error),s

MAPE (symmetric

MAPE ) and

MASE (Mean Absolute ScaledError). Whereas s

MAPE scales the error by the average be-tween the forecast and ground truth, the

MASE scales by theaverage error of the naïve predictor that simply copies the ob-servation measured m periods in the past, thereby accountingfor seasonality. Here m is the periodicity of the data ( e.g. , 12for monthly series). OWA (overall weighted average) is a M4-speciﬁc metric used to rank competition entries (M4 Team2018), where s

MAPE and

MASE metrics are normalized suchthat a seasonally-adjusted naïve forecast obtains

OWA = . ND , being a less standard metric inthe traditional TS forecasting literature, is nevertheless quitepopular in the machine learning TS forecasting papers (Yu,Rao, and Dhillon 2016; Flunkert, Salinas, and Gasthaus 2017;Wang et al. 2019; Rangapuram et al. 2018).s MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | + | (cid:98) y T + i | , MAPE = H H ∑ i = | y T + i − (cid:98) y T + i || y T + i | , MASE = H H ∑ i = | y T + i − (cid:98) y T + i | T + H − m ∑ T + Hj = m + | y j − y j − m | , OWA = (cid:20) s MAPE s MAPE

Naïve2 + MASEMASE

Naïve2 (cid:21) , ND = ∑ i , ts | y T + i , ts − (cid:98) y T + i , ts | ∑ i , ts | y T + i , ts | . In these expressions, y t refers to the observed time series(ground truth) and ˆ y t refers to a point forecast. In the lastequation, y T + i , ts refers to a sample T + i from TS with index ts and the sum ∑ i , ts is running over all TS indices and TSsamples. B N-BEATS Details

B.1 Architecture Details

N-BEATS originally proposed by (Oreshkin et al. 2020) op-tionally has interpretable hierarchical structure consisting ofmultiple stacks. In this work, without loss of generality, wefocus on a generic model for which output partitioning isirrelevant. This is depicted in Figure 2, modiﬁed from Figure1 in Oreshkin et al. (2020) accordingly. The ﬁnal forecast

ForecastBackcast Block Input

Block 2 – + History Global forecast(model output)

Block R FC layer L Linear LinearFC layer 1

Block 1 – Figure 2: N-BEATS architecture, adapted from Figure 1of Oreshkin et al. (2020).is obtained from the sum of individual forecasts producedby blocks; the blocks are chained together using a doublyresidual architecture.

C The analysis of additional existingmeta-learning algorithms

Matching networks (Vinyals et al. 2016) are similar to PNswith a few adjustments. In the vanilla matching networkarchitecture, C θ is deﬁned, assuming one-hot encoded Y val T i and Y tr T i , as a soft nearest neighbor: (cid:98) Y val T i = ∑ x , y ∈ D tr T i softmax ( − d ( E w ( X val T i ) , E w ( x ))) y . The softmax is normalized w.r.t. x ∈ D tr T i . Predictor parame-ters, dynamically generated by I t , include embedding/labelpairs: θ = { ( E w ( x ) , y ) , ∀ x , y ∈ D tr T i } . In the FCE matchingnetwork, validation and training embeddings additionally in-teract with the task training set via attention LSTMs (Vinyalset al. 2016). To reﬂect this, the update function, U u ( θ , D tr T i ) ,updates the original embeddings via LSTM equations: θ ←{ ( attLSTM u [ E w ( x ) , D tr T i ] , y ) , ∀ x , y ∈ D tr T i } . The LSTM pa-rameters are included in u . Second, the predictor is aug-mented with an additional relation module R w R , such that P θ , w ( · ) = C θ ◦ R w R ◦ E w E ( · ) , with the set of predictor meta-parameters extended accordingly: w = ( w R , w E ) . The re-lation module is again implemented via LSTM: R w R ( · ) ≡ attLSTM w R ( · , D tr T i ) . TADAM (Oreshkin, Rodríguez López, and Lacoste 2018)extends PNs by dynamically conditioning the embeddingfunction on the task training data via FiLM layers (Perezet al. 2018). TADAM’s predictor has the following form: P θ , w ( · ) = C θ C ◦ E θ γ , β , w ( · ) ; θ = ( θ γ , β , θ C ) . The compare func-tion parameters are as before, θ C = { p k } ∀ k . The embed-ding function parameters θ γ , β include the FiLM layer γ / β (scale/shift) vectors for each convolutional layer, generatedby a separate FC network from the task embedding. The ini-tialization function I t sets θ γ , β to all zeros, embeds task train-ing data, and sets the task embedding to the average of classprototypes. The update function U u whose meta-parametersinclude the coefﬁcients of the FC network, u = w FC , gen-erates an update to θ γ , β from the task embedding. Then itenerates an update to the class prototypes θ C using E θ γ , β , w ( · ) conditioned with the updated θ γ , β . LEO (Rusu et al. 2019) uses a ﬁxed pretrained embeddingfunction. The intermediate low-dimensional latent space z is optimized and is used to generate the predictor’s task-adaptive ﬁnal layer weights θ C . LEO’s predictor, P θ , w ( · ) = C θ ◦ E ( · ) has ﬁnal layer and the latent space parameters, θ =( θ C , θ z ) , and no meta-parameters, w = /0. The initializationfunction I t , t = ( w E , w R ) , uses a task encoder and a relationnetwork with meta-parameters w E and w R . It meta-initializesthe latent space parameters, θ z , based on the task training data.The update function U u , u = w D , uses a decoder with meta-parameters w D to iteratively decode θ z into the ﬁnal layerweights, θ C . It optimizes θ z by executing gradient descent θ z ← θ z − α ∇ θ z L T i ( P θ ( X tr T i ) , Y tr T i ) in the inner loop. D Training setup details

Most of the time, the model trained on a given frequency splitof a source dataset is used to forecast the same frequencysplit on the target dataset. There are a few exceptions to thisrule. First, when transferring from M4 to M3, the Otherssplit of M3 is forecasted with the model trained on Quarterlysplit of M4. This is because (i) the default horizon length ofM4 Quarterly is 8, same as that of M3 Others and (ii) M4Others is heterogeneous and contains Weekly, Daily, Hourlydata with horizon lengths 13, 14, 48. So M4 Quarterly toM3 Others transfer provided a more natural basis from animplementation standpoint. Second, the transfer from M4to

ELECTRICITY and

TRAFFIC dataset is done based on amodel trained on M4 Hourly. This is because

ELECTRICITY and

TRAFFIC contain hourly time-series with obvious 24-hour seasonality patterns. It is worth noting that the M4Hourly only contains 414 time-series and we can clearly seepositive zero-shot transfer in Table 1 from the model trainedon this rather small dataset. Third, the transfer from

FRED to ELECTRICITY and

TRAFFIC is done by training the model onthe

FRED

Monthly split, double upsampled using bi-linearinterpolation. This is because

FRED does not have hourlydata. Monthly data naturally provide patterns with seasonalityperiod 12. Upsampling with a factor of two and bi-linearinterpolation provide data with natural seasonality period 24,most often observed in Hourly data, such as

ELECTRICITY and

TRAFFIC . D.1 N-BEATS training setup

We use the same overall training framework, as deﬁnedby Oreshkin et al. (2020), including the stratiﬁed uniformsampling of TS in the source dataset to train the model. Onemodel is trained per frequency split of a dataset ( e.g.

Yearly,Quarterly, Monthly, Weekly, Daily and Hourly frequenciesin M4 dataset). All reported accuracy results are based on anensemble of 30 models (5 different initializations with 6 dif-ferent lookback periods). One aspect that we found importantin the zero-shot regime, which is different from the originaltraining setup, is the scaling/descaling of the input/output.We scale/descale the architecture input/output by the divid-ing/multiplying all input/output values over the max valueof the input window. We found that this does not affect the Table 2: DeepAR training parameters.

BatchLayers Cells Epochs SizeYearly (M3, M4, Tourism) 3 40 300 32Quarterly (M3, M4, Tourism) 2 20 100 32Monthly (M3, M4, Tourism) 2 40 500 32Others (M3) 2 40 100 32M4 (weekly, daily) 3 20 100 32M4 Hourly 2 20 50 32Electricity (all splits) 2 40 50 64Trafﬁc (2008-01-14) 1 20 5 64Trafﬁc (other splits) 4 40 50 64 accuracy of the model trained and tested on the same datasetin a statistically signiﬁcant way. In the zero-shot regime, thisoperation prevents catastrophic failure when the target datasetscale (marginal distribution) is signiﬁcantly different fromthat of the source dataset.

D.2 DeepAR training setup

DeepAR experiments are using the model implementationprovided by GluonTS (Alexandrov et al. 2019) version 1.6.We optimized hyperparameters of DeepAR as the defaultsprovided in GluonTS would often lead to apparently sub-optimal performance on many of the datasets. The train-ing parameters for each dataset are described in Table 2.Weight decay is 0.0, Dropout rate is 0.0 for all experi-ments except Electricity dataset where it is 0.1. The de-fault scaling was replaced by MaxAbs, which improvedand stabilized results. All other parameters are defaultsof gluonts.model.deepar.DeepAREstimator . To reduce vari-ance of performance between experiments we use medianensemble of 30 independent runs. The code for DeepARexperiments can be found at https://github.com/timeseries-zeroshot/deepar_evaluation.

E Dataset Details

E.1 M4 Dataset Details

Table 3 outlines the composition of the M4 dataset acrossdomains and forecast horizons by listing the number of TSbased on their frequency and type (M4 Team 2018). The M4dataset is large and diverse: all forecast horizons are com-posed of heterogeneous TS types (with exception of Hourly)frequently encountered in business, ﬁnancial and economicforecasting. Summary statistics on series lengths are alsolisted, showing wide variability therein, as well as a charac-terization ( smooth vs erratic ) that follows Syntetos, Boylan,and Croston (2005), and is based on the squared coefﬁcientof variation of the series. All series have positive observedvalues at all time-steps; as such, none can be considered inter-mittent or lumpy per Syntetos, Boylan, and Croston (2005). E.2

FRED

Dataset Details

FRED is a large-scale dataset introduced in this paper con-taining around 290k US and international economic TSfrom 89 sources, a subset of Federal Reserve economicable 3: Composition of the M4 dataset: the number of TS based on their sampling frequency and type.Frequency / HorizonType Yearly/6 Qtly/8 Monthly/18 Wkly/13 Daily/14 Hrly/48 TotalDemographic 1,088 1,858 5,728 24 10 0 8,708Finance 6,519 5,305 10,987 164 1,559 0 24,534Industry 3,716 4,637 10,017 6 422 0 18,798Macro 3,903 5,315 10,016 41 127 0 19,402Micro 6,538 6,020 10,975 112 1,476 0 25,121Other 1,236 865 277 12 633 414 3,437Total 23,000 24,000 48,000 359 4,227 414 100,000Min. Length 19 24 60 93 107 748Max. Length 841 874 2812 2610 9933 1008Mean Length 37.3 100.2 234.3 1035.0 2371.4 901.9SD Length 24.5 51.1 137.4 707.1 1756.6 127.9% Smooth 82% 89% 94% 84% 98% 83%% Erratic 18% 11% 6% 16% 2% 17%data (Federal Reserve 2019).

FRED is downloaded usinga custom download script based on the high-level FREDpython API (Velkoski 2016). This is a python wrapper overthe low-level web-based FRED API. For each point in atime-series the raw data published at the time of ﬁrst re-lease are downloaded. All time series with any NaN entrieshave been ﬁltered out. We focus our attention on Yearly,Quarterly, Monthly, Weekly and Daily frequency data. Otherfrequencies are available, for example, bi-weekly and ﬁve-yearly. They are skipped, because only being present in smallquantities. These factors explain the fact that the size of thedataset we assembled for this study is 290k, while 672k to-tal time-series are in principle available (Federal Reserve2019). Hourly data are not available in this dataset. For thedata frequencies included in

FRED dataset, we use the sameforecasting horizons as for the M4 dataset: Yearly: 6, Quar-terly: 8, Monthly: 18, Weekly: 13 and Daily: 14. The datasetdownload takes approximately 7–10 days, because of thebandwidth constraints imposed by the low-level FRED API.The test, validation and train subsets are deﬁned in the usualway. The test set is derived by splitting the full

FRED datasetat the left boundary of the last horizon of each time series.Similarly, the validation set is derived from the penultimatehorizon of each time series.

E.3 M3 Dataset Details

Table 4 outlines the composition of the M3 dataset acrossdomains and forecast horizons by listing the number of TSbased on their frequency and type (Makridakis and Hibon2000). The M3 is smaller than the M4, but it is still large anddiverse: all forecast horizons are composed of heterogeneousTS types frequently encountered in business, ﬁnancial andeconomic forecasting. Over the past 20 years, this dataset hassupported signiﬁcant efforts in the design of advanced sta-tistical models, e.g. Theta and its variants (Assimakopoulosand Nikolopoulos 2000; Fiorucci et al. 2016; Spiliotis, As-simakopoulos, and Nikolopoulos 2019). Summary statisticson series lengths are also listed, showing wide variability in length, as well as a characterization ( smooth vs erratic ) thatfollows Syntetos, Boylan, and Croston (2005), and is basedon the squared coefﬁcient of variation of the series. All serieshave positive observed values at all time-steps; as such, nonecan be considered intermittent or lumpy per Syntetos, Boylan,and Croston (2005). E.4

TOURISM

Dataset Details

Table 5 outlines the composition of the

TOURISM datasetacross forecast horizons by listing the number of TS basedon their frequency. Summary statistics on series lengths arelisted, showing wide variability in length. All series havepositive observed values at all time-steps. In contrast to M4and M3 datasets,

TOURISM includes a much higher fractionof erratic series.

E.5

ELECTRICITY and

TRAFFIC

Dataset Details

ELECTRICITY and TRAFFIC datasets (Dua and Graff 2017;Yu, Rao, and Dhillon 2016) are both part of UCI repository. ELECTRICITY represents the hourly electricity usage mon-itoring of 370 customers over three years.

TRAFFIC datasettracks the hourly occupancy rates scaled in (0,1) range of 963lanes in the San Francisco bay area freeways over a period ofslightly more than a year. Both datasets exhibit strong hourlyand daily seasonality patterns.Both datasets are aggregated to hourly data, but using dif-ferent aggregation operations: sum for

ELECTRICITY andmean for

TRAFFIC . The hourly aggregation is done so that allthe points available in ( h − , h : 00 ] hours are aggregatedto hour h , thus if original dataset starts on 2011-01-01 00:15then the ﬁrst time point after aggregation will be 2011-01-0101:00. For the ELECTRICITY dataset we removed the ﬁrstyear from training set, to match the training set used in (Yu,Rao, and Dhillon 2016), based on the aggregated dataset https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 https://archive.ics.uci.edu/ml/datasets/PEMS-SF able 4: Composition of the M3 dataset: the number of TS based on their sampling frequency and type. Frequency / HorizonType Yearly/6 Quarterly/8 Monthly/18 Other/8 TotalDemographic 245 57 111 0 413Finance 58 76 145 29 308Industry 102 83 334 0 519Macro 83 336 312 0 731Micro 146 204 474 4 828Other 11 0 52 141 204Total 645 756 1,428 174 3,003Min. Length 20 24 66 71Max. Length 47 72 144 104Mean Length 28.4 48.9 117.3 76.6SD Length 9.9 10.6 28.5 10.9% Smooth 90% 99% 98% 100%% Erratic 10% 1% 2% 0%

Table 5: Composition of the

TOURISM dataset: the numberof TS based on their sampling frequency.

Frequency / HorizonYearly/4 Quarterly/8 Monthly/24 Total518 427 366 1,311Min. Length 11 30 91Max. Length 47 130 333Mean Length 24.4 99.6 298SD Length 5.5 20.3 55.7% Smooth 77% 61% 49%% Erratic 23% 39% 51% downloaded from, presumable authors’, Github repository .We also made sure that data points for both ELECTRICITY and

TRAFFIC datasets after aggregation match those usedin (Yu, Rao, and Dhillon 2016). The authors of the Mat-Fact model were using the last 7 days of datasets as test set,but papers from Amazon DeepAR (Flunkert, Salinas, andGasthaus 2017), Deep State (Rangapuram et al. 2018), DeepFactors (Wang et al. 2019) are using different splits, wherethe split points are provided by a date. Changing split pointswithout a well-grounded reason adds uncertainties to the com-parability of the models performances and creates challengesto the reproducibility of the results, thus we were trying tomatch all different splits in our experiments. It was especiallychallenging on

TRAFFIC dataset, where we had to use someheuristics to ﬁnd records dates; the dataset authors state: “Themeasurements cover the period from Jan. 1st 2008 to Mar.30th 2009” and “We remove public holidays from the dataset,as well as two days with anomalies (March 8th 2009 andMarch 9th 2008) where all sensors were muted between 2:00and 3:00 AM.” In spite of this, we failed to match a part ofthe provided labels of week days to actual dates. Therefore, https://github.com/rofuyu/exp-trmf-nips16/blob/master/python/exp-scripts/datasets/download-data.sh we had to assume that the actual list of gaps, which includeholidays and anomalous days, is as follows:1. Jan. 1, 2008 (New Year’s Day)2. Jan. 21, 2008 (Martin Luther King Jr. Day)3. Feb. 18, 2008 (Washington’s Birthday)4. Mar. 9, 2008 (Anomaly day)5. May 26, 2008 (Memorial Day)6. Jul. 4, 2008 (Independence Day)7. Sep. 1, 2008 (Labor Day)8. Oct. 13, 2008 (Columbus Day)9. Nov. 11, 2008 (Veterans Day)10. Nov. 27, 2008 (Thanksgiving)11. Dec. 25, 2008 (Christmas Day)12. Jan. 1, 2009 (New Year’s Day)13. Jan. 19, 2009 (Martin Luther King Jr. Day)14. Feb. 16, 2009 (Washington’s Birthday)15. Mar. 8, 2009 (Anomaly day)The ﬁrst six gaps were conﬁrmed by the gaps in labels, butthe rest were more than one day apart from any public holidayof years 2008 and 2009 in San Francisco, California and US.Moreover, the number of gaps we found in the labels providedby dataset authors is 10, while the number of days betweenJan. 1st 2008 and Mar. 30th 2009 is 455, assuming that Jan.1st 2008 was skipped from the values and labels we shouldend up with either 454 − =

444 instead of 440 days ordifferent end date. The metric used to evaluate performanceon the datasets is ND (Yu, Rao, and Dhillon 2016), whichis equal to p

50 loss used in DeepAR, Deep State, and DeepFactors papers.

E.6 Overlaps Between Datasets

Some of the datasets used in experiments consist of timeseries from different domains. Thus, it would be reasonableo suggest that the target dataset, used for transfer learningperformance evaluation, could contain time series from thesource dataset. To validate that the model performance is notaffected by the fact that these datasets may share parts of timeseries we have performed sequence to sequence comparisonbetween training and testing sets. The searched sequenceis constructed from the last horizon of the input, providedto model during test, and the test part of the target dataset,forming the chunks of two horizons length. Then the searchedsequence is compared to every sequence of the source dataset.This method allows to spot training cases where the last partof the input with the output have exact match with the lasttwo horizons of time series from the target dataset, used forperformance evaluation. We have found that the only datasetswhich have common sequences are M4 and

FRED : 3 in Yearly,34 in Quarterly and 195 in Monthly. Taking into account thetotal number of time series in these datasets, the effect fromoverlap can be considered as insigniﬁcant.

F Empirical Results Details

On all datasets, we consider the original N-BEATS (Oreshkinet al. 2020), the model trained on a given dataset and appliedto this same dataset. This is provided for the purpose of as-sessing the generalization gap of the zero-shot N-BEATS. Weconsider four variants of zero-shot N-BEATS: NB-SH-M4,NB-NSH-M4, NB-SH-FR, NB-NSH-FR. -SH/-NSH optionsigniﬁes block weight sharing ON/OFF. -M4/-FR option sig-niﬁes M4/

FRED source dataset. The mapping between sea-sonal patterns of target and source datasets is presented inTable 6. The model architecture and training procedure doesnot depend on the source dataset, i.e. we used the same pa-rameters to train models from M4 and

FRED . The parametersvalues can be found in Table 7. The results are calculatedbased on ensembles of 90 models: 6 lookback horizons, 3loss functions, and 5 repeats. Models were trained using thetraining parts of the source datasets.

F.1 Detailed M4 Results

On M4 we compare against ﬁve M4 competition entries,each representative of a broad model class. Best pure ML isthe submission by B. Trotta, the best entry among the 6 pureML models. Best statistical is the best pure statistical modelby N.Z. Legaki and K. Koutsouri. ProLogistica is a weightedensemble of statistical methods, the third best M4 participant.Best ML/TS combination is the model by (Montero-Mansoet al. 2020), second best entry, gradient boosted tree overa few statistical time series models. Finally,

DL/TS hybrid is the winner of M4 competition (Smyl 2020). Results arepresented in Table 8.

F.2 Detailed

FRED

Results

We compare against well established off-the-shelf statisticalmodels available from the R forecast package (Hyndmanand Khandakar 2008). Those include Naïve (repeating thelast value), ARIMA, Theta, SES and ETS. The quality metricis the regular s

MAPE deﬁned in (1). Table 6: Mapping of seasonal patterns between source andtarget datasets. † Monthly dataset was linearly interpolated tomatch hourly period. M4 FRED

FRED

Yearly Yearly –Quarterly Quarterly –Monthly Monthly –Weekly Weekly –Daily Daily – M4 Yearly – YearlyQuarterly – QuarterlyMonthly – MonthlyWeekly – MonthlyDaily – MonthlyHourly –

Monthly † M3 Yearly Yearly YearlyQuarterly Quarterly QuarterlyMonthly Monthly MonthlyOthers Quarterly Quarterly

TOURISM

Yearly Yearly YearlyQuarterly Quarterly QuarterlyMonthly Monthly Monthly

ELECTRICITY

Hourly

Monthly † TRAFFIC

Hourly

Monthly † F.3 Detailed M3 Results

We used the original M3 s

MAPE metric to be able to compareagainst the results published in the literature. The s

MAPE used for M3 is different from the metric deﬁned in (1) inthat it does not have the absolute values of the values in thedenominator: s

MAPE = H H ∑ i = | y T + i − (cid:98) y T + i | y T + i + (cid:98) y T + i . (11)The detailed zero-shot transfer results on M3 from FRED andM4 are presented in Table 10.On M3 dataset (Makridakis and Hibon 2000), we compareagainst the

Theta method (Assimakopoulos and Nikolopou-los 2000), the winner of M3;

DOTA , a dynamically opti-mized Theta model (Fiorucci et al. 2016);

EXP , the mostresent statistical approach and the previous state-of-the-art onM3 (Spiliotis, Assimakopoulos, and Nikolopoulos 2019); aswell as

ForecastPro , an off-the-shelf forecasting software thatis based on model selection between exponential smoothing,ARIMA and moving average (Athanasopoulos et al. 2011;Assimakopoulos and Nikolopoulos 2000). We also includethe DeepAR model trained on M3, denoted ‘DeepAR’, aswell as DeepAR trained on M4 and tested in zero-shot trans-fer mode on M3, denoted ‘DeepAR-M4’. Please see (Makri-dakis and Hibon 2000) for the details of other models.able 7: Model parameters

Source Datasets M4,

FRED

Loss Functions

MASE , MAPE , s

MAPE

Number of Blocks 30Layers in Block 4Layer Size 512Iterations 15 000Lookback Horizons 2, 3, 4, 5, 6, 7History size 10 horizonsLearning rate 10 − Batch size 1024Repeats 5

F.4 Detailed

TOURISM

Results

On the

TOURISM dataset (Athanasopoulos et al. 2011), wecompare against three statistical benchmarks:

ETS , exponen-tial smoothing with cross-validated additive/multiplicativemodel;

Theta method;

ForePro , same as

ForecastPro in M3;as well as top 2 entries from the

TOURISM

Kaggle competi-tion (Athanasopoulos and Hyndman 2011):

Stratometrics , anunknown technique;

LeeCBaker (Baker and Howard 2011),a weighted combination of Naïve, linear trend model, and ex-ponentially weighted least squares regression trend. We alsoinclude the DeepAR model trained on

TOURISM , denoted‘DeepAR’, as well as DeepAR trained on M4 and tested inzero-shot transfer mode on

TOURISM , denoted ‘DeepAR-M4’. Please see (Athanasopoulos et al. 2011) for the detailsof other models.

F.5 Detailed

ELECTRICITY

Results On ELECTRICITY , we compare against MatFact (Yu, Rao,and Dhillon 2016), DeepAR (Flunkert, Salinas, and Gasthaus2017), Deep State (Rangapuram et al. 2018), Deep Fac-tors (Wang et al. 2019). We use ND metric that was usedin those papers. The results are presented in in Table 12.We present our results on 3 different splits, as explained inAppendix E.5. F.6 Detailed

TRAFFIC

Results On TRAFFIC , we compare against MatFact (Yu, Rao, andDhillon 2016), DeepAR (Flunkert, Salinas, and Gasthaus2017), Deep State (Rangapuram et al. 2018), Deep Fac-tors (Wang et al. 2019). We use ND metric that was usedin those papers. The results are presented in in Table 13.We present our results on 3 different splits, as explained inAppendix E.5. G The Details of the Study of Meta-learningEffects

Figures 3 and 4 detail the performance across a number ofdatasets, as the number of N-BEATS blocks is varied. Illus-trated on the plots are the effects of having the same param-eters being shared across all blocks (blue curves) or havingindividual parameters (red curves).able 8: Performance on the M4 test set, s

MAPE . Lower values are better. ∗ DeepAR trained by us using GluonTS.

Yearly Quarterly Monthly Others Average(23k) (24k) (48k) (5k) (100k)Best pure ML 14.397 11.031 13.973 4.566 12.894Best statistical 13.366 10.155 13.002 4.682 11.986ProLogistica 13.943 9.796 12.747 3.365 11.845Best ML/TS combination 13.528 9.733 12.639 4.118 11.720DL/TS hybrid, M4 winner 13.176 9.679 12.126 4.014 11.374DeepAR ∗ Table 9: Performance on the

FRED test set, s

MAPE . Lower values are better.

Yearly Quarterly Monthly Weekly Daily Average(133,554) (57,569) (99,558) (1,348) (17) (292,046)Theta 16.50 14.24 5.35 6.29 10.57 12.20ARIMA 16.21 14.25 5.58 5.51 9.88 12.15SES 16.61 14.58 6.45 5.38 7.75 12.70ETS 16.46 19.34 8.18 5.44 8.07 14.52Naïve 16.59 14.86 6.59 5.41 8.65 12.79N-BEATS 15.79 13.27 4.79 4.63 8.86 11.49NB-SH-M4 15.00 13.36 6.10 5.67 8.57 11.60NB-NSH-M4 15.06 13.48 6.24 5.71 9.21 11.70

Table 10: M3 s

MAPE deﬁned in (11). † Numbers from Appendix C.2, Detailed results: M3 Dataset, of (Oreshkin et al.2020). ∗ DeepAR trained by us using GluonTS.

Yearly Quarterly Monthly Others Average(645) (756) (1428) (174) (3003)Naïve2 17.88 9.95 16.91 6.30 15.47ARIMA (B–J automatic) 17.73 10.26 14.81 5.06 14.01Comb S-H-D 17.07 9.22 14.48 4.56 13.52ForecastPro 17.14 9.77 13.86 4.60 13.19Theta 16.90 8.96 13.85 4.41 13.01DOTM (Fiorucci et al. 2016) 15.94 9.28 13.74 4.58 12.90EXP (Spiliotis, Assimakopoulos, and Nikolopoulos 2019) 16.39 8.98 13.43 5.46 12 . † LGT (Smyl and Kuber 2016) 15.23 n/a n/a 4.26 n/aBaggedETS.BC (Bergmeir, Hyndman, and Benítez 2016) 17.49 9.89 13.74 n/a n/aDeepAR ∗ ∗ able 11: TOURISM, MAPE. ∗ DeepAR trained by us using GluonTS.

Yearly Quarterly Monthly Average(518) (427) (366) (1311)

Statistical benchmarks

SNaïve 23.61 16.46 22.56 21.25Theta 23.45 16.15 22.11 20.88ForePro 26.36 15.72 19.91 19.84ETS 27.68 16.05 21.15 20.88Damped 28.15 15.56 23.47 22.26ARIMA 28.03 16.23 21.13 20.96

Kaggle competitors

SaliMali n/a 14.83 19.64 n/aLeeCBaker 22.73 15.14 20.19 19.35Stratometrics 23.15 15.14 20.37 19.52Robert n/a 14.96 20.28 n/aIdalgo n/a 15.07 20.55 n/aDeepAR ∗ ∗ Table 12:

ELECTRICITY , ND. † Numbers reported by Flunkert, Salinas, and Gasthaus (2017), different from the originallyreported MatFact results, most probably due to changed split point. ∗ DeepAR trained by us using GluonTS † n/a 0.255DeepAR 0.070 0.272 n/aDeep State 0.083 n/a n/aDeep Factors n/a 0.112 n/aTheta 0.079 0.080 0.191ARIMA 0.067 0.068 0.225ETS 0.083 0.075 0.190SES 0.372 0.320 0.365DeepAR ∗ ∗ able 13: TRAFFIC , ND. † Numbers reported by Flunkert, Salinas, and Gasthaus (2017), different from the originally reportedMatFact results, most probably due to changed split point. ∗ DeepAR trained by us using GluonTS. † n/a 0.187DeepAR 0.170 0.296 n/aDeep State 0.167 n/a n/aDeep Factors n/a 0.225 n/aTheta 0.178 0.841 0.170ARIMA 0.145 0.500 0.153ETS 0.701 1.330 0.720SES 0.634 1.110 0.637DeepAR ∗ ∗

20 40 60 80 100

Number of Blocks s M A P E Shared Weights

TrueFalse (a) M3

Number of Blocks M A P E Shared Weights

TrueFalse (b) Tourism

Number of Blocks N D Shared Weights

TrueFalse (c) Electricity

Number of Blocks N D Shared Weights

TrueFalse (d) Trafﬁc

Figure 3: Evolution of performance metrics as a function of the number of N-BEATS blocks. Each plot combines metrics forboth architectures with shared weights (blue line) and distinct weights (red line), respectively for M3, Tourism, Electricity, andTrafﬁc. Each target dataset has its own performance metric, matching those in their respective literature. The results are based onensemble of 30 models (5 different initializations with 6 different lookback periods), the mean and conﬁdence interval (onestandard deviation) are calculated based on performance of 30 different ensembles.

20 40 60 80 100

Number of Blocks s M A P E Shared Weights

TrueFalse (a) M3

Number of Blocks s M A P E Shared Weights

TrueFalse (b) Tourism

Number of Blocks s M A P E Shared Weights

TrueFalse (c) Electricity

Number of Blocks s M A P E Shared Weights

TrueFalse (d) Trafﬁc(d) Trafﬁc