[PDF] MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention

Abstract

Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.

Full PDF

MMQTransformer: Multi-Horizon Forecasts with ContextDependent and Feedback-Aware Attention

Carson Eisenach ∗ Yagna Patel † Dhruv Madeka ∗ December 4, 2020

Abstract

Recent advances in neural forecasting have produced major improvements in accuracy forprobabilistic demand prediction. In this work, we propose novel improvements to the currentstate of the art by incorporating changes inspired by recent advances in Transformer architecturesfor Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own historybased on the context for which it is producing a forecast. We also present a novel positionalencoding that allows the neural network to learn context-dependent seasonality functions as wellas arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster(Wen et al., 2017) models display excess variability by failing to leverage previous errors in theforecast to improve accuracy. We propose a novel decoder-self attention scheme for forecastingthat produces signiﬁcant improvements in the excess variation of the forecast.

Time series forecasting is a fundamental problem in machine learning with relevance to manyapplication domains including supply chain management, ﬁnance, healthcare analytics, and more.Modern forecasting applications require predictions of many correlated time series over multiplehorizons. In multi-horizon forecasting, the learning objective is to produce forecasts for multiplefuture horizons at each time-step. Beyond simple point estimation, decision making problems requirea measure of uncertainty about the forecasted quantity. Access to the full distribution is usuallyunnecessary, and several quantiles are suﬃcient (many problems in Operations Research use the50 th and 90 th percentiles, for example).As a motivating example, consider a large e-commerce retailer with a system to produce forecastsof the demand distribution for a set of products at a target time T . Using these forecasts as aninput, the retailer can then optimize buying and placement decisions to maximize revenue and/orcustomer value. Accurate forecasts are important, but – perhaps less obviously – forecasts that ∗ Demand Forecasting, Amazon. Correspondence to: [email protected]. † Work done while at Amazon a r X i v : . [ c s . L G ] D ec on’t exhibit excess volatility as a target date approaches minimize costly, bull-whip eﬀects in asupply chain (Chen et al., 2000; Bray and Mendelson, 2012).Recent work applying deep learning to time-series forecasting focuses primarily on the useof recurrent and convolutional architectures (Nascimento et al., 2019; Yu et al., 2017; Gasparinet al., 2019; Mukhoty et al., 2019; Wen et al., 2017) . These are Seq2Seq architectures (Sutskeveret al., 2014) – which consist of an encoder which takes an input sequence and summarizes it into aﬁxed-length context vector, and a decoder which produces an output sequence. It is well known thatSeq2Seq models suﬀer from an information bottleneck by transmitting information from encoderto decoder via a single hidden state. To address this Bahdanau et al. (2014) introduces a methodcalled attention , allowing the decoder to take as input a weighted combination of relevant latentencoder states at each output time step, rather than using a single context to produce all decoderoutputs. While NLP is the predominate application of attention architectures, in this paper we showhow novel attention modules and positional embeddings can be used to introduce proper inductivebiases for probabilistic time-series forecasting to the model architecture.Even with these shortcomings, this line of work has lead to major advances in forecast accuracy forcomplex problems, and real-world forecasting systems increasingly rely on neural nets. Accordingly,a need for black-box forecasting system diagnostics has arisen. Stine and Foster (2020b,a) useprobabilistic martingales to study the dynamics of forecasts produced by an arbitrary forecastingsystem. They can be used to detect the degree to which forecasts adhere to the martingale model offorecast evolution (Heath and Jackson, 1994) and to detect unnecessary volatility (above and beyondany inherent uncertainty) in the forecasts produced. Thus, Stine and Foster (2020b,a) describe away to connect the excess variation of a forecast to accuracy misses against the realized target.While, multi-horizon forecasting networks such as (Wen et al., 2017; Madeka et al., 2018),minimize quantile loss - the network architectures do not explicitly handle excess variation, sinceforecasts on any particular date are not made aware of errors in the forecast for previous dates. Inshort, such tools can be used to detect ﬂaws in forecasts, but the question of how to incorporatethat information into model design is unexplored. Our Contributions

In this paper, we are concerned with both improving forecast accuracy and reducing excess forecast volatility. We present a set of novel architectures that seek to remedy someof inductive biases that are currently missing in state of the art MQ-Forecasters (Wen et al., 2017).The major contributions of this paper are1.

Positional Encoding from Event Indicators : Current MQ-Forecasters use explicitly engi-neered holiday “distances” to provide the model with information about the seasonality of thetime series. We introduce a novel positional encoding mechanism that allows the network tolearn a seasonality function depending on other information of the time series being forecasted,and demonstrate that its a strict generalization of conventional position encoding schemes.2.

Horizon-Speciﬁc Decoder-Encoder Attention : Wen et al. (2017); Madeka et al. (2018) andother MQ-Forecasters learn a single encoder representation for all future dates and periods being For a complete overview see Benidis et al. (2020)

Decoder Self-Attention for Forecast Evolution : To the best of our knowledge, this is theﬁrst work to consider the impacts of network architecture design on forecast evolution . Importantly,we accomplish this by using attention mechanisms to introduce the right inductive biases, andnot by explicitly penalizing a measure of forecast variability. This allows us to maintain a singleobjective function without needing to make trade-oﬀs between accuracy and volatility.By providing MQ-Forecasters with the structure necessary to learn context information dependentencodings , we observe major increases in accuracy (3.9% in overall P90 quantile loss throughoutthe year, and up to 60% during peak periods) on our demand forecasting application along with asigniﬁcant reduction in excess volatility (52% reduction in excess volatility at P50 and 30% at P90).We also apply MQTransformer to four public datasets, and show parity with the state-of-the-arton simple, univariate tasks. On a substantially more complex public dataset (retail forecasting)we demonstrate a

38% improvement over the previously reported state-of-the-art, and a 5%improvement in P50 QL, 11% in P90 QL versus our baseline. Because our innovations are compatiblewith eﬃcient training schemes, our architecture also achieves a signiﬁcant speedup (several ordersof magnitude greater throughput) over earlier transformer models for time-series forecasting.

Formally, the task considered in our work is the high-dimensional regression problem p ( y t +1 ,i , . . . , y t + H,i | y : t,i , x ( h ): t,i , x ( f ): t,i , x ( s ) i ) , (1)where y t + s,i , y : t,i , x ( h ): t,i , x ( f ) t : ,i , x ( s ) denote future observations of the target time series i , observationsof the target time series observed up until time t , the past covariates, known future information,and static covariates, respectively.For sequence modeling problems, Seq2Seq (Sutskever et al., 2014) is the canonical deep learningframework and although applied this architecture to neural machine translation (NMT) tasks, ithas since been adapted to time series forecasting (Nascimento et al., 2019; Yu et al., 2017; Gasparinet al., 2019; Mukhoty et al., 2019; Wen et al., 2017; Salinas et al., 2020; Wen and Torkkola, 2019).The MQ-Forecaster framework (Wen et al., 2017) solves (1) above by treating each series i as asample from a joint stochastic process and feeding into a neural network which predicts Q quantilesfor each horizon. These types of models, however, inherit from the Seq2Seq architecture the limitedcontextual information available to the decoder as it produces each estimate y qt + s,i , the q th quantileof the distribution of the target at time t + s , y t + s,i . Seq2Seq models rely on a single encodedcontext to produce forecasts for all horizons, imposing an information bottleneck and making itdiﬃcult for the model to understand long term dependencies.3ur MQTransformer architecture, like other MQ-Forecasters, uses the direct strategy: the modeloutputs the quantiles of interest directly, rather than the parameters of a distribution from whichsamples are to be generated. This has been shown (Wen et al., 2017) to outperform parametricmodels, like DeepAR (Salinas et al., 2020), on a wide variety of tasks. Recently, Lim et al. (2019)consider an application of attention to multi-horizon forecasting, but their method still produces asingle context for all horizons. Furthermore, by using an RNN decoder their models do not enjoythe same scaling properties as MQ-Forecaster models. To the best of our knowledge, our work isthe ﬁrst to devise attention mechanisms for this problem that readily scale. Bahdanau et al. (2014) introduced the concept of an attention mechanism to solve the informationbottleneck and sequence alignment problems in Seq2Seq architectures for NMT. Recently, attentionhas enjoyed success across a diverse range of applications including natural language processing(NLP), computer vision (CV) and time-series forecasting tasks (Galassi et al., 2019; Xu et al.,2015; Shun-Yao Shih and Fan-Keng Sun and Hung-yi Lee, 2019; Kim and Kang, 2019; Cinar et al.,2017; Li et al., 2019; Lim et al., 2019). Many variants have been proposed including self-attentionand dot-product attention (Luong et al., 2015; Cheng et al., 2016; Vaswani et al., 2017; Devlinet al., 2019), and transformer architectures (end-to-end attention with no recurrent layers) achievestate-of-the-art performance on most NLP tasks.Time series forecasting applications exhibit seasonal trends and the absolute position encodingscommonly used in the literature cannot be applied. Our work diﬀers from previous work on relativeposition encodings (Dai et al., 2019; Huang et al., 2018; Shaw et al., 2018) in that we learn arepresentation from a time series of indicator variables which encode events relevant to the targetapplication (such as holidays and promotions). If event indicators relevant to the application areprovided, then this imposes a strong inductive bias that will allow the model to generalize well tofuture observations. Existing encoding schemes either involve feature engineering (e.g. sinusoidalencodings) or have a maximum input sequence length, ours requires no feature engineering – themodel learns it directly from raw data – and it extends to arbitrarily long sequences.In the vanilla transformer (Vaswani et al., 2017), a sinusoidal position embedding is added tothe network input and each encoder layer consists of a multi-headed attention block followed by afeed-forward sub-layer. For each head i , the attention score between query q s and key k t is deﬁnedas follows for the input layer A hs,t = ( x s + r s ) (cid:62) W h, (cid:62) q W hk ( x t + r t ) (2)where x s , r s are the observation of the time series and the position encoding, respectively, at time s .Section 3 introduces attention mechanisms that diﬀer in their treatment of the position dependentbiases. See Appendix A for additional discussion of attention mechanisms. Originally the martingale model of forecast evolution (MMFE) was conceived as a way to simulatedemand forecasts used in inventory planning problems (Heath and Jackson, 1994). Denoting by (cid:98) Y T | t Y T made at time t ≤ T , the MMFE assumes that the forecast process { (cid:98) Y T | t } t ismartingale. Informally, a martingale captures the notion that a forecast should use all informationavailable to the forecasting system at time t . Mathematically, a discrete time martingale is astochastic process { X t } such that E [ X t +1 | X t , . . . , X ] = X t . We assume a working knowledge of martingales and direct the reader to Williams (1991) for athorough coverage in discrete time.Taleb (2018) describe how martingale forecasts correspond to rational updating, then expandedby Augenblick and Rabin (2019). Taleb (2018), Taleb and Madeka (2019) and Augenblick andRabin (2019) go on to develop tests for forecasts that rule out martingality and indicate irrationalor predictable updating for binary bets. Stine and Foster (2020a,b) further extend these ideasto quantile forecasts. Speciﬁcally, they consider the coverage probability process p t := P [ Y T ≤ τ | Y s , s ≤ t ] = E [ I ( Y T ≤ τ ) | Y s , s ≤ t ], where τ denotes the forecast announced in the ﬁrst period t = 0. Because { p t } is also a martingale, the authors show that E [( p T − π ) ] = T (cid:88) t =1 E [( p t − p t − ) ] = π (1 − π ) , where π = p is the expected value of p T , a Bernoulli random variable, across realizations ofthe coverage process. In the context of quantile forecasting, π is simply the quantile forecasted.The measure of excess volatility proposed is the quadratic variation process associated with { p t } , Q s := (cid:80) st =0 ( p t − p t − ) . While this process is not a martingale, we do know that under the MMFEassumption, E [ Q T ] = π (1 − π ).A second quantity of interest is the martingale V t := Q t − ( p t − π ) which follows the typicalstructure of subtracting the compensator to turn a sub-martingale into a martingale. In Section 4we leverage the properties of { V t } and { Q t } to compare the dynamics of forecasts produced by avariety of models, demonstrating that our feedback-aware decoder self-attention units reduce excessforecast volatility. Notation

We denote by H and Q the number of horizons and quantiles being forecast, respectively. Boldedcharacters are used to indicate vector and matrix values. The concatenation of two vectors v and u is denoted as [ u ; v ]. As mentioned, this work is motivated in part by the needs of the consumers of forecasting systems.We therefore care about whether or not our innovations can be used in practice. Our methodologymust scale to forecasting tens of thousands or millions of signals, at hundreds of horizons. We extendthe MQ-Forecaster family of models (Wen et al., 2017) because it, unlike many other architectures5onsidered in the literature, can be applied at a large-scale (millions of samples) due to its use of forking sequences – a technique to dramatically increase the eﬀective batch size during training andavoid expensive data augmentation. In this section we present our MQTransformer architecture,building upon the MQ-Forecaster framework.For ease of exposition, we reformulate the generic probabilistic forecasting problem in (1) as p ( y t +1 ,i , . . . , y t + H,i | y : t,i , x : t,i , x ( l ) i , x ( g ) , s i ) , where x : t,i are past observations of all covariates, x ( l ) i = { x ( l ) s,i } ∞ s =1 are known covariates speciﬁc totime-series i , x ( g ) = { x ( g ) s } ∞ s =1 are the global, known covariates. In this setting, known signiﬁes thatthe model has access to (potentially noisy) observations of past and future values. Note that thisformulation is equivalent to (1), and that known covariates can be included in the past covariates x : t . When it can be inferred from context, the time series index i is omitted. We train a quantile regression model to minimize the quantile loss, summed over all forecast creationdates and quantiles (cid:88) t (cid:88) q (cid:88) k L q (cid:16) y t + k , (cid:98) y ( q ) t + k (cid:17) , where L q ( y, (cid:98) y ) = q ( y − (cid:98) y ) + + (1 − q )( (cid:98) y − y ) + , ( · ) + is the positive part operator, q denotes a quantile,and k denotes the horizon. The design of the architecture is similar to MQ-RNN (Wen et al., 2017), and consists of encoder,decoder and position encoding blocks (see Figure 4 in Appendix B). The position encoding outputs,for each time step t , are a representation of global position information, r ( g ) t = PE ( g ) t ( x ( g ) i ), as wellas time-series speciﬁc context information, r ( l ) t = PE ( l ) t ( x ( l ) i ). Intuitively, r ( g ) t captures positioninformation that is independent of the time-series i (such as holidays), whereas r ( l ) t encodes time-series speciﬁc context information (such as promotions). In both cases, the inputs are a time seriesof indicator variables and require no feature-engineering or handcrafted functions .The encoder then summarizes past observations of the covariates into a sequence of hidden states h t := encoder( y : t , x : t , r ( g ): t , r ( l ): t , s ). Using these representations, the decoder produces an H × Q matrix of forecasts (cid:98) Y t = decoder( h : t , r ( g ) , r ( l ) ). Note that in the decoder, the model has access toposition encodings.In this section we focus on our novel attention blocks and position encodings; the reader isdirected to Appendix B for the other architecture details. MQTransformer

Now we describe a design, evaluated in Section 4, following the generic pattern given above. Wedeﬁne the combined position encoding as r := [ r ( g ) ; r ( l ) ]. In the encoder we use a stack of dilated6

156 -130 -104 -78 -52 -26 0 26

Figure 1: Position encoding learned from daily-grain event indicatorstemporal convolutions (van den Oord et al., 2016; Wen et al., 2017) to encode historical time-seriesand a multi-layer perceptron to encode the static features as (3).Table 1: MQTransformer encoder and decoder

Encoder Decoder Contexts h t = TemporalConv ( y : t , x : t , r : t ) h t = FeedForward ( s ) h t = [ h t ; h t ] , (3) c hst,h = HSAttention ( h : t , r ) c at = FeedForward ( h t , r ) c t = [ c at, ; · · · ; c hst,H ; c at ] (cid:101) c t,h = DSAttention ( c : t , h : t , r ) , (4) Our decoder incorporates our horizon speciﬁc and decoder self-attention blocks, and consists oftwo branches. The ﬁrst (global) branch summarizes the encoded representations into horizon-speciﬁc( c hst,h ) and horizon agnostic ( c at ) contexts. Formally, the global branch c t := m G ( · ) is given by (4).The output branch consists of a self-attention block followed by a local MLP, which producesoutputs using the same weights for each horizon. For FCT t and horizon h , the output is givenby ( (cid:98) y t + h , . . . , (cid:98) y Qt + h ) = m L ( c at , c hst,h , (cid:101) c t,h , r t + h ), where c : t denotes the output of the global branch, upthrough the FCT t . Next we describe the speciﬁcs of our position encoding and attention blocks. Prior work typically uses a variant on one of two approaches to provide attention blocks withposition information: (1) a handcrafted representation (such as sinusoidal encodings) or (2) a matrix M ∈ R L × d of position encoding where L is the maximum sequence length and each row correspondsto the position encoding for time point.In contrast, our novel encoding scheme maps sequences of indicator variables to a d -dimensionalrepresentations. For demand forecasting, this enables our model to learn an arbitrary functionof events (like holidays and promotions) to encode position information. As noted above, ourmodel includes two position encodings: r ( g ) t := P E ( g ) t ( x ( g ) ) and r ( l ) t := P E ( l ) t ( x ( l ) ), one that isshared among all time-series i and one that is speciﬁc. For the design we use in Section 4, P E ( g ) is7mplemented as a bidirectional 1-D convolution (looking both forward and backward in time) and P E ( l ) is an MLP applied separately at each time step. Figure 1 shows an example of P E ( g ) learnedfrom holiday indicator variables. For reference, MQ-(C)RNN (Wen et al., 2017) uses linear holidayand promotion distances to represent position information. Connection to matrix embeddings

Another way to view our position encoding scheme is as aform of set based indexing into rows of an inﬁnite dimensional matrix. We note that the traditionalmethod of learning a matrix embedding M can be recovered as a special case of our approach.Consider a sequence of length L , and take x ( g ) := [ e , . . . , e L ] where e s is used to denote the vectorin R L with a 1 in the s th position and 0s elsewhere. To recover the matrix embedding scheme,we deﬁne PE matrix t ( x ( g ) ) := x ( g ) , (cid:62) t M . Thus we see that our scheme is a strict generalization of thematrix embedding approach commonly used in the NLP community. Table 2: Attention weight and output computations for blocks introduced in Section 3.4

Block Attention Weights OutputDecoder-EncoderAttention A ht,s = q h, (cid:62) t W (cid:62) q W k k s q ht = [ h t ; r t ; r t + h ] k s = [ h s ; r s ] v s = h s (5) c hst,h = t (cid:88) s = t − L A ht,s v s (6)DecoderSelf-Attention A ht,s,r = q (cid:62) t,h W h, (cid:62) q W hk k s,r q t,h = [ h t ; c hst,h ; r t ; r t + h ] k s,r = [ c hss,r ; r s ; r s + r ] v s,r = c hss,r (7) (cid:101) c hst,h = (cid:88) ( s,r ) ∈H ( t,h ) A hs,t,r v s,r , H ( t, h ) := { ( s, r ) | s + r = t + h } (8) Horizon-Speciﬁc Decoder-Encoder Attention

Our horizon-speciﬁc attention mechanism is a multi-headed attention mechanism where the projectionweights are shared across all horizons. Each head corresponds to a diﬀerent horizon. It diﬀers froma traditional multi-headed attention mechanism in that its purpose is to attend over representationsof past time points to produce a representation speciﬁc to the target period. In our architecture,the inputs to the block are the encoder hidden states and position encodings. Mathematically, fortime s and horizon h , the attention weight for the value at time t is computed as (5).Observe that there are two key diﬀerences between these attention scores and those in thevanilla transformer architecture: (a) projection weights are shared by all H heads, (b) the additionof the position encoding of the target horizon h to the query. The output of our horizon speciﬁc8ecoder-encoder attention block, c hst,h , is obtained by taking a weighted sum of the encoder hiddencontexts, up to a maximum look-back of L periods as in (6). Decoder Self-Attention

The martingale diagnostic tools developed in (Stine and Foster, 2020b) indicate a deep connectionbetween accuracy and volatility. We leverage this connection to develop a novel decoder self-attentionscheme for multi-horizon forecasting. To motivate the development, consider a model which forecastsvalues of 40, 60 when the demand has constantly been 50 units. We would consider this model tohave excess volatility. Similarly, a model forecasting 40, 60 when demand jumps between 40 and 60units would not be considered to have excess volatility. This is because the ﬁrst model fails to learnfrom its past forecasts - it continues jumping between 40, 60 when the demand is 50 units.In order to ameliorate this, we need to pass the information of the previous forecast errors intothe current forecast. For each FCT t and horizon h , the model attends on the previous forecastsusing a query containing the demand information for that period. The attention mechanism has aseparate head for each horizon.Rather than attend on the demand information and prior outputs directly, a richer representationsof the same information is used: the demand information at time t is incorporated via the encodedcontext h t and previous forecasts are represented via the corresponding horizon-speciﬁc context c hss,r – in the absence of decoder-self attention c hss,r would be passed through the local MLP to generate theforecasts. Formally, the attention scores are given by (7). The horizon-speciﬁc and feedback-awareoutputs, (cid:101) c hst,h , are given by (8). Note how we sum only over previous forecasts of the same period. First, we evaluate our architecture on a demand forecasting problem for a large-scale e-commerceretailer with the objective of producing multi-horizon forecasts that span up to one year. Eachhorizon is speciﬁed by a lead time (LT), number of periods from the FCT to the start of the horizon,and a span (SP), number of periods covered by the forecast, combination. To assess the eﬀects ofeach innovation, we ablate by removing components one at a time. MQTransformer is denoted as

Dec-Enc & Dec-Self Att , Dec-Enc Att – which contains only the horizon-speciﬁc decoder-encoderunit – and

Baseline – the vanilla MQ-CNN model. MQ-CNN is selected as the baseline since priorwork demonstrate that MQ-CNN outperforms MQ-RNN and DeepAR on this dataset, and as canbe seen in Table 4, MQ-CNN similarly outperforms MQ-RNN and DeepAR on public datasets.We conduct our experiments on a subset of products ( ∼ Wen et al. (2017), Figure 3 shows MQ-CNN (labeled “MQ CNN wave”) outperforms MQ-RNN (all variants) andDeepAR (labeled “Seq2SeqC”) on the test set.

Model All LTSP LTSP 0/4 SeasonalPeak 1 Post-PeakRampdown PromotionType 1Baseline 1.000 1.000 1.000 1.000 1.000Dec-Enc Q V - C o m p e n s a t o r ModelBaselineHorizon-Specific Decoder-Encoder + Decoder-Self AttentionHorizon-Specific Decoder-Encoder AttentionPercentileP50P90

Figure 2: Martingale diagnostic process { V t } averaged over all weeks in test period (2018-2019) Forecast Accuracy

Table 3 summarizes several key metrics that demonstrate the accuracyimprovements achieved by adding our proposed attention mechanisms to the MQ-CNN architecture –the full set of results can be found in Appendix C. Introducing the Horizon-Speciﬁc Decoder-Encoderattention alone yields improvements along all metrics evaluated. Overall we see a 1 .

6% improvementin P50 QL and a 3 .

9% improvement in P90 QL. Notably, the attention mechanism yields signiﬁcantimprovements on short LT-SP (LT-SP 0/4).Further, Table 3 demonstrates improved performance on seasonal peaks and promotions. Observethat while MQ-CNN performs well on some seasonal peaks, it also is misaligned and fails to ramp-down post-peak – post-peak ramp-down issues occur when the model continues to forecast highfor target weeks after the peak week. By including MQTransformer’s attention mechanisms in thearchitecture, we see a 43% improvement for Seasonal Peak 1 and a 56% improvement on Post-PeakRampdown. In retail, promotions are used to provide a demand lift for products. Accordingly, amodel should be able to react to the upcoming promotion and forecast an accurate lift in demandfor the target weeks in which the promotion is placed. From Table 3 we see that MQTransformerachieves a see a 49% on items with Promotion Type 1 versus the baseline.10 Q V - C o m p e n s a t o r modelMQCNNMQTransformerTFTpercentileP50P90 Q u a n t il e L o ss Figure 3: Forecast evolution analysis on the retail dataset. Left: Martingale Diagnostic Process { V t } . Right: QL by lead time, averaged over target dates from 2016-03-01 through 2016-05-01; QLtrajectories are centered around 0. Forecast Volatility

We study the eﬀect of our proposed attention mechanisms on excess forecastvolatility using diagnostic tools recently proposed by Stine and Foster (2020b,a). Figure 2 plots theprocess { V t } (see Section 2). In the plot, the lines should appear horizontal under the MMFE. Anydeviation above this (on an aggregate level) indicates excess volatility in the forecast evolution. Wecan observe that while none of the models produce ideal forecasts, both attention models outperformthe Baseline with the attention model with both proposed attention mechanisms performing thebest in terms of these evolution diagnostics.The green line corresponds to the attention model with only the horizon-speciﬁc decoder-encoderattention. We can see that compared to the baseline, this model achieves up to 27% reduction inexcess volatility at P50 and 7% at P90. By also adding decoder-self attention we see a furtherreduction in excess volatility of an additional 20% at P50 and 21% at P90. Following Lim et al. (2019), we consider applications to brick-and-mortar retail sales, electricityload, securities volatility and traﬃc forecasting. For the retail task, we predict the next 30 days ofsales, given the previous 90 days of history. This dataset contains a rich set of static, time series,and known features. At the other end of the spectrum, the electricity load dataset is univariate.See Appendix D for additional information about these tasks. Table 4 compares MQTransformer’sperformance with other recent works – DeepAR (Salinas et al., 2020), ConvTrans (Li et al., 2019),MQ-RNN (Wen et al., 2017), and TFT (Lim et al., 2019).Our MQTransformer architecture is competitive with or beats the state-of-the-art on theelectricity load, volatility and traﬃc prediction tasks, as shown in Table 4. On the most challengingtask, it dramatically outperforms the previously reported state of the art by 38% and the MQ-CNNbaseline by 5% at P50 and 11% at P90. Because MQ-CNN and MQTransformer are trained usingforking sequences, we can use the entire training population, rather than downsample as is required Results for TFT, MQ-RNN, DeepAR and ConvTrans are from Lim et al. (2019). For MQ-CNN and MQTransformer,we use their pre-processing and evaluation code to ensure parity

P50 QLTask DeepAR ConvTrans MQ-RNN MQ-CNN TFT MQTransformerElectricity 0.075 0.059 0.077 0.076 (0.2645)Volatility 0.050 0.047 0.042 0.042

Traffic 0.161 0.122 0.117 0.115

Retail 0.230 0.192 0.152 0.118 0.147 (0.109)Volatility 0.024 0.024 0.021 0.020 0.020

Traffic 0.099 0.081 0.082 0.077 0.070 to train TFT (Lim et al., 2019) – see Appendix D. To ascertain what portion of the gain is due tolearning from more trajectories, versus our innovations alone, we retrain the optimal MQTransformerarchitecture using a random sub-sample of 450K trajectories (the same sampling procedure as TFT)and without using forking sequences – the results are indicated in parentheses in Table 4. We canobserve that MQTransformer still dramatically outperforms TFT, and its performance is similar tothe MQ-CNN baseline trained on all trajectories.Furthermore, on the Favorita retail forecasting task, Figure 3 shows that as expected, MQTrans-former substantially reduces excess volatility in the forecast evolution compared to the MQ-CNNbaseline. Somewhat surprisingly, TFT exhibits much lower volatility than does MQTransformer. InFigure 3, the right hand plot displays quantile loss as the target date approaches – trajectories foreach model are zero centered to emphasize the trends exhibited. While TFT is less volatile, it is alsoless accurate as it fails to incorporate newly available information. By contrast, MQTransformeris both less volatile and more accurate when compared with MQ-CNN. See Appendix D for moredetails on the experiment setup and training procedure used.

Computational Eﬃciency

For purposes of illustration, consider the Retail dataset. Prior work(Lim et al., 2019) was only able to make use of 450K out of 20M trajectories, and the optimalTFT architecture required 13 minutes per epoch (minimum validation error at epoch 6) using asingle V100 GPU. By contrast, our innovations are compatible with forking sequences, and thusour architecture can make use of all available trajectories . To train, MQTransformer requires only5 minutes per epoch (on 20M trajectories) using a single V100 GPU (minimum validation error Timing results obtained by running the source code provided by Lim et al. (2019)

In this work, we present three novel architecture enhancements that improve bottlenecks in stateof the art MQ-Forecasters. We presented a series of architectural innovations for probabilistictime-series forecasting including a novel alignment decoder-encoder attention, as well as a decoderself-attention scheme tailored to the problem of multi-horizon forecasting. To the best of ourknowledge, this is the ﬁrst work to consider the impact of model architecture on forecast evolution.We also demonstrated how position embeddings can be learned directly from domain-speciﬁc eventindicators and horizon-speciﬁc contexts can improve performance for diﬃcult sub-problems such aspromotions or seasonal peaks. Together, these innovations produced signiﬁcant improvements inthe excess variation of the forecast and accuracy across diﬀerent dimensions. Finally, we appliedour model to several public datasets, where it outperformed the baseline architecture by 5% at P50,11% at P90 and the previous reported state-of-the-art (TFT) by 38% on the most complex task. Onthe three less complex public datasets, our architecture achieved parity or slightly exceeded previousstate of the art results. Beyond accuracy gains, by making our architecture innovations compatiblewith forking sequences, our model achieves massive increases in throughput compared to existingtransformer architectures for time-series forecasting. An interesting direction we intend to explorein future work is incorporating encoder self-attention so that the model can leverage arbitrarily longhistorical time series, rather than the ﬁxed length consumed by the convolution encoder.

Acknowledgements

We would like to thank Dean Foster, Robert Stine, and Kari Torkkola for their insightful feedbackand support. 13 eferences

Augenblick, N. and

Rabin, M. (2019). Belief movement, uncertainty reduction, and rationalupdating. Tech. rep., Haas School of Business, University of California, Berkeley.

Bahdanau, D. , Cho, K. and

Bengio, Y. (2014). Neural machine translation by jointly learningto align and translate. arXiv:1409.0473 . Benidis, K. , Rangapuram, S. S. , Flunkert, V. , Wang, B. , Maddix, D. , Turkmen, C. , Gasthaus, J. , Bohlke-Schneider, M. , Salinas, D. , Stella, L. et al. (2020). Neuralforecasting: Introduction and literature overview. arXiv:2004.10240 . Bray, R. L. and

Mendelson, H. (2012). Information Transmission and the Bullwhip Eﬀect: AnEmpirical Investigation.

Management Science Chen, F. , Drezner, Z. , Ryan, J. K. and

Simchi-Levi, D. (2000). Quantifying the BullwhipEﬀect in a Simple Supply Chain: The Impact of Forecasting, Lead Times, and Information.

Management Science Cheng, J. , Dong, L. and

Lapata, M. (2016). Long short-term memory-networks for machinereading. arXiv:1601.06733 . Cinar, Y. G. , Mirisaee, H. , Goswami, P. , Gaussier, E. , Ait-Bachir, A. and

Strijov, V. (2017). Position-based content attention for time series forecasting with sequence-to-sequencernns. In

ICONIP . Dai, Z. , Yang, Z. , Yang, Y. , Carbonell, J. , Le, Q. V. and

Salakhutdinov, R. (2019).Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. In

ACL . Devlin, J. , Chang, M.-W. , Lee, K. and

Toutanova, K. (2019). BERT: Pre-training of DeepBidirectional Transformers for Language Understanding. In

NAACL-HLT . Galassi, A. , Lippi, M. and

Torroni, P. (2019). Attention, please! a critical review of neuralattention models in natural language processing. arXiv:1902.02181 . Gasparin, A. , Lukovic, S. and

Alippi, C. (2019). Deep Learning for Time Series Forecasting:The Electric Load Case. arXiv:1907.09207 . Heath, D. C. and

Jackson, P. L. (1994). Modeling the evolution of demand forecasts withapplication to safety stock analysis in production/distribution systems.

IIE transactions Huang, C.-Z. A. , Vaswani, A. , Uszkoreit, J. , Shazeer, N. , Simon, I. , Hawthorne, C. , Dai, A. M. , Hoffman, M. D. , Dinculescu, M. and

Eck, D. (2018). Music Transformer:Generating Music with Long-Term Structure. arXiv:1809.04281 . Kim, S. and

Kang, M. (2019). Financial series prediction using Attention LSTM. arXiv:1902.10877 . 14 ingma, D. P. and

Ba, J. (2015). Adam: A Method for Stochastic Optimization. In

ICLR . Li, S. , Jin, X. , Xuan, Y. , Zhou, X. , Chen, W. , Wang, Y.-X. and

Yan., X. (2019). Enhancingthe locality and breaking the memory bottleneck of transformer on time series forecasting. In

NIPS . Lim, B. , Arik, S. O. , Loeff, N. and

Pfister, T. (2019). Temporal Fusion Transformers forInterpretable Multi-horizon Time Series Forecasting. arXiv:1912.09363 . Luong, M.-T. , Pham, H. and

Manning, C. D. (2015). Eﬀective approaches to attention-basedneural machine translation. arXiv:1508.04025 . Madeka, D. , Swiniarski, L. , Foster, D. , Razoumov, L. , Torkkola, K. and

Wen, R. (2018).Sample path generation for probabilistic demand forecasting. In

ICML workshop on TheoreticalFoundations and Applications of Deep Generative Models . Mukhoty, B. P. , Maurya, V. and

Shukla, S. K. (2019). Sequence to sequence deep learningmodels for solar irradiation forecasting. In

IEEE Milan PowerTech . Nascimento, R. C. , Souto, Y. M. , Ogasawara, E. , Porto, F. and

Bezerra, E. (2019).STConvS2S: Spatiotemporal Convolutional Sequence to Sequence Network for weather forecasting. arXiv:1912.00134 . Salinas, D. , Flunkert, V. , Gasthaus, J. and

Januschowski, T. (2020). Deepar: Probabilisticforecasting with autoregressive recurrent networks.

International Journal of Forecasting Shaw, P. , Uszkoreit, J. and

Vaswani, A. (2018). Self-Attention with Relative Position Repre-sentations. In

NAACL . Shun-Yao Shih and Fan-Keng Sun and Hung-yi Lee (2019). Temporal pattern attention formultivariate time series forecasting.

Machine Learning

Stine, R. and

Foster, D. (2020a). Martingale Descriptions of the Evolution of Forecasts. Tech.rep., Manuscript in preparation.

Stine, R. and

Foster, D. (2020b). Martingale Diagnostics. Tech. rep., Manuscript in preparation.

Sutskever, I. , Vinyals, O. and

Le, Q. V. (2014). Sequence to sequence learning with neuralnetworks. In

NIPS . Taleb, N. N. (2018). Election predictions as martingales: an arbitrage approach.

QuantitativeFinance Taleb, N. N. and

Madeka, D. (2019). All roads lead to quantitative ﬁnance.

QuantitativeFinance an den Oord, A. , Dieleman, S. , Zen, H. , Simonyan, K. , Vinyals, O. , Graves, A. , Kalchbrenner, N. , Senior, A. and

Kavukcuoglu, K. (2016). Wavenet: A generative modelfor raw audio. arXiv:1609.03499 . Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A. N. , Kaiser,L. , and

Polosukhin, I. (2017). Attention is all you need. In

NIPS . Wen, R. and

Torkkola, K. (2019). Deep Generative Quantile-Copula Models for ProbabilisticForecasting. In

ICML Time Series Workshop . Wen, R. , Torkkola, K. , Narayanaswamy, B. and

Madeka, D. (2017). A multi-horizonquantile recurrent forecaster. In

NIPS Time Series Workshop . Williams, D. (1991).

Probability with Martingales . Cambridge University Press.

Xu, K. , Ba, J. , Kiros, R. , Cho, K. , Courville, A. , Salakhudinov, R. , Zemel, R. and

Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention.In

ICML . Yu, R. , Zheng, S. , Anandkumar, A. and

Yue, Y. (2017). Long-term Forecasting using HigherOrder Tensor RNNs. arXiv:1711.00073 . 16

Additional Background and Related Work

A.1 Attention Mechanisms

Attention mechanisms can be viewed as a form of content based addressing, that computes analignment between a set of queries and keys to extract a value . Formally, let q , . . . , q t , k , . . . , k t and v , . . . , v t be a series of queries, keys and values, respectively. The s th attended value is deﬁnedas c s = (cid:80) ti =1 score( q s , k t ) v t , where score is a scoring function – commonly score( u , v ) := u (cid:62) v .In the vanilla transformer model, q s = k s = v s = h s , where h s is the hidden state at time s .Because attention mechanisms have no concept of absolute or relative position, some sort of positioninformation must be provided. Vaswani et al. (2017) uses a sinusoidal positional encoding added tothe input to an attention block, providing each token’s position in the input time series. B MQTransformer Architecture Details

In this section we describe in detail the layers in our MQTransformer architecture, which is based oﬀof the MQ-Forecaster framework (Wen et al., 2017) and uses a wavenet encoder (van den Oord et al.,2016) for time-series covariates. Before describing the layers in each component, Figure 4 outlinesthe MQTransformer architecture. On diﬀerent datasets, we consider the following variations: choiceof encoding for categorical variables, a tunable parameter d h (dimension of hidden layers), dropoutrate p drop , a list of dilation rates for the wavenet encoder, and a list of dilation rates for the positionencoding. The ReLU activation function is used throughout the network. B.1 Input Embeddings and Position Encoding

Static categorical variables are encoded using either one-hot encoding or an embedding layer. Time-series categorical variables are one-hot encoded, and then passed through a single feed-forward layerof dimension d h .The global position encoding module takes as input the known time-series covariates, and consistof a stack of dilated, bi-directional 1-D convolution layers with d h ﬁlters. After each convolution isa ReLU activation, followed by a dropout layer with rate p drop , and the local position encoding isimplemented as a single dense layer of dimension d h . B.2 Encoder

After categorical encodings are applied, the inputs are passed through the encoder block. Theencoder consists of two components: a single dense layer to encode the static features, and a stackof dilated, temporal convolutions. The the time-series covariates are concatenated with the positionencoding to form the input to the convolution stack. The output of the encoder block is producedby replicating the encoded static features across all time steps and concatenating with the output ofthe convolution. 17 .3 Decoder

Please see Table 1 for a description of the blocks in the decoder. The dimension of each head’sin both the horizon-speciﬁc and decoder self-attention blocks is (cid:100) d h / (cid:101) . The dense layer used tocompute c at has dimension d h . The output block is two layer MLP with hidden layer dimension d h ,and weights are shared across all time points and horizons. The output layer has one output perhorizon, quantile pair. C Large Scale Demand Forecasting Experiments

C.1 Experiment Setup

In this section we describe the details of the model architecture and training procedure used in theexperiments on the large-scale demand forecasting application.

Training Procedure

Because we did not have enough history available to set aside a true holdout set, all models aretrained for 100 epochs, and the ﬁnal model is evaluated on the test set. For the same reason, nohyperparameter tuning was performed.

Architecture and Hyperparameters

The categorical variables consist of static features of the item, and the timeseries categorical variablesare event indicators (e.g. holidays). The parameters are summarized in Table 5.Table 5: Parameter settings for Large Scale Demand Forecasting ExperimentsParameter ValueEncoder Convolution Dilation Rates [1,2,4,8,16,32]Position Encoding Dilation Rates [1,2,4,8,16,20]Static Categorical One-HotTime-Series Categorical One-HotStatic Encoder Dimension 64Convolution Filters 64Attention Block Heads 52Attention Block Head Dimension 8Dropout Rate 0.15Activation Function ReLU18 .2 Results

Tables 6, 7, and 8 contain the full set of results on the large scale demand forecasting task. Per thediscussion in Section 4, we use MQ-CNN as the baseline model and we did not compare to TFTdue to scaling issues and the fact that on the public task that was most similar (retail), MQ-CNNsigniﬁcantly outperformed TFT.

Quantile loss by horizon

Table 6 demonstrates how the attention mechanism yields signiﬁcantimprovements in shorter LTSP (e.g. LTSP 3/1 and LTSP 0/4), 7% improvement in P90 QL for LTSP3/1 and 7 .

6% improvement in P90 QL for LTSP 0/4. We still see improvements for longer LTSP, butthey are less substantial: 3 .

8% improvement in P90 QL for LTSP 12/3 and 3 .

9% improvement inP90 QL for LTSP 0/33. By also adding decoder-self attention, we continue to see improved resultsfor shorter LTSP compared to only decoder-encoder attention, but we do see slight degradations forlonger LTSP when comparing to decoder-encoder attention.

Promotions Performance

In Table 8 we see that MQTransformer outperforms the prior stateof the art on all promotion types. After adding the horizon-speciﬁc decoder-encoder and decoder-selfattentions, versus the baseline, we see a 49% improvement for Promotion Type 1 products, a31% improvement for Promotion Type 2 products, and a 17% improvement for Promotion Type 3products.

Peak Performance

Table 3 illustrates that while MQCNN performs well on some seasonal peaks,it also is misaligned and fails to rampdown post-peak – ramp-down issues occur when the modelcontinues to forecast high for target weeks after the peak week. By including MQTransformer’sattention mechanisms in the architecture, we see a 43% improvement for Seasonal Peak 1, a 21%improvement for Seasonal Peak 2, a 7% improvement for Seasonal Peak 3, and a 56% improvementon Post-Peak Rampdown.Table 6: 52-week aggregate quantile loss metrics with for a set of representative lead times andspans

Model All LTSP LTSP 3/1 LTSP 0/4P50 P90 P50 P90 P50 P90Baseline 1.000 1.000 1.000 1.000 1.000 1.000Dec-Enc

Model LTSP 12/3 LTSP 0/33P50 P90 P50 P90Baseline 1.000 1.000 1.000 1.000Dec-Enc 0.975

Dec-Enc & Dec-Self

Model SeasonalPeak 1 SeasonalPeak 2 SeasonalPeak 3 Post-PeakRampdownBaseline 1.000 1.000 1.000 1.000Dec-Enc 0.748

Table 8: P90 quantile loss metrics on item, weeks with promotions

Model PromotionType 1 PromotionType 2 PromotionType 3Baseline 1.000 1.000 1.000Dec-Enc 0.706 0.769 0.865Dec-Enc + Dec-Self

D Experiments on Public Datasets

In this section we describe the experiment setup used for the public datasets in Section 4.

D.1 Datasets

We evaluate our MQTransformer on four public datasets. We summarize the datasets and prepro-cessing logic below; the reader is referred to Lim et al. (2019) for more details.

Retail

This dataset is provided by the Favorita Corporacion (a major Grocery chain in Ecuador) as part ofa Kaggle to predict sales for thousands of items at multiple brick-and-mortar locations. In totalthere are 135K items (item, store combinations are treated as distinct entities), and the datasetcontains a variety of features including: local, regional and national holidays; static features abouteach item; total sales volume at each location. The task is to predict log-sales for each (item, store)combination over the next 30 days, using the previous 90 days of history. The training period isJanuary 1, 2015 through December 1, 2015. The following 30 days are used as a validation set,and the 30 days after that as the test set. These 30 day windows correspond to a single forecastcreation time. While Lim et al. (2019) extract only 450K samples from the histories during thetrain window, there are in fact 20M trajectories avalaible for training – because our models canproduce forecasts for multiple trajectories (FCDs) simultaneously, we train using all available datafrom the training window. The original competition can be found here.

Electricity

This dataset consists of time series for 370 customers of at an hourly grain. The univariate datais augmented with a day-of-week, hour-of-day and oﬀset from a ﬁxed time point. The task is topredict hourly load over the next 24 hours for each customer, given the past seven days of usage.From the training period (January 1, 2014 through September, 1 2019) 500K samples are extracted.

Traﬃc

This dataset consists of lane occupancy information for 440 San Francisco area freeways. The datais aggregated to an hourly grain, and the task is to predict the hourly occupancy over the next 24hours given the past seven days. The training period consist of all data before 2008-06-15, with theﬁnal 7 days used as a validation set. The 7 days immediately following the training window is usedfor evaluation. The model takes as input lane occupancy, hour of day, day of week, hours from startand an entity identiﬁer.

Volatility

The volatility dataset consists of 5 minute sub-sampled realized volatility measurements from2000-01-03 to 2019-06-28. Using the past one year’s worth of daily measurements, the goal is topredict the next week’s (5 business days) volatility. The period ending on 2015-12-31 is used as thetraining set, 2016-2017 as the validation set, and 2018-01-01 through 2019-06-28 as the evaluationset. The region identiﬁer is provided as a static covariate, along with time-varying covariates dailyreturns, day-of-week, week-of-year and month. A log transformation is applied to the target.

D.2 Training Procedure

We only consider tuning two hyper-parameters, size of hidden layer d h ∈ { , , } and learningrate α ∈ { × − , × − , × − } . The model is trained using the ADAM optimizer Kingmaand Ba (2015) with parameters β = 0 . β = 0 . (cid:15) = 1 e − D.3 Architecture Details

Our MQTransformer architecture used for these experiments contain a single tune-able hyperpa-rameter – hidden layer dimension d h . Dataset speciﬁc settings are used for the dilation rates. Forstatic categorical covariates we use an embedding layer with dimension d h and use one-hot encodingfor time-series covariates. A dropout rate of 0.15 and ReLU activations are used throughout thenetwork. The only diﬀerence between this variant and the one used for the non-public large scale21emand forecasting task is the use of an embedding layer for static, categorical covariates ratherthan one-hot encoding. D.4 Reported Model Parameters

The optimal parameters for each task are given in Table 9.Table 9: Parameter settings of reported MQTransformer model on each public dataset.Name d h α Enc. Dilation Rates Pos. Dilation RatesElectricity 128 1 × − [1,2,4,8,16,32] [1,2,4,8,8]Traﬃc 64 1 × − [1,2,4,8,16,32] [1,2,4,8,8]Volatility 128 1 × − [1,2,4,8,16,32,64] [1,1,2]Retail 64 1 × −4