[PDF] Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series

Abstract

Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting settings. Traditionally, state space models have been successful in providing uncertainty estimates of trajectories in the latent space. More recently, deep learning, attention-based approaches have achieved state of the art performance for sequence modeling, though often require large amounts of data and parameters to do so. We propose Stanza, a nonlinear, non-stationary state space model as an intermediate approach to fill the gap between traditional models and modern deep learning approaches for complex time series. Stanza strikes a balance between competitive forecasting accuracy and probabilistic, interpretable inference for highly structured time series. In particular, Stanza achieves forecasting accuracy competitive with deep LSTMs on real-world datasets, especially for multi-step ahead forecasting.

Full PDF

SStanza: A Nonlinear State Space Model forProbabilistic Inference in Non-Stationary Time Series

Anna K. Yanchenko

Department of Statistical ScienceDuke UniversityDurham, NC 27708 [email protected]

Sayan Mukherjee

Department of Statistical ScienceDuke UniversityDurham, NC 27708 [email protected]

Abstract

Time series with long-term structure arise in a variety of contexts and capturingthis temporal structure is a critical challenge in time series analysis for bothinference and forecasting settings. Traditionally, state space models have beensuccessful in providing uncertainty estimates of trajectories in the latent space.More recently, deep learning, attention-based approaches have achieved state of theart performance for sequence modeling, though often require large amounts of dataand parameters to do so. We propose Stanza, a nonlinear, non-stationary state spacemodel as an intermediate approach to ﬁll the gap between traditional models andmodern deep learning approaches for complex time series. Stanza strikes a balancebetween competitive forecasting accuracy and probabilistic, interpretable inferencefor highly structured time series. In particular, Stanza achieves forecasting accuracycompetitive with deep LSTMs on real-world datasets, especially for multi-stepahead forecasting.

Time series with long-term structure arise in a variety of contexts and capturing this temporal structureis a critical challenge in time series analysis for both inference and forecasting. Traditionally, statespace models like Kalman Filters [15], Dynamic Linear Models (DLMs) [35] and their numerousextensions [21, 8] have been utilized for probabilistic inference in time series analysis. This generalclass of state space models has been very successful for modeling systems where the dynamics areapproximately linear and relatively simple. Alternatively, deep learning, attention-based models[33, 5] have achieved state-of-the-art forecasting performance in modeling longer-term structure intime series and sequences, especially for applications relating to language. However, these methodsoften require large amounts of training data.Traditional Kalman Filters and DLMs are very good for inference tasks, especially in settings whereprobabilistic ﬁltering and smoothing of the state space is important. However, Kalman Filters oftenare not competitive for forecasting, especially for data that does not meet the model speciﬁcations.On the other hand, deep learning models like Long Short-Term Memory (LSTM) models [14] achieveexcellent forecasting accuracy in many settings, but cannot be used for inference tasks, as ﬁlteringand smoothing is done implicitly and deterministically in the model. Thus, there still remains a gapbetween time series models designed for inference and those that achieve competitive forecastingaccuracy.We propose

Stanza , a State Space Model with Nonlinear Activations and Dynamically WeightedLatent (Z) States, as an intermediate approach to alleviate this trade-off between inference andforecasting. Our motivation is to ﬁll the gap between large deep learning, attention-based modelsand traditional state space models. Stanza achieves forecasting accuracy comparable to deep LSTMs

Preprint. Under review. a r X i v : . [ s t a t . M L ] J un n three real-world datasets and is able to perform inference for complex, nonlinear, non-stationarytime series. Furthermore, Stanza is fully probabilistic in the state space, allowing for straightforwardmulti-step ahead forecasts out to long time horizons and provides forecast coverage estimates whichare not possible in traditional LSTMs. Stanza incorporates ideas from deep learning into traditionalstate space models, resulting in a nonlinear, non-stationary model capable of performing inference inmore general and complex settings, as compared to classical state space models. We design Stanzaspeciﬁcally for settings where there is not enough data for a full attention-based model (i.e. the smallto medium data regime) and/or where the model needs to be interpretable and uncertainty in theinferential task needs to be quantiﬁed.The optimal setting for Stanza is quite general and occurs in many different application areas. Forexample, in many medical settings, the quantity of data may be limited, yet it is critical to have aninterpretable and probabilistic model for inference tasks and simple DLMs may not be rich enough tocapture the complexity in the data, for example, with EEG signals [36]. In retail sales settings, theremay be an abundance of data, yet the goal may be to understand seasonality patterns in consumerbehavior, necessitating a model with interpretable parameters that can either uncover structure in thedata or speciﬁcally incorporate known structure into the model [3]. Finally, even in ﬁnancial settingswhere forecasting is more important than inference, it may be the case that there is not enough data totrain a full deep learning, attention-based model, yet competitive forecasting accuracy is still required(for example, in the case of the Exchange Rate data considered in [20]). Stanza is motivated by aneed for powerful inference and forecasting.Our contributions are as follows:1. Stanza:

We propose Stanza as a nonlinear, non-stationary, interpretable time series modelthat gives full uncertainty estimates for parameters.2.

Versatility:

We demonstrate Stanza’s ability to model complex and highly structured timeseries and provide interpretable parameters with uncertainty estimates for inference. We alsodemonstrate that Stanza is competitive with deep LSTMs in terms of forecasting accuracyon real-world datasets, particularly for multi-step ahead forecasts.3.

Intermediate Approach:

Stanza incorporates ideas from deep learning based modelsdirectly into the state space model framework.

Stanza is designed for competitive performance on both inference and forecasting tasks, speciﬁcallyfor settings with limited data and/or where parameter uncertainty estimates and interpretable modelsare critical. Building off ideas from traditional state space models, particularly Time-VaryingAutoregressive (TVAR) models [36], Stanza combines ideas from deep learning approaches into thestate space model framework for an intermediate approach that balances advantages of probabilisticinference and competitive forecasting accuracy.

Let { X t } be a univariate or multivariate time series of length T centered about 0, and { Z t } a sequenceof real-valued latent states. The Stanza model is speciﬁed in Equation 1: Z t ∼ N (cid:0) f (cid:0) W Tt Z t − r : t − (cid:1) , Q t (cid:1) ,X t ∼ N ( HZ t , R ) , W t ∼ N ( W t − , Λ t ) , (1)where Z t − r : t − = [ Z t − r , . . . , Z t − ] T . The dynamic weights, W t , allow for non-stationarity in themodeling of the time series and can directly capture structure in the observed time series. Thehyper-parameter r determines the order of the lag dependence in the state space and can be speciﬁed Note: for clarity of notation, when t < r , there are 0s in the weight vector W t for the elements correspondingto times t < . For example, if r = 3 and t = 2 , then W = [0 , , W ] T . r can be set to 7 and the dynamic weights, W t − are expected toplay an important role in the modeling of the time series. The lag dependence can also be speciﬁedvia cross-validation, allowing for structure to be learned from the time series. The lagged dependenceenables series with longer term structure to be modeled as compared to the standard DLM or KalmanFilter, which utilize only a ﬁrst-order Markov assumption in the latent space. The function f is thehyperbolic tangent function, which helps prevent the values of Z t from increasing without boundand also aids in constraining the model for identiﬁability. This speciﬁcation extends the basic TVARframework of [36] with ideas from attention-based models, such as [1], where the complexity of themodel is captured in the state space, rather than the observed space, via dynamic weights. In thecase of Stanza, these dynamic weights, W t can be directly interpreted in terms of structure in theobserved time series.Similar to deep models, Stanza speciﬁes the model complexity in the latent space, rather thanthe emission distribution, which is kept simple. This setup allows for ﬂexibility to model eitherunivariate or multivariate series { X t } within the same framework due to the simplicity of the emissiondistribution. This is in contrast to TVAR models, which would require matrix-normal distributions tomodel multivariate sequences. Additionally, in contrast to LSTMs, Stanza allows for probabilistic,interpretable inference, while still allowing for long-term dependencies, speciﬁed by r . An externallatent sequence can optionally be added to the state space to serve as an external input signal, as oftenused in traditional Kalman Filters [6] or to be used as a multi-scale latent factor in a decouple-recouplesetting [3, 9, 35]. Filtering and smoothing of the latent space are critical to inference in DLMs and Kalman ﬁlters, andinference in Stanza proceeds similarly. The Extended Kalman Filter [21] and Unscented KalmanFilter [8] are extensions to the forward ﬁltering algorithms in the standard Kalman Filter for estimationin the nonlinear setting. We utilize both approaches in the ﬁltering of parameters for Stanza. TheUnscented Kalman Filter uses a deterministic sampling approach to approximate the mean andvariance of the state variables and can capture the mean and variance of the ﬁltered latent statevariable out to the third order [8]. The Extended Kalman Filter uses a ﬁrst order Taylor approximationof the nonlinearity to update the traditional Kalman Filter [21]. Stanza is motivated as a competitiveapproach to existing deep and state space models and we thus want inference to be relatively fast andefﬁcient. Thus, we avoid the use of particle ﬁlters for inference due to their computational cost. Acomparison of the Unscented and Extended Kalman Filters is discussed in [13]. Smoothing results inposterior distributions p ( Z t | X T ) for each time t .The inference procedure for Stanza is outlined in Algorithm 1 and mainly consists of nonlinearﬁltering and smoothing of the latent variables, followed by EM algorithm updates for the remainingparameters. First, we perform Unscented Filtering [8, 6] and Smoothing [29, 6] for the latent sequence { Z t } . Speciﬁcally, we specify Sigma Points, a minimal set of sample points, for all lagged Z t valuesfor use in the Unscented Filter and Smoother to increase the ﬁdelity of the approximation. Weapproximate the covariance matrix of these lagged values with a diagonal matrix of the ﬁlteredvariances at previous time points. Then, we perform Extended Kalman Filtering [21, 6] and regularRTS Smoothing [26] for the latent weight vectors W t . There are no nonlinearities in the stateequation for W t , so the Unscented Kalman Filter results in degenerate cross-covariance matrices,as the Sigma Points used in inference expect nonlinearities in the state space and observation space.Thus, we use the Extended Kalman Filter combined with the TVAR inference algorithm from [36],with corresponding ﬁrst derivative corrections due to the nonlinearity.We initialize these latent variables via a “warm start” approach, which ignores the nonlinearity inthe state space. As ﬁltering and smoothing depend on both Z t and W t , it is important to start withreasonable values of the latent variables for fast and accurate inference in the Stanza model. We ﬁrstinitialize the Z t sequence by the traditional, linear Kalman Filter and RTS Smoother with randomlysampled, time-invariant parameters. Then, we sample a latent trajectory from the smoothed estimatesto initialize the W t sequence via the TVAR inference algorithm [36], which is similar and fully linear.This warm start approach gives reasonable trajectories of the state variables { Z t } and { W t } to startthe Stanza inference procedure. 3s it can be difﬁcult to learn or to fully specify time-varying variances like { Q t } and { Λ t } , weuse discount factors [35] to aid in the speciﬁcation of these latent variances. For a discount factor δ , between 0 and 1 (but for most practical models, δ ∈ [0 . , ), we specify Q t = − δδ V t , where V t is the variance of p ( Z t | X t ) from forward ﬁltering [35]. The discount factor thus inﬂates thevariance Q t , relative to the ﬁltered uncertainty V t and also represents a decay of information overtime. A similar discount factor setup can be speciﬁed for { Λ t } . Lower discount factors allow formore volatility and adaptability of the ﬁltered parameters (and hence forecast distributions), whilehigher discount factors result in much less inﬂation of the variance and less increase in the uncertaintyfrom one time step to the next. Discount factors are thus directly interpretable in terms of the increasein the uncertainty and adaptability of the parameters for ﬁltering from one time step to the next, andcan thus either be speciﬁed by domain knowledge or by cross-validation. Discount factors are verygeneral and can be extended to many settings; for a full discussion, see [35]. We specify discountfactors δ Z and δ W for { Q t } and { Λ t } , respectively.Finally, we learn the parameters H and R via the EM algorithm, following [17], where these updateshave closed forms and use the mean and variance of the smoothed { Z t } from the Unscented KalmanSmoother. Alternatively, priors could be speciﬁed for these parameters and H and R could be learnedvia Gibbs Sampling. The simplicity of the emission distribution in Equation 1 allows for simpleupdates of H and R and also enables equally simple inference for these parameters whether X t is aunivariate or multivariate sequence. The convergence criteria of Algorithm 1 can either be on thelog-likelihood of the model or on the change in one of the EM parameters, H or R . Inference inStanza is relatively fast and efﬁcient; a full timing comparison with related approaches is discussed inSection 4 and full inference details are given in the Supplementary Materials.There are three hyper-parameters to specify for Stanza: r , the lagged dependence in the latent space, δ Z , the discount factor for { Q t } and δ W , the discount factor for { Λ t } . All of these parameters aredirectly interpretable and can either be speciﬁed directly via domain knowledge or selected usingcross validation. Algorithm 1:

Stanza Inference

Result:

Posterior distributions for Z T and W T , estimates H , R , Q T , Λ T Specify r , δ Z and δ W ;Warm Start: Filter and smooth Z T via the Kalman Filter and RTS Smoother; ﬁlter and smooth W T via the TVAR inference algorithm; while While not converged do Filter Z T via the Unscented Kalman Filter;Smooth Z T via the Unscented Kalman Smoother;Filter W T via the Extended Kalman Filter; update Q t and Λ t using discount factors;Smooth W T via the RTS Smoother;Update H and R via the EM Algorithm. end2.3 Forecasting As in the basic DLM setting, forecasting in Stanza proceeds very naturally via simulation. At a high-level, forecasting via simulation proceeds similarly to the regular ﬁltering procedure in Algorithm 1,but where values Z t +1 and X t +1 are sampled from the one step-ahead forecast distribution andused as the “observed” values in propagating the state variables forwards in time. Speciﬁcally, forone-step ahead forecasting at time t , we ﬁrst propagate the estimate for W t through the state updateequation via the Extended Kalman Filter, where the value for Z t at time t is sampled from the priordistribution. Then, after completing the update from time t − to time t for W t , we can samplea value of W t from the Extended Kalman Filter update, that is used to update the estimate for Z t via the state update equation and the Unscented Kalman Filter. After propagating both of the statespace sequences forwards in time, a value of X t can easily be sampled from the measurement updateequations in the Unscented Kalman Filter. This facilitates probabilistic forecasting of X t and can berepeated to forecast k steps ahead in time [35]. It is also possible to propagate only the forecast mean,for example, rather than the full forecast distribution, for computational efﬁciency.4 Related Work

Long Short-Term Memory (LSTM) models [14] are widely used deep learning models for sequencesor time series and address several of the training issues associated with recurrent neural networks[11], and have been applied in many different applications. However, LSTMs can still struggle tocapture long-term structure in sequences and time series [12] and more recently, attention-basedapproaches [33, 5] have achieved state-of-the-art results for sequence modeling. While these modelsare capable of capturing long-term structure in time series, they require large amounts of data to trainand are not necessarily suited to settings where probabilistic, interpretable inference is important orfor settings in the small to medium data regime. Autoregressive models such as [30] are additionallywell suited to large amounts of data, but not necessarily for settings where interpretable, probabilisticinference is required.Dynamic Linear Models (DLMs) [35], which include Kalman Filters [15], are linear models thatapply probabilistic ﬁltering and smoothing. Extensions include Dynamic Generalized Linear Models,which model { X t } from exponential family distributions [35, 3], and to include nonlinearities inthe state space and observation equations [6, 8, 32]. Stanza extends ideas from TVAR models, inparticular [36]. DLMs and various extensions are discussed in detail in [27] and in a Bayesian timeseries setting in [35].Finally, there are several related approaches that try to address the range between full deep learningapproaches and linear state space models, either by including probabilistic inference in deep learningapproaches or by extending state space models to more complex settings. Deep learning focusedmodels include Deep Kalman Filters [18], which use neural networks for the emission and transitiondistributions in the DLM speciﬁcation, probabilistic recurrent state space models [7], which combineRNN training with Gaussian processes and Gaussian process dynamic models [34] for motion capturedata. Additional prior work for probabilistic time series focuses on different aspects of improvinginference in complex models [2, 16, 19]. Finally, there is recent work focusing speciﬁcally onimproving forecasting using deep models to extend state space models, such as [25, 28].Related approaches that extend state space models directly to nonlinear or more complex settingsinclude [24], which applies the variational inference approach proposed in [32] to learn nonlinearstate space models for model predictive control, rather than inference or forecasting, and [1], whichuses a discrete state space and an attention mechanism learned via a sequence-to-sequence model tomodel disease progression.Stanza, however, is distinct from these previously proposed approaches. Unlike [24, 25], Stanza ismotivated by a desire to perform competitively for both inference and forecasting settings. Thus,Stanza is designed to be a versatile model that can perform well under a variety of different circum-stances. Additionally, rather than applying state space ideas to deep learning approaches [18, 7, 25, 2],Stanza incorporates ideas from deep learning into a state space framework. Similar to [1], Stanzacreates dynamic dependencies in the latent space using dynamic weighting, rather than an attentionmechanism with a discrete state space [1], to do so. Additionally, Stanza speciﬁes complexity in thelatent space and adds nonlinearities to improve performance. Stanza is designed for competitive performance in inference and forecasting settings. We demonstrateStanza’s inference utility on two simulation experiments, both of which consider settings whereStanza is able to capture relevant features in the data, while traditional DLMs are not sufﬁciently richto do so. We also demonstrate Stanza’s competitive forecasting accuracy on three real-world datasets.

We ﬁrst demonstrate the ability of Stanza to recover structure in time series. We consider a highlystructured series (Figure 1a top with autocorrelation function (ACF) and partial autocorrelationfunction (PACF) plots Figure 1b). There is clear seasonality out to lag 6, so we can specify r = 6 directly in the Stanza model. The resulting posterior mean weight matrix, W t is shown in the bottomof Figure 1a. The weight corresponding to times t − is the largest across time, reﬂecting theimportance of the seasonality evident in the original series. Additionally, the structure of the time5 a) Y and W t . (b) ACF and PACF plots. Figure 1: For a highly structured, periodic time series Y (a, upper) with strong seasonality, asevidenced in the ACF/PACF plots (b, left), Stanza is able to recover this structure in the W t parameters.i.e. W t − is important and changes when the pattern in Y changes (a, lower). Additionally, whengenerating from the learned Stanza model, the generated series is able to exhibit the same structure asthe original series, in terms of the ACF and PACF (b, right). (a) Posterior predictive mean. (b) Coverage plots. Figure 2: (a) Posterior predictive distribution for a Stanza(5) model ﬁt to the x trajectory for a Lorenzattractor. The posterior predictive mean, ˜ µ , agrees very well with the observed trajectory. (b) Forecastdistribution coverage for k = 1 , . . . , step-ahead forecasts for the Stanza(5) model ﬁt to the Lorenzattractor trajectory.series changes slightly between times 150-250 and 300-400, and this change is evident in decreasedvalues of the W t − weights at these same times (Figure 1a bottom). Additionally, we reproducethis structure when we simulate from from the learned Stanza model, see the ACF/PACF plots inFigure 1b right. Thus, for a highly structured time series, Stanza is able to recover key features of thestructure and dependence of the series over time in the dynamic weight matrix W t , which can bedirectly interpreted in the context of the application.To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from aLorenz attractor [22], following [10]. A Lorenz attractor is deﬁned by a system of three differentialequations and is an example of a chaotic system. We simulate 1000 time points from a Lorenzattractor with a time delta of 0.01, and consider the trajectory of the ﬁrst variable, x . We then ﬁta Stanza(5) (i.e. r = 5 ) model with high discount factors of 0.99 for δ W and δ Z , indicating littlevolatility. The posterior predictive distribution for the learned Stanza model is shown in Figure 2a.There is excellent correspondence between the posterior predictive mean and the observed Lorenzattractor trajectory, as well as good coverage for multi-step ahead forecasting (Figure 2b). As Stanza6 a) Exchange Rate. (b) Electricity. Figure 3: Coverage for k = 10 step-ahead forecasting on the (a) Exchange Rate and (b) Electricitydatasets. Stanza achieves excellent coverage on both datasets.performs probabilistic forecasting, we can calculate the coverage of the forecast distribution as thepercentage of observed values that fall within speciﬁc percentiles of the forecast distribution. That is,we expect approximately 95% of the observed values to fall within the 95% credible intervals for theforecast distribution. We additionally demonstrate Stanza’s competitive forecasting performance on three real-worlddatasets. Following [28, 20, 30], we compare Stanza’s forecasting performance on the followingpublicly available datasets: Exchange Rate [20], the daily exchange rate for 8 different currenciesfrom 1990-2011, and Electricity , hourly electricity consumption for 370 customers (subset to theﬁrst 50 customers and the last 10000 time points). We additionally analyze the Weather dataset from[4, 31], which contains 13 different weather measurements. We subset the Weather dataset to analyzethe data from 10/01/2016 to 12/31/2016. For each forecasting experiment considered below, the last1000 time points of each series are treated as test data and the preceding points as the training data.All series are individually normalized using the mean and standard deviation of the training set.We ﬁt several models independently to each series in each dataset and compare forecasting accuracywith the root-mean squared error (RMSE) for k = 1 , k = 5 and k = 10 step-ahead forecasting,where k is the forecast horizon. We compare Stanza to two baseline models, a normal DLM withtime-invariant parameters [35] and a TVAR model [36, 23], as well as two LSTMs, one with 2 hiddenlayers (LSTM(2)) and one with 5 hidden layers (LSTM(5)). For a direct comparison across series, weuse cross-validation to tune the three hyper-parameters for Stanza (using one-step ahead forecastingRMSE as a metric) on each dataset and use the same hyper-parameters for all series for both theStanza and TVAR models for consistency. As Exchange Rate is a daily dataset, the optimal valueof r = 7 for Stanza represents weekly seasonality, and Stanza is able to capture relevant structurein the data. To train the LSTMs, we use a rolling window approach, where the previous 50 datapoints are used to predict the next 10. The LSTMs are trained with the Adam optimizer on MSE loss.Additionally, as all models except for the LSTMs make probabilistic forecasts, we compare the 95%forecast coverage of the baseline models to Stanza.The forecasting results for the individual series are given in Table 1. For all three datasets andacross all three forecast horizons, Stanza achieves a good balance of high forecast accuracy and goodempirical coverage (Figure 3). As compared to competing approaches, Stanza’s advantages are theability to model multivariate time series (unlike TVAR), to produce probabilistic forecasts and thuscoverage estimates (unlike deep LSTMs) and to model non-stationary time series (unlike DLMs). Theforecasting accuracy of Stanza does not decay over longer forecasting horizons, and is competitivewith the performance of the deep LSTMs across all forecast horizons. Additionally, LSTMs donot provide probabilistic forecast distributions and thus cannot be used to calculate coverage of theobserved time series, while Stanza achieves excellent coverage (Figure 3), even compared to theDLM and TVAR models. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014

34 0 . ± .

34 98 98 98

TVAR(7) . ± .

02 0 . ± .

17 11 . ± .

32 94 96 98

LSTM(2) . ± .

22 0 . ± .

21 0 . ± . — — —LSTM(5) . ± .

23 0 . ± .

24 0 . ± . — — — Stanza(7) . ± .

22 0 . ± .

28 0 . ± .

37 100 100 97

Electricity DLM . ± .

24 1 . ± .

24 93 93 93

TVAR(5) . ± .

12 1 . ± .

25 1 . ± .

26 93 92 94

LSTM(2) . ± .

24 0 . ± .

20 0 . ± . — — —LSTM(5) . ± .

18 0 . ± .

21 0 . ± . — — — Stanza(5) . ± .

25 1 . ± .

27 1 . ± .

27 93 90 90

Weather DLM . ± .

22 1 . ± .

97 97 97TVAR(6) . ± .

15 0 . ± .

17 0 . ± .

95 96 96LSTM(2) . ± .

17 0 . ± .

15 0 . ± . — — —LSTM(5) . ± .

37 0 . ± .

34 0 . ± . — — — Stanza(6) . ± .

18 0 . ± .

19 0 . ± .

99 98 97

The TVAR model achieves very good forecasting accuracy, across the datasets. However, the TVARmodel is not designed for multivariate series, while Stanza is easily extended to multivariate X t sequences. We ﬁt Stanza to the full multivariate Exchange Rate and Weather datasets, due to theirrelatively low dimension. These results are given in Table 2. This extension to the full multivariatesequence further improves the forecasting ability of Stanza, especially compared to the TVAR modelsand the deep LSTMs in Table 1, on the Exchange Rate data.Finally, we can look at the computational efﬁciency of the Stanza model. For the Exchange Rate data,the average time in seconds for training Stanza is 41 seconds, while for the LSTM(5) model is 222seconds (averaged across series, run on a single core). While forecasting via simulation for Stanza ismuch slower than for the LSTM, this computational time could be reduced via propagating the meanonly or via parallelization.Table 2: Forecast performance of Stanza on multivariate sequences. RMSEDataset Model k = 1 k = 5 k = 10 Exchange Rate

Stanza(7)

Stanza(6)

Stanza is motivated by a general goal of integrating modern machine learning approaches withtraditional stochastic models. Speciﬁcally, we propose Stanza as a versatile, non-stationary statespace model for competitive performance on inference and forecasting tasks. Stanza is able to achievecompetitive forecasting accuracy with deep LSTMs, while still maintaining probabilistic estimatesof the state space and interpretability of parameters. Stanza is suitable for settings where parameterinterpretability is important for modeling complex time series and is suitable for either univariateor multivariate modeling. There is a long history of simple machine learning models, like randomforests, outperforming deep models on certain tasks, and Stanza ﬁts into this paradigm. Futuredirections include extending these ideas further in the direction of deep learning models, for exampleby adding additional depth to the latent space in the proposed Stanza model or extensions for higher-dimensional time series, for example with ideas from [30, 28]. Stanza is a step towards a powerful,general framework of incorporating deep learning ideas directly into traditional probabilistic models.8 eferences [1] A. M. Alaa and M. van der Schaar. Attentive state-space modeling of disease progression. In

Advances in Neural Information Processing Systems 32 , pages 11334–11344. 2019.[2] E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variationalinference for state space models.

CoRR , abs/1511.07367, 2015. URL https://arxiv.org/abs/1511.07367 .[3] L. R. Berry and M. West. Bayesian forecasting of many count-valued time series.

Journal ofBusiness and Economic Statistics , 2019.[4] F. Chollet.

Deep Learning with Python . Manning Publications, 2017.[5] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latentvariable model for sequential data. In

Advances in Neural Information Processing Systems 28 ,pages 2980–2988. 2015.[6] Dan Simon.

Optimal State Estimation . Wiley, 2006.[7] A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe.Probabilistic recurrent state-space models.

CoRR , abs/1801.10395, 2018. URL http://arxiv.org/abs/1801.10395 .[8] Eric A. Wan and Rudolph van der Merwe. The unscented kalman ﬁlter for nonlinear estimation.

Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications andControl Symposium , pages 153–158, 2000.[9] M. A. R. Ferreira, M. West, H. K. H. Lee, and D. M. Higdon. Multiscale and hidden resolutiontime series models.

Bayesian Analysis , 2:294–314, 2006.[10] V. Garcia Satorras, Z. Akata, and M. Welling. Combining generative and discriminativemodels for hybrid inference. In

Advances in Neural Information Processing Systems 32 , pages13802–13812. 2019.[11] I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . MIT Press, 2016.[12] A. Greaves-Tunnell and Z. Harchaoui. A statistical investigation of long memory in languageand music.

CoRR , abs/1904.03834, 2019. URL http://arxiv.org/abs/1904.03834 .[13] F. Gustafsson and G. Hendeby. Some relations between extended and unscented kalman ﬁlters.

IEEE Transactions on Signal Processing , 60(2):545–555, 2012.[14] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997.[15] R. E. Kalman. A new approach to linear ﬁltering and prediction problems.

Journal of BasicEngineering , 82(1):35–45, 1960.[16] M. Karl, M. Sölch, J. Bayer, and P. van der Smagt. Deep variational bayes ﬁlters: Unsupervisedlearning of state space models from raw data. In .OpenReview.net, 2017. URL https://openreview.net/forum?id=HyTqHL5xg .[17] M. E. Khan and D. N. Dutt. An expectation-maximization algorithm based kalman smootherapproach for event-related desynchronization (ERD) estimation from EEG.

IEEE Transactionson Biomedical Engineering , 54(7):1191 – 1198, July 2007.[18] R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman ﬁlters.

CoRR , abs/1511.05121, 2015.URL http://arxiv.org/abs/1511.05121 .[19] R. G. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear statespace models. In

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence ,AAAI’17, pages 2101–2109. AAAI Press, 2017.920] G. Lai, W. Chang, Y. Yang, and H. Liu. Modeling long- and short-term temporal patterns withdeep neural networks.

CoRR , abs/1703.07015, 2017. URL http://arxiv.org/abs/1703.07015 .[21] L. Ljung. Asymptotic behavior of the extended kalman ﬁlter as a parameter estimate for linearsystems.

IEEE Transactions on Automatic Control , 24(1):36–50, 1979.[22] E. N. Lorenz. Deterministic nonperiodic ﬂow.

Journal of the Atmospheric Sciences , 20:130–141,1963.[23] R. Prado and M. West.

Time Series: Modeling, Computation and Inference . CRC Press, 2010.[24] T. Raiko and M. Tornio. Variational Bayesian learning of nonlinear hidden state-space modelsfor model predictive control.

Neurocomputing , 72(16-18):3704–3712, 2009.[25] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski. Deepstate space models for time series forecasting. In

Advances in Neural Information ProcessingSystems 31 , pages 7785–7794. 2018.[26] H. E. Rauch, C. Striebel, and F. Tung. Maximum likelihood estimates of linear dynamic systems.

AIAA Journal , 3(8):1445–1450, 1965.[27] S. Roweis and Z. Ghahramani. A unifying review of linear gaussian models.

Neural Computa-tion , 11:305–345, 1999.[28] D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus. High-dimensionalmultivariate forecasting with low-rank gaussian copula processes. In

Advances in NeuralInformation Processing Systems 32 , pages 6824–6834. 2019.[29] S. SÄrkkÄ. Unscented rauch–tung–striebel smoother.

IEEE Transactions on Automatic Control ,53(3):845–849, 2008.[30] R. Sen, H.-F. Yu, and I. Dhillon. Think globally, act locally: A deep neural network approachto high-dimensional time series forecasting.

CoRR , abs/1905.03806, 2019. URL https://arxiv.org/abs/1905.03806 .[31] TensorFlow Tutorials. Time series forecasting. , May 2020.[32] H. Valpola and J. Karhunen. An unsupervised ensemble learning method for nonlinear dynamicstate-space models.

Neural Computation , 14(11):2647–2692, 2002.[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, andI. Polosukhin. Attention is all you need. In

Advances in Neural Information Processing Systems30 , pages 5998–6008. 2017.[34] J. Wang, A. Hertzmann, and D. J. Fleet. Gaussian process dynamical models. In

Advances inNeural Information Processing Systems 18 , pages 1441–1448. 2006.[35] M. West and J. Harrison.

Bayesian Forecasting and Dynamic Models . Springer, 1997.[36] M. West, R. Prado, and A. D. Krystal. Evaluation and comparison of EEG Traces: Latentstructure in nonstationary time series.

Journal of the American Statistical Association , 94(448):1083–1095, 1999. 10

Inference

Let { X t } be a univariate or multivariate time series of length T centered about 0, and { Z t } a sequenceof real-valued latent states. The Stanza model is speciﬁed in Equation 2: Z t ∼ N (cid:0) f (cid:0) W Tt Z t − r : t − (cid:1) , Q t (cid:1) ,X t ∼ N ( HZ t , R ) , W t ∼ N ( W t − , Λ t ) , (2)where Z t − r : t − = [ Z t − r , . . . , Z t − ] T . Additionally, Z ∼ N ( µ , V ) and W ∼ N ( m , C ) ,where we specify µ = 0 , V = 1 , m = , C = I r , where I r is the identity matrix of dimension r .Additionally, in the TVAR model framework [36], we specify k = 1 , d = 0 . for the volatility for Z t ; see [36, 35] for details.Filtering and smoothing of { Z t } proceeds via the Unscented Kalman Filter [8] and UnscentedSmoother [29, 6]. Inference for { W t } proceeds via the Extended Kalman Filtering [21, 6] and regularRTS Smoothing [26]. The EM updates for the parameters H and R are given in Equation 3 forunivariate X t and Equation 4 for multivariate X t , where the expectations E Z | X are obtained via thesmoothing relationships for { Z t } , as smoothing results in posterior distributions p ( Z t | X T ) for eachtime t . ˆ H EM = (cid:80) Tt =1 X t E Z | X ( Z t ) (cid:80) Tt =1 E Z | X ( Z t )ˆ R EM = 1 T T (cid:88) t =1 (cid:0) X t − H EM X t E Z | X ( Z t ) + H EM E Z | X ( Z t ) (cid:1) (3) H EM = (cid:80) Tt =1 X t E Z | X ( Z t ) (cid:80) Tt =1 E Z | X ( Z t ) R EM = 1 T T (cid:88) t =1 (cid:2) X t X Tt − H EM X Tt E Z | X ( Z t ) + H EM H TEM E Z | X ( Z t ) (cid:3) (4) B Simulation Experiments

Additional details about the Simulation Experiments in the main paper are discussed in this section.

B.1 Kalman Filter Recovery

Stanza can also be ﬁt to time series that follow dynamics that are simpler than those assumed in theStanza model. We generate data from a regular Kalman Filter with time-invariant parameters andthen ﬁt a Stanza(3) model with discount factors δ Z = δ W = 0 . . The generated series { X t } and thegenerating latent sequence are shown in Figure 4. A time invariant Kalman Filter can be expressed asin Equation 5 [15, 6]: Z t ∼ N ( F Z t − , Q ) ,X t ∼ N ( HZ t , R ) , (5)The ﬁltered, µ F , and smoothed, µ S , means from Stanza are shown in Figure 5 and correspond wellto the generating latent sequence, { Z t } . Finally, the posterior predictive mean for { X t } is shownin Figure 6 with credible intervals. There is good correspondence between the observed, generatedsequence { X t } and the Stanza model ﬁt. Thus, Stanza is able to model simpler dynamics thenspeciﬁed in the full model well. 11igure 4: Latent, { Z t } and observed, { X t } sequences generated from a time-invariant Kalman Filter.The parameters for this Kalman Filter are F = 0 . , Q = 0 . , H = 0 . and R = 1 .Figure 5: Filtered ( µ F ) and smoothed ( µ S ) means from Stanza for the Kalman Filter generated data.The generating latent sequence is shown in black. The shaded region shows the 95% credible intervalsusing the smoothed mean and variance from the Unscented Kalman Filter. B.2 Structure Recovery

A Stanza(6) model with δ Z = δ W = 0 . is ﬁt to highly periodic, structured data. The learned modelparameters are then used to generate new data, which corresponds well in terms of temporal structureto the original data (Figure 7). B.3 Lorenz Attractor Details

To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from aLorenz attractor [22], following [10]. A Lorenz attractor is deﬁned by a system of three differentialequations and is an example of a chaotic system. The Lorenz equations used to generate the data forthis simulation are given in Equation 6. We simulate 1000 time points from a Lorenz attractor with atime delta of 0.01, and consider the trajectory of the ﬁrst variable, x . We then ﬁt a Stanza(5) modelwith high discount factors of 0.99 for δ W and δ Z , indicating little volatility. ∂x∂t = 10( y − x ) , ∂y∂t = x (28 − z ) − y, ∂z∂t = xy − z (6)12igure 6: True { X t } sequence and posterior predictive mean from the Stanza(3) model on the KalmanFilter Data. The shaded regions represent 95% credible intervals. There is good correspondence andcoverage of the Stanza model ﬁt.Figure 7: Left, original, highly periodic series plotted with ACF and PACF plots. There is clearseasonal structure, with period 6. Right, time series generated by a Stanza(6) model on the time serieson the left. The structure in the ACF and PACF plots is recovered well by the Stanza(6) generatedseries. 13able 3: Dataset details for the forecasting experiments. Dataset Frequency Dimension Number of Time PointsExchange Rate [20] Daily 8 7588Electricity Hourly 50 10000Weather [4, 31] 10 mins 14 12804

C Forecasting Experiments

Additional details about the forecasting experiments considered in the main paper are included here.

C.1 Datasets

We demonstrate Stanza’s competitive forecasting ability on three real world, publicly availabledatasets, following [28, 20, 30]. Additional details about each dataset are given in Table 3. Allsubsetting is performed for computational efﬁciency only. The Electricity data is subset to theﬁrst 50 customers and the last 10000 time points. The Weather data is subset for the time range10/01/2016 to 12/31/2016. For all datasets and for each forecasting experiment considered below, thelast 1000 time points of each series are treated as test data and the preceding points as the trainingdata. All series are individually normalized using the mean and standard deviation of the training set.For the multivariate modeling, the Weather dataset is subset to 9 series. C.2 Forecasting Experiment Hyper-Parameter Details

For the univariate forecasting experiment, the hyper-parameters for the Stanza models are tuned on anaggregate series for the Electricity and Exchange Rate datasets, and then the same hyper-parametersare used for modeling all the individual series for the Stanza and TVAR models. The hyper-parametersare selected based on which combination gives the best 1-step ahead forecasting RMSE, using a gridof values. The hyper-parameters for the Electricity data are r = 5 , δ Z = δ W = 0 . and for theExchange Rate data are r = 7 , δ Z = δ W = 0 . . For the Weather dataset, as an aggregate series doesnot make sense (each individual series measures a different physical quantity on a different scale), wemanually select r = 6 (corresponding to hourly structure in the data) and δ Z = δ W = 0 . , allowingfor a relatively large amount of volatility in the measurements. C.3 Timing

Full timing details for training and forecasting on the Exchange Rate data are given in Table 4.Table 4: Time in seconds for the training and forecasting of each model on a single core.

Dataset Model Training ForecastingExchange Rate DLM 1.9 37.4TVAR(7) 0.57 124.9LSTM(2) 214.8 1.4LSTM(5) 222.1 1.4

Stanza(7) https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014