Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series
TTemporally-Reweighted Chinese Restaurant Process Mixtures forClustering, Imputing, and Forecasting Multivariate Time Series
Feras A. Saad Vikash K. Mansinghka
Probabilistic Computing ProjectMassachusetts Institute of Technology Probabilistic Computing ProjectMassachusetts Institute of Technology
Abstract
This article proposes a Bayesian nonparamet-ric method for forecasting, imputation, andclustering in sparsely observed, multivariatetime series data. The method is appropri-ate for jointly modeling hundreds of time se-ries with widely varying, non-stationary dy-namics. Given a collection of N time se-ries, the Bayesian model first partitions theminto independent clusters using a Chineserestaurant process prior. Within a clus-ter, all time series are modeled jointly us-ing a novel “temporally-reweighted” exten-sion of the Chinese restaurant process mix-ture. Markov chain Monte Carlo techniquesare used to obtain samples from the poste-rior distribution, which are then used to formpredictive inferences. We apply the tech-nique to challenging forecasting and imputa-tion tasks using seasonal flu data from the USCenter for Disease Control and Prevention,demonstrating superior forecasting accuracyand competitive imputation accuracy as com-pared to multiple widely used baselines. Wefurther show that the model discovers inter-pretable clusters in datasets with hundreds oftime series, using macroeconomic data fromthe Gapminder Foundation. Multivariate time series data is ubiquitous, arising indomains such as macroeconomics, neuroscience, andpublic health. Unfortunately, forecasting, imputation,and clustering problems can be difficult to solve when
Proceedings of the 21 st International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain. PMLR: Volume 84. Copyright 2018 by theauthor(s). there are tens or hundreds of time series. One chal-lenge in these settings is that the data may reflect un-derlying processes with widely varying, non-stationarydynamics [13]. Another challenge is that standardparametric approaches such as state-space models andvector autoregression often become statistically andnumerically unstable in high dimensions [20]. Modelsfrom these families further require users to performsignificant custom modeling on a per-dataset basis, orto search over a large set of possible parameter set-tings and model configurations. In econometrics andfinance, there is an increasing need for multivariatemethods that exploit sparsity, are computationally ef-ficient, and can accurately model hundreds of time se-ries (see introduction of [15], and references therein).This paper presents a nonparametric Bayesian methodfor multivariate time series that aims to address someof the above challenges. The model is based on two ex-tensions to Dirichlet process mixtures. First, we intro-duce a recurrent version of the Chinese restaurant pro-cess mixture to capture temporal dependences. Sec-ond, we add a hierarchical prior to discover groups oftime series whose underlying dynamics are modeledjointly. Unlike autoregressive models, our approach isdesigned to interpolate in regimes where it has seensimilar history before, and reverts to a broad prior inpreviously unseen regimes. This approach does notsacrifice predictive accuracy, when there is sufficientsignal to make a forecast or impute missing data.We apply the method to forecasting flu rates in 10 USregions using flu, weather, and Twitter data from theUS Center for Disease Control and Prevention. Quan-titative results show that the method outperforms sev-eral Bayesian and non-Bayesian baselines, includingFacebook Prophet, multi-output Gaussian processes,seasonal ARIMA, and the HDP-HMM. We also showcompetitive imputation accuracy with widely used sta-tistical techniques. Finally, we apply the method toclustering hundreds of macroeconomic time series fromGapminder, detecting meaningful clusters of countrieswhose data exhibit coherent temporal patterns. a r X i v : . [ s t a t . M E ] A p r emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series The temporally-reweighted Chinese restaurant process(TRCRP) mixture we introduce in Section 3 can bedirectly seen as a time series extension to a family ofnonparametric Bayesian regression models for cross-sectional data [18, 35, 28, 22, 23]. These methodsoperate on an exchangeable data sequence { x i } withexogenous covariates { y i } ; the prior CRP cluster prob-ability p ( z i = k ) for each observation x i is reweightedbased on y i . Our method extends this idea to a timeseries { x t } ; the prior CRP cluster probability p ( z t = k )for x t is now reweighted based on the p previous val-ues x t − t − p . Moreover, the hierarchical extension inSection 3.4 coincides with CrossCat [21], when all tem-poral dependencies are removed (by setting p = 0).Temporal extensions to the Dirichlet process have beenpreviously used in the context of dynamic clustering[38, 1]. The latter work derives a recurrent CRP asthe limit of a finite dynamic mixture model. Unlikethe method in this paper, those models are used forclustering batched data and dynamic topic modeling[7], rather than data analysis tasks such as forecastingor imputation in real-valued, multivariate time series.For multivariate time series, recent nonparametricBayesian methods include using the dependent Dirich-let process for dynamic density estimation [31]; hier-archical DP priors over the state in hidden Markovmodels [HDP-HMM; 12, 19]; Pitman-Yor mixtures ofnon-linear state-space models for clustering [26]; andDP mixtures [8] and Polya trees [27] for modeling noisedistributions. As nonparametric Bayesian extensionsof state-space models, all of these approaches specifypriors that fall under distinct model classes to the onedeveloped in this paper. They typically encode para-metric assumptions (such as linear autoregression andhidden-state transition matrices), or integrate explicitspecifications of underlying temporal dynamics suchas seasonality, trends, and time-varying functionals.Our method instead builds purely empirical modelsand uses simple infinite mixtures to detect patternsin the data, without relying on dataset-specific cus-tomizations. As a multivariate interpolator, the TR-CRP mixture is best applied to time series where thereis no structural theory of the temporal dynamics, andwhere there is sufficient statistical signal in the historyof the time series to inform probable future values.To the best of our knowledge, this paper presents thefirst multivariate, nonparametric Bayesian model thatprovides strong baseline results without specifying cus-tom dynamics on a problem-specific basis; and thathas been benchmarked against multiple Bayesian andnon-Bayesian techniques to cluster, impute, and fore-cast sparsely observed real-world time series data. We first outline the notations and basic setup as-sumed throughout this paper. Let { x n : n = 1 , . . . , N } denote a collection of N discrete-time series, wherethe first T variables of the n th time series is x n T =( x n , x n , . . . , x nT ). Slice notation is used to index sub-sequences of variables, so that x nt : t = ( x nt , . . . , x nt )for t < t . Superscript n will be often be omit-ted when discussing a single time series. The re-mainder of this section develops a generative pro-cess for the joint distribution of all random variables { x nt : t = 1 , . . . , N, n = 1 , . . . , N } in the N time series,which we proceed to describe in stages. Our approach is based on a temporal extension of thestandard Dirichlet process mixture (DPM), which wereview briefly. First consider the standard DPM in thenon-temporal setting [11], with concentration α andbase measure π Θ . The joint distribution of a sequenceof m exchangeable random variables ( x , . . . , x m ) is: P ∼ DP ( α, π Θ ) , θ ∗ j | P ∼ P, x j | θ ∗ j ∼ F ( · | θ ∗ j ) . The DPM can be represented in terms of the Chi-nese restaurant process [2]. As P is almost-surely dis-crete, the m draws (cid:8) θ ∗ j (cid:9) ∼ P contain repeated val-ues, thereby inducing a clustering among data x j . Let λ F be the hyperparameters of π Θ , { θ k } be the uniquevalues among the (cid:8) θ ∗ j (cid:9) , and z j denote the cluster as-signment of x j which satisfies θ ∗ j = θ z j . Define n jk to be the number of observations x i with z i = k for i < j . Using the conditional distribution of z j givenprevious cluster assignments z j − , the joint distribu-tion of exchangeable data sequence ( x , x , . . . ) in theCRP mixture model can be described sequentially: { θ k } iid ∼ π Θ ( · | λ F )Pr [ z j = k | z j − ; α ] ( j = 1 , , . . . ) ∝ (cid:40) n jk if ≤ k ≤ max ( z j − ) α if k = max ( z j − ) + 1 (1) x j | z j , { θ k } ∼ F ( ·| θ z j )The CRP mixture model (1), and algorithms for pos-terior inference, have been studied extensively for non-parametric modeling in a variety of statistical applica-tions (for a survey see [37], and references therein). aad and Mansinghka α, λ G z z z z · · · x x x x · · · x θ kk =1 , , . . . λ F Figure 1:
Graphical model for the TRCRP mixturein a single time series x = ( x , x , . . . ) with laggedwindow size p = 1. Our objective is to define a CRP-like process for a non-exchangeable discrete-time series ( x , x , . . . ), wherethere is now a temporal ordering and a temporaldependence among the variables. Instead of having( x t , z t ) be conditionally independent of all other datagiven z t − as in the CRP mixture (1), we instead con-sider using previous observations x t − when simulat-ing z t . The main idea in our approach is to modify theCRP prior by having the cluster probability Pr[ z t = k | z t − ] at step t additionally account for (i) the p most recent observations x t − p : t − , and (ii) collection oflagged values D tk := { x t (cid:48) − p : t (cid:48) − | z t (cid:48) = k, ≤ t (cid:48) < t } of earlier data points x t (cid:48) assigned to cluster k . The dis-tribution of time series ( x , x , . . . ) in the temporally-reweighted CRP (TRCRP) mixture is therefore: { θ k } iid ∼ π Θ ( · | λ F )Pr [ z t = k | z t − , x t − p : t − ; α, λ G ] ( t = 1 , , . . . ) ∝ (cid:40) n tk G ( x t − p : t − ; D tk , λ G ) if ≤ k ≤ max ( z t − ) α G ( x t − p : t − ; λ G ) if k = max ( z t − ) + 1 x t | z t , { θ k } ∼ F ( ·| θ z t ) (2)The main difference between the TRCRP mixture(2) and the standard CRP mixture (1) is the term G ( x t − p : t − ; λ G , D tk ) which acts as a non-negative “co-hesion” function R p → R + , parametrized by D tk anda bundle of real values λ G . This term measures howwell the current lagged values x t − p : t − match the col-lection of lagged values of earlier data D tk in each clus-ter k , thereby introducing temporal dependence to themodel. The smoothness of the process depends on thechoice of the window size p : if t and t are close intime (relative to p ) then they have overlapping laggedvalues x t − p : t − and x t − p : t − , so G increases theprior probability that { z t = z t } . More generally, anypair of time points t and t that share similar lagged values are a-priori more likely to have similar distribu-tions for generating x t and x t , because G increasesthe probability that { z t = z t = k } , so that x t and x t are both drawn from F ( ·| θ k ).Figure 1 shows a graphical model for the TRCRP mix-ture (2) with window size p = 1. The model pro-ceeds as follows: first assume the initial p observations( x − p +1 , . . . , x ) are fixed or have a known joint dis-tribution. At step t , the generative process samplesa cluster assignment z t , whose probability of joiningcluster k is a product of (i) the CRP probability for { z t = k } given all previous cluster assignments z t − ,and (ii) the “cohesion” term G ( x t − p : t − ; λ G , D tk ). InFigure 1, edges between the z t ’s denote the CRP prob-abilities, while edges from x t − up to z t representreweighting the CRP by G . Cluster assignment z t identifies the temporal regime that dictates the dis-tribution of x t ∼ F ( ·| θ z t ). Observe that if p = 0 or G ∝
1, then the model reduces to a standard CRP mix-ture (1) with no temporal dependence, since ( z t , x t )are conditionally independent of the entire time serieshistory x t − given z t − . Also note that the modelis not Markovian, due to the infinite coupling amongthe latent z t (compare to the recurrent switching lineardynamical system of [4]).The data distribution F in (2) is a Normal distributionwith Normal-InverseGamma prior π Θ : π Θ ( µ k , σ k | m, V, a, b ) = N( µ k | m, σ k V )IG( σ k | a, b ) F ( x t | µ k , σ k ) = N( x t | µ k , σ k ) , (3)where θ k = ( µ k , σ k ) are the per-cluster parametersof F , and λ F = ( m, V, a, b ) the hyperparameters of π Θ . Conjugacy of F and π Θ [5] implies that θ k canbe marginalized out of the generative model (2) (seeAppendix B). As for G , it may in general be any non-negative weighting function which assigns a high valueto lagged data vectors that are “similar” to one an-other. Previous approaches Bayesian nonparametricregression constructed covariate-dependent probabil-ity measures using kernel-based reweighting [10]. Ourmethod defines G as a product of p Student-T distri-butions whose location, scale, and degrees of freedomdepend on lagged data D tk in cluster k : G ( x t − p : t − ; D tk , λ G ) = p (cid:89) i =1 G i ( x t − i ; D tki , λ Gi )= p (cid:89) i =1 T a tki (cid:18) x t − i ; m tki , b tki V tki a tki (cid:19) (4)where hyperparameter λ Gi = ( m i , V i , a i , V i ) anddata D tki = { x t (cid:48) − i : z t (cid:48) = k, ≤ t (cid:48) < t } . Equationsfor the data-dependent terms ( m tki , V tki , a tki , b tki ) aregiven in Appendix A. We emphasize that G itself is emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series1 . Sample concentration parameter of CRP α ∼ Gamma(1,1) . Sample model hyperparameters ( n = 1 , , . . . , N ) λ nG ∼ H nG λ nF ∼ H nF . Sample distribution parameters of F ( n = 1 , , . . . , N ) θ n , θ n , . . . iid ∼ π Θ ( ·| λ nF ) . Assume first p values are known ( n = 1 , , . . . , N ) x n − p +1:0 ·· = ( x n − p +1 , . . . , x n ) . Sample time series observations ( t = 1 , , . . . ) Sample temporal cluster assignment z t Pr (cid:2) z t = k | z t − , x Nt − p : t − , α, λ NG (cid:3) ∝ CRP( k | α, z t − ) (cid:81) Nn =1 G ( x nt − p : t − ; D ntk , λ nG ) where D ntk ·· = (cid:8) x nt (cid:48) − p : t (cid:48) − | z t (cid:48) = k, ≤ t (cid:48) < t (cid:9) and k = 1 , . . . , max ( z t − ) + 1 Sample data x nt ( n = 1 , , . . . , N ) x nt | z t , { θ nk } ∼ F ( ·| θ nz t ) (a) Generative process for the multivariate TRCRP mixture
Incidence of Flu (% Population)
Messages about Flu on Twitter (1000s) − Minimum Temperature ( ◦ F) z t = 1 z t = 2 z t = 3 z t = 4 z t = 5 z t = 6 (b) Discovering flu season dynamics with the method
Figure 2: (a) Generative model describing the joint distribution of N dependent time series { x n } in themultivariate temporally-reweighted CRP mixture. Lagged values for all time series are used for reweighting theCRP by G in step 5.1. Dependencies between time series are mediated by the shared temporal regime assignment z t , which ensures that all the time series have the same segmentation of the time course into the different temporalregimes. (b) Applying the TRCRP mixture with p = 10 weeks to model x flu , x tweet , and x temp in US Region4. Six regimes describing the seasonal behavior shared among the three time series are detected in this posteriorsample. Purple, gray, and red are the pre-peak rise, peak, and post-peak decline during the flu season; andyellow, brown, and green represent the rebound in between successive seasons. In 2012, the model reports nored post-peak regime, reflecting the season’s mild flu peak. See Section 5 for quantitative experiments.used for reweighting only; it does not define a proba-bility distribution over lagged data. Mathematically, G attracts x t towards a cluster k that assigns x t − p : t − a high density, under the posterior predictive of anaxis-aligned Gaussian having observed D tk [24]. This section generalizes the univariate TRCRP mix-ture (2) to handle a collection of N time series { x n : n = 1 , . . . , N } , assumed for now to all be depen-dent. At time t , we let the temporal regime assign-ment z t be shared among all the time series, and uselagged values of all N time series when reweighting theCRP probabilities by the cohesion term G . Figure 2acontains a step-by-step description of the multivariateTRCRP mixture, with an illustrative application inFigure 2b. It is informative to consider how z t me-diates dependences between x N . First, the model requires all time series to be in the same regime z t attime t . However, each time series has its own set ofper-cluster parameters { θ nk } . Therefore, all the timeseries share the same segmentation z T of the timecourse into various temporal regimes, even though theparametric distributions F ( ·| θ nk ) , n = 1 , . . . , N withineach temporal regime k ∈ z T differ. Second, themodel makes the “naive Bayes” assumption that data { x nt } Nn =1 at time t are independent given z t , and thatthe reweighting term G in step 5.1 factors as a product.This characteristic is essential for numerical stabilityof the method in high dimensional and sparse regimes,while still maintaining the ability to recover complexdistributions due to the infinite CRP mixture. The TRCRP mixture in Figure 2a makes the restric-tive assumption that all time series x N are dependent aad and Mansinghka x x x x x ( c , c , c , c , c ) ∼ CRP c = c c = c c ∼ TRCRP Mixture ∼ TRCRP Mixture ∼ TRCRP Mixture(a) Original N = 5 time series (b) CRP over time series clusters (c) TRCRP mixture within each cluster Figure 3:
Hierarchical prior for learning the dependence structure between multiple time series. Given N EEGtime series, we first nonparametrically partition them by sampling an assignment vector c N from an “outer”CRP. Time series assigned to the same cluster are jointly generated using the TRCRP mixture. Colored segmentsof each curve indicate the hidden states at each time step (the shared latent variables within the cluster).with one another. However, with dozens or hundredsof time series whose temporal regimes are not well-aligned, forcing a single segmentation sequence z T to apply to all N time series will result in a poor fit tothe data. We relax this assumption by introducing ahierarchical prior that allows the model to determinewhich subsets of the N time series are probably well-described by a joint TRCRP model. The prior inducessparsity in the dependencies between the N time se-ries by first nonparametrically partitioning them usingan “outer” CRP. Within a cluster, all time series aremodeled jointly using the multivariate TRCRP mix-ture described in Figure 2a:( c , c , . . . , c N ) ∼ CRP( ·| α ) (5) { x n : c n = k } ∼ TRCRP Mixture (cid:0) k = 1 , . . . , max c N (cid:1) , where c n is the cluster assignment of x n . Figure 3shows an example of this structure learning prior ap-plied to five EEG time series. In the second cluster ofpanel (c), the final yellow segment illustrates two timeseries sharing the latent regime at each time step, buthaving different distributions within each regime. In this section, we give the full model likelihood andbriefly describe MCMC algorithms for inference in thehierarchical TRCRP mixture (5). Since the modellearns M = max( c N ) separate TRCRP mixtures(one for each time series cluster) we superscript latentvariables of Figure 2a by m = 1 , . . . , M . Namely, α m isthe CRP concentration, and z m T the latent regime vec-tor, shared by all time series in cluster m . Further, let K m = max( z m T ) denote the number of unique regimesin z m T . Given window size p and initial observations (cid:8) x n − p +1:0 : n = 1 , . . . , N (cid:9) , we have: P (cid:16) α , c N , α M , λ NG , λ NF , (cid:8) θ nj : 1 ≤ j ≤ K c n (cid:9) Nn =1 , z M T , x N T ; x N − p +1:0 , p (cid:17) = Γ( α ; 1 , c N | α ) (cid:32) N (cid:89) n =1 H nG ( λ nG ) (cid:33) (cid:32) N (cid:89) n =1 H nF ( λ nF ) (cid:33) N (cid:89) n =1 K cn (cid:89) j =1 π n Θ ( θ nj ) M (cid:89) m =1 (cid:32) Γ( α m ; 1 , T (cid:89) t =1 (cid:20) b mt CRP( z mt | z m t − , α m ) (cid:89) n | c n = m G ( x nt − p : t − ; D ntz mt , λ nG ) F ( x nt | θ nz mt ) (cid:21)(cid:33) , (6)where b mt normalizes the term between the squarebrackets, summed over z (cid:48) mt = 1 , . . . , max ( z m t − ) + 1.Eq (6) defines the unnormalized posterior distributionof all latent variables given the data. Appendix Bcontains detailed algorithms for posterior inference.Briefly, temporal regime assignments ( z mt | z m T \ t . . . )are sampled using a variant of Algorithm 3 from [25],taking care to handle the temporal-coupling term b mt which is not found in traditional DPM samplers. Wealso outline an alternative particle-learning scheme [9]to sample ( z m T | . . . ) jointly as a block. Time seriescluster assignments ( c n | c N \ n , . . . ) are transitionedby proposing to move x n to either an existing or anew cluster, and computing the appropriate MH ac-ceptance ratio for each case. Model hyperparametersare sampled using an empirical Bayes approach [30]and the “griddy Gibbs” [29] sampler. Given a collection of approximate posterior samples (cid:110) ˆ ξ , . . . , ˆ ξ S (cid:111) of all latent variables produced by S in- emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series All 170 GDP per capita time series from 1960 to 2010 in the Gapminder dataset
GDP cluster 1
USACanadaFranceItalyJapan
GDP cluster 2
ChinaBangladeshNepalIndiaVietnam
GDP cluster 3
RussiaRomaniaSerbiaUkraine
GDP cluster 4
LibyaTogoCote dIvoireGambia
GDP cluster 5
BrazilEcuadorHondurasAlgeria
GDP cluster 6
NigerMadagascarCentral African Rep.
GDP cluster 7
PolandSloveniaSlovakiaBelarus
GDP cluster 8
Equatorial GuineaSamoa
GDP cluster 9
North Korea
Figure 4:
Given GDP per capita data for 170 countries from 1960-2010, the hierarchical TRCRP mixture (5)detects qualitatively distinct temporal patterns. The top panel shows an overlay of all the time series; nine rep-resentative clusters averaged over 60 posterior samples are shown below. Countries within each cluster, of whicha subset are labeled, share similar political, economic, and/or geographic characteristics. For instance, cluster1 contains Western democracies with stable economic growth over 50 years (slight dip in 2008 is the financialcrash). Cluster 2 includes China and India, whose GDP growth rates have outpaced those of industrializednations since the 1990s. Cluster 3 contains former communist nations, whose economies tanked after fall of theSoviet Union. Outliers such as Samoa, Equatorial Guinea, and North Korea can be seen in clusters 8 and 9.dependent runs of MCMC, we can draw a variety ofpredictive inferences about the time series x N whichform the basis of the applications in Section 5. Forecasting
For out-of-sample time points, a fore-cast over an h step horizon T < t < T + h is gen-erated by ancestral sampling: first draw a chain ˜ s ∼ Uniform[1 . . . S ], then simulate step 5 of Figure 2a us-ing the latent variables in chain ξ ˜ s for t = T, . . . , T + h . Clustering
For a pair of time series ( x i , x k ), the pos-terior probability that they are dependent is the frac-tion of samples in which they are in the same cluster: P (cid:104) c i = c k (cid:12)(cid:12)(cid:12) x N (cid:105) ≈ S S (cid:88) s =1 I (cid:2) ˆ c i,s = ˆ c k,s (cid:3) . (7) Imputation
Posterior inference yields samples of eachtemporal regime ˆ z · ,st for all in-sample time points 1 ≤ t ≤ T ; the posterior distribution of a missing value is: P (cid:104) x nt ∈ B (cid:12)(cid:12)(cid:12) x N \ { x nt } (cid:105) ≈ S S (cid:88) s =1 F ( B | ˆ θ n,s ˆ z ˆ cn,st ) . (8) In this section, we apply the TRCRP mixture to clus-tering hundreds of time series using macroeconomicdata from the Gapminder Foundation, as well as im-putation and forecasting tasks on seasonal flu datafrom the US Center for Disease Control and Preven-tion (CDC). We describe the setup in the text below,with further commentary given in Figures 4, 5, and 6.Experimental methods are detailed in Appendix C .We first applied the TRCRP mixture with hierarchicalprior to cluster countries in the Gapminder dataset,which contains dozens of macroeconomic time seriesfor 170 countries spanning 50 years. Because fluc-tuations due to events such as natural disasters, fi-nancial crises, or healthcare epidemics are poorly de-scribed by parametric or hand-designed causal mod-els, a key objective is to automatically discover thenumber and kinds of patterns underlying the tempo- An implementation of the hierarchical TRCRP mix-ture is available at https://github.com/probcomp/trcrpm . aad and Mansinghka(a) Four representative flu time series imputed jointly R . % I L I R . % I L I R . % I L I R . % I L I (b) Example imputations in R09
Linear interpolation: R09.%ILI, YR2013True DataImputationsTRCRP imputation: R09.%ILI, YR2013True DataImputations (c)
Mean absolute imputation errors in ten United States flu regionsR01 R02 R03 R04 R05 R06 R07 R08 R09 R10Mean Imputation 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Linear Interpolation 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Cubic Splines 1 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Multi-output GP 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Amelia II 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . TRCRP Mixture . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Figure 5:
Jointly imputing missing data in ten flu populations over eight seasons. (a) Imputations and standarderrors in four of the time series. The TRCRP mixture accurately captures both seasonal behavior as well asnon-recurrent characteristics, such as the very mild flu season in 2012. (c) Comparing imputation quality withseveral baseline methods. The TRCRP mixture ( p = 10 weeks) achieves comparable performance to AmeliaII. Cubic splines are completely ineffective due to long sequences without any observations. (b) While linearinterpolation may seem to be a good performer given its simplicity and mean errors, unlike the TRCRP it cannotpredict non-linear behavior when an entire flu season is unobserved and entirely misses seasonality.ral structure. Figure 4 shows the outcome of structurediscovery in GDP time series using the model with p = 5 years. Several common-sense, qualitatively dis-tinct clusters are detected. Note that countries withineach cluster share similar political, economic, and/orgeographic characteristics; see caption for additionaldetails. Appendix C.5 gives an expanded set of clus-terings showing changepoint detection in cell phonesubscription time series, and compares to a baselineusing k-medoids clustering.Predicting flu rates is a fundamental objective in pub-lic health policy. The CDC has an extensive datasetof flu rates and associated time series such as weatherand vaccinations. Measurements are taken weeklyfrom January 1998 to June 2015. Figure 2b showsthe influenza-like-illness rate (ILI, or flu), tweets, andminimum temperature time series in US Region 4, aswell as six temporal regimes detected by one posteriorsample of the TRCRP mixture model ( p = 10 weeks).We first investigated the performance of the proposedmodel on a multivariate imputation task. Windows oflength 10 were dropped at a rate of 5% from flu se- ries in US Regions 1-10. The top panel of Figure 5ashows flu time series for US Regions 2, 4, 7, and 9, aswell joint imputations (and two standard deviations)obtained from the TRCRP mixture using (8). Quan-titative comparisons of imputation accuracy to base-lines are given in Table 5c. In this application, theTRCRP mixture achieves comparable accuracy to thewidely used Amelia II [16] baseline, although neithermethod is uniformly more accurate. A sensitivity anal-ysis showing imputation performance with varying p isgiven in Appendix C.3.To quantitatively investigate the forecasting abilitiesof the model, we next held out the 2015 season for10 US regions and generated forecasts on a rolling ba-sis. Namely, for each week t = 2014 . , . . . , .
20 weforecast x flu t : t + h given x flu1: t − and all available covariatedata up to time t , with horizon h = 10. A key chal-lenge is that when forecasting x flu t : t + h , the most recentflu measurement is two weeks old x flu t − . Moreover, co-variate time series are themselves sparsely observedin the training data (for instance, all Twitter datais missing before June 2013, top panel of Figure 2b). emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series Observed Data Held-Out Data Mean Forecast 95% Forecast IntervalTop row: Forecast week 2014.51 Bottom row: Forecast week 2015.10 R . % I L I Gaussian Process Facebook Prophet Seasonal ARIMA HDP-HSMM Univariate TRCRP Multivariate TRCRP R . % I L I Mean absolute flu prediction error for 10 forecast horizons (in weeks) averaged over 10 United States flu regions h = 1 h = 2 h = 3 h = 4 h = 5 h = 6 h = 7 h = 8 h = 9 h = 10 † Linear Extrapolation 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . † GP(SE+PER+WN) 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . † GP(SE × PER+WN) 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . † Facebook Prophet 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . † Seasonal ARIMA 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . † TRCRP Mixture 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . ‡ HDP-HSMM 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . (cid:63) Multi-output GP 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . (cid:63) TRCRP Mixture . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Modeled time series: † flu ‡ flu+weather (cid:63) flu+weather+tweets Figure 6:
Quantitative evaluation of forecasting performance on the 2015 flu season. The table shows meanprediction errors and (one standard error) of the flu rate, for various forecast horizons averaged over US Regions1–10. Available covariate time series include minimum temperature and Twitter messages about the flu (notshown, see Figure 2b). Predictive improvement of the multivariate TRCRP mixture over baselines is especiallyapparent at longer horizons. The top two panels show sample forecasts in US Region 6 for week 2014.51 (pre-peak) and week 2015.10 (post-peak). The TRCRP mixture accurately forecasts seasonal dynamics in both cases,whereas baseline methods produce inaccurate forecasts and/or miscalibrated uncertainties.Figure 6 shows the forecasting accuracy from severalwidely-used, domain-general baselines that do not re-quire detailed custom modeling for obtaining forecasts,and that have varying ability to make use of covariatedata (weather and tweet signals). The TRCRP mix-ture consistently produces the most accurate forecastsfor all horizons (last row). Methods such as seasonalARIMA [17] can handle covariate data in principle, butcannot handle missing covariates in the training set orover the course of the forecast horizon. Both Face-book Prophet [36] and ARIMA incorrectly forecast thepeak behavior (Figure 6, top row), and are biased inthe post-peak regime (bottom row). The HDP-HSMM[19] also accounts for weather data, but fails to detectflu peaks. The univariate TRCRP (only modeling theflu) performs similarly to periodic Gaussian processes,although the latter gives wider posterior error bars,even in the relatively noiseless post-peak regime. Themulti-output GP [3] uses both weather and tweet co-variates, but they do not result in an improvement inpredictive accuracy over univariate methods.
This paper has presented the temporally-reweightedCRP mixture, a domain-general nonparametricBayesian method for multivariate time series. Experi-ments show strong quantitative and qualitative resultson multiple real-world multivariate data analysis tasks,using little to no custom modeling. For certain appli-cation domains, however, predictive performance mayimprove by extending the model to include customknowledge such as time-varying functionals. Furtheravenues for research include guidelines for selectingthe window size; greater empirical validation; a stickbreaking representation; improving inference scalabil-ity; and establishing theoretical conditions for poste-rior consistency. Also, it could be fruitful to integratethis method into a probabilistic programming platform[33], such as BayesDB. This integration would make iteasy to query mutual information between time series[32], identify data that is unlikely under the model, andmake the method accessible to a broader audience. aad and Mansinghka
Acknowledgments
This research was supported by DARPA PPAML pro-gram, contract number FA8750-14-2-0004. The au-thors wish to thank Max Orhai from Galois, Inc. forassembling the CDC flu dataset.
References [1] A. Ahmed and E. Xing. Dynamic non-parametricmixture models and the recurrent Chinese restau-rant process: with applications to evolutionaryclustering. In
Proceedings of the 2008 SIAM In-ternational Conference on Data Mining , pages219–230. SIAM, 2008.[2] D. J. Aldous. Exchangeability and related topics.In P. L. Hennequin, editor, ´Ecole d’ ´Et´e de Proba-bilit´es de Saint-Flour XIII , pages 1–198. Springer,1985.[3] M. Alvarez and N. D. Lawrence. Sparse convolvedGaussian processes for multi-output regression. In
Advances in Neural Information Processing Sys-tems 21 , pages 57–64. Curran Associates, Inc.,2009.[4] D. Barber. Expectation correction for smoothedinference in switching linear dynamical systems.
Journal of Machine Learning Research , 7(Nov):2515–2540, 2006.[5] J. Bernardo and A. Smith.
Bayesian Theory . Wi-ley Series in Probability & Statistics. Wiley, 1994.ISBN 9780471924166.[6] D. J. Berndt and J. Clifford. Using dynamictime warping to find patterns in time series. In
Workshop on Knowledge Discovery in Databases ,AAAIWS-94, pages 359–370. AAAI Press, 1994.[7] D. M. Blei and J. D. Lafferty. Dynamic topicmodels. In
Proceedings of the 23rd InternationalConference on Machine learning , pages 113–120.ACM, 2006.[8] F. Caron, M. Davy, A. Doucet, E. Duflos, andP. Vanheeghe. Bayesian inference for linear dy-namic models with Dirichlet process mixtures.
IEEE Transactions on Signal Processing , 56(1):71–84, 2008.[9] C. M. Carvalho, M. S. Johannes, H. F. Lopes, andN. G. Polson. Particle learning and smoothing.
Statistical Science , 25(1):88–106, 02 2010.[10] D. B. Dunson, N. Pillai, and J.-H. Park. Bayesiandensity regression.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 69(2):163–183, 2007.[11] M. D. Escobar and M. West. Bayesian densityestimation and inference using mixtures.
Journal of the American Statistical Association , 90(430):577–588, 1995.[12] E. Fox, E. B. Sudderth, M. I. Jordan, and A. S.Willsky. Nonparametric Bayesian learning ofswitching linear dynamical systems. In
Advancesin Neural Information Processing Systems 21 ,pages 457–464. Curran Associates, Inc., 2009.[13] B. D. Fulcher and N. S. Jones. Highly compara-tive feature-based time-series classification.
IEEETransactions on Knowledge and Data Engineer-ing , 26(12):3026–3037, 2014.[14] P. J. Green. Reversible jump Markov chain MonteCarlo computation and Bayesian model determi-nation.
Biometrika , 82(4):711–732, 1995.[15] L. F. Gruber and M. West. Bayesian online vari-able selection and scalable multivariate volatil-ity forecasting in simultaneous graphical dynamiclinear models.
Econometrics and Statistics , 3(C):3–22, 2017.[16] J. Honaker, G. King, and M. Blackwell. Amelia II:A program for missing data.
Journal of StatisticalSoftware , 45(7):1–47, 2011.[17] R. Hyndman and Y. Khandakar. Automatic timeseries forecasting: The forecast package for R.
Journal of Statistical Software , 27(3):1–22, 2008.ISSN 1548-7660.[18] H. Ishwaran and L. F. James. Generalizedweighted Chinese restaurant processes for speciessampling mixture models.
Statistica Sinica , 13(4):1211–1235, 2003.[19] M. J. Johnson and A. S. Willsky. Bayesian non-parametric hidden semi-Markov models.
Journalof Machine Learning Research , 14(Feb):673–701,2013.[20] G. M. Koop. Forecasting with medium and largeBayesian VARS.
Journal of Applied Economet-rics , 28(2):177–203, 2013.[21] V. Mansinghka, P. Shafto, E. Jonas, C. Petschu-lat, M. Gasner, and J. B. Tenenbaum. Cross-Cat: A fully Bayesian nonparametric method foranalyzing heterogeneous, high dimensional data.
Journal of Machine Learning Research , 17(138):1–49, 2016.[22] P. Mueller and F. Quintana. Random partitionmodels with regression on covariates.
Journal ofStatistical Planning and Inference , 140(10):2801–2808, 2010.[23] P. Mueller, F. Quintana, and G. L. Rosner. Aproduct partition model with regression on co-variates.
Journal of Computational and GraphicalStatistics , 20(1):260–278, 2011. emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series [24] K. P. Murphy. Conjugate Bayesian analysis of theGaussian distribution. Technical report, Univer-sity of British Columbia, 2007.[25] R. M. Neal. Markov chain sampling methodsfor Dirichlet process mixture models.
Journalof Computational and Graphical Statistics , 9(2):249–265, 2000.[26] L. E. Nieto-Barajas and A. Contreras-Cristan. ABayesian nonparametric approach for time seriesclustering.
Bayesian Analysis , 9(1):147–170, 032014.[27] L. E. Nieto-Barajas and F. A. Quintana. ABayesian non-parametric dynamic AR model formultiple time series analysis.
Journal of Time Se-ries Analysis , 37(5):675–689, 2016.[28] J.-H. Park and D. B. Dunson. Bayesian gener-alized product partition model.
Statistica Sinica ,20(3):1203–1226, 2010.[29] C. Ritter and M. Tanner. The griddy Gibbssampler. Technical Report 878, University ofWisconsin-Madison, 1991.[30] H. Robbins. The empirical Bayes approach to sta-tistical decision problems.
The Annals of Mathe-matical Statistics , 35(1):1–20, 1964.[31] A. Rodriguez and E. ter Horst. Bayesian dynamicdensity estimation.
Bayesian Analysis , 3(2):339–365, 6 2008.[32] F. Saad and V. Mansinghka. Detecting dependen-cies in sparse, multivariate databases using prob-abilistic programming and non-parametric Bayes.In
Proceedings of the 20th International Confer-ence on Artificial Intelligence and Statistics , vol-ume 54 of
Proceedings of Machine Learning Re-search , pages 632–641. PMLR, 2017.[33] F. Saad and V. K. Mansinghka. A probabilis-tic programming approach to probabilistic dataanalysis. In
Advances in Neural Information Pro-cessing Systems 29 , pages 2011–2019. Curran As-sociates, Inc., 2016.[34] U. Schaechtle, B. Zinberg, A. Radul, K. Stathis,and V. Mansinghka. Probabilistic program-ming with Gaussian process memoization. arXivpreprint , arXiv:1512.05665, 2015.[35] B. Shahbaba and R. Neal. Nonlinear models usingDirichlet process mixtures.
Journal of MachineLearning Research , 10(Aug):1829–1850, 2009.[36] S. J. Taylor and B. Letham. Forecasting at scale.
The American Statistician , To appear, 2017.[37] Y. W. Teh. Dirichlet process. In
Encyclopedia ofMachine Learning , pages 280–287. Springer, 2011. [38] Z. G. Zhu, Xiaojin and J. Lafferty. Time-sensitiveDirichlet process mixture models. Technical Re-port CMU-CALD-05-104, Carnegie-Mellon Uni-versity, School of Computer Science, 2005. aad and Mansinghka
Appendices
A Data-dependent parameters forStudent-T reweighting function
Following (4), the reweighting function G is a productof p Student-T distributions whose location, scale anddegrees of freedom are data-dependent [24]: G ( x t − p : t − ; D tk , λ G )= p (cid:89) i =1 G i ( x t − i ; D tki , λ Gi )= p (cid:89) i =1 T a tki (cid:18) x t − i ; m tki , b tki V tki + 1 a tki (cid:19) (9) λ Gi = ( m i , V i , a i , b i ) D tki = { x t (cid:48) − i : z t (cid:48) = k, ≤ t (cid:48) < t } n tki = | D tki | ¯ x tki = 1 n tki (cid:80) t (cid:48) ∈ D tki x t (cid:48) − i (10) V tki = 1 / ( V − i + n tki ) m tki = V tki ( V − i m i + n tki ¯ x tki ) a tki = a i + n tki / b tki = b k + 12 (cid:0) m i V − i + (cid:80) t (cid:48) x t (cid:48) − i − m itk V − tki (cid:1) . B Markov chain Monte Carlomethods for posterior inference
Here, we provide the details of the MCMC method forposterior simulation from the nonparametric mixturemodel developed in Section 3. As discussed in the maintext, conjugacy of F and π Θ in (3) means we can an-alytically marginalize parameters { θ nk } when definingthe generative process of the TRCRP mixture. Themodel in Figure 2a therefore becomes: α ∼ Gamma(1,1) (11) λ nG ∼ H nG n = 1 , , . . . , Nλ nF ∼ H nF n = 1 , , . . . , N x n − p +1:0 ·· = ( x n − p +1 , . . . , x n ) n = 1 , , . . . , N Pr (cid:2) z t = k | z t − , x Nt − p : t − , α, λ NG (cid:3) t = 1 , , . . . , T ∝ CRP( k | α, z t − ) (cid:81) Nn =1 G ( x nt − p : t − ; D ntk , λ nG ) where D ntk ·· = (cid:8) x nt (cid:48) − p : t (cid:48) − | z t (cid:48) = k, ≤ t (cid:48) < t (cid:9) and k = 1 , . . . , max ( z t − ) + 1 x nt (cid:12)(cid:12) (cid:8) z t = k, x N t − (cid:9) ∼ (cid:82) θ F ( ·| θ ) π Θ ( θ | D (cid:48) ntk , λ nF )d θ where D (cid:48) ntk ·· = { x nt (cid:48) | z t (cid:48) = k, ≤ t (cid:48) < t } .n = 1 , , . . . , N The integration of F against π Θ ( θ | D (cid:48) nz t ) in the righthand-side of the final line evaluates to a Student-Tdistribution as in (9), whose updates given D (cid:48) ntz t and λ nF are identical to those in (10) with i = 0. Inference on temporal regime assignments ( z t | z T \ t , . . . ). We first describe how to transition z T , assuming the collapsed version of the TRCRP(11) with N time series. Note that since the hi-erarchical prior (5) for structure learning results in M = max( c N ) independent TRCRP mixtures (con-ditioned on the assignment vector), it suffices to de-scribe inference on z T in one of the mixtures (whichkeeps notation significantly simpler). Given observa-tions x N − p +1: T , the joint likelihood of model (11) is: P (cid:0) α, λ NG , λ NF , z T , x N T ; x N − p +1:0 , p (cid:1) = Γ( α ; 1 , (cid:32) N (cid:89) n =1 H nG ( λ nG ) (cid:33) (cid:32) N (cid:89) n =1 H nF ( λ nF ) (cid:33) T (cid:89) t =1 (cid:20) b t CRP( z t | z t − , α ) N (cid:89) n =1 G ( x nt − p : t − ; D ntz t , λ nG ) F ( x nt | D (cid:48) ntz t , λ nF ) (cid:21) (12)The normalizer at time t is given by: b t ( x N t − , z t − ) (13)= (cid:32) K t (cid:88) k =1 CRP( k | α, z t − ) N (cid:89) n =1 G ( x nt − p : t − ; D ntk , λ nG ) (cid:33) − , where K t = max( z t − ) + 1. Note that the normalizer b t ( x N t − , z t − ) ensures the reweighted cluster proba-bilities sum to one. It will also be convenient to definethe predictive density q t at time t of data x Nt , whichsums out all possible values of z t : q t ( x N t , z t − ) (14)= b t ( x N t − , z t − ) (cid:32) K t (cid:88) k =1 CRP( k | α, z t − ) N (cid:89) n =1 G ( x nt − p : t − ; D ntk , λ nG ) F ( x nt | D (cid:48) ntk , λ nF ) (cid:33) . Let the current state of the Markov chain be( α, λ NG , λ NF , z T ). We present two algorithms forsampling the latent regimes assignments. Algorithm 1is a single-site Metropolis-Hastings procedure that tar-gets ( z t | z T \ t , . . . ) at each step, where we assume thatall data in x N T are fully observed. Algorithm 2 is anSMC scheme to block sample ( z T | . . . ) using particlelearning [9]. Arbitrary observations may be missing,as they are imputed over the course of inference. emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series Algorithm 1: single-site Metropolis-Hastings . This algorithm proposes ( z t | z T \ t , . . . ) at each step, assumingfully observed data x N T . Repeat for t = 1 , , . . . , T :1. Propose z (cid:48) t from the multinomial distribution:Pr[ z (cid:48) t = k | z T \ t , x N , α ] ∝ CRP( k | α, z T \ t ) N (cid:89) n =1 G ( x nt − p : t − ; D nT k \ { x nt } , λ nG ) F ( x nt | D (cid:48) nT k \ { x nt } , λ nF ) , (15) for k ∈ unique( z T \ t ) ∪ (cid:8) max( z T \ t ) + 1 (cid:9) .
2. Compute the MH acceptance ratio r ( z t → z (cid:48) t ), using b t defined in (13): r ( z t → z (cid:48) t ) = (cid:81) t (cid:48) >t b t (cid:48) ( z t (cid:48) − \ t ∪ z (cid:48) t , x N t (cid:48) − ) (cid:81) t (cid:48) >t b t (cid:48) ( z t (cid:48) , x N t (cid:48) − ) . (16)3. Set z t ← z (cid:48) t with probability min(1 , r ), otherwise leave z t unchanged. Algorithm 2: block sampling with particle-learning . This algorithm block samples z T without any assump-tions on missingness of observations. Let o nt be the “observation indicator” so that o nt = 1 if x nt is observed,and 0 if it missing ( n = 1 , , . . . , N and t = 1 , , . . . , T ). Let J > j to indicate theinclusion of any imputed values by particle j .1. Set w j ← j = 1 , , . . . , J
2. Repeat for t = 1 , , . . . , T j = 1 , , . . . , J z jt from the multinomial distribution:Pr[ z jt = k | z j t − , x N,j , α ] ∝ CRP( k | α, z j t − ) N (cid:89) n =1 G ( x n,jt − p : t − ; D n,jtk , λ nG ) N (cid:89) n =1 (cid:16) F ( x nt | D (cid:48) n,jtk , λ nF ) (cid:17) o nt , (17) for k = 1 , , . . . , max( z j t − ) + 1 . q t defined in (14): w j ← w j q t (cid:16) x N,j t − ∪ { x nt | o nt = 1 } , z j t − (cid:17) . (18)2.1.3. For each n such that o nt = 0, simulate a value x n,jt ∼ F ( · | D (cid:48) ntz jt , λ nF ).2.2. If resampling criterion met, then:2.2.1. Resample ( z j t , x N,j t ) proportionally to w j , j = 1 , , . . . , J .2.2.2. Renormalize weights w j ← w j / (cid:80) j (cid:48) w j (cid:48) , j = 1 , , . . . , J .3. Resample j ∼ Categorical( w , . . . , w J ) and return ( z j T , x N,j T ). aad and Mansinghka It is worth discussing the computational trade-offsbetween MH Algorithm 1 and SMC Algorithm 2.In step 1 of Algorithm 1, (15) is recomputed K = O (max( z T )) times. Each assessment requires O ( N p )computations, where the factor of N is the productover the time series, and the factor of p is the cost of as-sessing G per (4). In step 2, computing the terms b t (cid:48) inthe acceptance ratio (16) requires revisiting O ( T ) datapoints. Therefore a single iteration requires O ( T KN p )computations, so that the cost of a full sweep over all T time points is O ( T KN p ). Note that it is not nec-essary to sum over K t in (13) when computing the b t (cid:48) terms in (16), since the data in at most two clusterswill change when proposing z t to z t (cid:48) . The sufficientstatistics can be updated in constant time using a sim-ple dynamic programming approach.In practice, we consider several computational approx-imations that simplify the scaling properties of thesingle-site MH Algorithm 1. For missing data, ratherthan evaluate the full model likelihood (12) on im-puted data for each t = 1 , . . . , T , we instead adopta “data-dependent” prior, similar to the strategy de-scribed by [10] in the context of Bayesian density re-gression. Namely, letting o nt be the indicator for hav-ing observed x nt , we let the reweighting function G consider only those data points that have actually beenobserved. Therefore, (4) becomes: G ( x t − p : t − ; D tk , λ G ) = p (cid:89) i =1 ( G i ( x t − i ; D tki , λ Gi )) o nt − i . (19)Second, note that the MH proposal (15) is very similarto the Gibbs proposal from Algorithm 3 of [25], exceptwe must account for the temporal coupling so thatthe transition is guaranteed to leave (12) invariant.Empirical evidence suggest that, when using the pro-posal (15), acceptance ratios center around one. Thisobservation suggests a good initialization strategy forthe Markov chain (prior to running the full MH algo-rithm): run several rounds of step 1 always acceptingthe proposal z t → z (cid:48) t without computing (16), whicheliminates the additional O ( T ) factor.Unlike the MH Algorithm 1, the SMC algorithm (2)with requires O ( KN p ) to assess (17) in step 2.1.1;the total cost of a complete pass through all T datapoints (step 2) and all J particles (step 2.1) is there-fore O ( JT KN p ). Note that in SMC, the normalizers b t need not to be retroactively computed, which is thekey overhead of MH. In addition to its linear scalingin T , SMC is able to (i) more tractably handle missingdata, and (ii) use a posterior particle filter by samplingfrom the conditionally optimal proposal distribution instep 2.1.1, resulting in significantly lower variance ofthe weights [9]. Inference on time series cluster assignments ( c n | c N \ n , . . . ). This section describes an MCMC al-gorithm for sampling the time series cluster assign-ments when using the hierarchical CRP structure prior(5). For notational simplicity, let B ⊆ [ N ] and define: L m ( z T , x B T ) = T (cid:89) t =1 (cid:20) b t CRP( z t | z t − , α m ) N (cid:89) n =1 G ( x nt − p : t − ; D ntz t , λ n ) F ( x nt | D (cid:48) ntz t , λ nF ) (cid:21) . (20)The term L m is a short-hand for the product from t = 1 to T in the full model likelihood (12) for asingle TRCRP mixture, with latent sequence z T ,data x B T , and CRP concentration α m . Second, let A m = { n | c n = m } be the indices of the time seriescurrently assigned to cluster m . Algorithm 3: Sampling time series cluster as-signments . Let the current state of the Markovchain be ( α , c N , α M , λ NG , λ NF , z M T ) withobservations x N T . This algorithm resamples( c n | c N \ n , . . . ). Repeat for n = 1 , , . . . , N :1. If c n is not a singleton cluster, i.e. | A c n | > z M +11: T from model prior (11), holdingthe data x n T fixed at the observed values.2. If c n is a singleton, i.e. | A c n | = 1, then re-usethe current latent regime sequence by setting z M +11: T = z c n T
3. For m ∈ unique( c N \ n ), compute p m = (cid:40) | A m | L m ( z m T , x n T ) if c n (cid:54) = m, ( | A m | − L m ( z m T , x n T ) if c n = m.
4. Compute the singleton proposal probability: p M +1 = α L m +1 (cid:0) z M +11: T , x n T (cid:1)
5. Sample c (cid:48) ∼ Categorical( { p m } ).6. Compute the MH acceptance ratio r ( c n → c (cid:48) ) = (cid:32) L c (cid:48) ( z c (cid:48) T , x A c (cid:48) T ∪ x n T ) L c n ( z c n T , x A cn T \ x n T ) L c (cid:48) ( z c (cid:48) T , x A c (cid:48) T ) L c n ( z c n T , x A cn T ) (cid:33)(cid:18) L c n ( z c n T , x n T ) L c (cid:48) ( z c (cid:48) T , x n T ) (cid:19) . (21)7. Set c n ← c (cid:48) with probability min(1 , r ), elseleave c n unchanged. emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series By proposing the latent regime singleton from the(conditional) prior in Step 2 of Algorithm 3, transdi-mensional adjustments such as reversible jump MCMC[14] need not be considered. Second, when computingthe MH acceptance ratio (21) in step 6, it is not neces-sary to recompute all the L m terms at each iteration.First, writing out the full products (20) results in can-cellation of several terms in the numerator and denom-inator of (21). Second the b mt terms that do not cancelcontain several duplicated components, which can bereused from one transition to the other.In practice, we find that a similar heuristic to the onedescribed for Algorithm 1 provides good transitions inthe state space, given the similarities between Algo-rithm 3 and the Gibbs Algorithm 8 from [25]. Inference on model hyperparameters ( α , { α m } , { λ nG } , { λ nF } | . . . ). This section describesthe empirical Bayes approach [30] for transitioningmodel hyperparameters, using the “griddy Gibbs”approach from [29]. For each hyperparameter, weconstruct a grid of 30 data-dependent logarithmically-spaced bins as follows:Outer CRP concentration grid ( α ) = logspace (1 /N, N )TRCRP concentration grid ( α m ) = logspace (1 /T, T )Normal-InverseGamma hyperparameters grid ( m n ) = logspace (min( x n T ) − , max( x n T ) + 5) grid ( V n ) = logspace (1 /T, T ) grid ( a n ) = logspace ( ssqdev ( x n T ) / , ssqdev ( x n T )) grid ( b n ) = logspace (1 , T ) . Grids for the Normal-InverseGamma hyperparametersapply to both λ F ( n = 1 , , . . . , N ) and λ G (windows i = 1 , . . . , p ). We cycle through the grid pointsof each hyperparameter, and assess the conditionallikelihood at each bin using (6). We find that thismethod is both computationally reasonable and findsgood hyperparameter settings. However, alternativeapproaches based on slice sampling offer a promisingalternative to achieve fully Bayesian inference over hy-perparameters. C Experimental Methods
This section describes the quantitative experimentalmethods used for forecasting, clustering, and imputa-tion pipelines in Section 5. Access to experimentalpipeline code is available upon request.
C.1 Flu forecasting
The full CDC flu datasets used in this paper are avail-able at https://github.com/GaloisInc/ppaml-cp7/tree/master/data . Flu populations were constructed fromthe following csv files:
USA-flu.csv , USA-tweets.csv ,and
USA-weather.csv . In each of US Regions 1 through10, we held out data from weeks 2014.40 through2015.20, and produced forecasts with a 10 week hori-zon on a rolling basis. Tweet and minimum tempera-ture covariates were used. More precisely, for a region r (such as US Region 10) a forecaster F for week t extending h weeks into the future is a function: F r,t,h : (cid:110) x flu ,r t − , x cov ,r t (cid:111) (cid:55)→ (cid:110) x flu ,rt : t + h (cid:111) . (22)The forecastors iterated over regions r = 1 , , . . . , t = 2014 . , . , . . . , .
20, and horizons h = 1 , , . . . ,
10. Note that the two week delay in thelatest flu data is expressed by only having data up to t − t . Second, x cov containsarbitrary missing values (see for example the tweetstime series from Figure 2b). When forecasting, covari-ate values are only available up to the current week t , not the entire course of the forecast horizon. Nineforecasting methods were used in the paper, shown inFigure 6. Below are further details on each forecaster: Constant . This method returns a constant predictionbased on the most recently observed flu value x flu t − overthe entire course of the horizon. Linear extrapolation . This method fits a straightline through the three most recently observed flu val-ues, x flu t − t − , and returns predictions by extrapolatingthe line for h weeks. GP (SE+PER+WN) . This method is a Gaussianprocess whose covariance kernel is a sum of squared ex-ponential, periodic, and white noise components. Hy-perparameter inference was conducted using the opensource implementation from the Venture platform [34; https://github.com/probcomp/Venturecxx ]. MH sam-pling on data-dependent hyperparameter grids wererun for a burn-in period of 10000 iterations. Pre-dictions were obtained by drawing 500 independentcurves from the posterior predictive distribution, eval-uated jointly at the forecast weeks.
GP (SE × PER+WN) . Identical to above, except tousing a covariance kernel with a product of squared ex-ponential and periodic components, plus white noise.The change in covariance kernel resulted in little quan-titative and qualitative differences.
Facebook Prophet . We used the open-sourcepython implementation of Facebook Prophet [36; https://facebook.github.io/prophet ]. We specified thedata sampling rate as weekly. The method requires aad and Mansinghka no additional specification or tuning. The predictorreturns point estimates, as well as upper and lowerconfidence intervals, at the held-out weeks.
Seasonal ARIMA . We used the R implementationof seasonal ARIMA from the forecast package [17; https://cran.r-project.org/web/packages/forecast ]. Themodel is parameterized as ARIMA( p, d, q )( P, D, Q ) m ,where p is the non-seasonal AR order, d is the non-seasonal differencing, q is the non-seasonal MA order, P is the seasonal AR order, D is the seasonal differenc-ing, Q is the seasonal MA order, and m is the samplingfrequency per period. For each of the 10 flu seasons,we used auto.arima to perform model selection. Wemanually specified the weekly sampling rate by set-ting m = 52, and set D = 1 to specify 1 flu seasonper year. The program optimize all other parametersusing non-stepwise grid search, which is significantlyslower to fit than stepwise search, but is both moreextensive and more appropriate for data with seasonalbehavior (according to the package documentation).While auto.arima can in principle support covariatedata using the xreg parameter, we were unable to suc-cessfully use xreg due to missing data in the matrix ofexternal regressors (tweets and weather) at the held-out weeks. The predictor returns point estimates, aswell as upper and lower confidence intervals, at theheld-out weeks. Multi-output GP
This method is a single-input(time) multiple-output (flu, tweets, and weather data)Gaussian process. We used the the open source MAT-LAB implementation of sparse convolved Gaussianprocess for multi-output regression from the multigp package [3; https://github.com/SheffieldML/multigp ].We used the following configuration options:i multigpOptions( ' ftc ' ) ;ii options.kernType= ' ggwhite ' ;iii options.optimizer= ' scg ' ;iv options.nlf=1 ,to specify (i) full estimation without running likeli-hood approximations; (ii) a Gaussian-Gaussian kernelwith white noise; (iii) scaled conjugate gradient opti-mization; and (iv) one latent function. Moreover, the options.bias and options.scale parameters wereinitialized to their empirical values from the trainingset. Optimization was run until convergence for allforecastors. This method is the only baseline whichcan handle arbitrary patterns of missing data, therebymaking use of the weather and tweet signals when fore-casting predictions at time t . However, the absenceof a periodic kernel in the convolved GP implementa-tion made it difficult to capture the seasonal dynam-ics. Predictions were obtained by sampling 500 inde- pendent normal random variables from the posteriorpredictive distribution evaluated at the forecast weeks. HDP-HSMM . This method is the hierarchicalDirichlet process semi-Markov model; experimentswere run using the open-source python package pyhsmm [19; https://github.com/mattjj/pyhsmm ]. While theHDP-HSMM cannot handle missing values in thetraining data, it can handle missing data over thecourse of the prediction horizon. Therefore, flu andweather time series were modeled jointly, leaving outthe tweets. We used the
WeakLimitHDPHSMM model,with a Poisson duration distribution and Gaussian ob-servation distribution. Default configurations of allhyperparameters of these distributions and the HDP-HSMM concentration were taken from examples madeavailable by the authors. MCMC inference with 1000steps of burn-in was used. Predictions were obtainedby drawing 100 independent curves from the posteriorpredictive evaluated at the forecast weeks.
Univariate TRCRP mixture . This method onlyconsidered the flu time series using model (2). We useda window size of p = 10 weeks, and S = 64 parallelMCMC runs with a burn-in period of 5000 iterations.Predictions were obtained by drawing 500 independentcurves from the posterior predictive distribution eval-uated at the forecast weeks. Multivariate TRCRP mixture . This method con-sidered flu, weather and tweet time series using themodel in Figure 2a. We used a window size of p = 10weeks, and S = 64 parallel MCMC runs with a burn-in period of 5000 iterations. Missing covariate datawas handled using the approximation given in (19).Using the hierarchical structure prior (5) resulted inlittle to no quantitative difference. The three timeseries are dependent, which was reflected in their pos-terior dependence probability (7) being 1 across all64 independent chains. Predictions were obtained bysampling 500 independent curves from the posteriorpredictive distribution evaluated at the forecast weeks.An open-source implementation of the method used inthis paper is at https://github.com:probcomp/trcrpm . C.2 Flu imputation
We constructed a single population of 10 flu time se-ries for US Regions 1 through 10. Missing data wasdropped independently in each time series by remov-ing consecutive windows of length 10 at a rate of 5%.The full and dropped datasets used for benchmarkingare shown in Figure 8. Below are further on details oneach of the five imputation methods:
Mean imputation . This method returns the per-series mean as the imputed value for each data point. emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series
Linear interpolation . This method constructsa straight line between every pair of time points t < t which have at least one missing observa-tion between them. The interpolation method usedwas pandas.Series.interpolate from the python pandas package at https://pandas.pydata.org . Cubic interpolation . The cubic interpolation rou-tine used was scipy.interpolate.interp1d from thepython scipy package at https://scipy.org . Amelia II . This method uses the R package amelia [16; http://cran.r-project.org/web/packages/Amelia ] formultiple imputation. We used 100 samples per missingdata point. Imputation errors were averaged over themultiple imputations.
Multivariate TRCRP mixture . A window of p =10 weeks was used, with S = 64 parallel MCMC runsand a burn-in period of 5000 iterations. 100 predic-tive samples from each of the chains were obtained us-ing (8), and imputation errors were averaged over themultiple imputations. Joint imputations of Regions 1through 10 are are shown Figure 8. C.3 Sensitivity of imputation performance tothe TRCRP mixture window size
We further studied how imputation performance of theTRCRP mixture varied as we changed the window size p . Figure 7 shows the outcome of this sensitivity anal-ysis. In all cases, the sampler was run for a burn-in of5000 iterations with S = 16 chains. While imputationis generally not highly sensitive to p , median imputa-tion values degrades slightly with increasing p and thevariance of imputation errors increases. (At higher p ,the MCMC chains need a significantly higher numberof iterations to mix well than at lower p .)The reason that small p works well for jointly imputingthe 10 time series in Figure 8 is that the multivariateTRCRP mixture shares statistical strength across timeseries. Namely, when imputing a missing value x n t attime t for time series n , the relevant variables for pre-dicting the hidden state z t are (i) the history x nt − p : t − of the current time series; and (ii) values { x nt | n (cid:54) = n } of other time series at time t . The latter effect is thedominant one in this imputation problem, leading toless sensitivity to p than might be expected. C.4 Clustering GDP time series
The clustering results from Figure 4 were obtained byusing a TRCRP with a window of p = 5 years. Thenine clusters that are shown were obtained by aver-aging dependence probabilities over S = 60 posteriorsamples (using a burn-in of 5000 iterations), and ex-tracting groups of variables whose dependence proba-
01 02 04 06 08 12 14 18 22 24 32
TRCRP Mixture Window Size p . . . . . . . . A b s o l u t e I m pu t a t i on E rr o r s Figure 7:
Sensitivity of imputation performance toTRCRP window size p .bilities (7) exceeded 80%. All time series in Figure 4are linearly rescaled to [0 ,
1] for plotting purposes only.While clustering is an unsupervised task that is chal-lenging to evaluate quantitatively (especially for real-world data, where there is no “ground-truth”), qual-itative comparisons to k-medoids clustering with thedynamic time warping metric on the same GDP timeseries are shown and discussed in Figure 9.
C.5 Expanded results on clustering cellphone subscription time series
In addition to clustering GDP series from Figure 4, weapplied the TRCRP prior with hierarchical extension(5) to cluster historical cell phone subscription data.The outcome of the clustering is shown in Figure 10,where we show all 170 time series in the left most fig-ure, along with three representative clusters from oneposterior sample. Each cluster corresponds to coun-tries whose change point in cell phone subscribers fromzero to non-zero fell in a distinct window: 1985-1995in cluster 1, 1995-2000 in cluster 2, and 2000-2005 incluster 3. We also compare renderings of the the pair-wise dependence probability matrix with the pairwisecross-correlation matrix. Refer to the caption of Fig-ure 10 for additional details. aad and Mansinghka
R01.%ILIR02.%ILIR03.%ILIR04.%ILIR05.%ILIR06.%ILIR07.%ILIR08.%ILIR09.%ILI2008 2009 2010 2011 2012 2013 2014 2015YearR10.%ILI (a)
Original flu time series (b)
Time series after dropping data
R01.%ILIR02.%ILIR03.%ILIR04.%ILIR05.%ILIR06.%ILIR07.%ILIR08.%ILIR09.%ILI2008 2009 2010 2011 2012 2013 2014 2015YearR10.%ILI (c)
Jointly imputed time series using TRCRP mixture ( p = 10) Figure 8:
Full, missing, and imputed flu time series over eight years in US Regions 1 through 10. emporally-Reweighted Chinese Restaurant Process Mixtures for Multivariate Time Series
Year G D P P e r C a p i t a M (a) k = 1 Year G D P P e r C a p i t a M Year M Year M Year M Year M Year M (b) k = 6 Year G D P P e r C a p i t a M Year M (c) k = 2 Year G D P P e r C a p i t a M Year M Year M Year M Year M Year M Year M (d) k = 7 Year G D P P e r C a p i t a M Year M Year M (e) k = 3 Year G D P P e r C a p i t a M Year M Year M Year M Year M Year M Year M Year M (f ) k = 8 Year G D P P e r C a p i t a M Year M Year M Year M (g) k = 4 Year G D P P e r C a p i t a M Year M Year M Year M Year M Year M Year M Year M Year M (h) k = 9 Year G D P P e r C a p i t a M Year M Year M Year M Year M (i) k = 5 Year G D P P e r C a p i t a M Year M Year M Year M Year M Year M Year M Year M Year M Year M (j) k = 10 Figure 9:
Outputs of k-medoids clustering on the GDP per capita time series for all 170 countries in theGapminder dataset, with k = 1 , , . . . ,
10. Distances are computed using the dynamic time warping (DTW)metric, a common similarity measure between a pair of time series [6]. For each k , we randomly initialized themedoids and ran the algorithm to convergence (medoids are shown in red, and time series assigned to that medoidin gray). Using k-medoids requires hand-tuning the number of latent clusters k , whereas the proposed method(whose posterior clustering is shown in Figure 4 of the main text), places a non-parametric Bayesian prior overthis parameter. Moreover, when compared to the clusters detected by the proposed method, those detectedby k-medoids with DTW appear qualitatively less distinct, and have more repetitive and duplicated temporalpatterns (especially apparent at higher k ). Finally, k-medoids outputs a fixed cluster assignment for each timeseries in the population; these assignments are sensitive to the random initialization and cannot be aggregatedin a principled way. In contrast, inference in the proposed method assigns probabilistic cluster assignments thatcan be averaged coherently using (7) to express posterior uncertainty. aad and Mansinghka c e ll phone s ub sc r i p t i on s All 170 cellphone time series
Changepoints1985-1995
Cell phone cluster 1
CanadaFranceItalyJapanSouth KoreaUSA
Changepoints1995-2000
Cell phone cluster 2
BrazilChinaJordanPolandRomaniaUruguay
Changepoints2000-2005
Cell phone cluster 3
AfghanistanBangladeshChadGhanaNepalSudan (a)
Three posterior clusters in the TRCRP mixture correspond to three non-overlapping change point windows. (b)
Pairwise dependence probability heatmap (c)
Pairwise cross-correlation heatmap