[PDF] Innovative And Additive Outlier Robust Kalman Filtering With A Robust Particle Filter

Abstract

In this paper, we propose CE-BASS, a particle mixture Kalman filter which is robust to both innovative and additive outliers, and able to fully capture multi-modality in the distribution of the hidden state. Furthermore, the particle sampling approach re-samples past states, which enables CE-BASS to handle innovative outliers which are not immediately visible in the observations, such as trend changes. The filter is computationally efficient as we derive new, accurate approximations to the optimal proposal distributions for the particles. The proposed algorithm is shown to compare well with existing approaches and is applied to both machine temperature and server data.

Full PDF

II NNOVATIVE A ND A DDITIVE O UTLIER R OBUST K ALMAN F ILTERING W ITH

A R

OBUST P ARTICLE F ILTER

A P

REPRINT

Alexander T. M. Fisch

Lancaster UniversityLancaster, United Kingdom

Idris A. Eckley

Lancaster UniversityLancaster, United Kingdom

Paul Fearnhead

Lancaster UniversityLancaster, United KingdomJuly 8, 2020 A BSTRACT

In this paper, we propose CE-BASS, a particle mixture Kalman ﬁlter which is robust to both innovativeand additive outliers, and able to fully capture multi-modality in the distribution of the hidden state.Furthermore, the particle sampling approach re-samples past states, which enables CE-BASS tohandle innovative outliers which are not immediately visible in the observations, such as trendchanges. The ﬁlter is computationally efﬁcient as we derive new, accurate approximations to theoptimal proposal distributions for the particles. The proposed algorithm is shown to compare wellwith existing approaches and is applied to both machine temperature and server data. K eywords Kalman Filter · Anomaly Detection · Particle Filtering · Robust Filtering

Anomaly detection is an area of considerable importance and has been subject to increasing attention in recent years.Comprehensive reviews of the area can be found in [1, 2]. The ﬁeld’s growing importance arises from the increasingrange of applications to which anomaly detection lends itself: from fraud prevention [1, 2], to fault detection [1, 2], andeven the detection of exoplanets [3]. More recently, the emergence of internet of things and the ubiquity of sensors hasled to emergence of the online detection of anomalies as an important statistical challenge.Kalman ﬁlters [4] provide a convenient framework to detect anomalies within a streaming data context. In particular,they can be updated in a fully online fashion at a ﬁxed computational cost. At each time point, Kalman ﬁlters alsoprovide an estimate both for the expectation and variance of the next observation. These can be used to determinewhether that observation is anomalous or not. However, the major drawback of Kalman ﬁlters is their lack of robustnessto outliers: once the ﬁlter has encountered an outlier, it will often produce inaccurate predictions for many future timepoints.The anomaly detection literature distinguishes between two types of outliers. The ﬁrst are additive outliers, sometimesreferred to as observational outliers [5], which affect the observational noise only. The other type of outliers are theinnovative, or process [6], outliers. These affect the updates of the hidden states. In practice, both have a similar effecton the next observation, but quite different effects on subsequent observations. Moreover, some innovative outlierscannot be detected immediately as their inﬂuence on the observations is only noticeable after, or over, a period of time.A range of robust Kalman ﬁlters has been proposed to date. Many side-step the problem of distinguishing between thetwo outlier types. By far the largest class of ﬁlters aims to be robust against heavy tailed additive outliers. Examples ofsuch ﬁlters include [7, 8], which assume t -distributed additive noise and perform inference using variational Bayes, [9],who use Huberised residuals, and [10] inﬂate the noise covariance matrix whenever an outlier is encountered. A few a r X i v : . [ s t a t . M E ] J u l E-BASS - J

ULY

8, 2020ﬁlters have also been developed with the aim of achieving robustness against innovative outliers [9]. The problem withsuch ﬁlters is that they exacerbate the shortcomings of the Kalman ﬁlter when they encounter the other type of anomaly:additive outlier robust Kalman ﬁlters, for example, update their hidden states even less than the classical Kalman ﬁlterwhen encountering innovative outliers.In principle, it seems straightforward to combine the ideas of these two types of robust Kalman ﬁlter. One body ofliterature proposes to use Huberisation of both innovative and additive residuals [5, 10]. Others [6, 11] have modelledboth additive and innovative outliers using t -distributions, by imposing Wishart priors on the precision matrix of boththe innovations and additions and maintaining the posterior by using variational Bayes approaches. The issue with theseﬁlters comes from how they approximate the ﬁltering distribution of the state. Both return uni-modal posteriors afterencountering an anomaly. This is a shortcoming given that the posterior after an anomaly is likely to be multi-modal: ifthe outlying observation was caused by an additive anomaly, the state will be close to the prior, whereas if it was causedby an innovative anomaly, the state would be far from it.The ideal approach to constructing a robust ﬁlter would be to model the possibility of outliers in both the observation andsystem noise, and then use a ﬁlter algorithm that attempts to calculate, or approximate, the true ﬁltering distribution forthe model. An early attempt to do this was the spline based approach [12], but the computational complexity increasesvery quickly with the number of dimensions and such a ﬁlter becomes impracticable when the state dimension is greaterthan 3. As a result we consider using particle ﬁlters [13, 14]. These are able to produce Monte Carlo approximationsto the ﬁltering distribution for an appropriate model that allows for outliers, and, in principle, can work even if theﬁltering distribution is multi-modal. However the Monte Carlo error of standard implementations of the particle can beprohibitively large [10].In this paper, we develop an efﬁcient particle ﬁlter by using a combination of Rao-Blackwellisation and well-designedproposal distributions. The idea of Rao-Blackwellisation is to integrate out part of the state so that the particle ﬁlterapproximates the ﬁltering distribution of a lower-dimensional projection of the state. In our application this projectionis whether each component of the additive and innovative noise is an outlier, and if it is how much the variance of thenoise has been inﬂated. Conditional on this information, the state space model becomes linear-Gaussian and we canimplement a Kalman Filter to calculate exactly the conditional ﬁltering distribution, while being able to fully capturemulti modal posteriors. This idea is similar to that which underpins the Mixture Kalman Filter [15].Whilst Rao-Blackwellisation improves the Monte Carlo accuracy of the ﬁlter, such a ﬁlter can still have the shortcomingsnoted by [10] and perform poorly without good proposal distributions for the information we condition on. One of themain contributions of this work is a proposal distribution that accurately approximates the conditional distribution ofthe variance inﬂation for each component of the noise, and hence approximates the optimal proposal distribution [16].As a result of this proposal, we ﬁnd that accurate results can be obtained even with only a few particles.Another important challenge addressed by this paper is that certain innovative outliers can not immediately be detected.An innovative outlier in a latent trend component for instance can cause a trend changes which may only becomeapparent – i.e. produce a visible outlier in the observations – many observations after the innovative outlier in the trendoccurred. It is nevertheless important to capture such outliers as they can affect a potentially unlimited number ofobservations to come. The proposed particle ﬁlter includes the possibility to back-sample the variance inﬂation particlesin light of more recent observations, which enables it to capture these important anomalies.The remainder of this paper is organised as follows: We discuss our robust noise model, consisting of a mixturedistribution of Gaussian noise, representing typical behaviour, and heavy tailed noise, representing atypical behaviour,for both the additive (observational) and innovative (system) noise process in Section 2. The model is shown to be verysimilar to that considered by [11]. We then introduce the proposal distribution for the scale of the noise in Section 3,before extending it to anomalies which are not immediately identiﬁable in Section 4. The proposed ﬁlter is compared toothers in Section 5 and applied to router data and a benchmark machine temperature data-set in Section 6. The proposedmethodology, which we call Computationally Efﬁcient Bayesian Anomaly detection by Sequential Sampling (CE-BASS) has been implemented in the the R package RobKF available from https://github.com/Fisch-Alex/Robkf .Derivations of theoretical results and complete pseudocode are available in the appendix.

Throughout this paper, we will consider inference about a latent state, X t , through partial observations, Y t , modelled as Y t = CX t + V t Σ A (cid:15) t , X t = AX t − + W t Σ I ν t . (1)2E-BASS - J ULY

8, 2020 (a) Random walk (b) Random walk with trends

Figure 1: Two examples of time series which are realisations of outlier infested Kalman models. (a) was simulatedusing the setup deﬁned in Equation (2), with σ A = 1 , σ I = 0 . , and outliers deﬁned by W = 3600 , V = 100 , and W = 10000 . Conversely (b) second example was simulated using the model deﬁned in Equation (3) using σ A = 1 , σ (1) I = 0 . , σ (2) I = 0 . and outliers deﬁned by W (1)100 = 3600 , V = 100 , and W (2)700 = 40000 .Here the additive noise, (cid:15) t ∈ R p , and the innovations ν t ∈ R q are both i.i.d. standard multivariate Gaussian. Thediagonal matrices Σ A and Σ I denote the covariance of the additive and innovation noise respectively. The diagonalmatrices V t and W t are used to capture additive and innovative outliers respectively, with large diagonal entries of V t corresponding to additive outliers and large diagonal entries of W t corresponding to innovative outliers. The classicalKalman model is recovered by setting W t = I and V t = I for all times t .The model in Equation (1) can be used to model a range of time series behaviours. We will use the following twoexamples throughout the paper: Example 1 : The random walk model with both changepoints and outliers, similar to the problem considered by [17]. Itcan be formulated as Y t = X t + V t σ A (cid:15) t , X t = X t − + W t σ I ν t . (2)Here atypically large values of V t correspond to outliers, whilst atypically large values of W t correspond to changes. Arealisation of this model can be found in Figure 1a. Example 2 : A time series with changes in trend, level shifts, as well as outliers, similar to the model considered by[18]. It can be formulated as Y t = X (1) t + V t σ A (cid:15) t X (1) t = X (1) t − + X (2) t − + (cid:16) W (1) t (cid:17) σ (1) I ν (1) t ,X (2) t = X (2) t − + (cid:16) W (2) t (cid:17) σ (2) I ν (2) t , (3) with the ﬁrst component of the hidden state denoting the current position and the second indicating the trend. Here,outliers are modelled by large values of V t whilst level shift and changes in trend are modelled by atypically largevalues of W (1) t and W (2) t respectively. A realisation of this model can be found in Figure 1b.A key feature of this second model is that an outlier in the trend component, X (2) t , may only become detectable manyobservations after the outlier – this challenging issue mentioned in the introduction is addressed via the methods inSection 4. A wide rage of other commonly used time series features, such as auto-correlation, moving averages, etc.can be incorporated in the model.To infer the locations of anomalies we use the model V ( i,i ) t = 1 + λ ( i ) t V ( i,i ) t W ( j,j ) t = 1 + γ ( j ) t W ( j,j ) t (4)3E-BASS - J ULY

8, 2020for ≤ i ≤ p and ≤ j ≤ q . The random variables λ ( i ) t ∼ Ber ( r i ) and γ ( j ) t ∼ Ber ( s j ) are indicators that determinewhether an anomaly is present or not for ≤ i ≤ p and ≤ j ≤ q respectively. For additional interpretability, weimpose that at most one anomaly is present at any given time t , and deﬁne r i and s j to be the probabilities that λ ( i ) t = 1 and γ ( j ) t = 1 respectively. The inverse scale, or precision, of an anomaly (if present) is given by the random variables ˜ V ( i,i ) t ∼ ˜ σ i Γ( a i , a i ) and ˜ W ( j,j ) t ∼ ˆ σ j Γ( b j , b j ) for ≤ i ≤ p and ≤ j ≤ q respectively.The proposed model bears similarities to the model used by [11]. Both use a mixture of Gaussian and heavy tailed noise.The main difference is that the anomalous behaviour is characterised by noise which is the sum of a Gaussian and a t -distribution in our model as opposed to just a t -distribution in the model used by [11]. This ensures that anomaliescoincide with strictly greater noise and makes the result more interpretable. In practice, however, the noise distributionconsidered in this paper and in [11] are likely to be of very similar shape. We now turn to ﬁltering the model deﬁned by Equations (1) and (4). The main feature we exploit is the fact that if weknew the value of ( V t , W t ) at all times t , we could just run the classical Kalman ﬁlter over the data. Consequently, ourapproach will consist of sampling particles for ( V t , W t ) , conditional on which the classical Kalman update equationsfor the hidden state x t can be used. This approach, very similar to the mixture Kalman ﬁlter [15, 19] is summarised bythe pseudocode in Algorithm 1.For each time, t , the code loops over the existing particles, ( V t , W t ) , and simulates M (cid:48) descendants for each of them instep 4. They are stored in a set of candidate particles. If we have N particles at time t , keeping all candidates wouldproduce N M (cid:48) particles at time t + 1 . To avoid growing the number of particles exponentially with t , Step 7 resamplesthe candidates to keep just N particles. The ﬁltering distribution for each of these particles is then calculated using theKalman Filter updates in step 10. Algorithm 1

Basic Particle Filter (No Back-sampling)

Input:

An initial state estimate ( µ , Σ ) A number of descendants, M (cid:48) = M ( p + q ) + 1 A number of particles to be maintained, N .A stream of observations Y , Y , ... Initialise:

Set

P articles (0) = { ( µ , Σ ) } for t ∈ N + do Candidates ← {} for ( µ , Σ ) ∈ P articles ( t − do ( V , W , prob ) ← Sample_Particles ( M (cid:48) , µ , Σ , Y t , A , C , Σ A , Σ I ) Candidates ← Candidates ∪ { ( µ , Σ , V , W , prob ) } end for Descendants ← Subsample ( N, Candidates ) P articles ( t ) ← {} for ( µ , Σ , V , W , prob ) ∈ Descendants do ( µ new , Σ new ) ← KF_Upd ( Y t , µ , Σ , C , A , V / Σ A , W / Σ I ) P articles ( t ) ← P articles ( t ) ∪ { ( µ new , Σ new ) } end for end for The main challenge in the above approach consists of selecting a good sampling procedure for the particles. Whilstit may be a natural choice to sample particles ( V t +1 , W t +1 ) from their prior distribution, this is not suitable for theproblem considered in this paper. In particular, this sampling procedure would not be robust to outliers: the stronger ananomaly was, the less likely we would be to sample a particle with an appropriate value of ( V t +1 , W t +1 ) , as discussedby [10].Adopting ideas from [16] and [20], we overcome the above challenge by sampling particles from an approximationto the conditional distribution of ( V t +1 , W t +1 ) given observation Y t +1 . Denote the model’s prior distribution for ( V t +1 , W t +1 ) in (4) by π ( · ) . The conditional distribution π ( W t +1 , V t +1 | Y t +1 ) for the descendants of a particle whoseﬁltering distribution for x t is N ( µ , Σ ) is then proportional to π ( W , V ) L (cid:16) Y , CA , CA Σ A T C T + Σ A V + C Σ I WC T (cid:17) . Here we have dropped time indices for convenience, and L ( x , µ , Σ ) denotes the likelihood of an observation x undera N ( µ , Σ ) -model. Since at most one component is anomalous, we can re-write this as a sum over which, if any,4E-BASS - J ULY

8, 2020component is anomalous I { W = I , V = I } π ( I , I | Y ) + q (cid:88) j =1 I (cid:26) W = I + I ( j )˜ W ( j,j ) , V = I (cid:27) ˆ π j (cid:16) ˜ W ( j,j ) (cid:17) + p (cid:88) i =1 I (cid:26) W = I , V = I + I ( i )˜ V ( i,i ) (cid:27) ˜ π i (cid:16) ˜ V ( i,i ) (cid:17) . Here, we use the shorthand ˜ π i (cid:16) ˜ V ( i,i ) (cid:17) = π (cid:18) I , I + I ( i ) ˜ V ( i,i ) | Y (cid:19) and ˆ π j (cid:16) ˜ W ( j,j ) (cid:17) = π (cid:18) I + I ( j ) ˜ W ( j,j ) , I | Y (cid:19) . Since the target distribution π ( W , V | Y ) is intractable, we construct an approximation to it, which we denote q ( W , V | Y ) ,and use this as our proposal distribution. This proposal is proportional to I { W = I , V = I } β + q (cid:88) j =1 I (cid:26) W = I + I ( j )˜ W ( j,j ) , V = I (cid:27) ˆ β j ˆ q j (cid:16) ˜ W ( j,j ) (cid:17) + p (cid:88) i =1 I (cid:26) W = I , V = I + I ( i )˜ V ( i,i ) (cid:27) ˜ β i ˜ q i (cid:16) ˜ V ( i,i ) (cid:17) . Clearly, there is no beneﬁt in simulating multiple identical descendants, so we wish to sample precisely one dependentthat corresponds to no outliers. To do this, and also to have the same number of descendant particles for each possibletype of outlier, we set β = M ( p + q ) , ˜ β i = M M ( p + q ) , and ˆ β j = M M ( p + q ) , and use stratiﬁed subsampling as in [19].This leads to M (cid:48) = M ( p + q ) + 1 total descendants per particle, M for each of the p additive and q innovative outliers,and one for no outlier. Each of these particles is then given a weight proportional to π ( W t +1 , V t +1 | Y t +1 ) q ( W t +1 , V t +1 | Y t +1 ) . The main challenge now consists of obtaining proposal distributions ˜ q i ( · ) for ≤ i ≤ p and ˆ q j ( · ) for ≤ j ≤ q that provide good approximations to the conditional posteriors which are proportional to ˜ π i ( · ) and ˆ π j ( · ) respectively.In the next subsection, we therefore derive proposal distributions that provide leading order approximations to theconditional posteriors. To simplify notation, we deﬁne the predictive variance ˆ Σ = CA Σ A T C T + Σ A + C Σ I C T anduse it throughout the remainder of this paper. We also begin by assuming that C contains no -columns. The proposalintroduced in the following subsection also forms the basis of back-sampling introduced in Section 4, which allows torelax this on C . For ≤ i ≤ p , we would like the proposal distribution ˜ q i (cid:16) ˜ V ( i,i ) (cid:17) for the precision, ˜ V ( i,i ) , to be as close as possible to ˜ π i (cid:16) ˜ V ( i,i ) (cid:17) or, equivalently, proportional to f i (cid:16) ˜ V ( i,i ) (cid:17) exp (cid:32) − ( Y − CA µ ) T (cid:18) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) I ( i ) (cid:19) − ( Y − CA µ ) (cid:33)(cid:115)(cid:12)(cid:12)(cid:12)(cid:12) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) I ( i ) (cid:12)(cid:12)(cid:12)(cid:12) , where f i () denotes the PDF of the ˜ σ i Γ( a i , a i ) -distributed prior of ˜ V ( i,i ) .It should be noted that the intractable terms, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) I ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and (cid:32) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) I ( i ) (cid:33) − (5)can both be expanded using the matrix determinant lemma and the Sherman Morrison formula respectively, as they arerank 1 updates of a determinant and inverse respectively. Indeed, by the matrix determinant lemma, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) I ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ Σ (cid:12)(cid:12)(cid:12) ˜ V ( i,i ) (cid:18) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + O (cid:16) ˜ V ( i,i ) (cid:17)(cid:19) , ULY

8, 2020the leading order term is conjugate to the prior of ˜ V ( i,i ) . Moreover, by the Sherman Morrison formula the second termin Equation (5) is equal to ˆ Σ − − ˆ Σ − I ( i ) ˆ Σ −  (cid:16) ˆ Σ − (cid:17) ( i,i ) −  (cid:16) ˆ Σ − (cid:17) ( i,i )  ˜ V ( i,i ) Σ ( i,i ) A  , up to O (cid:18)(cid:16) ˜ V ( i,i ) (cid:17) (cid:19) . Crucially, the ﬁrst two terms are constant in ˜ V ( i,i ) , while the third is linear in ˜ V ( i,i ) and thereforereturns a term which is conjugate to the prior of ˜ V ( i,i ) . Furthermore, we are most concerned about accurately samplingthe particle when an anomaly occurs in the i th component, which happens when the precision, ˜ V ( i,i ) , and the higherorder terms, become small.Keeping only the leading order terms in the determinant and the exponential term results in the proposal distribution ˜ V ( i,i ) ∼ ˜ σ i Γ  a i + 12 , a i + ˜ σ i Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:16) ˆ Σ − (cid:17) ( i,i )   for ˜ V ( i,i ) . More detailed derivations, including the associated weight are given by Theorem 1 in the appendix. Thisproposal has the property that as the observed anomaly in the i th component becomes larger, i.e. as Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:16) ˆ Σ − (cid:17) ( i,i )  increases, the mean of the proposal for ˜ V ( i,i ) diverges from the prior mean and behaves asymptotically like (2 a i + 1) Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ )  . Consequently, the variance and the squared residual will be on the same scale, thus achieving computational robustness.A very similar approach can be used to obtain a proposal distribution ˆ q j (cid:16) ˜ W ( j,j ) (cid:17) which provides a leading orderapproximation for the distribution proportional to π (cid:16) I + W ( j,j ) I ( j ) , I | Y (cid:17) . The proposal consists of sampling ˜ W ( j,j ) ∼ ˆ σ j Γ  b j + 12 , b j + ˆ σ i Σ ( j,j ) I  (cid:0) C T (cid:1) ( j, :) ˆ Σ − ( Y − CA µ ) (cid:16) C T ˆ Σ − C (cid:17) ( j,j )   and is of very similar form to the proposal distribution for particles with an additive outlier and well deﬁned if C has no -columns. Further details, including the associated weight, are given in Theorem 2 in the appendix. Like the proposaldistribution for particles with an additive anomaly this proposal is computationally robust: it ensures that the squaredresidual and the variance will be on the same scale as the anomaly in the j th innovative component becomes stronger.Finally, the “proposal" for particles without anomalies consists of deterministically setting V = I and W = I . Theweight associated with this particle is proportional to the likelihood, the closed form of which is given in Theorem 3 inthe appendix. The choice of hyper-parameters, particularly ˆ σ i and ˜ σ i , has a signiﬁcant effect of the performance of the proposedﬁlter. One reason for this is that an outlier observation could be the result of either an additive or an innovative outlier.It may be that the root cause can only be determined after further observations are made. Thus, we wish to choosehyper-parameters in such a way as to ensure that observed anomalies, which are equally well explained by differentclasses of anomalies, are given similar importance weights. The following result describes such a choice:6E-BASS - J ULY

8, 2020 (a) t=100 (b) t=101 (c) Full data

Figure 2: Robust particle ﬁlter output at various times. Additive anomalies are denoted by red points, innovativeanomalies by blue lines. Grey observations are yet to be observed.

Theorem 4

Let the prior for the hidden state X t be N ( µ , Σ ) and an observation Y t +1 := Y be available. When ˜ σ i = Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) and ˆ σ j = Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) , and a = ... = a p = b = ... = b q = c , the weights of additive and innovative anomalies are asymptoticallyproportional to c c M r i Γ( c + )Γ( c ) exp (cid:0) δ (cid:1)(cid:0) δ (cid:1) c and c c M s j Γ( c + )Γ( c ) exp (cid:0) δ (cid:1)(cid:0) δ (cid:1) c when Y − CA µ = δ e i (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i ) and Y − CA µ = δ C (: ,j ) (cid:114)(cid:16) C T ˆ Σ − C (cid:17) ( j,j ) , respectively, as δ → ∞ The above choice of hyper-parameters therefore leads to all components being given equal asymptotic importance weightunder an anomaly they are able to account for. I.e. one which satisﬁes C (: ,j ) (cid:113) ( C T ˆ Σ − C ) ( j,j ) δ = Y − CA µ = δ e i (cid:113) ( ˆ Σ − ) ( i,i ) .Setting all the a i s and b j s to the same constant is advisable due to the fact that the convolution of two t -distributionswhose means drift further and further apart yields two stable, i.e. non-vanishing modes if and only if they have the samescale parameter.While, ˆ Σ − is not ﬁxed but time dependent, it nevertheless converges to a limit under an observable Kalman ﬁltermodel. In practice, we therefore use this limit to set ˜ σ i and ˆ σ j . The proposed ﬁlter can be applied to the data displayed in Figure 1a to detect anomalies in an online fashion. It is worthpointing out that the ﬁlter re-evaluates past anomalies as more data becomes available. This can be seen in Figure 2:When initially encountering the anomaly at time t = 100 the ﬁlter gives approximately equal weight to the possibilityof it being an additive outlier and to it being an innovative one. It is only when the next observation becomes available,that the ﬁlter (correctly) classiﬁes it as an innovative anomaly. Note that only N = 20 particles were used and only M = 1 descendent of each anomaly type was sampled per particle. As mentioned in the introduction, it is possible that innovative outliers may not immediately be observed. One suchexample are innovative outliers in the trend component of the model described in (3). The ﬁlter as described inAlgorithm 1 can not deal with such anomalies as it only inﬂates the variance of the innovative process at time t whenthere is evidence in the observation at the same time t that an outlier occurred. This can be remedied by back-sampling7E-BASS - J ULY

8, 2020particles representing innovative outliers at a later time, t + k , once more observations and therefore evidence for ananomaly are available. This can be done using nearly identical approximation strategies as used in the previous sectionand allows to relax the assumptions made in the previous section from C not having any -columns to requiring that thesystem be observable. k + 1 Observations

The proposed back-sampling strategy at time t consists of sampling particles for ( V t +1 − k , ... V t +1 , W t +1 − k , ..., W t +1 ) given a N ( µ t − k , Σ t − k ) ﬁltering distribution for x t − k and observations Y t − k +1 , ..., Y t − k . Speciﬁcally, we sampleparticles with a innovative single anomaly in W t +1 − k assuming no other innovative anomalies or additive anomalies.Conditional on these augmented particles classical Kalman updates can once more be used as shown in Algorithm 2. Itshould be noted that Algorithm 1 is a special case of Algorithm 2 which arises from setting B = ... = B q = { } . Algorithm 2

Particle Filter (With Back Sampling) – CE-BASS

Input:

An initial state estimate ( µ , Σ ) .A number of descendants, M (cid:48) = M ( p + q ) + 1 .A number of particles to be maintained, N .A stream of observations Y , Y , ... Initialise:

Set

P articles (0) = { ( µ , Σ , } Set max _ horizon = max ( ∪ qi =1 B i ) for t ∈ N + do Cand ← {} (cid:46)

To Store Candidates3: for ( µ , Σ , prob prev ) ∈ P articles ( t − do ( V , W , prob ) ← Sample_typical ( µ , Σ , Y t , A , C , Σ A , Σ I ) Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , } Add _ Des ← Sample_additive ( µ , Σ , Y t , A , C , Σ A , Σ I , M ) for ( V , W , prob ) ∈ Add _ Des do Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , } end for end for for hor ∈ { , ..., max _ horizon } do for ( µ , Σ , prob prev ) ∈ P articles ( t − hor ) do ˜ Y ← (cid:2) Y Tt − hor +1 , ..., Y Tt (cid:3) T Inn _ Des ← BS_inn ( µ , Σ , ˜ Y , A , C , Σ A , Σ I , M, hor ) for ( V , W , prob ) ∈ Inn _ Des do Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , hor ) } end for end for end for Desc ← Subsample ( N, Cand ) (cid:46) Sampling proportional to prob

P articles ( t ) ← {} for ( µ , Σ , V , W , prob, hor ) ∈ Descendants do ( µ , Σ ) ← KF_Upd ( Y t +1 − hor , µ , Σ , C , A , V / Σ A , W / Σ I ) if hor > then for i ∈ { , ..., hor } do ( µ , Σ ) ← KF_Upd ( Y t + i − hor , µ , Σ , C , A , Σ A , Σ I ) end for end if P articles ( t ) ← P articles ( t ) ∪ { ( µ , Σ , prob · | Cand || Desc | ) } end for end for To sample a particle with an innovative anomaly in the j th component of W t +1 − k , we deﬁne an augmented observationvector ˜ Y ( k ) t +1 − k = ( Y Tt +1 − k , ..., Y Tt +1 ) T . This is normally distributed with mean ˜ C ( k ) A µ t − k and variance ˜ C ( k ) (cid:16) A Σ t − k A T + ˜ Q ( k ) (cid:17) (cid:16) ˜ C ( k ) (cid:17) T + ˜ R ( k ) , ULY

8, 2020where ˜ C ( k ) = C (cid:18)(cid:0) A (cid:1) T , ..., (cid:16) A k (cid:17) T (cid:19) T denotes the augmented matrix mapping the hidden states to the observations, ˜ R ( k ) =  V − t +1 − k Σ A . . . . . . . . . V − t +1 Σ A  and ˜ Q ( k ) =  W − t +1 − k Σ I . . . . . . . . . W − t +1 Σ I  In a similar spirit, we deﬁne the augmented predictive variance to be ˆ Σ ( k ) = ˜ C ( k ) (cid:16) A Σ t − k A T + I k +1 ⊗ Σ I (cid:17) (cid:16) ˜ C ( k ) (cid:17) T + I k +1 ⊗ Σ A . As a result of this reformulation, we retrieve update equations consisting of a single Kalman step, albeit with slightlydifferent dimensions of the observation, ( k + 1) p instead of p . It is therefore possible to use the sampling procedure forinnovative outliers introduced in Section 3.1. This consists of sampling particles for ˜ W ( j,j ) t +1 − k from ˆ σ j Γ  b j + 12 , b j + ˆ σ j Σ ( j,j ) I  (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:19) ( j, :) (cid:16) ˆ Σ ( k ) (cid:17) − ˜ z ( k ) t +1 − k (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − ˜ C ( k ) (cid:19) ( j,j )   . for the residual ˜ z ( k ) t +1 − k ˜ Y ( k ) t +1 − k − ˜ C ( k ) A µ t − k . The associated weight is given in Theorem 5 in the appendix.As in Section 3.2, we want to give different particles equal weights if they explain anomalies equally well. In particular,we therefore want to balance out the weights given to the back-sampled particles and the descendants of particles withan anomaly sampled at time t − k + 1 using just Y t +1 − k . In order to do so, consider observations Y t +1 , ..., Y t +1 − k which are such that they perfectly ﬁt an innovative outlier in the i th innovative component at time t − k + 1 , i.e. ˜ Y ( k ) t +1 − k − (cid:16) ˜ C ( k ) (cid:17) A µ t − k = (cid:16) ˜ C ( k ) (cid:17) (: ,j ) (cid:115)(cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) δ. As δ grows, the importance weight behaves as b b j j M s j Γ( b j + )Γ( b j ) exp (cid:0) − δ (cid:1) ˆ σ j Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T ( ˆ Σ ( k ) ) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) δ  b j , up to the likelihood term and the (cid:16) − (cid:80) pi =1 r i − (cid:80) qj =1 s j (cid:17) k factor. However, these terms are also present inthe weights of the descendants of the particles sampled at t + 1 − k if no further anomaly was sampled at times t + 2 − k, ..., t + 1 . Therefore, setting ˆ σ j = Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) results in the same asymptotic probabilities as the one obtained in Section 3.2. Given ˆ σ j can only take a single value weset ˆ σ j = max k ∈B j (cid:32) Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) (cid:33) , ULY

8, 2020 (a) t=820 (b) t=821 (c) Full data

Figure 3: Robust particle ﬁlter output at various times. Additive anomalies are denoted by red points, innovativeanomalies by blue lines. Grey observations are yet to be observed.where B j ⊂ N denotes the set of horizons used to back-sample the j th component of the W t .A range of observations guide the choice of the sets B j for ≤ j ≤ q . We assume that the Kalman model is observable,i.e. that there exists a k such that the matrix (cid:20) ( C ) T , ( CA ) T , ..., (cid:16) CA k (cid:17) T (cid:21) has full column rank. Let k ∗ denote thelowest such k . It is advisable to choose the set B j such that it contains at least one element greater or equal to k ∗ . Thereason for this being that any innovative anomaly capable of eventually inﬂuencing the observations must do so within k ∗ observations from occurring. It should also be noted that a horizon h can only be in the set B j if the j th column ofthe augmented mapping from the hidden states to the observations, ˜ C ( h ) , is non-zero as this is required by the proposal.Consequently, setting B j = (cid:26) k ∈ { , ..., k ∗ } : (cid:16) ˜ C ( k ) (cid:17) (: ,j ) (cid:54) = (cid:27) is a natural choice. With back-sampling, we are now able to tackle the example from Figure 1b. We used B = { , ..., } , B = { , ..., } ,to sample back up to 40 observations. We maintained N = 40 particles and sampled M = 1 descendants of each type.The output of the particle ﬁlter can be seen in Figure 3. As before, the ﬁlter updates its output as new observationsbecome available. Whilst the trend innovation occurs at time t = 800 , the anomaly is ﬁrst detected around time t = 820 .Even then, there is a large amount of uncertainty regarding the precise location of the anomaly which only gets resolvedat a later time. We now turn to comparing CE-BASS against other methods. In particular, we compare against the t -distribution basedadditive outlier robust ﬁlter by [8], the Huberisation based additive outlier robust ﬁlter by [9], the Huberisation basedinnovative outlier robust ﬁlter by [9], and the classical Kalman Filter [4]. All these algorithms are implemented in theaccompanying package.We consider four different models and generate 1000 observations for each. For each of the four models, we considera case in which no anomalies are present, a case in which only additive anomalies are present, a case in which onlyinnovative anomalies are present, and a case in which both additive and innovative anomalies are present. Whenanomalies are added, they are added at times t = 100 , t = 300 , t = 600 , and t = 900 . Speciﬁcally we considered thefollowing three models:1. The model of Example 1 with σ A = 1 and σ I = 0 . . We consider a case with only additive outliers, a case withonly innovative outliers, and a case where an additive outlier at t = 100 , is followed by two innovative outliersat times t = 300 and t = 600 , which were then followed by an additive outlier at time t = 900 . To simulateadditive anomalies, we set V t σ A (cid:15) t = 10 and to simulate the innovative outliers we set W t σ I ν t = 10 .10E-BASS - J ULY

8, 2020 (a) Case 1 (b) Case 1, IOs (c) Case 1, AOs (d) Case 1, Both(e) Case 2 (f) Case 2, IOs (g) Case 2, AOs (h) Case 2, Both(i) Case 3 (j) Case 3, IOs (k) Case 3, AOs (l) Case 3, Both(m) Case 4 (n) Case 4, IOs (o) Case 4, AOs (p) Case 4, Both

Figure 4: Violin plots for the average predictive log-likelihood of the ﬁve ﬁlters (IOAO: CE-BASS, KF: The classicalKalman Filter, AO T: [8], AO H: [9], IO H: [9]) over the four different scenarios under a range of models. Higher valuescorrespond to better performance. Methods are omitted on the graphs if they can not be applied to the setting or if theirperformance is too poor. 11E-BASS - J

ULY

8, 20202. The random walk model with two measurements Y (1) t = X t + (cid:16) V (1) t (cid:17) σ (1) A (cid:15) (1) t , X t = X t − + W t σ I ν t Y (2) t = X t + (cid:16) V (2) t (cid:17) σ (2) A (cid:15) (2) t , where σ (1) A = σ (2) A = 1 for i = 1 , and σ I = 0 . . We consider a case with only additive outliers (one in theﬁrst component, then two in the second, then one in the ﬁrst), a case with only innovative outliers, and a casewhere an additive outlier in the ﬁrst component at time t = 100 is followed by two innovative outliers at times t = 300 and t = 600 , which are then followed by an additive outlier in the second component at time t = 900 .For additive anomalies, we set (cid:16) V (1) t (cid:17) σ (1) A (cid:15) (1) t = 10 or (cid:16) V (2) t (cid:17) σ (2) A (cid:15) (2) t = 10 and for innovative outliers,we set W t σ I ν t = 10 .3. The model of Example 2 with σ A = 1 , σ (1) I = 0 . and σ (2) I = 0 . . We consider a case with only additiveoutliers, a case with only innovative outliers (one in the second component, then one in the ﬁrst, then one inthe second, then one in the ﬁrst), and a case with an additive outlier at t = 100 , followed by an innovativeoutlier affecting the ﬁrst component of the hidden state at times t = 300 , followed by an innovative outlieraffecting the second component of the hidden state at times t = 600 , followed by an additive outlier at time t = 900 . The additive anomalies were instances where we set V t (cid:15) t = 30 and the innovative outliers wereinstances where we set (cid:16) W (1) t (cid:17) η (1) t = 100 or (cid:16) W (2) t (cid:17) η (2) t = 500 .4. An extension of Example 2 where the position is also observed. The equations governing the hidden state areas before whilst the equations governing the observations are Y (1) t = X (1) t + (cid:16) V (1) t (cid:17) σ (1) A (cid:15) (1) t ,Y (2) t = X (2) t + (cid:16) V (2) t (cid:17) σ (2) A (cid:15) (2) t , where σ (1) A = σ (2) A = 1 . We consider a case with only additive outliers (in the ﬁrst component only), a casewith only innovative outliers (one in the second component, then one in the ﬁrst, then one in the second, thenone in the ﬁrst), and a case with an additive outlier at time t = 100 , followed by an innovative outlier affectingthe ﬁrst component of the hidden state at time t = 300 , followed by an innovative outlier affecting the secondcomponent of the hidden state at time t = 600 , followed by an additive outlier at time t = 900 . For additiveanomalies, we set (cid:16) V (1) t (cid:17) σ (1) A (cid:15) (1) t = 30 and for innovative outliers, we set (cid:16) W (1) t (cid:17) σ (1) I η (1) t = 100 or (cid:16) W (2) t (cid:17) σ (2) I η (2) t = 500 .We evaluate the different methods based on average predictive log-likelihood and average predictive mean squarederror. We exclude all observations corresponding to anomalies from the calculation of these averages since the ﬁlterscan not be expected to predict them. When calculating the average mean squared error we additionally remove oneobservation after the anomaly in the ﬁrst setting and two observations in the third setting from the performance metric.This is to give the ﬁlter enough information to determine which type of anomaly the outlier corresponds to and return toa unimodal posterior, as the MSE is only an appropriate metric for unimodal posteriors.The average log-likelihoods across all models can be found in Figure 4, while the qualitatively very similar results forthe mean squared error can be found in the appendix. We see that the performance of CE-BASS compares favourablywith that of the competing methods. In particular it is as accurate as the Kalman ﬁlter in the absence of anomalies andis more accurate than the additive outlier and innovative outlier robust ﬁlters even when only additive or innovativeoutliers are present, i.e. the settings for which these algorithms were designed. In this section, we apply CE-BASS to two real datasets. We will use different types of models for the two applicationsto illustrate the way in which CE-BASS can be used. The ﬁrst dataset is a labelled benchmark dataset which consistsof temperature readings on a large industrial machine. Here, we will use a model which considerably restricts themovements of the hidden states when no anomalies are present, and thus emulates a changepoint model. The second is12E-BASS - J

ULY

8, 2020 (a) Raw data with labels (b) CE-BASS output

Figure 5: Machine temperature dataset. The labelled anomalies are: a planned shutdown, an early warning sign of aproblem, and the catastrophic system failure caused by the problem.an unlabelled dataset which consist of repeated throughput measurements on a router. For that application we will use amodel which has a considerable amount of ﬂexibility and where the hidden states tend to follow the observations andtherefore detect localised anomalies.

We now apply CE-BASS to the machine temperature data taken from the Numenta Anomaly Benchmark (NAB, [21])which can be accessed at https://github.com/numenta/NAB . The data consists of over 20000 readings from a temperaturesensor on a large industrial machine and is displayed in Figure 5a along the three periods of anomalous behaviourlabelled by an engineer. The ﬁrst corresponds to a planned shutdown and the second to an early warning sign of thethird anomaly – a catastrophic failure.In order to do so, we use the random walk model from Example 1 with the aim of detecting persistent changes inmean. We therefore use a maximum backsampling horizon of 250 by setting B = { , , , , , , , } andﬁx σ I = 1 / σ A to ensure that long and weak anomalies will not be interpreted as a persistent shift in the typicalstate. We use the ﬁrst 15% of the data, marked by [21] as train data, to estimate the standard deviation σ A as well as theinitial mean µ using the median absolute deviation and the median respectively. Using robust covariance methods wealso detect very strong auto-correlation ( ρ = 0 . ) and therefore took the default probabilities for anomalies to thepower of − ρ .The results of this analysis can be seen in Figure 5b. We note that all anomalies ﬂagged by the engineer are also beingdetected by CE-BASS. Two additional innovative anomalies around a prolonged drop which preceded the plannedshutdown are also detected. They could be a false positive or an early warning sign of an anomaly prevented by theshutdown which has not been noticed by the engineer. The online analysis of aggregated trafﬁc data on servers is an important challenge in both predictive maintenance andcyber security. This is because anomalies in throughput can point towards problems in the network such as malfunctionsor malicious behaviour. Detecting anomalies as soon as possible therefore means that the root cause can be addressedmore quickly – potentially even before user experience is affected or harm caused.13E-BASS - J

ULY

8, 2020 (a) Day 11 (b) Day 12 (c) Day 13(d) Day 14 (e) Day 15 (f) Day 16(g) Day 17 (h) Day 18 (i) Day 19

Figure 6: CE-BASS applied to 9 days of de-seasonalised router data. Lines correspond to innovative anomalies, i.e.spikes or level shifts. 14E-BASS - J

ULY

8, 2020In this section, we consider 19 days worth of data from a network IP router which has been gathered at a frequency ofone observation every 30 seconds. To preserve conﬁdentiality, we de-seasonalised the data for days 11 to 19 using aseasonality model trained on days 1 to 10 and, for the purpose of this paper, consider only the de-seasonalised datafor days 11 to 19 which can be found in Figures 6a to 6i. The main features apparent in the daily series are spikes,outliers, and changepoints. In order to capture these, we use an AR(1) model with slowly changing mean to model theobservations Y t . Formally, we used the model Y t = X (1) t + X (2) t + V t σ A (cid:15) t , X (1) t = X (1) t − + W (1) t σ (1) I η (1) t ,X (2) t = ρX (2) t − + W (2) t σ (2) I η (2) t . Here, anomalies in (cid:15) t correspond to isolated outliers, anomalies in η (1) t correspond to level shifts and outliers in η (2) t correspond to spikes.We use the ﬁrst 1000 observations of the ﬁrst day, to obtain the estimates σ A = 0 . , σ (1) I = 0 . , σ (2) I = 0 . ,and ρ = 0 . . The result obtained from running CE-BASS with these parameters on the daily router data is displayedin Figures 6a to 6i. We note that very few of the anomalies returned can be classed as false positives. At the same time,a large number of anomalies are ﬂagged, including a large number of outliers and spikes, but also some level shifts(Day 14). Discussion with engineers highlighted that the anomalies detected matched well with their knowledge ofthe data. This shows CE-BASS’s ability to return a large number of diverse features which can be used as inputs to asupervised algorithm should labels become available. This work was supported by EPSRC grant numbers EP/N031938/1 (StatScale) and EP/L015692/1 (STOR-i). Theauthors also acknowledge British Telecommunications plc (BT) for ﬁnancial support, David Yearling and TrevorBurbridge in BT Research for discussions.

References [1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey.

ACM computing surveys(CSUR) , 41(3):15, 2009.[2] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection.

SignalProcessing , 99:215–249, 2014.[3] Alexander T M Fisch, Idris A Eckley, and Paul Fearnhead. A linear time method for the detection of point andcollective anomalies. arXiv preprint arXiv:1806.01947 , 2018.[4] Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems.

Transactions of the ASME–Journal of Basic Engineering , 82(Series D):35–45, 1960.[5] Mital A Gandhi and Lamine Mili. Robust Kalman ﬁlter based on a generalized maximum-likelihood-typeestimator.

IEEE Transactions on Signal Processing , 58(5):2509–2520, 2009.[6] Yulong Huang, Yonggang Zhang, Ning Li, Zhemin Wu, and Jonathon A Chambers. A novel robust student’st-based Kalman ﬁlter.

IEEE Transactions on Aerospace and Electronic Systems , 53(3):1545–1554, 2017.[7] Jo-Anne Ting, Evangelos Theodorou, and Stefan Schaal. Learning an outlier-robust Kalman ﬁlter. In

EuropeanConference on Machine Learning , pages 748–756. Springer, 2007.[8] Gabriel Agamennoni, Juan I Nieto, and Eduardo M Nebot. An outlier-robust Kalman ﬁlter. In , pages 1551–1558. IEEE, 2011.[9] Peter Ruckdeschel, Bernhard Spangl, and Daria Pupashenko. Robust Kalman tracking and smoothing withpropagating and non-propagating outliers.

Statistical Papers , 55(1):93–123, 2014.[10] Guobin Chang. Robust Kalman ﬁltering based on Mahalanobis distance as outlier judging criterion.

Journal ofGeodesy , 88(4):391–401, 2014.[11] Yulong Huang, Yonggang Zhang, Yuxin Zhao, and Jonathon A Chambers. A novel robust gaussian-student’s tmixture distribution based Kalman ﬁlter.

IEEE Transactions on Signal Processing , 2019.[12] Genshiro Kitagawa. Non-gaussian state—space modeling of nonstationary time series.

Journal of the Americanstatistical association , 82(400):1032–1041, 1987. 15E-BASS - J

ULY

8, 2020[13] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian bayesian stateestimation.

IEE Proceedings F - Radar and Signal Processing , 140(2):107–113, 1993.[14] Paul Fearnhead and Hans R. Künsch. Particle ﬁlters and data assimilation.

Annual Review of Statistics and ItsApplication , 5(1):421–449, 2018.[15] Rong Chen and Jun S Liu. Mixture Kalman ﬁlters.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 62(3):493–508, 2000.[16] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle ﬁlters.

Journal of the Americanstatistical association , 94(446):590–599, 1999.[17] Paul Fearnhead and Guillem Rigaill. Changepoint detection in the presence of outliers.

Journal of the AmericanStatistical Association , 114(525):169–183, 2019.[18] Hyeyoung Maeng and Piotr Fryzlewicz. Detecting linear trend changes and point anomalies in data sequences. arXiv preprint arXiv:1906.01939 , 2019.[19] Paul Fearnhead and Peter Clifford. On-line inference for hidden Markov models via particle ﬁlters.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 65(4):887–899, 2003.[20] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle ﬁlters for onlinenonlinear/non-gaussian bayesian tracking.

IEEE Transactions on signal processing , 50(2):174–188, 2002.[21] Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms–the numenta anomalybenchmark. In , pages38–44. IEEE, 2015. 16E-BASS - J

ULY

8, 2020

Let the prior for the hidden state X t be N ( µ , Σ ) and an observation Y t +1 := Y be available. Then thesamples for ˜ V ( i,i ) t +1 from ˜ σ i Γ  a i + 12 , a i + ˜ σ i Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:16) ˆ Σ − (cid:17) ( i,i )   have associated weight M r i Γ( a i + )Γ( a i ) √ ˜ σ i a a i i (cid:32) a i + ˜ σ i Σ ( i,i ) A (cid:18) ( ˆ Σ − ) ( i, :) ( Y − CA µ ) ( ˆ Σ − ) ( i,i ) (cid:19) (cid:33) a i + exp (cid:16) − ( Y − CA µ ) T ˆ Σ − ( Y − CA µ ) (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ V ( i,i ) + Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:19) exp    ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + ˜ V ( i,i ) t +1   (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i )   . Proof : We wish to sample from the posterior distribution of ˜ V ( i,i ) t +1 which is proportional to r i f i (cid:16) ˜ V ( i,i ) t +1 (cid:17) exp (cid:32) − ( Y − CA µ ) T (cid:18) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) t +1 I ( i ) (cid:19) − ( Y − CA µ ) (cid:33)(cid:115)(cid:12)(cid:12)(cid:12)(cid:12) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) t +1 I ( i ) (cid:12)(cid:12)(cid:12)(cid:12) , (6) where f i () denotes the PDF of a ˜ σ i Γ( a i , a i ) -distribution. The intractable part in the above consists of  ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) t +1 I ( i )  − , where I ( i ) = e i e Ti is a matrix which is 0 everywhere with the exception of the i th entry of the i th row, which is 1. Notethat I ( i ) has rank 1 and therefore, by the Sherman Morrison formula,  ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) t +1 I ( i )  − = ˆ Σ − − ˆ Σ − I ( i ) ˆ Σ − tr ( ˆ Σ − I ( i ) ) Σ ( i,i ) A ˜ V ( i,i ) t +1 Σ ( i,i ) A ˜ V ( i,i ) t +1 = ˆ Σ − − tr ( ˆ Σ − I ( i ) ) ˆ Σ − I ( i ) ˆ Σ − tr ( ˆ Σ − I ( i ) ) Σ ( i,i ) A ˜ V ( i,i ) t +1 . Furthermore, given tr ( ˆ Σ − I ( i ) ) = (cid:16) ˆ Σ − (cid:17) ( i,i ) , the above is equal to ˆ Σ − − ˆ Σ − I ( i ) ˆ Σ −  (cid:16) ˆ Σ − (cid:17) ( i,i ) −  (cid:16) ˆ Σ − (cid:17) ( i,i )  ˜ V ( i,i ) t +1 Σ ( i,i ) A +  ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  (cid:16) ˆ Σ − (cid:17) ( i,i ) + Σ ( i,i ) A ˜ V ( i,i ) t +1  . Crucially, the ﬁrst term is constant in ˜ V ( i,i ) t +1 , while the second is linear in ˜ V ( i,i ) t +1 and therefore conjugate to the prior of ˜ V ( i,i ) t +1 . The last term is quadratic in ˜ V ( i,i ) t +1 and therefore vanishing much faster than the other two terms as ˜ V ( i,i ) t +1 goes to0, i.e. as the anomaly becomes stronger. 17E-BASS - J ULY

8, 2020A very similar result for rank 1 updates of determinants, the matrix determinant Lemma, can be used to show that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ Σ + Σ ( i,i ) A ˜ V ( i,i ) t +1 I ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ˆ Σ (cid:12)(cid:12)(cid:12)  Σ ( i,i ) A ˜ V ( i,i ) t +1 (cid:16) ˆ Σ − (cid:17) ( i,i )  . Furthermore, given that −

12 ( Y − CA µ ) T ˆ Σ − I ( j ) ˆ Σ − ( Y − CA µ ) is equal to − (cid:18)(cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:19) , we can rewrite the posterior of ˜ V ( i,i ) t +1 in Equation (6) as r i f ( V ( i,i ) t +1 ) (cid:113) | ˜ V ( i,i ) t +1 | exp  − ˜ V ( i,i ) t +1 Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:16) ˆ Σ − (cid:17) ( i,i )   exp (cid:16) − ( Y − CA µ ) T ˆ Σ − ( Y − CA µ ) (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ V ( i,i ) + Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:19) exp    ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + ˜ V ( i,i ) t +1   (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i )   Using conjugacy, we can therefore sample M particles for ˜ V ( i,i ) from ˜ σ i Γ  a i + 12 , a i + ˜ σ i Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:16) ˆ Σ − (cid:17) ( i,i )   and give each particle an importance weight proportional to M r i Γ( a i + )Γ( a i ) √ ˜ σ i a a i i (cid:32) a i + ˜ σ i Σ ( i,i ) A (cid:18) ( ˆ Σ − ) ( i, :) ( Y − CA µ ) ( ˆ Σ − ) ( i,i ) (cid:19) (cid:33) a i + exp (cid:16) − ( Y − CA µ ) T ˆ Σ − ( Y − CA µ ) (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ V ( i,i ) + Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:19) exp    ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + ˜ V ( i,i ) t +1   (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i )   . Let the prior for the hidden state X t be N ( µ , Σ ) and an observation Y t +1 := Y be available. Then thesamples for ˜ W ( j,j ) from ˆ σ i Γ  b j + 12 , b j + ˆ σ j Σ ( j,j ) I  (cid:16) C T (cid:17) ( j, :) ˆ Σ − ( Y − CA µ ) (cid:16) C T ˆ Σ − C (cid:17) ( j,j )   have associated weight M s j Γ( b i + )Γ( b j ) (cid:112) ˆ σ j b b j j (cid:32) b j + ˆ σ i Σ ( j,j ) I (cid:18) ( C T ) ( j, :) ˆ Σ − ( Y − CA µ ) ( C T ˆ Σ − C ) ( j,j ) (cid:19) (cid:33) b i + exp (cid:16) − ( Y − CA µ ) T ˆ Σ − ( Y − CA µ ) (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ W ( j,j ) + Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) (cid:19) exp  (cid:32)  ˜ W ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j )  Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) + ˜ W ( j,j ) t +1 (cid:33)  (cid:0) C T (cid:1) ( j, :) ˆ Σ − ( Y − CA µ ) (cid:114)(cid:16) C T ˆ Σ − C (cid:17) ( j,j )   ULY

8, 2020The proof is almost identical to that of Theorem 1 and has been omitted.

Let the prior for the hidden state X t be N ( µ , Σ ) and an observation Y t +1 := Y be available. Then theproposal particle ( I p , I q ) for ( V t , W t ) has weight proportional to (1 − p (cid:88) i =1 r i − q (cid:88) j =1 s j ) exp (cid:16) − ( Y − CA µ ) T ˆ Σ − ( Y − CA µ ) (cid:17)(cid:113) | ˆ Σ | . This is immediate from the Gaussian likelihood and the Bernoulli priors for λ ( i ) t and γ ( j ) t . Let the prior for the hidden state X t be N ( µ , Σ ) and an observation Y t +1 := Y be available. When ˜ σ i = Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) and ˆ σ j = Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) , and a = ... = a p = b = ... = b q = c , the weights of additive and innovative anomalies are asymptoticallyproportional to c c M r i Γ( c + )Γ( c ) exp (cid:0) δ (cid:1)(cid:0) δ (cid:1) c and c c M s j Γ( c + )Γ( c ) exp (cid:0) δ (cid:1)(cid:0) δ (cid:1) c when Y − CA µ = δ e i (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i ) and Y − CA µ = δ C (: ,j ) (cid:114)(cid:16) C T ˆ Σ − C (cid:17) ( j,j ) , respectively, as δ → ∞ Proof : Removing the likelihood term common to all particles the importance weights can be summarised as being M r i Γ( a i + )Γ( a i ) √ ˜ σ i a a i i (cid:32) a i + ˜ σ i Σ ( i,i ) A (cid:18) ( ˆ Σ − ) ( i, :) ( Y − CA µ ) ( ˆ Σ − ) ( i,i ) (cid:19) (cid:33) a i + (cid:115)(cid:18) ˜ V ( i,i ) + Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:19) exp    ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + ˜ V ( i,i ) t +1   (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ ) (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i )   . for the particles containing an anomaly in the i th additive component, and M s j Γ( b i + )Γ( b j ) (cid:112) ˆ σ j b b j j (cid:32) b j + ˆ σ i Σ ( j,j ) I (cid:18) ( C T ) ( j, :) ˆ Σ − ( Y − CA µ ) ( C T ˆ Σ − C ) ( j,j ) (cid:19) (cid:33) b i + (cid:115)(cid:18) ˜ W ( j,j ) + Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) (cid:19) exp (cid:32)   ˜ W ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j )  Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) + ˜ W ( j,j ) t +1 (cid:33)  (cid:0) C T (cid:1) ( j, :) ˆ Σ − ( Y − CA µ ) (cid:114)(cid:16) C T ˆ Σ − C (cid:17) ( j,j )   for the particles containing an anomaly in the j th innovative component.As mentioned in Section II that the mean of the proposal of the i th additive component behaves asymptotically as (2 a i + 1) Σ ( i,i ) A  (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:16) ˆ Σ − (cid:17) ( i, :) ( Y − CA µ )  . ULY

8, 2020Furthermore, the standard deviation is on the same scale. We therefore have that ˜ V ( i,i ) t +1 ∼ δ as δ → ∞ . The weight of an anomaly in the i th additive component therefore asymptotically behaves as a a i i M r i Γ( a i + )Γ( a i ) exp (cid:0) δ (cid:1)(cid:18) ˜ σ i Σ ( i,i ) A ( ˆ Σ − ) ( i,i ) δ (cid:19) a i when Y − CA µ = (cid:113) ( ˆ Σ − ) ( i,i ) δ e i as δ → ∞ . A very similar reasoning can be used to show that the weight of ananomaly in the j th innovative component converges to b b j j M s j Γ( b j + )Γ( b j ) exp (cid:0) δ (cid:1)(cid:18) ˆ σ j Σ ( j,j ) I ( C T ˆ Σ − C ) ( j,j ) δ (cid:19) b j when Y − CA µ = C (: ,ij (cid:113) ( C T ˆ Σ − C ) ( j,j ) δ as δ → ∞ .The result then follows when all the b j s and the a i s are equal to the same constant c and ˜ σ i = Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) and ˆ σ j = Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) . Let the prior for the hidden state X t − k be N ( µ , Σ ) . Then the samples for ˜ W ( j,j ) t − k +1 from ˆ σ j Γ  b j + 12 , b j + ˆ σ j Σ ( j,j ) I  (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:19) ( j, :) (cid:16) ˆ Σ ( k ) (cid:17) − ˜ z ( k ) t +1 − k (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − ˜ C ( k ) (cid:19) ( j,j )   , where ˜ z ( k ) t +1 − k = ˜ Y ( k ) t +1 − k − ˜ C ( k ) A µ have associated weight M s i (cid:16) − (cid:80) pi (cid:48) =1 r i (cid:48) − (cid:80) qj (cid:48) =1 s j (cid:48) (cid:17) k Γ( bj + 12 )Γ( bj ) (cid:112) ˆ σ j b bjj  b i + ˆ σj Σ ( j,j ) I  (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:19) ( j, :) (cid:16) ˆ Σ ( k ) (cid:17) − (cid:18) ˜ z ( k ) t +1 − k (cid:19)(cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − C ( k ) (cid:19) ( j,j )   bj + 12 exp (cid:18) − (cid:16) ˜ z ( k ) t +1 − k (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ z ( k ) t +1 − k (cid:17)(cid:19)(cid:114)(cid:12)(cid:12)(cid:12) ˆ Σ ( k ) (cid:12)(cid:12)(cid:12)(cid:118)(cid:117)(cid:117)(cid:116)(cid:32) W ( j,j ) + Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) (cid:33) exp (cid:32)   W ( j,j ) t +1 Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j )  Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) Σ ( j,j ) I (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j ) + W ( j,j ) t +1  (cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:19) ( j, :) (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ Y ( k ) t +1 − k − (cid:16) ˜ C ( k ) (cid:17) A µ t − k (cid:17)(cid:115)(cid:18)(cid:16) ˜ C ( k ) (cid:17) T (cid:16) ˆ Σ ( k ) (cid:17) − (cid:16) ˜ C ( k ) (cid:17)(cid:19) ( j,j )  (cid:33) Proof : Identical (up to variable names) to that of Theorem 2.

Violin plots for the predictive mean squared error are displayed in Figure 7

ULY

Figure 7: Violin plots for the average predictive mean squared error of the ﬁve ﬁlters over the four different scenariosunder a range of models. Lower values correspond to better performance. Methods are omitted if they can not beapplied to the setting or if their performance is too poor. 21E-BASS - J

ULY

8, 2020

Algorithm 3

KF_Upd( Y , µ , Σ , C , A , Σ A , Σ I ) µ p ← A µ Σ p ← A Σ A T + Σ I z = Y − µ p ˆ Σ ← C Σ p C T + Σ A K ← Σ p C T ˆ Σ − µ new ← µ p + Kz Σ new ← ( I − KC ) Σ p Output: ( µ new , Σ new ) Algorithm 4

Sample_typical( µ , Σ , Y , A , C , Σ A , Σ I ) V ← I p W ← I q ˆ Σ ← C (cid:0) A Σ A T + Σ I (cid:1) C T + Σ A z ← Y − CA µ prob ← (cid:16) − (cid:80) pi =1 r i − (cid:80) qj =1 s j (cid:17) exp (cid:16) − z T ˆ Σ − z (cid:17) / (cid:114)(cid:12)(cid:12)(cid:12) ˆ Σ (cid:12)(cid:12)(cid:12) Output: ( V , W , prob ) Algorithm 5

Sample_add_comp( i, z , ˆ Σ , Σ A , M ) V ← I p V ← I q V ( i,i ) ← ˜ σ i Γ (cid:32) a i + , a i + ˜ σ i Σ ( i,i ) A (cid:18) ( ˆ Σ − ) ( i, :) z ( ˆ Σ − ) ( i,i ) (cid:19) (cid:33) prob ← M r i Γ( a i + )Γ( a i ) a a i i (cid:32) a i + ˜ σ i Σ ( i,i ) A (cid:18) ( ˆ Σ − ) ( i, :) z ( ˆ Σ − ) ( i,i ) (cid:19) (cid:33) a i + √ ˜ σ i exp (cid:16) − z T ˆ Σ − z (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ V ( i,i ) + Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) (cid:19) exp    ˜ V ( i,i ) t +1 Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i )  Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) Σ ( i,i ) A (cid:16) ˆ Σ − (cid:17) ( i,i ) + ˜ V ( i,i ) t +1   (cid:16) ˆ Σ − (cid:17) ( i, :) z (cid:114)(cid:16) ˆ Σ − (cid:17) ( i,i )   . Output: ( V , W , prob ) Algorithm 6

Sample_add( µ , Σ , Y , A , C , Σ A , Σ I , M ) ˆ Σ ← C (cid:0) A Σ A T + Σ I (cid:1) C T + Σ A z ← Y − CA µ Add _ P t ← {} (cid:46)

Additive Anom. Particles4: for i ∈ { , ..., p } do Add _ P t ← Add _ P t ∪ {

Sample_add_comp ( i, z , ˆ Σ , Σ A , M ) } end forOutput: Add _ P t

ULY

8, 2020

Algorithm 7

Sample_inn_comp( j, z , ˆ Σ , Σ I , M ) V ← I p V ← I q W ( i,i ) ← ˆ σ i Γ (cid:32) b i + , b i + ˆ σ i Σ ( i,i ) I (cid:18) ( C T ) ( i, :) ˆ Σ − z ( C T ˆ Σ − C ) ( i,i ) (cid:19) (cid:33) prob ← M s j Γ( b i + )Γ( b j ) b b j j (cid:32) b j + ˆ σ i Σ ( j,j ) I (cid:18) ( C T ) ( j, :) ˆ Σ − z ( C T ˆ Σ − C ) ( j,j ) (cid:19) (cid:33) b i + (cid:112) ˆ σ j exp (cid:16) − z T ˆ Σ − z (cid:17)(cid:113) | ˆ Σ | (cid:115)(cid:18) ˜ W ( j,j ) + Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) (cid:19) exp  (cid:32)  ˜ W ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j )  Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) Σ ( j,j ) I (cid:16) C T ˆ Σ − C (cid:17) ( j,j ) + ˜ W ( j,j ) t +1 (cid:33)  (cid:0) C T (cid:1) ( j, :) ˆ Σ − z (cid:114)(cid:16) C T ˆ Σ − C (cid:17) ( j,j )   Output: ( V , W , prob ) Algorithm 8

Sample_inn( µ , Σ , Y , A , C , Σ A , Σ I , M ) ˆ Σ ← C (cid:0) A Σ A T + Σ I (cid:1) C T + Σ A z ← Y − CA µ Inn _ P t ← {} (cid:46)

Innovative Anom. Particles4: for i ∈ { , ..., q } do Inn _ P t ← Inn _ P t ∪ {

Sample_inn_comp ( i, z , ˆ Σ , Σ I , M ) } end forOutput: Inn _ P t

Algorithm 9

Sample_Particles( M, µ , Σ , Y , A , C , Σ A , Σ I ) Desc ← {} (cid:46)

To store Descendants2:

Desc ← Desc ∪ Sample_typical ( µ , Σ , Y , A , C , Σ A , Σ I ) for i ∈ , ..., M do Desc ← Desc ∪ Sample_add ( µ , Σ , Y , A , C , Σ A , Σ I , M ) end for for i ∈ , ..., M do Desc ← Desc ∪ Sample_inn ( µ , Σ , Y , A , C , Σ A , Σ I , M ) end forOutput: Desc

Algorithm 10

BS_inn ( µ , Σ , ˜ Y , A , C , Σ A , Σ I , M, horizon ) ˜ C ← C (cid:104)(cid:0) A (cid:1) T , ..., (cid:0) A horizon (cid:1) T (cid:105) T ˜ z ← ˜ Y − ˜ CA µ ˜ Σ ← ˜ C (cid:0) A Σ A T + I horizon ⊗ Σ I (cid:1) ˜ C T + I horizon ⊗ Σ A Cd ← {} (cid:46) To store Candidates.5: for i ∈ { , .., q } do if horizon ∈ B i then for j ∈ { , ..., M } do Cd ← Cd ∪ { Sample_inn_comp ( i, ˜ z , ˜ Σ , A , ˜ C , Σ I , M · |B i | ) } end for end if end forOutput: Cand

ULY

8, 2020

Algorithm 1

Basic Particle Filter (No Back-sampling)

Input:

An initial state estimate ( µ , Σ ) A number of descendants, M (cid:48) = M ( p + q ) + 1 A number of particles to be maintained, N .A stream of observations Y , Y , ... Initialise:

Set

P articles (0) = { ( µ , Σ ) } for t ∈ N + do Candidates ← {} for ( µ , Σ ) ∈ P articles ( t − do ( V , W , prob ) ← Sample_Particles ( M, µ , Σ , Y t , A , C , Σ A , Σ I ) Candidates ← Candidates ∪ { ( µ , Σ , V , W , prob ) } end for Descendants ← Subsample ( N, Candidates ) P articles ( t ) ← {} for ( µ , Σ , V , W , prob ) ∈ Descendants do ( µ new , Σ new ) ← KF_Upd ( Y t , µ , Σ , C , A , V / Σ A , W / Σ I ) P articles ( t ) ← P articles ( t ) ∪ { ( µ new , Σ new ) } end for end for Algorithm 2

Particle Filter (With Back Sampling) – CE-BASS

Input:

An initial state estimate ( µ , Σ ) .A number of descendants, M (cid:48) = M ( p + q ) + 1 .A number of particles to be maintained, N .A stream of observations Y , Y , ... Initialise:

Set

P articles (0) = { ( µ , Σ , } Set max _ horizon = max ( ∪ qi =1 B i ) for t ∈ N + do Cand ← {} (cid:46)

To Store Candidates3: for ( µ , Σ , prob prev ) ∈ P articles ( t − do ( V , W , prob ) ← Sample_typical ( µ , Σ , Y t , A , C , Σ A , Σ I ) Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , } Add _ Des ← Sample_add ( µ , Σ , Y t , A , C , Σ A , Σ I , M ) for ( V , W , prob ) ∈ Add _ Des do Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , } end for end for for hor ∈ { , ..., max _ horizon } do for ( µ , Σ , prob prev ) ∈ P articles ( t − hor ) do ˜ Y ← (cid:2) Y Tt − hor +1 , ..., Y Tt (cid:3) T Inn _ Des ← BS_inn ( µ , Σ , ˜ Y , A , C , Σ A , Σ I , M, hor ) for ( V , W , prob ) ∈ Inn _ Des do Cand ← Cand ∪ { ( µ , Σ , V , W , prob · prob prev , hor ) } end for end for end for Desc ← Subsample ( N, Cand ) (cid:46) Sampling proportional to prob

P articles ( t ) ← {} for ( µ , Σ , V , W , prob, hor ) ∈ Desc do ( µ , Σ ) ← KF_Upd ( Y t +1 − hor , µ , Σ , C , A , V / Σ A , W / Σ I ) if hor > then for i ∈ { , ..., hor } do ( µ , Σ ) ← KF_Upd ( Y t + i − hor , µ , Σ , C , A , Σ A , Σ I ) end for end if P articles ( t ) ← P articles ( t ) ∪ { ( µ , Σ , prob · | Cand || Desc | ) } end for end forend for