[PDF] Generalized Bayesian Filtering via Sequential Monte Carlo

Abstract

We introduce a framework for inference in general state-space hidden Markov models (HMMs) under likelihood misspecification. In particular, we leverage the loss-theoretic perspective of Generalized Bayesian Inference (GBI) to define generalised filtering recursions in HMMs, that can tackle the problem of inference under model misspecification. In doing so, we arrive at principled procedures for robust inference against observation contamination by utilising the β -divergence. Operationalising the proposed framework is made possible via sequential Monte Carlo methods (SMC), where most standard particle methods, and their associated convergence results, are readily adapted to the new setting. We apply our approach to object tracking and Gaussian process regression problems, and observe improved performance over both standard filtering algorithms and other robust filters.

Full PDF

GGeneralised Bayesian Filtering via Sequential MonteCarlo

Ayman Boustati ∗ University of Warwick [email protected]

Ömer Deniz Akyildiz ∗ The Alan Turing InstituteUniversity of Warwick [email protected]

Theodoros Damoulas

The Alan Turing InstituteUniversity of Warwick [email protected]

Adam M. Johansen

University of WarwickThe Alan Turing Institute [email protected]

Abstract

We introduce a framework for inference in general state-space hidden Markovmodels (HMMs) under likelihood misspeciﬁcation. In particular, we leveragethe loss-theoretic perspective of Generalized Bayesian Inference (GBI) to deﬁnegeneralised ﬁltering recursions in HMMs, that can tackle the problem of inferenceunder model misspeciﬁcation. In doing so, we arrive at principled procedures forrobust inference against observation contamination by utilising the β -divergence.Operationalising the proposed framework is made possible via sequential MonteCarlo methods (SMC), where most standard particle methods, and their associatedconvergence results, are readily adapted to the new setting. We apply our approachto object tracking and Gaussian process regression problems, and observe improvedperformance over both standard ﬁltering algorithms and other robust ﬁlters. Estimating the hidden states in dynamical systems is a long-standing problem in many ﬁelds of sci-ence and engineering. This can be formulated as an inference problem of a general state-space hiddenMarkov model (HMM) deﬁned via two processes, the hidden process ( x t ) t ≥ , and the observation pro-cess ( y t ) t ≥ . More precisely, we consider the general state-space hidden Markov models of the form x ∼ π ( x ) , (1) x t | x t − ∼ f t ( x t | x t − ) , (2) y t | x t ∼ g t ( y t | x t ) , (3)where x t ∈ X for t ≥ , y t ∈ Y for t ≥ , f t is a Markov kernel on X and g t : Y × X → R + is thelikelihood function. We assume X ⊆ R d x and Y ⊆ R d y for convenience; however, the extension togeneral Polish spaces follows directly. The key inference problem in this model class is estimatingis the ﬁltering distributions , i.e. the posterior distributions of the hidden states ( x t ) t ≥ given theobservations y t denoted as ( π t ( x t | y t )) t ≥ — commonly known as Bayesian ﬁltering [1, 2].Under assumptions of linearity and Gaussianity, the inference problem for the hidden states of HMMscan be solved analytically via the Kalman ﬁlter [3]. However, inference for general HMMs of the form(1)–(3) with nonlinear, non-Gaussian transitions and likelihoods lacked a general, principled solutionuntil the arrival of the particle ﬁltering schemes [4]. Particle ﬁlters (PFs) have become ubiquitous for ∗ Equal contribution.34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ s t a t . M E ] O c t ayesian ﬁltering in the general setting. In short, the PFs retain a weighted collection of Monte Carlosamples representing the ﬁltering distribution π t ( x t | y t ) and recursively approximate the sequenceof distributions ( π t ) t ≥ using a particle mutation-selection scheme [5].While PFs (and other inference schemes for HMMs) implicitly assume that the assumed modelis well-speciﬁed, it is important to consider whether the proposed model class includes the truedata-generating mechanism (DGM). In particular, for general state-space HMMs, misspeciﬁcationcan occur if the true dynamics of the hidden process signiﬁcantly differ from the assumed model f t , or if the true observation model is markedly different from the assumed likelihood model g t , e.g.corruption by heavy tailed noise. The latter case is of widespread interest within the ﬁeld of robuststatistics [6] and has recently attracted signiﬁcant interest in the machine learning community [7]. Itis the setting that this paper seeks to address.When the true DGM cannot be modelled, one principled approach to address misspeciﬁcation isGeneralized Bayesian Inference (GBI) [8]. This approach views classical Bayesian inference asa loss minimisation procedure in the space of probability measures, a view ﬁrst developed by [9].In particular, the standard Bayesian update can be derived from this view, where a loss functionis constructed using the Kullback-Leibler (KL) divergence from the empirical distribution of theobservations to the assumed likelihood [8]. The KL divergence is sensitive to outliers [10], hencethe overall inference procedure is not robust to observations that are incompatible with the assumedmodel. A principled remedy is to replace the KL divergence with alternative discrepancy, such as the β -divergence, which makes the overall procedure more robust [11] while retaining interpretability.Previous work on robust particle ﬁlters have been done for handling outliers, sensor failures andmisspeciﬁcation of the transition model [12, 13, 14, 15, 16, 17, 18, 19]. However, these approachesare either based on problem-speciﬁc heuristic outlier detection schemes, or make strong assumptionsabout the DGM in order to justify the use of heavy-tailed distributions [15]. This requires knowledgeof the contamination mechanism that is implicitly embedded in the likelihood. Thus, this workconsiders the challenging M-open settings: we do not assume access to a family of models whichincludes the true generative model. This is qualitatively different from classical parameter estimationapproaches [20]; consequently, model selection schemes cannot generally be used to correct for mis-speciﬁcation (note the additional complications associated with parameter estimation in misspeciﬁedscenarios, see, e.g. [21]: not only does estimating parameters not address misspeciﬁcation, but eventhe interpretation of estimated parameters is difﬁcult). Furthermore, this case is not addressed bysequential Monte Carlo (SMC) algorithms under model uncertainty (see, e.g., [22]) where the truemodel is assumed to be available among many candidate models. For instance in [22], informationfrom many candidate models is fused according to their predictive performance, which is a pragmaticsolution with good empirical performance when a good suite of candidates is available. In contrast,we assume that we do not have any access at all to the true underlying generating mechanism.In this work we propose a principled approach to robust ﬁltering that does not impose additionalmodelling assumptions. We adapt the GBI approach of [8] to the Bayesian ﬁltering setting and developsequential Monte Carlo methods for inference. We illustrate the performance of this approach, usingthe β -divergence, to mitigate the effect of outliers. We show that this approach signiﬁcantly improvesthe PF performance in settings with contaminated data, while retaining a general and principledapproach to inference. We provide empirical results that demonstrate improvement over Kalman andparticle ﬁlters for both linear and non-linear HMMs. We further provide comparisons with variousrobust schemes against heavy-tailed noise, including t-based likelihoods [15] or auxiliary particleﬁlters (APFs) [12]. Finally, exploiting the state-space representations of Gaussian processes (GPs)[23], we demonstrate our framework on London air pollution data using robust GP regression whichhas linear time-complexity in the number of observations. Notation.

We denote the space of bounded, Borel measurable functions on X as B ( X ) . We denote theDirac measure located at y as δ y (d x ) and note that f ( y ) = (cid:82) f ( x ) δ y (d x ) for f ∈ B ( X ) . We denotethe Borel subsets of X as B ( X ) and the set of probability measures on ( X , B ( X )) as P ( X ) . For aprobability measure µ ∈ P ( X ) and ϕ ∈ B ( X ) , we write µ ( ϕ ) := (cid:82) ϕ ( x ) µ (d x ) . Given a probabilitymeasure µ , we abuse the notation denoting its density with respect to the Lebesgue measure as µ ( x ) .2 Background

Bayesian inference implicitly assumes that the generative model is well-speciﬁed, in particular, theobservations are generated from the assumed likelihood model. When this assumption is not expectedto hold in real-world scenarios, one may wish to take into account the discrepancy between the trueDGM and the assumed likelihood. GBI [8] is an approach to deal with such cases. Here, we presentthe main idea of GBI and refer the reader to the appendix for a more detailed description and to theoriginal reference for a full-treatment.For the simple Bayesian updating setup, consider a prior π and the assumed likelihood function g ( y | x ) . The posterior π ( x | y ) =: π ( x ) is given by Bayes rule π ( x ) = π ( x ) g ( y | x ) Z , where Z := (cid:82) g ( y | x ) π ( x )d x . [9] and [8] showed that this update can be seen as a special case of a more generalupdate rule, which can be described as a solution of an optimisation problem in the space of measures.This leads to a more general belief updating rule given by π ( x ) = π ( x ) G ( y | x ) Z , (4)with G ( y | x ) := exp( − (cid:96) ( x , y )) where (cid:96) ( x , y ) is a loss function connecting the observations to themodel parameters. Specifying (cid:96) ( x , y ) as the cross-entropy (from the KL-divergence) of the assumedlikelihood relative to the empirical distribution of the data recovers the standard Bayes update.As noted before, the standard Bayes update is not robust to outliers due to the properties of KLdivergence [10]. Hence, substituting the cross-entropy with a more robust loss such as the β -cross-entropy [7], based on the β -divergence, can make the inference more robust. Speciﬁcally, in thissetting the generalised Bayes update for the likelihood g ( y | x ) is written as π ( x ) = π ( x ) G β ( y | x ) Z β ,where G β ( y | x ) = exp (cid:18) β g ( y | x ) β − β + 1 (cid:90) g ( y (cid:48) | x ) β +1 d y (cid:48) (cid:19) . (5)One can consider G β ( y | x ) as a generalised likelihood, resulting from the use of a different lossfunction compared to the standard Bayes procedure. Here β is a hyperparameter that needs tobe selected depending on the degree of misspeciﬁcation. More precisely, it is a parameter of aspeciﬁed loss function: a subjective (generalised) Bayesian choice characterising conﬁdence in modelcorrectness. In general β ∈ (0 , and lim β → G β ( y | x ) = g ( y | x ) . Thus, intuitively, small β valuesare suitable for mild model misspeciﬁcation and large β values are suitable when the assumed modelis expected to signiﬁcantly deviate from the true model. In the experimental section, we devote someattention to the selection of β and sensitivity analysis.Generalised Bayesian updating is more robust against outliers if a suitable divergence is chosen[24, 25, 10]. We note that GBI is conceptually different from approximate Bayesian methods withalternative divergences such as [26, 27, 28, 29]. While these methods target approximate posteriorsthat minimise some discrepancy from the true posterior and are not necessarily robust, GBI methodschange the inference target from the standard Bayesian posterior (obtained by setting (cid:96) ( x , y ) to theKL divergence) to a different target distribution with more desirable properties such as robustnessto outliers. We also remark that the qualitative behaviour of this robustness is different than simplyinﬂating the variance of the likelihood (see Appendix B for more discussion from the perspective ofinﬂuence functions). Later, we demonstrate how the GBI approach can be used to construct robustPF procedures. Let x T be a hidden process with x t ∈ X and y T an observation process with y t ∈ Y . Our goal isto conduct inference in HMMs of the form (1)–(3) where π ( · ) is a prior probability distribution onthe initial state x , f t ( x | x (cid:48) ) is a Markov transition kernel on X and g t ( y t | x t ) is the likelihood forobservation y t . The observation sequence y T is assumed to be ﬁxed but otherwise arbitrary.The typical interest in probabilistic models is the estimation of expectations of general test functionswith respect to the posterior distribution, in this case, of the hidden process π t ( x t | y t ) and the3ssociated joint distributions p t ( x t | y t ) . More precisely, given a bounded test function ϕ ∈ B ( X ) ,we are interested in estimating integrals of the form π t ( ϕ ) = (cid:90) ϕ ( x t ) π t ( x t | y t ) . (6)Kalman ﬁltering [3, 1] can be used to obtain closed form expressions for ( π t , p t ) t ≥ if f t and g t arelinear-Gaussian. However, for non-linear or non-Gaussian cases, the target distributions are almostalways intractable, requiring an alternative approach, such as SMC methods [5, 30]. Known as ParticleFilters (PFs) when employed in the HMM setting, SMC methods combine importance sampling andresampling algorithms tailored to approximate the solution of the ﬁltering and smoothing problems.In a typical iteration, a PF method proceeds as follows: given a collection of samples { x ( i ) t − } Ni =1 representing the posterior π t − ( x t − | y t − ) , it ﬁrst samples from a (possibly observation dependent)proposal ¯ x ( i ) t ∼ q t ( x t | x ( i )1: t − , y t ) . It then computes weights for each sample (particle) ¯ x ( i ) t − inthe collection for a given observation y t , evaluating its ﬁtness with respect to the likelihood g t as w ( i ) t ∝ g t ( y t | ¯ x ( i ) t ) f t (¯ x ( i ) t | x ( i ) t − ) q t (¯ x ( i ) t | x ( i )1: t − , y t ) , where (cid:80) Ni =1 w ( i ) t = 1 . Finally, an optional resampling step is used to prevent degeneracy, leading to x ( i ) t ∼ (cid:80) Ni =1 w ( i ) t δ ¯ x ( i ) t (d x t ) . One can then construct theempirical measure π Nt (d x t | y t ) = N (cid:80) Ni =1 δ x ( i ) t (d x t ) , and the estimate of π t ( ϕ ) in (6) is given by π Nt ( ϕ ) = 1 N N (cid:88) i =1 ϕ ( x ( i ) t ) . (7)If the proposal is chosen as the transition density, i.e., q t ( x t | x ( i )1: t − , y t ) = f t ( x t | x ( i ) t − ) , we obtainthe bootstrap particle ﬁlter (BPF) [4]. This corresponds to the simple procedure of sampling ¯ x ( i ) t from f t ( x t | x ( i ) t − ) , and setting its weight w ( i ) t ∝ g t ( y t | ¯ x ( i ) t ) . As explained in Section 2.1, given a standard probability model comprised of the prior π ( x ) and alikelihood g ( y | x ) , the general Bayes update deﬁnes an alternative, generalised likelihood G ( y | x ) .The sequence of generalised likelihoods, denoted as G t ( y t | x t ) for t ≥ , in an HMM yields a jointgeneralised posterior density which factorises as p t ( x t | y t ) ∝ π ( x ) t (cid:89) k =1 f k ( x k | x k − ) G k ( y k | x k ) , (8)where G t ( y t | x t ) := exp( − (cid:96) t ( x t , y t )) . Inference can be done via SMC applied to this sequence oftwisted probabilities deﬁning a Feynman-Kac ﬂow in the terminology of [32].Comparing the update rule in (4) to the standard Bayes update suggests a generalisation of the particleﬁlter. In particular, under the model in (1)–(3), one can perform generalised inference using ( f t ) t ≥ as usual, but replacing the likelihood with ( G t ) t ≥ . Hence, a generalised sequential importanceresampling PF (given fully in Algorithm 1) keeps the sampling step intact, but applies a differentweight computation step w ( i ) t ∝ exp( − (cid:96) (¯ x ( i ) t , y t )) f t (¯ x ( i ) t | x ( i ) t − ) q t (¯ x ( i ) t | x ( i )1: t − , y t ) . Indeed, most PFs (including theAPF, see Algorithm 3 in the appendix) and related algorithms can be adapted to the GBI context. In the simplest form, drawing N times with replacement from the weighted empirical measure to obtainan unweighted sample whose empirical distribution approximates the same target; see [31] for an overview ofresampling schemes and their properties. lgorithm 1 The generalised particle ﬁlter

Input:

Observation sequence y T , number of samples N , proposal distributions q T ( · ) . Initialize:

Sample { ¯ x ( i )0 } Ni =1 for the prior π ( x ) . for t = 1 to T doSample: ¯ x ( i ) t ∼ q t ( x t | x ( i )1: t − , y t ) , for i = 1 , . . . , N. Weight: w ( i ) t ∝ exp( − (cid:96) (¯ x ( i ) t , y t )) f t (¯ x ( i ) t | x ( i ) t − ) q t (¯ x ( i ) t | x ( i )1: t − , y t ) , for i = 1 , . . . , N. Resample: x ( i ) t ∼ (cid:80) Ni =1 w ( i ) t δ ¯ x ( i ) t (d x t ) , for i = 1 , . . . , N. end for3.2 The β -BPF and the β -APF The β -BPF is derived by selecting (cid:96) t ( x t , y t ) as the β -divergence and applying the BPF procedurewith the associated generalised likelihood. In this case, the loss is (cid:96) βt ( x t , y t ) = 1 β + 1 (cid:90) g t ( y (cid:48) t | x t ) β +1 d y (cid:48) t − β g t ( y t | x t ) β . (9)We can then construct the general β -likelihood as G βt ( y t | x t ) ∝ exp( − (cid:96) βt ( x t , y t )) . (10)In this instance, the use of the β -divergence provides the sampler with robust properties [11]. Thiscan informally be seen from the form of the loss function in (9), where small values of β temperthe likelihood extending its tails making the loss more forgiving to outliers. The β -BPF procedureis given in Algorithm 2 in the appendix. The β -APF (Algorithm 3 in the appendix) is an AuxiliaryParticle Filter [12, 33] adapted to the GBI setting, and is derived similarly to the β -BPF.Note that the integral term in (9) is independent of x t and can be absorbed, without evaluation, intothe normalising constant when x t is a location parameter for a symmetric g t ( · ) and Y is a linearsubspace of R d y . More generally, if g t ( · ) is a member of the exponential family, the integral can becomputed by identifying g βt ( · ) with the kernel of another member of the same family with canonicalparameters scaled by β . The overhead of computing G βt ( · ) is negligible in this instance, which isnot too restrictive in the context of misspecitﬁed models. For other likelihoods, unbiased estimatorsfor G βt ( · ) , e.g. Poisson estimator [34], can be used in a random weight particle ﬁlter framework[35], where the overhead of computing G βt ( · ) will depend on the variance of the estimator and theconvergence results from this setting apply but as [35] demonstrate this cost need not be prohibitive. β It is often the case that the primary goal of inference, particularly in the presence of model misspeci-ﬁcation, is prediction. Hence, we propose choosing divergence parameters that lead to maximallypredictive posterior belief distributions. In particular, for the β -BPF and β -APF, deﬁne L β ( y t , ˆ y t ) asa loss function of the observations y t and the predictions ˆ y t . We propose to choose β as the solutionto the following decision-theoretic optimisation problem: min β agg Tt =1 ( E p (ˆ y t | y t − ) L β ( y t , ˆ y t )) , (11)where agg denotes an aggregating function. This approach requires some training data to allow theselection of β . In ﬁltering contexts, this can be historical data from the same setting or other availableproxies. For ofﬂine inference one could also employ the actual data within this framework. Since,this proposal relies on the quality of the observations, which in the case of outlier contamination isviolated by deﬁnition. To remedy this, we propose choosing robust versions for agg and L , e.g. themedian and the (standardised) absolute error respectively. Theoretical guarantees for SMC methods can be extended to the generalised Bayesian ﬁlteringsetting. Since the generalised Bayesian ﬁlters can be seen as a standard SMC methods with modiﬁed5ikelihoods, the same analytical tools can be used in this setting. We provide guarantees for the β -BPFbut emphasise that essentially the same results can be obtained much more broadly (including for the β -APF via the approach of [33]) using the standard arguments from the SMC literature.We denote the generalised ﬁlters and generalised posteriors for the HMM in the β -divergence settingas π βt and p βt respectively. Consequently, corresponding quantities constructed by the β -BPF aredenoted as π β,Nt and p β,Nt . Although the generalised likelihoods G βt ( y t | x t ) are not normalised, theycan be considered as potential functions [32]. Since G βt ( y t | x t ) < ∞ whenever g t ( y t | x t ) < ∞ and β is ﬁxed, we can adapt the standard convergence results into the generalised case. Assumption 1.

For a ﬁxed arbitrary observation sequence y T ∈ Y ⊗ T , the potential functions ( G βt ) t ≥ are bounded and G βt ( y t | x t ) > , ∀ t ∈ { , . . . , T } and x t ∈ X . This assumption holds for most used likelihood functions and their generalised extensions.

Theorem 1.

For any ϕ ∈ B ( X ) and p ≥ , (cid:107) π β,Nt ( ϕ ) − π βt ( ϕ ) (cid:107) p ≤ c t,p,β (cid:107) ϕ (cid:107) ∞ √ N , where c t,p,β < ∞ is a constant independent of N . The proof sketch and the constant c t,p,β are in the supplement. This L p bound provides a theoreticalguarantee on the convergence of particle approximations to generalised posteriors. The special casewhen p = 2 also provides the error bound for the mean-squared error. It is well known that Theorem 1with p > leads to a law of large numbers via Markov’s inequality and a Borel-Cantelli argument: Corollary 1.

Under the setting of Theorem 1, lim N →∞ π β,Nt ( ϕ ) = π βt ( ϕ ) a.s., for t ≥ . Finally, a central limit theorem for estimates of expectations with respect to the smoothing distribu-tions can be obtained by considering the path space X ⊗ t . Recall the joint posterior p βt ( x t | y t ) andconsider a test function ϕ t : X ⊗ t → R . We denote ϕ βt := (cid:82) ϕ βt ( x t ) p βt ( x t | y t ) and denote the β -BPF estimate of ϕ t with ϕ β,Nt := (cid:82) ϕ t ( x t ) p β,Nt (d x t ) . Theorem 2.

Under the regularity conditions given in [36, Theorem 1], √ N (cid:16) ϕ β,Nt − ϕ βt (cid:17) d −→ N (cid:0) , σ t,β ( ϕ t ) (cid:1) , as N → ∞ where σ t,β ( ϕ t ) < ∞ . The expression for σ t,β ( ϕ t ) can be found in the appendix. These results illustrate that the standardguarantees for generic particle ﬁltering methods extend to our case. In this section, we focus on β -BPF illustrating its the properties and empirically verifying itsrobustness. We include three experiments in the main text and another in Appendix E. Furthermore, wespeciﬁcally investigate the β -APF in Section 5.2 comparing its behaviour to the β -BPF. Throughout,we report the normalised mean squared error (NMSE) and the

90% empirical coverage as goodness-of-ﬁt measures. The NMSE scores indicate the mean ﬁt for the inferred posterior distribution andthe empirical coverage measures the quality of its uncertainty quantiﬁcation. We note that any claimin performance difference is based on the Wilcoxon signed-rank test. Further results and in-depthdetails on the experimental setup are given in the supplementary material.

The Wiener velocity model [37] is a standard model in the target tracking literature,where the velocity of a particle is modelled as a Wiener process. The discretised ver-sion of this model can be represented as a Linear-Gaussian State-Space model (LGSSM), x t = Ax t − + ν t − , ν t ∼ N ( , Q ) , (12) y t = Hx t + (cid:15) t , (cid:15) t ∼ N ( , Σ) , (13)6 N M S E Wiener velocity: aggregate metrics for p c = 0.1 K a l m a n B P F O r a c l e . . . . .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection

Figure 1:

The mean metrics over state dimensions for the Wienervelocity example with p c = 0 . . The top panel presents the NMSEresults (lower is better) and the bottom panel presents the 90%empirical coverage results (higher is better), on 100 runs. Thevertical dashed line in gold indicate the value of β chosen by theselection criterion in Section 3.3. The horizontal dashed line inblack in the lower panel indicates the 90% mark for the coverage. where A , Q are state-transition pa-rameters dictated by the continuous-time model and H is the observationmatrix (see Appendix). We simulatethis model in two-dimensions with Σ = I , contaminating the observa-tions with a large scale, zero-meanGaussian, N (0 , ) with probabil-ity p c . Our aim is to obtain theﬁltering density under the heavily-contaminated setting where optimalﬁlters struggle to perform. We com-pare the proposed β -BPF, for a rangeof values for β , to the standard BPFwith a Gaussian likelihood (BPF), the(optimal) Kalman ﬁlter and an OracleBPF with likelihood corresponding tothe true generative model, i.e., with aGaussian mixture likelihood with mix-ture components matching the noiseprocesses and mixture probabilitiesmatching contamination probability.We shed light onto four questions onthis simple setup: (a) Does the β -BPFproduce accurate and well-calibrated posterior distributions in the presence of contaminated data? (b)Is it sensitive to the choice β ? (c) Does the method described in Section 3.3 for selecting β returna near optimal result? (d) How does the robustiﬁcation procedure compare to the inference withknowledge of the true model.Figure 1 shows the results for p c = 0 . . We observe that (a) the β -BPF outperforms the Kalmanﬁlter and the standard BPF for β ≤ . while producing well-calibrated posteriors accounting forthe uncertainty (for β ∈ [0 . , . the coverage approaches the 90% threshold), (b) we see drasticperformance gains (with median NMSE scores around × smaller than the BPF and × smallerthat the Kalman ﬁlter) for a large range of β values, (c) we also see that the β -choice heuristic chooses a well-performing β (gold vertical lines in Figure 1), and (d) that the performance of the β -BPF is very close the Oracle (with knowledge of the true model) for a range of β values. Note that,for most values of β , the β -BPF signiﬁcantly outperforms both the Kalman ﬁlter and the standardBPF predictively. The full set of results for the predictive performance are presented in Table 3 inAppendix G.1. Terrain Aided Navigation (TAN) is a challenging estimation problem, where the state evolutionis deﬁned as in (12) (in three dimensions), but with a highly non-linear observation model, y t = h ( x t ) + (cid:15) t , where h ( · ) is a non-linear function, typically including a non-analytic Digital ElevationMap (DEM). This problem simulates the trajectory of an aeroplane or a drone over a terrain map,where we observe its elevation over the terrain and its distance from its take-off hub from on-boardsensors (see supplement for more details). We simulate transmission failure of the measurementsystem as impulsive noise on the observations, i.e., i.i.d. draws from a Student’s t distribution with ν = 1 degrees of freedom. In other words, we deﬁne (cid:15) t ∼ (1 − p c ) N (0 , ) + p c t ν =1 (0 , ) .We apply both the β -BPF and the β -APF to this problem and compare them to the standard BPFwith the Gaussian (BPF). We also compare against two other robust PF methods from the literature:Student’s t (t-BPF) [15] and the APF [12]. We set the degrees of freedom for the t-BPF to the samevalue as the contamination ν = 1 . We apply this choice criterion on an alternative dataset that is obtained from the same simulation but with90% fewer observations. β -BPF and the β -APF outperformthe standard Gaussian BPF, the t-BPF and the APF. This shows that the use of t -distribution for thelow contamination setting is inappropriate. This gap in the performance tightens, naturally, as p c grows since t -distribution becomes a good model for the observations. Notably, the performancegaps between the standard PFs and their β -robustiﬁed counterparts are similar, indicating that the useof the β -divergence in a particle ﬁltering procedure does indeed robustify the inference. N M S E TAN experiment: aggregate metrics

Contamination probability p c % E C BPF t-BPF -BPF = 0.1 APF -APF = 0.1

Figure 2:

The mean metrics over state dimensions for the TAN examplefor different p c . The top panel presents the NMSE results (lower is better)and the bottom panel presents the 90% empirical coverage results (higheris better), both evaluated on 50 runs. The horizontal dashed line in blackin the lower panel indicate the 90% mark for the coverage. In Figure 3, we plot the ﬁlter-ing distributions for the sixthstate dimension (vertical veloc-ity) obtained from an illustra-tive run with p c = 0 . . Thetop panel shows the ﬁltering dis-tributions from the (Gaussian)BPF (up) and the β -BPF (down).The locations of the most promi-nent outliers are marked withdashed vertical lines in black.Figure 3 displays the signiﬁcantdifference between the two ap-proaches: while the uncertaintyfor the standard BPF collapseswhen it meets the outliers, e.g.around t = 1700 , the β -BPFdoes not suffer from this prob-lem. This performance differ-ence is partly related to the sta-bility of the weights. The lowerpanel in Figure 3 demonstrates the effective sample size (ESS) with time for the two ﬁlters showingthat the β -BPF consistently exhibits larger ESS values, avoiding particle degeneracy. The ESS valuesfor the BPF, on the other hand, sharply decline when it meets outliers. A similar result is observedfor the APF versus the β -APF in the ﬁgures in the Appendix G.2. Further results on predictiveperformance can be found in Appendix G.2. m s BPF: velocity in z direction, NMSE = 0.1511, 90% Coverage = 0.691 m s -BPF: velocity in z direction, NMSE = 0.0944, 90% Coverage = 0.856True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Effective sample size with timeBPF -BPF Prominent Outliers

Figure 3:

The left panel shows the inferred marginal ﬁltering distributions for the velocity in the z direction forthe BPF and β -BPF with β = 0 . . The right panel shows the effective sample size with time. The locations ofthe most prominent (largest deviation) outliers are shown as dashed vertical lines in black in both panels. To measure air quality, London authorities use a network of sensors around the city recording pollutantmeasurements. Sensor measurements are susceptible to signiﬁcant outliers due to environmentaleffects, manual calibration and sensor deterioration. In the experiment, we use Gaussian process (GP)regression to infer the underlying signal from a PM2.5 sensor.For 1-D time series data, GP inference [38] can be accelerated to linear time in the number ofobservations by formulating an equivalent stochastic differential equation whose solution precisely8atches the GP under consideration [23]. The resulting model is a LGSSM of the form (12)–(13) where the smoothing distribution matches the GP marginals at discrete-times. One can thenapply smoothing algorithms, such as Rauch Tung Striebel (RTS) [39] or Forward Filters BackwardSmoothing (FFBS) [40], to obtain the GP posterior. These require a forward ﬁltering step with theKalman ﬁlter for RTS or a PF for FFBS. Here, we ﬁt a Matérn 5/2 GP with known hyperparametersto a time series from one of the sensors. We plot the median of the signals from the wider sensornetwork to obtain a simple approximation of the ground truth. Table 1:

GP regression NMSE (higher is better) and 90% empir-ical coverage for the credible intervals of the posterior predictivedistribution, on 100 runs.

Bold indicates statistically signiﬁcantbest result from Wilcoxon signed-rank test. All presented resultsare statistically different from each other according to the test.median (IQR)Filter (Smoother) NMSE ECKalman (RTS) . . BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . ( . ) . ( . ) To further investigate the GP solution ofthe β -BPF (FFBS), we show the ﬁt for β = 0 . and compare it with Kalman(RTS) smoothing. In Figure 4 (and Fig-ure 26 in the appendix) we see that thelatter is sensitive to outliers forcing theGP mean towards them while the β -BPFis robust and ignores them.Table 1 compares results with a Gaus-sian likelihood for GP regression withKalman (RTS) smoothing, the standardBPF (FFBS) and two runs for the β -BPF(FFBS) ( β = 0 . by predictive selectionas Section 3.3 and β = 0 . by overallbest performance). For both choices of β , the β -BPF outperforms all other methods on both metrics . p m . Matérn 5/2 GP with Kalman (RTS): NMSE = 0.1435, 90% Coverage = 0.685 p m . Matérn 5/2 GP with -BPF (FFBS): NMSE = 0.0597, 90% Coverage = 0.760Ground truthTraining data Kalman smoothing dist.-BPF smoothing dist. for = 0.1

Figure 4:

The GP ﬁt on the measurement time series for one ofthe London air quality sensors. The top panel shows the posteriorfrom the Kalman (RTS) smoothing. The bottom panel shows theposterior from the β -BPF (FFBS) for β = 0 . . We provided a generalised ﬁlteringframework based on GBI, which tack-les likelihood misspeciﬁcation in generalstate-space HMMs. Our approach lever-ages SMC methods, where we extendedsome analytical results to the generalisedcase. We presented the β -BPF, a sim-ple instantiation of our approach basedon the the β -divergence, developed anAPF for this setting, and showed perfor-mance gains compared to other standardalgorithms on a variety of problems andcontamination settings .This work opens up many exciting av-enues for future research. Principleamong which is online learning formodel parameters (system identiﬁcation)in the presence of misspeciﬁcation. Ourframework can directly incorporate mostestimators found in the SMC literatureand the computation of derivatives canbe tackled with automatic differentiationtools. The SDE representation of a GP depends on the form of the covariance function. In this paper we use a GPwith the Mate´rn 5/2 kernel, which admits a dual SDE representation. The code for this project is publicly available at https://github.com/aboustati/robust-smc . cknowledgements This work was supported by the Lloyds Register Foundation programme on Data Centric Engineeringthrough the London Air Quality project; The Alan Turing Institute for Data Science and AI underEPSRC grant EP/N510129/1; and the EPSRC under grants EP/R034710/1 and EP/T004134/1.

Broader Impact

Robust inference in the context of misspeciﬁed models is a topic of broad interest. However, there area few robust generally-applicable methods which can be employed in the context of online inferencein time series settings. This paper provides a principled solution to this problem within a formalframework backed by theoretical guarantees and opening up the beneﬁts to multiple applicationdomains. The illustrative applications demonstrate the potential improvements in settings includingnavigation and Gaussian process regression, which, if realised more widely, could have wide-reachingimpact. We hope that this inspires the community to build-on or apply our work to other challengingreal-world scenarios.Of particular interest is the application of Robust SMC methods, like the β -BPF and the auxil-iary counterpart which were developed in this work, to impactful data-streaming applications inenvironmental monitoring and forecasting. Indeed, our research in this area was motivated by a real-world application in which existing techniques were inadequate (see for more details). We have demonstratedthe beneﬁts such methods in proof-of-concept work and are incorporating the resulting algorithmsinto a fully-developed platform, that has been in development for approximately three years. Weare partnering with local authorities to help in directly informing policy makers and ultimately thegeneral public.More widely, this work provides an additional illustration that the GBI framework can providegood solutions to challenging problems in the world of misspeciﬁed framework and hence providesadditional motivation to further investigate this extremely promising but rather new direction. References [1] Brian D O Anderson and John B Moore.

Optimal ﬁltering . Englewood Cliffs, N.J. PrenticeHall, 1979.[2] Simo Särkkä.

Bayesian Filtering and Smoothing . Cambridge University Press, 2013.[3] Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems.

Journal ofFluids Engineering , 82(1):35–45, 1960.[4] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation.

IEE proceedings F (Radar and Signal Processing) ,140(2):107–113, 1993.[5] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo samplingmethods for Bayesian ﬁltering.

Statistics and Computing , 10(3):197–208, 2000.[6] Peter J Huber.

Robust statistics . Springer, 2011.[7] Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robustdivergences. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 813–822, 2018.[8] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework forupdating belief distributions.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 78(5):1103–1130, 2016.[9] Arnold Zellner. Optimal information processing and Bayes’s theorem.

The American Statistician ,42(4):278–280, 1988.[10] Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized Variational Inference:Three arguments for deriving new Posteriors. arXiv preprint arXiv:1904.02063 , 2019.[11] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences:Flexible and robust measures of similarities.

Entropy , 12(6):1532–1568, 2010.1012] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle ﬁlters.

Journalof the American Statistical Association , 94(446):590–599, 1999.[13] Cristina S Maiz, Joaquin Miguez, and Petar M Djuric. Particle ﬁltering in the presence ofoutliers. In , pages 33–36. IEEE,2009.[14] Cristina S Maiz, Elisa M Molanes-Lopez, Joaquín Miguez, and Petar M Djuric. A particleﬁltering scheme for processing time series corrupted by outliers.

IEEE Transactions on SignalProcessing , 60(9):4611–4627, 2012.[15] Dingjie Xu, Chen Shen, and Feng Shen. A robust particle ﬁltering algorithm with non-Gaussianmeasurement noise using student-t distribution.

IEEE Signal Processing Letters , 21(1):30–34,2013.[16] Laurent E. Calvet, Veronika Czellar, and Elvezio Ronchetti. Robust ﬁltering.

Journal of theAmerican Statistical Association , 110(512):1591–1606, 2015.[17] Francisco Curado Teixeira, João Quintas, Pramod Maurya, and António Pascoal. Robustparticle ﬁlter formulations with application to terrain-aided navigation.

International Journal ofAdaptive Control and Signal Processing , 31(4):608–651, 2017.[18] Xiao-Li Hu, Thomas B Schon, and Lennart Ljung. A robust particle ﬁlter for state estima-tion—with convergence results. In , pages312–317. IEEE, 2007.[19] Ömer Deniz Akyildiz and Joaquín Míguez. Nudging the particle ﬁlter.

Statistics and Computing ,30:305–330, 2020.[20] Nikolas Kantas, Arnaud Doucet, Sumeetpal S Singh, Jan Maciejowski, and Nicolas Chopin. Onparticle methods for parameter estimation in state-space models.

Statistical Science , 30(3):328–351, 2015.[21] Jenn`y Brynjarsdóttir and Anthony O’Hagan. Learning about physical parameters: The impor-tance of model discrepancy.

Inverse problems , 30(11):114007, 2014.[22] Iñigo Urteaga, Mónica F Bugallo, and Petar M Djuri´c. Sequential monte carlo methods undermodel uncertainty. In , pages 1–5.IEEE, 2016.[23] Simo Särkkä, Arno Solin, and Jouni Hartikainen. Spatiotemporal learning via inﬁnite-dimensional Bayesian ﬁltering and smoothing: A look at Gaussian process regression throughKalman ﬁltering.

IEEE Signal Processing Magazine , 30(4):51–61, 2013.[24] Abhik Ghosh and Ayanendranath Basu. Robust Bayes estimation using the density powerdivergence.

Annals of the Institute of Statistical Mathematics , 68(2):413–437, 2016.[25] Jeremias Knoblauch, Jack E Jewson, and Theodoros Damoulas. Doubly robust Bayesian infer-ence for non-stationary streaming data with β -divergences. In Advances in Neural InformationProcessing Systems , pages 64–75, 2018.[26] Tom Minka et al. Divergence measures and message passing. Technical report, Technical report,Microsoft Research, 2005.[27] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In

Advances inNeural Information Processing Systems , pages 1073–1081, 2016.[28] Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference.In

Advances in Neural Information Processing Systems , pages 496–504, 2016.[29] Dilin Wang, Hao Liu, and Qiang Liu. Variational inference with tail-adaptive f-divergence. In

Advances in Neural Information Processing Systems , pages 5737–5747, 2018.[30] Arnaud Doucet and Adam M Johansen. A tutorial on particle ﬁltering and smoothing: Fifteenyears later. In D. Crisan and B. Rozovski˘ı, editors,

The Oxford Handbook of Nonlinear Filtering ,pages 656–704. Oxford University Press, 2011.[31] Mathieu Gerber, Nicolas Chopin, and Nick Whiteley. Negative association, ordering andconvergence of resampling methods.

Annals of Statistics , 47(4):2236–2260, 2019.[32] Pierre Del Moral.

Feynman-Kac formulae: Genealogical and interacting particle systems withapplications . Springer, 2004. 1133] Adam M Johansen and Arnaud Doucet. A note on the auxiliary particle ﬁlter.

Statistics andProbability Letters , 78(12):1498–1504, September 2008.[34] Alexandros Beskos, Omiros Papaspiliopoulos, Gareth O. Roberts, and Paul Fearnhead. Exactand computationally efﬁcient likelihood-based estimation for discretely observed diffusionprocesses.

Journal of the Royal Statistical Society, Series B , 68(3):333–382, 2006.[35] Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O. Roberts. Particle ﬁlters for partially-observed diffusion.

Journal of the Royal Statistical Society, Series B , 70(4):755–777, 2008.[36] Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its applicationto Bayesian inference.

The Annals of Statistics , 32(6):2385–2411, 2004.[37] Simo Särkkä and Arno Solin.

Applied Stochastic Differential Equations , volume 10. CambridgeUniversity Press, 2019.[38] Carl Edward Rasmussen and Christopher KI Williams.

Gaussian processes for machine learning ,volume 1. MIT press Cambridge, 2006.[39] Herbert E Rauch, F Tung, and Charlotte T Striebel. Maximum likelihood estimates of lineardynamic systems.

American Institute of Aeronautics and Astronautics Journal , 3(8):1445–1450,1965.[40] Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state space models.

Annals of the Institute of Statistical Mathematics , 62(1):61–89, 2010.[41] John von Neumann and Oskar Morgenstern.

Theory of Games and Economic Behavior . Prince-ton University Press, 1947.[42] Jack Jewson, Jim Q Smith, and Chris Holmes. Principles of Bayesian inference using generaldivergence criteria.

Entropy , 20(6):442, 2018.[43] Pieralberto Guarniero, Adam M Johansen, and Anthony Lee. The iterated auxiliary particleﬁlter.

Journal of the American Statistical Association , 112(520):1636–1647, 2017.[44] Joaquín Míguez, Dan Crisan, and Petar M Djuri´c. On the convergence of two sequential MonteCarlo methods for maximum a posteriori sequence estimation and stochastic global optimization.

Statistics and Computing , 23(1):91–107, 2013.[45] Albert N Shiryaev.

Probability . Springer, 1996.12 upplementary Material

A Generalized Bayesian Inference

Parametric Bayesian inference implicitly assumes that the generative model is well-speciﬁed, inparticular, the observations are generated from the assumed likelihood model. In general, thisassumption may not hold in real-world scenarios. Hence, one may wish to take into account thediscrepancy between the true DGM and the assumed likelihood. Generalized Bayesian inference(GBI) is an approach proposed in [8] to deal with such cases.For the simple Bayesian updating setup, consider a prior π and the assumed likelihood function g ( y | x ) . The posterior π ( x | y ) =: π ( x ) is given by Bayes rule π ( x ) = π ( x ) g ( y | x ) Z , (14)where Z := (cid:82) g ( y | x ) π ( x )d x . [9] and [8] showed that (14) can be seen as a special case of a moregeneral update rule, which can be described as a solution of an optimisation problem in the space ofmeasures. In particular, let L ( ν ; π , y ) be a loss-function where ν is a probability measure and π isthe prior, a belief distribution over x can be constructed by solving ˆ ν = arg min ν L ( ν ; π , y ) . (15)To obtain Bayes-type updating rules, one needs to specify this loss function as a sum of a “data term”and a “regularisation term” [8] given as L ( ν ; π , y ) = λ ( ν, y ) + λ ( ν, π ) , (16)where λ deﬁnes a data dependent “loss” and λ controls the discrepancy between the prior and theﬁnal belief distribution ˆ ν . [8] show that the form of (16) that satisﬁes the von Neumann–Morgensternutility theorem [41] and Bayesian additivity is given by L ( ν ; π , y ) = (cid:90) (cid:96) ( x , y ) ν (d x ) + KL ( ν || π ) , (17)which leads to a Bayes-type update [8, 42], given by π ( x ) = π ( x ) G ( y | x ) Z , (18)with G ( y | x ) := exp( − (cid:96) ( x , y )) where (cid:96) ( x , y ) is some divergence measuring the discrepancy betweenthe observed information and the assumed model. In particular, if one assumes the real-worldlikelihood, i.e. the DGM, h , is different from the model likelihood g and deﬁnes (cid:96) ( x , y ) asa Kullback–Leibler (KL) divergence between the empirical likelihood ˜ h (an empirical measureconstructed using the observations) and the assumed likelihood g ( y | x ) , the standard Bayes rule (14)arises as a solution. To see this, we can employ the KL divergence as a loss,KL ( h || g ) = (cid:90) log h ( y (cid:48) ) h (d y (cid:48) ) − (cid:90) log g ( y (cid:48) | x ) h (d y (cid:48) ) , and note that the ﬁrst term does not affect the solution of the optimisation problem (15). Hence wearrive at the integrated loss function ˜ (cid:96) ( x ) = − (cid:90) log g ( y (cid:48) | x ) h (d y (cid:48) ) . (19)By replacing the true likelihood h with its empirical approximation upon observing y , i.e., setting h (d y (cid:48) ) ≈ δ y (d y (cid:48) ) , we obtain ˜ (cid:96) ( x ) ≈ (cid:96) ( x , y ) = − log g ( y | x ) , which can be plugged in to (18)resulting in the standard Bayes update (14). Bayesian additivity, also referred to as coherence says that applying a sequence of updates with subsets ofthe data should give rise to the same posterior distribution as single update employing all of the data.

13s previously mentioned, due to the properties of the KL divergence, the standard Bayes update isnot robust to outliers [10]. Hence, substituting the KL with a more robust divergence such as the β -divergence, can endow inference with more robustness. Speciﬁcally, if (cid:96) is chosen as a β -divergence,the one step Bayes update for the likelihood g ( y | x ) can be written as π ( x ) = π ( x ) G β ( y | x ) Z β , (20)where G β ( y | x ) = exp (cid:18) β g ( y | x ) β − β + 1 (cid:90) g ( y (cid:48) | x ) β +1 d y (cid:48) (cid:19) . (21)One can then see G β ( y | x ) as a generalised likelihood, resulting from the use of a different lossfunction compared to the standard Bayes procedure. Here β is a hyperparameter that needs to beselected depending on the degree of misspeciﬁcation. In general β ∈ (0 , and lim β → G β ( y | x ) = g ( y | x ) . Thus, intuitively, small β values are suitable for mild model misspeciﬁcation and large β values are suitable when the assumed model is expected to signiﬁcantly deviate from the true model. B Inﬂuence Figure

The use of the β -divergence for updating the particle ﬁlter weights can be further motivated bystudying the inﬂuence proﬁle of the resulting weight update. Appendix B shows the inﬂuence that anobservation exerts on the weights as a function of the number of standard deviations away from themean. The ﬁgure compares the standard Gaussian likelihood, a Gaussian likelihood with an inﬂatedvariance, Student’s t likelihood with 1 degree of freedom and a standard Gaussian warped by the β -divergence for 4 values of β . The plot shows that, with the β -divergence, observations that areclose to the mean exert similar inﬂuence to the original standard Gaussian; however, the inﬂuencedecreases away from the mean. This decrease is dependent on the value of β . For the case of aninﬂated Gaussian, the inﬂuence of the close observations is diminished compared to the originalstandard Gaussian; hence, this is not a suitable substitute to robustify the weight update since itdeviates signiﬁcantly from the properties of the assumed model near the mean. Finally, Student’s tlikelihood exerts higher inﬂuence on the inlying observations near the mean, which is also differentfrom the assumed model. C β -PF C.1 Outline derivation of the loss in (9)To arrive at the experssion of the loss in (9), recall the formula for the beta divergence [11] D βB ( P || Q ) = 1 β ( β + 1) (cid:90) ( p β +1 ( x ) + βq β +1 ( x ) − ( β + 1) p ( x ) q β ( x ))d µ ( x )= C P + 1 β + 1 (cid:90) q β +1 ( x )d x − β (cid:90) q ( x ) P ( dx ) where P and Q are probability measures on the measurable space ( X, A ) and µ is a ﬁnite or σ -ﬁnitemeasure on this space, such that P (cid:28) µ and Q (cid:28) µ are absolutely continuous w.r.t. µ and C P isa constant independent of Q . Finally, p = d P d µ and q = d Q d µ are densities and the Radon-Nikodymderivatives for P and Q w.r.t. µ .Comparison with (17) yields (21) directly. C.2 β -BPF Here, we provide the algorithmic procedure in Algorithm 2 for the β -BPF that is investigated in thismain text. 14 i n f l u e n c e o n w e i g h t Influence on particle weight for different likelihoodsGaussianGaussian, t = 0.001 = 0.01 = 0.1 = 0.2 Figure 5:

This ﬁgure depicts the inﬂuence of a single observation on the particle weights for different likelihoodsor generalised likelihoods.

Algorithm 2 β -Bootstrap Particle Filter Input:

Observation sequence y T , number of samples N . Initialise:

Sample { ¯ x ( i )0 } Ni =1 for the prior π ( x ) . for t = 1 to T doSample: ˜ x ( i ) t ∼ f t ( x t | ¯ x ( i ) t − ) for i = 1 , . . . , N. Weight: w ( i ) t ∝ G βt (˜ x ( i ) t ) for i = 1 , . . . , N. Resample: ¯ x ( i ) t ∼ N (cid:88) i =1 w ( i ) t δ ˜ x ( i ) t ( d x t ) for i = 1 , . . . , N. end forC.3 β -APF Here, we provide the algorithmic procedure in Algorithm 3 for the β -APF. Here q t denotes theproposal distribution at time t which in the case of the fully-adapted APF would be chosen to be theconditional density of x t given x t − and y t but in general would be chosen as some approximation ofthis distribution and ˜ G βt ( x t − ) is chosen as an approximation of the predictive generalised likelihood,i.e. ˜ G βt ( x t − ) ≈ (cid:82) G βt ( x t ) f t ( x t | x t − ) d x t .As in the case of the standard APF, the use of reference points obtained from the current statesin which one sets ˜ G βt ( x t − ) = G βt ( µ t ( x t − )) with µ t ( x t − ) = (cid:82) x t f ( x t | x t − ) dx t is one simpleapproach to this, but one which doesn’t work well in full generality because it is underdispersed withrespect to the true predictive generalised likelihood (cf. [33]). In general, better performance will15 lgorithm 3 β -Auxiliary Particle Filter Input:

Observation sequence y T , number of samples N . Initialise:

Sample { ¯ x ( i )0 } Ni =1 independently from the prior π ( x ) . for t = 1 to T doSample: k ( i ) ∼ P ( i = k | y t ) ∝ w ( i ) t − ˜ G βt (¯ x ( i ) t ) for i = 1 , . . . , N. ¯ x ( i ) t ∼ q t (¯ x t | ¯ x k ( i ) t − ) for i = 1 , . . . , N. Weight: w ( i ) t ∝ f t (¯ x ( i ) t | ¯ x k ( i ) t − ) G βt (¯ x ( i ) t ) q t (¯ x ( i ) t | ¯ x k ( i ) t − ) ˜ G βt (¯ x k ( i ) t − ) for i = 1 , . . . , N. end for of course be obtained by developing a good bespoke approximation to the predictive generalisedlikelihood and the locally-optimal proposal density for any given application, but in order to providea simple generically-applicable strategy which is reasonably robust we suggest setting the proposalequal to the transition density, q t = f t , and using a stabilised version of the simple approximation tothe predictive likelihood, provided by ˜ G βt ( x t − ) = G βt ( µ t ( x t − )) + c t where c t is a constant chosen, e.g. as .

05 sup x G βt ( x ) to avoid any instability in the weighting step.Such a strategy was advocated in the iterated version of this algorithm described by [43] which couldin principle also be adapted to the GBI setting. 16 Theoretical analysis

D.1 Proof of Theorem 1

This is an adaptation of a well-known proof, hence we will sketch results and provide the constant c t,p,β .The result is proved via induction. For t = 0 , we have the result in the theorem trivially, as itcorresponds to the i.i.d. case and, e.g. [32, Lemma 7.3.3] provides an explicit constant. Hence, as aninduction hypothesis, we assume (cid:107) π β,Nt − ( ϕ ) − π βt − ( ϕ ) (cid:107) p ≤ c t − ,p,β (cid:107) ϕ (cid:107) ∞ √ N , (22)where c t − ,p,β < ∞ is independent of N . After the sampling step, we obtain the predictive particles ¯ x ( i ) t and form the predictive measure π β,Nt (d x t | y t − ) = 1 N N (cid:88) i =1 δ ¯ x ( i ) t (d x t ) , and then one can show that we have [44, Lemma 1] (cid:107) π β,Nt ( ϕ ) − π βt ( ϕ ) (cid:107) p ≤ c ,t,p,β (cid:107) ϕ (cid:107) ∞ √ N , (23)where c ,t,p,β < ∞ is a constant independent of N . After the computation of weights, we construct (cid:101) π β,Nt (d x t ) = N (cid:88) i =1 w ( i ) t δ ¯ x ( i ) t (d x t ) . (24)Following again [44, Lemma 1], one readily obtains (cid:107) π βt ( ϕ ) − (cid:101) π β,Nt ( ϕ ) (cid:107) p ≤ c ,t,p,β (cid:107) ϕ (cid:107) ∞ √ N , (25)where c ,t,p,β = 2 (cid:107) G βt (cid:107) ∞ c ,t,p,β π t ( G βt ) < ∞ , where we note π t ( G βt ) > . Finally, performing multinomial resampling leads to a conditionally-i.i.d.sampling case, which yields (cid:107) (cid:101) π β,Nt ( ϕ ) − π β,Nt ( ϕ ) (cid:107) p ≤ c ,t,p,β (cid:107) ϕ (cid:107) ∞ √ N . (26)Combining (25) and (26) yields the result with c t,p,β = c ,t,p,β + c ,t,p,β . D.2 Proof of Corollary 1

We sketch here a standard argument for obtaining a strong law of large numbers from L p error bounds.Let us write for simplicity that ξ N = π β,Nt ( ϕ ) and ξ = π βt ( ϕ ) . (27)The strategy is to note that (cid:26) lim k →∞ | ξ k − ξ | = 0 (cid:27) = ∞ (cid:92) l =1 (cid:26) lim k →∞ | ξ k − ξ | < /l (cid:27) and hence if it can be shown that the P ( {| ξ k − ξ | < /l } ) → for every l ∈ N then the event that ξ k → ξ is a countable intersection of events of probability 1 and hence itself has probability 1.17sing the Borel-Cantelli lemma (see, e.g. [45, p. 255]), to show that P ( | ξ k − ξ | ≥ ε ) → as k → ∞ it sufﬁces to demonstrate that ∞ (cid:88) k =1 P ( | ξ k − ξ | ≥ ε ) < ∞ . We do this via the generalised Markov’s inequality: P ( | ξ k − ξ | ≥ ε ) ≤ E [ | ξ k − ξ | p ] ε p , which combined with Theorem 1 yields P ( | ξ k − ξ | ≥ ε ) ≤ c p (cid:107) ϕ (cid:107) p ∞ k p/ ε p . Choosing any p > ensures that the rhs is summable and hence that P ( | ξ k − ξ | < ε ) → as k → ∞ for any ε > and, by taking ε = 1 /l for each l ∈ N , the proof is complete. D.3 Proof of Theorem 2

We refer to the Proposition in [33] which provides explicit expressions for sequential importanceresampling based particle ﬁlters within the general frameworks of [32, 36]; the same argument holds mutatis mutandis in the context of the β -BPF. We note that the asymptotic variance expression σ t,β ( ϕ ) is given as follows. For t = 1 , we obtain [33] σ ,β ( ϕ ) = (cid:90) p β ( x | y ) f ( x ) ( ϕ ( x ) − ϕ ) d x , where f ( x ) := (cid:82) µ ( x ) f ( x | x )d x . Then, for t > [33] σ t,β = (cid:90) p βt ( x | y t ) f ( x ) (cid:18)(cid:90) ϕ t ( x t ) p βt ( x t | y t , x )d x t − ϕ t (cid:19) d x + t − (cid:88) k =2 (cid:90) p βk ( x k | y t ) p βk − ( x k − | y k − ) f k ( x k | x k − ) (cid:18)(cid:90) ϕ t ( x t ) p βt ( x k +1: t | y k +1: t , x k )d x k +1: t − ϕ t (cid:19) d x k + (cid:90) p βt ( x t | y t ) p βt − ( x t − | y t − ) f t ( x t | x t − ) ( ϕ t ( x t ) − ϕ t ) d x t . E Asymmetric Wiener velocity

In the case of simple, symmetric noise settings with additive contamination the use of heavy-tailedlikelihoods such as Student’s t may be still seen as a viable alternative to robustify the inference.However, there are some realistic settings in which such off-the-shelf heavy-tailed replacementsare not feasible or require considerable model-speciﬁc work. Consider, as a simple illustration,the Wiener velocity example in Section 5.1, where the observation noise in (13) is replaced with (cid:15) t ∼ [ −∞ , N (0 ,

1) + [0 , + ∞ ] N (0 , ) . This simulates an asymmetric noise scenario. Theobservations are further contaminated with multiplicative exponential noise, i.e. (cid:15) t ← ξ (cid:15) t , for ξ ∼ Exp (1000) with probability p c . This sums up to a multiplicatively corrupted asymmetric noisedistribution which could, for example, represent a sensor with asymmetric noise proﬁle in a failingregime which occasionally exhibits excessive gain.For this example, it is easy to derive a BPF with the assymetric likelihood. It is also easy to extendthis likelihood to the β -BPF case. We test BPF and the β -BPF ( β = 0 . ) versus two versions of thet-BPF, in which the likelihood is replaced with a heavy-tailed symmetric one, one set to a short scale σ = 1 and the other set to a long scale σ = 10 .Figure 6 shows the results for this experiment. The BPF is unable to handle the multiplicativeexponential contamination, as can be seen by the NMSE values. It also provides poor posteriorcoverage. The t-BPF fairs better with this type of contamination where we can see a trade-off between18 N M S E % E C BPFt-BPF-BPF

Asymmetric Wiener velocity: aggregate metrics for p c = 0.1 Figure 6:

The mean metrics over state dimensions for the asymmetric Weiner velocity example with p c = 0 . .The left panel presents the NMSE results (lower is better) and the right panel presents the 90% empiricalcoverage results (higher is better), evaluated on 100 runs. The x -axis ticks indicate the scale used for Student’s t likelihood. The horizontal dashed line in black in the right panel indicates the 90% mark for the coverage. accuracy and coverage depending on the chosen scale of the likelihood. This is due to the symmetryof the t -distribution which overestimates one of the tails depending on the scale. The β -BPF does nothave this trade-off and outperforms the t-BPF on both metrics.While one might attempt to model the noise with an asymmetric construction of the t -distributionswhich approximates the noise structure, we argue that in more general settings using heavy-taileddistributions requires approximations of the noise structure and making modelling choices whichcould be arbitrarily complex. This is in contrast to specifying a single tuning parameter as in the β -divergence case. The β -BPF requires no further modelling than the original problem and can beused as a drop-in replacement for nearly all types of likelihood structures. F Experiment Details

F.1 Evaluation Metrics

The following metrics metrics are used to evaluate the experiments:

The Normalised Mean Squared Error (NMSE) is computed per state dimension j asNMSE j = (cid:13)(cid:13)(cid:13)(cid:80) Tt =1 x tj − ˆ x tj (cid:13)(cid:13)(cid:13) (cid:80) Tt =1 (cid:107) x tj (cid:107) , (28)with ˆ x tj = N (cid:80) Ni =1 ¯ x ( i ) tj , i.e. the mean over resampled particles (trajectories). The 90% Emprical Coverage (EC) is computed per state dimension j asEC j = (cid:80) Tt =1 C t ( x tj ) T , (29)with C t = { z | z ∈ [ q . ( { ¯ x ( i ) tj } Ni =1 ) , q . ( { ¯ x ( i ) tj } Ni =1 )] } , where q is the quantile function. Predictive Median Absolute Error (MedAE) is computed per observation dimension j asMedAE = MEDIAN t ∈{ ,...,T } ( | ˆ y tj − y tj | ) , (30)where ˆ y t ∼ (cid:80) Ni =1 w it g t ( y | x ( i ) t ) . 19 ggregation: Metrics are often presented as aggregates over the state dimensions, which are simplythe mean of the metric across the state dimensions.

F.2 Details on the implementation of the selection criterion in Section 3.3

From (11), we chose agg as the median and L as the absolute error. When the observations aremultidimensional, we take the average loss weighted by the inverse of the median of each dimension.We compute the score for different values of β from a grid and choose β that minimises the score.For multiple runs, we report the modal value of the β ’s over all the runs.In the interest of simplicity, we use the entire observation sequence from an alternative realisationof the same simulation to compute the score. However, in practice one might one to tune β on asub-sequence to avoid extra computation. F.3 Wiener velocity model experiment details (Section 5.1)

In this section, we detail the experimental setup used to obtain the results for Section 5.1.

Simulator settings

We synthesise the data with a Python simulator utilising NumPy. We discretisethe system with ∆ τ = 0 . and simulate it for 100 time steps, i.e. we obtain 1000 time points intotal. For the state evolution process in Equation (12), we set the transition matrix A = (cid:34) τ

00 1 0 ∆ τ (cid:35) and the transition covariance matrix Q = (cid:34) ∆ τ ∆ τ ∆ τ ∆ τ τ τ ∆ τ τ (cid:35) . For the observation process inEquation (13), we set the observation matrix H = (cid:2) (cid:3) and the noise covariance Σ = I . Theinitial state of the simulator is set to x = [140 , , , . Contamination

To simulate contaminated observations we apply extra i.i.d. Gaussian noise with astandard deviation of 100.0 to the observation sequence with probability p c per observation. Sampler settings

Each experiment consists of 100 runs, where all samplers are seeded with thesame seed per run; however, the seeds vary across the runs. We use the same state sequence for allruns obtained from the simulator as above. However, each run simulates a new observation sequence(i.e. the observations noise changes per run).

F.4 Terrain Aided Navigation (TAN) experiment details (Section 5.2)

In this section, we detail the experimental setup used to obtain the results for Section 5.2.

Simulator settings

We synthesise the data with a Python simulator utilising NumPy. We discretisethe system with ∆ τ = 0 . and simulate it for 200 time steps, i.e. we obtain 2000 time points in total.For the state evolution process in Equation (12), we set the transition matrix A =  τ τ

00 0 1 0 0 ∆ τ  , Q =  . . .  . For the observation process, we set the non-linear observation function h ( x t ) = (cid:20) x t − DEM ( x t , x t ) (cid:112) ( x t − x ) + ( x t − x ) (cid:21) , where DEM is a non-analytic Digital Elevation Map. For our simulation we set DEM toDEM(a, b) = peaks ( q · a, q · b ) + (cid:88) i =1 α i sin( ω i · q · a ) cos( ψ · q · b ) , with peaks ( c, d ) = 200(3(1 − c ) exp( − c − ( d +1) ) − c − c − d ) exp( − c − d ) − exp( − ( x +1) + y )) , α = [300 , , , , , , ω = [5 , , , , , , ψ = [4 , , , , , and q = . × . The noise covariance Σ = σ I with σ = 400 . The initial state of the simulator isset x = [ − . × , × , . × , . , − . , . Contamination

To simulate contaminated observations we apply extra i.i.d. Student’s t noise with1 degree of freedom scale σ , where σ is given as above. The contamination is applied to observationinstances with probability p c per observation. Sampler settings

We initialise the samplers by sampling from the prior given by N ( x , Q ) with x being the initial state of the simulator and Q as above. We set the likelihood covariance to thesimulator noise covariance and the number of samples to 3000. For the APFs, we make the samedesign choices outlined in Appendix C.3, i.e. setting the proposal density to the transition densityand stabilising the predictive likelihood approximation with the given additive constant. Experiment settings

Each experiment consists of 50 runs, where all samplers are seeded with thesame seed per run; however, the seeds vary across the runs. We use the same state sequence for allruns obtained from the simulator as above. However, each run simulates a new observation sequence(i.e. the observation noise changes per run).

F.5 Asymmetric Wiener velocity model experiment details (Appendix E)

In this section, we detail the experimental setup used to obtain the results for Appendix E.

Simulator settings

We use the same simulator settings as in Appendix F.3, but changing theobservation noise to [ −∞ , N (0 ,

1) + [0 , + ∞ ] N (0 , ) . Contamination

To simulate contaminated observations we multiplicative apply i.i.d. Exponentialnoise with a scale of 1000 with probability p c = 0 . per observation. Sampler settings

We initialise the samplers by sampling from the prior given by N ( x , Q ) with x being the initial state of the simulator and Q as above. We set the number of samples to 1000. Experiment settings

We use the same settings as in Appendix F.3.

F.6 Air quality experiment details (Section 5.3)

In this section, we detail the setup used to obtain the results for Section 5.3.

Data

The data was obtained from . We select a time window of 200 hours. No preprocessing was per-formed on the data. 21 ernel

We use the Mate´rn 5/2 kernel and set the lengthscale l = 0 . and the signal variance σ s = 32 . We discretize the SDE representation of the Mate´rn 5/2 GP with stepsize ∆ τ = 0 . toobtain an LGSSM of the form (12)-(13), with transition matrix A = exp(∆ τ F ) = exp(∆ τ (cid:34) − λ − λ − λ (cid:35) ) , where λ = √ l and transition covariance matrix Q = P ∞ − AP ∞ A (cid:124) , with P ∞ = (cid:34) σ s κ κ − κ σ s λ (cid:35) ,where κ = σ s λ . For the observation process in (13), the observation matrix is set to H = [1 , , andthe noise variace σ = 1 . The prior on the initial state x x is given as N ( m, S ) , where m (cid:124) = [0 , , and S = P ∞ . Sampler settings

We initialise the samples by sampling from the prior N ( m, S ) . We set thenumber of samples to 1000. Smoother settings

We set the number of samples to 1000 for the FFBS smoother.

Experiment settings

We repeat the sampling procedure for 100 runs, where the samplers areseeded differently for each runs. The seeds are shared among samplers for each run. The Kalmanﬁlter does not require multiple runs as the solution is deterministic.22

Further results

G.1 Wiener velocity experiment N M S E Wiener velocity: aggregate metrics for p c = 0.0 K a l m a n B P F . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF -BPFPredictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.05 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.1 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.15 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.2 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.25 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.3 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.35 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection N M S E Wiener velocity: aggregate metrics for p c = 0.4 K a l m a n B P F O r a c l e . . . .

005 0 .

01 0 .

05 0 . . . . % E C Kalman FilterBPF Oracle-BPF Predictive Selection

Figure 7:

The mean metrics over state dimensions for the Wiener velocity example. The top panel presents theNMSE results (lower is better) and the bottom panel presents the 90% emprirical coverage results (higher isbetter), on 100 runs. The vertical dashed line in gold indicate the value of β chosen by the selection criterion inSection 3.3. The horizontal dashed line in black in the lower panel indicates the 90% mark for the coverage.

20 40 60 80 10020000150001000050000 m e t r e s Displacement in x direction Ground Truth Kalman Filter BPF -BPF

Figure 8:

Marginal ﬁltering distributions for the Kalman ﬁlter, the BPF and the β -BPF. m e t r e s Displacement in y direction Ground Truth Kalman Filter BPF -BPF

Figure 9:

Marginal ﬁltering distributions for the Kalman ﬁlter, the BPF and the β -BPF. m e t r e s p e r s e c o n d Velocity in x direction Ground Truth Kalman Filter BPF -BPF

Figure 10:

Marginal ﬁltering distributions for the Kalman ﬁlter, the BPF and the β -BPF. m e t r e s p e r s e c o n d Velocity in y direction Ground Truth Kalman Filter BPF -BPF

Figure 11:

Marginal ﬁltering distributions for the Kalman ﬁlter, the BPF and the β -BPF. β = 0.0001 0.97 0.00 β = 0.0005 0.97 0.00 β = 0.001 0.97 0.00 β = 0.005 0.90 0.00 β = 0.01 0.90 0.00 β = 0.05 0.90 0.00 β = 0.1 0.90 0.00 β = 0.2 0.92 0.00 β = 0.5 72.22 12.34 β = 0.8 226.61 11.62 Table 2:

Predictive results on the Weiner velocity example for p c = 0 . . The one step ahead predictiveperformance is measure by the median absolute error. The ﬁgures are averaged across 100 runs and the standarderror on the average score is provided. .2 TAN experiment m BPF: displacement in x direction, NMSE = 0.0035, 90% Coverage = 0.115 m -BPF: displacement in x direction, NMSE = 0.0001, 90% Coverage = 0.734True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 12:

Marginal ﬁltering distributions for the BPF (top) and β -BPF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m BPF: displacement in y direction, NMSE = 0.0070, 90% Coverage = 0.115 m -BPF: displacement in y direction, NMSE = 0.0002, 90% Coverage = 0.743True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 13:

Marginal ﬁltering distributions for the BPF (top) and β -BPF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m BPF: displacement in z direction, NMSE = 0.0044, 90% Coverage = 0.393 m -BPF: displacement in z direction, NMSE = 0.0003, 90% Coverage = 0.931True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 14:

Marginal ﬁltering distributions for the BPF (top) and β -BPF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s BPF: velocity in x direction, NMSE = 0.0019, 90% Coverage = 0.614 m s -BPF: velocity in x direction, NMSE = 0.0016, 90% Coverage = 0.860True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 15:

Marginal ﬁltering distributions for the BPF (top) and β -BPF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s BPF: velocity in y direction, NMSE = 0.0115, 90% Coverage = 0.532 m s -BPF: velocity in y direction, NMSE = 0.0090, 90% Coverage = 0.793True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 16:

Marginal ﬁltering distributions for the BPF (top) and β -BPF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s BPF: velocity in z direction, NMSE = 0.1511, 90% Coverage = 0.691 m s -BPF: velocity in z direction, NMSE = 0.0944, 90% Coverage = 0.856True trajectory-BPF filtering dist. for =0.1 BPF filtering dist.Prominent outliers

Figure 17:

Figure 18:

Effective sample size with time for the BPF (top) and β -BPF with β = 0 . . m APF: displacement in x direction, NMSE = 0.0013, 90% Coverage = 0.171 m -APF: displacement in x direction, NMSE = 0.0000, 90% Coverage = 0.860True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 19:

Marginal ﬁltering distributions for the APF (top) and β -APF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m APF: displacement in y direction, NMSE = 0.0025, 90% Coverage = 0.169 m -APF: displacement in y direction, NMSE = 0.0001, 90% Coverage = 0.884True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 20:

Marginal ﬁltering distributions for the APF (top) and β -APF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m APF: displacement in z direction, NMSE = 0.0024, 90% Coverage = 0.487 m -APF: displacement in z direction, NMSE = 0.0002, 90% Coverage = 0.943True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 21:

Marginal ﬁltering distributions for the APF (top) and β -APF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s APF: velocity in x direction, NMSE = 0.0034, 90% Coverage = 0.499 m s -APF: velocity in x direction, NMSE = 0.0009, 90% Coverage = 0.947True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 22:

Marginal ﬁltering distributions for the APF (top) and β -APF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s APF: velocity in y direction, NMSE = 0.0128, 90% Coverage = 0.454 m s -APF: velocity in y direction, NMSE = 0.0044, 90% Coverage = 0.898True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 23:

Marginal ﬁltering distributions for the APF (top) and β -APF (bottom) with β = 0 . . The locationsof the most prominent (largest deviation) outliers are shown as dashed vertical lines in black. m s APF: velocity in z direction, NMSE = 0.1124, 90% Coverage = 0.761 m s -APF: velocity in z direction, NMSE = 0.0910, 90% Coverage = 0.864True trajectory-APF filtering dist. for =0.1 APF filtering dist.Prominent outliers

Figure 24:

Figure 25:

Effective sample size with time for the APF (top) and β -APF with β = 0 . . c Filter 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4BPF 16.63(0.06) 17.67(0.05) 17.88(0.06) 18.66(0.07) 19.68(0.08) 20.12(0.09) 20.96(0.08) 21.55(0.09)t-BPF 16.33(0.05) 17.15(0.05) 17.15(0.05) 17.92(0.05) 18.72(0.06) 18.95(0.07) 19.71(0.09) 20.11(0.08) β -BPF = 0.005 16.26(0.05) 17.01(0.05) 16.96(0.06) 17.60(0.05) 18.34(0.07) 18.48(0.06) 19.24(0.06) 19.60(0.07) β -BPF = 0.01 16.23(0.04) 16.91(0.05) 16.65(0.05) 17.06(0.05) 17.74(0.05) 17.86(0.06) 18.43(0.05) 18.61(0.06) β -BPF = 0.05 16.39(0.04) 16.97(0.05) 16.70(0.06) 17.23(0.06) 18.03(0.06) 17.84(0.06) 18.45(0.07) 18.78(0.08) β -BPF = 0.1 17.46(0.05) 17.92(0.06) 17.90(0.11) 18.61(0.12) 19.49(0.11) 19.15(0.10) 19.76(0.10) 20.24(0.11) β -BPF = 0.2 16.56(0.04) 17.07(0.05) 16.58(0.04) 17.43(0.04) 17.87(0.06) 17.85(0.05) 18.56(0.06) 18.84(0.06)APF 15.96(0.05) 17.09(0.04) 17.34(0.05) 18.13(0.05) 19.04(0.08) 19.51(0.06) 20.67(0.07) 21.15(0.09) β -APF = 0.005 15.71(0.04) 16.49(0.05) 16.57(0.05) 17.19(0.04) 17.80(0.05) 18.15(0.04) 18.96(0.07) 19.19(0.06) β -APF = 0.01 15.69(0.04) 16.31(0.04) 16.31(0.04) 16.85(0.04) 17.47(0.05) 17.66(0.05) 18.46(0.05) 18.66(0.05) β -APF = 0.05 15.69(0.04) 16.26(0.04) 16.01(0.04) 16.53(0.03) 17.17(0.05) 17.14(0.06) 17.83(0.05) 17.92(0.05) β -APF = 0.1 15.84(0.04) 16.46(0.05) 16.16(0.04) 16.56(0.04) 17.30(0.05) 17.16(0.04) 17.89(0.05) 18.09(0.05) β -APF = 0.2 16.90(0.06) 17.35(0.06) 17.32(0.09) 17.68(0.06) 18.78(0.08) 18.40(0.06) 18.87(0.06) 19.28(0.08) Table 3:

Predictive results on the TAN example. The one step ahead predictive performance is measure by themedian absolute error. The ﬁgures are averaged across 50 runs and the standard error on the average score isprovided.

G.3 London air quality experiment

Table 4:

GP regression NMSE (higher is better) and 90% empirical coverage for the credible intervals of theposterior predictive distribution, on 100 runs. The bold font indicate the statistically signiﬁcant best resultaccording to the Wilcoxon signed-rank test. All presented results are statistically different from each otheraccording to the test. median (IQR)Filter (Smoother) NMSE ECKalman (RTS) . . BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . . . . β = 0 . -BPF (FFBS) . ( . ) . ( . ) p m . Matérn 5/2 GP with Kalman (RTS): NMSE = 0.1435, 90% Coverage = 0.685 p m . Matérn 5/2 GP with BPF (FFBS): NMSE = 0.1125, 90% Coverage = 0.640 p m . Matérn 5/2 GP with -BPF (FFBS): NMSE = 0.0614, 90% Coverage = 0.765Ground truth Training data Kalman smoothing dist. BPF smoothing dist. -BPF smoothing dist. for =0.1

Figure 26:

The GP ﬁt on the measurement time series for one of the London air quality sensors. The top panelshows the posterior from the Kalman (RTS) smoothing. The middle panel shows the posterior from the BPF(FFBS). The bottom panel shows the posterior from the β -BPF (FFBS) for β = 0 . ..