A scalable optimal-transport based local particle filter
aarXiv: stat.CO/00000000
A scalable optimal-transportbased local particle filter
Matthew M. Graham * and Alexandre H. Thiery ** Department of Statistics and Applied Probability, National University of Singapore
Abstract.
Filtering in spatially-extended dynamical systems is a chal-lenging problem with significant practical applications such as numer-ical weather prediction. Particle filters allow asymptotically consistentinference but require infeasibly large ensemble sizes for accurate esti-mates in complex spatial models. Localisation approaches, which per-form local state updates by exploiting low dependence between vari-ables at distant points, have been suggested as a potential resolution tothis issue. Naively applying the resampling step of the particle filter lo-cally however produces implausible spatially discontinuous states. Theensemble transform particle filter replaces resampling with an optimal-transport map and can be localised by computing maps for every spa-tial mesh node. The resulting local ensemble transport particle filteris however computationally intensive for dense meshes. We propose anew optimal-transport based local particle filter which computes a fixednumber of maps independent of the mesh resolution and interpolatesthese maps across space, reducing the computation required and allow-ing it to be ensured particles remain spatially smooth. We numericallyillustrate that, at a reduced computational cost, we are able to achievethe same accuracy as the local ensemble transport particle filter, andretain its improved robustness to non-Gaussianity and ability to quan-tify uncertainty when compared to local ensemble Kalman filters.
MSC 2010 subject classifications:
Primary 65C35; secondary 86A22.
Key words and phrases: particle filtering, Bayesian filtering, spatialmodels, inverse problems, localisation, optimal transport.
1. INTRODUCTION
A natural paradigm for modelling geophysical systems such as the atmosphereis as spatially-extended dynamical systems : one or more state variables definedover a spatial domain are evolved through time according to a set of stochasticpartial differential equation s ( spde s). In this article we will consider the prob-lem of inferring the distribution of the unknown state of such a system givennoisy observations at a sequence of time points. As well as being an importantproblem in its own right, state inference is also a vital sub-component of taskssuch as forecasting the future state of a system and inferring values for any freeparameters in the numerical model used (Fearnhead and K¨unsch, 2018). (e-mail: * [email protected]; ** [email protected]) a r X i v : . [ s t a t . C O ] J un M. M. GRAHAM & A. H. THIERY
A key issue in performing state inference in spatially-extended systems is thetypically high dimension of the state space. To allow numerical simulation of the spde model the spatial domain is discretised in to a mesh (also known as a grid);the system state can then be represented as a finite-dimensional vector consistingof the concatenated values of the state variables at the nodes of the mesh. Theresulting state dimension is therefore a multiple of the number of mesh nodeswhich can be very large. For example in the global atmospheric models used incurrent operational numerical weather prediction ( nwp ) systems the mesh sizecan be of the order 10 or higher (Bauer, Thorpe and Brunet, 2015).For large state dimensions, even inference in linear-Gaussian models using the Kalman filter ( kf ) (Kalman, 1960) is computationally infeasible due to the highprocessing and memory costs of operations involving the full covariance matrix ofthe state distribution. This motivated the development of ensemble Kalman filter ( enkf ) methods (Evensen, 1994; Burgers, van Leeuwen and Evensen, 1998) whichuse an ensemble of particles to represent the state distribution rather than thefull mean and covariance statistics. As the ensemble sizes used are typically muchsmaller than the state dimension the computational savings can be considerable.Although enkf methods are only consistent in an infinite ensemble limit forlinear-Gaussian models (Furrer and Bengtsson, 2007; Le Gland, Monbet andTran, 2011), they have been empirically found to perform well in models withweakly non-linear state update and observation operators, even when using rela-tively small ensembles of size much less than the state dimension (Evensen, 2009);the performance of the enkf in non-asymptotic regimes has been theoreticallyinvestigated in several recent works (Kelly, Law and Stuart, 2014; Del Moral andTugaut, 2018; Bishop and Del Moral, 2018; Tong, Majda and Kelly, 2016). Akey aspect in allowing enkf methods to be scaled to large spatially-extendedgeophysical models is the use of spatial localisation (Houtekamer and Mitchell,1998; Hamill, Whitaker and Snyder, 2001). Localisation exploits the observationthat there is often low statistical dependence between state variables at distantpoints in spatially-extended systems. In enkf methods this property is used toimprove the noisy covariance estimates resulting from the small ensemble sizesused by removing spurious correlations between distant state variables. enkf methods have been successfully applied in a variety of settings, includ-ing operational nwp systems (Bonavita, Torrisi and Marcucci, 2008; Clayton,Lorenc and Barker, 2013), however the quality of the state distribution estimatesis fundamentally limited by the linear-Gaussian assumptions made by the un-derlying kf updates. For models with non-Gaussian noise processes or stronglynon-linear state update or observation operators, enkf methods tend to producepoor estimates of the state distribution (Lei, Bickel and Snyder, 2010). Particle filter s ( pf s) (Gordon, Salmond and Smith, 1993; Del Moral, 1996) offeran alternative ensemble-based approach to sequential state inference that unlike enkf methods provides consistent estimates for non-Gaussian distributions. Thesimplest variant, the bootstrap pf , alternates propagating the ensemble membersforward in time under the model dynamics, with resampling according to weights Throughout this article we will for brevity refer to dynamical models with linear state updateand observation operators and additive Gaussian noise processes as linear-Gaussian . Current operational nwp ensemble systems are limited to ∼
50 particles due to the highcomputational cost of numerically integrating the particles forward in time (Buizza et al., 2005). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) True state.(b) Observed. (c) Prior samples.(d) Posterior samples: block pf . (e) Posterior samples: local etpf .(f) Posterior samples: this article. Fig 1: Examples of local pf assimilation updates applied to a Gaussian processmodel. The smooth true state field is shown in panel (a) and correspondingnoisy observations in (b). Panel (c) shows prior samples and (d)–(f) approximateposterior samples after applying different local pf assimilation updates. In eachof (c)–(f) 2 out of 40 samples are shown.calculated from the likelihood of the particles given the observed data.While pf s offer asymptotically consistent inference for general state space mod-els, in practice they typically suffer from weight-degeneracy in high-dimensionalsystems: after propagation only a single particle has non-negligible weight. Foreven simple linear-Gaussian models, pf s have been shown to require an ensemblesize which scales exponentially with the number of observations to avoid degen-eracy (Snyder et al., 2008; Bengtsson, Bickel and Li, 2008; Snyder, 2011).Given the importance of localisation in scaling enkf methods to large spa-tial systems, it is natural to consider whether pf methods can be localised toovercome weight-degeneracy issues (Snyder et al., 2008; Van Leeuwen, 2009).Rebeschini and van Handel (2015) analysed a simple local pf scheme in whichthe spatial domain is partitioned into disjoint blocks and independent pf s runfor each block, with local particle weights computed from the observations withineach block. The authors demonstrate this block pf algorithm can overcome theneed to exponentially scale the ensemble size with dimension to prevent degen-eracy. However as the variables in each block are resampled independently fromthose in other blocks, dependencies between blocks are ignored; this introducesa systematic bias that is difficult to control (Bertoli and Bishop, 2014).This issue is illustrated for a two-dimensional Gaussian process model in Fig. 1.The smooth true state field, shown in Fig. 1a, is partially and noisily observed(Fig. 1b). While the samples in the prior ensemble (Fig. 1c) reflect the smoothnessof the true state field, the posterior samples shown in Fig. 1d, computed using ablock pf assimilation update show spatial discontinuities at the block boundaries.Such discontinuities can cause numerical instabilities in the computation of spatialderivatives when integrating the spde s model to forward propagate the particles.The ensemble transform particle filter ( etpf ) (Reich, 2013) uses an optimal M. M. GRAHAM & A. H. THIERY transport ( ot ) map to linearly transform an ensemble instead of resampling.The etpf can be localised by computing ot maps for each mesh node usinglocal particle weights (Cheng and Reich, 2015); updating the particles using theresulting spatially varying maps significantly reduces the introduction of spatialdiscontinuities compared to independent resampling. This can be seen in thesamples computed using the local etpf shown in Fig. 1e, which show greaterspatial regularity than the block pf samples in Fig. 1d, though they remain lesssmooth than the true state field.The requirement in the local etpf to solve an ot problem at every node canbe computationally burdensome when the mesh size is large. Solving each ot problem has complexity e O ( P ) where P is the ensemble size ( e O indicates limitingcomplexity excluding polylogarithmic factors); although solvers can be run inparallel this still represents a large computational overhead.In this article we propose an alternative smooth and computationally scalablelocal etpf scheme. A finite set of patches which cover the spatial domain aredefined, with a non-negative bump function supported on the patch. The set ofbump functions is constrained to be a partition of unity ( pou ): the functionssum to unity at all points in the spatial domain. A single ot map is calculatedfor each spatial patch. The pou is then used to interpolate these local per-patchmaps across the spatial domain, defining maps for all nodes in the spatial mesh.Through an appropriate choice of bump functions this scheme can maintaina prescribed level of smoothness in the transformed state fields while also sig-nificantly reducing the number of ot problems needing to be solved. Examplesposterior samples computed using the proposed scheme are shown in Fig. 1f. Herethe pou is a set of smooth bump functions tiled in a 8 × etpf , in thisexample the number of ot problems solved was reduced from to 16 384 to 64.The remainder of the article is structured as follows. In Section 2 we brieflyintroduce our notation and some preliminaries on the filtering problem and en-semble methods, followed by a review of spde models and existing local filteringapproaches in Section 3. The new method we propose is described in Section 4and a numerical study comparing the approach to existing local ensemble filtersis presented in Section 5, with a concluding discussion in Section 6.
2. ENSEMBLE APPROACHES TO FILTERING2.1 Notation
Random variables are denoted by sans-serif symbols, e.g. x , and x ∼ µ indicates x has distribution µ . The probability of an event x taking a value in a set A is P ( x ∈ A ) and the expected value of x is E [ x ]. The conditional probability of x ∈ A given y = y is denoted P ( x ∈ A | y = y ) and likewise the conditional expectationof x given y = y is E [ x | y = y ]. A Gaussian distribution with mean m andcovariance C is denoted N ( m, C ). The set of integers from A to B inclusive is A : B and quantities sub- or superscripted by an integer range indicate an indexed set,e.g. φ M = { φ m } m ∈ M . The D vector of ones is 1 D and the D × D identity matrixI D , with the subscript omitted when unambiguous. The indicator function on aset S is S . The set of real numbers is R , non-negative reals R ≥ and complexnumbers C . For z ∈ C , < ( z ) and = ( z ) indicate its real and imaginary parts. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER The class of models we aim to perform inference in is state-space model s ( ssm s).Let X be a vector-space representing the state-space of the system of interest.We assume observations of the system are available at a set of T times, with theobservations at each discrete time index t ∈ T belonging to a common vector-space Y . We denote the unknown system state at each time index as a randomvariable x t ∈ X and the corresponding observations as a random variable y t ∈ Y .The modelled state dynamics are assumed to be Markovian and specified by aset of state-update operators F T such that x = F ( u ) , u ∼ µ ; x t = F t ( x t − , u t ) , u t ∼ µ t ∀ t ∈ T , (2.1)with each u t ∈ U a state noise variable drawn from a distribution µ t , representingthe stochasticity in the state initialisation and dynamics at each time step. Theobservations y t at each time index t are assumed to depend only on the currentstate x t and are generated via a set of observation operators G T , y t = G t ( x t , v t ) , v t ∼ ν t ∀ t ∈ T . (2.2)Any stochasticity in the observation process at each time index is introduced bythe observation noise variable v t ∈ V with distribution ν t . In ssm s where theoperators F T and G T are all linear and the distributions µ T and ν T are allGaussian – the aforementioned linear-Gaussian case – the joint distribution onall states x T and observations y T is Gaussian and a kf can be used to performexact inference. In this article we will focus on approximate inference methodsfor ssm s outside this class where exact inference is intractable.We require that the conditional distributions on y t given x t have known densi-ties g T with respect to a common dominating measure υ on Y , i.e. P ( y t ∈ d y | x t = x t ) = g t ( y | x t ) υ (d y ) ∀ t ∈ T . (2.3)For the state updates we assume only that the state-update operators F t canbe computed for any set of inputs and that we can generate samples from thestate noise distributions µ t ; the resulting state transition distributions will notnecessarily have tractable densities. Our main objects of interest from an inference perspective are the filteringdistributions : the conditional distributions on the state at time index t ∈ T given the observations at time indices up to and including t . We will denote thefiltering distribution at each time index t as π t (d x ) = P ( x t ∈ d x | y T = y T ) . (2.4)The filtering problem is then the task of inferring the filtering distributions π T given a ssm for the system and a sequence of observations y T .A further concept that will be important for our discussion of inference methodsis the predictive distribution on the state at the next time index t + 1 giventhe observations up to the current time index t . We will denote the predictivedistribution at time index t as ~π t +1 (d x ) = P ( x t +1 ∈ d x | y T = y T ) . (2.5) M. M. GRAHAM & A. H. THIERY
A key property for filtering algorithms is that the filtering distribution atany time index can be expressed recursively in terms of the distributions at theprevious time indices. Generally this recursion is split into two steps, here termedthe prediction and assimilation updates.The prediction update transforms the filtering distribution π t to the predictivedistribution ~π t +1 . This update corresponds to propagating the state distributionforward in time according to the modelled dynamics, with no new observationsintroduced. Denoting the Dirac measure at a point x ∈ X by δ x the predictionupdate can be expressed as ~π t +1 (d x ) = Z U Z X δ F t +1 ( x ,u ) (d x ) π t (d x ) µ t +1 (d u ) . (2.6)The assimilation update then relates the predictive distribution ~π t +1 to thefiltering distribution at the next time step π t +1 . It corresponds to an applicationof Bayes’ theorem, with the predictive distribution forming the prior and the fil-tering distribution at the next time index the posterior after a new observed datapoint has been assimilated. The observation density g t +1 defines the likelihoodterm, with the assimilation update then π t +1 (d x ) = g t +1 ( y t +1 | x ) R X g t +1 ( y t +1 | x ) ~π t +1 (d x ) ~π t +1 (d x ) . (2.7)The combination of prediction and assimilation updates together define a mapfrom the filtering distribution at time index t to the distribution at t + 1: · · · −→ π t prediction −−−−−−→ ~π t +1 assimilation −−−−−−−→ π t +1 −→ · · · ;sequentially alternating prediction and assimilation updates is in theory there-fore all that is needed to compute the filtering distributions at all times indices.In practice however for most ssm s the integrals in Eqs. (2.6) and (2.7) will beintractable to solve exactly, necessitating some form of approximation. A particularly common approximation is to use an ensemble of state particlesto represent the filtering distribution at each time index. Specifically the filteringdistribution π t at time index t is represented by an empirical measure defined byplacing point masses at the values of a set of P state particles x P t π t (d x ) ≈ P X p ∈ P δ x pt (d x ) . (2.8)A key advantage of using an ensemble representation of the filtering distri-bution is that a simple algorithm can be used to implement a prediction updateconsistent with Eq. (2.6). Specifically if a set of P independent state noise samples u P t +1 are generated from µ t +1 , then given particles x P t approximating π t , a newset of P particles can be computed as ~ x pt +1 = F t +1 ( x pt , u pt +1 ) ∀ p ∈ P . (2.9)This new particle ensemble can then be used to form an empirical measure ap-proximation to the predictive distribution ~π t +1 ~π t +1 (d x ) ≈ P X p ∈ P δ ~ x pt +1 (d x ) . (2.10) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Although Eq. (2.9) specifies an approach for performing a prediction update,a method for approximating the assimilation update in Eq. (2.7) to account forthe observed data is also required. One possibility is to require that the filteringensemble x P t is formed as a linear combination of the predictive ensemble ~ x P t x pt = X q ∈ P a p,qt ~ x qt (2.11)where a P , P t ∈ R P × P are a set of coefficients describing the transformation. Ingeneral the coefficients may depend non-linearly on both the observation y t andpredictive ensemble particles ~ x P t , however the form of the update constrains thefiltering ensemble x P t to lie in the linear subspace spanned by the predictiveensemble members. The class of ensemble filters using an assimilation update ofthe form in Eq. (2.11) was termed linear ensemble transform filter s ( letf s) inCheng and Reich (2015), and encompasses both ensemble Kalman and particlefiltering methods, as will be discussed in the following subsections. In a linear-Gaussian ssm the predictive and filtering distributions are Gaussianat all time indices: π t = N ( m t , C t ) and ~π t = N ( ~m t , ~C t ) for all t ∈ T , and so canbe fully described by the mean and covariance parameters. The Kalman filter ( kf )(Kalman, 1960) gives an efficient scheme for performing exact inference in linear-Gaussian ssm s by iteratively updating the mean and covariance parameters. Foran observation operator and noise distribution G t ( x, v ) = H t x + v, v t ∼ N (0 , R t ) , (2.12)the kf assimilation update can be written C t = ~C t − ~C t H T t ( R t + H t ~C t H T t ) − H t ~C t , (2.13a) m t = ~m t + C t H T t R − t ( y t − H t ~m t ) . (2.13b) Ensemble Kalman filter ( enkf ) methods are a class of letf s which use an assim-ilation update consistent with the kf updates in Eq. (2.13) for linear-Gaussian ssm s in the limit of an infinite ensemble, in effect replacing the predictive mean ~m t and covariance ~C t with ensemble estimates. The use of an ensemble representationrather than the full means and covariances used in the kf both gives a significantcomputational gain (by avoiding the need to store and perform operations on thefull covariance matrices) while also allowing application of the approach to ssm swith non-linear state updates via the prediction update in Eq. (2.9).The originally proposed enkf method (Evensen, 1994; Burgers, van Leeuwenand Evensen, 1998) generates simulated observations from the observation modelin Eq. (2.12) for each predictive ensemble member to form a Monte Carlo esti-mate of the R t + H t ~C t H T t term in Eq. (2.13a). Although simple to implement, theintroduction of artificial observation noise adds an additional source of variancewhich can be significant for small ensemble sizes. This additional variance canbe eliminated by the use of square-root enkf variants (Anderson, 2001; Bishop,Etherton and Majumdar, 2001; Whitaker and Hamill, 2002) which typically giv-ing more stable and accurate filtering for small ensemble sizes. M. M. GRAHAM & A. H. THIERY
Of particular interest here is the ensemble transform Kalman filter ( etkf ) pro-posed by Bishop, Etherton and Majumdar (2001), with this approach particularlyefficient in the regime of interest where the ensemble size P is much smaller thanthe state and observation dimensionalities. As we will use a localised variant ofthe etkf as a baseline in the numerical experiments in Section 5 we outline the etkf algorithm in Appendix A and show how it can be expressed in the form ofthe letf assimilation update in Eq. (2.11). Particle filtering offers an alternative letf approach that gives consistent es-timates of the filtering distributions as P → ∞ for the non-Gaussian case. The pf assimilation update transforms the empirical approximation to the predic-tive distribution ~π t in Eq. (2.10) to an empirical approximation of the filteringdistribution π t by attaching importance weights to the predictive ensemble˜ w pt = g t ( y t | ~ x pt ) , w pt = ˜ w pt P q ∈ P ˜ w qt ∀ p ∈ P , π t (d x ) ≈ X p ∈ P w pt δ ~ x pt (d x ) . (2.14)Directly iterating this importance weighting scheme, at each time index prop-agating the ensemble forward in time according to Eq. (2.9) and incrementallyupdating a set of (unnormalised) importance weights gives an algorithm termed sequential importance sampling . While appealingly simple, sequential importancesampling requires an exponentially growing ensemble size as the number of ob-servation times T increases. The key additional step in particle filtering is toresample the particle ensemble according to the importance weights between pre-diction updates. That is the filtering distribution ensemble at time index t isdefined in terms of the corresponding predictive distribution ensemble as x pt = X q ∈ P r p,qt ~ x qt ∀ p ∈ P , (2.15)where r P , P t ∈ { , } P × P are a set of binary random variables satisfying X q ∈ P r p,qt = 1 , E (cid:20) X q ∈ P r q,pt | w pt = w pt (cid:21) = P w pt ∀ p ∈ P . (2.16)This has the effect of removing particles with low weights from the ensemble andso ensures computational effort is concentrated on the most plausible particles.There are multiple algorithms available for generating random variables r P , P t satisfying Eq. (2.16) - see for example the reviews in (Douc and Capp´e, 2005; Hol,Schon and Gustafsson, 2006; Gerber, Chopin and Whiteley, 2019). Distributedversions of particle filters have recently been proposed and analyzed (Bolic, Djuricand Hong, 2005; Verg´e et al., 2015; Whiteley, Lee and Heine, 2016; Sen andThiery, 2019; Lee and Whiteley, 2015).The iterated application of prediction updates according to Eq. (2.9) and re-sampling assimilation updates according to Eq. (2.15) together defines the boot-strap pf algorithm. Although simple, the bootstrap pf algorithm does not exploitall the information available at each time index – specifically the prediction up-date in Eq. (2.9) does not take in to account future observations. Alternative pf schemes can be employed which use prediction updates which take in to account SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER future observations. Although such schemes typically express the resulting parti-cle weights in terms of the state transition densities we describe in Appendix Bhow they can be implemented in ssm s with intractable transition densities.While adjusting the prediction update can significantly improve performancecompared to the bootstrap pf for a fixed ensemble size, when applied to systemswith high state and observation dimensionalities these pf methods will still tendto suffer from weight degeneracy. In particular, even when using ‘locally optimal’updates in a simple linear-Gaussian model, the resulting pf has been shown tostill generally require an ensemble size which still grows exponentially with thedimension of the observation space to avoid weight degeneracy (Snyder et al.,2008; Snyder, Bengtsson and Morzfeld, 2015). Although typically the resampling variables in pf assimilation updates are gen-erated independently of the predictive ensemble particle values given the weights,this is not required. Reich (2013) exploited this flexibility to propose an alter-native particle filtering approach termed the ensemble transform particle filter ( etpf ) which uses ot methods to compute a resampling scheme which minimisesthe expected distances between the particles before and after resampling.A valid resampling scheme can be parametrised by a set of resampling proba-bilities ρ P , P t ∈ [0 , P × P with ρ p,qt = P ( r p,qt = 1 | w qt = w qt ) satisfying X q ∈ P ρ p,qt = 1 , X q ∈ P ρ q,pt = P w pt ∀ p ∈ P . (2.17)A simple choice satisfying Eq. (2.17) is ρ p,qt = w qt ∀ p ∈ P , q ∈ P with thiscorresponding to the probabilities used in standard pf resampling schemes.If we denote the set of resampling probabilities satisfying Eq. (2.17) for a givenset of weights w P t by R ( w P t ) and the realisations of the predictive particles ~ x P t at time index t by ~x P t , Reich (2013) instead proposed to compute the resamplingprobabilities as the solution to the optimal transport problem ρ P , P t = argmin % P , P ∈R ( w P t ) X p ∈ P X q ∈ P % p,q | ~x pt − ~x qt | . (2.18)The optimal transport problem can be posed as a linear program and efficientlysolved using the network simplex algorithm (Orlin, 1997) with a computationalcomplexity of order e O ( P ). While the resulting resampling probabilities couldthen be used to generate binary variables r P , P t and the standard pf resamplingassimilation update in Eq. (2.15) applied, Reich (2013) instead proposes to usethe resampling probabilities to directly update the particles as follows x pt = X q ∈ P ρ p,qt ~ x qt ∀ p ∈ P . (2.19)For P → ∞ this assimilation update remains consistent as, due to properties of theoptimal transport problem solution, the resampling probabilities tend to binary { , } values (Reich, 2013, Theorem 1) and thus Eq. (2.19) becomes equivalentto updating using realisations of the binary random variables.While the etpf does not in itself help overcome the weight degeneracy issue,the deterministic and distance minimising nature of the etpf update naturallylends itself to spatial localisation approaches which can help overcome the poorscaling of pf s with dimensionality, as will be discussed in the following section. M. M. GRAHAM & A. H. THIERY
3. SPATIAL MODELS AND LOCAL ENSEMBLE FILTERS
Our particular focus in this article is on filtering in models of spatially-extendeddynamical systems. Let S be a D -dimensional compact metric space equipped withdistance function d : S × S → [0 , ∞ ), representing the spatial domain the state ofthe modelled system is defined over, and Z ⊆ R N be the space the state variablesat each spatial coordinate in S take values in. The state-space of the system isthen a function space Z S with the state at each time index z t : S → Z a spatialfield. The dynamics of the system will typically be modelled by a set of spde s,with z T then corresponding to a solution of these equations at T times, given aninitial state sampled from some distribution.In practice in most problems we cannot solve the spde model exactly andinstead use numerical integration schemes to generate approximate solutions. Thestates are assumed to be restricted to a function space with a fixed dimensionalrepresentation, with typically a state field z t : S → Z represented as a linearcombination of a finite set of M basis functions β m : S → R z t ( s ) = X m ∈ M x t,m β m ( s ) ∀ t ∈ T , s ∈ S , (3.1)with coefficients x t,m ∈ Z ∀ m ∈ M . For the purposes of inference we willtherefore consider the state space to be a vector space X = Z M ⊆ R MN with statevectors consisting of the concatenation of the basis function coefficients.Typically the basis functions will be defined by partitioning the spatial domain S in to a mesh of polytopic spatial elements, for example triangles or quadrilat-erals for D = 2. The vertices of these polytopes (and potentially additional pointssuch as the midpoints of edges) define a collection of M nodes with spatial locations s M . Typically each node is associated with a basis function β m satisfying β m ( s m ) = 1 , β m ( s n ) = 0 ∀ m ∈ M , n ∈ M , n = m, (3.2)which combined with Eq. (3.1) implies that z t ( s m ) = x t,m ∀ m ∈ M .We will assume that there are L observations y t, L at every time point, each ofdimension K , with the overall observation vector y t then a length KL vector y T t = [ y T t, y T t, · · · y T t, L ] ∀ t ∈ T . (3.3)We also assume that y t,l ⊥ y t,m | x t ∀ l = m i.e. the observations are condition-ally independent given the state and that each observation y t,l depends only onthe value of the state field z t at a fixed spatial location s o l . Together these twoassumptions mean we can express the logarithm of the observation density aslog g t ( y t | x t ) = X l ∈ L log g t,l ( y t,l | z t ( s o l )) ∀ t ∈ T . (3.4) The combination of high state and observation space dimensionalities, and lowfeasible ensemble sizes, make filtering in spatial ssm s a significant computationalchallenge. Fortunately ssm s of spatially extended systems often also exhibit afavourable decay of spatial correlations property which can be exploited to makeapproximate filtering more tractable by performing local updates to the particles.
SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER If we assume the spatial field z t is defined as in Eq. (3.1) and x t is distributedaccording to the filtering distribution π t then the spatial correlation function c t,f : S × S → [0 ,
1] of a square integrable function f ∈ L is defined as c t,f ( s, s ) = E [ f ( z t ( s )) f ( z t ( s ))] − E [ f ( z t ( s ))] E [ f ( z t ( s ))]( E [( f ( z t ( s )) − E [ f ( z t ( s ))]) ] E [( f ( z t ( s )) − E [ f ( z t ( s ))]) ]) , (3.5)and the maximal spatial correlation function as ¯ c t ( s, s ) = sup f ∈L c t,f ( s, s ).The decay of spatial correlations property can then be stated as¯ c t ( s, s ) → d ( s, s ) → ∞ ∀ s ∈ S , s ∈ S , (3.6)which indicates that the dependence between state variables at distinct spatiallocations decays to zero as the distance between the locations increases.While it will typically not be possible to analytically verify Eq. (3.6) holds ex-actly, it has been empirically observed that models of spatially extended systemsin which the underlying dynamics are governed by local interactions between thestate variables exhibit an approximate decay of correlations property. In particu-lar weak long-range spatial correlations are a defining feature of spatio-temporalchaos (Hunt, Kostelich and Szunyogh, 2007) with many spatial models of interest,such as the atmospheric models used in nwp , exhibiting such behaviour. For ssm s exhibiting a decay of spatial correlations property, localising the letf assimilation update in Eq. (2.11), as proposed by Cheng and Reich (2015), canoffer significant performance gains compared to algorithms employing global up-dates. Rather than using a single set of transform coefficients a P , P t for theassimilation update, M sets of coefficients a P , P t, M are defined, one for each spatialmesh node location s M with the assimilation update then x pt,m = X q ∈ P a p,qt,m ~ x qt,m ∀ p ∈ P , ∀ m ∈ M . (3.7)As previously mentioned, the global letf update in Eq. (2.11) restricts the fil-tering ensemble members x P t to lie in the P dimensional linear subspace of X spanned by the predictive ensemble ~ x P t . When X is high-dimensional, as is gen-erally the case in spatially extended models, this can be highly restrictive.The local letf update in Eq. (3.7) overcomes this restriction of the global letf update, with the filtering ensemble members now formed from local linearcombinations of the predictive ensemble members and thus no longer constrainedto a P dimensional linear subspace. In particular for models exhibiting a decay ofcorrelations property, the state variables at each mesh node can be updated usingcoefficients computed using only the subset of observations which are within somelocalisation radius of the mesh node while still retaining accuracy.Local variants of the enkf (Houtekamer and Mitchell, 1998; Hamill, Whitakerand Snyder, 2001) are the prototypical examples of local letf s, and have beensuccessfully used to perform filtering in large complex spatio-temporal modelsincluding operational ensemble nwp systems (Bowler et al., 2009). In Appendix Awe briefly introduce a local variant of the etkf (Hunt, Kostelich and Szunyogh,2007) which we use as a baseline in the numerical experiments. M. M. GRAHAM & A. H. THIERY
It has been speculated that spatial localisation may be key to achieving usefulresults from pf s in large spatio-temporal models (Morzfeld, Hodyss and Snyder,2017) based on its importance to the success of enkf methods in such models.In Farchi and Bocquet (2018) the authors systematically compare a wide rangeof localised pf and related algorithms which have been proposed in the literatureincluding localised variants of the etpf which we will discuss in the followingsubsection. Below we briefly introduce concepts from a local pf algorithm pro-posed by Penny and Miyoshi (2015) which are relevant to this article, howeverwe refer readers to Farchi and Bocquet (2018) for a much more extensive review.For the standard pf , the logarithms of the unnormalised particle weights arelog ˜ w pt = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ∀ p ∈ P . (3.8)i.e. a summation of contributions due to the observations at all locations s o1: L .For a model exhibiting a decay of spatial correlations property we would expectthat only a local subset of observations should have a strong influence on thedistribution of the state variables at each mesh node. We can formalise thisintuition into a concrete approach for computing local particle weights via the useof a localisation function ‘ r : [0 , ∞ ) → [0 ,
1] and localisation radius r satisfying ‘ r (0) = 1 , ‘ r ( d ) = 0 ∀ d > r > . (3.9)Local unnormalised weights for each mesh node can then be definedlog ˜ w pt,m = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ‘ r ( d ( s m , s o l )) ∀ p ∈ P , m ∈ M , (3.10)and corresponding local normalised weights w pt,m = ˜ w pt,m P q ∈ P ˜ w qt,m ∀ p ∈ P , m ∈ M . (3.11)This formulation for the local particle weights has the desired property of usingonly a local subset of observations to update the state variables at each meshnode (with the terms in the sum zero when d ( s m , s o l ) > r ).Typical choices for the localisation function include the uniform or top-hatfunction ‘ r ( d ) = [0 ,r ] ( d ) and the triangular function ‘ r ( d ) = (1 − dr ) [0 ,r ] ( d ).In this article we exclusively use the smooth and compactly supported 5 th orderpiecewise rational function proposed by Gaspari and Cohn (1999) and defined as ‘ r ( d ) = ( − d r + 8 d r + 5 d r − d r + 1 0 ≤ d < r d r − d r + 5 d r + d r − dr + 4 − rd r ≤ d < r . (3.12)Penny and Miyoshi (2015) propose a local pf algorithm which uses local parti-cle weights defined as in Eq. (3.11) for the specific case of a Gaussian observationdensity and uniform localisation function ‘ r ( d ) = [0 ,r ] ( d ). The local weights areused to generate binary resampling variables r P , P t, M for each mesh node satisfying X q ∈ P r p,qt,m = 1 , E (cid:20) X q ∈ P r q,pt,m | w pt,m = w pt,m (cid:21) = P w pt,m ∀ p ∈ P , m ∈ M . (3.13) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) Independent resampling at each node. (b) Coupled resampling and smoothing. Fig 2: Examples of local pf assimilation updates applied to the same spatialGaussian process model as Figure 1.Generating the resampling variables for each mesh node independently means thestate variables at adjacent mesh nodes for a post-resampling particle will typicallyoriginate from different prior particles, tending to lead to highly discontinuous andnoisy spatial fields. An example of this is shown in Fig. 2a which show examples ofthe posterior state field samples generated using independent resampling at eachmesh node with local weights for the smooth spatial Gaussian process exampleencountered previously in Figure 1.To ameliorate the issues associated within using independent resampling vari-ables, it is proposed in Penny and Miyoshi (2015) to use a variant of the system-atic resampling scheme (Douc and Capp´e, 2005) often used as variance reductionmethod in standard pf algorithms. A single random standard uniform variableis used to generate the resample variables for all mesh nodes, resulting in per-node sets of resampling variables which each satisfy the marginal requirementsin Eq. (3.13) while also being strongly correlated to the resampling variables forother nodes. The correlation introduced between the resampling variables whenusing this ‘coupled resampling’ scheme significantly reduces but does not elimi-nate the introduction of discontinuities into the resampled fields.Rather than directly use these resampling variables in a local equivalent to the pf assimilation update in Eq. (2.15), Penny and Miyoshi (2015) instead propose touse a ‘smoothed’ update which uses a weighted average of the resampling variablesat the current mesh node and all neighbouring nodes to update the particlesvalues at each node. Fig. 2b shows examples of posterior state fields samplescomputed using this smoothed assimilation update with the resampling variablesgenerated using the coupled scheme. The previously observed discontinuities arenow removed, however the samples still remain significantly less smooth than thetrue state used to generate the observations (Fig. 1a) and prior samples (Fig. 1c). While techniques such as the smoothed and coupled resampling update usedin Penny and Miyoshi (2015) can help reduce the introduction of spatial discon-tinuities, the resampling variables r P , P t, M are still calculated without taking intoaccount the values of the predictive particles ~ x P t values other than via the localparticle weights. The etpf assimilation update discussed in Section 2.9 explicitlytries to minimise a distance between the values of the transformed and pre-updateparticles and does not require introducing any randomness and so is a naturalcandidate for a local pf s with improved spatial smoothness properties. M. M. GRAHAM & A. H. THIERY
Cheng and Reich (2015) proposed a localised variant of the etpf as a particularinstance of their letf framework. Local particle weights are calculated as inEqs. (3.10) and (3.11) for each mesh node, and a set of ot problems solved ρ P , P t,m = argmin % P , P ∈R ( w P t,m ) X p ∈ P X q ∈ P % p,q c p,qt,m ∀ m ∈ M . (3.14)Here the transport cost terms c P , P t, M are analogous to the inter-particle Euclideandistances used in Eq. (2.18). Rather than compute global transport costs basedon distances between the state variables values at points across the full spatialdomain, Cheng and Reich (2015) proposed to compute localised transports costsfor each mesh node index m ∈ M by integrating a distance between the statevariables values against a localisation function centred at the mesh node location s m and with support on points s ∈ S : d ( s, s m ) < r c p,qt,m = Z S | ~ z pt ( s ) − ~ z qt ( s ) | ‘ r ( d ( s m , s )) d s ∀ m ∈ M , p ∈ P , q ∈ P . (3.15)The localisation function ‘ r and localisation radius r are denoted with primeshere to emphasise they may be different from those used for the local weightscomputation. A more pragmatic definition of the localised transport costs is c p,qt,m = X n ∈ M (cid:12)(cid:12)(cid:12) ~ x pt,n − ~ x qt,n (cid:12)(cid:12)(cid:12) ‘ r ( d ( s m , s n )) ∀ m ∈ M , p ∈ P , q ∈ P . (3.16)In the common case of a rectilinear mesh with equal spacing between the nodesacross the domain, the summation in Eq. (3.16) can be seen, as a quadratureapproximation to the integral in Eq. (3.15) up to a constant multiplier whichdoes not affect the ot solutions.If the localisation functions ‘ r and ‘ r are smooth, then both the local weights w P t, M and local transport costs c P , P t, M will vary smoothly as functions of the meshnode locations s M . However, the solutions ρ P , P t, M to the linear programs definedby the local optimal transport problems in Eq. (3.14) will not vary smoothlywith the mesh node locations s M even if the local weights and transport costsdo. This can be seen in the spatial Gaussian process example in Fig. 1, with thelocal etpf scheme used to compute the posterior samples illustrated in Fig. 1e.Although less apparent than the discontinuities in Fig. 1d, the fields in Fig. 1estill show spatial artefacts due to the non-smooth variation of the ot solutions.One option to increase the smoothness of the update is to regularise the ot problems. In particular the entropically regularised ot problems defined by ρ P , P t,m = argmin % P , P t,m ∈R ( w P t,m ) X p ∈ P X q ∈ P (cid:16) % p,qt,m c p,qt,m + λ% p,qt,m (log % p,qt,m − (cid:17) , (3.17)for some positive regularisation coefficient λ have a unique optimal solution whichsmoothly varies as a function of the local weights w P t,m and transport costs c P , P t, M (Peyr´e and Cuturi, 2019) and tends to the solution of the non-regularised problemwith the highest entropy as λ →
0. Further the entropically regularised problemscan be efficiently iteratively solved using Sinkhorn–Knopp iteration (Sinkhornand Knopp, 1967; Cuturi, 2013) with complexity e O ( P ) per problem (Altschuler,Weed and Rigollet, 2017). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) Regularisation coefficient λ = 10 − . (b) Regularisation coefficient λ = 10 − . Fig 3: Examples of the Cheng and Reich (2015) local etpf assimilation updateapplied to the same spatial Gaussian process model as Figure 1 using entropicallyregularised ot maps for different values of the regularisation coefficient λ .Figs. 3a and 3b show examples of posterior fields samples computed usingentropically regularised local etpf updates for two regularisation coefficients λ .It can be seen that introducing entropic regularisation increases the smoothnessof the updated fields compared to the unregularised samples shown in Fig. 1e andthat the level of smoothness increases with the regularisation coefficient λ .However the increase in smoothness comes at the cost of a decreased diversity inthe post-update particles as λ increases - in particular for the λ = 10 − case shownin Fig. 3b, the four samples shown appear almost identical. This is a consequenceof the assimilation updates in the local etpf linearly transforming by the ot maps as in Eq. (2.19) as opposed to resampling using binary random variablesgenerated according to the resampling probabilities encoded by the ot maps. Forthe regularised ot problems in Eq. (3.17), as the regularisation coefficient λ → ∞ we have that ρ p,qt,m → w qt,m ∀ p ∈ P , q ∈ P , m ∈ M . In this case applying thelocal etpf assimilation update will tend to assigning the weighted mean of thestate variables at each mesh-node to the post-update particles, and thus a lackof diversity or under-dispersion in the post-update particles.Acevedo, de Wiljes and Reich (2017) proposed a variant of the etpf which over-comes this under-dispersion issue when using entropically regularised ot maps.For each ot map a correction terms is computed which ensures the empirical co-variance of the updated particles matches the values that would be obtained usingthe standard pf update. Although this second-order accurate etpf scheme over-comes the under-dispersion issues when using entropically regularised ot maps,in localised variants the correction factors must be computed separately for the ot map associated with each mesh node, with the computation of each correc-tion factor having a O ( P ) complexity, potentially negating any gains from usinga cheaper Sinkhorn solver for the regularised ot problems.In the review article of Farchi and Bocquet (2018) a local etpf variant isproposed which computes ot maps for blocks of state variables rather than foreach mesh node individually. Computing ot maps per-block rather than per-nodepotentially can give significant computational savings in higher spatial dimensions— for instance for three dimensional domains, even using cubic blocks which coverjust two mesh nodes in each dimension would lead to a reduction in the numberof ot problems needing to be solved by eight. In the numerical experiments inFarchi and Bocquet (2018) it was found however that the accuracy of the local M. M. GRAHAM & A. H. THIERY (a) Block size 4 ×
4. (b) Block size 8 × Fig 4: Examples of the Farchi and Bocquet (2018) local block etpf assimilationupdate applied to the same Gaussian process model as Figure 1.Fig 5: Example smooth partition of unity of a one-dimensional spatial domain S = [0 ,
1] with nine bump functions φ . The patches ˆ S covering S and whichthe bump functions have support on are visualised below the plot axes.block etpf method was highest when using blocks containing just one mesh-node, i.e. corresponding to the local etpf scheme of Cheng and Reich (2015). Asthe state variables in each block are updated independently given the computedper-block ot maps, the poorer performance with larger blocks may be at least inpart due to the spatially inhomogeneous error introduced at the block boundaries.Fig. 4 shows examples of posterior state field samples computed using this block etpf scheme for the earlier spatial Gaussian process example from Fig. 1 for twodifferent block size; in both the boundaries of the blocks are clearly visible dueto the discontinuities introduced in to the fields.
4. SMOOTH AND SCALABLE LOCAL PARTICLE FILTERING
Grouping mesh nodes into spatially contiguous blocks and computing ot mapsper-block rather than per-node as proposed in Farchi and Bocquet (2018) is anatural way to reduce the computational cost of local etpf assimilation update.However this approach further decreases the smoothness of the updated fields.Here we propose an alternative approach. Rather than computing ot maps fordisjoint blocks defining a partition of the spatial domain S we instead ‘softly’partition S into patches with overlapping support, computing an ot map foreach patch and smoothly interpolating between the ot maps associated withdifferent patches in the overlaps. The construct we will use to both define thesoft partitioning of the domain and interpolation across it is a partition of unity . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Let ˆ S B be a cover of the spatial domain S such that S B b =1 ˆ S b = S with eachˆ S b termed a patch . We associate a bump function φ b : S → [0 , ∀ b ∈ B witheach patch with φ b ( s ) = 0 ∀ s / ∈ ˆ S b , b ∈ B and require that X b ∈ B φ b ( s ) = 1 ∀ s ∈ S . (4.1)The set of bump functions φ B is then termed a partition of unity ( pou ) of S . pou s are typically used to allow local constructions to be extended globallyacross a space, for instance an atlas of local charts of a manifold. Generally in suchapplications the bump functions will be required to be infinitely differentiable.Here we will generally not require such stringent differentiability requirements,however we will informally refer to a smooth pou for the case where each bumpfunction is of at least class C with continuous derivatives, and to a hard pou forthe case where the cover ˆ S B is exact, i.e. the patches are pairwise disjoint, andso the bump functions are indicators on the patches φ b ( s ) = ˆ S b ( s ) ∀ b ∈ B .A useful method for constructing a pou with specified smoothness propertieson an arbitrary spatial domain is via convolution. Specifically, if S B is a partitionof S and ϕ : [0 , ∞ ) → [0 , ∞ ) is a non-negative mollifier function satisfying Z S ϕ ◦ d ( s, s ) d s = 1 ∀ s ∈ S (4.2)then we can define a pou φ B on S by convolving ϕ with the indicators on S B φ b ( s ) = Z S S b ( s ) ϕ ◦ d ( s, s ) d s ∀ b ∈ B , s ∈ S . (4.3)The bump functions will then inherit any smoothness properties of the mollifier.Figure 5 shows an example of a smooth pou constructed in this manner. We can use a pou to define a local letf that uses transform coefficients com-puted for each patch rather than mesh node. We define the per-node transformcoefficients a P , P t,m in Eq. (3.7) in terms of a set of per-patch coefficients ˆ a P , P t, B by a p,qt,m = X b ∈ B ˆ a p,qt,b φ b ( s m ) ∀ p ∈ P , q ∈ P , m ∈ M . (4.4)If the set of coefficients ˆ a P , P t,b for each patch index b ∈ B correspond to theelements of a left stochastic matrix such thatˆ a p,qt,b ∈ [0 , ∀ p ∈ P , q ∈ P and X q ∈ P ˆ a p,qt,b = 1 ∀ p ∈ P , (4.5)then due to the non-negativity and sum to unity properties of the pou we havethat a p,qt,m ∈ [0 , ∀ p ∈ P , q ∈ P , m ∈ M and X q ∈ P a p,qt,m = X b ∈ B X q ∈ P ˆ a p,qt,b φ b ( s m ) = X b ∈ B φ b ( s m ) = 1 ∀ p ∈ P , m ∈ M , (4.6)and so that a P , P t, M also correspond to the elements of left stochastic matrices. M. M. GRAHAM & A. H. THIERY
The resulting assimilation update in terms of the values of the predictive andfiltering distribution state field particles, ~ z P t and z P t , at the mesh nodes s M is z pt ( s m ) = X q ∈ P X b ∈ B ˆ a p,qt,b φ b ( s m ) ~ z qt ( s m ) ∀ p ∈ P , m ∈ M . (4.7)For a smooth pou φ B the P × B spatial fields defined by the pointwise products φ b ( s ) ~ z qt ( s ) ∀ b ∈ B , q ∈ P will be smooth functions of the spatial coordinate s ∈ S if the predictive distribution state field particles ~ z P t are themselves smooth.Each filtering distribution state field particle z P t is then formed as a convexcombination of these pairwise product fields, and so will also be smooth if the pou and predictive distribution state fields are. This is illustrated for a one-dimensional example in Fig. C.1 in Appendix C. We now consider the specific application of the smooth local letf scheme todefine a smooth localisation of the etpf , with in this case the coefficients ˆ a P , P t, B corresponding to ot maps computed for each patch. We first define the followingnotation for the distance between a subset of the spatial domain and a point. d ( S , s ) = inf s ∈S d ( s , s ) ∀ s ∈ S , S ⊆ S . (4.8)Analogously to the per-node case in Eq. (3.10), the logarithms of the per-patch(unnormalised) particle weights can then be defined bylog ˜ w pt,b = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ‘ r ( d ( ˆ S b , s o l )) ∀ b ∈ B , p ∈ P , (4.9)As d ( ˆ S b , s o l ) = 0 if s o l ∈ ˆ S b and ‘ r (0) = 1 the weighted summation of log obser-vation density terms in Eq. (4.9) gives weight one to all the terms correspondingto observations located within a patch. Observations outside a patch but withina distance of less than r are given weights between zero and one, and all obser-vations more than a distance of r from a patch are given zero weight.Taking inspiration from the per-node case in Eq. (3.15) we could define per-patch transport costs directly in terms of the predictive state fields ~ z P t c p,qt,b = Z S | ~ z pt ( s ) − ~ z qt ( s ) | ‘ r ( d ( ˆ S b , s m )) d s ∀ b ∈ B , p ∈ P , q ∈ P . (4.10)Although this is defined independently of the spatial discretisation used, evalu-ating the integrals exactly will often be intractable. Assuming the common caseof equally spaced mesh nodes, we propose to define per-patch transport costs as c p,qt,b = X m ∈M (cid:12)(cid:12)(cid:12) ~ x pt,m − ~ x qt,m (cid:12)(cid:12)(cid:12) ˆ S b ( s m ) ∀ b ∈ B , p ∈ P , q ∈ P , (4.11)where M ⊆ M corresponds to a spatial subsampling of the mesh nodes, e.g.corresponding to every K th node in each spatial dimension, such that |M| ≈ MK D .This spatial subsampling is motivated by the observation that if the state fieldsare spatially smooth then the values at immediately adjacent mesh nodes willtypically be very similar and there is therefore minimal loss of information incomputing pointwise differences over a subset of, rather than all, mesh nodes. In SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER addition to spatial subsampling we also define the transport costs in Eq. (4.11)with the fixed choice of a uniform localisation function ‘ r with r = 0. Empiri-cally we found varying the choice of ‘ r and r for the transport costs had littlediscernable effect on filtering performance.Given per-patch weights w P t, B and transport costs c P , P t, B computed as de-scribed above, the per-patch linear transform coefficients ˆ a P , P t, B are then com-puted as solutions to the B corresponding ot problemsˆ a P , P t,b = argmin % P , P ∈R ( w P t,b ) X p ∈ P X q ∈ P % p,q c p,qt,b ∀ b ∈ B . (4.12)We will subsequently refer to instances of this framework as smooth local en-semble transform particle filter s ( sletpf s). To define a sletpf method for agiven spatial ssm , we need to specify: a localisation function and radius ‘ r and r to compute the local weights; the set of mesh nodes M to use in computing thelocal transport costs; a pou of the spatial domain.For the sletpf local weight calculation in Eq. (4.9), the number of non-zerolog observation density terms in the sum is dependent on both the localisationfunction and the size of the patches ˆ S B used to define the pou . We can definean effective number of observations considered per patch as n b = X l ∈ L ‘ r ( d ( ˆ S b , s o l )) ∀ b ∈ B . (4.13)To avoid weight degeneracy we will typically need to control the n B valuesthrough the choice of pou and localisation radius r , with the results of Rebes-chini and van Handel (2015) suggesting n B should roughly scale with log P . Toapproximately minimise max( n B ) for a given number of patches B , as a heuristicwe suggest the patches should be chosen such that each contains a roughly equalnumber of observations. We discuss approaches for defining a partition of thespatial domain based on the observation locations to achieve this in Appendix D.The choice of the number of patches B to use will typically be based on atradeoff between several factors. Reducing computational cost favours using fewerpatches, while the need to control max( n B ) and so the tendency for weight de-generacy favours using a greater number of smaller patches. More complex is thedependency of the approximation error introduced by localisation. Using largerpatches and a greater number of observations to update the state variables withineach patch should reduce the approximation error for the updates within eachpatch. However for a fixed r using larger patches will also lead to great disparitiesin the local weights calculated for each patch using Eq. (4.9) and so the transformcoefficients for adjacent patches. If using a hard pou this will typically lead tospatial discontinuities in the state particles across patch boundaries after apply-ing the assimilation update, with the downstream effect of such discontinuitiespotentially negating any reduction in the approximation error within the patches.If using a smooth pou the mesh nodes in the overlaps between patches willbe updated using a interpolation of the transform coefficients for each of thepatches, allowing smaller numbers of patches B to be used while still retainingsmoothness. In the numerical experiments in Section 5 we show that using asmooth pou allows use of a number of patches B less than the number of meshnodes M while still retaining accurate filtering distribution estimates. M. M. GRAHAM & A. H. THIERY
The computational cost of the per-node local etpf assimilation updates pro-posed in Cheng and Reich (2015) is dominated by solving the M ot problemsleading to an overall e O ( MP ) scaling for the computational cost. For the sletpf ,the number of ot problems is determined by the number of patches B and sothe cost of solving the ot problems is e O ( BP ). When B (cid:28) M the relative cost ofthe other computations in the overall assimilation update can become significanthowever. To derive a relationship for the overall scaling of the computational costof the proposed sletpf we make the following assumptions. Assumption . The maximum number of patches covering any mesh node isindependent of and much smaller than B and so the sum across all patches of thenumber of mesh nodes within each patch scales independently of B , i.e. X b ∈ B X m ∈ M ˆ S b ( s m ) = O ( M ) . (4.14)For pou s in which each patch overlaps with only a fixed number of ‘neighbour’patches this will hold. If a uniform subsampling scheme is used to define the set ofmesh node indices M used in computing the transport costs, then as a corollarywe will also have that the total number of subsampled mesh nodes containedwithin all patches scales independently of B , i.e. X b ∈ B X m ∈M ˆ S b ( s m ) = O ( |M| ) . (4.15) Assumption . The sum across all patches of the number observations withina distance r from a patch is less than the number of mesh nodes M > L , i.e. X b ∈ B X l ∈ L [0 ,r ] ( d ( ˆ S b , s o l )) < M . (4.16)We will typically have that the number of observations locations L is smallcompared to the number of mesh nodes M and the localisation radius r will be setto limit the number of observations considered per patch to a small subset of allobservations so this will usually hold.Under Assumption 1 the cost of calculating the BP transport costs usingEq. (4.11) is O ( |M| P ) as we need to evaluate the distance between the P ( P − |M| mesh nodes and from Eq. (4.15) only O ( |M| ) terms inthe summations for each of the particle pairs need to be evaluated.The update to the particles in Eq. (4.7) for a general set of per-patch lineartransform coefficients ˆ a P , P t, B will have a cost of O ( MP ) under Assumption 1. How-ever for transform coefficients computed as the solution to discrete ot problems,at most 2 P − P coefficients for each patch are non-zero (Reich, 2013).In this case the assimilation update in Eq. (4.7) therefore has a O ( MP ) cost.Under Assumption 2, the computation using Eq. (4.9) of the BP per-patchweights will cost less than O ( MP ) as we need to evaluate LP < MP log observationdensity factors, and from Eq. (4.16) less than M terms in the summations for eachof the P particles will be non-zero and so need to be evaluated.Under these assumptions, the overall computational cost of each sletpf as-similation step therefore scales as e O (cid:0) BP + |M| P + MP (cid:1) . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER
5. NUMERICAL EXPERIMENTS
To evaluate the performance of the proposed approach, we perform filtering intwo spde test models, comparing our proposed scheme to the local etpf (Chengand Reich, 2015) and local etkf (Hunt, Kostelich and Szunyogh, 2007). Ratherthan measure performance in terms of the distance between the estimated meanof the filtering distribution and the true state used to generate the observations,as is common in similar work e.g. Farchi and Bocquet (2018), here we measurethe errors in the ensemble estimates of expectations with respect to the truefiltering distributions. This gives more directly interpretable results as a filterwhich exactly computes the expectations would give a zero error, unlike thedifference between the mean and true state which will in general be non-zeroeven if the mean is computed exactly. We are also to able to assess the accuracyof a broader range of features of the filtering distribution estimates, for exampletheir quantification of uncertainty via measures of dispersion.To allow such comparisons, we require models for which ground truth valuesfor expectations with respect to the filtering distributions can be computed. Tothis end our first model is based on a linear-Gaussian spde model for which thetrue filtering distribution can be exactly computed using a Kalman filter. Forthe second model, we use a more challenging spde model with non-linear statedynamics. Here our ‘ground-truth’ for the filtering distributions is based on longruns of a
Markov chain Monte Carlo ( mcmc ) method. For both models we consider several metrics for evaluating the accuracy of thedifferent local ensemble filters’ estimates of the filtering distributions.The first two metrics we consider are the time- and space-averaged root meansquared error s ( rmse s) of the ensemble estimates of the filtering distributionsmeans and standard deviations, to reflect respectively the filters’ accuracy inestimating the central tendencies and dispersions of the filtering distributions.Denote µ T and σ T as the true means and standard deviations under π T µ t = Z X x π t (d x ) and σ t = Z X ( x − µ t ) (cid:12) ( x − µ t ) π t (d x ) ∀ t ∈ T , (5.1)and ˆ µ T and ˆ σ T as the corresponding means and standard deviations under theempirical ensemble estimates to the filtering distributions ˆ π t (d x ) = P P p =1 δ x pt (d x ),ˆ µ t = Z X x ˆ π t (d x ) and ˆ σ t = Z X ( x − ˆ µ t ) (cid:12) ( x − ˆ µ t ) ˆ π t (d x ) ∀ t ∈ T . (5.2)We then define the time- and space-averaged rmse s of the estimates as rmse (ˆ µ T , µ T ) = s TM X t ∈ T X m ∈ M (ˆ µ t,m − µ t,m ) , (5.3)and rmse (ˆ σ T , σ T ) = s TM X t ∈ T X m ∈ M (ˆ σ t,m − σ t,m ) . (5.4)In both cases lower values of these metrics are better, with a value of zero indi-cating the mean or standard deviation estimates exactly match the true values. M. M. GRAHAM & A. H. THIERY
The two metrics discussed so far concentrate on the accuracy of estimates oflocal properties of the states, but do not reflect more global properties such aswhether the ensemble filters correctly estimate the smoothness of the state fields.As a proxy measure for smoothness we use the expectation under the true filteringdistributions of a finite-difference approximation of the integral across space ofthe magnitude of the spatial gradients of the state fields: γ t = Z X X m ∈ M | x t,m − x t,m ⊕ | π t (d x ) ≈ E (cid:20)Z S | ∂ s z t ( s ) | d s (cid:21) ∀ t ∈ T , (5.5)with m ⊕ m + 1 mod M , with one-dimensional periodic spatialdomains being used in both models considered. Defining the estimates ˆ γ T of these smoothness coefficients under the ensemble filtering distributions equivalently asˆ γ t = Z X X m ∈ M | x t,m − x t,m ⊕ | ˆ π t (d x ) ∀ t ∈ T , (5.6)we then define an overall measure of the accuracy of the ensemble estimates’spatial smoothness as the following time-averaged rmsermse (ˆ γ T , γ T ) = s T X t ∈ T (ˆ γ t − γ t ) . (5.7) As our first example we use a linear-Gaussian ssm derived from a spde modelfor turbulent signals by Majda and Harlim (2012, Ch. 5). The governing spde isd ζ ( s, τ ) = (cid:16) θ ∂ s + θ ∂ s − θ (cid:17) ζ ( s, τ )d τ + ( κ (cid:126) s d η )( s, τ ) , (5.8)where ζ : S × R ≥ → R is a real-valued space-time varying process, θ ∈ R ≥ is a non-negative parameter controlling dissipation due to diffusion, θ ∈ R isa parameter governing the direction and magnitude of the constant advection, θ ∈ R ≥ is a non-negative parameter controlling dissipation due to damping, κ : S → R ≥ is a spatial kernel function which governs the spatial smoothness of theadditive noise in the dynamics and η : S × R ≥ → R is a space-time varying noiseprocess. The spatial domain is a one-dimensional interval S = [0 ,
1) with periodicboundary conditions and a distance function d ( s, s ) = min( | s − s | , − | s − s | ),and (cid:126) s represents circular convolution in space.We use a spectral approach to define basis function expansions of the processes ζ and η and kernel κ using M = 512 mesh nodes. This results in a linear systemof stochastic differential equation s ( sde s) for which the the Gaussian state tran-sition and stationary distributions can be solved for exactly. We assume a linear-Gaussian observation model with the state noisily observed at L = 64 locationsand T = 200 time points. Full details of the model are given in Appendix F.1.The resulting stochastic turbulence ( st ) ssm is linear-Gaussian. We considertwo cases in our experiments: inference in the original linear-Gaussian ssm , andinference in a transformed ssm using this linear-Gaussian model as the base ssm .The specific definition we use for a transformed ssm is given in Appendix Ehowever in brief, by applying a non-linear transformation to the state of a linear-Gaussian ssm we can construct a ssm with non-Gaussian filtering distributions SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 4 . × − . × − . × − Median 4 . × − . × − . × − Maximum 4 . × − . × − . × − Localisation radius r Table 1
Values of metrics at optimal localisation radii for local etkf on linear-Gaussian st model. for which we can tractably estimate expectations with respect to the true filteringdistributions with artbirary accuracy. Here the nonlinear transformation is chosenas T ( x ) = sinh − ( θ x ) (with sinh − evaluated elementwise on vector arguments).As | sinh − ( θ x ) | ≈ log(2 θ | x | ) for | θ x | (cid:29) θ tends toinduce bimodality in the marginals of the transformed filtering distributions.For both the transformed and linear-Gaussian cases we use the model parame-ter settings give in Table F.1 and use simulated noisy observations y T generatedfrom the models using a shared set of Gaussian state and observation noise vari-able samples generated using a pseudo-random number generator. The resultingobservation sequence y T (which is the same for both models) is shown in Fig. F.1along with the corresponding true state sequences z T and z T used to generatethe observations under the linear-Gaussian and transformed ssm s respectively.We compare the performance of the local etkf , local etpf and our proposed sletpf algorithm in estimating the filtering distributions for both the linear-Gaussian and transformed ssm s. The mesh size M = 512 and number of observa-tions L = 64 are sufficiently large that non-local pf methods suffer from weightdegeneracy even with large ensembles of up to P = 10 particles for both thelinear-Gaussian and transformed ssm s. While non-local variants of the enkf donot suffer from weight degeneracy and can give relatively accurate filtering dis-tribution estimates for an ensemble size of P ≥ , this is still much larger thanthe ensemble sizes typically used in for example nwp ensemble filter systems. Foran ensemble size P = 10 we found the local etkf significantly outperformed thenon-local etkf on all the metrics we consider in both the linear-Gaussian andtransformed ssm s. We used P = 10 for all methods in the experiments here.For the local etkf we use the smooth compact Gaspari and Cohn localisationfunction ‘ r defined in Eq. (3.12). We conducted a grid search over localisationradii r ∈ { . , . , . . . . } , for each r performing five independent runsof the local etkf and recording the performance on the three metrics describedin Section 5.1. The results for the linear-Gaussian st model are summarisedin Table 1 and for the transformed st model in Table 2. For each metric theminimum, median and maximum value recorded across the five runs is shown,for the value of r which gave the minimum median value of that particular metric.The results for all r values are shown in the Appendix in Fig. G.1.The performance on all metrics for both models was relatively stable acrossthe multiple runs. Unsuprisingly the local etkf performs significantly betteron the linear-Gaussian st model than the transformed st model. While for thelinear-Gaussian st model the optimal r for each metric are relatively similar, forthe transformed st model the optimal r differs significantly across the metricsmeaning any choice of r will incur a performance penalty on some metrics. M. M. GRAHAM & A. H. THIERY rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 1 . × − . × − . × − Median 1 . × − . × − . × − Maximum 1 . × − . × − . × − Localisation radius r Table 2
Values of metrics at optimal localisation radii for local etkf on transformed st model. For our proposed sletpf framework we need to choose a pou . Here we con-struct the pou s by (discretely) convolving a mollifier function with the indi-cator functions on a partition of the spatial domain. As the observations arelocated on a regular grid, we partition the domain into B equally sized intervals S b = [ b − B , b B ) ∀ b ∈ B . For the mollifier function ϕ we use a normalised variant ofthe compactly supported Gaspari and Cohn localisation function ‘ r in Eq. (3.12),the bump functions then defined as φ b ( s n ) = X m ∈ M S b ( s m ) ‘ w ◦ d ( s n , s m ) P m ∈ M ‘ w ◦ d ( s n , s m ) ∀ n ∈ M , b ∈ B (5.9)with w a kernel width parameter determining how many mesh nodes the effectivesmoothing kernel being discretely convolved with the indicators has support on.For w = M − the kernel is only non-zero at one mesh node, and no smoothing isapplied, corresponding to a hard partition of the space. For w > M − , the amountof smoothing and overlap between the patches increases with w .For the experiments with the st models we performed runs with sletpf s with pou s with five different numbers of patches B ∈ { , , , , } and four differ-ent kernel widths w ∈ { − , − , − , − } . We used a Gaspari and Cohnlocalisation function for the local weight calculation, for each ( B , w ) pair perform-ing five independent runs for all localisation radii r ∈ { . , . , . . . . } where median( n B ) was in the range [1 , etpf ofCheng and Reich (2015) can be considered a particular instance of the sletpf framework, here corresponding to the runs with a pou with B = 512 patches and w = 512 − . The set of mesh nodes M used to calculate the per-patch transportcosts as in Eq. (4.11) was constructed by subsampling 1 : M by a factor min(4 , p n )with p n = M ( B − + 2 w ) − sletpf runs for each of the( B , w, r ) parameter combinations are shown for the linear-Gaussian st model inFig. 6 and for the transformed st model in Fig. 7. In each figure, the rows ofplots correspond to different kernel widths w and the three columns to differentmetrics. On each plot the value of the relevant metric on the vertical axis isplotted against the median number of effective observations per patch on thehorizontal axis (we plot against median( n B ) rather than r as it is more directlycomparable across different values of B and w ). The median values across the fiveindependent runs for each of the numbers of patches B are shown by the colouredcurves (see colour key at top of figures) and the surrounding lighter colouredregions indicated minimum to maximum range of values recorded across the runs(in many cases the across-run variation is too small to be visible). For each metricthe best value achieved by the local etkf (as given in Tables 1 and 2) for themetric is indicated by the black horizontal dashed line. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF
Fig 6: Comparison of accuracy of sletpf estimates on linear-Gaussian st ssm .Considering first the linear-Gaussian st model results, we see that across allparameter combinations and metrics the local etpf methods are outperformed bythe best local etkf results. This is as expected as the linear-Gaussian assumptionsmade by the etkf are correct in this case, and by better exploiting this modelstructure we expect the local etkf to outperform the more generic local etpf .Concentrating on the results for filters with hard pou s without smoothing inthe first row ( w = 512 − ), we see that the filters with B = M = 512 patches inthe pou , corresponding to the Cheng and Reich (2015) scheme, outpeform fil-ters using pou s with smaller numbers of patches across virtually all median( n B )values and metrics. This tallies with the findings of Farchi and Bocquet (2018)who found that for an equivalent ‘block’-based local etpf scheme the best per-formance was always achieved with blocks of size one. Considering specificallythe rmse (ˆ µ T , µ T ) metric we see that as the number of patches B decreases thevalue of the metric across all values of median( n B ) monotonically increases (cor-responding to poorer performance). The behaviours for the rmse (ˆ σ T , σ T ) and rmse (ˆ γ T , γ T ) metrics are more complex. For the smoothness coefficient we seethat accuracy of the filter estimates initially decreases as the number of patches isincreased from B = 512 to B = 256 and B = 128. The accuracy of the smoothnessestimates however then increases on decreasing the number of patches further to B = 64 and again the accuracy increases on decreasing the number of patches to B = 32. We believe this non-monotonic relationship between the accuracy of the M. M. GRAHAM & A. H. THIERY smoothness estimates and the number of patches in the pou may be explainedby the spatial averaging in the computation of the smoothness coefficient: whileusing fewer larger patches in the pou would be expected to introduce stronger dis-continuities at the patch boundaries due to larger differences in the local weightsassigned to each patch, there is a competing effect that as fewer patches are usedthere are fewer boundaries and so the spatially averaged error becomes lowerdespite the individual discontinuities at each block boundary being larger.Now comparing the results as the kernel width w and so smoothness of the pou is increased, there are two main trends apparent. Most prominently thevariation in performance across different numbers of patches B decreases as thesmoothness of the pou increases, with many of the curves overlapping over muchof their ranges for w = 128 − and w = 64 − , while the optimal performance oneach metric remains similar. This suggests using smooth pou allows fewer numberof patches to be used (and thus a lower computational cost of the assimilationupdate) while maintaining performance, contrary to what was observed for thehard pou case where using fewer patches always decreased performance.A second less obvious effect is that as the kernel width w is increased thelower limit for median( n B ) is increased (similarly using fewer larger patchesalso increases the lower limit for median( n B )). This is the reason for the curvesstarting at higher median( n B ) as the kernel width increases, corresponding tothe values achieved with the smallest r tested ( r = 0 . w = 64 − we see that all the curves start to the rightof the point at which the optimal performance is reached for the other smaller w . This suggests there is a drawback to making w too large as it limits how farthe number of observations per patch and so tendency to local weight degeneracycan be controlled; in this case it seems the best tradeoff is reached for either w = 256 − or w = 128 − . Interestingly we also see that the accuracy of thesmoothness and standard deviation estimates are poorer for w = 64 − comparedto w = 128 − even when comparing at the same median( n B ). This could be dueto the greater overlap between the patches in this case, with the averaging of theparticle values at the overlaps potentially acting to artificially oversmooth andreduce variation in the particles, again suggesting that the appropriate level ofsmoothing is a tradeoff between several factors.The results on the transformed st model shown in Fig. 7 show for the mostpart very similar trends as for the linear-Gaussian st model. The most signif-icant difference is the relative performance of the local etkf and local etpf methods, with in this case the local etpf approaches outperforming the bestlocal etkf results across all parameter values for the rmse (ˆ σ T , σ T ) and acrossa majority of the parameter values tested for the rmse (ˆ µ T , µ T ) metric. As theonly difference between these two models is the non-Gaussianity in the filteringdistributions introduced by the transformation, these results support the earlierclaims that pf -based methods such as the local etpf and sletpf proposed inthis article, are more robust to non-Gaussianity than than enkf methods suchas the local etkf . Interestingly the relative performance loss in the local etkf on introducing non-Gaussianity seems to be most severe in the rmse (ˆ σ T , σ T )metric, suggesting that uncertainty estimates provided by local etkf methodson non-linear-Gaussian models should be particuarly treated with caution.In addition to the accuracy of the filter estimates, we are also interested in the SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF
Fig 7: Comparison of accuracy of sletpf estimates on transformed st ssm .relative computational cost of the different methods. Fig. 8 shows the values ofthe performance metrics achieved by the different sletpf configurations tested,against the corresponding assimilation time (i.e. total filtering time minus thetime taken to integrate the model dynamics in the prediction updates) for thetransformed st ssm . Each of the three plots corresponds to one of the performancemetrics, the vertical coordinate of each marker indicates the minimum value of themetric achieved across all localisation radii r for a particular ( B , w ) combination,with the marker colour indicating the number of patches B , and the marker symbolthe kernel width w . The horizontal coordinate of each marker indicates the medianassimilation time across the five independent runs for the corresponding ( B , w, r )values. For the pou s with B = 512 patches, only the case without smoothing( w = 512 − ), corresponding to the Cheng and Reich (2015) local etpf , is shown,with the smoother pou s in this case substantially increasing the assimilationtimes without any gain in accuracy.As would be expected due to the lower number of ot problems that need tobe solved, in general the assimilation time decreases as the number of patches B in the pou is decreased for a fixed smoothing kernel width w . Note however thatthe assimilation time increases with the smoothing kernel width w (primarilydue to the increased number of non-zero terms in the summation in Eq. (4.7)),which results for example in the assimilation time for the scheme with B = 256and w = 64 − ( × ) being slightly larger than for the runs under the Cheng and M. M. GRAHAM & A. H. THIERY
Assimilation run time / s R M S E ( ˆ µ : T , µ : T ) Assimilation run time / s R M S E ( ˆ σ : T , σ : T ) Assimilation run time / s R M S E ( ˆ γ : T , γ : T ) B = 32 w = 512 − B = 64 w = 256 − B = 128 w = 128 − B = 256 w = 64 − B = 512 Fig 8: Accuracy versus run time for sletpf in transformed st ssm .Reich (2015) settings of B = 512 and w = 512 − ( • ). Although there is thereforea tradeoff in assimilation time between decreasing the number of patches B andincreasing the kernel width w , we still find that there are combinations of ( B , w )values which maintain the accuracy of the Cheng and Reich (2015) scheme whilegiving substantial reductions in assimilation time. In particular the runs with B = 128 and w = 256 − ( − ) and B = 128 and w = 128 − (+) achieve nearlyidentical accuracies on the mean and standard deviation rmse metrics as B = 512and w = 512 − (and a substantially improved smoothness coefficient rmse ) whilereducing the assimilation time by slightly more than a factor of two. At the costof around a 10% increase in the mean and standard deviation rmse s, a moresubstantial reduction in the assimilation time by a factor of four can be achievedby using a pou with B = 64 patches and w ∈ { − , − , − } .Although the absolute values of the assimilation times in Fig. 8 are dependenton the computational environment used to run the experiments, the relative tim-ings should still be informative as the same sletpf implementation was used torun all the experiments. We purposefully did not include the local etkf runs onthe plots as any differences in the assimilation times for the local etkf versus sletpf approaches are likely to be as much due to the particulars of the softwareimplementations and hardware used as any fundamental differences in perfor-mance. In particular more time was spent optimising the implementation of the sletpf algorithm than our local etkf implementation so the relative timingsare likely to unfairly favour the sletpf runs. The computational complexity forthe local etkf however is O ( MP ) which is the same as for the local etpf schemeof Cheng and Reich (2015), so it would be expected that there are regimes inwhich the sletpf assimilation updates (with complexity e O ( BP + |M| P + MP ))will have a computational advantage over the local etkf updates. As our second test model we consider a stochastic variant of a fourth-order non-linear partial differential equation ( pde ), often termed the Kuramoto–Sivashinksy ( ks ) equation, which has been independently derived as a model of various phys-ical phenomena (Kuramoto and Tsuzuki, 1976; Sivashinsky, 1977) and studied asan example of a relatively simple pde system exhibiting spatio-temporal chaos(Hyman and Nicolaenko, 1986). On a spatial domain S = [0 ,
1) with a distancefunction d ( s, s ) = min( | s − s | , − | s − s | ) and periodic boundary conditions, the SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER deterministic dynamics of the ks pde model can be described by ∂ τ ζ ( s, τ ) = − ∂ s θ + ∂ s θ ! ζ ( s, τ ) − ∂ s θ (cid:16) ζ (cid:17) (5.10)where θ is a length-scale parameter, with the system dynamics becoming chaoticfor large values of θ (Hyman and Nicolaenko, 1986).As our focus in on filtering in models with stochastic dynamics, we use a related spde model on the same spatial domain, described byd ζ ( s, τ ) = − ∂ s θ + ∂ s θ + θ ! ζ ( s, τ ) − ∂ s θ (cid:16) ζ (cid:17)! d τ + ( κ (cid:126) s d η )( s, τ ) (5.11)where ζ : S × T → R is a real-valued space-time varying process, θ ∈ R ≥ isthe non-negative length-scale parameter, θ ∈ R ≥ is a non-negative parametercontrolling dissipation due to damping, κ : S → R ≥ is a spatial kernel functionand η : S × T → R is a space-time varying noise process. In addition to theintroduction of the additive noise process, we also introduce a linear dampingcomponent controlled in magnitude by θ . This is motivated by our empiricalobservation in simulations that the stochastic system can become unstable whennumerically integrating over long time periods without additional dampening.We use a similar spectral approach to define the spatial basis function expan-sions of the state and noise processes ζ and η and kernel κ as for the st model,again using M = 512 mesh nodes. Full details of the discretisation used are given inAppendix F.2 and the values of all the parameters used in Table F.2. This resultsin a coupled non-linear system of sde s which governs the evolution of the stateFourier coefficients; unlike the linear-Gaussian dynamics of the st model these sde s do not have an analytic solution and so need to be numerically integrated.We assume the state is observed at T = 200 time points, with S = 10 integra-tor steps performed between each observation time; the resulting state transitionoperators F T are non-linear and do not admit closed form transition densities.We consider ssm s in which these ks state dynamics are noisily observed viaboth linear and non-linear observation operators. In both cases the state is as-sumed to be observed at L = 64 equispaced points in the spatial domain, withdirect observations of the state values at these points in the linear case and viaa hyperbolic tangent (tanh) function in the non-linear case. The simulated stateand observation sequences used in the experiments for both the linearly and non-linearly observed ks ssm s are shown in Fig. F.2 (with the same simulated statesequence being used in both cases, with only the generated observations differing).Compared to st model, the ks model exhibits more complex and unpredicatablestate dynamics and thus can be seen as more challenging test case for the localensemble filtering methods.Both the linearly and non-linearly observed ks ssm s have non-Gaussian filter-ing distributions which cannot be exactly inferred unlike the linear-Gaussian st model. We therefore used a mcmc method to generate proxy ground-truths forthe filtering distributions, constructing Markov chains which left invariant thejoint distribution across the M = 512 dimensional state vectors at all T = 200time points given the observed sequence, i.e. P ( x T ∈ d x | y T = y T ), with thefiltering distributions corresponding to marginals of this joint smoothing distribu-tion . Due to the large overall state dimension MT ≈ we use a gradient-based M. M. GRAHAM & A. H. THIERY rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 1 . × − . × − . × − Median 1 . × − . × − . × − Maximum 1 . × − . × − . × − Localisation radius r Table 3
Values of metrics at optimal localisation radii for local etkf on linearly observed ks model. rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 2 . × − . × − . × − Median 2 . × − . × − . × − Maximum 2 . × − . × − . × − Localisation radius r Table 4
Values of metrics at optimal localisation radii for local etkf on non-linearly observed ks model. Hamiltonian Monte Carlo algorithm (Duane et al., 1987) to generate the chains.For each of the linear and non-linearly observed cases we ran five parallel chains of200 samples each, with each chain using an independently seeded pseudo-randomnumber generator. Details of the set up used for the mcmc runs are given inAppendix H. The ‘ground-truth’ values for the filtering distributions means µ T ,standard deviations σ T and smoothness coefficients γ T were estimated usingthe combination of the final 100 x T samples of each of the five chains for each ssm , i.e. a total of 500 samples per ssm .As for the st ssm s, we used P = 10 particles for all the local ensemble filtersruns on the ks ssm s. For the local etkf we performed an equivalent grid search asfor the st models, performing five independent runs for each localisation radius r ∈ { . , . , . . . . } for both the linearly and non-linearly observed ksssm s. The results are summarised in Tables 3 and 4, with plots of the full gridsearch results shown in Fig. G.1 in Appendix G.Although the absolute values of the rmse metrics in Table 3 are higher thanfor the local etkf runs on the linear-Gaussian st model, given the non-linearstate dynamics in the ks model mean the filtering distributions are no longerconstrained to remain Gaussian, the local etkf performs remarkably well on thelinearly-observed ks ssm , recovering relatively accurate estimates of the filter-ing distribution means, standard deviations and smoothness coefficients. This isconcordant with the widespread empirical success of local enkf approaches evenwhen applied to models with non-linear state dynamics (Evensen, 2009), but alsosuggests that the filtering distributions in this case may have remained close toGaussian despite the non-linear dynamics.Swapping the linear observations for a non-linear observation operator how-ever can be seen to have a detrimental effect on the accuracy of the local etkf estimates of the filtering distributions. The optimal values achieved for each ofthe three rmse metrics shown for the non-linearly observed case in Table 4 showsignificant increases over the corresponding figures for the linearly observed casein Table 3, with the errors in the standard deviation estimates showing the largestincrease. This highlights that although local enkf methods are robust to somedegree of non-linearity in the dynamics or observation model, performance is stillsensitive to strong departures from Gaussianity.The effect of the non-Gaussianity induced by the non-linear observation opera- SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Rank D e n s i t y (a) r = 0 . Rank D e n s i t y (b) r = 0 . Fig 9: Rank histograms for single local etkf runs on ks ssm s.tor can also be seen by comparing rank histograms (i.e. the ranks of the true statevalues within the ensemble across all time and spatial indices) for single runs ofthe local etkf on the linearly and non-linearly observed ks ssm s, as shown inFig. 9a and Fig. 9b respectively. The localisation radius r was set to the valuefound in the grid searches to give the lowest mean estimate rmse . For a wellcalibrated ensemble the rank histograms should be close to uniform (indicatedby the dashed black line on the plots). While for the linearly observed case theminor departures from uniformity can be plausibly attributed to sampling noise,the histogram for the non-linearly observed case has a clear ‘double-humped’non-uniform shape, with this suggesting the ensemble estimates of the filter dis-tributions have greater kurtosis than the true filtering distributions.For the sletpf runs we use the same method to construct the pou s as de-scribed in the preceding section for the st model experiments. We again con-sidered pou s with B ∈ { , , , , } number of patches and smoothingkernel widths of w ∈ { − , − , − , − } . For each ( B , w ) pair we testedall localisation radii r ∈ { . , . , . . . . } where median( n B ) was in therange [1 ,
5] for the linearly observed ks ssm and in the range [2 ,
6] for the non-linearly observed ks ssm . For each ( B , w, r ) parameter triple tested, we performedfive independent filtering runs, with the median values recorded for the three met-rics shown by the coloured curves in Fig. 10 for the linearly observed ssm and inFig. 11 for the non-linearly observed ssm , along with the best values achieved bylocal etkf on each metric by the dashed horizontal lines. The plots in Figs. 10and 11 have the same format as Figs. 6 and 7 for the st model experiments.From the linearly observed ks ssm results in Fig. 10 we see that the sletpf was outperformed across all parameter settings and metrics by the best local etkf results. This reinforces the point that local enkf methods are a stronglyperformant approach and can often be the best choice even in models with non-linear dynamics, where the Gaussianity assumptions are not valid, due to theirrobust performance when using small ensemble sizes. A further advantage of local enkf methods over local pf s is that they naturally maintain smoothness prop-erties of the state field particles as evidenced by the low smoothness coefficienterrors achieved by the local etkf across all model configurations. Local pf typeapproaches such as the sletpf algorithm proposed here should generally there-fore be considered as a fallback solution for cases where local enkf methods areknown, or at least suspected, to give poor accuracy.Considering the performance of the sletpf on the linearly observed ks ssm for different ( B , w ) parameter settings we see similar trends as observed for the st M. M. GRAHAM & A. H. THIERY R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF
Fig 10: Comparison of accuracy of sletpf estimates on linearly observed ks ssm .model experiments though with some difference in the details. The differences inperformances on the mean and standard deviation rmse metrics for pou s withdifferent numbers of patches B for a fixed smoothing kernel width w show lessvariation than seen in the st model experiments. Even for the hard pou s casewithout smoothing ( w = 512 − , top-row of Fig. 10), only the runs with a pou with B = 32 patches show a significant drop in mean and standard deviationestimate accuracies across most median( n B ) values, and for the B = 32 casethe relative drops in accuracies are still quite minor. The most obvious effect ofincreasing the kernel width w in this model is therefore in the improved accuracyof the smoothness coefficient estimates for larger w values. This suggests that inthe ks model, although using a smoother pou does reduce the introduction ofartificial discontinuities into the state field particles, these discontinuities haveless of a negative effect on filtering performance than for the st model, perhapsdue to a stronger diffusive smoothing element to the model dynamics.The results for the non-linearly observed ks ssm in Fig. 11 show similar rel-ative performances for the different sletpf configurations as for the linearlyobserved case, with a general increase in the absolute rmse values across theboard. The corresponding increase in the rmse values for optimal tunings ofthe local etkf are however significantly larger, meaning that for this model the sletpf approaches show a minor improvement in the accuracy of the mean esti-mates compared to the local etkf across virtually all configurations and performs SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF
Fig 11: Comparison of accuracy of sletpf estimates on non-linearly obs. ks ssm . Rank D e n s i t y (a) B = 512 , w = 512 − , r = 0 . Rank D e n s i t y (b) B = 64 , w = 128 − , r = 0 . Fig 12: Rank histograms for single sletpf runs on non-linearly observed ks ssm .comparably in terms of the accuracy of the standard deviations estimates, havingslightly better performance for some configurations and slightly poorer for others.Again the smoothness of the pou used does not seem to have a strong effect onperformance in terms of the mean and standard deviation estimates here, with themain change as the smoothing kernel width w is increased the improved accuracyof the smoothness coefficient estimates corresponding to improved reproductionof the smoothness of the fields under the true filtering distributions.As for the local etkf ensemble estimates of the ks ssm filtering distributions,we can also use rank histograms for the sletpf ensembles as an alternative checkof the calibration of the filtering distribution estimates. The rank histogram foran ensemble generated for the non-linearly observed ks ssm by a sletpf with M. M. GRAHAM & A. H. THIERY
Assimilation run time / s R M S E ( ˆ µ : T , µ : T ) Assimilation run time / s R M S E ( ˆ σ : T , σ : T ) Assimilation run time / s R M S E ( ˆ γ : T , γ : T ) B = 32 w = 512 − B = 64 w = 256 − B = 128 w = 128 − B = 256 w = 64 − B = 512 Fig 13: Accuracy versus run time for sletpf in non-linearly observed ks ssm .a pou with B = 512 patches and kernel width w = 512 − (i.e. corresponding tothe per-node local etpf ) is shown in Fig. 12a, and for an ensemble generated forthe non-linearly observed ks ssm by a sletpf with a pou with B = 64 patchesand kernel width w = 128 − in Fig. 12b. In both cases the localisation radius r was set to the value from the grid search giving the lowest mean estimate rmse .Compared to the corresponding rank histogram for the local etkf in Fig. 9b,the histograms for both sletpf configurations are much closer to uniform. Thepeaks at the extreme ranks in both histograms are characteristic of the ensemblesunderestimating the dispersion of the filtering distribution in the tails, with thisdiscrepancy appearing to be stronger in the sletpf using fewer patches here.As in Fig. 8 for the transformed st model runs, it is instructive to also comparethe relative computational cost of the different sletpf configurations versus theirperformance on the three filtering accuracy metrics. Fig. 13 shows the time takento perform the assimilation updates (horizontal axes) versus the value recordedfor each of the three rmse metrics (vertical axes), for each of the ( B , w ) pou configurations. The markers show the median values across the five runs for thelocalisation radius r which achieved the minimum value for that particular metricfor the ( B , w ) values in question. Due to the decreased drop-off in filtering accu-racy for pou s with fewer number of patches B compared to st models, here wesee we are able to achieve even larger improvements in computational efficiencycompared to the local etpf scheme of Cheng and Reich (2015) (correspondingto B = 512, w = 51 − , • ) while retaining the same filtering accuracy. In partic-ular the sletpf s with B = 64 patches in the pou ( • + −× ) are able to achievethe same mean estimate accuracy, a slight improvement in the accuracy of thestandard deviation estimates, and a substantial improvement in the accuracy ofthe smoothness coefficient estimates, while having an assimilation time that isaround a quarter of the sletpf which computes separate ot transport maps foreach mesh node ( B = 512). Further in this case the filtering accuracy is largelyunaffected by the choice of smoothing kernel width w , other than an improvementin the smoothness coefficient estimates for larger w values. At the cost of a slightincrease in all three rmse s, the sletpf s with pou s with B = 32 patches give afurther approximate factor two decrease in assimilation time, leading to around aeight times decrease in assimilation time compared to the per-node local etpf . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER
6. DISCUSSION
In this article we have proposed a new scheme for constructing local particlefilters for state inference in spde models of spatially-extended dynamical systems.The local etpf (Cheng and Reich, 2015) although having the desirable property ofimproved robustness to non-Gaussianity in the filtering distributions compared tolocal enkf approaches has two key shortcomings: (i) the state fields produced bythe assimilation step fail to maintain the smoothness properties of the predictiveensemble members, potentially leading to numerical instabilities when used tofilter spde models and (ii) as an ot problem must be solved for every node inthe spatial mesh, the assimilation updates can be costly for dense meshes.Our approach to solving both issues is to softly partition the spatial domainusing a partition of unity : a finite set of non-negative bump functions which tilethe domain and sum to unity at all points. By computing an ot map for thepatch of the spatial domain associated with each bump function and then usingthe bump functions to smoothly interpolate these maps across the domain, we areable to smoothly combine different regions of the predictive ensemble particles.As well as allowing the smoothness of the spatial fields to be maintained duringthe assimilation step, the proposed approach reduces the e O ( MP ) cost of the per-node local etpf assimilation updates to e O ( BP + |M| P + MP ). If we increase themesh resolution by using a larger number of nodes M , while keeping the numberof patches B and number of subsampled nodes |M| fixed, the computational costof the assimilation update only need to scale at rate e O ( MP ) with M , which couldbe considered as the lower bound for an update to P particles of e O ( M ) dimension.We demonstrated in the numerical experiments that the resulting scheme isable to produce, at often significantly reduced computational cost, ensemble esti-mates of the filtering distributions for state space models with equivalent accuracyand improved smoothness compared to the local etpf of Cheng and Reich (2015).Although the experiments were restricted to models on one-dimensional spatialdomains, in most applications of interest the spatial domain will be two or three-dimensional. Our proposed scheme naturally carries over to this setting and asthe mesh sizes in such models will tend to be significantly higher, the potentialcomputational savings are even larger. Further, while we concentrated here onfiltering in spatial models which are observed at point locations, our scheme couldbe extended to models with spatially distributed observations by partitioning thespatial domain according to the geometry of the observation processes.The localisation approach to overcoming weight degeneracy when applying pf sto spatial models considered here could also be combined with other methods forimproving pf performance in high-dimensional ssm s. In particular tempering ap-proaches split the usual single prediction and assimilation update per observationtime into multiple updates which target a sequence of distributions bridging be-tween the filtering distributions at adjacent observation times (Frei and K¨unsch,2013; Johansen, 2015; Beskos et al., 2017; Svensson, Sch¨on and Lindsten, 2018;Herbst and Schorfheide, 2019). Tempering could be paired with our frameworkto further improve its robustness to high-dimensional and strongly informativeobservations, with the use of multiple assimilation updates per observation timewhen tempering making the reduced computational cost and improved smooth-ness preservation of our approach particularly important. M. M. GRAHAM & A. H. THIERY
APPENDIX A: ENSEMBLE TRANSFORM KALMAN FILTER
In this Appendix we describe the details of the etkf assimilation update(Bishop, Etherton and Majumdar, 2001) and show how it can be expressed in theform of the letf framework discussed in Section 2.6. We first introduce predictiveand filtering ensemble matrices respectively defined as ~ X t = h ~ x t ~ x t · · · ~ x P t i T and X t = h x t x t · · · x P t i T . (A.1)Using the following linear operators ε = P T P , ∆ = √ P − (I P − P ε ) , (A.2)the predictive and filtering ensemble means can then be compactly expressed ~ m T t = ε~ X t , and m T t = ε X t , , (A.3)and similarly the predictive and filtering ensemble covariances can be written ~ C t = (∆ ~ X t ) T (∆ ~ X t ) and C t = (∆ X t ) T (∆ X t ) . (A.4)Assuming initially linear-Gaussian observations as in Eq. (2.12) then by substi-tuting the expressions for the empirical covariances Eq. (A.4) into the Kalmanfilter covariance assimilation update in Eq. (2.13a) and applying the identityI P − ∆ ~ Y t ( R t + ~ Y T t ∆ ~ Y t ) − ~ Y T t ∆ = (I P + ∆ ~ Y t R − t ~ Y T t ∆) − with ~ Y t = ~ X t H T t we have(∆ X t ) T (∆ X t ) = (∆ ~ X t ) T (cid:16) I P + ∆ ~ Y t R − t ~ Y T t ∆ (cid:17) − (∆ ~ X t ) . (A.5)Definining S t as the symmetric matrix square-root of the central term in theright-hand-side of Eq. (A.5), i.e. S t = S t S t = (cid:16) I P + ∆ ~ Y t R − t ~ Y T t ∆ (cid:17) − (A.6)then we can compute a family of solutions of Eq. (A.5) for the filtering ensembleprojection ∆ X t in terms of the predictive ensemble projection ∆ ~ X t as∆ X t = Q S t ∆ ~ X t (A.7)where Q is an arbitary P × P orthogonal matrix. For the etkf generally Q = I P is chosen, corresponding to directly transforming by the symmetric square-root.Now considering the Kalman assimilation update for the mean in Eq. (2.13b),subsituting the expressions for the ensemble empirical means and covariances inEqs. (A.3) and (A.4) and using the definition of the square-root matrix S t inEq. (A.6) we have that ε X t = ε~ X t + ( y T t − ε ~ Y t ) R − t ~ Y T t ∆ S t ∆ ~ X t . (A.8)From the definition of ∆ in Eq. (A.2) we have that I P = 1 P ε + √ P −
1∆ and so X t = 1 P ε X t + √ P − X t (A.9)= (cid:16) P ε + 1 P ( y T t − ε ~ Y t ) R − t ~ Y T t ∆ S t ∆ + √ P − S t ∆ (cid:17) ~ X t , (A.10) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER with the matrix term in parentheses defining the coefficients a P , P t of an letf assimilation update as in Eq. (2.11).In the above it was assumed the observation model is linear-Gaussian. In thecase of a more general observation model of the form G t ( x, v ) = H t ( x ) + v, v t ∼ N (0 , R t ) , (A.11)where now H t is a potentially non-linear operator, then by observing that alloccurences of H t in Eqs. (A.6) and (A.10) are via ~ Y t = ~ X t H T t , for non-linear H t we can instead define the predictive observation ensemble matrix ~ Y t as ~ Y t = h H t ( ~ x t ) H t ( ~ x t ) · · · H t ( ~ x P t ) i T . (A.12)The etkf formulation of a square-root enkf has the advantage of only requiringcomputing cubic-cost matrix operations for matrices of size P × P (due to theconditional independence assumptions R t is block diagonal and so the cost ofcomputing R − t is at worst O ( LK ) with in general K (cid:28) P ).For all enkf methods, the assimilation updates are only consistent with theanalytic assimilation update in Eq. (2.7) as P → ∞ for linear-Gaussian mod-els. In models where the state update and observation operators are only weaklynonlinear, the filtering distribution at each time index π t can remain ‘close’ toGaussian and the enkf updates will often give reasonable estimates of the filter-ing distribution (Evensen, 2009). For models with highly non-Gaussian filteringdistributions enkf methods will typically perform poorly however.A local version of the etkf algorithm was proposed in Hunt, Kostelich andSzunyogh (2007). In the global etkf assimilation update summarised in Eq. (A.10)the linear transform coefficients depend on the current predictive state ensemblevalues ~ x P t only via a P × KL observation ensemble matrix ~ Y t . The local etkf algorithm scales the dependence of the update coefficients at each mesh node onthe columns of ~ Y t via a localisation function ‘ r : [0 , ∞ ) → [0 ,
1] satisfying theconditions in Eq. (3.9) for some localisation radius r >
0, such that observationsat a distance more than r from the mesh node are ignored in the correspondinglocal assimilation update.For each of the M mesh nodes a localisation kernel is then defined by applying ‘ r to the distances between the mesh nodes and the observation locations k T m = [ ‘ r ( d ( s m , s o1 )) T K , · · · , ‘ r ( d ( s m , s o L )) T K ] ∀ m ∈ M . (A.13)We can then define local effective observation noise precision matrices ˜ R − t, M ˜ R − t,m = R − t (cid:12) ( k m k T m ) ∀ m ∈ M (A.14)where (cid:12) indicate the elementwise or Hadamard product between equal sizedtensors. The local etkf assimilation update is then X t,m = (cid:16) P ε + 1 P ( y T t − ε ~ Y t ) ˜ R − t,m ~ Y T t ∆˜ S t,m ∆ + √ P − S t,m ∆ (cid:17) ~ X t,m , (A.15)where the local square root matrix ˜ S t,m is defined˜ S t,m = ˜ S t,m ˜ S t,m = (cid:16) I P + ∆ ~ Y t ˜ R − t,m ~ Y T t,m ∆ (cid:17) − . (A.16) M. M. GRAHAM & A. H. THIERY
This local assimilation update is equivalent to replacing each observation en-semble matrix term ~ Y t and observation vector term y t in the global assimilationupdate in Eq. (A.10) with ~ Y t (cid:12) (1 P k T m ) and y t (cid:12) k m respectively. As k m haszero entries for all indices corresponding to observation locations more than r in distance from s m , in practice when implementing the local etkf assimilationupdate the computations can be performed with only the non-zero submatricesof ~ Y t (cid:12) (1 P k T m ) and y t (cid:12) k m and corresponding submatrix of R − t .As separate assimilation updates need to be computed for each mesh nodethe computational cost of the local etkf scales linearly with the number ofmesh nodes M . The computation for each mesh node is of order O ( P ) due torequirement to calculate a matrix decomposition of the P × P matrix inside theparentheses on the right hand side of Eq. (A.16). On a sequential architecturethe overall computation time will therefore have a O ( MP ) scaling. As each ofthe local assimilation updates can be independently computed in parallel, witha large number of parallel compute nodes the assimilation update can still becomputed efficiently for models with large mesh sizes M however as shown in thenumerical experiments in Hunt, Kostelich and Szunyogh (2007). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX B: ALTERNATIVE PARTICLE FILTER PROPOSALS
Rather than propagating according to the forward dynamics of the generativemodel, it is possible to instead propose new particle values from different con-ditional distributions (which may depend on future observed values) and adjustthe expression for the importance weights in Eq. (2.14) accordingly. Typicallythe resulting expression for the importance weights is given in terms of the tran-sition density of the state updates, however as noted previously this density willoften be intractable to compute. Alternative state proposals can however insteadbe formulated by changing the distribution the state noise variables are drawnfrom. If each state noise vector u pt is sampled from a distribution with a knowndensity d pt with respect to µ t and the predictive ensemble particles computed asin Eq. (2.9), then unnormalised importance weights for the propagated particlescan be computed as˜ w pt = g t (cid:0) y t | F t ( x pt − , u pt ) (cid:1) d pt ( u pt ) − ∀ p ∈ P . (B.1)The corresponding normalised weights can then be used in the empirical filteringdistribution approximation in Eq. (2.14) and resampling update in Eq. (2.15).If we restrict the state noise proposal density d pt to be dependent on only theprevious particle x pt − and current observation y t in order to maintain the on-line nature of the algorithm, then the proposal distributions which minimise thevariance of the importance weights have densities with respect to µ t d pt ( u ) = g t (cid:0) y t | F t ( x pt − , u ) (cid:1)R U g t (cid:0) y t | F t ( x pt − , u ) (cid:1) µ t (d u ) ∀ p ∈ P . (B.2)In this case the unnormalised weights in Eq. (B.1) are independent of the statenoise variables u P t . Although this ‘optimal’ proposal is more typically expressedas a conditional distribution on ~ x pt given x pt − this alternative formulation is equiv-alent. In general it will not be possible to generate samples from the optimal pro-posal, however it may be possible to for example find a tractable approximationto use as a proxy.In cases where the optimal proposal is tractable or can be well approximated,the resulting pf algorithm can significantly outperform the basic bootstrap pf interms of the ensemble size required for a given accuracy in the filtering distribu-tion estimates. M. M. GRAHAM & A. H. THIERY
APPENDIX C: VISUALISATION OF SMOOTH LOCAL LETFASSIMILATION UPDATE −0.50.00.5 ~ z p ( s ) p = 1 p = 2 −0.50.00.5 φ b ( s ) ~ z p ( s ) b = 1 b = 3 b = 5 b = 7 b = 9 p = 1 p = 2 −0.50.00.5 φ b ( s ) ~ z p ( s ) b = 2 b = 4 b = 6 b = 8 p = 1 p = 2 s −0.50.00.5 z p ( s ) p = 1 p = 2 Fig C.1: Example of applying smooth local letf assimilation update in Eq. (4.7)on a one-dimensional spatial domain S = [0 ,
1] using the pou from Fig. 5 with P = 2 particles.Consider a spatial domain which is the same unit interval S = [0 ,
1] as used inFig. 5 and a pou chosen as the smooth bump functions φ shown there. The toppanel in Fig. C.1 shows two smooth predictive distribution particle realisations ~z . The central two panels show the products φ b ( s ) ~z p ( s ) ∀ b ∈ , p ∈ s and com-pactly supported on the patches ˆ S . The bottom panel shows the filtering dis-tribution particle realisations z computed using the assimilation update inEq. (4.7) for a randomly generated set of coefficients ˆ a , satisfying the condi-tions in Eq. (4.5), with these post-assimilation fields maintaining the smoothnessof the predictive fields. The separation of products with odd and even indexed bump functions on to separate panelsin Fig. C.1 is simply for visual clarity. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX D: PARTITIONING THE SPATIAL DOMAIN (a) Rectilinear observation locations. (b) Irregular observation locations.
Fig D.1: Examples of partitioning a space based on observation locations fora two-dimensional spatial domain. Panel (a) shows a partition (indicated bycoloured regions) for observations located on a equispaced rectilinear grid (shownby circular markers). Panel (b) shows a partition for an irregularly located setof observations, with the observation locations initially clustered (indicated bycolours of markers) before partitioning based on the Voronoi cells associated witheach cluster of observation locations (cells shown by bordered polygonal regions).In order to control the number of observations used to compute each localweight in the sletpf scheme, we recommend choosing the partition of the spatialdomain used to define the pou such that each patch contains roughly the samenumber of observations. For observations located on a rectilinear grid this caneasily be achieved by partitioning the space in to rectilinear blocks aligned withthe observation grid and each containing the same number of observations (anexample is shown in Fig. D.1a). For irregularly spaced observations, one optionis to first group the observation locations in to similarly sized clusters using forexample a k -means algorithm. The spatial domain can then be partitioned usinga Voronoi diagram generated from the observation locations, with all the cellscorresponding to observations in a single cluster then merged to form a singlecontiguous region. This leads to a partition of the spatial domain into a set ofregions which each contain a roughly number of observations and such that thenumbers of additional observations close to the region boundaries are minimised.A example of applying this scheme to a set of irregularly located observationpoints is shown in Fig. D.1b. In both the rectilinear and irregular spacing cases,a soft pou can then be generated from the resulting partition by convolving witha mollifier function as described in Section 4.1. M. M. GRAHAM & A. H. THIERY
APPENDIX E: TRANSFORMED STATE-SPACE MODELS
One of our primary motivations for considering pf -based methods was theclaim that they are more robust to non-Gaussianity in the filtering distributionscompared to enkf methods. While this can shown to be the case in the largeensemble limit for non-localised pf algorithms (including the etpf ) compared to enkf methods, it does not necessarily follow that, when using small ensemblesizes, a local etpf would be expected to outperform a local enkf in models withnon-Gaussian filtering distributions. Further even if there is a benefit to usingthe local etpf compared to the local enkf , this does not necessarily carry overto our proposed smooth and scalable local etpf scheme.Therefore to assess the affect on the relative performance of the local ensemblefilters methods being considered of non-Gaussianity in the filtering distributionswhile controlling as far as possible other factors which might affect performance,we use a simple scheme to map a tractable linear-Gaussian ssm to a transformed ssm with non-Gaussian filtering distributions. In particular let T : X → X bea diffeomorphism on the state space, with T − denoting its inverse, which weassume we can also compute. If we define x t = T ( x t ) ∀ t ∈ T then the conditionaldistribution on x t given observations y t = y t will be T ] π t for any time index t ∈ T , i.e. the push-forward of the filtering distribution π t under the map T . If T is non-linear then if π t is Gaussian T ] π t will in general be non-Gaussian.Importantly for our purposes we can construct a ssm acting directly on thetransformed states x T . In particular for a base ssm with state update and ob-servation operators F T and G T , we can define a T - transformed ssm with stateupdate and observation operators F T and G T given by x = F ( u ) = T ◦ F ( u ) , u ∼ µ , (E.1) x t = F t ( x t − , u t ) = T ◦ F t ( T − ( x t − ) , u t ) , u t ∼ µ t ∀ t ∈ T , (E.2) y t = G t ( x t , v t ) = G t ( T − ( x t ) , v t ) , v t ∼ ν t ∀ t ∈ T (E.3)and with observation densities g T defined by g t ( y t | x t ) = g t ( y t | T − ( x t )) ∀ t ∈ T . (E.4)We can therefore run ensemble filter algorithms on the T -transformed ssm todirectly compute ensemble estimates of the transformed filtering distributions π T with by construction π t = T ] π t ∀ t ∈ T . If the base ssm is linear-Gaussian and soa kf can be used to exactly compute the Gaussian filtering distributions π T , wecan compute accurate unbiased Monte Carlo estimates of expectations under thetransformed filtering distributions π T as we can generate N independent samplesfrom each π t by generating N independent samples from the Gaussian filteringdistribution π t and pushing each of the samples through the map T .This scheme therefore provides a method for constructing a non-Gaussian ssm for which we can easily compute accurate Monte Carlo estimates of the true filter-ing distribution means µ T , standard deviations σ T and smoothness coefficients γ T as defined in Eqs. (5.1) and (5.5) and so evaluate the rmse accuracy metricsdescribed in the preceding section for ensemble estimates of the filtering distri-butions. By using a large number of independent samples N in the Monte Carloestimates we can ensure the O ( N − ) Monte Carlo error is negligible compared tothe error in the ensemble estimates. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX F: MODEL DETAILSF.1 Stochastic turbulence model
Number of mesh nodes M = 512Number of observation times T = 200Number of observation locations L = 64Time step δ = 2 . θ = 4 × − Advection coefficient θ = 0 . θ = 0 . θ = 5State noise kernel length scale ϑ = 4 × − State noise kernel amplitude α = 0 . ς = 0 . Table F.1
Stochastic turbulence model parameter settings
We define a regular mesh of nodes s M and basis functions β M s m = m − M and β m ( s ) = sinc(2 π M ( s − s m )) cos( π ( s − s m ))sinc( π ( s − s m )) ∀ m ∈ M (F.1)with the space-time varying processes ζ and η and kernel function κ then beingdefined respectively in terms of the finite set of time-varying processes χ M and υ M and coefficients λ M as ζ ( s, τ ) = X m ∈ M χ m ( τ ) β m ( s ) , (F.2) η ( s, τ ) = X m ∈ M υ m ( τ ) β m ( s ) , (F.3)and κ ( s ) = X m ∈ M λ m β m ( s ) . (F.4)The basis functions β M and nodes s M satisfy Eq. (3.2) such that χ m ( τ ), υ m ( τ )and λ m correspond to the values of respectively ζ ( s m , τ ), η ( s m , τ ) and κ ( s m ) forany mesh node s m . We define ˜ χ K ( τ ) = dft ( χ M ( τ )), ˜ υ K ( τ ) = dft ( υ M ( τ )) and˜ λ K = dft ( λ M ) with K = (cid:4) M (cid:5) and dft indicating the discrete Fourier transform,with the Fourier coefficient ˜ x k for a real sequence x M being computed as˜ x k = dft k ( x M ) = M X m ∈ M x m exp (cid:18) − i π km M (cid:19) ∈ ( R if k ∈ { , M } , C if k ∈ (cid:6) M (cid:7) − . (F.5)Then we have the following equivalent spectral expansions for ζ , η and κζ ( s, τ ) = X k ∈− K : K α k ˜ χ k ( τ ) exp( iω k s ) , (F.6) η ( s, τ ) = X k ∈− K : K α k ˜ υ k ( τ ) exp( iω k s ) , (F.7) κ ( s ) = X k ∈− K : K α k ˜ λ k exp( iω k s ) , (F.8) M. M. GRAHAM & A. H. THIERY
Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (a) Noisy observation sequence y T . Time τ Sp a t i a l c oo r d i n a t e s −3−2−10123 (b) State sequence z T for linear-Gaussian ssm . Time τ Sp a t i a l c oo r d i n a t e s −3−2−10123 (c) State sequence z T for transformed ssm . Fig F.1: Simulated sequences used in experiments for st ssm s.with the convention that negative indices to the Fourier coefficients indicate com-plex conjugation, e.g. ˜ λ − k = ˜ λ ∗ k , and α − K : K and ω − K : K are defined as α k = if k = 0 , | k | ∈ (cid:6) M (cid:7) − , if | k | = M , and ω k = 2 π k ∀ k ∈ − K : K . (F.9)Using Eq. (F.6) we then have that spatial derivatives of ζ can be computed as ∂ ns ζ ( s, τ ) = X k ∈− K : K α k ( iω k ) n ¯ χ k ( τ ) exp( iω k s ) ∀ n ∈ N . (F.10)Substituting the expansions in Eqs. (F.6), (F.7) and (F.10) for the processes and SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER spatial derivatives into Eq. (5.8) and using the convolution theorem gives X k ∈− K : K α k (cid:16) d ˜ χ k − ( − θ ω k + iθ ω k − θ ) ˜ χ k d τ − ˜ λ k d˜ υ k (cid:17) exp( iω k s ) = 0 . (F.11)Integrating both sides over S against a suitable orthogonal set of test functions h j ( s ) = exp( − iω j s ) ∀ j ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) and h M ( s ) = cos( M π s ) if M is even , (F.12)we arrive at the following system of sde sd ˜ χ k ( τ ) = ( − θ ω k + iθ ω k − θ ) ˜ χ k ( τ ) d τ + ˜ λ k d˜ υ k ( τ ) ∀ k ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) , (F.13)d ˜ χ M ( τ ) = ( − θ ω k − θ ) ˜ χ M ( τ ) d τ + ˜ λ M d˜ υ M ( τ ) if M is even . (F.14)Assuming that the noise Fourier coefficients ˜ υ K are independent Wiener pro-cesses, real-valued for the zero- and Nyquist-frequency coefficients (˜ υ and ˜ υ M )and complex-valued for the remaining coefficients, then the transition distribu-tions for this system have analytic solutions˜ χ k ( τ ) | ˜ χ k (0) ∼ N exp ( ξ k τ ) ˜ χ k (0) , ˜ λ k ψ k (1 − exp( − ψ k τ )) ! , with ψ k = θ ω k + θ and ξ k = ( iθ ω k − ψ k if k = M − ψ k if k = M , k ∈ K . (F.15)where we have overloaded the notation for a Gaussian distribution N to extend tocomplex-valued variables with the convention that for a complex-valued randomvariable z ∈ C , complex mean parameter µ ∈ C and real variance σ ∈ R > , that z ∼ N ( µ, σ ) = ⇒< ( z ) ∼ N < ( µ ) , σ ! , = ( z ) ∼ N = ( µ ) , σ ! and < ( z ) ⊥ = ( z ) . (F.16)The Fourier coefficients ˜ χ K then also have Gaussian stationary distributions˜ χ k ( ∞ ) ∼ N , ˜ λ k ψ k ! ∀ k ∈ K . (F.17)We assume the system is observed at T time points with τ t = ( t − δ ∀ t ∈ T and that the Fourier coefficients of the initial state at time τ = 0 are generatedfrom the stationary distributions in Eq. (F.17). Identifying z t ( s ) = ζ ( s, τ t ) and x t, M = χ M ( τ t ) ∀ t ∈ T (F.18)we have that the state update operators can be written x , M = dft − ( a K (cid:12) u , K ) , (F.19) x t, M = dft − ( b K (cid:12) dft ( x t − , M ) + c K (cid:12) u t, K ) ∀ t ∈ T , (F.20)where a K , b K and c K are length K + 1 vectors with a k = ˜ λ k √ ψ k , b k = exp( ξ k δ ) , c k = a k q − exp( − ψ k δ ) ∀ k ∈ K , (F.21) M. M. GRAHAM & A. H. THIERYNumber of mesh nodes M = 512Number of observation times T = 200Number of observation locations L = 64Number of integrator steps between observations S = 10Integrator time step δ = 0 . θ = 32 π Damping coefficient θ = State noise kernel length scale θ = θ − State noise kernel amplitude θ = θ − Observation noise standard deviation ς = 0 . Table F.2
Kuramoto-Sivashinksy model parameter settings and the state noise variables u T , K are real-valued for the zero- and Nyquist-frequency components and complex otherwise and have Gaussian distributions u t,k ∈ ( R if k ∈ { , M } , C if k ∈ (cid:6) M (cid:7) − , u t,k ∼ N (0 , ∀ t ∈ T , k ∈ K . (F.22)The system is observed at L equispaced mesh nodes with s o l = s ML ( l − ) ∀ l ∈ L and a simple linear-Gaussian observation model assumed y t,l = z t ( s o l ) + v t,l = x ML ( l − ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.23)The state noise kernel Fourier coefficients ˜ λ K are chosen to represent a squared-exponential kernel with length-scale parameter ϑ and amplitude parameter α ˜ λ k = α exp( − ω k ϑ ) ∀ k ∈ K . (F.24) F.2 Kuramoto–Sivashinksy model
We use the same spectral approach in as in the st model to define the basisfunction expansions of the processes ζ and η and kernel κ in terms of coefficients ξ M , υ M and λ M (see Eqs. (F.6) to (F.7)). The non-linear ζ term in the driftcomponent of the spde cannot be exactly expressed as a linear combination ofthe basis function β M , and so we cannot directly form a system of sde s to solveas in the st model. We make the approximation that ζ ( s, τ ) = X m ∈ M X n ∈ M χ m ( τ ) χ n ( τ ) β m ( s ) β n ( s ) ≈ X m ∈ M χ m ( τ ) β m ( s ) . (F.25)At the mesh nodes s M this gives the correct values but gives a different interpola-tion at points between the nodes; for dense meshes however the error introducedis small. Using this approximation the following system of sde s can be derivedin the Fourier coefficients ˜ ξ K , ˜ υ K and ˜ λ K d ˜ χ k ( τ ) = ω k θ − ω k θ − θ ! ˜ χ k ( τ ) + N k ( ˜ χ K ) ! d τ +˜ λ k d˜ υ k ( τ ) ∀ k ∈ K (F.26)with the noise Fourier coefficients ˜ υ K again assumed to be (complex-valued)Wiener processes and the non-linear N k terms in the drift defined by N k ( ˜ χ K ) = ( iω k θ dft k ( dft − ( ˜ χ K ( τ )) ) if k ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) , k = M . (F.27) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (a) Noisy observation sequence y T with linear observation operator. Time τ Sp a t i a l c oo r d i n a t e s −2−1012 (b) Noisy observation sequence y T with nonlinear observation operator. Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (c) True state sequence z T used to generate observations. Fig F.2: Simulated sequences used in experiments with stochastic ks ssm s.The state noise kernel Fourier coefficients ˜ λ K are as in the st model chosen torepresent a squared-exponential kernel as defined in Eq. (F.24).Due to the non-linear terms, the system of sde s in Eq. (F.26) does not have ananalytic solution. Therefore we numerically integrate the system using a heuristiccombination of a exponential-time differencing fourth-order Runge-Kutta scheme(Cox and Matthews, 2002) to time step forward according to the drift term anda Euler-Maruyama discretisation to account for the diffusion term. To reducethe time discretisation error we use S integrator steps with time step δ betweeneach of the T observation times τ t = ( t − S δ ∀ t ∈ T . The state transitionoperator F t then correspond to the map from a previous state vector x t − andstate noise variable u t (consisting of the concatenation of S simulated Wienerprocess increments) to the state vector x t by peforming S integrator steps. The M. M. GRAHAM & A. H. THIERY state transition operators are non-linear and the density of the correspondingstate transition distribution does not have a closed form solution.For the observation operators we considered two cases - a linear-Gaussian ob-servation model and a non-linear observation operator. Although due to the non-linear state transition operators the filtering distributions are non-Gaussian irre-spective of the observation operator used, in practice we found the local enkf was able to generate accurate ensemble estimates of the filtering distributionswhen using a simple linear-Gaussian observation model, suggesting the filteringdistributions remain close to Gaussian despite the non-linear state dynamics. Asour focus is on inference in ssm s for which existing local enkf approaches per-form poorly in, we also considered an alternative model configuration in which anon-linear function of the model state is noisily observed.In both the linear and non-linear cases system is assume to be observed at L equispaced mesh nodes with s o l = s ML ( l − ) ∀ l ∈ L . For the linear case theobservation model is assumed to be equivalent to that assumed for the st model, y t,l = z t ( s o l ) + v t,l = x ML ( l − ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.28)The non-linear case is directly analogous other than the state values being ob-served via a hyperbolic tangent non-linearity: y t,l = tanh( x ML ( l − ) ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.29)Although seemingly minor change in the model, as illustrated in the experimentalresults, introducing this non-linearity was sufficient to significantly degrade thefiltering performance of the local etkf . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX G: FULL GRID SEARCH RESULTS FOR LOCAL ETKF
Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (a) Linear-Gaussian st ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (b) Transformed st ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (c) Linearly observed ks ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (d) Non-linearly observed ks ssm . Fig G.1: Values of metrics for all localisation radii r for local etkf on four ssm sconsidered in experiments. In all cases the curve shows the median value acrossfive independent runs and the filled region the minimum to maximum range. M. M. GRAHAM & A. H. THIERY
APPENDIX H: DETAILS OF MCMC RUNS FOR KS MODELS
A non-centred parametrisation was used for the Hamiltonian Monte Carlochains for the two ks ssm s (Papaspiliopoulos, Roberts and Sk¨old, 2007), withthe target smoothing distribution formulated in terms of the
MTS ≈ dimen-sional set of state noise variables u T which are independently and identicallydistributed standard normal variables under the prior, with the observation se-quence y T then having a Gaussian conditional distribution given u T . The step-size for the integrator of the Hamiltonian dynamics was manually tuned oncefor each ssm using short pilot chains with a fixed number of integrator stepsto achieve an average acceptance probability in the range [0 . , .
9] (Betancourt,Byrne and Girolami, 2014), with in both ssm s a step size 2 . × − found to givean acceptance rate is the target range. The integrator used was a variant of thestandard leapfrog / St¨ormer-Verlet integrator which uses an alternative splittingof the Hamiltonian to leverage an exact analytic solution for the Hamiltoniandynamics under the quadratic potential energy component due to the Gaussianprior (Shahbaba et al., 2014). The number of integrator steps used to generatethe Hamiltonian dynamics trajectory in each chain transition was dynamicallyset on each iteration using a variant of the No-U-Turn sampler scheme (Hoffmanand Gelman, 2014; Betancourt, 2017), with the chains for both ssm s performingapproximately 2 × steps per transition on average. For each ssm the total wallclock time to run the five chains in parallel on a Intel Xeon E5-2620 v4 8-coreCPU was around one week.All chains were initialised from the true state noise sequence u T used to gener-ate the observations, which corresponds to a single exact sample from the targetdistribution P ( u T ∈ d u | y T = y T ) as the ( u T , y T ) pair was originally gener-ated from the corresponding joint distribution P ( u T ∈ d u, y T = d y ). Althoughtypically it would be preferable for the robustness of convergence diagnosticsbased on comparisons between chains to initialise each of the chains indepen-dently from an over-dispersed distribution compared to the target such as theprior, here we found the step-size required to robustly achieve an average accep-tance probability in the range [0 . , .
9] for chains initialised from the prior to bemuch smaller than for chains initialised from the ‘true’ noise sequence u T , likelydue to the differing geometry of the target distribution in the tails (where initial-isations from the prior are likely to fall) and typical set, which u T as an exactsample from the target should be within. Given the long chain run times evenwhen using the larger step size, a pragmatic choice was therefore made to use acommon initialisation. This initialisation scheme and relatively small number ofsamples in each chain means there is a risk that the chains therefore only exploreda subset of the target distributions’ typical sets. As partial evidence against thisbeing the case, visual checks of the estimates of the first and second moments ofa subset of the filtering distributions π T using the final 100 samples from eachof the chains suggest that the estimates from the different chains are consistentwith each other (see examples in Figs. H.1 and H.2). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER −3−2−10123 C h a i n π (time τ = 250 ) −3−2−10123 π (time τ = 500 ) −3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−101230.0 0.2 0.4 0.6 0.8 1.0 Spatial coordinate s −3−2−10123 C h a i n Spatial coordinate s −3−2−10123 Fig H.1: Comparison of estimates of first and second moments of filtering distri-butions π and π for linearly observed ks ssm using final 100 samples fromeach of 5 chains (curves show the estimated mean and the filled region the mean ± two standard deviations). M. M. GRAHAM & A. H. THIERY −3−2−10123 C h a i n π (time τ = 250 ) −3−2−10123 π (time τ = 500 ) −3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−101230.0 0.2 0.4 0.6 0.8 1.0 Spatial coordinate s −3−2−10123 C h a i n Spatial coordinate s −3−2−10123 Fig H.2: Comparison of estimates of first and second moments of filtering distri-butions π and π for non-linearly observed ks ssm using final 100 samplesfrom each of 5 chains (curves show the estimated mean and the filled region themean ± two standard deviations). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER REFERENCES
Acevedo, W. , de Wiljes, J. and Reich, S. (2017). Second-order accurate ensemble transformparticle filters.
SIAM Journal on Scientific Computing A1834–A1850.
Altschuler, J. , Weed, J. and
Rigollet, P. (2017). Near-linear time approximation algo-rithms for optimal transport via Sinkhorn iteration. In
Advances in Neural Information Pro-cessing Systems 30
Anderson, J. L. (2001). An ensemble adjustment Kalman filter for data assimilation.
Monthlyweather review
Bauer, P. , Thorpe, A. and
Brunet, G. (2015). The quiet revolution of numerical weatherprediction.
Nature
Bengtsson, T. , Bickel, P. and
Li, B. (2008). Curse-of-dimensionality revisited: Collapse ofthe particle filter in very large scale systems. In
Probability and statistics: Essays in honor ofDavid A. Freedman
Bertoli, F. and
Bishop, A. N. (2014). Adaptively Blocked Particle Filtering with SpatialSmoothing in Large-Scale Dynamic Random Fields. arXiv:1406.0136 . Beskos, A. , Crisan, D. , Jasra, A. , Kamatani, K. and
Zhou, Y. (2017). A stable particlefilter for a class of high-dimensional state-space models.
Advances in Applied Probability Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 . Betancourt, M. , Byrne, S. and
Girolami, M. (2014). Optimizing the integrator step sizefor Hamiltonian Monte Carlo. arXiv:1411.6669 . Bishop, A. N. and
Del Moral, P. (2018). On the Stability of Matrix-Valued Riccati Diffusions. arXiv preprint arXiv:1808.00235 . Bishop, C. H. , Etherton, B. J. and
Majumdar, S. J. (2001). Adaptive sampling with theensemble transform Kalman filter. Part I: Theoretical aspects.
Monthly weather review
Bolic, M. , Djuric, P. M. and
Hong, S. (2005). Resampling algorithms and architectures fordistributed particle filters.
IEEE Transactions on Signal Processing Bonavita, M. , Torrisi, L. and
Marcucci, F. (2008). The ensemble Kalman filter in an oper-ational regional NWP system: Preliminary results with real observations.
Quarterly Journalof the Royal Meteorological Society
Bowler, N. E. , Arribas, A. , Beare, S. E. , Mylne, K. R. and
Shutts, G. J. (2009). Thelocal ETKF and SKEB: Upgrades to the MOGREPS short-range ensemble prediction system.
Quarterly Journal of the Royal Meteorological Society
Buizza, R. , Houtekamer, P. , Pellerin, G. , Toth, Z. , Zhu, Y. and
Wei, M. (2005). Acomparison of the ECMWF, MSC, and NCEP global ensemble prediction systems.
MonthlyWeather Review
Burgers, G. , van Leeuwen, P. J. and Evensen, G. (1998). Analysis scheme in the ensembleKalman filter.
Monthly weather review
Cheng, Y. and
Reich, S. (2015). Assimilating data into scientific models: An optimal couplingperspective. In
Nonlinear Data Assimilation
Clayton, A. M. , Lorenc, A. C. and
Barker, D. M. (2013). Operational implementationof a hybrid ensemble / 4D-Var global data assimilation system at the Met Office.
QuarterlyJournal of the Royal Meteorological Society
Cox, S. M. and
Matthews, P. C. (2002). Exponential time differencing for stiff systems.
Journal of Computational Physics
Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In
Advances in Neural Information Processing Systems 26
Del Moral, P. (1996). Non-linear filtering: interacting particle resolution.
Markov processesand related fields Del Moral, P. and
Tugaut, J. (2018). On the stability and the uniform propagation of chaosproperties of ensemble Kalman–Bucy filters.
The Annals of Applied Probability Douc, R. and
Capp´e, O. (2005). Comparison of resampling schemes for particle filtering. In
Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis,2005.
Duane, S. , Kennedy, A. D. , Pendleton, B. J. and
Roweth, D. (1987). Hybrid Monte Carlo.
Physics Letters B
Evensen, G. (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model M. M. GRAHAM & A. H. THIERYusing Monte Carlo methods to forecast error statistics.
Journal of Geophysical Research:Oceans Evensen, G. (2009).
Data Assimilation: The Ensemble Kalman Filter , 2nd ed. Springer.
Farchi, A. and
Bocquet, M. (2018). Comparison of local particle filters and new implemen-tations.
Nonlinear Processes in Geophysics Discussions
Fearnhead, P. and
K¨unsch, H. (2018). Particle Filters and Data Assimilation.
Annual Reviewof Statistics and Its Application Frei, M. and
K¨unsch, H. R. (2013). Bridging the ensemble Kalman and particle filters.
Biometrika
Furrer, R. and
Bengtsson, T. (2007). Estimation of high-dimensional prior and posteriorcovariance matrices in Kalman filter variants.
Journal of Multivariate Analysis Gaspari, G. and
Cohn, S. E. (1999). Construction of correlation functions in two and threedimensions.
Quarterly Journal of the Royal Meteorological Society
Gerber, M. , Chopin, N. and
Whiteley, N. (2019). Negative association, ordering and con-vergence of resampling methods.
The Annals of Statistics Gordon, N. J. , Salmond, D. J. and
Smith, A. F. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In
IEE Proceedings F (Radar and Signal Processing)
Hamill, T. M. , Whitaker, J. S. and
Snyder, C. (2001). Distance-dependent filtering ofbackground error covariance estimates in an ensemble Kalman filter.
Monthly Weather Review
Herbst, E. and
Schorfheide, F. (2019). Tempered particle filtering.
Journal of Econometrics
Hoffman, M. D. and
Gelman, A. (2014). The No-U-Turn sampler: adaptively setting pathlengths in Hamiltonian Monte Carlo.
Journal of Machine Learning Research Hol, J. D. , Schon, T. B. and
Gustafsson, F. (2006). On resampling algorithms for particlefilters. In
Nonlinear Statistical Signal Processing Workshop, 2006 IEEE
Houtekamer, P. L. and
Mitchell, H. L. (1998). Data assimilation using an ensemble Kalmanfilter technique.
Monthly Weather Review
Hunt, B. R. , Kostelich, E. J. and
Szunyogh, I. (2007). Efficient data assimilation for spa-tiotemporal chaos: A local ensemble transform Kalman filter.
Physica D: Nonlinear Phenom-ena
Hyman, J. M. and
Nicolaenko, B. (1986). The Kuramoto–Sivashinsky equation: a bridgebetween PDEs and dynamical systems.
Physica D: Nonlinear Phenomena Johansen, A. M. (2015). On blocks, tempering and particle MCMC for systems identification.
IFAC-PapersOnLine Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
Journal ofBasic Engineering Kelly, D. T. , Law, K. and
Stuart, A. M. (2014). Well-posedness and accuracy of the ensembleKalman filter in discrete and continuous time.
Nonlinearity Kuramoto, Y. and
Tsuzuki, T. (1976). Persistent propagation of concentration waves indissipative media far from thermal equilibrium.
Progress of theoretical physics Le Gland, F. , Monbet, V. and
Tran, V.-D. (2011). Large sample asymptotics for the ensem-ble Kalman filter. In
The Oxford Handbook of Nonlinear Filtering (D. Crisan and B. Rozovskii,eds.) 598–631. Oxford University Press.
Lee, A. and
Whiteley, N. (2015). Forest resampling for distributed sequential Monte Carlo.
Statistical Analysis and Data Mining: The ASA Data Science Journal . Lei, J. , Bickel, P. and
Snyder, C. (2010). Comparison of ensemble Kalman filters undernon-Gaussianity.
Monthly Weather Review
Majda, A. J. and
Harlim, J. (2012).
Filtering complex turbulent systems . Cambridge Univer-sity Press.
Morzfeld, M. , Hodyss, D. and
Snyder, C. (2017). What the collapse of the ensemble Kalmanfilter tells us about particle filters.
Tellus A: Dynamic Meteorology and Oceanography . Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum costflows.
Mathematical Programming Papaspiliopoulos, O. , Roberts, G. O. and
Sk¨old, M. (2007). A general framework for theparametrization of hierarchical models.
Statistical Science
Penny, S. G. and
Miyoshi, T. (2015). A local particle filter for high dimensional geophysicalsystems.
Nonlinear Processes in Geophysics Discussions Peyr´e, G. and
Cuturi, M. (2019).
Computational Optimal Transport . Now Publishers.
Rebeschini, P. and van Handel, R. (2015). Can local particle filters beat the curse of dimen-sionality?
The Annals of Applied Probability Reich, S. (2013). A nonparametric ensemble transform method for Bayesian inference.
SIAMJournal on Scientific Computing A2013–A2024.
Sen, D. and
Thiery, A. H. (2019). Particle filter efficiency under limited communication. arXiv:1904.09623 . Shahbaba, B. , Lan, S. , Johnson, W. O. and
Neal, R. M. (2014). Split Hamiltonian MonteCarlo.
Statistics and Computing Sinkhorn, R. and
Knopp, P. (1967). Concerning nonnegative matrices and doubly stochasticmatrices.
Pacific Journal of Mathematics Sivashinsky, G. (1977). Nonlinear analysis of hydrodynamic instability in laminar flames—I.Derivation of basic equations.
Acta Astronautica Snyder, C. (2011). Particle filters, the ‘optimal’ proposal and high-dimensional systems. In
Proceedings of the ECMWF Seminar on Data Assimilation for atmosphere and ocean
Snyder, C. , Bengtsson, T. and
Morzfeld, M. (2015). Performance bounds for particle filtersusing the optimal proposal.
Monthly Weather Review
Snyder, C. , Bengtsson, T. , Bickel, P. and
Anderson, J. (2008). Obstacles to high-dimensional particle filtering.
Monthly Weather Review
Svensson, A. , Sch¨on, T. B. and
Lindsten, F. (2018). Learning of state-space models withhighly informative observations: A tempered sequential Monte Carlo solution.
MechanicalSystems and Signal Processing
Tong, X. T. , Majda, A. J. and
Kelly, D. (2016). Nonlinear stability and ergodicity ofensemble based Kalman filters.
Nonlinearity Van Leeuwen, P. J. (2009). Particle filtering in geophysical systems.
Monthly Weather Review
Verg´e, C. , Dubarry, C. , Del Moral, P. and
Moulines, E. (2015). On parallel implementa-tion of sequential Monte Carlo methods: the island particle model.
Statistics and Computing Whitaker, J. S. and
Hamill, T. M. (2002). Ensemble data assimilation without perturbedobservations.
Monthly Weather Review
Whiteley, N. , Lee, A. and
Heine, K. (2016). On the role of interaction in sequential MonteCarlo algorithms.
Bernoulli22