[PDF] A scalable optimal-transport based local particle filter

Abstract

Filtering in spatially-extended dynamical systems is a challenging problem with significant practical applications such as numerical weather prediction. Particle filters allow asymptotically consistent inference but require infeasibly large ensemble sizes for accurate estimates in complex spatial models. Localisation approaches, which perform local state updates by exploiting low dependence between variables at distant points, have been suggested as a potential resolution to this issue. Naively applying the resampling step of the particle filter locally however produces implausible spatially discontinuous states. The ensemble transform particle filter replaces resampling with an optimal-transport map and can be localised by computing maps for every spatial mesh node. The resulting local ensemble transport particle filter is however computationally intensive for dense meshes. We propose a new optimal-transport based local particle filter which computes a fixed number of maps independent of the mesh resolution and interpolates these maps across space, reducing the computation required and allowing it to be ensured particles remain spatially smooth. We numerically illustrate that, at a reduced computational cost, we are able to achieve the same accuracy as the local ensemble transport particle filter, and retain its improved robustness to non-Gaussianity and ability to quantify uncertainty when compared to local ensemble Kalman filters.

Full PDF

aarXiv: stat.CO/00000000

A scalable optimal-transportbased local particle ﬁlter

Matthew M. Graham * and Alexandre H. Thiery ** Department of Statistics and Applied Probability, National University of Singapore

Abstract.

Filtering in spatially-extended dynamical systems is a chal-lenging problem with signiﬁcant practical applications such as numer-ical weather prediction. Particle ﬁlters allow asymptotically consistentinference but require infeasibly large ensemble sizes for accurate esti-mates in complex spatial models. Localisation approaches, which per-form local state updates by exploiting low dependence between vari-ables at distant points, have been suggested as a potential resolution tothis issue. Naively applying the resampling step of the particle ﬁlter lo-cally however produces implausible spatially discontinuous states. Theensemble transform particle ﬁlter replaces resampling with an optimal-transport map and can be localised by computing maps for every spa-tial mesh node. The resulting local ensemble transport particle ﬁlteris however computationally intensive for dense meshes. We propose anew optimal-transport based local particle ﬁlter which computes a ﬁxednumber of maps independent of the mesh resolution and interpolatesthese maps across space, reducing the computation required and allow-ing it to be ensured particles remain spatially smooth. We numericallyillustrate that, at a reduced computational cost, we are able to achievethe same accuracy as the local ensemble transport particle ﬁlter, andretain its improved robustness to non-Gaussianity and ability to quan-tify uncertainty when compared to local ensemble Kalman ﬁlters.

MSC 2010 subject classiﬁcations:

Primary 65C35; secondary 86A22.

Key words and phrases: particle ﬁltering, Bayesian ﬁltering, spatialmodels, inverse problems, localisation, optimal transport.

1. INTRODUCTION

A natural paradigm for modelling geophysical systems such as the atmosphereis as spatially-extended dynamical systems : one or more state variables deﬁnedover a spatial domain are evolved through time according to a set of stochasticpartial diﬀerential equation s ( spde s). In this article we will consider the prob-lem of inferring the distribution of the unknown state of such a system givennoisy observations at a sequence of time points. As well as being an importantproblem in its own right, state inference is also a vital sub-component of taskssuch as forecasting the future state of a system and inferring values for any freeparameters in the numerical model used (Fearnhead and K¨unsch, 2018). (e-mail: * [email protected]; ** [email protected]) a r X i v : . [ s t a t . C O ] J un M. M. GRAHAM & A. H. THIERY

A key issue in performing state inference in spatially-extended systems is thetypically high dimension of the state space. To allow numerical simulation of the spde model the spatial domain is discretised in to a mesh (also known as a grid);the system state can then be represented as a ﬁnite-dimensional vector consistingof the concatenated values of the state variables at the nodes of the mesh. Theresulting state dimension is therefore a multiple of the number of mesh nodeswhich can be very large. For example in the global atmospheric models used incurrent operational numerical weather prediction ( nwp ) systems the mesh sizecan be of the order 10 or higher (Bauer, Thorpe and Brunet, 2015).For large state dimensions, even inference in linear-Gaussian models using the Kalman ﬁlter ( kf ) (Kalman, 1960) is computationally infeasible due to the highprocessing and memory costs of operations involving the full covariance matrix ofthe state distribution. This motivated the development of ensemble Kalman ﬁlter ( enkf ) methods (Evensen, 1994; Burgers, van Leeuwen and Evensen, 1998) whichuse an ensemble of particles to represent the state distribution rather than thefull mean and covariance statistics. As the ensemble sizes used are typically muchsmaller than the state dimension the computational savings can be considerable.Although enkf methods are only consistent in an inﬁnite ensemble limit forlinear-Gaussian models (Furrer and Bengtsson, 2007; Le Gland, Monbet andTran, 2011), they have been empirically found to perform well in models withweakly non-linear state update and observation operators, even when using rela-tively small ensembles of size much less than the state dimension (Evensen, 2009);the performance of the enkf in non-asymptotic regimes has been theoreticallyinvestigated in several recent works (Kelly, Law and Stuart, 2014; Del Moral andTugaut, 2018; Bishop and Del Moral, 2018; Tong, Majda and Kelly, 2016). Akey aspect in allowing enkf methods to be scaled to large spatially-extendedgeophysical models is the use of spatial localisation (Houtekamer and Mitchell,1998; Hamill, Whitaker and Snyder, 2001). Localisation exploits the observationthat there is often low statistical dependence between state variables at distantpoints in spatially-extended systems. In enkf methods this property is used toimprove the noisy covariance estimates resulting from the small ensemble sizesused by removing spurious correlations between distant state variables. enkf methods have been successfully applied in a variety of settings, includ-ing operational nwp systems (Bonavita, Torrisi and Marcucci, 2008; Clayton,Lorenc and Barker, 2013), however the quality of the state distribution estimatesis fundamentally limited by the linear-Gaussian assumptions made by the un-derlying kf updates. For models with non-Gaussian noise processes or stronglynon-linear state update or observation operators, enkf methods tend to producepoor estimates of the state distribution (Lei, Bickel and Snyder, 2010). Particle ﬁlter s ( pf s) (Gordon, Salmond and Smith, 1993; Del Moral, 1996) oﬀeran alternative ensemble-based approach to sequential state inference that unlike enkf methods provides consistent estimates for non-Gaussian distributions. Thesimplest variant, the bootstrap pf , alternates propagating the ensemble membersforward in time under the model dynamics, with resampling according to weights Throughout this article we will for brevity refer to dynamical models with linear state updateand observation operators and additive Gaussian noise processes as linear-Gaussian . Current operational nwp ensemble systems are limited to ∼

50 particles due to the highcomputational cost of numerically integrating the particles forward in time (Buizza et al., 2005). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) True state.(b) Observed. (c) Prior samples.(d) Posterior samples: block pf . (e) Posterior samples: local etpf .(f) Posterior samples: this article. Fig 1: Examples of local pf assimilation updates applied to a Gaussian processmodel. The smooth true state ﬁeld is shown in panel (a) and correspondingnoisy observations in (b). Panel (c) shows prior samples and (d)–(f) approximateposterior samples after applying diﬀerent local pf assimilation updates. In eachof (c)–(f) 2 out of 40 samples are shown.calculated from the likelihood of the particles given the observed data.While pf s oﬀer asymptotically consistent inference for general state space mod-els, in practice they typically suﬀer from weight-degeneracy in high-dimensionalsystems: after propagation only a single particle has non-negligible weight. Foreven simple linear-Gaussian models, pf s have been shown to require an ensemblesize which scales exponentially with the number of observations to avoid degen-eracy (Snyder et al., 2008; Bengtsson, Bickel and Li, 2008; Snyder, 2011).Given the importance of localisation in scaling enkf methods to large spa-tial systems, it is natural to consider whether pf methods can be localised toovercome weight-degeneracy issues (Snyder et al., 2008; Van Leeuwen, 2009).Rebeschini and van Handel (2015) analysed a simple local pf scheme in whichthe spatial domain is partitioned into disjoint blocks and independent pf s runfor each block, with local particle weights computed from the observations withineach block. The authors demonstrate this block pf algorithm can overcome theneed to exponentially scale the ensemble size with dimension to prevent degen-eracy. However as the variables in each block are resampled independently fromthose in other blocks, dependencies between blocks are ignored; this introducesa systematic bias that is diﬃcult to control (Bertoli and Bishop, 2014).This issue is illustrated for a two-dimensional Gaussian process model in Fig. 1.The smooth true state ﬁeld, shown in Fig. 1a, is partially and noisily observed(Fig. 1b). While the samples in the prior ensemble (Fig. 1c) reﬂect the smoothnessof the true state ﬁeld, the posterior samples shown in Fig. 1d, computed using ablock pf assimilation update show spatial discontinuities at the block boundaries.Such discontinuities can cause numerical instabilities in the computation of spatialderivatives when integrating the spde s model to forward propagate the particles.The ensemble transform particle ﬁlter ( etpf ) (Reich, 2013) uses an optimal M. M. GRAHAM & A. H. THIERY transport ( ot ) map to linearly transform an ensemble instead of resampling.The etpf can be localised by computing ot maps for each mesh node usinglocal particle weights (Cheng and Reich, 2015); updating the particles using theresulting spatially varying maps signiﬁcantly reduces the introduction of spatialdiscontinuities compared to independent resampling. This can be seen in thesamples computed using the local etpf shown in Fig. 1e, which show greaterspatial regularity than the block pf samples in Fig. 1d, though they remain lesssmooth than the true state ﬁeld.The requirement in the local etpf to solve an ot problem at every node canbe computationally burdensome when the mesh size is large. Solving each ot problem has complexity e O ( P ) where P is the ensemble size ( e O indicates limitingcomplexity excluding polylogarithmic factors); although solvers can be run inparallel this still represents a large computational overhead.In this article we propose an alternative smooth and computationally scalablelocal etpf scheme. A ﬁnite set of patches which cover the spatial domain aredeﬁned, with a non-negative bump function supported on the patch. The set ofbump functions is constrained to be a partition of unity ( pou ): the functionssum to unity at all points in the spatial domain. A single ot map is calculatedfor each spatial patch. The pou is then used to interpolate these local per-patchmaps across the spatial domain, deﬁning maps for all nodes in the spatial mesh.Through an appropriate choice of bump functions this scheme can maintaina prescribed level of smoothness in the transformed state ﬁelds while also sig-niﬁcantly reducing the number of ot problems needing to be solved. Examplesposterior samples computed using the proposed scheme are shown in Fig. 1f. Herethe pou is a set of smooth bump functions tiled in a 8 × etpf , in thisexample the number of ot problems solved was reduced from to 16 384 to 64.The remainder of the article is structured as follows. In Section 2 we brieﬂyintroduce our notation and some preliminaries on the ﬁltering problem and en-semble methods, followed by a review of spde models and existing local ﬁlteringapproaches in Section 3. The new method we propose is described in Section 4and a numerical study comparing the approach to existing local ensemble ﬁltersis presented in Section 5, with a concluding discussion in Section 6.

2. ENSEMBLE APPROACHES TO FILTERING2.1 Notation

Random variables are denoted by sans-serif symbols, e.g. x , and x ∼ µ indicates x has distribution µ . The probability of an event x taking a value in a set A is P ( x ∈ A ) and the expected value of x is E [ x ]. The conditional probability of x ∈ A given y = y is denoted P ( x ∈ A | y = y ) and likewise the conditional expectationof x given y = y is E [ x | y = y ]. A Gaussian distribution with mean m andcovariance C is denoted N ( m, C ). The set of integers from A to B inclusive is A : B and quantities sub- or superscripted by an integer range indicate an indexed set,e.g. φ M = { φ m } m ∈ M . The D vector of ones is 1 D and the D × D identity matrixI D , with the subscript omitted when unambiguous. The indicator function on aset S is S . The set of real numbers is R , non-negative reals R ≥ and complexnumbers C . For z ∈ C , < ( z ) and = ( z ) indicate its real and imaginary parts. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER The class of models we aim to perform inference in is state-space model s ( ssm s).Let X be a vector-space representing the state-space of the system of interest.We assume observations of the system are available at a set of T times, with theobservations at each discrete time index t ∈ T belonging to a common vector-space Y . We denote the unknown system state at each time index as a randomvariable x t ∈ X and the corresponding observations as a random variable y t ∈ Y .The modelled state dynamics are assumed to be Markovian and speciﬁed by aset of state-update operators F T such that x = F ( u ) , u ∼ µ ; x t = F t ( x t − , u t ) , u t ∼ µ t ∀ t ∈ T , (2.1)with each u t ∈ U a state noise variable drawn from a distribution µ t , representingthe stochasticity in the state initialisation and dynamics at each time step. Theobservations y t at each time index t are assumed to depend only on the currentstate x t and are generated via a set of observation operators G T , y t = G t ( x t , v t ) , v t ∼ ν t ∀ t ∈ T . (2.2)Any stochasticity in the observation process at each time index is introduced bythe observation noise variable v t ∈ V with distribution ν t . In ssm s where theoperators F T and G T are all linear and the distributions µ T and ν T are allGaussian – the aforementioned linear-Gaussian case – the joint distribution onall states x T and observations y T is Gaussian and a kf can be used to performexact inference. In this article we will focus on approximate inference methodsfor ssm s outside this class where exact inference is intractable.We require that the conditional distributions on y t given x t have known densi-ties g T with respect to a common dominating measure υ on Y , i.e. P ( y t ∈ d y | x t = x t ) = g t ( y | x t ) υ (d y ) ∀ t ∈ T . (2.3)For the state updates we assume only that the state-update operators F t canbe computed for any set of inputs and that we can generate samples from thestate noise distributions µ t ; the resulting state transition distributions will notnecessarily have tractable densities. Our main objects of interest from an inference perspective are the ﬁlteringdistributions : the conditional distributions on the state at time index t ∈ T given the observations at time indices up to and including t . We will denote theﬁltering distribution at each time index t as π t (d x ) = P ( x t ∈ d x | y T = y T ) . (2.4)The ﬁltering problem is then the task of inferring the ﬁltering distributions π T given a ssm for the system and a sequence of observations y T .A further concept that will be important for our discussion of inference methodsis the predictive distribution on the state at the next time index t + 1 giventhe observations up to the current time index t . We will denote the predictivedistribution at time index t as ~π t +1 (d x ) = P ( x t +1 ∈ d x | y T = y T ) . (2.5) M. M. GRAHAM & A. H. THIERY

A key property for ﬁltering algorithms is that the ﬁltering distribution atany time index can be expressed recursively in terms of the distributions at theprevious time indices. Generally this recursion is split into two steps, here termedthe prediction and assimilation updates.The prediction update transforms the ﬁltering distribution π t to the predictivedistribution ~π t +1 . This update corresponds to propagating the state distributionforward in time according to the modelled dynamics, with no new observationsintroduced. Denoting the Dirac measure at a point x ∈ X by δ x the predictionupdate can be expressed as ~π t +1 (d x ) = Z U Z X δ F t +1 ( x ,u ) (d x ) π t (d x ) µ t +1 (d u ) . (2.6)The assimilation update then relates the predictive distribution ~π t +1 to theﬁltering distribution at the next time step π t +1 . It corresponds to an applicationof Bayes’ theorem, with the predictive distribution forming the prior and the ﬁl-tering distribution at the next time index the posterior after a new observed datapoint has been assimilated. The observation density g t +1 deﬁnes the likelihoodterm, with the assimilation update then π t +1 (d x ) = g t +1 ( y t +1 | x ) R X g t +1 ( y t +1 | x ) ~π t +1 (d x ) ~π t +1 (d x ) . (2.7)The combination of prediction and assimilation updates together deﬁne a mapfrom the ﬁltering distribution at time index t to the distribution at t + 1: · · · −→ π t prediction −−−−−−→ ~π t +1 assimilation −−−−−−−→ π t +1 −→ · · · ;sequentially alternating prediction and assimilation updates is in theory there-fore all that is needed to compute the ﬁltering distributions at all times indices.In practice however for most ssm s the integrals in Eqs. (2.6) and (2.7) will beintractable to solve exactly, necessitating some form of approximation. A particularly common approximation is to use an ensemble of state particlesto represent the ﬁltering distribution at each time index. Speciﬁcally the ﬁlteringdistribution π t at time index t is represented by an empirical measure deﬁned byplacing point masses at the values of a set of P state particles x P t π t (d x ) ≈ P X p ∈ P δ x pt (d x ) . (2.8)A key advantage of using an ensemble representation of the ﬁltering distri-bution is that a simple algorithm can be used to implement a prediction updateconsistent with Eq. (2.6). Speciﬁcally if a set of P independent state noise samples u P t +1 are generated from µ t +1 , then given particles x P t approximating π t , a newset of P particles can be computed as ~ x pt +1 = F t +1 ( x pt , u pt +1 ) ∀ p ∈ P . (2.9)This new particle ensemble can then be used to form an empirical measure ap-proximation to the predictive distribution ~π t +1 ~π t +1 (d x ) ≈ P X p ∈ P δ ~ x pt +1 (d x ) . (2.10) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Although Eq. (2.9) speciﬁes an approach for performing a prediction update,a method for approximating the assimilation update in Eq. (2.7) to account forthe observed data is also required. One possibility is to require that the ﬁlteringensemble x P t is formed as a linear combination of the predictive ensemble ~ x P t x pt = X q ∈ P a p,qt ~ x qt (2.11)where a P , P t ∈ R P × P are a set of coeﬃcients describing the transformation. Ingeneral the coeﬃcients may depend non-linearly on both the observation y t andpredictive ensemble particles ~ x P t , however the form of the update constrains theﬁltering ensemble x P t to lie in the linear subspace spanned by the predictiveensemble members. The class of ensemble ﬁlters using an assimilation update ofthe form in Eq. (2.11) was termed linear ensemble transform ﬁlter s ( letf s) inCheng and Reich (2015), and encompasses both ensemble Kalman and particleﬁltering methods, as will be discussed in the following subsections. In a linear-Gaussian ssm the predictive and ﬁltering distributions are Gaussianat all time indices: π t = N ( m t , C t ) and ~π t = N ( ~m t , ~C t ) for all t ∈ T , and so canbe fully described by the mean and covariance parameters. The Kalman ﬁlter ( kf )(Kalman, 1960) gives an eﬃcient scheme for performing exact inference in linear-Gaussian ssm s by iteratively updating the mean and covariance parameters. Foran observation operator and noise distribution G t ( x, v ) = H t x + v, v t ∼ N (0 , R t ) , (2.12)the kf assimilation update can be written C t = ~C t − ~C t H T t ( R t + H t ~C t H T t ) − H t ~C t , (2.13a) m t = ~m t + C t H T t R − t ( y t − H t ~m t ) . (2.13b) Ensemble Kalman ﬁlter ( enkf ) methods are a class of letf s which use an assim-ilation update consistent with the kf updates in Eq. (2.13) for linear-Gaussian ssm s in the limit of an inﬁnite ensemble, in eﬀect replacing the predictive mean ~m t and covariance ~C t with ensemble estimates. The use of an ensemble representationrather than the full means and covariances used in the kf both gives a signiﬁcantcomputational gain (by avoiding the need to store and perform operations on thefull covariance matrices) while also allowing application of the approach to ssm swith non-linear state updates via the prediction update in Eq. (2.9).The originally proposed enkf method (Evensen, 1994; Burgers, van Leeuwenand Evensen, 1998) generates simulated observations from the observation modelin Eq. (2.12) for each predictive ensemble member to form a Monte Carlo esti-mate of the R t + H t ~C t H T t term in Eq. (2.13a). Although simple to implement, theintroduction of artiﬁcial observation noise adds an additional source of variancewhich can be signiﬁcant for small ensemble sizes. This additional variance canbe eliminated by the use of square-root enkf variants (Anderson, 2001; Bishop,Etherton and Majumdar, 2001; Whitaker and Hamill, 2002) which typically giv-ing more stable and accurate ﬁltering for small ensemble sizes. M. M. GRAHAM & A. H. THIERY

Of particular interest here is the ensemble transform Kalman ﬁlter ( etkf ) pro-posed by Bishop, Etherton and Majumdar (2001), with this approach particularlyeﬃcient in the regime of interest where the ensemble size P is much smaller thanthe state and observation dimensionalities. As we will use a localised variant ofthe etkf as a baseline in the numerical experiments in Section 5 we outline the etkf algorithm in Appendix A and show how it can be expressed in the form ofthe letf assimilation update in Eq. (2.11). Particle ﬁltering oﬀers an alternative letf approach that gives consistent es-timates of the ﬁltering distributions as P → ∞ for the non-Gaussian case. The pf assimilation update transforms the empirical approximation to the predic-tive distribution ~π t in Eq. (2.10) to an empirical approximation of the ﬁlteringdistribution π t by attaching importance weights to the predictive ensemble˜ w pt = g t ( y t | ~ x pt ) , w pt = ˜ w pt P q ∈ P ˜ w qt ∀ p ∈ P , π t (d x ) ≈ X p ∈ P w pt δ ~ x pt (d x ) . (2.14)Directly iterating this importance weighting scheme, at each time index prop-agating the ensemble forward in time according to Eq. (2.9) and incrementallyupdating a set of (unnormalised) importance weights gives an algorithm termed sequential importance sampling . While appealingly simple, sequential importancesampling requires an exponentially growing ensemble size as the number of ob-servation times T increases. The key additional step in particle ﬁltering is toresample the particle ensemble according to the importance weights between pre-diction updates. That is the ﬁltering distribution ensemble at time index t isdeﬁned in terms of the corresponding predictive distribution ensemble as x pt = X q ∈ P r p,qt ~ x qt ∀ p ∈ P , (2.15)where r P , P t ∈ { , } P × P are a set of binary random variables satisfying X q ∈ P r p,qt = 1 , E (cid:20) X q ∈ P r q,pt | w pt = w pt (cid:21) = P w pt ∀ p ∈ P . (2.16)This has the eﬀect of removing particles with low weights from the ensemble andso ensures computational eﬀort is concentrated on the most plausible particles.There are multiple algorithms available for generating random variables r P , P t satisfying Eq. (2.16) - see for example the reviews in (Douc and Capp´e, 2005; Hol,Schon and Gustafsson, 2006; Gerber, Chopin and Whiteley, 2019). Distributedversions of particle ﬁlters have recently been proposed and analyzed (Bolic, Djuricand Hong, 2005; Verg´e et al., 2015; Whiteley, Lee and Heine, 2016; Sen andThiery, 2019; Lee and Whiteley, 2015).The iterated application of prediction updates according to Eq. (2.9) and re-sampling assimilation updates according to Eq. (2.15) together deﬁnes the boot-strap pf algorithm. Although simple, the bootstrap pf algorithm does not exploitall the information available at each time index – speciﬁcally the prediction up-date in Eq. (2.9) does not take in to account future observations. Alternative pf schemes can be employed which use prediction updates which take in to account SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER future observations. Although such schemes typically express the resulting parti-cle weights in terms of the state transition densities we describe in Appendix Bhow they can be implemented in ssm s with intractable transition densities.While adjusting the prediction update can signiﬁcantly improve performancecompared to the bootstrap pf for a ﬁxed ensemble size, when applied to systemswith high state and observation dimensionalities these pf methods will still tendto suﬀer from weight degeneracy. In particular, even when using ‘locally optimal’updates in a simple linear-Gaussian model, the resulting pf has been shown tostill generally require an ensemble size which still grows exponentially with thedimension of the observation space to avoid weight degeneracy (Snyder et al.,2008; Snyder, Bengtsson and Morzfeld, 2015). Although typically the resampling variables in pf assimilation updates are gen-erated independently of the predictive ensemble particle values given the weights,this is not required. Reich (2013) exploited this ﬂexibility to propose an alter-native particle ﬁltering approach termed the ensemble transform particle ﬁlter ( etpf ) which uses ot methods to compute a resampling scheme which minimisesthe expected distances between the particles before and after resampling.A valid resampling scheme can be parametrised by a set of resampling proba-bilities ρ P , P t ∈ [0 , P × P with ρ p,qt = P ( r p,qt = 1 | w qt = w qt ) satisfying X q ∈ P ρ p,qt = 1 , X q ∈ P ρ q,pt = P w pt ∀ p ∈ P . (2.17)A simple choice satisfying Eq. (2.17) is ρ p,qt = w qt ∀ p ∈ P , q ∈ P with thiscorresponding to the probabilities used in standard pf resampling schemes.If we denote the set of resampling probabilities satisfying Eq. (2.17) for a givenset of weights w P t by R ( w P t ) and the realisations of the predictive particles ~ x P t at time index t by ~x P t , Reich (2013) instead proposed to compute the resamplingprobabilities as the solution to the optimal transport problem ρ P , P t = argmin % P , P ∈R ( w P t ) X p ∈ P X q ∈ P % p,q | ~x pt − ~x qt | . (2.18)The optimal transport problem can be posed as a linear program and eﬃcientlysolved using the network simplex algorithm (Orlin, 1997) with a computationalcomplexity of order e O ( P ). While the resulting resampling probabilities couldthen be used to generate binary variables r P , P t and the standard pf resamplingassimilation update in Eq. (2.15) applied, Reich (2013) instead proposes to usethe resampling probabilities to directly update the particles as follows x pt = X q ∈ P ρ p,qt ~ x qt ∀ p ∈ P . (2.19)For P → ∞ this assimilation update remains consistent as, due to properties of theoptimal transport problem solution, the resampling probabilities tend to binary { , } values (Reich, 2013, Theorem 1) and thus Eq. (2.19) becomes equivalentto updating using realisations of the binary random variables.While the etpf does not in itself help overcome the weight degeneracy issue,the deterministic and distance minimising nature of the etpf update naturallylends itself to spatial localisation approaches which can help overcome the poorscaling of pf s with dimensionality, as will be discussed in the following section. M. M. GRAHAM & A. H. THIERY

3. SPATIAL MODELS AND LOCAL ENSEMBLE FILTERS

Our particular focus in this article is on ﬁltering in models of spatially-extendeddynamical systems. Let S be a D -dimensional compact metric space equipped withdistance function d : S × S → [0 , ∞ ), representing the spatial domain the state ofthe modelled system is deﬁned over, and Z ⊆ R N be the space the state variablesat each spatial coordinate in S take values in. The state-space of the system isthen a function space Z S with the state at each time index z t : S → Z a spatialﬁeld. The dynamics of the system will typically be modelled by a set of spde s,with z T then corresponding to a solution of these equations at T times, given aninitial state sampled from some distribution.In practice in most problems we cannot solve the spde model exactly andinstead use numerical integration schemes to generate approximate solutions. Thestates are assumed to be restricted to a function space with a ﬁxed dimensionalrepresentation, with typically a state ﬁeld z t : S → Z represented as a linearcombination of a ﬁnite set of M basis functions β m : S → R z t ( s ) = X m ∈ M x t,m β m ( s ) ∀ t ∈ T , s ∈ S , (3.1)with coeﬃcients x t,m ∈ Z ∀ m ∈ M . For the purposes of inference we willtherefore consider the state space to be a vector space X = Z M ⊆ R MN with statevectors consisting of the concatenation of the basis function coeﬃcients.Typically the basis functions will be deﬁned by partitioning the spatial domain S in to a mesh of polytopic spatial elements, for example triangles or quadrilat-erals for D = 2. The vertices of these polytopes (and potentially additional pointssuch as the midpoints of edges) deﬁne a collection of M nodes with spatial locations s M . Typically each node is associated with a basis function β m satisfying β m ( s m ) = 1 , β m ( s n ) = 0 ∀ m ∈ M , n ∈ M , n = m, (3.2)which combined with Eq. (3.1) implies that z t ( s m ) = x t,m ∀ m ∈ M .We will assume that there are L observations y t, L at every time point, each ofdimension K , with the overall observation vector y t then a length KL vector y T t = [ y T t, y T t, · · · y T t, L ] ∀ t ∈ T . (3.3)We also assume that y t,l ⊥ y t,m | x t ∀ l = m i.e. the observations are condition-ally independent given the state and that each observation y t,l depends only onthe value of the state ﬁeld z t at a ﬁxed spatial location s o l . Together these twoassumptions mean we can express the logarithm of the observation density aslog g t ( y t | x t ) = X l ∈ L log g t,l ( y t,l | z t ( s o l )) ∀ t ∈ T . (3.4) The combination of high state and observation space dimensionalities, and lowfeasible ensemble sizes, make ﬁltering in spatial ssm s a signiﬁcant computationalchallenge. Fortunately ssm s of spatially extended systems often also exhibit afavourable decay of spatial correlations property which can be exploited to makeapproximate ﬁltering more tractable by performing local updates to the particles.

SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER If we assume the spatial ﬁeld z t is deﬁned as in Eq. (3.1) and x t is distributedaccording to the ﬁltering distribution π t then the spatial correlation function c t,f : S × S → [0 ,

1] of a square integrable function f ∈ L is deﬁned as c t,f ( s, s ) = E [ f ( z t ( s )) f ( z t ( s ))] − E [ f ( z t ( s ))] E [ f ( z t ( s ))]( E [( f ( z t ( s )) − E [ f ( z t ( s ))]) ] E [( f ( z t ( s )) − E [ f ( z t ( s ))]) ]) , (3.5)and the maximal spatial correlation function as ¯ c t ( s, s ) = sup f ∈L c t,f ( s, s ).The decay of spatial correlations property can then be stated as¯ c t ( s, s ) → d ( s, s ) → ∞ ∀ s ∈ S , s ∈ S , (3.6)which indicates that the dependence between state variables at distinct spatiallocations decays to zero as the distance between the locations increases.While it will typically not be possible to analytically verify Eq. (3.6) holds ex-actly, it has been empirically observed that models of spatially extended systemsin which the underlying dynamics are governed by local interactions between thestate variables exhibit an approximate decay of correlations property. In particu-lar weak long-range spatial correlations are a deﬁning feature of spatio-temporalchaos (Hunt, Kostelich and Szunyogh, 2007) with many spatial models of interest,such as the atmospheric models used in nwp , exhibiting such behaviour. For ssm s exhibiting a decay of spatial correlations property, localising the letf assimilation update in Eq. (2.11), as proposed by Cheng and Reich (2015), canoﬀer signiﬁcant performance gains compared to algorithms employing global up-dates. Rather than using a single set of transform coeﬃcients a P , P t for theassimilation update, M sets of coeﬃcients a P , P t, M are deﬁned, one for each spatialmesh node location s M with the assimilation update then x pt,m = X q ∈ P a p,qt,m ~ x qt,m ∀ p ∈ P , ∀ m ∈ M . (3.7)As previously mentioned, the global letf update in Eq. (2.11) restricts the ﬁl-tering ensemble members x P t to lie in the P dimensional linear subspace of X spanned by the predictive ensemble ~ x P t . When X is high-dimensional, as is gen-erally the case in spatially extended models, this can be highly restrictive.The local letf update in Eq. (3.7) overcomes this restriction of the global letf update, with the ﬁltering ensemble members now formed from local linearcombinations of the predictive ensemble members and thus no longer constrainedto a P dimensional linear subspace. In particular for models exhibiting a decay ofcorrelations property, the state variables at each mesh node can be updated usingcoeﬃcients computed using only the subset of observations which are within somelocalisation radius of the mesh node while still retaining accuracy.Local variants of the enkf (Houtekamer and Mitchell, 1998; Hamill, Whitakerand Snyder, 2001) are the prototypical examples of local letf s, and have beensuccessfully used to perform ﬁltering in large complex spatio-temporal modelsincluding operational ensemble nwp systems (Bowler et al., 2009). In Appendix Awe brieﬂy introduce a local variant of the etkf (Hunt, Kostelich and Szunyogh,2007) which we use as a baseline in the numerical experiments. M. M. GRAHAM & A. H. THIERY

It has been speculated that spatial localisation may be key to achieving usefulresults from pf s in large spatio-temporal models (Morzfeld, Hodyss and Snyder,2017) based on its importance to the success of enkf methods in such models.In Farchi and Bocquet (2018) the authors systematically compare a wide rangeof localised pf and related algorithms which have been proposed in the literatureincluding localised variants of the etpf which we will discuss in the followingsubsection. Below we brieﬂy introduce concepts from a local pf algorithm pro-posed by Penny and Miyoshi (2015) which are relevant to this article, howeverwe refer readers to Farchi and Bocquet (2018) for a much more extensive review.For the standard pf , the logarithms of the unnormalised particle weights arelog ˜ w pt = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ∀ p ∈ P . (3.8)i.e. a summation of contributions due to the observations at all locations s o1: L .For a model exhibiting a decay of spatial correlations property we would expectthat only a local subset of observations should have a strong inﬂuence on thedistribution of the state variables at each mesh node. We can formalise thisintuition into a concrete approach for computing local particle weights via the useof a localisation function ‘ r : [0 , ∞ ) → [0 ,

1] and localisation radius r satisfying ‘ r (0) = 1 , ‘ r ( d ) = 0 ∀ d > r > . (3.9)Local unnormalised weights for each mesh node can then be deﬁnedlog ˜ w pt,m = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ‘ r ( d ( s m , s o l )) ∀ p ∈ P , m ∈ M , (3.10)and corresponding local normalised weights w pt,m = ˜ w pt,m P q ∈ P ˜ w qt,m ∀ p ∈ P , m ∈ M . (3.11)This formulation for the local particle weights has the desired property of usingonly a local subset of observations to update the state variables at each meshnode (with the terms in the sum zero when d ( s m , s o l ) > r ).Typical choices for the localisation function include the uniform or top-hatfunction ‘ r ( d ) = [0 ,r ] ( d ) and the triangular function ‘ r ( d ) = (1 − dr ) [0 ,r ] ( d ).In this article we exclusively use the smooth and compactly supported 5 th orderpiecewise rational function proposed by Gaspari and Cohn (1999) and deﬁned as ‘ r ( d ) = ( − d r + 8 d r + 5 d r − d r + 1 0 ≤ d < r d r − d r + 5 d r + d r − dr + 4 − rd r ≤ d < r . (3.12)Penny and Miyoshi (2015) propose a local pf algorithm which uses local parti-cle weights deﬁned as in Eq. (3.11) for the speciﬁc case of a Gaussian observationdensity and uniform localisation function ‘ r ( d ) = [0 ,r ] ( d ). The local weights areused to generate binary resampling variables r P , P t, M for each mesh node satisfying X q ∈ P r p,qt,m = 1 , E (cid:20) X q ∈ P r q,pt,m | w pt,m = w pt,m (cid:21) = P w pt,m ∀ p ∈ P , m ∈ M . (3.13) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) Independent resampling at each node. (b) Coupled resampling and smoothing. Fig 2: Examples of local pf assimilation updates applied to the same spatialGaussian process model as Figure 1.Generating the resampling variables for each mesh node independently means thestate variables at adjacent mesh nodes for a post-resampling particle will typicallyoriginate from diﬀerent prior particles, tending to lead to highly discontinuous andnoisy spatial ﬁelds. An example of this is shown in Fig. 2a which show examples ofthe posterior state ﬁeld samples generated using independent resampling at eachmesh node with local weights for the smooth spatial Gaussian process exampleencountered previously in Figure 1.To ameliorate the issues associated within using independent resampling vari-ables, it is proposed in Penny and Miyoshi (2015) to use a variant of the system-atic resampling scheme (Douc and Capp´e, 2005) often used as variance reductionmethod in standard pf algorithms. A single random standard uniform variableis used to generate the resample variables for all mesh nodes, resulting in per-node sets of resampling variables which each satisfy the marginal requirementsin Eq. (3.13) while also being strongly correlated to the resampling variables forother nodes. The correlation introduced between the resampling variables whenusing this ‘coupled resampling’ scheme signiﬁcantly reduces but does not elimi-nate the introduction of discontinuities into the resampled ﬁelds.Rather than directly use these resampling variables in a local equivalent to the pf assimilation update in Eq. (2.15), Penny and Miyoshi (2015) instead propose touse a ‘smoothed’ update which uses a weighted average of the resampling variablesat the current mesh node and all neighbouring nodes to update the particlesvalues at each node. Fig. 2b shows examples of posterior state ﬁelds samplescomputed using this smoothed assimilation update with the resampling variablesgenerated using the coupled scheme. The previously observed discontinuities arenow removed, however the samples still remain signiﬁcantly less smooth than thetrue state used to generate the observations (Fig. 1a) and prior samples (Fig. 1c). While techniques such as the smoothed and coupled resampling update usedin Penny and Miyoshi (2015) can help reduce the introduction of spatial discon-tinuities, the resampling variables r P , P t, M are still calculated without taking intoaccount the values of the predictive particles ~ x P t values other than via the localparticle weights. The etpf assimilation update discussed in Section 2.9 explicitlytries to minimise a distance between the values of the transformed and pre-updateparticles and does not require introducing any randomness and so is a naturalcandidate for a local pf s with improved spatial smoothness properties. M. M. GRAHAM & A. H. THIERY

Cheng and Reich (2015) proposed a localised variant of the etpf as a particularinstance of their letf framework. Local particle weights are calculated as inEqs. (3.10) and (3.11) for each mesh node, and a set of ot problems solved ρ P , P t,m = argmin % P , P ∈R ( w P t,m ) X p ∈ P X q ∈ P % p,q c p,qt,m ∀ m ∈ M . (3.14)Here the transport cost terms c P , P t, M are analogous to the inter-particle Euclideandistances used in Eq. (2.18). Rather than compute global transport costs basedon distances between the state variables values at points across the full spatialdomain, Cheng and Reich (2015) proposed to compute localised transports costsfor each mesh node index m ∈ M by integrating a distance between the statevariables values against a localisation function centred at the mesh node location s m and with support on points s ∈ S : d ( s, s m ) < r c p,qt,m = Z S | ~ z pt ( s ) − ~ z qt ( s ) | ‘ r ( d ( s m , s )) d s ∀ m ∈ M , p ∈ P , q ∈ P . (3.15)The localisation function ‘ r and localisation radius r are denoted with primeshere to emphasise they may be diﬀerent from those used for the local weightscomputation. A more pragmatic deﬁnition of the localised transport costs is c p,qt,m = X n ∈ M (cid:12)(cid:12)(cid:12) ~ x pt,n − ~ x qt,n (cid:12)(cid:12)(cid:12) ‘ r ( d ( s m , s n )) ∀ m ∈ M , p ∈ P , q ∈ P . (3.16)In the common case of a rectilinear mesh with equal spacing between the nodesacross the domain, the summation in Eq. (3.16) can be seen, as a quadratureapproximation to the integral in Eq. (3.15) up to a constant multiplier whichdoes not aﬀect the ot solutions.If the localisation functions ‘ r and ‘ r are smooth, then both the local weights w P t, M and local transport costs c P , P t, M will vary smoothly as functions of the meshnode locations s M . However, the solutions ρ P , P t, M to the linear programs deﬁnedby the local optimal transport problems in Eq. (3.14) will not vary smoothlywith the mesh node locations s M even if the local weights and transport costsdo. This can be seen in the spatial Gaussian process example in Fig. 1, with thelocal etpf scheme used to compute the posterior samples illustrated in Fig. 1e.Although less apparent than the discontinuities in Fig. 1d, the ﬁelds in Fig. 1estill show spatial artefacts due to the non-smooth variation of the ot solutions.One option to increase the smoothness of the update is to regularise the ot problems. In particular the entropically regularised ot problems deﬁned by ρ P , P t,m = argmin % P , P t,m ∈R ( w P t,m ) X p ∈ P X q ∈ P (cid:16) % p,qt,m c p,qt,m + λ% p,qt,m (log % p,qt,m − (cid:17) , (3.17)for some positive regularisation coeﬃcient λ have a unique optimal solution whichsmoothly varies as a function of the local weights w P t,m and transport costs c P , P t, M (Peyr´e and Cuturi, 2019) and tends to the solution of the non-regularised problemwith the highest entropy as λ →

0. Further the entropically regularised problemscan be eﬃciently iteratively solved using Sinkhorn–Knopp iteration (Sinkhornand Knopp, 1967; Cuturi, 2013) with complexity e O ( P ) per problem (Altschuler,Weed and Rigollet, 2017). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER (a) Regularisation coeﬃcient λ = 10 − . (b) Regularisation coeﬃcient λ = 10 − . Fig 3: Examples of the Cheng and Reich (2015) local etpf assimilation updateapplied to the same spatial Gaussian process model as Figure 1 using entropicallyregularised ot maps for diﬀerent values of the regularisation coeﬃcient λ .Figs. 3a and 3b show examples of posterior ﬁelds samples computed usingentropically regularised local etpf updates for two regularisation coeﬃcients λ .It can be seen that introducing entropic regularisation increases the smoothnessof the updated ﬁelds compared to the unregularised samples shown in Fig. 1e andthat the level of smoothness increases with the regularisation coeﬃcient λ .However the increase in smoothness comes at the cost of a decreased diversity inthe post-update particles as λ increases - in particular for the λ = 10 − case shownin Fig. 3b, the four samples shown appear almost identical. This is a consequenceof the assimilation updates in the local etpf linearly transforming by the ot maps as in Eq. (2.19) as opposed to resampling using binary random variablesgenerated according to the resampling probabilities encoded by the ot maps. Forthe regularised ot problems in Eq. (3.17), as the regularisation coeﬃcient λ → ∞ we have that ρ p,qt,m → w qt,m ∀ p ∈ P , q ∈ P , m ∈ M . In this case applying thelocal etpf assimilation update will tend to assigning the weighted mean of thestate variables at each mesh-node to the post-update particles, and thus a lackof diversity or under-dispersion in the post-update particles.Acevedo, de Wiljes and Reich (2017) proposed a variant of the etpf which over-comes this under-dispersion issue when using entropically regularised ot maps.For each ot map a correction terms is computed which ensures the empirical co-variance of the updated particles matches the values that would be obtained usingthe standard pf update. Although this second-order accurate etpf scheme over-comes the under-dispersion issues when using entropically regularised ot maps,in localised variants the correction factors must be computed separately for the ot map associated with each mesh node, with the computation of each correc-tion factor having a O ( P ) complexity, potentially negating any gains from usinga cheaper Sinkhorn solver for the regularised ot problems.In the review article of Farchi and Bocquet (2018) a local etpf variant isproposed which computes ot maps for blocks of state variables rather than foreach mesh node individually. Computing ot maps per-block rather than per-nodepotentially can give signiﬁcant computational savings in higher spatial dimensions— for instance for three dimensional domains, even using cubic blocks which coverjust two mesh nodes in each dimension would lead to a reduction in the numberof ot problems needing to be solved by eight. In the numerical experiments inFarchi and Bocquet (2018) it was found however that the accuracy of the local M. M. GRAHAM & A. H. THIERY (a) Block size 4 ×

4. (b) Block size 8 × Fig 4: Examples of the Farchi and Bocquet (2018) local block etpf assimilationupdate applied to the same Gaussian process model as Figure 1.Fig 5: Example smooth partition of unity of a one-dimensional spatial domain S = [0 ,

1] with nine bump functions φ . The patches ˆ S covering S and whichthe bump functions have support on are visualised below the plot axes.block etpf method was highest when using blocks containing just one mesh-node, i.e. corresponding to the local etpf scheme of Cheng and Reich (2015). Asthe state variables in each block are updated independently given the computedper-block ot maps, the poorer performance with larger blocks may be at least inpart due to the spatially inhomogeneous error introduced at the block boundaries.Fig. 4 shows examples of posterior state ﬁeld samples computed using this block etpf scheme for the earlier spatial Gaussian process example from Fig. 1 for twodiﬀerent block size; in both the boundaries of the blocks are clearly visible dueto the discontinuities introduced in to the ﬁelds.

4. SMOOTH AND SCALABLE LOCAL PARTICLE FILTERING

Grouping mesh nodes into spatially contiguous blocks and computing ot mapsper-block rather than per-node as proposed in Farchi and Bocquet (2018) is anatural way to reduce the computational cost of local etpf assimilation update.However this approach further decreases the smoothness of the updated ﬁelds.Here we propose an alternative approach. Rather than computing ot maps fordisjoint blocks deﬁning a partition of the spatial domain S we instead ‘softly’partition S into patches with overlapping support, computing an ot map foreach patch and smoothly interpolating between the ot maps associated withdiﬀerent patches in the overlaps. The construct we will use to both deﬁne thesoft partitioning of the domain and interpolation across it is a partition of unity . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Let ˆ S B be a cover of the spatial domain S such that S B b =1 ˆ S b = S with eachˆ S b termed a patch . We associate a bump function φ b : S → [0 , ∀ b ∈ B witheach patch with φ b ( s ) = 0 ∀ s / ∈ ˆ S b , b ∈ B and require that X b ∈ B φ b ( s ) = 1 ∀ s ∈ S . (4.1)The set of bump functions φ B is then termed a partition of unity ( pou ) of S . pou s are typically used to allow local constructions to be extended globallyacross a space, for instance an atlas of local charts of a manifold. Generally in suchapplications the bump functions will be required to be inﬁnitely diﬀerentiable.Here we will generally not require such stringent diﬀerentiability requirements,however we will informally refer to a smooth pou for the case where each bumpfunction is of at least class C with continuous derivatives, and to a hard pou forthe case where the cover ˆ S B is exact, i.e. the patches are pairwise disjoint, andso the bump functions are indicators on the patches φ b ( s ) = ˆ S b ( s ) ∀ b ∈ B .A useful method for constructing a pou with speciﬁed smoothness propertieson an arbitrary spatial domain is via convolution. Speciﬁcally, if S B is a partitionof S and ϕ : [0 , ∞ ) → [0 , ∞ ) is a non-negative molliﬁer function satisfying Z S ϕ ◦ d ( s, s ) d s = 1 ∀ s ∈ S (4.2)then we can deﬁne a pou φ B on S by convolving ϕ with the indicators on S B φ b ( s ) = Z S S b ( s ) ϕ ◦ d ( s, s ) d s ∀ b ∈ B , s ∈ S . (4.3)The bump functions will then inherit any smoothness properties of the molliﬁer.Figure 5 shows an example of a smooth pou constructed in this manner. We can use a pou to deﬁne a local letf that uses transform coeﬃcients com-puted for each patch rather than mesh node. We deﬁne the per-node transformcoeﬃcients a P , P t,m in Eq. (3.7) in terms of a set of per-patch coeﬃcients ˆ a P , P t, B by a p,qt,m = X b ∈ B ˆ a p,qt,b φ b ( s m ) ∀ p ∈ P , q ∈ P , m ∈ M . (4.4)If the set of coeﬃcients ˆ a P , P t,b for each patch index b ∈ B correspond to theelements of a left stochastic matrix such thatˆ a p,qt,b ∈ [0 , ∀ p ∈ P , q ∈ P and X q ∈ P ˆ a p,qt,b = 1 ∀ p ∈ P , (4.5)then due to the non-negativity and sum to unity properties of the pou we havethat a p,qt,m ∈ [0 , ∀ p ∈ P , q ∈ P , m ∈ M and X q ∈ P a p,qt,m = X b ∈ B X q ∈ P ˆ a p,qt,b φ b ( s m ) = X b ∈ B φ b ( s m ) = 1 ∀ p ∈ P , m ∈ M , (4.6)and so that a P , P t, M also correspond to the elements of left stochastic matrices. M. M. GRAHAM & A. H. THIERY

The resulting assimilation update in terms of the values of the predictive andﬁltering distribution state ﬁeld particles, ~ z P t and z P t , at the mesh nodes s M is z pt ( s m ) = X q ∈ P X b ∈ B ˆ a p,qt,b φ b ( s m ) ~ z qt ( s m ) ∀ p ∈ P , m ∈ M . (4.7)For a smooth pou φ B the P × B spatial ﬁelds deﬁned by the pointwise products φ b ( s ) ~ z qt ( s ) ∀ b ∈ B , q ∈ P will be smooth functions of the spatial coordinate s ∈ S if the predictive distribution state ﬁeld particles ~ z P t are themselves smooth.Each ﬁltering distribution state ﬁeld particle z P t is then formed as a convexcombination of these pairwise product ﬁelds, and so will also be smooth if the pou and predictive distribution state ﬁelds are. This is illustrated for a one-dimensional example in Fig. C.1 in Appendix C. We now consider the speciﬁc application of the smooth local letf scheme todeﬁne a smooth localisation of the etpf , with in this case the coeﬃcients ˆ a P , P t, B corresponding to ot maps computed for each patch. We ﬁrst deﬁne the followingnotation for the distance between a subset of the spatial domain and a point. d ( S , s ) = inf s ∈S d ( s , s ) ∀ s ∈ S , S ⊆ S . (4.8)Analogously to the per-node case in Eq. (3.10), the logarithms of the per-patch(unnormalised) particle weights can then be deﬁned bylog ˜ w pt,b = X l ∈ L log g t,l ( y t,l | ~ z pt ( s o l )) ‘ r ( d ( ˆ S b , s o l )) ∀ b ∈ B , p ∈ P , (4.9)As d ( ˆ S b , s o l ) = 0 if s o l ∈ ˆ S b and ‘ r (0) = 1 the weighted summation of log obser-vation density terms in Eq. (4.9) gives weight one to all the terms correspondingto observations located within a patch. Observations outside a patch but withina distance of less than r are given weights between zero and one, and all obser-vations more than a distance of r from a patch are given zero weight.Taking inspiration from the per-node case in Eq. (3.15) we could deﬁne per-patch transport costs directly in terms of the predictive state ﬁelds ~ z P t c p,qt,b = Z S | ~ z pt ( s ) − ~ z qt ( s ) | ‘ r ( d ( ˆ S b , s m )) d s ∀ b ∈ B , p ∈ P , q ∈ P . (4.10)Although this is deﬁned independently of the spatial discretisation used, evalu-ating the integrals exactly will often be intractable. Assuming the common caseof equally spaced mesh nodes, we propose to deﬁne per-patch transport costs as c p,qt,b = X m ∈M (cid:12)(cid:12)(cid:12) ~ x pt,m − ~ x qt,m (cid:12)(cid:12)(cid:12) ˆ S b ( s m ) ∀ b ∈ B , p ∈ P , q ∈ P , (4.11)where M ⊆ M corresponds to a spatial subsampling of the mesh nodes, e.g.corresponding to every K th node in each spatial dimension, such that |M| ≈ MK D .This spatial subsampling is motivated by the observation that if the state ﬁeldsare spatially smooth then the values at immediately adjacent mesh nodes willtypically be very similar and there is therefore minimal loss of information incomputing pointwise diﬀerences over a subset of, rather than all, mesh nodes. In SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER addition to spatial subsampling we also deﬁne the transport costs in Eq. (4.11)with the ﬁxed choice of a uniform localisation function ‘ r with r = 0. Empiri-cally we found varying the choice of ‘ r and r for the transport costs had littlediscernable eﬀect on ﬁltering performance.Given per-patch weights w P t, B and transport costs c P , P t, B computed as de-scribed above, the per-patch linear transform coeﬃcients ˆ a P , P t, B are then com-puted as solutions to the B corresponding ot problemsˆ a P , P t,b = argmin % P , P ∈R ( w P t,b ) X p ∈ P X q ∈ P % p,q c p,qt,b ∀ b ∈ B . (4.12)We will subsequently refer to instances of this framework as smooth local en-semble transform particle ﬁlter s ( sletpf s). To deﬁne a sletpf method for agiven spatial ssm , we need to specify: a localisation function and radius ‘ r and r to compute the local weights; the set of mesh nodes M to use in computing thelocal transport costs; a pou of the spatial domain.For the sletpf local weight calculation in Eq. (4.9), the number of non-zerolog observation density terms in the sum is dependent on both the localisationfunction and the size of the patches ˆ S B used to deﬁne the pou . We can deﬁnean eﬀective number of observations considered per patch as n b = X l ∈ L ‘ r ( d ( ˆ S b , s o l )) ∀ b ∈ B . (4.13)To avoid weight degeneracy we will typically need to control the n B valuesthrough the choice of pou and localisation radius r , with the results of Rebes-chini and van Handel (2015) suggesting n B should roughly scale with log P . Toapproximately minimise max( n B ) for a given number of patches B , as a heuristicwe suggest the patches should be chosen such that each contains a roughly equalnumber of observations. We discuss approaches for deﬁning a partition of thespatial domain based on the observation locations to achieve this in Appendix D.The choice of the number of patches B to use will typically be based on atradeoﬀ between several factors. Reducing computational cost favours using fewerpatches, while the need to control max( n B ) and so the tendency for weight de-generacy favours using a greater number of smaller patches. More complex is thedependency of the approximation error introduced by localisation. Using largerpatches and a greater number of observations to update the state variables withineach patch should reduce the approximation error for the updates within eachpatch. However for a ﬁxed r using larger patches will also lead to great disparitiesin the local weights calculated for each patch using Eq. (4.9) and so the transformcoeﬃcients for adjacent patches. If using a hard pou this will typically lead tospatial discontinuities in the state particles across patch boundaries after apply-ing the assimilation update, with the downstream eﬀect of such discontinuitiespotentially negating any reduction in the approximation error within the patches.If using a smooth pou the mesh nodes in the overlaps between patches willbe updated using a interpolation of the transform coeﬃcients for each of thepatches, allowing smaller numbers of patches B to be used while still retainingsmoothness. In the numerical experiments in Section 5 we show that using asmooth pou allows use of a number of patches B less than the number of meshnodes M while still retaining accurate ﬁltering distribution estimates. M. M. GRAHAM & A. H. THIERY

The computational cost of the per-node local etpf assimilation updates pro-posed in Cheng and Reich (2015) is dominated by solving the M ot problemsleading to an overall e O ( MP ) scaling for the computational cost. For the sletpf ,the number of ot problems is determined by the number of patches B and sothe cost of solving the ot problems is e O ( BP ). When B (cid:28) M the relative cost ofthe other computations in the overall assimilation update can become signiﬁcanthowever. To derive a relationship for the overall scaling of the computational costof the proposed sletpf we make the following assumptions. Assumption . The maximum number of patches covering any mesh node isindependent of and much smaller than B and so the sum across all patches of thenumber of mesh nodes within each patch scales independently of B , i.e. X b ∈ B X m ∈ M ˆ S b ( s m ) = O ( M ) . (4.14)For pou s in which each patch overlaps with only a ﬁxed number of ‘neighbour’patches this will hold. If a uniform subsampling scheme is used to deﬁne the set ofmesh node indices M used in computing the transport costs, then as a corollarywe will also have that the total number of subsampled mesh nodes containedwithin all patches scales independently of B , i.e. X b ∈ B X m ∈M ˆ S b ( s m ) = O ( |M| ) . (4.15) Assumption . The sum across all patches of the number observations withina distance r from a patch is less than the number of mesh nodes M > L , i.e. X b ∈ B X l ∈ L [0 ,r ] ( d ( ˆ S b , s o l )) < M . (4.16)We will typically have that the number of observations locations L is smallcompared to the number of mesh nodes M and the localisation radius r will be setto limit the number of observations considered per patch to a small subset of allobservations so this will usually hold.Under Assumption 1 the cost of calculating the BP transport costs usingEq. (4.11) is O ( |M| P ) as we need to evaluate the distance between the P ( P − |M| mesh nodes and from Eq. (4.15) only O ( |M| ) terms inthe summations for each of the particle pairs need to be evaluated.The update to the particles in Eq. (4.7) for a general set of per-patch lineartransform coeﬃcients ˆ a P , P t, B will have a cost of O ( MP ) under Assumption 1. How-ever for transform coeﬃcients computed as the solution to discrete ot problems,at most 2 P − P coeﬃcients for each patch are non-zero (Reich, 2013).In this case the assimilation update in Eq. (4.7) therefore has a O ( MP ) cost.Under Assumption 2, the computation using Eq. (4.9) of the BP per-patchweights will cost less than O ( MP ) as we need to evaluate LP < MP log observationdensity factors, and from Eq. (4.16) less than M terms in the summations for eachof the P particles will be non-zero and so need to be evaluated.Under these assumptions, the overall computational cost of each sletpf as-similation step therefore scales as e O (cid:0) BP + |M| P + MP (cid:1) . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER

5. NUMERICAL EXPERIMENTS

To evaluate the performance of the proposed approach, we perform ﬁltering intwo spde test models, comparing our proposed scheme to the local etpf (Chengand Reich, 2015) and local etkf (Hunt, Kostelich and Szunyogh, 2007). Ratherthan measure performance in terms of the distance between the estimated meanof the ﬁltering distribution and the true state used to generate the observations,as is common in similar work e.g. Farchi and Bocquet (2018), here we measurethe errors in the ensemble estimates of expectations with respect to the trueﬁltering distributions. This gives more directly interpretable results as a ﬁlterwhich exactly computes the expectations would give a zero error, unlike thediﬀerence between the mean and true state which will in general be non-zeroeven if the mean is computed exactly. We are also to able to assess the accuracyof a broader range of features of the ﬁltering distribution estimates, for exampletheir quantiﬁcation of uncertainty via measures of dispersion.To allow such comparisons, we require models for which ground truth valuesfor expectations with respect to the ﬁltering distributions can be computed. Tothis end our ﬁrst model is based on a linear-Gaussian spde model for which thetrue ﬁltering distribution can be exactly computed using a Kalman ﬁlter. Forthe second model, we use a more challenging spde model with non-linear statedynamics. Here our ‘ground-truth’ for the ﬁltering distributions is based on longruns of a

Markov chain Monte Carlo ( mcmc ) method. For both models we consider several metrics for evaluating the accuracy of thediﬀerent local ensemble ﬁlters’ estimates of the ﬁltering distributions.The ﬁrst two metrics we consider are the time- and space-averaged root meansquared error s ( rmse s) of the ensemble estimates of the ﬁltering distributionsmeans and standard deviations, to reﬂect respectively the ﬁlters’ accuracy inestimating the central tendencies and dispersions of the ﬁltering distributions.Denote µ T and σ T as the true means and standard deviations under π T µ t = Z X x π t (d x ) and σ t = Z X ( x − µ t ) (cid:12) ( x − µ t ) π t (d x ) ∀ t ∈ T , (5.1)and ˆ µ T and ˆ σ T as the corresponding means and standard deviations under theempirical ensemble estimates to the ﬁltering distributions ˆ π t (d x ) = P P p =1 δ x pt (d x ),ˆ µ t = Z X x ˆ π t (d x ) and ˆ σ t = Z X ( x − ˆ µ t ) (cid:12) ( x − ˆ µ t ) ˆ π t (d x ) ∀ t ∈ T . (5.2)We then deﬁne the time- and space-averaged rmse s of the estimates as rmse (ˆ µ T , µ T ) = s TM X t ∈ T X m ∈ M (ˆ µ t,m − µ t,m ) , (5.3)and rmse (ˆ σ T , σ T ) = s TM X t ∈ T X m ∈ M (ˆ σ t,m − σ t,m ) . (5.4)In both cases lower values of these metrics are better, with a value of zero indi-cating the mean or standard deviation estimates exactly match the true values. M. M. GRAHAM & A. H. THIERY

The two metrics discussed so far concentrate on the accuracy of estimates oflocal properties of the states, but do not reﬂect more global properties such aswhether the ensemble ﬁlters correctly estimate the smoothness of the state ﬁelds.As a proxy measure for smoothness we use the expectation under the true ﬁlteringdistributions of a ﬁnite-diﬀerence approximation of the integral across space ofthe magnitude of the spatial gradients of the state ﬁelds: γ t = Z X X m ∈ M | x t,m − x t,m ⊕ | π t (d x ) ≈ E (cid:20)Z S | ∂ s z t ( s ) | d s (cid:21) ∀ t ∈ T , (5.5)with m ⊕ m + 1 mod M , with one-dimensional periodic spatialdomains being used in both models considered. Deﬁning the estimates ˆ γ T of these smoothness coeﬃcients under the ensemble ﬁltering distributions equivalently asˆ γ t = Z X X m ∈ M | x t,m − x t,m ⊕ | ˆ π t (d x ) ∀ t ∈ T , (5.6)we then deﬁne an overall measure of the accuracy of the ensemble estimates’spatial smoothness as the following time-averaged rmsermse (ˆ γ T , γ T ) = s T X t ∈ T (ˆ γ t − γ t ) . (5.7) As our ﬁrst example we use a linear-Gaussian ssm derived from a spde modelfor turbulent signals by Majda and Harlim (2012, Ch. 5). The governing spde isd ζ ( s, τ ) = (cid:16) θ ∂ s + θ ∂ s − θ (cid:17) ζ ( s, τ )d τ + ( κ (cid:126) s d η )( s, τ ) , (5.8)where ζ : S × R ≥ → R is a real-valued space-time varying process, θ ∈ R ≥ is a non-negative parameter controlling dissipation due to diﬀusion, θ ∈ R isa parameter governing the direction and magnitude of the constant advection, θ ∈ R ≥ is a non-negative parameter controlling dissipation due to damping, κ : S → R ≥ is a spatial kernel function which governs the spatial smoothness of theadditive noise in the dynamics and η : S × R ≥ → R is a space-time varying noiseprocess. The spatial domain is a one-dimensional interval S = [0 ,

1) with periodicboundary conditions and a distance function d ( s, s ) = min( | s − s | , − | s − s | ),and (cid:126) s represents circular convolution in space.We use a spectral approach to deﬁne basis function expansions of the processes ζ and η and kernel κ using M = 512 mesh nodes. This results in a linear systemof stochastic diﬀerential equation s ( sde s) for which the the Gaussian state tran-sition and stationary distributions can be solved for exactly. We assume a linear-Gaussian observation model with the state noisily observed at L = 64 locationsand T = 200 time points. Full details of the model are given in Appendix F.1.The resulting stochastic turbulence ( st ) ssm is linear-Gaussian. We considertwo cases in our experiments: inference in the original linear-Gaussian ssm , andinference in a transformed ssm using this linear-Gaussian model as the base ssm .The speciﬁc deﬁnition we use for a transformed ssm is given in Appendix Ehowever in brief, by applying a non-linear transformation to the state of a linear-Gaussian ssm we can construct a ssm with non-Gaussian ﬁltering distributions SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 4 . × − . × − . × − Median 4 . × − . × − . × − Maximum 4 . × − . × − . × − Localisation radius r Table 1

Values of metrics at optimal localisation radii for local etkf on linear-Gaussian st model. for which we can tractably estimate expectations with respect to the true ﬁlteringdistributions with artbirary accuracy. Here the nonlinear transformation is chosenas T ( x ) = sinh − ( θ x ) (with sinh − evaluated elementwise on vector arguments).As | sinh − ( θ x ) | ≈ log(2 θ | x | ) for | θ x | (cid:29) θ tends toinduce bimodality in the marginals of the transformed ﬁltering distributions.For both the transformed and linear-Gaussian cases we use the model parame-ter settings give in Table F.1 and use simulated noisy observations y T generatedfrom the models using a shared set of Gaussian state and observation noise vari-able samples generated using a pseudo-random number generator. The resultingobservation sequence y T (which is the same for both models) is shown in Fig. F.1along with the corresponding true state sequences z T and z T used to generatethe observations under the linear-Gaussian and transformed ssm s respectively.We compare the performance of the local etkf , local etpf and our proposed sletpf algorithm in estimating the ﬁltering distributions for both the linear-Gaussian and transformed ssm s. The mesh size M = 512 and number of observa-tions L = 64 are suﬃciently large that non-local pf methods suﬀer from weightdegeneracy even with large ensembles of up to P = 10 particles for both thelinear-Gaussian and transformed ssm s. While non-local variants of the enkf donot suﬀer from weight degeneracy and can give relatively accurate ﬁltering dis-tribution estimates for an ensemble size of P ≥ , this is still much larger thanthe ensemble sizes typically used in for example nwp ensemble ﬁlter systems. Foran ensemble size P = 10 we found the local etkf signiﬁcantly outperformed thenon-local etkf on all the metrics we consider in both the linear-Gaussian andtransformed ssm s. We used P = 10 for all methods in the experiments here.For the local etkf we use the smooth compact Gaspari and Cohn localisationfunction ‘ r deﬁned in Eq. (3.12). We conducted a grid search over localisationradii r ∈ { . , . , . . . . } , for each r performing ﬁve independent runsof the local etkf and recording the performance on the three metrics describedin Section 5.1. The results for the linear-Gaussian st model are summarisedin Table 1 and for the transformed st model in Table 2. For each metric theminimum, median and maximum value recorded across the ﬁve runs is shown,for the value of r which gave the minimum median value of that particular metric.The results for all r values are shown in the Appendix in Fig. G.1.The performance on all metrics for both models was relatively stable acrossthe multiple runs. Unsuprisingly the local etkf performs signiﬁcantly betteron the linear-Gaussian st model than the transformed st model. While for thelinear-Gaussian st model the optimal r for each metric are relatively similar, forthe transformed st model the optimal r diﬀers signiﬁcantly across the metricsmeaning any choice of r will incur a performance penalty on some metrics. M. M. GRAHAM & A. H. THIERY rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 1 . × − . × − . × − Median 1 . × − . × − . × − Maximum 1 . × − . × − . × − Localisation radius r Table 2

Values of metrics at optimal localisation radii for local etkf on transformed st model. For our proposed sletpf framework we need to choose a pou . Here we con-struct the pou s by (discretely) convolving a molliﬁer function with the indi-cator functions on a partition of the spatial domain. As the observations arelocated on a regular grid, we partition the domain into B equally sized intervals S b = [ b − B , b B ) ∀ b ∈ B . For the molliﬁer function ϕ we use a normalised variant ofthe compactly supported Gaspari and Cohn localisation function ‘ r in Eq. (3.12),the bump functions then deﬁned as φ b ( s n ) = X m ∈ M S b ( s m ) ‘ w ◦ d ( s n , s m ) P m ∈ M ‘ w ◦ d ( s n , s m ) ∀ n ∈ M , b ∈ B (5.9)with w a kernel width parameter determining how many mesh nodes the eﬀectivesmoothing kernel being discretely convolved with the indicators has support on.For w = M − the kernel is only non-zero at one mesh node, and no smoothing isapplied, corresponding to a hard partition of the space. For w > M − , the amountof smoothing and overlap between the patches increases with w .For the experiments with the st models we performed runs with sletpf s with pou s with ﬁve diﬀerent numbers of patches B ∈ { , , , , } and four diﬀer-ent kernel widths w ∈ { − , − , − , − } . We used a Gaspari and Cohnlocalisation function for the local weight calculation, for each ( B , w ) pair perform-ing ﬁve independent runs for all localisation radii r ∈ { . , . , . . . . } where median( n B ) was in the range [1 , etpf ofCheng and Reich (2015) can be considered a particular instance of the sletpf framework, here corresponding to the runs with a pou with B = 512 patches and w = 512 − . The set of mesh nodes M used to calculate the per-patch transportcosts as in Eq. (4.11) was constructed by subsampling 1 : M by a factor min(4 , p n )with p n = M ( B − + 2 w ) − sletpf runs for each of the( B , w, r ) parameter combinations are shown for the linear-Gaussian st model inFig. 6 and for the transformed st model in Fig. 7. In each ﬁgure, the rows ofplots correspond to diﬀerent kernel widths w and the three columns to diﬀerentmetrics. On each plot the value of the relevant metric on the vertical axis isplotted against the median number of eﬀective observations per patch on thehorizontal axis (we plot against median( n B ) rather than r as it is more directlycomparable across diﬀerent values of B and w ). The median values across the ﬁveindependent runs for each of the numbers of patches B are shown by the colouredcurves (see colour key at top of ﬁgures) and the surrounding lighter colouredregions indicated minimum to maximum range of values recorded across the runs(in many cases the across-run variation is too small to be visible). For each metricthe best value achieved by the local etkf (as given in Tables 1 and 2) for themetric is indicated by the black horizontal dashed line. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF

Fig 6: Comparison of accuracy of sletpf estimates on linear-Gaussian st ssm .Considering ﬁrst the linear-Gaussian st model results, we see that across allparameter combinations and metrics the local etpf methods are outperformed bythe best local etkf results. This is as expected as the linear-Gaussian assumptionsmade by the etkf are correct in this case, and by better exploiting this modelstructure we expect the local etkf to outperform the more generic local etpf .Concentrating on the results for ﬁlters with hard pou s without smoothing inthe ﬁrst row ( w = 512 − ), we see that the ﬁlters with B = M = 512 patches inthe pou , corresponding to the Cheng and Reich (2015) scheme, outpeform ﬁl-ters using pou s with smaller numbers of patches across virtually all median( n B )values and metrics. This tallies with the ﬁndings of Farchi and Bocquet (2018)who found that for an equivalent ‘block’-based local etpf scheme the best per-formance was always achieved with blocks of size one. Considering speciﬁcallythe rmse (ˆ µ T , µ T ) metric we see that as the number of patches B decreases thevalue of the metric across all values of median( n B ) monotonically increases (cor-responding to poorer performance). The behaviours for the rmse (ˆ σ T , σ T ) and rmse (ˆ γ T , γ T ) metrics are more complex. For the smoothness coeﬃcient we seethat accuracy of the ﬁlter estimates initially decreases as the number of patches isincreased from B = 512 to B = 256 and B = 128. The accuracy of the smoothnessestimates however then increases on decreasing the number of patches further to B = 64 and again the accuracy increases on decreasing the number of patches to B = 32. We believe this non-monotonic relationship between the accuracy of the M. M. GRAHAM & A. H. THIERY smoothness estimates and the number of patches in the pou may be explainedby the spatial averaging in the computation of the smoothness coeﬃcient: whileusing fewer larger patches in the pou would be expected to introduce stronger dis-continuities at the patch boundaries due to larger diﬀerences in the local weightsassigned to each patch, there is a competing eﬀect that as fewer patches are usedthere are fewer boundaries and so the spatially averaged error becomes lowerdespite the individual discontinuities at each block boundary being larger.Now comparing the results as the kernel width w and so smoothness of the pou is increased, there are two main trends apparent. Most prominently thevariation in performance across diﬀerent numbers of patches B decreases as thesmoothness of the pou increases, with many of the curves overlapping over muchof their ranges for w = 128 − and w = 64 − , while the optimal performance oneach metric remains similar. This suggests using smooth pou allows fewer numberof patches to be used (and thus a lower computational cost of the assimilationupdate) while maintaining performance, contrary to what was observed for thehard pou case where using fewer patches always decreased performance.A second less obvious eﬀect is that as the kernel width w is increased thelower limit for median( n B ) is increased (similarly using fewer larger patchesalso increases the lower limit for median( n B )). This is the reason for the curvesstarting at higher median( n B ) as the kernel width increases, corresponding tothe values achieved with the smallest r tested ( r = 0 . w = 64 − we see that all the curves start to the rightof the point at which the optimal performance is reached for the other smaller w . This suggests there is a drawback to making w too large as it limits how farthe number of observations per patch and so tendency to local weight degeneracycan be controlled; in this case it seems the best tradeoﬀ is reached for either w = 256 − or w = 128 − . Interestingly we also see that the accuracy of thesmoothness and standard deviation estimates are poorer for w = 64 − comparedto w = 128 − even when comparing at the same median( n B ). This could be dueto the greater overlap between the patches in this case, with the averaging of theparticle values at the overlaps potentially acting to artiﬁcially oversmooth andreduce variation in the particles, again suggesting that the appropriate level ofsmoothing is a tradeoﬀ between several factors.The results on the transformed st model shown in Fig. 7 show for the mostpart very similar trends as for the linear-Gaussian st model. The most signif-icant diﬀerence is the relative performance of the local etkf and local etpf methods, with in this case the local etpf approaches outperforming the bestlocal etkf results across all parameter values for the rmse (ˆ σ T , σ T ) and acrossa majority of the parameter values tested for the rmse (ˆ µ T , µ T ) metric. As theonly diﬀerence between these two models is the non-Gaussianity in the ﬁlteringdistributions introduced by the transformation, these results support the earlierclaims that pf -based methods such as the local etpf and sletpf proposed inthis article, are more robust to non-Gaussianity than than enkf methods suchas the local etkf . Interestingly the relative performance loss in the local etkf on introducing non-Gaussianity seems to be most severe in the rmse (ˆ σ T , σ T )metric, suggesting that uncertainty estimates provided by local etkf methodson non-linear-Gaussian models should be particuarly treated with caution.In addition to the accuracy of the ﬁlter estimates, we are also interested in the SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF

Fig 7: Comparison of accuracy of sletpf estimates on transformed st ssm .relative computational cost of the diﬀerent methods. Fig. 8 shows the values ofthe performance metrics achieved by the diﬀerent sletpf conﬁgurations tested,against the corresponding assimilation time (i.e. total ﬁltering time minus thetime taken to integrate the model dynamics in the prediction updates) for thetransformed st ssm . Each of the three plots corresponds to one of the performancemetrics, the vertical coordinate of each marker indicates the minimum value of themetric achieved across all localisation radii r for a particular ( B , w ) combination,with the marker colour indicating the number of patches B , and the marker symbolthe kernel width w . The horizontal coordinate of each marker indicates the medianassimilation time across the ﬁve independent runs for the corresponding ( B , w, r )values. For the pou s with B = 512 patches, only the case without smoothing( w = 512 − ), corresponding to the Cheng and Reich (2015) local etpf , is shown,with the smoother pou s in this case substantially increasing the assimilationtimes without any gain in accuracy.As would be expected due to the lower number of ot problems that need tobe solved, in general the assimilation time decreases as the number of patches B in the pou is decreased for a ﬁxed smoothing kernel width w . Note however thatthe assimilation time increases with the smoothing kernel width w (primarilydue to the increased number of non-zero terms in the summation in Eq. (4.7)),which results for example in the assimilation time for the scheme with B = 256and w = 64 − ( × ) being slightly larger than for the runs under the Cheng and M. M. GRAHAM & A. H. THIERY

Assimilation run time / s R M S E ( ˆ µ : T , µ : T ) Assimilation run time / s R M S E ( ˆ σ : T , σ : T ) Assimilation run time / s R M S E ( ˆ γ : T , γ : T ) B = 32 w = 512 − B = 64 w = 256 − B = 128 w = 128 − B = 256 w = 64 − B = 512 Fig 8: Accuracy versus run time for sletpf in transformed st ssm .Reich (2015) settings of B = 512 and w = 512 − ( • ). Although there is thereforea tradeoﬀ in assimilation time between decreasing the number of patches B andincreasing the kernel width w , we still ﬁnd that there are combinations of ( B , w )values which maintain the accuracy of the Cheng and Reich (2015) scheme whilegiving substantial reductions in assimilation time. In particular the runs with B = 128 and w = 256 − ( − ) and B = 128 and w = 128 − (+) achieve nearlyidentical accuracies on the mean and standard deviation rmse metrics as B = 512and w = 512 − (and a substantially improved smoothness coeﬃcient rmse ) whilereducing the assimilation time by slightly more than a factor of two. At the costof around a 10% increase in the mean and standard deviation rmse s, a moresubstantial reduction in the assimilation time by a factor of four can be achievedby using a pou with B = 64 patches and w ∈ { − , − , − } .Although the absolute values of the assimilation times in Fig. 8 are dependenton the computational environment used to run the experiments, the relative tim-ings should still be informative as the same sletpf implementation was used torun all the experiments. We purposefully did not include the local etkf runs onthe plots as any diﬀerences in the assimilation times for the local etkf versus sletpf approaches are likely to be as much due to the particulars of the softwareimplementations and hardware used as any fundamental diﬀerences in perfor-mance. In particular more time was spent optimising the implementation of the sletpf algorithm than our local etkf implementation so the relative timingsare likely to unfairly favour the sletpf runs. The computational complexity forthe local etkf however is O ( MP ) which is the same as for the local etpf schemeof Cheng and Reich (2015), so it would be expected that there are regimes inwhich the sletpf assimilation updates (with complexity e O ( BP + |M| P + MP ))will have a computational advantage over the local etkf updates. As our second test model we consider a stochastic variant of a fourth-order non-linear partial diﬀerential equation ( pde ), often termed the Kuramoto–Sivashinksy ( ks ) equation, which has been independently derived as a model of various phys-ical phenomena (Kuramoto and Tsuzuki, 1976; Sivashinsky, 1977) and studied asan example of a relatively simple pde system exhibiting spatio-temporal chaos(Hyman and Nicolaenko, 1986). On a spatial domain S = [0 ,

1) with a distancefunction d ( s, s ) = min( | s − s | , − | s − s | ) and periodic boundary conditions, the SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER deterministic dynamics of the ks pde model can be described by ∂ τ ζ ( s, τ ) = − ∂ s θ + ∂ s θ ! ζ ( s, τ ) − ∂ s θ (cid:16) ζ (cid:17) (5.10)where θ is a length-scale parameter, with the system dynamics becoming chaoticfor large values of θ (Hyman and Nicolaenko, 1986).As our focus in on ﬁltering in models with stochastic dynamics, we use a related spde model on the same spatial domain, described byd ζ ( s, τ ) = − ∂ s θ + ∂ s θ + θ ! ζ ( s, τ ) − ∂ s θ (cid:16) ζ (cid:17)! d τ + ( κ (cid:126) s d η )( s, τ ) (5.11)where ζ : S × T → R is a real-valued space-time varying process, θ ∈ R ≥ isthe non-negative length-scale parameter, θ ∈ R ≥ is a non-negative parametercontrolling dissipation due to damping, κ : S → R ≥ is a spatial kernel functionand η : S × T → R is a space-time varying noise process. In addition to theintroduction of the additive noise process, we also introduce a linear dampingcomponent controlled in magnitude by θ . This is motivated by our empiricalobservation in simulations that the stochastic system can become unstable whennumerically integrating over long time periods without additional dampening.We use a similar spectral approach to deﬁne the spatial basis function expan-sions of the state and noise processes ζ and η and kernel κ as for the st model,again using M = 512 mesh nodes. Full details of the discretisation used are given inAppendix F.2 and the values of all the parameters used in Table F.2. This resultsin a coupled non-linear system of sde s which governs the evolution of the stateFourier coeﬃcients; unlike the linear-Gaussian dynamics of the st model these sde s do not have an analytic solution and so need to be numerically integrated.We assume the state is observed at T = 200 time points, with S = 10 integra-tor steps performed between each observation time; the resulting state transitionoperators F T are non-linear and do not admit closed form transition densities.We consider ssm s in which these ks state dynamics are noisily observed viaboth linear and non-linear observation operators. In both cases the state is as-sumed to be observed at L = 64 equispaced points in the spatial domain, withdirect observations of the state values at these points in the linear case and viaa hyperbolic tangent (tanh) function in the non-linear case. The simulated stateand observation sequences used in the experiments for both the linearly and non-linearly observed ks ssm s are shown in Fig. F.2 (with the same simulated statesequence being used in both cases, with only the generated observations diﬀering).Compared to st model, the ks model exhibits more complex and unpredicatablestate dynamics and thus can be seen as more challenging test case for the localensemble ﬁltering methods.Both the linearly and non-linearly observed ks ssm s have non-Gaussian ﬁlter-ing distributions which cannot be exactly inferred unlike the linear-Gaussian st model. We therefore used a mcmc method to generate proxy ground-truths forthe ﬁltering distributions, constructing Markov chains which left invariant thejoint distribution across the M = 512 dimensional state vectors at all T = 200time points given the observed sequence, i.e. P ( x T ∈ d x | y T = y T ), with theﬁltering distributions corresponding to marginals of this joint smoothing distribu-tion . Due to the large overall state dimension MT ≈ we use a gradient-based M. M. GRAHAM & A. H. THIERY rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 1 . × − . × − . × − Median 1 . × − . × − . × − Maximum 1 . × − . × − . × − Localisation radius r Table 3

Values of metrics at optimal localisation radii for local etkf on linearly observed ks model. rmse (ˆ µ T , µ T ) rmse (ˆ σ T , σ T ) rmse (ˆ γ T , γ T )Minimum 2 . × − . × − . × − Median 2 . × − . × − . × − Maximum 2 . × − . × − . × − Localisation radius r Table 4

Values of metrics at optimal localisation radii for local etkf on non-linearly observed ks model. Hamiltonian Monte Carlo algorithm (Duane et al., 1987) to generate the chains.For each of the linear and non-linearly observed cases we ran ﬁve parallel chains of200 samples each, with each chain using an independently seeded pseudo-randomnumber generator. Details of the set up used for the mcmc runs are given inAppendix H. The ‘ground-truth’ values for the ﬁltering distributions means µ T ,standard deviations σ T and smoothness coeﬃcients γ T were estimated usingthe combination of the ﬁnal 100 x T samples of each of the ﬁve chains for each ssm , i.e. a total of 500 samples per ssm .As for the st ssm s, we used P = 10 particles for all the local ensemble ﬁltersruns on the ks ssm s. For the local etkf we performed an equivalent grid search asfor the st models, performing ﬁve independent runs for each localisation radius r ∈ { . , . , . . . . } for both the linearly and non-linearly observed ksssm s. The results are summarised in Tables 3 and 4, with plots of the full gridsearch results shown in Fig. G.1 in Appendix G.Although the absolute values of the rmse metrics in Table 3 are higher thanfor the local etkf runs on the linear-Gaussian st model, given the non-linearstate dynamics in the ks model mean the ﬁltering distributions are no longerconstrained to remain Gaussian, the local etkf performs remarkably well on thelinearly-observed ks ssm , recovering relatively accurate estimates of the ﬁlter-ing distribution means, standard deviations and smoothness coeﬃcients. This isconcordant with the widespread empirical success of local enkf approaches evenwhen applied to models with non-linear state dynamics (Evensen, 2009), but alsosuggests that the ﬁltering distributions in this case may have remained close toGaussian despite the non-linear dynamics.Swapping the linear observations for a non-linear observation operator how-ever can be seen to have a detrimental eﬀect on the accuracy of the local etkf estimates of the ﬁltering distributions. The optimal values achieved for each ofthe three rmse metrics shown for the non-linearly observed case in Table 4 showsigniﬁcant increases over the corresponding ﬁgures for the linearly observed casein Table 3, with the errors in the standard deviation estimates showing the largestincrease. This highlights that although local enkf methods are robust to somedegree of non-linearity in the dynamics or observation model, performance is stillsensitive to strong departures from Gaussianity.The eﬀect of the non-Gaussianity induced by the non-linear observation opera- SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Rank D e n s i t y (a) r = 0 . Rank D e n s i t y (b) r = 0 . Fig 9: Rank histograms for single local etkf runs on ks ssm s.tor can also be seen by comparing rank histograms (i.e. the ranks of the true statevalues within the ensemble across all time and spatial indices) for single runs ofthe local etkf on the linearly and non-linearly observed ks ssm s, as shown inFig. 9a and Fig. 9b respectively. The localisation radius r was set to the valuefound in the grid searches to give the lowest mean estimate rmse . For a wellcalibrated ensemble the rank histograms should be close to uniform (indicatedby the dashed black line on the plots). While for the linearly observed case theminor departures from uniformity can be plausibly attributed to sampling noise,the histogram for the non-linearly observed case has a clear ‘double-humped’non-uniform shape, with this suggesting the ensemble estimates of the ﬁlter dis-tributions have greater kurtosis than the true ﬁltering distributions.For the sletpf runs we use the same method to construct the pou s as de-scribed in the preceding section for the st model experiments. We again con-sidered pou s with B ∈ { , , , , } number of patches and smoothingkernel widths of w ∈ { − , − , − , − } . For each ( B , w ) pair we testedall localisation radii r ∈ { . , . , . . . . } where median( n B ) was in therange [1 ,

5] for the linearly observed ks ssm and in the range [2 ,

6] for the non-linearly observed ks ssm . For each ( B , w, r ) parameter triple tested, we performedﬁve independent ﬁltering runs, with the median values recorded for the three met-rics shown by the coloured curves in Fig. 10 for the linearly observed ssm and inFig. 11 for the non-linearly observed ssm , along with the best values achieved bylocal etkf on each metric by the dashed horizontal lines. The plots in Figs. 10and 11 have the same format as Figs. 6 and 7 for the st model experiments.From the linearly observed ks ssm results in Fig. 10 we see that the sletpf was outperformed across all parameter settings and metrics by the best local etkf results. This reinforces the point that local enkf methods are a stronglyperformant approach and can often be the best choice even in models with non-linear dynamics, where the Gaussianity assumptions are not valid, due to theirrobust performance when using small ensemble sizes. A further advantage of local enkf methods over local pf s is that they naturally maintain smoothness prop-erties of the state ﬁeld particles as evidenced by the low smoothness coeﬃcienterrors achieved by the local etkf across all model conﬁgurations. Local pf typeapproaches such as the sletpf algorithm proposed here should generally there-fore be considered as a fallback solution for cases where local enkf methods areknown, or at least suspected, to give poor accuracy.Considering the performance of the sletpf on the linearly observed ks ssm for diﬀerent ( B , w ) parameter settings we see similar trends as observed for the st M. M. GRAHAM & A. H. THIERY R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF

Fig 10: Comparison of accuracy of sletpf estimates on linearly observed ks ssm .model experiments though with some diﬀerence in the details. The diﬀerences inperformances on the mean and standard deviation rmse metrics for pou s withdiﬀerent numbers of patches B for a ﬁxed smoothing kernel width w show lessvariation than seen in the st model experiments. Even for the hard pou s casewithout smoothing ( w = 512 − , top-row of Fig. 10), only the runs with a pou with B = 32 patches show a signiﬁcant drop in mean and standard deviationestimate accuracies across most median( n B ) values, and for the B = 32 casethe relative drops in accuracies are still quite minor. The most obvious eﬀect ofincreasing the kernel width w in this model is therefore in the improved accuracyof the smoothness coeﬃcient estimates for larger w values. This suggests that inthe ks model, although using a smoother pou does reduce the introduction ofartiﬁcial discontinuities into the state ﬁeld particles, these discontinuities haveless of a negative eﬀect on ﬁltering performance than for the st model, perhapsdue to a stronger diﬀusive smoothing element to the model dynamics.The results for the non-linearly observed ks ssm in Fig. 11 show similar rel-ative performances for the diﬀerent sletpf conﬁgurations as for the linearlyobserved case, with a general increase in the absolute rmse values across theboard. The corresponding increase in the rmse values for optimal tunings ofthe local etkf are however signiﬁcantly larger, meaning that for this model the sletpf approaches show a minor improvement in the accuracy of the mean esti-mates compared to the local etkf across virtually all conﬁgurations and performs SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − R M S E ( ˆ µ : T , µ : T ) R M S E ( ˆ σ : T , σ : T ) R M S E ( ˆ γ : T , γ : T ) w = − median( n B ) R M S E ( ˆ µ : T , µ : T ) median( n B ) R M S E ( ˆ σ : T , σ : T ) median( n B ) R M S E ( ˆ γ : T , γ : T ) w = − B = 32 B = 64 B = 128 B = 256 B = 512 Best local ETKF

Fig 11: Comparison of accuracy of sletpf estimates on non-linearly obs. ks ssm . Rank D e n s i t y (a) B = 512 , w = 512 − , r = 0 . Rank D e n s i t y (b) B = 64 , w = 128 − , r = 0 . Fig 12: Rank histograms for single sletpf runs on non-linearly observed ks ssm .comparably in terms of the accuracy of the standard deviations estimates, havingslightly better performance for some conﬁgurations and slightly poorer for others.Again the smoothness of the pou used does not seem to have a strong eﬀect onperformance in terms of the mean and standard deviation estimates here, with themain change as the smoothing kernel width w is increased the improved accuracyof the smoothness coeﬃcient estimates corresponding to improved reproductionof the smoothness of the ﬁelds under the true ﬁltering distributions.As for the local etkf ensemble estimates of the ks ssm ﬁltering distributions,we can also use rank histograms for the sletpf ensembles as an alternative checkof the calibration of the ﬁltering distribution estimates. The rank histogram foran ensemble generated for the non-linearly observed ks ssm by a sletpf with M. M. GRAHAM & A. H. THIERY

Assimilation run time / s R M S E ( ˆ µ : T , µ : T ) Assimilation run time / s R M S E ( ˆ σ : T , σ : T ) Assimilation run time / s R M S E ( ˆ γ : T , γ : T ) B = 32 w = 512 − B = 64 w = 256 − B = 128 w = 128 − B = 256 w = 64 − B = 512 Fig 13: Accuracy versus run time for sletpf in non-linearly observed ks ssm .a pou with B = 512 patches and kernel width w = 512 − (i.e. corresponding tothe per-node local etpf ) is shown in Fig. 12a, and for an ensemble generated forthe non-linearly observed ks ssm by a sletpf with a pou with B = 64 patchesand kernel width w = 128 − in Fig. 12b. In both cases the localisation radius r was set to the value from the grid search giving the lowest mean estimate rmse .Compared to the corresponding rank histogram for the local etkf in Fig. 9b,the histograms for both sletpf conﬁgurations are much closer to uniform. Thepeaks at the extreme ranks in both histograms are characteristic of the ensemblesunderestimating the dispersion of the ﬁltering distribution in the tails, with thisdiscrepancy appearing to be stronger in the sletpf using fewer patches here.As in Fig. 8 for the transformed st model runs, it is instructive to also comparethe relative computational cost of the diﬀerent sletpf conﬁgurations versus theirperformance on the three ﬁltering accuracy metrics. Fig. 13 shows the time takento perform the assimilation updates (horizontal axes) versus the value recordedfor each of the three rmse metrics (vertical axes), for each of the ( B , w ) pou conﬁgurations. The markers show the median values across the ﬁve runs for thelocalisation radius r which achieved the minimum value for that particular metricfor the ( B , w ) values in question. Due to the decreased drop-oﬀ in ﬁltering accu-racy for pou s with fewer number of patches B compared to st models, here wesee we are able to achieve even larger improvements in computational eﬃciencycompared to the local etpf scheme of Cheng and Reich (2015) (correspondingto B = 512, w = 51 − , • ) while retaining the same ﬁltering accuracy. In partic-ular the sletpf s with B = 64 patches in the pou ( • + −× ) are able to achievethe same mean estimate accuracy, a slight improvement in the accuracy of thestandard deviation estimates, and a substantial improvement in the accuracy ofthe smoothness coeﬃcient estimates, while having an assimilation time that isaround a quarter of the sletpf which computes separate ot transport maps foreach mesh node ( B = 512). Further in this case the ﬁltering accuracy is largelyunaﬀected by the choice of smoothing kernel width w , other than an improvementin the smoothness coeﬃcient estimates for larger w values. At the cost of a slightincrease in all three rmse s, the sletpf s with pou s with B = 32 patches give afurther approximate factor two decrease in assimilation time, leading to around aeight times decrease in assimilation time compared to the per-node local etpf . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER

6. DISCUSSION

In this article we have proposed a new scheme for constructing local particleﬁlters for state inference in spde models of spatially-extended dynamical systems.The local etpf (Cheng and Reich, 2015) although having the desirable property ofimproved robustness to non-Gaussianity in the ﬁltering distributions compared tolocal enkf approaches has two key shortcomings: (i) the state ﬁelds produced bythe assimilation step fail to maintain the smoothness properties of the predictiveensemble members, potentially leading to numerical instabilities when used toﬁlter spde models and (ii) as an ot problem must be solved for every node inthe spatial mesh, the assimilation updates can be costly for dense meshes.Our approach to solving both issues is to softly partition the spatial domainusing a partition of unity : a ﬁnite set of non-negative bump functions which tilethe domain and sum to unity at all points. By computing an ot map for thepatch of the spatial domain associated with each bump function and then usingthe bump functions to smoothly interpolate these maps across the domain, we areable to smoothly combine diﬀerent regions of the predictive ensemble particles.As well as allowing the smoothness of the spatial ﬁelds to be maintained duringthe assimilation step, the proposed approach reduces the e O ( MP ) cost of the per-node local etpf assimilation updates to e O ( BP + |M| P + MP ). If we increase themesh resolution by using a larger number of nodes M , while keeping the numberof patches B and number of subsampled nodes |M| ﬁxed, the computational costof the assimilation update only need to scale at rate e O ( MP ) with M , which couldbe considered as the lower bound for an update to P particles of e O ( M ) dimension.We demonstrated in the numerical experiments that the resulting scheme isable to produce, at often signiﬁcantly reduced computational cost, ensemble esti-mates of the ﬁltering distributions for state space models with equivalent accuracyand improved smoothness compared to the local etpf of Cheng and Reich (2015).Although the experiments were restricted to models on one-dimensional spatialdomains, in most applications of interest the spatial domain will be two or three-dimensional. Our proposed scheme naturally carries over to this setting and asthe mesh sizes in such models will tend to be signiﬁcantly higher, the potentialcomputational savings are even larger. Further, while we concentrated here onﬁltering in spatial models which are observed at point locations, our scheme couldbe extended to models with spatially distributed observations by partitioning thespatial domain according to the geometry of the observation processes.The localisation approach to overcoming weight degeneracy when applying pf sto spatial models considered here could also be combined with other methods forimproving pf performance in high-dimensional ssm s. In particular tempering ap-proaches split the usual single prediction and assimilation update per observationtime into multiple updates which target a sequence of distributions bridging be-tween the ﬁltering distributions at adjacent observation times (Frei and K¨unsch,2013; Johansen, 2015; Beskos et al., 2017; Svensson, Sch¨on and Lindsten, 2018;Herbst and Schorfheide, 2019). Tempering could be paired with our frameworkto further improve its robustness to high-dimensional and strongly informativeobservations, with the use of multiple assimilation updates per observation timewhen tempering making the reduced computational cost and improved smooth-ness preservation of our approach particularly important. M. M. GRAHAM & A. H. THIERY

APPENDIX A: ENSEMBLE TRANSFORM KALMAN FILTER

In this Appendix we describe the details of the etkf assimilation update(Bishop, Etherton and Majumdar, 2001) and show how it can be expressed in theform of the letf framework discussed in Section 2.6. We ﬁrst introduce predictiveand ﬁltering ensemble matrices respectively deﬁned as ~ X t = h ~ x t ~ x t · · · ~ x P t i T and X t = h x t x t · · · x P t i T . (A.1)Using the following linear operators ε = P T P , ∆ = √ P − (I P − P ε ) , (A.2)the predictive and ﬁltering ensemble means can then be compactly expressed ~ m T t = ε~ X t , and m T t = ε X t , , (A.3)and similarly the predictive and ﬁltering ensemble covariances can be written ~ C t = (∆ ~ X t ) T (∆ ~ X t ) and C t = (∆ X t ) T (∆ X t ) . (A.4)Assuming initially linear-Gaussian observations as in Eq. (2.12) then by substi-tuting the expressions for the empirical covariances Eq. (A.4) into the Kalmanﬁlter covariance assimilation update in Eq. (2.13a) and applying the identityI P − ∆ ~ Y t ( R t + ~ Y T t ∆ ~ Y t ) − ~ Y T t ∆ = (I P + ∆ ~ Y t R − t ~ Y T t ∆) − with ~ Y t = ~ X t H T t we have(∆ X t ) T (∆ X t ) = (∆ ~ X t ) T (cid:16) I P + ∆ ~ Y t R − t ~ Y T t ∆ (cid:17) − (∆ ~ X t ) . (A.5)Deﬁnining S t as the symmetric matrix square-root of the central term in theright-hand-side of Eq. (A.5), i.e. S t = S t S t = (cid:16) I P + ∆ ~ Y t R − t ~ Y T t ∆ (cid:17) − (A.6)then we can compute a family of solutions of Eq. (A.5) for the ﬁltering ensembleprojection ∆ X t in terms of the predictive ensemble projection ∆ ~ X t as∆ X t = Q S t ∆ ~ X t (A.7)where Q is an arbitary P × P orthogonal matrix. For the etkf generally Q = I P is chosen, corresponding to directly transforming by the symmetric square-root.Now considering the Kalman assimilation update for the mean in Eq. (2.13b),subsituting the expressions for the ensemble empirical means and covariances inEqs. (A.3) and (A.4) and using the deﬁnition of the square-root matrix S t inEq. (A.6) we have that ε X t = ε~ X t + ( y T t − ε ~ Y t ) R − t ~ Y T t ∆ S t ∆ ~ X t . (A.8)From the deﬁnition of ∆ in Eq. (A.2) we have that I P = 1 P ε + √ P −

1∆ and so X t = 1 P ε X t + √ P − X t (A.9)= (cid:16) P ε + 1 P ( y T t − ε ~ Y t ) R − t ~ Y T t ∆ S t ∆ + √ P − S t ∆ (cid:17) ~ X t , (A.10) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER with the matrix term in parentheses deﬁning the coeﬃcients a P , P t of an letf assimilation update as in Eq. (2.11).In the above it was assumed the observation model is linear-Gaussian. In thecase of a more general observation model of the form G t ( x, v ) = H t ( x ) + v, v t ∼ N (0 , R t ) , (A.11)where now H t is a potentially non-linear operator, then by observing that alloccurences of H t in Eqs. (A.6) and (A.10) are via ~ Y t = ~ X t H T t , for non-linear H t we can instead deﬁne the predictive observation ensemble matrix ~ Y t as ~ Y t = h H t ( ~ x t ) H t ( ~ x t ) · · · H t ( ~ x P t ) i T . (A.12)The etkf formulation of a square-root enkf has the advantage of only requiringcomputing cubic-cost matrix operations for matrices of size P × P (due to theconditional independence assumptions R t is block diagonal and so the cost ofcomputing R − t is at worst O ( LK ) with in general K (cid:28) P ).For all enkf methods, the assimilation updates are only consistent with theanalytic assimilation update in Eq. (2.7) as P → ∞ for linear-Gaussian mod-els. In models where the state update and observation operators are only weaklynonlinear, the ﬁltering distribution at each time index π t can remain ‘close’ toGaussian and the enkf updates will often give reasonable estimates of the ﬁlter-ing distribution (Evensen, 2009). For models with highly non-Gaussian ﬁlteringdistributions enkf methods will typically perform poorly however.A local version of the etkf algorithm was proposed in Hunt, Kostelich andSzunyogh (2007). In the global etkf assimilation update summarised in Eq. (A.10)the linear transform coeﬃcients depend on the current predictive state ensemblevalues ~ x P t only via a P × KL observation ensemble matrix ~ Y t . The local etkf algorithm scales the dependence of the update coeﬃcients at each mesh node onthe columns of ~ Y t via a localisation function ‘ r : [0 , ∞ ) → [0 ,

1] satisfying theconditions in Eq. (3.9) for some localisation radius r >

0, such that observationsat a distance more than r from the mesh node are ignored in the correspondinglocal assimilation update.For each of the M mesh nodes a localisation kernel is then deﬁned by applying ‘ r to the distances between the mesh nodes and the observation locations k T m = [ ‘ r ( d ( s m , s o1 )) T K , · · · , ‘ r ( d ( s m , s o L )) T K ] ∀ m ∈ M . (A.13)We can then deﬁne local eﬀective observation noise precision matrices ˜ R − t, M ˜ R − t,m = R − t (cid:12) ( k m k T m ) ∀ m ∈ M (A.14)where (cid:12) indicate the elementwise or Hadamard product between equal sizedtensors. The local etkf assimilation update is then X t,m = (cid:16) P ε + 1 P ( y T t − ε ~ Y t ) ˜ R − t,m ~ Y T t ∆˜ S t,m ∆ + √ P − S t,m ∆ (cid:17) ~ X t,m , (A.15)where the local square root matrix ˜ S t,m is deﬁned˜ S t,m = ˜ S t,m ˜ S t,m = (cid:16) I P + ∆ ~ Y t ˜ R − t,m ~ Y T t,m ∆ (cid:17) − . (A.16) M. M. GRAHAM & A. H. THIERY

This local assimilation update is equivalent to replacing each observation en-semble matrix term ~ Y t and observation vector term y t in the global assimilationupdate in Eq. (A.10) with ~ Y t (cid:12) (1 P k T m ) and y t (cid:12) k m respectively. As k m haszero entries for all indices corresponding to observation locations more than r in distance from s m , in practice when implementing the local etkf assimilationupdate the computations can be performed with only the non-zero submatricesof ~ Y t (cid:12) (1 P k T m ) and y t (cid:12) k m and corresponding submatrix of R − t .As separate assimilation updates need to be computed for each mesh nodethe computational cost of the local etkf scales linearly with the number ofmesh nodes M . The computation for each mesh node is of order O ( P ) due torequirement to calculate a matrix decomposition of the P × P matrix inside theparentheses on the right hand side of Eq. (A.16). On a sequential architecturethe overall computation time will therefore have a O ( MP ) scaling. As each ofthe local assimilation updates can be independently computed in parallel, witha large number of parallel compute nodes the assimilation update can still becomputed eﬃciently for models with large mesh sizes M however as shown in thenumerical experiments in Hunt, Kostelich and Szunyogh (2007). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX B: ALTERNATIVE PARTICLE FILTER PROPOSALS

Rather than propagating according to the forward dynamics of the generativemodel, it is possible to instead propose new particle values from diﬀerent con-ditional distributions (which may depend on future observed values) and adjustthe expression for the importance weights in Eq. (2.14) accordingly. Typicallythe resulting expression for the importance weights is given in terms of the tran-sition density of the state updates, however as noted previously this density willoften be intractable to compute. Alternative state proposals can however insteadbe formulated by changing the distribution the state noise variables are drawnfrom. If each state noise vector u pt is sampled from a distribution with a knowndensity d pt with respect to µ t and the predictive ensemble particles computed asin Eq. (2.9), then unnormalised importance weights for the propagated particlescan be computed as˜ w pt = g t (cid:0) y t | F t ( x pt − , u pt ) (cid:1) d pt ( u pt ) − ∀ p ∈ P . (B.1)The corresponding normalised weights can then be used in the empirical ﬁlteringdistribution approximation in Eq. (2.14) and resampling update in Eq. (2.15).If we restrict the state noise proposal density d pt to be dependent on only theprevious particle x pt − and current observation y t in order to maintain the on-line nature of the algorithm, then the proposal distributions which minimise thevariance of the importance weights have densities with respect to µ t d pt ( u ) = g t (cid:0) y t | F t ( x pt − , u ) (cid:1)R U g t (cid:0) y t | F t ( x pt − , u ) (cid:1) µ t (d u ) ∀ p ∈ P . (B.2)In this case the unnormalised weights in Eq. (B.1) are independent of the statenoise variables u P t . Although this ‘optimal’ proposal is more typically expressedas a conditional distribution on ~ x pt given x pt − this alternative formulation is equiv-alent. In general it will not be possible to generate samples from the optimal pro-posal, however it may be possible to for example ﬁnd a tractable approximationto use as a proxy.In cases where the optimal proposal is tractable or can be well approximated,the resulting pf algorithm can signiﬁcantly outperform the basic bootstrap pf interms of the ensemble size required for a given accuracy in the ﬁltering distribu-tion estimates. M. M. GRAHAM & A. H. THIERY

APPENDIX C: VISUALISATION OF SMOOTH LOCAL LETFASSIMILATION UPDATE −0.50.00.5 ~ z p ( s ) p = 1 p = 2 −0.50.00.5 φ b ( s ) ~ z p ( s ) b = 1 b = 3 b = 5 b = 7 b = 9 p = 1 p = 2 −0.50.00.5 φ b ( s ) ~ z p ( s ) b = 2 b = 4 b = 6 b = 8 p = 1 p = 2 s −0.50.00.5 z p ( s ) p = 1 p = 2 Fig C.1: Example of applying smooth local letf assimilation update in Eq. (4.7)on a one-dimensional spatial domain S = [0 ,

1] using the pou from Fig. 5 with P = 2 particles.Consider a spatial domain which is the same unit interval S = [0 ,

1] as used inFig. 5 and a pou chosen as the smooth bump functions φ shown there. The toppanel in Fig. C.1 shows two smooth predictive distribution particle realisations ~z . The central two panels show the products φ b ( s ) ~z p ( s ) ∀ b ∈ , p ∈ s and com-pactly supported on the patches ˆ S . The bottom panel shows the ﬁltering dis-tribution particle realisations z computed using the assimilation update inEq. (4.7) for a randomly generated set of coeﬃcients ˆ a , satisfying the condi-tions in Eq. (4.5), with these post-assimilation ﬁelds maintaining the smoothnessof the predictive ﬁelds. The separation of products with odd and even indexed bump functions on to separate panelsin Fig. C.1 is simply for visual clarity. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX D: PARTITIONING THE SPATIAL DOMAIN (a) Rectilinear observation locations. (b) Irregular observation locations.

Fig D.1: Examples of partitioning a space based on observation locations fora two-dimensional spatial domain. Panel (a) shows a partition (indicated bycoloured regions) for observations located on a equispaced rectilinear grid (shownby circular markers). Panel (b) shows a partition for an irregularly located setof observations, with the observation locations initially clustered (indicated bycolours of markers) before partitioning based on the Voronoi cells associated witheach cluster of observation locations (cells shown by bordered polygonal regions).In order to control the number of observations used to compute each localweight in the sletpf scheme, we recommend choosing the partition of the spatialdomain used to deﬁne the pou such that each patch contains roughly the samenumber of observations. For observations located on a rectilinear grid this caneasily be achieved by partitioning the space in to rectilinear blocks aligned withthe observation grid and each containing the same number of observations (anexample is shown in Fig. D.1a). For irregularly spaced observations, one optionis to ﬁrst group the observation locations in to similarly sized clusters using forexample a k -means algorithm. The spatial domain can then be partitioned usinga Voronoi diagram generated from the observation locations, with all the cellscorresponding to observations in a single cluster then merged to form a singlecontiguous region. This leads to a partition of the spatial domain into a set ofregions which each contain a roughly number of observations and such that thenumbers of additional observations close to the region boundaries are minimised.A example of applying this scheme to a set of irregularly located observationpoints is shown in Fig. D.1b. In both the rectilinear and irregular spacing cases,a soft pou can then be generated from the resulting partition by convolving witha molliﬁer function as described in Section 4.1. M. M. GRAHAM & A. H. THIERY

APPENDIX E: TRANSFORMED STATE-SPACE MODELS

One of our primary motivations for considering pf -based methods was theclaim that they are more robust to non-Gaussianity in the ﬁltering distributionscompared to enkf methods. While this can shown to be the case in the largeensemble limit for non-localised pf algorithms (including the etpf ) compared to enkf methods, it does not necessarily follow that, when using small ensemblesizes, a local etpf would be expected to outperform a local enkf in models withnon-Gaussian ﬁltering distributions. Further even if there is a beneﬁt to usingthe local etpf compared to the local enkf , this does not necessarily carry overto our proposed smooth and scalable local etpf scheme.Therefore to assess the aﬀect on the relative performance of the local ensembleﬁlters methods being considered of non-Gaussianity in the ﬁltering distributionswhile controlling as far as possible other factors which might aﬀect performance,we use a simple scheme to map a tractable linear-Gaussian ssm to a transformed ssm with non-Gaussian ﬁltering distributions. In particular let T : X → X bea diﬀeomorphism on the state space, with T − denoting its inverse, which weassume we can also compute. If we deﬁne x t = T ( x t ) ∀ t ∈ T then the conditionaldistribution on x t given observations y t = y t will be T ] π t for any time index t ∈ T , i.e. the push-forward of the ﬁltering distribution π t under the map T . If T is non-linear then if π t is Gaussian T ] π t will in general be non-Gaussian.Importantly for our purposes we can construct a ssm acting directly on thetransformed states x T . In particular for a base ssm with state update and ob-servation operators F T and G T , we can deﬁne a T - transformed ssm with stateupdate and observation operators F T and G T given by x = F ( u ) = T ◦ F ( u ) , u ∼ µ , (E.1) x t = F t ( x t − , u t ) = T ◦ F t ( T − ( x t − ) , u t ) , u t ∼ µ t ∀ t ∈ T , (E.2) y t = G t ( x t , v t ) = G t ( T − ( x t ) , v t ) , v t ∼ ν t ∀ t ∈ T (E.3)and with observation densities g T deﬁned by g t ( y t | x t ) = g t ( y t | T − ( x t )) ∀ t ∈ T . (E.4)We can therefore run ensemble ﬁlter algorithms on the T -transformed ssm todirectly compute ensemble estimates of the transformed ﬁltering distributions π T with by construction π t = T ] π t ∀ t ∈ T . If the base ssm is linear-Gaussian and soa kf can be used to exactly compute the Gaussian ﬁltering distributions π T , wecan compute accurate unbiased Monte Carlo estimates of expectations under thetransformed ﬁltering distributions π T as we can generate N independent samplesfrom each π t by generating N independent samples from the Gaussian ﬁlteringdistribution π t and pushing each of the samples through the map T .This scheme therefore provides a method for constructing a non-Gaussian ssm for which we can easily compute accurate Monte Carlo estimates of the true ﬁlter-ing distribution means µ T , standard deviations σ T and smoothness coeﬃcients γ T as deﬁned in Eqs. (5.1) and (5.5) and so evaluate the rmse accuracy metricsdescribed in the preceding section for ensemble estimates of the ﬁltering distri-butions. By using a large number of independent samples N in the Monte Carloestimates we can ensure the O ( N − ) Monte Carlo error is negligible compared tothe error in the ensemble estimates. SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX F: MODEL DETAILSF.1 Stochastic turbulence model

Number of mesh nodes M = 512Number of observation times T = 200Number of observation locations L = 64Time step δ = 2 . θ = 4 × − Advection coeﬃcient θ = 0 . θ = 0 . θ = 5State noise kernel length scale ϑ = 4 × − State noise kernel amplitude α = 0 . ς = 0 . Table F.1

Stochastic turbulence model parameter settings

We deﬁne a regular mesh of nodes s M and basis functions β M s m = m − M and β m ( s ) = sinc(2 π M ( s − s m )) cos( π ( s − s m ))sinc( π ( s − s m )) ∀ m ∈ M (F.1)with the space-time varying processes ζ and η and kernel function κ then beingdeﬁned respectively in terms of the ﬁnite set of time-varying processes χ M and υ M and coeﬃcients λ M as ζ ( s, τ ) = X m ∈ M χ m ( τ ) β m ( s ) , (F.2) η ( s, τ ) = X m ∈ M υ m ( τ ) β m ( s ) , (F.3)and κ ( s ) = X m ∈ M λ m β m ( s ) . (F.4)The basis functions β M and nodes s M satisfy Eq. (3.2) such that χ m ( τ ), υ m ( τ )and λ m correspond to the values of respectively ζ ( s m , τ ), η ( s m , τ ) and κ ( s m ) forany mesh node s m . We deﬁne ˜ χ K ( τ ) = dft ( χ M ( τ )), ˜ υ K ( τ ) = dft ( υ M ( τ )) and˜ λ K = dft ( λ M ) with K = (cid:4) M (cid:5) and dft indicating the discrete Fourier transform,with the Fourier coeﬃcient ˜ x k for a real sequence x M being computed as˜ x k = dft k ( x M ) = M X m ∈ M x m exp (cid:18) − i π km M (cid:19) ∈ ( R if k ∈ { , M } , C if k ∈ (cid:6) M (cid:7) − . (F.5)Then we have the following equivalent spectral expansions for ζ , η and κζ ( s, τ ) = X k ∈− K : K α k ˜ χ k ( τ ) exp( iω k s ) , (F.6) η ( s, τ ) = X k ∈− K : K α k ˜ υ k ( τ ) exp( iω k s ) , (F.7) κ ( s ) = X k ∈− K : K α k ˜ λ k exp( iω k s ) , (F.8) M. M. GRAHAM & A. H. THIERY

Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (a) Noisy observation sequence y T . Time τ Sp a t i a l c oo r d i n a t e s −3−2−10123 (b) State sequence z T for linear-Gaussian ssm . Time τ Sp a t i a l c oo r d i n a t e s −3−2−10123 (c) State sequence z T for transformed ssm . Fig F.1: Simulated sequences used in experiments for st ssm s.with the convention that negative indices to the Fourier coeﬃcients indicate com-plex conjugation, e.g. ˜ λ − k = ˜ λ ∗ k , and α − K : K and ω − K : K are deﬁned as α k =  if k = 0 , | k | ∈ (cid:6) M (cid:7) − , if | k | = M , and ω k = 2 π k ∀ k ∈ − K : K . (F.9)Using Eq. (F.6) we then have that spatial derivatives of ζ can be computed as ∂ ns ζ ( s, τ ) = X k ∈− K : K α k ( iω k ) n ¯ χ k ( τ ) exp( iω k s ) ∀ n ∈ N . (F.10)Substituting the expansions in Eqs. (F.6), (F.7) and (F.10) for the processes and SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER spatial derivatives into Eq. (5.8) and using the convolution theorem gives X k ∈− K : K α k (cid:16) d ˜ χ k − ( − θ ω k + iθ ω k − θ ) ˜ χ k d τ − ˜ λ k d˜ υ k (cid:17) exp( iω k s ) = 0 . (F.11)Integrating both sides over S against a suitable orthogonal set of test functions h j ( s ) = exp( − iω j s ) ∀ j ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) and h M ( s ) = cos( M π s ) if M is even , (F.12)we arrive at the following system of sde sd ˜ χ k ( τ ) = ( − θ ω k + iθ ω k − θ ) ˜ χ k ( τ ) d τ + ˜ λ k d˜ υ k ( τ ) ∀ k ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) , (F.13)d ˜ χ M ( τ ) = ( − θ ω k − θ ) ˜ χ M ( τ ) d τ + ˜ λ M d˜ υ M ( τ ) if M is even . (F.14)Assuming that the noise Fourier coeﬃcients ˜ υ K are independent Wiener pro-cesses, real-valued for the zero- and Nyquist-frequency coeﬃcients (˜ υ and ˜ υ M )and complex-valued for the remaining coeﬃcients, then the transition distribu-tions for this system have analytic solutions˜ χ k ( τ ) | ˜ χ k (0) ∼ N exp ( ξ k τ ) ˜ χ k (0) , ˜ λ k ψ k (1 − exp( − ψ k τ )) ! , with ψ k = θ ω k + θ and ξ k = ( iθ ω k − ψ k if k = M − ψ k if k = M , k ∈ K . (F.15)where we have overloaded the notation for a Gaussian distribution N to extend tocomplex-valued variables with the convention that for a complex-valued randomvariable z ∈ C , complex mean parameter µ ∈ C and real variance σ ∈ R > , that z ∼ N ( µ, σ ) = ⇒< ( z ) ∼ N < ( µ ) , σ ! , = ( z ) ∼ N = ( µ ) , σ ! and < ( z ) ⊥ = ( z ) . (F.16)The Fourier coeﬃcients ˜ χ K then also have Gaussian stationary distributions˜ χ k ( ∞ ) ∼ N , ˜ λ k ψ k ! ∀ k ∈ K . (F.17)We assume the system is observed at T time points with τ t = ( t − δ ∀ t ∈ T and that the Fourier coeﬃcients of the initial state at time τ = 0 are generatedfrom the stationary distributions in Eq. (F.17). Identifying z t ( s ) = ζ ( s, τ t ) and x t, M = χ M ( τ t ) ∀ t ∈ T (F.18)we have that the state update operators can be written x , M = dft − ( a K (cid:12) u , K ) , (F.19) x t, M = dft − ( b K (cid:12) dft ( x t − , M ) + c K (cid:12) u t, K ) ∀ t ∈ T , (F.20)where a K , b K and c K are length K + 1 vectors with a k = ˜ λ k √ ψ k , b k = exp( ξ k δ ) , c k = a k q − exp( − ψ k δ ) ∀ k ∈ K , (F.21) M. M. GRAHAM & A. H. THIERYNumber of mesh nodes M = 512Number of observation times T = 200Number of observation locations L = 64Number of integrator steps between observations S = 10Integrator time step δ = 0 . θ = 32 π Damping coeﬃcient θ = State noise kernel length scale θ = θ − State noise kernel amplitude θ = θ − Observation noise standard deviation ς = 0 . Table F.2

Kuramoto-Sivashinksy model parameter settings and the state noise variables u T , K are real-valued for the zero- and Nyquist-frequency components and complex otherwise and have Gaussian distributions u t,k ∈ ( R if k ∈ { , M } , C if k ∈ (cid:6) M (cid:7) − , u t,k ∼ N (0 , ∀ t ∈ T , k ∈ K . (F.22)The system is observed at L equispaced mesh nodes with s o l = s ML ( l − ) ∀ l ∈ L and a simple linear-Gaussian observation model assumed y t,l = z t ( s o l ) + v t,l = x ML ( l − ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.23)The state noise kernel Fourier coeﬃcients ˜ λ K are chosen to represent a squared-exponential kernel with length-scale parameter ϑ and amplitude parameter α ˜ λ k = α exp( − ω k ϑ ) ∀ k ∈ K . (F.24) F.2 Kuramoto–Sivashinksy model

We use the same spectral approach in as in the st model to deﬁne the basisfunction expansions of the processes ζ and η and kernel κ in terms of coeﬃcients ξ M , υ M and λ M (see Eqs. (F.6) to (F.7)). The non-linear ζ term in the driftcomponent of the spde cannot be exactly expressed as a linear combination ofthe basis function β M , and so we cannot directly form a system of sde s to solveas in the st model. We make the approximation that ζ ( s, τ ) = X m ∈ M X n ∈ M χ m ( τ ) χ n ( τ ) β m ( s ) β n ( s ) ≈ X m ∈ M χ m ( τ ) β m ( s ) . (F.25)At the mesh nodes s M this gives the correct values but gives a diﬀerent interpola-tion at points between the nodes; for dense meshes however the error introducedis small. Using this approximation the following system of sde s can be derivedin the Fourier coeﬃcients ˜ ξ K , ˜ υ K and ˜ λ K d ˜ χ k ( τ ) = ω k θ − ω k θ − θ ! ˜ χ k ( τ ) + N k ( ˜ χ K ) ! d τ +˜ λ k d˜ υ k ( τ ) ∀ k ∈ K (F.26)with the noise Fourier coeﬃcients ˜ υ K again assumed to be (complex-valued)Wiener processes and the non-linear N k terms in the drift deﬁned by N k ( ˜ χ K ) = ( iω k θ dft k ( dft − ( ˜ χ K ( τ )) ) if k ∈ (cid:0)(cid:6) M (cid:7) − (cid:1) , k = M . (F.27) SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (a) Noisy observation sequence y T with linear observation operator. Time τ Sp a t i a l c oo r d i n a t e s −2−1012 (b) Noisy observation sequence y T with nonlinear observation operator. Time τ Sp a t i a l c oo r d i n a t e s −4−2024 (c) True state sequence z T used to generate observations. Fig F.2: Simulated sequences used in experiments with stochastic ks ssm s.The state noise kernel Fourier coeﬃcients ˜ λ K are as in the st model chosen torepresent a squared-exponential kernel as deﬁned in Eq. (F.24).Due to the non-linear terms, the system of sde s in Eq. (F.26) does not have ananalytic solution. Therefore we numerically integrate the system using a heuristiccombination of a exponential-time diﬀerencing fourth-order Runge-Kutta scheme(Cox and Matthews, 2002) to time step forward according to the drift term anda Euler-Maruyama discretisation to account for the diﬀusion term. To reducethe time discretisation error we use S integrator steps with time step δ betweeneach of the T observation times τ t = ( t − S δ ∀ t ∈ T . The state transitionoperator F t then correspond to the map from a previous state vector x t − andstate noise variable u t (consisting of the concatenation of S simulated Wienerprocess increments) to the state vector x t by peforming S integrator steps. The M. M. GRAHAM & A. H. THIERY state transition operators are non-linear and the density of the correspondingstate transition distribution does not have a closed form solution.For the observation operators we considered two cases - a linear-Gaussian ob-servation model and a non-linear observation operator. Although due to the non-linear state transition operators the ﬁltering distributions are non-Gaussian irre-spective of the observation operator used, in practice we found the local enkf was able to generate accurate ensemble estimates of the ﬁltering distributionswhen using a simple linear-Gaussian observation model, suggesting the ﬁlteringdistributions remain close to Gaussian despite the non-linear state dynamics. Asour focus is on inference in ssm s for which existing local enkf approaches per-form poorly in, we also considered an alternative model conﬁguration in which anon-linear function of the model state is noisily observed.In both the linear and non-linear cases system is assume to be observed at L equispaced mesh nodes with s o l = s ML ( l − ) ∀ l ∈ L . For the linear case theobservation model is assumed to be equivalent to that assumed for the st model, y t,l = z t ( s o l ) + v t,l = x ML ( l − ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.28)The non-linear case is directly analogous other than the state values being ob-served via a hyperbolic tangent non-linearity: y t,l = tanh( x ML ( l − ) ) + v t,l , v t,l ∼ N (0 , ς ) ∀ t ∈ T , l ∈ L . (F.29)Although seemingly minor change in the model, as illustrated in the experimentalresults, introducing this non-linearity was suﬃcient to signiﬁcantly degrade theﬁltering performance of the local etkf . SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER APPENDIX G: FULL GRID SEARCH RESULTS FOR LOCAL ETKF

Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (a) Linear-Gaussian st ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (b) Transformed st ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (c) Linearly observed ks ssm . Loc. radius r R M S E ( ˆ µ : T , µ : T ) Loc. radius r R M S E ( ˆ σ : T , σ : T ) Loc. radius r R M S E ( ˆ γ : T , γ : T ) (d) Non-linearly observed ks ssm . Fig G.1: Values of metrics for all localisation radii r for local etkf on four ssm sconsidered in experiments. In all cases the curve shows the median value acrossﬁve independent runs and the ﬁlled region the minimum to maximum range. M. M. GRAHAM & A. H. THIERY

APPENDIX H: DETAILS OF MCMC RUNS FOR KS MODELS

A non-centred parametrisation was used for the Hamiltonian Monte Carlochains for the two ks ssm s (Papaspiliopoulos, Roberts and Sk¨old, 2007), withthe target smoothing distribution formulated in terms of the

MTS ≈ dimen-sional set of state noise variables u T which are independently and identicallydistributed standard normal variables under the prior, with the observation se-quence y T then having a Gaussian conditional distribution given u T . The step-size for the integrator of the Hamiltonian dynamics was manually tuned oncefor each ssm using short pilot chains with a ﬁxed number of integrator stepsto achieve an average acceptance probability in the range [0 . , .

9] (Betancourt,Byrne and Girolami, 2014), with in both ssm s a step size 2 . × − found to givean acceptance rate is the target range. The integrator used was a variant of thestandard leapfrog / St¨ormer-Verlet integrator which uses an alternative splittingof the Hamiltonian to leverage an exact analytic solution for the Hamiltoniandynamics under the quadratic potential energy component due to the Gaussianprior (Shahbaba et al., 2014). The number of integrator steps used to generatethe Hamiltonian dynamics trajectory in each chain transition was dynamicallyset on each iteration using a variant of the No-U-Turn sampler scheme (Hoﬀmanand Gelman, 2014; Betancourt, 2017), with the chains for both ssm s performingapproximately 2 × steps per transition on average. For each ssm the total wallclock time to run the ﬁve chains in parallel on a Intel Xeon E5-2620 v4 8-coreCPU was around one week.All chains were initialised from the true state noise sequence u T used to gener-ate the observations, which corresponds to a single exact sample from the targetdistribution P ( u T ∈ d u | y T = y T ) as the ( u T , y T ) pair was originally gener-ated from the corresponding joint distribution P ( u T ∈ d u, y T = d y ). Althoughtypically it would be preferable for the robustness of convergence diagnosticsbased on comparisons between chains to initialise each of the chains indepen-dently from an over-dispersed distribution compared to the target such as theprior, here we found the step-size required to robustly achieve an average accep-tance probability in the range [0 . , .

9] for chains initialised from the prior to bemuch smaller than for chains initialised from the ‘true’ noise sequence u T , likelydue to the diﬀering geometry of the target distribution in the tails (where initial-isations from the prior are likely to fall) and typical set, which u T as an exactsample from the target should be within. Given the long chain run times evenwhen using the larger step size, a pragmatic choice was therefore made to use acommon initialisation. This initialisation scheme and relatively small number ofsamples in each chain means there is a risk that the chains therefore only exploreda subset of the target distributions’ typical sets. As partial evidence against thisbeing the case, visual checks of the estimates of the ﬁrst and second moments ofa subset of the ﬁltering distributions π T using the ﬁnal 100 samples from eachof the chains suggest that the estimates from the diﬀerent chains are consistentwith each other (see examples in Figs. H.1 and H.2). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER −3−2−10123 C h a i n π (time τ = 250 ) −3−2−10123 π (time τ = 500 ) −3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−101230.0 0.2 0.4 0.6 0.8 1.0 Spatial coordinate s −3−2−10123 C h a i n Spatial coordinate s −3−2−10123 Fig H.1: Comparison of estimates of ﬁrst and second moments of ﬁltering distri-butions π and π for linearly observed ks ssm using ﬁnal 100 samples fromeach of 5 chains (curves show the estimated mean and the ﬁlled region the mean ± two standard deviations). M. M. GRAHAM & A. H. THIERY −3−2−10123 C h a i n π (time τ = 250 ) −3−2−10123 π (time τ = 500 ) −3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−10123−3−2−10123 C h a i n −3−2−101230.0 0.2 0.4 0.6 0.8 1.0 Spatial coordinate s −3−2−10123 C h a i n Spatial coordinate s −3−2−10123 Fig H.2: Comparison of estimates of ﬁrst and second moments of ﬁltering distri-butions π and π for non-linearly observed ks ssm using ﬁnal 100 samplesfrom each of 5 chains (curves show the estimated mean and the ﬁlled region themean ± two standard deviations). SCALABLE OPTIMAL-TRANSPORT BASED LOCAL PARTICLE FILTER REFERENCES

Acevedo, W. , de Wiljes, J. and Reich, S. (2017). Second-order accurate ensemble transformparticle ﬁlters.

SIAM Journal on Scientiﬁc Computing A1834–A1850.

Altschuler, J. , Weed, J. and

Rigollet, P. (2017). Near-linear time approximation algo-rithms for optimal transport via Sinkhorn iteration. In

Advances in Neural Information Pro-cessing Systems 30

Anderson, J. L. (2001). An ensemble adjustment Kalman ﬁlter for data assimilation.

Monthlyweather review

Bauer, P. , Thorpe, A. and

Brunet, G. (2015). The quiet revolution of numerical weatherprediction.

Nature

Bengtsson, T. , Bickel, P. and

Li, B. (2008). Curse-of-dimensionality revisited: Collapse ofthe particle ﬁlter in very large scale systems. In

Probability and statistics: Essays in honor ofDavid A. Freedman

Bertoli, F. and

Bishop, A. N. (2014). Adaptively Blocked Particle Filtering with SpatialSmoothing in Large-Scale Dynamic Random Fields. arXiv:1406.0136 . Beskos, A. , Crisan, D. , Jasra, A. , Kamatani, K. and

Zhou, Y. (2017). A stable particleﬁlter for a class of high-dimensional state-space models.

Advances in Applied Probability Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 . Betancourt, M. , Byrne, S. and

Girolami, M. (2014). Optimizing the integrator step sizefor Hamiltonian Monte Carlo. arXiv:1411.6669 . Bishop, A. N. and

Del Moral, P. (2018). On the Stability of Matrix-Valued Riccati Diﬀusions. arXiv preprint arXiv:1808.00235 . Bishop, C. H. , Etherton, B. J. and

Majumdar, S. J. (2001). Adaptive sampling with theensemble transform Kalman ﬁlter. Part I: Theoretical aspects.

Monthly weather review

Bolic, M. , Djuric, P. M. and

Hong, S. (2005). Resampling algorithms and architectures fordistributed particle ﬁlters.

IEEE Transactions on Signal Processing Bonavita, M. , Torrisi, L. and

Marcucci, F. (2008). The ensemble Kalman ﬁlter in an oper-ational regional NWP system: Preliminary results with real observations.

Quarterly Journalof the Royal Meteorological Society

Bowler, N. E. , Arribas, A. , Beare, S. E. , Mylne, K. R. and

Shutts, G. J. (2009). Thelocal ETKF and SKEB: Upgrades to the MOGREPS short-range ensemble prediction system.

Quarterly Journal of the Royal Meteorological Society

Buizza, R. , Houtekamer, P. , Pellerin, G. , Toth, Z. , Zhu, Y. and

Wei, M. (2005). Acomparison of the ECMWF, MSC, and NCEP global ensemble prediction systems.

MonthlyWeather Review

Burgers, G. , van Leeuwen, P. J. and Evensen, G. (1998). Analysis scheme in the ensembleKalman ﬁlter.

Monthly weather review

Cheng, Y. and

Reich, S. (2015). Assimilating data into scientiﬁc models: An optimal couplingperspective. In

Nonlinear Data Assimilation

Clayton, A. M. , Lorenc, A. C. and

Barker, D. M. (2013). Operational implementationof a hybrid ensemble / 4D-Var global data assimilation system at the Met Oﬃce.

QuarterlyJournal of the Royal Meteorological Society

Cox, S. M. and

Matthews, P. C. (2002). Exponential time diﬀerencing for stiﬀ systems.

Journal of Computational Physics

Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In

Advances in Neural Information Processing Systems 26

Del Moral, P. (1996). Non-linear ﬁltering: interacting particle resolution.

Markov processesand related ﬁelds Del Moral, P. and

Tugaut, J. (2018). On the stability and the uniform propagation of chaosproperties of ensemble Kalman–Bucy ﬁlters.

The Annals of Applied Probability Douc, R. and

Capp´e, O. (2005). Comparison of resampling schemes for particle ﬁltering. In

Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis,2005.

Duane, S. , Kennedy, A. D. , Pendleton, B. J. and

Roweth, D. (1987). Hybrid Monte Carlo.

Physics Letters B

Evensen, G. (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model M. M. GRAHAM & A. H. THIERYusing Monte Carlo methods to forecast error statistics.

Journal of Geophysical Research:Oceans Evensen, G. (2009).

Data Assimilation: The Ensemble Kalman Filter , 2nd ed. Springer.

Farchi, A. and

Bocquet, M. (2018). Comparison of local particle ﬁlters and new implemen-tations.

Nonlinear Processes in Geophysics Discussions

Fearnhead, P. and

K¨unsch, H. (2018). Particle Filters and Data Assimilation.

Annual Reviewof Statistics and Its Application Frei, M. and

K¨unsch, H. R. (2013). Bridging the ensemble Kalman and particle ﬁlters.

Biometrika

Furrer, R. and

Bengtsson, T. (2007). Estimation of high-dimensional prior and posteriorcovariance matrices in Kalman ﬁlter variants.

Journal of Multivariate Analysis Gaspari, G. and

Cohn, S. E. (1999). Construction of correlation functions in two and threedimensions.

Quarterly Journal of the Royal Meteorological Society

Gerber, M. , Chopin, N. and

Whiteley, N. (2019). Negative association, ordering and con-vergence of resampling methods.

The Annals of Statistics Gordon, N. J. , Salmond, D. J. and

Smith, A. F. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In

IEE Proceedings F (Radar and Signal Processing)

Hamill, T. M. , Whitaker, J. S. and

Snyder, C. (2001). Distance-dependent ﬁltering ofbackground error covariance estimates in an ensemble Kalman ﬁlter.

Monthly Weather Review

Herbst, E. and

Schorfheide, F. (2019). Tempered particle ﬁltering.

Journal of Econometrics

Hoffman, M. D. and

Gelman, A. (2014). The No-U-Turn sampler: adaptively setting pathlengths in Hamiltonian Monte Carlo.

Journal of Machine Learning Research Hol, J. D. , Schon, T. B. and

Gustafsson, F. (2006). On resampling algorithms for particleﬁlters. In

Nonlinear Statistical Signal Processing Workshop, 2006 IEEE

Houtekamer, P. L. and

Mitchell, H. L. (1998). Data assimilation using an ensemble Kalmanﬁlter technique.

Monthly Weather Review

Hunt, B. R. , Kostelich, E. J. and

Szunyogh, I. (2007). Eﬃcient data assimilation for spa-tiotemporal chaos: A local ensemble transform Kalman ﬁlter.

Physica D: Nonlinear Phenom-ena

Hyman, J. M. and

Nicolaenko, B. (1986). The Kuramoto–Sivashinsky equation: a bridgebetween PDEs and dynamical systems.

Physica D: Nonlinear Phenomena Johansen, A. M. (2015). On blocks, tempering and particle MCMC for systems identiﬁcation.

IFAC-PapersOnLine Kalman, R. E. (1960). A new approach to linear ﬁltering and prediction problems.

Journal ofBasic Engineering Kelly, D. T. , Law, K. and

Stuart, A. M. (2014). Well-posedness and accuracy of the ensembleKalman ﬁlter in discrete and continuous time.

Nonlinearity Kuramoto, Y. and

Tsuzuki, T. (1976). Persistent propagation of concentration waves indissipative media far from thermal equilibrium.

Progress of theoretical physics Le Gland, F. , Monbet, V. and

Tran, V.-D. (2011). Large sample asymptotics for the ensem-ble Kalman ﬁlter. In

The Oxford Handbook of Nonlinear Filtering (D. Crisan and B. Rozovskii,eds.) 598–631. Oxford University Press.

Lee, A. and

Whiteley, N. (2015). Forest resampling for distributed sequential Monte Carlo.

Statistical Analysis and Data Mining: The ASA Data Science Journal . Lei, J. , Bickel, P. and

Snyder, C. (2010). Comparison of ensemble Kalman ﬁlters undernon-Gaussianity.

Monthly Weather Review

Majda, A. J. and

Harlim, J. (2012).

Filtering complex turbulent systems . Cambridge Univer-sity Press.

Morzfeld, M. , Hodyss, D. and

Snyder, C. (2017). What the collapse of the ensemble Kalmanﬁlter tells us about particle ﬁlters.

Tellus A: Dynamic Meteorology and Oceanography . Orlin, J. B. (1997). A polynomial time primal network simplex algorithm for minimum costﬂows.

Mathematical Programming Papaspiliopoulos, O. , Roberts, G. O. and

Sk¨old, M. (2007). A general framework for theparametrization of hierarchical models.

Statistical Science

Penny, S. G. and

Miyoshi, T. (2015). A local particle ﬁlter for high dimensional geophysicalsystems.

Nonlinear Processes in Geophysics Discussions Peyr´e, G. and

Cuturi, M. (2019).

Computational Optimal Transport . Now Publishers.

Rebeschini, P. and van Handel, R. (2015). Can local particle ﬁlters beat the curse of dimen-sionality?

The Annals of Applied Probability Reich, S. (2013). A nonparametric ensemble transform method for Bayesian inference.

SIAMJournal on Scientiﬁc Computing A2013–A2024.

Sen, D. and

Thiery, A. H. (2019). Particle ﬁlter eﬃciency under limited communication. arXiv:1904.09623 . Shahbaba, B. , Lan, S. , Johnson, W. O. and

Neal, R. M. (2014). Split Hamiltonian MonteCarlo.

Statistics and Computing Sinkhorn, R. and

Knopp, P. (1967). Concerning nonnegative matrices and doubly stochasticmatrices.

Paciﬁc Journal of Mathematics Sivashinsky, G. (1977). Nonlinear analysis of hydrodynamic instability in laminar ﬂames—I.Derivation of basic equations.

Acta Astronautica Snyder, C. (2011). Particle ﬁlters, the ‘optimal’ proposal and high-dimensional systems. In

Proceedings of the ECMWF Seminar on Data Assimilation for atmosphere and ocean

Snyder, C. , Bengtsson, T. and

Morzfeld, M. (2015). Performance bounds for particle ﬁltersusing the optimal proposal.

Monthly Weather Review

Snyder, C. , Bengtsson, T. , Bickel, P. and

Anderson, J. (2008). Obstacles to high-dimensional particle ﬁltering.

Monthly Weather Review

Svensson, A. , Sch¨on, T. B. and

Lindsten, F. (2018). Learning of state-space models withhighly informative observations: A tempered sequential Monte Carlo solution.

MechanicalSystems and Signal Processing

Tong, X. T. , Majda, A. J. and

Kelly, D. (2016). Nonlinear stability and ergodicity ofensemble based Kalman ﬁlters.

Nonlinearity Van Leeuwen, P. J. (2009). Particle ﬁltering in geophysical systems.

Monthly Weather Review

Verg´e, C. , Dubarry, C. , Del Moral, P. and

Moulines, E. (2015). On parallel implementa-tion of sequential Monte Carlo methods: the island particle model.

Statistics and Computing Whitaker, J. S. and

Hamill, T. M. (2002). Ensemble data assimilation without perturbedobservations.

Monthly Weather Review

Whiteley, N. , Lee, A. and

Heine, K. (2016). On the role of interaction in sequential MonteCarlo algorithms.

Bernoulli22