Optimal Transportation Methods in Nonlinear Filtering: The feedback particle filter
OOptimal Transportation Methodsin Nonlinear Filtering
The feedback particle filter
Amirhossein Taghvaei and Prashant G. MehtaFebruary 23, 2021
How Data Became One of the Most Powerful Tools to Fight an Epidemic is a questionthat a recent (Jun 10 2020) NYT magazine article poses in its title and addresses in its content.Indeed, the spread of COVID-19 involves dynamically evolving hidden data (e.g., number ofinfected, number of asymptomatic etc.) that must be deduced from noisy and partially observeddata (e.g., number of daily deaths, number of daily hospitalized, number of daily tested positiveetc.). The underlying mathematics for posing and solving this and several other partially observeddynamic problems is familiar to most control theorists.A mathematical abstraction of these types of problems commonly involves definition oftwo stochastic processes ( X, Z ) where in continuous-time settings X = { X t ∈ S : t ≥ } isthe hidden signal process and Z = { Z t ∈ O : t ≥ } is the observed or measured process.For the sake of exposition, the state-space S and the observation-space O are assumed to bethe Euclidean spaces, and the two processes are modeled as solution of a stochastic differentialequation (SDE) d X t = a ( X t ) d t + σ ( X t ) d B t , X ∼ (prior) , (1) d Z t = h ( X t ) d t + d W t , Z = 0 , (2)where a ( · ) , σ ( · ) , h ( · ) are given smooth functions of their arguments, the signal (or process) noise B = { B t : t ≥ } and the measurement (or observation) noise W = { W t : t ≥ } are assumedto be independent Wiener processes (w.p). For example, in the models of disease spread, Z t mayindicate the number (cumulative) of tested positive up to time t . In this case, Z t − Z t is thenumber (increment) of tested positive during the time interval [ t , t ] , and “ d Z t ” can be thoughtof as the infinitesimal increment over the infinitesimal time “ d t ”.Given models for the stochastic processes ( X, Z ) , the mathematical problem of stochasticfiltering is to estimate the conditional distribution of the state X t given observations up to time t . The conditional distribution P ( X t |Z t ) is referred to as the posterior distribution where Z t isthe time-history (filtration) of observations up to time t .1 a r X i v : . [ ee ss . S Y ] F e b wing the widespread importance of this problem, there are a host of solution approachesunder different modeling assumptions. The most classical of these approaches is the method ofleast squares. The method was invented at the turn of the 19th century but remains popular todate, for instance, in identification of the (static) model parameters (see [1] for an applicationof these methods to parameter estimation in disease modeling). For the dynamic case, when themodels are linear (i.e., a ( x ) = Ax , σ ( x ) = σ and h ( x ) = Hx ) and the distributions (of thenoise processes and the prior) are Gaussian, Kalman and Bucy derived a recursive algorithm [2]known as the Kalman-Bucy filter d ˆ X t = A ˆ X t d t (cid:124) (cid:123)(cid:122) (cid:125) dynamics + K t ( d Z t − H ˆ X t d t ) (cid:124) (cid:123)(cid:122) (cid:125) control , where ˆ X t := E ( X t |Z t ) is the conditional mean and K t is the Kalman gain. Each of the twoterms on the right-hand side have an intuitive explanation. The first term accounts for the effectof the dynamics due to the signal model. The second term implements the effect of conditioningbecause of the most recent observation (increment) “ d Z t ”. The second term is referred to as thecorrection or the Bayes’ update step of the Kalman filter.It is remarkable that the Bayes’ update formula in the Kalman filter takes the form of afeedback control law where [ control ] = [ gain ] · [ error ] , and [ error ] = [ Observation ] − [ prediction ] . Note that “ H ˆ X t d t ” is the filter prediction of the new observation “ d Z t ”. The formula is sosimple that it can and should be part of any introductory undergraduate controls class – asan example of proportional gain feedback control law! Of course, this simple formula hashad an enormous impact in many applications such as target tracking and surveillance, airtraffic management, weather surveillance, ground mapping, geophysical surveys, remote sensing,autonomous navigation and robotics.The Kalman filter has many extensions, for example, to problems involving additionaluncertainties in the signal and the observation models. The resulting algorithms are referred toas the interacting multiple model (IMM) filter [3] and the probabilistic data association (PDA)filter [4], respectively. In the PDA filter, the Kalman gain is allowed to vary based on an estimateof the instantaneous uncertainty in the observations. In the IMM filter, multiple Kalman filtersare run in parallel and their outputs combined to form an estimate.Arguably, the structural aspects of the Kalman filter have been as important as the algorithmitself in design, integration, testing and operation of the overall system. As a simple illustration2f this, consider for example the Kalman filter gain. The gain is known to scale proportionallyto the signal-to-noise ratio (SNR) of the observations. In practice, the gain is often tuned oradapted in an online manner to trade-off performance for robustness. Without such structuralfeatures, it is a challenge to create scalable cost-effective robust solutions.A limitation of the Kalman filter is that it gives exact solution only in the linear Gaussiansettings. Beginning in early 1990s, spurred in part by computational advances, simulation-basedMonte-Carlo (MC) algorithms became popular for the purposes of numerically approximating theposterior distribution in more general settings. These class of algorithms, referred to as particlefilters, approximate the posterior distribution using a population of N particles { X it : t ≥ , ≤ i ≤ N } . One can interpret each of the particles as independent samples drawn from the posterior.Alternatively, one can interpret the empirical distribution of the population as approximating theposterior distribution.Like the Kalman filter, the particle filter too is a recursive algorithm. The signal modelis used to simulate the effect of the dynamics. The Bayesian update step is implemented usingtechniques such as importance sampling and resampling. Although these techniques are easilydescribed, they bear little resemblance to the feedback control structure of the Kalman filter.The focus of the present paper is on the feedback particle filter (FPF) algorithm. FPFrepresents an exact solution of the nonlinear non-Gaussian filtering problem (1)-(2) where thestate-space S can in general be a Riemannian manifold. (In applications, Euclidean spaces andmatrix Lie groups are most common.) The distinguishing feature of the FPF is that the Bayesianupdate step is implemented via a feedback control law of the form [ control ] = [ gain ] · [ error ] , where [ error ] = [ Observation ] − (cid:18)
12 [
Part. predict. ] + 12 [
Pop. predict. ] (cid:19) . The terms [Part. predict.] and [Pop. predict.] refer to the prediction – regarding the next[Observation] “ d Z t ” – as made by the particle and by the population, respectively (see thediagram 1). Because the control for each particle depends also on the population (and thus theempirical distribution), this is an example of a mean-field type control law; cf., [5], [6]. It turnsout that for the linear Gaussian problem in the Euclidean state-space, the [ gain ] of the FPFis exactly the Kalman gain. In non-Gaussian settings, the gain solves a certain linear partialdifferential equation (PDE) known as the weighted Poisson equation. The exact formula for theFPF control and gain appears in the main body of the paper.3t the turn of the decade beginning 2010, the FPF algorithm was introduced by our researchgroup at the University of Illinois [7], [8], [9]. The algorithm can be viewed as a modern extensionto the Kalman filter, a viewpoint stressed in a prior review paper [10]. Like the Kalman filter,the FPF is easily extended to handle additional uncertainties in signal and measurement model:These extensions, namely, the joint probabilistic data association (JPDA)-FPF and the interactingmultiple model (IMM) FPF appear in our prior works [11], [12].From a historical perspective, FPF is part of a broader class of exact and approximateinteracting particle algorithms, in particular, the ensemble Kalman filter (EnKF) which is widelyused for data assimilation in weather prediction and other types of geo-physical applications [13].Closely related to and pre-dating our work on FPF, the first interacting particle representationof the continuous-time nonlinear filtering problem (1)-(2) appeared in [14]. In linear Gaussiansettings, the update formula for FPF is known as the square root form of the EnKF [15], [16].The objective for this paper is to situate the development of FPF and related controlled in-teracting particle system algorithms (e.g., EnKF) within the framework of optimal transportationtheory. The key notion is that of “coupling” between two distributions – prior and posterior inthe Bayesian settings of this paper. Optimal transportation theory is then applied to design theoptimal coupling. Of course, in practice, this requires a solution of certain PDEs – such as thePoisson equation that arise in the FPF algorithm. The coupling viewpoint has several advantageswhich are described in the main body of this paper. The coupling viewpoint
The heart of any simulation-based recursive particle filter algorithm is the Bayes’ updateformula [ posterior ] ∝ [ likelihood ] · [ prior ] . The notation p ( · ) , p ( · ) , and (cid:96) ( · ) is used to denote the prior, posterior, and likelihood distributionsrespectively. The expressions for these in the linear Gaussian example appear in “optimalcoupling for Gaussian distributions”.In any particle filter algorithm, one also needs to simulate the effect of dynamics. Thisstep is straightforward using the signal model directly. Therefore, we focus here only on theupdate formula.In simulation settings, one of the challenges is that an analytic expression of the priordistribution is not available. Instead, the prior is approximated in terms of N independent samples4 X i : 1 ≤ i ≤ N } : p ( x ) ≈ N N (cid:88) i =1 δ X i ( x ) , where δ z is the Dirac distribution at z ∈ S . The expression on the right-hand side is referred toas the empirical distribution of the population. Alternatively, one can think of X i as independentsamples drawn from the prior.In a numerical implementation of the update formula, the problem is to convert the sampleof N particles { X i : 1 ≤ i ≤ N } from the prior distribution p ( · ) to a sample of N particles { X i : 1 ≤ i ≤ N } from the posterior distribution p ( · ) . The algorithmic problem is depicted inFigure 2 and expressed as followsInput: samples { X i : 1 ≤ i ≤ N } ∼ p , likelihood function (cid:96), Output: samples { X i : 1 ≤ i ≤ N } ∼ p . The task of converting samples from one distribution p ( · ) to samples from anotherdistribution p ( · ) is viewed here as the problem of finding a coupling π ( · , · ) ; cf., [17], [18],[19]. By definition, a coupling is any joint probability distribution that satisfies the marginalconstraints (cid:82) π ( x, x (cid:48) ) d x (cid:48) = p ( x ) and (cid:82) π ( x, x (cid:48) ) d x = p ( x (cid:48) ) . It is convenient to express π ( x, x (cid:48) ) = T ( x | x (cid:48) ) p ( x (cid:48) ) where T ( ·|· ) is referred to as the transition kernel. Once the couplingis at hand, new samples are generated by using the transition kernel.Given this viewpoint, the MC algorithm simulates the following stochastic update law forthe system of particles: X i ∼ T ( ·| X i ) . (3)This means that a new sample X i is generated by sampling from the distribution T ( ·| X i ) . Thesampling algorithm (3) ensures that if the probability distribution of X i is p then the probabilitydistribution of X i is p . The associated algorithmic task is expressed asInput: samples { X i : 1 ≤ i ≤ N } ∼ p , likelihood function (cid:96), Output: coupling between p and p . SIR particle filter
There are infinitely many couplings between two distributions. The simplest possible choiceis an independent coupling where π ( x, x (cid:48) ) = p ( x ) p ( x (cid:48) ) . For the independent coupling, thetransition kernel T ( x | x (cid:48) ) = p ( x ) . Sequential importance resampling (SIR) particle filters [20],[21] numerically implement the independent coupling in two steps:5tep 1: A weighted distribution of the particles is first formed according to (cid:80) Ni =1 w i δ X i wherethe weights w i = l ( X i ) (cid:80) Nj =1 l ( X j ) . The weighted distribution is an approximation of the posteriordistribution p . This step is called importance sampling.Step 2: Next N particles are independently sampled from the weighted distribution: X i ∼ (cid:80) Ni =1 w i δ X i , by sampling from a multinomial distribution with parameter vector ( N, { w i } Ni =1 ) . This step is called resampling.Theoretically, it is shown that the empirical approximation with particles becomes exactin the limit as N → ∞ with error rate O ( N − / ) [22], [23]. However, both empirically andtheoretically, it was discovered that particle filters can suffer from large simulation variance thatprogressively gets worse as the dimension of the problem increases [24], [25], [26]. To maintainthe same mean-squared error, a particle filter is known to require a number of particles thatscales exponentially with the dimension. This issue is referred to as the curse of dimensionality (CoD).The issues with the stochastic independent coupling in SIR filters has motivated investiga-tion of other forms of coupling that are also optimal in some sense [27], [28]. In the simulationliterature, this is referred to as the design of proposal distributions. Optimal transport coupling
Optimal transportation theory provides a principled approach to identify a coupling. Giventwo distributions p and p , the optimal transportation problem is min π ∈ Π p ,p E ( X ,X ) ∼ π [ | X − X | ] , (4)where Π p ,p is the set of all couplings with marginals fixed to p and p . The optimal cost isreferred to as L -Wasserstein distance between p and p [29].The optimization problem (4) is known as the Kantorovich formulation of the optimaltransportation problem. By a famous result due to Brenier, if the two distributions admit densitywith respect to Lebesgue measure, then the optimal coupling is unique and deterministic of theform π ( x, x (cid:48) ) = δ x = ∇ Φ( x (cid:48) ) p ( x (cid:48) ) where Φ is a convex function [29, Thm. 2.12]. The function Φ is obtained by solving the Monge-Amp`ere PDE [30]. A numerical solution to the Monge-Amp`ere PDE based only on the samples is a challenging problem. In the following, we discussthe special setting where the the prior and posterior distributions are Gaussian.6 ptimal coupling for Gaussian distributions In Gaussian settings, the prior and the likelihood are both Gaussian distributions: p ( x ) ∝ exp( −
12 ( x − m ) (cid:62) Σ − ( x − m )) , l ( x ) ∝ exp( − | y − Hx | . In this case, a simple completion of square helps show that the posterior is also a Gaussiandistribution p ( x ) ∝ exp( −
12 ( x − m ) (cid:62) Σ − ( x − m )) . This yields the following update formula for the mean and variance: m = m + K ( Y − Hm ) , Σ = Σ − K H Σ , where K = Σ H (cid:62) ( H Σ H (cid:62) + I ) − is the Kalman gain. This is in fact the update formula forthe discrete-time Kalman filter.The coupling design problem is to couple the Gaussian prior p and the Gaussian posterior p . The optimal coupling, solution to the optimal transportation problem (4), between twoGaussian distributions is explicitly known and is an affine map of the form T ( x ) = F ( x − m ) + m , (5)where F is the (unique such) symmetric matrix solution to the matrix equation F Σ F = Σ .The explicit form of the solution is F = Σ − (cid:16) Σ Σ Σ (cid:17) Σ − . Note that if X ∼ N ( m , Σ ) then T ( X ) ∼ N ( m , Σ ) because: (i) the mean E [ T ( X )] = m k +1 ;(ii) the variance E [( T ( X ) − m )( T ( X ) − m ) (cid:62) ] = F Σ F = Σ ; (iii) and an affine transformationof a Gaussian random variable is again Gaussian.The optimal transport map (5) yields the following algorithm for sampling X i = T ( X i ) = F ( X i − m ) + m . Given X i ∼ p we have X i ∼ p .The optimal coupling depends upon the statistics of both the prior and the posteriordistributions. In simulation-based setting, one only has a population of particles { X i : 1 ≤ i ≤ N } . So, the transport map needs also to be approximated from the particles. One suchapproximation is as follows:particle update: X i = F ( N ) ( X i − m ( N )0 ) + m ( N )0 + K ( N ) ( Y − Hm ( N )0 ) , (6)7here m ( N )0 = N (cid:80) Ni =1 X i is an empirical approximation of the mean m , Σ ( N )0 = N − (cid:80) Ni =1 ( X i − m ( N )0 )( X i − m ( N )0 ) (cid:62) is an empirical approximation of the variance, F ( N ) is theunique symmetric matrix solution to the matrix equation F ( N ) Σ ( N )0 F ( N ) = Σ ( N )0 − K ( N ) H Σ ( N )0 ,and K ( N ) = Σ ( N )0 H (cid:62) ( H Σ ( N )0 H (cid:62) + I ) − .The update formula (6) is compared to the discrete-time EnKF update:particle update (EnKF): X i = X i + K ( N ) ( Y − HX i + W i ) , (7)where { W i : 1 ≤ i ≤ N } are independent copies of the observation noise. The EnKF update isan example of a stochastic coupling in contrast to the deterministic optimal coupling (6). TheEnKF update does not require solving for F ( N ) . This makes it simpler to implement numerically.However, the presence of noise W i in the update law introduces an additional source of errorin any finite- N implementation [31].Besides (7), there are several other forms of the EnKF update. One particular update – thathas been crucial in successful application of EnKF in geosciences – is the ensemble square-rootKalman filter (EnSRKF) [32]. This update is an example of a deterministic coupling that alsoavoids the need to compute F ( N ) [17, Sec. 7.1]. In the sequel, the continuous-time version ofthe EnSRKF is shown to arise as a special case of the FPF.The systems (6) and (7) are both examples of interacting particle system. The interactionarises because of the terms involving the empirical quantities m ( N )0 and Σ ( N )0 . In the limit as N → ∞ , these converge to their respective statistics and the particles become independent ofeach other [33], [34], [35], [36], [37]. This phenomenon is referred to as the propagation ofchaos [38], [39]. The ( N = ∞ ) limit is referred to as the mean-field limit and for (6) it isidentified with a single equationmean-field update: ¯ X = F ( ¯ X − m ) + m + K ( Y − Hm ) . (8)As a practical matter, one first designs a coupling (5) which is used to define the mean-field process (8). Subsequently, the mean-field terms are approximated empirically to define aparticle system (6). A finite- N implementation of the particle system yields a practical algorithmto solve the filtering task.A summary of this subsection is presented in the sidebar “Summary of the linear Gaussianexample”. An excellent exposition of the coupling approach to discrete-time filtering appearsin the monograph [17] and the tutorial style review article [18]. Other examples of couplingsinclude the approximation of the Rosenblatt transport map [40], [41] and Gibbs flow [42].8 eedback particle filter The coupling viewpoint is next employed to introduce and describe the feedback particlefilter (FPF) algorithm. FPF is a MC solution to the continuous-time nonlinear filtering prob-lem (1)-(2).A construction of FPF proceeds in the following two steps:Step 1: Construct a stochastic process ¯ X = { ¯ X t ∈ S : t ≥ } such that the conditionaldistribution (given Z t ) of ¯ X t is equal to the posterior distribution of X t ;Step 2: Simulate N stochastic processes { X it ∈ S : t ≥ , ≤ i ≤ N } to empiricallyapproximate the conditional distribution of ¯ X t . E [ f ( X t ) |Z t ] Step 1 = E [ f ( ¯ X t ) |Z t ] (cid:124) (cid:123)(cid:122) (cid:125) exactness condition Step 2 ≈ N N (cid:88) i =1 f ( X it ) . for all bounded functions f . The process ¯ X is referred to as mean-field process and the N processes are referred to as particles. The construction ensures that the filter is exact in themean-field ( N = ∞ ) limit. Mean-field process:
The mean-field process ¯ X is modeled as a solution of a controlled SDE: d ¯ X t = a ( ¯ X t ) d t + σ ( X t ) d ¯ B t (cid:124) (cid:123)(cid:122) (cid:125) dynamics + u t ( ¯ X t ) d t + K t ( ¯ X t ) d Z t (cid:124) (cid:123)(cid:122) (cid:125) control , ¯ X d = X , (9)where ¯ B = { ¯ B t : t ≥ } is a w.p. independent of ¯ X . The first two terms in the SDE (9) area copy of the dynamics (1). The other two terms are control laws (transition kernels) that needto be designed to implement the filtering update step: The mathematical control objective is todesign { u t ( · ) : t ≥ } and { K t ( · ) : t ≥ } such that the conditional distribution (given Z t ) of ¯ X t is equal to the posterior distribution of X t .The control is regarded as implementing the transition kernel of a coupling. As in thesimpler discrete-time setting of the preceding section, there are infinitely many couplings andassociated transition kernels that satisfy the exactness criteria. This is not surprising: Theexactness condition specifies only the marginal distribution of ¯ X t at times t ≥ . This is clearlynot enough to uniquely identify a stochastic process, for instance, the joint distributions at twotime instants are not specified.The procedure from the preceding section is suitably adapted to design the optimalcoupling. The optimality criterion is the Kantorovich form (4) of the optimal transportation9roblem. The details appear in the sidebar “Optimal transport construction of stochasticprocesses” where it is shown that the optimal K t is of the gradient form as follows: K t ( x ) = ∇ φ t ( x ) , where φ t solves the PDE − ∇ · ( p t ∇ φ t ) = p t ( h − ˆ h t ) , where ∇ is the gradient operator, ∇· is the divergence operator, ˆ h t := E [ h ( ¯ X t ) |Z t ] = (cid:82) h ( x ) p t ( x ) d x , and p t is the conditional density of ¯ X t .The optimal solution for u t is as follows: u t ( x ) = −
12 ( h ( x ) + ˆ h t ) K t ( x ) + 12 K t ∇ K t ( x ) + ξ t ( x ) , where ξ t is the (unique such) divergence-free vector field (i.e., ∇ · ( p t ξ t ) = 0 ) such that u t isof a gradient form. An intuitive explanation of the three terms is as follows: The first term isthe gain times the average at the ‘particle’ prediction h ( x ) and the population prediction ˆ h t .Together with the stochastic term K t d Z t , the first term yields the gain times error structure ofthe FPF. The second term is the so-called Wong-Zakai correction term from which it follows thatthe gain times error formula is expressed in its Stratanovich form. (The geometric significanceof the Stratanovich form is described after the FPF formula has been formally presented.) Thesignificance of the third term is the subject of the next paragraph.It turns out that the divergence-free choices of ξ t parametrize a manifold of couplings all ofwhich yield the same (exact) posterior. Therefore, the choice of ξ t affects only the (Wasserstein)optimality but not the exactness property of the filter. In FPF, we simply take ξ t ≡ for all t ≥ .Although such a solution is optimal only in the scalar (one-dimensional) settings, it avoids theneed to solve an additional PDE to compute the optimal ξ t . The resulting algorithm is referredto as the feedback particle filter. It is described next. Feedback particle filter:
The mean-field process ¯ X evolves according to the SDE: d ¯ X t = a ( ¯ X t ) d t + σ ( X t ) d ¯ B t (cid:124) (cid:123)(cid:122) (cid:125) dynamics + K t ( ¯ X t ) ◦ ( d Z t − h ( ¯ X t ) + ˆ h t t ) (cid:124) (cid:123)(cid:122) (cid:125) Bayes’ update: feedback control law , ¯ X ∼ p , (10)where the symbol ◦ is used to denote the fact that the SDE is expressed in its Stratanovichform [43, Sec. 3.3]. The Itˆo form of the FPF includes the standard Wong-Zakai correction termthat arises on account of the dependence of the gain K t ( · ) on the state X t [44, Eq. 2]. Becausethe gain also depends upon the density, the interpretation of the Stratonovich form in the generalcase is more involved as discussed at length in [45].The gain function is K t ( x ) := ∇ φ t ( x ) where φ t is the solution of the Poisson equation:Poisson equation: − p t ( x ) ∇ · ( p t ( x ) ∇ φ t ( x )) = ( h ( x ) − ˆ h t ) , ∀ x ∈ R d . (11)10he operator on the left-hand side of (11) is the probability-weighted Laplacian . It is denotedas ∆ ρ := ρ ∇ · ( ρ ∇ ) where at time t the probability density ρ is the conditional density p t . Theequation (11) is referred to as the Poisson equation of nonlinear filtering [46].The Stratanovich form of the update formula provides for an intrinsic (i.e., coordinateindependent) description of the filter. It was shown that the FPF is an exact filter not only inthe Euclidean settings but also when the state-space S is a Riemannian manifold, e.g., a matrixLie group [47]. For a manifold with boundaries, the Poisson equation is supplemented with theNeumann boundary conditions. Particles:
A finite- N algorithm is obtained by empirically approximating the mean-field controllaw. The particles { X it : t ≥ , ≤ i ≤ N } evolve according to: d X it = a ( X it ) d t + σ ( X it ) d B it + K ( N ) t ( X it ) ◦ ( d Z t − h ( X it ) + ˆ h ( N ) t t ) , X i i.i.d. ∼ p , for i = 1 , . . . N , where { B it : t ≥ , ≤ i ≤ N } are mutually independent standard w.p., ˆ h ( N ) t := N (cid:80) Ni =1 h ( X it ) is the empirical approximation of ˆ h t , and K ( N ) t is the output of an algorithmthat approximates the solution to the Poisson equation (11) at each fixed time t :Gain function approximation: K ( N ) t := Algorithm ( { X it : 1 ≤ i ≤ N } ; h ) . The notation is suggestive of the fact that algorithm is adapted to the ensemble { X it : 1 ≤ i ≤ N } and the function h ; the density p t ( x ) is not known in an explicit manner.Two examples are presented in the sidebars to illustrate the FPF algorithm in practice: Inthe first example “Benefits of feedback”, analytical and numerical comparisons are provided toshow how feedback control can help ameliorate the CoD. In the second example “FPF for SIRmodels” the FPF algorithm is applied to an epidemiological SIR model.At this point, it is instructive to specialize FPF to the linear Gaussian case where thesolution of the Poisson equation is explicitly known. FPF for Linear Gaussian setting
Suppose a ( x ) = Ax , h ( x ) = Hx , and p t is a Gaussian density with mean ¯ m t and variance ¯Σ t . Then the solution of the Poisson equation is known in an explicit form [44, Sec. D]. Theresulting gain function is constant and equal to the Kalman gain: K t ( x ) ≡ ¯Σ t H (cid:62) , ∀ x ∈ R d . (12)Therefore, the mean-field process (10) for the linear Gaussian problem is given as d ¯ X t = A ¯ X t d t + d ¯ B t + ¯Σ t H (cid:62) ( d Z t − H ¯ X t + H ¯ m t t ) , ¯ X ∼ p . K ( N ) t = Σ ( N ) t H (cid:62) where Σ ( N ) t is the empirical covariance of the particles. Therefore,the evolution of the particles is: d X it = AX it d t + d B it + K ( N ) t ( d Z t − HX it + Hm ( N ) t t ) , X i i.i.d. ∼ p , (13)for i = 1 . . . . , N , where m ( N ) t is the empirical mean of the particles. The empirical quantitiesare computed as: m ( N ) t := 1 N N (cid:88) i =1 X it , Σ ( N ) t := 1 N − N (cid:88) i =1 ( X it − m ( N ) t )( X it − m ( N ) t ) (cid:62) . The linear Gaussian FPF (13) is identical to the square-root form of the ensemble Kalman filter(EnKF) [16, Eq. 3.3].The main difficulty in implementing an FPF in general settings is the gain functionapproximation. Two algorithms for this problem are presented in the following section.
Gain function approximation
The exact gain function is a solution of the Poisson equation (11). In practice, this problemis solved numerically:Input: samples { X i : 1 ≤ i ≤ N } i.i.d. ∼ ρ , observation function h ( · ) , output: gain function { K ( X i ) : 1 ≤ i ≤ N } , where in filtering ρ is the (posterior) density at time t . The explicit dependence on time t issuppressed in this section. The problem is illustrated in Fig. 3. Constant gain approximation
The simplest approximation is the constant gain approximation formula where the gain K t is approximated by its expected value (which represents the best least-square approximation ofthe gain by a constant). Remarkably, the expected value admits a closed-form expression whichis then readily approximated empirically using the particles:Const. gain approx: E [ K t ( X t ) |Z t ] = (cid:90) R d ( h ( x ) − ˆ h t ) x p t ( x ) d x ≈ N N (cid:88) i =1 ( h ( X it ) − ˆ h ( N ) t ) X it . (14)12See Figure 4 for an illustration of the constant gain approximation.) With the constant gainapproximation, the FPF algorithm simplifies to an EnKF algorithm [10]. The constant gainformula (14) was known in the EnKF literature prior to the FPF derivation [48], [16].There have been a number of studies to improve upon this formula [44], [9], [49], [50],[51], [52], [53]. In the following, we describe the diffusion map approximation which appearsto be the most promising approach in general settings. Diffusion map-based algorithm
The notation e (cid:15) ∆ ρ is used to denote the semigroup associated with the probability weightedLaplacian ∆ ρ [54]. As explained in the accompanying sidebar “Poisson equation and itsapproximations” (and more fully in [55]), the Poisson equation (11) is equivalently expressedas the fixed-point equation: φ = e ∆ ρ (cid:15) φ + (cid:90) (cid:15) e ∆ ρ s ( h − ˆ h ) d s. (15)where (cid:15) > is arbitrary. For small values of (cid:15) , there is a well known approximation of the exactsemigroup e ∆ ρ (cid:15) in terms of the so-called diffusion map: T (cid:15) f ( x ) := 1 n (cid:15) ( x ) (cid:90) R d g (cid:15) ( x − y ) (cid:113)(cid:82) g (cid:15) ( y − z ) ρ ( z ) d z f ( y ) ρ ( y ) d y, where g (cid:15) ( x ) := e − | x | (cid:15) is the Gaussian kernel in R and n (cid:15) ( x ) is the normalization factor chosenso that (cid:82) T (cid:15) x ) d x = 1 [56].It is straightforward to approximate the diffusion map empirically in terms of the particles: T ( N ) (cid:15) f ( x ) = 1 n ( N ) (cid:15) ( x ) N (cid:88) i =1 g (cid:15) ( x − X i ) (cid:113)(cid:80) Nj =1 g (cid:15) ( X i − X j ) f ( X i ) , where n ( N ) (cid:15) ( x ) is the normalization factor.Upon approximating the fixed-point equation (15) using the empirical approximation T ( N ) (cid:15) for e ∆ ρ (cid:15) , one obtains the diffusion map-based algorithm. The algorithm is summarized in thesidebar “Diffusion map-based algorithm for gain function approximation”. Error: bias variance trade-off
The error in diffusion map approximation comes from two sources: (i) the bias error dueto the diffusion map approximation of the semigroup; (ii) the variance error due to empirical13pproximation in terms of particles. The error is analyzed in [55] where it is shown thatr.m.s.e = (cid:32) E [ 1 N N (cid:88) i =1 | K ( X i ) − K exact ( X i ) | ] (cid:33) ≤ O ( (cid:15) ) (cid:124)(cid:123)(cid:122)(cid:125) bias + O ( 1 (cid:15) d N ) (cid:124) (cid:123)(cid:122) (cid:125) variance . (16)The error due to bias converges to zero as (cid:15) → and the error due to variance converges tozero as N → ∞ . There is trade-off between the two errors: To reduce bias, one must reduce (cid:15) . However, for any fixed value of N , one can reduce (cid:15) only up to a point where the variancestarts increasing. The bais-variance trade-off is illustrated Fig. 5: If (cid:15) is large, the error due tobias dominates, while if (cid:15) is small, the error due to variance dominates.As a final point, there is a remarkable and somewhat unexpected relationship betweenthe diffusion map and the constant gain approximations. In particular, in the limit as (cid:15) → ∞ ,the diffusion map gain converges to the constant gain. This suggests a systematic procedure toimprove upon the constant gain by de-tuning the value of (cid:15) away from the [ (cid:15) = ∞ ] limit. Forany fixed N , a finite value of (cid:15) is chosen to minimize the r.m.s.e. according to the bias variancetrade-off. Based on this, a rule of thumb for choosing the (cid:15) value appears in [55, Remark 5.1]. Some Final Remarks
In the past decade, the coupling perspective to data assimilation problems has beenenormously valuable with outstanding theoretical contributions and application impact. Giventhe limited scope of the present article with its narrow focus on the FPF algorithm, it is notpossible to do justice to the depth and breadth of this exciting new area in one article. Thereader is referred to the recent monograph [17] and the tutorial style review article [18] for anexcellent introduction to the subject.A few important remarks are also necessary at this point: The continuous-time formulationis stressed in this paper for the reasons of mathematical elegance and beauty. In practice,discrete-time formulations are much more common. The coupling viewpoint also applies tothese settings [17] and was in fact also used in the paper to introduce the main ideas. Next,optimal couplings are almost always difficult to compute. Most popular forms of couplings usedin practice are sub-optimal. This is true for the classical EnKF algorithm and also of the FPFalgorithm. (A discussion and exactness and optimality for FPF appears in the sidebar.) As afinal point, closely related to the coupling viewpoint is the gradient flow interpretation of theBayes’ update formula – see [46] for an FPF-specific exposition and also [57], [58] for relatedalgorithms.There are several directions for future work: It is an open problem to fully carry out the14tability and error analysis of the finite- N FPF particle system with the diffusion map-basedgain function approximation. It will be very useful to be able to characterize the CoD in thesegeneral settings. It is also of interest to construct optimization type formulations that directlyyield a finite- N algorithm without the need for empirical approximation as an intermediate step.Such constructions may lead to better error properties by design. Finally, apart from the optimaltransportation formulation stressed in this paper, one may consider alternative approaches forcontrol design. One possible direction is based on the Schr¨odinger bridge problem [59], [18]. References [1] S. B. Bastos and D. O. Cajueiro, “Modeling and forecasting the early evolution of thecovid-19 pandemic in brazil,” arXiv preprint arXiv:2003.14288 , 2020.[2] R. E. Kalman and R. S. Bucy, “New results in linear filtering and prediction theory,”
Journalof basic engineering , vol. 83, no. 1, pp. 95–108, 1961.[3] H. A. P. Blom, “The continuous time roots of the interacting multiple model filter,” in
Proc.of the st IEEE Conf. Decision and Control , Maui, Hawaii, Dec 2012, pp. 6015–6021.[4] Y. Bar-Shalom, F. Daum, and J. Huang, “The probabilistic data association filter,”
IEEEControl Syst. Mag. , vol. 29, no. 6, pp. 82–100, Dec 2009.[5] A. Bensoussan, J. Frehse, P. Yam et al. , Mean field games and mean field type controltheory . Springer, 2013, vol. 101.[6] R. Carmona and F. Delarue,
Probabilistic Theory of Mean Field Games with ApplicationsI-II . Springer, 2018.[7] T. Yang, P. G. Mehta, and S. P. Meyn, “A mean-field control-oriented approach for particlefiltering,” in
Proc. of the 2011 American Control Conference , San Francisco, June 2011,pp. 2037–2043.[8] ——, “Feedback particle filter with mean-field coupling,” in
Proc. of th IEEE Conf.Decision and Control , Orlanda, FL, December 2011, pp. 7909–7916.[9] ——, “Feedback particle filter,”
IEEE transactions on Automatic control , vol. 58, no. 10,pp. 2465–2480, 2013.[10] A. Taghvaei, J. De Wiljes, P. G. Mehta, and S. Reich, “Kalman filter and its modernextensions for the continuous-time nonlinear filtering problem,”
Journal of DynamicSystems, Measurement, and Control , vol. 140, no. 3, p. 030904, 2018.[11] T. Yang, G. Huang, and P. G. Mehta, “Joint probabilistic data association-feedback particlefilter for multiple target tracking applications,” in . IEEE, 2012, pp. 820–826.[12] T. Yang, H. A. Blom, and P. G. Mehta, “Interacting multiple model-feedback particle filterfor stochastic hybrid systems,” in . IEEE,15013, pp. 7065–7070.[13] G. Evensen, “Sequential data assimilation with a nonlinear quasi-geostrophic model usingMonte Carlo methods to forecast error statistics,”
Journal of Geophysical Research: Oceans ,vol. 99, no. C5, pp. 10 143–10 162, 1994.[14] D. Crisan and J. Xiong, “Approximate McKean-Vlasov representations for a class ofSPDEs,”
Stochastics , vol. 82, no. 1, pp. 53–68, 2010.[15] S. Reich, “A dynamical systems framework for intermittent data assimilation,”
BITNumerical Analysis , vol. 51, pp. 235–249, 2011.[16] K. Bergemann and S. Reich, “An ensemble Kalman-Bucy filter for continuous dataassimilation,”
Meteorologische Zeitschrift , vol. 21, no. 3, pp. 213–219, 2012.[17] S. Reich and C. Cotter,
Probabilistic forecasting and Bayesian data assimilation . Cam-bridge University Press, 2015.[18] S. Reich, “Data assimilation: The Schr¨odinger perspective,”
Acta Numerica , vol. 28, pp.635–711, 2019.[19] Y. Cheng and S. Reich, “A McKean optimal transportation perspective on Feynman-Kacformulae with application to data assimilation,” arXiv preprint arXiv:1311.6300 , 2013.[Online]. Available: https://arxiv.org/abs/1311.6300[20] N. J. Gordon, D. J. Salmond, and A. F. Smith, “Novel approach to nonlinear/non-GaussianBayesian state estimation,” in
IEE Proceedings F (Radar and Signal Processing) , vol. 140,1993, pp. 107–113.[21] A. M. Doucet, A.and Johansen, “A tutorial on particle filtering and smoothing: Fifteenyears later,”
Handbook of Nonlinear Filtering ∼ arnaud/doucet johansen tutorialPF.pdf[22] P. Del Moral and A. Guionnet, “On the stability of interacting processes with applicationsto filtering and genetic algorithms,” in Annales de l’Institut Henri Poincare (B) Probabilityand Statistics , vol. 37, no. 2. Elsevier, 2001, pp. 155–194.[23] O. Capp´e, E. Moulines, and T. Ryd´en, “Inference in hidden markov models,” in
Proceedingsof EUSFLAT Conference , 2009, pp. 14–16.[24] T. Bengtsson, P. Bickel, and B. Li, “Curse of dimensionality revisited: Collapse of theparticle filter in very large scale systems,” in
IMS Lecture Notes - Monograph Seriesin Probability and Statistics: Essays in Honor of David F. Freedman . Institute ofMathematical Sciences, 2008, vol. 2, pp. 316–334.[25] A. Beskos, D. Crisan, A. Jasra, and N. Whiteley, “Error bounds and normalising constantsfor sequential Monte Carlo samplers in high dimensions,”
Advances in Applied Probability ,vol. 46, no. 1, pp. 279–306, 2014.[26] P. Rebeschini, R. Van Handel et al. , “Can local particle filters beat the curse of dimension-ality?”
The Annals of Applied Probability , vol. 25, no. 5, pp. 2809–2866, 2015.1627] P. Del Moral, “Feynman-Kac formulae,” in
Feynman-Kac Formulae . Springer, 2004, pp.47–93.[28] A. Bain and D. Crisan,
Fundamentals of stochastic filtering . Springer, 2009, vol. 3.[29] C. Villani,
Topics in optimal transportation . American Mathematical Soc., 2003, no. 58.[30] L. C. Evans, “Partial differential equations and Monge-Kantorovich mass transfer,”
Currentdevelopments in mathematics , pp. 65–126, 1997.[31] A. Taghvaei and P. G. Mehta, “An optimal transport formulation of the linear feedbackparticle filter,” in
American Control Conference (ACC), 2016 . IEEE, 2016, pp. 3614–3619.[32] J. Whitaker and T. M. Hamill, “Ensemble data assimilation without perturbed observations,”
Monthly Weather Review , vol. 130, no. 7, pp. 1913–1924, 2002.[33] J. Mandel, L. Cobb, and J. D. Beezley, “On the convergence of the ensemble Kalman filter,”
Applications of Mathematics , vol. 56, no. 6, pp. 533–541, 2011.[34] P. Del Moral and J. Tugaut, “On the stability and the uniform propagation of chaosproperties of ensemble Kalman–Bucy filters,”
Ann. Appl. Probab. , vol. 28, no. 2, pp.790–850, 04 2018. [Online]. Available: https://doi.org/10.1214/17-AAP1317[35] P. Del Moral, A. Kurtzmann, and J. Tugaut, “On the stability and the uniform propagationof chaos of a class of extended ensemble Kalman–Bucy filters,”
SIAM Journal on Controland Optimization , vol. 55, no. 1, pp. 119–155, 2017.[36] J. de Wiljes, S. Reich, and W. Stannat, “Long-time stability and accuracy of the ensembleKalman–Bucy filter for fully observed processes and small measurement noise,”
SIAMJournal on Applied Dynamical Systems , vol. 17, no. 2, pp. 1152–1181, 2018.[37] A. Taghvaei and P. G. Mehta, “An optimal transport formulation of the ensemble kalmanfilter,”
ArXiv (To appear in Transactions of Automatic Control (TAC)) , vol. abs/1910.02338,2019.[38] A. Sznitman, “Topics in propagation of chaos,”
Ecole d’Et´e de Probabilit´es de Saint-FlourXIX—1989 , pp. 165–251, 1991.[39] S. T. Rachev and L. R¨uschendorf,
Mass Transportation Problems: Volume I: Theory .Springer Science & Business Media, 1998, vol. 1.[40] T. A. El Moselhy and Y. M. Marzouk, “Bayesian inference with optimal maps,”
Journalof Computational Physics , vol. 231, no. 23, pp. 7815–7850, 2012.[41] D. A. Mesa, J. Tantiongloc, M. Mendoza, S. Kim, and T. P. Coleman, “A distributedframework for the construction of transport maps,”
Neural computation , vol. 31, no. 4, pp.613–652, 2019.[42] J. Heng, A. Doucet, and Y. Pokern, “Gibbs flow for approximate transport withapplications to Bayesian computation,” arXiv preprint arXiv:1509.08787 , 2015. [Online].Available: https://arxiv.org/abs/1509.087871743] B. Oksendal,
Stochastic differential equations: an introduction with applications . SpringerScience & Business Media, 2013.[44] T. Yang, R. S. Laugesen, P. G. Mehta, and S. P. Meyn, “Multivariable feedback particlefilter,”
Automatica , vol. 71, pp. 10–23, 2016.[45] S. Pathiraja, S. Reich, and W. Stannat, “Mckean-vlasov sdes in nonlinear filtering,” arXivpreprint arXiv:2007.12658 , 2020.[46] R. S. Laugesen, P. G. Mehta, S. P. Meyn, and M. Raginsky, “Poisson’s equation innonlinear filtering,”
SIAM Journal on Control and Optimization , vol. 53, no. 1, pp.501–525, 2015. [Online]. Available: 10.1137/13094743X[47] C. Zhang, A. Taghvaei, and P. G. Mehta, “Feedback particle filter on riemannian manifoldsand matrix lie groups,”
IEEE Transactions on Automatic Control , vol. 63, no. 8, pp. 2465–2480, 2017.[48] G. Evensen,
Data Assimilation. The Ensemble Kalman Filter . New York: Springer-Verlag,2006.[49] K. Berntorp and P. Grover, “Data-driven gain computation in the feedback particle filter,”in , 2016, pp. 2711–2716.[50] Y. Matsuura, R. Ohata, K. Nakakuki, and R. Hirokawa, “Suboptimal gain functions offeedback particle filter derived from continuation method,” in
AIAA Guidance, Navigation,and Control Conference , 2016, p. 1620.[51] A. Radhakrishnan, A. Devraj, and S. Meyn, “Learning techniques for feedback particlefilter design,” in
Conference on Decision and COntrol (CDC), 2016 . IEEE, 2016, pp.648–653.[52] A. Radhakrishnan and S. Meyn, “Feedback particle filter design using a differential-lossreproducing kernel Hilbert space,” in .IEEE, 2018, pp. 329–336.[53] K. Berntorp, “Comparison of gain function approximation methods in the feedback particlefilter,” in . IEEE,2018, pp. 123–130.[54] D. Bakry, I. Gentil, and M. Ledoux,
Analysis and geometry of Markov diffusion operators .Springer Science & Business Media, 2013, vol. 348.[55] A. Taghvaei, P. G. Mehta, and S. P. Meyn, “Diffusion map-based algorithm for gainfunction approximation in the feedback particle filter,”
SIAM/ASA Journal on UncertaintyQuantification , vol. 8, no. 3, pp. 1090–1117, 2020.[56] R. R. Coifman and S. Lafon, “Diffusion maps,”
Applied and computational harmonicanalysis , vol. 21, no. 1, pp. 5–30, 2006.[57] A. Halder and T. T. Georgiou, “Gradient flows in filtering and Fisher-Rao geometry,” in . IEEE, 2018, pp. 4281–4286.1858] A. Garbuno-Inigo, F. Hoffmann, W. Li, and A. M. Stuart, “Interacting langevin diffusions:Gradient structure and ensemble kalman sampler,”
SIAM Journal on Applied DynamicalSystems , vol. 19, no. 1, pp. 412–441, 2020.[59] Y. Chen, T. T. Georgiou, and M. Pavon, “On the relation between optimal transport andSchr¨odinger bridges: A stochastic control viewpoint,”
Journal of Optimization Theory andApplications , vol. 169, no. 2, pp. 671–691, 2016.[60] P. Bickel, B. Li, T. Bengtsson et al. , “Sharp failure rates for the bootstrap particle filter inhigh dimensions,” in
Pushing the limits of contemporary statistics: Contributions in honorof Jayanta K. Ghosh . Institute of Mathematical Statistics, 2008, pp. 318–329.[61] T. Bengtsson, P. Bickel, and B. Li, “Curse of dimensionality revisited: Collapse of theparticle filter in very large scale systems,” in
IMS Lecture Notes - Monograph Seriesin Probability and Statistics: Essays in Honor of David F. Freedman . Institute ofMathematical Sciences, 2008, vol. 2, pp. 316–334.[62] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson, “Obstacles to high-dimensional particlefiltering,”
Monthly Weather Review , vol. 136, no. 12, pp. 4629–4640, 2008.[63] S. C. Surace, A. Kutschireiter, and J.-P. Pfister, “How to avoid the curse of dimensionality:scalability of particle filters with and without importance weights,”
Siam Review , vol. 61,no. 1, pp. 79–91, 2019.[64] S. Y. Olmez, A. Taghvaei, and P. G. Mehta, “Deep fpf: Gain function approximation inhigh-dimensional setting,” arXiv preprint arXiv:2010.01183 , 2020.[65] E. Weinan and B. Yu, “The deep ritz method: a deep learning-based numerical algorithmfor solving variational problems,”
Communications in Mathematics and Statistics , vol. 6,no. 1, pp. 1–12, 2018.[66] W. C. Roda, M. B. Varughese, D. Han, and M. Y. Li, “Why is it difficult to accuratelypredict the covid-19 epidemic?”
Infectious Disease Modelling , 2020.[67] S. L. Wu, A. N. Mertens, Y. S. Crider, A. Nguyen, N. N. Pokpongkiat, S. Djajadi, A. Seth,M. S. Hsiang, J. M. Colford, A. Reingold et al. , “Substantial underestimation of sars-cov-2infection in the united states,”
Nature communications , vol. 11, no. 1, pp. 1–10, 2020.[68] R. Li, S. Pei, B. Chen, Y. Song, T. Zhang, W. Yang, and J. Shaman, “Substantialundocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov-2),”
Science , vol. 368, no. 6490, pp. 489–493, 2020.[69] M. G. M. Gomes, R. Aguas, R. M. Corder, J. G. King, K. E. Langwig, C. Souto-Maior,J. Carneiro, M. U. Ferreira, and C. Penha-Goncalves, “Individual variation in susceptibilityor exposure to sars-cov-2 lowers the herd immunity threshold,” medRxiv , 2020.[70] P. E. Par´e, C. L. Beck, and T. Basar, “Modeling, estimation, and analysis of epidemics overnetworks: An overview,”
Annual Reviews in Control: Special Issue on Systems and ControlResearch Efforts Against COVID-19 and Future Pandemics , 2020.1971] S. Y. Olmez, J. Mori, E. Miehling, T. Basar, R. L. Smith, M. West, and P. Mehta, “Adata-informed approach for analysis, validation, and identification of covid-19 models,” medRxiv medRxiv , 2020.[73] G. Evensen, J. Amezcua, M. Bocquet, A. Carrassi, A. Farchi, A. Fowler, P. Houtekamer,C. K. Jones, R. de Moraes, M. Pulido et al. , “An international assessment of the covid-19pandemic using ensemble data assimilation,” medRxiv , 2020.20 a) Kalman filter (b) Feeedback particle filter
Figure 1: Feedback control structure of the Kalman filer and the feedback particle filter (FPF). ˆ X t in the KF is the estimate (conditional mean) of the hidden state. X it in the FPF is a sample fromthe posterior (conditional distribution) of the hidden state. In either algorithms, the Bayesianupdate is implemented via a gain times error control law.21 rior posterior Figure 2: Coupling viewpoint of the filtering problem. The task of a particle filter is to converta sample of N particles from the prior distribution to a sample of N particles from the posteriordistribution. This task is viewed as finding a coupling between the prior and the posteriordistributions. 22igure 3: Gain function approximation problem in the feedback particle filter. The exact gainfunction K ( x ) = ∇ φ ( x ) where φ solves the Poisson equation (11). The numerical problem is toapproximate ∇ φ ( x ) | x = X i using only the particles { X i : 1 ≤ i ≤ N } sampled from density ρ (depicted as shaded region). 23igure 4: Constant gain approximation in the feedback particle filter. The gain function isapproximated by its expected value according to (14).24 x K const. gainexact= 0.02= 0.10= 0.50= 2.00 (a) Approximate gain function compared to the exact gainfunction (solid line), variance dominates bias dominates di ff usion mapconstant gain (b) The root mean squared error as a function of theparameter (cid:15) . Figure 5: Bias variance trade-off in diffusion map-based gain function approximation. (a) Thedashed line is the constant gain solution (14). As (cid:15) → ∞ , the diffusion map gain converges tothe constant gain. (The shaded area in the background is the density function ρ taken as sumof two Gaussians N ( − , σ ) and N (+1 , σ ) with σ = 0 . . The exact gain function K ( x ) iscomputed for h ( x ) = x by using an integral formula [55, Eq. 4.6].) (b) The r.m.s.e. is computedas an empirical approximation of (16) by averaging over simulations for N = 200 particles.25 ummary Feedback particle filter (FPF) is a Monte-Carlo (MC) algorithm to approximate the solutionof a stochastic filtering problem. In contrast to conventional particle filters, the Bayesian updatestep in FPF is implemented via a mean-field type feedback control law.The objective for this paper is to situate the development of FPF and related controlledinteracting particle system algorithms within the framework of optimal transportation theory.Starting from the simplest setting of the Bayes’ update formula, a coupling viewpoint isintroduced to construct particle filters. It is shown that the conventional importance samplingresampling particle filter implements an independent coupling. Design of optimal couplings isintroduced first for the simple Gaussian settings and subsequently extended to derive the FPFalgorithm. The final half of the paper provides a review of some of the salient aspects of the FPFalgorithm including the feedback structure, algorithms for gain function design, and comparisonwith conventional particle filters. The comparison serves to illustrate the benefit of feedback inparticle filtering. 26 ummary of the linear Gaussian example
The formulae for the linear Gaussian example are summarized as follows:Prior: Gaussian N ( m , Σ ) Observation model: Y = HX + W Likelihood function: l ( x ) = exp( − | y − Hx | Posterior: Gaussian N ( m , Σ ) Optimal transport map: T ( x ) = F ( x − m ) + m Mean-field process: ¯ X = F ( ¯ X − m ) + m + K ( Y − Hm ) Particle system: X i = F ( N ) ( X i − m ( N )0 ) + m ( N )0 + K ( N ) ( Y − Hm ( N )0 ) , for i = 1 , . . . , N ptimal transport construction of stochastic processes Deterministic path
Let P ( R d ) be the space of everywhere positive probability densities on R d with finitesecond moment. Given a smooth path { p t ∈ P ( R d ) : t ≥ } the problem is to construct astochastic process { ¯ X t ; t ≥ } such that the probability density of ¯ X t , denoted as ¯ p t , equals p t for all t ≥ . The exactness condition is expressed as ¯ p t = p t , ∀ t ≥ . (S1)Now, there are infinitely many stochastic processes that satisfy the exactness condition. This isbecause the exactness condition specifies only the one-time marginal distribution which is clearlynot enough to uniquely identify the stochastic process, e.g., the two-time joint distributions arenot specified. A unique choice is made by prescribing an additional optimality criterion basedon the optimal transportation theory.To make these considerations concrete, assume that the given path { p t : t ≥ } evolvesaccording to the PDE ∂p t ∂t = V ( p t ) , where V ( · ) is an operator (e.g., the Laplacian) that acts on probability densities. (This necessarilyrestricts the operator V , e.g., (cid:82) V ( p )( x ) d x = 0 for all p ∈ P ( R d ) .) The following model isassumed for the process { ¯ X t : t ≥ } : dd t ¯ X t = u t ( ¯ X t ) , ¯ X ∼ p , (S2)where u t ( · ) is a control law that needs to be designed. Using the continuity equation, the exactnesscondition (S1) will be satisfied if −∇ · (¯ p t u t ) = V (¯ p t ) , ∀ t ≥ . (S3)The non-uniqueness issue is now readily seen: The first-order PDE (S3) admits infinitelymany solutions. A unique solution u t is picked by optimizing the coupling between ¯ X t and ¯ X t +∆ t in the limit as ∆ t → . The leading term in the transportation cost E [ | X t +∆ t − X | ] is oforder O (∆ t ) whereby lim ∆ t → t E [ | X t +∆ t − X t | ] = (cid:90) R d | u t ( x ) | ¯ p t ( x ) d x. Therefore, for each fixed t ∈ [0 , , the control law u t is obtained by solving the constrainedoptimization problem min u t (cid:90) R d | u t ( x ) | ¯ p t ( x ) d x, s.t − ∇ · (¯ p t u t ) = V (¯ p t ) . (S4)28he cost is the infinitesimal form of the L -Wasserstein distance and the constraint expressesthe exactness condition.By a standard calculus of variation argument, the solution of the optimization problem (S4)is obtained as u ∗ t = ∇ φ t where φ t solves the second-order PDE −∇ · (¯ p t ∇ φ t ) = V (¯ p t ) . Theresulting stochastic process ¯ X t evolves according to d ¯ X t d t = ∇ φ t ( ¯ X t ) , ¯ X ∼ p ,φ t solves the PDE − ∇ · (¯ p t ∇ φ t ) = V (¯ p t ) . As a concrete example, suppose the given path is a solution of the heat equation ∂p t ∂t = ∆ p t .(So V ( · ) is the Laplacian operator.) The solution of the second-order PDE is easily obtained as φ t = log(¯ p t ) . The optimal transport process ¯ X t then evolves according to dd t ¯ X t = −∇ log( p t ( ¯ X t )) , ¯ X ∼ p . This process should be compared to the SDE d X t = d B t , X ∼ p , (S5)where { B t : t ≥ } is a w.p.. The SDE (S5) is a well-known stochastic coupling whose one-pointmarginal evolves according to the solution of the heat equation. Stochastic path
In the filtering problem, the path of the posterior probability densities is stochastic (becauseit depends upon random observations { Z t : t ≥ } ). Therefore, the preceding discussion is notdirectly applicable. Suppose the stochastic path { p t ( · ) ∈ P ( R d ) : t ≥ } is governed by astochastic PDE d p t = H ( p t ) d I t , where H ( · ) is an operator that acts on probability densities and { I t : t ≥ } is a w.p..Consider the following SDE model: d ¯ X t = u t ( ¯ X t ) d t + K t ( ¯ X t ) d I t , ¯ X ∼ p , where, compared to the deterministic form of (S2), an additional stochastic term is now included.The problem is to identify control laws u t ( · ) and K t ( · ) such that the conditional distribution of ¯ X t equals p t . This exactness condition, counterpart of (S3), now is − ∇ · (¯ p t K t ) = H (¯ p t ) , (S6a) − ∇ · (¯ p t u t ) + 12 ( ∇ · (¯ p t K t ) K t + ¯ p t K t ∇ K t ) = 0 . (S6b)29hese equations are obtained by writing the time-evolution of the conditional probability densityof ¯ X t [44, Prop. 1]. As in the deterministic setting, the solution is not unique.The unique optimal control law is obtained by requiring that the coupling between ¯ X t and ¯ X t +∆ t is optimal in the limit as ∆ t → . In contrast to the deterministic setting, the leadingterm in the transportation cost E [ | X t +∆ t − X | ] is of order O (∆ t ) whereby lim ∆ t → t E [ | X t +∆ t − X t | ] = (cid:90) R d | K t ( x ) | ¯ p t ( x ) d x. (S7)Therefore, for each fixed t ∈ [0 , , the control law K t is obtained by solving the constrainedoptimization problem min K t (cid:90) R d | K t ( x ) | ¯ p t ( x ) d x, s.t − ∇ · (¯ p t K t ) = H t (¯ p t ) . (S8)As before, the solution of the optimization problem (S8) is given by K ∗ t = ∇ φ t where φ t solvesthe second-order PDE −∇ · (¯ p t ∇ φ t ) = H (¯ p t ) .It remains to identify the control law for u t . For this purpose, the second-order term inthe infinitesimal Wasserstein cost is used: lim ∆ t → t (cid:18) E [ | X t +∆ t − X t | ] − ∆ t (cid:90) R d | K ∗ t | ¯ p t d x (cid:19) = (cid:90) R d | u t | ¯ p t d x. The righthand-side is minimized subject to the constraint (S6b). Remarkably, the optimal solutionis expressed as u ∗ t = − p t H (¯ p t ) ∇ φ t + 12 ∇ φ t ∇ φ t + ξ t , where ξ t is the (unique such) divergence free vector field (i.e., ∇ · ( p t ξ t ) = 0 ) such that u t is ofa gradient form. The resulting optimal transport process is d ¯ X t = ∇ φ t ( ¯ X t ) ◦ ( d I t − p t H (¯ p t ) d t ) + ξ t ( ¯ X t ) d t, ¯ X ∼ p . (S9)It is also readily shown that the process { ¯ X t : t ≥ } is in fact exact for any choice of divergencefree vector field ξ t . The most convenient such choice is obtained by simply setting ξ t ≡ . Theresulting filter is exact and furthermore also (infinitesimally) optimal to the first-order (see (S7)).For the special case of the nonlinear filtering problem, H ( p ) = ( h − ˆ h ) p where ˆ h = (cid:82) h ( x ) p ( x ) d x and d I t = d Z t − ˆ h t d t is the increment of the innovation process. The optimaltransport stochastic process (S9) is then given by the Stratonovich form d ¯ X t = ∇ φ t ( ¯ X t ) ◦ ( d Z t −
12 ( h + ˆ h t ) d t ) + ξ t ( ¯ X t ) d t, ¯ X ∼ p . The control law in the FPF algorithm (10) represents the particular sub-optimal choice ξ t ≡ .30 enefit of Feedback In this sidebar, we consider a simple example to highlight the phenomena of curse ofdimensionality (CoD) in particle filters and provide comparison with FPF to see how the curseis mitigated by using feedback control. The example we consider is as follows: d X t = 0 , X ∼ N (0 , σ I d ) , (S10a) d Z t = X t d t + σ w d W t , (S10b)for t ∈ [0 , , where X t is a d -dimensional process, σ w , σ > and I d is a d × d identitymatrix. The posterior distribution at time t = 1 is a Gaussian distribution N ( m , Σ ) with mean m = σ σ + σ w and variance Σ = σ σ w σ + σ w I d .We consider the following MC approaches to approximate the posterior distribution:1) SIR particle filter (PF): Sample { X i : 1 ≤ i ≤ N } from the initial distribution. Form theweighted distribution and generate new samples from the weighted distribution. X i ∼ N (cid:88) i =1 w i δ X i , w i = e − | Z − Xi | σ w E [ e − | Z − Xi | σ w |Z ] , X i ∼ N (0 , σ I d ) . (S11)2) Feedback particle filter (FPF): Simulate the particles according to d X it = 1 σ w Σ ( N ) t ( d Z t − X it + m ( N ) t t ) , X i ∼ N (0 , σ I d ) . (S12)for t ∈ [0 , , where m ( N ) t is the empirical mean of the particles, and Σ ( N ) t is the empiricalvariance of the particles.We compare the mean squared error (m.s.e) in approximating the exact conditional mean m of X . The m.s.e is defined asm.s.e := E [ 1 N N (cid:88) i =1 | X i − m | ] . In [37], the following is proved:
Proposition 1:
Consider the filtering problem (S10) with state dimension d . Then(i) For the PF (S11) m.s.e PF ( f ) = σ N (cid:18) d ) − (cid:19) ≥ σ N d +1 . FPF ( f ) ≤ σ N (3 d + 2 d ) . The result above is consistent with the extensive studies on importance sampling-basedparticle filters [60], [61], [62], [26]. In these papers, it is shown that if log N log dd → then thelargest importance weight max ≤ i ≤ N w i → in probability. Consequently, in order to preventthe weight collapse, the number of particles must grow exponentially with the dimension. Thisphenomenon is referred to as the curse of dimensionality (CoD) for the particle filters.A numerical comparison of the m.s.e. as a function of N and d is depicted in Figure S1-(a)-(b). The expectation is approximated by averaging over M = 1000 independent simulations.It is observed that, in order to have the same error, the importance sampling-based approachrequires the number of samples N to grow exponentially with the dimension d , whereas thegrowth using the FPF for this numerical example is O ( d ) . The scaling with dimension depictedin Figure S1 (b) suggests that the O ( d ) bound for the m.s.e in the linear FPF is loose. Thisis because of the conservative nature of approximations used in deriving the inequality [37].The overall conclusions of the study are consistent with other numerical results reported in theliterature [63]. 32 l o g ( N ) d N d . . . . m.s.e (PF) (a) importance sampling particle filter (PF) (S11) l o g ( N ) d N d . . . . m.s.e (FPF) (b) feedback particle filter (FPF) (S12) Figure S1: Overcoming the curse of dimensionality of particle filters. The solid lines correspondto the level sets of the mean squared error for the filtering problem (S10). In order to have thesame error, the PF requires the number of samples N to grow exponentially with the dimension d , whereas the growth using the FPF for this numerical example is O ( d ) .33 oisson equation and its approximations The Poisson equation (11) of nonlinear filtering is a linear PDE. Its finite-dimensionalcounterpart is a familiar linear problem Ax = b, (S13)where A is a n × n (strictly) positive-definite symmetric matrix and the righthand-side b is agiven n × vector. The problem is to obtain the unknown n × vector x . For this purpose, thefollowing equivalent formulations of the finite-dimensional problem are first introduced:1) x is the solution of the weak form y (cid:62) Ax = y (cid:62) b, ∀ y ∈ R n .
2) For any t > , x is the solution to the fixed-point equation x = e − tA x + (cid:90) t e − sA b d s. x is the solution of the optimization problem min x ∈ R n x (cid:62) Ax − x (cid:62) b. When n is large, these formulations are useful to numerically approximate the solution of (S13):1) For each fixed y ∈ R n , the weak form is a single equation. By restricting y to a suitablelow-dimensional subspace S ⊂ R n , the numer of linear equations is reduced for thepurposes of obtaining an approximate solution (possibly also in S ).2) The fixed-point equation form is useful because e − tA is a contraction for positive-definite A . So, a good initial guess for x can readily be improved by using the Banach iteration.3) The optimization form is useful to develop alternate (e.g., search type) algorithms to obtainthe solution to (S13).We turn our attention next to the Poisson equation (11) expressed succinctly as − ∆ ρ φ = h − ˆ h, where ∆ ρ = ρ ∇ · ( ρ ∇ ) is the probability weighted Laplacian. Functional analytic considerationsrequire introduction of the function spaces: L ( ρ ) is the space of square integrable functionswith respect to ρ with inner product (cid:104) f, g (cid:105) L = (cid:82) f ( x ) g ( x ) ρ ( x ) d x ; H ( ρ ) is the Hilbert spaceof functions in L ( ρ ) whose first derivative, defined in the weak sense, is the also in L ( ρ ) ; and H ( ρ ) = { ψ ∈ H ( ρ ) | (cid:82) ψ ( x ) ρ ( x ) d x = 0 } . 34hese definitions are important because H ( ρ ) is the natural space for the solution φ ofthe Poisson equation (11). The operator − ∆ ρ is symmetric (self-adjoint) and positive definitebecause −(cid:104) f, ∆ ρ g (cid:105) L = (cid:104)∇ f, ∇ g (cid:105) L = −(cid:104) ∆ ρ f, g (cid:105) L , ∀ f, g ∈ H ( ρ ) . One requires an additional technical condition – the so-called Poincar´e inequality – to concludethat the operator is in fact strictly positive-definite. Assuming the Poincar´e inequality holds, itis also readily shown that ∆ − ρ is well defined, i.e., a unique solution φ ∈ H ( ρ ) exists for agiven h ∈ L ( ρ ) [44, Thm. 2].For the purposes of numerical approximation, entirely analogous to the finite-dimensionalcase, the following equivalent formulations of the Poisson equation are introduced:1) φ is a solution of the weak form (cid:104)∇ ψ, ∇ φ (cid:105) L = (cid:104) ψ, h − ˆ h (cid:105) L ∀ ψ ∈ H ( ρ ) . (S14)2) φ is a solution of the fixed-point equation φ = e t ∆ ρ φ + (cid:90) t e s ∆ ρ ( h − ˆ h ) d s. φ is the solution of the optimization problem min φ ∈ H ( ρ ) (cid:104)∇ φ, ∇ φ (cid:105) L + (cid:104) φ, h − ˆ h (cid:105) L . (S15)These formulations have been used to develop numerical algorithms for gain functionapproximation:1) Instead of ψ ∈ H ( ρ ) in the weak form (S14), a relaxation is considered whereby ψ ∈ S = span { ψ , . . . , ψ M } , a finite-dimensional subspace of H ( ρ ) . The resulting algorithmis referred to as the Galerkin algorithm for gain function approximation [44]. The constantgain formula (14) is obtained by considering S to be the subspace spanned by the coordinatefunctions.2) The semigroup e t ∆ ρ is approximated with the diffusion map operator T (cid:15) as described inthe main body of the paper. This approximation yields the diffusion map-based algorithmfor gain function approximation, tabulated in the sidebar.3) The optimization formulation (S15) is useful to explore nonlinear parametrizations of thegain function, e.g., using neural networks. A preliminary investigation of this appears inour paper [64]. Related deep learning-inspired techniques for solving PDEs using neuralnetworks appear in [65]. 35 iffusion-map based algorithm for gain function approximation Input: { X i : 1 ≤ i ≤ N } , { h ( X i ) : 1 ≤ i ≤ N } , kernel bandwidth (cid:15) Output: { K i : 1 ≤ i ≤ N } g ij := e − | Xi − Xj | (cid:15) for i, j = 1 to N k ij := g ij √ (cid:80) l g il √ (cid:80) l g jl for i, j = 1 to N d i = (cid:80) j k ij for i = 1 to N T ij := k ij d i for i, j = 1 to N π i = d i (cid:80) j d j for i = 1 to N ˆ h = (cid:80) Ni =1 π j h ( X i ) Φ = (0 , . . . , ∈ R N
8) Solve the fixed point problem Φ = TΦ + (cid:15) ( h − ˆ h ) iteratively9) r i = Φ i + (cid:15) h i for i = 1 to N s ij = (cid:15) T ij ( r j − (cid:80) Nk =1 T ik r k ) for i, j = 1 to N K i = (cid:80) j s ij X j for i = 1 to N xample: FPF for SIR models The basic mathematical model of epidemiological disease propagation is the SIR ODEmodel: ˙ S t = − βS t I t , ˙ I t = βS t I t − αI t , ˙ R t = αI t , where S t , I t and R t are the susceptible, infected and recovered population fractions, respectively,at time t . The parameters β and α are the transmission rate and the recovery rate parameters,respectively. In an epidemic, one observes the number of newly infected over a time-increment(daily). For our study, this is modeled as d Z t = ( βI t S t ) d t + σ W d W t , (S16)where W = { W t : t ≥ } is the standard w.p. and σ W is the standard deviation (std. dev.)parameter. Given the observations, the filtering objective is to estimate the population sizes andpossibly also the model parameters. In this study, the recovery rate parameter α is assumedknown while the transmission rate parameter β is estimated. In a filtering setup, this requires amodel which is assumed to be of the following form: d β t = σ B d B t , where B = { B t : t ≥ } is a standard w.p. and σ B is the standard deviation parameter.The model and the filter are simulated using the Euler discretization scheme for timeintegration. The simulation parameters are as follows: time discretization step-size ∆ t = 1 ;std. dev. for the observation noise σ W = 0 . ; std. dev. for the process noise σ B = 0 . ; initialdistribution I (0) ∼ unif [0 , . and S (0) = 1 − I (0) ; recovery rate α = 0 . ; the transmission rate β is fixed to be . but assumed unknown to the filtering algorithm. The FPF is simulated using N = 100 particles. Two gain function approximation algorithms are implemented: the constantgain approximation and the diffusion map approximation. For the diffusion map approximation,the heuristic (cid:15) = 10 med ( {| X i − X j | ; 1 ≤ i, j ≤ N } )(log( N )) − is used, where med ( · ) denotesthe statistical median. The simulation parameters and their values are tabulated in Table S1.Figure S2 depicts the numerical results for the synthetic observation data generated usingthe model. Although the results depicted in the figure are illustrative as an application of FPFto the SIR models, additional work is necessary for its use in prediction with real COVID-19data. This is because of the following reasons:37) The observation model (S16) is not accurate. In the real-world settings, one only observesa certain unknown (and possibly time-varying and delayed) fraction of the newly infectedpopulation. This leads to fundamental issues with the identifiability of the transmission rateparameter β [66], [67]. Accurate estimation of β (or the closely associated non-dimensionalreproduction number R ) is important to capture the initial growth of the epidemic [68].2) The 3-state SIR dynamic model is rather simplistic. This is because of several reasons: (i)The model assumes homogeneous well-mixed population while in practice there is strongevidence of heterogeneities [69] as well as spatial network effects [70]; (ii) The model isbased on the underlying assumption of Markovian transitions between the epidemiologicalstates. This assumption is contradicted by the experimental data on delay distributions [71];and (iii) Even in the simplistic settings of the SIR model, the transmission rate parameter β is strongly time-varying. It is affected both by the individual choices (e.g., mask wearing)of the large number of agents as well as population-level government mandates (e.g.,lockdown).These difficulties notwithstanding, EnKF-based solutions to the COVID-19 data assimilationproblem appear in [72], [73]. However, much work remains to be done on this importantproblem of immense societal importance. In the post-COVID reality, it is not inconceivablethat surveillance and monitoring of infectious diseases such as seasonal flu will be as pervasiveand common place as the weather tracking is today.38ABLE S1: Simulation parameters for the application of feedback particle filter to the epidemi-ological example. parameter notation valuetime step-size ∆ t σ W . process noise σ B . number of particles N recovery rate α . transmission rate β . ∆ I (depicted in the top figure). Thesize of infected population is I ( t ) and the size of susceptible population is S ( t ) . The infectiontransmission rate ββ