Inverse Problems and Data Assimilation
DData Assimilation and Inverse Problems
Daniel Sanz-Alonso Z , Andrew Stuart \ and Armeen Taeb ^Z Department of Statistics, University of Chicago \ Department of Computational and Mathematical Sciences, Caltech ^ Department of Electrical Engineering, Caltech a r X i v : . [ s t a t . M E ] J u l Introduction
Overview of the Notes
These notes were first developed out of Caltech course ACM 159 in Fall 2017,and then substantially modified for the University of Chicago course STAT 31550 inWinter 2019. The aim of the notes is to provide a clear and concise introduction to thesubjects of Inverse Problems and Data Assimilation, and their inter-relations, togetherwith citations to some relevant literature in this areaIn its most basic form, inverse problem theory is the study of how to estimate modelparameters from data. Often the data provides indirect information about these pa-rameters, corrupted by noise. The theory of inverse problems, however, is much richerthan just parameter estimation. For example, the underlying theory can be used todetermine the effects of noisy data on the accuracy of the solution; it can be used todetermine what kind of observations are needed to accurately determine a parameter;and it can be used to study the uncertainty in a parameter estimate and, relatedly, isuseful, for example, in the design of strategies for control or optimization under uncer-tainty, and for risk analysis. The theory thus has applications in many fields of scienceand engineering.To apply the ideas in these notes, the starting point is a mathematical model map-ping the unknown parameters to the observations: termed the “forward” or “direct”problem, and often a subject of research in its own right. A good forward model willnot only identify how the data is dependent on parameters, but also what sources ofnoise or model uncertainty are present in the postulated relationship between unknownparameters and data. For example, if the desired forward problem cannot be solvedanalytically, then the forward model may be approximated by a simulation; in thiscase discretization may be considered as a source of error. Once a relationship betweenmodel parameters, sources of error, and data is clearly defined, the inverse problem ofestimating parameters from data can be addressed. The theory of inverse problems canbe separated into two cases: (1) the ideal case where data is not corrupted by noiseand is derived from a known perfect model; and (2) the practical case where data isincomplete and imprecise. The first case is useful for classifying inverse problems anddetermining if a given set of observations can, in principle, provide exact solutions;this provides insight into conditions needed for existence, uniqueness, and stability of asolution. The second case is useful for the formulation of practical algorithms to learnabout parameters, and uncertainties in their estimates, and will be the focus of thesenotes.A model that has the properties: (a) a solution map exists, (b) is unique, and (c)its behavior changes continuously with input (stability) is termed “well-posed”. Con-versely, a model lacking any of these properties is termed “ill-posed”. Ill-posedness ispresent is many inverse problems, and mitigating it is an extensive part of the sub- ject. Out of the different approaches for formulating an inverse problem, the Bayesianframework naturally offers the ability to assess quality in parameter estimation, andalso leads to a form of well-posedness at the level of probability distributions describingthe solution. The goal of the Bayesian framework is to find a probability measure thatassigns a probability to each possible solution for a parameter 𝑢 , given the data 𝑦 .Bayes formula states that P ( 𝑢 | 𝑦 ) = 1 P ( 𝑦 ) P ( 𝑦 | 𝑢 ) P ( 𝑢 ) . It enables calculation of the posterior probability on 𝑢 | 𝑦 , P ( 𝑢 | 𝑦 ), in terms of the productof the data likelihood P ( 𝑦 | 𝑢 ) and the prior information on the parameter encoded in P ( 𝑢 ). The likelihood describes the probability of the observed data 𝑦 , if the input pa-rameter were set to be 𝑢 ; it is determined by the forward model, and the structure of thenoise. The normalization constant P ( 𝑦 ) ensures that P ( 𝑢 | 𝑦 ) is a probability measure.There are five primary benefits to this framework: (a) it provides a clear theoreticalsetting in which the forward model choice, noise model and a priori information areexplicit; (b) it provides information about the entire solution space for possible inputparameter choices; (c) it naturally leads to quantification of uncertainty and risk inparameter estimates; (d) it is generalizable to a wide class of inverse problems, in finiteand infinite dimension and comes with a well-posedness theory useful in these contexts.The first half of the notes is dedicated to studying the Bayesian framework forinverse problems. Techniques such as importance sampling and Markov Chain MonteCarlo (MCMC) methods are introduced; these methods have the desirable propertythat in the limit of an infinite number of samples they reproduce the full posteriordistribution. Since it is often computationally intensive to implement these methods,especially in high dimensional problems, approximate techniques such as approximatingthe posterior by a Dirac or a Gaussian distribution are discussed.The second half of the notes covers data assimilation. This refers to a particu-lar class of inverse problems in which the unknown parameter is the initial conditionof a dynamical system, and in the stochastic dynamics case the subsequent states ofthe system, and the data comprises partial and noisy observations of that (possiblystochastic) dynamical system. A primary use of data assimilation is in forecasting,where the purpose is to provide better future estimates than can be obtained usingeither the data or the model alone. All the methods from the first half of the coursemay be applied directly, but there are other new methods which exploit the Markovianstructure to update the state of the system sequentially, rather than to learn aboutthe initial condition. (But of course knowledge of the initial condition may be used toinform the state of the system at later times.) We will also demonstrate that methodsdeveloped in data assimilation may be employed to study generic inverse problems, byintroducing an artificial time to generate a sequence of probability measures interpo-lating from the prior to the posterior. Notation
Throughout the notes we use N to denote the positive integers { , , , · · · } and Z + to denote the non-negative integers N ∪ { } = { , , , , · · · } . The matrix 𝐼 𝑑 denotesthe identity on R 𝑑 . We use | · | to denote the Euclidean norm corresponding to theinner-product ⟨· , ·⟩ . A square matrix 𝐴 is positive definite (resp. positive semi-definite)if the quadratic form ⟨ 𝑢, 𝐴𝑢 ⟩ is positive (resp. non-negative) for all 𝑢 ̸ = 0. By | · | 𝐴 we denote the weighted norm defined by | 𝑣 | 𝐴 = 𝑣 𝑇 𝐴 − 𝑣 . The corresponding weightedEuclidean inner-product is given by ⟨· , ·⟩ 𝐴 and ⟨· , 𝐴 − ·⟩ . We use ⊗ to denote the outerproduct between two vectors: ( 𝑎 ⊗ 𝑏 ) 𝑐 = ⟨ 𝑏, 𝑐 ⟩ 𝑎. We let 𝐵 ( 𝑢, 𝛿 ) denote the open ball ofradius 𝛿 at 𝑢 , in the Euclidean norm.Throughout we denote by P ( · ) , P ( ·|· ) the pdf of a random variable and its conditionalpdf, respectively. We write 𝜌 ( 𝑓 ) = E 𝜌 [ 𝑓 ] = ∫︁ R 𝑑 𝑓 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 to denote expectation of 𝑓 : R 𝑑 ↦→ R with respect to probability measure with proba-bility density function (pdf) 𝜌 on R 𝑑 . The distribution of the random variables in thisbook will often have density with respect to Lebesgue measure, but occasional use ofDirac masses will be required; we will use the notational convention that Dirac massat point 𝑣 has “density” 𝛿 ( · − 𝑣 ) or 𝛿 𝑣 ( · ) . When a random variable 𝑢 has pdf 𝜌 we willwrite 𝑢 ∼ 𝜌. We use ⇒ to denote weak convergence of probability measures, that is, 𝜌 𝑛 ⇒ 𝜌 if 𝜌 𝑛 ( 𝑓 ) → 𝜌 ( 𝑓 ) for all bounded and continuous 𝑓 : R 𝑑 ↦→ R . Acknowledgements
These notes were first created out of Caltech course ACM 159 in Fall 2017 and werefurther developed for the University of Chicago course STAT 31550 in Winter 2019. Thenotes were created in L A TEX by the students in ACM 159, based on lectures presentedby the instructor Andrew Stuart, and on input from the course TA Armeen Taeb. Theindividuals responsible for the notes listed in alphabetic order are: Blancquart, Paul;Cai, Karena; Chen, Jiajie; Cheng, Richard; Cheng, Rui; Feldstein, Jonathan; Huang,De; Idíni, Benjamin; Kovachki, Nikola; Lee, Marcus; Levy, Gabriel; Li, Liuchi; Muir,Jack; Ren, Cindy; Seylabi, Elnaz, Schäfer, Florian; Singhal, Vipul; Stephenson, Oliver;Song, Yichuan; Su, Yu; Teke, Oguzhan; Williams, Ethan; Wray, Parker; Zhan, Eric;Zhang, Shumao; Xiao, Fangzhou. Furthermore, the following students have added con-tent beyond the class materials: Parker Wray – the Overview, Jiajie Chen – alternativeproof of Theorem 1.10 and proof idea for Theorem 14.3, Fangzhou Xiao – numericalsimulation of prior, likelihood & posterior, Elnaz Seylabi & Fangzhou Xiao – catchingtypographical errors in a draft of these notes, Cindy Ren – numerical simulations toenhance understanding of importance sampling in Examples 6.2 and 6.5, Cindy Ren& De Huang – improving the constants in Theorem 6.3 regarding the approximationerror of importance sampling, Richard Cheng & Florian Schäfer – illustrations to en-hance understanding of the coupling argument used to study convergence of MCMCalgorithms by presenting the finite state-space case, and Ethan Williams & Jack Muir– numerical simulations and illustrations of Ensemble Kalman Filter and ExtendedKalman Filter. We would also like to thank Tapio Helin (Helsinki) who used the notesin his own course and provided very helpful feedback on an early draft.
The work of AS has been funded by the EPSRC (UK), ERC (Europe) and byAFOSR, ARL, NIH, NSF and ONR (USA). The work of DSA is funded by NSF. Thisfunded research has helped to shape the presentation of the material here and is grate-fully acknowledged.
Warning
These are rough notes, far from being perfected. They are likely to contain mathe-matical errors, incomplete bibliographical information, inconsistencies in notation andtypographical errors. We hope that the notes are nonetheless useful. Please contact theauthors with any feedback from typos, through mathematical errors and bibliographicalomissions, to comments on the structural organization of the material.
Contents Bayesian Inverse Problems and Well-Posedness The Linear-Gaussian Setting Optimization Perspective The Gaussian Approximation 𝑑 KL ( 𝑝 ‖ 𝜋 ) . . . . . . . . . . . . . . . . 384.3 Best Gaussian Fit By Minimizing 𝑑 KL ( 𝜋 ‖ 𝑝 ) . . . . . . . . . . . . . . . . 394.4 Comparison between 𝑑 KL ( 𝜋 ‖ 𝑝 ) and 𝑑 KL ( 𝑝 ‖ 𝜋 ) . . . . . . . . . . . . . . . . 424.5 Variational Formulation of Bayes Theorem . . . . . . . . . . . . . . . . . 434.6 Discussion and Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 44 Monte Carlo and Importance Sampling Markov Chain Monte Carlo 𝜋 . . . . . . . . . . . . . . . . . . . 556.3.1 Detailed Balance and its Implication . . . . . . . . . . . . . . . . 556.3.2 Detailed Balance and the Metropolis-Hastings Algorithm . . . . 56 Filtering and Smoothing Problems and Well-Posedness The Linear-Gaussian Setting Optimization for Filtering and Smoothing: 3DVAR and 4DVAR The Extended and Ensemble Kalman Filters Particle Filter Optimal Particle Filter Filtering Approach to the Inverse Problem
In this chapter we introduce the Bayesian approach to inverse problems. We will showthat the Bayesian formulation leads to a form of well-posedness: small perturbationsof the forward model or the observed data translate into small perturbations of theBayesian solution. Well-posedness will be established in Hellinger distance. The totalvariation and Hellinger distances between probability densities play an important role inthe analysis of Bayesian methodology. We introduce both distances in this chapter, aswell as some characterizations and bounds between them that will be used throughoutthis book.
We consider the following setting. We let 𝐺 : R 𝑑 → R 𝑘 be a forward model and aim torecover an unknown parameter 𝑢 ∈ R 𝑑 from data 𝑦 ∈ R 𝑘 given by 𝑦 = 𝐺 ( 𝑢 ) + 𝜂, (1.1)where 𝜂 ∈ R 𝑘 represents observation noise. We view ( 𝑢, 𝑦 ) ∈ R 𝑑 × R 𝑘 as a randomvariable, whose distribution is specified by means of the following assumption. Assumption 1.1.
We make the following probabilistic assumptions: ∙ 𝑢 ∼ 𝜌 ( 𝑢 ) , 𝑢 ∈ R 𝑑 . ∙ 𝜂 ∼ 𝜈 ( 𝜂 ) , 𝜂 ∈ R 𝑘 . ∙ 𝑢 and 𝜂 are independent, written 𝑢 ⊥ 𝜂. Here 𝜌 and 𝜈 describe the (Lebesgue) probability density functions (pdfs) of therandom variables 𝑢 and 𝜂 respectively. Then 𝜌 ( 𝑢 ) is called the prior pdf, 𝑦 | 𝑢 ∼ 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ for each fixed 𝑢 ∈ R 𝑑 determines the likelihood function. In this probabilisticperspective, the solution to the inverse problem is the conditional distribution of 𝑢 given 𝑦 , which is called the posterior distribution and will be denoted by 𝑢 | 𝑦 ∼ 𝜋 𝑦 ( 𝑢 ) . Fromthe posterior pdf one can infer parameter values that are consistent with both the dataand the prior pdf. The posterior also contains information on the uncertainty remainingin the parameter recovery: for instance, large posterior covariance may indicate thatthe data contains insufficient information to recover the input parameter.
Bayes theorem is a bridge connecting the prior, the likelihood and the posterior.
Theorem 1.2 (Bayes theorem) . Let Assumption 1.1 hold and assume that 𝑍 = 𝑍 ( 𝑦 ) := ∫︁ R 𝑑 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) 𝑑𝑢 > . Then 𝑢 | 𝑦 ∼ 𝜋 𝑦 ( 𝑢 ) , where 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) . (1.2) Proof.
Denote by P ( · ) the pdf of a random variable and by P ( ·|· ) its conditional con-ditional pdf. We have P ( 𝑢, 𝑦 ) = P ( 𝑢 | 𝑦 ) P ( 𝑦 ) , if P ( 𝑦 ) > , P ( 𝑢, 𝑦 ) = P ( 𝑦 | 𝑢 ) P ( 𝑢 ) , if P ( 𝑢 ) > . Note that the marginal pdf on 𝑦 is given by P ( 𝑦 ) = ∫︁ R 𝑑 P ( 𝑢, 𝑦 ) 𝑑𝑢, and similarly for P ( 𝑢 ) . Assume P ( 𝑦 ) >
0. Then P ( 𝑢 | 𝑦 ) = 1 P ( 𝑦 ) P ( 𝑦 | 𝑢 ) P ( 𝑢 ) = 1 𝑍 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) (1.3)for both P ( 𝑢 ) > P ( 𝑢 ) = 0. Here we remark that P ( 𝑦 ) = 𝑍 = ∫︁ R 𝑑 𝑍 P ( 𝑢 | 𝑦 ) 𝑑𝑢 = ∫︁ R 𝑑 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) 𝑑𝑢 > . Thus the assumption that P ( 𝑦 ) > 𝑔 ( 𝑢 ) := 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ ; we then write 𝜋 ( 𝑢 ) = 1 𝑍 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) , omitting the data 𝑦 in both the likelihood function and the posterior pdf. Remark 1.3.
The proof of Theorem 1.2 shows that in order to apply Bayes formula(1.2) one needs to guarantee that the normalizing constant P ( 𝑦 ) = 𝑍 is positive. Inother words, the marginal density of the observed data 𝑦 needs to be positive, i.e. theobserved data needs to be consistent with the probabilistic assumptions (1.1). Fromnow on it will be assumed that P ( 𝑦 ) = 𝑍 > 𝜋 𝑦 ( 𝑢 ) contains all the knowledge on the parameter 𝑢 available in the prior and the data. In applications it is often useful, however, to sum-marize the posterior distribution through a few numerical values or through parameterregions of prescribed posterior probability (known as credible intervals). Summarizingthe posterior is particularly important if the parameter is high dimensional, since thenvisualizing the posterior or detecting regions of high posterior probability is nontrivial.Two natural numerical summaries are the posterior mean and the posterior mode. Definition 1.4.
The posterior mean estimator of 𝑢 given data 𝑦 is the mean of the pos-terior distribution: 𝑢 PM = ∫︁ R 𝑑 𝑢𝜋 𝑦 ( 𝑢 ) 𝑑𝑢. The maximum a posteriori (MAP) estimator of 𝑢 given data 𝑦 is the mode of theposterior distribution 𝜋 𝑦 ( 𝑢 ), defined as 𝑢 MAP = arg max 𝑢 ∈ R 𝑑 𝜋 𝑦 ( 𝑢 ) . The posterior mean and the MAP estimators already suggest the importance ofcomputing maxima and integrals in the practical implementation of Bayesian formu-lations of inverse problems and data assimilation. For this reason, optimization andsampling will play an important role in this book. An alternative way to make Bayesianformulations tractable is to approximate the posterior by a simple distribution, oftena Gaussian or a combination of Dirac masses. An optimization perspective for inverseproblems and data assimilation will be studied in Chapters 3 and 7, respectively, andthe use of Gaussian approximations will be discussed in Chapters 4 and 10; Dirac ap-proximations constructed via sampling will be studied in Chapters 5 and 6 (inverseproblems) and in Chapters 11 and 12 (data assimilation).We next consider two simple examples of a direct application of Bayes formula.
Example 1.5.
Let 𝑑 = 𝑘 = 1, 𝜂 ∼ 𝜈 = 𝒩 (0 , 𝛾 ) and 𝜌 ( 𝑢 ) = {︃ , 𝑢 ∈ ( − , , , 𝑢 ∈ ( − , 𝑐 . Suppose that the observation is generated by 𝑦 = 𝑢 + 𝜂 . Using Bayes Theorem 1.2, wederive the posterior 𝜋 𝑦 ( 𝑢 ) = {︃ 𝑍 exp( − 𝛾 | 𝑦 − 𝑢 | ) , 𝑢 ∈ ( − , , , 𝑢 ∈ ( − , 𝑐 , where 𝑍 is a normalizing constant ensuring ∫︀ R 𝜋 𝑦 ( 𝑢 ) 𝑑𝑢 = 1. The support of 𝜋 𝑦 , i.e.( − , 𝜌 . Now we find the MAP estimator. Fromthe explicit formula for 𝜋 𝑦 , we have 𝑢 MAP = arg max 𝑢 ∈ R 𝜋 𝑦 ( 𝑢 ) = ⎧⎪⎪⎨⎪⎪⎩ 𝑦 if 𝑦 ∈ ( − , , − 𝑦 ≤ − , 𝑦 ≥ . In this example, the prior on 𝑢 is supported on ( − ,
1) and the posterior on 𝑢 | 𝑦 issupported on ( − , − ,
1) then the MAP estimator is the dataitself; otherwise it is the extremal point of the prior support which matches the sign ofthe data. The posterior mean is 𝑢 PM = 12 𝑍 ∫︁ − 𝑢 exp (︁ − 𝛾 | 𝑦 − 𝑢 | )︁ 𝑑𝑢, which may be approximated for instance by using the sampling methods described inChapters 5 and 6.The following example illustrates once again the application of Bayes formula, andshows that the posterior may be concentrated near a low-dimensional manifold of theinput parameter space R 𝑑 . In such a case it is important to understand the geometryof the support of the posterior density, which cannot be captured by point estimationor Gaussian approximations. Example 1.6.
Let 𝑑 = 2 , 𝑘 = 1 , 𝜌 ∈ 𝐶 ( R , R + ) , < 𝜌 ( 𝑢 ) ≤ 𝜌 max < + ∞ , for all 𝑢 ∈ R and 𝑦 = 𝐺 ( 𝑢 ) + 𝜂 = 𝑢 + 𝑢 + 𝜂, 𝜂 ∼ 𝜈 = 𝒩 (0 , 𝛾 ) , < 𝛾 ≪ . Assume that 𝑦 >
0. Using Bayes theorem, we obtain the posterior distribution 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 exp (︂ − 𝛾 | 𝑢 + 𝑢 − 𝑦 | )︂ 𝜌 ( 𝑢 ) . We now show that the posterior concentrates on a manifold: the circumference { 𝑢 ∈ R : 𝑢 + 𝑢 = 𝑦 } . Denote by 𝐴 ± := { 𝑢 ∈ R : | 𝑢 + 𝑢 − 𝑦 | ≤ 𝛾 ± 𝛿 } , for some 𝛿 ∈ (0 , 𝜌 min = inf 𝑢 ∈ 𝐵 𝜌 ( 𝑢 ), where 𝐵 is the closed ball of radius 2 √ 𝑦 centeredat the origin. Since 𝜌 ( 𝑢 ) is positive and continuous and 𝐵 is compact, 𝜌 min >
0. Let 𝑢 + ∈ 𝐴 + ⊂ 𝐵, 𝑢 − ∈ ( 𝐴 − ) 𝑐 . Taking the small noise limit yields 𝜋 𝑦 ( 𝑢 + ) 𝜋 𝑦 ( 𝑢 − ) ≥ exp (︂ − 𝛾 𝛿 + 12 𝛾 − 𝛿 )︂ 𝜌 min 𝜌 max → ∞ , as 𝛾 → + . Therefore, conditional on 𝑦 >
0, the posterior 𝜋 𝑦 concentrates, as 𝛾 → , on thecircumference with radius √ 𝑦 . Figure 1
The posterior measure concentrates on a circumference with radius √ 𝑦 . Here,the blue shadow area is 𝐴 + and the green shadow area is ( 𝐴 − ) 𝑐 . In this section we show that the Bayesian formulation of inverse problems leads toa form of well-posedness. More precisely, we study the sensitivity of the posteriorpdf to perturbations of the forward model 𝐺. In many inverse problems the idealforward model 𝐺 is not accessible but can be approximated by some computable 𝐺 𝛿 ;consequently 𝜋 𝑦 is replaced by 𝜋 𝑦𝛿 . An example that is often found in applications,to which the theory contained herein may be generalized, is when 𝐺 is an operatoracting on an infinite-dimensional space which is approximated, for the purposes ofcomputation, by some finite-dimensional operator 𝐺 𝛿 . We seek to prove that, under certain assumptions, the small difference between 𝐺 and 𝐺 𝛿 (forward error) leads tosimilarly small difference between 𝜋 𝑦 and 𝜋 𝑦𝛿 (inverse error): Meta Theorem: Well-posedness | 𝐺 − 𝐺 𝛿 | = 𝑂 ( 𝛿 ) = ⇒ 𝑑 ( 𝜋 𝑦 , 𝜋 𝑦𝛿 ) = 𝑂 ( 𝛿 ) , for small enough 𝛿 > and some metric 𝑑 ( · , · ) on probability densities. This result will be formalized in Theorem 1.14 below, which shows that the 𝑂 ( 𝛿 )-convergence of 𝜋 𝑦𝛿 with respect to some distance 𝑑 ( · , · ) can be guaranteed under certainassumptions on the likelihood. We will conclude the chapter by showing an examplewhere these assumptions hold true. In order to discuss these issues we will need tointroduce metrics on probability densities. Here we introduce the total variation and the Hellinger distance, both of which havebeen used to show well-posedness results. In this chapter we will use the Hellingerdistance to establish well-posedness of Bayesian inverse problems, and in Chapter 7 weemploy the total variation distance to establish well-posedness of Bayesian formulationsof filtering and smoothing in data assimilation.
Definition 1.7.
The total variation distance between two probability densities 𝜋, 𝜋 ′ isdefined by 𝑑 TV ( 𝜋, 𝜋 ′ ) := 12 ∫︁ | 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) | 𝑑𝑢 = 12 ‖ 𝜋 − 𝜋 ′ ‖ 𝐿 . The
Hellinger distance between two probability densities 𝜋, 𝜋 ′ is defined by 𝑑 H ( 𝜋, 𝜋 ′ ) := (︁ ∫︁ | √︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) | 𝑑𝑢 )︁ / = 1 √ ‖√ 𝜋 − √ 𝜋 ′ ‖ 𝐿 . In the rest of this subsection we will establish bounds between the Hellinger andtotal variation distance, and show how both distances can be used to bound the dif-ference of expected values computed with two different densities; these results will beused in subsequent chapters. Before doing so, the next lemma motivates our choice ofnormalization constant 1 / / √ Lemma 1.8.
For any probability densities 𝜋, 𝜋 ′ , ≤ 𝑑 TV ( 𝜋, 𝜋 ′ ) ≤ , ≤ 𝑑 H ( 𝜋, 𝜋 ′ ) ≤ . Proof.
The lower bounds follow immediately from the definitions. We only need toprove the upper bounds: 𝑑 TV ( 𝜋, 𝜋 ′ ) = 12 ∫︁ | 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) | 𝑑𝑢 ≤ ∫︁ 𝜋 ( 𝑢 ) 𝑑𝑢 + 12 ∫︁ 𝜋 ′ ( 𝑢 ) 𝑑𝑢 = 1 , 𝑑 H ( 𝜋, 𝜋 ′ ) = (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / = (︂ ∫︁ (︁ 𝜋 ( 𝑢 ) + 𝜋 ′ ( 𝑢 ) − √︁ 𝜋 ( 𝑢 ) 𝜋 ′ ( 𝑢 ) )︁ 𝑑𝑢 )︂ / ≤ (︁ ∫︁ (︀ 𝜋 ( 𝑢 ) + 𝜋 ′ ( 𝑢 ) )︀ 𝑑𝑢 )︁ / = 1 . The proofs also show that 𝜋 and 𝜋 ′ have total variation and Hellinger distance equalto one if and only if they have disjoint supports, that is, if ∫︀ 𝜋 ( 𝑢 ) 𝜋 ′ ( 𝑢 ) 𝑑𝑢 = 0 . The following result gives bounds between total variation and Hellinger distance.
Lemma 1.9.
For any probability densities 𝜋, 𝜋 ′ , √ 𝑑 TV ( 𝜋, 𝜋 ′ ) ≤ 𝑑 H ( 𝜋, 𝜋 ′ ) ≤ √︁ 𝑑 TV ( 𝜋, 𝜋 ′ ) . Proof.
Using the Cauchy–Schwartz inequality 𝑑 TV ( 𝜋, 𝜋 ′ ) = 12 ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒⃒⃒⃒√︁ 𝜋 ( 𝑢 ) + √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 ≤ (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) + √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / ≤ 𝑑 H ( 𝜋, 𝜋 ′ ) (︂ ∫︁ (︀ 𝜋 ( 𝑢 ) + 2 𝜋 ′ ( 𝑢 ) )︀ 𝑑𝑢 )︂ / = √ 𝑑 H ( 𝜋, 𝜋 ′ ) . Notice that | √︀ 𝜋 ( 𝑢 ) − √︀ 𝜋 ′ ( 𝑢 ) | ≤ | √︀ 𝜋 ( 𝑢 ) + √︀ 𝜋 ′ ( 𝑢 ) | since √︀ 𝜋 ( 𝑢 ) , √︀ 𝜋 ′ ( 𝑢 ) ≥
0. Thus wehave 𝑑 H ( 𝜋, 𝜋 ′ ) = (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / ≤ (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒⃒⃒⃒√︁ 𝜋 ( 𝑢 ) + √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / ≤ (︁ ∫︁ ⃒⃒ 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) ⃒⃒ 𝑑𝑢 )︁ / = √︁ 𝑑 TV ( 𝜋, 𝜋 ′ ) . The following two lemmas show that if two densities are close in total variation orin Hellinger distance, expectations computed with respect to both densities are alsoclose. In addition, the following lemma also provides a useful characterization of thetotal variation distance which will be used repeatedly throughout this book. Lemma 1.10.
Let 𝑓 be a function such that sup 𝑢 ∈ R 𝑑 | 𝑓 ( 𝑢 ) | =: | 𝑓 | ∞ < ∞ , then ⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒ ≤ | 𝑓 | ∞ 𝑑 TV ( 𝜋, 𝜋 ′ ) . Moreover, the following variational characterization of the total variation distanceholds: 𝑑 TV ( 𝜋, 𝜋 ′ ) = 12 sup | 𝑓 | ∞ < ⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒ . (1.4) Proof.
For the first part of the lemma, note that ⃒⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒⃒ = ⃒⃒⃒ ∫︁ R 𝑑 𝑓 ( 𝑢 ) (︀ 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) )︀ 𝑑𝑢 ⃒⃒⃒ ≤ | 𝑓 | ∞ · ∫︁ R 𝑑 | 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) | 𝑑𝑢 = 2 | 𝑓 | ∞ 𝑑 TV ( 𝜋, 𝜋 ′ ) . This in particular shows that, for any 𝑓 with | 𝑓 | ∞ = 1 ,𝑑 TV ( 𝜋, 𝜋 ′ ) ≥ ⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒ . Our goal now is to show a choice of 𝑓 with | 𝑓 | ∞ = 1 that achieves equality. Define 𝑓 ( 𝑢 ) := sign (︁ 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) )︁ , so that 𝑓 ( 𝑢 ) (︁ 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) )︁ = | 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) | . Then, | 𝑓 | ∞ = 1and 𝑑 TV ( 𝜋, 𝜋 ′ ) = 12 ∫︁ R 𝑑 | 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) | 𝑑𝑢 = 12 ∫︁ R 𝑑 𝑓 ( 𝑢 ) (︁ 𝜋 ( 𝑢 ) − 𝜋 ′ ( 𝑢 ) )︁ 𝑑𝑢 = 12 ⃒⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒⃒ . This completes the proof of the variational characterization.
Lemma 1.11.
Let 𝑓 be a function such that E 𝜋 [ | 𝑓 | ] + E 𝜋 ′ [ | 𝑓 | ] =: 𝑓 < ∞ , then ⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒ ≤ 𝑓 𝑑 H ( 𝜋, 𝜋 ′ ) . Proof. ⃒⃒⃒ E 𝜋 [ 𝑓 ] − E 𝜋 ′ [ 𝑓 ] ⃒⃒⃒ = ⃒⃒⃒⃒∫︁ R 𝑑 𝑓 ( 𝑢 ) (︁√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) )︁(︁√︁ 𝜋 ( 𝑢 ) + √︁ 𝜋 ′ ( 𝑢 ) )︁ 𝑑𝑢 ⃒⃒⃒⃒ ≤ (︂ ∫︁ ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) − √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / (︂ ∫︁ | 𝑓 ( 𝑢 ) | ⃒⃒⃒√︁ 𝜋 ( 𝑢 ) + √︁ 𝜋 ′ ( 𝑢 ) ⃒⃒⃒ 𝑑𝑢 )︂ / ≤ 𝑑 H ( 𝜋, 𝜋 ′ ) (︁ ∫︁ | 𝑓 ( 𝑢 ) | (︀ 𝜋 ( 𝑢 ) + 𝜋 ′ ( 𝑢 ) )︀ 𝑑𝑢 )︁ / = 2 𝑓 𝑑 H ( 𝜋, 𝜋 ′ ) . Note that the result for Hellinger only assumes that 𝑓 is square integrable withrespect to 𝜋 and 𝜋 ′ . In contrast, the result for total variation distance assumes that 𝑓 is bounded, which is a stronger condition. We denote by 𝑔 ( 𝑢 ) = 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ and 𝑔 𝛿 ( 𝑢 ) = 𝜈 (︀ 𝑦 − 𝐺 𝛿 ( 𝑢 ) )︀ the likelihoods associated with 𝐺 ( 𝑢 ) and 𝐺 𝛿 ( 𝑢 ) , so that 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) and 𝜋 𝑦𝛿 ( 𝑢 ) = 1 𝑍 𝛿 𝑔 𝛿 ( 𝑢 ) 𝜌 ( 𝑢 ) , where 𝑍, 𝑍 𝛿 > Assumption 1.12.
There exist 𝛿 + > and 𝐾 , 𝐾 < ∞ such that, for all 𝛿 ∈ (0 , 𝛿 + ) , (i) | √︀ 𝑔 ( 𝑢 ) − √︀ 𝑔 𝛿 ( 𝑢 ) | ≤ 𝜙 ( 𝑢 ) 𝛿 , for some 𝜙 ( 𝑢 ) such that E 𝜌 [ 𝜙 ( 𝑢 )] ≤ 𝐾 ;(ii) sup 𝑢 ∈ R 𝑑 ( | √︀ 𝑔 ( 𝑢 ) | + | √︀ 𝑔 𝛿 ( 𝑢 ) | ) ≤ 𝐾 . Remark 1.13.
Assumption 1.12 only involves conditions on the likelihood 𝑔 and the ap-proximate likelihood 𝑔 𝛿 . While our presentation in this chapter emphasizes that thisapproximation may arise due to the need of approximating the forward model 𝐺, an-other important scenario that the theory covers is approximation due to perturbationsof the data 𝑦. Well-posedness results that guarantee the stability under data perturba-tions of Bayesian data assimilation will be established in Chapter 7.Now we state the main result of this section:
Theorem 1.14 (Well-posedness of Posterior) . Under Assumption 1.12 we have 𝑑 H ( 𝜋 𝑦 , 𝜋 𝑦𝛿 ) ≤ 𝑐𝛿, 𝛿 ∈ (0 , ˜ 𝛿 + ) , for some ˜ 𝛿 + > and some 𝑐 ∈ (0 , + ∞ ) independent of 𝛿 . Notice that this theorem together with Lemma 1.11 guarantee that expectationscomputed with respect to 𝜋 𝑦 and 𝜋 𝑦𝛿 are order 𝛿 apart. To prove Theorem 1.14, we firstshow a lemma which characterizes the normalization factor 𝑍 𝛿 in the small 𝛿 limit. Lemma 1.15.
Under Assumption 1.12 there exist ˜ 𝛿 + > , 𝑐 , 𝑐 ∈ (0 , + ∞ ) such that | 𝑍 − 𝑍 𝛿 | ≤ 𝑐 𝛿 and 𝑍, 𝑍 𝛿 > 𝑐 , for 𝛿 ∈ (0 , ˜ 𝛿 + ) . Proof.
Since 𝑍 = ∫︀ 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 and 𝑍 𝛿 = ∫︀ 𝑔 𝛿 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 we have | 𝑍 − 𝑍 𝛿 | = ∫︁ (︀ 𝑔 ( 𝑢 ) − 𝑔 𝛿 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) 𝑑𝑢 ≤ (︁ ∫︁ ⃒⃒⃒√︁ 𝑔 ( 𝑢 ) − √︁ 𝑔 𝛿 ( 𝑢 ) ⃒⃒⃒ 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / (︁ ∫︁ ⃒⃒⃒√︁ 𝑔 ( 𝑢 ) + √︁ 𝑔 𝛿 ( 𝑢 ) ⃒⃒⃒ 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / ≤ (︁ ∫︁ 𝛿 𝜙 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / (︁ ∫︁ 𝐾 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / ≤ √︀ 𝐾 𝐾 𝛿, 𝛿 ∈ (0 , 𝛿 + ) . And when 𝛿 ≤ ˜ 𝛿 + := min { 𝑍 √ 𝐾 𝐾 , 𝛿 + } , we have 𝑍 𝛿 ≥ 𝑍 − | 𝑍 − 𝑍 𝛿 | ≥ 𝑍. The lemma follows by taking 𝑐 = √ 𝐾 𝐾 and 𝑐 = 𝑍 . Proof of Theorem 1.14.
We break the distance into two error parts, one caused by thedifference between 𝑍 and 𝑍 𝛿 , the other caused by the difference between 𝑔 and 𝑔 𝛿 : 𝑑 H ( 𝜋 𝑦 , 𝜋 𝑦𝛿 ) = 1 √ ⃦⃦⃦ √ 𝜋 𝑦 − √︁ 𝜋 𝑦𝛿 ⃦⃦⃦ 𝐿 = 1 √ ⃦⃦⃦√︂ 𝑔𝜌𝑍 − √︂ 𝑔𝜌𝑍 𝛿 + √︂ 𝑔𝜌𝑍 𝛿 − √︂ 𝑔 𝛿 𝜌𝑍 𝛿 ⃦⃦⃦ 𝐿 ≤ √ ⃦⃦⃦√︂ 𝑔𝜌𝑍 − √︂ 𝑔𝜌𝑍 𝛿 ⃦⃦⃦ 𝐿 + 1 √ ⃦⃦⃦√︂ 𝑔𝜌𝑍 𝛿 − √︂ 𝑔 𝛿 𝜌𝑍 𝛿 ⃦⃦⃦ 𝐿 . Using Lemma 1.15, for 𝛿 ∈ (0 , ˜ 𝛿 + ), we have ⃦⃦⃦√︂ 𝑔𝜌𝑍 − √︂ 𝑔𝜌𝑍 𝛿 ⃦⃦⃦ 𝐿 = ⃒⃒⃒ √ 𝑍 − √ 𝑍 𝛿 ⃒⃒⃒(︁ ∫︁ 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / = | 𝑍 − 𝑍 𝛿 | ( √ 𝑍 + √ 𝑍 𝛿 ) √ 𝑍 𝛿 ≤ 𝑐 𝑐 𝛿, and ⃦⃦⃦√︂ 𝑔𝜌𝑍 𝛿 − √︂ 𝑔𝜌𝑍 𝛿 ⃦⃦⃦ 𝐿 = 1 √ 𝑍 𝛿 (︁ ∫︁ ⃒⃒⃒√︁ 𝑔 ( 𝑢 ) − √︁ 𝑔 𝛿 ( 𝑢 ) ⃒⃒⃒ 𝜌 ( 𝑢 ) 𝑑𝑢 )︁ / ≤ √︃ 𝐾 𝑐 𝛿. Therefore 𝑑 H ( 𝜋 𝑦 , 𝜋 𝑦𝛿 ) ≤ √ 𝑐 𝑐 𝛿 + 1 √ √︃ 𝐾 𝑐 𝛿 = 𝑐𝛿, with 𝑐 = √ 𝑐 𝑐 + √ √︁ 𝐾 𝑐 independent of 𝛿 . Many inverse problems arise from differential equations with unknown input param-eters. Here we consider a simple but typical example where 𝐺 ( 𝑢 ) comes from thesolution of an ODE, which needs to be solved numerically. Let 𝑥 ( 𝑡 ) be the solution tothe initial value problem 𝑑𝑥𝑑𝑡 = 𝐹 ( 𝑥 ; 𝑢 ) , 𝑥 (0) = 0 , (1.5)where 𝐹 : R 𝑘 × R 𝑑 → R 𝑘 is a function such that 𝐹 ( 𝑥 ; 𝑢 ) and the partial Jacobian 𝐷 𝑥 𝐹 ( 𝑥 ; 𝑢 ) are uniformly bounded with respect to ( 𝑥, 𝑢 ), i.e. | 𝐹 ( 𝑥 ; 𝑢 ) | , | 𝐷 𝑥 𝐹 ( 𝑥 ; 𝑢 ) | < 𝐹 max , for all ( 𝑥, 𝑢 ) ∈ R 𝑘 × R 𝑑 , for some constant 𝐹 max , and thus 𝐹 ( 𝑥 ; 𝑢 ) is Lipschitz in 𝑥 in that | 𝐹 ( 𝑥 ; 𝑢 ) − 𝐹 ( 𝑥 ′ ; 𝑢 ) | ≤ 𝐹 max | 𝑥 − 𝑥 ′ | , for all 𝑥, 𝑥 ′ ∈ R 𝑘 . Now consider the inverse problem setting 𝑦 = 𝐺 ( 𝑢 ) + 𝜂, where 𝐺 ( 𝑢 ) := 𝑥 (1) = 𝑥 ( 𝑡 ) | 𝑡 =1 , and 𝜂 ∼ 𝒩 (0 , 𝛾 𝐼 𝑘 ). We assume that the exact mapping 𝐺 ( 𝑢 ) is replaced by somenumerical approximation 𝐺 𝛿 ( 𝑢 ). In particular, 𝐺 𝛿 ( 𝑢 ) is given by using the forwardEuler method to solve the ODE (1.5). Define 𝑋 = 0, and 𝑋 ℓ +1 = 𝑋 ℓ + 𝛿𝐹 ( 𝑋 ℓ ; 𝑢 ) , ℓ ≥ , where 𝛿 = 𝐿 for some large integer 𝐿 . Finally define 𝐺 𝛿 ( 𝑢 ) := 𝑋 𝐿 .In what follows, we will prove that 𝐺 𝛿 ( 𝑢 ) is uniformly bounded and close to 𝐺 ( 𝑢 )when 𝛿 is small, and then we will use these results to show that Assumption 1.12 issatisfied. Therefore, we can apply Theorem 1.14 to this example and claim that theapproximate posterior 𝜋 𝑦𝛿 is close to the unperturbed one 𝜋 𝑦 .Define 𝑡 ℓ = ℓ𝛿 , 𝑥 ℓ = 𝑥 ( 𝑡 ℓ ). The following lemma gives an estimate on the errorgenerated from using the forward Euler method. Lemma 1.16.
Let 𝐸 ℓ := 𝑥 ℓ − 𝑋 ℓ . Then there is 𝑐 < ∞ independent of 𝛿 such that | 𝐸 ℓ | ≤ 𝑐𝛿, ≤ ℓ ≤ 𝐿. In particular | 𝐺 ( 𝑢 ) − 𝐺 𝛿 ( 𝑢 ) | = | 𝐸 𝐿 | ≤ 𝑐𝛿. Proof.
For simplicity of exposition we consider the case 𝑘 = 1; the case 𝑘 > 𝑘 = 1 , there is 𝜉 ℓ ∈ [ 𝑡 ℓ , 𝑡 ℓ +1 ] suchthat 𝑥 ℓ +1 = 𝑥 ℓ + 𝛿 𝑑𝑥𝑑𝑡 ( 𝑡 ℓ ) + 𝛿 𝑑 𝑥𝑑𝑡 ( 𝜉 ℓ )= 𝑥 ℓ + 𝛿𝐹 ( 𝑥 ℓ ; 𝑢 ) + 𝛿 𝐷 𝑥 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀ 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀ . Thus we have | 𝐸 ℓ +1 | = | 𝑥 ℓ +1 − 𝑋 ℓ +1 | = ⃒⃒⃒ 𝑥 ℓ − 𝑋 ℓ + 𝛿 (︁ 𝐹 ( 𝑥 ℓ ; 𝑢 ) − 𝐹 ( 𝑋 ℓ ; 𝑢 ) )︁ + 𝛿 𝐷 𝑥 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀ 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀⃒⃒⃒ ≤ | 𝑥 ℓ − 𝑋 ℓ | + 𝛿 ⃒⃒ 𝐹 ( 𝑥 ℓ ; 𝑢 ) − 𝐹 ( 𝑋 ℓ ; 𝑢 ) ⃒⃒ + 𝛿 ⃒⃒ 𝐷 𝑥 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀⃒⃒⃒⃒ 𝐹 (︀ 𝑥 ( 𝜉 ℓ ); 𝑢 )︀⃒⃒ ≤ | 𝐸 ℓ | + 𝛿𝐹 max | 𝐸 ℓ | + 𝛿 𝐹 . Noticing that | 𝐸 | = 0, the discrete Gronwall inequality gives | 𝐸 ℓ | ≤ (1 + 𝛿𝐹 max ) ℓ | 𝐸 | + (1 + 𝛿𝐹 max ) ℓ − 𝛿𝐹 max · 𝛿 𝐹 ≤ (︂(︁ 𝐹 max 𝐿 )︁ 𝐿 − )︂ · 𝐹 max 𝛿 ≤ ( 𝑒 𝐹 max − 𝐹 max 𝛿. The lemma follows by taking 𝑐 = ( 𝑒 𝐹 max − 𝐹 max . Lemma 1.17.
For any 𝑢 ∈ R 𝑑 , | 𝐺 ( 𝑢 ) | , | 𝐺 𝛿 ( 𝑢 ) | < 𝐹 max . Proof.
For 𝐺 ( 𝑢 ) we use that 𝐹 ( 𝑥 ; 𝑢 ) is uniformly bounded, so that | 𝐺 ( 𝑢 ) | = | 𝑥 (1) | = ⃒⃒⃒ ∫︁ 𝐹 ( 𝑥 ( 𝑡 ); 𝑢 ) 𝑑𝑡 ⃒⃒⃒ ≤ ∫︁ ⃒⃒ 𝐹 ( 𝑥 ( 𝑡 ); 𝑢 ) ⃒⃒ 𝑑𝑡 ≤ 𝐹 max . As for 𝐺 𝛿 ( 𝑢 ), we first notice that | 𝑋 ℓ +1 | = | 𝑋 ℓ + 𝛿𝐹 ( 𝑋 ℓ ; 𝑢 ) | ≤ | 𝑋 ℓ | + 𝛿 | 𝐹 ( 𝑋 ℓ ; 𝑢 ) | ≤ | 𝑋 ℓ | + 𝛿𝐹 max , and by induction | 𝑋 ℓ | ≤ | 𝑋 | + ℓ𝛿𝐹 max = ℓ𝛿𝐹 max . In particular, | 𝐺 𝛿 ( 𝑢 ) | = | 𝑋 𝐿 | ≤ 𝐿𝛿𝐹 max = 𝐹 max . To conclude this chapter we show that in this example Assumption 1.12 is satisfied.Recall that 𝜂 ∼ 𝒩 (0 , 𝛾 𝐼 𝑘 ), and thus √︁ 𝑔 ( 𝑢 ) = √︁ 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ = 1(2 𝜋 ) 𝑘/ 𝛾 𝑘/ exp (︁ − 𝛾 | 𝑦 − 𝐺 ( 𝑢 ) | )︁ , √︁ 𝑔 𝛿 ( 𝑢 ) = √︁ 𝜈 (︀ 𝑦 − 𝐺 𝛿 ( 𝑢 ) )︀ = 1(2 𝜋 ) 𝑘/ 𝛾 𝑘/ exp (︁ − 𝛾 | 𝑦 − 𝐺 𝛿 ( 𝑢 ) | )︁ . ∙ For Assumption 1.12(i) notice that the function 𝑒 − 𝑤 is Lipschitz for 𝑤 >
0, withLipschitz constant 1. Therefore we have ⃒⃒⃒√︁ 𝑔 ( 𝑢 ) − √︁ 𝑔 𝛿 ( 𝑢 ) ⃒⃒⃒ ≤ 𝜋 ) 𝑘/ 𝛾 𝑘/ · 𝛾 · ⃒⃒ | 𝑦 − 𝐺 ( 𝑢 ) | − | 𝑦 − 𝐺 𝛿 ( 𝑢 ) | ⃒⃒ = 1(2 𝜋 ) 𝑘/ 𝛾 𝑘/ · 𝛾 · | 𝑦 − 𝐺 ( 𝑢 ) − 𝐺 𝛿 ( 𝑢 ) || 𝐺 ( 𝑢 ) − 𝐺 𝛿 ( 𝑢 ) |≤ 𝜋 ) 𝑘/ 𝛾 𝑘/ · 𝛾 · (2 | 𝑦 | + 2 𝐹 max ) 𝑐𝛿 = ˜ 𝑐𝛿. That is to say, Assumption 1.12(i) is satisfied with 𝜙 ( 𝑢 ) = ˜ 𝑐 and ∫︀ R 𝑑 𝜙 ( 𝑢 ) 𝜌 ( 𝑢 ) 𝑑𝑢 =˜ 𝑐 < ∞ . ∙ Assumption 1.12(ii) is satisfied since √︁ 𝑔 ( 𝑢 ) = 1(2 𝜋 ) 𝑘/ 𝛾 𝑘/ exp (︁ − 𝛾 | 𝑦 − 𝐺 ( 𝑢 ) | )︁ ≤ 𝜋 ) 𝑘/ 𝛾 𝑘/ , √︁ 𝑔 𝛿 ( 𝑢 ) = 1(2 𝜋 ) 𝑘/ 𝛾 𝑘/ exp (︁ − 𝛾 | 𝑦 − 𝐺 𝛿 ( 𝑢 ) | )︁ ≤ 𝜋 ) 𝑘/ 𝛾 𝑘/ . The book by Kaipio and Somersalo [63] provides an introduction to the Bayesian ap-proach to inverse problems, especially in the context of differential equations. Anoverview of the subject of Bayesian inverse problems in differential equations, with aperspective informed by the geophysical sciences, is the book by Tarantola [108] (see,especially, Chapter 5).In the paper [105] the Bayesian approach to regularization is reviewed, developing afunction space viewpoint on the subject. A well-posedness theory and some algorithmicapproaches which are used when adopting the Bayesian approach to inverse problemsare introduced. The function space viewpoint on the subject is developed in more detailin the chapter notes of Dashti and Stuart [23]. An application of this function spacemethodology to a large-scale geophysical inverse problem is considered in [83]. Thepaper [76] demonstrates the potential for the use of dimension reduction techniquesfrom control theory within statistical inverse problems.See [46] for more detail on the subject of metrics, and other distance-like func-tions, on probability measures. See [105, 23] for more detailed discussions on the well-posedness of Bayesian inverse problems with respect to perturbations in the data;and see [20] for applications concerning numerical approximation of partial differen-tial equations appearing in the forward model. Related results, but using divergencesrather than the Hellinger metric, may be found in [84]. The paper [59] contains aninteresting set of examples where the Meta Theorem stated in this chapter fails in thesense that, whilst well-posedness holds, the posterior is Hölder with exponent less thanone, rather than Lipschitz, with respect to perturbations. Recall the inverse problem of estimating an unknown parameter 𝑢 ∈ R 𝑑 from data 𝑦 ∈ R 𝑘 under the model assumption 𝑦 = 𝐺 ( 𝑢 ) + 𝜂. (2.1)In this chapter we study the linear-Gaussian setting, where the forward map 𝐺 ( · ) islinear and both the prior on 𝑢 and the distribution of the observation noise 𝜂 are Gaus-sian. This setting is highly amenable to analysis and arises frequently in applications.Moreover, as we will see throughout this book, many methods employed in nonlinearor non-Gaussian settings build on ideas from the linear-Gaussian case by performinglinearization or invoking Gaussian approximations. Having established a formula forthe posterior, we investigate the effect that the choice of prior has on our solution. Wedo this by quantifying the spread of the posterior distribution in the small noise (ap-proaching zero) limit. This provides intuitive understanding concerning the impact ofthe prior for overdetermined, determined, and underdetermined regimes, correspondingto 𝑑 < 𝑘, 𝑑 = 𝑘, and 𝑑 > 𝑘, respectively.The following will be assumed throughout this chapter. Assumption 2.1.
The relationship between unknown 𝑢 ∈ R 𝑑 and data 𝑦 ∈ R 𝑘 defined byequation (2.1) holds. Moreover, Linearity of the forward model: 𝐺 ( 𝑢 ) = 𝐴𝑢 , for some 𝐴 ∈ R 𝑘 × 𝑑 ; Gaussian prior: 𝑢 ∼ 𝜌 ( 𝑢 ) = 𝒩 ( ̂︀ 𝑚, ̂︀ 𝐶 ) , where ̂︀ 𝐶 is positive definite; Gaussian noise: 𝜂 ∼ 𝜈 ( 𝜂 ) = 𝒩 (0 , Γ) , where Γ is positive definite; 𝑢 and 𝜂 are independent. Under Assumption 2.1 the likelihood on 𝑦 given 𝑢 is a Gaussian, 𝑦 | 𝑢 ∼ 𝒩 ( 𝐴𝑢, Γ) . (2.2)Therefore, using Bayes formula (1.2) we see that the posterior 𝜋 𝑦 ( 𝑢 ) is given by 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 𝜈 ( 𝑦 − 𝐴𝑢 ) 𝜌 ( 𝑢 )= 1 𝑍 exp (︂ − | 𝑦 − 𝐴𝑢 | )︂ exp (︂ − | 𝑢 − ̂︀ 𝑚 | ̂︀ 𝐶 )︂ = 1 𝑍 exp (︂ − | 𝑦 − 𝐴𝑢 | − | 𝑢 − ̂︀ 𝑚 | ̂︀ 𝐶 )︂ = 1 𝑍 exp (︀ − J ( 𝑢 ) )︀ , with J ( 𝑢 ) = 12 | 𝑦 − 𝐴𝑢 | + 12 | 𝑢 − ̂︀ 𝑚 | ̂︀ 𝐶 . (2.3) Since the posterior distribution can be written as the exponential of a quadratic in 𝑢 it follows that the posterior is Gaussian. Its mean and covariance are given in thefollowing result. Theorem 2.2 (Posterior is Gaussian) . Under Assumption 2.1 the posterior distribution isGaussian, 𝑢 | 𝑦 ∼ 𝜋 𝑦 ( 𝑢 ) = 𝒩 ( 𝑚, 𝐶 ) . (2.4) The posterior mean 𝑚 and covariance 𝐶 are given by the following formulae: 𝑚 = ( 𝐴 𝑇 Γ − 𝐴 + ̂︀ 𝐶 − ) − ( 𝐴 𝑇 Γ − 𝑦 + ̂︀ 𝐶 − ̂︀ 𝑚 ) , (2.5) 𝐶 = ( 𝐴 𝑇 Γ − 𝐴 + ̂︀ 𝐶 − ) − . (2.6) Proof.
Since 𝜋 𝑦 ( 𝑢 ) = 𝑍 exp (︀ − J ( 𝑢 ) )︀ with J ( 𝑢 ) given by (2.3), a quadratic function of 𝑢 , it follows that the posterior distribution 𝜋 𝑦 ( 𝑢 ) is Gaussian. Denoting the mean andvariance of 𝜋 𝑦 ( 𝑢 ) by 𝑚 and 𝐶 , we can write J ( 𝑢 ) in the following form J ( 𝑢 ) = 12 | 𝑢 − 𝑚 | 𝐶 + 𝑞, (2.7)where the term 𝑞 does not depend on 𝑢 . Now matching the coefficients of the quadraticand linear terms in equations (2.3) and (2.7), we get 𝐶 − = 𝐴 𝑇 Γ − 𝐴 + ̂︀ 𝐶 − ,𝐶 − 𝑚 = 𝐴 𝑇 Γ − 𝑦 + ̂︀ 𝐶 − ̂︀ 𝑚. Therefore equations (2.5) and (2.6) follow.Equation (2.7) shows that the posterior mean 𝑚 minimizes J ( 𝑢 ) given by (2.3). Thisdemonstrates that the posterior mean is found by compromising between maximizingthe likelihood (by making small the loss term | 𝑦 − 𝐴𝑢 | ) and minimizing deviationsfrom the prior mean (by making small the regularization term | 𝑢 − ̂︀ 𝑚 | ̂︀ 𝐶 ). The relativeimportance given to both objectives is determined by the relative size of the priorcovariance ̂︀ 𝐶 and the noise covariance Γ . An important feature of the linear-Gaussiansetting is that the posterior covariance 𝐶 does not depend on the data 𝑦 ; this is nottrue in general.As we saw in the previous chapter, the posterior mean estimator and the MAPestimator are typically different. However, in the linear-Gaussian setting the posterioris Gaussian, and therefore both estimators agree. Corollary 2.3 (Characterization of Bayes Estimators) . The posterior mean and MAP esti-mators under Assumptions 2.1 agree, and are given by 𝑢 MAP = 𝑢 PM = 𝑚 defined inequation (2.5) . Example 2.4.
Let Γ = 𝛾 𝐼 , ̂︀ 𝐶 = 𝜎 𝐼, ̂︀ 𝑚 = 0 , and set 𝜆 = 𝜎 𝛾 . Then J 𝜆 ( 𝑢 ) := 𝛾 J ( 𝑢 ) = 12 | 𝑦 − 𝐴𝑢 | + 𝜆 | 𝑢 | . Since 𝑚 minimizes J 𝜆 ( · ) it follows that( 𝐴 𝑇 𝐴 + 𝜆𝐼 ) 𝑚 = 𝐴 𝑇 𝑦. (2.8)Example 2.4 provides a link between Bayesian inversion and optimization approachesto inversion: J 𝜆 ( 𝑢 ) can be seen as the objective functional in a linear regression modelwith a regularizer 𝜆 | 𝑢 | , as used in ridge regression. Equation (2.8) for 𝑚 is exactly thenormal equation with regularizer in the least square problem. In fact, in the generalcase, equation (2.5) can be also viewed as a generalized normal equation. This point ofview helps us understand the structure of Bayesian regularization by linking it to thedeep understanding of optimization approaches to inverse problems. A more extensiveaccount of the optimization perspective and its interplay with Bayesian formulationswill be given in the following chapter. In this section we study the small observational noise limit of the posterior distributionin the linear-Gaussian setting. While most of the ideas can be extended beyond thissetting, the explicit calculations that the linear-Gaussian setting allows for providehelpful intuition. Throughout this section we assume the following.
Assumption 2.5.
In addition to the linear-Gaussian setting described in Assumption 2.1, 𝜂 := 𝛾𝜂 where 𝜂 ∼ 𝒩 (0 , Γ ) so that Γ = 𝛾 Γ . Note that substituting Γ = 𝛾 Γ into (2.5) and (2.6) we obtain that 𝑚 =( 𝐴 𝑇 Γ − 𝐴 + 𝛾 ̂︀ 𝐶 − ) − ( 𝐴 𝑇 Γ − 𝑦 + 𝛾 ̂︀ 𝐶 − ̂︀ 𝑚 ) , (2.9) 𝐶 = 𝛾 ( 𝐴 𝑇 Γ − 𝐴 + 𝛾 ̂︀ 𝐶 − ) − . (2.10)In the next three subsections we study the small noise limiting behavior as 𝛾 → ⇒ denotes weak convergence of probability measures.We will use repeatedly that weak convergence of Gaussian distributions is equivalent tothe convergence of their means and covariances. In particular, the weak limit of a se-quence of Gaussians with means converging to 𝑚 + and covariance matrices convergingto zero is a Dirac mass 𝛿 𝑚 + . We start with the overdetermined case 𝑑 < 𝑘 . Theorem 2.6 (Small Noise Limit of Posterior Distribution - Overdetermined) . Suppose that As-sumption 2.5 holds, that
Null( 𝐴 ) = 0 and that 𝑑 < 𝑘. Then in the limit 𝛾 → ,𝜋 𝑦 ⇒ 𝛿 𝑚 + , where 𝑚 + is the solution of the least-squares problem 𝑚 + = arg min 𝑢 ∈ R 𝑑 | Γ − / ( 𝑦 − 𝐴𝑢 ) | . (2.11) Proof.
Since Null( 𝐴 ) = 0 and Γ is invertible we deduce that there is 𝛼 > 𝑢 ∈ R 𝑑 , ⟨ 𝑢, 𝐴 𝑇 Γ − 𝐴𝑢 ⟩ = | Γ − / 𝐴𝑢 | ≥ 𝛼 | 𝑢 | . Thus 𝐴 𝑇 Γ − 𝐴 is positive definite and invertible. It follows that as 𝛾 →
0, the posteriorcovariance converges to the zero matrix, 𝐶 → , and the posterior mean satisfies thelimit 𝑚 → 𝑚 * = ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝑦. This proves the weak convergence of 𝜋 𝑦 to 𝛿 𝑚 * . It remains to characterize 𝑚 * . SinceNull( 𝐴 ) = 0, the minimizers of L ( 𝑢 ) := 12 | Γ − / ( 𝑦 − 𝐴𝑢 ) | are unique and satisfy the normal equations 𝐴 𝑇 Γ − 𝐴𝑢 = 𝐴 𝑇 Γ − 𝑦 . Hence 𝑚 * solvesthe desired least-squares problem and coincides with 𝑚 + given in (2.11). Remark 2.7.
In the overdetermined case where 𝐴 𝑇 Γ 𝐴 is invertible, the small obser-vational noise limit leads to a posterior which is a Dirac, centered at the solution ofthe least-square problem (2.11). Therefore, in this limit the prior plays no role in theBayesian inference. Theorem 2.8 (Posterior Consistency) . Suppose that the assumptions of Theorem 2.18 holdand that the data satisfies 𝑦 = 𝐴𝑢 † + 𝛾𝜂 † , for fixed 𝑢 † ∈ R 𝑑 , 𝜂 † ∈ R 𝑘 . (2.12) Then, for any sequence 𝑀 ( 𝛾 ) → ∞ as 𝛾 → , P 𝜋 𝑦 {︀ | 𝑢 − 𝑢 † | > 𝑀 ( 𝛾 ) 𝛾 }︀ → , (2.13) where P 𝜋 𝑦 denotes probability under the posterior distribution. Remark 2.9.
For any 𝜀 >
0, set 𝑀 ( 𝛾 ) = 𝜀 𝛾 in Theorem 2.11 to obtain P 𝜋 𝑦 {︀ | 𝑢 − 𝑢 † | > 𝜀 }︀ → . Proof. (Theorem 2.6) Throughout this proof we let 𝑐 be a constant independent of 𝛾 that may change from line to line, and we denote by E expectation with respect tothe posterior distribution, which is Gaussian with mean 𝑚 and covariance 𝐶 given byequations (2.9) and (2.10). Denote 𝑚 * = ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝑦 as in the proof of the previous theorem. We have that E | 𝑢 − 𝑢 † | ≤ 𝑐 (︁ E | 𝑢 − 𝑚 | + | 𝑚 − 𝑚 * | + | 𝑚 * − 𝑢 † | )︁ . (2.14)We now bound each of the three terms in the right-hand side.For the first one, E | 𝑢 − 𝑚 | = E [︀ ( 𝑢 − 𝑚 ) 𝑇 ( 𝑢 − 𝑚 ) ]︀ = E [︀ Tr[( 𝑢 − 𝑚 ) ⊗ ( 𝑢 − 𝑚 )] ]︀ = Tr E [︀ ( 𝑢 − 𝑚 ) ⊗ ( 𝑢 − 𝑚 ) ]︀ = Tr( 𝐶 ) ≤ 𝛾 Tr( 𝐴 𝑇 Γ − 𝐴 ) . For the second term, note that( 𝐴 𝑇 Γ − 𝐴 ) 𝑚 * = 𝐴 𝑇 Γ − 𝑦, ( 𝐴 𝑇 Γ − 𝐴 + 𝛾 ̂︀ 𝐶 − ) 𝑚 = 𝐴 𝑇 Γ − 𝑦 + 𝛾 ̂︀ 𝐶 − ̂︀ 𝑚. Therefore 𝑚 − 𝑚 * = 𝛾 ( 𝐴 𝑇 Γ 𝐴 ) − ( ̂︀ 𝐶 − ̂︀ 𝑚 − ̂︀ 𝐶 − 𝑚 ) . Since 𝑚 converges it is bounded, and so there is 𝑐 > | 𝑚 − 𝑚 * | ≤ 𝑐𝛾 . Finally, for the third term we write 𝑚 * = ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝐴𝑢 † + 𝛾 ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝜂 † = 𝑢 † + 𝛾 ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝜂 † , which gives | 𝑚 * − 𝑢 † | ≤ 𝑐𝛾 . Using Markov’s inequality and the three bounds above, P 𝜋 𝑦 {︀ | 𝑢 − 𝑢 † | > 𝑀 ( 𝛾 ) 𝛾 }︀ ≤ E ( | 𝑢 − 𝑢 † | ) 𝑀 ( 𝛾 ) 𝛾 ≤ 𝑐𝑀 ( 𝛾 ) → , as 𝛾 → . As a byproduct of the proof of Theorem 2.6, we can determine the limiting behaviorof 𝜋 𝑦 in the boundary case 𝑑 = 𝑘 . Theorem 2.10 (Small Noise Limit of Posterior Distribution - Determined) . Suppose that Assump-tion 2.5 holds,
Null( 𝐴 ) = 0 , and 𝑑 = 𝑘 . Then in the small noise limit 𝛾 → , 𝜋 𝑦 ⇒ 𝛿 𝐴 − 𝑦 . Proof.
In the proof of Theorem 2.6, the assumption 𝑑 < 𝑘 is used only in that 𝐴 is nota square matrix and thus 𝐴, 𝐴 𝑇 are not invertible. Denote by ( 𝑚, 𝐶 ) the mean andvariance of the posterior 𝑢 | 𝑦 . Using the same argument, we have 𝐶 → 𝑚 → 𝑚 * = ( 𝐴 𝑇 Γ − 𝐴 ) − 𝐴 𝑇 Γ − 𝑦. Using that
𝐴, 𝐴 𝑇 are square invertible matrices we obtain 𝑚 * = ( 𝐴 − Γ ( 𝐴 𝑇 ) − ) 𝐴 𝑇 Γ − 𝑦 = 𝐴 − 𝑦. Therefore, 𝜋 𝑦 ( 𝑢 ) ⇒ 𝛿 𝑚 * = 𝛿 𝐴 − 𝑦 .Note that here, as in the overdetermined case, the prior plays no role in the smallnoise limit. Moreover, it can be shown as above that posterior consistency holds. Theproof is identical and therefore omitted. Theorem 2.11 (Posterior Consistency) . Suppose that the assumptions of Theorem 2.10 hold,and that the data satisfies 𝑦 = 𝐴𝑢 † + 𝛾𝜂 † , for fixed 𝑢 † , 𝜂 † ∈ R 𝑑 . (2.15) Then for any sequence 𝑀 ( 𝛾 ) → ∞ as 𝛾 → , P 𝜋 𝑦 {︀ | 𝑢 − 𝑢 † | > 𝑀 ( 𝛾 ) 𝛾 }︀ → . (2.16) Finally we consider the underdetermined case 𝑑 > 𝑘 . We assume that 𝐴 ∈ R 𝑘 × 𝑑 withRank( 𝐴 ) = 𝑘 and write 𝐴 = ( 𝐴 𝑄 𝑇 = ( 𝐴 𝑄 𝑄 ) 𝑇 = 𝐴 𝑄 𝑇 , (2.17)with 𝐴 ∈ R 𝑘 × 𝑘 an invertible matrix, 𝑄 = ( 𝑄 𝑄 ) ∈ R 𝑑 × 𝑑 an orthogonal matrix sothat 𝑄 𝑇 𝑄 = 𝐼 , 𝑄 ∈ R 𝑑 × 𝑘 , 𝑄 ∈ R 𝑑 × ( 𝑑 − 𝑘 ) . We have the following result: Theorem 2.12 (Small Noise Limit of Posterior Distribution - Underdetermined) . Suppose thatAssumption 2.5 holds, that
Rank( 𝐴 ) = 𝑘, and 𝑑 > 𝑘 . In the small noise limit 𝛾 → , 𝜋 𝑦 ⇒ 𝒩 ( 𝑚 + , 𝐶 + ) , where 𝑚 + = ̂︀ 𝐶𝑄 ( 𝑄 𝑇 ̂︀ 𝐶𝑄 ) − 𝐴 − 𝑦 + 𝑄 ( 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ) − 𝑄 𝑇 ̂︀ 𝐶 − ̂︀ 𝑚,𝐶 + = 𝑄 ( 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ) − 𝑄 𝑇 . Since Rank( 𝐶 + ) = Rank( 𝑄 ) = 𝑑 − 𝑘 < 𝑑 this theorem demonstrates that, inthe small observational noise limit, the posterior retains uncertainty in a subspace ofdimension 𝑑 − 𝑘 , and has no uncertainty in a subspace of dimension 𝑘 . As a consequencethere is no posterior consistency in the underdetermined case. Example 2.13.
To help understand the result in Theorem 2.12, we consider a simpleexplicit example. Assume that 𝐴 = ( 𝐴 ∈ R 𝑘 × 𝑑 , Γ = 𝛾 Γ = 𝛾 𝐼 𝑘 , ̂︀ 𝐶 = 𝐼 𝑑 , ̂︀ 𝑚 = 0.Let 𝑢 = (︃ 𝑢 𝑢 )︃ ∼ 𝒩 (0 , 𝐼 𝑑 ) , 𝑢 ∈ R 𝑘 , 𝑢 ∈ R 𝑑 − 𝑘 . The data then satisfies 𝑦 = 𝐴𝑢 + 𝜂 = 𝐴 𝑢 + 𝜂, 𝜂 ∼ 𝒩 (0 , 𝛾 𝐼 𝑘 ) . The posterior 𝑢 | 𝑦 is 𝜋 𝑦𝛾 ( 𝑢 ) = 𝑍 𝛾 exp( − J 𝛾 ( 𝑢 )), where J 𝛾 ( 𝑢 ) = 12 𝛾 | 𝑦 − 𝐴 𝑢 | + 12 | 𝑢 | = (︂ 𝛾 | 𝑦 − 𝐴 𝑢 | + 12 | 𝑢 | )︂ + 12 | 𝑢 | . (2.18)It is clear that 𝜋 𝑦𝛾 ( 𝑢 ) ⇒ 𝛿 𝐴 − 𝑦 ( 𝑢 ) . Once 𝑢 is fixed as 𝐴 − 𝑦 , the first term in (2.18) is a constant | 𝐴 − 𝑦 | . Since 𝑢 and 𝑢 are independent we can derive, formally, the limiting posterior as follows 𝜋 𝑦𝛾 ( 𝑢 ) ⇒ 𝛿 𝐴 − 𝑦 ( 𝑢 ) ⊗ 𝑍 exp( − | 𝑢 | ) = 𝛿 𝐴 − 𝑦 ( 𝑢 ) ⊗ 𝒩 (0 , 𝐼 𝑑 − 𝑘 ) , where 𝑍 = ∫︀ R 𝑑 − 𝑘 exp( − | 𝑢 | ) 𝑑𝑢 . In fact, this is exactly the limiting posterior measuregiven in Theorem 2.12.To prove Theorem 2.12, we use the following decomposition of the identity 𝐼 𝑑 . Lemma 2.14.
Let ̂︀ 𝐶 ∈ R 𝑑 × 𝑑 be invertible and 𝑄 = ( 𝑄 𝑄 ) be an orthonormal matrixwith 𝑄 ∈ R 𝑑 × 𝑘 , 𝑄 ∈ R 𝑑 × ( 𝑑 − 𝑘 ) . We have the following decomposition of 𝐼 𝑑 𝐼 𝑑 = ̂︀ 𝐶𝑄 ( 𝑄 𝑇 ̂︀ 𝐶𝑄 ) − 𝑄 𝑇 + 𝑄 ( 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ) − 𝑄 𝑇 ̂︀ 𝐶 − . (2.19) Proof.
Denote by 𝑅 the right-hand side of (2.19). Since 𝑄 is orthonormal, we have 𝑄 𝑇 𝑄 = 0 , 𝑄 𝑇 𝑄 = 0 and thus 𝑄 𝑇 ( 𝑅 − 𝐼 ) = 0 , 𝑄 𝑇 ̂︀ 𝐶 − ( 𝑅 − 𝐼 ) = 0 . If 𝐵 := ( 𝑄 ̂︀ 𝐶 − 𝑄 ) is full rank, the above identities imply that 𝐵 𝑇 ( 𝑅 − 𝐼 ) = 0 andthus 𝑅 = 𝐼 . Note that 𝑄 𝑇 𝐵 = (︃ 𝑄 𝑇 𝑄 𝑇 )︃ ( 𝑄 ̂︀ 𝐶 − 𝑄 ) = (︃ 𝐼 𝑘 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 )︃ . Since the last matrix is invertible, 𝐵 is invertible and the proof is complete. Proof of Theorem 2.12.
Using (2.19), we can decompose 𝑢 as follows 𝑢 = ̂︀ 𝐶𝑄 ( 𝑄 𝑇 ̂︀ 𝐶𝑄 ) − ⏟ ⏞ 𝑆 𝑄 𝑇 𝑢 ⏟ ⏞ 𝑢 + 𝑄 ( 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ) − ⏟ ⏞ 𝑇 𝑄 𝑇 ̂︀ 𝐶 − 𝑢 ⏟ ⏞ 𝑢 = 𝑆𝑢 + 𝑇 𝑢 . Here 𝑢 and 𝑢 are Gaussian with 𝑢 ∼ 𝒩 ( 𝑄 𝑇 ̂︀ 𝐶 − ̂︀ 𝑚, 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ). The identityCov( 𝑢 , 𝑢 ) = 𝑄 𝑇 Cov( 𝑢, 𝑢 ) ̂︀ 𝐶 − 𝑄 = 𝑄 𝑇 𝑄 = 0shows that 𝑢 and 𝑢 are independent, written 𝑢 ⊥ 𝑢 . From (2.17), we have 𝑦 = 𝐴𝑢 + 𝜂 = 𝐴 𝑄 𝑇 𝑢 + 𝜂 = 𝐴 𝑢 + 𝜂. (2.20)Since 𝑢 ⊥ 𝜂 and 𝑢 ⊥ 𝑢 , we have that 𝑢 ⊥ 𝑦, 𝑢 . We apply conditional probability toyield 𝜋 𝑦 ( 𝑢 , 𝑢 ) := P ( 𝑢 , 𝑢 | 𝑦 ) = P ( 𝑢 ) P ( 𝑢 | 𝑦 ) . Equation (2.20) suggests that ( 𝑦, 𝑢 , 𝜂 ) with 𝐴 ∈ R 𝑘 × 𝑘 invertible is Gaussian dis-tributed and its posterior is exactly P ( 𝑢 | 𝑦 ). Theorem 2.10 shows that P ( 𝑢 | 𝑦 ) ⇒ 𝛿 𝐴 − 𝑦 ( 𝑢 ) as the noise vanishes, that is, as 𝛾 →
0. Note that 𝑢 ⊥ 𝑢 and 𝑢 ⊥ 𝑦 . Thelimiting posterior measure ( 𝑢 , 𝑢 ) | 𝑦 is 𝜋 𝑦 ( 𝑢 , 𝑢 ) ⇒ P ( 𝑢 ) ⊗ 𝛿 𝐴 − 𝑦 ( 𝑢 ) (2.21)as 𝛾 →
0. Recall 𝑢 = 𝑆𝑢 + 𝑇 𝑢 and 𝑢 ∼ 𝒩 ( 𝑄 𝑇 ̂︀ 𝐶 − ̂︀ 𝑚, 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ). The mean andvariance of the limiting posterior measure 𝑢 | 𝑦 is 𝑚 + = E ( 𝑆𝑢 + 𝑇 𝑢 | 𝑦 ) = 𝑆𝐴 − 𝑦 + 𝑇 E ( 𝑢 ) = 𝑆𝐴 − 𝑦 + 𝑇 𝑄 𝑇 ̂︀ 𝐶 − ̂︀ 𝑚,𝐶 + = Var( 𝑆𝑢 + 𝑇 𝑢 | 𝑦 ) = Var( 𝑇 𝑢 ) = 𝑇 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 𝑇 𝑇 = 𝑄 ( 𝑄 𝑇 ̂︀ 𝐶 − 𝑄 ) − 𝑄 𝑇 . We have thus completed the proof.
Remark 2.15.
Equation (2.21) shows that in the limit of zero observational noise, theuncertainty is only in the variable 𝑢 . Since Span( 𝑇 ) = Span( 𝑄 ) and 𝑢 = 𝑆𝐴 − 𝑦 + 𝑇 𝑢 ,the uncertainty we observed is in Span( 𝑄 ). The prior plays a role in the posteriormeasure, in the limit of zero observational noise, but only in the variables 𝑢 . The linear Gaussian setting plays, for several reasons, a central role in the study ofinverse problems. One is that it allows explicit solutions which can be used to giveinsight into the subject area more generally. The second is that in the large data limitmany Bayesian posteriors are approximately Gaussian. The paper [39], which is in thelinear-Gaussian setting, plays an important role in the history of Bayesian inversion asit was arguably the first to formulate Bayesian inversion in function space.We have also employed the Gaussian setting to present a basic form of posteriorconsistency. For a treatment in infinite dimensions see [71, 2, 89]. For the consistencyproblem in the classical statistical setting, see the books [48, 27]. The book [27] alsocontains definition and properties of convergence in probability as used here. For thenon-statistical approach to inverse problems, and consistency results, see [31] and thereferences therein. In this chapter we explore the properties of Bayesian inversion from the perspective ofan optimization problem which corresponds to maximizing the posterior probability,in a sense which we will make precise. We demonstrate the properties of the pointestimator resulting from this optimization problem, showing its positive and negativeattributes, the latter motivating our work in the following chapter.
Once again we work in the inverse problem setting of finding 𝑢 ∈ R 𝑑 from 𝑦 ∈ R 𝑘 givenby 𝑦 = 𝐺 ( 𝑢 ) + 𝜂 with noise 𝜂 ∼ 𝜈 ( · ) and prior 𝑢 ∼ 𝜌, as in Assumption 1.1. The posterior 𝜋 𝑦 ( 𝑢 ) on 𝑢 | 𝑦 is given by Theorem 1.2 and has the form 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 𝜈 ( 𝑦 − 𝐺 ( 𝑢 )) 𝜌 ( 𝑢 ) . We may define a loss: L ( 𝑢 ) = − log 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ , and a regularizer R ( 𝑢 ) = − log 𝜌 ( 𝑢 ) . When added together these two functions of 𝑢 comprise an objective function of theform J ( 𝑢 ) = L ( 𝑢 ) + R ( 𝑢 ) . Furthermore 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) ∝ 𝑒 − J ( 𝑢 ) . We see that minimizing the objective function J ( · ) is equivalent to maximizing theposterior 𝜋 𝑦 ( · ). Therefore, recalling Definition 1.4, the MAP estimator can be rewrittenin terms of J as follows: 𝑢 MAP = arg max 𝑢 ∈ R 𝑑 𝜋 𝑦 ( 𝑢 )= arg min 𝑢 ∈ R 𝑑 J ( 𝑢 ) . We will provide conditions under which the MAP estimator is attained in Theorem3.5 and we will give an interpretation of MAP estimators in terms of maximizingthe probability of infinitesimal balls in Theorem 3.8. This interpretation allows togeneralize the definition of MAP estimators to measures that do not possess a Lebesguedensity.
Example 3.1.
Consider the Gaussian setting of Assumption 2.1. Then, since the pos-terior is Gaussian, its mode agrees with its mean, which is given by 𝑚 as defined inTheorem 2.2. Example 3.2. If 𝜂 = 𝒩 (0 , Γ), then 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ ∝ exp( − | 𝑦 − 𝐺 ( 𝑢 ) | ). So the loss inthis case is L ( 𝑢 ) = | 𝑦 − 𝐺 ( 𝑢 ) | , a Γ-weighted L loss. Example 3.3.
If we have prior 𝜌 ( 𝑢 ) = 𝒩 (0 , ̂︀ 𝐶 ), then ignoring 𝑢 − independent normal-ization factors, which appear as constant shifts in J ( · ), we may take the regularizer as R ( 𝑢 ) = | 𝑢 | ̂︀ 𝐶 . In particular, if ̂︀ 𝐶 = 𝜆 − 𝐼 , then R ( 𝑢 ) = 𝜆 | 𝑢 | , an L regularizer.If we combine Example 3.2 and Example 3.3, we obtain a canonical objective func-tion J ( 𝑢 ) = 12 | 𝑦 − 𝐺 ( 𝑢 ) | + 𝜆 | 𝑢 | . To connect with future discussions, here 𝜆 corresponds to prior precision, and may belearned from data, such as in hierarchical methods. Example 3.4.
As an alternative to the L regularizer, consider 𝑢 = ( 𝑢 , . . . , 𝑢 𝑑 ) with 𝑢 𝑖 having prior distribution i.i.d. Laplace. Then 𝜌 ( 𝑢 ) ∝ exp( − 𝜆 ∑︀ 𝑑𝑖 =1 | 𝑢 𝑖 | ) = exp( − 𝜆 | 𝑢 | ).In this case R ( 𝑢 ) = 𝜆 | 𝑢 | , an L regularizer. If we combine this prior with the weighted L loss above then we have objective function J ( 𝑢 ) = 12 | 𝑦 − 𝐺 ( 𝑢 ) | + 𝜆 | 𝑢 | . Even though this objective function promotes sparse solutions, samples from the un-derlying posterior distribution are not typically sparse.
For any optimization problem for an objective function with a finite infimum, it is ofinterest to determine whether the infinimum is attained. We have the following result.
Theorem 3.5 (Attainable MAP Estimator) . Assume that J is non-negative, continuous andthat J ( 𝑢 ) → ∞ as | 𝑢 | → ∞ . Then J attains it infimum. Therefore, the MAP estimatorof 𝑢 based on the posterior 𝜋 𝑦 ( 𝑢 ) ∝ exp (︀ − J ( 𝑢 ) )︀ is attained.Proof. By the assumed growth and non-negativity of J , there is 𝑅 such that inf 𝑢 ∈ R 𝑑 J ( 𝑢 ) =inf 𝑢 ∈ 𝐵 (0 ,𝑅 ) where 𝐵 (0 , 𝑅 ) denotes the closed ball of radius 𝑅 around the origin. Since J is assumed to be continuous, its infimum over 𝐵 (0 , 𝑅 ) is attained and the proof iscomplete. Remark 3.6.
Suppose that1. 𝐺 ∈ 𝐶 ( R 𝑑 , R 𝑘 ), i.e. 𝐺 is a continuous function;2. the objective function J ( 𝑢 ) has L loss as defined in Example 3.2 and L 𝑝 regularizer R ( 𝑢 ) = 𝜆𝑝 | 𝑢 | 𝑝𝑝 , 𝑝 ∈ (0 , ∞ ).Then the assumptions on J in Theorem 3.5 are satisfied. This shows that if 𝐺 iscontinuous, the infinimum of J defined with L loss and L 𝑝 regularizer is attained at theMAP estimator of the corresponding Bayesian problem with posterior pdf proportionalto exp (︀ − J ( 𝑢 ) )︀ . Remark 3.7.
Notice that the assumption that J ( 𝑢 ) → ∞ is not restrictive: this conditionneeds to hold in order to be able to normalize 𝜋 𝑦 ( 𝑢 ) ∝ exp (︀ − J ( 𝑢 ) )︀ into a pdf, which isimplicitly assumed in the second part of the theorem statement.Intuitively the MAP estimator maximizes posterior probability. We make this pre-cise in the following theorem which links the objective function J ( · ) to small ball prob-abilities. Theorem 3.8 (Objective Function and Posterior Probability) . Making the same assumptions asin Theorem 3.5, let 𝛼 ( 𝑢, 𝛿 ) := ∫︁ 𝑣 ∈ 𝐵 ( 𝑢,𝛿 ) 𝜋 𝑦 ( 𝑣 ) 𝑑𝑣 = P 𝜋 𝑦 (︀ 𝐵 ( 𝑢, 𝛿 ) )︀ , be the posterior probability of a ball with radius 𝛿 centered at 𝑢 . Then, for all 𝑢 , 𝑢 ∈ R 𝑑 , we have lim 𝛿 → 𝛼 ( 𝑢 , 𝛿 ) 𝛼 ( 𝑢 , 𝛿 ) = 𝑒 J ( 𝑢 ) − J ( 𝑢 ) . Proof.
Let 𝑢 , 𝑢 ∈ R 𝑑 and let 𝜖 > . By continuity of J we have that, for all 𝛿 sufficientlysmall, 𝑒 − J ( 𝑢 ) − 𝜖 ≤ 𝑒 − J ( 𝑣 ) ≤ 𝑒 − J ( 𝑢 )+ 𝜖 for all 𝑣 ∈ 𝐵 ( 𝑢 , 𝛿 ) ,𝑒 − J ( 𝑢 ) − 𝜖 ≤ 𝑒 − J ( 𝑣 ) ≤ 𝑒 − J ( 𝑢 )+ 𝜖 for all 𝑣 ∈ 𝐵 ( 𝑢 , 𝛿 ) . Therefore, for all 𝛿 sufficiently small, 𝐵 𝛿 𝑒 − J ( 𝑢 ) − 𝜖 ≤ ∫︁ 𝑣 ∈ 𝐵 ( 𝑢 ,𝛿 ) 𝑒 − J ( 𝑣 ) 𝑑𝑣 ≤ 𝐵 𝛿 𝑒 − J ( 𝑢 )+ 𝜖 ,𝐵 𝛿 𝑒 − J ( 𝑢 ) − 𝜖 ≤ ∫︁ 𝑣 ∈ 𝐵 ( 𝑢 ,𝛿 ) 𝑒 − J ( 𝑣 ) 𝑑𝑣 ≤ 𝐵 𝛿 𝑒 − J ( 𝑢 )+ 𝜖 , where 𝐵 𝛿 is the Lebesgue measure of a ball with radius 𝛿 . Taking the ratio of 𝛼 ’s andusing the above bounds we obtain that, for all 𝛿 sufficiently small, 𝑒 J ( 𝑢 ) − J ( 𝑢 ) − 𝜖 ≤ 𝛼 ( 𝑢 , 𝛿 ) 𝛼 ( 𝑢 , 𝛿 ) ≤ 𝑒 J ( 𝑢 ) − J ( 𝑢 )+2 𝜖 . Since 𝜖 was arbitrary the desired result follows. Remark 3.9.
This theorem shows that maximizing the probability of an infinitesimallysmall ball is the same as minimizing the objective function J ( · ) . This is obvious in finitedimensions, but the proof above generalizes beyond measures which possess a Lebesguedensity, and may be used in infinite dimensions. Figure 2
Posterior (left) and objective function (right) for 𝒩 (0 ,
1) posterior (orange) andLaplace(0 ,
1) posterior (blue).
By means of examples we now probe the question of whether or not the MAP estimatorcaptures useful information about the posterior distribution.
Example 3.10.
If the posterior is single-peaked, such as a Gaussian or a Laplace dis-tribution, as shown in Figure 2, the MAP estimator, i.e. minimizer of the objectivefunctions, reasonably summarizes the most likely value of the unknown parameter.We next consider several examples where a point estimator —or a 𝛿 -radius ballwith 𝛿 →
0— fails to summarize adequately the posterior distribution.
Example 3.11.
If the posterior is rather unevenly distributed, such as a slab-and-spikedistribution, as shown in Figure 3, then it is less clear that the MAP estimator usefullysummarizes the posterior. For example, for the case in Figure 3 we may want thesolution output of our Bayesian problem to be a weighted average of two Gaussiandistributions, or two point estimators each with a separate mean located at one ofthe two minima of the objective functions, and weight describing the probability massassociated with each of those two points.
Example 3.12.
In addition to a multiple-peak posterior, there are cases where the objec-tive function and the associated posterior density are simply very rough. In these cases,the small-scale roughness should be ignored, while the large-scale variation should becaptured. For example in Figure 4, the objective function is very rough, and has aunique minimizer at a point far from 0. However, it also has a larger pattern: it tendsto be smaller around 0, while larger away from 0. The MAP estimator cannot capturethis large scale pattern, as it is found by minimizing the objective function. It is ar-guably the case that 𝑢 = 0 is a better point estimate. An alternative way to interpretthis phenomena is that there is a natural “temperature” to this problem, in the sensethat variations lower than this temperature could be viewed as random noise that donot capture meaningful information. Figure 3
Posterior (left) and objective function (right) for a posterior that is a sum of twoGaussian distributions, 𝒩 (0 , . ) with probability 0 . 𝒩 (10 , ) with probability 0 . Figure 4
Posterior (left) and objective function (right) from an objective function that isvery rough in the small scale, but contains a regular patter on the larger scale. Thisspecific example is generated by white noise summed with a quadratic function for theobjective function, and the posterior is computed from the objective function. The preceding examples suggest that multi-peak distributions, or multi-minimumobjective functions, can cause problems for MAP estimation as a point estimator. Nextwe illustrate that if the dimension 𝑑 of the parameter 𝑢 ∈ R 𝑑 is high, then a single pointestimator, even if a MAP estimator, is typically not a good summary of the posterior. Figure 5
Empirical density of ℓ norm of 𝒩 (0 , 𝐼 ) random vectors for various dimension: 𝑑 = 1 (blue), 𝑑 = 5 (orange), 𝑑 = 10 (green), 𝑑 = 50 (red), and 𝑑 = 100 (purple). Theempirical density is obtained from 10000 samples for each distribution. Example 3.13.
We consider what is the “typical size” of a vector 𝑢 drawn from thestandard Gaussian distribution 𝒩 (0 , 𝐼 ), as the dimension increases. In Figure 5 wedisplay the empirical density of the norm of such random vectors. We can see thatat low dimensions, such as when 𝑑 = 1, obtaining a value close to the mode 𝑢 = 0is highly likely. In higher dimensions, however, the probability for a vector from thisdistribution to have a small ℓ − norm becomes increasingly small as 𝑑 grows. Forexample, let us consider the probability for the norm to be less than 5. Then P ( | 𝑢 | < . 𝑑 = 1, 0 . 𝑑 = 5, 0 . 𝑑 = 10, 0 . 𝑑 = 50, and 1 . × − when 𝑑 = 100. So we see that, as the dimension increases,with probability close to 1 a sample from the posterior would have a norm far from 0.Indeed, for 𝑑 = 1000, the 5th and 95th percentiles are respectively 30 . . 𝑑 = 1000, we most likely will find a vector with size around 31, not0. Another way to see this is that, since the components 𝑢 𝑖 of 𝑢 are i.i.d. standardunit Gaussians we have that, by the strong law of large numbers,1 𝑑 𝑑 ∑︁ 𝑖 =1 𝑢 𝑖 → 𝑑 → ∞ almost surely. Thus, with high probability, the ℓ − norm is of size √ 𝑑. This example suggests that in high dimension, a point estimatormay not capture enoughinformation about the density.The preceding examples demonstrate that MAP estimators should be treated withcaution as they may not capture the desired posterior information in many cases.This motivates the study of alternative ways —beyond MAP estimators— to captureinformation from the posterior distribution. One such approach is to fit one or sev-eral Gaussian distributions to the posterior by minimizing an appropriate distance-likemeasure between distributions. This will be discussed in the next chapter. The optimization perspective on inversion predates the development of the Bayesianapproach as a computational tool, because it is typically far cheaper to implement.The subject of classical regularization techniques for inversion is discussed in [31].The concept of MAP estimators, which links probability to optimization, is discussedin the books [63, 108] in the finite dimensional setting. The paper [24] studies thisconnection precisely: it defines the MAP estimator for infinite dimensional Bayesianinverse problems, and the corresponding variational formulation, in a Gaussian setting.The paper [55] studies related ideas, but in the non-Gaussian setting, and [3] generalizesthe variational formulation of MAP estimators to non-Gaussian priors that are sparsitypromoting. The paper [109] shows an application of optimization based inversion to alarge-scale geophysical application. Recall the inverse problem of finding 𝑢 from 𝑦 given by (1.1), and the Bayesian formu-lation which follows from Assumption 1.1. In the previous chapter we explored the ideaof obtaining a point estimator using an optimization perspective arising from maximiz-ing the posterior pdf. We related this idea to finding the center of a ball of radius 𝛿 with maximal probability in the limit 𝛿 → . Whilst the idea is intuitively appealing,and reduces the complexity of Bayesian inference from determination of an entire dis-tribution to determination of a single point, the approach has a number of limitations,in particular for noisy, multi-peaked or high dimensional posterior distributions; theexamples at the end of the previous chapter illustrated this.In this chapter we again adopt an optimization approach to the problem of Bayesianinference, but instead seek a Gaussian distribution 𝑝 = 𝒩 ( 𝜇, Σ) that minimizes theKullback-Leibler divergence from the posterior 𝜋 𝑦 ( 𝑢 ); since the Kullback-Leibler di-vergence is not symmetric this leads to two distinct problems, both of which we willstudy. Definition 4.1.
Let 𝜋, 𝜋 ′ > R 𝑑 . The Kullback-Leibler(K-L) divergence , or relative entropy , of 𝜋 with respect to 𝜋 ′ is defined by 𝑑 KL ( 𝜋 ‖ 𝜋 ′ ) := ∫︁ R 𝑑 log (︂ 𝜋 ( 𝑢 ) 𝜋 ′ ( 𝑢 ) )︂ 𝜋 ( 𝑢 ) 𝑑𝑢 = E 𝜋 [︂ log (︂ 𝜋𝜋 ′ )︂]︂ = E 𝜋 ′ [︂ log (︂ 𝜋𝜋 ′ )︂ 𝜋𝜋 ′ ]︂ . Kullback-Leibler is a divergence in that 𝑑 KL ( 𝜋, 𝜋 ′ ) ≥ , with equality if and only if 𝜋 = 𝜋 ′ . However, unlike Hellinger and total variation it is not a distance. In particular,the K-L divergence is not symmetric: in general 𝑑 KL ( 𝜋 ‖ 𝜋 ′ ) ̸ = 𝑑 KL ( 𝜋 ′ ‖ 𝜋 ) , a fact that will be important in this chapter. Nevertheless, it is useful for at leastfour reasons: (1) it provides an upper bound for many distances; (2) its logarithmicstructure allows explicit computations that are difficult using actual distances; (3) itsatisfies many convenient analytical properties such as being convex in both argumentsand lower-semicontinuous in the topology of weak convergence (4) it has an informationtheoretic and physical interpretation. Lemma 4.2.
The K-L divergence provides the following upper bounds for Hellinger andtotal variation distance: 𝑑 H ( 𝜋, 𝜋 ′ ) ≤ 𝑑 KL ( 𝜋 ‖ 𝜋 ′ ) , 𝑑 TV ( 𝜋, 𝜋 ′ ) ≤ 𝑑 KL ( 𝜋 ‖ 𝜋 ′ ) . Proof.
The second inequality follows from the first one by Lemma 1.9; thus we proveonly the first inequality. Consider the function 𝜙 : R + ↦→ R defined by 𝜙 ( 𝑥 ) = 𝑥 − − log 𝑥. Note that 𝜙 ′ ( 𝑥 ) = 1 − 𝑥 ,𝜙 ′′ ( 𝑥 ) = 1 𝑥 ,𝜙 ( ∞ ) = 𝜙 (0) = ∞ . Thus the function is convex on its domain. As the minimum of 𝜙 is attained at 𝑥 = 1,and as 𝜙 (1) = 0, we deduce that 𝜙 ( 𝑥 ) ≥ 𝑥 ∈ (0 , ∞ ) . Hence, 𝑥 − ≥ log 𝑥 for all 𝑥 ≥ , √ 𝑥 − ≥
12 log 𝑥 for all 𝑥 ≥ . We can use this last inequality to bound the Hellinger distance: 𝑑 H ( 𝜋, 𝜋 ′ ) = 12 ∫︁ ⎛⎝ − √︃ 𝜋 ′ 𝜋 ⎞⎠ 𝜋𝑑𝑢 = 12 ∫︁ ⎛⎝ 𝜋 ′ 𝜋 − √︃ 𝜋 ′ 𝜋 ⎞⎠ 𝜋𝑑𝑢 = ∫︁ ⎛⎝ − √︃ 𝜋 ′ 𝜋 ⎞⎠ 𝜋𝑑𝑢 ≤ − ∫︁ log (︁ 𝜋 ′ 𝜋 )︁ 𝜋𝑑𝑢 = 12 𝑑 KL ( 𝜋 ‖ 𝜋 ′ ) . 𝑑 KL ( 𝑝 ‖ 𝜋 ) In this section we prove the existence of an approximating Gaussian. To streamlinethe exposition we work under the following assumption, which can be relaxed:
Assumption 4.3.
The posterior distribution is constructed under the following assump-tions: ∙ The loss function L ( 𝑢 ) := − log 𝜈 (︀ 𝑦 − 𝐺 ( 𝑢 ) )︀ is non-negative and bounded above. ∙ The prior is a centered isotropic Gaussian: 𝜌 ∼ 𝒩 (0 , 𝜆 − 𝐼 ) . Let 𝒜 be the set of Gaussian distributions on R 𝑑 with positive definite covariance, 𝒜 = {𝒩 ( 𝜇, Σ) : 𝜇 ∈ R 𝑑 , Σ ∈ R 𝑑 × 𝑑 positive-definite symmetric } . We have the following theorem: Theorem 4.4 (Best Gaussian Approximation) . Under Assumption 4.3, there exists at leastone probability distribution 𝑝 ∈ 𝒜 at which the infimum inf 𝑝 ∈𝒜 𝑑 KL ( 𝑝 ‖ 𝜋 ) is attained.Proof. The K-L divergence can be computed explicitly as 𝑑 KL ( 𝑝 ‖ 𝜋 ) = E 𝑝 log 𝑝 − E 𝑝 log 𝜋 = E 𝑝 (︂ − | 𝑢 − 𝜇 | −
12 log (︁ (2 𝜋 ) 𝑑 detΣ )︁ + L ( 𝑢 ) + 𝜆 | 𝑢 | + log 𝑍 )︂ . Note that 𝑍 is the normalization constant for 𝜋 and is independent of 𝑝 and hence of 𝜇 and Σ . We can represent a given random variable 𝑢 ∼ 𝑝 by writing 𝑢 = 𝜇 + Σ / 𝜉 ,where 𝜉 ∼ 𝒩 (0 , 𝐼 ), and hence | 𝑢 | = | 𝜇 | + | Σ / 𝜉 | + 2 ⟨ 𝜇, Σ / 𝜉 ⟩ to obtain 𝑑 KL ( 𝑝 ‖ 𝜋 ) = − 𝑑 − 𝑑 𝜋 ) −
12 log detΣ + E 𝑝 L ( 𝑢 ) + 𝜆 | 𝜇 | + 𝜆 𝑍. Define I ( 𝜇, Σ) = E 𝑝 L ( 𝑢 )+ 𝜆 | 𝜇 | + 𝜆 tr(Σ) − log detΣ . Clearly, there is a correspondencebetween minimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) over 𝑝 ∈ 𝒜 and minimizing I ( 𝜇, Σ) over 𝜇 ∈ R 𝑑 andpositive definite Σ . Note that: ∙ I (0 , 𝐼 ) < ∞ . ∙ For any Σ , I ( 𝜇, Σ) → ∞ as | 𝜇 | → ∞ . ∙ For any 𝜇, I ( 𝜇, Σ) → ∞ as tr(Σ) → → ∞ . Therefore, there are
𝑀, 𝑟, 𝑅 > I ( 𝜇, Σ) over 𝜇 ∈ R 𝑑 andpositive definite Σ is equal to the infimum of I ( 𝜇, Σ) over˜ 𝒜 := { ( 𝜇, Σ) : 𝜇 ∈ R 𝑑 , Σ ∈ R 𝑑 × 𝑑 positive-definite symmetric , | 𝜇 | ≤ 𝑀, 𝑟 ≤ tr(Σ) ≤ 𝑅 } . Since I is continuous in ˜ 𝒜 it achieves its infimum and the proof is complete.We remark that the theorem establishes the existence of a best Gaussian approxi-mation. However, minimizers need not be unique. 𝑑 KL ( 𝜋 ‖ 𝑝 ) In this section we show that the best Gaussian approximation in Kullback-Leibler withrespect to its second argument is unique and given by moment matching. Theorem 4.5 (Best Gaussian by Moment Matching) . Assume that ¯ 𝜇 := E 𝜋 [ 𝑢 ] is finite andthat ¯Σ := E 𝜋 [( 𝑢 − ¯ 𝜇 ) ⊗ ( 𝑢 − ¯ 𝜇 )] is positive-definite. Then the infimum inf 𝑝 ∈𝒜 𝑑 KL ( 𝜋 ‖ 𝑝 ) is attained at the element in 𝒜 with mean ¯ 𝜇 and covariance ¯Σ . Proof.
By definition 𝑑 KL ( 𝜋 ‖ 𝑝 ) = − E 𝜋 [log 𝑝 ] + E 𝜋 [log 𝜋 ] . (4.1)Since the second term does not involve 𝑝 , we study minimization of − E 𝜋 [log 𝑝 ] = − E 𝜋 ⎡⎣ log ⎛⎝ √︁ (2 𝜋 ) 𝑑 detΣ exp (︂ − ⃒⃒ 𝑢 − 𝜇 ⃒⃒ )︂⎞⎠⎤⎦ = 12 E 𝜋 [︁⃒⃒ 𝑢 − 𝜇 ⃒⃒ ]︁ + 12 log detΣ + 𝑑 𝜋. Let Ω = Σ − . Then our task is equivalent to minimizing the following function of 𝜇 and Ω: I ( 𝜇, Ω) = 12 E 𝜋 [ ⟨ 𝑢 − 𝜇, Ω( 𝑢 − 𝜇 ) ⟩ ] −
12 log detΩ . First we find the critical points of I by taking its first order partial derivative withrespect to 𝜇 and Ω and setting both to zero: 𝜕 𝜇 I = − E 𝜋 [Ω( 𝑢 − 𝜇 )] = 0; 𝜕 Ω I = 12 𝜕 Ω ( E 𝜋 [( 𝑢 − 𝜇 ) ⊗ ( 𝑢 − 𝜇 ) : Ω]) − 𝜕 Ω detΩ= 12 E 𝜋 [( 𝑢 − 𝜇 ) ⊗ ( 𝑢 − 𝜇 )] −
12 Ω − = 0;where we have used the relation 𝜕 Ω detΩ = detΩ · Ω − . Solving the above two equationsgives us the critical point, expressed in terms of mean and covariance,(¯ 𝜇, ¯Σ) = ( E 𝜋 [ 𝑢 ] , E 𝜋 [( 𝑢 − ¯ 𝜇 ) ⊗ ( 𝑢 − ¯ 𝜇 )]) . Next we will show (¯ 𝜇, ¯Σ − ) is a minimizer of I or, equivalently, that (¯ 𝜇, ¯Σ) is aminimizer of the expression given in (4.1). To this end, if we deal with a vector 𝜃 parametrizing the distribution 𝑝 , it will be sufficient to show that the Hessian of equa-tion (4.1) with respect to 𝜃 is positive definite. Recalling the label 𝑑 of the dimensionof the parameter 𝑢 of unknowns we then have: 𝑝 𝜃 ( 𝑢 ) = √︃ detΩ(2 𝜋 ) 𝑑 exp (︂ −
12 ( 𝑢 − 𝜇 ) 𝑇 Ω( 𝑢 − 𝜇 ) )︂ = √︃ detΩ(2 𝜋 ) 𝑑 exp [︂ − (︂ 𝑢 𝑇 Ω 𝑢 − 𝜇 𝑇 Ω 𝑢 + 12 𝜇 𝑇 Ω 𝜇 )︂]︂ = √︃ detΩ(2 𝜋 ) 𝑑 exp (︂ − 𝜇 𝑇 Ω 𝜇 )︂ exp (︂ − 𝑢 𝑇 Ω 𝑢 + 𝜇 𝑇 Ω 𝑢 )︂ . Utilizing vectorization of matrices we may write: 𝑢 𝑇 Ω 𝑢 = Ω : 𝑢𝑢 𝑇 = [vec(Ω)] 𝑇 vec( 𝑢𝑢 𝑇 ) 𝜇 𝑇 Ω 𝑢 = (Ω 𝜇 ) 𝑇 𝑢 ⇒ − 𝑢 𝑇 Ω 𝑢 + 𝜇 𝑇 Ω 𝑢 = [︃ Ω 𝜇 − vec(Ω) ]︃ 𝑇 [︃ 𝑢 vec( 𝑢𝑢 𝑇 ) ]︃ . Then we let: 𝜃 = [︃ Ω 𝜇 − vec(Ω) ]︃ , 𝑇 ( 𝑢 ) = [︃ 𝑢 vec( 𝑢𝑢 𝑇 ) ]︃ . The pdf 𝑝 𝜃 ( 𝑢 ) can then be written in the following form: 𝑝 𝜃 ( 𝑢 ) = 1 𝑍 ( 𝜃 ) exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) )︁ (4.2)with 𝑍 ( 𝜃 ) = ∫︁ R 𝑑 exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) )︁ 𝑑𝑢. (4.3)Using equation (4.2) we can rewrite equation (4.1) as: 𝐻 ( 𝜃 ) = 𝑑 KL ( 𝜋 ‖ 𝑝 𝜃 ) = − 𝜃 𝑇 E 𝜋 [ 𝑇 ( 𝑢 )] + log( 𝑍 ( 𝜃 )) + E 𝜋 [ 𝜋 ( 𝑢 )] . (4.4)Notice that ∇ 𝜃 log( 𝑍 ( 𝜃 )) = 1 𝑍 ( 𝜃 ) ∫︁ R 𝑑 ∇ 𝜃 [︁ exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) )︁]︁ 𝑑𝑢 = 1 𝑍 ( 𝜃 ) ∫︁ R 𝑑 𝑇 ( 𝑢 )exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) )︁ 𝑑𝑢 = E 𝑝 𝜃 [ 𝑇 ( 𝑢 )] . Therefore we can calculate the gradient and Hessian of 𝐻 ( 𝜃 ) as follows: ∇ 𝜃 𝐻 ( 𝜃 ) = − E 𝜋 [ 𝑇 ( 𝑢 )] + E 𝑝 𝜃 [ 𝑇 ( 𝑢 )] , [︁ ∇ 𝜃 𝐻 ( 𝜃 ) ]︁ 𝑖𝑗 = 𝜕 log( 𝑍 ( 𝜃 )) 𝜕𝜃 𝑖 𝜃 𝑗 = 𝜕𝜕𝜃 𝑗 (︂ 𝑍 ( 𝜃 ) ∫︁ R 𝑑 𝑇 𝑖 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑢 )︂ = 1 𝑍 ( 𝜃 ) 𝜕𝜕𝜃 𝑗 ∫︁ R 𝑑 𝑇 𝑖 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑥 + ∫︁ R 𝑑 𝑇 𝑖 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑢 · 𝜕𝜕𝜃 𝑗 𝑍 ( 𝜃 ) − = 1 𝑍 ( 𝜃 ) ∫︁ R 𝑑 𝑇 𝑖 ( 𝑢 ) 𝑇 𝑗 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑢 − 𝑍 ( 𝜃 ) ∫︁ R 𝑑 𝑇 𝑖 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑢 · ∫︁ R 𝑑 𝑇 𝑗 ( 𝑢 ) 𝑒 𝜃 𝑇 𝑇 ( 𝑢 ) 𝑑𝑢 = E 𝑝 𝜃 [ 𝑇 𝑖 𝑇 𝑗 ] − E 𝑝 𝜃 [ 𝑇 𝑖 ] E 𝑝 𝜃 [ 𝑇 𝑗 ]= [Cov 𝑝 𝜃 ( 𝑇 )] 𝑖𝑗 . Therefore the Hessian of the objective function is the covariance matrix of 𝑇 ( 𝑢 ) underthe multivariate normal distribution 𝑝 𝜃 and by construction, is positive semi-definite.Therefore 𝑑 KL ( 𝜋 ‖ 𝑝 𝜃 ) is convex in 𝜃 and the critical point (¯ 𝜇, ¯Σ) is a correspondingminimizer. Remark 4.6.
Notice that the preceding proof of convexity holds for any distribution 𝑝 that can alternatively be parametrized by a vector 𝜃 with the general form of equation(4.2). In particular for such problems the convexity of 𝑑 KL ( 𝜋 ‖ 𝑝 𝜃 ) follows, ensuring theexistence of a minimizer. In fact, equation (4.2) is a reduced form of the following moregeneral expression: 𝑝 𝜃 ( 𝑢 ) = ℎ ( 𝑢 )exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) − 𝐴 ( 𝜃 ) )︁ (4.5)with 𝐴 ( 𝜃 ) = log [︂∫︁ R 𝑑 ℎ ( 𝑢 )exp (︁ 𝜃 𝑇 𝑇 ( 𝑢 ) )︁ 𝑑𝑢 ]︂ . (4.6)Since ℎ ( 𝑢 ) is independent of 𝜃 , the conclusion of the previous theorem carries overto distributions with the form of (4.5). Such distributions are said to termed the exponential family in the statistics literature, and 𝜃 is called the natural parameter, 𝑇 ( 𝑢 ) the sufficient statistic, ℎ ( 𝑢 ) the base measure, and 𝐴 ( 𝜃 ) the log-partition. TheGaussian distribution is a special case in which ℎ ( 𝑢 ) is constant with respect to 𝑢 . Inwhat follows we show some examples of other exponential family members. Example 4.7.
The Bernoulli distribution
Let 𝑢 be a random variable following the Bernoulli distribution with P ( 𝑢 = 1) = 𝑞 and P ( 𝑢 = 0) = 1 − 𝑞 . Then the point mass function describing the distribution of 𝑢 is: 𝑝 ( 𝑢 ; 𝑞 ) = 𝑞 𝑢 (1 − 𝑞 ) − 𝑢 = exp [︂ 𝑢 log (︂ 𝑞 − 𝑞 )︂ + log(1 − 𝑞 ) ]︂ = (1 − 𝑞 )exp [︂ 𝑢 log (︂ 𝑞 − 𝑞 )︂]︂ . Therefore 𝜃 = log (︁ 𝑞 − 𝑞 )︁ , 𝑇 ( 𝑢 ) = 𝑢 , 𝐴 ( 𝜃 ) = log(1 + 𝑒 𝜃 ), and ℎ ( 𝑢 ) = 1. Example 4.8.
The Poisson distribution
The point mass function describing the distribution of 𝑢 is, in this case, 𝑝 ( 𝑢 ; 𝜆 ) = 𝜆 𝑢 𝑒 − 𝜆 𝑢 != 1 𝑢 ! 𝑒 𝑢 log 𝜆 − 𝜆 . Therefore 𝜃 = log 𝜆 , 𝑇 ( 𝑢 ) = 𝑢 , 𝐴 ( 𝜃 ) = 𝑒 𝜃 and ℎ ( 𝑢 ) = 𝑢 ! . 𝑑 KL ( 𝜋 ‖ 𝑝 ) and 𝑑 KL ( 𝑝 ‖ 𝜋 ) It is instructive to compare the two different minimization problems, both leading toa “best Gaussian”, that we described in the preceding two sections. We write the tworelevant divergences as follows and then explain the nomenclature: 𝑑 KL ( 𝑝 ‖ 𝜋 ) = E 𝑝 [︂ log (︂ 𝑝𝜋 )︂]︂ = E 𝑝 [log 𝑝 ] − E 𝑝 [log 𝜋 ] "Mode-seeking" 𝑑 KL ( 𝜋 ‖ 𝑝 ) = E 𝜋 [︂ log (︂ 𝜋𝑝 )︂]︂ = E 𝜋 [log 𝜋 ] − E 𝜋 [log 𝑝 ] "Mean-seeking" (a) Minimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) (b) Minimizing 𝑑 KL ( 𝜋 ‖ 𝑝 ) Figure 6 (a) Minimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) can lead to serious information loss while (b) minimizing 𝑑 KL ( 𝜋 ‖ 𝑝 ) ensures a comprehensive consideration of all components of 𝜋 . Note that when minimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) we want log 𝑝𝜋 to be small, which can happenwhen 𝑝 ≃ 𝜋 or 𝑝 ≪ 𝜋 . This illustrates the fact that minimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) may missout components of 𝜋 . For example, in Figure 6(a) 𝜋 is a bi-modal like distribution butminimizing 𝑑 KL ( 𝑝 ‖ 𝜋 ) over Gaussians 𝑝 can only give a single mode approximation whichis achieved by matching one of the modes; we may think of this as “mode-seeking”. Incontrast, when minimizing 𝑑 KL ( 𝜋 ‖ 𝑝 ) over Gaussians 𝑝 we want log 𝜋𝑝 to be small where 𝑝 appears as the denominator. This implies that wherever 𝜋 has some mass we must let 𝑝 also have some mass there in order to keep 𝜋𝑝 as close as possible to one. Therefore theminimization is carried out by allocating the mass of 𝑝 in a way such that on averagethe divergence between 𝑝 and 𝜋 attains its minimum as shown in Figure 6(b); hencethe label “mean-seeking.” Different applications will favor different choices between themean and mode seeking approaches to Gaussian approximation. This chapter has been concerned with finding the best Gaussian approximation to ameasure with respect to KL divergences. Bayes Theorem 1.2 itself can be formulatedthrough a closely related minimization principle. Consider a posterior 𝜋 𝑦 ( 𝑢 ) in thefollowing form: 𝜋 𝑦 ( 𝑢 ) = 1 𝑍 exp (︀ − L ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) , where 𝜌 ( 𝑢 ) is the prior, L ( 𝑢 ) is the negative log-likelihood, and 𝑍 the normalizationconstant. We assume here for exposition that all densities are positive. Then we canexpress 𝑑 KL ( 𝑝 ‖ 𝜋 ) in terms of the prior as follows: 𝑑 KL ( 𝑝 ‖ 𝜋 ) = ∫︁ R 𝑑 log (︂ 𝑝𝜋 )︂ 𝑝𝑑𝑢 = ∫︁ R 𝑑 log (︂ 𝑝𝜌 𝜌𝜋 )︂ 𝑝𝑑𝑢 = ∫︁ R 𝑑 log (︂ 𝑝𝜌 exp (︀ L ( 𝑢 ) )︀ 𝑍 )︂ 𝑝𝑑𝑢 = 𝑑 KL ( 𝑝 ‖ 𝜌 ) + E 𝑝 [ L ( 𝑢 )] + log 𝑍. If we define 𝒥 ( 𝑝 ) = 𝑑 KL ( 𝑝 ‖ 𝜌 ) + E 𝑝 [ L ( 𝑢 )] then we have the following: Theorem 4.9 (Bayes Theorem as an Optimization Principle) . The posterior distribution 𝜋 isgiven by the following minimization principle: 𝜋 = argmin 𝑝 ∈ P 𝒥 ( 𝑝 ) , where P contains all probability densities on R 𝑑 .Proof. Since 𝑍 is the normalization constant for 𝜋 and is independent of 𝑝 , the min-imizer of 𝑑 KL ( 𝑝 ‖ 𝜋 ) will also be the minimizer of 𝒥 ( 𝑝 ) . Since the global minimizer of 𝑑 KL ( 𝑝 ‖ 𝜋 ) is attained at 𝑝 = 𝜋 the result follows.Why is it useful to view the posterior as the minimizer of an energy? There are atleast three advantages of this viewpoint. First, the variational formulation provides anatural way to approximate the posterior by restricting the minimization problem todistributions satisfying some computationally desirable property. For instance, varia-tional Bayes methods often restrict the minimization to densities with product structureand in this chapter we have studied restriction to the class of Gaussian distributions.Second, variational formulations allow to show convergence of posterior distributionsindexed by some parameter of interest by studying the Gamma-convergence of the as-sociated objective functionals. Third, variational formulations provide natural paths,defined by a gradient flow, towards the posterior. Understanding these flows and theirrates of convergence is helpful in the choice of sampling algorithms. The definition of the K-L divergence, and upper-bounds in terms of probability met-rics, can be found in [46]. For a basic introduction to variational methods, and themoment-matching version of Gaussian approximation, see [13]. The problem of find-ing a Gaussian approximation of a general finite dimensional probability distributionis studied in [80], and infinite dimensional formulations are considered in [94] and thecompanion paper [93]. The approximation in Theorem 4.4 consists of a single Gaussiandistribution. If the posterior has more than one mode, a single Gaussian may not be ap-propriate. For an approximation composed of Gaussian mixtures, the reader is referredto [80]. The paper [44] highlights how minimization of K-L divergence arises naturallyin the optimization of local entropy and heat regularized costs in deep learning. Theformulation of Bayes Theorem as an optimization principle is well-known and widelyused in the machine learning community where it goes under the name “variationalBayes”; see the book [81] and the paper [8] for clear expositions of this subject. Thevariational formulation has been used to establish convergence of Bayesian proceduresin [43] and [41] and to define gradient flows that converge to the posterior as a way todevelop and analyze Markov chain Monte Carlo proposals in [42]. For more informationabout the properties and examples of exponential family distributions, we refer to [90].and for background on matrix calculations that were used in this chapter we refer to[92]. In this chapter we introduce Monte Carlo sampling and importance sampling. Theyare two general techniques for estimating expectations computed with respect to aparticular distribution by generating samples from (a possibly different) distribution;they may also be viewed as approximations of a given measure via sums of Diracmeasures.Throughout this chapter we focus interest on computing expectations with respectto a probability distribution with pdf 𝜋 given in the form 𝜋 ( 𝑢 ) = 1 𝑍 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) , (5.1)where 𝑍 is a normalizing constant. Monte Carlo sampling will use samples from 𝜋 itself; importance sampling will use samples from 𝜌 . A particular application of thissetting is where 𝜌 is the prior, 𝜋 the posterior and 𝑔 the likelihood. Let 𝑓 : R 𝑑 −→ R denote the function whose expectation we are interested in computing. For any pdf 𝑝 on R 𝑑 we write 𝑝 ( 𝑓 ) = E 𝑝 [ 𝑓 ( 𝑢 )] = ∫︁ R 𝑑 𝑓 ( 𝑢 ) 𝑝 ( 𝑢 ) 𝑑𝑢. (5.2)In this chapter we will generalize the concept of pdf to include Dirac mass distributions.A Dirac mass at 𝑣 will be viewed as having pdf 𝛿 ( · − 𝑣 ) where 𝛿 ( · ) integrates to oneand takes the value zero everywhere except at the origin. If we have 𝑁 random samples 𝑢 (1) , . . . , 𝑢 ( 𝑁 ) , generated i.i.d. according to 𝜋 , then wecan estimate 𝜋 by the Monte Carlo estimator 𝜋 𝑁𝑀𝐶 , which is defined as 𝜋 𝑁𝑀𝐶 := 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝛿 ( 𝑢 − 𝑢 ( 𝑛 ) ) . (5.3)This gives rise to the following estimator of 𝜋 ( 𝑓 ) : 𝜋 𝑁𝑀𝐶 ( 𝑓 ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑓 ( 𝑢 ( 𝑛 ) ) , 𝑢 ( 𝑛 ) ∼ 𝜋 i . i . d . Theorem 5.1 (Monte Carlo Error) . For 𝑓 : R 𝑑 −→ R denote | 𝑓 | ∞ := sup 𝑢 ∈ R 𝑑 | 𝑓 ( 𝑢 ) | . Wehave sup | 𝑓 | ∞ ≤ ⃒⃒⃒ E [︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) ]︁⃒⃒⃒ = 0 , sup | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ E [︂(︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂⃒⃒⃒⃒ ≤ 𝑁 .
Proof.
Define ¯ 𝑓 ( 𝑢 ) = 𝑓 ( 𝑢 ) − 𝜋 ( 𝑓 ) . To prove the first result, namely that the estimator is unbiased, note that E [︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) ]︁ = 1 𝑁 𝑁 ∑︁ 𝑛 =1 E [︁ 𝑓 (︁ 𝑢 ( 𝑛 ) )︁ − 𝜋 ( 𝑓 ) ]︁ = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀ 𝜋 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︀ = 1 𝑁 · . Therefore the supremum of its absolute value is also zero. For the second result, whichbounds the variance of the estimator, we observe that E [ ¯ 𝑓 ] = 0 and, then, E [︂(︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂ = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑁 ∑︁ 𝑛 =1 E [︁ ¯ 𝑓 (︁ 𝑢 ( 𝑛 ) )︁ ¯ 𝑓 (︁ 𝑢 ( 𝑛 ) )︁]︁ = 1 𝑁 𝑁 ∑︁ 𝑛 =1 E [︂ ¯ 𝑓 (︁ 𝑢 ( 𝑛 ) )︁ ]︂ = 1 𝑁 E [︂ ¯ 𝑓 (︁ 𝑢 (1) )︁ ]︂ = 1 𝑁 Var 𝜋 [ 𝑓 ]since 𝑢 ( 𝑛 ) are i.i.d. In particular we have E [︂(︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂ = 1 𝑁 Var 𝜋 [ 𝑓 ] ≤ 𝑁 𝜋 ( 𝑓 ) (5.4)since Var 𝜋 [ 𝑓 ] = 𝜋 ( 𝑓 ) − 𝜋 ( 𝑓 ) ≤ 𝜋 ( 𝑓 ) . Therefore sup | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ E [︂(︁ 𝜋 𝑁𝑀𝐶 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂⃒⃒⃒⃒ = sup | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ 𝑁 Var 𝜋 [ 𝑓 ] ⃒⃒⃒⃒ ≤ 𝑁 .
The theorem shows that the Monte Carlo estimator 𝜋 𝑁𝑀𝐶 is an unbiased approxi-mation for the posterior 𝜋 and that, by choosing 𝑁 large enough, expectation of anybounded function 𝑓 can in principle be approximated by Monte Carlo sampling toarbitrary accuracy. Furthermore, although the convergence is slow with respect to 𝑁 there is no dependence on the dimension of the problem or on the properties of 𝑓 , otherthan its supremum. Example 5.2 (Approximation of an Integral) . Let 𝑓 : R −→ R be a sigmoid function definedon R and shown in Figure 7(a) below as the blue solid curve. Let 𝑢 ∼ 𝜋 = 0 . 𝒩 ( − , . 𝒩 (5 ,
1) be a Gaussian mixture consisting of two Gaussian distributions. We wish toapproximate the expected value of 𝑓 ( 𝑢 ) × I [ 𝑎,𝑏 ] ( 𝑢 ) and I [ 𝑎,𝑏 ] ( 𝑢 ) = {︃ 𝑢 ∈ [ 𝑎, 𝑏 ]0 otherwise . We use Monte Carlo sampling to generate 𝑁 random samples 𝑢 (1) , . . . , 𝑢 ( 𝑁 ) and computethe error between the actual integral and Monte Carlo estimator. The integral andestimator are in the form: 𝜋 ( 𝑓 ) = ∫︁ 𝑏𝑎 𝑓 ( 𝑢 ) 𝜋 ( 𝑢 ) d 𝑢,𝜋 𝑁𝑀𝐶 ( 𝑓 ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑓 ( 𝑢 ( 𝑛 ) ) I [ 𝑎,𝑏 ] ( 𝑢 ( 𝑛 ) ) . The results of a set of numerical experiments with 𝑎 = − , 𝑏 = 5 and varying 𝑁 areshown in Figure 7(b). A randomly chosen subset of the samples used when 𝑁 = 100 isdisplayed in Figure 7(a). Figure 7
Large sampling number reduces the estimation error by Monte Carlo method.
Standard Monte Carlo sampling can only be used when it is possible to sample fromthe desired target distribution 𝜋 . When it is not possible to sample from 𝜋 , we candraw samples from another proposal distribution 𝜌 instead. We then need to evaluate 𝑔 in (5.1) at each sample and use it as an importance weight in the approximation ofthe desired distribution. This is the idea of importance sampling. Consider 𝜋 as in equation (5.1). Given a test function 𝑓 : R 𝑑 → R we can rewrite itsexpectation with respect to 𝜋 in terms of expected values with respect to 𝜌 as follows: 𝜋 ( 𝑓 ) = 𝜌 ( 𝑓 𝑔 ) 𝜌 ( 𝑔 ) , where we used that 𝑍 = 𝜌 ( 𝑔 ) . Approximating with standard Monte Carlo the numeratorand the denominator gives 𝜋 ( 𝑓 ) ≈ 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝑓 (︁ 𝑢 ( 𝑛 ) )︁ , 𝑢 ( 𝑛 ) ∼ 𝜌 i . i . d . = 𝜋 𝑁𝐼𝑆 ( 𝑓 ) , where 𝑤 ( 𝑛 ) := 𝑔 (︁ 𝑢 ( 𝑛 ) )︁∑︀ 𝑁𝑚 =1 𝑔 (︀ 𝑢 ( 𝑚 ) )︀ , 𝜋 𝑁𝐼𝑆 := 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝛿 ( 𝑢 − 𝑢 ( 𝑛 ) ) . Thus, given 𝑁 samples 𝑢 (1) , . . . , 𝑢 ( 𝑁 ) generated i.i.d. according to 𝜌 we can estimate 𝜋 with the particle approximation measure 𝜋 𝑁𝐼𝑆 . The quality of this estimator may beassessed by considering the worst-case bias and mean squared error when using theparticle approximation measure to estimate expectations over the class of boundedtest functions { 𝑓 : R 𝑑 → R : | 𝑓 | ∞ = 1 } . We will show that worst-case error upper-bounds can be obtained in terms of the 𝜒 divergence between the target and theproposal, suggesting that the performance of importance sampling depends on thecloseness between target and proposal. Definition 5.3.
Let 𝜋 > 𝜋 ′ > R 𝑑 . The 𝜒 divergence of 𝜌 with respect to 𝜌 ′ is 𝑑 𝜒 ( 𝜋 ‖ 𝜋 ′ ) := ∫︁ R 𝑑 (︁ 𝜋 ( 𝑢 ) 𝜋 ′ ( 𝑢 ) − )︁ 𝜋 ′ ( 𝑢 ) 𝑑𝑢. (5.5)The 𝜒 is a divergence but not a distance as it is, in general, not symmetric. Thenext result shows that, similarly as for standard Monte Carlo, the mean squared errorof 𝜋 𝑁𝐼𝑆 ( 𝑓 ) as an estimator of 𝜋 ( 𝑓 ) is order 𝑁 − . However, there are two main differences:the estimator is now biased, and the constant in the mean squared error depends onthe 𝜒 divergence between the target and the proposal. Theorem 5.4 (Importance Sampling Error) . We have sup | 𝑓 | ∞ ≤ ⃒⃒⃒ E [︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) ]︁⃒⃒⃒ ≤ 𝑑 𝜒 ( 𝜋 ‖ 𝜌 ) 𝑁 , sup | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ E [︂(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂⃒⃒⃒⃒ ≤ 𝑑 𝜒 ( 𝜋 ‖ 𝜌 ) 𝑁 . Proof.
Given 𝜋 ( 𝑢 ) = 1 𝑍 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) = 1 𝜌 ( 𝑔 ) 𝑔 ( 𝑢 ) 𝜌 ( 𝑢 ) , we have that 𝑑 𝜒 ( 𝜋 ‖ 𝜌 ) = 𝜌 ( 𝑔 ) 𝜌 ( 𝑔 ) − . To ease the notation we denote 𝜁 := 𝜌 ( 𝑔 ) 𝜌 ( 𝑔 ) . Werewrite 𝜋 ( 𝑓 ) = 𝜌 ( 𝑔𝑓 ) 𝜌 ( 𝑔 ) ≃ 𝜌 𝑁𝑀𝐶 ( 𝑔𝑓 ) 𝜌 𝑁𝑀𝐶 ( 𝑔 ) = 𝜋 𝑁𝐼𝑆 ( 𝑓 ) . Then we have 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) = 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜌 ( 𝑔𝑓 ) 𝜌 ( 𝑔 )= 𝜋 𝑁𝐼𝑆 ( 𝑓 ) (︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁ 𝜌 ( 𝑔 ) − (︁ 𝜌 ( 𝑔𝑓 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔𝑓 ) )︁ 𝜌 ( 𝑔 ) . (5.6)The expectation of the second term is zero and hence ⃒⃒⃒ E [︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) ]︁⃒⃒⃒ = 1 𝜌 ( 𝑔 ) ⃒⃒⃒ E [︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) (︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁]︁⃒⃒⃒ ≤ 𝜌 ( 𝑔 ) ⃒⃒⃒ E [︁(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁(︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁]︁⃒⃒⃒ , since E [︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) ]︁ = 0. Using the Cauchy-Schwarz inequality, the second resultfrom this theorem (whose proof follows) and (5.4) from the proof of Theorem 5.1 wehave, for all | 𝑓 | ∞ ≤ ⃒⃒⃒ E [︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) ]︁⃒⃒⃒ ≤ 𝜌 ( 𝑔 ) (︂ E [︂(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂)︂ / (︂ E [︂(︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁ ]︂)︂ / ≤ 𝜌 ( 𝑔 ) (︂ 𝜁𝑁 )︂ / (︃ 𝜌 ( 𝑔 ) 𝑁 )︃ / = 2 𝜁𝑁 . We now prove the second result. We use the splitting of 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) into the sumof two terms as derived in equation 5.6. Using (5.4), the basic inequality ( 𝑎 − 𝑏 ) ≤ 𝑎 + 𝑏 ) and that for all | 𝑓 | ∞ ≤ , | 𝜋 𝑁𝐼𝑆 ( 𝑓 ) | ≤
1, we have, for all | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ E [︂(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂⃒⃒⃒⃒ ≤ 𝜌 ( 𝑔 ) (︂ E [︂(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) )︁ (︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁ ]︂ + E [︂(︁ 𝜌 ( 𝑔𝑓 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔𝑓 ) )︁ ]︂)︂ ≤ 𝜌 ( 𝑔 ) (︂ E [︂(︁ 𝜌 ( 𝑔 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔 ) )︁ ]︂ + E [︂(︁ 𝜌 ( 𝑔𝑓 ) − 𝜌 𝑁𝑀𝐶 ( 𝑔𝑓 ) )︁ ]︂)︂ = 2 𝜌 ( 𝑔 ) 𝑁 (︀ Var 𝜌 [ 𝑔 ] + Var 𝜌 [ 𝑔𝑓 ] )︀ ≤ 𝜌 ( 𝑔 ) 𝑁 (︁ 𝜌 ( 𝑔 ) + 𝜌 ( 𝑔 𝑓 ) )︁ ≤ 𝜌 ( 𝑔 ) 𝜌 ( 𝑔 ) 𝑁 = 4 𝜁𝑁 . Therefore, sup | 𝑓 | ∞ ≤ ⃒⃒⃒⃒ E [︂(︁ 𝜋 𝑁𝐼𝑆 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︁ ]︂⃒⃒⃒⃒ ≤ 𝜁𝑁 . This theorem shows that, unlike Monte Carlo, the Importance Sampling estimator 𝜋 𝑁𝐼𝑆 is biased for 𝜋 . The rate of convergence of the bias is twice that of the standarddeviation, however. As for Monte Carlo the rate of convergence of the variance ofthe error is governed by the inverse of 𝑁 , but is independent of the dimension 𝑑 of thestate space. However for Importance Sampling to be accurate (with a limited number ofsamples 𝑁 ) it is important that the 𝜒 divergence between the target and the proposalis not too large. Example 5.5 (Change of Measurement) . We consider a similar set-up as in Example 5.2,integrating a sigmoid function, shown in blue in Figure 8, with respect to a probabilitymeasure 𝜋 which is bimodal, shown in red in Figure 8; we again restrict the supportof the desired integral. We estimate the integral using Importance Sampling based on 𝑁 random samples 𝑢 (1) , . . . , 𝑢 ( 𝑁 ) from the measure 𝜌 = 𝒩 ( 𝜇, 𝜎 ), shown in green inFigure 8. The estimator of the integral is given by 𝜋 𝑁𝐼𝑆 ( 𝑓 ) = 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝑓 ( 𝑢 ( 𝑛 ) ) [ 𝑎,𝑏 ] ( 𝑢 ( 𝑛 ) ) ,𝑤 ( 𝑛 ) = 𝑔 ( 𝑢 ( 𝑛 ) ) ∑︀ 𝑁𝑚 =1 𝑔 ( 𝑢 ( 𝑚 ) ) . Here 𝑔 is a function proportional to the ratio of the densities of 𝜋 and 𝜌. If 𝜋 ( 𝑢 ( 𝑛 ) ) >𝜌 ( 𝑢 ( 𝑛 ) ), the samples should have been denser, so we raise the weight on 𝑓 ( 𝑢 ( 𝑛 ) ) inproportion to 𝜋 ( 𝑢 ( 𝑛 ) ) 𝜌 ( 𝑢 ( 𝑛 ) ) >
1. If 𝜋 ( 𝑢 ( 𝑛 ) ) < 𝜌 ( 𝑢 ( 𝑛 ) ), the samples should have been lessdense, so we lower the weight on 𝑓 ( 𝑢 ( 𝑛 ) ) in proportion to 𝜋 ( 𝑢 ( 𝑛 ) ) 𝜌 ( 𝑢 ( 𝑛 ) ) < Figure 8
Importance Sampling is a change of measurement via the importance weights.The red curve shows a bimodal distribution 𝜋 and the green curve shows a Gaussiandistribution 𝜌 . The blue curve is the function to be integrated, on its support [ − , . Theupper figure shows samples from the posterior 𝜋 itself; these would be used for MonteCarlo sampling; the lower curve shows samples from the prior 𝜌 , as used for ImportanceSampling. The importance weights capture and compensate for the difference of samplingfrom these two distributions. A classic reference on the Monte Carlo method which includes discussion of severalvariance-reduction techniques is [52]. The chapter notes [7] give a comparison of MonteCarlo and Importance Sampling with examples. The paper [68] further explores ad-vanced Importance Sampling via adaptive algorithms. When 𝑁 is large enough, 𝜋 𝑁𝑀𝐶 arising from a Monte Carlo simulation should be close to 𝜋 [36]. In practice, all prob-abilities, integrals and summations can be approximated by the Monte Carlo method[110]. A review of importance sampling, from the perspective of filtering and sequentialimportance resampling, may be found in [4]; the proofs in this chapter closely followthe presentation in that paper. Necessary sample size results for importance samplingin terms of several divergences between target and proposal were established in [100].The subject of multi level Monte Carlo (MLMC) has made the use of Monte Carlomethods practical in new areas of application; see [47] for an overview. The method-ology applies when approximating expectations over infinite dimensional spaces, anddistributes the computational budget over different levels of approximation, with thegoal of optimizing the cost per unit error, noting that the latter balances sampling andapproximation based sources. In this chapter we study Markov chain Monte Carlo (MCMC), a methodology that al-lows to obtain approximate samples from a given target distribution 𝜋. As with MonteCarlo and importance sampling, MCMC may be viewed as approximating the targetdistribution by a sum of Dirac measures, thus allowing to approximate expectationswith respect to the target. Implementation of Monte Carlo pressuposes that inde-pendent samples from the target can be obtained. Importance sampling and MCMCbypass this restrictive assumption: importance sampling by appropriately weightingindependent samples from a proposal distribution, and MCMC by drawing correlatedsamples from a Markov kernel that has the target as invariant distribution. After somediscussion of the general MCMC methodology, we will specify to the case where 𝜋 isa posterior distribution given via Bayes theorem from the product of the likelihoodfunction and the prior distribution. In this context, we will analyze the convergence ofthe pCN algorithm, a popular MCMC method for computing posterior expectations inhigh dimensional inverse problems which uses these two ingredients, the prior and thelikelihood, as part of its design. The idea of MCMC is that, given a target distribution 𝜋, it is possible to constructa Markov kernel that can be sampled from and has 𝜋 as its invariant distribution (aformal definition of the invariance property will be given below). Samples { 𝑢 ( 𝑛 ) } 𝑁𝑛 =1 drawn iteratively from the kernel may be used to approximate posterior expectations 𝜋 ( 𝑓 ) ≈ 𝑁 𝑁 ∑︁ 𝑛 =1 𝑓 ( 𝑢 ( 𝑛 ) ) . The samples are given uniform weights 1 /𝑁 but, in contrast to standard Monte Carlo,they are not independent and they are not drawn from the target 𝜋. However, if thechain is guaranteed to satisfy certain sample path ergodicity then the estimator in theright hand-side is asymptotically unbiased for 𝜋 ( 𝑓 ) and satisfies a central limit theoremfor suitable test functions 𝑓. Addressing the design and analysis of MCMC methods in generality and depth isbeyond the scope of a single chapter; entire books are devoted to this subject. We willrestrict our discussion to a particular class of MCMC methods, known as Metropolis-Hastings algorithms. We will prove that the Metropolis-Hastings kernel is invariantwith respect to the desired target distribution, and we will show geometric ergodicity of the pCN Metropolis-Hastings algorithm, meaning that the distribution 𝜋 𝑛 of the 𝑛 -th sample approaches the invariant distribution exponentially fast in total variationdistance. The idea is illustrated in Figure 9: after an initial number of burn-in stepsthe samples from the chain start to concentrate in regions where the target distributionhas greatest mass. We will not discuss sample path ergodicity, noting simply that ageneral abstract theory exists to deduce it from geometric ergodicity. Figure 9
The Markov chain samples points from distribution 𝜋 𝑛 at step 𝑛 , and the samplingdistribution converges towards the target distribution 𝜋 whose high density regions arerepresented by the dashed circles. Here we outline the Metropolis-Hastings algorithm. The algorithm has two ingredi-ents: a proposal distribution 𝑞 ( 𝑢, 𝑣 ), which is a Markov transition kernel; and an accep-tance probability 𝑎 ( 𝑢, 𝑣 ) that will be used to convert the proposal kernel into a kernel 𝑝 MH ( 𝑢, 𝑣 ) that is invariant with respect to the given target 𝜋. Given the 𝑛 th sample 𝑢 ( 𝑛 ) , we generate 𝑢 ( 𝑛 +1) by drawing 𝑣 ⋆ from the distribution 𝑞 ( 𝑢 ( 𝑛 ) , · ) and acceptingthe result, which means setting 𝑢 ( 𝑛 +1) = 𝑣 ⋆ , with probability 𝑎 ( 𝑢 ( 𝑛 ) , 𝑣 ⋆ ), or insteadsetting 𝑢 ( 𝑛 +1) = 𝑢 ( 𝑛 ) with the remaining probability 1 − 𝑎 ( 𝑢 ( 𝑛 ) , 𝑣 ⋆ ). The acceptanceprobability is given by 𝑎 ( 𝑢, 𝑣 ) = min (︃ 𝜋 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) 𝜋 ( 𝑢 ) 𝑞 ( 𝑢, 𝑣 ) , )︃ . (6.1)Algorithm 6.1 makes precise the steps just described: Algorithm 6.1
Metropolis-Hastings Algorithm Input : Target distribution 𝜋 , initial distribution 𝜋 , Markov kernel 𝑞 ( 𝑢, 𝑣 ), numberof samples 𝑁. Initial Draw : Draw initial sample 𝑢 (0) ∼ 𝜋 . Subsequent Samples : For 𝑛 = 0 , , . . . , 𝑁 −
1, perform the following steps:1. Sample 𝑣 ⋆ ∼ 𝑞 ( 𝑢 ( 𝑛 ) , · ) .
2. Calculate the acceptance probability 𝑎 𝑛 := 𝑎 ( 𝑢 ( 𝑛 ) , 𝑣 ⋆ ) .
3. Update 𝑢 ( 𝑛 +1) = {︃ 𝑣 ⋆ , w.p. 𝑎 𝑛 ,𝑢 ( 𝑛 ) , otherwise . Output
Samples 𝑢 (1) , 𝑢 (2) , . . . , 𝑢 ( 𝑁 ) . The Metropolis-Hastings algorithm implicitly defines a Markov kernel 𝑝 MH ( 𝑢, · )which specifies the density of the ( 𝑛 + 1) th sample given that the 𝑛 th sample is locatedat 𝑢. For 𝑢 ̸ = 𝑣, the Metropolis-Hastings kernel has the following simple expression in terms of the proposal kernel and the acceptance probability 𝑝 MH ( 𝑢, 𝑣 ) = 𝑎 ( 𝑢, 𝑣 ) 𝑞 ( 𝑢, 𝑣 ); (6.2)this expression may be deduced noting that in order to move to a new location, theproposal needs to be accepted. Remark 6.1. ∙ In order to implement the Metropolis-Hastings algorithm one needsto be able to sample from the proposal kernel 𝑞 ( 𝑢, · ) and evaluate the acceptanceprobability 𝑎 ( 𝑢, 𝑣 ) . Importantly, the target distribution only appears in the ac-ceptance probability 𝑎 ( 𝑢, 𝑣 ) , and only the ratio 𝜋 ( 𝑣 ) /𝜋 ( 𝑢 ) is involved. Thereforethe Metropolis-Hastings algorithm may be implemented for target distributionsthat are only specified up to an unknown normalizing constant. ∙ If 𝑞 ( 𝑢, 𝑣 ) = 𝑞 ( 𝑣, 𝑢 ) the acceptance probability simplifies to 𝜋 ( 𝑣 ) /𝜋 ( 𝑢 ) . In such acase, moves to regions of higher target density are always accepted, while movesto regions of smaller but non-zero target density are accepted with positive prob-ability in order to ensure exploration of the target space. If 𝑞 is not symmetric,the method favors moves that are easier to be reversed, namely moves for which 𝑞 ( 𝑣, 𝑢 ) > 𝑞 ( 𝑢, 𝑣 ) . ∙ The Metropolis-Hastings algorithm is extremely flexible due to the freedom inthe choice of proposal kernel 𝑞 ( 𝑢, 𝑣 ). The ergodic behavior of the algorithm isheavily dependent on the choice of proposal. ∙ The accept-reject step may be implemented by drawing, independently from theproposal, a uniformly distributed random variable 𝜃 𝑛 in the interval [0 , . If 𝜃 𝑛 ∈ [0 , 𝑎 𝑛 ) then the proposal is accepted ( 𝑢 ( 𝑛 +1) = 𝑣 ⋆ ); it is rejected ( 𝑢 ( 𝑛 +1) = 𝑢 ( 𝑛 ) )otherwise. 𝜋 In this section we show that the Metropolis-Hastings kernel is invariant with respectto the target 𝜋. We start by introducing formal definitions of detailed balance andinvariance of a Markov kernel, and showing that the former implies the latter. We thenprove that the Metropolis-Hastings kernel satisfies detailed balance with respect to 𝜋 ,and hence it is invariant. The following definition makes precise the notion of detailed balance and invariantdistribution of a Markov kernel.
Definition 6.2.
A Markov kernel 𝑝 ( 𝑢, 𝑣 ) satisfies detailed balance with respect to 𝜋 if,for any 𝑢, 𝑣 ∈ R 𝑑 , 𝜋 ( 𝑢 ) 𝑝 ( 𝑢, 𝑣 ) = 𝜋 ( 𝑣 ) 𝑝 ( 𝑣, 𝑢 ) . We say that 𝜋 is an invariant distribution of the Markov kernel 𝑝 ( 𝑢, 𝑣 ) if, for any 𝑣 ∈ R 𝑑 , ∫︁ R 𝑑 𝜋 ( 𝑢 ) 𝑝 ( 𝑢, 𝑣 ) 𝑑𝑢 = 𝜋 ( 𝑣 ) . (6.3) Detailed balance of 𝑝 ( 𝑢, 𝑣 ) with respect to 𝜋 implies that 𝜋 is invariant for 𝑝 ( 𝑢, 𝑣 ).To see this, note that if 𝑝 ( 𝑢, 𝑣 ) satisfies detailed balance with respect to 𝜋 then ∫︁ R 𝑑 𝜋 ( 𝑢 ) 𝑝 ( 𝑢, 𝑣 ) 𝑑𝑢 = 𝜋 ( 𝑣 ) ∫︁ R 𝑑 𝑝 ( 𝑣, 𝑢 ) 𝑑𝑢 = 𝜋 ( 𝑣 ) . Invariance guarantees that, if the chain is distributed according to 𝜋 at a given step,then it will also be distributed according to 𝜋 in the following step. Detailed balanceguarantees that the in/out flow between any two states is preserved, which is a strongercondition. The concept of detailed balance is illustrated through the flow diagramshown in Figure 10. The flow to/from any node must be the same under detailedbalance, so the density at each node must remain constant at each step, as shown inFigure 10(b); Figure 10(a) fails to satisfy this. Consider a generic Markov kernel 𝑞 ( 𝑢, 𝑣 ) as shown in Figure 10(a), which does notsatisfy detailed balance, which we accept-reject according to the Metropolis algorithm.Intuitively, this Metropolis-Hastings kernel, limits the “flow” between nodes such thatdetailed balance is satisfied with respect to the distribution 𝜋 , as shown in Figure 10(b).There the red-highlighted numbers represent where the transition probabilities of thekernel have been reduced. The following theorem establishes the detailed balance of theMetropolis-Hastings kernel with respect to the target 𝜋 ; it implies, as a consequence,that the Metropolis-Hastings kernel is invariant with respect to the target. Theorem 6.3 (Metropolis-Hastings and Detailed Balance) . The Metropolis-Hastings kernelsatisfies detailed balance with respect to the distribution 𝜋 .Proof. We need to show that, for any 𝑢, 𝑣 ∈ R 𝑑 ,𝜋 ( 𝑢 ) 𝑝 MH ( 𝑢, 𝑣 ) = 𝜋 ( 𝑣 ) 𝑝 MH ( 𝑢, 𝑣 ) . If 𝑢 = 𝑣 the equality is trivial. Let us assume henceforth that 𝑢 ̸ = 𝑣, so that byequation (6.2) we have 𝑝 MH ( 𝑢, 𝑣 ) = 𝑎 ( 𝑢, 𝑣 ) 𝑞 ( 𝑢, 𝑣 ) . We rewrite 𝑝 MH ( 𝑢, 𝑣 ) = min (︂ 𝜋 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) 𝜋 ( 𝑢 ) 𝑞 ( 𝑢, 𝑣 ) , )︂ 𝑞 ( 𝑢, 𝑣 )= 1 𝜋 ( 𝑢 ) × min (︁ 𝜋 ( 𝑢 ) 𝑞 ( 𝑢, 𝑣 ) , 𝜋 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) )︁ . Thus, invoking symmetry, 𝜋 ( 𝑢 ) 𝑝 MH ( 𝑢, 𝑣 ) = min (︀ 𝜋 ( 𝑢 ) 𝑞 ( 𝑢, 𝑣 ) , 𝜋 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) )︀ = 𝜋 ( 𝑣 ) 𝑝 MH ( 𝑣, 𝑢 ) . The invariance of the Metropolis-Hastings 𝑝 MH with respect to 𝜋 implies that if theinitial sample is drawn from the target ( 𝜋 = 𝜋 ), then all subsequent samples are alsodistributed according to the target ( 𝜋 𝑛 = 𝜋 ) . Figure 10
Toy representation of MCMC with 3 states ( 𝑢, 𝑣, 𝑤 ). The numbers associatedwith each term represent an example case. (a) The top chain represents a Markov chainwith Markov kernel 𝑞 which does not satisfy detailed balance with respect to 𝜋 . (b) Thebottom chain utilizes Markov kernel 𝑝 , which satisfies detailed balance by modifying thetransition probabilities of 𝑞 . In the previous section we showed that if we initialize the Metropolis-Hastings algorithmwith distribution 𝜋 , all samples produced by the algorithm will be distributed accordingto 𝜋. But our original problem was exactly that we were not able to sample from 𝜋 .Our aim in this section is to show that, for certain Metropolis-Hastings methods, thelaw 𝜋 𝑛 of the 𝑛 -th sample converges to 𝜋 regardless of the initial distribution 𝜋 . Thisis a strong form of ergodic behavior which does not hold in general, as illustrated bythe chain depicted in Figure 11.In order to understand the mechanisms behid ergodicity we will first considerMarkov chains with finite state space. We then study a specific Metropolis-Hastingsalgorithm, known as the preconditioned Crank-Nicolson (pCN) method, which appliesto targets 𝜋 defined by their density with respect to a Gaussian distribution. We consider a Markov chain on the finite state space 𝑆 = { , · · · , 𝑑 } . We illustratea coupling approach to proving ergodicity and then, in the next subsection, use it tostudy the pCN method on a continuous state space.
Theorem 6.4 (Ergodicity in Finite State Spaces) . Let 𝑢 𝑛 be a Markov chain with state space 𝑆 , transition kernel 𝑝 , and initial distribution 𝜋 . Assume that 𝜀 := min 𝑖,𝑗 ∈ 𝑆 𝑝 ( 𝑖, 𝑗 ) > . (6.4) Figure 11
The arrows represent transitions with probability one in a four state Markovchain. The invariant distribution is the uniform distribution but for 𝜋 = 𝛿 𝐴 , we do nothave ergodicity. Then, 𝑝 has a unique invariant distribution 𝜋 , and the following convergence resultholds: 𝑑 TV ( 𝜋 𝑛 , 𝜋 ) ≤ (1 − 𝜀 ) 𝑛 , (6.5) where 𝜋 𝑛 is the law of 𝑢 𝑛 . Proof.
First note that the Markov kernel 𝑝 maps a probability distribution on 𝑆 intoanother probability distribution on 𝑆 ; it thus maps a compact, convex set into itself.By Brouwer’s fixed point theorem it follows that 𝑝 has a fixed point in this space,ensuring that an invariant distribution exists. We will now show that for any invariantdistribution 𝜋 equation (6.5) holds, which then implies the uniqueness of the invariantdistribution.Let 𝜋 be an invariant distribution, a probability vector on 𝑆 . Proving convergenceto equilibrium amounts to “forgetting the past”, to show that the long time behavior ofthe Markov chain does not depend on the initial distribution 𝜋 and in fact convergesto 𝜋 . In general, 𝑢 𝑛 +1 will be strongly dependent on 𝑢 𝑛 , but the condition given in(6.4) implies that there is always some residual chance that the chain jumps to anynew state, at each step, independently of where it is currently located, 𝑢 𝑛 . We willshow that this residual probability of the chain to make a “totally random” move willdiminish the stochastic dependence on 𝑢 as 𝑛 increases.To formalize this idea, let 𝑏 𝑛 be i.i.d. Bernoulli random variables with P ( 𝑏 𝑛 = 1) = 𝜀 and P ( 𝑏 𝑛 = 0) = 1 − 𝜀 ; furthermore assume that the sequence { 𝑏 𝑛 } is independent ofthe randomness defining draws from { 𝑝 ( 𝑢 𝑛 , · ) } . Because of the lower bound on 𝑝 wemay define a new Markov chain as follows: 𝑤 𝑛 +1 ∼ {︃ 𝑠 ( 𝑤 𝑛 , · ) , for 𝑏 𝑛 = 0 ,𝑟 ( 𝑤 𝑛 , · ) , for 𝑏 𝑛 = 1 , (6.6)where 𝑠 ( 𝑖, 𝑗 ) := 𝑝 ( 𝑖, 𝑗 ) − 𝜀𝑟 ( 𝑖, 𝑗 )1 − 𝜀 and 𝑟 is the uniform transition kernel with equal probability of transitioning to eachstate in 𝑆 : 𝑟 ( 𝑖, 𝑗 ) = 𝑑 − for all ( 𝑖, 𝑗 ) ∈ 𝑆 × 𝑆. We may now compute P ( 𝑤 𝑛 +1 = 𝑗 | 𝑤 𝑛 = 𝑖 ) = 𝜀 P ( 𝑤 𝑛 +1 = 𝑗 | 𝑤 𝑛 = 𝑖, 𝑏 𝑛 = 1) + (1 − 𝜀 ) P ( 𝑤 𝑛 +1 = 𝑗 | 𝑤 𝑛 = 𝑖, 𝑏 𝑛 = 0)= 𝜀𝑟 ( 𝑖, 𝑗 ) + 𝑝 ( 𝑖, 𝑗 ) − 𝜀𝑟 ( 𝑖, 𝑗 )= 𝑝 ( 𝑖, 𝑗 ) . Thus the kernel defined by (6.6) is equivalent in law to 𝑝 . However, by introducingthe ancillary random variables 𝑏 𝑛 , we have made explicit the concept of “forgettingthe past entirely, with a small probability” at every step. We may now use this tocomplete the proof. Let 𝑓 : 𝑆 ↦→ R be an arbitrary test function with | 𝑓 | ∞ ≤ 𝜏 := min( 𝑛 ∈ N : 𝑏 𝑛 = 1) . Then, regardless of how 𝑤 𝑛 is initialized, E [ 𝑓 ( 𝑤 𝑛 )] = E [ 𝑓 ( 𝑤 𝑛 ) | 𝜏 ≥ 𝑛 ] P ( 𝜏 ≥ 𝑛 ) + 𝑛 − ∑︁ 𝑙 =0 E [ 𝑓 ( 𝑤 𝑛 ) | 𝜏 = 𝑙 ] P ( 𝜏 = 𝑙 )= E [ 𝑓 ( 𝑤 𝑛 ) | 𝜏 ≥ 𝑛 ] P ( 𝜏 ≥ 𝑛 ) ⏟ ⏞ |·|≤ (1 − 𝜀 ) 𝑛 + 𝑛 − ∑︁ 𝑙 =0 E 𝑢 ∼ u ( · ) [ 𝑓 ( 𝑤 𝑛 − 𝑙 )] P ( 𝜏 = 𝑙 ) ⏟ ⏞ independent of original initial distribution , where u denotes the uniform distribution on 𝑆. Now consider two Markov chains 𝑤 𝑛 and 𝑤 ′ 𝑛 with kernel (6.6), the first initializedfrom 𝜋 and the second from an invariant distribution 𝜋. The law of 𝑤 𝑛 agrees with thelaw 𝜋 𝑛 of the original chain 𝑢 𝑛 , while for the second chain it follows from invariancethat 𝜋 ′ 𝑛 = 𝜋. We will use the variational characterization of the total variation distanceestablished in Lemma 1.10. Employing the preceding identity and noting that thecontribution which is independent of the initial distribution will cancel in the twodifferent Markov chains, we obtain 𝑑 TV ( 𝜋 𝑛 , 𝜋 ′ 𝑛 ) = 12 sup | 𝑓 | ∞ ≤ ⃒⃒⃒ E 𝜋 𝑛 [ 𝑓 ( 𝑢 )] − E 𝜋 ′ 𝑛 [ 𝑓 ( 𝑢 )] ⃒⃒⃒ ≤ (1 − 𝜀 ) 𝑛 . Since 𝜋 ′ 𝑛 = 𝜋 the desired result follows.Before extending the above argument to a setting with continuous state space, wemake two remarks: Remark 6.5.
The coupling proof we have just exhibited may be generalized in a numberof ways; in particular: ∙ The distribution 𝑟 need not be uniform; it was only chosen so for convenience.What is important is that 𝑟 𝑖,𝑗 is lower bounded, independently of 𝑖 , for all 𝑗 .Adapting 𝑟 to the 𝑝 at hand, might in some cases greatly improve the abovebound – a larger 𝜖 might be identified. ∙ Convergence to equilibrium can also be shown if condition (6.4) holds with 𝑝 replaced by the 𝑛 -step transition kernel 𝑝 𝑛 . Again, for some chains this mayyield faster bounds on the convergence to equilibrium. The coupling argument used in the previous subsection for Markov chains with finitestate space may also be employed to study ergodicity of Markov chains on a continuousstate space. To illustrate this we consider a particular Metropolis-Hastings algorithm,the pCN method, applied to a specific Bayesian inverse problem. Before we get intothe details of the inverse problem we describe the idea behind the pCN method at ahigh level. The idea is this. If the desired target distribution has the form 𝜋 ( 𝑢 ) = 1 𝑍 ˜ 𝑔 ( 𝑢 ) ˜ 𝜌 ( 𝑢 ) (6.7)and if the Metropolis-Hastings proposal kernel 𝑞 satisfies detailed balance with respectto ˜ 𝜌 , then (6.1) simplifies to give 𝑎 ( 𝑢, 𝑣 ) = min (︃ ˜ 𝜌 ( 𝑣 )˜ 𝑔 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 )˜ 𝜌 ( 𝑢 ) ˜ 𝑔 ( 𝑢 ) 𝑞 ( 𝑢, 𝑣 ) , )︃ = min (︃ ˜ 𝑔 ( 𝑣 )˜ 𝑔 ( 𝑢 ) , )︃ . (6.8)We will apply and study this idea in the case where ˜ 𝜌 is a Gaussian distribution, inwhich case it is straightforward to construct a proposal kernel that satisfies detailedbalance with respect to ˜ 𝜌 . This scenario arises naturally in Bayesian inverse problemswhere the the prior is defined in terms of a Gaussian. We now formalize the inverseproblem setting that we consider by imposing certain assumptions on the likelihoodand the prior, and then relate both to the functions ˜ 𝑔 and ˜ 𝜌 in Equation 6.7. Assumption 6.6.
We make the following assumptions on the Bayesian inverse problem: ∙ Bounded likelihood: there are 𝑔 − , 𝑔 + > such that, for all 𝑢 ∈ R 𝑑 , < 𝑔 − < 𝑔 ( 𝑢 ) < 𝑔 + . ∙ Truncated Gaussian prior: there is a compact set 𝐵 ⊂ R 𝑑 with positive Lebesguemeasure such that 𝜌 ( 𝑢 ) = 𝐵 ( 𝑢 ) 𝑧 ( 𝑢 ) , where 𝑧 = 𝒩 (︁ , ̂︀ 𝐶 )︁ . Under Assumption 6.6 we obtain for the posterior density 𝜋 ( 𝑢 ) ∝ 𝑔 ( 𝑢 ) 𝐵 ( 𝑢 ) 𝑧 ( 𝑢 ) , which is of the form in Equation (6.7) with ˜ 𝑔 ( 𝑢 ) = 𝑔 ( 𝑢 ) 𝐵 ( 𝑢 ) and ˜ 𝜌 ( 𝑢 ) = 𝑧 ( 𝑢 ) . The pCN method has proposal kernel which is invariant with respect to the Gaussian 𝑧 = 𝒩 (︁ , ̂︀ 𝐶 )︁ defined by 𝑞 ( 𝑢, · ) ∼ 𝒩 (︁ (1 − 𝛽 ) / 𝑢, 𝛽 ̂︀ 𝐶 )︁ . (6.9)Thus, given the sample 𝑢 ( 𝑛 ) , the pCN proposes a new sample 𝑣 ⋆ ∼ (︁ − 𝛽 )︁ / 𝑢 ( 𝑛 ) + 𝛽𝜉 ( 𝑛 ) , 𝜉 ( 𝑛 ) ∼ 𝒩 (︁ , ̂︀ 𝐶 )︁ , which only requires to sample a Gaussian. Theorem 6.7 (Ergodicity for pCN Method) . Assume that we apply the pCN method to samplefrom a posterior density 𝜋 arising from Assumptions 6.6 with initial condition drawnfrom any density supported on 𝐵 . Then there exists a constant 𝜀 ∈ (0 , such that 𝑑 TV ( 𝜋 𝑛 , 𝜋 ) ≤ (1 − 𝜀 ) 𝑛 , where 𝜋 𝑛 is the law of the 𝑛 -th sample from the pCN Metropolis-Hastings algorithm. Recall the notation for the covariance weighted inner-product and resulting normdescribed in the introduction to these notes. Before proving the above theorem, wewill prove an auxiliary lemma.
Lemma 6.8.
Let the assumptions of Theorem 6.7 hold. The proposal kernel (6.9) satisfiesdetailed balance with respect to the Gaussian 𝑧 = 𝒩 (︁ , ̂︀ 𝐶 )︁ and we have the followingexpression for the pCN accept-reject probability: 𝑎 ( 𝑢, 𝑣 ) = min (︂ 𝑔 ( 𝑣 ) 𝑔 ( 𝑢 ) 𝐵 ( 𝑣 ) , )︂ . Proof.
We need to show that 𝑧 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) is symmetric in 𝑢 and 𝑣. By direct calculation, − log (︀ 𝑧 ( 𝑣 ) 𝑞 ( 𝑣, 𝑢 ) )︀ = 12 | 𝑣 | ̂︀ 𝐶 + 12 𝛽 ⃒⃒⃒⃒ 𝑢 − (︁ − 𝛽 )︁ / 𝑣 ⃒⃒⃒⃒ ̂︀ 𝐶 = (︃
12 + (︀ − 𝛽 )︀ 𝛽 )︃ | 𝑣 | ̂︀ 𝐶 + 12 𝛽 | 𝑢 | ̂︀ 𝐶 − (︀ − 𝛽 )︀ / 𝛽 ⟨ 𝑢, 𝑣 ⟩ ̂︀ 𝐶 = 12 𝛽 (︁ | 𝑣 | ̂︀ 𝐶 + | 𝑢 | ̂︀ 𝐶 )︁ − (︀ − 𝛽 )︀ / 𝛽 ⟨ 𝑢, 𝑣 ⟩ ̂︀ 𝐶 . For the expression for the acceptance probability note that since 𝑢 ∈ 𝐵 , we have 𝑢 𝑛 ∈ 𝐵 for all 𝑛 as any proposed move out of 𝐵 will be rejected. Thus 𝐵 ( 𝑢 ) may bedropped from the formula for the acceptance probability.Using this lemma, we can prove ergodicity much as we did in the previous subsectionin the finite state space setting. The main idea is that, restricted to the bounded set 𝐵 , the probability density of the transition kernel will be bounded away from zero bysome 𝜀 . Splitting off a “forgetful part” that is triggered with probability 𝜀 will thenyield the result. Proof of Theorem 6.7.
Note again that since 𝑢 ∈ 𝐵 we have 𝑢 𝑛 ∈ 𝐵 for all 𝑛 ≥ . Note further that since 𝐵 is compact and 𝑞 is continuous in both of its arguments,there is 𝑞 − > 𝑢, 𝑣 ∈ 𝐵,𝑞 ( 𝑢, 𝑣 ) ≥ 𝑞 − . Let 𝑝 be the Markov kernel defined by the pCN Metropolis-Hastings algorithm. Itfollows that, for 𝑢, 𝑣 ∈ 𝐵, 𝑝 ( 𝑢, 𝑣 ) ≥ 𝑞 ( 𝑢, 𝑣 ) 𝑎 ( 𝑢, 𝑣 ) ≥ 𝑞 − 𝑔 − 𝑔 + =: 𝜀 Leb( 𝐵 ) , where the last equation defines 𝜀 and Leb( 𝐵 ) denotes the Lebesgue measure of 𝐵 (which is assumed to be positive). Analogously to the discrete proof, we now define 𝑏 𝑛 to be i.i.d. Bernoulli random variables with P ( 𝑏 𝑛 = 1) = 𝜀 , independently of all otherrandomness, and consider the transition rule 𝑢 𝑛 +1 ∼ {︃ 𝑠 ( 𝑢 𝑛 , · ) , for 𝑏 𝑛 = 0 ,𝑟 ( · ) , for 𝑏 𝑛 = 1 , where 𝑟 denotes the uniform distribution on 𝐵 and, for 𝐴 ⊂ 𝐵 and 𝑢 ∈ 𝐵,𝑠 ( 𝑢, 𝐴 ) := 𝑝 ( 𝑢, 𝐴 ) − 𝜀𝑟 ( 𝐴 )1 − 𝜀 . Just as in the discrete case, one can check that the resulting Markov kernel is equal tothe pCN Metropolis-Hastings kernel 𝑝 ( · , · ). Exponential convergence is then concludedin exactly the same way as in the discrete case. The book [40] is a useful basic introduction to MCMC and the book [16] presents stateof the art as of 2010. The paper [21] overviews the pCN method and related MCMCalgorithms specifically designed for inverse problems and other sampling problems inhigh dimensional state spaces. The book [77] describes the coupling method in ageneral setting. The book [86] contains a wide-ranging presentation of Markov chains,and their long-time behavior, including ergodicity, and coupling. Furthermore thebook describes the general methodology for going from convergence of expectations in(possibly weighted) total variation metrics to sample path ergodicity and almost sureconvergence of time averages, a topic we did not cover in this chapter. The paper [85]describes the coupling methodology in the context of stochastic differential equationsand their approximations. In this chapter we introduce data assimilation problems in which the model of interest,and the data associated with it, have a time-ordered nature. We distinguish between thefiltering problem (on-line) in which the data is incorporated sequentially as it comes in,and the smoothing problem (off-line) which is a specific instance of the inverse problemsthat have been the subject of the preceding chapters.
Consider the stochastic dynamics model given by 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝑗 ∈ Z + ,𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ) , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) i.i.d. , where we assume that 𝑣 is independent of the sequence { 𝜉 𝑗 } ; this is often written as 𝑣 ⊥ { 𝜉 𝑗 } . Now we add the data model given by 𝑦 𝑗 +1 = ℎ ( 𝑣 𝑗 +1 ) + 𝜂 𝑗 +1 , 𝑗 ∈ Z + ,𝜂 𝑗 ∼ 𝒩 (0 , Γ) i.i.d. , where we assume that { 𝜂 𝑗 } ⊥ 𝑣 for all 𝑗 and 𝜂 𝑘 ⊥ 𝜉 𝑗 for all 𝑗, 𝑘 . The following will beassumed in the remainder of this book. Assumption 7.1.
The matrices 𝐶 , Σ and Γ are positive-definite. Further, we have Ψ ∈ 𝐶 ( R 𝑑 , R 𝑑 ) and ℎ ∈ 𝐶 ( R 𝑑 , R 𝑘 ) . We define 𝑉 := { 𝑣 , . . . , 𝑣 𝐽 } , 𝑌 := { 𝑦 , . . . , 𝑦 𝐽 } , and 𝑌 𝑗 := { 𝑦 , . . . , 𝑦 𝑗 } . The sequence 𝑉 is often termed the signal and the sequence 𝑌 the data . Definition 7.2 (The Smoothing Problem) . The smoothing problem is to find the probabil-ity density Π( 𝑉 ) := P ( 𝑉 | 𝑌 ) = P ( { 𝑣 , . . . , 𝑣 𝐽 }|{ 𝑦 , . . . , 𝑦 𝐽 } ) on R 𝑑 ( 𝐽 +1) for some fixedinteger 𝐽. We refer to Π as the smoothing distribution . Definition 7.3 (The Filtering Problem) . The filtering problem is to find the probability densi-ties 𝜋 𝑗 ( 𝑣 𝑗 ) := P ( 𝑣 𝑗 | 𝑌 𝑗 ) on R 𝑑 for 𝑗 = 1 , . . . , 𝐽. We refer to 𝜋 𝑗 as the filtering distributionat time 𝑗. The key conceptual issue to appreciate concerning the filtering problem, in com-parison with the smoothing problem, is that interest is focused on characterizing, orapproximating, a sequence of probably distributions, defined in an iterative fashion asthe data is acquired sequentially.
Remark 7.4.
We note the following identity: ∫︁ Π( 𝑣 , . . . , 𝑣 𝐽 ) 𝑑𝑣 𝑑𝑣 . . . 𝑑𝑣 𝐽 − = 𝜋 𝐽 ( 𝑣 𝐽 ) . This expresses the fact that the marginal of the smoothing distribution at time 𝐽 corresponds to the filtering distribution at time 𝐽 . Note also that, in general, for 𝑗 < 𝐽 ∫︁ Π( 𝑣 , . . . , 𝑣 𝐽 ) 𝑑𝑣 . . . 𝑑𝑣 𝑗 − ̂︂ 𝑑𝑣 𝑗 𝑑𝑣 𝑗 +1 . . . 𝑑𝑣 𝐽 ̸ = 𝜋 𝑗 ( 𝑣 𝑗 ) , since the expression on the left-hand side of the equation depends on data 𝑌 𝐽 , whereasthat on the right-hand side depends only on 𝑌 𝑗 , and 𝑗 < 𝐽 . The smoothing distribution can be found by combining a prior on 𝑣 and a likelihoodfunction using Bayes formula. The prior is the probability distribution on 𝑣 impliedby the distribution of 𝑣 and the stochastics dynamics model; the likelihood functionis defined by the data model. We now derive the prior and the likelihood separately.The prior distribution can be derived as follows: P ( 𝑉 ) = P ( 𝑣 𝐽 , 𝑣 𝐽 − , . . . , 𝑣 )= P ( 𝑣 𝐽 | 𝑣 𝐽 − , . . . , 𝑣 ) P ( 𝑣 𝐽 − , . . . , 𝑣 )= P ( 𝑣 𝐽 | 𝑣 𝐽 − ) P ( 𝑣 𝐽 − , . . . , 𝑣 ) . The third equality comes from the Markov, or memoryless, property which follows fromthe independence of the elements of the sequence { 𝜉 𝑗 } . By induction, we have: P ( 𝑉 ) = 𝐽 − ∏︁ 𝑗 =0 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) P ( 𝑣 )= 1 𝑍 𝜌 exp (︀ − R ( 𝑉 ) )︀ =: 𝜌 ( 𝑉 ) , where 𝑍 𝜌 > R ( 𝑉 ) := 12 | 𝑣 − 𝑚 | 𝐶 + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑣 𝑗 +1 − Ψ( 𝑣 𝑗 ) | . The likelihood function, which incorporates the measurements gathered from ob-serving the system, depends only on the measurement model and may be derived asfollows: P ( 𝑌 | 𝑉 ) = 𝐽 − ∏︁ 𝑗 =0 P ( 𝑦 𝑗 +1 | 𝑣 , . . . , 𝑣 𝐽 )= 𝐽 − ∏︁ 𝑗 =0 P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) ∝ exp (︀ − L ( 𝑉 ; 𝑌 ) )︀ , where L ( 𝑉 ; 𝑌 ) := 12 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | . The factorization of P ( 𝑌 | 𝑉 ) in terms of the product of the P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) follows fromthe independence of the elements of { 𝜂 𝑗 } and the fact that the observation at time 𝑗 + 1depends only on the state at time 𝑗 + 1 . Using Bayes Theorem 1.2 we find the smoothing distribution by combining thelikelihood and the prior Π( 𝑉 ) ∝ P ( 𝑌 | 𝑉 ) P ( 𝑉 )= 1 𝑍 exp (︀ − R ( 𝑉 ) − L ( 𝑉 ; 𝑌 ) )︀ . Note that 𝑉 ∈ R 𝑑 ( 𝐽 +1) , and 𝑌 ∈ R 𝑘𝐽 . Now we study the well-posedness of the smoothing problem with respect to perturba-tions in the data. To this end we consider two smoothing distributions correspondingto different observed data sequences
𝑌, 𝑌 ′ :Π( 𝑉 ) := P ( 𝑉 | 𝑌 ) = 1 𝑍 exp (︀ − R ( 𝑉 ) − L ( 𝑉 ; 𝑌 ) )︀ , Π ′ ( 𝑉 ) := P ( 𝑉 | 𝑌 ′ ) = 1 𝑍 ′ exp (︀ − R ( 𝑉 ) − L ( 𝑉 ; 𝑌 ′ ) )︀ . We make the following assumptions:
Assumption 7.5.
There is finite non-negative constant 𝑅 such that the data 𝑌, 𝑌 ′ andthe observation function ℎ satisfy: ∙ | 𝑌 | , | 𝑌 ′ | ≤ 𝑅 ; ∙ Letting 𝜙 ( 𝑉 ) := (︁∑︀ 𝐽𝑗 =1 ( | ℎ ( 𝑣 𝑗 ) | ) )︁ / , it holds that E 𝜌 [ 𝜙 ( 𝑉 )] < ∞ . The following theorem shows well-posedness of the smoothing problem.
Theorem 7.6 (Well-posedness of Smoothing) . Under Assumption 7.5, there is 𝜅 ∈ [0 , ∞ ) , independent of 𝑌 and 𝑌 ′ such that 𝑑 H (Π , Π ′ ) ≤ 𝜅 | 𝑌 − 𝑌 ′ | . Proof.
We show that the proof of Theorem 1.14, which established well-posedness forBayesian inverse problems under Assumption 1.12, applies in the smoothing contextas well. To do so we rewrite the problem in the same notation used in Chapter 1, andshow that Assumption 7.5 above implies Assumption 1.12. WriteΠ( 𝑉 ) = 1 𝑍 exp (︀ − L ( 𝑉 ; 𝑌 ) )︀ 𝜌 ( 𝑉 ) = 1 𝑍 𝑔 ( 𝑉 ; 𝑌 ) 𝜌 ( 𝑉 ) , Π ′ ( 𝑉 ) = 1 𝑍 ′ exp (︀ − L ( 𝑉 ; 𝑌 ′ ) )︀ 𝜌 ( 𝑉 ) = 1 𝑍 ′ 𝑔 ( 𝑉 ; 𝑌 ′ ) 𝜌 ( 𝑉 ) , where 𝑍, 𝑍 ′ > | 𝑌 − 𝑌 ′ | plays the role of 𝛿 inTheorem 1.14. Since 𝑔 ( 𝑉 ; 𝑌 ) = exp (︀ − L ( 𝑉 ; 𝑌 ) )︀ and L ( 𝑉 ; 𝑌 ) is positive we have thatsup 𝑣 ⃒⃒⃒√︁ 𝑔 ( 𝑉 ; 𝑌 ) ⃒⃒⃒ + ⃒⃒⃒√︁ 𝑔 ( 𝑉 ; 𝑌 ′ ) ⃒⃒⃒ ≤ , and so Assumption 1.12 (ii) is satisfied. To see that Assumption 1.12 (i) is also satisfied,note that 𝑒 − 𝑥 is Lipschitz-1. Therefore, using the Cauchy-Schwarz inequality and somealgebraic manipulations, there is 𝜅 independent of 𝑌 and 𝑌 ′ such that ⃒⃒⃒√︁ 𝑔 ( 𝑉, 𝑌 ) − √︁ 𝑔 ( 𝑉 ; 𝑌 ′ ) ⃒⃒⃒ ≤ ⃒⃒⃒ L ( 𝑉 ; 𝑌 ) − L ( 𝑉 ; 𝑌 ′ ) ⃒⃒⃒ = 12 ⃒⃒⃒⃒ 𝐽 − ∑︁ 𝑗 =0 (︁ | 𝑦 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | − | 𝑦 ′ 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | )︁⃒⃒⃒⃒ ≤ 𝜅 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − 𝑦 ′ 𝑗 +1 | Γ | 𝑦 𝑗 +1 + 𝑦 ′ 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | Γ ≤ 𝜅 | 𝑌 − 𝑌 ′ | 𝜙 ( 𝑉 ) , where 𝜙 is defined in Assumption 7.5. This shows that under Assumption 7.5 thelikelihood function of the smoothing problem satisfies Assumption 1.12 with 𝛿 = | 𝑌 − 𝑌 ′ | ; Theorem 7.6 follows from Theorem 1.14. Filtering consists on the iterative updating of a sequence of probability distributions,as new data arrives. This rests on the formula 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 , where 𝒫 is a linear Markov map and 𝒜 𝑗 is a nonlinear likelihood map (Bayes theorem).We introduce ̂︀ 𝜋 𝑗 +1 = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) and recall that 𝜋 𝑗 = P ( 𝑣 𝑗 | 𝑌 𝑗 ) . Then the maps 𝒫 and 𝒜 𝑗 are defined by Prediction Step: ̂︀ 𝜋 𝑗 +1 = 𝒫 𝜋 𝑗 . Analysis Step: 𝜋 𝑗 +1 = 𝒜 𝑗 ̂︀ 𝜋 𝑗 +1 . (7.3)The map 𝒫 is sometimes termed prediction ; the map 𝒜 𝑗 is termed analysis . Through-out this section, all integrals are over R 𝑑 . First we derive the linear Markov map 𝒫 . By the Markov property of the stochastic dynamics model we have ̂︀ 𝜋 𝑗 +1 ( 𝑣 𝑗 +1 ) = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) (7.4)= ∫︁ P ( 𝑣 𝑗 +1 | 𝑌 𝑗 , 𝑣 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = ∫︁ P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = ∫︁ P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) 𝜋 𝑗 ( 𝑣 𝑗 ) 𝑑𝑣 𝑗 = 1(2 𝜋 ) 𝑑/ (detΣ) / ∫︁ exp (︁ − | 𝑣 𝑗 +1 − Ψ( 𝑣 𝑗 ) | )︁ 𝜋 𝑗 ( 𝑣 𝑗 ) 𝑑𝑣 𝑗 . This defines the linear integral operator 𝒫 .Now we derive the nonlinear likelihood map 𝒜 𝑗 by using Bayes theorem: 𝜋 𝑗 +1 = 𝒜 𝑗 ̂︀ 𝜋 𝑗 +1 . We note that 𝜋 𝑗 +1 ( 𝑣 𝑗 +1 ) = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) (7.5)= P ( 𝑣 𝑗 +1 | 𝑌 𝑗 , 𝑦 𝑗 +1 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 , 𝑌 𝑗 ) P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 )= exp( − | 𝑦 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | ) ̂︀ 𝜋 𝑗 +1 ( 𝑣 𝑗 +1 ) ∫︀ exp( − | 𝑦 𝑗 +1 − ℎ ( 𝑣 𝑗 +1 ) | ) ̂︀ 𝜋 𝑗 +1 ( 𝑣 𝑗 +1 ) 𝑑𝑣 𝑗 +1 . This defines the nonlinear map 𝒜 𝑗 +1 through multiplication by the likelihood, and thennormalization to a probability measure. Now we establish the well-posedness of the filtering problem. We let 𝜋 𝐽 = P ( 𝑣 𝐽 | 𝑌 ) , 𝜋 ′ 𝐽 = P ( 𝑣 𝐽 | 𝑌 ′ )be two filtering distributions arising from observed data 𝑌 = 𝑌 𝐽 and 𝑌 ′ = 𝑌 ′ 𝐽 . As notedin Remark 7.4, the filtering distribution at time 𝐽 is the 𝐽 -th marginal of the smoothingdistribution; using this observation, the well-posedness of the filtering problem is adirect consequence of the well-posedness of the smoothing problem. Corollary 7.7 (Well-posedness of Filtering) . Under Assumption 7.5, there exists 𝜅 = 𝜅 ( 𝑅 ) ,such that 𝑑 TV ( 𝜋 𝐽 , 𝜋 ′ 𝐽 ) ≤ 𝜅 | 𝑌 − 𝑌 ′ | .Proof. Let Π , Π ′ be the posterior distributions from the smoothing problems Π = P ( 𝑉 | 𝑌 ) and Π ′ = P ( 𝑉 | 𝑌 ′ ). We note that there exists 𝜅 such that 𝑑 TV (Π , Π ′ ) ≤ 𝜅 | 𝑌 − 𝑌 ′ | by Theorem 7.6 and by the fact that the Hellinger metric bounds the total variationmetric (Lemma 1.9). Let 𝑓 : R 𝑑 → R and 𝐹 : R 𝑑 ( 𝐽 +1) ↦→ R . Then 𝑑 TV ( 𝜋 𝐽 , 𝜋 ′ 𝐽 ) = 12 sup | 𝑓 | ∞ ≤ ⃒⃒⃒ E 𝜋 𝐽 [ 𝑓 ( 𝑣 𝐽 )] − E 𝜋 𝐽 ′ [ 𝑓 ( 𝑣 𝐽 )] ⃒⃒⃒ = 12 sup | 𝑓 | ∞ ≤ ⃒⃒⃒ E Π [ 𝑓 ( 𝑣 𝐽 )] − E Π ′ [ 𝑓 ( 𝑣 𝐽 )] ⃒⃒⃒ ≤
12 sup | 𝐹 | ∞ ≤ ⃒⃒⃒ E Π [ 𝐹 ( 𝑉 )] − E Π ′ [ 𝐹 ( 𝑉 )] ⃒⃒⃒ = 𝑑 TV (Π , Π ′ ) ≤ 𝜅 | 𝑌 − 𝑌 ′ | . Here the first inequality follows from the fact that {| 𝑓 | ≤ } can be viewed as a subsetof {| 𝐹 | ≤ } . The book [73] gives a mathematical introduction to data assimilation; for further in-formation on the smoothing problem as presented here, see section 2.3 in that book;for further information on the filtering problem as presented here, see section 2.4. Thebooks [1, 99] give alternative foundational presentations of the subject. The books[66, 91, 82, 17] study data assimilation in the context of weather forecasting, oil reser-voir simulation, turbulence modeling and geophysical sciences, respectively. Recall the stochastic dynamics and data models introduced in the previous chapter: 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) i.i.d. ,𝑦 𝑗 +1 = ℎ ( 𝑣 𝑗 +1 ) + 𝜂 𝑗 , 𝜂 𝑗 ∼ 𝒩 (0 , Γ) i.i.d. , (8.1)with 𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ) , 𝐶 , Σ and Γ positive definite and 𝑣 ⊥ { 𝜉 𝑗 } ⊥ { 𝜂 𝑗 } . Here westudy the filtering and smoothing problems under the assumption that both the state-transition operator Ψ( · ) and the observation operator ℎ ( · ) are linear. Throughout, wewill assume the following: Assumption 8.1.
The stochastic dynamics and the data models defined by equation (8.1) hold. Moreover, Linear dynamics: 𝑣 𝑗 +1 = 𝑀 𝑣 𝑗 + 𝜉 𝑗 for some 𝑀 ∈ R 𝑑 × 𝑑 ; Linear obervation: 𝑦 𝑗 +1 = 𝐻𝑣 𝑗 +1 + 𝜂 𝑗 +1 for some 𝐻 ∈ R 𝑘 × 𝑑 . We will be mostly concerned with the case where 𝑑 > 𝑘.
Under the linear-Gaussian assumption, the filtering and smoothing distributions are Gaussian and therefore arefully characterized by their mean and covariance. We consider first the
Kalman filter ,which gives explicit formulae for the iterative update of the mean and covariance of thefiltering distribution, and then the
Kalman smoother , which characterizes the smooth-ing distribution. While the Kalman filter and the Kalman smoother only characterizethe filtering and smoothing distributions in the linear-Gaussian setting, their impor-tance extends beyond this setting, as will be demonstrated in the next two chapters.
The filtering problem is to estimate the state at time 𝑗 given the data from the pastup to the present time 𝑗 . That is, we want to determine the pdf 𝜋 𝑗 = P ( 𝑣 𝑗 | 𝑌 𝑗 ), where 𝑌 𝑗 := { 𝑦 , . . . , 𝑦 𝑗 } . We define ̂︀ 𝜋 𝑗 +1 = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) and recall the evolution 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 , 𝜋 = 𝒩 ( 𝑚 , 𝐶 ) , which can be decomposed in terms of the prediction and analysis steps (7.3). Notethat 𝒫 does not depend on 𝑗 because the same Markov process defined by the statedynamics governs each prediction step, whereas 𝒜 𝑗 depends on 𝑗 because at eachstep 𝑗 the likelihood sees different data. The linear dynamics assumption implies thatapplying the operator 𝒫 to a Gaussian distribution gives again a Gaussian, and thelinear observation assumption implies that applying the operator 𝒜 𝑗 to a Gaussiangives again a Gaussian. Thefore we have the following: Theorem 8.2. (Gaussianity of Filtering Distributions) Under Assumption 8.1, 𝜋 , { 𝜋 𝑗 +1 } 𝑗 ∈ Z + and { ̂︀ 𝜋 𝑗 +1 } 𝑗 ∈ Z + are all Gaussian distributions. As a consequence, the filtering distributions can be entirely characterized by theirmean and covariance. We write ̂︀ 𝜋 𝑗 +1 = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) = 𝒩 ( ̂︀ 𝑚 𝑗 +1 , ̂︀ 𝐶 𝑗 +1 ) , (prediction) 𝜋 𝑗 +1 = P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = 𝒩 ( 𝑚 𝑗 +1 , 𝐶 𝑗 +1 ) , (analysis)and aim to find update formulae for these means and covariances. The Kalman filteralgorithm achieves this. Theorem 8.3 (Kalman filter) . Under Assumption 8.1 for all 𝑗 ∈ Z + , 𝐶 𝑗 is positive definiteand ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 , ̂︀ 𝐶 𝑗 +1 = 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ ,𝐶 − 𝑗 +1 = ( 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ) − + 𝐻 𝑇 Γ − 𝐻,𝐶 − 𝑗 +1 𝑚 𝑗 +1 = ( 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ) − 𝑀 𝑚 𝑗 + 𝐻 𝑇 Γ − 𝑦 𝑗 +1 . Proof.
The proof proceeds by breaking the Kalman filter step above into the predictionand the analysis steps.
Prediction : The mean and variance of the prediction step may be calculated as follows.The mean is given by: ̂︀ 𝑚 𝑗 +1 = E [︀ 𝑣 𝑗 +1 | 𝑌 𝑗 ]︀ = E [︀ 𝑀 𝑣 𝑗 + 𝜉 𝑗 | 𝑌 𝑗 ]︀ = 𝑀 E [︀ 𝑣 𝑗 | 𝑌 𝑗 ]︀ + E [︀ 𝜉 𝑗 | 𝑌 𝑗 ]︀ = 𝑀 𝑚 𝑗 , where we used that 𝜉 𝑗 and 𝑌 𝑗 are independent. The covariance is given by: ̂︀ 𝐶 𝑗 +1 = E [︀ ( 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) ⊗ ( 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) | 𝑌 𝑗 ]︀ = E [︀ 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) ⊗ 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) | 𝑌 𝑗 ]︀ + E [︀ 𝜉 𝑗 ⊗ 𝜉 𝑗 | 𝑌 𝑗 ]︀ + E [︀ 𝜉 𝑗 ⊗ 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) | 𝑌 𝑗 ]︀ + E [︀ 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) ⊗ 𝜉 𝑗 | 𝑌 𝑗 ]︀ = 𝑀 E [︀ ( 𝑣 𝑗 − 𝑚 𝑗 ) ⊗ ( 𝑣 𝑗 − 𝑚 𝑗 ) | 𝑌 𝑗 ]︀ 𝑀 𝑇 + Σ= 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ , where we used that 𝜉 𝑗 and 𝑣 𝑗 are independent. Thus in the linear Gaussian setting theprediction operator 𝒫 from 𝜋 𝑗 = 𝒩 ( 𝑚 𝑗 , 𝐶 𝑗 ) to ̂︀ 𝜋 𝑗 +1 = 𝒩 ( ̂︀ 𝑚 𝑗 +1 , ̂︀ 𝐶 𝑗 +1 ) is given by ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 , ̂︀ 𝐶 𝑗 +1 = 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ . Analysis : The analysis step may be derived as follows, using Bayes Theorem: P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = P ( 𝑣 𝑗 +1 | 𝑦 𝑗 +1 , 𝑌 𝑗 ) ∝ P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 , 𝑌 𝑗 ) P ( 𝑣 𝑗 +1 | 𝑌 𝑗 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) . This gives P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) ∝ exp (︂ − | 𝑣 𝑗 +1 − 𝑚 𝑗 +1 | 𝐶 𝑗 +1 )︂ ∝ exp (︂ − | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | )︂ exp (︂ − | 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 )︂ = exp (︂ − | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | − | 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 )︂ . (8.2)Taking logartihms and matching quadratic and linear terms in 𝑣 𝑗 +1 from either sideof this identity gives the update operator 𝒜 𝑗 from ̂︀ 𝜋 𝑗 +1 = 𝒩 ( ̂︀ 𝑚 𝑗 +1 , ̂︀ 𝐶 𝑗 +1 ) to 𝜋 𝑗 +1 = 𝒩 ( 𝑚 𝑗 +1 , 𝐶 𝑗 +1 ): 𝐶 − 𝑗 +1 = ̂︀ 𝐶 − 𝑗 +1 + 𝐻 𝑇 Γ − 𝐻,𝐶 − 𝑗 +1 𝑚 𝑗 +1 = ̂︀ 𝐶 − 𝑗 +1 ̂︀ 𝑚 𝑗 +1 + 𝐻 𝑇 Γ − 𝑦 𝑗 +1 . Combining the prediction operator 𝒫 and update operator 𝒜 𝑗 yields the desired updateformulae. Positive-definiteness of 𝐶 𝑗 : It remains to show that 𝐶 𝑗 > 𝑗 ∈ Z + . We will useinduction. By assumption the result holds true for 𝑗 = 0. Assume that it is true for 𝐶 𝑗 . For the prediction operator 𝒫 we have, for 𝑢 ̸ = 0, ⟨ 𝑢, ̂︀ 𝐶 𝑗 +1 𝑢 ⟩ = ⟨ 𝑢, 𝑀 𝐶 𝑗 𝑀 𝑇 𝑢 ⟩ + ⟨ 𝑢, Σ 𝑢 ⟩ = ⟨ 𝑀 𝑇 𝑢, 𝐶 𝑗 𝑀 𝑇 𝑢 ⟩ + ⟨ 𝑢, Σ 𝑢 ⟩≥ ⟨ 𝑢, Σ 𝑢 ⟩ > , where we used that 𝐶 𝑗 > > . Therefore ̂︀ 𝐶 𝑗 +1 , ̂︀ 𝐶 − 𝑗 +1 >
0. Then for the updateoperator 𝒜 𝑗 : ⟨ 𝑢, 𝐶 − 𝑗 +1 𝑢 ⟩ = ⟨ 𝑢, ̂︀ 𝐶 − 𝑗 +1 𝑢 ⟩ + ⟨ 𝑢, 𝐻 𝑇 Γ − 𝐻𝑢 ⟩ = ⟨ 𝑢, ̂︀ 𝐶 − 𝑗 +1 𝑢 ⟩ + ⟨ 𝐻𝑢, Γ − 𝐻𝑢 ⟩≥ ⟨ 𝑢, ̂︀ 𝐶 − 𝑗 +1 𝑢 ⟩ > , where we used that Γ > . Therefore, 𝐶 𝑗 +1 , 𝐶 − 𝑗 +1 >
0, which concludes the proof.
Remark 8.4.
The previous proof reveals two interesting facts about the structure of theKalman filter updates. The first is that the covariance update does not involve theobserved data; this can be thought of as a consequence of the fact that the posteriorcovariance in the linear-Gaussian setting for inverse problems does not depend on theobserved data, as noted in Chapter 2. The second is that the update formulae forthe covariance are linear in the prediction step, but nonlinear in the analysis step;specifically the analysis step is linear in the precisions (inverse covariances). We now rewrite the Kalman filter in an alternative form. This formulation is writtenin terms of the covariances directly, and does not involve the precisions. Furthermore,the formulation in Theorem 8.3 involves a matrix inversion in the state space while theone that we will present here requires only inversion the data space, namely 𝑆 − 𝑗 +1 inwhat follows. In many applications the observation space dimension is much smallerthan the state space dimension ( 𝑘 ≪ 𝑑 ), and then the formulation given in this sectionis much cheaper to compute than the one given in Theorem 8.3. Corollary 8.5 (Standard Form of Kalman Filter) . Under Assumption 8.3, the Kalman updateformulae may be written as 𝑚 𝑗 +1 = ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑑 𝑗 +1 ,𝐶 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝐶 𝑗 +1 , (8.3) where 𝑑 𝑗 +1 = 𝑦 𝑗 +1 − 𝐻 ̂︀ 𝑚 𝑗 +1 ,𝑆 𝑗 +1 = 𝐻 ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 + Γ ,𝐾 𝑗 +1 = ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 𝑆 − 𝑗 +1 , and where ̂︀ 𝑚 𝑗 +1 , ̂︀ 𝐶 𝑗 +1 are defined as in the proof of Theorem 8.3. Remark 8.6.
The vector 𝑑 𝑗 +1 is known as the innovation and the matrix 𝐾 𝑗 +1 as the Kalman gain . Note that 𝑑 𝑗 +1 measures the mismatch of the predicted state from thegiven data.The corollary may be proved by an application of the following: Lemma 8.7. (Woodbury Matrix Identity) Let 𝐴 ∈ R 𝑝 × 𝑝 , 𝑈 ∈ R 𝑝 × 𝑞 , 𝐵 ∈ R 𝑞 × 𝑞 , 𝑉 ∈ R 𝑞 × 𝑝 .If 𝐴, 𝐵 > , then 𝐴 + 𝑈 𝐵𝑉 is invertible and ( 𝐴 + 𝑈 𝐵𝑉 ) − = 𝐴 − − 𝐴 − 𝑈 ( 𝐵 − + 𝑉 𝐴 − 𝑈 ) − 𝑉 𝐴 − . Combining the form of 𝑑 𝑗 +1 and ̂︀ 𝑚 𝑗 +1 shows that the update formula for the Kalmanmean can be written as 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 , ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 . (8.4)This update formula has the very natural interpretation that the mean update is formedas a linear combination of the evolution of the noise-free dynamics and of the data.Equations (8.3) and (8.4) show that the Kalman gain 𝐾 𝑗 +1 determines the weight givento the new observation 𝑦 𝑗 +1 in the state estimation. The update formula (8.4) may bealso derived from an optimization perspective, the topic of the next subsection. Since 𝜋 𝑗 +1 is Gaussian, its mean agrees with its mode. Thus, formulae (8.2) impliesthat 𝑚 𝑗 +1 = argmax 𝑣 𝜋 𝑗 +1 ( 𝑣 )= argmin 𝑣 J ( 𝑣 ) , where J ( 𝑣 ) := 12 | 𝑦 𝑗 +1 − 𝐻𝑣 | + 12 | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 . In other words, 𝑚 𝑗 +1 is chosen to fit both the observed data 𝑦 𝑗 +1 and the predictions ̂︀ 𝑚 𝑗 +1 as well as possible. The covariances Γ and ̂︀ 𝐶 𝑗 +1 determine the relative weightingbetween the two quadratic terms. The solution of the minimization problem is givenby (8.4), as may be verified by direct differentiation of J . An alternative derivation which is helpful in more sophisticated contexts is to castthe problem in terms of constrained minimization. Write 𝑣 ′ = 𝑣 − ̂︀ 𝑚 𝑗 +1 , 𝑦 ′ = 𝑦 𝑗 +1 − 𝐻 ̂︀ 𝑚 𝑗 +1 and 𝐶 ′ = ̂︀ 𝐶 𝑗 +1 . Then minimization of J may be reformulated as 𝑚 𝑗 +1 = ̂︀ 𝑚 𝑗 +1 + argmin 𝑣 ′ (︂ | 𝑦 ′ − 𝐻𝑣 ′ | + 12 ⟨ 𝑣 ′ , 𝑏 ⟩ )︂ , where the minimization is now subject to the constraint 𝐶 ′ 𝑏 = 𝑣 ′ . Using Lagrangemultipliers we write I ( 𝑣 ′ ) = 12 | 𝑦 ′ − 𝐻𝑣 ′ | + 12 ⟨ 𝑣 ′ , 𝑏 ⟩ + ⟨ 𝜆, 𝐶 ′ 𝑏 − 𝑣 ′ ⟩ ; (8.5)computing the dervative and setting to zero gives − 𝐻 𝑇 Γ − ( 𝑦 ′ − 𝐻𝑣 ′ ) + 12 𝑏 − 𝜆 = 0 , 𝑣 ′ + 𝐶 ′ 𝜆 = 0 ,𝑣 ′ − 𝐶 ′ 𝑏 = 0 . The last two equations imply that 𝐶 ′ (2 𝜆 + 𝑏 ) = 0 . Thus we set 𝜆 = − 𝑏 and drop thesecond equation, replacing the first by − 𝐻 𝑇 Γ − ( 𝑦 ′ − 𝐻𝐶 ′ 𝑏 ) + 𝑏 = 0 . Solving for 𝑏 gives 𝑣 = ̂︀ 𝑚 𝑗 +1 + 𝑣 ′ = ̂︀ 𝑚 𝑗 +1 + 𝐶 ′ 𝑏 = ̂︀ 𝑚 𝑗 +1 + 𝐶 ′ ( 𝐻 𝑇 Γ − 𝐻𝐶 ′ + 𝐼 ) − 𝐻 𝑇 Γ − 𝑦 ′ = ̂︀ 𝑚 𝑗 +1 + 𝐶 ′ ( 𝐻 𝑇 Γ − 𝐻𝐶 ′ + 𝐼 ) − 𝐻 𝑇 Γ − ( 𝑦 𝑗 +1 − 𝐻 ̂︀ 𝑚 𝑗 +1 )= ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 , where we have defined 𝐾 𝑗 +1 = 𝐶 ′ ( 𝐻 𝑇 Γ − 𝐻𝐶 ′ + 𝐼 ) − 𝐻 𝑇 Γ − . It remains to show that 𝐾 𝑗 +1 agrees with the definition given in Corollary 8.5. To seethis we note that if we choose 𝑆 to be any matrix satisfying 𝐾 𝑗 +1 = 𝐶 ′ 𝐻 𝑇 𝑆 − then 𝐻 𝑇 𝑆 − = ( 𝐻 𝑇 Γ − 𝐻𝐶 ′ + 𝐼 ) − 𝐻 𝑇 Γ − so that ( 𝐻 𝑇 Γ − 𝐻𝐶 ′ + 𝐼 ) 𝐻 𝑇 = 𝐻 𝑇 Γ − 𝑆. Thus 𝐻 𝑇 Γ − 𝐻𝐶 ′ 𝐻 𝑇 + 𝐻 𝑇 = 𝐻 𝑇 Γ − 𝑆 which may be achieved by choosing any 𝑆 so thatΓ − ( 𝐻𝐶 ′ 𝐻 𝑇 + Γ) = Γ − 𝑆 and multiplication by Γ gives the desired formula for 𝑆 . The following theorem states that the Kalman filter gives the best estimator of themean in an online setting. In the following E denotes expectation with respect toall randomness present in the problem statement, through the initial condition, thenoisy dynamical evolution, and the noisy data. Furthermore E [ ·| 𝑌 𝑗 ] denotes conditionalexpectation, given the data 𝑌 𝑗 upto time 𝑗 . Theorem 8.8 (Optimality of Kalman Filter) . Let { 𝑚 𝑗 } be the sequence computed using theKalman filter, and { 𝑧 𝑗 } be any sequence in R 𝑑 such that 𝑧 𝑗 is 𝑌 𝑗 measurable. Then,for all 𝑗 ∈ N , E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | | 𝑌 𝑗 ]︁ ≤ E [︁ | 𝑣 𝑗 − 𝑧 𝑗 | | 𝑌 𝑗 ]︁ . Proof.
Note that 𝑚 𝑗 and 𝑧 𝑗 are fixed and non-random, given 𝑌 𝑗 . Thus we have: E [︁ | 𝑣 𝑗 − 𝑧 𝑗 | | 𝑌 𝑗 ]︁ = E [︁ | 𝑣 𝑗 − 𝑚 𝑗 + 𝑚 𝑗 − 𝑧 𝑗 | | 𝑌 𝑗 ]︁ = E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | + 2 ⟨ 𝑣 𝑗 − 𝑚 𝑗 , 𝑚 𝑗 − 𝑧 𝑗 ⟩ + | 𝑚 𝑗 − 𝑧 𝑗 | | 𝑌 𝑗 ]︁ = E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | | 𝑌 𝑗 ]︁ + 2 ⟨ E [︁ 𝑣 𝑗 − 𝑚 𝑗 | 𝑌 𝑗 ]︁ , 𝑚 𝑗 − 𝑧 𝑗 ⟩ + | 𝑚 𝑗 − 𝑧 𝑗 | = E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | | 𝑌 𝑗 ]︁ + 2 ⟨ E [︁ 𝑣 𝑗 | 𝑌 𝑗 ]︁ − 𝑚 𝑗 , 𝑚 𝑗 − 𝑧 𝑗 ⟩ + | 𝑚 𝑗 − 𝑧 𝑗 | = E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | | 𝑌 𝑗 ]︁ + 0 + | 𝑚 𝑗 − 𝑧 𝑗 | ≥ E [︁ | 𝑣 𝑗 − 𝑚 𝑗 | | 𝑌 𝑗 ]︁ . The fifth step follows since 𝑚 𝑗 = E [︀ 𝑣 𝑗 | 𝑌 𝑗 ]︀ . For practical purposes this means 𝑧 𝑗 is a fixed non-random function of given observed 𝑌 𝑗 . We next discuss the Kalman smoother, which refers to the smoothing problem in thelinear-Gaussian setting. As with the Kalman filter, it is possible to solve the problemexplicitly because the smoothing distribution is itself a Gaussian. The explicit formulaecomputed help to build intuition about the smoothing distribution more generally. Werecall remark 7.4, which implies that the filtering distribution at time 𝑗 = 𝐽 determinesthe marginal of the Kalman smoother on its last coordinate. However, the filteringdistributions do not determine the Kalman smoother in its entirety. Let 𝑉 = { 𝑣 , . . . , 𝑣 𝐽 } and 𝑌 = { 𝑦 , . . . , 𝑦 𝐽 } . Using Bayes theorem and the fact that { 𝜉 𝑗 } , { 𝜂 𝑗 } are mutually independent i.i.d. sequences, independent of 𝑣 , we have P ( 𝑉 | 𝑌 ) ∝ P ( 𝑌 | 𝑉 ) P ( 𝑉 ) = 𝐽 ∏︁ 𝑗 =1 P ( 𝑦 𝑗 | 𝑣 𝑗 ) × 𝐽 − ∏︁ 𝑗 =0 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) × P ( 𝑣 ) . Noting that 𝑣 𝑗 +1 | 𝑣 𝑗 ∼ 𝒩 ( 𝑀 𝑣 𝑗 , Σ) , 𝑦 𝑗 | 𝑣 𝑗 ∼ 𝒩 ( 𝐻𝑣 𝑗 , Γ)the Kalman smoothing distribution can be expressed as P ( 𝑉 | 𝑌 ) ∝ exp (︀ − J ( 𝑉 ) )︀ , (8.7)where J ( 𝑉 ) := 12 | 𝑣 − 𝑚 | 𝐶 + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑣 𝑗 +1 − 𝑀 𝑣 𝑗 | + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | . (8.8) Theorem 8.9 (Characterization of the Kalman Smoother) . Suppose that Assumption 8.1 holds.Then P ( 𝑉 | 𝑌 ) is Gaussian with a block tridiagonal precision matrix Ω > and mean 𝑚 solving Ω 𝑚 = 𝑟 , where Ω = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ Ω , Ω , Ω , Ω , ... ... ... ... ... ...... Ω 𝐽 − ,𝐽 − Ω 𝐽 − ,𝐽 Ω 𝐽,𝐽 − Ω 𝐽,𝐽 ⎤⎥⎥⎥⎥⎥⎥⎥⎦ (8.9) with Ω , = 𝐶 − + 𝑀 𝑇 Σ − 𝑀, Ω 𝑗,𝑗 = Σ − + 𝑀 𝑇 Σ − 𝑀 + 𝐻 𝑇 Γ − 𝐻, ≤ 𝑗 ≤ 𝐽 − , Ω 𝐽,𝐽 = Σ − + 𝐻 𝑇 Γ − 𝐻, Ω 𝑗,𝑗 +1 = − 𝑀 𝑇 Σ − , ≤ 𝑗 ≤ 𝐽 − ,𝑟 = 𝐶 − 𝑚 ,𝑟 𝑗 = 𝐻 𝑇 Γ − 𝑦 𝑗 , ≤ 𝑗 ≤ 𝐽. Proof.
We may write J ( 𝑉 ) = | Ω / ( 𝑉 − 𝑚 ) | + 𝑞 with 𝑞 independent of 𝑉 , by definition.Note that Ω is then the Hessian of J ( 𝑉 ), and differentiating in equation (8.8) we obtainthat Ω , = 𝜕 𝑣 J ( 𝑉 ) = 𝐶 − + 𝑀 𝑇 Σ − 𝑀, Ω 𝑗,𝑗 = 𝜕 𝑣 𝑗 J ( 𝑉 ) = Σ − + 𝑀 𝑇 Σ − 𝑀 + 𝐻 𝑇 Γ − 𝐻, Ω 𝐽,𝐽 = 𝜕 𝑣 𝐽 J ( 𝑉 ) = Σ − + 𝐻 𝑇 Γ − 𝐻, Ω 𝑗 − ,𝑗 = Ω 𝑗,𝑗 − = 𝜕 𝑣 𝑗 ,𝑣 𝑗 − J ( 𝑉 ) = − 𝑀 𝑇 Σ − . Otherwise, for all other values of indices { 𝑘, 𝑙 } , Ω 𝑘,𝑙 = 0 . This proves that the matrixΩ has a block tridiagonal structure.Now we focus on finding 𝑚 . We have that ∇ 𝑉 J ( 𝑉 ) = Ω( 𝑉 − 𝑚 ), so that −∇ 𝑉 J ( 𝑉 ) | 𝑉 =0 =Ω 𝑚 . Thus, we find 𝑟 as 𝑟 = − ∇ 𝑣 J ( 𝑉 ) | 𝑉 =0 = − ( − 𝐶 − 𝑚 ) = 𝐶 − 𝑚 ,𝑟 𝑗 = − ∇ 𝑣 𝑗 J ( 𝑉 ) | 𝑉 =0 = − ( − 𝐻 𝑇 Γ − 𝑦 𝑗 ) = 𝐻 𝑇 Γ − 𝑦 𝑗 . We have shown that Ω is symmetric and that Ω ≥
0; to prove that Ω is a precisionmatrix, we need to show that Ω > . Take, for the sake of argument, 𝑌 = 0 and 𝑚 = 0in equation (8.8), so that every term in the expansion of J ( 𝑉 ) involves 𝑉 . It is evidentthat in such case J ( 𝑉 ) = 𝑉 𝑇 Ω 𝑉 . Suppose that 𝑉 𝑇 Ω 𝑉 = 0 for some nonzero 𝑉 . Thenby positive-definiteness of 𝐶 , Σ, and Γ, it must be that 𝑣 = 0 and 𝑣 𝑗 +1 = 𝑀 𝑣 𝑗 for 𝑗 = 0 , , . . . , 𝐽 . Thus we must have 𝑉 = 0. This proves that Ω is positive-definite. Remark 8.10.
We note once again that since the smoothing distribution in the linear-Gaussian setting is itself Gaussian, its mean agrees with its mode. Therefore, theposterior mean found above is the unique minimizer of J ( 𝑉 ) , that is, the MAP estima-tor. We may also obtain 𝑚 via Gaussian elimination, using the block tridiagonal structureof Ω, as follows. First we form the matrix sequence { Ω 𝑗 } :Ω = Ω , , Ω 𝑗 +1 = Ω 𝑗 +1 ,𝑗 +1 − Σ − 𝑀 𝑇 Ω − 𝑗 𝑀 𝑇 Σ − , 𝑗 = 0 , . . . , 𝐽 −
1; (8.10)and we form the vector sequence { 𝑧 𝑗 } : 𝑧 = 𝐶 − 𝑚 ,𝑧 𝑗 +1 = 𝐻 𝑇 Γ − 𝑦 𝑗 +1 − Σ − 𝑀 𝑇 Ω − 𝑗 𝑧 𝑗 . We may then read off 𝑚 𝐽 from the equation Ω 𝐽 𝑚 𝐽 = 𝑧 𝐽 and finally we perform back-substitution to obtainΩ 𝑗 𝑚 𝑗 = 𝑧 𝑗 − Ω 𝑗,𝑗 +1 𝑚 𝑗 +1 , 𝑗 = 𝐽 − , . . . , . Note that 𝑚 𝐽 found this way coincides with the mean of the Kalman filter at 𝑗 = 𝐽. Proposition 8.11.
The matrices { Ω 𝑗 } in (8.10) are positive definite.Proof. The proof of this theorem relies on the following two lemmas:
Lemma 8.12. If 𝑋 := ⎡⎢⎢⎢⎣ 𝑋 × × ×× 𝑋 × ×× × ... ×× × × 𝑋 𝑑 ⎤⎥⎥⎥⎦ is positive-definite symmetric then 𝑋 𝑖 is positive-definite symmetric for all 𝑖 ∈ { , . . . , 𝑑 } . Lemma 8.13.
Let 𝐵 be a block lower (or upper) triangular matrix with identity on thediagonal. Then, 𝐵 is an invertible matrix. Using Lemma 8.12, we deduce that Ω = Ω , is positive-definite symmetric. Con-sider the matrix 𝐵 ∈ R 𝑑 ( 𝐽 +1) × 𝑑 ( 𝐽 +1) defined as: 𝐵 = ⎡⎢⎢⎢⎣ 𝐼 − Ω , Ω − 𝐼 ... ... ... 𝐼 ⎤⎥⎥⎥⎦ . We compute 𝐵 Ω 𝐵 𝑇 = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ Ω Ω , ... , Ω , ... 𝐽 − ,𝐽 − Ω 𝐽 − ,𝐽 𝐽,𝐽 − Ω 𝐽,𝐽 ⎤⎥⎥⎥⎥⎥⎥⎥⎦ . By Lemma 8.12, the matrix˜Ω = ⎡⎢⎢⎢⎢⎢⎣ Ω Ω , ... , Ω , ... Ω 𝐽 − ,𝐽 − Ω 𝐽 − ,𝐽 Ω 𝐽,𝐽 − Ω 𝐽,𝐽 ⎤⎥⎥⎥⎥⎥⎦ is positive definite, and so is Ω . Lemma 8.12 and the positive definiteness of ˜Ω imply that Ω is positive definite.Therefore, by Lemma 8.13 the matrix 𝐵 = ⎡⎢⎢⎢⎣ 𝐼 − Ω , Ω − 𝐼 ... ...... 𝐼 ⎤⎥⎥⎥⎦ is invertible. Thus we have 𝐵 ˜Ω 𝐵 𝑇 = ⎡⎢⎢⎢⎢⎢⎢⎢⎣ Ω Ω , ... Ω , Ω , 𝐽 − ,𝐽 − Ω 𝐽 − ,𝐽 𝐽,𝐽 − Ω 𝐽,𝐽 ⎤⎥⎥⎥⎥⎥⎥⎥⎦ , giving the positive definiteness of Ω . Iterating the argument shows that all the Ω 𝑗 arepositive definite. The original paper of Kalman, which is arguably the first systematic presentation of amethodology to combine models with data, is [64]. The continuous time analogue ofthat work may be found in [65]. The book [53] overviews the subject in the contextof time-series analysis and economics. The proof of Corollary 8.5 may be found in[73]. The paper [101] contains an application of the optimality property of the Kalmanfilter (which applies beyond the linear Gaussian setting to the mean of the filteringdistribution in quite general settings).The subject of Kalman smoothing is overviewed in the text [5]; see also [96]. A linkbetween the standard implementation of the smoother and Gauss-Newton methods forMAP estimation is made in [9]. Fur further details on the Kalman smoother, in bothdiscrete and continuous time, see [73] and [51]. In the previous chapter we showed how the mean of the Kalman filter could be de-rived through an optimization principle, once the predictive covariance is known. Inthis chapter we discuss two optimization based approaches to filtering and smoothing,namely the 3DVAR and 4DVAR methodologies. We emphasize that the methods wepresent in this chapter do not provide approximations of the filtering and smoothingprobability distributions; they simply provide estimates of the signal, given data, inthe filtering (on-line) and smoothing (off-line) data scenarios. 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝑗 ∈ Z + . Data Model: 𝑦 𝑗 +1 = 𝐻𝑣 𝑗 +1 + 𝜂 𝑗 +1 , 𝑗 ∈ Z + , for some 𝐻 ∈ R 𝑘 × 𝑑 . Probabilistic Structure: 𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ) , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) , 𝜂 𝑗 ∼ 𝒩 (0 , Γ) . Probabilistic Structure: 𝑣 ⊥ { 𝜉 𝑗 } ⊥ { 𝜂 𝑗 } independent. We introduce 3DVAR by analogy with the update formula (8.4) for the Kalman filter,and its derivation through optimization from section 8.1.2. The primary differencesbetween 3DVAR and the Kalman filter mean update is that Ψ( · ) can be nonlinear for3DVAR, and that for 3DVAR we have no closed update formula for the covariances.To deal with this second issue 3DVAR uses a fixed predicted covariance, independentof time 𝑗 , and pre-specified. The resulting minimization problem, and its solution,is described in Table 9.1, making the analogy with the Kalman filter. Note that theminimization itself is of a quadratic functional, and so may be solved by means of linearalgebra. The constraint formulation used for the Kalman filter, in section 8.1.2, mayalso be applied and used to derive the mean update formula. Kalman Filter 3DVAR 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) J ( 𝑣 ) = | 𝑦 𝑗 +1 − 𝐻𝑣 | + | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 J ( 𝑣 ) = | 𝑦 𝑗 +1 − 𝐻𝑣 | + | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 ̂︀ 𝑚 𝑗 +1 = Ψ( 𝑚 𝑗 ) 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 𝑚 𝑗 +1 = ( 𝐼 − 𝐾𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾𝑦 𝑗 +1 Table 9.1
Comparison of Kalman Filter and 3DVAR update formulae.
The Kalman gain 𝐾 for 3DVAR is fixed, because the predicted covariance ̂︀ 𝐶 isfixed. By analogy with Corollary 8.5 we have the following formulae for the 3DVARgain matrix 𝐾 , and the update formula for the estimator 𝑚 𝑗 : 𝑆 = 𝐻 ̂︀ 𝐶𝐻 𝑇 + Γ ,𝐾 = ̂︀ 𝐶𝐻 𝑇 𝑆 − ,𝑚 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑚 𝑗 ) + 𝐾𝑦 𝑗 +1 . The method also delivers an implied analysis covariance 𝐶 = ( 𝐼 − 𝐾𝐻 ) ̂︀ 𝐶. Note thatthe resulting algorithm which maps 𝑚 𝑗 to 𝑚 𝑗 +1 may be specified directly in terms ofthe gain 𝐾 , without need to introduce ̂︀ 𝐶, 𝐶 and 𝑆 . In the remainder of this section wesimply view 𝐾 as fixed and given. In this setting we show that the 3DVAR algorithmproduces accurate state estimation under vanishing noise assumptions in the dynam-ics/data model. The governing assumptions concerning the dynamics/data model areencapsulated in: Assumption 9.1.
Consider the dynamics/data model under the assumptions that 𝜉 𝑗 ≡ , Γ = 𝛾 Γ , | Γ | = 1 and assume that the data 𝑦 𝑗 +1 used in the 3DVAR algorithm isfound from observing a true signal 𝑣 † 𝑗 given by Dynamics Model: 𝑣 † 𝑗 +1 = Ψ( 𝑣 † 𝑗 ) , 𝑗 ∈ Z + . Data Model: 𝑦 𝑗 +1 = 𝐻𝑣 † 𝑗 +1 + 𝛾𝜂 † 𝑗 +1 , , 𝑗 ∈ Z + . With this assumption of noise-free dynamics ( 𝜉 𝑗 ≡
0) we deduce that the 3DVARfilter produces output which, asymptotically, has an error of the same size as theobservational noise error 𝛾. The key additional assumption in the theorem that allowsthis deduction is a relationship between the Kalman gain 𝐾 and the derivative 𝐷 Ψ( · )of the dynamics model. Encoded in the assumption are two ingredients: that theobservation operator 𝐻 is rich enough in principle to learn enough components of thesystem to synchronize the whole system; and that 𝐾 is designed cleverly enough to effectthis synchronization. The proof of the theorem is simply using these two ingredientsand then controlling the small stochastic perturbations, arising from Assumption 9.1. Theorem 9.2 (Accuracy of 3DVAR) . Let Assumption 9.1 hold with 𝜂 † 𝑗, ∼ 𝒩 (0 , Γ ) . an i.i.d.sequence. Assume that, for the gain matrix 𝐾 appearing in the 3DVAR method, thereexists a norm ‖·‖ on R 𝑑 and constant 𝜆 ∈ (0 , such that, for all 𝑣 ∈ R 𝑑 , ‖ ( 𝐼 − 𝐾𝐻 ) 𝐷 Ψ( 𝑣 ) ‖ ≤ 𝜆. Then there is constant 𝑐 > such that the 3DVAR algorithm satisfies the followinglarge-time asymptotic error bound: lim sup 𝑗 →∞ E [ ‖ 𝑚 𝑗 − 𝑣 † 𝑗 ‖ ] (cid:54) 𝑐𝛾 − 𝜆 , where the expectation is taken with respect to the sequence { 𝜂 † 𝑗, } . Proof.
We have 𝑣 † 𝑗 +1 = Ψ( 𝑣 † 𝑗 ) ,𝑚 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑚 𝑗 ) + 𝐾𝑦 𝑗 +1 , and hence that 𝑣 † 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑣 † 𝑗 ) + 𝐾𝐻 Ψ( 𝑣 † 𝑗 ) ,𝑚 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑚 𝑗 ) + 𝐾𝐻 Ψ( 𝑣 † 𝑗 ) + 𝛾𝐾𝜂 † 𝑗 +1 , . Define 𝑒 𝑗 = 𝑚 𝑗 − 𝑣 † 𝑗 . By subtracting the evolution equation for 𝑣 † 𝑗 from that for 𝑚 𝑗 weobtain, using the mean value theorem, 𝑒 𝑗 +1 = 𝑚 𝑗 +1 − 𝑣 † 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )(Ψ( 𝑚 𝑗 ) − Ψ( 𝑣 † 𝑗 )) + 𝛾𝐾𝜂 † 𝑗 +1 , = (︂ ( 𝐼 − 𝐾𝐻 ) ∫︁ 𝐷 Ψ (︁ 𝑠𝑚 𝑗 + (1 − 𝑠 ) 𝑣 † 𝑗 )︁ 𝑑𝑠 )︂ 𝑒 𝑗 + 𝛾𝐾𝜂 † 𝑗 +1 , . As a result, by the triangle inequality, ‖ 𝑒 𝑗 +1 ‖ ≤ ⃦⃦⃦⃦(︂∫︁ ( 𝐼 − 𝐾𝐻 ) 𝐷 Ψ (︁ 𝑠𝑚 𝑗 + (1 − 𝑠 ) 𝑣 † 𝑗 )︁ 𝑑𝑠 )︂ 𝑒 𝑗 ⃦⃦⃦⃦ + ‖ 𝛾𝐾𝜂 † 𝑗 +1 , ‖≤ (︂∫︁ ⃦⃦⃦ ( 𝐼 − 𝐾𝐻 ) 𝐷 Ψ (︁ 𝑠𝑚 𝑗 + (1 − 𝑠 ) 𝑣 † 𝑗 )︁⃦⃦⃦ 𝑑𝑠 )︂ ‖ 𝑒 𝑗 ‖ + ‖ 𝛾𝐾𝜂 † 𝑗 +1 , ‖≤ 𝜆 ‖ 𝑒 𝑗 ‖ + 𝛾 ‖ 𝐾𝜂 † 𝑗 +1 , ‖ . Taking expectations on both sides, we obtain, for 𝑐 := E [ ‖ 𝐾𝜂 † ,𝑗 +1 ‖ ] > E [ ‖ 𝑒 𝑗 +1 ‖ ] ≤ 𝜆 E [ ‖ 𝑒 𝑗 ‖ ] + 𝛾 E [ ‖ 𝐾𝜂 † 𝑗 +1 , ‖ ] ≤ 𝜆 E [ ‖ 𝑒 𝑗 ‖ ] + 𝛾𝑐. (9.1)Using the discrete Gronwall inequality we have that: E [ ‖ 𝑒 𝑗 ‖ ] ≤ 𝜆 𝑗 E [ ‖ 𝑒 ‖ ] + 𝑗 − ∑︁ 𝑖 =0 𝑐𝜆 𝑖 𝛾 ≤ 𝜆 𝑗 E [ ‖ 𝑒 ‖ ] + 𝑐𝛾 − 𝜆 𝑗 − 𝜆 , (9.2)where 𝑒 = 𝑚 − 𝑣 . As 𝜆 <
1, the desired statement follows. Recall that 3DVAR differs from 4DVAR because, whilst also based on an optimizationprinciple, 4DVAR is applied in a distributed fashion over all data in the time interval 𝑗 = 1 , . . . , 𝐽 ; in contrast 3DVAR is applied sequentially from time 𝑗 − 𝑗, for 𝑗 = 1 , . . . , 𝐽 . We consider two forms of the methodology: weak constraint 4DVAR(w4DVAR) , in which the fact that the dynamics model contains randomness is allowedfor in the optimization; and (sometimes known as strong constraint 4DVAR )which can be derived from w4DVAR in the limit of Σ → J ( 𝑉 ) = 12 | 𝑣 − 𝑚 | 𝐶 + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑣 𝑗 +1 − Ψ( 𝑣 𝑗 ) | + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | , (9.3)where 𝑉 = { 𝑣 𝑗 } 𝐽𝑗 =0 ∈ R 𝑑 ( 𝐽 +1) , 𝑌 = { 𝑦 𝑗 } 𝐽𝑗 =1 ∈ R 𝑘𝐽 , 𝑣 𝑗 ∈ R 𝑑 , 𝑦 𝑗 ∈ R 𝑘 , 𝐻 is theobservation operator, Σ is the random dynamical system covariance, Γ is the datanoise covariance and 𝑚 and 𝐶 are the mean and covariance of the initial state. Thethree terms in the objective function enforce, in turn, the initial condition 𝑣 , thedynamics model and the data model. Note that, because Ψ is nonlinear, the objectiveis not quadratic and cannot be optimized in closed form. In contrast, each step of3DVAR required solution of a quadratic optimization, tractable in closed form. Theorem 9.3 (Minimizer Exists for w4DVAR) . Assume that Ψ is bounded and continuous.Then J has a minimizer, which is a MAP estimator for the smoothing problem.Proof. Recall Theorem 3.5, which shows that the MAP estimator based on the smooth-ing distribution P ( 𝑉 | 𝑌 ) ∝ exp (︀ − J ( 𝑉 ) )︀ is attained provided that J is guaranteed to benon-negative, continuous and satisfy J ( 𝑉 ) → ∞ as | 𝑉 | → ∞ . Now, the objective J de-fined by equation (9.3) is clearly non-negative, and it is continuous since Ψ is assumedto be continuous. It remains to show that J ( 𝑉 ) → ∞ as | 𝑉 | → ∞ . Let 𝑅 be a boundfor Ψ , so that | Ψ( 𝑣 𝑗 ) | Σ ≤ 𝑅 for all 𝑣 𝑗 ∈ R 𝑑 . Then, since J ( 𝑉 ) ≥ | 𝑣 | 𝐶 − | 𝑣 | 𝐶 | 𝑚 | 𝐶 + 12 𝐽 − ∑︁ 𝑗 =0 (︁ | 𝑣 𝑗 +1 | − 𝑅 | 𝑣 𝑗 +1 | Σ )︁ it follows that J ( 𝑉 ) → ∞ as | 𝑉 | → ∞ and the proof is complete.We now consider the vanishing dynamical noise limit of w4DVAR. This is to mini-mize J ( 𝑉 ) = 12 | 𝑣 − 𝑚 | 𝐶 + 12 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | subject to the hard constraint that 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) , 𝑗 = 0 , . . . , 𝐽 − . This is 4DVAR. Note that by using the constraint 4DVAR can be written as a minimiza-tion over 𝑣 , rather than over the entire sequence { 𝑣 𝑗 } 𝐽𝑗 =0 as is required in w4DVAR.We let J 𝜎 denote the objective function J from w4DVAR in the case where Σ → 𝜎 Σ . Roughly speaking, the following result shows that minimizers of J 𝜎 converge as 𝜎 → R 𝑘 ( 𝐽 +1) which satisfy the hard constraint associated with 4DVAR. Theorem 9.4 (Small Signal Noise Limit of w4DVAR) . Suppose that Ψ is bounded and contin-uous and let 𝑉 𝜎 be a minimizer of J 𝜎 . Then as 𝜎 → there is a convergent subsequenceof 𝑉 𝜎 with limit 𝑉 * satisfying 𝑣 * 𝑗 +1 = Ψ( 𝑣 * 𝑗 ) .Proof. Throughout this proof 𝑐 is a constant which may change from instance toinstance, but is independent of 𝜎. Consider 𝑉 ∈ R 𝑑 ( 𝐽 +1) defined by 𝑣 = 𝑚 and 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ). Then 𝑉 is bounded, as Ψ( · ) is bounded, and the bound is independentof 𝜎 . Furthermore J 𝜎 ( 𝑉 ) = 12 𝐽 − ∑︁ 𝑗 =0 | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | ≤ 𝑐, where 𝑐 is independent of 𝜎 . It follows that J 𝜎 ( 𝑉 𝜎 ) ≤ J 𝜎 ( 𝑉 ) ≤ 𝑐. Thus 12 | 𝑣 𝜎𝑗 +1 − Ψ( 𝑣 𝜎𝑗 ) | = 𝜎 | 𝑣 𝜎𝑗 +1 − Ψ( 𝑣 𝜎𝑗 ) | = 𝜎 J 𝜎 ( 𝑉 𝜎 ) ≤ 𝜎 𝑐, | 𝑣 𝜎 − 𝑚 | 𝐶 ≤ J 𝜎 ( 𝑉 𝜎 ) ≤ 𝑐. Since Ψ is bounded these bounds imply that | 𝑉 𝜎 | ≤ 𝑐 . Therefore there is a limit 𝑉 * : 𝑉 𝜎 → 𝑉 * along a subsequence. By continuity0 ≤ | 𝑣 * 𝑗 +1 − Ψ( 𝑣 * 𝑗 ) | ← | 𝑣 𝜎𝑗 +1 − Ψ( 𝑣 𝜎𝑗 ) | ≤ 𝜎 𝑐. Letting 𝜎 → 𝑣 * 𝑗 +1 = Ψ( 𝑣 * 𝑗 ) . The 3DVAR and 4DVAR methodologies, in the context of weather forecasting, arediscussed in [78] and [38] respectively. The accuracy analysis presented here is similarto that which first appeared in the papers [14], [87] and was developed further in [72, 74].It arises from considering stochastic perturbations of the seminal work of Titi and co-workers, exemplified by the paper [54]. For an overview of variational methods, andtheir links to problems in physics and mechanics, see the book [1], and the referencestherein; see also the paper [15].
10 The Extended and Ensemble Kalman Filters
In this chapter we describe the Extended Kalman Filter (ExKF) and the EnsembleKalman Filter (EnKF). The ExKF approximates the predictive covariance by lineariza-tion, while the EnKF approximates it by the empirical covariance of a collection ofparticles. The status of the two methods in relation to the true filtering distribution isas follows. The ExKF is a provably accurate approximation of the filtering distributionin situations in which small noise is present in both signal and data, and the filteringdistribution is well-approximated by a Gaussian. The EnKF is also in principle a goodapproximation of the filtering distribution in this situation, if a large number of parti-cles is used. However the EnKF is typically deployed for problems where the use of asufficiently large number of particles is impractical; it is then better viewed as an onlineoptimizer, in the spirit of 3DVAR, but using multiple particles to better estimate thecovariances appearing in the quadratic objective function which is minimized to findparticle updates. Throughout this chapter we consider the setting in which 3DVAR was introduced andmay be applied: the dynamics model is nonlinear, but the observation operator islinear. For purposes of exposition we summarize it again here: 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) i.i.d. ,𝑦 𝑗 +1 = 𝐻𝑣 𝑗 +1 + 𝜂 𝑗 +1 , 𝜂 𝑗 ∼ 𝒩 (0 , Γ) i.i.d. , with 𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ) independent of the independent i.i.d. sequences { 𝜉 𝑗 } and { 𝜂 𝑗 } .Throughout this chapter we assume that 𝑣 𝑗 ∈ R 𝑑 , 𝑦 𝑗 ∈ R 𝑘 . This method is derived by applying the Kalman methodology, using linearization topropagate the covariance 𝐶 𝑗 to the predictive covariance ̂︀ 𝐶 𝑗 +1 . Table 10.1 summarizesthe idea, after which we calculate the formulae required in full detail. The extended Kalman filter is often termed the EKF in the literature, a terminology introducedbefore the existence of the EnKF; we find it useful to write ExKF to uneqivocally distinguish it fromthe EnKF.
Kalman Filter ExKF 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) J ( 𝑣 ) = | 𝑦 𝑗 +1 − 𝐻𝑣 | + | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 J ( 𝑣 ) = | 𝑦 𝑗 +1 − 𝐻𝑣 | + | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 ̂︀ 𝑚 𝑗 +1 = Ψ( 𝑚 𝑗 ) ̂︀ 𝐶 𝑗 +1 update exact ̂︀ 𝐶 𝑗 +1 update by linearization 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 Table 10.1
Comparison of Kalman Filter and ExKF update formulae We first recall the Kalman filter update formulae and their derivation. We have ̂︀ 𝑣 𝑗 +1 = 𝑀 𝑣 𝑗 + 𝜉 𝑗 , 𝑣 𝑗 ∼ 𝒩 ( 𝑚 𝑗 , 𝐶 𝑗 ) , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) . (10.1)From this we deduce, by taking expectations, that ̂︀ 𝑚 𝑗 +1 = E [ ̂︀ 𝑣 𝑗 +1 | 𝑌 𝑗 ] = E [ 𝑀 𝑣 𝑗 + 𝜉 𝑗 | 𝑌 𝑗 ] = E [ 𝑀 𝑣 𝑗 | 𝑌 𝑗 ] + E [ 𝜉 𝑗 | 𝑌 𝑗 ] = 𝑀 𝑚 𝑗 . (10.2)The covariance update is derived as follows. ̂︀ 𝐶 𝑗 +1 = E [︁ ( ̂︀ 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) ⊗ ( ̂︀ 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) | 𝑌 𝑗 ]︁ = E [︁ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) + 𝜉 𝑗 ) ⊗ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 ) + 𝜉 𝑗 ) | 𝑌 𝑗 ]︁ = E [︁ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 )) ⊗ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 )) | 𝑌 𝑗 ]︁ + E [︁ 𝜉 𝑗 ⊗ 𝜉 𝑗 | 𝑌 𝑗 ]︁ (10.3)+ E [︁ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 )) ⊗ 𝜉 𝑗 | 𝑌 𝑗 ]︁ + E [︁ 𝜉 𝑗 ⊗ ( 𝑀 ( 𝑣 𝑗 − 𝑚 𝑗 )) | 𝑌 𝑗 ]︁ = 𝑀 E [︁ ( 𝑣 𝑗 − 𝑚 𝑗 ) ⊗ ( 𝑣 𝑗 − 𝑚 𝑗 ) | 𝑌 𝑗 ]︁ 𝑀 𝑇 + Σ= 𝑀 𝐶 𝑗 𝑀 𝑇 + Σ . For the ExKF, the prediction map Ψ is no longer linear. But since 𝜉 𝑗 is independentof 𝑌 𝑗 and 𝑣 𝑗 , we obtain ̂︀ 𝑚 𝑗 +1 = E [︁ Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 | 𝑌 𝑗 ]︁ = E [︁ Ψ( 𝑣 𝑗 ) | 𝑌 𝑗 ]︁ + E [︁ 𝜉 𝑗 | 𝑌 𝑗 ]︁ = E [︁ Ψ( 𝑣 𝑗 ) | 𝑌 𝑗 ]︁ . If we assume that the fluctuations of 𝑣 𝑗 around its mean 𝑚 𝑗 (conditional on data) aresmall then a reasonable approximation is to take Ψ( 𝑣 𝑗 ) ≈ Ψ( 𝑚 𝑗 ) so that ̂︀ 𝑚 𝑗 +1 = Ψ( 𝑚 𝑗 ) . (10.4)For the predictive covariance we use linearization; we have ̂︀ 𝐶 𝑗 +1 = E [︁ ( ̂︀ 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) ⊗ ( ̂︀ 𝑣 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) | 𝑌 𝑗 ]︁ = E [︁ (Ψ( 𝑣 𝑗 ) − Ψ( 𝑚 𝑗 ) + 𝜉 𝑗 ) ⊗ (Ψ( 𝑣 𝑗 ) − Ψ( 𝑚 𝑗 ) + 𝜉 𝑗 ) | 𝑌 𝑗 ]︁ = E [︁ (Ψ( 𝑣 𝑗 ) − Ψ( 𝑚 𝑗 )) ⊗ (Ψ( 𝑣 𝑗 ) − Ψ( 𝑚 𝑗 )) | 𝑌 𝑗 ]︁ + Σ ≈ 𝐷 Ψ( 𝑚 𝑗 ) E [︁ ( 𝑣 𝑗 − 𝑚 𝑗 ) ⊗ ( 𝑣 𝑗 − 𝑚 𝑗 ) | 𝑌 𝑗 ]︁ 𝐷 Ψ( 𝑚 𝑗 ) 𝑇 + Σ , and so, again assuming that fluctuations of 𝑣 𝑗 around its mean 𝑚 𝑗 (conditional ondata) are small, we invoke the approximation ̂︀ 𝐶 𝑗 +1 = 𝐷 Ψ( 𝑚 𝑗 ) 𝐶 𝑗 𝐷 Ψ( 𝑚 𝑗 ) 𝑇 + Σ . (10.5)To be self-consistent Σ itself should be small.The analysis step is the same as for the Kalman filter: 𝑆 𝑗 +1 = 𝐻 ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 + Γ ,𝐾 𝑗 +1 = ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 𝑆 − 𝑗 +1 ,𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 ,𝐶 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝐶 𝑗 +1 . (10.6)Thus the overall ExKF comprises equations (10.4), (10.5) and (10.6). Unlike theKalman filter, for the extended Kalman filter the maps 𝐶 𝑗 ↦→ ̂︀ 𝐶 𝑗 +1 ↦→ 𝐶 𝑗 +1 depend onthe observed data, through the dependence of the predictive covariance on the filtermean. To be self-consistent with the “small fluctuations around the mean” assumptionsmade in the derivation of the ExKF, Σ and Γ should both be small.The analysis step can also be defined by 𝐶 − 𝑗 +1 = ̂︀ 𝐶 − 𝑗 +1 + 𝐻 𝑇 Γ − 𝐻,𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) , where J ( 𝑣 ) = 12 | 𝑦 𝑗 +1 − 𝐻𝑣 | + 12 | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 (10.7)and ̂︀ 𝑚 𝑗 +1 , ̂︀ 𝐶 𝑗 +1 are calculated as above in the prediction steps (10.4), (10.5). Theconstraint formulation of the minimization problem, derived for the Kalman filter insection 8.1.2, may also be used to derive the update formulae above. When the dynamical system is in high dimension, evaluation and storage of the pre-dicted covariance, and in particular the Jacobian required for the update formula (10.5),becomes computationally inefficient and expensive for the ExKF. The EnKF was de-veloped to overcome this issue. The basic idea is to maintain an ensemble of particles,and to use their empirical covariance within a Kalman-type update. The method issummarized in Table 10.2. It may be thought of as an ensemble 3DVAR technique inwhich a collection of particles are generated similarly to 3DVAR, but interact throughan ensemble estimate of their covariance.Kalman Filter EnKF 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) 𝑚 𝑗 +1 = arg min 𝑣 J ( 𝑣 ) J ( 𝑣 ) = | 𝑦 𝑗 +1 − 𝐻𝑣 | + | 𝑣 − ̂︀ 𝑚 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 J 𝑛 ( 𝑣 ) = | 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐻𝑚 | + | 𝑣 − ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 ̂︀ 𝑚 𝑗 +1 = 𝑀 𝑚 𝑗 ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = Ψ( 𝑣 ( 𝑛 ) 𝑗 ) + 𝜉 ( 𝑛 ) 𝑗 ̂︀ 𝐶 𝑗 +1 update exact ̂︀ 𝐶 𝑗 +1 update by ensemble estimate 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 𝑚 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑚 𝑗 +1 + 𝐾 𝑗 +1 𝑦 𝑗 +1 Table 10.2
Comparison of Kalman Filter and EnKF update formulae In the basic form which we present here, the EnKF is applied when Ψ is nonlinear,while the observation operator 𝐻 is linear. The 𝑁 particles used at step 𝑗 are denoted { 𝑣 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 . They are all given equal weight so it is possible, in principle, to make anapproximation to the filtering distribution of the form 𝜋 𝑁𝑗 ( 𝑣 𝑗 ) ≈ 𝑁 𝑁 ∑︁ 𝑛 =1 𝛿 ( 𝑣 𝑗 − 𝑣 ( 𝑛 ) 𝑗 ) . However the EnKF is typically used with a relatively small number 𝑁 of particles andmay be far from approximating the desired distribution in R 𝑑 . It is then better under-stood as a sequential optimization method, similar in spirit to 3DVAR, as describedabove; this is our perspective.The state of all the particles at time 𝑗 + 1 are predicted to give { ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 } 𝑁𝑛 =1 usingthe dynamical model. The resulting empirical covariance is then used to define theobjective function (10.7c) which is minimized in order to perform the analysis step andobtain { 𝑣 ( 𝑛 ) 𝑗 +1 } 𝑁𝑛 =1 . The updates are denoted schematically by { 𝑣 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 p ○↦−−→ { ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 } 𝑁𝑛 =1 a ○↦−−→ { 𝑣 ( 𝑛 ) 𝑗 +1 } 𝑁𝑛 =1 . We now detail these two steps.p ○ prediction: ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = Ψ( 𝑣 ( 𝑛 ) 𝑗 ) + 𝜉 ( 𝑛 ) 𝑗 , 𝑛 = 1 , . . . , 𝑁, (10.8a) ̂︀ 𝑚 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 , (10.8b) ̂︀ 𝐶 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 )︀ ⊗ (︀̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 )︀ . (10.8c)Here we have 𝜉 ( 𝑛 ) 𝑗 ∼ 𝒩 (0 , Σ) , i.i.d.a ○ analysis: 𝑆 𝑗 +1 = 𝐻 ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 + Γ , (10.9a) 𝐾 𝑗 +1 = ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 𝑆 − 𝑗 +1 , (10.9b) 𝑦 ( 𝑛 ) 𝑗 +1 = 𝑦 𝑗 +1 + 𝑠𝜂 ( 𝑛 ) 𝑗 +1 , 𝑛 = 1 , . . . , 𝑁, (10.9c) 𝑣 ( 𝑛 ) 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 + 𝐾 𝑗 +1 𝑦 ( 𝑛 ) 𝑗 +1 , 𝑛 = 1 , . . . , 𝑁. (10.9d)(10.9e)Here we take 𝜂 ( 𝑛 ) 𝑗 ∼ 𝒩 (0 , Γ) , i.i.d. The constant 𝑠 takes value 0 or 1. When 𝑠 = 1 the 𝑦 ( 𝑛 ) 𝑗 +1 are referred to as perturbedobservations . The analysis step may be written as 𝑣 ( 𝑛 ) 𝑗 +1 = arg min 𝑣 J 𝑛 ( 𝑣 ) , (10.10)where J 𝑛 ( 𝑣 ) := 12 | 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐻𝑣 | + 12 | 𝑣 − ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 (10.11)and the predicted mean and covariance are given by step p ○ . Note that ̂︀ 𝐶 𝑗 +1 is typicallynot invertible as it is a rank 𝑁 matrix and 𝑁 is usually less than the dimension 𝑑 ofthe space on which ̂︀ 𝐶 𝑗 +1 acts; this is since the typical use of ensemble methods is forhigh dimensional state space estimation, with a small ensemble size. The minimizingsolution can be found by regularizing ̂︀ 𝐶 𝑗 +1 by additions of 𝜖𝐼 for 𝜖 >
0, deriving theupdate equations as above for a ○ , and then letting 𝜖 → . Alternatively, the constraintformulation of the minimization problem, derived for the Kalman filter in section 8.1.2,may also be used to derive the update formulae above.
We now give another way to think of, and exploit in algorithms, the low rank propertyof ̂︀ 𝐶 𝑗 +1 . Note that J 𝑛 ( 𝑣 ) is undefined unless 𝑣 − ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = ̂︀ 𝐶 𝑗 +1 𝑎 for some 𝑎 ∈ R 𝑑 . From the structure of ̂︀ 𝐶 𝑗 +1 given in (10.8c) it follows that 𝑣 = ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 + 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 (︀̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 )︀ (10.12)for some unknown parameters { 𝑏 𝑛 } 𝑁𝑛 =1 , to be determined. (Note that the vector { 𝑏 𝑛 } 𝑁𝑛 =1 depends on enesemble index 𝑛 ; we have supressed this dependence for no-tational convenience.) This form for 𝑣 can be substituted into (10.11) to obtain afunctional I 𝑛 ( 𝑏 ) to be minimized over 𝑏 ∈ R 𝑁 . We re-emphasize that 𝑁 will typicallybe much smaller than 𝑑 , the state-space dimension. Once 𝑏 is determined it may besubstituted back into (10.12) to obtain the solution to the minimization problem.To dig a little deeper into this calculation we define 𝑒 ( 𝑛 ) = ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 and note that then ̂︀ 𝐶 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑒 ( 𝑛 ) ⊗ 𝑒 ( 𝑛 ) . Since ̂︀ 𝐶 𝑗 +1 𝑎 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 𝑒 ( 𝑛 ) we deduce that this is solved by taking 𝑏 𝑛 = ⟨ 𝑒 ( 𝑛 ) , 𝑎 ⟩ . Now note that 12 | 𝑣 − ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 | ̂︀ 𝐶 𝑗 +1 = 12 ⟨ 𝑎, ̂︀ 𝐶 𝑗 +1 𝑎 ⟩ = 12 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 . We define I 𝑛 ( 𝑏 ) := 12 ⃒⃒⃒ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐻 ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 𝐻 ( ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ̂︀ 𝑚 𝑗 +1 ) ⃒⃒⃒ + 12 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 . (10.13)We have shown: Theorem 10.1 (Implementation of EnKF in 𝑁 Dimensional Subspace) . Given the prediction p ○ defined by (10.8a), the Kalman update formulae a ○ may be found by minimizing I 𝑛 ( 𝑏 ) with respect to 𝑏 and substituting into (10.12) . The development and theory of the extended Kalman filter is documented in the text[61]. A methodology for analyzing evolving probability distributions with small vari-ance, and establishing the validity of the Gaussian approximation, is described in [102].The use of the ExKF for weather forecasting was proposed in [45]. However the dimen-sion of the state space in most geophysical applications renders the extended Kalmanfilter impractical. The ensemble Kalman filter provided an innovation with far reach-ing consequences in geophysical applications, because it allowed for the use of partialempirical correlation information, without the computation of the full covariance. Anoverview of ensemble Kalman methods may be found in the book [37], including a his-torical perspective on the subject, originating from papers of Evensen and Van Leeuwenin the mid 1990s [33, 35]; a similar idea was also developed by Houtkamer within theCanadian meteorological service, around the same time; [34, 56].The presentation of the ensemble Kalman filter as a smart optimization tool is alsodeveloped in [73], but the derivation of the update equations in a space whose dimensionis that of the ensemble is not described there. The analysis of ensemble methods isdifficult and theory is only just starting to emerge. In the linear case the methodconverges in the large ensemble limit to the Kalman filter [49], but in the nonlinearcase the limit does not reproduce the filtering distribution [32]. In any case the primaryadvantage of ensemble methods is that they can provide good state estimation whenthe number of particles is not large; this subject is discussed in [50, 69, 111, 112].
11 Particle Filter
This chapter is devoted to the particle filter, a method that approximates the filter-ing distribution by a sum of Dirac measures. Particle filters provably converge to thefiltering distribution as the number of Dirac measures approaches infinity. We focuson the bootstrap particle filter, also known as sequential importance resampling; it islinked to the material on Monte Carlo and importance sampling described in chapter 5.We note that the Kalman filter completely characterizes the filtering distribution, butonly in the linear Gaussian setting; the Kalman-based methods introduced in the twoprevious chapters outside the linear-Gaussian assumption are built by approximatingthe predictive distribution via a Gaussian ansatz, and then using the Kalman formulaefor the analysis step. Similarly, particle filters approximate the predictive distributionby a particle approximation measure and then solve exactly the analysis step. In thislight both Kalman-based methods (with linear observations) and the bootstrap particlefilter do exact application of Bayes formula with wrong priors. However, particle filtershave the potential of recovering an accurate approximation to the filtering distribu-tion in nonlinear, non-Gaussian settings provided that the number of particles is largeenough; their caveat is that they tend to struggle in high dimensional problems.
Let us return to the setting in which we introduced filtering and smoothing, withnonlinear stochastic dynamical system and nonlinear observation operator, namely themodel 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) i.i.d. ,𝑦 𝑗 +1 = ℎ ( 𝑣 𝑗 +1 ) + 𝜂 𝑗 , 𝜂 𝑗 ∼ 𝒩 (0 , Γ) i.i.d. , with 𝑣 ∼ 𝜋 := 𝒩 ( 𝑚 , 𝐶 ) independent of the i.i.d. sequences { 𝜉 𝑗 } and { 𝜂 𝑗 } . HereΨ( · ) drives the dynamical system and ℎ ( · ) is the observation operator. Recall that wedenote by 𝑌 𝑗 = { 𝑦 , . . . , 𝑦 𝑗 } all the data up to time 𝑗 and by 𝜋 𝑗 the pdf of 𝑣 𝑗 | 𝑌 𝑗 , thatis, 𝜋 𝑗 = P ( 𝑣 𝑗 | 𝑌 𝑗 ). The filtering problem is to determine 𝜋 𝑗 +1 from 𝜋 𝑗 . We may doso in two steps: first, we run forward the Markov chain generated by the stochasticdynamical system (prediction), and second, we incorporate the data by an applicationof Bayes Theorem (analysis).For the prediction step, we define the operator 𝒫 acting on a pdf 𝜋 as an applicationof a Markov kernel defined by( 𝒫 𝜋 )( 𝑣 ) = ∫︁ R 𝑑 𝑝 ( 𝑢, 𝑣 ) 𝜋 ( 𝑢 ) 𝑑𝑢, (11.1)where 𝑝 ( 𝑢, 𝑣 ) is the associated pdf of the stochastic dynamics, so that 𝑝 ( 𝑢, 𝑣 ) = 1 √︁ (2 𝜋 ) 𝑑 detΣ exp (︂ − | 𝑣 − Ψ( 𝑢 ) | )︂ . Thus we obtain P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) = ̂︀ 𝜋 𝑗 +1 = 𝒫 𝜋 𝑗 . We then define the analysis operator 𝒜 𝑗 acting on a pdf 𝜋 to correspond to an appli-cation of Bayes theorem, namely( 𝒜 𝑗 𝜋 )( 𝑣 ) = exp( − | 𝑦 𝑗 +1 − ℎ ( 𝑣 ) | ) 𝜋 ( 𝑣 ) ∫︀ R 𝑑 exp( − | 𝑦 𝑗 +1 − ℎ ( 𝑣 ) | ) 𝜋 ( 𝑣 ) 𝑑𝑣 (11.2)and so finally we obtain 𝜋 𝑗 +1 = 𝒜 𝑗 ̂︀ 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 . We now describe a way to numerically approximate, and update, the pdfs 𝜋 𝑗 . The bootstrap particle filter (BPF) can be thought of as performing sequential impor-tance resampling. Let 𝑆 𝑁 be an operator acting on a pdf 𝜋 by producing an 𝑁 − samplesDirac approximation of 𝜋 , that is( 𝑆 𝑁 𝜋 )( 𝑢 ) = 𝑁 ∑︁ 𝑛 =1 𝑤 𝑛 𝛿 ( 𝑢 − 𝑢 ( 𝑛 ) ) , where 𝑢 (1) , . . . , 𝑢 ( 𝑁 ) are i.i.d samples from 𝜋 that are weighted uniformly i.e. 𝑤 𝑛 = 𝑁 .Note that 𝑆 𝑁 𝜋 = 𝜋 𝑁𝑀𝐶 , as introduced in Chapter 5, equation (5.3). We will use theoperator 𝑆 𝑁 to approximate the measure produced by the Markov kernel step 𝒫 withinthe overall filtering map 𝒜 𝑗 𝒫 . Note that 𝑆 𝑁 is a random map taking pdfs into pdfs ifwe interpret weighted sums of Diracs as a pdf.Let 𝜋 𝑁 = 𝜋 = 𝒩 ( 𝑚 , 𝐶 ) and let 𝜋 𝑁𝑗 denote a particle approximation of the pdf 𝜋 𝑗 that we will determine in what follows. We define ̂︀ 𝜋 𝑁𝑗 +1 = 𝑆 𝑁 𝒫 𝜋 𝑁𝑗 ;this is an approximation of ̂︀ 𝜋 𝑗 +1 from the previous section. We then apply the operator 𝒜 𝑗 to act on ̂︀ 𝜋 𝑁𝑗 +1 by appropriately reconfiguring the weights 𝑤 𝑗 according to the data.To understand this reconfiguration of the weights we use the fact that, if( 𝒜 𝜋 )( 𝑣 ) = 𝑔 ( 𝑣 ) 𝜋 ( 𝑣 ) ∫︀ R 𝑑 𝑔 ( 𝑣 ) 𝜋 ( 𝑣 ) 𝑑𝑣 and if 𝜋 ( 𝑣 ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝛿 ( 𝑣 − 𝑣 ( 𝑛 ) ) , then ( 𝒜 𝜋 )( 𝑣 ) = 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝛿 ( 𝑣 − 𝑣 ( 𝑛 ) ) , where ¯ 𝑤 ( 𝑛 ) = 𝑔 ( 𝑣 ( 𝑛 ) ) and the 𝑤 ( 𝑛 ) are found from the ¯ 𝑤 ( 𝑛 ) by renormalizing them to sum to one. We usethis calculation concerning the application of Bayes formula to sums of Diracs withinthe following desired approximation of the filtering update formula: 𝜋 𝑗 +1 ≈ 𝜋 𝑁𝑗 +1 = 𝒜 𝑗 ̂︀ 𝜋 𝑁𝑗 +1 = 𝒜 𝑗 𝑆 𝑁 𝒫 𝜋 𝑁𝑗 . The steps for the method are summarized in Algorithm 11.1.
Algorithm 11.1
Bootstrap Particle Filter Input : Initial distribution 𝜋 𝑁 = 𝜋 , observations 𝑌 𝐽 , number of particles 𝑁. Particle Generation : For 𝑗 = 0 , , . . . , 𝐽 −
1, perform1. Draw 𝑣 ( 𝑛 ) 𝑗 ∼ 𝜋 𝑁𝑗 for 𝑛 = 1 , . . . , 𝑁 i.i.d.2. Set ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = Ψ( 𝑣 ( 𝑛 ) 𝑗 ) + 𝜉 ( 𝑛 ) 𝑗 with 𝜉 ( 𝑛 ) 𝑗 i.i.d. 𝒩 (0 , Σ) .
3. Set ¯ 𝑤 ( 𝑛 ) 𝑗 +1 = exp( − | 𝑦 𝑗 +1 − ℎ ( ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 ) | ) .
4. Set 𝑤 ( 𝑛 ) 𝑗 +1 = ¯ 𝑤 ( 𝑛 ) 𝑗 +1 / ∑︀ 𝑁𝑛 =1 ¯ 𝑤 ( 𝑛 ) 𝑗 +1 .
5. Set 𝜋 𝑁𝑗 +1 ( 𝑢 ) = ∑︀ 𝑁𝑛 =1 𝑤 ( 𝑛 ) 𝑗 +1 𝛿 ( 𝑢 − ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 ) . Output: pdf 𝜋 𝑁𝐽 that approximates the distribution P ( 𝑣 𝐽 | 𝑌 𝐽 ) . We will now show that under certain conditions, the BPF converges to the true particlefilter distribution in the limit 𝑁 → ∞ . The proof is similar to that of the Lax-Equivalence Theorem from the numerical approximation of evolution equations, partof which is the statement that consistency and stability together imply convergence. Forthe BPF consistency refers to a Monte Carlo error estimate, similar to that derived inthe chapter on importance sampling, and stability manifests in bounds on the Lipschitzconstants for the operators 𝒫 and 𝒜 𝑗 . Our first step is to define what we mean by convergence, that is, we need a metricon probability measures. Notice that the operators 𝒫 and 𝒜 𝑗 are deterministic, but theoperator 𝑆 𝑁 is random since it requires sampling. As a consequence the approximatepdfs 𝜋 𝑗 are also random. Thus, in fact, we need a metric on random probabilitymeasures. To this end, for random probability measures 𝜋 and 𝜋 ′ , we define 𝑑 ( 𝜋, 𝜋 ′ ) = sup | 𝑓 | ∞ ≤ (︁ E [︁(︀ 𝜋 ( 𝑓 ) − 𝜋 ′ ( 𝑓 ) )︀ ]︁)︁ / , where the expectation is taken over the random variable, in our case, the randomnessfrom sampling with 𝑆 𝑁 , and the supremum is taken over all functions 𝑓 : R 𝑑 → [ − , . Here we have used the notation defined in equation (5.2). The following lemma isstraightforward to prove, and provides some useful intuition about the metric. Lemma 11.1. 𝑑 ( · , · ) as defined above does indeed define a metric on random proba-bility measures. Furthermore, when 𝜋, 𝜋 ′ are deterministic, then we have 𝑑 ( 𝜋, 𝜋 ′ ) =2 𝑑 TV ( 𝜋, 𝜋 ′ ) . We now prove three lemmas which together will enable us to prove convergence ofthe BPF. The first shows consistency; the second and third show stability estimatesfor 𝒫 and 𝒜 𝑗 respectively. Lemma 11.2.
Let P be the set of probability densities on R 𝑑 then sup 𝜋 ∈ P 𝑑 ( 𝜋, 𝑆 𝑁 𝜋 ) ≤ √ 𝑁 .
Proof.
Note that 𝑆 𝑁 𝜋 agrees with 𝜋 𝑁𝑀𝐶 as defined in Chapter 5. Therefore Theorem5.1 shows that for any 𝑓 with | 𝑓 | ∞ ≤ 𝜋 it holds that E [︁(︀ 𝑆 𝑁 𝜋 ( 𝑓 ) − 𝜋 ( 𝑓 ) )︀ ]︁ ≤ 𝑁 .
Taking the square root and the supremum over 𝑓 and 𝜋 gives the desired result.Now we prove a stability bound for the operator 𝒫 defined in equation (11.1). Lemma 11.3. 𝑑 ( 𝒫 𝜋, 𝒫 𝜋 ′ ) ≤ 𝑑 ( 𝜋, 𝜋 ′ ) . Proof.
For | 𝑓 | ∞ ≤ 𝑞 on R 𝑑 by 𝑞 ( 𝑣 ′ ) = ∫︁ R 𝑑 𝑝 ( 𝑣 ′ , 𝑣 ) 𝑓 ( 𝑣 ) 𝑑𝑣, where, recall, 𝑝 denotes the transition pdf associated to the stochastic dynamics. Notethat | 𝑞 ( 𝑣 ′ ) | ≤ ∫︁ R 𝑑 𝑝 ( 𝑣 ′ , 𝑣 ) 𝑑𝑣 = 1 , and so 𝜋 ( 𝑞 ) = ∫︁ R 𝑑 𝑞 ( 𝑣 ′ ) 𝜋 ( 𝑣 ′ ) 𝑑𝑣 ′ = ∫︁ R 𝑑 [︃ ∫︁ R 𝑑 𝑝 ( 𝑣 ′ , 𝑣 ) 𝑓 ( 𝑣 ) 𝑑𝑣 ]︃ 𝜋 ( 𝑣 ′ ) 𝑑𝑣 ′ = ∫︁ R 𝑑 [︃ ∫︁ R 𝑑 𝑝 ( 𝑣 ′ , 𝑣 ) 𝜋 ( 𝑣 ′ ) 𝑑𝑣 ′ ]︃ 𝑓 ( 𝑣 ) 𝑑𝑣 = ∫︁ R 𝑑 ( 𝒫 𝜋 )( 𝑣 ) 𝑓 ( 𝑣 ) 𝑑𝑣 by exchanging the order of integration. Consequently, we have 𝜋 ( 𝑞 ) = ( 𝒫 𝜋 )( 𝑓 ) . Finally, 𝑑 ( 𝒫 𝜋, 𝒫 𝜋 ′ ) = sup | 𝑓 | ∞ ≤ (︁ E [︁(︀ ( 𝒫 𝜋 )( 𝑓 ) − ( 𝒫 𝜋 ′ )( 𝑓 ) )︀ ]︁)︁ / ≤ sup | 𝑞 | ∞ ≤ (︁ E [︁(︀ 𝜋 ( 𝑞 ) − 𝜋 ′ ( 𝑞 ) )︀ ]︁)︁ / = 𝑑 ( 𝜋, 𝜋 ′ ) . To prove the next lemma and the main convergence theorem of the BPF below,we will make the following assumption which encodes the idea of a bound on theobservation operator:
Assumption 11.4.
There exists 𝜅 ∈ (0 , such that, for all 𝑣 ∈ R 𝑑 , and 𝑗 ∈ { , . . . , 𝐽 } ,𝜅 ≤ 𝑔 𝑗 ( 𝑣 ) ≤ 𝜅 − . It may initially appear strange to use the same constant 𝜅 in the upper and lowerbounds, but recall that 𝑔 is undefined up to a multiplicative constant. Consequently,given any upper and lower bounds, 𝑔 can be scaled to achieve the bound as stated.Relatedly it is 𝜅 − which appears in the stability constant in the next lemma; if 𝑔 is notscaled to produce the same constant 𝜅 in the upper and lower bounds in Assumption11.4, then it is the ratio of the upper and lower bounds which would appear in thestability bound. Lemma 11.5.
Let Assumption 11.4 hold. Then 𝑑 ( 𝒜 𝑗 𝜋, 𝒜 𝑗 𝜋 ′ ) ≤ 𝜅 𝑑 ( 𝜋, 𝜋 ′ ) . Proof. ( 𝒜 𝜋 )( 𝑓 ) − ( 𝒜 𝜋 ′ )( 𝑓 ) = 𝜋 ( 𝑓 𝑔 ) 𝜋 ( 𝑔 ) − 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ′ ( 𝑔 )= 𝜋 ( 𝑓 𝑔 ) 𝜋 ( 𝑔 ) − 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ( 𝑔 ) + 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ( 𝑔 ) − 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ′ ( 𝑔 )= 1 𝜅 (︂ 𝜋 ( 𝜅𝑓 𝑔 ) − 𝜋 ′ ( 𝜅𝑓 𝑔 ) 𝜋 ( 𝑔 ) + 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ′ ( 𝑔 ) 𝜋 ′ ( 𝜅𝑔 ) − 𝜋 ( 𝜅𝑔 ) 𝜋 ( 𝑔 ) )︂ . Applying Bayes Theorem we obtain ⃒⃒⃒ 𝜋 ′ ( 𝑓 𝑔 ) 𝜋 ′ ( 𝑔 ) ⃒⃒⃒ = | ( 𝒜 𝜋 ′ )( 𝑓 ) | ≤ . Therefore, ⃒⃒ ( 𝒜 𝜋 )( 𝑓 ) − ( 𝒜 𝜋 ′ )( 𝑓 ) ⃒⃒ ≤ 𝜅 (︁⃒⃒ 𝜋 ( 𝜅𝑓 𝑔 ) − 𝜋 ′ ( 𝜅𝑓 𝑔 ) ⃒⃒ + ⃒⃒ 𝜋 ′ ( 𝜅𝑔 ) − 𝜋 ( 𝜅𝑔 ) ⃒⃒)︁ . It follows that E [︁(︀ ( 𝒜 𝜋 )( 𝑓 ) − ( 𝒜 𝜋 ′ )( 𝑓 ) )︀ ]︁ ≤ 𝜅 (︁ E [︁ ( 𝜋 ( 𝜅𝑓 𝑔 ) − 𝜋 ′ ( 𝜅𝑓 𝑔 )) ]︁ + E [︁(︀ 𝜋 ′ ( 𝜅𝑔 ) − 𝜋 ( 𝜅𝑔 ) )︀ ]︁)︁ . Since | 𝜅𝑔 | ≤ | 𝑓 | ∞ ≤ E [︁(︀ ( 𝒜 𝜋 )( 𝑓 ) − ( 𝒜 𝜋 ′ )( 𝑓 ) )︀ ]︁ ≤ 𝜅 sup | 𝑓 | ∞ ≤ E [︁(︀ 𝜋 ( 𝑓 ) − 𝜋 ′ ( 𝑓 ) )︀ ]︁ , and hence 𝑑 ( 𝒜 𝑗 𝜋, 𝒜 𝑗 𝜋 ′ ) ≤ 𝜅 𝑑 ( 𝜋, 𝜋 ′ ) . Theorem 11.6 (Convergence of the BPF) . Let Assumption 11.4 hold. Then there exists a 𝑐 = 𝑐 ( 𝐽, 𝜅 ) such that, for all 𝑗 = 1 , . . . , 𝐽,𝑑 ( 𝜋 𝑗 , 𝜋 𝑁𝑗 ) ≤ 𝑐 √ 𝑁 .
Proof.
Let 𝑒 𝑗 = 𝑑 ( 𝜋 𝑗 , 𝜋 𝑁𝑗 ), then 𝑒 𝑗 +1 = 𝑑 ( 𝜋 𝑗 +1 , 𝜋 𝑁𝑗 +1 ) = 𝑑 ( 𝒜 𝑗 𝒫 𝜋 𝑗 , 𝒜 𝑗 𝑆 𝑁 𝒫 𝜋 𝑁𝑗 ) ≤ 𝑑 ( 𝒜 𝑗 𝒫 𝜋 𝑗 , 𝒜 𝑗 𝒫 𝜋 𝑁𝑗 ) + 𝑑 ( 𝒜 𝑗 𝒫 𝜋 𝑁𝑗 , 𝒜 𝑗 𝑆 𝑁 𝒫 𝜋 𝑁𝑗 )by the triangle inequality. Applying the stability bound for 𝒜 𝑗 , we have 𝑒 𝑗 +1 ≤ 𝜅 [︁ 𝑑 ( 𝒫 𝜋 𝑗 , 𝒫 𝜋 𝑁𝑗 ) + 𝑑 ( ̂︀ 𝜋 𝑁𝑗 , 𝑆 𝑁 ̂︀ 𝜋 𝑁𝑗 ) ]︁ , where ̂︀ 𝜋 𝑁𝑗 = 𝒫 𝜋 𝑁𝑗 . By the stability bound for 𝒫 , 𝑑 ( 𝒫 𝜋 𝑗 , 𝒫 𝜋 𝑁𝑗 ) ≤ 𝑑 ( 𝜋 𝑗 , 𝜋 𝑁𝑗 )and by the consistency bound for 𝑆 𝑁 𝑑 ( ̂︀ 𝜋 𝑁𝑗 , 𝑆 𝑁 ̂︀ 𝜋 𝑁𝑗 ) ≤ √ 𝑁 .
Therefore, 𝑒 𝑗 +1 ≤ 𝜅 (︁ 𝑑 ( 𝜋 𝑗 , 𝜋 𝑁𝑗 ) + 1 √ 𝑁 )︁ ≤ 𝜅 (︁ 𝑒 𝑗 + 1 √ 𝑁 )︁ . We let 𝜆 = 2 /𝜅 and note that 𝜆 ≥ 𝜅 ∈ (0 , 𝑒 𝑗 ≤ 𝜆 𝑗 𝑒 + 𝜆 √ 𝑁 − 𝜆 𝑗 − 𝜆 . Recall that 𝜋 𝑁 = 𝜋 hence 𝑒 = 0. Thus letting 𝑐 = 𝜆 (1 − 𝜆 𝐽 )1 − 𝜆 completes the proof since 𝜆 (1 − 𝜆 𝑗 ) / (1 − 𝜆 ) is increasing in 𝑗 . A nice interpretation of the BPF is to view it as a random dynamical system for a setof interacting particles { 𝑣 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 . To this end, a measure¯ 𝜋 𝑁𝑗 ( 𝑢 ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝛿 ( 𝑢 − 𝑣 ( 𝑛 ) 𝑗 ) ≈ 𝜋 𝑁𝑗 ( 𝑢 ) ≈ 𝜋 𝑗 ( 𝑢 )with equally weighted particles may be naturally defined after the resampling step from 𝜋 𝑁𝑗 . It can then be seen that the BPF updates the particle positions { 𝑣 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 ↦→ { 𝑣 ( 𝑛 ) 𝑗 +1 } 𝑁𝑛 =1 via the random map ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = Ψ( 𝑣 ( 𝑛 ) 𝑗 ) + 𝜉 ( 𝑛 ) 𝑗 , 𝜉 ( 𝑛 ) 𝑗 ∼ 𝒩 (0 , Σ) i.i.d., 𝑣 ( 𝑛 ) 𝑗 +1 = 𝑁 ∑︁ 𝑚 =1 𝐼 ( 𝑚 ) 𝑗 +1 (︁ 𝑟 ( 𝑛 ) 𝑗 +1 )︁̂︀ 𝑣 ( 𝑚 ) 𝑗 +1 , 𝑟 ( 𝑛 ) 𝑗 +1 ∼ Uniform(0 ,
1) i.i.d.Here the supports 𝐼 ( 𝑚 ) 𝑗 of the indicator functions have widths given by the weightsappearing in 𝜋 𝑁𝑗 ( 𝑢 ) . Specifically we have 𝐼 ( 𝑚 ) 𝑗 +1 = [︁ 𝛼 ( 𝑚 − 𝑗 +1 , 𝛼 ( 𝑚 ) 𝑗 +1 )︁ , 𝛼 ( 𝑚 +1) 𝑗 +1 = 𝛼 ( 𝑚 ) 𝑗 +1 + 𝑤 ( 𝑚 ) 𝑗 +1 , 𝛼 (0) 𝑗 +1 = 0 . Note that, by construction, 𝛼 ( 𝑁 ) 𝑗 = 1 . Thus the underlying dynamical system on particles comprises 𝑁 particles governedby two steps: (i) the underlying stochastic dynamics model, in which the particlesdo not interact; (ii) a resampling of the resulting collection of particles, to reflect thedifferent weights associated with them, in which the particles do then interact. Theinteraction is driven by the weights which see all the particle positions, and measuretheir goodness of fit to the data. Particle filters are overviewed from an algorithmic viewpoint in [29, 28], and from amore mathematical perspective in [25]. The convergence of particle filters was addressedin [22]; the clean proof presented here originates in [97] and may also be found in [73].For problems in which the dynamics evolve in relatively low dimensional spaces theyhave been an enormously successful. Generalizing them so that they work for the highdimensional problems that arise, for example, in geophysical applications, provides amajor challenge [75].
12 Optimal Particle Filter
This chapter is devoted to the optimal particle filter. Like the bootstrap filter from theprevious chapter, the optimal particle filter approximates the filtering distribution bya sum of Dirac measures. The setting will initially be the same (nonlinear stochasticdynamics and nonlinear observations), but we will see that the optimal particle filtercannot be implemented, in general, in the fully nonlinear case. For this reason, wewill specify to the case of linear observation operator, where the optimal particle filtercan be implemented in a straightforward fashion and may be characterized as a set ofinteracting 3DVAR filters. We conclude this chapter by summarizing and comparingall of the filtering methods introduced in this and preceding chapters, highlighting thesetting in which they may be practically applied.
The bootstrap particle filter from the previous chapter is based on approximating thetwo components of a specific factorization of the filtering step. The factorization is P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) = 𝒫 P ( 𝑣 𝑗 | 𝑌 𝑗 ) , P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = 𝒜 𝑗 P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) . This gives the factorization P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = 𝒜 𝑗 𝒫 P ( 𝑣 𝑗 | 𝑌 𝑗 ) , which is the basis for the bootstrap particle filter. It is natural to ask if there are otherfactorizations of the filtering update and whether they might lead to improved particlefilters. In this lecture we derive the Optimal Particle Filter (OPF) which does justthat. We demonstrate a connection with 3DVAR, and we discuss the sense in whichthe OPF has desirable properties in comparison with the BPF. As usual we denote by 𝜋 𝑗 the filtering distribution P ( 𝑣 𝑗 | 𝑌 𝑗 ) . The fundamental filtering problem that we are interested in is determination of P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 )from P ( 𝑣 𝑗 | 𝑌 𝑗 ) . In the BPF, we are approximating the following manipulation: P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = P ( 𝑣 𝑗 +1 | 𝑦 𝑗 +1 , 𝑌 𝑗 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 , 𝑌 𝑗 ) P ( 𝑣 𝑗 +1 | 𝑌 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 , 𝑌 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 ) ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑌 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 ) ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = 𝒜 𝑗 𝒫 P ( 𝑣 𝑗 | 𝑌 𝑗 ) . The Markov kernel 𝒫 𝐵𝑃 𝐹 acts on arbitrary density 𝜋 by 𝒫 𝐵𝑃 𝐹 𝜋 ( 𝑣 𝑗 +1 ) = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) 𝜋 ( 𝑣 𝑗 ) 𝑑𝑣 𝑗 , and 𝒜 𝑗 acts on an arbitrary density 𝜋 by application of Bayes theorem, taking intoaccount the likelihood of the data 𝒜 𝑗 𝜋 ( 𝑣 𝑗 +1 ) = 1 𝑍 P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) 𝜋 ( 𝑣 𝑗 +1 ) , with 𝑍 normalization to a probability density. The above manipulations are summa-rized by the relationship 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 . (12.1)Note that in this factorization we apply a Markov kernel and then Bayes theorem. Incontrast, to derive the OPF we perform the following manipulation: P ( 𝑣 𝑗 +1 | 𝑌 𝑗 +1 ) = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 , 𝑣 𝑗 | 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑌 𝑗 +1 ) P ( 𝑣 𝑗 | 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 , 𝑌 𝑗 ) P ( 𝑣 𝑗 | 𝑦 𝑗 +1 , 𝑌 𝑗 ) 𝑑𝑣 𝑗 = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) P ( 𝑣 𝑗 | 𝑦 𝑗 +1 , 𝑌 𝑗 ) 𝑑𝑣 𝑗 = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) P ( 𝑦 𝑗 +1 | 𝑣 𝑗 , 𝑌 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) P ( 𝑦 𝑗 +1 | 𝑣 𝑗 ) P ( 𝑦 𝑗 +1 | 𝑌 𝑗 ) P ( 𝑣 𝑗 | 𝑌 𝑗 ) 𝑑𝑣 𝑗 = 𝒫 𝑂𝑃 𝐹𝑗 𝒜 𝑂𝑃 𝐹𝑗 P ( 𝑣 𝑗 | 𝑌 𝑗 ) , with Markov kernel for particle update 𝒫 𝑂𝑃 𝐹𝑗 𝜋 ( 𝑣 𝑗 +1 ) = ∫︁ R 𝑑 P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) 𝜋 ( 𝑣 𝑗 ) 𝑑𝑣 𝑗 and application of Bayes theorem to include the likelihood 𝒜 𝑂𝑃 𝐹𝑗 𝜋 ( 𝑣 𝑗 ) = 1 𝑍 P ( 𝑦 𝑗 +1 | 𝑣 𝑗 ) 𝜋 ( 𝑣 𝑗 ) . Thus we have 𝜋 𝑗 +1 = 𝒫 𝑂𝑃 𝐹𝑗 𝒜 𝑂𝑃 𝐹𝑗 𝜋 𝑗 . (12.2)Note that in the factorization given by OPF we apply Bayes theorem and then aMarkov kernel, the opposite order to the BPF. Moreover the propagation mechanismis different–it sees the data through the Markov kernel 𝒫 𝑂𝑃 𝐹𝑗 – and hence the weightingof the particles is also different: the OPF weights are proportional to the likelihood P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) and the BPF weights are proportional to P ( 𝑦 𝑗 +1 | 𝑣 𝑗 ) which may be, ingeneral, not available in closed form. We will see that the particle updates use a3DVAR procedure. In the BPF, the evolution of the particles and the observation ofthe data are kept separate from each other – the Markov kernel 𝒫 depends only on thedynamics and not the observed data and is thus independent of 𝑗 . In general it is not possible to implement the OPF in the fully nonlinear setting becauseof two computational bottlenecks: ∙ There may not be a closed formula for evaluating the likelihood P ( 𝑦 𝑗 +1 | 𝑣 𝑗 ) , mak-ing unfeasible the computation of the particle weights. ∙ It may not be possible to sample from the Markov kernel P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) , makingunfeasible the propagation of particles.However, when the observation function ℎ ( · ) is linear, i.e. ℎ ( · ) = 𝐻 · for some 𝐻 ∈ 𝑅 𝑘 × 𝑑 both bottlenecks are overcome. We thus consider the following setting, which arises inmany applications: 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝜉 𝑗 ∼ 𝒩 (0 , Σ) i.i.d. ,𝑦 𝑗 +1 = 𝐻𝑣 𝑗 +1 + 𝜂 𝑗 +1 , 𝜂 𝑗 +1 ∼ 𝒩 (0 , Γ) i.i.d. , with 𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ) and 𝑣 , { 𝜉 𝑗 } , { 𝜂 𝑗 } independent. We will now show that in thislinear observation both bottlenecks are overcome. First, note that combining the dy-namics and data models we may write 𝑦 𝑗 +1 = 𝐻 Ψ( 𝑣 𝑗 ) + 𝐻𝜉 𝑗 + 𝜂 𝑗 +1 , which shows that the conditional distribution for 𝑦 𝑗 +1 given 𝑣 𝑗 is P ( 𝑦 𝑗 +1 | 𝑣 𝑗 ) = 𝒩 ( 𝐻 Ψ( 𝑣 𝑗 ) , 𝑆 ) , where 𝑆 = 𝐻 Σ 𝐻 𝑇 + Γ . We will use this formula to compute the weights, thus over-coming the first computational bottleneck.We now show that under the linear observation assumption 𝒫 𝑂𝑃 𝐹𝑗 is a Gaussiankernel, and hence the second computational bottleneck is overcome too. We have P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) ∝ P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 , 𝑣 𝑗 ) P ( 𝑣 𝑗 +1 | 𝑣 𝑗 )= P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 𝑗 ) ∝ exp (︂ − | 𝑦 𝑗 +1 − 𝐻𝑣 𝑗 +1 | − | 𝑣 𝑗 +1 − Ψ( 𝑣 𝑗 ) | )︂ = exp (︀ − J opt ( 𝑣 𝑗 +1 ) )︀ . This is a Gaussian distribution for 𝑣 𝑗 +1 as J opt ( 𝑣 𝑗 +1 ) is quadratic with respect to 𝑣 𝑗 +1 . Consequently, we can compute the mean 𝑚 𝑗 +1 and covariance 𝐶 (which, note,is independent of 𝑗 ) of this Gaussian by matching the mean and quadratic terms in therelevant quadratic forms: 𝐶 − = 𝐻 𝑇 Γ − 𝐻 + Σ − ,𝐶 − 𝑚 𝑗 +1 = Σ − Ψ( 𝑣 𝑗 ) + 𝐻 𝑇 Γ − 𝑦 𝑗 +1 . J opt is identical to J on the right-hand side of Table 9.1, with ̂︀ 𝐶 replaced by Σ . Then P ( 𝑣 𝑗 +1 | 𝑦 𝑗 +1 , 𝑣 𝑗 ) = 𝒩 ( 𝑚 𝑗 +1 , 𝐶 ) . This is hence a special case of 3DVAR in whichthe analysis covariance is fixed at 𝐶 ; note that when we derived 3DVAR we fixed thepredictive covariance ̂︀ 𝐶 which, here, is fixed at Σ . As with the Kalman filter, and with3DVAR, it is possible to implement the prediction step through the following mean andcovariance formulae which avoid inversion in state space, and require inversion only indata space: 𝑚 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑣 𝑗 ) + 𝐾𝑦 𝑗 +1 ,𝐶 = ( 𝐼 − 𝐾𝐻 )Σ ,𝐾 = Σ 𝐻 𝑇 𝑆 − ,𝑆 = 𝐻 Σ 𝐻 𝑇 + Γ . Furthermore, as for 3DVAR, the inversion of 𝑆 need only be performed once in a pre-processing step before the algorithm is run. Since the expression for P ( 𝑣 𝑗 +1 | 𝑣 𝑗 , 𝑦 𝑗 +1 ) isGaussian we now have the ability to sample directly from 𝒫 𝑂𝑃 𝐹𝑗 . The OPF is thus givenby the following update algorithm for approximations 𝜋 𝑁𝑗 ≈ 𝜋 𝑗 in which we generalizethe notational conventions used in the previous chapter to formulate particle filters asrandom dynamical systems: Algorithm 12.1
Algorithm for the Optimal Particle Filter with linear observation map Input : Initial distribution P ( 𝑣 ) = 𝜋 , observations 𝑌 𝐽 , number of particles 𝑁. Initial Sampling : Draw 𝑁 particles 𝑣 ( 𝑛 )0 ∼ 𝜋 so that 𝜋 𝑁 = 𝑆 𝑁 𝜋 . Subsequent Sampling
For 𝑗 = 0 , , . . . , 𝐽 −
1, perform:1. Set ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( 𝑣 ( 𝑛 ) 𝑗 ) + 𝐾𝑦 𝑗 +1 + 𝜁 ( 𝑛 ) 𝑗 +1 with 𝜁 ( 𝑛 ) 𝑗 +1 i.i.d. 𝒩 (0 , 𝐶 ) .
2. Set ¯ 𝑤 ( 𝑛 ) 𝑗 +1 = exp (︁ − | 𝑦 𝑗 +1 − 𝐻 Ψ( 𝑣 ( 𝑛 ) 𝑗 ) | 𝑆 )︁ .
3. Set 𝑤 ( 𝑛 ) 𝑗 +1 = ¯ 𝑤 ( 𝑛 ) 𝑗 +1 / ∑︀ 𝑁𝑛 =1 ¯ 𝑤 ( 𝑛 ) 𝑗 +1 .
4. Set 𝑣 ( 𝑛 ) 𝑗 +1 = ∑︀ 𝑁𝑚 =1 𝐼 ( 𝑚 ) 𝑗 +1 ( 𝑟 ( 𝑛 ) 𝑗 +1 ) ̂︀ 𝑣 ( 𝑚 ) 𝑗 +1 .
5. Set 𝜋 𝑁𝑗 +1 ( 𝑣 𝑗 +1 ) = 𝑁 ∑︀ 𝑁𝑛 =1 𝛿 (︁ 𝑣 𝑗 +1 − 𝑣 ( 𝑛 ) 𝑗 +1 )︁ . Output : 𝑁 particles 𝑣 (1) 𝐽 , 𝑣 (2) 𝐽 , . . . , 𝑣 ( 𝑁 ) 𝐽 . It would be desirable to interpret this algorithm as an approximation of the filterupdate (12.2) in the form 𝜋 𝑁𝑗 +1 = 𝑆 𝑁 𝒫 𝑂𝑃 𝐹𝑗 𝒜 𝑂𝑃 𝐹𝑗 𝜋 𝑁𝑗 , 𝜋 𝑗 = 𝑆 𝑁 𝜋 . However the order in which the resampling and the particle propagation occurs impliesthat this is not possible. The following slight modification of the OPF, however, mayindeed be thought of as an approximation of this form; we simply reorder the resampling and the propagation. We refer to the resulting algorithm as the Gaussianized OptimalParticle filter (GOPF). We may write the resulting algorithm as follows: Algorithm 12.2
Algorithm for the Gaussianized Optimal Particle Filter Input : Initial distribution P ( 𝑣 ) = 𝜋 , observations 𝑌 𝐽 , number of particles 𝑁. Initial Sampling : Draw 𝑁 particles 𝑣 ( 𝑛 )0 ∼ 𝜋 so that 𝜋 𝑁 = 𝑆 𝑁 𝜋 . Subsequent Sampling
For 𝑗 = 0 , , . . . , 𝐽 −
1, perform:1. Set ¯ 𝑤 ( 𝑛 ) 𝑗 +1 = exp (︁ − | 𝑦 𝑗 +1 − 𝐻 Ψ( 𝑣 ( 𝑛 ) 𝑗 ) | 𝑆 )︁ .
2. Set 𝑤 ( 𝑛 ) 𝑗 +1 = ¯ 𝑤 ( 𝑛 ) 𝑗 +1 / ∑︀ 𝑁𝑛 =1 ¯ 𝑤 ( 𝑛 ) 𝑗 +1 .
3. Set ̂︀ 𝑣 ( 𝑛 ) 𝑗 = ∑︀ 𝑁𝑚 =1 𝐼 ( 𝑚 ) 𝑗 +1 ( 𝑟 ( 𝑛 ) 𝑗 +1 ) 𝑣 ( 𝑚 ) 𝑗 .
4. Set 𝑣 ( 𝑛 ) 𝑗 +1 = ( 𝐼 − 𝐾𝐻 )Ψ( ̂︀ 𝑣 ( 𝑛 ) 𝑗 ) + 𝐾𝑦 𝑗 +1 + 𝜁 ( 𝑛 ) 𝑗 +1 with 𝜁 ( 𝑛 ) 𝑗 +1 i.i.d 𝒩 (0 , 𝐶 ) .
5. Set 𝜋 𝑁𝑗 +1 ( 𝑣 𝑗 +1 ) = 𝑁 ∑︀ 𝑁𝑛 =1 𝛿 (︁ 𝑣 𝑗 +1 − 𝑣 ( 𝑛 ) 𝑗 +1 )︁ . Output : 𝑁 particles 𝑣 (1) 𝐽 , 𝑣 (2) 𝐽 , . . . , 𝑣 ( 𝑁 ) 𝐽 . Particle filter methods rely on approximating the distribution for the model by a swarmof point Dirac functions; it is clear that the distribution will not be well approximatedby only a small number of particles in most cases. Consequently, a performance require-ment for particle filter methods is that they do not lead to degeneracy of the particles.Resampling leads to degeneracy if a small number of particles have all the weights.Conversely, non-degeneracy may be promoted by ensuring that the weights 𝑤 ( 𝑛 ) 𝑗 aresimilar in magnitude, so that a small number of particles are not overly favoured dur-ing the resampling step. This condition can be formulated as a requirement that thevariance of the weights be minimized; doing this results in the OPF.To understand this perspective we consider an arbitrary particle update kernel ofthe form 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) and we study the resulting particle filter without resampling.It is then the case that the particle weights are updated according to the formula¯ 𝑤 ( 𝑛 ) 𝑗 +1 = ¯ 𝑤 ( 𝑛 ) 𝑗 P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) . Theorem 12.1 (Meaning of Optimality) . The choice of P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑦 𝑗 +1 ) as the particle up-date kernel 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) results in the minimal variance of the weight 𝑤 ( 𝑛 ) 𝑗 +1 withrespect to all possible choices of the particle update kernel 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) . Proof.
We calculate the variance of the unnormalized weights (treated as random vari- ables) ¯ 𝑤 ( 𝑛 ) 𝑗 +1 with respect to the transition density 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) and obtainVar 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ,𝑌 𝑗 +1 ) [ ¯ 𝑤 ( 𝑛 ) 𝑗 +1 ] = ∫︁ R 𝑑 (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 +1 )︁ 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 +1 − [︂∫︁ R 𝑑 ¯ 𝑤 ( 𝑛 ) 𝑗 +1 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 +1 ]︂ = (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 )︁ ∫︁ R 𝑑 (︁ P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) )︁ 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 +1 − (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 )︁ [︂∫︁ R 𝑑 P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) 𝑑𝑣 𝑗 +1 ]︂ = (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 )︁ ⎡⎢⎣∫︁ R 𝑑 (︁ P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) )︁ 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) 𝑑𝑣 𝑗 +1 − P ( 𝑦 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) ⎤⎥⎦ . Choosing 𝜋 ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑌 𝑗 +1 ) = P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑦 𝑗 +1 ), as in the OPF, we obtainVar P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ,𝑌 𝑗 +1 ) [ ¯ 𝑤 ( 𝑛 ) 𝑗 +1 ] = (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 )︁ ⎡⎢⎣∫︁ R 𝑑 (︁ P ( 𝑦 𝑗 +1 | 𝑣 𝑗 +1 ) P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) )︁ P ( 𝑣 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 , 𝑦 𝑗 +1 ) 𝑑𝑣 𝑗 +1 − P ( 𝑦 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) ⎤⎥⎦ = (︁ ¯ 𝑤 ( 𝑛 ) 𝑗 )︁ [︁ P ( 𝑦 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) − P ( 𝑦 𝑗 +1 | 𝑣 ( 𝑛 ) 𝑗 ) ]︁ = 0 . Remark 12.2.
The optimal particle filter is optimal in the very precise sense of thetheorem. Note in particular that no optimality criterion is asserted by this theoremwith respect to iterating the particle updates, and in particular when resampling isincluded. The nomenclature “optimal” should thus be treated with caution.
Particle filters often perform poorly for high-dimensional systems due to a collapseof the particle weights: only a few particles carry most of the weight; the optimalparticle filter can ameliorate this issue because the proposal uses the data, meaning thatparticle predictions are more likely to be weighted highly. There have been attemptsto formulate update steps that help to mitigate this weight collapse. Essentially, thesemethods aim to push the particles towards the region of high likelihood, such that allthe particles will be representative of the distribution. As will show in the next lecture,there are interesting particle based methods which, whilst not statistically consistent,do perform well as signal estimators. This perspective on particle filter type methods,namely to use them for smart signal estimation rather than as estimators of the filteringdistribution, may become increasingly useful in high dimensional systems. Having introduced a range of filtering methods in this and the preceding four chapters,it is helpful to summarize what these methods achieve, in a comparitive fashion.
The stochastic dynamics model is defined by 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) + 𝜉 𝑗 , 𝑗 ≥ , where 𝜉 𝑗 ∼ 𝒩 (0 , Σ) are independent and identically distributed random variables, alsoindependent of the initial condition 𝑣 ∼ 𝒩 ( 𝑚 , 𝐶 ). Figure 12
Prediction step.
The filtering distribution of 𝑣 𝑗 is denoted by 𝜋 𝑗 ( 𝑣 𝑗 ) = P ( 𝑣 𝑗 | 𝑌 𝑗 ) . This is propagatedby the stochastic dynamics model according to the formula ̂︀ 𝜋 𝑗 +1 = 𝒫 𝜋 𝑗 where 𝒫 isdefined in (11.1). This is shown schematically in Figure 12. Note that 𝒫 is indepen-dent of step 𝑗 because the Markov chain defined by the stochastic dynamics model istime-homogeneous. In the absence of data the probability distribution simply evolvesthrough repeated application of 𝒫 . The data model is given by 𝑦 𝑗 +1 = ℎ ( 𝑣 𝑗 +1 ) + 𝜂 𝑗 +1 , where 𝜂 𝑗 +1 ∼ 𝑁 (0 , Γ) are independent and identically distributed random variables,independent of both 𝑣 and the i.i.d. sequence { 𝜉 𝑗 } . Given data, we now view theprediction from the dynamical model as a prior which we condition on the data. Wewrite this prior at time 𝑗 + 1 as ̂︀ 𝜋 𝑗 +1 , and prediction using the stochastic dynamicsmodel gives ̂︀ 𝜋 𝑗 +1 = 𝒫 𝜋 𝑗 . Using the data to update this prior by Bayes theorem gives an improved estimateof the distribution 𝜋 𝑗 +1 ( 𝑣 𝑗 +1 ) via the formula 𝜋 𝑗 +1 = 𝒜 𝑗 ̂︀ 𝜋 𝑗 +1 where 𝒜 𝑗 is given by (11.2). This is shown schematically in Figure 13. Note that 𝒜 𝑗 is nonlinear because ̂︀ 𝜋 𝑗 +1 appears in both the numerator and the denominator. Itdepends on 𝑗 because the data 𝑦 𝑗 +1 appears in the equation and this will change witheach set of measurements. Figure 13
Update step.
Recall the notation 𝑌 𝑗 = { 𝑦 , . . . , 𝑦 𝑗 } for the collection of all data until time 𝑗 + 1. Thedynamics and data models described in the two previous subsections can be combined asfollows. Using the dynamics, the probability distribution is propagated forward in timeby the prediction step. The data is then used to update the probability distribution atthat time in the analysis step. The two can be combined to obtain 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 . The combination of the two steps is shown schematically in Figure 14.
Figure 14
Prediction and update step combined.
There are several filtering methods for performing the prediction and analysis steps.Some methods, such as the bootstrap particle filter, can be applied generally to non-linear problems. However, others require a linear model (Ψ( · ) = 𝑀 · ) and/or linearobservations ( ℎ ( · ) = 𝐻 · ). Some of the methods we have described provably approx-imate the probability distribution updates. Some just estimate the state and simplyuse covariance information to weight the relative importance of the predictions fromthe model and of the data.The applicability of the methods introduced is summarized in the table below, withrespect to linearity/nonlinearity of the dynamics and the observation model. Further-more P is used to denote methods which provably approximate the filtering distribu-tions 𝜋 𝑗 in the large particle limit; S denotes methods which only attempt to estimatethe state, using the data. Kalman Filter Ψ( · ) = 𝑀 · , ℎ ( · ) = 𝐻 · . P . general Ψ , ℎ ( · ) = 𝐻 · . S . Bootstrap Particle Filter general Ψ , ℎ . P . Optimal Particle Filter general Ψ , ℎ ( · ) = 𝐻 · . P . Extended Kalman Filter general Ψ , ℎ ( · ) = 𝐻 · . S . Ensemble Kalman Filter general Ψ , ℎ ( · ) = 𝐻 · . S . Some of these constraints can be relaxed on the setting in which they apply can berelaxed, but the list above describes the methods as we present them in these notes.Furthermore the last two methods can accurately predict probability distributions insituations where approximate Gaussianity holds; this may be induced by small noiseand by large data. However it is important to appreciate that the raison d’etre ofthe EnKF is to facilitate the solution of problems with high dimensional state space;typically small ensemble sizes are used and the algorithm is employed far from theregime in which it is able to provably approximate the distributions, even if they areclose to Gaussian. It is for this reason that we prefer to think of the EnKF as anensemble state estimator.
Particle filters often perform poorly for high-dimensional systems due to the fact thatthe particle weight typically concentrates on one, or a small number, of particles –seethe work of Bickel and Snyder in [12, 107, 106]. This is the issue that the optimalparticle filter tries to ameliorate; the paper [4] shows calculations which demonstratethe extent to which this amelioration is manifest in theory. The optimal particle filteris discussed, and further references given, in the very clear paper [28]; see section IID.Throughout much of this chapter we considered the case of Gaussian additive noise andlinear observation operator in which case the prediction step is tractable; the paper [28]discusses the more general setting. The order in which the prediction and resamplingis performed can be commuted in this case and a discussion of this fact may be foundin [95]; this leads to the distinction between what we term the GOPF and the OPF.The convergence of the optimal particle filter is studied in [62]. The formulation of thebootstrap and optimal particle filters as random dyamical systems may be found in[70]. An attempt to alleviate weight collapse by introducing alternative update stepscan be found in [106].
13 Filtering Approach to the Inverse Problem
In this final chapter we demonstrate how the two separate themes that underpin thiscourse, inverse problems and data assimilation, may be linked. This opens up thepossibility of transferring ideas from filtering into the setting of quite general inverseproblems. In the first section we describe the general, abstract, connection and intro-duce the revolutionary idea of sequential Monte Carlo methods (SMC). In the secondsection we analyze the concrete case of applying the EnKF to solve an inverse prob-lem, leading to ensemble Kalman inversion (EKI). In the final section we link the EKImethodology to SMC.
Recall the inverse problem of finding 𝑢 ∈ R 𝑑 from 𝑦 ∈ R 𝑘 where 𝑦 = 𝐺 ( 𝑢 ) + 𝜂, 𝜂 ∼ 𝒩 (0 , Γ ) (13.1)and the related loss function L ( 𝑢 ) = 12 | 𝑦 − 𝐺 ( 𝑢 ) | . The reason for writing Γ , rather than Γ, will become apparent below and will also beexploited in the third section of this chapter where we study EKI.If we put a prior 𝜌 on the unknown 𝑢 then the posterior takes the form 𝜋 ( 𝑢 ) = 1 𝑍 exp (︀ − L ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) . Let 𝐽 ∈ N and choose ℎ so that 𝐽 ℎ = 1. Then define the family of pdfs { 𝜋 𝑗 } 𝐽𝑗 =0 by 𝜋 𝑗 ( 𝑢 ) = 1 𝑍 𝑗 exp (︀ − 𝑗ℎ L ( 𝑢 ) )︀ 𝜌 ( 𝑢 ) . It follows that 𝜋 = 𝜌 and 𝜋 𝐽 = 𝜋 and, furthermore, we may update the sequence { 𝜋 𝑗 } 𝐽𝑗 =0 sequentially using the formula 𝜋 𝑗 +1 ( 𝑢 ) = 𝑍 𝑗 𝑍 𝑗 +1 exp (︀ − ℎ L ( 𝑢 ) )︀ 𝜋 𝑗 ( 𝑢 ) . This simply corresponds to application of Bayes theorem to the inverse problem 𝑦 = 𝐺 ( 𝑢 ) + 𝜂, 𝜂 ∼ 𝒩 (0 , Γ) , (13.2)with Γ = ℎ Γ and with prior 𝜋 𝑗 . With this choice of Γ the identity (13.2) gives thelikelihood proportional to exp( − ℎ L ( 𝑢 )). The update may be written 𝜋 𝑗 +1 ( 𝑢 ) = 𝒜 𝜋 𝑗 , noting that the analysis operator 𝒜 is nonlinear, corresponding to multiplication byexp( − ℎ L ( 𝑢 )) and then normalization to a pdf. If we let 𝒫 𝑗 denote any Markov kernel for which 𝜋 𝑗 is invariant then we obtain 𝜋 𝑗 +1 = 𝒜𝒫 𝑗 𝜋 𝑗 . This update formula should be compared with the filtering update formula 𝜋 𝑗 +1 = 𝒜 𝑗 𝒫 𝜋 𝑗 introduced and used in earlier chapters.Whilst the 𝑗 − dependence in Bayes rule and the Markov kernel is interchangedbetween the inverse problem and the filtering problem, this makes little material dif-ference to implementation. Firstly, once a Markov kernel 𝒫 is identified under which 𝜋 = 𝜋 𝐽 is invariant, it can be easily adapted to find a family of kernels 𝒫 𝑗 underwhich 𝜋 𝑗 is invariant, simply by rescaling the observation covariance. Secondly the factthat the data 𝑦 is fixed, rather than changing at each step, makes little difference tothe implementation. This perspective opens up an entire field, known as sequentialMonte Carlo , in which ideas from filtering may be transferred to other quite differentproblems, including Bayesian inversion, as explained here.Note however (see the discussion at the end of section 12.6) that of the filteringmethods we have introduced so far only the bootstrap filter applies directly to the caseof nonlinear observation operators; since 𝐺 is in general nonlinear this means that extraideas are required to implement the optimal particle filter, 3DVAR, ExKF and EnKF.One approach is to replace the sequential optimization principles by the non-quadraticoptimization problem required for nonlinear 𝐺 ; we do not discuss this idea in any detailbut it is a viable option. Another approach is to use the linearization technique whichwe now outline in the context of EKI. We now make a detour into the subject of how to use filters to estimate parameters 𝑢 from data 𝑦 satisfying (13.2). The approach we study, and which we will relate tosequential Monte Carlo in the next section, is to introduce an artificial time dynamic.This can be done quite simply as follows: we write 𝑢 𝑗 +1 = 𝑢 𝑗 ,𝑦 𝑗 +1 = 𝐺 ( 𝑢 𝑗 +1 ) + 𝜂 𝑗 +1 , and we can think of finding the filtering distribution on 𝑢 𝑗 | 𝑌 𝑗 . We discuss how to relatethe 𝑦 𝑗 to the one data point 𝑦 below. For now we take 𝜂 𝑗 ∼ 𝒩 (0 , Γ) but we will revisitthis choice in the next section, linking the problem to solution of (13.1). Because theobservation operator 𝐺 is, in general, nonlinear, this does not render our system in aform where we can readily apply the EnKF. To this end we introduce a new variable 𝑤 𝑗 and rewrite the filter as: 𝑢 𝑗 +1 = 𝑢 𝑗 ,𝑤 𝑗 +1 = 𝐺 ( 𝑢 𝑗 ) ,𝑦 𝑗 +1 = 𝑤 𝑗 +1 + 𝜂 𝑗 +1 . We introduce the new variables 𝑣 = ( 𝑢, 𝑤 ) 𝑇 , nonlinear map Ψ( 𝑣 ) = (︀ 𝑢, 𝐺 ( 𝑢 ) )︀ 𝑇 andlinear operators 𝐻 = [0 , 𝐼 ] , 𝐻 ⊥ = [ 𝐼, . Then if we write 𝑣 𝑗 = ( 𝑢 𝑗 , 𝑤 𝑗 ) 𝑇 we may writethe dynamical system in the form 𝑣 𝑗 +1 = Ψ( 𝑣 𝑗 ) , (13.3a) 𝑦 𝑗 +1 = 𝐻𝑣 𝑗 +1 + 𝜂 𝑗 +1 . (13.3b)We note that 𝐻𝑣 = 𝑤, 𝐻 ⊥ 𝑣 = 𝑢. Remark 13.1.
Typically either of the following are used to construct artificial data { 𝑦 𝑗 } for the filtering algorithm, given a single instance of data 𝑦 : 𝑦 𝑗 +1 = {︃ 𝑦 (unperturbed obervations). 𝑦 + ¯ 𝜂 𝑗 +1 , ¯ 𝜂 𝑗 +1 ∼ 𝒩 (0 , Γ) (perturbed observations).The first choice is natural if viewing the algorithm as a sequential optimizer; the latter,in the linear case, is natural when seeking to draw samples from the posterior.We now apply the EnKF to the dynamics/data model (13.3). We obtain, for 𝑛 =1 , . . . , 𝑁 , ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 = Ψ( 𝑣 ( 𝑛 ) 𝑗 ) , (13.4)¯ 𝑣 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 , (13.5) ̂︀ 𝐶 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ¯ 𝑣 𝑗 +1 )︀ ⊗ (︀̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 − ¯ 𝑣 𝑗 +1 )︀ , (13.6) 𝑣 ( 𝑛 ) 𝑗 +1 = ( 𝐼 − 𝐾 𝑗 +1 𝐻 ) ̂︀ 𝑣 ( 𝑛 ) 𝑗 +1 + 𝐾 𝑗 +1 𝑦 ( 𝑛 ) 𝑗 +1 , (13.7)with the Kalman gain 𝐾 𝑗 +1 = ̂︀ 𝐶 𝑗 +1 𝐻 𝑇 𝑆 𝑗 +1 ,𝑆 𝑗 +1 = ( 𝐻𝐶 𝑗 +1 𝐻 𝑇 + Γ) − . Now we may simplify these expressions by using the specific Ψ, 𝑣 , 𝐻 arising in theinverse problem. Writing ̂︀ 𝐶 𝑗 +1 = [︃ 𝐶 𝑢𝑢𝑗 +1 𝐶 𝑢𝑤𝑗 +1 ( 𝐶 𝑢𝑤𝑗 +1 ) 𝑇 𝐶 𝑤𝑤𝑗 +1 ]︃ , ¯ 𝑣 𝑗 +1 = (︃ ¯ 𝑢 𝑗 +1 ¯ 𝑤 𝑗 +1 )︃ , we have ¯ 𝑢 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑢 ( 𝑛 ) 𝑗 , ¯ 𝑤 𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) := ¯ 𝐺 𝑗 , and 𝐶 𝑢𝑤𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀ 𝑢 ( 𝑛 ) 𝑗 − ¯ 𝑢 𝑗 +1 )︀ ⊗ (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ ,𝐶 𝑤𝑤𝑗 +1 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ ⊗ (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ . There is a similar expression for 𝐶 𝑢𝑢𝑗 +1 , but as we will show in what follows it is notneeded for the unknown parameter 𝑢 update formula. Noting that, because of thestructure of 𝐻 , 𝑆 𝑗 +1 = ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − we obtain 𝐾 𝑗 +1 = (︃ 𝐶 𝑢𝑤𝑗 +1 ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − ( 𝐶 𝑤𝑤𝑗 +1 )( 𝐶 𝑤𝑤𝑗 +1 + Γ) − )︃ . (13.8)Combining equation (13.8) with the update equation within (13.4) it follows that {︀ 𝑣 ( 𝑛 ) 𝑗 }︀ 𝑁𝑛 =1 → {︀ 𝑣 ( 𝑛 ) 𝑗 +1 }︀ 𝑁𝑛 =1 and {︀ 𝐻 ⊥ 𝑣 ( 𝑛 ) 𝑗 }︀ 𝑁𝑛 =1 → {︀ 𝐻 ⊥ 𝑣 ( 𝑛 ) 𝑗 +1 }︀ 𝑁𝑛 =1 , and hence that 𝑢 ( 𝑛 ) 𝑗 +1 = 𝐻 ⊥ 𝑣 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 + 𝐶 𝑢𝑤𝑗 +1 ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − (︁ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) )︁ . Thus we have derived the EKI step 𝑢 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 + 𝐶 𝑢𝑤𝑗 +1 ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − (︁ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) )︁ . (13.9)The full algorithm is described below: Algorithm 13.1
Algorithm for Ensemble Kalman Inversion Input : Initial distribution 𝜋 = 𝜌 , observations 𝑌 𝐽 , number of particles 𝑁. Initial Sampling : Draw 𝑁 particles 𝑢 ( 𝑛 )0 ∼ 𝜌 so that 𝜋 𝑁 = 𝑆 𝑁 𝜌. Subsequent Sampling
For 𝑗 = 0 , , . . . , 𝐽 −
1, perform1. Set ¯ 𝑢 𝑗 +1 = 𝑁 ∑︀ 𝑁𝑛 =1 𝑢 ( 𝑛 ) 𝑗 .
2. Set ¯ 𝐺 𝑗 = 𝑁 ∑︀ 𝑁𝑛 =1 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) .
3. Set 𝐶 𝑢𝑤𝑗 +1 = 𝑁 ∑︀ 𝑁𝑛 =1 (︀ 𝑢 ( 𝑛 ) 𝑗 − ¯ 𝑢 𝑗 +1 )︀ ⊗ (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ .
4. Set 𝐶 𝑤𝑤𝑗 +1 = 𝑁 ∑︀ 𝑁𝑛 =1 (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ ⊗ (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ . 𝑢 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 + 𝐶 𝑢𝑤𝑗 +1 (︀ 𝐶 𝑤𝑤𝑗 +1 + Γ) − ( 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) )︀ . Output : 𝑁 particles 𝑢 (1) 𝐽 , 𝑢 (2) 𝐽 , . . . , 𝑢 ( 𝑁 ) 𝐽 . Remark 13.2.
This algorithm may be viewed as a derivative-free optimization method,within the broad class that contains genetic, swarm or ant colony optimization.The algorithm has the following invariant subspace property.
Theorem 13.3. (Space of Ensemble) Define 𝒜 = Span {︀ 𝑢 ( 𝑛 )0 : 𝑛 = 1 , . . . , 𝑁 }︀ , then for all ≤ 𝑗 ≤ 𝐽 and all ≤ 𝑛 ≤ 𝑁 , 𝑢 ( 𝑛 ) 𝑗 defined by the iteration (13.9) lie in 𝒜 .Proof. The proof proceeds by induction. We let u 𝑗 ∈ ( R 𝑑 ) 𝑁 denote the collection of { 𝑢 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 . At first, let 0 ≤ 𝑗 ≤ 𝐽 and let 1 ≤ 𝑚 ≤ 𝑁 and define the following twoquantities: 𝑑 ( 𝑛 ) 𝑗 := ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − (︁ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) )︁ ,𝐷 𝑚𝑛 ( u 𝑗 ) := −⟨ 𝑑 ( 𝑛 ) 𝑗 , 𝐺 ( 𝑢 ( 𝑚 ) 𝑗 ) − ¯ 𝐺 𝑗 ⟩ . Notice that the 𝑑 ( 𝑛 ) 𝑗 are elements of the data space R 𝐽 , whereas the 𝐷 𝑚𝑛 ( u 𝑗 ) are scalarquantities. Furthermore notice that 𝑢 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 + 𝐶 𝑢𝑤𝑗 +1 ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − (︁ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) )︁ = 𝑢 ( 𝑛 ) 𝑗 + 𝐶 𝑢𝑤𝑗 +1 𝑑 ( 𝑛 ) 𝑗 = 𝑢 ( 𝑛 ) 𝑗 + 1 𝑁 𝑁 ∑︁ 𝑚 =1 (︀ 𝑢 ( 𝑚 ) 𝑗 − ¯ 𝑢 𝑗 +1 )︀ ⊗ (︀ 𝐺 ( 𝑢 ( 𝑚 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ 𝑑 ( 𝑛 ) 𝑗 = 𝑢 ( 𝑛 ) 𝑗 − 𝑁 𝑁 ∑︁ 𝑚 =1 𝐷 𝑚𝑛 ( u 𝑗 )( 𝑢 ( 𝑚 ) 𝑗 − ¯ 𝑢 𝑗 +1 ) . From the definition of ¯ 𝐺 𝑗 and the bilinearity of the inner product it follows that 𝑁 ∑︀ 𝑚 =1 𝐷 𝑚𝑛 ( u 𝑗 ) = 0. Therefore the update expression may be rewritten, for 0 ≤ 𝑗 ≤ 𝐽 and 1 ≤ 𝑛 ≤ 𝑁, as 𝑢 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 − 𝑁 𝑁 ∑︁ 𝑚 =1 𝐷 𝑚𝑛 ( u 𝑗 ) 𝑢 ( 𝑚 ) 𝑗 . Hence if the property holds for all the particles of the time step 𝑗 , it will clearly be thecase for all the particles at time step 𝑗 + 1. Remark 13.4.
The choice of the initial ensemble of particles { 𝑢 ( 𝑛 )0 } 𝑁𝑛 =1 is thus crucialto the performance of the EnKF, since the algorithm remains in the initial space ofthe initial ensemble. In the setting of Bayesian inverse problems the initial ensembleis frequently created by drawing from the prior. Alternatively, if the prior is Gaus-sian, the first 𝑁 covariance eigenvectors may be used, ordered by decreasing variancecontribution. More generally any truncated basis for the space R 𝑑 is a natural initialensemble. However the question of how to adaptively learn a good choice of ensemblesubspace, in response to observed data, is unexplored and potentially fruitful. Remark 13.5.
We describe an alternative way to approach the derivation of the EKIupdate formulae. We apply Theorem 10.1 with the specific structure arising from thedynamical system used in EKI. To this end we define I 𝑛 ( 𝑏 ) := 12 ⃦⃦⃦ 𝑦 ( 𝑛 ) 𝑗 +1 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀⃦⃦⃦ + 12 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 . (13.10)Once this quadratic form has been minimized with respect to 𝑏 then the upate formula(10.12) gives 𝑢 ( 𝑛 ) 𝑗 +1 = 𝑢 ( 𝑛 ) 𝑗 + 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 (︀ 𝑢 ( 𝑛 ) 𝑗 − ¯ 𝑢 𝑗 )︀ ,𝑤 ( 𝑛 ) 𝑗 +1 = 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) + 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑏 𝑛 (︀ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − ¯ 𝐺 𝑗 )︀ . (Note that the vector 𝑏 depends on 𝑛 ; we have supressed this dependence for notationalconvenience.) Theorem 13.3 is an immediate consequence of this structure. Here we link the two preceding sections. In the first subsection we describe how EKImay be linked with SMC through a particular scaling of Γ with ℎ . We then take thelimit ℎ → 𝐺 is linear, leading toinsight into the EKI algorithm in the iterated context. Although we have emphasized the optimization perspective on ensemble methods, wemay think of one-step of ensemble Kalman inversion as approximating the filteringmapping 𝜋 𝑗 = P ( 𝑢 𝑗 | 𝑌 𝑗 ) ↦→ 𝜋 𝑗 +1 = P ( 𝑢 𝑗 +1 | 𝑌 𝑗 +1 ); this will be a good approximationwhen the measures are close to Gaussian and the number of ensemble members 𝑁 islarge. In order to link this to the mapping 𝜋 𝑗 +1 = 𝒜 𝜋 𝑗 in the presentation of SMC in thefirst section of this chapter, we set Γ = ℎ Γ , and we assume unperturbed observations 𝑦 ( 𝑛 ) 𝑗 = 𝑦 for all 𝑗. Let u 𝑗 ∈ ( R 𝑑 ) 𝑁 denote, as in the previous section, the collection ofparticles { 𝑢 ( 𝑛 ) 𝑗 } 𝑁𝑛 =1 at discrete time 𝑗. Now define 𝐷 𝑚𝑛 ( u 𝑗 ) := ⟨ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − 𝑦, 𝐺 ( 𝑢 ( 𝑚 ) 𝑗 ) − ¯ 𝐺 𝑗 ⟩ Γ . Note that then, to leading order in ℎ ≪ 𝐷 𝑚𝑛 ( u 𝑗 ) = − ⟨ ( 𝐶 𝑤𝑤𝑗 +1 + Γ) − ( 𝑦 − 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 )) , 𝐺 ( 𝑢 ( 𝑚 ) 𝑗 ) − ¯ 𝐺 𝑗 ⟩ ≈ ℎ ⟨ 𝐺 ( 𝑢 ( 𝑛 ) 𝑗 ) − 𝑦, 𝐺 ( 𝑢 ( 𝑚 ) ) − ¯ 𝐺 𝑗 ⟩ Γ = ℎ𝐷 𝑚𝑛 ( u 𝑗 ) . It follows that, also to leading order in ℎ , 𝑢 ( 𝑛 ) 𝑗 +1 ≈ 𝑢 ( 𝑛 ) 𝑗 − ℎ𝑁 𝑁 ∑︁ 𝑚 =1 𝐷 𝑚𝑛 ( u 𝑗 ) 𝑢 ( 𝑚 ) 𝑗 . Then letting ℎ → 𝑑𝑢 ( 𝑛 ) 𝑑𝑡 = − 𝑁 𝑁 ∑︁ 𝑚 =1 𝐷 𝑚𝑛 ( u ) 𝑢 ( 𝑚 ) , (13.11)where, similarly as above, u represents the collection of 𝑁 particles. Remark 13.6.
Notice that equation (13.11) has families of fixed points where: (i) eitherthe particles fit the data exactly (which corresponds to the left-hand side in the innerproduct defining 𝐷 𝑚𝑛 ( 𝑢 ) being zero); or (ii) all particles collapse on their mean value(which corresponds to the right-hand side of the same inner product). This suggeststhat the system of ordinary differential equations which describes the behaviour ofensemble Kalman inversion is driven by two desirable attributes: matching the dataand achieving consensus. Now suppose 𝐺 ( · ) is a linear map denoted 𝐴 · , and define¯ 𝑢 = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑢 ( 𝑛 ) . In this setting we have 𝑑𝑢 ( 𝑛 ) 𝑑𝑡 = − 𝑁 𝑁 ∑︁ 𝑚 =1 ⟨︀ 𝐴𝑢 ( 𝑛 ) − 𝑦, 𝐴 ( 𝑢 ( 𝑚 ) − ¯ 𝑢 ) ⟩︀ Γ 𝑢 ( 𝑚 ) = − 𝑁 𝑁 ∑︁ 𝑚 =1 ⟨︀ 𝐴𝑢 ( 𝑛 ) − 𝑦, 𝐴 ( 𝑢 ( 𝑚 ) − ¯ 𝑢 ) ⟩︀ Γ ( 𝑢 ( 𝑚 ) − ¯ 𝑢 )= − 𝑁 𝑁 ∑︁ 𝑚 =1 ⟨︀ 𝐴 𝑇 Γ − ( 𝐴𝑢 ( 𝑛 ) − 𝑦 ) , 𝑢 ( 𝑚 ) − ¯ 𝑢 ⟩︀ ( 𝑢 ( 𝑚 ) − ¯ 𝑢 )= − 𝑁 𝑁 ∑︁ 𝑚 =1 (︀ 𝑢 ( 𝑚 ) − ¯ 𝑢 )︀ ⊗ (︀ 𝑢 ( 𝑚 ) − ¯ 𝑢 )︀(︀ 𝐴 𝑇 Γ − ( 𝐴𝑢 ( 𝑛 ) − 𝑦 ) )︀ . If we define 𝐶 ( u ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 (︀ 𝑢 ( 𝑛 ) − ¯ 𝑢 )︀ ⊗ (︀ 𝑢 ( 𝑛 ) − ¯ 𝑢 )︀ (13.12)then we find that, for 𝑛 = 1 , . . . , 𝑁 , 𝑑𝑢 ( 𝑛 ) 𝑑𝑡 = − 𝐶 ( u ) 𝐴 𝑇 Γ − ( 𝐴𝑢 ( 𝑛 ) − 𝑦 )= − 𝐶 ( u ) ∇ L ( 𝑢 ( 𝑛 ) ) , where L ( 𝑢 ) = 12 | 𝑦 − 𝐴𝑢 | . This corresponds to a gradient descent in the subspace defined by the initial ensemble.In particular 𝑑𝑑𝑡 L (︀ 𝑢 ( 𝑛 ) ( 𝑡 ) )︀ = ⟨ ∇ L (︀ 𝑢 ( 𝑛 ) ( 𝑡 ) )︀ , 𝑑𝑢 ( 𝑛 ) 𝑑𝑡 ( 𝑡 ) ⟩ = − ⃒⃒⃒ 𝐶 ( 𝑢 ) / ∇ L (︀ 𝑢 ( 𝑛 ) ( 𝑡 ) )︀⃒⃒⃒ ≤ , demonstrating that the loss function L is decreasing along the trajectory associated toeach ensemble member.We note also that, if we define 𝑒 ( 𝑛 ) = 𝑢 ( 𝑛 ) − ¯ 𝑢 , then 𝑑𝑒 ( 𝑛 ) 𝑑𝑡 = − 𝐶 ( u ) 𝐴 𝑇 Γ − 𝐴𝑒 ( 𝑛 ) . Because 𝐶 := 𝐶 ( u ) = 1 𝑁 𝑁 ∑︁ 𝑛 =1 𝑒 ( 𝑛 ) ⊗ 𝑒 ( 𝑛 ) it follows that 𝑑𝐶𝑑𝑡 = − 𝐶𝐴 𝑇 Γ − 𝐴𝐶. Note further that if 𝐶 is invertible (requires 𝑁 ≥ 𝑑 ) then the inverse 𝑃 satisfies 𝑑𝑃𝑑𝑡 = 2 𝐴 𝑇 Γ − 𝐴 so that ‖ 𝑃 ‖ → ∞ as 𝑡 → ∞ . This fact, which can be suitably interpreted when 𝐶 isonly invertible on a subspace, suggests that the covariance shrinks with time, causingensemble collapse – consensus – while at the same time driving minimization of theloss L over an appropriate subspace. The idea of using particle filters to sample general distributions, including those arisingin Bayesian inversion, maybe be found in [26]; a recent application to a Bayesian inverseproblem, which demonstrated the potential of the methodology in that context, is [67].A simple proof of convergence of the method may be found in [11]; it is based on theproof described in [97] for the standard bootstrap particle filter.The use of the ensemble Kalman filter for parameter estimation was introducedin the papers [79, 6] in which a physical dynamical model was appended with trivialdynamics for the parameters in order to estimate them; the idea was extended tolearn an entire field of parameter values in [88]. The paper [104] was the first to dowhat we do here, namely to consider all the data at once and a single mapping ofunknown parameters to data. In that paper only one iteration is used; the papers[19, 30] demonstrated how iteration could be useful. See also the book [91] for the useof ensemble Kalman methods in oil reservoir simulation.Development of the method for general inverse problems was undertaken in [58],and further development of iterative methods is described in [57, 60]. Ensemble in-version was analyzed in [103], including the continuous-time limit and gradient flowstructure described here. The potential for ensemble inversion in the context of hi-erarchical Bayesian methods is demonstrated in [18]. The idea that we explain here,of obtaining an evolution equation for the covariance which is satisfied exactly by en-semble methods, appears in the remarkable papers [98, 10], and in the other papers ofBergemann and Reich referenced therein. Their work is at the heart of the analysis in[103] which demonstrates the ensemble collapse and approximation properties of theensemble method when applied to linear inverse problems. References [1] H. Abarbanel.
Predicting The Future: Completing Models Of Observed ComplexSystems . Springer, 2013.[2] S. Agapiou, S. Larsson, and A. M. Stuart. Posterior contraction rates for theBayesian approach to linear ill-posed inverse problems.
Stochastic Processes andtheir Applications , 123(10):3828–3860, 2013.[3] S. Agapiou, M. Burger, M. Dashti, and T. Helin. Sparsity-promoting and edge-preserving maximum a posteriori estimators in non-parametric Bayesian inverseproblems.
Inverse Problems , 34(4), 2017.[4] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importancesampling: Intrinsic dimension and computational cost.
Statistical Science , 32(3):405–431, 2017.[5] B. Anderson and J. Moore. Optimal filtering.
Englewood Cliffs , 21:22–95, 1979.[6] C. Anderson. An ensemble adjustment Kalman filter for data assimilation.
Monthly Weather Review , 129:2284–2903, 2001.[7] C. Anderson. Monte Carlo methods and importance sampling.
Lecture Notes forStatistical Genetics , 2014.[8] P. Bassiri, C. Holmes, and S. Walker. A general framework for updating be-lief distributions.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology , 78(5):1103–1130, 2016.[9] B. Bell. The iterated Kalman smoother as a Gauss–Newton method.
SIAMJournal on Optimization , 4(3):626–636., 2001.[10] K. Bergemann and S. Reich. An ensemble kalman-bucy filter for continuous dataassimilation.
Meteorologische Zeitschrift , 127(5):1417–1440, 2012.[11] A. Beskos, A. Jasra, K. Law, R. Tempone, and Y. Zhou. Multilevel SequentialMonte Carlo Samplers.
Stochastic Processes and their Applications , 127(5):1417–1440, 1994.[12] P. Bickel, B. Li, and T. Bengtsson. Pushing the limits of contemporary statistics:Contributions in honor of jayanta k. ghosh: Sharp failure rates for the bootstrapparticle filter in high dimensions.
Institute of Mathematical Statistics , pages 318–329, 1994.[13] C. Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.[14] C. Brett, K. Lam, K. Law, D. McCormick, M. Scott, and A. M. Stuart. Accuracyand stability of filters for dissipative PDE’s.
Physica D: Nonlinear Phenomena ,2013. [15] J. Bröcker. Existence and uniqueness for four-dimensional variational data as-similation in discrete time. SIAM Journal on Applied Dynamical Systems , 16(1):361–374, 2013.[16] S. Brooks, A. Gelman, G. Jones, and X. Meng. Handbook of Markov chain MonteCarlo.
CRC press , 2011.[17] A. Carrassi, M. Bocquet, L. Bertino, and G. Geir. Data assimilation in thegeosciences: An overview of methods, issues, and perspectives.
Wiley Interdisci-plinary Reviews: Climate Change , 9(5), 2018.[18] N. Chada, M. Iglesias, L. Roininen, and A. M. Stuart. Parameterizations forensemble Kalman inversion.
Inverse Problems , 34(2018), 2017.[19] Y. Chen and D. Oliver. Ensemble randomized maximum likelihood method asan iterative ensemble smoother.
Mathematical Geosciences , 44(1):1–26, 2002.[20] S. Cotter, M. Dashti, and A. M. Stuart. Approximation of bayesian inverseproblems for pde’s.
SIAM Journal on Numerical Analysis , 48(1):322–345, 2010.[21] S. Cotter, M. Dashti, and A. M. Stuart. MCMC methods for functions: modifyingold algorithms to make them faster.
SIAM Journal on Numerical Analysis , 48(1):322–345, 2010.[22] D. Crisan, P. Moral, and T. Lyons. Discrete filtering using branching and in-teracting particle systems.
Université de Toulouse. Laboratoire de Statistique etProbabilités [LSP] , 1998.[23] M. Dashti and A. M. Stuart. Bayesian approach to inverse problems.
Handbookof Uncertainty Quantification , pages 311–428, 2017.[24] M. Dashti, K. Law, A. M. Stuart, and J. Voss. Map estimators and their con-sistency in Bayesian nonparametric inverse problems.
Inverse Problems , 29(9),2013.[25] P. del Moral. Feynman-Kac Formulae.
Genealogical and Interacting ParticleSystems with Applications , pages 47–93, 2004.[26] P. del Moral. Sequential Monte Carlo samplers.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 68(3):411–436, 2006.[27] A. V. der Vaart.
Asymptotic Statistics . Cambridge University Press, 1998.[28] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo samplingmethods for Bayesian filtering.
Statistics and Computing , 10(3), 2000.[29] A. Doucet, N. Freitas, and N. Gordon.
An introduction to sequential Monte Carlomethods . Springer, 2001. [30] A. Emerick and A. Reynolds. Investigation of the sampling performance ofensemble-based methods with a simple reservoir model. Computational Geo-sciences , 17(2):325–350, 2013.[31] H. England, M. Hanke, and A. Neubauer.
Regularization of inverse problems .Springer Science and Business Media, 1996.[32] O. Ernst, B. Sprungk, and H. Starkloff. Analysis of the ensemble and polyno-mial chaos Kalman filters in Bayesian inverse problems.
SIAM/ASA Journal onUncertainty Quantification , 3(1):823–851, 2015.[33] G. Evans. Sequential data assimilation with a nonlinear quasi-geostrophic modelusing Monte Carlo methods to forecast error statistics.
Journal of GeophysicalResearch: Oceans , 99(c5):10143–10162, 1995.[34] G. Evans and P. V. Leeuwen. Methods for ensemble prediction.
Monthly WeatherReview , 123(7):2181–2196, 1995.[35] G. Evans and P. V. Leeuwen. Assimilation of Geosat altimeter data for theAgulhas current using the ensemble Kalman filter with a quasigeostrophic model.
Monthly Weather Review , 124(1):85–96, 1996.[36] M. Evans and T. Swartz. Methods for approximating integrals in statistics withspecial emphasis on Bayesian integration problems.
Statistical Science , pages254–272, 1995.[37] G. Evensen.
Data Assimilation: the Ensemble Kalman Filter . Springer Scienceand Business Media, 2009.[38] M. Fisher, J. Nocedal, Y. Trémolet, and S. Wright. Data assimilation in weatherforecasting: a case study in PDE-constrained optimization.
Optimization andEngineering , 10(3):409–426, 2009.[39] J. Franklin. Well-posed stochastic extensions of ill-posed linear problems.
Journalof mathematical analysis and applications , 31(3):682–716, 1970.[40] D. Gamerman and H. Lopes. Markov chain Monte Carlo: stochastic simulationfor Bayesian inference.
CRC Press , 2006.[41] N. Garcia Trillos and D. Sanz-Alonso. Continuum limits of posteriors in graphbayesian inverse problems.
SIAM Journal on Mathematical Analysis , 50(4):4020–4040, 2018.[42] N. Garcia Trillos and D. Sanz-Alonso. The Bayesian update: variational formu-lations and gradient flows.
Bayesian Analysis , 2018.[43] N. Garcia Trillos, Z. Kaplan, T. Samakhoana, and D. Sanz-Alonso. On the consis-tency of graph-based Bayesian learning and the scalability of sampling algorithms. arXiv preprint arXiv:1710.07702 , 2017. [44] N. Garcia Trillos, Z. Kaplan, and D. Sanz-Alonso. Variational characteriza-tions of local entropy and heat regularization in deep learning. arXiv preprintarXiv:1901.10082 , 2019.[45] M. Ghil, S. Cohn, J. Tavantzis, and K. Bube. Application of estimation theory tonumerical weather prediction. Dynamic Meteorology: Data Assimilation Methods ,1991.[46] A. Gibbs and F. Su. On choosing and bounding probability metrics.
InternationalStatistical Review , 70(3):419–435, 2002.[47] M. Giles. Multilevel Monte Carlo methods.
Acta Numerica , 24:259–328, 2015.[48] E. Gine and R. Nickl.
Mathematical foundations of infinite-dimensional statisticalmodels . Cambridge University Press, 2015.[49] F. Gland, V. Monbet, and V. Tran. Large sample asymptotics for the ensembleKalman filter.
PhD Thesis , 2009.[50] G. Gottwald and A. Majda. A mechanism for catastrophic filter divergence in dataassimilation for sparse observation networks.
Nonlinear Processes in Geophysic ,20(5):705–712, 2013.[51] M. Hairer, A. M. Stuart, J. Voss, and P. Wiberg. Analysis of SPDEs arisingin path sampling. Part I: The Gaussian case.
Communications in MathematicalSciences , 3(4):587–603, 2013.[52] J. Hammersley and D. Handscomb. Percolation processes.
Monte Carlo Methods ,pages 134–141, 1964.[53] A. Harbey.
Forecasting, structural time series models and the Kalman filter .Cambridge university press, 1964.[54] K. Hayden, E. Olson, and E. Titi. Discrete data assimilation in the Lorenz and 2DNavier–Stokes equations.
Physica D: Nonlinear Phenomena , 240(18):1416–1425,2011.[55] T. Helin and M. Burger. Maximum a posteriori probability estimates in infinite-dimensional bayesian inverse problems.
Inverse Problems , 31(8), 2015.[56] P. Houtekamer and H. Mitchell. Data assimilation using an ensemble Kalmanfilter technique.
Monthly Weather Review , 126(3):796–811, 1998.[57] M. Iglesias. Iterative regularization for ensemble data assimilation in reservoirmodels.
Computational Geosciences , 19(1):177–212, 2015.[58] M. Iglesias, K. Law, and A. M. Stuart. Ensemble Kalman methods for inverseproblems.
Inverse Problems , 29(4):134–141, 2014. [59] M. Iglesias, K. Lin, and A. M. Stuart. Well-posed Bayesian geometric inverseproblems arising in subsurface flow. Inverse Problems , 30(11), 2014.[60] M. Iglesias, K. Law, and A. M. Stuart. A regularizing iterative ensemble Kalmanmethod for PDE-constrained inverse problems.
Inverse Problems , 32(2), 2016.[61] A. Jazwinski.
Stochastic Processes and Filtering Theory . Courier Corporation,2007.[62] A. Johansen and A. Doucet. A note on auxiliary particle filters.
Statistics andProbability Letters , 78(12):1498–1504, 2008.[63] J. Kaipo and E. Somersalo. Statistical and Computational Inverse Problems.
Springer Science & Business Media , 160, 2006.[64] R. Kalman. A new approach to linear filtering and prediction problems.
Journalof Basic Engineering , 82(1):35–45, 1960.[65] R. Kalman and R. Bucy. New results in linear filtering and prediction theory.
Journal of Basic Engineering , 83(1):95–108, 1961.[66] E. Kalnay.
Atmospheric modeling, data assimilation and predictability . Cam-bridge University Press, 2003.[67] N. Kantas, A. Beskos, and A. Jasra. Sequential Monte Carlo methods for high-dimensional inverse problems: a case study for the Navier Stokes equations.
SIAMJournal on Uncertainty Quantification , 2(1):464–489, 2014.[68] R. Kawai. Adaptive importance sampling Monte Carlo simulation for generalmultivariate probability laws.
Journal of Computational and Applied Mathemat-ics , 319:440–459, 2017.[69] D. Kelly and A. M. Stuart. Well-posedness and accuracy of the ensemble Kalmanfilter in discrete and continuous time.
Nonlinearity , 27(10), 2014.[70] D. Kelly and A. M. Stuart. Ergodicity and accuracy of optimal particle filtersfor Bayesian data assimilation. arXiv preprint arXiv:1611.08761 , 2014.[71] B. Knapik, A. van der Vaart, and J. van Zanten. Bayesian inverse problems withGaussian priors.
Annals of Statistics , 39(5):2626–2657, 2011.[72] K. Law, A. Shukla, and A. M. Stuart. Analysis of the 3DVAR filter for thepartially observed Lorenz’63 model. arXiv preprint arXiv:1212.4923 , 2012.[73] K. Law, A. M. Stuart, and K. Zygalakis.
Data assimilation . Springer, 2015.[74] K. Law, D. Sanz-Alonso, A. Shukla, and A. M. Stuart. Filter accuracy for theLorenz 96 model: Fixed versus adaptive observation operators.
Physica D: Non-linear Phenomena , 325:1–13, 2016. [75] P. V. Leeuwen, Y. Cheng, and S. Reich. Nonlinear data assimilation . Springer,2015.[76] C. Lieberman, C. Willcox, and O. Ghattas. Parameter and state model reductionfor large-scale statistical inverse problems.
SIAM Journal on Scientific Comput-ing , 32(5):2535–2542, 2010.[77] T. Lindvall.
Lectures on the Coupling Method . Springer, 2002.[78] A. Lorenc. Analysis methods for numerical weather prediction.
Quarterly Journalof the Royal Meteorological Society , 112(474):1177–1194, 1986.[79] R. Lorentzen, R. Fjelde, J. FrØyen, A. Lage, G. Naevdal, and E. Vefring. Under-balanced and low-head drilling operations: Real time interpretation of measureddata and operational support.
SPE Annual Technical Conference and Exhibition ,2001.[80] Y. Lu, A. M. Stuart, and H. Weber. Gaussian Approximations for ProbabilityMeasures on R 𝑑 . SIAM/ASA Journal on Uncertainty Quantification , 5(1):1136–1165, 2017.[81] D. MacKay.
Information Theory, Inference and Learning Algorithms . CambridgeUniversity Press, 2003.[82] A. Majda and J. Harlim.
Filtering Complex Turbulent Systems . CambridgeUniversity Press, 2012.[83] J. Martin, L. Wilcox, C. Burstedde, and G. Omar. A stochastic Newton MCMCmethod for large-scale statistical inverse problems with application to seismicinversion.
SIAM Journal on Scientific Computing , 34(3):A1460–A1487, 2012.[84] Y. Marzouk and D. Xiu. A stochastic collocation approach to Bayesian inferencein inverse problems.
Communications in Computational Physics , 6(4):826–847,2009.[85] J. Mattingly, A. Stuart, and D. Higham. Ergodicity for PDE’s and approxima-tions: locally Lipschitz vector fields and degenerate noise.
Stochastic Processesand Their Applications , 101(2):185–232, 2002.[86] S. Meyn and R. Tweedie.
Markov Chains and Stochastic Stability . SpringerScience and Business Media, 2012.[87] A. Moodey, A. Lawless, R. Potthast, and P. V. Leeuwen. Nonlinear error dynam-ics for cycled data assimilation methods.
Inverse Problems , 29(2), 2013.[88] G. Naevdal, T. Mannseth, and E. Vefring. Near-well reservoir monitoringthrough ensemble Kalman filter.
Proceeding of SPE Improved Oil Recovery Sym-posium , 2002. [89] R. Nickl. Bernstein-von mises theorems for statistical inverse problems : Schröodinger equation. arXiv preprint arXiv:1707.01764) , 2017.[90] F. Nielsen and V. Garcia. Statistical exponential families: A digest with flashcards. arXiv preprint arXiv:0911.4863) , 2009.[91] D. Oliver, A. Reynolds, and N. Liu. Inverse theory for petroleum reservoir char-acterization and history matching . Cambridge University Press), 2008.[92] K. Petersen and M. Pedersen.
The matrix cookbook . Technical University ofDenmark), 2008.[93] F. Pinski, F. Simpson, A. M. Stuart, and H. Weber. Algorithms for Kullback–Leibler approximation of probability measures in infinite dimensions.
SIAM Jour-nal on Scientific Computing , 37(6):A2733–A2757, 2015.[94] F. Pinski, F. Simpson, A. M. Stuart, and H. Weber. Kullback–Leibler approx-imation for probability measures on infinite dimensional spaces.
SIAM Journalon Mathematical Analysis , 47(6):4091–4122, 2015.[95] M. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters.
Journal of the American Statistical Association , 94(446):590–599, 1999.[96] H. Rauch, C. Striebel, and F. Tung. Maximum likelihood estimates of lineardynamic systems.
AIAA journal , 3(8):1445–1450, 1965.[97] P. Rebeschini and R. V. Handel. Can local particle filters beat the curse ofdimensionality?
Annals of Applied Probability , 25(5):2809–2866, 2015.[98] S. Reich. A dynamical systems framework for intermittent data assimilation.
IBIT Numerical Mathematics , 51(1):235–249, 2017.[99] S. Reich and C. Cotter.
Probabilistic Forecasting and Bayesian Data Assimilation .Cambridge University Press), 2015.[100] D. Sanz-Alonso. Importance sampling and necessary sample size: An informationtheory approach.
SIAM/ASA Journal on Uncertainty Quantification , 6(2):867–879, 2018.[101] D. Sanz-Alonso and A. M. Stuart. Long-time asymptotics of the filtering distri-bution for partially observed chaotic dynamical systems.
SIAM/ASA Journal onUncertainty Quantification , 3(1):1200–1220, 2015.[102] D. Sanz-Alonso and A. M. Stuart. Gaussian approximations of small noise diffu-sions in Kullback-Leibler divergence.
Communications in Mathematical Sciences ,15(7):2087–2097, 2017.[103] C. Schillings and A. M. Stuart. Analysis of the ensemble Kalman filter for inverseproblems.
SIAM Journal on Numerical Analysis , 55(3), 2017. [104] J. Skjervheim, J. Evensen, S. Aanonsen, B. Ruud, and T. Johansen. Incorporating4d seismic data in reservoir simulation models using ensemble kalman filter. SPEJournal , 12(3):282–292, 2017.[105] C. Snyder. Inverse problems: a Bayesian perspective.
Acta Numerica , 19:451–559,2010.[106] C. Snyder. Particle filters, the optimal proposal and high-dimensional systems.
Proceedings of the ECMWF Seminar on Data Assimilation for Atmosphere andOcean , pages 1–10, 2011.[107] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particle filtering.
Monthly Weather Review , 136(12):4629–4640, 2016.[108] A. Tarantola. Inverse problem theory and methods for model parameter estima-tion.
SIAM , 2015.[109] A. Tarantola. Towards adjoint-based inversion for rheological parameters in non-linear viscous mantle flow.
Physics of the Earth and Planetary Interiors , 234:23–34, 2015.[110] S. Tokdar, S. Kass, and R. Kass. Importance sampling: a review.
Wiley Inter-disciplinary Reviews: Computational Statistics , 2(1):54–60, 2010.[111] X. Tong, A. Majda, and D. Kelly. Nonlinear stability of the ensemble Kalmanfilter with adaptive covariance inflation.
Nonlinearity , 29(2):54–60, 2015.[112] X. T. Tong, A. J. Majda, and D. Kelly. Nonlinear stability and ergodicity ofensemble based Kalman filters.