Non-Stationary Multi-layered Gaussian Priors for Bayesian Inversion
Muhammad Emzir, Sari Lasanen, Zenith Purisha, Lassi Roininen, Simo Särkkä
NNon-Stationary Multi-layered Gaussian Priors forBayesian Inversion
Muhammad Emzir , Sari Lasanen , Zenith Purisha , LassiRoininen , Simo S¨arkk¨a
1) Department of Electrical Engineering and Automation, Aalto University, P.O. Box12200, FI-00076 Aalto, Finland2) Sodankyl¨a Geophysical Observatory, University of Oulu, P.O. Box 8000, FI-90014University of Oulu, Finland3) School of Engineering Science, Lappeenranta-Lahti University of Technology, P.O.Box 20, FI-53851 Lappeenranta, FinlandE-mail: [email protected]
March 2020
Abstract.
In this article, we study Bayesian inverse problems with multi-layeredGaussian priors. We first describe the conditionally Gaussian layers in terms ofa system of stochastic partial differential equations. We build the computationalinference method using a finite-dimensional Galerkin method. We show that theproposed approximation has a convergence-in-probability property to the solution ofthe original multi-layered model. We then carry out Bayesian inference using thepreconditioned Crank–Nicolson algorithm which is modified to work with multi-layeredGaussian fields. We show via numerical experiments in signal deconvolution andcomputerized X-ray tomography problems that the proposed method can offer bothsmoothing and edge preservation at the same time.
1. Introduction
The Bayesian approach provides a consistent framework to obtain solutions of inverseproblems. By formulating the unknown as a random variable, the degree of informationthat is available can be encoded as a statistical prior. The ill-posedness of the problemis mitigated by reformulating the inverse problem as a well-posed extension in thespace of probability distributions [1]. Among statistical priors that are commonlyused in Bayesian inverse problem is the Gaussian prior which is relatively easy tomanipulate, has a simple structure, and also has a close relation with traditionalTikhonov regularization. This approach has also get a growing interest from a machinelearning community, where the use of Gaussian prior for Bayesian inference is known asGaussian process regression [2].When the unknown is a multivariate function, it is natural to model it as a randomfield. There is a vast amount of studies on Gaussian random fields and their applications a r X i v : . [ m a t h . S T ] J un on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion J Figure 1.
Illustration of a chain of Gaussian fields. Each node (field) has anindependent white noise input field and a length-scale parameter (cid:96) . The length-scale (cid:96) is obtained as a function evaluated on the fields above it. These Gaussian fields areapproximated in finite-dimensional Hilbert space using L ( u j − ) u j = w j , see (4). In many spatio-temporal inverse problems, it is often useful to start by working in aninfinite-dimensional space. There are essentially two approaches to construct a Bayesianinference algorithm. The first approach is to discretize the forward map (e.g., via gridpartitions or finite element meshes) and apply a Bayesian inference method to the finite-dimensional setting [1]. The second approach [14, 15] is to directly apply Bayesianinference to the infinite-dimensional problem, and afterward, apply a discretizationmethod. The latter approach is possible through realizing that the posterior and priorprobability distribution can be related via the Radon–Nikodym derivative which can begeneralized to function spaces [14].For any of the aforementioned approaches, the sampling technique has to becarefully designed. The traditional Markov chain Monte Carlo (MCMC) algorithms on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion
Let { φ l } be a basis formed from the orthogonal eigenfunctions of the Laplace operatorwith respect to some domain Ω with suitable boundary conditions. Let N be the numberof basis functions { φ l } used in the Galerkin method and let H N denote their span. For on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u j to denote the random field at layer j , wherethe total number of layer is J + 1, that is, J is the number of hyperprior layers. Thecollection of J + 1 random fields ( u , u , . . . , u J ) is denoted by u . The Fourier transformof any random field z is denoted by (cid:98) z , where the Fourier coefficient for index l is givenby (cid:98) z ( l ), that is, (cid:98) z ( l ) = (cid:104) z, φ l (cid:105) , where (cid:104)· , ·(cid:105) is the standard L inner product on thedomain Ω. We use bold characters to denote vectors or matrices with elements in C or R . Fourier coefficient for random field u j for index − N to N is given by u j , thatis, u j = ( (cid:98) u j ( − N ) . . . (cid:98) u j ( N )). We denote J + 1 collections of the Fourier coefficients ofrandom fields u as u = ( u , . . . , u J ). The identity operator is denoted with I .
2. Finite-dimensional approximations
Consider a Bayesian inverse problem on a Gaussian field where the unknown is a real-valued random field υ ( x ) : Ω → R on a bounded domain Ω ⊂ R d . We assume in theinverse problem that υ belongs to a Hilbert space H , specifically, υ ∈ H := L (Ω).To carry out the Bayesian inference, a set of measurements is taken (either direct orindirect in multiple locations). In this article, the measurement is assumed to be a linearoperation on υ corrupted with additive noises, that is, y k = (cid:104) υ, h k (cid:105) + e k , where h k is anelement in H represents a real linear functional on H , and e k is a zero-mean white noisewith a covariance matrix E .In the Bayesian framework, the estimation problem is equivalent to exploringthe posterior distribution of υ given the measurements { y k } . The Bayesian inversionapproach for this problem starts with assuming that υ is a Gaussian field with a certainmean (assumed zero for simplicity) and covariance function C ( x , x (cid:48) ). In the case when C ( x , x (cid:48) ) is a Mat´ern covariance function, the Gaussian field υ can be generated from astochastic partial differential equation of the form [8, 9] (cid:0) − (cid:96) ∆ (cid:1) α/ υ ( x ) = (cid:112) β(cid:96) d w ( x ) , (1)where α = ν + d/ d is the dimension of the space, ν is a smoothness parameter, w ( x )is a white noise on R d , (cid:96) is the length-scale constant of the Mat´ern covariance function C , and β = σ d π d/ Γ( α ) / Γ( ν ) with σ being a scale parameter.To obtain a non-stationary field, we modify SPDE (1) so that the length-scale (cid:96) is modeled via another Gaussian field u with Mat`ern covariance function. Namely, weselect (cid:96) ( x ) = g ( u ( x )), where g is a smooth positive function g : R → R + . As in [9], wealso require that (cid:96) should satisfy sup x ∈ Ω (cid:96) ( x ) < ∞ and inf x ∈ Ω (cid:96) ( x ) > α = 2. The results ofthis article could also be extended to other cases. Introducing the spatially varyinglength-scale (cid:96) ( x ) = g ( u ( x )) into (1), and since the length-scale is always greater thanzero, with probability one, using κ = 1 /(cid:96) we obtain the following SPDE (cid:0) κ ( u ( x )) − ∆ (cid:1) υ ( x ) = (cid:112) βκ ( u ( x )) ν w ( x ) . (2) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion v using the eigenfunctions of the Laplacian. With a suitable boundary condition, theLaplace operator can be expressed − ∆ υ = (cid:80) ∞ j = −∞ λ j (cid:104) υ, φ j (cid:105) φ j , where φ j is a completeset of orthonormal eigenfunctions of ∆ and λ j >
0, where lim j →∞ λ j = ∞ [36]. Observethat, in the sense of (2), κ ( u ( x )) := 1 /g ( u ( x )) ∈ L ∞ (Ω) is a multiplication operatoracting pointwise, that is, ( κ ( u ) υ )( x ) = κ ( u ( x )) υ ( x ) , ∀ x ∈ Ω [37].In what follows, we will first describe the matrix representation of a chain ofGaussian fields generated from SPDEs in the form of (2) for d -dimensional domain.Then in Section 3, we will develop a convergence result of the Galerkin method developedhere to the weak solution of (2). Let us examine a periodic boundary condition on d -dimensional box Ω with side length1. Within this boundary condition, it is useful to consider H as a complex Hilbertspace, so that we can set φ l = exp( i c d x (cid:62) k ( l )) as Fourier complex basis functions for d dimensions, for some constant c d and a multi-index k ( l ) which is unique for every l . Letthe finite-dimensional Hilbert subspace H N of H be the span of φ − N , . . . , φ , . . . , φ N . Inwhat follows, we will explicitly construct the multi-index k ( · ).Without losing generality, let us assume that every entry in k ( l ) is between − n to n . For any − N ≤ l ≤ N , where N = (2 n +1) d − , we would like to construct k ( l ) = ( k ( l ) , . . . , k d ( l )) such that it is unique for each − N ≤ l ≤ N and k ( l + m ) = k ( l ) + k ( m ), − N ≤ l, m ≤ N and max( | k r ( l + m ) | ) < n, r ≤ d . The construction of k : [ − N, N ] → [ − n, n ] d is as follows. Let the matrix K ( n ) be given as K ( n ) = z n ⊗ e ⊗ d − n e n ⊗ z n ⊗ e ⊗ d − n ... e ⊗ d − n ⊗ z n , where z n = ( − n, − n + 1 , . . . , n − , n ), e n = (1 , , . . . , , e ⊗ dn is a Kroneckerproduct of e n repeated for d times. The multi index k ( l ) is given by selecting l + ( N + 1)-th column of K . The linear relation is defined only if − N ≤ l + m ≤ N and every elementof the summation k ( l ) + k ( m ) has values in [ − n, n ]. As an example let d = 2 and n = 1.This gives us: K (1) = (cid:32) − − − − − − (cid:33) . It can be verified that k ( l ) selected this way is both unique and linear, given thatthe summation result is inside the range. For example, k (1) + k (2) = k (3). However, k (1) + k (1) is not defined since the summation is outside the limit. In the following, ifthe context is clear, we will also use the multiple index k ( l ) for Fourier component of u ,that is (cid:98) u ( k ( l )) := (cid:98) u ( l ). on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion − N ≤ l, m ≤ N , (cid:104) φ m u , φ l (cid:105) = (cid:98) u ( l − m ), whenmax( | k ( l ) − k ( m ) | ) ≤ n , and zero elsewhere.Let us denote with u , v , and w finite-dimensional representations of u , υ , and w ,respectively. Let M N ( u ) be the matrix representation of the multiplication operator u ( x )on H N . For a random field r with Fourier coefficients r = ( (cid:98) r ( − m ) · · · (cid:98) r ( m )) (cid:62) ∈ C m +1 ,let us write a Toeplitz matrix T ∈ C ( m +1) × ( m +1) with elements from r as follows: T ( r ) = (cid:98) r (0) · · · (cid:98) r ( − m )... . . . ... (cid:98) r ( m ) · · · (cid:98) r (0) . The previous discussion allows us to write for d = 1, M N ( u ) := T (˜ u ) ∈ C (2 N +1) × (2 N +1) ,where ˜ u = ( × N , u (cid:62) , × N ) (cid:62) ∈ C (4 N +1) × . It is also possible to construct M N ( u ) for d >
1. However, instead of working directly on u , we need to work on the frequencyindices. Let J ∈ R (2 n +1) × (2 n +1) be a square matrix where all of its entry equal to one.Let also Z (1) = T ( z n ) ⊗ J ⊗ d − , Z (2) = J ⊗ T ( z n ) ⊗ J ⊗ d − , . . . , Z ( d ) = J ⊗ d − ⊗ T ( z n ),respectively. Using these matrices, the ( l, m )-th entry of M N ( u ) is given by M N ( u ) l,m = (cid:98) u (˜ k ( l, m )) , (3)where ˜ k ( l, m ) = (cid:16) Z (1) l,m , · · · , Z ( d ) l,m (cid:17) . In this equation, we assign (cid:98) u (˜ k ( l, m )) = 0 whenmax( | ˜ k ( l, m ) | ) > n .The sparsity of M N as n approaches infinity is ( ) d . The weak solution to (2) inthe span of H N is equivalent to the following equation, L ( u ) v = w , (4)where L ( u ) := √ β ( M N ( κ ( u ) d/ ) − M N ( κ ( u ) − ν ) D ) is the square root of the precisionoperator corresponds to υ in matrix form, D is a diagonal matrix, and v , w are complexvectors with appropriate dimensions. The diagonal entries of D are given by D i,i = − λ i .Upon computing M N ( κ ( u ) γ ), we approximate u by its projection onto H N if u / ∈ H N .An important numerical issue to note is that we cannot use an approximation of M N ( κ ( u N ) γ ) obtained by spectral decomposition, that is, M N ( κ ( u N ) γ ) ≈ U (cid:62) κ ( D u ) γ U ,for a diagonal matrix D u with the diagonal entries are the eigenvalues of M N ( u N ), and U is the orthonormal matrix. The reason is that the resulting matrix will not be inthe form of (3). Instead, we could obtain M N ( κ ( u N ) γ ) matrix by applying Fouriertransform directly to κ ( u N ) γ and make use of (3). Writing (4) as v = L ( u ) − w , weobtain a composition of a Gaussian field from a unit Gaussian field given in [13].In what follows, for simplicity, with a slight abuse of notation for L ( u ), if r ∈ H N we also use L ( r ) := L ( (cid:80) Nl = − N (cid:98) r ( l ) φ l ) = L ( r ). Using this notation, the J Gaussian fieldhyperpriors with zero mean assumption can be written in the following form: L ( u j − ) u j = w j . (5) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u , . . . , u J − , where the randomfields u will be stationary. Within this multilayer hyperprior setting, the unknownfield υ is equivalent to u J . With the assumption that each random field involved isreal-valued, the number of element in u i is only N + 1 since the remaining element canbe obtained by complex conjugation.
3. Convergence analysis
We will show that the solution of the original SPDE system can be approximated withGalerkin methods on d -dimensional torus T d . In our convergence analysis, we willrestrict to d ≤
3, since we will rely on continuity of the sample paths (see Lemma3.3). We start by carefully presenting the notation of Galerkin method for analysispurposes. We denote the constant κ with exp( u − ) for notational ease. The function κ is taken to be a smooth function, which is bounded from below and above by exponentialfunctions. That is, c exp( − a | x | ) ≤ κ ( x ) , κ (cid:48) ( x ) ≤ c exp( a x ) for some c , c , a , a > x ∈ R .Almost surely bounded functions u , ..., u J are a weak solution of SPDE system − ∆ u i + κ ( u i − ) u i = β / i κ ν ( u i − ) w i , where i = 0 , . . . , J, (6)if they satisfy − (cid:104) u i , ∆ φ (cid:105) + (cid:104) κ ( u i − ) u i , φ (cid:105) = β / i ∞ (cid:88) p = −∞ (cid:98) w i ( p ) (cid:104) κ ν ( u i − ) φ p , φ (cid:105) , where i = 0 , . . . , J, for all φ ∈ C ( T d ). Assuming that the functions u i are almost surely bounded guaranteesthat the inner product of u i and ∆ φ is well-defined as compared to the inner productof ∇ u i and ∇ φ , which may not be well-defined. Here we expressed independent whitenoises w i on T d with the help of an orthonormal basis { φ p } in L ( T d ) as random series w i = (cid:80) ∞ p = −∞ (cid:98) w i ( p ) φ p , where the random coefficients (cid:98) w i ( p ) ∼ N (0 ,
1) are independent.In the first approximation step for the SPDE system (6), we will approximatewhite noise w i with its projections w Ni ( x ) onto the subspace H N . That is, w Ni ( x ) = (cid:80) Np = − N w i φ p ( x ) . Then the Galerkin approximations u Ni of u i satisfy the system (cid:104)∇ u Ni , ∇ φ (cid:105) + (cid:104) κ ( u Ni − )) u Ni , φ (cid:105) = β / i (cid:104) κ ν ( u Ni − ) w Ni , φ (cid:105) , i = 0 , . . . , J (7)for all φ ∈ H N , where u N − ( x ) := ln( κ ). Here we are allowed to use to use inner productsof ∇ u Ni and ∇ φ , since the approximated white noise w Ni belongs to H N .We will denote with L ( u i − ) − the solution operator, which maps f from thenegatively indexed Sobolev space H − ( T d ) to the weak solution of − ∆ u + κ ( u i − ) u = β / i κ ν ( u i − ) f . Similarly, we will denote with L N ( u Ni − ) − the solution operator,which maps f ∈ H − ( T d ) to the Galerkin approximation u N ∈ H N of the equation − ∆ u + κ ( u Ni − ) u = β / i κ ν ( u Ni − ) f . The matrix form of L N ( u Ni − ) − is given by L ( u i − ) − from Equation (5). The solution operators L ( u i − ) − and L N ( u Ni − ) − satisfy the on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion β i = 1 from nowon. Lemma 3.1.
Let u i − and u Ni − be bounded functions and let κ be a positive continuousfunction. The mappings L ( u i − ) : L ( T d ) → H ( T d ) and L N ( u Ni − ) : L ( T d ) → H ( T d ) satisfy norm estimates (cid:107) L ( u i − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u i − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) and (cid:107) L N ( u Ni − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u Ni − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u Ni − ) (cid:107) ∞ )min(1 , inf x κ ( u Ni − ( x ))) , respectively.Proof. By the Lax-Milgram theorem [38], (cid:107) ( − ∆ + κ ( u i − ) I ) − (cid:107) H − ,H ≤ C/ min(1 , inf κ ( u i − )) and the multiplication operator has norm (cid:107) κ ν ( u i − ) (cid:107) L ,L ≤ (cid:107) κ ( u i − (cid:107) ν ∞ .Hence, the solution operator has norm (cid:107) L ( u i − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u i − ( x )) (cid:107) ν ∞ min(1 , inf x κ ( u i − ( x ))) . We rewrite the PDE in the form( − ∆ + a ) u fi = ( a − κ ( u i − ) u fi + κ ν ( u i − ) f, where the right-hand side belongs now to L ( T d ). By inverting the operator − ∆ + aI ,we obtain an equation for the solution u fi , which leads to the norm estimate (cid:107) u fi (cid:107) H ≤ , inf x κ ( u i − ( x ))) (cid:18) C (cid:107) κ ( u i − ) (cid:107) ∞ (cid:107) κ ( u i − ) (cid:107) ν ∞ min(1 , inf x κ ( u i − ( x ))) + (cid:107) κ ( u i − ) (cid:107) ν ∞ (cid:19) (cid:107) f (cid:107) L ≤ C (cid:107) κ ( u i − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) (cid:107) f (cid:107) L after choosing a = inf x κ ( u i − ( x ) /
2. Similar procedure leads to the desired estimatefor L N ( u i − ).We will later need the following technical lemma to establish convergence. Lemma 3.2.
Let u i − and u Ni − be bounded functions and let κ be a continuouslydifferentiable positive function. The mappings L N ( u Ni − ) − and L ( u i − ) − satisfy (cid:107) L N ( u Ni − ) − − L ( u i − ) − (cid:107) L ,L ≤ N G ( u i − ) + G ( u i − , u Ni − ) (cid:107) u Ni − − u i − (cid:107) / L , where G ( u i − ) = C max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ ) max( (cid:107) κ ν ( u i − ) (cid:107) ∞ , (cid:107) κ − ν ( u i − ) (cid:107) ∞ )min(1 , inf κ ( u i − ( x ))) G ( u i − , u Ni ) = C max(1 , (cid:107) κ ν ( u i − ) (cid:107) ∞ ) max (cid:0) (cid:107) κ ( u Ni − ) (cid:107) ∞ , (cid:107) κ ( u i − ) (cid:107) ∞ (cid:1) / ν/ min(1 , inf x κ ( u Ni − ( x ))) min(1 , inf x κ ( u i − ( x ))) × max t ∈ B ( κ (cid:48) ( t )) / and the set B is the interval [min(inf x u N ( x ) , inf x u ( x )) , max( (cid:107) u N (cid:107) ∞ , (cid:107) u (cid:107) ∞ )] .on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Proof.
We partition T := L N ( u Ni − ) − − L ( u i − ) − into two parts T = ( L N ( u Ni − ) − − L N ( u i − ) − ) + ( L N ( u i − ) − − L ( u i − ) − ) = T + T . By Cea’s lemma [38], the term T has an upper bound (cid:107) T (cid:107) L ,L ≤ C max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) (cid:107) ( I − (cid:101) P N ) L ( u i − ) − κ − ν ( u i − ) (cid:107) L ,H (cid:107) κ ν ( u i − ) (cid:107) ∞ , where (cid:101) P N is the orthogonal projection onto H N in H and I is the identity operator.Let f ∈ L and denote g := L ( u i − ) − κ − ν ( u i − ) (cid:101) P N f . Then (cid:107) ( I − (cid:101) P N ) g (cid:107) H = (cid:88) | k | , | k | >n (1 + k + k ) | (cid:98) g k ,k | = (cid:88) | k | , | k | >n (1 + k + k ) − | ( − ∆ + 1) g ) ∧ | k ,k ≤ n (cid:107) ( − ∆ + 1) g (cid:107) L ≤ CN (cid:107) L ( u i − ) − (cid:107) L ,H (cid:107) κ ( u i − ( x )) − ν (cid:107) ∞ (cid:107) f (cid:107) L , where − ∆ + 1 : H → L is continuous, g ∧ denotes the Fourier transform and N = (2 n + 1) d . Hence, (cid:107) T (cid:107) L ,L ≤ CN max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ ) min(1 , inf κ ( u i − ( x ))) max( (cid:107) κ ν ( u i − ) (cid:107) ∞ , (cid:107) κ − ν ( u i − ) (cid:107) ∞ ) . In the term T , we need to tackle the difference of κ -terms. We aim to use L p -estimatesin order to later allow induction with respect to different layers. We partition T intosimpler terms (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:0) κ ν ( u Ni − ) − κ ν ( u i − ) (cid:1) (cid:107) L ,L + (cid:107) (cid:0) L N ( u Ni − ) − κ − ν ( u Ni − ) − L N ( u i − ) − κ − ν ( u i − ) (cid:1) κ ν ( u i − ) (cid:107) L ,L = (cid:107) T (cid:107) + (cid:107) T (cid:107) . In the term T , we apply the resolvent identity L N ( u Ni − ) − κ − ν ( u Ni − ) = L N ( u i − ) − κ − ν ( u i − )+ L N ( u i − ) − κ − ν ( u i − )( κ ( u i − ) − κ ( u Ni − )) L N ( u Ni − ) − κ − ν ( u Ni − ) , which leads to (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u i − ) − κ − ν ( u i − ) (cid:107) L / ,L (cid:107) κ ( u i − ) − κ ( u Ni − ) (cid:107) L ,L / × (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:107) H − ,H (cid:107) κ ν ( u i − ) (cid:107) ∞ . The space L / embeds continuously into H − , and (cid:107) A (cid:107) L / ,L ≤ (cid:107) A (cid:107) H − ,H for anyoperator A . Hence by the Lax-Milgram theorem, (cid:107) L N ( u i − ) − κ − ν ( u i − ) (cid:107) L / ,L ≤ , inf x κ ( u i − ( x ))) . on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion κ ( u i − ) − κ ( u Ni − ) as a multiplication operator from L to L / .That is, a function g defines a multiplication operator, which takes a function f tothe product of functions g and f . By H¨older’s inequality, any multiplication operator g : L → L / has norm (cid:107) g (cid:107) L . Similar treatment for the term T gives (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:107) L / ,L (cid:107) κ ν ( u Ni − ) − κ ν ( u i − ) (cid:107) L ,L / . The difference of κ terms in the estimates for T and T reduces to the difference ofthe functions u i − through series of elementary estimates (cid:107) κ a ( u Ni − ) − κ a ( u i − ) (cid:107) L = (cid:90) T d | κ a ( u Ni − ( x )) − κ a ( u i − ( x )) | d x ≤ (2 max (cid:0) (cid:107) κ a ( u Ni − ) (cid:107) ∞ , (cid:107) κ a ( u i − ) (cid:107) ∞ (cid:1) ) (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) u Ni − ( x ) u i − ( x ) κ (cid:48) ( t ) dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d x ≤ C max (cid:0) (cid:107) κ a ( u Ni − ) (cid:107) ∞ , (cid:107) κ a ( u i − ) (cid:107) ∞ (cid:1) max t ∈ B ( κ (cid:48) ( t )) (cid:107) u Ni − − u i − (cid:107) L , where a = ν, B is the interval [min(inf x u N ( x ) , inf x u ( x )) , max( (cid:107) u N (cid:107) ∞ , (cid:107) u (cid:107) ∞ )]. Remark 1.
When κ is a continuously differentiable function with bounds c exp( − a | t | ) ≤ κ ( t ) , | κ (cid:48) ( t ) | ≤ c exp( a | t | ) , where c , c , a , a > , the functions G and G can be taken to be G ( u i − ) = C exp( C ( (cid:107) u i − (cid:107) ∞ )) G ( u i − , u Ni − ) = C exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ ) . The next lemma shows that the SPDE system (6) forces the layers to be H¨older-continuous.
Lemma 3.3.
Let κ be a positive continuous function. Let < α < for d = 1 , and < α < / for d = 3 . The bounded weak solution u i , i = 0 , . . . , J , of the SPDE system(6) is α -H¨older continuous with probability 1, and has the form u i ( x ) = ∞ (cid:88) p = −∞ (cid:98) w i ( p ) L ( u i − ) − φ p ( x ) =: L ( u i − ) − w i ( x ) , i = 0 , . . . , J. (8) Proof.
We will show that if u i − is continuous, then u i is H¨older-continuous. This willprove continuity inductively, since the zeroth layer u has constant u − . It is enoughto verify continuity after conditioning with u i − , since P ( u i ∈ C ,α ( T d )) = E [ P ( u i ∈ C ,α ( T d ) | u i )] = 1 with probability 1 if and only if P ( u i ∈ C ,α ( T d ) | u i − ) = 1 . Afterconditioning, u i will be a Gaussian field for which we will apply Kolmogorov continuitycriterium. To this end, we calculate E [ | u i ( x ) − u i ( x (cid:48) ) | b | u i − ] = C b sup (cid:107) f (cid:107) L ≤ | L ( u i − ) − f ( x ) − L ( u i − ) − f ( x (cid:48) ) | b , on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion (cid:107) g (cid:107) L assup (cid:107) f (cid:107) L ≤ (cid:104) f, g (cid:105) . When u i − is continuous, the function L ( u i − ) − f ( x ) belongs to H by Lemma 3.1. By Sobolev’s embedding theorem [36], H embeds into C ,α for d ≤ E [ | u i ( x ) − u i ( x (cid:48) ) | b | u i − ] ≤ C | x − x (cid:48) | bα , which implies H¨older continuity with index smaller than (2 bα − d ) / b = α − d/ b , where b can be arbitrarily large.To complete the proof for existence of the solution, we insert the solution candidate(8) into the SPDE system (6) and do direct calculations. Uniqueness of the solutionfollows from Lax-Milgram theorem.We will show that u Ni converges in probability to u i . The proof uses uniformtightness of the distributions of u Ni − u i , which is a necessary condition for theconvergence. We recall sufficient conditions for the uniform tightness (see p. 61 in[39]). Lemma 3.4.
The random fields U N are uniformly tight on C ( T d ) if and only if thereexists a function K : C ( T d ) → [0 , ∞ ) with the following properties.(1) The set { g ∈ C ( T d ) : K ( g ) ≤ C } is compact for any C > ,(2) K ( U N ) < ∞ almost surely for every N , and(3) sup N E [ K ( U N )] < ∞ . We will need several iterations of the logarithm so we define the iterated compositionby setting F ( x ) = ln(1 + x ), F ( x ) = x and F n +1 ( x ) = F ◦ F n ( x ). Remark 2.
The function F i is increasing and, moreover, subadditive on non-negativenumbers. That is, F n ( x + y ) ≤ F n ( x ) + F n ( y ) for all x, y ≥ , which follows by inductionfrom subadditivity ln(1 + x + y ) ≤ ln((1 + x )(1 + y )) = ln(1 + x ) + ln(1 + y ) of each F .Similar procedure shows that F n ( xy ) ≤ F n ( x ) + F n ( y ) . Lemma 3.5.
Let d = 1 , or . Let κ be a continuous function with bounds c exp( − a | t | ) ≤ κ ( t ) ≤ c exp( a | t | ) , where c , c , a , a > . Let u Ni , i = 0 , . . . , J ,solve the system (7) and let u i , i = 0 , . . . , J solve the system (6) . Then the randomfields u Ni are uniformly tight on C ( T d ) , the random fields u Ni − u i are uniformly tight on C ( T d ) , and the vector-valued random fields ( u Ni , u i ) are uniformly tight on C ( T d ; R ) .Proof. We equip the H¨older space C ,α ( T d ; R p ) with its usual norm (cid:107) g (cid:107) α = sup x (cid:54) = y | g ( x ) − g ( y ) || x − y | α + sup x | g ( x ) | . Here α is chosen as in Lemma 3.3. We will use Kolmogorov-Chentsov tightness criterium(see [40]) to show the uniform tightness of the zeroth order layers, which are Gaussian. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion E [ | u N ( x ) − u N ( x (cid:48) ) | a ] ≤ C | x − x (cid:48) | d + b follows from choosing large enough a in E [ | u N ( x ) − u N ( x (cid:48) ) | a ] ≤ C sup (cid:107) f (cid:107) L ≤ | L N (ln( κ )) − f ( x ) − L N (ln( κ )) − f ( x (cid:48) ) | a | x − x (cid:48) | aα | x − x (cid:48) | aα ≤ C (cid:107) L N (ln( κ )) − (cid:107) aL ,C ,α | x − x (cid:48) | aα , where we applied It¯o isometry and the definition of L -norm as a supremum. By Lemma3.1 and Sobolev’s embedding theorem, the operator norm of L N (ln( κ )) − is boundedfor any 0 < α < /
2. Since κ is a constant, the bound is uniform. The case for u N − u and ( u N , u ) follow similarly with the help of the triangle inequality.For other layers, we use Lemma 3.4, where we choose K ( g ) = F i ( g ). For simplicity,we demonstrate Condition 1 only for i = 3, since the generalization is clear. The set { g : K ( g ) ≤ C } = { g : (cid:107) g (cid:107) α ≤ exp(exp(exp( C ) − − − } is clearly a closed set, which contains bounded equicontinuous functions. The set A ⊂ C ( T d ) is then compact by the Arzel´a-Ascoli theorem (see [41]). Moreover, thefields u Ni − u i are almost surely α -H¨older continuous, by Lemma 3.3 and since theapproximations belong to H N . Hence, Condition 2 holds. To show Condition 3, wewrite u Ni as L N ( u Ni − ) − w Ni and u i as L ( u i − ) − w i . By the triangle inequality andthe subadditivity of F i (see Remark 2), we can check the boundedness for u Ni and u i separately. Since the procedure is the same for both of the terms, we only show herethe case for u Ni . By Jensen’s inequality E [ln(1 + (cid:107) u Ni (cid:107) α ) | u Ni − ] ≤ ln (cid:0) E [1 + (cid:107) L N ( u Ni − ) − w Ni (cid:107) α | u Ni − ] (cid:1) . (9)The conditioning with u Ni − lets us compute expectations of Gaussian variables. Insteadof attacking directly the expectation in Equation (9), we will seek a Gaussian zero meanrandom field U with larger variance than the conditioned u Ni . Then also certain otherexpectations of u Ni will be bounded by expectations of U . Under conditioning, thevariances of the random variables (cid:82) φ ( x ) L N ( u Ni − ) − w Ni ( x ) d x are E [ (cid:104) w Ni , ( L N ( u Ni − ) − ) ∗ φ (cid:105) | u Ni − ] = (cid:107) P N ( L N ( u Ni − ) − ) ∗ φ (cid:107) L ≤ (cid:107) L N ( u Ni − ) − (cid:107) L ,H (cid:107) φ (cid:107) H − , where the last inequality follows from properties of the adjoint operator and Lemma 3.1.Set U to be a zero mean Gaussian random field U on T d whose covariance is definedby equations E [ (cid:104) U, φ (cid:105) ] = (cid:107) φ (cid:107) H − for all smooth φ . Then U has sample paths in H¨olderspace C ,α ( T d ) by Sobolev’s embedding theorem and Kolmogorov’s continuity theorem(see [42]). By Fernique’s theorem (e.g. [43]), the expectation E [ (cid:107) U (cid:107) α ] is finite. Sincenorms are absolutely continuous functions, also the conditional expectation of (cid:107) u Ni (cid:107) α isbounded by (cid:107) L ( u Ni − ) − (cid:107) L ,H E [ (cid:107) U (cid:107) α ] (see Corollary 3.3.7 in [43]). on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion E [ln(1 + (cid:107) u Ni (cid:107) α ) | u Ni − ] ≤ ln (cid:20) c (cid:107) κ ( u Ni − ( x )) (cid:107) ν ∞ max(1 , (cid:107) κ ( u Ni − ( x )) (cid:107) ∞ )min(1 , inf κ ( u Ni − ( x ))) E [ (cid:107) U (cid:107) α ] (cid:21) ≤ c α,ν (1 + (cid:107) u Ni − (cid:107) ∞ ) , (10)where we applied the bounds of κ and the elementary inequalities max( a, b exp( c )) ≤ max( a, b ) exp( | c | ), min(1 , inf exp( g ( x ))) ≥ exp( −(cid:107) g (cid:107) ∞ ), (1 + ab ) ≤ (1 + a )(1 + b ) and1 + a ≤ , a ). Here we can choose constants larger than 1.When i = 1, the expectations (10) are uniformly bounded, since expectations of (cid:107) u N (cid:107) ∞ are bounded. Then Lemma 3.4 shows that u N are uniformly tight. For thesubsequent layers we need to operate multiple times with the logarithm and Jensen’sinequality through inductive steps E [ F i ( (cid:107) u Ni (cid:107) α ))] ≤ E [ F i ( E [ (cid:107) u Ni (cid:107) α | u Ni − ])] ≤ E [ F i ( c (1 + (cid:107) u Ni − (cid:107) α ))] ≤ C + E [ F i − ( (cid:107) u Ni − (cid:107) α ))]with the help of the additivity properties of F i from Remark 2. Theorem 3.6.
Let d = 1 , , or . Let κ be a continuously differentiable functionwith bounds c exp( − a | t | ) ≤ κ ( t ) , | κ (cid:48) ( t ) | ≤ c exp( a | t | ) , where c , c , a , a > . Let u Ni , i = 0 , . . . , J . The solution ( u N , . . . , u NJ ) of the Galerkin system (7) converge inprobability to the weak solution ( u , . . . , u J ) of the SPDE system (6) on L ( T d ; R J +1 ) as N → ∞ .Proof. It is enough to show componentwise convergence. Uniform tightness on C ( T d )implies uniform tightness on L ( T d ), which in turn implies relative compactness in weaktopology of distributions. Hence, by Lemma 3.5 each subsequence of the distributions of u Ni − u i has a weakly convergent subsequence, say u N k i − u i . Recall, that the convergencein probability is equivalent to the convergence of τ N := E [min(1 , (cid:107) u Ni − u i (cid:107) L )] , where min(1 , (cid:107) · (cid:107) L ) is now a bounded continuous function. By weak convergenceof distributions, the subsequence τ N k has some limit. It remains to verify that limitsof u N k i − u i in distribution are zero. The characteristic functions of u Ni − u i convergeto 1, if τ N ( φ ) := E [min(1 , |(cid:104) u Ni − u i , φ (cid:105)| ) , converge to zero for every φ ∈ L . Thisformulation makes calculation of expectations manageable. The essential difference toother approaches arises from the monotonicity of conditional expectations. Namely, theproperty E [1 − min(1 , G ) | Σ ] ≥ G ≥ E [min(1 , G ) | Σ ] = min(1 , E [min(1 , G ) | Σ ]) ≤ min(1 , E [ G | Σ ]) . (11)Especially, τ N ( φ ) ≤ E [min(1 , E [ |(cid:104) u Ni − u i , φ (cid:105)| | w , . . . , w i − ])] , (12) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u i and its approximation u Ni easily.Moreover, the cutoff function min(1 , · ) is nondecreasing and subadditive.Under conditioning with w , . . . , w i − , the random fields u Ni and u i in Equation(12) become Gaussian, which enables us to compute τ N ( φ ) = E [min(1 , c (cid:107) ( P N ( L N ( u Ni − ) − ) ∗ − ( L ( u i − ) − ) ∗ ) φ (cid:107) L )] ≤ E (cid:2) min (cid:0) , c φ (cid:107) L N ( u Ni − ) − − L ( u i − ) − (cid:107) L ,L (cid:1) + min (cid:0) , c (cid:107) ( P N − I )( L ( u i − ) − ) ∗ φ (cid:107) L (cid:1)(cid:3) = : τ N ( φ ) + τ N ( φ )via the It¯o isometry and the properties of adjoints. Since L ( u i − ) ∗ φ ∈ L , the term τ N ( φ )converges to zero. We will apply Lemma 3.2 for the difference of operators L N ( u Ni − ) − and L − ( u i − ) in τ N ( φ ), which leads to a well-behaving estimate τ N ( φ ) ≤ E (cid:20) min (cid:18) , C φ N exp( C (cid:107) u i − (cid:107) ∞ ) (cid:19) (cid:21) + E [min(1 , C φ exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ )) ×(cid:107) u Ni − − u i − (cid:107) / L )] =: τ N ( φ ) + τ N ( φ ) . The first term τ N ( φ ) converges to zero by Lebesgue’s dominated convergence theorem.For the term τ N ( φ ), we will use the uniform tightness of the the vector-valued randomfields ( u Ni − , u i − ) shown in Lemma 3.5. Let K i − = K i − ( (cid:15) ) ⊂ C ( T d ; R ) be a compactset for which P (( u Ni − , u i − ) ∈ K C ) < (cid:15) . Then τ N ( φ ) = E [min(1 , C φ exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ )) (cid:107) u Ni − − u i − (cid:107) / L )(1 K + 1 K C )] ≤ E [min(1 , C (cid:15) (cid:107) u Ni − − u i − (cid:107) L )] / + E [1 K C ]by Lemma 3.2, Remark 1 and Jensen’s inequality. Thus τ N ( φ ) converges to zero if u Ni − − u i − converge to zero in probability on L .When i = 0, the above procedure shows then that u N converges to u in probability,because the sublayers are then constants. By induction, u Ni converges in probability to u i . Prohorov’s theorem [40] hands us weak convergence on the space of continuousfunctions. Corollary.
The distributions of u Ni converge weakly to the distribution of u i on C ( T d ) and the joint distribution of ( u N , . . . , u NJ ) converges weakly to the joint distribution of ( u , . . . , u J ) on C ( T d , R J +1 ) .Proof. The random fields u Ni are tight on C ( T d ) by Lemma 3.5. By Prohorov’s theorem,the closure of their distributions forms a sequentially compact set, implying the existenceof weak limit. By Theorem 3.6, each weakly converging subsequence of the distributionshas the same limit, that is, the distribution of u i . Indeed, since convergence inprobability on L ( T d ) implies weak convergence on L ( T d ), the characteristic functions E [exp( i (cid:104) u Ni , φ (cid:105) )] converge to E [exp( i (cid:104) u i , φ (cid:105) )] for all φ ∈ L ( T d ). The set of bounded on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u (cid:55)→ exp( i (cid:104) u, φ (cid:105) ), where φ are smooth on T d , separate the functionsin C ( T d ). Hence, the limits of these characteristic functions are enough to identify theweak limit on C ( T d ). The joint distribution is handled similarly.We recall a posterior convergence result from [44] with notation used in [14]. Theorem 3.7.
Let the posterior distribution µ y of an unknown u given an observation y have the Radon-Nikodym density dµ y dµ ( u ) ∝ exp( − Φ( u, y )) with respect to the prior distribution µ of u , where Φ( · , y ) ≥ − C y . Let u N be anapproximation of u and let the posterior distribution of u N given an observation y N have also the Radon-Nikodym density dµ y N N dµ N ( u ) ∝ exp( − Φ( u, y N )) with respect to the prior distribution µ N of u N .If the prior distributions µ N converge weakly to µ , then the posterior distributions µ y N converge weakly to µ y . Especially, the joint prior distribution of ( u N , . . . , u NJ ) converge weakly on C ( T d , R J +1 ) to the joint distribution of ( u , . . . , u J ). Hence, the corresponding posteriorsconverge also weakly.
4. Bayesian inference algorithm
In this section, we will develop a Bayesian inference procedure to sample the Fouriercoefficients from a posterior distribution where the prior is given by a multi-layeredGaussian fields. Let us work directly in the Fourier coefficient u J of u J , and denote by µ ( d u J ) = P ( d u J ) and µ y ( d u J ) = P ( d u J | y ) the prior and the posterior distribution of u J , the unknown target field, respectively, when the measurement is given by y = Hu J + e . (13)In this equation, y is a vector which contains all of the measurements. The elementsof the matrix H correspond to the linear mappings { h k } for the respective Fouriercomponents. With the number of measurements taken is given by m , the measurementnoise e is an m -dimensional Gaussian random vector with zero mean and covariance E .The posterior distribution µ y will be absolutely continuous with respect to the prior µ ,and the density of the posterior with respect to the prior is given by dµ y dµ ( u J ) = 1 Z exp ( − Φ( u J , y )) , (14) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u J , y ) := (cid:13)(cid:13) E − / ( y − Hu J ) (cid:13)(cid:13) is the potential function and the normalizationconstant Z := (cid:82) exp( − Φ( u J , y )) µ ( d u J ). The posterior probability µ y ( d u J ) can bewritten in the following form µ y ( d u J ) ∝ exp ( − Φ( u J , y )) µ ( d u J ) . (15)In the next section, we describe the MCMC algorithm to sample from the posteriordistribution. Consider the case of hierarchical prior distribution µ ( d u J ) with J hyperprior layerswhich we construct as L u = w , (16a) L ( u j − ) u j = w j , j = 1 , · · · , J, (16b)where u j contains the Fourier coefficients of the field u j , for j = 0 , . . . , J . We fix g ( x ) = exp( − x ). Since u − = ln( κ ) is a constant, we can write, L = L ( u − ), where k -th element of u − is equal to ln( κ ) δ k, , where δ k, is the Kronecker delta. By (16),we can define a linear transformation from w j to u j for j > u j = ˜ U ( w j , u j − ) := L ( u j − ) − w j . (17)Using (17) we can define a transformation from w to u as follows u = U ( w ) = ( ˜ U ( w , u − ) , ˜ U ( w , · ) ◦ ˜ U ( w , u − ) , . . . , ˜ U ( w J , · ) ◦ . . . ◦ ˜ U ( w , u − )) . (18)The dependence of u j on u j − , j = 1 , . . . , J , leads us to µ ( d u J ) = P ( d u ) J (cid:89) j =1 P ( d u j | u j − ) . (19)When evaluating the posterior function (15), it is necessary to compute the logdeterminant of L ( u j − ) in P ( d u j | u j − ) for each 1 < j ≤ J , respectively. Thesecalculations are expensive in general [45]. There is also a singularity issue if we sampledirectly from P ( d u | y ) if N approaches infinity [13]. We can avoid these issues byusing the reparametrization (17), where instead of sampling the Fourier coefficients u , we sample the Fourier coefficient of the noises w , which is then called non-centered algorithm [22, 46]. That is, we can write: d ˜ µ y d ˜ µ ( w ) = 1˜ Z exp (cid:16) − ˜Φ( w , y ) (cid:17) := 1˜ Z exp ( − Φ( U ( w ) , y )) . (20)In this equation, ˜ µ := P ( d w ) and ˜ µ y := P ( d w | y ) are the prior and posterior ofthe w , respectively. The preconditioned Crank-Nicolson (pCN) algorithm [16] can on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion P ( d w | y ), and it is well defined even for the case of N goesto infinity, using the fact that the prior for w is a standard Gaussian distribution.One implementation of the pCN algorithm is given by Algorithm 1. Due to linearityassumption of the forward model (13), we can leverage the standard Gaussian regressionin addition to the non-centered reparametrization. This procedure has been proposedin [13] for general deep Gaussian fields. Here, we adapt this algorithm for our Galerkinmethod. The resulting algorithm is a Metropolis within Gibbs type [47] where theFourier coefficients { u j } , j = 0 , . . . , J − u J are sampleddirectly. The detail is given as follows. By marginalization of u J − , we can write P ( d u J | y ) = (cid:82) P ( d u J | u J − , y ) P ( d u J − | y ) . Furthermore, from the standard Gaussianregression, we can sample directly from P ( d u J | u J − , y ) by using: V ( u J − , y ) := (cid:32) E − / HL ( u J − ) (cid:33) † (cid:32)(cid:32) E − / y (cid:33) + ˜ v (cid:33) , (21) u J − = ˜ U ( w J − , · ) ◦ · · · ◦ ˜ U ( w , u − ) , ˜ v ∼ N (0 , I ) . Writing y = HL ( u J − ) − w J + e , the conditional probability density of y given u J − is given by p ( y | u J − ) = N ( y | , HL ( u J − ) − L ( u J − ) −(cid:62) H (cid:62) + E ). Let ¯ w =( w , . . . , w J − ). To obtain samples from P ( d u J − | y ) we can use reparametrization (17)and Algorithm 1 to sample from P ( d ¯ w | y ). The probability distribution P ( d ¯ w | y ) is givenas follows: P ( d ¯ w | y ) ∝ exp (cid:16) − ˜Ψ( ¯ w , y ) (cid:17) P ( d ¯ w ) , (22a)˜Ψ( ¯ w , y ) := Ψ ( U ( ¯ w ) , y ) = 12 (cid:107) y (cid:107) Q + 12 log det( Q ) , (22b) Q = HL ( u J − ) − L ( u J − ) −(cid:62) H (cid:62) + E . (22c)To sample from P ( d ¯ w | y ) using Algorithm 1, we use J −
1, and ˜Ψ for J and ˜Φ, respectively.
5. Numerical results
In this section, we present examples of Bayesian inversion using a multi-layer Gaussianprior presented in the previous section. Our main focus in this section is to showthe effectiveness of the proposed finite-dimensional approximation method for selectedexamples. Therefore, we will not discuss the properties of the MCMC algorithm used togenerate the samples as they are based on the MCMC algorithms described in [48, 13].We aim at acceptance ratio between 25-50 %, which is obtained by tuning the pCNstep size s in Algorithm 1. Our experience in the numerical implementations belowindicates that the MCMC algorithm based on the pCN and non-centered algorithm isquite robust. For one and two hyperprior layers implementation, the step size s doesnot need to be extremely small. The step sizes s in the first and second examples arevarying around 10 − to 10 − . on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Algorithm 1: preconditioned Crank-Nicolson algorithm.
Input : J, w , y , ˜Φ( w , y ) Output: accepted , w accepted = 0 draw w (cid:48) ∼ N (0 , I )˜ w = √ − s w + s w (cid:48) logRatio = ˜Φ( w , y ) − ˜Φ( ˜ w , y ) draw ω ∼ Uniform[0 , if logRatio > ln( ω ) thenw = ˜ w accepted = 1 In this section, we consider the application of the proposed technique to address thenon-parametric denoising of two piecewise smooth signals. The first test signal is arectangular shape signal where the value is zero except on interval [0 . , . υ rect ( t ) = (cid:40) , t ∈ [0 . , . , , otherwise . , υ bell,rect ( t ) = exp (cid:0) − t − t (cid:1) , t ∈ (0 , . , , t ∈ [0 . , . , − , t ∈ (0 . , . , , otherwise . (23)Previously, in [24], for J = 1 and with the unknown signal υ bell,rect ( t ), we havedemonstrated that upon increasing the number of the Fourier basis functions, the L error between the ground truth and the posterior sample mean decreased significantly.In this section, we will compare the estimation results using one hyperprior and twohyperprior layers respectively. To allow high variation near points of discontinuities, thelength-scale of υ is expected to be smaller around the discontinuities than the rest ofthe domain. We take measurement of on one dimensional grid of 2 points and setthe standard deviation of the measurement noise to be 0 .
1. The proposed algorithmis tested with N = 2 −
1. After estimation, we reconstruct the signal using inverseFourier transform with a finer grid with 2 points equally spaced between zero and one.We take ten million samples for each MCMC run.To have a fair comparison, we use the same measurement record for each runwith different J . Figures 2 and 3 show the reconstructed signals and their respectivelength-scale estimations. It can be clearly seen that the addition of another hyperprior layer improves the reconstruction result for the unknown signals. For J = 2,although there are no sudden drops near the points of discontinuities, the length-scale on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion J L error PSNR L error PSNR1 1.044 24.325 1.527 21.6032 0.922 25.605 1.475 21.801 Table 1.
Quantitative performance comparison of a shallow Gaussian fields priorinversion for one dimensional signal in Section 5.1. value is sufficiently high in the a smooth part of υ signal, and substantially low nearthe points of discontinuities. This variation translates to a better smoothness detection,as can be seen in Figures 2a, 3a, 2c, and 3c. In contrast, Figures 2b and 3b show thatwhen J = 1, the posterior sample means are considerably overfitting the data on thesmooth part of the signals. Although the length-scale drops suddenly near the pointsof discontinuities of the ground truth υ , the variation of (cid:96) is limited (see Figure 2d and3d). This contributes to a decreased smoothness in the smooth region of the samplemean. The quantitative performance is given in Table 1. In this section, we apply the methods that we have developed to the X-ray tomographyreconstruction problem. For the tomography problem, the linear functional is given bya line integration known as the Radon transform [49, 28].Assume that the field of interest has support in the a circle with center at (1 / , / r with detection angle θ , we can write the Radon transform of the Fourierbasis φ k = exp (cid:0) i π k (cid:62) x (cid:1) as below (cid:104) φ k , H r,θ (cid:105) = (cid:90) Ω χ ( x − , y −
12 ) exp (cid:0) i π k (cid:62) x (cid:1) δ ( r − (( x −
12 ) cos θ + ( y −
12 ) sin θ )) d x , where χ ( x, y ) is an indicator function with support in ( x + y < ). Introducing arotation matrix R θ , and p = [ p q ] (cid:62) , and x (cid:48) = x − , y (cid:48) = y − , p = R θ x we can write (cid:104) φ k , H r,θ (cid:105) = exp( iπ ( k x + k y )) (cid:90) Ω χ ( x (cid:48) , y (cid:48) ) exp (cid:0) i π ( R θ k ) (cid:62) p (cid:1) δ ( r − p ) d p . Using the assumption we have mentioned and ˜ k = [˜ k x ˜ k y ] (cid:62) = R θ k , we end upwith (cid:104) φ k , H r,θ (cid:105) = exp( iπ ( k x + k y )) exp( i π ˜ k x r ) π ˜ k y [sin(2 π ˜ k y (cid:113) − r )] . Notice when˜ k y = 0 we replace the above equation with its limit, that is, lim ˜ k y → (cid:104) φ k , H r,θ (cid:105) =2 (cid:113) − r exp( iπ ( k x + k y )) exp( i π ˜ k x r ) . We modify the MCMC implementation used for the previous one-dimensionalexample to suit for a GPU architecture. For our test comparison in X-ray tomography,the Shepp-Logan phantom with 511 ×
511 resolution is used (see Figure 4a). We willuse this phantom to evaluate the proposed method. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion . . . . . . t − . . . . . . . u ( t ) (a) . . . . . . t − . . . . . . . u ( t ) (b) . . . . . . t ‘ ( t ) (c) . . . . . . t × − ‘ ( t ) (d) Figure 2.
Simulation results of example in Section 5.1. Figures 2a, 2c, 2b, and 2ddescribe the lowest random fields from u for J = 2 and J = 1, and their length-scales.In Figure 2a and 2b, the blue and the black lines are the mean Fourier inverse of thesamples in the 95 % confidence shades of the lowest layer and the original unknownsignals respectively. The blue line and the shades in the remaining figures are samplemeans and 95 % confidence interval of the estimated length-scales. Shallow-GP FBP TikhonovPSNR 22.246 23.399 21.673 L Error 44.795 47.156 42.145
Table 2.
Quantitative performance comparison of a shallow Gaussian field Bayesianinversion for tomography application in Section 5.2.
We take 45 sparsely full projections out of 180. The measurement is corruptedby a white Gaussian noise with standard deviation 0 .
2. Due to the restriction inGPU memory, J is set to one. To excel the speed Fourier transform and inverseFourier transform computations, we use FFT and IFFT routine from CUPY [50].For a performance comparison, we perform the filtered back projection (FBP) on on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion . . . . . . t − . − . . . . u ( t ) (a) . . . . . . t − . − . . . . u ( t ) (b) . . . . . . t ‘ ( t ) (c) . . . . . . t × − × − × ‘ ( t ) (d) Figure 3.
Similar to Figure 2, with the unknown is v bell,rect . sinogram using iradon routine from Skimage [51]. A Tikhonov regularization is alsoused to reconstruct the Fourier coefficients of the unknown field u J [52]. The Tikhonovregularization parameter λ is selected to be 5 × − based on the best L error and PSNRperformance. We set the Fourier basis number n to 31. The total number of parameterfor each layer is 1985, which makes the total number of parameters for all layers 3970.The total number of parameters in this example is greatly reduced compared to [53]where each pixel in the target image count as a parameter, that is, for our example ittranslates to 261121 parameters.Figure 4 shows that compared to the FBP and Tikhonov regularizationreconstructions, the posterior sample mean of our MCMC method resulted in an imagewith less streak artifact and noise. The features of the phantom appear much more clearcompared to those on the FBP and Tikhonov reconstructions. As examples, examinethe mouth parts, dark circles between eyes and at the forehead, and the two eyes areboth relatively much more clear than the other two. Nonetheless, since we set n only31, the edge of phantom face which has very high values is not fully recovered as much on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion L error (44 . . . .
246 compared to Tikhonovregularization method, 21 . n in our proposed Bayesianmethod to get a much better reconstruction. However, it is not possible to accomplishthis within our current setup due to a restriction on the GPU memory. We can fairlyconclude that the use of a shallow Gaussian field prior with n = 31 resulted in a highlyreduced amount of artifact at the expense of light blur at the edge. As in the onedimensional example, adding another Gaussian field layer might help to increase thesharpness of the edge in the X-ray tomography application.
6. Conclusion
We have presented a multi-layered Gaussian-field Bayesian inversion using a Galerkinmethod. We have also shown that our approach enjoys a nice convergence-in-probabilityproperty to the weak solution of the forward model. We also showed that it impliesweak convergence of the joint posterior distribution of the Gaussian field. This givesan assurance that our proposed method is well defined and robust upon increasing thenumber of Fourier basis functions. Using the non-centred version of the preconditionedCrank-Nicolson algorithm, we have shown that for a one-dimensional denoising problem,by using two hyperprior layers we could achieve a smoothing preserving and edgedetection of the unknown at the same time. For the X-ray tomography problem, with asingle hyper-prior layer and a very small number of Fourier basis ( n = 31), the posteriorsample mean of our proposed approach gives an image with less streak artifact andnoise compared to the FBP and Tikhonov regularization reconstructions. Althoughtraces of streak artefact and edge blurring still present, the L error and PSNR of ourproposed method sit in the middle of those from the FBP and Tikhonov regularization.Furthermore, adding another Gaussian field layer might help to increase the sharpnessof the edge in the X-ray tomography application. One of future outlook is to apply themethod in real data. Acknowledgments
The authors would like to thank Academy of Finland for financial support.
References [1] J. Kaipio and E. Somersalo,
Statistical and Computational Inverse Problems . Springer, 2004.[2] C. E. Rasmussen and C. K. I. Williams,
Gaussian Processes for Machine Learning . MIT PressLtd, 2005.[3] M. J. Heaton, A. Datta, A. O. Finley, R. Furrer, J. Guinness, R. Guhaniyogi, F. Gerber, R. B.Gramacy, D. Hammerling, M. Katzfuss, F. Lindgren, D. W. Nychka, F. Sun, and A. Zammit-Mangion, “A case study competition among methods for analyzing large spatial data,”
J. Agr.Biol. Envir. St. , vol. 24, pp. 398–425, dec 2018. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Target Image − . − . − . − . . . . . . (a) Posterior mean − . − . − . − . . . . . . (b) Filtered Back Projection − . − . − . − . . . . . . (c) Tikhonov Regularization − . − . − . − . . . . . . (d) Figure 4.
Simulation result of a shallow SPDE with J = 1 and n = 31, where thenumber of projections is 45. Figures 4b and 4d show posterior means of field υ andthe Tikhonov regularization result using λ = 5 × − , while Figure 4c shows the FBPreconstruction.[4] C. J. Paciorek and M. J. Schervish, “Nonstationary covariance functions for Gaussian processregression,” in Adv. Neur. Inf. Proc. Sys. , pp. 273–280, 2004.[5] G.-A. Fuglstad, D. Simpson, F. Lindgren, and H. Rue, “Does non-stationary spatial data alwaysrequire non-stationary random fields?,”
Spat. Stat. , vol. 14, pp. 505–531, nov 2015.[6] C. J. Paciorek,
Non-Stationary Gaussian processes for regression and spatial modelling . PhDthesis, 2003.[7] E. Snelson, Z. Ghahramani, and C. E. Rasmussen, “Warped Gaussian processes,” in
Adv. Neur.Inf. Proc. Sys. , pp. 337–344, 2004.[8] F. Lindgren, H. Rue, and J. Lindstr¨om, “An explicit link between Gaussian fields and GaussianMarkov random fields: the stochastic partial differential equation approach,”
J. Royal Stat. Soc.B , vol. 73, no. 4, pp. 423–498, 2011. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion [9] L. Roininen, M. Girolami, S. Lasanen, and M. Markkanen, “Hyperpriors for Mat´ern fields withapplications in Bayesian inversion,” Inverse Probl. & Imaging , vol. 13, no. 1, pp. 1–29, 2019.[10] K. Monterrubio-G´omez, L. Roininen, S. Wade, T. Damoulas, and M. Girolami, “Posterior inferencefor sparse hierarchical non-stationary models,”
Comp. Stat. & Data Anal. , p. 106954, mar 2020.[11] A. C. Damianou and N. D. Lawrence, “Deep Gaussian processes,”
Artif. Intelli. & Stat. , 2013.[12] D. Duvenaud, O. Rippel, R. P. Adams, and Z. Ghahramani, “Avoiding pathologies in very deepnetworks,”
Artif. Intelli. & Stat. , 2014.[13] M. M. Dunlop, M. A. Girolami, A. M. Stuart, and A. L. Teckentrup, “How deep are deep Gaussianprocesses?,”
J. Mach. Learn. Res. , vol. 19, no. 54, pp. 1–46, 2018.[14] A. M. Stuart, “Inverse problems: A Bayesian perspective,”
Acta Numerica , vol. 19, pp. 451–559,2010.[15] M. Dashti and A. M. Stuart, “The Bayesian approach to inverse problems,” in
Handbook ofUncertainty Quantification , pp. 311–428, Springer International Publishing, 2017.[16] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, “MCMC methods for functions:Modifying old algorithms to make them faster,”
Statist. Sci. , vol. 28, no. 3, pp. 424–446, 2013.[17] K. J. H. Law, “Proposals which speed up function-space MCMC,”
J. Comp. & Appl. Math. ,vol. 262, pp. 127–138, 2014.[18] A. Beskos, M. Girolami, S. Lan, P. E. Farrell, and A. M. Stuart, “Geometric MCMC for infinite-dimensional inverse problems,”
J. Comp. Phys , vol. 335, pp. 327–351, 2017.[19] D. Rudolf and B. Sprungk, “On generalization of the preconditioned Crank-Nicolson Metropolisalgorithm,”
Found. Comp. Math , vol. 18, pp. 309–343, 2018.[20] M. Hairer, A. M. Stuart, and S. J. Vollmer, “Spectral gaps for a Metropolis-Hastings algorithm ininfinite dimensions,”
Ann. Appl. Probab. , vol. 24, pp. 2455–2490, dec 2014.[21] V. Chen, M. M. Dunlop, O. Papaspiliopoulos, and A. M. Stuart, “Dimension-robust MCMC inBayesian inverse problems,”[22] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old, “A general framework for the parametrizationof hierarchical models,”
Stat. Sci. , vol. 22, pp. 59–73, feb 2007.[23] Z. Hu, Z. Yao, and J. Li, “On an adaptive preconditioned Crank-Nicolson MCMC algorithm forinfinite dimensional bayesian inference,”
J. Comp. Phys. , vol. 332, pp. 492–503, mar 2017.[24] M. Emzir, S. Lasanen, Z. Purisha, and S. S¨arkk¨a, “Hilbert-space reduced-rank methods for deepGaussian processes,” in
Proceeding of IEEE International Workshop on Machine Learning forSignal Processing (MLSP). , 2019.[25] A. Solin and S. S¨arkk¨a, “Hilbert space methods for reduced-rank Gaussian process regression,”
Stat. Comput. , vol. 30, pp. 419–446, aug 2019.[26] M. Lassas and S. Siltanen, “Can one use total variation prior for edge-preserving Bayesianinversion?,”
Inverse Problems , vol. 20, pp. 1537–1563, aug 2004.[27] J. Kaipio and E. Somersalo, “Statistical inverse problems: Discretization, model reduction andinverse crimes,”
J. Comp. & Appl. Math. , vol. 198, pp. 493–504, jan 2007.[28] F. Natterer,
The Mathematics of Computerized Tomography (Classics in Applied Mathematics) .SIAM: Society for Industrial and Applied Mathematics, 2001.[29] A. Tarantola,
Inverse Problem Theory and Methods for Model Parameter Estimation . Society forIndustrial and Applied Mathematics, 2005.[30] D. Li, J. Svensson, H. Thomsen, F. Medina, A. Werner, and R. Wolf, “Bayesian soft X-raytomography using non-stationary Gaussian processes,”
Rev. Sci. Inst. , vol. 84, p. 083506, aug2013.[31] C. Plagemann, K. Kersting, and W. Burgard, “Nonstationary Gaussian process regressionusing point estimates of local smoothness,” in
Machine Learning and Knowledge Discovery inDatabases European Conference, ECML PKDD 2008 Antwerp, Belgium, September 15-19, 2008Proceedings, Part II , 2008.[32] Z. Purisha, C. Jidling, N. Wahlstr¨om, T. B. Sch¨on, and S. S¨arkk¨a, “Probabilistic approach tolimited-data computed tomography reconstruction,”
Inverse Probl. , vol. 35, p. 105004, sep 2019. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion [33] T. Frese, C. Bouman, and K. Sauer, “Adaptive wavelet graph model for Bayesian tomographicreconstruction,” IEEE Trans. Imag. Proc. , vol. 11, pp. 756–770, jul 2002.[34] Y. Chen, Y. Li, W. Yu, L. Luo, W. Chen, and C. Toumoulin, “Joint-MAP tomographicreconstruction with patch similarity based mixture prior model,”
Multiscale Modeling &Simulation , vol. 9, pp. 1399–1419, oct 2011.[35] V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen, “On instabilities of deep learning inimage reconstruction and the potential costs of AI,”
PNAS , p. 201907377, may 2020.[36] L. C. Evans,
An Introduction to Stochastic Differential Equations . American Mathematical Society,2014.[37] S. Lasanen, L. Roininen, and J. M. Huttunen, “Elliptic boundary value problems with Gaussianwhite noise loads,”
Stoch. Proc. & Appl. , vol. 128, no. 11, pp. 3607–3627, 2018.[38] S. C. Brenner and L. R. Scott,
The mathematical theory of finite element methods , vol. 15 of
Textsin Applied Mathematics . Springer, New York, third ed., 2008.[39] V. I. Bogachev,
Weak convergence of measures , vol. 234 of
Mathematical Surveys and Monographs .American Mathematical Society, Providence, RI, 2018.[40] O. Kallenberg,
Foundations of modern probability . Probability and its Applications (New York),Springer-Verlag, New York, second ed., 2002.[41] W. Rudin,
Principles of Mathematical Analysis , vol. 39 of
International series in pure and appliedmathematics . McGraw-Hill, 1976.[42] M. B. Marcus and J. Rosen,
Markov processes, Gaussian processes, and local times , vol. 100 of
Cambridge Studies in Advanced Mathematics . Cambridge University Press, Cambridge, 2006.[43] V. I. Bogachev,
Gaussian measures , vol. 62 of
Mathematical Surveys and Monographs . AmericanMathematical Society, Providence, RI, 1998.[44] S. Lasanen, “Non-Gaussian statistical inverse problems. Part II: Posterior convergence forapproximated unknowns,”
Inverse Probl. Imaging , vol. 6, no. 2, pp. 267–287, 2012.[45] K. Dong, D. Eriksson, H. Nickisch, D. Bindel, and A. G. Wilson, “Scalable log determinantsfor Gaussian process kernel learning,” in
Advances in Neural Information Processing Systems ,pp. 6327–6337, 2017.[46] Y. Yu and X.-L. Meng, “To center or not to center: That is not the question- an ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency,”
J. Comp. & Graph.Stat. , vol. 20, pp. 531–570, jan 2011.[47] H. F. L. Dani Gamerman,
Markov Chain Monte Carlo . Taylor & Francis Inc, 2006.[48] T. Cui, K. J. Law, and Y. M. Marzouk, “Dimension-independent likelihood-informed MCMC,”
J.Comp. Phys. , vol. 304, pp. 109–137, jan 2016.[49] S. R. Deans,
The Radon transform and some of its applications . Wiley, 1983.[50] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis, “CuPy: A NumPy-compatible library forNVIDIA GPU calculations,” in
Proceeding MM ’17 Proceedings of the 25th ACM internationalconference on Multimedia , pp. 1217–1220, 2017.[51] S. van der Walt, J. L. Sch¨onberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager,E. Gouillart, and T. Yu, “scikit-image: image processing in Python,”
PeerJ , vol. 2, p. e453, jun2014.[52] J. Mueller and S. Siltanen,
Linear and nonlinear inverse problems with practical applications .Philadelphia: Society for Industrial and Applied Mathematics, 2012.[53] J. Suuronen, M. Emzir, S. Lasanen, S. S¨arkk¨a, and L. Roininen, “Enhancing industrial X-raytomography by data-centric statistical methods,” arXiv preprint arXiv:2003.03814arXiv preprint arXiv:2003.03814