[PDF] Non-Stationary Multi-layered Gaussian Priors for Bayesian Inversion

Abstract

In this article, we study Bayesian inverse problems with multi-layered Gaussian priors. We first describe the conditionally Gaussian layers in terms of a system of stochastic partial differential equations. We build the computational inference method using a finite-dimensional Galerkin method. We show that the proposed approximation has a convergence-in-probability property to the solution of the original multi-layered model. We then carry out Bayesian inference using the preconditioned Crank--Nicolson algorithm which is modified to work with multi-layered Gaussian fields. We show via numerical experiments in signal deconvolution and computerized X-ray tomography problems that the proposed method can offer both smoothing and edge preservation at the same time.

Full PDF

NNon-Stationary Multi-layered Gaussian Priors forBayesian Inversion

Muhammad Emzir , Sari Lasanen , Zenith Purisha , LassiRoininen , Simo S¨arkk¨a

1) Department of Electrical Engineering and Automation, Aalto University, P.O. Box12200, FI-00076 Aalto, Finland2) Sodankyl¨a Geophysical Observatory, University of Oulu, P.O. Box 8000, FI-90014University of Oulu, Finland3) School of Engineering Science, Lappeenranta-Lahti University of Technology, P.O.Box 20, FI-53851 Lappeenranta, FinlandE-mail: [email protected]

March 2020

Abstract.

In this article, we study Bayesian inverse problems with multi-layeredGaussian priors. We ﬁrst describe the conditionally Gaussian layers in terms ofa system of stochastic partial diﬀerential equations. We build the computationalinference method using a ﬁnite-dimensional Galerkin method. We show that theproposed approximation has a convergence-in-probability property to the solution ofthe original multi-layered model. We then carry out Bayesian inference using thepreconditioned Crank–Nicolson algorithm which is modiﬁed to work with multi-layeredGaussian ﬁelds. We show via numerical experiments in signal deconvolution andcomputerized X-ray tomography problems that the proposed method can oﬀer bothsmoothing and edge preservation at the same time.

1. Introduction

The Bayesian approach provides a consistent framework to obtain solutions of inverseproblems. By formulating the unknown as a random variable, the degree of informationthat is available can be encoded as a statistical prior. The ill-posedness of the problemis mitigated by reformulating the inverse problem as a well-posed extension in thespace of probability distributions [1]. Among statistical priors that are commonlyused in Bayesian inverse problem is the Gaussian prior which is relatively easy tomanipulate, has a simple structure, and also has a close relation with traditionalTikhonov regularization. This approach has also get a growing interest from a machinelearning community, where the use of Gaussian prior for Bayesian inference is known asGaussian process regression [2].When the unknown is a multivariate function, it is natural to model it as a randomﬁeld. There is a vast amount of studies on Gaussian random ﬁelds and their applications a r X i v : . [ m a t h . S T ] J un on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion J Figure 1.

Illustration of a chain of Gaussian ﬁelds. Each node (ﬁeld) has anindependent white noise input ﬁeld and a length-scale parameter (cid:96) . The length-scale (cid:96) is obtained as a function evaluated on the ﬁelds above it. These Gaussian ﬁelds areapproximated in ﬁnite-dimensional Hilbert space using L ( u j − ) u j = w j , see (4). In many spatio-temporal inverse problems, it is often useful to start by working in aninﬁnite-dimensional space. There are essentially two approaches to construct a Bayesianinference algorithm. The ﬁrst approach is to discretize the forward map (e.g., via gridpartitions or ﬁnite element meshes) and apply a Bayesian inference method to the ﬁnite-dimensional setting [1]. The second approach [14, 15] is to directly apply Bayesianinference to the inﬁnite-dimensional problem, and afterward, apply a discretizationmethod. The latter approach is possible through realizing that the posterior and priorprobability distribution can be related via the Radon–Nikodym derivative which can begeneralized to function spaces [14].For any of the aforementioned approaches, the sampling technique has to becarefully designed. The traditional Markov chain Monte Carlo (MCMC) algorithms on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion

Let { φ l } be a basis formed from the orthogonal eigenfunctions of the Laplace operatorwith respect to some domain Ω with suitable boundary conditions. Let N be the numberof basis functions { φ l } used in the Galerkin method and let H N denote their span. For on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u j to denote the random ﬁeld at layer j , wherethe total number of layer is J + 1, that is, J is the number of hyperprior layers. Thecollection of J + 1 random ﬁelds ( u , u , . . . , u J ) is denoted by u . The Fourier transformof any random ﬁeld z is denoted by (cid:98) z , where the Fourier coeﬃcient for index l is givenby (cid:98) z ( l ), that is, (cid:98) z ( l ) = (cid:104) z, φ l (cid:105) , where (cid:104)· , ·(cid:105) is the standard L inner product on thedomain Ω. We use bold characters to denote vectors or matrices with elements in C or R . Fourier coeﬃcient for random ﬁeld u j for index − N to N is given by u j , thatis, u j = ( (cid:98) u j ( − N ) . . . (cid:98) u j ( N )). We denote J + 1 collections of the Fourier coeﬃcients ofrandom ﬁelds u as u = ( u , . . . , u J ). The identity operator is denoted with I .

2. Finite-dimensional approximations

Consider a Bayesian inverse problem on a Gaussian ﬁeld where the unknown is a real-valued random ﬁeld υ ( x ) : Ω → R on a bounded domain Ω ⊂ R d . We assume in theinverse problem that υ belongs to a Hilbert space H , speciﬁcally, υ ∈ H := L (Ω).To carry out the Bayesian inference, a set of measurements is taken (either direct orindirect in multiple locations). In this article, the measurement is assumed to be a linearoperation on υ corrupted with additive noises, that is, y k = (cid:104) υ, h k (cid:105) + e k , where h k is anelement in H represents a real linear functional on H , and e k is a zero-mean white noisewith a covariance matrix E .In the Bayesian framework, the estimation problem is equivalent to exploringthe posterior distribution of υ given the measurements { y k } . The Bayesian inversionapproach for this problem starts with assuming that υ is a Gaussian ﬁeld with a certainmean (assumed zero for simplicity) and covariance function C ( x , x (cid:48) ). In the case when C ( x , x (cid:48) ) is a Mat´ern covariance function, the Gaussian ﬁeld υ can be generated from astochastic partial diﬀerential equation of the form [8, 9] (cid:0) − (cid:96) ∆ (cid:1) α/ υ ( x ) = (cid:112) β(cid:96) d w ( x ) , (1)where α = ν + d/ d is the dimension of the space, ν is a smoothness parameter, w ( x )is a white noise on R d , (cid:96) is the length-scale constant of the Mat´ern covariance function C , and β = σ d π d/ Γ( α ) / Γ( ν ) with σ being a scale parameter.To obtain a non-stationary ﬁeld, we modify SPDE (1) so that the length-scale (cid:96) is modeled via another Gaussian ﬁeld u with Mat`ern covariance function. Namely, weselect (cid:96) ( x ) = g ( u ( x )), where g is a smooth positive function g : R → R + . As in [9], wealso require that (cid:96) should satisfy sup x ∈ Ω (cid:96) ( x ) < ∞ and inf x ∈ Ω (cid:96) ( x ) > α = 2. The results ofthis article could also be extended to other cases. Introducing the spatially varyinglength-scale (cid:96) ( x ) = g ( u ( x )) into (1), and since the length-scale is always greater thanzero, with probability one, using κ = 1 /(cid:96) we obtain the following SPDE (cid:0) κ ( u ( x )) − ∆ (cid:1) υ ( x ) = (cid:112) βκ ( u ( x )) ν w ( x ) . (2) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion v using the eigenfunctions of the Laplacian. With a suitable boundary condition, theLaplace operator can be expressed − ∆ υ = (cid:80) ∞ j = −∞ λ j (cid:104) υ, φ j (cid:105) φ j , where φ j is a completeset of orthonormal eigenfunctions of ∆ and λ j >

0, where lim j →∞ λ j = ∞ [36]. Observethat, in the sense of (2), κ ( u ( x )) := 1 /g ( u ( x )) ∈ L ∞ (Ω) is a multiplication operatoracting pointwise, that is, ( κ ( u ) υ )( x ) = κ ( u ( x )) υ ( x ) , ∀ x ∈ Ω [37].In what follows, we will ﬁrst describe the matrix representation of a chain ofGaussian ﬁelds generated from SPDEs in the form of (2) for d -dimensional domain.Then in Section 3, we will develop a convergence result of the Galerkin method developedhere to the weak solution of (2). Let us examine a periodic boundary condition on d -dimensional box Ω with side length1. Within this boundary condition, it is useful to consider H as a complex Hilbertspace, so that we can set φ l = exp( i c d x (cid:62) k ( l )) as Fourier complex basis functions for d dimensions, for some constant c d and a multi-index k ( l ) which is unique for every l . Letthe ﬁnite-dimensional Hilbert subspace H N of H be the span of φ − N , . . . , φ , . . . , φ N . Inwhat follows, we will explicitly construct the multi-index k ( · ).Without losing generality, let us assume that every entry in k ( l ) is between − n to n . For any − N ≤ l ≤ N , where N = (2 n +1) d − , we would like to construct k ( l ) = ( k ( l ) , . . . , k d ( l )) such that it is unique for each − N ≤ l ≤ N and k ( l + m ) = k ( l ) + k ( m ), − N ≤ l, m ≤ N and max( | k r ( l + m ) | ) < n, r ≤ d . The construction of k : [ − N, N ] → [ − n, n ] d is as follows. Let the matrix K ( n ) be given as K ( n ) =  z n ⊗ e ⊗ d − n e n ⊗ z n ⊗ e ⊗ d − n ... e ⊗ d − n ⊗ z n  , where z n = ( − n, − n + 1 , . . . , n − , n ), e n = (1 , , . . . , , e ⊗ dn is a Kroneckerproduct of e n repeated for d times. The multi index k ( l ) is given by selecting l + ( N + 1)-th column of K . The linear relation is deﬁned only if − N ≤ l + m ≤ N and every elementof the summation k ( l ) + k ( m ) has values in [ − n, n ]. As an example let d = 2 and n = 1.This gives us: K (1) = (cid:32) − − − − − − (cid:33) . It can be veriﬁed that k ( l ) selected this way is both unique and linear, given thatthe summation result is inside the range. For example, k (1) + k (2) = k (3). However, k (1) + k (1) is not deﬁned since the summation is outside the limit. In the following, ifthe context is clear, we will also use the multiple index k ( l ) for Fourier component of u ,that is (cid:98) u ( k ( l )) := (cid:98) u ( l ). on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion − N ≤ l, m ≤ N , (cid:104) φ m u , φ l (cid:105) = (cid:98) u ( l − m ), whenmax( | k ( l ) − k ( m ) | ) ≤ n , and zero elsewhere.Let us denote with u , v , and w ﬁnite-dimensional representations of u , υ , and w ,respectively. Let M N ( u ) be the matrix representation of the multiplication operator u ( x )on H N . For a random ﬁeld r with Fourier coeﬃcients r = ( (cid:98) r ( − m ) · · · (cid:98) r ( m )) (cid:62) ∈ C m +1 ,let us write a Toeplitz matrix T ∈ C ( m +1) × ( m +1) with elements from r as follows: T ( r ) =  (cid:98) r (0) · · · (cid:98) r ( − m )... . . . ... (cid:98) r ( m ) · · · (cid:98) r (0)  . The previous discussion allows us to write for d = 1, M N ( u ) := T (˜ u ) ∈ C (2 N +1) × (2 N +1) ,where ˜ u = ( × N , u (cid:62) , × N ) (cid:62) ∈ C (4 N +1) × . It is also possible to construct M N ( u ) for d >

1. However, instead of working directly on u , we need to work on the frequencyindices. Let J ∈ R (2 n +1) × (2 n +1) be a square matrix where all of its entry equal to one.Let also Z (1) = T ( z n ) ⊗ J ⊗ d − , Z (2) = J ⊗ T ( z n ) ⊗ J ⊗ d − , . . . , Z ( d ) = J ⊗ d − ⊗ T ( z n ),respectively. Using these matrices, the ( l, m )-th entry of M N ( u ) is given by M N ( u ) l,m = (cid:98) u (˜ k ( l, m )) , (3)where ˜ k ( l, m ) = (cid:16) Z (1) l,m , · · · , Z ( d ) l,m (cid:17) . In this equation, we assign (cid:98) u (˜ k ( l, m )) = 0 whenmax( | ˜ k ( l, m ) | ) > n .The sparsity of M N as n approaches inﬁnity is ( ) d . The weak solution to (2) inthe span of H N is equivalent to the following equation, L ( u ) v = w , (4)where L ( u ) := √ β ( M N ( κ ( u ) d/ ) − M N ( κ ( u ) − ν ) D ) is the square root of the precisionoperator corresponds to υ in matrix form, D is a diagonal matrix, and v , w are complexvectors with appropriate dimensions. The diagonal entries of D are given by D i,i = − λ i .Upon computing M N ( κ ( u ) γ ), we approximate u by its projection onto H N if u / ∈ H N .An important numerical issue to note is that we cannot use an approximation of M N ( κ ( u N ) γ ) obtained by spectral decomposition, that is, M N ( κ ( u N ) γ ) ≈ U (cid:62) κ ( D u ) γ U ,for a diagonal matrix D u with the diagonal entries are the eigenvalues of M N ( u N ), and U is the orthonormal matrix. The reason is that the resulting matrix will not be inthe form of (3). Instead, we could obtain M N ( κ ( u N ) γ ) matrix by applying Fouriertransform directly to κ ( u N ) γ and make use of (3). Writing (4) as v = L ( u ) − w , weobtain a composition of a Gaussian ﬁeld from a unit Gaussian ﬁeld given in [13].In what follows, for simplicity, with a slight abuse of notation for L ( u ), if r ∈ H N we also use L ( r ) := L ( (cid:80) Nl = − N (cid:98) r ( l ) φ l ) = L ( r ). Using this notation, the J Gaussian ﬁeldhyperpriors with zero mean assumption can be written in the following form: L ( u j − ) u j = w j . (5) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u , . . . , u J − , where the randomﬁelds u will be stationary. Within this multilayer hyperprior setting, the unknownﬁeld υ is equivalent to u J . With the assumption that each random ﬁeld involved isreal-valued, the number of element in u i is only N + 1 since the remaining element canbe obtained by complex conjugation.

3. Convergence analysis

We will show that the solution of the original SPDE system can be approximated withGalerkin methods on d -dimensional torus T d . In our convergence analysis, we willrestrict to d ≤

3, since we will rely on continuity of the sample paths (see Lemma3.3). We start by carefully presenting the notation of Galerkin method for analysispurposes. We denote the constant κ with exp( u − ) for notational ease. The function κ is taken to be a smooth function, which is bounded from below and above by exponentialfunctions. That is, c exp( − a | x | ) ≤ κ ( x ) , κ (cid:48) ( x ) ≤ c exp( a x ) for some c , c , a , a > x ∈ R .Almost surely bounded functions u , ..., u J are a weak solution of SPDE system − ∆ u i + κ ( u i − ) u i = β / i κ ν ( u i − ) w i , where i = 0 , . . . , J, (6)if they satisfy − (cid:104) u i , ∆ φ (cid:105) + (cid:104) κ ( u i − ) u i , φ (cid:105) = β / i ∞ (cid:88) p = −∞ (cid:98) w i ( p ) (cid:104) κ ν ( u i − ) φ p , φ (cid:105) , where i = 0 , . . . , J, for all φ ∈ C ( T d ). Assuming that the functions u i are almost surely bounded guaranteesthat the inner product of u i and ∆ φ is well-deﬁned as compared to the inner productof ∇ u i and ∇ φ , which may not be well-deﬁned. Here we expressed independent whitenoises w i on T d with the help of an orthonormal basis { φ p } in L ( T d ) as random series w i = (cid:80) ∞ p = −∞ (cid:98) w i ( p ) φ p , where the random coeﬃcients (cid:98) w i ( p ) ∼ N (0 ,

1) are independent.In the ﬁrst approximation step for the SPDE system (6), we will approximatewhite noise w i with its projections w Ni ( x ) onto the subspace H N . That is, w Ni ( x ) = (cid:80) Np = − N w i φ p ( x ) . Then the Galerkin approximations u Ni of u i satisfy the system (cid:104)∇ u Ni , ∇ φ (cid:105) + (cid:104) κ ( u Ni − )) u Ni , φ (cid:105) = β / i (cid:104) κ ν ( u Ni − ) w Ni , φ (cid:105) , i = 0 , . . . , J (7)for all φ ∈ H N , where u N − ( x ) := ln( κ ). Here we are allowed to use to use inner productsof ∇ u Ni and ∇ φ , since the approximated white noise w Ni belongs to H N .We will denote with L ( u i − ) − the solution operator, which maps f from thenegatively indexed Sobolev space H − ( T d ) to the weak solution of − ∆ u + κ ( u i − ) u = β / i κ ν ( u i − ) f . Similarly, we will denote with L N ( u Ni − ) − the solution operator,which maps f ∈ H − ( T d ) to the Galerkin approximation u N ∈ H N of the equation − ∆ u + κ ( u Ni − ) u = β / i κ ν ( u Ni − ) f . The matrix form of L N ( u Ni − ) − is given by L ( u i − ) − from Equation (5). The solution operators L ( u i − ) − and L N ( u Ni − ) − satisfy the on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion β i = 1 from nowon. Lemma 3.1.

Let u i − and u Ni − be bounded functions and let κ be a positive continuousfunction. The mappings L ( u i − ) : L ( T d ) → H ( T d ) and L N ( u Ni − ) : L ( T d ) → H ( T d ) satisfy norm estimates (cid:107) L ( u i − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u i − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) and (cid:107) L N ( u Ni − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u Ni − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u Ni − ) (cid:107) ∞ )min(1 , inf x κ ( u Ni − ( x ))) , respectively.Proof. By the Lax-Milgram theorem [38], (cid:107) ( − ∆ + κ ( u i − ) I ) − (cid:107) H − ,H ≤ C/ min(1 , inf κ ( u i − )) and the multiplication operator has norm (cid:107) κ ν ( u i − ) (cid:107) L ,L ≤ (cid:107) κ ( u i − (cid:107) ν ∞ .Hence, the solution operator has norm (cid:107) L ( u i − ) − (cid:107) L ,H ≤ C (cid:107) κ ( u i − ( x )) (cid:107) ν ∞ min(1 , inf x κ ( u i − ( x ))) . We rewrite the PDE in the form( − ∆ + a ) u fi = ( a − κ ( u i − ) u fi + κ ν ( u i − ) f, where the right-hand side belongs now to L ( T d ). By inverting the operator − ∆ + aI ,we obtain an equation for the solution u fi , which leads to the norm estimate (cid:107) u fi (cid:107) H ≤ , inf x κ ( u i − ( x ))) (cid:18) C (cid:107) κ ( u i − ) (cid:107) ∞ (cid:107) κ ( u i − ) (cid:107) ν ∞ min(1 , inf x κ ( u i − ( x ))) + (cid:107) κ ( u i − ) (cid:107) ν ∞ (cid:19) (cid:107) f (cid:107) L ≤ C (cid:107) κ ( u i − ) (cid:107) ν ∞ max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) (cid:107) f (cid:107) L after choosing a = inf x κ ( u i − ( x ) /

2. Similar procedure leads to the desired estimatefor L N ( u i − ).We will later need the following technical lemma to establish convergence. Lemma 3.2.

Let u i − and u Ni − be bounded functions and let κ be a continuouslydiﬀerentiable positive function. The mappings L N ( u Ni − ) − and L ( u i − ) − satisfy (cid:107) L N ( u Ni − ) − − L ( u i − ) − (cid:107) L ,L ≤ N G ( u i − ) + G ( u i − , u Ni − ) (cid:107) u Ni − − u i − (cid:107) / L , where G ( u i − ) = C max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ ) max( (cid:107) κ ν ( u i − ) (cid:107) ∞ , (cid:107) κ − ν ( u i − ) (cid:107) ∞ )min(1 , inf κ ( u i − ( x ))) G ( u i − , u Ni ) = C max(1 , (cid:107) κ ν ( u i − ) (cid:107) ∞ ) max (cid:0) (cid:107) κ ( u Ni − ) (cid:107) ∞ , (cid:107) κ ( u i − ) (cid:107) ∞ (cid:1) / ν/ min(1 , inf x κ ( u Ni − ( x ))) min(1 , inf x κ ( u i − ( x ))) × max t ∈ B ( κ (cid:48) ( t )) / and the set B is the interval [min(inf x u N ( x ) , inf x u ( x )) , max( (cid:107) u N (cid:107) ∞ , (cid:107) u (cid:107) ∞ )] .on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Proof.

We partition T := L N ( u Ni − ) − − L ( u i − ) − into two parts T = ( L N ( u Ni − ) − − L N ( u i − ) − ) + ( L N ( u i − ) − − L ( u i − ) − ) = T + T . By Cea’s lemma [38], the term T has an upper bound (cid:107) T (cid:107) L ,L ≤ C max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ )min(1 , inf x κ ( u i − ( x ))) (cid:107) ( I − (cid:101) P N ) L ( u i − ) − κ − ν ( u i − ) (cid:107) L ,H (cid:107) κ ν ( u i − ) (cid:107) ∞ , where (cid:101) P N is the orthogonal projection onto H N in H and I is the identity operator.Let f ∈ L and denote g := L ( u i − ) − κ − ν ( u i − ) (cid:101) P N f . Then (cid:107) ( I − (cid:101) P N ) g (cid:107) H = (cid:88) | k | , | k | >n (1 + k + k ) | (cid:98) g k ,k | = (cid:88) | k | , | k | >n (1 + k + k ) − | ( − ∆ + 1) g ) ∧ | k ,k ≤ n (cid:107) ( − ∆ + 1) g (cid:107) L ≤ CN (cid:107) L ( u i − ) − (cid:107) L ,H (cid:107) κ ( u i − ( x )) − ν (cid:107) ∞ (cid:107) f (cid:107) L , where − ∆ + 1 : H → L is continuous, g ∧ denotes the Fourier transform and N = (2 n + 1) d . Hence, (cid:107) T (cid:107) L ,L ≤ CN max(1 , (cid:107) κ ( u i − ) (cid:107) ∞ ) min(1 , inf κ ( u i − ( x ))) max( (cid:107) κ ν ( u i − ) (cid:107) ∞ , (cid:107) κ − ν ( u i − ) (cid:107) ∞ ) . In the term T , we need to tackle the diﬀerence of κ -terms. We aim to use L p -estimatesin order to later allow induction with respect to diﬀerent layers. We partition T intosimpler terms (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:0) κ ν ( u Ni − ) − κ ν ( u i − ) (cid:1) (cid:107) L ,L + (cid:107) (cid:0) L N ( u Ni − ) − κ − ν ( u Ni − ) − L N ( u i − ) − κ − ν ( u i − ) (cid:1) κ ν ( u i − ) (cid:107) L ,L = (cid:107) T (cid:107) + (cid:107) T (cid:107) . In the term T , we apply the resolvent identity L N ( u Ni − ) − κ − ν ( u Ni − ) = L N ( u i − ) − κ − ν ( u i − )+ L N ( u i − ) − κ − ν ( u i − )( κ ( u i − ) − κ ( u Ni − )) L N ( u Ni − ) − κ − ν ( u Ni − ) , which leads to (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u i − ) − κ − ν ( u i − ) (cid:107) L / ,L (cid:107) κ ( u i − ) − κ ( u Ni − ) (cid:107) L ,L / × (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:107) H − ,H (cid:107) κ ν ( u i − ) (cid:107) ∞ . The space L / embeds continuously into H − , and (cid:107) A (cid:107) L / ,L ≤ (cid:107) A (cid:107) H − ,H for anyoperator A . Hence by the Lax-Milgram theorem, (cid:107) L N ( u i − ) − κ − ν ( u i − ) (cid:107) L / ,L ≤ , inf x κ ( u i − ( x ))) . on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion κ ( u i − ) − κ ( u Ni − ) as a multiplication operator from L to L / .That is, a function g deﬁnes a multiplication operator, which takes a function f tothe product of functions g and f . By H¨older’s inequality, any multiplication operator g : L → L / has norm (cid:107) g (cid:107) L . Similar treatment for the term T gives (cid:107) T (cid:107) L ,L ≤ (cid:107) L N ( u Ni − ) − κ − ν ( u Ni − ) (cid:107) L / ,L (cid:107) κ ν ( u Ni − ) − κ ν ( u i − ) (cid:107) L ,L / . The diﬀerence of κ terms in the estimates for T and T reduces to the diﬀerence ofthe functions u i − through series of elementary estimates (cid:107) κ a ( u Ni − ) − κ a ( u i − ) (cid:107) L = (cid:90) T d | κ a ( u Ni − ( x )) − κ a ( u i − ( x )) | d x ≤ (2 max (cid:0) (cid:107) κ a ( u Ni − ) (cid:107) ∞ , (cid:107) κ a ( u i − ) (cid:107) ∞ (cid:1) ) (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) u Ni − ( x ) u i − ( x ) κ (cid:48) ( t ) dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d x ≤ C max (cid:0) (cid:107) κ a ( u Ni − ) (cid:107) ∞ , (cid:107) κ a ( u i − ) (cid:107) ∞ (cid:1) max t ∈ B ( κ (cid:48) ( t )) (cid:107) u Ni − − u i − (cid:107) L , where a = ν, B is the interval [min(inf x u N ( x ) , inf x u ( x )) , max( (cid:107) u N (cid:107) ∞ , (cid:107) u (cid:107) ∞ )]. Remark 1.

When κ is a continuously diﬀerentiable function with bounds c exp( − a | t | ) ≤ κ ( t ) , | κ (cid:48) ( t ) | ≤ c exp( a | t | ) , where c , c , a , a > , the functions G and G can be taken to be G ( u i − ) = C exp( C ( (cid:107) u i − (cid:107) ∞ )) G ( u i − , u Ni − ) = C exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ ) . The next lemma shows that the SPDE system (6) forces the layers to be H¨older-continuous.

Lemma 3.3.

Let κ be a positive continuous function. Let < α < for d = 1 , and < α < / for d = 3 . The bounded weak solution u i , i = 0 , . . . , J , of the SPDE system(6) is α -H¨older continuous with probability 1, and has the form u i ( x ) = ∞ (cid:88) p = −∞ (cid:98) w i ( p ) L ( u i − ) − φ p ( x ) =: L ( u i − ) − w i ( x ) , i = 0 , . . . , J. (8) Proof.

We will show that if u i − is continuous, then u i is H¨older-continuous. This willprove continuity inductively, since the zeroth layer u has constant u − . It is enoughto verify continuity after conditioning with u i − , since P ( u i ∈ C ,α ( T d )) = E [ P ( u i ∈ C ,α ( T d ) | u i )] = 1 with probability 1 if and only if P ( u i ∈ C ,α ( T d ) | u i − ) = 1 . Afterconditioning, u i will be a Gaussian ﬁeld for which we will apply Kolmogorov continuitycriterium. To this end, we calculate E [ | u i ( x ) − u i ( x (cid:48) ) | b | u i − ] = C b sup (cid:107) f (cid:107) L ≤ | L ( u i − ) − f ( x ) − L ( u i − ) − f ( x (cid:48) ) | b , on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion (cid:107) g (cid:107) L assup (cid:107) f (cid:107) L ≤ (cid:104) f, g (cid:105) . When u i − is continuous, the function L ( u i − ) − f ( x ) belongs to H by Lemma 3.1. By Sobolev’s embedding theorem [36], H embeds into C ,α for d ≤ E [ | u i ( x ) − u i ( x (cid:48) ) | b | u i − ] ≤ C | x − x (cid:48) | bα , which implies H¨older continuity with index smaller than (2 bα − d ) / b = α − d/ b , where b can be arbitrarily large.To complete the proof for existence of the solution, we insert the solution candidate(8) into the SPDE system (6) and do direct calculations. Uniqueness of the solutionfollows from Lax-Milgram theorem.We will show that u Ni converges in probability to u i . The proof uses uniformtightness of the distributions of u Ni − u i , which is a necessary condition for theconvergence. We recall suﬃcient conditions for the uniform tightness (see p. 61 in[39]). Lemma 3.4.

The random ﬁelds U N are uniformly tight on C ( T d ) if and only if thereexists a function K : C ( T d ) → [0 , ∞ ) with the following properties.(1) The set { g ∈ C ( T d ) : K ( g ) ≤ C } is compact for any C > ,(2) K ( U N ) < ∞ almost surely for every N , and(3) sup N E [ K ( U N )] < ∞ . We will need several iterations of the logarithm so we deﬁne the iterated compositionby setting F ( x ) = ln(1 + x ), F ( x ) = x and F n +1 ( x ) = F ◦ F n ( x ). Remark 2.

The function F i is increasing and, moreover, subadditive on non-negativenumbers. That is, F n ( x + y ) ≤ F n ( x ) + F n ( y ) for all x, y ≥ , which follows by inductionfrom subadditivity ln(1 + x + y ) ≤ ln((1 + x )(1 + y )) = ln(1 + x ) + ln(1 + y ) of each F .Similar procedure shows that F n ( xy ) ≤ F n ( x ) + F n ( y ) . Lemma 3.5.

Let d = 1 , or . Let κ be a continuous function with bounds c exp( − a | t | ) ≤ κ ( t ) ≤ c exp( a | t | ) , where c , c , a , a > . Let u Ni , i = 0 , . . . , J ,solve the system (7) and let u i , i = 0 , . . . , J solve the system (6) . Then the randomﬁelds u Ni are uniformly tight on C ( T d ) , the random ﬁelds u Ni − u i are uniformly tight on C ( T d ) , and the vector-valued random ﬁelds ( u Ni , u i ) are uniformly tight on C ( T d ; R ) .Proof. We equip the H¨older space C ,α ( T d ; R p ) with its usual norm (cid:107) g (cid:107) α = sup x (cid:54) = y | g ( x ) − g ( y ) || x − y | α + sup x | g ( x ) | . Here α is chosen as in Lemma 3.3. We will use Kolmogorov-Chentsov tightness criterium(see [40]) to show the uniform tightness of the zeroth order layers, which are Gaussian. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion E [ | u N ( x ) − u N ( x (cid:48) ) | a ] ≤ C | x − x (cid:48) | d + b follows from choosing large enough a in E [ | u N ( x ) − u N ( x (cid:48) ) | a ] ≤ C sup (cid:107) f (cid:107) L ≤ | L N (ln( κ )) − f ( x ) − L N (ln( κ )) − f ( x (cid:48) ) | a | x − x (cid:48) | aα | x − x (cid:48) | aα ≤ C (cid:107) L N (ln( κ )) − (cid:107) aL ,C ,α | x − x (cid:48) | aα , where we applied It¯o isometry and the deﬁnition of L -norm as a supremum. By Lemma3.1 and Sobolev’s embedding theorem, the operator norm of L N (ln( κ )) − is boundedfor any 0 < α < /

2. Since κ is a constant, the bound is uniform. The case for u N − u and ( u N , u ) follow similarly with the help of the triangle inequality.For other layers, we use Lemma 3.4, where we choose K ( g ) = F i ( g ). For simplicity,we demonstrate Condition 1 only for i = 3, since the generalization is clear. The set { g : K ( g ) ≤ C } = { g : (cid:107) g (cid:107) α ≤ exp(exp(exp( C ) − − − } is clearly a closed set, which contains bounded equicontinuous functions. The set A ⊂ C ( T d ) is then compact by the Arzel´a-Ascoli theorem (see [41]). Moreover, theﬁelds u Ni − u i are almost surely α -H¨older continuous, by Lemma 3.3 and since theapproximations belong to H N . Hence, Condition 2 holds. To show Condition 3, wewrite u Ni as L N ( u Ni − ) − w Ni and u i as L ( u i − ) − w i . By the triangle inequality andthe subadditivity of F i (see Remark 2), we can check the boundedness for u Ni and u i separately. Since the procedure is the same for both of the terms, we only show herethe case for u Ni . By Jensen’s inequality E [ln(1 + (cid:107) u Ni (cid:107) α ) | u Ni − ] ≤ ln (cid:0) E [1 + (cid:107) L N ( u Ni − ) − w Ni (cid:107) α | u Ni − ] (cid:1) . (9)The conditioning with u Ni − lets us compute expectations of Gaussian variables. Insteadof attacking directly the expectation in Equation (9), we will seek a Gaussian zero meanrandom ﬁeld U with larger variance than the conditioned u Ni . Then also certain otherexpectations of u Ni will be bounded by expectations of U . Under conditioning, thevariances of the random variables (cid:82) φ ( x ) L N ( u Ni − ) − w Ni ( x ) d x are E [ (cid:104) w Ni , ( L N ( u Ni − ) − ) ∗ φ (cid:105) | u Ni − ] = (cid:107) P N ( L N ( u Ni − ) − ) ∗ φ (cid:107) L ≤ (cid:107) L N ( u Ni − ) − (cid:107) L ,H (cid:107) φ (cid:107) H − , where the last inequality follows from properties of the adjoint operator and Lemma 3.1.Set U to be a zero mean Gaussian random ﬁeld U on T d whose covariance is deﬁnedby equations E [ (cid:104) U, φ (cid:105) ] = (cid:107) φ (cid:107) H − for all smooth φ . Then U has sample paths in H¨olderspace C ,α ( T d ) by Sobolev’s embedding theorem and Kolmogorov’s continuity theorem(see [42]). By Fernique’s theorem (e.g. [43]), the expectation E [ (cid:107) U (cid:107) α ] is ﬁnite. Sincenorms are absolutely continuous functions, also the conditional expectation of (cid:107) u Ni (cid:107) α isbounded by (cid:107) L ( u Ni − ) − (cid:107) L ,H E [ (cid:107) U (cid:107) α ] (see Corollary 3.3.7 in [43]). on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion E [ln(1 + (cid:107) u Ni (cid:107) α ) | u Ni − ] ≤ ln (cid:20) c (cid:107) κ ( u Ni − ( x )) (cid:107) ν ∞ max(1 , (cid:107) κ ( u Ni − ( x )) (cid:107) ∞ )min(1 , inf κ ( u Ni − ( x ))) E [ (cid:107) U (cid:107) α ] (cid:21) ≤ c α,ν (1 + (cid:107) u Ni − (cid:107) ∞ ) , (10)where we applied the bounds of κ and the elementary inequalities max( a, b exp( c )) ≤ max( a, b ) exp( | c | ), min(1 , inf exp( g ( x ))) ≥ exp( −(cid:107) g (cid:107) ∞ ), (1 + ab ) ≤ (1 + a )(1 + b ) and1 + a ≤ , a ). Here we can choose constants larger than 1.When i = 1, the expectations (10) are uniformly bounded, since expectations of (cid:107) u N (cid:107) ∞ are bounded. Then Lemma 3.4 shows that u N are uniformly tight. For thesubsequent layers we need to operate multiple times with the logarithm and Jensen’sinequality through inductive steps E [ F i ( (cid:107) u Ni (cid:107) α ))] ≤ E [ F i ( E [ (cid:107) u Ni (cid:107) α | u Ni − ])] ≤ E [ F i ( c (1 + (cid:107) u Ni − (cid:107) α ))] ≤ C + E [ F i − ( (cid:107) u Ni − (cid:107) α ))]with the help of the additivity properties of F i from Remark 2. Theorem 3.6.

Let d = 1 , , or . Let κ be a continuously diﬀerentiable functionwith bounds c exp( − a | t | ) ≤ κ ( t ) , | κ (cid:48) ( t ) | ≤ c exp( a | t | ) , where c , c , a , a > . Let u Ni , i = 0 , . . . , J . The solution ( u N , . . . , u NJ ) of the Galerkin system (7) converge inprobability to the weak solution ( u , . . . , u J ) of the SPDE system (6) on L ( T d ; R J +1 ) as N → ∞ .Proof. It is enough to show componentwise convergence. Uniform tightness on C ( T d )implies uniform tightness on L ( T d ), which in turn implies relative compactness in weaktopology of distributions. Hence, by Lemma 3.5 each subsequence of the distributions of u Ni − u i has a weakly convergent subsequence, say u N k i − u i . Recall, that the convergencein probability is equivalent to the convergence of τ N := E [min(1 , (cid:107) u Ni − u i (cid:107) L )] , where min(1 , (cid:107) · (cid:107) L ) is now a bounded continuous function. By weak convergenceof distributions, the subsequence τ N k has some limit. It remains to verify that limitsof u N k i − u i in distribution are zero. The characteristic functions of u Ni − u i convergeto 1, if τ N ( φ ) := E [min(1 , |(cid:104) u Ni − u i , φ (cid:105)| ) , converge to zero for every φ ∈ L . Thisformulation makes calculation of expectations manageable. The essential diﬀerence toother approaches arises from the monotonicity of conditional expectations. Namely, theproperty E [1 − min(1 , G ) | Σ ] ≥ G ≥ E [min(1 , G ) | Σ ] = min(1 , E [min(1 , G ) | Σ ]) ≤ min(1 , E [ G | Σ ]) . (11)Especially, τ N ( φ ) ≤ E [min(1 , E [ |(cid:104) u Ni − u i , φ (cid:105)| | w , . . . , w i − ])] , (12) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u i and its approximation u Ni easily.Moreover, the cutoﬀ function min(1 , · ) is nondecreasing and subadditive.Under conditioning with w , . . . , w i − , the random ﬁelds u Ni and u i in Equation(12) become Gaussian, which enables us to compute τ N ( φ ) = E [min(1 , c (cid:107) ( P N ( L N ( u Ni − ) − ) ∗ − ( L ( u i − ) − ) ∗ ) φ (cid:107) L )] ≤ E (cid:2) min (cid:0) , c φ (cid:107) L N ( u Ni − ) − − L ( u i − ) − (cid:107) L ,L (cid:1) + min (cid:0) , c (cid:107) ( P N − I )( L ( u i − ) − ) ∗ φ (cid:107) L (cid:1)(cid:3) = : τ N ( φ ) + τ N ( φ )via the It¯o isometry and the properties of adjoints. Since L ( u i − ) ∗ φ ∈ L , the term τ N ( φ )converges to zero. We will apply Lemma 3.2 for the diﬀerence of operators L N ( u Ni − ) − and L − ( u i − ) in τ N ( φ ), which leads to a well-behaving estimate τ N ( φ ) ≤ E (cid:20) min (cid:18) , C φ N exp( C (cid:107) u i − (cid:107) ∞ ) (cid:19) (cid:21) + E [min(1 , C φ exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ )) ×(cid:107) u Ni − − u i − (cid:107) / L )] =: τ N ( φ ) + τ N ( φ ) . The ﬁrst term τ N ( φ ) converges to zero by Lebesgue’s dominated convergence theorem.For the term τ N ( φ ), we will use the uniform tightness of the the vector-valued randomﬁelds ( u Ni − , u i − ) shown in Lemma 3.5. Let K i − = K i − ( (cid:15) ) ⊂ C ( T d ; R ) be a compactset for which P (( u Ni − , u i − ) ∈ K C ) < (cid:15) . Then τ N ( φ ) = E [min(1 , C φ exp( C ( (cid:107) u i − (cid:107) ∞ + (cid:107) u Ni − (cid:107) ∞ )) (cid:107) u Ni − − u i − (cid:107) / L )(1 K + 1 K C )] ≤ E [min(1 , C (cid:15) (cid:107) u Ni − − u i − (cid:107) L )] / + E [1 K C ]by Lemma 3.2, Remark 1 and Jensen’s inequality. Thus τ N ( φ ) converges to zero if u Ni − − u i − converge to zero in probability on L .When i = 0, the above procedure shows then that u N converges to u in probability,because the sublayers are then constants. By induction, u Ni converges in probability to u i . Prohorov’s theorem [40] hands us weak convergence on the space of continuousfunctions. Corollary.

The distributions of u Ni converge weakly to the distribution of u i on C ( T d ) and the joint distribution of ( u N , . . . , u NJ ) converges weakly to the joint distribution of ( u , . . . , u J ) on C ( T d , R J +1 ) .Proof. The random ﬁelds u Ni are tight on C ( T d ) by Lemma 3.5. By Prohorov’s theorem,the closure of their distributions forms a sequentially compact set, implying the existenceof weak limit. By Theorem 3.6, each weakly converging subsequence of the distributionshas the same limit, that is, the distribution of u i . Indeed, since convergence inprobability on L ( T d ) implies weak convergence on L ( T d ), the characteristic functions E [exp( i (cid:104) u Ni , φ (cid:105) )] converge to E [exp( i (cid:104) u i , φ (cid:105) )] for all φ ∈ L ( T d ). The set of bounded on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u (cid:55)→ exp( i (cid:104) u, φ (cid:105) ), where φ are smooth on T d , separate the functionsin C ( T d ). Hence, the limits of these characteristic functions are enough to identify theweak limit on C ( T d ). The joint distribution is handled similarly.We recall a posterior convergence result from [44] with notation used in [14]. Theorem 3.7.

Let the posterior distribution µ y of an unknown u given an observation y have the Radon-Nikodym density dµ y dµ ( u ) ∝ exp( − Φ( u, y )) with respect to the prior distribution µ of u , where Φ( · , y ) ≥ − C y . Let u N be anapproximation of u and let the posterior distribution of u N given an observation y N have also the Radon-Nikodym density dµ y N N dµ N ( u ) ∝ exp( − Φ( u, y N )) with respect to the prior distribution µ N of u N .If the prior distributions µ N converge weakly to µ , then the posterior distributions µ y N converge weakly to µ y . Especially, the joint prior distribution of ( u N , . . . , u NJ ) converge weakly on C ( T d , R J +1 ) to the joint distribution of ( u , . . . , u J ). Hence, the corresponding posteriorsconverge also weakly.

4. Bayesian inference algorithm

In this section, we will develop a Bayesian inference procedure to sample the Fouriercoeﬃcients from a posterior distribution where the prior is given by a multi-layeredGaussian ﬁelds. Let us work directly in the Fourier coeﬃcient u J of u J , and denote by µ ( d u J ) = P ( d u J ) and µ y ( d u J ) = P ( d u J | y ) the prior and the posterior distribution of u J , the unknown target ﬁeld, respectively, when the measurement is given by y = Hu J + e . (13)In this equation, y is a vector which contains all of the measurements. The elementsof the matrix H correspond to the linear mappings { h k } for the respective Fouriercomponents. With the number of measurements taken is given by m , the measurementnoise e is an m -dimensional Gaussian random vector with zero mean and covariance E .The posterior distribution µ y will be absolutely continuous with respect to the prior µ ,and the density of the posterior with respect to the prior is given by dµ y dµ ( u J ) = 1 Z exp ( − Φ( u J , y )) , (14) on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion u J , y ) := (cid:13)(cid:13) E − / ( y − Hu J ) (cid:13)(cid:13) is the potential function and the normalizationconstant Z := (cid:82) exp( − Φ( u J , y )) µ ( d u J ). The posterior probability µ y ( d u J ) can bewritten in the following form µ y ( d u J ) ∝ exp ( − Φ( u J , y )) µ ( d u J ) . (15)In the next section, we describe the MCMC algorithm to sample from the posteriordistribution. Consider the case of hierarchical prior distribution µ ( d u J ) with J hyperprior layerswhich we construct as L u = w , (16a) L ( u j − ) u j = w j , j = 1 , · · · , J, (16b)where u j contains the Fourier coeﬃcients of the ﬁeld u j , for j = 0 , . . . , J . We ﬁx g ( x ) = exp( − x ). Since u − = ln( κ ) is a constant, we can write, L = L ( u − ), where k -th element of u − is equal to ln( κ ) δ k, , where δ k, is the Kronecker delta. By (16),we can deﬁne a linear transformation from w j to u j for j > u j = ˜ U ( w j , u j − ) := L ( u j − ) − w j . (17)Using (17) we can deﬁne a transformation from w to u as follows u = U ( w ) = ( ˜ U ( w , u − ) , ˜ U ( w , · ) ◦ ˜ U ( w , u − ) , . . . , ˜ U ( w J , · ) ◦ . . . ◦ ˜ U ( w , u − )) . (18)The dependence of u j on u j − , j = 1 , . . . , J , leads us to µ ( d u J ) = P ( d u ) J (cid:89) j =1 P ( d u j | u j − ) . (19)When evaluating the posterior function (15), it is necessary to compute the logdeterminant of L ( u j − ) in P ( d u j | u j − ) for each 1 < j ≤ J , respectively. Thesecalculations are expensive in general [45]. There is also a singularity issue if we sampledirectly from P ( d u | y ) if N approaches inﬁnity [13]. We can avoid these issues byusing the reparametrization (17), where instead of sampling the Fourier coeﬃcients u , we sample the Fourier coeﬃcient of the noises w , which is then called non-centered algorithm [22, 46]. That is, we can write: d ˜ µ y d ˜ µ ( w ) = 1˜ Z exp (cid:16) − ˜Φ( w , y ) (cid:17) := 1˜ Z exp ( − Φ( U ( w ) , y )) . (20)In this equation, ˜ µ := P ( d w ) and ˜ µ y := P ( d w | y ) are the prior and posterior ofthe w , respectively. The preconditioned Crank-Nicolson (pCN) algorithm [16] can on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion P ( d w | y ), and it is well deﬁned even for the case of N goesto inﬁnity, using the fact that the prior for w is a standard Gaussian distribution.One implementation of the pCN algorithm is given by Algorithm 1. Due to linearityassumption of the forward model (13), we can leverage the standard Gaussian regressionin addition to the non-centered reparametrization. This procedure has been proposedin [13] for general deep Gaussian ﬁelds. Here, we adapt this algorithm for our Galerkinmethod. The resulting algorithm is a Metropolis within Gibbs type [47] where theFourier coeﬃcients { u j } , j = 0 , . . . , J − u J are sampleddirectly. The detail is given as follows. By marginalization of u J − , we can write P ( d u J | y ) = (cid:82) P ( d u J | u J − , y ) P ( d u J − | y ) . Furthermore, from the standard Gaussianregression, we can sample directly from P ( d u J | u J − , y ) by using: V ( u J − , y ) := (cid:32) E − / HL ( u J − ) (cid:33) † (cid:32)(cid:32) E − / y (cid:33) + ˜ v (cid:33) , (21) u J − = ˜ U ( w J − , · ) ◦ · · · ◦ ˜ U ( w , u − ) , ˜ v ∼ N (0 , I ) . Writing y = HL ( u J − ) − w J + e , the conditional probability density of y given u J − is given by p ( y | u J − ) = N ( y | , HL ( u J − ) − L ( u J − ) −(cid:62) H (cid:62) + E ). Let ¯ w =( w , . . . , w J − ). To obtain samples from P ( d u J − | y ) we can use reparametrization (17)and Algorithm 1 to sample from P ( d ¯ w | y ). The probability distribution P ( d ¯ w | y ) is givenas follows: P ( d ¯ w | y ) ∝ exp (cid:16) − ˜Ψ( ¯ w , y ) (cid:17) P ( d ¯ w ) , (22a)˜Ψ( ¯ w , y ) := Ψ ( U ( ¯ w ) , y ) = 12 (cid:107) y (cid:107) Q + 12 log det( Q ) , (22b) Q = HL ( u J − ) − L ( u J − ) −(cid:62) H (cid:62) + E . (22c)To sample from P ( d ¯ w | y ) using Algorithm 1, we use J −

1, and ˜Ψ for J and ˜Φ, respectively.

5. Numerical results

In this section, we present examples of Bayesian inversion using a multi-layer Gaussianprior presented in the previous section. Our main focus in this section is to showthe eﬀectiveness of the proposed ﬁnite-dimensional approximation method for selectedexamples. Therefore, we will not discuss the properties of the MCMC algorithm used togenerate the samples as they are based on the MCMC algorithms described in [48, 13].We aim at acceptance ratio between 25-50 %, which is obtained by tuning the pCNstep size s in Algorithm 1. Our experience in the numerical implementations belowindicates that the MCMC algorithm based on the pCN and non-centered algorithm isquite robust. For one and two hyperprior layers implementation, the step size s doesnot need to be extremely small. The step sizes s in the ﬁrst and second examples arevarying around 10 − to 10 − . on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Algorithm 1: preconditioned Crank-Nicolson algorithm.

Input : J, w , y , ˜Φ( w , y ) Output: accepted , w accepted = 0 draw w (cid:48) ∼ N (0 , I )˜ w = √ − s w + s w (cid:48) logRatio = ˜Φ( w , y ) − ˜Φ( ˜ w , y ) draw ω ∼ Uniform[0 , if logRatio > ln( ω ) thenw = ˜ w accepted = 1 In this section, we consider the application of the proposed technique to address thenon-parametric denoising of two piecewise smooth signals. The ﬁrst test signal is arectangular shape signal where the value is zero except on interval [0 . , . υ rect ( t ) = (cid:40) , t ∈ [0 . , . , , otherwise . , υ bell,rect ( t ) =  exp (cid:0) − t − t (cid:1) , t ∈ (0 , . , , t ∈ [0 . , . , − , t ∈ (0 . , . , , otherwise . (23)Previously, in [24], for J = 1 and with the unknown signal υ bell,rect ( t ), we havedemonstrated that upon increasing the number of the Fourier basis functions, the L error between the ground truth and the posterior sample mean decreased signiﬁcantly.In this section, we will compare the estimation results using one hyperprior and twohyperprior layers respectively. To allow high variation near points of discontinuities, thelength-scale of υ is expected to be smaller around the discontinuities than the rest ofthe domain. We take measurement of on one dimensional grid of 2 points and setthe standard deviation of the measurement noise to be 0 .

1. The proposed algorithmis tested with N = 2 −

1. After estimation, we reconstruct the signal using inverseFourier transform with a ﬁner grid with 2 points equally spaced between zero and one.We take ten million samples for each MCMC run.To have a fair comparison, we use the same measurement record for each runwith diﬀerent J . Figures 2 and 3 show the reconstructed signals and their respectivelength-scale estimations. It can be clearly seen that the addition of another hyperprior layer improves the reconstruction result for the unknown signals. For J = 2,although there are no sudden drops near the points of discontinuities, the length-scale on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion J L error PSNR L error PSNR1 1.044 24.325 1.527 21.6032 0.922 25.605 1.475 21.801 Table 1.

Quantitative performance comparison of a shallow Gaussian ﬁelds priorinversion for one dimensional signal in Section 5.1. value is suﬃciently high in the a smooth part of υ signal, and substantially low nearthe points of discontinuities. This variation translates to a better smoothness detection,as can be seen in Figures 2a, 3a, 2c, and 3c. In contrast, Figures 2b and 3b show thatwhen J = 1, the posterior sample means are considerably overﬁtting the data on thesmooth part of the signals. Although the length-scale drops suddenly near the pointsof discontinuities of the ground truth υ , the variation of (cid:96) is limited (see Figure 2d and3d). This contributes to a decreased smoothness in the smooth region of the samplemean. The quantitative performance is given in Table 1. In this section, we apply the methods that we have developed to the X-ray tomographyreconstruction problem. For the tomography problem, the linear functional is given bya line integration known as the Radon transform [49, 28].Assume that the ﬁeld of interest has support in the a circle with center at (1 / , / r with detection angle θ , we can write the Radon transform of the Fourierbasis φ k = exp (cid:0) i π k (cid:62) x (cid:1) as below (cid:104) φ k , H r,θ (cid:105) = (cid:90) Ω χ ( x − , y −

12 ) exp (cid:0) i π k (cid:62) x (cid:1) δ ( r − (( x −

12 ) cos θ + ( y −

12 ) sin θ )) d x , where χ ( x, y ) is an indicator function with support in ( x + y < ). Introducing arotation matrix R θ , and p = [ p q ] (cid:62) , and x (cid:48) = x − , y (cid:48) = y − , p = R θ x we can write (cid:104) φ k , H r,θ (cid:105) = exp( iπ ( k x + k y )) (cid:90) Ω χ ( x (cid:48) , y (cid:48) ) exp (cid:0) i π ( R θ k ) (cid:62) p (cid:1) δ ( r − p ) d p . Using the assumption we have mentioned and ˜ k = [˜ k x ˜ k y ] (cid:62) = R θ k , we end upwith (cid:104) φ k , H r,θ (cid:105) = exp( iπ ( k x + k y )) exp( i π ˜ k x r ) π ˜ k y [sin(2 π ˜ k y (cid:113) − r )] . Notice when˜ k y = 0 we replace the above equation with its limit, that is, lim ˜ k y → (cid:104) φ k , H r,θ (cid:105) =2 (cid:113) − r exp( iπ ( k x + k y )) exp( i π ˜ k x r ) . We modify the MCMC implementation used for the previous one-dimensionalexample to suit for a GPU architecture. For our test comparison in X-ray tomography,the Shepp-Logan phantom with 511 ×

511 resolution is used (see Figure 4a). We willuse this phantom to evaluate the proposed method. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion . . . . . . t − . . . . . . . u ( t ) (a) . . . . . . t − . . . . . . . u ( t ) (b) . . . . . . t ‘ ( t ) (c) . . . . . . t × − ‘ ( t ) (d) Figure 2.

Simulation results of example in Section 5.1. Figures 2a, 2c, 2b, and 2ddescribe the lowest random ﬁelds from u for J = 2 and J = 1, and their length-scales.In Figure 2a and 2b, the blue and the black lines are the mean Fourier inverse of thesamples in the 95 % conﬁdence shades of the lowest layer and the original unknownsignals respectively. The blue line and the shades in the remaining ﬁgures are samplemeans and 95 % conﬁdence interval of the estimated length-scales. Shallow-GP FBP TikhonovPSNR 22.246 23.399 21.673 L Error 44.795 47.156 42.145

Table 2.

Quantitative performance comparison of a shallow Gaussian ﬁeld Bayesianinversion for tomography application in Section 5.2.

We take 45 sparsely full projections out of 180. The measurement is corruptedby a white Gaussian noise with standard deviation 0 .

2. Due to the restriction inGPU memory, J is set to one. To excel the speed Fourier transform and inverseFourier transform computations, we use FFT and IFFT routine from CUPY [50].For a performance comparison, we perform the ﬁltered back projection (FBP) on on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion . . . . . . t − . − . . . . u ( t ) (a) . . . . . . t − . − . . . . u ( t ) (b) . . . . . . t ‘ ( t ) (c) . . . . . . t × − × − × ‘ ( t ) (d) Figure 3.

Similar to Figure 2, with the unknown is v bell,rect . sinogram using iradon routine from Skimage [51]. A Tikhonov regularization is alsoused to reconstruct the Fourier coeﬃcients of the unknown ﬁeld u J [52]. The Tikhonovregularization parameter λ is selected to be 5 × − based on the best L error and PSNRperformance. We set the Fourier basis number n to 31. The total number of parameterfor each layer is 1985, which makes the total number of parameters for all layers 3970.The total number of parameters in this example is greatly reduced compared to [53]where each pixel in the target image count as a parameter, that is, for our example ittranslates to 261121 parameters.Figure 4 shows that compared to the FBP and Tikhonov regularizationreconstructions, the posterior sample mean of our MCMC method resulted in an imagewith less streak artifact and noise. The features of the phantom appear much more clearcompared to those on the FBP and Tikhonov reconstructions. As examples, examinethe mouth parts, dark circles between eyes and at the forehead, and the two eyes areboth relatively much more clear than the other two. Nonetheless, since we set n only31, the edge of phantom face which has very high values is not fully recovered as much on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion L error (44 . . . .

246 compared to Tikhonovregularization method, 21 . n in our proposed Bayesianmethod to get a much better reconstruction. However, it is not possible to accomplishthis within our current setup due to a restriction on the GPU memory. We can fairlyconclude that the use of a shallow Gaussian ﬁeld prior with n = 31 resulted in a highlyreduced amount of artifact at the expense of light blur at the edge. As in the onedimensional example, adding another Gaussian ﬁeld layer might help to increase thesharpness of the edge in the X-ray tomography application.

6. Conclusion

We have presented a multi-layered Gaussian-ﬁeld Bayesian inversion using a Galerkinmethod. We have also shown that our approach enjoys a nice convergence-in-probabilityproperty to the weak solution of the forward model. We also showed that it impliesweak convergence of the joint posterior distribution of the Gaussian ﬁeld. This givesan assurance that our proposed method is well deﬁned and robust upon increasing thenumber of Fourier basis functions. Using the non-centred version of the preconditionedCrank-Nicolson algorithm, we have shown that for a one-dimensional denoising problem,by using two hyperprior layers we could achieve a smoothing preserving and edgedetection of the unknown at the same time. For the X-ray tomography problem, with asingle hyper-prior layer and a very small number of Fourier basis ( n = 31), the posteriorsample mean of our proposed approach gives an image with less streak artifact andnoise compared to the FBP and Tikhonov regularization reconstructions. Althoughtraces of streak artefact and edge blurring still present, the L error and PSNR of ourproposed method sit in the middle of those from the FBP and Tikhonov regularization.Furthermore, adding another Gaussian ﬁeld layer might help to increase the sharpnessof the edge in the X-ray tomography application. One of future outlook is to apply themethod in real data. Acknowledgments

The authors would like to thank Academy of Finland for ﬁnancial support.

References [1] J. Kaipio and E. Somersalo,

Statistical and Computational Inverse Problems . Springer, 2004.[2] C. E. Rasmussen and C. K. I. Williams,

Gaussian Processes for Machine Learning . MIT PressLtd, 2005.[3] M. J. Heaton, A. Datta, A. O. Finley, R. Furrer, J. Guinness, R. Guhaniyogi, F. Gerber, R. B.Gramacy, D. Hammerling, M. Katzfuss, F. Lindgren, D. W. Nychka, F. Sun, and A. Zammit-Mangion, “A case study competition among methods for analyzing large spatial data,”

J. Agr.Biol. Envir. St. , vol. 24, pp. 398–425, dec 2018. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion Target Image − . − . − . − . . . . . . (a) Posterior mean − . − . − . − . . . . . . (b) Filtered Back Projection − . − . − . − . . . . . . (c) Tikhonov Regularization − . − . − . − . . . . . . (d) Figure 4.

Simulation result of a shallow SPDE with J = 1 and n = 31, where thenumber of projections is 45. Figures 4b and 4d show posterior means of ﬁeld υ andthe Tikhonov regularization result using λ = 5 × − , while Figure 4c shows the FBPreconstruction.[4] C. J. Paciorek and M. J. Schervish, “Nonstationary covariance functions for Gaussian processregression,” in Adv. Neur. Inf. Proc. Sys. , pp. 273–280, 2004.[5] G.-A. Fuglstad, D. Simpson, F. Lindgren, and H. Rue, “Does non-stationary spatial data alwaysrequire non-stationary random ﬁelds?,”

Spat. Stat. , vol. 14, pp. 505–531, nov 2015.[6] C. J. Paciorek,

Non-Stationary Gaussian processes for regression and spatial modelling . PhDthesis, 2003.[7] E. Snelson, Z. Ghahramani, and C. E. Rasmussen, “Warped Gaussian processes,” in

Adv. Neur.Inf. Proc. Sys. , pp. 337–344, 2004.[8] F. Lindgren, H. Rue, and J. Lindstr¨om, “An explicit link between Gaussian ﬁelds and GaussianMarkov random ﬁelds: the stochastic partial diﬀerential equation approach,”

J. Royal Stat. Soc.B , vol. 73, no. 4, pp. 423–498, 2011. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion [9] L. Roininen, M. Girolami, S. Lasanen, and M. Markkanen, “Hyperpriors for Mat´ern ﬁelds withapplications in Bayesian inversion,” Inverse Probl. & Imaging , vol. 13, no. 1, pp. 1–29, 2019.[10] K. Monterrubio-G´omez, L. Roininen, S. Wade, T. Damoulas, and M. Girolami, “Posterior inferencefor sparse hierarchical non-stationary models,”

Comp. Stat. & Data Anal. , p. 106954, mar 2020.[11] A. C. Damianou and N. D. Lawrence, “Deep Gaussian processes,”

Artif. Intelli. & Stat. , 2013.[12] D. Duvenaud, O. Rippel, R. P. Adams, and Z. Ghahramani, “Avoiding pathologies in very deepnetworks,”

Artif. Intelli. & Stat. , 2014.[13] M. M. Dunlop, M. A. Girolami, A. M. Stuart, and A. L. Teckentrup, “How deep are deep Gaussianprocesses?,”

J. Mach. Learn. Res. , vol. 19, no. 54, pp. 1–46, 2018.[14] A. M. Stuart, “Inverse problems: A Bayesian perspective,”

Acta Numerica , vol. 19, pp. 451–559,2010.[15] M. Dashti and A. M. Stuart, “The Bayesian approach to inverse problems,” in

Handbook ofUncertainty Quantiﬁcation , pp. 311–428, Springer International Publishing, 2017.[16] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, “MCMC methods for functions:Modifying old algorithms to make them faster,”

Statist. Sci. , vol. 28, no. 3, pp. 424–446, 2013.[17] K. J. H. Law, “Proposals which speed up function-space MCMC,”

J. Comp. & Appl. Math. ,vol. 262, pp. 127–138, 2014.[18] A. Beskos, M. Girolami, S. Lan, P. E. Farrell, and A. M. Stuart, “Geometric MCMC for inﬁnite-dimensional inverse problems,”

J. Comp. Phys , vol. 335, pp. 327–351, 2017.[19] D. Rudolf and B. Sprungk, “On generalization of the preconditioned Crank-Nicolson Metropolisalgorithm,”

Found. Comp. Math , vol. 18, pp. 309–343, 2018.[20] M. Hairer, A. M. Stuart, and S. J. Vollmer, “Spectral gaps for a Metropolis-Hastings algorithm ininﬁnite dimensions,”

Ann. Appl. Probab. , vol. 24, pp. 2455–2490, dec 2014.[21] V. Chen, M. M. Dunlop, O. Papaspiliopoulos, and A. M. Stuart, “Dimension-robust MCMC inBayesian inverse problems,”[22] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old, “A general framework for the parametrizationof hierarchical models,”

Stat. Sci. , vol. 22, pp. 59–73, feb 2007.[23] Z. Hu, Z. Yao, and J. Li, “On an adaptive preconditioned Crank-Nicolson MCMC algorithm forinﬁnite dimensional bayesian inference,”

J. Comp. Phys. , vol. 332, pp. 492–503, mar 2017.[24] M. Emzir, S. Lasanen, Z. Purisha, and S. S¨arkk¨a, “Hilbert-space reduced-rank methods for deepGaussian processes,” in

Proceeding of IEEE International Workshop on Machine Learning forSignal Processing (MLSP). , 2019.[25] A. Solin and S. S¨arkk¨a, “Hilbert space methods for reduced-rank Gaussian process regression,”

Stat. Comput. , vol. 30, pp. 419–446, aug 2019.[26] M. Lassas and S. Siltanen, “Can one use total variation prior for edge-preserving Bayesianinversion?,”

Inverse Problems , vol. 20, pp. 1537–1563, aug 2004.[27] J. Kaipio and E. Somersalo, “Statistical inverse problems: Discretization, model reduction andinverse crimes,”

J. Comp. & Appl. Math. , vol. 198, pp. 493–504, jan 2007.[28] F. Natterer,

The Mathematics of Computerized Tomography (Classics in Applied Mathematics) .SIAM: Society for Industrial and Applied Mathematics, 2001.[29] A. Tarantola,

Inverse Problem Theory and Methods for Model Parameter Estimation . Society forIndustrial and Applied Mathematics, 2005.[30] D. Li, J. Svensson, H. Thomsen, F. Medina, A. Werner, and R. Wolf, “Bayesian soft X-raytomography using non-stationary Gaussian processes,”

Rev. Sci. Inst. , vol. 84, p. 083506, aug2013.[31] C. Plagemann, K. Kersting, and W. Burgard, “Nonstationary Gaussian process regressionusing point estimates of local smoothness,” in

Machine Learning and Knowledge Discovery inDatabases European Conference, ECML PKDD 2008 Antwerp, Belgium, September 15-19, 2008Proceedings, Part II , 2008.[32] Z. Purisha, C. Jidling, N. Wahlstr¨om, T. B. Sch¨on, and S. S¨arkk¨a, “Probabilistic approach tolimited-data computed tomography reconstruction,”

Inverse Probl. , vol. 35, p. 105004, sep 2019. on-Stationary Multi-layered Gaussian Priors for Bayesian Inversion [33] T. Frese, C. Bouman, and K. Sauer, “Adaptive wavelet graph model for Bayesian tomographicreconstruction,” IEEE Trans. Imag. Proc. , vol. 11, pp. 756–770, jul 2002.[34] Y. Chen, Y. Li, W. Yu, L. Luo, W. Chen, and C. Toumoulin, “Joint-MAP tomographicreconstruction with patch similarity based mixture prior model,”

Multiscale Modeling &Simulation , vol. 9, pp. 1399–1419, oct 2011.[35] V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen, “On instabilities of deep learning inimage reconstruction and the potential costs of AI,”

PNAS , p. 201907377, may 2020.[36] L. C. Evans,

An Introduction to Stochastic Diﬀerential Equations . American Mathematical Society,2014.[37] S. Lasanen, L. Roininen, and J. M. Huttunen, “Elliptic boundary value problems with Gaussianwhite noise loads,”

Stoch. Proc. & Appl. , vol. 128, no. 11, pp. 3607–3627, 2018.[38] S. C. Brenner and L. R. Scott,

The mathematical theory of ﬁnite element methods , vol. 15 of

Textsin Applied Mathematics . Springer, New York, third ed., 2008.[39] V. I. Bogachev,

Weak convergence of measures , vol. 234 of

Mathematical Surveys and Monographs .American Mathematical Society, Providence, RI, 2018.[40] O. Kallenberg,

Foundations of modern probability . Probability and its Applications (New York),Springer-Verlag, New York, second ed., 2002.[41] W. Rudin,

Principles of Mathematical Analysis , vol. 39 of

International series in pure and appliedmathematics . McGraw-Hill, 1976.[42] M. B. Marcus and J. Rosen,

Markov processes, Gaussian processes, and local times , vol. 100 of

Cambridge Studies in Advanced Mathematics . Cambridge University Press, Cambridge, 2006.[43] V. I. Bogachev,

Gaussian measures , vol. 62 of

Mathematical Surveys and Monographs . AmericanMathematical Society, Providence, RI, 1998.[44] S. Lasanen, “Non-Gaussian statistical inverse problems. Part II: Posterior convergence forapproximated unknowns,”

Inverse Probl. Imaging , vol. 6, no. 2, pp. 267–287, 2012.[45] K. Dong, D. Eriksson, H. Nickisch, D. Bindel, and A. G. Wilson, “Scalable log determinantsfor Gaussian process kernel learning,” in

Advances in Neural Information Processing Systems ,pp. 6327–6337, 2017.[46] Y. Yu and X.-L. Meng, “To center or not to center: That is not the question- an ancillarity-suﬃciency interweaving strategy (ASIS) for boosting MCMC eﬃciency,”

J. Comp. & Graph.Stat. , vol. 20, pp. 531–570, jan 2011.[47] H. F. L. Dani Gamerman,

Markov Chain Monte Carlo . Taylor & Francis Inc, 2006.[48] T. Cui, K. J. Law, and Y. M. Marzouk, “Dimension-independent likelihood-informed MCMC,”

J.Comp. Phys. , vol. 304, pp. 109–137, jan 2016.[49] S. R. Deans,

The Radon transform and some of its applications . Wiley, 1983.[50] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis, “CuPy: A NumPy-compatible library forNVIDIA GPU calculations,” in

Proceeding MM ’17 Proceedings of the 25th ACM internationalconference on Multimedia , pp. 1217–1220, 2017.[51] S. van der Walt, J. L. Sch¨onberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager,E. Gouillart, and T. Yu, “scikit-image: image processing in Python,”

PeerJ , vol. 2, p. e453, jun2014.[52] J. Mueller and S. Siltanen,

Linear and nonlinear inverse problems with practical applications .Philadelphia: Society for Industrial and Applied Mathematics, 2012.[53] J. Suuronen, M. Emzir, S. Lasanen, S. S¨arkk¨a, and L. Roininen, “Enhancing industrial X-raytomography by data-centric statistical methods,” arXiv preprint arXiv:2003.03814arXiv preprint arXiv:2003.03814