Analysis of the Gibbs sampler for hierarchical inverse problems
Sergios Agapiou, Johnathan M. Bardsley, Omiros Papaspiliopoulos, Andrew M. Stuart
AAnalysis of the Gibbs sampler for hierarchicalinverse problems
Sergios Agapiou
Mathematics Institute, University of WarwickCoventry CV4 7AL, United [email protected]
Johnathan M. Bardsley
Department of Mathematics Sciences, University of MontanaMissoula, MT, 59812-0864 [email protected]
Omiros Papaspiliopoulos
Department of Economics, Universitat Pompeu FabraRamon Trias Fargas 25-27, 08005 Barcelona, [email protected]
Andrew M. Stuart
Mathematics Institute, University of WarwickCoventry CV4 7AL, United [email protected]
Abstract
Many inverse problems arising in applications come from continuum models wherethe unknown parameter is a field. In practice the unknown field is discretized resulting in a problemin R N , with an understanding that refining the discretization, that is increasing N , will often bedesirable. In the context of Bayesian inversion this situation suggests the importance of two issues: (i)defining hyper-parameters in such a way that they are interpretable in the continuum limit N → ∞ and so that their values may be compared between different discretization levels; (ii) understandingthe efficiency of algorithms for probing the posterior distribution, as a function of large N. Herewe address these two issues in the context of linear inverse problems subject to additive Gaussiannoise within a hierarchical modelling framework based on a Gaussian prior for the unknown fieldand an inverse-gamma prior for a hyper-parameter, namely the amplitude of the prior variance. Thestructure of the model is such that the Gibbs sampler can be easily implemented for probing theposterior distribution. Subscribing to the dogma that one should think infinite-dimensionally beforeimplementing in finite dimensions, we present function space intuition and provide rigorous theoryshowing that as N increases, the component of the Gibbs sampler for sampling the amplitude of theprior variance becomes increasingly slower. We discuss a reparametrization of the prior variance thatis robust with respect to the increase in dimension; we give numerical experiments which exhibitthat our reparametrization prevents the slowing down. Our intuition on the behaviour of the priorhyper-parameter, with and without reparametrization, is sufficiently general to include a broad classof nonlinear inverse problems as well as other families of hyper-priors.1 a r X i v : . [ m a t h . S T ] J u l ey words. Gaussian process priors, Markov chain Monte Carlo, inverse covariance operators,hierarchical models, diffusion limit.
We consider the possibly nonlinear inverse problem of recovering an unknown parameter u ∈ X froma noisy indirect observation y ∈ Y . We work in a framework where X is an infinite-dimensionalseparable Hilbert space with inner product (cid:10) · , · (cid:11) and norm (cid:107) · (cid:107) , and Y is also a separable Hilbertspace. We will be especially interested in the case Y = X or Y = R M . The unknown parameter andthe observation are related through an additive noise model y = G ( u ) + η , (1)where G : X → Y is the forward map which is assumed to be continuous, and η is Gaussian noise η ∼ N (0 , λ − C ) . (2)The linear operator C : Y → Y is bounded and positive definite and λ > C is trace-class, thereby allowing the case of Gaussian white noise where itis the identity.We adopt a Bayesian approach with a Gaussian prior on the unknown parameter uu | δ ∼ N (0 , δ − C ) , (3)where C : X → X is a positive definite and trace-class operator and δ > u is assumed to be independent of the noise η . The trace-classassumption on C ensures that draws from the prior on u | δ are in X . For a fixed u the likelihoodis Gaussian, y | u , δ ∼ N ( G ( u ) , λ − C ). We work under certain regularity conditions on the forwardmap G , which imply that the inverse problem is sufficiently ill-posed; in particular, for the noisemodel at hand, these conditions imply that the unknown u is not perfectly identifiable from a singlerealization of the data. Under the additional assumption that the prior on u | δ is such that theregularity conditions on G are satisfied in its support, it can be shown that almost surely withrespect to the data the posterior on u | y , δ is well defined, non-degenerate and absolutely continuouswith respect to the prior on u | δ , [39].In what follows, we consider the hyper-parameter δ as a part of the inference problem, thatis, we endow it with a prior P ( δ ); this leads to a hierarchical Bayesian model. The potential forthe use of hierarchical priors in inverse problems has been highlighted in [23], where the authorsexpress the conviction that if a parameter is not known, it is a part of the inference problem ; see also[11, 10] where conditionally Gaussian hierarchical models have been considered in finite dimensionalcontexts. Returning to our setting, we note that of course in practice other aspects of the model,such as parameters that control the regularity of the draws from the prior, will also be part ofthe inference problem. Section 6 discusses how the results of this paper can be extended to suchsituations, but the focus here is the joint hierarchical inference on u and δ . Statistical inference isachieved by Markov chain Monte Carlo sampling from the resulting full posterior on u , δ | y , whereby Bayes’ rule P ( u , δ | y ) ∝ P ( y | u , δ ) P ( u | δ ) P ( δ ) ∝ P ( u | y , δ ) P ( δ | y ) . A sufficient condition for this posterior to be well defined is that the prior P ( δ ) is proper.Due to the nature of the pair ( u , δ ) ∈ X × [0 , ∞ ), sampling can be achieved by a two-componentMetropolis-within-Gibbs (MwG) algorithm. There is a range of possible parametrizations for this2 nalysis of Gibbs sampler for inverse problems centered algorithm (CA), [34].This scheme alternates between simulating from u | y , δ and δ | y , u using Metropolis-Hastings steps.Each pair of such simulations is one algorithmic iteration of a prescribed number k max . For specificmodels the simulation from the two conditionals can be done directly, without Metropolis-Hastings,in which case the resultant algorithm is the Gibbs sampler. Note that the model structure impliesthat δ and y are conditionally independent given u , that is δ | y , u ≡ δ | u . This is the definingproperty of the so-called centered parameterisation of a hierarchical model, [34].In practice the inverse problem and the algorithm are discretized and Bayesian inference isimplemented in finite dimensions. We then have two sources of error in the estimated posteriordistribution: a) the approximation error due to the discretization of the unknown and the forwardproblem, that is the discretization bias, discussed in a general Bayesian (non-hierarchical) inverseproblem setting in [13]; b) the Monte Carlo error due to the use of a Markov chain Monte Carlomethod to sample the discretized posterior distribution. Assuming that the discretization level ofthe unknown is N , we have that the total error is of the order1 N s + C ( N ) √ k max , (4)for some s > C ( N ) which depends on the mixing properties of the particular algorithm used to probe theposterior. This picture allows the practitioner to get a rough idea how to distribute the compu-tational budget by balancing investments in higher discretization levels with investments in longerchains in order to achieve the desired error level in the estimated posterior distribution. In reality,of course, the constants that multiply these rates will be relevant and hard to determine.There are four principal motivations for formulating the inverse problem and the simulationalgorithms in infinite dimensions, while using consistent discretizations (in the sense of numericalanalysis, see subsection 1.2) for the numerical implementation. First, such formulation is often morefaithful to the mathematical model that we wish to learn from the data. Second, it makes theinference comparable across different levels of discretization, so that the estimation of the modelwith increasing values of N corresponds to a reduction in the discretization bias at the cost ofadditional computation. Third, the prior distribution on hyperparameters, such as δ , representsthe same prior beliefs across different levels of discretization. On the contrary, when the finite-dimensional model is not a consistent discretization of an infinite-dimensional one, the prior on thehyperparameters might contain an amount of information that depends on the level of discretizationchosen; see for example the last paragraph in subsection 1.2.2 below. Finally, practically usefulalgorithms can be designed for moderate or even small values of N by studying their behaviour atthe asymptotic limit N → ∞ . In fact, it is usually unrealistic to try to obtain practically usefultheoretical results on the convergence of Markov chain Monte Carlo for sampling non-trivial targets,unless such asymptotic regimes are constructed and invoked. This is precisely the case with theGibbs sampler and related MwG algorithms, which are particularly hard to analyse (see for example[32]). Similarly, conceiving of Metropolis-Hastings methods in the infinite-dimensional limit leads toalgorithms with provably dimension-independent convergence properties, whilst standard methodshave convergence properties which degenerate with increased refinement of the discretization; see[14] and discussion therein.In this paper we investigate theoretically and numerically the performance of MwG algorithmsin the asymptotic regime of large N . In order to have a mathematically tractable analysis, we focuson linear inverse problems, see subsection 1.1. For these models, and under a commonly adoptedprior on δ , the MwG becomes a Gibbs sampler. We establish a result on the mean drift and diffusionof the δ -chain in CA, which has the informal interpretation that C ( N ) is of the order N / . Animmediate consequence of this result, is that in order to minimize the total error in (4), k max shouldbe scaled like N s , whilst for algorithms for which C ( N ) is uniformly bounded with respect to N , the same error level can be achieved by scaling k max like N s ; we expect this to be the case for nalysis of Gibbs sampler for inverse problems δ , a detailed understanding of the ideasunderlying our proofs, indicates that most of the details of the model, including linearity, and theprior used on δ , do not really affect the validity of our main finding, that is, that CA deteriorateswith N . The fundamental reason why this algorithm becomes unusable for large N is an absolutecontinuity property, a high-level description of which we now provide. Note, however, that provingthe result in such generality is definitely beyond the scope of this paper.In the infinite-dimensional limit, δ is an almost sure property of u | δ ∼ N (0 , δ − C ). Thismeans that a single draw of u contains infinite information about the value of δ that generatedit. In measure-theoretic terms, it means that the prior measures P ( u | δ ) and P ( u | δ (cid:48) ) for δ (cid:54) = δ (cid:48) aremutually singular, [16, Remark 2.10]. Recalling that we work under assumptions which imply that u | y , δ is absolutely continuous with respect to u | δ , we deduce that δ is also an almost sure propertyof u | y , δ . As a result, iterative simulation from the distributions, u | y , δ and δ | y , u , will fail in everchanging the initial value of δ . On the other hand, recall that we also work under assumptions thatimply that u , hence δ , are not perfectly identifiable from the data. Therefore, δ | y is non-degenerate(provided the prior is non-degenerate) and hence any single value of δ has zero probability under thedata. Concatenating, we have that when iteratively simulating from u | y , δ and δ | y , u , the valuesof u will be changing along the iterations, but will be in fact sampled from a subspace which hasprobability zero under P ( u | y ). In other words CA is reducible in infinite dimensions and will failto sample from u , δ | y . Iterative conditional sampling of the finite-dimensional approximation of u , δ | y , will be able to obtain samples from the (approximated) posterior distribution of δ , but willsuffer from increasingly slow mixing as the discretization level N increases. In fact, the dependencebetween the discretized unknown parameter u and δ increases with N , and becomes infinitely strongin the limit N → ∞ ; it is this dependence that slows down the MwG.In order to alleviate the undesirable effects of the strong dependence between the prior on u and δ , using intuition from [34, 37], we reparametrize the prior by writing u = δ − v where v ∼ N (0 , C )and δ ∼ P ( δ ). This results in a MwG algorithm which alternates between a step of updating v | y , δ and a step of updating δ | y , v ; this is an example of a non-centered algorithm (NCA), [34]. Since v and δ are now a priori independent, and recalling that u is not perfectly identified by the data,the dependence of these two parameters is not perfect conditionally on the data. Thus, the NCAis irreducible in infinite dimensions and is thus robust with respect to the discretization level N .Hence, for NCA we expect that C ( N ) is uniformly bounded with respect to N ; we show numericalevidence in support of this statement in section 5. We will concentrate on the linear inverse problem case with gamma priors on δ which has theconvenient property of conditional conjugacy. Specifically, we restrict our attention to the case G = K where K : X → Y is a bounded linear operator. Then, the posterior distribution u | y , δ isalso Gaussian u | y , δ ∼ N ( m λ,δ ( y ) , C λ,δ );see [29, 27] where formulae for the posterior mean and covariance operator are provided. Whenthe prior distribution and the noise are specified in terms of precision operators (that is, inversecovariance operators) the following expressions for the posterior mean and precision are known tohold in a range of situations in [3, 4]: C − λ,δ = λ K ∗ C − K + δ C − , (5) C − λ,δ m λ,δ ( y ) = λ K ∗ C − y . (6) nalysis of Gibbs sampler for inverse problems X = Y and that the discretization levelsof the unknown and the data are the same. This assumption is not crucial to our results and werefer to the PhD thesis [2, section 4.5] for the more general statements. Furthermore, in section 5,we present numerical examples corresponding to both Y = X with an increasing discretization levelwhich is the same for both the unknown and the data, and Y = R M for some fixed M , whilst thedimension of the discretization of the unknown is increased. The case Y = X arises for examplewhen we observe the whole unknown function subject to blurring and noise, while the case Y = R M can arise when we have available blurred and noisy observations of the unknown at only M spatiallocations (see subsection 1.2.2). The two cases can also arise if we work in the spectral domain, de-pending on the availability of observations of a full or only a partial spectral expansion of a blurrednoisy version of the unknown.We denote by (cid:10) · , · (cid:11) R N and (cid:107) · (cid:107) R N the (possibly scaled) Euclidean inner product and norm in R N and by (cid:107) · (cid:107) ,N the induced operator norm for N × N matrices. Throughout the paper we assumethat this norm and inner product on R N are scaled so that, formally, the large N limit recoversthe norm and inner product on the Hilbert space when, for example, spectral or finite differenceapproximations are made. Henceforward, we use boldface and regular typeface letters to distinguishbetween infinite and finite-dimensional objects respectively. We assume that we have a way ofcomputing discretizations y ∈ R N of the observation y and replace the operators K , C and C bythe N × N matrices K, C and C respectively, which arise from a consistent, in the sense of numericalanalysis, family of approximations of the corresponding operators. In this finite-dimensional setting,the unknown is u ∈ R N and it is assigned a finite-dimensional Gaussian prior, u | δ ∼ N (0 , δ − C ).The noise distribution has Lebesgue density and the corresponding log-likelihood is quadratic in u .Thus, standard Bayesian linear theory (see e.g. [28]) implies that the posterior is also Gaussian, u | y, δ ∼ N ( m λ,δ ( y ) , C λ,δ ), where m λ,δ ( y ) and C − λ,δ solve equations (5) and (6) where the boldfaceinfinite-dimensional quantities are replaced by the corresponding finite-dimensional regular typefacequantities.Bayesian modelling for finite-dimensional approximations of linear inverse problems using Gaus-sian priors and noise models was recently carried out in [6]. The approach consisted in simultaneousinference for the unknown u and the hyper-parameters λ and δ . We will concentrate on simultane-ous inference on u and δ only, since λ can be efficiently estimated from a single high dimensionalrealization of the data, for example using quadratic variation. We again refer the interested readerto the PhD thesis [2, Chapter 4] for theoretical and numerical results on the large N behaviour of λ when considered as part of the inference problem; we stress here that for low-dimensional data,the inference on λ is non-trivial. In [6], a standard conditionally conjugate prior was used for thehyper-parameter, δ ∼ Gamma( α , β ), which in this type of finite-dimensional Gaussian models isknown to lead to a gamma conditional posterior distribution, [7, Chapter 5.2] δ | y, u ∼ Gamma( α + N , β + 12 (cid:13)(cid:13) C − u (cid:13)(cid:13) R N ) . (7)The inference for this model was carried out using CA which in this case is a Gibbs sampler(see Algorithm 1 in section 2 below), since both conditional distributions u | y, δ and δ | y, u belong toknown parametric families and can be sampled directly. One of the main aims of this paper is toanalyze the convergence of this algorithm in the large N limit. We also aim to exhibit via numericalsimulations, the deterioration of the performance of CA in the large N limit, as well as the benefitsof reparametrizing the prior and using the corrseponding NCA (see Algorithm 2 in section 2 below). nalysis of Gibbs sampler for inverse problems In order to aid the understanding of the paper and in anticipation of the subsequent developments,we briefly describe two methods for passing from the continuum infinite-dimensional model in X toa discrete model in R N . Here and elsewhere in the paper, we define a Gaussian white noise in R N tobe a random variable ζ given as ζ = (cid:80) Nj =1 ζ j e j , where { e j } Nj =1 is a basis in R N which is orthonormalin the possibly scaled Euclidean inner product (cid:10) · , · (cid:11) R N , and { ζ j } j ∈ N is a sequence of independentstandard Gaussian random variables in R . Let { e j } j ∈ N be a complete orthonormal basis in X . An element w ∈ X can be identified with thesequence { w j } j ∈ N of coefficients w j := (cid:10) w , e j (cid:11) and by Parseval’s identity the Hilbert space norm of w can be replaced by the (cid:96) -norm of the sequence of coefficients (similarly for the inner product).One can then discretize w by replacing it with w ∈ span { e , ..., e N } which is identified with thetruncated sequence of coefficients { w , ..., w N } ∈ R N . The (cid:96) -norm and inner product are thenreplaced by the Euclidean norm and inner product. Let Σ : X → X be a bounded operator whichis diagonalizable in { e j } j ∈ N with eigenvalues { µ Σ j } j ∈ N . The operator Σ can be identified with thesequence { µ Σ j } j ∈ N and we can discretize Σ at level N by replacing it with the finite rank operatorwhich is identified with the N × N diagonal matrix Σ = diag( µ Σ1 , ..., µ ΣN ). If x ∼ N (0 , Σ ) is aGaussian random variable in X , we can discretize by replacing x with x ∈ span { e , ..., e N } which isidentified with a random variable with distribution N (0 , Σ) in R N . Equivalently, x is identified withΣ x where x is a Gaussian white noise in R N with respect to the standard orthonormal basis ofEuclidean space. For more details see subsection 4.1. Let X = L (I) , I = (0 , A the negative Laplacian densely defined on X withdomain H (I) ∩ H (I), that is with Dirichlet boundary conditions. We discretize the domain I usinga grid of N equally spaced points { N +1 , ..., NN +1 } ; we can restrict our attention to the interior pointsdue to the Dirichlet boundary conditions. We define the inner product and norm in R N (cid:10) u, v (cid:11) R N = 1 N + 1 N (cid:88) j =1 u j v j and (cid:13)(cid:13) u (cid:13)(cid:13) R N = (cid:18) N + 1 N (cid:88) j =1 u j (cid:19) . Note that the natural orthonormal basis on the N -dimensional space of grid points with respect tothe above norm and inner product is { e j } Nj =1 , with e j = {√ N + 1 δ ij } Ni =1 , where δ ij is Kronecker’sdelta. For a function u in I which vanishes on the boundary, we consider its discretization on thegrid, hence u j = u ( jN +1 ). We thus have a discrete approximation of X with norm and inner productwhich are the discrete analogues of the L -norm and inner product. We use finite differences todiscretize A . In particular, we replace A by the N × N matrix A = ( N + 1) − . . . − − − − . . . − . If z ∼ N (0 , Σ ) is a Gaussian random variable in X where Σ is a function of A (for example apower), we discretize z by considering the N -dimensional random variable z = Σ z defined on the nalysis of Gibbs sampler for inverse problems A and z is a Gaussian white noise withrespect to { e j } Nj =1 .In subsection 5.2 we consider subsampling at a set of M equally spaced points amongst the N grid points, where N +1 M +1 is a nonnegative power of 2. To this end, we define the matrix P ∈ R M × N by P i,j = (cid:26) , if j = i N +1 M +1 , otherwise . The matrix P maps the vector of values on the fine grid { u ( jN +1 ) } Nj =1 to the subsampled vector ofthe values on the coarse grid { u ( iM +1 ) } Mi =1 . If we fix M and let N increase, then P corresponds toa discretization of the operator P : C (I) → R M defined as M pointwise evaluations at the points x i = iM +1 , i = 1 , ..., M , ( Pu ) i = u ( iM +1 ), for any continuous function u . A formal calculationsuggests that the adjoint of the pointwise evaluation operator at x ∈ I, is an operator mapping r ∈ R to rδ x , where δ x is the Dirac distribution at x . This suggests that P ∗ : R M → C (I), maps r ∈ R M to the linear combination of Dirac distributions r δ x + ... + r M δ x M . At the same timethe matrix P T ∈ R N × M maps the vector of values on the coarse grid { u ( iM +1 ) } Mi =1 to a vector in R N which is zero everywhere except from the i N +1 M +1 -th components where it is equal to u ( iM +1 ), i = 1 , ..., M . Combining, and in order to capture the effect of the Dirac distribution at the locations iM +1 , we have that P ∗ should be discretized using the matrix ( N + 1) P T .Note that if N (0 , δ − T − ) is used as a prior on u | δ at level N , where T is the N × N tridiagonalmatrix in the definition of A , then this corresponds to having a prior with covariance matrix( N +1) δ − A . In particular, if δ ∼ Gamma( α , β ), then we have that N +1) δ ∼ Gamma( α , ( N +1) β ) where in the large N limit the last gamma distribution converges to a point mass at zero,while A approximates A . This means that as N → ∞ the correlation structure of the prior isdescribed by the limiting A but with an amplitude which becomes larger and larger with everincreasing confidence; in other words as N grows the prior on u | δ looks increasingly flatter. We use subscripts to make explicit the dependence of the δ -chain on the discretization level N andsuperscripts to denote the iteration number in the Gibbs sampler. For a random variable x whichdepends on the mutually independent random variables z and z , we use E z [ x ] to denote theexpectation of x with respect to z for fixed z . We use x L = x to denote that the random variables x and x have the same law. Finally, for two sequences of positive numbers { s j } and { t j } , we usethe notation s j (cid:16) t j to mean that s j /t j is bounded away from zero and infinity uniformly in j . In the next section we present the centered Gibbs and non-centered MwG algorithms in our assumedlinear conjugate setting; we also discuss the option of integrating u out of the data likelihood andthe resulting marginal algorithm. In section 3 we present our main result on the deteriorationof the centered Gibbs sampler which holds under certain assumptions made at the discrete leveland which are stated explicitly in the same section. Our discrete level assumptions are typicallyinherited from Assumptions 3.1 on the underlying infinite-dimensional model also stated in section3, when consistent numerical discretizations are used. In section 4 we exhibit three classes of linearinverse problems satisfying our assumptions on the underlying infinite-dimensional model. For thefirst two of these classes, that is a class of mildly ill-posed and a class of severely ill-posed linearinverse problems both in a simultaneously diagonalizable setting, we also explicitly prove that ourdiscrete level assumptions are inherited from the infinite-dimensional assumptions when discretizingvia spectral truncation (see subsections 4.1 and 4.2). In section 5 we present numerical evidence nalysis of Gibbs sampler for inverse problems We now present in more detail the different algorithms for sampling u, δ | y in linear hierarchicalinverse problems, and provide a high-level comparison of their relative merits in the asymptoticregime of large N . We first provide pseudo-code for the most natural algorithm for sampling u, δ | y in this linear conju-gate setting, that is the centered Gibbs sampler used in [6] and discussed in section 1. Algorithm 1.
0. Initialize δ (0) and set k = 0; u ( k ) ∼ N (cid:0) m λ,δ ( k ) ( y ) , C λ,δ ( k ) (cid:1) ; δ ( k +1) ∼ Gamma( α + N , β + (cid:13)(cid:13) C − u ( k ) (cid:13)(cid:13) R N );
3. Set k = k + 1 . If k < k max return to step 1, otherwise stop. We now formulate in more detail the non-centered algorithm introduced in section 1. We define thealgorithm in the infinite-dimensional setting, and then discretize it. We reparametrize the prior bywriting u = δ − v , where now v ∼ N (0 , C ), and the observation model becomes y = δ − Kv + η . (8)The MwG sampler is used to sample v , δ | y by iteratively sampling from the two conditionals. Recallfrom the discussion on CA in section 1, that δ | y , u ≡ δ | u and note that δ | y , v , no longer simplifiesto δ | v , since even conditionally on v , δ and y are dependent; this is the non-centered property inthe hierarchical model, [34]. Additionally, note that a practically useful way to sample from v | y , δ ,which recycles available code for CA, is to first sample u | y , δ , as in CA, and then transform u to v via v = δ u . Finally, for reasons of efficiency described below, we prefer to sample τ = δ − insteadof δ directly. In order to obtain the same Bayesian model as the one before the transformation, theprior distribution for τ should be the one obtained from the prior on δ after the 1 / √ δ transformation,that is a square root of an inverse-gamma distribution. Of course, we can deterministically calculate δ = 1 /τ after each such update, to get δ -samples and proceed to the next conditional simulation inthe algorithm.The finite-dimensional discretization of the algorithm is obtained in the same way as CA. Wenotice that the log-likelihood is quadratic in τ , for given v . We can exploit this property to sample τ efficiently. The conditional posterior τ | y, v is not Gaussian, because the prior on τ is not Gaussian, nalysis of Gibbs sampler for inverse problems τ , as a proposal density in the Metropolis-Hastingsstep. The likelihood as a function of τ is Gaussian N ( r λ,v , q λ,v ), where1 q λ,v = λ (cid:13)(cid:13) C − Kv (cid:13)(cid:13) R N , r λ,v q λ,v = λ (cid:10) K ∗ C − y, v (cid:11) R N , (9)hence easy to simulate from. Proposals generated in this way are immediately rejected if negative,and if not they are accepted according to the Metropolis-Hastings ratio that by construction onlyinvolves the prior density. Note that the same complication would arise had we chosen to work with δ instead of τ , since δ | y, v is also not a known distribution. The difference in that case is that thereis no apparent good proposal density for the Metropolis-Hastings step, since the likelihood is not aknown distribution as a function of δ .We use the following Gibbs sampler, where p ( · ) denotes the density of the square root of theinverse-gamma distribution with parameters α , β : Algorithm 2.
0) Initialize τ (0) , calculate δ (0) = 1 / ( τ (0) ) and set k = 0; u ( k ) ∼ N (cid:0) m λ,δ ( k ) ( y ) , C λ,δ ( k ) (cid:1) ; v ( k ) = ( δ ( k ) ) u ( k ) ;2) propose τ ∼ N ( r λ,v ( k ) , q λ,v ( k ) ) ;if τ ≤ reject; if τ > accept with probability p ( τ ) p ( τ ( k ) ) ∧ otherwise reject;if τ accepted set τ ( k +1) = τ , otherwise set τ ( k +1) = τ ( k ) ; δ ( k +1) = 1 / ( τ ( k +1) ) ;3) Set k = k + 1 . If k < k max return to step 1, otherwise stop. Given that u (hence Ku ) and η are independent Gaussian random variables, the marginal distri-bution of the data y given δ is also Gaussian, y | δ ∼ N (0 , δ − K C K + λ − C ) . One can then use Bayes’ theorem to get that P ( δ | y ) ∝ P ( y | δ ) P ( δ ) . This distribution can be sampled using the random walk Metropolis (RWM) algorithm. In order toget samples from u , δ | y , we alternate between drawing δ | y and updating u | y , δ . Furthermore, it isbeneficial to the performance of the RWM, to sample log( δ ) | y instead of δ | y ; of course, samples fromlog( δ ) | y can be deterministically transformed to samples from δ | y . The resultant algorithm is whatwe call the marginal algorithm (MA). MA in the discrete level is as follows, where p ( · ) now denotesthe density of the logarithm of a gamma distribution with parameters α , β and ρ = log( δ ): nalysis of Gibbs sampler for inverse problems Algorithm 3.
0) Initialize ρ (0) and set k = 0; u ( k ) ∼ N (cid:0) m λ,δ ( k ) ( y ) , C λ,δ ( k ) (cid:1) ;
2) propose ρ ∼ N ( ρ ( k ) , s ) ;accept with probability P ( y | exp( ρ )) p ( ρ ) P ( y | exp( ρ ( k ) ) ) p ( ρ ( k ) ) ∧ otherwise reject;if ρ accepted set ρ ( k +1) = ρ , otherwise set ρ ( k +1) = ρ ( k ) ;set δ ( k +1) = exp( ρ ( k +1) ) ;3) Set k = k + 1 . If k < k max return to step 1, otherwise stop. We follow the rule-of-thumb proposed in [19] and choose the RWM proposal variance s to achievean acceptance probability around 44%. As discussed in section 1, and is formally shown in section 3, CA will deteriorate as the discretizationlevel of the unknown, N, becomes larger. To get a first understanding of this phenomenon in thelinear-conjugate setting, note that the Gamma( α , β ) distribution has mean and variance α β − and α β − respectively. Hence, for any µ >
0, as N grows, the random variable Gamma( α + N , β + µ N ) behaves like a Dirac distribution centred on µ − . Furthermore, we will show that, because of theconsistency of the approximation of the operators defining the Bayesian inverse problem, togetherwith scaling of the norms on R N to reproduce the Hilbert space norm limit, it is natural to assumethat (cid:13)(cid:13) C − u ( k ) (cid:13)(cid:13) R N (cid:39) ( δ ( k ) ) − N. Using the limiting behaviour of the gamma distribution described above, this means that as thedimension N increases, we have δ ( k +1) (cid:39) δ ( k ) hence the δ -chain makes very small moves and slowsdown.In contrast, both conditionals u | y , δ and δ | y , v sampled in NCA are non-degenerate even in theinfinite-dimensional limit. Our numerical results show that this reparametrization is indeed robustwith respect to the increase in dimension (see section 5), although establishing formally that aspectral gap exists for NCA in this limit is beyond the scope of this paper.Similarly, both distributions u | y , δ and δ | y sampled in MA are non-degenerate in the contin-uum limit, hence MA is robust with respect to N . Moreover, MA is optimal with respect to thedependence between the two components of the algorithm, since the δ -chain is independent of the u -draws; there is a loss of efficiency due to the use of RWM to sample δ | y , but provided the proposalvariance is optimally tuned, this will only have a minor effect in the performance of MA. For thesereasons, in section 5 we use the optimally tuned MA as the gold standard with which we comparethe performance of CA and NCA. Nevertheless, we stress here that:i) MA requires at each iteration the potentially computationally expensive calculation of thesquare root and the determinant of the precision matrix of y | δ . This makes the implementationof MA in large scale linear inverse problems less straightforward compared to CA and NCA.ii) even though we view MA as a gold, albeit potentially expensive, standard in our linear setting,for nonlinear problems MA is not available. On the contrary, CA and NCA are straightforwardto extend to the nonlinear case (see Section 6); this is one of the principal motivations forstudying the optimal parametrization of Gibbs sampling in this context. nalysis of Gibbs sampler for inverse problems In this section we present our theory concerning the behaviour of CA as the discretization levelincreases, in the linear inverse problem setting introduced in subsection 1.1. We first formulateour assumptions on the underlying infinite-dimensional model as well as a corresponding set ofdiscrete-level assumptions, before presenting our main result on the large N behaviour of Algorithm1. We work under the following assumptions on the underlying infinite-dimensional linear inverse prob-lem:
Assumptions 3.1. i) For any λ, δ > , we have m λ,δ ( y ) ∈ D ( C − ) y -almost surely; that is, the posterior meanbelongs to the Cameron-Martin space of the prior on u | δ ; ii) C − K C K ∗ C − is trace-class; that is, the prior is sufficiently regularizing. Assumption 3.1(ii) implies the second and third conditions of the Feldman-Hajek theorem [17,Theorem 2.23]. Together with Assumption 3.1(i), they thus imply that y -almost surely u | y , δ is absolutely continuous with respect to u | δ and hence the infinite-dimensional intuition on thebehaviour of CA described in section 1 applies.In the following, we assume that C and C are positive definite N × N matrices which are thediscretizations of the positive definite operators C and C respectively, and the N × N matrix K is the discretization of the bounded operator K . Our analysis of the δ -chain is valid under thefollowing assumptions at the discrete level: Assumptions 3.2. i) For almost all data y , for any λ, δ > , there exists a constant c = c ( y ; λ, δ ) ≥ , independentof N , such that (cid:13)(cid:13) C − m λ,δ ( y ) (cid:13)(cid:13) R N ≤ c ; ii) there exists a constant c ≥ , independent of N and y , such that Tr( C − K C K ∗ C − ) ≤ c . These assumptions are typically inherited from Assumptions 3.1 when consistent discretizationsare used; see subsection 1.2 and section 4 for more details and examples.
We now present our main result on the behaviour of Algorithm 1 in the asymptotic regime of large N .We start by noting that the two steps of updating u | y, δ and δ | y, u in Algorithm 1, can be compressedto give one step of updating δ and involving the noise in the u update. Indeed, we denote by δ ( k +1) N the δ -draw in the k + 1 iteration of the Gibbs sampler where the problem is discretized in R N . Thisdraw is made using the previous draw of u | y, δ , which assuming that δ ( k ) N = δ , is denoted by u ( k ) δ and can be written as u ( k ) δ = m λ,δ ( y ) + C λ,δ ζ, (10) nalysis of Gibbs sampler for inverse problems ζ is an N -dimensional Gaussian white noise representing the fluctuation in step 1, and C λ,δ , m λ,δ are given by the formulae (5), (6) respectively. Hence we have δ ( k +1) N ∼ Gamma( α + N , β + 12 (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N ) . (11)Assumptions 3.2 ensure that the squared norm appearing in (11) behaves like δ − N , as assumedin the discrete level intuition discussed in subsection 2.4. This is made precise in the following lemmawhich forms the backbone of our analysis and is proved in subsection 7.2. Lemma 3.3.
Under Assumptions 3.2, for any λ, δ > we have, β + 12 (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N = δ − N δ − (cid:114) N W ,N + F N ( δ ) , (12) where i) W ,N only depends on the white noise ζ in (10), has mean zero and variance one, higherorder moments which are bounded uniformly in N , and converges weakly to a standard normalrandom variable as N → ∞ ; ii) F N ( δ ) depends on the data y and y -almost surely has finite momentsof all positive orders uniformly in N (where the expectation is taken with respect to ζ ). Combining with the scaling property of the gamma distribution as in the intuition describedin subsection 2.4, we show that as the dimension increases the δ -chain makes increasingly smallersteps, and quantify the scaling of this slowing down. Indeed, we prove that for large N the δ -chainmakes moves which on average are of order N − with fluctuations of order N − . As a result, ittakes O ( N ) steps for the δ -chain to move by O (1). Theorem 3.4.
Let λ > and consider Algorithm 1 under Assumptions 3.2. In the limit N → ∞ ,we have almost surely with respect to y and where all the expectations are taken with respect to therandomness in the algorithm:i) the expected step in the δ -chain scales like N , that is, for any δ > ,N E (cid:104) δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ (cid:105) = ( α + 1) δ − f N ( δ ; y ) δ + O ( N − ) , where f N ( δ ; y ) is bounded uniformly in N . In particular, if there exists f ( δ ; y ) ∈ R such that f N ( δ ; y ) → f ( δ ; y ) , then N E (cid:104) δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ (cid:105) = ( α + 1) δ − f ( δ ; y ) δ + O (1); ii) the variance of the step also scales like N and in particular for any δ > ,N Var (cid:104) δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ (cid:105) = 2 δ + O ( N − ) . Remark 3.5. i) The proof of Theorem 3.4 can be found in subsection 7.1 in the Appendix. Equation (17) isa key identity, as it very clearly separates the three sources of fluctuation in the draw δ ( k +1) N ,that is, the fluctuation in the Gaussian-draw u | y, δ , the fluctuation in the gamma-draw δ | y, u and the fluctuation in the data.ii) f N ( δ ; y ) := E ζ [ F N ( δ ; y )] , where F N is defined in the proof of Lemma 3.3. The assumption onthe convergence of f N ( δ ; y ) is trivially satisfied under Assumptions 3.2, if the discretizationscheme used is such that if the vector x ∈ R N and the N × N matrix T are the discretizationsat level N of x ∈ X and the linear operator T respectively, then (cid:13)(cid:13) T x (cid:13)(cid:13) R N is a non-decreasingsequence. This is the case for example in spectral truncation methods, when T is diagonalizablein the orthonormal basis used (see subsection 1.2.1). nalysis of Gibbs sampler for inverse problems δ ( k +1) N − δ ( k ) N ≈ N (cid:16) ( α + 1) δ ( k ) N − f N ( δ ( k ) N ; y )( δ ( k ) N ) (cid:17) + 2 δ ( k ) N √ N Ξ , (13)where Ξ is a real random variable with mean zero and variance one. In the case where f N has alimit, the last expression looks like the Euler-Maruyama discretization of the stochastic differentialequation dδ = (cid:0) α + 1 − f ( δ ; y ) δ (cid:1) δdt + √ δdW, (14)where W = W ( t ) is a standard Brownian motion, with time step N . This is another manifestationof the fact that it takes O ( N ) steps for the δ -chain to make a move of O (1) size.Note that (13) implies that the expected square jumping distance of the Markov chain for δ generated by CA is O (1 /N ). Recall (see for example [38] for a recent account) that this distance isdefined as E [( δ ( k +1) N − δ ( k ) N ) ], where δ ( k ) N is drawn from the stationary distribution. Hence, it is theexpected squared step of the chain in stationarity. It is easy to check that it equals 2Var( δ ( k ) N )(1 − Corr ( δ ( k ) N , δ ( k +1) N )), where again all quantities are computed in stationarity. Although the expectedsquare jumping distance is a sensible and practically useful measure of efficiency of a Markov chain,there is no explicit result that links it to the variance of Monte Carlo averages formed by using theoutput of the chain. This variance will not only depend on autocorrelation at other lags, but also onthe function being averaged. Still, it gives a rough idea: if the autocorrelation function associatedto the identity function is geometrically decaying, with lag-1 autocorrelation ρ N , then the varianceof the sample average of k max , δ ( k ) N values in stationarity will be Var( δ ( k ) N )(1 + ρ N ) / (cid:0) (1 − ρ N ) k max (cid:1) .The point here is that ρ N behaves like 1 − c/N , for some c , but Var( δ ( k ) N ) is O (1). Hence, the MonteCarlo error associated with k max draws in stationarity is O ( (cid:112) N/k max ). We now present three families of linear inverse problems satisfying Assumptions 3.1 on the under-lying conitnuum model: a family of mildly ill-posed inverse problems, where the operators definingthe problem are simultaneously diagonalizable, [25]; a family of severely ill-posed inverse problemsagain in a diagonal setting, [26, 4]; and a family of mildly ill-posed inverse problems in a nondi-agonal setting, [3]. We expect that Assumptions 3.2, will be satisfied by consistent discretizationsof these models. Indeed, we show that our discrete level assumptions are satisfied if we discretizethe two diagonal examples using spectral truncation (see subsection 1.2.1). Furthermore, in section5 we provide numerical evidence that our ideas also apply in nondiagonal settings and when usingother discretization schemes, in particular discretization via finite difference approximations (seesubsection 1.2.2). We do not prove that discretization via finite differences satisfies our discrete levelassumptions as it is beyond the scope of this paper; we expect however this to be the case.
We consider the linear inverse problem setting of subsection 1.1, where K , C and C commutewith each other and K ∗ K , C and C are simultaneously diagonalizable with common completeorthonormal eigenbasis { e j } j ∈ N . Note that we do not assume that K and C are compact, but wedo assume that K (cid:63) K and C are both diagonalizable in { e j } j ∈ N ; in particular, we allow for K and C to be the identity. For any w ∈ X , let w j := (cid:10) w , e j (cid:11) . Let Σ be a positive definite and trace classoperator in X which is diagonalizable in the orthonormal basis { e j } j ∈ N , with eigenvalues { µ Σ j } j ∈ N . nalysis of Gibbs sampler for inverse problems ρ ∈ X , we can write a draw x ∼ N ( ρ, Σ ) as x = ρ + ∞ (cid:88) j =1 (cid:113) µ Σ j γ j e j , where γ j are independent standard normal random variables in R ; this is the Karhunen-Loeveexpansion [1, Chapter III.3]. In fact, the Karhunen-Loeve expansion makes sense even if µ Σ j arenot summable, that is if Σ is not trace class in X ; the expansion then defines a Gaussian measurein a bigger space than X in which Σ is trace class. This expansion suggests that since we arein a simultaneously diagonalizable setting we can use the Parseval identity and work entirely inthe frequency domain as in subsection 1.2.1. Indeed, we identify an element w ∈ X with thesequence of coefficients { w j } j ∈ N , and the norm and inner product in X with the (cid:96) -norm and innerproduct. Furthermore, we identify the operators C , C and K with the sequences of their eigenvalues { µ C j } j ∈ N , { µ C j } j ∈ N and { µ Kj } j ∈ N respectively. Algebraic operations on the operators C , C , K aredefined through the corresponding operations on the respective sequences.We make the following assumptions on the spectral decay of K , C and C : Assumptions 4.1.
The eigenvalues of K ∗ K , C and C , denoted by ( µ Kj ) , µ C j , µ C j , respectively,satisfy - ( µ Kj ) (cid:16) j − (cid:96) , (cid:96) ≥ - µ C j (cid:16) j − α , α > ; - µ C j (cid:16) j − β , β ≥ . Let ν be the joint distribution of y and u , where u | δ ∼ N (0 , δ − C ) and y | u , δ ∼ N ( Ku , λ − C ).Then in this diagonal case, it is straightforward to show in the infinite-dimensional setting that theconditional posterior u | y , δ is ν -almost surely Gaussian, N ( m λ,δ ( y ) , C λ,δ ), where C λ,δ and m λ,δ ( y )satisfy (5) and (6) respectively. We make the following additional assumption: Assumption 4.2.
The parameters α, β, (cid:96) in Assumptions 4.1 satisfy α + 4 (cid:96) − β > . We show that under Assumptions 4.1 and 4.2, Assumptions 3.1 on the underlying infinite-dimensional model are satisfied ν -almost surely. Without loss of generality assume δ = λ = 1.For Assumption 3.1(i), we have using the Karhunen-Loeve expansion and Assumption 4.1, E ν (cid:13)(cid:13) C − m ( y ) (cid:13)(cid:13) ≤ c E ν ∞ (cid:88) j =1 j α − (cid:96) +4 b ( j − (cid:96) +2 β + j α ) ( j − (cid:96) − α ζ j + j − β ξ j ) , where { ζ j } j ∈ N , { ξ j } j ∈ N are two independent sequences of independent standard normal randomvariables. The assumption 2 α + 4 (cid:96) − β > m ( y ) ∈D ( C − ) ν -almost surely. For Assumption 3.1(ii), the operator C − K C K ∗ C − has eigenvalues thatdecay like j − α − (cid:96) +2 β and hence are summable by Assumption 4.2.We define the Sobolev-like spaces H t , t ∈ R : for t ≥
0, define H t := { u ∈ X : (cid:13)(cid:13) u (cid:13)(cid:13) H t := ∞ (cid:88) j =1 j t (cid:10) u j , e j (cid:11) < ∞} , and for t < H t := ( H − t ) ∗ . We assume to have data of the following form: α, β not to be confused with α , β used respectively as shape and rate parameters of the gamma distribution. nalysis of Gibbs sampler for inverse problems Assumption 4.3. y = Ku † + λ − C ξ , where u † ∈ H β − (cid:96) is the underlying true solution and ξ isa Gaussian white noise, ξ ∼ N (0 , I ) . Note that under Assumptions 4.1, 4.2 and 4.3, it is straightforward to check that Assumption3.1(i) is also satisfied ξ -almost surely. Indeed, using the Karhunen-Loeve expansion we have E (cid:13)(cid:13) C − m ( y ) (cid:13)(cid:13) ≤ c E ∞ (cid:88) j =1 j α − (cid:96) +4 b ( j − (cid:96) +2 β + j α ) ( j − (cid:96) ( u † j ) + λ − j − β ξ j ) , where { ξ j } j ∈ N is a sequence of independent standard normal random variables. The assumption2 α + 4 (cid:96) − β > u † ∈ H β − (cid:96) secure that the right hand side is finite. Assumption3.1(ii) is independent of y , hence also holds by our previous considerations.A natural way to discretize random draws in this setup is by truncating the Karhunen-Loeveexpansion which is equivalent to the spectral truncation in subsection 1.2.1. We assume to havediscrete data of the form y = Ku † + λ − C ξ, where K, C , u † and ξ are discretized as in subsection 1.2.1. The prior is also discretized usingspectral truncation, u ∼ N (0 , C ). We show that Assumptions 3.2 are satisfied under Assumptions4.1 and 4.2, for this data and discretization scheme.By Assumption 4.1, there exists a constant c ≥ N , such that E (cid:13)(cid:13) C − m ( y ) (cid:13)(cid:13) R N ≤ c E N (cid:88) j =1 j α − (cid:96) +4 b ( j − (cid:96) +2 β + j α ) ( j − (cid:96) u † j + j − β ξ j ) , where the right hand side is bounded uniformly in N , since we are summing nonnegative numbersand we have seen that under Assumptions 4.2 and 4.3 the corresponding infinite series is summable.Furthermore, again by Assumption 4.1, there exists another constant c ≥ N , suchthat Tr( C − K C K ∗ C − ) ≤ c N (cid:88) j =1 j − α − (cid:96) +2 β , where the right hand side is bounded uniformly in N , since we have seen that under Assumption4.2 the corresponding infinite series is summable. We consider the setting of [26, 4], that is, a similar situation with the previous example, whereinstead of having ( µ Kj ) (cid:16) j − (cid:96) we now have ( µ Kj ) (cid:16) e − sj b , for b, s >
0. The proof of the validityof Assumptions 3.1 ν -almost surely is identical to the proof in the previous example, where we nowhave the added advantage of the exponential decay of the eigenvalues of K ∗ K . We can also provethat for data of the form y = K u † + λ − C ξ , where now it suffices to have u † ∈ X , Assumption3.1 is satisfied ξ -almost surely. Finally, in a similar way to the previous example, Assumptions 3.2are valid if we discretize this setup by spectral truncation (subsection 1.2.1). We consider the setting of [3], that is the linear inverse problem setting of subsection 1.1, where K ∗ K , C and C are not necessarily simultaneously diagonalizable but they are related to each other nalysis of Gibbs sampler for inverse problems K (cid:39) C (cid:96) and C (cid:39) C β for some (cid:96), β ≥ (cid:39) is used loosely to indicate two operators which induce equivalentnorms. As before let ν be the joint distribution of y and u , where u | δ ∼ N (0 , δ − C ) and y | u , δ ∼N ( Ku , λ − C ). Then as in the simultaneously diagonalizable case examined above, we have thatthe conditional posterior u | y , δ is ν -almost surely N ( m λ,δ ( y ) , C λ,δ ), where C λ,δ and m λ,δ ( y ) satisfy(5) and (6) respectively (see [3, Theorem 2.1]). It is implicit in [3, Theorem 2.1] that m λ,δ ( y ) ∈D ( C − ) ν -almost surely, hence Assumption 3.1(i) holds ν -almost surely. Assumption 3.1(ii) alsoholds ν -almost surely since if { φ j } j ∈ N is a complete orthonormal system of eigenfunctions of C and { µ C j } j ∈ N the corresponding eigenvalues, by [3, Assumption 3.1(3)] we have (cid:13)(cid:13) C − K C φ j (cid:13)(cid:13) ≤ c (cid:13)(cid:13) C − β + (cid:96) + φ j (cid:13)(cid:13) = c ( µ C j ) − β +2 (cid:96) +1 which is summable by [3, Assumption 3.1(1) and 3.1(2)]. Hence, C − K C is Hilbert-Schmidt thus C − K C K ∗ C − is trace-class. We believe that Assumptions 3.2on the discrete level are also satisfied in this example if consistent discretization methods are used,however proving this is beyond the scope of the present paper. We now present numerical simulations supporting our result in section 3 on the large N behaviourof CA described in subsection 2.1 and our intuition contained in subsection 2.4 on the benefits ofthe reparametrization described in subsection 2.2. We consider an instance and a modification ofthe mildly ill-posed diagonal setting presented in subsection 4.1. In subsection 5.1 we use spectraltruncation (see subsections 1.2.1, 4.1) and in subsection 5.2 we use finite differences approximation(see subsection 1.2.2). We consider the simultaneously diagonalizable setup described in subsection 4.1, where X = L (I) , I =(0 , e j ( x ) = √ jπx ) , x ∈ I, and define the operators K , C and C directly through their eigenvalues µ Kj = 1 , µ C j = j − and µ C j = 1 , for all j ∈ N , respectively.In particular, this is the normal mean model , in which one assumes observations of the form y i = u i + η j , j ∈ N , where η j ∼ N (0 , λ − ) and the unknown is { u j } j ∈ N ∈ (cid:96) . This model is clearly equivalent to the white noise model , y = u + η , (15)where η = λ − ξ and ξ is an unobserved Gaussian white noise, see subsection 1.2.1. Note that ξ whose covariance function is a Dirac delta function, is not realizable in the basic Hilbert space X (instead X is the corresponding Cameron-Martin space), but can be interpreted in process form asfor example in [8, 12] in the context of inverse problems. Although it can be argued that white noisedata models are unrealistic at the very smallest scales, they are a useful idealization of noise whichis best thought of as a continuous process with very short correlation lengthscales; in particular ifthe correlation lengthscale is much smaller than the grid scale used, then it is reasonable to use awhite noise model. The white noise model (15) is an important statistical model which is knownto be asymptotically equivalent to several standard statistical models, for example nonparametricregression, [9, 42]. It is also practically relevant, since it is a nontrivial special case of the deconvo-lution inverse problem, [22, 36]. Finally, it gives rise to Gaussian posterior distributions which arewell studied in the sense of posterior consistency, see [25, 3, 36]. nalysis of Gibbs sampler for inverse problems A to be the negative Laplace operator in I with Dirichlet boundary conditions, werecognize that we use a Gaussian prior with covariance operator C proportional to A − . As-sumptions 4.1 are satisfied with α = 1 . β = (cid:96) = 0; since 2 α + 4 (cid:96) − β = 3 >
1, Assump-tion 4.2 is also satisfied. We assume that we have data produced from the underlying true signal u † ( x ) = (cid:80) ∞ j =1 u † j √ jπx ) , for x ∈ I, where u † j = j − . sin(10 j ) and λ = 200, and in particularwe have that the coefficients of y , are given as y j = u † j + λ − ξ j , where ξ j are standard normal random variables. It is straightforward to check that u † ∈ H t for any t < .
75, hence Assumption 4.3 is also satisfied. According to the considerations in subsection 4.1,we thus have that Assumptions 3.2 hold when using the spectral truncation discretization method.This example is studied in [40] where the interest is in studying the asymptotic performance of theposterior in the small noise limit (see section 6).We use the hierarchical setup presented in subsection 1.1 and implement Algorithms 1 (CA), 2(NCA) and 3 (MA) contained in section 2 at discretization levels N = 32 , , α = 1 , β = 10 − , chosen to give uninformative hyper-priors, that is, hyper-priorswhose variance is much larger than their mean. Following the discussion in subsection 2.4, we viewMA as the gold standard and benchmark CA and NCA against it. We use 10 iterations and choose δ (0) = 1 in all cases. In order to have fair comparisons, we use a fixed burn-in time of 10 iterations.We take the viewpoint that we have a fixed computational budget, hence we choose not to increasethe burn-in time as N increases as one can do if infinite resources are available.In Figure 1 we plot the true solution, the noisy data and the sample means and credibility boundsusing CA and NCA for N = 8192. The sample means and credibility bounds at other discretizationlevels of the unknown are similar and are therefore omitted.Figure 1: Left: true solution (dashed black) and noisy data (blue continuous). Middle and right:true solution (dashed black), sample mean (red continuous) and 87.5% credibility bounds (shadedarea) for CA (middle) and NCA (right). Dimension is N = 8192.In Figure 2 we see that for CA, in small dimensions the δ -chain has a healthy mixing, howeveras predicted by Theorem 3.4, as N increases it becomes increasingly slower and exhibits diffusivebehaviour. This is also reflected in the density plots where we observe that as N increases, the kerneldensity estimates computed using CA look less and less like the density estimates computed usingMA which we consider to be optimal in this setting. In Figure 3 we see that for NCA as expected,the δ -chain appears to be robust with respect to the increase in dimension; this is also reflected inthe density estimates using NCA which now look very close to the ones obtained using MA for alldiscretization levels.Our observations in Figures 2 and 3 are supported by the autocorrelation plots presented inFigure 4. The rate of decay of correlations in the δ -chain in CA appears to decrease as the dimensionincreases, and in particular for N = 8192 the correlations seem not to decay at all. On the contrary,the rate of decay of correlation in the δ -chain in NCA appears not to be affected by the increase indimension and is very similar to the one in MA. nalysis of Gibbs sampler for inverse problems δ -chains (top) and kernel density estimates of the posterior on δ (bottom) for dimen-sions N = 32 ,
512 and 8192 left to right. In dashed red in the density plots is the density estimateusing MA, considered as the gold standard.Figure 3: NCA: δ -chains (top) and kernel density estimates of the posterior on δ (bottom) fordimensions N = 32 ,
512 and 8192 left to right. In dashed red in the density plots is the densityestimate using MA, considered as a gold standard.Figure 4: Autocorrelation functions of δ -chain, dimensions 32 (black), 512 (red) and 8192 (blue);left for MA, middle for CA, right for NCA. nalysis of Gibbs sampler for inverse problems We consider a modification of the simultaneously diagonalizable setup described in subsection 4.1,where X = L (I) , I = (0 ,
1) and we allow K to map X into R M and hence have data y ∈ R M .This setting is not directly covered by the theoretical analysis presented in section 3, however ourtheory readily generalizes to cover this setting; we refer the interested reader to the PhD thesis[2, section 4.5] for more details. The generalized analysis holds again under Assumptions 3.2 onthe discrete-level based on intuition which holds for problems satisfying Assumptions 3.1 on theunderlying continuum model for the unknown u .In particular, we consider the problem of recovering a true signal u † , by observing a blurredversion of it at M equally spaced points { M +1 , ..., MM +1 } , polluted by additive independent Gaussiannoise of constant variance λ − . We define A to be the negative Laplacian with Dirichlet boundarycondtions in I. We let P be defined as in subsection 1.2.2 and define ˜ K = ( I + π A ) − , andconsider the case K = P ˜ K , C = A − and C = I M in the setting of subsection 1.1 and where I M is the M × M identity matrix. Notice that due to the smoothing effect of ˜ K , the operator K isbounded in X . However, due to the presence of P , K is not simultaneously diagonalizable with C .We now check that this problem satisfies Assumptions 3.1. Indeed, assuming without loss ofgenerality that λ = δ = 1, by [39, Example 6.23] we have that the posterior covariance and meansatisfy (5) and (6), hence C − m ( y ) = C − ( C − + K ∗ K ) − K ∗ y = ( I + C K ∗ K C ) − C K ∗ y ,where C K ∗ y ∈ X , and ( I + C K ∗ K C ) − is bounded in X by the nonnegativity of C K ∗ K C .Furthermore, we have that Tr( C − K C K ∗ C − ) = Tr( K C K ∗ ), which is finite since K C K ∗ is an M × M matrix.We discretize this setup at level N , using the finite differences approximation as explained insubsection 1.2.2. In particular, we discretize A , P and P ∗ by replacing them with the matrices A , P and ( N +1) P T respectively as in subsection 1.2.2; this induces a discretization of the operators K and C by replacing them with the corresponding matrices K and C calculated through the appropriatefunctions of A and P . In defining K , we also replace the identity operator by the N × N identitymatrix. We do not prove that this discretization scheme satisfies Assumptions 3.2, however weexpect this to be the case.We assume that we have data produced from the underlying true signal u † ( x ) = 0 . · [0 . , . ( x )+0 . · [0 . , . + sin (2 πx ) · [0 . , ( x ) , x ∈ I . In particular, we construct data of the form y = Ku † + λ − C ξ, where λ = 100 and using a discretization level N c = 8192 for the unknown; we treat this discretiza-tion level as the continuum limit.We implement Algorithms 1 (CA), 2 (NCA) and 3 (MA) for constant number of observationpoints M = 15, and for discretization levels of the unknown N = 15 , , α = 1 , β = 10 − , chosen to give uninformative hyper-priors, that is, hyper-priorswhose variance is much larger than their mean. Following the discussion in subsection 2.4, we viewMA as the gold standard and benchmark CA and NCA against it. We use 10 iterations and choose δ (0) = 1 in all cases. We again use a constant burn-in time of 10 iterations.In Figure 5 we plot the true solution, the noisy data and the sample means and credibility boundsusing CA and NCA for N = 1023. The sample means and credibility bounds at other discretizationlevels of the unknown are similar and are therefore omitted.In Figure 6 we see that for CA, in small dimensions the δ -chain has a healthy mixing, howeveras predicted by our theory, as N increases it becomes increasingly slower and exhibits diffusivebehaviour. This is also reflected in the density plots where we observe that as N increases, thekernel density estimates computed using CA look less and less like the density estimates computed nalysis of Gibbs sampler for inverse problems N = 1023 and M = 15 respectively.using MA which we consider to be optimal in this setting. In Figure 7 we see that for NCA the δ -chain appears to be robust with respect to the increase in dimension; this is also reflected inthe density estimates using NCA which now look very close to the ones obtained using MA for alldiscretization levels.Figure 6: CA: δ -chains (top) and kernel density estimates of the posterior on δ (bottom) for dimen-sions N = 15 ,
127 and 1023 left to right. In dashed red in the density plots is the density estimateusing MA, considered as a gold standard.Our observations in Figures 6 and 7 are supported by the autocorrelation plots presented inFigure 8. The rate of decay of correlations in the δ -chain in CA appears to decrease as the dimensionincreases, and in particular for large N the correlations seem to decay very slowly. On the contrary,the rate of decay of correlations in the δ -chain in NCA appears not to be affected by the increase indimension and is relatively close to the one in MA. We considered a hierarchical Bayesian approach to the function-space general inverse problem (1),with Gaussian priors on the unknown function u which depend on a variance-scaling parameter δ also endowed with a prior. We studied the finite-dimensional implementation of this setup and inparticular, examined the mixing properties of MwG algorithms for sampling the posterior, as thediscretization level N of the unknown increases. We provided measure-theoretic intuition suggestingthat under natural assumptions on the underlying function space model, as N increases, CA, whichis the most natural algorithm in this setting, deteriorates (see section 1). We then used this intuitionto propose a reparametrization of the prior for which the resultant algorithm, NCA, is expected to nalysis of Gibbs sampler for inverse problems δ -chains (top) and kernel density estimates of the posterior on δ (bottom) fordimensions N = 15 ,
127 and 1023 left to right. In dashed red in the density plots is the densityestimate using MA, considered as a gold standard.Figure 8: Autocorrelation functions of δ -chain, dimensions 15 (black), 127 (red) and 1023 (blue);left for MA, middle for CA, right for NCA. nalysis of Gibbs sampler for inverse problems N . In the linear-conjugate setting we formulated rigorous theory whichquantifies the deterioration of CA in the asymptotic regime of large N (see section 3).This theory holds under assumptions on the discrete level (Assumptions 3.2) which we expect tobe inherited from our assumptions on the function-space model (Assumptions 3.1) when consistentdiscretizations are used. Indeed, we provided three families of linear inverse problems satisfying ourassumptions on the underlying infinite-dimensional model (section 4), and for two of them, whichare families of mildly and severely ill-posed problems in a simultaneously diagonal setting, we alsoshowed that a spectral truncation method based on the common eigenbasis satisfies our discretelevel assumptions (subsections 4.1 and 4.2). It would be interesting to show that discretization viafinite differences of these examples also satisfies our discrete assumptions.Our numerical results confirmed our theory on the deterioration of CA as well as our intuitionabout the robustness of NCA in the large N limit. However, for NCA the δ -chain slows down inthe small noise limit. This is because even though v and δ are a priori independent, they both needto explain the data, and this creates an increasingly severer constraint as λ becomes large. Hence, δ and v concentrate near a lower dimensional manifold, where δ − Kv ≈ y , and the Gibbs samplermixes poorly (see Figure 9 for a numerical illustration of this effect in the example of subsection5.1). Although MA is robust in both the large N and the small noise limit, it can be prohibitivelyexpensive for large scale inverse problems; new work is required to produce effective hierarchicalalgorithms in this small noise limit, when N is large. We have considered the interweaving methodof [41], which combines in each iteration centered and non-centered draws of δ , and the partiallynon-centered parametrizations of [33], in which the prior is reparametrized as u = δ − t v t where v t ∼ N (0 , δ t − C ), for some t ∈ [0 , λ = 200 , and dimension N = 512: δ -chain (left) and kernel density estimate of posterior on δ (right, black). In dashed red in right plotis the density estimate using MA, considered as a gold standard.In addition to [6], a similar hierarchical setup has been considered in [40] in the signal in whitenoise model (see subsection 5.1). The authors of [40] study a different aspect of the problem, namelythe asymptotic performance of the posterior distribution in the small noise limit. This is motivatedby results on posterior consistency suggesting that the optimal rates of contraction are achieved byrescaling the prior depending on the size of the noise [25, 3, 26, 4]. They also study an empiricalBayes method for estimating the value of the prior scaling from the data and show that both methodsachieve optimal posterior contraction rates over a range of regularity classes of the true solution.However, we have seen in this paper that the implementation of the hierarchical Bayesian method inthe large dimensional limit is problematic. On the other hand, while the empirical Bayes method isappealing because of the lack of mixing issues, it involves solving an optimization problem which inmore complicated models can be computationally demanding, and it does not provide uncertaintyquantification of the prior scaling which may be desirable. Again we highlight the need for moreresearch and new ideas in the small noise limit, when N is large.An asymptotic regime which we have not investigated yet, is the case where we have a sequence of N -dimensional linear inverse problems, with the relevant matrices being consistent discretizations of nalysis of Gibbs sampler for inverse problems N grows larger, that is λ = λ ( N ) → ∞ as N → ∞ . This is the limit of an infinite dimensional unknown which is also identifiable from thedata. Since in this regime, as N grows larger the supports of both δ | y, u and δ | y shrink to zero, weexpect that there will be an optimal relationship between λ and N , for which CA will not deterioratefor large N .Our theory on the slowing down of the δ -chain can be extended to cover nonlinear Gaussian-conjugate Bayesian inverse problems and in particular the nonparametric drift estimation in SDE’ssetting considered in [31, 35, 30]; see [2, Chapter 4.5]. Again the main result holds under as-sumptions on the discrete level which we expect to be inherited by consistent discretizations fromnatural assumptions on the underlying infinite-dimensional model which express that the posterioris absolutely continuous with respect to the prior.Furthermore, our infinite-dimensional intuition extends to hierarchical setups for inference onother hyper-parameters, for instance the prior regularity parameter α , where C = A − α , as studiedin [24]. In Figure 10 we plot autocorrelation functions for the centered MwG algorithm used in thissetting and the corresponding version of the non-centered algorithm; as before we also implementedthe corresponding marginal algorithm and use it as the gold standard. The underlying truth, thenoise distribution and the discretization method are the same as in subsection 5.1 and we usean exponential hyper-prior on α . The idea is the same as the intuition presented in section 1,since in infinite dimensions two Gaussian measures N (0 , Σ ) and N (0 , Σ ) , where Σ and Σ aresimultaneously diagonalizable with eigenvalues { j − α } j ∈ N and { j − α } j ∈ N respectively, are mutuallysingular unless α = α . Indeed, our numerical simulations confirm again the deterioration of thecentered algorithm and the robustness of the non-centered algorithm, in the large N limit. Moregenerally, as suggested in section 1, our intuition holds for inference on any prior on u which dependson a hyper-parameter θ , when it holds that u | y , θ is absolutely continuous with respect to u | θ almostsurely with respect to the data, while u | θ and u | θ (cid:48) are mutually singular when θ (cid:54) = θ (cid:48) .Figure 10: Autocorrelation functions of α -chain, dimensions 32 (black), 512 (red) and 8192 (blue);left for marginal, middle for centered, right for non-centered.Returning to the general nonlinear setting discussed in section 1, we note that both Algorithms1 and 2 are straightforward to generalize, however with a certain loss of efficiency compared to thelinear-conjugate setting. The distribution of u | y , δ no longer belongs to a known parametric familyof distributions, and thus has to be sampled using a Metropolis-Hastings (for example one based onLangevin diffusion) step. Moreover, for nonlinear inverse problems there is no longer an easy way offinding the marginal distribution y | δ , hence MA will not be an option. The so-called pseudo-marginalalgorithm [5], might be an alternative for non-linear problems, and has been recently employed toperform Bayesian inference using Gaussian process priors in [18]. An interesting research direction isthe comparison of the performance of the two MwG algorithms with the pseudo-marginal algorithmin both the large N and the small noise limits.Finally, our research agenda includes the extension to the hierarchical setting of the presentpaper, of the analysis contained in [13] of the bias in the estimated posterior distribution due to the nalysis of Gibbs sampler for inverse problems In this section we present the proof of Theorem 3.4, as well as several technical results and lemmas.Subsection 7.1 contains the proof of Theorem 3.4, the backbone of which is Lemma 3.3 proved insubsection 7.2. In subsection 7.3 we state and prove a lemma on the negative moments of the rateparameter in the δ draw (11), which allows us to control the lower order terms arising in the proofof Theorem 3.4. Finally, in subsection 7.4, we prove several probability and linear algebra lemmas,which are useful in our analysis. We now prove Theorem 3.4 under Assumptions 3.2. Using the scaling property of the gammadistribution, Gamma( α , β ) L = β − Gamma( α , N δ , we can writethe δ ( k +1) N draw in (11) as δ ( k +1) N L = δ Γ ,N N δ ( β + (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N ) (16)where Γ ,N ∼ Gamma( α + N , N ) is independent of y and u ( k ) δ .Defining W ,N := Γ ,N − − α N (cid:113) N + α N , we haveΓ ,N = 1 + 2 α N + (cid:114) N + 4 α N W ,N , where for every N , the random variable W ,N has mean zero and variance one, third and fourthmoments bounded uniformly in N (see Lemma 7.5), and is independent of the data y and ζ , theGaussian white noise expressing the fluctuation in u ( k ) δ . Concatenating we get δ ( k +1) N L = δ α N + (cid:113) N + α N W ,N (cid:113) N W ,N + N F N ( δ ) δ , (17)and we are now ready to prove Theorem 3.4: Proof.
By the independence of W ,N and ζ and since E [ W ,N ] = 0, we have E [ δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ ] = δ E α N + (cid:113) N + α N W ,N (cid:113) N W ,N + F N δN − = δ E ζ α N − (cid:113) N W ,N − F N δN (cid:113) N W ,N + F N δN . Using the identity x = 1 − x + x x we get E [ δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ ]= δ E ζ (cid:34)(cid:32) α − F N δ ) N − (cid:114) N W ,N (cid:33) (cid:32) − (cid:114) N W ,N − F N δN (cid:33)(cid:35) + E ζ [ e ,N ] , nalysis of Gibbs sampler for inverse problems e ,N = δ (cid:16) α − F N δ ) N − (cid:113) N W ,N (cid:17) (cid:16) W ,N N + F N δ N + √ F N W ,N δN (cid:17) (cid:113) N W ,N + F N δN . Using H¨older’s inequality and the fact that F N and W ,N have moments of all positive orderswhich are bounded uniformly in N , we get E [ δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ ] = 2 N (cid:0) ( α + 1) δ − E ζ [ F N ] δ (cid:1) + O ( N − ) + E ζ [ e ,N ] , almost surely with respect to y . For the residual e ,N , by Cauchy-Schwarz inequality and (12), wehave E ζ [ e ,N ] = E ζ (cid:20) (cid:16) α − F N δ ) N − (cid:113) N W ,N (cid:17) (cid:16) W ,N + N F N δ + √ N F N W ,N δ (cid:17) N δ (1 + (cid:113) N W ,N + F N δN ) (cid:21) ≤ (cid:18) E (cid:104) (cid:32) α − F N δ ) N − (cid:114) N W ,N (cid:33) (cid:32) W ,N + 2 F N δ N + 2 √ F N W ,N δN (cid:33) (cid:105)(cid:19) . (cid:18) E (cid:104) ( β + 12 (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N ) − (cid:105)(cid:19) . The square root of the first expectation on the right hand side of the inequality is of order N − ,while by Lemma 7.1 the square root of the second expectation is of order N − for almost all y .Combining we get that E ζ [ e ,N ] = O ( N − ), almost surely with respect to y , hence E [ δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ ] = 2 N (cid:0) (1 + α ) δ − E ζ [ F N ] δ (cid:1) + O ( N − ) , y -almost surely.For the variance of the step, we haveVar (cid:104) δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ (cid:105) = E (cid:104) ( δ ( k +1) N − δ ( k ) N ) | δ ( k ) N = δ (cid:105) − E (cid:104) δ ( k +1) N − δ ( k ) N | δ ( k ) N = δ (cid:105) , where by the first part of the proof the second term is O ( N − ). Thus, we need only consider thefirst term, which will be shown to be O ( N − ). By equation (17) we have E (cid:104) ( δ ( k +1) N − δ ( k ) N ) | δ ( k ) N = δ (cid:105) = δ E α N + (cid:113) N + α N W ,N − (cid:113) N W ,N − F N δN (cid:113) N W ,N + F N δN = δ E W ,N N + W ,N N + V N N (cid:16) (cid:113) N W ,N + F N δN (cid:17) , where the random variable V N depends only on W ,N and F N and has higher order moments whichare bounded uniformly in N , y -almost surely (the dependence on W ,N disappears by the indepen-dence of W ,N and ζ and the fact that both W ,N has mean zero and variance one). Using the nalysis of Gibbs sampler for inverse problems x ) = 1 − x + x +2 x (1+ x ) , we get E (cid:104) ( δ ( k +1) N − δ ( k ) N ) | δ ( k ) N = δ (cid:105) = δ E (cid:34)(cid:32) W ,N N + 2 W ,N N + V N N (cid:33) (cid:32) − (cid:114) N W ,N − N F N δ (cid:33)(cid:35) + E [ e ,N ] , where e ,N = δ (cid:32) W ,N N + 2 W ,N N + V N N (cid:33) (cid:16)(cid:113) N W ,N + F N δN (cid:17) + 2 (cid:16)(cid:113) N W ,N + F N δN (cid:17) (cid:16) (cid:113) N W ,N + F N δN (cid:17) := E N δ (cid:16) (cid:113) N W ,N + F N δN (cid:17) . Using the fact that y -almost surely W ,N , F N and V N have moments of all positive orders which arebounded uniformly in N , by H¨older inequality (we do not need to consider higher order momentsfor W ,N here, because it is independent with W ,N and F N , hence bounding terms involving W ,N does not require the use of H¨older’s inequality which needs higher moments), we get that E [( δ ( k +1) N − δ ( k ) N ) | δ ( k ) N = δ ] = 2 δ N (cid:0) E [ W ,N ] + E [ W ,N ] (cid:1) + O ( N − ) + E [ e ,N ] , y -almost surely. For the residual e ,N , as before using Cauchy-Schwarz inequality and (12), E [ e ,N ] ≤ N (cid:0) E [ E N ] (cid:1) (cid:18) E [( β + 12 (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N ) − ] (cid:19) . Since by Lemma 7.5 the first four moments of W ,N are also bounded uniformly in N , the square rootof the first expectation on the right hand side is of order N − , while by Lemma 7.1 the square rootof the second expectation is of order N − , for almost all y . Combining we get E ζ [ e ,N ] = O ( N − ),almost surely with respect to y , hence since E [ W ,N ] = E [ W ,N ] = 1, E [( δ ( k +1) N − δ ( k ) N ) | δ ( k ) N = δ ] = 4 δ N + O ( N − ) , y -almost surely. Concatenating, we get the result. Proof.
Let { e j } Nj =1 be any orthonormal basis of R N (with respect to the possibly scaled norm (cid:107)·(cid:107) R N )and for any w ∈ R N write w j := (cid:10) w, e j (cid:11) R N . We then have that ζ = (cid:80) Nj =1 ζ j e j where { ζ j } Nj =1 is asequence of independent standard normal random variables. Using (10) we have, (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N = (cid:13)(cid:13) C − m λ,δ ( y ) (cid:13)(cid:13) R N + (cid:13)(cid:13) C − C λ,δ ζ (cid:13)(cid:13) R N + 2 (cid:10) C − m λ,δ ( y ) , C − C λ,δ ζ (cid:11) R N := A N + B N + C N . Under Assumptions 3.2, we can analyze each term as follows:A) by Assumption 3.2(i), for almost all data y , this term and all its positive integer powers arebounded uniformly in N . nalysis of Gibbs sampler for inverse problems (cid:13)(cid:13) C − C λ,δ ζ (cid:13)(cid:13) R N = (cid:10) C − C λ,δ ζ, C − C λ,δ ζ (cid:11) R N = (cid:10) C λ,δ C − C λ,δ ζ, ζ (cid:11) R N = δ − (cid:10) C λ,δ ( C − λ,δ − λK ∗ C − K ) C λ,δ ζ, ζ (cid:11) R N = δ − (cid:13)(cid:13) ζ (cid:13)(cid:13) R N − δ − λ (cid:13)(cid:13) C − K C λ,δ ζ (cid:13)(cid:13) R N := b ,N − b ,N , whereb1) b ,N = δ − (cid:13)(cid:13) ζ (cid:13)(cid:13) R N = Nδ + δ (cid:80) Nj =1 ( ζ j −
1) := Nδ + √ Nδ W ,N , where as N → ∞ , W ,N = √ N (cid:80) Nj =1 ( ζ j −
1) converges weakly to a standard normal random variable by the CentralLimit Theorem and by Lemma 7.4 has all positive integer moments bounded uniformlyin N ;b2) for b ,N we have by Lemma 7.6(ii) that E ζ [ b ,N ] is uniformly bounded in N . In fact usingLemma 7.3 together with Lemma 7.6(ii), we get that for any p ∈ N , E ζ [ b p ,N ] is boundedindependently of N .C) for the third term we have (cid:10) C − m λ,δ ( y ) , C − C λ,δ ζ (cid:11) R N = (cid:10) ( C − C λ,δ ) ∗ C − m λ,δ ( y ) , ζ (cid:11) R N = N (cid:88) j =1 (( C − C λ,δ ) ∗ C − m λ,δ ( y )) j ζ j . It holds that N (cid:88) j =1 (( C − C λ,δ ) ∗ C − m λ,δ ( y )) j = (cid:13)(cid:13) ( C − C λ,δ ) ∗ C − m λ,δ ( y ) (cid:13)(cid:13) R N , and we claim that the norm on the right hand side is uniformly bounded in N y -almost surely.Indeed, by (5), the Cauchy-Schwarz inequality and the non-negative definiteness of the matrix C K ∗ C − K C , we have (cid:13)(cid:13) ( C − C λ,δ ) ∗ u (cid:13)(cid:13) R N = (cid:10) C − C λ,δ C − u, u (cid:11) R N = (cid:10) δ − ( I + λδ C K ∗ C − K C ) − u, u (cid:11) R N ≤ (cid:13)(cid:13) δ − ( I + λδ C K ∗ C − K C ) − u (cid:13)(cid:13) R N (cid:13)(cid:13) u (cid:13)(cid:13) R N ≤ δ − (cid:13)(cid:13) u (cid:13)(cid:13) R N . Combining with Assumption 3.2(i) we get the claim and therefore by Lemma 7.2 below we getthat the third term has y -almost surely all even moments uniformly bounded in N .We define F N = β + A N − b ,N + C N and observe that since all terms have bounded moments ofevery order uniformly in N y -almost surely, H¨older’s inequality secures that F N also has boundedmoments of every order uniformly in N almost surely with respect to y . δ draw Lemma 7.1.
Let u ( k ) δ be as in (10), for any δ, λ > . Under Assumptions 3.2, we have E ζ (cid:20) ( β + 12 (cid:13)(cid:13) C − u ( k ) δ (cid:13)(cid:13) R N ) − i (cid:21) = O ( N − i ) , as N → ∞ , almost surely with respect to y , for i = 1 , . nalysis of Gibbs sampler for inverse problems Proof.
Without loss of generality we consider the case δ = λ = 1 and drop the λ and δ dependencein u, m and C . To de-clutter our notation we also drop the dependence of m on the data. Since β ≥ β = 0. Formally, the random variable (cid:13)(cid:13) C − u ( k ) (cid:13)(cid:13) R N behaves likea chi-squared random variable with N degrees of freedom. We estimate the squared norm by arandom variable Y N of known moment generating function M Y N ( t ), and use the following formulafrom [15] for the calculation of negative moments of nonnegative random variables E [ Y − lN ] = Γ( l ) − (cid:90) ∞ t l − M Y N ( − t ) dt, l ∈ N . (18)We begin by showing that there exists a constant c > N such that (cid:13)(cid:13) C − C v (cid:13)(cid:13) R N ≤ c (cid:13)(cid:13) v (cid:13)(cid:13) R N for any v ∈ R N . We have, (cid:13)(cid:13) C − C v (cid:13)(cid:13) R N = (cid:10) C C − C v, v (cid:11) R N = (cid:10) ( I + C K ∗ C − K C ) v, v (cid:11) R N = (cid:13)(cid:13) v (cid:13)(cid:13) R N + (cid:13)(cid:13) C − K C v (cid:13)(cid:13) R N ≤ (1 + c ) (cid:13)(cid:13) v (cid:13)(cid:13) R N , by Lemma 7.6(iii). The proved claim gives the estimate (cid:13)(cid:13) C − u ( k ) (cid:13)(cid:13) R N = (cid:13)(cid:13) C − ( m + C ζ ) (cid:13)(cid:13) R N = (cid:13)(cid:13) C − C ( C − m + ζ ) (cid:13)(cid:13) R N ≥ c − (cid:13)(cid:13) C − m + ζ (cid:13)(cid:13) R N , hence it suffices to show that almost surely with respect to y we have E ζ [ Y − iN ] = O ( N − i ), for Y N := (cid:13)(cid:13) C − m + ζ (cid:13)(cid:13) R N . Indeed, let { e j } Nj =1 be any orthonormal basis of R N (with respect to thepossibly scaled norm (cid:107) · (cid:107) R N ), and define w j := (cid:10) w, e j (cid:11) for any w ∈ R N . Then we have Y N = N (cid:88) j =1 (( C − m ) j + ζ j ) , where ζ j ∼ N (0 ,
1) are the mutually independent components of the white noise ζ and ( C − m ) j are independent of ζ , therefore Y N is a non-central chi-squared random variable with N degreesof freedom and non-centrality parameter p N := (cid:80) Nj =1 ( C − m ) j ≥
0. The definition and propertiesof the non-central chi-squared distribution can be found in [21], where in particular, we find themoment generating function of Y N M Y N ( t ) = (1 − t ) − N exp (cid:0) p N t − t (cid:1) , hence using (18) we have for i = 1 , E ζ [ Y − iN ] = Γ(2 i ) − (cid:90) ∞ t i − (1 + 2 t ) − N exp (cid:0) − p N t t (cid:1) dt ≤ c (cid:90) ∞ t i − (1 + 2 t ) − N dt = O ( N − i ) , provided N > i , where the last integral can by calculated by integration by parts. Lemma 7.2.
Let { X j } be a sequence of random variables, such that X j = c j Y j , where the Y j , j ∈ N are independent and identically distributed random variables with finite even moments up to order nalysis of Gibbs sampler for inverse problems r ∈ N and zero odd moments, and the c j , j ∈ N are deterministic real numbers. Then for any N ∈ N , E [( N (cid:88) j =1 X j ) r ] ≤ κ ( N (cid:88) j =1 c j ) r , where κ = E [ Y r ] > is independent of N .Proof. Denote by m n the 2 n -th moment of Y , m n = E [ Y n ] . Observe that since by H¨older’s in-equality for 0 < s ≤ t , E [ | Y | s ] s ≤ E [ | Y | t ] t , we have that for n , ..., n q > n + ... + n q = rm n ...m n q ≤ E [ Y r ] n ... + nqr = E [ Y r ] . Combining with the fact that the random variables Y j are independent with zero odd moments, E [( N (cid:88) j =1 X j ) r ] = N (cid:88) j =1 c rj m r + N (cid:88) j (cid:54) = j c r − j m r − c j m + N (cid:88) j (cid:54) = j c r − j m r − c j m + ... + N (cid:88) j (cid:54) = j (cid:54) = ... (cid:54) = j r c j c j ...c j r m r ≤ m r ( N (cid:88) j =1 c j ) r . Lemma 7.3.
For any p ∈ N , there exists a constant c = c ( p ) ≥ , independent of N such that forany centered Gaussian random variable x N in R N , it holds E [ (cid:13)(cid:13) x N (cid:13)(cid:13) p R N ] ≤ c ( p )( E [ (cid:13)(cid:13) x N (cid:13)(cid:13) R N ]) p . Proof.
Direct consequence of [17, Corollary 2.17].
Lemma 7.4.
Let ( γ j ) j ∈ N be a sequence of independent standard normal random variables and define G N := √ N (cid:80) Nj =1 ( γ j − . Then all the positive integer moments of G N are bounded uniformly in N .Proof. For k ∈ N , we have E [ G kN ] = N ) k (cid:80) Nj ,...,j k E [( γ j − ... ( γ j k − . Since γ j − N determined by the total number of non zero terms in the summation. Byindependence and the fact that E [ γ j −
1] = 0, all the terms in the sum which contain a term withan index j i which occurs only once in the product are equal to zero. We thus have that if k is eventhe sum on the right hand side is of order N k , while if k is odd it is of order N k − . In both casesthe k -th moment of G N is bounded uniformly in N . Lemma 7.5.
Let Γ N ∼ Gamma( α + N , N ) , for α > , and define Θ N := Γ N − − α N (cid:113) N + α N . Then the first four moments of Θ N are bounded uniformly in N .Proof. The random variable Gamma( a,
1) has mean and variance a and third and fourth centralmoments 2 a and 3 a +6 a respectively, [20]. Hence by the scaling property of the gamma distribution,Γ N L = N Gamma( α + N ,
1) has mean 1 + α N , variance N + α N , and third and fourth central momentswhich are both of order N − . It is thus straightforward to see that Θ N has mean zero, varianceequal to one, and since the denominator in Θ N is of order N − it has third and fourth momentswhich are O ( N − ) and O (1) respectively. nalysis of Gibbs sampler for inverse problems Lemma 7.6.
Under Assumptions 3.2, we have that for any λ, δ > ,i) Tr( C − K C λ,δ K ∗ C − ) ≤ c δ − ; ii) E θ (cid:13)(cid:13) C − K C λ,δ θ (cid:13)(cid:13) R N ≤ c δ − , where θ is a Gaussian white noise in R N ;iii) (cid:13)(cid:13) C − K C (cid:13)(cid:13) ,N ≤ √ c ; where c is defined in Assumption 3.2(ii).Proof. i) By (5), we have C − K C λ,δ K ∗ C − = δ − C − K C ( I + λδ C K ∗ C − K C ) − C K ∗ C − , hence the fact that for any matrix A ∈ R N × N it holds Tr( A ( I + cA ∗ A ) A ∗ ) ≤ Tr( AA ∗ ) for any c >
0, together with Assumption 3.2(ii) give the claim.ii) It is well known that for x ∼ N (0 , Σ), E (cid:13)(cid:13) x (cid:13)(cid:13) R N = Tr(Σ). Since for θ ∼ N (0 , I ) we have C − K C λ,δ θ ∼ N (0 , C − K C λ,δ K ∗ C − ), the claim follows from part (i).iii) It is well known that for any matrix A ∈ R N × N , the Euclidean norm satisfies (cid:13)(cid:13) A (cid:13)(cid:13) ,N = (cid:13)(cid:13) A ∗ (cid:13)(cid:13) ,N = (cid:112) ρ ( A ∗ A ) ≤ (cid:112) Tr( A ∗ A ) where ρ ( B ) is the spectral radius of the matrix B . Hencewe have (cid:13)(cid:13) C − K C (cid:13)(cid:13) ,N ≤ (cid:113) Tr( C − K C K ∗ C − ) ≤ √ c , by Assumption 3.2(ii). References [1] R. J. Adler.
An introduction to continuity, extrema, and related topics for general Gaussianprocesses . Institute of Mathematical Statistics Lecture Notes—Monograph Series, 12. Instituteof Mathematical Statistics, Hayward, CA, 1990.[2] S. Agapiou. Aspects of bayesian inverse problems.
Ph.D. Thesis, University of Warwick , 2013.[3] S. Agapiou, S. Larsson, and A. M. Stuart. Posterior contraction rates for the Bayesian approachto linear ill-posed inverse problems.
Stochastic Processes and their Applications , 2013.[4] S. Agapiou, A. M. Stuart, and Y. X. Zhang. Bayesian posterior contraction rates for linearseverely ill-posed inverse problems. arxiv:1210.1563 , 2012.[5] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo com-putations.
The Annals of Statistics , pages 697–725, 2009.[6] J. M. Bardsley. MCMC-based image reconstruction with uncertainty quantification.
SIAM J.Sci. Comput. , 34(3):A1316–A1332, 2012.[7] J. M. Bernardo and A. F. M. Smith.
Bayesian theory , volume 405. Wiley, 2009.[8] N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence rates of general regular-ization methods for statistical inverse problems and applications.
SIAM Journal on NumericalAnalysis , 45(6):2610–2636, 2007. nalysis of Gibbs sampler for inverse problems
The Annals of Statistics , 24(6):2384–2398, 1996.[10] D. Calvetti, H. Hakula, S. Pursiainen, and E. Somersalo. Conditionally Gaussian hypermodelsfor cerebral source localization.
SIAM Journal on Imaging Sciences , 2(3):879–909, 2009.[11] D. Calvetti and E. Somersalo. Hypermodels in the Bayesian imaging framework.
InverseProblems , 24(3):034013, 2008.[12] L Cavalier. Nonparametric statistical inverse problems.
Inverse Problems , 24(3):034004, 2008.[13] S. L. Cotter, M. Dashti, and A. M. Stuart. Approximation of Bayesian inverse problems forpdes.
SIAM Journal on Numerical Analysis , 48(1):322–345, 2010.[14] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions:modifying old algorithms to make them faster.
Statistical Science , 28(3):424–446, 2013.[15] N. Cressie, A. S. Davis, J. L. Folks, and G. E. Policello II. The moment-generating functionand negative integer moments.
The American Statistician , 35(3):148–150, 1981.[16] G. Da Prato.
An introduction to infinite-dimensional analysis . Springer, 2005.[17] G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions, volume 44 of ency-clopedia of mathematics and its applications, 1992.[18] M. Filippone and M. Girolami. Pseudo-Marginal Bayesian inference for Gaussian processes. 102013.[19] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumping rules.
Bayesianstatistics , 5:599–608, 1996.[20] N. L. Johnson, S. Kotz, and N. Balakrishnan.
Continuous univariate distributions. Vol. 1 .Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics.John Wiley & Sons Inc., New York, second edition, 1994. A Wiley-Interscience Publication.[21] N. L. Johnson, S. Kotz, and N. Balakrishnan.
Continuous univariate distributions. Vol. 2 .Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics.John Wiley & Sons Inc., New York, second edition, 1995. A Wiley-Interscience Publication.[22] I. M. Johnstone, G. Kerkyacharian, D. Picard, and M. Raimondo. Wavelet deconvolution ina periodic setting.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,66(3):547–573, 2004.[23] J. P. Kaipio and E. Somersalo.
Statistical and computational inverse problems , volume 160.Springer, 2005.[24] B. T. Knapik, B. T. Szab´o, A. W. van der Vaart, and J. H. van Zanten. Bayes procedures foradaptive inference in nonparametric inverse problems. arXiv:1209.3628 , 2012.[25] B. T. Knapik, A. W. van Der Vaart, and J. H. van Zanten. Bayesian inverse problems withGaussian priors.
The Annals of Statistics , 39(5):2626–2657, 2011.[26] B. T. Knapik, A. W. van der Vaart, and J. H. van Zanten. Bayesian recovery of the initialcondition for the heat equation.
Communications in Statistics-Theory and Methods , 42(7):1294–1313, 2013.[27] M. S. Lehtinen, L. Paivarinta, and E. Somersalo. Linear inverse problems for generalised randomvariables.
Inverse Problems , 5(4):599, 1989. nalysis of Gibbs sampler for inverse problems
Journal of the RoyalStatistical Society. Series B (Methodological) , pages 1–41, 1972.[29] A. Mandelbaum. Linear estimators and measurable linear transformations on a Hilbert space.
Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete , 65(3):385–397, 1984.[30] F. van der Meulen, M. Schauer, and J. H. van Zanten. Reversible jump MCMC for nonpara-metric drift estimation for diffusion processes.
Computational Statistics & Data Analysis ,2013.[31] O. Papaspiliopoulos, Y. Pokern, G. O. Roberts, and A. M. Stuart. Nonparametric estimationof diffusions: a differential equations approach.
Biometrika , 99(3):511–531, 2012.[32] O. Papaspiliopoulos and G. O. Roberts. Stability of the Gibbs sampler for Bayesian hierarchicalmodels.
Ann. Statist. , 36(1):95–117, 2008.[33] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old. Non-centered parameterisations for hier-archical models and data augmentation. In
Bayesian Statistics 7: Proceedings of the SeventhValencia International Meeting . Oxford University Press, USA, 2003.[34] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old. A general framework for the parametrizationof hierarchical models.
Statistical Science , pages 59–73, 2007.[35] Y. Pokern, A. M. Stuart, and J. H. van Zanten. Posterior consistency via precision operators forBayesian nonparametric drift estimation in SDEs.
Stochastic Process. Appl. , 123(2):603–628,2013.[36] K. Ray. Bayesian inverse problems with non-conjugate priors.
Electronic Journal of Statistics ,7:2516–2549, 2013.[37] G. O. Roberts and O. Stramer. On inference for partially observed nonlinear diffusion modelsusing the Metropolis–Hastings algorithm.
Biometrika , 88(3):603–621, 2001.[38] C. Sherlock and G. O. Roberts. Optimal scaling of the random walk Metropolis on ellipticallysymmetric unimodal targets.
Bernoulli , 15(3):774–798, 2009.[39] A. M. Stuart. Inverse problems: a Bayesian perspective.
Acta Numer. , 19:451–559, 2010.[40] B. T. Szab´o, A. W. van der Vaart, and J. H. van Zanten. Empirical Bayes scaling of Gaussianpriors in the white noise model.
Electronic Journal of Statistics , 7:991–1018, 2013.[41] Y. Yu and X. L. Meng. To center or not to center: that is not the question—an ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency.
Journal of Computa-tional and Graphical Statistics , 20(3):531–570, 2011.[42] L. H. Zhao. Bayesian aspects of some nonparametric problems.