Data-Free Likelihood-Informed Dimension Reduction of Bayesian Inverse Problems
DData-Free Likelihood-Informed Dimension Reduction ofBayesian Inverse Problems
Tiangang Cui , Olivier Zahm School of Mathematics, Monash University, VIC 3800, Australia Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, FranceE-mail: [email protected], [email protected]
Abstract.
Identifying a low-dimensional informed parameter subspace offers a viable path toalleviating the dimensionality challenge in the sampled-based solution to large-scale Bayesianinverse problems. This paper introduces a novel gradient-based dimension reduction methodin which the informed subspace does not depend on the data. This permits an online-offlinecomputational strategy where the expensive low-dimensional structure of the problem is detectedin an offline phase, meaning before observing the data. This strategy is particularly relevantfor multiple inversion problems as the same informed subspace can be reused. The proposedapproach allows controlling the approximation error (in expectation over the data) of the posteriordistribution. We also present sampling strategies that exploit the informed subspace to drawefficiently samples from the exact posterior distribution. The method is successfully illustratedon two numerical examples: a PDE-based inverse problem with a Gaussian process prior and atomography problem with Poisson data and a Besov- B prior.
1. Introduction
The Bayesian approach to inverse problems builds a probabilistic representation of the parameterof interest conditioned on observed data. Denoting the parameter and data by x ∈ R d and y ∈ R m ,respectively, the solution to the inverse problem is encapsulated in the posterior distribution, whichhas the probability density function (pdf) π y pos ( x ) = 1 π data ( y ) L y ( x ) π pr ( x ) , (1)where π pr ( x ) denotes the prior density, L y ( x ) is the likelihood function, and π data ( y ) is the marginaldensity of the data y that can be expressed as π data ( y ) = (cid:90) R d L y ( x ) π pr ( x ) d x. (2)This way, one can encode the posterior into summary statistics, for example, moments, quantiles,or probabilities of some events of interest [31, 51, 52], to provide parameter inference and associateduncertainty quantification.In practice, computing these summary statistics requires dedicated methods to efficientlycharacterize the posterior distribution. Markov chain Monte Carlo (MCMC) methods [9, 38], a r X i v : . [ s t a t . C O ] F e b ata-Free Likelihood-Informed Dimension Reduction data-free strategy for constructing the informed subspace in which the computationally costly subspaceconstruction can be performed in an offline phase , meaning before observing any data sets. Inthe subsequent online phase , the data set is observed and the precomputed informed subspace isutilized to accelerate the inversion process. This computational strategy is particularly relevant forreal-time systems such as medical imaging where multiple inversions are needed.The rest of the paper is organized as follows. To begin, we introduce the problem setting inSection 2. In Section 3, we present a new data-free likelihood-informed approach to constructthe subspace. Denoting the Fisher information matrix of the likelihood function by I ( x ) = (cid:82) ( ∇ log L y ( x ))( ∇ log L y ( x )) (cid:62) L y ( x )d y , this approach defines the informed subspace as the rank- r dominant eigenspace of the matrix H = (cid:90) I ( x ) π pr ( x )d x, (3)with r (cid:28) d . This definition makes no particular assumption on the likelihood function, so it can beapplied to a wide range of measurement processes, e.g., Gaussian likelihood and Poisson likelihood.It also does not involve any particular data set y , and hence can be constructed offline. Given theinformed subspace, we approximate the posterior density π y pos ( x ) by (cid:101) π y pos ( x ) = (cid:101) π y pos ( x r ) π pr ( x ⊥ | x r ) , (4)where x r and x ⊥ denote respectively the informed and the non-informed components of x . We provethat the expected Kullback-Leibler (KL) divergence of the full posterior from its approximation is ata-Free Likelihood-Informed Dimension Reduction E [ D KL ( π Y pos || (cid:101) π Y pos )] ≤ κ (cid:88) i>r λ i ( H ) , (5)where the expectation is taken over the data Y ∼ π data ( y ), κ being the subspace Poincar´e constantof the prior [53, 54] and λ i ( H ) the i -th largest eigenvalue of H . This way, a problem with a fast decayin the spectrum of H yields an accurate low-dimensional posterior approximation in expectationover the data.In Section 4, we restrict the analysis to Gaussian likelihood. In this case, we show thatthe vector-valued extension [53] of the AS method [12], which reduces parameter dimensions viaapproximating forward models, also leads to the same data-free informed subspace as that obtainedusing (3). We can further show that, although the likelihood-informed approach and AS employdifferent approximations to the posterior density, the resulting approximations share the samestructure as shown in (4) and follow the same error bound as in (5).As suggested by (4), the factorized form of the approximate posterior densities allowsfor dimension-robust sampling. One can explore the low-dimensional intractable parameterreduced posterior (cid:101) π y pos ( x r ) using methods such as MCMC, followed by direct sampling of thehigh-dimensional but tractable conditional prior π pr ( x ⊥ | x r ). This strategy has been previouslyinvestigated, see [18, 54] and references therein. We provide a brief summary to this existingsampling strategy in Section 5. Despite the accelerated sampling offered by the informed subspace,the resulting inference results are subject to the dimension truncation error that is bounded in(5). In Section 6, by integrating the pseudo-marginal approach [1, 2] and the surrogate transitionapproach [11, 38, 39] into the abovementioned sampling strategy, we present new exact inference algorithms that can enjoy the same subspace acceleration while target on the full posterior. Ourexact inference algorithms only require minor modifications to the sampling strategy of [18, 54].While our dimension reduction method readily apply for Gaussian priors, its application tonon-Gaussian priors might not be straightforward. In Section 7, we show how to use the proposemethod for problems with Besov priors [21, 32, 34] which are commonly used in image reconstructionproblems.We demonstrate the accuracy of the proposed data-free LIS and the efficiency of new samplingstrategies on a range of problems. These include the identification of the diffusion coefficient of atwo-dimensional elliptic partial differential equation (PDE) with a Gaussian prior in Section 8 andPositron emission tomography (PET) with Poisson data and a Besov prior in Section 9.
2. Problem setting
For high-dimensional ill-posed inverse problems, the data are often informative only along a fewdirections in the parameter space. To detect and exploit this low-dimensional structure, weintroduce a projector P r ∈ R d × d of rank r (cid:28) d such that Im( P r ) is the informed subspace andKer( P r ) the non-informed one. This splits the parameter space as R d = Im( P r ) ⊕ Ker( P r ) , where the subspaces Im( P r ) and Ker( P r ) are not necessarily orthogonal unless P r is orthogonal.The fact that the data are only informative in Im( P r ) means there exists an approximation to theposterior density π y pos ( x ) ∝ L y ( x ) π pr ( x ) under the form (cid:101) π y pos ( x ) ∝ (cid:101) L y ( P r x ) π pr ( x ) , (6) ata-Free Likelihood-Informed Dimension Reduction x (cid:55)→ L y ( x ) is replaced by a ridge function x (cid:55)→ (cid:101) L y ( P r x ). Aridge function [46] is a function which is constant on a subspace, here Ker( P r ). Let x r = P r x and x ⊥ = ( I d − P r ) x be the components of x in Im( P r ) and Ker( P r ), respectively. We have theparameter decomposition x = x r + x ⊥ . Using a slight abuse of notation, we factorize the prior density as π pr ( x ) = π pr ( x r ) π pr ( x ⊥ | x r ), where π pr ( x r ) = (cid:90) Ker( P r ) π pr ( x r + x (cid:48)⊥ )d x (cid:48)⊥ and π pr ( x ⊥ | x r ) = π pr ( x r + x ⊥ ) /π pr ( x r )denote the marginal prior and the conditional prior. The approximate posterior (6) writes (cid:101) π y pos ( x r + x ⊥ ) ∝ (cid:0) (cid:101) L y ( x r ) π pr ( x r ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) (cid:101) π y pos ( x r ) π pr ( x ⊥ | x r ) . This factorization shows that, under the approximate posterior density, the Bayesian update iseffective on the informed subspace Im( P r ) (first term (cid:101) π y pos ( x r )), while the non-informed subspaceKer( P r ) is characterized by the prior (second term π pr ( x ⊥ | x r )). This property will be exploitedlater on to design efficient sampling strategies for exploring both the approximate posterior and thefull posterior.The challenge of dimension reduction is to construct both the low-rank projector P r and theridge approximation (cid:101) L y such that the KL divergence of the full posterior from its approximation D KL ( π y pos || (cid:101) π y pos ) = (cid:90) log (cid:18) π y pos ( x ) (cid:101) π y pos ( x ) (cid:19) π y pos ( x )d x, can be controlled. In this work, we specifically focus on constructing a projector P r which isindependent on the data y and which allows to bound D KL ( π y pos || (cid:101) π y pos ).
3. Dimension reduction via optimal parameter-reduced likelihood
In this section, we first briefly review the optimal parameter-reduced likelihood and the data-dependent LIS proposed in [54], and then we will introduce the data-free LIS.
As shown in Section 2.1 of [54], for a given data set y and a given projector P r , the parameter-reduced likelihood function L ∗ ,y ( x r ) = (cid:90) Ker( P r ) L y ( x r + x ⊥ ) π pr ( x ⊥ | x r ) d x ⊥ , (7)is an optimal approximation in the sense that it minimizes (cid:101) L y (cid:55)→ D KL ( π y pos || (cid:101) π y pos ). We denote by π ∗ ,y pos ( x ) ∝ L ∗ ,y ( P r x ) π pr ( x ) , ata-Free Likelihood-Informed Dimension Reduction π ∗ ,y pos ( x r ) = (cid:82) Ker( P r ) π ∗ ,y pos ( x r + x ⊥ )d x ⊥ can be expressed as π ∗ ,y pos ( x r ) ∝ L ∗ ,y ( x r ) π pr ( x r ) (7) ∝ (cid:90) Ker( P r ) L y ( x r + x ⊥ ) π pr ( x r + x ⊥ )d x ⊥ (1) ∝ π y pos ( x r ) , (8)for all x r ∈ Im( P r ), where π y pos ( x r ) = (cid:82) Ker( P r ) π y pos ( x r + x (cid:48)⊥ )d x (cid:48)⊥ is the marginal density of the fullposterior. Thus, for any projector P r and any data y , the approximate posterior π ∗ ,y pos and the fullposterior π y pos have the same marginal density on Im( P r ). In summary we have π y pos ( x ) = π y pos ( x r ) π y pos ( x ⊥ | x r ) ,π ∗ ,y pos ( x ) (8) = π y pos ( x r ) π pr ( x ⊥ | x r ) , which shows that the optimal approximation π ∗ ,y pos ( x ) to π y pos ( x ) replaces the conditional posterior π y pos ( x ⊥ | x r ) with the conditional prior π pr ( x ⊥ | x r ). We denote by P r = P yr a projector built by a data-dependent approach. Ideally, we would liketo build P yr that minimizes D KL ( π y pos || π ∗ ,y pos ) over the manifold of rank- r projectors. However, thisnon-convex minimization problem can be challenge to solve. Instead, the strategy proposed in [54]minimizes an upper bound of the KL divergence obtained by logarithmic Sobolev inequalities, inwhich the following assumption on the prior density is adopted. Assumption 3.1 (Subspace logarithmic Sobolev inequality) . There exists a symmetric positivedefinite matrix Γ ∈ R d × d and a scalar κ > P r ∈ R d × d and for anycontinuously differentiable function h : R d → R the inequality (cid:90) R d h ( x ) log (cid:32) h ( x ) h P r ( x ) (cid:33) π pr ( x )d x ≤ κ (cid:90) R d (cid:107) ( I d − P r ) (cid:62) ∇ h ( x ) (cid:107) − π pr ( x )d x, holds, where h P r is the conditional expectation of h given by h P r ( x ) = (cid:82) Ker( P r ) h ( P r x + x ⊥ ) π pr ( x ⊥ | P r x ) d x ⊥ . Here the norm (cid:107) · (cid:107) Γ − is defined by (cid:107) v (cid:107) − = v (cid:62) Γ − v for any v ∈ R d .Theorem 1 in [54] gives sufficient conditions on the prior density such that Assumption 3.1holds. In particular, any Gaussian prior π pr = N ( m pr , Σ pr ) with mean m pr ∈ R d and non-singularcovariance matrix Σ pr ∈ R d × d satisfies Assumption 3.1 with κ = 1 and Γ = Σ − . As shown in [54,example 2], any Gaussian mixture also satisfies this assumption, but with a constant κ which mightnot be accessible in practice. We refer to [28, 36] for nicely written introductions to logarithmicSobolev inequalities and examples of distributions which satisfy it. Proposition 3.2.
Suppose π pr satisfies Assumption 3.1 and the likelihood function L y iscontinuously differentiable. Then, for any projector P r ∈ R d × d , the posterior approximation π ∗ ,y pos ( x ) ∝ (cid:0) L ∗ ,y ( x r ) π pr ( x r ) (cid:1) π pr ( x ⊥ | x r ) induced by the optimal parameter-reduced likelihood asin (7) satisfies D KL ( π y pos || π ∗ ,y pos ) ≤ κ (cid:0) Γ − ( I d − P (cid:62) r ) H ( y )( I d − P r ) (cid:1) , (9) ata-Free Likelihood-Informed Dimension Reduction H ( y ) ∈ R d × d is defined by H ( y ) = (cid:90) R d (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) π y pos ( x )d x. (10) Proof.
See the proof of Corollary 1 in [54].Proposition 3.2 gives an upper bound on D KL ( π y pos || π ∗ ,y pos ). The minimizer of this bound P yr ∈ arg min P r ∈ R d × d rank- r projector trace (cid:0) Γ − ( I d − P (cid:62) r ) H ( y )( I d − P r ) (cid:1) , can be obtained from the leading generalized eigenvectors of the matrix pair ( H ( y ) , Γ), see [53,Proposition 2.6]. Let ( λ yi , v yi ) ∈ R ≥ × R d denotes the i -th eigenpair of ( H ( y ) , Γ) such that H ( y ) v yi = λ yi Γ v yi , with ( v yi ) (cid:62) Γ v yj = δ i,j and λ yi ≥ λ yj for all i ≤ j . The image and the kernelof P yr are respectively defined as Im( P yr ) = span { v y , . . . , v yr } , Ker( P yr ) = span { v yr +1 , . . . , v yd } . (11)The resulting projector P yr yields an approximate posterior density π ∗ ,y pos that satisfies D KL ( π y pos || π ∗ ,y pos ) ≤ κ d (cid:88) i = r +1 λ yi . The above relation can be used to choose the rank r = rank( P yr ) to guarantee that the D KL ( π y pos || π ∗ ,y pos ) is bounded below some user-defined tolerance. A rapid decay in the spectrum( λ y , λ y , . . . ) ensures that one can choose a rank r that is much lower than the original dimension d .Note that the projector P yr may not be unique, unless there exists a spectral gap λ yr > λ yr +1 whichensures the r -dimensional dominant eigenspace of ( H ( y ) , Γ) is unique.
Remark . The projector defined in (11) is, in general, not aligned with thecanonical coordinates. However, in some parametrizations—for example, different components of x represent physical quantities of different nature—we may prefer coordinate selection than subspaceidentification to make the dimension reduction more interpretable. Denoting the i -th canonicalbasis vector of R d by e i , we let P yr = (cid:80) i ∈I e i e (cid:62) i , be the projector of rank r = I , which extractsthe components of x indexed by the index set I ⊂ { , . . . , d } such thatIm( P yr ) = span { x i : i ∈ I} , Ker( P yr ) = span { x i : i / ∈ I} . Using such a projector, the bound (9) becomes D KL ( π y pos || π ∗ ,y pos ) ≤ κ (cid:88) i/ ∈I (Γ − ) ii H ( y ) ii , which suggests to define the index set I that selects the r largest values of (Γ − ) ii H ( y ) ii . ata-Free Likelihood-Informed Dimension Reduction y , the projector P yr must be built after a data set hasbeen observed, see Algorithm 1. For scenarios where one wants to solve multiple inverse problemswith multiple data sets, the matrix H ( y ) and the resulting projector have to be reconstructed foreach data set. This can be a computationally challenging task. In addition, H ( y ) is defined as anexpectation over the high-dimensional posterior distribution, which further raises the computationalburden. Algorithm 1:
Data-dependent dimension reduction.
Requires: π pr satisfying Assumption 3.1, tolerance ε > r max Online phase: given the data y do: Compute the matrix H ( y ) using (10).Compute the generalized eigendecomposition H ( y ) v yi = λ yi Γ v yi .Find the smallest r such that κ (cid:80) di = r +1 λ yi ≤ ε . If r ≥ r max , set r = r max .Assemble the projector P yr using (11).Define the conditional expectation L ∗ ,y ( x r ) defined in (7). Return:
Approximate posterior π ∗ ,y pos ( x ) ∝ L ∗ ,y ( P yr x ) π pr ( x ). To overcome the abovementioned computational burden of recomputing the data-dependentprojector for every new data set, we present a new data-free dimension reduction method. Thekey idea is to control the KL divergence in expectation over the marginal density of data. Weintroduce an m -dimensional random vector Y ∼ π data ( y ) , where π data is the marginal density of data defined in (2). Note that the observed data y correspondsto a particular realization of Y . For a given projector P r independent on the data, replacing y with Y in (9) and taking the expectation over Y yields E (cid:2) D KL ( π Y pos || π ∗ ,Y pos ) (cid:3) ≤ κ (cid:0) Γ − ( I d − P (cid:62) r ) E [ H ( Y )]( I d − P r ) (cid:1) . (12)Here, the approximate posterior π ∗ ,Y pos depends on Y via the optimal likelihood L ∗ ,Y . Similar tothe data-dependent case, the leading generalized eigenvectors of the matrix pair ( E [ H ( Y )] , Γ) canbe used to obtain a projector that minimizes the error bound. However, in this case, the matrix E [ H ( Y )] is the expectation of H ( y ) over the marginal density of data, and thus it is independentof observed data. Let ( λ i , v i ) ∈ R ≥ × R d denotes the i -th eigenpair of ( E [ H ( Y )] , Γ) such that E [ H ( Y )] v i = λ i Γ v i , with v (cid:62) i Γ v j = δ i,j and λ yi ≥ λ yj for all i ≤ j . The data-free projector P r thatminimizes the right-hand side of (12) is given byIm( P r ) = span { v , . . . , v r } , Ker( P r ) = span { v r +1 , . . . , v d } . (13) ata-Free Likelihood-Informed Dimension Reduction π ∗ ,Y pos , the expectation of the KLdivergence D KL ( π Y pos || π ∗ ,Y pos ) can be controlled as E (cid:2) D KL ( π Y pos || π ∗ ,Y pos ) (cid:3) ≤ κ d (cid:88) i = r +1 λ i . (14) Remark . Inequality (14) gives a bound on D KL ( π Y pos || π ∗ ,Y pos ) inexpectation. In order to obtain a bound in high probability, let us use the Markov inequality P { D KL ( π Y pos || π ∗ ,Y pos ) ≤ ε } ≥ − ε − E [ D KL ( π Y pos || π ∗ ,Y pos )] for some ε >
0. Thus, for a given 0 < η ≤ κ (cid:80) di = r +1 λ i ≤ εη is sufficient to ensure that D KL ( π Y pos || π ∗ ,Y pos ) ≤ ε, holds with a probability greater than 1 − η . Remark . Similarly to Remark 3.3, instead of defining P r as in (13), wecan define a coordinate-aligned projector P r = (cid:80) i ∈I e i e (cid:62) i by selecting an index set I correspondingto the r largest values of (Γ − ) ii E [ H ( Y )] ii .Now we show that the matrix E [ H ( Y )] admits a simple expression in terms of the Fisherinformation matrix associated with the likelihood function. This leads to a computationallyconvenient way to construct the data-free projector. Recall that the likelihood L y ( x ), seen asa function of y , is the pdf of the data y conditionned on the parameter x ∈ R d . The Fisherinformation matrix associated with this family of pdf is I ( x ) = (cid:90) R m (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) L y ( x ) d y. (15)We can write E [ H ( Y )] = (cid:90) R m H ( y ) π data ( y )d y (10) = (cid:90) R m (cid:18)(cid:90) R d (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) π y pos ( x )d x (cid:19) π data ( y )d y (1) = (cid:90) R m × R d (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) L y ( x ) π pr ( x ) (cid:82) R d L y ( x (cid:48) ) π pr ( x (cid:48) )d x (cid:48) π data ( y )d x d y (2) = (cid:90) R m × R d (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) L y ( x ) π pr ( x )d x d y (15) = (cid:90) R d I ( x ) π pr ( x )d x, (16)which shows that the matrix E [ H ( Y )] is the expectation of the Fisher information matrix over theprior. This expression does not involve any expectation over the posterior density, which is a majoradvantage compared to the expression (10) of the data-dependent matrix H ( y ). The methodologypresented here is summarized in Algorithm 2. ata-Free Likelihood-Informed Dimension Reduction Algorithm 2:
Data-free dimension reduction
Requires: π pr satisfying Assumption 3.1, Fisher information matrix I ( x ) of L y , tolerance ε >
0, and maximal rank r max Offline phase
Compute the matrix H I = (cid:82) R d I ( x ) π pr ( x )d x .Compute the generalized eigendecomposition H I v i = λ i Γ v i .Find the smallest r such that κ (cid:80) di = r +1 λ i ≤ ε . If r ≥ r max , set r = r max . Return:
Projector P r defined by (13) Online phase: given the data y do: Define L ∗ ,y as the conditional expectation defined in (7). Return:
Approximate posterior π ∗ ,y pos ( x ) ∝ L ∗ ,y ( P r x ) π pr ( x ) Example 3.6 (Gaussian likelihood) . Consider the parameter-to-data map is represented by asmooth forward model G : R d → R m and corrupted by an additive Gaussian noise ξ obs ∼ N (0 , Σ obs )with non-singular covariance matrix Σ obs ∈ R m × m , i.e., y = G ( x ) + ξ obs , where ξ obs ∼ N (0 , Σ obs ) . The likelihood function takes the form L y ( x ) = Z − exp( − (cid:107) G ( x ) − y (cid:107) − ), where Z = (cid:112) (2 π ) m det(Σ obs ) is a normalizing constant. The Slepian-Bangs formula gives an explicit expressionfor the Fisher information matrix I ( x ) = ∇ G ( x ) (cid:62) Σ − ∇ G ( x ), where ∇ G ( x ) ∈ R m × d denotes theJacobian of the forward model G ( x ). By relation (16) we obtain E [ H ( Y )] = (cid:90) R d ∇ G ( x ) (cid:62) Σ − ∇ G ( x ) π pr ( x )d x. (17)A similar matrix was considered in [18] in the context of data-dependent dimension reduction. Themajor difference with (17) is that, in [18], the expectation is taken over the posterior density ratherthan over the prior.
4. Dimension reduction via parameter-reduced forward model
In the previous Section 3, the detection of the data-free informed subspace is based on anapproximation of the likelihood function. In this section, we present an alternative strategy which,under Gaussian likelihood assumption, consist in approximating the forward model instead of thelikelihood itself. This approach is similar to the vector-valued extension of the AS method [53] andstill yields error bounds for the expected KL divergence.As in Example 3.6, let us start with a Gaussian likelihood of the form L y ( x ) = 1 Z exp (cid:18) − (cid:107) G ( x ) − y (cid:107) − (cid:19) , (18)where x (cid:55)→ G ( x ) is a continuously differentiable forward model, Σ obs ∈ R m × m is a non-singularcovariance matrix and Z = (cid:112) (2 π ) m det(Σ obs ) a normalizing constant. Our goal is to build a ata-Free Likelihood-Informed Dimension Reduction x (cid:55)→ (cid:101) G ( P r x ). That is, we look for a likelihood approximation of the form (cid:101) L y ( P r x ) = 1 Z exp (cid:18) − (cid:107) (cid:101) G ( P r x ) − y (cid:107) − (cid:19) , (19)where P r is a low-rank projector and where (cid:101) G is some parameter-reduced function defined overKer( P r ). In general, this approximate likelihood (19) is different than the previous one L ∗ ,y , see(7), and therefore (cid:101) L y might not be optimal with respect to the KL divergence as discussed in Section3.1. The following proposition will guide the construction of the approximate forward model. Proposition 4.1.
Consider the posterior density π y pos ( x ) ∝ L y ( x ) π pr ( x ) with a Gaussian likelihoodas in (18). For any approximate forward model (cid:98) G : R d → R m , the resulting approximate likelihood (cid:98) L y ( x ) = Z exp( − (cid:107) (cid:98) G ( x ) − y (cid:107) − ) defines an approximate posterior density (cid:98) π y pos ( x ) ∝ (cid:98) L y ( x ) π pr ( x )such that E (cid:104) D KL ( π Y pos || (cid:98) π Y pos ) (cid:105) + D KL ( π data || (cid:98) π data ) = 12 (cid:90) R d (cid:107) G ( x ) − (cid:98) G ( x ) (cid:107) − π pr ( x )d x. Here the expectation is taken over Y ∼ π data and (cid:98) π data ( y ) = (cid:82) R d (cid:98) L y ( x ) π pr ( x ) d x is the approximatemarginal density of data. Proof.
See Appendix A.Using an approximate forward model in the form of (cid:98) G ( x ) = (cid:101) G ( P r x ), Proposition 4.1 ensuresthat the approximate posterior (cid:101) π y pos ( x ) ∝ (cid:101) L y ( P r x ) π pr ( x ) with (cid:101) L y ( P r x ) as in (19) satisfies E [ D KL ( π Y pos || (cid:101) π Y pos )] ≤ (cid:90) R d (cid:107) G ( x ) − (cid:101) G ( P r x ) (cid:107) − π pr ( x )d x, (20)This suggests to construct a ridge approximation (cid:101) G ( P r x ) to G ( x ) in the L π pr sense. To accomplishthis, we follow the methodology proposed in [53] for the approximation of multivariate functionusing gradient information. First, for any projector P r , the optimal function (cid:101) G ∗ that minimizes theright-hand side of (20) is the conditional expectation (cid:101) G ∗ ( x r ) = (cid:90) Ker( P r ) G ( x r + x ⊥ ) π pr ( x ⊥ | x r ) d x ⊥ . (21)Then, similarly to Assumption 3.1, we assume that π pr satisfies the following subspace Poincar´einequality. Assumption 4.2 (Subspace Poincar´e inequality) . There exists a symmetric positive definite matrixΓ ∈ R d × d and a scalar κ > P r ∈ R d × d and for any continuouslydifferentiable function h : R d → R the inequality (cid:90) R d (cid:0) h ( x ) − h P r ( x ) (cid:1) π pr ( x )d x ≤ κ (cid:90) R d (cid:107) ( I d − P r ) (cid:62) ∇ h ( x ) (cid:107) − π pr ( x )d x, holds, where h P r is the conditional expectation of h defined by h P r ( x ) = (cid:82) Ker( P r ) h ( P r x + x (cid:48)⊥ ) π pr ( x (cid:48)⊥ | P r x )d x (cid:48)⊥ . ata-Free Likelihood-Informed Dimension Reduction κ and the same Γ, see for instance [54, Corollary 2]. We refer to the recentcontributions [4, 43, 49] for examples of probability distribution which satisfy (subspace) Poincar´einequality. As for the logarithmic-Sobolev constant, the Poincar´e constant is hard to compute inpractice, except the case of Gaussian prior. Using similar arguments as in the proof of Proposition2.5 in [53], Assumption 4.2 allows to write (cid:90) R d (cid:107) G ( x ) − (cid:101) G ∗ ( P r x ) (cid:107) − π pr ( x )d x ≤ κ trace (cid:0) Γ − ( I d − P (cid:62) r ) H G ( I d − P r ) (cid:1) , (22)holds for any projector P r , where the matrix H G ∈ R d × d is defined by H G = (cid:90) R d ∇ G ( x ) (cid:62) Σ − ∇ G ( x ) π pr ( x )d x, (23)with ∇ G ( x ) the Jacobian matrix of G ( x ) given by ∇ G ( x ) = ∂G ∂x ( x ) . . . ∂G ∂x d ( x )... . . . ... ∂G m ∂x ( x ) . . . ∂G m ∂x d ( x ) . Again, the projector P Gr that minimizes the right-hand side of (22) can be constructed via thegeneralized eigenvalue problem H G v Gi = λ Gi Γ v Gi :Im( P Gr ) = span { v G , . . . , v Gr } , Ker( P Gr ) = span { v Gr +1 , . . . , v Gd } . (24)Using this projector to construct the approximate forward model (cid:101) G ∗ in (21) and the approximatelikelihood as in (19), Proposition 4.1 and the inequality in (22) yield E (cid:2) D KL ( π Y pos || (cid:101) π Y pos ) (cid:3) ≤ κ d (cid:88) i = r +1 λ Gi . (25)The methodology is summarized in Algorithm 3.The matrix H G used in this case takes the same form as the matrix E [ H ( Y )] in Section 3.3with the Gaussian likelihood (cf. Example 3.6), and hence results in the same data-free projector.However, the resulting approximate likelihood functions are not the same. Indeed in Section 3.3,the optimal approximate likelihood L ∗ ,y is given as the conditional expectation of the likelihoodfunction (cf. (7)), whereas here, (cid:101) L y is defined by the conditional expectation of the forward model (cid:101) G ∗ (cf. (21)). Using either the parameter-reduced likelihood in (7) or the parameter-reducedforward model in (21) results in the same parameter truncation error bound in terms of expected ata-Free Likelihood-Informed Dimension Reduction Algorithm 3:
Data-free dimension reduction via forward model approximation
Requires: π pr satisfying Assumption 3.1, Jacobian ∇ G ( x ), tolerance ε > r = r max . Offline phase
Compute the matrix H G defined in (23)Compute the generalized eigendecomposition H G v Gi = λ Gi Γ v Gi Find the smallest r such that κ (cid:80) di = r +1 λ i ≤ ε . If r ≥ r max , set r = r max Assemble the projector P Gr defined in (24)Define (cid:101) G ∗ as the conditional expectation (21) Return:
Approximate forward model x (cid:55)→ (cid:101) G ∗ ( P Gr x ) Online phase: given the data y do: Assemble (cid:101) L y ( P Gr x ) as in (19) Return:
Approximate posterior (cid:101) π y pos ( x ) ∝ (cid:101) L y ( P r x ) π pr ( x ) Remark . Despite the similarity between the approximate likelihood functions given in (7)and (19), these two approaches offer different computational characteristics. Given the data-freeinformed subspace, the optimal parameter-reduced forward model x r (cid:55)→ G ∗ ( x r ) can be furtherreplaced by a surrogate model x r (cid:55)→ G ROM ( x r ) constructed in the offline phase. The surrogatemodel can be obtained using tensor methods [8, 42], the reduced basis method [44], polynomialtechniques [35], etc., just to cite a few. All these approximation techniques do not scale wellwith the apparent parameter dimensions d , and thus parameter reduction can greatly improve thescalability of surrogate models.In contrast, the conditional expectation of the likelihood function in (7) cannot be replacedwith offline surrogate models because of the data-dependency of the likelihood.
5. Sampling the approximate posterior
Given a data-free informed subspace, the approximate posterior density has the factorized form (cid:101) π y pos ( x ) ∝ (cid:101) π y post ( x r ) π pr ( x ⊥ | x r ) , (26)with either (cid:101) π y post ( x r ) = π y post ( x r ) in the optimal parameter-reduced likelihood approach of Section3, or with (cid:101) π y post ( x r ) = (cid:101) L y ( x r ) π y pr ( x r ), (cid:101) L y ( x r ) ∝ exp( − (cid:107) (cid:101) G ∗ ( x r ) − y (cid:107) − ) in the optimal parameter-reduced forward model approach of Section 4. The factorization (26) naturally suggests a dimensionrobust way to sampling the approximate posterior. The sampling method consists in first drawingsamples x (1) r , x (2) r , ..., x ( K ) r from the low-dimensional density (cid:101) π y pos ( x r ) using either MCMC or SMCmethod. Then, for each sample x ( j ) r , we simulate a conditional prior sample x ( j ) ⊥ from π pr ( x ⊥ | x ( j ) r ).In the end, x ( j ) = x ( j ) r + x ( j ) ⊥ are samples from the approximate posterior (cid:101) π y pos ( x ).We emphasis here that the key is to be able to sample from the conditional prior π pr ( x ⊥ | x r ).This task is rather easy for Gaussian priors. We show in Section 7 how to sample from π pr ( x ⊥ | x r )for non-Gaussian priors with a particular structure that can be exploited. Remark . If the end goal is to compute expectation of some function h over of the approximate ata-Free Likelihood-Informed Dimension Reduction (cid:90) R d h ( x ) (cid:101) π pos ( x )d x = (cid:90) Im( P r ) (cid:32)(cid:90) Ker( P r ) h ( x r + x ⊥ ) π pr ( x ⊥ | x r )d x ⊥ (cid:33) (cid:101) π pos ( x r )d x r ≈ K K (cid:88) j =1 (cid:90) Ker( P r ) h ( x ( j ) r + x ⊥ ) π pr ( x ⊥ | x ( j ) r )d x ⊥ , where x (1) r , ..., x ( K ) r are samples from the approximate marginal posterior (cid:101) π y pos ( x r ). This way, ifthe expectation over the conditional prior π pr ( x ⊥ | x ( j ) r ) can be carried out analytically, one cancan simply avoid using conditional prior samples. Alternatively, the K conditional expectations (cid:82) h ( x ( j ) r + x ⊥ ) π pr ( x ⊥ | x ( j ) r )d x ⊥ can also be approximated via other accurate quadrature rule for π pr ( x ⊥ | x ( j ) r ). Either way, we assume that integration with respect to the conditional prior istractable.In Algorithm 4 we provide the details of an MCMC-based sampling procedure in whichthe approximate likelihood (defined by either optimal parameter-reduced likelihood or optimalparameter-reduced forward model) can be obtained as sample averages over the conditional prior π pr ( x ⊥ | x r ). To make these approximations generally applicable, we replace the conditional priorwith the marginal prior π pr ( x ⊥ ) in computing those conditional expectations in the Equations (27)and (28) in Algorithm 4. Note that the typical class of inverse problems equipped with a Gaussianprior π pr = N ( m pr , Σ pr ) is a special case. Since the projector P r is orthogonal with respect to Σ − ,the marginal prior π pr ( x ⊥ ) coincides with the conditional prior π pr ( x ⊥ | x r ).A remaining question is how to choose the sample size N for computing the conditionalexpectations in (27) and (28). The following heuristic is developed based on the optimal parameter-reduced forward model. Consider the exact parameter-reduced forward model (cid:101) G ∗ ( P r x ) = (cid:101) G ∗ ( x r )and its sample-averaged approximation (cid:101) G N ( P r x ) = (cid:101) G N ( x r ). The sample-averaged approximationdefines an approximate posterior density (cid:98) π y pos ( x ) ∝ exp (cid:16) − (cid:107) (cid:101) G N ( P r x ) − y (cid:107) − (cid:17) π pr ( x ) , that satisfies E [ D KL ( π Y pos || (cid:98) π Y pos )] (20) ≤ E (cid:20) (cid:90) R d (cid:107) G ( x ) − (cid:101) G N ( P r x ) (cid:107) − π pr ( x )d x (cid:21) = 12 E (cid:20)(cid:90) R d (cid:107) G ( x ) − (cid:101) G ∗ ( P r x ) (cid:107) − + (cid:107) (cid:101) G ∗ ( P r x ) − (cid:101) G N ( P r x ) (cid:107) − π pr ( x )d x (cid:21) = (cid:18) N (cid:19) (cid:90) R d (cid:107) G ( x ) − (cid:101) G ∗ ( P r x ) (cid:107) − π pr ( x )d x. (30)Here, the expectation is taken jointly over the data Y and the sample { x ( i ) ⊥ } Ni =1 . The aboveinequality directly follows from Proposition 4.1 and the fact that (cid:101) G ∗ ( P r x ) is the conditionalexpectation of G ( x ) over Ker( P r ). We refer to Theorem 3.2 in [12] for more details on thisderivation. Inequality (30) implies that the random approximate posterior (cid:98) π y pos ( x ) can be usedin place of (cid:101) π y pos ( x ), as the bounds on the expected Kullback-Leibler divergence in (20) and (30) arecomparable. In addition, this suggests that the sample size N in (28) does not have to be large. ata-Free Likelihood-Informed Dimension Reduction Algorithm 4:
MCMC-based approach for sampling the approximate posterior.
Requires:
A projector P r , a sample size N for approximating the likelihood, a totalposterior sample size K , and a proposal density q ( ·| x r ) on Im( P r ). Sample-averaged likelihood approximation
Draw N i.i.d. samples x (1) ⊥ , . . . , x ( N ) ⊥ from the marginal π pr ( x ⊥ ) if optimal parameter-reduced likelihood is used then (cid:101) L yN ( x r ) = 1 N N (cid:88) i =1 L y ( x r + x ( i ) ⊥ ) (27)for any x r ∈ Im( P r ) endif optimal parameter-reduced forward model is used then (cid:101) L yN ( x r ) ∝ exp (cid:16) − (cid:107) (cid:101) G N ( x r ) − y (cid:107) − (cid:17) , (cid:101) G N ( x r ) = 1 N N (cid:88) i =1 G ( x r + x ( i ) ⊥ ) (28)for any x r ∈ Im( P r ) endReturn: a sample-averaged approximate likelihood function (cid:101) L yN ( x r ) Subspace MCMC samplingfor j = 1 , , ..., K do Given the Markov chain state X ( j − r = x r , propose a candidate x † r ∼ q ( ·| x r )Evaluate the approximate likelihood function (cid:101) L yN ( x † r )Compute the acceptance probability α ( x † r | x r ) = min (cid:26) , (cid:101) L yN ( x † r ) π pr ( x † r ) q ( x r | x † r ) (cid:101) L yN ( x r ) π pr ( x r ) q ( x † r | x r ) (cid:27) . (29)With probability α ( x † r | x r ), accept x † r by setting X ( j ) r = x † r , otherwise reject x † r bysetting X ( j ) r = x r . endReturn: a Markov chain X (1) r , X (2) r , ..., X ( K ) r with invariant density (cid:101) π y pos ( x r ) Approximate posterior samplingfor j = 1 , , ..., K do Given the state X ( j ) r = x ( j ) r , draw a conditional prior sample x ( j ) ⊥ ∼ π pr ( ·| x ( j ) r )Compute the i -th approximate posterior sample x ( j ) = x ( j ) r + x ( j ) ⊥ endReturn: approximate marginal posterior samples x (1) , x (2) , ..., x ( K ) Even with N = 1, (20) and (30) differs only by a factor of 2. For the optimal parameter-reducedlikelihood function, it is not obvious how to obtain a similar bound for the sampled-averagedconditional expectation in (27), see for instance the result [54, Proposition 5]. In this case, weadopt the identity (30) as a heuristic. ata-Free Likelihood-Informed Dimension Reduction
6. Sampling from the exact posterior
In this section, we present new strategies for sampling the exact posterior by adding minormodifications to Algorithm 4.
For the optimal parameter-reduced likelihood approach, Algorithm 4 replaces the optimal likelihood L ∗ ,y ( x r ) with the sample-average (cid:101) L yN ( x r ) defined by (27) using frozen (fixed) samples { x ( i ) ⊥ } Ni =1 .This way, Algorithm 4 produces samples from an estimation to the posterior approximation π ∗ ,y pos ( x ) = π pos ( x r ) π pr ( x ⊥ | x r ). In this section, we first show that replacing the frozen sampleswith freshly drawing samples { x ( i ) ⊥ } Ni =1 at each MCMC iteration yields a pseudo-marginal MCMC[1] which samples exactly from π ∗ ,y pos ( x ). In addition, we also show that an appropriate recycling ofthe data generated by this modified algorithm allows obtaining samples from the exact posterior π y pos ( x ) itself.We propose to modify Algorithm 4 by replacing the acceptance rate α N ( x † r | x r ) in (29) with (cid:98) α N ( x † r | x r ) = min , π pr ( x † r ) (cid:16) N (cid:80) Ni =1 L y ( x † r + x † ( i ) ⊥ ) (cid:17) q ( x r | x † r ) π pr ( x r ) (cid:16) N (cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ) (cid:17) q ( x † r | x r ) . (31)Here, { x ( i ) ⊥ } Ni =1 are i.i.d. samples from π pr ( x ⊥ | x r ) conditioned on the current state of the chain x r and { x † ( i ) ⊥ } Ni =1 are i.i.d. samples from π pr ( x ⊥ | x † r ) conditioned on the proposed candidate x † r .Compared to the previous acceptance rate (29) where { x ( i ) ⊥ } Ni =1 = { x † ( i ) ⊥ } Ni =1 where frozen, thenew acceptance rate (31) requires to redraw fresh samples at each proposal candidate x † r . This issummarized in Algorithm 5. Algorithm 5:
Pseudo-marginal MCMC for sampling the exact marginal posterior.
Requires:
A projector P r , a sample size N for approximating the likelihood, a totalposterior sample size K , and a proposal density q ( ·| x r ) on Im( P r ). for j = 1 , , ..., K do Given the previous state of the Markov chain X ( j − r = x r and the associated set ofconditional prior samples { X ( j − ,i ) ⊥ } Ni =1 = { x ( i ) ⊥ } Ni =1 Propose a candidate x † r ∼ q ( ·| x r )Draw N independent samples x † (1) ⊥ , ..., x † ( N ) ⊥ ∼ π pr ( x ⊥ | x † r )Compute the acceptance probability (cid:98) α ( x † r | x r ) as in (31)With probability (cid:98) α ( x † r | x r ), accept X ( j ) r = x † r and { X ( j,i ) ⊥ } Ni =1 = { x ( † ,i ) ⊥ } Ni =1 . Otherwise reject and set X ( j ) r = x r and { X ( j,i ) ⊥ } Ni =1 = { x ( i ) ⊥ } Ni =1 . endReturn: the Markov chain { ( X ( j ) r , { X ( j,i ) ⊥ } Ni =1 ) } Kj =1 In the next proposition we apply the analysis of pseudo-marginal MCMC [1] to show that π y pos ( x r ) is the invariant density of the Markov chain constructed by Algorithm 5. The key step is ata-Free Likelihood-Informed Dimension Reduction P r ) × Ker( P r ) N . Proposition 6.1.
Algorithm 5 constructs an ergodic Markov chain { ( X ( j ) r , { X ( j,i ) ⊥ } Ni =1 ) } j ≥ on theproduct space Im( P r ) × Ker( P r ) N with invariant density π y,N tar ( x r , { x ( i ) ⊥ } Ni =1 ) ∝ π pr ( x r ) (cid:32) N (cid:88) i =1 L y ( x r + x ( i ) ⊥ ) (cid:33) N (cid:89) j =1 π pr ( x ( j ) ⊥ | x r ) . (32)The marginal of this target density satisfies π y,N tar ( x r ) = π y pos ( x r ) so that the sequence { X ( j ) r } Nj =1 isan ergodic Markov chain with the invariant density π y pos ( x r ). Proof.
See Appendix B.
Remark N in Algorithm 5) . The statistical performance of pseudo-marginal methodsdepends on the variance of the sample-averaged estimate N (cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ). This variance beinginversely proportional to the sample size N , a larger N may result in better statistical efficiencyof the MCMC chain. However, the computational cost per MCMC iteration increases linearlywith N , while the improvement of the statistical efficiency will not follow the same rate. We referthe readers to [3, 23] for a detailed discussion on this topic and only provide an interpretation asfollows. With N → ∞ , the Markov chain constructed by the pseudo-marginal MCMC convergesto that of an idealized standard MCMC, which has the acceptance probability defined by the sameproposal density and the exact evaluation of L ∗ ,y ( x r ). This way, even with a very large N , thestatistical efficiency of the pseudo-marginal MCMC cannot be improved further beyond that ofthe idealized standard MCMC. As suggested by [23], the standard deviation of the logarithm ofthe parameter-reduced likelihood estimate, var[log (cid:101) L yN ] , can be used to monitor the quality of thesample-averaged estimator.It is remarkable to observe that, for N = 1, the target density (32) becomes the true posterior π y, ( x r , x ⊥ ) = π y pos ( x r + x ⊥ ). This means that Algorithm 5 actually produces samples x = x r + x ⊥ from π y pos ( x ). For N >
1, we propose to recycle the Markov chain { X (1 ,i ) ⊥ } Ni =1 , . . . , { X ( K,i ) ⊥ } Ni =1 produced by Algorithm 5 in order to generate samples from the exact posterior π y pos ( x ). Thisprocedure is summarized in Algorithm 6 and a justification is provided in the following proposition. Proposition 6.3.
Let { ( X ( j ) r , { X ( j,i ) ⊥ } Ni =1 ) } j ≥ be a Markov chain generated by Algorithm 5. Forany j ≥ X ( j ) ⊥ ∈ { X ( j,i ) ⊥ } Ni =1 according to the probability P (cid:16) X ( j ) ⊥ = X ( j,k ) ⊥ (cid:12)(cid:12)(cid:12) X ( j ) r , { X ( j,i ) ⊥ } Ni =1 (cid:17) = L y ( X ( j ) r + X ( j,k ) ⊥ ) (cid:80) Ni =1 L y ( X ( j ) r + X ( j,i ) ⊥ ) , ≤ k ≤ N, (33)and we let X ( j ) = X ( j ) r + X ( j ) ⊥ . Then { X ( j ) } j ≥ is a Markov chain with the exact posterior π y pos ( x )as invariant density. Proof.
See Appendix C ata-Free Likelihood-Informed Dimension Reduction Algorithm 6:
Recycling the Markov chain generated by Algorithm 5 to generate exactposterior samples
Requires:
MCMC chain generated { ( X ( j ) r , { X ( j,i ) ⊥ } Ni =1 ) } Kj =1 by Algorithm 5 for j = 1 , , ..., K do Subsample X ( j ) ⊥ ∈ { X ( j,i ) ⊥ } Ni =1 according to the probability (33)Assemble X ( j ) = X ( j ) r + X ( j ) ⊥ endReturn: the Markov chain { X ( j ) } Kj =1 with invariant density π y pos ( x ) For the optimal parameter-reduced forward model, the marginal density of the resultingapproximate posterior does not coincide with that of the exact posterior in general. However,we can still modify the approximate inference algorithm 4 using the delayed acceptance technique[11, 38, 39] to explore the exact posterior. The delayed acceptance modifies Algorithm 4 by addinga second stage acceptance rejection within each MCMC iteration. Here we consider the sample-averaged likelihood (cid:101) L yN ( x r ) defined by either (27) (the optimal parameter-reduced likelihood) or(28) (the optimal parameter-reduced forward model), where the marginal prior sample set { x ( i ) ⊥ } Ni = i is prescribed. The following Proposition and Algorithm 7 detail this modification. Proposition 6.4.
Suppose we have a proposal distribution q ( ·| x r ) defined in the parameter reducedsubspace Im( P r ). We consider the following two stage Metropolis-Hastings method. In the firststage, we draw a proposal candidate x † r ∼ q ( ·| x r ). Then, with the probability α ( x † r | x r ) = min (cid:26) , (cid:101) L yN ( x † r ) π pr ( x † r ) q ( x r | x † r ) (cid:101) L yN ( x r ) π pr ( x r ) q ( x † r | x r ) (cid:27) , (34)we move the proposal candidate x † r to the next stage. In the second stage, we draw a proposalcandidate π pr ( x †⊥ | x † r ) in the complement subspace Ker( P r ) and then accept the pair of proposalcandidates ( x † r , x †⊥ ) with the probability β ( x † r , x †⊥ | x r , x ⊥ ) = min (cid:34) , L y ( x † r + x †⊥ ) (cid:101) L yN ( x r ) L y ( x r + x ⊥ ) (cid:101) L yN ( x † r ) (cid:35) . (35)Then, the above procedure constructs an ergodic Markov chain with the full posterior π y pos ( x ) asthe invariant density. Proof.
This result can be derived from the standard delayed acceptance [11]. For completeness, weprovide the proof in Appendix D.
Remark . It worth to note that the delayed acceptance also opens the door to further acceleratethe exact inference using surrogate models instead of the original forward model. The approximatelikelihood (cid:101) L yN ( x r ) is deterministic and dimension reduced, which makes it possible to furtherapproximate (cid:101) L yN ( x r ) using computationally fast surrogate models. In this case, the same delayedacceptance MCMC (Algorithm 7) can still produce ergodic Markov chains that converge to the ata-Free Likelihood-Informed Dimension Reduction Algorithm 7:
Delayed acceptance MCMC for sampling the exact posterior.
Requires:
A projector P r , a sample-averaged likelihood approximation defined inAlgorithm 4, a total sample size K , and a proposal density q ( ·| x r ) on Im( P r ). for j = 1 , , ..., K do Given the Markov chain state X ( j − = x r + x ⊥ , propose a candidate x † r ∼ q ( ·| x r )Compute the parameter-reduced likelihood (cid:101) L yN ( x † r ) using either using (27) or (28)With probability α ( x † r | x r ) in (34) move x † r to the next stage as followsPropose a candidate x †⊥ ∼ π pr ( ·| x † r )Compute the full likelihood L y ( x † r + x †⊥ )With probability β ( x † r , x †⊥ | x r , x ⊥ ) in (35) accept ( x † r , x † ), otherwise reject ( x † r , x † )Otherwise reject x † r endAccept: set X j = x † r + x †⊥ Reject: set X j = X ( j − Return: a Markov chain X (1) , X (2) , ..., X ( K ) with the invariant density π y pos ( x )full posterior π y pos ( x ). In contrast, the pseudo-marginal method requires an unbiased Monte Carloestimate of the exact marginal posterior π y pos ( x r ) at every iteration, which is not straightforwardto accelerate using surrogate models.
7. Non-Gaussian priors
The dimension reduction techniques presented in Sections 3 and 4 require one to evaluate themarginal prior density π pr ( x r ) = (cid:82) Ker( P r ) π pr ( x r + x ⊥ )d x ⊥ and draw samples from the conditionalprior π pr ( x ⊥ | x r ) = π pr ( x r + x ⊥ ) /π pr ( x r ). While these tasks are readily doable for Gaussiandistributions, it might not be the case in general. In this section, we use Besov priors as anexample to present strategies that can extend the proposed dimension reduction methods to somenon-Gaussian priors. Besov measure [21, 32, 34] naturally appears in image reconstruction problems in which thedetection of edges and interfaces is important. Following [32, 34], we construct Besov priors usingwavelet functions and consider functions on the one-dimensional torus T = (0 , ψ ∗ ∈ L ( T ), we can define an orthogonalbasis ψ j,k ( s ) = 2 j ψ ∗ (2 j s − k ) , j, k ∈ N ≥ , k ∈ [0 , j − . This way, given a smoothness parameter r > ≤ p, q ≤ ∞ , afunction f : s (cid:55)→ f ( s ) in the Besov space B rpq ( T ) can be written as f ( s ) = c + ∞ (cid:88) j =0 2 j − (cid:88) k =0 − j ( r + − p ) b j,k ψ j,k ( s ) , (36) ata-Free Likelihood-Informed Dimension Reduction (cid:107) f (cid:107) B rpq := (cid:18) | c | q + ∞ (cid:88) j =0 (cid:16) j − (cid:88) k =0 | b j,k | p (cid:17) qp (cid:19) q < ∞ . In a Bayesian setting, we can set p = q and define the Besov- B rpp prior with the pdf ‡ π pr ( f ) ∝ exp (cid:16) − γ (cid:107) f (cid:107) p B rpp (cid:17) , (37)where γ > D terms.This way, collecting all the coefficients into a parameter vector x = ( c , b , , b , , ..., b , , b , , ... ) ∈ R d , where d = 2 D +1 , the discretized Besov- B rpp prior can be equivalently expressed as a product-form distribution over the parameter x with the pdf π pr ( x ) = d (cid:89) i =1 π ( i )pr ( x i ) with π ( i )pr ( x i ) ∝ exp ( − γ | x i | p ) . (38) In general, we do not have closed form expressions for both the marginal π pr ( x r ) and the conditional π pr ( x ⊥ | x r ), unless the projector P r is aligned with the canonical basis. This leads to the constructionof reduced subspace by selecting a subset of canonical basis. As discussed in Remarks 3.3 and 3.5,this task can be achieved by identifying an index set I ⊂ { , . . . , d } with cardinality r such that I contains the indices of the r largest values of i (cid:55)→ (Γ − ) ii E [ H ( Y )] ii in the data-dependent case orthose of i (cid:55)→ (Γ − ) ii E [ H ( Y )] ii in the data-free case. This leads to the projector P r = (cid:80) i ∈I e i e (cid:62) i , where { e , . . . , e d } is the canonical basis of R d . Thus, the product-form of (38) yields the marginalprior and the conditional prior π pr ( x r ) = (cid:89) i ∈I π ( i )pr ( x i ) and π pr ( x ⊥ | x r ) = (cid:89) i/ ∈I π ( i )pr ( x i ) , respectively. In this formulation, evaluating the marginal prior density and drawing samples fromthe conditional prior become straightforward tasks. Remark . For 1 ≤ q <
2, the tails of π ( i )pr ( x i ) defined in (38) are heavier than Gaussian tails, andhence Assumptions 3.1 and 4.2 may not be satisfied. Nonetheless, one can still numerically applythe proposed dimension reduction methods without having the error bounds in (14) and (25). Inthis case, we set Γ to be the identity matrix in accordance with the fact that the prior componentsare independent and identically distributed. Remark . There exist other shrinkage priors similar to Besovpriors, in which the random function is expressed as a weighted linear combination of basis functionsand the associated random weights follow other type of heavy tail distributions. For example, thehorseshoe prior and the Student’s t prior. See [10] for further discussions and references therein.The coordinate selection technique introduced here may also be applicable to those shrinkage priors. ‡ This pdf is used for demonstrating the intuition rather than a rigorous characterization, as it is defined with respectto the (non-existent) infinite-dimensional Lebesgue measure. However, the finite-dimensional discretization of theBesov measure, which is used in numerical simulations, has a pdf in this form with respect to Lebesgue measure. ata-Free Likelihood-Informed Dimension Reduction Alternatively, we consider the case where the prior can be defined as the pushforward of the standardGaussian measure with pdf µ ( x ) ∝ exp( − (cid:107) x (cid:107) ) under a C -diffeomorphism T : R d → R d , whichtakes the form π pr ( x ) = T (cid:93) µ ( x ) . (39)In other words, π pr ( x ) is the pdf of the random vector X = T ( Z ) where Z ∼ µ ( z ). For the Besov- B rpp prior defined in (38), the diffeomorphism T has a diagonal form T ( z ) = ( T ( z ) , . . . , T d ( z d ))with T i ( z i ) = Φ − i (Ψ( z i )), where Φ i ( · ) is the cumulative density function (cdf) of π ( i )pr ( x i ) defined in(38) and Ψ( · ) is the cdf of the standard Gaussian. We provide details of the cdf Φ( · ) in AppendixE. The invertibility of T allows us to reparametrize the Bayesian inverse problem in terms of thevariable z = T − ( x ), which is endowed with the Gaussian prior µ . With this change of variable, thelikelihood function becomes z (cid:55)→ L y ( T ( z )), and thus the matrix H ( y ) used to reduce the dimensionof z should be H z ( y ) = (cid:90) R d ∇ T ( z ) (cid:62) (cid:0) ∇ log L y ( T ( z )))( ∇ log L y ( T ( z )) (cid:1) (cid:62) ∇ T ( z ) µ ( z )d z, in the data-dependent case and E [ H z ( Y )] in the data-dependent case. For the optimal parameter-reduced forward model in the Gaussian likelihood case (cf. Section 4), the forward model x (cid:55)→ G ( x )is replaced by z (cid:55)→ G ( T ( z )). This way, the matrix H G should be replaced by H Gz = (cid:90) R d ∇ T ( z ) (cid:62) ∇ G ( T ( z )) (cid:62) Σ − ∇ G ( T ( z )) ∇ T ( z ) µ ( z )d z. Using either of these matrices, we obtain a projector P r to reduce the dimension in the variable z = z r + z ⊥ , where z r = P r z and z ⊥ = ( I d − P r ) z . In term of the original variable x , thedimension reduction method allows one to identify x r = T ( P r T − ( x )) with the observed data,while x ⊥ = T (( I d − P r ) T − ( x )) is informed by the prior only. Since x r and x ⊥ are nonlinear withrespect to x , the resulting method can be interpreted as a nonlinear dimension reduction method.
8. Example 1: elliptic PDE
We first validate our methods using an inverse problem of identifying the coefficient of a two-dimensional elliptic PDE from point observations of its solution.
Consider the problem domain Ω = [0 , × [0 , ∂ Ω. We denote the spatial coordinateby s = ( s , s ) ∈ Ω. We model the steady state potential solution field p ( s ) for a given conductivityfield κ ( s ) and forcing function f ( s ) using the Poisson’s equation −∇ · ( κ ( s ) ∇ p ( s )) = f ( s ) , s ∈ Ω . (40)Let ∂ Ω n = { s ∈ ∂ Ω | s = 0 } ∪ { s ∈ ∂ Ω | s = 1 } denote the top and bottom boundaries, and ∂ Ω d = { s ∈ ∂ Ω | s = 0 } ∪ { s ∈ ∂ Ω | s = 1 } denote the left and right boundaries. We impose themixed boundary condition: p ( s ) = 0 , ∀ s ∈ ∂ Ω d , and ( κ ( s ) ∇ p ( s )) · (cid:126)n ( s ) = 0 , ∀ x ∈ ∂ Ω n , ata-Free Likelihood-Informed Dimension Reduction f ( s, t ) = c (cid:16) exp (cid:0) − r (cid:107) s − a (cid:107) (cid:1) − exp (cid:0) − r (cid:107) s − b (cid:107) (cid:1)(cid:17) , ∀ t ≥ , with r = 0 .
05, which is the superposition of two Gaussian-shaped sink/source terms centered at a = (0 . , .
5) and b = (2 . , . c = 6 × − . The conductivity field κ ( s ) isendowed with a log-normal prior. That is, letting x ( s ) = log κ ( s ), the Gaussian process prior for x ( s ) is defined by the stochastic PDE (see [37] and references therein): −(cid:52) x ( s ) + γx ( s ) = W ( s ) , s ∈ Ω , (41)where (cid:52) is the Laplace operator and W ( s ) is the white noise process. We impose a no-flux boundarycondition on the above SPDE and set γ = 10. Equations (40) and (41) are solved using the finiteelement method with bilinear basis functions. A mesh with 80 ×
80 elements is used in this example.This leads to n = 6400 dimensional discretised parameters.Figure 1: Setup of three test cases for the elliptic PDE example. The observation locations areshown as dots. Each column represent a test case, in which the top row shows the true conductivityfields and the bottom row shows the corresponding potential field.We generate three “true” conductivity fields from the prior distribution and use them tosimulate synthetic observed data sets. The true conductivity fields and the simulated potentialfields are shown in Figure 1. Observations of the potential fields are measured at the m = 36discrete locations shown as black dots in Figure 1. We set the standard derivation of the observationnoise to σ = 0 . ata-Free Likelihood-Informed Dimension Reduction We first compare the approximate posterior densities defined by the data-free dimension reductionwith that of the data-dependent dimension reduction and that of the truncated Karhunen–Lo´eveexpansion of the prior. We build five sets of projectors: the data-free projectors as detailed inSection 3.3, three sets of data-dependent projectors (see Section 3.2) that correspond to threesynthetic data sets, and projectors defined by leading eigenvectors of the prior covariance ( i.e. thetruncated Karhunen–Lo´eve, referred to as the “prior-based projector” from hereinafter). For eachof the data sets, the corresponding data-dependent projectors are constructed using the adaptiveMCMC algorithm of [54]. Each set consists of projectors with ranks r = 2 , , ..., . For eachprojector, we compute the KL divergences of the full posteriors from the approximated posteriordensities defined by the optimal parameter-reduced likelihood (7). The results are shown in the toprow of Figure 2. Using the same set of projectors, we also compare the KL divergences of the fullposteriors from the resulting approximated posterior densities defined by the optimal parameter-reduced forward model (21). The results are shown in the bottom row of Figure 2. ata-Free Likelihood-Informed Dimension Reduction N posterior samples, which yields D KL ( π y pos (cid:107) ˜ π y pos ) ≈ N N (cid:88) i =1 (cid:0) log L y ( x ( i ) ) − log ˜ L y ( x ( i ) ) (cid:1) +log (cid:16) N N (cid:88) i =1 exp (cid:0) log ˜ L y ( x ( i ) ) − log L y ( x ( i ) ) (cid:1)(cid:17) , where the second sample average accounts for the ratio between normalizing constants. Forapproximations that are close to the full posterior, using a reasonable number of (independent)posterior samples, e.g., 10 used here, make the standard deviations of the estimated KL divergenceinsignificant compared with the mean estimates in our numerical examples.We observe that the optimal parameter-reduced likelihood and the optimal parameter-reducedforward model result in approximate posteriors with similar accuracy. For sufficiently large ranks( r ≥ κ in this example. We plot the errors and the bounds (with κ = 1 forGaussian prior) in Figure 3, in which all approximate posterior densities are defined by the data-free projectors. In this example, we observe that the errors of the approximate posterior densitiesfollow the same trend as their corresponding error bounds. Note that both (14) and (25) give upperbounds on the expected KL divergence, and thus they may not bound the KL divergence for arealization of the data.Figure 3: Elliptic PDE example. The bounds in (14) and (25) (with κ = 1) are compared to KLdivergence of the full posterior densities from the approximate posterior densities defined by thedata-free projectors with various ranks. Left: approximate posteriors are defined by the optimalparameter-reduced likelihood. Right: approximate posteriors are defined by the optimal parameter-reduced forward model. ata-Free Likelihood-Informed Dimension Reduction Table 1: Acronyms of various inference algorithms used in numerical comparisons.OL approximate inference using Algorithm 4 and the optimal parameter-reduced likelihood function in (27)PM exact inference using the pseudo-marginal method (Algorithms 5 and 6)OF approximate inference using Algorithm 4 and the optimal parameter-reduced forward model in (28)DA exact inference using the delayed acceptance algorithm (Algorithm 7) withthe approximation defined by the parameter-reduced forward model in (28)H-MALA exact inference using the Hessian preconditioned Langevin MCMC [45]PCN exact inference using the preconditioned Crank–Nicolson MCMC [7, 15]We demonstrate the sampling performance of various approximate and exact inferencealgorithms introduced in Sections 5 and 6 using the posterior density conditioned on the firstdata set. All the methods used in the comparison and their acronyms are summarized in Table 1.We use the Hessian-preconditioned Metropolis-Adjusted Langevin Algorithm (H-MALA) andthe preconditioned Crank–Nicolson (PCN) MCMC as reference MCMC methods for sampling thefull-dimensional posterior. Since H-MALA uses the low-rank decomposition of the Hessian matrix ofthe logarithm of the posterior density computed at the maximum a posteriori point to preconditionMCMC, it can also be viewed as a data-dependent subspace-accelerated method. We refer to [17, 45]for a detailed discussion. In order to make a fair comparison with H-MALA, the MCMC algorithmwe use on our data-free informed subspace is based on a Langevin proposal preconditioned by thesame Hessian matrix used by H-MALA projected onto the data-free informed subspaces.In Figure 4, the contours of the marginal posterior densities (marginalized onto the first twodata-free LIS basis vectors) produced by approximate inference methods (with r = 16 and r = 48)are compared with those produced by their exact inference modifications (with r = 48). Wecan observe that the results produced by approximate inference methods approach those of theirmodifications as the rank of informed subspace increases.To measure the efficiency of various MCMC methods, we use the average integratedautocorrelation times (IACTs) of parameters τ = 1 d d (cid:88) i =1 IACT( x i ) , where IACT( x i ) is the IACT of the i -th component of x . See [38, Section 12.7] for the definitionof IACT. The data-free projectors with different ranks r and two different sample sizes N = 2 and N = 5 are used in this experiment. Here H-MALA and PCN are used as base cases to benchmarkthose MCMC methods accelerated by the informed subspace. All the methods are simulated for5 × iterations and repeated 10 times to report the mean and the standard deviation of τ . Theinitial state of all the simulations are randomly selected from a pre-computed Markov chain ofposterior samples to avoid burn-in. The results are reported in Table 2.For the approximate inference methods (OL and OF), the average IACTs consistently increasewith the rank of the projectors, as the sampling performance of the Langevin proposal is expected ata-Free Likelihood-Informed Dimension Reduction r = 16 and r = 48) and PM (with r = 48). (b): OF ( r = 16 and r = 48) and DA (with r = 48). Here ( x , x ) represent the directionsspanned by the first two data-free LIS basis vectors.to decay with the underlying parameter dimension. Both OL and OF produce significantly smallerIACTs compared with the full-dimensional H-MALA.Table 2: Elliptic PDE example. Average IACTs of parameters computed by various inferencealgorithms applied to the posterior conditioned on data set 1. Here the symbol - indicates poorlymixing Markov chains that do not have reliable estimate of the IACTs. All the data reported hereare in the form of mean ± standard derivation.IACT IACT IACTOL PM √ var[log (cid:101) L yN ] OF DA E [ β ] HMALA PCN N = r = 16 18.9 ± ±
29 4.45 ± ± < ±
17 1303 ± r = 24 34.9 ± ±
13 2.65 ± ± < r = 32 52.6 ± ± ± ± ±
39 0.31 ± r = 40 59.4 ± ± ± ± ±
26 0.36 ± r = 48 60.7 ± ± ± ± ±
10 0.46 ± N = r = 16 18.7 ± ± ± ± < r = 24 32.7 ± ± ± ± ±
36 0.19 ± r = 32 48.8 ± ± ± ± ±
38 0.31 ± r = 40 55.2 ± ± ± ± ±
21 0.39 ± r = 48 56.0 ± ± ± ± ±
26 0.47 ± (cid:101) L yN , inthe PM method is a random estimator, whereas (cid:101) L yN in the OL method is deterministic because of theusage of prescribed samples. The standard deviation of the logarithm of (cid:101) L yN in Table 2 confirmsthat low-rank projectors have rather large Monte Carlo errors as the approximation accuracy is ata-Free Likelihood-Informed Dimension Reduction N and the rank of the projector. Weobserve that increasing either the rank or the sample size can narrow the gap between the IACTsproduced by PM and its OL counterpart. This experiment clearly suggests that PM needs tobalance the sample size N and the rank of the projector to achieve the optimal performance.Compared to the OF method, the DA method (the exact inference counterpart for OF)produces the largest IACTs among all subspace inference methods. This result is not surprising, asthe second stage acceptance/rejection of DA necessarily deteriorates the statistical performance [11].In Table 2, we observe that the second stage acceptance rates, E [ β ], increase with more accurateapproximations obtained with higher projector ranks and larger sample sizes. As the result, thegaps between the IACTs produced by OF and DA are smaller for higher projector ranks and largersample sizes.Overall, approximate inference methods have better statistical performance compared to othermethods in this example (cf. Table 2) and can obtain reasonably accurate results as shown in Figures2 and Figure 4. With the additional cost that comes in the form of either Monte Carlo error (PM)or the second stage acceptance/rejection (DA), the exact inference modifications produce Markovchains with larger IACTs. Among all the exact inference methods, PM produces smaller IACTscompared with the full-dimensional H-MALA, PCN, and DA.It is worth to mention that each iteration of the subspace MCMC method needs N numberof forward model simulations, whereas H-MALA requires only one forward model simulationper iteration. In this example, approximate inference methods (OL and OF) with N = 2 stilloutperforms H-MALA in terms of IACTs per model evaluation. Exact inference methods, however,need more model evaluations than H-MALA to obtain the same number of effective samples (we willshow in the subsection another example where H-MALA is outperformed by PM and DA). Noticethat the forward model evaluations in each iteration can be embarrassingly parallelized: withparallel computing resources available, the subspace MCMC methods can still be more efficientthan H-MALA in terms of the effective sample size per wall-clock time.
9. Example 2: PET with Poisson data
The second example is a two dimensional PET imaging problem, where we aim to reconstruct thedensity of the object from integer-valued Poisson observed data. We use here a Besov prior forwhich we access the coordinate selection technique and the prior normalization method presentedin Sections 7.2 and 7.3.
In PET imaging, the goal is to identify an object of interest located inside a domain Ω subjected togamma rays. The rays travel through Ω from multiple sources and the detectors count the numberof incident photons (thus the data are integer-valued), see Figure 5a. The object of interest isdescribed by its density of mass which is represented by s (cid:55)→ exp( f ( s )), where f : Ω → R followsa Besov- B prior with the Haar wavelet, see Section 7.1. The change of intensity of a gamma rayalong the path, (cid:96) i ( s ) , s ∈ Ω, can be modelled using Beer’s law: I d,i = I s,i exp (cid:16) − (cid:90) (cid:96) i ( s ) exp (cid:0) f ( s ) (cid:1) ds (cid:17) , (42) ata-Free Likelihood-Informed Dimension Reduction I d,i ∈ R ≥ and I s,i ∈ R ≥ are the intensities at the detector and at the source, respectively.We assume that all the gamma ray sources have the same intensity, I s,i = I s for i = 1 , ..., m .In this example, the domain Ω is discretized into a regular grid with d cells and the logarithmof the density is assumed to be piecewise constant. This yields the discretized parameter x ∈ R d .The line integrals in (42) are approximated by (cid:90) (cid:96) i ( s ) exp( f ( s )) ds ≈ n (cid:88) j =1 A ij exp( f j ) , where A ij ∈ R ≥ is the length of the intersection between line (cid:96) i and cell j , and exp( f j ) is thediscretized density in cell j . By discretizing the wavelet basis on the same grid and following theparametrization discussed in Section 7, we can write f = Bx, where B ∈ R d × d consists of discretized basis functions and x consists of associated coefficients. Inthis setting, x follows a product-form Laplace distribution given by (38) with p = 1 and with thescale parameter arbitrarily set to γ = 1. Suppose we have a total of m number of gamma ray pathsand the corresponding matrix A ∈ R m × d , the forward model G : R d → R m is defined as G ( x ) = I s exp( − A exp( Bx )) . (43)We consider a PET setup shown in Figure 5a: the problem domain Ω = [ − , is discretisedinto a d = 64 ×
64 regular grid, five radiation sources with intensity I s = 10 are positioned withequal spaces on one side of a circle, spanning a 90 ◦ angle, and each source sends a fan of 30 gammarays that are measured by detectors. This leads to m = 150 observations. The model setup is basedon the code of [30]. (a) Discretised domain of interest Ω =[ − , (mesh), radiation sources (reddots), and detectors (blue dots). (b) Top row: three density functions generated from the prior.Bottom row: corresponding intensity function (solids lines)and measured counting data sets (dots). Figure 5: The PET setup and three test cases. ata-Free Likelihood-Informed Dimension Reduction y ∈ N m where each element y i is associated with the i -th gamma ray in the model. For the i -th gamma ray, recall that the intensity at the detector iscomputed by G i ( x ) for some input parameter x , and then the probability mass function of observingthe counting data Y i = y i is given by P ( Y i = y i | x ) = G i ( x ) y i exp( − G i ( x )) y i ! . Suppose we can observe the counting data at all the detectors and assume the measurementprocesses are independent, we can write the likelihood function with the complete data as L y ( x ) = m (cid:89) i =1 P ( Y i = y i | x ) = m (cid:89) i =1 G i ( x ) y i exp( − G i ( x )) y i ! . (44)As shown in Appendix F, the Fisher information matrix of the above likelihood function takes theform I ( x ) = ∇ G ( x ) (cid:62) M ( x ) ∇ G ( x ) , (45)where M ( x ) is a diagonal matrix with M ii ( x ) = G i ( x ) − along its diagonal. We generate three“true” density functions from the prior distribution and use them to simulate synthetic data sets.The true density functions and the simulated data are shown in Figure 5b. We first present the results obtained by applying the coordinate selection method (cf. Section7.2). Similar to the first example, here we will first compare the accuracy of approximate posteriordensities defined by various approaches and projectors, and then benchmark the performance ofMCMC methods. We adopt the same setup and acronyms as in Example 1 and Table 1.For the approximate posterior densities, five sets of projectors built from selected coordinates,including the data-free projectors, three sets of data-dependent projectors (corresponding to threedata sets), and the prior-based truncated wavelet basis, are considered. Each set consists ofprojectors with ranks r = 2 , , ..., . The KL divergences of the full posteriors from theapproximated posterior densities defined by the optimal parameter-reduced likelihood (7) are shownin the top row of Figure 6, while those of the optimal parameter-reduced forward model (21) areshown in the bottom row of Figure 6.In this example, we observe similar results as the results of the elliptic PDE example. Theoptimal parameter-reduced likelihood and the optimal parameter-reduced forward model result inapproximate posteriors with similar accuracy. The most accurate approximate posterior densitiesare obtained by the data-dependent projectors of the corresponding data set, followed by thoseobtained by the data-free projectors. For each data set, the data-dependent projectors constructedusing other data sets result in less accurate approximations in general. However, the accuracygaps between the data-free projectors and the data-dependent projectors (using other data sets)are not as significant as the elliptic PDE example. This can be caused by either the coordinateselection method or the rather large data size in this example. Compared with the prior-baseddimension reduction, which is also an offline method, the data-free construction offers significantlymore accurate approximations in this example. Overall, the data-free dimension reduction providesreasonably accurate posterior approximations for the Poisson observation process considered here. ata-Free Likelihood-Informed Dimension Reduction κ being replacedby 1. Interesting, we still observe that the errors of the approximate posterior densities follow thesame trend as their corresponding bounds. ata-Free Likelihood-Informed Dimension Reduction r = 16 and r = 48) are compared with those produced by their exact inference modifications(with r = 48). In this example, we observe that the contours produced by approximate inferencemethods are visually similar to those of exact inference methods. In addition, with increasing ranks,the contours produced by approximate inference methods approach those of the exact inferencemethods.Figure 8: Same as Figure 4, but for PET and data set 1. Here r = 16 ,
48 are used in OL and OF,and r = 48 is used in PM and DA. Here ( x , x ) represent the first two coordinates selected by thedata-free method.Table 3: Same as Table 2, but for PET and data set 1. The coordinate selection is used in dimensionreduction. Ranks of the subspaces are chosen to be 16, 32, and 48.IACT IACT IACTOL PM √ var[log (cid:101) L yN ] OF DA E [ β ] HMALA PCN N = r = 16 33.2 ± ± ± ± ±
44 0.18 ± ± ± r = 32 40.0 ± ± ± .007 41.0 ± ± ± r = 48 45.3 ± ± ± .002 46.0 ± ± ± N = r = 16 31.4 ± ± ± .006 31.8 ± ±
65 0.22 ± r = 32 40.8 ± ± ± .004 42.8 ± ± ± r = 48 46.1 ± ± ± .001 46.3 ± ± ± τ = d (cid:80) di =1 IACT( f i ) , to measure theefficiency of various MCMC methods. The results are reported in Table 3. Here both PCN andH-MALA are implemented to sample the posterior in the transformed coordinate equipped with aGaussian prior (cf. Section 7.3).For the approximate inference methods (OL and OF), the average IACTs consistently increasewith the rank of the projectors, as the sampling performance of the Langevin proposal is expectedto decay with underlying the parameter dimension. Both OL and OF produce significantly smallerIACTs compared with the full-dimensional PCN and H-MALA method. Compared to the OL ata-Free Likelihood-Informed Dimension Reduction (cid:101) L yN (with N = 2 ,
5) in Table 3. Compared to the OF method, the DA method, again produces thelargest IACTs among all subspace inference methods. However, the loss of performance here is notas severe as the elliptic PDE example, this is also justified by the improved second stage acceptancerates, E [ β ].Overall, approximate inference methods have better statistical performance compared to othermethods in this example and can obtain reasonably accurate results as shown in Figures 6 andFigure 8. With improved approximation errors, the exact inference methods also produces Markovchains with better mixing. Among all the exact inference methods, PM produces significantlysmaller IACTs compared with other methods. Then, we present the results obtained by applying the prior normalization method (cf. Section 7.3).The KL divergences of the full posteriors from the approximated posterior densities are shown inFigure 9. Here the result of prior-based dimension reduction is not presented, as the prior in thetransformed space has an identity covariance matrix. We observe similar results as those obtainedby the coordinate selection. We notice that the accuracy gaps between the data-free projectors andthe data-dependent projectors (built using other data sets) are more significant compared with thoseobtained by the coordinate selection. The comparison of the errors of the approximate posteriordensities (defined by the data-free projectors) with the bounds in (14) and (25) are provided inFigure 10. Here we have κ = 1 because the transformed coordinate is endowed with a Gaussianprior. We observe that the errors of the approximate posterior densities follow the same trendas their corresponding bounds. The IACTs of various MCMC methods are reported in Table 4.Again, the efficiency of subspace MCMC methods defined by the prior normalization is very closeto that defined by the coordinate selection. Overall, both the coordinate selection and the priornormalization can be applied in this example to obtain accurate reduced-dimensional posteriorapproximations and derive efficient subspace MCMC methods.Table 4: Same as Table 2, but for PET and data set 1. The prior normalization is used in dimensionreduction. Ranks of the subspaces are chosen to be 16, 32, and 48.IACT IACT IACTOL PM √ var[log (cid:101) L yN ] OF DA E [ β ] HMALA PCN N = r = 16 35.8 ± ± ± ± ±
23 0.25 ± ± ± r = 32 42.8 ± ± ± .006 41.2 ± ± ± r = 48 45.0 ± ± ± .005 44.3 ± ± ± N = r = 16 35.1 ± ± ± ± ±
21 0.26 ± r = 32 45.0 ± ± ± .003 42.0 ± ± ± r = 48 45.9 ± ± ± .003 44.4 ± ± ±
10. Conclusion
We present a new data-free strategy for reducing the dimensionality of large-scale statisticalinverse problems. Compared to existing gradient-based dimension reduction technique, this new ata-Free Likelihood-Informed Dimension Reduction ata-Free Likelihood-Informed Dimension Reduction B prior. Acknowledgments
T. Cui and O. Zahm would like to acknowledge support from the INRIA associate teamUNQUESTIONABLE. T. Cui also acknowledges support from the Australian Research Council.
Appendix A. Proof of Proposition 4.1
Recall π y pos ( x ) = L y ( x ) π pr ( x ) π data ( y ) and (cid:98) π y pos ( x ) = (cid:98) L y ( x ) π pr ( x ) (cid:98) π data ( y ) . By definition of L y ( x ) and (cid:98) L y ( x ) we have D KL ( π y pos || (cid:98) π y pos ) = (cid:90) R d log (cid:18) π y pos ( x ) (cid:98) π y pos ( x ) (cid:19) π y pos ( x )d x = log (cid:98) π data ( y ) π data ( y ) + (cid:90) R d (cid:18) − (cid:107) G ( x ) − y (cid:107) − + 12 (cid:107) G ∗ ( x ) − y (cid:107) − (cid:19) π y pos ( x )d x = log (cid:98) π data ( y ) π data ( y ) + (cid:90) R d (cid:18) (cid:107) e ( x ) (cid:107) − − e ( x ) (cid:62) Σ − ( G ( x ) − y ) (cid:19) π y pos ( x )d x (A.1)where e ( x ) = G ( x ) − G ∗ ( x ) is independent on y . Next we replace y by Y ∼ π data and we take theexpectation over Y . The first term in the above expression becomes E (cid:20) log (cid:98) π data ( Y ) π data ( Y ) (cid:21) = (cid:90) R m log (cid:18) (cid:98) π data ( y ) π data ( y ) (cid:19) π data ( y )d y = − D KL ( π data || (cid:98) π data ) . (A.2)Next, by definition of π y pos ( x ), we have E (cid:20)(cid:90) R d (cid:107) e ( x ) (cid:107) − π Y pos ( x )d x (cid:21) = 12 (cid:90) R d × R m (cid:107) e ( x ) (cid:107) − π y pos ( x ) π data ( y )d x d y = 12 (cid:90) R d × R m (cid:107) e ( x ) (cid:107) − L y ( x ) π pr ( x )d x d y = 12 (cid:90) R d (cid:107) G ( x ) − (cid:98) G ( x ) (cid:107) − π pr ( x )d x. (A.3) ata-Free Likelihood-Informed Dimension Reduction y (cid:55)→ L y ( x ) is a pdf so that (cid:82) R m L y ( x )d y = 1. Using thesame arguments, we have E (cid:20) (cid:90) R d e ( x ) (cid:62) Σ − (cid:0) G ( x ) − Y (cid:1) π Y pos ( x )d x (cid:21) = (cid:90) R d × R m e ( x ) (cid:62) Σ − (cid:0) G ( x ) − y (cid:1) L y ( x ) π pr ( x )d x d y = (cid:90) R d e ( x ) (cid:62) Σ − G ( x ) π pr ( x )d x − (cid:90) R d e ( x ) (cid:62) Σ − (cid:18)(cid:90) R m y L y ( x )d y (cid:19) π pr ( x )d x = 0 . (A.4)The last equality is obtained by noting that the expectation of the data knowing the parameter x is (cid:82) R m y L y ( x )d y = G ( x ). Combining (A.2) (A.3) and (A.4), we obtain E (cid:104) D KL ( π Y pos || (cid:98) π Y pos ) (cid:105) (A.1) = − D KL ( π data || (cid:98) π data ) + 12 (cid:90) R d (cid:107) G ( x ) − (cid:98) G ( x ) (cid:107) − π pr ( x )d x, which concludes the proof. Appendix B. Proof of Proposition 6.1
Consider a Metropolis-Hastings algorithm which targets the pdf π y,N tar defined by (32) using thefollowing proposal density q (cid:16) x † r , { x † ( i ) ⊥ } Ni =1 (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) = q ( x † r | x r ) N (cid:89) i =1 π pr ( x † ( i ) ⊥ | x † r ) , (B.1)where q ( x † r | x r ) is the same proposal density as the one used at step 1 of Algorithm (5). Theacceptance probability of this Metropolis-Hastings algorithm is given by α (cid:16) x † r , { x † ( i ) ⊥ } Ni =1 (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) = min , π y,N tar ( x † r , { x † ( i ) ⊥ } Ni =1 ) q (cid:16) x r , { x ( i ) ⊥ } Ni =1 (cid:12)(cid:12)(cid:12) x † r , { x † ( i ) ⊥ } Ni =1 (cid:17) π y,N tar ( x r , { x ( i ) ⊥ } Ni =1 ) q (cid:16) x † r , { x † ( i ) ⊥ } Ni =1 (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) = min , π pr ( x † r ) (cid:16)(cid:80) Ni =1 L y ( x † r + x † ( i ) ⊥ ) (cid:17) q ( x r | x † r ) π pr ( x r ) (cid:16)(cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ) (cid:17) q ( x † r | x r ) , which is precisely (cid:98) α N ( x † r | x r ) defined in (31). Note that the first two steps of Algorithm 5 consists ofdrawing a sample ( x † r , { x † ( i ) ⊥ } Ni =1 ) from the proposal (B.1). This way, Algorithm 5 can be interpretedas a MCMC algorithm which targets π y,N tar . It remains to show that the marginal distribution ata-Free Likelihood-Informed Dimension Reduction π y,N tar ( x r ) is the marginal posterior π y pos ( x r ). We can write π y,N tar ( x r ) = (cid:90) Ker( P r ) N π y,N tar ( x r , x (1) ⊥ , ..., x ( N ) ⊥ )d x (1) ⊥ ... d x ( N ) ⊥ ∝ π pr ( x r ) N (cid:88) i =1 (cid:90) Ker( P r ) L y ( x r + x ( i ) ⊥ ) π pr ( x ( i ) ⊥ | x r )d x ( i ) ⊥ ∝ (cid:90) Ker( P r ) L y ( x r + x ( i ) ⊥ ) π pr ( x r + x ( i ) ⊥ )d x ⊥ ∝ π y pos ( x r )which concludes the proof. Appendix C. Proof of Proposition 6.3
Recall that { ( X ( j ) r , { X ( j,i ) ⊥ } Ni =1 ) } j ≥ admits π y,N tar (32) as the invariant density, see Proposition 6.1.It remains to prove that { X ( j ) } j ≥ admits π y pos ( x ) as the invariant density. For a given state( X ( j ) r , { X ( j,i ) ⊥ } = ( x r , { x ( i ) ⊥ } Ni =1 ), we have X ( j ) = x r + x ⊥ where x ⊥ ∈ { x ( i ) ⊥ } Ni =1 is selected withrespect to the probability P (cid:16) x ⊥ = x ( k ) ⊥ (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) = L y ( x r + x ( k ) ⊥ ) (cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ) , ≤ k ≤ N. (C.1)Thus, we need to prove that the pdf π ( x ) where x = x r + x ⊥ is the posterior density π ( x ) = π y pos ( x ).We can write π ( x ) = π ( x r , x ⊥ )= (cid:90) Ker( P r ) N π ( x r , { x ( i ) ⊥ } Ni =1 , x ⊥ ) d x (1) ⊥ ... d x ( N ) ⊥ = (cid:90) Ker( P r ) N π (cid:16) x ⊥ (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) π y,N tar (cid:16) x r , { x ( i ) ⊥ } Ni =1 (cid:17) d x (1) ⊥ ... d x ( N ) ⊥ , where π ( x ⊥ | x r , { x ( i ) ⊥ } Ni =1 ) is the pdf of x ⊥ conditioned on ( x r , { x ( i ) ⊥ } Ni =1 ). By construction we have π (cid:16) x ⊥ (cid:12)(cid:12)(cid:12) x r , { x ( i ) ⊥ } Ni =1 (cid:17) (C.1) = (cid:80) Nk =1 δ x ( k ) ⊥ ( x ⊥ ) L y ( x r + x ( k ) ⊥ ) (cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ) , (C.2) ata-Free Likelihood-Informed Dimension Reduction δ x ( k ) ⊥ denotes the Dirac mass function at point x ( k ) ⊥ . We can write π ( x ) = (cid:90) Ker( P r ) N (cid:80) Nk =1 δ x ( k ) ⊥ ( x ⊥ ) L y ( x r + x ( k ) ⊥ ) (cid:80) Ni =1 L y ( x r + x ( i ) ⊥ ) π y,N tar (cid:16) x r , { x ( i ) ⊥ } Ni =1 (cid:17) d x (1) ⊥ ... d x ( N ) ⊥ (32) ∝ N (cid:88) k =1 (cid:90) Ker( P r ) N δ x ( k ) ⊥ ( x ⊥ ) L y ( x r + x ( k ) ⊥ ) π pr ( x r ) N (cid:89) i =1 π pr ( x ( i ) ⊥ | x r ) d x (1) ⊥ ... d x ( N ) ⊥ ∝ (cid:88) k =1 (cid:90) Ker( P r ) δ x ( k ) ⊥ ( x ⊥ ) L y ( x r + x ( k ) ⊥ ) π pr ( x r ) π pr ( x ( k ) ⊥ | x r ) d x ( k ) ⊥ ∝ (cid:88) k =1 L y ( x r + x ⊥ ) π pr ( x r ) π pr ( x ⊥ | x r ) ∝ π y pos ( x r + x ⊥ ) , which concludes the proof. Appendix D. Proof of Proposition 6.4
To show the result of Proposition 6.4, we first interpret the first stage acceptance/rejection andthe conditional prior sampling π pr ( x †⊥ | x † r ) as a joint proposal acting in the full parameter spaceIm( P r ) ⊕ Ker( P r ). The proposal q ( · , | x r ) and the acceptance probability α ( x † r | x r ) defines an effectiveproposal distribution¯ q ( x † r | x r ) = q ( x † r | x r ) α ( x † r | x r ) + (cid:104) − (cid:90) q ( x † r | x r ) α ( x (cid:48) r | x r ) dx (cid:48) r (cid:105) δ x r ( x † r ) , where δ x r ( · ) denotes the Dirac delta and the term in the bracket represents the probability of aproposal candidate being rejected. Then, we can define a joint proposal distribution Q ( x † r , x †⊥ | x r , x ⊥ ) := Q ( x † r , x †⊥ | x r ) = π pr ( x †⊥ | x † r ) ¯ q ( x † r | x r ) , for the MH to sample the full posterior.Following the exactly same derivation in [11], one can show that accepting ( x † r , x †⊥ ) ∼ Q ( · , ·| x r , x ⊥ ) with the probability β ( x † r , x †⊥ | x r , x ⊥ ) = min (cid:34) , π y pos ( x † r + x †⊥ ) Q ( x r , x ⊥ | x † r , x †⊥ ) π y pos ( x r + x ⊥ ) Q ( x † r , x †⊥ | x r , x ⊥ ) (cid:35) defines a Markov transition kernel with the full posterior π y pos ( x r + x ⊥ ) as its invariant distribution.Since the above acceptance probability is only used in the case where the first stage proposalcandidate x † r is accepted, i.e., x † r (cid:54) = x r , we do not need to consider the Dirac delta term. This way,the above acceptance probability can be simplified as β ( x † r , x †⊥ | x r , x ⊥ ) = min (cid:34) , π y pos ( x † r + x †⊥ ) π pr ( x ⊥ | x r ) q ( x r | x † r ) α ( x r | x † r ) π y pos ( x r + x ⊥ ) π pr ( x †⊥ | x † r ) q ( x † r | x r ) α ( x † r | x r ) (cid:35) . Substituting the identities α ( x r | x † r ) α ( x † r | x r ) = (cid:101) L yN ( x r ) π pr ( x r ) q ( x † r | x r ) (cid:101) L yN ( x † r ) π pr ( x † r ) q ( x r | x † r ) , ata-Free Likelihood-Informed Dimension Reduction π y pos ( x † r + x †⊥ ) π pr ( x ⊥ | x r ) π y pos ( x r + x ⊥ ) π pr ( x †⊥ | x † r ) = L y ( x † r + x †⊥ ) π pr ( x † r ) L y ( x r + x ⊥ ) π pr ( x r ) , into the above equation, we obtain β ( x † r , x †⊥ | x r , x ⊥ ) = min (cid:34) , L y ( x † r + x †⊥ ) (cid:101) L yN ( x r ) L y ( x r + x ⊥ ) (cid:101) L yN ( x † r ) (cid:35) , which is identical to the second stage acceptance probability in (35). Thus, the result follows. Appendix E. Cumulative density function of p ( x ) ∝ exp( − γ | x | p )Given the pdf p ( x ) = c γ,p exp( − γ | x | p ) , x ∈ R , we want to find its normalizing constant c γ,p andcdf. Using symmetry, the normalizing constant takes the form c γ,p = 2 (cid:90) ∞ exp( − γx p ) dx, and the cdf can be expressed asΦ( x ) =
12 + 1 c γ,p (cid:90) x exp( − γt p ) dt x ≥ − c γ,p (cid:90) − x exp( − γt p ) dt x < . We first introduce the change-of-variable z = γt p so that t = (cid:18) zγ (cid:19) p and d t d z = p − γ − p z p − . This yields (cid:90) x exp( − γt p ) dt = p − γ − p (cid:90) γx p z p − exp( − z ) dz := p − γ − p Γ lower ( p − , γx p ) , where Γ lower ( · , · ) is the lower incomplete gamma function. Following a similar derivation, we obtain c γ,p = 2 p − γ − p Γ( p − ) , where Γ( · ) is the Gamma function. This way, we have the cdfΦ( x ) = 12 + sign( x )2 Γ( p − ) Γ lower (cid:16) p − , γ (cid:0) sign( x ) x (cid:1) p (cid:17) . There are two notable special cases. The Gaussian distribution N (0 , σ ) can be specified using γ = (2 σ ) − and p = 2, in which the cdf can be equivalently expressed using the error function. TheLaplace distribution can be specified using p = 1, so that the cdf yields a simpler (but equivalent)expression in the form of Φ( x ) = 12 + sign( x )2 (cid:16) − exp (cid:0) − γ sign( x ) x (cid:1)(cid:17) . ata-Free Likelihood-Informed Dimension Reduction Appendix F. Derivation of Fisher information matrices
Here we derive the Fisher information matrix for the Poisson likelihood case. Recall the Fisherinformation matrix I ( x ) = (cid:90) R m (cid:0) ∇ log L y ( x ))( ∇ log L y ( x ) (cid:1) (cid:62) L y ( x ) d y. (F.1)Defining the predata η = G ( x ), we can express the gradient of the likelihood function as ∇ x log L y ( x ) = ∇ G ( x ) (cid:62) ∇ η log L y ( η ) . where L y ( η ) = m (cid:89) i =1 η y i i exp( − η i ) y i ! , subject to η = G ( x ) . This way, the Fisher information matrix can be rewritten as I ( x ) = ∇ G ( x ) (cid:62) (cid:18) (cid:90) R m (cid:0) ∇ log L y ( η ))( ∇ log L y ( η ) (cid:1) (cid:62) L y ( η ) d y (cid:19) ∇ G ( x ) , (F.2)subject to η = G ( x ). The term in the brackets of the above equation is the Fisher informationmatrix of the Poisson distribution, which is a diagonal matrix (cid:18) (cid:90) R m (cid:0) ∇ log L y ( η ))( ∇ log L y ( η ) (cid:1) (cid:62) L y ( η ) d y (cid:19) ii = 1 η i . Thus, the Fisher information matrix w.r.t. x is I ( x ) = ∇ G ( x ) (cid:62) M ( x ) ∇ G ( x ) , (F.3)where M ( x ) is a diagonal matrix with M ii ( x ) = G i ( x ) − along its diagonal. References [1] Christophe Andrieu, Gareth O Roberts, et al. The pseudo-marginal approach for efficient Monte Carlocomputations.
The Annals of Statistics , 37(2):697–725, 2009.[2] Christophe Andrieu, Matti Vihola, et al. Convergence properties of pseudo-marginal Markov chain Monte Carloalgorithms.
The Annals of Applied Probability , 25(2):1030–1077, 2015.[3] Christophe Andrieu, Matti Vihola, et al. Establishing some order amongst exact approximations of MCMCs.
The Annals of Applied Probability , 26(5):2661–2696, 2016.[4] Mario Bebendorf. A note on the Poincar´e inequality for convex domains.
Zeitschrift f¨ur Analysis und ihreAnwendungen , 22(4):751–756, 2003.[5] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E Farrell, and Andrew M Stuart. Geometric MCMCfor infinite-dimensional inverse problems.
Journal of Computational Physics , 335:327–351, 2017.[6] Alexandros Beskos, Ajay Jasra, Kody Law, Youssef Marzouk, and Yan Zhou. Multilevel sequential MonteCarlo with dimension-independent likelihood-informed proposals.
SIAM/ASA Journal on UncertaintyQuantification , 6(2):762–786, 2018.[7] Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss. MCMC methods for diffusion bridges.
Stochastics and Dynamics , 8(03):319–350, 2008.[8] Marie Billaud-Friess, Anthony Nouy, and Olivier Zahm. A tensor approximation method based on ideal minimalresidual formulations for the solution of high-dimensional problems.
ESAIM: Mathematical Modelling andNumerical Analysis , 48(6):1777–1806, 2014. ata-Free Likelihood-Informed Dimension Reduction [9] S. Brooks, A. Gelman, G. Jones, and X. L. Meng, editors. Handbook of Markov Chain Monte Carlo . Taylor &Francis, 2011.[10] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horseshoe. In
ArtificialIntelligence and Statistics , pages 73–80, 2009.[11] J Andr´es Christen and Colin Fox. Markov chain Monte Carlo using an approximation.
Journal of Computationaland Graphical statistics , 14(4):795–810, 2005.[12] Paul G Constantine, Eric Dow, and Qiqi Wang. Active subspace methods in theory and practice: applicationsto kriging surfaces.
SIAM Journal on Scientific Computing , 36(4):A1500–A1524, 2014.[13] Paul G Constantine, Carson Kent, and Tan Bui-Thanh. Accelerating Markov chain Monte Carlo with activesubspaces.
SIAM Journal on Scientific Computing , 38(5):A2779–A2805, 2016.[14] Andrea F Cortesi, Paul G Constantine, Thierry E Magin, and Pietro M Congedo. Forward and backwarduncertainty quantification with active subspaces: application to hypersonic flows around a cylinder.
Journalof Computational Physics , 407:109079, 2020.[15] Simon L Cotter, Gareth O Roberts, Andrew M Stuart, David White, et al. MCMC methods for functions:modifying old algorithms to make them faster.
Statistical Science , 28(3):424–446, 2013.[16] Tiangang Cui, Gianluca Detommaso, and Robert Scheichl. Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems. arXiv preprint arXiv:1910.12431 , 2019.[17] Tiangang Cui, Kody JH Law, and Youssef M Marzouk. Dimension-independent likelihood-informed MCMC.
Journal of Computational Physics , 304:109–137, 2016.[18] Tiangang Cui, James Martin, Youssef M Marzouk, Antti Solonen, and Alessio Spantini. Likelihood-informeddimension reduction for nonlinear inverse problems.
Inverse Problems , 30(11):114015, 2014.[19] Tiangang Cui and Xin T Tong. A unified performance analysis of likelihood-informed subspace methods. arXivpreprint arXiv:2101.02417 , 2021.[20] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 79(3):651–676, 2017.[21] Masoumeh Dashti, Stephen Harris, and Andrew Stuart. Besov priors for Bayesian inverse problems.
InverseProblems & Imaging , 6(2):183, 2012.[22] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 68(3):411–436, 2006.[23] Arnaud Doucet, Michael K Pitt, George Deligiannidis, and Robert Kohn. Efficient implementation of Markovchain Monte Carlo when using an unbiased likelihood estimator.
Biometrika , 102(2):295–313, 2015.[24] Alain Durmus, Eric Moulines, et al. High-dimensional Bayesian inference via the unadjusted langevin algorithm.
Bernoulli , 25(4A):2854–2882, 2019.[25] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-Hastingsalgorithms are fast.
Journal of Machine Learning Research , 20(183):1–42, 2019.[26] Ivan G Graham, Frances Y Kuo, James A Nichols, Robert Scheichl, Ch Schwab, and Ian H Sloan. Quasi-MonteCarlo finite element methods for elliptic PDEs with lognormal random coefficients.
Numerische Mathematik ,131(2):329–368, 2015.[27] Ivan G Graham, Frances Y Kuo, Dirk Nuyens, Robert Scheichl, and Ian H Sloan. Quasi-Monte Carlo methodsfor elliptic PDEs with random coefficients and applications.
Journal of Computational Physics , 230(10):3668–3694, 2011.[28] Alice Guionnet and B Zegarlinksi. Lectures on logarithmic Sobolev inequalities. In
S´eminaire de probabilit´esXXXVI , pages 1–134. Springer, 2003.[29] W. Hastings. Monte Carlo sampling using Markov chains and their applications.
Biometrika , 57:97–109, 1970.[30] Jere Heikkinen. Statistical inversion theory in x-ray tomography. 2008.[31] Jari Kaipio and Erkki Somersalo.
Statistical and computational inverse problems , volume 160. Springer Science& Business Media, 2006.[32] Ville Kolehmainen, Matti Lassas, Kati Niinim¨aki, and Samuli Siltanen. Sparsity-promoting Bayesian inversion.
Inverse Problems , 28(2):025005, 2012.[33] Shiwei Lan. Adaptive dimension reduction to accelerate infinite-dimensional geometric Markov chain MonteCarlo.
Journal of Computational Physics , 392:71–95, 2019.[34] Matti Lassas, Eero Saksman, and Samuli Siltanen. Discretization-invariant Bayesian inversion and Besov spacepriors.
Inverse problems and imaging , 3(1):87–122, 2009.[35] Olivier Le Maˆıtre and Omar M Knio.
Spectral methods for uncertainty quantification: with applications tocomputational fluid dynamics . Springer Science & Business Media, 2010.[36] Michel Ledoux. Logarithmic Sobolev inequalities for unbounded spin systems revisited. In
S´eminaire deProbabilit´es XXXV , pages 167–194. Springer, 2001.[37] Finn Lindgren, H˚avard Rue, and Johan Lindstr¨om. An explicit link between gaussian fields and gaussian ata-Free Likelihood-Informed Dimension Reduction Markov random fields: the stochastic partial differential equation approach.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 73(4):423–498, 2011.[38] Jun S Liu.
Monte Carlo strategies in scientific computing . Springer Science & Business Media, 2001.[39] Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems.
Journal of the Americanstatistical association , 93(443):1032–1044, 1998.[40] Youssef M Marzouk and Habib N Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesianinference in inverse problems.
Journal of Computational Physics , 228(6):1862–1902, 2009.[41] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculationsby fast computing machines.
Journal of Chemical Physics , 21:1087–1092, 1953.[42] Anthony Nouy.
Low-Rank Tensor Methods for Model Order Reduction , pages 857–882. Springer InternationalPublishing, Cham, 2017.[43] Mario Teixeira Parente, Jonas Wallin, Barbara Wohlmuth, et al. Generalized bounds for active subspaces.
Electronic Journal of Statistics , 14(1):917–943, 2020.[44] Anthony T Patera, Gianluigi Rozza, et al. Reduced basis approximation and a posteriori error estimation forparametrized partial differential equations, 2007.[45] Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas. A computational framework for infinite-dimensional Bayesian inverse problems, part ii: Stochastic newton MCMC with application to ice sheet flowinverse problems.
SIAM Journal on Scientific Computing , 36(4):A1525–A1555, 2014.[46] Allan Pinkus.
Ridge functions , volume 205. Cambridge University Press, 2015.[47] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolisalgorithms.
Annals of Applied Probability , 7:110–120, 1997.[48] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-Hastings algorithms.
Statistical science , 16(4):351–367, 2001.[49] Olivier Roustant, Franck Barthe, Bertrand Iooss, et al. Poincar´e inequalities on intervals–application tosensitivity analysis.
Electronic journal of statistics , 11(2):3081–3119, 2017.[50] Alessio Spantini, Antti Solonen, Tiangang Cui, James Martin, Luis Tenorio, and Youssef Marzouk. Optimallow-rank approximations of Bayesian linear inverse problems.
SIAM Journal on Scientific Computing ,37(6):A2451–A2487, 2015.[51] A. M. Stuart. Inverse problems: a Bayesian perspective.
Acta Numerica , 19:451–559, 2010.[52] Albert Tarantola.
Inverse problem theory and methods for model parameter estimation , volume 89. SIAM,2005.[53] Olivier Zahm, Paul G Constantine, Clementine Prieur, and Youssef M Marzouk. Gradient-based dimensionreduction of multivariate vector-valued functions.
SIAM Journal on Scientific Computing , 42(1):A534–A558,2020.[54] Olivier Zahm, Tiangang Cui, Kody Law, Alessio Spantini, and Youssef Marzouk. Certified dimension reductionin nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712arXiv preprint arXiv:1807.03712