A unified performance analysis of likelihood-informed subspace methods
AA unified performance analysis oflikelihood-informed subspace methods
TIANGANG CUI , XIN TONG Monash University, School of Mathematics E-mail: [email protected] National University of Singapore, Department of Mathematics E-mail: [email protected]
The likelihood-informed subspace (LIS) method [e.g., Cui et al.,
Inverse Probl. , 30 (2014), 114015] offers a viableroute to reducing the dimensionality of high-dimensional probability distributions arisen in Bayesian inference.LIS identifies an intrinsic low-dimensional linear subspace where the target distribution differs the most fromsome tractable reference distribution. Such a subspace can be identified using the leading eigenvectors of a Grammatrix of the gradient of the log-likelihood function. Then, the original high-dimensional target distribution isapproximated through various forms of ridge approximations of the likelihood function, in which the approximatedlikelihood only has support on the intrinsic low-dimensional subspace. This approximation enables the design ofinference algorithms that can scale sub-linearly with the apparent dimensionality of the problem. Intuitively, theaccuracy of the approximation, and hence the performance of the inference algorithms, are influenced by threefactors—the dimension truncation error in identifying the subspace, Monte Carlo error in estimating the Grammatrices, and Monte Carlo error in constructing ridge approximations. This work establishes a unified frameworkto analysis each of these three factors and their interplay. Under mild technical assumptions, we establish errorbounds for a range of existing dimension reduction techniques based on the principle of LIS. Our error bounds alsoprovide useful insights into the accuracy comparison of these methods. In addition, we analyze the integration ofLIS with sampling methods such as Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). Wealso demonstrate our analyses on a linear inverse problem with Gaussian prior, which shows that all the estimatescan be dimension-independent if the prior covariance is a trace-class operator. Finally, we demonstrate variousaspects of our theoretical claims on several nonlinear inverse problems.
Keywords:
Dimension reduction; Approximation error; Likelihood-informed subspace; Monte Carlo estimation
1. Introduction
Many applications in science and engineering have to contend with expensive or intractable models thatare typically driven by high-dimensional or even infinite-dimensional random variables. For example,seismic imaging [15, 39], subsurface energy [20], glaciology [46], groundwater [26, 31], electricalimpedance tomography [32], density estimation [43], etc. Denoting the high-dimensional random vari-ables of interest by x ∈ X ⊆ R d , the associated target probability density often takes the form π ( x ) = 1 Z µ ( x ) f ( x ) , Z = (cid:90) µ ( x ) f ( x ) dx, (1)where we refer to Z , µ ( x ) , and f ( x ) as the normalization constant , the reference density and the likeli-hood function, respectively. In the most common scenario, the target density is the posterior defined byBayes’ rule, the reference density is the prior, and the likelihood function is often denoted by f ( x ; y ) for some observed data y . Here we drop the dependency of f on y for brevity unless otherwise required.In most of the aforementioned applications, the reference density µ ( x ) takes a simple form, e.g. aGaussian density or an elliptical density, so that the reference distribution, its marginal distributions,and its conditional distributions can be directly evaluated and sampled from. However, the likelihood1 a r X i v : . [ s t a t . C O ] J a n function f , which often encodes some highly nonlinear parameter-to-observable maps that representthe underlying systems, may introduce complicated nonlinear interactions among parameters. Whenthe parameter is also high-dimensional, generating samples from the target distribution using classicalmethods such as Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) can be acomputationally challenging task. The computational effort required for generating each independentsample from π ( x ) may scale super-linearly with the ambient parameter dimension d .In many high-dimensional problems, there often exists a low dimensional “effective" or “intrinsic"dimension. Designing scalable sampling methods that can utilize such property has been a focus inthe recent literature [1, 2, 3, 6, 7, 8, 21, 42, 47, 48, 54]. One effective strategy involves finding aparameter subspace X r with dimensionality d r (cid:28) d , so that the original density with high ambientparameter dimensions can be approximated by some low-dimensional parametrization. The recentlydeveloped likelihood informed subspace (LIS) method [17, 23, 57] offers a way to identify X r forhigh-dimensional target densities and approximates the target density via projections of the likelihoodfunction onto X r . For sampling related problems, such projections naturally implement subsequentMCMC and SMC computations on the reduced subspace X r . As a result, this may significantly lowerthe computation effort compared with implementations directly targeting the ambient space X . In thiswork, we focus on the analysis of the approximation accuracy of LIS method and its related samplingalgorithms. Dimension reduction techniques have been exploited to reduce the computational cost due to the pa-rameter dimension. When the target density π ( x ) has a known covariance matrix Σ , a common ap-proach is to carry the principle component analysis (PCA) or Karhunen–Loéve decomposition [33, 38]that identifies the leading eigenvectors of Σ to define the subspace X r . Then, the parameters in the com-plement subspace of X r are truncated from the inference. Other than the computational difficulties ofestimating the covariance matrix for high-dimensional non-Gaussian target densities, this approach isproven to be suboptimal even for problems with Gaussian reference densities and Gaussian likelihoodfunctions [50].Without truncating parameters from the inference procedure, LIS exploits an alternative way toapproximate target densities. The intuition underpinning the development of LIS is that the likelihoodfunction f ( x r , x ⊥ ) is often effectively supported on a low-dimensional subspace X r with dimension d r (cid:28) d . For a given subspace X r with dimension d r , we denote its complement subspace by X ⊥ anddefine projection operators P r and P ⊥ such that range( P r ) = X r and range( P ⊥ ) = X ⊥ . This way, aparameter x can be decomposed as x = x r + x ⊥ , x r = P r x ∈ X r , x ⊥ = P ⊥ x ∈ X ⊥ . (2)To avoid redundant notations, for a density ν ( x r , x ⊥ ) on R d , we use ν ( x r ) to denote its marginal on X r . This way, the target density can be decomposed as π ( x ) ≡ π ( x r , x ⊥ ) = π ( x r ) π ( x ⊥ | x r ) , where the marginal density and the conditional density take the form π ( x r ) = 1 Z µ ( x r ) (cid:90) f ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ , π ( x ⊥ | x r ) = f ( x r , x ⊥ ) µ ( x ⊥ | x r ) (cid:82) f ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ . (3) ccuracy of LIS methods f is effectively supported on X r , the above decom-position suggests that µ ( x ⊥ | x r ) can be a good approximation of π ( x ⊥ | x r ) . Thus, one can identify thesubspace X r and construct a suitable reduced dimensional target density π r ( x r ) to approximate π ( x r ) and obtain an approximation to the target π r ( x r , x ⊥ ) ∝ π r ( x r ) µ ( x ⊥ | x r ) . (4)The approximate target distribution can be efficiently sampled using a two-step strategy—one can firstapply MCMC or SMC to generate samples from the reduced dimensional target density π r ( x r ) , andthen draw independent samples from the conditional reference density µ ( x ⊥ | x r ) .The identification of the subspace X r is the key in constructing approximate densities in the from of(4). Several methods based on the derivative information of the likelihood function have been devel-oped for this purpose. Some examples include the use of Fisher information matrix [23, 22], Hessianmatrix of log f [16, 39], and gradient of log f [17, 57]. Here we focus on the analysis of the gradi-ent based techniques. Note that the gradient of the logarithm of the likelihood, ∇ log f ( x ) , indicates alocal direction at x in which the likelihood changes most rapidly, and the Gram matrix of ∇ log f ( x ) averaging over all possible outcomes of x can measure possible variations of the likelihood function.Depending on the choice of the distribution assigned to x , different Gram matrices have been consid-ered: H := (cid:90) ∇ log f ( x ) ∇ log f ( x ) (cid:62) µ ( x ) dx,H := (cid:90) ∇ log f ( x ) ∇ log f ( x ) (cid:62) π ( x ) dx. (5)When the gradient Gram matrix H k , k = { , } , is presented, the subspace spanned by eigenvectorsof the largest eigenvalues of H k preserves most of the variations of ∇ log f ( x ) . Thus, the first d r eigenvectors (from hereinafter referred to as the leading eigenvectors) of the gradient Gram matrix canbe used to construct the subspace X r .Both H and H can be numerically estimated using Monte Carlo integration. Independent samplesfrom the reference density µ ( x ) can be drawn to estimate H in a fairly easy way. In comparison,estimation of H is less trivial as samples from π ( x ) are needed. One may apply importance sampling H = 1 Z (cid:90) ∇ log f ( x ) ∇ log f ( x ) (cid:62) f ( x ) µ ( x ) dx, so that samples from µ ( x ) weighed by the likelihood function can be used to estimate H . However,the likelihood f ( x ) may concentrate in a small region for problems with informative data, and thusthe above importance sampling formula may suffer from low effective sample size. This way, adaptiveMCMC sampling or SMC sampling has to be used to estimate H . At the first glance, it appears thatthe matrix H is not an effective way to identify the subspace X r . However, our analysis provides a jus-tification of the superiority of H over H for the subspace identification in terms of the approximationaccuracy. Given a subspace X r , here we discuss three methods for building reduced dimensional target density.A natural choice is to use the marginal posterior in (3). Definition 1.1 (Marginal likelihood) . By marginalizing the likelihood function over the complementsubspace X ⊥ , one has f r ( x r ) := (cid:90) f ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ = E µ ( f ( x ) | x r ) . (6) This yields the reduced dimensional target density π f ( x r ) = Z f r ( x r ) µ ( x r ) and the approximate tar-get density π f ( x r , x ⊥ ) = π f ( x r ) µ ( x ⊥ | x r ) . Note that π f shares the same normalizing constant Z with the full target π . Closely related to themarginal likelihood approximation, we also consider the following approximation based on marginal-izing the square root of the likelihood and the logarithm of the likelihood. Definition 1.2 (Radical likelihood) . Defining the square root of the likelihood by g ( x ) := (cid:112) f ( x ) , themarginal function g r ( x r ) = E µ ( g ( x ) | x r ) defines the reduced dimensional target density π g ( x r ) = 1 Z g g r ( x r ) µ ( x r ) , Z g = (cid:90) g r ( x r ) µ ( x r ) dx r , (7) and the approximate target density π g ( x r , x ⊥ ) = π g ( x r ) µ ( x ⊥ | x r ) . Definition 1.3 (Log-likelihood) . Defining the logarithm of the likelihood by l ( x ) := log f ( x ) , themarginal function l r ( x r ) = E µ ( l ( x ) | x r ) defines the reduced dimensional target density π l ( x r , x ⊥ ) = 1 Z l exp( l r ( x r )) µ ( x r ) µ ( x ⊥ | x r ) , Z l = (cid:90) exp( l r ( x r )) µ ( x r ) dx r , (8) and the approximate target density π l ( x r , x ⊥ ) = π l ( x r ) µ ( x ⊥ | x r ) . Note that the combination of π l and the subspace defined by H is also known as the active subspace method [17] in the literature. To provide a unified discussion, here we view it as one specific scenarioof the LIS. While using π g and π l may seem less natural than using π f , they offer different theoreticaland computational aspects in later part of this paper compared with π f . Drawing independent andidentically distributed (i.i.d.) samples x i ⊥ , i = 1 , . . . , M π from µ ( x ⊥ | x r ) , the marginalization in all ofthe approximate likelihood functions f r , g r , and l r can be computed by Monte Carlo integration f M ( x r ) := 1 M π M π (cid:88) i =1 f ( x r , x i ⊥ ) , g M ( x r ) := 1 M π M π (cid:88) i =1 g ( x r , x i ⊥ ) , l M ( x r ) := 1 M π M π (cid:88) i =1 l ( x r , x i ⊥ ) . (9)Then, we denote the corresponding reduced dimensional target densities by π Mf ( x r ) ∝ f M ( x r ) µ ( x r ) , π Mg ( x r ) ∝ g M ( x r ) µ ( x r ) , π Ml ( x r ) ∝ exp( l M ( x r )) µ ( x r ) , respectively, and the corresponding approximate target densities in a similar way. Despite the overwhelming success of LIS in many applications, the use of the approximate target den-sities naturally introduces errors compared with solutions obtained from the full target densities. There ccuracy of LIS methods π r ( x ) , r = { f, g, l } . In Section 2, we derive error bounds on the difference between π r and π , quantified through either estimation error of some test function or various statisticaldivergences. The highlight is that all these errors can all be bounded by the spectrum of H or H .So if we have the oracle (true) values of H or H , we can find the optimal projection subspacewith performance guarantees. From the results, we will also observe that approximation error withsubspace estimated from H tends to be smaller than the one from H , and it is independent ofthe normalizing constant. In subspace estimation, this leads to a tradeoff between H and H : theformer is easier to estimate while the latter tends to have better approximation accuracy.2. Monte Carlo errors of π Mr ( x ) , r = { f, g, l } . In most practical cases, the approximate target π r ( x ) needs to be replaced by the Monte Carlo averaging version π Mr ( x ) using references samples fromthe complement subspace X ⊥ . In Section 3, we show that the Monte Carlo averaging incurs anaddition error that is about O (1 / √ M π ) times as large as the error of π r ( x ) . Therefore, M π doesnot need to be very large in practical settings.3. Monte Carlo errors in estimating H and H . The Gram matrices H and H have to be approx-imated by their Monte Carlo estimates (cid:98) H and (cid:98) H , respectively. The resulting sample averagedsubspace (cid:98) X r may lead to additional approximation errors. In Section 4, we establish bounds on theerror of the approximate target using (cid:98) X r instead of using the oracle subspace X r . These bounds onlydepend on the dimension of (cid:98) X r and the variances of H and H . Importantly, our bounds do notrely on eigenvalue gaps, which is a typical assumption used in dimension reduction but may havelimited practical applicability.4. Efficiency of LIS accelerated sampling. Using the approximate target in (4), we can implementMCMC on the low-dimensional π r ( x r ) and extend the low dimensional samples by adding sam-ples from the conditional reference density µ ( x ⊥ | x r ) . Section 5.1 studies the efficiency of thisalgorithm, and Proposition 5.1 shows the overall efficiency is mostly determined by the MCMC onthe low-dimensional X r . Section 5.2 discusses the connection between SMC and the LIS method,in particular how to use it to simplify the estimation of H .5. Dimension independence. LIS methods are mostly used in high dimensional problems, and hence itis important for the quantitative bounds to be dimension independent. We illustrate this is indeed thecase in Section 6 using a linear inverse problem. It also serves a concrete example to demonstratethe efficacy of our theoretical results.In addition, we also provide several numerical examples on nonlinear inverse problems to further verifyour results in Section 7. To keep the discussion focused, we allocate most of the technical proofs to theappendix.It is worth to acknowledge some of the related works to address some of the aforementioned ques-tions. In [17], Problems 1–3 are investigated in the context of active subspace method, i.e. the combi-nation of π l with X r estimated from H . An analysis similar to that of [17] has also been developed tofunction approximation problems with H in [45]. The work of [57] investigated Problems 1–3 for π f with X r estimated from H using Kullback–Leibler (KL) divergence. Our new results cover generalprojection densities π r , r = { f, g, l } , and the tradeoff between H and H . Moreover, our analysis forProblem 3 does not require the eigenvalue gap condition, which is assumed in [17, 57] and not easilyfulfilled in applications. We further acknowledge that the independence of eigenvalue gap in low-rankmatrix approximation has also been studied in [27], but not for the purpose of sample averaged sub-space estimation. Beyond Problem 1–3, our analysis also enables us to investigate Problems 4 and 5,which have practical significance but have not been previously addressed.
2. Accuracy of approximate target densities
Our starting point is to establish bounds on the errors of approximate densities π r ( x ) , r = { f, g, l } in Section 1.2. We consider two forms to quantify the approximation errors. The first way is throughthe estimation error. Suppose the goal is to estimate E π [ h ] for some function of interest h . Using theapproximation π r , we obtain an approximate estimate E π r [ h ] . The associated estimation error becomes E h ( π, π r ) := | E π [ h ] − E π r [ h ] | . (10)The second way is via statistical divergences, which are also known as f -divergences. Some popularchoices include the (squared) Hellinger distance D H ( π, ν ) = 12 (cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) − (cid:115) ν ( x ) λ ( x ) (cid:33) λ ( x ) dx. where λ is a reference measure such as the Lebsegue; and the KL divergence D KL ( π, ν ) = (cid:90) log π ( x ) ν ( x ) π ( x ) dx. We present in Lemma A.1 a few of results regarding the relationship between these divergences andtheir connections with the estimation error E h . Various error forms can be useful for applying di-mension reduction in different inference tasks, as each inference task often has its “preferred” wayto quantify the error. For example, the optimization problems in transport maps [9, 40, 49] and Steinvariational methods [25, 37] are formulated by KL divergence, tensor train [19] and other approxima-tion methods, e.g., [36], gives bounds in Hellinger distance, and the min-max formulation in densityestimation methods such as [52, 53, 55] rely on the estimation error in (10). Unless otherwise specified,we only consider the estimations error and statistical divergences of the full-dimensional approximatetarget densities π r ( x ) , r = { f, g, l } rather than their reduced dimensional counterparts π r ( x r ) in therest of this paper.For different combinations of the approximate target densities, π r ( x ) , r = { f, g, l } , and subspaceconstruction methods, H k , k = { , } , our first main result discusses the a priori estimate of either E h ( π, π r ) or D ( · ) ( π, π r ) using the approximation subspace X r and spectral information of H k . Re-call that LIS uses the leading d r eigenvectors of H k to construct the subspace X r . So intuitively, theapproximation error is related to the truncated or residual eigenvalues of H k . The sum of the residualeigenvalues is given by R ( X r , H k ) := tr( P ⊥ H k P ⊥ ) , (11)where P ⊥ is the projector defined in (2). Note that (11) is well defined with any linear subspace X r andcomputable for a given H k , whereas most of statistical divergences do not have closed-form formulas.To build a connection between approximation errors of π r ( x ) and the residual function R ( X r , H k ) ,we assume the reference density µ ( x ) is “nice" with the subspace X r in the following sense: Assumption 2.1.
For any d r dimensional subspace X r and the parameter decomposition x = x r + x ⊥ defined in (2) , we decompose the reference density as µ ( x ) = µ ( x r ) µ ( x ⊥ | x r ) ccuracy of LIS methods Then µ ( x ⊥ | x r ) satisfies a κ -Poincaré inequality (PI) for all x r and any C function h : var µ ( x ⊥ | x r ) ( h ) ≤ κ (cid:90) (cid:107)∇ h ( x r , x ⊥ ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ . Assumption 2.1 asserts a Poincaré-type inequality that is modified for our subspace approximations.In probability theory, it is well known that Poincaré-type inequalities hold for any log-strongly-concavedensity µ ( x ) . We refer the readers to [10] for a summary and its connection to other inequalities suchas the Brascamp–Lieb’s inequality [14] and the logarithmic Sobolev inequality [12, 28, 35, 44]. In thefollowing proposition, we provide a concrete example of Assumption 2.1 for the case that µ ( x ) is aslight perturbation from a log-strongly-concave density. Proposition 2.2.
Suppose µ ( x ) ∝ exp( − V ( x ) − U ( x )) and there are constants c, B > such that• For any x , the minimal eigenvalue of Hessian ∇ V ( x ) is larger than c ;• The variation in U is bounded in the sense that exp(sup x U ( x ) − inf x U ( x )) ≤ B ;Then Assumption 2.1 holds with κ = B /c . Proof.
See Appendix B.1.Under Assumption 2.1 we will show that E h ( π, π r ) and D ( · ) ( π, π r ) can be upper bounded by cer-tain fraction power of R ( X r , H k ) . We summarize the results in Table 1. The first row indicates thatif the Gram matrix H and the approximation methods π f are used, then E h ( π, π f ) is bounded by O ( (cid:112) R ( X r , H )) . The same applies to other entries in the table. Note that we have parenthesis around √ D KL for H and π f , since this scenario has been analysed in [57] under a similar assumption.Therefore we do not discuss bounds for D KL ( π, π f ) and focus on other bounds that have yet to beanalysed. approximation method marginalization approximation errors upper bounds H and π f f E h , D H , ( (cid:112) D KL ) O( R ( X r , H ) ) H and π g g = √ f E h , D H O( R ( X r , H ) ) H and π l l = log f E h , D H , (cid:112) D KL O( R ( X r , H ) ) Table 1.
A summary of approximation error bounds. The second column indicates the functions marginalized bythe approximate target densities.
We first consider the approximations ( H , π f ) and ( H , π g ) as described in Definitions 1.1 and 1.2,respectively. Proposition 2.3.
For a given subspace X r , the expected conditional variance of the radical likelihoodfunction g = √ f provides the following upper bounds on the squared Hellinger distance between π and π f and that between π and π g :1) D H ( π, π f ) ≤ Z (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r . D H ( π, π g ) ≤ Z (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r .In addition, the normalizing constants Z and Z g satisfies Z ≥ Z g . Proof.
See Appendix B.2.
Theorem 2.4.
Suppose the reference density µ satisfies Assumption 2.1, we have the following1) The Hellinger distance between π and π f is bounded by D H ( π, π f ) ≤ (cid:112) κ R ( X r , H ) . (12)
2) The estimation error with any L integrable function h is given by E h ( π, π f ) ≤ (cid:113) κ ( E π [ h ] + E π f [ h ]) R ( X r , H ) .
3) The above two claims also hold for the approximation π g . Proof.
Claim 1).
Recall the projector P r such that range( P r ) = X r . By PI of µ ( x ⊥ | x r ) , the expectedconditional variance of g satisfies var µ ( x ⊥ | x r ) (cid:2) g (cid:3) ≤ κ (cid:90) (cid:107) ( I − P r ) ∇ g ( x ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ . Applying Proposition 2.3, we have D H ( π, π f ) ≤ κZ (cid:90) (cid:16) (cid:90) (cid:107) ( I − P r ) ∇ g ( x ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ (cid:17) µ ( x r ) dx r = κZ (cid:90) (cid:107) ( I − P r ) ∇ g ( x ) (cid:107) µ ( x ) dx = κZ (cid:90) (cid:107) P ⊥ ∇ log g ( x ) (cid:107) g ( x ) µ ( x ) dx = κ (cid:90) (cid:107) P ⊥ ∇ log f ( x ) (cid:107) π ( x ) dx. Since (cid:107) P ⊥ ∇ log f ( x ) (cid:107) = P ⊥ ∇ log f ( x ) ∇ log f ( x ) (cid:62) P ⊥ , the result follows from (cid:90) (cid:107) P ⊥ ∇ log f ( x ) (cid:107) π ( x ) dx = R ( X r , H ) . Claim 2).
This result follows from Lemma A.1 and claim 1).
Claim 3).
The same proofs of claims 1) and 2) can be applied.Although the result of Theorem 2.4 claim 1) can also be obtained from Lemma A.1 claim 2) andCorollary 1 of [57] (which uses the logarithmic Sobolev inequality), our proof offers additional insightsinto the subspace construction. Proposition 2.3 connects the error of approximate target densities withthe κ -Poincaré inequality via the expected conditional variance. This may also lead to new subspaceconstruction techniques beyond the gradient based methods. ccuracy of LIS methods Remark 2.5.
Recalling the definitions of H k , we have H ≤ Z sup x f H . Thus, we have a directcorollary of Theorem 2.4 claims 1) and 3): D H ( π, π f ) ≤ (cid:114) κ sup x fZ R ( X r , H ) , D H ( π, π g ) ≤ (cid:114) κ sup x fZ R ( X r , H ) . (13) Similar bounds for the L distance between log π l and log π assuming sup x f = 1 can be found inTheorem 3.1 [17] with a more complicated pre-constant. In problems where f concentrates in a smallregion, the associated normalizing constant Z can be small. This way, the constant on the right handside of (13) can have a large value. In contrast, the only constant in (12) is κ , which is of value . withthe reference density is the standard Gaussian distribution. This partially explains that using H canbe suboptimal. Following this observation, we predict that the reduced subspace from H will performbetter than the one from H , especially when the likelihood function has a concentrated support. Thiswill be verified in our numerical examples. Comparing with π g and π f , marginalizing the log-likelihood has performance guarantee only withthe matrix H . Although H is usually easier to estimate in practice, the error bounds for π l are ingeneral weaker than those of π f and π g —they depend on additional constants that can take largevalues and the scaling with R is of power . Theorem 2.6.
Suppose the reference density µ satisfies Assumption 2.1, we have the following1) The error in KL-divergence is bounded by D KL ( π, π l ) ≤ √ κ (cid:107) f (cid:107) ,µ Z (cid:112) R ( X r , H ) , (cid:107) f (cid:107) ,µ := (cid:115)(cid:90) f ( x ) µ ( x ) dx ≥ Z. This also leads to an upper bound in Hellinger distance, since D H ( π, π l ) ≤ (cid:113) D KL ( π, π l ) .2) The estimation error is bounded by | E π [ h ] − E π l [ h ] | ≤ ( E π [ h ] + E π l [ h ]) (cid:114) (cid:107) f (cid:107) ,µ Z ( κ R ( X r , H )) . Proof.
See Appendix B.3.
3. Monte Carlo error of approximate target densities
The marginalization in the reduced dimensional likelihood approximations (cf. Definitions 1.1–1.3)often needs to be computed by Monte Carlo integration. Suppose i.i.d. samples x i ⊥ , i = 1 , . . . , M π , from µ ( x ⊥ | x r ) are used for constructing the sample averaged approximations, π Mr ( x ) , r = { f, g, l } .Denoting the average over all possible outcomes for x i ⊥ , i = 1 , . . . , M π , as E M , the following theoremsreveals the accuracy of π Mr . Theorem 3.1.
Under Assumption 2.1, the following bounds hold1) The expected Hellinger distance between π Mg and π g satisfies E M (cid:104) D H ( π Mg , π g ) (cid:105) ≤ √ κZ (cid:112) Z g M π (cid:112) R ( X r , H ) .
2) Given the conditional likelihood f ( x ⊥ | x r ) := f ( x ⊥ ,x r ) f r ( x r ) and C f = sup x r sup x ⊥ f ( x ⊥ | x r ) , then theexpected Hellinger distance between π Mf and π f satisfies E M (cid:104) D H ( π Mf , π f ) (cid:105) ≤ (cid:112) κC f √ M π (cid:112) R ( X r , H ) . Proof.
See Appendix C.1.Since the Hellinger distance enjoys the triangle inequality, so we have E M (cid:104) D H ( π Mr , π ) (cid:105) ≤ E M (cid:104) D H ( π Mr , π r ) (cid:105) + D H ( π r , π ) , r = { f, g } . This way, claims 1) and 2) and Theorem 2.4 reveals that the Monte Carlo averaging used by π Mr ( x ) , r = { f, g } incurs an addition error that is about O (1 / √ M π ) as large as the error of π r ( x ) , r = { f, g } . Incontrast, the KL-divergence does not satisfies the triangle inequality, we directly establish the boundon E M (cid:2) D KL ( π, π Ml ) (cid:3) as follows. Theorem 3.2.
Under Assumption 2.1, the expected L error of the marginalized log-likelihood isbounded by E M (cid:34)(cid:18)(cid:90) ( l M ( x r ) − l r ( x r )) µ ( x r ) dx r (cid:19) (cid:35) ≤ √ κ √ M π (cid:112) R ( X r , H ) . The expected KL-divergence of π from the approximation π Ml is bounded by E M (cid:104) D KL ( π, π M(cid:96) ) (cid:105) ≤ √ κ (cid:107) f (cid:107) ,µ Z (cid:18) √ M π (cid:19) (cid:112) R ( X r , H ) . Proof.
See Appendix C.2.Note that Theorem 3.1 claim 2) needs an additional assumption on the supremum of f ( x ⊥ | x r ) , whileclaim 1) does not, showing the analytical advantage of π g . Meanwhile, the requirement that f ( x ⊥ | x r ) is bounded is not restrictive in practice, since the conditional likelihood is expected to be flat in thecomplement subspace of X r . Another important remark is that Theorems 3.1 and 3.2 reveals that thesample size M π does not have to be large in practice, as the error D ( · ) ( π Mr , π ) is dominated by theprojection residual R ( X r , H ) and does not decay by using a large M π .
4. Sample-based Gram matrix estimation
In Sections 2 and 3, we have shown that the approximation errors are bounded by R ( X r , H k ) , k = { , } . In most of the scenarios, the gradient Gram matrix H k has to be obtained through Monte Carlointegration. Therefore, it is of practical interest to ask how does the sampling error of H k affects theoverall approximation error. Here we provide rigorous estimates of the upper bounds of the samplingerror in H k . ccuracy of LIS methods x i , i = 1 , . . . , M ν , from a density ν , then the Monte Carloestimators of H and H are given by (cid:98) H = 1 M ν M ν (cid:88) i =1 ∇ log f ( x i ) ∇ log f ( x i ) (cid:62) µ ( x i ) ν ( x i ) , (cid:98) H = 1 M ν M ν (cid:88) i =1 ∇ log f ( x i ) ∇ log f ( x i ) (cid:62) π ( x i ) ν ( x i ) . (14)For simplicity, we use E ν to denote the averaging over all sampling outcome of x i , i = 1 , . . . , M ν .For example, we have E ν [ (cid:98) H k ] = H k . We also define the one-sample variance of the matrix estimatorsunder the Frobenious norm (cid:107) · (cid:107) F byV ( H , ν ) := d (cid:88) i,j =1 var X ∼ ν (cid:20) ∂ i log f ( X ) ∂ j log f ( X ) µ ( X ) ν ( X ) (cid:21) = M ν E ν (cid:107) (cid:98) H − H (cid:107) F , (15)V ( H , ν ) := d (cid:88) i,j =1 var X ∼ ν (cid:20) ∂ i log f ( X ) ∂ j log f ( X ) π ( X ) ν ( X ) (cid:21) = M ν E ν (cid:107) (cid:98) H − H (cid:107) F . (16)Recall that in the oracle LIS procedure, the reduced subspace X r is obtained as the d r dimensionalleading eigensubspace of H k . The associated residual is given by R ( X r , H k ) = (cid:80) di = d r +1 λ i ( H k ) .Practically, we can only obtain the leading eigensubspace (cid:98) X r generated by (cid:98) H k . The associated approx-imation residual is R ( (cid:98) X r , H k ) (Note the oracle H k instead of (cid:98) H k is used here, because it governs theLIS approximation error). Then, it is natural to compare the residual R ( (cid:98) X r , H k ) obtained by MonteCarlo integration with that of the oracle H k . In addition, we have no access to (cid:80) di = d r +1 λ i ( H k ) but only (cid:80) di = d r +1 λ i ( (cid:98) H k ) in practice. Thus, it is also of interest to estimate the difference between R ( (cid:98) X r , H k ) and (cid:80) di = d r +1 λ i ( (cid:98) H k ) , where the latter can be used as a computable error estimate. Thefollowing variation of the Davis-Kahan Theorem [56] is useful to address these questions. Lemma 4.1.
Let Σ and (cid:98) Σ be two positive semidefinite matrices. Let (cid:98) X r be the r -dimensional leadingeigensubspace of (cid:98) Σ and (cid:98) P ⊥ be the orthogonal projection to its complementary subspace. Then thefollowing hold1) R ( (cid:98) X r , Σ) = (cid:98) P ⊥ Σ (cid:98) P ⊥ ≤ (cid:80) ni = r +1 λ i (Σ) + 2 √ r (cid:107) (cid:98) Σ − Σ (cid:107) F .2) R ( (cid:98) X r , Σ) = (cid:98) P ⊥ Σ (cid:98) P ⊥ ≤ (cid:80) ni = r +1 λ i ( (cid:98) Σ) + √ r (cid:107) (cid:98) Σ − Σ (cid:107) F + tr(Σ − (cid:98) Σ) . Proof.
See Appendix D.1. A unique feature of these bounds is that they do not depend on eigen-value gaps, which are usually necessary for finding the subspace correctly. Further implications will bediscussed in Remark 4.4.
Theorem 4.2.
Under Assumption 2.1, suppose (cid:98) H k , k = { , } is computed by (14) and the computedsubspace (cid:98) X r is spanned by the d r leading eigenvectors of (cid:98) H k and the oracle subspace X r is spannedby the d r leading eigenvectors of oracle H k . Then the following bounds hold, where the expectationtakes average over all random outcomes of (cid:98) H k .
1) In the oracle setting: E ν (cid:104) R ( (cid:98) X r , H k ) (cid:105) − R ( X r , H k ) ≤ (cid:112) d r V ( H k , ν ) √ M ν .
2) In the practical setting: E ν (cid:104) R ( (cid:98) X r , H k ) − (cid:80) di = d r +1 λ i ( (cid:98) H k ) (cid:105) ≤ (cid:112) d r V ( H k , ν ) √ M ν . Proof.
Using Lemma 4.1 claim 1) and the identify E ν [ (cid:107) H k − (cid:98) H k (cid:107) F ] ≤ (cid:113) E ν [ (cid:107) H k − (cid:98) H k (cid:107) F ] = (cid:112) d r V ( H k , ν ) √ M ν , claim 1) directly follows. Using the fact that E ν [ (cid:98) H k ] = H k , E ν (cid:104) tr( (cid:98) H k − H k ) (cid:105) = 0 , claim 2) followsfrom Lemma 4.1 claim 2).Theorem 4.2 claim 1) shows that the difference between the expected approximation residual usingthe sample averages defined in (14) and the oracle approximation residual is of the order of / √ M ν ,where the prefactor is controlled by the variance V ( H k , ν ) and dimension of subspace d r . This revealsthat, with increasing M ν , the approximation accuracy of the subspace given by the sample averagesbecomes closer to that of the oracle subspace. Theorem 4.2 claim 2) shows that the sum of resid-ual eigenvalues provides a reliable estimate of the approximation residual in expectation, where thereliability is controlled by the sample size M ν , the variance V ( H k , ν ) and dimension of subspace d r .In the following corollary, we combine Theorem 4.2 with the results in Section 2 to address a prac-tical question: given the Monte Carlo estimator (cid:98) H k , k = { , } , how to quantify the associate LIS ap-proximation error for estimating E π [ h ] . Similar upper bounds for the statistical divergences discussedin Section 2 can also be established. We do not present them for the sake of conciseness. Corollary 4.3.
For all bounded test function h , the estimation errors satisfy the following bounds:1) Given (cid:98) X r obtained from (cid:98) H the resulting ˆ π r , r = { f, g } , satisfies E ν [ E h ( π, ˆ π r )] ≤ κ (cid:32) E π [ h ] + E ν (cid:2) E ˆ π r [ h ] (cid:3) (cid:33) (cid:32) R ( X r , H ) + 2 (cid:112) d r V ( H , ν ) √ M ν (cid:33) , E ν [ E h ( π, ˆ π r )] ≤ κ (cid:32) E π [ h ] + E ν (cid:2) E ˆ π r [ h ] (cid:3) (cid:33) (cid:32) E ν (cid:20) d (cid:88) i = d r +1 λ i ( (cid:98) H ) (cid:21) + (cid:112) d r V ( H , ν ) √ M ν (cid:33) .
2) Given (cid:98) X r obtained from (cid:98) H , the resulting ˆ π l satisfies E ν [ E h ( π, ˆ π l )] ≤ κ (cid:32) (cid:107) f (cid:107) ,µ (cid:0) E π [ h ] + E ν (cid:2) E ˆ π l [ h ] (cid:3)(cid:1) Z (cid:33) (cid:32) R ( X r , H ) + 2 (cid:112) d r V ( H , ν ) √ M ν (cid:33) , E ν [ E h ( π, ˆ π l )] ≤ κ (cid:32) (cid:107) f (cid:107) ,µ (cid:0) E π [ h ] + E ν (cid:2) E ˆ π l [ h ] (cid:3)(cid:1) Z (cid:33) (cid:32) E ν (cid:34) d (cid:88) i = d r +1 λ i ( (cid:98) H ) (cid:35) + (cid:112) d r V ( H , ν ) √ M ν (cid:33) . Proof.
For claim 1), recall that Theorem 2.4 applies to any given subspace, including (cid:98) X r , so we have E ν E h ( π, ˆ π f ) ≤ (cid:113) κ ( E π [ h ] + E ν E ˆ π f [ h ]) E ν R ( (cid:98) X r , H ) , ccuracy of LIS methods E ν R ( (cid:98) X r , H ) in Theorem 4.2 to obtain thecorollary. Claim 2) can be shown in a similar way. Remark 4.4.
It is worth pointing out that our results did not discuss the difference between theestimated subspace (cid:98) X r and the subspace X r obtained from the oracle H k . While finding the differenceis possible using tools like the Davis–Khan theorem, such difference is usually inversely proportional tothe eigenvalue gap, i.e., O (cid:0) λ ( H k ) (cid:14) ( λ d r ( H k ) − λ d r +1 ( H k )) (cid:17) . This quantity can be very large if thematrix H k does not have a significant eigenvalue gap near the truncation dimension d r . For example,if λ m ( H k ) = m − , then ( λ d r ( H k ) − λ d r +1 ( H k )) − = O ( d r ) . For the numerical examples in Section7, we observe that the eigenvalue gap is in the order of − to − for a moderate d r . In otherwords, it is actually impractical to recover the subspace exactly.Fortunately, the accuracy of (cid:98) X r on different directions have very different impact on the resultingapproximate target density π r . Intuitively, the accuracy of π r has little dependence on the d r -th eigen-vector of H k , because it contributes little to the Gram matrix. But having accurate estimations for theseeigenvectors is the most difficult, since their eigenvalues are close to each other. Our analysis avoidsconsidering the accuracy of (cid:98) X r and focuses on the accuracy of π r , since the latter does not need theeigenvalue gap and is the purpose of identifying the subspace. As an importance sampling scheme, the proposal density ν plays an important role for the samplingaccuracy. In particular, the one-sample sampling variance of the Gram matrix can be bounded by thelikelihood ratio between ν and µ , or ν and π as follows. Proposition 4.5.
We have the following upper bounds for the sampling variance of the Gram matrixV ( H , ν ) ≤ E X ∼ ν (cid:20) (cid:107)∇ log f ( X ) (cid:107) µ ( X ) ν ( X ) (cid:21) , V ( H , ν ) ≤ E X ∼ ν (cid:20) (cid:107)∇ log f ( X ) (cid:107) π ( X ) ν ( X ) (cid:21) . Proof.
See Appendix D.2.Proposition 4.5 shows that one should make the ratios, µν and πν , close to one to minimize sam-pling variance for H and H , respectively. For estimating H , we can naturally use the the referencedistribution, which is easy to sample from, as the biasing distribution, i.e., ν = µ . For estimating H ,Proposition 4.5 claim 2) suggests that using the reference distribution may not be a feasible strategy.Consider a scenario where the likelihood function is bounded as f ≤ and the gradient of the log-likelihood is bounded as (cid:107)∇ log f ( x ) (cid:107) ≤ M f . Then the variance V ( H , µ ) is inversely quadratic inthe normalising constant Z , i.e., V ( H , µ ) ≤ M f (cid:14) Z . For a target density concentrating in a small re-gion of the parameter space, the normalising constant Z can takes a small value, and thus the varianceV ( H , µ ) can take a rather large value. This way, alternative strategies such as MCMC and SMC haveto be used to adaptively gather samples from the target distribution for estimating H , while the inter-mediate estimation of H provides approximate target densities that can be used to accelerate MCMCand SMC. Further details are presented in the next section.
5. Integration with MCMC and SMC
In this section, we discuss the integration of MCMC and SMC with the approximate target densitiesdefined by LIS for estimating the Gram matrix H .4 For a given target density q ( x ) , the Metropolis–Hastings (MH) method employs a proposal density p ( x, x (cid:48) ) and an acceptance/rejection step with the acceptance probability β ( x, x (cid:48) ) = 1 ∧ q ( x (cid:48) ) p ( x (cid:48) , x ) q ( x ) p ( x, x (cid:48) ) to construct a Markov chain of random variables with q ( x ) as the invariant density. With the subspaceidentified by the LIS approach, we can apply different strategies to different subspaces to accelerate theconvergence of MCMC. For a given subspace X r , we can formulate an MCMC transition kernel on X r that has one of the reduced dimensional target densities π r ( x r ) , r = { f, g, l } , as the invariant density.Then, combining the transition kernel on X r and the conditional reference distribution µ x ⊥ | x r , we candefine a transition Markov chain transition kernel that has the full target density π ( x ) as the invariantdensity. This procedure is summarized in Algorithm 1. Algorithm 1:
MCMC with LIS proposal
Input: target density π ( x ) , an initial state X = x , a LIS subspace X r , reduced dimensionallikelihood f r , r ∈ { f, g, l } and reduced dimensional target density π r , iteration count t Output: a Markov chain X , . . . , X t for j = 1 , . . . , t do Given the previous state X ( j − = x , decompose it as x = x r + x ⊥ based on the subspacedecomposition R d = X r ⊕ X ⊥ ; Generate a MCMC proposal x (cid:48) r ∼ p ( x r , · ) ; Let x (cid:48) r = x r with rejection probability − β ( x r , x (cid:48) r ) , β ( x r , x (cid:48) r ) = 1 ∧ π r ( x (cid:48) r ) p ( x (cid:48) r ,x r ) π r ( x r ) p ( x r ,x (cid:48) r ) ; Generate a proposal x (cid:48)⊥ ∼ µ ( x ⊥ | x r ) and set x (cid:48) = x (cid:48) r + x (cid:48)⊥ ; Compute the acceptance probability α ( x, x (cid:48) ) = 1 ∧ f ( x (cid:48) ) f r ( x ) f ( x ) f r ( x (cid:48) ) ; With probability α ( x, x (cid:48) ) , accept the complement proposal by setting X k = x (cid:48) , otherwisereject x (cid:48) by setting X j = x .The acceptances/rejections used in lines 4 and 7 of Algorithm 1 is well aligned with the approxi-mation of the target density. Since the reduced dimensional target density π r ( x r ) carries most of theinformation provided by the likelihood function, it may have a complicated structure to explore. How-ever, the rather low dimensionality of π r ( x r ) makes it possible to design efficient MCMC transitionkernels. Note that the product of the reduced dimensional target density and the conditional refer-ence density, π r ( x r ) µ ( x ⊥ | x r ) , defines an approximation to the full target, in which its accuracy hasbeen extensively analyzed in previous sections. This way, in the complement space X ⊥ , we embed µ ( x ⊥ | x r ) , which is an approximation to π ( x ⊥ | x r ) , into another MCMC transition kernel to explorethe full target π ( x ) . Thus, the efficiency of the complement subspace MCMC transition in lines 5–7 ofAlgorithm 1 should strongly depend on its acceptance rate. In the following proposition, we show that π ( x ) is indeed the invariant measure of Algorithm 1. In addition, we also provide a lower bound on thecomplement transition acceptance rate in line 7 of Algorithm 1. Proposition 5.1. π ( x ) is an invariant measure for Algorithm 1. Moreover, the average acceptancerate for x ⊥ part is lower bounded by E (cid:2) α ( X, X (cid:48) ) (cid:3) ≥ − √ D H ( π, π r ) . ccuracy of LIS methods Here X is a random sample from π and X (cid:48) is a proposal generated by Algorithm 1. Proof.
See Appendix E.1.Proposition 5.1 indicates that when running Algorithm 1, the acceptance rate of the MCMC transi-tion in the complement subspace X ⊥ is controlled by the accuracy of the approximate target density.One can anticipate that the acceptance rate in line 7 approaches if the approximation error approaches . In other words, the efficiency of Algorithm 1 depends largely on the efficiency of the MCMC onthe low dimensional X r . To implement Algorithm 1, we need the reduced dimensional likelihood f r , r ∈ { f, g, l } . This in practice can be replaced by the Monte Carlo version f Mr (cf. Definitions1.1–1.3), with accuracy guarantee provided by Theorems 3.1 and 3.2. It worths to mention that theapproximation π Mf ( x r ) provides an unbiased estimate of the marginal target density π ( x r ) . In [22],this is used together with pseudo-marginal technique [4, 5] to design alternative sampling methods.Another key ingredient in Algorithm 1 is the LIS subspace X r , which is obtained by estimatingeither the matrix H or the matrix H . While H is easy to compute, the resulting subspace may haveinferior approximation accuracy compared with that obtained by H . However the estimation of H often requires samples from the target π ( x ) (cf. Section 4). To resolve this dilemma, we consider toadaptively estimate H and the LIS X r within MCMC. The procedure is summarized in Algorithm 2. Algorithm 2:
MCMC with adaptive LIS
Input: target density π ( x ) , number of epochs K , iteration count t , and a truncation index K ∗ Output: a Markov chain X , . . . , X ( K +1) t and a subspace X r Generate X , . . . , X t from µ and compute H (0) = t (cid:80) ti =1 ∇ log f ( X i ) ∇ log f ( X i ) (cid:62) ; Find the leading eigenvectors { v , . . . , v d r } of H (0) to define X r = span { v , . . . , v d r } ; for j = 1 , . . . , K do Run Algorithm 1 with target density π , initial state X jt , and LIS subspace X r for t iterations to generate the Markov chain X jt +1 , . . . X jt + t ; Compute H ( j ) = t (cid:80) ti =1 ∇ log f ( X jt + i ) ∇ log f ( X jt + i ) (cid:62) ; Compute ¯ H = j +1 − min( j,K ∗ ) (cid:80) ji =min( j,K ∗ ) H ( i ) ; Find the leading eigenvectors { v , . . . , v d r } of ¯ H to define X r = span { v , . . . , v d r } ;Our starting point is an initial LIS X that can be estimated using H . Then, we run Algorithm 1using X to generate samples from π . We call this the first epoch. Admittedly, Algorithm 1 in this epochmight not be efficient, since X may not be a good subspace. However, we can re-estimate the matrix H using the samples in the first epoch and an improved LIS X . Then, the updated X is used in thenext epoch to run Algorithm 1. This procedure can be carried out iteratively, where each epoch createsbetter estimate of the gram matrix H and corresponding LIS. The truncation index K ∗ is introducedto discard burn-in samples from initial epochs in estimating H . This adaptive estimation procedureshare a lot of common features with the classical adaptive Metropolis of [29]. However, we often onlyneed an adaptation phase with a finite number of samples, e.g., in the order of , to estimate the LIS X r in many practical scenarios, as the subspace estimation error follows a Monte Carlo convergencerate (cf. Theorem 4.2). For target densities with complicated and multi-modal structures, SMC offers an efficient alternativeto MCMC. Here we present the integration of SMC with LIS. This integration also offers a layered6subspace construction procedure that is naturally embedded within SMC. In our context, SMC utilizesa sequence of densities π k , k = 0 , . . . , K , such that π = µ , π K = π , and each ratio π k +1 /π k has asmall variance. For example, one can obtain such a sequence using the tempering formula π k ( x ) = 1 Z k µ ( x ) f ( x ) β k , Z k = (cid:90) µ ( x ) f ( x ) β k dx, k = 0 , , . . . , K, where β k ≥ is an increasing sequence with β = 0 and β K = 1 . This way, given samples from π k ,one can apply importance sampling to obtain weighted samples from π k +1 and estimate associatedstatistics. Then, these statistics can be used to formulate MCMC transition kernels with the invariantdensity π k +1 to update the weighted samples. This procedure is summarized in Algorithm 3. Algorithm 3:
SMC for LIS proposal
Input: likelihood function f , reference density µ , tempering coefficients β k and iteration count t k for each level k = 1 , . . . , K Output:
Samples X , . . . , X T from π ∝ f µ Generate samples X , . . . , X T from π = µ ; for k = 0 , . . . , K − do Compute the weights W j = f β k +1 − β k ( X j ) for j = 1 , . . . , T ; Let H (cid:48) = T (cid:80) Tj =1 ∇ log f ( X j ) ∇ log f ( X j ) (cid:62) W j ; Find the leading eigenvectors { v , . . . , v d r } of H (cid:48) to define X r = span { v , . . . , v d r } ; for t = 1 , . . . , T do Draw a resampling index J from the categorical distribution with the probability massfunction P [ J = j ] ∝ W j , j = 1 , . . . , T , and set Y = X J ; Run MCMC Algorithm 1 with the invariant density π k ∝ µf β k , initial state Y , LISsubspace X r , iteration count t k , let Y , . . . Y t k be the output ; Update X t = Y t k ;In Algorithm 3, a resampling step is used to transform a weighted particle representation of π k +1 into an equally weighted particle representation, followed by MCMC updates. In practice, the temper-ature can be chosen adaptively. Given the weighting function f β k +1 − β k ( x ) , one can choose the nexttemperature β k +1 such that E (cid:104) f β k +1 − β k ( X ) (cid:105) E (cid:2) f β k +1 − β k ( X ) (cid:3) ≈ (cid:16)(cid:80) ni =1 f β k +1 − β k ( x i ) (cid:17) n (cid:0)(cid:80) ni =1 f β k +1 − β k ( x i ) (cid:1) ≡ ESSn < τ where τ ∈ (0 , is a prefixed threshold.For each temperature β k +1 , we construct the matrix H for the corresponding target density π k +1 using importance sampling with π k as the basing density, and then build MCMC transition kernelsdescribed in Algorithm 1. Denoting the matrix H for the density π k +1 by H ( k +1)1 , its importancesampling estimate takes the form H ( k +1)1 = E π k (cid:34) ∇ log (cid:18) π k +1 ( X ) π k ( X ) (cid:19) ∇ log (cid:18) π k +1 ( X ) π k ( X ) (cid:19) (cid:62) π k +1 ( X ) π k ( X ) (cid:35) = ( β k +1 − β k ) Z k Z k +1 E π k (cid:104) ∇ log f ( X ) ∇ log f ( X ) (cid:62) f ( X ) ( β k +1 − β k ) (cid:105) ccuracy of LIS methods ∝ E π k (cid:104) ∇ log f ( X ) ∇ log f ( X ) (cid:62) f ( X ) ( β k +1 − β k ) (cid:105) . (17)The introduction of the tempering sequence reduces the variance of each H ( k +1)1 compared to thedirect estimation of H from the reference µ . This can be characterized in the following proposition. Proposition 5.2.
Suppose (cid:107)∇ log f ( x ) (cid:107) f ( x ) δ with δ = β k +1 − β k is integrable under µ , then thevariance V k +1 ( H , π k ) = E π k (cid:20) δ (cid:107)∇ log f ( X ) (cid:107) π k +1 ( X ) π k ( X ) (cid:21) for the subspace estimation in SMC is finite. Proof.
See Appendix E.2.As long as (cid:107)∇ log f ( x ) (cid:107) f ( x ) δ is integrable under µ , the variance V k +1 ( H , π k ) is finite. Since δ = β k +1 − β k < , this is a much milder condition than the (cid:107)∇ log f ( x ) (cid:107) f ( x ) integrability re-quirement for the direct importance sampling formula in Proposition 4.5. Consider the scenario usedin Section 4 where the likelihood function is bounded as f ≤ and the gradient of the log-likelihoodis bounded as (cid:107)∇ log f ( x ) (cid:107) ≤ M f . The variance V k +1 ( H , π k ) in SMC is controlled by V k +1 ( H , π k ) ≤ δ M f Z k Z k +1 , which can be much smaller than the upper bound V ( H , µ ) ≤ M f (cid:14) Z of the direct importance sam-pling formula in Proposition 4.5.
6. Dimension independent approximation error for linear inverseproblems
LIS mainly targets high-dimensional problems with intrinsic low-dimensional structures. For problemsyield an infinite-dimensional limit, it is highly desired that the subspace approximation error (e.g., theresult in Corollary 4.3) is independent of the apparent dimension d . While the dimension indepen-dence of sampling methods has been extensively investigated in the literature, see [18] and referencestherein, there has been little investigation on the dimension independence of LIS. Intuitively, for LISto be dimension independent, one needs to have (i) H k , k = { , } , is trace-class for d → ∞ and (ii)the variance V ( H k , ν ) is bounded independently of d to ensure this. Although it remains an openquestion to establish conditions under which these properties hold for general likelihood functions,Bayesian inverse problems with Gaussian priors, Gaussian likelihood functions, and linear parameter-to-observable maps offers insight into the dimension independence property. Here we demonstrate thedimension independence property of these conditions in the Gaussian linear setting.We consider a Bayesian problem with unknown parameter z ∈ R d and the prior p ( z ) ∼ N (0 , Γ) . Alinear parameter-to-observable map is made by y = Gz + ξ, ξ ∼ N (0 , I d y ) . Applying a whitening transformation x = Γ − / z , we obtain x ∼ µ ( x ) = N (0 , I d ) and y = Ax + ξ, A = G Γ / . f ( x ) = exp (cid:18) − (cid:107) Ax − y (cid:107) (cid:19) . First we will establish a series of estimates for the quantities we derived in previous sections.
Proposition 6.1.
Denote C A = A (cid:62) A = G (cid:62) Γ G . The following hold1) The eigenvalues of H are controlled by λ i +1 ( H ) ≤ λ i ( C A ) .
2) The eigenvalues of H are controlled by λ i +1 ( H ) ≤ λ i ( C A ) λ i ( C A ) .
3) The normalizing constant is bounded by (cid:112) det( C A + I ) ≤ Z ≤ (cid:112) det( C A + I ) exp (cid:18) (cid:107) y (cid:107) (cid:19) .
4) The constant (cid:107) f (cid:107) ,µ /Z is bounded by (cid:107) f (cid:107) ,µ Z ≤ det( I + C A ) / exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) .
5) When the reference distribution µ is used for estimating the matrices H k , k = { , } , the variancesare bounded by V ( H , µ ) ≤ (cid:32) d (cid:88) i =1 λ i ( C A ) (cid:33) + (cid:107) A (cid:62) y (cid:107) ,V ( H , µ ) ≤ (cid:113) det( I + C A ) exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) (cid:32) d (cid:88) i =1 λ i ( C A ) λ i ( C A ) (cid:33) + (cid:107) A (cid:62) y (cid:107) .
6) The SMC sampling variance in Proposition 5.2 is bounded by V k +1 ( H , µ ) ≤ δ (cid:113) det( I + δ C A ) exp (cid:16) δ (cid:107) A (cid:62) y (cid:107) (cid:17) (cid:32)(cid:16) d (cid:88) i =1 λ i ( C A ) τ λ i ( C A ) (cid:17) + (cid:107) A (cid:62) y (cid:107) (cid:33) . where δ = β k +1 − β k and τ = β k +1 + δ . Proof.
See Appendix F.1.Claim 1) implies that the difference between the spectrum of H and the spectrum of C A is boundedin l norm. Claim 2) implies that the difference between the spectrum of H and the spectrum of C A ( I + C A ) − is bounded in l norm. Note that λ i ( C A ) λ i ( C A ) ≤ λ i ( C A ) , and the ratio between the two is large if λ i ( C A ) is large. Also recall that the approximation errors withsubspace obtained from H involves pre-constants such as /Z and (cid:107) f (cid:107) ,µ /Z , which are estimated inclaims 3) and 4), but not the subspace obtained from H . Thus, in the Gaussian linear setting, the errorestimates obtained from H will be much tighter than those obtained from H when the dominatingeigenvalues of C A are large, which is often the case in practice. ccuracy of LIS methods H is easier for the Monte Carlo based estimate than H . This can be seen fromthe bounds on the the variance of the Gram matrices in claim 5). Comparing with V ( H , µ ) , V ( H , µ ) has an additional dependence on (cid:113) det( I + C A ) and exp (cid:16) √ − √ (cid:107) y (cid:107) (cid:17) . But this estimation difficultycan be remedied by SMC, as shown in claim 6), variance V k +1 ( H , π k ) in SMC can decay to 1 if thetemperature increment, β k +1 − β k , is close to 0.In many applications, the spectrum of prior covariance Γ is assumed to be bounded by a polynomialdecay, i.e. λ j (Γ) ≤ C Γ j − α , α > . With α > / , the prior covariance is trace-class, and thus the priorhas measure 1 on some suitably constructed Banach space. This assumption is often used to analyzethe properties of linear inverse problems at the infinite-dimensional limit. In the following corollary,we replace the bounds in Proposition 6.1 with estimates obtained using this trace-class constraint todemonstrate the dimension scalability. Corollary 6.2.
Suppose the eigenvalues of Γ follows a polynomial decay, λ j (Γ) ≤ C Γ j − α with α > / , the observation matrix G has bounded l operator norm, the observed data have bounded l norm, then following estimates hold without dependence on the ambient dimension d . Consequentially,the estimation error E h for any bounded h is independent of d by Corollary 4.3.1) When using X r as the subspace spanned by the first d r eigenvectors of H k , k ∈ { , } , R ( X r , H k ) ≤ α − (cid:107) G (cid:107) C ( d r − − α .
2) The constant (cid:107) f (cid:107) ,µ /Z is bounded by (cid:107) f (cid:107) ,µ Z ≤ exp (cid:32) √ − √ (cid:107) y (cid:107) + α α − (cid:107) G (cid:107) C (cid:33) .
3) When the reference distribution µ is used for estimating the matrices H k , k = { , } , the variancesare bounded by V ( H , µ ) ≤ (cid:18) α (2 α − (cid:107) G (cid:107) C + (cid:107) A (cid:62) y (cid:107) (cid:19) ,V ( H , µ ) ≤ (cid:32) √ − √ (cid:107) y (cid:107) + α α − (cid:107) G (cid:107) C (cid:33) (cid:18) α (2 α − (cid:107) G (cid:107) C + (cid:107) A (cid:62) y (cid:107) (cid:19) .
4) The SMC sampling variance in Proposition 5.2 is bounded by V k +1 ( H , µ ) ≤ δ exp (cid:18) δ (cid:107) A (cid:62) y (cid:107) + δ α α − (cid:107) G (cid:107) C (cid:19) (cid:18) α (2 α − (cid:107) G (cid:107) C + (cid:107) A (cid:62) y (cid:107) (cid:19) . where δ = β k +1 − β k . Proof.
See Appendix F.2.
7. Numerical examples
Now we provide several numerical examples to illustrate the theoretical results developed in the pre-ceding sections. We start with a synthetic linear inverse problem to demonstrate various likelihoodapproximation methods, and continue with a more practical nonlinear Bayesian inference problemgoverned by partial differential equations (PDE).0
In the first example, we consider a Bayesian inverse problem with linear observations and log-normalprior. Applications of such problem can be found in X-ray tomography and atmospheric remote sensingliterature, see [30] and references therein. The parameter to be inferred can be modeled by a randomGaussian vector x ∼ µ ( x ) = N (0 , Γ) , where Γ ∈ R d × d is the covariance matrix. The observation datais modeled through y = G exp( x ) + ξ, ξ ∼ N (0 , σ I d y ) , where G ∈ R d y × d is a matrix. The observation likelihood is then given by f ( x ; y ) ∝ exp (cid:18) − σ (cid:107) y − G exp( x ) (cid:107) (cid:19) . We specify the prior covariance by setting
Γ = diag( γ , γ , . . . , γ d ) with γ j = γ j − β γ . To create arandom observation matrix G , we use the reduced singular value decomposition G = U Λ V (cid:62) , wherethe matrices U and V are randomly and independently generated from the orthogonal group [51] and Λ = diag( λ , λ , . . . , λ d y ) with λ j = λ j − β λ . By using randomly generated G , we can confirm theobserved phenomena are not restricted to a specific G . In this example, we present 3 independentlygenerated G , and we will see the numerical results have little differences among them.The problem dimensions are set to d = 500 and d y = 50 . The variables prescribing Γ and H aregiven by γ = 4 , β γ = − , λ = 100 , and β λ = − . The standard deviation of the observation noise isgiven by σ = 1 .
23 25 2710 − − d r R.M.1
23 25 2710 − − d r R.M.2
23 25 2710 − − d r R.M.3 R ( X r, H λr ( H λr ( H − λr - H R ( X r, H λr ( H λr ( H − λr - H Figure 1 . Synthetic example. Eigenvalues and the gaps of eigenvalues of H k matrices for three randomly gener-ated observation maps (R.M.) and the sums of the residual eigenvalues versus the projection dimension. We present the numerical results based on three realizations of the randomly generated G matrices.For each test case, we construct the matrix H using × Monte Carlo samples and constructthe matrix H using × MCMC samples. As shown in Figure 1, for all three test cases, theeigenvalues of H k matrices have decaying eigenvalues. The eigenvalues of H matrices are severalorders smaller than those of H matrices, and thus the sums of the residual eigenvalues, R ( X r , H ) ,are significantly lower than R ( X r , H ) . This suggests that the dimension reduced posterior induced ccuracy of LIS methods
22 24 2610 − − d r D ( π k π Ml ) H , R.M.1 H , R.M.2 H , R.M.3 H , R.M.1 H , R.M.2 H , R.M.3
22 24 2610 − − d r D ( π k π Mf )
22 24 2610 − − d r D ( π k π Mg ) Figure 2 . Synthetic example. Approximation errors of the dimension reduced posteriors versus projection dimen-sion for different H k matrices and different approximation methods. From left to right, approximation methodsare π Ml , π Mf , and π Mg , respectively. Sample size M π = 4 is used in computing the conditional expectation.
22 24 2610 − d r R ( b X r , H ) , MC Mν = 800 Mν = 1600 Mν = 3200 Mν = 6400 R ( X r, H
0) 22 24 2610 − d r R ( b X r , H ) , MCMC Mν = 100 Mν = 200 Mν = 400 Mν = 800 R ( X r, H
1) 22 24 2610 − d r R ( b X r , H ) , SMC Mν = 100 Mν = 200 Mν = 400 Mν = 800 R ( X r, H Figure 3 . Synthetic example with the first random observation matrix. The sums of the residual eigenvalues versusprojection dimension for different H k matrices and subspaces (cid:98) X r computed using different samples sizes M ν anddifferent methods. From left to right, we have Monte Carlo estimation of (cid:98) H , MCMC estimation of (cid:98) H , and SMCestimation of (cid:98) H , respectively. by the H matrix should have better accuracy compared with that defined by the H matrix in thisexample. This is confirmed in Figure 2, which shows the squared Hellinger distances between the trueposterior and three dimension reduced posteriors introduced in Section 3, namely π Ml , π Mf , and π Mg .For all posterior approximation methods, the approximation subspaces defined by H matrices yieldssignificantly smaller squared Hellinger distances than those of H matrices.The eigenvalue gaps, λ r − λ r − , are also plotted in Figure 1. Their values decay to zero quickly. Inparticular, for H , the gap is around − for a moderate reduction dimension r = 32 , and − for alarger dimension r = 128 . For H , the effective eigenvalue gaps are of similar values, since we needto divide it by λ , which is around for H . This illustrates that it is important for the theoreticalresults be independent of eigenvalue gaps, as explained in Remark 4.4.Since all the numerical results are similar among all three randomly generated observation matrices,we will focus on the first realization for subsequent discussion. Next, we investigate the impact ofsample size M ν for estimating the reduced dimensional subspace (cid:98) X r . Figure 3 presents the quantilesof the sums of the residual errors for various sample-based subspace estimations. Although both H and H matrices have diminishing eigenvalue gaps in this example (c.f. Figure 1), the sample-basedestimations still exhibit sufficient accuracy in probing the dimension reduced subspaces. For example,2 − − − M π D ( π k π Ml ) , H dr = 8 dr = 16 dr = 32 1 2 4 810 − − − M π D ( π k π Mf ) , H − − − M π D ( π k π Mg ) , H Figure 4 . Synthetic example with the first random observation matrix. Approximation errors of the dimension re-duced posteriors (with projection dimension d r = { , , } ) versus sample size M π . From left to right, approx-imation methods are π Ml with the matrix H , π Mf with the matrix H , and π Mg with the matrix H , respectively.The error bars represent [10% , quantiles estimated using runs. with a rather small sample size M ν = 100 , using either MCMC or SMC can lead to accurate subspaceestimations for the approach based on the H matrix. This confirms the efficacy of our analysis inSection 4. In addition, the SMC-based estimation has better accuracy compared to that of MCMC-based estimation, which is also anticipated by our analysis in Section 5.Finally we investigate the sample size M π for computing the conditional expectations in variousapproximate posteriors, as discussed in Section 3. The results presented in Figure 4 confirms our anal-ysis results. For problems with rather small residual eigenvalues R ( X r , H k ) , A small sample size M π is sufficient for constructing accurate estimation of the conditional expectations. Increasing M π onlyleads to a marginal improvement in the approximation accuracy in this example. The reasons behindthis was explained at the end of Section 3. We consider a classical Bayesian inverse problem governed by an elliptic PDE [26, 24]. Such problemcan find applications in subsurface structure reconstruction and oil reservoir management. Given a do-main of interest D with boundary ∂D . The potential function s (cid:55)→ u ( s ) where s ∈ D ⊂ R is modeledby the PDE − ∇ · (cid:0) κ ( s ) ∇ u ( s ) (cid:1) = 0 , s ∈ D := (0 , , (18)with Dirichlet boundary conditions u | s =0 = 1 and u | s =1 = 0 on the left and right boundaries, andhomogeneous Neumann conditions on other boundaries. The diffusion coefficient κ ( s ) should be pos-itive, and thus it is often parametrized by its logarithm, i.e., κ ( s ) = exp( x ( s )) . The goal is to infer theunknown parameter function x ( s ) from d y incomplete observations of the potential function u ( s ) . Azero mean Gaussian process prior with the exponential kernel K ( s, s (cid:48) ) = exp (cid:18) − (cid:96) (cid:107) s − s (cid:48) (cid:107) (cid:19) is prescribed to the unknown parameter x ( s ) .Given an arbitrary function x ( s ) , the PDE (18) cannot be solved analytically. This way, the func-tions x ( s ) and u ( s ) need to be discretized to numerically solve (18). We tessellate the spatial do-main D with a uniform triangular grid with mesh size h , and then define continuous, piecewise ccuracy of LIS methods { φ ( s ) , . . . , φ d ( s ) } with cardinality d . Then, the infi-nite dimensional functions x ( s ) and u ( s ) can be approximated by x ( s ) ≈ x h ( s ) := (cid:80) di =1 φ i ( s ) x i and u ( s ) ≈ u h ( s ) := (cid:80) di =1 φ i ( s ) u i , respectively. After discretization, the unknown function x h ( s ) can beeffectively represented by a coefficient vector x = ( x , . . . , x d ) , which yields a multivariate Gaussianprior x ∼ µ ( x ) := N (0 , Γ) where Γ ij = (cid:82) (cid:82) K ( s, s (cid:48) ) φ i ( s ) φ j ( s (cid:48) ) ds ds (cid:48) . Figure 5 . Setup of the PDE inverse problem. Left: the true parameter x true ( s ) used for generating synthetic data.Right: the corresponding potential function u ( s ; x true ) and observation locations. For any given parameter coefficients x , the corresponding discretized potential function u h ( s ; x ) isobtained by solving the the Galerkin projection of the PDE (18). Observations are collected as d y = 19 local averages of the potential function u ( s ) over sub-domains D i ⊂ D , i = 1 , . . . , d y . The subdomainsare shown by the squares in Figure 5. To simulate the observable model outputs, we define the forwardmodel G : R d (cid:55)→ R d y with G i ( x ) = 1 | D i | (cid:90) D i u h ( s ; x ) ds, i = 1 , . . . , d y . Synthetic data for these d y local averages are produced by y = G ( x true ) + ξ , where ξ ∼ N (0 , σ I d y ) and x true is a realization of the prior random variable. To investigate the impact of the observa-tion noise in practical applications, we present three test cases with observational standard deviations σ = { . , . , . } that correspond to signal-to-noise ratios, 10, 20, and 40, respectively. Theresulting posterior distribution concentrates with reducing σ .As shown in Figure 6, for all three test cases, the eigenvalues of H k matrices have decaying valuesand gaps, which are similar to the first numerical example. The eigenvalues of H matrices and theassociated sums of the residual eigenvalues, R ( X r , H ) , are several orders smaller than the eigenvaluesof H matrices and R ( X r , H ) . In addition, the gap between R ( X r , H ) and R ( X r , H ) increases withdecreasing σ . This suggests that the accuracy improvement of the dimension reduced posterior inducedby the H matrix over that of the H matrix can be further enhanced for more concentrated posteriordistributions in this example. Moreover, with a smaller σ , the signal-to-noise ratio is larger, the posteriordensity is more different from the prior. So the sampling problem becomes more challenging. This canbe told from the eigenvalue values for σ = 8 . × − , which are of several magnitudes larger than theones of σ = 3 . × − .In Figure 7, we compare the performance of using H and H with three projection methods andnoise scales. The performance is measured in Hellinger distance from the true posterior. We can see that4
23 25 2710 − − d r σ = 0 .
23 25 2710 − − d r σ = 0 .
23 25 2710 − − d r σ = 0 . R ( X r, H λr ( H λr ( H − λr - H R ( X r, H λr ( H λr ( H − λr - H Figure 6 . Same as Figure 1, but for the PDE inverse problem with σ = { . , . , . } .
22 24 2610 − − d r D ( π k π Ml ) H , σ = 0 . H , σ = 0 . H , σ = 0 . H , σ = 0 . H , σ = 0 . H , σ = 0 .
034 22 24 2610 − − d r D ( π k π Mf )
22 24 2610 − − d r D ( π k π Mg ) Figure 7 . Same as Figure 2, but for the PDE inverse problem with σ = { . , . , . } . Sample size M π = 4 is used in computing the conditional expectation.
22 24 26100101102103 d r R ( b X r , H ) , MC Mν = 100 Mν = 200 Mν = 400 Mν = 800 R ( X r, H
0) 22 24 2610 − − d r R ( b X r , H ) , MCMC Mν = 100 Mν = 200 Mν = 400 Mν = 800 R ( X r, H
1) 22 24 2610 − − d r R ( b X r , H ) , SMC Mν = 100 Mν = 200 Mν = 400 Mν = 800 R ( X r, H Figure 8 . Same as Figure 3, but for the PDE inverse problem with σ = 0 . . all three projection methods yield similar results. Approximations using H consistently outperformthe ones using H , especially when the reduction dimension increases. By reducing σ , the sampling ccuracy of LIS methods H , since the approximation error is of scale with d r = 2 , while the approx-imation error with H is of − scale.Using the test case with σ = 0 . , we investigate the impact of sample size M ν for estimating thereduced dimensional subspace (cid:98) X r . Figure 8 presents the quantiles of the sums of the residual errorsfor various sample-based subspace estimations. Similar to the first numerical example, the diminishingeigenvalue gaps of H k matrices (c.f. Figure 6) do not impact the accuracy of subspace estimations.With a rather small sample size M ν = 100 , all methods (MC for H and MCMC and SMC for H )can lead to accurate subspace estimations.
8. Conclusion
This paper has provided a step-by-step analysis of the accuracy of the LIS method for approximatinghigh-dimensional intractable target probability densities. We have shown that the spectrum informa-tion of the Gram matrices H k , k = { , } , leads to upper bounds on the errors of various oracle ap-proximations. We have also generalized these upper bounds to the numerical implementation of theapproximate probability densities, in which Monte Carlo averaging are applied to both the estimationof H k , k = { , } , and the marginalization used during the construction of the approximate likelihoodfunctions. Our analysis provides insights into the tradeoff between the usage of H and H for con-structing LIS: while the approximations based on H can have smaller approximation errors comparedwith those obtained from H , the matrix H is often more difficult to estimate. Fortunately, this diffi-culty can be addressed by integrating the LIS estimation process into sampling tools such as MCMCand SMC. Thus, utilizing error bounds we obtained, we have also discussed the performance of theintegration of MCMC and SMC with LIS. We have demonstrated our analysis on a linear Bayesianinverse problem, where all the error bounds presented in this paper are independent of the apparentparameter dimension with suitable technical assumptions that are commonly used in high- or infinite-dimensional inverse problems. Finally we have provided numerical examples to further demonstratethe efficacy of our analysis on nonlinear problems.This work leads to some future research directions for dimension reduction techniques. Firstly, ouranalysis on the linear Bayesian inverse problem shows that various approximation errors are dimen-sion independent. We conjecture that such property will also hold for general nonlinear Bayesian in-verse problems. Yet finding the conditions that guaranteeing this property remains an open problem.Secondly, our analysis indicates that the expected conditional variance of the square root of the likeli-hood provides a direct control of the approximation error. This may lead to new dimension reductiontechniques that bypass the usage of the Poincaré inequality and the gradient. Moreover, the analysispresented in this work can be further generalized to other type of log-concave reference distributions,for example, the Laplace distribution commonly used in sparsity-promoting learning. This may requirefurther investigations on utilizing weighted Poincaré-type inequalities [11, 13] for building alternative H k matrices and subspace approximations. Appendix A: Useful lemmas
Lemma A.1.
The following holds1) The estimation error of a L function h can be bounded by Hellinger distance | E π [ h ] − E ν [ h ] | ≤ (cid:113) E π [ h ] + 2 E ν [ h ] D H ( π, ν ) .
2) The Hellinger distance can be bounded by the square root of KL divergence D H ( π, ν ) ≤ (cid:114) D KL ( π, ν ) .
3) The total variation distance can be bounded by the Hellinger distance D T V ( π, ν ) ≤ √ D H ( π, ν ) . Proof.
Proof of claim 1).
Let λ be a reference density, e.g. the Lebesgue density, for the Hellingerdistance, so D H ( π, ν ) = 12 (cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) − (cid:115) ν ( x ) λ ( x ) (cid:33) λ ( x ) dx. Note that | E π [ h ] − E ν [ h ] | = (cid:18)(cid:90) (cid:18) π ( x ) λ ( x ) − ν ( x ) λ ( x ) (cid:19) h ( x ) λ ( x ) dx (cid:19) = (cid:32)(cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) − (cid:115) ν ( x ) λ ( x ) (cid:33) (cid:32)(cid:115) π ( x ) λ ( x ) + (cid:115) ν ( x ) λ ( x ) (cid:33) h ( x ) λ ( x ) dx (cid:33) ( by Cauchy–Schwarz ) ≤ D H ( π, ν ) (cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) + (cid:115) ν ( x ) λ ( x ) (cid:33) h ( x ) λ ( x ) dx ( by Young’s ineq. ) ≤ D H ( π, ν ) (cid:18)(cid:90) (cid:18) π ( x ) λ ( x ) + ν ( x ) λ ( x ) (cid:19) h ( x ) λ ( x ) dx (cid:19) = 2( E π [ h ] + E ν [ h ]) D H ( π, ν ) . Proof of claim 2).
The result comes from the following D KL ( π, ν ) = (cid:90) log π ( x ) ν ( x ) π ( x ) dx = − (cid:90) log (cid:112) ν ( x ) (cid:112) π ( x ) π ( x ) dx (by − log(1 + x ) ≥ − x ) ≥ − (cid:90) ( (cid:112) ν ( x ) (cid:112) π ( x ) − π ( x ) dx = (cid:90) ( π ( x ) + ν ( x ) − (cid:112) π ( x ) ν ( x )) dx = 2 D H ( π, ν ) . Proof of claim 3).
The result comes from the following D T V ( π, ν ) = (cid:18) (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) π ( x ) λ ( x ) − ν ( x ) λ ( x ) (cid:12)(cid:12)(cid:12)(cid:12) λ ( x ) dx (cid:19) ccuracy of LIS methods ≤ (cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) − (cid:115) ν ( x ) λ ( x ) (cid:33) λ ( x ) dx (cid:90) (cid:32)(cid:115) π ( x ) λ ( x ) + (cid:115) ν ( x ) λ ( x ) (cid:33) λ ( x ) dx ≤ D H ( π, ν ) (cid:90) (cid:18) π ( x ) λ ( x ) + ν ( x ) λ ( x ) (cid:19) λ ( x ) dx = 2 D H ( π, ν ) . Lemma A.2.
Consider two probability densities π ( x ) = Z f f ( x ) µ ( x ) and p ( x ) = Z h h ( x ) µ ( x ) where Z f = (cid:82) f ( x ) µ ( x ) dx and Z h = (cid:82) h ( x ) µ ( x ) dx. Given the L distance between √ f and √ h (cid:107) f − h (cid:107) ,µ = (cid:18)(cid:90) (cid:0)(cid:112) f ( x ) − (cid:112) h ( x ) (cid:1) µ ( x ) dx (cid:19) , Then we have the following:1) The normalizing constant difference is bounded as (cid:12)(cid:12)(cid:112) Z f − √ Z h (cid:12)(cid:12) ≤ (cid:107) f − h (cid:107) ,µ .2) The squared Hellinger distance is bounded as D H ( π, p ) ≤ Z f (cid:107) f − h (cid:107) ,µ . Proof.
Proof of claim 1).
Note that (cid:12)(cid:12) Z f − Z h (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:0) f ( x ) − h ( x ) (cid:1) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( (cid:112) f ( x ) − (cid:112) h ( x ))( (cid:112) f ( x ) + (cid:112) h ( x )) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( (cid:112) f ( x ) − (cid:112) h ( x )) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( (cid:112) f ( x ) + (cid:112) h ( x )) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( (cid:112) f ( x ) − (cid:112) h ( x )) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f ( x ) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / + (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) h ( x ) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / (cid:33) = ( (cid:113) Z f + (cid:112) Z h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( (cid:112) f ( x ) − (cid:112) h ( x )) µ ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) / Dividing both sides by ( (cid:112) Z f + √ Z h ) > we have the result. Proof of claim 2).
The squared Hellinger distance of π from p satisfies D H ( π (cid:107) p ) = 12 (cid:90) (cid:16)(cid:112) π ( x ) − (cid:112) p ( x ) (cid:17) dx = 12 (cid:90) (cid:32)(cid:115) f ( x ) Z f − (cid:115) h ( x ) Z h (cid:33) µ ( x ) dx = 12 Z f (cid:90) (cid:32)(cid:112) f ( x ) − (cid:112) h ( x ) + (cid:112) h ( x ) − (cid:112) h ( x ) (cid:115) Z f Z h (cid:33) µ ( x ) dx = 12 Z f (cid:90) (cid:32)(cid:112) f ( x ) − (cid:112) h ( x ) + (cid:112) h ( x ) (cid:32) − (cid:115) Z f Z h (cid:33)(cid:33) µ ( x ) dx (by Young’s ineq.) ≤ Z f (cid:90) (cid:16)(cid:112) f ( x ) − (cid:112) h ( x ) (cid:17) µ ( x ) dx + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:115) Z f Z h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) h ( x ) µ ( x ) dx = 1 Z f (cid:18) (cid:107) f − h (cid:107) µ + (cid:12)(cid:12)(cid:12)(cid:112) Z h − (cid:113) Z f (cid:12)(cid:12)(cid:12) (cid:19) ≤ Z f (cid:107) f − h (cid:107) ,µ . Thus, the result follows.
Lemma A.3.
Let C be a positive semidefinite matrix and U is a rank p symmetric matrix. Then forany k λ k + p ( C + U ) ≤ λ k ( C ) . Proof.
By the Courant–Fischer–Weyl min-max principle, we note that for any symmetric matrix Cλ k + p ( C + U ) = min V { max { x (cid:62) ( C + U ) x, (cid:107) x (cid:107) = 1 } , dim ( V ) = d − k − p + 1 } . Let the eigenvectors of C be v , . . . , v d and the eigenvectors of U with nonzero eigenvalues be u , . . . , u p . Now we pick V ⊥ = span { v , . . . , v k − , u , . . . , u p } , and its orthogonal complement V (cid:48) as a subspace of dimension at least d − k − p + 1 . Select anysubspace V of dimension d − k − p + 1 from V (cid:48) , then λ k + p ( C + U ) ≤ max x ∈ V, (cid:107) x (cid:107) =1 x (cid:62) ( C + U ) x = max x ∈ V, (cid:107) x (cid:107) =1 x (cid:62) Cx ≤ λ k ( C ) . Appendix B: Proofs in Section 2
B.1. Proof of Proposition 2.2
Denote the density v ( x ) ∝ exp( − V ( x )) and the associated conditional density as v ( x ⊥ | x r ) . Note that µ ( x ⊥ | x r ) = µ ( x r , x ⊥ ) (cid:82) µ ( x r , x ⊥ ) dx ⊥ = exp( − V ( x r , x ⊥ )) exp( − U ( x r , x ⊥ )) (cid:82) exp( − V ( x r , x ⊥ )) exp( − U ( x r , x ⊥ )) dx ⊥ . Let c = inf x exp( − U ( x )) . Then exp( − U ( x r , x ⊥ )) ≤ Bc , so µ ( x ⊥ | x r ) ≤ exp( − V ( x r , x ⊥ )) Bc (cid:82) exp( − V ( x r , x ⊥ )) c dx ⊥ = Bv ( x ⊥ | x r ) Likewise µ ( x ⊥ | x r ) ≥ exp( − V ( x r , x ⊥ )) c (cid:82) exp( − V ( x r , x ⊥ )) c κdx ⊥ = B − v ( x ⊥ | x r ) ccuracy of LIS methods ∇ x ⊥ log V ( x ⊥ | x r ) = ∇ x ⊥ log V ( x ⊥ , x r ) , which is a sub-matrix of ∇ log V ( x ) , so its minimal eigenvalue is greater than c by the strong log-concavity. Then the Bakry-Emery principle Theorem 3.1 of [41] indicates that v ( x ⊥ | x r ) satisfies thePoincare inequality with coefficient c , i.e. for any h var v ( x ⊥ | x r ) [ h ] ≤ c (cid:90) (cid:107)∇ h (cid:107) dv ( x ⊥ | x r ) . Finally, we have the P.I. for µ ( x ⊥ | x r ) : var µ ( x ⊥ | x r ) [ h ] ≤ (cid:90) [ h ( x ) − E v ( x ⊥ | x r ) h ( x )] µ ( x ⊥ | x r ) dx ⊥ ≤ B (cid:90) [ h − E v ( x ⊥ | x r ) h ( x )] v ( x ⊥ | x r ) dx ⊥ = B var v ( x ⊥ | x r ) ( h ) ≤ Bc (cid:90) (cid:107)∇ x ⊥ h (cid:107) dv ( x ⊥ | x r ) ≤ B c (cid:90) (cid:107)∇ x ⊥ h (cid:107) dµ ( x ⊥ | x r ) . B.2. Proof of Proposition 2.3
Proof of claim 1).
Recall the squared Hellinger distance D H ( π, π f ) = 12 (cid:90) (cid:115) π ( x ) µ ( x ) − (cid:115) π f ( x ) µ ( x ) µ ( x ) dx r = 1 Z (cid:90) (cid:18) (cid:90) ( f ( x r , x ⊥ ) − f r ( x r ) ) µ ( x ⊥ | x r ) dx ⊥ (cid:19) µ ( x r ) dx r , (19)and definitions of f r ( x r ) and g r ( x r ) : f r ( x r ) = (cid:90) g ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ , g r ( x r ) = (cid:90) g ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ . We have the identity: var µ ( x ⊥ | x r ) (cid:2) g (cid:3) = f r ( x r ) − g r ( x r ) . The inner integral in (19) can be expressed as (cid:90) ( g ( x r , x ⊥ ) − f r ( x r ) ) µ ( x ⊥ | x r ) dx ⊥ = 12 (cid:90) (cid:0) g ( x r , x ⊥ ) + f r ( x r ) − g ( x r , x ⊥ ) f r ( x r ) (cid:1) µ ( x ⊥ | x r ) dx ⊥ = f r ( x r ) − f r ( x r ) g r ( x r ) . f r ( x r ) ≥ g r ( x r ) ≥ , and therefore f r ( x r ) ≥ f r ( x r ) g r ( x r ) ≥ g r ( x r ) ≥ . This leads to the inequality f r ( x r ) − f r ( x r ) g r ( x r ) ≤ f r ( x r ) − g r ( x r ) = var µ ( x ⊥ | x r ) (cid:2) g (cid:3) Applying this bound above to (19), we find that: D H ( π, π f ) ≤ Z (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r . Proof of claim 2).
Recall that the normalizing constant of π g takes the form Z g := (cid:90) g r ( x r ) µ ( x ) dx. The squared Hellinger distance can be written as D H ( π, π g ) = 12 (cid:90) (cid:16) (cid:90) (cid:16) Z g ( x r , x ⊥ ) − Z g g r ( x r ) (cid:17) µ ( x ⊥ | x r ) dx ⊥ (cid:17) µ ( x r ) dx r . (20)Using the identities var µ ( x ⊥ | x r ) (cid:2) g (cid:3) = f r ( x r ) − g r ( x r ) , and f r ( x r ) ≥ g r ( x r ) ≥ , we have Z = (cid:90) f ( x ) µ ( x ) dx = (cid:90) f r ( x r ) µ ( x r ) dx r ≥ (cid:90) g r ( x r ) µ ( x r ) dx r = (cid:90) g r ( x r ) µ ( x ) dx = Z g > , and therefore Z g Z ≤ and ZZ g ≥ . Then, we can bound the inner integral in (20) by (cid:90) (cid:16) Z g ( x r , x ⊥ ) − Z g g r ( x r ) (cid:17) µ ( x ⊥ | x r ) dx ⊥ = 1 Z (cid:16) f r ( x r ) + ZZ g g r ( x r ) − (cid:115) ZZ g g r ( x r ) (cid:17) ≤ Z (cid:16) f r ( x r ) + ZZ g g r ( x r ) − g r ( x r ) (cid:17) = 1 Z (cid:16) f r ( x r ) + Z − Z g Z g g r ( x r ) − g r ( x r ) (cid:17) = 1 Z (cid:16) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) + Z − Z g Z g g r ( x r ) (cid:17) . Substituting this upper bound into (20), we have D H ( π, π g ) ≤ Z (cid:90) (cid:16) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) + Z − Z g Z g g r ( x r ) (cid:17) µ ( x r ) dx r = 12 Z (cid:16) (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r + ( Z − Z g ) (cid:17) . ccuracy of LIS methods Z − Z g satisfies Z − Z g = (cid:90) (cid:16) (cid:90) ( g ( x r , x ⊥ ) − g r ( x r ) ) µ ( x ⊥ | x r ) dx ⊥ (cid:17) µ ( x r ) dx r = (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r . (21)In summary, we have D H ( π, π g ) ≤ Z (cid:16) (cid:90) var µ ( x ⊥ | x r ) (cid:2) g (cid:3) µ ( x r ) dx r (cid:17) . Order of normalizing constants).
The result of (21) also reveals that Z ≥ Z g . B.3. Proof of Theorem 2.6
Proof of claim 1).
Note that (cid:90) (log f ( x ) − l r ( x r )) µ ( x ⊥ | x r ) dx ⊥ = var µ ( x ⊥ | x r ) log f ( x ) ≤ κ (cid:90) (cid:107)∇ x ⊥ log f ( x ⊥ , x r ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ = κ (cid:90) (cid:107) ( P ⊥ ∇ log f ( x ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ Integrate both side with µ ( x r ) will yield (cid:90) (log f ( x ) − l r ( x r )) µ ( x ) dx ≤ κ R ( X r , H ) . Then by Cauchy–Schwarz, we find (cid:90) (log f ( x ) − l r ( x r )) µ ( x ) dx (cid:90) f ( x ) µ ( x ) dx ≥ Z (cid:18)(cid:90) log π ( x ) Zπ l ( x ) Z l π ( x ) dx (cid:19) , where Z l := (cid:90) exp( l r ( x r )) µ ( x r ) dx r = (cid:90) exp (cid:18)(cid:90) log f ( x ) µ ( x ⊥ | x r ) dx ⊥ (cid:19) µ ( x r ) dx r (by Jensen’s ineq.) ≤ (cid:90) (cid:18)(cid:90) f ( x ) µ ( x ⊥ | x r ) dx ⊥ (cid:19) µ ( x r ) dx r = Z. Moreover, it is well known that D KL ( π, π l ) = (cid:82) log π ( x ) π l ( x ) π ( x ) dx ≥ , so (cid:18)(cid:90) log π ( x ) Zπ l ( x ) Z l π ( x ) dx (cid:19) = ( D KL ( π, π l ) + log Z/Z l ) ≥ D KL ( π, π l ) . D KL ( π, π l ) ≤ (cid:18)(cid:90) log π ( x ) Zπ l ( x ) Z l π ( x ) dx (cid:19) ≤ Z (cid:90) (log f ( x ) − l r ( x r )) µ ( x ) dx (cid:90) f ( x ) µ ( x ) dx ≤ (cid:107) f (cid:107) ,µ κZ (cid:90) (cid:107) P ⊥ ∇ log f ( x ) (cid:107) µ ( x ) dx = (cid:107) f (cid:107) ,µ κZ R ( X r , H ) . Proof of claim 2).
Applying Lemma A.1 first claim 1) and then claim 2), we have | E π [ h ] − E π l [ h ] | ≤ (cid:113) E π [ h ] + E π l [ h ]) D H ( π, π l ) ≤ ( E π [ h ] + E π l [ h ]) (cid:112) D KL ( π, π l ) ≤ ( E π [ h ] + E π l [ h ]) (cid:114) (cid:107) f (cid:107) ,µ Z ( κ R ( X r , H )) . Thus, the result follows.
Appendix C: Proofs in Section 3
C.1. Proof of Theorem 3.1
Proof of claim 1).
Recall the function g takes the form g = √ f and using g M ( x r ) = 1 M π M π (cid:88) i =1 g ( x r , x i ⊥ ) and g r ( x r ) = (cid:90) g ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ , we have the corresponding approximate target densities π Mg ( x r , x ⊥ ) ∝ g M ( x r ) µ ( x r , x ⊥ ) and π g ( x r , x ⊥ ) ∝ g ( x r ) µ ( x r , x ⊥ ) , respectively. Note that E M [ g M ( x r )] = g r ( x r ) . Applying Lemma A.2claim 2), we have E M (cid:104) D H ( π Mg , π g ) (cid:105) ≤ Z g E M (cid:20)(cid:90) (cid:90) (cid:0) g M ( x r ) − g r ( x r ) (cid:1) µ ( x r ) dx r µ ( x ⊥ | x r ) dx ⊥ (cid:21) = 2 Z g M π (cid:90) var µ ( x ⊥ | x r ) ( g ( x r , x ⊥ )) µ ( x r ) dx r ≤ κZ g M π (cid:90) (cid:107)∇ x ⊥ g ( x ) (cid:107) µ ( x ) dx = 2 κZ g M π (cid:90) (cid:107)∇ x ⊥ log g ( x ) (cid:107) f ( x ) µ ( x ) dx = 2 κZZ g M π R ( X r , H ) . ccuracy of LIS methods Proof of claim 2).
We have the corresponding approximate target densities π Mf ( x r , x ⊥ ) ∝ f M ( x r ) µ ( x r , x ⊥ ) and π f ( x r , x ⊥ ) = Z f ( x r ) µ ( x r , x ⊥ ) where f M ( x r ) = 1 M π M π (cid:88) i =1 f ( x r , x i ⊥ ) and f r ( x r ) = (cid:90) f ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ . Using a similar technique in the proof of claim 1), we apply the inequality E M (cid:104) D H ( π Mf , π f ) (cid:105) ≤ Z E M (cid:20)(cid:90) (cid:90) (cid:0)(cid:112) f M ( x r ) − (cid:112) f r ( x r ) (cid:1) µ ( x r ) dx r µ ( x ⊥ | x r ) dx ⊥ (cid:21) = 2 Z (cid:90) E M (cid:104)(cid:0)(cid:112) f M ( x r ) − (cid:112) f r ( x r ) (cid:1) (cid:105) µ ( x r ) dx r , (22)here. Considering the identity ( (cid:112) f M ( x r ) − (cid:112) f r ( x r )) ≤ ( (cid:112) f M ( x r ) − (cid:112) f r ( x r )) ( (cid:112) f M ( x r ) /f r ( x r ) + 1) = ( f M ( x r ) − f r ( x r )) /f r ( x r ) , the inequality in (22) also satisfies E M (cid:104) D H ( π Mf , π f ) (cid:105) ≤ Z (cid:90) E M (cid:2) ( f M ( x r ) − f r ( x r )) (cid:3) f r ( x r ) µ ( x r ) dx r . (23)Then for each given x r , by independence of x i ⊥ we have E M (cid:104) ( f M ( x r ) − f r ( x r )) (cid:105) = 1 M π var µ ( x ⊥ | x r ) f ( x r , x ⊥ ) ≤ κM π (cid:90) (cid:107)∇ x ⊥ f ( x r , x ⊥ ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ = κM π (cid:90) (cid:107)∇ x ⊥ log f ( x r , x ⊥ ) (cid:107) f ( x ) µ ( x ⊥ | x r ) dx ⊥ = κM π f r ( x r ) (cid:90) (cid:107)∇ x ⊥ log f ( x r , x ⊥ ) (cid:107) f ( x ⊥ | x r ) µ ( x ⊥ | x r ) dx ⊥ . Substituting the above identify into (23), we have E M (cid:104) D H ( π Mf , π f ) (cid:105) ≤ κZM π (cid:90) (cid:90) (cid:107)∇ x ⊥ log f ( x ) (cid:107) f ( x ⊥ | x r ) µ ( x ⊥ | x r ) dx ⊥ f r ( x r ) µ ( x r ) dx r = 2 κZM π (cid:90) (cid:107)∇ x ⊥ log f ( x ) (cid:107) f ( x ⊥ | x r ) f ( x ) µ ( x ) dx ≤ κC f M π R ( X r , H ) , where C f = sup x r sup x ⊥ f ( x ⊥ | x r ) . Then, the result follows from Jensen’s inequality.4 C.2. Proof of Theorem 3.2
First by Assumption 2.1, we have E M (cid:104) ( l M ( x r ) − l r ( x r )) (cid:105) = 1 M π var µ ( x ⊥ | x r ) [ l ( x r , x ⊥ )] , which leads to E M (cid:20)(cid:90) ( l M ( x r ) − l r ( x r )) µ ( x r ) dx r (cid:21) ≤ κM π (cid:90) var µ ( x ⊥ | x r ) [ l ( x r , x ⊥ )] µ ( x r ) dx r . By Jensen’s inequality, the expected L error of l M ( x r ) satisfies E M (cid:34)(cid:18)(cid:90) ( l M ( x r ) − l r ( x r )) µ ( x r ) dx r (cid:19) (cid:35) ≤ √ κ √ M π (cid:18)(cid:90) var µ ( x ⊥ | x r ) [ l ( x r , x ⊥ )] µ ( x r ) dx r (cid:19) . (24)Assumption 2.1 states that (cid:90) var µ ( x ⊥ | x r ) [ l ( x r , x ⊥ )] µ ( x r ) dx r ≤ κ (cid:90) (cid:107)∇ x ⊥ l ( x r , x ⊥ ) (cid:107) µ ( x ⊥ | x r ) dx ⊥ = κ R ( X r , H ) , together with (24), we have E M (cid:34)(cid:18)(cid:90) ( l M ( x r ) − l r ( x r )) µ ( x r ) dx r (cid:19) (cid:35) ≤ √ κ √ M π (cid:112) R ( X r , H ) . To obtain the KL divergence, we note that E M (cid:104) D KL ( π, π M(cid:96) ) (cid:105) = E M (cid:34)(cid:90) log π ( x ) π M(cid:96) ( x ) π ( x ) dx (cid:35) = E M (cid:20)(cid:90) ( (cid:96) ( x ) − (cid:96) M ( x r )) π ( x ) dx (cid:21) + E M (cid:20) log Z M Z (cid:21) , where Z M = (cid:82) e (cid:96) M ( x r ) µ ( x r ) dx r . The expectation of Z M satisfies E M [ Z M ] = E M (cid:20)(cid:90) e (cid:96) M ( x r ) µ ( x r ) dx r (cid:21) = (cid:90) E M (cid:34) exp (cid:32) M M (cid:88) i =1 (cid:96) ( x r , x i ⊥ ) (cid:33)(cid:35) µ ( x r ) dx r ≤ (cid:90) (cid:32) E M (cid:34) M (cid:89) i =1 exp (cid:16) (cid:96) ( x r , x i ⊥ ) (cid:17)(cid:35)(cid:33) /M µ ( x r ) dx r = (cid:90) (cid:32) M (cid:89) i =1 (cid:90) exp (cid:16) (cid:96) ( x r , x i ⊥ ) (cid:17) µ ( x i ⊥ | x r ) dx i ⊥ (cid:33) /M µ ( x r ) dx r ccuracy of LIS methods = (cid:90) (cid:90) f ( x r , x ⊥ ) µ ( x ⊥ | x r ) dx ⊥ µ ( x r ) dx r = Z. Therefore, by Jensen’s inequality, we have E M (cid:20) log Z M Z (cid:21) ≤ log E M (cid:20) Z M Z (cid:21) ≤ . Thus, the expected KL satisfies E M (cid:2) D KL ( π, π (cid:96) M ) (cid:3) ≤ E M (cid:20)(cid:90) ( (cid:96) ( x ) − (cid:96) r ( x r ) + (cid:96) r ( x r ) − (cid:96) M ( x r )) π ( x ) dx (cid:21) = (cid:90) ( (cid:96) ( x ) − (cid:96) r ( x r )) π ( x ) dx + E M (cid:20)(cid:90) ( (cid:96) r ( x r ) − (cid:96) M ( x r )) π ( x ) dx (cid:21) . (25)Applying Cauchy–Schwarz inequality, the first term in (25) can be bounded by (cid:90) ( (cid:96) ( x ) − (cid:96) r ( x r )) π ( x ) dx ≤ (cid:107) f (cid:107) ,µ Z (cid:18)(cid:90) ( (cid:96) ( x ) − (cid:96) r ( x r )) µ ( x ) dx (cid:19) = (cid:107) f (cid:107) ,µ Z (cid:18)(cid:90) var µ ( x ⊥ | x r ) [ (cid:96) ( x r , x ⊥ )] µ ( x r ) dx r (cid:19) , and the second term in (25) can be bounded by E M (cid:20)(cid:90) ( (cid:96) r ( x r ) − (cid:96) M ( x r )) π ( x ) dx (cid:21) ≤ (cid:107) f (cid:107) ,µ Z E M (cid:34)(cid:18)(cid:90) ( (cid:96) r ( x ) − (cid:96) M ( x r )) µ ( x r ) dx r (cid:19) (cid:35) . Thus, applying the bound on E M [ · ] in (24) and Assumption 2.1, we have E M (cid:2) D KL ( π, π (cid:96) M ) (cid:3) ≤ (cid:107) f (cid:107) ,µ Z (cid:18) √ M π (cid:19) (cid:18)(cid:90) var µ ( x ⊥ | x r ) [ (cid:96) ( x r , x ⊥ )] µ ( x r ) dx r (cid:19) / ≤ (cid:107) f (cid:107) ,µ Z (cid:18) √ M π (cid:19) (cid:18) κ (cid:90) (cid:107)∇ x ⊥ l ( x r , x ⊥ ) (cid:107) µ ( x ) dx (cid:19) / = √ κ (cid:107) f (cid:107) ,µ Z (cid:18) √ M π (cid:19) (cid:112) R ( X r , H ) . Appendix D: Proofs in Section 4
D.1. Proof of Lemma 4.1
Given two positive semidefinite matrices Σ , (cid:98) Σ ∈ R n × n . Let (cid:98) V r be the matrix consisting of the r leadingorthonormal eigenvectors of (cid:98) Σ such that (cid:98) V (cid:62) r (cid:98) V r = I r , and (cid:98) Λ r be the r × r diagonal matrices consistingof the r leading eigenvalues of (cid:98) Σ as its diagonal entries. Similarly, we denote V r and Λ r be the ma-trices consisting of the r leading orthonormal eigenvectors of Σ and the r leading eigenvalues of Σ ,respectively. We can define the orthogonal projectors (cid:98) P r = (cid:98) V r (cid:98) V (cid:62) r and (cid:98) P ⊥ = I − (cid:98) P r .6 Proof of claim 1).
Since tr( (cid:98) P ⊥ Σ (cid:98) P ⊥ ) = tr(Σ (cid:98) P ⊥ ) = tr(Σ (cid:98) P ⊥ ) = tr(Σ) − tr(Σ (cid:98) P r ) , we have tr( (cid:98) P ⊥ Σ (cid:98) P ⊥ ) = tr(Σ) − tr(Λ r ) + tr(Λ r ) − tr(Σ (cid:98) P r ) + tr( (cid:98) Σ (cid:98) P r ) − tr( (cid:98) Σ (cid:98) P r )= tr(Σ) − tr(Λ r ) + tr(Λ r ) − tr( (cid:98) Σ (cid:98) P r ) + tr(( (cid:98) Σ − Σ) (cid:98) P r ) . The definition of the eigenvalue problem (cid:98) Σ (cid:98) V r = (cid:98) V r (cid:98) Λ r gives tr( (cid:98) Σ (cid:98) P r ) = tr( (cid:98) Σ (cid:98) V r (cid:98) V (cid:62) r ) = tr( (cid:98) Λ r ) . To-gether with tr(Σ) = (cid:80) ni =1 λ i (Σ) , we have tr( (cid:98) P ⊥ Σ (cid:98) P ⊥ ) = n (cid:88) i = r +1 λ i (Σ) + tr(Λ r − (cid:98) Λ r ) + tr(( (cid:98) Σ − Σ) (cid:98) P r ) . (26)The term tr(Λ r − (cid:98) Λ r ) satisfies tr(Λ r − (cid:98) Λ r ) ≤ r (cid:88) i =1 (cid:12)(cid:12)(cid:12) λ i (Σ) − λ i ( (cid:98) Σ) (cid:12)(cid:12)(cid:12) ≤ √ r (cid:32) r (cid:88) i =1 (cid:16) λ i (Σ) − λ i ( (cid:98) Σ) (cid:17) (cid:33) / . Since Σ and (cid:98) Σ are both symmetric, applying Theorem 6.11 of [34], we have r (cid:88) i =1 (cid:16) λ i (Σ) − λ i ( (cid:98) Σ) (cid:17) ≤ n (cid:88) i =1 (cid:16) λ i (Σ) − λ i ( (cid:98) Σ) (cid:17) ≤ n (cid:88) i =1 λ i (Σ − (cid:98) Σ) = (cid:107) Σ − (cid:98) Σ (cid:107) F , which leads to tr(Λ r − (cid:98) Λ r ) ≤ √ r (cid:107) Σ − (cid:98) Σ (cid:107) F . (27)Since for any matrix A ∈ R r × r , it satisfies tr( A ) ≤ √ r (cid:113)(cid:80) ri =1 A ii ≤ √ r (cid:107) A (cid:107) F , the term tr( (cid:98) P r ( (cid:98) Σ − Σ)) satisfies tr( (cid:98) P r ( (cid:98) Σ − Σ)) = tr( (cid:98) V (cid:62) r ( (cid:98) Σ − Σ) (cid:98) V r ) ≤ √ r (cid:107) (cid:98) V (cid:62) r ( (cid:98) Σ − Σ) (cid:98) V r (cid:107) F ≤ √ r (cid:107) (cid:98) Σ − Σ (cid:107) F , (28)where the last inequality follows from the property (cid:107) AB (cid:107) F = tr( ABB (cid:62) A (cid:62) ) = tr( BB (cid:62) A (cid:62) A ) ≤ (cid:107) BB (cid:62) (cid:107) tr( A (cid:62) A ) = (cid:107) B (cid:107) (cid:107) A (cid:107) F . Substituting (27) and (28) into (26), the result of claim 1) follows.
Proof of claim 2).
The approximation residual can be expressed as tr( (cid:98) P ⊥ Σ (cid:98) P ⊥ ) = tr( (cid:98) P ⊥ (cid:98) Σ (cid:98) P ⊥ ) + tr( (cid:98) P ⊥ (Σ − (cid:98) Σ) (cid:98) P ⊥ )= n (cid:88) i = r +1 λ i ( (cid:98) Σ) + tr( (cid:98) P ⊥ (Σ − (cid:98) Σ)) + tr( (cid:98) P r (Σ − (cid:98) Σ)) − tr( (cid:98) P r (Σ − (cid:98) Σ))= n (cid:88) i = r +1 λ i ( (cid:98) Σ) + tr(Σ − (cid:98) Σ) + tr( (cid:98) P r ( (cid:98) Σ − Σ)) . Applying (28), then the result of claim 2) follows. ccuracy of LIS methods D.2. Proof of Proposition 4.5
For any i, j -th component of V ( H , ν ) , we havevar X ∼ ν (cid:20) ∂ i log f ( X ) ∂ j log f ( X ) µ ( X ) ν ( X ) (cid:21) ≤ E X ∼ ν (cid:34)(cid:18) ∂ i log f ( X ) ∂ j log f ( X ) µ ( X ) ν ( X ) (cid:19) (cid:35) . This way, summing over all indices, we haveV ( H , ν ) ≤ E X ∼ ν d (cid:88) i,j =1 (cid:18) ∂ i log f ( X ) ∂ j f ( X ) µ ( X ) ν ( X ) (cid:19) = E X ∼ ν (cid:34) (cid:107)∇ log f ( X ) (cid:107) (cid:18) µ ( X ) ν ( X ) (cid:19) (cid:35) . The bound on V ( H , ν ) can be shown in a similar way by replacing µ with π . Appendix E: Proofs in Section 5.1
E.1. Proof of Proposition 5.1
We denote the composite transition density of lines 3–5 of Algorithm 1 as Q ( x, x (cid:48) ) . We first verify thedetailed balanced relation π ( x ) Q ( x, x (cid:48) ) = π ( x (cid:48) ) Q ( x (cid:48) , x ) . Note that for x r (cid:54) = x (cid:48) r , x ⊥ (cid:54) = x (cid:48)⊥ , the overalltransition density is Q ( x, x (cid:48) ) = p ( x r , x (cid:48) r ) β ( x r , x (cid:48) r ) µ ( x (cid:48)⊥ | x (cid:48) r ) α ( x, x (cid:48) ) . This leads to π ( x ) Q ( x, x (cid:48) ) = π ( x ) p ( x r , x (cid:48) r ) β ( x r , x (cid:48) r ) µ ( x (cid:48)⊥ | x (cid:48) r ) α ( x, x (cid:48) ) ( by detailed balance of β ) = π ( x ) π r ( x (cid:48) r ) π r ( x r ) p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) µ ( x (cid:48)⊥ | x (cid:48) r ) α ( x, x (cid:48) )= p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) π ( x (cid:48) ) π ( x ) π ( x (cid:48) ) π r ( x (cid:48) r ) π r ( x r ) µ ( x (cid:48)⊥ | x (cid:48) r ) (cid:18) ∧ f ( x (cid:48) ) f r ( x r ) f ( x ) f r ( x (cid:48) r ) (cid:19) = p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) π ( x (cid:48) ) f ( x ) µ ( x ) f ( x (cid:48) ) µ ( x (cid:48) ) f r ( x (cid:48) r ) µ ( x (cid:48) ) f r ( x r ) µ ( x r ) (cid:18) ∧ f ( x (cid:48) ) f r ( x r ) f ( x ) f r ( x (cid:48) r ) (cid:19) = p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) µ ( x ⊥ | x r ) π ( x (cid:48) ) f ( x ) f r ( x (cid:48) r ) f ( x (cid:48) ) f r ( x r ) (cid:18) ∧ f ( x (cid:48) ) f r ( x r ) f ( x ) f r ( x (cid:48) r ) (cid:19) = p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) µ ( x ⊥ | x r ) π ( x (cid:48) ) (cid:18) ∧ f ( x ) f r ( x (cid:48) r ) f ( x (cid:48) ) f r ( x r ) (cid:19) = p ( x (cid:48) r , x r ) β ( x (cid:48) r , x r ) µ ( x ⊥ | x r ) π ( x (cid:48) ) α ( x (cid:48) , x ) = π ( x (cid:48) ) Q ( x (cid:48) , x ) . For the case x r = x (cid:48) r , x ⊥ (cid:54) = x (cid:48)⊥ , the overall transition density is Q ( x, x (cid:48) ) = δ x r = x (cid:48) r β c ( x r ) µ ( x (cid:48)⊥ | x r ) α ( x, x (cid:48) ) , β c ( x r ) = 1 − (cid:90) p ( x r , y r ) β ( x r , y r ) dy r . x r = x (cid:48) r , we have π ( x ) Q ( x, x (cid:48) ) = π ( x ) δ x r = x (cid:48) r β c ( x r ) µ ( x (cid:48)⊥ | x r ) α ( x, x (cid:48) )= δ x r = x (cid:48) r β c ( x r ) f r ( x (cid:48) r ) f r ( x r ) π ( x ) π ( x (cid:48) ) π ( x (cid:48) ) µ ( x (cid:48)⊥ | x r ) α ( x, x (cid:48) )= δ x r = x (cid:48) r β c ( x r ) f r ( x (cid:48) r ) f r ( x r ) f ( x ) µ ( x ⊥ | x r ) f ( x (cid:48) ) µ ( x (cid:48)⊥ | x (cid:48) r ) π ( x (cid:48) ) µ ( x (cid:48)⊥ | x r ) α ( x, x (cid:48) )= π ( x (cid:48) ) δ x r = x (cid:48) r β c ( x r ) µ ( x ⊥ | x r ) α ( x (cid:48) , x ) = π ( x (cid:48) ) Q ( x (cid:48) , x ) . Note that if the proposal is rejected for the x ⊥ part, then the x r is also rejected. So the case that x r (cid:54) = x (cid:48) r , x ⊥ = x (cid:48)⊥ does not exist. Finally, the detailed balanced relationship is trivial if x = x (cid:48) . Inconclusion, the detailed balance relation holds, so π is the invariant density of Algorithm 1.Next we investigate the acceptance rate of complement transition. If we denote the MCMC transitionprobability for the x r part as P ( x r , x (cid:48) r ) = p ( x r , x (cid:48) r ) β ( x r , x (cid:48) r ) + β c ( x r ) δ x r = x (cid:48) r . Note that the acceptance probability can also be written as ∧ f ( x (cid:48) ) f r ( x r ) f ( x ) f r ( x (cid:48) r ) = 1 ∧ π ( x (cid:48) ) π r ( x ) π ( x ) π r ( x (cid:48) ) . Then, the acceptance rate is given by E (cid:2) α ( X, X (cid:48) ) (cid:3) = (cid:90) π ( x ) P ( x r , x (cid:48) r ) µ ( x (cid:48)⊥ | x (cid:48) r ) (cid:18) ∧ π ( x (cid:48) ) π r ( x ) π ( x ) π r ( x (cid:48) ) (cid:19) dxdx (cid:48) = (cid:90) P ( x r , x (cid:48) r ) µ ( x (cid:48)⊥ | x (cid:48) r ) π r ( x ) (cid:20) π ( x ) π r ( x ) ∧ π ( x (cid:48) ) π r ( x (cid:48) ) (cid:21) dxdx (cid:48) Therefore if we denote the likelihood ratio b ( x ) = π ( x ) π r ( x ) , then the average rejection probability is − E (cid:2) α ( X, X (cid:48) ) (cid:3) = (cid:90) P ( x r , x (cid:48) r ) π r ( x ) µ ( x (cid:48)⊥ | x (cid:48) r ) (cid:2) (1 − b ( x (cid:48) )) ∨ (1 − b ( x )) (cid:3) dxdx (cid:48) . To continue, we note that for any b ≥ , − b ≤ − √ b ≤ | − √ b | , therefore (1 − b ( x )) ∨ (1 − b ( x (cid:48) )) ≤ | − (cid:112) b ( x ) | ∨ | − (cid:112) b ( x (cid:48) ) | ≤ | − (cid:112) b ( x ) | + | − (cid:112) b (cid:48) ( x ) | . As a consequence, − E (cid:2) α ( X, X (cid:48) ) (cid:3) ≤ (cid:90) P ( x r , x (cid:48) r ) π r ( x ) µ ( x (cid:48)⊥ | x (cid:48) r ) | − (cid:112) b ( x ) | dxdx (cid:48) + (cid:90) P ( x r , x (cid:48) r ) π r ( x ) µ ( x (cid:48)⊥ | x (cid:48) r ) | − (cid:112) b ( x (cid:48) ) | dxdx (cid:48) (note that ( P ( x r , x (cid:48) r ) π r ( x ) µ ( x (cid:48)⊥ | x (cid:48) r ) = P ( x (cid:48) r , x r ) π r ( x (cid:48) ) µ ( x ⊥ | x r ) ) = 2 (cid:90) P ( x r , x (cid:48) r ) π r ( x ) µ ( x (cid:48)⊥ | x (cid:48) r ) | − (cid:112) b ( x ) | dxdx (cid:48) = 4 (cid:90) π r ( x ) | − (cid:112) b ( x ) | dx. ccuracy of LIS methods (cid:90) π r ( x ) | − (cid:112) b ( x ) | dx ≤ (cid:115)(cid:90) π r ( x )(1 − (cid:112) b ( x )) dx (cid:115)(cid:90) π r ( x ) dx ≤ √ D H ( π, π r ) . In summary, we have E (cid:2) α ( X, X (cid:48) ) (cid:3) ≥ − √ D H ( π, π r ) . E.2. Proof of Proposition 5.2
Recall that π k ( x ) = f ( x ) β k Z k , π k +1 ( x ) π k ( x ) = Z k f ( x ) δ Z k +1 , ∇ log π k +1 ( x ) π k ( x ) = δ ∇ log f ( x ) . By Proposition 4.5, we have V k +1 ( H , π k ) ≤ δ Z k Z k +1 (cid:90) (cid:107)∇ log f ( x ) (cid:107) f ( x ) δ f ( x ) β k µ ( x ) dx = δ Z k Z k +1 (cid:90) (cid:107)∇ log f ( x ) (cid:107) f ( x ) β k +1 + δ µ ( x ) dx ≤ δ Z k Z k +1 max (cid:26)(cid:90) (cid:107)∇ log f ( x ) (cid:107) µ ( x ) dx, (cid:90) (cid:107)∇ log f ( x ) (cid:107) f ( x ) δ µ ( x ) dx (cid:27) , since β k ≤ for all k . Appendix F: Proofs in Section 6
F.1. Proof of Proposition 6.1
For the linear inverse problem, the likelihood function and its log-gradients are given by f ( x ) = exp (cid:18) − (cid:107) Ax − y (cid:107) (cid:19) , ∇ log f ( x ) = A (cid:62) ( Ax − y ) . The posterior of x is given by π ( x ) ∼ N (( C A + I ) − u, ( C A + I ) − ) , C A = A (cid:62) A, u = A (cid:62) y. Proof of claim 1).
The H matrix is given by H = E µ (cid:104) ∇ log f ( x ) ∇ log f ( x ) (cid:62) = A (cid:62) ( AA (cid:62) + yy (cid:62) ) A = C A + U (cid:105) . where U = A (cid:62) yy (cid:62) A (cid:62) is a rank 1 matrix. Apply Lemma A.3 with H = C A + U , we have λ k +1 ( H ) ≤ λ k ( C A ) . Proof of claim 2).
The H matrix is given by H = E π (cid:104) ∇ log f ( x ) ∇ log f ( x ) (cid:62) (cid:105) = E π (cid:104) A (cid:62) ( Ax − y )( Ax − y ) (cid:62) A (cid:105) . For a given y , the vector A (cid:62) ( Ax − y ) follows a Gaussian distribution with the mean A (cid:62) (cid:16) A ( C A + I ) − u − y (cid:17) = C A ( C A + I ) − u − A (cid:62) y = C A ( C A + I ) − u − u = − ( C A + I ) − u and the covariance C A ( C A + I ) − C A . This way, we have H = C A ( C A + I ) − C A + ( C A + I ) − U ( C A + I ) − , where ( C A + I ) − U ( C A + I ) − is a rank- matrix. Thus, by Lemma A.3, we have λ k +1 ( H ) ≤ λ k ( C A ( C A + I ) − C A ) . Finally note that eigenvectors of C A ( I + C A ) − C A are identical with the eigenvectors of C A , with v (cid:62) k C A ( I + C A ) − C A v k = λ k ( C A ) λ k ( C A ) , and we have our claim. Proof of claim 3).
Note that normalizing constant Z is given as the integration over the function Z = (cid:90) (2 π ) − d/ exp (cid:18) − (cid:107) Ax − y (cid:107) − (cid:107) x (cid:107) (cid:19) dx, which is equivalent to (2 π ) − d y / Z = (cid:90) (2 π ) − d/ − d y / exp (cid:18) − (cid:107) Ax − y (cid:107) − (cid:107) x (cid:107) (cid:19) dx, where the integrand of the right hand side is the joint probability density of x and y . This way, (2 π ) − d y / Z is the marginal probability density of y follows the Gaussian distribution N (0 , I + AA (cid:62) ) .Thus, we have Z = det( I + AA (cid:62) ) − / exp (cid:18) − y (cid:62) ( I + AA (cid:62) ) − y (cid:19) . Since det( I + AA (cid:62) ) = det( I + A (cid:62) A ) = det( I + C A ) , we have (cid:112) det( I + C A ) ≥ Z ≥ (cid:112) det( I + C A ) exp (cid:18) − (cid:107) y (cid:107) (cid:19) . Then, the result follows.
Proof of claim 4).
We have (cid:107) f (cid:107) ,µ = (2 π ) − d/ (cid:90) exp (cid:18) −(cid:107) Ax − y (cid:107) − (cid:107) x (cid:107) (cid:19) dx = (2 π ) d y / (cid:20) (2 π ) − d/ − d y / (cid:90) exp (cid:18) − (cid:107) ˜ Ax − ˜ y (cid:107) − (cid:107) x (cid:107) (cid:19) dx (cid:21) ccuracy of LIS methods ˜ A = √ A and ˜ y = √ y . Note that the term in the squared bracket is equivalent to the normalizingconstant of the posterior defined by the prior N (0 , I d ) , the parameter-to-observable map ˜ A , and thedata ˜ y . This way, applying a similar identity to that in claim 3), we have (cid:107) f (cid:107) ,µ = det( I + ˜ A ˜ A (cid:62) ) − / exp (cid:18) −
12 ˜ y (cid:62) ( I + ˜ A ˜ A (cid:62) ) − ˜ y (cid:19) = 1 (cid:112) det( I + 2 C A ) exp (cid:16) − y (cid:62) ( I + 2 AA (cid:62) ) − y (cid:17) . This leads to (cid:107) f (cid:107) ,µ Z = det( I + C A ) / det( I + 2 C A ) / exp (cid:18) y (cid:62) ( I + AA (cid:62) ) − y − y (cid:62) ( I + 2 AA (cid:62) ) − y (cid:19) . (29)Using the singular value decomposition of A , one can show that ( I + AA (cid:62) ) − − ( I + 2 AA (cid:62) ) − (cid:22) √ − √ I where A (cid:22) B denotes B − A is positive semidefinite. Thus, we obtain the upper bound (cid:107) f (cid:107) ,µ Z ≤ (cid:32) d (cid:89) k =1 λ i ( C A ) + λ i ( C A ) λ i ( C A ) (cid:33) / exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) ≤ d (cid:89) k =1 (cid:16) λ i ( C A ) (cid:17) / exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) = det( I + C A ) / exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) . Proof of claim 5).
Recall that ∇ log f ( x ) = A (cid:62) ( Ax − y ) , we introduce the random variable ζ = A (cid:62) Ax − A (cid:62) y that follows the Gaussian distribution p ( ζ ) = N ( − A (cid:62) y, C A ) . Employing the upperbound established in Proposition 4.5, we have V ( H , µ ) ≤ E µ (cid:104) (cid:107)∇ log f ( X ) (cid:107) (cid:105) = E ζ ∼ p (cid:104)(cid:16) d (cid:88) i =1 ζ i (cid:17) (cid:105) = E ζ ∼ p (cid:104) (cid:88) i,j ζ i ζ j (cid:105) . For i = j , we have E [ ζ i ] = 3 (cid:0) E [ ζ i ] (cid:1) , and for i (cid:54) = j , we have E [ ζ i ζ j ] ≤ (cid:113) E [ ζ i ] E [ ζ j ] = 3 E [ ζ i ] E [ ζ j ] . This leads to E ζ ∼ p (cid:104) (cid:88) i,j ζ i ζ j (cid:105) ≤ (cid:88) i,j E [ ζ i ] E [ ζ j ] = 3 (cid:16) d (cid:88) i E [ ζ i ] (cid:17) . Since ζ ∼ N ( − A (cid:62) y, C A ) , we have d (cid:88) i E [ ζ i ] = tr( C A ) + tr( A (cid:62) yy (cid:62) A ) = d (cid:88) i =1 λ i ( C A ) + (cid:107) A (cid:62) y (cid:107) . V ( H , µ ) ≤ (cid:16) d (cid:88) i =1 λ i ( C A ) + (cid:107) A (cid:62) y (cid:107) (cid:17) ≤ (cid:16) d (cid:88) i =1 λ i ( C A ) (cid:17) + 6 (cid:107) A (cid:62) y (cid:107) . For the case k = 1 , we have the identity V ( H , µ ) ≤ E µ (cid:20) (cid:107)∇ log f ( X ) (cid:107) π ( X ) µ ( X ) (cid:21) = 1 Z (cid:90) (cid:107)∇ log f ( x ) (cid:107) f ( x ) µ ( x ) dx. Defining a new distribution with the density π ( x ) = 1 Z f ( x ) µ ( x ) dx,π ( x ) has the mean I + 2 C A ) − A (cid:62) y , the covariance ( I + 2 C A ) − , and the normalizing constant Z = (cid:107) f (cid:107) ,µ that satisfies Z Z ≤ d (cid:89) k =1 (cid:16) λ i ( C A ) (cid:17) / exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) = (cid:113) det( I + C A ) exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) , as the result of claim 4). We express the upper bound on the variance as V ( H , µ ) ≤ E µ (cid:20) (cid:107)∇ log f ( X ) (cid:107) π ( X ) µ ( X ) (cid:21) = Z Z E π (cid:104) (cid:107)∇ log f ( X ) (cid:107) (cid:105) = Z Z E π (cid:104) (cid:107) C A X − A (cid:62) y ) (cid:107) (cid:105) . Similar to the proof of the first part, we can introduce ζ = C A X − A (cid:62) y , which follows the Gaussiandistribution with the mean C A ( I + 2 C A ) − A (cid:62) y − A (cid:62) y = − ( I + 2 C A ) − A (cid:62) y, and the covariance C A ( I + 2 C A ) − C A . This leads to d (cid:88) i E [ ζ i ] = tr( C A ( I + 2 C A ) − C A ) + (cid:107) ( I + 2 C A ) − A (cid:62) y (cid:107) = d (cid:88) i =1 λ i ( C A ) λ i ( C A ) + (cid:107) ( I + 2 C A ) − A (cid:62) y (cid:107) ≤ d (cid:88) i =1 λ i ( C A ) λ i ( C A ) + (cid:107) A (cid:62) y (cid:107) . Thus, following a similar derivation to the case k = 0 , we have V ( H , µ ) ≤ (cid:113) det( I + C A ) exp (cid:32) √ − √ (cid:107) y (cid:107) (cid:33) (cid:32) d (cid:88) i =1 λ i ( C A ) λ i ( C A ) (cid:33) + (cid:107) A (cid:62) y (cid:107) . ccuracy of LIS methods Proof of claim 6).
Consider the tempered target density π β ( x ) = 1 Z β f ( x ) β µ ( x ) = 1(2 π ) d/ Z β exp (cid:18) − (cid:107) (cid:112) βAx − (cid:112) βy (cid:107) − (cid:107) x (cid:107) (cid:19) , where the normalizing constant takes the form Z β = det( I + βAA (cid:62) ) − / exp (cid:18) − β y (cid:62) ( I + βAA (cid:62) ) − y (cid:19) . We use the shorthand notations π k and Z k to denote π β k and Z β k , respectively. Let δ = β k +1 − β k ,we have V k +1 ( H , π k ) ≤ Z k Z k +1 E π k (cid:104) δ (cid:107)∇ log f ( X ) (cid:107) f ( X ) δ (cid:105) = δ Z k Z k +1 E µ (cid:104) (cid:107)∇ log f ( X ) (cid:107) f ( X ) δ + β k (cid:105) . Following a similar procedure in the proof of claim 5), we define a new distribution with the density π τ ( x ) = 1 Z τ f ( x ) τ µ ( x ) dx, where τ = 2 δ + β k = 2 β k +1 − β k = β k +1 + δ . The density π τ ( x ) has the mean τ ( I + τ C A ) − A (cid:62) y and the covariance ( I + τ C A ) − . We express the upper bound on the variance as V k +1 ( H , µ ) ≤ δ Z k Z k +1 E µ (cid:104) (cid:107)∇ log f ( X ) (cid:107) f ( X ) τ (cid:105) = δ Z τ Z k Z k +1 E π τ (cid:104) (cid:107)∇ log f ( X ) (cid:107) (cid:105) = δ Z τ Z k Z k +1 E π τ (cid:104) (cid:107) C A X − A (cid:62) y (cid:107) (cid:105) . (30)Then, we introduce ζ = C A X − A (cid:62) y , which follows the Gaussian distribution with the mean − ( I + τ C A ) − A (cid:62) y, and the covariance C A ( I + τ C A ) − C A . This leads to d (cid:88) i E [ ζ i ] = tr( C A ( I + τ C A ) − C A ) + (cid:107) ( I + τ C A ) − A (cid:62) y (cid:107) ≤ d (cid:88) i =1 λ i ( C A ) τ λ i ( C A ) + (cid:107) A (cid:62) y (cid:107) , which yields E π τ (cid:104) (cid:107) C A X − A (cid:62) y ) (cid:107) (cid:105) ≤ (cid:32)(cid:16) d (cid:88) i =1 λ i ( C A ) τ λ i ( C A ) (cid:17) + (cid:107) A (cid:62) y (cid:107) (cid:33) . (31)4The ratio between normalizing constants in (30) can be expressed as Z τ Z k Z k +1 = det( I + β k +1 AA (cid:62) ) (cid:112) det( I + β k AA (cid:62) ) det( I + τ AA (cid:62) ) exp (cid:18) y (cid:62) T y (cid:19) , where T = 2 β k +1 ( I + β k +1 AA (cid:62) ) − − β k ( I + β k AA (cid:62) ) − − τ ( I + τ AA (cid:62) ) − . In the above equation, the ratio between determinants can be expressed as det( I + β k +1 AA (cid:62) ) (cid:112) det( I + β k AA (cid:62) ) det( I + τ AA (cid:62) ) = (cid:32) d (cid:89) i =1 β k +1 λ i ( C A ) + β k +1 λ i ( C A ) β k +1 λ i ( C A ) + β k τ λ i ( C A ) (cid:33) / . Since β k +1 − ( β k +1 − β k ) = β k (2 β k +1 − β k ) , which gives β k +1 = δ + β k τ , and thus det( I + β k +1 AA (cid:62) ) (cid:112) det( I + β k AA (cid:62) ) det( I + τ AA (cid:62) ) = d (cid:89) i =1 (cid:18) δ λ i ( C A ) β k +1 λ i ( C A ) + β k τ λ i ( C A ) (cid:19) / ≤ d (cid:89) i =1 (cid:16) δ λ i ( C A ) (cid:17) / . (32)Since the matrix T and the matrix AA (cid:62) share the same eigenvectors, the eigenvalues of T can beexpressed as λ i ( T ) = β k β k +1 λ i ( AA (cid:62) ) − β k β k λ i ( AA (cid:62) ) + τ β k +1 λ i ( AA (cid:62) ) − τ τ λ i ( AA (cid:62) )= − β k δλ i ( AA (cid:62) )(1 + β k +1 λ i ( AA (cid:62) ))(1 + β k λ i ( AA (cid:62) )) + τ δλ i ( AA (cid:62) )(1 + β k +1 λ i ( AA (cid:62) ))(1 + τ λ i ( AA (cid:62) ))= δλ i ( AA (cid:62) )1 + β k +1 λ i ( AA (cid:62) ) (cid:18) τ τ λ i ( AA (cid:62) ) − β k β k λ i ( AA (cid:62) ) (cid:19) = 2 δ λ i ( AA (cid:62) )(1 + β k +1 λ i ( AA (cid:62) ))(1 + τ λ i ( AA (cid:62) ))(1 + β k λ i ( AA (cid:62) )) . This way, we have λ i ( T ) ≤ δ λ i ( AA (cid:62) ) , and thus T (cid:22) δ AA (cid:62) . This leads to exp (cid:18) y (cid:62) T y (cid:19) ≤ exp (cid:16) δ (cid:107) A (cid:62) y (cid:107) (cid:17) . Substituting the above inequality, (31), and (32) into (30), we have V k +1 ( H , µ ) ≤ δ (cid:113) det( I + δ C A ) exp (cid:16) δ (cid:107) A (cid:62) y (cid:107) (cid:17) (cid:32)(cid:16) d (cid:88) i =1 λ i ( C A ) τ λ i ( C A ) (cid:17) + (cid:107) A (cid:62) y (cid:107) (cid:33) . ccuracy of LIS methods F.2. Proof of Corollary 6.2
We will first show that the eigenvalues of C A are bounded by those of Γ . By the Courant–Fischer–Weylmin-max principle, we note that for any symmetric matrix Cλ r ( C ) = min V { max x ∈ V { x (cid:62) Cx, (cid:107) x (cid:107) = 1 } , dim ( V ) = d − r + 1 } . Let the eigenvectors of Γ be v , . . . , v d . Now we pick V = span { v r , . . . , v d } . Then for any x ∈ V ofunit l norm, x (cid:62) C A x = x (cid:62) Γ / G (cid:62) G Γ / x = (cid:107) G Γ / x (cid:107) ≤ (cid:107) G (cid:107) (cid:107) Γ / x (cid:107) ≤ (cid:107) G (cid:107) λ r (Γ) . Therefore, λ r ( C A ) ≤ (cid:107) G (cid:107) λ r (Γ) .The following identities are useful: For r ≥ , we have d (cid:88) j = r j − α ≤ (cid:90) ∞ r − x − α dx ≤ ( r − − α α − , and for r = 1 , we have d (cid:88) j =2 j − α ≤ αα − . For any a > , we have d (cid:89) j = r (1 + aλ j ( C A )) ≤ d (cid:89) j = r (1 + a (cid:107) G (cid:107) λ j (Γ)) ≤ exp a (cid:107) G (cid:107) d (cid:88) j = r λ j (Γ) = exp a (cid:107) G (cid:107) C Γ d (cid:88) j = r j − α . Thus, we have d (cid:88) i =1 λ i ( C A ) ≤ (cid:107) G (cid:107) C α α − , (33)and det( I + aC A ) ≤ exp (cid:18) a (cid:107) G (cid:107) C α α − (cid:19) . (34)Then, replacing the estimates in Proposition 6.1 with these upper bounds, the results follows. Acknowledgments
X. Tong’s research is supported by MOE Academic Research Funds R-146-000-292-114. T. Cui ac-knowledges support from the Australian Research Council.6
References [1] A
GAPIOU , S., D
ASHTI , M. and H
ELIN , T. (2018). Rates of contraction of posterior distributionsbased on p -exponential priors. arXiv preprint arXiv:1811.12244 .[2] A GAPIOU , S., P
APASPILIOPOULOS , O., S
ANZ -A LONSO , D., S
TUART , A. et al. (2017). Impor-tance sampling: Intrinsic dimension and computational cost.
Statistical Science GAPIOU , S., R
OBERTS , G. O., V
OLLMER , S. J. et al. (2018). Unbiased Monte Carlo: Posteriorestimation for intractable/infinite-dimensional models.
Bernoulli NDRIEU , C. and R
OBERTS , G. O. (2009). The pseudo-marginal approach for efficient MonteCarlo computations.
The Annals of Statistics NDRIEU , C. and V
IHOLA , M. (2015). Convergence properties of pseudo-marginal Markovchian Monte Carlo algorithms.
Ann. Appl. Probab. ESKOS , A., C
RISAN , D., J
ASRA , A. et al. (2014). On the stability of sequential Monte Carlomethods in high dimensions.
The Annals of Applied Probability ESKOS , A., G
IROLAMI , M., L AN , S., F ARRELL , P. E. and S
TUART , A. M. (2017). GeometricMCMC for infinite-dimensional inverse problems.
Journal of Computational Physics
ESKOS , A., J
ASRA , A., L AW , K., M ARZOUK , Y. and Z
HOU , Y. (2018). Multilevel sequentialMonte Carlo with dimension-independent likelihood-informed proposals.
SIAM/ASA Journal onUncertainty Quantification IGONI , D., Z
AHM , O., S
PANTINI , A. and M
ARZOUK , Y. (2019). Greedy inference with layersof lazy maps. arXiv preprint arXiv:1906.00031 .[10] B
OBKOV , S. G. (1999). Isoperimetric and analytic inequalities for log-concave probability mea-sures.
The Annals of Probability OBKOV , S. and L
EDOUX , M. (1997). Poincaré’s inequalities and Talagrand’s concentrationphenomenon for the exponential distribution.
Probability Theory and Related Fields
OBKOV , S. G. and L
EDOUX , M. (2000). From Brunn-Minkowski to Brascamp-Lieb and tologarithmic sobolev inequalities.
Geometric & Functional Analysis GAFA OBKOV , S. G. and L
EDOUX , M. (2009). Weighted Poincaré-type inequalities for Cauchy andother convex measures.
The Annals of Probability RASCAMP , H. J. and L
IEB , E. H. (1976). On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application tothe diffusion equation.
Journal of Functional Analysis UI -T HANH , T., B
URSTEDDE , C., G
HATTAS , O., M
ARTIN , J., S
TADLER , G. andW
ILCOX , L. C. (2012). Extreme-scale UQ for Bayesian inverse problems governed by PDEs.In
SC’12: Proceedings of the International Conference on High Performance Computing, Net-working, Storage and Analysis UI -T HANH , T., G
HATTAS , O., M
ARTIN , J. and S
TADLER , G. (2013). A computational frame-work for infinite-dimensional Bayesian inverse problems Part I: The linearized case, with appli-cation to global seismic inversion.
SIAM Journal on Scientific Computing A2494–A2523.[17] C
ONSTANTINE , P. G., K
ENT , C. and B UI -T HANH , T. (2016). Accelerating Markov chain MonteCarlo with active subspaces.
SIAM Journal on Scientific Computing A2779–A2805.[18] C
OTTER , S. L., R
OBERTS , G. O., S
TUART , A. M. and W
HITE , D. (2013). MCMC methods forfunctions: modifying old algorithms to make them faster.
Statistical Science UI , T. and D OLGOV , S. (2020). Deep Composition of Tensor Trains using Squared InverseRosenblatt Transports. arXiv preprint arXiv:2007.06968 . ccuracy of LIS methods UI , T., F OX , C. and O’S ULLIVAN , M. J. (2011). Bayesian calibration of a large-scale geother-mal reservoir model by a new adaptive delayed acceptance Metropolis Hastings algorithm.
WaterResources Research .[21] C UI , T., L AW , K. J. H. and M ARZOUK , Y. M. (2016). Dimension-independent likelihood-informed MCMC.
J. Comput. Phys. UI , T. and Z AHM , O. (2020). Data-Free Likelihood-Informed Dimension Reduction ofBayesian Inverse Problems. hal preprint: hal-02938064 .[23] C UI , T., M ARTIN , J., M
ARZOUK , Y. M., S
OLONEN , A. and S
PANTINI , A. (2014). Likelihood-informed dimension reduction for nonlinear inverse problems.
Inverse Problems ASHTI , M. and S
TUART , A. M. (2011). Uncertainty quantification and weak approximation ofan elliptic inverse problem.
SIAM Journal on Numerical Analysis ETOMMASO , G., C UI , T., M ARZOUK , Y., S
PANTINI , A. and S
CHEICHL , R. (2018). A Steinvariational Newton method.
Advances in Neural Information Processing Systems ODWELL , T. J., K
ETELSEN , C., S
CHEICHL , R. and T
ECKENTRUP , A. L. (2019). Multilevelmarkov chain monte carlo.
Siam Review RINEAS , P. and I
PSEN , I. C. (2019). Low-rank matrix approximations do not need a singularvalue gap.
SIAM Journal on Matrix Analysis and Applications ROSS , L. (1975). Logarithmic sobolev inequalities.
American Journal of Mathematics AARIO , H., S
AKSMAN , E., T
AMMINEN , J. et al. (2001). An adaptive Metropolis algorithm.
Bernoulli AARIO , H., L
AINE , M., L
EHTINEN , M., S
AKSMAN , E. and T
AMMINEN , J. (2004). Markovchain Monte Carlo methods for high dimensional inversion in remote sensing.
Journal of theRoyal Statistical Society: series B (statistical methodology) GLESIAS , M. A., L IN , K. and S TUART , A. M. (2014). Well-posed Bayesian geometric inverseproblems arising in subsurface flow.
Inverse Problems AIPIO , J. P., K
OLEHMAINEN , V., S
OMERSALO , E. and V
AUHKONEN , M. (2000). Statisticalinversion and Monte Carlo sampling methods in electrical impedance tomography.
Inverse prob-lems ARHUNEN , K. (1947). Über lineare Methoden in der Wahrscheinlichkeitsrechnung.
Ann. Acad.Sci. Fennicae. Ser. A. I. Math.-Phys ATO , T. (1982).
A Short Introduction to Perturbation Theory for Linear Operators . Springer-Verlag.[35] L
EDOUX , M. (1994). A simple analytic proof of an inequality by P. Buser.
Proceedings of theAmerican mathematical society IE , H. C., S ULLIVAN , T. J. and T
ECKENTRUP , A. (2019). Error bounds for some approximateposterior measures in Bayesian inference. arXiv preprint arXiv:1911.05669 .[37] L IU , Q. and W ANG , D. (2016). Stein variational gradient descent: A general purpose bayesianinference algorithm.
Advances in neural information processing systems OÈVE , M. (1978).
Probability theory, Vol. II , 4 ed.
Graduate Texts in Mathematics . Springer-Verlag, Berlin.[39] M ARTIN , J., W
ILCOX , L. C., B
URSTEDDE , C. and G
HATTAS , O. (2012). A stochastic NewtonMCMC method for large-scale statistical inverse problems with application to seismic inversion.
SIAM Journal on Scientific Computing A1460–A1487.[40] M
ARZOUK , Y., M
OSELHY , T., P
ARNO , M. and S
PANTINI , A. (2016). Sampling via measuretransport: An introduction.
Handbook of uncertainty quantification
OHAMAD , M. A. and S
APSIS , T. P. (2015). Probabilistic description of extreme events inintermittently unstable dynamical systems excited by correlated stochastic processes.
SIAM/ASAJournal on Uncertainty Quantification ORZFELD , M., T
ONG , X. T. and M
ARZOUK , Y. M. (2019). Localization for MCMC: samplinghigh-dimensional posterior distributions with local structure.
Journal of Computational Physics
URRAY , I., M AC K AY , D. and A DAMS , R. P. (2008). The Gaussian process density sampler.
Advances in Neural Information Processing Systems TTO , F. and V
ILLANI , C. (2000). Generalization of an inequality by Talagrand and links withthe logarithmic Sobolev inequality.
Journal of Functional Analysis
ARENTE , M. T., W
ALLIN , J., W
OHLMUTH , B. et al. (2020). Generalized bounds for activesubspaces.
Electronic Journal of Statistics ETRA , N., M
ARTIN , J., S
TADLER , G. and G
HATTAS , O. (2014). A computational frameworkfor infinite-dimensional Bayesian inverse problems, Part II: Stochastic Newton MCMC with ap-plication to ice sheet flow inverse problems.
SIAM Journal on Scientific Computing A1525–A1555.[47] R
UDOLF , D. and S
PRUNGK , B. (2018). On a generalization of the preconditioned Crank–Nicolson Metropolis algorithm.
Foundations of Computational Mathematics ANZ -A LONSO , D. (2018). Importance sampling and necessary sample size: an informationtheory approach.
SIAM/ASA Journal on Uncertainty Quantification PANTINI , A., B
IGONI , D. and M
ARZOUK , Y. (2018). Inference via low-dimensional couplings.
The Journal of Machine Learning Research PANTINI , A., S
OLONEN , A., C UI , T., M ARTIN , J., T
ENORIO , L. and M
ARZOUK , Y. (2015).Optimal low-rank approximations of Bayesian linear inverse problems.
SIAM Journal on Scien-tific Computing A2451–A2487.[51] S
TEWART , G. W. (1980). The efficient generation of random orthogonal matrices with an appli-cation to condition estimators.
SIAM Journal on Numerical Analysis ABAK , E. G., T
RIGILA , G. and Z
HAO , W. (2020). Conditional density estimation and simula-tion through optimal transport.
Machine Learning
ABAK , E. G. and T
URNER , C. V. (2013). A family of nonparametric density estimation algo-rithms.
Communications on Pure and Applied Mathematics ONG , X. T., M
ORZFELD , M. and M
ARZOUK , Y. M. (2020). MALA-within-Gibbs samplersfor high-dimensional distributions with sparse conditional structure.
SIAM Journal on ScientificComputing A1765–A1788.[55] T
RIGILA , G. and T
ABAK , E. G. (2016). Data-driven optimal transport.
Communications on Pureand Applied Mathematics U , Y., W ANG , T. and S
AMWORTH , R. J. (2015). A useful variant of the Davis–Kahan theoremfor statisticians.
Biometrika
AHM , O., C UI , T., L AW , K., S PANTINI , A. and M
ARZOUK , Y. (2018). Certified dimensionreduction in nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712arXiv preprint arXiv:1807.03712