On Bayesian A- and D-optimal experimental designs in infinite dimensions
aa r X i v : . [ m a t h . S T ] S e p ON BAYESIAN A- AND D-OPTIMAL EXPERIMENTAL DESIGNS ININFINITE DIMENSIONS
ALEN ALEXANDERIAN, PHILIP GLOOR, AND OMAR GHATTAS
Abstract.
We consider Bayesian linear inverse problems in infinite-dimensionalseparable Hilbert spaces, with a Gaussian prior measure and additive Gaussiannoise model, and provide an extension of the concept of Bayesian D-optimalityto the infinite-dimensional case. To this end, we derive the infinite-dimensionalversion of the expression for the Kullback-Leibler divergence from the posteriormeasure to the prior measure, which is subsequently used to derive the expressionfor the expected information gain. We also study the notion of Bayesian A-optimality in the infinite-dimensional setting, and extend the well known (in thefinite-dimensional case) equivalence of the Bayes risk of the MAP estimator withthe trace of the posterior covariance, for the Gaussian linear case, to the infinite-dimensional Hilbert space case. Introduction
In a Bayesian inference problem one uses experimental (observed) data to updatethe prior state of knowledge about a parameter which often specifies certain propertiesof a mathematical model. The ingredients of a Bayesian inference problem include theprior measure which encodes our prior knowledge about the inference parameter, exper-imental data, and the data likelihood which describes the conditional distribution of theexperimental data for a given model parameter. The solution of a Bayesian inferenceproblem is a posterior probability law for the inference parameter. The quality of thissolution, which can be measured using different criteria, depends to a large extent onthe experimental data used in solving the inference problem. In practice, acquisition ofsuch experimental data is often costly, as it requires deployment of scarce resources.Hence, the problem of optimal collection of experimental data, i.e. that of optimal ex-perimental design (OED) [2, 17, 13], is an integral part of modeling under uncertainty.The basic problem of OED is to optimize a function of the experimental setup whichdescribes, in a certain sense which needs to be specified, the statistical quality of thesolution to the Bayesian inference problem. Note that what constitutes an experimen-tal design depends on the application at hand. For example, in a problem involvingdiffusive transport of a contaminant, one may use measurements of concentration atsensor sites in the physical domain (at a certain point in time) to infer where the con-taminant originated, i.e. the initial state of the concentration field. In this problem, anexperimental design specifies the locations of the sensors in the physical domain. Notealso that the inference parameter in this example, i.e. the initial concentration field, isa random function (random field) whose realizations belong to an appropriate functionspace.
Date : September 27, 2018.2010
Mathematics Subject Classification.
Key words and phrases.
Bayesian inference in Hilbert space; Gaussian measure; Kullback Leiblerdivergence; Bayesian optimal experimental design; expected information gain; Bayes risk.
We consider the problem of design of experiments for inference problems whoseinference parameter belongs to an infinite-dimensional separable Hilbert space. This ismotivated by the recent interest in the Bayesian framework for inverse problems [16].A
Bayesian inverse problem involves inference of Hilbert space valued parameters thatdescribe physical properties of mathematical models which are often governed by partialdifferential equations. Study of such problems requires a synthesis of ideas from inverseproblem theory, PDE-constrained optimization, functional analysis, and probability andstatistics and has provided a host of interesting mathematical problems with a widerange of applications. The problem of design of experiments in this infinite-dimensionalsetting involves optimizing functionals of experimental designs which are defined interms of operators on Hilbert spaces.The precise definition of what is meant by an optimal design leads to the choice ofa design criterion. A popular experimental design criterion, in the finite-dimensionalcase, is that of D-optimality which seeks to minimize the determinant of the posteriorcovariance operator. The geometric intuition behind D-optimality is that of minimizingthe volume of the uncertainty ellipsoid. Minimizing this determinant, however, is notmeaningful in infinite dimensions, as the posterior covariance operator is a trace-classlinear operator with eigenvalues that accumulate at zero. In the present work, weprovide an extension of the concept of D-optimal design to the infinite-dimensionalHilbert space setting. In particular, we focus on the case of Bayesian linear inverseproblems whose parameter space is an infinite-dimensional separable Hilbert space whichwe denote by H , and we assume a Gaussian prior measure, and an additive Gaussiannoise model. To study the concept of D-optimality in the infinite-dimensional setting weformulate the problem as that of maximizing the expected information gain, measuredby the Kullback-Leibler (KL) divergence [12] from posterior to prior. To be precise, if µ pr denotes the prior measure, y is a vector of experimental data obtained using anexperimental design specified by a vector of design parameters ξ , and µ y , ξ post denotes theresulting posterior measure, the KL divergence from posterior to prior is given by, D kl (cid:16) µ y , ξ post k µ pr (cid:17) := Z H log ( dµ y , ξ post dµ pr ) dµ y , ξ post . (The argument of the logarithm in the above formula is the Radon-Nikodym derivativeof the posterior measure with respect to the prior measure.) The experimental designcriterion is then defined by averaging D kl ( µ y , ξ post k µ pr ) over all possible experimental data.In a Bayesian inverse problem, this averaging over experimental data can be done asfollows:expected information gain := Z H Z Y D kl (cid:16) µ y , ξ post k µ pr (cid:17) π like ( y | u ; ξ ) d y µ pr ( du ) , where ξ is a fixed design vector, Y denotes the space of experimental data and π like ( y | u ; ξ ) is the data likelihood which specifies the distribution of y for a given u ∈ H .It is known in the finite-dimensional Gaussian linear case (i.e., an inference problemwith Gaussian prior and noise distributions) that maximizing this expected informationgain is equivalent to minimizing the determinant of the posterior covariance operator,i.e., the usual D-optimal design problem. While this does not directly extend to theinfinite-dimensional case, it suggests a mathematically rigorous path to an infinite-dimensional analogue of Bayesian D-optimality. In the present work, we derive analyticexpressions for the KL divergence from posterior to prior in a Hilbert space. This AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 3 enables deriving the expression for the expected information gain, leading to the infinite-dimensional version of the Bayesian D-optimal experimental design criterion.We also discuss another popular experimental design criterion, that of A-optimality,in the infinite-dimensional setting. An A-optimal design is one that minimizes the traceof the posterior covariance operator; i.e., if C post ( ξ ) : H → H denotes the posteriorcovariance operator corresponding to an experimental design ξ , we seek to minimize tr( C post ( ξ )) . In the statistics literature it is known (see e.g., [6]) that for a Gaussianlinear inference problem in H = R n , minimizing the trace of the posterior covariance matrix is equivalent to minimizing the average mean square error of the maximuma posteriori probability (MAP) estimator for the inference parameter. We provide anextension of this result to the infinite-dimensional Hilbert space setting, where weshow that the trace of the posterior covariance operator —a positive, self-adjoint, andtrace-class operator on H —coincides with the average mean square error of the MAPestimator.Note that the design vector ξ enters the Bayesian inverse problem through thedata likelihood. The exact nature of this dependence on ξ is not essential to ourdiscussion and hence, to keep the notation simple, we suppress the dependence to ξ in our derivations. (See e.g., [6] for a an overview of how an experimental design isincorporated in an inference problem in classical formulations.)2. Background concepts
In this section, we outline the background concepts that are needed in the rest ofthis article. In what follows, H denotes an infinite-dimensional separable real Hilbertspace, with inner-product h· , ·i H and induced norm k·k H = h· , ·i / H .2.1. Trace-class operators on H . Let L ( H ) denote the set of bounded linearoperators on H . We say A ∈ L ( H ) is positive if h x, A x i H ≥ for all x ∈ H ,and is strictly positive if h x, A x i H > for all non-zero x ∈ H . For A ∈ L ( H ) , |A| = ( A ∗ A ) / , where A ∗ denotes the adjoint of A . We say A is of trace-class if forany orthonormal basis { f j } ∞ j =1 of H , ∞ X j =1 h|A| f j , f j i H < ∞ . It is straightforward to show that the value of the above summation is invariant withrespect to the choice of the orthonormal basis [14]. We denote by L ( H ) the subspaceof L ( H ) consisting of trace-class operators. For A ∈ L ( H ) , tr( A ) = ∞ X j =1 hA f j , f j i H , where the sum is finite and its value is independent of the choice of the orthonormalbasis [7, 14].Let L sym +1 ( H ) be the subspace of positive self-adjoint operators in L ( H ) , andnote that for A ∈ L sym +1 ( H ) , there exists an orthonormal basis of eigenvectors, { e j } ,with corresponding (real, non-negative) eigenvalues, { λ j } , and tr( A ) = P ∞ j =1 hA e j , e j i H = P ∞ j =1 λ j .In what follows we shall make repeated use of the following result: if A ∈ L ( H ) and B ∈ L ( H ) then AB and BA both belong to L ( H ) and tr( AB ) = tr( BA ) ; seee.g., [14] for details. Moreover, it is straightforward to show that if A is a trace-class ALEN ALEXANDERIAN, PHILIP GLOOR, AND OMAR GHATTAS operator and B : H → R q is a bounded linear operator, then AB ∗ B ∈ L ( H ) and tr( AB ∗ B ) = tr( BAB ∗ ) .2.2. Borel probability measures on H . We work with probability measures on themeasurable space (cid:0) H , B ( H ) (cid:1) , where B ( H ) denotes the Borel sigma-algebra on H ;we refer to such measures as Borel probability measures. Let µ be a Borel probabilitymeasure on H , which has bounded first and second moments. The mean m ∈ H andcovariance operator Q ∈ L ( H ) of µ are characterized as follows: h m, x i H = Z H h z, x i H µ ( dz ) , hQ x, y i H = Z H h x, z − m i H h y, z − m i H µ ( dz ) , for all x, y ∈ H . It is straightforward to show (see e.g., [8]) that Q belongs to L sym +1 ( H ) , and that Z H k x k H µ ( dx ) = tr( Q ) + k m k H . (1)2.3. Gaussian measures on H . In the present work, we shall be working with Gauss-ian measures on Hilbert spaces [8]; µ is a Gaussian measure on ( H , B ( H )) if forevery x ∈ H the linear functional h x, ·i H , considered as a random variable from ( H , B ( H ) , µ ) to ( R , B ( R )) , is a (one-dimensional) Gaussian random variable. Werefer the reader to [8] or [9] for the theory of Gaussian measures on Hilbert spaces. Wedenote a Gaussian measure with mean m ∈ H and Q ∈ L sym +1 ( H ) by N ( m, Q ) .If Q satisfies ker( Q ) = { } , where ker( Q ) denotes the null space of Q , we say that N ( m, Q ) is a non-degenerate Gaussian measure.In what follows, we shall use the following result, concerning the law of an affinetransformation on H : If µ = N ( m, Q ) , a Gaussian measure, A ∈ L ( H ) , and b ∈ H ,then T x = A x + b is a random variable on H whose law is given by µ T = µ ◦ T − = N ( A m + b, AQA ∗ ) [8]. Thus, in particular, we note that, Z H k T x k H µ ( dx ) = Z H k ξ k H µ T ( dξ ) = tr( AQA ∗ ) + kA m + b k H , where the last equality uses (1). It follows that if A ∈ L ( H ) is positive self-adjointcompact operator, and µ = N ( m, Q ) is a Gaussian measure, then Z H hA x, x i H µ ( dx ) = Z H (cid:13)(cid:13)(cid:13) A / x (cid:13)(cid:13)(cid:13) H µ ( dx )= tr( A / QA / ) + D A / m, A / m E H (2) = tr( AQ ) + hA m, m i H . This shows that the well-known expression for the expectation of a quadratic form on R n extends to the infinite-dimensional Hilbert space setting. It can be shown that, asin the finite-dimensional case, this result holds not just for Gaussian measures, but alsofor any Borel probability measure with mean m and covariance operator Q ; moreover,the the only requirement on the operator A is boundedness. That is, we have thefollowing result: Lemma 1.
Let µ be a Borel probability measure on H with mean m ∈ H andcovariance operator Q ∈ L sym +1 ( H ) , and let A ∈ L ( H ) . Then, Z H hA x, x i H µ ( dx ) = tr( AQ ) + hA m, m i H . Proof.
See Appendix A. (cid:3)
AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 5
Kullback-Leibler divergence.
In probability theory the Kullback-Leibler (KL) di-vergence, also referred to as the relative entropy, is a measure of “distance” betweentwo probability measures. This notion was defined in [12]. While KL divergence isnot a metric—it is non-symmetric and does not satisfy the triangle inequality—it isused commonly in probability theory to describe the distance of a measure µ from areference measure µ . Also, KL divergence does satisfy some of the intuitive notions ofdistance; i.e. the KL divergence from µ to µ is non-negative and is zero if and only ifthe two measures are the same. Consider µ and µ be two Borel probability measuresand suppose µ is absolutely continuous with respect to µ . The KL divergence from µ to µ , denoted by D kl ( µ k µ ) , is defined as D kl ( µ k µ ) = Z H log n dµdµ o dµ, where dµdµ is the Radon-Nikodym derivative of µ with respect to µ . In the case µ isnot absolutely continuous with respect to µ the KL divergence is + ∞ . Notice thatfor Borel probability measures on R n that admit densities with respect to the Lebesguemeasure, we may rewrite the definition of the KL divergence in terms the densities;that is, if p and p are Lebesgue densities, i.e., probability density functions (pdfs), of µ and µ respectively, one has D kl ( µ k µ ) = R R n log (cid:0) p ( x ) /p ( x ) (cid:1) p ( x ) d x . However,in an infinite-dimensional Hilbert space, where there is no Lebesgue measure, we areforced to work with the abstract definition of KL divergence presented above.In this paper, we will be dealing with (non-degenerate) Gaussian measures on infinite-dimensional Hilbert spaces. For Gaussian measures on R n , one can use the expressionfor the (multivariate) Gaussian pdfs to derive the well-known analytic expression forthe KL divergence between Gaussians. In the infinite-dimensional Hilbert space setting,not only do we not have access to pdfs, but given two Gaussian measures they are notnecessarily equivalent. In fact, given a centered Gaussian measure µ on H , shiftingthe mean gives, µ -almost surely, a Gaussian measure which is singular with respect to µ ; see e.g., [8, Chapter 2]. However, In the present work, we work with a special case,namely that of a Bayesian linear inverse problem on H with a Gaussian prior and anadditive Gaussian noise model; in this case the posterior measure is also Gaussian andis equivalent to the prior [16], and thus, D kl (cid:0) µ y post k µ pr (cid:1) is well-defined. Later in thepaper, we will derive the expression for the KL divergence from posterior to prior in aninfinite-dimensional Hilbert space, which we shall use to derive the expression for theexpected information gain.3. Bayesian linear inverse problems in a Hilbert space
We consider the problem of inference of a parameter u which belongs to an infinite-dimensional Hilbert space H . All our prior knowledge regarding the parameter u isencoded in a Borel probably measure on H , which we refer to as the prior measureand denote by µ pr ; here we assume that µ pr is a Gaussian measure µ pr = N ( u pr , C pr ) .Moreover, in what follows, we assume that ker( C pr ) = { } , i.e., µ pr is non-degenerate.The inference problem uses experimental data y ∈ Y to update the prior state ofknowledge on the law of the parameter u . Here Y is the space of the experimentaldata, which in the present work is Y = R q . We assume that u is a model parameterwhich is related to experimental data y ∈ Y according to the following noise model, y = G u + η . (3) Recall that two measures are called equivalent if they are mutually absolutely continuous withrespect to each other.
ALEN ALEXANDERIAN, PHILIP GLOOR, AND OMAR GHATTAS
The operator G : H → Y is the parameter-to-observable map and is assumed to bea continuous linear mapping. In practice, for a given u , computing G u would involvethe evaluation of a mathematical model with the parameter value u followed by theapplication of a restriction operator to extract data at pre-specified locations in spaceand/or time. The discrepancy between the model output G u and experimental data y is modeled by η which is a random vector that accounts for experimental noise,i.e. noise associated with the process of collecting experimental data. We assume η ∼ N ( , Γ noise ) , and thus, the distribution of y | u is Gaussian, y | u ∼ N ( G u, Γ noise ) with pdf π like ( y | u ) = 1 Z like exp n −
12 ( G u − y ) T Γ − noise ( G u − y ) o , where Z like = (2 π ) q/ det( Γ noise ) / . In what follows, we denote Φ( u ; y ) = 12 ( G u − y ) T Γ − noise ( G u − y ) . (4)3.1. The Bayes formula and the posterior measure.
The solution of the Bayesianinverse problem is the posterior measure, describing the law of the parameter u , condi-tioned on the experimental data y , and is linked to the prior measure µ pr through theinfinite-dimensional version of Bayes Theorem [16]: dµ y post dµ pr = 1 Z ( y ) π like ( y | u ) , (5)where Z ( y ) is the normalization constant. Notice that we can rewrite Bayes Theoremas, dµ y post dµ pr = 1 Z ( y ) exp {− Φ( u ; y ) } , (6)with Z ( y ) = R H exp {− Φ( u ; y ) } µ pr ( du ) . In the Gaussian linear case, it is possible toevaluate Z ( y ) analytically; see Lemma 2 below.As discussed above, we consider Bayesian linear inverse problems; i.e., Bayesianinverse problems involving a linear parameter-to-observable map G . It is well known [16]that for a Gaussian linear inverse problem, as specified above, the solution is a Gaussianposterior measure µ y post = N (cid:0) u y post , C post (cid:1) with, C post = ( G ∗ Γ − noise G + C − pr ) − , u y post = C post ( G ∗ Γ − noise y + C − pr u pr ) . In practice, the noise covariance matrix, Γ noise is often a multiple of the identity, Γ noise = σ I , where σ is the experimental noise level. In the derivations that follow, sincethere is no loss of generality, we take σ = 1 . Generalizing the results to the caseswhere Γ noise is an anisotropic diagonal matrix (uncorrelated observations with varyingexperimental noise levels) or more generally Γ noise that is symmetric and positive definitewith nonzero off diagonal entries (correlated observations) is straightforward. Moreover,for simplicity, we assume that the prior is a centered Gaussian, i.e. u pr = 0 . Again, thegeneralization to the case of non-centered prior measure is straightforward. With thesesimplifications, the mean and covariance of the posterior measure are given by, C post = ( G ∗ G + C − pr ) − , u y post = C post G ∗ y . (7)In what follows, we use the notation, H m = G ∗ G . (8)The motivation behind this notation is that G ∗ G is the Hessian of the functional, Φ( u ; y ) , which measures the magnitude of the misfit between experimental data y and AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 7 model prediction G u . Note that in statistical terms, H m is the Hessian of the nega-tive log-likelihood which is also referred to as the Fisher information matrix . Anothernotation we shall use frequently is, ˜ H m = C / pr H m C / pr . (9)Intuitively, this prior-preconditioned H m can be thought of as the information matrixwhich has been filtered through the prior. To further appreciate the notion of the prior-preconditioned misfit Hessian, we note that the second moment of the parameter-to-observable map, considered as a random variable G : ( H , B ( H ) , µ pr ) → ( R q , B ( R q )) is given by, Z H |G u | µ pr ( du ) = Z H hG u, G u i R q µ pr ( du ) = Z H hH m u, u i H µ pr ( du ) = tr( C pr H m ) = tr( ˜ H m ) . A spectral point of view of uncertainty reduction.
Let ˜ H m be the prior-preconditioned misfit Hessian as defined in (9) and denote S = ( I + ˜ H m ) − . (10)The posterior covariance operator, C post , given in (7) can be written as, C post = C / pr ( I +˜ H m ) − C / pr = C / pr SC / pr . We consider the quantity, δ ( C pr , C post ) := tr( C pr ) − tr( C post ) = tr (cid:0) C / pr ( I − S ) C / pr (cid:1) . For the class of Bayesian linear inverse problems considered in the present work, it isstraightforward to show that δ ( C pr , C post ) ≥ . In particular, we note that if { λ i } and { e i } are the eigenvalues and the respective eigenvectors of ˜ H m , then h e i , ( I − S ) e i i H = 1 − h e i , S e i i H = 1 − / (1 + λ i ) = λ i / (1 + λ i ) ≥ , i = 1 , , . . . , which shows that δ ( C pr , C post ) = tr( C / pr ( I − S ) C / pr ) ≥ . The quantity δ ( C pr , C post ) can thus be considered a measure of variance (uncertainty) reduction. More precisely,we consider for each i ≥ , h e i , C post e i i H = Z H (cid:10) e i , u − u y post (cid:11) H µ y post ( du ) , which measures the posterior variance of the coordinate of u in the direction e i . Proposition 1.
Let { λ i , e i } ∞ be eigenpairs of ˜ H m . Then, h e i , C post e i i H ≤ h e i , C pr e i i H ,for all i ≥ .Proof. Note that for each v ∈ H , S v = P j (1 + λ j ) − h e j , v i H e j . Hence, h e i , C post e i i H = D e i , C / pr SC / pr e i E H = D C / pr e i , SC / pr e i E H = X j (1+ λ j ) − D e j , C / pr e i E H ≤ X j D e j , C / pr e i E H = (cid:13)(cid:13)(cid:13) C / pr e i (cid:13)(cid:13)(cid:13) H = h e i , C pr e i i H , where the penultimate equality follows from Parseval’s identity. (cid:3) Also, tr( C post ) = tr( C pr ) − tr( C / pr ( I − S ) C / pr ) = ∞ X j =1 (1 − α j ) h e j , C pr e j i H , where α j = λ j / (1 + λ j ) . Thus, for eigenvalues λ j that are large, we have α j ≈ which suggests that significant uncertainty reduction occurs in such directions. It iswell known that for large classes of ill-posed Bayesian inverse problems, the eigenvalues λ i of ˜ H m decay rapidly to zero, with a relatively small number of dominant eigenvalues ALEN ALEXANDERIAN, PHILIP GLOOR, AND OMAR GHATTAS indicating the data-informed directions in the parameter space. This allows “focusing”the inference to low-dimensional subspaces of the parameter space H . Such ideas havebeen used to develop efficient numerical algorithms for solution of infinite-dimensionalBayesian inverse problems in works such as [4, 11] and for algorithms for computingA-optimal experimental designs for infinite-dimensional Bayesian linear inverse problemsin [1].4. KL divergence from posterior to prior and expected informationgain
Let us first motivate the discussion by recalling the form of the KL divergence fromthe posterior to prior in the finite-dimensional case. We use boldface letters for thefinite-dimensional versions of the operators appearing in the Bayesian inverse problem.To indicate that we work in R n , we denote by µ pr,n and µ y post,n the prior and posteriormeasures in the n -dimensional case. The following expression for D kl (cid:0) µ y post,n k µ pr,n (cid:1) iswell known: D kl (cid:0) µ y post,n k µ pr,n (cid:1) = 12 h − log (cid:18) det C post det C pr (cid:19) − n + tr( C − pr C post ) + (cid:10) C − pr u y post , u y post (cid:11) R n i . (11)Note that the above expression is not meaningful in the infinite-dimensional case. Forone thing, n appears explicitly in the expression. Moreover, in the infinite-dimensionalcase, C pr is a trace-class operator whose eigenvalues accumulate at zero, so dividing bythe determinant of the prior covariance is problematic as n → ∞ . Finally, in the infinite-dimensional case, C − pr is the inverse of a compact operator and hence is unbounded;therefore, the trace term, which involves the inverse of the prior covariance, needsclarification. However, if we reformulate the above expression, we obtain an expressionthat has meaning in the infinite-dimensional case.A straightforward calculation shows that the first term on the right in (11) may besimplified: − log (cid:18) det C post det C pr (cid:19) = log (cid:18) det C pr det C post (cid:19) = log det (cid:0) C pr C − post (cid:1) = log det (cid:16) C / pr ( H m + C − pr ) C / pr (cid:17) (12) = log det( ˜ H m + I ) . Recall that, in general, if A is Hermitian, then there exists a unitary matrix U suchthat D = [ λ i δ ij ] = U ∗ AU is diagonal. In this case, the diagonal elements are the eigenvalues of A , and det( I + A ) = det( U ) det( I + A ) det( U ∗ ) = n Y i =1 (1 + λ i ) . In the infinite-dimensional setting, given a trace-class operator
A ∈ L sym +1 ( H ) , lim n →∞ log n Y i =1 (1 + λ i ( A )) ! = lim n →∞ n X i =1 log(1 + λ i ( A )) ≤ lim n →∞ n X i =1 λ i ( A ) < ∞ , AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 9 so, motivated by the n -dimensional case, we may define the Fredholm determinant of I + A as det( I + A ) = ∞ Y i =1 (1 + λ i ( A )) , where λ i ( A ) are the eigenvalues of A [15]. Hence, the final expression in equation (12)is meaningful in infinite dimensions. Next, we consider the term − n + tr( C − pr C post ) : − n + tr( C − pr C post ) = − tr( I ) + tr( C − pr C post )= tr( C − pr C post − I ) = tr (cid:0) ( C − pr − C − post ) C post (cid:1) = − tr( H m C post ) , where in the last step we used the fact that C − post = H m + C − pr . Notice that the argu-ment of the trace in the final expression is in fact a trace-class operator in the infinite-dimensional case and has a well-defined trace. Combining (12) and (13) and definingthe inner-product h x , y i C − pr = D C − / pr x , C − / pr y E R n for x , y ∈ R n , we rewrite (11), D kl (cid:0) µ y post,n k µ pr,n (cid:1) = 12 h log det( ˜ H m + I ) − tr( H m C post ) + (cid:10) u y post , u y post (cid:11) C − pr i . (13)In Section 4.1 we derive, rigorously, alternate forms of the expression for the KLdivergence from posterior to prior in the infinite-dimensional Hilbert space setting; aswe shall see shortly, one of those forms is a direct extension of (13) to the infinite-dimensional case. The reason for introducing the weighted inner-product h· , ·i C − pr willalso become clear in the discussion that follows.4.1. The KL-divergence from posterior to prior.
The following result which is aconsequence of Proposition 1.2.8 in [9] will be needed in what follows.
Proposition 2.
Let
A ∈ L ( H ) be a positive self-adjoint operator, µ = N (0 , Q ) anon-degenerate Gaussian measure on H , and b ∈ H . Then, Z H exp (cid:26) − hA x, x i H + h b, x i H (cid:27) µ ( dx ) = det( I + ˜ A ) − / exp n (cid:13)(cid:13) ( I + ˜ A ) − / Q / b (cid:13)(cid:13) H o , where ˜ A = Q / AQ / . In the following technical lemma, we calculate the expression for Z , introduced inequation (6). Lemma 2.
Let Φ( u ; y ) = ( G u − y ) T Γ − noise ( G u − y ) , as defined by equation (4) . Then, Z ( y ) := Z H exp {− Φ( u ; y ) } µ pr ( du ) = exp (cid:26) − | y | (cid:27) det( I + ˜ H m ) − / exp (cid:26) hC post b, b i H (cid:27) , where b = G ∗ y and C post = ( G ∗ G + C − pr ) − , as in equation (7) .Proof. First note that (recall that we have assumed Γ noise = I ) Φ( u ; y ) = 12 ( G u − y ) T ( G u − y ) = 12 hG u, G u i R q − hG u, y i R q + 12 h y , y i R q = 12 hH m u, u i H − hG ∗ y , u i H + 12 | y | . (14)Therefore, Z H exp {− Φ( u ; y ) } µ pr ( du ) = exp (cid:26) − | y | (cid:27) Z H exp (cid:26) − hH m u, u i H + h b, u i H (cid:27) µ pr ( du ) , where b = G ∗ y . By Proposition 2 we have, Z H exp (cid:26) − hH m u, u i H + h b, u i H (cid:27) µ pr ( du )= det( I + C / pr H m C / pr ) − / exp (cid:26) (cid:13)(cid:13)(cid:13) ( I + C / pr H m C / pr ) − / C / pr b (cid:13)(cid:13)(cid:13) H (cid:27) = det( I + ˜ H m ) − / exp (cid:26) (cid:13)(cid:13)(cid:13) ( I + ˜ H m ) − / C / pr b (cid:13)(cid:13)(cid:13) H (cid:27) . The assertion of the lemma now follows, since C post = C / pr ( I + ˜ H m ) − C / pr . (cid:3) The following result provides the expression for the KL divergence from posterior toprior:
Proposition 3.
Let µ pr be a centered Gaussian measure on H , and µ y post = N (cid:0) u y post , C post (cid:1) be the posterior measure for a Bayesian linear inverse problem with additive Gaussiannoise model as described in Section 3. Then, D kl (cid:0) µ y post k µ pr (cid:1) = 12 h log det( I + ˜ H m ) − tr( H m C post ) − (cid:10) u y post , G ∗ ( G u y post − y ) (cid:11) H i . (15) Proof.
Consider (6), and note that D kl (cid:0) µ y post k µ pr (cid:1) = Z H log (cid:26) dµ y post dµ pr (cid:27) µ y post ( du )= − log Z ( y ) − Z H Φ( u ; y ) µ y post ( du ) . (16)Using (14) to expand Φ( u ; y ) , the integral on the right becomes Z H Φ( u ; y ) µ y post ( du ) = 12 Z H hH m u, u i H µ y post ( du ) − Z H hG ∗ y , u i H µ y post ( du )+ 12 | y | . The second integral evaluates to (cid:10) G ∗ y , u y post (cid:11) H , by the definition of the mean of themeasure, and the first integral is evaluated via the formula for the integral of a quadraticform: Z H hH m u, u i H µ y post ( du ) = tr( H m C post ) + (cid:10) u y post , H m u y post (cid:11) H . Using the expression for Z from Lemma 2, − log Z ( y ) = 12 | y | − log det( I + ˜ H m ) − / − hC post G ∗ y , G ∗ y i H = 12 | y | + 12 log det( I + ˜ H m ) − (cid:10) u y post , G ∗ y (cid:11) H , where we have also used the definition of u y post . Substituting into equation (16), weobtain D kl (cid:0) µ y post k µ pr (cid:1) = 12 log det( I + ˜ H m ) − (cid:10) u y post , G ∗ y (cid:11) H −
12 tr( H m C post ) − (cid:10) u y post , H m u y post (cid:11) H + (cid:10) G ∗ y , u y post (cid:11) H , which, after some algebraic manipulation and recalling that H m = G ∗ G , yields theassertion of the proposition. (cid:3) AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 11
Let us note the following interpretation for the last term appearing in D kl (cid:0) µ y post k µ pr (cid:1) given in (15). Consider the function Φ( u ) = ( G u − y ) T ( G u − y ) , which is thefamiliar misfit term in the deterministic interpretation of the corresponding linear inverseproblem. (For notational simplicity we have suppressed the dependence of Φ on thedata vector y .) Note that the variational derivative of Φ at a point u ∈ H in direction h ∈ H is given by, Φ ′ ( u ) h = ddε (cid:12)(cid:12)(cid:12) ε =0 Φ( u + εh ) = hG u − y , G h i R q = hG ∗ ( G u − y ) , h i H . Next, recall that the mean of the posterior, u y post , of the present Bayesian linear inverseproblem coincides with the MAP estimator for the inference parameter u and is theglobal minimizer of the following regularized cost functional [16, 10] J ( u ) = Φ( u ) + 12 h u, u i C − pr with minimization done over the space, H µ pr = range ( C / pr ) ⊂ H . The inner-productin the regularization term is given by h· , ·i C − pr = D C − / pr x, C − / pr y E H for x, y ∈ H µ pr .We have, by the first order optimality conditions J ′ ( u y post ) h = 0 for every h ∈ H µ pr ,that is, hG ∗ ( G u − y ) , h i H + (cid:10) u y post , h (cid:11) C − pr = 0 , for all h ∈ H µ pr . Thus, in particular, − (cid:10) G ∗ ( G u − y ) , u y post (cid:11) H = (cid:10) u y post , u y post (cid:11) C − pr . This leads to thefollowing alternate form of expression (15): D kl (cid:0) µ y post k µ pr (cid:1) = 12 h log det( I + ˜ H m ) − tr( H m C post ) + (cid:10) u y post , u y post (cid:11) C − pr i . (17)Note that this expression for the KL divergence D kl (cid:0) µ y post k µ pr (cid:1) is the direct extensionof the corresponding expression in the case of H = R n as given in (13) to infinitedimensions. Remark . A straightforward modification of the arguments leading to equation (15),for the case of a prior µ pr = N ( u pr , C pr ) , leads to D kl (cid:0) µ y post k µ pr (cid:1) = 12 h log det( I + ˜ H m ) − tr( H m C post ) − (cid:10) u y post − u pr , G ∗ ( G u y post − y ) (cid:11) H i . Moreover, in view of the argument leading to (17), we have: D kl (cid:0) µ y post k µ pr (cid:1) = 12 h log det( I + ˜ H m ) − tr( H m C post ) + (cid:10) u y post − u pr , u y post − u pr (cid:11) C − pr i . Expected information gain.
Here we derive the expression for the expected in-formation gain. We first prove the following technical lemma which is needed in theproof of the main result in this section.
Lemma 3.
The following identities hold.(1) E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , G ∗ y (cid:11) H (cid:9)(cid:9) = tr( ˜ H m ) (2) E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , H m u y post (cid:11) H (cid:9)(cid:9) = tr( S ˜ H m ) ,where ˜ H m and S be as in (9) and (10) respectively. Given a Gaussian measure µ = N ( m, C ) on a Hilbert space H , the space range ( C / ) is calledthe Cameron-Martin space corresponding to the measure µ . It is a known result (see e.g. [8]) that ifthe Hilbert space H is infinite-dimensional, µ (cid:0) range ( C / ) (cid:1) = 0 . Proof.
We present the proof of the first statement; the second one follows from asimilar argument. Let us begin from the inner expectation. Note that, by the definitionof u y post we have, (cid:10) u y post , G ∗ y (cid:11) H = hC post G ∗ y , G ∗ y i H = h y , GC post G ∗ y i R q , For clarity let us denote L = GC post G ∗ . Recall that y | u is distributed according to N ( G u, Γ noise ) , and that we assumed Γ noise = I . Using the formula for the expectationof a quadratic form (on Y = R q ), Lemma 1, we haveE y | u (cid:8)(cid:10) u y post , G ∗ y (cid:11) H (cid:9) = E y | u {h y , L y i R q } = tr( L )+ hG u, L G u i R q = tr( L )+ h u, G ∗ L G u i H . By the comment at the end of Section 2.1 and recalling that C post = C / pr SC / pr , wehave tr( L ) = tr( GC post G ∗ ) = tr( C post H m ) = tr( C / pr SC / pr H m )= tr( SC / pr H m C / pr ) = tr( S ˜ H m ) . (18)Therefore, E y | u (cid:8)(cid:10) u y post , G ∗ y (cid:11) H (cid:9) = tr( S ˜ H m ) + h u, G ∗ L G u i H . (19)Next, to compute the outer expectation we proceed as follows (keep in mind that µ pr = N (0 , C pr ) ). By Lemma 1,E µ pr {h u, G ∗ L G u i H } = Z H h u, G ∗ L G u i H µ pr ( du ) = tr( G ∗ L GC pr ); and tr( G ∗ L GC pr ) = tr( G ∗ GC post G ∗ GC pr ) = tr( H m C post H m C pr )= tr( C / pr H m C post H m C / pr ) = tr( C / pr H m C / pr SC / pr H m C / pr ) = tr( ˜ H m S ˜ H m ) = tr( S ˜ H m ) . Thus, combining equations (18), (19), and (20) givesE µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , G ∗ y (cid:11) H (cid:9)(cid:9) = tr( S ˜ H m ) + tr( S ˜ H m ) = tr( S ˜ H m ( I + ˜ H m ))= tr( ˜ H m ( I + ˜ H m ) S ) = tr( ˜ H m ) , which is the first statement of the lemma. (cid:3) The following theorem is the main result of this section.
Theorem 1.
Let µ pr be a centered Gaussian prior measure on H , and µ y post = N (cid:0) u y post , C post (cid:1) be the posterior measure for a Bayesian linear inverse problem withadditive Gaussian noise model as described in Section 3. Then, E µ pr (cid:8) E y | u (cid:8) D kl (cid:0) µ y post k µ pr (cid:1)(cid:9)(cid:9) = 12 log det( I + ˜ H m ) . Proof.
By (15) we have,E µ pr (cid:8) E y | u (cid:8) D kl (cid:0) µ y post k µ pr (cid:1)(cid:9)(cid:9) = 12 log det( I + ˜ H m ) −
12 tr( H m C post ) − E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , G ∗ ( G u y post − y ) (cid:11) H (cid:9)(cid:9) . (20)Using the previous lemma we can proceed as follows,E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , G ∗ ( G u y post − y ) (cid:11) H (cid:9)(cid:9) = E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , H m u y post (cid:11) H (cid:9)(cid:9) − E µ pr (cid:8) E y | u (cid:8)(cid:10) u y post , G ∗ y (cid:11) H (cid:9)(cid:9) = tr( S ˜ H m ) − tr( ˜ H m )= tr( S ( ˜ H m − S − ) ˜ H m ) = − tr( S ˜ H m ) . AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 13
Thus, since tr( H m C post ) = tr( ˜ H m S ) = tr( S ˜ H m ) , the expression for the expected infor-mation gain in (20) simplifies to E µ pr (cid:8) E y | u (cid:8) D kl (cid:0) µ y post k µ pr (cid:1)(cid:9)(cid:9) = log det( I + ˜ H m ) . (cid:3) The result above provides the infinite-dimensional analogue of Bayesian D-optimality.As mentioned in the introduction, an experimental design ξ enters the Bayesian inverseproblem through the data likelihood. This dependence to ξ , in the present Gaussianlinear case, is manifested through a ξ dependent misfit Hessian, H m = H m ( ξ ) . Conse-quently, the D-optimal design problem in the infinite-dimensional Hilbert space settingis given by, maximize ξ ∈ Ξ log det( I + ˜ H m ( ξ )) , where Ξ is the design space which needs to be specified in a given experimental designproblem. Remark . As mentioned earlier, in a large class of Bayesian inverse problems, ˜ H m admits a low-rank approximation, ˜ H m v ≈ r X i =1 λ i h e i , v i H e i , v ∈ H , where r is the numerical rank of ˜ H m and { λ i } ri =1 are the dominant eigenvalues of ˜ H m with respective eigenvectors { e i } ri =1 . Thus, one can use the following approximation log det( I + ˜ H m ) ≈ r X i =1 log(1 + λ i ) , which enables an efficient means of approximating the expected information gain.5. Expected mean square error of the MAP estimator and BayesianA-optimality
In this section, we consider another well known optimal experimental design criterion,Bayesian A-optimality, which aims to minimize the trace of the posterior covarianceoperator. It is well known in the statistics literature that for inference problems witha finite-dimensional parameter, this is equivalent to minimizing the expected meansquare error of the mean posterior which, in the case of a Bayesian linear inverseproblem, coincides with the MAP estimator. In this section, we extend this result tothe infinite-dimensional Hilbert space setting.The MSE of the MAP estimator u y post is MSE( u y post ; u ) = E y | u n(cid:13)(cid:13) u − u y post (cid:13)(cid:13) H o . The
MSE is also referred to as the risk of the estimator u y post , corresponding to aquadratic loss function. A straightforward calculation shows that MSE( u y post ; u ) = (cid:13)(cid:13) u − E y | u (cid:8) u y post (cid:9)(cid:13)(cid:13) H + E y | u n(cid:13)(cid:13) u y post − E y | u (cid:8) u y post (cid:9)(cid:13)(cid:13) H o , (21)Note that the first term in (21) quantifies the magnitude of estimation bias, and thesecond term describes the variability of the estimator around its mean. The followingtechnical Lemma provides the expression for MSE( u y post ; u ) in the infinite-dimensionalHilbert space setting. Lemma 4.
Let u y post be the MAP estimator for u as in (7) . Then, MSE( u y post ; u ) = k ( C post H m − I ) u k H + tr( C post H m ) . Proof.
Consider the expression for
MSE( u y post ; u ) given in (21). For the first term inthe sum, we have u − E y | u (cid:8) u y post (cid:9) = u − E y | u {C post G ∗ y } = u − C post G ∗ G u = ( I − C post H m ) u. Next, note that ξ ( y ) = u y post − E y | u (cid:8) u y post (cid:9) has law µ = N (0 , Q ) with Q = ( C post G ∗ )( C post G ∗ ) ∗ = C post H m C post . Therefore,E y | u n(cid:13)(cid:13) u y post − E y | u (cid:8) u y post (cid:9)(cid:13)(cid:13) H o = Z H k ξ k H µ ( dξ ) = tr( C post H m C post ) = tr( C post H m ) . (cid:3) Next, we consider the average over the prior measure of the
MSE ,E µ pr (cid:8) MSE( u y post ; u ) (cid:9) = Z H Z Y (cid:13)(cid:13) u − u y post (cid:13)(cid:13) H π like ( y | u ) d y µ pr ( du ) , which is also known as the Bayes risk of the estimator u y post , corresponding to a quadraticloss function [5, 3]. The following result extends the well known result regarding theconnection between the Bayes risk of the MAP estimator and the trace of the posteriorcovariance, for a Bayesian linear inverse problem, to the infinite-dimensional Hilbertspace setting. Theorem 2.
Let µ pr be a centered Gaussian prior measure on H , and µ y post = N (cid:0) u y post , C post (cid:1) be the posterior measure for a Bayesian linear inverse problem with ad-ditive Gaussian noise model as described in Section 3. Then, E µ pr (cid:8) MSE( u y post ; u ) (cid:9) =tr( C post ) .Proof. By Lemma 4,E µ pr (cid:8) MSE( u y post ; u ) (cid:9) = Z H k ( C post H m − I ) u k H µ pr ( du ) + tr( C post H m ) , (22)and since ( C post H m − I ) u ∼ N (0 , ( C post H m − I ) C pr ( C post H m − I ) ∗ ) =: µ , Z H k ( C post H m − I ) u k H µ pr ( du ) = Z H k ξ k H µ ( dξ ) = tr(( C post H m − I ) C pr ( C post H m − I ) ∗ ) . We proceed as follows, tr(( C post H m − I ) C pr ( C post H m − I ) ∗ ) = tr(( C post H m − I ) ∗ ( C post H m − I ) C pr )= tr( H m C post H m C pr ) − tr( H m C post C pr ) − tr( C post H m C pr ) + tr( C pr ) . Let us first consider the last two terms; recalling that C post = C / pr SC / pr , we have − tr( C post H m C pr ) + tr( C pr ) = − tr( C / pr SC / pr H m C pr ) + tr( C pr )= − tr( C pr S ˜ H m ) + tr( C pr )= tr( C pr S ( S − − ˜ H m )) = tr( C pr S ) = tr( C post ) . Thus, by (22), we have,E µ pr (cid:8) MSE( u y post ; u ) (cid:9) = tr( C post H m ) + tr( H m C post H m C pr ) − tr( H m C post C pr ) + tr( C post ) . Hence, showing that the first three three sum to zero completes the proof. To this end,we note that tr( H m C post C pr ) = tr( ˜ H m SC pr ) and that tr( C post H m ) + tr( H m C post H m C pr ) = tr( ˜ H m SC pr S ) + tr( ˜ H m SC pr S ˜ H m )= tr (cid:0) ˜ H m SC pr S ( I + ˜ H m ) (cid:1) = tr( ˜ H m SC pr ) . (cid:3) AYESIAN A- AND D-OPTIMALITY IN INFINITE DIMENSIONS 15
Appendix A. Proof of Lemma 1
Let { e i } ∞ be a complete orthonormal set in H , and denote by Π n the orthogonalprojection of H onto Span { e , . . . , e n } ; that is, for x ∈ H , Π n ( x ) = P ni =1 h e i , x i H e i .First note that, Z H hA x, x i H = Z H hA ( x − m ) , x − m i H µ ( dx )+ Z H hA x, m i H µ ( dx ) + Z H hA m, x i H µ ( dx ) − hA m, m i H . Now by the definition of the mean of the measure, the last three terms sum to hA m, m i H . Thus, the rest of the proof consists of showing R H hA ( x − m ) , x − m i H µ ( dx ) =tr( AQ ) . Note that for every x ∈ H , x − m = lim n →∞ Π n ( x − m ) , and thus, lim n →∞ hA Π n ( x − m ) , Π n ( x − m ) i H = hA ( x − m ) , x − m i H . Moreover, we note that, | hA Π n ( x − m ) , Π n ( x − m ) i H | ≤ k A k k x − m k H , and since Q is trace-class, the measure µ has a bounded second moment; hence, R H k x − m k H µ ( dx ) < ∞ . Therefore, we can apply Lebesgue-Dominated Convergence Theorem to get, lim n →∞ Z H hA Π n ( x − m ) , Π n ( x − m ) i H µ ( dx ) = Z H hA ( x − m ) , x − m i H µ ( dx ) . (23)Next, let { e i } ∞ be the (complete) set of eigenvectors of Q with corresponding (real)eigenvalues { λ i } ∞ . We know that AQ is trace-class with, tr( AQ ) = X i hAQ e i , e i i H = X i λ i hA e i , e i i H . (24)Also, note, Z H hA Π n ( x − m ) , Π n ( x − m ) i H µ ( dx )= n X i,j =1 Z H hA e i , e j i H h x − m, e i i H h x − m, e j i H µ ( dx )= n X i,j =1 hA e i , e j i H Z H h x − m, e i i H h x − m, e j i H µ ( dx )= n X i,j =1 hA e i , e j i H hQ e i , e j i H = n X i =1 λ i hA e i , e i i H . Therefore, combining this last result with (23) and (24), we get Z H hA ( x − m ) , x − m i H µ ( dx ) = lim n →∞ Z H hA Π n ( x − m ) , Π n ( x − m ) i H µ ( dx )= lim n →∞ n X i =1 λ i hA e i , e i i H = tr( AQ ) . (cid:3) References [1] A. Alexanderian, N. Petra, G. Stadler, and O. Ghattas. A-optimal design of experiments forinfinite-dimensional Bayesian linear inverse problems with regularized ℓ -sparsification. SIAM Jour-nal on Scientific Computing , 2014. to appear.[2] A. C. Atkinson and A. N. Donev.
Optimum Experimental Designs . Oxford, 1992.[3] J. O. Berger.
Statistical decision theory and Bayesian analysis . Springer, 1985. [4] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler. A computational framework for infinite-dimensional Bayesian inverse problems part i: The linearized case, with application to globalseismic inversion.
SIAM Journal on Scientific Computing , 35(6):A2494–A2523, 2013.[5] B. P. Carlin and T. A. Louis. Bayes and empirical bayes methods for data analysis.
Statistics andComputing , 7(2):153–154, 1997.[6] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review.
Statistical Science ,10(3):273–304, 1995.[7] J. B. Conway.
A course in operator theory . American Mathematical Soc., 2000.[8] G. Da Prato.
An introduction to infinite-dimensional analysis . Springer, 2006.[9] G. Da Prato and J. Zabczyk.
Second order partial differential equations in Hilbert spaces , volume293. Cambridge University Press, 2002.[10] M. Dashti, K. J. Law, A. M. Stuart, and J. Voss. MAP estimators and their consistency in Bayesiannonparametric inverse problems.
Inverse Problems , 29, 2013.[11] H. P. Flath, L. C. Wilcox, V. Ak¸celik, J. Hill, B. van Bloemen Waanders, and O. Ghattas. Fastalgorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based onlow-rank partial hessian approximations.
SIAM Journal on Scientific Computing , 33(1), 2011.[12] S. Kullback and R. A. Leibler. On information and sufficiency.
The Annals of MathematicalStatistics , 22(1):79–86, 03 1951.[13] F. Pukelsheim.
Optimal design of experiments , volume 50. siam, 2006.[14] M. Reed and B. Simon.
Methods of Modern Mathematical Physics: Vol.: 1.: Functional Analysis .Academic press, 1972.[15] B. Simon. Notes on infinite determinants of Hilbert space operators.
Advances in Mathematics ,24:244–273, 1977.[16] A. M. Stuart. Inverse problems: a Bayesian perspective.
Acta Numerica , 19:451–559, 2010.[17] D. Uci´nski.
Optimal measurement methods for distributed parameter system identification . CRCPress, Boca Raton, 2005.
Institute for Computational Engineering and Sciences, The University of Texas atAustin
E-mail address : [email protected] Mathematics Department, United States Naval Academy
E-mail address : [email protected] Institute for Computational Engineering and Sciences, The University of Texas atAustin
E-mail address ::