Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency
aa r X i v : . [ m a t h . S T ] M a y Hyperparameter Estimation in Bayesian MAP Estimation:Parameterizations and Consistency
Matthew M. Dunlop ∗ , Tapio Helin † , and Andrew M. Stuart ‡ Abstract.
The Bayesian formulation of inverse problems is attractive for three primary reasons: it providesa clear modelling framework; means for uncertainty quantification; and it allows for principledlearning of hyperparameters. The posterior distribution may be explored by sampling methods,but for many problems it is computationally infeasible to do so. In this situation maximum aposteriori (MAP) estimators are often sought. Whilst these are relatively cheap to compute, andhave an attractive variational formulation, a key drawback is their lack of invariance under changeof parameterization. This is a particularly significant issue when hierarchical priors are employedto learn hyperparameters. In this paper we study the effect of the choice of parameterizationon MAP estimators when a conditionally Gaussian hierarchical prior distribution is employed.Specifically we consider the centred parameterization, the natural parameterization in which theunknown state is solved for directly, and the noncentred parameterization, which works with awhitened Gaussian as the unknown state variable, and arises when considering dimension-robustMCMC algorithms; MAP estimation is well-defined in the nonparametric setting only for thenoncentred parameterization. However, we show that MAP estimates based on the noncentredparameterization are not consistent as estimators of hyperparameters; conversely, we show thatlimits of finite-dimensional centred MAP estimators are consistent as the dimension tends toinfinity. We also consider empirical Bayesian hyperparameter estimation, show consistency ofthese estimates, and demonstrate that they are more robust with respect to noise than centredMAP estimates. An underpinning concept throughout is that hyperparameters may only berecovered up to measure equivalence, a well-known phenomenon in the context of the Ornstein-Uhlenbeck process.
Key words.
Bayesian inverse problems, hierarchical Bayesian, MAP estimation, optimization, nonparamet-ric inference, hyperparameter inference, consistency of estimators.
AMS subject classifications.
1. Introduction.
Let
X, Y be separable Hilbert spaces, and let A : X → Y be a linearmap. We consider the problem of recovering a state u ∈ X from observations y ∈ Y givenby y = Au + η, η ∼ N (0 , Γ )(1.1)where η is random noise corrupting the observations. This is an example of a linear inverseproblem, with the mapping u Au being the corresponding forward problem. In the ∗ Courant Institute of Mathematical Sciences, New York University, New York, New York, 10012, USA([email protected]). † School of Engineering Science, Lappeenranta University of Technology, Lappeenranta, 53850, Finland(tapio.helin@lut.fi). ‡ Computing & Mathematical Sciences, California Institute of Technology, Pasadena, California, 91125, USA([email protected]). applications we consider, X is typically be an infinite-dimensional space of functions, and Y a finite-dimensional Euclidean space R J .Our focus in this paper is on the Bayesian approach to this inverse problem. We view y, u, η as random variables, assume that u and η are a priori independent with knowndistributions P (d u ) and P (d η ), and seek the posterior distribution P (d u | y ) . Bayes’ theoremthen states that P (d u | y ) ∝ P ( y | u ) P (d u ) . In the hierarchical Bayesian approach the prior depends on hyperparameters θ which areappended to the state u to form the unknown. The prior on ( u, θ ) is factored as P (d u, d θ ) = P (d u | θ ) P (d θ ) and Bayes’ theorem states that P (d u, d θ | y ) ∝ P ( y | u ) P (d u | θ ) P (d θ ) . In this paper we study conditionally Gaussian priors in which P (d u | θ ) is a Gaussian measurefor every fixed θ. Centred methods work directly with ( u, θ ) as unknowns, whilst noncentred methodswork with ( ξ, θ ) where u = p C ( θ ) ξ and ξ is, a priori , a Gaussian white noise; thus C ( θ )is the covariance of u | θ. In the context of MCMC methods the use of noncentred variableshas been demonstrated to confer considerable advantages. However the key message of thispaper is that, when studying maximum a posteriori (MAP) estimation, and in particularconsistency of learning hyperparameters θ in the data-rich limit, centred parameterizationis preferable to noncentred parameterization. The Bayesian approach is a fundamental and underpinningframework for statistical inference [7]. In the last decade it has started to become apractical computational tool for large scale inverse problems [23], realizing an approach toill-posed inverse problems introduced in the 1970 paper [18]. The subject has developingmathematical foundations and attendant stability and approximation theories [16, 28, 29,44, 39]. Furthermore, the subject of Bayesian posterior consistency is being systematicallydeveloped [4, 6, 35, 37, 27, 26, 41, 19, 20]. Furthermore, the paper [25] was the first toestablish consistency in the context of hyperparameter learning, as we do here, and indoing so demonstrates that Bayesian methods have comparable capabilities to frequentistmethods, regarding adaptation to smoothness, whilst also quantifying uncertainty. Wecomment further on the relationship of our work to [25] in more detail later in the paper,once the needed framework has been established.For some problems it is still beyond reach to perform posterior sampling via MCMCor SMC methods. For this reason maximum a posterior (MAP) estimation, which pro-vides a point estimator of the unknown and has a variational formulation, remains animportant practical computational tool [23]. Furthermore MAP estimation links Bayesianinference with optimization approaches to inversion, and allows for the possibility of newoptimization methods informed by the Bayesian perspective. As a consequence there is
YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 3 also a developing mathematical theory around MAP estimation for Bayesian ill-posed in-verse problems, relating to both how to define a MAP estimator in the infinite dimensionalsetting [2, 12, 15, 21, 22], and to the subject of posterior consistency of MAP estimators[5, 3, 15, 36, 38].The focus of this paper is hierarchical Bayesian inversion with Gaussian priors suchas the Whittle–Mat´ern and ARD priors. See [43, 33] for references to the literature inthis area. In this context the question of centred versus noncentred parameterization isimportant in defining the problem [40]. This choice also has significant ramifications foralgorithms: there are many examples of settings in which the noncentred approach ispreferable to the centred approach within the context of Gibbs-based MCMC sampling[1, 17, 42, 47] and even non-Bayesian methods such as ensemble Kalman inversion [10].Nonetheless, in the context of MAP estimation, we demonstrate in this paper that themessage is rather different: centred methods are preferable.
The primary contributions of this paper are as follows: • We demonstrate that, for MAP estimation, centred parameterizations are prefer-able to noncentred parameterizations when a goal of the inference is recovery of thehyperparameters θ. We provide conditions on the data model and prior distribu-tion that lead to theorems describing the recovery, or lack of recovery, of the truehyperparameters in the simultaneous large data/small noise limit. • We extend the theory to empirical Bayesian estimation of hyperparameters; we alsodemonstrate additional robustness that this method has over the centred parame-terization. • We demonstrate the precise sense in which hyperparameter recovery holds only upto measure equivalence.In section 2 we introduce the Bayesian setting in which we work, emphasizing hierarchi-cal Gaussian priors and describing the centred and noncentred formulations. In section 3we review the concept of MAP estimation in a general setting, describing issues associatedwith working directly in infinite dimensions, and discussing different choices of parameteri-zation. Section 4 contains the theoretical results concerning consistency of hyperparemeterestimation, setting up the data-rich scenario, and studying the properties of hyperpa-rameter estimators for the centred, noncentred and empirical Bayes settings; we show inparticular the applicability of the theory to the case of hierarchical Whittle–Mat´ern pri-ors and Automatic Relevance Determination (ARD) priors. In section 5 numerical resultsare given which illustrate the foregoing theory. In section 6 we conclude. Some lemmasrequired in the analysis are given in an appendix.
2. Bayesian Inverse Problems.
In this section we introduce the Bayesian hierarchicalapproach to the solution of inverse problems of the form considered in the introduction.In subsection 2.1 we describe the construction of the posterior distribution using Bayes’theorem in both finite and infinite dimensions, including a discussion of the Gaussian priors
M. M. DUNLOP, T. HELIN AND A. M. STUART that we use when discussing consistency of MAP estimators. Subsection 2.2 is devotedto hierarchical priors, centred versus noncentred parameterization and sampling methodsassociated with the different choice of hierarchical parameterization; this sets the contextfor our results comparing MAP estimation with centred and noncentred parameterizations.
We describe the likelihood and posterior arising from a Bayesiantreatment of the inverse problem of interest, in the setting of Gaussian random field priors.
In this paper we focus on the case where theprior is (in the hierarchical case, conditionally,) Gaussian. Recall that probability measure µ on X is a Gaussian measure if ℓ ♯ µ is a Gaussian measure on R for any boundedlinear ℓ : X → R ; equivalently, µ is a Gaussian measure if u ∼ µ implies that ℓ ( u ) is aGaussian random variable on R for any such ℓ . If X is a space of functions on a domain D ⊆ R d , a random variable u on X with law µ is referred to as a Gaussian process on D . Such a Gaussian process is characterized completely by its mean function m : D → R andcovariance function c : D × D → R : m ( x ) = E µ ( u ( x )) for all x ∈ D,c ( x, x ′ ) = E µ ( u ( x ) − m ( x ))( u ( x ′ ) − m ( x ′ )) for all x, x ′ ∈ D. Equivalently, it is characterized by its mean m ∈ X and covariance operator C : X → X , m = E µ ( u ) , C = E µ ( u − m ) ⊗ ( u − m ) . When X = L ( D ), the covariance function is related to the covariance operator by( Cϕ )( x ) = Z D c ( x, x ′ ) ϕ ( x ′ ) d x ′ for all ϕ ∈ X, x ∈ D, that is, C is the integral operator with kernel c . In particular, if C is the inverse of adifferential operator, c is the Green’s function for that operator. We now detail a numberof Gaussian processes that arise as examples throughout the paper. Example 2.1 (Ornstein–Uhlenbeck).
Let D = [0 , . Given σ, ℓ > , define the covari-ance function c OU ( t, t ′ ; σ, ℓ ) = σ exp (cid:18) − | t − t ′ | ℓ (cid:19) . This is the covariance associated with the stationary Ornstein–Uhlenbeck process on [0 , defined by d u t = − u t /ℓ d t + p σ /ℓ d W t , u ∼ N (0 , σ ) , Given a measure µ on X and a measurable map T : X → X ′ , T ♯ µ denotes the pushforward measure on X ′ , defined by ( T ♯ µ )( A ) = µ ( T − ( A )) for all measurable A ⊆ X ′ . Also sometimes termed a Gaussian random field.
YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 5
Figure 2.1.
Three sample paths each from Gaussian processes on [0 , with (left) Ornstein–Uhlenbeck,(middle) squared exponential, and (right) Matern ( ν = 3 / covariance functions. where σ is the variance and ℓ the length scale. The sample paths of this process are almostsurely H¨older with any exponent less than one half, everywhere in D .Given observation of u t over any interval I ⊆ D , the diffusion coefficient σ /ℓ maybe found exactly by, for example, looking at quadratic variation. To see this, we rewritein terms of ( σ, β ) = ( σ, σ /ℓ ) , instead of treating ( σ, ℓ ) as the hyperparameters. With thisparameterization we obtain d u t = − βσ u t d t + p β d W t , u ∼ N (0 , σ ) . By Girsanov’s theorem, the law of u is equivalent to that for √ βW t for any choice of σ . Almost sure properties are shared between equivalent measures and for this reason itis possible to recover β from observation of u t over any interval I ⊆ D , as it is fromobservation of √ βW t [42]. However joint recovery of β and σ requires more data, suchas observation of a sample path on [0 , ∞ ) ; see the discussion in [46].Note also that the covariance function underlying this construction can be generalizedto more general D ⊆ R d , using | · | to denote the Euclidean norm on R d – it is then typicallyreferred to as the exponential covariance function. Example 2.2 (Squared Exponential).
Let D ⊆ R d . Given σ, ℓ > , define the covariancefunction c SE ( x, x ′ ; σ, ℓ ) = σ exp (cid:18) − | x − x ′ | ℓ (cid:19) . Then the corresponding Gaussian process has samples which are almost surely infinitelydifferentiable everywhere; the parameters σ , ℓ represent variance and length-scale as forthe Ornstein–Uhlenbeck covariance. Example 2.3 (Whittle–Mat´ern).
Let D ⊆ R d . The Mat´ern (or Whittle–Mat´ern) co-variance function provides an interpolation between the previous two examples in terms ofsample regularity. The parameters σ , ℓ > have the same meaning as in the previoustwo examples and, additionally, we introduce the regularity parameter ν > . Define the M. M. DUNLOP, T. HELIN AND A. M. STUART covariance function c W M ( x, x ′ ; σ, ℓ, ν ) = σ − ν Γ( ν ) (cid:18) | x − x ′ | ℓ (cid:19) ν K ν (cid:18) | x − x ′ | ℓ (cid:19) where K ν is the modified Bessel function of the second kind of order ν . Then the corre-sponding Gaussian process has samples which possess up to ν (fractional) weak derivativesalmost surely; if the domain D is suitably regular they also possess up to ν H¨older deriva-tives almost surely. Note that we have c W M ( x, x ′ ; σ, ℓ/ √ ν, ν ) → ( c OU ( x, x ′ ; σ, ℓ ) as ν → / c SE ( x, x ′ ; σ, ℓ ) as ν → ∞ . If D = R d , the covariance function c W M ( x, x ′ ; σ, ℓ, ν ) is the Green’s function for the frac-tional differential operator given by (2.1) L ( σ, ℓ, ν ) = Γ( ν ) σ ℓ d Γ( ν + d/ π ) d/ ( I − ℓ ∆) ν + d/ . This is the precision operator for the Gaussian measure. The corresponding covarianceoperator is given by C ( σ, ℓ, ν ) = L ( σ, ℓ, ν ) − . On more general domains D ⊆ R d , boundaryconditions must be imposed on the Laplacian in order to ensure the invertibility of L ( σ, ℓ, ν ) ;this generally affects the stationarity of samples, however conditions may be chosen suchthat stationarity of samples is (approximately) preserved [14, 24].Finally, observe that if − ∆ on a bounded domain D , subject to appropriate boundaryconditions, diagonalizes with eigenbasis { ϕ j } and corresponding eigenvalues { λ j } , then C ( σ, ℓ, ν ) diagonalizes in the same basis with eigenvalues { µ j ( σ, ℓ, ν ) } , µ j ( σ, ℓ, ν ) = σ ℓ d Γ( ν + d/ π ) d/ Γ( ν ) (1 + ℓ λ j ) − ν − d/ . (2.2) This is used later when considering consistency of point estimates.
Using the model (1.1), assuming η ⊥ u , we have y | u ∼ N ( Au, Γ )and so P ( y | u ) ∝ exp( − Φ( u ; y )) , (2.3a) Φ( u ; y ) = 12 k Au − y k Γ , (2.3b)where we have introduced the notation k z k T := h z, T − z i Some authors may include a factor √ ν before the distances | x − x ′ | . We omit it here for consistencywith works such as [31, 43], which are key to the application of results in this paper. YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 7 for strictly positive-definite matrix or operator T on Hilbert space with inner-product h· , ·i ;here we use the Euclidean inner-product on Y = R J . Other data models, such as thoseinvolving multiplicative or non-Gaussian noise, may lead to more complicated likelihoodfunctions – we focus on Gaussian additive noise in this article for both clarity of presentationand analytical tractability.
The posterior distribution is the law of the unknown state u giventhe data y , that is, the law P ( u | y ). Bayes’ theorem shows how to construct the posteriorin terms of the prior and likelihood – at the level of densities, it says formally that P ( u | y ) = P ( y | u ) P ( u ) P ( y ) . If X = R n and the prior µ = N (0 , C ) is Gaussian, so that the above densities exist, theposterior density is then given by P ( u | y ) ∝ exp (cid:18) − Φ( u ; y ) − k u k C (cid:19) , (2.4)where the (data-dependent) proportionality constant is given by 1 / P ( y ) and P ( y ) = Z X exp (cid:16) − Φ( u ; y ) − k u k C (cid:17) d u. In the setting where X is a function space and µ = N (0 , C ) is a centred Gaussian randomprocess prior, the posterior measure µ y is given by µ y (d u ) ∝ exp( − Φ( u ; y )) µ (d u ) . (2.5)The proportionality constant is given by 1 / P ( y ) and P ( y ) = Z X exp( − Φ( u ; y )) µ (d u ) . The posterior itself is Gaussian in this conjugate setting: µ y = N ( m, C ), where, for-mally, m = C A ∗ ( Γ + A C A ∗ ) − y, C = C − C A ∗ ( Γ + A C A ∗ ) − A C . (2.6)(In infinite dimensions justification of these formulae requires careful specification of thefunctional analytic setting [30]).In more general cases, such as when the forward map is non-linear or the prior is onlyconditionally Gaussian, sampling typically cannot be performed directly, and methods suchas MCMC or SMC must be used instead to target the posterior. We note here that whenthe prior is Gaussian, MCMC and SMC methods are available for targeting the posteriorthat are well-defined on function space and possess dimension-independent convergenceproperties [9, 13, 8] – the existence of such methods is important when considering thechoice of hierarchical parameterization in the next subsection. M. M. DUNLOP, T. HELIN AND A. M. STUART
The choice of a particular prior distribution with fixedparameters may be too restrictive in practice. For example, if a Whittle–Mat´ern Gaussiandistribution is chosen, good estimates of the regularity parameter ν or length-scale ℓ maynot be known, and differing choices of these parameters can lead to very different estimatesunder the posterior [34]. In the Bayesian paradigm we may treat these parameters asunknown random variables and place a prior distribution upon them. We now describealgorithmic issues arising from how we choose to parameterize the resulting Bayesian inverseproblem. We denote the hyperparameters by θ ∈ Θ, andassume Θ is finite-dimensional. Denoting ρ the Lebesgue density of the prior on θ , wedefine the conditionally Gaussian prior distribution on ( u, θ ) ∈ X × Θ by µ (d u, d θ ) = ν (d u ; θ ) ρ ( θ ) d θ (2.7)where ν (d u ; θ ) = N (0 , C ( θ )). Bayes’ theorem is applied as above, and the posterior is nowa measure on the product space Z = X × Θ: µ y (d u, d θ ) ∝ exp( − Φ( u ; y )) µ (d u, d θ ) . (2.8)As in the non-hierarchical setting, it is desirable to produce samples from the posterior inorder to perform inference. The posterior is no longer Gaussian even when the forwardmap is linear, and so we cannot sample it directly. We can however take advantage ofthe conditional Gaussianity of the prior and the existence of dimension-robust MCMCsampling algorithms, as outlined in Algorithm 2.1. Algorithm 2.1
Centred Metropolis-within-GibbsChoose u (1) ∈ X , θ (1) ∈ Θ. for k = 1 : K do Generate u ( k ) u ( k +1) with a dimension-robust MCMC algorithm targeting u | y, θ ( k ) .Generate θ ( k ) θ ( k +1) with an MCMC algorithm targeting θ | y, u ( k +1) . end forreturn { ( u ( k ) , θ ( k ) ) } Kk =1 .However, even though the update u ( k ) u ( k +1) uses a dimension-robust algorithm,the update θ ( k ) θ ( k +1) can be problematic even though it is only targeting a finite-dimensional distribution. The acceptance probability for a proposed update θ ( k ) θ ′ involves the Radon–Nikodym derivative between the Gaussian distributions ν ( · ; θ ( k ) ) and ν ( · ; θ ′ ). Such a derivative does not exist in general – by the Feldman–Haj`ek theoremGaussian measures in infinite dimensions are either equivalent or singular, and the restric-tive conditions required for equivalence mean that in many naturally occurring situations,two Gaussian measures corresponding to different values of θ are singular. In practice YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 9 this means that, for Algorithm 2.1, any updates to θ have vanishingly small acceptanceprobability with respect to increasingly fine discretization of X ; see [42] for a seminal anal-ysis of this phenomenon. In the next subsubsection we discuss how this problem can becircumvented by means of a reparameterization. In the natural or centred parameterization [40], we treatthe input u to the forward map as an unknown in the problem. However, the conditionalnature of the prior on the pair ( u, θ ) leads to sampling problems related to measure sin-gularity as described above. We therefore look for a way of parameterizing the prior thatavoids this. We first make the observation that if ξ ∼ N (0 , I ), then for any fixed θ ∈ Θ wehave C ( θ ) / ξ ∼ N (0 , C ( θ )) = ν (d u ; θ ) . Therefore, if we choose ξ ∼ N (0 , I ) and θ ∼ ρ independently , we have( C ( θ ) / ξ, θ ) ∼ ν (d u ; θ ) ρ (d θ ) = µ (d u, d θ ) . We can hence write a sample from µ as a deterministic transform of a sample from theproduct measure N (0 , I ) × ρ – this reparameterization is referred to as noncentring in theliterature [40]. It has the advantage that we may pass it to the posterior distribution bysampling an appropriate surrogate distribution instead of directly targeting the posterior.We now make the preceding statement precise. Let ¯ X be a space of distributions thatwhite noise samples ξ ∼ N (0 , I ) belong to almost surely, and define the product spaces Z = X × Θ, ¯ Z = ¯ X × Θ. Define the mapping T : ¯ Z → Z by T ( ξ, θ ) = ( C ( θ ) / ξ, θ ). Thenwe have the following. Proposition 2.4 (Noncentring).
Let µ y denote the hierarchical posterior (2.8) on Z withprior µ . Define the measures ¯ µ , ¯ µ y on ¯ Z by ¯ µ = N (0 , I ) × ρ and ¯ µ y (d ξ, d θ ) ∝ exp (cid:0) − Φ( T ( ξ, θ )) (cid:1) ¯ µ (d ξ, d θ ) . Then µ = T ♯ ¯ µ and µ y = T ♯ ¯ µ y .Proof. The first equality follows from the preceding discussion, and the second from astandard property of pushforward measures: Z f ( x ) ( T ♯ µ )(d x ) = Z f ( T ( y )) µ (d y ) . The key consequence of this proposition is that if we sample ( ξ, θ ) ∼ ¯ µ y , we have T ( ξ, θ ) ∼ µ y . We therefore use MCMC to target ¯ µ y instead of µ y – since the field ξ andhyperparameter θ are independent under the prior, the previous measure singularity issuesdisappear. This leads us to Algorithm 2.2. Here we have implicitly extended Φ : X → R to Φ : Z → R via projection: Φ( u, θ ) ≡ Φ( u ). Algorithm 2.2
Noncentred Metropolis-within-GibbsChoose ξ (1) ∈ ¯ X , θ (1) ∈ Θ. for k = 1 : K do Generate ξ ( k ) ξ ( k +1) with a dimension-robust MCMC algorithm targeting ξ | y, θ ( k ) .Generate θ ( k ) θ ( k +1) with an MCMC algorithm targeting θ | y, ξ ( k +1) . end forreturn { T ( ξ ( k ) , θ ( k ) ) } Kk =1 .Making the choice of noncentred variables over centred variables leads, in the contextof Gibbs-based MCMC, to significant improvement in algorithmic performance, as detailedin a number of papers [42, 40, 47, 1, 11]. However, as we demonstrate in the remainderof this paper, for MAP estimation different considerations come in to play, and centredmethods are preferable.
3. Point Estimation.
Sampling of the posterior distribution, for example using MCMCmethods as mentioned in the previous section, or SMC methods as in [8], may be pro-hibitively expensive computationally if a large number of samples are required. It is thendesirable to find a point estimate for the solution to the problem, as opposed to the fullposterior distribution. The conditional mean is one such point estimate, but this typicallyrequires samples in order to be computed. Two alternative point estimates that we studyin this paper, and define in this section, are the MAP estimate and the empirical Bayes(EB) estimate, both of which can be computed through optimization procedures. Theformer can be interpreted as the mode of the posterior distribution, and the latter as acompromise between the mean and the mode. In subsection 3.1 we introduce the basicMAP estimator and discuss its properties under change of variables. In subsection 3.2 wegeneralize to centred and noncentred hierarchical formulations; mapping from one formu-lation to the other may be viewed as a hyperparameter dependent change of variables. Insubsection 3.3 we define the empirical Bayes estimator.
In this subsection we review the definition of a MAP estimatorin infinite dimensions and discuss its dependence on choice of parameterization.
Suppose first that X = R n and the posterior ad-mits a Lebesgue density: µ y (d u ) ∝ π y ( u ) d u A MAP estimate, or mode of the posterior distribution, is then any point u ∈ X thatmaximizes π y . Equivalently, it is any point that minimizes − log π y , which is usually morestable to deal with numerically. When the prior µ = N (0 , C ) is taken to be Gaussian andthe data model (1.1), (2.3) is used, so that the posterior density is given by (2.4), a point YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 11 u ∈ X is hence a MAP estimator if and only if it minimizes the functional I ( u ) = Φ( u ; y ) + 12 k u k C . (3.1)The existence of a Lebesgue density is central to this definition of MAP estimate. We,however, are primarily interested in the case that X is infinite-dimensional and a moregeneral definition is therefore required. Dashti et al. [15] introduced such a generalizationas follows. Definition 3.1.
Let µ be a Borel probability measure on a Banach space X , and denoteby B δ ( u ) the ball of radius δ centred at u ∈ X . A point u ∗ ∈ X is said to be a MAPestimator for the measure µ if lim δ → µ ( B δ ( u ∗ ))max u ∈X µ ( B δ ( u )) ! = 1 . More general definitions have subsequently been introduced [21, 12], but for the measuresconsidered in this article they are equivalent to the definition above. If a Gaussian prior ν = N (0 , C ) is chosen and the data model (1.1), (2.3) is used so that the posteriordistribution is given by (2.5), then it is known [15] that a point u is a MAP estimator if andonly if it minimizes the Onsager-Machlup functional given by (3.1); the quadratic penaltyterm is the Cameron-Martin norm associated to the Gaussian measure on Hilbert space X .This provides an explicit link between Bayesian and classical (Tikhonov) regularization.Note that, as distinct from the finite-dimensional case, the quadratic term in I ( u ) is infiniteat almost every point of the space X : µ ( { u ∈ X | k u k C < ∞} ) = 0. Although we haveframed this discussion for the linear inverse problem (1.1) subject to additive Gaussiannoise, it applies to the nonlinear setting, with Gaussian priors, and Φ is simply the negativelog-likelihood; however for this paper we consider only linear inverse problems with additiveGaussian noise and Φ is given by (2.3b). MAP estimation makes a deep connection toclassical applied mathematics approaches to inversion via optimization and for this reason ithas an important place in the theory of Bayesian inversion. However an often-cited criticismof MAP estimation within the statistics community is that the methodology depends onthe choice of parameterization of the model. To see this, assume again that X = R n andthat the posterior admits a Lebesgue density π y ( u ), so that the MAP estimator maximizes π y . Suppose that we have a (smooth) bijective map T : X → X , and instead write theunknown as u = T ( ξ ) for some new coordinates ξ . Then the posterior in the coordinates ξ is given by ¯ π y ( ξ ) = π y ( T ( ξ )) × | det( ∇ T ( ξ )) | , that is, for any bounded measurable f : X → R we have Z X f ( u ) π y ( u ) d u = Z X f ( T ( ξ ))¯ π y ( ξ ) d ξ. Due to the presence of this Jacobian determinant, the MAP estimators using the twocoordinates generally differ. If there was no determinant term, we would have equivalenceof the MAP estimators in the following sense, which is straightforward to verify.
Proposition 3.2. ξ ∗ ∈ arg max π y ( T ( · )) if and only if T ( ξ ∗ ) ∈ arg max π y ( · ) . It is natural to study how this issue of reparameterization affects MAP estimators forhierarchical problems. In the previous section we chose a reparameterization in order toenable robust sampling. We show, however, that this reparameterization has undesirableeffects on MAP estimation for hyperparameters.
In this subsection we extend the definition of aMAP estimator to the centred and noncentred hierarchical parameterization introduced inthe previous section.
We are interested in the case where µ on Z = X × Θ is given by (2.7). The dependence of the covariance operator onthe hyperparameter θ means that we cannot directly apply the above result for Gaussianmeasures to write down the Onsager-Machlup functional, as the normalization factor forthe measure ν (d u ; θ ) depends on θ . If X = R n is finite dimensional, we may write downthe Onsager-Machlup functional as I C ( u, θ ) = Φ( u ; y ) + 12 k u k C ( θ ) + 12 log det C ( θ ) − log ρ ( θ ) . Now consider the case where n → ∞ and R n represents approximation of an infinitedimensional space X . Since the limiting operator C ( θ ) is symmetric and compact, then thedeterminant of finite dimensional approximations tends to zero as n → ∞ . Additionally,the set of points for which the quadratic term is finite may depend on the hyperparameter θ – in particular such sets for different values of θ may intersect only at the origin. As anexample of this latter phenomenon, consider the Whittle–Mat´ern process with precisionoperator L given by (2.1). The quadratic penalty term is h u, Lu i X and, for different valuesof ν these correspond to different Sobolev space penalizations. In summary, both thedefinition and optimization of the functional I C may be problematic in infinite dimensions;we show in what follows that this is also true for sequences of finite dimensional problemswhich approach the infinite dimensional limit.Assuming now X = R n , if we fix θ ∈ Θ, then we can optimize I C ( · , θ ) to find u ( θ ) ∈ X such that J C ( θ ) := I C ( u ( θ ) , θ ) ≤ I C ( u, θ ) for all u ∈ X. In the linear setting (1.1) that is our focus, using (2.6), we have u ( θ ) = C ( θ ) A ∗ ( Γ + A C ( θ ) A ∗ ) − y. (3.2)We may then optimize J C ( · ) to find θ ∗ ∈ Θ such that J C ( θ ∗ ) = I C ( u ( θ ∗ ) , θ ∗ ) ≤ I C ( u ( θ ) , θ ) ≤ I C ( u, θ ) for all u ∈ X, θ ∈ Θ . YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 13
The task of optimizing I C is hence reduced to that of optimizing J C . In the next section westudy the behaviour of minimizers of J C as the quality of the data increases. If we work with the noncentredcoordinates introduced in the previous section, the joint prior measure is the independentproduct of a Gaussian measure on ¯ X and with the hyperprior on Θ. MAP estimators canhence be seen to be well-defined on the infinite-dimensional space ¯ Z = ¯ X × Θ, and to beequivalent to minimizers of the Onsager-Machlup functional I NC ( ξ, θ ) = Φ( C ( θ ) / ξ ; y ) + 12 k ξ k I − log ρ ( θ ) . Note that if we reverse the transformation and write ( ξ, θ ) = T − ( u, θ ) = ( C ( θ ) − / u, θ ),we could equivalently define I NC on Z = X × Θ by I NC ( u, θ ) = Φ( u ; y ) + 12 k u k C ( θ ) − log ρ ( θ ) , in view of Proposition 3.2. This is I C , with the the problematic log-determinant termsubtracted.As in the centred case, we can now fix θ and optimize I NC ( · , θ ) over to find ξ ( θ ) ∈ ¯ X such that J NC ( θ ) := I NC ( ξ ( θ ) , θ ) ≤ I NC ( ξ, θ ) for all ξ ∈ ¯ X. Again, in the linear setting (1.1), using (2.6), we have that ξ ( θ ) is given by ξ ( θ ) = C ( θ ) / A ∗ ( Γ + A C ( θ ) A ∗ ) − y. Note that u ( θ ) = C ( θ ) / ξ ( θ ), which is consistent with Proposition 3.2. However, note that J C = J NC : only the former has the log-determinant term, and so the MAP estimate for thehyperparameters typically differs between the two parameterizations. Remark J C and J NC is related to thevolume term arising from change of parameterization, consider the case X = R n . We startwith the measure µ (d u, d θ ) ∝ exp( − I C ( u, θ ))d θ = exp (cid:18) − Φ( u ; y ) − k u k C ( θ ) −
12 log det C ( θ ) + log ρ ( θ ) (cid:19) d u d θ =: f ( u, θ ) d u d θ. We make the transformation ( u, θ ) = T ( ξ, θ ) = ( C ( θ ) / ξ, θ ). The density in these newcoordinates is now given by h ( ξ, θ ) = f ( T ( ξ, θ )) × | det( ∇ T ( ξ, θ )) | . The Jacobian determinant may be calculated asdet( ∇ T ( ξ, θ )) = det( ∇ ξ T ( ξ, θ )) det( ∇ θ T ( ξ, θ )) = det( C ( θ ) / ) det( I ) = det( C ( θ )) / . The log determinant terms hence cancel, giving h ( ξ, θ ) ∝ exp (cid:18) − Φ( C ( θ ) / ξ ; y ) − k ξ k I + log ρ ( θ ) (cid:19) = exp( − I NC ( ξ, θ )) . Instead of jointly optimizing over the state u and hyperparameters θ , we may integrate out the state to obtain a measure just on θ . Inthis case, one considers finding the mode of the marginal measure P (d θ | y ) = Z X µ y (d u, d θ ) = (cid:18) P ( y ) Z X exp( − Φ( u ; y )) ν (d u ; θ ) (cid:19) ρ ( θ ) d θ. The corresponding functional we wish to optimize to find θ is hence given by J E ( θ ) = − log (cid:18)Z X exp( − Φ( u ; y )) ν (d u ; θ ) (cid:19) − log ρ ( θ ) . (3.3)In general the above functional cannot be written down more explicitly due to the in-tractability of the integral. When X = R n is finite-dimensional, the integral may be ap-proximated using a Monte Carlo average over samples { u j } Mj =1 ∼ exp( − Φ( u ; y )) ν (d u ; θ ′ )for any fixed θ ′ ∈ Θ: J E ( θ ) = − log (cid:18)Z X ν ( u ; θ ) ν ( u ; θ ′ ) exp( − Φ( u ; y )) ν (d u ; θ ′ ) (cid:19) − log ρ ( θ ) ≈ − log M X j =1 exp (cid:18) k u j k C ( θ ′ ) − k u j k C ( θ ) + 12 log det C ( θ ′ ) C ( θ ) − (cid:19) − log ρ ( θ )=: J E ( θ ; θ ′ , { u j } ) , where the log-sum-exp trick may be used numerically to avoid underflow [32, § J E via Algorithm 3.1, which alternates approx-imating the integral above via samples from the conditional posterior given the currenthyperparameter values, and optimizing over the hyperparameters given these samples, aform of expectation-maximization (EM) algorithm. The sampling in each step is typicallybe performed using a dimension-robust MCMC algorithm, such as the pCN algorithm; theresulting random sequence { θ ( k ) } can then be averaged, for example, to produce a singlehyperparameter estimate.In the linear setting (1.1), the integral in (3.3) can be computed analytically usingGaussian structure. Rather than calculate the integral above directly, we note that wemay rewrite the data in noncentred coordinates as y = A C ( θ ) / ξ + η, η ∼ N (0 , Γ ) YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 15
Algorithm 3.1
EM AlgorithmChoose initial estimate θ (1) for the hyperparameter. for k = 1 : K do Sample { u ( k ) j } Mj =1 ∼ exp( − Φ( u ; y )) ν (d u ; θ ( k ) ). θ ( k +1) ← argmin θ ∈ Θ J E ( θ ; θ ( k ) , { u ( k ) j } ). end for where ξ ∼ N (0 , I ); from this it can be seen that P ( y | θ ) = N (0 , Γ + A C ( θ ) A ∗ ). Thus, byBayes’ theorem, P (d θ | y ) ∝ p det( Γ + A C ( θ ) A ∗ ) exp (cid:18) − k y k Γ + A C ( θ ) A ∗ (cid:19) ρ ( θ ) d θ. Modes of this marginal measure are then given by minimizers of the functional J E ( θ ) = 12 k y k Γ + A C ( θ ) A ∗ + 12 log det( Γ + A C ( θ ) A ∗ ) − log ρ ( θ ) . Despite involving norms and determinants on the data space rather than the state space,the form of J E is actually very similar to that of J C , as is shown in the following section. Remark P (d θ | y ) as theempirical estimator for θ . Such a choice can also be considered as a regularized maximumlikelihood estimator, where the hyperparameter density acts as a regularizer.
4. Consistency of Point Estimators.
In the previous section we derived three differentfunctionals, J C , J NC and J E . Optimizing each of these functionals leads to different estimatesof the hyperparameters of the same underlying statistical model. In this section we studythe behaviour of these estimates in a data-rich scenario. In subsection 4.1 we spell out theprecise data model that we use; it corresponds to a finite dimension N truncation of thelinear inverse problem (1.1), and since subsequent limit theorems focus on the situationin which the observational noise standard deviation γ is small, we write the resultingfunctionals to be optimized as J N,γ C , J N,γ NC and J N,γ E . Proposition 4.4 gives the exact form forthe resulting functionals and demonstrates the similar form taken by J N,γ C and J N,γ E , whilstalso showing that J N,γ NC is substantially different. Subsection 4.2 contains the limit theoremswhich characterize the three different estimators in the data-rich limit. Theorem 4.8 showsthat the centred and empirical Bayes approaches recover the true parameter value whilstthe noncentred approach does not. In subsection 4.3 we discuss examples. In order to analyse the behaviour of these minimizers, wework in the simplified setup where the forward map A is linear, and A ∗ A is simultaneouslydiagonalizable with the family of covariance operators. Specifically, we make the followingassumptions. Assumptions 4.1.
We assume in what follows that:(i) The map A ∗ A and family of prior covariance operators { C ( θ ) } θ ∈ Θ are strictly pos-itive and simultaneously diagonalizable with orthonormal eigenbasis { ϕ j } , and wehave A ∗ Aϕ j = a j ϕ j , C ( θ ) ϕ j = µ j ( θ ) ϕ j for all j ∈ N , θ ∈ Θ . (ii) The noise covariance Γ = γ I is white.Remark Γ is non-degenerate: we may work with the transformed data Γ − / y and re-define A as Γ − / A . We could hence equivalently replace A ∗ A with A ∗ Γ − A in the firstassumption.We choose the basis { ψ j } for Y given by ψ j = Aϕ j / k Aϕ j k = Aϕ j /a j ; it can readily bechecked that this is an orthonormal basis. Assume that the true state u † that generatesthe data is drawn from the distribution N (0 , C ( θ † )) for some θ † ∈ Θ. We define the data y γ ∈ Y by y γ = Au † + γη, η ∼ N (0 , I ) , where we have made the dependence of the data on γ explicit. We define individualobservations y γj ∈ R of the data y γ ∈ Y as y γj := h y γ , ψ j i = 1 a j h Au † , Aϕ j i + γ h η, ψ j i = 1 a j h u † , A ∗ Aϕ j i + γ h η, ψ j i = a j u † j + γη j , η j iid ∼ N (0 , , j ∈ N , (4.1)where u † j := h u † , ϕ j i . It is convenient to note that we have the equality in distribution withthe noncentred-type representation y γj d = q a j µ j ( θ † ) + γ ξ † j , ξ † j iid ∼ N (0 , , j ∈ N . (4.2)As we establish results regarding convergence of minimizers in probability, there is no lossin generality in assuming that the data is given by (4.2) instead of (4.1).The infinite collection of scalar problems (4.1) is equivalent to the full infinite-dimensional problem. We consider a sequence of finite-dimensional problems arising fromtaking the first N of these observations, so that data provided for the N th problem is givenby y γj = a j u † j + γη j , η j iid ∼ N (0 , , j = 1 , . . . , N. (4.3) YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 17
We take the prior distribution for these problems to be the projection of the full prioronto the span of the first N eigenfunctions { ϕ j } Nj =1 , so that both the state and the dataare finite-dimensional. To motivate why we use this projection of the prior distribution,we look at the structure of the likelihood. Writing y γ N for the vector of observations( y γ , . . . , y γN ) ∈ R N , the negative log-likelihood of y γ N given u takes the formΦ γ ( u ; y γ N ) = 12 γ N X j =1 |h Au − y γ , ψ j i| = 12 γ N X j =1 (cid:12)(cid:12)(cid:12)(cid:12) a j h A ∗ Au, ϕ j i − h y γ , ψ j i (cid:12)(cid:12)(cid:12)(cid:12) = 12 γ N X j =1 | a j u j − y γj | where u j := h u, ϕ j i . The posterior on u j for j > N is hence uninformed by the observationsand remains the same as the prior. To be more explicit, for the N th problem we choosethe conditional prior distribution ν N ( · ; θ ) = P ♯N ν ( · ; θ ), where P N : X → R N is givenby ( P N u ) j = u j for j = 1 , . . . , N . Since ν ( · ; θ ) = N (0 , C ( θ )) is Gaussian on X , this isequivalent to saying ν N ( · ; θ ) = N (0 , P N C ( θ ) P ∗ N ) is Gaussian on R N .We denote by J N,γ C , J N,γ NC and J N,γ E the functionals J C , J NC and J E respectively con-structed for these finite dimensional problems. We study the convergence of estimatesof the hyperparameter θ to its true value θ † in the simultaneous limit of the number ofobservations y γ , . . . , y γN going to infinity and the noise level γ going to zero. Remark J N,γ NC and J N,γ E ; J N,γ C however does change. Nonetheless, if the non-truncated prior is used to writedown J N,γ C , poor estimates for hyperparameters are obtained as the prior then dominatesover the observations, see subsection 5.2 for an illustration.For brevity, in what follows we use the notation f ( θ ) ∝ g ( θ ) to mean that f ( θ ) = αg ( θ ) + β for some constants α, β – note that f and g then have the same minimizers. Proposition 4.4.
Define s γj ( θ ) = a j µ j ( θ ) + γ . Then we have J N,γ C ( θ ) ∝ N N X j =1 " ( y γj ) s γj ( θ ) − log µ j ( θ † ) µ j ( θ ) − N log ρ ( θ ) , (4.4) J N,γ NC ( θ ) ∝ N N X j =1 ( y γj ) s γj ( θ ) − N log ρ ( θ ) , (4.5) J N,γ E ( θ ) ∝ N N X j =1 " ( y γj ) s γj ( θ ) − log s γj ( θ † ) s γj ( θ ) − N log ρ ( θ ) . (4.6) Remark J N,γ C ( θ ) J N,γ C ( θ ) − N X j =1 log µ j ( θ † ) , J N,γ E ( θ ) J N,γ E ( θ ) − N X j =1 log s γj ( θ † ) . These do not affect minimizers, as the shifts are constant in θ . These transformations areuseful in the next section in the derivation of a limiting functional as N → ∞ and γ → Proof.
Instead of the expression for u ( θ ) given by (3.2), we use the alternative expres-sion u ( θ ) = ( A ∗ Γ − A + C ( θ ) − ) − A ∗ Γ − y = 1 γ (cid:18) γ A ∗ A + C ( θ ) − (cid:19) − A ∗ y which follows from the Sherman–Morrison–Woodbury formula. Using the simultaneousdiagonalizability, we then have that u j ( θ ) := h u ( θ ) , ϕ j i = 1 γ a j γ + 1 µ j ( θ ) ! − h A ∗ y, ϕ j i = a j µ j ( θ ) s γj ( θ ) y γj . Now consider the functional J N,γ ( θ ) := Φ γ ( u ; θ ) + 12 k u k C ( θ ) . We may calculate J N,γ ( θ ) = 12 γ N X j =1 ( a j u j ( θ ) − y γj ) + 12 N X j =1 u j ( θ ) µ j ( θ )= 12 N X j =1 ( y γj ) γ a j µ j ( θ ) s γj ( θ ) − ! + a j µ j ( θ ) s γj ( θ ) = 12 N X j =1 ( y γj ) s γj ( θ ) (cid:20) γ (cid:16) a j µ j ( θ ) − s γj ( θ ) (cid:17) + a j µ j ( θ ) (cid:21) = 12 N X j =1 ( y γj ) s γj ( θ ) . The expression for J N,γ NC then follows. For J N,γ C , we note that12 log det C ( θ ) = 12 N X j =1 log µ j ( θ ) ∝ − N X j =1 log µ j ( θ † ) µ j ( θ ) YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 19 from which the result follows. Finally we deal with the empirical Bayes case J N,γ E . Observethat 12 k y γ k Γ + A C ( θ ) A ∗ = 12 N X i,j =1 y γi y γj h ψ i , ( Γ + A C ( θ ) A ∗ ) − ψ j i = 12 N X i,j =1 y γi y γj · a i a j h ϕ i , A ∗ ( Γ + A C ( θ ) A ∗ ) − Aϕ j i . Using the Sherman–Morrison–Woodbury identity again, we may write A ∗ ( Γ + A C ( θ ) A ∗ ) − A = A ∗ Γ − A − A ∗ Γ − A ( A ∗ Γ − A + C ( θ ) − ) − A ∗ Γ − A = 1 γ A ∗ A − γ A ∗ A (cid:18) γ A ∗ A + C ( θ ) − (cid:19) − γ A ∗ A, and so by the simultaneous diagonalizability, and orthonormality of { ϕ j } ,12 k y k Γ + A C ( θ ) A ∗ = 12 N X j =1 ( y γi ) a j a j γ − a j γ a j γ − µ j ( θ ) ! − a j γ = 12 N X j =1 ( y γi ) γ " − a j µ j ( θ ) s γj ( θ ) = 12 N X j =1 ( y γi ) s γj ( θ ) . To deal with the log-determinant term, we use Lemma A.1 to see that12 log det( Γ + A C ( θ ) A ∗ ) = 12 log det( A ∗ Γ A + A ∗ A C ( θ ) A ∗ A ) −
12 log det( AA ∗ ) . Since { ϕ j } is an orthonormal basis for X , the first determinant may be calculated as12 log det( A ∗ Γ A + A ∗ A C ( θ ) A ∗ A ) = 12 N X j =1 log( a j γ + a j µ j ( θ )) = 12 N X j =1 log( a j s γj ( θ ))and so 12 log det( Γ + A C ( θ ) A ∗ ) ∝ − N X j =1 log s γj ( θ † ) s γj ( θ )from which the result follows. We study convergence of the minimizers of the ran-dom functionals J N,γ C , J N,γ NC and J N,γ E in the simultaneous limit N → ∞ and γ →
0. Weestablish that, if the noise level decays sufficiently fast relative to the smallest value of theproduct of the singular values and the prior covariance, for the truncated problem, thenthe true hyperparameter is recovered in the cases of the centred MAP and empirical Bayesestimates. We also establish that it is not recovered in the case of the noncentred MAPestimate.Let γ N > N observations are taken. We define s γ N j ( θ ) = a j µ j ( θ ) + γ N as in Proposition 4.4, and define b Nj ( θ ) = s γ N j ( θ † ) s γ N j ( θ ) . In order to establish the convergence, we make the following assumptions.
Assumptions 4.6.
We assume in what follows that:(i) Θ ⊆ R k is compact.(ii) min j =1 ,...,N a j µ j ( θ ) /γ N → ∞ as N → ∞ for all θ ∈ Θ .(iii) g ( θ, θ † ) := lim j →∞ µ j ( θ † ) µ j ( θ ) exists for all θ ∈ Θ , and the map θ g ( θ, θ † ) − log g ( θ, θ † ) islower semicontinuous.(iv) If g ( θ, θ † ) = 1 , then θ = θ † .(v) The maps θ log µ j ( θ ) are Lipschitz on Θ for each j ∈ N , with Lipschitz constantsuniformly bounded in j .(vi) The maps θ b Nj ( θ ) are Lipschitz on Θ for each j = 1 , . . . , N , N ∈ N , withLipschitz constants uniformly bounded in j, N .(vii) The map θ log ρ ( θ ) is Lipschitz on Θ . Assumption (i) is made to avoid complications with hyperparameter estimates poten-tially diverging. Assumption (ii) gives the rate at which the noise must decay relative tothe decay of the singular values of the (whitened) forward map – the more ill-posed theproblem is, and the weaker the prior is, the faster the noise must vanish. Assumption(iii) allows a limiting functional to be identified, and (iv) is an identifiability assumptionwhich allows us to identify the true hyperparameter. Assumptions (v)-(vii) are made to en-sure the functionals J N,γ N C , J N,γ N NC , J N,γ N E are also Lipschitz with Lipschitz constants (almostsurely) uniformly bounded in N ; note that when combined with the assumed compactnessof Θ, we thus obtain existence of minimizers of these functionals over Θ . Remark γ N a function of the number of obser-vations, we could also consider having the number of observations N γ as a function of thenoise level – this may be more appropriate in practice as one may not have control over YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 21 the noise level. In this case, one would need to replace Assumption (ii) withmin j =1 ,...,N γ a j µ j ( θ ) /γ → ∞ as γ → γ N to make the arguments clearer: oursequences of functionals are indexed by a discrete rather than continuous parameter. Theorem 4.8.
Let Assumptions hold, and let { θ N C } , { θ N E } , { θ N NC } denote sequences ofminimizers over Θ of { J N,γ N C } , { J N,γ N E } , { J N,γ N NC } respectively.(i) θ N C , θ N E → θ † in probability as N → ∞ .(ii) Assume further that g ( · , θ † ) has a unique minimizer θ ∗ . Then θ N NC → θ ∗ in proba-bility as N → ∞ .Remark rates of the empirical estimator, like us in probability, and they usethis to deduce that the empirical posterior on u contracts around the ground truth at anoptimal rate. Whereas we assume data to be generated according to P ( y | θ † ), Knapik et al.consider P ( y | u † ) as the data generating distribution and the function u † implicitly identifiesthe true regularity θ † . Remark θ ∗ = θ † , and so the result concerning theconvergence of { θ N NC } is a negative result: the true hyperparameter is not recovered. Proof of Theorem . We establish the result in full for J N,γ N C , and note the smallmodifications required to establish the results for J N,γ N E and J N,γ N NC . We start by provingitem (i). We have J N,γ N C ( θ ) = 12 N N X j =1 " ( y γ N j ) s γ N j ( θ ) − log µ j ( θ † ) µ j ( θ ) − N log ρ ( θ ) . We rewrite y γ N j using the representation (4.2): y γ N j d = q s γ N j ( θ † ) ζ j , ζ j iid ∼ N (0 , , and so J N,γ N C ( θ ) d = 12 N N X j =1 (cid:20) b Nj ( θ ) ζ j − log µ j ( θ † ) µ j ( θ ) (cid:21) − N log ρ ( θ ) . We can see formally from the assumptions that, for each θ ∈ Θ, b Nj ( θ ) → g ( θ, θ † ) as j, N → ∞ , and so the strong law of large numbers suggests that J N,γ N C ( θ ) → J C ( θ ) := 12 g ( θ, θ † ) −
12 log g ( θ, θ † )almost surely. Observe that J C is minimized if and only if g ( θ, θ † ) = 1, which by Assump-tions 4.6(iv) occurs if and only if θ = θ † . We hence wish to establish convergence of theminimizers of J N,γ N C to that of J C . In order to show this convergence, we use the approach of[45]. Specifically we use the result of Exercise 3.2.3, which follows from Corollary 3.2.3(ii)and the Arzel`a-Ascoli theorem. We must establish that:(a) J N,γ N C converges pointwise in probability to J C ;(b) the maps θ J N,γ N C ( θ ) are Lipschitz on Θ for each N , with (random) Lipschitzcoefficients uniformly bounded in N almost surely;(c) J C is lower semicontinuous with a unique minimum at θ † ; and(d) θ N C = O P (1).The point (c) is true by assumption, and (d) follows since Θ is compact. To establish thatpoint (a) holds, we note that it suffices to show that, in probability, for each θ ∈ Θ, (cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 b Nj ( θ )( ζ − (cid:12)(cid:12)(cid:12)(cid:12) → (cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 (cid:18) b Nj ( θ ) − log µ j ( θ † ) µ j ( θ ) (cid:19) − N log ρ ( θ ) − J C ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) → . (4.9)Note that the expression (4.9) is deterministic. Define the map G N ( θ ) = 12 N N X j =1 b Nj ( θ )( ζ j − . We show that G N ( θ ) → θ ∈ Θ; since the limit is constant, the convergencethen also occurs in probability. Combining Lemma A.2 with Assumptions 4.6(ii),(iii), wesee that 1 N N X j =1 b Nj ( θ ) → g ( θ, θ † )(4.10)for each θ ∈ Θ. The proof of Lemma A.2 implies, in particular, that the sequence { b Nj ( θ ) } j,N is uniformly bounded for each θ . Since ζ j iid ∼ χ , we have that the charac- YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 23 teristic function of G N ( θ ) satisfies E (cid:16) exp (cid:0) itG N ( θ ) (cid:1)(cid:17) = N Y j =1 E (cid:18) exp (cid:18) it · N b Nj ( θ )( ζ j − (cid:19)(cid:19) = N Y j =1 − b Nj ( θ ) itN ! − exp (cid:18) − it · N b Nj ( θ ) (cid:19) = exp − N X j =1 " log − b Nj ( θ ) itN ! + b Nj ( θ ) itN = exp − N X j =1 − b Nj ( θ ) itN − b Nj ( θ ) itN ! − O ( N − ) + b Nj ( θ ) itN = exp − N X j =1 " b Nj ( θ ) t N − O ( N − ) . From the boundedness of { b Nj ( θ ) } j,N , we deduce that the sum in the exponent tends tozero as N → ∞ . It follows that E (exp ( itG N ( θ ))) → exp(0) = E Z ∼ δ (exp( itZ ))and so G N ( θ ) → N N X j =1 (cid:18) b Nj ( θ ) − log µ j ( θ † ) µ j ( θ ) (cid:19) − N log ρ ( θ ) − J C ( θ )= 12 N N X j =1 (cid:16) b Nj ( θ ) − g ( θ, θ † ) (cid:17) − N N X j =1 (cid:18) log µ j ( θ † ) µ j ( θ ) − log g ( θ, θ † ) (cid:19) − N log ρ ( θ ) . The first sum vanishes as N → ∞ due to the convergence (4.10), the second vanishesdue to Assumptions 4.6(ii), and third clearly vanishes. The convergence (4.9) follows, andhence so does the pointwise convergence in probability J N,γ N C → J C . It remains to show Here log refers to the principal branch of the complex logarithm – note that we are bounded away fromthe branch cut since the argument always has real part 1. the Lipschitz condition (b). We have, for any θ , θ ∈ Θ, | J N,γ N C ( θ ) − J N,γ N C ( θ ) | ≤ N N X j =1 | b Nj ( θ ) − b Nj ( θ ) | ζ j + 12 N N X j =1 | log µ j ( θ ) − log µ j ( θ ) | + 12 | log ρ ( θ ) − log ρ ( θ ) | . By Assumptions 4.6(v)-(vii) the Lipschitz property follows. The almost sure boundednessof the Lipschitz constants follows from the strong law of large numbers, since the i.i.d.random variables ζ j have finite second moments.In the case of J N,γ N E , the limiting functional is the same: J E = J C . The proof forconvergence of minimizers differs only in the expression (4.9), wherein the logarithmic termin the sum is replaced by log b Nj ( θ ); this does not affect the convergence of the expression.We now study (ii). The functional J N,γ N NC differs from J N,γ N C only in the absence of thelogarithmic term – it is easy to see that the limiting functional is then given by J N,γ NC ( θ ) := 12 g ( θ, θ † ) , and that the required conditions (a)–(d) above are satisfied, since existence of a uniqueminimizer θ ∗ of g ( · , θ † ) is assumed. The same result from [45] may then be used to obtainthe stated result. Remark N (0 , C ( θ † )) and N (0 , C ( θ )) are equivalent if and only if ∞ X j =1 (cid:18) µ j ( θ † ) µ j ( θ ) − (cid:19) < ∞ which in particular implies that the limit g ( θ, θ † ) is identically 1. The limiting functionalis hence minimized by any θ that gives rise to an equivalent measure. Remark g ( θ, θ † ) is infinite whenever θ = θ † . Even though this limit clearly identifies the true hyperparameters, Theorem 4.8does not directly apply, since, for example, Assumptions 4.6(iii),(vii) cannot hold. Oneapproach to avoid this is to replace the objective functional J N,γ N C ( θ ) by J N,γ N C ( θ ) ε N forsome positive sequence ε N → t t ε N is strictly increasing for all N . Such a sequence { ε N } may be chosenin practice to be such that ( µ N ( θ † ) /µ N ( θ )) ε N converges to a finite value for each θ as N → ∞ . Examples of situations where these infinite limits occur, and appropriate choicesof sequences { ε N } to obtain finite limits, are discussed in what follows. YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 25
We now provide examples which elucidate Theorem 4.8.
Example 4.13 (Whittle–Mat´ern).
Consider the case where the conditional Gaussianpriors are Whittle–Mat´ern distributions on a bounded domain D ⊆ R d . As mentioned inExample , the covariance operators diagonalize in the eigenbasis { ϕ j } of the Laplacianon D . Since D is bounded we simply define the Whittle–Matern process to have covariancegiven by the inverse of (2.1) , and where we equip the Laplacian with Dirichlet, Neumannor periodic boundary conditions; we note that for all three such sets of boundary conditions,the eigenvalues λ j of the negative Laplacian tend to infinity. We first consider the casewhere we are hierarchical about the standard deviation σ and the length-scale ℓ , and denote θ = ( σ, ℓ ) ∈ Θ . Fixing the regularity parameter ν > , the eigenvalues are given by µ j ( θ ) = κ ( ν ) σ ℓ d (1 + ℓ λ j ) − ν − d/ = κ ( ν ) σ ℓ − ν ( ℓ − + λ j ) − ν − d/ for some constant κ ( ν ) . We may then calculate g ( θ, θ † ) = lim j →∞ (cid:18) σ † σ (cid:19) (cid:18) ℓℓ † (cid:19) ν (cid:18) ℓ − − ( ℓ † ) − ( ℓ † ) − + λ j (cid:19) ν = (cid:18) σ † σ (cid:19) (cid:18) ℓℓ † (cid:19) ν . We then see that g ( θ, θ † ) = 1 if and only if σℓ − ν = σ † ( ℓ † ) − ν . This equality is satisfiedby infinitely many pairs ( σ, ℓ ) . In order to apply Theorem we require that the equalityis only satisfied by the true hyperparameters. Therefore, instead of attempting to infer thepair ( σ, ℓ ) , we attempt to infer the pair ( σ, β ) := ( σ, σℓ − ν ) ; this is closely related to thediscussion around the Ornstein–Uhlenbeck process in Example . We then have µ j ( θ ) = κ ( ν ) β (cid:18) βσ (cid:19) /ν + λ j ! − ν − d/ which leads to g ( θ, θ † ) = (cid:18) β † β (cid:19) . When σ is fixed, by applying Theorem , we can deduce that the parameter β is iden-tifiable using via the centred MAP and empirical Bayesian methods; the proof that therequisite assumptions are satisfied under appropriate conditions is provided in Lemma A.3 .In particular, assuming the algebraic decay a j ≍ j − a and γ N ≍ N − w , Assumptions (ii)is equivalent to w > a + νd + 12 . (4.11) This condition is slightly weaker than that required for measure equivalence – for the measures to beequivalent we require in addition that d ≤
3, see for example Theorem 1 in [17].
We also see that the parameter β is not identifiable via the noncentred MAP method, since g ( · ; θ ) is minimized by taking β as large as possible.In the case where we are hierarchical about the regularity parameter ν , the assumptionsof Theorem do not hold. Nonetheless, the limiting functional can still be formallycalculated as J C ( ν ) = ( ∞ ν = ν † ν = ν † which is clearly minimized if and only if ν = ν † . As discussed in Remark , we canrescale to obtain a finite limiting functional; in this case making the choice ε N = 1 / log(1 + λ N ) achieves this. Example 4.14 (Automatic Relevance Determination).
The Automatic Relevance Deter-mination (ARD) kernel is typically defined by c ( x, x ′ ; θ ) = σ exp − d X k =1 (cid:18) x k − x ′ k θ k (cid:19) ! . This is the Green’s function for the anisotropic heat equation at time t = 1 : ∂u∂t ( t, x ) = d X k =1 θ k ∂u ∂x k ( t, x ) , u (0 , x ) = σ ξ ( x ) . The corresponding covariance operator is hence given by C ( θ ) = σ exp(∆ θ ) := σ exp − d X k =1 θ k ∂ ∂x k ! . On rectangular domains this family of operators is simultaneously diagonalizable underthe Laplacian eigenbasis. For example, if D = (0 , d and we impose Dirichlet boundaryconditions on the Laplacian, then the eigenvalues are given by µ i ,...,i d ( θ ) = σ exp − π d X k =1 θ k i k ! . The results we have concerning consistency are given in terms of eigenvalues indexed bya single index j rather than a multi-index ( i , . . . , i d ) . Rather than consider a particularenumeration of the multi-indices, we instead aim to infer each hyperparameter θ k individ-ually by only sending i k → ∞ – this amounts to taking a subset of the observations. Theproblem of inferring each θ k is then essentially equivalent to inference of the length-scaleparameter of squared exponential prior with d = 1 .Note that Theorem does not apply in this case – the limiting functional J C is infiniteeverywhere except for the true hyperparameter, as was the case when inferring the parameter YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 27 ν in the previous example. Again, following Remark , we can rescale to obtain a finiteobjective function; in this case making the choice ε N = 1 /N suffices. ARD versions ofgeneral Whittle–Mat´ern covariances can also be obtained by replacing the negative Laplacian − ∆ with its anistropic analogue − ∆ θ within the precision operator. It can be verified thatthe requisite assumptions for Theorem are satisfied in this case when ν < ∞ ; the proofis almost identical to that of Lemma A.3 and is hence omitted for brevity.
5. Numerical Experiments.
In this section we present a number of numerical exper-iments in order to both validate the theory presented, and illustrate how the theory mayextend beyond what has been proven. Subsection 5.1 introduces a diagonalizable deblur-ring problem which is considered in the subsequent subsections. Subsection 5.2 looks atthe behaviour of minimizers of J N,γ C with and without the prior truncation, as discussed inRemark 4.3. Subsection 5.3 looks at the traces of the errors between the hyperparameter es-timates, comparing the convergence rates between the different functionals. Subsection 5.4considers the setup of Example 4.13, wherein the variance and length-scale parametersare to be jointly inferred; the minimizers are confirmed numerically to lie on the curve ofhyperparameters which give rise to equivalent measures. Finally, subsection 5.5 considerssettings that enable us to test whether the assumptions of the theory are sharp – in partic-ular we see that they appear sharp only for the centred MAP approach, with the empiricalBayes estimates appearing to be more robust with respect to noise. In this subsection we consider the case that the forwardmap is given by a linear blurring operator. Let { ϕ j } ∞ j =0 denote the cosine Fourier basis on D = (0 , ϕ j ( x ) = √ πjx ) , and define A : L ( D ) → L ( D ) by h Au, ϕ j i = ( j − h u, ϕ j i j ≥ j = 0 . Then the map A may be viewed as the solution operator f u for the problem − ∆ u ( x ) = π f ( x ) for x ∈ D, u ′ (0) = u ′ (1) = 0 , Z u ( x ) d x = 0 . (5.1)It could equivalently viewed as a convolution operator, writing( Au )( x ) = Z G ( x, x ′ ) u ( x ′ ) d x ′ where G ( x, x ′ ) is the Green’s function for the system (5.1). This choice of forward oper-ator is convenient as it diaganalizes in the same basis as the Whittle–Mat´ern covarianceoperators on D , which are what we use throughout this subsection. In Figure 5.1 we show the true state u † that we fix throughout this subsection, and its image Au † under A . It isdrawn from a Whittle–Mat´ern distribution with parameters σ † = ℓ † = 1, ν † = 3 /
2. To beexplicit, in the notation of section 4, we have a j = 1 /j , µ j ( θ ) = σ ℓ − ν ( π j + ℓ − ) − ν . We first provide some numerical justification for the truncationof the prior at the same level as the observations when using the centred parameterization,as discussed in Remark 4.3. We fix a maximum discretization level N max = 10 , and lookat the behaviour of minimizers of the two functionals J N,γ N C ( θ ) ∝ N X j =1 ( y γj ) s γj ( θ ) − N X j =1 log µ j ( θ † ) µ j ( θ ) − log ρ ( θ ) , ^ J N,γ N C ( θ ) ∝ N X j =1 ( y γj ) s γj ( θ ) − N max X j =1 log µ j ( θ † ) µ j ( θ ) − log ρ ( θ ) , as N is increased. We consider a conditional Whittle–Mat´ern prior, treating the inverselength-scale θ = ℓ − as a hyperparameter, and set γ N = 1 /N so that (4.11) is satisfied.In Figure 5.2 we show how the errors between the estimated inverse length-scales and thetruth compare between the two functionals as N increases. It can be seen that the errorfor the truncated prior is bounded above by that for the full prior, as expected. We now compare numerically thebehaviour of optimizers of the three functionals J N,γ N C , J N,γ N NC and J N,γ N E , and verify thatthe conclusions of Theorem 4.8 hold. As above, we consider a conditional Whittle–Mat´ernprior with the inverse length-scale θ = ℓ − as a hyperparameter, and set γ N = 1 /N . InFigure 5.3 we show how the errors between the three sequences of minimizers and the truthcompare as N increases. We see that the noncentred MAP error diverges, as expected: thelimiting functional is given by J NC ( θ ) = 12 (cid:18) ℓℓ † (cid:19) , which is minimized as ℓ − → ∞ . The empirical Bayes and centred MAP errors bothgenerally decrease as N is increased, again as expected, with the empirical Bayes estimateslightly outperforming the centred MAP estimate for moderate N ; the noncentred MAPestimator fails to converge.Also in Figure 5.3 we show the same errors averaged over 1000 independent realizationsof the truth u † ∼ N (0 , C ( θ † )) and noise η ∼ N (0 , I ), and the same behaviour is observed.A reason for the empirical Bayes estimate outperforming the noncentred MAP estimatefor moderate N may be that the terms in the summation in the functional (4.6) taking theform x j − log x j , rather than x j − log x εj for some x εj ≈ x j as in (4.4), which is minimizedby x j = 1. For larger N there is very little difference between the two functionals, since γ N → YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 29
We now consider the same setup as the previ-ous subsubsection, but treat both the inverse length-scale ℓ − and the standard deviation σ as hyperparameters: θ = ( σ, ℓ − ). The resulting family of conditional prior measures arethen equivalent along any curve { ( σ, ℓ − ) | σℓ − ν = constant } , as discussed in Example 4.13,and so the hyperparameters cannot be identified beyond this curve. This is illustrated inFigure 5.4. In the top row we plot the functional J N,γ N C ( σ, ℓ ) for ( σ, ℓ − ) ∈ (0 , , with N increasing from left to right. In the bottom row we plot the sets ( ( σ, ℓ − ) (cid:12)(cid:12)(cid:12)(cid:12) ℓ ∈ arg min ℓ − ∈ (0 , J N,γ N C ( σ, ℓ ) ) , ( ( σ, ℓ − ) (cid:12)(cid:12)(cid:12)(cid:12) σ ∈ arg min σ ∈ (0 , J N,γ N C ( σ, ℓ ) ) , along with the curve σℓ − ν = σ † ( ℓ † ) − ν , i.e. g ( · , θ † ) − (1); the global minimizer (i.e. theintersection of these sets) is also shown as a green dot. We see that the sets of minimizersconcentrate on the limiting curve g ( · , θ † ) − (1) as N is increased.For reference, we also consider the same experiments, but working with the reparam-eterization θ = ( σ, β ) introduced in Example 4.13 so that β should be identifiable. InFigure 5.5 we see that this is indeed the case, with the curves now concentrating on theline β = σ † ( ℓ † ) − ν = 1 as N is increased. We choose here now to be hierarchical about just the inverselength-scale θ = ℓ − . In the theory we made the assumption Assumptions 4.6(ii) concerningthe decay rate of the forward map and covariance singular values versus the decay ofthe noise level. For Whittle–Mat´ern priors, assuming the algebraic decay a j ≍ j − a and γ N ≍ N − w , the required condition on w for Assumptions 4.6(ii) to hold is given by (4.11).In the setup considered here, this translates to w >
4. We now investigate numericallywhether this condition is sharp, making the three choices γ N = N − w for w = 3 . , , .
5. Theresulting error traces are shown in Figure 5.6 for the centred MAP and empirical Bayesianmethods. It appears that the condition is likely to be sharp for the centred optimization,given that convergence fails at the borderline case. However. For the empirical Bayesianoptimization the condition does not appear to be necessary, with convergence occurring inall cases, suggesting it is a more stable estimator than the MAP.In light of Remark 4.7, we also consider the same setup, but with a fixed noise leveland increasing N . Making the choice N γ = γ − /w , the condition on w , equivalent to (4.7),is the same as before. In Figure 5.7 we show the errors for the choices w = 3 . . , .
5, andthe same trends are observed as for Figure 5.6.
6. Conclusions.
Learning hyperparameters in Bayesian hierarchical inference is impor-tant in two main contexts: when the hyperparameters themselves are the primary objectof inference, and the underlying quantity which depends on them a priori is viewed as anuisance parameter; when the hyperparameters themselves are not of direct interest, butchoosing them carefully aids in inferring the underlying quantity which depends on them apriori . In both settings it is of interest to understand when hyperparameters can be accu-rately inferred from data. In this paper we have studied this question within the context
Figure 5.1.
The true state u † used throughout subsection , and its image Au † under the blurringoperator A . -6 -4 -2 Figure 5.2.
The trace of the errors between the minimizers of J N,γ N NC and ^ J N,γ N NC and the true hyperpa-rameter, as defined in subsection , as N is increased. of MAP estimation. Our work suggests the benefits of using the centred parameterizationover the noncentred one, and also supports the use of empirical Bayes procedures. Thisis interesting because the relative merits of centring and noncentring in this context differfrom what is found for sampling methods such as MCMC.The theorem is confined to a straightforward situation, concerning linear inverse prob-lems, in which the relevant operators are simultaneously diagonalizable. It also imposesconditions on the the parameters defining the problem; numerical experiments indicatethat these are sharp for the centred MAP estimator, but not for the empirical Bayes es-timator, demonstrating that the latter is preferable. It would also be of interest to pushthe boundaries of the theory outside this regime to the non-diagonal setting and even intononlinear inverse problems. It would also be of interest to study fully Bayesian posteriorinference for the hyperparameters, and Bernstein-von Mises theorems; this may be related YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 31 -6 -4 -2 Single Realization -6 -4 -2 Mean
Figure 5.3.
Comparison of the errors between the minimizers of the three functionals J N,γ N E , J N,γ N C , J N,γ N NC and the true hyperparameter, as N is increased. The left figure shows the error tracesfor a single realization of the truth and the noise, and the right figure shows the errors averaged over 1000such realizations. the re-scalings needed at the end of Examples 4.13 and 4.14. Acknowledgements
The work of AMS and MMD is funded by US National ScienceFoundation (NSF) grant DMS 1818977 and AFOSR Grant FA9550-17-1-0185.
REFERENCES [1]
S. Agapiou, J. M. Bardsley, O. Papaspiliopoulos, and A. M. Stuart , Analysis of the Gibbssampler for hierarchical inverse problems , SIAM/ASA Journal on Uncertainty Quantification, 2(2014), pp. 511–544.[2]
S. Agapiou, M. Burger, M. Dashti, and T. Helin , Sparsity-promoting and edge-preserving max-imum a posteriori estimators in non-parametric Bayesian inverse problems , Inverse Problems, 34(2018), p. 045002.[3]
S. Agapiou, M. Dashti, and T. Helin , Rates of contraction of posterior distributions based on p -exponential priors , arXiv preprint arXiv:1811.12244, (2018).[4] S. Agapiou, S. Larsson, and A. M. Stuart , Posterior contraction rates for the Bayesian ap-proach to linear ill-posed inverse problems , Stochastic Processes and their Applications, 123 (2013),pp. 3828–3860.[5]
S. Agapiou and P. Math´e , Posterior contraction in Bayesian inverse problems under Gaussianpriors , in New Trends in Parameter Identification for Mathematical Models, Springer, 2018, pp. 1–29.[6]
S. Agapiou, A. M. Stuart, and Y.-X. Zhang , Bayesian posterior contraction rates for linearseverely ill-posed inverse problems , Journal of Inverse and Ill-posed Problems, 22 (2014), pp. 297–321.[7]
J. O. Berger , Statistical Decision Theory and Bayesian Analysis , Springer Science & Business Media,2013.[8]
A. Beskos, A. Jasra, E. A. Muzaffer, and A. M. Stuart , Sequential Monte Carlo methods forBayesian elliptic inverse problems , Statistics and Computing, 25 (2015), pp. 727–737.[9]
A. Beskos, G. Roberts, A. Stuart, and J. Voss , MCMC methods for diffusion bridges , Stochastics (Top) The objective function J N,γ N C ( σ, ℓ ) for N = 1 , , , . (Bottom) The locationsof the minimizers of each J N,γ N C ( σ, ℓ ) across each row and column of the computed approximations (blue), thecurve of parameters that produce equivalent measures to the true parameter (black), and the global optimizer(green). and Dynamics, 8 (2008), pp. 319–350.[10] N. K. Chada, M. A. Iglesias, L. Roininen, and A. M. Stuart , Parameterizations for ensembleKalman inversion , Inverse Problems, 34 (2018), p. 055009.[11]
V. Chen, M. M. Dunlop, O. Papaspiliopoulos, and A. M. Stuart , Dimension-Robust MCMCin Bayesian Inverse Problems , arXiv preprint arXiv:1806.00519, (2018).[12]
C. Clason, T. Helin, R. Kretschmann, and P. Piiroinen , Generalized modes in Bayesian inverseproblems , arXiv preprint arXiv:1806.00519, (2018).[13]
S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White , MCMC methods for functions:modifying old algorithms to make them faster , Statistical Science, 28 (2013), pp. 424–446.[14]
Y. Daon and G. Stadler , Mitigating the influence of the boundary on PDE-based covariance oper-ators , Inverse Problems & Imaging, 12 (2018).[15]
M. Dashti, K. J. Law, A. M. Stuart, and J. Voss , MAP estimators and their consistency inBayesian nonparametric inverse problems , Inverse Problems, 29 (2013), p. 095017.[16]
M. Dashti and A. M. Stuart , The Bayesian approach to inverse problems , Springer InternationalPublishing, 2017, pp. 311–428.[17]
M. M. Dunlop, M. A. Iglesias, and A. M. Stuart , Hierarchical Bayesian level set inversion ,Statistics and Computing, (2016), pp. 1–30.[18]
J. N. Franklin , Well-posed stochastic extensions of ill-posed linear problems , Journal of mathematicalanalysis and applications, 31 (1970), pp. 682–716.[19]
S. Gugushvili, A. van der Vaart, and D. Yan , Bayesian linear inverse problems in regularityscales , arXiv preprint arXiv:1802.08992, (2018).[20]
S. Gugushvili, A. W. van der Vaart, and D. Yan , Bayesian inverse problems with partial obser-vations , Transactions of A. Razmadze Mathematical Institute, 172 (2018), pp. 388–403.
YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 33Figure 5.5. (Top) The objective function J N,γ N C ( σ, β ) for N = 1 , , , . (Bottom) The locationsof the minimizers of each J N,γ N C ( σ, β ) across each row and column of the computed approximations (blue),the curve of parameters that produce equivalent measures to the true parameter (black), and the globaloptimizer (green). -4 -3 -2 -1 Centred -4 -3 -2 -1 Empirical
Figure 5.6.
Traces of the errors between optimizers of J N,γ N C ( θ ) (left), J N,γ N E ( θ ) (right) and the truehyperparameter, as N is increased. Here the noise level γ N is taken as γ N = N − w for w = 3 . , , . . [21] T. Helin and M. Burger , Maximum a posteriori probability estimates in infinite-dimensionalBayesian inverse problems , Inverse Problems, 31 (2015), p. 085009.[22]
T. Helin and M. Lassas , Hierarchical models in statistical inverse problems and the mumford–shahfunctional , Inverse problems, 27 (2010), p. 015008.[23]
J. Kaipio and E. Somersalo , Statistical and Computational Inverse Problems , vol. 160, Springer -20 -15 -10 -5 -6 -4 -2 Centred -20 -15 -10 -5 -6 -4 -2 Empirical
Figure 5.7.
Traces of the errors between optimizers of J N γ ,γ C ( θ ) (left), J N γ ,γ E ( θ ) (right) and the truehyperparameter, as γ is decreased. Here the number of observations N γ is taken as N γ = γ − /w for w = 3 . , , . . Science & Business Media, 2006.[24]
U. Khristenko, L. Scarabosio, P. Swierczynski, E. Ullmann, and B. Wohlmuth , Analy-sis of boundary effects on pde-based sampling of whittle-mat´ern random fields , arXiv preprintarXiv:1809.07570, (2018).[25]
B. T. Knapik, B. T. Szab´o, A. W. van der Vaart, and J. H. van Zanten , Bayes procedures foradaptive inference in inverse problems for the white noise model , Probability Theory and RelatedFields, 164 (2016), pp. 771–813.[26]
B. T. Knapik, A. W. van der Vaart, and J. H. van Zanten , Bayesian recovery of the initialcondition for the heat equation , Communications in Statistics-Theory and Methods, 42 (2013),pp. 1294–1313.[27]
B. T. Knapik, A. W. van der Vaart, J. H. van Zanten, et al. , Bayesian inverse problems withGaussian priors , The Annals of Statistics, 39 (2011), pp. 2626–2657.[28]
S. Lasanen , Non-Gaussian statistical inverse problems. Part I: Posterior distributions , Inverse Prob-lems & Imaging, 6 (2012), pp. 215–266.[29]
S. Lasanen , Non-Gaussian statistical inverse problems. Part II: Posterior convergence for approxi-mated unknowns. , Inverse Problems & Imaging, 6 (2012).[30]
M. S. Lehtinen, L. Paivarinta, and E. Somersalo , Linear inverse problems for generalised randomvariables , Inverse Problems, 5 (1989), p. 599.[31]
F. Lindgren, H. Rue, and J. Lindstr¨om , An explicit link between Gaussian fields and GaussianMarkov random fields: the stochastic partial differential equation approach , Journal of the RoyalStatistical Society. Series B: Statistical Methodology, 73 (2011), pp. 423–498.[32]
K. P. Murphy , Machine Learning: A Probabilistic Perspective , MIT Press, 2012.[33]
R. M. Neal , Bayesian Learning for Neural Networks , PhD thesis, University of Toronto, 1995.[34]
R. M. Neal , Monte Carlo implementation of Gaussian process models for Bayesian regression andclassification , arXiv preprint physics/9701026, (1997).[35]
R. Nickl , Bernstein-von Mises theorems for statistical inverse problems I: Schr¨odinger equation ,arXiv preprint arXiv:1707.01764, (2017).[36]
R. Nickl and K. Ray , Nonparametric statistical inference for drift vector fields of multi-dimensionaldiffusions , arXiv preprint arXiv:1810.01702, (2018).[37]
R. Nickl and J. S¨ohl , Bernstein-von Mises theorems for statistical inverse problems II: CompoundPoisson processes , arXiv preprint arXiv:1709.07752, (2017).
YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 35 [38]
R. Nickl, S. van de Geer, and S. Wang , Convergence rates for penalised least squares estimatorsin pde-constrained regression problems , arXiv preprint arXiv:1809.08818, (2018).[39]
H. Owhadi, C. Scovel, and T. Sullivan , On the brittleness of Bayesian inference , SIAM Review,57 (2015), pp. 566–582.[40]
O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old , A general framework for the parametrizationof hierarchical models , Statistical Science, (2007), pp. 59–73.[41]
K. Ray , Bayesian inverse problems with non-conjugate priors , Electronic Journal of Statistics, 7(2013), pp. 2516–2549.[42]
G. O. Roberts and O. Stramer , On inference for partially observed nonlinear diffusion modelsusing the Metropolis–Hastings algorithm , Biometrika, 88 (2001), pp. 603–621.[43]
L. Roininen, J. M. J. Huttunen, and S. Lasanen , Whittle-Mat´ern priors for Bayesian statisticalinversion with applications in electrical impedance tomography , Inverse Problems & Imaging, 8(2014), pp. 561–586.[44]
A. M. Stuart , Inverse problems: a Bayesian perspective , in Acta Numerica, vol. 19, CambridgeUniversity Press, 2010, pp. 451–559.[45]
A. W. van Der Vaart and J. A. Wellner , Weak convergence , in Weak convergence and empiricalprocesses, Springer, 1996, pp. 16–28.[46]
J. H. van Zanten , A note on consistent estimation of multivariate parameters in ergodic diffusionmodels , Scandinavian Journal of Statistics, 28 (2001), pp. 617–623.[47]
Y. Yu and X.-L. Meng , To center or not to center: that is not the question – an ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency , Journal of Computationaland Graphical Statistics, 20 (2011), pp. 531–570.
Appendix A. Supporting Lemmas.
In this appendix we provide a number of lemmasthat are used during proofs and examples in the main text.
Lemma A.1.
Let m ≥ n , A ∈ R m × n and Q ∈ R m × m . Then det( A ∗ QA ) = det( Q ) det( AA ∗ ) . Proof.
Let A = U Σ V ∗ be the singular value decomposition of A , with U ∈ R m × m , V ∈ R n × n unitary, and Σ ∈ R m × n . Then we havedet( A ∗ QA ) = det( V Σ ∗ U ∗ QU Σ V ∗ ) = det(Σ ∗ U ∗ QU Σ) det( V ∗ V ) = det(Σ ∗ U ∗ QU Σ) . We have that (Σ ∗ U ∗ QU Σ) ij = ( Σ ii ( U ∗ QU ) ij Σ jj i, j ≤ m i > m or j > m = ( ˆΣ U ∗ QU ˆΣ) ij where ˆΣ ∈ R m × m is given by ˆΣ ij = Σ ij . Since all matrices are now square, we see thatdet(Σ ∗ U ∗ QU Σ) = det( ˆΣ U ∗ QU ˆΣ)= det( Q ) det( U ∗ U ) det( ˆΣ )= det( Q ) det(ΣΣ ∗ )= det( Q ) det( AA ∗ ) . Lemma A.2.
Let { a j } , { µ j } , { ¯ µ j } and { γ j } be positive sequences with ¯ µ j /µ j → g ≥ .Then if min j =1 ,...,N a j µ j γ N → ∞ as N → ∞ we have N N X j =1 a j ¯ µ j + γ N a j µ j + γ N → g as N → ∞ . Proof.
We write a j ¯ µ j + γ N a j µ j + γ N = ¯ µ j + γ N /a j µ j + γ N /a j = ¯ µ j µ j + ¯ µ j + γ N /a j µ j + γ N /a j − ¯ µ j µ j ! = ¯ µ j µ j + γ N /a j ( µ j − ¯ µ j ) µ j + γ N /a j µ j = ¯ µ j µ j + 1 a j µ j /γ N + 1 (cid:18) − ¯ µ j µ j (cid:19) . YPERPARAMETER ESTIMATION IN BAYESIAN INVERSE PROBLEMS 37
Now observe that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 a j ¯ µ j + γ N a j µ j + γ N − g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 a j ¯ µ j + γ N a j µ j + γ N − ¯ µ j µ j !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 ¯ µ j µ j − g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The second term on the right hand side tends to zero by the assumed convergence. Fromthe above, the first term is equal to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X j =1 a j µ j /γ N + 1 (cid:18) − ¯ µ j µ j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max j =1 ,...,N a j µ j /γ N + 1 (cid:12)(cid:12)(cid:12)(cid:12) − ¯ µ j µ j (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:18) min j =1 ,...,N a j µ j /γ N + 1 (cid:19) − again using the assumed convergence of the ratio ¯ µ j /µ j ; the result follows.In the following, given two sequence { a j } , { b j } , we write a j ≍ b j if there exist constants c , c > c a j ≤ b j ≤ c a j for all j . Lemma A.3.
Let
Θ = [ β − , β + ] ⊆ (0 , ∞ ) . Given ν, σ > , d ∈ N and a positive sequence λ j ≍ j /d define µ j ( β ) = β (cid:18) βσ (cid:19) /ν + λ j ! − ν − d/ . Assume that a j ≍ j − a and γ N ≍ N − w , where w, a > are such that w > a + νd + 12 . Then Assumptions (i)-(vi) hold.Proof. (i) This is true by assumption.(ii) We assume without loss of generality that a j , λ j are monotonically decreasing. Thenmin j =1 ,...,N a j µ j ( β ) γ N = a N µ N ( β ) γ N . We may bound the right hand side as a N µ N ( β ) γ N ≍ N w − a ) β (cid:18) βσ (cid:19) /ν + λ N ! − ν − d/ ≍ N w − a − ν/d − / , which diverges given the assumption on the parameters. (iii) In Example 4.13 it is demonstrated that g ( β, β † ) = lim j →∞ µ j ( β † ) µ j ( β ) = (cid:18) β † β (cid:19) for all β ∈ Θ. The map g ( β, β † ) − log g ( β, β † ) is clearly continuous on Θ, and so inparticular lower semicontinuous.(iv) This is clearly true.(v) We have that log µ j ( β ) = 2 log β − (cid:18) ν + d (cid:19) log (cid:18) βσ (cid:19) /ν + λ j ! which is smooth on Θ, and so (cid:12)(cid:12)(cid:12)(cid:12) dd β log µ j ( β ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β − (cid:18) ν + d (cid:19) ν (cid:18) βσ (cid:19) /ν − (cid:18) βσ (cid:19) /ν + λ j ! − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ β − + (cid:18) ν + d (cid:19) ν σβ − . It follows that log µ j ( β ) is Lipschitz with Lipschitz constants bounded in j .(vi) The map b Nj ( β ) is smooth on Θ, and we have that | ( b Nj ) ′ ( β ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b Nj ( β ) µ ′ j ( β ) µ j ( β ) + γ N /a j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12) b Nj ( β ) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) µ ′ j ( β ) µ j ( β ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) b Nj ( β ) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) dd β log µ j ( β ) (cid:12)(cid:12)(cid:12)(cid:12) , and the final term on the right hand side is uniformly bounded by part (v). Finallyobserve that | b Nj ( θ ) | ≤ µ j ( β † ) µ j ( β ) + γ N a j µ j ( β ) . The first term can be seen to be uniformly bounded by noting that β − >
0, and thesecond term by using part (ii). The map β Nj ( β ) is hence Lipschitz with Lipschitzconstants bounded in j, Nj, N