BBayesian inverse problems with unknown operators
Mathias TrabsUniversität Hamburg
Abstract
We consider the Bayesian approach to linear inverse problems when the underlying oper-ator depends on an unknown parameter. Allowing for finite dimensional as well as infinitedimensional parameters, the theory covers several models with different levels of uncertaintyin the operator. Using product priors, we prove contraction rates for the posterior distribu-tion which coincide with the optimal convergence rates up to logarithmic factors. In orderto adapt to the unknown smoothness, an empirical Bayes procedure is constructed based onLepski’s method. The procedure is illustrated in numerical examples.
MSC 2010 Classification:
Primary: 62G05; Secondary: 62G08, 62G20.
Keywords and phrases:
Rate of contraction, posterior distribution, product priors, ill-posedlinear inverse problems, empirical Bayes, non-parametric estimation.
Bayesian procedures to solve inverse problems became increasingly popular in the last years, cf.Stuart [31]. In the inverse problem literature the underlying operator of the forward problemis typically assumed to be known. In practice, there might however be some uncertainty in theoperator which has to be taken into account by the procedure. While there are some frequentistapproaches in the statistical literature to solve inverse problems with an unknown operator, theBayesian point of view has not yet been analysed. The aim of this work is to fill this gap.Let f ∈ L ( D ) be a function on a domain D ⊆ R d and K ϑ : L ( D ) → L ( Q ) , Q ⊆ R q , be aninjective, continuous linear operator depending on some parameter ϑ ∈ Θ . We consider the linearinverse problem Y = K ϑ f + εZ, (1.1)where Z Gaussian white noise in L ( Q ) and ε > is the noise level which converges to zeroasymptotically. If the operator K ϑ is known, the inverse problem to recover f non-parametrically,i.e. as element of the infinite dimensional space L ( D ) , from the observation Y is well studied,see for instance Cavalier [5]. The Bayesian approach has been analysed by Knapik et al. [22]with Gaussian priors, by Ray [30] non-conjugate priors and many subsequent articles including[1, 2, 21, 23]. Also non-linear inverse problems have been successfully solved via Bayesian methods,for example, [3, 9, 27, 28, 29, 35].Focussing on linear inverse problems, we will extend the Bayesian methodology to unknownoperators. To this end, the unknown parameter ϑ ∈ Θ is introduced in (1.1) where K ϑ may dependnon-linearly on ϑ . Unknown operators are relevant in numerous applications. Examples includesemi-blind and blind deconvolution for image analysis. Therein, the operator is given by K ϑ f = (cid:82) g ϑ ( · − y ) f ( y ) dy with some unknown convolution kernel g ϑ [4, 20, 32]. More general integraloperators such as singular layer potential operators appear in the context of partial differentialequations, cf. examples in [7, 15]. If the coefficients of the underlying PDE are unknown, thenthe operator itself is only partially known. A typical example of this type is the backwards heatequation where the solution u of the PDE ∂∂t u = ϑ ∆ u (with Dirichlet boundary conditions) isobserved at some time t and the aim is to estimate the initial value function f = u (0 , · ) . Here, wetake into account an unknown diffusivity parameter ϑ > . The solution u ( t, · ) depends linearly on1 a r X i v : . [ m a t h . S T ] A p r and the resulting operator admits a singular value decomposition (SVD) with respect to the sinebasis and with ϑ dependent singular values ρ ϑ,k = e − ϑπ k t , k (cid:62) , cf. Section 6. In particular,the resulting inverse problem is severely ill-posed.Even without measurement errors the target function f is in general not identifiable any morefor unknown operators, i.e., there may be several solutions ( ϑ, f ) to the equation Y = K ϑ f . Forinstance, if K ϑ admits a SVD K ϑ ϕ k = ρ k ψ k for a orthonormal systems ( ϕ k ) k (cid:62) , ( ψ k ) k (cid:62) andunknown singular values ϑ = ( ρ k ) k (cid:62) , then we have K ϑ f = K ϑ/a ( af ) for any function f ∈ L ( D ) and any scalar a > . We thus require some extra information.There are different approaches in the inverse problem literature to deal with this identifiabilityproblem, particularly in the context of semi-blind or blind deconvolution. One approach is to findthe so called minimum norm solution which has a minimal distance to some a priori estimates for ϑ and f , cf. [4, 20]. Another idea is to assume that some approximation of the unknown operatoris available for the reconstruction of f , cf. [18, 32]. Similarly, we may assume to have some noisyobservation of the unknown parameter ϑ which then allows to construct an estimator for K ϑ .In this paper we will study this last setting. More precisely, we suppose that the parameterset Θ is (a subset of) some Hilbert space and we consider the additional sample T = ϑ + δW (1.2)where W is white noise on Θ , independent of Z , and δ > is some noise level. Thereby, ϑ isconsidered as a nuisance parameter and we will not impose any regularity assumptions on ϑ . Ouraim is the estimation of f from the observations (1.1) and (1.2). This setting includes severalexemplary models with different levels of uncertainty in the operator K ϑ :A If Θ ⊆ R p , we have a parametric characterization of the operator K ϑ and T can be understoodas an independent estimator for ϑ .B Cavalier and Hengartner [6] have studied the case where the eigenfunctions of K ϑ are known,but only noisy observations of the singular values ( ρ k ) k (cid:62) are observed: T k = ρ k + δW k , k (cid:62) ,with i.i.d. standard normal ( W k ) k . In this case Θ = (cid:96) , supposing K ϑ is Hilbert-Schmidt,and ϑ = ( ρ k ) k is the sequences of singular values of K ϑ .C Efromovich and Koltchinskii [11], Hoffmann and Reiß [16] as well as Marteau [25] haveassumed the operator as completely unknown and considered additional observations of theform L = K + δW where the operator L is blurred by some independent white noise W on the space of linearoperators from L ( D ) to L ( Q ) with some noise level δ . Fixing basis ( e k ) and ( h k ) of L ( D ) and L ( Q ) , respectively, K is characterised by the infinite matrix ϑ = ( (cid:104) Ke k , h l (cid:105) ) k,l (cid:62) ∈ R N and W can be identified with the random matrix ( (cid:104) W e k , g l (cid:105) ) k,l (cid:62) consisting of i.i.d. standardGaussian entries.In contrast to the just mentioned articles [6, 11, 16, 25], we will investigate the Bayesian approach.We thus put a prior distribution Π on ( f, ϑ ) ∈ L ( D ) × Θ . Denoting the probability density of ( Y, T ) under the parameters ( f, ϑ ) with respect to some reference measure by p f,ϑ , the posteriordistribution given the observations ( Y, T ) is given by Bayes’ theorem: Π( B | Y, T ) = (cid:82) B p f,ϑ ( Y, T )dΠ( f, ϑ ) (cid:82) L × Θ p f,ϑ ( Y, T )dΠ( f, ϑ ) (1.3)for measurable subsets B ⊆ L ( D ) × Θ . Due to the white noise model, the density p f,ϑ inheritsthe nice structure from the normal distribution, cf. Section 2. Although we cannot hope for niceconjugate pairs of prior and posterior distribution due to the non-linear structure of ( f, ϑ ) (cid:55)→ K ϑ f ,there are efficient Markov chain Monte Carlo algorithms to draw from Π( ·| Y, T ) , cf. Tierney [34].To analyse the behaviour of the posterior distribution, we will take a frequentist point ofview and assume the observations are generated under some true, but unknown f ∈ L ( D ) and2 ∈ Θ . In a first step we will identify general conditions on a prior Π under which the posterior Π( f ∈ ·| Y, T ) for f concentrates in a neighbourhood of f with a certain rate of contraction ξ ε,δ :We show for some constant D > the convergence Π (cid:0) f ∈ L ( D ) : (cid:107) f − f (cid:107) > Dξ ε,δ | Y, T (cid:1) → (1.4)in P f ,ϑ -probability as ε and δ go to zero. This contraction result verifies that whole probabilitymass the posterior distribution is asymptotically located in a small ball around f with radius oforder ξ ε,δ ↓ . Hence, draws from the posterior distribution will be close to the unknown function f with high probability. This especially implies that the posterior mean and the posterior medianare consistent estimators of the unknown function f . Interestingly, the difficulty to recover f from ( Y, T ) is same in all three above mentioned models.The proof of the contraction result follows general principles developed by Ghosal et al. [12].The analysis of the posterior distribution requires to control both, the numerator in (1.3) andthe normalising constant. To find a lower bound for the latter, a so-called small ball probabilitycondition is imposed ensuring that the prior puts some minimal weight in a neighbourhood ofthe truth. Given this bound, the contraction theorem can be shown by constructing sufficientlypowerful tests for the hypothesis H : f = f against the alternative H : (cid:107) f − f (cid:107) > Dξ ε,δ for theconstant D > from (1.4). To find the test, we follow Giné and Nickl [13] and use a plug-in testbased on a frequentist estimator. This estimator obtained by the Galerkin projection method, asproposed in [11, 16] for the Model C.The main difficulty is that without structural assumptions on Θ , e.g. if Θ = (cid:96) , an infinitedimensional nuisance parameter ϑ cannot be consistently estimated. We thus cannot expect aconcentration of Π( ϑ ∈ ·| Y, T ) . Why should then Π( K ϑ f ∈ ·| Y, T ) concentrate around the truth?Fortunately, K ϑ f is regular, such that a finite dimensional projection suffices to reconstruct f with high accuracy. Under the reasonable assumption that the projection of K ϑ depends onlyon a finite dimensional projection P ϑ of ϑ , we can indeed estimate f without estimating thefull ϑ . Similarly, we show in the Bayesian setting that a concentration of this finite dimensionalprojection P ϑ is sufficient resulting in a small ball probability condition depending only on f and P ϑ .The conditions of the general result are verified in the mildly ill-posed case and in the severelyill-posed case, assuming some Sobolev regularity of f . We use a truncated product prior of f anda product prior on ϑ . Choosing the truncation level J of prior in an optimal way, the resultingcontraction rates coincide with the minimax optimal rates which are known in several models upto logarithmic factors. These rates are indeed the same as for the known parameter case, cf. Ray[30], if δ = O ( ε ) .Since the optimal level J depends on the unknown regularity s of f , a data-driven procedureto select J is desirable. There are basically two ways to tackle this problem. Setting a hyperprior on s , a hierarchical Bayes procedure could be considered. Alternatively, although not purelyBayesian, we can try to select some (cid:98) J empirically from the observations Y, T and then use this (cid:98) J inthe Bayes procedure. Both possibilities are only rarely studied for inverse problems. Using knownoperators, the few articles on this topic include Knapik et al. [21] considering both constructionswith a Gaussian prior on f and Ray [30] who has considered a sieve prior which could be interpretedas hierarchical Bayes procedure. We will follow the second path and choose J empirically usingLepski’s method [24] which yields an easy to implement procedure (note that [21] used a maximumlikelihood approach to estimate s ). We prove that the final adaptive procedure attains the samerate as the optimized non-adaptive method.This paper is organised as follows: The posterior distribution is derived in Section 2. Thegeneral contraction theorem is presented in Section 3. In Section 4 specific rates for Sobolevregular functions f in the mildly and the severely case are determined using a truncated productprior. An adaptive choice of the truncation level is constructed in Section 5. In Section 6 wediscuss the implementation of the Bayes method using a Markov chain Monte Carlo algorithmand illustrate the method in two numerical examples. All proofs are postponed to Section 7.3 Setting and posterior distribution
Let us fix some notation: (cid:104)· , ·(cid:105) and (cid:107) · (cid:107) denote the scalar product and the norm of L ( Q ) or Θ .We write x (cid:46) y if there is some universal constant C > such that x (cid:54) Cy . If x (cid:46) y and y (cid:46) x we write x (cid:39) y . We recall that noise process Z in (1.1) is the standard iso-normal process, i.e., (cid:104) g, Z (cid:105) is N (0 , (cid:107) g (cid:107) ) -distributed for any g ∈ L ( Q ) and covariances are given by E [ (cid:104) Z, g (cid:105)(cid:104) Z, g (cid:105) ] = (cid:104) g , g (cid:105) for all g , g ∈ L ( Q ) . We write Z ∼ N (0 , Id) . Note that Z cannot be realised as an element of L ( Q ) , but only as anGaussian process g (cid:55)→ (cid:104) g, Z (cid:105) .The observation scheme (1.1) is equivalent to observing (cid:104) Y, g (cid:105) = (cid:104) K ϑ f, g (cid:105) + ε (cid:104) Z, g (cid:105) for all g ∈ L ( Q ) . Choosing an orthonormal basis ( ϕ k ) k (cid:62) of L ( Q ) with respect to the standard L -scalar product,we obtain the series representation Y k := (cid:104) Y, ϕ k (cid:105) = (cid:104) K ϑ f, ϕ k (cid:105) + εZ k for i.i.d. random variables Z k ∼ N (0 , , k (cid:62) . Note that the distribution of ( Z k ) does not dependon ϑ . If K ϑ is compact, it might be tempting to choose ( ϕ k ) from the singular value decompositionof K ϑ simplifying (cid:104) K ϑ f, ϕ k (cid:105) . However, such a basis of eigenfunctions will in general depend onthe unknown ϑ and thus cannot be used. Since ( Z k ) k (cid:62) are i.i.d., the distribution of the vector ( Y k ) k (cid:62) is given by P Yϑ,f = (cid:79) k (cid:62) N (cid:0) (cid:104) K ϑ f, ϕ k (cid:105) , ε (cid:1) . By Kakutani’s theorem, cf. Da Prato [8, Theorem 2.7], P ϑ,f is equivalent to the law P Y = (cid:78) k (cid:62) N (0 , ε ) of the white noise εZ . Writing (cid:104) K ϑ f, Z (cid:105) := (cid:80) k (cid:62) (cid:104) K ϑ f, ϕ k (cid:105) Z k with some abuse ofnotation, since Z is not in L ( Q ) , we obtain the density d P Yϑ,f d P Y = exp (cid:16) ε (cid:104) K ϑ f, Z (cid:105) − ε (cid:88) k (cid:62) (cid:104) K ϑ f, ϕ k (cid:105) (cid:17) = exp (cid:16) ε (cid:104) K ϑ f, Y (cid:105) − ε (cid:107) K ϑ f (cid:107) (cid:17) , where we have used Y k = εZ k under P Y for the second equality.Since any continuous operator K ϑ can be described by the infinite matrix ( (cid:104) K ϑ ϕ j , ϕ k (cid:105) ) j,k (cid:62) ,we may assume with loss of generality that Θ ⊆ (cid:96) . The distribution of T is then similarly givenby P Tϑ = (cid:78) k (cid:62) N ( ϑ k , δ ) being equivalent to P T = (cid:78) k (cid:62) N (0 , δ ) . Writing T = ( T k ) k (cid:62) and (cid:104) ϑ, T (cid:105) = (cid:80) k (cid:62) ϑ k T k , we obtain the density d P Tϑ d P T = exp (cid:16) δ (cid:104) ϑ, T (cid:105) − δ (cid:107) ϑ (cid:107) (cid:17) . Therefore, the likelihood of the observations ( Y, T ) with respect to P Y ⊗ P T is given by d P Yϑ,f ⊗ P Tϑ,f d P Y ⊗ P T = exp (cid:16) ε (cid:104) K ϑ f, Y (cid:105) − ε (cid:107) K ϑ f (cid:107) + 1 δ (cid:104) ϑ, T (cid:105) − δ (cid:107) ϑ (cid:107) (cid:17) . (2.1)Applying a prior Π on the parameter ( f, ϑ ) ∈ L ( D ) × Θ , we obtain the posterior distribution Π( B | Y, T ) = (cid:82) B e ε − (cid:104) K ϑ f,Y (cid:105)− (2 ε ) − (cid:107) K ϑ f (cid:107) + δ − (cid:104) ϑ,T (cid:105)− (2 δ ) − (cid:107) ϑ (cid:107) dΠ( f, ϑ ) (cid:82) L × Θ e ε − (cid:104) K ϑ f,Y (cid:105)− (2 ε ) − (cid:107) K ϑ f (cid:107) + δ − (cid:104) ϑ,T (cid:105)− (2 δ ) − (cid:107) ϑ (cid:107) dΠ( f, ϑ ) , B ∈ B , (2.2)with the Borel- σ -algebra B on L ( Q ) × Θ . Under the frequentist assumption that Y and T aregenerated under some f and ϑ , we obtain the representation Π( B | Y, T ) = (cid:82) B p f,ϑ ( Z, W )dΠ( f, ϑ ) (cid:82) L × Θ p f,ϑ ( Z, W )dΠ( f, ϑ ) , B ∈ B , (2.3)4or p f,ϑ ( z, w ) := exp (cid:16) ε (cid:104) K ϑ f − K ϑ f , z (cid:105) − ε (cid:107) K ϑ f − K ϑ f (cid:107) + 1 δ (cid:104) ϑ − ϑ , w (cid:105) − δ (cid:107) ϑ − ϑ (cid:107) (cid:17) corresponding to the density of P Yϑ,f ⊗ P Tϑ with respect to P Yϑ ,f ⊗ P Tϑ .Note that even if a Gaussian prior is chosen, the posterior distribution is in general not Gaus-sian, since ϑ (cid:55)→ K ϑ might be non-linear. Hence, the posterior distribution cannot be explicitlycalculated in most cases, but has to be approximated by an MCMC algorithm, see for instanceTierney [34] and Section 6. For simplicity we throughout suppose D = Q such that L := L ( D ) = L ( Q ) and assume K ϑ tobe self-adjoint. The general case is discussed in Remark 7.Taking a frequentist point of view, we assume that the observations (1.1) and (1.2) are gen-erated by some fixed unknown f ∈ L and ϑ ∈ Θ . As a first main result the following theoremgives general conditions on the prior which ensure a contraction rate for the posterior distributionfrom (2.3) around the true f .Let ( ϕ ( j,l ) : j ∈ I , l ∈ Z j ) be an orthonormal basis of L . We use here the double-indexnotation which is especially common for wavelet bases, but also the single-indexed notation isincluded if Z j contains only one element. For any index k = ( j, l ) we write | k | := j . Let moreover V j = span { ϕ k : | k | (cid:54) j } be a sequence of approximation spaces with dimensions d j ∈ N associatedto ( ϕ k ) . We impose the following compatibility assumption on ( ϕ k ) : Assumption 1.
There is some m ∈ N such that (cid:104) K ϑ ϕ l , ϕ k (cid:105) = 0 for any ϑ ∈ Θ if (cid:12)(cid:12) | l | − | k | (cid:12)(cid:12) > m . If K ϑ is compact and admits an orthonormal basis of eigenfunction ( e k ) k (cid:62) being independentof ϑ , then this is assumption is trivially satisfied for ( ϕ k ) = ( e k ) and m = 0 . On the otherhand this assumption allows for more flexibility for the considered approximation spaces andcan be compared to Condition 1 by Ray [30]. As a typical example, the possibly ϑ dependedeigenfunctions ( e ϑ,k ) of K ϑ may be the trigonometric basis of L while V j are generated by band-limited wavelets.Having ( ϕ k ) and thus V j fixed, we write (cid:107) A (cid:107) V j → V j := sup v ∈ V j , (cid:107) v (cid:107) =1 (cid:107) Av (cid:107) for the operatornorm for any bounded linear operator A : V j → V j where V j is equipped with the L -norm. Wedenote by P j the orthognal projection of L onto V j and define the operator K ϑ,j := P j K ϑ | V j as restriction of K ϑ to an operator from V j to V j . Note that K ϑ,j is given by the finite dimensionalmatrix ( (cid:104) Kϕ k , ϕ l (cid:105) ) | k | (cid:54) j, | l | (cid:54) j ∈ R d j × d j . Assumption 2.
Let K ϑ,j depend only on a finite dimensional projection P j ϑ := ( ϑ , . . . , ϑ l j ) of ϑ ∈ Θ for some integer (cid:54) l j (cid:54) d j , j ∈ I . Moreover, let K ϑ,j be Lipschitz continuous with respectto ϑ in the following sense: (cid:107) K ϑ,j − K ϑ (cid:48) ,j (cid:107) V j → V j (cid:54) L (cid:107) P j ( ϑ − ϑ (cid:48) ) (cid:107) j for all ϑ, ϑ (cid:48) ∈ Θ where L > is a constant being independent of j, ϑ, ϑ (cid:48) and where (cid:107) · (cid:107) j is a norm on P j Θ . Wesuppose that the norm (cid:107) · (cid:107) j satisfies P ϑ (cid:0) (cid:107) P j W (cid:107) j > C ( κ + (cid:112) d j ) (cid:1) (cid:54) exp( − cκ ) . Although projections on L and on Θ are both denoted by P j , it will be always clear fromthe context which is used such that this abuse of notation is quite convenient. Since K ϑ,j is fullydescribed by a d j × d j matrix, we naturally have the upper bound l j (cid:54) d j . Let us illustrate theprevious assumptions in the models A,B and C from the introduction:5 xamples 3.
1. In Model A we have a finite dimensional parameter space Θ ⊆ R p with fixed p ∈ N . As-sumption 1 is, for instance, satisfied if K ϑ f = g ϑ ∗ f is a convolution operator with a kernel g ϑ whose Fourier transform has compact support and if we choose a band-limited waveletbasis. Note that in this case we do not have know the SVD of K ϑ . For Assumption 2 wemay choose P j = Id and (cid:107) · (cid:107) j = | · | as the Euclidean distance on R p leading to the Lipschitzcondition (cid:107) K ϑ,j − K ϑ (cid:48) ,j (cid:107) V j → V j (cid:54) L | ϑ − ϑ (cid:48) | . Then P ϑ (cid:0) | W | > √ pκ (cid:1) (cid:54) pe − κ / follows fromthe Gaussian concentration of W .2. In Model B let K ϑ be compact and let ( e i ) i (cid:62) be an orthonormal basis consisting of eigen-functions with corresponding eigenvectors ( ρ ϑ,i ) i (cid:62) and let ( ϕ k ) be a wavelet basis fulfilling d j (cid:39) dj . Then Assumption 1 is satisfied if there is some C > such that (cid:104) e i , ϕ k (cid:105) (cid:54) = 0 onlyif C − dk (cid:54) i (cid:54) C dk . Since then (cid:104) e k , v (cid:105) = 0 for any v ∈ V j if k (cid:62) C dj , we moreover havefor any v ∈ V j (cid:13)(cid:13) ( K ϑ,j − K ϑ (cid:48) ,j ) v (cid:13)(cid:13) = (cid:13)(cid:13) P j (cid:88) i (cid:62) ( ρ ϑ,i − ρ ϑ (cid:48) ,i ) (cid:104) e i , v (cid:105) e i (cid:13)(cid:13) (cid:54) sup i (cid:54) C dj | ρ ϑ,i − ρ ϑ (cid:48) ,i | (cid:88) i (cid:54) C dj (cid:104) e i , v (cid:105) (cid:54) sup i (cid:54) C dj | ρ ϑ,i − ρ ϑ (cid:48) ,i | (cid:107) v (cid:107) . We thus choose l j = C dj (cid:39) d j and (cid:107) · (cid:107) j as the supremum norm on P j Θ . Since W k are i.i.d.Gaussian, we have for some c > P (cid:0) sup k (cid:54) C dj | W k | > κ + (cid:112) c log d j (cid:1) (cid:54) C dj e − ( κ + c log d j ) / (cid:54) Ce − κ / . Therefore, Assumption 2 is satisfied.3. In Model C the projected operators K ϑ,j are given by R d j × d j matrices. Assumption 1 issatisfied if and only if all K ϑ,j are band matrices with some fixed bandwidth m independentfrom j and ϑ . To verify Assumption 2, (cid:107) · (cid:107) j can be chosen as the operator norm or spec-tral norm of these matrices. The Lipschitz condition is then obviously satisfied. Moreover P j W P j is a R d j × d j random matrix where all entries are i.i.d. N (0 , random variables. Astandard result for i.i.d. random matrices is the bound E [ (cid:107) P j W P j (cid:107) V j → V j ] (cid:46) (cid:112) d j for theoperator norm, cf. [33, Cor. 2.3.5]. Together with the Borell-Sudakov-Tsirelson concen-tration inequality for Gaussian processes, cf. [14, Thm. 2.5.8], we immediately obtain theconcentration inequality in Assumption 2.Finally, the degree of ill-posedness of K ϑ can be quantified by the smoothing effect of theoperator: Assumption 4.
For a decreasing sequence ( σ j ) j ⊆ (0 , ∞ ) and some constant Q > let theoperator K ϑ satisfy Q − (cid:80) k σ | k | (cid:104) f, ϕ k (cid:105) (cid:54) (cid:104) K ϑ f, f (cid:105) (cid:54) Q (cid:80) k σ | k | (cid:104) f, ϕ k (cid:105) for all f ∈ L and ϑ ∈ Θ . Note that Assumptions 1 and 4 with σ j ↓ imply that K ϑ is compact, because it can be ap-proximated by the operator sequence K ϑ P j having finite dimensional ranges. The rate of the decayof σ j will determine the degree of ill-posedness of the inverse problem. If σ j decays polynomiallyor exponentially, we obtain a mildly or severely illposed problem, respectively.Recall that the nuisance parameter ϑ cannot be consistently estimated without additionalassumptions. Therefore, we study the contraction rate of the marginal posterior distribution Π( f ∈ ·| Y, T ) . While we allow for a general prior Π f on L for f , we will use a product prior on ϑ . For densities β k on R we thus consider prior distributions of the form dΠ( f, ϑ ) = dΠ f ( f ) ⊗ (cid:79) k (cid:62) β k ( ϑ k )d ϑ k . (3.1)6 heorem 5. Consider the model (1.1), (1.2) generated by some f ∈ L and ϑ ∈ Θ with ε = ε n → and δ = δ n → for n → ∞ , respectively, and let Assumptions 1, 2 and 4 be satisfied.Let Π n be a sequence of prior distributions of the form (3.1) on the Borel- σ -algebra on L × Θ . Let ( κ n ) , ( ξ n ) two positive sequences converging to zero and ( j n ) a sequence of integers with j n → ∞ .Suppose κ n / ( ε n ∨ δ n ) → ∞ as n → ∞ as well as d j n (cid:54) c κ n ( ε n ∨ δ n ) , κ n σ j n (cid:54) c ξ n and δ n σ j n (cid:112) d j n → for constants c , c > and all n (cid:62) . Suppose f satisfies (cid:107) f (cid:107) (cid:54) R and (cid:107) f − P j n ( f ) (cid:107) (cid:54) C ξ n for some R, C > . Let F n ⊆ { f ∈ L : (cid:107) f − P j n f (cid:107) (cid:54) C ξ n } be a sequence and C > such that Π n ( L \ F n ) (cid:54) e − ( C +4) κ n / ( ε n ∨ δ n ) . (3.2) Moreover assume for sufficiently large n Π n (cid:16) ( f, ϑ ) ∈ V j n × Θ : (cid:107) P j n + m ( K ϑ f − K ϑ f ) (cid:107) ε n + (cid:107) P j n + m ( ϑ − ϑ ) (cid:107) δ n (cid:54) κ n ( ε n ∨ δ n ) (cid:17) (cid:62) e − C κ n / ( ε n ∨ δ n ) . (3.3) Then there exists a finite constant
D > such that the posterior distribution from (2.3) satisfies Π n ( f ∈ V j n : (cid:107) f − f (cid:107) > Dξ n | Y, T ) → (3.4) as n → ∞ in P f ,ϑ -probability. Theorem 5 states that the posterior distribution Π( f ∈ ·| Y, T ) is consistent and concentratesasymptotically its whole probability mass in a ball around the true f with decaying radius Dξ n ↓ ,that is, the posterior “contracts to f ” with the rate ξ n . This result is similarly to Ray [30, Theorem2.1] who has proven a corresponding theorem for known operators. However, the contraction rateis now determined by the maximum ε ∨ δ instead of ε , which is natural in view of the results byHoffmann and Reiß [16] who have included the case δ > ε in their frequentist analysis.To gain some intuition on the interplay between κ n and the noise level ε n ∨ δ n , let us setfor simplicity m = 0 in Assumption 1 and ε n = δ n . Using Assumption 4 (with Lemma 14) andAssumption 2, we then can decompose (cid:107) f − f (cid:107) (cid:54) (cid:107) P j n f − f (cid:107) + (cid:107) f − P j n f (cid:107) (cid:46) (cid:107) P j n f − f (cid:107) + σ − j n (cid:107) K ϑ f − K ϑ P j n f (cid:107) (cid:54) (cid:107) P j n f − f (cid:107) + σ − j n (cid:107) K ϑ f − K ϑ P j n f (cid:107) + σ − j n L (cid:107) P j n ( ϑ − ϑ ) (cid:107) j (cid:107) f (cid:107) The first term in the last line is the approximation error being bounded by ξ n . It corresponds to theclassical bias. Indeed, the prior sequence Π n is concentrated on a subset of { f : (cid:107) f − P j n f (cid:107) (cid:54) C ξ n } due to (3.2) such that the projection of f to the level j n serves as reference measure for the priorand the deterministic error remains bounded by ξ n . The last two terms in the previous displaycorrespond to the stochastic errors in f and ϑ and are of the order κ n /σ j n owing the the minimalspread of Π n imposed by the small ball probability condition (3.3). In particular, we recoverthe ill-posedness of the inverse problem due to σ j n → in the denominator. To obtain the bestpossible contraction rate, we need to choose j n in way that ensures that ξ n is close to κ n /σ j n , i.e.,we will balance the deterministic and the stochastic error. The conditions on the dimension d j n are mild technical assumptions.The crucial small ball probability assumption (3.3) ensures that the prior sequence Π n hassome minimal mass in a neighbourhood of the underlying f and ϑ . The distance from ( f , ϑ ) ismeasured in a (semi-)metric which reflects the structure of our inverse problem. If ε n = δ n , it wouldbe sufficient if (cid:107) K ϑ f − K ϑ f (cid:107) and (cid:107) ϑ − ϑ (cid:107) are smaller than κ n . However, condition (3.3) is more7ubtle. Firstly, the maximum of ε and δ on the right-hand side within the probability introducessome difficulties. The prior has to weight a smaller neighbourhood of K ϑ f or ϑ , respectively,depending on whether ε is smaller than δ or the other way around. If, for instance, ε < δ thecontraction rate is determined by δ but the prior has to put enough probability to the smaller ε -ball around K ϑ f . We see such effects also in the construction of lower bounds, cf. [16], where wemay have in the extreme case a δ distance between f and f while K ϑ f = K ϑ f . Secondly, (3.3)depends only on finite dimensional projections of both K ϑ f and ϑ . This is particularly importantas we do not assume any regularity conditions on ϑ such that we cannot expect the projectionremainder (Id − P j + m ) ϑ to be small.To allow for this relaxed small ball probability condition, the contraction rate is restrictedto the set V j . The result can be extended to L by appropriate constructions of the prior, inparticular, if the support of Π n is contained in V j we can immediately replace V j by L in (3.4).Another possibility are general product prior if the basis is chosen according to the singular valuedecomposition of K ϑ .To prove Theorem 5, we use the techniques by Ghosal et al. [12, Thm. 2.1], cf. also [14, Thm.7.3.5]. A main step is the construction of tests for the testing problem H : f = f vs. H : f ∈ F n , (cid:107) f − f (cid:107) (cid:62) Dξ n . To this end, we first study a frequentist estimator of f which then allows to construct a plug intest as proposed by Giné and Nickl [13].The natural estimator for ϑ is T itself. In order to estimate f , we use a linear Galerkin methodbased on the perturbed operator K T similar to the approaches in [11, 16]. We thus aim for asolution (cid:98) f ε,δ ∈ V j to (cid:104) K T (cid:98) f ε,δ , v (cid:105) = (cid:104) Y, v (cid:105) for all v ∈ V j . (3.5)Choosing v ∈ { ϕ k : | k | (cid:54) j } , we obtain a system of linear equations depending only on theprojected operator K T,j . There is a unique solution if K T,j is invertible. Noting that for theunperturbed operator K ϑ,j Assumption 4 implies (cid:107) K − ϑ,j (cid:107) V j → V j (cid:54) Qσ − j (cf. Lemma 14 below),we set (cid:98) f j := (cid:40) K − T,j P j Y, if (cid:107) K − T,j (cid:107) V j → V j (cid:54) τ /σ j , , otherwise , (3.6)for a projection level j and a cut-off parameter τ > . Adopting ideas from [13, 16], we obtain thefollowing non-asymptotic concentration result for the estimator (cid:98) f j . Proposition 6.
Let j ∈ N , κ > such that d j (cid:54) C κ / ( ε ∨ δ ) for some C > . UnderAssumptions 2 and 4 there are constants c, C > such that, if δσ − j ( κ + (cid:112) d j ) (cid:54) c τ − QτQ and τ > Q ,then (cid:98) f j from (3.6) fulfils P f,ϑ (cid:16) (cid:107) (cid:98) f j − f (cid:107) (cid:62) Cσ − j ( (cid:107) f (cid:107) ∨ κ + (cid:107) f − P j f (cid:107) (cid:17) (cid:54) e − κ / ( ε ∨ δ ) . Note that some care will be needed to analyse the above mentioned tests since also thestochastic error term σ − j ( (cid:107) f (cid:107) ∨ κ depends on the unknown function f and, for instance, aGaussian prior on f will not sufficiently concentrate on a fixed ball { f ∈ L : (cid:107) f (cid:107) (cid:54) R } . Remark . While the assumption that K ϑ is self-adjoint simplifies the analysis and the presentationof our approach, the methodology can be generalised to general compact operators K ϑ . In thiscase Assumption 4 should be replaced by the assumption (cid:107) K ϑ f (cid:107) (cid:39) (cid:80) k σ | k | (cid:104) f, ϕ k (cid:105) which isconsistent with the original condition, cf. Remark 15. The Galerkin projection method (3.5) canthen be generalised to solve (cid:104) K ∗ T K T (cid:98) f ε,δ , v (cid:105) = (cid:104) Y, K T v (cid:105) for all v ∈ V j , cf. Cohen et al. [7, Appendix A]. This modified estimator should have a similar behaviour asabove such that we can construct the tests which we needed to prove Theorem 5. The rest of theproof of the contraction theorem and the subsequent results would remain as before.8 A truncated product prior and the resulting rates
For the ease of clarity we fix a ( S -regular) wavelet basis ( ϕ k ) k ∈{− , , ,... }× Z of L with the as-sociated approximation spaces V j = span { ϕ k : | k | (cid:54) j } . We write | k | = | ( j, l ) | = j as before.Investigating a bounded domain D ⊆ R d , we have in particular d j (cid:39) jd . The regularity of f willbe measured in the Sobolev balls H s ( R ) := (cid:110) f ∈ L ([0 , (cid:107) f (cid:107) H s := ∞ (cid:88) j = − sj (cid:88) l (cid:104) f, ϕ j,l (cid:105) (cid:54) R (cid:111) , s ∈ R . (4.1)We will use Jackson’s inequality and the Bernstein inequality: For − S < s (cid:54) t < S and f ∈ H s , g ∈ V j we have (cid:107) (Id − P j ) f (cid:107) H s (cid:46) − j ( t − s ) (cid:107) f (cid:107) H t and (cid:107) g (cid:107) H t (cid:46) j ( t − s ) (cid:107) g (cid:107) H s . (4.2) Remark . The subsequent analysis applies also to the trigonometric as well as the sine basisin the case of periodic functions. Considering more specifically L per ([0 , { f ∈ L ([0 , f (0) = f (1) = 0 } , we may set ϕ k = √ πk · ) for k ∈ N . Since (cid:107) f (cid:107) H s (cid:39) (cid:80) k (cid:62) j s (cid:104) f, ϕ k (cid:105) holds for any f ∈ L per ([0 , , it is then easy to see that the inequalities (4.2) are satisfied for V j = span { ϕ , . . . , ϕ j } if j is replaced by j . Alternatively we may set V j = span { ϕ , . . . , ϕ j } which gives exactly (4.2).For ϑ we use the product prior as in (3.1) with a fixed density β k = β . For f we also a applya product prior. More precisely, we take a prior Π f determined by the random series f ( x ) = (cid:88) | k | (cid:54) J τ | k | Φ k ϕ k ( x ) , x ∈ [0 , , for a sequence ( τ j ) j (cid:62) − , i.i.d. random coefficients Φ k (independent of ϑ k ) distributed accordingto a density α and a cut-off J ∈ N . Hence, dΠ( ϑ, f ) = (cid:89) | k | (cid:54) J τ − d | k | α ( τ − | k | f k ) d f k · (cid:89) k (cid:62) β ( ϑ k ) d ϑ k . (4.3)Under appropriate conditions on the distributions α, β and on J we will verify the conditions ofTheorem 5. Assumption 9.
There are constants γ, Γ > such that the densities α and β satisfy α ( x ) ∧ β ( x ) (cid:62) Γ e − γ | x | for all x ∈ R . Assumption 9 is very weak and is satisfied for many distributions with unbounded support, forexample, Gaussian, Cauchy, Laplace distributions or Student’s t -distribution. Also uninformativepriors where α or β are constant are included. A consequence of the previous assumption is thatany random variable Φ with probability density α (or β ) satisfies P ( | Φ − x | (cid:54) κ ) (cid:62) Γ (cid:90) | y | (cid:54) κ e − γ | y + x | d y (cid:62) κe − γ ( | x | + κ ) for all κ > , x ∈ R . (4.4)This lower bound will be helpful to verify the small ball probabilities (3.3).To apply Theorem 5, we choose J = j n to ensure that the support of Π f lies in { f ∈ F : (cid:107) P j n ( f ) − f (cid:107) (cid:54) C ξ n } . Note that the optimal j n is not known in practice. We will discuss thea data-driven choice of J in Section 5. Alternatively to truncating the random series for f , thesmall bias condition could be satisfied if ( τ j ) decays sufficiently fast and α has bounded support,as it is the case for uniform wavelet priors.We start with the mildly ill-posed case imposing σ j = 2 − jt for some t > in Assumption 4.In this case the operators K ϑ are naturally adapted to Sobolev scale, since then K ϑ : L → H t iscontinuous with (cid:107) K ϑ f (cid:107) (cid:46) (cid:107) f (cid:107) H t , cf. Remark 15.9 heorem 10. Let ε η (cid:46) δ (cid:46) ε for some η > and let Assumptions 1, 2 with l j (cid:54) jd , Assumption 4with σ j = 2 − jt for some t > as well as Assumption 9 be fulfilled. Then the posterior distributionfrom (2.3) with prior given by (4.3) where J is chosen such that J = (cid:0) ε log(1 /ε ) (cid:1) − / (2 s +2 t + d ) and cj − j (2 s + d ) (cid:54) τ j (cid:54) Cj for constants c, C > and some < s < s satisfies for any f ∈ H s ( R ) and ϑ ∈ Θ Π n (cid:16) f ∈ L : (cid:107) f − f (cid:107) > D (cid:0) ε log(1 /ε ) (cid:1) s/ (2 s +2 t + d ) (cid:12)(cid:12)(cid:12) Y, T (cid:17) → with some constant D > and in P f ,ϑ -probability.Remark . This theorem is restricted to the case ε (cid:38) δ . However, its proof reveals that in thespecial case where m = 0 , for instance, if ( ϕ k ) are eigenfunctions, the condition ε η (cid:46) δ (cid:46) ε can beweakened to log δ (cid:39) log ε , which also allows for ε < δ . The second restriction is l j (cid:54) jd which isespecially satisfied in the model B of unknown eigenvalues in the singular value decomposition of K ϑ . Larger l j could be incorporated if we put some structure on Θ which allows for applying adifferent prior on ϑ with better concentration of P j ϑ .The contraction rate coincides with the minimax optimal convergence rate, as determined in[6, 16] for specific settings of ϑ (cid:55)→ K ϑ , up to the logarithmic term. The conditions on τ j are veryweak and allow for a large flexibility in the choice of prior, particularly, a constant τ j = 1 for all j is included. In contrast, the choice of the cut-off parameter J is crucial and depends on theregularity s of f and the ill-posedness t of the operator.In the severely ill-posed case the contraction rates deteriorates to a logarithmic dependence on ε ∨ δ and coincide again with the minimax optimal rate. Theorem 12.
Let log ε (cid:39) log δ and let Assumptions 1, 2, Assumption 4 with σ j = exp( − r jt ) for some r, t > as well as Assumption 9 be fulfilled. Then the posterior distribution from (2.3)with prior given by (4.3) where J is chosen such that J = (cid:0) − r log( ε ∨ δ ) (cid:1) /t and − j (2 s + t + d ) (cid:54) τ j (cid:54) exp( C jt ) for a constant C > satisfies for any f ∈ H s ( R ) and ϑ ∈ Θ Π n (cid:16) f ∈ L : (cid:107) f − f (cid:107) > D (cid:0) log( ε ∨ δ ) − (cid:1) − s/t (cid:12)(cid:12)(cid:12) Y, T (cid:17) → with some constant D > and in P f ,ϑ -probability. We saw above that the choice of the projection level J of the prior depends on the unknownregularity s (and the ill-posedness t ) in order to achieve the optimal rate. We will now discusshow J can be chosen purely data-driven resulting in an empirical Bayes procedure that adaptson s . Noting that choice of J in Theorem 12 is already independent of s , we focus on the mildlyill-posed case and δ (cid:46) ε .The method is based on the observation that all conditions on the level j n in Theorem 5 aremonotone (in the sense that they are also satisfied for all j smaller than the optimal j n ) exceptfor the bias condition on (cid:107) f − P j n f (cid:107) (cid:46) ξ n . Given the optimal J o = j n , the so-called oracle, theresult in Theorem 10 continues to hold for any, empirically chosen (cid:98) J satisfying (cid:98) J (cid:54) J o and (cid:107) f − P (cid:98) J f (cid:107) (cid:46) ξ n . To find (cid:98) J , we use Lepski’s method [24] which is generally known for these two properties.In Proposition 6 we saw that the variance of the estimator (cid:98) f j from (3.6) is of the order ε d j /σ j = ε jt + jd . For some fixed lower bound s on the regularity s of f ∈ H s let J ε = (cid:106) log ε − ( s + t + d/
2) log 2 (cid:107) (cid:98) x (cid:99) denotes be the largest integer smaller than x . The choice of J ε allows for applying theconcentration inequality from Propotion 6 to all (cid:98) f j with j (cid:54) J ε . We then choose (cid:98) J := min (cid:8) j ∈ { , . . . , J ε } : (cid:107) (cid:98) f i − (cid:98) f j (cid:107) (cid:54) ∆ ε (log ε − ) i ( t + d/ ∀ i > j (cid:9) for a constant ∆ ∈ (0 , which can be chosen by the practitioner. The idea of the choice (cid:98) J isas follows: Starting with large j the projection estimator (cid:98) f j has a small bias, but a standarddeviation of order ε j ( t + d/ . Decreasing j reduces the variance while the bias increases. At thepoint where there is some i > j such that (cid:107) (cid:98) f i − f (cid:107) + (cid:107) (cid:98) f j − f (cid:107) (cid:62) (cid:107) (cid:98) f i − (cid:98) f j (cid:107) is larger than the orderof the variance the bias starts dominating the estimation error. At this point we stop lowering j and select (cid:98) J . Theorem 13.
Let ε η (cid:46) δ (cid:46) ε for some η > and let Assumptions 1, 2 with l j (cid:54) jd , Assumption 4with σ j = 2 − jt for some t > as well as Assumption 9 be fulfilled. Then the posterior distributionfrom (2.3) with prior given by (4.3) with (cid:98) J instead of J and cj − j (2 s + d ) (cid:54) τ j (cid:54) Cj for constants c, C > and some < s < s satisfies for any f ∈ H s ( R ) and ϑ ∈ Θ Π n (cid:16) f ∈ L : (cid:107) f − f (cid:107) > D (log ε − ) χ ε s/ (2 s +2 t + d ) (cid:12)(cid:12)(cid:12) Y, T (cid:17) → with some constant D > , χ = (4 s + 2 t + d ) / (2 s + 2 t + d ) and in P f ,ϑ -probability. Note that the empirical Bayes procedure is adaptive with respect to s and the Sobolev radius R . Compared to Theorem 10 where the oracle choice for J is used, we only lose a logarithmicfactor for adaptivity. To illustrate the previous theory, we consider the heat equation ∂∂t u ( x, t ) = ϑ ∂ ∂x u ( x, t ) , u ( · ,
0) = f, u (0 , t ) = u (1 , t ) = 0 (6.1)with Dirichlet boundary condition at x = 0 and x = 1 and some initial value function f ∈ L ([0 , satisfying f (0) = f (1) = 0 . Different to [23, 30] we take an unknown diffusivity parameter ϑ > into account. A solution to (6.1) is observed at some time t > Y = u ( · , t ) + εZ (6.2)with white noise Z on L ([0 , . The aim is to recover f from Y .The solution u ( · , t ) depends linearly on f via an operator K ϑ which is diagonalised by the sinebasis e k = √ πk · ) , k (cid:62) , of L per ([0 , building a system of eigenfunctions of the Laplaceoperator. The corresponding eigenvalues of K ϑ are given by ρ ϑ,k := e − ϑπ k t , k (cid:62) , and we obtainthe singular value decomposition K ϑ f = (cid:88) k (cid:62) (cid:104) f, e k (cid:105) ρ ϑ,k e k = (cid:88) k (cid:62) (cid:104) f, e k (cid:105) e − ϑπ k t √ πk · ) . Note that K ϑ depends on ϑ only via its eigenvalues ρ ϑ,k while the eigenfunctions and thus theconsidered basis is independent of ϑ . Moreover the dependence of ρ ϑ,k on ϑ is non-linear. Fromthe decay of the eigenvalues we see that the resulting inverse problem is severely ill-posed with σ j = exp( − ϑπ tj ) . Since we can easily construct pairs ( ϑ, f ) and ( ϑ (cid:48) , f (cid:48) ) with K ϑ f = K ϑ (cid:48) f (cid:48) , thefunction f is indeed not identifiable only based on the observation Y and we need the additionalobservation T = ϑ + δW for some W ∼ N (0 , .11ince the eigenfunctions are independent of ϑ , we can choose the basis ϕ k = e k thanks toRemark 8. We moreover apply the truncated product prior (4.3) with centered normal densitiesdensities α and β and fixed variances τ and σ . In our numerical example we set t = 0 . , f ( x ) = 4 x (1 − x )(8 x − and ϑ = 1 (6.3)reproducing the same setting as considered in [23], but taking the unknown ϑ into account. TheFourier coefficients of f with respect to the sine series ϕ k are given by f ,k = (cid:104) f , ϕ k (cid:105) = 8 √ − k ) π k , k (cid:62) . By the decay of the coefficients, we have f ,k ∈ H s for every s < / .To implement our Bayes procedure, we need to sample from the posterior distribution which isnot explicitly accessible. Fortunately, using independent normal N (0 , τ ) priors on the coefficients f k = (cid:104) f, ϕ k (cid:105) , we see from (2.2) that at least the conditional posterior distribution of f given ϑ, Y, T can be explicitly computed as Π( f ∈ ·| ϑ, Y, T ) = (cid:79) k (cid:54) J N (cid:16) ε − ρ − ϑ,k ε − + ρ − ϑ,k τ − Y k , ρ − ϑ,k ε − + ρ − ϑ,k τ − (cid:17) . (6.4)Profiting from this known conditional posterior distribution, we use a Gibbs sampler to draw(approximately) from the unconditional posterior distribution of f given Y, T , cf. [34]. Givensome initial ϑ (0) , the algorithm alternates between draws from f ( i +1) | ϑ = ϑ ( i ) , Y, T and ϑ ( i +1) | f = f ( i +1) , Y, T for i ∈ N . The second conditional distribution is not explicitly given, due to thenon-linear dependence of ρ ϑ,k from ϑ . We apply a standard Metropolis-Hastings algorithm toapproximate the distribution of ϑ | f, Y, T using a random walk with N (0 , v ) increments as proposalchain. A similar Metropolis-within-Gibbs method has been used in [21] in a comparable simulationtask. Using the sequence ( ϑ ( i ) ) i from this algorithm, the final Markov chain Monte Carlo (MCMC)approximation of Π( f ∈ ·| Y, T ) is then given by an average M M (cid:88) m =1 Π (cid:0) f ∈ ·| ϑ = ϑ ( B + m ∗ l ) , Y, T (cid:1) for sufficiently large B, M, l ∈ N , where we again profit from the explicitly given conditionalposterior distribution (6.4).Figure 1 shows the typical posterior mean and 20 draws from the posterior distribution ina simulation using ε = δ = 10 − and − . In both cases the projection level is chosen as J = 4 (cid:39) (cid:112) − log( ε ) . Especially for the smaller noise level, the common intersections of allsampled functions are conspicuous. They reflect a quite low variance of the posterior distributionin the first coefficients compared to a relatively large variance already for f due to the severeill-posedness, cf. (6.4).As a reference estimator the Galerkin projector (cid:98) f J from (3.6) is plotted, too. We see that for ε = 10 − the posterior mean is much closer to the true function indicating an efficiency gain of theBayesian procedure compared to the projection estimator. For ε = 10 − both estimators coincidealmost perfectly. As shown by the theory, the figure illustrates that the posterior distributionconcentrates around the truth for smaller noise levels. Monte Carlo simulations based on 500iterations yield a root mean integrated squared error (RMISE) 0.3353 and 0.0512 for ε = 10 − and ε = 10 − , respectively. For the posterior mean of ϑ we observe a root mean squared error ofapproximately . · − and . · − , respectively. Additionally, Table 1 reports the RMISE forseveral different combinations of the noise levels ε and δ . Another example is the deconvolution problem occurring for instance in image processing, cf.Johnstone et al. [19]. The aim is to recover some unknown 1-periodic function f from the obser-12 .0 0.2 0.4 0.6 0.8 1.0 − − − Figure 1: Heat equation with unknown diffusivity parameter: True function (black), projectionestimator (blue), posterior mean (red, solid) and 20 draws from the posterior distribution (red,dotted) with ε = δ = 10 − (left) and − (right). δ \ ε − − − − − − ε and δ .vations Y = K ϑ f + εZ with K ϑ f := g ϑ ∗ f := (cid:90) f ( · − x ) g ϑ ( x )d x where g ϑ ∈ L per is some -periodic convolution kernel (more general it might be a signed measure).Since the convolution operator K ϑ is smoothing, the inverse problem is ill-posed. If the kernel g ϑ isunknown, the problem is called blind deconvolution occurring in many applications [4, 20, 32]. In adensity estimation setting this problem as already been intensively investigated, cf. [10, 17, 18, 26]among others. However, the Bayesian perspective on this problem seem not thoroughly studied.We consider the trigonometric basis ϕ = 1 , ϕ j, = √ πj · ) , ϕ j, = √ πj · ) , j ∈ N , with the corresponding approximation spaces V J = span( ϕ j,l : j (cid:54) J, l ∈ { , } ) . Assuming g ϑ issymmetric, we have (cid:104) g ϑ , ϕ j, (cid:105) = 0 and K ϑ ϕ = (cid:104) g ϑ , ϕ (cid:105) ϕ , K ϑ ϕ j,l = (cid:88) m (cid:104) g ϑ , ϕ m, (cid:105) ( ϕ j,l ∗ ϕ m, ) = 1 √ (cid:104) g ϑ , ϕ j, (cid:105) ϕ j,l , j ∈ N , l ∈ { , } by the angle sum identities (for non-symmetric kernels K ϑ could be diagonolised by the complexvalued Fourier basis). We thus obtain the singular value decomposition K ϑ f = (cid:80) k ρ ϑ,k f k ϕ k , againin muli-index notation k = ( j, l ) , j ∈ N , l ∈ { , } , where ρ ϑ,k = (cid:104) g ϑ , ϕ | k | , (cid:105) / √ and f k = (cid:104) f, ϕ k (cid:105) .Depending on the regularity of g and thus the decay of (cid:104) g ϑ , ϕ j, (cid:105) the problem is mildly or severelyill-posed.If the convolution kernel is fully unknown, we parametrise g ϑ = ϑ by all (symmetric) 1-periodickernels ϑ . Due to the SVD, we then can identify g ϑ with the singular values, that is, we set13 .0 0.2 0.4 0.6 0.8 1.0 − − . − . . . . . . . Figure 2: Adaptive deconvolution for the Laplace kernel with ε = δ = 10 − (left) and ε = δ = 10 − (right): True function (black), projection estimator (blue), posterior mean (red, solid) and 20 drawsfrom the posterior distribution (red, dotted). ϑ = ( ρ ϑ,k ) k . The sample T can be understood as training data, where the convolution experimentis applied to all basis functions f ∈ { ϕ j,l } . In this scenario we obtain ε = δ .In our simulation ϑ is given by the periodic Laplace kernel g ϑ ( x ) = C h e −| x | /h [ − / , / ( x ) with normalisation constant C h = 2 h (1 + e − / (2 h ) ) and fixed bandwidth h = 0 . . Hence, we havefor k ∈ N × { , } ρ ϑ , = 1 , ρ ϑ ,k = 2 h − C h (4 π | k | + h − ) (cid:0) − e − / (2 h ) cos( π | k | ) + e − / (2 h ) π | k | h sin( π | k | ) (cid:1) In particular, we have two degree of illposedness. We moreover use f from (6.3).To implement the empirical Bayes procedure with the trigonometric basis and correspondingapproximation spaces V j = span( ϕ k : | k | (cid:54) j ) , we need to replace j by j as mentioned inRemark 8. Choosing some b > and setting J ε = (cid:106) log ε − ( s + t + d/
2) log b (cid:107) for some lower bound s (cid:54) s ,the selection rule then reads as (cid:98) J := min (cid:8) j ∈ { , b, b , . . . , b J ε } : (cid:107) (cid:98) f i − (cid:98) f j (cid:107) (cid:54) ∆ ε (log ε − ) i / ∀ i > j (cid:9) . Using the again Gaussian product priors for f and ϑ , the posterior distribution can be similarlyapproximated as described in Section 6.1. However, the nuisance parameter ϑ is now infinitedimensional. Here, we can profit from the truncated product structure of the prior which impliesthat the posterior distribution only depends on the (cid:98) J -dimensional projection P (cid:98) J ϑ (note thatAssumption 1 is satisfied with m = 0 ). More precisely, we only have to draw from the posteriorgiven by Π( B | Y, T ) = 1 C (cid:90) B exp (cid:16) ε (cid:104) P (cid:98) J K ϑ f, Y (cid:105) − ε (cid:107) P (cid:98) J K ϑ f (cid:107) + 1 δ (cid:104) P (cid:98) J ϑ, T (cid:105) − δ (cid:107) P (cid:98) J ϑ (cid:107) (cid:17) dΠ( f, ϑ ) , with normalisation constant C > and for all Borel sets B ⊆ L × P (cid:98) J Θ , cf. proof of The-orem 5. Therefore, a Gibbs sampler can be used to draw successively the coordinates of P (cid:98) J ϑ witha Metropolis-Hastings algorithm and iterate as above with draws of f . This simulation approachis not restricted to this particular example, but applies generally. Note that in the specific de-convolution setting, the map ϑ (cid:55)→ K ϑ f is linear for fixed f , such that ϑ | f, Y, T can be directlysampled from a Gaussian distribution. 14or ε = δ = 10 − and ε = δ = 10 − a typical trajectory of the posterior mean and 20 drawsfrom the posterior are presented in Figure 2 where the Lepski rule has chosen J = 3 (i.e. 7 basisfunctions) and J = 5 (11 basis functions), respectively. For the larger noise level, the posteriormean slightly improves the Galerkin projector, while for the smaller noise level both estimatorsbasically coincide. We see a much better concentration of the posterior distribution than in theseverely ill-posed case discussed previously. In a Monte Carlo simulation for ε = δ = 10 − basedon 500 iterations in this setting the posterior mean for f achieved a RMISE of 0.1142 which isapproximately . of (cid:107) f (cid:107) . The Lepski method has chosen J ∈ { , } with relative frequency . . For ε = δ = 10 − the simulation yields a RMISE of 0.0174, which is . of (cid:107) f (cid:107) , andprojections levels J in { , } in . of the Monte Carlo iterations. We first study some smoothing properties of the operator K ϑ . Lemma 14.
Under Assumption 4 we have (cid:107) K − ϑ,j (cid:107) V j → V j (cid:54) Qσ − j for all ϑ ∈ Θ .Proof. For g ∈ V j the function h = K − ϑ,j g ∈ V j is given by the unique solution to the linear system (cid:104) K ϑ h, v (cid:105) = (cid:104) g, v (cid:105) , for all v ∈ V j . Assumption 4 then yields σ j (cid:107) h (cid:107) = σ j (cid:88) | k | (cid:54) j (cid:104) h, ϕ k (cid:105) (cid:54) (cid:88) | k | (cid:54) j σ | k | (cid:104) h, ϕ k (cid:105) = Q (cid:104) K ϑ h, h (cid:105) (cid:54) Q (cid:107) h (cid:107) sup v ∈ V j : (cid:107) v (cid:107) (cid:54) (cid:104) K ϑ h, v (cid:105) = Q (cid:107) h (cid:107) sup v ∈ V j : (cid:107) v (cid:107) (cid:54) (cid:104) g, v (cid:105) = Q (cid:107) h (cid:107)(cid:107) g (cid:107) . Therefore, σ j (cid:107) K − ϑ,j g (cid:107) (cid:54) Q (cid:107) g (cid:107) holds true for all g ∈ V j . Remark . As soon as ( σ j ) decays at least geometrically, Assumptions 1 and 4 also yield (cid:107) K ϑ f (cid:107) (cid:46) (cid:80) k σ | k | (cid:104) f, ϕ k (cid:105) . Indeed, we have for any f ∈ L such that the right-hand side isfinite: (cid:107) K ϑ f (cid:107) = (cid:88) k |(cid:104) K ϑ P | k | + m f, ϕ k (cid:105)| (cid:54) (cid:88) k (cid:107) K / ϑ P | k | + m f (cid:107) (cid:107) K / ϑ ϕ k (cid:107) (cid:46) (cid:88) k (cid:88) | l | (cid:54) | k | + m σ | k | σ | l | (cid:104) f, ϕ l (cid:105) = (cid:88) l (cid:16) (cid:88) | k | (cid:62) ( | l |− m ) ∨ σ | k | (cid:17) σ | l | (cid:104) f, ϕ l (cid:105) (cid:46) (cid:88) l σ | l | (cid:104) f, ϕ l (cid:105) . To simplify the notation, we abbreviate P = P f,ϑ in the sequel and define the operator ∆ T,j := K T,j − K ϑ,j . Set for γ ∈ (0 , − Q/τ )Ω T,j := {(cid:107) K − ϑ,j ∆ T,j (cid:107) V j → V j (cid:54) γ } . Lemma 14 yields P (cid:0) Ω cT,j (cid:1) (cid:54) P (cid:0) (cid:107) K − ϑ,j (cid:107) V j → V j (cid:107) ∆ T,j (cid:107) V j → V j > γ (cid:1) (cid:54) P (cid:0) (cid:107) ∆ T,j (cid:107) V j → V j > γσ j /Q (cid:1) . Under Assumption 2 we have due to the condition δσ − j ( κ + (cid:112) d j ) (cid:54) γ/ ( CQL ) P (cid:0) Ω cT,j (cid:1) (cid:54) P (cid:0) (cid:107) ∆ T,j (cid:107) V j → V j > γσ j /Q (cid:1) (cid:54) P (cid:16) (cid:107) P j W (cid:107) j (cid:62) γσ j QL (cid:17) (cid:54) P (cid:0) (cid:107) P j W (cid:107) V j → V j > Cδ ( κ + (cid:112) d j ) (cid:1) (cid:54) e − cκ .
15e thus may restrict on Ω T,j on which the operator K T,j = K ϑ,j (Id − K − ϑ,j ∆ T,j ) is invertiblesatisfying (cid:107) K − T,j (cid:107) V j → V j (cid:54) (cid:107) (Id − K − ϑ,j ∆ T,j ) − (cid:107) V j → V j (cid:107) K − ϑ,j (cid:107) V j → V j (cid:54) − γ (cid:107) K − ϑ,j (cid:107) V j → V j (cid:54) Q (1 − γ ) σ j where we used Lemma 14 in the last step. Hence, for γ (cid:54) − Q/τ we have Ω T,j ⊆ {(cid:107) K − T,j (cid:107) V j → V j (cid:54) τ σ − j } . Therefore, we can decompose on Ω T,j (cid:107) (cid:98) f j − f (cid:107) = (cid:107) P j f − f (cid:107) + (cid:107) (cid:98) f j − P j f (cid:107) (cid:54) (cid:107) P j f − f (cid:107) + (cid:107) K − T,j P j Y − P j f (cid:107) . (7.1)The first term is the usual bias. For the second term in (7.1) we write on Ω T,j K − T,j P j Y − P j f = (cid:0) (Id − K − ϑ,j ∆ T,j ) − − Id (cid:1) P j f + ε (Id − K − ϑ,j ∆ T,j ) − K − ϑ,j P j Z = (Id − K − ϑ,j ∆ T,j ) − K − ϑ,j ∆ T,j P j f + ε (Id − K − ϑ,j ∆ T,j ) − K − ϑ,j P j Z. Since (cid:107) (Id − K − ϑ,j ∆ T,j ) − (cid:107) V j → V j (cid:54) / (1 − γ ) on Ω T,j , we obtain (cid:107) K − T,j P j Y − P j f (cid:107) (cid:54) − γ (cid:107) K − ϑ,j (cid:107) V j → V j (cid:107) ∆ T,j (cid:107) V j → V j (cid:107) P j f (cid:107) + ε − γ (cid:107) K − ϑ,j (cid:107) V j → V j (cid:107) P j Z (cid:107) (cid:54) Q (1 − γ ) σ j (cid:0) (cid:107) f (cid:107)(cid:107) ∆ T,j (cid:107) V j → V j + ε (cid:107) P j Z (cid:107) (cid:1) . (7.2)To deduce a concentration inequality for (cid:107) P j Z (cid:107) , we proceed as proposed in [13]: For a countabledense subset B of the unit ball in L , we have (cid:107) P j Z (cid:107) = sup f ∈ B (cid:107) P j Z ( f ) (cid:107) . The Borell-Sudakov-Tsirelson inequality [14, Thm. 2.5.8] yields for any κ > P ( (cid:107) P j Z (cid:107) (cid:62) κ + E [ (cid:107) P j Z (cid:107) ]) (cid:54) P (cid:16) sup f ∈ B (cid:107) P j Z ( f ) (cid:107) − E (cid:104) sup f ∈ B (cid:107) P j Z ( f ) (cid:107) (cid:105) (cid:62) κ (cid:17) (cid:54) − κ / (2 σ ) with σ = sup f ∈ B Var( P j Z ( f )) (cid:54) (cid:107) f (cid:107) (cid:54) . Since E [ (cid:107) P j Z (cid:107) ] (cid:54) E [ (cid:107) P j Z (cid:107) ] / = (cid:16) (cid:88) | k | (cid:54) j E [ Z k ] (cid:17) / = d / j and d j (cid:46) ε − κ , we find for some constant C > P (cid:0) ε (cid:107) P j Z (cid:107) (cid:62) Cκ (cid:1) (cid:54) P (cid:0) ε (cid:107) P j Z (cid:107) (cid:62) κ + εd / j (cid:1) (cid:54) e − κ / (2 ε ) . Under Assumption 2 and due to d j (cid:46) δ − κ , we analogously obtain P (cid:0) (cid:107) ∆ T,j (cid:107) V j → V j (cid:62) Cκ (cid:1) (cid:54) e − κ / (2 δ ) . In combination with (7.2), the asserted concentration inequality is proven.
We proof the theorem in two steps.
Step 1:
We construct tests Ψ n = Ψ n ( Y, T ) such that E f ,ϑ [Ψ n ] → and sup f ∈F n ,ϑ ∈ Θ: (cid:107) f − f (cid:107) (cid:62) Dξ n E f,ϑ [1 − Ψ n ] (cid:54) e − ( C +4) κ n / ( ε n ∨ δ n ) . (7.3)Based on the estimator (cid:98) f j n from (3.6), we set Ψ n := {(cid:107) (cid:98) f jn − f (cid:107) (cid:62) D ξ n } D = 2 Cc √ C + 4 R + 2 C ξ n with the constant C from Proposition 6. Due to Proposition 6and κ n /σ j n (cid:54) c ξ n , we then have E f ,ϑ [Ψ n ] = P f ,ϑ (cid:16) (cid:107) (cid:98) f j n − f (cid:107) (cid:62) Cσ − j n (cid:112) C + 4 κ n (cid:107) f (cid:107) + 2 C ξ n (cid:17) (cid:54) P f ,ϑ (cid:0) (cid:107) (cid:98) f j n − f (cid:107) (cid:62) Cσ − j n (cid:112) C + 4 κ n (cid:107) f (cid:107) + (cid:107) f − P j n f (cid:107) (cid:1) (cid:54) e − ( C +4) κ n / ( ε n ∨ δ n ) converging to 0.On the alternative we set D = D (1 + R ) for D = 2 max( C + D , C √ C + 4 /c ) . For any ϑ ∈ Θ and any f ∈ F n with (cid:107) f − f (cid:107) (cid:62) D (1 + R ) ξ n we have (2 − D ξ n ) (cid:107) f − f (cid:107) (cid:62) D (1 + R ) ξ n for sufficiently small ξ n ↓ . Therefore, (cid:107) f − f (cid:107) (cid:62) D R + (cid:107) f − f (cid:107) ) ξ n (cid:62) D (cid:0) (cid:107) f (cid:107) + (cid:107) f − f (cid:107) (cid:1) ξ n (cid:62) D (cid:107) f (cid:107) ) ξ n (cid:62) Cσ − j n (cid:112) C + 4 κ n (cid:107) f (cid:107) + ( C + D ) ξ n , (7.4)where the last inequality holds by the choice of D . We obtain E f,ϑ [1 − Ψ n ] = P f,ϑ (cid:0) (cid:107) (cid:98) f j n − f (cid:107) < D ξ n (cid:1) (cid:54) P f,ϑ (cid:0) (cid:107) (cid:98) f j n − f (cid:107) > (cid:107) f − f (cid:107) − D ξ n (cid:1) (cid:54) P f,ϑ (cid:0) (cid:107) (cid:98) f j n − f (cid:107) > Cσ − j n (cid:112) C + 4 κ n (cid:107) f (cid:107) + C ξ n (cid:1) . Proposition 6 yields again E f,ϑ [1 − Ψ n ] (cid:54) e − ( C +4) κ n / ( ε n ∨ δ n ) . Step 2:
Since E f ,ϑ [Ψ n ] → , it suffices to prove that Π n (cid:0) f ∈ V j n : (cid:107) f − f (cid:107) > Dξ n | Y, T (cid:1) (1 − Ψ n )= (cid:82) f ∈ V jn : (cid:107) f − f (cid:107) >Dξ n ,ϑ ∈ Θ p f,ϑ ( Z, W )dΠ n ( f, ϑ )(1 − Ψ n ) (cid:82) f ∈F ,ϑ ∈ Θ p f,ϑ ( Z, W )dΠ n ( f, ϑ ) → in P f ,ϑ -probability.Due to Assumption 1, we have K ϑ P j n = P j n + m K ϑ P j n = K ϑ,j n + m P j n . Hence, restricted on f ∈ V j n , we obtain p f,ϑ ( z, w ) = exp (cid:16) ε (cid:104) K ϑ,j n + m f − K ϑ f , z (cid:105) − ε (cid:107) K ϑ,j n + m f − K ϑ f (cid:107) + 1 δ (cid:104) ϑ − ϑ , w (cid:105) − δ (cid:107) ϑ − ϑ (cid:107) (cid:17) = exp (cid:16) ε (cid:104) K ϑ,j n + m f − K ϑ f , z (cid:105) − ε (cid:107) K ϑ,j n + m f − P j n + m K ϑ f (cid:107) − ε (cid:107) (Id − P j n + m ) K ϑ f (cid:107) + 1 δ (cid:104) ϑ − ϑ , w (cid:105) − δ (cid:107) ϑ − ϑ (cid:107) (cid:17) . Since we assume that K ϑ,j n + m depends only on P j n + m ϑ = ( ϑ , . . . , ϑ l jn + m ) and Π is a productprior in ( ϑ k ) , we may rewrite Π n (cid:0) f ∈ V j n : (cid:107) f − f (cid:107) > Dξ n | Y, T (cid:1) (1 − Ψ n ) (7.5) (cid:54) (cid:82) f ∈F∩ V jn : (cid:107) f − f (cid:107) >Dξ n ,ϑ ∈ Θ p ( j n ) f,ϑ ( Z, W )dΠ n ( f, ϑ )(1 − Ψ n ) (cid:82) f ∈F∩ V jn ,ϑ ∈ Θ p ( j n ) f,ϑ ( Z, W )dΠ n ( f, ϑ ) with p ( j n ) f,ϑ ( z, w ) = exp (cid:16) ε (cid:104) P j n + m ( K ϑ f − K ϑ f ) , z (cid:105) − ε (cid:107) P j n + m ( K ϑ f − K ϑ f ) (cid:107) + 1 δ (cid:104) P j n + m ( ϑ − ϑ ) , w (cid:105) − δ (cid:107) P j n + m ( ϑ − ϑ ) (cid:107) (cid:17) .
17e can proceed as in the proof of the Theorems 7.3.1 and 7.3.5, respectively, in [14]. First weneed a lower bound for the denominator in (7.5). Defining the event B n := (cid:110) ( f, ϑ ) ∈ V j n × Θ : (cid:107) P j n + m ( K ϑ f − K ϑ f ) (cid:107) ε n + (cid:107) P j n + m ( ϑ − ϑ ) (cid:107) δ n (cid:54) κ n ( ε n ∨ δ n ) (cid:111) , we obtain P f ,ϑ (cid:16) (cid:90) f ∈F∩ V jn ,ϑ ∈ Θ p ( j n ) f,ϑ ( Z, W )dΠ n ( f, ϑ ) (cid:62) e − ( C +2) κ n / ( ε n ∨ δ n ) (cid:17) (cid:62) P f ,ϑ (cid:16) (cid:90) B n p ( j n ) f,ϑ ( Z, W ) dΠ n ( f, ϑ )Π n ( B n ) (cid:62) e − κ n / ( ε n ∨ δ n ) (cid:17) (cid:62) − ( ε n ∨ δ n ) κ n , where the first inequality is due to the small ball probability (3.3) and the second inequalityfollows along the lines of Lemma 7.3.4 in [14]. Using this bound for the denominator together withMarkov’s inequality and Fubini’s theorem, the probability that (7.5) is larger than some r > isbounded by P f ,ϑ (cid:16) Π n ( f ∈ V j n : (cid:107) f − f (cid:107) > Dξ n | Y, T )(1 − Ψ n ) (cid:62) r (cid:17) (cid:54) P f ,ϑ (cid:16) e ( C +2) κ n / ( ε n ∨ δ n ) (1 − Ψ n ) (cid:90) f,ϑ : (cid:107) f − f (cid:107) >Dξ n p ( j n ) f,ϑ ( Z, W )dΠ n ( f, ϑ ) (cid:62) r (cid:17) + ( ε n ∨ δ n ) κ n (cid:54) e ( C +2) κ n / ( ε n ∨ δ n ) r E f ,ϑ (cid:104) (1 − Ψ n ) (cid:90) f,ϑ : (cid:107) f − f (cid:107) >Dξ n p ( j n ) f,ϑ ( Z, W )dΠ n ( f, ϑ ) (cid:105) + ( ε n ∨ δ n ) κ n (cid:54) e ( C +2) κ n / ( ε n ∨ δ n ) r (cid:90) f,ϑ : (cid:107) f − f (cid:107) >Dξ n E f ,ϑ (cid:104) (1 − Ψ n ) p ( j n ) f,ϑ ( Z, W ) (cid:105) dΠ n ( f, ϑ ) + ( ε n ∨ δ n ) κ n . Note that p ( j n ) f,ϑ corresponds to the density of the law of ( Y (cid:48) , T (cid:48) ) where Y (cid:48) = P j n + m K ϑ f + (Id − P j n + m ) K ϑ f + εZ and T (cid:48) = P j n + m ϑ + (Id − P j n + m ) ϑ + δW with respect to P Yϑ ,f ⊗ P Tϑ and we have Ψ n ( Y, T ) = Ψ n ( Y (cid:48) , T (cid:48) ) by construction. Therefore, wecan apply Step 1 to bound the previous display and conclude P f ,ϑ (cid:16) Π n ( f ∈ V j n : (cid:107) f − f (cid:107) > Dξ n | Y, T ) (cid:62) r (cid:17) (cid:46) r e − κ n / ( ε n ∨ δ n ) + ( ε n ∨ δ n ) κ n + E f ,ϑ [Ψ n ] . (7.6)It remains to note that for any r > the right-hand side converges to zero as n → ∞ . For the sake of brevity we omit the subscript n in the proof. c , c , . . . will denote positive,universal constants. We will choose κ, ξ and j = J according to ξ (cid:39) (cid:16) ( ε ∨ δ ) log 1 ε ∨ δ (cid:17) s/ (2 s +2 t + d ) , κ (cid:39) (cid:16) ( ε ∨ δ ) log 1 ε ∨ δ (cid:17) s + t ) / (2 s +2 t + d ) , j = κ − / ( s + t ) . (7.7)It is not difficult to see that these choices satisfy the requirements of Theorem 5 and (cid:107) f − P j f (cid:107) (cid:46) ξ holds by (4.2). Moreover, the support of Π f lies in V j such that (3.2) is trivially satisfied for F n = { f : (cid:107) f − P j f (cid:107) (cid:54) C ξ } . It only remains to verify the small ball probability (3.3).Owing to P j K ϑ = P j K ϑ P j + m = P j K ϑ,j + m , (4.2) and (cid:107) K ϑ f (cid:107) (cid:46) (cid:107) f (cid:107) H − t , we can estimate forany f ∈ V j (cid:107) P j + m ( K ϑ f − K ϑ f ) (cid:107) (cid:54) (cid:107) P j + m ( K ϑ − K ϑ ) f (cid:107) + (cid:107) P j + m K ϑ,j +2 m ( f − f ) (cid:107) (cid:54) (cid:107) ( K ϑ,j +2 m − K ϑ ,j +2 m ) f (cid:107) + (cid:107) K ϑ P j +2 m ( f − f ) (cid:107) (cid:54) R (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + (cid:107) f − P j +2 m f (cid:107) H − t (cid:54) R (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + (cid:107) f − P j f (cid:107) H − t + (cid:107) P j +2 m (Id − P j ) f (cid:107) H − t . (cid:107) (Id − P j ) f (cid:107) H − t (cid:46) − j ( s + t ) (cid:107) f (cid:107) H s being of the order O ( κε/ ( ε ∨ δ )) due to ε (cid:38) δ and the choice j as in (7.7). We obtain Π (cid:0) f ∈ F ∩ V j , ϑ ∈ Θ : ε − (cid:107) P j + m ( K ϑ f − K ϑ f ) (cid:107) + δ − (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) κ / ( ε ∨ δ ) (cid:1) (cid:62) Π (cid:16) f ∈ F ∩ V j , ϑ ∈ Θ : 1 ε (cid:107) f − P j f (cid:107) H − t + R ε (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c κ / ( ε ∨ δ ) (cid:17) (cid:62) Π f (cid:16) f ∈ F ∩ V j : (cid:107) f − P j f (cid:107) H − t (cid:54) c εκ √ ε ∨ δ ) (cid:17) × Π ϑ (cid:16) ε (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c κ ε ∨ δ ) (cid:17) , (7.8)where the last line follows from independence of f and ϑ under Π . The first term can be boundedusing the product structure and the estimate (4.4). Setting (cid:101) κ = εκε ∨ δ and taking log τ j (cid:46) j intoaccount, we obtain Π f (cid:16) f ∈ F ∩ V j : (cid:107) f − P j f (cid:107) H − t (cid:54) c εκ √ ε ∨ δ ) (cid:17) = Π f (cid:16) f ∈ F ∩ V j : (cid:88) | k | (cid:54) j − t | k | ( f k − f ,k ) (cid:54) c (cid:101) κ (cid:17) (cid:62) (cid:89) | k | (cid:54) j Π f (cid:0) | f k − f ,k | (cid:54) c (cid:101) κ ( t − d/ | k | (cid:1) (cid:62) exp (cid:16) c jd log(2Γ) + c (cid:88) | k | (cid:54) j (cid:0) (2 t − d ) | k | − log τ | k | + log (cid:101) κ (cid:1) − γ (cid:88) | k | (cid:54) j τ − | k | (cid:0) | f ,k | + c (cid:101) κ (2 t − d ) | k | (cid:1)(cid:17) (cid:62) exp (cid:16) c jd (log (cid:101) κ − j ) − γ max l (cid:54) j (2 − sl τ − l ) (cid:107) f (cid:107) H s − c (cid:101) κ τ − j tj (cid:17) . Since κ (cid:39) − j ( s + t ) , we have log (cid:101) κ − (cid:46) j + log ε ∨ δε (cid:46) j . From the assumptions on τ j we thus deduce Π f (cid:16) f ∈ F ∩ V j : (cid:107) f − P j f (cid:107) H − t (cid:54) c εκ √ ε ∨ δ ) (cid:17) (cid:62) e c jd (log (cid:101) κ − j ) (cid:62) e − c j jd . (7.9)By the the Lipschitz continuity (cid:107) K ϑ,j − K ϑ ,j (cid:107) V j → V j (cid:46) (cid:107) P j ( ϑ − ϑ ) (cid:107) , the second term in (7.8) isbounded by Π ϑ (cid:16)(cid:16) ε + 1 δ (cid:17) (cid:107) P j +2 m ( ϑ − ϑ ) (cid:107) (cid:54) c κ ( ε ∨ δ ) (cid:17) (cid:62) Π ϑ (cid:16) (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c ( ε ∧ δ ) κε ∨ δ (cid:17) . Due to Assumption 9 and using again (4.4), we can estimate for ¯ κ = ( ε ∧ δ ) κε ∨ δ : Π ϑ (cid:16) (cid:107) P j +2 m ( ϑ − ϑ ) (cid:107) (cid:54) c ¯ κ (cid:17) (cid:62) (cid:89) | k | (cid:54) l j + m Π ϑ (cid:0) | ϑ k − ϑ k, | (cid:54) c ¯ κ/ (cid:112) l j +2 m (cid:1) (cid:62) exp (cid:16) c l j +2 m log(2Γ) + c l j +2 m log ¯ κ − c l j +2 m log l j +2 m − γ (cid:88) | k | (cid:54) l j +2 m (cid:0) | ϑ ,k | + c ¯ κ/ (cid:112) l j +2 m (cid:1) (cid:17) (cid:62) exp (cid:16) c l j +2 m log ¯ κ − c l j +2 m log l j +2 m − c (cid:107) ϑ (cid:107) − c ¯ κ (cid:17) (cid:62) exp (cid:16) − c l j +2 m (cid:0) log(¯ κ − ) + log l j +2 m (cid:1)(cid:17) , where we have used in the last step that ¯ κ (cid:54) κ → . Because ε η (cid:46) δ implies log ¯ κ − (cid:46) j +log ε ∨ δε ∧ δ (cid:46) j , we find in combination with log l j +2 m (cid:46) j that Π ϑ (cid:16) (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c ¯ κ (cid:17) (cid:62) e − c jl j +2 m (cid:38) e − c j jd . (7.10)19herefore, (3.3) follows from j jd (cid:54) κ ( ε ∨ δ ) − , which is satisfied due to (7.7), in combinationwith (7.8), (7.9) and (7.10). The proof is similar to the previous one. The choices of κ, ξ and j given by ξ (cid:39) (cid:16) log 1 ε ∨ δ (cid:17) − s/t , κ (cid:39) ( ε ∨ δ ) / , j = (cid:16) − r log (cid:0) κεε ∨ δ (cid:1)(cid:17) /t (7.11)satisfy the conditions of Theorem 5. Especially, we have (cid:107) f − P j f (cid:107) (cid:46) − js (cid:107) f (cid:107) H s (cid:46) ξ and J = 2 j (1 + o (1)) because of log ε (cid:39) log δ . Since (cid:107) K ϑ f (cid:107) (cid:46) (cid:80) k e − r | k | t f k , we estimate for any f ∈ F ∩ V j (cid:107) P j + m ( K ϑ f − K ϑ f ) (cid:107) (cid:54) (cid:107) ( K ϑ,j +2 m − K ϑ ,j +2 m ) f (cid:107) + 2 (cid:107) K ϑ P j +2 m ( f − f ) (cid:107) (cid:46) (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + (cid:88) k e − r | k | t (cid:104) f − P j +2 m f , ϕ k (cid:105) (cid:54) (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + 4 (cid:88) | k | (cid:54) j e − r | k | t | f k − f ,k | + 4 (cid:88) | k | >j e − r | k | t f ,k . (7.12)Using (cid:88) | k | >j e − r | k | t f ,k (cid:54) − js e − r jt (cid:107) f (cid:107) H s (cid:46) e − r jt , together with the choice j from (7.11), the last term in (7.12) is O ( κε/ ( ε ∨ δ )) . Analogously to(7.8) we obtain for some c > (cid:16) f ∈ F ∩ V j , ϑ ∈ Θ : 1 ε (cid:107) P j + m ( K ϑ f − K ϑ f ) (cid:107) + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) κ ( ε ∨ δ ) (cid:17) (cid:62) Π f (cid:16) f ∈ F ∩ V j : (cid:88) | k | (cid:54) j e − r | k | t | f k − f ,k | (cid:54) c εκ ( ε ∨ δ ) (cid:17) × Π ϑ (cid:16) ε (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c κ ( ε ∨ δ ) (cid:17) . (7.13)The second factor is the same as in the proof of Theorem 10. Taking into account that log δ (cid:39) log ε and (7.11) imply − log ( ε ∧ δ ) κε ∨ δ = − log κ − log ( ε ∧ δ ) ε ∨ δ (cid:39) jt , we find Π ϑ (cid:16) ε (cid:107) K ϑ,j +2 m − K ϑ ,j +2 m (cid:107) V j +2 m → V j +2 m + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) c κ ( ε ∨ δ ) (cid:17) (cid:62) Π ϑ (cid:16) (cid:107) P j +2 m ( ϑ − ϑ ) (cid:107) (cid:54) c ( ε ∧ δ ) κε ∨ δ (cid:17) (cid:62) e − c jt l j +2 m . (7.14)20etting (cid:101) κ = εκε ∨ δ and applying (4.4), we obtain for the first term Π f (cid:16) f ∈ F ∩ V j : (cid:88) | k | (cid:54) j e − r | k | t (cid:104) f − f , ϕ k (cid:105) (cid:54) c εκ ( ε ∨ δ ) (cid:17) (cid:62) (cid:89) | k | (cid:54) j Π f (cid:0) | f k − f ,k | (cid:54) c (cid:101) κ − d | k | / e r | k | t (cid:1) (cid:62) exp (cid:16) c jd log(2Γ) + c (cid:88) | k | (cid:54) j (cid:0) r | k | t − d | k | − log τ | k | + log (cid:101) κ (cid:1) − γ (cid:88) | k | (cid:54) j τ − | k | (cid:0) | f ,k | + c (cid:101) κ − dj e r | k | t (cid:1)(cid:17) (cid:62) exp (cid:16) c jd (log (cid:101) κ − jt ) − c max l (cid:54) j (2 − sl τ − l ) (cid:107) f (cid:107) H s − c (cid:101) κ τ − j e r jt (cid:17) . From the assumptions on τ j and − log (cid:101) κ (cid:46) jt we thus deduce Π f (cid:16) f ∈ F ∩ V j : (cid:88) | k | (cid:54) j e − r | k | t (cid:104) f − f , ϕ k (cid:105) (cid:54) c εκ ( ε ∨ δ ) (cid:17) (cid:62) e c jd (log (cid:101) κ − jt ) (cid:62) e − c j ( d + t ) . (7.15)Combining (7.14) and (7.15) yields Π (cid:16) f ∈ F∩ V j , ϑ ∈ Θ : 1 ε (cid:107) P j + m ( K ϑ f − K ϑ f ) (cid:107) + 1 δ (cid:107) P j + m ( ϑ − ϑ ) (cid:107) (cid:54) κ ( ε ∨ δ ) (cid:17) (cid:62) e − c jt (2 jd + l j +2 m ) . Therefore, (3.3) follows from j (2 d + t ) (cid:39) log( κ − ) (2 d + t ) /t (cid:54) κ ( ε ∨ δ ) − by the choice of κ from(7.11). Let us introduce the oracle which balances the bias and the variance term: J o := min (cid:8) j (cid:54) J ε : R − js (cid:54) CR log(1 /ε ) ε j ( t + d/ (cid:9) where C is the constant from Proposition 6 and R is the radius of the Hölder ball. As ε → wesee that J (cid:39) (cid:0) ε (log ε − ) (cid:1) − / (2 s +2 t + d ) , which coincides with the choice of j in the proof of Theorem 10. The rest of the proof is dividedinto three steps. Step 1:
We will proof that (cid:98) J (cid:54) J o with probability approaching one. We have for sufficientlysmall ε P f ,ϑ ( (cid:98) J > J o ) = P f ,ϑ (cid:0) ∃ i > j (cid:62) J o : (cid:107) (cid:98) f i − (cid:98) f j (cid:107) > ∆ ε (log ε − ) i ( t + d/ (cid:1) (cid:54) (cid:88) i>j (cid:62) J o P f ,ϑ (cid:0) (cid:107) (cid:98) f i − (cid:98) f j (cid:107) > ∆ ε (log ε − ) i ( t + d/ (cid:1) (cid:54) (cid:88) i>j (cid:62) J o P f ,ϑ (cid:0) (cid:107) (cid:98) f i − f (cid:107) + (cid:107) (cid:98) f j − f (cid:107) > ∆ ε (log ε − ) i ( t + d/ (cid:1) (cid:54) J ε (cid:88) j (cid:62) J o P f ,ϑ (cid:0) (cid:107) (cid:98) f j − f (cid:107) > CRε (log ε − )2 j ( t + d/ (cid:1) . By definition of J o we have for every j (cid:62) J o and f ∈ H s ( R ) that (cid:107) f − P j f (cid:107) (cid:54) R − js (cid:54) CR log(1 /ε ) ε j ( t + d/ . Hence, for ε sufficiently small we obtain P f ,ϑ ( (cid:98) J > J o ) (cid:54) J ε (cid:88) j (cid:62) J P f ,ϑ (cid:0) (cid:107) (cid:98) f j − f (cid:107) > C (cid:107) f (cid:107) ε (log ε − )2 j ( t + d/ + (cid:107) f − P j f (cid:107) (cid:1) . j (cid:54) J ε we then have ε j ( t + d/ → and the concentration inequality from Proposition 6can be applied to (cid:98) f j for any κ ∈ ( C − jd/ ε, C − jt ε − ) for a certain constant C > . We canchoose κ = 2 jd/ ε (log ε − ) to obtain P f ,ϑ ( (cid:98) J > J o ) (cid:54) J ε e − Jod (log ε ) (cid:54) J ε ε → . Step 2:
In order to prove the adaptive contraction rate, we replace the test Ψ n from the proofof Theorem 5 by (cid:101) Ψ := {(cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:62) ε (log ε − ) Jo ( t + d/ } requiring to verify (7.3) for (cid:101) Ψ and κ = ( ε log(1 /ε )) s + t ) / (2 s +2 t + d ) , ξ (cid:39) (log ε − ) (cid:0) ε (log ε − ) (cid:1) s/ (2 s +2 t + d ) . (7.16)Note that ε (log ε − )2 J o ( t + d/ (cid:39) (cid:0) ε (log ε − ) (cid:1) s/ (2 s +2 t + d ) by the choice of the oracle J o . Thanks toStep 1 we have E f ,ϑ [ (cid:101) Ψ] = P f ,ϑ (cid:0) (cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:62) ε (log ε − ) J o ( t + d/ (cid:1) (cid:54) P f ,ϑ (cid:0) (cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:62) ε (log ε − ) J o ( t + d/ , (cid:98) J (cid:54) J o (cid:1) + 6 J ε ε (cid:54) P f ,ϑ (cid:0) (cid:107) (cid:98) f J o − f (cid:107) (cid:62) ε (log ε − ) J o ( t + d/ − (cid:107) (cid:98) f (cid:98) J − (cid:98) f J o (cid:107) , (cid:98) J (cid:54) J o (cid:1) + 6 J ε ε. By construction of (cid:98) J we have (cid:107) (cid:98) f (cid:98) J − (cid:98) f J o (cid:107) (cid:54) ε (log ε − ) J o ( t + d/ on the event { (cid:98) J (cid:54) J o } . Therefore, E f ,ϑ [ (cid:101) Ψ] (cid:54) P f ,ϑ (cid:0) (cid:107) (cid:98) f J o − f (cid:107) (cid:62) ε (log ε − ) J o ( t + d/ (cid:1) + 6 J ε ε (cid:54) P f ,ϑ (cid:0) (cid:107) (cid:98) f J o − f (cid:107) (cid:62) CRε (log ε − )2 J o ( t + d/ (cid:1) + 6 J ε ε (cid:54) ε + 6 J ε ε → where the last bound follows from Proposition 6 exactly as in Step 1. For any f ∈ F n with (cid:107) f − f (cid:107) (cid:62) C ε (log ε − ) J ( t + d/ for an sufficiently large constant C and ϑ ∈ Θ we obtain onthe alternative with an argument as in (7.4) E f,ϑ [1 − (cid:101) Ψ] = P f,ϑ (cid:0) (cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:54) ε (log ε − ) J o ( t + d/ (cid:1) (cid:54) P f,ϑ (cid:0) (cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:62) (cid:107) f − f (cid:107) − ε (log ε − ) J o ( t + d/ (cid:1) (cid:54) P f,ϑ (cid:0) (cid:107) (cid:98) f (cid:98) J − f (cid:107) (cid:62) C (1 + (cid:107) f (cid:107) ) ε (log ε − ) J o ( t + d/ (cid:1) . (cid:54) J ε ) e − c Jod (log ε ) for some C , c > . Since J ε (cid:39) log ε − and (log ε − ) J o d (cid:39) (log ε − ) ( ε log(1 /ε )) − d/ (2 s +2 t + d ) (cid:39) κ ε − , we indeed have E f,ϑ [1 − (cid:101) Ψ] (cid:54) e − C (cid:48) κ ε − for some constant C (cid:48) (cid:62) . Step 3:
With the previous preparations we can now prove the adaptive contraction result.Given (cid:101) Ψ n , we have for any r > and ξ from (7.16) P f ,ϑ (cid:16) Π n (cid:0) f ∈ F : (cid:107) f − f (cid:107) > M ξ | Y, T (cid:1) > r (cid:17) (cid:54) P f ,ϑ (cid:16) Π n (cid:0) f ∈ F ∩ V (cid:98) J : (cid:107) f − f (cid:107) > M ξ | Y, T (cid:1) (1 − (cid:101) Ψ) > r, (cid:98) J (cid:54) J o (cid:17) + 6 J ε ε (cid:54) (cid:88) j (cid:54) J P f ,ϑ (cid:16) Π n (cid:0) f ∈ F ∩ V j : (cid:107) f − f (cid:107) > M ξ | Y, T (cid:1) (1 − (cid:101) Ψ) > r, (cid:98) J = j (cid:17) + 6 J ε ε. We can now handle each term in the sum exactly as in the proof of Theorem 5. It suffices to notethat: First, (cid:101) Ψ depends only on the (cid:98) J = j projection of Y and j + m projection of ϑ , respectively.Second, if the small ball probability condition (3.3) is satisfied for J o , as verified in the proof ofTheorem 10, than by monotonicity it is also satisfied for all j (cid:54) J . We thus conclude from (7.6) P f ,ϑ (cid:16) Π n (cid:0) f ∈ F : (cid:107) f − f (cid:107) > M ξ | Y, T (cid:1) > r (cid:17) (cid:54) J o r e − κ ε − + J o ε κ + 6 J ε ε → . cknowledgement This work is the result of three conferences which I have attended in 2017. The first two onBayesian inverse problems in Cambridge and in Leiden lead to my interest for this problem. Onthe third conference in Luminy on the honour of Oleg Lepski’s and Alexandre B. Tsybakov’s 60thbirthday, I realised that Lepski’s method can be used to construct an empirical Bayes procedure. Iwant to thank the organisers of these three conferences. The helpful comments by two anonymousreferees are grate- fully acknowledged.
References [1] Agapiou, S., Larsson, S., and Stuart, A. M. (2013). Posterior contraction rates for the Bayesianapproach to linear ill-posed inverse problems.
Stochastic Process. Appl. , 123(10):3828–3860.[2] Agapiou, S., Stuart, A. M., and Zhang, Y.-X. (2014). Bayesian posterior contraction rates forlinear severely ill-posed inverse problems.
J. Inverse Ill-Posed Probl. , 22(3):297–321.[3] Bochkina, N. (2013). Consistency of the posterior distribution in generalized linear inverseproblems.
Inverse Problems , 29(9):095010, 43.[4] Burger, M. and Scherzer, O. (2001). Regularization methods for blind deconvolution and blindsource separation problems.
Math. Control Signals Systems , 14(4):358–383.[5] Cavalier, L. (2008). Nonparametric statistical inverse problems.
Inverse Problems ,24(3):034004, 19.[6] Cavalier, L. and Hengartner, N. W. (2005). Adaptive estimation for inverse problems withnoisy operators.
Inverse Problems , 21(4):1345–1361.[7] Cohen, A., Hoffmann, M., and Reiß, M. (2004). Adaptive wavelet Galerkin methods for linearinverse problems.
SIAM J. Numer. Anal. , 42(4):1479–1501.[8] Da Prato, G. (2006).
An introduction to infinite-dimensional analysis . Universitext. Springer-Verlag, Berlin. Revised and extended from the 2001 original by Da Prato.[9] Dashti, M., Law, K. J. H., Stuart, A. M., and Voss, J. (2013). MAP estimators and theirconsistency in Bayesian nonparametric inverse problems.
Inverse Problems , 29(9):095017, 27.[10] Dattner, I., Reiß, M., and Trabs, M. (2016). Adaptive quantile estimation in deconvolutionwith unknown error distribution.
Bernoulli , 22(1):143–192.[11] Efromovich, S. and Koltchinskii, V. (2001). On inverse problems with unknown operators.
IEEE Trans. Inform. Theory , 47(7):2876–2894.[12] Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posteriordistributions.
Ann. Statist. , 28(2):500–531.[13] Giné, E. and Nickl, R. (2011). Rates on contraction for posterior distributions in L r -metrics, ≤ r ≤ ∞ . Ann. Statist. , 39(6):2883–2911.[14] Giné, E. and Nickl, R. (2016).
Mathematical foundations of infinite-dimensional statisticalmodels . Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge UniversityPress, New York.[15] Gugushvili, S., van der Vaart, A., and Yan, D. (2018). Bayesian linear inverse problems inregularity scales. arXiv preprint arXiv:1802.08992 .[16] Hoffmann, M. and Reiß, M. (2008). Nonlinear estimation for linear inverse problems witherror in the operator.
Ann. Statist. , 36(1):310–336.2317] Johannes, J. (2009). Deconvolution with unknown error distribution.
Ann. Statist. ,37(5A):2301–2323.[18] Johannes, J., Van Bellegem, S., and Vanhems, A. (2011). Convergence rates for ill-posedinverse problems with an unknown operator.
Econometric Theory , 27(3):522–545.[19] Johnstone, I. M., Kerkyacharian, G., Picard, D., and Raimondo, M. (2004). Wavelet decon-volution in a periodic setting.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 66(3):547–573.[20] Justen, L. and Ramlau, R. (2006). A non-iterative regularization approach to blind deconvo-lution.
Inverse Problems , 22(3):771–800.[21] Knapik, B. T., Szabó, B. T., van der Vaart, A. W., and van Zanten, J. H. (2016). Bayesprocedures for adaptive inference in inverse problems for the white noise model.
Probab. TheoryRelated Fields , 164(3-4):771–813.[22] Knapik, B. T., van der Vaart, A. W., and van Zanten, J. H. (2011). Bayesian inverse problemswith Gaussian priors.
Ann. Statist. , 39(5):2626–2657.[23] Knapik, B. T., van der Vaart, A. W., and van Zanten, J. H. (2013). Bayesian recovery of theinitial condition for the heat equation.
Comm. Statist. Theory Methods , 42(7):1294–1313.[24] Lepski, O. V. (1990). A problem of adaptive estimation in Gaussian white noise.
Teor.Veroyatnost. i Primenen. , 35(3):459–470.[25] Marteau, C. (2006). Regularization of inverse problems with unknown operator.
Math.Methods Statist. , 15(4):415–443 (2007).[26] Neumann, M. H. (1997). On the effect of estimating the error density in nonparametricdeconvolution.
J. Nonparametr. Statist. , 7(4):307–330.[27] Nickl, R. (2017). Bernstein-von mises theorems for statistical inverse problems i: Schr \ "odinger equation. arXiv preprint arXiv:1707.01764 .[28] Nickl, R. and Söhl, J. (2017). Bernstein-von mises theorems for statistical inverse problemsii: Compound poisson processes. arXiv preprint arXiv:1709.07752 .[29] Nickl, R. and Söhl, J. (2017). Nonparametric Bayesian posterior contraction rates for dis-cretely observed scalar diffusions. Ann. Statist. , 45(4):1664–1693.[30] Ray, K. (2013). Bayesian inverse problems with non-conjugate priors.
Electron. J. Stat. ,7:2516–2549.[31] Stuart, A. M. (2010). Inverse problems: a Bayesian perspective.
Acta Numer. , 19:451–559.[32] Stück, R., Burger, M., and Hohage, T. (2012). The iteratively regularized Gauss-Newtonmethod with convex constraints and applications in 4Pi microscopy.
Inverse Problems ,28(1):015012, 16.[33] Tao, T. (2012).
Topics in random matrix theory , volume 132 of
Graduate Studies in Math-ematics . American Mathematical Society, Providence, RI.[34] Tierney, L. (1994). Markov chains for exploring posterior distributions.
Ann. Statist. ,22(4):1701–1762. With discussion and a rejoinder by the author.[35] Vollmer, S. J. (2013). Posterior consistency for Bayesian inverse problems through stabilityand regression results.