[PDF] On bayesian estimation and proximity operators

Abstract

There are two major routes to address the ubiquitous family of inverse problems appearing in signal and image processing, such as denoising or deblurring. A first route relies on Bayesian modeling, where prior probabilities are used to embody models of both the distribution of the unknown variables and their statistical dependence with the observed data. The estimation process typically relies on the minimization of an expected loss (e.g. minimum mean squared error, or MMSE). The second route has received much attention in the context of sparse regularization and compressive sensing: it consists in designing (often convex) optimization problems involving the sum of a data fidelity term and a penalty term promoting certain types of unknowns (e.g., sparsity, promoted through an 1 norm). Well known relations between these two approaches have lead to some widely spread misconceptions. In particular, while the so-called Maximum A Posterori (MAP) estimate with a Gaussian noise model does lead to an optimization problem with a quadratic data-fidelity term, we disprove through explicit examples the common belief that the converse would be true. It has already been shown [7, 9] that for denoising in the presence of additive Gaussian noise, for any prior probability on the unknowns, MMSE estimation can be expressed as a penalized least squares problem, with the apparent characteristics of a MAP estimation problem with Gaussian noise and a (generally) different prior on the unknowns. In other words, the variational approach is rich enough to build all possible MMSE estimators associated to additive Gaussian noise via a well chosen penalty. We generalize these results beyond Gaussian denoising and characterize noise models for which the same phenomenon occurs. In particular, we prove that with (a variant of) Poisson noise and any prior probability on the unknowns, MMSE estimation can again be expressed as the solution of a penalized least squares optimization problem. For additive scalar denois-ing the phenomenon holds if and only if the noise distribution is log-concave. In particular, Laplacian denoising can (perhaps surprisingly) be expressed as the solution of a penalized least squares problem. In the multivariate case, the same phenomenon occurs when the noise model belongs to a particular subset of the exponential family. For multivariate additive denoising, the phenomenon holds if and only if the noise is white and Gaussian.

Full PDF

OON BAYESIAN ESTIMATION AND PROXIMITY OPERATORS

R. GRIBONVAL AND M. NIKOLOVA

Abstract.

There are two major routes to address the ubiquitous family of inverse problemsappearing in signal and image processing, such as denoising or deblurring. A ﬁrst route re-lies on Bayesian modeling, where prior probabilities are used to embody models of both thedistribution of the unknown variables and their statistical dependence with the observed data.The estimation process typically relies on the minimization of an expected loss (e.g. minimummean squared error, or MMSE). The second route has received much attention in the context ofsparse regularization and compressive sensing: it consists in designing (often convex) optimiza-tion problems involving the sum of a data ﬁdelity term and a penalty term promoting certaintypes of unknowns (e.g., sparsity, promoted through an (cid:96) norm).Well known relations between these two approaches have lead to some widely spread mis-conceptions. In particular, while the so-called Maximum A Posterori (MAP) estimate with aGaussian noise model does lead to an optimization problem with a quadratic data-ﬁdelity term,we disprove through explicit examples the common belief that the converse would be true.It has already been shown [7, 9] that for denoising in the presence of additive Gaussian noise,for any prior probability on the unknowns, MMSE estimation can be expressed as a penalizedleast squares problem, with the apparent characteristics of a MAP estimation problem withGaussian noise and a (generally) diﬀerent prior on the unknowns. In other words, the variationalapproach is rich enough to build all possible MMSE estimators associated to additive Gaussiannoise via a well chosen penalty.We generalize these results beyond Gaussian denoising and characterize noise models forwhich the same phenomenon occurs. In particular, we prove that with (a variant of) Poisson noise and any prior probability on the unknowns, MMSE estimation can again be expressedas the solution of a penalized least squares optimization problem. For additive scalar denois-ing the phenomenon holds if and only if the noise distribution is log-concave. In particular,Laplacian denoising can (perhaps surprisingly) be expressed as the solution of a penalized leastsquares problem. In the multivariate case, the same phenomenon occurs when the noise modelbelongs to a particular subset of the exponential family. For multivariate additive denoising,the phenomenon holds if and only if the noise is white and Gaussian.This work and the companion paper [10] are dedicated to the memory of Mila Nikolova, who passed awayprematurely in June 2018. Mila dedicated much of her energy to bring the technical content to completion duringthe spring of 2018. The ﬁrst author did his best to ﬁnalize the papers as Mila would have wished. He should beheld responsible for any possible imperfection in the ﬁnal manuscript.R. Gribonval, Univ Rennes, Inria, CNRS, IRISA, [email protected];M. Nikolova, CMLA, CNRS and Ecole Normale Sup´erieure de Cachan, Universit´e Paris-Saclay, 94235 Cachan,France. a r X i v : . [ m a t h . S T ] J u l R. GRIBONVAL AND M. NIKOLOVA Introduction and overview

Inverse problems in signal and image processing consist in estimating an unknown signal x given an indirect observation y that may have suﬀered from blurring, noise, saturation, etc. Thetwo main routes to address such problems are variational approaches and Bayesian estimation. Variational approaches: a signal estimate is the solution of an optimization problem(1) ˆ x ∈ arg min x D ( x, y ) + ϕ ( x )where D ( x, y ) is a data-ﬁdelity measure, and ϕ is a penalty promoting desirable properties ofthe estimate ˆ x such as, e.g., sparsity.A typical example is linear inverse problems, where one assumes y = Lx + e with L someknown linear operator (e.g., a blurring operator), e is some error / noise. The most comondata-ﬁdelity term is quadratic, which in combination with an (cid:96) sparsity enforcing penalty leadsto the well known Basis Pursuit Denoising approach(2) ˆ x BPDN ( y ) := arg min x (cid:107) y − Lx (cid:107) + λ (cid:107) x (cid:107) . Bayesian estimation: x is modeled as a realization of a random variable X (with ”prior”probability p X ( x )) and y as the realization of a random variable Y (with conditional probabilitydistribution p Y | X ( y | x )). An estimator is designed as the function y (cid:55)→ ˆ x ( y ) that minimizes inexpectation some speciﬁed cost C ( x , ˆ x ), i.e., that minimizes(3) E X,Y C ( X, ˆ x ( Y ))where the pair ( X, Y ) is drawn according to the joint distribution p X,Y ( x, y ) = p Y | X ( y | x ) p X ( x ).A typical example is Minimum Mean Square Error (MMSE) estimation. The cost is thequadratic error C ( x, ˆ x ) = (cid:107) ˆ x − x (cid:107) , and the optimal estimator is the conditional expectation,also called Conditional Mean or Posterior Mean(4) y (cid:55)→ ˆ x MMSE ( y ) := E ( X | Y = y ) = (cid:90) xp X | Y ( x | Y = y ) dx. By Bayes law p X | Y ( x | y ) = p X ( x ) p Y | X ( y | x ) p Y ( y ) , with p Y ( y ) the marginal distribution of Y .While MMSE estimation yields the expected value of the a posteriori probability distribution p X | Y ( x | y ) of X , Maximum A Posteriori (MAP) estimation selects its mode, i.e. the most probable x with respect to this distribution,ˆ x MAP ( y ) := arg min x − log p X | Y ( x | y ) = arg min x {− log p Y | X ( y | x ) − log p X ( x ) } . MAP is directly connected to variational approaches, hence its popularity. However, this is notusually considered as a proper Bayesian estimator although it can be seen as minimizing (3) withthe ”pseudo-cost” C ( x, ˆ x ) := δ ( x − ˆ x ). We will soon come back to the Bayesian interpretationof MAP estimation and its pitfalls.Many other costs can be used, e.g. C ( x, ˆ x ) = (cid:107) x − ˆ x (cid:107) yields the conditional spatial median.Banerjee et al [3] show that if C ( x, ˆ x ) (deﬁned on R n × R n ) is the Bregman divergence D h ( x, ˆ x ) of a strictly convex proper diﬀerentiable function h [5], then the conditional mean is By deﬁnition D h ( x, y ) := h ( x ) − h ( y ) −(cid:104)∇ h ( y ) , x − y (cid:105) . This is usually not symmetric, i.e. D h ( x, y ) (cid:54) = D h ( y, x ). N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 3 the unique minimizer of (3) [3, Theorem 1]. Vice-versa, they prove under mild smoothnessassumptions on C ( x, ˆ x ) that if the conditional mean is the unique minimizer of (3) for any pairof random variables X, Y then(5) C ( x, ˆ x ) = D h ( x, ˆ x ) , ∀ x, ˆ x for some strictly convex diﬀerentiable function h .1.1. The MAP vs MMSE quid pro quo.

A comon quid pro quo between tenants of the twoapproaches revolves around the MAP interpretation of variational approaches [12]. In the par-ticular case of a linear inverse problem with white Gaussian noise p Y | X ( y | x ) ∝ exp (cid:16) − (cid:107) y − Lx (cid:107) σ (cid:17) ,denoting ϕ X ( x ) := − σ log p X ( x ), MAP estimation reads(6) ˆ x MAP = arg min x σ (cid:107) y − Lx (cid:107) − log p X ( x ) = arg min x (cid:107) y − Lx (cid:107) + ϕ X ( x )Thus, if one assumes a Gaussian noise model and if one chooses MAP as an estimation principle then this results in a variational problem shaped as (1) with a quadratic data-ﬁdelity term anda penalty which is a scaled version of the negative log-prior.It has been shown [17] that –except in very special circumstances (Gaussian prior and Gaussiannoise)– “ the MAP approach is not relevant in the applications where the data-observation andthe prior models are accurate ”. Thus, (6) leads to a suboptimal estimator when the considereddata is indeed distributed as p X ( x ) ∝ exp( − ϕ X ( x ) /σ ) with Gaussian noise. As an exampleconsider compressive sensing where y = Lx + e with Gaussian i.i.d. e and L an underdeterminedmeasurement matrix. When X is a random vector with i.i.d. Laplacian entries, ϕ X ( x ) ∝ (cid:107) x (cid:107) and the MAP estimator is Basis Pursuit Denoising (2). It has been shown [8] to have poorerperformance (in the highly compressed regime and in the limit of low noise) than a variationalestimator (1) with quadratic data-ﬁdelity and quadratic penalty ϕ ( x ) ∝ (cid:107) x (cid:107) , aka Tikhonovregression or ridge regression.Unfortunately, a widely spread misconception has lead to a ”reverse reading” of optimizationproblems associated to variational approaches. For example, even though it is true that oneobtains (2) as a MAP under additive Gaussian noise with a Laplacian signal prior, by no meansdoes this imply that the use of (2) to build an estimator is necessarily motivated by the choice of MAP as an estimation principle and the belief that the Laplacian prior is a good descriptionof the distribution of X .Instead, as widely documented in the litterature on sparse regularization, the main reason forchoosing the (cid:96) penalty is simply to promote sparse solutions: any minimizer of (2) is boundto have (many) zero entries (in particular when the parameter λ is large) which is desirablewhen prior knowledge indicates that x is “compressible”, that is to say, well approximated bya sparse vector.As demonstrated e.g. in [8] (see also [2] for related results), a random vector X ∈ R n withentries drawn i.i.d. from a Laplacian distribution is, with high probability, not compressible; onthe opposite, when the entries of X are drawn i.i.d. according to a heavy-tailed distribution , X istypically compressible. Thus, despite having the apparent characteristics of the MAP estimatorwith a Laplacian prior, (2) in fact approximates well the MMSE estimator with a heavy-tailedprior. R. GRIBONVAL AND M. NIKOLOVA

Writing certain convex variational estimators as proper Bayes estimators.

Inpenalized least squares regression for linear inverse problems in R n , the variational estimator(7) ˆ x ( y ) ∈ arg min x ∈ R n (cid:107) y − Lx (cid:107) + ϕ ( x ) , is traditionally interpreted as a MAP estimator (6) with Gaussian noise and log-concave prior p X ( x ) ∝ exp( − ϕ ( x )) . When the penalty ϕ is convex diﬀerentiable and (cid:107) Lx (cid:107) + ϕ ( x ) has at least linear growth atinﬁnity, by [6, Theorem 1] the estimator (7) is also a proper Bayesian estimator minimizing (3),with a prior distribution p X ( x ) ∝ exp( − ϕ ( x )) and y = Lx + e where e is white Gaussian noise,for the cost(8) C ( x, ˆ x ) = (cid:107) L ( x − ˆ x ) (cid:107) + D ϕ (ˆ x, x )involving the Bregman divergence D ϕ . The results in [6] are actually expressed with coloredGaussian noise and (cid:107) · (cid:107) replaced by the corresponding weighted (Mahalanobis) quadratic normin (7) and (8). Further extensions to inﬁnite-dimension have been considered in [11].In other words, in the context of linear inverse problems with additive Gaussian noise anda log-concave prior, a variational approach (7) having all the apparent features of the MAPestimator is in fact also a proper Bayesian estimator with the speciﬁc cost (8). Remark . The reader may notice that the cost (8) is a Bregman divergence C ( x, ˆ x ) = D h (ˆ x, x ),with h ( x ) := (cid:107) Lx (cid:107) + ϕ ( x ). The order of arguments is reversed compared to (5), which is neitheran error nor a coincidence: by the results of Banerjee [3], when C ( x, ˆ x ) = D h ( x, ˆ x ) for some h ,the Bayes estimator minimizing (3) is simply the MMSE. Here, in (7)-(8), this does not happen:the MAP estimator does not match the MMSE estimator except in certain very speciﬁc caseswhere the Bregman divergence is indeed symmetric.1.3. Writing certain MMSE estimators using an expression analog to a MAP es-timator.

The results of Burger et al strongly interleave the prior model p X ( x ) ∝ e − ϕ ( x ) , theobservation model p Y | X ( y | x ) embodied by the linear operator L , and the task cost C L,ϕ ( x, ˆ x ).In particular, the task (”what we want to do”) becomes dependent on the data model (”whatwe believe”).From a data processing perspective the above approach is not fully satisfactory: it seems morenatural to ﬁrst choose a relevant task (e.g., MMSE estimation) and a reasonable model (e.g.,a prior p X ( x )), and then to design a penalty (the tool used to solve the task given the model)based on these choices.In this spirit, in the context of additive white Gaussian denoising, [7] showed that, for anysignal prior p X ( x ), the MMSE estimator (4) is the unique solution (and unique stationary point)of a variational optimization problem(9) ˆ x MMSE ( y ) = arg min x (cid:107) y − x (cid:107) + (cid:101) ϕ X ( x )with (cid:101) ϕ X ( x ) some penalty that depends on the considered signal prior p X ( x ).In other words, MMSE has all the apparent features of a MAP estimator with Gaussian noiseand a ”pseudo” signal prior (cid:101) p X ( x ) ∝ e − (cid:101) ϕ X ( x ) . Except in the very special case of a Gaussian N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 5 prior, the pseudo-prior (cid:101) p X ( x ) diﬀers from the prior p X ( x ) that deﬁnes the MMSE estimator(4). This result has been extended to MMSE estimation for inverse problems with additivecolored Gaussian noise [9]. Unser and co-authors [13, 1] exploit these results for certain MMSEestimation problems. Louchet and Moisan [14, Theorem 3.9] consider the speciﬁc case of theMMSE estimator associated to a total variation image prior p X ( x ) and establish the sameproperty through connections with the notion of proximity operator of a convex lsc function.1.4. Contribution: MMSE estimators that can be expressed as proximity operators.

We extend the general results of [8, 9] beyond Gaussian denoising using a characterization ofproximity operators of possibly nonconvex penalties obtained in a companion paper [10]. Our ex-tension goes substantially beyond Gaussian denoising, including scalar Poisson denoising, scalardenoising in the presence of additive noise with any log-concave distribution, and multivariatedenoising for certain noise distributions belonging to the exponential family.1.4.1.

Scalar denoising.

Proposition . Consider the scalar Poisson noise model where theconditional probability distribution of the integer random variable Y ∈ N given x ∈ R ∗ + is p Y | X ( Y = n | x ) = x n n ! e − x , ∀ n ∈ N and let p X be any probability distribution for a positive random variable X > . There is a(possibly nonconvex) penalty function ˜ ϕ X : R ∗ + → R ∪ { + ∞} such that, for any n ∈ N (10) E ( X | Y = n ) = arg min x> (cid:8) ( n − x ) + ˜ ϕ X ( x ) (cid:9) . Note that the arg min is strictly positive since the conditional expectation is strictly positive. Thepenalty ˜ ϕ X depends, among other things, on the probability distribution p X of X . Proposition . Consider an additive noise model Y = X + N wherethe random variables X, Z ∈ R are independent. The conditional probability distribution of therandom variable Y ∈ R given x ∈ R is p Y | X ( y | x ) = p Z ( y − x ) . Assume that p Z ( z ) > for any z ∈ R and that z (cid:55)→ F ( z ) := − log p Z ( z ) is continuous. The following properties are equivalent: (a) the function F is convex (i.e., the noise distribution is log-concave); (b) for any prior probability distribution p X on the random variable X ∈ R , the conditionalexpectation E ( X | Y = y ) is well deﬁned for any y ∈ R , and there is a (possibly nonconvex)penalty function ˜ ϕ X : R ∗ + → R ∪ { + ∞} such that for any y ∈ R (11) E ( X | Y = y ) = arg min x ∈ R (cid:8) ( y − x ) + ˜ ϕ X ( x ) (cid:9) . The penalty ˜ ϕ X depends, among other things, on the probability distribution p X of X . Proposition 1 and Proposition 2 are proved in Section 2 as corollaries of a more general result(Lemma 1) on scalar MMSE estimation. Let us insist that while the naive interpretation of anoptimization problem such as (10) (resp. (11)) would be that of MAP estimation with Gaussiannoise, it actually corresponds to MMSE estimation with Poisson (resp. log-concave) noise.

R. GRIBONVAL AND M. NIKOLOVA

Classical log-concave examples include generalized Gaussian noise [15, 4], where p Y | X ( y | x ) ∝ exp (cid:16) − (cid:12)(cid:12) x − yσ (cid:12)(cid:12) γ (cid:17) with 1 (cid:54) γ < ∞ . This includes Gaussian noise for γ = 2, but also Laplaciannoise for γ = 1, and in all cases the MMSE estimator can always be written as a ”pseudo-MAP”with a ”Gaussian” (i.e., quadratic) data-ﬁdelity term and an adapted ”pseudo-prior” ˜ ϕ X . For0 < γ < cannot always be written as in (11). Example . Consider X a scalar random variable with a Laplacian prior distribution, and Y = X + cZ where Z is an independent scalar Laplacian noise variable and c >

0. While theMAP estimator reads arg min x {| y − x | + c | x |} , by Proposition 2 the MMSE estimator can be expressed as f ( y ) = arg min x (cid:8) ( y − x ) + ϕ ( x ) (cid:9) with some penalty ϕ . The details of the analytic derivation of f are in Appendix A.8. Thecorresponding potential ψ can be obtained by numerical integration, and the penalty ϕ ischaracterized by (16): ϕ ( f ( y )) = yf ( y ) − f ( y ) / − ψ ( y ). It can be plotted using the pairs f ( y ), ϕ ( f ( y )). Figure 1 provides the shape of f ( y ), of the corresponding potential ψ ( y ), and ofthe corresponding penalty ϕ ( x ) for c = 0 . Multivariate denoising.

The above scalar examples straightforwardly extend to the mul-tivariate setting in the special case where both the noise model and the prior p X are completelyseparable, cf Example 2 in Section 2. In general however, the multivariate setting is quite dif-ferent from the scalar one. For example in R n , n (cid:62)

2, MMSE estimation in the presence ofadditive Laplacian noise (resp. of Poisson noise) cannot always be written as in (9), dependingon the prior p X , see Example 3 (resp. Example 4) in Section 2. In contrast, we show (Lemma 5)that the MMSE estimator can always be expressed as in (9) with particular noise models of theexponential family(12) p Y | X ( y | x ) = exp ( c (cid:104) x, y (cid:105) − a ( x ) − b ( y ))for some c (cid:62) b . This form is in fact essentially necessary (Lemma 4).A primary example is additive Gaussian white noise, where(13) p Y | X ( y | x ) = C exp (cid:16) − (cid:107) y − x (cid:107) σ (cid:17) = exp (cid:16) log C − (cid:107) y (cid:107) σ − (cid:107) x (cid:107) σ + σ (cid:104) x, y (cid:105) (cid:17) , and we recover the main results of [7], using the following deﬁnition. Definition . A random variable X with values in H is non-degenerate if there is no aﬃnehyperplane V ⊂ H such that X ∈ V almost surely, i.e., if p X ( (cid:104) X, v (cid:105) = d ) < for all nonzero v ∈ H and d ∈ R . Proposition . Consider an additive noise model Y = X + Z ∈ R n where the random variables X, Z are independent and p Z ( z ) ∝ exp (cid:16) − (cid:107) z (cid:107) σ (cid:17) . Then N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 7 -1000 -800 -600 -400 -200 0 200 400 600 800 1000y-1000-50005001000 f(y) c=0.9 -1000 -800 -600 -400 -200 0 200 400 600 800 1000y00.511.522.533.5 10 (y) c=0.9 -800 -600 -400 -200 0 200 400 600 800y-0.500.511.522.5 10 (y) c=0.9 Figure 1. f ( y ) (top), ψ ( y ) (middle) and ϕ ( y ) (bottom) for Example 1 with c = 0 . for any prior distribution p X there is a (possibly nonconvex) penalty (cid:101) ϕ X : R n → R ∪ { + ∞} suchthat for any y ∈ R n (14) E ( X | Y = y ) = arg min x ∈ R n { (cid:107) y − x (cid:107) + (cid:101) ϕ X ( x ) } . The penalty (cid:101) ϕ X depends, among other things, on the probability distribution p X of X . Further,if X is non-degenerate then the function y (cid:55)→ E ( X | Y = y ) is injective and there is a choice of R. GRIBONVAL AND M. NIKOLOVA (cid:101) ϕ X such that for any y , E ( X | Y = y ) is the unique stationary point (and global minimizer) ofthe rhs of (14) . Remark . It is further known [7] that in this Gaussian context the conditional mean estimatoris C ∞ , and that e − (cid:101) ϕ X ( x ) is (up to renormalization) a proper prior density. It is also known [9] that (cid:101) ϕ X is convex (resp. additively separable) if and only if the marginal p Y ( y ) is log-concave(resp. separable); this holds in particular as soon as the prior p X is log-concave (resp. separable).Extending such characterizations beyond Gaussian denoising is postponed to further work. This is the only example of centered additive noise with smooth density of the form (12).

Proposition . Consider a multivariate centered additive noise model i.e. with p Y | X ( y | x ) = p Z ( y − x ) . Assume that p Z ( z ) > for any z ∈ H and denote z (cid:55)→ F ( z ) := − log p Z ( z ) . If F iscontinuous and p Y | X is of the form (12) then p Y | X ( y | x ) is in fact Gaussian, with the expression (13) for some σ > . The proof is in Appendix A.6. Despite Proposition 4 and the apparent connection betweenthe quadratic ﬁdelity term in (14) and the Gaussian log-likelihood − log p Y | X ( y | x ), noise modelsof the form (12) are more general. For example, they cover a variant of multivariate Poissondenoising. Proposition . Consider a random variable Y ∈ N n with conditional distribution given x ∈ ( R ∗ + ) n expressed as p Y | X ( Y = y | x ) = n (cid:89) i =1 x yii y i ! e − x i , ∀ y ∈ N n , and let p X be any probability distribution for a multivariate positive random variable X . Thereis a (possibly nonconvex) penalty function ˜ ϕ X : R n → R ∪ { + ∞} such that the MMSE estimator of the (entrywise) logarithm log X = (log X i ) ni =1 satisﬁes for any y ∈ N n (15) E (log X | Y = y ) = arg min ξ ∈ R n (cid:8) (cid:107) y − ξ (cid:107) + ˜ ϕ X ( ξ ) (cid:9) . The penalty ˜ ϕ X depends, among other things, on the probability distribution p X of X . Further,if X is non-degenerate then the function y (cid:55)→ E ( X | Y = y ) is injective and there is a choice of (cid:101) ϕ X such that for any y , E ( X | Y = y ) is the unique stationary point (and global minimizer) ofthe rhs of (15) . In Proposition 5 the noise model is not Gaussian, yet MMSE estimation can be expressed ina variational form ”looking like” Gaussian denoising due to the quadratic data-ﬁdelity term in(15). Proposition 3 and Proposition 5 are proved in Section 2 as corollaries of Lemma 5.1.5.

Discussion and extensions.

While this paper primarily focuses on characterizing whenthe conditional mean estimator is a proximity operator, it is natural to wonder under which noisemodels the conditional mean estimator can be expressed as the solution of another variationaloptimization problem such as (2) or more generally (1), with properly chosen data ﬁdelity termand penalty. This has been done [9] for linear inverse problems with colored Gaussian noise, andwe expect that it is possible to combine the corresponding proof techniques with [10, Theorem

N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 9 existence of a penalty such that the MMSE, which isa priori expressed as an integral, is in fact the solution of a (often convex) variational problem.From a practical point of view, a challenging perspective would be to identify how to exploitthis property to design eﬃcient estimation algorithms.2.

When is MMSE estimation a proximity operator ?

We now provide our main general results on the connections between Bayesian estimationand variational approaches. After some reminders on the expression of the conditional meanin terms of a marginal and a conditional distribution, we deﬁne a class of ”proto” conditionalmeans estimators based on a given ”proto” conditional distribution q ( y | x ). Then, we focuson the scalar case where we characterize (proto) conditional distributions that lead to (proto)conditional mean estimators that are proximity operators. Finally we consider the vector case.2.1. Reminders on proximity operators.

Let H be a Hilbert space equipped with an innerproduct denoted (cid:104)· , ·(cid:105) and a norm denoted (cid:107) · (cid:107) . In the ﬁnite dimensional case, one may simplyconsider H = R n . A function θ : H → R ∪{ + ∞} is proper iﬀ there is x ∈ H such that θ ( x ) < + ∞ ,i.e., dom( θ ) (cid:54) = ∅ , where dom( θ ) := { x ∈ H | θ ( x ) < ∞} . It is lower semi-continuous (lsc) if forany x ∈ H , lim inf x → x θ ( x ) (cid:62) θ ( x ), or equivalently if the set { x ∈ H : θ ( x ) > α } is open forevery α ∈ R . The proximity operator of a given (possibly nonconvex) proper penalty function ϕ : H → R ∪ { + ∞} is the set-valued operator y (cid:55)→ prox ϕ ( y ) := arg min x ∈ R n (cid:8) (cid:107) y − x (cid:107) + ϕ ( x ) (cid:9) A primary example is soft-thresholding f ( y ) := y (1 − /y ) + , y ∈ H := R , which is the proximityoperator of the absolute value function ϕ ( x ) := | x | . Let us recall that if θ is a convex function, u ∈ H is a subgradient of θ at x ∈ H iﬀ θ ( x (cid:48) ) − θ ( x ) (cid:62) (cid:104) u, x (cid:48) − x (cid:105) , ∀ x (cid:48) ∈ H . The subdiﬀerentialat x is the collection of all subgradients of θ at x and is denoted ∂θ ( x ).The following results are proved in the companion paper [10]. Theorem . [10, Theorem 1] Consider Y a non-empty subset of H . A function f : Y → H is aproximity operator of some penaly ϕ (i.e. f ( y ) ∈ prox ϕ ( y ) for any y ∈ Y ) if, and only if thereexists a convex lsc function ψ : H → R ∪ { + ∞} such that for any y ∈ Y , f ( y ) ∈ ∂ψ ( y ) . When the domain Y is convex, [10, Theorem 3] implies that there is a number K ∈ R suchthat the functions f , ϕ and ψ in Theorem 1 satisfy(16) ψ ( y ) = (cid:104) y, f ( y ) (cid:105) − (cid:107) f ( y ) (cid:107) − ϕ ( f ( y )) + K, ∀ y ∈ Y . Corollary . [10, Corollary 4] Consider an arbitrary non-empty subset

Y ⊂ R . A function f : Y → R is the proximity operator of some penalty ϕ if, and only if, it is nondecreasing. Theorem . [10, Theorem 2] Let Y be an open convex subset of H and f : Y → H be C ( Y ) .The following properties are equivalent: (a) f is a proximity operator of a function ϕ (i.e. f ( y ) ∈ prox ϕ ( y ) for any y ∈ Y ); (b) there is a convex C function ψ : H → R ∪ { + ∞} such that f ( y ) = ∇ ψ ( y ) for all y ∈ Y ; (c) the diﬀerential Df ( y ) is a symmetric positive semi-deﬁnite operator for any y ∈ Y . Corollary . [10, Corollary 3] Let

Y ⊂ H be open and convex, and f : Y → H be C with Df ( y ) (cid:31) for any y ∈ Y . Then f is injective and there is ϕ : H → R ∪ { + ∞} such that prox ϕ ( y ) = { f ( y ) } , ∀ y ∈ Y and dom( ϕ ) = Im( f ) . Moreover, if x ∈ H is a stationary point of x (cid:55)→ (cid:107) y − x (cid:107) + ϕ ( x ) then x = f ( y ) . Reminders on conditional expectation.

Consider a pair of random variables (

X, Y )with values in some Hilbert spaces , with joint probability density function (pdf) p X,Y ( x, y )and marginals p Y ( y ) , p X ( x ). For y such that p Y ( y ) >

0, the conditional distribution of X given Y = y is p X | Y ( x | y ) = p X,Y ( x, y ) /p Y ( y ). When (cid:107) x (cid:107) is integrable with respect to p X | Y ( ·| y ), theconditional expectation of X given Y = y is E ( X | Y = y ) = (cid:90) x p X | Y ( x | y ) dx By Bayes rule, the conditional distribution and the marginal p Y ( y ) satisfy p X | Y ( x | y ) = p Y | X ( y | x ) p X ( x ) p Y ( y ) and p Y ( y ) = (cid:90) p X,Y ( x, y ) dxdy = (cid:90) p Y | X ( y | x ) p X ( x ) dx. Denoting q ( y | x ) := p Y | X ( y | x )we thus have E ( X | Y = y ) = (cid:82) x p Y | X ( y | x ) p X ( x ) dxp Y ( y ) = E X (cid:0) Xq ( y | X ) (cid:1) E X (cid:0) q ( y | X ) (cid:1) . As we will see (Corollary 1), the conditional mean has the same expression in related settingssuch as scalar Poisson denoising. Considering a function q ( y | x ) that plays the role of a ”proto”conditional distribution of the observation y given the unknown x , we can deﬁne ”proto” con-ditional expectation functions in order to characterize when the conditional expectation is aproximity operator. A continuous linear operator L : H → H is symmetric if (cid:104) x, Ly (cid:105) = (cid:104) Lx, y (cid:105) for any x, y ∈ H . A symmetriccontinuous linear operator is positive semi-deﬁnite if (cid:104) x, Lx (cid:105) (cid:62) x ∈ H . This is denoted L (cid:23)

0. It ispositive deﬁnite if (cid:104) x, Lx (cid:105) > x ∈ H . This is denoted L (cid:31) u is a stationary point of (cid:37) : H → R ∪ { + ∞} if ∇ (cid:37) ( u ) = 0; then, (cid:37) is proper on a neighborhood of u . Y may be valued in any set equipped with a probability measure. We restrict to a Hilbert space for simplicity. all pdfs are with respect to some reference measure on H , typically the Lebesgue measure in ﬁnite dimension. N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 11 ”Proto” conditional distributions and ”proto” conditional expectations.

Definition . Consider X , Y ⊂ H and a ”proto” conditional distribution q : X × Y → R + , ( x, y ) (cid:55)→ q ( y | x ) . Given P be a probability distribution on X such that (17) E X ∼ P (cid:0) (1 + (cid:107) X (cid:107) ) q ( y | X ) (cid:1) < ∞ , ∀ y ∈ Y . The ”proto” marginal distribution q P ( y ) of q ( y | x ) is deﬁned as the function y ∈ Y (cid:55)→ q P ( y ) := E X ∼ P (cid:0) q ( y | X ) (cid:1) . On its support Y P := { y ∈ Y : q P ( y ) > } , the ”proto” conditional mean f P ( y ) is deﬁned as the function (18) y ∈ Y P (cid:55)→ f P ( y ) := E X ∼ P (cid:0) Xq ( y | X ) (cid:1) E X ∼ P (cid:0) q ( y | X ) (cid:1) . Scalar denoising.

In the scalar case, we fully characterize proto conditional distributions q ( y | x ) such that the conditional mean estimator f P deﬁned by (18) is a proximity operator. Lemma . Consider X , Y ⊂ R and q : X × Y → R + . For P a probability distribution on X suchthat (17) holds, let f P , Y P be deﬁned as in Deﬁnition 2. The following properties are equivalent: (a) For any P satisfying (17) , f P is a proximity operator; (b) For any x, x (cid:48) ∈ X , y, y (cid:48) ∈ Y with x (cid:48) > x and y (cid:48) > y , q ( y (cid:48) | x (cid:48) ) q ( y | x ) − q ( y (cid:48) | x ) q ( y | x (cid:48) ) (cid:62) . If q ( y | x ) > on X × Y then (a) - (b) are further equivalent to: (c) For any y, y (cid:48) ∈ Y with y (cid:48) > y , the function x (cid:55)→ log q ( y (cid:48) | x ) − log q ( y | x ) is non-decreasing; (d) For any x, x (cid:48) ∈ Y with x (cid:48) > x , the function y (cid:55)→ log q ( y | x (cid:48) ) − log q ( y | x ) is non-decreasing.If ∂∂y log q ( y | x ) exists on X × Y , then (a) - (b) are further equivalent to: (e) For any y ∈ Y , the function x (cid:55)→ ∂∂y log q ( y | x ) is non-decreasing.If ∂∂x log q ( y | x ) exists on X × Y , then (a) - (b) are further equivalent to: (f) For any x ∈ X the function y (cid:55)→ ∂∂x log q ( y | x ) is non-decreasing. The proof is postponed to Annex A.1. Two applications are scalar Poisson denoising (Propo-sition 1) and denoising in the presence of additive noise (Proposition 2).

Proof . [Proof of Proposition 1] Consider X := R ∗ + , Y := R + , and the proto-conditional distri-bution q ( y | x ) := x y Γ( y +1) e − x which is deﬁned and strictly positive on X × Y . For any y ∈ Y we have sup x ∈X (1 + | x | ) q ( y | x ) < ∞ , hence property (17) holds for any distribution P . Deﬁ-nition 2 yields q P ( y ) > Y hence Y P = Y . Setting P = p X , as p Y | X ( Y = y | x ) = q ( y | x )for any y ∈ Y (cid:48) := N we get E ( X | Y = y ) = f P ( y ) with f P given by (18). On X × Y we havelog q ( y | x ) = y log x − log Γ( y + 1) − x , hence y (cid:55)→ ∂∂x log q ( y | x ) = yx − X .The fact that f P is a proximity operator, i.e., the existence of ˜ ϕ X , follows by Lemma 1(a) ⇔ (e). (cid:3) If p Y | X ( y | x ) = q ( y | x ) is a well-deﬁned conditional probability then q P ( y ) = (cid:82) p X,Y ( x, y ) dx is simply themarginal distribution of the random variable Y where p X,Y ( x, y ) = p Y | X ( y | x ) P ( x ) = q ( y | x ) P ( x ). Proof . [Proof of Proposition 2] Consider X = Y = R , q ( y | x ) := p Y | X ( y | x ) = exp( − F ( y − x )),and observe that log q ( y | x ) = − F ( y − x ). For P := p X a probability distribution satisfying(17), reasoning as in the proof of Proposition 1 we have q P ( y ) > Y hence Y P = Y and E ( X | Y = y ) = f P ( y ).Consider ﬁrst the case where F is C . Then, F is convex iﬀ u (cid:55)→ F (cid:48) ( u ) is non-increasing, whichis equivalent to y (cid:55)→ ∂∂x log q ( y | x ) = F (cid:48) ( y − x ) being non-decreasing. By Lemma 1(f) ⇔ (a) thisis equivalent to the fact that f P is a proximity operator for any P satisfying (17). To conclude,we exploit a simple lemma proved in Appendix A.7. Lemma . If G : R → R is convex and satisﬁes (cid:82) R e − G ( x ) dx < ∞ then sup x ∈ R (1+ | x | ) e − G ( x ) < ∞ . Hence: • if F is not convex then there exists P satisfying (17) such that f P is not a proximityoperator, hence Proposition 2(b) cannot hold; • if F is convex then for any y the function G : x (cid:55)→ F ( y − x ) is convex and (cid:82) R e − G ( x ) dx < ∞ hence, by Lemma 2, the function x (cid:55)→ (1 + | x | ) q ( y | x ) is bounded. As a result, (17) holdsfor any distribution P . As just shown, f P is thus a proximity operator for any P .This establishes the equivalence Proposition 2(a) ⇔ (b) when F is C .To extend the result when F is only C , we reason similarly using Lemma 1(d) ⇔ (a). A bitmore work is needed to show that Lemma 1(d) holds iﬀ F is convex, as we now establish.With the change of variable u = y − x (cid:48) , h = x (cid:48) − x >

0, Lemma 1(d) is equivalent to:(19) ∀ h > , u (cid:55)→ F ( u + h ) − F ( u ) is non-decreasing . When (19) holds we have for any u < u (using h := ( u − u ) / F ( u + u ) − F ( u ) = F ( u + h ) − F ( u ) (cid:54) F ( u + u + h ) − F ( u + u ) = F ( u ) − F ( u + u )hence F ( u + u ) (cid:54) F ( u )+ F ( u )2 . As F is C , this is well known to imply that F is convex.Vice-versa, when F is convex, given u and h, h (cid:48) > F ( u ) − F ( u ) (cid:62) F ( u ) − F ( u ) where u := u , u := u + h , u := u + h (cid:48) , u := u + h + h (cid:48) . Two cases are possible:either u < u < u < u or u < u (cid:54) u < u . We treat the latter, the ﬁrst one can behandled similarly. Since F is convex and u (cid:54) u , there exists a ∈ ∂F ( u ) and b ∈ ∂F ( u ) with a (cid:54) b , and we get a ( u − u ) = ah (cid:48) (cid:54) bh (cid:48) = b ( u − u ). As a result F ( u ) − F ( u ) = F ( u ) − F ( u ) + F ( u ) − F ( u ) (cid:62) b ( u − u ) + F ( u ) − F ( u ) (cid:62) a ( u − u ) + F ( u ) − F ( u ) (cid:62) F ( u ) − F ( u ) + F ( u ) − F ( u ) = F ( u ) − F ( u ) . (cid:3) Example . [Completely separable model] Consider a completely separable model: the entries X i ∈ R , 1 (cid:54) i (cid:54) n of the random variable X ∈ R n are drawn independently with priordistributions P i , i.e., p X ( x ) = (cid:81) ni =1 p X i ( x i ); the conditional distribution of Y ∈ R n given x ∈ R n corresponds to independent noise on each coordinate, p Y | X ( y | x ) = (cid:81) ni =1 p Y i | X i ( y i | x i ). This N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 13 implies that the conditional expectation can be computed coordinate by coordinate. Assumingfurther that p Y i | X i ( y i | x i ) ∝ q i ( y i | x i ) where q i satisﬁes (b), we obtain for 1 (cid:54) i (cid:54) n E ( X | Y = y ) i = E ( X i | Y i = y i ) = arg min x i ∈ R (cid:8) ( y i − x i ) + ϕ P i ( x i ) (cid:9) , ∀ y ∈ R . and as a result E ( X | Y = y ) = arg min x ∈ R n (cid:40) (cid:107) y − x (cid:107) + n (cid:88) i =1 ϕ P i ( x ) (cid:41) , ∀ y ∈ R n . Hence, the conditional mean is the proximity operator of the penalty ϕ P ( x ) := (cid:80) ni =1 ϕ P i ( x i ).2.5. Multivariate denoising.

In dimension 1 (cid:54) n (cid:54) ∞ , we have the following result. Lemma . Consider X , Y ⊂ H and q : X × Y → R + . For P a probability distribution on X suchthat (17) holds, let f P , Y P be deﬁned as in Deﬁnition 2.Assume f P is a proximity operator for any P satisfying (17) . Then, for any x, x (cid:48) ∈ X , y, y (cid:48) ∈ Y : (a) if ∇ x log q ( y | x ) and ∇ x log q ( y (cid:48) | x ) exist, there is a scalar c = c ( x, y, y (cid:48) ) (cid:62) such that (20) ∇ x log q ( y (cid:48) | x ) − ∇ x log q ( y | x ) = c ( y (cid:48) − y ) . (b) if ∇ y log q ( y | x ) and ∇ y log q ( y | x (cid:48) ) exist, there is a scalar c = c ( x, x (cid:48) , y ) (cid:62) such that (21) ∇ y log q ( y | x (cid:48) ) − ∇ y log q ( y | x ) = c (cid:48) ( x (cid:48) − x ) . The proof is postponed to Annex A.2.

Remark . In (a) - (b) the gradients are only assumed to exist at particular points x, x (cid:48) , y, y (cid:48) ,leading to the necessary conditions (20) - (21) at these points. A ﬁrst consequence is that MMSE estimation with additive Laplacian noise (resp. withPoisson nois) in dimension n (cid:62) n = 1 (cf Proposition 1,Proposition 2). Example . Consider a multivariate Laplacian noisemodel: given x ∈ X = R n , the conditional probability of the random vector Y on Y = R n isdeﬁned by p ( y | x ) ∝ q ( y | x ) := e −(cid:107) x − y (cid:107) . As log q ( y | x ) = −(cid:107) x − y (cid:107) , given y = ( y i ) i =1 n and x = ( x i ) ni =1 such that x i (cid:54) = y i , ∀ (cid:54) i (cid:54) n , log q ( · , y ) is diﬀerentiable at x and ∇ x log q ( y | x ) = − sign( x − y ) . Hence, for x ∈ X , y, y (cid:48) ∈ Y such that x i / ∈ { y i , y (cid:48) i } for (cid:54) i (cid:54) n we have (22) ∇ x log q ( y (cid:48) | x ) − ∇ x log q ( y | x ) = − (cid:0) sign( x − y (cid:48) ) − sign( x − y ) (cid:1) . • The scalar case n = 1 is covered by Proposition 2 since q is log-concave. For anydistribution P on the scalar random variable X ∈ R , the MMSE estimator f P is aproximity operator. • For n (cid:62) , this is no longer the case. Consider x = 0 and y ∈ ( R ∗ + ) n such that y (cid:54) = y ,and y (cid:48) = − y . The vector − (sign( x − y (cid:48) ) − sign( x − y )) = 2 · n is not proportional tothe vector ( y (cid:48) − y ) = 2 y (cid:48) which ﬁrst two entries are distinct. Hence (22) is incompatiblewith condition (20) and by Lemma 3, there exists a prior distribution P such that f P is not a proximity operator. Remark . As the noise distribution is separable (multivariate Laplacian noise corresponds toi.i.d. scalar noise on each coordinate), by Example 2 the MMSE estimator is in fact a proximityoperator as soon as the prior P is also separable (i.e., if X has independent entries X i ). However,for n (cid:62) , we have just shown that there exists a prior P (non-separable) such that f P is not aproximity operator. Example . Consider a multivariate Poisson noise model:given x ∈ X := ( R ∗ + ) n , the conditional probability of the random vector of integers Y on Y := N n is deﬁned by p ( y | x ) = q ( y | x ) := (cid:81) ni =1 (cid:16) x yii e − xi Γ( y i +1) (cid:17) . We observe that log q ( y | x ) = n (cid:88) i =1 y i ln x i − x i − log Γ( y i + 1) Given y ∈ Y , the function x (cid:55)→ log q ( y | x ) is diﬀerentiable on X with ∇ x log q ( y | x ) = ( y i /x i − ni =1 hence for x ∈ X , y, y (cid:48) ∈ Y we have ∇ x log q ( y (cid:48) | x ) − ∇ x log q ( y | x ) = (( y (cid:48) i − y i ) /x i ) ni =1 . • The case n = 1 is covered by Proposition 1: f P is a proximity operator for any prior P . • For n (cid:62) this is again no longer the case. Consider, e.g., x ∈ X with x (cid:54) = x and y, y (cid:48) ∈ Y such that y i (cid:54) = y (cid:48) i , i = 1 , . The vector ( y (cid:48) i − y i ) /x i cannot be proportional to y (cid:48) − y hence (20) cannot hold. By Lemma 3, there is a prior P such that f P is not aproximity operator. Remark 4 again applies: for a separable prior P , f P is a proximity operator, yet there existsa (non-separable) prior P such that f P is not a proximity operator.For smooth enough proto-conditional distributions we have the following corollary of Lemma 3. Lemma . Let H be of dimension (cid:54) n (cid:54) ∞ and consider open sets X , Y ⊂ H where X isconnected. Let q : X × Y → R ∗ + be such that x (cid:55)→ ∇ x log q ( y | x ) is C for any y ∈ Y , and y (cid:55)→ ∇ y log q ( y | x ) is C for any x ∈ X . For P a probability distribution on X satisfying (17) ,let f P , Y P be deﬁned as in Deﬁnition 2.If f P is a proximity operator for any such P , then there exists c (cid:62) , a ∈ C ( X ) b ∈ C ( Y ) such that (23) q ( y | x ) = exp ( − a ( x ) − b ( y ) + c (cid:104) x, y (cid:105) ) , ∀ x, y ∈ X × Y . The proof is postponed to Annex A.3. A converse result also holds.

Lemma . Consider X , Y ⊂ H where Y is open and convex. Assume that q : X × Y → R ∗ + has the expression (23) where c (cid:62) , a : X → R , b : Y → R , with b is C ( Y ) and b (cid:48) is locallybounded .Let P be a probability distribution satisfying: for any y ∈ Y , there is r = r ( y ) > such that (24) E X ∼ P { (1 + (cid:107) X (cid:107) ) · exp ( − a ( X ) + cr (cid:107) X (cid:107) + c (cid:104) X, y (cid:105) ) } < ∞ . Then (17) holds and, using the notations of Deﬁnition 2, we have If H = R n local boundedness is automatic by compactness arguments; the assumption is useful in inﬁnitedimension. N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 15 • the proto-conditional mean f P is diﬀerentiable on Y P = Y ; • its diﬀerential is symmetric positive semi-deﬁnite , i.e., Df P ( y ) (cid:23) , ∀ y ∈ Y ; • the proto-conditional mean f P is the proximity operator of some penalty function ϕ P .Moreover if c > and if the support of the distribution P is not included in any hyperplane (i.e.if P ( (cid:104) X, u (cid:105) = d ) < for any nonzero u ∈ H and any d ∈ R ), then ϕ P can be chosen such that: • Df P ( y ) (cid:31) for any y ; • the proto-conditional mean f P is injective; • for any y , f P ( y ) is the unique stationary point (and global minimizer) of x (cid:55)→ (cid:107) y − x (cid:107) + ϕ P ( x ) . The proof is postponed to Appendix A.4.

Remark . The case c = 0 corresponds to independent random variables X and Y governed by p Y | X ( y | x ) = q ( y | x ) (cid:82) q ( y (cid:48) | x ) dy (cid:48) = exp( − a ( x ) − b ( y )) (cid:82) exp( − a ( x ) − b ( y (cid:48) )) dy (cid:48) = exp( − b ( y )) (cid:82) exp( − b ( y (cid:48) )) dy (cid:48) . A consequence of Lemma 5 is that we recover the results of [7]. We also cover a variantof multivariate Poisson denoising that leads again to a proximity operator even for n (cid:62) Proof of Proposition 3.

Consider white Gaussian noise, i.e. p ( y | x ) ∝ q ( y | x ) with q ( y | x ) :=exp (cid:0) − c (cid:107) x − y (cid:107) (cid:1) ; X = Y = H = R n . Observe that log q ( y | x ) = − c (cid:107) x − y (cid:107) = c (cid:104) x, y (cid:105) − a ( x ) − b ( y ) with a ( x ) := c (cid:107) x (cid:107) , b ( y ) := c (cid:107) y (cid:107) . Since sup x ∈X (1 + (cid:107) x (cid:107) ) exp ( − a ( x ) + ( r + c (cid:107) y (cid:107) ) (cid:107) x (cid:107) ) (cid:54) sup u ∈ R + (1 + u ) e − cu / r + c (cid:107) y (cid:107) ) u < ∞ , any probability distribution P satisﬁes (24) hence wecan apply Lemma 5.2.5.2. Proof of Proposition 5.

Consider X := ( R ∗ + ) n , Y (cid:48) = N n , Y := ( R + ) n . For ( x, y ) ∈ X × Y (cid:48) we have p ( y | x ) = q ( y | x ) wherelog q ( y | x ) = n (cid:88) i =1 ( y i log x i − x i − log Γ( y i + 1))is deﬁned on X × Y . Denoting z := (log x i ) ni =1 ∈ Z := R n we have (cid:101) p ( y | z ) := (cid:101) q ( y | z ) wherelog (cid:101) q ( y | z ) := (cid:104) z, y (cid:105) − a ( z ) − b ( y ) with a ( z ) := n (cid:88) i =1 e z i and b ( y ) := (cid:80) ni =1 log Γ( y i + 1).For any y ∈ Y , r >

0, using arguments similar to those in the proof of Lemma 2, we get thatsup z ∈Z (1 + (cid:107) z (cid:107) ) exp ( − a ( z ) + ( r + (cid:107) y (cid:107) ) (cid:107) z (cid:107) ) < ∞ we get that for any distribution p X on X , denoting P the resulting distribution on the randomvariable Z = log X ∈ Z , P necessarily satisﬁes (24) for any y ∈ Y and r > A continuous linear operator L : H → H is symmetric if (cid:104) x, Ly (cid:105) = (cid:104) Lx, y (cid:105) for any x, y ∈ H . A symmetriccontinuous linear operator is positive semi-deﬁnite if (cid:104) x, Lx (cid:105) (cid:62) x ∈ H . This is denoted L (cid:23)

0. It ispositive deﬁnite if (cid:104) x, Lx (cid:105) > x ∈ H . This is denoted L (cid:31) As b is C ( Y ), both b and b (cid:48) are locally bounded hence, we can again apply Lemma 5 to get f P ( y ) = E Z ∼ P ( Z | Y = y ) = E (log X | Y = y ) is a proximity operator. This concludes the proof. Appendix A. Proofs

A.1.

Proof of Lemma 1. (a) ⇒ (b). First, we establish a necessary condition that any function q ( x | y ) must satisfy to ensure that (18) deﬁnes a proximity operator for any prior probabilitydistribution P on the random variable X . This condition will be re-used for the proof ofLemma 3. Lemma . Consider X , Y ⊂ H and q : X × Y → R + , ( x, y ) (cid:55)→ q ( y | x ) . Assume that for anyprobability distribution P such that (17) holds , f P is the proximity operator of some penalty ϕ P (i.e. f P ( y ) ∈ prox ϕ P ( y ) , ∀ y ∈ Y P ), with Y P and f P as in Deﬁnition 2. Then the function q satisﬁes (25) ∀ x, x (cid:48) ∈ X , y, y (cid:48) ∈ Y , (cid:104) x (cid:48) − x, y (cid:48) − y (cid:105) (cid:0) q ( y (cid:48) | x (cid:48) ) q ( y | x ) − q ( y (cid:48) | x ) q ( y | x (cid:48) ) (cid:1) (cid:62) . Even though (25) is not intuitive, its main interest is that it depends only on the “proto”conditional distribution q ( y | x ) but not on the prior distribution P on X . It necessary holds whenthe ”proto” conditional distribution q ( y | x ) ensures that for any prior probability distribution P on X , the ”proto” conditional expectation f P is a proximity operator. Proof . Given x, x (cid:48) ∈ X , y, y (cid:48) ∈ Y , consider the probability distribution P := ( δ x + δ x (cid:48) ). It isstraightforward that P satisﬁes (17).Now, denoting λ := ( q ( y | x )+ q ( y | x (cid:48) ))( q ( y (cid:48) | x )+ q ( y (cid:48) | x (cid:48) )), we distinguish two cases dependingwhether λ = 0 or not.First, if λ = 0 then one must have q ( y | x ) + q ( y | x (cid:48) ) = 0, or q ( y (cid:48) | x ) + q ( y (cid:48) | x (cid:48) ) = 0, or both.Without loss of generality we treat the case q ( y | x ) + q ( y | x (cid:48) ) = 0 (the other cases are treatedsimilarly). Since q is non-negative, this implies q ( y | x ) = q ( y | x (cid:48) ) = 0, hence q ( y (cid:48) | x (cid:48) ) q ( y | x ) − q ( y (cid:48) | x ) q ( y | x (cid:48) ) = 0 and (25) trivially holds.Now, consider the case λ >

0, i.e. q ( y | x ) + q ( y | x (cid:48) ) > q ( y (cid:48) | x ) + q ( y (cid:48) | x (cid:48) ) >

0. By thedeﬁnition of q P and Y P in Deﬁnition 2, this implies y, y (cid:48) ∈ Y P and thus q P ( y ) > q P ( y (cid:48) ) > f P is a proximity operator, hence by Theorem 1 thereexists a convex lsc function ψ P such that f P ( y ) ∈ ∂ψ P ( y ) and f P ( y (cid:48) ) ∈ ∂ψ P ( y (cid:48) ). From thedeﬁnition of a subdiﬀerential, this implies that ψ P ( y (cid:48) ) − ψ P ( y ) (cid:62) (cid:104) f P ( y ) , y (cid:48) − y (cid:105) and that ψ P ( y ) − ψ P ( y (cid:48) ) (cid:62) (cid:104) f P ( y (cid:48) ) , y − y (cid:48) (cid:105) . Adding both inequalities yields (cid:62) (cid:104) f P ( y ) − f P ( y (cid:48) ) , y (cid:48) − y (cid:105) . Since by (18) f P ( y ) = xq ( y | x ) + x (cid:48) q ( y | x (cid:48) ) q ( y | x ) + q ( y | x (cid:48) ) and f P ( y (cid:48) ) = xq ( y (cid:48) | x ) + x (cid:48) q ( y (cid:48) | x (cid:48) ) q ( y (cid:48) | x ) + q ( y (cid:48) | x (cid:48) )we obtain with straightforward computations that0 (cid:54) λ (cid:104) f P ( y (cid:48) ) − f P ( y ) , y (cid:48) − y (cid:105) = (cid:104) x (cid:48) − x, y (cid:48) − y (cid:105) (cid:0) q ( y (cid:48) | x (cid:48) ) q ( y | x ) − q ( y (cid:48) | x ) q ( y | x (cid:48) ) (cid:1) . (cid:3) or equivalently (cid:104) f P ( y (cid:48) ) − f P ( y ) , y (cid:48) − y (cid:105) (cid:62)

0. Since this holds for any y, y (cid:48) ∈ Y , f P is a monotone operator(see, e.g., [16]). N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 17

In the scalar case where, (25) is equivalent to Lemma 1(b), hence Lemma 6 establishes(a) ⇒ (b).(b) ⇒ (a). Since we consider the scalar case, the assumption that (b) holds implies (25).Consider y, y (cid:48) ∈ Y P and X, X (cid:48) two independent random variables with distribution P . Write q P ( y ) q P ( y (cid:48) ) (cid:0) f P ( y (cid:48) ) − f P ( y ) (cid:1) (18) = q P ( y ) E X (cid:48) (cid:0) X (cid:48) q ( y (cid:48) | X (cid:48) ) (cid:1) − q P ( y (cid:48) ) E X (cid:48) (cid:0) X (cid:48) q ( y | X (cid:48) ) (cid:1) = E X (cid:0) q ( y | X ) (cid:1) E X (cid:48) (cid:0) X (cid:48) q ( y (cid:48) | X (cid:48) ) (cid:1) − E X (cid:0) q ( y (cid:48) | X ) (cid:1) E X (cid:48) (cid:0) X (cid:48) q ( y | X (cid:48) ) (cid:1) = E X,X (cid:48) (cid:16) X (cid:48) (cid:0) q ( y | X ) q ( y (cid:48) | X (cid:48) ) − q ( y (cid:48) | X ) q ( y | X (cid:48) ) (cid:1) (cid:17) (26) X ↔ X (cid:48) = E X (cid:48) ,X (cid:16) X (cid:0) q ( y | X (cid:48) ) q ( y (cid:48) | X ) − q ( y (cid:48) | X (cid:48) ) q ( y | X ) (cid:1) (cid:17) (27) ((26)+(27)) / = E X (cid:48) ,X (cid:16) X (cid:48) − X (cid:0) q ( y (cid:48) | X (cid:48) ) q ( y | X ) − q ( y (cid:48) | X ) q ( y | X (cid:48) ) (cid:1) (cid:17) If follows that( y (cid:48) − y ) (cid:0) f P ( y (cid:48) ) − f P ( y ) (cid:1) = q P ( y ) q P ( y (cid:48) ) · E X (cid:48) ,X (cid:110) ( X (cid:48) − X )( y (cid:48) − y )2 (cid:0) q ( y (cid:48) | X (cid:48) ) q ( y | X ) − q ( y (cid:48) | X ) q ( y | X (cid:48) ) (cid:1)(cid:111) (25) (cid:62) , hence f P is non-decreasing on the set Y P . By Corollary 1, f P is thus a proximity operator.(b) ⇔ (c) ⇔ (d) when q ( y | x ) > X × Y . We sketch the proof of the equivalence of (b) and(c). The equivalence between (b) and (d) follows similarly. Denote Q ( x ; y, y (cid:48) ) := log q ( y (cid:48) | x ) − log q ( y | x ). Property (b) holds if and only if for any x, x (cid:48) ∈ X , y, y (cid:48) ∈ Y with x (cid:48) > x , y (cid:48) > y we have q ( y (cid:48) | x (cid:48) ) q ( y | x ) (cid:62) q ( y (cid:48) | x ) q ( y | x (cid:48) ), which is equivalent to log q ( y (cid:48) | x (cid:48) ) + log q ( y | x ) − log q ( y (cid:48) | x ) − log q ( y | x (cid:48) ) (cid:62)

0, that is to say Q ( x (cid:48) ; y, y (cid:48) ) − Q ( x ; y, y (cid:48) ) (cid:62) . i.e., x (cid:55)→ Q ( x ; y, y (cid:48) ) isnon-decreasing as soon as y (cid:48) > y .(c) ⇔ (e) and (d) ⇔ (f). This is straightforward using, e.g., that ∂∂y log q ( y | x ) = lim y (cid:48) → y Q ( x ; y,y (cid:48) ) y (cid:48) − y .A.2. Proof of Lemma 3.

We establish (a).(b) is obtained similarly by reversing the roles of x and y .Since f P is a proximity operator for any probability distribution P such that (17) holds, byLemma 6 it follows that property (25) holds.Consider h ∈ H such that (cid:104) y (cid:48) − y, h (cid:105) >

0. As log q ( · , y ) is diﬀerentiable at x , we have x (cid:48) := x + εh ∈ X for ε > (cid:104) x (cid:48) − x, y (cid:48) − y (cid:105) = ε (cid:104) h, y (cid:48) − y (cid:105) >

0. Since this holds for any h such that (cid:104) y (cid:48) − y, h (cid:105) > c (cid:62) x, y, y (cid:48) ) such that (20) holds. A.3.

Proof of Lemma 4.

Since dim H (cid:62) X and Y are open, the aﬃne dimension of X (resp. of Y ) exceeds two. We use the following lemma. Lemma . Let

Z ⊂ H be of aﬃne dimension at least two (i.e., there is no pair z , z ∈ H suchthat Z ⊂ { tz + (1 − t ) z , t ∈ R } ). Assume that the function θ : Z → H satisﬁes (28) ∀ z, z (cid:48) ∈ Z , ∃ c ( z, z (cid:48) ) ∈ R , θ ( z ) − θ ( z (cid:48) ) = c ( z, z (cid:48) ) ( z − z (cid:48) ) . Then, there exists c ∈ R such that ∀ z, z (cid:48) ∈ Z , θ ( z ) − θ ( z (cid:48) ) = c ( z − z (cid:48) ) . Proof . Step 1.

We show that if z , z , z ∈ Z are not aligned (i.e., if they are aﬃnely indepen-dent) then c ( z i , z j ) = c ( z , z ) for all 1 (cid:54) i (cid:54) = j (cid:54) c ij := c ( z i , z j ). By symmetry(28) yields c ij = c ji , ∀ i (cid:54) = j . Moreover, summing up(28) with z = z i , z (cid:48) = z j over ( i, j ) ∈ { (1 , , (2 , , (3 , } yields0 = c ( z − z ) + c ( z − z ) + c ( z − z ) = ( c − c ) z + ( c − c ) z + ( c − c ) z . As the coeﬃcients in the right hand side sum to zero, aﬃne independence yields c = c = c . Step 2.

As the aﬃne dimension of Z exceeds two it is not a singleton and we can choosean arbitrary pair z (cid:54) = z ∈ Z . Deﬁne c := c ( z , z ). We show that for any x, y ∈ Z we have c ( x, y ) = c ( y, x ) = c .Deﬁne S := { z + (1 − t ) z , t ∈ R } the aﬃne hull of z , z and observe that Z ∩ S c (cid:54) = ∅ .First, consider x ∈ Z ∩ S c . For y = z , as z , z , x are not aligned, by Step 1 we have c ( x, z ) = c ( z , x ) = c . For y ∈ Z ∩ S\{ z } , as x, y, z are not aligned, by Step 1 again we get c ( x, y ) = c ( y, x ) = c ( x, z ) = c . This establishes the result for any x ∈ Z ∩ S c and y ∈ Z ∩ S .Second, consider x, y ∈ Z ∩ S c . As just shown, we have c ( x, z ) = c ( z , y ) = c , hence by (28) θ ( x ) − θ ( y ) = θ ( x ) − θ ( z ) + θ ( z ) − θ ( y ) = c ( x, z ) ( x − z ) + c ( z , y ) ( z − y ) = c ( x − y ) . Finally consider x, y ∈ Z ∩ S . Let z ∈ Z ∩ S c be arbitrary. As c ( x, z ) = c ( z, y ) = c , (28)yields θ ( x ) − θ ( y ) = θ ( x ) − θ ( z ) + θ ( z ) − θ ( y ) = c ( x, z ) ( x − z ) + c ( z, y ) ( z − y ) = c ( x − y ) . (cid:3) For a given x ∈ X , consider the function y (cid:55)→ θ x ( y ) := ∇ x log q ( y | x ). By Lemma 3(a), θ x satisﬁes (28) and the constants c ( z, z (cid:48) ) are non-negative. By Lemma 7 there is c x ∈ R + suchthat(29) θ x ( y ) − θ x ( y (cid:48) ) = c x ( y − y (cid:48) ) , ∀ y, y (cid:48) ∈ Y . As a result y (cid:55)→ θ x ( y ) is diﬀerentiable with D y ∇ x log q ( y | x ) = c x Id. Similarly, given y ∈ Y ,with (cid:37) y ( x ) := ∇ y log q ( y | x ), there is d y ∈ R + such that(30) (cid:37) y ( x ) − (cid:37) y ( x (cid:48) ) = d y ( x − x (cid:48) ) , ∀ x, x (cid:48) ∈ X hence D x ∇ y log q ( y | x ) = d y Id.

N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 19

When ( x, y ) (cid:55)→ log q ( y | x ) is C , by Schwarz’ theorem we have D x ∇ y log q ( y | x ) = D y ∇ x log q ( y | x )for any x, y ∈ X × Y . Thus, c x Id = d y Id for any x, y ∈ X × Y hence c x = d y = c (cid:62) x, y and(31) ∇ x log q ( y | x ) − ∇ x log q ( y (cid:48) | x ) = c ( y − y (cid:48) ) , ∀ x ∈ X , ∀ y, y (cid:48) ∈ Y . Let us now show that (31) also holds with the considered weaker assumption on q . For this, ﬁxsome arbitrary x ∈ X and x (cid:48) ∈ X close enough so that { x + t ( x (cid:48) − x ) } ⊂ X (remember that X is open so this is possible). Consider y, y (cid:48) ∈ Y . Denote f ( t ) := log q ( y | x + t ( x (cid:48) − x )) − log q ( y (cid:48) | x + t ( x (cid:48) − x )) , t ∈ [0 , . As x (cid:55)→ ∇ x log q ( y | x ) is assumed to be continuous, the function f is C on [0 ,

1] and by (29)we have f (cid:48) ( t ) = (cid:104)∇ x log q ( y | x + t ( x (cid:48) − x )) − ∇ x log q ( y (cid:48) | x + t ( x (cid:48) − x )) , x (cid:48) − x (cid:105) = c x + t ( x (cid:48) − x ) (cid:104) y − y (cid:48) , x (cid:48) − x (cid:105) . As a resultlog q ( y | x (cid:48) ) − log q ( y (cid:48) | x (cid:48) ) − log q ( y | x ) + log q ( y (cid:48) | x ) = f (1) − f (0) = (cid:90) f (cid:48) ( t ) dt = (cid:90) c x + t ( x (cid:48) − x ) dt (cid:104) y − y (cid:48) , x (cid:48) − x (cid:105) ., By (30), taking the gradient of both sides with respect to y yields d y ( x (cid:48) − x ) = (cid:90) c x + t ( x (cid:48) − x ) dt ( x (cid:48) − x ) . Thus, d y does not depend on y . Similarly, c x does not depend on x . This establishes (30).Fix an arbitrary y ∈ Y and denote H ( x, y ) := log q ( y | x ) − c (cid:104) x, y (cid:105) and a ( x ) := − H ( x, y ).We obtain ∇ x ( H ( x, y ) + a ( x )) = ∇ x ( H ( x, y ) − H ( x, y )) = 0 , ∀ x ∈ X , ∀ y ∈ Y . Since X is open and connected, hence path-connected, it follows that for any y ∈ Y there is b ( y ) ∈ R such that: H ( x, y ) + a ( x ) = − b ( y ) for all x ∈ X , i.e.,log q ( y | x ) = − a ( x ) − b ( y ) + c (cid:104) x, y (cid:105) . As both x (cid:55)→ ∇ y log q ( y | x ) and y (cid:55)→ ∇ x log q ( y | x ) are C , both a ( · ) and b ( · ) are C .A.4. Proof of Lemma 5.

We use the following lemma, which proof is slightly postponed.

Lemma . Consider two subsets X , Y ⊂ H and a ”proto” conditional distribution q : X × Y → R + . Let V ⊂ Y be an open set such that the gradient ∇ y q ( y | x ) exists for any x ∈ X , y ∈ V .Deﬁne (32) C q, V ( x ) := sup y ∈V { q ( y | x ) + (cid:107)∇ y q ( y | x ) (cid:107) H } ∈ [0 , ∞ ] Consider P a probability distribution such that (33) E X ∼ P { (1 + (cid:107) X (cid:107) H ) C q, V ( X ) } < ∞ . Then (17) holds, the set V P := { y ∈ V : q P ( y ) > } = V ∩ Y P is open, and • the ”proto” marginal distribution q P ( y ) := E X ∼ P ( q ( y | X )) is diﬀerentiable on V ; • the ”proto” conditional mean f P deﬁned by (18) is diﬀerentiable on V P with (34) Df P ( y ) = q P ( y ) E X ∼ P E X (cid:48) ∼ P (cid:16) q ( y | X ) q ( y | X (cid:48) ) · ( X (cid:48) − X ) (cid:0) ∇ Ty log q ( y | X (cid:48) ) − ∇ Ty log q ( y | X ) (cid:1) (cid:17) where by convention ∇ y log q ( y | x ) = 0 when q ( y | x ) = 0 . Remark . The assumption (33) is chosen for simplicity but could be relaxed.

By (23) and the fact that b ∈ C ( Y ) we have for all ( x, y ) ∈ X × Y q ( y | x ) + (cid:107)∇ y q ( y | x ) (cid:107) = q ( y | x ) (1 + (cid:107)∇ y log q ( y | x ) (cid:107) )= exp ( − a ( x ) − b ( y ) + c (cid:104) x, y (cid:105) ) (cid:0) (cid:107) b (cid:48) ( y ) + x (cid:107) (cid:1) (cid:54) exp ( − a ( x ) + c (cid:104) x, y (cid:105) ) exp ( − b ( y )) (cid:0) (cid:107) b (cid:48) ( y ) (cid:107) (cid:1) (1 + (cid:107) x (cid:107) ) . Consider y ∈ Y . Since Y is open, there is r such that for 0 < r < r the closed ball V = B ( y, r ) isa neighborhood of y satisfying V ⊂ Y . By the local boundedness of b (cid:48) , b is locally Lipschitz hencelocally bounded hence there is r such that C ( V ) := sup y (cid:48) ∈V exp ( − b ( y )) (1 + (cid:107) b (cid:48) ( y ) (cid:107) ) < ∞ forany 0 < r < r . For 0 < r < min( r , r ) we have sup y (cid:48) ∈V exp ( c (cid:104) x, y (cid:48) − y (cid:105) ) = exp ( cr (cid:107) x (cid:107) ) < ∞ ,hence for any x ∈ X C q, V ( x ) := sup y (cid:48) ∈V (cid:8) q ( y (cid:48) | x ) + (cid:107)∇ y q ( y (cid:48) | x ) (cid:107) (cid:9) (cid:54) C ( V ) sup y (cid:48) ∈V exp (cid:0) − a ( x ) + c (cid:104) x, y (cid:48) − y (cid:105) + c (cid:104) x, y (cid:105) (cid:1) (1 + (cid:107) x (cid:107) ) (cid:54) C ( V ) exp ( − a ( x ) + cr (cid:107) x (cid:107) + c (cid:104) x, y (cid:105) ) (1 + (cid:107) x (cid:107) ) . By assumption (24) we have E X ∼ P { (1 + (cid:107) X (cid:107) ) C q, V ( X ) } < ∞ when r > q ( y | x ) > x ∈ X we have q P ( y ) > y ∈ Y P , showingthat Y = Y P . By Lemma 8, f P is diﬀerentiable at y with diﬀerential given by (34). Finally, by(23) we have( x (cid:48) − x ) (cid:0) ∇ Ty log q ( y | x (cid:48) ) − ∇ Ty log q ( y | x ) (cid:1) = c ( x (cid:48) − x )( x (cid:48) − x ) T (cid:23) , ∀ x, x (cid:48) ∈ X hence Df P ( y ) is the expectation of a symmetric positive semi-deﬁnite operator. As a result, Df P ( y ) is symmetric positive semi-deﬁnite. As this holds for any y ∈ Y , and since Y is openand convex, by Theorem 2, f P is a proximity operator, i.e., there exists a function ϕ P such that f P = prox ϕ P .Now, assume that c > u ∈ H such that (cid:104) u, Df P ( y ) u (cid:105) = 0. As (cid:104) u, Df P ( y ) u (cid:105) = cq P ( y ) E X ∼ P E X (cid:48) ∼ P q ( y | X ) q ( y | X (cid:48) ) (cid:104) u, ( X (cid:48) − X ) (cid:105) . and q ( y | x ) q ( y | x (cid:48) ) (cid:104) u, x (cid:48) − x (cid:105) (cid:62) x, x (cid:48) , by Markov’s inequality we obtain that q ( y | X ) q ( y | X (cid:48) ) (cid:104) u, X (cid:48) − X (cid:105) = 0 For u, v ∈ H , v T : H → R denotes the linear form x (cid:55)→ (cid:104) v, x (cid:105) , and uv T : H → H the linear operator x (cid:55)→ (cid:104) v, x (cid:105) u . N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 21 almost surely on the draw of (

X, X (cid:48) ). As q ( y | x ) > x , this implies that (cid:104) u, X (cid:48) − X (cid:105) = 0almost surely, hence there exists d ∈ R such that (cid:104) u, X (cid:105) = d almost surely. Since we assume that P ( (cid:104) u, X (cid:105) = d ) < u ∈ H , it follows that u = 0. This shows that Df P ( y ) (cid:31) Proof of Lemma 8.

By (32), 0 (cid:54) q ( y | x ) (cid:54) C ( x ) for any y ∈ V , x ∈ X , hence thenumerator in (18), n ( y ) := E X ∼ P { Xq ( y | X ) } , is well-deﬁned. As, (cid:107)∇ y q ( y | x ) (cid:107) H (cid:54) C ( x ), E X ∼ P { X ∇ y q ( y | X ) } < ∞ is similarly well-deﬁned. Denote ∆( x, y, h ) := q ( y + h | x ) − q ( y | x ) −(cid:104)∇ y q ( y | x ) , h (cid:105) . As | ∆( x, y, h ) | (cid:54) C ( x ) (cid:107) h (cid:107) for any x ∈ X , y ∈ V and h with (cid:107) h (cid:107) H small enough,and as lim (cid:107) h (cid:107)→ x ∆( x, y, h ) = 0 for any x ∈ X , y ∈ V , by the dominated convergence theorem itfollows that for any y ∈ V lim (cid:107) h (cid:107) H → n ( y + h ) − n ( y ) − E X ∼ P { X (cid:104)∇ y q ( y | X ) , h (cid:105)} = 0showing that n is diﬀerentiable on V with diﬀerential Dn ( y ) = E X ∼ P { X ∇ Ty q ( y | X ) } . Sim-ilar arguments exploiting (32) show that q P is diﬀerentiable on V with diﬀerential Dq P ( y ) = ∇ T q P ( y ) where ∇ q P ( y ) = E X ∼ P {∇ y q ( y | X ) } . In particular, q P is continuous on V , hence V P = q − P ((0 , ∞ )) is open.For y ∈ V P , the denominator in (18) is q P ( y ) >

0. By standard calculus f P is diﬀerentiableat y and Df P ( y ) = q P ( y ) Dn ( y ) − n ( y ) Dq P ( y ) q P ( y )= E X (cid:8) X ∇ Ty q ( y | X ) (cid:9) · E X (cid:48) { q ( y | X (cid:48) ) } − E X { Xq ( y | X ) } · E X (cid:48) (cid:8) ∇ Ty q ( y | X (cid:48) ) (cid:9) q P ( y ) q P ( y ) Df P ( y ) = E X,X (cid:48) (cid:8) X (cid:0) q ( y | X (cid:48) ) ∇ Ty q ( y | X ) − q ( y | X ) ∇ Ty q ( y | X (cid:48) ) (cid:1)(cid:9) . Now, we distinguish two cases: • for x, x (cid:48) such that q ( y | x ) q ( y | x (cid:48) ) > ∇ y log q P = ( ∇ y q P ) /q P where q P > q ( y | x (cid:48) ) ∇ Ty q ( y | x ) − q ( y | x ) ∇ Ty q ( y | x (cid:48) ) = q ( y | x ) q ( y | x (cid:48) ) (cid:0) ∇ Ty log q ( y | x ) − ∇ Ty log q ( y | x (cid:48) ) (cid:1) ; • for x, x (cid:48) such that q ( y | x ) q ( y | x (cid:48) ) = 0, we have q ( y | x ) or (non-exclusive) q ( y | x (cid:48) ) = 0.For example assume q ( y | x ) = 0. As y (cid:48) (cid:55)→ q ( y (cid:48) | x ) is non-negative, it is locally minimumat y (cid:48) = y , and as it is diﬀerentiable this implies ∇ y q ( y | x ) = 0. Similarly if q ( y | x (cid:48) ) = 0then ∇ y q ( y | x (cid:48) ) = 0. As a result (35) remains valid with the convention ∇ y log q P = 0where q P = 0. With the above observations we rewrite q P ( y ) Df P ( y ) = E X,X (cid:48) (cid:16) Xq ( y | X ) q ( y | X (cid:48) ) (cid:0) ∇ Ty log q ( y | X ) − ∇ Ty log q ( y | X (cid:48) ) (cid:1) (cid:17) = E X,X (cid:48) (cid:16) − Xq ( y | X ) q ( y | X (cid:48) ) (cid:0) ∇ Ty log q ( y | X (cid:48) ) − ∇ Ty log q ( y | X ) (cid:1) (cid:17) (36) X ↔ X (cid:48) = E X,X (cid:48) (cid:16) − X (cid:48) q ( y | X (cid:48) ) q ( y | X ) (cid:0) ∇ Ty log q ( y | X ) − ∇ Ty log q ( y | X (cid:48) ) (cid:1) (cid:17) = E X,X (cid:48) (cid:16) X (cid:48) q ( y | X ) q ( y | X (cid:48) ) (cid:0) ∇ Ty log q ( y | X (cid:48) ) − ∇ Ty log q ( y | X ) (cid:1) (cid:17) (37)We conclude by taking the half sum of (36) and (37).A.6. Proof of Proposition 4.

Since p Y | X is of the form (12) we have a ( x ) + b ( y ) = F ( x − y ) + c (cid:104) x, y (cid:105) for any x, y ∈ H . A consequence is that a ( x ) = F ( x ) − b (0) and b ( y ) = F ( − y ) − a (0)for any x, y . In particular a (0) + b (0) = F (0). It follows that F ( x − y ) + c (cid:104) x, y (cid:105) = a ( x ) + b ( y ) = F ( x )+ F ( − y ) − a (0) − b (0) = F ( x )+ F ( − y ) − F (0) for any x, y . Denoting G ( z ) = G ( z ) − G (0), weget G ( x − y )+ c (cid:104) x, y (cid:105) = G ( x )+ G ( − y ). Specializing to x = y yields G (0)+ c (cid:107) x (cid:107) = G ( x )+ G ( − x )for any x , hence G (0) = 0 and c (cid:107) x (cid:107) = G ( x ) + G ( − x ) for all x . Denoting A ( z ) , B ( z ) the oddand the even part of G ( z ) we get B ( z ) = c (cid:107) z (cid:107) / A ( x − y )+ c (cid:107) x − y (cid:107) / c (cid:104) x, y (cid:105) = G ( x − y )+ c (cid:104) x, y (cid:105) = G ( x )+ G ( − y ) = A ( x )+ A ( − y )+ c (cid:107) x (cid:107) / c (cid:107) y (cid:107) / x, y . Thus, A ( x − y ) = A ( x ) + A ( − y ) for any x, y and, as A is C (by the continuityof F ) it follows that A is linear: there is µ ∈ H such that A ( x ) = − c (cid:104) x, µ (cid:105) for all x , so that G ( z ) = c (cid:107) z (cid:107) / − c (cid:104) z, µ (cid:105) = c (cid:0) (cid:107) z − µ (cid:107) − (cid:107) µ (cid:107) (cid:1) and F ( z ) = c (cid:107) z − µ (cid:107) + d with d ∈ R . As thenoise is centered, we conclude that µ = 0.A.7. Proof of Lemma 2.

Equivalently we show that G ( x ) − log(1+ | x | ) is bounded from below.Since (cid:82) R e − G ( x ) dx < ∞ and G is convex, there is a ∈ R such that G is non-increasing on( −∞ , a ] and non-decreasing on [ a, + ∞ ) with lim | x |→∞ G ( x ) = + ∞ . By convexity again, itfollows that there are x < a < x and u < < u such that G ( x ) (cid:62) G ( x ) + u ( x − x )and G ( x ) (cid:62) G ( x ) + u ( x − x ) for any x ∈ R . On [ x , + ∞ ) we have G ( x ) − log(1 + | x | ) (cid:62) G ( x ) + u ( x − x ) − log(1 + | x | ) henceinf x ∈ [ x , ∞ ) { G ( x ) − log(1 + | x | ) } (cid:62) G ( x ) − u x + inf x (cid:62) x { u x − log(1 + | x | ) } > −∞ . Similarly inf x ∈ ( −∞ ,x ] { G ( x ) − log(1 + | x | ) } > −∞ . Finally, as G is convex on [ x , x ] it iscontinuous on this compact interval hence inf x ∈ [ x ,x ] { G ( x ) − log(1 + | x | ) = min x ∈ [ x ,x ] { G ( x ) − log(1 + | x | ) } > −∞ . Putting the pieces together establishes the result.A.8. Worked example: scalar denoising with Laplacian noise and Laplacian prior.

Consider p Y | X ( y | x ) ∝ exp ( −| y − x | ) =: q ( y | x ) and p X ( x ) = c exp ( − c | x | ) /

2. Consider y > y < | y − x | + c | x | =  y − (1 + c ) x if x < y − (1 − c ) x if 0 (cid:54) x (cid:54) y (1 + c ) x − y if x > y N BAYESIAN ESTIMATION AND PROXIMITY OPERATORS 23 hence for c (cid:54) = 1 we have c q P ( y ) = (cid:90) + ∞−∞ exp ( −| y − x | − | x | ) dx = (cid:90) −∞ e (1+ c ) x − y dx + (cid:90) y e (1 − c ) x − y dx + (cid:90) + ∞ y e y − (1+ c ) x dx = e − y (cid:90) −∞ e (1+ c ) x dx + e − y (cid:90) y e (1 − c ) x + (cid:90) + ∞ e y − (1+ c )( y + x ) dx = e − y (cid:90) + ∞ e − (1+ c ) x dx + e − y (cid:90) y e (1 − c ) x + e − cy (cid:90) + ∞ e − (1+ c ) x dx = e − y + e − cy c + e − y e (1 − c ) y − − c = e − y + e − cy c + e − cy − e − y − c c (cid:90) + ∞−∞ xq ( y | x ) p X ( x ) dx = (cid:90) −∞ x e (1+ c ) x − y dx + (cid:90) y x e (1 − c ) x − y dx + (cid:90) + ∞ y x e y − (1+ c ) x dx = − e − y (cid:90) + ∞ x e − (1+ c ) x dx + e − y (cid:90) y x e (1 − c ) x dx + (cid:90) + ∞ ( y + x ) e y − (1+ c )( y + x ) dx = (cid:0) e − cy − e − y (cid:1) (cid:90) + ∞ x e − (1+ c ) x dx + e − y (cid:90) y x e (1 − c ) x dx + ye − cy (cid:90) + ∞ e − (1+ c ) x dx = e − cy − e − y (1+ c ) + ye − y c + e − y (cid:104)(cid:16) x − − c (cid:17) e (1 − c ) x − c (cid:105) y = e − cy − e − y (1+ c ) + ye − y c + ((1 − c ) y − e − cy + e − y (1 − c ) These derivations allow to express analytically f ( y ) := E ( X | Y = y ) = (cid:82) + ∞−∞ xq ( y | x ) p X ( x ) dxq P ( y ) . References [1] Arash Amini, Ulugbek Kamilov, Emrah Bostan, and Michael A Unser. Bayesian Estimation for Continuous-Time Sparse Stochastic Processes.

IEEE Trans. Signal Processing , 61(4):907–920, 2013.[2] Arash Amini, Michael A Unser, and Farokh Marvasti. Compressibility of Deterministic and Random InﬁniteSequences.

IEEE Trans. Information Theory , 59(11):5193–5201, 2011.[3] Arindam Banerjee, Xin Guo 0001, and Hui Wang 0003. On the optimality of conditional expectation as aBregman predictor.

IEEE Trans. Information Theory , 51(7):2664–2669, 2005.[4] Murat Belge, Misha Kilmer, and Eric Miller. Wavelet domain image restoration with adaptive edge-preservingregularization.

IEEE Trans. Image Process. , 9(4):597–608, 2000.[5] L. M. Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to thesolution of problems in convex programming.

USSR Computational Mathematics and Mathematical Physics ,7(3):200–217, 1967. [6] Martin Burger and Felix Lucka. Maximum a posteriori estimates in linear inverse problems with log-concavepriors are proper Bayes estimators.

Inverse problems , 30(11):114004–22, October 2014.[7] Remi Gribonval. Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Esti-mation?

IEEE Transactions on Signal Processing , 59(5):2405–2410, 2011.[8] Remi Gribonval, Volkan Cevher, and Michael E Davies. Compressible Distributions for High-DimensionalStatistics.

IEEE Trans. Information Theory , 58(8):5016–5034, 2012.[9] Remi Gribonval and Pierre Machart. Reconciling ”priors” and ”priors” without prejudice? In C J C Burges,L Bottou, M Welling, Z Ghahramani, and K Q Weinberger, editors,

Advances in Neural Information Pro-cessing Systems 26 (NIPS) , pages 2193–2201, 2013.[10] R´emi Gribonval and Mila Nikolova. A characterization of proximity operators. https://hal.inria.fr/hal-01835101 , July 2018.[11] T Helin and M Burger. Maximum a posteriori probability estimates in inﬁnite-dimensional Bayesian inverseproblems.

Inverse problems , 31(8):085009, August 2015.[12] S M Kay.

Fundamentals of Statistical Signal Processing˜: Estimation Theory . Signal Processing. PrenticeHall, 1993.[13] Abbas Kazerouni, Ulugbek Kamilov, Emrah Bostan, and Michael A Unser. Bayesian Denoising - From MAPto MMSE Using Consistent Cycle Spinning.

IEEE Signal Process. Lett. , 20(3):249–252, 2013.[14] C´ecile Louchet and Lionel Moisan. Posterior Expectation of the Total Variation Model: Properties andExperiments.

SIAM J. Imaging Sci. , 6(4):2640–2684, January 2013.[15] P. Mathieu, M. Antonini, M. Barlaud, and I. Daubechies. Image coding using wavelet transform.

IEEE Trans.Image Process. , 1(2):205–220, 1992.[16] Jean-Jacques Moreau. Proximit´e et dualit´e dans un espace Hilbertien.

Bull. Soc. math. France , 93:273–299,1965.[17] Mila Nikolova. Model distortions in Bayesian MAP reconstruction.