Accelerating proximal Markov chain Monte Carlo by using an explicit stabilised method
aa r X i v : . [ s t a t . C O ] M a r Accelerating proximal Markov chain Monte Carlo by using anexplicit stabilised method
Luis Vargas , , Marcelo Pereyra , Konstantinos C. Zygalakis , March 20, 2020
Abstract
We present a highly efficient proximal Markov chain Monte Carlo methodology to perform Bayesian computa-tion in imaging problems. Similarly to previous proximal Monte Carlo approaches, the proposed method is derivedfrom an approximation of the Langevin diffusion. However, instead of the conventional Euler-Maruyama approx-imation that underpins existing proximal Monte Carlo methods, here we use a state-of-the-art orthogonal Runge-Kutta-Chebyshev stochastic approximation [2] that combines several gradient evaluations to significantly accelerateits convergence speed, similarly to accelerated gradient optimisation methods. The proposed methodology is demon-strated via a range of numerical experiments, including non-blind image deconvolution, hyperspectral unmixing, andtomographic reconstruction, with total-variation and ℓ -type priors. Comparisons with Euler-type proximal MonteCarlo methods confirm that the Markov chains generated with our method exhibit significantly faster convergencespeeds, achieve larger effective sample sizes, and produce lower mean square estimation errors at equal computa-tional budget. Imaging sciences study theory, methods, models, and algorithms to solve imaging problems, such as image denoising[29], deblurring [9, 5], compressive sensing reconstruction [34], super-resolution [44], tomographic reconstruction[5], inpainting [52], source separation [30], and phase retrieval [21].There are currently three main formal paradigms to formulate and solve imaging problems: the variational frame-work [16], machine learning [6], and the Bayesian statistical framework [42]. In this paper we focus on the Bayesianframework, which is an intrinsically probabilistic paradigm where the data observation process and the prior knowl-edge available are represented by using statistical models, and where solutions are derived by using inference tech-niques stemming from Bayesian decision theory [42]. The Bayesian framework is particularly well equipped to addressimaging problems in which uncertainty plays an important role, such as medical imaging or remote sensing problemswhere it is necessary or desirable to quantify the uncertainty in the delivered solutions to inform decisions or con-clusions, see, e.g., [39, 41]. The framework is also well adapted to blind, semi-blind, and unsupervised problemsinvolving partially unknown models (e.g., unspecified regularisation parameters or observation operators) [51, 23].Bayesian model selection technique also allows the objective comparison of several potential models to analyse theobserved imaging data, even in cases where there is no ground truth available [19].In this paper we focus on the computational aspects of performing Bayesian inferences in imaging problems.Modern Bayesian computation methods suitable for imaging sciences can be broadly grouped in three categories:stochastic Markov chain Monte Carlo (MCMC) methods that are computationally expensive but robust, and whichcan be applied to a wide range of models and inferences; optimisation methods that are significantly more efficientby comparison, but which are only useful for point estimation and some other specific inferences; and deterministicapproximation methods such as variational Bayes and message passing methods, which are efficient and support more School of Mathematics, University of Edinburgh, Edinburgh, Scotland School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, Scotland Maxwell Institute for Mathematical Sciences, Bayes Centre, 47 Potterrow, Edinburgh, Scotland ℓ and T V priors. Conclusions and perspectivesfor future work are reported in Section 5. Proofs are finally reported in Appendices A and B.
We consider imaging problems involving an unknown image x ∈ R d and some observed data y ∈ C p , related to x through a statistical model with likelihood function p ( y | x ) . In particular, we are interested in problems where therecovery of x from y is ill-conditioned or ill-posed (i.e., either the problem does not admit a unique solution thatchanges continuously with y , or there exists a unique solution but it is not stable w.r.t. small perturbations in y ).For example, problems of the form y = Ax + w with w ∼ N (0 , σ I p ) and σ > where the observation operator A ∈ C n × p is rank deficient, or problems where A ⊤ A is full rank but has a poor condition number. As mentionedpreviously, such problems are ubiquitous in imaging sciences and have been the focus of significant research efforts[16, 31].In this paper we adopt a Bayesian approach to regularise the estimation problem and deliver meaningful estimatesof x , as well as uncertainty quantification for the solutions delivered. More precisely, we represent x as a randomquantity with prior distribution p ( x ) promoting expected properties (e.g., sparsity, piecewise-regularity, smoothness,etc.), and base our inferences on the posterior distribution [31] π ( x ) , p ( x | y ) = p ( y | x ) p ( x ) R R d p ( y | x ) p ( x ) dx , (2.1)2enceforth denoted by π . We focus on log-concave models of the following form π ( x ) = e − f ( x ) − g ( x ) R R d e − f ( s ) − g ( s ) ds , (2.2)where f : R d → R and g : R d → ( −∞ , ∞ ] are two lower bounded functions satisfying the following conditions:1. f is convex and Lipschitz continuously differentiable with constant L f , i.e., k∇ f ( x ) − ∇ f ( y ) k ≤ L f k x − y k , ∀ x, y ∈ R d ; g is proper, convex, and lower semi continuous, but potentially non-smooth.This class of models is widely used in imaging sciences, and includes, for instance, analysis models of the form f ( x ) = k y − Ax k / σ and g ( x ) = θ k Ψ x k † + ι S ( x ) some dictionary or representation Ψ and norm or pseudo-norm k · k † , and a hard constraint S ⊂ R d on the solution space . Also note that we do not assume that p ( x | y ) belongs tothe exponential family.As mentioned previously, posterior distributions of the form (2.2) are log-concave, which is an important propertyfor Bayesian inference because it guarantees the existence of all posterior moments and hence of moment-basedestimators such as the minimum mean square error (MMSE) estimator. Log-concavity also plays a central role inmaximum-a-posteriori (MAP) estimation, given by ˆ x MAP = arg max x π ( x ) , = arg min x f ( x ) + g ( x ) , which is the main estimation strategy in imaging sciences. The popularity of MAP estimation stems from the factthat it is a convex optimisation problem that can be efficiently solved by using modern proximal splitting optimisationtechniques [16]. Some forms of approximate uncertainty quantification can be also computed by using proximalsplitting techniques (see [41] and references therein).However, most Bayesian analyses require using specialised computational statistics techniques to calculate expec-tations and probabilities w.r.t. π . For example, computing Bayesian estimators (e.g., MMSE estimation), calibratingunknown model parameters (e.g., regularisation parameters), performing Bayesian model selection and predictivemodel checks, and reporting (exact) credible regions and hypothesis tests. From a Bayesian computation viewpoint,this typically requires using a high-dimensional Markov chain Monte Carlo (MCMC) method to simulate samples from x | y followed by Monte Carlo integration [40]. Unfortunately, this approach has been traditionally computationally tooexpensive for wide adoption in imaging sciences, limiting the impact of Bayesian statistics in this field. Alternatively,one can also perform approximate inferences by using deterministic surrogate methods [40]. However, these canexhibit convergence issues and have little theoretical guarantees, and hence they have not been widely adopted either.Recent works have sought to addressed these limitations of Bayesian computation methodology by developing newand highly efficient MCMC methods tailored for imaging sciences, particularly by using techniques from proximaloptimisation that are already widely adopted in the field. These so-called proximal MCMC methods [19, 48, 10]have been an important step towards promoting Bayesian imaging techniques, as they are easy to implement, havesignificantly reduced computing times, and improve theoretical guarantees on the solutions delivered. However, thereremain some fundamental features of modern optimisation methodology that have not yet been replicated in proximalMCMC approaches. In particular, modern optimisation methods rely strongly on acceleration techniques to achievefaster convergence rates and improve their robustness to poor conditioning [50]. In this paper, we accelerate proximalMCMC methods to improve their convergence properties. Proximal MCMC methods are derived from the overdamped Langevin diffusion, which we recall bellow. For claritywe first introduce the approach for models that are smooth, and then explain the generalisation to non-smooth models. For any
S ⊂ R d , the indicator ι S takes value ι S ( x ) = 0 if x ∈ S , and ι S ( x ) = + ∞ otherwise. ¯ π that is continuously differentiable on R d . LangevinMCMC methods address this task by using the overdamped Langevin stochastic differential equation (SDE), given bydX t = ∇ log ¯ π ( X t ) d t + √ d W t , (2.3)where ( W t ) t ≥ is a d -dimensional Brownian motion. Under mild regularity assumptions, this SDE has an uniquestrong solution and admits ¯ π as unique invariant distribution. Consequently, if we could solve (2.3) and let t →∞ , this would provide Monte Carlo samples from ¯ π useful for Bayesian computation. This strategy is particularlycomputationally efficient when ¯ π is log-concave because in that case X t converges in distribution to ¯ π exponentiallyfast with a good rate [18].Unfortunately, it is generally not possible to exactly solve (2.3), and discrete approximations of X t need to beconsidered instead. In particular, most algorithms use the Euler-Maruyama (EM) discretization [33]: X n +1 = X n + δ ∇ log ¯ π ( X n ) + √ δZ n +1 , (2.4)where δ > is a given stepsize and ( Z n ) n ≥ is a sequence of i.i.d. d -dimensional standard Gaussian random variables.This MCMC method is known as the unadjusted Langevin algorithm (ULA) [43].Under some regularity assumptions, namely ¯ L -Lipschitz continuity of ∇ log ¯ π and δ < / ¯ L , the Markov chain ( X n ) n ≥ is ergodic with stationary distribution ¯ π δ ( x ) close to ¯ π [18]. Additionally, when ¯ π is log-concave, ULAinherits the favourable properties of (2.3) and converges to ¯ π δ ( x ) geometrically fast with good convergence rates,offering an efficient Bayesian computation methodology for large problems [18].The estimation bias [4] associated with targeting ¯ π δ ( x ) instead of ¯ π can be reduced by decreasing δ , and vanishesas δ → . However, decreasing δ deteriorates the convergence properties of the chain and amplifies the associatednon-asymptotic bias and variance. Therefore, to apply ULA to large problems in a computationally efficient way it isnecessary to use values of δ that are close to the stability limit / ¯ L , at the expense of some asymptotic bias. Noticethat it is also possible to remove the asymptotic bias by combining ULA with a Metrolopolis Hastings correction steptargeting ¯ π , leading to the so-called Metropolis adjusted Langevin algorithm (MALA) [43]. This strategy is widelyused in computational statistics for medium-sized problems. However, in large problems such as imaging problems,using a Metropolis-Hastings correction may dramatically deteriorate the convergence speed [19]. We now consider the class of models π given by (2.2), which are not smooth. Unfortunately, ULA and MALAcannot be directly applied to such models, as they require Lipschitz differentiability of log π . Proximal MCMCmethods address this difficulty by carefully constructing a smooth approximation π λ that by construction satisfies allthe regularity conditions required by ULA and MALA, and which can be made arbitrarily close to the original model π by tuning a regularisation parameter λ > . This strategy, originally proposed in [38], can be implemented in differentways. In particular, [19] replaces the non-smooth term g in (2.2) with its Moreau-Yosida (MY) envelope g λ ( x ) = min y ∈ R d (cid:26) g ( y ) + 12 λ k x − y k (cid:27) , to construct the approximation π λ ( x ) = e − f ( x ) − g λ ( x ) R R d e − f ( s ) − g λ ( s ) ds , which has the following key properties that are useful for Bayesian computation [19]: • For all λ > , π λ defines a proper density on R d . • For all λ > , π λ is log-concave and Lipschitz continuously differentiable with ∇ log π λ = −∇ f ( x ) − ∇ g λ ( x ) , = −∇ f ( x ) − λ (cid:0) x − prox λg ( x ) (cid:1) , (2.5)4ith Lipschitz constant L = L f + 1 /λ , and where for all x ∈ R d prox λg ( x ) = arg min u ∈ R d g ( u ) + 12 λ k x − u k . • The approximation π λ converges to π in total-variation norm; i.e., lim λ → k π λ − π k T V = 0 . Given the smooth approximation π λ , we define the auxiliary Langevin SDEdX t = ∇ log π λ ( X t ) d t + √ d W t , (2.6)and derive the MYULA Markov chain by discretising this SDE by the EM method X n +1 = X n − δ ∇ f ( X n ) − δλ (cid:0) X n − prox λg ( X n ) (cid:1) + √ δZ n +1 . (2.7)If necessary, the asymptotic bias can then be removed by complementing MYULA with a Metropolis-Hastings step[38], which is useful for benchmarking purposes [19, 12]. Notice that one can also consider other approximations con-structed by applying the Moreau-Yosida envelope directly to f + g [38], or by replacing the Moreau-Yosida envelopewith a forward-backward envelope [38, 7]. It is also possible to apply the Moreau-Yosida envelope separately to f and g and integrate MYULA (with or without Metropolisation) within an auxiliary-variable Gibbs sampling scheme, see[48].As mentioned previously, despite being relatively recent, proximal MCMC methods have already been successfullyapplied to a many large-scale inference problems related to imaging sciences [19, 12, 48, 35], and machine learning[22, 46, 10, 11]. A main limitation of ULA, MALA and their proximal variants is that they are all derived from the EM approxima-tion (2.4) of the Langevin SDE. This approximation is mainly used because it is computationally efficient in high-dimensions, it is easy to implement, and it can be rigorously theoretically analysed. However, the EM approximationis not particularly suitable for problems that are ill-conditioned or ill-posed as its performance is very sensitive tothe anisotropy of the target density, which is a common feature of imaging problems. More precisely, in order tobe useful for Bayesian computation, the EM approximation of the Langevin SDE (2.6) has to be numerically stable.For MYULA, this requires using a stepsize δ < /L with L = L f + 1 /λ , where we recall that L f is the Lipschitzconstant of ∇ f and that λ controls the quality of the approximation π λ of π . This restriction essentially guaranteesthat the chain moves slowly enough to follow changes in ∇ log π λ in a numerically stable manner, particularly alongdirections of fast change. However, this is problematic when π λ has some directions or regions of the parameter spacethat change relatively very slowly, as the chain will struggle to properly explore the solution space and will requirea very large number of iterations to converge. In imaging models, this typically arises when the likelihood p ( y | x ) has identifiability issues (e.g, if it involves an observation operator A for which A ⊤ A is badly conditioned or rankdeficient), or if we seek to use a small value of λ to bring π λ close to π .To highlight this issue, we report below two simple illustrative experiments where MYULA is applied to a two-dimensional Gaussian distribution. In this case there is no non-smooth term g and the time-step restriction is dictatedby the Lipschitz constant of f , but the same phenomenon arises in more general models. In the first experimentwe consider µ = (0 , and Σ = diag (1 , − ) (i.e., L f = 10 ); whereas in the second experiment we use µ = (0 , and Σ = diag (1 , − ) (i.e., L f = 10 ). The results are presented in Figure 1. Notice that in the firstcase MYULA explores the distribution very well, showing a good rate of decay in the autocorrelation functions of bothcomponents. However, in the second case, MYULA exhibits poor convergence properties as it struggles to explore thefirst component. 5 . . . . . . . . . . . . . . . (a) MYULA, N ( µ , Σ ) (b) ACF, x (c) ACF, x -4 -2 0 2 4-0.0500.05 . . . . . . . . . . . . . . . (d) MYULA, N ( µ , Σ ) (e) ACF, x (f) ACF, x Figure 1: Two-dimensional Gaussian distribution: (a) samples generated by MYULA using the target distributions N ( µ , Σ ) with δ = 2 / ( L + ℓ ) = 1 . × − where L = 1 /σ = 100 and ℓ = 1 /σ = 1 ; and (d) × samples generated by MYULA using the target distributions N ( µ , Σ ) with δ = 2 / ( L + ℓ ) = 1 . × − where L = 1 /σ = 10 and ℓ = 1 /σ = 1 . Autocorrelation functions of the (b)-(e) first and (c)-(f) second component (i.e., x and x ) of the samples generated by the ULA algorithm, having N ( µ , Σ ) and N ( µ , Σ ) as target distributions,respectively.This limitation of the EM approximation could be partially mitigated by preconditioning the gradient ∇ log π λ byconsidering a Langevin SDE on an appropriate Riemannian manifold, as recommended in [26], and in a spirit akinto natural gradient descent and Newton optimisation methods. The preconditioning procedure proposed in [26] isvery effective but too expensive for imaging models because it requires evaluating quantities related to second andthird order derivatives of log π λ and performing expensive matrix operations. Conversely, simple procedures suchas preconditioning with a pseudo-inverse of the Hessian matrix of the log-likelihood function are computationallyefficient but do not typically lead to significant improvements in performance because they do not take into accountthe geometry of the log-prior. The development of computationally efficient yet effective preconditioning strategiesfor imaging models is an active research topic, see, e.g., [37, 36].Moreover, a different strategy to improve the convergence rate of the EM approximation is to substitute both f and g by their regularised envelopes f λ and g λ so that the Lipschitz constant of ∇ log π λ is bounded by /λ , at the expenseof additional bias. One can then use a single MYULA kernel targeting π λ , or alternatively a splitting scheme involvingtwo MYULA kernels with λ = δ to separately address f λ and g λ as recommended recently in [48]. That splittingleads to a Markov chain that is by construction numerically stable and potentially much faster than MYULA at theexpense of some further estimation bias. Note that splitting schemes that combine Gibbs sampling with relaxations area highly promising direction of research as they could potentially lead to algorithms with dimension-free convergencerates [49].It is worth mentioning at this point that one can also consider other dynamics to derive Markov chains withpotentially better convergence properties, namely the Hamiltonian dynamic which leads to the Hamiltonian MonteCarlo (HMC) algorithm [40, 14]. However, HMC uses a Verlet integrator that, despite being superior in other ways,has the same stepsize restrictions as the Euler method and hence also struggles to address problems that are poorlyconditioned. Also, HMC uses a Metropolis correction that can be dramatically inefficient in large problems such asimaging problems.In this paper we propose to fundamentally improve proximal MCMC methods for imaging by using state-of-the-artnumerical SDE approximation strategies that significantly outperform the conventional EM scheme. More precisely,6e focus on a class of explicit stabilised methods that are specifically designed to deal with the time-step restriction,called stochastic orthogonal Runge-Kutta-Chebyshev methods (SK-ROCK) [2]. The idea, in a nutshell, is to cleverlycombine several evaluations of the gradient ∇ log π λ ( x ) in a way that allows for taking larger time-steps, and thusbreaking the stability barrier of MYULA. As mentioned previously, the same strategy can then be applied to otherMCMC methods that internally use MYULA (e.g., [48]), or variants of MYULA with other approximations of π (e.g.,[38, 7]), although this is beyond the scope of this paper and will be investigated in future works. We propose to significantly accelerate Bayesian computation for imaging problems by using the state-of-the-art explic-itly stabilised SK-ROCK scheme [2] to approximate the Langevin SDE (2.6) associated with π λ , instead of the basicEM discretisation scheme that underpins MYULA and other proximal MCMC methods. From a numerical analysisviewpoint, this is a highly advanced Runge-Kutta stochastic integration scheme that extends the deterministic Cheby-shev method [1] to SDEs, and uses a damping strategy to stabilise the stochastic term. Crucially, its implementationis straightforward as it only requires knowledge of the gradient operator ∇ log π λ ( x ) given by (2.5), which is alsoused in MYULA. However, unlike MYULA that uses a single evaluation of ∇ log π λ ( x ) per iteration, the consideredRunge-Kutta scheme performs s ∈ N ∗ evaluations of ∇ log π λ ( x ) at carefully chosen extrapolated points determinedby Chebyshev polynomials. In this regard, the stochastic integration scheme is morally similar to accelerated optimi-sation methods that also use several gradient evaluations and extrapolation techniques to significantly improve theirconvergence properties. In fact, the deterministic Runge-Kutta-Chebyshev method was recently shown to have similartheoretical convergence properties to Nesterov’s accelerated optimisation algorithms in the case of strongly convexfunctions [20].The proposed proximal SK-ROCK method is presented in Algorithm 1 below, where T s denotes the Chebyshevpolynomial of order s of the first kind, defined recursively by T k +1 = 2 xT k ( x ) − T k − ( x ) with T ( x ) = 1 and T ( x ) = x . The two main parameters of the algorithm are the number of stages s ∈ N ∗ and the step-size δ ∈ (0 , δ maxs ] .Notice that the range of admissible values for δ is controlled by s : for any s ∈ N ∗ , the maximum allowed step-size isgiven by δ maxs = l s / ( L f + 1 /λ ) with l s = [( s − . (2 − / η ) − . and η = 0 . [2]. Violating this upper boundleads to a potentially explosive Markov chain. Also note that in the case of s = 1 the method reduces to MYULA.The values of δ and s are subject to standard bias-variance trade-offs. On the one hand, to optimise the mixingproperties of the algorithm one would like to choose δ as large as possible. MYULA, based on the EM method,requires setting δ < δ max = 1 / ( L f + 1 /λ ) for stability, but in SK-ROCK one can in principle take δ arbitrarily largeby increasing the value of s . However, this would also increase the asymptotic bias and the computational cost periteration. In our numerical experiments we found that a good trade-off in terms of bias, variance, and computationalcost per iteration is achieved by setting < s < and using a value of δ that is close to the maximum allowed step-size δ maxs . As a general rule for imaging problems, we recommend using s = 15 in problems that are strongly log-concave, and s = 10 otherwise. Lastly, it is worth mentioning at this point that we also considered other alternativesto the EM scheme, namely the Runge-Kutta scheme of [3], but found that SK-ROCK delivers the best performancefor imaging models (the results with alternative schemes are not reported in the paper because of lack of space).To illustrate the benefits of using the proximal SK-ROCK method instead of MYULA, we repeat the two Gaussianexperiments reported in Figure 1 with Algorithm 1. The results are shown in Figure 2, and where we have set thenumber of s optimally by using (3.7). Observe that because the SK-ROCK method is allowed to use a larger stepsize δ in a stable manner, it produces, for the same computational cost (i.e., number of gradient evaluations), samples thatare significantly less correlated than MYULA with respect to the slow component. We also observe in Figure 2 thatthis allows SK-ROCK to explore the target distribution more accurately. To the best of our knowledge, it is not possible to establish general complexity results for Runge-Kutta-Chebyshevmethods by using existing analysis techniques, and we are currently investigating new bespoke techniques to study7 lgorithm 1
SK-ROCK algorithm
Set X ∈ R d , λ > , n ∈ N , s ∈ { , . . . , } , η = 0 . Compute l s = ( s − . (2 − / η ) − . Compute ω = 1 + ηs , ω = T s ( ω ) T ′ s ( ω ) , µ = ω ω , ν = sω / , k = sω /ω Choose δ ∈ (0 , δ maxs ] , where δ maxs = l s / ( L f + 1 /λ ) for i = 0 : n do Z i +1 ∼ N (0 , I d ) K = X i K = X i + µ δ ∇ log π λ ( X i + ν √ δZ i +1 ) + k √ δZ i +1 for j = 2 : s doCompute µ j = 2 ω T j − ( ω ) T j ( ω ) , ν j = 2 ω T j − ( ω ) T j ( ω ) , k j = − T j − ( ω ) T j ( ω ) = 1 − ν j K j = µ j δ ∇ log π λ ( K j − ) + ν j K j − + k j K j − end for X i +1 = K s end for -4 -2 0 2 4-0.500.5 . . . . . . . . . . . . . . . (a) SK-ROCK, N ( µ , Σ ) (b) ACF, x (c) ACF, x -4 -2 0 2 4-0.0500.05 . . . . . . . . . . . . . . . (d) SK-ROCK, N ( µ , Σ ) (e) ACF, x (f) ACF, x Figure 2: Two-dimensional Gaussian distribution: (a) /s samples generated by the SK-ROCK algorithm ( s =2 ) using the target distribution N ( µ , Σ ) with δ = 4 . × − and (d) × /s samples ( s = 16 ) using thetarget distribution N ( µ , Σ ) with δ = 4 . × − . Autocorrelation functions of the (b)-(e) first and (c)-(f) secondcomponent (i.e., x and x ) of the samples generated by the SK-ROCK algorithm, having N ( µ , Σ ) and N ( µ , Σ ) as target distributions, respectively. 8K-ROCK. This is an important difference w.r.t. the EM scheme used in MYULA, for which there are detailednon-asymptotic convergence results available that can be used to characterise its computational complexity [18]. Nev-ertheless, it is possible to get an intuition for the computational complexity of SK-ROCK by theoretically analysingits convergence properties for a d -dimensional Gaussian target distribution with density π ( x ) ∝ exp ( − . x ⊤ Σ − x ) ,and Σ = diag ( σ , ..., σ d ) . More precisely, we study the convergence of SK-ROCK in the 2-Wasserstein distance,as a function of the number of gradient evaluations and the condition number κ = σ max /σ min , and compare it withMYULA. This is achieved by analysing in full generality the numerical solution of the Langevin SDE associated with π , given by dX t = − Σ − X t dt + √ dW t , (3.1)by a one step numerical integrator, which yields (in general) a recurrence of the form X in +1 = R ( z i ) X in + √ δR ( z i ) ξ in +1 , ξ in +1 ∼ N (0 , , (3.2)where z i = − δ/σ i and X = ( x , ..., x d ) T is a deterministic initial condition. For the EM scheme used in MYULAwe have R ( z ) = 1 + z and R ( z ) = 1 , and for the SK-ROCK we have that [2] R ( z ) = T s ( ω + ω z ) T s ( ω ) , R ( z ) = U s − ( ω + ω z ) U s − ( ω ) (cid:16) ω z (cid:17) , (3.3)where T s , U s are Chebyshev polynomials of first and second kind respectively and ω = 1 + ηs , ω = T s ( ω ) T ′ s ( ω ) . By using the fact that Gaussian distributions are closed under linear transformations, and assuming that the initialcondition X is deterministic, we derive the distribution of X n for any δ > and obtain the following general resultthat holds for the EM (MYULA) method and for SK-ROCK. The proof is reported in Appedix A. Proposition 3.1.
Let π ( x ) ∝ exp ( − . x T Σ − x ) with Σ = diag ( σ , ..., σ d ) , and let Q n be the probability measureassociated with n iterations of the generic Markov kernel (3.2) . Then the 2-Wasserstein distance between π and Q n isgiven by W ( π ; Q n ) = d X i =1 (cid:0) D n ( z i , x i ) + B n ( z i , σ i ) (cid:1) (3.4) where D n ( x, u ) = ( R ( x )) n u , B n ( x, u ) = " u − √ δR ( x ) (cid:18) − ( R ( x )) n − ( R ( x )) (cid:19) / . In addition the following bound holds W ( π ; Q n +1 ) ≤ W ( π ; ˜ π ) + CW (˜ π, Q n ) (3.5) where ˜ π = N (cid:18) , δ ( R ( z )) (cid:20) − R ( z ) (cid:21)(cid:19) , is the numerical invariant measure and C = max ≤ i ≤ d R ( z i ) . (3.6)The bound (3.5) can now be used to compare the EM and the SKROCK method in terms of how many gradientevaluations are required to achieve W ( π ; Q n ) < ε for some desired accuracy level ε > . We see that the W distancebetween π and Q n involves two terms. The first term W ( π ; ˜ π ) relates directly to the asymptotic bias of the method[4] (recall that without a Metropolis correction step, any generic approximation of (3.1) will have some asymptoticbias because it will not exactly converge to π ). The second term CW (˜ π, Q n ) related to the convergence of the chain9 (a) ε = 10 − (b) ε = 10 − Figure 3: Wasserstein distance bounds, Gaussian analysis: Minimum number of gradient evaluations of the EM andSK-ROCK methods in order to have W ( P ; Q n ) < ε W ( P ; Q ) , given different condition numbers κ .to the stationary distribution ˜ π , with the C controlling the convergence rate. In imaging problems, the computationalcomplexity is usually largely dominated by the second term in (3.5) because of the dimensionality involved.For the case of the EM (MYULA) method it is known [17] that, with a suitable choice of δ , the number ofgradient evaluations that one needs to take in order to achieve W ( π ; Q n ) < ε is of order O ( κ ) , where we recallthat κ = σ max /σ min is the condition number of Σ . For SK-ROCK, the number of gradient evaluations depends onthe choice of s and δ . Our focus is on problems where κ is large, where the optimal performance is achieved byminimising C by setting the number of internals stages s of each step to be s = (cid:20)r η κ − (cid:21) , (3.7)with η = 0 . , and δ = ω − ℓ s ω , ℓ s = 1 σ max , (3.8)so that C ≈ ( √ κ − / ( √ κ + 1) (see [20] and Appendix B for details). In that case, and under the assumption that W ( π, ˜ π ) ≪ ǫ so that W ( π ; Q n ) is dominated by the term CW (˜ π, Q n ) related to convergence to ˜ π , we observethat the number of gradient evaluations required to achieve W ( π ; Q n ) < ε is of the order of O ( √ κ ) instead of O ( κ ) ,similarly to the behaviour of accelerated algorithms in optimization [13]. These convergence results are illustrated inFigure 3, where we plot the number of gradient evaluations required to achieve W ( π ; Q n ) < ε as a function of theconditioning number κ for the EM method and for SK-ROCK, where π is a -dimensional Gaussian distributionwith mean zero and covariance Σ = diag ( σ , ..., σ d ) , with decreasing diagonal elements uniformly spread between σ = 1 and σ d = 1 /κ .One can also simplify the non-asymptotic W results of Appendix A to obtain non-asymptotic results for theestimation bias of the EM and SK-ROCK methods for the mean of Gaussian target densities (this is a weaker analysisthan convergence in W ). As in the case of the W analysis, the number of gradient evaluations to attain a prescribednon-asymptotic bias for the mean is of order O ( √ κ ) for SK-ROCK, whereas it is of order O ( κ ) for the EM method.Both methods are asymptotically unbiased for the mean for Gaussian models.We emphasise at this point that there are situations where one would not observe any acceleration by using SK-ROCK, namely situations in which a very accurate solution is required and the bound (3.5) is dominated by theasymptotic bias term W ( π, ˜ π ) . In that case, instead of using MYULA or SK-ROCK with a very small δ , we wouldrecommend using the P-MALA method described in [38], which combines an EM approximation with a MH correc-tion. 10 .2 Mean-square stability analysis We conclude this section by discussing the mean-square stability properties of SK-ROCK and the EM method. Inparticular, we consider the following test equation that is widely used in the numerical analysis literature [27, 28] tobenchmark SDE solvers dX ( t ) = γX ( t ) dt + µX ( t ) dW ( t ) , X (0) = 1 , (3.9)where γ , µ ∈ R , which has the solution X ( t ) = exp[( γ − / µ ) t + µW ( t )] . It is easy to show using Ito calculusthat when γ + µ < t →∞ E ( | X ( t ) | ) = 0 . We want to understand for what range of time-steps δ would a numerical discretisation X n of (3.9) behave in a similarmanner as n → ∞ , i.e. E ( | X n | ) → . In the case of EM one has that X n +1 = X n + δγX n + √ δµX n Z n +1 , and hence E ( | X n +1 | ) = R ( p, q ) E ( | X n | ) , R ( p, q ) = (1 + p ) + q , p = δγ, q = √ δµ. In order to have E ( | X n | ) → one needs that R ( p, q ) < . We visualise the values of admissible p, q for the EMmethod in Figure 4(a), where we can see that there is only a very small portion of the true mean-square stabilitydomain ( p + q < ) covered by it (anything on the left hand side of the dotted line in Figure 4(a)-(b) belongs tothe true stability domain). This implies that when one or both of the parameters γ, µ are large one needs to choose avery small δ in order to be stable (for example when µ = 0 one recovers the stability condition δ < − γ − for theLangevin SDE). In the case of SK-ROCK one has that R ( p, q ) = R ( p ) + R ( p ) q , where R and R are given by (3.3).Similarly to the case of the EM method, we now plot the mean-square stability domain of SK-ROCK in Figure 4(b).As we can see, a significantly larger portion of the true mean-square stability domain is now covered when comparedto the EM method. One can show, using the properties of Chebyshev polynomials [2], that for SK-ROCK the coverageof the mean-square stability domain increases quadratically in s ; i.e., that if ( p, q ) ∈ { p + q < ∩ p < C ( η ) s } then R ( p, q ) < for the SK-ROCK method. In contrast, if for comparison one would consider s -steps of the EM method,the corresponding coverage of the mean-square stability domain would be linear in s . This means that for the samenumber of gradient evaluations s , one can choose a much larger time-step δ for SK-ROCK and still integrate equation(3.9) in a stable manner. The spikes observed in 4(b) at specific values of p correspond to roots of the polynomial R ( p ) defined in (3.3); these are determined by the values of s and η , and by the roots of the Chebyshev polynomialof second kind U s − . In this section we demonstrate the proposed SK-ROCK proximal MCMC methodology with a range of numericalexperiments related to image deconvolution, tomographic reconstruction, and hyper-spectral unmixing. We have se-lected these experiments to represent a wide variety of configurations in terms of ill-posedness and ill-conditioning,strict and strong log-concavity, and dimensionality of y and x . Following our previous recommendation, in the ex-periments related to image deconvolution and hyper-spectral unmixing the model is strongly log-concave so we use s = 15 , whereas for the tomography experiment we use s = 10 . We report comparisons with the MYULA method [19]to highlight the benefits of using the SK-ROCK discretization as opposed to the conventional EM discretization usedin Langevin and Hamiltonian algorithms [40], and because MYULA underpins other proximal MCMC algorithmssuch as the auxiliary Gibbs sampler of [48].To make the comparisons fair, in all experiments we use the same number of gradient (and proximal operator)evaluations for MYULA and SK-ROCK, and compare their computational efficiency in several ways (the efficiency11 (a) EM (dark grey) and SK-ROCK (light grey) mean-square stability regions. -200 -150 -100 -50 00200400 (b) SK-ROCK mean-square stability region ( s = 10 , η = 0 . ) Figure 4: Mean-square stability domains for (a) EM and and (b) SK-ROCK (with s = 10 ) in the p − q plane. Thedashed line represents the upper boundary of the true mean-square stability domain.of an MCMC method is not an absolute quantity as it depends on the estimator considered). Because our aim is toillustrate the performance of SK-ROCK in Bayesian imaging problems, here we use the MYULA and SK-ROCKsamples to compute the following quantities: 1) the minimum mean square error solution given by the posteriormean E ( x | y ) , which is a classic image point estimator; 2) the marginal posterior variances or standard deviationsfor the image pixels, which provide an indication of the performance of the methods in uncertainty quantificationtasks; 3) the effective sample size (ESS) of the fastest mixing component of the chain, calculated in stationarity ;and 4) the ESS of the slowest mixing component of the chain, also calculated in stationarity. These fast and slowcomponents correspond to the one-dimensional subspaces where the Markov chains achieve their highest and lowestconvergence rates respectively, and that we have identified via an estimate of the first and last eigenvectors of thesamples posterior covariance. We choose to report ESS values because these are intuitive quantities that are directlyrelated to the variance of the Monte Carlo estimators, and hence provide an indication of the accuracy of the methods,up to estimation bias .In addition to reporting estimates, we use autocorrelation plots to visually compare the convergence propertiesof both methods (again, we report the autocorrelation function for the fastest and the slowest components of theMarkov chains). We also show the evolution of the estimation MSE across iterations, and display the estimates of themarginal (pixelwise) standard deviations. These latter are useful for illustrating the differences in the performance ofthe methods, as second order moments are more difficult to estimate by Monte Carlo integration than the posteriormean.Notice that because the methods are compared at equal computational budget they do not produce the same numberof samples, as their complexity per iteration is different. More precisely, if the MYULA chain has n -samples, thenthe SK-ROCK chain has only n/s samples, which is considerably lower. However, experiments show that SK-ROCK Recall that ESS = n { P k ρ ( k ) } − , where n is the total number of samples and P k ρ ( k ) is the sum of the K monotone sampleauto-correlations which we estimated with the initial monotone sequence estimator [25]. Note that the computation of ESS values is well-posed because p ( x | y ) is log-concave. If p ( x | y ) were heavy-tailed or multi-modal then wewould need to consider robust efficiency indicators [24]. δ , effective sample sizes (ESS) and KL-divergence of the EM and SK-ROCK algorithmsfor the one dimensional Laplace distribution. Stages s Method Stepsize δ ESS KL-Divergence Speed-up - MYULA . × − . × . × − - s = 10 SK-ROCK . × − . × . × − s = 15 SK-ROCK . × − . × . × − Table 2: Values of the stepsize δ , effective sample sizes (ESS) and KL-divergence of the EM and SK-ROCK algorithmsfor the one dimensional uniform distribution. Stages s Method Stepsize δ ESS KL-Divergence Speed-up - MYULA . × − . × . × − - s = 10 SK-ROCK . × − . × . × − s = 15 SK-ROCK . × − . × . × − usually delivers higher ESS values because of its superior convergence properties. Similarly, to make the comparisonof autocorrelation plots fair with regards to computational complexity, in all autocorrelation plots we apply a 1-in- s thinning to the MYULA chain to artificially boost its autocorrelation function decay rate by a factor of s . We start our numerical experiments by studying two simple one dimensional distributions, namely the Laplace dis-tribution and the uniform distribution in [ − , , for which we can also perform computations exactly. Since both ofthese distributions are not Lipschitz differentiable we employ the corresponding Moreau-Yosida approximation using λ = 10 − to bring π λ very close to π and deliver a good approximation. This implies that the largest stepsize δ that can be used for MYULA is × − , which is dramatically small. We set δ = 10 − for MYULA and run thecorresponding chain for n = 15 × iterations to create a situation where MYULA struggles to deliver a goodapproximation and that highlights the superior performance of SK-ROCK.For SK-ROCK we use s = 15 and set δ as it is explained in Algorithm 1. Notice that we choose the (regularised)Laplace and the uniform distributions to illustrate the performance of the methods in two different scenarios: theregularised Laplace distribution is strongly log-concave near the mode and only strictly log-concave in the tails, whichis problematic for the Langevin diffusion because the gradient remains constant as | x | grows, whereas the regulariseduniform distribution is flat over [ − , and hence has most of its mass in regions where the gradient is zero, and thenstrongly log-concave in the tails.Figures 5 and 6 display the histogram approximations of the distributions obtained with the two methods, as wellas the autocorrelation functions of the generated Markov chains. Observe that in both cases SK-ROCK significantlyoutperforms MYULA, which struggles to deliver a good approximation due to the stepsize limitation and the limitednumber of iterations (this phenomenon is particularly clearly captured by the difference in decay speed in the auto-correlation plots). These results are quantitatively summarised in Tables 1 and 2 respectively, where we highlightthat SK-ROCK delivers an ESS that is over times larger than MYULA, while also achieving higher accuracy asmeasured by the Kullback-Leibler (KL) divergence between the empirical distribution and π λ . For completeness, wealso report the results using SK-ROCK with s = 10 .It is worth emphasising at this point that we could improve the ESS performance of both methods by increasingthe value of λ , at the expense of some additional bias. In the case of the uniform distribution this would lead to aconsiderable number of samples outside the true support [ − , .13 a) MYULA (b) SK-ROCK, s = 15 MYULASK-ROCK (c) ACF
Figure 5: One-dimensional Laplace distribution: Histograms computed with (a) × samples generated byMYULA and (b) × /s samples generated by SK-ROCK from the approximated Laplace distribution, usingan approximation parameter λ = 10 − and s = 15 for the SK-ROCK method. (c) Autocorrelation functions of thesamples. (a) MYULA (b) SK-ROCK, s = 15 MYULASK-ROCK (c) ACF
Figure 6: One-dimensional uniform distribution: Histograms computed with (a) × samples generated byMYULA and (b) × /s samples generated by SK-ROCK from the approximated uniform distribution, usingan approximation parameter λ = 10 − and s = 15 for the SK-ROCK method. (c) Autocorrelation functions of thesamples. 14 .2 Image deconvolution with total-variation prior We now consider a non-blind image deconvolution problem, where we seek to recover a high-resolution image x ∈ R d from a blurred and noisy observation y = Hx + ǫ , where H is a known blur operator and ǫ ∼ N (0 , σ I d ) . This problemis ill-conditioned i.e., H is nearly singular, thus yielding highly noise-sensitive solutions. To make the estimationproblem well posed, we use a total-variation norm prior that promotes solutions with spatial regularity. The resultingposterior distribution is given by p ( x | y ) ∝ exp (cid:0) −k y − Hx k / σ − βT V ( x ) (cid:1) , (4.1)where T V ( x ) represents the total-variation pseudo-norm [45, 15], and σ, β ∈ R + are model hyper-parameters that weassume fixed (in this experiment we use β = 0 . , determined using the method of [23]).Figure 7 presents an experiment with the cameraman test image of size d = 256 × pixels, depicted inFigure7(a). Figure 7(b) shows an artificially blurred and noisy observation y , generated by using a × uniform blurand σ = 0 . , related to a blurred signal-to-noise ratio of dB. We use MYULA and SK-ROCK to draw Monte Carlosamples from (4.1) using λ = L − f = 0 . . To make the comparison fair, we generate samples using MYULAand /s samples using SK-ROCK for s = 15 . We then use the generated samples to compute two quantities: 1) theminimum mean square error (MMSE) estimator of x | y , given by the posterior mean; and 2) the pixel-wise (marginal)posterior standard deviation, which provides an indication of the level of confidence in each pixel value, as measuredby the model. This quantity is useful to highlight features in the image that are difficult to accurately determine; inthe case of in image deconvolution problems these are the exact locations of edges and contours in the image. Noticethat computing standard deviations requires computing second order statistical moments, which is more difficult thanestimating the posterior mean, and hence requires a larger number of effective samples to produce stable estimates.Observe in Figures 7(c)-(f) that while the estimates of the posterior mean obtained with MYULA and SK-ROCKare visually similar, the estimates of the pixel-wise standard deviations obtained with SK-ROCK are noticeably moreaccurate and in agreement with the results obtained by sampling the true posterior with an asymptotically unbiasedMetropolised algorithm, see [38, Example 4.1]. In particular, the standard deviations estimated with SK-ROCK ac-curately capture the uncertainty in the location of the contours in the image, whereas MYULA produces very noisyresults as it struggles to estimate second order moments because of the stepsize limitation and limited computationbudget (with a sufficiently large number of iterations, MYULA would produce similar results to SK-ROCK).Moreover, to rigorously analyse the convergence properties of the two methods and compute autocorrelation func-tions, we generated samples with MYULA and /s samples using SK-ROCK ( s = 15 ). We then used thesesamples to determine the fastest and slowest components of each chain and measured their autocorrelation functions.We also computed trace plots for the chains by using T ( x ) = log π λ ( x | y ) as scalar statistic, which is particularly in-teresting because it determines the typical set of x | y [39]. These trace plots clearly illustrate how the methods behaveduring their transient regime, and then how they behave once the chains have converged to the typical set.Figure 8(a) shows the convergence of the Markov chains to the typical set { x : T ( x ) ≈ E [ T ( x ) | y ] } . Moreover,Figure 8(b) shows the last samples of the chains (again with a 1-in- s thinning for MYULA). Additionally, we haveincluded the summary statistic E ( T ( X )) calculated by a very long run of the P-MALA algorithm [38], which targets(4.1) exactly, in order to study the bias of the methods . We can see that, for this experiment, the bias of SK-ROCK isslightly increased in comparison to MYULA, however, it has significantly better mixing properties that result in a betterexploration of the typical set. Lastly, the superior convergence properties of SK-ROCK are also clearly illustrated bythe autocorrelation plots of Figure 8(c), which shows the autocorrelation functions for the slowest components of thechains, and where again we observe a dramatic improvement in decay rate (we have again used a 1-in- s thinning forMYULA for fair comparison). Table 3 reports the associated ESS values for this experiment, where we note thatSK-ROCK with s = 15 outperforms MYULA by a factor of 21.77 in terms of computational efficiency for the slowestcomponent (see Table 3).We conclude this experiment by comparing the two methods in terms of estimation of the MSE against the trueimage. Figure 9 shows the evolution of the estimation error for the MMSE solution, as estimated by MYULA and The chain’s slowest (fastest) component was identified by computing the approximated singular value decomposition of the chain’s covariancematrix and choosing on the samples the component with the largest (smallest) singular value. The statistics T ( x ) = log p ( x | y ) is very useful for analysing the bias of high-dimensional log-concave distributions because these concentratesharply on the typical set T ( x ) ≈ E ( T ( X )) [39]. (a) true image x
50 100 150 200 25050100150200250 (b) noisy and blurred observation y
50 100 150 200 25050100150200250 (c) MYULA: posterior mean
50 100 150 200 25050100150200250 (d) SK-ROCK: posterior mean ( s = 15 )
50 100 150 200 25050100150200250 (e) MYULA: standard deviation
50 100 150 200 25050100150200250 (f) SK-ROCK: standard deviation ( s = 15 ) Figure 7:
Cameraman experiment: (a) Original image of dimension × pixels; (b) blurred observation withSNR = 40 . (c) Mean of samples generated by MYULA and (d) mean of /s samples generated by SK-ROCK.(e) Standard deviation of the samples generated by MYULA and (f) SK-ROCK.16
500 1000 1500 2000-10 (a) log π λ ( x ) -9.8-9.6-9.4-9.2-9 10 (b) log π λ ( x ) (c) ACF slow component Figure 8:
Cameraman experiment: (a) Convergence to the typical set of the posterior distribution (4.1) for the first × MYULA samples and the first × /s SK-ROCK ( s = 15 ) samples. (b) Last values of log π ( x ) . (c)Autocorrelation function for the slowest component.Table 3: Cameraman experiment: Summary of the results after generating samples with MYULA and /s samples with SK-ROCK with s = 15 . Computing time hours per method. Method Stepsize ESS ESS Speed-up Speed-up δ slow comp. fast comp. slow comp. fast comp. MYULA .
106 2 . × . × - -SK-ROCK ( s = 10 ) .
65 4 . × . × .
89 2 . × − SK-ROCK ( s = 15 ) .
30 6 . × . × .
77 6 . × − SK-ROCK, and as a function of the number of gradient and proximal operator evaluations. Again, observe that theacceleration properties of SK-ROCK lead to dramatic improvement in convergence speed, and consequently to asignificantly more accurate computation of the MMSE estimator for given computational budget.
We now present an application to hyperspectral unmixing [32]. Given a hyperspectral image y ∈ R m × d with m spectral bands and d pixels, the unmixing problem assumes that the observed scene is composed of k materials or endmembers , each with a characteristic spectral response a j ∈ R m for j ∈ { , . . . , k } , and seeks to determine theproportions or abundances x j,i of each material j ∈ { , . . . , k } in each image pixel i ∈ { , . . . , d } . Here we considerthe widely used linear mixing model y = Ax + w , where A = { a , . . . , a k } ∈ R m × k is a spectral library gathering Figure 9:
Cameraman experiment: Mean squared error (MSE) between the mean of the algorithms and the trueimage, measured using × samples from MYULA and × /s samples from SK-ROCK ( s = 15 ), instationary regime. 17able 4: Hyperspectral experiment: Summary of the results after generating × samples with MYULA and × /s samples with SK-ROCK. Computing time hours per method. Method Stepsize ESS ESS Speed-up Speed-up δ slow comp. fast comp. slow comp. fast comp. MYULA . × − . × . × - -SK-ROCK ( s = 10 ) . × − . × . × .
33 2 . SK-ROCK ( s = 15 ) . × − . × . × .
93 5 . the spectral responses of the materials, x ∈ R k × d gathers the abundance maps, and w ∼ N (0 , σ I m × d ) is additiveGaussian noise. Moreover, following [30], we expect x to be sparse since most image pixels contain only a subsetof the materials. Also, we expect materials to exhibit some degree of spatial coherence and regularity. In order topromote solutions with these characteristics, we use the ℓ -TV prior proposed in [30] for this type of problem p ( x ) ∝ exp {− α k x k − βT V ( x ) } R n + ( x ) , where α > and β > are hyper-parameters that we assume fixed (in this experiment we use α = 25 and β = 185 ,determined using the method of [23]). The resulting posterior distribution is given by [30] p ( x | y ) ∝ exp (cid:2) −k y − Ax k / σ − α k x k − βT V ( x ) (cid:3) R n + ( x ) . (4.2)Figure 10 presents an experiment with a synthetic dataset from [30] of size n = 75 ×
75 = 5625 , with materials,and noise amplitude σ = 8 . × − related to a signal-to-noise-ratio of dB, see [30] for details. Figure 10(a)presents the evolution of the estimation MSE between the true abundance maps and the posterior mean as estimated byMYULA and SK-ROCK (with s = 15 ), and as a function of the number of gradient and proximal operator evaluations(using λ = 7 . × − which is in the order of L − f , as it is recommended in [19, Section 3.3]). As in previousexperiments, observe that the posterior means estimated with SK-ROCK converge dramatically faster than the onescalculated with MYULA, clearly exhibiting the benefits of the proposed methodology. Moreover, for illustration,Figures 10(c)-(e) respectively show the estimated abundance maps for the fourth endmember for MYULA ( × samples) and SK-ROCK ( × /s samples, s = 15 ), as well as the pixel-wise (marginal) standard deviations for theabundances of this material. Again, as in previous experiments, we notice that the estimates obtained with SK-ROCKare noticeably more precise than the ones of MYULA, which would require a larger number of iterations to accuratelyestimate these second order statistical moments.To further compare the convergence properties of the two methods we repeated the experiment and generated × samples with MYULA and × /s samples with SK-ROCK for s = 15 to make the comparisons fair.Figure 11(a) presents trace plots for the two chains during their transient regimes using T ( x ) = log p ( x | y ) as summarystatistic, as a function of the number of gradient and proximal operator evaluations; observe that SK-ROCK attainsthe typical set of x | y significantly faster than MYULA, similarly to the previous experiments. Figure 11(b) presentssimilar trace plots for the two chains in stationarity. Additionally, as we did in cameraman experiment, we haveincluded the summary statistic E ( T ( X )) calculated by a very long run of the P-MALA algorithm, which targets (4.2)exactly, in order to study the bias of the methods. As can be seen clearly, SK-ROCK presents a lower bias thanMYULA, and also exhibits better mixing properties. The good convergence properties of SK-ROCK can be clearlyobserved in the autocorrelation plots of Figure 11(c), which correspond to the slowest components of the chains asdetermined by their covariance structure, and where we have again applied the -in- thinning to the MYULA chainfor fairness of comparison. Table 4 reports the ESS values for this experiment. In particular, observe that SK-ROCKoutperforms MYULA by a factor of . in terms of ESS for the slowest component of the chain, and by a factor of . for the fastest component. We conclude this section with a tomographic image reconstruction experiment. We have selected this problem toillustrate the proposed methodology in a setting where the posterior distribution is strictly log-concave. The lack of18 a) MSE
20 40 60204060 (b) true image x
20 40 60204060 (c) MYULA: posterior mean
20 40 60204060 (d) SK-ROCK: posterior mean ( s = 15 )
20 40 60204060 -3 (e) MYULA: standard deviation
20 40 60204060 -3 (f) SK-ROCK: standard deviation ( s = 15 ) Figure 10:
Hyperspectral experiment: (a) Mean squared error (MSE) between the mean of the algorithms andthe true image (fractional abundances of endmembers 1 to 5) measured using samples from MYULA (solid line)and /s samples from SK-ROCK (dash-dot line, s = 15 ), in logarithmic scale. (b) True fractional abundancesof the endmember 4 ( × pixels), (c) posterior mean as estimated with samples generated with MYULAand (d) /s samples generated by SK-ROCK. (e) Standard deviation of the samples generated by MYULA and (f)SK-ROCK. 19 -10 -10 -10 -10 (a) log π λ ( x ) -9-8.9-8.8-8.7-8.6 10 (b) log π λ ( x ) (c) ACF slow component Figure 11:
Hyperspectral experiment: (a) Convergence to the typical set of the posterior distribution (4.2) for thefirst × MYULA samples and the first × /s SK-ROCK ( s = 15 ) samples. (b) Last values of log π ( x ) .(c) Autocorrelation function for the slowest component.strong log-concavity has a clear negative impact on the convergence properties of the continuous-time Langevin SDE(2.3) [18], and also impacts the convergence properties of the MYULA and SK-ROCK approximations.In tomographic image reconstruction we seek to recover an image x ∈ R d from an observation y ∈ C p related to x by a linear Fourier model y = AF x + ξ , where F is the discrete Fourier transform operator on C d , A ∈ C p × d isa (sparse) tomographic subsampling mask and ξ ∼ N (0 , σ I p ) . Typically d ≫ p , making the estimation problemstrongly ill-posed. We address this difficulty by using a total-variation prior to regularise the estimation problem andpromote solutions with certain spatial regularity properties. From Bayes’ theorem, the posterior p ( x | y ) is given by: p ( x | y ) ∝ exp (cid:2) −k y − AF x k / σ − βT V ( x ) (cid:3) , (4.3)with hyper-parameters σ, β ∈ R + assumed fixed (in this experiment we use β = 10 ).Figure 12 presents an experiment with the Shepp-Logan phantom test image of size d = 128 × pixels,which we use to generate a noisy observation y by measuring of the original Fourier coefficients, corruptedwith additive Gaussian noise with σ = 10 − (to improve visibility, Figure 12(b) shows the amplitude of the Fouriercoefficients in logarithmic scale, unobserved coefficients are depicted in black). Following on from this, we useMYULA and SK-ROCK with s = 10 to generate and samples respectively from p ( x | y ) with λ = 0 . × − which is in the order of L − f , as it is recommended in [19, Section 3.3]. We then use these samples to compute twoquantities: 1) the MMSE estimators - displayed in Figures 12(c)-(d); and 2) the (marginal) standard deviations ofthe amplitude of the Fourier coefficients of x | y , depicted in Figures 12(e)-(f) in logarithmic scale. Observe that, inthis experiment, both methods deliver good and similar results with the number of samples available, with MYULAproducing slightly less accurate standard deviation estimates. More interestingly, notice from Figures 12(e)-(f) thatin this tomographic experiment the uncertainty is concentrated in the unobserved medium frequencies, whereas in thedeconvolution experiment uncertainty was predominant in the high-frequencies.Moreover, to analyse the convergence properties of the two methods we compute autocorrelation functions bygenerating × samples with MYULA and × /s samples using SK-ROCK with s = 10 . We use said samplesto determine the fastest and slowest components of each chain and measure their autocorrelation functions. Table 5reports the associated ESS, which show that the SK-ROCK outperform MYULA by a factor of . in terms of ESSfor the slowest component of the chain.These superior convergence properties can be clearly observed in Figure 13(c), which presents the autocorrelationplots for the slowest components of the chains. For completeness, Table 5 also reports the values obtained withSK-ROCK with s = 5 .Finally, as in previous experiments, Figure 13(a) presents trace plots for the two chains during their burn-in stages;again we can see that SK-ROCK reaches the typical set of x | y significantly faster than MYULA. Figure 13(b) showsthe log π ( x ) trace of both methods in stationary regime, and similarly to the cameraman and hyperspectral experimentswe have also included the entropy E ( T ( X )) of the distribution calculated by a very long run of the P-MALA algorithm,which targets (4.3) exactly. As can be seen clearly, SK-ROCK presents a lower bias than MYULA.20 (a) true image x
20 40 60 80 100 12020406080100120 -4-202 (b) observation y
20 40 60 80 100 12020406080100120 (c) MYULA: posterior mean
20 40 60 80 100 12020406080100120 (d) SK-ROCK: posterior mean ( s = 10 )
20 40 60 80 100 12020406080100120 -5-4.8-4.6-4.4-4.2-4 (e) MYULA: std. dev. - Fourier coefs.
20 40 60 80 100 12020406080100120 -5-4.8-4.6-4.4-4.2-4 (f) SK-ROCK: std. dev. - Fourier coefs.
Figure 12: Tomography experiment: (a)
Shepp-Logan phantom image ( × pixels), (b) tomographic obser-vation y (amplitude of Fourier coefficients in logarithmic scale). Posterior mean of x | y as estimated with (c) MYULA( samples) and (d) SK-ROCK ( samples, s = 10 ). Standard deviations of the amplitude of the Fourier coeffi-cients of x | y as estimated with (e) MYULA ( samples) and (f) SK-ROCK ( samples, s = 10 ).21able 5: Tomography experiment: Summary of the results after generating × samples with MYULA and × /s samples with SK-ROCK. Computing time hours per method. Method Stepsize ESS ESS Speed-up Speed-up δ slow comp. fast comp. slow comp. fast comp. MYULA . × − . × . × - -SK-ROCK ( s = 5 ) . × − . × . × .
05 1 . SK-ROCK ( s = 10 ) . × − . × . × .
23 0 . (a) log π λ ( x ) -9-8.8-8.6-8.4-8.2 10 (b) log π λ ( x ) (c) ACF slow component Figure 13:
Tomography experiment: (a) Convergence to the typical set of the posterior distribution (4.3) for the first × MYULA samples and the first × /s SK-ROCK ( s = 10 ). (b) Last values of log π ( x ) from MYULAand SK-ROCK ( s = 10 ) chains. (c) Autocorrelation function for the slowest component. This paper presented a new proximal MCMC method to perform Bayesian computation in inverse problems related toimaging sciences. Similarly to previous proximal MCMC methods, the methodology is derived from a Moreau-Yosidaregularised overdamped Langevin diffusion process. However, instead of the conventional EM discrete approximationof the Langevin diffusion, the proposed method employs a significantly more advanced discrete approximation thathas better convergence properties in problems that are ill-posed and ill-conditioned.The explicit EM approximation struggles in problems that are ill-posed or ill-conditioned because of the corre-sponding severe time-step restrictions. These same issues arise in the case of gradient descent and proximal gradientoptimisation algorithms that also suffer from time-step restrictions. In optimisation, this has been successfully ad-dressed by using accelerated proximal optimisation algorithms [8]. The SK-ROCK approximation used in this paperachieves a similar acceleration quality. For Gaussian models, we prove rigorously the acceleration of the Markovchains in the 2-Wasserstein distance as a function of the condition number κ when moderate accuracy is required. Thesuperior behaviour of our method is further demonstrated with a range of numerical experiments, including non-blindimage deconvolution, tomographic reconstruction, and hyperspectral unmixing, with total-variation and ℓ priors. Thegenerated Markov chains exhibit faster mixing, achieve larger effective sample sizes, and produce lower mean squareestimation errors at equal computational budget. This allows, for example, to accurately estimate high order statisticalmoments and perform uncertainty quantification analyses in a more computationally efficient way.Furthermore, this paper opens a number of interesting directions for future research. For example, as mentionedpreviously, the poor performance of EM approximations can also be mitigated by using auxiliary variable Gibbssplitting schemes [48] that have similarities with the ADMM optimisation algorithm, at the expense of additional es-timation bias. Given that ADMM optimisation can be accelerated, it would be interesting to study sampling methodsthat combine SK-ROCK with splitting schemes to reduce that bias or achieve greater computational efficiency. An-other important perspective is to theoretically analyse the non-asymptotic convergence properties of SK-ROCK fornon-Gaussian log-concave models and derive bounds in total-variation and Wassertein metrics; this is highly technicaland will require developing new analysis techniques. It would also be interesting to explore possible Metropolis-adjusted variants of the stochastic Runge-Kutta-Chebyshev methods discussed in this paper, and to investigate empiri-22al Bayesian computation algorithms that combine SK-ROCK with stochastic gradient descent, which could be usefulfor estimating unknown model parameters such as regularisation parameters [23]. A Wasserstein distance - Gaussian process
We begin computing the distribution Q n of the n samples generated by the approximation (3.2). We will work in theone dimensional case but the results easily extend to higher dimensions, as can be seen later. First, we can notice thatthe solution of (3.2) can be expressed by the following recursive formula: X n = ( R ( z )) n X + √ δ n X i =1 ( R ( z )) n − i ( R ( z )) ξ i , where X is the initial condition of the problem. Computing expectations on both sides of the latter equation, we have: E ( X n ) = ( R ( z )) n X . Then, we compute the variance as follows: E ( X n ) − E ( X n ) = 2 δ n X i =1 ( R ( z )) n − i ) ( R ( z )) , = 2 δ ( R ( z )) n ( R ( z )) n X i =1 R ( z )) i , = 2 δ ( R ( z )) n ( R ( z )) R ( z )) " − R ( z )) n − R ( z )) , = 2 δ ( R ( z )) (cid:20) ( R ( z )) n − R ( z )) − (cid:21) , thus, the approximated distribution Q n of the n -th sample produced by the numerical scheme (3.2) is defined, asfollows: Q n = N (cid:18) ( R ( z )) n X , δ ( R ( z )) (cid:20) ( R ( z )) n − R ( z )) − (cid:21)(cid:19) . We can now compute the Wasserstein distance between the two univariate Gaussian distributions P and Q n : W ( P ; Q n ) = ( R ( z )) n X + " σ − √ δR ( z ) (cid:18) − ( R ( z )) n − ( R ( z )) (cid:19) / . As we mentioned at the beginning of this Appendix, we can trivially extend the last result for a d -dimensional Gaussiandistribution i.e. let P ∼ N (0 , Σ) where Σ = diag ( σ , ..., σ d ) and X = ( x , ..., x d ) T and obtain the followingexpression for the Wasserstein distance: W ( P ; Q n ) = d X i =1 ( R ( z i )) n ( x i ) + d X i =1 " σ i − √ δR ( z i ) (cid:18) − ( R ( z i )) n − ( R ( z i )) (cid:19) / , where z i = − δ/σ i . This concludes the proof. B Explicit bound for the Wasserstein distance
We begin applying the triangle inequality to W ( P ; Q n +1 ) as follows: W ( P ; Q n +1 ) ≤ W ( P ; ˜ Q ) + W ( ˜ Q ; Q n +1 ) , (B.1)23here ˜ Q is the unique invariant distribution to which (3.2) converges when n → ∞ and it is defined as: ˜ Q = N (cid:18) , δ ( R ( z )) (cid:20) − ( R ( z )) (cid:21)(cid:19) , thus, we have that: W ( ˜ Q ; Q n +1 ) = d X i =1 R ( z i ) n +2 ( x i ) + d X i =1 "(cid:18) δR ( z i ) − R ( z i ) (cid:19) / −√ δR ( z i ) (cid:18) − R ( z i ) n +2 − R ( z i ) (cid:19) / , = d X i =1 (cid:20) R ( z i ) n +2 ( x i ) + 2 δR ( z i ) − R ( z i ) (cid:16) √ − p − R ( z i ) n +2 (cid:17) (cid:21) . (B.2)It is easy to prove the following property: − √ − x n +2 − √ − x n x ≤ x , (B.3)for x ∈ (0 , . Thus, applying the latter in (B.2) we have: W ( ˜ Q ; Q n +1 ) ≤ d X i =1 R ( z i ) n +2 ( x i ) + d X i =1 δR ( z i ) − R ( z i ) (cid:16) R ( z i ) h − p − R ( z i ) n i(cid:17) , ≤ d X i =1 (cid:20) R ( z i ) n ( x i ) + 2 δR ( z i ) − R ( z i ) (cid:16) − p − R ( z i ) n (cid:17) (cid:21) R ( z i ) , ≤ max ≤ i ≤ d R ( z i ) W ( ˜ Q ; Q n ) . Thus, (B.1) becomes: W ( P ; Q n +1 ) ≤ W ( P ; ˜ Q ) + max ≤ i ≤ d R ( z i ) W ( ˜ Q ; Q n ) . Let: C = max ≤ i ≤ d R ( z i ) , applying (B.3) n + 1 times, we finally have that: W ( P ; Q n +1 ) ≤ W ( P ; ˜ Q ) + C n +1 W ( ˜ Q ; Q ) , concluding the proof.As an attempt to minimise the bound found in the latter expression, we will try to accelerate the decay of theconstant C composed by R ( z ) in the stochastic ROCK methods. This approach follows closely the approach in [20].In particular, in order to bound R ( z ) by one, we need that | ω + ω z | ≤ , in other words we need that: − ≤ ω − ω δσ i ≤ . Let L := 1 /σ and ℓ := 1 /σ , so we have that: − ≤ ω − ω Lδ ≤ ω − ω ℓδ ≤ , which it is the same as: − ≤ ω ℓδ − ω ≤ ω Lδ − ω ≤ . δ ≥ ω − ℓω . (B.4)We choose the smallest δ to have an efficient algorithm i.e., δ = ( ω − /ℓω and now working with the last twomembers on the right-hand side of the previous inequality, we have that: κ := Lℓ ≤ ω + 1 ω − s η , where κ is the condition number of our Gaussian problem. We choose the smallest s to have an efficient algorithm andthe latter expression determines the parameter s as: s = (cid:20)r η κ − (cid:21) , (B.5)where [ x ] is the notation for the integer rounding of real numbers. References [1] A. A
BDULLE , Explicit stabilized Runge-Kutta methods , Encyclopedia of Applied and Computational Mathemat-ics, (2015), pp. 460–468.[2] A. A
BDULLE , I. A
LMUSLIMANI , AND
G. V
ILMART , Optimal explicit stabilized integrator of weak order one forstiff and ergodic stochastic differential equations , SIAM/ASA Journal on Uncertainty Quantification, 6 (2018),pp. 937–964.[3] A. A
BDULLE AND
S. C
IRILLI , S-ROCK: Chebyshev methods for stiff stochastic differential equations , SIAMJournal on Scientific Computing, 30 (2008), pp. 997–1014.[4] A. A
BDULLE , G. V
ILMART , AND
K. Z
YGALAKIS , High order numerical approximation of the invariant mea-sure of ergodic sdes , SIAM Journal on Numerical Analysis, 52 (2014), pp. 1600–1622.[5] M. V. A
FONSO , J. M. B
IOUCAS -D IAS , AND
M. A. T. F
IGUEIREDO , Fast image recovery using variablesplitting and constrained optimization , IEEE Transactions on Image Processing, 19 (2010), pp. 2345–2356.[6] S. A
RRIDGE , P. M
AASS , O. Ö
KTEM , AND
C.-B. S
CHÖNLIEB , Solving inverse problems using data-drivenmodels , Acta Numerica, 28 (2019), pp. 1–174.[7] Y. A
TCHADE AND
A. B
HATTACHARYYA , Regularization and Computation with high-dimensional spike-and-slab posterior distributions , arXiv e-prints, (2018), p. arXiv:1803.10282.[8] A. B
ECK , First-order methods in optimization , vol. 25 of MOS-SIAM Series on Optimization, Society for In-dustrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia,PA, 2017.[9] A. B
ECK AND
M. T
EBOULLE , A fast iterative shrinkage-thresholding algorithm for linear inverse problems ,SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.[10] P. B
IANCHI , A. S
ALIM , AND
S. S
CHECHTMAN , Passty Langevin , in Conference on Machine Learning (CAp)2019, Toulouse, France, June 2019.[11] N. B
ROSSE , A. D
URMUS , M. P
EREYRA , AND
E. M
OULINES , Sampling from a log-concave distribution withcompact support with proximal Langevin Monte Carlo , in Conference on Learning Theory (COLT) 2017, Ams-terdam, Netherlands, Sep. 2017. 2512] X. C AI , M. P EREYRA , AND
J. D. M C E WEN , Uncertainty quantification for radio interferometric imaging – I.Proximal MCMC methods , Monthly Notices of the Royal Astronomical Society, 480 (2018), pp. 4154–4169.[13] Y. C
ARMON , J. C. D
UCHI , O. H
INDER , AND
A. S
IDFORD , Accelerated methods for nonconvex optimization ,SIAM Journal on Optimization, 28 (2018), pp. 1751–1772.[14] L. C
HAARI , J. T
OURNERET , C. C
HAUX , AND
H. B
ATATIA , A Hamiltonian Monte Carlo method for non-smoothenergy sampling , IEEE Transactions on Signal Processing, 64 (2016), pp. 5585–5594.[15] A. C
HAMBOLLE , An algorithm for total variation minimization and applications , Journal of Mathematical imag-ing and vision, 20 (2004), pp. 89–97.[16] A. C
HAMBOLLE AND
T. P
OCK , An introduction to continuous optimization for imaging , Acta Numerica, 25(2016), pp. 161–319.[17] A. D
URMUS , S. M
AJEWSKI , AND
B. M
IASOJEDOW , Analysis of Langevin Monte Carlo via convex optimiza-tion , Journal of Machine Learning Research, 20 (2019), pp. 1–46.[18] A. D
URMUS AND
E. M
OULINES , Nonasymptotic convergence analysis for the unadjusted Langevin algorithm ,Annals of Applied Probability, 27 (2017), pp. 1551–1587.[19] A. D
URMUS , E. M
OULINES , AND
M. P
EREYRA , Efficient Bayesian computation by proximal Markov chainMonte Carlo: when Langevin meets Moreau , SIAM Journal on Imaging Sciences, 11 (2018), pp. 473–506.[20] A. E
FTEKHARI , B. V
ANDEREYCKEN , G. V
ILMART , AND
K. C. Z
YGALAKIS , Explicit stabilised gradientdescent for faster strongly convex optimisation , arXiv e-prints, (2018).[21] V. E
LSER , T. L AN , AND
T. B
ENDORY , Benchmark problems for phase retrieval , SIAM Journal on ImagingSciences, 11 (2018), pp. 2429–2455.[22] C. E
LVIRA , P. C
HAINAIS , AND
N. D
OBIGEON , Bayesian antisparse coding , IEEE Transactions on SignalProcessing, 65 (2017), pp. 1660–1672.[23] A. F
ERNANDEZ V IDAL AND
V. D E B ORTOLI AND
M. P
EREYRA AND
A. D
URMUS , Maximum likelihoodestimation of regularisation parameters in high-dimensional inverse problems: an empirical Bayesian approach ,arXiv e-prints, (2019), p. arXiv:1911.11709.[24] A. V
EHTARI AND
A. G
ELMAN AND
D. S
IMPSON AND
B. C
ARPENTER AND
P.C. BÃ RKNER , Rank-normalization, folding, and localization: An improved b R for assessing convergence of MCMC , arXiv e-prints,(2019), p. arXiv:1903.08008.[25] C. J. G EYER , Practical Markov chain Monte Carlo , Statistical Science, 7 (1992), pp. 473–483.[26] M. G
IROLAMI AND
B. C
ALDERHEAD , Riemann manifold Langevin and Hamiltonian Monte Carlo methods ,Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), pp. 123–214.[27] D. J. H
IGHAM , A-stability and stochastic mean-square stability , BIT Numerical Mathematics, 40 (2000),pp. 404–409.[28] D. J. H
IGHAM , Mean-square and asymptotic stability of the stochastic theta method , SIAM Journal on Numeri-cal Analysis, 38 (2000), pp. 753–769.[29] A. H
OUDARD , C. B
OUVEYRON , AND
J. D
ELON , High-dimensional mixture models for unsupervised imagedenoising (hdmi) , SIAM Journal on Imaging Sciences, 11 (2018), pp. 2815–2846.[30] M.-D. I
ORDACHE , J. M. B
IOUCAS -D IAS , AND
A. P
LAZA , Total variation spatial regularization for sparsehyperspectral unmixing , IEEE Transactions on Geoscience and Remote Sensing, 50 (2012), pp. 4484–4502.2631] J. K
AIPIO AND
E. S
OMERSALO , Statistical and computational inverse problems , vol. 160, Springer Science &Business Media, 2006.[32] N. K
ESHAVA AND
J. M
USTARD , Spectral unmixing , IEEE Signal Processing Magazine, 19 (2002), pp. 44–57.[33] P. E. K
LOEDEN AND
E. P
LATEN , Numerical solution of stochastic differential equations , vol. 23 of Applicationsof Mathematics (New York), Springer-Verlag, Berlin, 1992.[34] F. L
UCKA , N. H
UYNH , M. B
ETCKE , E. Z
HANG , P. B
EARD , B. C OX , AND
S. A
RRIDGE , Enhancing com-pressed sensing 4d photoacoustic tomography by simultaneous motion estimation , SIAM Journal on ImagingSciences, 11 (2018), pp. 2224–2253.[35] M.-C. C
ORBINEAU , D. K
OUAME , E. C
HOUZENOUX , J.-Y. T
OURNERET , AND
J.-C. P
ESQUET , Precondi-tioned P-ULA for Joint Deconvolution-Segmentation of Ultrasound Image , IEEE Signal Process. Letters, 26(2019), pp. 1456–1460.[36] Y. M
ARNISSI , A. B
ENAZZA -B ENYAHIA , E. C
HOUZENOUX , AND
J. . P
ESQUET , Majorize-minimize adaptedmetropolis-hastings algorithm. application to multichannel image recovery , in 2014 22nd European Signal Pro-cessing Conference (EUSIPCO), Sep. 2014, pp. 1332–1336.[37] Y. M
ARNISSI , E. C
HOUZENOUX , J. P
ESQUEI , AND
A. B
ENAZZA -B ENYAHIA , An auxiliary variable methodfor Langevin based MCMC algorithms , in 2016 IEEE Statistical Signal Processing Workshop (SSP), June 2016,pp. 1–5.[38] M. P
EREYRA , Proximal Markov chain Monte Carlo algorithms , Statistics and Computing, 26 (2016), pp. 745–760.[39] M. P
EREYRA , Maximum-a-posteriori estimation with bayesian confidence regions , SIAM Journal on ImagingSciences, 10 (2017), pp. 285–302.[40] M. P
EREYRA , P. S
CHNITER , E. C
HOUZENOUX , J.-C. P
ESQUET , J.-Y. T
OURNERET , A. O. H
ERO , AND
S. M C L AUGHLIN , A survey of stochastic simulation and optimization methods in signal processing , IEEE Jour-nal of Selected Topics in Signal Processing, 10 (2016), pp. 224–241.[41] A. R
EPETTI , M. P
EREYRA , AND
Y. W
IAUX , Scalable bayesian uncertainty quantification in imaging inverseproblems via convex optimization , SIAM Journal on Imaging Sciences, 12 (2019), pp. 87–118.[42] C. R
OBERT , The Bayesian choice: from decision-theoretic foundations to computational implementation ,Springer Science & Business Media, 2007.[43] G. O. R
OBERTS AND
R. L. T
WEEDIE , Exponential convergence of Langevin distributions and their discreteapproximations , Bernoulli, 2 (1996), pp. 341–363.[44] Y. R
OMANO , M. E
LAD , AND
P. M
ILANFAR , The little engine that could: Regularization by denoising (red) ,SIAM Journal on Imaging Sciences, 10 (2017), pp. 1804–184.[45] L. I. R
UDIN , S. O
SHER , AND
E. F
ATEMI , Nonlinear total variation based noise removal algorithms , PhysicaD: nonlinear phenomena, 60 (1992), pp. 259–268.[46] M. V
ONO , N. D
OBIGEON , AND
P. C
HAINAIS , Sparse bayesian binary logistic regression using the split-and-augmented gibbs sampler , in 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing(MLSP), Sep. 2018, pp. 1–6.[47] M. V
ONO , N. D
OBIGEON , AND
P. C
HAINAIS , Bayesian image restoration under poisson noise and log-concaveprior , in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), May 2019, pp. 1712–1716. 2748] M. V
ONO , N. D
OBIGEON , AND
P. C
HAINAIS , Split-and-augmented Gibbs sampler - Application to large-scaleinference problems , IEEE Transactions on Signal Processing, 67 (2019), pp. 1648–1661.[49] M. V
ONO , D. P
AULIN , AND
A. D
OUCET , Efficient MCMC Sampling with Dimension-Free Convergence Rateusing ADMM-type Splitting , arXiv e-prints, (2019), p. arXiv:1905.11937.[50] A. W
IBISONO , A. C. W
ILSON , AND
M. I. J
ORDAN , A variational perspective on accelerated methods inoptimization , Proceedings of the National Academy of Sciences, 113 (2016), pp. E7351–E7358.[51] D. W
IPF AND
H. Z
HANG , Revisiting bayesian blind deconvolution , Journal of Machine Learning Research, 15(2014), pp. 3775–3814.[52] J. Y E , Y. H AN , AND