[PDF] Proximity Operators of Discrete Information Divergences

Abstract

Full PDF

PProximity Operators of Discrete InformationDivergences – Extended Version

Mireille El Gheche ∗ Giovanni Chierchia † Jean-Christophe Pesquet ‡ April 27, 2017

Abstract

Information divergences allow one to assess how close two distributions are fromeach other. Among the large panel of available measures, a special attention hasbeen paid to convex ϕ -divergences, such as Kullback-Leibler, Jeﬀreys, Hellinger, Chi-Square, Renyi, and I α divergences. While ϕ -divergences have been extensively studiedin convex analysis, their use in optimization problems often remains challenging. Inthis regard, one of the main shortcomings of existing methods is that the minimizationof ϕ -divergences is usually performed with respect to one of their arguments, possiblywithin alternating optimization techniques. In this paper, we overcome this limitationby deriving new closed-form expressions for the proximity operator of such two-variablefunctions. This makes it possible to employ standard proximal methods for eﬃcientlysolving a wide range of convex optimization problems involving ϕ -divergences. In ad-dition, we show that these proximity operators are useful to compute the epigraphicalprojection of several functions. The proposed proximal tools are numerically validatedin the context of optimal query execution within database management systems, wherethe problem of selectivity estimation plays a central role. Experiments are carried outon small to large scale scenarios. Divergence measures play a crucial role in evaluating the dissimilarity between two infor-mation sources. The idea of quantifying how much information is shared between twoprobability distributions can be traced back to the work by Pearson [1] and Hellinger [2].Later, Shannon [3] introduced a powerful mathematical framework that links the notion of ∗ Universit´e de Bordeaux, IMS UMR 5218 and IMB UMR 5251, Talence, France. † Universit´e Paris-Est, LIGM UMR 8049, CNRS, ENPC, ESIEE Paris, UPEM, Noisy-le-Grand, France. ‡ Center for Visual Computing, CentraleSup´elec, University Paris-Saclay, Chatenay-Malabry, France. a r X i v : . [ c s . I T ] A p r nformation with communications and related areas, laying the foundations for informationtheory. However, information theory was not just a product of Shannon’s work, it wasthe result of fundamental contributions made by many distinct individuals, from a varietyof backgrounds, who took his ideas and expanded upon them. As a result, informationtheory has broadened to applications in statistical inference, natural language processing,cryptography, neurobiology, quantum computing, and other forms of data analysis. Im-portant sub-ﬁelds of information theory are algorithmic information theory, informationquantiﬁcation, and source/channel coding. In this context, a key measure of informationis the Kullback-Leibler divergence [4], which can be regarded as an instance of the widerclass of ϕ -divergences [5–7], including also Jeﬀreys, Hellinger, Chi-square, R´enyi, and I α divergences [8]. The Kullback-Leibler (KL) divergence is known to play a prominent role in the computa-tion of channel capacity and rate-distortion functions. One can address these problems withthe celebrated alternating minimization algorithm proposed by Blahut and Arimoto [9, 10].However, other approaches based on geometric programming may provide more eﬃcientnumerical solutions [11]. As the KL divergence is a Bregman distance, optimization prob-lems involving this function can also be addressed by using the alternating minimizationapproach proposed by Bauschke et al [12] (see also [13] for recent related works). However,the required optimization steps may be diﬃcult to implement, and the convergence of thealgorithm is only guaranteed under restrictive conditions. Moreover, a proximal algorithmgeneralizing the EM algorithm was investigated in [14], where the KL divergence is a metricfor maximizing a log-likelihood.The generalized KL divergence (also called I-divergence) is widely used in inverse prob-lems for recovering a signal of interest from an observation degraded by Poisson noise. Insuch a case, the generalized KL divergence is usually employed as a data ﬁdelity term.The resulting optimization approach can be solved through an alternating projection tech-nique [15], where both the data ﬁdelity term and the regularization term are based onthe KL divergence. The problem was formulated in a similar manner by Richardson andLucy [16, 17], whereas more general forms of the regularization functions were consideredby others [18–25]. In particular, some of these works are grounded on proximal splittingmethods [19, 22, 23]. These methods oﬀer eﬃcient and ﬂexible solutions to a wide classof possibly nonsmooth convex minimization problems (see [26, 27] and references therein).However, in all the aforementioned works, one of the two variables of the KL divergence isﬁxed. 2 .2 Other Divergences

Recently, the authors in [28, 29] deﬁned a new measure called Total KL divergence, whichhas the beneﬁt of being invariant to transformations from a special linear group. On theother side, the classical symmetrization of KL divergence, also known as Jeﬀreys-Kullback(JK) divergence [30], was recently used in the k -means algorithm as a replacement of thesquared diﬀerence [31,32], yielding analytical expression of the divergence centroids in termsof the Lambert W function.The Hellinger (Hel) divergence was originally introduced by Beran [33] and later redis-covered under diﬀerent names [34–37], such as Jeﬀreys-Masutita distance. In the ﬁeld ofinformation theory, the Hel divergence is commonly used for nonparametric density estima-tion [38, 39], statistics, and data analytics [40], as well as machine learning [41].The Chi-square divergence was introduced by Pearson [1], who used it to quantitativelyassess whether an observed phenomenon tends to conﬁrm or deny a given hypothesis. Thiswork heavily contributed to the development of modern statistics. In 1984, the journal Science referred to it as “one of the 20 most important scientiﬁc breakthroughs”. Moreover,Chi-square was also successfully applied in diﬀerent contexts, such as information theory andsignal processing, as a dissimilarity measure between two probability distributions [7, 42].R´enyi divergence was introduced as a measure of information related to the R´enyi entropy[43]. According to the deﬁnition by Harremo [44], R´enyi divergence measures “how much aprobabilistic mixture of two codes can be compressed”. It has been studied and applied inmany areas [36, 45, 46], including image registration and alignement problems [47].The I α divergence was originally proposed by Chernoﬀ [48] to statistically evaluate theeﬃciency of an hypothesis test. Subsequently, it was recognized as an instance of moregeneral classes of divergences [6], such as the ϕ -divergences [49] and the Bregman divergences[50], and further extended by many researchers [46, 50–52]. The I α divergence has been alsoconsidered in the context of Non-negative Matrix Factorization, where the hyperparameter α is associated with characteristics of a learning machine [53]. To the best of our knowledge, existing approaches for optimizing convex criteria involving ϕ -divergences are often restricted to speciﬁc cases, such as performing the minimizationw.r.t. one of the divergence arguments. In order to take into account both arguments,one may resort to alternating minimization schemes, but only in the case when speciﬁcassumptions are met. Otherwise, there exist some approaches that exploit the presence ofadditional moment constraints [54], or the equivalence between ϕ -divergences and some lossfunctions [55], but they provide little insight into the numerical procedure for solving theresulting optimization problems.In the context of proximal methods, there exists no general approach for performing3he minimization w.r.t. both the arguments of a ϕ -divergence. This limitation can beexplained by the fact that a few number of closed-form expressions are available for theproximity operator of non-separable convex functions, as opposed to separable ones [26, 56].Some examples of such functions are the Euclidean norm [57], the squared Euclidean normcomposed with an arbitrary linear operator [57], a separable function composed with anorthonormal or semi-orthogonal linear operator [57], the max function [58], the quadratic-over-linear function [59–61], and the indicator function of some closed convex sets [57, 62].In this work, we develop a novel proximal approach that allows us to address moregeneral forms of optimization problems involving ϕ -divergences. Our main contribution isthe derivation of new closed-form expressions for the proximity operator of such functions.This makes it possible to employ standard proximal methods for eﬃciently solving a widerange of convex optimization problems involving ϕ -divergences. In addition to its ﬂexibility,the proposed approach leads to parallel algorithms that can be eﬃciently implemented onboth multicore and GPGPU architectures [63]. The remaining of the paper is organized as follows. Section 2 presents the general form ofthe optimization problem that we aim at solving. Section 3 studies the proximity operatorof ϕ -divergences and some of its properties. Section 4 details the closed-form expressionsof the aforementioned proximity operators. Section 5 makes the connection with epigraph-ical projections. Section 6 illustrates the application to selectivity estimation for queryoptimization in database management systems. Finally, Section 7 concludes the paper. Throughout the paper, Γ ( H ) denotes the class of convex functions f deﬁned on a realHilbert space H and taking their values in ] −∞ , + ∞ ] which are lower-semicontinuous and proper(i.e. their domain dom f on which they take ﬁnite values is nonempty). (cid:107) · (cid:107) and (cid:104)· | ·(cid:105) de-note the norm and the scalar product of H , respectively. The Moreau subdiﬀerential of f at x ∈ H is ∂f ( x ) = (cid:8) u ∈ H (cid:12)(cid:12) ( ∀ y ∈ H ) (cid:104) y − x | u (cid:105) + f ( x ) ≤ f ( y ) (cid:9) . If f ∈ Γ ( H ) is Gˆateauxdiﬀerentiable at x , ∂f ( x ) = {∇ f ( x ) } where ∇ f ( x ) denotes the gradient of f at x . Theconjugate of f is f ∗ ∈ Γ ( H ) such that ( ∀ u ∈ H ) f ∗ ( u ) = sup x ∈H (cid:0) (cid:104) x | u (cid:105) − f ( x ) (cid:1) . Theproximity operator of f is the mapping prox f : H → H deﬁned as [64]( ∀ x ∈ H ) prox f ( x ) = argmin y ∈H f ( y ) + 12 (cid:107) x − y (cid:107) . (1)Let C be a nonempty closed convex subset C of H . The indicator function of C is deﬁnedas ( ∀ x ∈ H ) ι C ( x ) = (cid:40) x ∈ C + ∞ otherwise. (2)4he elements of a vector x ∈ H = R N are denoted by x = ( x ( (cid:96) ) ) ≤ (cid:96) ≤ N , whereas I N is the N × N identity matrix. The objective of this paper is to address convex optimization problems involving a discreteinformation divergence. In particular, the focus is put on the following formulation.

Problem 2.1

Let D be a function in Γ ( R P × R P ). Let A and B be matrices in R P × N ,and let u and v be vectors in R P . For every s ∈ { , . . . , S } , let R s be a function in Γ ( R K s )and T s ∈ R K s × N . We want tominimize x ∈ R N D ( Ax + u, Bx + v ) + S (cid:88) s =1 R s ( T s x ) . (3)Note that the functions D and ( R s ) ≤ s ≤ S are allowed to take the value + ∞ , so thatProblem 2.1 can include convex constraints by letting some of the functions R s be equal tothe indicator function ι C s of some nonempty closed convex set C s . In inverse problems, R s may also model some additional prior information, such as the sparsity of coeﬃcients aftersome appropriate linear transform T s . A special case of interest in information theory arises by decomposing x into two vectors p ∈ R P (cid:48) and q ∈ R Q (cid:48) , that is x = [ p (cid:62) q (cid:62) ] (cid:62) with N = P (cid:48) + Q (cid:48) . Indeed, set u = v = 0, A = [ A (cid:48)

0] with A (cid:48) ∈ R P × P (cid:48) , B = [0 B (cid:48) ] with B (cid:48) ∈ R P × Q (cid:48) and, for every s ∈ { , . . . , S } , T s = [ U s V s ] with U s ∈ R K s × P (cid:48) and V s ∈ R K s × Q (cid:48) . Then, Problem 2.1 takes the followingform: Problem 2.2

Let A (cid:48) , B (cid:48) , ( U s ) ≤ s ≤ S , and ( V s ) ≤ s ≤ S be matrices as deﬁned above. Let D bea function in Γ ( R P × R P ) and, for every s ∈ { , . . . , S } , let R s be a function in Γ ( R K s ).We want to minimize ( p,q ) ∈ R P (cid:48) × R Q (cid:48) D ( A (cid:48) p, B (cid:48) q ) + S (cid:88) s =1 R s ( U s p + V s q ) . (4)Several tasks can be formulated within this framework, such as the computation of channelcapacity and rate-distortion functions [9, 10], the selection of log-optimal portfolios [65],maximum likelihood estimation from incomplete data [66], soft-supervised learning for textclassiﬁcation [67], simultaneously estimating a regression vector and an additional modelparameter [61] or the image gradient distribution and a parametric model distribution [68],as well as image registration [69], deconvolution [70], and recovery [15]. We next detail animportant application example in source coding.5 xample 2.3 Assume that a discrete memoryless source E , taking its values in a ﬁnitealphabet { e , . . . , e P } with probability P ( E ), is to be encoded by a compressed signal (cid:98) E in terms of a second alphabet { (cid:98) e , . . . , (cid:98) e P } . Furthermore, for every j ∈ { , . . . , P } and k ∈ { , . . . , P } , let δ ( k,j ) be the distortion induced when substituting (cid:98) e k for e j . We wishto ﬁnd an encoding P ( (cid:98) E | E ) that yields a point on the rate-distortion curve at a givendistortion value δ ∈ ]0 , + ∞ [. It is well-known [71] that this amounts to minimizing themutual information I between E and (cid:98) E , more precisely the rate-distortion function R isgiven by R ( δ ) = min P ( (cid:98) E | E ) I ( E, (cid:98) E ) , (5)subject to the constraint P (cid:88) j =1 P (cid:88) k =1 δ ( k,j ) P ( E = e j ) P ( (cid:98) E = (cid:98) e k | E = e j ) ≤ δ. (6)The mutual information can be written as [9, Theorem 4(a)]min P ( (cid:98) E ) P (cid:88) j =1 P (cid:88) k =1 P ( E = e j , (cid:98) E = (cid:98) e k ) ln (cid:32) P ( (cid:98) E = (cid:98) e k , E = e j ) P ( E = e j ) P ( (cid:98) E = (cid:98) e k ) (cid:33) , (7)subject to the constraint P (cid:88) k =1 P ( (cid:98) E = (cid:98) e k ) = 1 . (8)Moreover, the constraint in (6) can be reexpressed as P (cid:88) j =1 P (cid:88) k =1 δ ( k,j ) P ( E = e j , (cid:98) E = (cid:98) e k ) ≤ δ, (9)with ( ∀ j ∈ { , . . . , P } ) P (cid:88) k =1 P ( E = e j , (cid:98) E = (cid:98) e k ) = P ( E = e j ) . (10)The unknown variables are thus the vectors p = (cid:0) P ( E = e j , (cid:98) E = (cid:98) e k ) (cid:1) ≤ j ≤ P , ≤ k ≤ P ∈ R P P (11)and q = (cid:0) P ( (cid:98) E = (cid:98) e k ) (cid:1) ≤ k ≤ P ∈ R P , (12)whose optimal values are solutions to the problem:minimize p ∈ C ∩ C ,q ∈ C D ( p, r ⊗ q ) (13)6here r = (cid:0) P ( E = e j ) (cid:1) ≤ j ≤ P ∈ R P , ⊗ denotes the Kronecker product, D is the Kullback-Leibler divergence, and C , C , C are the closed convex sets corresponding to the linearconstraints (8), (9), (10), respectively. The above formulation is a special case of Problem2.2 in which P = P (cid:48) = P P , Q (cid:48) = P , A (cid:48) = I P , B (cid:48) is such that ( ∀ q ∈ R Q (cid:48) ) B (cid:48) q = r ⊗ q , S = 3, V = I Q (cid:48) , U = U = I P , U and V = V are null matrices, and ( ∀ s ∈ { , , } ) R s isthe indicator function of the constraint convex set C s . We will focus on additive information measures of the form (cid:0) ∀ ( p, q ) ∈ R P × R P (cid:1) D ( p, q ) = P (cid:88) i =1 Φ( p ( i ) , q ( i ) ) , (14)where Φ ∈ Γ ( R × R ) is the perspective function [72] on [0 , + ∞ [ × ]0 , + ∞ [ of a function ϕ : R → [0 , + ∞ ] belonging to Γ ( R ) and twice diﬀerentiable on ]0 , + ∞ [. In other words, Φis deﬁned as follows: for every ( υ, ξ ) ∈ R ,Φ( υ, ξ ) =  ξ ϕ (cid:16) υξ (cid:17) if υ ∈ [0 , + ∞ [ and ξ ∈ ]0 , + ∞ [ υ lim ζ → + ∞ ϕ ( ζ ) ζ if υ ∈ ]0 , + ∞ [ and ξ = 00 if υ = ξ = 0+ ∞ otherwise , (15)where the above limit is guaranteed to exist [73, Sec. 2.3]. Moreover, if ϕ is a strictly convexfunction such that ϕ (1) = ϕ (cid:48) (1) = 0 , (16)the function D in (14) belongs to the class of ϕ -divergences [5, 74]. Then, for every ( p, q ) ∈ [0 , + ∞ [ P × [0 , + ∞ [ P , D ( p, q ) ≥ D ( p, q ) = 0 ⇔ p = q. (18)Examples of ϕ -divergences will be provided in Sections 4.1, 4.2, 4.3, 4.4 and 4.6. For athorough investigation of the rich properties of ϕ -divergences, the reader is refered to [5,6,75].Other divergences (e.g., R´enyi divergence) are expressed as (cid:0) ∀ ( p, q ) ∈ R P × R P (cid:1) D g ( p, q ) = g (cid:0) D ( p, q ) (cid:1) (19)where g is an increasing function. Then, provided that g (cid:0) ϕ (1) (cid:1) = 0, D g ( p, q ) ≥ p (cid:62) q (cid:62) ] (cid:62) ∈ C with C = (cid:110) x ∈ [0 , P (cid:12)(cid:12) P (cid:88) i =1 x ( i ) = 1 and P (cid:88) i =1 x ( P + i ) = 1 (cid:111) . (20)7rom an optimization standpoint, minimizing D or D g (possibly subject to constraints)makes no diﬀerence, hence we will only address problems involving D in the rest of thispaper. Proximity operators will be fundamental tools in this paper. We ﬁrst recall some of theirkey properties.

Proposition 2.4 [64, 72]

Let f ∈ Γ ( H ) . Then, (i) For every x ∈ H , prox f x ∈ dom f . (ii) For every ( x, x ) ∈ H x = prox f ( x ) ⇔ x − x ∈ ∂f ( x ) . (21)(iii) For every ( x, z ) ∈ H , prox f ( · + z ) ( x ) = prox f ( x + z ) − z. (22)(iv) For every ( x, z ) ∈ H and for every α ∈ R , prox f + (cid:104) z |·(cid:105) + α ( x ) = prox f ( x − z ) . (23)(v) Let f ∗ be the conjugate function of f . For every x ∈ H and for every γ ∈ ]0 , + ∞ [ , prox γf ∗ ( x ) = x − γ prox f/γ ( x/γ ) . (24)(vi) Let G be a real Hilbert space and let T : G → H be a bounded linear operator, with theadjoint denoted by T ∗ . If T T ∗ = κ Id and κ ∈ ]0 , + ∞ [ , then for all x ∈ H prox f ◦ T ( x ) = x + 1 κ T ∗ (cid:0) prox κf ( T x ) − T x (cid:1) . (25)Numerous additional properties of proximity operators are mentioned in [26, 27].In this paper, we will be mainly concerned with the determination of the proximityoperator of the function D deﬁned in (14) with H = R P × R P . The next result emphasizesthat this task reduces to the calculation of the proximity operator of a real function of twovariables. 8 roposition 2.5 Let D be deﬁned by (14) where Φ ∈ Γ ( R ) and let γ ∈ ]0 , + ∞ [ . Let u ∈ R P and v ∈ R P . Then, for every p ∈ R P and for every q ∈ R P , prox γD ( · + u, · + v ) ( p, q ) = ( p − u, q − v ) (26) where, for every i ∈ { , . . . , P } , ( p ( i ) , q ( i ) ) = prox γ Φ ( p ( i ) + u ( i ) , q ( i ) + v ( i ) ) . (27) Proof . The result is a straightforward consequence of [26, Table 10.1ix] and Proposi-tion 2.4(iii), by setting f = D and z = ( u, v ).Note that, although an extensive list of proximity operators of one-variable real functionscan be found in [26], few results are available for real functions of two variables [57,59,60,62].An example of such a result is provided below. Proposition 2.6

Let ϕ ∈ Γ ( R ) be an even diﬀerentiable function on R \{ } . Let Φ : R → ] −∞ , + ∞ ] be deﬁned as: ( ∀ ( ν, ξ ) ∈ R )Φ( ν, ξ ) = (cid:40) ϕ ( ν − ξ ) if ( ν, ξ ) ∈ [0 , + ∞ [ + ∞ otherwise . (28) Then, for every ( ν, ξ ) ∈ R , prox Φ ( ν, ξ ) =  (cid:0) ν + ξ + π , ν + ξ − π (cid:1) if | π | < ν + ξ (0 , π ) if π > and π ≥ ν + ξ ( π , if π > and π ≥ ν + ξ (0 , otherwise, (29) with π = prox ϕ ( ν − ξ ) , π = prox ϕ ( ξ ) and π = prox ϕ ( ν ) .Proof . See Appendix AThe above proposition provides a simple characterization of the proximity operators ofsome distances deﬁned for nonnegative-valued vectors. However, the assumptions made inProposition 2.6 are not satisﬁed by the class of functions Φ considered in Section 2.2. In thenext section, we will propose two algorithms for solving a general class of convex problemsinvolving these functions Φ. Indeed, none of the considered ϕ -divergences can be expressed as a function of the diﬀerence betweenthe two arguments. .4 Proximal splitting algorithms As soon as we know how to calculate the proximity operators of the functions involvedin Problem 2.1, various proximal methods can be employed to solve it numerically. Twoexamples of such methods are given subsequently.The ﬁrst algorithm is PPXA+ [76] which constitutes an extension of PPXA (ParallelProXimal Agorithm) proposed in [57]. As it can be seen in [77,78], PPXA+ is an augmentedLagrangian-like methods (see also [76, Sec. 6]).

Algorithm 1

PPXA+

Initialization  ( ω , . . . , ω S ) ∈ ]0 , + ∞ [ S +1 , ( t , , t , ) ∈ R P × R P , t , ∈ R K , . . . , t S +1 , ∈ R K S Q = (cid:16) ω A (cid:62) A + ω B (cid:62) B + S (cid:88) s =1 ω s T (cid:62) s T s (cid:17) − x = Q (cid:16) ω A (cid:62) t , + ω B (cid:62) t , + S (cid:88) s =1 ω s T (cid:62) s t s +1 , (cid:17) . For n = 0 , , . . .  ( r ,n , r ,n ) = prox ω − D ( · + u, · + v ) ( t ,n , t ,n ) + e ,n For s = 0 , , . . . S (cid:4) r s +1 ,n = prox ω − s R s ( t s +1 ,n ) + e s,n y n = Q (cid:16) ω A (cid:62) r ,n + ω B (cid:62) r ,n + S (cid:88) s =1 ω s T (cid:62) s r s +1 ,n (cid:17) λ n ∈ ]0 , t ,n +1 = t ,n + λ n (cid:16) A (2 y n − x n ) − r ,n (cid:17) t ,n +1 = t ,n + λ n (cid:16) B (2 y n − x n ) − r ,n (cid:17) For s = 0 , , . . . S (cid:106) t s +1 ,n +1 = t s +1 ,n + λ n (cid:16) T s (2 y n − x n ) − r s +1 ,n (cid:17) x n +1 = x n + λ n ( y n − x n ) . In this algorithm, ω , . . . , ω S are weighting factors and ( λ n ) n ≥ are relaxation factors. Forevery n ≥

0, the variables e ,n ∈ R P × R P , e ,n ∈ R K , . . . , e S,n ∈ R K S model possible errorsin the computation of the proximity operators. For instance, these errors arise when theproximity operator is not available in a closed form, and one needs to compute it through10nner iterations. Under some technical conditions, the convergence of PPXA+ is guaranteed. Proposition 2.7 [76, Corollary 5.3]

Suppose that the following assumptions hold. (i)

The matrix A (cid:62) A + B (cid:62) B + S (cid:88) s =1 T (cid:62) s T s is invertible. (ii) There exists ˇ x ∈ R N such that  A ˇ x + u ∈ ]0 , + ∞ [ P B ˇ x + v ∈ ]0 , + ∞ [ P ( ∀ s ∈ { , . . . , S } ) T s ˇ x ∈ ri(dom R s ) . (30)(iii) There exists λ ∈ ]0 , such that, for every n ∈ N , λ ≤ λ n +1 ≤ λ n < . (31)(iv) For every s ∈ { , . . . , S } , (cid:88) n ∈ N (cid:107) e s,n (cid:107) < + ∞ . (32) If the set of solutions to Problem 2.1 is nonempty, then any sequence ( x n ) n ∈ N generated byAlgorithm 1 converges to an element of this set. It can be noticed that, at each iteration n , PPXA+ requires to solve a linear system inorder to compute the intermediate variable y n . The computational cost of this operationmay be high when N is large. Proximal primal-dual approaches [79–86] allow us to circum-vent this diﬃculty. An example of such an approach is the Monotone+Lipschitz ForwardBackward Forward (M+LFBF) method [83] which takes the following form.11 lgorithm 2 M+LFBF

Initialization  ( t , , t , ) ∈ R P × R P , t , ∈ R K , . . . , t S +1 , ∈ R K S x ∈ R N , β = (cid:16) (cid:107) A (cid:107) + (cid:107) B (cid:107) + S (cid:88) s =1 (cid:107) T s (cid:107) (cid:17) / ,ε ∈ ]0 , / ( β + 1)[ . For n = 0 , , . . .  γ n ∈ [ ε, (1 − ε ) /β ] (cid:98) x n = x n − γ n ( A (cid:62) t ,n + B (cid:62) t ,n + S (cid:88) s =1 T (cid:62) s t s +1 ,n ) (cid:0)(cid:98) t ,n , (cid:98) t ,n (cid:1) = (cid:0) t ,n , t ,n (cid:1) + γ n (cid:0) Ax n , Bx n (cid:1) ( r ,n , r ,n ) = ( (cid:98) t ,n , (cid:98) t ,n ) − γ n prox γ − n D ( · + u, · + v ) ( γ − n (cid:98) t ,n , γ − n (cid:98) t ,n ) + e ,n (cid:0)(cid:101) t ,n , (cid:101) t ,n (cid:1) = (cid:0) r ,n , r ,n (cid:1) + γ n (cid:0) A (cid:98) x n , B (cid:98) x n (cid:1)(cid:0) t ,n +1 , t ,n +1 (cid:1) = (cid:0) t ,n , t ,n (cid:1) − (cid:0)(cid:98) t ,n , (cid:98) t ,n (cid:1) + (cid:0)(cid:101) t ,n , (cid:101) t ,n (cid:1) For s = 0 , , . . . S  (cid:98) t s +1 ,n = t s +1 ,n + γ n T s x n r s +1 ,n = (cid:98) t s +1 ,n − γ n prox γ − n R s ( γ − n (cid:98) t s +1 ,n ) + e s,n (cid:101) t s +1 ,n = r s +1 ,n + γ n T s (cid:98) x n t s +1 ,n +1 = t s +1 ,n − (cid:98) t s +1 ,n + (cid:101) t s +1 ,n (cid:101) x n = (cid:98) x n − γ n ( A (cid:62) r ,n + B (cid:62) r ,n + S (cid:88) s =1 T (cid:62) s r s +1 ,n ) x n +1 = x n − (cid:98) x n + (cid:101) x n . In this algorithm, ( γ n ) n ≥ is a sequence of step-sizes, and e ,n ∈ R P × R P , e ,n ∈ R K ,. . . , e S,n ∈ R K S correspond to possible errors in the computation of proximity operators.The convergence is secured by the following result. Proposition 2.8 [83, Theorem 4.2]

Suppose that the following assumptions hold. (i)

There exists ˇ x ∈ R N such that (30) holds. (ii) ( ∀ s ∈ { , . . . , S } ) (cid:80) n ∈ N (cid:107) e s,n (cid:107) < + ∞ .If the set of solutions to Problem 2.1 is nonempty, then any sequence ( x n ) n ∈ N generated byAlgorithm 2 converges to an element of this set.

12t is worth highlighting that these two algorithms share two interesting features: manyoperations can be implemented in parallel (e.g., the loops on s ), there is a tolerance toerrors in the computation of the proximity operators. Recently, random block-coordinateversions of proximal algorithms have been proposed (see [87] and references therein) furtherimproving the ﬂexibility of these methods. As shown by Proposition 2.5, we need to compute the proximity operator of a scaled versionof a function Φ ∈ Γ ( R ) as deﬁned in (15). In the following, Θ denotes a primitive on]0 , + ∞ [ of the function ζ (cid:55)→ ζϕ (cid:48) ( ζ − ). The following functions will subsequently play animportant role: ϑ − : ]0 , + ∞ [ → R : ζ (cid:55)→ ϕ (cid:48) ( ζ − ) (33) ϑ + : ]0 , + ∞ [ → R : ζ (cid:55)→ ϕ ( ζ − ) − ζ − ϕ (cid:48) ( ζ − ) . (34)A ﬁrst technical result is as follows. Lemma 3.1

Let γ ∈ ]0 , + ∞ [ , let ( υ, ξ ) ∈ R , and deﬁne χ − = inf (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ − ( ζ ) < γ − υ (cid:9) (35) χ + = sup (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ + ( ζ ) < γ − ξ (cid:9) (36) (with the usual convention inf ∅ = + ∞ and sup ∅ = −∞ ). If χ − (cid:54) = + ∞ , the function ψ : ]0 , + ∞ [ → R : ζ (cid:55)→ ζϕ ( ζ − ) − Θ( ζ ) + γ − υ ζ − γ − ξζ (37) is strictly convex on ] χ − , + ∞ [ . In addition, if (i) χ − (cid:54) = + ∞ and χ + (cid:54) = −∞ (ii) lim ζ → χ − ζ>χ − ψ (cid:48) ( ζ ) < ζ → χ + ψ (cid:48) ( ζ ) > then ψ admits a unique minimizer (cid:98) ζ on ] χ − , + ∞ [ , and (cid:98) ζ < χ + . roof . The derivative of ψ is, for every ζ ∈ ]0 , + ∞ [, ψ (cid:48) ( ζ ) = ϕ ( ζ − ) − ( ζ + ζ − ) ϕ (cid:48) ( ζ − ) + γ − υζ − γ − ξ = ζ (cid:0) γ − υ − ϑ − ( ζ ) (cid:1) + ϑ + ( ζ ) − γ − ξ. (38)The function ϑ − is decreasing as the convexity of ϕ yields( ∀ ζ ∈ ]0 , + ∞ [) ϑ (cid:48)− ( ζ ) = − ζ − ϕ (cid:48)(cid:48) ( ζ − ) ≤ . (39)This allows us to deduce thatif (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ − ( ζ ) < γ − υ (cid:9) (cid:54) = ∅ , then ] χ − , + ∞ [= (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ − ( ζ ) < γ − υ (cid:9) . (40)Similarly, the function ϑ + is increasing as the convexity of ϕ yields( ∀ ζ ∈ ]0 , + ∞ [) ϑ (cid:48) + ( ζ ) = ζ − ϕ (cid:48)(cid:48) ( ζ − ) ≥ (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ + ( ζ ) < γ − ξ (cid:9) (cid:54) = ∅ , then ]0 , χ + [= (cid:8) ζ ∈ ]0 , + ∞ [ (cid:12)(cid:12) ϑ + ( ζ ) < γ − ξ (cid:9) . (42)If ( χ − , χ + ) ∈ ]0 , + ∞ [ , then (38) leads to ψ (cid:48) ( χ − ) = ϑ + ( χ − ) − γ − ξ (43) ψ (cid:48) ( χ + ) = χ + (cid:0) γ − υ − ϑ − ( χ + ) (cid:1) . (44)So, Conditions (ii) and (iii) are equivalent to ϑ + ( χ − ) − γ − ξ < χ + (cid:0) γ − υ − ϑ − ( χ + ) (cid:1) > . (46)In view of (40) and (42), these inequalities are satisﬁed if and only if χ − < χ + . Thisinequality is also obviously satisﬁed if χ − = 0 or χ + = + ∞ . In addition, we have:( ∀ ζ ∈ ]0 , + ∞ [) ψ (cid:48)(cid:48) ( ζ ) = γ − υ − ϑ − ( ζ ) + ζ − (1 + ζ − ) ϕ (cid:48)(cid:48) ( ζ − ) . (47)When ζ > χ − (cid:54) = + ∞ , γ − υ − ϑ − ( ζ ) >

0, and the convexity of ϕ yields ψ (cid:48)(cid:48) ( ζ ) >

0. Thisshows that ψ is strictly convex on ] χ − , + ∞ [.If Conditions (i)-(iii) are satisﬁed, due to the continuity of ψ (cid:48) , there exists (cid:98) ζ ∈ ] χ − , χ + [such that ψ (cid:48) ( (cid:98) ζ ) = 0. Because of the strict convexity of ψ on ] χ − , + ∞ [, (cid:98) ζ is the uniqueminimizer of ψ on this interval.The required assumptions in the previous lemma can often be simpliﬁed as stated below.14 emma 3.2 Let γ ∈ ]0 , + ∞ [ and ( υ, ξ ) ∈ R . If ( χ − , χ + ) ∈ ]0 , + ∞ [ , then Conditions (ii)and (iii) in Lemma 3.1 are equivalent to: χ − < χ + . If χ − ∈ ]0 , + ∞ [ and χ + = + ∞ (resp. χ − = 0 and χ + ∈ ]0 , + ∞ [ ), Conditions (ii)-(iii) are satisﬁed if and only if lim ζ → + ∞ ψ (cid:48) ( ζ ) > (resp. lim ζ → ζ> ψ (cid:48) ( ζ ) < ).Proof . If ( χ − , χ + ) ∈ ]0 , + ∞ [ , we have already shown that Conditions (ii) and (iii) aresatisﬁed if and only χ − < χ + .If χ − ∈ ]0 , + ∞ [ and χ + = + ∞ (resp. χ − = 0 and χ + ∈ ]0 , + ∞ [), we still have ψ (cid:48) ( χ − ) = ϑ + ( χ − ) − γ − ξ < ψ (cid:48) ( χ + ) = χ + (cid:0) γ − υ − ϑ − ( χ + ) (cid:1) > , (49)which shows that Condition (ii) (resp. Condition (iii)) is always satisﬁed.By using the same expressions of χ − and χ + as in the previous lemmas, we obtain thefollowing characterization of the proximity operator of any scaled version of Φ: Proposition 3.3

Let γ ∈ ]0 , + ∞ [ and ( υ, ξ ) ∈ R . prox γ Φ ( υ, ξ ) ∈ ]0 , + ∞ [ if and only ifConditions (i)-(iii) in Lemma 3.1 are satisﬁed. When these conditions hold, prox γ Φ ( υ, ξ ) = (cid:0) υ − γ ϑ − ( (cid:98) ζ ) , ξ − γ ϑ + ( (cid:98) ζ ) (cid:1) (50) where (cid:98) ζ < χ + is the unique minimizer of ψ on ] χ − , + ∞ [ .Proof . For every ( υ, ξ ) ∈ R , such that Conditions (i)-(iii) in Lemma 3.1 hold, let υ = υ − γ ϑ − ( (cid:98) ζ ) (51) ξ = ξ − γ ϑ + ( (cid:98) ζ ) (52)where the existence of (cid:98) ζ ∈ ] χ − , χ + [ is guaranteed by Lemma 3.1. As consequences of (40)and (42), υ and ξ are positive. In addition, since ψ (cid:48) ( (cid:98) ζ ) = 0 ⇔ (cid:98) ζ (cid:0) γ − υ − ϑ − ( (cid:98) ζ ) (cid:1) = γ − ξ − ϑ + ( (cid:98) ζ ) (53)we derive from (51) and (52) that (cid:98) ζ = ξ/υ >

0. This allows us to re-express (51) and (52)as υ − υ + γϕ (cid:48) (cid:16) υξ (cid:17) = 0 (54) ξ − ξ + γ (cid:18) ϕ (cid:16) υξ (cid:17) − υξ ϕ (cid:48) (cid:16) υξ (cid:17)(cid:19) = 0 , (55)15hat is υ − υ + γ ∂ Φ ∂υ ( υ, ξ ) = 0 (56) ξ − ξ + γ ∂ Φ ∂ξ ( υ, ξ ) = 0 . (57)The latter equations are satisﬁed if and only if [26]( υ, ξ ) = prox γ Φ ( υ, ξ ) . (58)Conversely, for every ( υ, ξ ) ∈ R , let ( υ, ξ ) = prox γ Φ ( υ, ξ ). If ( υ, ξ ) ∈ ]0 , + ∞ [ , ( υ, ξ )satisﬁes (54) and (55). By setting (cid:101) ζ = ξ/υ >

0, after simple calculations, we ﬁnd υ = υ − γ ϑ − ( (cid:101) ζ ) > ξ = ξ − γ ϑ + ( (cid:101) ζ ) > ψ (cid:48) ( (cid:101) ζ ) = 0 . (61)According to (40) and (42), (59) and (60) imply that χ − (cid:54) = + ∞ , χ + (cid:54) = −∞ , and (cid:101) ζ ∈ ] χ − , χ + [.In addition, according to Lemma 3.1, ψ (cid:48) is strictly increasing on ] χ − , + ∞ [ (since ψ is strictlyconvex on this interval). Hence, ψ (cid:48) has a limit at χ − (which may be equal to −∞ when χ − = −∞ ), and Condition (ii) is satisﬁed. Similarly, ψ (cid:48) has a limit at χ + (possibly equalto + ∞ when χ + = + ∞ ), and Condition (iii) is satisﬁed. Remark 3.4

In (15), a special case arises when( ∀ ζ ∈ ]0 , + ∞ [) ϕ ( ζ ) = (cid:101) ϕ ( ζ ) + ζ (cid:101) ϕ ( ζ − ) (62)where (cid:101) ϕ is a twice diﬀerentiable convex function on ]0 , + ∞ [. Then Φ takes a symmetricform, leading to L -divergences. It can then be deduced from (34) that, for every ζ ∈ ]0 , + ∞ [, ϑ − ( ζ ) = ϑ + ( ζ − ) = (cid:101) ϕ ( ζ ) + (cid:101) ϕ (cid:48) ( ζ − ) − ζ (cid:101) ϕ (cid:48) ( ζ ) . (63) Let us now apply the results in the previous section to the functionΦ( υ, ξ ) =  υ ln (cid:18) υξ (cid:19) + ξ − υ if ( υ, ξ ) ∈ ]0 , + ∞ [ ξ if υ = 0 and ξ ∈ [0 , + ∞ [+ ∞ otherwise. (64)16his is a function in Γ ( R ) satisfying (15) with( ∀ ζ ∈ ]0 , + ∞ [) ϕ ( ζ ) = ζ ln ζ − ζ + 1 . (65) Proposition 4.1

The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) = (cid:40) ( υ, ξ ) if exp ( γ − υ ) > − γ − ξ (0 , otherwise, (66) where υ = υ + γ ln (cid:98) ζ (67) ξ = ξ + γ (cid:16)(cid:98) ζ − − (cid:17) (68) and (cid:98) ζ is the unique minimizer on ] exp( − γ − υ ) , + ∞ [ of ψ : ]0 , + ∞ [ → R : (69) ζ (cid:55)→ (cid:16) ζ − (cid:17) ln ζ + 12 (cid:16) γ − υ − (cid:17) ζ + (1 − γ − ξ ) ζ. Proof . For every ( υ, ξ ) ∈ R , ( υ, ξ ) = prox γ Φ ( υ, ξ ) is such that ( υ, ξ ) ∈ dom Φ [88]. Let usﬁrst note that υ ∈ ]0 , + ∞ [ ⇔ ( υ, ξ ) ∈ ]0 , + ∞ [ . (70)We are now able to apply Proposition 3.3, where ψ is given by (69) and, for every ζ ∈ ]0 , + ∞ [, Θ( ζ ) = ζ (cid:18) − ln ζ (cid:19) − ϑ − ( ζ ) = − ln ζ (72) ϑ + ( ζ ) = 1 − ζ − . (73)In addition, χ − = exp( − γ − υ ) (74) χ + = (cid:40) (1 − γ − ξ ) − if ξ < γ + ∞ otherwise. (75)According to (70) and Proposition 3.3, υ ∈ ]0 , + ∞ [ if and only if Conditions (i)-(iii) inLemma 3.1 hold. Since χ − ∈ ]0 , + ∞ [ and lim ζ → + ∞ ψ (cid:48) ( ζ ) = + ∞ , Lemma 3.2 shows thatthese conditions are satisﬁed if and only if ξ ≥ γ or (cid:0) ξ < γ and exp( − υ/γ ) < (1 − γ − ξ ) − (cid:1) , (76)17hich is equivalent to exp( υ/γ ) > − γ − ξ. (77)Under this assumption, Proposition 3.3 leads to the expressions (67) and (68) of the prox-imity operator, where (cid:98) ζ is the unique minimizer on ] exp( − υ/γ ) , + ∞ [ of the function ψ .We have shown that υ > ⇔ (77). So, υ = 0 when (77) is not satisﬁed. Then, theexpression of ξ simply reduces to the asymmetric soft-thresholding rule [89]: ξ = (cid:40) ξ − γ if ξ > γ γ − υ ) ≤ − γ − ξ ⇒ ξ < γ , so that ξ is necessarily equal to 0. Remark 4.2

More generally, we can derive the proximity operator of (cid:101) Φ( υ, ξ ) =  υ ln (cid:18) υξ (cid:19) + κ ( ξ − υ ) if ( υ, ξ ) ∈ ]0 , + ∞ [ κξ if υ = 0 and ξ ∈ [0 , + ∞ [+ ∞ otherwise, (79)where κ ∈ R . Of particular interest in the literature is the case when κ = 0 [9, 10, 19, 22].From Proposition 2.4(iv), we get, for every γ ∈ ]0 , + ∞ [ and for every ( υ, ξ ) ∈ R ,prox γ (cid:101) Φ ( υ, ξ ) = prox γ Φ ( υ + γκ − γ, ξ − γκ + γ ) , (80)where prox γ Φ is provided by Proposition 4.1. Remark 4.3

It can be noticed that ψ (cid:48) ( (cid:98) ζ ) = (cid:98) ζ ln (cid:98) ζ + γ − υ (cid:98) ζ − (cid:98) ζ − + 1 − γ − ξ = 0 (81)is equivalent to (cid:98) ζ − exp (cid:16)(cid:98) ζ − (cid:0)(cid:98) ζ − + γ − ξ − (cid:1)(cid:17) = exp( γ − υ ) . (82)In the case where ξ = γ , the above equation reduces to2 (cid:98) ζ − exp (cid:0) (cid:98) ζ − (cid:1) = 2 exp(2 γ − υ ) ⇔ (cid:98) ζ = (cid:18) W (2 e γ − υ ) (cid:19) / (83)where W is the Lambert W function [90]. When ξ (cid:54) = γ , although a closed-form expressionof (82) is not available, eﬃcient numerical methods to compute (cid:98) ζ can be developed.18 emark 4.4 To minimize ψ in (69), we need to ﬁnd the zero on ] exp( − γ − υ ) , + ∞ [ of thefunction: ( ∀ ζ ∈ ]0 , + ∞ [) ψ (cid:48) ( ζ ) = ζ ln ζ + γ − υ ζ − ζ − + 1 − γ − ξ. (84)This can be performed by Algorithm 3, the convergence of which is proved in Appendix B. Algorithm 3

Newton method for minimizing (69).

Set (cid:98) ζ = exp( − γ − υ ) For n = 0 , , . . . (cid:106) (cid:98) ζ n +1 = (cid:98) ζ n − ψ (cid:48) ( (cid:98) ζ n ) /ψ (cid:48)(cid:48) ( (cid:98) ζ n ) . Let us now consider the symmetric form of (64) given byΦ( υ, ξ ) =  ( υ − ξ ) (cid:0) ln υ − ln ξ ) if ( υ, ξ ) ∈ ]0 , + ∞ [ υ = ξ = 0+ ∞ otherwise. (85)This function belongs to Γ ( R ) and satisﬁes (15) and (62) with( ∀ ζ ∈ ]0 , + ∞ [) (cid:101) ϕ ( ζ ) = − ln ζ. (86) Proposition 4.5

The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) = (cid:40) ( υ, ξ ) if W ( e − γ − υ ) W ( e − γ − ξ ) < , otherwise (87) where υ = υ + γ (cid:0) ln (cid:98) ζ + (cid:98) ζ −

1) (88) ξ = ξ − γ (cid:0) ln (cid:98) ζ − (cid:98) ζ − + 1) (89) and (cid:98) ζ is the unique minimizer on ] W ( e − γ − υ ) , + ∞ [ of ψ : ]0 , + ∞ [ → R : ζ (cid:55)→ (cid:16) ζ ζ − (cid:17) ln ζ + ζ (cid:16) γ − υ − (cid:17) ζ − γ − ξζ. (90)19 roof . We apply Proposition 3.3 where ψ is given by (90) and, for every ζ ∈ ]0 , + ∞ [,Θ( ζ ) = ζ (cid:16) − ζ −

12 ln ζ (cid:17) (91) ϑ − ( ζ ) = ϑ + ( ζ − ) = − ln ζ − ζ + 1 . (92)The above equalities have been derived from (62) and (63). It can be deduced from (35),(36) and (92) that χ − + ln χ − = 1 − γ − υ (93) χ − + ln( χ − ) = 1 − γ − ξ (94)that is χ − = W ( e − γ − υ ) (95) χ + = (cid:0) W ( e − γ − ξ ) (cid:1) − . (96)According to Proposition 3.3, prox γ Φ ( υ, ξ ) ∈ ]0 , + ∞ [ if and only if Conditions (i)-(iii)in Lemma 3.1 hold. Lemma 3.2 shows that these conditions are satisﬁed if and only if W ( e − γ − υ ) W ( e − γ − ξ ) < . (97)Under this assumption, the expression of the proximity operator follows from Proposition3.3 and (92).We have shown that prox γ Φ ( υ, ξ ) ∈ ]0 , + ∞ [ ⇔ (97). Since prox γ Φ ( υ, ξ ) ∈ dom Φ, wenecessarily get prox γ Φ ( υ, ξ ) = (0 , Remark 4.6

To minimize ψ in (90), we need to ﬁnd the zero on [ χ − , χ + ] of the function:( ∀ ζ ∈ ]0 , + ∞ [) ψ (cid:48) ( ζ ) = ( ζ + 1) ln ζ + ζ − ζ − + ζ + (cid:16) γ − υ − (cid:17) ζ + 1 − γ − ξ. (98)This can be performed by Algorithm 4, the convergence of which is proved in Appendix C. Algorithm 4

Projected Newton for minimizing (90).

Set (cid:98) ζ ∈ [ χ − , χ + ] (see (95)–(96) for the bound expressions) For n = 0 , , . . . (cid:106) (cid:98) ζ n +1 = P [ χ − ,χ + ] (cid:16)(cid:98) ζ n − ψ (cid:48) ( (cid:98) ζ n ) /ψ (cid:48)(cid:48) ( (cid:98) ζ n ) (cid:17) . emark 4.7 From a numerical standpoint, to avoid the arithmetic overﬂow in the expo-nentiations when γ − υ or γ − ξ tend to −∞ , one can use the asymptotic approximation ofthe Lambert W function for large values: for every τ ∈ [1 , + ∞ [, τ − ln τ + 12 ln ττ ≤ W (cid:0) e τ (cid:1) ≤ τ − ln τ + ee − ττ , (99)with equality only if τ = 1 [91]. Let us now consider the function of Γ ( R ) given byΦ( υ, ξ ) = (cid:40) ( √ υ − √ ξ ) if ( υ, ξ ) ∈ [0 , + ∞ [ + ∞ otherwise. (100)This symmetric function satisﬁes (15) and (62) with( ∀ ζ ∈ ]0 , + ∞ [) (cid:101) ϕ ( ζ ) = ζ − (cid:112) ζ. (101) Proposition 4.8

The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) = (cid:40) ( υ, ξ ) if υ ≥ γ or (cid:16) − υγ (cid:17) (cid:16) − ξγ (cid:17) < , otherwise, (102) where υ = υ + γ ( ρ −

1) (103) ξ = ξ + γ (cid:0) ρ − − (cid:1) (104) and ρ is the unique solution on ]max(1 − γ − υ, , + ∞ [ of ρ + (cid:0) γ − υ − (cid:1) ρ + (cid:0) − γ − ξ (cid:1) ρ − . (105) Proof . For every ( υ, ξ ) ∈ R , ( υ, ξ ) = prox γ Φ ( υ, ξ ) is such that ( υ, ξ ) ∈ [0 , + ∞ [ [88]. Byusing the notation of Proposition 3.3 and by using Remark 3.4, we have that, for every ζ ∈ ]0 , + ∞ [, Θ( ζ ) = ζ − ζ / + 1 (106) ϑ − ( ζ ) = ϑ + ( ζ − ) = 1 − (cid:112) ζ (107)21nd χ − = (cid:40) (1 − γ − υ ) if υ < γ χ + = (cid:40) (1 − γ − ξ ) − if ξ < γ + ∞ otherwise. (109)According to Proposition 3.3, ( υ, ξ ) ∈ ]0 , + ∞ [ if and only if Conditions (i)-(iii) in Lemma3.1 hold. Under these conditions, Proposition 3.3 leads to υ = υ + γ ( (cid:98) ζ / −

1) (110) ξ = ξ + γ (cid:16)(cid:98) ζ − / − (cid:17) (111)where (cid:98) ζ is the unique minimizer on ] χ − , + ∞ [ of the function deﬁned as, for every ζ ∈ ]0 , + ∞ [, ψ ( ζ ) = 25 ζ / − ζ / + γ − υ − ζ + (1 − γ − ξ ) ζ. (112)This means that (cid:98) ζ is the unique solution on ] χ − , + ∞ [ of the equation: ψ (cid:48) ( (cid:98) ζ ) = (cid:98) ζ / − (cid:98) ζ − / + ( γ − υ − (cid:98) ζ + 1 − γ − ξ = 0 . (113)By setting ρ = (cid:98) ζ / , (105) is obtained.Since lim ζ → ζ> ψ (cid:48) ( ζ ) = −∞ and lim ζ → + ∞ ψ (cid:48) ( ζ ) = + ∞ , Lemma 3.2 shows that Condi-tions (i)-(iii) are satisﬁed if and only if υ < γ, ξ < γ, and (1 − γ − υ ) < (1 − γ − ξ ) − or υ < γ and ξ ≥ γ or υ ≥ γ and ξ < γ or υ ≥ γ and ξ ≥ γ (114)or, equivalently υ < γ and (1 − γ − υ )(1 − γ − ξ ) < υ ≥ γ. (115)In turn, when (115) is not satisﬁed, we necessarily have υ = 0 or ξ = 0. In the ﬁrstcase, the expression of ξ is simply given by the asymmetric soft-thresholding rule in (78).Similarly, in the second case, we have υ = (cid:40) υ − γ if υ > γ υ > γ or ξ > γ , (114) is always satisﬁed, so that υ = ξ = 0.Altogether, the above results yield the expression of the proximity operator in (102).22 .4 Chi square divergence Let us now consider the function of Γ ( R ) given byΦ( υ, ξ ) =  ( υ − ξ ) ξ if υ ∈ [0 , + ∞ [ and ξ ∈ ]0 , + ∞ [0 if υ = ξ = 0+ ∞ otherwise. (117)This function satisﬁes (15) with( ∀ ζ ∈ ]0 , + ∞ [) ϕ ( ζ ) = ( ζ − . (118) Proposition 4.9

The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) =  ( υ, ξ ) if υ > − γ and ξ > − (cid:18) υ + υ γ (cid:19)(cid:0) , max { ξ − γ, } (cid:1) otherwise , (119) where υ = υ + 2 γ (1 − ρ ) (120) ξ = ξ + γ ( ρ −

1) (121) and ρ is the unique solution on ]0 , γ − υ/ of ρ + (cid:0) γ − ξ (cid:1) ρ − γ − υ − . (122) Proof . By proceeding similarly to the proof of Proposition 4.8, we have that, for every ζ ∈ ]0 , + ∞ [, Θ( ζ ) = 2 ζ − ζ , ϑ − ( ζ ) = 2( ζ − − ϑ + ( ζ ) = 1 − ζ − , and χ − = (cid:40) (cid:0) γ − υ (cid:1) − if υ > − γ + ∞ otherwise (123) χ + = (cid:40)(cid:0) − γ − ξ (cid:1) − / if ξ < γ + ∞ otherwise. (124)According to Proposition 3.3, ( υ, ξ ) ∈ ]0 , + ∞ [ if and only if Conditions (i)-(iii) in Lemma3.1 hold. Then, ( υ, ξ ) = prox γ Φ ( υ, ξ ) is such that υ = υ + 2 γ (1 − (cid:98) ζ − ) and ξ = ξ + γ ( (cid:98) ζ − − (cid:98) ζ is the unique minimizer on ] χ − , + ∞ [ of the function: ψ : ]0 , + ∞ [ → R : ζ (cid:55)→ (cid:16) γ − υ (cid:17) ζ − (1 + γ − ξ ) ζ − ζ − . (cid:98) ζ is the unique solution on ] χ − , + ∞ [ of the equation: ψ (cid:48) ( (cid:98) ζ ) = (2 + γ − υ ) (cid:98) ζ − − γ − ξ − (cid:98) ζ − = 0 . (125)By setting ρ = (cid:98) ζ − , (122) is obtained. Lemma 3.2 shows that Conditions (ii) and (iii) aresatisﬁed if and only if υ > − γ, ξ < γ, and 22 + γ − υ < (cid:112) − γ − ξ or υ > − γ and ξ ≥ γ (126)or, equivalently, υ > − γ, ξ < γ, and 1 − γ − ξ < (cid:0) γ ) − υ (cid:1) or υ > − γ and ξ ≥ γ. (127)When (127) does not hold, we necessarily have υ = 0. The end of the proof is similar tothat of Proposition 4.8. Let α ∈ ]1 , + ∞ [ and consider the below function of Γ ( R )Φ( υ, ξ ) =  υ α ξ α − if υ ∈ [0 , + ∞ [ and ξ ∈ ]0 , + ∞ [0 if υ = ξ = 0+ ∞ otherwise, (128)which corresponds to the case when( ∀ ζ ∈ ]0 , + ∞ [) ϕ ( ζ ) = ζ α . (129)Note that the above function Φ allows us to generate the R´enyi divergence up to a logtransform and a multiplicative constant. Proposition 4.10

The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) =  ( υ, ξ ) if υ > and γ α − ξ − α < (cid:18) υα (cid:19) αα − (cid:0) , max { ξ, } (cid:1) otherwise , (130) where υ = υ − γα (cid:98) ζ − α (131) ξ = ξ + γ ( α − (cid:98) ζ − α (132)24 nd (cid:98) ζ is the unique solution on ]( αγυ − ) α − , + ∞ [ of γ − υ (cid:98) ζ α − γ − ξ (cid:98) ζ α − α (cid:98) ζ + 1 − α = 0 . (133) Proof . We proceed similarly to the previous examples by noticing that, for every ζ ∈ ]0 , + ∞ [,Θ( ζ ) = (cid:40) α − α ζ − α if α (cid:54) = 3 α ln ζ if α = 3 (134) ϑ − ( ζ ) = αζ − α (135) ϑ + ( ζ ) = (1 − α ) ζ − α (136) ψ (cid:48) ( ζ ) = (1 − α ) ζ − α − αζ − α + γ − υζ − γ − ξ (137)and χ − = (cid:16) γαυ (cid:17) α − if υ > ∞ otherwise (138) χ + = (cid:16) γ (1 − α ) ξ (cid:17) /α if ξ < ∞ otherwise. (139)Note that (133) becomes a polynomial equation when α is a rational number. In partic-ular, when α = 2, it reduces to the cubic equation: ρ + (cid:0) γ − ξ (cid:1) ρ − γ − υ = 0 (140)with (cid:98) ζ = ρ − . I α divergence Let α ∈ ]0 ,

1[ and consider the function of Γ ( R ) given byΦ( υ, ξ ) = (cid:40) αυ + (1 − α ) ξ − υ α ξ − α if ( υ, ξ ) ∈ [0 , + ∞ [ + ∞ otherwise (141)which corresponds to the case when( ∀ ζ ∈ ]0 , + ∞ [) ϕ ( ζ ) = 1 − α + αζ − ζ α . (142)25 roposition 4.11 The proximity operator of γ Φ with γ ∈ ]0 , + ∞ [ is, for every ( υ, ξ ) ∈ R , prox γ Φ ( υ, ξ ) =  ( υ, ξ ) if υ ≥ γα or − ξγ (1 − α ) < (cid:18) − υγα (cid:19) αα − (0 , otherwise , (143) where υ = υ + γα ( (cid:98) ζ − α −

1) (144) ξ = ξ + γ (1 − α )( (cid:98) ζ − α −

1) (145) and (cid:98) ζ is the unique solution on (cid:3)(cid:0) max { − υγα , } (cid:1) − α , + ∞ (cid:2) of α (cid:98) ζ + ( γ − υ − α ) (cid:98) ζ α +1 + (1 − α − γ − ξ ) (cid:98) ζ α = 1 − α. (146) Proof . We have then, for every ζ ∈ ]0 , + ∞ [,Θ( ζ ) = α (cid:16) ζ − ζ − α − α (cid:17) (147) ϑ − ( ζ ) = α (1 − ζ − α ) (148) ϑ + ( ζ ) = (1 − α )(1 − ζ − α ) (149) ψ (cid:48) ( ζ ) = αζ − α + (cid:16) υγ − α (cid:17) ζ + α − ζ α + 1 − α − ξγ (150)and χ − = (cid:16) − υγα (cid:17) − α if υ < γα χ + = (cid:16) − ξγ (1 − α ) (cid:17) − /α if ξ < γ (1 − α )+ ∞ otherwise. (152)The result follows by noticing that lim ζ → ζ> ψ (cid:48) ( ζ ) = −∞ and lim ζ → + ∞ ψ (cid:48) ( ζ ) = + ∞ .As for the Renyi divergence, (146) becomes a polynomial equation when α is a rationalnumber. Remark 4.12

We can also derive the proximity operator of (cid:101) Φ( υ, ξ ) = (cid:26) κ (cid:0) αυ + (1 − α ) ξ (cid:1) − υ α ξ − α if ( υ, ξ ) ∈ [0 , + ∞ [ + ∞ otherwise, (153)where κ ∈ R . From Proposition 2.4(iv), we get, for every γ ∈ ]0 , + ∞ [ and for every( υ, ξ ) ∈ R , prox γ (cid:101) Φ ( υ, ξ ) = prox γ Φ (cid:0) υ + γ (1 − κ ) α, ξ + γ (1 − κ )(1 − α ) (cid:1) , (154)where prox γ Φ is provided by Proposition 4.11.26able 1: Conjugate function ϕ ∗ of the restriction of ϕ to [0 , + ∞ [.Divergence ϕ ( ζ ) ϕ ∗ ( ζ ∗ ) ζ > ζ ∗ ∈ R Kullback-Leibler ζ ln ζ − ζ + 1 e ζ ∗ − ζ −

1) ln ζ W ( e − ζ ∗ ) + (cid:0) W ( e − ζ ∗ ) (cid:1) − + ζ ∗ − ζ − √ ζ  ζ ∗ − ζ ∗ if ζ ∗ < ∞ otherwiseChi square ( ζ −  ζ ∗ ( ζ ∗ + 4)4 if ζ ∗ ≥ − − α ∈ ]1 , + ∞ [ ζ α  ( α − (cid:16) ζ ∗ α (cid:17) αα − if ζ ∗ ≥

00 otherwiseI α , α ∈ ]0 ,

1[ 1 − α + αζ − ζ α  (1 − α ) (cid:16)(cid:16) − ζ ∗ α (cid:17) αα − − (cid:17) if ζ ∗ ≤ α + ∞ otherwise Proximal methods iterate a sequence of steps in which proximity operators are evalu-ated. The eﬃcient computation of these operators is thus essential for dealing with high-dimensional convex optimization problems. In the context of constrained optimization, atleast one of the additive terms of the global cost to be minimized consists of the indicatorfunction of a closed convex set, whose proximity operator reduces to the projections ontothis set. However, if we except a few well-known cases, such projection does not admit aclosed-form expression. The resolution of large-scale optimization problems involving nontrivial constraints is thus quite challenging. This diﬃculty can be circumvented when theconstraint can be expressed as the lower-level set of some separable function, by makinguse of epigraphical projection techniques. Such approaches have attracted interest in thelast years [25, 62, 92–95]. The idea consists of decomposing the constraint of interest into27he intersection of a half-space and a number of epigraphs of simple functions. For thisapproach to be successful, it is mandatory that the projection onto these epigraphs can beeﬃciently computed.The next proposition shows that the expressions of the projection onto the epigraph of awide range of functions can be deduced from the expressions of the proximity operators of ϕ -divergences. In particular, in Table 1, for each of the ϕ -divergences presented in Section 3,we list the associated functions ϕ ∗ for which such projections can thus be derived. Proposition 5.1

Let ϕ : R → [0 , + ∞ ] be a function in Γ ( R ) which is twice diﬀerentiableon ]0 , + ∞ [ . Let Φ be the function deﬁned by (15) and ϕ ∗ ∈ Γ ( R ) the Fenchel-conjugatefunction of the restriction of ϕ on [0 , + ∞ [ , deﬁned as ( ∀ ζ ∗ ∈ R ) ϕ ∗ ( ζ ∗ ) = sup ζ ∈ [0 , + ∞ [ ζζ ∗ − ϕ ( ζ ) . (155) Let the epigraph of ϕ ∗ be deﬁned as epi ϕ ∗ = (cid:8) ( υ ∗ , ξ ∗ ) ∈ R (cid:12)(cid:12) ϕ ∗ ( υ ∗ ) ≤ ξ ∗ (cid:9) . (156) Then, the projection onto epi ϕ ∗ is: for every ( υ ∗ , ξ ∗ ) ∈ R , P epi ϕ ∗ ( υ ∗ , ξ ∗ ) = ( υ ∗ , − ξ ∗ ) − prox Φ ( υ ∗ , − ξ ∗ ) . (157) Proof . The conjugate function of Φ is, for every ( υ, ξ ) ∈ R ,Φ ∗ ( υ ∗ , ξ ∗ ) = sup ( υ,ξ ) ∈ R υυ ∗ + ξξ ∗ − Φ( υ, ξ ) . (158)From the deﬁnition of Φ, we deduce that, for all ( υ, ξ ) ∈ R ,Φ ∗ ( υ ∗ , ξ ∗ ) = sup (cid:110) sup ( υ,ξ ) ∈ [0 , + ∞ [ × ]0 , + ∞ [ (cid:16) υυ ∗ + ξξ ∗ − ξϕ (cid:16) υξ (cid:17)(cid:17) , sup υ ∈ ]0 , + ∞ [ (cid:16) υυ ∗ − lim ξ → ξ> ξϕ (cid:16) υξ (cid:17)(cid:17) , (cid:111) (159)= sup (cid:110) sup ( υ,ξ ) ∈ [0 , + ∞ [ × ]0 , + ∞ [ (cid:16) υυ ∗ + ξξ ∗ − ξϕ (cid:16) υξ (cid:17)(cid:17) , (cid:111) (160)= sup { ι epi ϕ ∗ ( υ ∗ , − ξ ∗ ) , } (161)= ι epi ϕ ∗ ( υ ∗ , − ξ ∗ ) , (162)where the equality in (161) stems from [72, Example 13.8]. Then, (157) follows from theconjugation property of the proximity operator (see Proposition 2.4 (v)).28 Experimental results

To illustrate the potential of our results, we consider a query optimization problem indatabase management systems where the optimal query execution plan depends on theaccurate estimation of the proportion of tuples that satisfy the predicates in the query. Morespeciﬁcally, every request formulated by a user can be viewed as an event in a probabilityspace (Ω , T , P ), where Ω is a ﬁnite set of size N . In order to optimize request fulﬁllment,it is useful to accurately estimate the probabilities, also called selectivities , associated witheach element of Ω. To do so, rough estimations of the probabilities of a certain number P ofevents can be inferred from the history of formulated requests and some a priori knowledge.Let x = ( x ( n ) ) ≤ n ≤ N ∈ R N be the vector of sought probabilities, and let z = ( z ( i ) ) ≤ i ≤ P ∈ [0 , P be the vector of roughly estimated probabilities. The problem of selectivity estimationis equivalent to the following constrained entropy maximization problem [96]:minimize x ∈ R N N (cid:88) n =1 x ( n ) ln x ( n ) s . t .  Ax = z, N (cid:88) n =1 x ( n ) = 1 ,x ∈ [0 , N , (163)where A ∈ R P × N is a binary matrix establishing the theoretical link between the probabili-ties of each event and the probabilities of the elements of Ω belonging to it.Unfortunately, due to the inaccuracy of the estimated probabilities, the intersectionbetween the aﬃne constraints Ax = z and the other ones may be empty, making the aboveproblem infeasible. In order to overcome this issue, we propose to jointly estimate theselectivities and the feasible probabilities. Our idea consists of reformulating Problem (163)by introducing the divergence between Ax and an additional vector y of feasible probabilities.This allows us to replace the constraint Ax = z with an (cid:96) k -ball centered in z , yieldingminimize ( x,y ) ∈ R N × R P D ( Ax, y ) + λ N (cid:88) n =1 x ( n ) ln x ( n ) s . t .  (cid:107) y − z (cid:107) k ≤ η, N (cid:88) n =1 x ( n ) = 1 ,x ∈ [0 , N , (164)where D is deﬁned in (14), λ and η are some positive constants, whereas k ∈ [1 , + ∞ [ (thechoice k = 2 yields the Euclidean ball).To demonstrate the validity of this approach, we compare it with the following methods:(i) a relaxed version of Problem (163), in which the constraint Ax = z is replaced with a29quared Euclidean distance:minimize x ∈ R N (cid:107) Ax − z (cid:107) + λ N (cid:88) n =1 x ( n ) ln x ( n ) s . t .  N (cid:88) n =1 x ( n ) = 1 ,x ∈ [0 , N , (165)or with ϕ -divergence D :minimize x ∈ R N D ( Ax, z ) + λ N (cid:88) n =1 x ( n ) ln x ( n ) s . t .  N (cid:88) n =1 x ( n ) = 1 ,x ∈ [0 , N , (166)where λ is some positive constant;(ii) the two-step procedure in [95], which consists of ﬁnding a solution (cid:98) x tominimize x ∈ R N Q (cid:0) Ax, z (cid:1) s . t .  N (cid:88) n =1 x ( n ) = 1 ,x ∈ [0 , N , (167)and then solving (163) by replacing z with (cid:98) z = A (cid:98) x . Hereabove, for every y ∈ R P , Q ( y, z ) = (cid:80) Pi =1 φ ( y ( i ) /z ( i ) ) is a sum of quotient functions, i.e. φ ( ξ ) =  ξ, if ξ ≥ ,ξ − , if 0 < ξ < , + ∞ , otherwise . (168)For the numerical evaluation, we adopt an approach similar to [95], and we ﬁrst considerthe following low-dimensional setting: A =   , z =  . . . . . .  , (169)for which there exists no x ∈ [0 , + ∞ [ N such that Ax = z . To assess the quality of thesolutions x ∗ obtained with the diﬀerent methods, we evaluate the max-quotient between Ax ∗ and z , that is [95] Q ∞ ( Ax ∗ , z ) = max ≤ i ≤ P φ (cid:18) [ Ax ∗ ] ( i ) z ( i ) (cid:19) . (170)30able 2: Comparison of Q ∞ -scoresProblem (164) (166) (167)+(163) [95] (165) ϕ KL / Q ∞ -scores (lower is better) obtained with the diﬀerent approaches.For all the considered ϕ -divergences, the proposed approach performs favorably with re-spect to the state-of-the-art, the KL divergence providing the best performance among thepanel of considered ϕ -divergences. For the sake of fairness, the hyperparameters λ and η were hand-tuned in order to get the best possible score for each compared method. Thegood performance of our approach is related to the fact that ϕ -divergences are well suitedfor the estimation of probability distributions.Figure 1 next shows the computational time for solving Problem (164) for various di-mensions N of the selectivity vector to be estimated, with A and z randomly generated soas to keep the ratio N/P equal to 7 /

6. To make this comparison, the primal-dual prox-imal method recalled in Algorithm 2 was implemented in MATLAB R2015, by using thestopping criterion (cid:107) x n +1 − x n (cid:107) < − (cid:107) x n (cid:107) . We then measured the execution times on anIntel i5 CPU at 3.20 GHz with 12 GB of RAM. The results show that all the considered ϕ -divergences can be eﬃciently optimized, with no signiﬁcant computational time diﬀerencesbetween them. In this paper, we have shown how to solve convex optimization problems involving discreteinformation divergences by using proximal methods. We have carried out a thorough studyof the properties of the proximity operators of ϕ -divergences, which has led us to derivenew tractable expressions of them. In addition, we have related these expressions to theprojection onto the epigraph of a number of convex functions.Finally, we have illustrated our results on a selectivity estimation problem, where ϕ -divergences appear to be well suited for the estimation of the sought probability distri-butions. Moreover, computational time evaluations allowed us to show that the proposednumerical methods provide eﬃcient solutions for solving large-scale optimization problems. Note that the Renyi divergence is not suitable for the considered application, because it tends to favorsparse solutions. N in Problem (164). A Proof of Proposition 2.6

Let ( ν, ξ ) ∈ R . From Proposition 2.4(i), we know that prox Φ ( ν, ξ ) ∈ [0 , + ∞ [ . By usingProposition 2.4(ii), we have the following equivalences: (cid:40) ( ν, ξ ) ∈ ]0 , + ∞ [ ( ν, ξ ) = prox Φ ( ν, ξ ) ⇔  ( ν, ξ ) ∈ ]0 , + ∞ [ ν − ν ∈ ∂ϕ ( ν − ξ ) ξ − ξ ∈ − ∂ϕ ( ν − ξ ) ⇔ (cid:40) ( ν, ξ ) ∈ ]0 , + ∞ [ ( ν, ξ ) = prox ˜Φ ( ν, ξ ) , (171)where ˜Φ : ( ν, ξ ) (cid:55)→ ϕ ( ν − ξ ). By using now Proposition 2.4(vi), we get(171) ⇔ (cid:40) ( ν, ξ ) ∈ ]0 , + ∞ [ ( ν, ξ ) = ( ν, ξ ) + (cid:0) prox ϕ ( ν − ξ ) − ν + ξ (cid:1) (1 , − ⇔  ( ν, ξ ) ∈ ]0 , + ∞ [ ν = (cid:0) ν + ξ + prox ϕ ( ν − ξ ) (cid:1) ξ = (cid:0) ν + ξ − prox ϕ ( ν − ξ ) (cid:1) ⇔ (cid:12)(cid:12) prox ϕ ( ν − ξ ) (cid:12)(cid:12) < ν + ξν = (cid:0) ν + ξ + prox ϕ ( ν − ξ ) (cid:1) ξ = (cid:0) ν + ξ − prox ϕ ( ν − ξ ) (cid:1) . (172)32imilarly, we have (cid:40) ν = 0 , ξ ∈ ]0 , + ∞ [( ν, ξ ) = prox Φ ( ν, ξ ) ⇔  ν = 0 , ξ ∈ ]0 , + ∞ [ ν − ϕ (cid:48) ( − ξ ) ∈ ∂ι [0 , + ∞ [ (0) =] − ∞ , ξ − ξ − ϕ (cid:48) ( − ξ ) = 0 ⇔  ν = 0 , ξ ∈ ]0 , + ∞ [ ξ ≥ ν + ξξ = prox ϕ ( −· ) ξ ⇔  prox ϕ ( −· ) ξ ∈ ]0 , + ∞ [ ∩ [ ν + ξ, + ∞ [ ν = 0 ξ = prox ϕ ( −· ) ξ ⇔  prox ϕ ξ ∈ ]0 , + ∞ [ ∩ [ ν + ξ, + ∞ [ ν = 0 ξ = prox ϕ ξ, (173)where the last equivalence results from the assumption that ϕ is even. Symmetrically, (cid:40) ν ∈ ]0 , + ∞ [ , ξ = 0( ν, ξ ) = prox Φ ( ν, ξ ) ⇔  prox ϕ ν ∈ ]0 , + ∞ [ ∩ [ ν + ξ, + ∞ [ ν = prox ϕ νξ = 0 . (174) B Convergence proof of Algorithm 3

We aim at ﬁnding the unique zero on ] exp( − γ − υ ) , + ∞ [ of the function ψ (cid:48) given by (84)along with its derivatives:( ∀ ζ ∈ ]0 , + ∞ [) ψ (cid:48)(cid:48) ( ζ ) = 1 + ln ζ + γ − υ + ζ − , (175) ψ (cid:48)(cid:48)(cid:48) ( ζ ) = ζ − − ζ − . (176)To do so, we employ the Newton method given in Algorithm 3, the convergence of which ishere established. Assume that • (cid:0) υ, ξ (cid:1) ∈ R are such that exp( γ − υ ) > − γ − ξ , • (cid:98) ζ is the zero on ]exp( − γ − υ ) , + ∞ [ of ψ (cid:48) ,33 ( (cid:98) ζ n ) n ∈ N is the sequence generated by Algorithm 3, • (cid:15) n = (cid:98) ζ n − (cid:98) ζ for every n ∈ N .We ﬁrst recall a fundamental property of the Newton method, and then we proceed to theactual convergence proof. Lemma B.1

For every n ∈ N , (cid:15) n +1 = (cid:15) n ψ (cid:48)(cid:48)(cid:48) ( (cid:37) n )2 ψ (cid:48)(cid:48) ( (cid:98) ζ n ) (177) where (cid:37) n is between (cid:98) ζ n and (cid:98) ζ .Proof . The deﬁnition of (cid:15) n +1 yields (cid:15) n +1 = (cid:98) ζ n − ψ (cid:48) ( (cid:98) ζ n ) ψ (cid:48)(cid:48) ( (cid:98) ζ n ) − (cid:98) ζ = (cid:15) n ψ (cid:48)(cid:48) ( (cid:98) ζ n ) − ψ (cid:48) ( (cid:98) ζ n ) ψ (cid:48)(cid:48) ( (cid:98) ζ n ) . (178)Moreover, for every (cid:98) ζ n ∈ ]0 , + ∞ [, the second-order Taylor expansion of ψ (cid:48) around (cid:98) ζ n is ψ (cid:48) ( (cid:98) ζ ) = ψ (cid:48) ( (cid:98) ζ n ) + ψ (cid:48)(cid:48) ( (cid:98) ζ n )( (cid:98) ζ − (cid:98) ζ n ) + 12 ψ (cid:48)(cid:48)(cid:48) ( (cid:37) n )( (cid:98) ζ − (cid:98) ζ n ) , (179)where (cid:37) n is between (cid:98) ζ n and (cid:98) ζ . From the above equality, we deduce that ψ (cid:48) ( (cid:98) ζ ) = ψ (cid:48) ( (cid:98) ζ n ) − ψ (cid:48)(cid:48) ( (cid:98) ζ n ) (cid:15) n + ψ (cid:48)(cid:48)(cid:48) ( (cid:37) n ) (cid:15) n = 0. Proposition B.2

The sequence ( (cid:98) ζ n ) n ∈ N converges to (cid:98) ζ .Proof . The assumption exp( γ − υ ) > − γ − ξ implies that ψ (cid:48) is negative at the initial value (cid:98) ζ = exp( − γ − υ ), that is ψ (cid:48) ( (cid:98) ζ ) = − exp( γ − υ ) + 1 − γ − ξ < . (180)Moreover, ψ (cid:48) is increasing on [exp( − γ − υ ) , + ∞ [, since (cid:0) ∀ ζ ∈ (cid:2) exp( − γ − υ ) , + ∞ (cid:2) (cid:1) ψ (cid:48)(cid:48) ( ζ ) > , (181)and √ ψ (cid:48) , since (cid:0) ∀ ζ ∈ (cid:3) √ , + ∞ (cid:2)(cid:1) ψ (cid:48)(cid:48)(cid:48) ( ζ ) > , (182) (cid:0) ∀ ζ ∈ (cid:3) , √ (cid:2)(cid:1) ψ (cid:48)(cid:48)(cid:48) ( ζ ) < . (183)To prove the convergence, we consider the following cases:34 Case (cid:98) ζ ≤ √ ψ (cid:48) is increasing and concave on [ (cid:98) ζ , √ (cid:98) ζ , (cid:98) ζ ] monotonically increases to (cid:98) ζ [97]. • Case √ ≤ (cid:98) ζ < (cid:98) ζ : ψ (cid:48) is increasing and convex on [ (cid:98) ζ , + ∞ [. Hence, Lemma B.1yields (cid:15) = (cid:98) ζ − (cid:98) ζ >

0. It then follows from standard properties of Newton algorithmfor minimizing an increasing convex function that ( (cid:98) ζ n ) n ≥ monotonically decreases to (cid:98) ζ [97]. • Case (cid:98) ζ < √ < (cid:98) ζ : as ψ (cid:48) is negative and increasing on [ (cid:98) ζ , (cid:98) ζ [, the quantity − ψ (cid:48) /ψ (cid:48)(cid:48) ispositive and lower bounded on [ (cid:98) ζ , √ (cid:0) ∀ ζ ∈ [ (cid:98) ζ , √ (cid:1) − ψ (cid:48) ( ζ ) ψ (cid:48)(cid:48) ( ζ ) ≥ − ψ (cid:48) ( √ ψ (cid:48)(cid:48) ( (cid:98) ζ ) > . (184)There thus exists k > (cid:98) ζ < . . . < (cid:98) ζ k and (cid:98) ζ k > √

2. Then, the convergenceof ( (cid:98) ζ n ) n ≥ k follows from the same arguments as in the previous case. C Convergence proof of Algorithm 4

We aim at ﬁnding the unique zero on ] W ( e − γ − υ ) , + ∞ [ of the function ψ (cid:48) given by (98),whose derivative reads( ∀ ζ ∈ ]0 , + ∞ [) ψ (cid:48)(cid:48) ( ζ ) = ln ζ + 1 ζ + 1 ζ + 2 ζ + γ − υ. (185)To do so, we employ the projected Newton algorithm, whose global convergence is guaran-teed for any initial value by the following condition [98]: ( ∀ a ∈ ]0 , + ∞ [)( ∀ b ∈ ] a, + ∞ [) ψ (cid:48)(cid:48) ( a ) + ψ (cid:48)(cid:48) ( b ) > ψ (cid:48) ( b ) − ψ (cid:48) ( a ) b − a , (186)which is equivalent to( b − a ) ψ (cid:48)(cid:48) ( a ) + ( b − a ) ψ (cid:48)(cid:48) ( b ) − ψ (cid:48) ( b ) + ψ (cid:48) ( a ) > . (187)Condition (187) can be rewritten as follows( b − a )(ln a + 1 a + 1 a + 2 a + γ − υ )+ ( b − a )(ln b + 1 b + 1 b + 2 b + γ − υ )+ (cid:16) ( a + 1) ln a + a − a + a + (cid:16) γ − υ − (cid:17) a − γ − ξ (cid:17) − (cid:16) ( b + 1) ln b + b − b + b + (cid:16) γ − υ − (cid:17) b − γ − ξ (cid:17) > , (188)35hich, after some simpliﬁcation, boils down to( b + 1) ln a − ( a + 1) ln b + ba + ba − a − ab − ab + 2 b − a + b + γ − υ ( b − a ) + 12 ( b − a ) > . (189)We now show that Condition (189) holds true because b > a and it is a sum of two terms( b + 1) ln a − ( a + 1) ln b + ba + ba − a − ab − ab + 2 b − a + b > γ − υ ( b − a ) + 12 ( b − a ) (cid:124) (cid:123)(cid:122) (cid:125) > . (191)Indeed, (190) can be rewritten as( ∀ a ∈ ]0 , + ∞ [)( ∀ b ∈ ] a, + ∞ [) g ( a, b ) − g ( b, a ) > g ( x, y ) = − ( x + 1) ln y − xy + yx − x + y . Therefore, we shall demonstrate that, for every b > a > g is decreasing w.r.t. the ﬁrstargument and increasing w.r.t. to the second argument, i.e. g ( a, b ) > g ( b, b ) and g ( b, b ) > g ( b, a ) , (192)which implies that g ( a, b ) > g ( b, a ) . To prove these two inequalities, we will study the derivative of g with respect to its argu-ments. The conditions in (192) are indeed equivalent to( ∀ y ∈ ]0 , + ∞ [)( ∀ x ∈ ]0 , y [) ∂g∂x ( x, y ) < , (193)( ∀ x ∈ ]0 , + ∞ [)( ∀ y ∈ ] x, + ∞ [) ∂g∂y ( x, y ) > . (194)The ﬁrst and second partial derivatives of g w.r.t. x read( ∀ y ∈ ]0 , + ∞ [)( ∀ x ∈ ]0 , y [) ∂g∂x ( x, y ) = − ln y − y − yx + 2 x ∂ g∂x ( x, y ) = 6 yx − x = 6 y − xa > . ∂ g/∂x is strictly positive, ∂g/∂x is strictly increasing w.r.t. x andlim x → y ∂g∂x ( x, y ) = − ln y − y = ln 1 y − y < . Therefore, Condition (193) holds, and g is decreasing with respect to x .The ﬁrst and second partial derivatives of g w.r.t. y read( ∀ x ∈ ]0 , + ∞ [)( ∀ y ∈ ] x, + ∞ [) ∂g∂y ( x, y ) = − xy − y + xy + 1 x + 2 y∂ g∂y ( x, y ) = xy + 1 y − xy + 2 . For every y ∈ [1 , + ∞ [,( ∀ x ∈ ]0 , y [) ∂ g∂y ( x, y ) = xy + 1 y − xy (cid:124)(cid:123)(cid:122)(cid:125) < +2 > , and ∂g/∂y is strictly increasing w.r.t. y (since ∂ g/∂y is strictly positive) and( ∀ x ∈ [1 , + ∞ [) lim y → x ∂g∂y ( x, y ) = − x + 2 x > , ( ∀ x ∈ ]0 , ∂g∂y ( x,

1) = 1 x + 1 > . For every y ∈ ]0 , ∀ x ∈ ]0 , y [) ∂g∂y ( x, y ) = − xy + xy − y + 1 x + 2 y, = x − xyy − y + 1 x + 2 y, since x < y <

1, this implies that xy < x and y < x < x and( ∀ x ∈ ]0 , y [) ∂g∂y ( x, y ) = x − xyy (cid:124) (cid:123)(cid:122) (cid:125) > − y + 1 x (cid:124) (cid:123)(cid:122) (cid:125) > +2 y > . As Condition (194) holds, g is increasing with respect to y . References [1] K. Pearson, “On the criterion that a given system of deviations from the probable inthe case of a correlated system of variables is such that it can reasonable be supposedto have arisen from random sampling,”

Phil. Mag. , vol. 50, pp. 157–175, 1900.372] E. Hellinger, “Neue begr¨undung der theorie quadratischer formen von unendlichvielenver¨anderlichen.,”

J. f¨ur die reine und angewandte Mathematik , vol. 136, pp. 210–271,1909.[3] C. Shannon, “A mathematical theory of communication,”

Bell Syst. Tech. J. , vol. 27,no. 3, pp. 379–423, 623–656, Jul. 1948.[4] S. Kullback and R. A. Leibler, “On information and suﬃciency,”

Ann. Math. Stat. ,vol. 22, no. 1, pp. 79–86, 1951.[5] I. Csisz´ar, “Eine informations theoretische ungleichung und ihre anwendung auf denbeweis der ergodizit¨at von markoﬀschen ketten.,”

Magyar Tud. Akad. Mat. Kutat´o Int.K¨oozl. , vol. 8, pp. 85–108, 1963.[6] A. M. Ali and S. D. Silvey, “A general class of coeﬃcients of divergence of one dis-tribution from another,”

J. R. Stat. Soc. Series B Stat. Methodol. , vol. 28, no. 1, pp.131–142, 1966.[7] T. M. Cover and J. A. Thomas,

Elements of information theory , New York, USA:Wiley-Interscience, 1991.[8] I. Sason and S. Verd´u, “ f -divergence inequalities,” IEEE Trans. Inf. Theory , vol. 62,no. 11, pp. 5973–6006, 2016.[9] R.E. Blahut, “Computation of channel capacity and rate-distortion functions,”

IEEETrans. Inf. Theory , vol. 18, no. 4, pp. 460–473, Jul. 1972.[10] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memorylesschannels,”

IEEE Trans. Inf. Theory , vol. 18, no. 1, pp. 14–20, Jan. 1972.[11] M. Chiang and S. Boyd, “Geometric programming duals of channel capacity and ratedistortion,”

IEEE Trans. Inf. Theory , vol. 50, no. 2, pp. 245 – 258, Feb. 2004.[12] H. H. Bauschke, P. L. Combettes, and D. Noll, “Joint minimization with alternatingBregman proximity operators,”

Pac. J. Optim , vol. 2, no. 3, pp. 401–424, Sep. 2006.[13] P. L. Combettes and Q. V. Nguyen, “Solving composite monotone inclusions in reﬂexiveBanach spaces by constructing best Bregman approximations from their Kuhn-Tuckerset,”

J. Convex Anal. , vol. 23, no. 2, May 2016.[14] S. Chr´etien and A. O. Hero, “On EM algorithms and their proximal generalizations,”

Control Optim. Calc. Var. , vol. 12, pp. 308–326, Jan. 2008.[15] C. L. Byrne, “Iterative image reconstruction algorithms based on cross-entropy mini-mization,”

IEEE Trans. Image Process. , vol. 2, no. 1, pp. 96 –103, Jan. 1993.3816] W. Richardson, “Bayesian-based iterative method of image restoration,”

J. Opt. Soc.Am. A , vol. 62, no. 1, pp. 55–59, Jan. 1972.[17] L. B. Lucy, “An iterative technique for the rectiﬁcation of observed distributions,”

Astron. J. , vol. 79, no. 6, pp. 745–754, 1974.[18] J.A. Fessler, “Hybrid Poisson/polynomial objective functions for tomographic imagereconstruction from transmission scans,”

IEEE Trans. Image Process. , vol. 4, no. 10,pp. 1439 –1450, Oct. 1995.[19] F.-X. Dup´e, M. J. Fadili, and J.-L. Starck, “A proximal iteration for deconvolvingPoisson noisy images using sparse representations,”

IEEE Trans. Image Process. , vol.18, no. 2, pp. 310–321, Feb. 2009.[20] R. Zanella, P. Boccacci, L. Zanni, and M. Bertero, “Eﬃcient gradient projection meth-ods for edge-preserving removal of poisson noise,”

Inverse Probl. , vol. 25, no. 4, 2009.[21] P. Piro, S. Anthoine, E. Debreuve, and M. Barlaud, “Combining spatial and temporalpatches for scalable video indexing,”

Multimed. Tools Appl. , vol. 48, no. 1, pp. 89–104,May 2010.[22] N. Pustelnik, C. Chaux, and J.-C. Pesquet, “Parallel proximal algorithm for imagerestoration using hybrid regularization,”

IEEE Trans. Image Process. , vol. 20, no. 9,pp. 2450 –2462, Nov. 2011.[23] T. Teuber, G. Steidl, and R. H. Chan, “Minimization and parameter estimation forseminorm regularization models with I-divergence constraints,”

Inverse Probl. , vol. 29,pp. 1–28, 2013.[24] M. Carlavan and L. Blanc-F´eraud, “Sparse poisson noisy image deblurring,”

IEEETrans. Image Process. , vol. 21, no. 4, pp. 1834–1846, Apr. 2012.[25] S. Harizanov, J.-C. Pesquet, and G. Steidl, “Epigraphical projection for solving leastsquares anscombe transformed constrained optimization problems,” in

Scale-Space andVariational Methods in Computer Vision , A. Kuijper et al., Ed., vol. 7893 of

Lect. NotesComput. Sc. , pp. 125–136. 2013.[26] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,”in

Fixed-Point Algorithms for Inverse Problems in Science and Engineering , H. H.Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz,Eds., pp. 185–212. Springer-Verlag, New York, 2011.[27] N. Parikh and S. Boyd, “Proximal algorithms,”

Foundations and Trends in Optimiza-tion , vol. 1, no. 3, pp. 123–231, 2014. 3928] B.C. Vemuri, L. Meizhu, S.-I. Amari, and F. Nielsen, “Total Bregman divergence andits applications to DTI analysis,”

IEEE Trans. Med. Imag. , vol. 30, no. 2, pp. 475–483,Feb. 2011.[29] L. Meizhu, B.C. Vemuri, S.-I. Amari, and F. Nielsen, “Total Bregman divergence andits applications to shape retrieval,” in

Proc. IEEE Comput. Soc. Conf. Comput. Vis.Pattern Recognit. , San Francisco, CA, Jun. 2010, pp. 3463–3468.[30] H. Jeﬀreys, “An invariant form for the prior probability in estimation problems,”

Proc.R. Soc. Lond. A Math. Phys. Sci. , vol. 186, no. 1007, pp. 453–461, 1946.[31] F. Nielsen and R. Nock, “Sided and symmetrized Bregman centroids,”

IEEE Trans.Inf. Theory , vol. 55, no. 6, pp. 2882–2904, 2009.[32] F. Nielsen, “Jeﬀreys centroids: A closed-form expression for positive histograms and aguaranteed tight approximation for frequency histograms,”

IEEE Signal Process. Lett. ,vol. 20, no. 7, pp. 657–660, July 2013.[33] R. Beran, “Minimum hellinger distance estimates for parametric models,”

Ann. Stat. ,vol. 5, no. 3, pp. 445–463, 1977.[34] P. A. Devijver and J. Kittler,

Pattern Recognition: A Statistical Approach , Prentice-Hall International, 1982.[35] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,”

Int. Stat.Rev. , vol. 70, no. 3, pp. 419–435, 2002.[36] F. Liese and I. Vajda, “On divergences and informations in statistics and informationtheory,”

IEEE Trans. Inf. Theory , vol. 52, no. 10, pp. 4394–4412, 2006.[37] T.W. Rauber, T. Braun, and K. Berns, “Probabilistic distance measures of the dirichletand beta distributions.,”

Pattern Recogn. , vol. 41, no. 2, pp. 637–645, 2008.[38] L. LeCam, “Convergence of estimates under dimensionality restrictions,”

Ann. Stat. ,vol. 1, no. 1, pp. 38–53, 1973.[39] S. van de Geer, “Hellinger-consistency of certain nonparametric maximum likelihoodestimators,”

Ann. Stat. , vol. 21, no. 1, pp. 14–44, 1993.[40] L. Chang-Hwan, “A new measure of rule importance using hellinger divergence,” in

Int. Conf. on Data Analytics , Barcelona, Spain, Sep. 2012, pp. 103–106.[41] D. Cieslak, T. Hoens, N. Chawla, and W. Kegelmeyer, “Hellinger distance decisiontrees are robust and skew-insensitive,”

Data Min. Knowl. Discov. , vol. 24, no. 1, pp.136–158, 2012. 4042] I. Park, S. Seth, M. Rao, and J.C. Principe, “Estimation of symmetric chi-squaredivergence for point processes,” in

Proc. Int. Conf. Acoust. Speech Signal Process. ,Prague, Czech Republic, May 2011, pp. 2016–2019.[43] A. R´enyi, “On measures of entropy and information,” in

Proc. 4th Berkeley Symp. onMath. Statist. and Prob. , California, Berkeley, Jun. 1961, vol. 1, pp. 547–561.[44] P. Harremo¨es, “Interpretations of r´enyi entropies and divergences,”

Physica A: Statis-tical Mechanics and its Applications , vol. 365, no. 1, pp. 57–62, 2006.[45] I. Vajda,

Theory of Statistical Inference and Information , Kluwer, Dordrecht, 1989.[46] F. Liese and I. Vajda,

Convex Statistical Distances , Treubner, 1987.[47] A. O. Hero, B. Ma, O. Michel, and J. Gorman, “Applications of entropic spanninggraphs,”

IEEE Signal Process. Mag. , vol. 19, no. 5, pp. 85–95, 2002.[48] H. Chernoﬀ, “A measure of asymptotic eﬃciency for tests of a hypothesis based on asum of observations,”

Ann. Stat. , vol. 23, pp. 493–507, 1952.[49] I. Csisz´ar, “Information measures: A critical survey.,”

IEEE Trans. Inf. Theory , vol.A, pp. 73–86, 1974.[50] S.-I. Amari, “Alpha-divergence is unique, belonging to both f-divergence and Bregmandivergence classes,”

IEEE Trans. Inf. Theory , vol. 55, no. 11, pp. 4925–4931, 2009.[51] J. Zhang, “Divergence function, duality, and convex analysis,”

Neural Comput. , vol.16, pp. 159–195, 2004.[52] T. Minka, “Divergence measures and message passing,” Tech. Rep., Microsoft ResearchTechnical Report (MSR-TR-2005), 2005.[53] A. Cichocki, L. Lee, Y. Kim, and S. Choi, “Non-negative matrix factorization with α -divergence,” Pattern Recogn. Lett. , vol. 29, no. 9, pp. 1433–1440, 2008.[54] I. Csisz´ar and F. Mat´uˇs , “On minimization of multivariate entropy functionals,” in

IEEE Information Theory Workshop , Jun. 2009, pp. 96–100.[55] X. L. Nguyen, M. J. Wainwright, and M. I. Jordan, “On surrogate loss functions and f -divergences,” Ann. Stat. , vol. 37, no. 2, pp. 876–904, 2009.[56] C. Chaux, P. L. Combettes, J.-C. Pesquet, and V. R. Wajs, “A variational formulationfor frame-based inverse problems,”

Inverse Probl. , vol. 23, no. 4, pp. 1495–1518, Jun.2007.[57] P. L. Combettes and J.-C. Pesquet, “A proximal decomposition method for solvingconvex variational inverse problems,”

Inverse Probl. , vol. 24, no. 6, pp. 065014, Dec.2008. 4158] L. Condat, “Fast projection onto the simplex and the l1 ball,”

Math. Program. SeriesA , 2015, to appear.[59] J.-D. Benamou and Y. Brenier, “A computational ﬂuid mechanics solution to theMonge-Kantorovich mass transfer problem,”

Numer. Math. , vol. 84, no. 3, pp. 375–393, 2000.[60] J.-D. Benamou, Y. Brenier, and K. Guittet, “The Monge-Kantorovich mass transferand its computational ﬂuid mechanics formulation,”

Int. J. Numer. Meth. Fluids , vol.40, pp. 21–30, 2002.[61] P. L Combettes and C. L. M¨uller, “Perspective functions: Proximal calculus andapplications in high-dimensional statistics,”

J. Math. Anal. Appl. , 2016.[62] G. Chierchia, N. Pustelnik, J.-C. Pesquet, and B. Pesquet-Popescu, “Epigraphicalprojection and proximal tools for solving constrained convex optimization problems,”

Signal Image Video P. , vol. 9, no. 8, pp. 1737–1749, Nov. 2015.[63] R. Gaetano, G. Chierchia, and B. Pesquet-Popescu, “Parallel implementations of adisparity estimation algorithm based on a proximal splitting method,” in

Proc. Int.Conf. Visual Commun. Image Process. , San Diego, USA, 2012.[64] J. J. Moreau, “Proximit´e et dualit´e dans un espace hilbertien,”

Bull. Soc. Math. France ,vol. 93, pp. 273–299, 1965.[65] T. Cover, “An algorithm for maximizing expected log investment return,”

IIEEETrans. Inf. Theory , vol. 30, no. 2, pp. 369–373, 1984.[66] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incompletedata via the EM algorithm,”

J. R. Stat. Soc. B , pp. 1–38, Dec. 1977.[67] A. Subramanya and J. Bilmes, “Soft-supervised learning for text classiﬁcation,” in

Proc. of EMNLP , 2008, pp. 1090–1099.[68] G. Tartavel, G. Peyr´e, and Y. Gousseau, “Wasserstein loss for image synthesis andrestoration,”

SIAM J. Imaging Sci. , vol. 9, no. 4, pp. 1726–1755, 2016.[69] M. El Gheche, A. Jezierska, J.-C. Pesquet, and J. Farah, “A proximal approach forsignal recovery based on information measures,” in

Proc. Eur. Signal Process. Conf. ,Marrakech, Maroc, Sept. 2013, pp. 1–5.[70] M. El Gheche, J.-C. Pesquet, and J. Farah, “A proximal approach for optimizationproblems involving kullback divergences,” in

Proc. Int. Conf. Acoust. Speech SignalProcess. , Vancouver, Canada, May 2013, pp. 5984–5988.[71] T. M. Cover and J. A. Thomas,

Elements of Information Theory , Wiley-Interscience,2nd edition, 2006. 4272] H. H. Bauschke and P. L. Combettes,

Convex Analysis and Monotone Operator Theoryin Hilbert Spaces , Springer, New York, Jan. 2011.[73] J.-B. Hiriart-Urruty and C. Lemar´echal,

Convex analysis and minimization algorithms,Part I : Fundamentals , vol. 305 of

Grundlehren der mathematischen Wissenschaften ,Springer-Verlag, Berlin, Heidelberg, N.Y., 2nd edition, 1996.[74] I. Csisz´ar, “Information-type measures of diﬀerence of probability distributions andindirect observations,”

Studia Sci. Math. Hungar. , vol. 2, pp. 299–318, 1967.[75] M. Basseville, “Distance measures for signal processing and pattern recognition,”

Eu-ropean J. Signal Process. , vol. 18, no. 4, pp. 349–369, 1989.[76] J.-C. Pesquet and N. Pustelnik, “A parallel inertial proximal optimization method,”

Pac. J. Optim. , vol. 8, no. 2, pp. 273–305, Apr. 2012.[77] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “An augmented Lagrangianapproach to the constrained optimization formulation of imaging inverse problems,”

IEEE Trans. Image Process. , vol. 20, no. 3, pp. 681–695, Mar. 2011.[78] S. Setzer, G. Steidl, and T. Teuber, “Deblurring Poissonian images by split Bregmantechniques,”

J. Visual Communication and Image Representation , vol. 21, no. 3, pp.193–199, Apr. 2010.[79] G. Chen and M. Teboulle, “A proximal-based decomposition method for convex mini-mization problems,”

Math. Program. , vol. 64, no. 1–3, pp. 81–101, Mar. 1994.[80] E. Esser, X. Zhang, and T. Chan, “A general framework for a class of ﬁrst orderprimal-dual algorithms for convex optimization in imaging science,”

SIAM J. ImagingSci. , vol. 3, no. 4, pp. 1015–1046, Dec. 2010.[81] A. Chambolle and T. Pock, “A ﬁrst-order primal-dual algorithm for convex problemswith applications to imaging,”

J. Math. Imag. Vis. , vol. 40, no. 1, pp. 120–145, May2011.[82] L. M. Brice˜no-Arias and P. L. Combettes, “A monotone + skew splitting model forcomposite monotone inclusions in duality,”

SIAM J. Opt. , vol. 21, no. 4, pp. 1230–1250,Oct. 2011.[83] P. L. Combettes and J.-C. Pesquet, “Primal-dual splitting algorithm for solving in-clusions with mixtures of composite, Lipschitzian, and parallel-sum type monotoneoperators,”

Set-Valued Var. Anal. , vol. 20, no. 2, pp. 307–330, Jun. 2012.[84] B. C. V˜u, “A splitting algorithm for dual monotone inclusions involving cocoerciveoperators,”

Adv. Comput. Math. , vol. 38, no. 3, pp. 667–681, Apr. 2013.4385] L. Condat, “A primal-dual splitting method for convex optimization involving Lips-chitzian, proximable and linear composite terms,”

J. Optimiz. Theory App. , vol. 158,no. 2, pp. 460–479, Aug. 2013.[86] N. Komodakis and J.-C. Pesquet, “Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems,”

IEEE Signal Process.Mag. , vol. 32, no. 6, pp. 31–54, Nov. 2015.[87] J.-C. Pesquet and A. Repetti, “A class of randomized primal-dual algorithms fordistributed optimization,”

J. Nonlinear Convex Anal. , vol. 16, no. 12, Dec. 2015.[88] J. J. Moreau, “Fonctions convexes duales et points proximaux dans un espace hilber-tien,”

C. R. Acad. Sci. , vol. 255, pp. 2897–2899, 1962.[89] P. L. Combettes and J.-C. Pesquet, “Proximal thresholding algorithm for minimizationover orthonormal bases,”

SIAM J. Optim. , vol. 18, no. 4, pp. 1351–1376, Nov. 2007.[90] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeﬀrey, and D. E. Knuth, “On theLambert W function,”

Adv. Comput. Math. , vol. 5, no. 1, pp. 329–359, 1996.[91] A. Hoorfar and M. Hassani, “Inequalities on the Lambert W function and hyperpowerfunction,”

J. Inequal. Pure Appl. Math , vol. 9, no. 2, pp. 5–9, 2008.[92] M. Toﬁghi, K. Kose, and A. E. Cetin, “Signal reconstruction framework basedon Projections onto Epigraph Set of a Convex cost function (PESC),”

PreprintarXiv:1402.2088 , 2014.[93] S. Ono and I. Yamada, “Second-order Total Generalized Variation constraint,” in

Proc.Int. Conf. Acoust. Speech Signal Process. , Florence, Italy, May 2014, pp. 4938–4942.[94] P.-W. Wang, M. Wytock, and J. Z. Kolter, “Epigraph projections for fast generalconvex programming,” in

Proc. of ICML , 2016.[95] G. Moerkotte, M. Montag, A. Repetti, and G. Steidl, “Proximal operator of quotientfunctions with application to a feasibility problem in query optimization,”

J. Comput.Appl. Math. , vol. 285, pp. 243–255, Sept. 2015.[96] V. Markl, P. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. Tran, “Consistentselectivity estimation via maximum entropy,”

VLDB J. , vol. 16, no. 1, pp. 55–76, 2007.[97] D. R. Kincaid and E. W. Cheney,

Numerical analysis: mathematics of scientiﬁc com-puting , vol. 2, American Mathematical Soc., 2002.[98] L. Thorlund-Petersen, “Global convergence of Newtons method on an interval,”