A characterization of proximity operators
aa r X i v : . [ m a t h . C A ] F e b A CHARACTERIZATION OF PROXIMITY OPERATORS
R´EMI GRIBONVAL AND MILA NIKOLOVA
Abstract.
We characterize proximity operators, that is to say functions that map a vectorto a solution of a penalized least squares optimization problem. Proximity operators of convexpenalties have been widely studied and fully characterized by Moreau. They are also widelyused in practice with nonconvex penalties such as the ℓ pseudo-norm, yet the extension ofMoreau’s characterization to this setting seemed to be a missing element of the literature. Wecharacterize proximity operators of (convex or nonconvex) penalties as functions that are thesubdifferential of some convex potential. This is proved as a consequence of a more generalcharacterization of so-called Bregman proximity operators of possibly nonconvex penalties interms of certain convex potentials. As a side effect of our analysis, we obtain a test to verifywhether a given function is the proximity operator of some penalty, or not. Many well-knownshrinkage operators are indeed confirmed to be proximity operators. However, we prove thatwindowed Group-LASSO and persistent empirical Wiener shrinkage – two forms of so-called so-cial sparsity shrinkage– are generally not the proximity operator of any penalty; the exception iswhen they are simply weighted versions of group-sparse shrinkage with non-overlapping groups. Keywords:
Proximity operator; Convex regularization; Nonconvex regularization; Subdiffer-ential; Shrinkage operator; Social sparsity; Group sparsity Introduction and overview
Proximity operators have become an important ingredient of nonsmooth optimization, wherea huge body of work has demonstrated the power of iterative proximal algorithms to addresslarge-scale variational optimization problems. While these techniques have been thoroughlyanalyzed and understood for proximity operators involving convex penalties, there is a definitetrend towards the use of proximity operators of nonconvex penalties such as the ℓ penalty [7, 8].This paper extends existing characterizations of proximity operators – which are specializedfor convex penalties – to the nonconvex case. A particular motivation is to understand whethercertain thresholding rules known as social sparsity shrinkage , which have been successfully ex-ploited in the context of certain linear inverse problems, are proximity operators. Another mo-tivation is to characterize when Bayesian estimation with the conditional mean estimator (also This work and the companion paper [19] are dedicated to the memory of Mila Nikolova, who passed awayprematurely in June 2018. Mila dedicated much of her energy to bring the technical content to completion duringthe spring of 2018. The first author did his best to finalize the papers as Mila would have wished. He should beheld responsible for any possible imperfection in the final manuscript.R. Gribonval ([email protected]) was with Univ Rennes, Inria, CNRS, IRISA when this work was conducted.He is now with Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, F-69342, Lyon, France.M. Nikolova, CMLA, CNRS and ENS de Cachan, Universit´e Paris-Saclay, 94235 Cachan, France. known as minimum mean square error estimation or MMSE) can be expressed as a proximityoperator. This is the object of a companion paper [19] characterizing when certain variationalapproaches to address inverse problems can in fact be considered as Bayesian approaches.1.1.
Characterization of proximity operators.
Let H be a Hilbert space equipped with aninner product h· , ·i and a norm k · k . This includes the case H = R n , and most of the text can beread with this simpler setting in mind. The proximity operator of a function ϕ : H → R mapseach y ∈ H to the solutions of a penalized least-squares problem y prox ϕ ( y ) := arg min x ∈H (cid:8) k y − x k + ϕ ( x ) (cid:9) Formally, a proximity operator is set-valued as there may be several solutions to this problem,or the set of solutions may be empty. A primary example is the soft-thresholding function f ( y ) := y max(1 − / | y | , y ∈ R , which is the proximity operator of the absolute valuefunction ϕ ( x ) := | x | .Proximity operators can be defined for certain generalized functions ϕ : H → R ∪ { + ∞} .A particular example is the projection onto a given closed convex set C ⊂ H , which can bewritten as proj C = prox ϕ with ϕ the indicator function of C , i.e., ϕ ( x ) = 0 if x ∈ C , ϕ ( x ) = + ∞ otherwise. For the sake of precision and brevity, we use the following definition: Definition . Let
Y ⊂ H be non-empty. A function f : Y → H is a proximity operator of afunction ϕ : H → R ∪ { + ∞} if, and only if, f ( y ) ∈ prox ϕ ( y ) for each y ∈ Y . In convex analysis, this corresponds to the notion of a selection of the set-valued mappingprox ϕ .A characterization of proximity operators of convex lower semicontinuous (l.s.c.) functions is due to Moreau. It involves the subdifferential ∂θ ( x ) of a convex l.s.c. function θ at x , i.e., theset of all its subgradients at x [14, Chapter III.2] . Proposition . [26, Corollary 10.c] A function f : H → H defined everywhere is the proximityoperator of a proper convex l.s.c. function ϕ : H → R ∪ { + ∞} if, and only if the followingconditions hold jointly: (a) there exists a (convex l.s.c.) function ψ such that for each y ∈ H , f ( y ) ∈ ∂ψ ( y ) ; (b) f is nonexpansive, i.e. k f ( y ) − f ( y ′ ) k k y − y ′ k , ∀ y, y ′ ∈ H . We extend Moreau’s result to possibly nonconvex functions ϕ on subdomains of H by simplyrelaxing the non-expansivity condition: Theorem . Let
Y ⊂ H be non-empty. A function f : Y → H is a proximity operator of afunction ϕ : H → R ∪ { + ∞} if, and only if, there exists a convex l.s.c. function ψ : H → R ∪ { + ∞} such that for each y ∈ Y , f ( y ) ∈ ∂ψ ( y ) . See Section 2.1 for detailed notations and reminders on convex analysis and differentiability in Hilbert spaces.
CHARACTERIZATION OF PROXIMITY OPERATORS 3
This is proved in Section 2 as a particular consequence of our main result, Theorem 3, whichcharacterizes functions such that f ( y ) ∈ arg min x ∈H { D ( x, y ) + ϕ ( x ) } for certain types of data-fidelity terms D ( x, y ). Among others, the data-fidelity terms covered by Theorem 3 include: • the Euclidean distance D ( x, y ) = k y − x k , which is the data-fidelity associated toproximity operators; • its variant D ( x, y ) = k y − M x k with M some linear operator; and • Bregman divergences [9], leading to an analog of Theorem 1 to characterize so-calledBregman proximity operators [12] (see Corollary 5 in Section 2).Theorem 3 further implies that the functions ϕ and ψ in Theorem 1 can be chosen such that(1) ψ ( y ) = h y, f ( y ) i − k f ( y ) k − ϕ ( f ( y )) , ∀ y ∈ Y . This is a particular instance of a more general result valid for all considered data-fidelity terms.Another consequence of Theorem 3 (see Corollary 4 in Section 2) is that for the considereddata-fidelity terms D ( x, y ), if f : Y → H can be written as f ( y ) ∈ arg min x ∈H { D ( x, y ) + ϕ ( x ) } for some (possibly nonconvex) function ϕ and if its image Im( f ) := f ( Y ) is a convex set (e.g.,if Im( f ) = H ) then the function x D ( x, y ) + ϕ ( x ) is convex on Im( f ).This is reminiscent of observations on convex optimization with nonconvex penalties [28, 31]and on the hidden convexity of conditional mean estimation under additive Gaussian noise[17, 18, 25, 1]. The latter is extended to other noise models in the companion paper [19].1.2. The case of smooth proximity operators.
The smoothness of a proximity operator f = prox ϕ and that of the corresponding functions ϕ and ψ , cf (1), are inter-related, leading toa characterization of continuous proximity operators . Corollary . Let
Y ⊂ H be non-empty and open and f : Y → H be C . The following areequivalent: (a) f is a proximity operator of a function ϕ : H → R ∪ { + ∞} ; (b) there exists a convex C ( Y ) function ψ such that f ( y ) = ∇ ψ ( y ) for each y ∈ Y . This is established in Section 2.6 as a particular consequence of our second main result,Corollary 6. There, we also prove that when f is a proximity operator of some ϕ , the Lipschitzproperty of f with Lipschitz constant L is equivalent to the convexity of x ϕ ( x )+ (cid:0) − L (cid:1) k x k .Moreau’s characterization (Proposition 1) corresponds to the special case L = 1. Next, wecharacterize C proximity operators on convex domains more explicitly using the differential of f . Theorem . Let
Y ⊂ H be non-empty, open and convex, and f : Y → H be C . The followingproperties are equivalent: (a) f is a proximity operator of a function ϕ : H → R ∪ { + ∞} ; (b) there exists a convex C ( Y ) function ψ such that f ( y ) = ∇ ψ ( y ) for each y ∈ Y ; See Section 2.1 for brief reminders on the notion of continuity / differentiability in Hilbert spaces.
R´EMI GRIBONVAL AND MILA NIKOLOVA (c) the differential Df ( y ) is a symmetric positive semi-definite operator for each y ∈ Y .Proof . Since f is C , the equivalence (a) ⇔ (b) is a consequence of Corollary 1. We nowestablish (b) ⇔ (c). Since Y is convex it is simply connected, and as Y is open by Poincar’slemma (see [16, Theorem 6.6.3] when H = R n ) the differential Df is symmetric if, and only if, f is the gradient of some C function ψ . By [6, Proposition 17.7], the function ψ is convex iff ∇ ψ (cid:23) Y , i.e. iff Df (cid:23) Y . (cid:3) Corollary . Let
Y ⊂ H be a set with non-empty interior int( Y ) = ∅ , y ∈ int( Y ) , and f : Y → H be a proximity operator. If f is C in a neighborhood of y , then Df ( y ) is symmetricpositive semi-definite.Proof . Restrict f to any open convex neighborhood Y ′ ⊂ Y of y and apply Theorem 2. (cid:3) Remark . Differentials are perhaps more familiar to some readers in the context of multivariatecalculus: when y = ( y j ) nj =1 ∈ H = R n and f ( y ) = ( f i ( y )) ni =1 , Df ( y ) is identified to the Jacobianmatrix J f ( y ) = ( ∂f i ∂y j ) i,j n . The rows of
J f ( y ) are the transposed gradients ∇ f i ( y ). The differential is symmetric if themixed derivatives satisfy ∂f i ∂y j = ∂f j ∂y i for all i = j . When n = 3, this corresponds to f beingan irrotational vector field . More generally, this characterizes the fact that f is a so-called conservative field , i.e., a vector field that is the gradient of some potential function. As theJacobian is the Hessian of this potential, it is positive definite if the potential is convex.Finally we provide conditions ensuring that f is a proximity operator and that f ( y ) is theonly critical point of the corresponding optimization problem. Corollary . Let
Y ⊂ H be open and convex, and f : Y → H be C with Df ( y ) ≻ on Y .Then f is injective and there is ϕ : H → R ∪ { + ∞} such that prox ϕ ( y ) = { f ( y ) } , ∀ y ∈ Y and dom( ϕ ) = Im( f ) . Moreover, if Df ( y ) is boundedly invertible on Y then ϕ is C on Y and foreach y ∈ Y , the only critical point of x k y − x k + ϕ ( x ) is x = f ( y ) . This is established in Appendix A.6.
Remark . In finite dimension H = R n , Df ( y ) is boundedly invertible as soon as Df ( y ) ≻ Df ( y ) ≻ f ( y ) is the unique criticalpoint. This is no longer the case in infinite dimension. Indeed, consider H = ℓ ( N ) and f : y =( y n ) n ∈ N f ( y ) := ( y n / ( n + 1)) n ∈ N . As f is linear, its differential is Df ( y ) = f for every y ∈ H .As h f ( y ) , y i = P n ∈ N y n / ( n + 1) > y ∈ H we have Df ( y ) ≻ n ∈ N and z ∈ R we have z/ ( n + 1) = arg min x ∈ R ( z − x ) + nx / f = prox ϕ with ϕ : x = ( x n ) n ∈ N ϕ ( x ) := P n ∈ N nx n /
2. Setting ϕ ( x ) = ϕ ( x )for x ∈ Im( f ), ϕ ( x ) = + ∞ otherwise, we have prox ϕ = f and dom( ϕ ) = Im( f ) = { x ∈ A continuous linear operator L : H → H is symmetric if h x, Ly i = h Lx, y i for each x, y ∈ H . A symmetriccontinuous linear operator is positive semi-definite if h x, Lx i > x ∈ H . This is denoted L (cid:23)
0. It ispositive definite if h x, Lx i > x ∈ H . This is denoted L ≻ CHARACTERIZATION OF PROXIMITY OPERATORS 5 H , P n ∈ N ( n + 1) x n < ∞} . Yet, as no point in dom( ϕ ) admits any open neighborhood in H , ϕ is nowhere differentiable and every x ∈ H is a critical point of x k y − x k + ϕ ( x ). Terminology.
Proximity operators often appear in the context of penalized least squaresregression, where ϕ is called a penalty , and from now on we will adopt this terminology. Inlight of Corollary 1, a continuous proximity operator is exactly characterized as a gradient of aconvex function ψ . In the terminology of physics, a proximity operator is thus a conservativefield associated to a convex potential . In the language of convex analysis, subdifferentials ofconvex functions are characterized as maximal cyclically monotone operators [30, Theorem B].1.3. Organization of the paper.
The proof of our most general results, Theorem 3 and Corol-lary 6 (and the fact that they imply Theorem 1, (1), Corollary 4 and Corollary 1) are establishedin Section 2, where we also discuss their consequences in terms of Bregman proximity opera-tors and illustrate them on concrete examples. As Theorem 1 and its corollaries characterizewhether a function f is a proximity operator and study its smoothness in relation to that of thecorresponding penalty and potential, they are particularly useful when f is not explicitly built as a proximity operator. This is the case of so-called social shrinkage operators (see e.g. [22]).We conclude the paper by showing in Section 3 that social shrinkage operators are generally not the proximity operator of any penalty.1.4. Discussion.
In light of the extension to nonconvex penalties of Moreau’s characterizationof proximity operators of convex (l.s.c.) penalties (Proposition 1), the nonexpansivity of theproximity operator f determines whether the underlying penalty ϕ is convex or not. While non-expansivity certainly plays a role in the convergence analysis of iterative proximal algorithmsbased on convex penalties, the adaptation of such an analysis when the proximity operator isLipschitz rather than nonexpansive, using Proposition 1, is an interesting perspective.The characterization of smooth proximity operators as the gradients of convex potentials,which also appear in optimal transport (see e.g., [36]), suggests that further work is neededto better understand the connections between these concepts and tools. This could possiblylead to simplified arguments where the strong machinery of convex analysis may be used moreexplicitly despite the apparent lack of convexity of the optimization problems associated tononconvex penalties. 2. Main results
We now state our main results, Theorem 3 and Corollary 6, and prove a number of theirconsequences including Theorem 1, (1), Corollary 4 and Corollary 1 which were advertized inSection 1. The most technical proofs are postponed to the Appendix.2.1.
Detailed notations.
The indicator function of a set S is denoted χ S ( x ) := (cid:26) x ∈ S , + ∞ if x
6∈ S . The domain of a function θ : H → R ∪ { + ∞} is defined and denoted by dom( θ ) := { x ∈ H | θ ( x ) < ∞} . Given Y ⊂ H and a function f : Y → H , the image of Y under f is denoted byIm( f ). A function θ : H → R ∪ { + ∞} is proper iff there is x ∈ H such that θ ( x ) < + ∞ , i.e., R´EMI GRIBONVAL AND MILA NIKOLOVA dom( θ ) = ∅ . It is lower semicontinuous (l.s.c.) if for each x ∈ H , lim inf x → x θ ( x ) > θ ( x ), orequivalently if the set { x ∈ H : θ ( x ) > α } is open for every α ∈ R . A subgradient of a convexfunction θ : H → R ∪ { + ∞} at x is any u ∈ H such that θ ( x ′ ) − θ ( x ) > h u, x ′ − x i , ∀ x ′ ∈ H .A function with k continuous derivatives is called a C k function. The notation C k ( X ) is usedto specify a C k function on an open domain X . Thus C is the space of continuous functions,whereas C is the space of continuously differentiable functions [11, p. 327]. The gradient of a C scalar function θ at x is denoted ∇ θ ( x ).The segment between two elements x, x ′ ∈ H is the set [ x, x ′ ] := { tx + (1 − t ) x ′ , t ∈ [0 , } .A finite union of segments [ x i − , x i ], 1 i n , n ∈ N , where x = x and x n = x ′ is calleda polygonal path between x and x ′ . A non-empty subset C ⊂ H is polygonally connectediff between each pair x, x ′ ∈ C there is a polygonal path with all its segments included in C ,[ x i − , x i ] ⊂ C . Remark . The notion of polygonal-connectedness is a bit stronger than that of connect-edness. Indeed, polygonal-connectedness implies the classical topological property of path-connectedness, which in turn implies connectedness. However there are path-connected setsthat are not polygonally-connected – e.g., the unit circle in R is path-connected, but no twopoints are polygonally-connected, and there are connected sets that are not path-connected.Yet, every open connected set is polygonally-connected, see [16, Theorem 2.5.2] for a statementin R n .2.2. Main theorem.
Theorem . Consider H and H ′ two Hilbert spaces , and Y ⊂ H ′ a non-empty set. Let a : Y → R ∪ { + ∞} , b : H → R ∪ { + ∞} , A : Y → H and B : H → H ′ be arbitrary functions. Consider f : Y → H and denote
Im( f ) the image of Y under f . (a) Let D ( x, y ) := a ( y ) − h x, A ( y ) i + b ( x ) . The following properties are equivalent: (i) there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H { D ( x, y ) + ϕ ( x ) } for each y ∈ Y ; (ii) there is a convex l.s.c. g : H → R ∪ { + ∞} such that A ( f − ( x )) ⊂ ∂g ( x ) for each x ∈ Im( f ) ;When they hold, ϕ (resp. g ) can be chosen given g (resp. ϕ ) so that g ( x ) + χ Im( f ) = b ( x ) + ϕ ( x ) . (b) Let ϕ and g satisfy (ai) and (aii) , respectively, and let C ⊂
Im( f ) be polygonally connected.Then there is K ∈ R such that g ( x ) = b ( x ) + ϕ ( x ) + K, ∀ x ∈ C . (2)(c) Let e D ( x, y ) := a ( y ) − h B ( x ) , y i + b ( x ) . The following properties are equivalent: (i) there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H { e D ( x, y ) + ϕ ( x ) } for each y ∈ Y ; see Appendix A.1 for some reminders on Fr´echet derivatives in Hilbert spaces. For the sake of simplicity we use the same notation h· , ·i for the inner products h x, A ( y ) i (between elements of H ) and h B ( x ) , y i (between elements of H ′ ). The reader can inspect the proof of Theorem 3 to check that the resultstill holds if we consider Banach spaces H and H ′ , H ⋆ and ( H ′ ) ⋆ their duals, and A : Y → H ⋆ , B : H → ( H ′ ) ⋆ . CHARACTERIZATION OF PROXIMITY OPERATORS 7 (ii) there is a convex l.s.c. ψ : H ′ → R ∪ { + ∞} such that B ( f ( y )) ∈ ∂ψ ( y ) for each y ∈ Y . ϕ (resp. ψ ) can be chosen given ψ (resp. ϕ ) so that ψ ( y ) = h B ( f ( y ) , y i − b ( f ( y )) − ϕ ( f ( y )) on Y . (d) Let ϕ and ψ satisfy (ci) and (cii) , respectively, and let C ′ ⊂ Y be polygonally connected.Then there is K ′ ∈ R such that ψ ( y ) = h B ( f ( y )) , y i − b ( f ( y )) − ϕ ( f ( y )) + K ′ , ∀ y ∈ C ′ . (3)The proof of Theorem 3 is postponed to Appendix A.4. As stated in (a) (resp. (c)), thefunctions can be chosen such that the relation (2) (resp. (3)) holds on Im( f ) (resp. on Y ) with K = K ′ = 0. As the functions ϕ, g, ψ are at best defined up to an additive constant, we providein (b) (resp. (d)) conditions ensuring that adding a constant is indeed the unique degree offreedom. The role of polygonal-connectedness will be illustrated on examples in Section 2.7. Example . In the context of linear inverse problems one often encounters optimization prob-lems involving functions expressed as k y − M x k + ϕ ( x ) with M some linear operator. Such func-tions fit into the framework of Theorem 3 using a ( y ) := k y k , b ( x ) := k M x k , A ( y ) := M ⋆ y ,and B ( x ) := M x , where M ⋆ is the adjoint of M . Among other consequences one gets that f : Y → H is a generalized proximity operator of this type for some penalty ϕ if, and only if,there is a convex l.s.c. ψ such that M f ( y ) ∈ ∂ψ ( y ) for each y ∈ Y .Examples where the data-fidelity term is a so-called Bregman divergence are detailed inSection 2.4 below. This covers the case of standard proximity operators where D ( x, y ) = k y − x k .2.3. Convexity in proximity operators of nonconvex penalties.
An interesting conse-quence of Theorem 3 is that the optimization problem associated to (generalized) proximityoperators is in a sense always convex, even when the considered penalty ϕ is not convex. Corollary . Consider H , H ′ two Hilbert spaces. Let Y ⊂ H ′ be non-empty and f : Y → H .Assume that there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H { D ( x, y ) + ϕ ( x ) } for each y ∈ Y , with D ( x, y ) = a ( y ) − h x, A ( y ) i + b ( x ) as in Theorem 3 (a) . Then (a) the function x b ( x ) + ϕ ( x ) is convex on each convex subset C ⊂
Im( f ) ; (b) if Im( f ) is convex, then the function x ∈ Im( f ) D ( x, y ) + ϕ ( x ) is convex, ∀ y ∈ Y .Similarly, if there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H n e D ( x, y ) + ϕ ( x ) o foreach y ∈ Y , with e D ( x, y ) = a ( y ) − h B ( x ) , y i + b ( x ) as in Theorem 3 (c) then y
7→ h B ( f ( y )) , y i − b ( f ( y )) − ϕ ( f ( y )) is convex on each convex subset C ′ ⊂ Y .Proof . (a) follows from Theorem 3(a)-(b). (b) follows from (a) and the definition of D . Theproof of the result with e D instead of D is similar. (cid:3) Corollary 4(b) might seem surprising as, given a nonconvex penalty ϕ , one may expect theoptimization problem min x D ( x, y ) + ϕ ( x ) to be nonconvex. However, as noticed e.g. by [27, 28,31], there are nonconvex penalties such that this problem with D ( x, y ) := k y − x k is in factconvex. Corollary 4 establishes that this convexity property indeed holds whenever the imageIm( f ) of the resulting function f is a convex set. A particular case is that of functions f built as R´EMI GRIBONVAL AND MILA NIKOLOVA conditional expectations in the context of additive Gaussian denoising, which have been shown[17] to be proximity operators. Extensions of this phenomenon for conditional mean estimationwith other noise models are discussed in the companion paper [19].2.4.
Application to Bregman proximity operators.
The squared Euclidean norm is a par-ticular
Bregman divergence , and Theorem 3 characterizes generalized proximity operators definedwith such divergences. The Bregman divergence, known also as D -function, was introduced in[9] for strictly convex differentiable functions on so-called linear topological spaces. For thegoals of our study, it will be enough to consider that h : H → R ∪ { + ∞} is proper, convex anddifferentiable on a Hilbert space. Definition . Let h : H → R ∪ { + ∞} be proper convex and differentiable on its open domain dom( h ) . The Bregman divergence (associated with h ) between x and y is defined by (4) D h : H × H → [0 , + ∞ ] : ( x, y ) → ( h ( x ) − h ( y ) − h∇ h ( y ) , x − y i , if y ∈ dom( h );+ ∞ , otherwise In Theorem 3(a) one obtains D ( x, y ) = D h ( x, y ) by setting a ( y ) = + ∞ and A ( y ) arbitrary if y / ∈ dom( h ) and, for y ∈ dom( h ) and each x ∈ H ,(5) a ( y ) := h∇ h ( y ) , y i − h ( y ) b ( x ) := h ( x ) and A ( y ) = ∇ h ( y )The lack of symmetry of the Bregman divergence suggests to consider also D h ( y, x ). In Theo-rem 3(c) one obtains e D ( x, y ) = D h ( y, x ) using b ( x ) = + ∞ and B ( x ) arbitrary for x / ∈ dom( h )and, for x ∈ dom( h ) and each y ∈ H ,(6) a ( y ) := h ( y ) b ( x ) := h∇ h ( x ) , x i − h ( x ) and B ( x ) = ∇ h ( x )The next claim is an application of Theorem 3 with D ( x, y ) = D h ( x, y ) and e D ( x, y ) = D h ( y, x ). We thus consider the so-called Bregman proximity operators which were introduced in[12]. We will focus on the characterization of these operators defined by y arg min x ∈H { D h ( x, y )+ ϕ ( x ) } and y arg min x ∈H { D h ( y, x ) + ϕ ( x ) } . Such operators have been further studied in [5]with an emphasis on the notion of viability, which is essential for these operators to be usefulin the context of iterative algorithms. Corollary . Consider f : Y → H . Let h : H → R ∪ { + ∞} be a proper convex function thatis differentiable on its open domain dom( h ) . Let D h read as in (4) . (a) The following properties are equivalent: (i) there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H { D h ( x, y ) + ϕ ( x ) } , ∀ y ∈ Y ; (ii) there is a convex l.s.c. g : H → R ∪ { + ∞} s.t. ∇ h ( f − ( x )) ⊂ ∂g ( x ) , ∀ x ∈ Im( f ) ;When they hold, ϕ (resp. g ) can be chosen given g (resp. ϕ ) so that g ( x ) + χ Im( f ) = h ( x ) + ϕ ( x ) . (b) Let ϕ and g satisfy (ai) and (aii) , respectively, and let C ⊂
Im( f ) be polygonally connected.Then there is K ∈ R such that g ( x ) = h ( x ) + ϕ ( x ) + K, ∀ x ∈ C . (c) The following properties are equivalent:
CHARACTERIZATION OF PROXIMITY OPERATORS 9 (i) there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ arg min x ∈H { D h ( y, x ) + ϕ ( x ) } , ∀ y ∈ Y ; (ii) there is a convex l.s.c. ψ : H → R ∪ { + ∞} such that ∇ h ( f ( y )) ∈ ∂ψ ( y ) , ∀ y ∈ Y . ϕ can be chosen given ψ (resp. ψ given ϕ ) s.t. ψ ( y ) = h∇ h ( f ( y )) , y − f ( y ) i + h ( f ( y )) − ϕ ( f ( y )) , ∀ y ∈ Y . (d) Let ϕ and ψ satisfy (ci) (cii) , respectively, and let C ′ ⊂ Y be polygonally connected. Thenthere is K ′ ∈ R such that ψ ( y ) = (cid:10) ∇ h ( f ( y )) , y − f ( y ) (cid:11) + h ( f ( y )) − ϕ ( f ( y )) + K ′ , ∀ y ∈ C ′ . Proof . (a) and (b) use (5). Further, (c) and (d) use (6). (cid:3)
Specialization to (standard) proximity operators.
Standard (Hilbert space) prox-imity operators correspond to taking as the Bregman divergence D h ( x, y ) = k y − x k , whichis associated to h ( x ) := k x k . An immediate consequence of Corollary 2.4 is the followingtheorem, which implies Theorem 1 and (1). Theorem . Let
Y ⊂ H be non-empty, and f : Y → H . (a) The following properties are equivalent: (i) there is ϕ : H → R ∪ { + ∞} such that f ( y ) ∈ prox ϕ ( y ) for each y ∈ Y ; (ii) there is a convex l.s.c. g : H → R ∪ { + ∞} such that f − ( x ) ⊂ ∂g ( x ) for each x ∈ Im( f ) ; (iii) there is a convex l.s.c. ψ : H → R ∪ { + ∞} such that f ( y ) ∈ ∂ψ ( y ) for each y ∈ Y .When they hold, there exists a choice of ϕ, g, ψ satisfying (ai) - (aii) - (aiii) such that g ( x ) + χ Im( f ) = k x k + ϕ ( x ) , ∀ x ∈ H ; ψ ( y ) = h y, f ( y ) i − k f ( y ) k − ϕ ( f ( y )) , ∀ y ∈ Y . (b) Let ϕ , g and ψ satisfy (ai) , (aii) and (aiii) , respectively. Let C ⊂
Im( f ) and C ′ ⊂ Y bepolygonally connected. Then there exist K, K ′ ∈ R such that g ( x ) = k x k + ϕ ( x ) + K, ∀ x ∈ C ;(7) ψ ( y ) = h y, f ( y ) i − k f ( y ) k − ϕ ( f ( y )) + K ′ , ∀ y ∈ C ′ . (8)2.6. Local smoothness of proximity operators.
Theorem 4 characterizes proximity oper-ators in terms of three functions: a (possibly nonconvex) penalty ϕ , a convex potential ψ ,and another convex function g . As we now show, the properties of these functions are tightlyinter-related. First we extend Moreau’s characterization (Proposition 1) as follows: Proposition . Consider f : H → H defined everywhere, and
L > . The following areequivalent:(1) there is ϕ : H → R ∪ { + ∞} s.t. f ( y ) ∈ prox ϕ ( y ) on H , and x ϕ ( x ) + (1 − L ) k x k isconvex l.s.c;(2) the following conditions hold jointly: (a) there exists a (convex l.s.c.) function ψ such that for each y ∈ H , f ( y ) ∈ ∂ψ ( y ) ; (b) f is L -Lipschitz, i.e. k f ( y ) − f ( y ′ ) k L k y − y ′ k , ∀ y, y ′ ∈ H . Proof . (1) ⇒ (2a). Simply observe that f is a proximity operator and use Theorem 4(ai) ⇒ (aiii).(1) ⇒ (2b). The function ˜ ϕ ( z ) := L ( ϕ ( Lz ) + (1 − L ) k Lz k ) is convex l.s.c. by assumption. Weprove below that ˜ f := f /L is a proximity operator of ˜ ϕ . By Proposition 1 ˜ f is thus non-expansive, i.e., f is L -Lipschitz.To show ˜ f ( y ) ∈ prox ˜ ϕ ( y ) for each y ∈ H , observe that ϕ ( x ) = L ˜ ϕ ( x/L ) − (1 − L ) k x k . For each x ∈ H k y − x k + ϕ ( x ) = k y k − h y, x i + k x k + L ˜ ϕ ( x/L ) − (1 − L ) k x k = k y k − h y, x i + k x k L + L ˜ ϕ ( x/L )= k y k − L h y, z i + L k z k + L ˜ ϕ ( z )= (1 − L ) k y k + L (cid:0) k y − z k + ˜ ϕ ( z ) (cid:1) , with z = x/L. Since x = f ( y ) is a minimizer of the left-hand-side, z = f ( y ) /L = ˜ f ( y ) is a minimizer of theright hand side, hence ˜ f is a proximity operator of ˜ ϕ as claimed.(2a) and (2b) ⇒ (1). By (2a) the function ˜ ψ ( y ) := ψ ( y ) /L is convex l.s.c and f ( y ) /L ∈ ∂ ˜ ψ ( y ).By Theorem 4(aiii) ⇒ (ai) ˜ f := f /L is therefore a proximity operator. Since f is L -Lipschitz, ˜ f is non-expansive hence by Proposition 1 ˜ f is a proximity operator of some convex l.s.c penalty˜ ϕ . The function ϕ ( x ) := L ˜ ϕ ( x/L ) − (1 − L ) k x k is such that ϕ ( x ) + (1 − L ) k x k = L ˜ ϕ ( x/L )is convex l.s.c. as claimed. By the same argument as above, as z = ˜ f ( y ) is a minimizer of k y − z k + ˜ ϕ ( x ), x = Lz = f ( y ) is a minimizer of k y − x k + ϕ ( x ), showing that f is indeeda proximity operator of ϕ . (cid:3) Next we consider additional properties of these functions.
Corollary . Let
Y ⊂ H and f : Y → H . Consider three functions ϕ , g , ψ on H satisfying theequivalent properties (ai) , (aii) and (aiii) of Theorem 4, respectively. Let k > be an integer. (a) Consider an open set
V ⊂ Y . The following two properties are equivalent: (i) ψ is C k +1 ( V ) ; (ii) f is C k ( V ) ;When one of them holds, we have f ( y ) = ∇ ψ ( y ) , ∀ y ∈ V . (b) Consider an open set
X ⊂
Im( f ) . The following three properties are equivalent: (i) ϕ is C k +1 ( X ) ; (ii) g is C k +1 ( X ) ; (iii) the restriction e f of f to the set f − ( X ) is injective and ( e f ) − is C k ( X ) .When one of them holds, e f is a bijection between f − ( X ) and X , and we have ( e f ) − ( x ) = ∇ g ( x ) = x + ∇ ϕ ( x ) , ∀ x ∈ X . Before proving this corollary, let us first mention that the characterization of any continuous proximity operator f as the gradient of a C convex potential ψ , i.e., f = ∇ ψ , is a directconsequence of Corollary 6(a) and Theorem 1. This establishes Corollary 1 from Section 1.The proof of Corollary 6 relies on the following technical lemma which we prove in Appen-dix A.5 as a consequence of [6, Prop 17.41]. CHARACTERIZATION OF PROXIMITY OPERATORS 11
Lemma . Consider a function ̺ : H → H , a function θ : H → R ∪ { + ∞} and an open set X ⊂ dom( ̺ ) ∩ dom( θ ) ⊂ H . Assume that θ is subdifferentiable at each x ∈ X and that (9) ∀ x ∈ X ̺ ( x ) ∈ ∂θ ( x ) Then the following statements are equivalent: (a) ̺ is continuous on X ; (b) θ is continuously differentiable on X i.e., its gradient ∇ θ ( x ) is continuous on X .When one of the statements holds, { ̺ ( x ) } = {∇ θ ( x ) } = ∂θ ( x ) for each x ∈ X .Proof . [Proof of Corollary 6](ai) ⇔ (aii) By assumption ψ satisfies Theorem 4(cii), i.e., f ( y ) ∈ ∂ψ ( y ), ∀ y ∈ V . ByLemma 1 with ̺ := f and the convex function θ := ψ , f is C ( V ) if and only if ψ is C ( V ) andwhen one of these holds, f = ∇ ψ on V . This proves the result for k = 0. The extension to k > ⇔ (bii) Consider x ∈ V . As V is open there is an open ball B x such that x ∈ B x ⊂ V .Noticing that B x is polygonally connected, by Theorem 4-(b), there is K ∈ R such that g ( x ′ ) = k x ′ k + ϕ ( x ′ ) + K for each x ′ ∈ B x . Hence g is C k +1 ( B x ) if and only if ϕ is C k +1 ( B x ), and ∇ g ( x ′ ) = x ′ + ∇ ϕ ( x ′ ) on B x . As this holds for each x ∈ V , the equivalence holds on V .(bii) ⇒ (biii) By (bii), g is C k +1 ( X ) hence ∂g ( x ) = {∇ g ( x ) } for each x ∈ X . By Theo-rem 4(aii), f − ( x ) ⊂ ∂g ( x ) for each x ∈ Im( f ). Combining both facts yields(10) y = ∇ g ( f ( y )) ∀ y ∈ f − ( X ) . Consider y, y ′ ∈ f − ( X ) such that f ( y ) = f ( y ′ ). Then y = ∇ g ( f ( y )) = ∇ g ( f ( y ′ )) = y ′ ,which shows that f is injective on f − ( X ). Consequently, e f is a bijection between f − ( X ) and X , hence the inverse function ( e f ) − is well defined. Inserting y = ( e f ) − ( x ) into (10) yields( e f ) − ( x ) = ∇ g ( x ) for each x ∈ X . Then, since g is C k +1 ( X ), it follows that ( e f ) − is C k ( X ).(biii) ⇒ (bii) Consider x ∈ X . As e f is injective on f − ( X ) by (biii), there is a unique y ∈ f − ( X ) such that x = f ( y ). Using that f − ( x ) ⊂ ∂g ( x ) by Theorem 4(aii) shows that( e f ) − ( x ) = y ∈ ∂g ( x ). Since ( e f ) − is C k ( X ), using Lemma 1 with ̺ := ( e f ) − and θ := g provesthat ( e f ) − ( x ) = ∇ g ( x ) ∀ x ∈ X Since ( e f ) − is C k ( X ) it follows that g is C k +1 ( X ). (cid:3) Illustration using classical examples.
Theorem 1 and its corollaries characterize whethera function f is a proximity operator. This is particularly useful when f is not explicitly built as a proximity operator. We illustrate this with a few examples. We begin with H = R , whereproximity operators happen to have a particularly simple characterization. Corollary . Let
Y ⊂ R be non-empty. A function f : Y → R is the proximity operator ofsome penalty ϕ if, and only if, f is nondecreasing.Proof . By Theorem 1 we just need to prove that a scalar function f : Y → R belongs to thesub-gradient of a convex function if, and only if, f is non-decreasing. When f is continuous and Y is an open interval, a primitive ψ of f is indeed convex if, and only if, ψ ′ = f is non-decreasing [6, Proposition 17.7]. We now prove the result for more general Y and f . First, if f ( y ) ∈ ∂ψ ( y ) for each y ∈ Y where ψ : R → R ∪ { + ∞} is convex, then by [21, Theorem 4.2.1 (i)] f is non-decreasing. To prove the converse define a := inf { y : y ∈ Y} , I := ( a, ∞ ) if a / ∈ Y (resp. I := [ a, ∞ ) if a ∈ Y ), and set ¯ f ( x ) := sup y ∈Y ,y x f ( y ) ∈ R ∪ { + ∞} for each x ∈ I , ¯ f ( x ) = + ∞ ,for x / ∈ I . By construction ¯ f is non-decreasing. If f is non-decreasing on Y then ¯ f ( y ) = f ( y )for each y ∈ Y hence Y ⊂ dom( ¯ f ) ⊂ I and dom( ¯ f ) is an interval. Choose an arbitrary b ∈ Y .As ¯ f is monotone it is integrable on each bounded interval one can define ψ ( x ) := R xb ¯ f ( t ) dt foreach x ∈ dom( ¯ f ) (with the usual convention that if x < b then R xb = − R bx ) and ψ ( x ) := + ∞ for x / ∈ dom( ¯ f ). Consider x ∈ dom( ¯ f ). Since ¯ f is non-increasing for h > x + h ∈ dom( ¯ f )we have ψ ( x + h ) − ψ ( x ) = R x + hx ¯ f ( t ) dt > ¯ f ( x ) h ; similarly for h x − h ∈ dom( ¯ f )we have ψ ( x ) − ψ ( x − h ) = R xx − h ¯ f ( t ) dt ¯ f ( x )( − h ), hence ψ ( x − h ) − ψ ( x ) > ¯ f ( x ) h . Combiningboth results shows ψ ( y ) − ψ ( x ) > ¯ f ( x )( y − x ) for each x, y ∈ dom( ¯ f ). This establishes that¯ f ( x ) ∈ ∂ψ ( x ) for each x ∈ dom( ¯ f ), hence that ψ is convex on its domain dom( ψ ) = dom( ¯ f ). Toconclude, simply observe that for y ∈ Y ⊂ dom( ¯ f ) we have f ( y ) = ¯ f ( y ) ∈ ∂ψ ( y ). (cid:3) Example . In Y = [0 , ⊂ R = H , consider x < x < . . . < x q −
C >
Example . In Y = H = R consider f ( y ) := , if | y | < C ( y − , if y > C ( y + 1) , if y − Cy max(1 − / | y | , . This function has the same shape as the classical soft-thresholding operator, but is scaled by amultiplicative factor C . When C = 1 , f is the soft-thresholding operator which is the proximityoperator of the absolute value, ϕ ( x ) = | x | , which is convex. For C > , as f is expansive, byProposition 1 it cannot be the proximity operator of any convex function. Yet, as f is mono-tonically increasing, f ( y ) is a subgradient of its “primitive” ψ ( y ) = C (max( | y | − , = C y (max(1 − / | y | , = f ( y )2 C which is convex. Moreover, by Corollary 7, f is still the prox-imity operator of some (necessarily nonconvex) function ϕ ( x ) . By (1) , up to an additive constant K ∈ R , ϕ satisfies ϕ ( f ( y )) = yf ( y ) − f ( y ) − ψ ( y ) = yf ( y ) − C C f ( y ) , ∀ y ∈ R For x > , writing x = f ( y ) with y = f − ( x ) = 1 + x/C yields ϕ ( x ) = ϕ ( f ( y )) = (1 + x/C ) x − C C x . Similar considerations for x < and for x = 0 show that ϕ ( x ) = | x | + (cid:0) C − (cid:1) x .When C > , ϕ is indeed not bounded from below, and not convex. When is social shrinkage a proximity operator ?
We conclude this paper by studying so-called social shrinkage operators, which have beenintroduced to mimic classical sparsity promoting proximity operators when certain types ofstructured sparsity are targeted. We show that the characterization of proximity operatorsobtained in this paper provides answers to questions raised by Kowalski et al [22] and by Varo-queaux et al [35] on the link between such non-separable shrinkage operators and proximityoperators.Most proximity operators are indeed not separable. A classical example is the proximityoperator associated to mixed ℓ norms, which enforces group-sparsity. Example . Consider a partition G = { G , . . . , G p } of J , n K , theinterval of integers from to n , into disjoint sets called groups . Let x G be the restriction of x ∈ R n to its entries indexed by G ∈ G , and define the group ℓ norm , or mixed ℓ norm , as (11) ϕ ( x ) := X G ∈G k x G k . The proximity operator f ( y ) := prox λϕ is the group-sparsity shrinkage operator with threshold λ (12) ∀ i ∈ G, f i ( y ) := y i (cid:18) − λ k y G k (cid:19) + . CHARACTERIZATION OF PROXIMITY OPERATORS 15
The group-LASSO penalty (11) appeared in statistics in the thesis of Bakin [4, Chapter 2].It was popularized by Yuan and Lin [37] who introduced an iterative shrinkage algorithm toaddress the corresponding optimization problem. A generalization is Group Empirical Wiener/ Group Non-negative Garrotte, see e.g. [15](13) ∀ i ∈ G, f i ( y ) := y i (cid:18) − λ k y G k (cid:19) + , see also [2] for a review of thresholding rules, and [3] for a review on sparsity-inducing penalties.To account for varied types of structured sparsity, [23, 24] empirically introduced the so-calledWindowed Group-LASSO. A weighted version for audio applications was further developed in[32] which coins the notion of persistency , and the term social sparsity was coined in [22] tocover Windowed Group-LASSO, as well as other structured shrinkage operators. As furtherdescribed in these papers, the main motivation of such social shrinkage operators is to obtainflexible ways of taking into account (possibly overlapping) neighborhoods of a coefficient index i rather than disjoint groups of indices to decide whether or not to set a coefficient to zero. Theseare summarized in the definition below. Definition . Consider a family N i ⊂ J , n K , i ∈ J , n K of sets such that i ∈ N i . The set N i is called a neighborhood of its index i . Consider nonnegative weight vectors w i = ( w iℓ ) nℓ =1 such that supp( w i ) = N i . Windowed Group Lasso (WG-LASSO) shrinkage isdefined as f ( y ) := ( f i ( y )) ni =1 with (14) ∀ i, f i ( y ) := y i (cid:18) − λ k diag( w i ) y k (cid:19) + and Persistent Empirical Wiener (PEW) shrinkage (see [33] for the unweighted version) with (15) ∀ i, f i ( y ) := y i (cid:18) − λ k diag( w i ) y k (cid:19) + . Kowalski et al [22] write “ while the classical proximity operators are directly linked to convexregression problems with mixed norm priors on the coefficients, [those] new, structured, shrinkageoperators cannot be directly linked to a convex minimization problem ”. Similarly, Varoquaux etal [35] write that Windowed Group Lasso “ is not the proximal operator of a known penalty ”.They leave open the question of whether social shrinkage is the proximity operator of some yetto be discovered penalty. Using Theorem 2, we answer these questions for generalized socialshrinkage operators. The answer is negative unless the involved neighborhoods form a partition. Definition . Consider subsets N i ⊂ J , n K and nonnegativeweight vectors w i ∈ R n + such that i ∈ N i and supp( w i ) = N i for each i ∈ J , n K . Consider λ > and a family of C ( R ∗ + ) scalar functions h i , i ∈ J , n K such that h ′ i ( t ) = 0 for t ∈ R ∗ + . Ageneralized social shrinkage operator is defined as f ( y ) := ( f i ( y )) ni =1 with f i ( y ) := ( y i h i (cid:0) k diag( w i ) y k (cid:1) , if k diag( w i ) y k > λ, otherwise . that are explicitly constructed as the proximity operator of a convex l.s.c. penalty, e.g., soft-thresholding. We let the reader check that the above definition covers Group LASSO (12), Windowed Group-LASSO (14), Group Empirical Wiener (13) and Persistent Empirical Wiener shrinkage (15).
Lemma . Let f : R n → R n be a generalized social shrinkage operator and N i ⊂ J , n K , w i ∈ R n + , i ∈ J , n K be the corresponding families of neighborhoods and weight vectors. If f is a proximityoperator then there exists a partition G = { G p } Pp =1 of the set J , n K of indices such that: for each p and all i, j ∈ G p we have w i = w j and supp( w i ) = supp( w j ) = G p . As a consequence for i ∈ G p , j ∈ G q with p = q , the weight vectors w i and w j have disjoint support. The proof of Lemma 2 is postponed to Appendix A.7. An immediate consequence of thislemma is that if f is a generalized social shrinkage operator, then the neighborhood system N i = supp( w i ) coincides with the groups G from the partition G . In particular, the neighborhoodsystem must form a partition. By contraposition we get the following corollary: Corollary . Consider non-negative weights { w i } as in Definition 4 and { N i } the correspond-ing neighborhood system. Assume that there exists i, j such that N i = N j and N i ∩ N j = ∅ . • Let f be the WG-LASSO shrinkage (14) . There is no penalty ϕ such that f = prox ϕ . • Let f be the PEW shrinkage (15) . There is no penalty ϕ such that f = prox ϕ . In other words, WG-LASSO / PEW can be a proximity operator only if the neighborhoodsystem has no overlap , i.e. with “plain” Group-LASSO (12) / Group Empirical Wiener (13).
Acknowledgements
The first author wishes to thank Laurent Condat, Jean-Christophe Pesquet and Patrick-LouisCombettes for their feedback that helped improve an early version of this paper, as well as theanonymous reviewers for many insightful comments that improved it much further.
Appendix A. Proofs
The proofs of technical results of Section 2 are provided in Sections A.4 (Theorem 3), A.5(Lemma 1), A.6 (Corollary 3) and A.7 (Lemma 2). As a preliminary we give brief reminders onsome useful but classical notions in Sections A.1-A.3.A.1.
Brief reminders on (Fr´echet) differentials and gradients in Hilbert spaces.
Con-sider H , H ′ two Hilbert spaces. A function θ : X → H ′ where X ⊂ H is an open domain is(Fr´echet) differentiable at x if there exists a continuous linear operator L : H → H ′ such thatlim h → k θ ( x + h ) − θ ( x ) − L ( h ) k H ′ / k h k H = 0. The linear operator L is called the differential of θ at x and denoted Dθ ( x ). When H ′ = R , L belongs to the dual of H , hence there is u ∈ H –called the gradient of θ at x and denoted ∇ θ ( x )– such that L ( h ) = h u, h i , ∀ h ∈ H .A.2. Subgradients and subdifferentials for possibly nonconvex functions.
We adopt agentle definition which is familiar when θ is a convex function. Although this is possibly lesswell-known by non-experts, this definition is also valid when θ is possibly nonconvex, see e.g.[6, Definition 16.1]. CHARACTERIZATION OF PROXIMITY OPERATORS 17
Definition . Let θ : H → R ∪ { + ∞} be a proper function. The subdifferential ∂θ ( x ) of θ at x is the set of all u ∈ H , called subgradient s of θ at x , such that (16) θ ( x ′ ) > θ ( x ) + h u, x ′ − x i , ∀ x ′ ∈ H . If x dom( θ ) , then ∂θ ( x ) = ∅ . The function θ is subdifferentiable at x ∈ H if ∂θ ( x ) = ∅ . Thedomain of ∂θ is dom( ∂θ ) := { x ∈ H , ∂θ ( x ) = ∅ } . It satisfies dom( ∂θ ) ⊂ dom( θ ) . Fact 1.
When ∂θ ( x ) = ∅ the inequality in (16) is trivial for each x ′ dom( θ ) since it amountsto + ∞ = θ ( x ′ ) − θ ( x ) > h u, x ′ − x i . Definition 5 leads to the well-known Fermat’s rule [6, Theorem 16.3]
Theorem . Let θ : H → R ∪ { + ∞} be a proper function. A point x ∈ dom( θ ) is a globalminimizer of θ if and only if ∈ ∂θ ( x ) . If θ has a global minimizer at x , then by Theorem 5 the set ∂θ ( x ) is non-empty. However, ∂θ ( x ) can be empty, e.g., at local minimizers that are not the global minimizer: Example . Let θ ( x ) = x − cos( πx ). The global minimum of θ is reached at x = 0 where ∂θ ( x ) = f ′ ( x ) = 0. At x = ± . θ has local minimizers where ∂θ ( x ) = ∅ (even though θ is C ∞ ). For | x | < .
53 one has ∂θ ( x ) = ∇ θ ( x ) with θ ′′ ( x ) > . < | x | < . ∂θ ( x ) = ∅ .The proof of the following lemma is a standard exercice in convex analysis [6, Exercice 16.8]. Lemma . Let θ : H → R ∪ { + ∞} be a proper function such that (a) dom( θ ) is convex and (b) ∂θ ( x ) = ∅ for each x ∈ dom( θ ) . Then θ is a convex function. Definition . (Lower convex envelope of a function)Let θ : H → R ∪ { + ∞} be proper with dom( ∂θ ) = ∅ . Its lower convex envelope, denoted ˘ θ , isthe pointwise supremum of all the convex lower-semicontinuous functions minorizing θ (17) ˘ θ ( x ) := sup { ̺ ( x ) | ̺ : H → R ∪ { + ∞} , ̺ convex l.s.c. , ̺ ( z ) θ ( z ) , ∀ z ∈ H} , ∀ x ∈ H . The function ˘ θ is proper, convex and lower-semicontinuous. It satisfies (18) ˘ θ ( x ) θ ( x ) , ∀ x ∈ H . Proposition . Let θ : H → R ∪ { + ∞} be proper with dom( ∂θ ) = ∅ . For any x ∈ dom( ∂θ ) we have ˘ θ ( x ) = θ ( x ) , ∂θ ( x ) = ∂ ˘ θ ( x ) .Proof . As ∂θ ( x ) = ∅ , by [6, Proposition 13.45], ˘ θ is the so-called biconjugate θ ∗∗ of θ [6,Definition 13.1]. Moreover, [6, Proposition 16.5] yields θ ∗∗ ( x ) = θ ( x ) and ∂θ ∗∗ ( x ) = ∂θ ( x ). (cid:3) We need to adapt [6, Proposition 17.31] to the case where θ is proper but possibly nonconvex,with a stronger assumption of Fr´echet (instead of Gˆateaux) differentiability. Proposition . If ∂θ ( x ) = ∅ and θ is (Fr´echet) differentiable at x then ∂θ ( x ) = {∇ θ ( x ) } . also known as convex hull, [29, p. 57],[21, Definition 2.5.3] Proof . Consider u ∈ ∂θ ( x ). As θ is differentiable at x there is an open ball B centered at 0 suchthat x + h ∈ dom( θ ) for each h ∈ B . For each h ∈ B , Definition 5 yields θ ( x − h ) − θ ( x ) > h u, − h i and θ ( x + h ) − θ ( x ) > h u, h i hence − ( θ ( x − h ) − θ ( x )) h u, h i θ ( x + h ) − θ ( x ). Since θ is Fr´echet differentiable at x , letting k h k tend to zero yields − ( h∇ θ ( x ) , − h i + o ( k h k )) h u, h i h∇ θ ( x ) , h i + o ( k h k )hence h u − ∇ θ ( x ) , h i = o ( k h k ), ∀ h ∈ B . This shows that u = ∇ θ ( x ). (cid:3) A.3.
Characterizing functions with a given subdifferential.
Corollary 9 below generalizesa result of Moreau [26, Proposition 8.b] characterizing functions by their subdifferential. It showsthat one only needs the subdifferentials to intersect. We begin in dimension one.
Lemma . Consider a , a : R → R ∪ { + ∞} convex functions such that dom( a i ) = dom( ∂a i ) =[0 , and ∂a ( t ) ∩ ∂a ( t ) = ∅ on [0 , . Then there exists a constant K ∈ R such that a ( t ) − a ( t ) = K on [0 , .Proof . As a i is convex it is continuous on (0 ,
1) [21, Theorem 3.1.1, p16]. Moreover, by [21,Proposition 3.1.2] we have a i (0) > lim t → ,t> a i ( t ) =: a i (0 + ), and since ∂a i (0) = ∅ , there is u i ∈ ∂a i (0) such that a i ( t ) > a i (0) + u i ( t −
0) for each t ∈ [0 ,
1] hence a i (0 + ) > a i (0). Thisshows that a i (0 + ) = a i (0), and similarly lim t → ,t< a i ( t ) = a i (1), hence a i is continuous on [0 , , a i is differentiable on [0 ,
1] except on a countable set B i ⊂ [0 , t ∈ [0 , \ ( B ∪ B ) and i ∈ { , } , Proposition 4 yields ∂a i ( t ) = { a ′ i ( t ) } hence the function δ := a − a is continuous on [0 ,
1] and differentiable on [0 , \ ( B ∪ B ). For t ∈ I \ ( B ∪ B ), { a ′ ( t ) } ∩ { a ′ ( t ) } = ∂a ( t ) ∩ ∂a ( t ) = ∅ , hence a ′ ( t ) = a ′ ( t ) and δ ′ ( t ) = 0. A classical exercice in real analysis [34, Example 4] is to show that if a function f is continuous on an interval, anddifferentiable with zero derivative except on a countable set, then f is constant. As B ∪ B iscountable it follows δ is constant on (0 , , , (cid:3) Corollary . Let θ , θ : H → R ∪ { + ∞} be proper and C ⊂ H a non-empty polygonallyconnected set. Assume that for each z ∈ C , ∂θ ( z ) ∩ ∂θ ( z ) = ∅ ; then there is a constant K ∈ R such that θ ( x ) − θ ( x ) = K , ∀ x ∈ C . Remark . Note that the functions θ i and the set C are not assumed to be convex. Proof . The proof is in two parts.(i) Assume that C is convex and fix some x ∗ ∈ C . Consider x ∈ C , and define a i ( t ) := θ i ( x ∗ + t ( x − x ∗ )), for i = 0 , t ∈ [0 , a i ( t ) = + ∞ if t [0 , C is convex, for a proof see e.g. (in french) https://fr.wikipedia.org/wiki/Lemme_de_Cousin section 4.9, version from13/01/2019. CHARACTERIZATION OF PROXIMITY OPERATORS 19 z t := x ∗ + t ( x − x ∗ ) ∈ C hence for each t ∈ [0 ,
1] there exists u t ∈ ∂θ ( z t ) ∩ ∂θ ( z t ). By Definition 5for each t, t ′ ∈ [0 , a i ( t ′ ) − a i ( t ) = θ i ( x ∗ + t ′ ( x − x ∗ )) − θ i ( x ∗ + t ( x − x ∗ )) > h u t , ( t ′ − t )( x − x ∗ ) i = h u t , x − x ∗ i ( t ′ − t ) . For t ∈ [0 ,
1] and t ′ ∈ R \ [0 ,
1] since a i ( t ′ ) = + ∞ the inequality a i ( t ′ ) − a i ( t ) > h u t , x − x ∗ i ( t ′ − t )also obviously holds, hence h u t , x − x ∗ i ∈ ∂a i ( t ), i = 0 ,
1. Thus ∂a i ( t ) = ∅ for each t ∈ [0 , a i is convex on [0 ,
1] for i = 0 ,
1, and h u t , x − x ∗ i ∈ ∂a ( t ) ∩ ∂a ( t ) for each t ∈ [0 , K ∈ R such that a ( t ) − a ( t ) = K for each t ∈ [0 , θ ( x ) − θ ( x ) = a (1) − a (1) = a (0) − a (0) = θ ( x ∗ ) − θ ( x ∗ ) = K. As this holds for each x ∈ C , we have established the result as soon as C is convex.(ii) Now we prove the result when C is polygonally connected. Fix some x ∗ ∈ C and define K := θ ( x ∗ ) − θ ( x ∗ ). Consider x ∈ C : by the definition of polygonal connectedness, there existsan integer n > x j ∈ C , 0 j n with x = x ∗ and x n = x such that the (convex)segments C j = [ x j , x j +1 ] = { tx j + (1 − t ) x j +1 , t ∈ [0 , } satisfy C j ⊂ C . Since each C j is convex,the result established in (i) implies that θ ( x j +1 ) − θ ( x j +1 ) = θ ( x j ) − θ ( x j ) for 0 j < n .This shows that θ ( x ) − θ ( x ) = θ ( x n ) − θ ( x n ) = . . . = θ ( x ) − θ ( x ) = θ ( x ∗ ) − θ ( x ∗ ) = K . (cid:3) A.4.
Proof of Theorem 3.
The indicator function of a set S is denoted χ S ( x ) := (cid:26) x ∈ S , + ∞ if x
6∈ S . (ai) ⇒ (aii). We introduce the function θ : H → R ∪ { + ∞} by(19) θ := b + ϕ + χ Im( f ) . Consider x ∈ Im( f ). By definition x = f ( y ) where y ∈ Y , hence by (ai) x is a global minimizerof x ′
7→ { D ( x ′ , y ) + ϕ ( x ′ ) } . Therefore, we have(20) ∀ x ′ ∈ H , −h A ( y ) , x ′ i + b ( x ′ ) + ϕ ( x ′ ) + χ Im( f ) ( x ′ ) | {z } = θ ( x ′ ) > −h A ( y ) , x i + b ( x ) + ϕ ( x ) + χ Im( f ) ( x ) | {z } = θ ( x ) which is equivalent to ∀ x ′ ∈ H θ ( x ′ ) > θ ( x ) + h A ( y ) , x ′ − x i (21)meaning that A ( y ) ∈ ∂θ ( f ( y )). As this holds for each y ∈ Y such that f ( y ) = x , we get A ( f − ( x )) ⊂ ∂θ ( x ). Consider g := ˘ θ according to Definition 6. Since g is convex l.s.c. and ∀ x ∈ Im( f ) , ∂θ ( x ) = ∅ , (22)by Proposition 3, ∂θ ( x ) = ∂g ( x ) and θ ( x ) = g ( x ) for each x ∈ Im( f ). This establishes (aii)with g := g = ˘ θ . (aii) ⇒ (ai). Set θ := g + χ Im( f ) . By (aii), ∂g ( x ) = ∅ for each x ∈ Im( f ). Since dom( ∂g ) ⊂ dom( g ) it follows that Im( f ) ⊂ dom( g ) and consequentlydom( θ ) = Im( f ) . Consider y ∈ Y and x := f ( y ) so that x ∈ Im( f ), hence θ ( x ) = g ( x ) and A ( y ) ∈ A ( f − ( x )) ⊂ ∂g ( x ) where the inclusion comes from (aii). It follows that for each ( x, x ′ ) ∈ Im( f ) × H one has θ ( x ′ ) = g ( x ′ ) + χ Im( f ) ( x ′ ) > g ( x ′ ) > g ( x ) + h A ( y ) , x ′ − x i = θ ( x ) + h A ( y ) , x ′ − x i , showing that A ( y ) ∈ ∂θ ( x ). This is equivalent to (21) with θ := θ , and since dom( θ ) = Im( f ),the inequality in (20) holds with ϕ ( x ) := θ ( x ) − b ( x ), i.e., x is a global minimizer of D ( x ′ , y ) + ϕ ( x ′ ). Since this holds for each y ∈ Y , this establishes (ai) with ϕ := θ − b = g − b + χ Im( f ) .(b). Consider ϕ and g satisfying (ai) and (aii), respectively. Let g := ˘ θ with θ defined in (19).Following the arguments of (ai) ⇒ (aii) we obtain that g (just as g ) satisfies (aii). For each x ∈ C we thus have ∂g ( x ) ∩ ∂g ( x ) ⊃ A ( f − ( x )) = ∅ with g, g convex l.s.c. functions. Hence, byCorollary 9, since C is polygonally connected, there is a constant K such that g ( x ) = g ( x ) + K , ∀ x ∈ C . To establish the relation (2) between g and ϕ we now show that g ( x ) = b ( x ) + ϕ ( x )on C . By (22) and Proposition 3 we have ˘ θ ( x ) = θ ( x ) for each x ∈ Im( f ), hence as C ⊂
Im( f )we obtain g ( x ) := ˘ θ ( x ) = θ ( x ) = b ( x ) + ϕ ( x ) for each x ∈ C . This establishes (2).(ci) ⇒ (cii). Define ̺ ( y ) := ( + ∞ , ∀ y / ∈ Yh B ( f ( y )) , y i − b ( f ( y )) − ϕ ( f ( y )) , ∀ y ∈ Y . (23)Consider y ∈ Y . From (ci), for each y ′ the global minimizer of x e D ( x, y ′ ) + ϕ ( x ) is reachedat x ′ = f ( y ′ ). Hence, for x = f ( y ) we have −h B ( f ( y ′ )) , y ′ i + b ( f ( y ′ ))+ ϕ ( f ( y ′ )) −h B ( x ) , y ′ i + b ( x )+ ϕ ( x ) = −h B ( f ( y )) , y ′ i + b ( f ( y ))+ ϕ ( f ( y ))Using this inequality we obtain that ∀ y ′ ∈ Y , ̺ ( y ′ ) − ̺ ( y ) = −h B ( f ( y )) , y i + b ( f ( y )) + ϕ ( f ( y )) + h B ( f ( y ′ )) , y ′ i − b ( f ( y ′ )) − ϕ ( f ( y ′ )) > h B ( f ( y )) , y ′ i − h B ( f ( y )) , y i > h B ( f ( y )) , y ′ − y i This shows that B ( f ( y )) ∈ ∂̺ ( y ) . (24)Set ψ := ˘ ̺ according to Definition 6. Then the function ψ is convex l.s.c. and for each y ∈ Y thefunction B ( f ( y )) is well defined, so ∂̺ ( y ) = ∅ . Hence, by Proposition 3, ∂̺ ( y ) = ∂ ˘ ̺ ( y ) = ∂ψ ( y )and ̺ ( y ) = ˘ ̺ ( y ) = ψ ( y ) for each y ∈ Y . This establishes (cii) with ψ := ψ = ˘ ̺ . In general, we may have g = g as there is no connectedness assumption on dom( θ ). CHARACTERIZATION OF PROXIMITY OPERATORS 21 (cii) ⇒ (ci). Define h : Y → R by h ( y ) := h B ( f ( y )) , y i − ψ ( y )Since B ( f ( y ′ )) ∈ ∂ψ ( y ′ ) with ψ convex by (cii), applying Definition 5 to ∂ψ yields ψ ( y ) − ψ ( y ′ ) > h y − y ′ , B ( f ( y ′ )) i . Using this inequality, one has(25) ∀ y, y ′ ∈ Y h ( y ′ ) − h ( y ) = h B ( f ( y ′ )) , y ′ i − ψ ( y ′ ) − h B ( f ( y )) , y i + ψ ( y ) > h B ( f ( y ′ )) , y ′ i − h B ( f ( y )) , y i + h B ( f ( y ′ )) , y − y ′ i = (cid:10) B ( f ( y ′ ) − B ( f ( y )) , y (cid:11) Noticing that for each x ∈ Im( f ) there is y ∈ Y such that x = f ( y ), we can define θ : H → R ∪ { + ∞} obeying dom( θ ) = Im( f ) by θ ( x ) := (cid:26) h ( y ) with y ∈ f − ( x ) if x ∈ Im( f )+ ∞ otherwiseFor x ∈ Im( f ), as f ( y ) = f ( y ′ ) = x for each y, y ′ ∈ f − ( x ), applying (25) yields h ( y ′ ) − h ( y ) > h ( y ′ ) = h ( y ), hence the definition of θ ( x ) does not depend of which y ∈ f − ( x ) ischosen.For x ′ ∈ Im( f ) we write x ′ = f ( y ′ ). Using (25) and the definition of θ yields θ ( x ′ ) − θ ( f ( y )) = θ ( f ( y ′ )) − θ ( f ( y )) = h ( y ′ ) − h ( y ) > h B ( f ( y ′ )) − B ( f ( y )) , y i = h B ( x ′ ) − B ( f ( y )) , y i . that is to say θ ( x ′ ) − h B ( x ′ ) , y i > θ ( f ( y )) − h B ( f ( y )) , y i , ∀ x ′ ∈ Im( f ) . This also trivially holds for x ′ / ∈ Im( f ). Setting ϕ ( x ) := θ ( x ) − b ( x ) for each x ∈ H , and replacing θ by b + ϕ in the inequality above yields a ( y ) − h B ( x ′ ) , y i + b ( x ′ ) + ϕ ( x ′ ) > a ( y ) − h B ( f ( y )) , y i + b ( f ( y )) + ϕ ( f ( y )) , ∀ x ′ ∈ H showing that f ( y ) ∈ arg min x ′ { e D ( x ′ , y ) + ϕ ( x ′ ) } . As this holds for each y ∈ Y , ϕ satisfies (ci).(d). Consider ϕ and ψ satisfying (ci) and (cii), respectively. Using the arguments of (ci) ⇒ (cii),the function ψ := ˘ ̺ with ̺ defined in (23) satisfies (cii). As ψ and ψ both satisfy (cii), for each y ∈ C ′ we have ∂ψ ( y ) ∩ ∂ψ ( y ) ⊃ B ( f ( y )) = ∅ with ψ, ψ convex l.s.c. functions. Hence, byCorollary 9, since C ′ is polygonally connected, there is a constant K ′ such that ψ ( y ) = ψ ( y )+ K ′ , ∀ y ∈ C ′ . By (24), ∂̺ ( y ) = ∅ for each y ∈ Y , hence by Proposition 3 we have ˘ ̺ ( y ) = ̺ ( y ) foreach y ∈ Y . As C ′ ⊂ Y , it follows that ψ ( y ) = ˘ ̺ ( y ) = ̺ ( y ) for each y ∈ C ′ . This establishes (3).A.5. Proof of Lemma 1.
Proof . Without loss of generality we prove the equivalence forthe convex envelope ˘ θ instead of θ : indeed by Proposition 3, since ∂θ ( x ) = ∅ on X we have˘ θ ( x ) = θ ( x ) and ∂ ˘ θ ( x ) = ∂θ ( x ) on X .(a) ⇒ (b). By [6, Prop 17.41(iii) ⇒ (i)], as ˘ θ is convex l.s.c. and ̺ is a selection of its subdifferentialwhich is continuous at each x ∈ X , ˘ θ is (Frchet) differentiable at each x ∈ X . By Proposition 4we get ∂ ˘ θ ( x ) = {∇ ˘ θ ( x ) } = { ̺ ( x ) } on X . Since ̺ is continuous, x
7→ ∇ ˘ θ ( x ) is continuous on X .(b) ⇒ (a). Since ˘ θ is differentiable on X , by Proposition 4 we have ∂ ˘ θ ( x ) = {∇ ˘ θ ( x ) } on X . By(9) it follows that ̺ ( x ) = ∇ ˘ θ ( x ) on X . Since ∇ ˘ θ is continuous on X , so is ̺ . (cid:3) A.6.
Proof of Corollary 3.
By Theorem 2, as Y is open and convex and f is C ( Y ) with Df ( y ) symmetric semi-definite positive for each y ∈ Y , there is a function ϕ and a convexl.s.c. function ψ ∈ C ( Y ) such that ∇ ψ ( y ) = f ( y ) ∈ prox ϕ ( y ) for each y ∈ Y . We define ϕ ( x ) := ϕ ( x ) + χ Im( f ) ( x ) and let the reader check that f ( y ) ∈ prox ϕ ( y ) for each y ∈ Y . Byconstruction, dom( ϕ ) = Im( f ). Uniqueness of the global minimizer.
Consider e f any function such that e f ( y ) ∈ prox ϕ ( y )for each y . This implies(26) k y − f ( y ) k + ϕ ( f ( y )) = k y − e f ( y ) k + ϕ ( e f ( y )) = min x ∈H { k y − x k + ϕ ( x ) } , ∀ y ∈ Y . By Corollary 1 there is a convex l.s.c. function e ψ such that e f ( y ) ∈ ∂ e ψ ( y ) for each y ∈ Y . Since Y is convex it is polygonally connected hence by Theorem 4(b) and (26) there are K, K ′ ∈ R such that(27) ψ ( y ) − K = k y k − k y − f ( y ) k − ϕ ( f ( y )) = k y k − k y − e f ( y ) k − ϕ ( e f ( y )) = e ψ ( y ) − K ′ , ∀ y ∈ Y . Thus, e ψ is also C ( Y ) and e f ( y ) ∈ ∂ e ψ ( y ) = {∇ ψ ( y ) } = { f ( y ) } for each y ∈ Y . This shows that e f ( y ) = f ( y ) for each y , hence f ( y ) is the unique global minimizer on H of x k y − x k + ϕ ( x ),i.e., prox ϕ ( y ) = { f ( y ) } . Injectivity of f . The proof follows that of [17, Lemma 1]. Given y = y ′ define v := y ′ − y = 0and θ ( t ) := h f ( y + tv ) , v i for t ∈ [0 , Y is convex this is well defined. As f ∈ C ( Y ) and Df ( y + tv ) ≻
0, the function θ is C ([0 , θ ′ ( t ) = h Df ( y + tv ) v, v i > t . Ifwe had f ( y ) = f ( y ′ ) then by Rolle’s theorem there would be t ∈ [0 ,
1] such that θ ′ ( t ) = 0,contradicting the fact that θ ′ ( t ) > Differentiability of ϕ . If Df ( y ) is boundedly invertible for each y ∈ Y , then by the inversefunction theorem Im( f ) is open and f − : Im( f ) → Y is C . Given x ∈ Im( f ), denoting u := f − ( x ), (27) yields ϕ ( x ) = ϕ ( f ( u )) = − ( ψ ( u ) − K )+ k u k − k u − f ( u ) k = − ( ψ ( f − ( x )) − K )+ k f − ( x ) k − k f − ( x ) − x k . Since ψ is C and f − is C , it follows that ϕ is C . Global minimum is the unique critical point.
The proof is inspired by that of [17,Theorem 1]. Consider x a critical point of θ : x k y − x k + ϕ ( x ), i.e., since ϕ is C , a pointwhere ∇ θ ( x ) = 0. Since dom( ϕ ) = Im( f ) there is some v ∈ Y such that x = f ( v ). Moreover, as ϕ is C on the open set Im( f ), the gradient ∇ θ ( x ) is well defined and ∇ θ ( x ) = 0. On the onehand, denoting ̺ ( u ) := ( θ ◦ f )( u ) = k y − f ( u ) k + ϕ ( f ( u )) we have ∇ ̺ ( u ) = Df ( u ) ∇ θ ( f ( u ))for each u ∈ Y . On the other hand, for each u ∈ Y , as f ( u ) = ∇ ψ ( u ) we also have ̺ ( u ) = k y k + k f ( u ) k − h y, f ( u ) i + ϕ ( f ( u ))= + k y k + h u − y, f ( u ) i − ( ψ ( u ) − K ) , ∇ ̺ ( u ) = Df ( u ) ( u − y ) + f ( u ) − ∇ ψ ( u ) = Df ( u ) ( u − y )For u = v we get Df ( v ) ( v − y ) = ∇ ̺ ( v ) = Df ( v ) ∇ θ ( f ( v )) = Df ( v ) ∇ θ ( x ) = 0. As Df ( v ) ≻ v = y , hence x = f ( y ). CHARACTERIZATION OF PROXIMITY OPERATORS 23
A.7.
Proof of Lemma 2.
As a preliminary let us compute the entries of the n × n matrixassociated to Df ( y ):(28) ∀ i, j ∈ J , n K ∂f i ∂y j ( y ) = ( k diag( w i ) y k < λ w ij ) y i y j h ′ i (cid:0) k diag( w i ) y k (cid:1) if k diag( w i ) y k > λ NB: if k diag( w i ) y k = λ then f may not be differentiable at y ; this case will not be useful below.The proof exploits Corollary 2 which shows that if f is a proximity operator then Df ( y ) issymmetric in each open set where it is well defined.Let f be a generalized social shrinkage operator as described in Lemma 2 and consider G = { G , . . . , G p } the partition of J , n K into disjoint groups corresponding to the equivalence classesdefined by the equivalence relation between indices: for i, j ∈ J , n K , i ∼ j if and only if w i = w j .Given G ∈ G , denote w G the weight vector shared all i ∈ G . If f is a proximity operator thenwe show that for each G ∈ G , we have supp( w G ) = G .For i ∈ G , by Definition 4 we have i ∈ N i = supp( w i ) = supp( w G ), establishing that (29) G ⊂ supp( w G ) . From now on we assume that f is a proximity operator, and consider a group G ∈ G . Toprove that G = supp( w G ), we will establish that for each i, j ∈ J , n K (30) if there exists y ∈ R n such that k diag( w j ) y k = k diag( w i ) y k then w ij = 0 and w ji = 0 . To see why it allows to conclude, consider j ∈ supp( w G ), and i ∈ G . As N i := supp( w i ) =supp( w G ) we obtain that j ∈ N i , i.e., w ij = 0. By (30), it follows that k diag( w j ) y k = k diag( w i ) y k for each y . As w i , w j have non-negative entries, this means that w i = w j . As i ∈ G , this implies j ∈ G by the very definition of G as an equivalence class. This showssupp( w G ) ⊂ G . Using also (29), we conclude that supp( w G ) = G .Let us now prove (30). Consider a given pair i, j ∈ J , n K . Assume that k diag( w j ) y k = k diag( w i ) y k for at least one vector y . Without loss of generality assume that a := k diag( w j ) y k < k diag( w i ) y k =: b . Rescaling y by a factor c = 2 λ/ ( a + b ) yields the existence of y such that forthe considered pair i, j k diag( w j ) y k < λ < k diag( w i ) y k . (31)By continuity, perturbing y if needed we can also assume that for this pair i, j we have y i y j = 0.By (28), as (31) holds in a neighborhood of y , f is C at y and its partial derivatives for theconsidered pair i, j satisfy ∂f i ∂y j ( y ) = 2( w ij ) y i y j h ′ i (cid:0) k diag( w i ) y k (cid:1) and ∂f j ∂y i ( y ) = 0 . Since f is a proximity operator, by Corollary 2 we have ∂f i ∂y j ( y ) = ∂f j ∂y i ( y ). It follows that for theconsidered pair i, j ( w ij ) y i y j h ′ i (cid:0) k diag( w i ) y k (cid:1) = 0 . As y i y j = 0 and h ′ i ( t ) = 0 for t = 0, we obtain w ij = 0. The inclusion (29) is true even if f is not a proximity operator. To conclude we now show that w ji = 0. As w ij = 0, f i is in fact independent of y j and ∂f i ∂y j is identically zero on R n . By scaling y as needed, we get a vector y ′ such that y ′ i y ′ j = 0 and λ < k diag( w j ) y ′ k < k diag( w i ) y ′ k . Reasoning as above yields 2( w ji ) y ′ j y ′ i h ′ j (cid:0) k diag( w j ) y ′ k (cid:1) = ∂f j ∂y i ( y ′ ) = ∂f i ∂y j ( y ′ ) = 0, hence w ji = 0.We thus obtain that w ij = w ji = 0 as claimed, establishing (30) and therefore G = supp( w G ). References [1] Madhu Advani and Surya Ganguli. An equivalence between high dimensional Bayes optimal inference andM-estimation. In D D Lee, M Sugiyama, U V Luxburg, I Guyon, and R Garnett, editors,
Advances in NeuralInformation Processing Systems , pages 3378–3386. Curran Associates, Inc., 2016.[2] Anestis Antoniadis. Wavelet methods in statistics: Some recent developments and their applications.
StatisticsSurveys , 1(0):16–55, 2007.[3] Francis Bach. Optimization with Sparsity-Inducing Penalties.
FNT in Machine Learning , 4(1):1–106, 2011.[4] Sergey Bakin.
Adaptive regression and model selection in data mining problems . PhD thesis, School of Math-ematical Sciences, Australian National University, 1999.[5] Heinz H Bauschke, Jonathan M Borwein, and Patrick L Combettes. Bregman Monotone Optimization Algo-rithms.
SIAM J. Control and Optimization , 42(2):596–636, 2003.[6] Heinz H Bauschke and Patrick L Combettes. Convex Analysis and Monotone Operator Theory in HilbertSpaces – Second Edition. Springer International Publishing, Cham, 2017.[7] Thomas Blumensath and Michael E Davies. Iterative hard thresholding for compressed sensing.
Appl. Comp.Harm. Anal. , 27(3):265–274, November 2009.[8] Kristian Bredies, Dirk A Lorenz, and Stefan Reiterer. Minimization of Non-smooth, Non-convex Functionalsby Iterative Thresholding.
Journal of Optimization Theory and Applications , 165(1):78–112, July 2014.[9] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to thesolution of problems in convex programming.
USSR Computational Mathematics and Mathematical Physics ,7(3):200–217, 1967.[10] T Tony Cai and Bernard W Silverman. Incorporating information on neighbouring coefficients intowavelet estimation.
Sankhy`a: The Indian Journal of Statistics, Series B (1960-2002) , 63(Special Issue onWavelets):127–148, 2001.[11] H. Cartan.
Cours de calcul diff´erentiel . Collection M´ethodes. Editions Hermann, 1977.[12] Y. Censor and S. A. Zenios. Proximal minimization algorithm with d -functions. J. Optim. Theory Appl. ,73(3):451–464, 1992.[13] Patrick L Combettes and Jean-Christophe Pesquet. Proximal Thresholding Algorithm for Minimization overOrthonormal Bases.
SIAM J. Optim. , 18:1351–1376, 2007.[14] I Ekeland and T Turnbull.
Infinite-dimensional optimization and convexity . Chicago Lectures in Mathematics.The University of Chicago Press, 1983.[15] C´edric F´evotte and Matthieu Kowalski. Hybrid sparse and low-rank time-frequency signal decomposition.
EUSIPCO , pages 464–468, 2015.[16] Antonio Galbis and Manuel Maestre.
Vector Analysis Versus Vector Calculus . Universitext. Springer US,Boston, MA, 2012.[17] Remi Gribonval. Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Esti-mation?
IEEE Transactions on Signal Processing , 59(5):2405–2410, 2011.[18] Remi Gribonval and Pierre Machart. Reconciling ”priors” and ”priors” without prejudice? In C J C Burges,L Bottou, M Welling, Z Ghahramani, and K Q Weinberger, editors,
Advances in Neural Information Pro-cessing Systems 26 (NIPS) , pages 2193–2201, 2013.[19] R´emi Gribonval and Mila Nikolova. On bayesian estimation and proximity operators.
Applied and Compu-tational Harmonic Analysis , pages 1–25, 2019.
CHARACTERIZATION OF PROXIMITY OPERATORS 25 [20] Peter Hall, Spiridon I Penev, Gerard Kerkyacharian, and Dominique Picard. Numerical performance of blockthresholded wavelet estimators.
Statistics and Computing , 1997.[21] J.-B. Hiriart-Urruty and C. Lemar´echal.
Convex analysis and Minimization Algorithms, vol. I . Springer-Verlag, Berlin, 1996.[22] Matthieu Kowalski, Kai Siedenburg, and Monika D¨orfler. Social Sparsity! Neighborhood Systems EnrichStructured Shrinkage Operators.
IEEE Trans. Signal Processing , 2013.[23] Matthieu Kowalski and Bruno Torr´esani. Sparsity and persistence: mixed norms provide simple signal modelswith dependent coefficients.
Signal, Image and Video Processing , 3(3):251–264, 2009.[24] Matthieu Kowalski and Bruno Torr´esani. Structured Sparsity: from Mixed Norms to Structured Shrinkage.In R´emi Gribonval, editor,
SPARS’09 - Signal Processing with Adaptive Sparse Structured Representations ,Saint Malo, France, April 2009. Inria Rennes - Bretagne Atlantique.[25] C´ecile Louchet and Lionel Moisan. Posterior Expectation of the Total Variation Model: Properties andExperiments.
SIAM J. Imaging Sci. , 6(4):2640–2684, January 2013.[26] Jean-Jacques Moreau. Proximit´e et dualit´e dans un espace Hilbertien.
Bull. Soc. math. France , 93:273–299,1965.[27] M Nikolova. Estimation of binary images by minimizing convex criteria. In
Proceedings 1998 InternationalConference on Image Processing. ICIP98 (Cat. No.98CB36269) , pages 108–112. IEEE Comput. Soc, 1998.[28] Ankit Parekh and Ivan W Selesnick. Convex Denoising using Non-Convex Tight Frame Regularization.
IEEESignal Process. Lett. , 2015.[29] R Tyrrell Rockafellar and Roger J B Wets.
Variational Analysis , volume 317 of
Grundlehren der mathema-tischen Wissenschaften . Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.[30] Ralph Rockafellar. On the maximal monotonicity of subdifferential mappings.
Pacific Journal of Mathematics ,33(1):209–216, 1970.[31] Ivan W Selesnick. Sparse Regularization via Convex Analysis.
IEEE Trans. Signal Processing , 65(17):4481–4494, 2017.[32] K Siedenburg and M D¨orfler. Structured sparsity for audio signals. In
Proc. 14th Int. Conf. on Digital AudioEffects (DAFx-11) , Paris, 2011.[33] Kai Siedenburg, Matthieu Kowalski, and Monika D¨orfler. Audio declipping with social sparsity. In
ICASSP2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages1577–1581. IEEE, 2014.[34] Brian S Thomson. Rethinking the Elementary Real Analysis Course.
The American Mathematical Monthly ,2007.[35] Ga¨el Varoquaux, Matthieu Kowalski, and Bertrand Thirion. Social-sparsity brain decoders: faster spatialsparsity. In
International Workshop on Pattern Recognition in Neuroimaging , Trento, 2016.[36] C Villani.
Optimal Transport - Old and New , volume 338 of
Grundlehren der mathematischen Wissenschaften- A series of Comprehensive Studies in Mathematics . Springer-Verlag Berlin Heidelberg, 2009.[37] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.