Duality-based Higher-order Non-smooth Optimization on Manifolds
DDuality-based Higher-order Non-smooth Optimization onManifolds
Willem Diepeveen ∗ Jan Lellmann † Abstract
We propose a method for solving non-smooth optimization problems on manifolds. Inorder to obtain superlinear convergence, we apply a Riemannian Semi-smooth Newtonmethod to a non-smooth non-linear primal-dual optimality system based on a recent ex-tension of Fenchel duality theory to Riemannian manifolds. We also propose an inexactversion of the Riemannian Semi-smooth Newton method and prove conditions for local lin-ear and superlinear convergence. Numerical experiments on (cid:96) -TV-like problems confirmsuperlinear convergence on manifolds with positive and negative curvature. Energy-based modeling in image- and data analysis requires the numerical minimization oflarge-scale energy functions. Due to the growing popularity of sparsity-based approaches suchas compressive sensing [47] and total variation-based image processing [91], these energies oftenincorporate non-smooth terms.First-order methods based on (sub-)gradients for minimizing such energies have becomevery popular due to their robustness [41, 40, 107, 102, 33, 25]; see also the survey in [49].However, their convergence rate in the general case is typically limited to O ( (cid:15) ) iterationsfor achieving (cid:15) -suboptimality, which can sometimes be improved to O ( √ (cid:15) ) using accelerationstrategies [63, 88, 105].Options for obtaining a superlinear convergence rate include interior point methods [54]or Newton-like methods, such as Quasi Newton [108, 36], Proximal Newton [13, 83, 75, 30,109], Forward Backward Newton [84] and Semi-smooth Newton [58, 82, 29, 106, 78]. In thiswork we will focus on a generalization of the Semi-smooth Newton (SSN) method, which wasoriginally proposed in [89] and initially used in [44, 98], but popularized through optimal controlapplications in the early 2000s [65], before being discovered by the image processing community[58, 82, 29]. We also refer to [92, 106] for some recent variants.On the modeling side, there has been an increasing interest in manifold-valued data pro-cessing: apart from statistical [86, 85, 53, 74] and PDE approaches [71, 35] to smooth dataprocessing on manifolds, non-smooth variational approaches on Riemannian manifolds havebeen gaining momentum as well in the 2010s. Typical applications include non-linear colorspaces [34] such as the Chromaticity Brightness model ( S × R ) and the Hue Saturation Valuemodel ( S × R ), InSAR imaging [81] ( S ), Diffusion Tensor imaging [11] ( P (3), see Fig. 1) andElectron Backscatter Diffraction imaging [4] ( SO (3)). Also several branches in data science ∗ Department of Applied Mathematics and Theoretical Physics, University of Cambridge ([email protected]). † Institute of Mathematics and Image Computing, University of L¨ubeck ([email protected]). a r X i v : . [ m a t h . O C ] F e b ave become more involved in using the geometry of problems. Examples include sparse prin-cipal component analysis, compressed mode analysis in physics, unsupervised feature selectionand sparse blind convolution, where in these cases the Stiefel manifold captures all geometricinformation; see [37] for an overview.In this work, we consider optimization problems of the forminf p ∈M { F ( p ) + G (Λ( p )) } , (1)where F : M → R and G : N → R are non-smooth functions mapping into the extended realline R := R ∪ {±∞} , M and N are Riemannian manifolds, and Λ : M → N is differentiable.Unlike the Euclidean case M = R n , to our knowledge no intrinsic higher-order solvers withsuperlinear convergence exist in the manifold-valued case. Our goal is to close this gap.Figure 1: Diffusion-Tensor Magnetic Resonance Imaging (DT-MRI) allows precise fiber tracingin the human brain by modeling the diffusion directions of water molecules using multivariatenormal distributions, which are represented by elements on the manifold of symmetric positivedefinite matrices P (3). The resulting optimization problems are manifold-valued and oftennon-smooth, which necessitates specialized numerical optimization algorithms. Image: Caminoproject [42]. There have been multiple attempts to generalize algorithms for both smooth and non-smoothoptimization on manifolds. Early work can be traced back to [79], where a manifold constraintis included in the optimization problem and enforced using projections. Using this so-called extrinsic approach, algorithms from the real-valued case can be made to work on manifolds aswell, if they can be reasonably embedded into en Euclidean space. While this technique doesnot capture the manifold structure specifically, recent contributions still tend to this approachfor optimization [73, 26, 38, 72, 110]. A step towards exploiting the intrinsic geometry ofmanifolds is to rely on local charts, which provide a linear subspace in which methods for real-valued optimization can be applied. However, both embedding and localization approachessuffer from serious drawbacks, which are discussed extensively in [59, Sec. 1]. Hence, intrinsicmethods that do not rely on charts have become very popular.Pioneering work using these ideas in smooth optimization on manifolds was done in the1994 with [94, 99], whose authors formulated several algorithms such as gradient descent, New-ton’s method and conjugate gradient on Riemannian manifolds. From then on, the community2tarted working on generalizing other algorithms to Riemannian manifolds [2, 27, 32], special-izing algorithms [48, 1, 64] and the application to real-world problems [5, 6]. An extensiveoverview of first-and second-order methods for smooth optimization on matrix manifolds canbe found in [3].The development of non-smooth Riemannian optimization methods appears to have oc-curred somewhat independently in the image analysis- and in the non-smooth optimizationcommunities. Whereas the former mostly relied on methods that generalize convexity, the lat-ter has been focussed on finding new notions and restrictions to generalize manifold algorithmsfrom the smooth to non-smooth setting. We give a short survey of both, as we believe thateither provide valuable insights.
Manifold-valued Imaging
The non-smooth Rudin-Osher-Fatemi (ROF) model [91] is theprototype of modern image analysis. This model revolves around the notion of Total Variation(TV), which has also been generalized to the S manifold in the early 1990s [55] and has beenextended to general Riemannian manifolds in the second half of the 2000s [56, 57]. Althoughearly numerical attempts for TV-regularized problems on manifolds originate in the early 2000s[34], the majority of the initial contributions in TV-based models on manifolds were proposedin the early 2010s [97, 43, 100]. These initial models had the major drawback that they wereproposed for specific manifolds. In [77], the authors reformulated the variational problemon arbitrary Riemannian manifolds into a multi-label optimization problem. In [103], thegeneralized ROF model was formulated in a fully intrinsic fashion.When entering the second half of the 2010s we see generalizations to several models emerg-ing. A second-order model for cyclic data was proposed in [21], and soon, popular real-valuedmodels were extended to the manifold case: a general second-order method [9], infimal con-volution models [18, 19], and TGV for manifold-valued imaging [28, 19]. At the same timespecialized models were extended to applications such as inpainting [24], segmentation [104], ormanifold-valued inverse problems [12, 95]. Additionally, new problem settings arose with theemergence TV for manifold-valued data on graphs [23].As non-smooth models were extended to manifolds, so did the solvers. The proposed al-gorithms were typically proposed on Hadamard manifolds : Riemannian manifolds that arecomplete, simply connected and have non-positive sectional curvature. On these manifolds,convexity can be generalized, which has been crucial in many of the convergence proofs. Pro-posed algorithms include the Proximal Point Algorithm (PPA) [10] and its extension, the CyclicProximal Point Algorithm (CPPA) [8]; Iteratively Reweighted Least Squares (IRLS) [61, 17]and an adaptation in [62]; the Parallel Douglas-Racheford Algorithm (PDRA) [22]; and the re-cently derived exact and linearized Riemannian Chambolle Pock Algorithms (eRCPA/lRCPA)[20]. Despite the Hadamard constraint, it should be noted that these algorithms perform well onmanifolds with positive curvature in experiments. Methods that do not rely on the Hadamardconstraint include exact methods as in [96], which uses exact solutions for (cid:96) -TV with sphericaldata, and functional lifting [101]. These methods come with their own shortcomings: for theformer this is limited applicability and for the latter this is high computational cost. Non-smooth Optimization on Manifolds
For non-smooth optimization on manifolds, thepioneering works include [51], in which the subgradient method was extended to Riemannianmanifolds and was shown to converge on Hadamard manifolds, and [52], whose authors extendedthe proximal map and proved convergence on Hadamard manifolds. Further development innon-smooth optimization was accelerated in the late 2000s by the introduction of the proximalsubdifferential for manifolds [7], the Clarke generalized subdifferential for manifolds [68], and aframework for duality on
CAT (0) metric spaces [70].3y the start of the 2010s, researchers started moving from Hadamard spaces to the non-Hadamard case. New developments include the introduction of a non-smooth version of theKurdyka-Lojasiewicz (KL) inequality used to show the convergence of PPA on general Rie-mannian manifolds [14], convergence of subgradient descent for functions satisfying the KLinequality to a singular critical point [66], and a new approach to the convergence of PPA thatextends previous results to a broader class of functions [15].Around 2015, the first numerical implementations of these and new algorithms receivedattention: [59] introduced a non-smooth trust region method for Riemannian manifolds andshowed global convergence, [60] used an approximate subdifferential and proposed a descentmethod with global convergence, [69] proposed a gradient sampling algorithm and showedits global convergence, and [67] proposed a line search algorithm and generalized the Wolfeconditions for Riemannian manifolds. Recently, a higher-order method was introduced: theRiemannian Semi-smooth Newton method (RSSN) [45], although to our knowledge no publicly-available implementation has existed so far. This method will form the basis of our approach.
The contributions of this work are two-fold:
1. Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN) for non-smooth op-timization
While the RSSN method [45] allows to find zeros of generic vector fields on man-ifolds in the same way that the Newton method allows to solve nonlinear systems of equations,the application to finding minimizers of non-smooth optimization problems is not straightfor-ward, as the optimality conditions typically are in inclusion form. Using the recently proposedgeneralized Fenchel duality theory on manifolds [20], we construct a primal-dual optimalitysystem in the form of a vector field, which is then solved by RSSN. Overall, this providesa superlinearly convergent primal-dual scheme (PD-RSSN) for non-smooth Riemannian opti-mization.
2. Expanding the theoretical framework of Riemannian Semi-smooth Newton
Asin the classical Newton method, the proposed method requires to solve Newton-type systemin each step. For larger-scale problems, solving for the Newton steps exactly is often notefficient. Therefore we propose an inexact version of RSSN and provide a convergence proof(theorem 3.5). We show that at least linear convergence can be expected in the inexact case.The theoretical results are validated by our numerical experiments.This work is based on the theses [92] and [46].
In section 2, basic notation from differential geometry and Riemannian geometry is coveredand the necessary definitions for manifold Fenchel duality theory are discussed along with theresulting non-smooth optimality systems. We also summarize the basic notions necessary forapplying the Riemannian Semi-smooth Newton method. In section 3, we present the inexactRiemannian Semi-smooth Newton method and give a local convergence proof. In section 4,we discuss how RSSN can be used to solve the non-smooth optimality system from section 2,which leads to the proposed PD-RSSN method. The application of our method to (cid:96) -TV-likeproblems is discussed in section 5, and numerical results are shown in section 6. We concludeand summarize in section 7. 4 Preliminaries
In the first half of this section, the notation from differential and Riemannian geometry issummarized. In the second part, we briefly recapitulate manifold Fenchel duality theory andthe Riemannian Semi-smooth Newton method.
For details regarding differential geometry and Riemannian geometry, we refer the reader tobooks such as [76, 93, 31].We write M and N for Riemannian manifolds. The tangent space at p ∈ M is denotedby T p M and for tangent vectors we write X p and Y p , or simply X and Y . For the tangentbundle we have T M := (cid:83) p ∈M T p M . Similarly, we write T ∗ p M for the cotangent space, ξ p and η p or simply ξ and η for covectors, and T ∗ M := (cid:83) p ∈M T ∗ p M for the cotangent bundle.A cotangent vector ξ ∈ T ∗ p M acts on a tangent vector X ∈ T p M through the duality pairing (cid:104) ξ, X (cid:105) p := ξ ( X ) ∈ R .We assume a Riemannian manifold is equipped with a metric. For some point p ∈ M themetric tensor is denoted by ( · , · ) p : T p M × T p M → R and the norm induced by the Riemannianmetric is written as (cid:107) · (cid:107) p . The Riemannian distance between points p, q ∈ M is denoted by d M ( p, q ). For the open metric ball of radius r > p ∈ M induced by this distancewe write B r ( p ) := { y ∈ M | d M ( p, q ) < r } . (2)Through this metric, for p ∈ M we can define the musical isomorphisms (cid:91) : T p M → T ∗ p M as (cid:104) X (cid:91) , Y (cid:105) p = ( X, Y ) p for all Y ∈ T p M (3)and its inverse (cid:93) : T ∗ p M → T p M as( ξ (cid:93) , X ) p = (cid:104) ξ, X (cid:105) p for all X ∈ T p M . (4)The metric can also be used to construct a unique affine connection, the Levi-Civita con-nection, which is denoted by ∇ ( · ) ( · ). For geodesics γ : [0 , → M , we typically write γ p,q ( t )and γ p,X ( t ) to denote a minimizing geodesic connecting p, q ∈ M and a geodesic starting from p ∈ M with velocity ˙ γ p,X (0) = X ∈ T p M . The latter defines the exponential map, which isdenoted by exp p : G p → M where G p ⊂ T p M is the set on which ˙ γ p,X =: exp p ( X ) is defined.Furthermore, if G p = T p M , the manifold is called complete. For G (cid:48) p ⊂ G p a metric ball withradius r p on which exp p is a diffeomorphism, the logarithmic map is defined and we denote itby log p : exp( G (cid:48) p ) → G (cid:48) p . This radius r p is called the injectivity radius. Importantly, on simplyconnected, complete Riemannian manifolds with non-positive sectional curvature, the exp p andlog p maps are globally defined. Such manifolds are called Hadamard manifolds.
Finally, wewrite parallel transport of a vector X ∈ T p M from p to q ∈ M as P p → q X and parallel transportof a covector ξ ∈ T ∗ p M as P p → q ξ , where the latter is defined through the musical isomorphisms,as P p → q ξ := (cid:0) P p → q ξ (cid:93) (cid:1) (cid:91) . (5) In this section, generalizations of classical notions from non-smooth analysis to manifolds pro-posed in [20] are discussed. For the equivalent basic notions of convex analysis in vector spaces5e refer the reader to [90, 39]. The section culminates in the recently derived primal-dual op-timality conditions proposed by the authors of [20]. In the following, R := R ∪ {±∞} denotesthe extended real line. The notion of convexity can be defined on strongly convex subsets of Riemannian manifolds.
Definition 2.1 (strongly convex set, [93, Def. IV.5.1]) . A subset
C ⊂ M of a Riemannianmanifold M is said to be strongly convex if, for all p, q ∈ C , a minimal geodesic γ p,q between p and q exists, is unique, and lies completely in C . The well-known notions of properness, convexity and lower semi-continuity can be general-ized as follows.
Definition 2.2 (proper, [20, Def. 2.11.i]) . A function F : M → R is proper if dom F := { x ∈M| F ( x ) < ∞} (cid:54) = ∅ and F ( x ) > −∞ holds for all x ∈ M . Definition 2.3 (convex, [20, Def. 2.11.ii]) . Suppose that
C ⊂ M is strongly convex. A properfunction F : M → R is called (geodesically) convex on C ⊂ M if, for all p, q ∈ C , thecomposition F ◦ γ p,q ( t ) is a convex function on [0 , in the classical sense. Definition 2.4 (epigraph, [20, Def. 2.11.iii]) . Suppose that
A ⊂ M . The epigraph of afunction F : A → R is defined as epi F := { ( x, α ) ∈ A × R | F ( x ) ≤ α } . (6) Definition 2.5 (lower semi-continuous, [20, Def. 2.11.iv]) . Suppose that
A ⊂ M . A properfunction F : A → R is called lower semi-continuous (lsc) if epi F is closed. The notions of subdifferentials and proximal mappings can also be extended to manifoldsusing the exponential map and the geodesic distance:
Definition 2.6 (subdifferential,[99, Def. 3.4.4]) . Suppose that
C ⊂ M is strongly convex. The subdifferential ∂ M F on C of a proper, convex function F : C → R at a point p ∈ C is defined as ∂ M F ( p ) := (cid:8) ξ ∈ T ∗ p M| F ( q ) ≥ F ( p ) + (cid:104) ξ, log p q (cid:105) p for all q ∈ C (cid:9) . (7) Definition 2.7 (proximal mapping, [52]) . Let M be a Riemannian manifold, F : M → R beproper, and λ > . The proximal map of F is defined as prox λF ( p ) := arg min q ∈M (cid:26) λ d M ( p, q ) + F ( q ) (cid:27) . (8)Tangent and cotangent spaces play an important role in the generalization of the Fenchel-dual functions. In this context, well-definedness of the exponential and the logarithmic mapis ensured by restricting ourselves to the following subset, which is a localized variant of thepre-image of the exponential map: Definition 2.8 ([20, Def 2.8]) . Let
C ⊂ M and p ∈ C . We define the tangent subset L C ,p ⊂ T p M as L C ,p := (cid:8) X ∈ T p M | exp p X ∈ C and (cid:107) X (cid:107) p = d M (exp p X, p ) (cid:9) . (9)6ith these basic notions, Fenchel duality theory can be generalized as in [20]. The Fenchelconjugate or Fenchel dual is defined by introducing a base point m on the manifold. Definition 2.9 ( m -Fenchel conjugate,[20, Def. 3.1]) . Suppose that F : C → R and m ∈ C . The m -Fenchel conjugate of F is defined as the function F ∗ m : T ∗ p M → R such that F ∗ m ( ξ m ) := sup X ∈L C,m {(cid:104) ξ m , X (cid:105) − F (exp m X ) } , ξ m ∈ T ∗ m M . (10)For the Fenchel biconjugate, we can then define the following. Definition 2.10 (( mm (cid:48) )-Fenchel biconjugate, [20, Def. 3.5]) . Suppose that F : C → R and m, m (cid:48) ∈ C . Then the ( mm (cid:48) )-Fenchel biconjugate function F mm (cid:48) : C → R is defined as F ∗∗ mm (cid:48) ( p ) := sup ξ m (cid:48) ∈T ∗ m (cid:48) M {(cid:104) ξ m (cid:48) , log m (cid:48) p (cid:105) − F ∗ m ( P m (cid:48) → m ξ m (cid:48) ) } , p ∈ C . (11)Basic properties and a more elaborate discussion of these generalized conjugate functionsare covered in [20]. We will only focus on the relation between F ∗∗ mm and F .First, we note that geodesic convexity is often too strong a condition, so a weaker conditionis used. Given a set C ⊂ M , m ∈ C , and a function F : C → R , we define f m : T m M → R by f m ( X ) := (cid:26) F (exp m X ) , X ∈ L C ,m , + ∞ , X / ∈ L C ,m . (12)It turns out that one can look at the convexity of the function f m : T m M → R in the usualvector space sense on T m M . For a more elaborate discussion of the discrepancy between theconvexity of f m and (geodesic) convexity of F we refer to [20, Example 3.10].Now, the Fenchel-Moreau-Rockafellar theorem can also be extended to the manifold case: Theorem 2.11 ([20, Thm. 3.13]) . Suppose that
C ⊂ M is strongly convex and m ∈ C . Let F : C → R be proper. If f m is lsc and convex on T m M , then F = F ∗∗ mm . In this case F ∗ m isproper as well. In this section the first-order optimality conditions for a minimization problem of the forminf p ∈C { F ( p ) + G (Λ( p )) } (13)are discussed. Here C ⊂ M and
D ⊂ N are strongly convex sets, F : C → R and G : D → R are proper functions, and Λ : M → N is a differentiable map, possibly nonlinear, such thatΛ( C ) ⊂ D . Furthermore, we assume that F : C → R is geodesically convex and that g n ( X ) := (cid:26) G (exp n X ) , X ∈ L D ,n , + ∞ , X / ∈ L D ,n , (14)is proper, convex and lsc on T n N for some n ∈ D .Under these assumptions, eq. (13) can be rewritten into the following saddle-point formu-lation inf p ∈C sup ξ n ∈T ∗ n N {(cid:104) ξ n , log n Λ( p ) (cid:105) + F ( p ) − G ∗ n ( ξ n ) } (15)7he authors of [20] propose two pairs of optimality conditions which we will also refer to asthe exact and linearized optimality conditions for the primal and dual variables . The proposed exact optimality conditions are P m → p (cid:0) − ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) ∈ ∂ M F ( p ) , (16)log n Λ( p ) ∈ ∂G ∗ n ( ξ n ) , (17)where D m Λ ∗ : T ∗ Λ( m ) N → T ∗ m M is the adjoint operator of D m Λ. As shown in [20], this systemcan be rewritten into p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (18) ξ n = prox τG ∗ n (cid:16) ξ n + τ (log n Λ ( p )) (cid:91) (cid:17) . (19)Note that the name “exact” is slightly misleading, as the linearizationΛ( p ) ≈ exp Λ( m ) D Λ( m ) [log m p ] (20)was needed to obtain eq. (16).The linearized optimality conditions can be obtained by linearizing both the primal and thedual optimality condition. That is, for n := Λ( m ) , we want to solveinf p ∈M inf ξ n ∈T ∗ n N { F ( p ) + (cid:104) ξ n , D m Λ [log m p ] (cid:105) − G ∗ n ( ξ n ) } , (21)and obtain the optimality system for the general n ∈ N case [20] P m → p (cid:0) − ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) ∈ ∂ M F ( p ) , (22) D m Λ [log m p ] ∈ ∂G ∗ n ( ξ n ) , (23)which can be rewritten into p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (24) ξ n = prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) . (25)The linearized system has so far a better theoretical basis than the exact system. The followingweak duality result holds: Theorem 2.12 ([20, Thm. 4.2]) . Let n := Λ( m ) . The dual problem of inf p ∈M (cid:110) F ( p ) + G (exp Λ( m ) D m Λ [log m p ]) (cid:111) (26) is given by sup ξ n ∈T ∗ n N { F ∗ m ( − ( D m Λ) ∗ [ ξ n ]) − G ∗ n ( ξ n ) } (27) and weak duality holds, i.e., inf p ∈M (cid:110) F ( p ) + G (exp Λ( m ) D m Λ [log m p ]) (cid:111) ≥ sup ξ n ∈T ∗ n N {− F ∗ m ( − ( D m Λ) ∗ [ ξ n ]) − G ∗ n ( ξ n ) } . (28)If one performs a suitable fixed-point iteration on eqs. (24) and (25), one obtains the lin-earized Riemannian Chambolle Pock algorithm, which has been shown to converge on Hadamardmanifolds [20, Thm. 4.3]. In this work, we aim to construct a higher-order method instead.8 .3 The Riemannian Semi-smooth Newton Method This section is devoted to the
Riemannian Semi-smooth Newton (RSSN) method, which wepropose to use for solving eq. (13).
Remark 2.13.
Throughout this and the next section, we will develop RSSN on some manifold M (cid:48) . This is not the original manifold M defining the feasible set; we will later apply the RSSNmethod to M (cid:48) = M × T ∗ n N (and p (cid:48) = ( p, ξ n ) ). In order not to clutter the notation, we stillwrite M instead of M (cid:48) (and p instead of p (cid:48) ) throughout this and the next section. In the real-valued case, the Newton method is formulated for solving a smooth system ofequations X ( p ) = 0 for some non-linear map X : R d → R d , where – in the optimizationcontext – this map would implement optimality conditions for the minimization problem. Inthe manifold case, there is no unique generalization. One possibility is to consider a vector field X : M → T M and define the Newton iteration through the covariant derivative: p k +1 = exp p k ( − [ ∇ ( · ) X ] − p k X ( p k )) (29)in order to find a zero 0 = X ( p ) of the vector field. In this section, we will focus on ageneralized covariant derivative approach for finding zeros of semi-smooth vector fields, whichwill be applied by rewriting eqs. (24) and (25) into a vector field in section 4. Throughout thissection we will use the notions and results developed in [45]. Definition 2.14 ((locally) Lipschitz, [45, Def. 6]) . A vector field X on M is said to be Lipschitz continuous on Ω ⊂ M if there exists a constant L > such that, for all p, q ∈ Ω ,there holds (cid:107)P p → q X ( p ) − X ( q ) (cid:107) q ≤ Ld M ( p, q ) , ∀ p, q ∈ Ω , (30) Moreover, for a point p ∈ M , if there exists δ > such that X is Lipschitz continuous on theopen ball B δ ( p ) , then X is said to be Lipschitz continuous at p. Moreover, if for all p ∈ M , X is Lipschitz continuous at p , then X is said to be locally Lipschitz continuous on M . Subsequently, we can generalize Rademacher’s theorem to Lipschitz vector fields.
Theorem 2.15 ([45, Thm. 10]) . If X is a locally Lipschitz continuous vector field on M , then X is almost everywhere differentiable on M . Hence, it makes sense to define the generalized covariant derivative.
Definition 2.16 (Clarke generalized covariant derivative, [45, Def. 11]) . The
Clarke general-ized covariant derivative ∂ C X of a locally Lipschitz continuous vector field X at a point p ∈ M is defined as the set-valued mapping ∂ C X ( p ) : T p M → T p M , ∂ C X ( p ) := co (cid:26) V ∈ L ( T p M ) : ∃ { p k } ⊂ D X , lim k →∞ p k = p, V = lim k →∞ P p k → p ∇ X ( p k ) (cid:27) , (31) where D X ⊂ M is the set on which X is differentiable, “co” represents the convex hull and L ( T p M ) is the vector space consisting of all bounded linear operators from T p M to T p M . In addition to local Lipschitzness, we will also need the directional derivative for the notionof semi-smoothness. We follow the definition in [45].9 efinition 2.17 (directional derivative) . The directional derivative of a vector field X on M at p ∈ M in the direction v ∈ T p M is defined by X (cid:48) ( p, v ) := lim t (cid:38) t (cid:104) P exp p ( tv ) → p X (cid:0) exp p ( tv ) (cid:1) − X ( p ) (cid:105) ∈ T p M , (32) whenever the limit exists. If this directional derivative exists for every v , then X is said to be directionally differentiable at p . Finally, we are able to generalize the notion of semi-smoothness to vector fields.
Definition 2.18 (semi-smooth vector field, [45, Def. 18]) . A vector field X on M that isLipschitz continuous at p ∈ M and directionally differentiable at q ∈ B δ ( p ) for all directionsin T p M , is said to be semi-smooth at p iff for every (cid:15) > there exists < δ < r p , where r p isthe injectivity radius, such that (cid:107) X ( p ) − P q → p (cid:2) X ( q ) + V q log q p (cid:3) (cid:107) p ≤ (cid:15)d M ( p, q ) , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (33) The vector field X is said to be µ -order semi-smooth at p for < µ ≤ iff there exist (cid:15) > and < δ < r p such that (cid:107) X ( p ) − P q → p (cid:2) X ( q ) + V q log q p (cid:3) (cid:107) p ≤ (cid:15)d M ( p, q ) µ , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (34)The Riemannian Semi-smooth Newton method for finding a zero of a vector field, i.e., X ( p ) = 0, is shown in algorithm 1. It extends on the classical Newton method by replacingthe classical Jacobian ∇ X ( p k ) by an element from the Clarke generalized covariant derivative ∂ C X ( p k ), and performing the update step using the exponential map. Algorithm 1
Riemannian Semi-smooth NewtonInitialization: p ∈ M , k := 0 while not converged do Choose any V ( p k ) ∈ ∂ C X ( p k )Solve V ( p k ) d k = − X ( p k ) in the vector space T p k M p k +1 := exp p k ( d k ) k := k + 1 end while We have the following local convergence result.
Theorem 2.19 ([45, Thm. 19]) . Let X be a locally Lipschitz-continuous vector field on M and p ∗ ∈ M be a solution of the problem X ( p ) = 0 . Assume that X is semi-smooth at p ∗ and thatall V p ∗ ∈ ∂ C X ( p ∗ ) are invertible. Then there exists a δ > such that, for every starting point p ∈ B δ ( p ∗ ) \ { p ∗ } , the sequence ( p k ) k ≥ generated by algorithm 1 is well-defined, contained in B δ ( p ∗ ) and converges superlinearly to p ∗ . If additionally X is µ -order semi-smooth at p ∗ , thenthe convergence of ( p k ) k ≥ to p ∗ is of order µ . Inexact Riemannian Semi-smooth Newton
Before moving to applications, we will consider a generalization of RSSN, the
Inexact Rieman-nian Semi-smooth Newton (IRSSN) method in algorithm 2. In contrast to the exact version,it only requires to solve the linear system up to a residual term r k . The reader should bearin mind that remark 2.13 still holds for the remainder of this section. The main motivationfor this method is that solving the Newton system with high precision can be very expensive.Solving the system inexactly, for example using an iterative method, can potentially amelioratethis problem. Algorithm 2
Inexact Riemannian Semi-smooth NewtonInitialization: p ∈ M , a ≥ , k := 0 while not converged do Choose V k ( p k ) ∈ ∂ C X ( p k )Solve V k ( p k ) d k = − X ( p k ) + r k in T p k M where (cid:107) r k (cid:107) ( p k ) ≤ a k (cid:107) X ( p k ) (cid:107) ( p k ) p k +1 := exp p k ( d k )Choose a k +1 ≥ k := k + 1 end while In this section, we focus on proving theorem 3.5: a local convergence result for algorithm 2on Riemannian manifolds. The proof is based on the ideas of the Inexact Semi-smooth Newtonmethods in R n as discussed in [80, 50]. The technical details are inspired by the convergenceproof for Riemannian Semi-smooth Newton [45]. Starting with the technicalities, we first need to account for curvature. In particular, we needto account for how geodesics spread, which we can formalize in a single quantity [45, Def. 2]:
Definition 3.1.
Let p ∈ M and r p be the radius of injectivity of M at p. Define the quantity K p := sup (cid:40) d M (cid:0) exp q u, exp q v (cid:1) (cid:107) u − v (cid:107) q : q ∈ B r p ( p ) , u, v ∈ T q M , u (cid:54) = v, (cid:107) v (cid:107) q ≤ r p , (cid:107) u − v (cid:107) q ≤ r p (cid:41) . (35)The following remark from [45] provides some intuition: Remark 3.2.
This number K p measures how fast the geodesics spread apart in M . In par-ticular, when u = 0 ∈ T q M or more generally when u and v are on the same line through , d M (cid:0) exp q u, exp q v (cid:1) = (cid:107) u − v (cid:107) q . Hence, K p ≥ , for all p ∈ M . When M has non-negativesectional curvature, the geodesics spread apart less than the rays, i.e., d M (cid:0) exp p u, exp p v (cid:1) ≤(cid:107) u − v (cid:107) q and, in this case, K p = 1 for all p ∈ M . Next, remember the definition of an operator norm.
Definition 3.3.
Let p ∈ M . The norm of a linear map A : T p M → T p M is defined by (cid:107) A (cid:107) p := sup {(cid:107) Av (cid:107) p : v ∈ T p M , (cid:107) v (cid:107) p ≤ } . (36)11e have the following result. Lemma 3.4 ([45, Lemma 17]) . Let X be a locally Lipschitz continuous vector field on M . As-sume that all elements V p ∈ ∂ C X ( p ) are invertible at base point p ∈ M and let λ p ≥ max (cid:8) (cid:107) V − p (cid:107) p : V p ∈ ∂ C X ( p ) (cid:9) . Then, for every (cid:15) > satisfying (cid:15)λ p < , there exists < δ < r p such that all V q ∈ ∂ C X ( q ) are invertible on B δ ( p ) and (cid:107) V − q (cid:107) q ≤ λ p − (cid:15)λ p , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (37) With these tools we can move on to the main result of this section.
Theorem 3.5.
Let X be locally Lipschitz continuous vector field on M and p ∗ ∈ M be asolution of problem the X ( p ) = 0 . Assume that X is semi-smooth at p ∗ and that all V p ∗ ∈ ∂ C X ( p ∗ ) are invertible. Then the following statements hold:(i) There exist a > and δ > such that, for every p ∈ B δ ( p ∗ ) and a k ≤ a , the sequence ( p k ) k ≥ generated by algorithm 2 is well-defined, is contained in B δ ( p ∗ ) and convergesQ-linearly to the solution p ∗ .(ii) If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ and further (cid:107) r k (cid:107) ( p k ) ∈ o (cid:0) (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) , then the rate of convergence is Q -superlinear.(iii) If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ , X is µ -order semi-smooth at p ∗ , and (cid:107) r k (cid:107) ( p k ) ∈ O (cid:16) (cid:107) X ( p k ) (cid:107) µ ( p k ) (cid:17) , then the rate of convergenceis of Q -order µ .Proof. (i) Let K p ∗ be as defined in definition 3.1 and let r p ∗ be the injectivity radius. Since X is locally Lipschitz, there exist constants ˆ δ > L such that, for all p ∈ B ˆ δ ( p ∗ ) , (cid:107) X ( p ) (cid:107) p = (cid:107)P p ∗ → p X ( p ∗ ) − X ( p ) (cid:107) p ≤ Ld M ( p, p ∗ ) . (38)The equality holds since X ( p ∗ ) = 0 and parallel transport is linear.Now, since all V p ∗ ∈ ∂ C X ( p ∗ ) are invertible at p ∗ ∈ M by assumption, we can take λ p ∗ ≥ max {(cid:107) V − p ∗ (cid:107) p ∗ : V p ∗ ∈ ∂ C X ( p ∗ ) } . Furthermore, take a < λ p ∗ LK p ∗ , choose a k ≤ a ∀ k ∈ N and (cid:15) satisfying (cid:15)λ p ∗ (1 + K p ∗ ) < − aλ p ∗ LK p ∗ . As (cid:15)λ p ∗ <
1, by lemma 3.4 we can find a0 < δ < min { ˆ δ, r p ∗ } such that, for all p ∈ B δ ( p ∗ ) and V p ∈ ∂ C X ( p ) , (cid:107) V − p (cid:107) p ≤ λ p ∗ − (cid:15)λ p ∗ . (39)From the semi-smoothness of X , (cid:107) X ( p ∗ ) − P p → p ∗ (cid:2) X ( p ) + V p log p p ∗ (cid:3) (cid:107) ( p ∗ ) ≤ (cid:15)d M ( p, p ∗ ) (40)holds according to eq. (33).We now show that for our chosen δ the Newton iteration is well-defined. Let k ∈ N andassume that p k ∈ B δ ( p ∗ ). Let d k be such that (cid:107) V p k d k + X ( p k ) (cid:107) ( p k ) ≤ a k (cid:107) X ( p k ) (cid:107) ( p k ) . (41)12hen (cid:107) log p k p ∗ − d k (cid:107) ( p k ) = (cid:107) log p k p ∗ + V − p k X ( p k ) − V − p k ( V p k d k + X ( p k )) (cid:107) ( p k ) (42) ≤ (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) + (cid:107) V − p k (cid:107) ( p k ) (cid:107) V p k d k + X ( p k ) (cid:107) ( p k ) (43) eq. (41) ≤ (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) + a k (cid:107) V − p k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) . (44)Since X ( p ∗ ) = 0 and parallel transport is an isometry, we see that (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) = (cid:107) V − p k (cid:0) V p k log p k p ∗ + X ( p k ) (cid:1) (cid:107) ( p k ) (45) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:107)P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) (46) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:107) X ( p ∗ ) − P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) . (47)Substituting eq. (47) back into eq. (44), we find (cid:107) log p k p ∗ − d k (cid:107) ( p k ) (48) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:0) (cid:107) X ( p ∗ ) − P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) + a k (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) (49) eq. (39) ,eq. (40) ,eq. (38) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) + a k Ld M ( p k , p ∗ )) (50) a k ≤ a ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) d M ( p k , p ∗ ) . (51)Now note that, since K p ∗ ≥ λ p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) ≤ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) . (52)For our choice of a and (cid:15), we find (cid:15)λ p ∗ (1 + K p ∗ ) < − aλ p ∗ LK p ∗ (53) ⇔ (cid:15)λ p ∗ + (cid:15)λ p ∗ K p ∗ + aλ p ∗ LK p ∗ < ⇔ λ p ∗ K p ∗ ( (cid:15) + aL ) < − (cid:15)λ p ∗ (55) ⇔ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) < . (56)Since d M ( p k , p ∗ ) < δ , we obtain from combining eq. (51), eq. (52) and eq. (56) (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) < d M ( p k , p ∗ ) < δ ≤ r p ∗ . (57)Moreover, we have (cid:107) log p k p ∗ (cid:107) ( p k ) = d M ( p k , p ∗ ) ≤ r p ∗ . Hence, we find (see definition 3.1) d M (exp p k ( d k ) , p ∗ ) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) (58)and observe d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) algorithm 2 = d M (exp p k ( d k ) , p ∗ ) d M ( p k , p ∗ ) eq. (58) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) d M ( p k , p ∗ ) (59) eq. (51) ≤ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) eq. (56) < . (60)13rom this result we conclude by induction that if we choose p ∈ B δ ( p ∗ ) as in the assumption,we have p k ∈ B δ ( p ∗ ) ∀ k ∈ N and convergence is Q-linear.(ii) The second part is very similar. Let K p ∗ , r p ∗ , and ˆ δ with corresponding L as before.Choose (cid:15) > (cid:15)λ p ∗ (1 + 2 K p ∗ ) < < δ < min { ˆ δ, r p ∗ } such that eq. (39) andeq. (40) hold. Due to the assumption (cid:107) r k (cid:107) ( p k ) ∈ o (cid:0) (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) , the assumption that p k → p ∗ ,and that X is continuous, we have that (cid:107) X ( p k ) (cid:107) ( p k ) →
0. Moreover for large enough k , (cid:107) r k (cid:107) ( p k ) < (cid:15)δ. (61)Because of the convergence assumption p k → p ∗ , we also have for large k that d M ( p k , p ∗ ) < δ .Consequently, using the similar steps that lead to eq. (51) in the proof of (i), we see (cid:107) log p k p ∗ − d k (cid:107) ( p k ) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) + (cid:107) r k (cid:107) ( p k ) ) (62) ≤ (cid:15)λ p ∗ − (cid:15)λ p ∗ δ ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ δ. (63)For our choice of (cid:15) we find (cid:15)λ p ∗ (1 + 2 K p ∗ ) < ⇔ (cid:15)λ p ∗ K p ∗ < − (cid:15)λ p ∗ (65) ⇔ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ < . (66)Since d M ( p k , p ∗ ) < δ , we obtain from combining eq. (63) and eq. (66) (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) < δ ≤ r p ∗ . (67)Again, we have (cid:107) log p k p ∗ (cid:107) ( p k ) = d M ( p k , p ∗ ) ≤ r p ∗ , which leads to (see definition 3.1) d M (exp p k ( d k ) , p ∗ ) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) . (68)Finally we see that, for large k , d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) algorithm 2 = d M (exp p k ( d k ) , p ∗ ) d M ( p k , p ∗ ) eq. (68) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) d M ( p k , p ∗ ) (69) eq. (62) ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) d M ( p k , p ∗ ) . (70)By (cid:107) X ( p k ) (cid:107) ( p k ) ≤ Ld M ( p k , p ∗ ), we can continue ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + 1 L λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) . (71)Note that this result holds for all (arbitrarily small) (cid:15) > (cid:15)λ p ∗ (1 + 2 K p ∗ ) <
1. Hence,we can focus solely on the residual term. Then, by the assumption (cid:107) r k (cid:107) ( p k ) ∈ o ( (cid:107) X ( p k ) (cid:107) ( p k ) ),lim k →∞ d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) ≤ lim k →∞ λ p ∗ K p ∗ L (1 − (cid:15)λ p ∗ ) (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) = 0 , (72)14nd we conclude that the convergence is superlinear.(iii) Again this is very similar to the previous cases. Let µ denote the order of the semi-smoothness and let K p ∗ , r p ∗ , and ˆ δ with corresponding L as before. For (cid:15) > (cid:15)λ p ∗ <
1, choose 0 < δ < min { ˆ δ, r p ∗ } satisfying (cid:15)λ p ∗ (1 + 2 δ µ K p ∗ ) < (cid:13)(cid:13) X ( p ∗ ) − P p → p ∗ (cid:2) X ( p ) + V p exp − p p ∗ (cid:3)(cid:13)(cid:13) ( p ∗ ) ≤ (cid:15)d M ( p, p ∗ ) µ (73)hold. Then for large enough k we have by the same reasoning as for establishing eq. (61), that (cid:107) r k (cid:107) ( p k ) < (cid:15)δ µ . (74)Similar to eq. (66), (cid:107) log p k p ∗ − d k (cid:107) ( p k ) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) µ + (cid:107) r k (cid:107) ( p k ) ) (75) ≤ (cid:15)λ p ∗ δ µ − (cid:15)λ p ∗ δ ≤ (cid:15)λ p ∗ δ µ K p ∗ − (cid:15)λ p ∗ δ (76)follows, and for our choice of (cid:15) and δ we find (cid:15)λ p ∗ (1 + 2 δ µ K p ∗ ) < ⇔ (cid:15)λ p ∗ δ µ K p ∗ − (cid:15)λ p ∗ < . (77)Using similar arguments as for establishing eq. (71) in (ii), we obtain d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + 1 L λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) µ ( p k ) . (78)As (cid:15) can be arbitrarily small, we can again focus on the second term as in (ii). Finally, we seeby the assumption (cid:107) r k (cid:107) ( p k ) = O ( (cid:107) X ( p k ) (cid:107) µ ( p k ) ) thatlim k →∞ d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) µ ≤ lim k →∞ λ p ∗ K p ∗ L (1 − (cid:15)λ p ∗ ) (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) µ ( p k ) =: M (79)for some M >
0. Therefore we conclude that the convergence is of Q-order 1 + µ .From (ii) we obtain the following practical condition if one can ensure that the relative errorin solving the Newton system approaches zero. Corollary 3.5.1.
If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ and the sequence { a k } converges to zero, then the rate of convergence is Q -superlinear. Remark 3.6.
In the real-valued case (ii) and (iii) are formulated stronger: if a convergentsequence exists, the converse statements in (ii) and (iii) also hold. However, the proof in [50]relies heavily on the linearity of R d and is therefore much harder to translate to the manifoldcase. This remains an open problem. Before moving on to applications, we will elaborate on some key observations in the theory of the(inexact) RSSN method. In particular, we can make the following observations regarding the15onvergence behavior of both algorithms. By a continuity argument, we see that the minimalradius of convergence is determined by (cid:15) . This (cid:15) must satisfy the inequality (that is for a k = 0) (cid:15)λ p ∗ (1 + K p ∗ ) < . (80)In other words, a too small (cid:15) can give trouble in getting RSSN to work.We see that we run into trouble in two cases: a large λ p ∗ and a large K p ∗ correspondence.This comes down to the following scenarios. The generalized covariant derivative is close to singular
Here a large λ p ∗ onlyadmits a very small (cid:15) , which in turn results in a small convergence region. The manifold has negative curvature
For negatively-curved manifolds we do not havean a priori estimate for K p ∗ . If this value becomes arbitrarily large, we cannot expect a largeregion of convergence. We will now merge the ideas of Riemannian Semi-smooth Newton (RSSN) and Fenchel dualitytheory on manifolds in order to solve the original probleminf p ∈M { F ( p ) + G (Λ( p )) } . (81)Using linearization, we saw in section 2.2 that solving eq. (81) can be approximated by solvinginf p ∈M sup ξ n ∈T ∗ n N { F ( p ) + (cid:104) ξ n , D m Λ [log m p ] (cid:105) − G ∗ n ( ξ n ) } , (82)which is characterized by the optimality conditions p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (83) ξ n = prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) , (84)where σ, τ >
0. In this section we focus on rewriting this system of non-linear equations into aform that is amenable to the Riemannian Semi-smooth Newton method, i.e., into the problemof finding a zero of a vector field.While the dual variable ξ n lives in a vector space and (84) immediately translates into ξ n − prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) = 0 , (85)the primal optimality condition (83) is an equation in M . In order to obtain a vector fieldform, we apply the logarithm, which yields elements from the tangent bundle T M : − log p prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) = 0 . (86)Such a zero implies eq. (83). Then, we define the vector field X : M × T ∗ n N → T M × T ∗ n N as X ( p, ξ n ) := − log p prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) ξ n − prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) . (87)16hich allows to apply RSSN. We refer to this approach as Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN).
Remark 4.1.
Here it becomes obvious why first transforming the set-valued equivalent opti-mality conditions in eqs. (24) and (25) into the prox-based equality form is crucial: we can now differentiate X in order to apply the Newton-based RSSN method. If the vector field X in eq. (87) is smooth, we obtain the covariant derivative [46, Sec. 6.4.2]: ∇ Y p + η ξn X = −∇ Y p log ( · ) f ( p, ξ n ) − D p log p f ( · , ξ n )[ Y p ] − D ξ n log p f ( p, · )[ η ξ n ] − D p f ( · , ξ n )[ Y p ] + η ξ n − ∇ η ξn f ( p, · ) , (88)where f ( p, ξ n ) := prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (89) f ( p, ξ n ) := prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) . (90)However, we generally cannot assume smoothness of X and therefore require a generalizationof the differential. First, consider the following generalization of Rademacher’s theorem. Theorem 4.2.
Let M and N be smooth manifolds. If F : M → N is a locally Lipschitzcontinuous function, then F is almost everywhere differentiable on M .Proof. The proof is the same as [45, Thm. 10] with a general manifold N instead of T M .The generalized differential can now be defined as follows.
Definition 4.3 (Clarke generalized differential) . The
Clarke generalized differential D C F ofa locally Lipschitz continuous function F : M → N at p ∈ M is defined as the set-valuedmapping D C F ( p ) : T p M → T F ( p ) N , D C F ( p ) := co (cid:26) V ∈ L (cid:0) T p M , T F ( p ) N (cid:1) : ∃ { p k } ⊂ D F , lim k →∞ p k = p, V = lim k →∞ D p k F (cid:27) , (91) where D F ⊂ M is the set on which F is differentiable, “co” represents the convex hull and L (cid:0) T p M , T F ( p ) N (cid:1) denotes the vector space consisting of all bounded linear operator from T p M to T F ( p ) N . In block notation, the generalized covariant derivative of X is of the form ∂ C X ( p, ξ n ) = (cid:20) − ∂ C (log ( · ) f ( p, ξ n ))( p ) − D C (log p f ( · , ξ n ))( p ) − D C (log p f ( p, · ))( ξ n ) − D C ( f ( · , ξ n ))( p ) I − ∂ C ( f ( p, · ))( ξ n ) (cid:21) . (92)Under the special assumption that the manifold of interest is symmetric , the four componentsin eq. (92) can be computed explicitly up to the differentials of the proximal maps using Jacobifields [87, Lemma 2.3]. For the technical details we refer to the first author’s thesis [46, Sec.6.4.3 and 6.4.4]. For typical imaging applications this is convenient, as many typical manifoldsof interest, in particular S n and P ( n ), are symmetric. Remark 4.4.
Semi-smoothness of X does not follow directly from the proposed construction.In general it is problem-specific but relatively straightforward to prove, as it can be shown thatpiecewise smooth functions are semi-smooth. Application to (cid:96) -TV-like Functionals We consider the classical Rudin-Osher-Fatemi denoising problem, extended to the manifold-valued setting: Let M be a Riemannian manifold, d , d ∈ N be the dimensions of the image,and h ∈ M d × d be the noisy input data. We are interested in solving the isotropic ( q = 2)and anisotropic ( q = 1) discrete ROF model [20]inf p ∈M d × d α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:107) T ( p ) (cid:107) p,q, , (93)where α > T : M d × d → T M d × d × is the non-linear finite difference operator( T ( p )) i,j,k := ∈ T p i,j M if i = d and k = 10 ∈ T p i,j M if j = d and k = 2log p i,j p i +1 ,j ∈ T p i,j M if i < d and k = 1log p i,j p i,j +1 ∈ T p i,j M if j < d and k = 2 (94)and with the norm defined as (cid:107) T ( p ) (cid:107) p,q, := d ,d (cid:88) i,j =1 ( (cid:107) ( T ( p )) i,j, (cid:107) qp i,j + (cid:107) ( T ( p )) i,j, (cid:107) qp i,j ) q . (95)The first step is to apply Fenchel duality theory in order to rewrite model eq. (93) into anappropriate linearized saddle-point form with corresponding optimality conditions.First, as T ( p ) ∈ T p M d × d × T p M d × d ∼ = T p,p M d × d × , we write the base point ( p, p ) of T ( p ) simply as p , and consequently write T p M d × d × for T ( p ) as well. Then, using duality ofthe (cid:107) · (cid:107) p,q, norm and the fact that parallel transport is an isometry, we can choose m ∈ M and write (cid:107) T ( p ) (cid:107) p,q, = (cid:107) P p → m T ( p ) (cid:107) m,q, = sup η m ∈ T ∗ m M d × d × (cid:104) η m , P p → m T ( p ) (cid:105) − ι B q ∗ ( η m ) , (96)where q ∗ such that q + q ∗ = 1, B q ∗ := (cid:8) ν m ∈ T ∗ m M d × d × | (cid:107) ν m (cid:107) m,q ∗ , ∞ ≤ (cid:9) (97)= (cid:8) ν m ∈ T ∗ m M d × d × | max (cid:107) ( ν m ) i,j, : (cid:107) m,q ∗ ≤ (cid:9) , (98)and ι B q ∗ ( η m ) := (cid:26) η m ∈ B q ∗ , ∞ if η m / ∈ B q ∗ . (99)Now we choose n := 0 ∈ T m M d × d × as the zero vector. We can write (cid:104) η m , P p → m T ( p ) (cid:105) = (cid:104) η m , P p → m T ( p ) − n (cid:105) . (100)Remember that we can decompose the tangent bundle tangent space into a vector part η m anda point part ζ m . Let ξ n := ( ζ m , η m ) ∈ T ∗ m M d × d × × T ∗ m M d × d × ∼ = T ∗ n T M d × d × , then (cid:104) η m , P p → m T ( p ) − n (cid:105) = sup ζ m ∈ T ∗ m M d × d × (cid:104) ( ζ m , η m ) , (log m p, P p → m T ( p ) − n ) (cid:105) − ι { } ( ζ m ) (101)= sup ζ m ∈ T ∗ m M d × d × (cid:104) ( ζ m , η m ) , log n T ( p ) (cid:105) − ι { } ( ζ m ) , (102)18here we used the definition of the logarithmic map on the tangent bundle in the last step.Finally bringing everything together, we find a saddle-point problem in the desired form:inf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , log n T ( p ) (cid:105) n − ι { }× B q ∗ ( ξ n ) . (103)Choosing m such that T ( m ) = n , the corresponding linearized saddle-point problem isinf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , D m T [log m p ] (cid:105) − ι { }× B q ∗ ( ξ n ) , (104)on which we can now apply the PD-RSSN method as in the previous chapter. Finally, it remainsto find values for m such that both T ( m ) = n and n = 0 hold. We must have m ij = ˜ m for i = 1 , . . . , d and j = 1 , . . . , d (105)for some fixed ˜ m ∈ M . Adaptations to the optimization problem
There is one serious numerical issue withsolving eq. (104) using an PD-RSSN approach: the generalized covariant derivative can benon-invertible. For the real-valued case this can be solved by adding dual regularization, whichwill also be the strategy for the manifold case: We add a quadratic penalty term for the duals ξ n , inf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , D m T [log m p ] (cid:105) − ι { }× B q ∗ ( ξ n ) − β (cid:107) ξ n (cid:107) n , (106)where 0 < β (cid:28) ξ n will be zero to get a smaller matrix representation of the generalized covariantderivative, i.e., we solveinf p ∈M d × d sup ξ m ∈T ∗ m M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:10) ξ m , ∇ ( · ) T [log m p ] (cid:11) − ι B q ∗ ( ξ m ) − β (cid:107) ξ m (cid:107) m (107)where we use that the value of the vector part of the differential of T is that of the covariantderivative. We obtain the reduced optimality conditions p = prox σF (cid:18) exp p (cid:18) P m → p (cid:16) − σ (cid:0) ∇ ( · ) T (cid:1) ∗ [ ξ m ] (cid:17) (cid:93) (cid:19)(cid:19) , (108) ξ m = prox τG ∗ m,q (cid:16) ξ m + τ (cid:0) ∇ ( · ) T [log m p ] (cid:1) (cid:91) (cid:17) , (109)and the reduced vector field X : M d × d × T ∗ m M d × d × → T p M d × d × T ∗ m M d × d × X ( p, ξ m ) := − log p prox σF (cid:18) exp p (cid:18) P m → p (cid:16) − σ (cid:0) ∇ ( · ) T (cid:1) ∗ [ ξ m ] (cid:17) (cid:93) (cid:19)(cid:19) ξ m − prox τG ∗ m,q (cid:16) ξ n + τ (cid:0) ∇ ( · ) T [log m p ] (cid:1) (cid:91) (cid:17) , (110)19here F ( p ) := d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) and G ∗ m,q ( ξ m ) := ι B q ∗ ( ξ m ) + β (cid:107) ξ m (cid:107) m . (111)For the expressions of the proximal maps and the differentials we again refer the reader to thethesis [46, Sec. 6.5]; the differential of T and its adjoint can be found in [20, Sec. 5] In this section we will explore the behavior of the Riemannian Semi-smooth Newton methodin several numerical experiments with the ROF model on both the S and P (3) manifold.This choice of manifolds allows to observe the behavior on both positively as well as negativelycurved spaces.In order to verify basic performance of the complete method, we first consider a 1D prob-lem with known minimizer. Then we provide a detailed runtime analysis and comparison tolRCPA [20]. Finally, we numerically validate the convergence rates predicted by theorem 3.5.Throughout this section we will use the relative (residual) error (cid:15) krel := (cid:107) X ( p k , ξ km ) (cid:107) ( p k ,ξ km ) (cid:107) X ( p , ξ m ) (cid:107) ( p ,ξ m ) (112)as measure for convergence.The proposed algorithms were implemented in Julia version 1.3.0 using Manopt.jl [16]and evaluated on a 2.4 GHz Intel Core i7 with 8 GB RAM. The source code will be madeavailable on GitHub.
In this first experiment we investigate the progression of the relative error as well as the distanceto the exact solution for a problem with known minimizer. We consider the 1-dimensionalpiecewise constant signal from [20]: h ∈ M (cid:96) h i := (cid:26) ˆ p if i ≤ (cid:96) ˆ p if i > (cid:96) . (113)For the (cid:96) -TV problem as in eq. (93) with this signal, the exact minimizer is known: for α > p , ˆ p ∈ M , the minimizer p ∗ is given by p ∗ i := (cid:26) p ∗ if i ≤ (cid:96)p ∗ if i > (cid:96) , (114)where p ∗ = γ ˆ p , ˆ p ( δ ) , p ∗ = γ ˆ p , ˆ p ( δ ) , and δ = min (cid:26) , α(cid:96) d M (ˆ p , ˆ p ) (cid:27) . (115)A proof can be found in [46, Appendix 2]Furthermore, since d = 1, the operator T reduces to the 1-dimensional non-linear differenceoperator, i.e., we have T : M (cid:96) → T M (cid:96) . This also means that the isotropic ( q = 2) andanisotropic ( q = 1) cases reduce to the same functional.For our tests we set β = 0, (cid:96) = 10, α = 5 , and σ = τ = . For the stopping criterion, we used (cid:15) rel = 10 − . Furthermore, for a given primal base point m ∈ M we chose n = T ( m ) ∈ T M (cid:96) n is the zero vector, it is enoughto choose m i the same in every grid point.For both manifolds we investigated the influence of the starting point. For the “cold start”we set p := h , ξ n := 0. For the “warm start” we again set p := h , but modified the dual startto ξ n := prox τG ∗ m, (cid:16) ξ n + τ (cid:0) ∇ ( · ) T (cid:2) log m p (cid:3)(cid:1) (cid:91) (cid:17) , the result of a single lRCPA dual step.In the following we denote by ˜ p the solution obtained from our method. The distance of ˜ p to the exact solution p ∗ are summarized in table 1.Cold start Warm start M d M ( p ∗ , ˜ p ) d M ( p ∗ , ˜ p ) S P (3) 2.1286297e-13 Table 1: Distance between the exact minimizer p ∗ of the 1D piecewise constant (cid:96) -TV problemand the final iterate ˜ p of our method on the S and the P (3) manifold. Using a warm start witha rough estimate for primal and dual variables, the result is very close to the exact minimizer. Case 1: M = S We choseˆ p := 1 √ , , (cid:62) , ˆ p := 1 √ , − , (cid:62) (116)and m i := (1 , , (cid:62) for all i = 1 , ..., (cid:96) .With a cold start, PD-RSSN gets stuck in a local minimum and is terminated after 50iterations. With the warm start, convergence to the solution of the linearized system is achievedafter only two iterations in 0.469 seconds total. As it turns out this is the same point as theROF minimizer, i.e., d M ( p ∗ , ˜ p ) = 0 as well. Case 2: M = P (3) Here we choseˆ p := exp I (cid:18) (cid:107) X (cid:107) I X (cid:19) , ˆ p := exp I (cid:18) − (cid:107) X (cid:107) I X (cid:19) , with X := , (117)where X ∈ T I P (3) and I is the identity matrix, and pick m i := I for all i = 1 , ..., (cid:96) .In this setting, we observed no difference between warm start and cold start (table 1). Withthe cold start, PD-RSSN converges in 1 iteration in 4.61 seconds at a distance of d M ( p ∗ , p l ) =2 . · − from the exact solution. With the warm start, PD-RSSN converges after 1 iterationin 4.202 seconds at an almost identical distance of d M ( p ∗ , p l ) = 1 . · − . (cid:96) -TV In this experiment we aim to compare the runtime performance of our PD-RSSN-based methodto that of the lRCPA algorithm at different accuracies. In contrast to the previous section, wefocus on the more realistic 2D setting. We chose a dual regularization strength of β = 10 − ,as numerical experience showed that the generalized covariant derivative for both S and P (3)can become singular without dual regularization.21 a) Input (b) Exact minimizer(c) Cold-start PD-RSSN solution (d) Warm-start PD-RSSN solution Iterations −15 −10 −5 ε r e l cold startwarm start (e) Progression of the cold-start relative error Iterations C o s t cold startwarm start (f) Progression of the cold-start (cid:96) -TV cost Figure 2: Results and progression of the relative error and (cid:96) -TV cost for the proposed PD-RSSN method for solving an S -valued problem with known minimizer. With a cold start,PD-RSSN gets stuck in a local minimum. With a warm-start strategy, the algorithm convergessuperlinearly to the exact minimizer.For the S problem, we used a 20 ×
20 artificial S rotations image from [9] which isimplemented in Manopt.jl with a half rotation around each axis. For the P (3) problem, weused a 10 ×
10 artificial P (3) image from [22] which is also implemented in Manopt.jl .Although lRCPA is not guaranteed to converge on positively-curved manifolds, it performedwell in recent work [20]. The expectation is for lRCPA to be faster at the start but suffering fromslow tail convergence. Hence, our method should give better performance for higher-accuracysolutions.
Initialization
For our numerical experiment, we measured the (CPU) runtime until the al-gorithms achieved (cid:15) rel ∈ { − , − , − } . With a cold start, and even with a dual warmstart, our method diverges, as can be expected from local methods. Therefore, we prependedlRCPA pre-steps with a coarse stopping criterion of (cid:15) rel = 1 / Case 1: M = S We computed an isotropically regularized ( q = 2) (cid:96) -TV solution on theabove data with α = 1 .
5. We set m i,j := (0 , , (cid:62) for all i, j = 1 , . . . , , so that again n is thezero vector. We initialized both PD-RSSN and lRCPA with σ = τ = 0 .
35. For lRCPA we used γ = 0 . a) Input (b) Exact minimizer(c) Cold- start PD-RSSN solution (d) Warm- start PD-RSSN solution Iterations −12.5 −10.0 −7.5 −5.0 −2.5 ε r e l cold startwarm start (e) Progression of the cold-start relative error Iterations C o s t cold startwarm start (f) Progression of the cold-start (cid:96) -TV cost Figure 3: Results of the proposed PD-RSSN method for the P (3)-valued problem with knownminimizer (compare fig. 2). The method converges superlinearly to the exact minimizer forboth cold- and warm start.The pre-steps took 2.437 seconds. The resulting runtimes are shown in table 2. The solutionsof lRCPA and PD-RSSN at (cid:15) rel = 10 − along with the progress of the relative error and theisotropic (cid:96) -TV-cost are shown in fig. 4. β = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − Method Time
697 608.781 2886PD-RSSN 330.422 10 346.187 11 (cid:15) rel for the two-dimensional S problem. For high-accuracy solutions the proposed higher-order PD-RSSN method outperforms the first-orderlRCPA algorithm with respect to runtime on a manifold with positive curvature.The results confirm what we expect from a higher-order method: for higher accuracieswith relative error (cid:15) rel smaller than approximately 10 − , PD-RSSN cleary outperforms lRCPA.Furthermore, we see the superlinear convergence more clearly than in the 1D case. Case 2: M = P (3) Again we solved the isotropically regularized (cid:96) -TV problem with α = 0 . m i,j := I for all i, j = 1 , . . . ,
10, so that n is the zero vector. We initialized bothPD-RSSN and lRCPA with σ = τ = 0 .
4. For lRCPA we used γ = 0 . a) Noisy input (b) Result Iterations −6 −4 −2 ε r e l lRCPAPD-RSSN (proposed) (c) Progression of the relative error Iterations C o s t lRCPAPD-RSSN (proposed) (d) Progression of the isotropic (cid:96) -TV cost Figure 4: Runtime behavior of lRCPA and the proposed higher-order PD-RSSN method foran S -valued problem. The lRCPA method exhibits slow tail convergence as is typical forfirst-order methods. After few first-order pre-steps (shown in green), PD-RSSN superlinearlyconverges within 13 iterations to an optimal solution.lutions returned by lRCPA and PD-RSSN at (cid:15) rel = 10 − along with the error progress andthe (isotropic) (cid:96) -TV cost are shown in fig. 5. We note that lRCPA effectively stalled beforereaching the relative error of 10 − and was terminated after 2500 iterations.As for the S example, for higher accuracies PD-RSSN clearly outperforms lRCPA. Thelatter even seems unable to reach such high accuracies in reasonable time, which aligns withthe fact that first-order methods generally show sublinear convergence. For PD-RSSN, we againobserve superlinear convergence. It is interesting to see if the convergence rates for the proposed inexact method in algorithm 2as predicted by theorem 3.5 hold in practice. In particular, we expect linear convergence forconstant relative residual a k = constant and superlinear convergence for a k →
0. Given the24 = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − Method Time
97 1213.327 ≥ (cid:15) rel for the two-dimensional P (3) problem. As inthe positively-curved S case (table 2), the proposed PD-RSSN algorithm outperforms lRCPAat higher accuracies on this negatively-curved manifold. lRCPA was terminated after 2500iterations before achieving (cid:15) rel = 10 − . (a) Noisy input (b) Result Iterations −6 −5 −4 −3 −2 −1 ε r e l lRCPAPD-RSSN (proposed) (c) Progression of the relative error Iterations C o s t lRCPAPD-RSSN (proposed) (d) Progression of the isotropic (cid:96) -TV cost Figure 5: Results on the negatively-curved manifold P (3) (compare fig. 4); first-order pre-steps are shown in green. The proposed PD-RSSN method superlinearly converges within 12iterations after the pre-steps, while again lRCPA suffers from slow tail convergence.similar behavior of the methods on positively and negatively curved manifolds in the previousexperiments, we only consider the S manifold.We used (cid:96) -TV to denoise Bernoulli’s Lemniscate [9], which is a figure-8 on the 2-sphere.25ote that this is once again a 1D problem. To this end, we will sampled 128 S -valued pointson the Lemniscate curve and distorted them with Gaussian noise with variance δ = 0 .
01. Weemployed the (cid:96) -TV model with α = 0 . m set to the intersection point of the Lemniscate: m i := 1 √ , , (cid:62) , for i = 1 , . . . , . (118)We did not use dual regularization, i.e., β = 0.Again we performed lRCPA pre-steps with σ = τ = 0 . , γ = 0 .
2, and (cid:15) rel = 10 − in orderto warm-start PD-IRSSN. In order to simulate inexact solution of the Newton steps, we definedthree different progressions of the relative residual, a k := 0 , a k := 15 , a k := 15 k , (119)and randomly chose the corresponding absolute residual as r ki := a ki (cid:107) X ( p k , ξ kn ) (cid:107) ( p k ,ξ kn ) U ( p k ,ξ kn ) , for i = 1 , , , (120)where U ( p k ,ξ kn ) is a tangent vector drawn from a normal distribution with unit covariance matrix.We then solved (to numerical precision) the modified inexact system for the step d k : V ( p k ,ξ kn ) d k = X ( p k , ξ kn ) + r ki . (121)This allows to precisely control the residual and therefore simulate the behavior of PD-IRSSNat different accuracies.From the error progression, we estimated a convergence rate q by q k := log (cid:16) (cid:107) X ( p k ,ξ kn ) (cid:107)(cid:107) X ( p k − ,ξ k − n ) (cid:107) (cid:17) log (cid:16) (cid:107) X ( p k − ,ξ k − n ) (cid:107)(cid:107) X ( p k − ,ξ k − n ) (cid:107) (cid:17) . (122)The results are shown in fig. 6. As predicted by theorem 3.5, for the exact method withzero residual a we obtain clear superlinear convergence. For constant relative residual a ,convergence is clearly linear with rate q = 1, and for decreasing relative residual a , convergenceis again superlinear – although hard to judge due to the rapid convergence – in agreement withthe theorem. In this work we have proposed the Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN)method as a higher-order alternative for solving non-smooth variational problems on mani-folds. Although locally convergent, the method experimentally performs well on isotropic andanisotropic 1D and 2D (cid:96) -TV problems. On both S and P (3), representing manifolds withpositive and negative curvature, we observe superlinear convergence, which allows to gener-ate high-accuracy solutions. In particular, PD-RSSN can be used to overcome the slow tailconvergence that is a common bottleneck of first-order methods.Moreover, we have proposed an inexact variant of RSSN (IRSSN) with provable local con-vergence rates. The theoretical results are supported by numerical experiments, in which weobserve at least linear convergence in the inexact case, and superlinear convergence when ac-curacy is progressively increasing. 26 a) Bernoulli’s Lemniscate (grey), the noisy input(orange) and result (blue) Iterations q q=1a₁a₂a₃ (b) Estimated convergence rates after the pre-steps Iterations −6 −4 −2 ε r e l a₁a₂a₃ (c) Progression of the relative error of the lRCPApre-steps (blue) and the three PD-IRSSN schemes Iterations C o s t a₁a₂a₃ (d) Progression of the (cid:96) -TV cost of the lRCPApre-steps (blue) and the three PD-IRSSN schemes Figure 6: Convergence behavior of the Primal-Dual Inexact Riemannian Semi-smooth Newtonmethod (PD-IRSSN) for recovering Bernoulli’s Lemniscate from noisy input data. As predictedby theorem 3.5, exact PD-IRSSN, i.e., PD-RSSN, ( a ) and PD-IRSSN with progressively in-creasing accuracy ( a ) converge superlinearly. For constant relative accuracy ( a ), we stillobserve at least linear convergence. Acknowledgments
We would like to thank Ronny Bergmann for fruitful discussions regarding differential-geometricinterpretations of earlier work and his Julia library Manopt.jl.27 eferences [1]
T. E. Abrudan, J. Eriksson, and V. Koivunen , Steepest descent algorithms foroptimization under unitary matrix constraint , IEEE Transactions on Signal Processing,56 (2008), pp. 1134–1147, https://doi.org/10.1109/TSP.2007.908999 .[2]
P.-A. Absil, C. G. Baker, and K. A. Gallivan , Trust-region methods on Rie-mannian manifolds , Foundations of Computational Mathematics, 7 (2007), pp. 303–330, https://doi.org/10.1007/s10208-005-0179-9 .[3]
P.-A. Absil, R. Mahony, and R. Sepulchre , Optimization algorithms onmatrix manifolds , Princeton University Press, 2009, https://doi.org/10.1515/9781400830244 .[4]
B. L. Adams, S. I. Wright, and K. Kunze , Orientation imaging: the emergence ofa new microscopy , Metallurgical Transactions A, 24 (1993), pp. 819–831, https://doi.org/10.1007/BF02656503 .[5]
R. L. Adler, J.-P. Dedieu, J. Y. Margulies, M. Martens, and M. Shub , Newton’smethod on Riemannian manifolds and a geometric model for the human spine , IMA Jour-nal of Numerical Analysis, 22 (2002), pp. 359–390, https://doi.org/10.1093/imanum/22.3.359 .[6]
B. Afsari, R. Tron, and R. Vidal , On the convergence of gradient descent for findingthe riemannian center of mass , SIAM Journal on Control and Optimization, 51 (2013),pp. 2230–2260, https://doi.org/10.1137/12086282X .[7]
D. Azagra and J. Ferrera , Proximal calculus on Riemannian manifolds , Mediter-ranean Journal of Mathematics, 2 (2005), pp. 437–450, https://doi.org/10.1007/s00009-005-0056-4 .[8]
M. Bac´ak , Computing medians and means in hadamard spaces , SIAM Journal on Opti-mization, 24 (2014), pp. 1542–1566, https://doi.org/10.1137/140953393 .[9]
M. Bac´ak, R. Bergmann, G. Steidl, and A. Weinmann , A second order nonsmoothvariational model for restoring manifold-valued images , SIAM Journal on Scientific Com-puting, 38 (2016), pp. A567–A597, https://doi.org/10.1137/15M101988X .[10]
S. Banert , Backward–backward splitting in hadamard spaces , Journal of Mathemati-cal Analysis and Applications, 414 (2014), pp. 656–665, https://doi.org/10.1016/j.jmaa.2014.01.054 .[11]
P. J. Basser, J. Mattiello, and D. LeBihan , Mr diffusion tensor spectroscopyand imaging , Biophysical journal, 66 (1994), pp. 259–267, https://doi.org/10.1016/S0006-3495(94)80775-1 .[12]
M. Baust, A. Weinmann, M. Wieczorek, T. Lasser, M. Storath, and N. Navab , Combined tensor fitting and tv regularization in diffusion tensor imaging based on a Rie-mannian manifold approach , IEEE transactions on medical imaging, 35 (2016), pp. 1972–1989, https://doi.org/10.1109/TMI.2016.2528820 .[13]
S. Becker and J. Fadili , A quasi-Newton proximal splitting method , in Advances inneural information processing systems, 2012, pp. 2618–2626, https://doi.org/10.5555/2999325.2999427 . 2814]
G. Bento, J. Neto, and P. Oliveira , Convergence of inexact descent methods for non-convex optimization on Riemannian manifolds , arXiv preprint arXiv:1103.4828, (2011), https://arxiv.org/abs/1103.4828 .[15]
G. d. C. Bento, J. X. da Cruz Neto, and P. R. Oliveira , A new approach tothe proximal point method: convergence on general Riemannian manifolds , Journal ofOptimization Theory and Applications, 168 (2016), pp. 743–755, https://doi.org/10.1007/s10957-015-0861-2 .[16]
R. Bergmann , manopt.jl , Optimization on Manifolds in Julia., (2019).[17] R. Bergmann, R. H. Chan, R. Hielscher, J. Persch, and G. Steidl , Restoration of manifold-valued images by half-quadratic minimization , arXiv preprintarXiv:1505.07029, (2015), https://doi.org/10.3934/ipi.2016001 .[18]
R. Bergmann, J. H. Fitschen, J. Persch, and G. Steidl , Infimal convolutioncoupling of first and second order differences on manifold-valued images , in InternationalConference on Scale Space and Variational Methods in Computer Vision, Springer, 2017,pp. 447–459, https://doi.org/10.1007/978-3-319-58771-4_36 .[19]
R. Bergmann, J. H. Fitschen, J. Persch, and G. Steidl , Priors with coupledfirst and second order differences for manifold-valued image processing , Journal of math-ematical imaging and vision, 60 (2018), pp. 1459–1481, https://doi.org/10.1007/s10851-018-0840-y .[20]
R. Bergmann, R. Herzog, M. S. Louzeiro, D. Tenbrinck, and J. Vidal-N´u˜nez , Fenchel duality theory and a primal-dual algorithm on Riemannian manifolds , arXivpreprint arXiv:1908.02022, (2019), https://arxiv.org/abs/1908.02022v2 .[21]
R. Bergmann, F. Laus, G. Steidl, and A. Weinmann , Second order differences ofcyclic data and applications in variational denoising , SIAM Journal on Imaging Sciences,7 (2014), pp. 2916–2953, https://doi.org/10.1137/140969993 .[22]
R. Bergmann, J. Persch, and G. Steidl , A parallel Douglas–Rachford algorithm forminimizing rof-like functionals on images with values in symmetric hadamard manifolds ,SIAM Journal on Imaging Sciences, 9 (2016), pp. 901–937, https://doi.org/10.1137/15M1052858 .[23]
R. Bergmann and D. Tenbrinck , A graph framework for manifold-valued data ,SIAM Journal on Imaging Sciences, 11 (2018), pp. 325–360, https://doi.org/10.1137/17M1118567 .[24]
R. Bergmann and A. Weinmann , Inpainting of cyclic data using first and second orderdifferences , in International Workshop on Energy Minimization Methods in ComputerVision and Pattern Recognition, Springer, 2015, pp. 155–168.[25]
D. P. Bertsekas , Incremental proximal methods for large scale convex optimiza-tion , Mathematical programming, 129 (2011), p. 163, https://doi.org/10.1007/s10107-011-0472-0 .[26]
J. Bolte, S. Sabach, and M. Teboulle , Proximal alternating linearized minimiza-tion for nonconvex and nonsmooth problems , Mathematical Programming, 146 (2014),pp. 459–494, https://doi.org/10.1007/s10107-013-0701-9 .2927]
M. Bortoloti, T. Fernandes, O. Ferreira, and J. Yuan , Damped Newton’smethod on Riemannian manifolds , arXiv preprint arXiv:1803.05126, (2018), https://arxiv.org/abs/1803.05126 .[28]
K. Bredies, M. Holler, M. Storath, and A. Weinmann , Total generalized varia-tion for manifold-valued data , SIAM Journal on Imaging Sciences, 11 (2018), pp. 1785–1848, https://doi.org/10.1137/17M1147597 .[29]
R. H. Byrd, G. M. Chin, J. Nocedal, and F. Oztoprak , A family of second-ordermethods for convex (cid:96) -regularized optimization , Mathematical Programming, 159 (2016),pp. 435–467, https://doi.org/10.1007/s10107-015-0965-3 .[30] R. H. Byrd, J. Nocedal, and F. Oztoprak , An inexact successive quadratic approx-imation method for l-1 regularized optimization , Mathematical Programming, 157 (2016),pp. 375–396, https://doi.org/10.1007/s10107-015-0941-y .[31]
M. P. d. Carmo , Riemannian geometry , Birkh¨auser, 1992.[32]
R. Castro, G. Di Giorgi, and W. Sierra , Secant method on Riemannian manifolds ,arXiv preprint arXiv:1712.02655, (2017), https://arxiv.org/abs/1712.02655 .[33]
A. Chambolle and T. Pock , A first-order primal-dual algorithm for convex problemswith applications to imaging , Journal of mathematical imaging and vision, 40 (2011),pp. 120–145, https://doi.org/10.1007/s10851-010-0251-1 .[34]
T. F. Chan, S. H. Kang, and J. Shen , Total variation denoising and enhancement ofcolor images based on the CB and HSV color models , Journal of Visual Communicationand Image Representation, 12 (2001), pp. 422–435, https://doi.org/10.1006/jvci.2001.0491 .[35]
C. Chefd’Hotel, D. Tschumperl´e, R. Deriche, and O. Faugeras , Regularizingflows for constrained matrix-valued images , Journal of Mathematical Imaging and Vision,20 (2004), pp. 147–162, https://doi.org/10.1023/B:JMIV.0000011920.58935.9c .[36]
D.-Q. Chen , Fixed point algorithm based on quasi-Newton method for convex minimiza-tion problem with application to image deblurring , 2014, https://arxiv.org/abs/1412.4438 .[37]
S. Chen, S. Ma, A. Man-Cho So, and T. Zhang , Proximal gradient method fornonsmooth optimization over the stiefel manifold , SIAM Journal on Optimization, 30(2020), pp. 210–239, https://doi.org/10.1137/18M122457X .[38]
W. Chen, H. Ji, and Y. You , An augmented Lagrangian method for (cid:96) -regularized opti-mization problems with orthogonality constraints , SIAM Journal on Scientific Computing,38 (2016), pp. B570–B592, https://doi.org/10.1137/140988875 .[39] C. Clason and T. Valkonen , Introduction to nonsmooth analysis and optimization ,arXiv preprint arXiv:2001.00216, (2020).[40]
P. L. Combettes and J.-C. Pesquet , A Douglas–Rachford splitting approach to non-smooth convex variational signal recovery , IEEE Journal of Selected Topics in SignalProcessing, 1 (2007), pp. 564–574, https://doi.org/10.1109/JSTSP.2007.910264 .3041]
P. L. Combettes and V. R. Wajs , Signal recovery by proximal forward-backwardsplitting , Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200, https://doi.org/10.1137/050626090 .[42]
P. Cook, Y. Bai, S. Nedjati-Gilani, K. Seunarine, M. Hall, G. Parker, andD. C. Alexander , Camino: open-source diffusion-mri reconstruction and processing , in14th scientific meeting of the international society for magnetic resonance in medicine,vol. 2759, Seattle WA, USA, 2006, p. 2759.[43]
D. Cremers and E. Strekalovskiy , Total cyclic variation and generalizations , Jour-nal of mathematical imaging and vision, 47 (2013), pp. 258–277, https://doi.org/10.1007/s10851-012-0396-1 .[44]
T. De Luca, F. Facchinei, and C. Kanzow , A semismooth equation approach to thesolution of nonlinear complementarity problems , Mathematical programming, 75 (1996),pp. 407–439, https://doi.org/10.1007/BF02592192 .[45]
F. R. de Oliveira and O. P. Ferreira , Newton method for finding a singularity of aspecial class of locally Lipschitz continuous vector fields on Riemannian manifolds , arXivpreprint arXiv:1810.11636, (2018), https://arxiv.org/abs/1810.11636 .[46]
W. Diepeveen , Non-smooth higher-order optimization on manifolds , master’sthesis, 2020, .[47]
D. L. Donoho , Compressed sensing , IEEE Transactions on information theory, 52(2006), pp. 1289–1306, https://doi.org/10.1109/TIT.2006.871582 .[48]
A. Edelman, T. A. Arias, and S. T. Smith , The geometry of algorithms with or-thogonality constraints , SIAM journal on Matrix Analysis and Applications, 20 (1998),pp. 303–353, https://doi.org/10.1137/S0895479895290954 .[49]
E. Esser, X. Zhang, and T. F. Chan , A general framework for a class of first orderprimal-dual algorithms for convex optimization in imaging science , SIAM Journal onImaging Sciences, 3 (2010), pp. 1015–1046, https://doi.org/10.1137/09076934X .[50]
F. Facchinei, A. Fischer, and C. Kanzow , Inexact Newton methods for semismoothequations with applications to variational inequality problems , in Nonlinear Optimizationand Applications, Springer, 1996, pp. 125–139.[51]
O. Ferreira and P. Oliveira , Subgradient algorithm on Riemannian manifolds , Jour-nal of Optimization Theory and Applications, 97 (1998), pp. 93–104, https://doi.org/10.1023/A:1022675100677 .[52]
O. Ferreira and P. Oliveira , Proximal point algorithm on Riemannian manifolds ,Optimization, 51 (2002), pp. 257–270, https://doi.org/10.1080/02331930290019413 .[53]
P. T. Fletcher and S. Joshi , Riemannian geometry for the statistical analysis ofdiffusion tensor data , Signal Processing, 87 (2007), pp. 250–262, https://doi.org/10.1016/j.sigpro.2005.12.018 .[54]
K. Fountoulakis, J. Gondzio, and P. Zhlobich , Matrix-free interior point methodfor compressed sensing problems , Mathematical Programming Computation, 6 (2014),pp. 1–31, https://doi.org/10.1007/s12532-013-0063-6 .3155]
M. Giaquinta, G. Modica, and J. Souˇcek , Variational problems for maps of boundedvariation with values in s , Calculus of Variations and Partial Differential Equations, 1(1993), pp. 87–121, https://doi.org/10.1007/BF02163266 .[56] M. Giaquinta and D. Mucci , The bv -energy of maps into a manifold: Relaxation anddensity results , Annali della Scuola Normale Superiore di Pisa-Classe di Scienze, 5 (2006),pp. 483–548.[57] M. Giaquinta and D. Mucci , Maps of bounded variation with values into a manifold:total variation and relaxed energy , Pure and Applied Mathematics Quarterly, 3 (2007),pp. 513–538, https://doi.org/10.4310/PAMQ.2007.v3.n2.a6 .[58]
R. Griesse and D. A. Lorenz , A semismooth Newton method for Tikhonov functionalswith sparsity constraints , Inverse Problems, 24 (2008), p. 035007, https://doi.org/10.1088/0266-5611/24/3/035007 .[59]
P. Grohs and S. Hosseini , Nonsmooth trust region algorithms for locally lipschitz func-tions on Riemannian manifolds , IMA Journal of Numerical Analysis, 36 (2016), pp. 1167–1192, https://doi.org/10.1093/imanum/drv043 .[60]
P. Grohs and S. Hosseini , ε -subgradient algorithms for locally lipschitz functions onRiemannian manifolds , Advances in Computational Mathematics, 42 (2016), pp. 333–360, https://doi.org/10.1007/s10444-015-9426-z .[61] P. Grohs and M. Sprecher , Total variation regularization by iteratively reweightedleast squares on Hadamard spaces and the sphere , preprint, 39 (2014).[62]
P. Grohs and M. Sprecher , Total variation regularization on Riemannian manifoldsby iteratively reweighted minimization , Information and Inference: A Journal of the IMA,5 (2016), pp. 353–378, https://doi.org/10.1093/imaiai/iaw011 .[63]
B. He, Y. You, and X. Yuan , On the convergence of primal-dual hybrid gradientalgorithm , SIAM Journal on Imaging Sciences, 7 (2014), pp. 2526–2537, https://doi.org/10.1137/140963467 .[64]
J. He, J. Wang, and J.-C. Yao , Convergence criteria of Newton’s method on Liegroups , Fixed Point Theory and Applications, 2013 (2013), p. 293, https://doi.org/10.1186/1687-1812-2013-293 .[65]
M. Hinterm¨uller, K. Ito, and K. Kunisch , The primal-dual active set strategy asa semismooth Newton method , SIAM Journal on Optimization, 13 (2002), pp. 865–888, https://doi.org/10.1137/s1052623401383558 .[66]
S. Hosseini , Convergence of nonsmooth descent methods via kurdyka-lojasiewicz inequal-ity on Riemannian manifolds , Hausdorff Center for Mathematics and Institute for Nu-merical Simulation, University of Bonn (2015,(INS Preprint No. 1523)), (2015).[67]
S. Hosseini, W. Huang, and R. Yousefpour , Line search algorithms for locallylipschitz functions on Riemannian manifolds , SIAM Journal on Optimization, 28 (2018),pp. 596–619, https://doi.org/10.1137/16M1108145 .[68]
S. Hosseini and M. Pouryayevali , Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds , Nonlinear Analysis: Theory, Methods & Appli-cations, 74 (2011), pp. 3884–3895, https://doi.org/10.1016/j.na.2011.02.023 .3269]
S. Hosseini and A. Uschmajew , A riemannian gradient sampling algorithm for nons-mooth optimization on manifolds , SIAM Journal on Optimization, 27 (2017), pp. 173–189, https://doi.org/10.1137/16M1069298 .[70]
B. A. Kakavandi and M. Amini , Duality and subdifferential for convex functions oncomplete CAT(0) metric spaces , Nonlinear Analysis: Theory, Methods & Applications,73 (2010), pp. 3450–3455, https://doi.org/10.1016/j.na.2010.07.033 .[71]
R. Kimmel and N. Sochen , Orientation diffusion or how to comb a porcupine , Journalof Visual Communication and Image Representation, 13 (2002), pp. 238–248, https://doi.org/10.1006/jvci.2001.0501 .[72]
A. Kovnatsky, K. Glashoff, and M. M. Bronstein , Madmm: a generic algorithmfor non-smooth optimization on manifolds , in European Conference on Computer Vision,Springer, 2016, pp. 680–696, https://doi.org/10.1007/978-3-319-46454-1_41 .[73]
R. Lai and S. Osher , A splitting method for orthogonality constrained problems ,Journal of Scientific Computing, 58 (2014), pp. 431–449, https://doi.org/10.1007/s10915-013-9740-x .[74]
F. Laus, M. Nikolova, J. Persch, and G. Steidl , A nonlocal denoising algorithm formanifold-valued images using second order statistics , SIAM Journal on Imaging Sciences,10 (2017), pp. 416–448, https://doi.org/10.1137/16M1087114 .[75]
J. D. Lee, Y. Sun, and M. A. Saunders , Proximal Newton-type methods for mini-mizing composite functions , SIAM Journal on Optimization, 24 (2014), pp. 1420–1443, https://doi.org/10.1137/130921428 .[76]
J. M. Lee , Smooth manifolds , in Introduction to Smooth Manifolds, Springer, 2013,pp. 1–31, https://doi.org/10.1007/978-1-4419-9982-5 .[77]
J. Lellmann, E. Strekalovskiy, S. Koetter, and D. Cremers , Total variationregularization for functions with values in a manifold , in Proceedings of the IEEE Inter-national Conference on Computer Vision, 2013, pp. 2944–2951, https://doi.org/10.1109/ICCV.2013.366 .[78]
X. Li, D. Sun, and K.-C. Toh , A highly efficient semismooth Newton Augmented La-grangian method for solving LASSO problems , SIAM Journal on Optimization, 28 (2018),pp. 433–458, https://doi.org/10.1137/16M1097572 .[79]
D. G. Luenberger , The gradient projection method along geodesics , Management Sci-ence, 18 (1972), pp. 620–631, https://doi.org/10.1287/mnsc.18.11.620 .[80]
J. Mart´ınez and L. Qi , Inexact Newton methods for solving nonsmooth equations ,Journal of Computational and Applied Mathematics, 60 (1995), pp. 127–145, https://doi.org/10.1016/0377-0427(94)00088-I .[81]
D. Massonnet and K. L. Feigl , Radar interferometry and its application to changesin the earth’s surface , Reviews of geophysics, 36 (1998), pp. 441–500, https://doi.org/10.1029/97RG03139 .[82]
A. Milzarek and M. Ulbrich , A semismooth Newton method with multidimensionalfilter globalization for (cid:96) -optimization , SIAM Journal on Optimization, 24 (2014), pp. 298–333, https://doi.org/10.1137/120892167 .3383] H. Pan, Z. Jing, M. Lei, R. Liu, B. Jin, and C. Zhang , A sparse proximal Newtonsplitting method for constrained image deblurring , Neurocomputing, 122 (2013), pp. 245–257, https://doi.org/10.1016/j.neucom.2013.06.027 .[84]
P. Patrinos, L. Stella, and A. Bemporad , Forward-backward truncated New-ton methods for convex composite optimization , arXiv preprint arXiv:1402.6655, (2014), https://arxiv.org/abs/1402.6655 .[85]
X. Pennec , Intrinsic statistics on Riemannian manifolds: Basic tools for geometricmeasurements , Journal of Mathematical Imaging and Vision, 25 (2006), p. 127, https://doi.org/10.1007/s10851-006-6228-4 .[86]
X. Pennec, P. Fillard, and N. Ayache , A riemannian framework for tensor comput-ing , International Journal of computer vision, 66 (2006), pp. 41–66, https://doi.org/10.1007/s11263-005-3222-z .[87]
J. Persch , Optimization Methods for manifold-valued Image Processing , Verlag Dr. Hut,2018.[88]
T. Pock and A. Chambolle , Diagonal preconditioning for first order primal-dual al-gorithms in convex optimization , in 2011 International Conference on Computer Vision,IEEE, 2011, pp. 1762–1769, https://doi.org/10.1109/ICCV.2011.6126441 .[89]
L. Qi and J. Sun , A nonsmooth version of Newton’s method , Mathematical program-ming, 58 (1993), pp. 353–367, https://doi.org/10.1007/BF01581275 .[90]
R. T. Rockafellar and R. J.-B. Wets , Variational analysis , vol. 317, SpringerScience & Business Media, 2009, https://doi.org/10.1007/978-3-642-02431-3 .[91]
L. I. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removalalgorithms , Physica D: nonlinear phenomena, 60 (1992), pp. 259–268, https://doi.org/10.1016/0167-2789(92)90242-F .[92]
C. Rust, J. Lellmann, and T. Vogt , Semiglatte Optimierungsverfahren zweiter Ord-nung in der Bildverarbeitung , master’s thesis, University of L¨ubeck, 2017.[93]
T. Sakai , Riemannian geometry , vol. 149, American Mathematical Soc., 1996.[94]
S. T. Smith , Optimization techniques on Riemannian manifolds , Fields institute com-munications, 3 (1994), pp. 113–135.[95]
M. Storath and A. Weinmann , Wavelet sparse regularization for manifold-valueddata , arXiv preprint arXiv:1808.00505, (2018), https://arxiv.org/abs/1808.00505 .[96]
M. Storath, A. Weinmann, and M. Unser , Exact algorithms for l -tv regularizationof real-valued or circle-valued signals , SIAM Journal on Scientific Computing, 38 (2016),pp. A614–A630, https://doi.org/10.1137/15M101796X .[97] E. Strekalovskiy and D. Cremers , Total variation for cyclic structures: Convexrelaxation and efficient minimization , in CVPR 2011, IEEE, 2011, pp. 1905–1911, https://doi.org/10.1109/CVPR.2011.5995573 .[98]
D. Sun and J. Han , Newton and Quasi-Newton methods for a class of nonsmoothequations and related problems , SIAM Journal on Optimization, 7 (1997), pp. 463–480, https://doi.org/10.1137/S1052623494274970 .3499]
C. Udriste , Convex functions and optimization methods on Riemannian manifolds ,vol. 297, Springer Science & Business Media, 1994.[100]
T. Valkonen, K. Bredies, and F. Knoll , Total generalized variation in diffusiontensor imaging , SIAM Journal on Imaging Sciences, 6 (2013), pp. 487–525, https://doi.org/10.1137/120867172 .[101]
T. Vogt, E. Strekalovskiy, D. Cremers, and J. Lellmann , Lifting methods formanifold-valued variational problems , arXiv preprint arXiv:1908.03776, (2019), https://arxiv.org/abs/1908.03776 .[102]
Y. Wang, J. Yang, W. Yin, and Y. Zhang , A new alternating minimization algorithmfor total variation image reconstruction , SIAM Journal on Imaging Sciences, 1 (2008),pp. 248–272, https://doi.org/10.1137/080724265 .[103]
A. Weinmann, L. Demaret, and M. Storath , Total variation regularization formanifold-valued data , SIAM Journal on Imaging Sciences, 7 (2014), pp. 2226–2257, https://doi.org/10.1137/130951075 .[104]
A. Weinmann, L. Demaret, and M. Storath , Mumford–Shah and Potts regulariza-tion for manifold-valued data , Journal of Mathematical Imaging and Vision, 55 (2016),pp. 428–445, https://doi.org/10.1007/s10851-015-0628-2 .[105]
S. J. Wright, R. D. Nowak, and M. A. Figueiredo , Sparse reconstruction byseparable approximation , IEEE Transactions on Signal Processing, 57 (2009), pp. 2479–2493, https://doi.org/10.1109/TSP.2009.2016892 .[106]
X. Xiao, Y. Li, Z. Wen, and L. Zhang , A regularized semi-smooth Newton methodwith projection steps for composite convex programs , Journal of Scientific Computing, 76(2018), pp. 364–389, https://doi.org/10.1007/s10915-017-0624-3 .[107]
W. Yin, S. Osher, D. Goldfarb, and J. Darbon , Bregman iterative algorithmsfor (cid:96) -minimization with applications to compressed sensing , SIAM Journal on Imagingsciences, 1 (2008), pp. 143–168, https://doi.org/10.1137/070703983 .[108] J. Yu, S. Vishwanathan, S. G¨unter, and N. N. Schraudolph , A quasi-Newtonapproach to nonsmooth convex optimization problems in machine learning , Journal ofMachine Learning Research, 11 (2010), pp. 1145–1200, https://doi.org/10.1145/1390156.1390309 .[109]
M.-C. Yue, Z. Zhou, and A. M.-C. So , A family of inexact SQA methods for non-smooth convex minimization with provable convergence guarantees based on the luo–tsengerror bound property , Mathematical Programming, 174 (2019), pp. 327–358, https://doi.org/10.1007/s10107-018-1280-6 .[110]
H. Zhu, X. Zhang, D. Chu, and L.-Z. Liao , Nonconvex and nonsmooth optimiza-tion with generalized orthogonality constraints: An approximate augmented Lagrangianmethod , Journal of Scientific Computing, 72 (2017), pp. 331–372, https://doi.org/10.1007/s10915-017-0359-1https://doi.org/10.1007/s10915-017-0359-1