[PDF] Duality-based Higher-order Non-smooth Optimization on Manifolds

Abstract

Full PDF

DDuality-based Higher-order Non-smooth Optimization onManifolds

Willem Diepeveen ∗ Jan Lellmann † Abstract

We propose a method for solving non-smooth optimization problems on manifolds. Inorder to obtain superlinear convergence, we apply a Riemannian Semi-smooth Newtonmethod to a non-smooth non-linear primal-dual optimality system based on a recent ex-tension of Fenchel duality theory to Riemannian manifolds. We also propose an inexactversion of the Riemannian Semi-smooth Newton method and prove conditions for local lin-ear and superlinear convergence. Numerical experiments on (cid:96) -TV-like problems conﬁrmsuperlinear convergence on manifolds with positive and negative curvature. Energy-based modeling in image- and data analysis requires the numerical minimization oflarge-scale energy functions. Due to the growing popularity of sparsity-based approaches suchas compressive sensing [47] and total variation-based image processing [91], these energies oftenincorporate non-smooth terms.First-order methods based on (sub-)gradients for minimizing such energies have becomevery popular due to their robustness [41, 40, 107, 102, 33, 25]; see also the survey in [49].However, their convergence rate in the general case is typically limited to O ( (cid:15) ) iterationsfor achieving (cid:15) -suboptimality, which can sometimes be improved to O ( √ (cid:15) ) using accelerationstrategies [63, 88, 105].Options for obtaining a superlinear convergence rate include interior point methods [54]or Newton-like methods, such as Quasi Newton [108, 36], Proximal Newton [13, 83, 75, 30,109], Forward Backward Newton [84] and Semi-smooth Newton [58, 82, 29, 106, 78]. In thiswork we will focus on a generalization of the Semi-smooth Newton (SSN) method, which wasoriginally proposed in [89] and initially used in [44, 98], but popularized through optimal controlapplications in the early 2000s [65], before being discovered by the image processing community[58, 82, 29]. We also refer to [92, 106] for some recent variants.On the modeling side, there has been an increasing interest in manifold-valued data pro-cessing: apart from statistical [86, 85, 53, 74] and PDE approaches [71, 35] to smooth dataprocessing on manifolds, non-smooth variational approaches on Riemannian manifolds havebeen gaining momentum as well in the 2010s. Typical applications include non-linear colorspaces [34] such as the Chromaticity Brightness model ( S × R ) and the Hue Saturation Valuemodel ( S × R ), InSAR imaging [81] ( S ), Diﬀusion Tensor imaging [11] ( P (3), see Fig. 1) andElectron Backscatter Diﬀraction imaging [4] ( SO (3)). Also several branches in data science ∗ Department of Applied Mathematics and Theoretical Physics, University of Cambridge ([email protected]). † Institute of Mathematics and Image Computing, University of L¨ubeck ([email protected]). a r X i v : . [ m a t h . O C ] F e b ave become more involved in using the geometry of problems. Examples include sparse prin-cipal component analysis, compressed mode analysis in physics, unsupervised feature selectionand sparse blind convolution, where in these cases the Stiefel manifold captures all geometricinformation; see [37] for an overview.In this work, we consider optimization problems of the forminf p ∈M { F ( p ) + G (Λ( p )) } , (1)where F : M → R and G : N → R are non-smooth functions mapping into the extended realline R := R ∪ {±∞} , M and N are Riemannian manifolds, and Λ : M → N is diﬀerentiable.Unlike the Euclidean case M = R n , to our knowledge no intrinsic higher-order solvers withsuperlinear convergence exist in the manifold-valued case. Our goal is to close this gap.Figure 1: Diﬀusion-Tensor Magnetic Resonance Imaging (DT-MRI) allows precise ﬁber tracingin the human brain by modeling the diﬀusion directions of water molecules using multivariatenormal distributions, which are represented by elements on the manifold of symmetric positivedeﬁnite matrices P (3). The resulting optimization problems are manifold-valued and oftennon-smooth, which necessitates specialized numerical optimization algorithms. Image: Caminoproject [42]. There have been multiple attempts to generalize algorithms for both smooth and non-smoothoptimization on manifolds. Early work can be traced back to [79], where a manifold constraintis included in the optimization problem and enforced using projections. Using this so-called extrinsic approach, algorithms from the real-valued case can be made to work on manifolds aswell, if they can be reasonably embedded into en Euclidean space. While this technique doesnot capture the manifold structure speciﬁcally, recent contributions still tend to this approachfor optimization [73, 26, 38, 72, 110]. A step towards exploiting the intrinsic geometry ofmanifolds is to rely on local charts, which provide a linear subspace in which methods for real-valued optimization can be applied. However, both embedding and localization approachessuﬀer from serious drawbacks, which are discussed extensively in [59, Sec. 1]. Hence, intrinsicmethods that do not rely on charts have become very popular.Pioneering work using these ideas in smooth optimization on manifolds was done in the1994 with [94, 99], whose authors formulated several algorithms such as gradient descent, New-ton’s method and conjugate gradient on Riemannian manifolds. From then on, the community2tarted working on generalizing other algorithms to Riemannian manifolds [2, 27, 32], special-izing algorithms [48, 1, 64] and the application to real-world problems [5, 6]. An extensiveoverview of ﬁrst-and second-order methods for smooth optimization on matrix manifolds canbe found in [3].The development of non-smooth Riemannian optimization methods appears to have oc-curred somewhat independently in the image analysis- and in the non-smooth optimizationcommunities. Whereas the former mostly relied on methods that generalize convexity, the lat-ter has been focussed on ﬁnding new notions and restrictions to generalize manifold algorithmsfrom the smooth to non-smooth setting. We give a short survey of both, as we believe thateither provide valuable insights.

Manifold-valued Imaging

The non-smooth Rudin-Osher-Fatemi (ROF) model [91] is theprototype of modern image analysis. This model revolves around the notion of Total Variation(TV), which has also been generalized to the S manifold in the early 1990s [55] and has beenextended to general Riemannian manifolds in the second half of the 2000s [56, 57]. Althoughearly numerical attempts for TV-regularized problems on manifolds originate in the early 2000s[34], the majority of the initial contributions in TV-based models on manifolds were proposedin the early 2010s [97, 43, 100]. These initial models had the major drawback that they wereproposed for speciﬁc manifolds. In [77], the authors reformulated the variational problemon arbitrary Riemannian manifolds into a multi-label optimization problem. In [103], thegeneralized ROF model was formulated in a fully intrinsic fashion.When entering the second half of the 2010s we see generalizations to several models emerg-ing. A second-order model for cyclic data was proposed in [21], and soon, popular real-valuedmodels were extended to the manifold case: a general second-order method [9], inﬁmal con-volution models [18, 19], and TGV for manifold-valued imaging [28, 19]. At the same timespecialized models were extended to applications such as inpainting [24], segmentation [104], ormanifold-valued inverse problems [12, 95]. Additionally, new problem settings arose with theemergence TV for manifold-valued data on graphs [23].As non-smooth models were extended to manifolds, so did the solvers. The proposed al-gorithms were typically proposed on Hadamard manifolds : Riemannian manifolds that arecomplete, simply connected and have non-positive sectional curvature. On these manifolds,convexity can be generalized, which has been crucial in many of the convergence proofs. Pro-posed algorithms include the Proximal Point Algorithm (PPA) [10] and its extension, the CyclicProximal Point Algorithm (CPPA) [8]; Iteratively Reweighted Least Squares (IRLS) [61, 17]and an adaptation in [62]; the Parallel Douglas-Racheford Algorithm (PDRA) [22]; and the re-cently derived exact and linearized Riemannian Chambolle Pock Algorithms (eRCPA/lRCPA)[20]. Despite the Hadamard constraint, it should be noted that these algorithms perform well onmanifolds with positive curvature in experiments. Methods that do not rely on the Hadamardconstraint include exact methods as in [96], which uses exact solutions for (cid:96) -TV with sphericaldata, and functional lifting [101]. These methods come with their own shortcomings: for theformer this is limited applicability and for the latter this is high computational cost. Non-smooth Optimization on Manifolds

For non-smooth optimization on manifolds, thepioneering works include [51], in which the subgradient method was extended to Riemannianmanifolds and was shown to converge on Hadamard manifolds, and [52], whose authors extendedthe proximal map and proved convergence on Hadamard manifolds. Further development innon-smooth optimization was accelerated in the late 2000s by the introduction of the proximalsubdiﬀerential for manifolds [7], the Clarke generalized subdiﬀerential for manifolds [68], and aframework for duality on

CAT (0) metric spaces [70].3y the start of the 2010s, researchers started moving from Hadamard spaces to the non-Hadamard case. New developments include the introduction of a non-smooth version of theKurdyka-Lojasiewicz (KL) inequality used to show the convergence of PPA on general Rie-mannian manifolds [14], convergence of subgradient descent for functions satisfying the KLinequality to a singular critical point [66], and a new approach to the convergence of PPA thatextends previous results to a broader class of functions [15].Around 2015, the ﬁrst numerical implementations of these and new algorithms receivedattention: [59] introduced a non-smooth trust region method for Riemannian manifolds andshowed global convergence, [60] used an approximate subdiﬀerential and proposed a descentmethod with global convergence, [69] proposed a gradient sampling algorithm and showedits global convergence, and [67] proposed a line search algorithm and generalized the Wolfeconditions for Riemannian manifolds. Recently, a higher-order method was introduced: theRiemannian Semi-smooth Newton method (RSSN) [45], although to our knowledge no publicly-available implementation has existed so far. This method will form the basis of our approach.

The contributions of this work are two-fold:

1. Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN) for non-smooth op-timization

While the RSSN method [45] allows to ﬁnd zeros of generic vector ﬁelds on man-ifolds in the same way that the Newton method allows to solve nonlinear systems of equations,the application to ﬁnding minimizers of non-smooth optimization problems is not straightfor-ward, as the optimality conditions typically are in inclusion form. Using the recently proposedgeneralized Fenchel duality theory on manifolds [20], we construct a primal-dual optimalitysystem in the form of a vector ﬁeld, which is then solved by RSSN. Overall, this providesa superlinearly convergent primal-dual scheme (PD-RSSN) for non-smooth Riemannian opti-mization.

2. Expanding the theoretical framework of Riemannian Semi-smooth Newton

Asin the classical Newton method, the proposed method requires to solve Newton-type systemin each step. For larger-scale problems, solving for the Newton steps exactly is often noteﬃcient. Therefore we propose an inexact version of RSSN and provide a convergence proof(theorem 3.5). We show that at least linear convergence can be expected in the inexact case.The theoretical results are validated by our numerical experiments.This work is based on the theses [92] and [46].

In section 2, basic notation from diﬀerential geometry and Riemannian geometry is coveredand the necessary deﬁnitions for manifold Fenchel duality theory are discussed along with theresulting non-smooth optimality systems. We also summarize the basic notions necessary forapplying the Riemannian Semi-smooth Newton method. In section 3, we present the inexactRiemannian Semi-smooth Newton method and give a local convergence proof. In section 4,we discuss how RSSN can be used to solve the non-smooth optimality system from section 2,which leads to the proposed PD-RSSN method. The application of our method to (cid:96) -TV-likeproblems is discussed in section 5, and numerical results are shown in section 6. We concludeand summarize in section 7. 4 Preliminaries

In the ﬁrst half of this section, the notation from diﬀerential and Riemannian geometry issummarized. In the second part, we brieﬂy recapitulate manifold Fenchel duality theory andthe Riemannian Semi-smooth Newton method.

For details regarding diﬀerential geometry and Riemannian geometry, we refer the reader tobooks such as [76, 93, 31].We write M and N for Riemannian manifolds. The tangent space at p ∈ M is denotedby T p M and for tangent vectors we write X p and Y p , or simply X and Y . For the tangentbundle we have T M := (cid:83) p ∈M T p M . Similarly, we write T ∗ p M for the cotangent space, ξ p and η p or simply ξ and η for covectors, and T ∗ M := (cid:83) p ∈M T ∗ p M for the cotangent bundle.A cotangent vector ξ ∈ T ∗ p M acts on a tangent vector X ∈ T p M through the duality pairing (cid:104) ξ, X (cid:105) p := ξ ( X ) ∈ R .We assume a Riemannian manifold is equipped with a metric. For some point p ∈ M themetric tensor is denoted by ( · , · ) p : T p M × T p M → R and the norm induced by the Riemannianmetric is written as (cid:107) · (cid:107) p . The Riemannian distance between points p, q ∈ M is denoted by d M ( p, q ). For the open metric ball of radius r > p ∈ M induced by this distancewe write B r ( p ) := { y ∈ M | d M ( p, q ) < r } . (2)Through this metric, for p ∈ M we can deﬁne the musical isomorphisms (cid:91) : T p M → T ∗ p M as (cid:104) X (cid:91) , Y (cid:105) p = ( X, Y ) p for all Y ∈ T p M (3)and its inverse (cid:93) : T ∗ p M → T p M as( ξ (cid:93) , X ) p = (cid:104) ξ, X (cid:105) p for all X ∈ T p M . (4)The metric can also be used to construct a unique aﬃne connection, the Levi-Civita con-nection, which is denoted by ∇ ( · ) ( · ). For geodesics γ : [0 , → M , we typically write γ p,q ( t )and γ p,X ( t ) to denote a minimizing geodesic connecting p, q ∈ M and a geodesic starting from p ∈ M with velocity ˙ γ p,X (0) = X ∈ T p M . The latter deﬁnes the exponential map, which isdenoted by exp p : G p → M where G p ⊂ T p M is the set on which ˙ γ p,X =: exp p ( X ) is deﬁned.Furthermore, if G p = T p M , the manifold is called complete. For G (cid:48) p ⊂ G p a metric ball withradius r p on which exp p is a diﬀeomorphism, the logarithmic map is deﬁned and we denote itby log p : exp( G (cid:48) p ) → G (cid:48) p . This radius r p is called the injectivity radius. Importantly, on simplyconnected, complete Riemannian manifolds with non-positive sectional curvature, the exp p andlog p maps are globally deﬁned. Such manifolds are called Hadamard manifolds.

Finally, wewrite parallel transport of a vector X ∈ T p M from p to q ∈ M as P p → q X and parallel transportof a covector ξ ∈ T ∗ p M as P p → q ξ , where the latter is deﬁned through the musical isomorphisms,as P p → q ξ := (cid:0) P p → q ξ (cid:93) (cid:1) (cid:91) . (5) In this section, generalizations of classical notions from non-smooth analysis to manifolds pro-posed in [20] are discussed. For the equivalent basic notions of convex analysis in vector spaces5e refer the reader to [90, 39]. The section culminates in the recently derived primal-dual op-timality conditions proposed by the authors of [20]. In the following, R := R ∪ {±∞} denotesthe extended real line. The notion of convexity can be deﬁned on strongly convex subsets of Riemannian manifolds.

Deﬁnition 2.1 (strongly convex set, [93, Def. IV.5.1]) . A subset

C ⊂ M of a Riemannianmanifold M is said to be strongly convex if, for all p, q ∈ C , a minimal geodesic γ p,q between p and q exists, is unique, and lies completely in C . The well-known notions of properness, convexity and lower semi-continuity can be general-ized as follows.

Deﬁnition 2.2 (proper, [20, Def. 2.11.i]) . A function F : M → R is proper if dom F := { x ∈M| F ( x ) < ∞} (cid:54) = ∅ and F ( x ) > −∞ holds for all x ∈ M . Deﬁnition 2.3 (convex, [20, Def. 2.11.ii]) . Suppose that

C ⊂ M is strongly convex. A properfunction F : M → R is called (geodesically) convex on C ⊂ M if, for all p, q ∈ C , thecomposition F ◦ γ p,q ( t ) is a convex function on [0 , in the classical sense. Deﬁnition 2.4 (epigraph, [20, Def. 2.11.iii]) . Suppose that

A ⊂ M . The epigraph of afunction F : A → R is deﬁned as epi F := { ( x, α ) ∈ A × R | F ( x ) ≤ α } . (6) Deﬁnition 2.5 (lower semi-continuous, [20, Def. 2.11.iv]) . Suppose that

A ⊂ M . A properfunction F : A → R is called lower semi-continuous (lsc) if epi F is closed. The notions of subdiﬀerentials and proximal mappings can also be extended to manifoldsusing the exponential map and the geodesic distance:

Deﬁnition 2.6 (subdiﬀerential,[99, Def. 3.4.4]) . Suppose that

C ⊂ M is strongly convex. The subdiﬀerential ∂ M F on C of a proper, convex function F : C → R at a point p ∈ C is deﬁned as ∂ M F ( p ) := (cid:8) ξ ∈ T ∗ p M| F ( q ) ≥ F ( p ) + (cid:104) ξ, log p q (cid:105) p for all q ∈ C (cid:9) . (7) Deﬁnition 2.7 (proximal mapping, [52]) . Let M be a Riemannian manifold, F : M → R beproper, and λ > . The proximal map of F is deﬁned as prox λF ( p ) := arg min q ∈M (cid:26) λ d M ( p, q ) + F ( q ) (cid:27) . (8)Tangent and cotangent spaces play an important role in the generalization of the Fenchel-dual functions. In this context, well-deﬁnedness of the exponential and the logarithmic mapis ensured by restricting ourselves to the following subset, which is a localized variant of thepre-image of the exponential map: Deﬁnition 2.8 ([20, Def 2.8]) . Let

C ⊂ M and p ∈ C . We deﬁne the tangent subset L C ,p ⊂ T p M as L C ,p := (cid:8) X ∈ T p M | exp p X ∈ C and (cid:107) X (cid:107) p = d M (exp p X, p ) (cid:9) . (9)6ith these basic notions, Fenchel duality theory can be generalized as in [20]. The Fenchelconjugate or Fenchel dual is deﬁned by introducing a base point m on the manifold. Deﬁnition 2.9 ( m -Fenchel conjugate,[20, Def. 3.1]) . Suppose that F : C → R and m ∈ C . The m -Fenchel conjugate of F is deﬁned as the function F ∗ m : T ∗ p M → R such that F ∗ m ( ξ m ) := sup X ∈L C,m {(cid:104) ξ m , X (cid:105) − F (exp m X ) } , ξ m ∈ T ∗ m M . (10)For the Fenchel biconjugate, we can then deﬁne the following. Deﬁnition 2.10 (( mm (cid:48) )-Fenchel biconjugate, [20, Def. 3.5]) . Suppose that F : C → R and m, m (cid:48) ∈ C . Then the ( mm (cid:48) )-Fenchel biconjugate function F mm (cid:48) : C → R is deﬁned as F ∗∗ mm (cid:48) ( p ) := sup ξ m (cid:48) ∈T ∗ m (cid:48) M {(cid:104) ξ m (cid:48) , log m (cid:48) p (cid:105) − F ∗ m ( P m (cid:48) → m ξ m (cid:48) ) } , p ∈ C . (11)Basic properties and a more elaborate discussion of these generalized conjugate functionsare covered in [20]. We will only focus on the relation between F ∗∗ mm and F .First, we note that geodesic convexity is often too strong a condition, so a weaker conditionis used. Given a set C ⊂ M , m ∈ C , and a function F : C → R , we deﬁne f m : T m M → R by f m ( X ) := (cid:26) F (exp m X ) , X ∈ L C ,m , + ∞ , X / ∈ L C ,m . (12)It turns out that one can look at the convexity of the function f m : T m M → R in the usualvector space sense on T m M . For a more elaborate discussion of the discrepancy between theconvexity of f m and (geodesic) convexity of F we refer to [20, Example 3.10].Now, the Fenchel-Moreau-Rockafellar theorem can also be extended to the manifold case: Theorem 2.11 ([20, Thm. 3.13]) . Suppose that

C ⊂ M is strongly convex and m ∈ C . Let F : C → R be proper. If f m is lsc and convex on T m M , then F = F ∗∗ mm . In this case F ∗ m isproper as well. In this section the ﬁrst-order optimality conditions for a minimization problem of the forminf p ∈C { F ( p ) + G (Λ( p )) } (13)are discussed. Here C ⊂ M and

D ⊂ N are strongly convex sets, F : C → R and G : D → R are proper functions, and Λ : M → N is a diﬀerentiable map, possibly nonlinear, such thatΛ( C ) ⊂ D . Furthermore, we assume that F : C → R is geodesically convex and that g n ( X ) := (cid:26) G (exp n X ) , X ∈ L D ,n , + ∞ , X / ∈ L D ,n , (14)is proper, convex and lsc on T n N for some n ∈ D .Under these assumptions, eq. (13) can be rewritten into the following saddle-point formu-lation inf p ∈C sup ξ n ∈T ∗ n N {(cid:104) ξ n , log n Λ( p ) (cid:105) + F ( p ) − G ∗ n ( ξ n ) } (15)7he authors of [20] propose two pairs of optimality conditions which we will also refer to asthe exact and linearized optimality conditions for the primal and dual variables . The proposed exact optimality conditions are P m → p (cid:0) − ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) ∈ ∂ M F ( p ) , (16)log n Λ( p ) ∈ ∂G ∗ n ( ξ n ) , (17)where D m Λ ∗ : T ∗ Λ( m ) N → T ∗ m M is the adjoint operator of D m Λ. As shown in [20], this systemcan be rewritten into p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (18) ξ n = prox τG ∗ n (cid:16) ξ n + τ (log n Λ ( p )) (cid:91) (cid:17) . (19)Note that the name “exact” is slightly misleading, as the linearizationΛ( p ) ≈ exp Λ( m ) D Λ( m ) [log m p ] (20)was needed to obtain eq. (16).The linearized optimality conditions can be obtained by linearizing both the primal and thedual optimality condition. That is, for n := Λ( m ) , we want to solveinf p ∈M inf ξ n ∈T ∗ n N { F ( p ) + (cid:104) ξ n , D m Λ [log m p ] (cid:105) − G ∗ n ( ξ n ) } , (21)and obtain the optimality system for the general n ∈ N case [20] P m → p (cid:0) − ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) ∈ ∂ M F ( p ) , (22) D m Λ [log m p ] ∈ ∂G ∗ n ( ξ n ) , (23)which can be rewritten into p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (24) ξ n = prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) . (25)The linearized system has so far a better theoretical basis than the exact system. The followingweak duality result holds: Theorem 2.12 ([20, Thm. 4.2]) . Let n := Λ( m ) . The dual problem of inf p ∈M (cid:110) F ( p ) + G (exp Λ( m ) D m Λ [log m p ]) (cid:111) (26) is given by sup ξ n ∈T ∗ n N { F ∗ m ( − ( D m Λ) ∗ [ ξ n ]) − G ∗ n ( ξ n ) } (27) and weak duality holds, i.e., inf p ∈M (cid:110) F ( p ) + G (exp Λ( m ) D m Λ [log m p ]) (cid:111) ≥ sup ξ n ∈T ∗ n N {− F ∗ m ( − ( D m Λ) ∗ [ ξ n ]) − G ∗ n ( ξ n ) } . (28)If one performs a suitable ﬁxed-point iteration on eqs. (24) and (25), one obtains the lin-earized Riemannian Chambolle Pock algorithm, which has been shown to converge on Hadamardmanifolds [20, Thm. 4.3]. In this work, we aim to construct a higher-order method instead.8 .3 The Riemannian Semi-smooth Newton Method This section is devoted to the

Riemannian Semi-smooth Newton (RSSN) method, which wepropose to use for solving eq. (13).

Remark 2.13.

Throughout this and the next section, we will develop RSSN on some manifold M (cid:48) . This is not the original manifold M deﬁning the feasible set; we will later apply the RSSNmethod to M (cid:48) = M × T ∗ n N (and p (cid:48) = ( p, ξ n ) ). In order not to clutter the notation, we stillwrite M instead of M (cid:48) (and p instead of p (cid:48) ) throughout this and the next section. In the real-valued case, the Newton method is formulated for solving a smooth system ofequations X ( p ) = 0 for some non-linear map X : R d → R d , where – in the optimizationcontext – this map would implement optimality conditions for the minimization problem. Inthe manifold case, there is no unique generalization. One possibility is to consider a vector ﬁeld X : M → T M and deﬁne the Newton iteration through the covariant derivative: p k +1 = exp p k ( − [ ∇ ( · ) X ] − p k X ( p k )) (29)in order to ﬁnd a zero 0 = X ( p ) of the vector ﬁeld. In this section, we will focus on ageneralized covariant derivative approach for ﬁnding zeros of semi-smooth vector ﬁelds, whichwill be applied by rewriting eqs. (24) and (25) into a vector ﬁeld in section 4. Throughout thissection we will use the notions and results developed in [45]. Deﬁnition 2.14 ((locally) Lipschitz, [45, Def. 6]) . A vector ﬁeld X on M is said to be Lipschitz continuous on Ω ⊂ M if there exists a constant L > such that, for all p, q ∈ Ω ,there holds (cid:107)P p → q X ( p ) − X ( q ) (cid:107) q ≤ Ld M ( p, q ) , ∀ p, q ∈ Ω , (30) Moreover, for a point p ∈ M , if there exists δ > such that X is Lipschitz continuous on theopen ball B δ ( p ) , then X is said to be Lipschitz continuous at p. Moreover, if for all p ∈ M , X is Lipschitz continuous at p , then X is said to be locally Lipschitz continuous on M . Subsequently, we can generalize Rademacher’s theorem to Lipschitz vector ﬁelds.

Theorem 2.15 ([45, Thm. 10]) . If X is a locally Lipschitz continuous vector ﬁeld on M , then X is almost everywhere diﬀerentiable on M . Hence, it makes sense to deﬁne the generalized covariant derivative.

Deﬁnition 2.16 (Clarke generalized covariant derivative, [45, Def. 11]) . The

Clarke general-ized covariant derivative ∂ C X of a locally Lipschitz continuous vector ﬁeld X at a point p ∈ M is deﬁned as the set-valued mapping ∂ C X ( p ) : T p M → T p M , ∂ C X ( p ) := co (cid:26) V ∈ L ( T p M ) : ∃ { p k } ⊂ D X , lim k →∞ p k = p, V = lim k →∞ P p k → p ∇ X ( p k ) (cid:27) , (31) where D X ⊂ M is the set on which X is diﬀerentiable, “co” represents the convex hull and L ( T p M ) is the vector space consisting of all bounded linear operators from T p M to T p M . In addition to local Lipschitzness, we will also need the directional derivative for the notionof semi-smoothness. We follow the deﬁnition in [45].9 eﬁnition 2.17 (directional derivative) . The directional derivative of a vector ﬁeld X on M at p ∈ M in the direction v ∈ T p M is deﬁned by X (cid:48) ( p, v ) := lim t (cid:38) t (cid:104) P exp p ( tv ) → p X (cid:0) exp p ( tv ) (cid:1) − X ( p ) (cid:105) ∈ T p M , (32) whenever the limit exists. If this directional derivative exists for every v , then X is said to be directionally diﬀerentiable at p . Finally, we are able to generalize the notion of semi-smoothness to vector ﬁelds.

Deﬁnition 2.18 (semi-smooth vector ﬁeld, [45, Def. 18]) . A vector ﬁeld X on M that isLipschitz continuous at p ∈ M and directionally diﬀerentiable at q ∈ B δ ( p ) for all directionsin T p M , is said to be semi-smooth at p iﬀ for every (cid:15) > there exists < δ < r p , where r p isthe injectivity radius, such that (cid:107) X ( p ) − P q → p (cid:2) X ( q ) + V q log q p (cid:3) (cid:107) p ≤ (cid:15)d M ( p, q ) , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (33) The vector ﬁeld X is said to be µ -order semi-smooth at p for < µ ≤ iﬀ there exist (cid:15) > and < δ < r p such that (cid:107) X ( p ) − P q → p (cid:2) X ( q ) + V q log q p (cid:3) (cid:107) p ≤ (cid:15)d M ( p, q ) µ , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (34)The Riemannian Semi-smooth Newton method for ﬁnding a zero of a vector ﬁeld, i.e., X ( p ) = 0, is shown in algorithm 1. It extends on the classical Newton method by replacingthe classical Jacobian ∇ X ( p k ) by an element from the Clarke generalized covariant derivative ∂ C X ( p k ), and performing the update step using the exponential map. Algorithm 1

Riemannian Semi-smooth NewtonInitialization: p ∈ M , k := 0 while not converged do Choose any V ( p k ) ∈ ∂ C X ( p k )Solve V ( p k ) d k = − X ( p k ) in the vector space T p k M p k +1 := exp p k ( d k ) k := k + 1 end while We have the following local convergence result.

Theorem 2.19 ([45, Thm. 19]) . Let X be a locally Lipschitz-continuous vector ﬁeld on M and p ∗ ∈ M be a solution of the problem X ( p ) = 0 . Assume that X is semi-smooth at p ∗ and thatall V p ∗ ∈ ∂ C X ( p ∗ ) are invertible. Then there exists a δ > such that, for every starting point p ∈ B δ ( p ∗ ) \ { p ∗ } , the sequence ( p k ) k ≥ generated by algorithm 1 is well-deﬁned, contained in B δ ( p ∗ ) and converges superlinearly to p ∗ . If additionally X is µ -order semi-smooth at p ∗ , thenthe convergence of ( p k ) k ≥ to p ∗ is of order µ . Inexact Riemannian Semi-smooth Newton

Before moving to applications, we will consider a generalization of RSSN, the

Inexact Rieman-nian Semi-smooth Newton (IRSSN) method in algorithm 2. In contrast to the exact version,it only requires to solve the linear system up to a residual term r k . The reader should bearin mind that remark 2.13 still holds for the remainder of this section. The main motivationfor this method is that solving the Newton system with high precision can be very expensive.Solving the system inexactly, for example using an iterative method, can potentially amelioratethis problem. Algorithm 2

Inexact Riemannian Semi-smooth NewtonInitialization: p ∈ M , a ≥ , k := 0 while not converged do Choose V k ( p k ) ∈ ∂ C X ( p k )Solve V k ( p k ) d k = − X ( p k ) + r k in T p k M where (cid:107) r k (cid:107) ( p k ) ≤ a k (cid:107) X ( p k ) (cid:107) ( p k ) p k +1 := exp p k ( d k )Choose a k +1 ≥ k := k + 1 end while In this section, we focus on proving theorem 3.5: a local convergence result for algorithm 2on Riemannian manifolds. The proof is based on the ideas of the Inexact Semi-smooth Newtonmethods in R n as discussed in [80, 50]. The technical details are inspired by the convergenceproof for Riemannian Semi-smooth Newton [45]. Starting with the technicalities, we ﬁrst need to account for curvature. In particular, we needto account for how geodesics spread, which we can formalize in a single quantity [45, Def. 2]:

Deﬁnition 3.1.

Let p ∈ M and r p be the radius of injectivity of M at p. Deﬁne the quantity K p := sup (cid:40) d M (cid:0) exp q u, exp q v (cid:1) (cid:107) u − v (cid:107) q : q ∈ B r p ( p ) , u, v ∈ T q M , u (cid:54) = v, (cid:107) v (cid:107) q ≤ r p , (cid:107) u − v (cid:107) q ≤ r p (cid:41) . (35)The following remark from [45] provides some intuition: Remark 3.2.

This number K p measures how fast the geodesics spread apart in M . In par-ticular, when u = 0 ∈ T q M or more generally when u and v are on the same line through , d M (cid:0) exp q u, exp q v (cid:1) = (cid:107) u − v (cid:107) q . Hence, K p ≥ , for all p ∈ M . When M has non-negativesectional curvature, the geodesics spread apart less than the rays, i.e., d M (cid:0) exp p u, exp p v (cid:1) ≤(cid:107) u − v (cid:107) q and, in this case, K p = 1 for all p ∈ M . Next, remember the deﬁnition of an operator norm.

Deﬁnition 3.3.

Let p ∈ M . The norm of a linear map A : T p M → T p M is deﬁned by (cid:107) A (cid:107) p := sup {(cid:107) Av (cid:107) p : v ∈ T p M , (cid:107) v (cid:107) p ≤ } . (36)11e have the following result. Lemma 3.4 ([45, Lemma 17]) . Let X be a locally Lipschitz continuous vector ﬁeld on M . As-sume that all elements V p ∈ ∂ C X ( p ) are invertible at base point p ∈ M and let λ p ≥ max (cid:8) (cid:107) V − p (cid:107) p : V p ∈ ∂ C X ( p ) (cid:9) . Then, for every (cid:15) > satisfying (cid:15)λ p < , there exists < δ < r p such that all V q ∈ ∂ C X ( q ) are invertible on B δ ( p ) and (cid:107) V − q (cid:107) q ≤ λ p − (cid:15)λ p , ∀ q ∈ B δ ( p ) , ∀ V q ∈ ∂ C X ( q ) . (37) With these tools we can move on to the main result of this section.

Theorem 3.5.

Let X be locally Lipschitz continuous vector ﬁeld on M and p ∗ ∈ M be asolution of problem the X ( p ) = 0 . Assume that X is semi-smooth at p ∗ and that all V p ∗ ∈ ∂ C X ( p ∗ ) are invertible. Then the following statements hold:(i) There exist a > and δ > such that, for every p ∈ B δ ( p ∗ ) and a k ≤ a , the sequence ( p k ) k ≥ generated by algorithm 2 is well-deﬁned, is contained in B δ ( p ∗ ) and convergesQ-linearly to the solution p ∗ .(ii) If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ and further (cid:107) r k (cid:107) ( p k ) ∈ o (cid:0) (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) , then the rate of convergence is Q -superlinear.(iii) If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ , X is µ -order semi-smooth at p ∗ , and (cid:107) r k (cid:107) ( p k ) ∈ O (cid:16) (cid:107) X ( p k ) (cid:107) µ ( p k ) (cid:17) , then the rate of convergenceis of Q -order µ .Proof. (i) Let K p ∗ be as deﬁned in deﬁnition 3.1 and let r p ∗ be the injectivity radius. Since X is locally Lipschitz, there exist constants ˆ δ > L such that, for all p ∈ B ˆ δ ( p ∗ ) , (cid:107) X ( p ) (cid:107) p = (cid:107)P p ∗ → p X ( p ∗ ) − X ( p ) (cid:107) p ≤ Ld M ( p, p ∗ ) . (38)The equality holds since X ( p ∗ ) = 0 and parallel transport is linear.Now, since all V p ∗ ∈ ∂ C X ( p ∗ ) are invertible at p ∗ ∈ M by assumption, we can take λ p ∗ ≥ max {(cid:107) V − p ∗ (cid:107) p ∗ : V p ∗ ∈ ∂ C X ( p ∗ ) } . Furthermore, take a < λ p ∗ LK p ∗ , choose a k ≤ a ∀ k ∈ N and (cid:15) satisfying (cid:15)λ p ∗ (1 + K p ∗ ) < − aλ p ∗ LK p ∗ . As (cid:15)λ p ∗ <

1, by lemma 3.4 we can ﬁnd a0 < δ < min { ˆ δ, r p ∗ } such that, for all p ∈ B δ ( p ∗ ) and V p ∈ ∂ C X ( p ) , (cid:107) V − p (cid:107) p ≤ λ p ∗ − (cid:15)λ p ∗ . (39)From the semi-smoothness of X , (cid:107) X ( p ∗ ) − P p → p ∗ (cid:2) X ( p ) + V p log p p ∗ (cid:3) (cid:107) ( p ∗ ) ≤ (cid:15)d M ( p, p ∗ ) (40)holds according to eq. (33).We now show that for our chosen δ the Newton iteration is well-deﬁned. Let k ∈ N andassume that p k ∈ B δ ( p ∗ ). Let d k be such that (cid:107) V p k d k + X ( p k ) (cid:107) ( p k ) ≤ a k (cid:107) X ( p k ) (cid:107) ( p k ) . (41)12hen (cid:107) log p k p ∗ − d k (cid:107) ( p k ) = (cid:107) log p k p ∗ + V − p k X ( p k ) − V − p k ( V p k d k + X ( p k )) (cid:107) ( p k ) (42) ≤ (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) + (cid:107) V − p k (cid:107) ( p k ) (cid:107) V p k d k + X ( p k ) (cid:107) ( p k ) (43) eq. (41) ≤ (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) + a k (cid:107) V − p k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) . (44)Since X ( p ∗ ) = 0 and parallel transport is an isometry, we see that (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) = (cid:107) V − p k (cid:0) V p k log p k p ∗ + X ( p k ) (cid:1) (cid:107) ( p k ) (45) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:107)P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) (46) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:107) X ( p ∗ ) − P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) . (47)Substituting eq. (47) back into eq. (44), we ﬁnd (cid:107) log p k p ∗ − d k (cid:107) ( p k ) (48) ≤ (cid:107) V − p k (cid:107) ( p k ) (cid:0) (cid:107) X ( p ∗ ) − P p k → p ∗ (cid:0) X ( p k ) + V p k log p k p ∗ (cid:1) (cid:107) ( p ∗ ) + a k (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) (49) eq. (39) ,eq. (40) ,eq. (38) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) + a k Ld M ( p k , p ∗ )) (50) a k ≤ a ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) d M ( p k , p ∗ ) . (51)Now note that, since K p ∗ ≥ λ p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) ≤ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) . (52)For our choice of a and (cid:15), we ﬁnd (cid:15)λ p ∗ (1 + K p ∗ ) < − aλ p ∗ LK p ∗ (53) ⇔ (cid:15)λ p ∗ + (cid:15)λ p ∗ K p ∗ + aλ p ∗ LK p ∗ < ⇔ λ p ∗ K p ∗ ( (cid:15) + aL ) < − (cid:15)λ p ∗ (55) ⇔ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) < . (56)Since d M ( p k , p ∗ ) < δ , we obtain from combining eq. (51), eq. (52) and eq. (56) (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) < d M ( p k , p ∗ ) < δ ≤ r p ∗ . (57)Moreover, we have (cid:107) log p k p ∗ (cid:107) ( p k ) = d M ( p k , p ∗ ) ≤ r p ∗ . Hence, we ﬁnd (see deﬁnition 3.1) d M (exp p k ( d k ) , p ∗ ) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) (58)and observe d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) algorithm 2 = d M (exp p k ( d k ) , p ∗ ) d M ( p k , p ∗ ) eq. (58) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) d M ( p k , p ∗ ) (59) eq. (51) ≤ λ p ∗ K p ∗ − (cid:15)λ p ∗ ( (cid:15) + aL ) eq. (56) < . (60)13rom this result we conclude by induction that if we choose p ∈ B δ ( p ∗ ) as in the assumption,we have p k ∈ B δ ( p ∗ ) ∀ k ∈ N and convergence is Q-linear.(ii) The second part is very similar. Let K p ∗ , r p ∗ , and ˆ δ with corresponding L as before.Choose (cid:15) > (cid:15)λ p ∗ (1 + 2 K p ∗ ) < < δ < min { ˆ δ, r p ∗ } such that eq. (39) andeq. (40) hold. Due to the assumption (cid:107) r k (cid:107) ( p k ) ∈ o (cid:0) (cid:107) X ( p k ) (cid:107) ( p k ) (cid:1) , the assumption that p k → p ∗ ,and that X is continuous, we have that (cid:107) X ( p k ) (cid:107) ( p k ) →

0. Moreover for large enough k , (cid:107) r k (cid:107) ( p k ) < (cid:15)δ. (61)Because of the convergence assumption p k → p ∗ , we also have for large k that d M ( p k , p ∗ ) < δ .Consequently, using the similar steps that lead to eq. (51) in the proof of (i), we see (cid:107) log p k p ∗ − d k (cid:107) ( p k ) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) + (cid:107) r k (cid:107) ( p k ) ) (62) ≤ (cid:15)λ p ∗ − (cid:15)λ p ∗ δ ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ δ. (63)For our choice of (cid:15) we ﬁnd (cid:15)λ p ∗ (1 + 2 K p ∗ ) < ⇔ (cid:15)λ p ∗ K p ∗ < − (cid:15)λ p ∗ (65) ⇔ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ < . (66)Since d M ( p k , p ∗ ) < δ , we obtain from combining eq. (63) and eq. (66) (cid:107) log p k p ∗ + V − p k X ( p k ) (cid:107) ( p k ) < δ ≤ r p ∗ . (67)Again, we have (cid:107) log p k p ∗ (cid:107) ( p k ) = d M ( p k , p ∗ ) ≤ r p ∗ , which leads to (see deﬁnition 3.1) d M (exp p k ( d k ) , p ∗ ) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) . (68)Finally we see that, for large k , d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) algorithm 2 = d M (exp p k ( d k ) , p ∗ ) d M ( p k , p ∗ ) eq. (68) ≤ K p ∗ (cid:107) log p k p ∗ − d k (cid:107) ( p k ) d M ( p k , p ∗ ) (69) eq. (62) ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) d M ( p k , p ∗ ) . (70)By (cid:107) X ( p k ) (cid:107) ( p k ) ≤ Ld M ( p k , p ∗ ), we can continue ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + 1 L λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) . (71)Note that this result holds for all (arbitrarily small) (cid:15) > (cid:15)λ p ∗ (1 + 2 K p ∗ ) <

1. Hence,we can focus solely on the residual term. Then, by the assumption (cid:107) r k (cid:107) ( p k ) ∈ o ( (cid:107) X ( p k ) (cid:107) ( p k ) ),lim k →∞ d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) ≤ lim k →∞ λ p ∗ K p ∗ L (1 − (cid:15)λ p ∗ ) (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) ( p k ) = 0 , (72)14nd we conclude that the convergence is superlinear.(iii) Again this is very similar to the previous cases. Let µ denote the order of the semi-smoothness and let K p ∗ , r p ∗ , and ˆ δ with corresponding L as before. For (cid:15) > (cid:15)λ p ∗ <

1, choose 0 < δ < min { ˆ δ, r p ∗ } satisfying (cid:15)λ p ∗ (1 + 2 δ µ K p ∗ ) < (cid:13)(cid:13) X ( p ∗ ) − P p → p ∗ (cid:2) X ( p ) + V p exp − p p ∗ (cid:3)(cid:13)(cid:13) ( p ∗ ) ≤ (cid:15)d M ( p, p ∗ ) µ (73)hold. Then for large enough k we have by the same reasoning as for establishing eq. (61), that (cid:107) r k (cid:107) ( p k ) < (cid:15)δ µ . (74)Similar to eq. (66), (cid:107) log p k p ∗ − d k (cid:107) ( p k ) ≤ λ p ∗ − (cid:15)λ p ∗ ( (cid:15)d M ( p k , p ∗ ) µ + (cid:107) r k (cid:107) ( p k ) ) (75) ≤ (cid:15)λ p ∗ δ µ − (cid:15)λ p ∗ δ ≤ (cid:15)λ p ∗ δ µ K p ∗ − (cid:15)λ p ∗ δ (76)follows, and for our choice of (cid:15) and δ we ﬁnd (cid:15)λ p ∗ (1 + 2 δ µ K p ∗ ) < ⇔ (cid:15)λ p ∗ δ µ K p ∗ − (cid:15)λ p ∗ < . (77)Using similar arguments as for establishing eq. (71) in (ii), we obtain d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) ≤ (cid:15)λ p ∗ K p ∗ − (cid:15)λ p ∗ + 1 L λ p ∗ K p ∗ − (cid:15)λ p ∗ (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) µ ( p k ) . (78)As (cid:15) can be arbitrarily small, we can again focus on the second term as in (ii). Finally, we seeby the assumption (cid:107) r k (cid:107) ( p k ) = O ( (cid:107) X ( p k ) (cid:107) µ ( p k ) ) thatlim k →∞ d M ( p k +1 , p ∗ ) d M ( p k , p ∗ ) µ ≤ lim k →∞ λ p ∗ K p ∗ L (1 − (cid:15)λ p ∗ ) (cid:107) r k (cid:107) ( p k ) (cid:107) X ( p k ) (cid:107) µ ( p k ) =: M (79)for some M >

0. Therefore we conclude that the convergence is of Q-order 1 + µ .From (ii) we obtain the following practical condition if one can ensure that the relative errorin solving the Newton system approaches zero. Corollary 3.5.1.

If the sequence ( p k ) k ≥ generated by algorithm 2 converges to the solution p ∗ and the sequence { a k } converges to zero, then the rate of convergence is Q -superlinear. Remark 3.6.

In the real-valued case (ii) and (iii) are formulated stronger: if a convergentsequence exists, the converse statements in (ii) and (iii) also hold. However, the proof in [50]relies heavily on the linearity of R d and is therefore much harder to translate to the manifoldcase. This remains an open problem. Before moving on to applications, we will elaborate on some key observations in the theory of the(inexact) RSSN method. In particular, we can make the following observations regarding the15onvergence behavior of both algorithms. By a continuity argument, we see that the minimalradius of convergence is determined by (cid:15) . This (cid:15) must satisfy the inequality (that is for a k = 0) (cid:15)λ p ∗ (1 + K p ∗ ) < . (80)In other words, a too small (cid:15) can give trouble in getting RSSN to work.We see that we run into trouble in two cases: a large λ p ∗ and a large K p ∗ correspondence.This comes down to the following scenarios. The generalized covariant derivative is close to singular

Here a large λ p ∗ onlyadmits a very small (cid:15) , which in turn results in a small convergence region. The manifold has negative curvature

For negatively-curved manifolds we do not havean a priori estimate for K p ∗ . If this value becomes arbitrarily large, we cannot expect a largeregion of convergence. We will now merge the ideas of Riemannian Semi-smooth Newton (RSSN) and Fenchel dualitytheory on manifolds in order to solve the original probleminf p ∈M { F ( p ) + G (Λ( p )) } . (81)Using linearization, we saw in section 2.2 that solving eq. (81) can be approximated by solvinginf p ∈M sup ξ n ∈T ∗ n N { F ( p ) + (cid:104) ξ n , D m Λ [log m p ] (cid:105) − G ∗ n ( ξ n ) } , (82)which is characterized by the optimality conditions p = prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (83) ξ n = prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) , (84)where σ, τ >

0. In this section we focus on rewriting this system of non-linear equations into aform that is amenable to the Riemannian Semi-smooth Newton method, i.e., into the problemof ﬁnding a zero of a vector ﬁeld.While the dual variable ξ n lives in a vector space and (84) immediately translates into ξ n − prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) = 0 , (85)the primal optimality condition (83) is an equation in M . In order to obtain a vector ﬁeldform, we apply the logarithm, which yields elements from the tangent bundle T M : − log p prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) = 0 . (86)Such a zero implies eq. (83). Then, we deﬁne the vector ﬁeld X : M × T ∗ n N → T M × T ∗ n N as X ( p, ξ n ) :=  − log p prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) ξ n − prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) .  (87)16hich allows to apply RSSN. We refer to this approach as Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN).

Remark 4.1.

Here it becomes obvious why ﬁrst transforming the set-valued equivalent opti-mality conditions in eqs. (24) and (25) into the prox-based equality form is crucial: we can now diﬀerentiate X in order to apply the Newton-based RSSN method. If the vector ﬁeld X in eq. (87) is smooth, we obtain the covariant derivative [46, Sec. 6.4.2]: ∇ Y p + η ξn X = −∇ Y p log ( · ) f ( p, ξ n ) − D p log p f ( · , ξ n )[ Y p ] − D ξ n log p f ( p, · )[ η ξ n ] − D p f ( · , ξ n )[ Y p ] + η ξ n − ∇ η ξn f ( p, · ) , (88)where f ( p, ξ n ) := prox σF (cid:16) exp p (cid:16) P m → p (cid:0) − σ ( D m Λ) ∗ (cid:2) P n → Λ( m ) ξ n (cid:3)(cid:1) (cid:93) (cid:17)(cid:17) , (89) f ( p, ξ n ) := prox τG ∗ n (cid:16) ξ n + τ (cid:0) P Λ( m ) → n D m Λ [log m p ] (cid:1) (cid:91) (cid:17) . (90)However, we generally cannot assume smoothness of X and therefore require a generalizationof the diﬀerential. First, consider the following generalization of Rademacher’s theorem. Theorem 4.2.

Let M and N be smooth manifolds. If F : M → N is a locally Lipschitzcontinuous function, then F is almost everywhere diﬀerentiable on M .Proof. The proof is the same as [45, Thm. 10] with a general manifold N instead of T M .The generalized diﬀerential can now be deﬁned as follows.

Deﬁnition 4.3 (Clarke generalized diﬀerential) . The

Clarke generalized diﬀerential D C F ofa locally Lipschitz continuous function F : M → N at p ∈ M is deﬁned as the set-valuedmapping D C F ( p ) : T p M → T F ( p ) N , D C F ( p ) := co (cid:26) V ∈ L (cid:0) T p M , T F ( p ) N (cid:1) : ∃ { p k } ⊂ D F , lim k →∞ p k = p, V = lim k →∞ D p k F (cid:27) , (91) where D F ⊂ M is the set on which F is diﬀerentiable, “co” represents the convex hull and L (cid:0) T p M , T F ( p ) N (cid:1) denotes the vector space consisting of all bounded linear operator from T p M to T F ( p ) N . In block notation, the generalized covariant derivative of X is of the form ∂ C X ( p, ξ n ) = (cid:20) − ∂ C (log ( · ) f ( p, ξ n ))( p ) − D C (log p f ( · , ξ n ))( p ) − D C (log p f ( p, · ))( ξ n ) − D C ( f ( · , ξ n ))( p ) I − ∂ C ( f ( p, · ))( ξ n ) (cid:21) . (92)Under the special assumption that the manifold of interest is symmetric , the four componentsin eq. (92) can be computed explicitly up to the diﬀerentials of the proximal maps using Jacobiﬁelds [87, Lemma 2.3]. For the technical details we refer to the ﬁrst author’s thesis [46, Sec.6.4.3 and 6.4.4]. For typical imaging applications this is convenient, as many typical manifoldsof interest, in particular S n and P ( n ), are symmetric. Remark 4.4.

Semi-smoothness of X does not follow directly from the proposed construction.In general it is problem-speciﬁc but relatively straightforward to prove, as it can be shown thatpiecewise smooth functions are semi-smooth. Application to (cid:96) -TV-like Functionals We consider the classical Rudin-Osher-Fatemi denoising problem, extended to the manifold-valued setting: Let M be a Riemannian manifold, d , d ∈ N be the dimensions of the image,and h ∈ M d × d be the noisy input data. We are interested in solving the isotropic ( q = 2)and anisotropic ( q = 1) discrete ROF model [20]inf p ∈M d × d α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:107) T ( p ) (cid:107) p,q, , (93)where α > T : M d × d → T M d × d × is the non-linear ﬁnite diﬀerence operator( T ( p )) i,j,k :=  ∈ T p i,j M if i = d and k = 10 ∈ T p i,j M if j = d and k = 2log p i,j p i +1 ,j ∈ T p i,j M if i < d and k = 1log p i,j p i,j +1 ∈ T p i,j M if j < d and k = 2 (94)and with the norm deﬁned as (cid:107) T ( p ) (cid:107) p,q, := d ,d (cid:88) i,j =1 ( (cid:107) ( T ( p )) i,j, (cid:107) qp i,j + (cid:107) ( T ( p )) i,j, (cid:107) qp i,j ) q . (95)The ﬁrst step is to apply Fenchel duality theory in order to rewrite model eq. (93) into anappropriate linearized saddle-point form with corresponding optimality conditions.First, as T ( p ) ∈ T p M d × d × T p M d × d ∼ = T p,p M d × d × , we write the base point ( p, p ) of T ( p ) simply as p , and consequently write T p M d × d × for T ( p ) as well. Then, using duality ofthe (cid:107) · (cid:107) p,q, norm and the fact that parallel transport is an isometry, we can choose m ∈ M and write (cid:107) T ( p ) (cid:107) p,q, = (cid:107) P p → m T ( p ) (cid:107) m,q, = sup η m ∈ T ∗ m M d × d × (cid:104) η m , P p → m T ( p ) (cid:105) − ι B q ∗ ( η m ) , (96)where q ∗ such that q + q ∗ = 1, B q ∗ := (cid:8) ν m ∈ T ∗ m M d × d × | (cid:107) ν m (cid:107) m,q ∗ , ∞ ≤ (cid:9) (97)= (cid:8) ν m ∈ T ∗ m M d × d × | max (cid:107) ( ν m ) i,j, : (cid:107) m,q ∗ ≤ (cid:9) , (98)and ι B q ∗ ( η m ) := (cid:26) η m ∈ B q ∗ , ∞ if η m / ∈ B q ∗ . (99)Now we choose n := 0 ∈ T m M d × d × as the zero vector. We can write (cid:104) η m , P p → m T ( p ) (cid:105) = (cid:104) η m , P p → m T ( p ) − n (cid:105) . (100)Remember that we can decompose the tangent bundle tangent space into a vector part η m anda point part ζ m . Let ξ n := ( ζ m , η m ) ∈ T ∗ m M d × d × × T ∗ m M d × d × ∼ = T ∗ n T M d × d × , then (cid:104) η m , P p → m T ( p ) − n (cid:105) = sup ζ m ∈ T ∗ m M d × d × (cid:104) ( ζ m , η m ) , (log m p, P p → m T ( p ) − n ) (cid:105) − ι { } ( ζ m ) (101)= sup ζ m ∈ T ∗ m M d × d × (cid:104) ( ζ m , η m ) , log n T ( p ) (cid:105) − ι { } ( ζ m ) , (102)18here we used the deﬁnition of the logarithmic map on the tangent bundle in the last step.Finally bringing everything together, we ﬁnd a saddle-point problem in the desired form:inf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , log n T ( p ) (cid:105) n − ι { }× B q ∗ ( ξ n ) . (103)Choosing m such that T ( m ) = n , the corresponding linearized saddle-point problem isinf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , D m T [log m p ] (cid:105) − ι { }× B q ∗ ( ξ n ) , (104)on which we can now apply the PD-RSSN method as in the previous chapter. Finally, it remainsto ﬁnd values for m such that both T ( m ) = n and n = 0 hold. We must have m ij = ˜ m for i = 1 , . . . , d and j = 1 , . . . , d (105)for some ﬁxed ˜ m ∈ M . Adaptations to the optimization problem

There is one serious numerical issue withsolving eq. (104) using an PD-RSSN approach: the generalized covariant derivative can benon-invertible. For the real-valued case this can be solved by adding dual regularization, whichwill also be the strategy for the manifold case: We add a quadratic penalty term for the duals ξ n , inf p ∈M d × d sup ξ n ∈T ∗ n T M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:104) ξ n , D m T [log m p ] (cid:105) − ι { }× B q ∗ ( ξ n ) − β (cid:107) ξ n (cid:107) n , (106)where 0 < β (cid:28) ξ n will be zero to get a smaller matrix representation of the generalized covariantderivative, i.e., we solveinf p ∈M d × d sup ξ m ∈T ∗ m M d × d × α d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) + (cid:10) ξ m , ∇ ( · ) T [log m p ] (cid:11) − ι B q ∗ ( ξ m ) − β (cid:107) ξ m (cid:107) m (107)where we use that the value of the vector part of the diﬀerential of T is that of the covariantderivative. We obtain the reduced optimality conditions p = prox σF (cid:18) exp p (cid:18) P m → p (cid:16) − σ (cid:0) ∇ ( · ) T (cid:1) ∗ [ ξ m ] (cid:17) (cid:93) (cid:19)(cid:19) , (108) ξ m = prox τG ∗ m,q (cid:16) ξ m + τ (cid:0) ∇ ( · ) T [log m p ] (cid:1) (cid:91) (cid:17) , (109)and the reduced vector ﬁeld X : M d × d × T ∗ m M d × d × → T p M d × d × T ∗ m M d × d × X ( p, ξ m ) :=  − log p prox σF (cid:18) exp p (cid:18) P m → p (cid:16) − σ (cid:0) ∇ ( · ) T (cid:1) ∗ [ ξ m ] (cid:17) (cid:93) (cid:19)(cid:19) ξ m − prox τG ∗ m,q (cid:16) ξ n + τ (cid:0) ∇ ( · ) T [log m p ] (cid:1) (cid:91) (cid:17)  , (110)19here F ( p ) := d ,d (cid:88) i,j =1 d M ( p i,j , h i,j ) and G ∗ m,q ( ξ m ) := ι B q ∗ ( ξ m ) + β (cid:107) ξ m (cid:107) m . (111)For the expressions of the proximal maps and the diﬀerentials we again refer the reader to thethesis [46, Sec. 6.5]; the diﬀerential of T and its adjoint can be found in [20, Sec. 5] In this section we will explore the behavior of the Riemannian Semi-smooth Newton methodin several numerical experiments with the ROF model on both the S and P (3) manifold.This choice of manifolds allows to observe the behavior on both positively as well as negativelycurved spaces.In order to verify basic performance of the complete method, we ﬁrst consider a 1D prob-lem with known minimizer. Then we provide a detailed runtime analysis and comparison tolRCPA [20]. Finally, we numerically validate the convergence rates predicted by theorem 3.5.Throughout this section we will use the relative (residual) error (cid:15) krel := (cid:107) X ( p k , ξ km ) (cid:107) ( p k ,ξ km ) (cid:107) X ( p , ξ m ) (cid:107) ( p ,ξ m ) (112)as measure for convergence.The proposed algorithms were implemented in Julia version 1.3.0 using Manopt.jl [16]and evaluated on a 2.4 GHz Intel Core i7 with 8 GB RAM. The source code will be madeavailable on GitHub.

In this ﬁrst experiment we investigate the progression of the relative error as well as the distanceto the exact solution for a problem with known minimizer. We consider the 1-dimensionalpiecewise constant signal from [20]: h ∈ M (cid:96) h i := (cid:26) ˆ p if i ≤ (cid:96) ˆ p if i > (cid:96) . (113)For the (cid:96) -TV problem as in eq. (93) with this signal, the exact minimizer is known: for α > p , ˆ p ∈ M , the minimizer p ∗ is given by p ∗ i := (cid:26) p ∗ if i ≤ (cid:96)p ∗ if i > (cid:96) , (114)where p ∗ = γ ˆ p , ˆ p ( δ ) , p ∗ = γ ˆ p , ˆ p ( δ ) , and δ = min (cid:26) , α(cid:96) d M (ˆ p , ˆ p ) (cid:27) . (115)A proof can be found in [46, Appendix 2]Furthermore, since d = 1, the operator T reduces to the 1-dimensional non-linear diﬀerenceoperator, i.e., we have T : M (cid:96) → T M (cid:96) . This also means that the isotropic ( q = 2) andanisotropic ( q = 1) cases reduce to the same functional.For our tests we set β = 0, (cid:96) = 10, α = 5 , and σ = τ = . For the stopping criterion, we used (cid:15) rel = 10 − . Furthermore, for a given primal base point m ∈ M we chose n = T ( m ) ∈ T M (cid:96) n is the zero vector, it is enoughto choose m i the same in every grid point.For both manifolds we investigated the inﬂuence of the starting point. For the “cold start”we set p := h , ξ n := 0. For the “warm start” we again set p := h , but modiﬁed the dual startto ξ n := prox τG ∗ m, (cid:16) ξ n + τ (cid:0) ∇ ( · ) T (cid:2) log m p (cid:3)(cid:1) (cid:91) (cid:17) , the result of a single lRCPA dual step.In the following we denote by ˜ p the solution obtained from our method. The distance of ˜ p to the exact solution p ∗ are summarized in table 1.Cold start Warm start M d M ( p ∗ , ˜ p ) d M ( p ∗ , ˜ p ) S P (3) 2.1286297e-13 Table 1: Distance between the exact minimizer p ∗ of the 1D piecewise constant (cid:96) -TV problemand the ﬁnal iterate ˜ p of our method on the S and the P (3) manifold. Using a warm start witha rough estimate for primal and dual variables, the result is very close to the exact minimizer. Case 1: M = S We choseˆ p := 1 √ , , (cid:62) , ˆ p := 1 √ , − , (cid:62) (116)and m i := (1 , , (cid:62) for all i = 1 , ..., (cid:96) .With a cold start, PD-RSSN gets stuck in a local minimum and is terminated after 50iterations. With the warm start, convergence to the solution of the linearized system is achievedafter only two iterations in 0.469 seconds total. As it turns out this is the same point as theROF minimizer, i.e., d M ( p ∗ , ˜ p ) = 0 as well. Case 2: M = P (3) Here we choseˆ p := exp I (cid:18) (cid:107) X (cid:107) I X (cid:19) , ˆ p := exp I (cid:18) − (cid:107) X (cid:107) I X (cid:19) , with X :=   , (117)where X ∈ T I P (3) and I is the identity matrix, and pick m i := I for all i = 1 , ..., (cid:96) .In this setting, we observed no diﬀerence between warm start and cold start (table 1). Withthe cold start, PD-RSSN converges in 1 iteration in 4.61 seconds at a distance of d M ( p ∗ , p l ) =2 . · − from the exact solution. With the warm start, PD-RSSN converges after 1 iterationin 4.202 seconds at an almost identical distance of d M ( p ∗ , p l ) = 1 . · − . (cid:96) -TV In this experiment we aim to compare the runtime performance of our PD-RSSN-based methodto that of the lRCPA algorithm at diﬀerent accuracies. In contrast to the previous section, wefocus on the more realistic 2D setting. We chose a dual regularization strength of β = 10 − ,as numerical experience showed that the generalized covariant derivative for both S and P (3)can become singular without dual regularization.21 a) Input (b) Exact minimizer(c) Cold-start PD-RSSN solution (d) Warm-start PD-RSSN solution Iterations −15 −10 −5 ε r e l cold startwarm start (e) Progression of the cold-start relative error Iterations C o s t cold startwarm start (f) Progression of the cold-start (cid:96) -TV cost Figure 2: Results and progression of the relative error and (cid:96) -TV cost for the proposed PD-RSSN method for solving an S -valued problem with known minimizer. With a cold start,PD-RSSN gets stuck in a local minimum. With a warm-start strategy, the algorithm convergessuperlinearly to the exact minimizer.For the S problem, we used a 20 ×

20 artiﬁcial S rotations image from [9] which isimplemented in Manopt.jl with a half rotation around each axis. For the P (3) problem, weused a 10 ×

10 artiﬁcial P (3) image from [22] which is also implemented in Manopt.jl .Although lRCPA is not guaranteed to converge on positively-curved manifolds, it performedwell in recent work [20]. The expectation is for lRCPA to be faster at the start but suﬀering fromslow tail convergence. Hence, our method should give better performance for higher-accuracysolutions.

Initialization

For our numerical experiment, we measured the (CPU) runtime until the al-gorithms achieved (cid:15) rel ∈ { − , − , − } . With a cold start, and even with a dual warmstart, our method diverges, as can be expected from local methods. Therefore, we prependedlRCPA pre-steps with a coarse stopping criterion of (cid:15) rel = 1 / Case 1: M = S We computed an isotropically regularized ( q = 2) (cid:96) -TV solution on theabove data with α = 1 .

5. We set m i,j := (0 , , (cid:62) for all i, j = 1 , . . . , , so that again n is thezero vector. We initialized both PD-RSSN and lRCPA with σ = τ = 0 .

35. For lRCPA we used γ = 0 . a) Input (b) Exact minimizer(c) Cold- start PD-RSSN solution (d) Warm- start PD-RSSN solution Iterations −12.5 −10.0 −7.5 −5.0 −2.5 ε r e l cold startwarm start (e) Progression of the cold-start relative error Iterations C o s t cold startwarm start (f) Progression of the cold-start (cid:96) -TV cost Figure 3: Results of the proposed PD-RSSN method for the P (3)-valued problem with knownminimizer (compare ﬁg. 2). The method converges superlinearly to the exact minimizer forboth cold- and warm start.The pre-steps took 2.437 seconds. The resulting runtimes are shown in table 2. The solutionsof lRCPA and PD-RSSN at (cid:15) rel = 10 − along with the progress of the relative error and theisotropic (cid:96) -TV-cost are shown in ﬁg. 4. β = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − Method Time

697 608.781 2886PD-RSSN 330.422 10 346.187 11 (cid:15) rel for the two-dimensional S problem. For high-accuracy solutions the proposed higher-order PD-RSSN method outperforms the ﬁrst-orderlRCPA algorithm with respect to runtime on a manifold with positive curvature.The results conﬁrm what we expect from a higher-order method: for higher accuracieswith relative error (cid:15) rel smaller than approximately 10 − , PD-RSSN cleary outperforms lRCPA.Furthermore, we see the superlinear convergence more clearly than in the 1D case. Case 2: M = P (3) Again we solved the isotropically regularized (cid:96) -TV problem with α = 0 . m i,j := I for all i, j = 1 , . . . ,

10, so that n is the zero vector. We initialized bothPD-RSSN and lRCPA with σ = τ = 0 .

4. For lRCPA we used γ = 0 . a) Noisy input (b) Result Iterations −6 −4 −2 ε r e l lRCPAPD-RSSN (proposed) (c) Progression of the relative error Iterations C o s t lRCPAPD-RSSN (proposed) (d) Progression of the isotropic (cid:96) -TV cost Figure 4: Runtime behavior of lRCPA and the proposed higher-order PD-RSSN method foran S -valued problem. The lRCPA method exhibits slow tail convergence as is typical forﬁrst-order methods. After few ﬁrst-order pre-steps (shown in green), PD-RSSN superlinearlyconverges within 13 iterations to an optimal solution.lutions returned by lRCPA and PD-RSSN at (cid:15) rel = 10 − along with the error progress andthe (isotropic) (cid:96) -TV cost are shown in ﬁg. 5. We note that lRCPA eﬀectively stalled beforereaching the relative error of 10 − and was terminated after 2500 iterations.As for the S example, for higher accuracies PD-RSSN clearly outperforms lRCPA. Thelatter even seems unable to reach such high accuracies in reasonable time, which aligns withthe fact that ﬁrst-order methods generally show sublinear convergence. For PD-RSSN, we againobserve superlinear convergence. It is interesting to see if the convergence rates for the proposed inexact method in algorithm 2as predicted by theorem 3.5 hold in practice. In particular, we expect linear convergence forconstant relative residual a k = constant and superlinear convergence for a k →

0. Given the24 = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − (cid:15) rel = 10 − Method Time

97 1213.327 ≥ (cid:15) rel for the two-dimensional P (3) problem. As inthe positively-curved S case (table 2), the proposed PD-RSSN algorithm outperforms lRCPAat higher accuracies on this negatively-curved manifold. lRCPA was terminated after 2500iterations before achieving (cid:15) rel = 10 − . (a) Noisy input (b) Result Iterations −6 −5 −4 −3 −2 −1 ε r e l lRCPAPD-RSSN (proposed) (c) Progression of the relative error Iterations C o s t lRCPAPD-RSSN (proposed) (d) Progression of the isotropic (cid:96) -TV cost Figure 5: Results on the negatively-curved manifold P (3) (compare ﬁg. 4); ﬁrst-order pre-steps are shown in green. The proposed PD-RSSN method superlinearly converges within 12iterations after the pre-steps, while again lRCPA suﬀers from slow tail convergence.similar behavior of the methods on positively and negatively curved manifolds in the previousexperiments, we only consider the S manifold.We used (cid:96) -TV to denoise Bernoulli’s Lemniscate [9], which is a ﬁgure-8 on the 2-sphere.25ote that this is once again a 1D problem. To this end, we will sampled 128 S -valued pointson the Lemniscate curve and distorted them with Gaussian noise with variance δ = 0 .

01. Weemployed the (cid:96) -TV model with α = 0 . m set to the intersection point of the Lemniscate: m i := 1 √ , , (cid:62) , for i = 1 , . . . , . (118)We did not use dual regularization, i.e., β = 0.Again we performed lRCPA pre-steps with σ = τ = 0 . , γ = 0 .

2, and (cid:15) rel = 10 − in orderto warm-start PD-IRSSN. In order to simulate inexact solution of the Newton steps, we deﬁnedthree diﬀerent progressions of the relative residual, a k := 0 , a k := 15 , a k := 15 k , (119)and randomly chose the corresponding absolute residual as r ki := a ki (cid:107) X ( p k , ξ kn ) (cid:107) ( p k ,ξ kn ) U ( p k ,ξ kn ) , for i = 1 , , , (120)where U ( p k ,ξ kn ) is a tangent vector drawn from a normal distribution with unit covariance matrix.We then solved (to numerical precision) the modiﬁed inexact system for the step d k : V ( p k ,ξ kn ) d k = X ( p k , ξ kn ) + r ki . (121)This allows to precisely control the residual and therefore simulate the behavior of PD-IRSSNat diﬀerent accuracies.From the error progression, we estimated a convergence rate q by q k := log (cid:16) (cid:107) X ( p k ,ξ kn ) (cid:107)(cid:107) X ( p k − ,ξ k − n ) (cid:107) (cid:17) log (cid:16) (cid:107) X ( p k − ,ξ k − n ) (cid:107)(cid:107) X ( p k − ,ξ k − n ) (cid:107) (cid:17) . (122)The results are shown in ﬁg. 6. As predicted by theorem 3.5, for the exact method withzero residual a we obtain clear superlinear convergence. For constant relative residual a ,convergence is clearly linear with rate q = 1, and for decreasing relative residual a , convergenceis again superlinear – although hard to judge due to the rapid convergence – in agreement withthe theorem. In this work we have proposed the Primal-Dual Riemannian Semi-smooth Newton (PD-RSSN)method as a higher-order alternative for solving non-smooth variational problems on mani-folds. Although locally convergent, the method experimentally performs well on isotropic andanisotropic 1D and 2D (cid:96) -TV problems. On both S and P (3), representing manifolds withpositive and negative curvature, we observe superlinear convergence, which allows to gener-ate high-accuracy solutions. In particular, PD-RSSN can be used to overcome the slow tailconvergence that is a common bottleneck of ﬁrst-order methods.Moreover, we have proposed an inexact variant of RSSN (IRSSN) with provable local con-vergence rates. The theoretical results are supported by numerical experiments, in which weobserve at least linear convergence in the inexact case, and superlinear convergence when ac-curacy is progressively increasing. 26 a) Bernoulli’s Lemniscate (grey), the noisy input(orange) and result (blue) Iterations q q=1a₁a₂a₃ (b) Estimated convergence rates after the pre-steps Iterations −6 −4 −2 ε r e l a₁a₂a₃ (c) Progression of the relative error of the lRCPApre-steps (blue) and the three PD-IRSSN schemes Iterations C o s t a₁a₂a₃ (d) Progression of the (cid:96) -TV cost of the lRCPApre-steps (blue) and the three PD-IRSSN schemes Figure 6: Convergence behavior of the Primal-Dual Inexact Riemannian Semi-smooth Newtonmethod (PD-IRSSN) for recovering Bernoulli’s Lemniscate from noisy input data. As predictedby theorem 3.5, exact PD-IRSSN, i.e., PD-RSSN, ( a ) and PD-IRSSN with progressively in-creasing accuracy ( a ) converge superlinearly. For constant relative accuracy ( a ), we stillobserve at least linear convergence. Acknowledgments

We would like to thank Ronny Bergmann for fruitful discussions regarding diﬀerential-geometricinterpretations of earlier work and his Julia library Manopt.jl.27 eferences [1]

T. E. Abrudan, J. Eriksson, and V. Koivunen , Steepest descent algorithms foroptimization under unitary matrix constraint , IEEE Transactions on Signal Processing,56 (2008), pp. 1134–1147, https://doi.org/10.1109/TSP.2007.908999 .[2]

P.-A. Absil, C. G. Baker, and K. A. Gallivan , Trust-region methods on Rie-mannian manifolds , Foundations of Computational Mathematics, 7 (2007), pp. 303–330, https://doi.org/10.1007/s10208-005-0179-9 .[3]

P.-A. Absil, R. Mahony, and R. Sepulchre , Optimization algorithms onmatrix manifolds , Princeton University Press, 2009, https://doi.org/10.1515/9781400830244 .[4]

B. L. Adams, S. I. Wright, and K. Kunze , Orientation imaging: the emergence ofa new microscopy , Metallurgical Transactions A, 24 (1993), pp. 819–831, https://doi.org/10.1007/BF02656503 .[5]

R. L. Adler, J.-P. Dedieu, J. Y. Margulies, M. Martens, and M. Shub , Newton’smethod on Riemannian manifolds and a geometric model for the human spine , IMA Jour-nal of Numerical Analysis, 22 (2002), pp. 359–390, https://doi.org/10.1093/imanum/22.3.359 .[6]

B. Afsari, R. Tron, and R. Vidal , On the convergence of gradient descent for ﬁndingthe riemannian center of mass , SIAM Journal on Control and Optimization, 51 (2013),pp. 2230–2260, https://doi.org/10.1137/12086282X .[7]

D. Azagra and J. Ferrera , Proximal calculus on Riemannian manifolds , Mediter-ranean Journal of Mathematics, 2 (2005), pp. 437–450, https://doi.org/10.1007/s00009-005-0056-4 .[8]

M. Bac´ak , Computing medians and means in hadamard spaces , SIAM Journal on Opti-mization, 24 (2014), pp. 1542–1566, https://doi.org/10.1137/140953393 .[9]

M. Bac´ak, R. Bergmann, G. Steidl, and A. Weinmann , A second order nonsmoothvariational model for restoring manifold-valued images , SIAM Journal on Scientiﬁc Com-puting, 38 (2016), pp. A567–A597, https://doi.org/10.1137/15M101988X .[10]

S. Banert , Backward–backward splitting in hadamard spaces , Journal of Mathemati-cal Analysis and Applications, 414 (2014), pp. 656–665, https://doi.org/10.1016/j.jmaa.2014.01.054 .[11]

P. J. Basser, J. Mattiello, and D. LeBihan , Mr diﬀusion tensor spectroscopyand imaging , Biophysical journal, 66 (1994), pp. 259–267, https://doi.org/10.1016/S0006-3495(94)80775-1 .[12]

M. Baust, A. Weinmann, M. Wieczorek, T. Lasser, M. Storath, and N. Navab , Combined tensor ﬁtting and tv regularization in diﬀusion tensor imaging based on a Rie-mannian manifold approach , IEEE transactions on medical imaging, 35 (2016), pp. 1972–1989, https://doi.org/10.1109/TMI.2016.2528820 .[13]

S. Becker and J. Fadili , A quasi-Newton proximal splitting method , in Advances inneural information processing systems, 2012, pp. 2618–2626, https://doi.org/10.5555/2999325.2999427 . 2814]

G. Bento, J. Neto, and P. Oliveira , Convergence of inexact descent methods for non-convex optimization on Riemannian manifolds , arXiv preprint arXiv:1103.4828, (2011), https://arxiv.org/abs/1103.4828 .[15]

G. d. C. Bento, J. X. da Cruz Neto, and P. R. Oliveira , A new approach tothe proximal point method: convergence on general Riemannian manifolds , Journal ofOptimization Theory and Applications, 168 (2016), pp. 743–755, https://doi.org/10.1007/s10957-015-0861-2 .[16]

R. Bergmann , manopt.jl , Optimization on Manifolds in Julia., (2019).[17] R. Bergmann, R. H. Chan, R. Hielscher, J. Persch, and G. Steidl , Restoration of manifold-valued images by half-quadratic minimization , arXiv preprintarXiv:1505.07029, (2015), https://doi.org/10.3934/ipi.2016001 .[18]

R. Bergmann, J. H. Fitschen, J. Persch, and G. Steidl , Inﬁmal convolutioncoupling of ﬁrst and second order diﬀerences on manifold-valued images , in InternationalConference on Scale Space and Variational Methods in Computer Vision, Springer, 2017,pp. 447–459, https://doi.org/10.1007/978-3-319-58771-4_36 .[19]

R. Bergmann, J. H. Fitschen, J. Persch, and G. Steidl , Priors with coupledﬁrst and second order diﬀerences for manifold-valued image processing , Journal of math-ematical imaging and vision, 60 (2018), pp. 1459–1481, https://doi.org/10.1007/s10851-018-0840-y .[20]

R. Bergmann, R. Herzog, M. S. Louzeiro, D. Tenbrinck, and J. Vidal-N´u˜nez , Fenchel duality theory and a primal-dual algorithm on Riemannian manifolds , arXivpreprint arXiv:1908.02022, (2019), https://arxiv.org/abs/1908.02022v2 .[21]

R. Bergmann, F. Laus, G. Steidl, and A. Weinmann , Second order diﬀerences ofcyclic data and applications in variational denoising , SIAM Journal on Imaging Sciences,7 (2014), pp. 2916–2953, https://doi.org/10.1137/140969993 .[22]

R. Bergmann, J. Persch, and G. Steidl , A parallel Douglas–Rachford algorithm forminimizing rof-like functionals on images with values in symmetric hadamard manifolds ,SIAM Journal on Imaging Sciences, 9 (2016), pp. 901–937, https://doi.org/10.1137/15M1052858 .[23]

R. Bergmann and D. Tenbrinck , A graph framework for manifold-valued data ,SIAM Journal on Imaging Sciences, 11 (2018), pp. 325–360, https://doi.org/10.1137/17M1118567 .[24]

R. Bergmann and A. Weinmann , Inpainting of cyclic data using ﬁrst and second orderdiﬀerences , in International Workshop on Energy Minimization Methods in ComputerVision and Pattern Recognition, Springer, 2015, pp. 155–168.[25]

D. P. Bertsekas , Incremental proximal methods for large scale convex optimiza-tion , Mathematical programming, 129 (2011), p. 163, https://doi.org/10.1007/s10107-011-0472-0 .[26]

J. Bolte, S. Sabach, and M. Teboulle , Proximal alternating linearized minimiza-tion for nonconvex and nonsmooth problems , Mathematical Programming, 146 (2014),pp. 459–494, https://doi.org/10.1007/s10107-013-0701-9 .2927]

M. Bortoloti, T. Fernandes, O. Ferreira, and J. Yuan , Damped Newton’smethod on Riemannian manifolds , arXiv preprint arXiv:1803.05126, (2018), https://arxiv.org/abs/1803.05126 .[28]

K. Bredies, M. Holler, M. Storath, and A. Weinmann , Total generalized varia-tion for manifold-valued data , SIAM Journal on Imaging Sciences, 11 (2018), pp. 1785–1848, https://doi.org/10.1137/17M1147597 .[29]

R. H. Byrd, G. M. Chin, J. Nocedal, and F. Oztoprak , A family of second-ordermethods for convex (cid:96) -regularized optimization , Mathematical Programming, 159 (2016),pp. 435–467, https://doi.org/10.1007/s10107-015-0965-3 .[30] R. H. Byrd, J. Nocedal, and F. Oztoprak , An inexact successive quadratic approx-imation method for l-1 regularized optimization , Mathematical Programming, 157 (2016),pp. 375–396, https://doi.org/10.1007/s10107-015-0941-y .[31]

M. P. d. Carmo , Riemannian geometry , Birkh¨auser, 1992.[32]

R. Castro, G. Di Giorgi, and W. Sierra , Secant method on Riemannian manifolds ,arXiv preprint arXiv:1712.02655, (2017), https://arxiv.org/abs/1712.02655 .[33]

A. Chambolle and T. Pock , A ﬁrst-order primal-dual algorithm for convex problemswith applications to imaging , Journal of mathematical imaging and vision, 40 (2011),pp. 120–145, https://doi.org/10.1007/s10851-010-0251-1 .[34]

T. F. Chan, S. H. Kang, and J. Shen , Total variation denoising and enhancement ofcolor images based on the CB and HSV color models , Journal of Visual Communicationand Image Representation, 12 (2001), pp. 422–435, https://doi.org/10.1006/jvci.2001.0491 .[35]

C. Chefd’Hotel, D. Tschumperl´e, R. Deriche, and O. Faugeras , Regularizingﬂows for constrained matrix-valued images , Journal of Mathematical Imaging and Vision,20 (2004), pp. 147–162, https://doi.org/10.1023/B:JMIV.0000011920.58935.9c .[36]

D.-Q. Chen , Fixed point algorithm based on quasi-Newton method for convex minimiza-tion problem with application to image deblurring , 2014, https://arxiv.org/abs/1412.4438 .[37]

S. Chen, S. Ma, A. Man-Cho So, and T. Zhang , Proximal gradient method fornonsmooth optimization over the stiefel manifold , SIAM Journal on Optimization, 30(2020), pp. 210–239, https://doi.org/10.1137/18M122457X .[38]

W. Chen, H. Ji, and Y. You , An augmented Lagrangian method for (cid:96) -regularized opti-mization problems with orthogonality constraints , SIAM Journal on Scientiﬁc Computing,38 (2016), pp. B570–B592, https://doi.org/10.1137/140988875 .[39] C. Clason and T. Valkonen , Introduction to nonsmooth analysis and optimization ,arXiv preprint arXiv:2001.00216, (2020).[40]

P. L. Combettes and J.-C. Pesquet , A Douglas–Rachford splitting approach to non-smooth convex variational signal recovery , IEEE Journal of Selected Topics in SignalProcessing, 1 (2007), pp. 564–574, https://doi.org/10.1109/JSTSP.2007.910264 .3041]

P. L. Combettes and V. R. Wajs , Signal recovery by proximal forward-backwardsplitting , Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200, https://doi.org/10.1137/050626090 .[42]

P. Cook, Y. Bai, S. Nedjati-Gilani, K. Seunarine, M. Hall, G. Parker, andD. C. Alexander , Camino: open-source diﬀusion-mri reconstruction and processing , in14th scientiﬁc meeting of the international society for magnetic resonance in medicine,vol. 2759, Seattle WA, USA, 2006, p. 2759.[43]

D. Cremers and E. Strekalovskiy , Total cyclic variation and generalizations , Jour-nal of mathematical imaging and vision, 47 (2013), pp. 258–277, https://doi.org/10.1007/s10851-012-0396-1 .[44]

T. De Luca, F. Facchinei, and C. Kanzow , A semismooth equation approach to thesolution of nonlinear complementarity problems , Mathematical programming, 75 (1996),pp. 407–439, https://doi.org/10.1007/BF02592192 .[45]

F. R. de Oliveira and O. P. Ferreira , Newton method for ﬁnding a singularity of aspecial class of locally Lipschitz continuous vector ﬁelds on Riemannian manifolds , arXivpreprint arXiv:1810.11636, (2018), https://arxiv.org/abs/1810.11636 .[46]

W. Diepeveen , Non-smooth higher-order optimization on manifolds , master’sthesis, 2020, .[47]

D. L. Donoho , Compressed sensing , IEEE Transactions on information theory, 52(2006), pp. 1289–1306, https://doi.org/10.1109/TIT.2006.871582 .[48]

A. Edelman, T. A. Arias, and S. T. Smith , The geometry of algorithms with or-thogonality constraints , SIAM journal on Matrix Analysis and Applications, 20 (1998),pp. 303–353, https://doi.org/10.1137/S0895479895290954 .[49]

E. Esser, X. Zhang, and T. F. Chan , A general framework for a class of ﬁrst orderprimal-dual algorithms for convex optimization in imaging science , SIAM Journal onImaging Sciences, 3 (2010), pp. 1015–1046, https://doi.org/10.1137/09076934X .[50]

F. Facchinei, A. Fischer, and C. Kanzow , Inexact Newton methods for semismoothequations with applications to variational inequality problems , in Nonlinear Optimizationand Applications, Springer, 1996, pp. 125–139.[51]

O. Ferreira and P. Oliveira , Subgradient algorithm on Riemannian manifolds , Jour-nal of Optimization Theory and Applications, 97 (1998), pp. 93–104, https://doi.org/10.1023/A:1022675100677 .[52]

O. Ferreira and P. Oliveira , Proximal point algorithm on Riemannian manifolds ,Optimization, 51 (2002), pp. 257–270, https://doi.org/10.1080/02331930290019413 .[53]

P. T. Fletcher and S. Joshi , Riemannian geometry for the statistical analysis ofdiﬀusion tensor data , Signal Processing, 87 (2007), pp. 250–262, https://doi.org/10.1016/j.sigpro.2005.12.018 .[54]

K. Fountoulakis, J. Gondzio, and P. Zhlobich , Matrix-free interior point methodfor compressed sensing problems , Mathematical Programming Computation, 6 (2014),pp. 1–31, https://doi.org/10.1007/s12532-013-0063-6 .3155]

M. Giaquinta, G. Modica, and J. Souˇcek , Variational problems for maps of boundedvariation with values in s , Calculus of Variations and Partial Diﬀerential Equations, 1(1993), pp. 87–121, https://doi.org/10.1007/BF02163266 .[56] M. Giaquinta and D. Mucci , The bv -energy of maps into a manifold: Relaxation anddensity results , Annali della Scuola Normale Superiore di Pisa-Classe di Scienze, 5 (2006),pp. 483–548.[57] M. Giaquinta and D. Mucci , Maps of bounded variation with values into a manifold:total variation and relaxed energy , Pure and Applied Mathematics Quarterly, 3 (2007),pp. 513–538, https://doi.org/10.4310/PAMQ.2007.v3.n2.a6 .[58]

R. Griesse and D. A. Lorenz , A semismooth Newton method for Tikhonov functionalswith sparsity constraints , Inverse Problems, 24 (2008), p. 035007, https://doi.org/10.1088/0266-5611/24/3/035007 .[59]

P. Grohs and S. Hosseini , Nonsmooth trust region algorithms for locally lipschitz func-tions on Riemannian manifolds , IMA Journal of Numerical Analysis, 36 (2016), pp. 1167–1192, https://doi.org/10.1093/imanum/drv043 .[60]

P. Grohs and S. Hosseini , ε -subgradient algorithms for locally lipschitz functions onRiemannian manifolds , Advances in Computational Mathematics, 42 (2016), pp. 333–360, https://doi.org/10.1007/s10444-015-9426-z .[61] P. Grohs and M. Sprecher , Total variation regularization by iteratively reweightedleast squares on Hadamard spaces and the sphere , preprint, 39 (2014).[62]

P. Grohs and M. Sprecher , Total variation regularization on Riemannian manifoldsby iteratively reweighted minimization , Information and Inference: A Journal of the IMA,5 (2016), pp. 353–378, https://doi.org/10.1093/imaiai/iaw011 .[63]

B. He, Y. You, and X. Yuan , On the convergence of primal-dual hybrid gradientalgorithm , SIAM Journal on Imaging Sciences, 7 (2014), pp. 2526–2537, https://doi.org/10.1137/140963467 .[64]

J. He, J. Wang, and J.-C. Yao , Convergence criteria of Newton’s method on Liegroups , Fixed Point Theory and Applications, 2013 (2013), p. 293, https://doi.org/10.1186/1687-1812-2013-293 .[65]

M. Hinterm¨uller, K. Ito, and K. Kunisch , The primal-dual active set strategy asa semismooth Newton method , SIAM Journal on Optimization, 13 (2002), pp. 865–888, https://doi.org/10.1137/s1052623401383558 .[66]

S. Hosseini , Convergence of nonsmooth descent methods via kurdyka-lojasiewicz inequal-ity on Riemannian manifolds , Hausdorﬀ Center for Mathematics and Institute for Nu-merical Simulation, University of Bonn (2015,(INS Preprint No. 1523)), (2015).[67]

S. Hosseini, W. Huang, and R. Yousefpour , Line search algorithms for locallylipschitz functions on Riemannian manifolds , SIAM Journal on Optimization, 28 (2018),pp. 596–619, https://doi.org/10.1137/16M1108145 .[68]

S. Hosseini and M. Pouryayevali , Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds , Nonlinear Analysis: Theory, Methods & Appli-cations, 74 (2011), pp. 3884–3895, https://doi.org/10.1016/j.na.2011.02.023 .3269]

S. Hosseini and A. Uschmajew , A riemannian gradient sampling algorithm for nons-mooth optimization on manifolds , SIAM Journal on Optimization, 27 (2017), pp. 173–189, https://doi.org/10.1137/16M1069298 .[70]

B. A. Kakavandi and M. Amini , Duality and subdiﬀerential for convex functions oncomplete CAT(0) metric spaces , Nonlinear Analysis: Theory, Methods & Applications,73 (2010), pp. 3450–3455, https://doi.org/10.1016/j.na.2010.07.033 .[71]

R. Kimmel and N. Sochen , Orientation diﬀusion or how to comb a porcupine , Journalof Visual Communication and Image Representation, 13 (2002), pp. 238–248, https://doi.org/10.1006/jvci.2001.0501 .[72]

A. Kovnatsky, K. Glashoff, and M. M. Bronstein , Madmm: a generic algorithmfor non-smooth optimization on manifolds , in European Conference on Computer Vision,Springer, 2016, pp. 680–696, https://doi.org/10.1007/978-3-319-46454-1_41 .[73]

R. Lai and S. Osher , A splitting method for orthogonality constrained problems ,Journal of Scientiﬁc Computing, 58 (2014), pp. 431–449, https://doi.org/10.1007/s10915-013-9740-x .[74]

F. Laus, M. Nikolova, J. Persch, and G. Steidl , A nonlocal denoising algorithm formanifold-valued images using second order statistics , SIAM Journal on Imaging Sciences,10 (2017), pp. 416–448, https://doi.org/10.1137/16M1087114 .[75]

J. D. Lee, Y. Sun, and M. A. Saunders , Proximal Newton-type methods for mini-mizing composite functions , SIAM Journal on Optimization, 24 (2014), pp. 1420–1443, https://doi.org/10.1137/130921428 .[76]

J. M. Lee , Smooth manifolds , in Introduction to Smooth Manifolds, Springer, 2013,pp. 1–31, https://doi.org/10.1007/978-1-4419-9982-5 .[77]

J. Lellmann, E. Strekalovskiy, S. Koetter, and D. Cremers , Total variationregularization for functions with values in a manifold , in Proceedings of the IEEE Inter-national Conference on Computer Vision, 2013, pp. 2944–2951, https://doi.org/10.1109/ICCV.2013.366 .[78]

X. Li, D. Sun, and K.-C. Toh , A highly eﬃcient semismooth Newton Augmented La-grangian method for solving LASSO problems , SIAM Journal on Optimization, 28 (2018),pp. 433–458, https://doi.org/10.1137/16M1097572 .[79]

D. G. Luenberger , The gradient projection method along geodesics , Management Sci-ence, 18 (1972), pp. 620–631, https://doi.org/10.1287/mnsc.18.11.620 .[80]

J. Mart´ınez and L. Qi , Inexact Newton methods for solving nonsmooth equations ,Journal of Computational and Applied Mathematics, 60 (1995), pp. 127–145, https://doi.org/10.1016/0377-0427(94)00088-I .[81]

D. Massonnet and K. L. Feigl , Radar interferometry and its application to changesin the earth’s surface , Reviews of geophysics, 36 (1998), pp. 441–500, https://doi.org/10.1029/97RG03139 .[82]

A. Milzarek and M. Ulbrich , A semismooth Newton method with multidimensionalﬁlter globalization for (cid:96) -optimization , SIAM Journal on Optimization, 24 (2014), pp. 298–333, https://doi.org/10.1137/120892167 .3383] H. Pan, Z. Jing, M. Lei, R. Liu, B. Jin, and C. Zhang , A sparse proximal Newtonsplitting method for constrained image deblurring , Neurocomputing, 122 (2013), pp. 245–257, https://doi.org/10.1016/j.neucom.2013.06.027 .[84]

P. Patrinos, L. Stella, and A. Bemporad , Forward-backward truncated New-ton methods for convex composite optimization , arXiv preprint arXiv:1402.6655, (2014), https://arxiv.org/abs/1402.6655 .[85]

X. Pennec , Intrinsic statistics on Riemannian manifolds: Basic tools for geometricmeasurements , Journal of Mathematical Imaging and Vision, 25 (2006), p. 127, https://doi.org/10.1007/s10851-006-6228-4 .[86]

X. Pennec, P. Fillard, and N. Ayache , A riemannian framework for tensor comput-ing , International Journal of computer vision, 66 (2006), pp. 41–66, https://doi.org/10.1007/s11263-005-3222-z .[87]

J. Persch , Optimization Methods for manifold-valued Image Processing , Verlag Dr. Hut,2018.[88]

T. Pock and A. Chambolle , Diagonal preconditioning for ﬁrst order primal-dual al-gorithms in convex optimization , in 2011 International Conference on Computer Vision,IEEE, 2011, pp. 1762–1769, https://doi.org/10.1109/ICCV.2011.6126441 .[89]

L. Qi and J. Sun , A nonsmooth version of Newton’s method , Mathematical program-ming, 58 (1993), pp. 353–367, https://doi.org/10.1007/BF01581275 .[90]

R. T. Rockafellar and R. J.-B. Wets , Variational analysis , vol. 317, SpringerScience & Business Media, 2009, https://doi.org/10.1007/978-3-642-02431-3 .[91]

L. I. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removalalgorithms , Physica D: nonlinear phenomena, 60 (1992), pp. 259–268, https://doi.org/10.1016/0167-2789(92)90242-F .[92]

C. Rust, J. Lellmann, and T. Vogt , Semiglatte Optimierungsverfahren zweiter Ord-nung in der Bildverarbeitung , master’s thesis, University of L¨ubeck, 2017.[93]

T. Sakai , Riemannian geometry , vol. 149, American Mathematical Soc., 1996.[94]

S. T. Smith , Optimization techniques on Riemannian manifolds , Fields institute com-munications, 3 (1994), pp. 113–135.[95]

M. Storath and A. Weinmann , Wavelet sparse regularization for manifold-valueddata , arXiv preprint arXiv:1808.00505, (2018), https://arxiv.org/abs/1808.00505 .[96]

M. Storath, A. Weinmann, and M. Unser , Exact algorithms for l -tv regularizationof real-valued or circle-valued signals , SIAM Journal on Scientiﬁc Computing, 38 (2016),pp. A614–A630, https://doi.org/10.1137/15M101796X .[97] E. Strekalovskiy and D. Cremers , Total variation for cyclic structures: Convexrelaxation and eﬃcient minimization , in CVPR 2011, IEEE, 2011, pp. 1905–1911, https://doi.org/10.1109/CVPR.2011.5995573 .[98]

D. Sun and J. Han , Newton and Quasi-Newton methods for a class of nonsmoothequations and related problems , SIAM Journal on Optimization, 7 (1997), pp. 463–480, https://doi.org/10.1137/S1052623494274970 .3499]

C. Udriste , Convex functions and optimization methods on Riemannian manifolds ,vol. 297, Springer Science & Business Media, 1994.[100]

T. Valkonen, K. Bredies, and F. Knoll , Total generalized variation in diﬀusiontensor imaging , SIAM Journal on Imaging Sciences, 6 (2013), pp. 487–525, https://doi.org/10.1137/120867172 .[101]

T. Vogt, E. Strekalovskiy, D. Cremers, and J. Lellmann , Lifting methods formanifold-valued variational problems , arXiv preprint arXiv:1908.03776, (2019), https://arxiv.org/abs/1908.03776 .[102]

Y. Wang, J. Yang, W. Yin, and Y. Zhang , A new alternating minimization algorithmfor total variation image reconstruction , SIAM Journal on Imaging Sciences, 1 (2008),pp. 248–272, https://doi.org/10.1137/080724265 .[103]

A. Weinmann, L. Demaret, and M. Storath , Total variation regularization formanifold-valued data , SIAM Journal on Imaging Sciences, 7 (2014), pp. 2226–2257, https://doi.org/10.1137/130951075 .[104]

A. Weinmann, L. Demaret, and M. Storath , Mumford–Shah and Potts regulariza-tion for manifold-valued data , Journal of Mathematical Imaging and Vision, 55 (2016),pp. 428–445, https://doi.org/10.1007/s10851-015-0628-2 .[105]

S. J. Wright, R. D. Nowak, and M. A. Figueiredo , Sparse reconstruction byseparable approximation , IEEE Transactions on Signal Processing, 57 (2009), pp. 2479–2493, https://doi.org/10.1109/TSP.2009.2016892 .[106]

X. Xiao, Y. Li, Z. Wen, and L. Zhang , A regularized semi-smooth Newton methodwith projection steps for composite convex programs , Journal of Scientiﬁc Computing, 76(2018), pp. 364–389, https://doi.org/10.1007/s10915-017-0624-3 .[107]

W. Yin, S. Osher, D. Goldfarb, and J. Darbon , Bregman iterative algorithmsfor (cid:96) -minimization with applications to compressed sensing , SIAM Journal on Imagingsciences, 1 (2008), pp. 143–168, https://doi.org/10.1137/070703983 .[108] J. Yu, S. Vishwanathan, S. G¨unter, and N. N. Schraudolph , A quasi-Newtonapproach to nonsmooth convex optimization problems in machine learning , Journal ofMachine Learning Research, 11 (2010), pp. 1145–1200, https://doi.org/10.1145/1390156.1390309 .[109]

M.-C. Yue, Z. Zhou, and A. M.-C. So , A family of inexact SQA methods for non-smooth convex minimization with provable convergence guarantees based on the luo–tsengerror bound property , Mathematical Programming, 174 (2019), pp. 327–358, https://doi.org/10.1007/s10107-018-1280-6 .[110]

H. Zhu, X. Zhang, D. Chu, and L.-Z. Liao , Nonconvex and nonsmooth optimiza-tion with generalized orthogonality constraints: An approximate augmented Lagrangianmethod , Journal of Scientiﬁc Computing, 72 (2017), pp. 331–372, https://doi.org/10.1007/s10915-017-0359-1https://doi.org/10.1007/s10915-017-0359-1