Accelerated Algorithms for Convex and Non-Convex Optimization on Manifolds
Lizhen Lin, Bayan Saparbayeva, Michael Minyi Zhang, David B. Dunson
AAccelerated Algorithms for Convex andNon-Convex Optimization on Manifolds
Lizhen Lin , Bayan Saparbayeva , Michael Minyi Zhang ,David B. Dunson [email protected],[email protected],[email protected], [email protected] Department of Applied and Computational Mathematics and Statistics,University of Notre Dame, Notre Dame, IN, USA. Department of Biostatistics and Computational Biology, University ofRochester, Rochester, NY, USA. Department of Statistics and Actuarial Science, University of HongKong, Hong Kong, China. Department of Statistical Science, Duke University, Durham, NC, USA.October 20, 2020
Abstract
We propose a general scheme for solving convex and non-convex optimizationproblems on manifolds. The central idea is that, by adding a multiple of thesquared retraction distance to the objective function in question, we “convexify” theobjective function and solve a series of convex sub-problems in the optimizationprocedure. One of the key challenges for optimization on manifolds is the difficultyof verifying the complexity of the objective function, e.g., whether the objectivefunction is convex or non-convex, and the degree of non-convexity. Our proposedalgorithm adapts to the level of complexity in the objective function. We showthat when the objective function is convex, the algorithm provably converges tothe optimum and leads to accelerated convergence. When the objective function isnon-convex, the algorithm will converge to a stationary point. Our proposed methodunifies insights from Nesterov’s original idea for accelerating gradient descentalgorithms with recent developments in optimization algorithms in Euclidean space.We demonstrate the utility of our algorithms on several manifold optimization taskssuch as estimating intrinsic and extrinsic Fréchet means on spheres and low-rankmatrix factorization with Grassmann manifolds applied to the Netflix rating dataset. a r X i v : . [ s t a t . C O ] O c t Introduction
Optimization is a near ubiquitous tool used in a wide-range of disciplines including thephysical sciences, applied mathematics, engineering and the social sciences. Formally, itaims to maximize or minimize some quantitative criteria, namely, the objective functionwith respect to some parameters of interest. In many broad, complex learning in moderndata science, the parameters are naturally defined over to be on a manifold . The emergingfield of statistics on manifolds based on Fréchet means (Bhattacharya and Bhattacharya,2012; Bhattacharya and Lin, 2017) can be viewed as one of the notable examples ofoptimization on general manifolds.Another example can be found in building scalable recommender systems whereextracting a low-rank matrix involves an optimization problem over a Grassmannmanifold (Boumal et al., 2019). Recent development in geometric deep learning, wherethe input or output layer constrained to be on a Riemannian manifold (Lohit and Turaga,2017; Huang and Gool, 2017; Huang et al., 2017), constitutes another important classof applications. Other applications arise in diverse areas ranging from medical imaginganalysis, Procrustes shape matching, dimension reduction, dynamic subspace tracking,and cases involving ranking and orthogonality constraints–among many others. Thisproliferation of manifold-valued applications demands fundamental development ofmodels, algorithms and theory for solving optimization problems over non-Euclideanspaces.The current literature on optimization over manifolds mainly focuses on extendingexisting Euclidean space algorithms, such as Newton’s method (Smith, 2014; Ring andWirth, 2012), conjugate gradient descent (Edelman et al., 1998; Nishimori et al., 2008),steepest descent (Absil et al., 2010), trust-region methods (Absil et al., 2007; Boumaland Absil, 2011) and others. Many of the objective functions in manifold optimizationproblems are very complex. One of key challenges for solving such problems lies inthe difficulty in verifying the convexity and the degree of convexity of the objectivefunction. Current approaches cannot adapt to the complexity of the problem at hand inmanifold spaces.We take a major step to address these issues by proposing a general scheme tosolve convex and non-convex optimization problems on manifolds using gradient-basedalgorithms originally designed for convex functions. The key idea is to “convexify” theobjective function by adding a multiple of the squared retraction distance. The proposedalgorithm does not require knowledge of whether the objective function is convex butwill automatically converges to an optima if the function is strongly convex. Whenthe objective is non-convex, it achieves rapid convergence to a stationary point. Theproposed algorithm is a generalization of Nesterov acceleration (Nesterov, 2004), whichimproves the convergence rate of gradient descent algorithms. Our algorithm (which wecall A ) takes any general existing optimization method (which we call A ), originallydesigned for convex functions, and converts it into a method applicable for non-convexfunctions.Similar schemes have been explored for optimization problems in Euclidean space(Paquette et al., 2018). Generalizations to arbitrary manifolds, however, require fun-damentally novel theoretical development. In the Euclidean case, the gradient stepsare taken towards lines, whereas for the manifold we use the retraction curves which2rucially affects the result and raises the difficulty in proving convergence. Also formanifolds, it is not trivial to correctly convexify the ’weakly-convex function’, a broadclass of non-convex functions on manifolds we consider which account for most ofthe interesting examples of non-convex functions in machine learning. We propose anovel idea to convexify the objective locally with the help of the retraction. Key featuresof our algorithm include adaptation to the unknown weak convexity of the objectivefunction and automatic Nesterov acceleration. The proposed algorithm can be used toaccelerate a broad class of A algorithms including gradient descent as well as paralleloptimization approaches (see Saparbayeva et al., 2018).Our paper is organized as follows: In Section 2, we introduce related work on accel-erated optimization algorithms. Next, we present our proposed acceleration algorithmon manifolds in Section 3 and present theoretical convergence results. In Section 4, weconsider a simulation study of estimating Fréchet means and a real data example usingthe Netflix prize data set in a matrix completion problem. Liu et al. (2017) propose accelerated first-order methods for geodesically convex opti-mization on Riemannian manifolds. This is a direct generalization of Nesterov’s originallinear extrapolation mechanism to general Riemannian manifolds via a non-linear oper-ator. One drawback of Liu et al. (2017) is that the accelerated step of their algorithminvolves exact solving of non-trivial implicit equations.Zhang and Sra (2018a) later proposed a computationally tractable accelerated gradi-ent algorithm and a novel estimation sequence for convergence analysis. Our approach isfundamentally different from theirs. We regularize an objective function with a squaredretraction distance (see Proposition 1), solve a sequence of convex subproblems, adaptto the degree of weak convexity of the objective function, and produce accelerated ratesfor convex objectives. Even in the convex case, our approach can deal with a muchbroader class of retraction-based convex functions.Paquette et al. (2018) proposes a general scheme called “Catalyst acceleration” forsolving general optimizations in Euclidean space, which has inspired development ofsome ideas for our work. Similar ideas have been explored for convex functions inEuclidean space in both theory and practice (Lin et al., 2017). However, optimizationproblems on manifolds are of fundamentally different nature and require developmentof substantially new tools and theory.There is an interesting line of work on proposing fast algorithms for stochasticoptimization on manifolds (see Zhang et al., 2016; Zhang et al., 2018; Zhou et al.,2019; Bonnabel, 2013) which employ very different techniques such as minibatching,variance reduction and utilizing the uncertainty of inputs. Methods like Zhang and Sra(2018b) propose optimization methods that are analogous to Nesterov-type algorithmsfor manifold spaces. 3igure 1: Illustration of a retraction map on a manifold
We first define general retraction-based, weakly convex, convex, and strongly convexfunctions by generalizing from their geodesic-based counterparts . We then prove animportant proposition that can transform a non-convex function into a convex one simplyby adding a multiple of the squared retraction-distance to the objective function.
Definition 1. A retraction on a manifold M is a smooth mapping from its tangentbundle R : T M → M with the following properties:1. R θ (0 θ ) = R ( θ ,0 θ ) = θ , where θ denotes the zero vector on the tangent space T θ M ;
2. For any point θ ∈ R the differential d ( R θ ) of the retraction mapping at thezero vector θ ∈ T θ M has to be equal to the identity mapping on T θ M , that is d ( R θ (0 θ )) = d (cid:161) R ( θ ,0 θ ) (cid:162) = id T θ M , where id T θ M denotes the identity mapping on T θ M . The exponential map on a Riemannian manifold can be viewed as a special case of theretraction map, and the inverse-exponential map is a special case of the inverse-retractionmap. A good choice of retraction map can lead to substantial reduction in computationburden compared to the exponential map. We see an example in Section 4.2 on thechoice of a retraction map for Grassmannian; Figure 1 provides a visualization of aretraction map.We first define the retraction distance function on M d R ( θ , θ ) = (cid:107) R − θ θ (cid:107) . Since at zero the differential of the retraction map is the identity, there is a smallenough neighborhood D of the point θ where the inverse retraction map R − θ is bi-4ipschitz continuous in D , i.e. d R satisfies inequalities K d R ( ϑ , ϑ ) ≤ (cid:107) R − θ ϑ − R − θ ϑ (cid:107) ≤ K d R ( ϑ , ϑ ), where ϑ , ϑ ∈ D , and K ≥ and K ≥ In addition, we also require the squared retraction distance function to be R -strongly retraction convex around ϑ –that is, for some δ > and constant ≤ R ≤ thefollowing inequality holds: d R ( θ , ϑ ) ≥ d R ( θ , ϑ ) + 〈∇ d R ( θ , ϑ ), R − θ θ 〉 + R d R ( θ , θ ), (1)where d R ( θ i , ϑ ) < δ , i = Due to the fact that at the zero vector ϑ ∈ T ϑ M thedifferential of R ϑ is equal to identity mapping, we can see that in a small neighborhoodof ϑ , the square retraction distance function behaves like the square normal functionwhich is strongly convex. Definition 2.
Consider a function f : M → ¯ (cid:82) and a point θ with f ( θ ) finite. The R − subdifferential of f at θ is the set ∂ f ( θ ) = (cid:110) v ∈ T θ M : f ( ϑ ) ≥ f ( θ ) + 〈 v , R − θ ϑ 〉 + o (cid:161) d R ( θ , ϑ ) (cid:162) ∀ ϑ ∈ M (cid:111) . We now define the notion of convex functions on manifolds with respect to theretraction map.
Definition 3.
A function f is convex with respect to the retraction R if for any points θ , θ ∈ M the inequality holds f ( θ ) ≥ f ( θ ) + 〈 v , R − θ θ 〉 , v ∈ ∂ f ( θ ). (2)Now we are ready to define one of the most important classes of non-convexfunctions called weakly-convex functions which constitute many interesting applicationsof non-convex functions in machine learning. Definition 4.
A function f is ρ -weakly convex with respect to the retraction R if forany points θ , θ ∈ M the inequality holds f ( θ ) ≥ f ( θ ) + 〈 v , R − θ θ 〉 − ρ d R ( θ , θ ), v ∈ ∂ f ( θ ). (3)Given the strong retraction convexity of the squared retraction distance (see (1)), wecan regularize the weakly convex function f by adding the term κ d R ( θ , ϑ ) and turn itinto a convex function through the following proposition. Proposition 1.
Let d R be a retraction distance that is strong-retraction convex orsatisfies the inequality (1) in the subset D ⊂ M . Then the function f is R κ -weaklyconvex in D if and only if the function h κ ( θ , ϑ ) = f ( θ ) + κ d R ( θ , ϑ ) is convex in D . roof. Let f be ρ -weakly convex. Then for any θ , θ ∈ D and any λ ∈ [0,1] f ( θ ) ≥ f ( θ ) + 〈 ∂ f ( θ ), R − θ θ 〉 − R κ d R ( θ , θ ) ≥ f ( θ ) + 〈 ∂ f ( θ ), R − θ θ 〉 + κ d R ( θ , ϑ ) + 〈∇ κ d R ( θ , ϑ ), R − θ θ 〉 − κ d R ( θ , ϑ ), which implies h κ ( θ , ϑ ) ≥ h κ ( θ , ϑ ) + 〈 ∂ h κ ( θ , ϑ ), R − θ θ 〉 . For functions defined on an Euclidean space we have a definition of a weakly convexfunction that is equivalent to (3): f (cid:161) R ϑ ( λ R − ϑ θ ) (cid:162) ≤ λ f ( θ ) + (1 − λ ) f ( ϑ ) + ρλ (1 − λ )2 d R ( θ , ϑ ). (4)Over the manifold, however, there is no such straightforward equivalence. This is due tothe distance function d R ( ϑ , θ ) which does not satisfy the following equality: d R ( R θ λ R − θ θ , ϑ ) = λ d R ( θ , ϑ ) + (1 − λ ) d R ( θ , ϑ ) − λ (1 − λ ) d R ( θ , θ ). (5)Nevertheless in some neighborhood of ϑ , , for some δ > , the following inequality holds d R ( R θ λ R − θ θ , ϑ ) ≤ λ d R ( θ , ϑ ) + (1 − λ ) d R ( θ , ϑ ) − λ (1 − λ ) R d R ( θ , θ ), (6)where d R ( θ , ϑ ) < δ and d R ( θ , ϑ ) < δ .Therefore the function f is ρ -weakly convex with respect to the retraction R if forany points θ , ϑ ∈ M such that λ ∈ [0,1], the approximate secant inequality holds f (cid:161) R ϑ ( λ R − ϑ θ ) (cid:162) ≤ λ f ( θ ) + (1 − λ ) f ( ϑ ) + ρλ (1 − λ )2 d R ( θ , ϑ ), where d R ( θ , ϑ ) < δ . In this section, we propose our acceleration algorithms for convex and non-convexfunctions on manifolds. We first minimize the convex subproblem of an objectivefunction f for some existing approach A (such as a gradient descent algorithm) wherethe objective function is written as h ∗ ( ϑ ) = min θ ∈ M (cid:110) f ( θ ) + κ d R ( θ , ϑ ) (cid:111) , κ . Proposition 1 ensures the convexity of thesubproblem for an appropriate level of regularization.Therefore, with an existing approach A , we define the proximal operator p ( ϑ ) = prox f / κ ( ϑ ) = arg min θ ∈ M (cid:110) f ( θ ) + κ d R ( θ , ϑ ) (cid:111) , where ϑ is a prox-center. To consider optimizing p ( ϑ ) , we focus on A having linear convergence rates.Specifically, a minimization algorithm A , generating the sequence of iterates ( θ k ) k ≥ , has a linear convergence rate if there exists τ A , f ∈ (0,1) and a constant C A , f ∈ (cid:82) suchthat f ( x k ) − f ∗ ≤ C A , f (1 − τ A , f ) k , where f ∗ is the minimum value of f . There are multiple optimization algorithms on manifolds with linear convergencerates for strongly-convex functions on manifold. These include gradient descent, con-jugate gradient descent, MASAGA (Babanezhad et al., 2018), RSVRG (Zhang et al.,2016), and many others.For a proximal center ϑ and a smoothing parameter κ , we let h κ ( θ , ϑ ) = f ( θ ) + κ d R ( θ , ϑ ). At the k -th iteration, given a previous iterate θ k − and the extrapolation term ˜ ϑ k − , we perform the following steps:1. Proximal point step. ¯ θ k ≈ arg min θ ∈ M h κ ( θ , θ k − ). Accelerated proximal point step. ϑ k = R θ k − (cid:179) α k R − θ k − ˜ ϑ k − (cid:180) , ˜ θ k ≈ arg min θ ∈ M h κ ( θ , ϑ k ),˜ ϑ k = R θ k − (cid:181) α k R − θ k − ˜ θ k (cid:182) , 1 − α k + α k + = α k . One needs a stopping criterion, since we cannot use the functional gap as a stoppingcriterion here as in the convex case. A stationarity stopping criterion is adopted whichconsists of two conditions:•
Descent condition h κ ( θ , ϑ ) ≤ h κ ( ϑ , ϑ ); • Adaptive stationary condition dist (cid:161) θ , ∂ θ h κ ( θ , ϑ ) (cid:162) < κ d R ( θ , ϑ ). Here, dist( · , · ) denotes the standard Euclidean distance on the tangent space.Recall that a quadratic of the retraction distance is added to f to make the subproblemconvex. So if the weak-convexity parameter ρ is known, then one should set κ > ρ to7ake the problem convex. In this case, it is proven that the number of inner calls to A for the subproblems min ϑ ∈ M h κ ( ϑ , θ ) (7)can be bounded by proper initialization point ϑ : • if f is smooth, then set ϑ = θ ; • if f = f + ψ , where f is L -smooth, then set ϑ = prox ηψ (cid:161) R θ (cid:161) η ∇ f ( θ ) (cid:162)(cid:162) with η ≤ L + κ . However, in general one does not have knowledge of ρ . Thus we propose a methodthat allows algorithm A (Algorithm 1) to handle the convexity problem adaptively.Our idea is to let A run on the subproblem for T predefined iterations, output thepoint ¯ θ T , and check if a sufficient decrease occurs. If the subproblem is convex, thenthe aforementioned descent and adaptive stationary conditions are guaranteed. If eitherof the conditions are violated, then the subproblem is deemed non-convex. In this case,we double the value κ and repeat the previous stepsThe tuning parameter κ should be chosen big enough to ensure the convexity of thesubproblems and simultaneously small enough to obtain the optimal complexity by notletting the subproblem deviate too far away from the original objective function. Thuswe introduce κ cvx as an A - dependent smoothing parameter. Notice that the linearconvergence rate τ A , h κ of A is independent of the prox-center and varies with κ . Wedefine κ cvx as κ cvx = argmax κ > τ A , h κ (cid:112) L + κ . Algorithm 1: A : The Adaptation Algorithm on Manifolds input the point θ ∈ M , the smoothing parameter κ and the number of iterations T repeat Compute ¯ θ T ≈ arg min ϑ ∈ M h κ ( ϑ , θ ) by running T iterations of A , using the initialization strategy describedbelow Equation (7). If h κ ( ¯ θ T , θ ) > h κ ( θ , θ ) or dist( ∂ h κ ( θ T , θ ),0 θ T ) > κ d R ( θ T , θ ) then go to repeat by replacing κ with κ . until h κ ( ¯ θ T , θ ) < h κ ( θ , θ ) and dist( ∂ h κ ( θ T , θ ),0 θ T ) < κ d R ( θ T , θ ) ; output ( θ T , κ ) Finally, for an initial estimate θ ∈ M , smoothing parameters κ , κ cvx , an optimiza-tion algorithm A , and a stopping criterion based on a fixed budget T and S , we have thefollowing acceleration algorithm, A , for the manifold (Algorithm 2).8 lgorithm 2: A : Acceleration Algorithm on ManifoldsInitialize ˜ ϑ = θ , α = repeat for k =
1. compute ( ¯ θ k , κ k ) = A ( θ k − , κ k − , T )
2. compute ϑ k = R θ k − (cid:179) α k R − θ k − ˜ ϑ k − (cid:180) and apply S k log( k + iterations of A tofind ˜ θ k ≈ arg min θ ∈ M h κ cvx ( θ , ϑ k ), by using initialization strategy described below (7).3. Update ˜ ϑ k and α k + : ˜ ϑ k = R θ k − (cid:181) α k R − θ k − ˜ θ k (cid:182) , α k + = (cid:113) α k + α k − α k
4. Choose θ k to be any point satisfying f ( θ k ) = min{ f ( ¯ θ k ), f ( ˜ θ k )}. until the stopping criterion is dist( ∂ f ( ¯ θ k ),0 ¯ θ ) < ε ;9 emark 1. Note that there are two sequences (cid:169) ˜ θ k (cid:170) and (cid:169) ¯ θ k (cid:170) in Algorithm A . Sincethe extrapolation step is designed for the convex case, the second sequence { ˜ θ k } approxi-mates the optimal point with accelerated rate which means that it approaches the optimalpoint faster than the first sequence (cid:169) ¯ θ k (cid:170) above. Intuitively, when the first sequenceis chosen it uses the initial algorithm A and adapts the smoothing parameter to ourobjective–implying that the Nesterov step failed to accelerate convergence.In the adaptation method A ( θ k − , κ k − , T ) , the resulting ¯ θ k and κ k have to satisfythe following inequalities dist (cid:161) ¯ θ k , ∂ h ( ¯ θ k , θ k − ) (cid:162) < κ k d R ( ¯ θ k , θ k − ) and (8) h κ k ( ¯ θ k , θ k − ) ≤ h κ k ( θ k − , θ k − ). (9)The resulting ˜ θ k , needs to satisfy the condition that if the function f is convex, then dist (cid:161) ˜ θ k , ∂ h κ cvx ( ˜ θ k , ϑ k ) (cid:162) < κ cvx k + d R ( ˜ θ k , ϑ k ). (10)We then have the following lemma: Lemma 1.
Suppose θ satisfies dist(0 θ , ∂ h κ ( θ , ϑ )) < ε , and |∇ d R ( θ , ϑ ) | ≤ K d R ( θ , ϑ ) ,then the inequality holds: dist(0 θ , ∂ f ( θ )) ≤ ε + κ K d R ( θ , ϑ ). Proof.
We can find v ∈ ∂ h κ ( θ , ϑ ) with (cid:107) v (cid:107) ≤ ε . Taking into account ∂ h κ ( θ , ϑ ) = ∂ f ( θ ) + κ ∇ d R ( θ , ϑ ) the result follows.Since we assume retraction distance function d R is continuous, we can deduce thatthe vector field ∇ d R ( θ , ϑ ) is continuous, so the conditions of Lemma 1 are very mild.Also, as mentioned previously, the square retraction distance function d R ( · , ϑ ) acts likea square normal function in a small neighborhood of ϑ .We define the following retraction-based strongly convex function: Definition 5.
A function f is µ -strongly convex with respect to the retraction R if forany points θ , θ ∈ M and µ > the inequality holds f ( θ ) ≥ f ( θ ) + 〈 v , R − θ θ 〉 + µ d R ( θ , θ ), v ∈ ∂ f ( θ ). (11)Then we have the following convergence analysis for the acceleration algorithm A : Theorem 1.
Fix real-valued constants κ , κ cvx > and the point θ ∈ M . Set κ max = max k ≥ κ k . Suppose that the number of iterations T is such that ¯ θ k satisfies (17), and |∇ d R ( θ , ϑ ) | ≤ K d R ( θ , ϑ ). Define f ∗ = lim k →∞ f ( θ k ). Then for any N ≥ the iteratedsequence generated by the acceleration algorithm satisfies min j = N (cid:110) dist (cid:161) ¯ θ j , ∂ f ( ¯ θ j ) (cid:162)(cid:111) ≤ κ max K N (cid:161) f ( θ ) − f ∗ (cid:162) .
10f in addition the function f is κ cvx ( K K − R ) -strongly convex and S k is chosen sothat ˜ θ k satisfies (20), then f ( θ N ) − f ∗ ≤ κ cvx K K ( N + d R ( θ ∗ , θ ), (12)where θ ∗ is any minimizer of the function f . The detailed proof of this theorem can be found in the Appendix.
Remark 2.
If the original method A has a linear rate of convergence then our method A also converges to the local minimum for the strongly convex case. If the knowledgeof the strong-convexity is given, then some existing method can achieve optimal linearrate for smooth and convex functions (Zhang and Sra, 2016), however, it is overall anextremely difficult to verify convexity of a function on a manifold, and our methodadapts to that without requiring the knowledge of the convexity. Note that our algorithmalso applies to the subgradient descent method, where instead of gradient of the functionone takes the subdifferential, for non-smooth functions. In this case, for the strongly-convex objective, the subgradient method converges to the optimum with O (1/ N ) rateof convergence (see Zhang and Sra (2016)). Thus our accelerated rate O (1/ N ) can beconsidered optimal for strongly-convex functions on the manifold. To examine the convergence and acceleration rates of our proposed algorithm, we firstapply our method to the estimation of both intrinsic and extrinsic Fréchet means onspheres, in which one has the exact optima for comparison in the case of extrinsic mean.We also apply our algorithm to the Netflix movie-ranking data set as an example ofoptimization over Grassmannian manifolds in the low-rank matrix completion problem.
We first consider the estimation problem of Fréchet means on manifolds (Fréchet, 1948).In this simple example, we have observations { x ,..., x N } that lie on a sphere (cid:83) d andour goal is to estimate the sample mean: ˆ θ = argmin θ ∈ S d f ( θ ), f ( θ ) = n (cid:88) i = ρ ( θ , x i ). (13)If ρ is the embedded distance metric in the Euclidean space, then there exists a closedform solution ˆ θ = (cid:80) Ni = x i / (cid:107) (cid:80) Ni = x i (cid:107) , which is the projection of the Euclidean mean ¯ x onto the sphere (Bhattacharya and Bhattacharya, 2012). This is called the extrinsicmean. When ρ is taken to be the geodesic or intrinsic distance, ˆ θ is called the intrinsicmean . We will consider estimation of both extrinsic and intrinsic means using ourmethod compared to other optimization techniques.One simple examples of a retraction map for (cid:83) d is R ϑ v = ϑ + v | ϑ + v | , | · | is the Euclidean norm in (cid:82) d + . Therefore the inverse retraction has thefollowing expression R − ϑ θ = ϑ T θ θ − ϑ . We first compare our accelerated method against gradient descent optimizationand a Newton-type optimization scheme, DANE (Shamir et al., 2014), and a Nesterovmethod, RAGD (Zhang and Sra, 2018b), adapted for manifolds. For all the experimentsin this section, we optimized the step size of the optimizer using an Armijo conditionbacktracing line seach (Armijo, 1966) where we reduce the step size by a factor of .95 until the difference between the old loss function evaluation and the new one is − × .95 . For our Catalyst algorithm manifold we set the A2 budget to S = , theA1 number of iterations to T = , and cutoff parameter for A1 is initialized at .1 . Forthe DANE results, we set the regularization term to , For RAGD we set the shrinkageparameter to . Our synthetic data set is 10,000 observations generated i.i.d from a dimensional N (0, I ) distribution projected onto (cid:83) .We run each optimization routine for 100 iterations. Figure 2 and Figure 3 showsthat our novel accelerated method converges, for an intrinsic mean as well as an extrinsicmean example, to an optima in fewer iterations than the other competing methods, bothin terms of the loss function value and the norm of the loss function gradient. Moreover,we can see in the intrinsic mean example, our method is able to obtain a smaller lossfunction and gradient norm than the competing methods. In the extrinsic mean example,our method obtains a comparable loss function value and MSE between the learnedparameter and the closed-form expression of the sample mean with other methods infewer iterations and obtains a smaller gradient norm than the competing methods.By explicit calculation we show the objective functions are strongly convex over aneighborhood of any point on the manifold (see the Appendixx for a proof). This is ahighly non-trivial task for general objective functions, hence necessitating an adaptivemethod such as ours. Moreover, in the extrinsic mean example, since we have a closedform expression of the Fréchet mean we also show that our optimization approachconverges to the true extrinsic mean in terms of mean squared error faster than the otheroptimization methods. 12igure 3: Extrinsic mean comparison on spheres Next, we consider an application of our algorithm to the Netflix movie rating dataset.This dataset of over a million entries, X ∈ (cid:82) M × N , consists of M = movies and N = users, in which only a sparse subset of the users and movies have ratings.In order to build a better recommendation systems to users, we can frame the problemof predicting users’ ratings for movies as a low-rank matrix completion problem bylearning the rank- r Grassmannian manifold U ∈ Gr(M, r) which optimizes for the set ofobserved entries ( i , j ) ∈ Ω the loss function L ( U ) = (cid:88) ( i , j ) ∈ Ω (cid:169) ( UW ) i j − X i j (cid:170) + λ (cid:88) ( i , j ) ∉ Ω ( UW ) i j , (14)where W is an r -by- N matrix. Each user k has the loss function L ( U , k ) = | c k ◦ ( U w k ( U ) − X k ) | , where ◦ is the Hadamard product, ( w k ) i = W ik , and ( c k ) i = (cid:40)
1, if ( i , k ) ∈ Ω λ , if ( i , k ) ∉ Ω , ( X k ) i = (cid:40) X ik , if ( i , k ) ∈ Ω
0, if ( i , k ) ∉ Ω , w k ( U ) = (cid:161) U T diag( c k ◦ c k ) U (cid:162) − U T (cid:161) c k ◦ c k ◦ X k (cid:162) . This results in the following gradient ∇ L ( U , k ) = (cid:161) c k ◦ c k ◦ ( U w k ( U ) − X k ) (cid:162) w k ( U ) T = diag( c k ◦ c k )( U w k ( U ) − X k ) w k ( U ) T . For this problem on Grassman manifolds, we have the retraction map: R V U = U + V (15)13igure 4: Results for the parallel (right) and reduced (left) Netflix example..and the inverse retraction map: R − V U = V − U ( U T U ) − U T V (16)We look at a comparison of our method against a standard gradient descent methodon a subset of the data where we only observe a million ratings ( ≈ of the fulldata set). In this setting we fix the matrix rank r = and the regularization parameter λ = .01 . Figure 4 shows that our accelerated method obtains a smaller loss functionvalue, a smaller identical test set MSE, and nearly identical loss gradient norm fasterthan RAGD, DANE, or a typical gradient descent approach.On a large scale, we apply a parallelized version of our accelerated method anda communication-efficient parallel algorithm on manifolds proposed in (Saparbayevaet al., 2018, ILEA) on the full Netflix dataset. We randomly distribute the data across64 processors and run the optimization routine for 200 iterations. In Figure 4, again wecan see steady acceleration that our method provides in terms of the loss function valueacross iterations and the loss of gradient norm though ILEA obtains slightly better testset MSE than our method. 14 Conclusion and Discussion
We propose a general scheme for solving non-convex optimization on manifolds whichyields theoretical guarantees of convergence to a stationary point when the objectivefunction is non-convex. When the objective function is convex, it leads to acceler-ated convergence rates for a large class of first order methods, which we show in ournumerical examples. One of the interesting future directions we want to pursue isproposing accelerated algorithms on statistical manifolds (manifolds of densities or dis-tributions) by employing information-geometric techniques, and applying the algorithmsto accelerate convergence and mixing MCMC algorithms.
We first introduce a simple lemma.
Lemma 2.
Suppose the sequence { α k } k ≥ is produced by A . Then, the followingbounds hold for all k ≥ (cid:112) k + ≤ α k ≤ k + Proof of Theorem 2.
The descent condition in dist (cid:161) ¯ θ k , ∂ h ( ¯ θ k , θ k − ) (cid:162) < κ k d R ( ¯ θ k , θ k − ) and h κ k ( ¯ θ k , θ k − ) ≤ h κ k ( θ k − , θ k − ), (17)implies { f ( θ k )} k ≥ are monotonically decreasing. From this f ( θ k − ) = h κ ( θ k − , θ k − ) ≥ h κ ( ¯ θ k , θ k − ) ≥ f ( θ k ) + κ d R ( ¯ θ k , θ k − ). (18)Using condition (17), we apply Lemma 2 with ϑ = θ k − , θ = ¯ θ k and ε = κ K d R ( ¯ θ k , θ k − ); hence dist(0 ¯ θ k , ∂ f ( ¯ θ k )) ≤ κ K d R ( ¯ θ k , θ k − ). Combining the above inequality with (18), one has dist (0 ¯ θ k , ∂ f ( ¯ θ k )) ≤ κ K d R ( ¯ θ k , θ k − ) ≤ κ max K (cid:161) f ( θ k − − f ( θ k ) (cid:162) . (19)Summing j = to N , we can conclude min j = N (cid:110) dist (cid:161) ¯ θ j , ∂ f ( ¯ θ j ) (cid:162)(cid:111) ≤ κ max K N N (cid:88) j = (cid:161) f ( θ j − ) − f ( θ j ) (cid:162) ≤ κ max K N (cid:161) f ( θ ) − f ∗ (cid:162) . v k ∈ ∂ h κ ( ˜ θ k , ϑ k ). Since the function f is κ cvx ( K K − R ) -strongly convex,the function h κ cvx is κ cvx K K -strongly convex. f ( θ ) + κ cvx d R ( θ , ϑ k ) ≥ f ( ˜ θ k ) + κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx K K d R ( ˜ θ k , θ ) + 〈 v k , R − θ k θ 〉 . Then f ( ˜ θ k ) ≤ f ( θ ) + κ cvx (cid:161) d R ( θ , ϑ k ) − K K d R ( ˜ θ k , θ ) − d R ( ˜ θ k , ϑ k ) (cid:162) − 〈 v k , R − θ k θ 〉 . So for any θ ∈ M f ( θ k ) ≤ f ( ˜ θ k ) ≤ f ( θ ) + κ cvx (cid:179) K (cid:107) R − θ k − θ − R − θ k − ϑ k (cid:107) − K K (cid:107) R − θ k − ˜ θ k − R − θ k − θ (cid:107) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) − 〈 v k , R − θ k θ 〉 . We substitute θ = R θ k − α k R − θ k − θ ∗ , where θ ∗ is any minimizer of f . Using convexityof f f ( x ) ≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ), the stopping criteria , dist (cid:161) ˜ θ k , ∂ h κ cvx ( ˜ θ k , ϑ k ) (cid:162) < κ cvx k + d R ( ˜ θ k , ϑ k ), (20)i.e. (cid:107) v k (cid:107) < κ cvx k + d R ( ˜ θ k , ϑ k ), and ϑ k = R θ k − α k R − θ k − ˜ ϑ k − , and ˜ ϑ k = R θ k − α k R − θ k − ˜ θ k , f ( θ k ) ≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K (cid:107) R − θ k − θ ∗ − R − θ k − ˜ ϑ k − (cid:107) − K K (cid:107) R − θ k − ˜ ϑ k − R − θ k − θ ∗ (cid:107) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx k + d R ( ˜ θ k , ϑ k ) (cid:107) R − θ k θ (cid:107)≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx k + d R ( ˜ θ k , ϑ k ) d R ( ˜ θ k , θ ) ≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx K k + d R ( ˜ θ k , ϑ k ) (cid:107) R − θ k − ˜ θ k − R − θ k − θ (cid:107)= α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx α k K k + d R ( ˜ θ k , ϑ k ) (cid:107) R − θ k − ˜ ϑ k − R − θ k − θ ∗ (cid:107)≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx α k K K k + d R ( ˜ θ k , ϑ k ) d R ( ˜ ϑ k , θ ∗ ). So f ( θ k ) ≤ α k f ( θ ∗ ) + (1 − α k ) f ( θ k ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx α k K K k + d R ( ˜ θ k , ϑ k ) d R ( θ ∗ , ˜ ϑ k ). (21)17et µ k = k + . Completing the square yields − κ cvx d R ( ˜ θ k , ϑ k ) + κ cvx α k µ k K K d R ( ˜ θ k , ϑ k ) d R ( θ ∗ , ˜ ϑ k ) ≤ K K κ cvx α k µ k d R ( θ ∗ , ˜ ϑ k ), and subtracting f ∗ = f ( θ ∗ ) from both sides, we obtain f ( θ k ) − f ∗ ≤ (1 − α k )( f ( θ k − ) − f ∗ ) + κ cvx α k (cid:179) K K d R ( θ ∗ , ˜ ϑ k − ) − K K d R ( θ ∗ , ˜ ϑ k ) (cid:180) + K K κ cvx α k µ k d R ( θ ∗ , ˜ ϑ k ) = (1 − α )( f ( θ k − ) − f ∗ ) + κ cvx α k K K d R ( θ ∗ , ˜ ϑ k − ) − κ cvx α k K K − µ k ) d R ( θ ∗ , ˜ ϑ k ). So one can obtain f ( θ k ) − f ∗ α k + κ cvx K K − µ k ) d R ( θ ∗ , ˜ ϑ k ) ≤ − α k α k ( f ( θ k − ) − f ∗ ) + κ cvx K K d R ( θ ∗ , ˜ ϑ k − ). Denote A k = (1 − µ k ). Using the equality − α k α k = α k − we derive the following recursion f ( θ k ) − f ∗ α k + κ cvx K K A k d R ( θ ∗ , ˜ ϑ k ) ≤ − α k α k ( f ( θ k − ) − f ∗ ) + κ cvx K K d R ( θ ∗ , ˜ ϑ k − ) = f ( θ k − ) − f ∗ α k − + κ cvx K K d R ( θ ∗ , ˜ ϑ k − ) ≤ f ( θ k − ) − f ∗ A k − α k − + κ cvx K K d R ( θ ∗ , ˜ ϑ k − ) = A k − (cid:181) f ( θ k − ) − f ∗ α k − + κ cvx K K A k − d R ( θ ∗ , ˜ ϑ k − ) (cid:182) . The last inequality holds because < A k ≤ Iterating N times, we deduce f ( θ N ) − f ∗ α N ≤ f ( θ N ) − f ∗ α N + κ cvx K K A k d R ( θ ∗ , ˜ ϑ k ) ≤ κ cvx K K d R ( θ ∗ , θ ) N (cid:89) k = A k − . N (cid:89) k = A k − ≤ thereby with inequality from Lemma 1 we conclude f ( θ N ) − f ∗ ≤ α N κ cvx K K d R ( θ ∗ , θ ) N (cid:89) k = A k − ≤ α N κ cvx K K d R ( θ ∗ , θ ) ≤ κ cvx K K ( N + d R ( θ ∗ , θ ). Hence f ( θ N ) − f ∗ ≤ κ cvx K K ( N + d R ( θ ∗ , θ ). We provide a proof that the objective functions in estimating both the intrinsic andextrinsic Fréchet means on the sphere in Section 4 is strongly convex.
Proof.
In order to prove the strong-convexity of the intrinsic mean on the sphere S n ,we will prove the strong-convexity of the square intrinsic distance function from thepoint x ∈ S n d g ( x , x ) = arccos ( x T x ). So for the geodesic from the point x ∈ S n to the point x ∈ S γ ( λ ) = exp x λ log x x = cos (cid:161) λ arccos( x T x ) (cid:162) x + sin (cid:161) λ arccos( x T x ) (cid:162) x − ( x T x ) x (cid:113) − ( x T x ) , we need to show following inequality d g ( x , γ ( λ )) ≤ (1 − λ ) d g ( x , x ) + λ d g ( x , x ) − λ (1 − λ ) µ d g ( x , x ). For the sake of briefness let’s use the following notations d = arccos( x T x ), d = arccos( x T x ), d = arccos( x T x ). arccos (cid:179) cos( λ d )cos( d ) + sin( λ d ) cos( d ) − cos( d )cos( d )sin( d ) (cid:180) ≤ (1 − λ ) d + λ d − λ (1 − λ ) µ d . (22)Or we should prove the inequality arccos ( x T x ) > arccos ( x T x ) − x x ) T log x x + µ ( x T x ) d > d − (cid:179) d x − cos( d ) x sin( d ) (cid:180) T (cid:179) d x − cos( d ) x sin( d ) (cid:180) + µ d = d − d d cos( d ) − cos( d )cos( d )sin( d )sin( d ) + µ d The last inequality was checked to hold in Wolfram Mathematica for d , d ∈ [0, π /4] and d ∈ (cid:104) | d − d | , d + d (cid:105) , where µ = .In order to proof the strong-convexity of Fréchet function in estimating extrinsicmean on the sphere S n , we will prove the strong-convexity of the square extrinsicdistance function from the point x ∈ S n d e ( x , x ) = − x T x ). So for the geodesic from the point x ∈ S n to the point x ∈ S γ ( λ ) = exp x λ log x x = cos (cid:161) λ arccos( x T x ) (cid:162) x + sin (cid:161) λ arccos( x T x ) (cid:162) x − ( x T x ) x (cid:113) − ( x T x ) , we need to show that d e ( x , γ ( λ )) ≤ (1 − λ ) d e ( x , x ) + λ d e ( x , x ) − λ (1 − λ ) µ d g ( x , x ). Therefore we have to prove (cid:179) − cos( λ d )cos( d ) − sin( λ d ) cos( d ) − cos( d )cos( d )sin( d ) (cid:180) ≤ − (cid:161) (1 − λ )cos( d ) + λ cos( d ) (cid:162) − λ (1 − λ ) µ d . (23)20r we need to show − x T x ) > − x T x ) − (cid:161) x − ( x T x ) x (cid:162) T log x x + µ ( x T x ) Thus (cid:161) − cos( d ) (cid:162) > (cid:161) − cos( d ) (cid:162) − (cid:161) x − cos( d ) x (cid:162) T (cid:179) d x − cos( d ) x sin( d ) (cid:180) + µ d = (cid:161) − cos( d ) (cid:162) − d cos( d ) − cos( d )cos( d )sin( d ) + µ d The last inequality was verified Wolfram Mathematica for d , d ∈ [0, π /4] and d ∈ (cid:104) | d − d | , d + d (cid:105) , where µ = Acknowledgments
Lizhen Lin would like to thank Dong Quan Nguyen for very helpful discussions. LizhenLin acknowledges the support from NSF grants IIS 1663870, DMS Career 1654579and a DARPA grant N66001-17-1-4041. Bayan Saparbayeva was partially supported byDARPA N66001-17-1-4041. 21 eferences
Absil, P.-A., Baker, C., and Gallivan, K. (2007). Trust-region methods on Riemannianmanifolds.
Foundations of Computational Mathematics , 7(3):303–330.Absil, P.-A., Mahony, R., and Sepulchre, R. (2010). Optimization on manifolds: methodsand applications. In Diehl, M., Glineur, F., Jarlebring, E., and Michiels, W., editors,
Recent Advances in Optimization and its Applications in Engineering , pages 125–144,Berlin, Heidelberg. Springer Berlin Heidelberg.Armijo, L. (1966). Minimization of functions having Lipschitz continuous first partialderivatives.
Pacific Journal of Mathematics , 16(1):1–3.Babanezhad, R., Laradji, I. H., Shafaei, A., and Schmidt, M. O. (2018). Masaga: Alinearly-convergent stochastic first-order method for optimization on manifolds. In
ECML/PKDD .Bhattacharya, A. and Bhattacharya, R. (2012).
Nonparametric Inference on Manifolds:With Applications to Shape Spaces . IMS Monograph
The Proceedings of the AmericanMathematical Society , 145:13–428.Bonnabel, S. (2013). Stochastic gradient descent on Riemannian manifolds.
IEEETransactions on Automatic Control , 58(9):2217–2229.Boumal, N. and Absil, P. (2011). RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F.,and Weinberger, K. Q., editors,
Advances in Neural Information Processing Systems24 , pages 406–414. Curran Associates, Inc.Boumal, N., Absil, P.-A., and Cartis, C. (2019). Global rates of convergence fornonconvex optimization on manifolds.
IMA Journal of Numerical Analysis , 39(1):1–33.Edelman, A., Arias, T., and Smith, S. (1998). The geometry of algorithms withorthogonality constraints.
SIAM. J. Matrix Anal. & Appl. , 20(2):303–353.Fréchet, M. (1948). Lés élements aléatoires de nature quelconque dans un espacedistancié.
Ann. Inst. H. Poincaré , 10:215–310.Huang, Z. and Gool, L. V. (2017). A Riemannian network for spd matrix learning. In
AAAI .Huang, Z., Wan, C., Probst, T., and Gool, L. V. (2017). Deep learning on Lie groups forskeleton-based action recognition. , pages 1243–1252.Lin, H., Mairal, J., and Harchaoui, Z. (2017). Catalyst acceleration for first-order convexoptimization: From theory to practice.
J. Mach. Learn. Res. , 18(1):7854–7907.22iu, Y., Shang, F., Cheng, J., Cheng, H., and Jiao, L. (2017). Accelerated first-ordermethods for geodesically convex optimization on Riemannian manifolds. In Guyon,I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett,R., editors,
Advances in Neural Information Processing Systems 30 , pages 4868–4877.Curran Associates, Inc.Lohit, S. and Turaga, P. K. (2017). Learning invariant Riemannian geometric represen-tations using deep nets. , pages 1329–1338.Nesterov, Y. (2004).
Introductory lectures on convex optimization: a basic course .Kluwer Academic Publishers.Nishimori, Y., Akaho, S., and Plumbley, M. D. (2008).
Natural Conjugate Gradienton Complex Flag Manifolds for Complex Independent Subspace Analysis , pages165–174. Springer Berlin Heidelberg, Berlin, Heidelberg.Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., and Harchaoui, Z. (2018). Catalystfor gradient-based nonconvex optimization. In Storkey, A. and Perez-Cruz, F., editors,
Proceedings of the Twenty-First International Conference on Artificial Intelligenceand Statistics , volume 84 of
Proceedings of Machine Learning Research , pages613–622, Playa Blanca, Lanzarote, Canary Islands. PMLR.Ring, W. and Wirth, B. (2012). Optimization methods on Riemannian manifolds andtheir application to shape space.
SIAM Journal on Optimization , 22(2):596–627.Saparbayeva, B., Zhang, M., and Lin, L. (2018). Communication efficient parallelalgorithms for optimization on manifolds. In Bengio, S., Wallach, H., Larochelle,H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors,
Advances in NeuralInformation Processing Systems 31 , pages 3574–3584. Curran Associates, Inc.Shamir, O., Srebro, N., and Zhang, T. (2014). Communication-efficient distributedoptimization using an approximate Newton-type method. In
Proceedings of the 31stInternational Conference on International Conference on Machine Learning - Volume32 , ICML’14, pages II–1000–II–1008. JMLR.org.Smith, S. T. (2014). Optimization Techniques on Riemannian Manifolds. arXiv e-prints ,page arXiv:1407.5965.Zhang, H., Reddi, S. J., and Sra, S. (2016). Riemannian SVRG: Fast stochastic optimiza-tion on Riemannian manifolds. In
Proceedings of the 30th International Conferenceon Neural Information Processing Systems , NIPS’16, pages 4599–4607, USA. CurranAssociates Inc.Zhang, H. and Sra, S. (2016). First-order methods for geodesically convex optimiza-tion. volume 49 of
Proceedings of Machine Learning Research , pages 1617–1638,Columbia University, New York, New York, USA. PMLR.23hang, H. and Sra, S. (2018a). An estimate sequence for geodesically convex optimiza-tion. In Bubeck, S., Perchet, V., and Rigollet, P., editors,
Proceedings of the 31stConference On Learning Theory , volume 75 of
Proceedings of Machine LearningResearch , pages 1703–1723. PMLR.Zhang, H. and Sra, S. (2018b). Towards Riemannian accelerated gradient methods. In .Zhang, J., Zhang, H., and Sra, S. (2018). R-SPIDER: A fast Riemannian stochas-tic optimization algorithm with curvature Independent Rate. arXiv e-prints , pagearXiv:1811.04194.Zhou, P., Yuan, X.-T., and Feng, J. (2019). Faster first-order methods for stochasticnon-convex optimization on Riemannian manifolds. In Chaudhuri, K. and Sugiyama,M., editors,
Proceedings of Machine Learning Research , volume 89 of